CN109992576A - A kind of government data quality evaluation and abnormal data recovery technique based on big data technology - Google Patents

A kind of government data quality evaluation and abnormal data recovery technique based on big data technology Download PDF

Info

Publication number
CN109992576A
CN109992576A CN201910156894.6A CN201910156894A CN109992576A CN 109992576 A CN109992576 A CN 109992576A CN 201910156894 A CN201910156894 A CN 201910156894A CN 109992576 A CN109992576 A CN 109992576A
Authority
CN
China
Prior art keywords
data
quality
rule
library
inspection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910156894.6A
Other languages
Chinese (zh)
Inventor
练海荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Longshi Information Technology Co Ltd
Original Assignee
Suzhou Longshi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Longshi Information Technology Co Ltd filed Critical Suzhou Longshi Information Technology Co Ltd
Priority to CN201910156894.6A priority Critical patent/CN109992576A/en
Publication of CN109992576A publication Critical patent/CN109992576A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of government data quality evaluations and abnormal data recovery technique based on big data technology in data analysis technique field, first establish database, then carry out data quality accessment, finally carry out quality of data reparation;The present invention is by carrying out null value, codomain, specification, logic, reference property, repeated data inspection to data field, from data integrity, relevance, uniqueness, accuracy, consistency and normative six dimension comprehensive assessment qualities of data, and creation data quality appraisal report, repaired by hand is carried out to data by user or rule is repaired or deep learning reparation, government is helped to break internal data barrier, vitalize data assets, promote data value, unified intelligent data service is externally provided, the value bonus of big data is further deep-cut and discharge.

Description

A kind of government data quality evaluation and abnormal data reparation based on big data technology Technology
Technical field
The present invention relates to data analysis technique fields, and in particular to a kind of government data quality based on big data technology is commented Estimate and abnormal data recovery technique.
Background technique
Project be based on PDCA (Plan, Do, Check, Act, U.S. quality control specialist doctor Xiu Hate, after by Dai Mingcai Receive universal) method for quality control, DQAF (Data Quality Assessment Framework, the IMF joint World Bank Disclose international data quality accessment frame) Data quality assessment model, DAMA (international data manage association) number According to management function frame and the abnormal data recovery technique based on deep learning, establishes complete big data science and administer system And standard, guarantee that the quality of data, the service efficiency of Improving Government ensure that this highway of data infrastructure is efficient, unimpeded, To build wisdom government lay a good foundation.
There are relevant Database Systems in current each government department, and cuts management, causes government information not smooth, and The data of government database are mixed and disorderly, and there are all kinds of problems in a large amount of data without bad lookup and discovery, easily cause data Missing and inaccuracy, such as in population library ID card No. fail to fill in or fill in it is incorrect, in legal person library relevant information it is imperfect or Mistake etc., using this big data analysis and appraisal procedure, entire each government information platform, from data integrity, relevance, only One property, accuracy, consistency and normative six dimension comprehensive assessment qualities of data.Based on this, the present invention devises one kind Government data quality evaluation and abnormal data recovery technique based on big data technology, to solve the above problems.
Summary of the invention
The purpose of the present invention is to provide a kind of government data quality evaluations based on big data technology and abnormal data to repair Recovering technology, to solve the problems mentioned in the above background technology.
To achieve the above object, the invention provides the following technical scheme: a kind of government data matter based on big data technology Amount assessment and abnormal data recovery technique, the specific steps are as follows:
The first step establishes database
The database includes base library and theme library, and the solution of the base library construction combines current government data The problem of, shared thinking is built according to overall planning, one, based on the department of key data source, passes through data Acquire that exchange, working process, information is integrated and the means such as mining analysis, integrate people society, civil administration, credit, public security, industry and commerce, health, Other committees such as education, traffic do the data of office, and planning standard standard system constructs base library, and provides face in this base library To the data sharing service of government department and the public, corresponding client includes the Committee of Development and Reform, Jing Xinwei, big data office;
The theme library is discussed with overall strategic planning and Object--oriented method as foundation, in conjunction with the business characteristic of client, The means such as exchange, Data Integration, association analysis are acquired by data, the theme library of characteristic is established, vitalizes data assets, to make Innovation special topic application lays the foundation, such as the legal person library of market surveillance management board, special equipment library, food storage, drug storage, license Library, population library, license library, criminal investigation library, security administration library, the entry and exit library of public security bureau, the population library of Department of Civil Affairs, social group Knit library, aged library, welfare library, marriage library etc.;
Second step, data quality accessment
(1) general rule management
The general rule management includes five groups of general, network, date, character and numerical value rules, it is described it is general include body Part card, phone number, mailbox, postcode and fixed-line telephone, the regular expression of the identity card are
^[1-9]\d{7}((0\d)|(1[0-2]))(([0|1|2]\d)|3[0-1])\d{3}$|^[1-9]\d{5}[1- 9] d { 3 } ((0 d) | (1 [0-2])) (([0 | 1 | 2] d) | 3 [0-1]) d { 3 } ([0-9] | X) $
The rule of the identity card is described as
China second-generation identity card, such as 420106198311136666, regular length are 18, and first 17 are number, last position For number or letter x, and it is necessary for legal effective ID card No.;One generation ID: such as 420106831113666, Gu Measured length is 15, and 7 to 12 are six dates;
The regular expression of the phone number is
^1 ([38] [0-9] | 4 [579] | 5 [0-3,5-9] | 6 [6] | 7 [0135678] | 9 [89]) d { 8 } $,
The rule of the phone number is described as
Such as 13666666666, started with number 1, regular length is 11;
The regular expression of the mailbox is
^ w+ ([-+] w+) *@w+ ([-] w+) * w+ ([-] w+) * $,
The rule of the mailbox is described as
Such as 123 mail.com, English alphabet, number and underscore can only occur in mailbox names and cannot be with underscore Beginning, and with the ending of the characters such as .com .cn .edu;
The regular expression of the postcode be [1-9] d { 5 } (?!D), the rule of the postcode is described as
Beginning cannot be 0, totally 6 numbers;
The regular expression of the fixed-line telephone be d { 3 }-d { 8 } | d { 4 }-d { 7 }, the rule of the fixed-line telephone It is described as
Such as 027-88880808-1, wherein 027 is area code, 1 is extension number, is separated with "-", and area code and extension number can not It fills out;
The network includes the address IPv4, the address IPv6 and MAC Address, and the regular expression of the address IPv4 is
^ ((25 [0-5] | 2 [0-4] d | [01]? d d?)) { 3 } (and 25 [0-5] | 2 [0-4] d | [01]? d d?) $,
The rule of the address IPv4 is described as
It such as 000.000.000.000, is made of 4 0~255 numerical value, is separated with " ";
The regular expression of the address IPv6 is
^ ([da-fA-F] { Isosorbide-5-Nitrae } :) { 7 } [da-fA-F] { Isosorbide-5-Nitrae } $,
The rule of the address IPv6 is described as
Such as CDCD:910A:222:9:8475:11:390:2020, it is made of 8 four hexadecimal numerical value, with ": " It separates, while supporting to write a Chinese character in simplified form or mix literary style, it is recommended that using standard literary style;
The regular expression of the MAC Address is
[0-9a-fA-F] { 2 } (: [0-9a-fA-F] { 2 }) { 5 },
The rule of the MAC Address is described as
Such as 00-00-00-00-00-00, is formed with 6 two hexadecimal numbers, separated with "-";
The date include YYYY.MM.DD, YYYYMMDD, YYYY/MM/DD, YYYY MM month DD day, YYYY and YYYYMM, wherein the YYYY is the specific time, and the MM is specific month, and the DD is exact date, the YYYY/ The regular expression of MM/DD be (d { 4 })/(d { 1,2 })/(d { 1,2 });
The numerical value includes nonnegative integer, integer, non-negative floating number, floating number, integer band percentage sign, floating number percentage Number, integer band per thousand sign and floating number per thousand sign, the regular expression of the nonnegative integer be ^ [1-9] d* | 0 $, it is described non-negative The rule of integer is described as the character string of nonnegative integer format, and such as 28;
Does is the regular expression of the integer ^-? [1-9] d* $, the rule of the integer is described as the character of integer data format String;
The regular expression ^ of the non-negative floating number d+ (d+)? $, the rule of the non-negative floating number be described as The character string of non-negative floating number format;
The regular expression of the floating number be ^ (-? d+) (d+)? the rule of $, the floating number are described as floating-point The character string of number format;
(2) data quality model
Incidence relation according to the general rule management establishes data quality model, the data quality model be based on The Data quality assessment model of DQAF, the Data quality assessment model includes that entity table, incidence relation and rule describe, described The entity table name of entity table is selected from database, incidence relation of the incidence relation between main table and word table, the rule Description is divided into null value inspection, codomain detection, normalized checking, logical check, repeated data inspection and referential integrity and checks six groups Rule type;
(3) quality-monitoring task
According to the title of the data quality model, all data of the data quality model is exported, according to quality mould Type title, quality model description, implementation strategy, execute state recently and execute recently time etc. to the data quality model into Row assessment, completes quality testing task;
(4) quality-monitoring is reported
According to the quality testing task, quality testing report is generated;
(5) quality appraisal report
According to the quality testing report content, from data integrity, relevance, uniqueness, accuracy, consistency with And the normative six comprehensive assessment qualities of data only, the data in the database are generated and are based on database classification and data The quality appraisal report of library name, the quality appraisal report include that quality score, quality score figure and data quality model are commented Divide ranking list, the quality score includes overall quality scoring, quality model number and model rule number;The quality score figure packet Include quality score tendency chart and data aggregate distribution figure;The data quality model scoring ranking list is commented according to data quality model Divide ranking.
Preferably, the data quality accessment can pass through the customized length range of fixed character, the rule of the fixed character Then be described as support asterisk wildcard " * " and "? ";" * " represents multiple any characters, "? " an any character is represented, such as ABC*: ABC, ABCD, ABCDE meet the expression formula;A? C: only ABC, ADC meets, and ABDC does not meet expression formula then.
Preferably, in the data quality model, detection cycle is manually set, according to detection cycle timing to the quality of data It is detected.
Preferably, the Data quality assessment model based on DQAF, the GB/T25000.24- based on DQAF and China 2017 " system and soft project system and software quality require and evaluate (SQuaRE) " the 12nd partial data quality models and 24th partial data mass measurement establishes general data quality assessment models and thematic data quality evaluation for Urban Data center Model supports null value inspection, codomain inspection, normalized checking, logical check, repeated data inspection, referential integrity inspection, peels off Advanced, comprehensive, the expansible quality such as value inspection, timeliness inspection, missing inspection, fluctuation inspection, balance inspection are commented Valence algorithmic technique meets the definition of Constructing data center, each rule-like in data governance process, establishes the assessment mould of science Type finally carries out Urban Data center from six integrality, normalization, consistency, accuracy, uniqueness, relevance dimensions Comprehensive assessment.
Preferably, according to the data quality accessment, abnormal data is dealt, checks the abnormal data, and to institute It states abnormal data and carries out quality of data reparation, the method for the quality of data reparation includes repaired by hand, rule is repaired and depth It repairs, the repaired by hand is manually to pass through computer keyboard to database update abnormal data and the correct data letter of typing Breath, the rule are repaired as according to the correct data information of general rule typing, the depth reparation is deep learning exception number According to recovery technique, mature Hadoop/Spark big data technology is made full use of, extensive number is realized by deep learning algorithm It is administered according to automation, examination abnormal data and reparation abnormal data carry out exception using deep learning method for abnormal data Data reparation, the main depth including average value filling, K minimum distance method, recurrence, the estimation of very big liny and multiple interpolation Learning method realizes that efficiently accurately data are administered, and improve the quality of data in conjunction with manual review.
Compared with prior art, the beneficial effects of the present invention are:
(1) present invention Database Systems open to notebook data platform by binding, by data field carry out null value, Codomain, specification, logic, reference property, repeated data inspection, from data integrity, relevance, uniqueness, accuracy, consistency with And normative six dimension comprehensive assessment qualities of data, and creation data quality appraisal report, hand is carried out to data by user Work reparation or rule reparation or deep learning reparation.
(2) present invention, which builds and administers based on smart city large data center, provides integrative solution, from data It is originally formed, standard formulation, secure storage, exchanges shared, applied analysis to science decision, precisely prediction, form complete city City's big data is administered and management system, constructs city big data platform, from data standard, manages, runs to decision and formed completely Data ecological chain, continuous growth data boundary promotes the quality of data and data user rate, constructs city healthy ecology.It helps Government breaks internal data barrier, vitalizes data assets, promotes data value, externally provides unified intelligent data service, Further deep-cut and discharge the value bonus of big data.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is general rule schematic table of the present invention.
Fig. 2 is Data quality assessment model schematic table of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.
The present invention provides a kind of technical solution referring to FIG. 1-2: a kind of government data quality based on big data technology is commented Estimate and abnormal data recovery technique, the specific steps are as follows:
The first step establishes database
Database includes base library and theme library, and the solution of base library construction combines present in current government data Problem builds shared thinking according to overall planning, one, based on the department of key data source, exchanged by the acquisition of data, The means such as working process, information integration and mining analysis, integrate people society, civil administration, credit, public security, industry and commerce, health, education, traffic The data of office are done Deng other committees, planning standard standard system constructs base library, and Government department is provided in this base library With the data sharing service of the public, corresponding client includes the Committee of Development and Reform, Jing Xinwei, big data office;
Theme library is discussed with overall strategic planning and Object--oriented method as foundation, in conjunction with the business characteristic of client, is passed through Data acquire the means such as exchange, Data Integration, association analysis, establish the theme library of characteristic, vitalize data assets, to make innovation Special topic application lays the foundation, public such as the legal person library of market surveillance management board, special equipment library, food storage, drug storage, license library The population library of peace office, license library, criminal investigation library, security administration library, entry and exit library, the population library of Department of Civil Affairs, social organization library, Aged library, welfare library, marriage library etc.;
Second step, data quality accessment
(1) general rule management
General rule management includes general, network, five groups of date, character and numerical value rules, and general includes identity card, mobile phone The regular expression of number, mailbox, postcode and fixed-line telephone, identity card is
^[1-9]\d{7}((0\d)|(1[0-2]))(([0|1|2]\d)|3[0-1])\d{3}$|^[1-9]\d{5}[1- 9] d { 3 } ((0 d) | (1 [0-2])) (([0 | 1 | 2] d) | 3 [0-1]) d { 3 } ([0-9] | X) $
The rule of identity card is described as
China second-generation identity card, such as 420106198311136666, regular length are 18, and first 17 are number, last position For number or letter x, and it is necessary for legal effective ID card No.;One generation ID: such as 420106831113666, Gu Measured length is 15, and 7 to 12 are six dates;
The regular expression of phone number is
^1 ([38] [0-9] | 4 [579] | 5 [0-3,5-9] | 6 [6] | 7 [0135678] | 9 [89]) d { 8 } $,
The rule of phone number is described as
Such as 13666666666, started with number 1, regular length is 11;
The regular expression of mailbox is
^ w+ ([-+] w+) *@w+ ([-] w+) * w+ ([-] w+) * $,
The rule of mailbox is described as
Such as 123 mail.com, English alphabet, number and underscore can only occur in mailbox names and cannot be with underscore Beginning, and with the ending of the characters such as .com .cn .edu;
The regular expression of postcode be [1-9] d { 5 } (?!D), the rule of postcode is described as
Beginning cannot be 0, totally 6 numbers;
The regular expression of fixed-line telephone be d { 3 }-d { 8 } | d { 4 }-d { 7 }, the rule of fixed-line telephone is described as
Such as 027-88880808-1, wherein 027 is area code, 1 is extension number, is separated with "-", and area code and extension number can not It fills out;
Network includes the address IPv4, the address IPv6 and MAC Address, and the regular expression of the address IPv4 is
^ ((25 [0-5] | 2 [0-4] d | [01]? d d?)) { 3 } (and 25 [0-5] | 2 [0-4] d | [01]? d d?) $,
The rule of the address IPv4 is described as
It such as 000.000.000.000, is made of 4 0~255 numerical value, is separated with " ";
The regular expression of the address IPv6 is
^ ([da-fA-F] { Isosorbide-5-Nitrae } :) { 7 } [da-fA-F] { Isosorbide-5-Nitrae } $,
The rule of the address IPv6 is described as
Such as CDCD:910A:222:9:8475:11:390:2020, it is made of 8 four hexadecimal numerical value, with ": " It separates, while supporting to write a Chinese character in simplified form or mix literary style, it is recommended that using standard literary style;
The regular expression of MAC Address is
[0-9a-fA-F] { 2 } (: [0-9a-fA-F] { 2 }) { 5 },
The rule of MAC Address is described as
Such as 00-00-00-00-00-00, is formed with 6 two hexadecimal numbers, separated with "-";
Date includes YYYY.MM.DD, YYYYMMDD, YYYY/MM/DD, YYYY MM month DD day, YYYY and YYYYMM, Wherein, YYYY is the specific time, and MM is specific month, and DD is exact date, the regular expression of YYYY/MM/DD be (d {4})\/(\d{1,2})\/(\d{1,2});
Numerical value includes nonnegative integer, integer, non-negative floating number, floating number, integer band percentage sign, floating number percentage sign, whole Number band per thousand sign and floating number per thousand signs, the regular expression of nonnegative integer be ^ [1-9] d* | the rule of 0 $, nonnegative integer are retouched It states as the character string of nonnegative integer format, such as 28;
Does is the regular expression of integer ^-? [1-9] d* $, the rule of integer is described as the character string of integer data format;
The regular expression ^ of non-negative floating number d+ (d+)? $, the rule of non-negative floating number are described as non-negative floating-point The character string of number format;
The regular expression of floating number be ^ (-? d+) (d+)? the rule of $, floating number are described as floating number format Character string;
(2) data quality model
Incidence relation according to general rule management establishes data quality model, and data quality model is the number based on DQAF According to Evaluation Model on Quality, the Data quality assessment model includes that entity table, incidence relation and rule describe, the entity table Entity table name is selected from database, incidence relation of the incidence relation between main table and word table, and the rule description is divided into Null value inspection, codomain detection, normalized checking, logical check, repeated data inspection and referential integrity check six groups of rule types;
(3) quality-monitoring task
According to the title of data quality model, all data of data quality model is exported, according to quality model title, matter Amount model description, implementation strategy, nearest execution state and nearest execution time etc. assess data quality model, complete matter Measure Detection task;
(4) quality-monitoring is reported
According to quality testing task, quality testing report is generated;
(5) quality appraisal report
According to the content of quality testing report, from data integrity, relevance, uniqueness, accuracy, consistency and rule The comprehensive assessment quality of data only of plasticity six generates based on database classification and data library name the data in database Quality appraisal report, quality appraisal report include quality score, quality score figure and data quality model scoring ranking list, quality Scoring includes overall quality scoring, quality model number and model rule number;Quality score figure includes quality score tendency chart sum number According to aggregate distribution figure;Data quality model scores ranking list according to data quality model scoring ranking;
Wherein, data quality accessment can be described as propping up by the customized length range of fixed character, the rule of fixed character Hold asterisk wildcard " * " and "? ";" * " represents multiple any characters, "? " represent an any character, as ABC*:ABC, ABCD, ABCDE meets the expression formula;A? C: only ABC, ADC meets, and ABDC does not meet expression formula then.
In data quality model, manually set detection cycle, according to detection cycle timing to data null value, codomain, specification, Logic, reference property, repeated data inspection.
Data quality assessment model based on DQAF, based on DQAF and China GB/T 25000.24-2017 " system with Soft project system and software quality require and evaluate (SQuaRE) " the 12nd partial data quality model and the 24th part number According to mass measurement, general data quality assessment models and thematic data Evaluation Model on Quality are established for Urban Data center, are supported Null value inspection, codomain inspection, normalized checking, logical check, repeated data inspection, referential integrity inspection, outlier inspection and The inspection of when property, missing check, fluctuate advanced, comprehensive, the expansible quality evaluation algorithm skills such as inspection, balance inspection Art meets the definition of Constructing data center, each rule-like in data governance process, establishes the assessment models of science, finally from Six integrality, normalization, consistency, accuracy, uniqueness, relevance dimensions carry out comprehensive assessment to Urban Data center.
According to data quality accessment, abnormal data is dealt, checks abnormal data, and data matter is carried out to abnormal data Amount is repaired, and the method for quality of data reparation includes repaired by hand, rule is repaired and depth reparation, and repaired by hand is manually to pass through meter Switch disk is calculated to database update abnormal data and the correct data information of typing, rule repair for according to general rule typing just True data information, depth reparation are deep learning abnormal data recovery technique, make full use of mature Hadoop/Spark big Data technique is realized that large-scale data automates by deep learning algorithm and is administered, and screens abnormal data and repairs abnormal data, For abnormal data using deep learning method carry out abnormal data reparation, mainly include average value filling, K minimum distance method, The deep learning method of recurrence, the estimation of very big liny and multiple interpolation realizes efficiently accurately data in conjunction with manual review It administers, improves the quality of data.
One concrete application of the present embodiment are as follows:
Data quality assessment model includes that entity table, incidence relation and rule description, the entity table name of entity table are selected from Database, enable in database include table1, table2, table3, table4 and table5, table1 include C11, C12, C13, C14, C15, C16, C17 and C18, table2 include C21, C22, C23, C24, C25, C26, C27 and C28, table3 packet Include C31, C32, C33, C34, C35, C36, C37 and C38, table4 include C41, C42, C43, C44, C45, C46, C47 and C48, table5 include C51, C52, C53, C54, C55, C56, C57 and C58, table1, table2, table3 and table4 Data volume be respectively 1000,800,1500 and 2000;
Incidence relation of the incidence relation between main table and word table, rule description are divided into null value inspection, codomain detection, specification Inspection, logical check, repeated data inspection and referential integrity check six groups of rule types;
Enable null value inspection regular number be 3, respectively to the inspection column C11 of the database auditing table table1 of null value inspection, C12, C13, inspection column C24, C25 of table2, the inspection column C38 of table3 is checked, " selectcount (*) is inputted Fromtable1whereC11isnullor C12isnullorC13isnull ", obtain table1 inspection column C11, C12, The problem of C13 rank is important, rank 5, and weight 9, it is not sky that inspection condition, which is whole, and the data volume that breaks the rules is 100; It inputs " selectcount (*) fromtable2whereC24isnulland C25isnull ", obtains the inspection column of table2 The problem of C24, C25, rank was serious, rank 3, and weight 7, inspection condition is that at least one is not sky, and break the rules data Amount is 80;It inputs " selectcount (*) fromtable3whereC38isnull ", obtains asking for the inspection column C38 of table3 Topic rank is general, rank 1, and weight 3, it is not sky that inspection condition, which is whole, and the data volume that breaks the rules is 50;Calculate null value The weighted value of inspection is 2400, and total weight is 28, weighted average 85.71428571;
The regular number for enabling codomain inspection is 2, respectively to the inspection column C18 of the database auditing table table1 of codomain inspection It is checked with the inspection column C34 of table3, inputs " selectcount (*) fromtable1where!(C18 > minimum value AndC18≤maximum value) ", the problem of obtaining the inspection column C18 of table1 rank be serious, rank 3, weight 6 violates Regular data amount is 300;Input " selectcount (*) fromtable3where!(C34 > minimum value andC34≤maximum Value) ", the problem of obtaining the inspection column C34 of table3 rank be general, rank 1, weight 2, the data volume that breaks the rules is 350;The weighted value for calculating codomain inspection is 3750, and total weight is 12, weighted average 312.5;
Enable normalized checking regular number be 3, respectively to the inspection column C15 of the database auditing table table1 of normalized checking, The inspection column C48 of the inspection column C44 and table4 of table4 is checked, " selectcount (*) is inputted fromtable1where!C15regrxp' identity card regular expression ' ", the problem of obtaining the inspection column C15 of table1 rank To be serious, rank 3, weight 8, inspection condition is identity card, and the data volume that breaks the rules is 7;Input " selectcount (*) from table1where!C15regrxp' phone number regular expression ' ", the problem of obtaining the inspection column C44 of table4 Rank is general, rank 1, and weight 3, inspection condition is phone number, and the data volume that breaks the rules is 9;Input "selectcount(*)fromtable1where!C15regrxp ' mailbox regular expression ' ", obtain the inspection column of table4 The problem of C48, rank was general, rank 1, and weight 4, inspection condition is mailbox, and the data volume that breaks the rules is 20;Calculate rule The weighted value of model inspection is 213, and total weight is 20, weighted average 10.65;
The regular number for enabling logical check is 2, is carried out respectively to database auditing the table table1 and table2 of logical check It checks, the inspection formula of table1 is " casewhenC14isnullthenC15 in (' A', ' B') end ", input "selectcount(*)fromtable1where!(casewhenC14 isnullthenC15in (' A', ' B') end) ", it obtains Rank is serious, rank 3 the problem of table1 out, and weight 7, the data volume that breaks the rules is 10;The inspection formula of table2 For " if (C24 isnotnull, C24in (' 1', ' 2'), 1=1) ", input " selectcount (*) fromtable2 where!If (C24isnotnull, C24in (' 1', ' 2'), 1=1) ", the problem of obtaining table2 rank be it is general, rank is 1, weight 1, the data volume that breaks the rules is 30;The weighted value for calculating logical check is 160, and total weight is 12, weighted average It is 13.3333333;
The regular number for enabling repeated data inspection is 3, respectively to the inspection of the database auditing table table2 of repeated data inspection Column C11, C12, C13 are looked into, inspection column C34, C35 of table3, inspection column C41, C42 of table4 is checked, is inputted " selectcount (*) fromtable2groupby C11, C12, C13havingcount (*) > 1 ", obtain the inspection of table2 The problem of looking into column C11, C12, C13 rank is serious, rank 3, and weight 6, the data volume that breaks the rules is 100;Input " selectcount (*) fromtable3groupbyC34, C35havingcount (*) > 1 " obtains the inspection column of table3 The problem of C34, C35, rank was serious, rank 3, and weight 7, the data volume that breaks the rules is 150;Input " selectcount (*) fromtable4groupby C41, C42havingcount (*) > 1 ", the problem of obtaining inspection column C41, C42 of table4 Rank is general, rank 1, and weight 3, the data volume that breaks the rules is 130;Calculate repeated data inspection weighted value be 2920, total weight is 23, weighted average 126.9565217;
The regular number for enabling referential integrity inspection is 3, respectively to the database auditing table table1 of referential integrity inspection Inspection column C18, table2 inspection column C28, table3 inspection column C38 checked, input " selectcount (*) Fromtable1whereC18notin (' A', ' B', ' C') ", the problem of obtaining the inspection column C18 of table1 rank be it is important, Rank is 5, weight 9, and the data volume that breaks the rules is 130;Input " selectcount (*) fromtable2where C28notin (' A', ' B', ' C') ", the problem of obtaining the inspection column C28 of table2 rank be serious, rank 3, weight 6, The data volume that breaks the rules is 170;Input " selectcount (*) from table3whereC38notin (' A', ' B', ' C') ", the problem of obtaining the inspection column C38 of table3 rank is general, rank 1, weight 3, and the data volume that breaks the rules is 190;The weighted value for calculating referential integrity inspection is 4110, and total weight is 27, weighted average 152.2222222;
The scoring calculation formula of comprehensive score is the 100- (SUM (weighted average of null value inspection: referential integrity inspection Weighted average)/SUM (data volume of table1: the data volume of table4)) * 100=100- (SUM (85.71428571: 152.2222222)/SUM (1000:2000)) * 100=86.76648372.
By binding the Database Systems open to notebook data platform, null value is carried out to data field, codomain, standardizes, patrol It collects, reference property, repeated data inspection, from data integrity, relevance, uniqueness, accuracy, consistency and normalization six The dimension comprehensive assessment quality of data, and creation data quality appraisal report carry out repaired by hand or rule to data by user Reparation or deep learning reparation.It builds and administers based on smart city large data center and integrative solution is provided, from number According to being originally formed, standard formulation, secure storage, exchanging shared, applied analysis to science decision, precisely prediction, formed completely City big data is administered and management system, constructs city big data platform, from data standard, manages, runs to decision and formed Whole data ecological chain, continuous growth data boundary promote the quality of data and data user rate, construct city healthy ecology.Side It helps government to break internal data barrier, vitalize data assets, promote data value, unified intelligent data clothes is externally provided Business, further deep-cuts and discharges the value bonus of big data.
In the description of this specification, the description of reference term " one embodiment ", " example ", " specific example " etc. means Particular features, structures, materials, or characteristics described in conjunction with this embodiment or example are contained at least one implementation of the invention In example or example.In the present specification, schematic expression of the above terms may not refer to the same embodiment or example. Moreover, particular features, structures, materials, or characteristics described can be in any one or more of the embodiments or examples to close Suitable mode combines.
Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment All details are described, also do not limit the specific embodiment that the invention is only.Obviously, according to the content of this specification, can make Many modifications and variations.These embodiments are chosen and specifically described to this specification, is original in order to better explain the present invention Reason and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only authorized The limitation of sharp claim and its full scope and equivalent.

Claims (5)

1. a kind of government data quality assessment techniques based on big data technology, which is characterized in that specific step is as follows:
The first step establishes database
The database includes base library and theme library, and the solution of the base library construction combines deposits in current government data The problem of, according to overall planning, one build shared thinking, based on the department of key data source, pass through the acquisition of data The means such as exchange, working process, information integration and mining analysis, integrate people society, civil administration, credit, public security, industry and commerce, health, religion Educate, other committees such as traffic do the data of office, planning standard standard system constructs base library, and provide in this base library towards The data sharing service of government department and the public, corresponding client include the Committee of Development and Reform, Jing Xinwei, big data office;
The theme library is discussed with overall strategic planning and Object--oriented method as foundation, in conjunction with the business characteristic of client, is passed through Data acquire the means such as exchange, Data Integration, association analysis, establish the theme library of characteristic, vitalize data assets, to make innovation Special topic application lays the foundation, public such as the legal person library of market surveillance management board, special equipment library, food storage, drug storage, license library The population library of peace office, license library, criminal investigation library, security administration library, entry and exit library, the population library of Department of Civil Affairs, social organization library, Aged library, welfare library, marriage library etc.;
Second step, data quality accessment
(1) general rule management
The general rule management includes five groups of general, network, date, character and numerical value rules, it is described it is general include identity card, The regular expression of phone number, mailbox, postcode and fixed-line telephone, the identity card is
^[1-9]\d{7}((0\d)|(1[0-2]))(([0|1|2]\d)|3[0-1])\d{3}$|^[1-9]\d{5}[1-9]\d { 3 } ((0 d) | (1 [0-2])) (([0 | 1 | 2] d) | 3 [0-1]) d { 3 } ([0-9] | X) $,
The rule of the identity card is described as
China second-generation identity card, such as 420106198311136666, regular length are 18, and first 17 are number, last position is number Word or letter x, and it is necessary for legal effective ID card No.;One generation ID: such as 420106831113666, fixed length Degree is 15, and 7 to 12 are six dates;
The regular expression of the phone number is
^1 ([38] [0-9] | 4 [579] | 5 [0-3,5-9] | 6 [6] | 7 [0135678] | 9 [89]) d { 8 } $,
The rule of the phone number is described as
Such as 13666666666, started with number 1, regular length is 11;
The regular expression of the mailbox is
^ w+ ([-+] w+) *@w+ ([-] w+) * w+ ([-] w+) * $,
The rule of the mailbox is described as
Such as 123 mail.com, can only occur in mailbox names English alphabet, number and underscore and cannot with underscore, And with the ending of the characters such as .com .cn .edu;
The regular expression of the postcode be [1-9] d { 5 } (?!D), the rule of the postcode is described as
Beginning cannot be 0, totally 6 numbers;
The regular expression of the fixed-line telephone be d { 3 }-d { 8 } | d { 4 }-d { 7 }, the fixed-line telephone rule description For
Such as 027-88880808-1, wherein 027 is area code, 1 is extension number, is separated with "-", and area code and extension number can not be filled out;
The network includes the address IPv4, the address IPv6 and MAC Address, and the regular expression of the address IPv4 is
^ ((25 [0-5] | 2 [0-4] d | [01]? d d?)) { 3 } (and 25 [0-5] | 2 [0-4] d | [01]? d d?) $,
The rule of the address IPv4 is described as
It such as 000.000.000.000, is made of 4 0~255 numerical value, is separated with " ";
The regular expression of the address IPv6 is
^ ([da-fA-F] { Isosorbide-5-Nitrae } :) { 7 } [da-fA-F] { Isosorbide-5-Nitrae } $,
The rule of the address IPv6 is described as
Such as CDCD:910A:222:9:8475:11:390:2020, it is made of 8 four hexadecimal numerical value, is separated with ": ", It supports to write a Chinese character in simplified form or mix literary style simultaneously, it is recommended that using standard literary style;
The regular expression of the MAC Address is
[0-9a-fA-F] { 2 } (: [0-9a-fA-F] { 2 }) { 5 },
The rule of the MAC Address is described as
Such as 00-00-00-00-00-00, is formed with 6 two hexadecimal numbers, separated with "-";
The date includes YYYY.MM.DD, YYYYMMDD, YYYY/MM/DD, YYYY MM month DD day, YYYY and YYYYMM, Wherein, the YYYY is the specific time, and the MM is specific month, and the DD is exact date, the rule of the YYYY/MM/DD Then expression formula be (d { 4 })/(d { 1,2 })/(d { 1,2 });
The numerical value includes nonnegative integer, integer, non-negative floating number, floating number, integer band percentage sign, floating number percentage sign, whole Number band per thousand sign and floating number per thousand signs, the regular expression of the nonnegative integer be ^ [1-9] d* | 0 $, the nonnegative integer Rule be described as the character string of nonnegative integer format, such as 28;
Does is the regular expression of the integer ^-? [1-9] d* $, the rule of the integer is described as the character string of integer data format;
The regular expression ^ of the non-negative floating number d+ (d+)? $, the rule of the non-negative floating number are described as being non-negative The character string of floating number format;
The regular expression of the floating number be ^ (-? d+) (d+)? $, the rule of the floating number are described as floating number lattice The character string of formula;
(2) data quality model
Incidence relation according to the general rule management establishes data quality model, and the data quality model is based on DQAF Data quality assessment model, the Data quality assessment model include entity table, incidence relation and rule description, the entity The entity table name of table is selected from database, incidence relation of the incidence relation between main table and word table, the rule description It is divided into null value inspection, codomain detection, normalized checking, logical check, repeated data inspection and referential integrity and checks six groups of rules Type;
(3) quality-monitoring task
According to the title of the data quality model, all data of the data quality model is exported, according to quality model name Title, quality model description, implementation strategy, nearest execution state and nearest execution time etc. comment the data quality model Estimate, completes quality testing task;
(4) quality-monitoring is reported
According to the quality testing task, quality testing report is generated;
(5) quality appraisal report
According to the content of quality testing report, from data integrity, relevance, uniqueness, accuracy, consistency and rule The comprehensive assessment quality of data only of plasticity six generates the data in the database and is based on database classification and data library name The quality appraisal report of title, the quality appraisal report include quality score, quality score figure and data quality model scoring row Row list, the quality score include overall quality scoring, quality model number and model rule number;The quality score figure includes matter Measure grade trend figure and data aggregate distribution figure;The data quality model scoring ranking list scores according to data quality model arranges Name.
2. a kind of government data quality assessment techniques based on big data technology according to claim 1, it is characterised in that: The data quality accessment can be described as supporting wildcard by the customized length range of fixed character, the rule of the fixed character Accord with " * " and "? ";" * " represents multiple any characters, "? " an any character is represented, as ABC*:ABC, ABCD, ABCDE are accorded with Close the expression formula;A? C: only ABC, ADC meets, and ABDC does not meet expression formula then.
3. a kind of government data quality assessment techniques based on big data technology according to claim 1, it is characterised in that: In the data quality model, manually set detection cycle, according to detection cycle timing to data null value, codomain, standardize, patrol It collects, reference property, repeated data inspection.
4. a kind of government data quality assessment techniques based on big data technology according to claim 1, it is characterised in that: The Data quality assessment model based on DQAF, GB/T 25000.24-2017 " system and software based on DQAF and China Engineering system and software quality require and evaluate (SQuaRE) " the 12nd partial data quality model and the 24th partial data quality Measurement establishes general data quality assessment models and thematic data Evaluation Model on Quality for Urban Data center, supports null value inspection It looks into, codomain inspection, normalized checking, logical check, repeated data inspection, referential integrity inspection, outlier inspection, timeliness inspection It looks into, lack advanced, comprehensive, the expansible quality evaluation algorithm technologies such as inspection, fluctuation inspection, balance inspection, meet The definition of each rule-like in Constructing data center, data governance process, establish science assessment models, finally from integrality, Six normalization, consistency, accuracy, uniqueness, relevance dimensions carry out comprehensive assessment to Urban Data center.
5. a kind of government data abnormal data recovery technique based on big data technology according to claim 1, feature It is: according to the data quality accessment, deals abnormal data, checks the abnormal data, and to the abnormal data Quality of data reparation is carried out, the method for the quality of data reparation includes repaired by hand, rule is repaired and depth reparation, the hand Work reparation is manually by computer keyboard to database update abnormal data and the correct data information of typing, and the rule is repaired Multiple is according to the correct data information of general rule typing, and the depth reparation is deep learning abnormal data recovery technique, is filled Divide using mature Hadoop/Spark big data technology, large-scale data automation improvement realized by deep learning algorithm, It screens abnormal data and repairs abnormal data, abnormal data reparation is carried out using deep learning method for abnormal data, mainly Deep learning method including average value filling, K minimum distance method, recurrence, the estimation of very big liny and multiple interpolation, in conjunction with Manual review realizes that efficiently accurately data are administered, and improve the quality of data.
CN201910156894.6A 2019-03-01 2019-03-01 A kind of government data quality evaluation and abnormal data recovery technique based on big data technology Pending CN109992576A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910156894.6A CN109992576A (en) 2019-03-01 2019-03-01 A kind of government data quality evaluation and abnormal data recovery technique based on big data technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910156894.6A CN109992576A (en) 2019-03-01 2019-03-01 A kind of government data quality evaluation and abnormal data recovery technique based on big data technology

Publications (1)

Publication Number Publication Date
CN109992576A true CN109992576A (en) 2019-07-09

Family

ID=67130091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910156894.6A Pending CN109992576A (en) 2019-03-01 2019-03-01 A kind of government data quality evaluation and abnormal data recovery technique based on big data technology

Country Status (1)

Country Link
CN (1) CN109992576A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400299A (en) * 2020-06-04 2020-07-10 成都四方伟业软件股份有限公司 Method and system for testing fusion quality of multiple data
CN112035456A (en) * 2020-08-31 2020-12-04 重庆长安汽车股份有限公司 Real-time detection method for user behavior data quality and storage medium
CN112380204A (en) * 2020-11-16 2021-02-19 浙江大华技术股份有限公司 Data quality evaluation method and device
CN113360548A (en) * 2021-06-29 2021-09-07 平安普惠企业管理有限公司 Data processing method, device, equipment and medium based on data asset analysis
CN113468158A (en) * 2021-07-13 2021-10-01 广域铭岛数字科技有限公司 Data repair method, system, electronic device and medium
CN116341987A (en) * 2023-04-11 2023-06-27 北京数字政通科技股份有限公司 Configurable evaluation method and system thereof
CN116777288A (en) * 2023-06-28 2023-09-19 广东裕太科技有限公司 Government system information integration system and application method thereof
CN117743310A (en) * 2023-12-19 2024-03-22 云宝宝大数据产业发展有限责任公司 Full-period data management method, system and storage medium
CN118297444A (en) * 2024-06-06 2024-07-05 中国信息通信研究院 Artificial intelligence-oriented data set quality general assessment method
CN118503888A (en) * 2024-07-18 2024-08-16 北京亚信数据有限公司 Medical insurance data quality detection method and device, electronic equipment and medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103873298A (en) * 2014-03-14 2014-06-18 浪潮通信信息系统有限公司 Configurable method for automatically monitoring data quality of maintenance-center OMC (Operation and Maintenance Center) northbound interfaces
CN106855962A (en) * 2015-12-09 2017-06-16 星际空间(天津)科技发展有限公司 A kind of method for building government affairs big data platform
CN107368957A (en) * 2017-07-04 2017-11-21 广西电网有限责任公司电力科学研究院 A kind of construction method of equipment condition monitoring quality of data evaluation and test system
CN107491381A (en) * 2017-07-04 2017-12-19 广西电网有限责任公司电力科学研究院 A kind of equipment condition monitoring quality of data evaluating system
CN107545043A (en) * 2017-08-09 2018-01-05 国政通科技股份有限公司 A kind of data application method and device based on data quality checking
CN107545349A (en) * 2016-06-28 2018-01-05 国网天津市电力公司 A kind of Data Quality Analysis evaluation model towards electric power big data
CN108595563A (en) * 2018-04-13 2018-09-28 林秀丽 A kind of data quality management method and device
CN108615115A (en) * 2018-05-02 2018-10-02 山东汇贸电子口岸有限公司 A kind of implementation method of government data collecting flowchart
CN108830029A (en) * 2017-11-29 2018-11-16 上海海洋大学 A kind of quality evaluation of typhoon data and restorative procedure

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103873298A (en) * 2014-03-14 2014-06-18 浪潮通信信息系统有限公司 Configurable method for automatically monitoring data quality of maintenance-center OMC (Operation and Maintenance Center) northbound interfaces
CN106855962A (en) * 2015-12-09 2017-06-16 星际空间(天津)科技发展有限公司 A kind of method for building government affairs big data platform
CN107545349A (en) * 2016-06-28 2018-01-05 国网天津市电力公司 A kind of Data Quality Analysis evaluation model towards electric power big data
CN107368957A (en) * 2017-07-04 2017-11-21 广西电网有限责任公司电力科学研究院 A kind of construction method of equipment condition monitoring quality of data evaluation and test system
CN107491381A (en) * 2017-07-04 2017-12-19 广西电网有限责任公司电力科学研究院 A kind of equipment condition monitoring quality of data evaluating system
CN107545043A (en) * 2017-08-09 2018-01-05 国政通科技股份有限公司 A kind of data application method and device based on data quality checking
CN108830029A (en) * 2017-11-29 2018-11-16 上海海洋大学 A kind of quality evaluation of typhoon data and restorative procedure
CN108595563A (en) * 2018-04-13 2018-09-28 林秀丽 A kind of data quality management method and device
CN108615115A (en) * 2018-05-02 2018-10-02 山东汇贸电子口岸有限公司 A kind of implementation method of government data collecting flowchart

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400299A (en) * 2020-06-04 2020-07-10 成都四方伟业软件股份有限公司 Method and system for testing fusion quality of multiple data
CN112035456B (en) * 2020-08-31 2024-05-03 重庆长安汽车股份有限公司 Real-time detection method for user behavior data quality and storage medium
CN112035456A (en) * 2020-08-31 2020-12-04 重庆长安汽车股份有限公司 Real-time detection method for user behavior data quality and storage medium
CN112380204A (en) * 2020-11-16 2021-02-19 浙江大华技术股份有限公司 Data quality evaluation method and device
CN113360548A (en) * 2021-06-29 2021-09-07 平安普惠企业管理有限公司 Data processing method, device, equipment and medium based on data asset analysis
CN113468158A (en) * 2021-07-13 2021-10-01 广域铭岛数字科技有限公司 Data repair method, system, electronic device and medium
CN113468158B (en) * 2021-07-13 2023-10-31 广域铭岛数字科技有限公司 Data restoration method, system, electronic equipment and medium
CN116341987A (en) * 2023-04-11 2023-06-27 北京数字政通科技股份有限公司 Configurable evaluation method and system thereof
CN116341987B (en) * 2023-04-11 2023-10-31 北京数字政通科技股份有限公司 Configurable evaluation method and system thereof
CN116777288A (en) * 2023-06-28 2023-09-19 广东裕太科技有限公司 Government system information integration system and application method thereof
CN116777288B (en) * 2023-06-28 2024-03-12 广东裕太科技有限公司 Government system information integration system and application method thereof
CN117743310A (en) * 2023-12-19 2024-03-22 云宝宝大数据产业发展有限责任公司 Full-period data management method, system and storage medium
CN118297444A (en) * 2024-06-06 2024-07-05 中国信息通信研究院 Artificial intelligence-oriented data set quality general assessment method
CN118503888A (en) * 2024-07-18 2024-08-16 北京亚信数据有限公司 Medical insurance data quality detection method and device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN109992576A (en) A kind of government data quality evaluation and abnormal data recovery technique based on big data technology
Hart et al. Reference data and geocoding quality: Examining completeness and positional accuracy of street geocoded crime incidents
Ngo et al. Estimating the confidence intervals for DEA efficiency scores of Asia-Pacific airlines
Gómez et al. Governance and type of industry as determinants of corporate social responsibility disclosures in Latin America
CN112507936A (en) Image information auditing method and device, electronic equipment and readable storage medium
CN104834731A (en) Recommendation method and device for self-media information
CN103577404A (en) Microblog-oriented discovery method for new emergencies
Dong Who will trade bauxite with whom? Finding potential links through link prediction
Morris Manifestation of emerging specialties in journal literature: A growth model of papers, references, exemplars, bibliographic coupling, cocitation, and clustering coefficient distribution
Mazeika et al. The impact of geocoding method on the positional accuracy of residential burglaries reported to police
Lyu et al. Global scientific production, international cooperation, and knowledge evolution of public administration
CN111858627B (en) System and method for inquiring academic calendar based on blockchain
Li et al. Identification of Critical Risks in Hosting Sports Mega-events: a Social Network Perspective
CN110941638B (en) Application classification rule base construction method, application classification method and device
Sumić et al. Favourable culture for crisis management–an empirical evaluation
Zhou et al. Dynamic development analysis of complex network research: A bibliometric analysis
Lei [Retracted] Association Rule Mining Algorithm in College Students’ Quality Evaluation System
CN109670728A (en) A kind of Ship Design quality information management system based on database
TW201539217A (en) A document analysis system, document analysis method and document analysis program
He et al. Analyzing hospital medical efficiency of administration and medical treatment in China
Jiang et al. [Retracted] Employment Recommendation for Education Talents Based on Big Data Precision Technology
CN104239314A (en) Search word expanding method and system
Gu [Retracted] Evaluation of Teaching Quality on IP Environment Driven by Multiple Values Theory Based on Big Data
Sun et al. An evaluation model for the teaching reform of the physical education industry
Li et al. Revaluation of occupancy duration for live load using big data of enterprise credit information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190709

RJ01 Rejection of invention patent application after publication