CN106202347A - A kind of device excavated with useful data for data quality management - Google Patents

A kind of device excavated with useful data for data quality management Download PDF

Info

Publication number
CN106202347A
CN106202347A CN201610524328.2A CN201610524328A CN106202347A CN 106202347 A CN106202347 A CN 106202347A CN 201610524328 A CN201610524328 A CN 201610524328A CN 106202347 A CN106202347 A CN 106202347A
Authority
CN
China
Prior art keywords
data
user
submodule
quality
useful
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201610524328.2A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610524328.2A priority Critical patent/CN106202347A/en
Publication of CN106202347A publication Critical patent/CN106202347A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of device excavated for data quality management with useful data, module is excavated including data quality management module and useful data, wherein quality management module includes that preliminary treatment submodule, data describe submodule, quality testing submodule and data quality grading management submodule, and useful data excavates module and includes that data prediction submodule, useful data build submodule, useful data correction submodule and useful data layer digging submodule.

Description

A kind of device excavated with useful data for data quality management
Technical field
The present invention relates to data management field, be specifically related to a kind of dress excavated with useful data for data quality management Put.
Background technology
Data are exactly numerical value, and namely we are by the result observed, test or calculate.Data have a variety of, Simple is exactly numeral.Data can also be word, image, sound etc..Data may be used for scientific research, design, verification etc.. Data background is recipient's information preparation for particular data, i.e. understands the rule of phy symbol sequence as recipient, and knows When each symbol in road and the directivity target of symbol combination or implication, just can obtain the information of one group of data institute load.Data As the carrier of information, certainly want the main information comprised in analytical data, and the principal character of analytical data.Data are load Or the phy symbol by certain regularly arranged combination of record information.
In the data message currently used, having substantial portion of data is to be issued by manager, and root Modify by manager according to the suggestion of user or the demand of manager self, for the magnanimity information of this part, as What can preferably carry out quality management and excavation, the most therefrom finds useful information, is one and needs solution badly Problem.
Summary of the invention
For the problems referred to above, the present invention provides a kind of device excavated for data quality management with useful data.
The purpose of the present invention realizes by the following technical solutions:
A kind of device excavated with useful data for data quality management, is characterized in that, including data quality management mould Block and useful data excavate module, and wherein quality management module includes that preliminary treatment submodule, data describe submodule, data matter Amount evaluates submodule and data quality grading management submodule, and useful data excavates module and includes data prediction submodule, has With data construct submodule, useful data correction submodule and useful data layer digging submodule;
Preliminary treatment submodule, is characterized in that, including:
Including bottom data administrative unit, middle level Data Management Unit, high level data administrative unit, data template, overall Data base.
Preferably, it is characterized in that, described global database includes underlying database, middle level data base, high level data storehouse.
Preferably, it is characterised in that described bottom data administrative unit is recorded according to the specification of data template for underlying user Enter facility information data, and the facility information data of typing are checked, to being unsatisfactory for requirement according to the specification of data template Facility information data propose amendment prompting, satisfactory facility information data are saved in underlying database;
Facility information data in underlying database are mapped to middle level data management list by described middle level Data Management Unit In unit, middle level user sets up the virtual crosslinking relation between facility information data, middle layer data pipe on the Data Management Unit of middle level Virtual crosslinking relation between reason element analysis facility information data, produces the report of crosslinking relation, the virtual crosslinking to error definition Relation prompting middle level user re-establish, to the facility information data in the underlying database found when setting up virtual crosslinking relation Error feedback modify to underlying user;Correct virtual crosslinking relation data is saved in the data base of middle level;Middle level Virtual crosslinking relation data existing in the data base of middle level can also be mapped to middle level Data Management Unit by Data Management Unit On, check for middle level user and revise;
Described high level data administrative unit is by the facility information data in underlying database and the equipment in the data base of middle level Virtual crosslinking relation data between information data is mapped in high level data administrative unit, true by high-level user Gather and input The annexation data of airborne equipment data and actual onboard networks are converted to the form of underlying database and middle level data base's Form, com-parison and analysis maps the facility information data come and the true airborne device data of collection, and com-parison and analysis maps setting Virtual crosslinking relation data between standby information data and the annexation data of the actual onboard networks of collection, produce data analysis Report, instructs user to carry out troubleshooting.
Preferably,
(1) data describe submodule
The attribute of attribute and data influencer by introducing data itself describes data, the attribute number of data itself According to size, date created, comprise picture number, related data amount represents, wherein, related data amount be current data point to other The summation of other data of data and sensing current data;The attribute of data influence person influencer network clustering coefficientCarry out table Show,Obtained by following methods:
Building data influence person and describe network, for each data, influencer includes multiple user and a pipe Reason person, each of which influencer all represents a node, and user may browse through data, it is also possible to data propose the suggestion of amendment, And data both can have been modified by manager voluntarily, it is also possible to modify according to user's suggestion,
Then influencer network clustering coefficientIt is defined as:
K ‾ = mσ 1 + lσ 2 + n ( δ 1 × σ 3 + δ 2 × σ 4 ) m + l + n × 1 - ( m - l m ) 3
In formula, σ1Representing that user often browses the factor of influence that a secondary data applies, m represents that user browses total degree;σ2Represent User often proposes the factor of influence that suggestion for revision applies, and l represents that user advises total degree;σ3Represent that manager is often certainly The factor of influence that row amendment one secondary data applies, σ4Represent that manager often advises revising the impact that a secondary data applies according to user The factor, δ1And δ2It is respectively σ3And σ4Weights, n represents that manager revises total degree;Frequency system is revised for user Number, for representing user's satisfaction to data, this coefficient shows that the most greatly user is the most frequent to the amendment of data;
(2) quality testing submodule
Use " three grades of evaluation models " that the quality of data is evaluated, first split data into three classes according to size of data, Then its quality of data is evaluated by all other attribute in addition to size of data of synthetic data, and concrete grammar is as follows:
Sample data is divided into quality data, middle qualitative data and low quality data, if size of data is more than threshold value T1, then these data belong to quality data, if size of data is more than threshold values T2But it is less than threshold values T1, then these data belong to middle matter Amount data, if size of data is less than threshold values T2, then these data belong to low quality data, T1> T2And T1、T2Span be [1KB, 1MB];Further quality data and low quality are divided into different brackets, choose all other attribute composition of data Vector, and the average of each data attribute of each grade is calculated according to sample data, set up corresponding average for each grade Vector, new data vector X=(x1,…,xN) represent, the mean vector of certain grade Y=(y1,…,yN) represent, N represents All other attribute number of data in addition to size of data, two vectorial similarities similarity function R (X, Y) represent:
R ( X , Y ) = Σ i = 1 N | x i - y i x i | 2 + Σ i = 1 N | x i - y i y i | 2
R (X, Y) value is the least, then show that similarity is the biggest, otherwise, then similarity is the least, each data calculate respectively with not The similarity of the mean vector of ad eundem, thus confirm its credit rating;
(3) quality of data administration by different levels submodule
Data are divided into different quality grade, according to data level different pairs by after quality testing submodule According to carrying out administration by different levels;
Preferably,
(1) data prediction submodule
Data are divided into different field, determine client's desired data field according to user's request, use above-mentioned three grade High-quality High-level Data in field is screened by evaluation model, forms new tables of data K;
(2) useful data builds submodule
Through the data of pretreatment, each data fields contains different classification, introduces correlation coefficient P and screens useful number According to classification:
P = Z s Z - ρ 1 - ρ
In formula, ZsRepresent the quantity that in new data table K mono-classification, data double-way points to, i.e. for data A and B, can Point to B from A, also can point to A, Z from B and represent the related data amount in tables of data K mono-classification,Wherein N represents one The sum of data in classification;
(3) useful data correction submodule
Useful data in use, can be affected by artificial destruction and user two aspects of voting, according to this two The revised correlation coefficient of aspect is P ';Concurrently set threshold value T, and T ∈ (0,0.1], if P ' is > T, then show that this classification is to have Use data;When qualified useful data cannot be obtained from quality data, successively at middle qualitative data and low quality number Qualified useful data is made a look up according to, and after all data search, if the P ' finally given is maximum Value less than T, although or the maximum of P ' more than T but its absolute value with the difference of threshold values T less than setting value C, show nothing Although method finds useful data or can find useful data but the useful data degree of association obtained is already below expection, then Now automatically manager is sent prompting, revise or increase related data;Take C=T/5;
(4) useful data layer digging module
First scan data table K, it is assumed that maximum and the minima of P ' are respectively P 'maxWith P 'min, tables of data K is split BecomeIndividual Non-overlapping Domain, P mining goes out Local frequent itemset, and wherein int is bracket function;Then profit Use priori character, connect Local frequent itemset and obtain overall candidate;Scanning K counts the reality of each candidate and props up again Degree of holding is to determine global frequentItemset.
The concrete correction formula being modified according to artificial destruction and user's ballot in useful data correction submodule is:
P '=P × (1-Y) × (1+H)
In formula, Y represents the data probability by artificial destruction, and H represents that ballot user accounts for the ratio of total number of persons.
Have the beneficial effect that data are described by introducing network clustering coefficient, considered the attribute of data itself with The attribute of data influence person, improves the accuracy rate of classification, revises the introducing of coefficient of frequency by user simultaneously and reduces manually Intervene, it is achieved that the target of the efficient detection quality of data;Use three grades of evaluation models, saved memory space, improve calculating Efficiency;Use brand-new similarity function, be exaggerated the effect of bigger relative error so that credit rating more science is accurate; Introduce data correction submodule correlation coefficient is modified, it is possible to fully overcome artificial destruction and user's ballot shadow to data Ring;The association rule mining divided based on region application is combined with the classification of useful data, it is only necessary to after classifying at three grades A tables of data in carry out layer digging, only when current data table does not has satisfactory data, just can at the next one Excavating in tables of data, amount of calculation declines to a great extent, and the excavation of these data can associate useful data classification, excavates purposiveness more By force.
Accompanying drawing explanation
The invention will be further described to utilize accompanying drawing, but the embodiment in accompanying drawing does not constitute any limit to the present invention System, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain according to the following drawings Other accompanying drawing.
Fig. 1 is the structured flowchart of a kind of device excavated with useful data for data quality management.
Reference: quality management module-1;Useful data excavates module-2;Preliminary treatment submodule-11;Data describe Submodule-12;Quality testing submodule-13;Quality of data administration by different levels submodule-14;Data prediction submodule- 21;Useful data builds submodule-22;Useful data correction submodule-23;Useful data layer digging submodule-24.
Detailed description of the invention
The invention will be further described with the following Examples.
Embodiment 1:
A kind of device excavated with useful data for data quality management as shown in Figure 1, including data quality management Module 1 and useful data excavate module 2, and wherein quality management module 1 includes that preliminary treatment submodule 11, data describe submodule 12, quality testing submodule 13 and quality testing submodule 14, useful data excavates module 2 and includes data prediction Submodule 21, useful data build submodule 22, useful data correction submodule 23 and useful data layer digging submodule 24.
Preliminary treatment submodule 11, is characterized in that, including:
Including bottom data administrative unit, middle level Data Management Unit, high level data administrative unit, data template, overall Data base.
Preferably, it is characterized in that, described global database includes underlying database, middle level data base, high level data storehouse.
Preferably, it is characterised in that described bottom data administrative unit is recorded according to the specification of data template for underlying user Enter facility information data, and the facility information data of typing are checked, to being unsatisfactory for requirement according to the specification of data template Facility information data propose amendment prompting, satisfactory facility information data are saved in underlying database;
Facility information data in underlying database are mapped to middle level data management list by described middle level Data Management Unit In unit, middle level user sets up the virtual crosslinking relation between facility information data, middle layer data pipe on the Data Management Unit of middle level Virtual crosslinking relation between reason element analysis facility information data, produces the report of crosslinking relation, the virtual crosslinking to error definition Relation prompting middle level user re-establish, to the facility information data in the underlying database found when setting up virtual crosslinking relation Error feedback modify to underlying user;Correct virtual crosslinking relation data is saved in the data base of middle level;Middle level Virtual crosslinking relation data existing in the data base of middle level can also be mapped to middle level Data Management Unit by Data Management Unit On, check for middle level user and revise;
Described high level data administrative unit is by the facility information data in underlying database and the equipment in the data base of middle level Virtual crosslinking relation data between information data is mapped in high level data administrative unit, true by high-level user Gather and input The annexation data of airborne equipment data and actual onboard networks are converted to the form of underlying database and middle level data base's Form, com-parison and analysis maps the facility information data come and the true airborne device data of collection, and com-parison and analysis maps setting Virtual crosslinking relation data between standby information data and the annexation data of the actual onboard networks of collection, produce data analysis Report, instructs user to carry out troubleshooting.
Preferably,
(1) data describe submodule 12:
The attribute of attribute and data influencer by introducing data itself describes data, the attribute number of data itself According to size, date created, comprise picture number, related data amount represents, wherein, related data amount be current data point to other The summation of other data of data and sensing current data;The attribute of data influence person influencer network clustering coefficientCarry out table Show,Obtained by following methods:
Building data influence person and describe network, for each data, influencer includes multiple user and a pipe Reason person, each of which influencer all represents a node, and user may browse through data, it is also possible to data propose the suggestion of amendment, And data both can have been modified by manager voluntarily, it is also possible to modify according to user's suggestion,
Then influencer network clustering coefficientIt is defined as:
K ‾ = mσ 1 + lσ 2 + n ( δ 1 × σ 3 + δ 2 × σ 4 ) m + l + n × 1 - ( m - l m ) 3
In formula, σ1Representing that user often browses the factor of influence that a secondary data applies, m represents that user browses total degree;σ2Represent User often proposes the factor of influence that suggestion for revision applies, and l represents that user advises total degree;σ3Represent that manager is often certainly The factor of influence that row amendment one secondary data applies, σ4Represent that manager often advises revising the impact that a secondary data applies according to user The factor, δ1And δ2It is respectively σ3And σ4Weights, n represents that manager revises total degree;Frequency system is revised for user Number, for representing user's satisfaction to data, this coefficient shows that the most greatly user is the most frequent to the amendment of data.
(2) quality testing submodule 13:
Use " three grades of evaluation models " that the quality of data is evaluated, first split data into three classes according to size of data, Then its quality of data is evaluated by all other attribute in addition to size of data of synthetic data, and concrete grammar is as follows:
Sample data is divided into quality data, middle qualitative data and low quality data, if size of data is more than threshold value T1, then these data belong to quality data, if size of data is more than threshold values T2But it is less than threshold values T1, then these data belong to middle matter Amount data, if size of data is less than threshold values T2, then these data belong to low quality data, T1> T2And T1、T2Span be [1KB, 1MB];Further quality data and low quality are divided into different brackets, choose all other attribute composition of data Vector, and the average of each data attribute of each grade is calculated according to sample data, set up corresponding average for each grade Vector, new data vector X=(x1,…,xN) represent, the mean vector of certain grade Y=(y1,…,yN) represent, N represents All other attribute number of data in addition to size of data, two vectorial similarities similarity function R (X, Y) represent:
R ( X , Y ) = Σ i = 1 N | x i - y i x i | 2 + Σ i = 1 N | x i - y i y i | 2
R (X, Y) value is the least, then show that similarity is the biggest, otherwise, then similarity is the least, each data calculate respectively with not The similarity of the mean vector of ad eundem, thus confirm its credit rating.
(3) quality testing submodule 14:
Data are divided into different quality grade, according to data level different pairs by after quality testing submodule According to carrying out administration by different levels.
Preferably,
(1) data prediction submodule
Data are divided into different field, determine client's desired data field according to user's request, use above-mentioned three grade High-quality High-level Data in field is screened by evaluation model, forms new tables of data K;
(2) useful data builds submodule
Through the data of pretreatment, each data fields contains different classification, introduces correlation coefficient P and screens useful number According to classification:
P = Z s Z - ρ 1 - ρ
In formula, ZsRepresent the quantity that in new data table K mono-classification, data double-way points to, i.e. for data A and B, can Point to B from A, also can point to A, Z from B and represent the related data amount in tables of data K mono-classification,Wherein N represents one The sum of data in classification;
(3) useful data correction submodule
Useful data in use, can be affected by artificial destruction and user two aspects of voting, according to this two The revised correlation coefficient of aspect is P ';Concurrently set threshold value T, and T ∈ (0,0.1], if P ' is > T, then show that this classification is to have Use data;When qualified useful data cannot be obtained from quality data, successively at middle qualitative data and low quality number Qualified useful data is made a look up according to, and after all data search, if the P ' finally given is maximum Value less than T, although or the maximum of P ' more than T but its absolute value with the difference of threshold values T less than setting value C, show nothing Although method finds useful data or can find useful data but the useful data degree of association obtained is already below expection, then Now automatically manager is sent prompting, revise or increase related data;Take C=T/5;
(4) useful data layer digging module
First scan data table K, it is assumed that maximum and the minima of P ' are respectively P 'maxWith P 'min, tables of data K is split BecomeIndividual Non-overlapping Domain, P mining goes out Local frequent itemset, and wherein int is bracket function;Then profit Use priori character, connect Local frequent itemset and obtain overall candidate;Scanning K counts the reality of each candidate and props up again Degree of holding is to determine global frequentItemset.
The concrete correction formula being modified according to artificial destruction and user's ballot in useful data correction submodule is:
P '=P × (1-Y) × (1+H)
In formula, Y represents the data probability by artificial destruction, and H represents that ballot user accounts for the ratio of total number of persons.
In the present embodiment, introduce network clustering coefficient and data be described, considered the attribute of data itself with The attribute of data influence person, improves the accuracy rate of classification, revises the introducing of coefficient of frequency by user simultaneously and reduces manually Intervene, it is achieved that the target of the efficient detection quality of data;Use three grades of evaluation models, saved memory space, improve calculating Efficiency;Use brand-new similarity function, be exaggerated the effect of bigger relative error so that credit rating more science is accurate; Introduce data correction submodule correlation coefficient is modified, it is possible to fully overcome artificial destruction and user's ballot shadow to data Ringing, take C=T/5, prompting scope of data increases by 5%, but amount of calculation adds 3.7%;The association rule that will divide based on region Then excavate application to combine with the classification of useful data, it is only necessary in three grades of sorted tables of data, carry out layering dig Pick, only when current data table does not has satisfactory data, just can excavate in next tables of data, computationally intensive Width declines, and the excavation of these data can associate useful data classification, excavates purposiveness higher.
Embodiment 2:
A kind of device excavated with useful data for data quality management as shown in Figure 1, including data quality management Module 1 and useful data excavate module 2, and wherein quality management module 1 includes that preliminary treatment submodule 11, data describe submodule 12, quality testing submodule 13 and quality testing submodule 14, useful data excavates module 2 and includes data prediction Submodule 21, useful data build submodule 22, useful data correction submodule 23 and useful data layer digging submodule 24.
Preliminary treatment submodule 11, is characterized in that, including:
Including bottom data administrative unit, middle level Data Management Unit, high level data administrative unit, data template, overall Data base.
Preferably, it is characterized in that, described global database includes underlying database, middle level data base, high level data storehouse.
Preferably, it is characterised in that described bottom data administrative unit is recorded according to the specification of data template for underlying user Enter facility information data, and the facility information data of typing are checked, to being unsatisfactory for requirement according to the specification of data template Facility information data propose amendment prompting, satisfactory facility information data are saved in underlying database;
Facility information data in underlying database are mapped to middle level data management list by described middle level Data Management Unit In unit, middle level user sets up the virtual crosslinking relation between facility information data, middle layer data pipe on the Data Management Unit of middle level Virtual crosslinking relation between reason element analysis facility information data, produces the report of crosslinking relation, the virtual crosslinking to error definition Relation prompting middle level user re-establish, to the facility information data in the underlying database found when setting up virtual crosslinking relation Error feedback modify to underlying user;Correct virtual crosslinking relation data is saved in the data base of middle level;Middle level Virtual crosslinking relation data existing in the data base of middle level can also be mapped to middle level Data Management Unit by Data Management Unit On, check for middle level user and revise;
Described high level data administrative unit is by the facility information data in underlying database and the equipment in the data base of middle level Virtual crosslinking relation data between information data is mapped in high level data administrative unit, true by high-level user Gather and input The annexation data of airborne equipment data and actual onboard networks are converted to the form of underlying database and middle level data base's Form, com-parison and analysis maps the facility information data come and the true airborne device data of collection, and com-parison and analysis maps setting Virtual crosslinking relation data between standby information data and the annexation data of the actual onboard networks of collection, produce data analysis Report, instructs user to carry out troubleshooting.
Preferably,
(1) data describe submodule 12:
The attribute of attribute and data influencer by introducing data itself describes data, the attribute number of data itself According to size, date created, comprise picture number, related data amount represents, wherein, related data amount be current data point to other The summation of other data of data and sensing current data;The attribute of data influence person influencer network clustering coefficientCarry out table Show,Obtained by following methods:
Building data influence person and describe network, for each data, influencer includes multiple user and a pipe Reason person, each of which influencer all represents a node, and user may browse through data, it is also possible to data propose the suggestion of amendment, And data both can have been modified by manager voluntarily, it is also possible to modify according to user's suggestion,
Then influencer network clustering coefficientIt is defined as:
K ‾ = mσ 1 + lσ 2 + n ( δ 1 × σ 3 + δ 2 × σ 4 ) m + l + n × 1 - ( m - l m ) 3
In formula, σ1Representing that user often browses the factor of influence that a secondary data applies, m represents that user browses total degree;σ2Represent User often proposes the factor of influence that suggestion for revision applies, and l represents that user advises total degree;σ3Represent that manager is often certainly The factor of influence that row amendment one secondary data applies, σ4Represent that manager often advises revising the impact that a secondary data applies according to user The factor, δ1And δ2It is respectively σ3And σ4Weights, n represents that manager revises total degree;Frequency system is revised for user Number, for representing user's satisfaction to data, this coefficient shows that the most greatly user is the most frequent to the amendment of data.
(2) quality testing submodule 13:
Use " three grades of evaluation models " that the quality of data is evaluated, first split data into three classes according to size of data, Then its quality of data is evaluated by all other attribute in addition to size of data of synthetic data, and concrete grammar is as follows:
Sample data is divided into quality data, middle qualitative data and low quality data, if size of data is more than threshold value T1, then these data belong to quality data, if size of data is more than threshold values T2But it is less than threshold values T1, then these data belong to middle matter Amount data, if size of data is less than threshold values T2, then these data belong to low quality data, T1> T2And T1、T2Span be [1KB, 1MB];Further quality data and low quality are divided into different brackets, choose all other attribute composition of data Vector, and the average of each data attribute of each grade is calculated according to sample data, set up corresponding average for each grade Vector, new data vector X=(x1,…,xN) represent, the mean vector of certain grade Y=(y1,…,yN) represent, N represents All other attribute number of data in addition to size of data, two vectorial similarities similarity function R (X, Y) represent:
R ( X , Y ) = Σ i = 1 N | x i - y i x i | 2 + Σ i = 1 N | x i - y i y i | 2
R (X, Y) value is the least, then show that similarity is the biggest, otherwise, then similarity is the least, each data calculate respectively with not The similarity of the mean vector of ad eundem, thus confirm its credit rating.
(3) quality testing submodule 14:
Data are divided into different quality grade, according to data level different pairs by after quality testing submodule According to carrying out administration by different levels.
Preferably,
(1) data prediction submodule
Data are divided into different field, determine client's desired data field according to user's request, use above-mentioned three grade High-quality High-level Data in field is screened by evaluation model, forms new tables of data K;
(2) useful data builds submodule
Through the data of pretreatment, each data fields contains different classification, introduces correlation coefficient P and screens useful number According to classification:
P = Z s Z - ρ 1 - ρ
In formula, ZxRepresent the quantity that in new data table K mono-classification, data double-way points to, i.e. for data A and B, can Point to B from A, also can point to A, Z from B and represent the related data amount in tables of data K mono-classification,Wherein N represents one The sum of data in classification;
(3) useful data correction submodule
Useful data in use, can be affected by artificial destruction and user two aspects of voting, according to this two The revised correlation coefficient of aspect is P ';Concurrently set threshold value T, and T ∈ (0,0.1], if P ' is > T, then show that this classification is to have Use data;When qualified useful data cannot be obtained from quality data, successively at middle qualitative data and low quality number Qualified useful data is made a look up according to, and after all data search, if the P ' finally given is maximum Value less than T, although or the maximum of P ' more than T but its absolute value with the difference of threshold values T less than setting value C, show nothing Although method finds useful data or can find useful data but the useful data degree of association obtained is already below expection, then Now automatically manager is sent prompting, revise or increase related data;Take C=T/5;
(4) useful data layer digging module
First scan data table K, it is assumed that maximum and the minima of P ' are respectively P 'maxWith P 'min, tables of data K is split BecomeIndividual Non-overlapping Domain, P mining goes out Local frequent itemset, and wherein int is bracket function;Then profit Use priori character, connect Local frequent itemset and obtain overall candidate;Scanning K counts the reality of each candidate and props up again Degree of holding is to determine global frequentItemset.
The concrete correction formula being modified according to artificial destruction and user's ballot in useful data correction submodule is:
P '=P × (1-Y) × (1+H)
In formula, Y represents the data probability by artificial destruction, and H represents that ballot user accounts for the ratio of total number of persons.
In the present embodiment, introduce network clustering coefficient and data be described, considered the attribute of data itself with The attribute of data influence person, improves the accuracy rate of classification, revises the introducing of coefficient of frequency by user simultaneously and reduces manually Intervene, it is achieved that the target of the efficient detection quality of data;Use three grades of evaluation models, saved memory space, improve calculating Efficiency;Use brand-new similarity function, be exaggerated the effect of bigger relative error so that credit rating more science is accurate; Introduce data correction submodule correlation coefficient is modified, it is possible to fully overcome artificial destruction and user's ballot shadow to data Ringing, take C=T/6, prompting scope of data increases by 4%, but amount of calculation adds 3.3%;The association rule that will divide based on region Then excavate application to combine with the classification of useful data, it is only necessary in three grades of sorted tables of data, carry out layering dig Pick, only when current data table does not has satisfactory data, just can excavate in next tables of data, computationally intensive Width declines, and the excavation of these data can associate useful data classification, excavates purposiveness higher.
Embodiment 3:
A kind of device excavated with useful data for data quality management as shown in Figure 1, including data quality management Module 1 and useful data excavate module 2, and wherein quality management module 1 includes that preliminary treatment submodule 11, data describe submodule 12, quality testing submodule 13 and quality testing submodule 14, useful data excavates module 2 and includes data prediction Submodule 21, useful data build submodule 22, useful data correction submodule 23 and useful data layer digging submodule 24.
Preliminary treatment submodule 11, is characterized in that, including:
Including bottom data administrative unit, middle level Data Management Unit, high level data administrative unit, data template, overall Data base.
Preferably, it is characterized in that, described global database includes underlying database, middle level data base, high level data storehouse.
Preferably, it is characterised in that described bottom data administrative unit is recorded according to the specification of data template for underlying user Enter facility information data, and the facility information data of typing are checked, to being unsatisfactory for requirement according to the specification of data template Facility information data propose amendment prompting, satisfactory facility information data are saved in underlying database;
Facility information data in underlying database are mapped to middle level data management list by described middle level Data Management Unit In unit, middle level user sets up the virtual crosslinking relation between facility information data, middle layer data pipe on the Data Management Unit of middle level Virtual crosslinking relation between reason element analysis facility information data, produces the report of crosslinking relation, the virtual crosslinking to error definition Relation prompting middle level user re-establish, to the facility information data in the underlying database found when setting up virtual crosslinking relation Error feedback modify to underlying user;Correct virtual crosslinking relation data is saved in the data base of middle level;Middle level Virtual crosslinking relation data existing in the data base of middle level can also be mapped to middle level Data Management Unit by Data Management Unit On, check for middle level user and revise;
Described high level data administrative unit is by the facility information data in underlying database and the equipment in the data base of middle level Virtual crosslinking relation data between information data is mapped in high level data administrative unit, true by high-level user Gather and input The annexation data of airborne equipment data and actual onboard networks are converted to the form of underlying database and middle level data base's Form, com-parison and analysis maps the facility information data come and the true airborne device data of collection, and com-parison and analysis maps setting Virtual crosslinking relation data between standby information data and the annexation data of the actual onboard networks of collection, produce data analysis Report, instructs user to carry out troubleshooting.
Preferably,
(1) data describe submodule 12:
The attribute of attribute and data influencer by introducing data itself describes data, the attribute number of data itself According to size, date created, comprise picture number, related data amount represents, wherein, related data amount be current data point to other The summation of other data of data and sensing current data;The attribute of data influence person influencer network clustering coefficientCarry out table Show,Obtained by following methods:
Building data influence person and describe network, for each data, influencer includes multiple user and a pipe Reason person, each of which influencer all represents a node, and user may browse through data, it is also possible to data propose the suggestion of amendment, And data both can have been modified by manager voluntarily, it is also possible to modify according to user's suggestion,
Then influencer network clustering coefficientIt is defined as:
K ‾ = mσ 1 + lσ 2 + n ( δ 1 × σ 3 + δ 2 × σ 4 ) m + l + n × 1 - ( m - l m ) 3
In formula, σ1Representing that user often browses the factor of influence that a secondary data applies, m represents that user browses total degree;σ2Represent User often proposes the factor of influence that suggestion for revision applies, and l represents that user advises total degree;σ3Represent that manager is often certainly The factor of influence that row amendment one secondary data applies, σ4Represent that manager often advises revising the impact that a secondary data applies according to user The factor, δ1And δ2It is respectively σ3And σ4Weights, n represents that manager revises total degree;Frequency system is revised for user Number, for representing user's satisfaction to data, this coefficient shows that the most greatly user is the most frequent to the amendment of data.
(2) quality testing submodule 13:
Use " three grades of evaluation models " that the quality of data is evaluated, first split data into three classes according to size of data, Then its quality of data is evaluated by all other attribute in addition to size of data of synthetic data, and concrete grammar is as follows:
Sample data is divided into quality data, middle qualitative data and low quality data, if size of data is more than threshold value T1, then these data belong to quality data, if size of data is more than threshold values T2But it is less than threshold values T1, then these data belong to middle matter Amount data, if size of data is less than threshold values T2, then these data belong to low quality data, T1> T2And T1、T2Span be [1KB, 1MB];Further quality data and low quality are divided into different brackets, choose all other attribute composition of data Vector, and the average of each data attribute of each grade is calculated according to sample data, set up corresponding average for each grade Vector, new data vector X=(x1,…,xN) represent, the mean vector of certain grade Y=(y1,…,yN) represent, N represents All other attribute number of data in addition to size of data, two vectorial similarities similarity function R (X, Y) represent:
R ( X , Y ) = Σ i = 1 N | x i - y i x i | 2 + Σ i = 1 N | x i - y i y i | 2
R (X, Y) value is the least, then show that similarity is the biggest, otherwise, then similarity is the least, each data calculate respectively with not The similarity of the mean vector of ad eundem, thus confirm its credit rating.
(3) quality testing submodule 14:
Data are divided into different quality grade, according to data level different pairs by after quality testing submodule According to carrying out administration by different levels.
Preferably,
(1) data prediction submodule
Data are divided into different field, determine client's desired data field according to user's request, use above-mentioned three grade High-quality High-level Data in field is screened by evaluation model, forms new tables of data K;
(2) useful data builds submodule
Through the data of pretreatment, each data fields contains different classification, introduces correlation coefficient P and screens useful number According to classification:
P = Z s Z - ρ 1 - ρ
In formula, ZsRepresent the quantity that in new data table K mono-classification, data double-way points to, i.e. for data A and B, can Point to B from A, also can point to A, Z from B and represent the related data amount in tables of data K mono-classification,Wherein N represents one The sum of data in classification;
(3) useful data correction submodule
Useful data in use, can be affected by artificial destruction and user two aspects of voting, according to this two The revised correlation coefficient of aspect is P ';Concurrently set threshold value T, and T ∈ (0,0.1], if P ' is > T, then show that this classification is to have Use data;When qualified useful data cannot be obtained from quality data, successively at middle qualitative data and low quality number Qualified useful data is made a look up according to, and after all data search, if the P ' finally given is maximum Value less than T, although or the maximum of P ' more than T but its absolute value with the difference of threshold values T less than setting value C, show nothing Although method finds useful data or can find useful data but the useful data degree of association obtained is already below expection, then Now automatically manager is sent prompting, revise or increase related data;Take C=T/5;
(4) useful data layer digging module
First scan data table K, it is assumed that maximum and the minima of P ' are respectively P 'maxWith P 'min, tables of data K is split BecomeIndividual Non-overlapping Domain, P mining goes out Local frequent itemset, and wherein int is bracket function;Then profit Use priori character, connect Local frequent itemset and obtain overall candidate;Scanning K counts the reality of each candidate and props up again Degree of holding is to determine global frequentItemset.
The concrete correction formula being modified according to artificial destruction and user's ballot in useful data correction submodule is:
P '=P × (1-Y) × (1+H)
In formula, Y represents the data probability by artificial destruction, and H represents that ballot user accounts for the ratio of total number of persons.
In the present embodiment, introduce network clustering coefficient and data be described, considered the attribute of data itself with The attribute of data influence person, improves the accuracy rate of classification, revises the introducing of coefficient of frequency by user simultaneously and reduces manually Intervene, it is achieved that the target of the efficient detection quality of data;Use three grades of evaluation models, saved memory space, improve calculating Efficiency;Use brand-new similarity function, be exaggerated the effect of bigger relative error so that credit rating more science is accurate; Introduce data correction submodule correlation coefficient is modified, it is possible to fully overcome artificial destruction and user's ballot shadow to data Ringing, take C=T/7, prompting scope of data increases by 3.5%, but amount of calculation adds 3%;The association rule that will divide based on region Then excavate application to combine with the classification of useful data, it is only necessary in three grades of sorted tables of data, carry out layering dig Pick, only when current data table does not has satisfactory data, just can excavate in next tables of data, computationally intensive Width declines, and the excavation of these data can associate useful data classification, excavates purposiveness higher.
Embodiment 4:
A kind of device excavated with useful data for data quality management as shown in Figure 1, including data quality management Module 1 and useful data excavate module 2, and wherein quality management module 1 includes that preliminary treatment submodule 11, data describe submodule 12, quality testing submodule 13 and quality testing submodule 14, useful data excavates module 2 and includes data prediction Submodule 21, useful data build submodule 22, useful data correction submodule 23 and useful data layer digging submodule 24.
Preliminary treatment submodule 11, is characterized in that, including:
Including bottom data administrative unit, middle level Data Management Unit, high level data administrative unit, data template, overall Data base.
Preferably, it is characterized in that, described global database includes underlying database, middle level data base, high level data storehouse.
Preferably, it is characterised in that described bottom data administrative unit is recorded according to the specification of data template for underlying user Enter facility information data, and the facility information data of typing are checked, to being unsatisfactory for requirement according to the specification of data template Facility information data propose amendment prompting, satisfactory facility information data are saved in underlying database;
Facility information data in underlying database are mapped to middle level data management list by described middle level Data Management Unit In unit, middle level user sets up the virtual crosslinking relation between facility information data, middle layer data pipe on the Data Management Unit of middle level Virtual crosslinking relation between reason element analysis facility information data, produces the report of crosslinking relation, the virtual crosslinking to error definition Relation prompting middle level user re-establish, to the facility information data in the underlying database found when setting up virtual crosslinking relation Error feedback modify to underlying user;Correct virtual crosslinking relation data is saved in the data base of middle level;Middle level Virtual crosslinking relation data existing in the data base of middle level can also be mapped to middle level Data Management Unit by Data Management Unit On, check for middle level user and revise;
Described high level data administrative unit is by the facility information data in underlying database and the equipment in the data base of middle level Virtual crosslinking relation data between information data is mapped in high level data administrative unit, true by high-level user Gather and input The annexation data of airborne equipment data and actual onboard networks are converted to the form of underlying database and middle level data base's Form, com-parison and analysis maps the facility information data come and the true airborne device data of collection, and com-parison and analysis maps setting Virtual crosslinking relation data between standby information data and the annexation data of the actual onboard networks of collection, produce data analysis Report, instructs user to carry out troubleshooting.
Preferably,
(1) data describe submodule 12:
The attribute of attribute and data influencer by introducing data itself describes data, the attribute number of data itself According to size, date created, comprise picture number, related data amount represents, wherein, related data amount be current data point to other The summation of other data of data and sensing current data;The attribute of data influence person influencer network clustering coefficientCarry out table Show,Obtained by following methods:
Building data influence person and describe network, for each data, influencer includes multiple user and a pipe Reason person, each of which influencer all represents a node, and user may browse through data, it is also possible to data propose the suggestion of amendment, And data both can have been modified by manager voluntarily, it is also possible to modify according to user's suggestion,
Then influencer network clustering coefficientIt is defined as:
K ‾ = mσ 1 + lσ 2 + n ( δ 1 × σ 3 + δ 2 × σ 4 ) m + l + n × 1 - ( m - l m ) 3
In formula, σ1Representing that user often browses the factor of influence that a secondary data applies, m represents that user browses total degree;σ2Represent User often proposes the factor of influence that suggestion for revision applies, and l represents that user advises total degree;σ3Represent that manager is often certainly The factor of influence that row amendment one secondary data applies, σ4Represent that manager often advises revising the impact that a secondary data applies according to user The factor, δ1And δ2It is respectively σ3And σ4Weights, n represents that manager revises total degree;Frequency system is revised for user Number, for representing user's satisfaction to data, this coefficient shows that the most greatly user is the most frequent to the amendment of data.
(2) quality testing submodule 13:
Use " three grades of evaluation models " that the quality of data is evaluated, first split data into three classes according to size of data, Then its quality of data is evaluated by all other attribute in addition to size of data of synthetic data, and concrete grammar is as follows:
Sample data is divided into quality data, middle qualitative data and low quality data, if size of data is more than threshold value T1, then these data belong to quality data, if size of data is more than threshold values T2But it is less than threshold values T1, then these data belong to middle matter Amount data, if size of data is less than threshold values T2, then these data belong to low quality data, T1> T2And T1、T2Span be [1KB, 1MB];Further quality data and low quality are divided into different brackets, choose all other attribute composition of data Vector, and the average of each data attribute of each grade is calculated according to sample data, set up corresponding average for each grade Vector, new data vector X=(x1,…,xN) represent, the mean vector of certain grade Y=(y1,…,yN) represent, N represents All other attribute number of data in addition to size of data, two vectorial similarities similarity function R (X, Y) represent:
R ( X , Y ) = Σ i = 1 N | x i - y i x i | 2 + Σ i = 1 N | x i - y i y i | 2
R (X, Y) value is the least, then show that similarity is the biggest, otherwise, then similarity is the least, each data calculate respectively with not The similarity of the mean vector of ad eundem, thus confirm its credit rating.
(3) quality testing submodule 14:
Data are divided into different quality grade, according to data level different pairs by after quality testing submodule According to carrying out administration by different levels.
Preferably,
(1) data prediction submodule
Data are divided into different field, determine client's desired data field according to user's request, use above-mentioned three grade High-quality High-level Data in field is screened by evaluation model, forms new tables of data K;
(2) useful data builds submodule
Through the data of pretreatment, each data fields contains different classification, introduces correlation coefficient P and screens useful number According to classification:
P = Z s Z - ρ 1 - ρ
In formula, ZsRepresent the quantity that in new data table K mono-classification, data double-way points to, i.e. for data A and B, can Point to B from A, also can point to A, Z from B and represent the related data amount in tables of data K mono-classification,Wherein N represents one The sum of data in classification;
(3) useful data correction submodule
Useful data in use, can be affected by artificial destruction and user two aspects of voting, according to this two The revised correlation coefficient of aspect is P ';Concurrently set threshold value T, and T ∈ (0,0.1], if P ' is > T, then show that this classification is to have Use data;When qualified useful data cannot be obtained from quality data, successively at middle qualitative data and low quality number Qualified useful data is made a look up according to, and after all data search, if the P ' finally given is maximum Value less than T, although or the maximum of P ' more than T but its absolute value with the difference of threshold values T less than setting value C, show nothing Although method finds useful data or can find useful data but the useful data degree of association obtained is already below expection, then Now automatically manager is sent prompting, revise or increase related data;Take C=T/5;
(4) useful data layer digging module
First scan data table K, it is assumed that maximum and the minima of P ' are respectively P 'maxWith P 'min, tables of data K is split BecomeIndividual Non-overlapping Domain, P mining goes out Local frequent itemset, and wherein int is bracket function;Then profit Use priori character, connect Local frequent itemset and obtain overall candidate;Scanning K counts the reality of each candidate and props up again Degree of holding is to determine global frequentItemset.
The concrete correction formula being modified according to artificial destruction and user's ballot in useful data correction submodule is:
P '=P × (1-Y) × (1+H)
In formula, Y represents the data probability by artificial destruction, and H represents that ballot user accounts for the ratio of total number of persons.
In the present embodiment, introduce network clustering coefficient and data be described, considered the attribute of data itself with The attribute of data influence person, improves the accuracy rate of classification, revises the introducing of coefficient of frequency by user simultaneously and reduces manually Intervene, it is achieved that the target of the efficient detection quality of data;Use three grades of evaluation models, saved memory space, improve calculating Efficiency;Use brand-new similarity function, be exaggerated the effect of bigger relative error so that credit rating more science is accurate; Introduce data correction submodule correlation coefficient is modified, it is possible to fully overcome artificial destruction and user's ballot shadow to data Ringing, take C=T/8, prompting scope of data increases by 3%, but amount of calculation adds 2.7%;The association rule that will divide based on region Then excavate application to combine with the classification of useful data, it is only necessary in three grades of sorted tables of data, carry out layering dig Pick, only when current data table does not has satisfactory data, just can excavate in next tables of data, computationally intensive Width declines, and the excavation of these data can associate useful data classification, excavates purposiveness higher.
Embodiment 5:
A kind of device excavated with useful data for data quality management as shown in Figure 1, including data quality management Module 1 and useful data excavate module 2, and wherein quality management module 1 includes that preliminary treatment submodule 11, data describe submodule 12, quality testing submodule 13 and quality testing submodule 14, useful data excavates module 2 and includes data prediction Submodule 21, useful data build submodule 22, useful data correction submodule 23 and useful data layer digging submodule 24.
Preliminary treatment submodule 11, is characterized in that, including:
Including bottom data administrative unit, middle level Data Management Unit, high level data administrative unit, data template, overall Data base.
Preferably, it is characterized in that, described global database includes underlying database, middle level data base, high level data storehouse.
Preferably, it is characterised in that described bottom data administrative unit is recorded according to the specification of data template for underlying user Enter facility information data, and the facility information data of typing are checked, to being unsatisfactory for requirement according to the specification of data template Facility information data propose amendment prompting, satisfactory facility information data are saved in underlying database;
Facility information data in underlying database are mapped to middle level data management list by described middle level Data Management Unit In unit, middle level user sets up the virtual crosslinking relation between facility information data, middle layer data pipe on the Data Management Unit of middle level Virtual crosslinking relation between reason element analysis facility information data, produces the report of crosslinking relation, the virtual crosslinking to error definition Relation prompting middle level user re-establish, to the facility information data in the underlying database found when setting up virtual crosslinking relation Error feedback modify to underlying user;Correct virtual crosslinking relation data is saved in the data base of middle level;Middle level Virtual crosslinking relation data existing in the data base of middle level can also be mapped to middle level Data Management Unit by Data Management Unit On, check for middle level user and revise;
Described high level data administrative unit is by the facility information data in underlying database and the equipment in the data base of middle level Virtual crosslinking relation data between information data is mapped in high level data administrative unit, true by high-level user Gather and input The annexation data of airborne equipment data and actual onboard networks are converted to the form of underlying database and middle level data base's Form, com-parison and analysis maps the facility information data come and the true airborne device data of collection, and com-parison and analysis maps setting Virtual crosslinking relation data between standby information data and the annexation data of the actual onboard networks of collection, produce data analysis Report, instructs user to carry out troubleshooting.
Preferably,
(1) data describe submodule 12:
The attribute of attribute and data influencer by introducing data itself describes data, the attribute number of data itself According to size, date created, comprise picture number, related data amount represents, wherein, related data amount be current data point to other The summation of other data of data and sensing current data;The attribute of data influence person influencer network clustering coefficientCarry out table Show,Obtained by following methods:
Building data influence person and describe network, for each data, influencer includes multiple user and a pipe Reason person, each of which influencer all represents a node, and user may browse through data, it is also possible to data propose the suggestion of amendment, And data both can have been modified by manager voluntarily, it is also possible to modify according to user's suggestion,
Then influencer network clustering coefficientIt is defined as:
K ‾ = mσ 1 + lσ 2 + n ( δ 1 × σ 3 + δ 2 × σ 4 ) m + l + n × 1 - ( m - l m ) 3
In formula, σ1Representing that user often browses the factor of influence that a secondary data applies, m represents that user browses total degree;σ2Represent User often proposes the factor of influence that suggestion for revision applies, and l represents that user advises total degree;σ3Represent that manager is often certainly The factor of influence that row amendment one secondary data applies, σ4Represent that manager often advises revising the impact that a secondary data applies according to user The factor, δ1And δ2It is respectively σ3And σ4Weights, n represents that manager revises total degree;Frequency system is revised for user Number, for representing user's satisfaction to data, this coefficient shows that the most greatly user is the most frequent to the amendment of data.
(2) quality testing submodule 13:
Use " three grades of evaluation models " that the quality of data is evaluated, first split data into three classes according to size of data, Then its quality of data is evaluated by all other attribute in addition to size of data of synthetic data, and concrete grammar is as follows:
Sample data is divided into quality data, middle qualitative data and low quality data, if size of data is more than threshold value T1, then these data belong to quality data, if size of data is more than threshold values T2But it is less than threshold values T1, then these data belong to middle matter Amount data, if size of data is less than threshold values T2, then these data belong to low quality data, T1> T2And T1、T2Span be [1KB, 1MB];Further quality data and low quality are divided into different brackets, choose all other attribute composition of data Vector, and the average of each data attribute of each grade is calculated according to sample data, set up corresponding average for each grade Vector, new data vector X=(x1,…,xN) represent, the mean vector of certain grade Y=(y1,…,yN) represent, N represents All other attribute number of data in addition to size of data, two vectorial similarities similarity function R (X, Y) represent:
R ( X , Y ) = Σ i = 1 N | x i - y i x i | 2 + Σ i = 1 N | x i - y i y i | 2
R (X, Y) value is the least, then show that similarity is the biggest, otherwise, then similarity is the least, each data calculate respectively with not The similarity of the mean vector of ad eundem, thus confirm its credit rating.
(3) quality testing submodule 14:
Data are divided into different quality grade, according to data level different pairs by after quality testing submodule According to carrying out administration by different levels.
Preferably,
(1) data prediction submodule
Data are divided into different field, determine client's desired data field according to user's request, use above-mentioned three grade High-quality High-level Data in field is screened by evaluation model, forms new tables of data K;
(2) useful data builds submodule
Through the data of pretreatment, each data fields contains different classification, introduces correlation coefficient P and screens useful number According to classification:
P = Z s Z - ρ 1 - ρ
In formula, ZsRepresent the quantity that in new data table K mono-classification, data double-way points to, i.e. for data A and B, can Point to B from A, also can point to A, Z from B and represent the related data amount in tables of data K mono-classification,Wherein N represents one The sum of data in classification;
(3) useful data correction submodule
Useful data in use, can be affected by artificial destruction and user two aspects of voting, according to this two The revised correlation coefficient of aspect is P ';Concurrently set threshold value T, and T ∈ (0,0.1], if P ' is > T, then show that this classification is to have Use data;When qualified useful data cannot be obtained from quality data, successively at middle qualitative data and low quality number Qualified useful data is made a look up according to, and after all data search, if the P ' finally given is maximum Value less than T, although or the maximum of P ' more than T but its absolute value with the difference of threshold values T less than setting value C, show nothing Although method finds useful data or can find useful data but the useful data degree of association obtained is already below expection, then Now automatically manager is sent prompting, revise or increase related data;Take C=T/5;
(4) useful data layer digging module
First scan data table K, it is assumed that maximum and the minima of P ' are respectively P 'maxWith P 'min, tables of data K is split BecomeIndividual Non-overlapping Domain, P mining goes out Local frequent itemset, and wherein int is bracket function;Then profit Use priori character, connect Local frequent itemset and obtain overall candidate;Scanning K counts the reality of each candidate and props up again Degree of holding is to determine global frequentItemset.
The concrete correction formula being modified according to artificial destruction and user's ballot in useful data correction submodule is:
P '=P × (1-Y) × (1+H)
In formula, Y represents the data probability by artificial destruction, and H represents that ballot user accounts for the ratio of total number of persons.
In the present embodiment, introduce network clustering coefficient and data be described, considered the attribute of data itself with The attribute of data influence person, improves the accuracy rate of classification, revises the introducing of coefficient of frequency by user simultaneously and reduces manually Intervene, it is achieved that the target of the efficient detection quality of data;Use three grades of evaluation models, saved memory space, improve calculating Efficiency;Use brand-new similarity function, be exaggerated the effect of bigger relative error so that credit rating more science is accurate; Introduce data correction submodule correlation coefficient is modified, it is possible to fully overcome artificial destruction and user's ballot shadow to data Ringing, take C=T/9, prompting scope of data increases by 2.7%, but amount of calculation adds 2.5%;The association that will divide based on region Rule digging application combines with the classification of useful data, it is only necessary to carries out layering in three grades of sorted tables of data and digs Pick, only when current data table does not has satisfactory data, just can excavate in next tables of data, computationally intensive Width declines, and the excavation of these data can associate useful data classification, excavates purposiveness higher.
Last it should be noted that, above example is only in order to illustrate technical scheme, rather than the present invention is protected Protecting the restriction of scope, although having made to explain to the present invention with reference to preferred embodiment, those of ordinary skill in the art should Work as understanding, technical scheme can be modified or equivalent, without deviating from the reality of technical solution of the present invention Matter and scope.

Claims (5)

1. the device excavated with useful data for data quality management, is characterized in that, including data quality management module Excavating module with useful data, wherein quality management module includes that preliminary treatment submodule, data describe submodule, the quality of data Evaluating submodule and data quality grading management submodule, useful data excavates module and includes data prediction submodule, useful Data construct submodule, useful data correction submodule and useful data layer digging submodule;
Preliminary treatment submodule, is characterized in that, including:
Including bottom data administrative unit, middle level Data Management Unit, high level data administrative unit, data template, conceptual data Storehouse.
A kind of device excavated with useful data for data quality management the most according to claim 1, is characterized in that, institute State global database and include underlying database, middle level data base, high level data storehouse.
A kind of device excavated with useful data for data quality management the most according to claim 2, it is characterised in that Described bottom data administrative unit for underlying user according to the specification recording device information data of data template, and according to data mould The facility information data of typing are checked by the specification of plate, and the facility information data being unsatisfactory for requiring are proposed amendment prompting, Satisfactory facility information data are saved in underlying database;
Facility information data in underlying database are mapped on the Data Management Unit of middle level by described middle level Data Management Unit, Middle level user sets up the virtual crosslinking relation between facility information data, middle level Data Management Unit on the Data Management Unit of middle level Virtual crosslinking relation between analytical equipment information data, produces the report of crosslinking relation, puies forward the virtual crosslinking relation of error definition Show that middle level user re-establishes, the mistake to the facility information data in the underlying database found when setting up virtual crosslinking relation Feed back to underlying user to modify;Correct virtual crosslinking relation data is saved in the data base of middle level;Middle layer data pipe Virtual crosslinking relation data existing in the data base of middle level can also be mapped on the Data Management Unit of middle level by reason unit, in supplying Layer user checks and revises;
Described high level data administrative unit is by the facility information data in underlying database and the facility information in the data base of middle level Virtual crosslinking relation data between data is mapped in high level data administrative unit, the most airborne by high-level user Gather and input The annexation data of device data and actual onboard networks are converted to form and the form of middle level data base of underlying database, Com-parison and analysis maps the facility information data come and the true airborne device data of collection, and com-parison and analysis maps the facility information of coming Virtual crosslinking relation data between data and the annexation data of the actual onboard networks of collection, produce data analysis report, User is instructed to carry out troubleshooting.
A kind of device excavated with useful data for data quality management the most according to claim 1, is characterized in that,
(1) data describe submodule
The attribute of attribute and data influencer by introducing data itself describes data, and the attribute data of data itself are big Little, date created, comprise picture number, related data amount represents, wherein, related data amount is other data that current data is pointed to Summation with other data pointing to current data;The attribute of data influence person influencer network clustering coefficientRepresent, Obtained by following methods:
Building data influence person and describe network, for each data, influencer includes multiple user and a manager, Each of which influencer all represents a node, and user may browse through data, it is also possible to data propose the suggestion of amendment, and manages Data both can be modified by person voluntarily, it is also possible to modifies according to user's suggestion,
Then influencer network clustering coefficientIt is defined as:
K ‾ = mσ 1 + lσ 2 + n ( δ 1 × σ 3 + δ 2 × σ 4 ) m + l + n × 1 - ( m - l m ) 3
In formula, σ1Representing that user often browses the factor of influence that a secondary data applies, m represents that user browses total degree;σ2Represent user Often proposing the factor of influence that suggestion for revision applies, l represents that user advises total degree;σ3Represent that manager repaiies the most voluntarily Change the factor of influence that a secondary data applies, σ4Represent manager often according to user advise revising a secondary data applies affect because of Son, δ1And δ2It is respectively σ3And σ4Weights, n represents that manager revises total degree;Coefficient of frequency is revised for user, For representing user's satisfaction to data, this coefficient shows that the most greatly user is the most frequent to the amendment of data;
(2) quality testing submodule
Use " three grades of evaluation models " that the quality of data is evaluated, first split data into three classes according to size of data, then Its quality of data is evaluated by all other attribute in addition to size of data of synthetic data, and concrete grammar is as follows:
Sample data is divided into quality data, middle qualitative data and low quality data, if size of data is more than threshold value T1, then These data belong to quality data, if size of data is more than threshold values T2But it is less than threshold values T1, then these data belong to middle mass number According to, if size of data is less than threshold values T2, then these data belong to low quality data, T1> T2And T1、T2Span be [1KB, 1MB];Further quality data and low quality are divided into different brackets, choose all other attribute composition of vector of data, And the average of each data attribute according to the sample data each grade of calculating, set up corresponding mean vector for each grade, New data vector X=(x1,…,xN) represent, the mean vector of certain grade Y=(y1,…,yN) represent, N represents divisor According to all other attribute number of the outer data of size, two vectorial similarities similarity function R (X, Y) represent:
R ( X , Y ) = Σ i = 1 N | x i - y i x i | 2 + Σ i = 1 N | x i - y i y i | 2
R (X, Y) value is the least, then show that similarity is the biggest, otherwise, then similarity is the least, and each data calculate respectively with the most equal The similarity of the mean vector of level, thus confirm its credit rating;
(3) quality of data administration by different levels submodule
Data are divided into different quality grade by after quality testing submodule, according to data level different pairs according to entering Row administration by different levels.
A kind of device excavated with useful data for data quality management the most according to claim 1, is characterized in that,
(1) data prediction submodule
Data are divided into different field, determine client's desired data field according to user's request, use above-mentioned three grade to evaluate High-quality High-level Data in field is screened by model, forms new tables of data K;
(2) useful data builds submodule
Through the data of pretreatment, each data fields contains different classification, introduces correlation coefficient P screening useful data and divides Class:
P = Z s Z - ρ 1 - ρ
In formula, ZsRepresent the quantity that in new data table K mono-classification, data double-way points to, i.e. for data A and B, can refer to from A To B, also can point to A, Z from B and represent the related data amount in tables of data K mono-classification,During wherein N represents a classification The sum of data;
(3) useful data correction submodule
Useful data in use, can be affected, according to these two aspects by artificial destruction and user two aspects of voting Revised correlation coefficient is P ';Concurrently set threshold value T, and T ∈ (0,0.1], if P ' is > T, then show that this classification is useful number According to;When qualified useful data cannot be obtained from quality data, successively in middle qualitative data and low quality data Make a look up qualified useful data, and after all data search, if the P ' maximum finally given is little In T, although or the maximum of P ' more than T but its absolute value with the difference of threshold values T less than setting value C, show to look for To useful data or although useful data can find but the useful data degree of association obtained is already below expection, the most now Automatically manager is sent prompting, revise or increase related data;Take C=T/5;
(4) useful data layer digging module
First scan data table K, it is assumed that maximum and the minima of P ' are respectively P 'maxWith P 'min, tables of data K is divided intoIndividual Non-overlapping Domain, P mining goes out Local frequent itemset, and wherein int is bracket function;Then utilize Priori character, connects Local frequent itemset and obtains overall candidate;Scanning K counts the actual support of each candidate again Degree is to determine global frequentItemset;
The concrete correction formula being modified according to artificial destruction and user's ballot in useful data correction submodule is:
P '=P × (1-Y) × (1+H)
In formula, Y represents the data probability by artificial destruction, and H represents that ballot user accounts for the ratio of total number of persons.
CN201610524328.2A 2016-07-04 2016-07-04 A kind of device excavated with useful data for data quality management Withdrawn CN106202347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610524328.2A CN106202347A (en) 2016-07-04 2016-07-04 A kind of device excavated with useful data for data quality management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610524328.2A CN106202347A (en) 2016-07-04 2016-07-04 A kind of device excavated with useful data for data quality management

Publications (1)

Publication Number Publication Date
CN106202347A true CN106202347A (en) 2016-12-07

Family

ID=57465852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610524328.2A Withdrawn CN106202347A (en) 2016-07-04 2016-07-04 A kind of device excavated with useful data for data quality management

Country Status (1)

Country Link
CN (1) CN106202347A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107171847A (en) * 2017-05-27 2017-09-15 武汉虹信通信技术有限责任公司 Automatic management bulk device method based on EMS
CN109785915A (en) * 2018-12-24 2019-05-21 东软集团股份有限公司 Data collect method, device, storage medium and electronic equipment
CN110309131A (en) * 2019-04-12 2019-10-08 北京星网锐捷网络技术有限公司 The method for evaluating quality and device of massive structured data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107171847A (en) * 2017-05-27 2017-09-15 武汉虹信通信技术有限责任公司 Automatic management bulk device method based on EMS
CN107171847B (en) * 2017-05-27 2019-10-25 武汉虹信通信技术有限责任公司 Automatic management bulk device method based on EMS
CN109785915A (en) * 2018-12-24 2019-05-21 东软集团股份有限公司 Data collect method, device, storage medium and electronic equipment
CN109785915B (en) * 2018-12-24 2021-03-19 东软集团股份有限公司 Data collection method, device, storage medium and electronic equipment
CN110309131A (en) * 2019-04-12 2019-10-08 北京星网锐捷网络技术有限公司 The method for evaluating quality and device of massive structured data

Similar Documents

Publication Publication Date Title
Contreras et al. Stochastic uncapacitated hub location
US5546564A (en) Cost estimating system
CN111222661A (en) Urban planning implementation effect analysis and evaluation method
CN106126521A (en) The social account method for digging of destination object and server
Chevalier et al. Data integration methods to account for spatial niche truncation effects in regional projections of species distribution
CN110135890A (en) The product data method for pushing and relevant device of knowledge based relation excavation
CN109255586A (en) A kind of online personalized recommendation method that E-Governance Oriented is handled affairs
Wang et al. Location optimization of multiple distribution centers under fuzzy environment
CN106970986A (en) Urban waterlogging influence degree method for digging and system based on deep learning
JP2002032773A (en) Device and method for processing map data
CN106202347A (en) A kind of device excavated with useful data for data quality management
CN106326923A (en) Sign-in position data clustering method in consideration of position repetition and density peak point
CN105975640A (en) Big data quality management and useful data mining device
CN108509198B (en) Neutral BOM-based product electronic album construction method
CN111798032B (en) Fine grid evaluation method for supporting dual evaluation of homeland space planning
CN110263109A (en) A kind of family's amount evaluation method merging internet information and GIS technology
CN115605903A (en) System and method for quickly composing, launching and configuring a customizable second-level migration structure with a built-in audit and monitoring structure
CN106126739A (en) A kind of device processing business association data
Morrice et al. An approach to ranking and selection for multiple performance measures
CN106202344A (en) The quality management of a kind of vehicle-mounted data and useful data excavating gear
CN106156323A (en) Realize data staging management and the device excavated
Sadrykia et al. A GIS-based decision making model using fuzzy sets and theory of evidence for seismic vulnerability assessment under uncertainty (case study: Tabriz)
Zhang et al. Clustering with implicit constraints: A novel approach to housing market segmentation
Al-Deek et al. Computing travel time reliability in transportation networks with multistates and dependent link failures
CN116415199A (en) Business data outlier analysis method based on audit intermediate table

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C04 Withdrawal of patent application after publication (patent law 2001)
WW01 Invention patent application withdrawn after publication

Application publication date: 20161207