CN113554288A - Universal data quality evaluation method and device - Google Patents

Universal data quality evaluation method and device Download PDF

Info

Publication number
CN113554288A
CN113554288A CN202110767866.5A CN202110767866A CN113554288A CN 113554288 A CN113554288 A CN 113554288A CN 202110767866 A CN202110767866 A CN 202110767866A CN 113554288 A CN113554288 A CN 113554288A
Authority
CN
China
Prior art keywords
data
training
quality
feature
quality evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110767866.5A
Other languages
Chinese (zh)
Inventor
李晓东
王伟
李颖
王翠翠
王威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuxi Technology Heze Co ltd
Shandong Fuxi Think Tank Internet Research Institute
Original Assignee
Fuxi Technology Heze Co ltd
Shandong Fuxi Think Tank Internet Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuxi Technology Heze Co ltd, Shandong Fuxi Think Tank Internet Research Institute filed Critical Fuxi Technology Heze Co ltd
Priority to CN202110767866.5A priority Critical patent/CN113554288A/en
Publication of CN113554288A publication Critical patent/CN113554288A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Operations Research (AREA)
  • Evolutionary Computation (AREA)
  • Strategic Management (AREA)
  • Mathematical Physics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Algebra (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a method and a device for evaluating the quality of general data, comprising the following steps: carrying out feature vectorization processing on the original data set to form feature synthetic data; respectively carrying out data quality evaluation model training and characteristic disturbance processing according to the characteristic synthetic data, and respectively sending model training parameters and disturbed data to a server; receiving a training updating parameter and a data mean value sent by a server, wherein the training updating parameter is updated to a data quality evaluation model to optimize the model; and obtaining a data quality predicted value according to the data quality evaluation model, and obtaining a data relative quality estimated value according to data mean processing. The client processes the original data and outputs the disturbed data to protect the privacy of the data; the server receives the disturbed data to complete the construction of a data quality evaluation model and the calculation of the mean value of the data quality characteristic vector; and the client acquires the server-side synthesis model and the mean value, and calculates to obtain an open quality prediction value and a relative quality estimation value.

Description

Universal data quality evaluation method and device
Technical Field
The invention relates to the technical field of big data analysis and data security, in particular to a general data quality evaluation method and device.
Background
At present, most data quality evaluation methods depend on a trusted third party, a data owner is required to transmit data to the trusted third party, and the data quality evaluation is completed on plaintext data by the third party. In the prior art, data quality evaluation methods mainly comprise two methods: 1. extracting intrinsic characteristics of the data set, wherein the intrinsic characteristics are mainly based on data quality characteristics of constraint rules; 2. the contribution of the data set to the calculation task is mainly to evaluate whether the gradient before and after the training of the data set is closer to a local optimal point or not aiming at the machine learning/deep learning task.
The first method belongs to a pre-evaluation algorithm, a data owner can complete the method locally without a third party, but the quality evaluation method is single and can only reflect basic quality problems of data, such as data loss, data errors and the like. The second method belongs to a post-evaluation algorithm, and the data quality can be evaluated only after the calculation task is completed; however, in many cases, the data quality is a precondition for determining whether a data requester uses the data, and a feasible scheme is that a historical data set feature and contribution evaluation training model is used, so that the contribution value of a new data set can be predicted according to the feature of the new data set, but a data set owned by a single data owner is relatively limited, and a good training effect is difficult to achieve, so that model training needs to be performed by collecting various data sets and the contribution thereof by a third party, but the third party is not all credible, and uncontrollable events such as copying, transmission, leakage and the like of the data may occur because the data leaves the data owner and is locally sent to the third party. In addition to the above two methods, in the data market, it is necessary to reflect market conditions using the relative quality of data to help the data requester quickly evaluate the quality condition of the current data set compared with other data sets to decide whether to use the data set. The calculation of the relative quality of the data is likewise dependent on the collection of the data set, however, there is a problem with untrusted third parties.
Disclosure of Invention
The invention provides a general data quality evaluation method and device, which are used for overcoming the defect that data quality evaluation depends on a third party in the prior art, realizing data quality evaluation and protecting the privacy and safety of client information.
The invention provides a general data quality evaluation method, which is applied to a client, wherein the client communicates with a server, and the method comprises the following steps:
carrying out feature vectorization processing on the original data set to form feature synthetic data;
respectively carrying out data quality evaluation model training and feature disturbance processing according to the feature synthetic data, and respectively sending training parameters of the data quality evaluation model and the data subjected to the feature disturbance processing to the server;
receiving a training updating parameter and a data mean value sent by the server, wherein the training updating parameter is updated to a data quality evaluation model to optimize the model;
and obtaining a data quality predicted value according to the data quality evaluation model, and obtaining a data relative quality estimated value according to the data mean value processing.
According to the general data quality evaluation method provided by the invention, the feature vectorization processing is performed on the original data set to form feature synthesis data, and the method comprises the following steps:
performing quality evaluation on data in the original data set according to a constraint rule to obtain a first feature vector which meets standard quality;
carrying out cluster analysis on the data in the original data set to deduce data of similarity between the data;
and synthesizing the first feature vector and the similarity data to form a second feature vector.
According to the general data quality evaluation method provided by the invention, the quality evaluation is performed on the data in the original data set according to the constraint rule to obtain the first feature vector which accords with the standard quality, and the method comprises the following steps: and evaluating the data in the original data set with the data number of m according to the data quality measurement standard of the constraint rule to obtain a first feature vector with the length of n-1, wherein the first feature vector meets the standard quality of the constraint rule.
According to the general data quality evaluation method provided by the invention, the clustering analysis is carried out on the data in the original data set to deduce the similarity data among the data, and the method comprises the following steps:
performing feature vectorization on an original data set with m data numbers to generate m x k feature matrixes formed by m k-dimensional feature vectors;
and performing clustering speculation on the characteristic matrix by using an unsupervised clustering method to obtain similarity data among the data.
According to the general data quality evaluation method provided by the present invention, the synthesizing the first feature vector and the similarity data to form a second feature vector includes: and synthesizing the first feature vector and the similarity data into an n-dimensional vector as the second feature vector.
According to the general data quality evaluation method provided by the invention, the data quality evaluation model training and the feature disturbance processing are respectively carried out according to the feature synthesis data, and the training parameters of the data quality evaluation model and the data after the feature disturbance processing are respectively sent to the server side, and the method comprises the following steps:
according to a transverse federal logistic regression method, the second feature vector is used as training data, a contribution label is calculated according to model training on the training data, the data quality assessment model is obtained, and model training parameters are sent to the server side;
and normalizing the second feature vector, disturbing the normalized feature by applying a random response mechanism, and sending disturbed data to the server.
According to the general data quality evaluation method provided by the invention, the data quality prediction value is obtained according to the data quality evaluation model, and the data relative quality estimation value is obtained according to the data mean value processing, and the method comprises the following steps:
updating the training updating parameters sent by the server to a data quality evaluation model to obtain an optimized data quality evaluation model, and inputting the new data set into the optimized data quality evaluation model to obtain an exposable data quality prediction value of the new data set;
and obtaining an exposable data relative quality estimated value of the new data set according to the relative error average value of the feature vector of the new data set and the feature of the data mean value.
The invention also provides a general data quality evaluation method, which is applied to a server side, wherein the server side communicates with a plurality of clients, and the method comprises the following steps:
receiving training parameters of a data quality evaluation model and data after characteristic disturbance processing sent by each client; wherein, each client applies the same random response mechanism to carry out characteristic disturbance treatment;
processing the training parameters sent by each client by applying a weighted average method to obtain training update parameters, and sending the training update parameters to each client respectively;
deducing the relation between the sum of the disturbance values and the sum of the original values according to the disturbance mechanism parameters, deducing the mean value of the original values, taking the deduced mean value of the original values as the data mean value and sending the data mean value to each client.
The invention also provides a general data quality evaluation device, which is applied to a client, wherein the client communicates with a server, and the device comprises:
the characteristic vectorization unit is used for carrying out characteristic vectorization processing on the original data set to form characteristic synthetic data;
the characteristic synthesis data processing unit is used for respectively carrying out data quality evaluation model training and characteristic disturbance processing according to the characteristic synthesis data and respectively sending training parameters of the data quality evaluation model and the data subjected to the characteristic disturbance processing to the server;
the receiving server data unit is used for receiving the training update parameters and the data mean value sent by the server, wherein the training update parameters are updated to a data quality evaluation model to optimize the model;
and the quality evaluation unit is used for obtaining a data quality predicted value according to the data quality evaluation model and obtaining a data relative quality estimated value according to the data mean value processing.
The invention also provides a general data quality evaluation device, which is applied to a server side, wherein the server side communicates with a plurality of clients, and the device comprises:
the receiving client data unit is used for receiving the training parameters of the data quality evaluation model and the data subjected to characteristic disturbance processing sent by each client; wherein, each client applies the same random response mechanism to carry out characteristic disturbance treatment;
the training parameter processing unit is used for processing the training parameters sent by the clients by applying a weighted average method to obtain training update parameters and sending the training update parameters to the clients respectively;
and the original value mean processing unit deduces the relation between the sum of the disturbance values and the sum of the original values according to the disturbance mechanism parameters, deduces the mean value of the original values, takes the deduced original value mean value as the data mean value and sends the data mean value to each client.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the above general data quality evaluation methods.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for universal data quality assessment as described in any of the above.
According to the universal data quality evaluation method and device provided by the invention, the client processes the original data and outputs the disturbed processed data to protect the privacy of the data; the server receives the disturbed data to complete the construction of a data quality evaluation model and the calculation of the mean value of the data quality characteristic vector; and the client acquires the server-side synthesis model and the mean value, and calculates to obtain an open quality prediction value and a relative quality estimation value. The method integrates two quality evaluation results of the quality predicted value and the relative quality estimation value, comprehensively shows the data quality, and provides a data requester with full knowledge of the data under the condition that the data requester cannot see the data; by combining the original data which are not transmitted only by the parameters and the data disturbance mechanism, the data evaluation can be completed without depending on a trusted third party, and the privacy disclosure risk is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic block diagram of a general data quality assessment system provided by the present invention;
FIG. 2 is a flow chart of a method for evaluating the quality of data at a client end according to the present invention;
FIG. 3 is a flowchart of the method of step 210 in FIG. 2;
FIG. 4 is a flowchart of the method of step 212 of FIG. 3;
FIG. 5 is a flowchart of the method of step 220 of FIG. 2;
FIG. 6 is a flowchart of the method of step 220 of FIG. 2;
FIG. 7 is a flowchart of a method for evaluating the quality of data in a universal manner from the perspective of a server according to the present invention;
FIG. 8 is a schematic structural diagram of a device for evaluating the quality of data at a client end according to the present invention;
FIG. 9 is a schematic structural diagram of a device for evaluating quality of general data from a server side according to the present invention;
fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the present invention shown in fig. 1 provides a general data quality evaluation system, which includes a client and a server,
the client is used for respectively carrying out local model training and characteristic disturbance processing on the data of the original data set after carrying out characteristic vectorization processing, sending parameters after the local model training and disturbed data to the server, and respectively processing the parameters according to training updating parameters and data mean values returned by the server to obtain two data quality evaluation parameters, namely a data quality predicted value and a relative quality estimation value;
the server is used for receiving the local model training parameters and the disturbance characteristic data sent by the client, performing weighted average processing on the local model training parameters of each client to obtain training updating parameters, estimating the mean value of the original values, namely the data mean value according to the disturbance characteristics, and respectively returning the training updating parameters and the data mean value to the client.
In the embodiment of the invention, the client belongs to a data owner and can directly contact data, the server belongs to an untrusted third party and needs to avoid acquiring any information capable of deducing data content, the quantity relationship between the server and the clients is one-to-many, and one server collects data from a plurality of clients. In the embodiment of the invention, the participator is assumed to be semi-honest, i.e. honest but curious, which means that the participator can perform calculation according to a predetermined algorithm, but can try to use the acquired information to guess more information and data.
As shown in fig. 2, the method for evaluating the quality of the general data applied to the client according to the embodiment of the present invention includes:
step 210: carrying out feature vectorization processing on the original data set to form feature synthetic data;
as shown in fig. 3, which is a detailed flowchart of step 210, this step specifically includes:
step 211: performing quality evaluation on data in the original data set according to a constraint rule to obtain a first feature vector which meets standard quality;
the method specifically comprises the following steps: and evaluating the data in the original data set with the data number of m according to the data quality measurement standard of the constraint rule to obtain a first feature vector with the length of n-1, wherein the first feature vector meets the standard quality of the constraint rule. The constraint rule in the embodiment of the invention is designed aiming at basic quality measurement standards of correctness, completeness, consistency and the like of a data set with the size of m, such as counting of number of null attributes, number of data inconsistency, number of errors in value type, number of errors in value range and the like, and the feature vector with the length of n-1 is formed by automatically counting codes. In this step, each data in the original data set of m data finally forms a first feature vector of n-1 dimensional length. The step is a process of analyzing the data quality in the original data set and performing feature vectorization to generate feature vectors. Each dimension of the feature vector generated by the original data set is obtained through constraint rule statistics, so that only one quality evaluation statistic is performed on each data in the step.
Step 212: carrying out cluster analysis on the data in the original data set to deduce data of similarity between the data;
as shown in fig. 4, which is a detailed flowchart of step 212, this step specifically includes:
step 2121: performing feature vectorization on an original data set with m data numbers to generate m x k feature matrixes formed by m k-dimensional feature vectors;
in a specific example, if the calculation task, that is, the evaluation content, is to analyze domain name data, the domain name is converted into a feature vector according to the requirement of domain name data analysis. That is, the present step is to convert the original data set into the form of a feature matrix.
Step 2122: and performing clustering speculation on the characteristic matrix by using an unsupervised clustering method to obtain similarity data among the data. In this step, a threshold value T which is small enough is set, an unsupervised Clustering algorithm such as DBSCAN (sensitivity-Based Spatial Clustering of Applications with Noise) is used for Clustering the m k-dimensional feature vectors, and the similarity degree of the data is estimated according to the Clustering number.
Step 213: and synthesizing the first feature vector and the similarity data to form a second feature vector.
The method specifically comprises the following steps: the first feature vector and the similarity data are combined into an n-dimensional vector as the second feature vector, so that the second feature vector is the output feature combined data of step 210. In this step, the output measured in step 212 is a feature, which is one dimension of the synthesized n-dimensional feature vector.
In the embodiment of the present invention, step 211 and step 212 are relatively independent processes.
Step 220: respectively carrying out data quality evaluation model training and feature disturbance processing according to the feature synthetic data, and respectively sending training parameters of the data quality evaluation model and the data subjected to the feature disturbance processing to the server;
in the embodiment of the present invention, the first and second substrates,
as shown in fig. 5, which is a detailed flowchart of step 220, the step specifically includes:
step 221: according to a transverse federal logistic regression method, the second feature vector is used as training data, a contribution label is calculated according to model training on the training data, the data quality assessment model is obtained, and model training parameters are sent to the server side;
the input in this step is the output of step 210, and the data labels represent the contribution of the training data to the calculation task, i.e., the quality assessment model training calculation. The magnitude of the contribution of the gradient of the data contribution when training the domain name analysis model is a specific example of the domain name analysis model. Specifically, the model training process of the calculation task is to continuously approximate the local optimal point by a gradient descent method. When one piece of data is input into the calculation model, the gradient changes, and the gradient reduction indicates that the contribution of the piece of data is large, otherwise, the contribution is small; more gradient drops indicates a greater contribution. In embodiments of the present invention, the data labels are known prior to training.
Preferably, in this step, in order to ensure privacy of the calculation process before transmitting the training model parameters to the server, the model parameters are disturbed by using a differential privacy algorithm and then transmitted to the server.
Step 222: and normalizing the second feature vector, disturbing the normalized feature by applying a random response mechanism, and sending disturbed data to the server.
The input in this step is the output of step 210, the feature is normalized first, then the feature is disturbed by using a random response mechanism, and the epsilon-local differential privacy can be satisfied, that is, the probability that any two values are disturbed to have the same output is not more than eεAnd sending the disturbed value to the server.
Step 230: receiving a training updating parameter and a data mean value sent by the server, wherein the training updating parameter is updated to a data quality evaluation model to optimize the model;
in this step, the server integrates the training parameters of the clients to form training update parameters, and returns the training update parameters to the local models of the clients for the next iteration.
Step 240: and obtaining a data quality predicted value according to the data quality evaluation model, and obtaining a data relative quality estimated value according to the data mean value processing.
As shown in fig. 6, which is a detailed flowchart of step 240, this step specifically includes:
step 241: updating the training updating parameters sent by the server to a data quality evaluation model to obtain an optimized data quality evaluation model, and inputting the new data set into the optimized data quality evaluation model to obtain a data quality prediction value of the new data set, which can be disclosed and does not leak any original data; in this step, the numerical range of the predicted value of the data quality is [0,1 ]. In the step, the client loads the parameters of the model which enables the model to be converged according to the model parameters processed by the server to the local for predicting the quality of the new data set.
Step 242: and obtaining an exposable data relative quality estimated value of the new data set according to the relative error average value of the feature vector of the new data set and the feature of the data mean value. The relative quality estimation value obtained in the step is a numerical value which does not reveal any privacy information and can be directly output in a public way, and the range of the numerical value is [0,1 ].
In the embodiment of the invention, the client locally completes the processing and treatment of the original data set, and ensures that the output of the client is disturbed data so as to protect the privacy of the data; the server receives the disturbed data, and completes construction of a data quality evaluation model and mean value calculation of a data quality characteristic vector according to the processed data; and the client acquires the model and the mean value synthesized by the server, and calculates to obtain the public quality prediction value and the relative quality score.
In the embodiment of the invention, 1) a plurality of data quality evaluation methods are integrated to comprehensively show the data quality, so that a data requester can fully know the data under the condition that the data requester cannot see the data; 2) the method is characterized in that the federated learning algorithm is combined to realize the data set contribution prediction with privacy protection which is independent of a trusted third party; 3) a relative quality assessment with privacy protection independent of a trusted third party is realized by combining a differential privacy algorithm. The core idea of the federal learning algorithm is that only parameters are transmitted without transmitting data, each data owner trains a model locally, and the parameters are updated to a third party to be integrated and used as the parameters for updating the model in the current round; and after the integration by the third party, distributing the parameters to each data owner for updating the local model. And iterating until convergence. The core idea of the difference privacy is that original data are disturbed by random variables obeying certain distribution, so that the original data are indistinguishable, and the relationship between the expectation of a certain function of the disturbed value and the expectation of the original value function can be calculated according to probability theory knowledge, so that the value of the original value function can be calculated according to the disturbed value function.
As shown in fig. 7, an embodiment of the present invention further provides a method for evaluating quality of general data, which is applied to a server, where the server communicates with a plurality of clients, and the method includes:
step 710: receiving training parameters of a data quality evaluation model and data after characteristic disturbance processing sent by each client; wherein, each client applies the same random response mechanism to carry out characteristic disturbance treatment;
in the embodiment of the invention, in order to protect the privacy of the client data, the client sends the data to the server for disturbance processing.
Step 720: processing the training parameters sent by each client by applying a weighted average method to obtain training update parameters, and sending the training update parameters to each client respectively;
in the step, the input is the parameters of local model training from each client; and obtaining parameters of the synthetic model by using a weighted average algorithm, and sending the parameters of the synthetic model as training update parameters to each client for optimizing the local model of each client.
Step 730: deducing the relation between the sum of the disturbance values and the sum of the original values according to the disturbance mechanism parameters, deducing the mean value of the original values, taking the deduced mean value of the original values as the data mean value and sending the data mean value to each client.
In the embodiment of the present invention, steps 720 and 730 are independent processes, and the processed training update parameters and the processed data mean are respectively returned to the client, and the client performs quality prediction and relative quality calculation on the new data set by using the returned values.
As shown in fig. 8, an embodiment of the present invention further provides a universal data quality evaluation apparatus, which is applied to a client, where the client communicates with a server, and the apparatus includes:
a feature vectorization unit 810, configured to perform feature vectorization on the original data set to form feature synthesized data;
in the embodiment of the present invention, the feature vectorization 810 includes two sub-units, namely a data quality vectorization sub-unit and a data similarity measurement sub-unit.
The data quality vectorization subunit is used for carrying out quality vectorization on the data based on a constraint rule, wherein the constraint rule is designed aiming at basic quality measurement standards of correctness, completeness, consistency and the like of a data set with the size of m, such as counting of the number of null attributes, the number of data inconsistency, the number of value type errors, the number of value range errors and the like, and the characteristic vector with the length of n-1 is formed through automatic code counting;
and the data similarity measurement subunit is used for vectorizing the m pieces of original data according to the calculation tasks to generate an m x k feature matrix, clustering the m k-dimensional feature vectors by applying an unsupervised clustering algorithm, and estimating the similarity of the data according to the clustering number.
And finally, synthesizing the results of the data quality vectorization subunit and the data similarity measurement subunit into characteristic synthetic data.
The feature synthesis data processing unit 820 is configured to perform data quality evaluation model training and feature perturbation processing according to the feature synthesis data, and send training parameters of the data quality evaluation model and data after the feature perturbation processing to the server respectively;
in the embodiment of the invention, each client side constructs and trains a local data quality evaluation model according to characteristic synthetic data obtained by data processing, and sends model training parameters after local training to the server side, but in order to ensure the privacy of the calculation process, the model training parameters are disturbed by adopting a differential privacy algorithm and then sent to the server side.
In the embodiment of the present invention, the processing procedure of the feature synthesis data by the feature synthesis data processing unit 820 is to normalize the feature, and then perturb the feature by applying a random response mechanism, which is required to satisfy the epsilon-local differential privacy, that is, any two values are perturbed to output the same dataProbability not greater than eεAnd sending the disturbed value to the server.
A receiving server data unit 830, configured to receive a training update parameter and a data mean value sent by the server, where the training update parameter is updated to a data quality evaluation model to optimize the model;
and the quality evaluation unit 840 obtains a data quality prediction value according to the data quality evaluation model and obtains a data relative quality estimation value according to the data mean value processing.
In the embodiment of the invention, the client loads the training updating parameters to the local for predicting the quality of the new data set and outputting a prediction value which can be disclosed and does not leak any original data.
The principle of performing a relative mass calculation on a data set is: calculating a relative error average value of the feature vector and the data mean value of the local data set as a relative quality estimation value of the data set; the value does not reveal any privacy information and can be directly and openly output.
As shown in fig. 9, an embodiment of the present invention further provides a general data quality evaluation apparatus, which is applied to a server, where the server communicates with a plurality of clients, and the apparatus includes:
a receiving client data unit 910, configured to receive the training parameters and the feature disturbance processed data of the data quality evaluation model sent by each client; wherein, each client applies the same random response mechanism to carry out characteristic disturbance treatment;
a training parameter processing unit 920, configured to apply a weighted average method to process the training parameters sent by each client, so as to obtain training update parameters, and send the training update parameters to each client;
in the embodiment of the present invention, the training parameter processing unit 920 applies a weighted average algorithm to obtain parameters of the synthetic model, which are used as parameters for updating in the current round, and then returns the updated parameters to the local model of the client for the next iteration.
The original value mean processing unit 930, which deduces the relationship between the sum of the disturbance values and the sum of the original values according to the disturbance mechanism parameters, deduces the mean value of the original values, uses the deduced original value mean value as the data mean value, and sends the data mean value to each client.
In the embodiment of the invention, each client applies the same random response mechanism to carry out feature vector disturbance treatment, and deduces a relational expression of the sum of disturbance values and the sum of original values according to disturbance mechanism parameters, thereby reasoning out a data mean value of the original values; and returning the data mean value to the client for relative quality calculation.
An entity structure schematic diagram of an electronic device provided in an embodiment of the present invention is described below with reference to fig. 10, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a general data quality assessment method comprising: carrying out feature vectorization processing on the original data set to form feature synthetic data; respectively carrying out data quality evaluation model training and feature disturbance processing according to the feature synthetic data, and respectively sending training parameters of the data quality evaluation model and the data subjected to the feature disturbance processing to the server; receiving a training updating parameter and a data mean value sent by the server, wherein the training updating parameter is updated to a data quality evaluation model to optimize the model; and obtaining a data quality predicted value according to the data quality evaluation model, and obtaining a data relative quality estimated value according to the data mean value processing.
Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the general data quality assessment method provided by the above methods, where the method includes: carrying out feature vectorization processing on the original data set to form feature synthetic data; respectively carrying out data quality evaluation model training and feature disturbance processing according to the feature synthetic data, and respectively sending training parameters of the data quality evaluation model and the data subjected to the feature disturbance processing to the server; receiving a training updating parameter and a data mean value sent by the server, wherein the training updating parameter is updated to a data quality evaluation model to optimize the model; and obtaining a data quality predicted value according to the data quality evaluation model, and obtaining a data relative quality estimated value according to the data mean value processing.
In yet another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to, when executed by a processor, perform the above-mentioned methods for general data quality assessment: carrying out feature vectorization processing on the original data set to form feature synthetic data; respectively carrying out data quality evaluation model training and feature disturbance processing according to the feature synthetic data, and respectively sending training parameters of the data quality evaluation model and the data subjected to the feature disturbance processing to the server; receiving a training updating parameter and a data mean value sent by the server, wherein the training updating parameter is updated to a data quality evaluation model to optimize the model; and obtaining a data quality predicted value according to the data quality evaluation model, and obtaining a data relative quality estimated value according to the data mean value processing.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A general data quality evaluation method is applied to a client side, and the client side is communicated with a server side, and the method comprises the following steps:
carrying out feature vectorization processing on the original data set to form feature synthetic data;
respectively carrying out data quality evaluation model training and feature disturbance processing according to the feature synthetic data, and respectively sending training parameters of the data quality evaluation model and the data subjected to the feature disturbance processing to the server;
receiving a training updating parameter and a data mean value sent by the server, wherein the training updating parameter is updated to a data quality evaluation model to optimize the model;
and obtaining a data quality predicted value according to the data quality evaluation model, and obtaining a data relative quality estimated value according to the data mean value processing.
2. The method for evaluating the quality of general data according to claim 1, wherein the performing a feature vectorization process on the original data set to form feature synthesis data comprises:
performing quality evaluation on data in the original data set according to a constraint rule to obtain a first feature vector which meets standard quality;
carrying out cluster analysis on the data in the original data set to deduce data of similarity between the data;
and synthesizing the first feature vector and the similarity data to form a second feature vector.
3. The method for general data quality assessment according to claim 2, wherein the quality assessment of the data in the original data set according to the constraint rule to obtain the first feature vector meeting the standard quality comprises: and evaluating the data in the original data set with the data number of m according to the data quality measurement standard of the constraint rule to obtain a first feature vector with the length of n-1, wherein the first feature vector meets the standard quality of the constraint rule.
4. The method for universal data quality assessment according to claim 3, wherein the clustering analysis of the data in the original data set to deduce data similarity comprises:
performing feature vectorization on an original data set with m data numbers to generate m x k feature matrixes formed by m k-dimensional feature vectors;
and performing clustering speculation on the characteristic matrix by using an unsupervised clustering method to obtain similarity data among the data.
5. The method according to claim 4, wherein the synthesizing the first feature vector and the similarity data to form a second feature vector comprises: and synthesizing the first feature vector and the similarity data into an n-dimensional vector as the second feature vector.
6. The method for general data quality assessment according to claim 5, wherein the performing data quality assessment model training and feature perturbation processing respectively according to the feature synthesis data, and sending training parameters of the data quality assessment model and the feature perturbation processed data to the server respectively comprises:
according to a transverse federal logistic regression method, the second feature vector is used as training data, a contribution label is calculated according to model training on the training data, the data quality assessment model is obtained, and model training parameters are sent to the server side;
and normalizing the second feature vector, disturbing the normalized feature by applying a random response mechanism, and sending disturbed data to the server.
7. The method according to claim 6, wherein the obtaining a data quality prediction value according to the data quality assessment model and obtaining a data relative quality estimation value according to the data mean processing comprises:
updating the training updating parameters sent by the server to a data quality evaluation model to obtain an optimized data quality evaluation model, and inputting the new data set into the optimized data quality evaluation model to obtain an exposable data quality prediction value of the new data set;
and obtaining an exposable data relative quality estimated value of the new data set according to the relative error average value of the feature vector of the new data set and the feature of the data mean value.
8. A general data quality evaluation method is applied to a server side, and the server side is communicated with a plurality of clients, and the method comprises the following steps:
receiving training parameters of a data quality evaluation model and data after characteristic disturbance processing sent by each client; wherein, each client applies the same random response mechanism to carry out characteristic disturbance treatment;
processing the training parameters sent by each client by applying a weighted average method to obtain training update parameters, and sending the training update parameters to each client respectively;
deducing the relation between the sum of the disturbance values and the sum of the original values according to the disturbance mechanism parameters, deducing the mean value of the original values, taking the deduced mean value of the original values as the data mean value and sending the data mean value to each client.
9. A universal data quality assessment device applied to a client, the client communicates with a server, and the device comprises:
the characteristic vectorization unit is used for carrying out characteristic vectorization processing on the original data set to form characteristic synthetic data;
the characteristic synthesis data processing unit is used for respectively carrying out data quality evaluation model training and characteristic disturbance processing according to the characteristic synthesis data and respectively sending training parameters of the data quality evaluation model and the data subjected to the characteristic disturbance processing to the server;
the receiving server data unit is used for receiving the training update parameters and the data mean value sent by the server, wherein the training update parameters are updated to a data quality evaluation model to optimize the model;
and the quality evaluation unit is used for obtaining a data quality predicted value according to the data quality evaluation model and obtaining a data relative quality estimated value according to the data mean value processing.
10. A universal data quality assessment device is applied to a server side, the server side is communicated with a plurality of clients, and the device comprises:
the receiving client data unit is used for receiving the training parameters of the data quality evaluation model and the data subjected to characteristic disturbance processing sent by each client; wherein, each client applies the same random response mechanism to carry out characteristic disturbance treatment;
the training parameter processing unit is used for processing the training parameters sent by the clients by applying a weighted average method to obtain training update parameters and sending the training update parameters to the clients respectively;
and the original value mean processing unit deduces the relation between the sum of the disturbance values and the sum of the original values according to the disturbance mechanism parameters, deduces the mean value of the original values, takes the deduced original value mean value as the data mean value and sends the data mean value to each client.
CN202110767866.5A 2021-07-07 2021-07-07 Universal data quality evaluation method and device Pending CN113554288A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110767866.5A CN113554288A (en) 2021-07-07 2021-07-07 Universal data quality evaluation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110767866.5A CN113554288A (en) 2021-07-07 2021-07-07 Universal data quality evaluation method and device

Publications (1)

Publication Number Publication Date
CN113554288A true CN113554288A (en) 2021-10-26

Family

ID=78102955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110767866.5A Pending CN113554288A (en) 2021-07-07 2021-07-07 Universal data quality evaluation method and device

Country Status (1)

Country Link
CN (1) CN113554288A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760191A (en) * 2022-05-24 2022-07-15 咪咕文化科技有限公司 Data service quality early warning method, system, device and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760191A (en) * 2022-05-24 2022-07-15 咪咕文化科技有限公司 Data service quality early warning method, system, device and readable storage medium
CN114760191B (en) * 2022-05-24 2023-09-19 咪咕文化科技有限公司 Data service quality early warning method, system, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111177792B (en) Method and device for determining target business model based on privacy protection
CN111966904B (en) Information recommendation method and related device based on multi-user portrait model
CN111340611B (en) Risk early warning method and device
CN110210233B (en) Combined construction method and device of prediction model, storage medium and computer equipment
CN114782161A (en) Method, device, storage medium and electronic device for identifying risky users
CN114219562A (en) Model training method, enterprise credit evaluation method and device, equipment and medium
CN117061322A (en) Internet of things flow pool management method and system
CN113554288A (en) Universal data quality evaluation method and device
CN112269937B (en) Method, system and device for calculating user similarity
CN111245815B (en) Data processing method and device, storage medium and electronic equipment
CN111865899A (en) Threat-driven cooperative acquisition method and device
CN109241249B (en) Method and device for determining burst problem
CN110275880A (en) Data analysing method, device, server and readable storage medium storing program for executing
CN111405563A (en) Risk detection method and device for protecting user privacy
JP2006505858A (en) Providing method and computer structure for providing database information in the first database, and computer-aided formation method of statistical images in the database
CN113837481B (en) Financial big data management system based on block chain
CN112506063B (en) Data analysis method, system, electronic device and storage medium
CN111914905B (en) Anti-crawler system based on semi-supervision and design method
CN111723872B (en) Pedestrian attribute identification method and device, storage medium and electronic device
CN111881008B (en) Data evaluation method, data evaluation device, model training method, model evaluation device, model training equipment and storage medium
CN110087230B (en) Data processing method, data processing device, storage medium and electronic equipment
Riff et al. A graph-based immune-inspired constraint satisfaction search
CN114722061B (en) Data processing method and device, equipment and computer readable storage medium
CN117093697B (en) Real-time adaptive dialogue method, device, equipment and storage medium
CN113946758B (en) Data identification method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination