CN113836130A - Data quality evaluation method, device, equipment and storage medium - Google Patents

Data quality evaluation method, device, equipment and storage medium Download PDF

Info

Publication number
CN113836130A
CN113836130A CN202111149024.XA CN202111149024A CN113836130A CN 113836130 A CN113836130 A CN 113836130A CN 202111149024 A CN202111149024 A CN 202111149024A CN 113836130 A CN113836130 A CN 113836130A
Authority
CN
China
Prior art keywords
data
evaluation
quality evaluation
evaluated
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111149024.XA
Other languages
Chinese (zh)
Other versions
CN113836130B (en
Inventor
唐立志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Skyworth Smart Technology Co ltd
Original Assignee
Shenzhen Skyworth Smart Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Skyworth Smart Technology Co ltd filed Critical Shenzhen Skyworth Smart Technology Co ltd
Priority to CN202111149024.XA priority Critical patent/CN113836130B/en
Priority claimed from CN202111149024.XA external-priority patent/CN113836130B/en
Publication of CN113836130A publication Critical patent/CN113836130A/en
Application granted granted Critical
Publication of CN113836130B publication Critical patent/CN113836130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The invention discloses a data quality evaluation method, a device, equipment and a storage medium, wherein the method comprises the following steps: when the data quality evaluation task is established, determining a corresponding data evaluation mode according to the current configuration information; determining a corresponding data evaluation rule according to the data type of the data to be evaluated; and performing quality evaluation on the data to be evaluated through a data evaluation mode and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated. According to the invention, the corresponding data evaluation mode is determined through the current configuration information, the data evaluation rule is determined according to the data type of the data to be evaluated, and the quality evaluation is carried out on the data to be evaluated through the determined data evaluation mode and the data evaluation rule, so that the quality evaluation result is obtained, the technical problem that the quality evaluation can not be carried out on the data by quickly determining the evaluation mode and the evaluation rule according to a specific scene and the data type in the prior art is solved, and the efficiency of data quality evaluation is improved.

Description

Data quality evaluation method, device, equipment and storage medium
Technical Field
The present invention relates to the field of big data technologies, and in particular, to a data quality evaluation method, apparatus, device, and storage medium.
Background
Under the condition of diversified data sources, the reliability and the practicability of the data directly influence the correctness of the statistical analysis conclusion, so that the quality of the data is particularly important. Data analysis and data mining can not leave high-quality data, the high-quality data has great significance for products, and the data generally has no integrity, normalization and consistency, so that the final data processing result is deviated, and therefore, how to improve the efficiency of performing quality evaluation on the existing data becomes a technical problem to be solved urgently.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a data quality evaluation method, a data quality evaluation device, data quality evaluation equipment and a storage medium, and aims to solve the technical problem of low data quality evaluation efficiency in the prior art.
In order to achieve the above object, the present invention provides a data quality evaluation method, including the steps of:
when the data quality evaluation task is established, determining a corresponding data evaluation mode according to the current configuration information;
determining a corresponding data evaluation rule according to the data type of the data to be evaluated;
and performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
Optionally, the performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated includes:
when the data evaluation mode is the cluster mode, acquiring the total recorded quantity of the data to be evaluated and the cluster node quantity of the cluster mode;
dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the total record number and the cluster node number;
distributing the quality evaluation subtasks to corresponding cluster nodes so as to process the quality evaluation subtasks through the corresponding cluster nodes and the data evaluation rule to obtain corresponding data quality evaluation results;
and determining the quality evaluation result of the data to be evaluated according to the data quality evaluation result.
Optionally, when the data evaluation mode is the cluster mode, after acquiring the total recorded quantity of the data to be evaluated and the number of cluster nodes of the cluster mode, the method further includes:
when the total recorded number is less than or equal to the preset data number, determining a target cluster node according to the total recorded number and the parameters of each cluster node;
and sending the data quality evaluation task to the target cluster node so as to process the data quality evaluation task through the target cluster node and the data evaluation rule and obtain a quality evaluation result of the data to be evaluated.
Optionally, the dividing the data quality assessment task into a plurality of quality assessment subtasks according to the total record number and the cluster node number includes:
dividing the data to be evaluated into a plurality of data sets according to the total recorded quantity and the preset data capacity;
and dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the plurality of data sets and the number of the cluster nodes.
Optionally, the performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated includes:
when the data evaluation mode is a Hadoop mode, acquiring the data to be evaluated;
and storing the data to be evaluated to the HDFS, and performing quality evaluation on the data to be evaluated through a MapReduce task and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
Optionally, the storing the data to be evaluated to the HDFS, and performing quality evaluation on the data to be evaluated through a MapReduce task and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated includes:
storing the data to be evaluated to an HDFS (Hadoop distributed File System), and carrying out fragment evaluation on the data to be evaluated through a Map task and the data evaluation rule to obtain a fragment quality evaluation result;
and merging the slicing quality evaluation results through Reduce to obtain the quality evaluation result of the data to be evaluated.
Optionally, when the data quality evaluation task is completed, determining a corresponding data evaluation mode according to the current configuration information includes:
when the data quality evaluation task is established, current configuration information is acquired;
when the current configuration information supports a Hadoop mode, setting the data evaluation mode to be the Hadoop mode;
and when the current configuration information supports the cluster mode, setting the data evaluation mode to be the cluster mode.
In addition, to achieve the above object, the present invention also provides a data quality evaluation apparatus, including:
the data evaluation mode determining module is used for determining a corresponding data evaluation mode according to the current configuration information when the data quality evaluation task is established;
the data evaluation rule determining module is used for determining a corresponding data evaluation rule according to the data type of the data to be evaluated;
and the quality evaluation module is used for carrying out quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
Further, to achieve the above object, the present invention also proposes a data quality evaluation apparatus, including: a memory, a processor and a data quality assessment program stored on the memory and executable on the processor, the data quality assessment program configured to implement the steps of the data quality assessment method as described above.
Furthermore, to achieve the above object, the present invention also proposes a storage medium having stored thereon a data quality evaluation program which, when executed by a processor, implements the steps of the data quality evaluation method as described above.
When the data quality evaluation task is established, determining a corresponding data evaluation mode according to the current configuration information; determining a corresponding data evaluation rule according to the data type of the data to be evaluated; and performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated. According to the invention, the corresponding data evaluation mode is determined through the current configuration information, the data evaluation rule is determined according to the data type of the data to be evaluated, and the quality evaluation is carried out on the data to be evaluated through the determined data evaluation mode and the data evaluation rule, so that the quality evaluation result is obtained, the technical problem that the quality evaluation can not be carried out on the data by quickly determining the evaluation mode and the evaluation rule aiming at specific scenes and data in the prior art is solved, and the efficiency of data quality evaluation is improved.
Drawings
Fig. 1 is a schematic structural diagram of a data quality evaluation device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data quality evaluation method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a data quality evaluation method according to a second embodiment of the present invention;
fig. 4 is a block diagram of a first embodiment of the data quality evaluation apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a data quality evaluation device of a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the data quality evaluation apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the data quality evaluation device and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a data quality evaluation program.
In the data quality evaluation apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the data quality evaluation apparatus of the present invention may be provided in the data quality evaluation apparatus which calls the data quality evaluation program stored in the memory 1005 through the processor 1001 and executes the data quality evaluation method provided by the embodiment of the present invention.
An embodiment of the present invention provides a data quality assessment method, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of the data quality assessment method according to the present invention.
In this embodiment, the data quality evaluation method includes the following steps:
step S10: and when the data quality evaluation task is established, determining a corresponding data evaluation mode according to the current configuration information.
It should be noted that the execution subject of the embodiment may be a computing service device with data processing, network communication and program running functions, such as a tablet computer, a personal computer, a mobile phone, etc., or an electronic device, a data quality evaluation device, etc., capable of implementing the above functions. The present embodiment and the following embodiments will be described below by taking a data quality evaluation apparatus as an example.
It can be understood that the data quality evaluation task is a task for performing quality evaluation on data to be evaluated, the data quality evaluation task may be created by a user, and when the creation is completed, prompt information is sent to the data quality evaluation device; the data quality evaluation task may also be created by the data quality evaluation device when the data quality evaluation device receives a preset amount of data, which is not limited in this embodiment.
It should be understood that different data evaluation modes have different configuration requirements on the data quality evaluation device, and the corresponding data evaluation mode is determined according to the current configuration information, so that the device resources can be fully utilized, and the efficiency of data quality evaluation is improved.
In a specific implementation, for example, when the data quality evaluation device receives a preset amount of data, a data quality evaluation task is created, and when the data quality evaluation task is created, a corresponding data evaluation mode is determined according to information such as a current operating environment, an operating system, and memory parameters.
Step S20: and determining a corresponding data evaluation rule according to the data type of the data to be evaluated.
It can be understood that the data to be evaluated is data which needs to be subjected to quality evaluation, and the data to be evaluated can be production data generated in a production process, traffic data of a certain area or information data of people and the like; the data type can be numerical value type, character string type, date type and the like; the data evaluation rule may be a rule set for evaluating the quality of a specific type of data.
It should be understood that evaluating the quality of different types of data requires using different evaluation rules, and when the data type of the data to be evaluated is numerical, the data evaluation rules may be set as: counting the number of numerical values of which the numerical values are smaller than a preset numerical value in the data and/or the number of characters or character strings except the numerical values in the data, and calculating the proportion of the numerical values to the total number of the data; when the data type of the data to be evaluated is a character string type, the data evaluation rule may be set to: counting the number of character strings in the data, wherein the length of the character strings is smaller than the preset length and/or the format of the character strings is a non-preset format, and calculating the proportion of the character strings to the total number of the data; when the data type of the data to be evaluated is date type, the data evaluation rule may be set to: and counting the number of dates in the data which are in a non-preset date range and/or the number of dates which are in a format which does not conform to the preset date format, and calculating the proportion of the dates in the data to the total number of the data.
Step S30: and performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
It can be understood that after the data evaluation mode and the data evaluation rule corresponding to the data type of the data to be evaluated are determined, the quality evaluation of the data to be evaluated can be performed according to the data evaluation mode and the data evaluation rule, so as to obtain a quality evaluation result of the data.
In a specific implementation, for example, when the quality evaluation device receives a preset amount of data, a data quality evaluation task is created, when the data quality evaluation task is created, a data evaluation mode matched with parameter information of the current device, such as an operating environment, an operating system, and a memory parameter, is determined according to the parameter information, and when the data type of the data to be evaluated is numerical data, a corresponding data evaluation rule is determined as follows: and counting the number of numerical values of which the numerical values are smaller than the preset numerical values in the data and/or the number of characters or character strings except the numerical values in the data, calculating the proportion of the number of the data to the total number, and performing quality evaluation on the data to be evaluated according to the data evaluation rule and the data evaluation mode, thereby determining the proportion of unqualified data in the data to be evaluated and taking the proportion of the unqualified data as a quality evaluation result.
Further, in a specific application scenario, when the data amount of the data to be evaluated is too large, the evaluation efficiency of the data quality may be affected, and in order to improve the efficiency of the data evaluation, the step S30 includes:
step S301: and when the data evaluation mode is the cluster mode, acquiring the total recorded quantity of the data to be evaluated and the quantity of cluster nodes of the cluster mode.
It can be understood that the cluster mode is a mode in which data is processed by a plurality of data processing nodes at the same time, the total recorded number is the total number of data in the data to be evaluated, and the number of cluster nodes is the number of nodes processing the data in the cluster.
Step S302: and dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the total record number and the cluster node number.
It should be understood that the quality evaluation subtask may be a quality evaluation task that each cluster node needs to process, and the dividing manner of dividing the data quality evaluation task into the data quality evaluation subtasks may be: dividing the total recorded quantity into a plurality of data sets according to the quantity of the cluster nodes, wherein each data set comprises the same quantity of data, and the plurality of data sets correspond to a plurality of quality evaluation subtasks; and dividing the total record quantity into a plurality of data sets according to the data processing capacity of each cluster node and the number of the cluster nodes, wherein the data quantities in the data sets can be the same or different, the data quantities are determined by the data processing capacity of the cluster nodes and the number of the cluster nodes, and the data sets correspond to a plurality of quality evaluation subtasks.
Step S303: and distributing the quality evaluation subtasks to corresponding cluster nodes, and processing the quality evaluation subtasks through the corresponding cluster nodes and the data evaluation rule to obtain corresponding data quality evaluation results.
It can be understood that the data quality evaluation result is an unqualified data proportion in the data corresponding to the subtask, which is obtained by processing the corresponding quality evaluation subtask by the data cluster node according to the data evaluation rule, and the unqualified data proportion is the quality evaluation result; and integrating the data quality evaluation results corresponding to the quality evaluation subtasks to obtain the data quality evaluation result to be evaluated.
It should be understood that when the data volumes corresponding to the quality evaluation subtasks are equal, the quality evaluation result of the data to be evaluated can be obtained by averaging the unqualified data proportions of the cluster nodes; when the data volumes corresponding to the quality evaluation subtasks are unequal, calculating the proportion of the total amount of the unqualified data of each cluster node to the total data volume, and obtaining the quality evaluation result of the data to be evaluated.
Step S304: and determining the quality evaluation result of the data to be evaluated according to the data quality evaluation result.
In a specific implementation, for example, when the data evaluation mode is the cluster mode, the data quality evaluation device divides the total record number into a plurality of data sets according to the number of cluster nodes, each data set contains the same number of data, the plurality of data sets correspond to a plurality of quality evaluation subtasks, the plurality of subtasks are respectively sent to the cluster nodes, the plurality of quality evaluation subtasks are processed through the plurality of cluster nodes and a data evaluation rule, and the proportion of unqualified data in the data set corresponding to each quality evaluation subtask is obtained, that is, the average value of the proportion of unqualified data is obtained, and the quality evaluation result of the data to be evaluated is obtained.
Further, in order to improve the evaluation efficiency of the data quality, the dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the total record number and the cluster node number includes: dividing the data to be evaluated into a plurality of data sets according to the total recorded quantity and the preset data capacity; and dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the plurality of data sets and the number of the cluster nodes.
It should be understood that the preset data capacity is the amount of data contained in the data set, the preset data capacity can be set according to a specific scenario, and the data to be evaluated can be divided into a plurality of data sets according to the preset data capacity and the total recorded number.
It can be understood that, when the storage form of the data to be evaluated is data table storage, the total record number of the data to be evaluated in the data table can be obtained, the number of single-page data in each page of the data table is set, the data table can be divided into a plurality of pages according to the total record number and the number of the single-page data, the page numbers of a start page and an end page can be determined, and a middle page number can be determined according to the start page number and the end page number, so that the data quality evaluation task corresponding to the data table can be divided into two quality evaluation subtasks, the two quality evaluation subtasks are divided into 4 quality evaluation subtasks according to the start page number and the middle page number, and the data quality evaluation task can be divided into a plurality of data quality evaluation subtasks according to the division manner.
In a specific implementation, for example, the preset data capacity is 1000, the total recorded number is 10000, and the number of cluster nodes is 5, then the data to be evaluated may be divided into 10 data sets, and the data quality evaluation task may be divided into 5 quality evaluation subtasks according to the number of cluster nodes, where each quality evaluation subtask includes 2 data sets.
Further, in order to determine a corresponding data evaluation mode according to the current configuration information and improve the evaluation efficiency of the data quality, the step S10 includes: when the data evaluation mode is a Hadoop mode, acquiring the data to be evaluated; and storing the data to be evaluated to the HDFS, and performing quality evaluation on the data to be evaluated through a MapReduce task and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
In the specific implementation, when the data evaluation mode is the Hadoop mode, the data quality evaluation equipment stores data to be evaluated to an HDFS component of the Hadoop, starts a MapReduce task, and performs quality evaluation on the data to be evaluated through the MapReduce task and a data evaluation rule to obtain a quality evaluation result.
Further, in order to improve data quality evaluation efficiency, the storing the data to be evaluated to the HDFS, and performing quality evaluation on the data to be evaluated through a MapReduce task and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated includes: storing the data to be evaluated to an HDFS (Hadoop distributed File System), and carrying out fragment evaluation on the data to be evaluated through a Map task and the data evaluation rule to obtain a fragment quality evaluation result; and merging the slicing quality evaluation results through Reduce to obtain the quality evaluation result of the data to be evaluated.
In the specific implementation, the data quality evaluation device stores data to be evaluated to the HFDS component, the Map task is executed in a slicing mode, each data in the data to be evaluated is circulated, the data evaluation rule is traversed, data to be evaluated is evaluated to obtain slicing quality evaluation results, the slicing quality evaluation results are combined through the Reduce task to obtain quality evaluation results of the data to be evaluated, the quality evaluation results are stored to the HDFS component, and the quality evaluation of the data to be evaluated is completed.
Further, in order to determine a corresponding data evaluation mode according to the current configuration information and improve the evaluation efficiency of the data quality, when the data quality evaluation task is completed, determining the corresponding data evaluation mode according to the current configuration information includes: when the data quality evaluation task is established, current configuration information is acquired; when the current configuration information supports a Hadoop mode, setting the data evaluation mode to be the Hadoop mode; and when the current configuration information supports the cluster mode, setting the data evaluation mode to be the cluster mode.
In a specific implementation, when a data quality evaluation task is created, the data quality evaluation device sets a data evaluation mode to a mode supported by current configuration information, sets the data evaluation mode to a Hadoop mode when the current configuration information supports the Hadoop mode, and sets the data evaluation mode to a trunking mode when the current configuration information supports the trunking mode.
In the embodiment, when the data quality evaluation task is established, a corresponding data evaluation mode is determined according to the current configuration information; determining a corresponding data evaluation rule according to the data type of the data to be evaluated; and performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated. In the embodiment, the corresponding data evaluation mode is determined through the current configuration information, the data evaluation rule is determined according to the data type of the data to be evaluated, and the quality evaluation is performed on the data to be evaluated through the determined data evaluation mode and the data evaluation rule to obtain the quality evaluation result.
Referring to fig. 3, fig. 3 is a flowchart illustrating a data quality evaluation method according to a second embodiment of the present invention.
Based on the first embodiment described above, in the present embodiment, after the step S301, the method includes:
step S3011: and when the total record number is less than or equal to the preset data number, determining a target cluster node according to the total record number and the parameters of each cluster node.
It can be understood that the preset data quantity may be a preset data quantity for determining whether the data quality evaluation task needs to be divided into the data quality evaluation subtasks, and the preset data quantity may be determined according to a cluster node with the maximum data throughput among the plurality of cluster nodes, that is, the preset data quantity is set as the maximum data throughput of the cluster node; the preset data amount may also be determined according to the optimal data throughput of the node with the maximum data throughput in the plurality of cluster nodes, that is, the preset data amount may be set according to a specific scenario, and this embodiment does not limit this.
It should be understood that each cluster node measured parameter includes the maximum data throughput and the optimal data throughput of each cluster node, and the target cluster node can be determined according to the total recorded quantity and the parameter in the following manner: (1) traversing the optimal data processing capacity of each cluster node, matching the optimal data processing capacity with the total recorded data, and determining the cluster node as a target cluster node when the absolute value of the difference value between the total recorded data processing capacity and the optimal data processing capacity is smaller than a preset value; (2) and when the target cluster node cannot be determined according to the optimal data processing capacity, traversing the maximum data processing capacity of each cluster node, matching the maximum data processing capacity with the total recorded number, and determining the cluster node which is larger than the total recorded number and has the minimum difference with the total recorded number as the target cluster node.
In a specific implementation, for example, there will be 5 cluster nodes, and the data throughput of each node is: 10000 for node 1, 8000 for node 2, 15000 for node 3, 12000 for node 4, 13000 for node 5, and the optimal data processing amount of each node is: if the node 1 is 8000, the node 2 is 6400, the node 3 is 12000, the node 4 is 9600, and the node 5 is 10400, the preset data amount may be set to 15000 according to the maximum data throughput, the preset data amount may be set to 12000 according to the optimal data throughput, and the node 3 is determined as the target cluster node when the total number of records is 14000.
Step S3012: and sending the data quality evaluation task to the target cluster node so as to process the data quality evaluation task through the target cluster node and the data evaluation rule and obtain a quality evaluation result of the data to be evaluated.
In the specific implementation, after the quality evaluation task is sent to the target cluster node, to-be-evaluated data corresponding to the quality evaluation task is obtained, each piece of to-be-evaluated data is circulated, and the data evaluation rules are run one by one, so that the quality evaluation result of the to-be-evaluated data is obtained.
In this embodiment, when the total recorded number is less than or equal to the preset data number, determining a target cluster node according to the total recorded number and parameters of each cluster node; and sending the data quality evaluation task to the target cluster node so as to process the data quality evaluation task through the target cluster node and the data evaluation rule and obtain a quality evaluation result of the data to be evaluated. According to the embodiment, when the total record number of the data to be evaluated is smaller than the preset data number, the target cluster node is determined according to the parameters of the cluster nodes and the total record number, and the data quality evaluation task is processed through the target cluster node and the data evaluation rule to obtain the quality evaluation result, so that the resources of the cluster nodes can be furthest processed, the data quality evaluation efficiency is ensured, and the waste of equipment resources is reduced.
Furthermore, an embodiment of the present invention further provides a storage medium, on which a data quality evaluation program is stored, and the data quality evaluation program, when executed by a processor, implements the steps of the data quality evaluation method as described above.
Referring to fig. 4, a block diagram of a first embodiment of the data quality evaluation apparatus of the present invention is shown.
As shown in fig. 4, the data quality evaluation apparatus according to the embodiment of the present invention includes: a data evaluation mode determination module 10, a data evaluation rule determination module 20, and a quality evaluation module 30.
The data evaluation mode determining module 10 is configured to determine a corresponding data evaluation mode according to the current configuration information when the data quality evaluation task is completed;
the data evaluation rule determining module 20 is configured to determine a corresponding data evaluation rule according to a data type of data to be evaluated;
and the quality evaluation module 30 is configured to perform quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule, so as to obtain a quality evaluation result of the data to be evaluated.
In the embodiment, when the data quality evaluation task is established, a corresponding data evaluation mode is determined according to the current configuration information; determining a corresponding data evaluation rule according to the data type of the data to be evaluated; and performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated. In the embodiment, the corresponding data evaluation mode is determined through the current configuration information, the data evaluation rule is determined according to the data type of the data to be evaluated, and the quality evaluation is performed on the data to be evaluated through the determined data evaluation mode and the data evaluation rule to obtain the quality evaluation result.
A second embodiment of the data quality evaluation apparatus of the present invention is proposed based on the first embodiment of the data quality evaluation apparatus of the present invention described above.
In this embodiment, the quality evaluation module 30 is further configured to, when the data evaluation mode is the cluster mode, obtain the total recorded quantity of the data to be evaluated and the number of cluster nodes in the cluster mode; dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the total record number and the cluster node number; distributing the quality evaluation subtasks to corresponding cluster nodes so as to process the quality evaluation subtasks through the corresponding cluster nodes and the data evaluation rule to obtain corresponding data quality evaluation results; and determining the quality evaluation result of the data to be evaluated according to the data quality evaluation result.
The quality evaluation module 30 is further configured to divide the data to be evaluated into a plurality of data sets according to the total recorded quantity and a preset data capacity; and dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the plurality of data sets and the number of the cluster nodes.
The data evaluation mode determining module 10 is further configured to obtain the data to be evaluated when the data evaluation mode is a Hadoop mode; and storing the data to be evaluated to the HDFS, and performing quality evaluation on the data to be evaluated through a MapReduce task and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
The data evaluation mode determining module 10 is further configured to store the data to be evaluated to an HDFS, and perform slice evaluation on the data to be evaluated through a Map task and the data evaluation rule to obtain a slice quality evaluation result; and merging the slicing quality evaluation results through Reduce to obtain the quality evaluation result of the data to be evaluated.
The data evaluation mode determining module 10 is further configured to obtain current configuration information when the data quality evaluation task is completed; when the current configuration information supports a Hadoop mode, setting the data evaluation mode to be the Hadoop mode; and when the current configuration information supports the cluster mode, setting the data evaluation mode to be the cluster mode.
The quality evaluation module 30 is further configured to determine a target cluster node according to the total recorded number and parameters of each cluster node when the total recorded number is less than or equal to a preset data number; and sending the data quality evaluation task to the target cluster node so as to process the data quality evaluation task through the target cluster node and the data evaluation rule and obtain a quality evaluation result of the data to be evaluated.
Other embodiments or specific implementation manners of the data quality assessment apparatus of the present invention may refer to the above method embodiments, and are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., a rom/ram, a magnetic disk, an optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for evaluating data quality, the method comprising:
when the data quality evaluation task is established, determining a corresponding data evaluation mode according to the current configuration information;
determining a corresponding data evaluation rule according to the data type of the data to be evaluated;
and performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
2. The method of claim 1, wherein the performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated comprises:
when the data evaluation mode is the cluster mode, acquiring the total recorded quantity of the data to be evaluated and the cluster node quantity of the cluster mode;
dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the total record number and the cluster node number;
distributing the quality evaluation subtasks to corresponding cluster nodes so as to process the quality evaluation subtasks through the corresponding cluster nodes and the data evaluation rule to obtain corresponding data quality evaluation results;
and determining the quality evaluation result of the data to be evaluated according to the data quality evaluation result.
3. The method of claim 2, wherein after obtaining the total recorded number of the data to be evaluated and the number of cluster nodes in the cluster mode when the data evaluation mode is the cluster mode, the method further comprises:
when the total recorded number is less than or equal to the preset data number, determining a target cluster node according to the total recorded number and the parameters of each cluster node;
and sending the data quality evaluation task to the target cluster node so as to process the data quality evaluation task through the target cluster node and the data evaluation rule and obtain a quality evaluation result of the data to be evaluated.
4. The method of claim 2, wherein said dividing said data quality assessment task into a number of quality assessment subtasks based on said total number of records and said number of cluster nodes comprises:
dividing the data to be evaluated into a plurality of data sets according to the total recorded quantity and the preset data capacity;
and dividing the data quality evaluation task into a plurality of quality evaluation subtasks according to the plurality of data sets and the number of the cluster nodes.
5. The method of claim 1, wherein the performing quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated comprises:
when the data evaluation mode is a Hadoop mode, acquiring the data to be evaluated;
and storing the data to be evaluated to the HDFS, and performing quality evaluation on the data to be evaluated through a MapReduce task and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
6. The method of claim 5, wherein the storing the data to be evaluated to the HDFS and performing quality evaluation on the data to be evaluated through a MapReduce task and a data evaluation rule to obtain a quality evaluation result of the data to be evaluated comprises:
storing the data to be evaluated to an HDFS (Hadoop distributed File System), and carrying out fragment evaluation on the data to be evaluated through a Map task and the data evaluation rule to obtain a fragment quality evaluation result;
and merging the slicing quality evaluation results through Reduce to obtain the quality evaluation result of the data to be evaluated.
7. The method of any one of claims 1 to 6, wherein determining the corresponding data evaluation mode based on the current configuration information upon completion of the creation of the data quality evaluation task comprises:
when the data quality evaluation task is established, current configuration information is acquired;
when the current configuration information supports a Hadoop mode, setting the data evaluation mode to be the Hadoop mode;
and when the current configuration information supports the cluster mode, setting the data evaluation mode to be the cluster mode.
8. An apparatus for evaluating data quality, the apparatus comprising:
the data evaluation mode determining module is used for determining a corresponding data evaluation mode according to the current configuration information when the data quality evaluation task is established;
the data evaluation rule determining module is used for determining a corresponding data evaluation rule according to the data type of the data to be evaluated;
and the quality evaluation module is used for carrying out quality evaluation on the data to be evaluated through the data evaluation mode and the data evaluation rule to obtain a quality evaluation result of the data to be evaluated.
9. A data quality evaluation apparatus characterized in that the apparatus comprises: a memory, a processor, and a data quality assessment program stored on the memory and executable on the processor, the data quality assessment program configured to implement the steps of the data quality assessment method of any one of claims 1 to 7.
10. A storage medium having stored thereon a data quality evaluation program which, when executed by a processor, implements the steps of the data quality evaluation method according to any one of claims 1 to 7.
CN202111149024.XA 2021-09-28 Data quality evaluation method, device, equipment and storage medium Active CN113836130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111149024.XA CN113836130B (en) 2021-09-28 Data quality evaluation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111149024.XA CN113836130B (en) 2021-09-28 Data quality evaluation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113836130A true CN113836130A (en) 2021-12-24
CN113836130B CN113836130B (en) 2024-05-10

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115409419A (en) * 2022-09-26 2022-11-29 河南星环众志信息科技有限公司 Value evaluation method and device of business data, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677763A (en) * 2015-12-29 2016-06-15 华南理工大学 Image quality evaluating system based on Hadoop
CN106708815A (en) * 2015-07-15 2017-05-24 中兴通讯股份有限公司 Data processing method, device and system
CN107133110A (en) * 2017-04-27 2017-09-05 中国科学院国家授时中心 GNSS navigation signal mass data immediate processing methods based on cluster parallel computing
US20180189330A1 (en) * 2017-01-04 2018-07-05 Salesforce.Com, Inc. Database schema for efficient data assessment
US20200192782A1 (en) * 2017-08-31 2020-06-18 Huawei Technologies Co., Ltd. Method and apparatus for evaluating quality of software running environment of device
CN111552686A (en) * 2020-05-08 2020-08-18 国网四川省电力公司信息通信公司 Power data quality assessment method and device
CN111737244A (en) * 2020-06-22 2020-10-02 平安医疗健康管理股份有限公司 Data quality inspection method, device, computer system and storage medium
CN112579578A (en) * 2019-09-27 2021-03-30 中兴通讯股份有限公司 Metadata-based data quality management method, device and system and server

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708815A (en) * 2015-07-15 2017-05-24 中兴通讯股份有限公司 Data processing method, device and system
CN105677763A (en) * 2015-12-29 2016-06-15 华南理工大学 Image quality evaluating system based on Hadoop
US20180189330A1 (en) * 2017-01-04 2018-07-05 Salesforce.Com, Inc. Database schema for efficient data assessment
CN107133110A (en) * 2017-04-27 2017-09-05 中国科学院国家授时中心 GNSS navigation signal mass data immediate processing methods based on cluster parallel computing
US20200192782A1 (en) * 2017-08-31 2020-06-18 Huawei Technologies Co., Ltd. Method and apparatus for evaluating quality of software running environment of device
CN112579578A (en) * 2019-09-27 2021-03-30 中兴通讯股份有限公司 Metadata-based data quality management method, device and system and server
CN111552686A (en) * 2020-05-08 2020-08-18 国网四川省电力公司信息通信公司 Power data quality assessment method and device
CN111737244A (en) * 2020-06-22 2020-10-02 平安医疗健康管理股份有限公司 Data quality inspection method, device, computer system and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵星;李石君;余伟;杨莎;丁永刚;胡亚慧;: "大数据环境下Web数据源质量评估方法研究", 计算机工程, no. 02 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115409419A (en) * 2022-09-26 2022-11-29 河南星环众志信息科技有限公司 Value evaluation method and device of business data, electronic equipment and storage medium
CN115409419B (en) * 2022-09-26 2023-12-05 河南星环众志信息科技有限公司 Method and device for evaluating value of business data, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108492005B (en) Project data processing method and device, computer equipment and storage medium
CN110502514B (en) Data acquisition method, device, equipment and computer readable storage medium
CN107807967B (en) Real-time recommendation method, electronic device and computer-readable storage medium
CN109543891B (en) Method and apparatus for establishing capacity prediction model, and computer-readable storage medium
CN109446837B (en) Text auditing method and device based on sensitive information and readable storage medium
CN108280091B (en) Task request execution method and device
CN110233741B (en) Service charging method, device, equipment and storage medium
CN106557307B (en) Service data processing method and system
CN110708360A (en) Information processing method and system and electronic equipment
CN111352846A (en) Test system number making method, device, equipment and storage medium
CN109240916B (en) Information output control method, information output control device and computer readable storage medium
CN114741392A (en) Data query method and device, electronic equipment and storage medium
CN113836130A (en) Data quality evaluation method, device, equipment and storage medium
CN113836130B (en) Data quality evaluation method, device, equipment and storage medium
CN111813435A (en) Page content configuration method and device and electronic equipment
CN115576973A (en) Service deployment method, device, computer equipment and readable storage medium
CN113259449A (en) Distributed storage method, device, equipment and storage medium
CN113051128B (en) Power consumption detection method and device, electronic equipment and storage medium
CN112328450A (en) Data monitoring method and device, computer equipment and storage medium
CN112817742A (en) Data migration method, device, equipment and storage medium
CN110838001A (en) Sample analysis method and sample analysis system for nuclear power plant
CN108629610B (en) Method and device for determining popularization information exposure
CN112417259A (en) Media resource processing method, device, equipment and storage medium
CN117112242B (en) Resource node allocation method and system in cloud computing system
CN115757049B (en) Multi-service module log recording method, system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant