CN114064618A - Data quality evaluation method and system - Google Patents

Data quality evaluation method and system Download PDF

Info

Publication number
CN114064618A
CN114064618A CN202010757320.7A CN202010757320A CN114064618A CN 114064618 A CN114064618 A CN 114064618A CN 202010757320 A CN202010757320 A CN 202010757320A CN 114064618 A CN114064618 A CN 114064618A
Authority
CN
China
Prior art keywords
data
index
evaluation
data quality
evaluated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010757320.7A
Other languages
Chinese (zh)
Inventor
谭志远
王谦
宫云平
许盛宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202010757320.7A priority Critical patent/CN114064618A/en
Publication of CN114064618A publication Critical patent/CN114064618A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure discloses a data quality evaluation method and system, and relates to the technical field of information. The method comprises the following steps: processing the data to be evaluated based on a data quality evaluation model of the data to be evaluated to obtain basic statistical indexes of the data to be evaluated, wherein index information in the data quality evaluation model of the data to be evaluated is selected from a data quality universal index model according to evaluation requirements of the data to be evaluated; calculating to obtain corresponding evaluation dimension indexes and good rate indexes according to basic statistical indexes of data to be evaluated and statistical granularity; and evaluating the data quality of the data to be evaluated according to the evaluation dimension index and the excellent rate index. The method and the device aim at different evaluation data, do not need to develop a customized data quality evaluation model again, and can meet evaluation requirements of different data.

Description

Data quality evaluation method and system
Technical Field
The present disclosure relates to the field of information technologies, and in particular, to a data quality assessment method and system.
Background
In the related art, a model for quality evaluation of data is designed for a conventional relational database. The implementation mode is that SQL instructions are embedded into the RUBY script, and the quality indexes of related data are obtained by regularly inquiring a database through scheduling tasks.
However, in the related technology, the targeted evaluation object is fixed and lacks expandability, and with the development of services, the quality evaluation of newly added data needs to be customized and developed again, which wastes investment.
Disclosure of Invention
The technical problem to be solved by the present disclosure is to provide a data quality evaluation method and system, which can meet evaluation requirements of different data.
According to an aspect of the present disclosure, a data quality evaluation method is provided, including: processing the data to be evaluated based on a data quality evaluation model of the data to be evaluated to obtain basic statistical indexes of the data to be evaluated, wherein index information in the data quality evaluation model of the data to be evaluated is selected from a data quality universal index model according to evaluation requirements of the data to be evaluated; calculating to obtain corresponding evaluation dimension indexes and good rate indexes according to basic statistical indexes of data to be evaluated and statistical granularity; and evaluating the data quality of the data to be evaluated according to the evaluation dimension index and the excellent rate index.
In some embodiments, whether an evaluation dimension index and a good rate index of data to be evaluated exceed corresponding thresholds is judged; and if the evaluation dimension index or the good rate index exceeds the corresponding threshold value, alarming.
In some embodiments, the data quality universal index model includes basic statistical indexes for different data sources, individual evaluation dimension indexes and yield indexes, and an overall yield index of a data set, wherein the data set includes a plurality of data sources.
In some embodiments, the respective evaluation dimension index for each data source is determined from the underlying statistical index for the data source; the goodness rate index of each data source is determined according to each evaluation dimension index of the data source; the overall yield indicator for the data set is determined based on the yield indicators for the individual data sources.
In some embodiments, each index in the data quality universal index model has a unique code, wherein the code of each index comprises an identifier corresponding to a primary index name and a number corresponding to a secondary index name, and the secondary index name is a subclass of the primary index name.
In some embodiments, the individual assessed dimension indicators and yield indicators in the data quality universal indicator model, as well as the overall yield indicator for the data set, specify a calculation formula for each indicator and associated calculation indicators.
In some embodiments, each index in the data quality evaluation model has a unique code, where the code of each index includes a data item corresponding to the data to be evaluated, an identifier corresponding to a name of the primary index, a number corresponding to a name of the secondary index, and a granularity flag.
In some embodiments, the encoding of each indicator in the data quality assessment model further comprises one or more of a data type and a custom dimension name of the data to be assessed.
In some embodiments, evaluating the data quality of the data to be evaluated comprises: and evaluating the data quality of the data to be evaluated according to different granularities.
In some embodiments, the evaluation dimension indicator includes one or more of an integrity indicator, a timeliness indicator, a consistency indicator, an accuracy indicator, and a logistical indicator.
According to another aspect of the present disclosure, there is also provided a data quality evaluation system, including: the data quality evaluation model management module is configured to select index information from the data quality general index model according to the evaluation requirement of the data source and construct a data quality evaluation model; the data processing module is configured to process the data to be evaluated based on a data quality evaluation model of the data to be evaluated to obtain a basic statistical index of the data to be evaluated; the data quality evaluation management module is configured to calculate and obtain a corresponding evaluation dimension index and a corresponding good rate index according to the basic statistical index of the data to be evaluated and the statistical granularity; and the statistical analysis and quality evaluation module is configured to evaluate the data quality of the data to be evaluated according to the evaluation dimension index and the excellent rate index.
In some embodiments, the monitoring alarm module is configured to determine whether an evaluation dimension index and a good rate index of the data to be evaluated exceed corresponding thresholds; and if the evaluation dimension index or the good rate index exceeds the corresponding threshold value, alarming.
In some embodiments, the universal index model management module is configured to manage the basic statistical index, the individual evaluation dimension index and the yield index of different data sources in the data quality universal index model, and the overall yield index of a data set, wherein the data set comprises a plurality of data sources.
According to another aspect of the present disclosure, there is also provided a data quality evaluation system, including: a memory; and a processor coupled to the memory, the processor configured to perform the data quality assessment method as described above based on instructions stored in the memory.
According to another aspect of the present disclosure, a non-transitory computer-readable storage medium is also presented, having stored thereon computer program instructions, which when executed by a processor, implement the data quality assessment method described above.
In the embodiment of the disclosure, according to different evaluation data, a data quality evaluation model with different evaluation indexes is selected, the data to be evaluated is processed to obtain a basic statistical index, then, according to the basic statistical index and the statistical granularity, a corresponding evaluation dimension index and a good-quality rate index are obtained through calculation, and further, the data to be evaluated is subjected to quality evaluation. In the embodiment, aiming at different evaluation data, a customized data quality evaluation model does not need to be developed again, and evaluation requirements of different data can be met.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
fig. 1 is a schematic flow diagram of some embodiments of a data quality assessment method of the present disclosure.
Fig. 2 is a schematic diagram of an index architecture in a data quality universal index model according to some embodiments of the present disclosure.
Fig. 3 is a flow diagram of some embodiments of a data quality assessment method of the present disclosure.
Fig. 4 is a flow diagram of some embodiments of a data quality assessment method of the present disclosure.
Fig. 5 is a schematic structural diagram of some embodiments of the data quality assessment system of the present disclosure.
Fig. 6 is a schematic structural diagram of another embodiment of the data quality evaluation system of the present disclosure.
Fig. 7 is a schematic structural diagram of another embodiment of the data quality evaluation system of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
The traditional relational database cannot meet the development requirements of technologies such as cloud computing and big data, and related evaluation index sets lack expandability. Moreover, the patent index set has limitations, and has no systematic evaluation element for researching data quality, so that the quality of data cannot be comprehensively evaluated from different dimensions.
Fig. 1 is a schematic flow diagram of some embodiments of a data quality assessment method of the present disclosure.
In step 110, the data to be evaluated is processed based on the data quality evaluation model of the data to be evaluated to obtain the basic statistical index of the data to be evaluated, wherein the index information in the data quality evaluation model of the data to be evaluated is selected from the data quality general index models according to the evaluation requirement of the data to be evaluated.
In some embodiments, the basic statistical indexes are the basis of the evaluation of the index model, and all the upper-layer indexes are calculated through the basic statistical indexes after gathering according to the statistical granularity.
In some embodiments, the data quality universal index model includes basic statistical indexes for different data sources, individual evaluation dimension indexes and yield indexes, and an overall yield index of a data set, wherein the data set includes a plurality of data sources.
In some embodiments, the respective evaluation dimension index for each data source is determined from the underlying statistical index for the data source; the goodness rate index of each data source is determined according to each evaluation dimension index of the data source; the overall yield indicator for the data set is determined based on the yield indicators for the individual data sources.
In some embodiments, taking fig. 2 as an example, a plurality of basic statistical indicators are involved for data source a and data source B, respectively. Each data source includes a plurality of assessment dimension indicators, including, for example, an integrity indicator, a timeliness indicator, a consistency indicator, an accuracy indicator, a logistical indicator, and the like. Each data source corresponds to one or more yield indicators, and the entire data set includes the overall yield indicators.
And selecting partial indexes or all indexes in the data quality universal index model aiming at different data sources to construct a data quality evaluation model. For example, for DPI (Deep Packet Inspection) data, according to DPI data evaluation requirements, a data quality general indicator model selects a part or all of indicators to establish a data quality evaluation model of DPI.
In step 120, according to the basic statistical index of the data to be evaluated, and according to the statistical granularity, calculating to obtain a corresponding evaluation dimension index and a corresponding good rate index.
In some embodiments, after the basic statistical indexes of the data to be evaluated are obtained, the basic statistical indexes are subjected to convergent calculation according to the statistical granularity to obtain each evaluation dimension index, and then each evaluation dimension index is subjected to comprehensive evaluation to obtain the good-yield index of the data to be evaluated.
The statistical granularity includes, for example, granularity from small to large, for example, aggregation from small hour-level data to day-level data, or aggregation from prefecture-level data to city-level data and provincial-level data.
In step 130, the data quality of the data to be evaluated is evaluated according to the evaluation dimension index and the good rate index.
In some embodiments, multidimensional analysis or trend analysis and the like are performed on indexes of data to be evaluated according to different granularities such as time, regions and the like, so that comprehensive evaluation on data quality is realized.
In the embodiment, according to different evaluation data, a data quality evaluation model with different evaluation indexes is selected, the data to be evaluated is processed to obtain a basic statistical index, then, according to the basic statistical index and the statistical granularity, a corresponding evaluation dimension index and a corresponding good rate index are obtained through calculation, and further, the data to be evaluated is subjected to quality evaluation. In the embodiment, aiming at different evaluation data, a customized data quality evaluation model does not need to be developed again, and evaluation requirements of different data can be met.
Fig. 3 is a flow diagram of some embodiments of a data quality assessment method of the present disclosure. In this embodiment, the steps 310-320 are the same as the steps 110-120, respectively.
In step 330, it is determined whether the evaluation dimension index and the yield index of the data to be evaluated exceed the corresponding thresholds.
In some embodiments, threshold information of each of the evaluation dimension index and the yield index is set in the data quality evaluation model.
In step 340, if the estimated dimension index or the good rate index exceeds the corresponding threshold, an alarm is issued.
In the embodiment, the data quality is monitored by monitoring each evaluation dimension index and the good rate index of the data to be evaluated in real time, and when the evaluation dimension index or the good rate index exceeds the corresponding threshold, an alarm is given and related alarm information is output.
Fig. 4 is a flow diagram of some embodiments of a data quality assessment method of the present disclosure.
At step 410, a data quality universal index model with basic statistical indexes, individual evaluation dimension indexes, yield indexes, and overall yield is constructed.
In some embodiments, each of the evaluation dimension index, the yield index, and the overall yield index is calculated from the base index or the next level index in the index hierarchy.
For example, the basic statistical indexes are numerator and denominator in the upper layer indexes. The calculation capability of the index includes: addition, subtraction, multiplication, division, and common function calculation formulas, such as Sum, Avg, Max, Min, Σ, and the like. When defining the index, the calculation method and the related calculation index need to be specified. For example, in specifying the integrity dimension, the data acquisition integrity rate indicator must specify the division and the numerator denominator in the calculation formula. And the data processing timeliness rate in the timeliness dimension index is obtained by calculating the number of the files processed in the statistical period in time as a numerator and the total number of the files processed in the statistical period as a denominator.
In some embodiments, the granularity comprises: time type, region type, etc., time granularity, such as by minute, by hour, by day, by month, etc.; regional granularity, such as by county, city, province, etc.
And outputting the basic statistical indexes according to the minimum granularity. For example, according to the frequency of generation of data, the generation index can be output every 1 hour at the time granularity, and the generation index is output every hour as much as possible, not by day. The method is convenient for subsequent monitoring alarm and statistical analysis, for example, the data of the hour granularity can be counted through the minute data, the data of the hour granularity can be converged to the data of the day granularity, the data of the month granularity can be obtained through the data convergence of the day granularity, and the like. Alternatively, the data may be summarized according to regional dimensions such as prefecture, and province.
In some embodiments, when the quality of data is actually evaluated, comprehensive evaluation needs to be performed from different dimensions of integrity, timeliness, consistency, accuracy, logic and the like, namely, progressive evaluation is performed from the form quality of the data to the content quality and the logic quality. The form quality mainly evaluates the integrity of data, namely judging whether the acquired data is complete or not and whether missing exists or not, such as the file acquisition integrity rate. After data collection is ensured to be complete, the timeliness of processing the data needs to be evaluated.
The content quality of the data mainly evaluates the consistency and accuracy of the data, and the consistency of the data judges whether the data conforms to an agreed specification or a data dictionary, namely whether the data is consistent with the specification, such as whether the type of a field is consistent with the specification, whether the value range of the field is consistent with the specification, and the like. Under the condition that the data is consistent with the specification, whether the data is in accordance with objective practice or not needs to be verified, namely whether the data objectively reflects the scene of the occurrence of the service at that time or not, whether the generated data is accurate or not, for example, whether the longitude and latitude of a certain base station is the longitude and latitude of the position where the base station is actually located or not needs to be verified, namely, the data value is required to be accurate and is in accordance with objective practice. The data with the further data quality requirement and the correlation relation satisfy the established logical requirement, namely the logical index.
In some embodiments, the goodness indicator is mainly used to evaluate the overall quality of a certain item of data, such as the DPI data goodness, and the overall quality can be evaluated by selecting relevant indicators such as integrity, timeliness, consistency, and logic. In some embodiments, different weight combinations may be used to evaluate the goodness of the data in combination with actual needs and assessment focus.
In some embodiments, the overall goodness indicator mainly refers to a comprehensive evaluation of the overall data quality of the plurality of data items, and may be a weighted evaluation or an arithmetic average of the goodness of each data item.
In some embodiments, each index in the data quality universal index model has a unique code, wherein the code of each index comprises an identifier corresponding to a primary index name and a number corresponding to a secondary index name, and the secondary index name is a subclass of the primary index name.
In some embodiments, each index name consists of a primary index name and a secondary index name. The primary index name is represented, for example, by one or more letters. For example, the letter a identifies a basic statistical index, the letter B identifies an integrity index, the letter C identifies a timeliness index, the letter D identifies a consistency index, the letter F identifies an accuracy index, the letter F identifies a logicality index, the letter Y identifies a good rate index, the letter Z identifies a total good rate index, and so on. It will be understood by those skilled in the art that this is for example only, and that the primary index name may also be identified by its chinese pinyin abbreviation, etc. The secondary index names are numbered sequentially, for example, with 4-digit numbers. For example, the file and time rate index is represented by 0001, the data deletion and time rate is represented by 0002, and the like. It will be understood by those skilled in the art that the secondary index name may also be identified by, for example, a 5-digit number, a 3-digit number, etc., which is used herein by way of example only.
In some embodiments, the primary index name and the secondary index name are divided by "_" such as "X _ XXXX" to represent a certain index.
The basic statistical indicators suitable for integrity checking are shown in table 1.
Figure BDA0002612006170000081
Figure BDA0002612006170000091
TABLE 1
The basic statistical index suitable for timeliness check is shown in table 2, for example.
Figure BDA0002612006170000101
Figure BDA0002612006170000111
TABLE 2
The timeliness index is shown in table 3, for example.
Figure BDA0002612006170000112
TABLE 3
The yield index is shown in table 4, for example.
Figure BDA0002612006170000121
TABLE 4
In the embodiment, object-oriented thinking is used for reference, a data quality evaluation general index model (class) is constructed, and evaluation rules are unified, so that multi-dimensional and multi-granularity comprehensive evaluation on data quality can be realized, quality guarantee is provided for enterprise data value mining, and decision risks caused by the data quality are avoided.
In step 420, for the data evaluation requirements of different data sources, a corresponding index is selected from the data quality universal index model, and a data quality evaluation model of the data source is established. This process is an instantiation of the universal metric model for a specific data quality assessment.
In some embodiments, a new data quality assessment model may be added for a new data source, and an existing data quality assessment model may be modified or deleted.
In some embodiments, statistical granularity information is set in the data quality evaluation model, and threshold information of each index is set to trigger monitoring and alarming.
In some embodiments, each index in the data quality evaluation model has a unique code, where the code of each index includes a data item corresponding to the data to be evaluated, an identifier corresponding to a name of the primary index, a number corresponding to a name of the secondary index, and a granularity flag.
In some embodiments, the data items are english acronyms of the evaluated data, for example, wireless MR (Measurement Report) data is identified by MR, wireless CDR (Call Detail Records) data is identified by CDR, wireless PM (project Management) data is identified by PM, wireless CM data is identified by CM, mobile DPI/XDP data is identified by YDDPI, fixed network DPI data is identified by CWDPI, oid data is identified by oid, MME (mobile Management Entity) signaling log data is identified by MME, packet domain ticket data is identified by PSCDR, space wing high definition is identified by IPTV, and the like. And the identification corresponding to the first-level index name and the number corresponding to the second-level index name are consistent with the code of each index in the data quality universal index model. The granularity index is, for example, a time granularity index for identifying which granularity output the index adopts, for example, M5/15/30 indicates that the index is output in 5 minutes, 15 minutes, and 30 minutes, H indicates that the index is output in hours, D indicates that the index is output in days, W indicates that the index is output in weeks, M indicates that the index is output in months, and Y indicates that the index is output in years, respectively.
In some embodiments, the encoding of the indicator further includes a data type of the data to be evaluated if there are multiple objects to be evaluated in the data item. Such as HTTP, DNS, STREAM, etc.
In some embodiments, the encoding of the metrics further includes custom dimension names. For example, for the case that a plurality of same index codes are required to be calculated in the same data item and the same fine classification of data, a dimension name may be defined to improve the identification of the index. For example, for HTTP traffic of DPI, non-null rates of key fields such as synctime, synckttime, firstplaytimetime, and the like are evaluated, and then the same "identifier corresponding to the primary index name — number corresponding to the secondary index name" is used, so that the distinction is performed by the custom dimension name.
In some embodiments, the data item, the data type, the identifier corresponding to the primary index name, the number corresponding to the secondary index name, the granularity flag, and the custom dimension name corresponding to the data to be evaluated may be connected by underlining.
In some embodiments, in the process of creating the data quality assessment model, interaction with a data source, that is, a metadata module, is required to be performed to read field information of metadata for data quality auditing and the like, for example, when a value of a key field is determined, determination is required according to a value range or an enumerated value of the field. Therefore, it is necessary to manage and maintain the field information, the field type, the value range, whether the evaluated data is empty, and other related information of the evaluated data.
In step 430, receiving the scheduling of the task scheduling module, and processing the data to be evaluated by reading the relevant indexes in the data quality evaluation model to obtain the basic statistical indexes of the data to be evaluated.
In some embodiments, the big data processing technology is used, relevant business rules in the data quality evaluation model are read to process data, and basic statistical indexes are output for data quality management work while normal data processing is carried out.
In step 440, the basic statistical indicators are managed and aggregated according to rules defined in the data quality assessment model, and each assessment dimension indicator is generated.
In some embodiments, aggregation is performed according to granularity of each evaluation dimension index, and each evaluation dimension index value can be calculated because a calculation formula of the index and a related calculation index are set in the data quality evaluation model.
For example, taking DPI data as an example, the dimension index may be calculated by hour or day. For example, in the output basic statistical index, the processing files and the processing time are as follows: DPI _ a _1001_ H to DPI _ a _1001_ D, the total number of processed files being: DPI _ a _2001_ H to DPI _ a _2001_ D. The hourly and timely rate DPI _ B _1001_ H equals DPI _ a _1001_ H/DPI _ a _2001_ H.
In step 450, according to the rules defined in the data quality evaluation model, each evaluation dimension index is managed and aggregated, and the goodness rate index of the data to be evaluated is calculated.
In step 460, the data quality of the data to be evaluated is evaluated according to the evaluation dimension indexes and the yield index and according to different granularities.
For example, performing multidimensional analysis or trend analysis on the multiple indexes according to different granularities such as time or lower granularity, and the like to obtain a comprehensive evaluation result of the data to be evaluated.
In step 470, when the evaluation dimension index or the yield index exceeds the corresponding threshold, an alarm process is performed.
Step 460 and step 470 may be performed simultaneously or not sequentially.
In the embodiment, for different data sources, corresponding indexes are selected from the data quality evaluation general index models to construct corresponding data quality evaluation models, multi-dimensional and multi-granularity quality evaluation of newly added data sources and data quality monitoring and alarming can be quickly realized, and meanwhile, the evaluation/evaluation indexes of the existing data sources can be dynamically adjusted according to needs, so that the method is free from re-customization and development and is extensible, and the method is suitable for mass data quality evaluation of cloud computing, big data and the like, saves development cost for enterprises, and efficiently supports enterprise operation. In addition, the setting of the index naming rule and the index coding rule after instantiation is convenient for management and maintenance and is convenient for the index maintainers to understand during docking.
Fig. 5 is a schematic structural diagram of some embodiments of the data quality assessment system of the present disclosure. The system includes a data quality assessment model management module 510, a data processing module 520, a data quality assessment management module 530, and a statistical analysis and quality assessment module 540.
The data quality evaluation model management module 510 is configured to select index information from the data quality universal index model according to the evaluation requirement of the data source, and construct a data quality evaluation model.
The data quality evaluation model management module 510 can implement operations such as adding, deleting, and modifying, etc., to manage the data quality evaluation model, and configure and manage the evaluated data source information, such as managing file directories, formats, update frequencies, etc., for each evaluation model.
The data quality assessment model management module 510 is further configured to set threshold information for various metrics in order to trigger monitoring and alarms.
In some embodiments, the data quality assessment model management module 510 is further configured to manage rules, statistical granularity, etc. information of the metrics instantiation output. Each index in the data quality evaluation model has a unique code, wherein the code of each index comprises a data item corresponding to data to be evaluated, an identifier corresponding to a primary index name, a number corresponding to a secondary index name and a granularity mark.
In some embodiments, the code for each indicator in the data quality assessment model further includes a data type or custom dimension name of the data to be assessed.
The data processing module 520 is configured to process the data to be evaluated based on the data quality evaluation model of the data to be evaluated, so as to obtain a basic statistical index of the data to be evaluated.
In some embodiments, the data processing module 520 may be embedded in the ETL or data parsing module, and output basic statistical indicators for data quality management work while normal data processing is performed, for use by the data quality evaluation management module 530.
The data quality evaluation management module 530 is configured to calculate, according to the basic statistical index of the data to be evaluated and the statistical granularity, a corresponding evaluation dimension index and a corresponding yield index.
In some embodiments, after the basic statistical indexes of the data to be evaluated are obtained, the basic statistical indexes are subjected to convergent calculation according to the statistical granularity to obtain each evaluation dimension index, and then each evaluation dimension index is subjected to comprehensive evaluation to obtain the good-yield index of the data to be evaluated.
The statistical analysis and quality evaluation module 540 is configured to evaluate the data quality of the data to be evaluated according to the evaluation dimension index and the good yield index.
In some embodiments, the system is responsible for performing multidimensional analysis, trend analysis and the like on the indexes according to different granularities such as time, regions and the like so as to realize comprehensive evaluation on data quality.
In the embodiment, aiming at different evaluation data, a customized data quality evaluation model does not need to be developed again, and evaluation requirements of different data can be met.
Fig. 6 is a schematic structural diagram of another embodiment of the data quality evaluation system of the present disclosure. The system comprises a data quality evaluation model management module 510, a data processing module 520, a data quality evaluation management module 530 and a statistical analysis and quality evaluation module 540, and also comprises a monitoring alarm module 550 which is configured to judge whether an evaluation dimension index and a good rate index of data to be evaluated exceed corresponding threshold values; and if the evaluation dimension index or the good rate index exceeds the corresponding threshold value, alarming.
In some embodiments, the system further comprises a universal index model management module 560 configured to manage the basic statistical indicators, the individual evaluation dimension indicators and the yield indicators of the different data sources in the data quality universal index model, and the overall yield indicator of the data set, wherein the data set comprises a plurality of data sources.
The general index model management module 560 defines each evaluation dimension index of each data source, and determines according to the basic statistical index of the data source; the goodness rate index of each data source is determined according to each evaluation dimension index of the data source; the overall yield indicator for the data set is determined based on the yield indicators for the individual data sources.
The universal index model management module 560 is further configured to manage and define codes of the indexes, wherein each index has a unique code, wherein the code of each index includes an identifier corresponding to a primary index name and a number corresponding to a secondary index name, and the secondary index name is a subclass of the primary index name.
The universal index model management module 560 is further configured to specify a calculation formula and associated calculation indices for each index for the respective evaluation dimension indices and yield indices in the data quality universal index model, as well as the overall yield index for the data set.
In the embodiment, object-oriented thinking is used for reference, a data quality evaluation general index model is constructed, evaluation rules are unified, multi-dimensional and multi-granularity comprehensive evaluation on data quality can be realized, quality guarantee is provided for enterprise data value mining, and decision risks caused by the data quality are avoided.
In other embodiments of the present disclosure, the system may further include a metadata management module 570 configured to manage and maintain relevant information of the evaluated data, such as field information, field type, value range, whether null or not, and the like. Interaction with a metadata module is needed in the process of creating the data evaluation model, so that field information of data is read to audit data quality and the like.
In other embodiments of the present disclosure, the system may further include a task scheduling module 580, wherein the data processing module 520 is capable of receiving the scheduling of the task scheduling module 580, and processing the data by reading the relevant business rules in the data quality assessment model using big data processing technology. The task scheduling module 580 is configured to schedule the data processing module 520 to perform data processing and quality auditing, and output related logs and completion information, etc. by reading related information such as scheduling frequency in the data quality assessment model.
In the embodiment, the evaluation data is not required to be customized and developed again, and the method is extensible, can be suitable for quality evaluation of mass data such as cloud computing and big data, saves development cost for enterprises, and efficiently supports enterprise operation.
Fig. 7 is a schematic structural diagram of another embodiment of the data quality evaluation system of the present disclosure. The system includes a memory 710 and a processor 720. Wherein: the memory 710 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is used for storing instructions in the embodiments corresponding to fig. 1, 3 and 4. Processor 720, coupled to memory 710, may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 720 is configured to execute instructions stored in the memory.
In some embodiments, processor 720 is coupled to memory 710 through a BUS BUS 730. The system 700 may also be coupled to an external storage system 750 via a storage interface 740 for accessing external data, and to a network or another computer system (not shown) via a network interface 760. And will not be described in detail herein.
In this embodiment, the evaluation requirements of different data can be satisfied by storing data instructions in the memory and processing the instructions by the processor.
In further embodiments, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the embodiments corresponding to fig. 1, 3, 4. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (15)

1. A data quality assessment method, comprising:
processing the data to be evaluated based on a data quality evaluation model of the data to be evaluated to obtain basic statistical indexes of the data to be evaluated, wherein index information in the data quality evaluation model of the data to be evaluated is selected from a data quality general index model according to evaluation requirements of the data to be evaluated;
calculating to obtain corresponding evaluation dimension indexes and good rate indexes according to the basic statistical indexes of the data to be evaluated and the statistical granularity; and
and evaluating the data quality of the data to be evaluated according to the evaluation dimension index and the excellent rate index.
2. The data quality evaluation method of claim 1 further comprising:
judging whether the evaluation dimension index and the good rate index of the data to be evaluated exceed corresponding thresholds or not; and
and if the evaluation dimension index or the good rate index exceeds the corresponding threshold value, alarming.
3. The data quality evaluation method according to claim 1,
the data quality general index model comprises basic statistical indexes, evaluation dimension indexes and good rate indexes aiming at different data sources and a total good rate index of a data set, wherein the data set comprises a plurality of data sources.
4. The data quality evaluation method according to claim 3,
determining each evaluation dimension index of each data source according to the basic statistical index of the data source;
the goodness rate index of each data source is determined according to each evaluation dimension index of the data source; and
the overall yield indicator for the data set is determined based on the yield indicators for the respective data sources.
5. The data quality evaluation method according to claim 3,
each index in the data quality general index model has a unique code, wherein the code of each index comprises an identifier corresponding to a first-level index name and a number corresponding to a second-level index name, and the second-level index name is a subclass of the first-level index name.
6. The data quality evaluation method according to claim 3,
and each evaluation dimension index and the goodness index in the data quality general index model and the overall goodness index of the data set specify a calculation formula and related calculation indexes of each index.
7. The data quality evaluation method according to claim 3,
each index in the data quality evaluation model has a unique code, wherein the code of each index comprises a data item corresponding to data to be evaluated, an identifier corresponding to a primary index name, a number corresponding to a secondary index name and a granularity mark.
8. The data quality evaluation method according to claim 7,
the code of each index in the data quality evaluation model further comprises one or more items of data types and custom dimension names of the data to be evaluated.
9. The data quality assessment method of any one of claims 1 to 8, wherein assessing the data quality of the data to be assessed comprises:
and evaluating the data quality of the data to be evaluated according to different granularities.
10. The data quality evaluation method according to any one of claims 1 to 8,
the evaluation dimension index comprises one or more of an integrity index, a timeliness index, a consistency index, an accuracy index and a logicality index.
11. A data quality assessment system, comprising:
the data quality evaluation model management module is configured to select index information from the data quality general index model according to the evaluation requirement of the data source and construct a data quality evaluation model;
the data processing module is configured to process the data to be evaluated based on a data quality evaluation model of the data to be evaluated to obtain a basic statistical index of the data to be evaluated;
the data quality evaluation management module is configured to calculate and obtain a corresponding evaluation dimension index and a corresponding good rate index according to the basic statistical index of the data to be evaluated and the statistical granularity; and
and the statistical analysis and quality evaluation module is configured to evaluate the data quality of the data to be evaluated according to the evaluation dimension index and the good rate index.
12. The data quality evaluation system of claim 11 further comprising:
the monitoring alarm module is configured to judge whether the evaluation dimension index and the good rate index of the data to be evaluated exceed corresponding thresholds; and if the evaluation dimension index or the good rate index exceeds the corresponding threshold value, alarming.
13. The data quality evaluation system according to claim 11 or 12, further comprising:
a universal index model management module configured to manage basic statistical indexes, evaluation dimension indexes and yield indexes of different data sources in the data quality universal index model, and a total yield index of a data set, wherein the data set includes a plurality of data sources.
14. A data quality assessment system, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the data quality assessment method of any of claims 1 to 10 based on instructions stored in the memory.
15. A non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the data quality assessment method of any one of claims 1 to 10.
CN202010757320.7A 2020-07-31 2020-07-31 Data quality evaluation method and system Pending CN114064618A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010757320.7A CN114064618A (en) 2020-07-31 2020-07-31 Data quality evaluation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010757320.7A CN114064618A (en) 2020-07-31 2020-07-31 Data quality evaluation method and system

Publications (1)

Publication Number Publication Date
CN114064618A true CN114064618A (en) 2022-02-18

Family

ID=80227445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010757320.7A Pending CN114064618A (en) 2020-07-31 2020-07-31 Data quality evaluation method and system

Country Status (1)

Country Link
CN (1) CN114064618A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840743A (en) * 2023-02-20 2023-03-24 北京中兵数字科技集团有限公司 Method, apparatus, device and medium for data quality evaluation
CN116450632A (en) * 2023-04-18 2023-07-18 北京卫星信息工程研究所 Geographic sample data quality evaluation method, device and storage medium
CN116777288A (en) * 2023-06-28 2023-09-19 广东裕太科技有限公司 Government system information integration system and application method thereof

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840743A (en) * 2023-02-20 2023-03-24 北京中兵数字科技集团有限公司 Method, apparatus, device and medium for data quality evaluation
CN116450632A (en) * 2023-04-18 2023-07-18 北京卫星信息工程研究所 Geographic sample data quality evaluation method, device and storage medium
CN116450632B (en) * 2023-04-18 2023-12-19 北京卫星信息工程研究所 Geographic sample data quality evaluation method, device and storage medium
CN116777288A (en) * 2023-06-28 2023-09-19 广东裕太科技有限公司 Government system information integration system and application method thereof
CN116777288B (en) * 2023-06-28 2024-03-12 广东裕太科技有限公司 Government system information integration system and application method thereof

Similar Documents

Publication Publication Date Title
CN111158977B (en) Abnormal event root cause positioning method and device
CN114064618A (en) Data quality evaluation method and system
CN105868373B (en) Method and device for processing key data of power business information system
US9354867B2 (en) System and method for identifying, analyzing and integrating risks associated with source code
EP3418910A1 (en) Big data-based method and device for calculating relationship between development objects
CN106682097A (en) Method and device for processing log data
CN107810500A (en) Data quality analysis
US10671627B2 (en) Processing a data set
CN106815125A (en) A kind of log audit method and platform
KR20150080533A (en) Characterizing data sources in a data storage system
CN111176953B (en) Abnormality detection and model training method, computer equipment and storage medium
CN101188523A (en) Generation method and generation system of alarm association rules
CN110716539B (en) Fault diagnosis and analysis method and device
CN111324604A (en) Database table processing method and device, electronic equipment and storage medium
CN104158682A (en) Synchronous Digital Hierarchy (SDH) fault positioning method based on contribution degree
CN111177139A (en) Data quality verification monitoring and early warning method and system based on data quality system
US8543552B2 (en) Detecting statistical variation from unclassified process log
CN111414410A (en) Data processing method, device, equipment and storage medium
CN112416904A (en) Electric power data standardization processing method and device
CN104462095B (en) A kind of extracting method and device of query statement common portion
CN109583773A (en) A kind of method, system and relevant apparatus that taxpaying credit integral is determining
CN114116391A (en) Redis instance health detection method, device, equipment and storage medium
WO2015029969A1 (en) Data processing device, and data processing method and program
CN112486841A (en) Method and device for checking data collected by buried point
CN111563111A (en) Alarm method, alarm device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination