CN111949644A - Data quality evaluation method and equipment for power industry service system - Google Patents

Data quality evaluation method and equipment for power industry service system Download PDF

Info

Publication number
CN111949644A
CN111949644A CN202010826540.0A CN202010826540A CN111949644A CN 111949644 A CN111949644 A CN 111949644A CN 202010826540 A CN202010826540 A CN 202010826540A CN 111949644 A CN111949644 A CN 111949644A
Authority
CN
China
Prior art keywords
field
database
data table
fields
full
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010826540.0A
Other languages
Chinese (zh)
Other versions
CN111949644B (en
Inventor
许博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Jinghua Borui Technology Co ltd
Original Assignee
Shenyang Jinghua Borui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Jinghua Borui Technology Co ltd filed Critical Shenyang Jinghua Borui Technology Co ltd
Priority to CN202010826540.0A priority Critical patent/CN111949644B/en
Publication of CN111949644A publication Critical patent/CN111949644A/en
Application granted granted Critical
Publication of CN111949644B publication Critical patent/CN111949644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management

Abstract

The embodiment of the invention provides a data quality evaluation method and equipment for a power industry service system, wherein the method comprises the steps of evaluating the integrity of a database through a database integrity evaluation index value to obtain an integrity evaluation result; evaluating the consistency of the database by a main foreign key relation screening method to obtain a consistency evaluation result; and comprehensively evaluating the database according to the evaluation result. In this way, the integrity and consistency of the database can be evaluated, the relation among the fields is pre-judged by comparing different field information, the fields which are obviously impossible to have the main foreign key relation are filtered, the workload of comparing the contents among the fields is greatly reduced, the missing main foreign key relation is quickly and efficiently detected, and the comprehensive evaluation results of the integrity and consistency are integrated to perform more comprehensive evaluation on the data quality of the whole database.

Description

Data quality evaluation method and equipment for power industry service system
Technical Field
Embodiments of the present invention generally relate to the field of computer data processing, and more particularly, to a data quality evaluation method and apparatus for a power industry business system.
Background
With the development of information technology, the size of data is increasing at a high rate. While the amount of data is increasing, quality problems are prevalent in the data. The quality of data can be characterized by six characteristics, namely data consistency, data integrity, data accuracy, data timeliness, data entity identity, and data uncertainty. Data integrity and data consistency are the most important key indexes for describing data quality.
In order to improve the system response speed and the development efficiency, more and more developers choose to ignore the main foreign key constraint in the relational database. Therefore, the same data in different tables cannot be timely and accurately synchronized, the data quality is remarkably reduced, and enterprise-level big data analysis and application cannot be supported. Enterprises have to spend a great deal of time and effort in carrying out data governance work. Due to the lack of the main foreign key relationship, data management workers cannot acquire the association relationship between the tables, and can acquire the association relationship only through communication with developers and business combing. Such methods are extremely inefficient, are difficult to maintain as the data grows, and are difficult to quantitatively evaluate the success of data governance work.
For the research of the data integrity evaluation method, the key point is to solve the problem of relation data integrity measurement. Most of existing data integrity evaluation methods are implemented by providing an abstract data integrity measurement framework, then realizing the influence of function dependence on data integrity on the basis of the framework, and providing a concrete relational data integrity measurement index and a system scheme. Under the current environment, the emphasis of research on data integrity focuses on definition, assurance technology, assurance model and application thereof, while the research on data integrity evaluation methods mostly focuses on verification of data, which is very slow in working efficiency for power industry business systems with huge data volume.
The following problems exist for database integrity assessment:
1) to evaluate database integrity without quantification;
2) with the rapid increase of the data volume, the integrity evaluation according to the database has low efficiency and high cost;
3) no specific database integrity evaluation method is given.
Disclosure of Invention
According to the embodiment of the invention, a data quality evaluation scheme oriented to a power industry business system is provided.
In a first aspect of the invention, a data quality evaluation method for a power industry service system is provided. The method comprises the following steps:
evaluating the integrity of the database through the database integrity evaluation index value to obtain an integrity evaluation result;
evaluating the consistency of the database by a main foreign key relation screening method to obtain a consistency evaluation result;
and comprehensively evaluating the database according to the integrity evaluation result and the consistency evaluation result.
Further, the evaluating the integrity of the database through the database integrity evaluation index value includes:
reading a first field from a first function table, wherein the first field is the total number of records contained in a data table; reading a second field and a third field from the second function table, and establishing an affiliation relationship between the data table and the fields according to the second field; the second field represents the name of the data table to which the field visible to the user belongs; the third field is a null value number contained in each field in the data table;
reading a data table from a database, and accumulating the values of a third field of the data table to obtain the number of none values of the data table; calculating the number of none values and the number of full field records for removing all empty fields;
traversing all data tables in the database, accumulating the obtained data index values, calculating the proportion of the full empty fields of the database, the recording proportion of the full fields of the database, the proportion of the none values of the database and the proportion of the none values of the removed full empty fields of the database, and evaluating the integrity of the database.
Further, the calculating the number of none values of the removed all-empty fields includes:
Q=P-M*N
q is the number of none values of all empty fields removed from the data table; p is the number of none values in the data table; m is the number of all-empty fields in the data table; n is the total number of records in the data table;
the full field record number is:
K=N-F
k is the number of full field records in the data table; n is the total number of records in the data table; f is the number of records of the non-full field in the data table; and the number of the records of the non-full field is the total number of the records in the field containing the null value in the data table.
Further, the accumulating the obtained data values includes:
accumulating the number of fields of each data table to obtain the total number of fields of the database;
accumulating the product of the number of fields of each data table and the number of records to obtain the total number of the numerical values of the database;
accumulating the total record number of each data table to obtain the total record number of the database;
accumulating the number of none values of each data table to obtain the number of the none values of the database;
accumulating the number of none values of all removed empty fields of each data table to obtain the number of all removed empty field none values of the database;
accumulating the full field record number of each data table to obtain the full field record number of the database;
accumulating the numerical values of all the data tables after all the empty fields are removed to obtain the total number of the numerical values of the database after all the empty fields are removed; the numerical value after removing the full-empty field is as follows:
H=(R-M)*N
h is a numerical value of the data table after all empty fields are removed; r is the number of fields in the data table; m is the number of all-empty fields in the data table; and N is the total number of records in the data table.
Further, the calculating the proportion of the fully empty fields of the database, the proportion of the full field records of the database, the proportion of the none values of the database and the proportion of the none values of the removed fully empty fields of the database comprises:
the proportion of the database full-empty fields is the ratio of the number of the database full-empty fields to the total number of the database fields;
the database full field record proportion is the ratio of the database full field record number to the total database record number;
the proportion of the database none values is the ratio of the number of the database none values to the total number of the database values;
the proportion of the none values of the database without the full-empty fields is the ratio of the number of the none values of the database without the full-empty fields to the total number of the numerical values of the database without the full-empty fields.
Further, the evaluating the data integrity of the database includes:
the data integrity of the database is in negative correlation with the proportion of the full empty fields of the database, the proportion of the none values of the database and the proportion of the none values of the removed full empty fields of the database;
the data integrity of the database is positively correlated with the full field record proportion of the database.
Further, the evaluation of the consistency of the database by the primary foreign key relationship screening method includes:
reading data table information and field information in the data table from a database, and judging the attribute of each field;
selecting a target data table and selecting a comparison data table from other data tables;
and evaluating the main foreign key relationship of the target data table and the comparison data table according to a main foreign key relationship screening method.
Further, the determining the attributes of each field includes:
judging whether the selected field is uuid;
judging the type of the selected field content;
judging whether the selected field is a main key of the data table to which the field belongs;
the judging whether the selected field is uuid includes:
if the selected field only contains characters '0123456789 ABCDEFAbcdef' and a separator '-' and the character length is between 32-42, judging that the selected field is uuid;
the judging the content type of the selected field comprises the following steps:
judging whether the types of each record in the selected field are consistent, if so, judging that the content type of the selected field is the record type; otherwise, judging the content type of the selected field as a non-type;
the judging whether the selected field is the primary key of the data table to which the field belongs includes:
judging whether the selected field meets one of the following conditions, if so, the selected field is the primary key of the affiliated data table, otherwise, the selected field is not the primary key of the affiliated data table;
condition 1: the selected field has been set as the data table primary key;
condition 2: the selected field is uuid, and the number of the field is 1;
condition 3: the number of different values contained in the selected field is not less than the number of records contained in the data table, and the field number of the selected field is 1.
Further, the performing of the evaluation of the primary foreign key relationship on the target data table and the comparison data table according to the primary foreign key relationship screening method includes:
step 1: judging whether the record number of the target data table is 0 or not, if so, determining that no main foreign key relation exists between the target data table and the comparison data table; otherwise, judging whether the record number of the comparison data table is 0, if so, determining that no main foreign key relation exists between the target data table and the comparison data table; otherwise, continuously judging whether the primary key field exists in the target data table, if so, executing the step 2; otherwise, no main foreign key relation exists between the target data table and the comparison data table;
step 2: selecting a main key field in the target data table and a comparison field in the comparison data table, judging the main key field and the comparison field according to field main foreign key relationship screening conditions, and if the main key field and the comparison field accord with one of the field main foreign key relationship screening conditions, determining that the main key field and the comparison field have no main foreign key relationship; otherwise, executing step 3;
the field main foreign key relationship screening conditions comprise:
condition 1: the comparison field is a main key of the comparison data table;
condition 2: the content types of the primary key field and the comparison field are different;
condition 3: one and only one of the primary key field and the contrast field is uuid;
condition 4: the data types of the primary key field and the comparison field are different;
condition 5: the average column length difference of the main key field and the comparison field exceeds a preset threshold value;
and step 3: judging whether records in a comparison field exist in the primary key field, if so, taking the primary key field and the comparison field as a primary foreign key relation, and obtaining a primary foreign key relation between the target data table and the comparison data table; otherwise, the primary key field and the comparison field have no primary foreign key relation.
In a second aspect of the invention, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.
It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of any embodiment of the invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
According to the invention, through analyzing the proportion of the number of records with null attributes in each data table to the total number of records and the proportion of the number of attributes with null values in the table to the total number of attributes, the method has higher discovery rate and accuracy when searching the attributes such as spaces, null columns, null rows and the like of the data table, and evaluates the integrity of the database; through the comparison of different field information, the relation between the fields is pre-judged, the fields which obviously cannot have the main foreign key relation are filtered, the workload of comparing the contents between the fields is greatly reduced, the missing main foreign key relation is quickly and efficiently detected, the detection efficiency and the accuracy are improved, and the wide applicability is realized; and the overall and consistency evaluation results are integrated, so that the data quality of the whole database is more comprehensively evaluated.
Drawings
The above and other features, advantages and aspects of various embodiments of the present invention will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:
FIG. 1 shows a flow diagram of a data quality evaluator oriented towards an electric power industry business system according to an embodiment of the invention;
FIG. 2 shows a flow diagram of a database integrity evaluation method according to an embodiment of the invention;
FIG. 3 shows a flow diagram of a database consistency evaluation method according to an embodiment of the invention;
FIG. 4 shows a flowchart of a primary foreign key relationship evaluation process, according to an embodiment of the invention;
FIG. 5 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
In the invention, the proportion of the number of records with empty attribute in each data table to the total number of records and the proportion of the number of attributes with empty value in the table to the total number of attributes are analyzed, so that the method has higher discovery rate and accuracy when searching the attributes such as space, empty column, empty row and the like of the data table, and evaluates the integrity of the database; through the comparison of different field information, the relation between the fields is pre-judged, the fields which obviously cannot have the main foreign key relation are filtered, the workload of comparing the contents between the fields is greatly reduced, the missing main foreign key relation is quickly and efficiently detected, the detection efficiency and the accuracy are improved, and the wide applicability is realized; and the overall and consistency evaluation results are integrated, so that the data quality of the whole database is more comprehensively evaluated.
Fig. 1 shows a flowchart of a data quality evaluator oriented to a power industry business system according to an embodiment of the present invention.
The method comprises the following steps:
and S110, evaluating the integrity of the database through the database integrity evaluation index value to obtain an integrity evaluation result.
Further, fig. 2 shows a database integrity evaluation process according to an embodiment of the present invention, which includes:
s111, reading a first field from a first function table, wherein the first field is the total record number contained in a data table; reading a second field and a third field from the second function table, and establishing an affiliation relationship between the data table and the fields according to the second field; the second field represents the name of the data table to which the field visible to the user belongs; the third field is the number of null values contained in each field in the data table.
As an embodiment of the present invention, a dbms _ stats function module is called in an Oracle database, and after the dbms _ stats function module is called, a first field is read from a first function table, such as a user _ tables table, which represents a current user visible table; the first field is, for example, a num _ row field, which indicates the total number of records contained in the data table. The second function table, for example, a user _ tab _ columns table, indicates the fields visible to the current user, that is, the fields visible to the current user; a second field, for example, a table _ name field, indicating the name of the data table to which the field visible to the user belongs; a third field, such as a num _ null field, indicates the number of null values contained in each field of the data table. And establishing the relationship between the data table and the data field according to the table _ name field.
In the embodiment, the dbms _ stats functional module carried by Oracle is utilized to realize the basic data collection of the data table, the difficulty and the workload of the development of the integrity method are reduced, and the accuracy is improved.
And S112, reading a data table from the database, and accumulating the values of the third field of the data table to obtain the number of the none values of the data table.
The data table read from the database is represented as an A table, and the values of num _ null fields in the A table are accumulated to obtain the number of none values of the A table; a num _ null field is a null value number contained in each field of the a table, for example, if num _ null is equal to num _ row, the field is indicated to be a completely null field, where num _ row is a total record number contained in the a table; if num _ null is 0, it indicates that there is no null value in this field, which is a full data field. The number of none values is the cumulative sum of the num _ null values of the fields in the a table.
And S113, calculating the number of none values of all empty fields.
The number of none values of the removed all-empty fields is the number of none values of an A table-the number of all-empty fields of the A table-the total number of records of the A table, namely the number Q of none values of the removed all-empty fields is as follows:
Q=P-M*N
q is the number of none values of all empty fields removed from the data table; p is the number of none values in the data table; m is the number of all-empty fields in the data table; and N is the total number of records in the data table.
And S114, calculating the full field record number.
The full field record number is calculated, and the following logic is realized through an sql statement: set of precedent instructions { a1,a2,...,anAnd (4) respectively judging whether each set element is a none or not for each field of the A table, if so, counting the number of records by +1, and traversing the whole set to obtain the number of records of the non-full field. The calculation method is as follows:
K=N-F
k is the number of full field records in the data table; n is the total number of records in the data table; f is the number of records of the non-full field in the data table; and the number of the records of the non-full field is the total number of the records in the field containing the null value in the data table.
And in the S111-S114, the dbms _ stats module of the Oracle database is used for counting the basic information of a single data table, and data index values such as the number of fields, the number of records, the number of all-empty fields, the number of none values without all-empty fields, the number of full-field records and the like of the data table are obtained through counting.
And S115, traversing all data tables in the database, and accumulating the obtained data index values.
As an embodiment of the present invention, the total data index value of the entire database is obtained by traversing all the data tables in the database and accumulating the data index values of each data table.
The method specifically comprises the following steps:
accumulating the number of the fields of each table and multiplying the number of records to obtain the total number of the numerical values of the database;
accumulating the number of the fields of each table to obtain the total number of the fields of the database;
accumulating the total record number of each table to obtain the total record number of the database;
accumulating the number of all null fields of each table to obtain the total number of all null fields of the database;
accumulating the number of the none values of each table to obtain the number of the none values of the database;
accumulating the number of none values of all removed empty fields of each table to obtain the number of the none values of all removed empty fields of the database;
accumulating the full field record number of each table to obtain the full field record number of the database;
accumulating the number of records in each table (field number-full empty field number) to obtain the total number of the values of the database after the full empty fields are removed, namely the values after the full empty fields are removed are as follows:
H=(R-M)*N
h is a numerical value of the data table after all empty fields are removed; r is the number of fields in the data table; m is the number of all-empty fields in the data table; and N is the total number of records in the data table.
And S116, calculating the proportion of the full empty fields of the database, the proportion of the full fields of the database, the proportion of the none values of the database and the proportion of the none values of the removed full empty fields of the database, and evaluating the data integrity of the database.
The proportion of the all-empty fields of the database is the number of all-empty fields/the total number of the fields of the database;
the full field record proportion of the database is the total number of full field records/the total number of database records;
the proportion of the database none values is the number of the none values/the total number of the database values;
the proportion of the none value of the database without the full-empty field is the number of the none values after the full-empty field is removed/the total number of the numerical values of the database without the full-empty field.
Further, 117, the integrity of the database can be evaluated through the four evaluation indexes, including:
the data integrity of the database is in negative correlation with the proportion of the all-empty field of the database, namely the higher the proportion of the all-empty field of the database is, the worse the integrity of the database is; conversely, the lower the proportion of the full fields of the database is, the better the integrity of the database is.
The data integrity of the database is in negative correlation with the proportion of the database none value, namely the higher the proportion of the database none value is, the worse the integrity of the database is; conversely, the lower the proportion of the database none values, the better the integrity of the database.
The data integrity of the database is in negative correlation with the proportion of the none value of the database with the removed full-empty field, namely the higher the proportion of the none value of the database with the removed full-empty field is, the worse the integrity of the database is; conversely, the lower the proportion of the none value of the database excluding the all-empty field is, the better the integrity of the database is.
The data integrity of the database is positively correlated with the full field record proportion of the database; namely, the higher the full field record proportion of the database is, the better the integrity of the database is; conversely, the lower the full field record proportion of the database, the worse the integrity of the database.
The integrity of the database can be evaluated from the perspective of quantitative evaluation of data quality through the index evaluation.
As an embodiment of the present invention, a weight may be assigned to the index value for evaluating the integrity of the database, for example, a corresponding weight may be assigned to the proportion of the empty field of the database, the proportion of the none value of the removed empty field of the database, and the proportion of the full field record of the database, and the proportional contribution of the weight to the integrity of the database may be evaluated in proportion, so as to finally obtain the evaluation result of the integrity of the database.
In the integrity evaluation process, the database system table is used as the basis of evaluation, and the integrity of the database is evaluated; in the evaluation process, the reading frequency and the workload of the database service table are reduced, and the evaluation efficiency is improved; when the attributes such as the blank space, the empty column and the empty row of the data table are searched, the method has higher discovery rate and accuracy.
And S120, evaluating the consistency of the database by a main foreign key relation screening method to obtain a consistency evaluation result.
FIG. 3 shows, as an embodiment of the present invention, a database consistency evaluation process of the present invention, including:
and S121, reading the data table information and the field information in the data table from the database, and judging the attribute of each field.
The method starts by reading the information of each data table and the information of each field in the data table from each data table in the database system. The data table information includes: recording the number of the data table; the field information in the data table includes: data type, data column number, field maximum, field minimum, average column length, etc.
After the data table record number and the field information of each data table are read, the field attribute of each field needs to be judged.
Further, the determining the attribute of each field includes three aspects:
in the first aspect, it is necessary to determine whether the selected field is uuid.
The UUID is an abbreviation of universal Unique Identifier (universal Unique Identifier), which is a standard for software construction. UUID refers to a number generated on one machine that is guaranteed to be unique to all machines in the same space-time; the UUID is used to identify the attribute type and is considered a unique identification in all spaces and times. In general, it is guaranteed that no UUID generated anywhere that this value is truly unique will have the same value. One benefit of using UUIDs is that new identifiers can be created for new services. Thus, a client, when looking for a service, need only indicate the UUID associated with a certain class of service (or a particular service) in its service lookup request, and return a response if the service provider can match the available service with this UUID.
The lengths of the uuid of different versions vary from 32 bits to 42 bits, so that whether the selected field is uuid is judged, whether the number of characters of the selected field is between 32 characters and 42 characters is judged firstly, and meanwhile, the condition that the selected field only contains the characters "0123456789 ABCDEFAbcdef" and the separator "-" is met. If the selected field satisfies both of the above conditions, the field can be considered to be uuid.
In the second aspect, the type of the content of the selected field needs to be determined.
Judging whether the selected field is uuid, if so, judging that the content type of the selected field is uuid, and finishing the judgment; otherwise, continuously judging whether the record types in the selected fields are consistent, if so, determining the field content type as the record type, otherwise, determining the field content type as the no type.
The record types in the selected fields are consistent with the following situations:
the content recorded in the selected field is digital, does not contain decimal point and does not start with 0, and then the content type of the selected field is an integer;
the contents recorded in the selected field are all numbers and comprise a decimal point and do not start with 0, and then the type of the decimal point is decimal;
if the contents recorded in the selected field are all Chinese characters, the type of the contents is Chinese characters;
and if the contents recorded in the selected field are all English letters, the type of the selected field is English characters.
As an embodiment of the present invention, a specified number of records may be randomly sampled from the selected field and the type of each sample record determined.
In a third aspect, it is determined whether the selected field is the primary key of the data table to which it belongs.
If the selected field is set as the primary key of the data table, the selected field is the primary key of the data table, and the judgment is finished;
if the selected field is the uuid field and the field number (column _ id in the oracle database) is 1, the selected field is the main key of the data table, and the judgment is ended;
if the number of different values contained in the selected field (e.g., num _ distict in the oracle database) is less than the number of records contained in the data table, then the selected field is not the primary key of the data table.
In the above step S121, the attribute of each field is determined in advance, and is used as a basis for subsequent data table screening.
S122, selecting a target data table, and selecting a comparison data table from other data tables, wherein the method specifically comprises the following steps:
s1221, selecting one data table from the data tables in the database as a target data table, and recording the target data table as a table A;
s1222, selecting one data table from the other data tables except for the table A in the database as a comparison data table, and recording the comparison data table as a table B.
S123, performing a primary foreign key relationship evaluation on the target data table and the comparison data table according to a primary foreign key relationship screening method, as shown in fig. 4, specifically including:
s1231, screening the data tables, judging whether the record number of the table A is 0, if so, indicating that the table A is empty and the main foreign key relation cannot exist between the table A and the table B, so that judging that the table A and the table B do not have the main foreign key relation, and returning to S1221 to reselect a target data table; if the number of records in table a is not 0, continuously determining whether the number of records in table B is 0, if the number of records in table B is 0, indicating that table B is empty and there is no main foreign key relationship with table a, so determining that there is no main foreign key relationship between table a and table B, and returning to S1222 to reselect a comparison data table; if the number of records in the table B is not 0, continuously judging whether the table A has the primary key field, if so, executing S1232; otherwise, there is no primary foreign key relationship between table a and table B.
S1232, selecting a primary key field in the table A and recording the primary key field as a field a;
s1233, selecting a comparison field in the table B and recording the comparison field as a field B;
s1234, pre-judging the field a and the field B, if any one of the following field main foreign key relationship screening conditions is met, judging that no main foreign key relationship exists between the field a and the field B, returning to S1233 to reselect the comparison field until all fields in the history table B; if not, go to S135;
further, the field main foreign key relationship screening condition is as follows:
condition 1: the comparison field (field b) is a primary key of the comparison data table; that is, if the field b is the primary key of the comparison data table, it is determined that there is no primary foreign key relationship between the field a and the field b.
Condition 2: the content types of the primary key field and the comparison field are different; generally, it is necessary to determine whether the content type of the field is the same data type in the current field a and the field b, for example, VARCHAR 2; the two fields with the same data type can not ensure the main foreign key relationship between the two fields, and the content type of the field still needs to be judged, namely if the content types of the field a and the field b are different, for example, the content type of the field a is an integer type, and the content type of the field b is a Chinese character type, the main foreign key relationship between the field a and the field b does not exist.
Condition 3: one and only one of the primary key field and the contrast field is uuid; that is, if one of the fields a and b is uuid and the other is not uuid, there is no primary foreign key relationship between the fields a and b.
Condition 4: the data types of the primary key field and the comparison field are different; that is, the data types in oracle include, but are not limited to: VARCHAR2, NCHAR, NVARCHAR, DATE, LONG, BLOG, CLOG, NUMBER, INTEREGER, FLOAT. If the data types of the field a and the field b are different, the field a and the field b can be judged to have no main foreign key relation.
Condition 5: the average column length difference of the main key field and the comparison field exceeds a preset threshold value; the preset threshold is set as a ratio or a difference of the field average column lengths, for example, if the preset threshold is a difference of the field average column lengths, the differences between the average column lengths a1 and b1 of the field a and the field b are compared, and if the absolute value of the difference between a1 and b1 is greater than the difference of the preset threshold, no primary foreign key relationship exists between the field a and the field b; for another example, the preset threshold is a ratio of the average column lengths of the fields, and if the ratio of the average column lengths of the field a and the field b is greater than the preset threshold, there is no primary foreign key relationship between the field a and the field b. The average column length of the main key field and the average column length of the comparison field are compared by the method of presetting the threshold, if the column lengths of the main key field and the comparison field are greatly different, the main key relation between the main key field and the comparison field can be judged to be absent, and therefore the screening is carried out.
S1235, acquiring a first number of records from the comparison field, judging whether the acquired records exist in the primary key field, if so, taking the primary key field and the comparison field as a primary foreign key relation, and further obtaining a primary foreign key relation between the target data table and the comparison data table; otherwise, the primary key field and the comparison field have no primary foreign key relation.
In S1234, the comparison fields that obviously have no main foreign key relationship are removed, in S1235, it is further determined whether all records in field b exist in field a, since the number of records in field is large, for convenience of operation, the number of one acquisition record can be set, a corresponding number of records are acquired from the comparison fields, the acquired records are used as samples to determine whether all records exist in field a, if yes, it can be basically inferred that all records in field b exist in field a, at this time, it can be determined that field a and field b have a main foreign key relationship, the determination result is recorded, and the determination process is ended. If the partial record in the field B does not exist in the field a, the field a and the field B have no main foreign key relationship, and the method returns to S1233 to reselect the comparison field until all fields in the history table B.
As an embodiment of the present invention, the whole database is traversed according to the above method, the evaluation of the primary foreign key relationship of each data table in the whole database is obtained, the evaluation results of the primary foreign key relationship of each table are accumulated, the proportion with the primary foreign key relationship is determined, and then the comparison is performed according to a preset threshold value, so as to obtain the overall evaluation of the consistency of the database.
Through different field information comparison, the relation between the fields is prejudged, the fields which obviously cannot have the main foreign key relation are filtered, the workload of content comparison between the fields is greatly reduced, the missing main foreign key relation is quickly and efficiently detected, the detection efficiency and accuracy are improved, and the method has wide applicability, is not only suitable for an electric power system database, but also can be widely applied to other fields.
And S130, comprehensively evaluating the database according to the integrity evaluation result and the consistency evaluation result.
In some embodiments, the evaluation of database integrity may be divided into four levels, one being the best and four being the worst; the database consistency evaluation may also be ranked four times, with one being the best and four being the worst. An overall star-level evaluation scheme can be set according to requirements, for example, if the total level number of the evaluation results of the integrity and consistency of the database is greater than or equal to 7 levels, the data quality of the database is excellent; if the total level number of the evaluation results of the integrity and the consistency of the database is more than or equal to 5 levels and less than 7 levels, the data quality of the database is good; if the total level number of the evaluation results of the integrity and the consistency of the database is more than or equal to 3 levels and less than 5 levels, the data quality of the database is general; and if the total level number of the evaluation results of the integrity and the consistency of the database is more than or equal to 1 level and less than 3 levels, the data quality of the database is poor.
According to the invention, through analyzing the proportion of the number of records with null attributes in each data table to the total number of records and the proportion of the number of attributes with null values in the table to the total number of attributes, the method has higher discovery rate and accuracy when searching the attributes such as spaces, null columns, null rows and the like of the data table, and evaluates the integrity of the database; through the comparison of different field information, the relation between the fields is pre-judged, the fields which obviously cannot have the main foreign key relation are filtered, the workload of comparing the contents between the fields is greatly reduced, the missing main foreign key relation is quickly and efficiently detected, the detection efficiency and the accuracy are improved, and the wide applicability is realized; and the overall and consistency evaluation results are integrated, so that the data quality of the whole database is more comprehensively evaluated.
As shown in fig. 5, the electronic device includes a Central Processing Unit (CPU) that can perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) or computer program instructions loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device can also be stored. The CPU, ROM, and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.
A plurality of components in an electronic device are connected to an I/O interface, including: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; storage units such as magnetic disks, optical disks, and the like; and a communication unit such as a network card, modem, wireless communication transceiver, etc. The communication unit allows the electronic device to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processing unit performs the various methods and processes described above, such as methods S110 to S130. For example, in some embodiments, methods S110-S130 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via ROM and/or the communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more of the steps of methods S110-S160 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to perform methods S110-S130 by any other suitable means (e.g., by way of firmware).
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (10)

1. A data quality evaluation method for a power industry service system is characterized by comprising the following steps:
evaluating the integrity of the database through the database integrity evaluation index value to obtain an integrity evaluation result;
evaluating the consistency of the database by a main foreign key relation screening method to obtain a consistency evaluation result;
and comprehensively evaluating the database according to the integrity evaluation result and the consistency evaluation result.
2. The method of claim 1, wherein evaluating the integrity of the database by the database integrity evaluation index value comprises:
reading a first field from a first function table, wherein the first field is the total number of records contained in a data table; reading a second field and a third field from the second function table, and establishing an affiliation relationship between the data table and the fields according to the second field; the second field represents the name of the data table to which the field visible to the user belongs; the third field is a null value number contained in each field in the data table;
reading a data table from a database, and accumulating the values of a third field of the data table to obtain the number of none values of the data table; calculating the number of none values and the number of full field records for removing all empty fields;
traversing all data tables in the database, accumulating the obtained data index values, calculating the proportion of the full empty fields of the database, the recording proportion of the full fields of the database, the proportion of the none values of the database and the proportion of the none values of the removed full empty fields of the database, and evaluating the integrity of the database.
3. The method of claim 2, wherein the calculating the number of none values excluding all empty fields comprises:
Q=P-M*N
q is the number of none values of all empty fields removed from the data table; p is the number of none values in the data table; m is the number of all-empty fields in the data table; n is the total number of records in the data table;
the full field record number is:
K=N-F
k is the number of full field records in the data table; n is the total number of records in the data table; f is the number of records of the non-full field in the data table; and the number of the records of the non-full field is the total number of the records in the field containing the null value in the data table.
4. The method of claim 2, wherein accumulating the resulting data values comprises:
accumulating the number of fields of each data table to obtain the total number of fields of the database;
accumulating the product of the number of fields of each data table and the number of records to obtain the total number of the numerical values of the database;
accumulating the total record number of each data table to obtain the total record number of the database;
accumulating the number of none values of each data table to obtain the number of the none values of the database;
accumulating the number of none values of all removed empty fields of each data table to obtain the number of all removed empty field none values of the database;
accumulating the full field record number of each data table to obtain the full field record number of the database;
accumulating the numerical values of all the data tables after all the empty fields are removed to obtain the total number of the numerical values of the database after all the empty fields are removed; the numerical value after removing the full-empty field is as follows:
H=(R-M)*N
h is a numerical value of the data table after all empty fields are removed; r is the number of fields in the data table; m is the number of all-empty fields in the data table; and N is the total number of records in the data table.
5. The method of claim 2, wherein the calculating the proportion of all empty fields in the database, the proportion of full field records in the database, the proportion of none values in the database, and the proportion of none values in the removed all empty fields in the database comprises:
the proportion of the database full-empty fields is the ratio of the number of the database full-empty fields to the total number of the database fields;
the database full field record proportion is the ratio of the database full field record number to the total database record number;
the proportion of the database none values is the ratio of the number of the database none values to the total number of the database values;
the proportion of the none values of the database without the full-empty fields is the ratio of the number of the none values of the database without the full-empty fields to the total number of the numerical values of the database without the full-empty fields.
6. The method of claim 2, wherein said evaluating the data integrity of said database comprises:
the data integrity of the database is in negative correlation with the proportion of the full empty fields of the database, the proportion of the none values of the database and the proportion of the none values of the removed full empty fields of the database;
the data integrity of the database is positively correlated with the full field record proportion of the database.
7. The method of claim 1, wherein the evaluating the consistency of the database by the primary foreign key relationship screening method comprises:
reading data table information and field information in the data table from a database, and judging the attribute of each field;
selecting a target data table and selecting a comparison data table from other data tables;
and evaluating the main foreign key relationship of the target data table and the comparison data table according to a main foreign key relationship screening method.
8. The method of claim 7, wherein said determining the attributes of each field comprises:
judging whether the selected field is uuid;
judging the type of the selected field content;
judging whether the selected field is a main key of the data table to which the field belongs;
the judging whether the selected field is uuid includes:
if the selected field only contains characters '0123456789 ABCDEFAbcdef' and a separator '-' and the character length is between 32-42, judging that the selected field is uuid;
the judging the content type of the selected field comprises the following steps:
judging whether the types of each record in the selected field are consistent, if so, judging that the content type of the selected field is the record type; otherwise, judging the content type of the selected field as a non-type;
the judging whether the selected field is the primary key of the data table to which the field belongs includes:
judging whether the selected field meets one of the following conditions, if so, the selected field is the primary key of the affiliated data table, otherwise, the selected field is not the primary key of the affiliated data table;
condition 1: the selected field has been set as the data table primary key;
condition 2: the selected field is uuid, and the number of the field is 1;
condition 3: the number of different values contained in the selected field is not less than the number of records contained in the data table, and the field number of the selected field is 1.
9. The method according to claim 7, wherein the performing the primary foreign key relationship evaluation on the target data table and the comparison data table according to the primary foreign key relationship screening method comprises:
step 1: judging whether the record number of the target data table is 0 or not, if so, determining that no main foreign key relation exists between the target data table and the comparison data table; otherwise, judging whether the record number of the comparison data table is 0, if so, determining that no main foreign key relation exists between the target data table and the comparison data table; otherwise, continuously judging whether the primary key field exists in the target data table, if so, executing the step 2; otherwise, no main foreign key relation exists between the target data table and the comparison data table;
step 2: selecting a main key field in the target data table and a comparison field in the comparison data table, judging the main key field and the comparison field according to field main foreign key relationship screening conditions, and if the main key field and the comparison field accord with one of the field main foreign key relationship screening conditions, determining that the main key field and the comparison field have no main foreign key relationship; otherwise, executing step 3;
the field main foreign key relationship screening conditions comprise:
condition 1: the comparison field is a main key of the comparison data table;
condition 2: the content types of the primary key field and the comparison field are different;
condition 3: one and only one of the primary key field and the contrast field is uuid;
condition 4: the data types of the primary key field and the comparison field are different;
condition 5: the average column length difference of the main key field and the comparison field exceeds a preset threshold value;
and step 3: judging whether records in a comparison field exist in the primary key field, if so, taking the primary key field and the comparison field as a primary foreign key relation, and obtaining a primary foreign key relation between the target data table and the comparison data table; otherwise, the primary key field and the comparison field have no primary foreign key relation.
10. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, wherein the processor, when executing the program, implements the method of any of claims 1-9.
CN202010826540.0A 2020-08-17 2020-08-17 Data quality evaluation method and equipment for power industry service system Active CN111949644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010826540.0A CN111949644B (en) 2020-08-17 2020-08-17 Data quality evaluation method and equipment for power industry service system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010826540.0A CN111949644B (en) 2020-08-17 2020-08-17 Data quality evaluation method and equipment for power industry service system

Publications (2)

Publication Number Publication Date
CN111949644A true CN111949644A (en) 2020-11-17
CN111949644B CN111949644B (en) 2022-04-29

Family

ID=73342599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010826540.0A Active CN111949644B (en) 2020-08-17 2020-08-17 Data quality evaluation method and equipment for power industry service system

Country Status (1)

Country Link
CN (1) CN111949644B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651296A (en) * 2020-11-23 2021-04-13 安徽继远软件有限公司 Method and system for automatically detecting data quality problem without prior knowledge

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6161103A (en) * 1998-05-06 2000-12-12 Epiphany, Inc. Method and apparatus for creating aggregates for use in a datamart
US7739224B1 (en) * 1998-05-06 2010-06-15 Infor Global Solutions (Michigan), Inc. Method and system for creating a well-formed database using semantic definitions
CN105205581A (en) * 2014-06-30 2015-12-30 国网上海市电力公司 Power-supply-enterprise electronic file safety risk evaluation system
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
CN108132957A (en) * 2016-12-01 2018-06-08 中国移动通信有限公司研究院 A kind of data base processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6161103A (en) * 1998-05-06 2000-12-12 Epiphany, Inc. Method and apparatus for creating aggregates for use in a datamart
US7739224B1 (en) * 1998-05-06 2010-06-15 Infor Global Solutions (Michigan), Inc. Method and system for creating a well-formed database using semantic definitions
CN105205581A (en) * 2014-06-30 2015-12-30 国网上海市电力公司 Power-supply-enterprise electronic file safety risk evaluation system
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
CN108132957A (en) * 2016-12-01 2018-06-08 中国移动通信有限公司研究院 A kind of data base processing method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651296A (en) * 2020-11-23 2021-04-13 安徽继远软件有限公司 Method and system for automatically detecting data quality problem without prior knowledge

Also Published As

Publication number Publication date
CN111949644B (en) 2022-04-29

Similar Documents

Publication Publication Date Title
US20220284017A1 (en) Systems and methods for rapid data analysis
US11734233B2 (en) Method for classifying an unmanaged dataset
US8117224B2 (en) Accuracy measurement of database search algorithms
CN113092981B (en) Wafer data detection method and system, storage medium and test parameter adjustment method
CN104756106A (en) Characterizing data sources in a data storage system
US11550762B2 (en) Implementation of data access metrics for automated physical database design
JP6242540B1 (en) Data conversion system and data conversion method
CN112559271A (en) Method, device, equipment and storage medium for monitoring interface performance of distributed application
CN111414410A (en) Data processing method, device, equipment and storage medium
CN111949644B (en) Data quality evaluation method and equipment for power industry service system
CN111949663B (en) Big data main foreign key consistency evaluation method, device and equipment
CN110377805B (en) Sensor resource recommendation method based on rapid branch allocation and sorting algorithm
CN116910102A (en) Enterprise query method and device based on user feedback and electronic equipment
CN109727056B (en) Financial institution recommendation method, device, storage medium and device
CN115641191B (en) Data pushing method and AI system based on data analysis
CN110737673B (en) Data processing method and system
CN111897803B (en) Database integrity evaluation method for power industry service system
CN114785616A (en) Data risk detection method and device, computer equipment and storage medium
CN115080552A (en) Data quality evaluation method, device, equipment and computer readable storage medium
CN112131215B (en) Bottom-up database information acquisition method and device
JP6201053B2 (en) Feature data management system and feature data management method
CN112100670A (en) Big data based privacy data grading protection method
Zhang et al. A GUI Design for Composition Discovery of View Interestingness.
CN111897889B (en) Data table integrity evaluation method for power industry service system
CN111680083A (en) Intelligent multi-stage government financial data acquisition system and data acquisition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant