CN111090644A

CN111090644A - Data consistency evaluation method based on data distribution fluctuation rate

Info

Publication number: CN111090644A
Application number: CN201911362810.0A
Authority: CN
Inventors: 唐雪飞; 蒲高飞; 黄永鑫; 王东方; 胡茂秋
Original assignee: Chengdu Comsys Information Technology Co ltd
Current assignee: Chengdu Comsys Information Technology Co ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-01

Abstract

The invention discloses a data consistency evaluation method based on data distribution fluctuation rate, which is applied to the field of big data analysis and processing and aims at solving the problem that some data are lost or modified wrongly because mistakes occur in the bug or etl process of a service system in the prior art; firstly, dividing data to be detected into historical data and current data according to a timestamp field; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal; the method of the invention can quickly and effectively find out some data loss or modification errors caused by errors in the bug or etl process of the service system.

Description

Data consistency evaluation method based on data distribution fluctuation rate

Technical Field

The invention belongs to the field of big data analysis and processing, and particularly relates to a consistency evaluation technology for structured data.

Background

Structured data, simply referred to as a database. The method is easier to understand when being combined into typical scenes, such as enterprise ERP, financial systems; a medical HIS database; an education all-purpose card; government administration approval; other core databases, etc.

The method basically comprises the requirements of high-speed storage application, data backup, data sharing and data disaster tolerance.

Structured data, also called row data, is data logically represented and implemented by a two-dimensional table structure, strictly following the data format and length specifications, and mainly stored and managed by a relational database. In contrast to structured data, unstructured data is not suitable for representation by a database two-dimensional table, including office documents of all formats, XML, HTML, various types of reports, pictures and audio, video information, and the like. The database supporting unstructured data adopts a multi-value field, a field and variable length field mechanism to create and manage data items, and is widely applied to the fields of full-text retrieval and various multimedia information processing.

With the development of information technology, various departments and enterprises and public institutions construct data centers. Since the data quality level of the data source is unknown, data inconsistency always occurs due to an etl (Extract Transform Loading, data extraction transformation Loading rule) process error and the like. Data consistency is a dimension of data quality assessment, and emphasis is placed on assessing the degree of data alteration or variation. Currently, data consistency is generally evaluated only by evaluating data format consistency within fields. In fact, merely evaluating the consistency of the data format within the field does not solve the following problem:

errors in the bug or etl process of the service system result in some data loss or modification errors. The usual evaluation methods cannot find such abnormal data.

Disclosure of Invention

In order to solve the technical problem, the invention provides a data consistency evaluation method based on data distribution fluctuation rate, which preliminarily finds data with abnormal fluctuation by evaluating the value mode distribution fluctuation rate in a field.

The technical scheme adopted by the invention is as follows: a data consistency assessment method based on data distribution fluctuation rate includes dividing data to be measured into historical data and current data according to timestamp fields; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal.

The mode value proportion calculation formula is as follows:

therein, sigma_x＝k1 is used for counting the number of data pieces equal to a certain value, x is an independent variable, k is a data value, and sigma 1 is used for representing the total number of the field data.

The change amplitude of the ratio is specifically the difference value between the current value mode ratio and the historical ratio of the data to be detected.

Of course, before the data to be tested is divided into the historical data and the current data according to the timestamp field, the method further includes: judging whether the data to be detected is empty or not, and if so, ending the operation; otherwise, the data to be detected is divided into historical data and current data according to the timestamp field.

The invention has the beneficial effects that: the invention can evaluate the fluctuation of some value patterns in the field in quantity compared with the past, and can find abnormal points, namely the current change in quantity is more than the expected value pattern. By the method, a data engineer can evaluate whether the data conforms to the historical rule or not, whether the situation that the etl process is wrong or the data is inconsistent due to bug of an application system can exist or not, and the method can be used as a method for evaluating the data consistency.

Drawings

FIG. 1 is a flow chart of the scheme of the invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

The present invention will be described first in terms of a usage scenario, and may be used in any scenario where it is desired to evaluate the magnitude of a change in the magnitude of a pattern of data values in a field quantitatively compared to the magnitude of a change in the past.

In the present embodiment, the content of the present invention is described in detail by taking a "school things special subclass table T" as an example, which includes fields "school number F2", "special movement condition F1" and "special movement time F0". The F0 value ranges from [2010-9-1,2019-8-30], the F1 value patterns comprise 'leaving school', 'self application', 'leaving school without permission', 'end of rest', 'school address clearing' and 'poor performance', the value patterns in the invention are values which can be inquired in a dictionary table, and each value pattern represents a class of values. If the field value only contains professor/instructor, the field has a plurality of pieces of data. The professor is a value mode, the sub professor is a value mode, and the instructor is a value mode.

The processing flow is shown in FIG. 1:

the split time t can be set to 2018-8-30, and the value of F1 can be divided into two segments, i.e., F11 in the case that F0 is smaller than t, F12 in the case that F0 is larger than t, and F0 is equal to t, which are generally determined according to the set split time and are assigned to F11 or F12; the case where F0 is equal to t in this embodiment is assigned to F11.

Then grouping the occupation ratio of each value mode in the statistic field for F11 and F12 respectively

Assume the statistical results are as follows:

the ratio of the modes in F11 is as follows:

10% of abroad reserved school- > 22% of the applicant application- > 6% of free school- > 30% of rest period- > 20% of school address clearance- > 12% of low grade- >

The ratio of the modes in F12 is as follows:

11 percent of leaving school- >1 percent of the applicant application- >1 percent of the inventor, 29 percent of free school- >25 percent of the rest period, 21 percent of the school address clearance- >21 percent of the school address and 13 percent of the low grade

Given a threshold TH of 5%, the result data of F11 and the result data of F12 were compared, respectively

y(x)＝|f(x1)-f(x2)|-TH

It was found that the fluctuation rates of the two value modes of "my application" and "unauthorized correction" in F12 exceed the threshold (i.e., y (x) > 0). We can preliminarily determine that there is a consistency problem with the data. We then further analyze based on other information (not discussed herein) that the bug on the latest upgraded version of the business system caused the problem: when updating the transaction data, the 'principal application' and the 'unauthorized correction' are set as the same codes, and all the data which is updated to be the 'principal application' are changed into the 'unauthorized correction'.

As illustrated by the above example, the present invention can be used as a method for evaluating data consistency.

The threshold value in the invention is set to be 3% -6%, and the applicant shows through a large number of experiments that when the threshold value is set to be 5%, the obtained data consistency is optimal.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A data consistency assessment method based on data distribution fluctuation rate is characterized in that firstly, data to be measured are divided into historical data and current data according to a timestamp field; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal.

2. The method according to claim 1, wherein the calculation formula of the mode value ratio is:

3. The data consistency evaluation method based on the data distribution fluctuation rate as claimed in claim 1, wherein the variation amplitude of the ratio is specifically a difference between a current value mode ratio and a historical ratio of the data to be measured.

4. The method for evaluating the consistency of data based on the fluctuation rate of data distribution according to claim 1, further comprising, before the step of separating the data to be measured into the historical data and the current data according to the timestamp field: judging whether the data to be detected is empty or not, and if so, ending the operation; otherwise, the data to be detected is divided into historical data and current data according to the timestamp field.