CN111090644A - Data consistency evaluation method based on data distribution fluctuation rate - Google Patents

Data consistency evaluation method based on data distribution fluctuation rate Download PDF

Info

Publication number
CN111090644A
CN111090644A CN201911362810.0A CN201911362810A CN111090644A CN 111090644 A CN111090644 A CN 111090644A CN 201911362810 A CN201911362810 A CN 201911362810A CN 111090644 A CN111090644 A CN 111090644A
Authority
CN
China
Prior art keywords
data
ratio
value
consistency
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911362810.0A
Other languages
Chinese (zh)
Inventor
唐雪飞
蒲高飞
黄永鑫
王东方
胡茂秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Comsys Information Technology Co ltd
Original Assignee
Chengdu Comsys Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Comsys Information Technology Co ltd filed Critical Chengdu Comsys Information Technology Co ltd
Priority to CN201911362810.0A priority Critical patent/CN111090644A/en
Publication of CN111090644A publication Critical patent/CN111090644A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • G06F16/2315Optimistic concurrency control
    • G06F16/2322Optimistic concurrency control using timestamps

Abstract

The invention discloses a data consistency evaluation method based on data distribution fluctuation rate, which is applied to the field of big data analysis and processing and aims at solving the problem that some data are lost or modified wrongly because mistakes occur in the bug or etl process of a service system in the prior art; firstly, dividing data to be detected into historical data and current data according to a timestamp field; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal; the method of the invention can quickly and effectively find out some data loss or modification errors caused by errors in the bug or etl process of the service system.

Description

Data consistency evaluation method based on data distribution fluctuation rate
Technical Field
The invention belongs to the field of big data analysis and processing, and particularly relates to a consistency evaluation technology for structured data.
Background
Structured data, simply referred to as a database. The method is easier to understand when being combined into typical scenes, such as enterprise ERP, financial systems; a medical HIS database; an education all-purpose card; government administration approval; other core databases, etc.
The method basically comprises the requirements of high-speed storage application, data backup, data sharing and data disaster tolerance.
Structured data, also called row data, is data logically represented and implemented by a two-dimensional table structure, strictly following the data format and length specifications, and mainly stored and managed by a relational database. In contrast to structured data, unstructured data is not suitable for representation by a database two-dimensional table, including office documents of all formats, XML, HTML, various types of reports, pictures and audio, video information, and the like. The database supporting unstructured data adopts a multi-value field, a field and variable length field mechanism to create and manage data items, and is widely applied to the fields of full-text retrieval and various multimedia information processing.
With the development of information technology, various departments and enterprises and public institutions construct data centers. Since the data quality level of the data source is unknown, data inconsistency always occurs due to an etl (Extract Transform Loading, data extraction transformation Loading rule) process error and the like. Data consistency is a dimension of data quality assessment, and emphasis is placed on assessing the degree of data alteration or variation. Currently, data consistency is generally evaluated only by evaluating data format consistency within fields. In fact, merely evaluating the consistency of the data format within the field does not solve the following problem:
errors in the bug or etl process of the service system result in some data loss or modification errors. The usual evaluation methods cannot find such abnormal data.
Disclosure of Invention
In order to solve the technical problem, the invention provides a data consistency evaluation method based on data distribution fluctuation rate, which preliminarily finds data with abnormal fluctuation by evaluating the value mode distribution fluctuation rate in a field.
The technical scheme adopted by the invention is as follows: a data consistency assessment method based on data distribution fluctuation rate includes dividing data to be measured into historical data and current data according to timestamp fields; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal.
The mode value proportion calculation formula is as follows:
Figure BDA0002337651310000011
therein, sigmax=k1 is used for counting the number of data pieces equal to a certain value, x is an independent variable, k is a data value, and sigma 1 is used for representing the total number of the field data.
The change amplitude of the ratio is specifically the difference value between the current value mode ratio and the historical ratio of the data to be detected.
Of course, before the data to be tested is divided into the historical data and the current data according to the timestamp field, the method further includes: judging whether the data to be detected is empty or not, and if so, ending the operation; otherwise, the data to be detected is divided into historical data and current data according to the timestamp field.
The invention has the beneficial effects that: the invention can evaluate the fluctuation of some value patterns in the field in quantity compared with the past, and can find abnormal points, namely the current change in quantity is more than the expected value pattern. By the method, a data engineer can evaluate whether the data conforms to the historical rule or not, whether the situation that the etl process is wrong or the data is inconsistent due to bug of an application system can exist or not, and the method can be used as a method for evaluating the data consistency.
Drawings
FIG. 1 is a flow chart of the scheme of the invention.
Detailed Description
In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.
The present invention will be described first in terms of a usage scenario, and may be used in any scenario where it is desired to evaluate the magnitude of a change in the magnitude of a pattern of data values in a field quantitatively compared to the magnitude of a change in the past.
In the present embodiment, the content of the present invention is described in detail by taking a "school things special subclass table T" as an example, which includes fields "school number F2", "special movement condition F1" and "special movement time F0". The F0 value ranges from [2010-9-1,2019-8-30], the F1 value patterns comprise 'leaving school', 'self application', 'leaving school without permission', 'end of rest', 'school address clearing' and 'poor performance', the value patterns in the invention are values which can be inquired in a dictionary table, and each value pattern represents a class of values. If the field value only contains professor/instructor, the field has a plurality of pieces of data. The professor is a value mode, the sub professor is a value mode, and the instructor is a value mode.
The processing flow is shown in FIG. 1:
the split time t can be set to 2018-8-30, and the value of F1 can be divided into two segments, i.e., F11 in the case that F0 is smaller than t, F12 in the case that F0 is larger than t, and F0 is equal to t, which are generally determined according to the set split time and are assigned to F11 or F12; the case where F0 is equal to t in this embodiment is assigned to F11.
Then grouping the occupation ratio of each value mode in the statistic field for F11 and F12 respectively
Figure BDA0002337651310000021
Assume the statistical results are as follows:
the ratio of the modes in F11 is as follows:
10% of abroad reserved school- > 22% of the applicant application- > 6% of free school- > 30% of rest period- > 20% of school address clearance- > 12% of low grade- >
The ratio of the modes in F12 is as follows:
11 percent of leaving school- >1 percent of the applicant application- >1 percent of the inventor, 29 percent of free school- >25 percent of the rest period, 21 percent of the school address clearance- >21 percent of the school address and 13 percent of the low grade
Given a threshold TH of 5%, the result data of F11 and the result data of F12 were compared, respectively
y(x)=|f(x1)-f(x2)|-TH
It was found that the fluctuation rates of the two value modes of "my application" and "unauthorized correction" in F12 exceed the threshold (i.e., y (x) > 0). We can preliminarily determine that there is a consistency problem with the data. We then further analyze based on other information (not discussed herein) that the bug on the latest upgraded version of the business system caused the problem: when updating the transaction data, the 'principal application' and the 'unauthorized correction' are set as the same codes, and all the data which is updated to be the 'principal application' are changed into the 'unauthorized correction'.
As illustrated by the above example, the present invention can be used as a method for evaluating data consistency.
The threshold value in the invention is set to be 3% -6%, and the applicant shows through a large number of experiments that when the threshold value is set to be 5%, the obtained data consistency is optimal.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (4)

1. A data consistency assessment method based on data distribution fluctuation rate is characterized in that firstly, data to be measured are divided into historical data and current data according to a timestamp field; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal.
2. The method according to claim 1, wherein the calculation formula of the mode value ratio is:
Figure FDA0002337651300000011
therein, sigmax=k1 is used for counting the number of data pieces equal to a certain value, x is an independent variable, k is a data value, and sigma 1 is used for representing the total number of the field data.
3. The data consistency evaluation method based on the data distribution fluctuation rate as claimed in claim 1, wherein the variation amplitude of the ratio is specifically a difference between a current value mode ratio and a historical ratio of the data to be measured.
4. The method for evaluating the consistency of data based on the fluctuation rate of data distribution according to claim 1, further comprising, before the step of separating the data to be measured into the historical data and the current data according to the timestamp field: judging whether the data to be detected is empty or not, and if so, ending the operation; otherwise, the data to be detected is divided into historical data and current data according to the timestamp field.
CN201911362810.0A 2019-12-26 2019-12-26 Data consistency evaluation method based on data distribution fluctuation rate Pending CN111090644A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911362810.0A CN111090644A (en) 2019-12-26 2019-12-26 Data consistency evaluation method based on data distribution fluctuation rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911362810.0A CN111090644A (en) 2019-12-26 2019-12-26 Data consistency evaluation method based on data distribution fluctuation rate

Publications (1)

Publication Number Publication Date
CN111090644A true CN111090644A (en) 2020-05-01

Family

ID=70398241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911362810.0A Pending CN111090644A (en) 2019-12-26 2019-12-26 Data consistency evaluation method based on data distribution fluctuation rate

Country Status (1)

Country Link
CN (1) CN111090644A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715027A (en) * 2015-03-04 2015-06-17 北京京东尚科信息技术有限公司 Distributed data transaction judging and positioning method and system
CN107943809A (en) * 2016-10-13 2018-04-20 阿里巴巴集团控股有限公司 Data quality monitoring method, device and big data calculating platform
CN109241043A (en) * 2018-08-13 2019-01-18 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109872813A (en) * 2019-01-24 2019-06-11 广州金域医学检验中心有限公司 Detection system positive rate appraisal procedure and device, computer readable storage medium
CN110008201A (en) * 2019-04-09 2019-07-12 浩鲸云计算科技股份有限公司 A kind of quality of data towards big data checks monitoring method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715027A (en) * 2015-03-04 2015-06-17 北京京东尚科信息技术有限公司 Distributed data transaction judging and positioning method and system
CN107943809A (en) * 2016-10-13 2018-04-20 阿里巴巴集团控股有限公司 Data quality monitoring method, device and big data calculating platform
CN109241043A (en) * 2018-08-13 2019-01-18 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109872813A (en) * 2019-01-24 2019-06-11 广州金域医学检验中心有限公司 Detection system positive rate appraisal procedure and device, computer readable storage medium
CN110008201A (en) * 2019-04-09 2019-07-12 浩鲸云计算科技股份有限公司 A kind of quality of data towards big data checks monitoring method

Similar Documents

Publication Publication Date Title
US11429614B2 (en) Systems and methods for data quality monitoring
CN106815326B (en) System and method for detecting consistency of data table without main key
US8161070B2 (en) Efficient delta handling in star and snowflake schemes
CN106033437A (en) Method and system for processing distributed transaction
US10671627B2 (en) Processing a data set
WO2013014558A1 (en) Auto-mapping between source and target models using statistical and ontology techniques
CN110555770B (en) Block chain world state checking and recovering method based on incremental hash
US10198346B1 (en) Test framework for applications using journal-based databases
CN105045917A (en) Example-based distributed data recovery method and device
CN111159272A (en) Data quality monitoring and early warning method and system based on data warehouse and ETL
CN103440265A (en) MapReduce-based CDC (Change Data Capture) method of MYSQL database
CN105930375A (en) XBRL file-based data mining method
Shahbaz Data mapping for data warehouse design
US10606829B1 (en) Methods and systems for identifying data inconsistencies between electronic record systems using data partitioning
CN111090644A (en) Data consistency evaluation method based on data distribution fluctuation rate
US20230099164A1 (en) Systems and methods for automated data quality semantic constraint identification using rich data type inferences
CN112131291B (en) Structured analysis method, device and equipment based on JSON data and storage medium
CN103778218A (en) Cloud computation-based standard information consistency early warning system and method
CN103605699A (en) Method and device for configuring relations of data
CN112214983A (en) Data record duplicate checking method and system
CN107766884B (en) Bayes fusion evaluation method based on representative point optimization
CN105243479A (en) Risk judgment method and data processing system
US11768806B1 (en) System and method for regular updates to computer-form files
CN111046056A (en) Data consistency evaluation method based on data pattern clustering
US20100250621A1 (en) Financial-analysis support apparatus and financial-analysis support method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200501