CN111090644A - Data consistency evaluation method based on data distribution fluctuation rate - Google Patents
Data consistency evaluation method based on data distribution fluctuation rate Download PDFInfo
- Publication number
- CN111090644A CN111090644A CN201911362810.0A CN201911362810A CN111090644A CN 111090644 A CN111090644 A CN 111090644A CN 201911362810 A CN201911362810 A CN 201911362810A CN 111090644 A CN111090644 A CN 111090644A
- Authority
- CN
- China
- Prior art keywords
- data
- ratio
- value
- consistency
- current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 7
- 238000000034 method Methods 0.000 claims abstract description 17
- 230000008859 change Effects 0.000 claims abstract description 10
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 5
- 230000004048 modification Effects 0.000 abstract description 4
- 238000012986 modification Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 abstract description 3
- 238000007405 data analysis Methods 0.000 abstract description 2
- 230000002159 abnormal effect Effects 0.000 description 3
- 238000012937 correction Methods 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2365—Ensuring data consistency and integrity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2308—Concurrency control
- G06F16/2315—Optimistic concurrency control
- G06F16/2322—Optimistic concurrency control using timestamps
Abstract
The invention discloses a data consistency evaluation method based on data distribution fluctuation rate, which is applied to the field of big data analysis and processing and aims at solving the problem that some data are lost or modified wrongly because mistakes occur in the bug or etl process of a service system in the prior art; firstly, dividing data to be detected into historical data and current data according to a timestamp field; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal; the method of the invention can quickly and effectively find out some data loss or modification errors caused by errors in the bug or etl process of the service system.
Description
Technical Field
The invention belongs to the field of big data analysis and processing, and particularly relates to a consistency evaluation technology for structured data.
Background
Structured data, simply referred to as a database. The method is easier to understand when being combined into typical scenes, such as enterprise ERP, financial systems; a medical HIS database; an education all-purpose card; government administration approval; other core databases, etc.
The method basically comprises the requirements of high-speed storage application, data backup, data sharing and data disaster tolerance.
Structured data, also called row data, is data logically represented and implemented by a two-dimensional table structure, strictly following the data format and length specifications, and mainly stored and managed by a relational database. In contrast to structured data, unstructured data is not suitable for representation by a database two-dimensional table, including office documents of all formats, XML, HTML, various types of reports, pictures and audio, video information, and the like. The database supporting unstructured data adopts a multi-value field, a field and variable length field mechanism to create and manage data items, and is widely applied to the fields of full-text retrieval and various multimedia information processing.
With the development of information technology, various departments and enterprises and public institutions construct data centers. Since the data quality level of the data source is unknown, data inconsistency always occurs due to an etl (Extract Transform Loading, data extraction transformation Loading rule) process error and the like. Data consistency is a dimension of data quality assessment, and emphasis is placed on assessing the degree of data alteration or variation. Currently, data consistency is generally evaluated only by evaluating data format consistency within fields. In fact, merely evaluating the consistency of the data format within the field does not solve the following problem:
errors in the bug or etl process of the service system result in some data loss or modification errors. The usual evaluation methods cannot find such abnormal data.
Disclosure of Invention
In order to solve the technical problem, the invention provides a data consistency evaluation method based on data distribution fluctuation rate, which preliminarily finds data with abnormal fluctuation by evaluating the value mode distribution fluctuation rate in a field.
The technical scheme adopted by the invention is as follows: a data consistency assessment method based on data distribution fluctuation rate includes dividing data to be measured into historical data and current data according to timestamp fields; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal.
The mode value proportion calculation formula is as follows:
therein, sigmax=k1 is used for counting the number of data pieces equal to a certain value, x is an independent variable, k is a data value, and sigma 1 is used for representing the total number of the field data.
The change amplitude of the ratio is specifically the difference value between the current value mode ratio and the historical ratio of the data to be detected.
Of course, before the data to be tested is divided into the historical data and the current data according to the timestamp field, the method further includes: judging whether the data to be detected is empty or not, and if so, ending the operation; otherwise, the data to be detected is divided into historical data and current data according to the timestamp field.
The invention has the beneficial effects that: the invention can evaluate the fluctuation of some value patterns in the field in quantity compared with the past, and can find abnormal points, namely the current change in quantity is more than the expected value pattern. By the method, a data engineer can evaluate whether the data conforms to the historical rule or not, whether the situation that the etl process is wrong or the data is inconsistent due to bug of an application system can exist or not, and the method can be used as a method for evaluating the data consistency.
Drawings
FIG. 1 is a flow chart of the scheme of the invention.
Detailed Description
In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.
The present invention will be described first in terms of a usage scenario, and may be used in any scenario where it is desired to evaluate the magnitude of a change in the magnitude of a pattern of data values in a field quantitatively compared to the magnitude of a change in the past.
In the present embodiment, the content of the present invention is described in detail by taking a "school things special subclass table T" as an example, which includes fields "school number F2", "special movement condition F1" and "special movement time F0". The F0 value ranges from [2010-9-1,2019-8-30], the F1 value patterns comprise 'leaving school', 'self application', 'leaving school without permission', 'end of rest', 'school address clearing' and 'poor performance', the value patterns in the invention are values which can be inquired in a dictionary table, and each value pattern represents a class of values. If the field value only contains professor/instructor, the field has a plurality of pieces of data. The professor is a value mode, the sub professor is a value mode, and the instructor is a value mode.
The processing flow is shown in FIG. 1:
the split time t can be set to 2018-8-30, and the value of F1 can be divided into two segments, i.e., F11 in the case that F0 is smaller than t, F12 in the case that F0 is larger than t, and F0 is equal to t, which are generally determined according to the set split time and are assigned to F11 or F12; the case where F0 is equal to t in this embodiment is assigned to F11.
Then grouping the occupation ratio of each value mode in the statistic field for F11 and F12 respectively
Assume the statistical results are as follows:
the ratio of the modes in F11 is as follows:
10% of abroad reserved school- > 22% of the applicant application- > 6% of free school- > 30% of rest period- > 20% of school address clearance- > 12% of low grade- >
The ratio of the modes in F12 is as follows:
11 percent of leaving school- >1 percent of the applicant application- >1 percent of the inventor, 29 percent of free school- >25 percent of the rest period, 21 percent of the school address clearance- >21 percent of the school address and 13 percent of the low grade
Given a threshold TH of 5%, the result data of F11 and the result data of F12 were compared, respectively
y(x)=|f(x1)-f(x2)|-TH
It was found that the fluctuation rates of the two value modes of "my application" and "unauthorized correction" in F12 exceed the threshold (i.e., y (x) > 0). We can preliminarily determine that there is a consistency problem with the data. We then further analyze based on other information (not discussed herein) that the bug on the latest upgraded version of the business system caused the problem: when updating the transaction data, the 'principal application' and the 'unauthorized correction' are set as the same codes, and all the data which is updated to be the 'principal application' are changed into the 'unauthorized correction'.
As illustrated by the above example, the present invention can be used as a method for evaluating data consistency.
The threshold value in the invention is set to be 3% -6%, and the applicant shows through a large number of experiments that when the threshold value is set to be 5%, the obtained data consistency is optimal.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.
Claims (4)
1. A data consistency assessment method based on data distribution fluctuation rate is characterized in that firstly, data to be measured are divided into historical data and current data according to a timestamp field; then, analyzing the current ratio and the past ratio of different value modes in the data to be detected, and comparing the change amplitude of the ratio with a given threshold value; if the percentage change amplitude of a certain data existence value mode is larger than a threshold value, the data is considered to have the consistency problem; otherwise, the data is normal.
2. The method according to claim 1, wherein the calculation formula of the mode value ratio is:
therein, sigmax=k1 is used for counting the number of data pieces equal to a certain value, x is an independent variable, k is a data value, and sigma 1 is used for representing the total number of the field data.
3. The data consistency evaluation method based on the data distribution fluctuation rate as claimed in claim 1, wherein the variation amplitude of the ratio is specifically a difference between a current value mode ratio and a historical ratio of the data to be measured.
4. The method for evaluating the consistency of data based on the fluctuation rate of data distribution according to claim 1, further comprising, before the step of separating the data to be measured into the historical data and the current data according to the timestamp field: judging whether the data to be detected is empty or not, and if so, ending the operation; otherwise, the data to be detected is divided into historical data and current data according to the timestamp field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911362810.0A CN111090644A (en) | 2019-12-26 | 2019-12-26 | Data consistency evaluation method based on data distribution fluctuation rate |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911362810.0A CN111090644A (en) | 2019-12-26 | 2019-12-26 | Data consistency evaluation method based on data distribution fluctuation rate |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111090644A true CN111090644A (en) | 2020-05-01 |
Family
ID=70398241
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911362810.0A Pending CN111090644A (en) | 2019-12-26 | 2019-12-26 | Data consistency evaluation method based on data distribution fluctuation rate |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111090644A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104715027A (en) * | 2015-03-04 | 2015-06-17 | 北京京东尚科信息技术有限公司 | Distributed data transaction judging and positioning method and system |
CN107943809A (en) * | 2016-10-13 | 2018-04-20 | 阿里巴巴集团控股有限公司 | Data quality monitoring method, device and big data calculating platform |
CN109241043A (en) * | 2018-08-13 | 2019-01-18 | 蜜小蜂智慧(北京)科技有限公司 | A kind of data quality checking method and device |
CN109872813A (en) * | 2019-01-24 | 2019-06-11 | 广州金域医学检验中心有限公司 | Detection system positive rate appraisal procedure and device, computer readable storage medium |
CN110008201A (en) * | 2019-04-09 | 2019-07-12 | 浩鲸云计算科技股份有限公司 | A kind of quality of data towards big data checks monitoring method |
-
2019
- 2019-12-26 CN CN201911362810.0A patent/CN111090644A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104715027A (en) * | 2015-03-04 | 2015-06-17 | 北京京东尚科信息技术有限公司 | Distributed data transaction judging and positioning method and system |
CN107943809A (en) * | 2016-10-13 | 2018-04-20 | 阿里巴巴集团控股有限公司 | Data quality monitoring method, device and big data calculating platform |
CN109241043A (en) * | 2018-08-13 | 2019-01-18 | 蜜小蜂智慧(北京)科技有限公司 | A kind of data quality checking method and device |
CN109872813A (en) * | 2019-01-24 | 2019-06-11 | 广州金域医学检验中心有限公司 | Detection system positive rate appraisal procedure and device, computer readable storage medium |
CN110008201A (en) * | 2019-04-09 | 2019-07-12 | 浩鲸云计算科技股份有限公司 | A kind of quality of data towards big data checks monitoring method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11429614B2 (en) | Systems and methods for data quality monitoring | |
CN106815326B (en) | System and method for detecting consistency of data table without main key | |
US8161070B2 (en) | Efficient delta handling in star and snowflake schemes | |
CN106033437A (en) | Method and system for processing distributed transaction | |
US10671627B2 (en) | Processing a data set | |
WO2013014558A1 (en) | Auto-mapping between source and target models using statistical and ontology techniques | |
CN110555770B (en) | Block chain world state checking and recovering method based on incremental hash | |
US10198346B1 (en) | Test framework for applications using journal-based databases | |
CN105045917A (en) | Example-based distributed data recovery method and device | |
CN111159272A (en) | Data quality monitoring and early warning method and system based on data warehouse and ETL | |
CN103440265A (en) | MapReduce-based CDC (Change Data Capture) method of MYSQL database | |
CN105930375A (en) | XBRL file-based data mining method | |
Shahbaz | Data mapping for data warehouse design | |
US10606829B1 (en) | Methods and systems for identifying data inconsistencies between electronic record systems using data partitioning | |
CN111090644A (en) | Data consistency evaluation method based on data distribution fluctuation rate | |
US20230099164A1 (en) | Systems and methods for automated data quality semantic constraint identification using rich data type inferences | |
CN112131291B (en) | Structured analysis method, device and equipment based on JSON data and storage medium | |
CN103778218A (en) | Cloud computation-based standard information consistency early warning system and method | |
CN103605699A (en) | Method and device for configuring relations of data | |
CN112214983A (en) | Data record duplicate checking method and system | |
CN107766884B (en) | Bayes fusion evaluation method based on representative point optimization | |
CN105243479A (en) | Risk judgment method and data processing system | |
US11768806B1 (en) | System and method for regular updates to computer-form files | |
CN111046056A (en) | Data consistency evaluation method based on data pattern clustering | |
US20100250621A1 (en) | Financial-analysis support apparatus and financial-analysis support method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200501 |