CN113342834A - Method for solving historical data change in big data system - Google Patents

Method for solving historical data change in big data system Download PDF

Info

Publication number
CN113342834A
CN113342834A CN202110675041.0A CN202110675041A CN113342834A CN 113342834 A CN113342834 A CN 113342834A CN 202110675041 A CN202110675041 A CN 202110675041A CN 113342834 A CN113342834 A CN 113342834A
Authority
CN
China
Prior art keywords
data
task
business
service
changed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110675041.0A
Other languages
Chinese (zh)
Inventor
张春剑
吴凯
潘自滨
刘水军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Quanshopkeeper Technology Co ltd
Original Assignee
Qingdao Quanshopkeeper Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Quanshopkeeper Technology Co ltd filed Critical Qingdao Quanshopkeeper Technology Co ltd
Priority to CN202110675041.0A priority Critical patent/CN113342834A/en
Publication of CN113342834A publication Critical patent/CN113342834A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for solving historical data change in a big data system, which comprises the following steps: setting an update time field for the service table, wherein the update time field is used for marking the update time of the field; the ODS task is executed once every morning, the update time field of the service table is checked, and a piece of ODS layer data for operation data storage is generated; the zipper table module checks a latest record of all the operation type data storage ODS layer data and a corresponding last time record to compare, and if the latest record of all the operation type data storage ODS layer data changes, the event that the current record of the table changes is recorded in a change notification table; and after the pull-linked list task is finished, executing a task for generating a service daily list, checking the change notification list by the task, executing a corresponding task unit according to the changed list and the changed date, and generating service daily list data. The method can solve the problems of when and which part of the historical data should be recalculated after the historical data is changed, thereby saving a great amount of computing resources.

Description

Method for solving historical data change in big data system
Technical Field
The invention belongs to the technical field of big data processing, and particularly relates to a method for solving historical data change in a big data system.
Background
Big data is a new technology, and is widely applied to business scenes such as reports, analysis, prediction and the like. Through a big data technology, data can be processed quickly, and support of each dimension can be provided for services. In most scenes, for big data technology, only the input data needs to be processed to obtain the output data, and the subsequent services use the output data. In general, the historical data is solidified and does not change in the scene, and the output data obtained through processing can be continuously used subsequently. A typical scenario is to perform inductive statistics on daily service data to generate a data report for presentation. However, if the historical data is allowed to be modified, the output data needs to be recalculated. For example, in the application scenario of policy data processing in the insurance field, the historical data is randomly changed (e.g., some information of the historical policy is modified if policy information is found to be incorrect), which results in the need to recalculate the historical output data. This requires determining which day and which portion of the traffic history data needs to be recalculated, which would otherwise result in wasted computing resources.
By developing a SPARK program of a big data calculation engine to calculate data, the problem of data processing under the condition of overlarge data amount can be solved. However, such tasks consume very much processor and memory resources, and in the case of a fixed cluster resource, the more tasks, the longer the time consumed.
The big data task is used for calculating and summarizing historical data to obtain statistical data after the business data are finally processed, and in most cases, the historical business data cannot be changed after being generated. If the historical data is changed, a data regeneration problem is involved. The more complex the service and the more frequent the data changes, the more complex the data regeneration logic will be.
Imagine a scenario where 3 years of insurance data need to be counted monthly. If the requirement is simple, monthly insurance data can be extracted according to time, and a big data engine SPARK program is written to count the monthly data according to the requirement. However, in actual business, the insurance data may be modified (for example, it may happen that the client has retired after several months, or the client requests to modify the policy, or the benefit distribution changes due to the change of the insurance policy at the time). And statistics may be required on a weekly basis later to provide more dimensionality for analysis.
This becomes difficult because it is not necessarily which part of the day the data has changed. A simple idea is to re-count the data for all months daily, but this is essentially impossible because according to the current cluster computing power, a re-calculation may take days well and time is not allowed. Expanding clusters requires costs of hundreds of thousands or even millions per year, which is invaluable.
The present invention is to solve the above problems, and therefore provides a method for solving the history data change in a big data system.
Disclosure of Invention
The purpose of the invention is: aiming at the problems described in the background art, the invention provides a method for solving the change of historical data in a big data system, and solves the problems of when the historical data is required to be recalculated and which part of the historical data is required to be recalculated after the historical data is changed, so that a large amount of computing resources are saved on the premise of ensuring the correctness of the service.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a method for solving historical data change in a big data system is characterized by comprising the following steps:
(1) all the business data are stored in the mysql database, each business table is provided with an update time field for marking the update time of the field, and the update time field of the corresponding data entry can be modified to the time when the business system updates the database table;
(2) the ODS task is executed once every morning, updating time fields of all service tables are checked, if the updating time falls on the previous day, service table data are copied in a same mode to generate an ODS layer data of the operation type data storage, and therefore historical versions of all data of an original service database are equivalent to those stored in the ODS of the operation type data storage;
(3) the method comprises the steps that a generation zipper list task starts after an operation type data storage ODS task is executed, through a spark big data calculation engine task, a zipper list module checks that the latest record of all operation type data storage ODS layer data is compared with the record of the last time corresponding to the latest record, and if the field concerned by a business daily table changes, the event that the day of the table changes is recorded into a change notification table;
(4) after the pull chain table task is finished, executing a task for generating a service daily table; the task checks the change notification table, and executes the corresponding task unit according to the changed table and the changed date, thereby refreshing the previously generated business calendar data.
Further, in step (2), the operation-type data storage ODS layer data is stored in the Hive database.
Further, in the step (4), a business calendar is generated, the method is suitable for various dimensionalities including 'day, month and year' of follow-up business query dimensionalities, and flexible support can be provided for upper-layer front-end display according to calendar records.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the invention has the advantages that:
(1) the invention solves the complex service scene of frequent change of historical data, can effectively deal with the service requirement and the continuous change of the historical data, reduces the calculated amount, and simultaneously can ensure that the final statistical result is up-to-date and correct every day.
(2) By the method, statistics is not carried out according to the final demand data by month, but the service data is split and statistics is carried out according to days.
(3) Through the mechanism of the invention, the changed business module and date can be quickly positioned, and then the recalculation is carried out only on the part of data, thereby greatly saving the calculation resources. Meanwhile, statistics is carried out according to the service modules and the days, and the change of subsequent requirements can be more flexibly met.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a data flow diagram of a method for resolving historical data changes in a big data system according to an embodiment of the present invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The invention provides a method for solving the change of historical data in a big data system, which is used for solving the problem that the historical data can change continuously in a scene using big data technology. The whole data flow is shown in fig. 1, and the detailed process is described as follows:
1. firstly, all the business data are stored in the mysql database, each business table has an update time field for marking the update time of the field, and as long as the business system updates the database table, the update time field of the corresponding data entry can be modified to the time at that moment.
2. The ODS is called Operational Data Store, and means an Operational Data Store, and the Data of the ODS can be understood as a version record of original service Data. The layer data generation task is executed once every morning, the update time fields of all relevant business tables are checked firstly, if the update time falls on the previous day, the business layer data is copied as if to generate a piece of operation type data storage (ODS) data, and therefore the historical version of each piece of data in the original business database is equivalent to the version stored in the operation type data storage (ODS). Operational Data Storage (ODS) data is stored in the Hive database.
Remarking: hive is a database in big data ecology, the bottom layer is essentially file storage, supports sql inquiry similar to the Mysql database, and is very suitable for storing structured log type data.
3. The zipper table generation task starts after the execution of the Operational Data Storage (ODS) task is completed, through the spark big data calculation engine task, the zipper table module checks a latest record of all the Operational Data Storage (ODS) to compare with the corresponding last time record, and if the field concerned by the business daily table changes, the event that the day of the table changes is recorded in the change notification table.
4. And after the pull chain table task is finished, executing the task of generating the service daily table. The task checks the change notification table and executes the corresponding task unit according to the changed table and the changed date, thereby refreshing the previously generated business calendar data. The reason why the schedule is generated is that the follow-up business query dimensions have various flexible dimensions such as 'day, month and year', and support can be flexibly provided for the upper-layer front-end display according to the record of the schedule.
5. After the calendar is generated, various interfaces can be realized subsequently, and flexible interfaces are provided for visualization of front-end data.
The zipper table generation process is essentially a process of recording the change information of the ODS layer data into a zipper table. And simultaneously, recording the fact that the ODS layer data table changes in a certain day into a change notification table, and when a service daily table is generated subsequently, checking the change notification table first, and then generating the daily table data through the data of the corresponding pull-up table.
Through the design of a pull chain table, a daily table and the like, the system can provide better support for the change of historical data. The zipper table generation module can store the data change situation in the change notification table every time the service data is changed. The business calendar generation module can know which date data needs to be regenerated by checking the change notification table, and in this way, a great deal of computing resources are saved (because the data regeneration of a certain day can be positioned, and all data does not need to be generated once).
The process of the present invention is described in detail below by way of an example. Because the actual business logic is very complex, the actual application scenario will be simplified and described herein for ease of understanding.
In an actual statistical scenario, the revenue situation of each organization is counted. The original business table has two basic tables of a user table and a policy table, the user table records the user number of the marketer and the mechanism corresponding to the marketer, and the policy table contains information such as the amount of the policy and the user number of the marketer. Through the two tables, a daily table of daily income information of each organization can be obtained. Through the daily table, the income information of each organization every week and every month can be further obtained. The following illustrates how the calendar is recalculated after a change in historical data, assuming 2021.3.29 the big data statistics timing task performed.
Assuming the smoke desk mechanism, a warranty with an underwriting date of 2021.2.10 has been refunded at 2021.3.28.
1. At this time, the service platform system firstly modifies the piece of data correspondingly, and modifies the update time of the piece of data to 2021.3.28.
2. When 2021.3.29 executes the whole big data task in the morning, the ODS (operational data storage) generation task will pull all data with business data update time 2021.3.28 to the local database, which contains the policy data that smoking platform architecture 2021.2.10 de-guarantees.
3. The task of generating the linked list then processes the data with the update time 2021.3.28 of the ODS layer data (this part of logic would be complex in practice), and records 2021.2.10 the change of the policy data of the cigarette platform mechanism in the change notification table
4. The newly generated data in the schedule generation task check change notification table, and after knowing that the data of the No. 2021.2.10 cigarette station mechanism is changed, the task of statistics of the income of the cigarette station mechanism on the day 2021.2.10 is executed, the income data of the cigarette station mechanism on the day is recalculated, and at the moment, the income data of the cigarette station mechanism is updated.
The above-described example is basically consistent with the actual flow, but simply simplifies the actual requirement for more convenient illustration of the whole flow of the large data portion.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims (3)

1. A method for solving historical data change in a big data system is characterized by comprising the following steps:
(1) all the business data are stored in the mysql database, each business table is provided with an update time field for marking the update time of the field, and the update time field of the corresponding data entry can be modified to the time when the business system updates the database table;
(2) the method comprises the steps that an operation type data storage ODS layer data table executes a task once every morning, firstly, updating time fields of all service tables are checked, if the updating time falls on the previous day, service table data are copied in a same mode to generate an operation type data storage ODS layer data, and therefore historical versions of all pieces of data in an original service database are equivalent to what is stored in the operation type data storage ODS layer data table;
(3) the method comprises the steps that a generated linked list task starts after an operation type data storage ODS layer data table task is executed, through a spark big data calculation engine task, a zipper table module checks a latest record of all operation type data storage ODS layer data and a corresponding last time record for comparison, and if the latest record and the corresponding last time record are changed, the event that the ODS layer data table is changed every day is recorded into a change notification table;
(4) after the pull chain table task is finished, executing a task for generating a service daily table; the task checks the change notification table, and executes the corresponding task unit according to the changed table and the changed date, thereby refreshing the previously generated business calendar data.
2. The method for resolving historical data changes in a big data system as claimed in claim 1, wherein in step (2), the operational data store ODS layer data is stored in a Hive database.
3. The method for solving historical data change in big data system as claimed in claim 1, wherein in step (4), a business calendar is generated, which is suitable for subsequent business query dimensions including various dimensions of "day, month and year", and flexible support can be provided for upper front-end display according to the calendar record.
CN202110675041.0A 2021-06-18 2021-06-18 Method for solving historical data change in big data system Pending CN113342834A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110675041.0A CN113342834A (en) 2021-06-18 2021-06-18 Method for solving historical data change in big data system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110675041.0A CN113342834A (en) 2021-06-18 2021-06-18 Method for solving historical data change in big data system

Publications (1)

Publication Number Publication Date
CN113342834A true CN113342834A (en) 2021-09-03

Family

ID=77476175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110675041.0A Pending CN113342834A (en) 2021-06-18 2021-06-18 Method for solving historical data change in big data system

Country Status (1)

Country Link
CN (1) CN113342834A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840534A (en) * 2022-03-10 2022-08-02 创云融达信息技术(天津)股份有限公司 Data consistency keeping method and device based on inspection service
CN116383228A (en) * 2023-06-05 2023-07-04 建信金融科技有限责任公司 Data processing method, device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043800A (en) * 2009-10-16 2011-05-04 无锡华润上华半导体有限公司 Data storage realization method and data warehouse
CN105718468A (en) * 2014-12-02 2016-06-29 阿里巴巴集团控股有限公司 Method and device for building ODS layer of data warehouse
CN107193985A (en) * 2017-05-27 2017-09-22 郑州云海信息技术有限公司 A kind of slide fastener table design method of record data change histories
CN107679136A (en) * 2017-09-22 2018-02-09 上海携程商务有限公司 The storage method and storage system of slide fastener table
CN112765135A (en) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 Data processing method and device, electronic equipment and storage medium
CN112817970A (en) * 2021-01-14 2021-05-18 内蒙古蒙商消费金融股份有限公司 Data table generation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043800A (en) * 2009-10-16 2011-05-04 无锡华润上华半导体有限公司 Data storage realization method and data warehouse
CN105718468A (en) * 2014-12-02 2016-06-29 阿里巴巴集团控股有限公司 Method and device for building ODS layer of data warehouse
CN107193985A (en) * 2017-05-27 2017-09-22 郑州云海信息技术有限公司 A kind of slide fastener table design method of record data change histories
CN107679136A (en) * 2017-09-22 2018-02-09 上海携程商务有限公司 The storage method and storage system of slide fastener table
CN112817970A (en) * 2021-01-14 2021-05-18 内蒙古蒙商消费金融股份有限公司 Data table generation method and device
CN112765135A (en) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 Data processing method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840534A (en) * 2022-03-10 2022-08-02 创云融达信息技术(天津)股份有限公司 Data consistency keeping method and device based on inspection service
CN116383228A (en) * 2023-06-05 2023-07-04 建信金融科技有限责任公司 Data processing method, device, computer equipment and storage medium
CN116383228B (en) * 2023-06-05 2023-08-25 建信金融科技有限责任公司 Data processing method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US6904411B2 (en) Multi-processing financial transaction processing system
US10372492B2 (en) Job-processing systems and methods with inferred dependencies between jobs
CN113342834A (en) Method for solving historical data change in big data system
US20100138391A1 (en) Management method, management program and management apparatus of database
CN110209650A (en) The regular moving method of data, device, computer equipment and storage medium
CN114528127A (en) Data processing method and device, storage medium and electronic equipment
CN110659999A (en) Data processing method and device and electronic equipment
CN114385760A (en) Method and device for real-time synchronization of incremental data, computer equipment and storage medium
CN115185955A (en) Data lake data processing method and system
CN112925835A (en) Data synchronization method and device and server
CN110046172B (en) Online computing data processing method and system
CN116204391A (en) Early warning method and device based on custom configuration
Nguyen et al. An approach towards an event-fed solution for slowly changing dimensions in data warehouses with a detailed case study
CN104317820A (en) Statistical method and device of report
CN115114354A (en) Distributed data storage and query system
CN114511314A (en) Payment account management method and device, computer equipment and storage medium
CN110704488B (en) Method for managing data and corresponding system, computer device and medium
CN101004816A (en) Method, device, and system for maintaining levels of clients
CN111506628A (en) Data processing method and device
KR102620080B1 (en) Method and apparatus for processing order data
CN112559641A (en) Processing method and device of pull chain table, readable storage medium and electronic equipment
US10810662B1 (en) Utilizing time-series tables for tracking accumulating values
US10558647B1 (en) High performance data aggregations
CN115544096B (en) Data query method and device, computer equipment and storage medium
US20230168939A1 (en) Definition and implementation of process sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210903