CN113342834A

CN113342834A - Method for solving historical data change in big data system

Info

Publication number: CN113342834A
Application number: CN202110675041.0A
Authority: CN
Inventors: 张春剑; 吴凯; 潘自滨; 刘水军
Original assignee: Qingdao Quanshopkeeper Technology Co ltd
Current assignee: Qingdao Quanshopkeeper Technology Co ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-03

Abstract

The invention provides a method for solving historical data change in a big data system, which comprises the following steps: setting an update time field for the service table, wherein the update time field is used for marking the update time of the field; the ODS task is executed once every morning, the update time field of the service table is checked, and a piece of ODS layer data for operation data storage is generated; the zipper table module checks a latest record of all the operation type data storage ODS layer data and a corresponding last time record to compare, and if the latest record of all the operation type data storage ODS layer data changes, the event that the current record of the table changes is recorded in a change notification table; and after the pull-linked list task is finished, executing a task for generating a service daily list, checking the change notification list by the task, executing a corresponding task unit according to the changed list and the changed date, and generating service daily list data. The method can solve the problems of when and which part of the historical data should be recalculated after the historical data is changed, thereby saving a great amount of computing resources.

Description

Method for solving historical data change in big data system

Technical Field

The invention belongs to the technical field of big data processing, and particularly relates to a method for solving historical data change in a big data system.

Background

Big data is a new technology, and is widely applied to business scenes such as reports, analysis, prediction and the like. Through a big data technology, data can be processed quickly, and support of each dimension can be provided for services. In most scenes, for big data technology, only the input data needs to be processed to obtain the output data, and the subsequent services use the output data. In general, the historical data is solidified and does not change in the scene, and the output data obtained through processing can be continuously used subsequently. A typical scenario is to perform inductive statistics on daily service data to generate a data report for presentation. However, if the historical data is allowed to be modified, the output data needs to be recalculated. For example, in the application scenario of policy data processing in the insurance field, the historical data is randomly changed (e.g., some information of the historical policy is modified if policy information is found to be incorrect), which results in the need to recalculate the historical output data. This requires determining which day and which portion of the traffic history data needs to be recalculated, which would otherwise result in wasted computing resources.

By developing a SPARK program of a big data calculation engine to calculate data, the problem of data processing under the condition of overlarge data amount can be solved. However, such tasks consume very much processor and memory resources, and in the case of a fixed cluster resource, the more tasks, the longer the time consumed.

The big data task is used for calculating and summarizing historical data to obtain statistical data after the business data are finally processed, and in most cases, the historical business data cannot be changed after being generated. If the historical data is changed, a data regeneration problem is involved. The more complex the service and the more frequent the data changes, the more complex the data regeneration logic will be.

Imagine a scenario where 3 years of insurance data need to be counted monthly. If the requirement is simple, monthly insurance data can be extracted according to time, and a big data engine SPARK program is written to count the monthly data according to the requirement. However, in actual business, the insurance data may be modified (for example, it may happen that the client has retired after several months, or the client requests to modify the policy, or the benefit distribution changes due to the change of the insurance policy at the time). And statistics may be required on a weekly basis later to provide more dimensionality for analysis.

This becomes difficult because it is not necessarily which part of the day the data has changed. A simple idea is to re-count the data for all months daily, but this is essentially impossible because according to the current cluster computing power, a re-calculation may take days well and time is not allowed. Expanding clusters requires costs of hundreds of thousands or even millions per year, which is invaluable.

The present invention is to solve the above problems, and therefore provides a method for solving the history data change in a big data system.

Disclosure of Invention

The purpose of the invention is: aiming at the problems described in the background art, the invention provides a method for solving the change of historical data in a big data system, and solves the problems of when the historical data is required to be recalculated and which part of the historical data is required to be recalculated after the historical data is changed, so that a large amount of computing resources are saved on the premise of ensuring the correctness of the service.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a method for solving historical data change in a big data system is characterized by comprising the following steps:

(1) all the business data are stored in the mysql database, each business table is provided with an update time field for marking the update time of the field, and the update time field of the corresponding data entry can be modified to the time when the business system updates the database table;

(2) the ODS task is executed once every morning, updating time fields of all service tables are checked, if the updating time falls on the previous day, service table data are copied in a same mode to generate an ODS layer data of the operation type data storage, and therefore historical versions of all data of an original service database are equivalent to those stored in the ODS of the operation type data storage;

(3) the method comprises the steps that a generation zipper list task starts after an operation type data storage ODS task is executed, through a spark big data calculation engine task, a zipper list module checks that the latest record of all operation type data storage ODS layer data is compared with the record of the last time corresponding to the latest record, and if the field concerned by a business daily table changes, the event that the day of the table changes is recorded into a change notification table;

(4) after the pull chain table task is finished, executing a task for generating a service daily table; the task checks the change notification table, and executes the corresponding task unit according to the changed table and the changed date, thereby refreshing the previously generated business calendar data.

Further, in step (2), the operation-type data storage ODS layer data is stored in the Hive database.

Further, in the step (4), a business calendar is generated, the method is suitable for various dimensionalities including 'day, month and year' of follow-up business query dimensionalities, and flexible support can be provided for upper-layer front-end display according to calendar records.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the invention has the advantages that:

(1) the invention solves the complex service scene of frequent change of historical data, can effectively deal with the service requirement and the continuous change of the historical data, reduces the calculated amount, and simultaneously can ensure that the final statistical result is up-to-date and correct every day.

(2) By the method, statistics is not carried out according to the final demand data by month, but the service data is split and statistics is carried out according to days.

(3) Through the mechanism of the invention, the changed business module and date can be quickly positioned, and then the recalculation is carried out only on the part of data, thereby greatly saving the calculation resources. Meanwhile, statistics is carried out according to the service modules and the days, and the change of subsequent requirements can be more flexibly met.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a data flow diagram of a method for resolving historical data changes in a big data system according to an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The invention provides a method for solving the change of historical data in a big data system, which is used for solving the problem that the historical data can change continuously in a scene using big data technology. The whole data flow is shown in fig. 1, and the detailed process is described as follows:

1. firstly, all the business data are stored in the mysql database, each business table has an update time field for marking the update time of the field, and as long as the business system updates the database table, the update time field of the corresponding data entry can be modified to the time at that moment.

2. The ODS is called Operational Data Store, and means an Operational Data Store, and the Data of the ODS can be understood as a version record of original service Data. The layer data generation task is executed once every morning, the update time fields of all relevant business tables are checked firstly, if the update time falls on the previous day, the business layer data is copied as if to generate a piece of operation type data storage (ODS) data, and therefore the historical version of each piece of data in the original business database is equivalent to the version stored in the operation type data storage (ODS). Operational Data Storage (ODS) data is stored in the Hive database.

Remarking: hive is a database in big data ecology, the bottom layer is essentially file storage, supports sql inquiry similar to the Mysql database, and is very suitable for storing structured log type data.

3. The zipper table generation task starts after the execution of the Operational Data Storage (ODS) task is completed, through the spark big data calculation engine task, the zipper table module checks a latest record of all the Operational Data Storage (ODS) to compare with the corresponding last time record, and if the field concerned by the business daily table changes, the event that the day of the table changes is recorded in the change notification table.

4. And after the pull chain table task is finished, executing the task of generating the service daily table. The task checks the change notification table and executes the corresponding task unit according to the changed table and the changed date, thereby refreshing the previously generated business calendar data. The reason why the schedule is generated is that the follow-up business query dimensions have various flexible dimensions such as 'day, month and year', and support can be flexibly provided for the upper-layer front-end display according to the record of the schedule.

5. After the calendar is generated, various interfaces can be realized subsequently, and flexible interfaces are provided for visualization of front-end data.

The zipper table generation process is essentially a process of recording the change information of the ODS layer data into a zipper table. And simultaneously, recording the fact that the ODS layer data table changes in a certain day into a change notification table, and when a service daily table is generated subsequently, checking the change notification table first, and then generating the daily table data through the data of the corresponding pull-up table.

Through the design of a pull chain table, a daily table and the like, the system can provide better support for the change of historical data. The zipper table generation module can store the data change situation in the change notification table every time the service data is changed. The business calendar generation module can know which date data needs to be regenerated by checking the change notification table, and in this way, a great deal of computing resources are saved (because the data regeneration of a certain day can be positioned, and all data does not need to be generated once).

The process of the present invention is described in detail below by way of an example. Because the actual business logic is very complex, the actual application scenario will be simplified and described herein for ease of understanding.

In an actual statistical scenario, the revenue situation of each organization is counted. The original business table has two basic tables of a user table and a policy table, the user table records the user number of the marketer and the mechanism corresponding to the marketer, and the policy table contains information such as the amount of the policy and the user number of the marketer. Through the two tables, a daily table of daily income information of each organization can be obtained. Through the daily table, the income information of each organization every week and every month can be further obtained. The following illustrates how the calendar is recalculated after a change in historical data, assuming 2021.3.29 the big data statistics timing task performed.

Assuming the smoke desk mechanism, a warranty with an underwriting date of 2021.2.10 has been refunded at 2021.3.28.

1. At this time, the service platform system firstly modifies the piece of data correspondingly, and modifies the update time of the piece of data to 2021.3.28.

2. When 2021.3.29 executes the whole big data task in the morning, the ODS (operational data storage) generation task will pull all data with business data update time 2021.3.28 to the local database, which contains the policy data that smoking platform architecture 2021.2.10 de-guarantees.

3. The task of generating the linked list then processes the data with the update time 2021.3.28 of the ODS layer data (this part of logic would be complex in practice), and records 2021.2.10 the change of the policy data of the cigarette platform mechanism in the change notification table

4. The newly generated data in the schedule generation task check change notification table, and after knowing that the data of the No. 2021.2.10 cigarette station mechanism is changed, the task of statistics of the income of the cigarette station mechanism on the day 2021.2.10 is executed, the income data of the cigarette station mechanism on the day is recalculated, and at the moment, the income data of the cigarette station mechanism is updated.

The above-described example is basically consistent with the actual flow, but simply simplifies the actual requirement for more convenient illustration of the whole flow of the large data portion.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. A method for solving historical data change in a big data system is characterized by comprising the following steps:

(2) the method comprises the steps that an operation type data storage ODS layer data table executes a task once every morning, firstly, updating time fields of all service tables are checked, if the updating time falls on the previous day, service table data are copied in a same mode to generate an operation type data storage ODS layer data, and therefore historical versions of all pieces of data in an original service database are equivalent to what is stored in the operation type data storage ODS layer data table;

(3) the method comprises the steps that a generated linked list task starts after an operation type data storage ODS layer data table task is executed, through a spark big data calculation engine task, a zipper table module checks a latest record of all operation type data storage ODS layer data and a corresponding last time record for comparison, and if the latest record and the corresponding last time record are changed, the event that the ODS layer data table is changed every day is recorded into a change notification table;

2. The method for resolving historical data changes in a big data system as claimed in claim 1, wherein in step (2), the operational data store ODS layer data is stored in a Hive database.

3. The method for solving historical data change in big data system as claimed in claim 1, wherein in step (4), a business calendar is generated, which is suitable for subsequent business query dimensions including various dimensions of "day, month and year", and flexible support can be provided for upper front-end display according to the calendar record.