CN112256734A

CN112256734A - Big data processing method, device, system, equipment and storage medium

Info

Publication number: CN112256734A
Application number: CN202011127509.4A
Authority: CN
Inventors: 朱伟伟; 徐烨; 陈萌; 杜锐; 薛飞; 牛佩云; 张子奇; 蒋威
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-01-22

Abstract

The application discloses a big data processing method, a big data processing device, a big data processing system, big data processing equipment and a big data processing storage medium. And acquiring real-time streaming data, processing the real-time streaming data in real time according to a second service rule to obtain a real-time processing result, and displaying the real-time processing result through a preset window. And under the condition of triggering the service summarizing task, summarizing the data statistical result and the real-time processing result according to the service rule indicated by the service summarizing task to obtain a service summarizing result. Compared with the prior art, the data statistics result and the real-time processing result are directly summarized, and the waste of a large amount of computing resources can be reduced. And moreover, the historical data is counted according to the first business rule, so that the calculation burden can be reduced for the stream calculation engine, the stream calculation engine is prevented from additionally opening a window to process the historical data, and the waste of calculation resources is reduced.

Description

Big data processing method, device, system, equipment and storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a big data processing method, apparatus, system, device, and storage medium.

Background

With the development of business, more and more scenes are used for stream computing, and then the stream computing engine is developed vigorously. In practical applications, the computing power of the stream computing engine is often related to computing resources, and the more computing resources are occupied, the stronger the computing power is.

Currently, in the process of performing large data processing with a stream computing engine in a long window (a conventional term used to refer to a data processing window, which has a long working time, such as an all-weather business monitoring window of a bank), a large amount of hardware equipment needs to be disposed to guarantee the computing power of the stream computing engine. However, the long-window big data processing process does not require high-intensity computing power all the time, and obviously, the prior art inevitably causes waste of a large amount of computing resources.

Disclosure of Invention

The applicant found that: in a scenario of monitoring service data by using a window (for example, a monitoring window with a working time of 30 days is used for monitoring real-time stream data of a service within 30 days, and the real-time stream data is calculated according to a service rule to obtain a calculation result), various stream calculation engines all have weakness, and the main reason is that: in a scenario of monitoring service data by using a long window, the stream computation engine not only needs to store massive data computation results, but also needs to open up a large number of windows to process (for example, perform data cleaning, conversion, processing, and the like) different types of data in different time periods, and additionally opens up a large number of windows to process data, which may consume huge computation resources.

Therefore, the present application provides a big data processing method, apparatus, system, device, and storage medium, and aims to provide an effective big data processing method, which avoids the stream computation engine from opening up additional windows to process historical data and reduces the waste of computation resources in the process of processing big data with a long window by using the stream computation engine.

In order to achieve the above object, the present application provides the following technical solutions:

a big data processing system, comprising:

a batch computation engine and a stream computation engine;

the batch calculation engine is used for counting the historical data according to a first business rule to obtain a data counting result; the first business rule is a business rule which is preset aiming at the data processing task with the data processing duration being greater than a preset threshold value;

the batch calculation engine is further configured to send the data statistics result to the stream calculation engine according to a preset interval time;

the stream calculation engine is used for acquiring real-time stream data, performing real-time processing on the real-time stream data according to a second service rule to obtain a real-time processing result, and displaying the real-time processing result through a preset window; the second business rule is a business rule which is preset aiming at the data processing task with the data processing duration not greater than the preset threshold;

the flow calculation engine is further configured to, under the condition that a service summarizing task is triggered, summarize the data statistics result and the real-time processing result according to a service rule indicated by the service summarizing task to obtain a service summarizing result, and display the service summarizing result through the preset window; the service summarizing task is a task for summarizing the historical data and the real-time streaming data within a preset period time.

Optionally, the method further includes:

a data synchronization engine;

and the data synchronization engine is used for sending the historical data pre-stored in the database to the batch calculation engine according to the preset interval time.

Optionally, the data synchronization engine is further configured to:

and receiving the data statistical result sent by the batch calculation engine, and sending the data statistical result to the stream calculation engine.

Optionally, the data synchronization engine is further configured to:

and receiving the real-time streaming data and storing the real-time streaming data into the database.

Optionally, the data synchronization engine is further configured to:

and emptying the historical data in the database according to the preset interval time.

A big data processing method comprises the following steps:

according to a first business rule, counting historical data to obtain a data counting result; the first business rule is a business rule which is preset aiming at a data processing task with the data processing duration being greater than a preset threshold value;

acquiring real-time streaming data, processing the real-time streaming data in real time according to a second service rule to obtain a real-time processing result, and displaying the real-time processing result through a preset window; the second business rule is a business rule which is preset aiming at the data processing task with the data processing duration not greater than the preset threshold;

under the condition that a service summarizing task is triggered, summarizing the data statistical result and the real-time processing result according to a service rule indicated by the service summarizing task to obtain a service summarizing result, and displaying the service summarizing result through the preset window; the service summarizing task is a task for summarizing the historical data and the real-time streaming data within a preset period time.

Optionally, the method further includes:

storing the real-time streaming data into a preset database;

and emptying the historical data prestored in the database according to the preset interval time.

A big data processing apparatus, comprising:

the statistical unit is used for carrying out statistics on the historical data according to the first business rule to obtain a data statistical result; the first business rule is a business rule which is preset aiming at a data processing task with the data processing duration being greater than a preset threshold value;

the real-time processing unit is used for acquiring real-time streaming data, processing the real-time streaming data in real time according to a second business rule to obtain a real-time processing result, and displaying the real-time processing result through a preset window; the second business rule is a business rule which is preset aiming at the data processing task with the data processing duration not greater than the preset threshold;

the summarizing unit is used for summarizing the data statistics result and the real-time processing result according to a service rule indicated by the service summarizing task under the condition that the service summarizing task is triggered to obtain a service summarizing result, and displaying the service summarizing result through the preset window; the service summarizing task is a task for summarizing the historical data and the real-time streaming data within a preset period time.

A computer-readable storage medium comprising a stored program, wherein the program executes the big data processing method.

A big data processing device, comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;

the memory is used for storing programs, and the processor is used for running the programs, wherein the programs execute the big data processing method during running.

According to the technical scheme, historical data are counted according to the first business rule, and a data counting result is obtained. The first business rule is a business rule which is preset aiming at the data processing task with the data processing duration being greater than a preset threshold value. And acquiring real-time streaming data, processing the real-time streaming data in real time according to a second service rule to obtain a real-time processing result, and displaying the real-time processing result through a preset window. The second business rule is a business rule which is preset aiming at the data processing task with the data processing duration not greater than the preset threshold value. And under the condition of triggering the service summarizing task, summarizing the data statistical result and the real-time processing result according to the service rule indicated by the service summarizing task to obtain a service summarizing result, and displaying the service summarizing result through a preset window. The service summarizing task is a task for summarizing historical data and real-time streaming data within a preset period time. Utilize flow calculation engine to carry out the big data processing process of long window, and under the condition that triggers the business task of gathering, compare in prior art, need open up extra window and gather historical data and real-time flow data, this application is direct to be gathered data statistics result and real-time processing result, can reduce the waste of a large amount of computing resources, and the processing efficiency of business task of gathering also obviously improves. Moreover, the historical data is counted according to the first business rule to obtain a data counting result, so that the calculation burden of the stream calculation engine can be reduced, the stream calculation engine is prevented from additionally opening a window to process the historical data, and the waste of calculation resources is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is a block diagram of a big data processing system according to an embodiment of the present disclosure;

fig. 1b is a schematic diagram of a detailed implementation process of big data processing according to an embodiment of the present application;

fig. 2 is a schematic diagram of a big data processing method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a big data processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1a, an architecture diagram of a big data processing system provided in the embodiment of the present application includes:

a data synchronization engine 100, a batch computation engine 200, and a stream computation engine 300.

It should be noted that the data synchronization engine 100 includes, but is not limited to, treesont, fed, and dbsync engines, the batch computation engine 200 includes, but is not limited to, spark, storm, and flink engines, and the stream computation engine 300 includes, but is not limited to, spark, storm, and flink engines.

The data processing mode adopted by the batch calculation engine 200 is batch calculation, and the data processing mode adopted by the stream calculation engine 300 is stream calculation. Batch computation and streaming computation are each applicable in different big data application scenarios.

The stream computing, as the name implies, is a data computing mode in which data is collected uniformly, stored in a database, and then processed in batches, and the data is processed in batches.

1. The data are different in timeliness: the flow calculation is real-time and low-delay, and the batch calculation is non-real-time and high-delay;

2. the data characteristics are different: streaming data is typically dynamic, borderless, while batch data is typically static;

3. the application scenarios are different: the stream computing is applied to a real-time scene, a scene with higher timeliness requirements, such as real-time recommendation, service monitoring and the like, the batch computing is generally batch processing, and the stream computing is applied to a scene with low real-time requirements and offline computing, such as data analysis, offline reporting and the like;

4. the operation modes are different: the task of the streaming computation is continuously carried out, and the task of the batch computation is completed at one time.

The process of implementing big data processing by the big data processing system, as shown in fig. 1b, includes the following steps:

s101: the data synchronization engine receives the real-time streaming data and stores the real-time streaming data in the database.

The data synchronization engine can receive real-time stream data sent by a preset service system, and can also acquire the real-time stream data from a cloud. Since real-time data is stored in the database without having real-time performance, the data stored in the database is referred to as history data.

S102: and the data synchronization engine sends the historical data in the database to the batch calculation engine according to the preset interval time.

After executing S102, the data synchronization engine continues to execute S103.

S103: and the data synchronization engine empties the historical data in the database according to the preset interval time.

The historical data is not real-time and is sent to the batch calculation engine, so that the historical data can be determined to have no use value, and the historical data is removed from the database, so that the calculation resources can be effectively saved. In other words, the data synchronization engine is used for regularly clearing the historical data without using value, and the dynamic storage of the data is realized.

S104: and the batch calculation engine counts the historical data according to the first business rule to obtain a data statistical result.

After executing S104, the batch engine proceeds to execute S105.

The first business rule is a business rule which is preset aiming at the data processing task with the data processing duration being greater than a preset threshold value. For example, assuming that the first business rule indicates a total number of monthly customer transfer amounts to be counted, and accordingly, the historical data includes daily customer transfer amounts, the process of counting the historical data includes: and cleaning the historical data, removing partial invalid data (such as transfer failure data), and calculating the sum of the transfer amount of each day in the current month of the client to obtain a data statistical result.

It should be noted that, a data processing task with a data processing duration longer than a preset threshold, that is, the task does not have timeliness, that is, real-time calculation is not required. Compared with the prior art, a stream computing engine does not need to open a new data processing task with the window processing data processing time length being larger than a preset threshold, and the batch computing engine is responsible for counting the historical data in the process of processing the large data with the stream computing engine in the long window, so that the computing burden of the stream computing engine can be reduced, and the stream computing engine is prevented from wasting computing resources to process the historical data.

S105: and the batch calculation engine sends the data statistical result to the data synchronization engine according to the preset interval time.

S106: the data synchronization engine sends the data statistics to the stream computation engine.

S107: and the stream calculation engine acquires the real-time stream data, processes the real-time stream data in real time according to a second business rule to obtain a real-time processing result, and displays the real-time processing result through a preset window.

The second business rule is a business rule which is preset aiming at the data processing task with the data processing duration not greater than the preset threshold value. For example, assuming that the second service rule indicates monitoring of access traffic of a website, and accordingly, the real-time streaming data includes data such as an IP address of a guest, content accessed by the guest, and search information input by the guest, the process of processing the real-time streaming data in real time includes: and reserving character strings meeting the requirements for recording the IP address of the visitor, extracting keywords from the content accessed by the visitor and filtering the search information input by the visitor, and displaying the IP address, the keywords and the character strings through a window.

It should be emphasized that the parallel relationship between S107 and S101, i.e. the execution sequence of S101-S106, does not affect the execution of S107. And the data processing task with the data processing time length not greater than the preset threshold value is time-efficient, namely real-time calculation is needed.

S108: and the flow calculation engine collects the data statistical result and the real-time processing result according to the service rule indicated by the service collection task under the condition of triggering the service collection task to obtain a service collection result, and displays the service collection result through a preset window.

The service summarizing task is a task for summarizing historical data and real-time streaming data within a preset period time, and in practical application, the service summarizing task can reflect the change details of the service within the preset period time. For example, assuming that the business rule indicated by the business summarizing task includes counting the total amount of the transfer amount of the customer within 30 days, correspondingly, the data statistics includes the total amount of the transfer amount of the customer within the previous 29 days, the real-time processing result includes the transfer amount of the customer within the current 1 day, and the process of summarizing the data statistics and the real-time processing result includes: and calculating the sum of the total transfer amount of the customer in the previous 29 days and the transfer amount of the customer in the current 1 day to obtain the total transfer amount of the customer in 30 days, and displaying the total transfer amount of the customer in 30 days through a window.

It should be noted that, under the condition of triggering the service summarizing task, the stream calculation engine only needs to summarize the data statistics result and the real-time processing result according to the service rule indicated by the service summarizing task, and then the service summarizing result can be obtained.

In summary, in the case that the stream calculation engine is used to perform the long-window large-data processing process and the service summarization task is triggered, compared with the prior art, an additional window needs to be opened to summarize the historical data and the real-time stream data. Moreover, the historical data is counted according to the first business rule to obtain a data counting result, so that the calculation burden of the stream calculation engine can be reduced, the stream calculation engine is prevented from additionally opening a window to process the historical data, and the waste of calculation resources is reduced.

It should be noted that, the data synchronization engine mentioned in the foregoing embodiments is an optional functional module for implementing a big data processing method for a big data processing system, and does not affect implementation of the whole big data processing process. In addition, the data synchronization engine is used for storing the real-time streaming data into a preset database and emptying historical data prestored in the database according to preset interval time, and the method is an optional specific implementation mode of the big data processing process. For this reason, the big data processing flow mentioned in the above embodiment can be summarized as the method shown in fig. 2.

As shown in fig. 2, a schematic diagram of a big data processing method provided in an embodiment of the present application includes the following steps:

s201: and according to the first business rule, counting the historical data to obtain a data counting result.

The first business rule is a business rule which is preset aiming at the data processing task with the data processing duration being greater than a preset threshold value.

S202: and acquiring real-time streaming data, processing the real-time streaming data in real time according to a second service rule to obtain a real-time processing result, and displaying the real-time processing result through a preset window.

The second business rule is a business rule which is preset aiming at the data processing task with the data processing duration not greater than the preset threshold value.

S203: and under the condition of triggering the service summarizing task, summarizing the data statistical result and the real-time processing result according to the service rule indicated by the service summarizing task to obtain a service summarizing result, and displaying the service summarizing result through a preset window.

The service summarizing task is a task for summarizing historical data and real-time stream data within a preset period time.

Corresponding to the big data processing method provided by the embodiment of the application, the application also provides a big data processing device.

As shown in fig. 3, a schematic structural diagram of a big data processing apparatus provided in an embodiment of the present application includes:

the statistical unit 301 is configured to perform statistics on the historical data according to the first service rule to obtain a data statistical result. The first business rule is a business rule which is preset aiming at the data processing task with the data processing duration being greater than a preset threshold value.

The real-time processing unit 302 is configured to obtain real-time stream data, perform real-time processing on the real-time stream data according to a second service rule to obtain a real-time processing result, and display the real-time processing result through a preset window. The second business rule is a business rule which is preset aiming at the data processing task with the data processing duration not greater than the preset threshold value.

The summarizing unit 303 is configured to summarize the data statistics result and the real-time processing result according to the service rule indicated by the service summarizing task under the condition that the service summarizing task is triggered, obtain a service summarizing result, and display the service summarizing result through a preset window. The service summarizing task is a task for summarizing historical data and real-time streaming data within a preset period time.

The storage unit 304 is configured to store the real-time streaming data in a preset database.

The emptying unit 305 is configured to empty the historical data pre-stored in the database according to a preset interval.

The application also provides a computer readable storage medium, which comprises a stored program, wherein the program executes the big data processing method provided by the application.

The present application also provides a big data processing apparatus, including: a processor, a memory, and a bus. The processor is connected with the memory through a bus, the memory is used for storing programs, and the processor is used for running the programs, wherein when the programs run, the big data processing method provided by the application is executed, and the method comprises the following steps:

Optionally, the method further includes:

storing the real-time streaming data into a preset database;

The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A big data processing system, comprising:

a batch computation engine and a stream computation engine;

2. The system of claim 1, further comprising:

a data synchronization engine;

3. The system of claim 2, wherein the data synchronization engine is further configured to:

4. The system of claim 2, wherein the data synchronization engine is further configured to:

5. The system of claim 2, wherein the data synchronization engine is further configured to:

6. A big data processing method is characterized by comprising the following steps:

7. The method of claim 6, further comprising:

storing the real-time streaming data into a preset database;

8. A big data processing apparatus, comprising:

9. A computer-readable storage medium characterized in that the computer-readable storage medium includes a stored program, wherein the program executes the big data processing method according to claims 6 to 7.

10. A big data processing device, comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;

the memory is used for storing a program, and the processor is used for running the program, wherein the program runs to execute the big data processing method of claims 6-7.