CN115080156A - Flow-batch-integration-based optimized calculation method and device for big data batch calculation - Google Patents

Flow-batch-integration-based optimized calculation method and device for big data batch calculation Download PDF

Info

Publication number
CN115080156A
CN115080156A CN202211012966.8A CN202211012966A CN115080156A CN 115080156 A CN115080156 A CN 115080156A CN 202211012966 A CN202211012966 A CN 202211012966A CN 115080156 A CN115080156 A CN 115080156A
Authority
CN
China
Prior art keywords
data
batch
trigger time
changeflag
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211012966.8A
Other languages
Chinese (zh)
Other versions
CN115080156B (en
Inventor
李雪峰
杨敏
孙开翠
刘广东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aspire Technologies Shenzhen Ltd
Original Assignee
Aspire Technologies Shenzhen Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aspire Technologies Shenzhen Ltd filed Critical Aspire Technologies Shenzhen Ltd
Priority to CN202211012966.8A priority Critical patent/CN115080156B/en
Publication of CN115080156A publication Critical patent/CN115080156A/en
Application granted granted Critical
Publication of CN115080156B publication Critical patent/CN115080156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention discloses an optimized calculation method, a device, computer equipment and a storage medium for big data batch calculation based on stream batch integration, wherein the method comprises the following steps: determining a field combination according to the service appeal; grouping the data streams according to the field combination to form a plurality of groups of different data streams; calculating the multiple groups of different data streams according to the time windows which are decreased in a step manner in the KeyedProcessFunction respectively so as to process data logic; and when the calculation is completed, outputting the data subset meeting the service appeal. In the embodiment of the application, cluster resources and calculation time of batch calculation are optimized in a time step reduction and multiple data change heuristic mode, data calculation can be completed at one time, data subsets meeting service requirements are output, repeated iterative calculation is not needed, a large amount of calculation resources can be effectively saved, calculation time is reduced in a step-type mode, and calculation time is effectively reduced.

Description

Optimized calculation method and device for big data batch calculation based on stream and batch integration
Technical Field
The invention relates to the technical field of big data batch computation, in particular to an optimized computation method and device for big data batch computation based on stream batch integration, computer equipment and a storage medium.
Background
The flink is a computing framework and a distributed processing engine, supports real-time streaming unbounded big data computation, and simultaneously supports bounded batch big data computation.
At present, a standard batch data calculation process of flink based on stream type calculation is to partition a data stream, input the data stream into a window stream WindowStream, spit data according to a certain time window, and process data logic in the time window. The complete data subsets cannot be completely processed in a window period due to the fact that the data streams cannot guarantee the time order of input, for example, when data in a certain time range interval of the data streams of the same group already expires in the window1 window period, the data in the time range interval is spitted out, and if the data streams of the same group expire in the window2 window period, the time window processing data logic is triggered, so that the data streams belonging to the same group have a plurality of processed data subsets.
If only a complete data subset conforming to the unique constraint of distint is needed, the above calculation process needs to be iterated for many times, which results in high consumption of calculation resources, long calculation time, and uncertainty in the iteration number.
Disclosure of Invention
In view of the above, it is necessary to provide an optimized computing method, an optimized computing device, a computer device, and a storage medium for batch computation of large data based on stream and batch integration, so as to solve the problems of high computing resource consumption and long computing time in the prior art.
In a first aspect, an optimized calculation method for big data batch calculation based on stream batch integration is provided, and includes:
determining a field combination according to the service appeal;
grouping the data streams according to the field combination to form a plurality of groups of different data streams;
calculating the multiple groups of different data streams in a KeyedProcessFunction according to a time window which decreases in a step-by-step manner so as to process data logic;
and when the calculation is completed, outputting the data subset meeting the service appeal.
In one embodiment, the calculating according to the time windows decreasing in steps comprises:
using a plurality of variables of the flash state types in the open function, and respectively recording different calculation states;
processing data streams of the same field combination through a processElement function, and registering first trigger time for triggering the onTimer function so as to trigger the onTimer function when the first trigger time is reached;
in the first trigger time, when the calculation state meets a first preset trigger condition, registering second trigger time for re-triggering the onTimer function, and when the second trigger time is reached, re-triggering the onTimer function;
in the second trigger time, when the calculation state meets a second preset trigger condition, registering third trigger time for re-triggering the onTimer function, so that the onTimer function is re-triggered when the second trigger time is reached;
wherein the first trigger time is greater than the second trigger time, which is greater than the third trigger time.
In one embodiment, the computation state includes dateState, changeFlag, idl;
the dateState is used for expressing the service data which is required to be spit downstream and meets the service appeal;
the changeFlag is used for identifying the data stream of the same field combination, and whether the dataState is changed due to the fact that new data are received or not;
the idl is used to identify how many times the dateState has accumulated without change in the data stream for the same field combination.
In one embodiment, the registering triggers a first trigger time of the onTimer function, including:
when the dateState is a null value, calculating and updating the dateState, updating the changeFlag to false, updating the idl to 0, and registering a first trigger time for triggering the onTimer function;
wherein the false is used to indicate that the data of the dateState has not changed.
In an embodiment, after the processing the data in the data stream of the same field combination by the processElement function, the processing includes:
when the dateState is not a null value, calculating and updating the dateState, and updating the changeFlag to true;
wherein the true is used for indicating that the data of the dateState has change.
In an embodiment, the registering a second trigger time for re-triggering the onTimer function when the first preset trigger condition is met includes:
step a: when the changeFlag is true, resetting the changeFlag to false, resetting the changeFlag to 0 at idl, and recording a first second trigger time for re-triggering the onTimer function;
step b: resetting the changeFlag to false and idl to 0 when the changeFlag is true within the first second trigger time, and recording a second trigger time for re-triggering the onTimer function;
step c: and resetting the changeFlag to false within the second trigger time when the changeFlag is true, resetting the changeFlag to 0 at idl, simultaneously recording a third second trigger time for re-triggering the onTimer function, and repeating the steps b to c until the Nth second trigger time when the changeFlag is false.
In an embodiment, the registering a third trigger time for re-triggering the onTimer function when the second preset trigger condition is met includes:
when the changeFlag is false, determining whether the idl is smaller than a first preset threshold;
if yes, the changeFlag is updated to false, the idl is incremented once, and a third trigger time for re-triggering the onTimer function is recorded.
In an embodiment, said simultaneously recording after a third trigger time to re-trigger said onTimer function comprises:
when the dateState changes within the third trigger time, resetting the changeFlag to false, resetting the idl to 0, and recording a third second trigger time for re-triggering the onTimer function; or
When the dateState is not changed within a preset third trigger time, determining idl whether the dateState is greater than a second preset threshold;
and if so, outputting the service data, and clearing the dateState, changeFlag and idl.
In a second aspect, an optimized computing device for big data batch computation based on batch-to-batch integration of streams is provided, which includes:
the field combination determining unit is used for determining the field combination according to the service appeal;
the grouping unit is used for grouping the data streams according to the field combination to form a plurality of groups of different data streams;
the computing unit is used for computing the plurality of groups of different data streams in the KeyedProcessFunction according to a time window which decreases in a step-by-step manner so as to process data logic;
and the data subset output unit is used for outputting the data subset meeting the service appeal when the calculation is completed.
In a third aspect, a computer device is provided, which includes a memory, a processor and computer readable instructions stored in the memory and executable on the processor, and the processor executes the computer readable instructions to implement the steps of the method for calculating optimization based on batch bulk computation of big data of stream batch integration.
In a fourth aspect, a readable storage medium is provided, which stores computer readable instructions, when executed by a processor, implementing the steps of the optimized calculation method for batch-based bulk computation of big data of streaming batch as described above.
The flow batch integration-based optimization calculation method, device, computer equipment and storage medium for big data batch calculation are realized by the method comprising the following steps: determining a field combination according to the service appeal; grouping the data streams according to the field combination to form a plurality of groups of different data streams; calculating the multiple groups of different data streams in a KeyedProcessFunction according to a time window which decreases in a step-by-step manner so as to process data logic; and when the calculation is completed, outputting the data subset meeting the service appeal. In the embodiment of the application, cluster resources and calculation time of batch calculation are optimized in a time step reduction and multiple data change heuristic mode, calculation of data in the same group can be completed at one time, data subsets meeting service requirements are output, repeated iterative calculation is not needed, a large amount of calculation resources can be effectively saved, calculation time is reduced in a step-type mode, and calculation time is effectively reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flow chart of an optimized calculation method for batch calculation of big data based on batch integration of streams;
FIG. 2 is a schematic processing flow diagram of an optimized computing method for batch computation of big data based on batch integration according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating an implementation of a KeyedProcessFunction optimization method according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an optimized computing device for batch computation of big data based on batch integration of streams according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In an embodiment, as shown in fig. 1, an optimized computing method for big data batch computation based on stream batch integration is provided, which includes the following steps:
in step S110, a field combination is determined according to the service appeal;
in the embodiment of the application, the batch of big data is obtained through the flink engine, the watching scenes of the family business are taken as an example for explanation according to the business demands under different business scenes, such as the scenes of watching, shopping, sports and the like, and when the business demands are that the watching duration of the user identifier at the business carrier identifier is calculated in an accumulated mode within a specific time range, the user identifier and the business carrier identifier can be used as field combinations, namely, keys; when the service appeal is that the earliest and longest watching time of the user is obtained in a specific time range, the user identification can be used as a field combination, namely, a key; when the service appeal is that the watching daily activity of the user is acquired within a specific time range, the user identification can be used as a field combination, namely, a key.
In step SA120, the data streams are grouped according to the field combinations to form a plurality of different sets of data streams;
in the embodiment of the present application, the data stream may be bounded real-time streaming data or unbounded batch data, which is generally used for bounded data. When a field combination key is determined, the data stream may be grouped by the hash value of the key.
In the embodiment of the application, data streams are grouped according to field combination keys, taking a family service scene as an example, for example, when the keys are a user identifier and a service carrier identifier, user viewing behavior data such as a user, a service carrier identifier, a start viewing time, an end viewing time, an equipment identifier and the like can be divided into data streams in the same group, so that viewing duration of the user identifier in the service carrier identifier is calculated in an accumulated manner within a specific time range through the user viewing behavior data; when the key is the user identifier, the film watching behavior data of a large number of users in the charging month, the service carrier identifier, the starting watching time, the ending watching time, the equipment identifier and the like can be divided into the same group of data streams, so that the earliest and longest watching time of the users in a specific time range can be calculated through the film watching behavior data of the users; when the key is the user identifier, the film watching behavior data of a large number of users in the charging month, the service carrier identifier, the starting watching time, the ending watching time, the equipment identifier and the like can be divided into the same group of data streams, so that the film watching daily activity of the users in a specific time range can be calculated through the film watching behavior data of the users.
In the embodiment of the present application, the constituent fields of the keys are determined by service requirements, different service requirements may correspond to different keys, and one key may include a combination of 1 or more fields.
In step S130, the multiple different sets of data streams are respectively calculated in the KeyedProcessFunction according to a time window decreasing in steps to process data logic;
referring to fig. 2, in the embodiment of the present application, after data streams datastreams are grouped according to key, KeyedStream is formed, and then the KeyedStream is input into a KeyedProcessFunction (case partition processing function), calculation of the grouped data streams is completed at one time, and finally processed data datastreams are output when the calculation is completed, that is, the service data subset meeting the service requirement.
Referring to fig. 3, in the embodiment of the present application, the logic of processing data by calculating the grouped data streams according to the time windows decreasing in steps includes:
using a plurality of variables of the flash state types in the open function, and respectively recording different calculation states;
processing data streams of the same field combination through a processElement function, and registering first trigger time for triggering the onTimer function so as to trigger the onTimer function when the first trigger time is reached;
in the first trigger time, when a first preset trigger condition is met, registering second trigger time for re-triggering the onTimer function, and when the second trigger time is reached, re-triggering the onTimer function;
in the second trigger time, when a second preset trigger condition is met, registering third trigger time for re-triggering the onTimer function, so that the onTimer function is re-triggered when the second trigger time is reached;
wherein the first trigger time is greater than the second trigger time, which is greater than the third trigger time.
Wherein each piece of data in the same set of data streams can trigger the processElement function.
Further, the processing of the same group of data streams through the processElement function includes:
using a plurality of variables of the state type of the flash in the open function, and respectively recording different calculation states, wherein the calculation states comprise dateState, changeFlag and idl;
the dateState is used for expressing business data which are required to be output downstream and meet the business appeal;
the changeFlag is used for identifying the data stream of the same field combination, and whether the dataState is changed due to the fact that new data are received or not;
the idl is used to identify how many times the dateState has accumulated without change in the data stream for the same field combination.
In the embodiment of the present application, for data streams of the same key, assuming that the value of the key is k1, when a first piece of data conforming to k1 arrives, the first piece of data may be processed in the processElement function first, at this time, the dateState, changeFlag, and idl do not have any meaningful value yet, it is necessary to calculate the first piece of data, assign an initial value to the dateState, set changeFlag to unchanged, set idl to unchanged, register the first trigger time T1, and notify the flag of the data of k1, and the data may be output to the downstream after T1 time. That is, when the time T1 arrives for the first data entry that matches k1, the flink triggers the onTimer function.
In this embodiment of the present application, the first trigger condition may be whether the dataState is updated within the first trigger time, and specifically, the changeFlag flag. If new data arrives and the dataState is updated successively within the second trigger time T2, the changeFlag is changed, at this time, a second trigger time T2 is re-registered, and the flag is notified to re-trigger the onTimer function after T2, and if the changeFlag is still changed during the re-registered second repeat time T2, a second trigger time T2 is re-registered until the value of changeFlag remains unchanged, that is, no new data arrives and the dataState is not updated.
In the embodiment of the present application, the second trigger condition may be that when the value of changeFlag remains unchanged for the second trigger time, i.e. no new data arrives and the dataState is not updated, a third trigger time T3 may be registered with the flink, where T3 is less than T2, so that the onTimer function is retriggered through the flink when T3 is reached.
In the embodiment of the application, when no new data arrives within a plurality of consecutive third trigger times T3, it indicates that the data of k1 has been processed, at this time, data meeting the service appeal may be output downstream to form a data subset, and by setting a plurality of stepwise decreasing trigger times and triggering the onTimer function, the same group of data streams may be calculated and output at one time, and the service data does not need to be output in multiple times, and iterative computation does not need to be repeated, so that a large amount of computation time may be saved, and the computation efficiency is improved.
In an embodiment of the present application, the registering a first trigger time for triggering the onTimer function includes:
when the dateState is a null value, calculating and updating the dateState, updating the changeFlag to false, updating the idl to 0, and registering a first trigger time for triggering the onTimer function;
wherein the false is used to indicate that the data of the dateState has not changed.
Specifically, for data flows of the same key, assuming that the value of the key is k1, when a first piece of data meeting k1 arrives, the first piece of data can be processed in the processElement function, at this time, the dateState, changeFlag and idl do not have any meaningful value yet, and therefore, the dateState is a null value, the data meeting the service appeal calculated by the first piece of data is updated to the dateState, the data change identifier can be updated to be unchanged, the data change identifier is expressed by false, the accumulated number of times of unchanged data is updated to be 0, and the first trigger time triggered by the timer function is registered.
In this embodiment of the present application, after the processing, by the processElement function, the data in the data stream of the same field combination, the processing includes:
when the dateState is not a null value, calculating and updating the dateState, and updating the changeFlag to true;
wherein the true is used for indicating that the data of the dateState has a change.
Specifically, when the second to nth data items corresponding to k1 arrive, the processElement function is still triggered and processed, and at this time, because the dateState, changeFlag, and idl already have values, the result of accumulated calculation only needs to be performed using the second to nth data items, the value of the dateState is updated, and changeFlag is set to true, which indicates that the data has changed, that is, the dateState has changed.
In this embodiment of the present application, the registering, when the first preset trigger condition is met, a second trigger time for re-triggering the onTimer function includes:
step a: when the changeFlag is true, resetting the changeFlag to false, resetting the changeFlag to 0 at idl, and recording a first second trigger time for re-triggering the onTimer function;
step b: resetting the changeFlag to false and idl to 0 when the changeFlag is true within the first second trigger time, and recording a second trigger time for re-triggering the onTimer function;
step c: and resetting the changeFlag to false within the second trigger time when the changeFlag is true, resetting the changeFlag to 0 at idl, simultaneously recording a third second trigger time for re-triggering the onTimer function, and repeating the steps b to c until the Nth second trigger time when the changeFlag is false.
Specifically, by detecting the value of the data change flag, when the changeFlag indicates that multiple pieces of data arrive successively in the data conforming to k1 within the first trigger time and updates the dataState, it indicates that the data conforming to the service appeal has changed, and at this time, new data may arrive successively, so that the data of the dataState cannot be output downstream, the value of the changeFlag may be reset to false, that is, it indicates that no new data arrives, and the value of idl is reset to 0, that is, it indicates that the data is not idle for multiple times, and at this time, the second trigger time may be registered.
In the embodiment of the present application, when N +1 to N + m pieces of data corresponding to k1 arrive, the data still can be processed in the processElement function, and since the dataState, changeFlag, and idl already have values, it is only necessary to use the N +1 and N + m pieces of data to perform cumulative calculation, update the value of the dataState, and set changeFlag as the data have changed. When the second trigger time is reached, the onTimer function is retriggered.
In this embodiment of the present application, when the (N + 1) th data and the (N + m) th data arrive successively within the second trigger time, the dateState is updated, and the changeFlag is a change of data, that is, true, at this time, the second trigger time T2 may be registered again, and when the second trigger time arrives, the onTimer function is triggered again. Repeating the steps until no new data arrives in the Nth second trigger time, N is a natural number, the value of changeFlag is kept unchanged, and the dataState is not updated, and registering the third trigger time.
In this embodiment of the present application, the registering, when a second preset trigger condition is met, a third trigger time for re-triggering the onTimer function includes:
when the changeFlag is false, determining whether the idl is smaller than a first preset threshold;
if yes, the changeFlag is updated to false, the idl is incremented once, and a third trigger time for re-triggering the onTimer function is recorded.
Specifically, when no new data arrives in the data corresponding to k1 within the second trigger time, the value of changeFlag remains unchanged, and the dataState is not updated, it may be further determined whether the accumulated number of times of data unchanged idl is less than a first preset threshold, and if it is determined that the accumulated number of times of data unchanged is less than the first preset threshold, the data change flag may be updated to false to indicate that the data is unchanged, that is, the dataState is unchanged, while the accumulated number of times of data unchanged is increased once, and a third trigger time is registered to trigger the onTimer function when the third trigger time is reached.
In this embodiment of the application, within the third trigger time, if data arrives successively, a second trigger time may be registered again, and notify the flink to re-trigger the onTimer function after the second trigger time.
In an embodiment of the present application, after the third trigger time for re-triggering the onTimer function by the simultaneous recording, the method includes:
when the dateState changes within the third trigger time, resetting the changeFlag to false, resetting the idl to 0, and recording a third second trigger time for re-triggering the onTimer function; or
When the dateState has not changed within a preset third trigger time, determining idl whether the dateState is greater than a second preset threshold;
and if so, outputting the service data, and clearing the dateState, changeFlag and idl.
Specifically, within the third trigger time, if data arrives successively, a second trigger time may be registered again, and it is notified that the flink retriggers the onTimer function after the second trigger time.
When no new data arrives within the preset third trigger time, whether the accumulated number idl of unchanged data is larger than a second preset threshold value or not can be determined, and when idl is accumulated to a certain value, it means that no new data arrives within a plurality of third trigger times, which indicates that the data conforming to k1 has been processed, the data can be spitted downstream.
Further, when no new data arrives within the first third trigger time, a second third trigger time may be registered, and when no new data arrives within the second third trigger time, a third trigger time may be registered, and so on, when there is a preset number, for example, none of the 5 third trigger times, it may be said that the data conforming to k1 has been processed, and the data may be spit downstream.
In the embodiment of the application, each data stream open function conforming to the key1 is triggered only once; each piece of data in the data stream conforming to the key1 triggers a processElement function; the onTimer function is triggered only once for the data registered with the first trigger time; for data registered with the second trigger time, triggering the onTimer function for a plurality of times; for data in which the third trigger time is registered, the onTimer function may be triggered a preset number of times. The predetermined number of times is at least 2.
In step S140, when the calculation is completed, the data subset meeting the service appeal is output.
In this embodiment of the application, the data subset may include data meeting the service appeal, for example, the user identifier and the service carrier identifier are used as keys, the data of the user viewing behavior, such as the data stream user, the service carrier identifier, the start viewing time, the end viewing time, the device identifier, and the like, are divided into a group, the batch data is input through an interface of the KeyedProcessFunction, so as to complete cumulative calculation of the viewing duration of the user identifier at the service carrier identifier within a specific time range, and when the calculation is completed, the viewing duration data of the user using the terminal within a time range of each day, each week, each month, and the like may be output.
In the embodiment of the application, for bounded bulk processed big data, most data will be processed within time T1, most data behaviors are time-ordered due to the sizes of the values of T1, T2 and T3 after being reasonably adjusted and the relationship between the values, so most data will be processed within time T1, and the remaining source data which is out of order in time will trigger the onTimer function at a constant interval T2 to re-trigger the onTimer function. If there is no data change, that is, the subsequent data without the same Key needs to be processed, and when the detection times of the data unchanged do not reach the second preset threshold, the system can automatically and limitedly decrement for multiple times, trigger the onTimer function at an interval T3, and when there is no remaining source data out of order in time, the system automatically spits out the calculated data.
The embodiment of the application provides an optimized calculation method for large data batch calculation based on flow batch integration, which comprises the following steps: determining field combinations according to service appeal, and grouping data streams according to the field combinations; calculating the grouped data stream according to a time window which is decreased progressively in a step mode so as to process data logic; and when the calculation is completed, outputting the data subset meeting the service appeal. In the embodiment of the application, a single window operator optimizes cluster resources and computing time of batch computing in a delta mode, a time step reduction and multiple data change heuristic mode, data computing can be completed at one time, data subsets meeting service requirements are output, a large amount of computing resources can be effectively saved, computing time is gradually decreased, computing time is effectively reduced, and the problem that iteration times cannot be determined in iterative computing is solved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, an optimized calculation device for batch calculation of big data based on stream and batch integration is provided, and the optimized calculation device for batch calculation of big data based on stream and batch integration corresponds to the optimized calculation method for batch calculation of big data based on stream and batch integration in the above embodiment one to one. As shown in fig. 4, the optimization calculation apparatus for calculating a large data batch based on a stream batch includes a field combination determination unit 10, a grouping unit 20, a calculation unit 30, and a data subset output unit 40. The constituent units are explained in detail as follows:
a field combination determining unit 10, configured to determine a field combination according to a service appeal;
a grouping unit 20, configured to group data streams according to the field combinations to form multiple different groups of data streams;
a calculating unit 30, configured to calculate the multiple different sets of data streams in KeyedProcessFunction according to a time window decreasing in steps, respectively, so as to process data logic;
and the data subset output unit 40 is used for outputting the data subset meeting the service appeal when the calculation is completed.
In an embodiment, the computing unit 30 is further configured to:
using a plurality of variables of the flash state types in the open function, and respectively recording different calculation states;
processing data streams of the same field combination through a processElement function, and registering first trigger time for triggering the onTimer function so as to trigger the onTimer function when the first trigger time is reached;
registering second trigger time for re-triggering the onTimer function when the calculation state meets a first preset trigger condition within the first trigger time, so as to re-trigger the onTimer function when the second trigger time is reached;
in the second trigger time, when the calculation state meets a second preset trigger condition, registering third trigger time for re-triggering the onTimer function, so that the onTimer function is re-triggered when the second trigger time is reached;
wherein the first trigger time is greater than the second trigger time, which is greater than the third trigger time.
In one embodiment, the computation state includes dateState, changeFlag, idl;
the dateState is used for expressing the service data which is required to be output downstream and meets the service appeal;
the changeFlag is used for identifying the data stream of the same field combination, and whether the dataState is changed due to the fact that new data are received or not;
the idl is used to identify how many times the dateState has accumulated without change in the data flow for the same field combination.
In an embodiment, the computing unit 30 is further configured to:
when the dateState is a null value, calculating and updating the dateState, updating the changeFlag to false, updating the idl to 0, and registering a first trigger time for triggering the onTimer function;
wherein the false is used to indicate that the data of the dateState has not changed.
In an embodiment, the computing unit 20 is further configured to:
when the dateState is not a null value, calculating and updating the dateState, and updating the changeFlag to true;
wherein the true is used for indicating that the data of the dateState has change.
In an embodiment, the computing unit 20 is further configured to:
step a: when the changeFlag is true, resetting the changeFlag to false, resetting the changeFlag to 0 at idl, and recording a first second trigger time for re-triggering the onTimer function;
step b: resetting the changeFlag to false and idl to 0 when the changeFlag is true within the first second trigger time, and recording a second trigger time for re-triggering the onTimer function;
step c: and resetting the changeFlag to false within the second trigger time when the changeFlag is true, resetting the changeFlag to 0 at idl, simultaneously recording a third second trigger time for re-triggering the onTimer function, and repeating the steps b to c until the Nth second trigger time when the changeFlag is false.
In an embodiment, the computing unit 20 is further configured to:
when the changeFlag is false, determining whether the idl is smaller than a first preset threshold;
if yes, the changeFlag is updated to false, the idl is incremented once, and a third trigger time for re-triggering the onTimer function is recorded.
In an embodiment, the computing unit 20 is further configured to:
when the dateState changes within the third trigger time, resetting the changeFlag to false, resetting idl to 0, and recording a third second trigger time for re-triggering the onTimer function; or
When the dateState is not changed within a preset third trigger time, determining idl whether the dateState is greater than a second preset threshold;
and if so, outputting the service data, and clearing the dateState, changeFlag and idl.
In the embodiment of the application, for bounded bulk processed big data, most data will be processed within time T1, most data behaviors are time-ordered due to the sizes of the values of T1, T2 and T3 after being reasonably adjusted and the relationship between the values, so most data will be processed within time T1, and the remaining source data which is out of order in time will trigger the onTimer function at a constant interval T2 to re-trigger the onTimer function. If there is no data change, i.e. there is no subsequent data with the same Key to be processed, and when the number of detections of data unchanged does not reach the threshold, the system may automatically trigger the onTimer function for a limited number of times at interval T3, and when there is no remaining source data out of order in time, the system automatically spits out the calculated data. According to the method and the device, the single window operator optimizes the cluster resources and the calculation time of batch calculation in a delta mode in a mode of processing time step reduction and multiple data change heuristics, the calculation of data can be completed at one time, data subsets meeting service requirements are output, a large amount of calculation resources can be effectively saved, the calculation time is gradually decreased, the calculation time is effectively reduced, and the problem that iteration times cannot be determined in iterative calculation is solved.
For specific limitations of the optimized computing device based on the bulk batch computation of the large data of the bulk batch of the flow batch, reference may be made to the above limitations of the optimized computing method based on the bulk batch computation of the large data of the bulk batch of the flow batch, and details are not described here again. The modules in the optimized computing device based on the stream batch integrated big data batch computation can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal device, and its internal structure diagram may be as shown in fig. 5. The computer device comprises a processor, a memory and a network interface which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium. The readable storage medium stores computer readable instructions. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions are executed by a processor to implement an optimized computation method for bulk computation of big data based on batch integration of streams. The readable storage media provided by the present embodiment include nonvolatile readable storage media and volatile readable storage media.
A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, the processor when executing the computer readable instructions implementing the steps of the optimized calculation method for bulk calculation of big data based on batch of streams as described above.
A readable storage medium storing computer readable instructions, which when executed by a processor, implement the steps of the optimal computation method for big data batch computation based on batch-of-streams as described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware related to computer readable instructions, which may be stored in a non-volatile readable storage medium or a volatile readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (11)

1. An optimized calculation method for large data batch calculation based on flow batch integration is characterized by comprising the following steps:
determining a field combination according to the service appeal;
grouping the data streams according to the field combination to form a plurality of groups of different data streams;
calculating the multiple groups of different data streams in a KeyedProcessFunction according to a time window which decreases in a step-by-step manner so as to process data logic;
and when the calculation is completed, outputting the data subset meeting the service appeal.
2. The optimization calculation method for batch-wise bulk computation of big data based on batch-wise integration of streams according to claim 1, wherein the calculation according to the time windows decreasing in steps comprises:
using a plurality of variables of the flash state types in the open function, and respectively recording different calculation states;
processing data streams of the same field combination through a processElement function, and registering first trigger time for triggering an onTimer function, so that the onTimer function is triggered when the first trigger time is reached;
registering second trigger time for re-triggering the onTimer function when the calculation state meets a first preset trigger condition within the first trigger time, so as to re-trigger the onTimer function when the second trigger time is reached;
in the second trigger time, when the calculation state meets a second preset trigger condition, registering third trigger time for re-triggering the onTimer function, so that the onTimer function is re-triggered when the second trigger time is reached;
wherein the first trigger time is greater than the second trigger time, which is greater than the third trigger time.
3. The optimized computation method for batch computation of big data based on stream-batch as claimed in claim 2, wherein the computation state comprises dateState, changeFlag, idl;
the dateState is used for expressing the service data which is required to be output downstream and meets the service appeal;
the changeFlag is used for identifying the data flow of the same field combination, and whether new data is received or not causes the change of the dateState;
the idl is used to identify how many times the dateState has accumulated without change in the data flow for the same field combination.
4. The optimization calculation method for batch-wise bulk computation of big data based on stream batch as claimed in claim 3, wherein the registering a first trigger time for triggering the onTimer function comprises:
when the dateState is a null value, calculating and updating the dateState, updating the changeFlag to false, updating the idl to 0, and registering a first trigger time for triggering the onTimer function;
wherein the false is used to indicate that the data of the dateState has not changed.
5. The optimization calculation method for batch-based big data batch calculation of claim 3, wherein after the processing of the data in the data stream with the same field combination by the processElement function, the method comprises:
when the dateState is not a null value, calculating and updating the dateState, and updating the changeFlag to true;
wherein the true is used for indicating that the data of the dateState has a change.
6. The optimization calculation method for batch-wise bulk computation of big data based on stream integration according to claim 5, wherein the registering a second trigger time for re-triggering the onTimer function when a first preset trigger condition is met comprises:
step a: when the changeFlag is true, resetting the changeFlag to false, resetting the changeFlag to 0 at idl, and recording a first second trigger time for re-triggering the onTimer function;
step b: resetting the changeFlag to false and idl to 0 when the changeFlag is true within the first second trigger time, and recording a second trigger time for re-triggering the onTimer function;
step c: and resetting the changeFlag to false within the second trigger time when the changeFlag is true, resetting the changeFlag to 0 at idl, simultaneously recording a third second trigger time for re-triggering the onTimer function, and repeating the steps b to c until the Nth second trigger time when the changeFlag is false.
7. The optimization calculation method for batch-wise bulk computation of big data based on stream integration according to claim 6, wherein the registering a third trigger time for re-triggering the onTimer function when a second preset trigger condition is met comprises:
when the changeFlag is false, determining whether the idl is smaller than a first preset threshold;
if yes, the changeFlag is updated to false, the idl is incremented once, and a third trigger time for re-triggering the onTimer function is recorded.
8. The optimization calculation method for batch-wise bulk computation of big data based on stream batch as claimed in claim 7, wherein said simultaneously recording after a third trigger time to re-trigger said onTimer function comprises:
when the dateState changes within the third trigger time, resetting the changeFlag to false, resetting the idl to 0, and recording a third second trigger time for re-triggering the onTimer function; or
When the dateState is not changed within a preset third trigger time, determining idl whether the dateState is greater than a second preset threshold;
and if so, outputting the service data, and clearing the dateState, changeFlag and idl.
9. An optimized computing device for big data batch computing based on stream batch integration, which is characterized in that the device comprises:
the field combination determining unit is used for determining the field combination according to the service appeal;
the grouping unit is used for grouping the data streams according to the field combination to form a plurality of groups of different data streams;
the computing unit is used for computing the plurality of groups of different data streams in a KeyedProcessFunction according to a time window which decreases in a step-by-step manner so as to process data logic;
and the data subset output unit is used for outputting the data subset meeting the service appeal when the calculation is completed.
10. A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, wherein the processor when executing the computer readable instructions implements the steps of the method for optimized computation of batch computation of big data based on stream batch as claimed in any one of claims 1 to 8.
11. A readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the steps of the optimized computation method based on batch-wise bulk computation of big data according to any one of claims 1 to 8.
CN202211012966.8A 2022-08-23 2022-08-23 Flow-batch-integration-based optimized calculation method and device for big data batch calculation Active CN115080156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211012966.8A CN115080156B (en) 2022-08-23 2022-08-23 Flow-batch-integration-based optimized calculation method and device for big data batch calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211012966.8A CN115080156B (en) 2022-08-23 2022-08-23 Flow-batch-integration-based optimized calculation method and device for big data batch calculation

Publications (2)

Publication Number Publication Date
CN115080156A true CN115080156A (en) 2022-09-20
CN115080156B CN115080156B (en) 2022-11-11

Family

ID=83244813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211012966.8A Active CN115080156B (en) 2022-08-23 2022-08-23 Flow-batch-integration-based optimized calculation method and device for big data batch calculation

Country Status (1)

Country Link
CN (1) CN115080156B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881310A (en) * 2023-09-07 2023-10-13 卓望数码技术(深圳)有限公司 Method and device for calculating set of big data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017023432A1 (en) * 2015-08-05 2017-02-09 Google Inc. Data flow windowing and triggering
CN110532044A (en) * 2019-08-26 2019-12-03 锐捷网络股份有限公司 A kind of big data batch processing method, device, electronic equipment and storage medium
CN111367951A (en) * 2020-02-29 2020-07-03 深圳前海微众银行股份有限公司 Method and device for processing stream data
CN112000636A (en) * 2020-08-31 2020-11-27 民生科技有限责任公司 User behavior statistical analysis method based on Flink streaming processing
CN113609201A (en) * 2021-08-10 2021-11-05 珍岛信息技术(上海)股份有限公司 Service data processing method and system
CN113742004A (en) * 2020-08-26 2021-12-03 北京沃东天骏信息技术有限公司 Data processing method and device based on flink framework
CN114595047A (en) * 2022-03-03 2022-06-07 京东科技控股股份有限公司 Batch task processing method and device
CN114860846A (en) * 2022-05-30 2022-08-05 深圳前海微众银行股份有限公司 Data processing method and device and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017023432A1 (en) * 2015-08-05 2017-02-09 Google Inc. Data flow windowing and triggering
CN110532044A (en) * 2019-08-26 2019-12-03 锐捷网络股份有限公司 A kind of big data batch processing method, device, electronic equipment and storage medium
CN111367951A (en) * 2020-02-29 2020-07-03 深圳前海微众银行股份有限公司 Method and device for processing stream data
CN113742004A (en) * 2020-08-26 2021-12-03 北京沃东天骏信息技术有限公司 Data processing method and device based on flink framework
CN112000636A (en) * 2020-08-31 2020-11-27 民生科技有限责任公司 User behavior statistical analysis method based on Flink streaming processing
CN113609201A (en) * 2021-08-10 2021-11-05 珍岛信息技术(上海)股份有限公司 Service data processing method and system
CN114595047A (en) * 2022-03-03 2022-06-07 京东科技控股股份有限公司 Batch task processing method and device
CN114860846A (en) * 2022-05-30 2022-08-05 深圳前海微众银行股份有限公司 Data processing method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RESEMBLE_: "Flink ProcessFunction onTimer 延迟处理数据", 《HTTPS://BLOG.CSDN.NET/QQ_27657429/ARTICLE/DETAILS/105350909》 *
白玉辛等: "Hadoop与Flink应用场景研究", 《通信技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881310A (en) * 2023-09-07 2023-10-13 卓望数码技术(深圳)有限公司 Method and device for calculating set of big data
CN116881310B (en) * 2023-09-07 2023-11-14 卓望数码技术(深圳)有限公司 Method and device for calculating set of big data

Also Published As

Publication number Publication date
CN115080156B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN110990138B (en) Resource scheduling method, device, server and storage medium
CN106202280B (en) Information processing method and server
CN115080156B (en) Flow-batch-integration-based optimized calculation method and device for big data batch calculation
CN112434039A (en) Data storage method, device, storage medium and electronic device
CN111262795A (en) Service interface-based current limiting method and device, electronic equipment and storage medium
CN108334633B (en) Data updating method and device, computer equipment and storage medium
CN113342603B (en) Alarm data processing method and device, computer equipment and storage medium
Kuszmaul et al. The multiplicative version of azuma's inequality, with an application to contention analysis
CN112101674B (en) Resource allocation matching method, device, equipment and medium based on group intelligent algorithm
CN107622121B (en) Data analysis method and device based on bitmap data structure
CN110990350A (en) Log analysis method and device
CN114690731A (en) Associated scene recommendation method and device, storage medium and electronic device
CN117675866A (en) Data processing method, device, equipment and medium based on Bayesian inference
CN106294457B (en) Network information pushing method and device
CN111092922B (en) Information sending method and device
CN111126572B (en) Model parameter processing method and device, electronic equipment and storage medium
CN109379605B (en) Bullet screen distribution method, device, equipment and storage medium based on bullet screen sequence
CN106156169B (en) Discrete data processing method and device
CN107391590B (en) Theme list updating method and device, electronic equipment and storage medium
CN109542662B (en) Memory management method, device, server and storage medium
CN109542609B (en) Deduction-based repayment method and device, computer equipment and storage medium
CN114004623A (en) Machine learning method and system
CN111858542A (en) Data processing method, device, equipment and computer readable storage medium
CN117009094B (en) Data oblique scattering method and device, electronic equipment and storage medium
CN112181672A (en) Block chain data processing method, block chain system and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant