CN110209685B - Real-time data processing method and system - Google Patents
Real-time data processing method and system Download PDFInfo
- Publication number
- CN110209685B CN110209685B CN201910507802.4A CN201910507802A CN110209685B CN 110209685 B CN110209685 B CN 110209685B CN 201910507802 A CN201910507802 A CN 201910507802A CN 110209685 B CN110209685 B CN 110209685B
- Authority
- CN
- China
- Prior art keywords
- time
- data
- data processing
- window
- processing window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a real-time data processing method and system, and relates to the technical field of data processing. The data real-time processing method comprises the following steps: dividing the data to be processed in the system based on the data processing window; judging a data processing window based on the system reference time; and when the system reference time is greater than or equal to the end time of the data processing window, calculating and outputting the calculation result of the data in the current data processing window. According to the scheme, the accuracy of calculation can be guaranteed, and a large amount of data aggregation calculation can be realized.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a real-time data processing method and system.
Background
With the rapid expansion of data, big data processing technology has been developed rapidly. Currently, a big data processing platform can be divided into off-line computation and real-time computation according to a computation mode. With the continuous improvement of the real-time degree of the economic and social information, people have higher and higher requirements on data real-time calculation. For example: real-time calculation is needed in scenes such as risk control systems (wind control systems, compliance checks and the like) for preventing fraud and judging whether funds flow into illegal ways, data processing systems for data extraction, conversion and loading (ETL for short), and the like.
However, the existing real-time computing system may have an inaccurate data processing result due to the processing of a large amount of data.
Disclosure of Invention
The embodiment of the invention provides a real-time data processing method and system, which aim to solve the problem that accurate calculation results cannot be guaranteed to be obtained due to large data processing amount in the conventional data processing.
In order to solve the above technical problem, an embodiment of the present invention provides a real-time data processing method, including:
dividing the data to be processed in the system based on the data processing window;
judging a data processing window based on the system reference time;
and when the system reference time is greater than or equal to the end time of the data processing window, calculating and outputting the calculation result of the data in the current data processing window.
Specifically, the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the start time of the data processing window and is less than the end time of the data processing window.
Optionally, the data real-time processing method further includes:
and activating the data processing window and calculating the data in the data processing window.
Optionally, the data real-time processing method further includes:
and destroying the data processing window.
Further, the destruction data processing window includes:
when the destruction delay time is equal to zero, destroying the data processing window after the data processing window is activated for the first time; or
When the destroying delay time is not equal to zero and the system reference time is greater than or equal to the first target time, destroying the data processing window;
wherein the first target time is equal to the end time of the data processing window plus the cancellation delay time.
Optionally, before the determining the data processing window based on the system reference time, the method further includes:
updating the system reference time;
and the value of the updated system reference time is greater than or equal to the value of the system reference time before updating.
Further, the updating mode of the system reference time comprises at least one of the following modes:
updating the system reference time once the system receives a new piece of data;
and updating the system reference time according to a preset time interval.
Optionally, the updating the system reference time includes:
determining the first data time as an updated system reference time; or
Updating the system reference time according to the first data time and the waiting time;
the first data time is the data time corresponding to the data with the maximum data time value in the data processing window received by the system.
Further, the updating the system reference time according to the first data time and the waiting time includes:
according to the formula: b-X, updating the system reference time;
wherein, a is the updated system reference time, b is the first data time, and X is the waiting time.
Optionally, the updating manner includes updating the system reference time according to a preset time interval, and the updating the system reference time includes:
updating the system reference time according to the first data time, the first time increment of the operating system and the waiting time;
the first data time is the data time corresponding to the data with the maximum data time value in the data processing window received by the system;
the first time increase is an operating system time interval between the time the system receives new data and a system reference time update time.
Further, the updating the system reference time according to the first data time, the first time increment of the operating system, and the waiting time includes:
according to the formula: b + A1-X, updating the system reference time;
where a is the updated system reference time, b is the first data time, a1 is the first time increment of the operating system, and X is the latency.
Optionally, the updating method includes updating the system reference time once every time the system receives a new piece of data, and the updating of the system reference time includes:
updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system and the waiting time;
wherein the second time increase is an operating system time interval between the time the system receives the new data and the time the system receives the last piece of data.
Further, the updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system and the waiting time includes:
determining a second target time according to the data time of the new data received by the system and the second time increase of the operating system;
updating the system reference time according to the second target time and the waiting time;
wherein the second target time is the largest of a data time of new data received by the system and a second time increase of the operating system.
Specifically, the updating the system reference time according to the second target time and the waiting time includes:
according to the formula: updating the system reference time when a is B-X;
wherein a is the updated system reference time, B is the second target time, and X is the waiting time.
Specifically, the data time includes at least one of: the time at which the data is generated, the time at which the data enters the system, and the time at which the data is processed by the system.
Optionally, the data real-time processing method further includes:
judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold of the next data processing window;
when the data volume to be processed is larger than the stored data threshold value, splitting the next data processing window into at least two data processing windows;
respectively acquiring calculation results of the data in the at least two split data processing windows;
and acquiring the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
Further, before the determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and acquiring a stored data threshold value of the next data processing window.
Optionally, the obtaining a stored data threshold of the next data processing window includes:
obtaining the memory usage ratio of each piece of data;
and acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
Specifically, the obtaining of the memory usage percentage of each piece of data includes:
acquiring memory usage ratios of the operating systems corresponding to the starting time and the ending time of each data processing window in the previous P data processing windows respectively;
acquiring the data volume of each data processing window in the previous P data processing windows;
determining the memory usage ratio of each piece of data according to the data volume and the memory usage ratio of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
Optionally, before the determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
Specifically, the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
Specifically, the splitting the next data processing window into at least two data processing windows includes:
acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold of the next data processing window;
splitting a next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
The embodiment of the invention also provides a real-time data processing method, which comprises the following steps:
judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold of the next data processing window;
when the data volume to be processed is larger than the stored data threshold value, splitting the next data processing window into at least two data processing windows;
respectively acquiring calculation results of the data in the at least two split data processing windows;
and acquiring the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
Optionally, before the determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and acquiring a stored data threshold value of the next data processing window.
Further, the obtaining the stored data threshold of the next data processing window includes:
obtaining the memory usage ratio of each piece of data;
and acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
Specifically, the obtaining of the memory usage percentage of each piece of data includes:
acquiring memory usage ratios of the operating systems corresponding to the starting time and the ending time of each data processing window in the previous P data processing windows respectively;
acquiring the data volume of each data processing window in the previous P data processing windows;
determining the memory usage ratio of each piece of data according to the data volume and the memory usage ratio of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
Optionally, before the determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
Specifically, the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
Optionally, the splitting the next data processing window into at least two data processing windows includes:
acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold of the next data processing window;
splitting a next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
Specifically, the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the starting time of the data processing window and is less than the ending time of the data processing window.
An embodiment of the present invention further provides a real-time data processing system, including:
the boundary module is used for dividing the data to be processed in the system based on the data processing window;
the first judgment module is used for judging the data processing window based on the system reference time;
and the calculation module is used for calculating and outputting the calculation result of the data in the current data processing window when the system reference time is greater than or equal to the end time of the data processing window.
Specifically, the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the start time of the data processing window and is less than the end time of the data processing window.
Optionally, the data real-time processing system further includes:
and the activation unit is used for activating the data processing window and calculating the data in the data processing window.
Optionally, the data real-time processing system further includes:
and the window processing module is used for destroying the data processing window.
Specifically, the window processing module is configured to:
when the destruction delay time is equal to zero, destroying the data processing window after the data processing window is activated for the first time; or
When the destroying delay time is not equal to zero and the system reference time is greater than or equal to the first target time, destroying the data processing window;
wherein the first target time is equal to the end time of the data processing window plus the cancellation delay time.
Optionally, the data real-time processing system further includes:
the updating module is used for updating the system reference time;
and the value of the updated system reference time is greater than or equal to the value of the system reference time before updating.
Specifically, the updating mode of the system reference time comprises at least one of the following modes:
updating the system reference time once the system receives a new piece of data;
and updating the system reference time according to a preset time interval.
Optionally, the update module is configured to:
determining the first data time as an updated system reference time; or
Updating the system reference time according to the first data time and the waiting time;
the first data time is the data time corresponding to the data with the maximum data time value in the data processing window received by the system.
Specifically, the manner of updating the system reference time according to the first data time and the waiting time is as follows:
according to the formula: b-X, updating the system reference time;
wherein, a is the updated system reference time, b is the first data time, and X is the waiting time.
Optionally, the updating manner includes updating the system reference time according to a preset time interval, and the updating module is configured to:
updating the system reference time according to the first data time, the first time increment of the operating system and the waiting time;
the first data time is the data time corresponding to the data with the maximum data time value in the data processing window received by the system;
the first time increase is an operating system time interval between the time the system receives new data and a system reference time update time.
Specifically, the manner of updating the system reference time according to the first data time, the first time increment of the operating system, and the waiting time is as follows:
according to the formula: b + A1-X, updating the system reference time;
where a is the updated system reference time, b is the first data time, a1 is the first time increment of the operating system, and X is the latency.
Optionally, the updating method includes updating the system reference time once every time the system receives a new piece of data, and the updating module is configured to:
updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system and the waiting time;
wherein the second time increase is an operating system time interval between the time the system receives the new data and the time the system receives the last piece of data.
Further, the update module includes:
the first determining unit is used for determining a second target time according to the data time of the new data received by the system and the increase of the second time of the operating system;
the updating unit is used for updating the system reference time according to the second target time and the waiting time;
wherein the second target time is the largest of a data time of new data received by the system and a second time increase of the operating system.
Specifically, the updating unit is configured to:
according to the formula: updating the system reference time when a is B-X;
wherein a is the updated system reference time, B is the second target time, and X is the waiting time.
Specifically, the data time includes at least one of: the time at which the data is generated, the time at which the data enters the system, and the time at which the data is processed by the system.
Optionally, the data real-time processing system further includes:
the second judgment module is used for judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold of the next data processing window;
the first splitting module is used for splitting the next data processing window into at least two data processing windows when the data volume to be processed is larger than the stored data threshold;
the first acquisition module is used for respectively acquiring calculation results of the data in the at least two split data processing windows;
and the second obtaining module is used for obtaining the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
Optionally, before the second determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the third acquisition module is used for acquiring the stored data threshold of the next data processing window.
Further, the third obtaining module includes:
the first acquisition unit is used for acquiring the memory usage ratio of each piece of data;
and the second determining unit is used for acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
Specifically, the first obtaining unit includes:
the first obtaining subunit is configured to obtain memory usage ratios of the operating system corresponding to the start time and the end time of each of the P previous data processing windows, respectively;
the second acquisition subunit is used for acquiring the data volume of each data processing window in the previous P data processing windows;
the first determining subunit is configured to determine the memory usage proportion of each piece of data according to the data amount and the memory usage proportion of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
Optionally, before the second determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the first prediction module is used for predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
Specifically, the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
Optionally, the first splitting module includes:
the second acquisition unit is used for acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold value of the next data processing window;
the first splitting unit is used for splitting the next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
An embodiment of the present invention further provides a real-time data processing system, including:
the third judging module is used for judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold value of the next data processing window;
the second splitting module is used for splitting the next data processing window into at least two data processing windows when the data volume to be processed is larger than the stored data threshold;
the fourth acquisition module is used for respectively acquiring calculation results of the data in the at least two split data processing windows;
and the fifth obtaining module is used for obtaining the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
Optionally, before the third determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the sixth acquisition module is used for acquiring the stored data threshold of the next data processing window.
Further, the sixth obtaining module includes:
the third acquisition unit is used for acquiring the memory usage ratio of each piece of data;
and the third determining unit is used for acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
Specifically, the third acquiring unit includes:
the third acquiring subunit is configured to acquire the memory usage ratios of the operating systems corresponding to the start time and the end time of each of the P previous data processing windows, respectively;
the fourth acquiring subunit is used for acquiring the data volume of each data processing window in the previous P data processing windows;
the second determining subunit is configured to determine the memory usage proportion of each piece of data according to the data amount and the memory usage proportion of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
Optionally, before the third determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the second prediction module is used for predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
Specifically, the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
Optionally, the second splitting module includes:
a fourth obtaining unit, configured to obtain a maximum duration of the split window according to a to-be-processed data amount and a stored data threshold of a next data processing window;
the second splitting unit is used for splitting the next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
Specifically, the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the starting time of the data processing window and is less than the ending time of the data processing window.
The embodiment of the invention also provides a data real-time processing system, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; the processor implements the steps of the data real-time processing method when executing the computer program.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the data real-time processing method.
The invention has the beneficial effects that:
according to the scheme, the problem of disorder in the data processing window can be solved by judging the data processing window based on the system reference time; the problem of overflow of a storage memory can be avoided by carrying out self-adaptive splitting on the data processing window; according to the scheme, the accuracy of calculation can be guaranteed, and a large amount of data aggregation calculation can be realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a flow chart of a real-time data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of window division;
FIG. 3 is a second flowchart of a real-time data processing method according to an embodiment of the present invention;
FIG. 4 is a block diagram of a real-time data processing system according to an embodiment of the present invention;
fig. 5 is a second block diagram of a real-time data processing system according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a data real-time processing method according to an embodiment of the present invention, the data real-time processing method is applied to a data real-time processing system, and includes:
it should be noted that, for streaming computing, data is infinite and unbounded, and each piece of data is processed in real time, in other words, the system processes each piece of data immediately after receiving it (activates computing logic once). If a batch of data aggregation operation is to be performed, for example, a sum of a batch of data is calculated, first, a starting boundary and a stopping boundary of the batch of data, that is, a first piece of data and a last piece of data, are to be found from infinite and borderless data, in the embodiment of the present invention, a data processing Window (hereinafter referred to as a Window, that is, Window) is used for demarcation; specifically, the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the start time of the data processing window and is less than the end time of the data processing window. It should be noted that the data time includes at least one of the following: the time at which the data is generated, the time at which the data enters the system, and the time at which the data is processed by the system. The data processing window is configured in advance by a system or configured by a user, and the main configuration parameters comprise at least one of the following parameters: the window size is the window size, the time interval for creating the window, and the offset of the first window start time (specifically, it may be an offset from Unix timestamp 0, i.e., an offset from 1/1970: 00: 00), specifically, the window size refers to the duration of the window, which is the end time of the window — the start time of the window.
For example, as shown in FIG. 2, Window changes data into an infinite set of finite bounded datasets in a batch. A Window internally stores a batch of data (i.e., data belonging to the Window, which is stored in the Window each time data for the Window is received). Window is distinguished by time, and if the starting time of Window is 00:01:00 and the ending time is 00:01:05, the data (including the beginning and not the end) of the interval representing the data time belongs to the Window. Grouping by key is the value of group, the value of group + start and stop times of Window defining a Window, see the following example:
grouping the data according to the corresponding group values and time intervals (defining Window), and performing aggregation operation in the grouping; as shown in table 1, the following 10 data are used as an example, and a specific grouping method is described below.
TABLE 1 data sheet
id | group | time | value |
0 | A | 00:05:00 | 1 |
1 | B | 00:05:01 | 1 |
2 | B | 00:05:02 | 1 |
3 | A | 00:05:03 | 1 |
4 | A | 00:05:04 | 1 |
5 | B | 00:05:05 | 1 |
6 | A | 00:05:06 | 1 |
7 | A | 00:05:07 | 1 |
8 | A | 00:05:08 | 1 |
9 | A | 00:05:09 | 1 |
If we start with 00:05:00, with 5 second intervals, and group values as keys, we group the result as (i.e. defined Window):
Group:A,Time:[00:05:00,00:05:05),data:[id:0,id:3,id:4];
Group:B,Time:[00:05:00,00:05:05),data:[id:1,id:2];
Group:A,Time:[00:05:05,00:05:10),data:[id:6,id:7,id:8,id:9];
Group:B,Time:[00:05:05,00:05:10),data:[id:5]。
on the basis of this grouping, performing an aggregation operation, e.g. summing of statistical value field values
Group:A,Time:[00:05:00,00:05:05),data:[id:0,id:3,id:4],Sum(value):3
Group:B,Time:[00:05:00,00:05:05),data:[id:1,id:2],Sum(value):2
Group:A,Time:[00:05:05,00:05:10),data:[id:6,id:7,id:8,id:9],Sum(value):4
Group:B,Time:[00:05:05,00:05:10),data:[id:5],Sum(value):1。
and step 13, when the system reference time is greater than or equal to the end time of the data processing window, calculating and outputting the calculation result of the data in the current data processing window.
It should be noted that the data Time is an attribute included in each piece of data, each piece of data has a timestamp corresponding to the data Time, and an Event Time (Event Time) generated by the data is assigned by a system that generates the data; the Time of data entering the system (Inget Time) may specifically refer to the Time of data entering the data source operator, the Time of data being processed by the system (Processing Time) may specifically refer to the Time of data entering the data Processing operator, and the Time of data entering the system (Inget Time) and the Time of data being processed by the system (Processing Time) are assigned by the RT system of the present invention. It should be noted that the system in the present invention refers to a real-time data processing (RT) system, and specifically refers to a stream data real-time processing system, i.e., a stream computing system, and the operator refers to an operator of the RT system. The Time of data generation (Event Time) represents the Time of data generation, the system generating the data should generate a timestamp corresponding to the data for each piece of data, the Time of data entering the system (input Time) and the Time of data being processed by the system (Processing Time) are the Time of the machine where the system is located, for example, the Time of data entering the system is the current Time (real Time) of the operating system of the machine (the machine running the RT system) when the data enters the data source operator.
It should be noted that, according to the above scheme, order preserving processing of data in the data processing window can be realized, and the problem of inaccurate calculation caused by data disorder is avoided.
Specifically, out-of-order means that the chronological order of data generation does not coincide with the chronological order of data entry into the system and/or the chronological order in which data is processed; because the network transmission time delay is different, disorder may occur in the process of receiving data by the system; due to different transmission speeds among devices of a distributed system, different transmission speeds among internal components of the system, different processing speeds of different computing nodes (generally located in different devices), different processing speeds of different threads of the same computing node, and the like, disorder may occur inside the RT data real-time processing system. For some service scenes, disorder can cause wrong calculation results, the service scenes are based on order preservation, and the results obtained only under the condition of order are correct, for example, a banking service scene, wherein AB is stored twice for 50 ten thousand, C is taken for 100 ten thousand, ABC is legal, and ACB can falsely trigger alarm; for example, in a statistical analysis scenario, the sum of data in a certain time period is calculated, when data with a certain timestamp arrives, aggregation operation is performed to output a calculation result, and if disorder exists, data which is delayed from the arrival of the timestamp cannot be calculated, so that order preservation processing is required to solve the problem of disorder; the current technical difficulty of order-preserving processing in a distributed system in real-time processing of streaming data is to ensure the sequence of event processing when a series of steps and algorithms ensure that some complex event processing is operated in the system.
It should be noted that, after data to be processed is demarcated by using a data processing Window, a calculation result can be obtained in two ways, the first way is to set a temporary (temp) value, the temp value is accumulated to the temp value every time data belonging to the data processing Window is received, the temp value is updated in real time along with the generation of data stream, that is, every time a piece of data is calculated, the temp value is output to a Window boundary, and the temp value is reset; the second is to set a storage region, store in this space every time data belonging to this data processing window is received, and after the last data belonging to this data processing window is received, calculate to directly obtain the final result. It is further explained that the embodiment of the present invention adopts the second mode, that is, the data is first stored and then calculated in batch, and is output every time the Window boundary is reached. When the system reference time (namely the water line time, Watermark) is greater than or equal to the end time of Window, the system judges the boundary of Window, activates the current Window, and performs calculation and output.
It should be noted that, after the data processing window is created and after the operation is completed, in order to timely release the memory of the system, the embodiment of the present invention further includes: and destroying the data processing window.
The method can be realized in at least one of the following ways:
in a first mode, when the destroying delay time is equal to zero, after a data processing window is activated for the first time, the data processing window is destroyed;
in a second mode, when the destroying delay time is not equal to zero and the system reference time is greater than or equal to the first target time, the data processing window is destroyed;
specifically, the first target time is equal to the end time of the data processing window plus the kill latency.
It should be noted that destroying the data processing window includes clearing the data stored in the window. Wherein, the destruction behavior is controlled by using the destruction delay time (AllowedLatensess), and when the AllowedLatensess is 0, the destruction is directly carried out after the window is activated for the first time; when AllowedLateness is not 0, destroying the window when the system reference time is greater than or equal to the end time of the data processing window plus the cancellation delay time; specifically, the unit of AllowedLateness is generally milliseconds, e.g., N milliseconds, where N is an integer greater than 0. After the window is destroyed, all data of the window is destroyed, and any relevant operation can not be carried out any more. Allowedlatency implements the function of waiting for delayed arrival data for a specific time, and can process some out-of-order data, making the calculation more accurate.
It should be further noted that, before destroying a window, the window needs to be created first, where creating a window means that when a first piece of data belonging to a window is received, the window is created; it should be further noted that, only after a window is activated, the window can be destroyed, specifically, the activation of the data processing window, that is, the activation of the calculation of the data processing window, includes: and activating the data processing window and calculating the data in the data processing window.
In order to smoothly calculate data in a window, in the embodiment of the present invention, a water line Time (Watermark) is set as a system reference Time, and the water line Time is a reference Time of a current system (RT system) in a dimension of a data generation Time (Event Time) and is used for the system to determine whether all data belonging to the current window is received. Specifically, in the embodiment of the present invention, the water line Time is set based on one of three times, i.e., a Time when data is generated (Event Time), a Time when data enters the system (insesttime), and a Time when data is processed by the system (Processing Time).
It should be further noted that the system reference time needs to be updated in real time along with the operation of the system, specifically, the value of the updated system reference time is greater than or equal to the value of the system reference time before updating, that is, the system reference time is only increased but not decreased, and if the calculated system reference time is smaller than the current value due to data entering the system, the system reference time is not updated. Since the RT system is a streaming computing system, the system reference time is also streaming forward and no back-off will occur.
Specifically, the updating mode of the system reference time comprises at least one of the following modes:
firstly, updating the system reference time once the system receives a new piece of data;
and secondly, updating the system reference time according to a preset time interval.
The following describes a specific implementation process of updating the system reference time.
First, can be applied to the renewal process of the above-mentioned two kinds of renewal modes
It should be noted that, in this case, the specific implementation manner of updating the system reference time may include one of the following:
determining the first data time as updated system reference time;
it should be noted that the first data time is a data time corresponding to data with a maximum data time value among data in a data processing window that has been received by the system.
According to the second implementation mode, updating the system reference time according to the first data time and the waiting time;
specifically, the implementation means in this way is as follows:
according to the formula: b-X, updating the system reference time;
wherein, a is the updated system reference time, b is the first data time, and X is the waiting time.
It should be noted that, this update process is applicable to the case where data is not sparse, the system reference time is only increased by the data time, and the system reference time is not increased according to the time (real time) increase of the operating system:
when the waiting time (maxoutofforder) is 0, the update of the system reference time does not take the maxoutofforder into account; when maxOutOfOrder is not 0, the updated system reference time needs to be obtained by subtracting maxOutOfOrder from the data time corresponding to the data with the maximum data time value in the received data. The unit of maxOutOfOrder is typically milliseconds, e.g., M milliseconds, with M being an integer greater than 0.
Example 1, maxOutOfOrder is set to 5 seconds, the system reference time interval is updated at 3 seconds, and the update condition of the system reference time is shown in table 2 below.
TABLE 2 System reference time update situation table
It is further noted that maxoutoofforder implements the function of waiting for data arriving with a delay of a specific time, and can process some data out of order, so that the calculation is more accurate.
Secondly, the updating process can only be applied to the updating mode of the system reference time according to the preset time interval
It should be noted that, in this case, the specific implementation manner of updating the system reference time is as follows:
updating the system reference time according to the first data time, the first time increment of the operating system and the waiting time;
the first data time is the data time corresponding to the data with the maximum data time value in the data processing window received by the system;
the first time increase of the operating system is the operating system time interval between the time the RT system receives new data and the system reference time update time.
Further, according to the first data time, the first time increment of the operating system and the waiting time, updating the system reference time is performed, and the implementation mode is as follows:
according to the formula: b + A1-X, updating the system reference time;
where a is the updated system reference time, b is the first data time, a1 is the first time increment of the operating system, and X is the latency.
It should be noted that, this kind of updating process is applicable to the case where data is sparse, the system reference time interval is updated at a fixed time, the system reference time is affected by both the data time and the system time, and increases according to the time increase of the system, and takes the data time as the priority, that is, after receiving the first piece of data after the RT system operates and generating the system reference time according to the data timestamp of the data, when receiving new data in the interval of updating the system reference time, the first data time plus the first time increase of the operating system is taken as the increase of the system reference time, otherwise, the increase of the system time is taken as the increase of the system reference time. In the case of sparse data, data with a certain timestamp may not be generated or arrive in time, and in this way, calculation and output of results can be performed in time.
For the above specific implementation, see the following examples.
The updating process is described by taking maxOutOfOrder as 0 and updating the system reference time every 5 seconds as an example:
if no data is received in the first 5 seconds, the system reference time is 5 (the last generated system reference time is 0);
when data with a timestamp (data time) of 3 seconds is received at the 4 th second in the first 5 seconds, the system reference time is 4;
if data with a timestamp of 3 seconds is received in the 1 st 5 seconds, the system reference time is 7;
if a piece of data with a timestamp of 3 seconds is received at the 0 th of the first 5 seconds, the system reference time is 8.
Example 2, maxOutOfOrder is set to 5 seconds, the system reference time interval is updated at 3 seconds, and the update condition of the system reference time is shown in table 3 below.
TABLE 3 System reference time update situation Table
Third, the updating process can only be applied to the updating mode that the system reference time is updated once when the system receives a new piece of data
It should be noted that, in this case, the specific implementation manner of updating the system reference time is as follows:
updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system and the waiting time;
wherein the second time of the operating system is increased by an operating system time interval between the time the RT system receives the new data and the time the RT system receives the last piece of data.
Further, the specific implementation manner of updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system, and the waiting time is as follows: determining a second target time according to the data time of the new data received by the system and the second time increase of the operating system;
updating the system reference time according to the second target time and the waiting time;
wherein the second target time is the largest of a data time of new data received by the system and a second time increase of the operating system.
It should be noted that, according to the second target time and the waiting time, a specific implementation manner of updating the system reference time is as follows:
according to the formula: updating the system reference time when a is B-X;
wherein a is the updated system reference time, B is the second target time, and X is the waiting time.
It should be noted that, the system reference time is based on one data update (every time the RT system receives one data update), and the system reference time is affected by both the data time and the time of the operating system (increased according to the time increase of the operating system), that is, after the RT system receives the first data after running and generates the system reference time according to the timestamp of the data, the operating system time interval between the data time of the latest received data, the time of the system receiving new data and the time of receiving the last data is selected as the increase of the system reference time.
It should be noted that, in the embodiment of the present invention, an Event Time is used to define a system reference Time, that is, a water level line Time, and the Processing Time and the import Time are copied to the Event Time, so as to simplify the Processing logic. The water line time increment logic is activated at the time the system receives the first piece of data. The user can select one of the three schemes in the initial configuration, with the configuration item in the waterline time operator.
Specifically, in the above description of the embodiments of the present invention, the time when the system receives new data is preferably the time when the RT system starts to process new data. The window is default to be activated when the time of the water level line is greater than or equal to the end time of the window, the window is effective before the window is destroyed, the first activation is that the time of the water level line is greater than or equal to the end time of the window for the first time, and the window is activated (recalculated once, the calculation result is updated and output) once after the first activation and when data belonging to the window are received each time until the window is destroyed.
It should also be noted that streaming computing processes infinite unbounded data, and windows (windows) transform data into a batch of infinite sets of finite bounded data sets.
Because data in windows currently suitable for a real-time computing system is currently stored in a memory to improve computing efficiency, because the memory of a machine is limited, the data which can be cached is limited, and if the data volume which needs to be cached is larger than the memory which can be used by a Window, the memory overflows, and the result cannot be calculated.
The large Window can be split into small windows, the result is calculated by each small Window, Window data are cleared in time, and the results are finally combined.
Specifically, the data real-time processing method according to the embodiment of the present invention further includes:
judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold of the next data processing window;
when the data volume to be processed is larger than the stored data threshold value, splitting the next data processing window into at least two data processing windows;
respectively acquiring calculation results of the data in the at least two split data processing windows;
and acquiring the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
It should be noted that, in this embodiment, the system (RT system) monitors the change of the data and memory usage ratio based on the window start time and the window end time.
Further, before determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data to be stored in the next data processing window, the system needs to first obtain the threshold of data to be stored in the next data processing window, and specifically, the implementation manner is:
obtaining the memory usage ratio of each piece of data; and acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
It should be noted that, the system defaults or the user sets a percentage threshold V (i.e. the above-mentioned preset value), for example, 80%, and when the memory usage ratio of the next data processing window is expected to be greater than the set percentage threshold, the window splitting policy is updated.
Further, the specific implementation manner of the memory usage ratio for acquiring each piece of data is as follows:
acquiring memory usage ratios of the operating systems corresponding to the starting time and the ending time of each data processing window in the previous P data processing windows respectively;
acquiring the data volume of each data processing window in the previous P data processing windows;
determining the memory usage ratio of each piece of data according to the data volume and the memory usage ratio of the previous P data processing windows; wherein P is an integer greater than or equal to 1.
It should be noted that the system defaults or the user sets a value P, for example, P is 5, and the system counts the received data of the previous P windows to deduce the amount of change of the os memory usage corresponding to the next window.
It should be noted that the specific obtaining manner of the memory usage ratio is as follows: respectively obtaining the memory usage ratio corresponding to the ending time of each window in the P windows minus the memory usage ratio corresponding to the starting time, and then obtaining the total memory usage ratio of the data in the P windows; acquiring the total data volume of the P data processing windows according to the data volume of each data processing window in the previous P data processing windows; the average memory usage fraction of each piece of data can be obtained by dividing the total memory usage fraction by the total data amount.
It should be further noted that, the RT system further needs to predict the amount of data to be processed in the next data processing window, specifically, predict the amount of data to be processed in the next data processing window according to the amount of data in each of the P previous data processing windows; in the embodiment of the invention, the prediction mode of the data volume to be processed of the next data processing window is a linear regression mode.
It should be noted that, the splitting the next data processing window into at least two data processing windows includes: acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold of the next data processing window;
splitting a next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
It should be noted that, preferably, the duration of each split window is the same and is the maximum duration of the window, or the duration of the last split window is smaller and the durations of the other windows are the maximum durations of the windows, and the total duration satisfies the requirement of storing the predicted amount of data to be processed of the next window.
It should be noted that, after the window is split, each split small window calculates an aggregation result, and then caches the result in the database, and destroys the small window at the same time, and after all the small windows split from the large window are calculated, takes out the aggregation results of all the small windows, and calculates the aggregation result of the large window.
For example, starting from 00:00:00, the user needs to sum up a certain column in every 10 minutes of data, the percentage threshold V of memory usage is 80%, the prediction needs to be performed according to the previous P ═ 5 windows, and table 4 is a record table of related data of each window.
TABLE 4 correlation data for each Window
1. At 00:50:00, the data volume and the memory usage ratio of each of the previous 5 windows are counted as 1000: 70% -60%, 1200: 72% -60%, 1400: 74% -60%, 1600: 76% -60%, 1800: 78-60 percent. Wherein, before activating the next window, the last window is destroyed, so the most initial state is returned, namely the memory usage percentage is 60%.
2. The result is that the total memory usage ratio of 1000+1200+1400+1600+1800 data is 70% -60% + 72% -60% + 74% -60% + 76% -60% + 78% -60%, and the average memory usage ratio of each data is 0.01%, so that the maximum amount of data stored (i.e. threshold value of stored data) per window is 2000 calculated from 60% memory.
3. Predicting the data total amount of the next window to be 2000 according to the data total amount of the first 5 windows, wherein the data total amount does not exceed the maximum storage data amount of the window, and the table 5 is a record table of the predicted related data of the next window.
TABLE 5 correlation data for the next window
4. Counting the data volume of each window in the previous 5 windows and the memory usage ratio corresponding to the ending time minus the memory usage ratio corresponding to the starting time at 01:00:00, wherein the data volume is 1200: 72% -60%, 1400: 74% -60%, 1600: 76% -60%, 1800: 78% -60%, 2000: 80 to 60 percent.
5. Determining the average memory usage ratio of each piece of data according to the data volume and the memory usage ratio of the previous P data processing windows;
for example, the average usage percentage per memory is 0.01%, so the maximum storage data amount of the window is 2000 pieces when the calculation is started according to 60% of the memory.
6. And when the data volume to be processed of the next window is 2200 predicted according to the total data volume of the previous 5 windows and the maximum data volume stored in the next window is exceeded, calculating the maximum duration time of the split window to be 10min/2200 × 2000-545 s according to the data volume to be processed of the next window and the maximum data volume stored in the next window.
7. And splitting two windows with 545s and 55s as the next window according to the maximum duration of the windows, namely two windows with the window starting time and the window ending time of 01:00:00-01:09:05 and 01:09:05-01:10:00 respectively.
8. The aggregation result of the 01:00:00-01:09:05 window is calculated at 01:09:05 and the window is destroyed, which may be specifically the aggregation result of the 01: 00-01:09:05 window calculated when the system reference time is equal to 01:09: 05.
9. And calculating the aggregation result of the 01:09:05-01:10:00 window at the time of 01:10:00, destroying the window, and calculating and outputting a fusion value of the aggregation values of the two windows of 01:00:00-01:09:05 and 01:09:05-01:10:00, namely a final result.
It should be noted that, in the embodiment of the present invention, the data processing window is determined based on the system reference time, so that the problem of disorder in the data processing window can be solved, and the accuracy of calculation is ensured; the updating process of the system reference time in the embodiment of the invention can simplify the updating of the system reference time and ensure the updating accuracy; the problem of overflow of a storage memory can be avoided by carrying out self-adaptive splitting on the data processing window; according to the scheme, the accuracy of calculation can be guaranteed, and a large amount of data aggregation calculation can be realized.
Referring to fig. 3, fig. 3 is a flowchart of a real-time data processing method according to an embodiment of the present invention, the real-time data processing method is applied to a real-time data processing system, and includes:
and step 34, obtaining the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
It should be further noted that, because data in windows currently applicable to a real-time computing system is currently stored in a memory to improve computing efficiency, there is a possibility of memory overflow, a large Window may be split into small windows, a result is calculated by each small Window, Window data is removed in time, and finally the results are merged.
Further, before determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data to be stored in the next data processing window, the system needs to first obtain the threshold of data to be stored in the next data processing window, and specifically, the implementation manner is:
obtaining the memory usage ratio of each piece of data;
and acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
Further, the obtaining of the memory usage percentage of each piece of data includes:
acquiring memory usage ratios of the operating systems corresponding to the starting time and the ending time of each data processing window in the previous P data processing windows respectively;
acquiring the data volume of each data processing window in the previous P data processing windows;
determining the memory usage ratio of each piece of data according to the data volume and the memory usage ratio of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
Optionally, before the determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
Specifically, the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
It should be noted that the specific obtaining manner of the memory usage ratio is as follows: respectively obtaining the memory usage ratio corresponding to the ending time of each window in the P windows minus the memory usage ratio corresponding to the starting time, and then obtaining the total memory usage ratio of the data in the P windows; acquiring the total data volume of the P data processing windows according to the data volume of each data processing window in the previous P data processing windows; the total memory usage ratio is divided by the total data volume, and the memory usage ratio of each piece of data can be obtained.
It should be noted that, the splitting the next data processing window into at least two data processing windows includes:
acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold of the next data processing window;
splitting a next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
It should be noted that the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the start time of the data processing window and is less than the end time of the data processing window.
It should be further noted that all descriptions regarding the data processing window and the split data processing window in the foregoing embodiments are applicable to this embodiment, and are not described herein again.
It should be noted that, in the embodiment of the present invention, by adaptively splitting the window, overflow of the window memory can be avoided, so that it can be ensured that the system can normally perform data processing, a relatively large amount of data aggregation calculation can be implemented, and accuracy of the calculation result is ensured.
It should be noted that, in the foregoing implementation manner, the embodiment of performing window splitting may be adaptively combined with the embodiment of performing data order preserving processing, that is, the embodiment of performing window splitting may be applied to the embodiment of performing data order preserving processing, and the embodiment of performing data order preserving processing may also be applied to the embodiment of performing window splitting.
Referring to fig. 4, fig. 4 is a block diagram of a data real-time processing system according to an embodiment of the present invention. As shown in fig. 4, the data real-time processing system 40 includes:
the boundary module 41 is configured to boundary data to be processed in the system based on the data processing window;
a first judging module 42, configured to judge the data processing window based on the system reference time;
and the calculating module 43 is configured to perform calculation and output a calculation result of the data in the current data processing window when the system reference time is greater than or equal to the end time of the data processing window.
Specifically, the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the start time of the data processing window and is less than the end time of the data processing window.
Optionally, the data real-time processing system further includes:
and the activation unit is used for activating the data processing window and calculating the data in the data processing window.
Optionally, the data real-time processing system further includes:
and the window processing module is used for destroying the data processing window.
Further, the window processing module is configured to:
when the destruction delay time is equal to zero, destroying the data processing window after the data processing window is activated for the first time; or
When the destroying delay time is not equal to zero and the system reference time is greater than or equal to the first target time, destroying the data processing window;
wherein the first target time is equal to the end time of the data processing window plus the cancellation delay time.
Optionally, the data real-time processing system further includes:
the updating module is used for updating the system reference time;
and the value of the updated system reference time is greater than or equal to the value of the system reference time before updating.
Specifically, the updating mode of the system reference time comprises at least one of the following modes:
updating the system reference time once the system receives a new piece of data;
and updating the system reference time according to a preset time interval.
Optionally, the update module is configured to:
determining the first data time as an updated system reference time; or
Updating the system reference time according to the first data time and the waiting time;
the first data time is the data time corresponding to the data with the maximum data time value in the data processing window received by the system.
Specifically, the manner of updating the system reference time according to the first data time and the waiting time is as follows:
according to the formula: b-X, updating the system reference time;
wherein, a is the updated system reference time, b is the first data time, and X is the waiting time.
Optionally, the updating manner includes updating the system reference time according to a preset time interval, and the updating module is configured to:
updating the system reference time according to the first data time, the first time increment of the operating system and the waiting time;
the first data time is the data time corresponding to the data with the maximum data time value in the data processing window received by the system;
the first time increase is an operating system time interval between the time the system receives new data and a system reference time update time.
Specifically, the manner of updating the system reference time according to the first data time, the first time increment of the operating system, and the waiting time is as follows:
according to the formula: b + A1-X, updating the system reference time;
where a is the updated system reference time, b is the first data time, a1 is the first time increment of the operating system, and X is the latency.
Optionally, the updating method includes updating the system reference time once every time the system receives a new piece of data, and the updating module is configured to:
updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system and the waiting time;
wherein the second time increase is an operating system time interval between the time the system receives the new data and the time the system receives the last piece of data.
Further, the update module includes:
the first determining unit is used for determining a second target time according to the data time of the new data received by the system and the increase of the second time of the operating system;
the updating unit is used for updating the system reference time according to the second target time and the waiting time;
wherein the second target time is the largest of a data time of new data received by the system and a second time increase of the operating system.
Specifically, the updating unit is configured to:
according to the formula: updating the system reference time when a is B-X;
wherein a is the updated system reference time, B is the second target time, and X is the waiting time.
Specifically, the data time includes at least one of: the time at which the data is generated, the time at which the data enters the system, and the time at which the data is processed by the system.
Optionally, the data real-time processing system further includes:
the second judgment module is used for judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold of the next data processing window;
the first splitting module is used for splitting the next data processing window into at least two data processing windows when the data volume to be processed is larger than the stored data threshold;
the first acquisition module is used for respectively acquiring calculation results of the data in the at least two split data processing windows;
and the second obtaining module is used for obtaining the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
Further, before the second determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the third acquisition module is used for acquiring the stored data threshold of the next data processing window.
Specifically, the third obtaining module includes:
the first acquisition unit is used for acquiring the memory usage ratio of each piece of data;
and the second determining unit is used for acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
Further, the first obtaining unit includes:
the first obtaining subunit is configured to obtain memory usage ratios of the operating system corresponding to the start time and the end time of each of the P previous data processing windows, respectively;
the second acquisition subunit is used for acquiring the data volume of each data processing window in the previous P data processing windows;
the first determining subunit is configured to determine the memory usage proportion of each piece of data according to the data amount and the memory usage proportion of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
Further, before the second determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the first prediction module is used for predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
Specifically, the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
Optionally, the first splitting module includes:
the second acquisition unit is used for acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold value of the next data processing window;
the first splitting unit is used for splitting the next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
The invention also provides a real-time data processing system, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; wherein, the processor implements the steps of the data real-time processing method when executing the computer program.
The present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process of the data real-time processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the detailed description is omitted here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
Referring to fig. 5, fig. 5 is a block diagram of a data real-time processing system according to an embodiment of the present invention. As shown in fig. 5, the data real-time processing system 50 includes:
a third judging module 51, configured to judge whether the amount of data to be processed in a next data processing window is greater than a stored data threshold of the next data processing window;
a second splitting module 52, configured to split the next data processing window into at least two data processing windows when the amount of the data to be processed is greater than the stored data threshold;
a fourth obtaining module 53, configured to obtain calculation results of data in the at least two split data processing windows respectively;
a fifth obtaining module 54, configured to obtain, according to the calculation results of the data in the at least two data processing windows, the calculation results of the data in the data processing windows before splitting to which the at least two data processing windows belong.
Optionally, before the third determining module 51 determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the sixth acquisition module is used for acquiring the stored data threshold of the next data processing window.
Specifically, the sixth obtaining module includes:
the third acquisition unit is used for acquiring the memory usage ratio of each piece of data;
and the third determining unit is used for acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
Further, the third obtaining unit includes:
the third acquiring subunit is configured to acquire the memory usage ratios of the operating systems corresponding to the start time and the end time of each of the P previous data processing windows, respectively;
the fourth acquiring subunit is used for acquiring the data volume of each data processing window in the previous P data processing windows;
the second determining subunit is configured to determine the memory usage proportion of each piece of data according to the data amount and the memory usage proportion of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
Optionally, before the third determining module 51 determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, the method further includes:
and the second prediction module is used for predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
Specifically, the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
Optionally, the second splitting module 52 includes:
a fourth obtaining unit, configured to obtain a maximum duration of the split window according to a to-be-processed data amount and a stored data threshold of a next data processing window;
the second splitting unit is used for splitting the next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
Specifically, the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the starting time of the data processing window and is less than the ending time of the data processing window.
The invention also provides a real-time data processing system, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; wherein, the processor implements the steps of the data real-time processing method when executing the computer program.
The present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process of the data real-time processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the detailed description is omitted here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (38)
1. A method for real-time processing of data, comprising:
dividing the data to be processed in the system based on the data processing window;
judging a data processing window based on the system reference time;
when the system reference time is greater than or equal to the end time of the data processing window, calculating and outputting a calculation result of the data in the current data processing window;
before the determining the data processing window based on the system reference time, the method further includes: updating the system reference time; wherein, the value of the updated system reference time is greater than or equal to the value of the system reference time before updating;
the updating of the system reference time includes any one of the following:
updating the system reference time according to a first data time and a waiting time, wherein the first data time is a data time corresponding to data with the maximum data time value in data in a data processing window received by a system;
the updating mode comprises updating the system reference time according to a preset time interval, and updating the system reference time according to a first data time, a first time increment of an operating system and waiting time, wherein the first data time is the data time corresponding to the data with the maximum data time value in the data in a data processing window received by the system, and the first time increment is the operating system time interval between the time when the system receives new data and the updating time of the system reference time;
the updating mode comprises updating the system reference time once when the system receives a new piece of data, and updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system and the waiting time, wherein the second time increment is the operating system time interval between the time when the system receives the new data and the time when the system receives the previous piece of data.
2. The real-time data processing method according to claim 1, wherein the data processing windows are distinguished by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the start time of the data processing window and less than the end time of the data processing window.
3. The real-time data processing method according to claim 1, further comprising:
and activating the data processing window and calculating the data in the data processing window.
4. The real-time data processing method according to claim 1, further comprising:
and destroying the data processing window.
5. The real-time data processing method according to claim 4, wherein the destroying data processing window comprises:
when the destruction delay time is equal to zero, destroying the data processing window after the data processing window is activated for the first time; or
When the destroying delay time is not equal to zero and the system reference time is greater than or equal to the first target time, destroying the data processing window;
wherein the first target time is equal to the end time of the data processing window plus the cancellation delay time.
6. The real-time data processing method according to claim 1, wherein the updating manner of the system reference time comprises at least one of the following manners:
updating the system reference time once the system receives a new piece of data;
and updating the system reference time according to a preset time interval.
7. The real-time data processing method according to claim 1, wherein the updating the system reference time according to the first data time and the waiting time comprises:
according to the formula: b-X, updating the system reference time;
wherein, a is the updated system reference time, b is the first data time, and X is the waiting time.
8. The real-time data processing method according to claim 1, wherein the updating the system reference time according to the first data time, the first time increment of the operating system, and the waiting time comprises:
according to the formula: b + A1-X, updating the system reference time;
where a is the updated system reference time, b is the first data time, a1 is the first time increment of the operating system, and X is the latency.
9. The real-time data processing method according to claim 1, wherein the updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system, and the waiting time comprises:
determining a second target time according to the data time of the new data received by the system and the second time increase of the operating system;
updating the system reference time according to the second target time and the waiting time;
wherein the second target time is the largest of a data time of new data received by the system and a second time increase of the operating system.
10. The real-time data processing method according to claim 9, wherein the updating the system reference time according to the second target time and the waiting time comprises:
according to the formula: updating the system reference time when a is B-X;
wherein a is the updated system reference time, B is the second target time, and X is the waiting time.
11. The real-time data processing method according to claim 1, 2 or 9, wherein the data time includes at least one of: the time at which the data is generated, the time at which the data enters the system, and the time at which the data is processed by the system.
12. The real-time data processing method according to any one of claims 1 to 10, further comprising:
judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold of the next data processing window;
when the data volume to be processed is larger than the stored data threshold value, splitting the next data processing window into at least two data processing windows;
respectively acquiring calculation results of the data in the at least two split data processing windows;
and acquiring the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
13. The real-time data processing method according to claim 12, before the determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, further comprising:
and acquiring a stored data threshold value of the next data processing window.
14. The method of claim 13, wherein the obtaining the stored data threshold of the next data processing window comprises:
obtaining the memory usage ratio of each piece of data;
and acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
15. The real-time data processing method according to claim 14, wherein the obtaining of the memory usage ratio of each piece of data includes:
acquiring memory usage ratios of the operating systems corresponding to the starting time and the ending time of each data processing window in the previous P data processing windows respectively;
acquiring the data volume of each data processing window in the previous P data processing windows;
determining the memory usage ratio of each piece of data according to the data volume and the memory usage ratio of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
16. The real-time data processing method according to claim 12, before the determining whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, further comprising:
and predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
17. The real-time data processing method according to claim 16, wherein the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
18. The real-time data processing method according to claim 12, wherein the splitting the next data processing window into at least two data processing windows comprises:
acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold of the next data processing window;
splitting a next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
19. A system for real-time processing of data, comprising:
the boundary module is used for dividing the data to be processed in the system based on the data processing window;
the first judgment module is used for judging the data processing window based on the system reference time;
the calculation module is used for calculating and outputting a calculation result of data in the current data processing window when the system reference time is greater than or equal to the end time of the data processing window;
wherein, the data real-time processing system further comprises:
the updating module is used for updating the system reference time; wherein, the value of the updated system reference time is greater than or equal to the value of the system reference time before updating;
the update module is configured to implement any one of:
updating the system reference time according to a first data time and a waiting time, wherein the first data time is a data time corresponding to data with the maximum data time value in data in a data processing window received by a system;
the updating mode comprises updating the system reference time according to a preset time interval, and updating the system reference time according to a first data time, a first time increment of an operating system and waiting time, wherein the first data time is a data time corresponding to data with the maximum data time value in data in a data processing window received by a system; the first time increment is an operating system time interval between the time when the system receives new data and the system reference time updating time;
the updating mode comprises updating the system reference time once when the system receives a new piece of data, and updating the system reference time according to the data time of the new data received by the system, the second time increment of the operating system and the waiting time, wherein the second time increment is the operating system time interval between the time when the system receives the new data and the time when the system receives the previous piece of data.
20. The real-time data processing system according to claim 19, wherein the data processing windows are differentiated by time, the data to be processed are grouped, and the data time of the data belonging to one data processing window is greater than or equal to the start time of the data processing window and less than the end time of the data processing window.
21. The real-time data processing system of claim 19, further comprising:
and the activation unit is used for activating the data processing window and calculating the data in the data processing window.
22. The real-time data processing system of claim 19, further comprising:
and the window processing module is used for destroying the data processing window.
23. The real-time data processing system according to claim 22, wherein the window processing module is configured to:
when the destruction delay time is equal to zero, destroying the data processing window after the data processing window is activated for the first time; or
When the destroying delay time is not equal to zero and the system reference time is greater than or equal to the first target time, destroying the data processing window;
wherein the first target time is equal to the end time of the data processing window plus the cancellation delay time.
24. The real-time data processing system according to claim 19, wherein the updating manner of the system reference time comprises at least one of the following manners:
updating the system reference time once the system receives a new piece of data;
and updating the system reference time according to a preset time interval.
25. The real-time data processing system according to claim 19, wherein the updating of the system reference time according to the first data time and the waiting time is performed by:
according to the formula: b-X, updating the system reference time;
wherein, a is the updated system reference time, b is the first data time, and X is the waiting time.
26. The real-time data processing system of claim 19, wherein the updating of the system reference time according to the first data time, the first time increment of the operating system, and the latency is performed by:
according to the formula: b + A1-X, updating the system reference time;
where a is the updated system reference time, b is the first data time, a1 is the first time increment of the operating system, and X is the latency.
27. The real-time data processing system according to claim 19, wherein the update module comprises:
the first determining unit is used for determining a second target time according to the data time of the new data received by the system and the increase of the second time of the operating system;
the updating unit is used for updating the system reference time according to the second target time and the waiting time;
wherein the second target time is the largest of a data time of new data received by the system and a second time increase of the operating system.
28. The real-time data processing system according to claim 27, wherein the updating unit is configured to:
according to the formula: updating the system reference time when a is B-X;
wherein a is the updated system reference time, B is the second target time, and X is the waiting time.
29. The real-time data processing system according to claim 19, 20 or 27, wherein the data time comprises at least one of: the time at which the data is generated, the time at which the data enters the system, and the time at which the data is processed by the system.
30. The real-time data processing system according to any one of claims 19 to 28, further comprising:
the second judgment module is used for judging whether the data volume to be processed of the next data processing window is larger than the stored data threshold of the next data processing window;
the first splitting module is used for splitting the next data processing window into at least two data processing windows when the data volume to be processed is larger than the stored data threshold;
the first acquisition module is used for respectively acquiring calculation results of the data in the at least two split data processing windows;
and the second obtaining module is used for obtaining the calculation result of the data in the data processing window before the splitting to which the at least two data processing windows belong according to the calculation result of the data in the at least two data processing windows.
31. The system of claim 30, before the second determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, further comprising:
and the third acquisition module is used for acquiring the stored data threshold of the next data processing window.
32. The real-time data processing system according to claim 31, wherein the third obtaining module comprises:
the first acquisition unit is used for acquiring the memory usage ratio of each piece of data;
and the second determining unit is used for acquiring the data volume when the memory usage proportion of the operating system is equal to a preset value, and determining the data volume as the stored data threshold of the next data processing window.
33. The real-time data processing system according to claim 32, wherein the first obtaining unit comprises:
the first obtaining subunit is configured to obtain memory usage ratios of the operating system corresponding to the start time and the end time of each of the P previous data processing windows, respectively;
the second acquisition subunit is used for acquiring the data volume of each data processing window in the previous P data processing windows;
the first determining subunit is configured to determine the memory usage proportion of each piece of data according to the data amount and the memory usage proportion of the previous P data processing windows;
wherein P is an integer greater than or equal to 1.
34. The system of claim 30, before the second determining module determines whether the amount of data to be processed in the next data processing window is greater than the threshold of data stored in the next data processing window, further comprising:
and the first prediction module is used for predicting the data volume to be processed of the next data processing window according to the data volume of each data processing window in the previous P data processing windows.
35. The system of claim 34, wherein the prediction mode of the amount of data to be processed in the next data processing window is a linear regression mode.
36. The real-time data processing system according to claim 30, wherein the first splitting module comprises:
the second acquisition unit is used for acquiring the maximum duration time of the split window according to the data volume to be processed and the stored data threshold value of the next data processing window;
the first splitting unit is used for splitting the next data processing window according to the maximum duration;
and the duration of each split data processing window is less than or equal to the maximum duration.
37. A data real-time processing system comprising a memory, a processor and a computer program stored on said memory and executable on said processor; characterized in that the processor implements the steps in the method for real-time processing of data according to any one of claims 1 to 18 when executing the computer program.
38. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for real-time processing of data according to any one of claims 1 to 18.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910507802.4A CN110209685B (en) | 2019-06-12 | 2019-06-12 | Real-time data processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910507802.4A CN110209685B (en) | 2019-06-12 | 2019-06-12 | Real-time data processing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110209685A CN110209685A (en) | 2019-09-06 |
CN110209685B true CN110209685B (en) | 2020-04-21 |
Family
ID=67792196
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910507802.4A Active CN110209685B (en) | 2019-06-12 | 2019-06-12 | Real-time data processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110209685B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111142942B (en) * | 2019-12-26 | 2023-08-04 | 远景智能国际私人投资有限公司 | Window data processing method and device, server and storage medium |
CN112286582B (en) * | 2020-12-31 | 2021-03-16 | 浙江岩华文化科技有限公司 | Multithreading data processing method, device and medium based on streaming computing framework |
CN113760989A (en) * | 2021-02-04 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method, device and equipment for processing unbounded stream data and storage medium |
CN113312434A (en) * | 2021-07-29 | 2021-08-27 | 北京快立方科技有限公司 | Pre-polymerization treatment method for massive structured data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103414545A (en) * | 2013-07-31 | 2013-11-27 | 东软集团股份有限公司 | Time-out judging method and system between heterogeneous systems |
CN106101752A (en) * | 2016-07-08 | 2016-11-09 | 青岛海信宽带多媒体技术有限公司 | A kind of time shift time obtaining method and Set Top Box |
CN107729504A (en) * | 2017-10-23 | 2018-02-23 | 武汉楚鼎信息技术有限公司 | A kind of method and system for handling large data objectses |
CN109614413A (en) * | 2018-12-12 | 2019-04-12 | 上海金融期货信息技术有限公司 | A kind of memory streaming computing plateform system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102902746A (en) * | 2012-09-18 | 2013-01-30 | 杭州勒卡斯广告策划有限公司 | Method, device and system for processing mass data |
CN104156524B (en) * | 2014-08-01 | 2018-03-06 | 河海大学 | The Aggregation Query method and system of transport data stream |
EP3215963A1 (en) * | 2015-08-05 | 2017-09-13 | Google, Inc. | Data flow windowing and triggering |
CN106911589B (en) * | 2015-12-22 | 2020-04-24 | 阿里巴巴集团控股有限公司 | Data processing method and equipment |
-
2019
- 2019-06-12 CN CN201910507802.4A patent/CN110209685B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103414545A (en) * | 2013-07-31 | 2013-11-27 | 东软集团股份有限公司 | Time-out judging method and system between heterogeneous systems |
CN106101752A (en) * | 2016-07-08 | 2016-11-09 | 青岛海信宽带多媒体技术有限公司 | A kind of time shift time obtaining method and Set Top Box |
CN107729504A (en) * | 2017-10-23 | 2018-02-23 | 武汉楚鼎信息技术有限公司 | A kind of method and system for handling large data objectses |
CN109614413A (en) * | 2018-12-12 | 2019-04-12 | 上海金融期货信息技术有限公司 | A kind of memory streaming computing plateform system |
Also Published As
Publication number | Publication date |
---|---|
CN110209685A (en) | 2019-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110209685B (en) | Real-time data processing method and system | |
CN106201829B (en) | Monitor Threshold and device, monitoring alarm method, apparatus and system | |
CN112640380A (en) | Apparatus and method for anomaly detection of an input stream of events | |
CN109597800B (en) | Log distribution method and device | |
CN109450659B (en) | Block delay broadcasting method, equipment and storage medium | |
US10452658B2 (en) | Caching methods and a system for entropy-based cardinality estimation | |
CN108492150B (en) | Method and system for determining entity heat degree | |
CN112631767A (en) | Data processing method, system, device, electronic equipment and readable storage medium | |
CN109756372B (en) | Elastic expansion method and device for telecommunication charging system | |
CN109471989A (en) | A kind of page request processing method and relevant apparatus | |
CN102932264B (en) | Method and device for judging flow overflowing | |
US11855837B2 (en) | Adaptive time window-based log message deduplication | |
CN109522100A (en) | Real-time calculating task method of adjustment and device | |
CN114185885A (en) | Streaming data processing method and system based on column storage database | |
CN112001563B (en) | Method and device for managing ticket quantity, electronic equipment and storage medium | |
Bashyam et al. | Application of perturbation analysis to a class of periodic review (s, S) inventory systems | |
CN114116853B (en) | Data security analysis method and device based on time sequence association analysis | |
CN108509148B (en) | I/O request processing method and device | |
CN108255710B (en) | Script abnormity detection method and terminal thereof | |
WO2023077451A1 (en) | Stream data processing method and system based on column-oriented database | |
Leonardi et al. | Modeling least recently used caches with shot noise request processes | |
JP2002215581A (en) | Method for estimating service level of computer system in operation | |
CN110083441A (en) | A kind of distributed computing system and distributed computing method | |
CN107145495B (en) | Method and device for dynamically adjusting parameter rules | |
CN113938429A (en) | Flow control method, flow control device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |