CN114185885A - Streaming data processing method and system based on column storage database - Google Patents

Streaming data processing method and system based on column storage database Download PDF

Info

Publication number
CN114185885A
CN114185885A CN202111307991.4A CN202111307991A CN114185885A CN 114185885 A CN114185885 A CN 114185885A CN 202111307991 A CN202111307991 A CN 202111307991A CN 114185885 A CN114185885 A CN 114185885A
Authority
CN
China
Prior art keywords
window
data
time
processing
streaming
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111307991.4A
Other languages
Chinese (zh)
Inventor
程学旗
郭嘉丰
李冰
邱强
张志斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202111307991.4A priority Critical patent/CN114185885A/en
Publication of CN114185885A publication Critical patent/CN114185885A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24539Query rewriting; Transformation using cached or materialised query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a streaming data processing method and a system based on column memory data, comprising the following steps: acquiring column memory streaming data to be processed and a processing task corresponding to the column memory streaming data, dividing the streaming data into batch data blocks based on a time dimension, and allocating a window sequence number to each piece of data in the batch data blocks according to a preset window mode; the batch data block is segmented into a plurality of intermediate data blocks, each intermediate data block only contains data with the same window sequence number, and the data of each intermediate data block is subjected to pre-polymerization calculation to generate a pre-polymerization intermediate state; and according to a preset streaming data time processing mode, pre-polymerization intermediate states of window sequence numbers corresponding to windows are extracted from internal storage, processing tasks corresponding to the pre-polymerization intermediate states are executed, and task execution results are output and serve as streaming data processing results. According to the invention, the throughput of a data analysis scene is improved by using the column storage and calculation engine and combining a pre-polymerization technology on the premise of keeping lower delay.

Description

Streaming data processing method and system based on column storage database
Technical Field
The invention belongs to the field of distributed computation, is particularly applied to the computation direction of distributed streaming data, and particularly relates to a streaming data processing method and system based on a column storage database.
Background
Streaming data computing engines are emerging and penetrating into various industries. At present, almost all cloud service providers provide streaming data computing engines, and the streaming data computing engines can be used in data aggregation, data association, data monitoring, data analysis and other scenarios. Currently, a mainstream Streaming data calculation engine is represented by systems such as Apache flight, Apache Spark Streaming, Storm and the like, uses a directed acyclic graph to represent user operation, and has a programming model more flexible than MapReduce. The current generation streaming data calculation engine realizes data aggregation of time dimension through a window technology and realizes out-of-order message processing support through event messages.
Technical introduction of window:
the streaming data computation engine performs time dimension aggregation operations on data using windowing techniques, common windows including scrolling and sliding windows. The rolling window is also called a fixed time window, and data is aggregated at fixed time intervals, such as summarizing the data every day; the sliding window, also called the jumping window, defines a window having a fixed size and sliding at fixed time intervals, such as may be used to generate a table of statistics for the last week each day. When the time window size and the sliding interval are equal, the sliding time window degenerates to a rolling time window, and when the sliding interval is smaller than the time window size, the sliding windows overlap, and at this time, one record may belong to a plurality of different windows.
The temporal semantic introduction:
the streaming data calculation engine processes data in a time dimension, and generally supports two types of time semantics of processing time and event time. The processing time is the time when the message enters the computing engine, the data is bound with the increasing timestamps according to the sequence of entering the system, and the window is triggered according to the machine time because the machine physical time is used under the processing time semantic meaning, so the data processing is simpler under the mode. The event time refers to the time when data actually occurs, but the data may arrive at the server out of order due to network instability and the like after being generated, or the data may not arrive at the server due to network failure. Therefore, in the event time processing mode, the system cannot determine whether all the data in the window are aligned using the machine time. One of the mainstream methods at present is to use a water level line mechanism to determine whether the data is aligned, where the water level line is a flag bit estimated by the system using a specific algorithm and is used to mark that all the data in a certain window at that time is aligned, and the water level line is usually obtained by using a heuristic algorithm. But since the data is unknown, the system cannot predict the exact water line location, so there may still be a late arrival of data after the water line, which is processed separately. Since data may be delayed to arrive for hours or even days, and data discarding is unacceptable in some fields such as finance, how to cache large amounts of window data in a data processing scenario is a big challenge of a streaming data computing system.
Storing and calculating a pattern profile:
the storage and calculation modes of the stream type calculation engine are divided into two types, namely a row-based mode and a column-based mode, wherein the row-based mode refers to that the system stores data and performs calculation by using a row of a data table, as shown in fig. 1. The line-based storage mode is a very intuitive storage mode similar to the table storage mode that humans are accustomed to. Its advantages are high efficiency in operating the data attributes of same record and friendly operation. However, since the line memory mode needs to read all the data recorded in each line when reading data, if only a part of attributes in the data record need to be used for one query, the mode will cause irrelevant read-write overhead, which may seriously affect the system performance when the data record attributes are too many. In addition, in a scenario where data aggregation needs to be performed on the entire data set according to a certain attribute, the line memory mode needs to read all data recorded, which is not friendly to the memory, and results in poor performance. Mainstream Streaming data computing engines such as Apache Flink, Apache Spark Streaming, etc. use a line memory mode, which brings low delay to the system under the scenes of data cleaning, filtering, conversion, etc.
The column storage mode is that the system maintains data records according to columns of a data table and performs calculation, each column of the data table represents an attribute of the data record, and all the data records are stored in the memory in an ordered manner according to the attributes, as shown in fig. 2. The storage mode is not intuitive to the line storage mode. The storage mode based on the column is created for improving the performance of the data analysis scene. Because the attribute storage of each data record is discontinuous, the operation speed of a single data record is slower than that of a line memory storage mode, and the transaction operation is not friendly. However, the column storage mode can only retrieve the specified data attribute without reading all data, so that the data reading and writing overhead can be greatly reduced in a scene needing data filtering, and meanwhile, the column storage mode is friendly to the memory in a data aggregation scene, so that the column storage mode has unique advantages in a data analysis scene. The storage mode based on the column memory is widely applied to data analysis engines, such as HBase, ClickHouse and the like.
To sum up, the prior art has the following problems and disadvantages:
(1) the ram pattern data analysis scenario performance is low. The mainstream flow data calculation engine is designed and optimized aiming at log data processing, and a line memory storage and calculation mode is adopted to realize real-time message processing. However, the performance of the line memory computing mode is low in a data analysis scenario, and research shows that the throughput of the mainstream data computing engine may be lower than 500 times or even more than that of the column memory data analyzing engines such as SQL Server and Shark. The array storage engine can more efficiently utilize hardware resources in the scenes of data sequencing or aggregation and the like, and has unique advantages in the large data analysis occasion. However, since mainstream databases and the like lack support of an incremental computation model, streaming data computation cannot be supported.
(2) The use of multiple systems is difficult and the overhead of data copying causes performance loss. Many analytical tasks, such as real-time recommendations, online machine learning, or streaming graph computation processes, have complex computational patterns that often require aggregated computation from multiple different systems, such as aggregating data in streaming data computation engines, databases, and content caching systems. For example, the advertisement analysis system uses advertiser and user data in a relational database and uses this data in a streaming data processing task. Similarly, in online machine learning or graph computation tasks, databases may also be accessed to obtain information such as training data. The use of multiple systems increases the learning cost of the user, and also makes the system logic complex and difficult to maintain, and in addition, data copying, serialization and deserialization overhead is brought because data needs to be transferred among multiple different systems. The mainstream streaming data computing system does not support database storage, so the user service can be completed only by matching with a database system, and meanwhile, a message queue is required to be introduced to realize the communication between the streaming data computing system and the database system.
Disclosure of Invention
The invention aims to improve the calculation efficiency of a streaming data calculation system in a data analysis scene, and provides a streaming data calculation method and a streaming data calculation system using a column storage and calculation engine.
Aiming at the defects of the prior art, the invention provides a streaming data processing method based on column memory data, which comprises the following steps:
step 1, acquiring column-stored streaming data to be processed and a processing task corresponding to the column-stored streaming data, dividing the streaming data into batch data blocks based on a time dimension, and allocating a window sequence number to each piece of data in the batch data blocks according to a preset window mode;
step 2, the batch data block is segmented into a plurality of intermediate data blocks, each intermediate data block only contains data with the same window serial number, and the data of each intermediate data block is subjected to pre-polymerization calculation to generate a pre-polymerization intermediate state;
and 3, according to a preset streaming data time processing mode, pre-polymerization intermediate states of window sequence numbers corresponding to the windows are extracted from the internal storage, corresponding processing tasks are executed, and task execution results are output and serve as streaming data processing results.
The streaming data processing method based on the inventory data, wherein the step 2 comprises the following steps: the pre-polymerization process is executed by directly discarding the window expiration data or discarding the window expiration data after a specified time.
The streaming data processing method based on the column memory data, wherein the streaming data time processing mode in the step 3 is a processing time or event time processing mode;
in the processing time processing mode, a computer machine time setting trigger for executing the processing task is used, so that when the machine time reaches the window end time, a window processing command is called, the pre-polymerization intermediate state of the window corresponding to the window end time is selected, and the processing task corresponding to the pre-polymerization intermediate state is executed;
and under the event time processing mode, setting a trigger by using a water level line mechanism to take the maximum time of all streaming data as a water level line, selecting a pre-polymerization intermediate state of a window corresponding to the window ending time when the water level line meets a trigger condition, and executing a processing task corresponding to the pre-polymerization intermediate state.
The streaming data processing method based on the inventory data, wherein the step 1 comprises the following steps:
when the window mode is a rolling window, taking the sum of the window starting time and the window size of the data in the batch data block as the window ending time, and taking the window ending time as the window serial number;
when the window mode is a sliding window, calculating the starting time of a window where data in the batch data block are located according to a sliding interval, and taking the sum of the starting time and the sliding interval of the window as the window ending time;
and setting a temporary window by taking the maximum common factor of the window size and the window sliding interval as the size of the temporary sub-window and the end time of the window as the start time of the temporary sub-window, sliding the temporary window in the time reduction direction until the window with the minimum sequence number containing the data in the batch data block is found, and taking the end time as the window sequence number.
The streaming data processing method based on the column storage data is characterized in that the streaming data is physiological data, image data or log text data acquired by a sensor in real time; and the processing task corresponding to the streaming data is a database statistical task.
The invention also provides a streaming data processing system based on the column memory data, which comprises:
the system comprises a module 1, a processing module and a processing module, wherein the module 1 is used for acquiring column storage stream data to be processed and a processing task corresponding to the column storage stream data, dividing the stream data into batch data blocks based on a time dimension, and allocating a window sequence number to each piece of data in the batch data blocks according to a preset window mode;
a module 2, configured to segment the batch data block into a plurality of intermediate data blocks, where each intermediate data block only contains data with the same window sequence number, and perform pre-aggregation calculation on the data of each intermediate data block to generate a pre-aggregation intermediate state;
and the module 3 is used for extracting the pre-polymerization intermediate state of the window sequence number corresponding to the window from the internal storage according to a preset streaming data time processing mode, executing the processing task corresponding to the pre-polymerization intermediate state, and outputting a task execution result as a streaming data processing result.
The streaming data processing system based on the queue data, wherein the module 2 is configured to directly discard the window expired data or discard the window expired data after a specified time when performing the pre-aggregation process.
The streaming data processing system based on the inventory data, wherein the streaming data time processing mode in the module 3 is a processing time or event time processing mode;
in the processing time processing mode, a computer machine time setting trigger for executing the processing task is used, so that when the machine time reaches the window end time, a window processing command is called, the pre-polymerization intermediate state of the window corresponding to the window end time is selected, and the processing task corresponding to the pre-polymerization intermediate state is executed;
and under the event time processing mode, setting a trigger by using a water level line mechanism to take the maximum time of all streaming data as a water level line, selecting a pre-polymerization intermediate state of a window corresponding to the window ending time when the water level line meets a trigger condition, and executing a processing task corresponding to the pre-polymerization intermediate state.
The streaming data processing system based on the stored data, wherein the module 1 is used for
When the window mode is a rolling window, taking the sum of the window starting time and the window size of the data in the batch data block as the window ending time, and taking the window ending time as the window serial number;
when the window mode is a sliding window, calculating the starting time of a window where data in the batch data block are located according to a sliding interval, and taking the sum of the starting time and the sliding interval of the window as the window ending time;
and setting a temporary window by taking the maximum common factor of the window size and the window sliding interval as the size of the temporary sub-window and the end time of the window as the start time of the temporary sub-window, sliding the temporary window in the time reduction direction until the window with the minimum sequence number containing the data in the batch data block is found, and taking the end time as the window sequence number.
The streaming data processing system based on the column storage data is characterized in that the streaming data is physiological data, image data or log text data acquired by a sensor in real time; and the processing task corresponding to the streaming data is a database statistical task.
According to the scheme, the invention has the advantages that:
the invention provides a streaming data computing system using a storage engine. Compared with the prior art, the system improves the throughput of a data analysis scene by using the storage and calculation engine and combining the pre-polymerization technology on the premise of keeping lower delay. The system has the throughput of 14.8 times that of an Apache flight system known in the industry in the Yahoo Streaming data calculation benchmark test, and the throughput of the system exceeds the Flink and the Apache Spark Streaming 2700 times in a typical data analysis scene using a New York taxi data set.
Drawings
FIG. 1 is a schematic diagram of a row memory mode;
FIG. 2 is a schematic diagram of a column memory mode;
FIG. 3 is a diagram of a system usage pattern;
FIG. 4 is a flow chart illustrating a process for streaming data;
FIG. 5 is a diagram of a Windows View creation syntax;
FIG. 6 is a diagram illustrating an example of a water line;
FIG. 7 is an exemplary diagram of the use of a late strategy;
FIG. 8 is a diagram of the TUMBLE function definition;
FIG. 9 is a diagram illustrating an exemplary use of the TUBBLE function;
FIG. 10 is a HOP function definition diagram;
fig. 11 is a diagram illustrating an example of how the HOP function is used.
Detailed Description
Many users have the problem that the throughput of the streaming data calculation task in the data analysis scene is obviously lower than that of the traditional database calculation task. When the inventor researches a streaming computing engine, the inventor finds that the defect in the prior art is caused by a line memory storage and processing engine used by the streaming computing engine, the line memory engine performs computation by taking a single data record as a unit, and the association relationship between data is difficult to obtain for accelerating aggregation computation. The reason why the mainstream data calculation engine does not adopt the column memory engine is that the line memory mode processes single data, the processing delay is low, and the column memory mode increases the processing delay. Through research on the prior art, the inventor provides a streaming data computing system based on a column storage engine, reduces the processing delay of the column storage engine through technologies such as window segmentation, window ID compression, window computing state pre-polymerization and the like, and realizes the persistence of an expired window through a storage engine optimization technology so as to support that expired data is never discarded.
In particular, the present application relates to the following key technical points:
key point 1, a streaming data computing system using a storage computing engine; the technical effects are as follows: the system divides the streaming data into batch data blocks in the time dimension, takes the data blocks instead of single data as data calculation units, and fully utilizes the column storage and calculation technology to accelerate the aggregation operation;
key point 2, window prepolymerization technology; the technical effects are as follows: computing tasks are pre-aggregated into a computing intermediate state, so that the computing amount during window triggering is reduced, and the computing delay is reduced;
a key point 3, a sliding window segmentation and calculation state multiplexing technology; the technical effects are as follows: the overlapped sliding windows are segmented into non-overlapped continuous windows, pre-polymerization calculation is carried out on the segmented windows, and the pre-polymerization calculation state is reused when the windows are triggered, so that the repeated calculation cost of the sliding windows is reduced, and the calculation delay is reduced.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
The system realizes the stream data processing under the structured query semantic SQL in a view mode, converts a relational source data table into stream data by defining a Windows View view table, and outputs a processing result to a target table after the stream data is processed in the Windows View, as shown in figure 3. Similar to a traditional database view, the WindowView monitors a source data table, and when data is inserted, the newly inserted data can be automatically read, and the source data table can be any table in the system, such as a common relational data table, and some special tables, such as a distributed table, a Kafka table, a file table, a Null table, and the like. Distributed computation can be achieved through the distributed table, data can be directly inserted into Windows View through the Null table, and non-disk-drop processing of streaming data is achieved. FIG. 4 shows a Windows View streaming data processing flow.
The first process is as follows: the SQL statement is used to create a WindowView table, the syntax for creating WindowView is similar to creating a database view table, as shown in FIG. 5, and the keyword description is in Table 1.
Table 1 WindowView key description:
Figure BDA0003340928800000071
the system supports the following water line mechanism, using an example as shown in FIG. 6:
STRICTLY _ ASCENDING: submitting the water level line according to the maximum time observed by the system, and not delaying if the data time is less than the maximum observation time. Where the maximum time is the "latest time" of all logs observed by the system. If the system observes a log sequence of: 1, 5, 3, 4, the "maximum time" is 5. "maximum time" is used herein, rather than "most recent time", to take into account that time is expressed in the system in the form of a "timestamp", with the larger the number, the more recent the time.
ASCENDING: submitting the waterline according to the maximum time observed by the system minus 1, and not delaying if the data time is not more than the maximum observation time.
Bound: the water line is submitted at a maximum time observed by the system minus a fixed time interval.
The system uses the Window Function to assign a Window number to the data set, the Window number being a unique identifier for identifying the Window, the system supporting the TUMBLE (scrolling) and HOP (sliding) Window functions.
The TUMBLE window function defines a window that scrolls at fixed time intervals in the time dimension, the definition of which is shown in FIG. 8. The parameter time _ attr is a time stamp contained in the data, and the data time can also be specified as the system current time by using a function now (); the interval is used to specify the window size; the parameter timezone is an optional parameter for specifying a time zone different from the system, and defaults to the system time zone. FIG. 9 is an example use of the TUMBLE function, which defines a rolling time window of one day in size.
The HOP window function defines a window of fixed size and sliding in the time dimension, which is defined as shown in fig. 10. The parameter time _ attr is a time stamp contained in the data, and the data time can also be specified as the system current time by using a function now (); the parameter hop _ interval is the window sliding interval; the window _ interval parameter is the window size, when the window size is larger than the sliding interval, the sliding windows overlap, when the window size is equal to the sliding interval, the window degenerates to a rolling window, when the window size is smaller than the sliding interval, the window becomes discontinuous, because the system does not support discontinuous windows, the window size cannot be smaller than the sliding interval; the parameter timezone is an optional parameter for specifying a time zone different from the system, and defaults to the system time zone. Fig. 11 is an example of use of the HOP function, which defines a window size of three days with a sliding interval of one day time window, which can be used to count the data for the last three days per day.
And a second process: newly arrived data may be appended to the system source data table by the user application as the streaming data is processed. The system can also automatically monitor data sources such as kafka, and the new data is automatically inserted into the source data table when arriving.
The third process: and the Windows View automatically monitors the source data table updating, and the newly inserted data is automatically pushed to the Windows View when the source data table is updated.
The process four is as follows: in order to give full play to the advantages of a column storage engine, data are temporarily cached after being inserted into the Windows View, and after a certain amount of data are accumulated, the data accumulated by the Windows View are packaged into data blocks which are processed by taking the data blocks as units. The data block packing policy may be configured to trigger a packing operation based on the number of data entries, the size of the amount of data, and the time interval.
And a fifth process: if the user computing task includes a window aggregation operation, window outdated data in the data block is filtered. The system supports the discarding of expired data directly or after a window expires for a period of time, which may be specified in a windowView creation statement.
The process six: calculating and distributing a window sequence number to each piece of data in the data block, wherein the time stamp is the processing time or the event time of the data record:
and 6.1, if the window is a rolling window, acquiring the starting time of the window. The window start time may be calculated using, for example, the method of table 2 below.
Process 6.2 use the start time + window size obtained in process 6.1 as the window end time.
Procedure 6.3 the window end time obtained in procedure 6.2 is assigned as the window number.
Procedure 6.4 if the window is a sliding window, the window start time can be calculated using the calculation method of table 2 below with the sliding interval as the window size.
Process 6.5, the window start time + sliding interval obtained in the process 6.4 is taken as the window end time
And 6.6, because the sliding windows are overlapped, in order to avoid repeated calculation caused by the overlapped windows, the windows are divided into small continuous non-overlapped windows when the sliding windows are divided.
Process 6.7 maximum common factor of window size and sliding interval is calculated as the non-overlapping small window size described in Process 6.6
And 6.8, setting a temporary window by taking the window ending time obtained in the step 6.5 as the window starting time and the maximum common factor obtained in the step 6.6 as the window size, and sliding the temporary window in the time reduction direction until the first window is found, wherein the window ending time is less than the data time stamp. The purpose of this step is to find the first window containing the target data timestamp, but since the window cannot be directly obtained by numerical calculation, the first window whose end time is less than the target timestamp can only be found by sliding the window, and then the window is slid by one unit in the time increasing direction.
Process 6.9 the window number is given as the end time of the window from process 6.8 + the greatest common factor from process 6.6.
TABLE 2 Window Start time calculation method
Figure BDA0003340928800000091
Figure BDA0003340928800000101
The process is seven: and dividing the data block into a plurality of intermediate data blocks by taking the window sequence number distributed in the sixth process as a unit, wherein each intermediate data block only contains data with the same window sequence number. The data for each intermediate data block is then pre-polymerization calculated, resulting in a pre-polymerization intermediate state.
When the system carries out pre-aggregation on the data blocks, only the column data required by the aggregation operation is read through a column storage technology, and the disk reading time is reduced. For example, the total number of users with the age of more than 30 years in each window needs to be counted, the age column is read first, the users with the age of less than or equal to 30 years are filtered, then the window sequence number column is read, and the aggregation summation is carried out according to the window sequence numbers, so that the information of other columns in the data table does not need to be read in the whole operation, and the disk overhead is reduced. And in addition, the process is more compact in data volume, is more friendly to CPU cache and can accelerate the calculation process.
The pre-polymerization technique can be, for example, a calculation task is digital summation, one data stream reaches 4 numbers, 1, 2, 3 and 4 respectively, the system performs calculation in advance when each number reaches by using the pre-polymerization technique, and the intermediate state of each pre-polymerization is 1 and 1 respectively; 2, 3; 3, 6; 4,10. When the system triggers the final calculation, the latest, i.e. the 4 th pre-polymerization intermediate state is directly read, and 10 is the final calculation result.
And the process eight: the pre-polymerization intermediate state is written to the internal storage engine.
The process is nine: in the streaming data processing, data arrives continuously, so that a background task is required to perform multiple times of merging operation irregularly. The system uses a background task, and when the calculation is idle, pre-polymerization calculation is automatically carried out on the data blocks with the same window sequence number in the storage engine, and a plurality of data blocks are combined into a single data block.
The process ten: in the processing time processing mode, the system uses the computer machine time to set a trigger, when the machine time reaches the end time of the window, a window processing command is called, and the data of the window corresponding to the time is calculated. In the event time processing mode, the system sets a trigger by using a water level line mechanism, takes the maximum time of observing all messages at present as a water level line, and calls a corresponding window processing command when the water level line meets a trigger condition. The window processing command specifically comprises the following steps:
process 10.1: and pre-polymerization intermediate states of window serial numbers corresponding to the windows are extracted from the internal storage, each rolling window corresponds to one window serial number, and the sliding window corresponds to one or more window serial numbers due to window segmentation.
Process 10.2: if the pre-polymerization intermediate state extracted by the process 10.1 is multiple data blocks, a pre-polymerization calculation is performed to combine them into a single data block.
Process 10.3: the pre-polymerization intermediate state of the single data block is calculated as a final calculation result by a final calculation operation.
The process eleven: and if the TO keywords are specified when the Windows View is created, outputting the final calculation result TO the target table.
The process is twelve: and if the client monitors the WindowView by using the WATCH keyword, outputting the final calculation result to the client terminal.
The process thirteen is as follows: the process repeats three through twelve times as new data arrives.
The process is fourteen: the system uses a background task to regularly clear the data of the overdue window according to a late data processing strategy and release the storage space.
In summary, the system divides all processing tasks (computing operations) equally into two steps: calculation to prepolymerized intermediate state and the prepolymerized intermediate states are combined to produce the final calculation. The calculation operation may be a common database operation such as summing, averaging, counting, sorting, etc. Taking the summation operation for 100 pieces of data as an example, assume that the machine has 10 computing threads. The present system assigns 10 pieces of data to each computing thread. The method comprises the following steps: each calculation thread counts 10 distributed data, wherein the summation value of the 10 data is a prepolymerization intermediate state; step two: the 10 summation values generated by the 10 threads are combined to generate a final calculation state, which is the summation value of 100 data.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a streaming data processing system based on the column memory data, which comprises:
the system comprises a module 1, a processing module and a processing module, wherein the module 1 is used for acquiring column storage stream data to be processed and a processing task corresponding to the column storage stream data, dividing the stream data into batch data blocks based on a time dimension, and allocating a window sequence number to each piece of data in the batch data blocks according to a preset window mode;
a module 2, configured to segment the batch data block into a plurality of intermediate data blocks, where each intermediate data block only contains data with the same window sequence number, and perform pre-aggregation calculation on the data of each intermediate data block to generate a pre-aggregation intermediate state;
and the module 3 is used for extracting the pre-polymerization intermediate state of the window sequence number corresponding to the window from the internal storage according to a preset streaming data time processing mode, executing the processing task corresponding to the pre-polymerization intermediate state, and outputting a task execution result as a streaming data processing result.
The streaming data processing system based on the queue data, wherein the module 2 is configured to directly discard the window expired data or discard the window expired data after a specified time when performing the pre-aggregation process.
The streaming data processing system based on the inventory data, wherein the streaming data time processing mode in the module 3 is a processing time or event time processing mode;
in the processing time processing mode, a computer machine time setting trigger for executing the processing task is used, so that when the machine time reaches the window end time, a window processing command is called, the pre-polymerization intermediate state of the window corresponding to the window end time is selected, and the processing task corresponding to the pre-polymerization intermediate state is executed;
and under the event time processing mode, setting a trigger by using a water level line mechanism to take the maximum time of all streaming data as a water level line, selecting a pre-polymerization intermediate state of a window corresponding to the window ending time when the water level line meets a trigger condition, and executing a processing task corresponding to the pre-polymerization intermediate state.
The streaming data processing system based on the stored data, wherein the module 1 is used for
When the window mode is a rolling window, taking the sum of the window starting time and the window size of the data in the batch data block as the window ending time, and taking the window ending time as the window serial number;
when the window mode is a sliding window, calculating the starting time of a window where data in the batch data block are located according to a sliding interval, and taking the sum of the starting time and the sliding interval of the window as the window ending time;
and setting a temporary window by taking the maximum common factor of the window size and the window sliding interval as the size of the temporary sub-window and the end time of the window as the start time of the temporary sub-window, sliding the temporary window in the time reduction direction until the window with the minimum sequence number containing the data in the batch data block is found, and taking the end time as the window sequence number.
The streaming data processing system based on the column storage data is characterized in that the streaming data is physiological data, image data or log text data acquired by a sensor in real time; and the processing task corresponding to the streaming data is a database statistical task.

Claims (10)

1. A streaming data processing method based on columnar data, comprising:
step 1, acquiring column-stored streaming data to be processed and a processing task corresponding to the column-stored streaming data, dividing the streaming data into batch data blocks based on a time dimension, and allocating a window sequence number to each piece of data in the batch data blocks according to a preset window mode;
and 2, segmenting the batch data block into a plurality of intermediate data blocks, wherein each intermediate data block only contains data with the same window serial number, and performing pre-polymerization calculation on the data of each intermediate data block to generate a pre-polymerization intermediate state.
And 3, according to a preset streaming data time processing mode, pre-polymerization intermediate states of window sequence numbers corresponding to the windows are extracted from the internal storage, corresponding processing tasks are executed, and task execution results are output and serve as streaming data processing results.
2. The streaming data processing method based on the inventory data of claim 1, wherein the step 2 comprises: the pre-polymerization process is executed by directly discarding the window expiration data or discarding the window expiration data after a specified time.
3. The streaming data processing method based on inventory data according to claim 1, wherein the streaming data time processing mode in step 3 is a processing time or event time processing mode;
in the processing time processing mode, a computer machine time setting trigger for executing the processing task is used, so that when the machine time reaches the window end time, a window processing command is called, the pre-polymerization intermediate state of the window corresponding to the window end time is selected, and the processing task corresponding to the pre-polymerization intermediate state is executed;
and under the event time processing mode, setting a trigger by using a water level line mechanism to take the maximum time of all streaming data as a water level line, selecting a pre-polymerization intermediate state of a window corresponding to the window ending time when the water level line meets a trigger condition, and executing a processing task corresponding to the pre-polymerization intermediate state.
4. The streaming data processing method based on inventory data according to claim 1, wherein the step 1 comprises:
when the window mode is a rolling window, taking the sum of the window starting time and the window size of the data in the batch data block as the window ending time, and taking the window ending time as the window serial number;
when the window mode is a sliding window, calculating the starting time of a window where data in the batch data block are located according to a sliding interval, and taking the sum of the starting time and the sliding interval of the window as the window ending time;
and setting a temporary window by taking the maximum common factor of the window size and the window sliding interval as the size of the temporary sub-window and the end time of the window as the start time of the temporary sub-window, sliding the temporary window in the time reduction direction until the window with the minimum sequence number containing the data in the batch data block is found, and taking the end time as the window sequence number.
5. The streaming data processing method based on the columnar data as claimed in claim 1, wherein the streaming data is physiological data, image data or log text data acquired by a sensor in real time; and the processing task corresponding to the streaming data is a database statistical task.
6. A streaming data processing system based on columnar data, comprising:
the system comprises a module 1, a processing module and a processing module, wherein the module 1 is used for acquiring column storage stream data to be processed and a processing task corresponding to the column storage stream data, dividing the stream data into batch data blocks based on a time dimension, and allocating a window sequence number to each piece of data in the batch data blocks according to a preset window mode;
and the module 2 is used for segmenting the batch data block into a plurality of intermediate data blocks, each intermediate data block only contains data with the same window sequence number, and performing pre-polymerization calculation on the data of each intermediate data block to generate a pre-polymerization intermediate state.
And the module 3 is used for extracting the pre-polymerization intermediate state of the window sequence number corresponding to the window from the internal storage according to a preset streaming data time processing mode, executing the processing task corresponding to the pre-polymerization intermediate state, and outputting a task execution result as a streaming data processing result.
7. The streaming data processing system according to claim 6, wherein the module 2 is configured to discard the window-expired data directly or discard the window-expired data after a specified time period after performing the pre-aggregation.
8. The streaming data processing system based on inventory data of claim 6, wherein the streaming data time processing mode in module 3 is a processing time or event time processing mode;
in the processing time processing mode, a computer machine time setting trigger for executing the processing task is used, so that when the machine time reaches the window end time, a window processing command is called, the pre-polymerization intermediate state of the window corresponding to the window end time is selected, and the processing task corresponding to the pre-polymerization intermediate state is executed;
and under the event time processing mode, setting a trigger by using a water level line mechanism to take the maximum time of all streaming data as a water level line, selecting a pre-polymerization intermediate state of a window corresponding to the window ending time when the water level line meets a trigger condition, and executing a processing task corresponding to the pre-polymerization intermediate state.
9. Streaming data processing system based on inventory data according to claim 6, characterized in that the module 1 is designed for
When the window mode is a rolling window, taking the sum of the window starting time and the window size of the data in the batch data block as the window ending time, and taking the window ending time as the window serial number;
when the window mode is a sliding window, calculating the starting time of a window where data in the batch data block are located according to a sliding interval, and taking the sum of the starting time and the sliding interval of the window as the window ending time;
and setting a temporary window by taking the maximum common factor of the window size and the window sliding interval as the size of the temporary sub-window and the end time of the window as the start time of the temporary sub-window, sliding the temporary window in the time reduction direction until the window with the minimum sequence number containing the data in the batch data block is found, and taking the end time as the window sequence number.
10. The streaming data processing system based on columnar data according to claim 6, wherein the streaming data is physiological data, image data or log text data acquired by a sensor in real time; and the processing task corresponding to the streaming data is a database statistical task.
CN202111307991.4A 2021-11-05 2021-11-05 Streaming data processing method and system based on column storage database Pending CN114185885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111307991.4A CN114185885A (en) 2021-11-05 2021-11-05 Streaming data processing method and system based on column storage database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111307991.4A CN114185885A (en) 2021-11-05 2021-11-05 Streaming data processing method and system based on column storage database

Publications (1)

Publication Number Publication Date
CN114185885A true CN114185885A (en) 2022-03-15

Family

ID=80540772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111307991.4A Pending CN114185885A (en) 2021-11-05 2021-11-05 Streaming data processing method and system based on column storage database

Country Status (1)

Country Link
CN (1) CN114185885A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722014A (en) * 2022-06-09 2022-07-08 杭银消费金融股份有限公司 Batch data time sequence transmission method and system based on database log file
CN115794900A (en) * 2022-11-10 2023-03-14 南京捷崎信息科技有限公司 Data processing method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722014A (en) * 2022-06-09 2022-07-08 杭银消费金融股份有限公司 Batch data time sequence transmission method and system based on database log file
CN114722014B (en) * 2022-06-09 2022-09-02 杭银消费金融股份有限公司 Batch data time sequence transmission method and system based on database log file
CN115794900A (en) * 2022-11-10 2023-03-14 南京捷崎信息科技有限公司 Data processing method and system

Similar Documents

Publication Publication Date Title
CN106681846B (en) Statistical method, device and system of log data
CN114185885A (en) Streaming data processing method and system based on column storage database
US7376682B2 (en) Time model
US20050055673A1 (en) Automatic database diagnostic monitor architecture
CN107623639B (en) EMD distance-based data flow distributed similarity connection method
US20090248725A1 (en) Compressability estimation of non-unique indexes in a database management system
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
CN114116665B (en) Method for writing transaction log in parallel in database to promote processing efficiency
CN104157065B (en) Internet voting method and device
CN107480072B (en) Transparent computing server cache optimization method and system based on association mode
CN102063449A (en) Method and device for improving reliability of statistic information of data object in database
CN110825598A (en) Log real-time processing method and system
US9094225B1 (en) Discovery of short-term and emerging trends in computer network traffic
CN114185884A (en) Streaming data processing method and system based on column storage data
CN104317820B (en) Statistical method and device for report forms
KR20170130178A (en) In-Memory DB Connection Support Type Scheduling Method and System for Real-Time Big Data Analysis in Distributed Computing Environment
WO2023077451A1 (en) Stream data processing method and system based on column-oriented database
CN113220530B (en) Data quality monitoring method and platform
CN113760950B (en) Index data query method, device, electronic equipment and storage medium
CN111813833B (en) Real-time two-degree communication relation data mining method
CN113590322A (en) Data processing method and device
CN116226296B (en) OpenGauss-based data packet aggregation method
CN115952200B (en) MPP architecture-based multi-source heterogeneous data aggregation query method and device
Yang et al. Fast and accurate stream processing by filtering the cold
CN116483886B (en) Method for inquiring OLAP by combining KV storage engine and time sequence storage engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination