CN111522846B

CN111522846B - Data aggregation method based on time sequence intermediate state data structure

Info

Publication number: CN111522846B
Application number: CN202010273950.7A
Authority: CN
Inventors: 王新根; 王新宇; 鲁萍; 黄滔; 陈伟; 金路
Original assignee: Zhejiang Bangsheng Technology Co ltd
Current assignee: Zhejiang Bangsheng Technology Co ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2023-08-22
Anticipated expiration: 2040-04-09
Also published as: CN111522846A

Abstract

The invention discloses a data aggregation method based on a time sequence intermediate state data structure, wherein the time sequence intermediate state data structure comprises a main key, a characteristic key and data; the main key is used for associating a specific business object in the business system, carrying out load balancing on data storage and calculation, the feature key is used for describing the feature name of the specific business object, and the data are numerical values and calculation methods thereof formed by processing events in the business system through the feature calculation system and are used for calculating intermediate results during time sequence intermediate state data merging operation. The data aggregation method is based on a time sequence intermediate state data structure, events are converted into a plurality of pieces of intermediate state data and stored in corresponding cache queues, the intermediate state data are combined according to a main key, a characteristic key and a time stamp of the intermediate state data, and finally the intermediate state data are stored in a database. The method can reduce the IO load of the system and reduce and reasonably distribute the calculation load of the system.

Description

Data aggregation method based on time sequence intermediate state data structure

Technical Field

The invention relates to the field of data processing, in particular to a data aggregation method based on a time sequence intermediate state data structure.

Background

Time series data, also called time series data, is a data sequence recorded in time series. The time sequence feature calculation is feature data obtained after processing the time sequence by a calculation method such as summation, average, variance, and the like, for example: the transaction running water containing the time stamp is time sequence data, and the total transaction amount of the month and the time sequence characteristic of the last 1 hour transaction number can be obtained through summation and counting. In industries such as mobile internet, internet of things, financial services, etc., time sequence features are widely used in business scenarios such as transaction fraud prevention, personalized recommendation, in-process decision making, etc. Besides serving as a basis for business decision, the time sequence features can also serve as input, and are in butt joint with a rule engine, a machine learning model and the like to process complex decisions.

The real-time timing characteristic computing system (hereinafter referred to as the characteristic computing system) is used for computing the timing characteristic of the event data stream in real time, and has the following three characteristics: 1) Event driven, feature computation is triggered by receiving an event. 2) Stateful calculations, feature calculations rely on past or associated data and cannot be deduced from events currently being processed. For example: the "total amount of last 5 minutes transaction" cannot be calculated from the information of the last transaction. Thus, the feature computing system needs to maintain a series of states. 3) In real-time calculation, the data value is reduced with the lapse of time, and the feature calculation system is used as a decision basis, so that the feature calculation needs to be completed in the shortest time.

The characteristic computing system is mostly realized by adopting a classical stream computing architecture, and a stream big data computing framework of the current main stream comprises a flame, a Spark Streaming, a Storm and the like. However, with the increase of data acquisition means and the increase of business complexity, the amount of data required to be processed by the feature computing system is rapidly expanding. The system faces greater challenges, mainly in two ways: firstly, the number of features is huge, and a multi-dimensional complex feature system is required to be established for business objects such as users, accounts, assets and the like, so that a huge system containing trillion or trillion features is formed; secondly, the event concurrency is huge, the feature computing system needs to bear more than ten millions of events per second, and huge IO pressure is generated for network transmission and bottom storage.

The pressure in the two aspects has a great impact on classical flow computing architecture. As the number of features increases dramatically, the number of states that need to be maintained inside the stream computation framework also increases dramatically. Because the flow computing framework such as the Flink is itself implemented as a computing framework only, and is not managed and optimized for underlying storage, engineers are required to transfer state management from within the flow computing framework to an external distributed memory database. And when each event is received, the feature computing system needs to take out the data related to the current time sequence feature from the memory database and send the data into the stream computing framework for processing. On the other hand, as the event concurrency increases, the frequency and number of state data syncs increase, and network IO loads inside the system are caused. Finally, the real-time requirements of feature computation can be met only by adding a large number of stream computation framework nodes and in-memory database nodes. The cost of ownership and the cost of operation and maintenance of feature computing systems have thus increased dramatically.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data aggregation method based on a time sequence intermediate state data structure, which can reduce the IO load of a system and reduce and reasonably distribute the calculation load of the system.

The aim of the invention is realized by the following technical scheme: a data aggregation method based on time sequence intermediate state data structure, the method converts the data of the business system into intermediate state data through the characteristic computing system, and then aggregates and stores the intermediate state data;

the structure of the intermediate state Data comprises a primary key PKey, a characteristic key Fkey and Data;

the primary key PKey is used for associating a specific business object in the business system and is a globally unique key value; load balancing is performed on the storage and calculation of the data;

the feature key Fkey is used for describing the feature names of specific business objects; the feature names have uniqueness; by a combination of the primary key and the feature key, a particular feature of a certain business object can be uniquely determined.

The Data, the numerical value formed by the events in the service system after being processed by the feature computing system and the computing method thereof are used for computing an intermediate result when the time sequence intermediate state Data is combined. The data comprises a time stamp, an aggregation mode, a result value and auxiliary data; the time stamp is the starting point of the time slice to which the current intermediate state data belongs, and is mapped by the feature computing system according to the event time stamp; the aggregation mode is a method for describing intermediate state data aggregation; the result value is a specific value of currently known intermediate state data; the auxiliary data are required additional auxiliary data related to the aggregation mode when the intermediate state data are aggregated.

Further, the method comprises the steps of:

(1) The service system randomly sends the event to any node in the feature computing system;

(2) The feature computing system node receiving the event calculates the time sequence feature of the event data stream in real time and converts the time sequence feature into intermediate state data; determining a target node corresponding to the intermediate state data according to a primary key PKey of the intermediate state data, and sending the intermediate state data to a cache Queue of the target node;

(3) The buffer Queue takes out n pieces of intermediate state data each time, and the intermediate state data are compared and combined pairwise according to whether the primary key PKey, the characteristic key Fkey and the Timestamp of the intermediate state data are consistent or not;

(4) Merging the intermediate state data merging results in the step (3) with corresponding intermediate state data in the memory database MemDB one by one in the same way as the step (3), and storing the final merged results in the memory database MemDB.

Further, the feature computing system has several nodes, each node comprising three primary structures: the feature processor Feature Processor, the cache Queue and the memory database MemDB;

the feature processor Feature Processor is configured to receive an event from the service system, convert the event into intermediate data, and forward the intermediate data to a corresponding node for subsequent processing according to a PKey corresponding to the intermediate data;

the buffer Queue buffers intermediate state data output by the feature processor Feature Processor, and is used for decoupling between the feature processor and the underlying database;

the memory database MemDB is the bottom layer storage of the whole characteristic computing system and is used for storing all intermediate state data.

Further, the primary key may be a merchant number in the clearing system or a certain sensor ID in the internet of things, and a specific unique object is designed and abstracted according to the business system.

Further, the feature computing system is a system for computing timing features of an event data stream in real time.

Further, the feature computing system converts the primary key into a fixed value through a hash algorithm, and selects nodes in the feature computing system for data processing and storage according to the fixed value.

Further, the aggregation mode comprises the mode of realizing data aggregation according to the maximum/minimum value, the average value, the variance or the standard deviation and the like.

Further, when the time sequence intermediate state data combination updating result value is completed, the auxiliary data needs to be correspondingly updated.

The invention has the beneficial effects that:

(1) The IO load of the system is reduced:

network load is reduced: compared with the transmission of complete detail original data, the transmission by using the intermediate state data record can obviously reduce the load of network transmission IO among all modules of the system.

Reducing storage load: when the persistent storage is carried out, only the intermediate state data record is needed to be stored, and the storage medium is not needed to be read and written frequently, so that the IO load of the storage medium is reduced.

(2) Reducing and rationalizing the computational load of the system:

computational load of the decentralized system: intermediate state data records can be combined, so that computation can occur in a plurality of structures of the system, and computation loads are effectively distributed into the system instead of nodes which are specially used for data computation.

Reducing serialization processing load: because the data are required to be transmitted among all parts of the system through a network, more serialization and deserialization operations exist, and because of the use of intermediate state data, the transmission quantity of the data is greatly reduced, and meanwhile, the serialization and deserialization workload required by all modules of the system is reduced, so that the calculation load of the whole system is reduced.

Drawings

FIG. 1 is an architectural thumbnail of a feature computing system;

FIG. 2 is a diagram of a timing intermediate data record structure;

FIG. 3 is a diagram illustrating an environmental temperature monitoring architecture in accordance with an embodiment of the present invention.

Detailed Description

The invention provides a data aggregation method based on a time sequence intermediate state data structure, which uses intermediate state data as a medium to perform characteristic calculation and data propagation in a system. In the stateful computing field, intermediate state data is a concept corresponding to final state data. For the time sequence feature, when the time window corresponding to the feature changes, the finally calculated value is the final state data. For example: the time window of the timing feature "last 24 hours transaction total" slides once per hour, creating a new final state data. In contrast, intermediate state data is calculated according to events in a certain time slice, and intermediate results instead of detail data are reserved. And when the time window slides, the corresponding intermediate state data are aggregated to obtain final state data. Taking the "last 24 hours transaction total" as an example, since the time window slides once per hour, the intermediate state data is sliced in one hour, and the transaction total per hour is stored. When a new event arrives, only incremental calculation is carried out on the intermediate state data of the corresponding time slice according to the event time stamp, and the numerical value of the intermediate state data is changed. When the window slides, only the intermediate state data of 24 corresponding time slices are needed to be summed, and the recalculation of detail data is not needed.

The method has the advantages that due to the characteristic that the intermediate state data can be aggregated, the total calculation in the feature calculation is changed into incremental calculation, on one hand, the repeated calculation amount generated by time window movement in the feature calculation process is reduced, and the consumption of a system to a CPU is reduced; on the other hand, the IO consumption of the internal network and the internal memory of the system is reduced because the repeated transmission of the total amount of detail data in the system is not needed.

The intermediate Data specific structure is shown in fig. 2, and comprises three parts, namely a primary key (PKey), a feature key (FKey) and Data (Data):

a. a primary key (PKey) for associating a particular business object in a business system is a globally unique key value. The service system is a system which has characteristic calculation requirements and can realize the service requirements by being in butt joint with the characteristic calculation system through an interface/client; the primary key can be a merchant number in the clearing system or a certain sensor ID in the Internet of things, and a specific unique object is designed and abstracted according to the business system. In addition, the primary key is also used for carrying out load balancing on key processes such as data storage and calculation. The feature computing system converts the primary key into a fixed value through a hash algorithm, and selects nodes in the feature computing system for data processing and storage according to the fixed value.

b. Feature keys (FKey) for describing the feature names of particular business objects. The feature names are unique, and there is no case where two or more features have the same name for a particular object. By combining the primary key and the feature name, a particular feature of an object in the system can be uniquely determined. For example: the feature computing system receives temperature sensing data uploaded by the plurality of temperature sensors. If a feature needs to calculate the maximum temperature for the past 24 hours, the globally unique hardware device ID of the sensor can be used as the primary key, such as: t-8IXY5C8S. The storage and computing resources associated therewith are determined by the hash value of the ID. While the "highest temperature over 24 hours" may be used as a feature key. Different temperature sensors may have a characteristic key of "maximum temperature over 24 hours in the past", while "maximum temperature over 24 hours in T-8IXY C8S" uniquely characterizes the temperature sensor.

c. The Data (Data) and the numerical value and the calculation method thereof formed by the event processed by the characteristic calculation system can be used for calculating an intermediate result during the intermediate state Data merging operation. The data contains four parts: the method comprises the steps that firstly, a time stamp, a starting point of a time slice to which current intermediate state data belong, is mapped by a feature computing system according to an event time stamp; secondly, the aggregation mode describes a method for aggregating intermediate state data, for example: maximum/minimum, mean, variance, standard deviation, etc.; thirdly, the result value is a specific value of currently known intermediate state data; and fourthly, auxiliary data, namely, additional auxiliary data required by the polymerization mode when the intermediate state data are polymerized. For example: the intermediate state data requires calculation of an average value, and the calculated number of values is recorded as auxiliary data in addition to the known average value data. When new data arrives, the new average value can be calculated by using the existing average value and the number of the values. When the intermediate state data is combined and the result value is updated, the auxiliary data is required to be updated correspondingly. In fig. 2, the maximum value Max does not need auxiliary data when merging, and this is not shown.

FIG. 1 is an architectural thumbnail of a feature computing system. The service system is a Client of the feature computing system, and sends two types of data to the feature computing system: events and processing rules. The processing rules define the method of retrieving intermediate state data from an event. The feature computing system consists of 1 or more nodes, supporting lateral expansion. Each node comprises 3 main structures, specifically as follows:

1. the feature processor Feature Processor is responsible for receiving events from Client clients, converting the events into intermediate state data, and forwarding the intermediate state data to corresponding nodes for subsequent processing according to PKey corresponding to the intermediate state data;

2. the buffer Queue buffers intermediate state data output by the feature processor and is used for decoupling between the feature processor and the bottom database;

3. the memory database MemDB is the bottom storage of the whole system, and is used for storing all intermediate state data.

Before the feature computing system processes the data, the Client will send the processing rules to any node. Synchronization of processing rules is accomplished between nodes. The method comprises the following specific steps:

(1) The Client randomly sends the event to any node in the feature computing system;

(2) The feature processor Feature Processor converts the event into 1 or more intermediate state data according to a preset processing rule, determines a target node corresponding to the intermediate state data according to a primary key PKey of the intermediate state data, and sends the intermediate state data to a cache Queue of the target node.

(3) And the buffer Queue takes out n pieces of intermediate state data every time, and the intermediate state data are compared and combined pairwise according to whether the primary key PKey, the characteristic key Fkey and the Timestamp of the intermediate state data are consistent or not. The specific algorithm for merging the plurality of intermediate data records is as follows (expressed in Python syntax):

the parameters records of line 01 is a list containing all intermediate state data to be merged. The variable results on line 02 is the final merged result. And (3) matching each intermediate state data record in records from line 03, and searching whether intermediate state data which can be combined with the intermediate state data in the results exists. The is_merge method of line 05 judges whether two intermediate state data can be merged by comparing whether PKey, fkey and Timestamp of result and record are consistent. If the results and the records can be combined, combining the two intermediate state data by a merge_record method of the 06 th line, and changing the corresponding elements in the results. If there is no combinable intermediate state data in the results, the current intermediate state data record is placed at the end of the results list. And finally, returning the merged intermediate state data record list results.

(4) After the merging of the cached intermediate data is completed, merging the merging results with the corresponding intermediate state data in the memory database MemDB one by one in the same way as the step (3).

Assuming that steps (3) and (4) are performed using conventional methods, one data at a time is fetched from the Queue, and the serialization and deserialization operations are performed first to transfer the data to the MemDB. Then, a query is made in MemDB to obtain the current value of the feature, and the feature is stored in MemDB after calculation and combination are performed. Thus, n pieces of data require n times of serialization and deserialization transfer operations, and n times of MemDB read and write operations.

In comparison, the method disclosed by the invention reduces the merging of n pieces of intermediate state data into m pieces, and reduces the consumption of subsequent transmission and MemDB storage. Because the feature system generally divides the underlying storage according to different features, and most of the event data in the system accords with normal distribution, most of the data can be concentrated in a small part of the features, and therefore, most of the intermediate state data associated with the same feature exists in the same Queue. In the extreme case, if n pieces of intermediate state data belong to the same feature, the transmission operation of serialization and deserialization is only needed to be performed for 1 time, and the reading and writing of MemDB for 1 time can be completed, so that the calculation and IO consumption of the system are greatly reduced.

One embodiment of the invention is as follows:

in the field of internet of things, monitoring of ambient temperature is a common requirement. Assuming the next scenario, all temperature sensors need to be monitored for the last 24 hours of maximum temperature. The time window slides once per hour, then. As shown in fig. 3, the metrics computing system receives 6 Event messages, from event_01 to event_06. From the content of 6 events, two temperature sensors were included, with IDs T-8IXY C8S and T-28WMYPDB, respectively. T-8IXY5C8S acquires temperature once every 1 minute, two temperature measurement events are issued at 12:02:00, 12:03:02, 12:04:01, 12:05:00, respectively. T-28WMYPDB collects temperatures every 3 minutes, two temperature measurement events are sent out at 12:03:01 and 12:06:00. The feature computing system processes these data in three steps:

(1) The Client sends the 6 events to a feature computing system;

(2) The feature processor Feature Processor converts the event into 6 intermediate state data from msrecord_01 to msrecord_06 by a preset processing rule. The sensor ID maps to the primary key PKey, and the "highest 24 hours past" maps to Ikey. The Data portion includes:

a. timestamp, 1585713600000, because all data belongs to the time slice 2020/04/01/12:00:00, the intermediate state data timestamp of the 6 event transitions is 1585713600000;

b. the calculation method, max, is the statistical maximum;

c. the value of the intermediate state data is the temperature value of each of the 6 events, because the system does not perform aggregation calculation at this step.

Then, the feature processor Feature Processor transmits the 6 intermediate data to the cache Queue of the corresponding node according to the hash value of the primary key PKey. In this example, assume that T-8IXY5C8S and T-28WMYPDB would be routed to the same node.

(3) After the buffer Queue finishes the processing of the previous batch of intermediate state data, 6 newly received intermediate state data are taken out, and the 6 newly received intermediate state data are combined according to the primary key PKey, the characteristic key Fkey and the Timestamp of the intermediate state data. The MSRecord_01\03\04\05 are consistent with 4 intermediate state data three elements, and are combined into MSRecord_AGG_01, and the value of the MSRecord_AGG_01 takes the maximum value of 17.3 in the 4 intermediate state data. MSRecord_02 and MSRecord_06 are combined into MSRecord_AGG_02, which takes the maximum value of 35.53 of the 2 intermediate data.

(4) After the merging of the cached intermediate data is completed, the system merges the merging result with the corresponding intermediate state data in the memory database MemDB one by one. MSRecord_AGG_01 is combined with MSRecord_X to generate MSRecord_X ', and MSRecord_AGG_02 is combined with MSRecord_Y to generate MSRecord_Y'. The merging rule is consistent with the step (3).

Because the intermediate state data is used and the data are combined in the cache structure, the reading and storing of the memory database by the system are reduced from 12 times (6 times Select and 6 times Update) to 4 times (2 times Select and 2 times Update) which is one third of the original data.

The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made thereto are within the spirit of the invention and the scope of the appended claims.

Claims

1. A data aggregation method based on a time sequence intermediate state data structure is characterized in that the method converts data of a service system into intermediate state data through a feature computing system, and then aggregates and stores the intermediate state data; the method comprises the following steps:

the feature key Fkey is used for describing the feature names of specific business objects; the feature names have uniqueness; the specific characteristics of a certain business object can be uniquely determined through the combination of the main key and the characteristic key;

the Data, the numerical value formed by the events in the service system after being processed by the feature computing system and the computing method thereof are used for computing an intermediate result when the time sequence intermediate state Data is combined; the data comprises a time stamp, an aggregation mode, a result value and auxiliary data; the time stamp is the starting point of the time slice to which the current intermediate state data belongs, and is mapped by the feature computing system according to the event time stamp; the aggregation mode is a method for describing intermediate state data aggregation; the result value is a specific value of currently known intermediate state data; the auxiliary data are required additional auxiliary data related to the aggregation mode when the intermediate state data are aggregated;

2. The method of data aggregation based on time-series intermediate data structures according to claim 1, wherein the feature computing system has a plurality of nodes, each node comprising three structures: the feature processor Feature Processor, the cache Queue and the memory database MemDB;

3. The data aggregation method based on the time sequence intermediate state data structure according to claim 1, wherein the primary key is a merchant number in a clearing system or a certain sensor ID in the internet of things, and the specific unique object is designed and abstracted according to a business system.

4. A method of data aggregation based on a time-series intermediate data structure according to claim 1, wherein the feature computing system is a system for computing time-series features of an event data stream in real time.

5. The method of claim 4, wherein the feature computing system converts the primary key to a fixed value by a hash algorithm, and selects nodes in the feature computing system for data processing and storage based on the fixed value.

6. The method of claim 1, wherein the aggregation comprises aggregating data according to a maximum/minimum value, average value, variance, or standard deviation.

7. The method of claim 1, wherein the auxiliary data is updated in response to completion of the merging of the intermediate data.