CN113934759B

CN113934759B - Data caching device and system for fusion calculation in Gaia system

Info

Publication number: CN113934759B
Application number: CN202111201901.3A
Authority: CN
Inventors: 赵恒泰; 赵宇海; 王国仁; 季航旭; 李博扬
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2024-05-17
Anticipated expiration: 2041-10-15
Also published as: CN113934759A

Abstract

The invention discloses a data caching device and a system for fusion computation in Gaia systems, which are realized by modifying a data source operator, a data shuffle virtual operator and a dimension table association computation operator in Gaia systems, and relate to the technical field of distributed big data processing. The method specifically comprises the following steps: the Gaia system comprises a total cache device and an increment cache device which are oriented to fusion calculation, a distributed cache system formed by a plurality of total cache devices, a distributed cache system formed by a plurality of increment cache devices and a distributed cache system formed by mixing a plurality of total cache devices and increment cache devices. The application of each device and system can improve the cache expansion capability of Gaia computing systems, expand the upper limit of the scale of batch data which can be cached by the whole Gaia computing system, further improve the upper limit of the cache data supported by the system, improve the mixed computing efficiency of the Gaia system in unit time, reduce the query delay of the batch data, and be more beneficial to the computation of processing mass data.

Description

Data caching device and system for fusion calculation in Gaia system

Technical Field

The invention relates to the technical field of distributed big data processing, in particular to a data caching device and system for fusion calculation in Gaia systems.

Background

The Gaia system is a high-timeliness and extensible new-generation distributed big data analysis system which is oriented to the mixing coexistence of multiple computing models. The method solves a series of key technical problems at several core layers of a self-adaptive and telescopic big data storage, batch flow fusion big data calculation, high-dimensional large-scale machine learning, high-timeliness big data intelligent interaction guidance and other big data analysis systems, builds an autonomous controllable high-timeliness and telescopic new generation big data analysis system, and grasps the core technology of the international leading big data analysis system.

Gaia the system has a full-cycle multi-scale optimization and unified compute engine for the batch fusion task. The existing big data computing system simulates the behavior of another type of framework by depending on a self computing engine, or defines a set of universal interfaces to shield the difference of the computing engines at the bottom layer, and has weaker support for batch flow fusion. At the same time, it is mostly at a specific time or level of execution and is not optimized for high complexity tasks. The Gaia system was innovatively developed to address the above issues. The technology provides unified expression logic support for batch flow fusion processing, and realizes real fusion of batch and flow processing by integrating a calculation model, a data model, a transformation model and an action model of batch flow processing through unified expression modeling. Aiming at the characteristics of diversity, durability, iteration and the like of the operation, an optimization strategy for multi-operation, multi-task, iterative computation, persistent computation and the like is provided, and the optimization pertinence is stronger. Meanwhile, full-cycle optimization before and during execution is provided and is subdivided into a plurality of scales such as a job level, a task level, a conversion level and the like so as to realize extremely fast response and mass throughput.

In the traditional batch processing calculation, each calculation stage is independent, and independently waits and caches own calculation data. Each stage typically requires only sequential traversal of the cached data during the computation process to complete the computation. And meanwhile, after the calculation is completed and the next stage is carried out, the data of the current stage can be discarded. In the traditional streaming computing, most scenes do not need to buffer data, the computing of the data is real-time, and the data can be discarded after finishing checking points. Only part of the computation involving multiple streams needs to be cached, the caching structure is usually based on basic array types, the query mode is mostly traversal query, and the influence of the query cache on the system is small because the data quantity is small in each computation period. Unlike traditional batch or streaming computing, in some fused computing scenarios, the batch data requires long-term buffering for streaming computing to consume after processing is complete in the fused computing of Gaia systems. This results in the data in the fusion calculation being neither used as complete nor discarded as in batch calculations, which can result in the loss of data required for part of the calculation; nor can it be found using traditional array type storage and traversal as in streaming computing, because the data size of the batch data set in fusion computing is normally very large, and traversing queries during fusion computing can seriously affect streaming computing efficiency. Resulting in data throughput and computational delays affecting Gaia systems.

Disclosure of Invention

Aiming at the defects of the traditional calculation engine, the invention provides a data caching device and a system for fusion calculation in Gaia systems, which comprise an increment caching device, a full-quantity caching device and a data caching system for fusion calculation in Gaia systems formed by a plurality of increment caching devices and/or full-quantity caching devices, wherein the data caching system for fusion calculation in the increment caching device, the full-quantity caching device or the Gaia systems is selected according to different data scales and purposes when in use so as to optimize the batch data reading speed, the caching scale and the query efficiency under the condition of fusion calculation and match high-speed stream data input, thereby improving the data throughput of the Gaia systems and reducing the calculation delay.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

A full-volume caching device for fusion calculation in Gaia systems is used for caching all data from external data sources in the fusion calculation process of batch data and stream data, and is suitable for the situation that the sum of internal memories of all operation nodes of Gaia computing systems is larger than or equal to the size of external data.

Further, according to the full-quantity caching device facing fusion calculation in the Gaia system, the full-quantity caching device comprises a data source module and at least 1 calculation node module;

the data source module further comprises:

The full-external data source connection module is used for reading the data stored in the external storage system in the initialization stage of the calculation process and transmitting the read data to the full-data cleaning module; implementing an external data source connection operator abstract class provided inside a Gaia computing system, providing a data connector for a MySQL data source, reserving a unified interface called by the data source connector, allowing a user to design and implement other data connectors by himself and access the system for connecting external data sources which are not covered by the module according to the data connector for the MySQL data source provided by the system as a template;

the full-data cleaning module is used for carrying out protocol and cleaning treatment on data in different formats obtained from different external data sources, and the treated data is sent to the incremental information analysis module for data analysis; the protocol processing is to uniformly construct all data into a triplet form, and comprises key value information of all data columns of external data, key value information for data connection columns and information of all data columns; the cleaning treatment is to treat the data after the specification, eliminate the repeated data, and combine the data with the same key value;

The incremental information analysis module is used for receiving the data in the form of the triples sent by the full-data cleaning module; and comparing the key value information of all data columns in the triple form data with the key value information acquired from the external data caching module, and judging the processing mode of the incremental data: when the key value information in the triple form data exists in the external data cache module, skipping the data, not sending the triple form data to the data distribution module, and only correspondingly updating the version information of the key value information in the external data cache module, when the key value information in the triple form data does not exist in the external data cache module, sending the key value information to the external data cache module for storage, and sending the triple form data and the data adding operation instruction to the data distribution module; after each data reading period is finished, retrieving version information of key value information in the triple form data stored in the external data cache module, constructing all key value information of a non-latest version into a triple form, and sending the triple form data and a data deleting operation instruction to the data distribution module;

the external data caching module is used for receiving the key value information of the triple form data sent by the incremental information analysis module, and storing the received key value information after adding corresponding version information;

The data distribution module is responsible for maintaining data change and forwarding the data and the instructions received from the incremental information analysis module to the computing node module;

The compute node module further includes:

the data receiving module is used for receiving and analyzing the five-tuple data sent by the data source module and sending the analyzed data to the full-data caching module;

The full-data caching module is used for receiving the instruction from the data receiving module and the data in the form of triples and correspondingly modifying the local data caching according to the instruction type;

the full-quantity-calculation module is responsible for the final calculation service: when stream data reaches the module, constructing corresponding key value information through the stream data so as to construct query information, initiating a data query request to the full-data cache module, and connecting and calculating batch data of the query result received from the full-data cache module with the stream data reaching the module;

the full-data caching module further comprises:

the cache frame module is used for providing a data cache function; exposing a unified data operation interface in the whole data caching module, wherein the unified data operation interface comprises a data query operation, a data deletion operation and a data addition operation; according to the received different instructions, respectively calling a data deleting operation interface or a data adding operation interface to carry out cache modification;

and the data query module is used for receiving the data query message from the full-quantity-calculation module, calling a data query operation interface of the cache frame module to perform data query according to the key value information provided by the data query message, and returning the query result to the full-quantity-calculation module.

Further, according to the full-scale caching apparatus for fusion calculation in the Gaia system, the data distribution module further includes:

The data encapsulation module is used for receiving the data in the form of the triples and the data update operation instructions comprising the data addition operation instructions and the data deletion operation instructions, adding version information of the data update operation instructions and the key value information on the basis of the data in the form of the triples, constructing the data in the form of the quintuples, and delivering the data to the distribution strategy module for data distribution;

And the distribution strategy module is used for providing selection of a data distribution strategy by realizing a data shuffle virtual operator in the Gaia computing system, supporting copying and distributing data in a quintuple form to each computing node module or distributing the data in a specific computing node module according to the data key value information, and further increasing the data cache upper limit of the Gaia computing system.

Further, according to the full-scale caching apparatus for fusion calculation in the Gaia system, the data receiving module further includes:

The data analysis module is used for decomposing the data in the quintuple form, reducing the data in the quintuple form into data in the triplet form, data updating operation instructions and version information of key value information, and sending the analyzed data to the operation analysis module;

The operation analysis module is used for processing the version information of the key value information from the data analysis module and discarding all the key value information lower than the local version information; then analyzing the data updating operation instruction to obtain a data adding instruction or a data deleting instruction; and finally, sending the analyzed instruction and the data in the form of the triples to a full-data caching module.

The utility model provides an increment buffer unit towards fusion calculation in Gaia system, increment buffer unit for control buffer scale in the fusion calculation process of batch data and stream data, the dynamic data who comes from external data source of caching is applicable to Gaia computing system all running node's the circumstances that the memory can't complete storage external data.

Further, according to the above-mentioned Gaia system, the incremental buffering device facing to fusion calculation includes:

The increment-external data source connection module is used for receiving a data query command from the increment-data cache module in the device in the calculation process, and interacting with an external data source to obtain query result data; implementing an external data source connection operator abstract class provided inside a Gaia computing system, providing a data connector for a MySQL data source, reserving a unified interface called by the data source connector, allowing a user to design and implement other data connectors by himself and access the system for connecting external data sources which are not covered by the module according to the data connector for the MySQL data source provided by the system as a template;

the increment-data cleaning module is used for carrying out corresponding data protocol and cleaning treatment on query result data acquired from an external data source, converting the original query result data into data in a triplet form, carrying out de-duplication and then returning the data to the increment-data caching module;

The increment-data caching module is used for caching and managing the data acquired from the increment-data cleaning module, and cleaning the cache according to a preset data size and a cache timeout threshold or a user-defined data size and a cache timeout threshold; meanwhile, the method is responsible for receiving and processing data query from the increment-calculation module, and when corresponding data cannot be queried in the local cache, the data query is initiated to an external data source through the increment-external data source connection module;

The increment-calculation module is responsible for final calculation service: when stream data reaches the module, constructing corresponding key value information through the stream data, further constructing query information, initiating a data query request to the increment-data cache module, and connecting and calculating batch data of the query result received from the data query module with the stream data reaching the module.

Further, according to the above-mentioned Gaia incremental caching device facing fusion calculation, the incremental-data caching module further includes:

the cache frame module is used for providing a data cache function; exposing a unified data operation interface in the increment-data cache module, wherein the unified data operation interface comprises a data query operation, a data deletion operation and a data addition operation; according to the received different instructions, respectively calling a data deleting operation interface or a data adding operation interface to carry out cache modification;

and the data query module is used for receiving the data query message from the increment-calculation module, calling a data query operation interface of the cache frame module to perform data query according to the key value information provided by the data query message, and returning the query result to the increment-calculation module.

A data caching system for fusion computation in Gaia systems, which comprises at least 1 full-scale caching device.

A data caching system for fusion computation in Gaia systems, which comprises at least 1 incremental caching device.

A data caching system for fusion calculation in Gaia system is composed of at least 1 full-quantity caching unit and at least 1 increment caching unit.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the data caching device and the system for fusion calculation in Gaia system are realized by modifying a data source operator, a data shuffle virtual operator and a dimension table association calculation operator in Gaia system, (1) an incremental caching device is introduced into a fusion calculation architecture in Gaia system, so that the system can cache batch data in the calculation process, and the query efficiency of batch data is improved. Furthermore, the mixing calculation efficiency of Gaia systems in unit time is improved, and the batch data query delay is reduced. (2) The full-quantity caching device is introduced into the fusion computing architecture in the Gaia system, and all batch data are cached within the allowable range of the Gaia system memory, so that the batch data query efficiency is greatly improved, the hybrid computing efficiency in unit time of the Gaia system is further improved, and the batch data query delay is reduced. (3) The distributed cache architecture is introduced into the full-volume cache device and the increment cache device, so that the cache expansion capacity of the Gaia computing system is improved, the upper limit of the batch data scale which can be cached by the whole Gaia computing system is expanded, the upper limit of the cache data supported by the system is further improved, and the computing for processing mass data is facilitated. In the incremental cache devices, depending on the stream processing characteristics of Gaia computing systems, the range of stream data received by each incremental cache device is limited, so that the cache scale of each device is indirectly reduced; in the full-quantity caching device, an independent data source module is introduced to read and distribute an external data source, and batch data are split and distributed to all full-quantity-calculation modules. (4) The two devices of the invention can be mixed in Gaia computing systems to adapt to more computing scenes. The increment buffer device is suitable for the condition that the external data source changes fast or the data scale is extremely large, the full-quantity buffer device is suitable for the condition that the external data source changes slowly and the memory of the computing system is larger than or equal to the data scale, and under the mixed condition, the needed buffer device can be dynamically selected according to different characteristics of the external data source in each computing step.

Drawings

Fig. 1 is a schematic architecture diagram of a data source module in a full-scale caching device facing fusion calculation in the system Gaia in this embodiment;

Fig. 2 is a schematic architecture diagram of a full-size-computing node module in a full-size cache device facing fusion computation in the system Gaia of the present embodiment;

fig. 3 is a schematic architecture diagram of an incremental buffering device facing fusion calculation in the system Gaia in this embodiment;

fig. 4 is a schematic architecture diagram of a data cache system facing fusion computation in a Gaia system according to this embodiment;

FIG. 5 is a schematic diagram of a data cache system for fusion computation in Gaia systems according to this embodiment;

fig. 6 is a schematic diagram of a data cache system for fusion computation in a Gaia system according to this embodiment.

Detailed Description

In order that the application may be readily understood, a more complete description of the application will be rendered by reference to the appended drawings. The drawings illustrate preferred embodiments of the application. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

The embodiment firstly provides a full-quantity caching device for fusion calculation in Gaia systems, which is used for caching all data from external data sources in the fusion calculation process and is applicable to the situation that the total memory sum of all operation nodes of Gaia computing systems is larger than or equal to the external data size. In this embodiment, the full-volume caching apparatus further includes a data source module and at least 1 compute node module.

The data source module, as shown in fig. 1, further comprises a full-external data source connection module, a full-data cleaning module, an incremental information analysis module, an external data cache module and a data distribution module.

The full-external data source connection module is used for reading data stored in the external storage system in the initialization stage of the calculation process and transmitting the read data to the full-data cleaning module; the module realizes the external data source connection operator abstract class provided in Gaia computing system, provides a data connector for MySQL data source, reserves a unified interface called by the data source connector, allows a user to design and realize other data connectors by himself according to the data connector for MySQL data source provided by the system as a template, and accesses the system for connecting the external data source which is not covered by the module.

The full-data cleaning module is used for carrying out protocol and cleaning treatment on data in different formats obtained from different external data sources, sending the treated data to the external data source caching module for storage, and sending the processed data to the incremental information analysis module for data analysis; the protocol processing is to uniformly construct all data into a triplet (tuple 3) form, and the triplet comprises key value information of all data columns of external data, key value information for data connection columns and information of all data columns; and the cleaning treatment is to treat the data after the specification, eliminate repeated data and combine the data with the same key value.

The incremental information analysis module is used for receiving the data in the form of the triples sent by the full-data cleaning module; and comparing the key value information of all data columns in the triple form data with the key value information acquired from the external data caching module, and judging the processing mode of the incremental data: when the key value information in the triple form data exists in the external data cache module, skipping the data, not sending the triple form data to the data distribution module, and only correspondingly updating the version information of the key value information in the external data cache module, when the key value information in the triple form data does not exist in the external data cache module, sending the key value information to the external data cache module for storage, and sending the triple form data and the data adding operation instruction to the data distribution module; after each data reading period is finished, the module searches version information of key value information in the triple form data stored in the external data caching module, constructs all key value information of a non-latest version into a triple form, and sends the triple form data and a data deleting operation instruction to the data distributing module.

The external data caching module is used for receiving the key value information of the triplet form data sent by the incremental information analysis module, adding corresponding version information to the received key value information, and then storing the key value information and the version information, wherein the stored key value information and the version information can be used as the basis for judging the data change and the data updating strategy by the incremental information analysis module.

The data distribution module is responsible for maintaining data change and forwarding the data and the instructions received from the incremental information analysis module to the computing node module; the data distribution module in this embodiment further includes a data encapsulation module and a distribution policy module.

The data packaging module is used for acquiring the triple form data and the data updating operation instruction, namely the data adding operation instruction and the data deleting operation instruction, from the incremental information analyzing module, adding version information of the data updating operation instruction and the key value information on the basis of the triple form data, constructing quintuple form data, and delivering the quintuple form data to the distribution strategy module for data distribution.

The distribution policy module is configured to provide a selection of a data distribution policy by implementing a data shuffle virtual operator in the Gaia computing system, and support copying and distributing data in a quintuple form to each computing node module, or distributing the data in a specific computing node module according to data key value information, so as to increase an upper limit of a data cache of the Gaia computing system.

The computing node module, as shown in fig. 2, further includes a data receiving module, a full-data caching module, and a full-computing module.

The data receiving module is used for processing the data in the five-tuple form sent by the data source module and further comprises a data analysis module and an operation analysis module.

The data analysis module is used for decomposing the data in the quintuple form, reducing the data in the quintuple form into the data in the triplet form, updating the version information of the operation instruction and the key value information, and sending the analyzed data to the operation analysis module.

The operation analysis module is used for processing the version information of the key value information from the data analysis module and discarding all key value information lower than the local version information; then analyzing the data updating operation instruction to obtain a data adding instruction or a data deleting instruction; and finally, sending the analyzed instruction and the data in the form of the triples to a full-data caching module.

The full-data caching module is used for receiving the instruction from the data receiving module and the data in the form of triples, and correspondingly modifying the local data caching according to the instruction type, and further comprises a caching framework module and a data query module.

The cache frame module is used for providing a data cache function; exposing a unified data operation interface in the whole data caching module, wherein the unified data operation interface comprises a data query operation, a data deletion operation and a data addition operation; and respectively calling a data deleting operation interface or a data adding operation interface to carry out cache modification according to the received different instructions.

The data query module is used for receiving the data query message from the full-quantity-calculation module, calling the data query operation interface of the cache frame module to perform data query according to the key value information provided by the data query message, and returning the query result to the full-quantity-calculation module.

The full-computation module is responsible for the final computation service: when stream data arrives at the module, corresponding key value information is constructed through the stream data so as to construct query information, a data query request is initiated to the data query module, and batch data of query results received from the data query module and the stream data arriving at the module are connected and calculated, so that connection calculation of the stream data of the Gaia system and external data cached in the cache frame module is realized. The module is based on the dimension table association calculation in the existing Gaia system, and a data query interface for butting the full-data caching module is added, so that the dimension table association calculation supports data caching, the calculation efficiency is improved, and the calculation delay is reduced.

The specific working process of the full-quantity caching device is as follows:

Step 1: acquiring connection information of an external data source;

The external data source is a data source outside the Gaia computing system, and the connection information is related information used for communicating with the external data source, such as a data source address, a data source user name, a data source password, and the like.

Step 2: connecting and reading external data through a data source module according to the connection information of the external data source;

Step 3: the data source module caches the obtained key value information of the data, distributes the data according to a distribution strategy selected by a user, and distributes the data to the downstream full-quantity-computing node module;

step 4: the full-quantity-computing node module receives and caches the data, and the receiving of the streaming data is temporarily stopped in the batch data receiving stage;

Step 5: after the first data distribution is finished, the data source module notifies Gaia the all downstream total-computing node modules in the computing system to start receiving and processing stream data;

Step 6: at the beginning of each data reading period, the data source module re-reads the external data, judges whether the data needs to be distributed according to the local data cache, packages different data operation instructions according to different judging results and distributes the data. During the subsequent cache data update, the cache update of the full-compute node module is asynchronous and the processing of the streaming data will not stop.

Step 7: after the receiving of the stream data is started, the full-quantity-computing node module calculates key value information for caching query from the stream data, and initiates query to the full-quantity-data caching module according to the key value information, and the query result is transmitted to the dimension table association operator together with the stream data for fusion computation.

In this embodiment, the stream data is data in the form of tuples, and the user ID column is the primary key column of the tuple data, which is to be used for performing the cache query.

The second aspect of the present embodiment provides an incremental buffer device for fusion calculation in Gaia systems, where the incremental buffer device is configured to control a buffer size in a fusion calculation process, and dynamically buffer data from an external data source, so that the system can buffer batch data in a calculation process, and is suitable for a case where memories of all operation nodes of Gaia computing systems cannot completely store external data.

In this embodiment, the incremental buffering device, as shown in fig. 3, includes an incremental-external data source connection module, an incremental-data cleansing module, an incremental-data buffering module, and an incremental-calculation module.

The increment-external data source connection module is used for receiving a data query command from the increment-data cache module in the device in the calculation process, and interacting with an external data source to obtain query result data; the module realizes the external data source connection operator abstract class provided in Gaia computing system, provides a data connector for MySQL data source, reserves a unified interface called by the data source connector, allows a user to design and realize other data connectors by himself according to the data connector for MySQL data source provided by the system as a template, and accesses the system for connecting the external data source which is not covered by the module.

The increment-data cleaning module is used for carrying out corresponding data protocol and cleaning treatment on query result data acquired from an external data source, converting the original query result data into data in a triplet form, carrying out de-duplication and then returning the data to the increment-data caching module.

The increment-data caching module is responsible for caching and managing the data acquired from the increment-data cleaning module, and cleaning the cache according to a preset data size and a cache timeout threshold or a user-defined data size and a cache timeout threshold; and meanwhile, the data query from the increment-calculation module is received and processed, and when the corresponding data cannot be queried in the local cache, the query is initiated to an external data source through the increment-external data source connection module. The delta-data caching module further comprises a caching framework module and a data query module.

The cache frame module is used for providing a data cache function; exposing a unified data operation interface in the increment-data cache module, wherein the unified data operation interface comprises a data query operation, a data deletion operation and a data addition operation; and respectively calling a data deleting operation interface or a data adding operation interface to carry out cache modification according to the received different instructions.

The data query module is used for receiving the data query message from the increment-calculation module, calling the data query operation interface of the cache frame module to perform data query according to the key value information provided by the data query message, and returning the query result to the increment-calculation module.

The increment-calculation module is responsible for the final calculation service: when stream data arrives at the module, corresponding key value information is constructed through the stream data so as to construct query information, a data query request is initiated to the data query module, and batch data of query results received from the data query module and the stream data arriving at the module are connected and calculated, so that connection calculation of the stream data of the Gaia system and external data cached in the cache frame module is realized. The module is based on the dimension table association calculation in the existing Gaia system, and a data query interface for interfacing with the incremental data buffer module is added, so that the dimension table association calculation supports data buffer, the calculation efficiency is improved, and the calculation delay is reduced.

The specific workflow of the increment buffer device is as follows:

Step I: acquiring connection information of an external data source;

Step II: the external data source connection module establishes a connection pool corresponding to the external data source according to the connection information of the external data source;

Step III: after the receiving of the stream data is started, the increment-calculation node module calculates key value information for caching query from the stream data, and initiates query to the increment-data caching module according to the key value information, and the query result is transmitted to the dimension table correlation operator together with the stream data for fusion calculation.

Step IV: the increment-calculation module starts to receive stream data, calculates primary key information for caching query from the stream data, initiates a data query request to the increment data caching module according to the primary key information, and the query result is transmitted to the dimension table association operator together with the stream data for fusion calculation. In this embodiment, the stream data is data in the form of tuples, and the user ID column is the primary key column of the tuple data, which is to be used for performing the cache query.

In a third aspect of this embodiment, a data caching system for fusion computation in a Gaia system is provided, where the architecture of the whole system is shown in fig. 4, and the data caching system is a distributed caching architecture formed by a plurality of full-scale caching devices, including at least 1 full-scale caching device, so that the cache expansion capability of the Gaia computing system is improved, the upper limit of the batch data size that can be cached by the whole Gaia computing system is expanded, and further, the upper limit of the cached data supported by the system is provided, which is more beneficial to computation for processing massive data.

In a fourth aspect of this embodiment, a data caching system for fusion computation in Gaia systems is provided, where the architecture of the whole system is shown in fig. 5, and the data caching system is a distributed caching architecture formed by a plurality of incremental caching devices, and includes at least 1 incremental caching device as described above. Depending on the stream processing characteristics of Gaia computing systems, the range of the stream data received by each increment buffer device is limited, so that the buffer size of each increment buffer device is indirectly reduced for the distributed buffer architecture of the system;

in a fifth aspect of this embodiment, a data caching system for fusion computation in Gaia systems is provided, where the architecture of the whole system is shown in fig. 6, and the data caching system is formed by mixing a plurality of full-size caching devices and incremental caching devices, and includes at least 1 full-size caching device and at least 1 incremental caching device. The system adapts to more computing scenarios. The increment buffer device is suitable for the condition that the external data source changes fast or the data scale is extremely large, the full-quantity buffer device is suitable for the condition that the external data source changes slowly and the memory of the computing system is larger than or equal to the data scale, and the two devices are mixed to form the system, so that the needed buffer device can be dynamically selected according to different characteristics of the external data source in each computing step.

Examples

In the embodiment, gaia system processing recommendation flow is used as an actual fusion calculation scene, mySQL is used as a data source, hash partition is used as a partition strategy, and hash Map is used as a storage structure and a query strategy. The commodity recommendation is a very common business scene in the current big data environment, calculates a recommended commodity ID by using the user ID and the user click data and the user information data, expands the data of the recommended commodity ID, and finally returns a recommendation result. In the embodiment, the data caching device and the system are used in the process, and the data expansion is calculated by matching with a dimension table association operator of a Gaia system.

The data used in this embodiment are all obtained by manually constructing and generating the real production scene data format, and three data tables are respectively a user information table, a commodity information table and a user click data table. The record data and data capacity of each table are shown in table 1.

Table 1 database data table size

The whole experimental business logic is as follows:

(1) Streaming user ID to Gaia systems used for testing;

(2) According to the user ID, expanding user information by using a user information table through data connection operation, and expanding a user click commodity ID by using a user click data table;

(3) Adding a random number to each user click commodity ID as a recommended commodity ID;

(4) And acquiring commodity specific information through the recommended commodity ID and returning.

In this embodiment, when the data caching for fusion calculation is implemented by the full-volume caching device, the external data source is MySQL, and the data is obtained from the database through standard SQL query. In this embodiment, after the data is obtained from MySQL, the key value for the connection operation is calculated, and meanwhile, the data is sorted into a standard JSON format string through the data reduction operation in the full-data source cleaning module, and the hash value of the JSON format string is used as the characteristic key value of the data. Eventually, the data characteristic key value, the key value of the connection operation, and the JSON format character string will be constructed as triples. In this embodiment, the external data caching module caches the key value information of the data in the form of the triplet, where the caching type is HashMap, and the external data caching module does not cache the JSON format string in the triplet. In the first round of data distribution in this embodiment, since the content of the external data buffer module is empty, all data will be added with information by the additional data. From the second round of data distribution, all data sent by the full-data source cleaning module is compared with the key value cached in the external data caching module. When a change occurs in the data, a data update operation will be performed. The data updating operation is divided into three cases, wherein the first case is that the main key value of the newly added data is the same, but the non-main key column data is different, the data modifying operation is triggered here, a deleting message and an adding message are issued to the data distributing module, and the data updating is completed through the two messages. The second case is that the newly added data primary key is not cached, and an addition message is sent to the data distribution module. The third case is that the data is deleted, at which point a delete message will be sent to the data distribution module. In this embodiment, the data format sent to the full-size computing module is multiple 5< intelger, string, intelger >, where the first data is a data version, the second data is a data operation information type, the third data is a data primary key, the fourth data is data complete information, the fifth data is a distribution channel, and the distribution policy module fills in the data. In this embodiment, the data distribution channel is determined by calculating the ratio of the hash value of the primary key to the number of total computing nodes in a manner of distributing the data key value information to the specific computing nodes. In this embodiment, after receiving the data of the type 5, the data receiving module parses the data to restore the quintuple into the triplet, the data update operation, and the data version information. And sending the parsed data to an operation parsing module. The data operation type is first read. When the data is added, writing the data into the cache according to the data main key and the data complete information; and when deleting the data, deleting the cached data according to the data main key. The operation analysis module firstly processes the data version information from the data analysis module and discards all data lower than the local data version information. And then, processing the data updating operation, and constructing an execution command for operating the full cache module, wherein the execution command is divided into a data adding command and a data deleting command, and the data adding operation and the data deleting operation are respectively corresponding to the data adding operation and the data deleting operation generated by the incremental information analysis module of the data source module. And finally, sending the data command and the data in the form of triples to a full-data caching module for processing. In this embodiment, the data is cached by means of a hash table, and the query efficiency of the time complexity O (1) is obtained by further sacrificing the memory space. The full-computation module adds a data query interface for interfacing with the full data cache module based on the dimension table association computation in the Gaia system. In this embodiment, a user-defined function interface provided to a user by using a dimension table association operator of the Gaia system implements data expansion, click expansion, recommended commodity ID calculation and commodity information expansion of a user ID, and each step of expansion is an independent dimension table association operation.

In this embodiment, a specific working process of implementing data caching by the full-scale caching device is as follows: firstly, acquiring connection information of an external data source MySQL, wherein the connection information comprises connection users, connection passwords, connection addresses, database names, data table names and data column information of the MySQL; establishing a database connection pool through connection information, constructing inquiry sentences according to the information such as database names, data table names, data columns and the like, and finally obtaining data of an external data source; the data distribution strategy is to partition data according to the hash value of the data main key value, so that the uniformity of the cache data of each full-quantity-computing node module in the distributed environment is ensured; the full-quantity-computing node module receives and caches data, and the receiving of stream data is blocked in the first batch data distribution and receiving stage; after the first data distribution of the data source is finished, all downstream full-quantity-computing node modules are informed to start receiving and processing the streaming data, and the receiving and processing of the streaming data are not blocked any more in the subsequent data distribution and receiving processes. In this embodiment, because of the full cache, it is assumed that data is all acquired, and if the data cannot be queried by the query request sent by the computing task unit, a null value is directly returned; after the data cache refresh period is reached, the data source module will re-read the external data and compare with the old data information already cached locally. When the main key of the read data does not exist in the local cache, performing data adding operation, and when the main key of the read data exists in the local cache but is different from other key values, performing data modifying operation, wherein the adding operation and the modifying operation synchronously write the data into the local new cache and delete the corresponding data in the local old cache; and after the reading is finished, performing data deletion operation on all data of the local cache, and setting the new cache as a new old cache. In the subsequent updating process, the cache updating of the full-data cache module is asynchronous and synchronous with the full-computing module, and specifically comprises the following steps: activating an incremental information analysis module, inquiring the tuple information in the Gaia system corresponding to the data construction and encapsulation to be updated, and sending the tuple information to a downstream computing node module; the data receiving module of the computing node module analyzes the data and dynamically updates the cache of the full-data caching module. The update of the computational cache is an atomic operation and does not affect the query.

In this embodiment, when the data caching for fusion calculation is implemented by the incremental caching device, the external data source is MySQL, and the data is obtained from the database through standard SQL query. The whole query process of requesting data from the database is an independent asynchronous thread, and the query capacity of unit time is increased in an asynchronous calculation mode. In this embodiment, after the data is obtained from MySQL, the key value for the join operation is calculated, and at the same time, the data is sorted into a standard JSON format string by the data reduction operation in the delta-data cleansing module, and the hash value of the JSON format string is used as the characteristic key value of the data. Eventually, the data characteristic key value, the key value of the connection operation, and the JSON format character string will be constructed as triples. In this embodiment, the caching function of the query result is implemented through the open source caching component Guava Cache, and each caching record is updated in a timeout cleaning manner, so that the effectiveness of the cached data is ensured. When stream data reaches the increment-calculation module, corresponding key value information is constructed through the stream data, then query information is constructed, a data query request is initiated to the data query module, batch data of a query result received from the data query module and the stream data reaching the module are connected and calculated, and connection calculation of the stream data of the Gaia system and external data cached in the cache frame module is realized. The specific workflow of the incremental caching apparatus in this embodiment includes: obtaining external data source connection information, including connection users, connection passwords, connection addresses, database names, data table names and data column information of MySQL; the increment-external data source connection module establishes a data source connection pool corresponding to an external data source according to external data source connection information, and in the embodiment, the data source connection pool is directly and respectively constructed in each computing node, and then the data source is accessed in an asynchronous query mode; starting a calculation flow, and after receiving a query request sent by the increment-calculation module, performing query operation by the increment-data buffer module, wherein the method specifically comprises the following steps: inquiring the local cache, directly returning an inquiry result when the corresponding key value data exists in the local cache, giving the inquiry to an increment-external data source connection module for inquiring when the data does not exist, returning the data to an increment-data cleaning module for data protocol and cleaning when the data is inquired, and returning empty information to the increment-data cleaning module otherwise; the increment-data cleaning module performs reduction and cleaning work on the acquired data, and sends the tidied data to the increment-data caching module for processing; after receiving the data returned by the increment-data cleaning module, firstly caching the data, when the acquired information is empty, performing miss caching, reducing the query cost when the data is queried later in the current caching period, and finally returning a query result or empty data according to the data type; after the data of the data source connection unit is acquired, the data is cached, when the acquired information is empty, miss caching is performed, and query cost when the data is queried later in the current caching period is reduced. And finally, returning the query result or the null data according to the data type.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. A fusion computation oriented data caching apparatus in Gaia systems, the apparatus comprising:

The full-quantity caching device is used for caching all data from an external data source in the fusion calculation process of the batch data and the stream data, and is suitable for the situation that the sum of the internal memories of all operation nodes of Gaia computing systems is larger than or equal to the size of the external data;

The incremental caching device is used for controlling the caching scale in the fusion calculation process of the batch data and the stream data, dynamically caching the data from the external data source, and is suitable for the situation that memories of all operation nodes of the Gaia computing system cannot completely store the external data;

The full-quantity caching device comprises a data source module and at least 1 computing node module;

the data source module further comprises:

The compute node module further includes:

the full-data caching module further comprises:

The data query module is used for receiving the data query message from the full-quantity-calculation module, calling a data query operation interface of the cache frame module to perform data query according to key value information provided by the data query message, and returning a query result to the full-quantity-calculation module;

The increment buffer apparatus includes:

2. The fusion-oriented data caching device in a Gaia system according to claim 1, wherein the data distribution module further includes:

3. The fusion-oriented data caching device in a Gaia system according to claim 1, wherein the data receiving module further includes:

4. The fusion-oriented data caching device in the Gaia system of claim 1, wherein the delta-data caching module further comprises:

5. A fusion calculation oriented data caching system in a Gaia system, characterized in that the system comprises at least 1 data caching device according to any one of claims 1-4.