CN113568938B

CN113568938B - Data stream processing method and device, electronic equipment and storage medium

Info

Publication number: CN113568938B
Application number: CN202110892026.1A
Authority: CN
Inventors: 巴铁凯; 封磊; 池阳
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2023-11-14
Anticipated expiration: 2041-08-04
Also published as: CN113568938A

Abstract

The disclosure provides a data stream processing method, a data stream processing device, electronic equipment and a storage medium, and relates to the technical field of big data processing. The method comprises the following steps: acquiring a data stream to be processed, wherein the data stream to be processed comprises stream processing data and batch processing data; processing the data stream to be processed according to a stream batch integrated processing mode to obtain processed data; and storing the processed data into a data lake. According to the technical scheme, the stream processing data and the batch processing data to be processed are processed according to the stream batch integrated processing mode, so that timeliness of data processing can be improved, the processed data are stored in the data lake, and low delay requirements of users on data query can be met.

Description

Data stream processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to the field of big data processing.

Background

With the rapid growth of internet service data, the timeliness requirements of product, operation and related industry managers and the like on the acquisition of data results are higher and higher. Today, where information is quickly acquired and exchanged, it can be said who can quickly acquire the value results of the data and who can make faster decisions and actions.

With the continuous expansion of data scale in the big data age, the magnitude, the generation speed, the complexity and the like of the data are also higher and higher. Due to the huge complexity of analysis data, report and index results in many scenes are processed in batch mode, and delay exists in the processing results, so that the data results are not obtained faster, and further decisions are not made in time. How to process data and obtain processing results rapidly by technical means under the large environment of continuous mass data production is a problem which needs to be solved currently, and is particularly important in the aspects of product decision making or commercial popularization and the like.

Disclosure of Invention

The disclosure provides a data stream processing method, a data stream processing device, electronic equipment and a storage medium, so as to solve at least one technical problem.

According to an aspect of the present disclosure, there is provided a data stream processing method, including:

acquiring a data stream to be processed, wherein the data stream to be processed comprises stream processing data and batch processing data;

processing the data stream to be processed according to a stream batch integrated processing mode to obtain processed data;

and storing the processed data into a data lake.

According to another aspect of the present disclosure, there is provided a data stream processing apparatus including:

The acquisition module is used for acquiring a data stream to be processed, wherein the data stream to be processed comprises stream processing data and batch processing data;

the processing module is used for processing the data stream to be processed according to a stream batch integrated processing mode to obtain processed data;

and the storage module is used for storing the processed data into the data lake.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method in any of the embodiments of the present disclosure.

According to the data stream processing method, the device, the electronic equipment and the storage medium, which are provided by the technical scheme of the disclosure, for stream processing data and batch processing data to be processed, the timeliness of data processing can be improved, the processed data is stored in a data lake, and the low delay requirement of a user on data query can be met.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a system architecture of a data stream processing system in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a data stream processing method according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of a data stream processing method according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of data warehouse and data lake compatible storage in accordance with an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a data stream processing method according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram of a data interrogation system in an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating comparison of query effects according to an embodiment of the disclosure;

FIG. 8 is a schematic diagram of a data stream processing apparatus according to an embodiment of the disclosure;

FIG. 9 is a schematic diagram of a query module according to an embodiment of the disclosure;

fig. 10 is a block diagram of an electronic device for implementing a data stream processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The technical scheme of the method and the device can be applied to a scene of big data processing, stream processing data and batch processing data generated in real time in a certain internet service are processed uniformly, the processed data are stored in a data lake, a user can inquire a processing result, and product decision and business popularization are assisted.

The execution subject of the present disclosure may be any electronic device, e.g., a server. The following describes the technical scheme of the present disclosure in detail through a plurality of embodiments.

FIG. 1 is a system architecture diagram of a data stream processing system according to an embodiment of the present disclosure. As shown in fig. 1, a data stream processing system in an embodiment of the present disclosure includes: a batch integration layer and an online service layer.

A flow batch integrated layer in the data flow processing system receives a real-time data flow to be processed, and optionally, the data flow to be processed can be received through a kafaka message queue, wherein the data flow to be processed comprises batch processing data and flow processing data, and the batch processing data is data which can be processed in a batch processing mode, and can be data with larger data processing capacity generally, for example, data generated in a larger time span; the stream processing data may be data processed by stream processing, and may generally be data having a smaller data processing amount, for example, data generated over a smaller time span. The flow batch integrated layer can perform flow batch integrated processing on the data flow to be processed through a flow processing engine, the flow processing engine can include but is not limited to a Flink data processing engine, the Flink data processing engine performs type conversion, structural processing, text type cleaning and the like on the data, and the processed data is stored in a data table form in a lake and bin integrated storage engine of the line service layer so as to be used for data query and data analysis through an application program by a user. The lake and warehouse integrated storage engine comprises a data warehouse and a data lake, wherein the data warehouse stores historical data, the data lake stores real-time received data processed by the integrated layer of the flow batch, the formats of the historical data and the real-time received data are different, and the historical data can be stored in the data lake after being converted into data with the same format as the real-time received data. The data lake can be understood as a distributed database for storing mass data, and is mainly different from the data warehouse in that the data lake has the capability of real-time data ingestion and has good query performance. The average second-level response can be achieved by carrying out data query through the data lake, and the query requirement of a user for low time delay is met.

Fig. 2 is a schematic diagram of a data stream processing method according to an embodiment of the disclosure. As shown in fig. 2, the data stream processing method may include:

step S201, obtaining a data stream to be processed, wherein the data stream to be processed comprises stream processing data and batch processing data;

the execution subject of the embodiments of the present disclosure may be a server. The server receives a real-time data stream to be processed, optionally, the data stream to be processed can be received through a kafaka message queue, wherein the data stream to be processed can comprise batch processing data and stream processing data, and the batch processing data can be processed in a batch processing mode, and can be data with larger data processing capacity, such as data generated in a larger time span; the stream processing data may be data processed by stream processing, and may generally be data having a smaller data processing amount, for example, data generated over a smaller time span.

Step S202, processing a data stream to be processed according to a stream batch integrated processing mode to obtain processed data;

the flow batch integrated processing mode may be to process flow processing data and batch processing data in a unified data processing mode without distinguishing flow processing data or batch processing data, and specific data processing may include, but is not limited to, performing type conversion, structured processing, text type cleaning and other processes on the data to obtain processed data.

Step S203, the processed data is stored in a data lake.

The server stores the processed data in the form of a data table with a specific format in the data lake. The specific format may be any format determined according to specific needs. A data lake can be understood as a distributed database storing massive data, which is mainly different from a traditional data warehouse in that the data lake has the capability of real-time data ingestion and has good query performance. The average second-level response can be achieved by carrying out data query through the data lake, and the query requirement of a user for low time delay is met.

According to the data stream processing method provided by the embodiment of the disclosure, for stream processing data and batch processing data to be processed, the timeliness of data processing can be improved by processing the stream processing data and the batch processing data in a stream batch integrated processing mode, and the processed data is stored in a data lake, so that the low delay requirement of a user on data query can be met.

The specific implementation manner of processing the data stream to be processed according to the stream batch integrated processing manner is as follows:

in one embodiment, the processing of the data stream to be processed according to the stream batch integrated processing mode to obtain processed data includes:

And processing the data stream to be processed by using the same data processing engine to obtain processed data.

The flow batch integrated processing mode can process the data flow to be processed by using the same data processing engine, does not distinguish flow processing data or batch processing data, and uniformly processes the data.

Optionally, the processing engine of the link data is utilized to perform integrated processing on the data stream to be processed, which may include, but is not limited to, performing type conversion, structural processing, text type cleaning and other processing on the data, and storing the processed data in a data lake in the form of a data table for a user to perform data query and data analysis through an application program. The Flink data processing engine can process the data flow to be processed in real time, so that streaming data processing in the true sense is supported, real-time processing and warehousing of the data are realized, and the delay time of the data entering a data lake is further reduced.

In the embodiment of the disclosure, the streaming processing data and the batch processing data are uniformly processed by utilizing the same data processing engine, so that the technical architecture is simple, the function reusability of each link is optimized to the greatest extent, and the maintenance and the management are easy. The problem of high maintenance cost due to the fact that two sets of big data processing engines, namely the stream processing engine and the batch processing engine, are used is avoided, if the two sets of data processing engines are used, technical architectures of different processing layers are different, application program interfaces (Application Programming Interface, API) are different, any logic change exists, synchronous updating is needed at a plurality of places, and the time period of later iteration is prolonged along with the increase of data quantity.

In addition, the same data processing engine is utilized to uniformly process the streaming processing data and the batch processing data, the processed data is supported to enter the data lake in real time, the data production delay time is reduced to the greatest extent, and the visibility of the data processing result can be reduced to the minute level or even the second level from the original day level, hour level and the like. If two sets of data processing engines are used, all data finally arrive at the data warehouse to be processed in time in a batch due to the fact that batch processing is carried out regularly, for example, batch processing tasks scheduled by day granularity are processed in a batch mode, and data changed today can only enter the data warehouse after unified processing in the open day, delay of entering the data warehouse is high, and inquiry and analysis of data processing results by users are not facilitated.

The specific implementation manner of storing the processed data into the data lake is as follows:

in one embodiment, storing the processed data in a data lake includes:

pre-configuring index information of a data table in a first format in a data lake;

updating the processed data according to the index information;

and storing the updated data into a data table in the first format.

The processed data is stored in a data lake in the form of a data table of a first format, wherein the first format can be any format determined according to specific needs. Index information of the data table in the first format may be preconfigured, and may include, but is not limited to, what index is used to configure which fields. The index information may be a data structure that orders the values of one or more columns in the data table. And writing the processed data into a data table in the first format in real time according to the index information and a mode of changing data capturing (Change Data Capture, CDC), or reading target data in the data table according to the index information and modifying the target data.

Optionally, the server may pre-configure a data type corresponding to each field in the first format data table, what compression policy is used, and update the processed data in the first format data table according to at least one of the index information, the data type, and the compression policy.

In the embodiment of the disclosure, the index information of the data table is pre-configured, and the data in the data table in the first format is updated according to the index information, so that when a user performs data query, the user can perform data query in the data lake according to the index information, and the data query efficiency can be improved. If the index information is not available, one query task corresponds to one batch processing task, the query speed is low, and the delay is relatively high.

In one embodiment, the method further comprises:

under the condition that the processed data are data stored in a data table in a first format, historical data are acquired from a data warehouse, and the historical data are data stored in a data table in a second format;

the historical data is converted into data stored in a data table in a first format and is stored in a data lake.

Wherein the second format may be any format different from the first format. For stock data, namely, historical data stored in a data warehouse, the historical data stored in data tables with different formats can be converted into data with the same format as the data flow processed in real time in a conversion mode, namely, the data stored in the data table with the second format is converted into the data stored in the data table with the first format and is stored in a data lake, so that compatible storage of the data is realized.

In the embodiment of the disclosure, the historical data is converted into the data with the same format as the data stream processed in real time and stored in the data lake, so that the composite query of the historical data and the processing result of the data stream processed in real time by a data interface can be satisfied, the convenience of data query can be improved, and meanwhile, the rapid query of the data can be realized.

In one embodiment, wherein obtaining historical data from a data warehouse comprises:

obtaining a table structure and a data storage path of a data table in a second format from a data warehouse;

based on the table structure and the data storage path, historical data stored in the data table in the second format is obtained.

Wherein the history data is stored in the data warehouse in the form of a data table of the second format, and therefore, by extracting the table structure and the data storage path of the data table of the second format, an instance of the data table of the second format can be obtained based on the table structure and the data storage path, thereby obtaining the history data.

In the embodiment of the disclosure, the history data is acquired by acquiring the table structure and the data storage path of the data table in the second format, so that the history data stored in the data table in the second format can be conveniently and quickly acquired.

In one embodiment, where converting the historical data into data stored in a data table in a first format and storing the data in a data lake includes:

creating an external table of the data table in the second format, wherein the external table is associated with the data table in the second format through a data storage path so that the historical data stored in the data table in the second format becomes the historical data stored in the external table;

And loading the historical data stored in the data table in the second format into the data table in the first format by using the external table, and storing the historical data stored in the data table in the first format into the data lake.

Wherein the external table may be a data table created in a data lake outside the data warehouse. The data storage path in the external table is defined as a data storage path pointing to a data table in the second format in the data warehouse, alternatively, the data stored in the data table in the second format in the data warehouse may be partitioned according to a time span (e.g., 1 hour, 1 day) of generating the data, and each partition data is registered in the external table, that is, the data storage path of each partition data is associated with metadata of the external table, and the history data stored in the data table in the second format becomes the history data stored in the external table. The history data stored in the data table in the second format is loaded into the data table in the first format by utilizing the external table, so that the history data is introduced into the data lake.

In the embodiment of the disclosure, the historical data is introduced into the data through the external table of the data table in the second format, so that the data comprises both the historical data and the data stream processed in real time, and the composite query requirement of a user on the historical data and the processing result of the data stream processed in real time can be met.

In one embodiment, the loading the history data stored in the data table of the second format into the data table of the first format using the external table includes:

converting the storage format of the history data stored in the external table into the storage format of the data table in the first format;

acquiring index information of a data table in a first format;

and writing the data after the storage format conversion into a data table of the first format according to the index information.

Specifically, since the history data stored in the external table is stored in the data table of the second format, it is necessary to convert the storage format of the history data into the storage format corresponding to the data table stored in the first format, and write the data into the data table of the first format according to the index information of the data table of the first format configured in advance. Alternatively, the spark calculation engine may be utilized to load the history data stored by the data table in the second format into the data table in the first format.

In the embodiment of the disclosure, by storing format conversion and index information, the historical data can be loaded into the data table of the first format, so that compatible storage of the historical data and the data stream processed in real time in a data lake is realized.

In one embodiment, the first format is a carbodata format; the second format is Hive format.

The format of the data tables in the data lake may be a carbodata format, which may be understood as a distributed file storage format, which may support the use of a distributed file system (Hadoop Distributed File System, HDFS) to store massive data sets. The format of the data tables stored in the data warehouse may be Hive format.

In the embodiment of the disclosure, the data is stored in the data lake through the data table in the carbonData format, and compared with the Hive format, the response time is shorter and the query speed is faster when the data is queried.

In one embodiment, the method further comprises:

receiving a data query request, wherein the data query request comprises a query condition;

determining index information matched with the query condition, wherein the index information is index information of a data table in a first format in a data lake;

and determining the data meeting the query conditions in the data lake according to the index information.

Specifically, the server receives a data query request sent by the user terminal through the application program, analyzes the data query request, extracts a data query condition carried in the data query request, and queries according to the index information and the data query condition, wherein the data query condition can be any condition that a query target needs to meet, and the method is not limited in this disclosure.

Optionally, the index information of the data table in the first format in the data lake is preconfigured, the data lake includes a plurality of first format files stored in the data table in the first format, the corresponding index information can be configured for each first format file, the query condition and the index information corresponding to each first format file are matched, the first format file with the matching degree meeting the preset matching degree threshold is obtained, then the query of the file level is carried out according to the index information of the first format file, the data block meeting the query condition is obtained, optionally, the data query can be further carried out in a Binary Search (Binary Search) mode for the data block, and the final query target is obtained.

In the embodiment of the disclosure, the data query is performed according to the index information of the data table in the first format, so that the quick response of the data query request can be realized, the data query speed is improved, and the low-delay query requirement of the user is met.

In one embodiment, determining index information that matches the query condition includes:

and querying index information matched with the query condition in the distributed cache server.

Specifically, when determining index information matched with query conditions, the query can be performed in the distributed cache server, the index information in the distributed cache server is cached before the data query request, the distributed cache server can perform switch-on and switch-off management through configuration, after the switch-on, the distributed cache server is requested for each data query request of a user by default, and if index information matched with the query conditions exists in the distributed cache server, the data query can be performed in a data lake by preferentially using the index information.

The data caching server can support lateral expansion, can flexibly select a caching reservation strategy according to a service scene, can be a server cluster formed by a plurality of servers, and further supports the high efficiency of data query.

In the embodiment of the disclosure, the timeliness of data query can be improved by querying index information matched with the query condition in the distributed cache server and then performing data query according to the index information.

under the condition that index information matched with the query condition does not exist in the distributed cache server, determining the index information matched with the query condition in a data lake through the cluster nodes;

and loading index information matched with the query condition into the distributed cache server.

Specifically, if no index information matched with the query condition exists in the distributed cache server, determining the index information matched with the query condition in the data lake through the cluster nodes, loading the index information matched with the query condition into the distributed cache server, and carrying out data query in the data lake according to the index information matched with the query condition.

The cluster nodes can be query nodes of a data query system, the data query system can comprise a plurality of cluster nodes, each cluster node can be one or more servers or terminal equipment, and data query can be performed in a data lake through each cluster node.

In the embodiment of the disclosure, under the condition that index information matched with the query condition does not exist in the distributed cache server, the cluster node determines the index information matched with the query condition in the data lake, so that the data query requirement of a user is met.

Fig. 3 is a schematic diagram of a data stream processing method according to an embodiment of the disclosure. As shown in fig. 3, the data stream processing method may include:

in step S301, a data stream to be processed is acquired, where the data stream to be processed includes stream processing data and batch processing data.

Step S302, the same data processing engine is utilized to process the data stream to be processed, and the processed data is obtained.

Step S303, the processed data are stored in a data lake.

Step S304, a data query request is received, wherein the data query request comprises query conditions.

In step S305, index information matching the query condition is queried in the distributed cache server, and if index information matching the query condition does not exist in the distributed cache server, the index information matching the query condition is determined in the data lake by the cluster node.

Step S306, index information matched with the query condition is loaded into the distributed cache server.

Step S307, according to the index information, determining the data meeting the query condition in the data lake.

In the embodiment of the disclosure, for the stream processing data and the batch processing data to be processed, the processing is performed according to a stream batch integrated processing mode, so that the timeliness of data processing can be improved, efficient query is performed by using index information and a distributed cache server, and second-level response of PB-level data query can be supported. According to the embodiment, the data is supported to enter the data lake in real time, and the data inquiry is supported to quickly respond to the two dimensionalities, so that the overall time from generation of the data to final data inquiry by a user is shortened, and the end-to-end low-delay requirement can be well guaranteed.

FIG. 4 is a schematic diagram of data warehouse and data lake compatible storage in an embodiment of the present disclosure. As shown in fig. 4, in this embodiment, the first format is carbonrata, and the second format is Hive format. The history data is stored in a data warehouse (a "traditional data warehouse production environment" as shown in the figure) in the form of a Hive table, a metadata management service (also called Hive MetaServer) is used for maintaining the table structure and the data storage path of the Hive table, the table structure and the data storage path (also called Schema acquisition) are acquired from the metadata management service, a Hive table instance (a "Hive table" as shown in the figure) is obtained, a Hive external table is created in a data lake (a "carbodata data lake production environment" as shown in the figure), the Hive table and the Hive external table are associated (an "external table association" as shown in the figure) through the data storage path, the history data stored in the Hive external table is subjected to format conversion and loaded into the carbodata table, and a client can read and write data in the Hive external table or the carbodata table (a "read and write" as shown in the figure). The client can be connected with the data lake and inquire the data through a Spark-Shell mode or a ThriftServer mode. Both the carbodata table and the Hive table may be stored in a distributed storage system.

Fig. 5 is a schematic diagram of a data stream processing method according to an embodiment of the disclosure. As shown in fig. 5, in the present embodiment, the data table in the first format is a carbodata table. The method comprises two aspects of pre-configuration of the carboondata table and real-time capturing of the carboondata table data. The pre-configuration of the carbonData table comprises the following steps: the index information and the data type of each field in the carbodata table are preconfigured (as "carbodata table structure definition" in the figure), and the subsequent data is injected into the data lake according to the preconfigured index information and data type (as "index type/field injection" in the figure). The real-time capture of carbodata table data includes: the data flow to be processed flows into the Flink data processing engine through the Kafka message queue, then the data is processed through the Flink data processing engine and then is injected into the variable data capturing CDC module, and finally the data is written into the CarbonData table in the data lake through the CDC module so as to be used for data query by a user.

Fig. 6 is a schematic diagram of a data query system in an embodiment of the present disclosure. As shown in fig. 6, the data query system includes a thread server node, a distributed cache server (may also be referred to as Distributed Index Server), and a plurality of cluster nodes (cluster node 1, cluster node 2 …, cluster node n as shown in the figure). Wherein the thread server node includes a request forwarding sub-node (may also be referred to as a thread server sub-node) and a dialogue context sub-node (may also be referred to as a SparkContext sub-node); the distributed cache server includes an Index Driver node (also referred to as an Index Driver node) and a plurality of Index execution nodes (also referred to as Index Executor nodes): index execution node 1, index execution node 2 … index execution node m; each cluster node includes a compute sub-node (also known as Spark Executor) and a query sub-node (also known as Carbon Engine). A plurality of files stored in the form of carbodata tables, such as file 1, file 2, file 3, file 4 … file X, file Y shown in the figure, are stored in a data lake (a "distributed storage system" shown in the figure).

The thread server node is configured with a request address and a port of a user side, so that the user can be connected to the node server through Java database connection (Java Database Connectivity, JDBC), and the thread server node is connected with a spark cluster (a server cluster based on a spark computing engine), so that the query request of the user can be distributed to each cluster node, and the distributed query is performed, and the data query efficiency is improved. After the thread server node is started, the thread server node can be kept connected with other nodes in a long-link service state, and when a data query request is received, the request is distributed to all cluster nodes in real time. And simultaneously, spark resources can be locked when the system is started, and the data query request distributed by the thread server node can be responded at any time in a resident memory mode.

The specific data query process in this embodiment includes: when a data query request sent by a user application program is received, the request forwarding sub-node extracts query conditions in the data query request, drives each index execution node through the index driving node, queries index information matched with the query conditions in the distributed cache server through each index execution node, and queries a corresponding target file in a data lake according to the index information if the index information matched with the query conditions is queried. Under the condition that index information matched with query conditions does not exist in the distributed cache server, the request forwarding sub-node forwards the data processing request to the dialogue context sub-node, the dialogue context sub-node calculates the index information matched with the query conditions through the calculation sub-nodes in all cluster nodes, and the query sub-node queries corresponding target files in the data lake according to the index information.

And carrying out comparison analysis on the CarbonData table and the Hive table in terms of query response time by combining with actual service scene data so as to embody the actual query acceleration effect of the current scheme relative to a data warehouse. In this embodiment, 6 more representative query scenarios are selected, each scenario taking 10 queries with a statistical average of time consumption. The details of the correspondence are shown in table 1.

TABLE 1

For example, as shown in fig. 7, the horizontal axis in fig. 7 represents each scene, and the vertical axis represents the query time consumption corresponding to the Hive table and the carbond data table in each scene. In order to compare effects under the same conditions, the query environment is aligned as follows:

(1) The storage levels are aligned exactly: the Hive table is the same storage cluster used by the carbond table.

(2) Computing resources are aligned exactly: the same resource and the same client are used for inquiring, the resource allocation is specifically 30 execution examples, each example is 2 cores, and the memory is 8G.

From fig. 7 and table 1, the following conclusions can be drawn:

(1) Compared with the Hive query, the carbonodata query speed exceeds the Hive query speed in the six query scenes, and the average speed is 17.91 times faster.

(2) Under the condition that the amount of the query resources is limited, the CarbonData can achieve second-level response in most scenes. If the query resources are more adequate, the query response performance is further improved.

(3) CarbonData offers significant advantages in terms of complex aggregate queries or spot checks.

Fig. 8 is a schematic diagram of a data stream processing apparatus according to an embodiment of the disclosure. As shown in fig. 8, the data stream processing apparatus may include:

an obtaining module 801, configured to obtain a data stream to be processed, where the data stream to be processed includes stream processing data and batch processing data;

a processing module 802, configured to process a data stream to be processed according to a batch integrated processing manner, to obtain processed data;

a storage module 803 for storing the processed data in a data lake.

According to the data stream processing device provided by the embodiment of the disclosure, for stream processing data and batch processing data to be processed, the timeliness of data processing can be improved by processing the stream processing data and the batch processing data in a stream batch integrated processing mode, and the processed data is stored in a data lake, so that the low-delay requirement of a user on data query can be met.

In one embodiment, the processing module 802 is specifically configured to:

In one embodiment, the storage module 803 is specifically configured to:

Updating the processed data according to the index information;

and storing the updated data into a data table in the first format.

In one embodiment, the data stream processing device further includes a conversion module, where the conversion module includes an acquisition unit and a conversion unit;

the acquisition unit is used for acquiring historical data from the data warehouse when the processed data are stored in the data table in the first format, wherein the historical data are stored in the data table in the second format;

and the conversion unit is used for converting the historical data into data stored in the data table in the first format and storing the data into the data lake.

In one embodiment, the obtaining unit is specifically configured to:

In one embodiment, the conversion unit is specifically configured to:

In one embodiment, the conversion unit is configured to, when loading the history data stored in the data table of the second format into the data table of the first format using the external table:

acquiring index information of a data table in a first format;

In one implementation manner, the data stream processing apparatus further includes a query module, and fig. 9 is a schematic diagram of the query module in an embodiment of the disclosure, where, as shown in fig. 9, the query module includes a receiving unit 901, a first determining unit 902, and a second determining unit 903;

a receiving unit 901, configured to receive a data query request, where the data query request includes a query condition;

a first determining unit 902, configured to determine index information that matches the query condition, where the index information is index information of a data table in a first format in the data lake;

A second determining unit 903, configured to determine, according to the index information, data meeting the query condition in the data lake.

In one embodiment, the first determining unit 902 is configured to:

The functions of each unit, module or sub-module in each apparatus of the embodiments of the present disclosure may be referred to the corresponding descriptions in the above method embodiments, which are not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, such as a data stream processing method. For example, in some embodiments, the data stream processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When a computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the data stream processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the data stream processing method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of data stream processing, the method comprising:

storing the processed data into a data lake;

acquiring historical data from a data warehouse under the condition that the processed data are data stored in a data table in a first format, wherein the historical data are data stored in a data table in a second format;

Creating an external table of the data table in the second format, wherein the external table is associated with the data table in the second format through a data storage path so that the history data stored in the data table in the second format is the history data stored in the external table;

converting the storage format of the history data stored by the external table into the storage format of a data table in a first format; acquiring index information of the data table in the first format; writing the data after the storage format conversion into the data table of the first format according to the index information, and storing the historical data stored in the data table of the first format into the data lake.

2. The method of claim 1, wherein the processing the data stream to be processed according to a batch integrated processing manner to obtain processed data comprises:

3. The method of claim 1, the storing the processed data into a data lake, comprising:

pre-configuring index information of a data table in a first format in the data lake;

updating the processed data according to the index information;

And storing the updated data into the data table of the first format.

4. The method of claim 1, wherein the obtaining historical data from a data warehouse comprises:

obtaining a table structure and a data storage path of a data table in a second format from the data warehouse;

and obtaining the historical data stored in the data table in the second format based on the table structure and the data storage path.

5. The method of any of claims 1-4, wherein the first format is a carbodata format; the second format is Hive format.

6. The method of claim 1, further comprising:

determining index information matched with the query condition, wherein the index information is index information of a data table in a first format in the data lake;

and determining data meeting the query conditions in the data lake according to the index information.

7. The method of claim 6, the determining index information that matches the query condition, comprising:

8. The method of claim 6, the determining index information that matches the query condition, comprising:

under the condition that index information matched with the query condition does not exist in the distributed cache server, determining the index information matched with the query condition in the data lake through cluster nodes;

and loading the index information matched with the query condition into the distributed cache server.

9. A data stream processing apparatus, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data stream to be processed, and the data stream to be processed comprises stream processing data and batch processing data;

the storage module is used for storing the processed data into a data lake;

the device also comprises a conversion module, wherein the conversion module comprises an acquisition unit and a conversion unit;

the acquiring unit is used for acquiring historical data from a data warehouse under the condition that the processed data are data stored in a data table in a first format, wherein the historical data are data stored in a data table in a second format;

The conversion unit is configured to create an external table of the data table in the second format, where the external table is associated with the data table in the second format through a data storage path, so that the history data stored in the data table in the second format is the history data stored in the external table;

10. The apparatus of claim 9, the processing module being specifically configured to:

11. The apparatus of claim 9, the storage module being specifically configured to:

updating the processed data according to the index information;

and storing the updated data into the data table of the first format.

12. The apparatus of claim 9, wherein the obtaining unit is specifically configured to:

13. The apparatus of any of claims 9-12, wherein the first format is a carbodata format; the second format is Hive format.

14. The apparatus of claim 9, further comprising a query module comprising a receiving unit, a first determining unit, and a second determining unit;

the receiving unit is used for receiving a data query request, wherein the data query request comprises a query condition;

the first determining unit is configured to determine index information matched with the query condition, where the index information is index information of a data table in a first format in the data lake;

and the second determining unit is used for determining the data meeting the query condition in the data lake according to the index information.

15. The apparatus of claim 14, the first determining unit to:

16. The apparatus of claim 14, the first determining unit to:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-8.