CN111159135A

CN111159135A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN111159135A
Application number: CN201911340646.3A
Authority: CN
Inventors: 李文学; �田�浩; 史忠伟
Original assignee: Wuba Co Ltd
Current assignee: Wuba Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-15

Abstract

The application discloses a data processing method, a data processing device, electronic equipment and a storage medium, wherein the method is used for acquiring a log data set from a distributed file system; performing offline ETL processing on the log data set to obtain an offline result data set; importing an offline result data set into a Druid, wherein the Druid comprises at least one piece of real-time result data which is imported in advance, and the real-time result data is obtained by processing the log data through a real-time ETL (extract transform load); in Druid, offline result data is fused with real-time result data. The data processing method fuses the offline data and the real-time data by using the Druid, not only fuses a plurality of advantages of the Druid, but also can support constantly changing data analysis requirements, for example, when the offline data is subjected to data analysis, the data result of the real-time data can be visually seen, and when various algorithm models are trained, the common requirements of the offline data and the real-time data can be supported.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

In the information and data intelligence era, data warehouses are used to provide computing resources for the internet and intranets and databases, which can hold vast quantities of data for analysis and support a variety of data access technologies.

A conventional data warehouse provides two independent data processing links, one of which is an offline processing link for processing offline data, and the other of which is a real-time processing link for processing real-time data, so that a real-time data result and an offline data result are generated in the two independent data processing links, respectively.

Since the development mode, processing method, logic, data source and the like of the two data processing links are different, the data warehouse architecture is difficult to support continuously changing data analysis requirements.

For example, in a typical data analysis scenario, real-time reporting data and offline reporting data are queried from a data warehouse separately, and the offline reporting data is used to check the correctness, i.e., "logarithm", of the real-time reporting data. Since the offline reporting data and the real-time reporting data are generated from different data processing links, respectively, the execution of the "logarithmic" process is made extremely difficult.

Disclosure of Invention

The application provides a data processing method, a data processing device, electronic equipment and a storage medium, which are used for supporting continuously changing data analysis requirements.

In a first aspect, the present application provides a data processing method, including:

obtaining a log data set from a distributed file system, wherein the log data set comprises at least one log data;

performing offline ETL processing on the log data set to obtain an offline result data set, wherein the offline result data set comprises at least one piece of offline result data, and each piece of offline result data corresponds to one log data;

importing the offline result data set into a Druid, wherein the Druid comprises at least one piece of real-time result data which is imported in advance, and the real-time result data is obtained by processing the log data through a real-time ETL (extract transform load);

and in the Druid, fusing the offline result data and the real-time result data.

Further, importing the offline result data set into the Druid includes:

the offline result data set is segmented according to a preset time sequence to obtain one or more offline data segments, the preset time sequence comprises at least one or more time segments, and each offline data segment corresponds to one time segment;

creating an index for each of the offline data segments;

storing the offline data segment with the index into the Druid.

Further, the fragmenting the offline result data set according to a preset time sequence to obtain one or more offline data segments includes:

acquiring a time stamp of log data corresponding to each piece of offline result data;

determining a time period corresponding to each piece of offline result data according to the timestamp;

and forming an offline data segment by the offline result data corresponding to the same time segment, wherein the formed offline data segment corresponds to the time segment.

Further, the performing offline ETL processing on the log data set to obtain an offline result data set includes:

transmitting the at least one log data into an offline ETL processing framework according to the time stamp sequence of the log data, wherein the offline ETL processing framework is a Hive-Sql-based data processing framework;

and carrying out offline ETL processing on the log data through the offline ETL processing framework to obtain offline result data respectively corresponding to each log data.

Further, the method further comprises:

when the log data is generated, transmitting the generated log data into a real-time ETL processing framework through Kafka, wherein the real-time ETL processing framework is a Spark Streaming-based data processing framework;

performing real-time ETL processing on the log data through the real-time ETL processing framework to obtain real-time result data;

the real-time results data was imported into the Druid by Kafka.

Further, the importing the real-time result data into the Druid by Kafka includes:

acquiring a time stamp of log data corresponding to the real-time result data;

determining a time period corresponding to the real-time result data according to the timestamp;

and storing the real-time result data into the real-time data segment appointed in the Druid according to the time segment corresponding to the real-time result data.

Further, the fusing the offline result data and the real-time result data includes:

and fusing the offline data segment and the real-time data segment corresponding to the same time segment.

In a second aspect, the present application further provides a data processing apparatus, the apparatus comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a log data set from a distributed file system, and the log data set comprises at least one log data;

an offline ETL processing module, configured to perform offline ETL processing on the log data set to obtain an offline result data set, where the offline result data set includes at least one offline result data, and each offline result data corresponds to one log data;

the offline data import module is used for importing the offline result data set into a Druid, the Druid comprises at least one piece of real-time result data which is imported in advance, and the real-time result data is obtained by processing the log data through a real-time ETL (extract transform and load);

and the fusion module is used for fusing the offline result data and the real-time result data in the Druid.

Further, the offline data importing module includes:

the data fragmentation unit is used for fragmenting the offline result data set according to a preset time sequence to obtain one or more offline data segments, wherein the preset time sequence comprises at least one or more time segments, and each offline data segment corresponds to one time segment;

an index creating unit, configured to create an index for each offline data segment;

and the data import unit is used for storing the offline data segment with the index into the Druid.

Further, the data slicing unit is specifically configured to:

Further, the offline ETL processing module comprises:

the offline data transmitting unit is used for transmitting the at least one log data into an offline ETL processing framework according to the time stamp sequence of the log data, and the offline ETL processing framework is a Hive-Sql-based data processing framework;

and the offline ETL processing unit is used for performing offline ETL processing on the log data through the offline ETL processing framework to obtain offline result data corresponding to each log data.

Further, the apparatus further comprises:

a first Kafka module, configured to, when log data is generated, transmit the generated log data to a real-time ETL processing framework through Kafka, where the real-time ETL processing framework is a Spark Streaming-based data processing framework;

the real-time ETL processing module is used for carrying out real-time ETL processing on the log data through the real-time ETL processing framework to obtain real-time result data;

a second Kafka module for importing the real-time result data into the Druid by Kafka.

Further, the second Kafka module is specifically configured to:

acquiring a time stamp of log data corresponding to the real-time result data;

Further, the fusion module is specifically configured to fuse the offline data segment and the real-time data segment corresponding to the same time period.

In a third aspect, the present application further provides an electronic device, including:

a memory for storing program instructions;

a processor for calling and executing program instructions in said memory to implement the method of any of the first aspects.

In a fourth aspect, the present application further provides a storage medium having a computer program stored therein, wherein when the computer program is executed by at least one processor of the apparatus of any one of the second aspects, the apparatus performs the method of any one of the first aspects.

According to the technical scheme, the embodiment of the application provides a data processing method, and the method acquires a log data set from a distributed file system; performing offline ETL processing on the log data set to obtain an offline result data set; importing an offline result data set into a Druid, wherein the Druid comprises at least one piece of real-time result data which is imported in advance, and the real-time result data is obtained by processing the log data through a real-time ETL (extract transform load); in Druid, offline result data is fused with real-time result data. The data processing method fuses the offline data and the real-time data by using the Druid, not only fuses a plurality of advantages of the Druid, but also can support constantly changing data analysis requirements, for example, when the offline data is subjected to data analysis, the data result of the real-time data can be visually seen, and when various algorithm models are trained, the common requirements of the offline data and the real-time data can be supported.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic diagram of a data warehouse architecture shown in the present application in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a data processing method according to an exemplary embodiment of the present application;

FIG. 3 is a flow diagram illustrating another data processing method according to an exemplary embodiment of the present application;

FIG. 4 is a block diagram of a data processing device according to an exemplary embodiment of the present application;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to meet the continuously changing data analysis requirement, the embodiment of the application provides a data processing method, the method is applied to a data warehouse architecture based on a Lambda architecture, and the data processing method and the corresponding data warehouse architecture not only have the advantages of high fault tolerance, low delay and expandability of Lambda, but also can realize the fusion of offline data and real-time data, and further can support the continuously changing data analysis requirement. For example, when data analysis is performed on offline data, data results of real-time data can be visually seen, and common requirements on the offline data and the real-time data can be supported when various algorithm models are trained.

Fig. 1 is a schematic diagram of a data warehouse architecture according to an exemplary embodiment of the present application, as shown in fig. 1, the data warehouse architecture includes: a flux data acquisition module 10, an HDFS module 20, a Hive-Sql offline ETL module 30, a Hadoopindex module 40, a first Kafka module 50, a Spark Streaming real-time ETL module 60, a second Kafka module 70, and a drive module 80.

And the Flume data acquisition module 10 is used for acquiring log data. The data acquisition mode includes real-time acquisition and off-line acquisition, the real-time acquired log data (real-time data) is transmitted to the Hive-Sql real-time ETL module 60 in real time through the first Kafka (a high throughput distributed publish-subscribe message system) module 50 to perform real-time ETL (Extract-Transform-Load) processing, and the off-line acquired log data (off-line data) is stored in the HDFS module (distributed file system) 20.

In specific implementation, the Flume data acquisition module 10 may be a journal collection system Flume, which is a highly available, highly reliable, distributed system for collecting, aggregating and transmitting mass journals, and supports various data senders customized in the system for collecting data, and at the same time, Flume provides simple processing for data and has the capability of writing data into a receiver. By using the flash, log information collected from a plurality of web servers can be efficiently stored in the HDFS (Hadoop distributed File System), and data acquired from a plurality of servers can be rapidly handed over to Hadoop.

And the HDFS module 20 is used for receiving data import of the flash data acquisition module 10 and storing the imported log data in an off-line manner. HDFS is a distributed file system suitable for running on general-purpose hardware, has high fault tolerance, can provide high-throughput data access due to being suitable for being deployed on cheap machines, and is very suitable for being applied to large-scale data sets.

A Hive-Sql offline ETL module 30, configured to perform offline ETL processing on offline log data.

In a specific implementation, the Hive-Sql offline ETL module 30 is a Hive-Sql-based data processing framework. The Hive is a set of data warehouse analysis system constructed based on Hadoop, provides rich SQL query modes to analyze data stored in a Hadoop distributed file system, can map structured data files into a database table, provides a complete SQL query function, can convert SQL statements into MapReduce tasks to run, and queries and analyzes required contents through own SQL, and is called Hive SQL for short, so that users unfamiliar with MapReduce can conveniently query, summarize and analyze data by using SQL language.

The Hadoop index module 40 is configured to import the offline result data output by the offline ETL module 30 into the drain module 80 in a Hadoopindex manner. Specifically, the Hadoop index module 40 is configured to segment the offline result data set according to a preset time sequence to obtain one or more offline data segments, where the time sequence includes at least one or more time segments, and each offline data segment corresponds to one time segment; creating an index for each of the offline data segments; storing the offline data segment with the index into the Druid.

And the first Kafka module 50 is configured to transmit the log data collected by the Flume data collection module 10 to the Spark Streaming real-time ETL module in real time.

A Spark Streaming real-time ETL module 60, configured to perform real-time ETL processing on received log data; in a specific implementation, the Spark Streaming real-time ETL module 60 is a Spark Streaming based data processing framework.

A second Kafka module 70 for importing the real-time data results output by the real-time ETL module 60 into the Druid module 80.

And the Druid module 80 is used for fusing the imported offline data result and the real-time data result.

In a specific implementation, the draid module 80 is an Olap data analysis system for processing time series data in real time, and it first needs to segment data according to time for creating an index of data, and then performs routing index according to time for query. And the Druid receives the import of the offline data result and the real-time data result and realizes the fusion of the offline data result and the real-time data result. Besides, the Druid has the following advantages: the system has a sub-second-level response speed, can support high-concurrency user-oriented application, supports high-concurrency real-time import, has high availability for all components, adopts a distributed shared-nothing architecture, has high expandability capable of being expanded to a PB level, saves resources, and supports aggregation functions. The Druid is scalable in more dimensions than other olap tools such as kylin. Meanwhile, the Druid is suitable for a star model of a data warehouse, and simultaneously supports the import of off-line and real-time data.

According to the data warehouse architecture, the data warehouse architecture introduces the Druid on the basis of the Lambda architecture, not only retains the advantages of high fault tolerance, low delay and expandability of Lambda, but also integrates the advantages of the Druid, and fuses the offline data and the real-time data by using the Druid, so that the data warehouse architecture can support constantly changing data analysis requirements, for example, when the offline data is analyzed, the data result of the real-time data can be visually seen, and when various algorithm models are trained, the common requirements of the offline data and the real-time data can be supported.

Based on the data warehouse architecture, an embodiment of the present application provides a data processing method, and fig. 2 is a flowchart of a data processing method according to an exemplary embodiment of the present application, and as shown in fig. 2, the method may include:

step 100, a log data set is obtained from a distributed file system, wherein the log data set comprises at least one log data.

As can be seen from fig. 1, the distributed file system is used to store offline data, and thus the log data set obtained from the distributed file system is an offline data set.

Step 200, performing offline ETL processing on the log data set to obtain an offline result data set, where the offline result data set includes at least one offline result data, and each offline result data corresponds to one log data.

During specific implementation, offline ETL processing is performed on the log data set by using an offline ETL processing framework, and an offline result data set formed by at least one piece of offline result data is output. The ETL processing framework may be a Hive-Sql based data processing framework.

For example, at least one log data in a log data set is transmitted into an offline ETL processing framework in the order of the time stamps of the log data; and carrying out offline ETL processing on the log data through the offline ETL processing framework to obtain offline result data respectively corresponding to each log data.

Step 300, importing the offline result data set into a drive, wherein the drive comprises at least one piece of real-time result data imported in advance, and the real-time result data is obtained by processing the log data through real-time ETL.

According to the Druid characteristic, in the process of importing the offline result data set into the Druid, an index is created for the offline result data set, which may specifically include the following steps as shown in fig. 3:

step 310, the offline result data set is segmented according to a preset time sequence to obtain one or more offline data segments, where the time sequence includes at least one or more time segments, and each offline data segment corresponds to one time segment.

For example, a timestamp of log data corresponding to each offline result data is obtained first; then determining a time period corresponding to each offline result data according to the time stamp; and finally, forming an offline data segment by the offline result data corresponding to the same time segment, wherein the formed offline data segment corresponds to the time segment.

Step 320, creating an index for each offline data segment; specifically, an index bitmap may be created according to a time period corresponding to each offline data segment, where the index bitmap includes an index corresponding to each offline data segment.

Step 330, storing the offline data segment with the index into the Druid.

It should be noted that the Druid includes at least one piece of real-time result data imported in advance, and the real-time result data is log data obtained through real-time ETL processing.

Specifically, when log data are generated, the Flume data acquisition module 10 acquires the log data in real time, and transmits the acquired real-time log data into a real-time ETL processing framework through Kafka; carrying out real-time ETL processing on the log data through a real-time ETL processing framework to obtain real-time result data; and finally, transmitting the real-time result data into the Druid through Kafka. Wherein the real-time ETL processing framework can be a Spark Streaming based data processing framework.

Since in the Druid, the ETL processed data is stored in the form of data segments, and every other data segment corresponds to a time period, based on which the real-time result data is passed into the Druid through Kafka, it may include: acquiring a time stamp of log data corresponding to the real-time result data; determining a time period corresponding to the real-time result data according to the time stamp; and storing the real-time result data into the real-time data segment appointed in the Druid according to the time segment corresponding to the real-time result data. For the convenience of distinguishing from the data segment storing the offline data result, the data segment storing the real-time data result is collectively referred to as a real-time data segment in this embodiment.

Step 400, in the Druid, fusing the offline result data and the real-time result data.

Specifically, in the Druid, an offline data segment and a real-time data segment corresponding to the same time period are fused. Wherein the data fusion process includes, but is not limited to, overwriting the real-time data segment with the offline data segment.

As can be seen from the foregoing embodiments, the embodiments of the present application provide a data processing method, where the method obtains a log data set from a distributed file system; performing offline ETL processing on the log data set to obtain an offline result data set; importing an offline result data set into a Druid, wherein the Druid comprises at least one piece of real-time result data which is imported in advance, and the real-time result data is obtained by processing the log data through a real-time ETL (extract transform load); in Druid, offline result data is fused with real-time result data. The data processing method fuses the offline data and the real-time data by using the Druid, not only fuses a plurality of advantages of the Druid, but also can support constantly changing data analysis requirements, for example, when the offline data is subjected to data analysis, the data result of the real-time data can be visually seen, and when various algorithm models are trained, the common requirements of the offline data and the real-time data can be supported.

According to the data processing method provided by the foregoing embodiment, an embodiment of the present application further provides a data processing apparatus, as shown in fig. 4, the apparatus may include:

an obtaining module 410, configured to obtain a log data set from a distributed file system, where the log data set includes at least one log data;

an offline ETL processing module 420, configured to perform offline ETL processing on the log data set to obtain an offline result data set, where the offline result data set includes at least one offline result data, and each offline result data corresponds to one log data;

an offline data importing module 430, configured to import the offline result data set into a drouid, where the drouid includes at least one piece of real-time result data imported in advance, and the real-time result data is obtained by performing real-time ETL processing on the log data;

a fusion module 440, configured to fuse, in the droid, the offline result data and the real-time result data.

In some embodiments, the offline data import module includes:

In some embodiments, the data fragmentation unit is specifically configured to:

In some embodiments, the offline ETL processing module comprises:

In some embodiments, the apparatus further comprises:

In some embodiments, the second Kafka module is specifically configured to:

acquiring a time stamp of log data corresponding to the real-time result data;

In some embodiments, the fusion module is specifically configured to fuse the offline data segment and the real-time data segment corresponding to the same time period.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device may include: a memory 501 for storing program instructions; a processor 502 for calling and executing the program instructions in the memory to implement the above-mentioned data processing method.

In this embodiment, the processor and the memory may be connected by a bus or other means. The processor may be a general-purpose processor, such as a central processing unit, a digital signal processor, an application specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention. The memory may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk.

In a specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a computer program, and when at least one processor of a data processing apparatus executes the computer program, the data processing apparatus executes some or all of the steps in the embodiments of the data processing method of the present application. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The same and similar parts in the various embodiments in this specification may be referred to each other. In particular, as for the device, the electronic apparatus and the storage medium embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.

The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. A method of data processing, the method comprising:

and in the Druid, fusing the offline result data and the real-time result data.

2. The method of claim 1, wherein importing offline result set data into a Druid comprises:

creating an index for each of the offline data segments;

storing the offline data segment with the index into the Druid.

3. The method of claim 2, wherein the fragmenting the offline result dataset according to the preset time sequence to obtain one or more offline data segments comprises:

4. The method of claim 1, wherein the offline ETL processing the log data set to obtain an offline result data set comprises:

5. The method of claim 1, further comprising:

the real-time results data was imported into the Druid by Kafka.

6. The method of claim 5, wherein said importing the real-time result data into the Druid by Kafka comprises:

acquiring a time stamp of log data corresponding to the real-time result data;

7. The method of any one of claims 1-6, wherein fusing offline result data with the real-time result data comprises:

8. A data processing apparatus, characterized in that the apparatus comprises:

9. The apparatus of claim 8, wherein the offline data import module comprises:

10. The apparatus of claim 9, wherein the data fragmentation unit is specifically configured to:

11. The apparatus of claim 8, wherein the offline ETL processing module comprises:

12. The method of claim 8, wherein the apparatus further comprises:

13. The apparatus of claim 12, wherein the second Kafka module is specifically configured to:

acquiring a time stamp of log data corresponding to the real-time result data;

14. The apparatus according to any one of claims 8 to 13, wherein the fusion module is specifically configured to fuse an offline data segment and a real-time data segment corresponding to the same time segment.

15. An electronic device, comprising:

a memory for storing program instructions;

a processor for calling and executing program instructions in said memory to implement the method of any of claims 1-7.

16. A storage medium having a computer program stored thereon, wherein the apparatus performs the method of any of claims 1-7 when the computer program is executed by at least one processor of the apparatus of any of claims 8-14.