WO2017096941A1

WO2017096941A1 - Background refreshing method based on spark-sql big data processing platform

Info

Publication number: WO2017096941A1
Application number: PCT/CN2016/095361
Authority: WO
Inventors: 王成; 冯骏
Original assignee: 深圳市华讯方舟软件技术有限公司; 华讯方舟科技有限公司
Priority date: 2015-12-11
Filing date: 2016-08-15
Publication date: 2017-06-15
Also published as: CN105550293A; CN105550293B

Abstract

Disclosed in the present invention is a background refreshing method based on a Spark-SQL big data processing platform. A new process is created and a timed refreshing mechanism is set in an entry function of Spark-SQL, and a specified table space file directory structure of a Hadoop distributed file system (HDFS) is periodically scanned. Configuration items are added in a hive-site.xml under a conf folder of a Spark installation directory, and thus, whether to open a refreshing process, a refreshing interval and a big data table space set to be refreshed can be configured in a customized manner. In the present invention, under the background of big data, a first query time of the Spark-SQL big data processing platform is greatly reduced; taking 20T data as an example, a big data table is partitioned into 25 regions in a manner of taking hour as a first subregion, is partitioned into 1001 regions in a manner of taking first three digits of a mobile phone number as a second subregion, and is subjected to compressed storage according to a PARQUET format; for the query querying for a total amount of all data of a certain number section of a certain period of time, the original first query time is approximately 20 minutes, and by means of the background refreshing method optimized by the present invention, the time of the first query is reduced to approximately 45 seconds.

Description

Background refresh method based on Spark-SQL big data processing platform

Technical field

The invention relates to a background refreshing method of a big data processing platform, in particular to a background refreshing method based on a Spark-SQL big data processing platform.

Background technique

With the development of the Internet, mobile Internet and Internet of Things, we have ushered in an era of big data. The processing and analysis of these big data has become a very important and urgent need.

With the development of technology, the big data processing platform has experienced the original Hadoop and Hbase, and later developed SQL-based Hive, Shark and so on. Processing platforms such as Hbase based on key-value are also emerging. Nowadays, the rise of the SQL-on-Hadoop concept has prompted the Spark ecosystem to grow and become the most popular, most used and most efficient big data processing platform.

Regardless of the big data processing platform, their purpose is to process and analyze big data, from which to analyze and mine useful data for people to use. From the most basic principle, whether it is Map-Reduce-based Hadoop, HBase based on Key-Value key-value pairs, or RSD-based Spark, their overall processing flow is the same, both contain data. Import->Data Analysis and Processing->Data Results Display Three main steps, the most important of which are data import and data analysis processing. The speed of data import determines the speed of the entire system to process data in real time, affecting the whole process. The processing performance of the system, the process of data import and analysis is the core of data processing.

As shown in Figure 1, the overall framework of the Spark big data processing platform is: The data import function of the Spark big data processing platform is implemented by Spark-SQL, which is implemented by Hive on Spark. The Hive query can be submitted to the Spark cluster as a task of the Spark. Calculate on. Hive has more comprehensive support and a broader user base for SQL syntax than Impala and Shark. Data import usually involves key points such as import content, storage format, and import speed:

1, import content

Usually the imported content can be a formatted or unformatted text file, and each record and each field are separated by a specific separator or file format. The data content can be transmitted in the form of a file, or can be transmitted in the form of a data stream, and the size Uncertainty.

2, storage format

The format of the stored data can be either text format or compressed format to reduce disk usage. Currently, the compression formats supported by Spark-SQL include zip, snappy, and parquet.

In the context of big data, importing data can be partitioned based on content, and data can be stored in partitions to speed up queries.

3, the import speed

In the context of big data, as data is continuously generated, this puts high demands on the data import speed. According to the actual situation, the import speed must not be lower than x strips per second or xMB per second. Data loss, data import errors, data backlogs, etc. must not occur.

In the prior art, the Spark-SQL data import and data refresh scheme (the external data file is in a text format) is as follows:

When a query is initiated, information can be added to the conditional clause to define the data range of the query. In the Spark big data processing platform, different storage formats have different refresh mechanisms, mainly as follows:

i) If the data is finally stored by text (TEXTFILE) or optimized column storage (ORC) ZIP or SNAPPY, each time the big data table is queried, the directory structure on the distributed file system HDFS and the update metabase are scanned first, and the data can be identified. All updates to this tablespace on HDFS, including inserts, modifications, and deletes. In the case of a large number of directory structures and many data files, the time per HDFS is very long and increases with time. The time for scanning HDFS is included in the query time. After scanning HDFS, Spark will divide the task according to the scan result and submit it to the actuator for execution. Therefore, the length of the scan directly affects the length of the query.

Ii) If the data is finally compressed and stored in the PARQUET format, the first time the data table is queried, the directory structure on the distributed file system HDFS will be scanned and the metabase will be updated. Therefore, in the context of big data, the first query will take a long time. Instead of the first query, the directory structure of HDFS is no longer scanned, and the scan results of the first query are directly used, aiming to shorten the final query time. The advantage of this mechanism is that the speed of non-first query is faster, but there are also drawbacks that cannot be ignored. That is, after the first query scan, any direct modification of the table space on the HDFS cannot be recognized. Any insert and delete operations (HDFS does not support modification operations in principle) can only be performed by Spark-SQL and executed in Spark. When the resources of the device are limited, both read and write occupy a certain amount of system resources, which indirectly leads to a decrease in data import speed and query speed. In addition, when a data file of the tablespace on the HDFS is lost, all the queries on the Spark will fail. The file does not exist. You can only restart the Spark-SQL process and re-run the first query and scan. HDFS.

In summary, the problems in the prior art are:

1. The first query of Spark-SQL scans the entire tablespace in the HDFS distributed file system according to the query table, and saves the snapshot of the tablespace. In the context of big data, the first query takes a very long time and cannot meet the time. Claim. Any changes to the table after scanning are not recognized by Spark-SQL.

2, the existing technology based on hive or Spark-SQL data import program, written in Scala language, running on the JVM virtual machine, there are problems such as low efficiency, slow speed, easy memory overflow. Scala is a pure object-oriented programming language that uses the Scalac compiler to compile source files into Java class files (that is, bytecodes running on the JVM), so it is an interpreted language, and query and import are less efficient.

3. In the Standalone mode of the Spark big data processing platform, there is a waste of resources in the control node. In the prior art, the Spark big data processing platform is generally deployed as a cluster, and the cluster is composed of several machines. During the cluster running process, usually the import of external data and the real-time query of the data are synchronized. Therefore, the resources of the machine in the cluster will be allocated to the data import program and the data query program at the same time, in the IO conflict, CPU time, and memory. In terms of application, the two will have more or less conflicts, and in severe cases, the performance of the two will be greatly reduced.

Summary of the invention

The technical problem to be solved by the present invention is to avoid the step of scanning the distributed file system HDFS for the first query in the context of big data, and greatly shorten the first query time of the Spark-SQL big data processing platform.

In order to solve the above technical problem, the background refresh method based on the Spark-SQL big data processing platform of the present invention creates a refresh process and sets a timing refresh machine in the entry function of the Spark-SQL. System, regularly scans the specified file space file directory structure of the distributed file system HDFS.

Add a configuration item to hive-site.xml in the conf folder of the Spark installation directory. You can customize whether the background background refresh process is enabled, and the refresh interval size and the big data table space collection to be refreshed.

If the background refresh process is enabled, the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed. When the Spark-SQL receives the query, the original first refresh policy is used, and the query is scanned before the query. The distributed file system HDFS specifies the file directory structure of the tablespace. If the refresh process is refreshed for the first time, the directory structure information of the specified tablespace on the HDFS is saved in the memory. When Spark-SQL receives the query, it does not scan the HDFS. Directly use the directory structure information of the table space in the memory to achieve the effect of shortening the query time.

The refresh interval is one tenth to one half of the time taken to refresh once, or the refresh interval is 5 seconds to 10 seconds, and the refresh interval size may be customized according to product or user requirements.

The external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.

Using Scala programming, modify the Spark source code for Spark-SQL to execute query statements.

Before refreshing, first create a temporary table, create a big data table with partition information, import data files in text format into a temporary table, process temporary table data, and store the big data table with partition information.

The creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;

The big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of the string, or partition by integer interval, or combine partitioning, further divide the data, partition the data, and improve the data query speed;

The data file into the temporary table is: according to the data file format, the Spark-SQL statement or the Hadoop-supported Load statement is executed, and the data in the text format is directly guided. Into the temporary table.

The processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through Spark-SQL internal transformation mechanism, the partition information in the SQL statement is converted into a specific operation method for the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data Processing, writing to the distributed file system HDFS.

The background refresh method based on the Spark-SQL big data processing platform has the following beneficial effects compared with the prior art.

1) In the context of big data, the first query time of the Spark-SQL big data processing platform is greatly shortened; taking 20T data as an example, the big data table is divided into 25 zones according to the hour as the first zone (0 to 23 points and one The default partition) is divided into 1001 areas (000-999 and a default partition) according to the first 3 digits of the mobile phone number, and is compressed and stored according to the PARQUET format, for querying the total number of data of a certain number segment of a certain time period. In the query, the original query time is about 20 minutes. The background refresh method optimized by the present invention shortens the time of the first query to about 45 seconds.

2) While using a more efficient and fast data import program, the new files of the HDFS distributed file system are identified and stored in the metadata for user query requests. With Spark-SQL's original data import method speed of 20,000 / sec, using a more efficient and fast data import program to directly write data to HDFS can increase the data import speed to 200,000 / sec or higher (depending on The number of concurrent calls is bypassed, and the Spark is directly written to the new file on the HDFS. The background refresh method proposed by the present invention can identify all the newly added files in the specified table space and can be used for querying, and the Spark-SQL service is no longer required to be restarted. It also does not increase the time of the query.

3) Improve the system resource utilization of the control nodes of the Spark big data processing platform. The native Spark data import program is the data import statement of Spark-SQL. When the data import program is used, it will occupy some or all of the computing resources of the Spark big data processing platform, which greatly affects the speed and efficiency of data query. Using a more efficient data importer to process data independently from Spark makes system utilization even higher. At the same time, the background refresh adopts an independent process, which does not occupy the original Spark's system resources.

4) Since disk space is also a bottleneck of system availability in the context of big data, it is necessary to compress and store external data files. The common compression formats in Spark are ZIP, BZ2, SNAPPY, and PARQUET. The PARQUET format supports all projects in the Hadoop ecosystem, providing efficient compression of columnar data representation, and is independent of the data processing framework, data model, and programming language. The PARQUET format can be preferred as a big data storage format. The Spark big data processing platform has certain limitations on the data query in the PARQUET format. For large data tables stored in the PARQUET format, Spark-SQL scans the directory structure of the table on HDFS only when the table is first queried. The scan is performed again, so the directory structure added or deleted after the first query cannot be identified. The background refreshing technique of the present invention can effectively solve this problem.

5) Using Scala programming, modify the Spark source code for Spark-SQL to execute the query statement, which can greatly improve the programming efficiency.

DRAWINGS

FIG. 1 is a schematic diagram of an overall framework of a Spark big data processing platform in the prior art.

2 is a flow chart of a background refresh method based on the Spark-SQL big data processing platform of the present invention.

Figure 3 is a flow chart of the modified data query.

detailed description

As shown in FIG. 2 and FIG. 3, the background refresh method based on the Spark-SQL big data processing platform in this embodiment is to create a refresh process in the Spark-SQL entry function and set a timing refresh mechanism, and periodically scan the distributed file system HDFS. The specified tablespace file directory structure, as a preference, the refresh result is stored in memory to support the query request for the table data.

If the refresh process is enabled, the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed. When Spark-SQL receives the query, the original is used. The first refresh policy is to scan the file directory structure of the specified file space of the distributed file system HDFS before querying; if the refresh process is refreshed for the first time, the directory structure information of the specified table space on the HDFS is saved in the memory, when Spark-SQL receives When the query is executed, HDFS is no longer scanned, and the directory structure information of the table space in the memory is directly used, thereby shortening the query time.

The big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of a string, or partition by integer interval, or combine partitions to further divide data and improve data query speed;

The importing the data file in the text format into the temporary table is: according to the data file format, executing a Spark-SQL statement or a Load statement supported by Hadoop, and directly importing the data into the temporary table.

The processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through the internal transformation mechanism of Spark-SQL, the partition information in the SQL statement is converted into The specific operation method of the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data into the distributed file system HDFS.

As shown in Figure 2, the illustration is a background refresh flow chart.

1) Using Scala language programming, adding a background refresh process to the Spark-SQL entry function, periodically scanning the specified tablespace directory structure on the distributed file system HDFS, and saving it to memory for data query use. After Spark-SQL starts, it first reads the hive-site.xml configuration file, parses out the configuration items related to the background refresh process, and sets the timing refresh mechanism to perform timing refresh in the manner of message triggering. Each time refresh, Spark-SQL creates a query plan for the big data table to be refreshed, locates the space in the memory to store the table information according to the query plan, and calls the refresh method in its attribute to scan the distributed file system HDFS. Table directory structure. The refresh method will overwrite the previous scan result, and the original result will not be emptied before overwriting, thus ensuring that data is also available when receiving a data query request during the refresh process.

As shown in FIG. 3, FIG. 3 is a flow chart of the modified data query.

2) Modify the Spark-SQL processing data query strategy. Scanning the distributed file system HDFS for the first time is completed by the background refresh process. The first query directly uses the background refresh process to scan the results and shorten the query time. After the modification, the first query is consistent with the policy of the non-first query, that is, each query directly uses the result of the table directory structure information scanned by the background refresh process in the memory.

3) Background refresh function can be customized

Before running Spark-SQL, you can customize the configuration of the background refresh function, such as whether to enable the background refresh function, the big data table collection to be refreshed, and the refresh interval. The configuration item is located in the hive-site.xml file in the conf folder of the Spark installation directory. When Spark-SQL is started, all configuration items are read and parsed at one time. No additional program is required to read and parse the configuration file, which saves system overhead.

The key points of the present invention are as follows.

1) Using Scala language programming, integrated into the Spark source code, increase the background refresh process without affecting all the functions of the native Spark.

2) Modify the original Spark-SQL processing query strategy to improve the speed of the first query.

3) The refresh process supports all data compression formats supported by Spark, such as PARQUET, SNAPPY, and ZIP.

4) Background refresh technology makes it possible to separate Spark data import and data query, and improve system resource utilization.

The advantages of the present invention are as follows.

1) It makes it possible to use an efficient and fast data import program to identify all updates to the specified tablespace on the distributed file system HDFS, including addition, deletion and modification operations. At the same time, the data import program is independent of Spark, and does not affect the data query, improving their processing capabilities.

2) Modify the original Spark-SQL processing query statement strategy, and merge the functions of scanning the distributed file system HDFS into a separate refresh process, which greatly shortens the query time.

It should be noted that the various embodiments described above with reference to the accompanying drawings are only to illustrate the invention and not to limit the scope of the invention, and those of ordinary skill in the art should understand that without departing from the spirit and scope of the invention Modifications or equivalents to the invention are intended to be included within the scope of the invention. In addition, unless the context indicates otherwise, words in the singular include plural and vice versa. In addition, all or a portion of any embodiment can be used in combination with all or a portion of any other embodiment, unless otherwise stated.

Claims

A background refreshing method based on the Spark-SQL big data processing platform, which is characterized in that: a refresh process is created in the entry function of the Spark-SQL and a timing refresh mechanism is set, and the specified table space file directory of the distributed file system HDFS is periodically scanned. structure.
The background refreshing method of the Spark-SQL big data processing platform according to claim 1, wherein: adding a configuration item to the hive-site.xml in the conf folder of the Spark installation directory, the configuration background refresh process can be customized. Whether to enable, refresh the interval size and the big data table space collection to be refreshed.
The background refreshing method based on the Spark-SQL big data processing platform according to claim 2, wherein if the refresh process is started, the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed. When the Spark-SQL receives the query, the original first refresh policy is used. Before the query, the file directory structure of the specified file space of the distributed file system HDFS is scanned. If the refresh process is completed for the first time, the specified table space on the HDFS is The directory structure information is stored in the memory. When Spark-SQL receives the query statement, it no longer scans the HDFS, and directly uses the directory structure information of the table space in the memory to achieve the effect of shortening the query time.
The background refreshing method based on the Spark-SQL big data processing platform according to claim 2, wherein the refresh interval is one tenth to one half of a time used for refreshing once, or the refresh interval is 5 seconds to 10 seconds.
The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, wherein the external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, characterized in that: using Scala programming, the strategy of executing Spark-SQL query statements in the Spark source code is modified.
The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, wherein before the refreshing, the temporary table is created, the big data table with the partition information is created, and the data file in the text format is imported. Temporary tables, processing temporary table data and storing large data tables with partition information.
The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, characterized in that: the new file of the HDFS distributed file system is identified and saved in the metadata while using a more efficient and fast data importing program. Used for user query requests.
The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, wherein:

The creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;

The big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of a string, or partition by integer interval, or combine partitions to further divide data and improve data query speed;

The importing the data file in the text format into the temporary table is: executing a Spark-SQL statement or a Load statement supported by Hadoop according to the data file format, and directly importing the data in the text format into the temporary table.

The processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is allocated to the specified task for parallel processing. Then through the internal transformation mechanism of Spark-SQL, the partition information in the SQL statement is converted into a specific operation method for the RDD data block, thereby partitioning the data based on the RDD data block, and compressing and writing the partitioned data. To the distributed file system HDFS.