WO2017096941A1 - Background refreshing method based on spark-sql big data processing platform - Google Patents

Background refreshing method based on spark-sql big data processing platform Download PDF

Info

Publication number
WO2017096941A1
WO2017096941A1 PCT/CN2016/095361 CN2016095361W WO2017096941A1 WO 2017096941 A1 WO2017096941 A1 WO 2017096941A1 CN 2016095361 W CN2016095361 W CN 2016095361W WO 2017096941 A1 WO2017096941 A1 WO 2017096941A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
spark
sql
big data
query
Prior art date
Application number
PCT/CN2016/095361
Other languages
French (fr)
Chinese (zh)
Inventor
王成
冯骏
Original Assignee
深圳市华讯方舟软件技术有限公司
华讯方舟科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华讯方舟软件技术有限公司, 华讯方舟科技有限公司 filed Critical 深圳市华讯方舟软件技术有限公司
Publication of WO2017096941A1 publication Critical patent/WO2017096941A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1737Details of further file system functions for reducing power consumption or coping with limited storage space, e.g. in mobile devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Definitions

  • the invention relates to a background refreshing method of a big data processing platform, in particular to a background refreshing method based on a Spark-SQL big data processing platform.
  • the data import function of the Spark big data processing platform is implemented by Spark-SQL, which is implemented by Hive on Spark.
  • the Hive query can be submitted to the Spark cluster as a task of the Spark. Calculate on. Hive has more comprehensive support and a broader user base for SQL syntax than Impala and Shark.
  • Data import usually involves key points such as import content, storage format, and import speed:
  • the imported content can be a formatted or unformatted text file, and each record and each field are separated by a specific separator or file format.
  • the data content can be transmitted in the form of a file, or can be transmitted in the form of a data stream, and the size Uncertainty.
  • the format of the stored data can be either text format or compressed format to reduce disk usage.
  • the compression formats supported by Spark-SQL include zip, snappy, and parquet.
  • importing data can be partitioned based on content, and data can be stored in partitions to speed up queries.
  • the Spark-SQL data import and data refresh scheme (the external data file is in a text format) is as follows:
  • the directory structure on the distributed file system HDFS will be scanned and the metabase will be updated. Therefore, in the context of big data, the first query will take a long time. Instead of the first query, the directory structure of HDFS is no longer scanned, and the scan results of the first query are directly used, aiming to shorten the final query time.
  • the advantage of this mechanism is that the speed of non-first query is faster, but there are also drawbacks that cannot be ignored. That is, after the first query scan, any direct modification of the table space on the HDFS cannot be recognized.
  • HDFS does not support modification operations in principle
  • Any insert and delete operations can only be performed by Spark-SQL and executed in Spark.
  • both read and write occupy a certain amount of system resources, which indirectly leads to a decrease in data import speed and query speed.
  • all the queries on the Spark will fail. The file does not exist. You can only restart the Spark-SQL process and re-run the first query and scan. HDFS.
  • the first query of Spark-SQL scans the entire tablespace in the HDFS distributed file system according to the query table, and saves the snapshot of the tablespace. In the context of big data, the first query takes a very long time and cannot meet the time. Claim. Any changes to the table after scanning are not recognized by Spark-SQL.
  • Scala is a pure object-oriented programming language that uses the Scalac compiler to compile source files into Java class files (that is, bytecodes running on the JVM), so it is an interpreted language, and query and import are less efficient.
  • the Spark big data processing platform In the Standalone mode of the Spark big data processing platform, there is a waste of resources in the control node.
  • the Spark big data processing platform is generally deployed as a cluster, and the cluster is composed of several machines.
  • the cluster running process usually the import of external data and the real-time query of the data are synchronized. Therefore, the resources of the machine in the cluster will be allocated to the data import program and the data query program at the same time, in the IO conflict, CPU time, and memory.
  • the two In terms of application, the two will have more or less conflicts, and in severe cases, the performance of the two will be greatly reduced.
  • the technical problem to be solved by the present invention is to avoid the step of scanning the distributed file system HDFS for the first query in the context of big data, and greatly shorten the first query time of the Spark-SQL big data processing platform.
  • the background refresh method based on the Spark-SQL big data processing platform of the present invention creates a refresh process and sets a timing refresh machine in the entry function of the Spark-SQL. System, regularly scans the specified file space file directory structure of the distributed file system HDFS.
  • the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed.
  • the original first refresh policy is used, and the query is scanned before the query.
  • the distributed file system HDFS specifies the file directory structure of the tablespace. If the refresh process is refreshed for the first time, the directory structure information of the specified tablespace on the HDFS is saved in the memory. When Spark-SQL receives the query, it does not scan the HDFS. Directly use the directory structure information of the table space in the memory to achieve the effect of shortening the query time.
  • the refresh interval is one tenth to one half of the time taken to refresh once, or the refresh interval is 5 seconds to 10 seconds, and the refresh interval size may be customized according to product or user requirements.
  • the external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
  • the creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;
  • the big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of the string, or partition by integer interval, or combine partitioning, further divide the data, partition the data, and improve the data query speed;
  • the data file into the temporary table is: according to the data file format, the Spark-SQL statement or the Hadoop-supported Load statement is executed, and the data in the text format is directly guided. Into the temporary table.
  • the processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through Spark-SQL internal transformation mechanism, the partition information in the SQL statement is converted into a specific operation method for the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data Processing, writing to the distributed file system HDFS.
  • the background refresh method based on the Spark-SQL big data processing platform has the following beneficial effects compared with the prior art.
  • the first query time of the Spark-SQL big data processing platform is greatly shortened; taking 20T data as an example, the big data table is divided into 25 zones according to the hour as the first zone (0 to 23 points and one
  • the default partition is divided into 1001 areas (000-999 and a default partition) according to the first 3 digits of the mobile phone number, and is compressed and stored according to the PARQUET format, for querying the total number of data of a certain number segment of a certain time period.
  • the original query time is about 20 minutes.
  • the background refresh method optimized by the present invention shortens the time of the first query to about 45 seconds.
  • the native Spark data import program is the data import statement of Spark-SQL.
  • the data import program When used, it will occupy some or all of the computing resources of the Spark big data processing platform, which greatly affects the speed and efficiency of data query.
  • Using a more efficient data importer to process data independently from Spark makes system utilization even higher.
  • the background refresh adopts an independent process, which does not occupy the original Spark's system resources.
  • the common compression formats in Spark are ZIP, BZ2, SNAPPY, and PARQUET.
  • the PARQUET format supports all projects in the Hadoop ecosystem, providing efficient compression of columnar data representation, and is independent of the data processing framework, data model, and programming language.
  • the PARQUET format can be preferred as a big data storage format.
  • the Spark big data processing platform has certain limitations on the data query in the PARQUET format. For large data tables stored in the PARQUET format, Spark-SQL scans the directory structure of the table on HDFS only when the table is first queried. The scan is performed again, so the directory structure added or deleted after the first query cannot be identified.
  • the background refreshing technique of the present invention can effectively solve this problem.
  • FIG. 1 is a schematic diagram of an overall framework of a Spark big data processing platform in the prior art.
  • FIG. 2 is a flow chart of a background refresh method based on the Spark-SQL big data processing platform of the present invention.
  • Figure 3 is a flow chart of the modified data query.
  • the background refresh method based on the Spark-SQL big data processing platform in this embodiment is to create a refresh process in the Spark-SQL entry function and set a timing refresh mechanism, and periodically scan the distributed file system HDFS.
  • the specified tablespace file directory structure, as a preference, the refresh result is stored in memory to support the query request for the table data.
  • the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed.
  • the first refresh policy is to scan the file directory structure of the specified file space of the distributed file system HDFS before querying; if the refresh process is refreshed for the first time, the directory structure information of the specified table space on the HDFS is saved in the memory, when Spark-SQL receives When the query is executed, HDFS is no longer scanned, and the directory structure information of the table space in the memory is directly used, thereby shortening the query time.
  • the refresh interval is one tenth to one half of the time taken to refresh once, or the refresh interval is 5 seconds to 10 seconds, and the refresh interval size may be customized according to product or user requirements.
  • the external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
  • the creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;
  • the big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of a string, or partition by integer interval, or combine partitions to further divide data and improve data query speed;
  • the importing the data file in the text format into the temporary table is: according to the data file format, executing a Spark-SQL statement or a Load statement supported by Hadoop, and directly importing the data into the temporary table.
  • the processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through the internal transformation mechanism of Spark-SQL, the partition information in the SQL statement is converted into The specific operation method of the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data into the distributed file system HDFS.
  • the illustration is a background refresh flow chart.
  • Spark-SQL uses Scala language programming to add a background refresh process to the Spark-SQL entry function, periodically scanning the specified tablespace directory structure on the distributed file system HDFS, and saving it to memory for data query use.
  • Spark-SQL After Spark-SQL starts, it first reads the hive-site.xml configuration file, parses out the configuration items related to the background refresh process, and sets the timing refresh mechanism to perform timing refresh in the manner of message triggering.
  • Spark-SQL creates a query plan for the big data table to be refreshed, locates the space in the memory to store the table information according to the query plan, and calls the refresh method in its attribute to scan the distributed file system HDFS. Table directory structure. The refresh method will overwrite the previous scan result, and the original result will not be emptied before overwriting, thus ensuring that data is also available when receiving a data query request during the refresh process.
  • FIG. 3 is a flow chart of the modified data query.
  • the configuration item is located in the hive-site.xml file in the conf folder of the Spark installation directory.
  • the refresh process supports all data compression formats supported by Spark, such as PARQUET, SNAPPY, and ZIP.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Library & Information Science (AREA)

Abstract

Disclosed in the present invention is a background refreshing method based on a Spark-SQL big data processing platform. A new process is created and a timed refreshing mechanism is set in an entry function of Spark-SQL, and a specified table space file directory structure of a Hadoop distributed file system (HDFS) is periodically scanned. Configuration items are added in a hive-site.xml under a conf folder of a Spark installation directory, and thus, whether to open a refreshing process, a refreshing interval and a big data table space set to be refreshed can be configured in a customized manner. In the present invention, under the background of big data, a first query time of the Spark-SQL big data processing platform is greatly reduced; taking 20T data as an example, a big data table is partitioned into 25 regions in a manner of taking hour as a first subregion, is partitioned into 1001 regions in a manner of taking first three digits of a mobile phone number as a second subregion, and is subjected to compressed storage according to a PARQUET format; for the query querying for a total amount of all data of a certain number section of a certain period of time, the original first query time is approximately 20 minutes, and by means of the background refreshing method optimized by the present invention, the time of the first query is reduced to approximately 45 seconds.

Description

一种基于Spark-SQL大数据处理平台的后台刷新方法Background refresh method based on Spark-SQL big data processing platform 技术领域Technical field
本发明涉及一种大数据处理平台的后台刷新方法,尤其涉及一种基于Spark-SQL大数据处理平台的后台刷新方法。The invention relates to a background refreshing method of a big data processing platform, in particular to a background refreshing method based on a Spark-SQL big data processing platform.
背景技术Background technique
随着互联网、移动互联网和物联网的发展,我们迎来了一个大数据的时代,对这些大数据的处理和分析已经成为一个非常重要且紧迫的需求。With the development of the Internet, mobile Internet and Internet of Things, we have ushered in an era of big data. The processing and analysis of these big data has become a very important and urgent need.
随着技术的发展,大数据处理平台经历了最初的Hadoop和Hbase,以及后来发展起来的基于SQL的Hive、Shark等。基于key-value的Hbase等处理平台也逐渐兴起。而如今SQL-on-Hadoop概念的兴起又促使Spark生态发展壮大,逐渐成为最热门,使用最多,效率最好的一种大数据处理平台。With the development of technology, the big data processing platform has experienced the original Hadoop and Hbase, and later developed SQL-based Hive, Shark and so on. Processing platforms such as Hbase based on key-value are also emerging. Nowadays, the rise of the SQL-on-Hadoop concept has prompted the Spark ecosystem to grow and become the most popular, most used and most efficient big data processing platform.
不管采取哪种大数据处理平台,它们的目的都是处理和分析大数据,从中分析和挖掘出有用的数据供人们使用。从最基本的原理来看,无论是基于Map-Reduce的Hadoop,还是基于Key-Value键值对的Hbase,或者是基于RDD的Spark,它们的总体处理流程都是相同的,都是包含了数据导入->数据分析和处理->数据结果展示三个主要步骤,其中最重要的两个部分为数据导入和数据分析处理过程,数据导入的速度决定了整个系统能够实时处理数据速度,影响到整个系统的处理性能,数据导入和分析的过程则是数据处理的核心。Regardless of the big data processing platform, their purpose is to process and analyze big data, from which to analyze and mine useful data for people to use. From the most basic principle, whether it is Map-Reduce-based Hadoop, HBase based on Key-Value key-value pairs, or RSD-based Spark, their overall processing flow is the same, both contain data. Import->Data Analysis and Processing->Data Results Display Three main steps, the most important of which are data import and data analysis processing. The speed of data import determines the speed of the entire system to process data in real time, affecting the whole process. The processing performance of the system, the process of data import and analysis is the core of data processing.
如图1所示,Spark大数据处理平台总体框架是:Spark大数据处理平台的数据导入功能由Spark-SQL实现,即由Hive on Spark来实现的,Hive查询可以作为Spark的任务提交到Spark集群上进行计算。Hive相较于Impala和Shark等对SQL语法有着更全面的支持和更为广泛的用户基础。数据导入通常涉及导入内容、存储格式、导入速度等关键点:As shown in Figure 1, the overall framework of the Spark big data processing platform is: The data import function of the Spark big data processing platform is implemented by Spark-SQL, which is implemented by Hive on Spark. The Hive query can be submitted to the Spark cluster as a task of the Spark. Calculate on. Hive has more comprehensive support and a broader user base for SQL syntax than Impala and Shark. Data import usually involves key points such as import content, storage format, and import speed:
1、导入内容 1, import content
通常导入内容可以为格式化或非格式化的文本文件,以特定的分隔符或文件格式分隔每一条记录及每一个字段,数据内容可以是文件形式传送,也可以是数据流形式传送,且大小具有不确定性。Usually the imported content can be a formatted or unformatted text file, and each record and each field are separated by a specific separator or file format. The data content can be transmitted in the form of a file, or can be transmitted in the form of a data stream, and the size Uncertainty.
2、存储格式2, storage format
存储数据的格式可以是文本格式,也可以是压缩格式,以减少磁盘使用量,目前Spark-SQL支持的压缩格式有zip、snappy及parquet等等。The format of the stored data can be either text format or compressed format to reduce disk usage. Currently, the compression formats supported by Spark-SQL include zip, snappy, and parquet.
在大数据的背景下,导入数据可以根据内容进行分区,将数据按分区进行存储,可加快查询速度。In the context of big data, importing data can be partitioned based on content, and data can be stored in partitions to speed up queries.
3、导入速度3, the import speed
在大数据的背景下,由于数据在源源不断地产生,这就对数据导入速度提出了较高的要求,根据实际情况要求导入速度不得低于x条每秒或者xMB每秒,同时还要求保证不得出现数据丢失、数据导入错误、数据积压等情况。In the context of big data, as data is continuously generated, this puts high demands on the data import speed. According to the actual situation, the import speed must not be lower than x strips per second or xMB per second. Data loss, data import errors, data backlogs, etc. must not occur.
现有技术中,基于Spark-SQL数据导入与数据刷新方案(外部数据文件为文本格式)如下:In the prior art, the Spark-SQL data import and data refresh scheme (the external data file is in a text format) is as follows:
当发起查询时,可以在条件子句中加入信息,从而限定查询的数据范围。在Spark大数据处理平台中,不同的存储格式有着不同的刷新机制,主要为如下两种:When a query is initiated, information can be added to the conditional clause to define the data range of the query. In the Spark big data processing platform, different storage formats have different refresh mechanisms, mainly as follows:
i)若数据最终按文本(TEXTFILE)或优化列存储(ORC)ZIP或SNAPPY进行存储,每次查询大数据表时,会先扫描分布式文件系统HDFS上的目录结构和更新元数据库,能识别出HDFS上该表空间的所有更新,包括插入、修改和删除操作。在目录结构多、数据文件多的情况下,每扫描一次HDFS的时间都很长,且随着时间的推移而增加。扫描HDFS的时间包含在了查询时间中,扫描完HDFS以后Spark才会根据扫描结果划分任务,提交给执行器去执行,因此扫描时间的长短直接影响了查询时间的长短。i) If the data is finally stored by text (TEXTFILE) or optimized column storage (ORC) ZIP or SNAPPY, each time the big data table is queried, the directory structure on the distributed file system HDFS and the update metabase are scanned first, and the data can be identified. All updates to this tablespace on HDFS, including inserts, modifications, and deletes. In the case of a large number of directory structures and many data files, the time per HDFS is very long and increases with time. The time for scanning HDFS is included in the query time. After scanning HDFS, Spark will divide the task according to the scan result and submit it to the actuator for execution. Therefore, the length of the scan directly affects the length of the query.
ii)若数据最终按PARQUET格式进行压缩存储,首次查询数据表时,会先扫描分布式文件系统HDFS上的目录结构和更新元数据库,因此在大数据的背景下,首次查询的时间会很长;而非首次查询不再去扫描HDFS的目录结构,直接利用首次查询的扫描结果,旨在缩短最终的查询时间。这种机制的优点在于非首次查询的速度较快,但也存在着不容忽视的弊端, 那就是在首次查询的扫描以后,任何对HDFS上该表空间的直接修改都不能识别,任何插入和删除操作(HDFS原则上不支持修改操作)都只能通过Spark-SQL来完成,在Spark执行器资源有限的情况下,读和写均占用了一定的系统资源,间接导致了数据导入速度和查询速度的下降。另外,当HDFS上该表空间的某个数据文件丢失时,会导致Spark上对该表的所有查询失败,出现该文件不存在的错误,只能重启Spark-SQL进程,重新进行首次查询并扫描HDFS。Ii) If the data is finally compressed and stored in the PARQUET format, the first time the data table is queried, the directory structure on the distributed file system HDFS will be scanned and the metabase will be updated. Therefore, in the context of big data, the first query will take a long time. Instead of the first query, the directory structure of HDFS is no longer scanned, and the scan results of the first query are directly used, aiming to shorten the final query time. The advantage of this mechanism is that the speed of non-first query is faster, but there are also drawbacks that cannot be ignored. That is, after the first query scan, any direct modification of the table space on the HDFS cannot be recognized. Any insert and delete operations (HDFS does not support modification operations in principle) can only be performed by Spark-SQL and executed in Spark. When the resources of the device are limited, both read and write occupy a certain amount of system resources, which indirectly leads to a decrease in data import speed and query speed. In addition, when a data file of the tablespace on the HDFS is lost, all the queries on the Spark will fail. The file does not exist. You can only restart the Spark-SQL process and re-run the first query and scan. HDFS.
综上所述,现有技术中存在的问题有:In summary, the problems in the prior art are:
1、Spark-SQL首次查询会根据查询的表扫描其在HDFS分布式文件系统中整个表空间,并保存该表空间快照,在大数据的背景下,首次查询需要非常长的时间,不能满足时间要求。在扫描以后对该表的任何修改,Spark-SQL都不能识别。1. The first query of Spark-SQL scans the entire tablespace in the HDFS distributed file system according to the query table, and saves the snapshot of the tablespace. In the context of big data, the first query takes a very long time and cannot meet the time. Claim. Any changes to the table after scanning are not recognized by Spark-SQL.
2、现有技术基于hive或者Spark-SQL的数据导入程序,采用Scala语言编写,运行在JVM虚拟机上,存在效率低、速度慢、容易内存溢出等问题。Scala是一种纯粹的面向对象编程语言,它用Scalac编译器把源文件编译成Java的class文件(即在JVM上运行的字节码),所以是解释型语言,查询及导入效率较低。2, the existing technology based on hive or Spark-SQL data import program, written in Scala language, running on the JVM virtual machine, there are problems such as low efficiency, slow speed, easy memory overflow. Scala is a pure object-oriented programming language that uses the Scalac compiler to compile source files into Java class files (that is, bytecodes running on the JVM), so it is an interpreted language, and query and import are less efficient.
3、在Spark大数据处理平台的Standalone模式下,控制节点存在资源上的浪费。现有技术中,Spark大数据处理平台一般部署为一个集群,集群由若干台机器组成。在集群运行过程中,通常外部数据的导入和对数据的实时查询同步进行,因此,集群中机器的资源将同时分配给数据导入程序和数据查询程序,在IO冲突、CPU时间争抢以及内存的申请方面,两者必将有或多或少的冲突,严重时两者性能将大打折扣。3. In the Standalone mode of the Spark big data processing platform, there is a waste of resources in the control node. In the prior art, the Spark big data processing platform is generally deployed as a cluster, and the cluster is composed of several machines. During the cluster running process, usually the import of external data and the real-time query of the data are synchronized. Therefore, the resources of the machine in the cluster will be allocated to the data import program and the data query program at the same time, in the IO conflict, CPU time, and memory. In terms of application, the two will have more or less conflicts, and in severe cases, the performance of the two will be greatly reduced.
发明内容Summary of the invention
本发明要解决的技术问题是在大数据的背景下,避开首次查询扫描分布式文件系统HDFS的步骤,大幅缩短Spark-SQL大数据处理平台的首次查询时间。The technical problem to be solved by the present invention is to avoid the step of scanning the distributed file system HDFS for the first query in the context of big data, and greatly shorten the first query time of the Spark-SQL big data processing platform.
为了解决上述技术问题,本发明基于Spark-SQL大数据处理平台的后台刷新方法是在Spark-SQL的入口函数中创建刷新进程并设定定时刷新机 制,定时扫描分布式文件系统HDFS的指定表空间文件目录结构。In order to solve the above technical problem, the background refresh method based on the Spark-SQL big data processing platform of the present invention creates a refresh process and sets a timing refresh machine in the entry function of the Spark-SQL. System, regularly scans the specified file space file directory structure of the distributed file system HDFS.
在Spark安装目录的conf文件夹下的hive-site.xml中增加配置项,可以自定义配置后台刷新进程是否开启,刷新间隔大小与要刷新的大数据表空间集合。Add a configuration item to hive-site.xml in the conf folder of the Spark installation directory. You can customize whether the background background refresh process is enabled, and the refresh interval size and the big data table space collection to be refreshed.
若开启后台刷新进程,则在刷新进程首次刷新完成之前,内存中尚没有指定表空间的目录结构信息,这时Spark-SQL若收到查询语句,则采用原始的首次刷新策略,查询之前先扫描分布式文件系统HDFS指定表空间的文件目录结构;若刷新进程首次刷新完成,则将HDFS上指定表空间的目录结构信息保存在内存中,当Spark-SQL接收到查询语句时则不再扫描HDFS,直接使用内存中该表空间的目录结构信息,达到缩短查询时间的效果。If the background refresh process is enabled, the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed. When the Spark-SQL receives the query, the original first refresh policy is used, and the query is scanned before the query. The distributed file system HDFS specifies the file directory structure of the tablespace. If the refresh process is refreshed for the first time, the directory structure information of the specified tablespace on the HDFS is saved in the memory. When Spark-SQL receives the query, it does not scan the HDFS. Directly use the directory structure information of the table space in the memory to achieve the effect of shortening the query time.
所述刷新间隔是刷新一次所用时间的十分之一至二分之一,或者,所述刷新间隔是5秒至10秒,可以根据产品或者用户需求自定义所述刷新间隔大小。The refresh interval is one tenth to one half of the time taken to refresh once, or the refresh interval is 5 seconds to 10 seconds, and the refresh interval size may be customized according to product or user requirements.
将外部数据文件进行压缩存储,所述压缩格式为ZIP、BZ2、SNAPPY或PARQUET。The external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
采用Scala编程,修改Spark源码中关于Spark-SQL执行查询语句的策略。Using Scala programming, modify the Spark source code for Spark-SQL to execute query statements.
在刷新以前,先依次进行创建临时表、创建带分区信息的大数据表、将文本格式的数据文件导入临时表、处理临时表数据并存入带有分区信息的大数据表。Before refreshing, first create a temporary table, create a big data table with partition information, import data files in text format into a temporary table, process temporary table data, and store the big data table with partition information.
所述创建临时表是:根据数据模型创建用于存储文本格式数据的临时表,该临时表用来作为最终数据表的数据源;The creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;
所述创建带分区信息的大数据表是:在大数据的背景下,创建带有分区信息的大数据表可以提高数据查询的速度;实际应用中,按时间中的月、周、天或小时进行分区,或者,按照字符串的某个子串进行分区,或者,按整数区间进行分区,或者,进行组合分区,进一步划分数据,将数据分区,提高数据查询速度;The big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of the string, or partition by integer interval, or combine partitioning, further divide the data, partition the data, and improve the data query speed;
所述将文本格式的数据文件导入临时表是:根据数据文件格式,执行Spark-SQL语句或者Hadoop支持的Load语句,将文本格式的数据直接导 入临时表中。The data file into the temporary table is: according to the data file format, the Spark-SQL statement or the Hadoop-supported Load statement is executed, and the data in the text format is directly guided. Into the temporary table.
所述处理临时表数据并存入带有分区信息的大数据表是:执行指定分区格式与存储格式的Spark-SQL语句,将临时表中的数据按照指定分区格式进行分析和处理,再按照指定的存储格式(压缩格式)将数据写入到最终的大数据表中;这一步中,Spark首先将上述临时表空间中的数据根据配置分为RDD数据块,每个RDD数据块分配给指定任务进行并行处理,再通过Spark-SQL的内部转化机制,将SQL语句中的分区信息转化为针对RDD数据块的特定操作方法,从而基于RDD数据块对数据进行分区,并将分区后的数据进行压缩处理,写入到分布式文件系统HDFS中。The processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through Spark-SQL internal transformation mechanism, the partition information in the SQL statement is converted into a specific operation method for the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data Processing, writing to the distributed file system HDFS.
本发明基于Spark-SQL大数据处理平台的后台刷新方法与现有技术相比具有以下有益效果。The background refresh method based on the Spark-SQL big data processing platform has the following beneficial effects compared with the prior art.
1)在大数据的背景下,大幅缩短Spark-SQL大数据处理平台的首次查询时间;以20T数据为例,大数据表按照小时作为第一分区分为25个区(0~23点与一个默认分区),按照手机号码前3位作为第二分区分为1001个区(000-999与一个默认分区),并按照PARQUET格式进行压缩存储,针对查询某个时间段某个号码段所有数据总数的查询,原来首次查询时间为20分钟左右,本发明所优化的后台刷新方法使得首次查询的时间缩短为45秒左右。1) In the context of big data, the first query time of the Spark-SQL big data processing platform is greatly shortened; taking 20T data as an example, the big data table is divided into 25 zones according to the hour as the first zone (0 to 23 points and one The default partition) is divided into 1001 areas (000-999 and a default partition) according to the first 3 digits of the mobile phone number, and is compressed and stored according to the PARQUET format, for querying the total number of data of a certain number segment of a certain time period. In the query, the original query time is about 20 minutes. The background refresh method optimized by the present invention shortens the time of the first query to about 45 seconds.
2)在使用更高效快速的数据导入程序同时,识别出HDFS分布式文件系统的新增文件,保存在元数据中,用于用户查询请求。用Spark-SQL原始的数据导入方法速度为2万条/秒,采用更高效快速的数据导入程序直接将数据写入HDFS时可将数据导入速度提升至20万条/秒甚至更高(取决于并发数),而绕过了Spark直接写入HDFS上的新文件,本发明所提出的后台刷新方法可以识别出指定表空间所有新增文件并可用于查询,不再需要重启Spark-SQL服务,也不会增加查询的时间。2) While using a more efficient and fast data import program, the new files of the HDFS distributed file system are identified and stored in the metadata for user query requests. With Spark-SQL's original data import method speed of 20,000 / sec, using a more efficient and fast data import program to directly write data to HDFS can increase the data import speed to 200,000 / sec or higher (depending on The number of concurrent calls is bypassed, and the Spark is directly written to the new file on the HDFS. The background refresh method proposed by the present invention can identify all the newly added files in the specified table space and can be used for querying, and the Spark-SQL service is no longer required to be restarted. It also does not increase the time of the query.
3)提高Spark大数据处理平台控制节点的系统资源利用率。原生的Spark数据导入程序即为Spark-SQL的数据导入语句,进行数据导入程序时会占用一部分甚至全部的Spark大数据处理平台的计算资源,很大程度上影响了数据查询的速度和效率。使用更高效的数据导入程序独立于Spark单独处理数据,使得系统利用率更高。同时后台刷新采用独立进程,不占用原 Spark的系统资源。3) Improve the system resource utilization of the control nodes of the Spark big data processing platform. The native Spark data import program is the data import statement of Spark-SQL. When the data import program is used, it will occupy some or all of the computing resources of the Spark big data processing platform, which greatly affects the speed and efficiency of data query. Using a more efficient data importer to process data independently from Spark makes system utilization even higher. At the same time, the background refresh adopts an independent process, which does not occupy the original Spark's system resources.
4)由于在大数据的背景下,磁盘空间也是系统可用性的一个瓶颈,因此将外部数据文件进行压缩存储十分必要。Spark中常见的压缩格式有ZIP、BZ2、SNAPPY和PARQUET,其中PARQUET格式支持Hadoop生态系统中所有项目,提供高效率压缩的列式数据表达,而且与数据处理框架、数据模型和编程语言无关,因此可以优先选择PARQUET格式作为大数据存储格式。Spark大数据处理平台对PARQUET格式的数据查询有一定的局限性,对于以PARQUET格式存储的大数据表,Spark-SQL只有在首次查询该表的时候会扫描HDFS上该表的目录结构,此后不再进行扫描,因此无法识别出在首次查询之后新增或删除的目录结构。采用本发明的后台刷新技术,可以有效解决这一问题。4) Since disk space is also a bottleneck of system availability in the context of big data, it is necessary to compress and store external data files. The common compression formats in Spark are ZIP, BZ2, SNAPPY, and PARQUET. The PARQUET format supports all projects in the Hadoop ecosystem, providing efficient compression of columnar data representation, and is independent of the data processing framework, data model, and programming language. The PARQUET format can be preferred as a big data storage format. The Spark big data processing platform has certain limitations on the data query in the PARQUET format. For large data tables stored in the PARQUET format, Spark-SQL scans the directory structure of the table on HDFS only when the table is first queried. The scan is performed again, so the directory structure added or deleted after the first query cannot be identified. The background refreshing technique of the present invention can effectively solve this problem.
5)采用Scala编程,修改Spark源码中关于Spark-SQL执行查询语句的策略,可大大提高编程效率。5) Using Scala programming, modify the Spark source code for Spark-SQL to execute the query statement, which can greatly improve the programming efficiency.
附图说明DRAWINGS
图1是现有技术中Spark大数据处理平台总体框架示意图。FIG. 1 is a schematic diagram of an overall framework of a Spark big data processing platform in the prior art.
图2是本发明基于Spark-SQL大数据处理平台的后台刷新方法的流程图。2 is a flow chart of a background refresh method based on the Spark-SQL big data processing platform of the present invention.
图3是修改后数据查询的流程图。Figure 3 is a flow chart of the modified data query.
具体实施方式detailed description
如图2和图3所示,本实施方式基于Spark-SQL大数据处理平台的后台刷新方法是在Spark-SQL的入口函数中创建刷新进程并设定定时刷新机制,定时扫描分布式文件系统HDFS的指定表空间文件目录结构,作为一种优选,刷新结果保存在内存中用于支持该表数据的查询请求。As shown in FIG. 2 and FIG. 3, the background refresh method based on the Spark-SQL big data processing platform in this embodiment is to create a refresh process in the Spark-SQL entry function and set a timing refresh mechanism, and periodically scan the distributed file system HDFS. The specified tablespace file directory structure, as a preference, the refresh result is stored in memory to support the query request for the table data.
在Spark安装目录的conf文件夹下的hive-site.xml中增加配置项,可以自定义配置后台刷新进程是否开启,刷新间隔大小与要刷新的大数据表空间集合。Add a configuration item to hive-site.xml in the conf folder of the Spark installation directory. You can customize whether the background background refresh process is enabled, and the refresh interval size and the big data table space collection to be refreshed.
若开启刷新进程,则在刷新进程首次刷新完成之前,内存中尚没有指定表空间的目录结构信息,这时Spark-SQL若收到查询语句,则采用原始 的首次刷新策略,查询之前先扫描分布式文件系统HDFS指定表空间的文件目录结构;若刷新进程首次刷新完成,则将HDFS上指定表空间的目录结构信息保存在内存中,当Spark-SQL接收到查询语句时则不再扫描HDFS,直接使用内存中该表空间的目录结构信息,达到缩短查询时间的效果。If the refresh process is enabled, the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed. When Spark-SQL receives the query, the original is used. The first refresh policy is to scan the file directory structure of the specified file space of the distributed file system HDFS before querying; if the refresh process is refreshed for the first time, the directory structure information of the specified table space on the HDFS is saved in the memory, when Spark-SQL receives When the query is executed, HDFS is no longer scanned, and the directory structure information of the table space in the memory is directly used, thereby shortening the query time.
所述刷新间隔是刷新一次所用时间的十分之一至二分之一,或者,所述刷新间隔是5秒至10秒,可以根据产品或者用户需求自定义所述刷新间隔大小。The refresh interval is one tenth to one half of the time taken to refresh once, or the refresh interval is 5 seconds to 10 seconds, and the refresh interval size may be customized according to product or user requirements.
将外部数据文件进行压缩存储,所述压缩格式为ZIP、BZ2、SNAPPY或PARQUET。The external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
采用Scala编程,修改Spark源码中关于Spark-SQL执行查询语句的策略。Using Scala programming, modify the Spark source code for Spark-SQL to execute query statements.
在刷新以前,先依次进行创建临时表、创建带分区信息的大数据表、将文本格式的数据文件导入临时表、处理临时表数据并存入带有分区信息的大数据表。Before refreshing, first create a temporary table, create a big data table with partition information, import data files in text format into a temporary table, process temporary table data, and store the big data table with partition information.
所述创建临时表是:根据数据模型创建用于存储文本格式数据的临时表,该临时表用来作为最终数据表的数据源;The creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;
所述创建带分区信息的大数据表是:在大数据的背景下,创建带有分区信息的大数据表可以提高数据查询的速度;实际应用中,按时间中的月、周、天或小时进行分区,或者,按照字符串的某个子串进行分区,或者,按整数区间进行分区,或者,进行组合分区,进一步划分数据,提高数据查询速度;The big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of a string, or partition by integer interval, or combine partitions to further divide data and improve data query speed;
所述将文本格式的数据文件导入临时表是:根据数据文件格式,执行Spark-SQL语句或者Hadoop支持的Load语句,将数据直接导入临时表中。The importing the data file in the text format into the temporary table is: according to the data file format, executing a Spark-SQL statement or a Load statement supported by Hadoop, and directly importing the data into the temporary table.
所述处理临时表数据并存入带有分区信息的大数据表是:执行指定分区格式与存储格式的Spark-SQL语句,将临时表中的数据按照指定分区格式进行分析和处理,再按照指定的存储格式(压缩格式)将数据写入到最终的大数据表中;这一步中,Spark首先将上述临时表空间中的数据根据配置分为RDD数据块,每个RDD数据块分配给指定任务进行并行处理,再通过Spark-SQL的内部转化机制,将SQL语句中的分区信息转化为针对 RDD数据块的特定操作方法,从而基于RDD数据块对数据进行分区,并将分区后的数据进行压缩处理,写入到分布式文件系统HDFS中。The processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through the internal transformation mechanism of Spark-SQL, the partition information in the SQL statement is converted into The specific operation method of the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data into the distributed file system HDFS.
如图2所示,图示是后台刷新流程图。As shown in Figure 2, the illustration is a background refresh flow chart.
1)采用Scala语言编程,在Spark-SQL的入口函数中增加后台刷新进程,定时扫描分布式文件系统HDFS上的指定表空间目录结构,并保存到内存中,供数据查询使用。Spark-SQL启动后首先读取hive-site.xml配置文件,解析出后台刷新进程相关配置项,并设置定时刷新机制,以消息触发的方式进行定时刷新。每次刷新时,Spark-SQL创建待刷新的大数据表的查询计划,根据查询计划定位到内存中存储该表信息的空间,调用其属性中的刷新方法,扫描分布式文件系统HDFS上的该表目录结构。该刷新方法会覆盖之前的扫描结果,覆盖之前不会清空原有结果,因此保证了在刷新过程中接收到数据查询请求时也有数据可查。1) Using Scala language programming, adding a background refresh process to the Spark-SQL entry function, periodically scanning the specified tablespace directory structure on the distributed file system HDFS, and saving it to memory for data query use. After Spark-SQL starts, it first reads the hive-site.xml configuration file, parses out the configuration items related to the background refresh process, and sets the timing refresh mechanism to perform timing refresh in the manner of message triggering. Each time refresh, Spark-SQL creates a query plan for the big data table to be refreshed, locates the space in the memory to store the table information according to the query plan, and calls the refresh method in its attribute to scan the distributed file system HDFS. Table directory structure. The refresh method will overwrite the previous scan result, and the original result will not be emptied before overwriting, thus ensuring that data is also available when receiving a data query request during the refresh process.
如图3所示,图3是修改后数据查询流程图。As shown in FIG. 3, FIG. 3 is a flow chart of the modified data query.
2)修改Spark-SQL处理数据查询的策略,首次查询时扫描分布式文件系统HDFS的工作由后台刷新进程完成,首次查询直接使用后台刷新进程扫描结果,缩短查询时间。修改后,首次查询与非首次查询的策略一致,即每次查询都直接使用内存中由后台刷新进程扫描出的该表目录结构信息的结果。2) Modify the Spark-SQL processing data query strategy. Scanning the distributed file system HDFS for the first time is completed by the background refresh process. The first query directly uses the background refresh process to scan the results and shorten the query time. After the modification, the first query is consistent with the policy of the non-first query, that is, each query directly uses the result of the table directory structure information scanned by the background refresh process in the memory.
3)后台刷新功能可以自定义3) Background refresh function can be customized
运行Spark-SQL之前,可以自定义配置后台刷新功能相关项,如是否开启后台刷新功能、待刷新的大数据表集合、刷新间隔时间等。配置项位于Spark安装目录的conf文件夹下的hive-site.xml中,启动Spark-SQL时一次性读取并解析所有配置项,不需要额外的程序读取并解析配置文件,节省系统开销。Before running Spark-SQL, you can customize the configuration of the background refresh function, such as whether to enable the background refresh function, the big data table collection to be refreshed, and the refresh interval. The configuration item is located in the hive-site.xml file in the conf folder of the Spark installation directory. When Spark-SQL is started, all configuration items are read and parsed at one time. No additional program is required to read and parse the configuration file, which saves system overhead.
本发明的关键点如下。The key points of the present invention are as follows.
1)采用Scala语言编程,集成到Spark源码中,在不影响原生Spark所有功能的前提下,增加后台刷新进程。1) Using Scala language programming, integrated into the Spark source code, increase the background refresh process without affecting all the functions of the native Spark.
2)修改原Spark-SQL的处理查询策略,提高首次查询的速度。2) Modify the original Spark-SQL processing query strategy to improve the speed of the first query.
3)刷新进程支持Spark所有支持的数据压缩格式,如PARQUET、SNAPPY、ZIP等。 3) The refresh process supports all data compression formats supported by Spark, such as PARQUET, SNAPPY, and ZIP.
4)后台刷新技术使得分离Spark的数据导入与数据查询成为可能,提高系统资源利用率。4) Background refresh technology makes it possible to separate Spark data import and data query, and improve system resource utilization.
本发明的优点如下。The advantages of the present invention are as follows.
1)使得采用高效快速的数据导入程序成为可能,能识别出分布式文件系统HDFS上指定表空间的所有更新,包括增加、删除和修改操作。同时,数据导入程序独立于Spark,与数据查询互不影响,提高各自处理能力。1) It makes it possible to use an efficient and fast data import program to identify all updates to the specified tablespace on the distributed file system HDFS, including addition, deletion and modification operations. At the same time, the data import program is independent of Spark, and does not affect the data query, improving their processing capabilities.
2)修改原Spark-SQL处理查询语句的策略,将扫描分布式文件系统HDFS的功能归并到单独的刷新进程中处理,大幅缩短查询的时间。2) Modify the original Spark-SQL processing query statement strategy, and merge the functions of scanning the distributed file system HDFS into a separate refresh process, which greatly shortens the query time.
需要说明的是,以上参照附图所描述的各个实施例仅用以说明本发明而非限制本发明的范围,本领域的普通技术人员应当理解,在不脱离本发明的精神和范围的前提下对本发明进行的修改或者等同替换,均应涵盖在本发明的范围之内。此外,除上下文另有所指外,以单数形式出现的词包括复数形式,反之亦然。另外,除非特别说明,那么任何实施例的全部或一部分可结合任何其它实施例的全部或一部分来使用。 It should be noted that the various embodiments described above with reference to the accompanying drawings are only to illustrate the invention and not to limit the scope of the invention, and those of ordinary skill in the art should understand that without departing from the spirit and scope of the invention Modifications or equivalents to the invention are intended to be included within the scope of the invention. In addition, unless the context indicates otherwise, words in the singular include plural and vice versa. In addition, all or a portion of any embodiment can be used in combination with all or a portion of any other embodiment, unless otherwise stated.

Claims (9)

  1. 一种基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:在Spark-SQL的入口函数中创建刷新进程并设定定时刷新机制,定时扫描分布式文件系统HDFS的指定表空间文件目录结构。A background refreshing method based on the Spark-SQL big data processing platform, which is characterized in that: a refresh process is created in the entry function of the Spark-SQL and a timing refresh mechanism is set, and the specified table space file directory of the distributed file system HDFS is periodically scanned. structure.
  2. 根据权利要求1所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:在Spark安装目录的conf文件夹下的hive-site.xml中增加配置项,可以自定义配置后台刷新进程是否开启,刷新间隔大小与要刷新的大数据表空间集合。The background refreshing method of the Spark-SQL big data processing platform according to claim 1, wherein: adding a configuration item to the hive-site.xml in the conf folder of the Spark installation directory, the configuration background refresh process can be customized. Whether to enable, refresh the interval size and the big data table space collection to be refreshed.
  3. 根据权利要求2所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:若开启刷新进程,则在刷新进程首次刷新完成之前,内存中尚没有指定表空间的目录结构信息,这时Spark-SQL若收到查询语句,则采用原始的首次刷新策略,查询之前先扫描分布式文件系统HDFS指定表空间的文件目录结构;若刷新进程首次刷新完成,则将HDFS上指定表空间的目录结构信息保存在内存中,当Spark-SQL接收到查询语句时则不再扫描HDFS,直接使用内存中该表空间的目录结构信息,达到缩短查询时间的效果。The background refreshing method based on the Spark-SQL big data processing platform according to claim 2, wherein if the refresh process is started, the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed. When the Spark-SQL receives the query, the original first refresh policy is used. Before the query, the file directory structure of the specified file space of the distributed file system HDFS is scanned. If the refresh process is completed for the first time, the specified table space on the HDFS is The directory structure information is stored in the memory. When Spark-SQL receives the query statement, it no longer scans the HDFS, and directly uses the directory structure information of the table space in the memory to achieve the effect of shortening the query time.
  4. 根据权利要求2所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:所述刷新间隔是刷新一次所用时间的十分之一至二分之一,或者,所述刷新间隔是5秒至10秒。The background refreshing method based on the Spark-SQL big data processing platform according to claim 2, wherein the refresh interval is one tenth to one half of a time used for refreshing once, or the refresh interval is 5 seconds to 10 seconds.
  5. 根据权利要求1所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:将外部数据文件进行压缩存储,所述压缩格式为ZIP、BZ2、SNAPPY或PARQUET。The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, wherein the external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
  6. 根据权利要求1所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:采用Scala编程,修改Spark源码中关于Spark-SQL执行查询语句的策略。The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, characterized in that: using Scala programming, the strategy of executing Spark-SQL query statements in the Spark source code is modified.
  7. 根据权利要求1所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:在刷新以前,先依次进行创建临时表、创建带分区信息的大数据表、将文本格式的数据文件导入临时表、处理临时表数据并存入带有分区信息的大数据表。 The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, wherein before the refreshing, the temporary table is created, the big data table with the partition information is created, and the data file in the text format is imported. Temporary tables, processing temporary table data and storing large data tables with partition information.
  8. 根据权利要求1所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:在使用更高效快速的数据导入程序同时,识别出HDFS分布式文件系统的新增文件,保存在元数据中,用于用户查询请求。The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, characterized in that: the new file of the HDFS distributed file system is identified and saved in the metadata while using a more efficient and fast data importing program. Used for user query requests.
  9. 根据权利要求1所述基于Spark-SQL大数据处理平台的后台刷新方法,其特征在于:The background refreshing method based on the Spark-SQL big data processing platform according to claim 1, wherein:
    所述创建临时表是:根据数据模型创建用于存储文本格式数据的临时表,该临时表用来作为最终数据表的数据源;The creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;
    所述创建带分区信息的大数据表是:在大数据的背景下,创建带有分区信息的大数据表可以提高数据查询的速度;实际应用中,按时间中的月、周、天或小时进行分区,或者,按照字符串的某个子串进行分区,或者,按整数区间进行分区,或者,进行组合分区,进一步划分数据,提高数据查询速度;The big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of a string, or partition by integer interval, or combine partitions to further divide data and improve data query speed;
    所述将文本格式的数据文件导入临时表是:根据数据文件格式,执行Spark-SQL语句或者Hadoop支持的Load语句,将文本格式的数据直接导入临时表中。The importing the data file in the text format into the temporary table is: executing a Spark-SQL statement or a Load statement supported by Hadoop according to the data file format, and directly importing the data in the text format into the temporary table.
    所述处理临时表数据并存入带有分区信息的大数据表是:执行指定分区格式与存储格式的Spark-SQL语句,将临时表中的数据按照指定分区格式进行分析和处理,再按照指定的存储格式将数据写入到最终的大数据表中;这一步中,Spark首先将上述临时表空间中的数据根据配置分为RDD数据块,每个RDD数据块分配给指定任务进行并行处理,再通过Spark-SQL的内部转化机制,将SQL语句中的分区信息转化为针对RDD数据块的特定操作方法,从而基于RDD数据块对数据进行分区,并将分区后的数据进行压缩处理,写入到分布式文件系统HDFS中。 The processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is allocated to the specified task for parallel processing. Then through the internal transformation mechanism of Spark-SQL, the partition information in the SQL statement is converted into a specific operation method for the RDD data block, thereby partitioning the data based on the RDD data block, and compressing and writing the partitioned data. To the distributed file system HDFS.
PCT/CN2016/095361 2015-12-11 2016-08-15 Background refreshing method based on spark-sql big data processing platform WO2017096941A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510919868.6A CN105550293B (en) 2015-12-11 2015-12-11 A kind of backstage method for refreshing based on Spark SQL big data processing platforms
CN201510919868.6 2015-12-11

Publications (1)

Publication Number Publication Date
WO2017096941A1 true WO2017096941A1 (en) 2017-06-15

Family

ID=55829482

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/095361 WO2017096941A1 (en) 2015-12-11 2016-08-15 Background refreshing method based on spark-sql big data processing platform

Country Status (2)

Country Link
CN (1) CN105550293B (en)
WO (1) WO2017096941A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136777A (en) * 2018-02-09 2019-08-16 深圳先进技术研究院 It is a kind of that sequence sequence alignment method is resurveyed based on Spark frame
CN110162563A (en) * 2019-05-28 2019-08-23 深圳市网心科技有限公司 A kind of data storage method, system and electronic equipment and storage medium
CN110727684A (en) * 2019-10-08 2020-01-24 浪潮软件股份有限公司 Incremental data synchronization method for big data statistical analysis
CN110765154A (en) * 2019-10-16 2020-02-07 华电莱州发电有限公司 Method and device for processing mass real-time generated data of thermal power plant
CN110990340A (en) * 2019-11-12 2020-04-10 上海麦克风文化传媒有限公司 Big data multi-level storage framework
CN110990669A (en) * 2019-10-16 2020-04-10 广州丰石科技有限公司 DPI (deep packet inspection) analysis method and system based on rule generation
CN111179048A (en) * 2019-12-31 2020-05-19 中国银行股份有限公司 SPARK-based user information personalized analysis method, device and system
CN111488323A (en) * 2020-04-14 2020-08-04 中国农业银行股份有限公司 Data processing method and device and electronic equipment
CN111666260A (en) * 2019-03-08 2020-09-15 杭州海康威视数字技术股份有限公司 Data processing method and device
CN112163030A (en) * 2020-11-03 2021-01-01 北京明略软件系统有限公司 Multi-table batch operation method and system and computer equipment
CN112783923A (en) * 2020-11-25 2021-05-11 辽宁振兴银行股份有限公司 Implementation method for efficiently acquiring database based on Spark and Impala
CN113434608A (en) * 2021-07-06 2021-09-24 中国银行股份有限公司 Data processing method and device for Hive data warehouse
CN113553533A (en) * 2021-06-10 2021-10-26 国网安徽省电力有限公司 Index calculation method based on digital internal five-level market assessment system

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550293B (en) * 2015-12-11 2018-01-16 深圳市华讯方舟软件技术有限公司 A kind of backstage method for refreshing based on Spark SQL big data processing platforms
US10305967B2 (en) * 2016-03-14 2019-05-28 Business Objects Software Ltd. Unified client for distributed processing platform
CN106570129A (en) * 2016-10-27 2017-04-19 南京邮电大学 Storage system for rapidly analyzing real-time data and storage method thereof
CN106777278B (en) * 2016-12-29 2021-02-23 海尔优家智能科技(北京)有限公司 Spark-based data processing method and device
CN106682213B (en) * 2016-12-30 2020-08-07 Tcl科技集团股份有限公司 Internet of things task customizing method and system based on Hadoop platform
CN108959952B (en) * 2017-05-23 2020-10-30 中国移动通信集团重庆有限公司 Data platform authority control method, device and equipment
CN107391555B (en) * 2017-06-07 2020-08-04 中国科学院信息工程研究所 Spark-Sql retrieval-oriented metadata real-time updating method
CN108108490B (en) * 2018-01-12 2019-08-27 平安科技(深圳)有限公司 Hive table scan method, apparatus, computer equipment and storage medium
CN109491973A (en) * 2018-09-25 2019-03-19 中国平安人寿保险股份有限公司 Electronic device, declaration form delta data distribution analysis method and storage medium
CN109189798B (en) * 2018-09-30 2021-12-17 浙江百世技术有限公司 Spark-based data synchronous updating method
CN109473178B (en) * 2018-11-12 2022-04-01 北京懿医云科技有限公司 Method, system, device and storage medium for medical data integration
CN109800782A (en) * 2018-12-11 2019-05-24 国网甘肃省电力公司金昌供电公司 A kind of electric network fault detection method and device based on fuzzy knn algorithm
CN110222009B (en) * 2019-05-28 2021-08-06 咪咕文化科技有限公司 Method and device for automatically processing Hive warehousing abnormal file
CN110209654A (en) * 2019-06-05 2019-09-06 深圳市网心科技有限公司 A kind of text file data storage method, system and electronic equipment and storage medium
CN111159235A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Data pre-partition method and device, electronic equipment and readable storage medium
CN111427887A (en) * 2020-03-17 2020-07-17 中国邮政储蓄银行股份有限公司 Method, device and system for rapidly scanning HBase partition table
CN114238450B (en) * 2022-02-22 2022-08-16 阿里云计算有限公司 Time partitioning method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8516022B1 (en) * 2012-01-11 2013-08-20 Emc Corporation Automatically committing files to be write-once-read-many in a file system
CN104239377A (en) * 2013-11-12 2014-12-24 新华瑞德(北京)网络科技有限公司 Platform-crossing data retrieval method and device
CN104767795A (en) * 2015-03-17 2015-07-08 浪潮通信信息系统有限公司 LTE MRO data statistical method and system based on HADOOP
CN105550293A (en) * 2015-12-11 2016-05-04 深圳市华讯方舟软件技术有限公司 Background refreshing method based on Spark-SQL big data processing platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699676B (en) * 2013-12-30 2017-02-15 厦门市美亚柏科信息股份有限公司 MSSQL SERVER based table partition and automatic maintenance method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8516022B1 (en) * 2012-01-11 2013-08-20 Emc Corporation Automatically committing files to be write-once-read-many in a file system
CN104239377A (en) * 2013-11-12 2014-12-24 新华瑞德(北京)网络科技有限公司 Platform-crossing data retrieval method and device
CN104767795A (en) * 2015-03-17 2015-07-08 浪潮通信信息系统有限公司 LTE MRO data statistical method and system based on HADOOP
CN105550293A (en) * 2015-12-11 2016-05-04 深圳市华讯方舟软件技术有限公司 Background refreshing method based on Spark-SQL big data processing platform

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136777A (en) * 2018-02-09 2019-08-16 深圳先进技术研究院 It is a kind of that sequence sequence alignment method is resurveyed based on Spark frame
CN111666260A (en) * 2019-03-08 2020-09-15 杭州海康威视数字技术股份有限公司 Data processing method and device
CN110162563A (en) * 2019-05-28 2019-08-23 深圳市网心科技有限公司 A kind of data storage method, system and electronic equipment and storage medium
CN110162563B (en) * 2019-05-28 2023-11-17 深圳市网心科技有限公司 Data warehousing method and system, electronic equipment and storage medium
CN110727684A (en) * 2019-10-08 2020-01-24 浪潮软件股份有限公司 Incremental data synchronization method for big data statistical analysis
CN110727684B (en) * 2019-10-08 2023-07-25 浪潮软件股份有限公司 Incremental data synchronization method for big data statistical analysis
CN110765154A (en) * 2019-10-16 2020-02-07 华电莱州发电有限公司 Method and device for processing mass real-time generated data of thermal power plant
CN110990669A (en) * 2019-10-16 2020-04-10 广州丰石科技有限公司 DPI (deep packet inspection) analysis method and system based on rule generation
CN110990340A (en) * 2019-11-12 2020-04-10 上海麦克风文化传媒有限公司 Big data multi-level storage framework
CN110990340B (en) * 2019-11-12 2024-04-12 上海麦克风文化传媒有限公司 Big data multi-level storage architecture
CN111179048B (en) * 2019-12-31 2023-05-02 中国银行股份有限公司 SPARK-based user information personalized analysis method, device and system
CN111179048A (en) * 2019-12-31 2020-05-19 中国银行股份有限公司 SPARK-based user information personalized analysis method, device and system
CN111488323B (en) * 2020-04-14 2023-06-13 中国农业银行股份有限公司 Data processing method and device and electronic equipment
CN111488323A (en) * 2020-04-14 2020-08-04 中国农业银行股份有限公司 Data processing method and device and electronic equipment
CN112163030A (en) * 2020-11-03 2021-01-01 北京明略软件系统有限公司 Multi-table batch operation method and system and computer equipment
CN112783923A (en) * 2020-11-25 2021-05-11 辽宁振兴银行股份有限公司 Implementation method for efficiently acquiring database based on Spark and Impala
CN113553533A (en) * 2021-06-10 2021-10-26 国网安徽省电力有限公司 Index calculation method based on digital internal five-level market assessment system
CN113434608A (en) * 2021-07-06 2021-09-24 中国银行股份有限公司 Data processing method and device for Hive data warehouse

Also Published As

Publication number Publication date
CN105550293A (en) 2016-05-04
CN105550293B (en) 2018-01-16

Similar Documents

Publication Publication Date Title
WO2017096941A1 (en) Background refreshing method based on spark-sql big data processing platform
US9678969B2 (en) Metadata updating method and apparatus based on columnar storage in distributed file system, and host
CN109891402B (en) Revocable and online mode switching
US11119997B2 (en) Lock-free hash indexing
US11556396B2 (en) Structure linked native query database management system and methods
WO2017096940A1 (en) Data import method for spark-sql-based big-data processing platform
EP3170109B1 (en) Method and system for adaptively building and updating column store database from row store database based on query demands
WO2019128205A1 (en) Method and device for achieving grayscale publishing, computing node and system
US11275759B2 (en) Data storage method and apparatus, server, and storage medium
CN111797121B (en) Strong consistency query method, device and system of read-write separation architecture service system
US9418094B2 (en) Method and apparatus for performing multi-stage table updates
CN104679898A (en) Big data access method
CN112286941B (en) Big data synchronization method and device based on Binlog + HBase + Hive
CN104778270A (en) Storage method for multiple files
US20230418811A1 (en) Transaction processing method and apparatus, computing device, and storage medium
Sears et al. Rose: Compressed, log-structured replication
WO2020041950A1 (en) Data update method, device, and storage device employing b+ tree indexing
CN113050886B (en) Nonvolatile memory storage method and system for embedded memory database
US10558636B2 (en) Index page with latch-free access
CN113672556A (en) Batch file migration method and device
Schindler Profiling and analyzing the I/O performance of NoSQL DBs
Valvag et al. Cogset vs. hadoop: Measurements and analysis
US20190163799A1 (en) Database management system and database management method
US11816106B2 (en) Memory management for KLL sketch
US20240143566A1 (en) Data processing method and apparatus, and computing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16872137

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16872137

Country of ref document: EP

Kind code of ref document: A1