WO2017101475A1 - Query method based on spark big data processing platform - Google Patents

Query method based on spark big data processing platform Download PDF

Info

Publication number
WO2017101475A1
WO2017101475A1 PCT/CN2016/095353 CN2016095353W WO2017101475A1 WO 2017101475 A1 WO2017101475 A1 WO 2017101475A1 CN 2016095353 W CN2016095353 W CN 2016095353W WO 2017101475 A1 WO2017101475 A1 WO 2017101475A1
Authority
WO
WIPO (PCT)
Prior art keywords
result
spark
output
query
task
Prior art date
Application number
PCT/CN2016/095353
Other languages
French (fr)
Chinese (zh)
Inventor
万修远
Original Assignee
深圳市华讯方舟软件技术有限公司
华讯方舟科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华讯方舟软件技术有限公司, 华讯方舟科技有限公司 filed Critical 深圳市华讯方舟软件技术有限公司
Publication of WO2017101475A1 publication Critical patent/WO2017101475A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Definitions

  • the invention relates to a query method for data processing, in particular to a query method based on a Spark big data processing platform.
  • MapReduce performs computations in parallel on multiple nodes in the cluster, thus greatly speeding up the query.
  • MapReduce gradually becomes incapable, so the Spark computing framework based on memory computing emerges at the historic moment. Spark's query speed is 100 times faster than hadoop, so it is the most advanced.
  • Distributed parallel computing framework With the development of Spark ecosystem, Spark SQL, Spark Streaming, MLlib, GraphX, etc. emerged on it. Spark SQL is developed for SQL users and can analyze structured data in SQL language. The tool for the query.
  • the query method based on the Spark big data processing platform can be divided into five steps:
  • Step 1 After receiving the user's SQL statement, the Spark SQL application performs syntax parsing, execution strategy optimization, job (query job) generation, and finally submits the job by calling the SparkContext interface in the Spark platform.
  • Step 2 After receiving the job, SparkContext defines how to store the calculation result after the Task (calculation task) is successfully executed, and then submits the job to the eventProcessActor, and then waits for the eventProcessActor to notify the end of the execution of the job, and returns the calculation result to the Spark SQL after the end;
  • Step 3 After receiving the event of submitting the job, the eventProcessActor allocates multiple Tasks at each node to start parallel computing;
  • Step 4 After each Task is executed, report the status and result to the eventProcessActor.
  • the eventProcessActor counts whether all the tasks of the job are completed. If it is completed, the job that the SparkContext submits is notified to terminate, and the SparkContext returns the calculation result to the Spark SQL.
  • Step 5 After Spark SQL obtains the calculation result, the format conversion is performed first, then one copy is sent to the output module, and finally the output module outputs the result.
  • step 1 is mainly to parse the syntax of the SQL statement and generate a set of RDDs representing a job.
  • RDD is a distributed Data structure, which describes the distributed storage data to be processed and how to deal with the algorithm, so an RDD represents an operation on the data, a set of RDD is a sequence of operations, after completing the series of operations in sequence A query calculation is completed; Spark adopts a delayed execution strategy, that is, each operation is not executed first, but is a sequence of operations, and then the sequence is sent to the actuator for execution; the operation represented by this group of RDDs is ordered. It does not loop, so its logical dependency graph is also called directed acyclic graph (DAG); in DAG, the downstream RDD is generated after the upstream RDD performs an operation.
  • DAG directed acyclic graph
  • step 2 mainly submits the DAG to the eventProcessActor in another thread environment.
  • the current thread is submitted.
  • each Task result is a subset of the final result. It is not necessary to output all the result subsets together.
  • the size of the result is directly limited by the stack size of the program.
  • step 3 mainly implements the division of the DAG phase and the generation of the task set in each phase. All Tasks in each phase perform the same operations, except that they work differently, so they can be executed in parallel; however, Tasks at different stages may not be parallel.
  • Each dark gray filled rectangle represents a data block, and each data block is calculated corresponding to a Task. Since the data block of RDD2 is calculated according to multiple data blocks of RDD1, it is necessary to wait for the execution of RDD1. All Tasks can be used to start calculating RDD2, so RDD1 and RDD2 need to be divided into two different phases. When RDD2 calculates RDD5, each data block is independent and independent of each other. RDD2 calculates one of the data blocks.
  • the Task does not need to wait for the end of the other data block to start the calculation generated by RDD5 (here, the join operation), so RDD2 and RDD5 can belong to the same stage; similarly, RDD3 and RDD4 can belong to the same stage, but RDD4 cannot and RDD5 belongs to the same stage; in Figure 4, stage1 and stage2 are independent of each other and can be executed in parallel. Stage3 depends on both stage1 and stage2, so it must wait for stage1 and stage2 to complete before executing.
  • step 4 is mainly after the last stage of the task is successfully executed, and the calculation result is stored in the memory specified by the SparkContext; in FIG. 5, only the intermediate result is generated after the execution of the stage1 and stage2 Tasks, stage3 Each Task is the final result, and the final output is spliced from the results of each Task of stage3, which may be sorted during the splicing process.
  • the results of the Task are stored in order. If the results are not sorted, the results are sorted according to the order in which the tasks are completed, and the order of the results of each query will be random.
  • Step 5 is mainly to format the result in the form of the record row array into a string sequence, each row record is converted into a string format, and will The column delimiter is replaced with a tab character. Finally, when the output module extracts the formatted result, it copies a copy to the output module and then outputs it. In fact, formatting the results is not necessary, formatting may look better, but it consumes a lot of memory and performance. In some cases, the data itself is very neat, and there is no need to format it. .
  • the query method based on the Spark big data processing platform in the prior art has the following technical problems:
  • the Spark big data processing platform performs query, the user response time is too long, especially when analyzing large-scale data, the response time is beyond the tolerance of the user, and as the amount of analysis data increases. This response delay will also increase synchronously.
  • the Spark big data processing platform does not support the output of large-scale query results.
  • the default configuration only allows the output of 1G query result data. If the configuration is too small, the memory resources will not be fully utilized, and the configuration is too much. The remaining memory space will cause a memory overflow exception; in addition, for a machine environment with a low memory configuration, the amount of data allowed to be output will be further reduced.
  • Spark SQL will perform some format conversion and data copy before it is officially output, which will result in multiple copies of the same or similar data in memory, which wastes memory resources. Reduced performance also directly affects user response and result storage capacity, and this effect increases as output increases.
  • the technical problem to be solved by the present invention is to provide a query method based on the Spark big data processing platform.
  • the query method performs a conventional simple query (the DAG phase is relatively small), even if the data to be processed is very large, it can quickly return.
  • the user response time can be greatly shortened on the original basis. No matter which query is executed, an attempt is made to output immediately as long as the result satisfies the output condition without any delay.
  • the query method based on the Spark big data processing platform of the present invention is to transmit a result formatting rule, a result output rule, and a notification of whether the result is to be sorted when the Spark application submits a job to the Spark big data processing platform.
  • the processing strategy after the successful execution of the Task is set:
  • the query is sorted, it is judged whether the rank number of the current Task result is the next digit of the last output sequence number, and if so, the result is output according to the result formatting rule and the output rule passed by the Spark application, and then judged according to the rank number. Whether there are other consecutive Task results that have been stored in the queue, and the results are output together, and the memory that has been output is immediately released; if not, the current Task result is stored to the corresponding rank number index position. In this way, as long as the result satisfies the output condition according to the ranking number, the output is immediately output without any delay. In the fastest case, the Task for calculating the first result is first completed, and in the slowest case, the Task for calculating the first result is the last one, so Average at least half the response time;
  • each task will output the result according to the result formatting rules and output rules passed by the Spark application immediately after the success.
  • the result is not stored. In this way, as long as the Task is successfully executed, its result is output immediately.
  • the result is also a continuous output until the last completed Task is output. In this case, the entire calculation process has no output delay. As long as there are new calculation results, the output is immediate; for the analysis of large-scale data sets, the number of tasks is increased, but the data processed by each task The block size has not changed, and no matter how large the data set is processed, it is the first completed task to output the result immediately. Therefore, as long as it is a simple query, even a very large-scale data set can quickly output the first result.
  • the Spark big data processing platform no longer applies for storing the memory of the calculation result. Accordingly, the Task of the last stage of the DAG executes the result directly after the execution of the Task; if it is a sorted query and the Task needs to be temporarily stored, it is judged whether the memory is sufficient. The result of accommodating the Task, if the memory is not enough, immediately terminate the current job, and notify the Spark application that the query result exceeds the system capacity, prompting the client to increase the screening condition. Therefore, in the case of non-sorted queries, the Spark big data processing platform can continuously output a large number of query results, supporting the scene returned by large data volume query; in the case of sorting queries, there will be no memory overflow exception caused by excessive query results. problem.
  • the Spark SQL application gets the calculation result, it is first judged whether the result is empty. If it is empty, the output process is no longer taken. If it is not empty, it can be selected according to the configuration, and then the output process is taken.
  • the Spark SQL application Before submitting a job to the Spark big data processing platform, the Spark SQL application needs to predefine the result formatting rules, the result output rules, whether the results should be sorted, and pass the information when submitting the job, where the result formatting rules are configured according to the configuration. Can be empty.
  • the overloaded interface adds the result formatting rules, the result output rules, and the notification of whether the results should be sorted. Finally, before the formal submission of the job. According to these three parameters, the processing strategy after the success of the Task is set; at the same time, the Spark SQL application uses the overloaded interface when submitting the job.
  • the query method based on the Spark big data processing platform of the present invention has the following beneficial effects compared with the prior art.
  • non-sorted queries can output a large number of query results, and even output the total amount of stored data; sorted queries ensure that the memory overflow exception will not be caused by the output result is too large, and the data allowed to be output is greatly increased in a certain probability the amount.
  • Spark SQL application can output both formatted and unformatted results.
  • the output module of the Spark SQL application directly refers to the result obtained by the job submission module when obtaining the query result, thereby avoiding the result copying;
  • Spark big data processing platform has the function of outputting results, and the rules for how to format the results and output results are defined by the Spark application, so this function is applicable to all Spark applications;
  • the existing Spark application can still use the original interface when submitting the job, is not affected.
  • FIG. 1 is an architectural diagram of a Spark SQL execution query in the prior art.
  • FIG. 2 is a framework diagram of a Spark SQL generated DAG in the prior art.
  • FIG. 3 is a flow chart of a SparkContext submit job in the prior art.
  • FIG. 4 is a schematic diagram of the RDD phase division in the prior art.
  • Figure 5 is a process diagram of a prior art DAG performed in stages.
  • FIG. 6 is a schematic diagram of a sorted query Task storage calculation result in the prior art.
  • FIG. 7 is a flowchart of implementing an undelayed output of a query result according to an embodiment of the present invention.
  • FIG. 8 is a flowchart of implementing sorting query processing in a Task successful processing strategy according to an embodiment of the present invention.
  • FIG. 9 is a flowchart of implementing non-sorted query processing in a Task successful processing policy according to an embodiment of the present invention.
  • FIG. 10 is a flowchart of implementing a non-sorted query to support a massive query result according to an embodiment of the present invention.
  • FIG. 11 is a flowchart of implementing memory protection for a query result by using a sort query according to an embodiment of the present invention.
  • FIG. 12 is a flowchart of an implementation of a Spark SQL processing job query result according to an embodiment of the present invention.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • Spark Big Data Processing Platform's application programming interface SparkContext provides a new job submission interface.
  • the new interface requires notification of the result output rules, result formatting rules, and whether the results should be sorted.
  • the new interface redefines the result processing strategy when the Task succeeds, and is executed when the Task success event occurs. In the processing strategy, it is no longer simply storing the result of the Task, but sorting according to the result, and determining whether the result is satisfied. Immediately output the condition. If it is satisfied, the result is formatted and output according to the result formatting rule and output rule defined and passed by the Spark application. The result is not stored after the output; if the output condition is not met temporarily, the temporary storage is performed. Wait until the next task succeeds and then judge whether the output condition is satisfied. If it meets the immediate output and release the memory.
  • the Spark application being developed can use this new interface to achieve results without delay output, and existing Spark applications can still make Works normally with the original interface.
  • the result is formatted as an array of record rows, and because the column separator is replaced with a tab, the row array format is also formatted. Converted to an array of strings to replace the string with the character substitution function, and then formatted as a column delimiter for the result form of the tab.
  • the result is Print directly to the console, so for Spark SQL applications, the output is printed to the specified console; while the Spark SQL application processes the SQL statement, depending on whether the order contains the order by clause ( Sort the words) to determine whether the results should be sorted.
  • the Spark application receives the job failure notification and outputs an error message and waits for the next query job submission.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the specific implementation manner of the sort query processing of the Task success processing strategy in the query method of the Spark big data processing platform is as follows:
  • the ranking number of the Task result is that the Spark big data processing platform has been calculated by other stages before the last stage of executing the job. All the Tasks of the last stage of the job know the ranking number of each result, so each After the independent Task is executed, its ranking order has been determined, and it is not necessary to wait until all Tasks have been executed to determine.
  • the Tasks mentioned in the embodiments of the present invention all refer to the Task of the last stage of the job.
  • Task1 is the first to complete, it calculates the result of the third digit, and currently has output to the 0th digit (that is, there is no output), so it can not be output to the customer, can only be stored first, and stored to the third index position on;
  • Task2 is finished second, and the result of the calculation is ranked first, so it is output immediately, not stored; then it is judged whether there is a consecutive result on the index position immediately after the start of the second position, because the second index There is no result in the position, no processing; the last updated rank number currently output is 1;
  • the third completion of Task3 the result of the calculation is ranked 5th, and the current output has been output to the 1st place, so it can not be output to the customer, can only be stored first, and stored to the 5th index position;
  • the fourth completion of Task4 the result of the calculation is ranked 4th, and the current output has been output to the first place, so it can not be output to the customer, can only be stored first, and stored to the 4th index position;
  • the fifth completion of Task5 the result of the calculation is ranked 7th, and the current output has been output to the first place, so it can not be output to the customer, can only be stored first, and stored to the 7th index position;
  • the sixth completion of Task6 the result of the calculation is ranked second, and the current output has been output to the first digit, so it can be output immediately, and then judge whether there is a continuous ranking at the index position immediately after (the third digit starts).
  • the three consecutive results are continuously output, and the seventh bit has a result, but the sixth bit result has not been returned, so it cannot be output;
  • the result of the calculation is ranked 6th, and the current output has been output to the 5th position, so it can be output immediately, and then it is judged whether there is a consecutive result in the index position (starting from the 7th bit) immediately after. Because there is a result at the 7th index position, the result of the 7th bit is continuously output; finally, the memory occupied by the result of the 7th bit is released and the currently ranked rank number is 7;
  • Embodiment 3 is a diagrammatic representation of Embodiment 3
  • the present embodiment is based on a non-sorting check of a successful strategy for a task in a query method of a Spark big data processing platform.
  • the specific implementation of the query processing is as follows:
  • the non-sorting query processing process is started, as follows:
  • the results are formatted according to the result formatting rules passed by the Spark application, and then the output results are output according to the results passed by the Spark application, and the processing ends when the Task succeeds.
  • Task1 is the first one to complete, because the order of the results is not required, so it can be output immediately; the other Tasks are processed the same, and when all the tasks are executed, the result is already output.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • the specific implementation of the query method based on the Spark big data processing platform for the non-sorted query to support the massive query result is as follows:
  • each data file block is 256M
  • each file block corresponds to a Task for query calculation, due to the investigation of the Task.
  • the result of the query calculation is a subset of the contents of the file block, so the calculation result is at most equal to 256M.
  • the memory consumed by the Spark big data processing platform during the processing of the query is always between 0 and 256M, so How many file blocks and tasks, as long as the portion of the memory managed by the Spark big data processing platform for storing Task results is more than 256M, all the Task results can be received and processed.
  • Embodiment 5 is a diagrammatic representation of Embodiment 5:
  • the specific implementation of the memory protection of the Spark big data processing platform based on the query method of the Spark big data processing platform for the sort query is as follows:
  • the Spark big data processing platform determines the memory space when applying for the memory to store the results of each task. Is it enough, if not enough, terminate the current job and notify the Spark application that the job has failed.
  • all Tasks are executed in the order in which they are ranked.
  • the results of each Task do not need to be stored and can be directly output.
  • the first execution of Task1 is completed, and its result is just ranked first, then the output condition is satisfied, and the output is directly output.
  • the second execution of Task2 is completed, and the result is just ranked second, then the output condition is satisfied, and the output is directly output.
  • all Tasks output the results in order; in the worst case, the last result of the Task is the last execution. At this time, except for the result of the last Task, the results of all the Tasks need to be temporarily stored.
  • the Spark big data processing platform Since most of the time is between the best and worst case, and the temporary stored Task results will output and release the occupied memory at the first time when the output condition is met, so for the sort query, the Spark big data processing platform is A certain probability can also support the return of massive query results. For the worse case, when the amount of Tasks that need to be temporarily stored exceeds the memory limit, do some memory protection to avoid system memory overflow exception caused by insufficient memory allocation.
  • the specific implementation of the Spark SQL processing job query result based on the query method of the Spark big data processing platform is as follows:
  • the Spark SQL application In order to make the Spark SQL application compatible with both the old and new job submission interfaces, the Spark SQL application needs to judge whether the result is empty (NULL) when the result is returned after the job ends. If it is empty, it will not go. Output flow, the end of this query; if not empty, according to the new configuration properties to determine whether to format, if the configuration to be formatted, then format, and then go to the output process.
  • NULL NULL
  • the output module When passing the results of a single Task or the entire job to the output module of the Spark SQL application, the output module directly accesses the resulting memory and prints it to the console without having to reapply a block of memory and copy the result to that memory.
  • the specific implementation of the Spark application's result output rule, result formatting rule, and whether the result is to be sorted to the Spark big data processing platform based on the query method of the Spark big data processing platform is as follows:
  • the result output rule is implemented by a function.
  • the function can be called in the place where the result is to be output.
  • the function is finally passed to the processing strategy after the success of the task.
  • the processing strategy here is also a function implementation, including a piece of business logic. Processing, for the call after the success of the Task;
  • the result formatting rules are also implemented by functions, the formatting here is not necessary, determined by the configuration switch, if the switch is turned on, the formatting rules function contains a specific formatting step, if the switch is off, the formatting rules function is not Contains any steps.
  • the formatting rules are ultimately passed to the processing strategy when the Task succeeds;
  • Each result subset of the query releases the memory as soon as it is output. Since the memory is released in time, the new result subset can be continuously received.
  • Spark SQL application directly access the result memory when outputting the query result, no need to re-apply memory and copy the result.
  • the Spark application needs to pass the rules of the result output and the rules of the result formatting to the Spark big data processing platform. This ensures that the solution of the present invention can be reused by other modules. If the module needs to use the immediate output function, it only needs to To Spark Big Data The platform passes its output rules and formatting rules, and the system knows how to output it, because the data format processed by different modules and the output format of the output may be different.
  • This embodiment has the following advantageous effects as compared with the prior art.
  • the Spark SQL application can choose whether to format the calculation result according to the configuration. In some scenarios, the data that does not need to be output looks very neat, so the memory and performance consumption can be reduced to some extent.
  • the output module of the Spark SQL application directly refers to the result obtained by the job submission module when obtaining the query result, thereby avoiding the result copying;
  • Spark big data processing platform has the function of output results, and how to format the results and output rules are defined by the Spark application, so this feature applies to all Spark applications;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)

Abstract

A query method based on a Spark big data processing platform. The query method comprises: for a sorting query, determining whether the rank of the result of a current calculation task is a next one of a previous outputted rank; if yes, outputting the result, then determining, according to the rank, whether there are continuously ranked and stored results of other calculation tasks next to the rank, and if there are, outputting the results all together; if not, storing the result of the current calculation task to a corresponding rank index position; and for a non-sorting query, immediately outputting the result of each completed calculation task, wherein all the outputted results will not be stored or the memory occupied by all the outputted results will be immediately released. When the query method is used to perform a routine simple query, a result can be quickly returned even if data to be processed is very huge; and when the query method is used to perform a complicated query, the user response time can be greatly shortened.

Description

一种基于Spark大数据处理平台的查询方法A query method based on Spark big data processing platform 技术领域Technical field
本发明涉及一种数据处理的查询方法,尤其涉及一种基于Spark大数据处理平台的查询方法。The invention relates to a query method for data processing, in particular to a query method based on a Spark big data processing platform.
背景技术Background technique
随着互联网的发展,数以万计的网页不断涌现,而要搜索这些网页首先要抓取存储,然后进行分析计算,google就是这么做的,但是日益庞大的数据使得存储面临着单台机器容量不够的问题,查询面临着耗时太多的问题;针对这两个问题,google提出了分布式存储和分布式并行计算的解决方案,而后来的hadoop产品是这种解决方案的源码实现;hadoop提供了分布式文件系统HDFS和分布式并行计算框架MapReduce,随着hadoop的发展,其生态系统又不断涌现新的项目产品,如Hbase、hive、pig等,但它们都是基于HDFS存储层和MapReduce计算框架,MapReduce通过在集群的多个节点上并行执行计算,因此大大加快了查询的速度,但随着数据量的日益增大,MapReduce渐渐显得力不从心,于是基于内存计算的Spark计算框架应运而生,Spark的查询速度相比hadoop提升了100倍,因此是目前最先进的分布式并行计算框架;随着Spark生态系统的发展,在其上又涌现出Spark SQL、Spark Streaming、MLlib、GraphX等,其中Spark SQL是针对SQL用户开发的可以用SQL语言对结构化数据进行分析查询的工具。With the development of the Internet, tens of thousands of web pages are constantly emerging. To search for these web pages, we must first capture the storage and then analyze and calculate. Google does this, but the increasing amount of data makes the storage face a single machine capacity. Insufficient problem, the query faces too many problems; for these two problems, Google proposed a solution for distributed storage and distributed parallel computing, and later hadoop products are the source implementation of this solution; hadoop Distributed file system HDFS and distributed parallel computing framework MapReduce are provided. With the development of Hadoop, new ecosystem products such as Hbase, hive, pig, etc. are emerging in the ecosystem, but they are all based on HDFS storage layer and MapReduce. The computational framework, MapReduce performs computations in parallel on multiple nodes in the cluster, thus greatly speeding up the query. However, as the amount of data increases, MapReduce gradually becomes incapable, so the Spark computing framework based on memory computing emerges at the historic moment. Spark's query speed is 100 times faster than hadoop, so it is the most advanced. Distributed parallel computing framework; With the development of Spark ecosystem, Spark SQL, Spark Streaming, MLlib, GraphX, etc. emerged on it. Spark SQL is developed for SQL users and can analyze structured data in SQL language. The tool for the query.
如图1所示,现有技术中,以使用Spark SQL应用程序为例,基于Spark大数据处理平台的查询方法可以分为五个步骤:As shown in FIG. 1 , in the prior art, using the Spark SQL application as an example, the query method based on the Spark big data processing platform can be divided into five steps:
步骤1、Spark SQL应用程序接收用户的SQL语句后,进行语法解析、执行策略优化、job(查询作业)生成,最后通过调用Spark平台中的SparkContext接口进行job的提交;Step 1. After receiving the user's SQL statement, the Spark SQL application performs syntax parsing, execution strategy optimization, job (query job) generation, and finally submits the job by calling the SparkContext interface in the Spark platform.
步骤2、SparkContext收到job后,定义Task(计算任务)执行成功后如何存储计算结果,然后提交job给eventProcessActor,接着等待eventProcessActor告知job执行结束,结束后将计算结果返回给Spark SQL;Step 2: After receiving the job, SparkContext defines how to store the calculation result after the Task (calculation task) is successfully executed, and then submits the job to the eventProcessActor, and then waits for the eventProcessActor to notify the end of the execution of the job, and returns the calculation result to the Spark SQL after the end;
步骤3、eventProcessActor收到提交job的事件后,在各个节点分配多个Task开始并行计算;Step 3: After receiving the event of submitting the job, the eventProcessActor allocates multiple Tasks at each node to start parallel computing;
步骤4、每个Task执行完毕后向eventProcessActor报告状态和结果,eventProcessActor统计job的所有Task是否全部完成,若完成,则通知SparkContext提交的job已结束,SparkContext返回计算结果给Spark SQL;Step 4: After each Task is executed, report the status and result to the eventProcessActor. The eventProcessActor counts whether all the tasks of the job are completed. If it is completed, the job that the SparkContext submits is notified to terminate, and the SparkContext returns the calculation result to the Spark SQL.
步骤5、Spark SQL得到计算结果后,先进行格式转化,然后拷贝一份给输出模块,最后由输出模块输出结果。Step 5: After Spark SQL obtains the calculation result, the format conversion is performed first, then one copy is sent to the output module, and finally the output module outputs the result.
如图2所示,步骤1主要是解析SQL语句的语法并生成一组代表一个Job的RDD,RDD是一种分布式 数据结构,它描述了要处理的分布式存储的数据以及怎样处理的算法,因此一个RDD就代表对数据的一个操作,一组RDD就是一个操作序列,按序完成了这一系列操作后即代表完成了一次查询计算;Spark采用了延迟执行策略,即每个操作不先执行,而是先生成操作的序列,然后把这个序列发送给执行器执行;这一组RDD所代表的操作因为有序且不循环,因此其组成的逻辑依赖图又称有向无环图(DAG);在DAG中,下游的RDD是上游的RDD执行某个操作后生成的。As shown in Figure 2, step 1 is mainly to parse the syntax of the SQL statement and generate a set of RDDs representing a job. RDD is a distributed Data structure, which describes the distributed storage data to be processed and how to deal with the algorithm, so an RDD represents an operation on the data, a set of RDD is a sequence of operations, after completing the series of operations in sequence A query calculation is completed; Spark adopts a delayed execution strategy, that is, each operation is not executed first, but is a sequence of operations, and then the sequence is sent to the actuator for execution; the operation represented by this group of RDDs is ordered. It does not loop, so its logical dependency graph is also called directed acyclic graph (DAG); in DAG, the downstream RDD is generated after the upstream RDD performs an operation.
如图3所示,步骤2主要是将DAG提交给处于另一个线程环境的eventProcessActor,提交前,分配一块内存,并告知eventProcessActor当Task执行成功后往这块内存中存储结果,提交后,当前线程挂起,等待eventProcessActor在job完成后唤醒它,被唤醒后,此时所有Task已经执行结束,并且计算结果全部已经存储到事先分配的内存中;因此直接将这块存储了结果的内存地址返回给Spark SQL模块。由于要等到所有Task执行结束才能返回结果,因此客户响应时间过长,事实上,每个Task的结果都是最终结果的一个子集,没有必要一起输出所有的结果子集;另外,由于输出前要存储整个查询结果,结果的大小将直接受限于程序的堆栈大小。As shown in Figure 3, step 2 mainly submits the DAG to the eventProcessActor in another thread environment. Before committing, allocates a piece of memory and informs the eventProcessActor to store the result in the memory after the task is successfully executed. After committing, the current thread is submitted. Suspend, wait for eventProcessActor to wake it up after the job is completed, wake up, then all the tasks have been executed, and the calculation results are all stored in the pre-allocated memory; therefore, the memory address that stores the result is directly returned to Spark SQL module. Since the results are not returned until all Task execution ends, the customer response time is too long. In fact, each Task result is a subset of the final result. It is not necessary to output all the result subsets together. In addition, due to the output before To store the entire query result, the size of the result is directly limited by the stack size of the program.
如图4所示,步骤3主要是实现DAG阶段的划分以及每个阶段Task集合的生成。每个阶段的所有Task执行的都是相同的操作,只不过它们作用的数据不同罢了,因此它们可以完全并行执行;但是不同阶段的Task就不一定能并行了。每个深灰色填充矩形代表一个数据块,并且每个数据块都对应一个Task对其进行计算,由于RDD2的数据块是根据RDD1的多个数据块计算得来的,因此就需要等待执行RDD1的所有Task结束才能开始计算RDD2,所以RDD1和RDD2需要分属两个不同的阶段,而RDD2计算出RDD5时,每个数据块都是独立进行的,互不依赖,RDD2中计算其中一个数据块的Task不需要等待其它数据块的Task结束即可开始向RDD5生成的计算(此处是join操作),因此RDD2和RDD5可以同属一个阶段;同理,RDD3和RDD4可以同属一个阶段,但RDD4不能和RDD5同属一个阶段;图4中,stage1和stage2互不依赖,可以并行执行,stage3同时依赖stage1和stage2,因此必须等待stage1和stage2均完成后才能执行。As shown in FIG. 4, step 3 mainly implements the division of the DAG phase and the generation of the task set in each phase. All Tasks in each phase perform the same operations, except that they work differently, so they can be executed in parallel; however, Tasks at different stages may not be parallel. Each dark gray filled rectangle represents a data block, and each data block is calculated corresponding to a Task. Since the data block of RDD2 is calculated according to multiple data blocks of RDD1, it is necessary to wait for the execution of RDD1. All Tasks can be used to start calculating RDD2, so RDD1 and RDD2 need to be divided into two different phases. When RDD2 calculates RDD5, each data block is independent and independent of each other. RDD2 calculates one of the data blocks. The Task does not need to wait for the end of the other data block to start the calculation generated by RDD5 (here, the join operation), so RDD2 and RDD5 can belong to the same stage; similarly, RDD3 and RDD4 can belong to the same stage, but RDD4 cannot and RDD5 belongs to the same stage; in Figure 4, stage1 and stage2 are independent of each other and can be executed in parallel. Stage3 depends on both stage1 and stage2, so it must wait for stage1 and stage2 to complete before executing.
如图5所示,步骤4主要是最后一个阶段的Task执行成功后,将计算结果存储到SparkContext指定的内存中;在图5中,stage1和stage2的Task执行结束后只生成中间结果,stage3的每个Task才是最终结果,而最终输出的结果是由stage3的每个Task的结果拼接而成,在拼接的过程中可能会有排序。如图6所示,如果查询语句要求对结果进行排序,则Task的结果按序存放,如果不对结果排序,则结果按照Task完成的先后顺序排序,每次查询的结果排列顺序将是随机的。对于结果排序的情况,既然每个Task知道它的结果应该排在什么位置,那么应该排在首位的Task就已经计算出了最终结果的头部,可以立即通知客户了;对于结果不排序的情况,由于客户不关心结果的排列顺序,因此不管哪个Task先计算完成,其结果就可以告知客户了,没有必要去等到其它Task,而且即使等待,最终的结果也是按照它们执行完成的先后顺序排序的。As shown in FIG. 5, step 4 is mainly after the last stage of the task is successfully executed, and the calculation result is stored in the memory specified by the SparkContext; in FIG. 5, only the intermediate result is generated after the execution of the stage1 and stage2 Tasks, stage3 Each Task is the final result, and the final output is spliced from the results of each Task of stage3, which may be sorted during the splicing process. As shown in Figure 6, if the query statement requires sorting the results, the results of the Task are stored in order. If the results are not sorted, the results are sorted according to the order in which the tasks are completed, and the order of the results of each query will be random. In the case of sorting the results, since each Task knows where its results should be ranked, the Task that should be ranked first will already calculate the head of the final result, and the client can be notified immediately; Since the customer does not care about the order of the results, no matter which Task is calculated first, the result can inform the customer that there is no need to wait for other Tasks, and even if they wait, the final result is sorted according to the order in which they are executed. .
步骤5主要是对记录行数组形式的结果格式化为字符串序列,每一行记录转换成字符串格式,并且将 列分隔符替换为制表符,最后输出模块在提取格式化后的结果时,复制一份到输出模块,然后输出。事实上,对结果的格式化不是必须的,格式化可能看起来更美观,但是却消耗了大量内存和性能,在某些情况下,数据本身已经很工整了,此时就没必要去格式化。Step 5 is mainly to format the result in the form of the record row array into a string sequence, each row record is converted into a string format, and will The column delimiter is replaced with a tab character. Finally, when the output module extracts the formatted result, it copies a copy to the output module and then outputs it. In fact, formatting the results is not necessary, formatting may look better, but it consumes a lot of memory and performance. In some cases, the data itself is very neat, and there is no need to format it. .
综上所述,现有技术中基于Spark大数据处理平台的查询方法存在以下技术问题:In summary, the query method based on the Spark big data processing platform in the prior art has the following technical problems:
1、目前Spark大数据处理平台执行查询时,用户响应时间过长,尤其是分析较大规模数据时,其响应时间更是超出了用户所能忍受的程度,并且随着分析数据量的增大,这种响应延迟也将同步增加。1. When the Spark big data processing platform performs query, the user response time is too long, especially when analyzing large-scale data, the response time is beyond the tolerance of the user, and as the amount of analysis data increases. This response delay will also increase synchronously.
2、目前Spark大数据处理平台不支持大规模查询结果的输出,默认配置只允许输出1G的查询结果数据量,配置的过少,将不能充分利用内存资源,配置的过多,若结果超出实际剩余的内存空间将导致内存溢出异常;再者,对于内存配置较低的机器环境,允许输出的数据量将进一步大大缩减。2. Currently, the Spark big data processing platform does not support the output of large-scale query results. The default configuration only allows the output of 1G query result data. If the configuration is too small, the memory resources will not be fully utilized, and the configuration is too much. The remaining memory space will cause a memory overflow exception; in addition, for a machine environment with a low memory configuration, the amount of data allowed to be output will be further reduced.
3、Spark SQL在获取Spark大数据处理平台的计算结果后,要进行一些格式转化和数据拷贝后才正式输出,将造成内存中相同或近似相同的数据有多个拷贝,浪费了内存资源,也降低了性能,也直接影响了用户响应和结果存储容量,并且这种影响会随着输出结果的增大而增大。3. After obtaining the calculation result of Spark big data processing platform, Spark SQL will perform some format conversion and data copy before it is officially output, which will result in multiple copies of the same or similar data in memory, which wastes memory resources. Reduced performance also directly affects user response and result storage capacity, and this effect increases as output increases.
发明内容Summary of the invention
本发明要解决的技术问题是提供一种基于Spark大数据处理平台的查询方法,该查询方法在执行常规的简单查询时(DAG阶段比较少),即使要处理的数据非常庞大,也能快速返回结果;执行复杂的查询时,能在原有基础上大大缩短用户响应时间,无论是执行哪种查询,都试图实现只要有结果满足输出条件就立即输出,没有任何延迟。The technical problem to be solved by the present invention is to provide a query method based on the Spark big data processing platform. When the query method performs a conventional simple query (the DAG phase is relatively small), even if the data to be processed is very large, it can quickly return. As a result, when performing complex queries, the user response time can be greatly shortened on the original basis. No matter which query is executed, an attempt is made to output immediately as long as the result satisfies the output condition without any delay.
为了解决上述技术问题,本发明基于Spark大数据处理平台的查询方法是,当Spark应用程序向Spark大数据处理平台提交job时,同时传递结果格式化规则、结果输出规则以及结果是否要排序的通知,同时在Spark平台内部根据传递的这些信息设定Task执行成功后的处理策略:In order to solve the above technical problem, the query method based on the Spark big data processing platform of the present invention is to transmit a result formatting rule, a result output rule, and a notification of whether the result is to be sorted when the Spark application submits a job to the Spark big data processing platform. At the same time, according to the information passed in the Spark platform, the processing strategy after the successful execution of the Task is set:
若是排序查询时,判断当前Task结果的排名序号是否是上一次输出序号的下一位,若是,则根据Spark应用程序传递的结果格式化规则和输出规则输出结果,然后按照排名序号判断紧挨其后是否有排名连续的已经存储的其它Task结果,有则一并输出这些结果,已经输出的结果其占用的内存立即释放;若不是,存储当前Task结果到相应的排名序号索引位置上。这样,只要结果按照排名序号满足输出条件就立即输出,无任何延迟,最快的情况下,计算首结果的Task第一个完成,最慢的情况下,计算首结果的Task最后一个完成,因此平均下来至少缩短一半的响应时间;If the query is sorted, it is judged whether the rank number of the current Task result is the next digit of the last output sequence number, and if so, the result is output according to the result formatting rule and the output rule passed by the Spark application, and then judged according to the rank number. Whether there are other consecutive Task results that have been stored in the queue, and the results are output together, and the memory that has been output is immediately released; if not, the current Task result is stored to the corresponding rank number index position. In this way, as long as the result satisfies the output condition according to the ranking number, the output is immediately output without any delay. In the fastest case, the Task for calculating the first result is first completed, and in the slowest case, the Task for calculating the first result is the last one, so Average at least half the response time;
若是非排序查询时,每一个Task成功后立即根据Spark应用程序传递的结果格式化规则和输出规则输出结果,结果不存储。这样,只要有Task执行成功,就立即输出它的结果,随着Task集合的不断完成,结果也是连续的输出,直到输出最后一个完成的Task,在这种情况下,整个计算过程没有任何输出延迟,只要有新的计算结果就立即输出;对于大规模数据集的分析来说,任务数增加了,但每个任务处理的数据 块大小没变,而无论处理的数据集多大,都是第一个完成的Task立即输出结果,因此成功实现了只要是简单查询,哪怕是超大规模数据集,也能快速输出首结果。If it is a non-sorted query, each task will output the result according to the result formatting rules and output rules passed by the Spark application immediately after the success. The result is not stored. In this way, as long as the Task is successfully executed, its result is output immediately. As the Task set is continuously completed, the result is also a continuous output until the last completed Task is output. In this case, the entire calculation process has no output delay. As long as there are new calculation results, the output is immediate; for the analysis of large-scale data sets, the number of tasks is increased, but the data processed by each task The block size has not changed, and no matter how large the data set is processed, it is the first completed task to output the result immediately. Therefore, as long as it is a simple query, even a very large-scale data set can quickly output the first result.
如果是非排序查询,Spark大数据处理平台不再申请存储计算结果的内存,相应地,DAG最后一个阶段的Task执行成功后直接输出结果;如果是排序查询且Task结果需要暂时存储,判断内存是否足够容纳该Task结果,若内存不够容纳,则立即终止当前job,并通知Spark应用程序查询结果超出系统容量,提示客户增加筛选条件。因此,非排序查询的情况下,Spark大数据处理平台能够源源不断输出海量的查询结果,支持大数据量查询返回的场景;排序查询的情况下,不会出现查询结果过大导致内存溢出异常的问题。If it is a non-sorted query, the Spark big data processing platform no longer applies for storing the memory of the calculation result. Accordingly, the Task of the last stage of the DAG executes the result directly after the execution of the Task; if it is a sorted query and the Task needs to be temporarily stored, it is judged whether the memory is sufficient. The result of accommodating the Task, if the memory is not enough, immediately terminate the current job, and notify the Spark application that the query result exceeds the system capacity, prompting the client to increase the screening condition. Therefore, in the case of non-sorted queries, the Spark big data processing platform can continuously output a large number of query results, supporting the scene returned by large data volume query; in the case of sorting queries, there will be no memory overflow exception caused by excessive query results. problem.
Spark SQL应用程序得到计算结果后,先判断结果是否为空,如果为空,不再走输出流程,如果不为空,根据配置可以选择是否格式化,然后走输出流程。After the Spark SQL application gets the calculation result, it is first judged whether the result is empty. If it is empty, the output process is no longer taken. If it is not empty, it can be selected according to the configuration, and then the output process is taken.
Spark SQL应用程序输出结果时,直接引用结果,不再重新拷贝一份到输出模块。When the Spark SQL application outputs the result, the result is directly referenced and no copy is made to the output module.
Spark SQL应用程序在向Spark大数据处理平台提交job前,需要预先定义结果格式化规则、结果输出规则、结果是否要排序的通知,并在提交job时传递这些信息,其中结果格式化规则根据配置可以是空。Before submitting a job to the Spark big data processing platform, the Spark SQL application needs to predefine the result formatting rules, the result output rules, whether the results should be sorted, and pass the information when submitting the job, where the result formatting rules are configured according to the configuration. Can be empty.
Spark大数据处理平台所有跟提交job相关的接口均重载一份,重载的接口新增结果格式化规则、结果输出规则以及结果是否要排序的通知这三个参数,最后在正式提交job前,根据这三个参数设定Task成功后的处理策略;同时Spark SQL应用程序在提交job时,使用重载的接口。All the interfaces related to the job submitted by the Spark big data processing platform are overloaded. The overloaded interface adds the result formatting rules, the result output rules, and the notification of whether the results should be sorted. Finally, before the formal submission of the job. According to these three parameters, the processing strategy after the success of the Task is set; at the same time, the Spark SQL application uses the overloaded interface when submitting the job.
本发明基于Spark大数据处理平台的查询方法与现有技术相比具有以下有益效果。The query method based on the Spark big data processing platform of the present invention has the following beneficial effects compared with the prior art.
1、该查询方法在执行常规的简单查询时(DAG阶段比较少),即使要处理的数据非常庞大,也能快速返回结果;执行复杂的查询时,能在原有基础上大大缩短用户响应时间,无论是执行哪种查询,都试图实现只要有结果满足输出条件就立即输出,没有任何延迟。大规模以下数据的任意查询以及超大规模数据的简单查询都能够在3s内看到首批结果,即客户响应时间始终控制在3s以内;对于超大规模数据的复杂查询,能在现有技术实现的基础上大幅加快客户响应。1. When the query method performs a regular simple query (the DAG phase is relatively small), even if the data to be processed is very large, the result can be quickly returned; when the complex query is executed, the user response time can be greatly shortened on the original basis. Regardless of which query is executed, an attempt is made to output as soon as the result satisfies the output condition without any delay. Any query with large-scale data below and simple query of very large-scale data can see the first results within 3s, that is, the customer response time is always controlled within 3s; for complex queries of very large-scale data, it can be implemented in the prior art. Significantly accelerate customer response.
2、非排序的查询可以输出海量的查询结果,甚至可以输出存储数据的总量;排序的查询确保不会因为输出结果过大导致内存溢出异常,并且在一定机率上大幅增加了允许输出的数据量。2, non-sorted queries can output a large number of query results, and even output the total amount of stored data; sorted queries ensure that the memory overflow exception will not be caused by the output result is too large, and the data allowed to be output is greatly increased in a certain probability the amount.
3、Spark SQL应用程序可以输出格式化和未格式化两种结果。3, Spark SQL application can output both formatted and unformatted results.
4、Spark SQL应用程序的输出模块获取查询结果时直接引用job提交模块得到的结果,避免了结果复制;4. The output module of the Spark SQL application directly refers to the result obtained by the job submission module when obtaining the query result, thereby avoiding the result copying;
5、Spark大数据处理平台拥有输出结果的功能,并且怎样格式化结果以及输出结果的规则由Spark应用程序定义,因此该功能适用所有的Spark应用程序;5, Spark big data processing platform has the function of outputting results, and the rules for how to format the results and output results are defined by the Spark application, so this function is applicable to all Spark applications;
6、已有的Spark应用程序在提交job时仍可以使用原始的接口,不受影响。 6, the existing Spark application can still use the original interface when submitting the job, is not affected.
附图说明DRAWINGS
下面结合附图和具体实施方式对本发明基于Spark大数据处理平台的查询方法作进一步的详细描述。The query method based on the Spark big data processing platform of the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.
图1是现有技术中Spark SQL执行查询的架构图。FIG. 1 is an architectural diagram of a Spark SQL execution query in the prior art.
图2是现有技术中Spark SQL生成DAG的框架图。2 is a framework diagram of a Spark SQL generated DAG in the prior art.
图3是现有技术中SparkContext提交job的流程图。FIG. 3 is a flow chart of a SparkContext submit job in the prior art.
图4是现有技术中RDD阶段划分的示意图。4 is a schematic diagram of the RDD phase division in the prior art.
图5是现有技术中DAG按阶段执行的过程图。Figure 5 is a process diagram of a prior art DAG performed in stages.
图6是现有技术中排序查询Task存储计算结果的示意图。FIG. 6 is a schematic diagram of a sorted query Task storage calculation result in the prior art.
图7是本发明实施例提供的查询结果无延迟输出的实现流程图。FIG. 7 is a flowchart of implementing an undelayed output of a query result according to an embodiment of the present invention.
图8是本发明实施例提供的Task成功处理策略中排序查询处理的实现流程图。FIG. 8 is a flowchart of implementing sorting query processing in a Task successful processing strategy according to an embodiment of the present invention.
图9是本发明实施例提供的Task成功处理策略中非排序查询处理的实现流程图。FIG. 9 is a flowchart of implementing non-sorted query processing in a Task successful processing policy according to an embodiment of the present invention.
图10是本发明实施例提供的非排序查询支持海量查询结果的实现流程图。FIG. 10 is a flowchart of implementing a non-sorted query to support a massive query result according to an embodiment of the present invention.
图11是本发明实施例提供的排序查询对查询结果做内存保护的实现流程图。FIG. 11 is a flowchart of implementing memory protection for a query result by using a sort query according to an embodiment of the present invention.
图12是本发明实施例提供的Spark SQL处理job查询结果的实现流程图。FIG. 12 is a flowchart of an implementation of a Spark SQL processing job query result according to an embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
实施例一:Embodiment 1:
如图7所示,本实施方式基于Spark大数据处理平台的查询方法中结果无延迟输出的具体实施方式如下:As shown in FIG. 7, the specific implementation manner of the result without delay output in the query method of the Spark big data processing platform is as follows:
Spark大数据处理平台的应用程序编程接口SparkContext提供新的job提交接口,新接口要求传递结果输出规则、结果格式化规则、结果是否要排序的通知。Spark Big Data Processing Platform's application programming interface SparkContext provides a new job submission interface. The new interface requires notification of the result output rules, result formatting rules, and whether the results should be sorted.
新接口重新定义Task成功时的结果处理策略,供Task成功事件发生时执行,在该处理策略中,不再是单纯的只存储Task的结果,而是根据结果要排序与否,判断结果是否满足立即输出条件,若满足,则直接根据Spark应用程序定义并传递的结果格式化规则和输出规则对结果先进行格式化再输出,结果输出后不再存储;若暂时不满足输出条件,则临时存储,等到下一个Task成功时再判断输出条件是否满足,若满足立即输出并释放存储内存。The new interface redefines the result processing strategy when the Task succeeds, and is executed when the Task success event occurs. In the processing strategy, it is no longer simply storing the result of the Task, but sorting according to the result, and determining whether the result is satisfied. Immediately output the condition. If it is satisfied, the result is formatted and output according to the result formatting rule and output rule defined and passed by the Spark application. The result is not stored after the output; if the output condition is not met temporarily, the temporary storage is performed. Wait until the next task succeeds and then judge whether the output condition is satisfied. If it meets the immediate output and release the memory.
正在开发的Spark应用程序可以使用该新接口实现结果无延迟输出,已有的Spark应用程序仍可以使 用原接口正常工作。The Spark application being developed can use this new interface to achieve results without delay output, and existing Spark applications can still make Works normally with the original interface.
例如,对于Spark SQL应用程序,由于其处理的是结构化数据,因此结果要格式化为记录行的数组形式,又因为要将列分隔符替换为制表符,因此还要将记录行数组格式化为字符串数组,以便利用字符串的字符替换功能进行替换处理,然后格式化为列分隔符为制表符的结果形式,最后要输出时,由于Spark SQL应用程序是命令行程序,结果是直接打印到控制台上的,因此对于Spark SQL应用程序来说,输出结果即是打印结果到指定的控制台上;而Spark SQL应用程序在处理SQL语句时,根据语句中是否包含order by字句(排序字句)可判断结果是否要排序。For example, for a Spark SQL application, since it is processing structured data, the result is formatted as an array of record rows, and because the column separator is replaced with a tab, the row array format is also formatted. Converted to an array of strings to replace the string with the character substitution function, and then formatted as a column delimiter for the result form of the tab. Finally, when outputting, since the Spark SQL application is a command-line program, the result is Print directly to the console, so for Spark SQL applications, the output is printed to the specified console; while the Spark SQL application processes the SQL statement, depending on whether the order contains the order by clause ( Sort the words) to determine whether the results should be sorted.
以上结果格式化规则、结果输出规则、结果是否要排序三者信息是Spark应用程序特有的,因此需要在应用提交job时传递给Spark大数据处理平台,并最终在Task成功时使用这些信息以便准确及时地输出结果。The above result formatting rules, result output rules, results whether to sort the three information is unique to the Spark application, so it needs to be passed to the Spark big data processing platform when the application submits the job, and finally use this information in order to be accurate when the task succeeds. Output results in a timely manner.
当所有Task均成功结束后,此时整个job即成功结束,而由于所有结果已在Task成功时全部输出,因此job结束后,Spark应用程序无须再执行输出流程,可以立即进入下次查询job的提交。When all the tasks are successfully completed, the entire job ends successfully. Since all the results have been output when the Task succeeds, the Spark application does not need to execute the output process after the job ends, and can immediately enter the next query job. submit.
当某个Task执行失败时,此时整个job宣告失败结束,Spark应用程序在得到job失败结束通知后,输出错误信息并等待下一次查询job的提交。When a task fails to execute, the entire job announces the failure. At this point, the Spark application receives the job failure notification and outputs an error message and waits for the next query job submission.
实施例二:Embodiment 2:
如图8所示,本实施方式基于Spark大数据处理平台的查询方法中Task成功时处理策略的排序查询处理的具体实施方式如下:As shown in FIG. 8 , the specific implementation manner of the sort query processing of the Task success processing strategy in the query method of the Spark big data processing platform is as follows:
在Task成功时处理策略中,判别当前查询是否要对结果排序,若要排序,则开始应用排序查询处理过程,详述如下:In the success strategy of the Task, determine whether the current query should sort the results. If you want to sort, start applying the sort query processing, as detailed below:
判断当前Task的结果的排名序号是否是上一次输出序号的下一位,若是,则按照Spark应用程序传递的结果格式化和输出规则先格式化再输出结果,然后按照排名序号判断紧挨其后是否有排名连续的已经存储的其它Task结果,有则一并输出这些结果,输出后释放这些结果占用的内存;若不是,存储当前Task结果到相应的排名序号索引位置上。Determine whether the ranking number of the current Task result is the next digit of the last output sequence number. If yes, format and output the rules according to the result format and output rules passed by the Spark application, and then judge the result according to the ranking number. Whether there are other consecutive Task results that have been stored in the queue, and the results are output together, and the memory occupied by the results is released after output; if not, the current Task result is stored to the corresponding rank number index position.
需要说明的是,Task结果的排名序号是Spark大数据处理平台在执行job的最后一个阶段前已由其它阶段事先计算好,job最后一个阶段的所有Task分别知道各自结果的排名序号,因此每个独立的Task执行完毕后,其排名顺序已确定了,无须等到所有Task执行完毕才能确定。本发明实施例提到的Task均是指job最后一个阶段的Task。It should be noted that the ranking number of the Task result is that the Spark big data processing platform has been calculated by other stages before the last stage of executing the job. All the Tasks of the last stage of the job know the ranking number of each result, so each After the independent Task is executed, its ranking order has been determined, and it is not necessary to wait until all Tasks have been executed to determine. The Tasks mentioned in the embodiments of the present invention all refer to the Task of the last stage of the job.
例如,对于返回结果的查询如表一所示。For example, the query for the returned result is shown in Table 1.
表一:Table I:
Figure PCTCN2016095353-appb-000001
Figure PCTCN2016095353-appb-000001
Figure PCTCN2016095353-appb-000002
Figure PCTCN2016095353-appb-000002
Task1第一个完成,它计算的结果排第3位,而当前已输出到第0位(即还没有输出),因此不能输出给客户,只能先存储起来,并且存储到第3个索引位置上;Task1 is the first to complete, it calculates the result of the third digit, and currently has output to the 0th digit (that is, there is no output), so it can not be output to the customer, can only be stored first, and stored to the third index position on;
Task2第二个完成,它计算的结果排第1位,因此立即输出,不存储;接着判断紧挨其后的索引位置(第2位开始)上是否有排名连续的结果,由于第2位索引位置上没有结果,不处理;最后更新当前已输出的排名序号为1;Task2 is finished second, and the result of the calculation is ranked first, so it is output immediately, not stored; then it is judged whether there is a consecutive result on the index position immediately after the start of the second position, because the second index There is no result in the position, no processing; the last updated rank number currently output is 1;
Task3第3个完成,它计算的结果排第5位,而当前已输出到第1位,因此不能输出给客户,只能先存储起来,并且存储到第5个索引位置上;The third completion of Task3, the result of the calculation is ranked 5th, and the current output has been output to the 1st place, so it can not be output to the customer, can only be stored first, and stored to the 5th index position;
Task4第4个完成,它计算的结果排第4位,而当前已输出到第1位,因此不能输出给客户,只能先存储起来,并且存储到第4个索引位置上;The fourth completion of Task4, the result of the calculation is ranked 4th, and the current output has been output to the first place, so it can not be output to the customer, can only be stored first, and stored to the 4th index position;
Task5第5个完成,它计算的结果排第7位,而当前已输出到第1位,因此不能输出给客户,只能先存储起来,并且存储到第7个索引位置上;The fifth completion of Task5, the result of the calculation is ranked 7th, and the current output has been output to the first place, so it can not be output to the customer, can only be stored first, and stored to the 7th index position;
Task6第6个完成,它计算的结果排第2位,而当前已输出到第1位,因此可以立即输出,然后判断紧挨其后的索引位置上(第3位开始)是否有排名连续的结果,由于第3、4、5位索引位置上均有结果,因此继续输出这三个排名连续的结果,而第7位虽然有结果,但第6位的结果还未返回,因此不能输出;最后释放第3、4、5位的结果占用的内存并更新当前已输出的排名序号为5;The sixth completion of Task6, the result of the calculation is ranked second, and the current output has been output to the first digit, so it can be output immediately, and then judge whether there is a continuous ranking at the index position immediately after (the third digit starts). As a result, since the 3rd, 4th, and 5th index positions have results, the three consecutive results are continuously output, and the seventh bit has a result, but the sixth bit result has not been returned, so it cannot be output; Finally release the memory occupied by the results of the 3rd, 4th, and 5th digits and update the currently output rank number to 5;
Task7最后一个完成,它计算的结果排第6位,而当前已输出到第5位,因此可以立即输出,然后判断紧挨其后的索引位置上(第7位开始)是否有排名连续的结果,由于第7位索引位置上有结果,因此继续输出第7位的结果;最后释放第7位的结果占用的内存并更新当前已输出的排名序号为7;The last completion of Task7, the result of the calculation is ranked 6th, and the current output has been output to the 5th position, so it can be output immediately, and then it is judged whether there is a consecutive result in the index position (starting from the 7th bit) immediately after. Because there is a result at the 7th index position, the result of the 7th bit is continuously output; finally, the memory occupied by the result of the 7th bit is released and the currently ranked rank number is 7;
至此,所有任务已全部结束,而结果也已经全部按序输出;At this point, all tasks have been completed, and the results have been all output in order;
实施例三:Embodiment 3:
如图9所示,本实施方式基于Spark大数据处理平台的查询方法中Task成功时处理策略的非排序查 询处理的具体实施方式如下:As shown in FIG. 9 , the present embodiment is based on a non-sorting check of a successful strategy for a task in a query method of a Spark big data processing platform. The specific implementation of the query processing is as follows:
在Task成功时处理策略中,判别当前查询是否要对结果排序,若无须排序,则开始应用非排序查询处理过程,详述如下:In the success strategy of the Task, it is determined whether the current query needs to sort the results. If there is no need to sort, the non-sorting query processing process is started, as follows:
先按照Spark应用程序传递的结果格式化规则对结果进行格式化,然后按照Spark应用程序传递的结果输出规则输出结果,至此Task成功时的处理结束。The results are formatted according to the result formatting rules passed by the Spark application, and then the output results are output according to the results passed by the Spark application, and the processing ends when the Task succeeds.
所有Task的结果均不存储,谁先结束谁就先输出,结束一个输出一个,直到所有Task结束,例如对于如表二的查询:The results of all Tasks are not stored. Whoever ends first will output first, and end an output one until all Task ends, for example, for the query as shown in Table 2:
表二Table II
Figure PCTCN2016095353-appb-000003
Figure PCTCN2016095353-appb-000003
Task1第一个完成,因为对结果的顺序不要求,因此可以立即输出;其它Task处理过程相同,当所有Task均执行结束后,结果就已经输出完毕了。Task1 is the first one to complete, because the order of the results is not required, so it can be output immediately; the other Tasks are processed the same, and when all the tasks are executed, the result is already output.
实施例四:Embodiment 4:
如图10所示,本实施方式基于Spark大数据处理平台的查询方法对于非排序查询支持海量查询结果的具体实现如下:As shown in FIG. 10, the specific implementation of the query method based on the Spark big data processing platform for the non-sorted query to support the massive query result is as follows:
用户执行的查询不要求对结果排序时,则所有Task均不存储各自的查询结果子集,因此Spark大数据处理平台没有申请索引内存用于存储Task结果,并在返回给Spark应用程序结果时,返回一个空值。When the user-executed query does not require sorting of the results, all Tasks do not store their respective subsets of query results, so the Spark Big Data Processing Platform does not request index memory for storing the Task results, and when returning to the Spark application results, Returns a null value.
由于查询结果不再累积存储,每当有新的Task执行成功后,它的结果只临时存储在内存中,当结果输出完毕后,它占用的内存立刻自动回收,随着Task结果的不断接收,内存使用量始终固定在很小的范围,因此,Spark大数据处理平台可以源源不断的输出计算结果,支持海量的查询结果返回。Since the query result no longer accumulates storage, whenever a new Task is successfully executed, its result is only temporarily stored in the memory. When the result is output, the memory it occupies is automatically reclaimed immediately. As the result of the Task is continuously received, Memory usage is always fixed in a small range, so the Spark big data processing platform can continuously output calculation results and support the return of massive query results.
例如,每个数据文件块的大小为256M,每个文件块对应一个Task对其进行查询计算,由于Task的查 询计算结果是文件块内容的一个子集,因此计算结果至多等于256M,相应地,Spark大数据处理平台处理查询的整个过程中存储Task结果消耗的内存始终在0到256M之间,因此无论有多少个文件块和Task,只要Spark大数据处理平台管理的内存中用于存储Task结果的部分多于256M,则可以接收并处理所有的Task结果。For example, the size of each data file block is 256M, and each file block corresponds to a Task for query calculation, due to the investigation of the Task. The result of the query calculation is a subset of the contents of the file block, so the calculation result is at most equal to 256M. Accordingly, the memory consumed by the Spark big data processing platform during the processing of the query is always between 0 and 256M, so How many file blocks and tasks, as long as the portion of the memory managed by the Spark big data processing platform for storing Task results is more than 256M, all the Task results can be received and processed.
实施例五:Embodiment 5:
如图11所示,本实施方式基于Spark大数据处理平台的查询方法对于排序查询做一些查询结果内存保护的具体实现如下:As shown in FIG. 11 , the specific implementation of the memory protection of the Spark big data processing platform based on the query method of the Spark big data processing platform for the sort query is as follows:
用户执行的查询要求对结果排序时,一部分Task可能因为不满足立即输出条件需要暂时存储,为了防止结果过大,造成内存溢出,Spark大数据处理平台申请内存储存每个Task的结果时判断内存空间是否足够,若不够,终止当前job,并通知Spark应用程序job已失败结束。When the query executed by the user requires sorting the results, some of the tasks may need to be temporarily stored because they do not satisfy the immediate output condition. To prevent the result from being too large, the memory overflow occurs. The Spark big data processing platform determines the memory space when applying for the memory to store the results of each task. Is it enough, if not enough, terminate the current job and notify the Spark application that the job has failed.
最好的情况下,所有的Task恰好按照结果排名的顺序执行完毕,此时,每个Task的结果均不需要存储,可以直接输出。例如,Task1第一个执行完毕,而它的结果刚好排第一,则满足输出条件,直接输出,Task2第二个执行完毕,它的结果刚好排第二,则满足输出条件,直接输出,以此类推,所有Task按序输出结果;最坏的情况下,结果排名第一的Task最后一个执行完毕,此时除了最后一个Task的结果可以直接输出以外,其余所有的Task结果都需要临时存储。由于大多数时候是处于最好和最坏情况之间,并且临时存储的Task结果当满足输出条件的时候会在第一时间输出并释放占用的内存,因此对于排序查询,Spark大数据处理平台在一定概率上也能支持海量查询结果返回,而对于较坏的情况,即需要临时存储的Task结果量超过内存限制时,做一些内存保护即可避免因内存分配不足而造成系统内存溢出异常。In the best case, all Tasks are executed in the order in which they are ranked. In this case, the results of each Task do not need to be stored and can be directly output. For example, the first execution of Task1 is completed, and its result is just ranked first, then the output condition is satisfied, and the output is directly output. The second execution of Task2 is completed, and the result is just ranked second, then the output condition is satisfied, and the output is directly output. In this way, all Tasks output the results in order; in the worst case, the last result of the Task is the last execution. At this time, except for the result of the last Task, the results of all the Tasks need to be temporarily stored. Since most of the time is between the best and worst case, and the temporary stored Task results will output and release the occupied memory at the first time when the output condition is met, so for the sort query, the Spark big data processing platform is A certain probability can also support the return of massive query results. For the worse case, when the amount of Tasks that need to be temporarily stored exceeds the memory limit, do some memory protection to avoid system memory overflow exception caused by insufficient memory allocation.
实施例六:Example 6:
如图12所示,本实施方式基于Spark大数据处理平台的查询方法的Spark SQL处理job查询结果的具体实现如下:As shown in FIG. 12, the specific implementation of the Spark SQL processing job query result based on the query method of the Spark big data processing platform is as follows:
为了使得Spark SQL应用程序能够同时兼容新老两种job提交接口,Spark SQL应用程序在得到job结束后返回的结果时,需要先判断结果是否为空(NULL),若为空,则不再走输出流程,本次查询结束;若不为空,根据新增的配置属性判断是否要格式化,若配置要格式化,则先格式化,然后走输出流程。In order to make the Spark SQL application compatible with both the old and new job submission interfaces, the Spark SQL application needs to judge whether the result is empty (NULL) when the result is returned after the job ends. If it is empty, it will not go. Output flow, the end of this query; if not empty, according to the new configuration properties to determine whether to format, if the configuration to be formatted, then format, and then go to the output process.
实施例七:Example 7:
本实施方式基于Spark大数据处理平台的查询方法的Spark SQL直接引用查询结果的具体实现如下:The specific implementation of the Spark SQL direct reference query result based on the query method of the Spark big data processing platform in this embodiment is as follows:
当把单个Task结果或整个job的结果传递给Spark SQL应用程序的输出模块时,输出模块直接访问结果的内存并打印到控制台上,无须重新申请一块内存并将结果复制到这块内存上。 When passing the results of a single Task or the entire job to the output module of the Spark SQL application, the output module directly accesses the resulting memory and prints it to the console without having to reapply a block of memory and copy the result to that memory.
由于避免了结果的重复拷贝,一方面节约了内存消耗,另一方面节省了拷贝带来的时间消耗,因此可以在一定程度上支持更多的查询结果返回以及更快的客户响应。By avoiding duplicate copies of the results, on the one hand, memory consumption is saved, and on the other hand, the time consumption caused by copying is saved, so that more query result returns and faster customer response can be supported to some extent.
实施例八:Example 8:
本实施方式基于Spark大数据处理平台的查询方法的Spark应用程序的结果输出规则、结果格式化规则、结果是否要排序的通知怎样传递给Spark大数据处理平台的具体实现如下:The specific implementation of the Spark application's result output rule, result formatting rule, and whether the result is to be sorted to the Spark big data processing platform based on the query method of the Spark big data processing platform is as follows:
1、结果输出规则是用函数实现的,在要输出结果的地方调用该函数即可,该函数最终被传递给Task成功后的处理策略,这里的处理策略也是函数实现的,包含了一段业务逻辑处理,供Task成功后调用;1. The result output rule is implemented by a function. The function can be called in the place where the result is to be output. The function is finally passed to the processing strategy after the success of the task. The processing strategy here is also a function implementation, including a piece of business logic. Processing, for the call after the success of the Task;
2、结果格式化规则也是用函数实现的,这里的格式化不是必须的,由配置开关决定,若开关打开,格式化规则函数包含一段具体的格式化步骤,若开关关闭,格式化规则函数不包含任何步骤。格式化规则最终被传递给Task成功时的处理策略;2, the result formatting rules are also implemented by functions, the formatting here is not necessary, determined by the configuration switch, if the switch is turned on, the formatting rules function contains a specific formatting step, if the switch is off, the formatting rules function is not Contains any steps. The formatting rules are ultimately passed to the processing strategy when the Task succeeds;
3、结果是否要排序的通知是通过变量实现的,在Spark应用程序生成执行计划后,根据执行计划的数据类型是否是Sort可动态判定当前查询是否是要排序的,然后定义一个布尔变量存储该判定值,并将它传递给Task成功后的处理策略。3. Whether the result is to be sorted by the variable. After the Spark application generates the execution plan, according to whether the data type of the execution plan is Sort, it can dynamically determine whether the current query is to be sorted, and then define a Boolean variable to store the Determine the value and pass it to the processing strategy after the Task succeeds.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. All should be covered by the scope of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
本实施方式的创新点如下。The innovations of this embodiment are as follows.
1、查询时只要有新的结果计算出来并且满足输出条件就立即输出,不必等到所有并行任务均计算出各自的结果后再一起输出。1. As long as new results are calculated and the output conditions are met, the output is immediately output. It is not necessary to wait until all parallel tasks have calculated their respective results and then output them together.
2、查询的每个结果子集一旦输出就立即释放内存,由于内存被及时释放,因此可以源源不断接收新的结果子集。2. Each result subset of the query releases the memory as soon as it is output. Since the memory is released in time, the new result subset can be continuously received.
3、对接收Task计算结果并申请内存存储做保护,系统无法容纳更多计算结果时,终止job,并提示客户,有效避免程序异常。3. Receive the Task calculation result and apply for memory storage for protection. When the system cannot accommodate more calculation results, terminate the job and prompt the customer to effectively avoid the program exception.
4、查询结果是否要格式化为工整统一的格式是可选择的,并且格式化的过程是分批的,有效减少了大量数据在内存中有多个拷贝。4. Whether the query result is formatted into a uniform format is optional, and the formatting process is batch-based, effectively reducing a large amount of data in the memory with multiple copies.
5、Spark SQL应用程序输出查询结果时直接访问结果内存,无须重新申请内存并拷贝结果。5, Spark SQL application directly access the result memory when outputting the query result, no need to re-apply memory and copy the result.
6、Spark应用程序需要将结果输出的规则以及结果格式化的规则传递给Spark大数据处理平台,这样可以保证本发明的方案可以被其它模块复用,该模块如果要使用立即输出功能,只需要向Spark大数据处 理平台传递它的输出规则和格式化规则,系统即可知道该如何输出了,因为不同的模块处理的数据格式以及输出的结果格式都可能是不一样的。6. The Spark application needs to pass the rules of the result output and the rules of the result formatting to the Spark big data processing platform. This ensures that the solution of the present invention can be reused by other modules. If the module needs to use the immediate output function, it only needs to To Spark Big Data The platform passes its output rules and formatting rules, and the system knows how to output it, because the data format processed by different modules and the output format of the output may be different.
本实施方式与现有技术相比具有以下有益效果。This embodiment has the following advantageous effects as compared with the prior art.
1、执行大数据查询时,客户响应时间大大缩短,在极短的时间内即可看到首结果。1. When performing big data query, the customer response time is greatly shortened, and the first result can be seen in a very short time.
2、对非排序查询的结果无数量和大小限制,支持海量结果连续不断输出。2. There is no limit on the number and size of the results of the non-sorted query, and the massive results are continuously output.
3、对排序查询,对输出结果做内存保护,有效避免因输出结果过大导致Spark大数据处理平台异常崩溃。3. For sorting queries, memory protection is performed on the output results, effectively avoiding the abnormal collapse of the Spark big data processing platform due to excessive output results.
4、Spark SQL应用程序可以根据配置选择是否对计算结果进行格式化,某些场景下,并不需要输出的数据看起来很工整,因此可以一定程度减少内存和性能消耗。4. The Spark SQL application can choose whether to format the calculation result according to the configuration. In some scenarios, the data that does not need to be output looks very neat, so the memory and performance consumption can be reduced to some extent.
5、Spark SQL应用程序的输出模块获取查询结果时直接引用job提交模块得到的结果,避免了结果复制;5. The output module of the Spark SQL application directly refers to the result obtained by the job submission module when obtaining the query result, thereby avoiding the result copying;
6、Spark大数据处理平台拥有输出结果的功能,并且怎样格式化结果以及输出结果的规则由Spark应用程序定义,因此该功能适用所有的Spark应用程序; 6, Spark big data processing platform has the function of output results, and how to format the results and output rules are defined by the Spark application, so this feature applies to all Spark applications;

Claims (6)

  1. 一种基于Spark大数据处理平台的查询方法,当Spark应用程序向Spark大数据处理平台提交job时,同时传递格式化结果的规则、输出结果的规则以及结果是否要排序的通知,同时Spark内部设定Task执行成功后的处理策略,其特征在于:A query method based on the Spark big data processing platform. When the Spark application submits a job to the Spark big data processing platform, it also passes the rules of the formatting result, the rules of the output result, and the notification of whether the result is to be sorted, and the Spark internal setting The processing strategy after the successful execution of the task is characterized by:
    若是排序查询时,判断当前Task结果的排名序号是否是上一次输出序号的下一位,若是,则根据Spark应用程序传递的结果格式化规则和输出规则输出结果,然后按照排名序号判断紧挨其后是否有排名连续的已经存储的其它Task结果,有则一并输出这些结果,已经输出的结果其占用的内存立即释放;若不是,存储当前Task结果到相应的排名序号索引位置上;If the query is sorted, it is judged whether the rank number of the current Task result is the next digit of the last output sequence number, and if so, the result is output according to the result formatting rule and the output rule passed by the Spark application, and then judged according to the rank number. Whether there are other consecutive Task results that have been stored in the queue, and the results are output together, and the memory that has been output is immediately released; if not, the current Task result is stored to the corresponding rank number index position;
    若是非排序查询时,每一个Task成功后立即根据Spark应用程序传递的结果格式化规则和输出规则输出结果,结果不存储。If it is a non-sorted query, each task will output the result according to the result formatting rules and output rules passed by the Spark application immediately after the success. The result is not stored.
  2. 根据权利要求1所述基于Spark大数据处理平台的查询方法,其特征在于:如果是非排序查询,Spark大数据处理平台不再申请存储计算结果的内存,相应地,job最后一个阶段的每个Task执行成功后直接输出结果;如果是排序查询且Task结果需要暂时存储,判断内存是否足够容纳该Task结果,若内存不够容纳,则立即终止当前job,并通知Spark应用程序查询结果超出系统容量,提示客户增加筛选条件。The query method based on the Spark big data processing platform according to claim 1, wherein if the query is a non-sorted query, the Spark big data processing platform no longer applies for storing the memory of the calculation result, and correspondingly, each Task of the last stage of the job. If the result is successful, the result is directly output. If the query is sorted and the result of the task needs to be temporarily stored, it is judged whether the memory is enough to accommodate the result of the task. If the memory is not enough, the current job is terminated immediately, and the Spark application is notified that the query result exceeds the system capacity. Customers increase the screening criteria.
  3. 根据权利要求1所述基于Spark大数据处理平台的查询方法,其特征在于:Spark大数据处理平台内部集成的SQL语言交互式查询引擎应用程序Spark SQL得到计算结果后,先判断结果是否为空,如果为空,不再走输出流程,如果不为空,根据配置可以选择是否格式化,然后走输出流程。The query method based on the Spark big data processing platform according to claim 1, wherein the Spark SQL integrated query engine application Spark SQL integrated in the Spark big data processing platform obtains the calculation result, and first determines whether the result is empty. If it is empty, no longer go through the output process. If it is not empty, you can choose whether to format according to the configuration, and then go through the output process.
  4. 根据权利要求3所述基于Spark大数据处理平台的查询方法,其特征在于:Spark SQL应用程序输出结果时,直接引用结果,不再重新拷贝一份到输出模块。The query method based on the Spark big data processing platform according to claim 3, wherein when the Spark SQL application outputs the result, the result is directly referenced, and the copy module is not copied again.
  5. 根据权利要求1所述基于Spark大数据处理平台的查询方法,其特征在于:Spark SQL应用程序向Spark大数据处理平台提交job前,需要预先定义结果格式化规则、结果输出规则、结果是否要排序的通知,并在提交job时传递这些信息,其中结果格式化规则根据配置可以是空。The query method based on the Spark big data processing platform according to claim 1, wherein before the Spark SQL application submits the job to the Spark big data processing platform, the result formatting rule, the result output rule, and whether the result is to be sorted are required to be pre-defined. The notification, and pass this information when submitting the job, where the resulting formatting rules can be empty depending on the configuration.
  6. 根据权利要求1所述基于Spark大数据处理平台的查询方法,其特征在于:Spark大数据处理平台所有跟提交job相关的接口均重载一份,重载的接口新增结果格式化规则、结果输出规则以及结果是否要排序的通知这三个参数,最后在正式提交job前,根据这三个参数设定Task成功后的处理策略;同时Spark SQL应用程序在提交job时,使用重载的接口。 The Spark big data processing platform-based query method according to claim 1, wherein all the interfaces related to the job submitted by the Spark big data processing platform are overloaded, and the overloaded interface adds a result formatting rule and a result. The output rules and whether the results are to be sorted are notified of these three parameters. Finally, before the formal submission of the job, the processing strategy after the Task is successfully set according to these three parameters; and the Spark SQL application uses the overloaded interface when submitting the job. .
PCT/CN2016/095353 2015-12-15 2016-08-15 Query method based on spark big data processing platform WO2017101475A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510930909.1A CN105550318B (en) 2015-12-15 2015-12-15 A kind of querying method based on Spark big data processing platforms
CN201510930909.1 2015-12-15

Publications (1)

Publication Number Publication Date
WO2017101475A1 true WO2017101475A1 (en) 2017-06-22

Family

ID=55829507

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/095353 WO2017101475A1 (en) 2015-12-15 2016-08-15 Query method based on spark big data processing platform

Country Status (2)

Country Link
CN (1) CN105550318B (en)
WO (1) WO2017101475A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582706A (en) * 2018-11-14 2019-04-05 重庆邮电大学 The neighborhood density imbalance data mixing method of sampling based on Spark big data platform
CN110659292A (en) * 2019-09-21 2020-01-07 北京海致星图科技有限公司 Spark and Ignite-based distributed real-time graph construction and query method and system
CN112612584A (en) * 2020-12-16 2021-04-06 远光软件股份有限公司 Task scheduling method and device, storage medium and electronic equipment
CN113392140A (en) * 2021-06-11 2021-09-14 上海达梦数据库有限公司 Data sorting method and device, electronic equipment and storage medium

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550318B (en) * 2015-12-15 2017-12-26 深圳市华讯方舟软件技术有限公司 A kind of querying method based on Spark big data processing platforms
CN106372127B (en) * 2016-08-24 2019-05-03 云南大学 The diversity figure sort method of large-scale graph data based on Spark
CN106909621B (en) * 2017-01-17 2020-02-11 中国科学院信息工程研究所 Accelerated IPC (International Process control) code-based query processing method
CN107480202B (en) * 2017-07-18 2020-06-02 湖南大学 Data processing method and device for multiple parallel processing frameworks
CN110019497B (en) * 2017-08-07 2021-06-08 北京国双科技有限公司 Data reading method and device
CN107609130A (en) * 2017-09-18 2018-01-19 链家网(北京)科技有限公司 A kind of method and server for selecting data query engine
CN108062251B (en) * 2018-01-09 2023-02-28 福建星瑞格软件有限公司 Server resource recovery method and computer equipment
CN108536727A (en) * 2018-02-24 2018-09-14 国家计算机网络与信息安全管理中心 A kind of data retrieval method and device
CN108874897B (en) * 2018-05-23 2019-09-13 新华三大数据技术有限公司 Data query method and device
CN110109747B (en) * 2019-05-21 2021-05-14 北京百度网讯科技有限公司 Apache Spark-based data exchange method, system and server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130346988A1 (en) * 2012-06-22 2013-12-26 Microsoft Corporation Parallel data computing optimization
CN104239501A (en) * 2014-09-10 2014-12-24 中国电子科技集团公司第二十八研究所 Mass video semantic annotation method based on Spark
US9135559B1 (en) * 2015-03-20 2015-09-15 TappingStone Inc. Methods and systems for predictive engine evaluation, tuning, and replay of engine performance
CN105550318A (en) * 2015-12-15 2016-05-04 深圳市华讯方舟软件技术有限公司 Spark big data processing platform based query method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799622B (en) * 2012-06-19 2015-07-15 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework
CN103995827B (en) * 2014-04-10 2017-08-04 北京大学 High-performance sort method in MapReduce Computational frames
CN104951509A (en) * 2015-05-25 2015-09-30 中国科学院信息工程研究所 Big data online interactive query method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130346988A1 (en) * 2012-06-22 2013-12-26 Microsoft Corporation Parallel data computing optimization
CN104239501A (en) * 2014-09-10 2014-12-24 中国电子科技集团公司第二十八研究所 Mass video semantic annotation method based on Spark
US9135559B1 (en) * 2015-03-20 2015-09-15 TappingStone Inc. Methods and systems for predictive engine evaluation, tuning, and replay of engine performance
CN105550318A (en) * 2015-12-15 2016-05-04 深圳市华讯方舟软件技术有限公司 Spark big data processing platform based query method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TENCENT;: "Tencent Big Data TDW Calculation Engine Analysis-Shuffle", 11 July 2014 (2014-07-11), Retrieved from the Internet <URL:http://data.qq.com/article?id=543> *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582706A (en) * 2018-11-14 2019-04-05 重庆邮电大学 The neighborhood density imbalance data mixing method of sampling based on Spark big data platform
CN110659292A (en) * 2019-09-21 2020-01-07 北京海致星图科技有限公司 Spark and Ignite-based distributed real-time graph construction and query method and system
CN112612584A (en) * 2020-12-16 2021-04-06 远光软件股份有限公司 Task scheduling method and device, storage medium and electronic equipment
CN113392140A (en) * 2021-06-11 2021-09-14 上海达梦数据库有限公司 Data sorting method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN105550318A (en) 2016-05-04
CN105550318B (en) 2017-12-26

Similar Documents

Publication Publication Date Title
WO2017101475A1 (en) Query method based on spark big data processing platform
Hu et al. Flutter: Scheduling tasks closer to data across geo-distributed datacenters
US8209703B2 (en) Apparatus and method for dataflow execution in a distributed environment using directed acyclic graph and prioritization of sub-dataflow tasks
US8959519B2 (en) Processing hierarchical data in a map-reduce framework
Stergiou et al. Shortcutting label propagation for distributed connected components
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
US20240061712A1 (en) Method, apparatus, and system for creating training task on ai training platform, and medium
US11487555B2 (en) Running PBS jobs in kubernetes
CN106569896B (en) A kind of data distribution and method for parallel processing and system
US11748164B2 (en) FAAS distributed computing method and apparatus
US8458136B2 (en) Scheduling highly parallel jobs having global interdependencies
CN106383746A (en) Configuration parameter determination method and apparatus of big data processing system
Shen et al. Performance prediction of parallel computing models to analyze cloud-based big data applications
Henzinger et al. Scheduling large jobs by abstraction refinement
WO2023284171A1 (en) Resource allocation method and system after system restart, and related component
Leida et al. Distributed SPARQL query answering over RDF data streams
Alemi et al. CCFinder: using Spark to find clustering coefficient in big graphs
US9436503B2 (en) Concurrency control mechanisms for highly multi-threaded systems
Zhao et al. A data locality optimization algorithm for large-scale data processing in Hadoop
CN114168594A (en) Secondary index creating method, device, equipment and storage medium of horizontal partition table
CN113868249A (en) Data storage method and device, computer equipment and storage medium
Huang et al. Improving speculative execution performance with coworker for cloud computing
CN114240632A (en) Batch job execution method, apparatus, device, medium, and product
CN113504966A (en) GPU cluster scheduling strategy simulation method and GPU cluster simulator
US10303567B2 (en) Managing database nodes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16874535

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16874535

Country of ref document: EP

Kind code of ref document: A1