WO2017101475A1

WO2017101475A1 - Query method based on spark big data processing platform

Info

Publication number: WO2017101475A1
Application number: PCT/CN2016/095353
Authority: WO
Inventors: 万修远
Original assignee: 深圳市华讯方舟软件技术有限公司; 华讯方舟科技有限公司
Priority date: 2015-12-15
Filing date: 2016-08-15
Publication date: 2017-06-22
Also published as: CN105550318A; CN105550318B

Abstract

A query method based on a Spark big data processing platform. The query method comprises: for a sorting query, determining whether the rank of the result of a current calculation task is a next one of a previous outputted rank; if yes, outputting the result, then determining, according to the rank, whether there are continuously ranked and stored results of other calculation tasks next to the rank, and if there are, outputting the results all together; if not, storing the result of the current calculation task to a corresponding rank index position; and for a non-sorting query, immediately outputting the result of each completed calculation task, wherein all the outputted results will not be stored or the memory occupied by all the outputted results will be immediately released. When the query method is used to perform a routine simple query, a result can be quickly returned even if data to be processed is very huge; and when the query method is used to perform a complicated query, the user response time can be greatly shortened.

Description

A query method based on Spark big data processing platform

Technical field

The invention relates to a query method for data processing, in particular to a query method based on a Spark big data processing platform.

Background technique

With the development of the Internet, tens of thousands of web pages are constantly emerging. To search for these web pages, we must first capture the storage and then analyze and calculate. Google does this, but the increasing amount of data makes the storage face a single machine capacity. Insufficient problem, the query faces too many problems; for these two problems, Google proposed a solution for distributed storage and distributed parallel computing, and later hadoop products are the source implementation of this solution; hadoop Distributed file system HDFS and distributed parallel computing framework MapReduce are provided. With the development of Hadoop, new ecosystem products such as Hbase, hive, pig, etc. are emerging in the ecosystem, but they are all based on HDFS storage layer and MapReduce. The computational framework, MapReduce performs computations in parallel on multiple nodes in the cluster, thus greatly speeding up the query. However, as the amount of data increases, MapReduce gradually becomes incapable, so the Spark computing framework based on memory computing emerges at the historic moment. Spark's query speed is 100 times faster than hadoop, so it is the most advanced. Distributed parallel computing framework; With the development of Spark ecosystem, Spark SQL, Spark Streaming, MLlib, GraphX, etc. emerged on it. Spark SQL is developed for SQL users and can analyze structured data in SQL language. The tool for the query.

As shown in FIG. 1 , in the prior art, using the Spark SQL application as an example, the query method based on the Spark big data processing platform can be divided into five steps:

Step 1. After receiving the user's SQL statement, the Spark SQL application performs syntax parsing, execution strategy optimization, job (query job) generation, and finally submits the job by calling the SparkContext interface in the Spark platform.

Step 2: After receiving the job, SparkContext defines how to store the calculation result after the Task (calculation task) is successfully executed, and then submits the job to the eventProcessActor, and then waits for the eventProcessActor to notify the end of the execution of the job, and returns the calculation result to the Spark SQL after the end;

Step 3: After receiving the event of submitting the job, the eventProcessActor allocates multiple Tasks at each node to start parallel computing;

Step 4: After each Task is executed, report the status and result to the eventProcessActor. The eventProcessActor counts whether all the tasks of the job are completed. If it is completed, the job that the SparkContext submits is notified to terminate, and the SparkContext returns the calculation result to the Spark SQL.

Step 5: After Spark SQL obtains the calculation result, the format conversion is performed first, then one copy is sent to the output module, and finally the output module outputs the result.

As shown in Figure 2, step 1 is mainly to parse the syntax of the SQL statement and generate a set of RDDs representing a job. RDD is a distributed Data structure, which describes the distributed storage data to be processed and how to deal with the algorithm, so an RDD represents an operation on the data, a set of RDD is a sequence of operations, after completing the series of operations in sequence A query calculation is completed; Spark adopts a delayed execution strategy, that is, each operation is not executed first, but is a sequence of operations, and then the sequence is sent to the actuator for execution; the operation represented by this group of RDDs is ordered. It does not loop, so its logical dependency graph is also called directed acyclic graph (DAG); in DAG, the downstream RDD is generated after the upstream RDD performs an operation.

As shown in Figure 3, step 2 mainly submits the DAG to the eventProcessActor in another thread environment. Before committing, allocates a piece of memory and informs the eventProcessActor to store the result in the memory after the task is successfully executed. After committing, the current thread is submitted. Suspend, wait for eventProcessActor to wake it up after the job is completed, wake up, then all the tasks have been executed, and the calculation results are all stored in the pre-allocated memory; therefore, the memory address that stores the result is directly returned to Spark SQL module. Since the results are not returned until all Task execution ends, the customer response time is too long. In fact, each Task result is a subset of the final result. It is not necessary to output all the result subsets together. In addition, due to the output before To store the entire query result, the size of the result is directly limited by the stack size of the program.

As shown in FIG. 4, step 3 mainly implements the division of the DAG phase and the generation of the task set in each phase. All Tasks in each phase perform the same operations, except that they work differently, so they can be executed in parallel; however, Tasks at different stages may not be parallel. Each dark gray filled rectangle represents a data block, and each data block is calculated corresponding to a Task. Since the data block of RDD2 is calculated according to multiple data blocks of RDD1, it is necessary to wait for the execution of RDD1. All Tasks can be used to start calculating RDD2, so RDD1 and RDD2 need to be divided into two different phases. When RDD2 calculates RDD5, each data block is independent and independent of each other. RDD2 calculates one of the data blocks. The Task does not need to wait for the end of the other data block to start the calculation generated by RDD5 (here, the join operation), so RDD2 and RDD5 can belong to the same stage; similarly, RDD3 and RDD4 can belong to the same stage, but RDD4 cannot and RDD5 belongs to the same stage; in Figure 4, stage1 and stage2 are independent of each other and can be executed in parallel. Stage3 depends on both stage1 and stage2, so it must wait for stage1 and stage2 to complete before executing.

As shown in FIG. 5, step 4 is mainly after the last stage of the task is successfully executed, and the calculation result is stored in the memory specified by the SparkContext; in FIG. 5, only the intermediate result is generated after the execution of the stage1 and stage2 Tasks, stage3 Each Task is the final result, and the final output is spliced from the results of each Task of stage3, which may be sorted during the splicing process. As shown in Figure 6, if the query statement requires sorting the results, the results of the Task are stored in order. If the results are not sorted, the results are sorted according to the order in which the tasks are completed, and the order of the results of each query will be random. In the case of sorting the results, since each Task knows where its results should be ranked, the Task that should be ranked first will already calculate the head of the final result, and the client can be notified immediately; Since the customer does not care about the order of the results, no matter which Task is calculated first, the result can inform the customer that there is no need to wait for other Tasks, and even if they wait, the final result is sorted according to the order in which they are executed. .

Step 5 is mainly to format the result in the form of the record row array into a string sequence, each row record is converted into a string format, and will The column delimiter is replaced with a tab character. Finally, when the output module extracts the formatted result, it copies a copy to the output module and then outputs it. In fact, formatting the results is not necessary, formatting may look better, but it consumes a lot of memory and performance. In some cases, the data itself is very neat, and there is no need to format it. .

In summary, the query method based on the Spark big data processing platform in the prior art has the following technical problems:

1. When the Spark big data processing platform performs query, the user response time is too long, especially when analyzing large-scale data, the response time is beyond the tolerance of the user, and as the amount of analysis data increases. This response delay will also increase synchronously.

2. Currently, the Spark big data processing platform does not support the output of large-scale query results. The default configuration only allows the output of 1G query result data. If the configuration is too small, the memory resources will not be fully utilized, and the configuration is too much. The remaining memory space will cause a memory overflow exception; in addition, for a machine environment with a low memory configuration, the amount of data allowed to be output will be further reduced.

3. After obtaining the calculation result of Spark big data processing platform, Spark SQL will perform some format conversion and data copy before it is officially output, which will result in multiple copies of the same or similar data in memory, which wastes memory resources. Reduced performance also directly affects user response and result storage capacity, and this effect increases as output increases.

Summary of the invention

The technical problem to be solved by the present invention is to provide a query method based on the Spark big data processing platform. When the query method performs a conventional simple query (the DAG phase is relatively small), even if the data to be processed is very large, it can quickly return. As a result, when performing complex queries, the user response time can be greatly shortened on the original basis. No matter which query is executed, an attempt is made to output immediately as long as the result satisfies the output condition without any delay.

In order to solve the above technical problem, the query method based on the Spark big data processing platform of the present invention is to transmit a result formatting rule, a result output rule, and a notification of whether the result is to be sorted when the Spark application submits a job to the Spark big data processing platform. At the same time, according to the information passed in the Spark platform, the processing strategy after the successful execution of the Task is set:

If the query is sorted, it is judged whether the rank number of the current Task result is the next digit of the last output sequence number, and if so, the result is output according to the result formatting rule and the output rule passed by the Spark application, and then judged according to the rank number. Whether there are other consecutive Task results that have been stored in the queue, and the results are output together, and the memory that has been output is immediately released; if not, the current Task result is stored to the corresponding rank number index position. In this way, as long as the result satisfies the output condition according to the ranking number, the output is immediately output without any delay. In the fastest case, the Task for calculating the first result is first completed, and in the slowest case, the Task for calculating the first result is the last one, so Average at least half the response time;

If it is a non-sorted query, each task will output the result according to the result formatting rules and output rules passed by the Spark application immediately after the success. The result is not stored. In this way, as long as the Task is successfully executed, its result is output immediately. As the Task set is continuously completed, the result is also a continuous output until the last completed Task is output. In this case, the entire calculation process has no output delay. As long as there are new calculation results, the output is immediate; for the analysis of large-scale data sets, the number of tasks is increased, but the data processed by each task The block size has not changed, and no matter how large the data set is processed, it is the first completed task to output the result immediately. Therefore, as long as it is a simple query, even a very large-scale data set can quickly output the first result.

If it is a non-sorted query, the Spark big data processing platform no longer applies for storing the memory of the calculation result. Accordingly, the Task of the last stage of the DAG executes the result directly after the execution of the Task; if it is a sorted query and the Task needs to be temporarily stored, it is judged whether the memory is sufficient. The result of accommodating the Task, if the memory is not enough, immediately terminate the current job, and notify the Spark application that the query result exceeds the system capacity, prompting the client to increase the screening condition. Therefore, in the case of non-sorted queries, the Spark big data processing platform can continuously output a large number of query results, supporting the scene returned by large data volume query; in the case of sorting queries, there will be no memory overflow exception caused by excessive query results. problem.

After the Spark SQL application gets the calculation result, it is first judged whether the result is empty. If it is empty, the output process is no longer taken. If it is not empty, it can be selected according to the configuration, and then the output process is taken.

When the Spark SQL application outputs the result, the result is directly referenced and no copy is made to the output module.

Before submitting a job to the Spark big data processing platform, the Spark SQL application needs to predefine the result formatting rules, the result output rules, whether the results should be sorted, and pass the information when submitting the job, where the result formatting rules are configured according to the configuration. Can be empty.

All the interfaces related to the job submitted by the Spark big data processing platform are overloaded. The overloaded interface adds the result formatting rules, the result output rules, and the notification of whether the results should be sorted. Finally, before the formal submission of the job. According to these three parameters, the processing strategy after the success of the Task is set; at the same time, the Spark SQL application uses the overloaded interface when submitting the job.

The query method based on the Spark big data processing platform of the present invention has the following beneficial effects compared with the prior art.

1. When the query method performs a regular simple query (the DAG phase is relatively small), even if the data to be processed is very large, the result can be quickly returned; when the complex query is executed, the user response time can be greatly shortened on the original basis. Regardless of which query is executed, an attempt is made to output as soon as the result satisfies the output condition without any delay. Any query with large-scale data below and simple query of very large-scale data can see the first results within 3s, that is, the customer response time is always controlled within 3s; for complex queries of very large-scale data, it can be implemented in the prior art. Significantly accelerate customer response.

2, non-sorted queries can output a large number of query results, and even output the total amount of stored data; sorted queries ensure that the memory overflow exception will not be caused by the output result is too large, and the data allowed to be output is greatly increased in a certain probability the amount.

3, Spark SQL application can output both formatted and unformatted results.

4. The output module of the Spark SQL application directly refers to the result obtained by the job submission module when obtaining the query result, thereby avoiding the result copying;

5, Spark big data processing platform has the function of outputting results, and the rules for how to format the results and output results are defined by the Spark application, so this function is applicable to all Spark applications;

6, the existing Spark application can still use the original interface when submitting the job, is not affected.

DRAWINGS

The query method based on the Spark big data processing platform of the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

FIG. 1 is an architectural diagram of a Spark SQL execution query in the prior art.

2 is a framework diagram of a Spark SQL generated DAG in the prior art.

FIG. 3 is a flow chart of a SparkContext submit job in the prior art.

4 is a schematic diagram of the RDD phase division in the prior art.

Figure 5 is a process diagram of a prior art DAG performed in stages.

FIG. 6 is a schematic diagram of a sorted query Task storage calculation result in the prior art.

FIG. 7 is a flowchart of implementing an undelayed output of a query result according to an embodiment of the present invention.

FIG. 8 is a flowchart of implementing sorting query processing in a Task successful processing strategy according to an embodiment of the present invention.

FIG. 9 is a flowchart of implementing non-sorted query processing in a Task successful processing policy according to an embodiment of the present invention.

FIG. 10 is a flowchart of implementing a non-sorted query to support a massive query result according to an embodiment of the present invention.

FIG. 11 is a flowchart of implementing memory protection for a query result by using a sort query according to an embodiment of the present invention.

FIG. 12 is a flowchart of an implementation of a Spark SQL processing job query result according to an embodiment of the present invention.

detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Embodiment 1:

As shown in FIG. 7, the specific implementation manner of the result without delay output in the query method of the Spark big data processing platform is as follows:

Spark Big Data Processing Platform's application programming interface SparkContext provides a new job submission interface. The new interface requires notification of the result output rules, result formatting rules, and whether the results should be sorted.

The new interface redefines the result processing strategy when the Task succeeds, and is executed when the Task success event occurs. In the processing strategy, it is no longer simply storing the result of the Task, but sorting according to the result, and determining whether the result is satisfied. Immediately output the condition. If it is satisfied, the result is formatted and output according to the result formatting rule and output rule defined and passed by the Spark application. The result is not stored after the output; if the output condition is not met temporarily, the temporary storage is performed. Wait until the next task succeeds and then judge whether the output condition is satisfied. If it meets the immediate output and release the memory.

The Spark application being developed can use this new interface to achieve results without delay output, and existing Spark applications can still make Works normally with the original interface.

For example, for a Spark SQL application, since it is processing structured data, the result is formatted as an array of record rows, and because the column separator is replaced with a tab, the row array format is also formatted. Converted to an array of strings to replace the string with the character substitution function, and then formatted as a column delimiter for the result form of the tab. Finally, when outputting, since the Spark SQL application is a command-line program, the result is Print directly to the console, so for Spark SQL applications, the output is printed to the specified console; while the Spark SQL application processes the SQL statement, depending on whether the order contains the order by clause ( Sort the words) to determine whether the results should be sorted.

The above result formatting rules, result output rules, results whether to sort the three information is unique to the Spark application, so it needs to be passed to the Spark big data processing platform when the application submits the job, and finally use this information in order to be accurate when the task succeeds. Output results in a timely manner.

When all the tasks are successfully completed, the entire job ends successfully. Since all the results have been output when the Task succeeds, the Spark application does not need to execute the output process after the job ends, and can immediately enter the next query job. submit.

When a task fails to execute, the entire job announces the failure. At this point, the Spark application receives the job failure notification and outputs an error message and waits for the next query job submission.

Embodiment 2:

As shown in FIG. 8 , the specific implementation manner of the sort query processing of the Task success processing strategy in the query method of the Spark big data processing platform is as follows:

In the success strategy of the Task, determine whether the current query should sort the results. If you want to sort, start applying the sort query processing, as detailed below:

Determine whether the ranking number of the current Task result is the next digit of the last output sequence number. If yes, format and output the rules according to the result format and output rules passed by the Spark application, and then judge the result according to the ranking number. Whether there are other consecutive Task results that have been stored in the queue, and the results are output together, and the memory occupied by the results is released after output; if not, the current Task result is stored to the corresponding rank number index position.

It should be noted that the ranking number of the Task result is that the Spark big data processing platform has been calculated by other stages before the last stage of executing the job. All the Tasks of the last stage of the job know the ranking number of each result, so each After the independent Task is executed, its ranking order has been determined, and it is not necessary to wait until all Tasks have been executed to determine. The Tasks mentioned in the embodiments of the present invention all refer to the Task of the last stage of the job.

For example, the query for the returned result is shown in Table 1.

Table I:

Task1 is the first to complete, it calculates the result of the third digit, and currently has output to the 0th digit (that is, there is no output), so it can not be output to the customer, can only be stored first, and stored to the third index position on;

Task2 is finished second, and the result of the calculation is ranked first, so it is output immediately, not stored; then it is judged whether there is a consecutive result on the index position immediately after the start of the second position, because the second index There is no result in the position, no processing; the last updated rank number currently output is 1;

The third completion of Task3, the result of the calculation is ranked 5th, and the current output has been output to the 1st place, so it can not be output to the customer, can only be stored first, and stored to the 5th index position;

The fourth completion of Task4, the result of the calculation is ranked 4th, and the current output has been output to the first place, so it can not be output to the customer, can only be stored first, and stored to the 4th index position;

The fifth completion of Task5, the result of the calculation is ranked 7th, and the current output has been output to the first place, so it can not be output to the customer, can only be stored first, and stored to the 7th index position;

The sixth completion of Task6, the result of the calculation is ranked second, and the current output has been output to the first digit, so it can be output immediately, and then judge whether there is a continuous ranking at the index position immediately after (the third digit starts). As a result, since the 3rd, 4th, and 5th index positions have results, the three consecutive results are continuously output, and the seventh bit has a result, but the sixth bit result has not been returned, so it cannot be output; Finally release the memory occupied by the results of the 3rd, 4th, and 5th digits and update the currently output rank number to 5;

The last completion of Task7, the result of the calculation is ranked 6th, and the current output has been output to the 5th position, so it can be output immediately, and then it is judged whether there is a consecutive result in the index position (starting from the 7th bit) immediately after. Because there is a result at the 7th index position, the result of the 7th bit is continuously output; finally, the memory occupied by the result of the 7th bit is released and the currently ranked rank number is 7;

At this point, all tasks have been completed, and the results have been all output in order;

Embodiment 3:

As shown in FIG. 9 , the present embodiment is based on a non-sorting check of a successful strategy for a task in a query method of a Spark big data processing platform. The specific implementation of the query processing is as follows:

In the success strategy of the Task, it is determined whether the current query needs to sort the results. If there is no need to sort, the non-sorting query processing process is started, as follows:

The results are formatted according to the result formatting rules passed by the Spark application, and then the output results are output according to the results passed by the Spark application, and the processing ends when the Task succeeds.

The results of all Tasks are not stored. Whoever ends first will output first, and end an output one until all Task ends, for example, for the query as shown in Table 2:

Table II

Task1 is the first one to complete, because the order of the results is not required, so it can be output immediately; the other Tasks are processed the same, and when all the tasks are executed, the result is already output.

Embodiment 4:

As shown in FIG. 10, the specific implementation of the query method based on the Spark big data processing platform for the non-sorted query to support the massive query result is as follows:

When the user-executed query does not require sorting of the results, all Tasks do not store their respective subsets of query results, so the Spark Big Data Processing Platform does not request index memory for storing the Task results, and when returning to the Spark application results, Returns a null value.

Since the query result no longer accumulates storage, whenever a new Task is successfully executed, its result is only temporarily stored in the memory. When the result is output, the memory it occupies is automatically reclaimed immediately. As the result of the Task is continuously received, Memory usage is always fixed in a small range, so the Spark big data processing platform can continuously output calculation results and support the return of massive query results.

For example, the size of each data file block is 256M, and each file block corresponds to a Task for query calculation, due to the investigation of the Task. The result of the query calculation is a subset of the contents of the file block, so the calculation result is at most equal to 256M. Accordingly, the memory consumed by the Spark big data processing platform during the processing of the query is always between 0 and 256M, so How many file blocks and tasks, as long as the portion of the memory managed by the Spark big data processing platform for storing Task results is more than 256M, all the Task results can be received and processed.

Embodiment 5:

As shown in FIG. 11 , the specific implementation of the memory protection of the Spark big data processing platform based on the query method of the Spark big data processing platform for the sort query is as follows:

When the query executed by the user requires sorting the results, some of the tasks may need to be temporarily stored because they do not satisfy the immediate output condition. To prevent the result from being too large, the memory overflow occurs. The Spark big data processing platform determines the memory space when applying for the memory to store the results of each task. Is it enough, if not enough, terminate the current job and notify the Spark application that the job has failed.

In the best case, all Tasks are executed in the order in which they are ranked. In this case, the results of each Task do not need to be stored and can be directly output. For example, the first execution of Task1 is completed, and its result is just ranked first, then the output condition is satisfied, and the output is directly output. The second execution of Task2 is completed, and the result is just ranked second, then the output condition is satisfied, and the output is directly output. In this way, all Tasks output the results in order; in the worst case, the last result of the Task is the last execution. At this time, except for the result of the last Task, the results of all the Tasks need to be temporarily stored. Since most of the time is between the best and worst case, and the temporary stored Task results will output and release the occupied memory at the first time when the output condition is met, so for the sort query, the Spark big data processing platform is A certain probability can also support the return of massive query results. For the worse case, when the amount of Tasks that need to be temporarily stored exceeds the memory limit, do some memory protection to avoid system memory overflow exception caused by insufficient memory allocation.

Example 6:

As shown in FIG. 12, the specific implementation of the Spark SQL processing job query result based on the query method of the Spark big data processing platform is as follows:

In order to make the Spark SQL application compatible with both the old and new job submission interfaces, the Spark SQL application needs to judge whether the result is empty (NULL) when the result is returned after the job ends. If it is empty, it will not go. Output flow, the end of this query; if not empty, according to the new configuration properties to determine whether to format, if the configuration to be formatted, then format, and then go to the output process.

Example 7:

The specific implementation of the Spark SQL direct reference query result based on the query method of the Spark big data processing platform in this embodiment is as follows:

When passing the results of a single Task or the entire job to the output module of the Spark SQL application, the output module directly accesses the resulting memory and prints it to the console without having to reapply a block of memory and copy the result to that memory.

By avoiding duplicate copies of the results, on the one hand, memory consumption is saved, and on the other hand, the time consumption caused by copying is saved, so that more query result returns and faster customer response can be supported to some extent.

Example 8:

The specific implementation of the Spark application's result output rule, result formatting rule, and whether the result is to be sorted to the Spark big data processing platform based on the query method of the Spark big data processing platform is as follows:

1. The result output rule is implemented by a function. The function can be called in the place where the result is to be output. The function is finally passed to the processing strategy after the success of the task. The processing strategy here is also a function implementation, including a piece of business logic. Processing, for the call after the success of the Task;

2, the result formatting rules are also implemented by functions, the formatting here is not necessary, determined by the configuration switch, if the switch is turned on, the formatting rules function contains a specific formatting step, if the switch is off, the formatting rules function is not Contains any steps. The formatting rules are ultimately passed to the processing strategy when the Task succeeds;

3. Whether the result is to be sorted by the variable. After the Spark application generates the execution plan, according to whether the data type of the execution plan is Sort, it can dynamically determine whether the current query is to be sorted, and then define a Boolean variable to store the Determine the value and pass it to the processing strategy after the Task succeeds.

The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. All should be covered by the scope of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

The innovations of this embodiment are as follows.

1. As long as new results are calculated and the output conditions are met, the output is immediately output. It is not necessary to wait until all parallel tasks have calculated their respective results and then output them together.

2. Each result subset of the query releases the memory as soon as it is output. Since the memory is released in time, the new result subset can be continuously received.

3. Receive the Task calculation result and apply for memory storage for protection. When the system cannot accommodate more calculation results, terminate the job and prompt the customer to effectively avoid the program exception.

4. Whether the query result is formatted into a uniform format is optional, and the formatting process is batch-based, effectively reducing a large amount of data in the memory with multiple copies.

5, Spark SQL application directly access the result memory when outputting the query result, no need to re-apply memory and copy the result.

6. The Spark application needs to pass the rules of the result output and the rules of the result formatting to the Spark big data processing platform. This ensures that the solution of the present invention can be reused by other modules. If the module needs to use the immediate output function, it only needs to To Spark Big Data The platform passes its output rules and formatting rules, and the system knows how to output it, because the data format processed by different modules and the output format of the output may be different.

This embodiment has the following advantageous effects as compared with the prior art.

1. When performing big data query, the customer response time is greatly shortened, and the first result can be seen in a very short time.

2. There is no limit on the number and size of the results of the non-sorted query, and the massive results are continuously output.

3. For sorting queries, memory protection is performed on the output results, effectively avoiding the abnormal collapse of the Spark big data processing platform due to excessive output results.

4. The Spark SQL application can choose whether to format the calculation result according to the configuration. In some scenarios, the data that does not need to be output looks very neat, so the memory and performance consumption can be reduced to some extent.

5. The output module of the Spark SQL application directly refers to the result obtained by the job submission module when obtaining the query result, thereby avoiding the result copying;

6, Spark big data processing platform has the function of output results, and how to format the results and output rules are defined by the Spark application, so this feature applies to all Spark applications;

Claims

A query method based on the Spark big data processing platform. When the Spark application submits a job to the Spark big data processing platform, it also passes the rules of the formatting result, the rules of the output result, and the notification of whether the result is to be sorted, and the Spark internal setting The processing strategy after the successful execution of the task is characterized by:

If the query is sorted, it is judged whether the rank number of the current Task result is the next digit of the last output sequence number, and if so, the result is output according to the result formatting rule and the output rule passed by the Spark application, and then judged according to the rank number. Whether there are other consecutive Task results that have been stored in the queue, and the results are output together, and the memory that has been output is immediately released; if not, the current Task result is stored to the corresponding rank number index position;

If it is a non-sorted query, each task will output the result according to the result formatting rules and output rules passed by the Spark application immediately after the success. The result is not stored.
The query method based on the Spark big data processing platform according to claim 1, wherein if the query is a non-sorted query, the Spark big data processing platform no longer applies for storing the memory of the calculation result, and correspondingly, each Task of the last stage of the job. If the result is successful, the result is directly output. If the query is sorted and the result of the task needs to be temporarily stored, it is judged whether the memory is enough to accommodate the result of the task. If the memory is not enough, the current job is terminated immediately, and the Spark application is notified that the query result exceeds the system capacity. Customers increase the screening criteria.
The query method based on the Spark big data processing platform according to claim 1, wherein the Spark SQL integrated query engine application Spark SQL integrated in the Spark big data processing platform obtains the calculation result, and first determines whether the result is empty. If it is empty, no longer go through the output process. If it is not empty, you can choose whether to format according to the configuration, and then go through the output process.
The query method based on the Spark big data processing platform according to claim 3, wherein when the Spark SQL application outputs the result, the result is directly referenced, and the copy module is not copied again.
The query method based on the Spark big data processing platform according to claim 1, wherein before the Spark SQL application submits the job to the Spark big data processing platform, the result formatting rule, the result output rule, and whether the result is to be sorted are required to be pre-defined. The notification, and pass this information when submitting the job, where the resulting formatting rules can be empty depending on the configuration.
The Spark big data processing platform-based query method according to claim 1, wherein all the interfaces related to the job submitted by the Spark big data processing platform are overloaded, and the overloaded interface adds a result formatting rule and a result. The output rules and whether the results are to be sorted are notified of these three parameters. Finally, before the formal submission of the job, the processing strategy after the Task is successfully set according to these three parameters; and the Spark SQL application uses the overloaded interface when submitting the job. .