CN107544844A

CN107544844A - A kind of method and device of lifting Spark Operating ettectiveness

Info

Publication number: CN107544844A
Application number: CN201610482075.7A
Authority: CN
Inventors: 肖丽华; 王跃; 刘晏
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2018-01-05

Abstract

The invention discloses a kind of method and device of lifting Spark Operating ettectiveness, and being related to big data analysis and process field, methods described includes：The table that cache cache is needed in system is determined；To being identified using the identified table for needing cache as the cache tasks of input or output；The cache tasks identified are grouped, and treatment progress is created for corresponding cache task groups；According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, cache tasks to be committed are combined, and sent to the processing of Spark clusters.The embodiment of the present invention can make full use of the system resource of cluster, reasonably determine to treat cache table and content, carry out dynamic decision and scheduling to process, in the case of resource grant, increase degree of parallelism to greatest extent, reach the purpose of lifting spark Operating ettectiveness.

Description

A kind of method and device of lifting Spark Operating ettectiveness

Technical field

The present invention relates to big data analysis and process field, the method for more particularly to a kind of lifting Spark Operating ettectiveness and Device.

Background technology

As the development of informationization, the growth of enterprise's data explosion formula to be processed, data volume have all reached terabyte (ten thousand Hundred million bytes, TB) level, petabyte (thousand terabytes, PB) level.It is all kinds of in order to support the analysis and processing of so large-scale data Big data framework, instrument and technology are arisen at the historic moment, and Spark is one of them.

Spark is a big data processing framework around speed, ease for use and complicated analysis structure, and it passes through in data In processing procedure using cost it is lower shuffle (Shuffle) mode, will " mapping-stipulations " model (Map Reduce) lifted arrive One higher level, and using internal storage data storage and the disposal ability of near real-time, make its performance than other big datas Treatment technology will soon many times.Spark SQL are a Spark components, for supporting structuring query language (Structured Query Language, SQL) standard, because of its ease for use, high-performance, and the user that by means of traditional SQL is used to The strength of property and the ecosystem, is favored extensively by user.

Spark SQL Scheduling Core is Catalyst, and Catalyst is that the functional expression relational query in Spark SQL is excellent Change framework, query optimization can be carried out during sql like language to be translated into final executive plan.Other Spark SQL are being deposited Data storage can be converted into column by caches table (cache table) in storage to store, while data are added It is downloaded to internal memory to be cached, is opened so as to greatly reduce memory cache data volume, transmitted data on network amount and input/output (I/O) Pin.Column storage simultaneously is data type identical data Coutinuous store, and serializing and compression can be utilized to reduce memory headroom Occupancy.

Although Spark SQL have a powerful optimizer, and support column to store by cache table, memory cache and Storage compression, but these optimizations all concentrate on Spark SQL implementation procedures, i.e. and task is submitted to after Spark, if using not Spark SQL advantage can be made full use of according to actual conditions, the submission of rationalization's task, is just unavoidably occurred Problem, offset the enhancing efficiency bonus that Spark SQL intrinsic advantages are brought：

1. system possesses the memory source and kernel (CPU) resource of abundance, but can not fully be used, as task serial carries Hand over or degree of parallelism is inadequate.

2. the opportunity of caches table cache and release holds inaccurate.

Caching is too early or discharges too late, and system resource (especially internal memory) will be caused to discharge in time；cache table It is mainly used in caching middle table result, its feature is that data volume is few and data calculate (SQL) frequently use by follow-up.In if Between table result use finish, should immediately using uncache orders discharge spatial cache, to cache other data.

3. the data of the caching do not cache, the data that should not be cached but cache, or only need the number of caching part row According to by full table cache.

4. the data after caching cannot be used fully, cause to cache same data repeatedly.

5. the essence of distributed computing system is mobile computing rather than mobile data, but in the calculating process of reality In, always there is the situation of mobile data, the copy of data is only all preserved on all nodes of cluster.Mobile data, Data are calculated from a node motion to another node, network I/O is not only consumed, also consumes disk I/O, are reduced The efficiency of whole calculating.

The content of the invention

The technical problem that technical scheme provided in an embodiment of the present invention solves is how to lift Spark Operating ettectiveness.

A kind of method of the lifting Spark Operating ettectiveness provided according to embodiments of the present invention, including：

The table that cache cache is needed in system is determined；

To being identified using the identified table for needing cache as the cache tasks of input or output；

The cache tasks identified are grouped, and treatment progress is created for corresponding cache task groups；

According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, to be committed Cache tasks are combined, and are sent to the processing of Spark clusters.

Preferably, it is described that the step of needing cache table to be determined in system is included：

It is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that Need cache table；And/or

The table of customized cache types is defined as needing to cache table.

Preferably, described the step of being grouped to the cache tasks that are identified, includes：

Using the cache tasks identified as object, the directed acyclic graph on cache tasks is established；

According to the directed acyclic graph, by by need cache table be mutually related cache tasks be assigned to it is same Cache task groups.

Preferably, described the step of creating treatment progress for corresponding cache task groups, includes：

If the processing being grouped to the cache tasks identified is first time packet transaction, to this packet transaction Each obtained cache task groups create corresponding treatment progress；

If the processing being grouped to the cache tasks identified is not first time packet transaction, history cache is obtained Task groups set, and the cache task groups set obtained according to this packet and the relation of history cache task groups set, really Surely the cache task groups of establishment treatment progress are needed, and create corresponding treatment progress.

Preferably, in addition to：

The cache task groups and the relation of history cache task groups obtained according to this packet, it is determined that needing cancellation to handle The cache task groups of process, and cancel corresponding treatment progress.

Preferably, the current state of each treatment progress of the basis and the real-time service condition of Spark cluster resources are right Cache tasks to be committed are combined, and transmission to Spark clusters processing includes：

If treatment progress is in process state to be launched, according to the real-time service condition of the Spark cluster resources and The resource requirement of the treatment progress, determine the available resources of the process；

If the treatment progress is in process ready state, according to the available resources of the treatment progress, task Priority and resource requirement, cache tasks to be committed are combined, and sent via message channel corresponding to the treatment progress to institute State the processing of Spark clusters.

Preferably, in addition to：

If treatment progress is in process not-ready state, without any processing；

If treatment progress, which is in process, cancels state or process exception state or process completion status, discharge shared by it Resource.

The storage medium provided according to embodiments of the present invention, it is stored for realizing above-mentioned lifting Spark Operating ettectiveness The program of method.

A kind of device of the lifting Spark Operating ettectiveness provided according to embodiments of the present invention, including：

Cache table identification modules, for being determined to the table that cache cache is needed in system；

Cache task recognition modules, for needing cache table as the cache of input or output using identified Task is identified；

Packet and process manager module, appoint for being grouped to the cache tasks identified, and for corresponding cache Business group creates treatment progress；

Cache tasks submit module, for the real-time of the current state according to each treatment progress and Spark cluster resources Service condition, cache tasks to be committed are combined, and sent to the processing of Spark clusters.

Preferably, the Cache tables identification module is according to multiple in the out-degree of table, list time cache record number, table Ready time between cache tasks is poor, it is determined that needing cache table, and/or the table of customized cache types is defined as Need cache table.

Preferably, the packet and process manager module are established on cache using the cache tasks identified as object The directed acyclic graph of task, and according to the directed acyclic graph, it will be mutually related cache tasks by needing cache table It is assigned to same cache task groups.

Preferably, if the cache tasks identified are grouped processing when first time packet transaction, the packet And each cache task groups for being obtained to this packet transaction of process manager module create corresponding treatment progress, otherwise institute State packet and process manager module obtains history cache task groups set, the cache task groups set obtained according to this packet With the relation of history cache task groups set, it is determined that needing to create the cache task groups for the treatment of progress, and corresponding place is created Reason process.

Preferably, the packet and process manager module are additionally operable to the cache task groups obtained according to this packet and gone through The relation of history cache task groups, it is determined that needing to cancel the cache task groups for the treatment of progress, and cancel corresponding treatment progress.

Preferably, the Cache tasks submit module when treatment progress is in process state to be launched, according to described The real-time service condition of Spark cluster resources and the resource requirement of the treatment progress, the available resources of the process are determined, The treatment progress be in process ready state when, according to the available resources of the treatment progress, the priority of task and money Source demand, cache tasks to be committed are combined, and sent via message channel corresponding to the treatment progress to the Spark collection Group's processing.

Preferably, the Cache tasks submit module not do any place when treatment progress is in process not-ready state Reason, when treatment progress is in process and cancels state or process exception state or process completion status, discharge shared by it Resource.

A kind of server provided according to embodiments of the present invention, it includes the device of above-mentioned lifting Spark Operating ettectiveness.

Technical scheme provided in an embodiment of the present invention has the advantages that：

The embodiment of the present invention can make full use of the system resource of cluster, reasonably determine to treat cache table and content, right Process carries out dynamic decision and scheduling, in the case of resource grant, increases degree of parallelism to greatest extent, reaches lifting spark fortune The purpose of row efficiency.

Brief description of the drawings

Fig. 1 is the method block diagram of lifting Spark efficiency provided in an embodiment of the present invention；

Fig. 2 is the device block diagram of lifting Spark efficiency provided in an embodiment of the present invention；

Fig. 3 is the basic block diagram of lifting Spark efficiency provided in an embodiment of the present invention；

Fig. 4 is the overall execution flow chart of lifting Spark efficiency provided in an embodiment of the present invention；

Fig. 5 is Cache packets renewal flow chart provided in an embodiment of the present invention；

Fig. 6 is the FB(flow block) of lifting Spark efficiency provided in an embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing to a preferred embodiment of the present invention will be described in detail, it will be appreciated that described below is excellent Select embodiment to be merely to illustrate and explain the present invention, be not intended to limit the present invention.

Fig. 1 is the method block diagram of lifting Spark efficiency provided in an embodiment of the present invention, as shown in figure 1, step includes：

Step S101：The table that cache cache is needed in system is determined.

It is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that Need cache table；And/or the table of customized cache types is defined as needing to cache table.

Step S102：To needing cache table (being denoted as cache tables) as the cache of input or output using identified Task is identified.

Step S103：The cache tasks identified are grouped, and for corresponding cache task groups establishment handle into Journey.

During packet, using the cache tasks identified as object, the directed acyclic graph on cache tasks is established, according to The directed acyclic graph, the cache tasks that will be mutually related by cache tables are assigned to same cache task groups.Namely Say, related cache tasks are divided into one group, the cache tasks of onrelevant are divided into one group.

During creating treatment progress, if the processing being grouped to the cache tasks identified is first time packet transaction, Illustrate no history packet, each the cache task groups now obtained to this packet transaction create corresponding treatment progress. If the processing being grouped to the cache tasks identified is not first time packet transaction, illustrate that history of existence is grouped, this scene Occur when cache tasks change (such as increase, delete, change), the variation of cache tasks may result in cache task incidence relations Change, and then influence packet, now obtain history cache task groups set, and the cache obtained according to this packet appoints The set of business group and the relation of history cache task groups set, it is determined that needing to create the cache task groups for the treatment of progress, create phase The treatment progress answered, while determine to need the cache task groups for cancelling treatment progress, cancel corresponding treatment progress.

Step S104：According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, treat The cache tasks of submission are combined, and are sent to the processing of Spark clusters.

When the cache tasks in waiting list meet ready condition, this is met that the cache tasks of ready condition add To the ready task queue of the treatment progress belonging to it.

If there is the message system between cache tasks and the Spark clusters in the ready task queue of the treatment progress The Spark contexts of not actuated, described Spark clusters are not actuated, then treatment progress is in process state to be launched, now basis By monitoring the obtained real-time service condition of Spark cluster resources and the resource requirement of the treatment progress, it is determined that it is described enter The available resources of journey；If there is disappearing between cache tasks and the Spark clusters in the ready task queue of the treatment progress Breath system has been turned on, the Spark contexts of the Spark clusters have been turned on, then the treatment progress is in ready state, right In the treatment progress in ready state, according to the available resources of the treatment progress, the priority of task and resource requirement, Cache tasks to be committed are combined, and are sent via message channel corresponding to the treatment progress to Spark clusters processing.

Further, if there is no cache tasks and the Spark clusters in the ready task queue of the treatment progress Between not actuated, the described Spark clusters of message system Spark contexts it is not actuated, then it is not ready to be in process for treatment progress State, it is now without any processing.

Further, if treatment progress is in process and cancels state or process exception state or process completion status, Discharge the resource shared by it.

It should be noted that if resource disapproves or required, using independent submission pattern and strategy, can individually submit Cache tasks, without by the way of combinations thereof submission (i.e. concurrent executive mode).

It should be noted that the resource service condition according to each treatment progress, it may be determined that task is in treatment progress It is no concurrently to perform, i.e., whether will be sent after the multiple tasks combination in treatment progress.

It should be noted that the resource service condition according to Spark clusters, it may be determined that whether multi-process concurrently performs, The task of multiple treatment progress whether is sent simultaneously.

The present embodiment can solve the problem that cluster resource using it is insufficient, be unable to reasonable employment cache tables and determine treat cache's Content, caching and release opportunity hold forbidden, cause in cluster should not data movement the problems such as.The present embodiment makes full use of The advantages such as Spark SQL column storage, memory cache and storage compression, support customized task, algorithm and resource requirement, intelligence Identification combines artificial customization and determines to need cache table object, and it is reasonable that all inter-related tasks are carried out according to cache tables dependence It is grouped and is treatment progress corresponding to every group of distribution, reaching situation dynamic decision task according to data submits, according to Spark clusters Resource service condition determine whether multi-process concurrently performs, task in process is determined according to the resource service condition of each process Whether concurrently perform, while consider the processing of process exception and cache time-out etc..Further, in the situation of resource grant Under, increase degree of parallelism to greatest extent, reach the effect of lifting spark Operating ettectiveness.

Can be with it will appreciated by the skilled person that realizing that all or part of step in above-described embodiment method is The hardware of correlation is instructed to complete by program, described program can be stored in computer read/write memory medium, should Program upon execution, including step S101 to step S104.Wherein, described storage medium can be ROM/RAM, magnetic disc, light Disk etc..

Fig. 2 is the device block diagram of lifting Spark efficiency provided in an embodiment of the present invention, as shown in Fig. 2 including：

Cache tables identification module 10, for being determined to the table that cache cache is needed in system.Cache tables are known Other module 10 is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that Cache table is needed, and/or, the table of customized cache types is defined as needing to cache table.That is, Cache Table identification module 10 can be analyzed by way of knowledge customizes otherwise and/or manually automated intelligent and determine to need cache's Table.

Cache task recognition modules 20, for using the identified table for needing cache as input or output Cache tasks are identified.

Packet and process manager module 30, for being grouped to the cache tasks identified, and are corresponding cache Task groups create treatment progress.It is described packet and process manager module 30 using the cache tasks identified as object, establish on The directed acyclic graph of cache tasks, and according to the directed acyclic graph, it will be mutually related cache by needing cache table Task is assigned to same cache task groups.If the processing being grouped to the cache tasks identified is at first time packet Reason, illustrate no history packet, each cache task that now packet and process manager module 30 obtain to this packet transaction Group creates corresponding treatment progress, and otherwise explanation has history packet, and this scene is sent out when changing in cache tasks and (such as increasing, delete, changing) Raw, the variation of cache tasks may result in cache task incidence relations and change, and then influence packet, now packet and Process manager module 30 obtains history cache task groups set, the cache task groups set obtained according to this packet and history The relation of cache task groups set, it is determined that needing to create the cache task groups for the treatment of progress, corresponding treatment progress is created, together When determine need cancel treatment progress cache task groups, cancel corresponding treatment progress.

Cache tasks submit module 40, for the current state according to each treatment progress and the reality of Spark cluster resources When service condition, cache tasks to be committed are combined, and send to Spark clusters processing.Cache tasks submit mould Block 40 when treatment progress belongs to process state to be launched, i.e., have in the ready task queue of described treatment progress cache tasks, When the Spark contexts of not actuated, the described Spark clusters of message system between the Spark clusters are not actuated, according to described The real-time service condition of Spark cluster resources and the resource requirement of the treatment progress, the available resources of the process are determined, Treatment progress be in process ready state when, i.e., have in the ready task queue of described treatment progress cache tasks, with it is described When message system between Spark clusters has been turned on, the Spark contexts of the Spark clusters have been turned on, according to it is described handle into The available resources of journey, the priority of task and resource requirement, cache tasks to be committed are combined, and via the treatment progress pair The message channel answered is sent to Spark clusters processing.If resource disapproves or required using independent submission pattern and strategy, Then packet and process manager module 30 can also individually submit cache tasks.

Further, the Cache tasks submit module 40 to be in process not-ready state in treatment progress, that is, handle into There is no that the message system between cache tasks and the Spark clusters is not actuated, the Spark collection in the ready task queue of journey It is without any processing when the Spark contexts of group are not actuated, it is in process in treatment progress and cancels state or process exception state Or process completion status when, discharge the resource shared by it.

It should be noted that if the resource of Spark clusters is sufficient, multiple treatment progress can obtain resource, now more Individual treatment progress can send task simultaneously；And in each treatment progress, according to the available resources for the treatment of progress, task Priority and resource requirement, the multiple tasks in its ready task queue can be combined, realize that task is concurrent in process.

The present embodiment makes full use of Spark SQL advantages, lifts Spark Operating ettectiveness, and it, which considers system, can use money Source, task actual demand, cache (caching) and uncache (discharge) factors such as opportunity, data movement, for each Task is planned and decision-making, and multi-process performs, concurrent in process.

The embodiment of the present invention additionally provides a kind of server, and it includes the device of above-mentioned lifting Spark Operating ettectiveness.

Fig. 3 is the basic block diagram of lifting Spark efficiency provided in an embodiment of the present invention, as shown in figure 3, local have bag Include cache tables identification function (equivalent to the function of Cache tables identification module 10) and cache grouped tasks function (equivalent to Cache task recognition modules 20 and packet and process manager module 30 function) base support structure, and with task adjust Degree mechanism, Spark execution mechanisms and multiple tasks submit the logical process structure of example (to submit module equivalent to Cache tasks 40).Remotely Spark clusters (i.e. cluster), which have, includes the adaptation layer of adapter and multiple SparkContext.Wherein, it is local with Connected between cluster by the message system (such as akka actor message systems) with message sender and message receiver.

Workflow comprises the following steps：

1. determine to need cache table object.

This class object (table for needing cache) is mainly by system according to the out-degree of table, the data volume of table and out-degree task Ready situation automated intelligent identification；Consider flexibility and the specific demand of some scenes (such as checking demand), while support logical Cross the artificial customization of configuration.

2. cache tasks in identifying system are simultaneously grouped, it is determined that packet key.

Cache tasks are the task using cache tables as input or output, and all cache task recognitions in system are gone out Come, simplified directed acyclic graph (Directed Acyclic are established according to data dependence (relation i.e. between cache tables) Graph, DAG) figure, this figure only do not include non-cache tables comprising cache tables；Cache tasks are grouped based on DAG figures, had The task of direct or indirect incidence relation is assigned to same group, and the task without any incidence relation is assigned to different groups；Take current The set of all cache tables of group is as packet key, for one cache packet of unique mark.

3. independent treatment progress is created for each cache grouped tasks (i.e. cache task groups).

Each process is owned by locally applied/long-range spark clusters message channel, resource requirement and the resource allocation of oneself Snapshot, ready task queue, the queue of cache tables, task submit example etc..

4. Task Scheduling Mechanism periodically judges each cache task ready status, identify that the processing of ready task is entered Journey.

For ready cache tasks, the set of all cache tables in its input and output is obtained, packet key collects comprising this The cache processes of conjunction are the treatment progress of the cache tasks, and task is added to the ready task queue for the treatment of progress.

5.Spark tasks carryings mechanism (i.e. Spark execution mechanisms) periodically judges each state of a process, and according to The difference of state triggers different handling processes.

6. the Spark tasks of process submit example (i.e. task submits example) to complete the cache to be committed to Spark clusters The combination of task, and be submitted to Spark clusters and post-processed.

The present embodiment makes full use of Spark SQL column to store, memory cache and storage compression etc. advantage, support customization Task, algorithm and resource requirement, Intelligent Recognition combine artificial customization and determine to need cache table object, relied on and closed according to cache tables System carries out rationally being grouped to all inter-related tasks and is treatment progress corresponding to every group of distribution, and reaching situation dynamic according to data determines Plan task is submitted, and determines whether multi-process concurrently performs according to the resource service condition of Spark clusters, according to the money of each process Source service condition determines whether task concurrently performs in process, while considers the processing of process exception and cache time-out etc..

Fig. 4 is the overall execution flow chart of lifting Spark efficiency provided in an embodiment of the present invention, as shown in figure 4, step bag Include：

Step S200：Start flow.

Step S201：Identify cache table objects and cache tasks.

Step S202：Cache tasks enter scheduling mechanism.

Step S203：DAG figures are established, and cache tasks are grouped based on DAG figures, and are each group distribution processing Process and process parameter.

Step S204：Whether Task Scheduling Mechanism periodic scan has task ready.

Step S205：If there are cache task readies, the treatment progress belonging to the ready cache tasks is identified, and will The cache tasks are added to the ready task queue of the treatment progress.

Step S206：Spark execution mechanisms scan all treatment progress, and according to the triggering pair of the state of each treatment progress Answering for task submits example to perform corresponding operation.

The state for the treatment of progress includes cancellation, not ready, to be launched, ready, exception, completion.

Step S207：The task for the treatment of progress submits example to be combined cache tasks, and is committed to Spark clusters Handled.

Step S208：All processes are disposed, and terminate flow.

Fig. 5 is Cache packets renewal flow chart provided in an embodiment of the present invention, as shown in figure 5, step includes：

Step S301：Start flow.

Step S302：Schemed based on DAG, cache tasks are grouped, obtain including the set of multiple cache task groups K1, packet key of the set of all cache tables of each cache task groups as the group is taken, for one cache of unique mark Task groups.

Step S303：Judge whether that history is grouped, if in the presence of performing step S304, otherwise perform step S308.

Step S304：The set K2 for including history cache task groups is obtained, for part not overlapping with K1, according to step Rapid S305 is performed, and for the part overlapping with K1, i.e. K1 and K2 common factor part, is performed according to step S309.

Step S305：Packet not overlapping with K1 in K2 is obtained, for having what packet key was split or merged in K1 Part is grouped, and alignment processing process is arranged into cancellation state, for remaining other packets, optionally renewal packet key (if Packet has increased task newly, and newly-increased cache tables need to be added to corresponding packet key).

Such as K2 includes packet key1, packet key2, is grouped key3, K1 includes being grouped key a, packet key b, divided Group keyc.Wherein, the key a in K1 merge to obtain by being grouped key1 and packet key2, and packet key b, packet keyc are by packet 3 Fractionation obtains.Now, key1, packet key2 will be grouped in K2, is grouped treatment progress cancellation corresponding to key3.

Step S306：Obtain the packet key set K2 ' of the treatment progress of all non-cancellation states in K2.

Step S307：K1 and K2 ' difference set is obtained, for each element in difference set, that is, belongs to K1 but is not belonging to K2 Cache task groups, create treatment progress, planning process parameter.

Step S308：If being grouped in the absence of history, respective handling process is created for each cache task groups in K1, And planning process parameter.

Step S309：The lap of common factor part, i.e. K1 and K2 is without any processing.

Step S310：Terminate flow.

Fig. 6 is the FB(flow block) of lifting Spark efficiency provided in an embodiment of the present invention, as shown in fig. 6, step includes：

Step S401：It is determined that need cache table object.

Needing cache table object has two sources, i.e. automated intelligent analysis identifies, by configuring artificial customization.

1.1. automated intelligent analysis identification.

The table for meeting following three key elements simultaneously is the table object for needing cache, wherein the constant used combines to be theoretical The empirical value that actual verification determines, best using the effect of empirical value, empirical value can be changed, and have certain journey after modification to effect Degree influences.

(1) out-degree of table>3

Using all tasks in system as object, DAG figures are created according to data dependence.

Table node and task node are included in DAG figures, wherein：

Table node：Set containing incoming task and output set of tasks.The element number of incoming task set is entering for table Degree, the element number for exporting set of tasks is the out-degree of table.

Task node：The table set containing input and output table set.The element number of input table set is the in-degree of task, The element number of output table set is the out-degree of task.

Table-task and the class relation of task-table two are included in DAG figures, wherein：

Table-task：Table key and table node mapping relations.

Task-table：Task key and task node mapping relations.

Using all table nodes as analysis object, the table node that out-degree is more than 3 is filtered out, table corresponding to these table nodes is To need cache table object candidate (i.e. the cache table objects of candidate).

(2) list time cache record number>10000000

The cache table objects candidate (i.e. the cache table objects of candidate) filtered out to key element (1) analyzes, and finds out list Secondary cache record number is more than the table of 10,000,000, reduces the scope of candidate's table object.

(3) ready time for relying on the multiple tasks of same cache tables is poor<=1 hour.

The cache table objects candidate (i.e. the cache table objects of candidate) further determined that based on key element (2), is filtered out more Table of the individual output task ready time difference in 1 hour is the final table object for needing cache that automated intelligent analysis determines.

Task ready time difference computational methods：Based on the cache tables history task ready time difference data of one month, use Linear regression analysis method, establishes regression model, and the ready time difference of task is predicted.

1.2. by configuring artificial customization

In order to flexibly support the demand of some special screnes of user and manual intervention, there is provided configuration interface, by user voluntarily Define cache tables, cache concrete mode and algorithm etc..The table that user is directly defined as cache types needs cache Another source of table object.

Step S402：The cache table set determined based on step S401, cache tasks in identifying system are simultaneously divided Group, it is determined that packet key.

Specifically comprise the following steps：

Step S4021：Cache tasks be using cache tables as input or export task, by current system it is all this Kind task recognition comes out.

Step S4022：Using step S4021 all cache tasks as object, DAG figures are established according to data dependence relation, This figure is simplified, and the node and relation for being related to table only do not include non-cache tables comprising cache tables.

Step S4023：Cache tasks are grouped based on simplified cache tasks DAG figures (i.e. DAG figures), had directly Or the task of indirect association relation is assigned to same group, the task without any incidence relation is assigned to different groups.

Step S4024：The set of all cache tables of each packet is taken as packet key, for unique mark one Cache is grouped.

Step S4025：(newly-increased, removal, modification), renewal cache tasks DAG when the cache tasks of system have variation Figure.

Step S4026：If cache tasks DAG figures (i.e. DAG figures) change, (tasks carrying completes what is removed from system Except scene), then re-execute step S4023 and step S4024.

Step S403：(i.e. DAG figures) group result is schemed according to cache tasks DAG, performs corresponding cache processes (i.e. The treatment progress of cache task groups) create or cancel operation, for newly-built process, while determine its resource requirement and preferential Level.

Wherein, cache processes (i.e. the treatment progress of cache task groups), which are created or cancelled, distinguishes two kinds of scenes, retouches in detail State and see Fig. 5.

(1) it is directed to initialisation packet, being directly grouped (i.e. cache task groups) for each cache creates one newly Cache treatment progress, packet keys of the packet key as process.

(2) for packet renewal, can be grouped key according to each cache after renewal and existing cache processes be grouped key it Between relation determine that original cache processes are reservation or cancellation, if there are newly-increased cache packets additionally to create cache New process.

Step S403 cache process creations are not the establishment process of real meaning, and it is normal to be merely creating cache processes Work some necessary parameters, mainly includes：

(1) process id：The mark of one cache process of unique identification.

(2) process priority：Determine resource allocation and process scheduling.

(3) process status：Normal condition cancels state, and acquiescence is normal condition, may when cache, which is grouped, to be changed Some cache processes can be caused to need to cancel.

(4) information of akka actor message systems：The ip of local message transmitting terminal and port；Remote message receiving terminal Ip, port, username and password (username and password is mainly used in creating session)；Remote message receiving terminal session is closed Whether state：Acquiescence is opening.

(5) resource requirement percentage：The system resource that current cache processes need, containing kernel and internal memory

(6) cache is grouped key：All cache tables set that cache packets include, based on cache task DAG map analysis It is determined that

(7) it is responsible for the submission example that cache tasks are submitted to Spark clusters.

Step S403 process is cancelled simply is arranged to cancellation state by cache process status, treats the Spark actuator cycles Property scheduling when perform actual cancellation operation again.

Step S404：Task Scheduling Mechanism periodic wakeup separate threads, scan all cache tasks in waiting list Whether ready condition is met.For ready cache tasks, the set of all cache tables in its input and output is obtained, is grouped The cache processes that key includes this set are the treatment progress of the cache tasks, and task is added into the ready for the treatment of progress Task queue；Otherwise treat that follow-up polling cycle is continued to scan on and judged.

Step S405：Spark tasks carrying mechanism periodic wakeup separate threads, the current all cache processes of scanning.

(1) if current cache states of a process are cancellation state, perform and really cancel operation：Uncache is (i.e. Release) table good cache in process and Reset Status, empty ready task queue and it is reentered scheduling；Stop Message system and SparkContext, process is removed, discharge resource.

(2) if current cache processes are normal conditions, including：

A. process is not ready：There is no the task of armed state or cache tables in process, and not yet initiation message system and SparkContext.It is now without any processing, wait for the subsequent scan period of thread quietly.

B. process is to be launched：Have that task or cache tables are pending in process, but message system and SparkContext are not opened It is dynamic.Now judge whether system resource enough according to the resource requirement of process, if enough, application resource initiation message system and SparkContext, message system and SparkContext health status are checked, be ready for the execution of task.

C. process is ready：Have that task or cache tables are pending, and message system and SparkContext have been turned in process And check that health status is normal.According to the available resources of process, the priority of task, the resource requirement of task, the submission of task Pattern (whether must individually submit), the task of system submit strategy (single to submit, be grouped submission etc.) tissue is to be committed to appoint Business list, and be sent to Spark tasks presenter corresponding to process and perform submission.

D. process occurs abnormal：Have that task or cache tables are pending, and message system and SparkContext have been opened in process Dynamic and inspection health status is abnormal.Table good cache and Reset Status, empty ready task team in uncache processes Arrange and it is reentered scheduling；Stopping message system and SparkContext, discharge resource --- the process is with cancelling process It is substantially similar, removal process is simply not required to, treats that subsequent scan period does respective handling again according to process particular state.

E. process has been completed：All tasks and cache tables are processed and finished i.e. in process.Stop akka actor message System and SparkContext, remove process.

Step S406：The Spark tasks of each cache processes submit example receiving Spark tasks carryings mechanism hair During the new cache task groups brought, start separate threads and perform task groups submission.Specifically, support task is generated The Parameter File of group execution simultaneously uploads to the Spark cluster environment of the process；Submitted by message system to SparkContext Task；Result is analyzed after the Normal Feedback of remote message system is got and returns to implementing result；In remote message When system exception, SparkContext exceptions or task time-out perform, start corresponding abnormal protection flow.

The present embodiment is the efficiency optimization based on Spark Computational frames, and it can take full advantage of Spark SQL column The advantages such as storage, memory cache and storage compression.

The present embodiment combination automatic decision framework and artificial customization are intervened, from multiple angles such as packet, parallel, asynchronous, caching The degree lifting various resource utilizations of cluster.

The present embodiment uses the cache tables and task recognition, packet, decision-making, scheduling and way of submission of layer-stepping, each level Clear-cut job responsibility independent of one another, cooperated with each other between level combination, and whether table cache (caching) can be made rational planning for And adjustment, cache tasks are effectively grouped, accurately hold the opportunity of cache table caches and release, dynamic monitor task shape The resource service condition of state and cluster, the scheduling and submission of timely decision task, the application and release of resource is carried out in time, so as to Make full use of Spark SQL column to store as far as possible, memory cache and storage compression etc. advantage, improve cluster in task and Row degree, makes cluster resource more fully be utilized, and improves the Spark Operating ettectiveness of whole system.

Although the present invention is described in detail above, the invention is not restricted to this, those skilled in the art of the present technique Various modifications can be carried out according to the principle of the present invention.Therefore, all modifications made according to the principle of the invention, all should be understood to Fall into protection scope of the present invention.

Claims

1. a kind of method of lifting Spark Operating ettectiveness, including：

The table that cache cache is needed in system is determined；

According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, cache to be committed is appointed Business is combined, and is sent to the processing of Spark clusters.

It is 2. according to the method for claim 1, described that the step of needing cache table to be determined in system is included：

It is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that needing Cache table；And/or

The table of customized cache types is defined as needing to cache table.

3. according to the method for claim 1, described the step of being grouped to the cache tasks that are identified, includes：

According to the directed acyclic graph, by by needing the cache table cache tasks that are mutually related to be assigned to same cache Task groups.

4. according to the method for claim 1, described the step of creating treatment progress for corresponding cache task groups, includes：

If the processing being grouped to the cache tasks identified is first time packet transaction, this packet transaction is obtained Each cache task groups create corresponding treatment progress；

If the processing being grouped to the cache tasks identified is not first time packet transaction, history cache tasks are obtained Group set, and the cache task groups set obtained according to this packet and the relation of history cache task groups set, it is determined that needing The cache task groups for the treatment of progress are created, and create corresponding treatment progress.

5. the method according to claim 11, in addition to：

The cache task groups and the relation of history cache task groups obtained according to this packet, it is determined that needing to cancel treatment progress Cache task groups, and cancel corresponding treatment progress.

6. according to the method for claim 1, the current state of each treatment progress of basis and Spark cluster resources Real-time service condition, cache tasks to be committed are combined, and transmission to Spark clusters processing includes：

If treatment progress is in process state to be launched, according to the real-time service condition of the Spark cluster resources and described The resource requirement for the treatment of progress, determine the available resources of the process；

If the treatment progress is in process ready state, according to the available resources of the treatment progress, task it is preferential Level and resource requirement, combine cache tasks to be committed, and send to described via message channel corresponding to the treatment progress The processing of Spark clusters.

7. the method according to claim 11, in addition to：

If treatment progress is in process not-ready state, without any processing；

8. a kind of device of lifting Spark Operating ettectiveness, including：

Cache task recognition modules, for needing cache table as the cache tasks of input or output using identified It is identified；

Packet and process manager module, for being grouped to the cache tasks identified, and are corresponding cache task groups Create treatment progress；

Cache tasks submit module, the real-time use for the current state according to each treatment progress and Spark cluster resources Situation, cache tasks to be committed are combined, and sent to the processing of Spark clusters.

9. device according to claim 8, the Cache tables identification module is according to the out-degree of table, list time cache note It is poor to record number, the ready time in table between multiple cache tasks, it is determined that need cache table, and/or by customized cache classes The table of type is defined as needing cache table.

10. device according to claim 8, the packet and process manager module using the cache tasks that are identified as pair As, the directed acyclic graph on cache tasks is established, and according to the directed acyclic graph, by by needing cache table mutual The cache tasks of association are assigned to same cache task groups.

11. device according to claim 8, if the processing being grouped to the cache tasks identified is to divide for the first time Group processing, then each cache task groups that the packet and process manager module obtain to this packet transaction create corresponding Treatment progress, otherwise it is described packet and process manager module obtain history cache task groups set, obtained according to this packet The set of cache task groups and history cache task groups set relation, it is determined that need create treatment progress cache tasks Group, and create corresponding treatment progress.

12. device according to claim 11, the packet and process manager module are additionally operable to be obtained according to this packet Cache task groups and history cache task groups relation, it is determined that needing to cancel the cache task groups for the treatment of progress, and cancel Corresponding treatment progress.

13. device according to claim 8, it is to be launched that the Cache tasks submit module in treatment progress to be in process During state, according to the resource requirement of the real-time service condition of the Spark cluster resources and the treatment progress, it is determined that it is described enter The available resources of journey, when the treatment progress is in process ready state, according to the available resources of the treatment progress, appoint The priority of business and resource requirement, cache tasks to be committed are combined, and sent via message channel corresponding to the treatment progress Handled to the Spark clusters.

14. according to the method for claim 13, the Cache tasks submit module in treatment progress to be in process not ready It is without any processing during state, it is in process in treatment progress and cancels state or process exception state or process completion status When, discharge the resource shared by it.

15. a kind of big data server, it includes the lifting Spark Operating ettectiveness described in the claims 8-14 any one Device.