CN107544844A - A kind of method and device of lifting Spark Operating ettectiveness - Google Patents
A kind of method and device of lifting Spark Operating ettectiveness Download PDFInfo
- Publication number
- CN107544844A CN107544844A CN201610482075.7A CN201610482075A CN107544844A CN 107544844 A CN107544844 A CN 107544844A CN 201610482075 A CN201610482075 A CN 201610482075A CN 107544844 A CN107544844 A CN 107544844A
- Authority
- CN
- China
- Prior art keywords
- cache
- tasks
- treatment progress
- spark
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Abstract
The invention discloses a kind of method and device of lifting Spark Operating ettectiveness, and being related to big data analysis and process field, methods described includes:The table that cache cache is needed in system is determined;To being identified using the identified table for needing cache as the cache tasks of input or output;The cache tasks identified are grouped, and treatment progress is created for corresponding cache task groups;According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, cache tasks to be committed are combined, and sent to the processing of Spark clusters.The embodiment of the present invention can make full use of the system resource of cluster, reasonably determine to treat cache table and content, carry out dynamic decision and scheduling to process, in the case of resource grant, increase degree of parallelism to greatest extent, reach the purpose of lifting spark Operating ettectiveness.
Description
Technical field
The present invention relates to big data analysis and process field, the method for more particularly to a kind of lifting Spark Operating ettectiveness and
Device.
Background technology
As the development of informationization, the growth of enterprise's data explosion formula to be processed, data volume have all reached terabyte (ten thousand
Hundred million bytes, TB) level, petabyte (thousand terabytes, PB) level.It is all kinds of in order to support the analysis and processing of so large-scale data
Big data framework, instrument and technology are arisen at the historic moment, and Spark is one of them.
Spark is a big data processing framework around speed, ease for use and complicated analysis structure, and it passes through in data
In processing procedure using cost it is lower shuffle (Shuffle) mode, will " mapping-stipulations " model (Map Reduce) lifted arrive
One higher level, and using internal storage data storage and the disposal ability of near real-time, make its performance than other big datas
Treatment technology will soon many times.Spark SQL are a Spark components, for supporting structuring query language
(Structured Query Language, SQL) standard, because of its ease for use, high-performance, and the user that by means of traditional SQL is used to
The strength of property and the ecosystem, is favored extensively by user.
Spark SQL Scheduling Core is Catalyst, and Catalyst is that the functional expression relational query in Spark SQL is excellent
Change framework, query optimization can be carried out during sql like language to be translated into final executive plan.Other Spark SQL are being deposited
Data storage can be converted into column by caches table (cache table) in storage to store, while data are added
It is downloaded to internal memory to be cached, is opened so as to greatly reduce memory cache data volume, transmitted data on network amount and input/output (I/O)
Pin.Column storage simultaneously is data type identical data Coutinuous store, and serializing and compression can be utilized to reduce memory headroom
Occupancy.
Although Spark SQL have a powerful optimizer, and support column to store by cache table, memory cache and
Storage compression, but these optimizations all concentrate on Spark SQL implementation procedures, i.e. and task is submitted to after Spark, if using not
Spark SQL advantage can be made full use of according to actual conditions, the submission of rationalization's task, is just unavoidably occurred
Problem, offset the enhancing efficiency bonus that Spark SQL intrinsic advantages are brought:
1. system possesses the memory source and kernel (CPU) resource of abundance, but can not fully be used, as task serial carries
Hand over or degree of parallelism is inadequate.
2. the opportunity of caches table cache and release holds inaccurate.
Caching is too early or discharges too late, and system resource (especially internal memory) will be caused to discharge in time;cache table
It is mainly used in caching middle table result, its feature is that data volume is few and data calculate (SQL) frequently use by follow-up.In if
Between table result use finish, should immediately using uncache orders discharge spatial cache, to cache other data.
3. the data of the caching do not cache, the data that should not be cached but cache, or only need the number of caching part row
According to by full table cache.
4. the data after caching cannot be used fully, cause to cache same data repeatedly.
5. the essence of distributed computing system is mobile computing rather than mobile data, but in the calculating process of reality
In, always there is the situation of mobile data, the copy of data is only all preserved on all nodes of cluster.Mobile data,
Data are calculated from a node motion to another node, network I/O is not only consumed, also consumes disk I/O, are reduced
The efficiency of whole calculating.
The content of the invention
The technical problem that technical scheme provided in an embodiment of the present invention solves is how to lift Spark Operating ettectiveness.
A kind of method of the lifting Spark Operating ettectiveness provided according to embodiments of the present invention, including:
The table that cache cache is needed in system is determined;
To being identified using the identified table for needing cache as the cache tasks of input or output;
The cache tasks identified are grouped, and treatment progress is created for corresponding cache task groups;
According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, to be committed
Cache tasks are combined, and are sent to the processing of Spark clusters.
Preferably, it is described that the step of needing cache table to be determined in system is included:
It is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that
Need cache table;And/or
The table of customized cache types is defined as needing to cache table.
Preferably, described the step of being grouped to the cache tasks that are identified, includes:
Using the cache tasks identified as object, the directed acyclic graph on cache tasks is established;
According to the directed acyclic graph, by by need cache table be mutually related cache tasks be assigned to it is same
Cache task groups.
Preferably, described the step of creating treatment progress for corresponding cache task groups, includes:
If the processing being grouped to the cache tasks identified is first time packet transaction, to this packet transaction
Each obtained cache task groups create corresponding treatment progress;
If the processing being grouped to the cache tasks identified is not first time packet transaction, history cache is obtained
Task groups set, and the cache task groups set obtained according to this packet and the relation of history cache task groups set, really
Surely the cache task groups of establishment treatment progress are needed, and create corresponding treatment progress.
Preferably, in addition to:
The cache task groups and the relation of history cache task groups obtained according to this packet, it is determined that needing cancellation to handle
The cache task groups of process, and cancel corresponding treatment progress.
Preferably, the current state of each treatment progress of the basis and the real-time service condition of Spark cluster resources are right
Cache tasks to be committed are combined, and transmission to Spark clusters processing includes:
If treatment progress is in process state to be launched, according to the real-time service condition of the Spark cluster resources and
The resource requirement of the treatment progress, determine the available resources of the process;
If the treatment progress is in process ready state, according to the available resources of the treatment progress, task
Priority and resource requirement, cache tasks to be committed are combined, and sent via message channel corresponding to the treatment progress to institute
State the processing of Spark clusters.
Preferably, in addition to:
If treatment progress is in process not-ready state, without any processing;
If treatment progress, which is in process, cancels state or process exception state or process completion status, discharge shared by it
Resource.
The storage medium provided according to embodiments of the present invention, it is stored for realizing above-mentioned lifting Spark Operating ettectiveness
The program of method.
A kind of device of the lifting Spark Operating ettectiveness provided according to embodiments of the present invention, including:
Cache table identification modules, for being determined to the table that cache cache is needed in system;
Cache task recognition modules, for needing cache table as the cache of input or output using identified
Task is identified;
Packet and process manager module, appoint for being grouped to the cache tasks identified, and for corresponding cache
Business group creates treatment progress;
Cache tasks submit module, for the real-time of the current state according to each treatment progress and Spark cluster resources
Service condition, cache tasks to be committed are combined, and sent to the processing of Spark clusters.
Preferably, the Cache tables identification module is according to multiple in the out-degree of table, list time cache record number, table
Ready time between cache tasks is poor, it is determined that needing cache table, and/or the table of customized cache types is defined as
Need cache table.
Preferably, the packet and process manager module are established on cache using the cache tasks identified as object
The directed acyclic graph of task, and according to the directed acyclic graph, it will be mutually related cache tasks by needing cache table
It is assigned to same cache task groups.
Preferably, if the cache tasks identified are grouped processing when first time packet transaction, the packet
And each cache task groups for being obtained to this packet transaction of process manager module create corresponding treatment progress, otherwise institute
State packet and process manager module obtains history cache task groups set, the cache task groups set obtained according to this packet
With the relation of history cache task groups set, it is determined that needing to create the cache task groups for the treatment of progress, and corresponding place is created
Reason process.
Preferably, the packet and process manager module are additionally operable to the cache task groups obtained according to this packet and gone through
The relation of history cache task groups, it is determined that needing to cancel the cache task groups for the treatment of progress, and cancel corresponding treatment progress.
Preferably, the Cache tasks submit module when treatment progress is in process state to be launched, according to described
The real-time service condition of Spark cluster resources and the resource requirement of the treatment progress, the available resources of the process are determined,
The treatment progress be in process ready state when, according to the available resources of the treatment progress, the priority of task and money
Source demand, cache tasks to be committed are combined, and sent via message channel corresponding to the treatment progress to the Spark collection
Group's processing.
Preferably, the Cache tasks submit module not do any place when treatment progress is in process not-ready state
Reason, when treatment progress is in process and cancels state or process exception state or process completion status, discharge shared by it
Resource.
A kind of server provided according to embodiments of the present invention, it includes the device of above-mentioned lifting Spark Operating ettectiveness.
Technical scheme provided in an embodiment of the present invention has the advantages that:
The embodiment of the present invention can make full use of the system resource of cluster, reasonably determine to treat cache table and content, right
Process carries out dynamic decision and scheduling, in the case of resource grant, increases degree of parallelism to greatest extent, reaches lifting spark fortune
The purpose of row efficiency.
Brief description of the drawings
Fig. 1 is the method block diagram of lifting Spark efficiency provided in an embodiment of the present invention;
Fig. 2 is the device block diagram of lifting Spark efficiency provided in an embodiment of the present invention;
Fig. 3 is the basic block diagram of lifting Spark efficiency provided in an embodiment of the present invention;
Fig. 4 is the overall execution flow chart of lifting Spark efficiency provided in an embodiment of the present invention;
Fig. 5 is Cache packets renewal flow chart provided in an embodiment of the present invention;
Fig. 6 is the FB(flow block) of lifting Spark efficiency provided in an embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing to a preferred embodiment of the present invention will be described in detail, it will be appreciated that described below is excellent
Select embodiment to be merely to illustrate and explain the present invention, be not intended to limit the present invention.
Fig. 1 is the method block diagram of lifting Spark efficiency provided in an embodiment of the present invention, as shown in figure 1, step includes:
Step S101:The table that cache cache is needed in system is determined.
It is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that
Need cache table;And/or the table of customized cache types is defined as needing to cache table.
Step S102:To needing cache table (being denoted as cache tables) as the cache of input or output using identified
Task is identified.
Step S103:The cache tasks identified are grouped, and for corresponding cache task groups establishment handle into
Journey.
During packet, using the cache tasks identified as object, the directed acyclic graph on cache tasks is established, according to
The directed acyclic graph, the cache tasks that will be mutually related by cache tables are assigned to same cache task groups.Namely
Say, related cache tasks are divided into one group, the cache tasks of onrelevant are divided into one group.
During creating treatment progress, if the processing being grouped to the cache tasks identified is first time packet transaction,
Illustrate no history packet, each the cache task groups now obtained to this packet transaction create corresponding treatment progress.
If the processing being grouped to the cache tasks identified is not first time packet transaction, illustrate that history of existence is grouped, this scene
Occur when cache tasks change (such as increase, delete, change), the variation of cache tasks may result in cache task incidence relations
Change, and then influence packet, now obtain history cache task groups set, and the cache obtained according to this packet appoints
The set of business group and the relation of history cache task groups set, it is determined that needing to create the cache task groups for the treatment of progress, create phase
The treatment progress answered, while determine to need the cache task groups for cancelling treatment progress, cancel corresponding treatment progress.
Step S104:According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, treat
The cache tasks of submission are combined, and are sent to the processing of Spark clusters.
When the cache tasks in waiting list meet ready condition, this is met that the cache tasks of ready condition add
To the ready task queue of the treatment progress belonging to it.
If there is the message system between cache tasks and the Spark clusters in the ready task queue of the treatment progress
The Spark contexts of not actuated, described Spark clusters are not actuated, then treatment progress is in process state to be launched, now basis
By monitoring the obtained real-time service condition of Spark cluster resources and the resource requirement of the treatment progress, it is determined that it is described enter
The available resources of journey;If there is disappearing between cache tasks and the Spark clusters in the ready task queue of the treatment progress
Breath system has been turned on, the Spark contexts of the Spark clusters have been turned on, then the treatment progress is in ready state, right
In the treatment progress in ready state, according to the available resources of the treatment progress, the priority of task and resource requirement,
Cache tasks to be committed are combined, and are sent via message channel corresponding to the treatment progress to Spark clusters processing.
Further, if there is no cache tasks and the Spark clusters in the ready task queue of the treatment progress
Between not actuated, the described Spark clusters of message system Spark contexts it is not actuated, then it is not ready to be in process for treatment progress
State, it is now without any processing.
Further, if treatment progress is in process and cancels state or process exception state or process completion status,
Discharge the resource shared by it.
It should be noted that if resource disapproves or required, using independent submission pattern and strategy, can individually submit
Cache tasks, without by the way of combinations thereof submission (i.e. concurrent executive mode).
It should be noted that the resource service condition according to each treatment progress, it may be determined that task is in treatment progress
It is no concurrently to perform, i.e., whether will be sent after the multiple tasks combination in treatment progress.
It should be noted that the resource service condition according to Spark clusters, it may be determined that whether multi-process concurrently performs,
The task of multiple treatment progress whether is sent simultaneously.
The present embodiment can solve the problem that cluster resource using it is insufficient, be unable to reasonable employment cache tables and determine treat cache's
Content, caching and release opportunity hold forbidden, cause in cluster should not data movement the problems such as.The present embodiment makes full use of
The advantages such as Spark SQL column storage, memory cache and storage compression, support customized task, algorithm and resource requirement, intelligence
Identification combines artificial customization and determines to need cache table object, and it is reasonable that all inter-related tasks are carried out according to cache tables dependence
It is grouped and is treatment progress corresponding to every group of distribution, reaching situation dynamic decision task according to data submits, according to Spark clusters
Resource service condition determine whether multi-process concurrently performs, task in process is determined according to the resource service condition of each process
Whether concurrently perform, while consider the processing of process exception and cache time-out etc..Further, in the situation of resource grant
Under, increase degree of parallelism to greatest extent, reach the effect of lifting spark Operating ettectiveness.
Can be with it will appreciated by the skilled person that realizing that all or part of step in above-described embodiment method is
The hardware of correlation is instructed to complete by program, described program can be stored in computer read/write memory medium, should
Program upon execution, including step S101 to step S104.Wherein, described storage medium can be ROM/RAM, magnetic disc, light
Disk etc..
Fig. 2 is the device block diagram of lifting Spark efficiency provided in an embodiment of the present invention, as shown in Fig. 2 including:
Cache tables identification module 10, for being determined to the table that cache cache is needed in system.Cache tables are known
Other module 10 is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that
Cache table is needed, and/or, the table of customized cache types is defined as needing to cache table.That is, Cache
Table identification module 10 can be analyzed by way of knowledge customizes otherwise and/or manually automated intelligent and determine to need cache's
Table.
Cache task recognition modules 20, for using the identified table for needing cache as input or output
Cache tasks are identified.
Packet and process manager module 30, for being grouped to the cache tasks identified, and are corresponding cache
Task groups create treatment progress.It is described packet and process manager module 30 using the cache tasks identified as object, establish on
The directed acyclic graph of cache tasks, and according to the directed acyclic graph, it will be mutually related cache by needing cache table
Task is assigned to same cache task groups.If the processing being grouped to the cache tasks identified is at first time packet
Reason, illustrate no history packet, each cache task that now packet and process manager module 30 obtain to this packet transaction
Group creates corresponding treatment progress, and otherwise explanation has history packet, and this scene is sent out when changing in cache tasks and (such as increasing, delete, changing)
Raw, the variation of cache tasks may result in cache task incidence relations and change, and then influence packet, now packet and
Process manager module 30 obtains history cache task groups set, the cache task groups set obtained according to this packet and history
The relation of cache task groups set, it is determined that needing to create the cache task groups for the treatment of progress, corresponding treatment progress is created, together
When determine need cancel treatment progress cache task groups, cancel corresponding treatment progress.
Cache tasks submit module 40, for the current state according to each treatment progress and the reality of Spark cluster resources
When service condition, cache tasks to be committed are combined, and send to Spark clusters processing.Cache tasks submit mould
Block 40 when treatment progress belongs to process state to be launched, i.e., have in the ready task queue of described treatment progress cache tasks,
When the Spark contexts of not actuated, the described Spark clusters of message system between the Spark clusters are not actuated, according to described
The real-time service condition of Spark cluster resources and the resource requirement of the treatment progress, the available resources of the process are determined,
Treatment progress be in process ready state when, i.e., have in the ready task queue of described treatment progress cache tasks, with it is described
When message system between Spark clusters has been turned on, the Spark contexts of the Spark clusters have been turned on, according to it is described handle into
The available resources of journey, the priority of task and resource requirement, cache tasks to be committed are combined, and via the treatment progress pair
The message channel answered is sent to Spark clusters processing.If resource disapproves or required using independent submission pattern and strategy,
Then packet and process manager module 30 can also individually submit cache tasks.
Further, the Cache tasks submit module 40 to be in process not-ready state in treatment progress, that is, handle into
There is no that the message system between cache tasks and the Spark clusters is not actuated, the Spark collection in the ready task queue of journey
It is without any processing when the Spark contexts of group are not actuated, it is in process in treatment progress and cancels state or process exception state
Or process completion status when, discharge the resource shared by it.
It should be noted that if the resource of Spark clusters is sufficient, multiple treatment progress can obtain resource, now more
Individual treatment progress can send task simultaneously;And in each treatment progress, according to the available resources for the treatment of progress, task
Priority and resource requirement, the multiple tasks in its ready task queue can be combined, realize that task is concurrent in process.
The present embodiment makes full use of Spark SQL advantages, lifts Spark Operating ettectiveness, and it, which considers system, can use money
Source, task actual demand, cache (caching) and uncache (discharge) factors such as opportunity, data movement, for each
Task is planned and decision-making, and multi-process performs, concurrent in process.
The embodiment of the present invention additionally provides a kind of server, and it includes the device of above-mentioned lifting Spark Operating ettectiveness.
Fig. 3 is the basic block diagram of lifting Spark efficiency provided in an embodiment of the present invention, as shown in figure 3, local have bag
Include cache tables identification function (equivalent to the function of Cache tables identification module 10) and cache grouped tasks function (equivalent to
Cache task recognition modules 20 and packet and process manager module 30 function) base support structure, and with task adjust
Degree mechanism, Spark execution mechanisms and multiple tasks submit the logical process structure of example (to submit module equivalent to Cache tasks
40).Remotely Spark clusters (i.e. cluster), which have, includes the adaptation layer of adapter and multiple SparkContext.Wherein, it is local with
Connected between cluster by the message system (such as akka actor message systems) with message sender and message receiver.
Workflow comprises the following steps:
1. determine to need cache table object.
This class object (table for needing cache) is mainly by system according to the out-degree of table, the data volume of table and out-degree task
Ready situation automated intelligent identification;Consider flexibility and the specific demand of some scenes (such as checking demand), while support logical
Cross the artificial customization of configuration.
2. cache tasks in identifying system are simultaneously grouped, it is determined that packet key.
Cache tasks are the task using cache tables as input or output, and all cache task recognitions in system are gone out
Come, simplified directed acyclic graph (Directed Acyclic are established according to data dependence (relation i.e. between cache tables)
Graph, DAG) figure, this figure only do not include non-cache tables comprising cache tables;Cache tasks are grouped based on DAG figures, had
The task of direct or indirect incidence relation is assigned to same group, and the task without any incidence relation is assigned to different groups;Take current
The set of all cache tables of group is as packet key, for one cache packet of unique mark.
3. independent treatment progress is created for each cache grouped tasks (i.e. cache task groups).
Each process is owned by locally applied/long-range spark clusters message channel, resource requirement and the resource allocation of oneself
Snapshot, ready task queue, the queue of cache tables, task submit example etc..
4. Task Scheduling Mechanism periodically judges each cache task ready status, identify that the processing of ready task is entered
Journey.
For ready cache tasks, the set of all cache tables in its input and output is obtained, packet key collects comprising this
The cache processes of conjunction are the treatment progress of the cache tasks, and task is added to the ready task queue for the treatment of progress.
5.Spark tasks carryings mechanism (i.e. Spark execution mechanisms) periodically judges each state of a process, and according to
The difference of state triggers different handling processes.
6. the Spark tasks of process submit example (i.e. task submits example) to complete the cache to be committed to Spark clusters
The combination of task, and be submitted to Spark clusters and post-processed.
The present embodiment makes full use of Spark SQL column to store, memory cache and storage compression etc. advantage, support customization
Task, algorithm and resource requirement, Intelligent Recognition combine artificial customization and determine to need cache table object, relied on and closed according to cache tables
System carries out rationally being grouped to all inter-related tasks and is treatment progress corresponding to every group of distribution, and reaching situation dynamic according to data determines
Plan task is submitted, and determines whether multi-process concurrently performs according to the resource service condition of Spark clusters, according to the money of each process
Source service condition determines whether task concurrently performs in process, while considers the processing of process exception and cache time-out etc..
Fig. 4 is the overall execution flow chart of lifting Spark efficiency provided in an embodiment of the present invention, as shown in figure 4, step bag
Include:
Step S200:Start flow.
Step S201:Identify cache table objects and cache tasks.
Step S202:Cache tasks enter scheduling mechanism.
Step S203:DAG figures are established, and cache tasks are grouped based on DAG figures, and are each group distribution processing
Process and process parameter.
Step S204:Whether Task Scheduling Mechanism periodic scan has task ready.
Step S205:If there are cache task readies, the treatment progress belonging to the ready cache tasks is identified, and will
The cache tasks are added to the ready task queue of the treatment progress.
Step S206:Spark execution mechanisms scan all treatment progress, and according to the triggering pair of the state of each treatment progress
Answering for task submits example to perform corresponding operation.
The state for the treatment of progress includes cancellation, not ready, to be launched, ready, exception, completion.
Step S207:The task for the treatment of progress submits example to be combined cache tasks, and is committed to Spark clusters
Handled.
Step S208:All processes are disposed, and terminate flow.
Fig. 5 is Cache packets renewal flow chart provided in an embodiment of the present invention, as shown in figure 5, step includes:
Step S301:Start flow.
Step S302:Schemed based on DAG, cache tasks are grouped, obtain including the set of multiple cache task groups
K1, packet key of the set of all cache tables of each cache task groups as the group is taken, for one cache of unique mark
Task groups.
Step S303:Judge whether that history is grouped, if in the presence of performing step S304, otherwise perform step S308.
Step S304:The set K2 for including history cache task groups is obtained, for part not overlapping with K1, according to step
Rapid S305 is performed, and for the part overlapping with K1, i.e. K1 and K2 common factor part, is performed according to step S309.
Step S305:Packet not overlapping with K1 in K2 is obtained, for having what packet key was split or merged in K1
Part is grouped, and alignment processing process is arranged into cancellation state, for remaining other packets, optionally renewal packet key (if
Packet has increased task newly, and newly-increased cache tables need to be added to corresponding packet key).
Such as K2 includes packet key1, packet key2, is grouped key3, K1 includes being grouped key a, packet key b, divided
Group keyc.Wherein, the key a in K1 merge to obtain by being grouped key1 and packet key2, and packet key b, packet keyc are by packet 3
Fractionation obtains.Now, key1, packet key2 will be grouped in K2, is grouped treatment progress cancellation corresponding to key3.
Step S306:Obtain the packet key set K2 ' of the treatment progress of all non-cancellation states in K2.
Step S307:K1 and K2 ' difference set is obtained, for each element in difference set, that is, belongs to K1 but is not belonging to K2
Cache task groups, create treatment progress, planning process parameter.
Step S308:If being grouped in the absence of history, respective handling process is created for each cache task groups in K1,
And planning process parameter.
Step S309:The lap of common factor part, i.e. K1 and K2 is without any processing.
Step S310:Terminate flow.
Fig. 6 is the FB(flow block) of lifting Spark efficiency provided in an embodiment of the present invention, as shown in fig. 6, step includes:
Step S401:It is determined that need cache table object.
Needing cache table object has two sources, i.e. automated intelligent analysis identifies, by configuring artificial customization.
1.1. automated intelligent analysis identification.
The table for meeting following three key elements simultaneously is the table object for needing cache, wherein the constant used combines to be theoretical
The empirical value that actual verification determines, best using the effect of empirical value, empirical value can be changed, and have certain journey after modification to effect
Degree influences.
(1) out-degree of table>3
Using all tasks in system as object, DAG figures are created according to data dependence.
Table node and task node are included in DAG figures, wherein:
Table node:Set containing incoming task and output set of tasks.The element number of incoming task set is entering for table
Degree, the element number for exporting set of tasks is the out-degree of table.
Task node:The table set containing input and output table set.The element number of input table set is the in-degree of task,
The element number of output table set is the out-degree of task.
Table-task and the class relation of task-table two are included in DAG figures, wherein:
Table-task:Table key and table node mapping relations.
Task-table:Task key and task node mapping relations.
Using all table nodes as analysis object, the table node that out-degree is more than 3 is filtered out, table corresponding to these table nodes is
To need cache table object candidate (i.e. the cache table objects of candidate).
(2) list time cache record number>10000000
The cache table objects candidate (i.e. the cache table objects of candidate) filtered out to key element (1) analyzes, and finds out list
Secondary cache record number is more than the table of 10,000,000, reduces the scope of candidate's table object.
(3) ready time for relying on the multiple tasks of same cache tables is poor<=1 hour.
The cache table objects candidate (i.e. the cache table objects of candidate) further determined that based on key element (2), is filtered out more
Table of the individual output task ready time difference in 1 hour is the final table object for needing cache that automated intelligent analysis determines.
Task ready time difference computational methods:Based on the cache tables history task ready time difference data of one month, use
Linear regression analysis method, establishes regression model, and the ready time difference of task is predicted.
1.2. by configuring artificial customization
In order to flexibly support the demand of some special screnes of user and manual intervention, there is provided configuration interface, by user voluntarily
Define cache tables, cache concrete mode and algorithm etc..The table that user is directly defined as cache types needs cache
Another source of table object.
Step S402:The cache table set determined based on step S401, cache tasks in identifying system are simultaneously divided
Group, it is determined that packet key.
Specifically comprise the following steps:
Step S4021:Cache tasks be using cache tables as input or export task, by current system it is all this
Kind task recognition comes out.
Step S4022:Using step S4021 all cache tasks as object, DAG figures are established according to data dependence relation,
This figure is simplified, and the node and relation for being related to table only do not include non-cache tables comprising cache tables.
Step S4023:Cache tasks are grouped based on simplified cache tasks DAG figures (i.e. DAG figures), had directly
Or the task of indirect association relation is assigned to same group, the task without any incidence relation is assigned to different groups.
Step S4024:The set of all cache tables of each packet is taken as packet key, for unique mark one
Cache is grouped.
Step S4025:(newly-increased, removal, modification), renewal cache tasks DAG when the cache tasks of system have variation
Figure.
Step S4026:If cache tasks DAG figures (i.e. DAG figures) change, (tasks carrying completes what is removed from system
Except scene), then re-execute step S4023 and step S4024.
Step S403:(i.e. DAG figures) group result is schemed according to cache tasks DAG, performs corresponding cache processes (i.e.
The treatment progress of cache task groups) create or cancel operation, for newly-built process, while determine its resource requirement and preferential
Level.
Wherein, cache processes (i.e. the treatment progress of cache task groups), which are created or cancelled, distinguishes two kinds of scenes, retouches in detail
State and see Fig. 5.
(1) it is directed to initialisation packet, being directly grouped (i.e. cache task groups) for each cache creates one newly
Cache treatment progress, packet keys of the packet key as process.
(2) for packet renewal, can be grouped key according to each cache after renewal and existing cache processes be grouped key it
Between relation determine that original cache processes are reservation or cancellation, if there are newly-increased cache packets additionally to create cache
New process.
Step S403 cache process creations are not the establishment process of real meaning, and it is normal to be merely creating cache processes
Work some necessary parameters, mainly includes:
(1) process id:The mark of one cache process of unique identification.
(2) process priority:Determine resource allocation and process scheduling.
(3) process status:Normal condition cancels state, and acquiescence is normal condition, may when cache, which is grouped, to be changed
Some cache processes can be caused to need to cancel.
(4) information of akka actor message systems:The ip of local message transmitting terminal and port;Remote message receiving terminal
Ip, port, username and password (username and password is mainly used in creating session);Remote message receiving terminal session is closed
Whether state:Acquiescence is opening.
(5) resource requirement percentage:The system resource that current cache processes need, containing kernel and internal memory
(6) cache is grouped key:All cache tables set that cache packets include, based on cache task DAG map analysis
It is determined that
(7) it is responsible for the submission example that cache tasks are submitted to Spark clusters.
Step S403 process is cancelled simply is arranged to cancellation state by cache process status, treats the Spark actuator cycles
Property scheduling when perform actual cancellation operation again.
Step S404:Task Scheduling Mechanism periodic wakeup separate threads, scan all cache tasks in waiting list
Whether ready condition is met.For ready cache tasks, the set of all cache tables in its input and output is obtained, is grouped
The cache processes that key includes this set are the treatment progress of the cache tasks, and task is added into the ready for the treatment of progress
Task queue;Otherwise treat that follow-up polling cycle is continued to scan on and judged.
Step S405:Spark tasks carrying mechanism periodic wakeup separate threads, the current all cache processes of scanning.
(1) if current cache states of a process are cancellation state, perform and really cancel operation:Uncache is (i.e.
Release) table good cache in process and Reset Status, empty ready task queue and it is reentered scheduling;Stop
Message system and SparkContext, process is removed, discharge resource.
(2) if current cache processes are normal conditions, including:
A. process is not ready:There is no the task of armed state or cache tables in process, and not yet initiation message system and
SparkContext.It is now without any processing, wait for the subsequent scan period of thread quietly.
B. process is to be launched:Have that task or cache tables are pending in process, but message system and SparkContext are not opened
It is dynamic.Now judge whether system resource enough according to the resource requirement of process, if enough, application resource initiation message system and
SparkContext, message system and SparkContext health status are checked, be ready for the execution of task.
C. process is ready:Have that task or cache tables are pending, and message system and SparkContext have been turned in process
And check that health status is normal.According to the available resources of process, the priority of task, the resource requirement of task, the submission of task
Pattern (whether must individually submit), the task of system submit strategy (single to submit, be grouped submission etc.) tissue is to be committed to appoint
Business list, and be sent to Spark tasks presenter corresponding to process and perform submission.
D. process occurs abnormal:Have that task or cache tables are pending, and message system and SparkContext have been opened in process
Dynamic and inspection health status is abnormal.Table good cache and Reset Status, empty ready task team in uncache processes
Arrange and it is reentered scheduling;Stopping message system and SparkContext, discharge resource --- the process is with cancelling process
It is substantially similar, removal process is simply not required to, treats that subsequent scan period does respective handling again according to process particular state.
E. process has been completed:All tasks and cache tables are processed and finished i.e. in process.Stop akka actor message
System and SparkContext, remove process.
Step S406:The Spark tasks of each cache processes submit example receiving Spark tasks carryings mechanism hair
During the new cache task groups brought, start separate threads and perform task groups submission.Specifically, support task is generated
The Parameter File of group execution simultaneously uploads to the Spark cluster environment of the process;Submitted by message system to SparkContext
Task;Result is analyzed after the Normal Feedback of remote message system is got and returns to implementing result;In remote message
When system exception, SparkContext exceptions or task time-out perform, start corresponding abnormal protection flow.
The present embodiment is the efficiency optimization based on Spark Computational frames, and it can take full advantage of Spark SQL column
The advantages such as storage, memory cache and storage compression.
The present embodiment combination automatic decision framework and artificial customization are intervened, from multiple angles such as packet, parallel, asynchronous, caching
The degree lifting various resource utilizations of cluster.
The present embodiment uses the cache tables and task recognition, packet, decision-making, scheduling and way of submission of layer-stepping, each level
Clear-cut job responsibility independent of one another, cooperated with each other between level combination, and whether table cache (caching) can be made rational planning for
And adjustment, cache tasks are effectively grouped, accurately hold the opportunity of cache table caches and release, dynamic monitor task shape
The resource service condition of state and cluster, the scheduling and submission of timely decision task, the application and release of resource is carried out in time, so as to
Make full use of Spark SQL column to store as far as possible, memory cache and storage compression etc. advantage, improve cluster in task and
Row degree, makes cluster resource more fully be utilized, and improves the Spark Operating ettectiveness of whole system.
Although the present invention is described in detail above, the invention is not restricted to this, those skilled in the art of the present technique
Various modifications can be carried out according to the principle of the present invention.Therefore, all modifications made according to the principle of the invention, all should be understood to
Fall into protection scope of the present invention.
Claims (15)
1. a kind of method of lifting Spark Operating ettectiveness, including:
The table that cache cache is needed in system is determined;
To being identified using the identified table for needing cache as the cache tasks of input or output;
The cache tasks identified are grouped, and treatment progress is created for corresponding cache task groups;
According to the current state of each treatment progress and the real-time service condition of Spark cluster resources, cache to be committed is appointed
Business is combined, and is sent to the processing of Spark clusters.
It is 2. according to the method for claim 1, described that the step of needing cache table to be determined in system is included:
It is poor according to the ready time between multiple cache tasks in the out-degree of table, list time cache record number, table, it is determined that needing
Cache table;And/or
The table of customized cache types is defined as needing to cache table.
3. according to the method for claim 1, described the step of being grouped to the cache tasks that are identified, includes:
Using the cache tasks identified as object, the directed acyclic graph on cache tasks is established;
According to the directed acyclic graph, by by needing the cache table cache tasks that are mutually related to be assigned to same cache
Task groups.
4. according to the method for claim 1, described the step of creating treatment progress for corresponding cache task groups, includes:
If the processing being grouped to the cache tasks identified is first time packet transaction, this packet transaction is obtained
Each cache task groups create corresponding treatment progress;
If the processing being grouped to the cache tasks identified is not first time packet transaction, history cache tasks are obtained
Group set, and the cache task groups set obtained according to this packet and the relation of history cache task groups set, it is determined that needing
The cache task groups for the treatment of progress are created, and create corresponding treatment progress.
5. the method according to claim 11, in addition to:
The cache task groups and the relation of history cache task groups obtained according to this packet, it is determined that needing to cancel treatment progress
Cache task groups, and cancel corresponding treatment progress.
6. according to the method for claim 1, the current state of each treatment progress of basis and Spark cluster resources
Real-time service condition, cache tasks to be committed are combined, and transmission to Spark clusters processing includes:
If treatment progress is in process state to be launched, according to the real-time service condition of the Spark cluster resources and described
The resource requirement for the treatment of progress, determine the available resources of the process;
If the treatment progress is in process ready state, according to the available resources of the treatment progress, task it is preferential
Level and resource requirement, combine cache tasks to be committed, and send to described via message channel corresponding to the treatment progress
The processing of Spark clusters.
7. the method according to claim 11, in addition to:
If treatment progress is in process not-ready state, without any processing;
If treatment progress, which is in process, cancels state or process exception state or process completion status, discharge shared by it
Resource.
8. a kind of device of lifting Spark Operating ettectiveness, including:
Cache table identification modules, for being determined to the table that cache cache is needed in system;
Cache task recognition modules, for needing cache table as the cache tasks of input or output using identified
It is identified;
Packet and process manager module, for being grouped to the cache tasks identified, and are corresponding cache task groups
Create treatment progress;
Cache tasks submit module, the real-time use for the current state according to each treatment progress and Spark cluster resources
Situation, cache tasks to be committed are combined, and sent to the processing of Spark clusters.
9. device according to claim 8, the Cache tables identification module is according to the out-degree of table, list time cache note
It is poor to record number, the ready time in table between multiple cache tasks, it is determined that need cache table, and/or by customized cache classes
The table of type is defined as needing cache table.
10. device according to claim 8, the packet and process manager module using the cache tasks that are identified as pair
As, the directed acyclic graph on cache tasks is established, and according to the directed acyclic graph, by by needing cache table mutual
The cache tasks of association are assigned to same cache task groups.
11. device according to claim 8, if the processing being grouped to the cache tasks identified is to divide for the first time
Group processing, then each cache task groups that the packet and process manager module obtain to this packet transaction create corresponding
Treatment progress, otherwise it is described packet and process manager module obtain history cache task groups set, obtained according to this packet
The set of cache task groups and history cache task groups set relation, it is determined that need create treatment progress cache tasks
Group, and create corresponding treatment progress.
12. device according to claim 11, the packet and process manager module are additionally operable to be obtained according to this packet
Cache task groups and history cache task groups relation, it is determined that needing to cancel the cache task groups for the treatment of progress, and cancel
Corresponding treatment progress.
13. device according to claim 8, it is to be launched that the Cache tasks submit module in treatment progress to be in process
During state, according to the resource requirement of the real-time service condition of the Spark cluster resources and the treatment progress, it is determined that it is described enter
The available resources of journey, when the treatment progress is in process ready state, according to the available resources of the treatment progress, appoint
The priority of business and resource requirement, cache tasks to be committed are combined, and sent via message channel corresponding to the treatment progress
Handled to the Spark clusters.
14. according to the method for claim 13, the Cache tasks submit module in treatment progress to be in process not ready
It is without any processing during state, it is in process in treatment progress and cancels state or process exception state or process completion status
When, discharge the resource shared by it.
15. a kind of big data server, it includes the lifting Spark Operating ettectiveness described in the claims 8-14 any one
Device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610482075.7A CN107544844A (en) | 2016-06-27 | 2016-06-27 | A kind of method and device of lifting Spark Operating ettectiveness |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610482075.7A CN107544844A (en) | 2016-06-27 | 2016-06-27 | A kind of method and device of lifting Spark Operating ettectiveness |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107544844A true CN107544844A (en) | 2018-01-05 |
Family
ID=60961296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610482075.7A Withdrawn CN107544844A (en) | 2016-06-27 | 2016-06-27 | A kind of method and device of lifting Spark Operating ettectiveness |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107544844A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108563508A (en) * | 2018-04-27 | 2018-09-21 | 新华三大数据技术有限公司 | YARN resource allocation methods and device |
CN109324894A (en) * | 2018-08-13 | 2019-02-12 | 中兴飞流信息科技有限公司 | PC cluster method, apparatus and computer readable storage medium |
CN109409734A (en) * | 2018-10-23 | 2019-03-01 | 中国电子科技集团公司第五十四研究所 | A kind of satellite data production scheduling system |
WO2019228237A1 (en) * | 2018-05-29 | 2019-12-05 | 华为技术有限公司 | Data processing method and computer device |
CN114741121A (en) * | 2022-04-14 | 2022-07-12 | 哲库科技(北京)有限公司 | Method and device for loading module and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104834532A (en) * | 2015-06-03 | 2015-08-12 | 星环信息科技(上海)有限公司 | Distributed data vectorization processing method and device |
US20160110416A1 (en) * | 2013-04-06 | 2016-04-21 | Citrix Systems, Inc. | Systems and methods for caching of sql responses using integrated caching |
CN105577806A (en) * | 2015-12-30 | 2016-05-11 | Tcl集团股份有限公司 | Distributed cache method and system |
-
2016
- 2016-06-27 CN CN201610482075.7A patent/CN107544844A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160110416A1 (en) * | 2013-04-06 | 2016-04-21 | Citrix Systems, Inc. | Systems and methods for caching of sql responses using integrated caching |
CN104834532A (en) * | 2015-06-03 | 2015-08-12 | 星环信息科技(上海)有限公司 | Distributed data vectorization processing method and device |
CN105577806A (en) * | 2015-12-30 | 2016-05-11 | Tcl集团股份有限公司 | Distributed cache method and system |
Non-Patent Citations (2)
Title |
---|
邓诗卓等: ""PCPIR-V:基于Spark的并行隐私保护近邻查询算法"", 《网络与信息安全学报》 * |
陈康等: ""Spark计算引擎的数据对象缓存优化研究"", 《中兴通讯技术》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108563508A (en) * | 2018-04-27 | 2018-09-21 | 新华三大数据技术有限公司 | YARN resource allocation methods and device |
WO2019228237A1 (en) * | 2018-05-29 | 2019-12-05 | 华为技术有限公司 | Data processing method and computer device |
US11422861B2 (en) | 2018-05-29 | 2022-08-23 | Huawei Technologies Co., Ltd. | Data processing method and computer device |
CN109324894A (en) * | 2018-08-13 | 2019-02-12 | 中兴飞流信息科技有限公司 | PC cluster method, apparatus and computer readable storage medium |
CN109409734A (en) * | 2018-10-23 | 2019-03-01 | 中国电子科技集团公司第五十四研究所 | A kind of satellite data production scheduling system |
CN114741121A (en) * | 2022-04-14 | 2022-07-12 | 哲库科技(北京)有限公司 | Method and device for loading module and electronic equipment |
CN114741121B (en) * | 2022-04-14 | 2023-10-20 | 哲库科技(北京)有限公司 | Method and device for loading module and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107544844A (en) | A kind of method and device of lifting Spark Operating ettectiveness | |
US11630832B2 (en) | Dynamic admission control for database requests | |
US8082234B2 (en) | Closed-loop system management method and process capable of managing workloads in a multi-system database environment | |
US8082273B2 (en) | Dynamic control and regulation of critical database resources using a virtual memory table interface | |
US8775413B2 (en) | Parallel, in-line, query capture database for real-time logging, monitoring and optimizer feedback | |
US7805436B2 (en) | Arrival rate throttles for workload management | |
US8224845B2 (en) | Transaction prediction modeling method | |
US9785468B2 (en) | Finding resource bottlenecks with low-frequency sampled data | |
KR101694287B1 (en) | Apparatus and method for managing processing tasks | |
US8392404B2 (en) | Dynamic query and step routing between systems tuned for different objectives | |
CN104050042B (en) | The resource allocation methods and device of ETL operations | |
US8042119B2 (en) | States matrix for workload management simplification | |
JPH03130842A (en) | Simultaneous execution controller for data base system | |
CN111752965A (en) | Real-time database data interaction method and system based on micro-service | |
US8392461B2 (en) | Virtual data maintenance | |
CN110069329A (en) | A kind of task processing method, device, server and storage medium | |
CN112181621A (en) | Task scheduling system, method, equipment and storage medium | |
Jeong | Conceptual frame for development of optimized simulation-based scheduling systems | |
US20110023044A1 (en) | Scheduling highly parallel jobs having global interdependencies | |
US8510273B2 (en) | System, method, and computer-readable medium to facilitate application of arrival rate qualifications to missed throughput server level goals | |
CN113391911B (en) | Dynamic scheduling method, device and equipment for big data resources | |
CN110084507A (en) | The scientific workflow method for optimizing scheduling of perception is classified under cloud computing environment | |
Yang et al. | Design of kubernetes scheduling strategy based on LSTM and grey model | |
WO2019029721A1 (en) | Task scheduling method, apparatus and device, and storage medium | |
CN110018887A (en) | Task schedule and Resource Management Algorithm on a kind of Reconfigurable Platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180105 |
|
WW01 | Invention patent application withdrawn after publication |