CN114691302A

CN114691302A - Dynamic cache replacement method and device for big data processing

Info

Publication number: CN114691302A
Application number: CN202210424807.2A
Authority: CN
Inventors: 周明贤; 钱柱中
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-01

Abstract

The invention discloses a dynamic cache replacement method and equipment for big data processing, wherein the method comprises the following steps: abstracting the big data processing application into a directed acyclic graph G (V, E), wherein a node set V represents data calculated in the big data processing application, and an edge set E represents the dependency relationship among the data; establishing a cache replacement problem mathematical model with the aim of minimizing the overall execution time of large data processing application based on the data contained in the directed acyclic graph G (V, E), and making a decision on data to be cached at each moment t by the model; simplifying the cache replacement problem based on big data processing characteristics; and solving the simplified cache replacement problem based on the dynamic programming idea. The invention realizes the cache replacement of the data access mode in the dynamic adaptation data processing process, can improve the use efficiency of the memory and greatly reduce the execution time of the big data processing application.

Description

Dynamic cache replacement method and device for big data processing

Technical Field

The invention relates to a big data processing system and a memory optimization technology oriented to big data processing application, in particular to a dynamic cache replacement method and equipment for big data processing application.

Background

With the rapid development of the modern information society and the internet, data shows explosive exponential growth, and the prediction of the total amount of global data in 2025 by the white paper of data era 2025 published by IDCS in 2018 reaches 175 ZB. The development of large-scale cluster-based computing software, namely a big data processing system is benefited, a great deal of work of the industry and academia is used for analyzing and processing mass data, and the purpose of mining rich and complex information contained in the mass data is achieved, so that the modern economic development, social progress and technological innovation are promoted. Different from the simple parallelization operation of the traditional data processing system, the modern big data system has richer functions, stronger expression capability and more user-friendly property. The concrete expression is as follows: 1) the big data processing system provides the functions of storage, calculation, analysis, mining processing and the like for mass data; 2) the big data processing system can support a user-defined method, so that the combined operations of merging, grouping, slicing, connecting, sequencing and the like can be flexibly executed on data; 3) the big data processing system provides an application layer framework for a user and supports the user to process and analyze data by using a simpler and clearer Structured Query Language (SQL).

Thanks to the above advantages, big data processing systems (Hadoop, Spark, Flink, etc.) have been widely used in the fields of graph computation, machine learning, and streaming processing. An experimental report from microsoft corporation of actual production clusters shows that up to 60% of the jobs in a cluster exhibit a phenomenon of data recalculation, i.e., multiple tasks receive the same data and execute the same calculation logic. In order to make the most use of the phenomenon of repeated computation, large data processing systems use caching to speed up the execution of data processing applications, i.e. to write intermediate data generated during the execution of the applications into a memory or a disk. The completion time of the data processing application is greatly reduced since the cached data does not need to be repeatedly calculated. We observe that there is a caching requirement under limited memory for the caching process of large data processing systems based on memory computations. Because the memory resource is often the bottleneck of the big data processing system, the big data processing framework cannot cache all intermediate data to the memory in the application execution process, and needs to dynamically replace the data to be cached when the application runs. Namely, when the data processing application runs, the data to be cached is dynamically decided, and the cached data is replaced based on the priority. The problem is called a cache replacement problem, and since a large data processing application often accompanies a phenomenon of task parallel execution and data parallel computation, data cache benefits are difficult to predict, and a decision of the cache replacement problem is affected, a cache replacement strategy considering the parallel phenomenon is urgently needed.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a dynamic cache replacement method and device for big data processing application, which realize cache replacement of a data access mode in a dynamic adaptation data processing process, can improve the use efficiency of a memory, and greatly reduce the execution time of big data processing application.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, a dynamic cache replacement method for big data processing application includes the following steps:

(1) abstracting the big data processing application into a directed acyclic graph G ═ (V, E), wherein each node of a node set V represents data of the big data processing application, and any element V in the set V contains two attributes: data occupies memory space s_vAnd the time c required for calculating the data_v(ii) a Each edge of the edge set E is less than u, and v represents that the calculation of the data v in the big data processing application depends on the data u;

(2) acquiring the ith executed job J in the big data processing application based on the directed acyclic graph G ═ V, E and the hierarchical relation among the big data processing application, the jobs, the stages and the data_iIs expressed as f (S)_i,jCS), indicating that job J is performed when cached data set CS is present_iMiddle stage S_i,jTime of completion of calculation and job J_iA difference in starting execution times;

(3) establishing a cache replacement problem mathematical model, cache replacement problem P, with the goal of minimizing the overall execution time of a big data processing application₁Set of data CS expressed as a decision stored in a buffer space at the time t at which each data calculation is completed_new,tTo minimize the application from p_tTotal completion time of subsequent jobs from the individual job;

(4) solving a cache replacement problem P₁And obtaining a dynamic cache replacement strategy.

Further, in the step (2), the hierarchical relationship is: the big data processing application consists of a plurality of big data processing jobs executed serially, the big data processing jobs comprising a plurality of big data processing stages executed serially or in parallel, the big data processing stages comprising a plurality of abstract data sets computed serially or in parallel.

Further, in the step (2), f (S)_i,jCS) is calculated by the following method:

wherein S is_i,jRepresenting the jth phase, x, of the ith job in the application_i,j,kRepresenting a stage S_i,jThe k-th calculated abstract data set, N_i,jRepresenting a stage S_i,jNumber of abstract data sets in D (S)_i,j) Representing a stage S_i,jAt operation S_iThe set of dependent phases in, g (S)_i,jX, CS) represents the stage S_i,jThe execution delay under the cached data set CS is equal to the phase S_i,jIn the final calculation of data

Time and stage S of completion of calculation_i,jThe difference of the starting execution time is calculated as follows:

wherein c (x, S)_i,j) Is shown in stage S_i,jTime required for calculating data x, P (x, S)_i,j) Indicating data x at stage S_i,jA collection of dependent data.

Further, in the step (3), the cache replacement problem P₁The expression is as follows:

wherein CS_old,tIndicating that the buffer space is not stored in x at time t_tPrevious cached data set, CS_new,tIndicating that the buffer space is stored in x at time t_tLater cached data set, x_tRepresenting the abstract data set that was computed at time t,

representing the calculated time set, p, of all abstract data sets in an application_tSubscript indicating the job executed at time t, L indicating the total memory upper limit of the cache space, and J indicating the number of jobs in the application.

Further, in the step (4), the cache replacement problem P is solved₁The method comprises the following steps: simplifying cache replacement problem based on big data processing characteristics₁And solving the simplified cache replacement problem based on the dynamic programming idea.

Further, the simplified cache replacement problem P based on big data processing characteristics₁The method comprises the following steps:

all stages of each job in the application are replaced by 'job critical path', the execution mode of the data processing stage is changed from parallel to serial, and the problem P₁Is reduced to problem P₂Wherein the 'operation critical path' is defined as a phase calculation chain with the longest execution time in the operation;

replacing the abstract data set calculated in each stage with 'hot spot access data', wherein the data calculation mode in the data processing stage is changed from parallel to serial, and the problem P₂Is reduced to problem P₃The hot spot access data is defined as an abstract data set represented by nodes with the out degree greater than 1 in a directed acyclic graph formed by application abstraction;

replace the "hotspot access data" in each data processing phase with the execution result of the "phase representative calculation", problem P₃Is reduced to problem P₄Question P₄Equivalent to the 0-1 knapsack problem, where "phase-representative computation" refers to the data processing operator represented by the "hot-spot access data" of the final computation in the data processing phase.

Further, the solving of the simplified cache replacement problem based on the dynamic programming thought includes:

preprocessing based on a problem simplification idea: receiving a parameter operation set J and data CS existing in a cache space at the time t_old,tAnd data x to be added into the buffer space at time t_tReturning the caching profit RRT of the ' stage representative data set ' x ' and the ' stage representative data ' by calculating the ' operation key path ', analyzing the ' hot spot access data ' and counting the ' stage representative calculation ' and the caching profit thereof;

cache replacement based on dynamic programming concept: according to the cache profit RRT of the 'stage representative data set' x 'and the' stage representative data 'returned by preprocessing, simultaneously combining the memory upper limit L of the cache space as input, decomposing the cache replacement problem into a plurality of sub-problems by dynamic programming along with the traversal of each value of the memory upper limit L of the cache space and each element in the' stage representative data 'set x', and obtaining the problem P₄Optimal caching decision CS_new,t。

Further, the memory upper limit L in the cache space and the memory s occupied by each data x_xAnd when the number is a natural number, the result of the cache replacement algorithm based on the dynamic programming idea is an optimal cache decision.

In a second aspect, a computer device, the device comprising one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the steps of the big data processing oriented dynamic cache replacement method according to the first aspect of the invention.

In a third aspect, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the big data processing oriented dynamic cache replacement method according to the first aspect of the present invention.

Has the beneficial effects that: a difficulty with the problem of cache replacement in large data processing systems is that the cache gain is difficult to predict due to the frequently occurring parallel execution modes in data processing applications. Aiming at the difficulty, the invention carries out mathematical modeling on the multi-stage parallel execution phenomenon of the big data processing operation and the multi-data parallel calculation phenomenon of the big data processing stage, and defines the operator-level cache replacement problem facing the multi-stage parallel big data processing application on the basis. The modeled problem is NP-hard, and the method simplifies the problem based on the characteristics of the 'operation critical path', 'hot spot access data' and 'phase representative calculation'. Because the problem has an optimal substructure after simplification, the invention provides a solution algorithm based on a dynamic programming idea to determine a dynamic replacement strategy of the cache. The invention realizes the cache replacement of the data access mode in the dynamic adaptation data processing process, can improve the use efficiency of the memory and greatly reduce the execution time of the big data processing application.

Drawings

FIG. 1 is a flow chart of a dynamic cache replacement method of the present invention;

FIG. 2 is a directed acyclic graph depicting a big data processing application in a particular embodiment;

FIG. 3 is a simplified diagram of a cache replacement problem in an embodiment;

fig. 4 is a flowchart of an algorithm based on the idea of dynamic programming in an embodiment.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The invention aims at the research of the cache replacement problem facing the big data processing application. As shown in fig. 1, the present invention provides a dynamic cache replacement method for big data processing application, which includes the following steps:

(1) big data processing applications are abstracted into directed acyclic graphs.

The invention abstracts the big data processing application into a directed acyclic graph G (V, E) to describe the data processing process of the big data processing application. In a directed acyclic graph, each element of the set of nodes V represents data, an abstract dataset, of a big data processing application. In a big data processing system, an abstract data set is defined as the result of the execution of data processing operators, consisting of blocks of memory data distributed over different nodes in a cluster. For any element V (abstract data set) in the collection V, it contains two attributes: data occupies memory space s_vAnd the time c required for calculating the data_v. Each edge of the set of edges E < u, v > represents a computational dependency between abstract data sets in a big data processing application. For example, the computation process for the edge < u, v > represents data v needs to be dependent on data u. In fact, the dependence between data in big data processing application has practical meaning, namely a specific operator execution process. However, the edge set E does not record the above contents because of the attribute c of each node V of the node set V_vThe time required to calculate the data v, the execution of the operators has been considered.

(2) And analyzing the parallel execution phenomenon of the operation in the big data processing application according to the information contained in the directed acyclic graph.

The directed acyclic graph contains hierarchical information consisting of big data processing applications, big data processing jobs, big data processing stages, and abstract data sets. The specific description is as follows: the big data processing application is composed of a plurality of big data processing jobs executed in series, the big data processing jobs comprise a plurality of big data processing stages executed in series or in parallel, and the big data processing stages comprise a plurality of abstract data sets calculated in series or in parallel. Based on the node set V, the edge set E and the hierarchy information, the ith executed job J in the big data processing application can be obtained_iAnd further provides a mathematical model for the optimization objective of the cache replacement problem. Due to the number ofThe data processing job is composed of phases, and the data processing phases are composed of abstract data sets (hereinafter collectively referred to as "data"), and the execution delay is described in a hierarchical format of data, phases, and jobs in conjunction with the concepts in table 1.

TABLE 1. symbol table contained in abstract directed acyclic graph for big data processing jobs

In a directed acyclic graph abstracted by big data processing application, function c (x, S) is used_i,j) Is shown in stage S_i,jThe time required to calculate the data x (and c mentioned above)_xEquivalent), then in stage S_i,jCache gain of medium data x (at stage S)_i,jThe time saved by caching data x) may be expressed as r (x, S)_i,j). Let us assume the function P (x, S)_i,j) Defined as data x in stage S_i,jA collection of dependent data. c (x, S)_i,j) And r (x, S)_i,j) The relationship of (a) is as follows:

in each phase of a large data processing application, there is one piece of data that is not relied upon by any other data in that phase, which is referred to as the final computed data for that phase. On this basis, the execution latency of a stage may be associated with the final computed data for that stage. Assume that the set of cached data in the cache space is CS and the variable x_i,j,kRepresents the stage S_i,jThe k-th calculated data of (1), variable N_i,jIs a stage S_i,jThe amount of data contained, then stage S_i,jThe execution delay under the cached data set CS is equal to the phase S_i,jTo finally calculate data

Time and stage S of completion of calculation_i,jThe difference of the starting execution time is formally expressed as

Wherein the recursive function g (S)_i,jX, CS) is used to model the phenomenon of multiple data parallel computing within a phase, which is formally described as follows:

in the operation of large data processing application, when the execution of one stage depends on two or more other stages, the parallel execution phenomenon of the stages will occur, and the operation execution delay modeling in the invention mainly focuses on the parallel execution phenomenon of the stages. Given that there is a phase in each data processing job that is not relied upon by any other phase in the job, referred to as the final execution phase of the job, we associate the execution latency of the job with the final execution phase of the job. Assume that a set of cached data in the cache space is CS and the variable J_iRepresenting the ith executed job in a big data processing application, Job J_iThe number of stages contained is M_iFunction D (S)_i,j) Representing a stage S_i,jAt operation J_iThe set of dependent phases in (1). Then operation J_iExecution latency under the cached data set CS equals Job J_iMiddle final execution stage

Time of completion of execution and job J_iThe difference of the starting execution time is formally expressed as

Wherein the recursive function f (S)_i,jCS) is used to model a multi-stage parallel execution phenomenon within the industry, in the form ofThe description is as follows:

as shown in FIG. 2, the present invention abstracts big data processing applications into a directed acyclic graph. The directed acyclic graph contains hierarchical information consisting of applications, jobs, phases, and data. In addition, the nodes in the directed acyclic graph represent data in a big data processing application, and the edges represent the dependency of the data. Based on the directed acyclic graph, the operation time delay under different cache states can be obtained, and the method is used for modeling the cache replacement problem facing the big data processing application. When the application executes, the big data processing framework stores intermediate data generated during the operation into a cache space. Large data processing frameworks suffer from a trade-off problem when performing cache acceleration with limited memory, namely the cache replacement problem. For example, when the data 27 of stage 4 in fig. 2 is calculated and needs to be buffered, if the buffer space is about to be exhausted, it is necessary to decide the data that should be stored in the buffer space.

(3) A mathematical model is established for the cache replacement problem.

The mathematical formal model of the cache replacement problem oriented to the big data processing framework has the following characteristics: at arbitrary data x_tAt the calculated time t, the decision variable CS is used_new,tRepresenting the data that should be put in the buffer space. Suppose CS_old,tIs shown without considering data x_tData, variables, already put in the pre-buffer space

Set of calculated times for all data in a data processing application, variable p_tSubscript indicating job executed at time t, variable | J | is job set of big data processing application, function s (x) (and variable s in step (1))_xEquivalence) represents the memory space occupied by data x, and variable L represents the upper memory limit of the cache space. Operator-level cache replacement problem P for big data processing application in limited memory scene₁Can be expressed as the time at which each data calculation is completedt-decision data set CS stored in cache space_new,tTo minimize the application from p_tThe overall completion time of the subsequent job from the individual job. Formally describing problem P in conjunction with the concepts in Table 2₁As follows:

TABLE 2 cache replacement problem P for big data processing applications₁Related symbol table

The constraints have the following characteristics: first, when considering the cache replacement problem for a particular big data processing application, the optimization goal only needs to minimize the pth of the application_tOverall completion time of individual and subsequently executed jobs, variable p_tThe value range of (a) does not exceed the number of data processing jobs in the application. The following constraints describe the set of jobs for which completion time is a consideration in the optimization objective of the cache replacement problem:

second, the decision variable CS_new,tAs a set CS_old,t∪{x_tA subset of the (c),the description relating to decision variable considerations is as follows:

thirdly, it is required to ensure that the cached data set in the caching decision does not exceed the memory capacity of the cache space, and the constraint related to the upper memory limit of the cached data is as follows:

(4) the cache replacement problem is simplified based on big data handling features.

Assuming that the number of phases included in a job and the number of data included in a phase in a big data processing application are both 1, problem P₁Is defined as a problem P₁ ^*. As a result of observation, the problem P₁ ^*Is equivalent to the 0-1 knapsack problem, so the 0-1 knapsack problem can be reduced to the problem P₁Further, the problem was found to be NP-difficult. Therefore, the invention is based on the characteristic simplification problem P in the 'operation critical path', 'hot spot access data' and 'stage representative calculation' in the data processing process₁Making it easier to solve. The method comprises the following specific steps:

A. the "job critical path" of a large data processing application is defined as the phase calculation chain with the longest execution time in the job. The time delay of the 'operation critical path' can approximately replace the whole time delay of the operation, and the middle stages of the 'operation critical path' are all executed in series. Thus, phases of a big data processing job are aggregated

Replacing as "Job Key Path"

Then, problem P₁In the description of the recursive function f (S) of the multi-stage parallelism_i,jCS) can beRemove, problem P₁Is reduced to problem P₂As follows:

wherein, the variable M'_i,S′_i,j,

For representing the concept related to the "job critical path", the detailed definition is shown in Table 3, function g (S)_i,j,x_i,j,kCS) is defined in step (2).

(symbol)	Definition of
		M′_i	Operation J_iThe critical path of (2) contains the number of stages
N′_i,j	Stage S'_i,jNumber of data present
		S′_i,j	J stage in "critical Path" of ith job in application
x′_i,j,k	Stage S'_i,jData of the k-th calculation

TABLE 3 problem P₂Related symbol table

B. The 'hot spot access data' of the big data processing application is defined as data represented by nodes with out degrees larger than 1 in a directed acyclic graph formed by abstraction of the application. The cache benefit of the 'hot spot access data' can approximately replace the cache benefit of all data, and the 'hot spot access data' can be approximately considered as serial calculation. Therefore, after replacing the data calculated in each stage with the 'hot spot access data', the data calculation mode in the data processing stage is changed from parallel to approximate serial, and the problem P₂Recursive function for describing multi-data parallel computing phenomenon

Can be removed. To facilitate formalization of the problem, the present invention equivalently replaces "minimizing overall completion time for jobs" in the optimization objective with "maximizing time saved by caching data". On this basis, problem P₂Is reduced to problem P₃As follows:

the set of "hot-spot access data" in the application is defined as

Function μ (i, j, CS)_new,t) Phi (i, j, CS)_new,t) Stage S 'is defined'_i,jThe subscript of the "hotspot access data" in (1) constitutes a set, which is formally described as follows:

C. since the big data processing framework has the feature of "lazy computation", the operator represented by the "hotspot access data" of the final computation in the data processing stage is referred to as "stage representative computation". It is observed that the "phase representative calculation" can approximate the alternative phase overall calculation. Therefore, after replacing "hot spot access data" in each data processing stage with the execution result of "stage representative calculation" — "stage representative data", the problem P₃Is reduced to problem P₄As follows:

k^*＝max(μ(i,j,CS_old,t∪{x_t})).

at problem P₄In, variable k^*Representative data representing each phase, i.e. "phase representative calculation", k for each determined phase^*To determine the value, the function η (x, CS) indicates whether the data x is one of the elements of the set CS, 1 being yes and 0 being no.

D. Operator-level cache replacement problem P for multi-stage parallel large data processing application based on characteristics in ' operation critical path ', ' hotspot access data ' and ' stage representative calculation₁Can be simplified to problem P₄. On this basis, it is assumed that at each time t, the data set CS_old,t∪{x_tThe amount of data in

Data set CS_old,t∪{x_tJ-th of the calculated data is x'_kDecision variable z_kDetermine whether to cache data x'_k(1 means buffered, 0 means not buffered), problem P₄Can be equivalently converted into the following problems:

k^*＝max(μ(i,j,CS_old,t∪{x_t})).

in the 0-1 backpack problem, assume that the total number of items is

The weight and value of item j are each w_jAnd p_j. Decision variables for the 0-1 knapsack problem can be observed

Weight w of the article_jValue p of the article_jRespectively associated with problem P₄Decision variables of

Data x'_kOccupied memory s (x'_k) Cache each data x'_kThe time that is saved by the device,

corresponding (when data x'_kThe time saved by caching the data is 0) when not "phase representative calculation". Therefore, problem P₄Equivalent to the 0-1 backpack problem.

Fig. 3 shows specific contents of "job critical path", "hot spot access data", and "phase representative calculation". On this basis, it can be observed that the cache replacement problem considering only the data relevant to the "phase representative calculation" -the "phase representative data" can approximately replace the cache replacement problem considering all the data, which greatly reduces the complexity of the problem.

(5) And designing an optimal algorithm based on a dynamic planning idea.

Based on the simplified problem P₄The cache replacement algorithm based on the dynamic programming idea is designed according to the characteristic of the optimal substructure. Fig. 4 shows a flow of a cache replacement algorithm based on the dynamic programming concept. After a preprocessing step based on a problem simplification idea, a cache replacement module based on a dynamic planning idea receives a ' stage representative data set ' x ' and a ' stage representative data ' cache profit RRT, and meanwhile, an upper memory limit L of a cache space is combined as input. With each value of the memory ceiling L to the cache space and each of the "phase representative data" sets xTraversing the elements, and decomposing the cache replacement problem into a plurality of sub-problems by the module through dynamic planning, thereby obtaining a problem P₄Optimal caching decision CS_new,t. Specifically, the algorithm comprises the following modules:

(5.1) a preprocessing module based on the problem simplification idea: the module receives a parameter operation set J and data CS existing in a cache space at the moment t_old,tAnd data x to be added into the cache space at time t_tThe method comprises the following steps of calculating a ' operation key path ', analyzing ' hot spot access data ', counting ' stage representative calculation ' and cache income, returning the cache income RRT of ' stage representative data set ' x ' and ' stage representative data ', and taking the returned value as the input of a cache replacement module based on a dynamic planning idea, wherein the steps comprise the following steps:

A. receive input J, CS_old,tAnd x_t

B. Initializing output 'phase representative data' and cache gain thereof: x' and RRT

C. Calculating the "Job Critical Path" CP of J by the longest Path Algorithm

D. Directed acyclic graph statistics "hotspot access data" according to J expression "

E. Counting uncompleted operation J in operation set J at time t^t

F. To J^tIn each job J_iEach unexecuted stage S'_i,jThe following operations are performed

a) Stage S 'of statistics'_i,jTopological sequence TP of all data in

b) Acquisition sequence TP # (CS)_old,t∪{x_t}) end element x of n HD_u

c) Updating x': x '← x' { n }, and pharmaceutically acceptable salts thereof_u}

d) And (3) updating RRT: RRT (remote resistance test)_u←RRT_u+ at stage S'_i,jIn-buffer data x_uGain of (2)

G. Output "phase representative data" and its cache revenue: x' and RRT

(5.2) a cache replacement module based on the dynamic programming idea: the module receives the output of the preprocessing module based on the idea of problem simplification, namely the ' stage representative data set ' x ' and the ' stage representative data ' buffer profit RRT, and combines the memory upper limit L of the buffer space as its input. With the traversal of each value of the memory upper limit L of the cache space and each element in the ' stage representative data ' set x ', the module decomposes the cache replacement problem into a plurality of sub-problems through dynamic planning, so as to obtain a problem P₄Best caching decision CS_new,tThe method comprises the following specific steps:

A. receiving input ' stage representative data ' x ', cache benefits RRT and cache space memory upper limit L

B. Initializing dynamic programming array dp and sub-problem optimal result set C

C. The following operations are performed cyclically from i 1 to | x', j 1 to L

a) If it is

The following operations are performed:

dp_i,j,C_i,j←dp_i-1,j,C_i-1,j

b) otherwise, the following operations are performed:

D. outputting an optimal caching decision: c_N,L

The computational complexity of the algorithm is jointly determined by a preprocessing module based on a problem simplification idea and a cache replacement module based on a dynamic planning idea. The two steps of calculating the 'operation critical path' and analyzing the 'hot spot access data' in the preprocessing module based on the problem simplification idea are only required to be executed once before the large data processing application starts, so that the influence is small. Therefore, the key step of the pre-processing module based on the idea of problem simplification is the statistical "orderSegment representative calculation and cache profit thereof, with calculation complexity of O (| V | N |)²). In addition, the computational complexity of the cache replacement module based on the dynamic planning idea is determined by the search space of the dynamic planning algorithm, is directly related to the scale of the 'hot spot access data' and the upper memory limit of the cache space, and is O (| V |)²X L). Therefore, the overall computational complexity of the dynamic cache replacement algorithm for big data processing application is O (| V²×L)。

Because the memory resources are limited, the memory resources are utilized to perform cache acceleration in large data processing application, and the problem of taking up or rejecting exists. The invention provides a dynamic cache replacement method for multi-stage parallel large data processing application. In order to overcome the difficulty that the parallel computing phenomenon is difficult to describe formally in data processing application, the invention establishes a mathematical model for the caching process of the multi-stage parallel data processing application and proves that the operator-level cache replacement problem is NP-hard. Aiming at the complexity of the problem, the problem difficulty is reduced by observing the real caching process of the big data processing system based on the characteristics of the operation key path, the hotspot access data and the stage representative calculation, so that the problem is easier to solve. Because the simplified problem has an optimal substructure, the dynamic cache replacement algorithm based on the dynamic programming idea is designed. The invention fills the blank of the cache replacement work of parallel computation under a big data processing framework, realizes the cache replacement of a data access mode in the dynamic adaptation data processing process, can improve the use efficiency of the memory, and greatly reduces the execution time of big data processing application.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing associated hardware, and the program may be stored in a computer-readable storage medium. In the context of the present invention, the computer-readable medium may be considered tangible and non-transitory. Non-limiting examples of a non-transitory tangible computer-readable medium include a non-volatile memory circuit (e.g., a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), a volatile memory circuit (e.g., a static random access memory circuit or a dynamic random access memory circuit), a magnetic storage medium (e.g., an analog or digital tape or hard drive), and an optical storage medium (e.g., a CD, DVD, or blu-ray disc), among others.

Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the embodiments, and various equivalent modifications can be made within the technical spirit of the present invention, and the scope of the present invention is also within the scope of the present invention.

Claims

1. A dynamic cache replacement method for big data processing application is characterized by comprising the following steps:

(1) abstracting the big data processing application into a directed acyclic graph G ═ (V, E), wherein each node of a node set V represents data of the big data processing application, and any element V in the set V contains two attributes: data occupies memory space s_vAnd calculating the time c required for the data_v(ii) a Each edge of the edge set E is less than u, and v represents that the calculation of the data v in the big data processing application depends on the data u;

(3) establishing a cache replacement problem mathematical model, cache replacement problem P, with the goal of minimizing the overall execution time of a big data processing application₁Set of data CS expressed as decision stored in the buffer space at time t when each data calculation is completed_new,tTo minimize the application from p_tTotal completion time of subsequent jobs from the job;

2. The dynamic cache replacement method according to claim 1, wherein in the step (2), the hierarchical relationship is: the big data processing application is composed of a plurality of big data processing jobs executed in series, the big data processing jobs comprise a plurality of big data processing stages executed in series or in parallel, and the big data processing stages comprise a plurality of abstract data sets calculated in series or in parallel.

3. According to claim 1The dynamic cache replacement method, wherein in step (2), f (S)_i,jCS) is calculated by:

wherein S is_i,jRepresenting the jth phase, x, of the ith job in the application_i,j,kRepresenting a stage S_i,jThe k-th calculated abstract data set, N_i,jRepresenting a stage S_i,jNumber of abstract data sets in D (S)_i,j) Representing a stage S_i,jAt operation S_iThe set of dependent phases in, g (S)_i,jX, CS) denotes the stage S_i,jThe execution delay under the cached data set CS is equal to the phase S_i,jIn the final calculation of data

wherein c (x, S)_i,j) Is shown in stage S_i,jTime required for calculating data x, P (x, S)_i,j) Indicating data x at stage S_i,_jA collection of dependent data.

4. The dynamic cache replacement method according to claim 1, wherein in step (3), the cache replacement problem P₁The expression is as follows:

P₁：

s.t.1≤p_t≤|J|,

wherein CS_old,tIndicating that the buffer space is not stored in x at time t_tPrevious cached data set, CS_new,tIndicating that the buffer space is stored in x at time t_tLater cached data set, x_tIndicating the data that has been calculated at time t,

represents the calculated time set of all data in the application, p_tSubscript indicating the job executed at time t, L indicating the total memory upper limit of the cache space, | J | indicating the number of jobs in the application.

5. The dynamic cache replacement method according to claim 1, wherein in the step (4), the cache replacement problem P is solved₁The method comprises the following steps: simplifying cache replacement problem P based on big data processing features₁And solving the simplified cache replacement problem based on the dynamic programming idea.

6. The dynamic cache replacement method of claim 5, wherein the reduced cache replacement problem P is based on big data handling features₁The method comprises the following steps:

all phases of each job in the application are replaced by 'job critical path', the execution mode of the data processing phase is changed from parallel to serial, and the problem P₁Is reduced to problem P₂Wherein the 'operation critical path' is defined as a phase calculation chain with the longest execution time in the operation;

replacing the "hotspot access data" in each data processing phase with the execution result of the "phase representative computation", problem P₃Is reduced to problem P₄Question P₄Equivalent to the 0-1 knapsack problem, where "phase-representative computation" is the data processing operator represented by the data that is finally computed in the data processing phase.

7. The dynamic cache replacement method according to claim 6, wherein the solving the simplified cache replacement problem based on the dynamic programming concept comprises:

8. The dynamic cache replacement method of claim 7, wherein the cache space isMemory upper limit L and memory s occupied by each data x_xAnd when the number is a natural number, the result of the cache replacement algorithm based on the dynamic programming idea is the optimal cache decision.

9. A computer device, comprising:

one or more processors;

a memory;

and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one processor, the programs when executed by the processor implementing the steps of the method of any of claims 1-8.

10. A computer-readable storage medium, on which one or more computer programs are stored, which when executed by a processor implement the steps of the method according to any of claims 1-8.