CN114691302A - Dynamic cache replacement method and device for big data processing - Google Patents

Dynamic cache replacement method and device for big data processing Download PDF

Info

Publication number
CN114691302A
CN114691302A CN202210424807.2A CN202210424807A CN114691302A CN 114691302 A CN114691302 A CN 114691302A CN 202210424807 A CN202210424807 A CN 202210424807A CN 114691302 A CN114691302 A CN 114691302A
Authority
CN
China
Prior art keywords
data
data processing
cache replacement
stage
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210424807.2A
Other languages
Chinese (zh)
Inventor
周明贤
钱柱中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210424807.2A priority Critical patent/CN114691302A/en
Publication of CN114691302A publication Critical patent/CN114691302A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a dynamic cache replacement method and equipment for big data processing, wherein the method comprises the following steps: abstracting the big data processing application into a directed acyclic graph G (V, E), wherein a node set V represents data calculated in the big data processing application, and an edge set E represents the dependency relationship among the data; establishing a cache replacement problem mathematical model with the aim of minimizing the overall execution time of large data processing application based on the data contained in the directed acyclic graph G (V, E), and making a decision on data to be cached at each moment t by the model; simplifying the cache replacement problem based on big data processing characteristics; and solving the simplified cache replacement problem based on the dynamic programming idea. The invention realizes the cache replacement of the data access mode in the dynamic adaptation data processing process, can improve the use efficiency of the memory and greatly reduce the execution time of the big data processing application.

Description

Dynamic cache replacement method and device for big data processing
Technical Field
The invention relates to a big data processing system and a memory optimization technology oriented to big data processing application, in particular to a dynamic cache replacement method and equipment for big data processing application.
Background
With the rapid development of the modern information society and the internet, data shows explosive exponential growth, and the prediction of the total amount of global data in 2025 by the white paper of data era 2025 published by IDCS in 2018 reaches 175 ZB. The development of large-scale cluster-based computing software, namely a big data processing system is benefited, a great deal of work of the industry and academia is used for analyzing and processing mass data, and the purpose of mining rich and complex information contained in the mass data is achieved, so that the modern economic development, social progress and technological innovation are promoted. Different from the simple parallelization operation of the traditional data processing system, the modern big data system has richer functions, stronger expression capability and more user-friendly property. The concrete expression is as follows: 1) the big data processing system provides the functions of storage, calculation, analysis, mining processing and the like for mass data; 2) the big data processing system can support a user-defined method, so that the combined operations of merging, grouping, slicing, connecting, sequencing and the like can be flexibly executed on data; 3) the big data processing system provides an application layer framework for a user and supports the user to process and analyze data by using a simpler and clearer Structured Query Language (SQL).
Thanks to the above advantages, big data processing systems (Hadoop, Spark, Flink, etc.) have been widely used in the fields of graph computation, machine learning, and streaming processing. An experimental report from microsoft corporation of actual production clusters shows that up to 60% of the jobs in a cluster exhibit a phenomenon of data recalculation, i.e., multiple tasks receive the same data and execute the same calculation logic. In order to make the most use of the phenomenon of repeated computation, large data processing systems use caching to speed up the execution of data processing applications, i.e. to write intermediate data generated during the execution of the applications into a memory or a disk. The completion time of the data processing application is greatly reduced since the cached data does not need to be repeatedly calculated. We observe that there is a caching requirement under limited memory for the caching process of large data processing systems based on memory computations. Because the memory resource is often the bottleneck of the big data processing system, the big data processing framework cannot cache all intermediate data to the memory in the application execution process, and needs to dynamically replace the data to be cached when the application runs. Namely, when the data processing application runs, the data to be cached is dynamically decided, and the cached data is replaced based on the priority. The problem is called a cache replacement problem, and since a large data processing application often accompanies a phenomenon of task parallel execution and data parallel computation, data cache benefits are difficult to predict, and a decision of the cache replacement problem is affected, a cache replacement strategy considering the parallel phenomenon is urgently needed.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a dynamic cache replacement method and device for big data processing application, which realize cache replacement of a data access mode in a dynamic adaptation data processing process, can improve the use efficiency of a memory, and greatly reduce the execution time of big data processing application.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, a dynamic cache replacement method for big data processing application includes the following steps:
(1) abstracting the big data processing application into a directed acyclic graph G ═ (V, E), wherein each node of a node set V represents data of the big data processing application, and any element V in the set V contains two attributes: data occupies memory space svAnd the time c required for calculating the datav(ii) a Each edge of the edge set E is less than u, and v represents that the calculation of the data v in the big data processing application depends on the data u;
(2) acquiring the ith executed job J in the big data processing application based on the directed acyclic graph G ═ V, E and the hierarchical relation among the big data processing application, the jobs, the stages and the dataiIs expressed as f (S)i,jCS), indicating that job J is performed when cached data set CS is presentiMiddle stage Si,jTime of completion of calculation and job JiA difference in starting execution times;
(3) establishing a cache replacement problem mathematical model, cache replacement problem P, with the goal of minimizing the overall execution time of a big data processing application1Set of data CS expressed as a decision stored in a buffer space at the time t at which each data calculation is completednew,tTo minimize the application from ptTotal completion time of subsequent jobs from the individual job;
(4) solving a cache replacement problem P1And obtaining a dynamic cache replacement strategy.
Further, in the step (2), the hierarchical relationship is: the big data processing application consists of a plurality of big data processing jobs executed serially, the big data processing jobs comprising a plurality of big data processing stages executed serially or in parallel, the big data processing stages comprising a plurality of abstract data sets computed serially or in parallel.
Further, in the step (2), f (S)i,jCS) is calculated by the following method:
Figure BDA0003608069440000021
wherein S isi,jRepresenting the jth phase, x, of the ith job in the applicationi,j,kRepresenting a stage Si,jThe k-th calculated abstract data set, Ni,jRepresenting a stage Si,jNumber of abstract data sets in D (S)i,j) Representing a stage Si,jAt operation SiThe set of dependent phases in, g (S)i,jX, CS) represents the stage Si,jThe execution delay under the cached data set CS is equal to the phase Si,jIn the final calculation of data
Figure BDA0003608069440000022
Time and stage S of completion of calculationi,jThe difference of the starting execution time is calculated as follows:
Figure BDA0003608069440000031
wherein c (x, S)i,j) Is shown in stage Si,jTime required for calculating data x, P (x, S)i,j) Indicating data x at stage Si,jA collection of dependent data.
Further, in the step (3), the cache replacement problem P1The expression is as follows:
Figure BDA0003608069440000032
Figure BDA0003608069440000033
Figure BDA0003608069440000034
Figure BDA0003608069440000035
wherein CSold,tIndicating that the buffer space is not stored in x at time ttPrevious cached data set, CSnew,tIndicating that the buffer space is stored in x at time ttLater cached data set, xtRepresenting the abstract data set that was computed at time t,
Figure BDA0003608069440000036
representing the calculated time set, p, of all abstract data sets in an applicationtSubscript indicating the job executed at time t, L indicating the total memory upper limit of the cache space, and J indicating the number of jobs in the application.
Further, in the step (4), the cache replacement problem P is solved1The method comprises the following steps: simplifying cache replacement problem based on big data processing characteristics1And solving the simplified cache replacement problem based on the dynamic programming idea.
Further, the simplified cache replacement problem P based on big data processing characteristics1The method comprises the following steps:
all stages of each job in the application are replaced by 'job critical path', the execution mode of the data processing stage is changed from parallel to serial, and the problem P1Is reduced to problem P2Wherein the 'operation critical path' is defined as a phase calculation chain with the longest execution time in the operation;
replacing the abstract data set calculated in each stage with 'hot spot access data', wherein the data calculation mode in the data processing stage is changed from parallel to serial, and the problem P2Is reduced to problem P3The hot spot access data is defined as an abstract data set represented by nodes with the out degree greater than 1 in a directed acyclic graph formed by application abstraction;
replace the "hotspot access data" in each data processing phase with the execution result of the "phase representative calculation", problem P3Is reduced to problem P4Question P4Equivalent to the 0-1 knapsack problem, where "phase-representative computation" refers to the data processing operator represented by the "hot-spot access data" of the final computation in the data processing phase.
Further, the solving of the simplified cache replacement problem based on the dynamic programming thought includes:
preprocessing based on a problem simplification idea: receiving a parameter operation set J and data CS existing in a cache space at the time told,tAnd data x to be added into the buffer space at time ttReturning the caching profit RRT of the ' stage representative data set ' x ' and the ' stage representative data ' by calculating the ' operation key path ', analyzing the ' hot spot access data ' and counting the ' stage representative calculation ' and the caching profit thereof;
cache replacement based on dynamic programming concept: according to the cache profit RRT of the 'stage representative data set' x 'and the' stage representative data 'returned by preprocessing, simultaneously combining the memory upper limit L of the cache space as input, decomposing the cache replacement problem into a plurality of sub-problems by dynamic programming along with the traversal of each value of the memory upper limit L of the cache space and each element in the' stage representative data 'set x', and obtaining the problem P4Optimal caching decision CSnew,t
Further, the memory upper limit L in the cache space and the memory s occupied by each data xxAnd when the number is a natural number, the result of the cache replacement algorithm based on the dynamic programming idea is an optimal cache decision.
In a second aspect, a computer device, the device comprising one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the steps of the big data processing oriented dynamic cache replacement method according to the first aspect of the invention.
In a third aspect, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the big data processing oriented dynamic cache replacement method according to the first aspect of the present invention.
Has the beneficial effects that: a difficulty with the problem of cache replacement in large data processing systems is that the cache gain is difficult to predict due to the frequently occurring parallel execution modes in data processing applications. Aiming at the difficulty, the invention carries out mathematical modeling on the multi-stage parallel execution phenomenon of the big data processing operation and the multi-data parallel calculation phenomenon of the big data processing stage, and defines the operator-level cache replacement problem facing the multi-stage parallel big data processing application on the basis. The modeled problem is NP-hard, and the method simplifies the problem based on the characteristics of the 'operation critical path', 'hot spot access data' and 'phase representative calculation'. Because the problem has an optimal substructure after simplification, the invention provides a solution algorithm based on a dynamic programming idea to determine a dynamic replacement strategy of the cache. The invention realizes the cache replacement of the data access mode in the dynamic adaptation data processing process, can improve the use efficiency of the memory and greatly reduce the execution time of the big data processing application.
Drawings
FIG. 1 is a flow chart of a dynamic cache replacement method of the present invention;
FIG. 2 is a directed acyclic graph depicting a big data processing application in a particular embodiment;
FIG. 3 is a simplified diagram of a cache replacement problem in an embodiment;
fig. 4 is a flowchart of an algorithm based on the idea of dynamic programming in an embodiment.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
The invention aims at the research of the cache replacement problem facing the big data processing application. As shown in fig. 1, the present invention provides a dynamic cache replacement method for big data processing application, which includes the following steps:
(1) big data processing applications are abstracted into directed acyclic graphs.
The invention abstracts the big data processing application into a directed acyclic graph G (V, E) to describe the data processing process of the big data processing application. In a directed acyclic graph, each element of the set of nodes V represents data, an abstract dataset, of a big data processing application. In a big data processing system, an abstract data set is defined as the result of the execution of data processing operators, consisting of blocks of memory data distributed over different nodes in a cluster. For any element V (abstract data set) in the collection V, it contains two attributes: data occupies memory space svAnd the time c required for calculating the datav. Each edge of the set of edges E < u, v > represents a computational dependency between abstract data sets in a big data processing application. For example, the computation process for the edge < u, v > represents data v needs to be dependent on data u. In fact, the dependence between data in big data processing application has practical meaning, namely a specific operator execution process. However, the edge set E does not record the above contents because of the attribute c of each node V of the node set VvThe time required to calculate the data v, the execution of the operators has been considered.
(2) And analyzing the parallel execution phenomenon of the operation in the big data processing application according to the information contained in the directed acyclic graph.
The directed acyclic graph contains hierarchical information consisting of big data processing applications, big data processing jobs, big data processing stages, and abstract data sets. The specific description is as follows: the big data processing application is composed of a plurality of big data processing jobs executed in series, the big data processing jobs comprise a plurality of big data processing stages executed in series or in parallel, and the big data processing stages comprise a plurality of abstract data sets calculated in series or in parallel. Based on the node set V, the edge set E and the hierarchy information, the ith executed job J in the big data processing application can be obtainediAnd further provides a mathematical model for the optimization objective of the cache replacement problem. Due to the number ofThe data processing job is composed of phases, and the data processing phases are composed of abstract data sets (hereinafter collectively referred to as "data"), and the execution delay is described in a hierarchical format of data, phases, and jobs in conjunction with the concepts in table 1.
Figure BDA0003608069440000051
Figure BDA0003608069440000061
TABLE 1. symbol table contained in abstract directed acyclic graph for big data processing jobs
In a directed acyclic graph abstracted by big data processing application, function c (x, S) is usedi,j) Is shown in stage Si,jThe time required to calculate the data x (and c mentioned above)xEquivalent), then in stage Si,jCache gain of medium data x (at stage S)i,jThe time saved by caching data x) may be expressed as r (x, S)i,j). Let us assume the function P (x, S)i,j) Defined as data x in stage Si,jA collection of dependent data. c (x, S)i,j) And r (x, S)i,j) The relationship of (a) is as follows:
Figure BDA0003608069440000062
in each phase of a large data processing application, there is one piece of data that is not relied upon by any other data in that phase, which is referred to as the final computed data for that phase. On this basis, the execution latency of a stage may be associated with the final computed data for that stage. Assume that the set of cached data in the cache space is CS and the variable xi,j,kRepresents the stage Si,jThe k-th calculated data of (1), variable Ni,jIs a stage Si,jThe amount of data contained, then stage Si,jThe execution delay under the cached data set CS is equal to the phase Si,jTo finally calculate data
Figure BDA0003608069440000063
Time and stage S of completion of calculationi,jThe difference of the starting execution time is formally expressed as
Figure BDA0003608069440000064
Wherein the recursive function g (S)i,jX, CS) is used to model the phenomenon of multiple data parallel computing within a phase, which is formally described as follows:
Figure BDA0003608069440000071
in the operation of large data processing application, when the execution of one stage depends on two or more other stages, the parallel execution phenomenon of the stages will occur, and the operation execution delay modeling in the invention mainly focuses on the parallel execution phenomenon of the stages. Given that there is a phase in each data processing job that is not relied upon by any other phase in the job, referred to as the final execution phase of the job, we associate the execution latency of the job with the final execution phase of the job. Assume that a set of cached data in the cache space is CS and the variable JiRepresenting the ith executed job in a big data processing application, Job JiThe number of stages contained is MiFunction D (S)i,j) Representing a stage Si,jAt operation JiThe set of dependent phases in (1). Then operation JiExecution latency under the cached data set CS equals Job JiMiddle final execution stage
Figure BDA0003608069440000072
Time of completion of execution and job JiThe difference of the starting execution time is formally expressed as
Figure BDA0003608069440000073
Wherein the recursive function f (S)i,jCS) is used to model a multi-stage parallel execution phenomenon within the industry, in the form ofThe description is as follows:
Figure BDA0003608069440000074
as shown in FIG. 2, the present invention abstracts big data processing applications into a directed acyclic graph. The directed acyclic graph contains hierarchical information consisting of applications, jobs, phases, and data. In addition, the nodes in the directed acyclic graph represent data in a big data processing application, and the edges represent the dependency of the data. Based on the directed acyclic graph, the operation time delay under different cache states can be obtained, and the method is used for modeling the cache replacement problem facing the big data processing application. When the application executes, the big data processing framework stores intermediate data generated during the operation into a cache space. Large data processing frameworks suffer from a trade-off problem when performing cache acceleration with limited memory, namely the cache replacement problem. For example, when the data 27 of stage 4 in fig. 2 is calculated and needs to be buffered, if the buffer space is about to be exhausted, it is necessary to decide the data that should be stored in the buffer space.
(3) A mathematical model is established for the cache replacement problem.
The mathematical formal model of the cache replacement problem oriented to the big data processing framework has the following characteristics: at arbitrary data xtAt the calculated time t, the decision variable CS is usednew,tRepresenting the data that should be put in the buffer space. Suppose CSold,tIs shown without considering data xtData, variables, already put in the pre-buffer space
Figure BDA0003608069440000075
Set of calculated times for all data in a data processing application, variable ptSubscript indicating job executed at time t, variable | J | is job set of big data processing application, function s (x) (and variable s in step (1))xEquivalence) represents the memory space occupied by data x, and variable L represents the upper memory limit of the cache space. Operator-level cache replacement problem P for big data processing application in limited memory scene1Can be expressed as the time at which each data calculation is completedt-decision data set CS stored in cache spacenew,tTo minimize the application from ptThe overall completion time of the subsequent job from the individual job. Formally describing problem P in conjunction with the concepts in Table 21As follows:
Figure BDA0003608069440000081
Figure BDA0003608069440000082
Figure BDA0003608069440000083
Figure BDA0003608069440000084
Figure BDA0003608069440000085
TABLE 2 cache replacement problem P for big data processing applications1Related symbol table
The constraints have the following characteristics: first, when considering the cache replacement problem for a particular big data processing application, the optimization goal only needs to minimize the pth of the applicationtOverall completion time of individual and subsequently executed jobs, variable ptThe value range of (a) does not exceed the number of data processing jobs in the application. The following constraints describe the set of jobs for which completion time is a consideration in the optimization objective of the cache replacement problem:
Figure BDA0003608069440000086
second, the decision variable CSnew,tAs a set CSold,t∪{xtA subset of the (c),the description relating to decision variable considerations is as follows:
Figure BDA0003608069440000087
thirdly, it is required to ensure that the cached data set in the caching decision does not exceed the memory capacity of the cache space, and the constraint related to the upper memory limit of the cached data is as follows:
Figure BDA0003608069440000091
(4) the cache replacement problem is simplified based on big data handling features.
Assuming that the number of phases included in a job and the number of data included in a phase in a big data processing application are both 1, problem P1Is defined as a problem P1 *. As a result of observation, the problem P1 *Is equivalent to the 0-1 knapsack problem, so the 0-1 knapsack problem can be reduced to the problem P1Further, the problem was found to be NP-difficult. Therefore, the invention is based on the characteristic simplification problem P in the 'operation critical path', 'hot spot access data' and 'stage representative calculation' in the data processing process1Making it easier to solve. The method comprises the following specific steps:
A. the "job critical path" of a large data processing application is defined as the phase calculation chain with the longest execution time in the job. The time delay of the 'operation critical path' can approximately replace the whole time delay of the operation, and the middle stages of the 'operation critical path' are all executed in series. Thus, phases of a big data processing job are aggregated
Figure BDA0003608069440000092
Replacing as "Job Key Path"
Figure BDA0003608069440000093
Then, problem P1In the description of the recursive function f (S) of the multi-stage parallelismi,jCS) can beRemove, problem P1Is reduced to problem P2As follows:
Figure BDA0003608069440000094
Figure BDA0003608069440000095
Figure BDA0003608069440000096
Figure BDA0003608069440000097
wherein, the variable M'i,S′i,j,
Figure BDA0003608069440000098
For representing the concept related to the "job critical path", the detailed definition is shown in Table 3, function g (S)i,j,xi,j,kCS) is defined in step (2).
(symbol) Definition of
M′i Operation JiThe critical path of (2) contains the number of stages
N′i,j Stage S'i,jNumber of data present
S′i,j J stage in "critical Path" of ith job in application
x′i,j,k Stage S'i,jData of the k-th calculation
TABLE 3 problem P2Related symbol table
B. The 'hot spot access data' of the big data processing application is defined as data represented by nodes with out degrees larger than 1 in a directed acyclic graph formed by abstraction of the application. The cache benefit of the 'hot spot access data' can approximately replace the cache benefit of all data, and the 'hot spot access data' can be approximately considered as serial calculation. Therefore, after replacing the data calculated in each stage with the 'hot spot access data', the data calculation mode in the data processing stage is changed from parallel to approximate serial, and the problem P2Recursive function for describing multi-data parallel computing phenomenon
Figure BDA0003608069440000101
Can be removed. To facilitate formalization of the problem, the present invention equivalently replaces "minimizing overall completion time for jobs" in the optimization objective with "maximizing time saved by caching data". On this basis, problem P2Is reduced to problem P3As follows:
Figure BDA0003608069440000102
Figure BDA0003608069440000103
Figure BDA0003608069440000104
Figure BDA0003608069440000105
the set of "hot-spot access data" in the application is defined as
Figure BDA0003608069440000106
Function μ (i, j, CS)new,t) Phi (i, j, CS)new,t) Stage S 'is defined'i,jThe subscript of the "hotspot access data" in (1) constitutes a set, which is formally described as follows:
Figure BDA0003608069440000107
Figure BDA0003608069440000108
C. since the big data processing framework has the feature of "lazy computation", the operator represented by the "hotspot access data" of the final computation in the data processing stage is referred to as "stage representative computation". It is observed that the "phase representative calculation" can approximate the alternative phase overall calculation. Therefore, after replacing "hot spot access data" in each data processing stage with the execution result of "stage representative calculation" — "stage representative data", the problem P3Is reduced to problem P4As follows:
Figure BDA0003608069440000109
Figure BDA00036080694400001010
Figure BDA00036080694400001011
Figure BDA00036080694400001012
k*=max(μ(i,j,CSold,t∪{xt})).
at problem P4In, variable k*Representative data representing each phase, i.e. "phase representative calculation", k for each determined phase*To determine the value, the function η (x, CS) indicates whether the data x is one of the elements of the set CS, 1 being yes and 0 being no.
D. Operator-level cache replacement problem P for multi-stage parallel large data processing application based on characteristics in ' operation critical path ', ' hotspot access data ' and ' stage representative calculation1Can be simplified to problem P4. On this basis, it is assumed that at each time t, the data set CSold,t∪{xtThe amount of data in
Figure BDA0003608069440000111
Data set CSold,t∪{xtJ-th of the calculated data is x'kDecision variable zkDetermine whether to cache data x'k(1 means buffered, 0 means not buffered), problem P4Can be equivalently converted into the following problems:
Figure BDA0003608069440000112
Figure BDA0003608069440000113
k*=max(μ(i,j,CSold,t∪{xt})).
in the 0-1 backpack problem, assume that the total number of items is
Figure BDA0003608069440000114
The weight and value of item j are each wjAnd pj. Decision variables for the 0-1 knapsack problem can be observed
Figure BDA0003608069440000115
Weight w of the articlejValue p of the articlejRespectively associated with problem P4Decision variables of
Figure BDA0003608069440000116
Data x'kOccupied memory s (x'k) Cache each data x'kThe time that is saved by the device,
Figure BDA0003608069440000117
corresponding (when data x'kThe time saved by caching the data is 0) when not "phase representative calculation". Therefore, problem P4Equivalent to the 0-1 backpack problem.
Fig. 3 shows specific contents of "job critical path", "hot spot access data", and "phase representative calculation". On this basis, it can be observed that the cache replacement problem considering only the data relevant to the "phase representative calculation" -the "phase representative data" can approximately replace the cache replacement problem considering all the data, which greatly reduces the complexity of the problem.
(5) And designing an optimal algorithm based on a dynamic planning idea.
Based on the simplified problem P4The cache replacement algorithm based on the dynamic programming idea is designed according to the characteristic of the optimal substructure. Fig. 4 shows a flow of a cache replacement algorithm based on the dynamic programming concept. After a preprocessing step based on a problem simplification idea, a cache replacement module based on a dynamic planning idea receives a ' stage representative data set ' x ' and a ' stage representative data ' cache profit RRT, and meanwhile, an upper memory limit L of a cache space is combined as input. With each value of the memory ceiling L to the cache space and each of the "phase representative data" sets xTraversing the elements, and decomposing the cache replacement problem into a plurality of sub-problems by the module through dynamic planning, thereby obtaining a problem P4Optimal caching decision CSnew,t. Specifically, the algorithm comprises the following modules:
(5.1) a preprocessing module based on the problem simplification idea: the module receives a parameter operation set J and data CS existing in a cache space at the moment told,tAnd data x to be added into the cache space at time ttThe method comprises the following steps of calculating a ' operation key path ', analyzing ' hot spot access data ', counting ' stage representative calculation ' and cache income, returning the cache income RRT of ' stage representative data set ' x ' and ' stage representative data ', and taking the returned value as the input of a cache replacement module based on a dynamic planning idea, wherein the steps comprise the following steps:
A. receive input J, CSold,tAnd xt
B. Initializing output 'phase representative data' and cache gain thereof: x' and RRT
C. Calculating the "Job Critical Path" CP of J by the longest Path Algorithm
D. Directed acyclic graph statistics "hotspot access data" according to J expression "
Figure BDA0003608069440000121
E. Counting uncompleted operation J in operation set J at time tt
F. To JtIn each job JiEach unexecuted stage S'i,jThe following operations are performed
a) Stage S 'of statistics'i,jTopological sequence TP of all data in
b) Acquisition sequence TP # (CS)old,t∪{xt}) end element x of n HDu
c) Updating x': x '← x' { n }, and pharmaceutically acceptable salts thereofu}
d) And (3) updating RRT: RRT (remote resistance test)u←RRTu+ at stage S'i,jIn-buffer data xuGain of (2)
G. Output "phase representative data" and its cache revenue: x' and RRT
(5.2) a cache replacement module based on the dynamic programming idea: the module receives the output of the preprocessing module based on the idea of problem simplification, namely the ' stage representative data set ' x ' and the ' stage representative data ' buffer profit RRT, and combines the memory upper limit L of the buffer space as its input. With the traversal of each value of the memory upper limit L of the cache space and each element in the ' stage representative data ' set x ', the module decomposes the cache replacement problem into a plurality of sub-problems through dynamic planning, so as to obtain a problem P4Best caching decision CSnew,tThe method comprises the following specific steps:
A. receiving input ' stage representative data ' x ', cache benefits RRT and cache space memory upper limit L
B. Initializing dynamic programming array dp and sub-problem optimal result set C
C. The following operations are performed cyclically from i 1 to | x', j 1 to L
a) If it is
Figure BDA0003608069440000122
The following operations are performed:
dpi,j,Ci,j←dpi-1,j,Ci-1,j
b) otherwise, the following operations are performed:
Figure BDA0003608069440000131
D. outputting an optimal caching decision: cN,L
The computational complexity of the algorithm is jointly determined by a preprocessing module based on a problem simplification idea and a cache replacement module based on a dynamic planning idea. The two steps of calculating the 'operation critical path' and analyzing the 'hot spot access data' in the preprocessing module based on the problem simplification idea are only required to be executed once before the large data processing application starts, so that the influence is small. Therefore, the key step of the pre-processing module based on the idea of problem simplification is the statistical "orderSegment representative calculation and cache profit thereof, with calculation complexity of O (| V | N |)2). In addition, the computational complexity of the cache replacement module based on the dynamic planning idea is determined by the search space of the dynamic planning algorithm, is directly related to the scale of the 'hot spot access data' and the upper memory limit of the cache space, and is O (| V |)2X L). Therefore, the overall computational complexity of the dynamic cache replacement algorithm for big data processing application is O (| V2×L)。
Because the memory resources are limited, the memory resources are utilized to perform cache acceleration in large data processing application, and the problem of taking up or rejecting exists. The invention provides a dynamic cache replacement method for multi-stage parallel large data processing application. In order to overcome the difficulty that the parallel computing phenomenon is difficult to describe formally in data processing application, the invention establishes a mathematical model for the caching process of the multi-stage parallel data processing application and proves that the operator-level cache replacement problem is NP-hard. Aiming at the complexity of the problem, the problem difficulty is reduced by observing the real caching process of the big data processing system based on the characteristics of the operation key path, the hotspot access data and the stage representative calculation, so that the problem is easier to solve. Because the simplified problem has an optimal substructure, the dynamic cache replacement algorithm based on the dynamic programming idea is designed. The invention fills the blank of the cache replacement work of parallel computation under a big data processing framework, realizes the cache replacement of a data access mode in the dynamic adaptation data processing process, can improve the use efficiency of the memory, and greatly reduces the execution time of big data processing application.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing associated hardware, and the program may be stored in a computer-readable storage medium. In the context of the present invention, the computer-readable medium may be considered tangible and non-transitory. Non-limiting examples of a non-transitory tangible computer-readable medium include a non-volatile memory circuit (e.g., a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), a volatile memory circuit (e.g., a static random access memory circuit or a dynamic random access memory circuit), a magnetic storage medium (e.g., an analog or digital tape or hard drive), and an optical storage medium (e.g., a CD, DVD, or blu-ray disc), among others.
Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the embodiments, and various equivalent modifications can be made within the technical spirit of the present invention, and the scope of the present invention is also within the scope of the present invention.

Claims (10)

1. A dynamic cache replacement method for big data processing application is characterized by comprising the following steps:
(1) abstracting the big data processing application into a directed acyclic graph G ═ (V, E), wherein each node of a node set V represents data of the big data processing application, and any element V in the set V contains two attributes: data occupies memory space svAnd calculating the time c required for the datav(ii) a Each edge of the edge set E is less than u, and v represents that the calculation of the data v in the big data processing application depends on the data u;
(2) acquiring the ith executed job J in the big data processing application based on the directed acyclic graph G ═ V, E and the hierarchical relation among the big data processing application, the jobs, the stages and the dataiIs expressed as f (S)i,jCS), indicating that job J is performed when cached data set CS is presentiMiddle stage Si,jTime of completion of calculation and job JiA difference in starting execution times;
(3) establishing a cache replacement problem mathematical model, cache replacement problem P, with the goal of minimizing the overall execution time of a big data processing application1Set of data CS expressed as decision stored in the buffer space at time t when each data calculation is completednew,tTo minimize the application from ptTotal completion time of subsequent jobs from the job;
(4) solving a cache replacement problem P1And obtaining a dynamic cache replacement strategy.
2. The dynamic cache replacement method according to claim 1, wherein in the step (2), the hierarchical relationship is: the big data processing application is composed of a plurality of big data processing jobs executed in series, the big data processing jobs comprise a plurality of big data processing stages executed in series or in parallel, and the big data processing stages comprise a plurality of abstract data sets calculated in series or in parallel.
3. According to claim 1The dynamic cache replacement method, wherein in step (2), f (S)i,jCS) is calculated by:
Figure FDA0003608069430000011
wherein S isi,jRepresenting the jth phase, x, of the ith job in the applicationi,j,kRepresenting a stage Si,jThe k-th calculated abstract data set, Ni,jRepresenting a stage Si,jNumber of abstract data sets in D (S)i,j) Representing a stage Si,jAt operation SiThe set of dependent phases in, g (S)i,jX, CS) denotes the stage Si,jThe execution delay under the cached data set CS is equal to the phase Si,jIn the final calculation of data
Figure FDA0003608069430000012
Time and stage S of completion of calculationi,jThe difference of the starting execution time is calculated as follows:
Figure FDA0003608069430000021
wherein c (x, S)i,j) Is shown in stage Si,jTime required for calculating data x, P (x, S)i,j) Indicating data x at stage Si,jA collection of dependent data.
4. The dynamic cache replacement method according to claim 1, wherein in step (3), the cache replacement problem P1The expression is as follows:
P1
Figure FDA0003608069430000022
s.t.1≤pt≤|J|,
Figure FDA0003608069430000023
Figure FDA0003608069430000024
Figure FDA0003608069430000025
wherein CSold,tIndicating that the buffer space is not stored in x at time ttPrevious cached data set, CSnew,tIndicating that the buffer space is stored in x at time ttLater cached data set, xtIndicating the data that has been calculated at time t,
Figure FDA0003608069430000026
represents the calculated time set of all data in the application, ptSubscript indicating the job executed at time t, L indicating the total memory upper limit of the cache space, | J | indicating the number of jobs in the application.
5. The dynamic cache replacement method according to claim 1, wherein in the step (4), the cache replacement problem P is solved1The method comprises the following steps: simplifying cache replacement problem P based on big data processing features1And solving the simplified cache replacement problem based on the dynamic programming idea.
6. The dynamic cache replacement method of claim 5, wherein the reduced cache replacement problem P is based on big data handling features1The method comprises the following steps:
all phases of each job in the application are replaced by 'job critical path', the execution mode of the data processing phase is changed from parallel to serial, and the problem P1Is reduced to problem P2Wherein the 'operation critical path' is defined as a phase calculation chain with the longest execution time in the operation;
replacing the abstract data set calculated in each stage with 'hot spot access data', wherein the data calculation mode in the data processing stage is changed from parallel to serial, and the problem P2Is reduced to problem P3The hot spot access data is defined as an abstract data set represented by nodes with the out degree greater than 1 in a directed acyclic graph formed by application abstraction;
replacing the "hotspot access data" in each data processing phase with the execution result of the "phase representative computation", problem P3Is reduced to problem P4Question P4Equivalent to the 0-1 knapsack problem, where "phase-representative computation" is the data processing operator represented by the data that is finally computed in the data processing phase.
7. The dynamic cache replacement method according to claim 6, wherein the solving the simplified cache replacement problem based on the dynamic programming concept comprises:
preprocessing based on a problem simplification idea: receiving a parameter operation set J and data CS existing in a cache space at the time told,tAnd data x to be added into the buffer space at time ttReturning the caching profit RRT of the ' stage representative data set ' x ' and the ' stage representative data ' by calculating the ' operation key path ', analyzing the ' hot spot access data ' and counting the ' stage representative calculation ' and the caching profit thereof;
cache replacement based on dynamic programming concept: according to the cache profit RRT of the 'stage representative data set' x 'and the' stage representative data 'returned by preprocessing, simultaneously combining the memory upper limit L of the cache space as input, decomposing the cache replacement problem into a plurality of sub-problems by dynamic programming along with the traversal of each value of the memory upper limit L of the cache space and each element in the' stage representative data 'set x', and obtaining the problem P4Optimal caching decision CSnew,t
8. The dynamic cache replacement method of claim 7, wherein the cache space isMemory upper limit L and memory s occupied by each data xxAnd when the number is a natural number, the result of the cache replacement algorithm based on the dynamic programming idea is the optimal cache decision.
9. A computer device, comprising:
one or more processors;
a memory;
and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one processor, the programs when executed by the processor implementing the steps of the method of any of claims 1-8.
10. A computer-readable storage medium, on which one or more computer programs are stored, which when executed by a processor implement the steps of the method according to any of claims 1-8.
CN202210424807.2A 2022-04-21 2022-04-21 Dynamic cache replacement method and device for big data processing Pending CN114691302A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210424807.2A CN114691302A (en) 2022-04-21 2022-04-21 Dynamic cache replacement method and device for big data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210424807.2A CN114691302A (en) 2022-04-21 2022-04-21 Dynamic cache replacement method and device for big data processing

Publications (1)

Publication Number Publication Date
CN114691302A true CN114691302A (en) 2022-07-01

Family

ID=82144269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210424807.2A Pending CN114691302A (en) 2022-04-21 2022-04-21 Dynamic cache replacement method and device for big data processing

Country Status (1)

Country Link
CN (1) CN114691302A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221460A (en) * 2022-09-20 2022-10-21 浙江保融科技股份有限公司 Method for solving ordered knapsack problem segmentation dynamic planning under limited resources

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221460A (en) * 2022-09-20 2022-10-21 浙江保融科技股份有限公司 Method for solving ordered knapsack problem segmentation dynamic planning under limited resources
CN115221460B (en) * 2022-09-20 2023-01-06 浙江保融科技股份有限公司 Method for solving ordered knapsack problem segmentation dynamic programming under limited resources

Similar Documents

Publication Publication Date Title
Miao et al. Towards unified data and lifecycle management for deep learning
Verma et al. Big data management processing with Hadoop MapReduce and spark technology: A comparison
US7680765B2 (en) Iterate-aggregate query parallelization
US9411853B1 (en) In-memory aggregation system and method of multidimensional data processing for enhancing speed and scalability
JPH09171503A (en) Method and apparatus for parallel processing
Elsayed et al. Mapreduce: State-of-the-art and research directions
CN102567312A (en) Machine translation method based on distributive parallel computation framework
CN107870949B (en) Data analysis job dependency relationship generation method and system
Bala et al. P-ETL: Parallel-ETL based on the MapReduce paradigm
US11797337B2 (en) Data processing management methods for imaging applications
Jankov et al. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning
CN110795469B (en) Spark-based high-dimensional sequence data similarity query method and system
US9830369B1 (en) Processor for database analytics processing
CN111078705A (en) Spark platform based data index establishing method and data query method
CN114691302A (en) Dynamic cache replacement method and device for big data processing
Lin et al. Mining high-utility sequential patterns from big datasets
CN108334532A (en) A kind of Eclat parallel methods, system and device based on Spark
CN110851515A (en) Big data ETL model execution method and medium based on Spark distributed environment
US9361588B2 (en) Construction of tree-shaped bayesian network
Sarkar et al. MapReduce: A comprehensive study on applications, scope and challenges
CN107784032B (en) Progressive output method, device and system of data query result
US11386155B2 (en) Filter evaluation in a database system
JP6523823B2 (en) Virtual database system management apparatus, management method and management program
Gieseke et al. Bigger Buffer k-d Trees on Multi-Many-Core Systems
Lee et al. On a hadoop-based analytics service system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination