CN110941494A - Deep learning-oriented GPU parallel computing data processing method - Google Patents

Deep learning-oriented GPU parallel computing data processing method Download PDF

Info

Publication number
CN110941494A
CN110941494A CN201911210933.2A CN201911210933A CN110941494A CN 110941494 A CN110941494 A CN 110941494A CN 201911210933 A CN201911210933 A CN 201911210933A CN 110941494 A CN110941494 A CN 110941494A
Authority
CN
China
Prior art keywords
graph
calculation
tensor
time
operations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911210933.2A
Other languages
Chinese (zh)
Inventor
吴艳霞
任宁
李晓松
张硕
王旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201911210933.2A priority Critical patent/CN110941494A/en
Publication of CN110941494A publication Critical patent/CN110941494A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a deep learning-oriented GPU parallel computing data processing method. Firstly, inputting data to model a calculation graph: (1) constructing operation rules of vertexes and edges of the directed graph; (2) defining the execution sequence of the operations in the graph by using topological sorting; (3) parameters are updated by training the model. Then, a tensor life cycle is introduced, and then, the calculation graph rewriting based on the data operation cost is carried out, so that the optimal operation strategy is obtained, and the method mainly comprises the following steps: firstly, modeling a cost-based calculation graph, and redefining an operation function on a CPU; then, the swap-out operations of the same tensor are fused into a single swap-out operation; and finally, obtaining a traversal sequence by using a tensor back-exchange strategy based on calculation and transmission cost. Therefore, a computational graph modeling method based on formalization rules is constructed. Finally, the invention combines the extensible neural network with the computational graph, can improve the training speed of the model and effectively improve the image processing effect.

Description

Deep learning-oriented GPU parallel computing data processing method
Technical Field
The invention relates to a data processing method of GPU parallel computation, in particular to a computation graph modeling method based on formalization rules.
Background
With the rapid development of artificial intelligence, deep neural networks have penetrated into various fields of scientific research, and neural network models are more and more complex, so that the GPU is applied to deep learning. Compared with a CPU, the GPU has excellent performance in the aspect of accelerating matrix calculation, for example, the face recognition precision of a faceNet network model developed by Google can reach 99.63%. The Optasia developed by Microsoft shows high precision and performance in the aspect of the correlation query of the traffic cameras in the large city. But the complex network model presents significant challenges to the GPU processor. Especially under large samples, large scale parameters and deep structures, the training speed will be slower. Therefore, the invention provides a calculation graph modeling technology based on formalization rules. Different from the research of data exchange between a GPU memory and an external memory, the method provides a data processing mode method capable of fusing on the basis of the technology, for input data with different data structures, data preprocessing can be carried out according to the model to obtain data with the same size, and then model operation is carried out, so that the processing speed of the graph and the training speed of the neural network model can be improved.
The computation graph is a dynamic graph built from different input data structures. In general, most deep learning items require data preprocessing on training data of a model. In the process, data input in various sizes and structures are cut into data with the same latitude and size and are pushed into a stack, and then a batch processing flow of a model training stage is carried out. For example, in many fields including natural language processing (parse tree) and chemical informatics (molecular graph), the input of the model training process is the parse number and the molecular graph, and the model parsing can complete the neural network training calculated on the graph structure, which is an optimal way to solve the graph problem.
Disclosure of Invention
The invention aims to provide a data processing method for GPU parallel computing facing deep learning, which can improve the transmission of data streams between a CPU and a GPU and enhance the generality of a data transmission model.
The purpose of the invention is realized as follows:
step one, modeling a calculation graph;
secondly, introducing a tensor life cycle on the basis of modeling of the calculation graph;
step three, introducing a cost-based data flow operation method, and rewriting a calculation graph;
and step four, introducing an optimization strategy based on calculation and transmission cost to obtain accurate data transmission.
The present invention may further comprise:
1. the first step specifically comprises the following steps:
(1) calculation chart
Constructing a directed graph G ═ (V, E λ, τ) for the input data, where V is the set of vertices in G, E ∈ V × V is the set of edges in G, λ → (O, Bool) is a boolean value indicating whether the function mapping each vertex to operation O and the operation are parameterized, the tuple O ∈ O, τ: e → (D, ACT) is the mapping of each edge to data type D and the mapping of actions to ACT mapping functions, where ACT { "read", "update", "control" };
(2) topological ordering
Given a computation graph G ═ (V, E λ, τ), let N be the number of vertices in the graph, the topological ordering is a fixed-point to integer mapping, γ: v → {0, 1., N-1}, satisfying the following requirements
·
Figure BDA0002298081010000021
·
Figure BDA0002298081010000022
Topological ordering represents the order of execution of the operations in the diagram: given two operations u and v, if γ (u) < γ (v), then u is performed before v; if γ (u) ═ γ (v), then u and v execute in parallel; if the variable is always 0, then this operation will be performed first and incoming edges will not change the execution order; the following execution variables depend on incoming operations and are independent of the variable order, only these executions do not trigger outgoing operations of variables;
(3) updating parameters
Training a neural network involves minimizing an objective function f to measure the gap between the predicted result and the actual value, the objective function being a combination of a plurality of functions having learnable parameters, a gradient descent algorithm being used to minimize the objective function; the optimization is an iterative process for updating learnable parameters based on minimizing an objective function, wherein the training iteration comprises three stages: calculating an objective function in a forward stage, calculating a backward gradient, and updating a learnable parameter by using the gradient; at the start of the iteration, the tensors other than the learnable parameter are unloaded, the input tensors are updated and the iterative process is triggered.
2. The second step specifically comprises:
each tensor is allocated with an citing counter, namely an operand, after the tensor is used up in each operation, the citing count is reduced by 1, and if the citing count reaches zero, the memory space of the tensor is released; the lifetime of the tensor is from the beginning of the operation that generated it to the end of the last operation that used it, let TsIs the tensor produced by operation u, and v1,v2,...vkIs using TsK operations of (T)sIs calculated as max [ gamma (v) ]1),γ(v2),...γ(vk)}-γ(u);
The lifetime calculation formula for the change tensor is max (Time)all(v1),Timeall(v2),...Timeall(vk)}-Timeall(u) where Timeall(v) Representing the total time spent by the program executing to node v.
3. The third step specifically comprises:
(1) data flow computation graph modeling method based on cost
Setting edge (f)1,f2) Where τ (f) is performed using the GPU1,f2)=(D,_),f1And f2The calculation of the edge is expressed as
Figure BDA0002298081010000031
Superscript G represents operations on the GPU, rewriting this calculation as:
Figure BDA0002298081010000032
equation of
Figure BDA0002298081010000033
The function id in (1) is a swap-out operation, where the superscript C stands for CPU and id is a self-function, i.e., id (x) x, the vector value at the time of swap-in and swap-out is unchanged, since id is performed using CPU, so f is1Is at f1Immediately switching out the memory of the CPU after finishing, releasing the memory of the GPU, and when f is finished2When triggered, the output tensor of id is exchanged into GPU and input into f2Performing the following steps;
using the formula
Figure BDA0002298081010000034
Rewriting the graph;
(2) function redefinition
Rewriting the id function into a combination of two functions, namely a primary function and an inverse function, and selecting f as the function of the id:
Figure BDA0002298081010000035
formula (II)
Figure BDA0002298081010000036
The rewrite is:
Figure BDA0002298081010000037
by using a pair of encoding and decoding functions instead of id,
Figure BDA0002298081010000038
in the equation
Figure BDA0002298081010000039
In, operation id is changed back2The representation switches the tensor back to the device.
4. The computational graph rewrite conditions are such that a threshold value α is defined, and if the following conditions are simultaneously met, the tensor of u is swapped out to the CPU,
α≤γ(v)-γ(u)
α≥n←trans(u)
α′≤γ(u)-γ(u_fwd)
n ← trans (u) represents the total time cost for swapping out, the node number n in the time interval is found out according to the pre-operated calculation time evaluation value, the meaning of α 'is the interval node number between the current node u and the previous rewritten node, the value α' is adjusted according to the execution condition of the current deep learning model, and the optimal value is obtained through training.
5. The fourth step specifically comprises:
performing performance analysis of a deep learning method by determining the time cost of data transmission, and determining the time of theoretical node calculation; determining actual calculation time through pre-operation of an algorithm model to realize calculation graph rewriting, wherein the strategy of realizing swapping out and swapping back of a memory is to add control edges between nodes and determine the weight of data transmission on the edges;
and adding the control edge of the exchange strategy into the nodes of the computational graph for triggering the exchange operation.
Because the traditional computational graph is poor in generality, the training speed of the deep learning model can be greatly improved by applying the computational graph to a CPU-GPU heterogeneous environment. Therefore, the computational graph modeling technology under the CPU-GPU heterogeneous environment has important research significance. Compared with the traditional calculation graph, the calculation graph modeling technology based on the formalization rules redefines the modeling method of the calculation graph, and then optimizes the model by using the derived swap-out and swap-in operations in the existing graph and rules. The technology improves the adaptability of the deep neural network structure, improves the running speed of the CPU and the GPU, and has good image batch processing effect.
The invention provides a calculation graph modeling method based on formalization rules, which combines an extensible neural network with calculation graph reconstruction, can effectively improve the transmission of data streams between a CPU and a GPU, enhances the generality of a data transmission model, and has certain research and use values.
Drawings
FIG. 1 is an indexing flow diagram;
FIG. 2 calculates a graph topology;
FIG. 3 is a diagram of a backward calculation process;
FIG. 4 is a calculation graph rewrite example;
FIG. 5 is a graphical semantic understanding of Table 1;
FIG. 6 is a flow chart of the present invention.
Detailed Description
The invention mainly comprises the following contents:
1. modeling computational graphs
(1) Calculation chart
Constructing a directed graph G ═ (V, E λ, τ) for the input data, where V is the set of vertices in G, E ∈ V × V is the set of edges in G, λ → (O, Bool) is a boolean value that represents the function that maps each vertex to operation O (tuple O ∈ O) and whether the operation is parameterized, τ: e → (D, ACT) is the mapping of each edge to data type D and the mapping of actions to ACT mapping functions, where ACT { "read", "update", "control" }.
(2) Topological ordering
Given a computation graph G ═ (V, E λ, τ), let N be the number of vertices in the graph, the topological ordering is a fixed-point to integer mapping, γ: v → {0, 1., N-1}, satisfying the following requirements
·
Figure BDA0002298081010000051
·
Figure BDA0002298081010000052
The topological ordering represents the order of execution of the operations in the diagram. Given two operations u and v, if γ (u) < γ (v), then u is performed before v. If γ (u) ═ γ (v), then u and v execute in parallel. If the variable is always 0, this operation will be performed first on behalf of the incoming edge (the "update" edge) and the execution order will not be changed. The following execution variables depend on incoming operations and are independent of the variable order, only these executions do not trigger outgoing operations of variables.
(4) Updating parameters
Training the neural network involves minimizing the objective function f to measure the gap between the predicted outcome and the actual value. The objective function is a combination of multiple functions with learnable parameters, and a gradient descent algorithm is used to minimize the objective function. The optimization is an iterative process for updating learnable parameters based on minimizing an objective function, wherein the training iteration comprises three stages: a forward stage to compute an objective function, a backward gradient computation stage, and a learnable parameters stage using gradient updates. At the start of the iteration, the tensors other than the learnable parameter are unloaded, the input tensors are updated and the iterative process is triggered.
2. And introducing a tensor life cycle on the basis of modeling by calculating the graph.
If the tensors are no longer used in the TensorFlow, the data is released by TensorFlow garbage collection. Therefore, each tensor is assigned a reference counter, an operand. Each operation uses up the tensor, its reference count is decremented by 1. If the reference count reaches zero, the memory space of the tensor is released. Similarly, the lifetime of a tensor is from the beginning of the operation that generated it to the end of the last operation that used it. For example, let TsIs the tensor produced by operation u, and v1,v2,...vkIs using TsK operations. T issIs calculated as max [ gamma (v) ]1),γ(v2),...γ(vk)}-γ(u)。
The lifetime calculation formula for the change tensor is max (Time)all(v1),Timeall(v2),...Timeall(vk)}-Timeall(u) where Timeall(v) Representing the total time spent by the program executing to node v. The calculation method is used for comparing and transmittingThe system has more accurate and efficient calculation mode.
3. After modeling is carried out on the basis of the calculation graph, a data flow operation method based on cost is introduced, and the calculation graph is rewritten.
The 'long-life' tensors in the GPU are temporarily sent to the CPU and are replaced into the GPU when necessary, so that the memory of the GPU is released, and the running speed of the CPU is accelerated.
(1) Data flow computation graph modeling method based on cost
Suppose edge (f)1,f2) Where τ (f) is performed using the GPU1,f2)=(D,_),f1And f2. The computation of the edge can be expressed as
Figure BDA0002298081010000061
Superscript G represents operations on the GPU, rewriting this calculation as:
Figure BDA0002298081010000062
the function id in the above equation is a swap-out operation, where the superscript C stands for CPU and id is a self-function, i.e., id (x) x, the vector values at the time of swap-in and swap-out are unchanged. Since id is performed using CPU, f1Is at f1And immediately switching out the memory of the CPU after the completion, and releasing the memory of the GPU. When f is2When triggered, the output tensor of id is exchanged into GPU and input into f2In (1).
Using the formula
Figure BDA0002298081010000063
Graphics can be rewritten, reducing GPU memory consumption. However, considering the bandwidth problem of the system, if the tensor is continuously swapped out, the load of the current platform is increased. Therefore, a strategy of interval swapping out tensors is proposed. And, for the edge (u, v) therein, v is executed immediately after u. Therefore, no tensor exchange on such edges is required. Next, a calculation graph based on the cost of the data flow will be describedThe condition is rewritten, a threshold is defined α, and the tensor of u can be swapped out to the CPU if the following condition is satisfied at the same time.
α≤γ(v)-γ(u)
α≥n←trans(u)
α′≤γ(u)-γ(u_fwd)
N ← trans (u) represents the total time cost for swapping out, and the node number n in the time interval is found out according to the pre-operated calculation time evaluation value, α 'meaning is the interval node number between the current node u and the previous rewritten node, α' value needs to be adjusted appropriately according to the execution condition of the current deep learning model, and then the optimal value is obtained through training
(2) Function redefinition
Since the swap-in and swap-out operations cannot be represented by the same function, the tensor is swapped back as early as possible, requiring the addition of additional operations. The id function is rewritten as a combination of two functions, a primary function and an inverse function. F is chosen as a function of id:
Figure BDA0002298081010000071
formula (II)
Figure BDA0002298081010000072
The rewrite is:
Figure BDA0002298081010000073
by using a pair of encoding and decoding functions instead of an id, memory consumption on the CPU is reduced.
Figure BDA0002298081010000074
In the equation
Figure BDA0002298081010000075
In, operation id is changed back2The representation switches the tensor back to the device. Moreover, the id must be triggered with high priority2(ii) a Otherwiseid2Will be in id1Later, it is necessary to add a certain operation to the id2The control edge of (1). So next, an optimization strategy based on computation and transmission costs is proposed.
4. And an optimization strategy based on calculation and transmission cost is introduced to obtain a more accurate data transmission mode.
(1) Fusion operation strategy
In the case where a large number of switching operations switch tensors multiple times consuming bandwidth, multiple switching operations can be merged into one switching operation if the tensors are large and the operation intervals are short. The tensor is only swapped back once and resides in GPU memory for reuse by consuming operations.
(2) Swap-in strategy based on operation transmission cost
And performing performance analysis of the deep learning method by determining the time cost of data transmission to determine the theoretical calculation time of the nodes. The rewriting of the calculation graph is realized by determining the actual calculation time through the pre-operation of the algorithm model, and the key for realizing the swap-out and swap-back strategy of the memory is to add the control edge between the nodes, thereby accurately determining the weight of data transmission on the edge and designing an efficient control edge addition algorithm.
And adding the control edges of the exchange strategy into the nodes of the computational graph for triggering exchange operation so as to reduce the overhead of exchange communication and ensure the equivalence of the computational graph.
In the formula
Figure BDA0002298081010000076
Mitigation operation id2Must be operated from a set of operations VcIs selected, wherein for
Figure BDA0002298081010000077
All exist gamma (id)1)<γ(v)<γ(f2) To ensure computation graph correctness. Let k be γ (f)2) - γ (v) is f2And v. If k is too large, the exchange tensor is too early and the tensor will be stored in the device for a long time until it is f2The preparation is used.
The inventionThe method is based on static theoretical analysis and pre-operation actual data, and introduces two parameters: lower limit σlAnd an upper limit σuThereby defining the condition for the node to add a dependent edge.
1) And determining the upper limit and the lower limit of the search node.
Definition flThe cost of the node's tensor swap-out and swap-back is trans (f)i) The backward operation is defined as fiAnd f is1Satisfying the swap-out condition, searching from the backward node to the forward node, and determining the number of the computing nodes capable of covering the data and swapping back to the GPU according to the actual operation time of the nodes under the condition before pre-operation, namely determining the number as the lower limit sigmal
Figure BDA0002298081010000081
The above equation is the pre-run time of the node, nexeThe total number of nodes, upper bound sigma, that can be selected in the pre-operation stateuCan be obtained by the following process:
Figure BDA0002298081010000082
timetheoryrepresenting the evaluation run time of the node, ntheoryRepresenting the total number of nodes that can be selected in the case of theoretical analysis of the evaluation model.
2) The maximum swap back data volume on the dependent node is defined as the maximum throughput of the system.
Using a function thr (f) to represent the amount of dependent data added on the node f, and thread _ max to represent the maximum throughput, if data is added for the current node and the dependency is changed back, the current node needs to satisfy:
Thr(f)<Through_max
under the condition of ensuring the current system load, the system throughput is used as a dependency, the data is added and replaced with the dependency, and meanwhile, the data transmission quantity from the CPU to the GPU cannot exceed the maximum throughput of the system, so that the performance stability is ensured.
3) Modeling process:
referring to FIG. 1, an index flow diagram is depicted. Firstly, inputting a starting node and an upper limit and a lower limit of a calculation graph, acquiring the starting node, then finding out a node meeting the range of the upper limit and the lower limit, determining an accessible edge, judging whether the node meets the condition of adding dependence or not according to a cost function, and dividing into two implementation strategies of backward and forward according to different directions of searching nodes, namely different degrees of approaching the upper limit or the lower limit. For example, for one edge (f)1,f2) Using a swap-out operation, a swap-back operation as si
Figure BDA0002298081010000083
For the switch back operation siThe first is a backward topology query strategy. At the target operation f2And an overwrite operation f1In between, directly using topological ordering to obtain a set of candidate nodes for controllable operation, lower and upper bounds are relative to f2. Candidate node is and2is at a distance oflTo sigmauOperation within range, there is one node to f2And the data-dependent data volume currently added by the node does not exceed the maximum throughput of the system per second, once an operation meeting the conditions is found, the algorithm automatically stops, and the control node obtained by the algorithm is the operation which triggers the data to be changed back at the latest.
The second is a forward search strategy. Chain rule policy slave source operation f1A search along the calculation direction is started to find the corresponding backward operation as a candidate for the control operation. Breadth-first search is used to traverse operations in the forward phase, where the lower and upper bounds are relative to f1. For breadth-first search strategy, the invention sets two open sets s1And s2And a closed set sc,s1Involving the current forward operation, s2Involving the next stage of forward operation (including s)1Outgoing operation of all operations in (1), scContaining accessed operations, from f1At the beginning, if the algorithm is at σlTo sigmauIn the range ofThe output backward operations of the current operation are checked for validity. If there is a valid operation, the current node is the candidate node and the algorithm returns it. Otherwise, the algorithm enters the next loop.
And finally, optimizing the proposed cost-based data stream exchange method through a memory fusion strategy and a forward searching mode and a backward searching mode to obtain a final calculation graph model.
The invention is described in more detail below by way of example.
The invention relates to a calculation graph modeling technology based on formalization rules, which firstly carries out data flow calculation graph modeling, updates learnable parameters and introduces tensor life cycle. Then, model optimization is carried out on the rewriting calculation graph and the derived operation strategy, and the following steps are specifically described:
1. modeling computational graphs
(1) Calculation chart
Table 1 as in fig. 5 lists symbolic representations of the different vertices and edges of the computational graph. The combination of functions is expressed as
Figure BDA0002298081010000091
From the definition
Figure BDA0002298081010000092
The function "_ i" is to take the ith element in the tuple, e.g., (a, b) _2 returns b. Triggering the operations in the computation graph to begin execution when all incoming edges have data. After this operation is completed, data is generated on its output side, and then the next operation is triggered repeatedly in the same manner. This process ends when all reachable operations are performed and all reachable edges are filled with data. At this time, each reachable operation except for the variable is performed once. Variables are introduced because at the start of a computation, if no value is set for an edge, execution of the graph cannot be triggered. And the calculation graph is an acyclic graph, and if some operations have no incoming edges, the operations cannot be triggered, and the problem is solved by introducing variables. Variables in the computation graph are used for storing learnable parameters, inputAnd outputting the data, thereby triggering the computation of the graph. A special application of variables is an important feature that distinguishes deep learning computational graphs from general graphs.
(2) Topological ordering
The topological ordering represents the order of execution of the operations in the diagram. Given two operations u and v, if γ (u) < γ (v), then u is performed before v. If γ (u) ═ γ (v), then u and v execute in parallel. If the variable is always 0, this operation will be performed first on behalf of the incoming edge (the "update" edge) and the execution order will not be changed. The following execution variables depend on incoming operations and are independent of the variable order, only these executions do not trigger outgoing operations of variables.
As in fig. 2(a) with the following execution sequence: "x → h → f → x". Firstly, initializing a variable x by a user, and then triggering an operation h; executing h and triggering operation f; and updating x by using the output of f, and finishing the calculation. The execution of operation h depends only on x, which itself cannot trigger h again.
The exemplary diagram as in fig. 2(b) may have two possible execution orders: "x → h → f → x → f1", or" x → h → f1→ x ". Based on tensor T1And T2' all can be used to trigger operation f1Then f is1Must be performed after f and x. However, x is performed multiple times, so the key step is to determine how the output of x is used as f1Is input. To avoid ambiguity, the method proposes the following conventions for variables:
the operation always uses the most recently updated value of the variable.
In operations using the same tensor, a variable always has the highest execution priority.
This definition ensures that f is executed after updating x with the output of f1The order of (a). The order of execution of the operations depends not only on the data availability of the incoming edges, but also on the presence or absence of data by the "control" edges. The "control" edges are not inputs to operations, but are used to control the order in which the operations of the computation graph are performed. Adding a "control" edge in the computation graph will change the topological ordering of the graph. For example, (u, v) is a "control" edge, and γ (u) < γ (v), then it must be performed after uAnd row v. From this definition, a variable control-free edge can be derived.
The role of the "control" edge is shown in FIG. 2 (c). Adding a new operation f2It uses the output of f, performs the calculation and updates the variable x. In the absence of f2To f1In the case of the control edge of (1), after f is executed, f1And f2Can be executed in parallel because they are not dependent on each other. So they both access the variable x, i.e. f1Read x and f2Write x, so control edges are needed to ensure that they access x in sequence. From f2To f1The "control" edge of (c) represents f1Operation will be at completion f2And executing after operating and updating x data.
(3) Updating parameters
The process of updating learnable parameters (represented by variables) during training is shown in fig. 3. In the forward phase, variable xiIs a function fiInput of fiIs used in the latter function, and finally the loss value is generated by the objective function f. In the backward phase, the method calculates the gradient of f according to learnable parameters, which are needediAs a function of
Figure BDA0002298081010000101
To calculate f relative to xiFinally, the function Ux is used in the update phaseiUpdating xi
2. And introducing a tensor life cycle on the basis of modeling by calculating the graph.
The life cycle of the tensor is changed, and the calculation formula is max { Timeall(v1),Timeall(v2),...Timeall(vk)}-Timeall(u) where Timeall(v) Representing the total time spent by the program executing to node v. The calculation method is more accurate and efficient than the traditional method.
3. After modeling is carried out on the basis of the calculation graph, a data flow operation method based on cost is introduced, and the calculation graph is rewritten.
(1) Data flow computation graph modeling method based on cost
A computational graph rewrite example is illustrated in fig. 4. (a) The thick edges of (a) are rewritten to generate (b). Integers above or below a vertex are the order of execution of that vertex in topological order, vertex f1Is swapped out, thus overwriting the slave f1To f2And f3The edge of (c) is fiAnd fjAre control dependent operations, trigger exchange operation id separately3And id4Is performed. To store the tensor in the CPU memory, a derivation operates to automatically send the tensor to the CPU and back to the GPU when used.
For edge (f)1,f2) Where τ (f) is performed using the GPU1,f2)=(D,_),f1And f2. The computation of the edge can be expressed as
Figure BDA0002298081010000111
Superscript G represents operations on the GPU, rewriting this calculation as:
Figure BDA0002298081010000112
since id is performed using CPU, f1Is at f1And immediately switching out the memory of the CPU after the completion, and releasing the memory of the GPU. When f is2When triggered, the output tensor of id is exchanged into GPU and input into f2In (1).
The method includes the steps of rewriting a graph by using an equation (2-2) so as to reduce GPU memory consumption, adopting a strategy of switching out tensors at intervals in order to reduce system broadband consumption, and executing v immediately after u for edges (u, v) of the graph, therefore, not needing to switch tensors on the edges, describing rewriting conditions of a calculation graph based on data flow cost, defining a threshold value α, and switching out the tensor of u to a CPU if the following conditions are met at the same time.
α≤γ(v)-γ(u)
α≥n←trans(u) (2-3)
α′≤γ(u)-γ(u_fwd)
N ← trans (u) represents the total time cost for exchanging, and the node number n in the time interval is found out according to the pre-operated calculation time evaluation value, α 'means the interval node number between the current node u and the previous rewritten node, α' value needs to be adjusted appropriately according to the execution condition of the current deep learning model, and the optimal value is obtained through training.
(2) Function redefinition
As shown in fig. 4, the swap-in and swap-out operations cannot be represented by the same function, and the tensor is swapped back as early as possible, requiring the addition of extra operations. The id function is rewritten as a combination of two functions, a primary function and an inverse function. F is chosen as a function of id:
Figure BDA0002298081010000121
equation 2-2 is rewritten as:
Figure BDA0002298081010000122
by using a pair of encoding and decoding functions instead of id, memory consumption on the CPU is reduced.
Figure BDA0002298081010000123
In equations 2-6, the operation id is swapped back2The representation switches the tensor back to the device. Moreover, the id must be triggered with high priority2(ii) a Otherwise id2Will be in id1Later, it is necessary to add a certain operation to the id2The control edge of (1). Optimization strategies based on computational and transmission costs are therefore proposed.
4. And an optimization strategy based on calculation and transmission cost is introduced to obtain a more accurate data transmission mode.
(1) Fusion operation strategy
When the tensor is large and the operation interval is short, a plurality of switching operations are merged into one switching operation. Sheet of paperThe amount is only swapped back once and resides in GPU memory for reuse by consuming operations. For example, in FIG. 4 (d), if f2And f3Close and the tensor transmitted is large, id will be3And id4To determine the proximity of two operations, the present invention defines a threshold β in terms of the distance between '·' that represents the proximity of the two operations.
(2) Swap-in strategy based on operation transmission cost
Based on the work, the time cost of data transmission is determined, and the theoretical node calculation time is determined through performance analysis of a deep learning method. The rewriting of the calculation graph is realized by determining the actual calculation time of the algorithm model and the operation, the key for realizing the swap-out and swap-back strategy of the memory is to add control edges between the nodes, determine the weight of data transmission on the edges as accurately as possible, and design a reasonable control variable addition algorithm. And adding the control edges of the exchange strategy into the nodes of the computational graph for triggering exchange operation so as to reduce the overhead of exchange communication and ensure the equivalence of the computational graph.
In equations 2-6, the slow down operation id2Must be operated from a set of operations VcIs selected, wherein for
Figure BDA0002298081010000124
All exist gamma (id)1)<γ(v)<γ(f2) To ensure the correctness of the calculation graph. Let k be γ (f)2) - γ (v) is f2And v. If k is too large, the exchange tensor is too early and the tensor will remain in the device for a long time until it is f2The preparation is used.
The scheme is based on static theoretical analysis and pre-running actual data, and introduces two parameters: lower limit σlAnd an upper limit σuAnd defining the condition that the node adds the dependent edge.
1) And determining the upper limit and the lower limit of the search node.
Definition flThe cost of the node's tensor swap out and back is trans (f)i) The backward operation is defined as fiAnd f is1Satisfying the swap-out condition, searching from the backward node to the forward node, and determining the number of the computing nodes capable of covering the data and swapping back to the GPU according to the actual operation time of the nodes under the condition before pre-operation, namely determining the number as the lower limit sigmal
Figure BDA0002298081010000131
Equation 3-1 is the pre-run time of the node, nexeThe total number of nodes, upper bound sigma, that can be selected in the pre-operation stateuCan be obtained by the following process:
Figure BDA0002298081010000132
in the formula 3-2, timetheoryRepresenting the evaluation run time of the node, ntheoryRepresenting the total number of nodes that can be selected in the case of theoretical analysis of the evaluation model.
2) The maximum swap back data volume on the dependent node is defined as the maximum throughput of the system.
Using a function thr (f) to represent the amount of dependent data to be added to the node f, and thread _ max to represent the maximum throughput, if data is added to the current node and the dependency is changed back, the current node needs to satisfy:
Thr(f)<Through_max (3-3)
under the condition of ensuring the current system load, the system throughput is used as a dependency, the added data is replaced with the dependency, and meanwhile, the data transmission quantity from the CPU to the GPU cannot exceed the maximum throughput of the stabbing pain, so that the performance stability is ensured.
3) Modeling process:
for one edge (f)1,f2) Using a swap-out operation, a swap-back operation as sl
Figure BDA0002298081010000133
For the switch back operation slThe first is a backward topology query strategy. At the targetOperation f2And operation f of rewriting1In between, directly using topological ordering to obtain a set of candidate nodes for controllable operation, lower and upper bounds are relative to f2. Algorithm 1 is an implementation of this strategy. Candidate node is and2is at a distance oflTo sigmau(line 7) operation in the range from them to f2And the current amount of data-dependent data added by the node does not exceed the maximum throughput per second of the system, the algorithm will stop by itself as soon as an operation is found that satisfies the above conditions (lines 10-13), and the control node through the algorithm is the operation that triggered the data swap back the latest. And finally outputting the traversal sequence of the directed graph. The specific implementation process is described as follows:
Figure BDA0002298081010000141
the second is a forward market search strategy. Chain rule policy slave source operation f1A search along the calculation direction is started to find the corresponding backward operation as a candidate for the control operation. Breadth-first search is used to traverse operations in the forward phase, where the lower and upper bounds are relative to f1. Algorithm 2 shows the specific course of the strategy. For breadth-first search strategy, two open sets s are set in the text1And s2One set for one closed set sc,s1Involving the current forward operation, s2Involving the next stage of forward operation (including s)1Outgoing operation of all operations in (1), scContaining accessed operations, from f1At the beginning, if the algorithm is at σlσuThe output backward operations (line 9) of the current operation are obtained within the range (line 8), and the validity of these backward operations is checked (lines 10-11). If there is a valid operation, the current node is the candidate node and it is returned. Otherwise, the algorithm enters the next loop. And finally outputting the traversal sequence of the directed graph. (lines 27-30).
Figure BDA0002298081010000151
Figure BDA0002298081010000161
And finally, combining the established computational graph model with a ResNet network model for training to obtain the optimal tensor, and inputting data of different structures again for testing. And then optimizing through a memory fusion strategy and a forward searching mode and a backward searching mode to obtain a final traversal sequence data set.
A computational graph modeling technology based on formalization rules is combined with a deep learning network model, so that the interpretability and the pushability of the model can be improved, and the image batch processing effect can be improved.

Claims (6)

1. A deep learning-oriented GPU parallel computing data processing method is characterized by comprising the following steps:
step one, modeling a calculation graph;
secondly, introducing a tensor life cycle on the basis of modeling of the calculation graph;
step three, introducing a cost-based data flow operation method, and rewriting a calculation graph;
and step four, introducing an optimization strategy based on calculation and transmission cost to obtain accurate data transmission.
2. The deep learning oriented GPU parallel computing data processing method according to claim 1, wherein the first step specifically comprises:
(1) calculation chart
Constructing a directed graph G ═ (V, E λ, τ) for the input data, where V is the set of vertices in G, E ∈ V × V is the set of edges in G, λ → (O, Bool) is a boolean value indicating whether the function mapping each vertex to operation O and the operation are parameterized, the tuple O ∈ O, τ: e → (D, ACT) is the mapping of each edge to data type D and the mapping of actions to ACT mapping functions, where ACT { "read", "update", "control" };
(2) topological ordering
Given a computation graph G ═ (V, E λ, τ), let N be the number of vertices in the graph, the topological ordering is a fixed-point to integer mapping, γ: v → {0, 1., N-1}, satisfying the following requirements
·
Figure FDA0002298077000000011
·
Figure FDA0002298077000000012
Topological ordering represents the order of execution of the operations in the diagram: given two operations u and v, if γ (u) < γ (v), then u is performed before v; if γ (u) ═ γ (v), then u and v execute in parallel; if the variable is always 0, then this operation will be performed first and incoming edges will not change the execution order; the following execution variables depend on incoming operations and are independent of the variable order, only these executions do not trigger outgoing operations of variables;
(3) updating parameters
Training a neural network involves minimizing an objective function f to measure the gap between the predicted result and the actual value, the objective function being a combination of a plurality of functions having learnable parameters, a gradient descent algorithm being used to minimize the objective function; the optimization is an iterative process for updating learnable parameters based on minimizing an objective function, wherein the training iteration comprises three stages: calculating an objective function in a forward stage, calculating a backward gradient, and updating a learnable parameter by using the gradient; at the start of the iteration, the tensors other than the learnable parameter are unloaded, the input tensors are updated and the iterative process is triggered.
3. The deep learning-oriented GPU parallel computing data processing method as claimed in claim 1, wherein the second step specifically comprises:
each tensor is allocated with an citing counter, namely an operand, after the tensor is used up in each operation, the citing count is reduced by 1, and if the citing count reaches zero, the memory space of the tensor is released; the lifetime of the tensor is the operation from which it is derivedMake T to end with the last operation to use itsIs the tensor produced by operation u, and v1,v2,...vkIs using TsK operations of (T)sIs calculated as max [ gamma (v) ]1),γ(v2),...γ(vk)}-γ(u);
The lifetime calculation formula for the change tensor is max (Time)all(v1),Timeall(v2),...Timeall(vk)}-Timeall(u) where Timeall(v) Representing the total time spent by the program executing to node v.
4. The deep learning-oriented GPU parallel computing data processing method as claimed in claim 1, characterized by comprising the following steps:
(1) data flow computation graph modeling method based on cost
Setting edge (f)1,f2) Where τ (f) is performed using the GPU1,f2)=(D,_),f1And f2The calculation of the edge is expressed as
Figure FDA0002298077000000021
Superscript G represents operations on the GPU, rewriting this calculation as:
Figure FDA0002298077000000022
equation of
Figure FDA0002298077000000023
The function id in (1) is a swap-out operation, where the superscript C stands for CPU and id is a self-function, i.e., id (x) x, the vector value at the time of swap-in and swap-out is unchanged, since id is performed using CPU, so f is1Is at f1Immediately switching out the memory of the CPU after finishing, releasing the memory of the GPU, and when f is finished2When triggered, the output tensor of id is exchanged into GPU and input into f2Performing the following steps;
using the formula
Figure FDA0002298077000000024
Rewriting the graph;
(2) function redefinition
Rewriting the id function into a combination of two functions, namely a primary function and an inverse function, and selecting f as the function of the id:
Figure FDA0002298077000000025
formula (II)
Figure FDA0002298077000000026
The rewrite is:
Figure FDA0002298077000000027
by using a pair of encoding and decoding functions instead of id,
Figure FDA0002298077000000031
in the equation
Figure FDA0002298077000000032
In, operation id is changed back2The representation switches the tensor back to the device.
5. The data processing method of GPU parallel computing facing deep learning of claim 4, wherein the condition of rewriting the computation graph is defined as α threshold, if the following conditions are satisfied at the same time, the tensor of u is swapped out to CPU,
α≤γ(v)-γ(u)
α≥n←trans(u)
α′≤γ(u)-γ(u_fwd)
n ← trans (u) represents the total time cost for swapping out, the node number n in the time interval is found out according to the pre-operated calculation time evaluation value, the meaning of α 'is the interval node number between the current node u and the previous rewritten node, the value α' is adjusted according to the execution condition of the current deep learning model, and the optimal value is obtained through training.
6. The deep learning-oriented GPU parallel computing data processing method as claimed in claim 1, wherein the fourth step specifically comprises:
performing performance analysis of a deep learning method by determining the time cost of data transmission, and determining the time of theoretical node calculation; determining actual calculation time through pre-operation of an algorithm model to realize calculation graph rewriting, wherein the strategy of realizing swapping out and swapping back of a memory is to add control edges between nodes and determine the weight of data transmission on the edges;
and adding the control edge of the exchange strategy into the nodes of the computational graph for triggering the exchange operation.
CN201911210933.2A 2019-12-02 2019-12-02 Deep learning-oriented GPU parallel computing data processing method Pending CN110941494A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911210933.2A CN110941494A (en) 2019-12-02 2019-12-02 Deep learning-oriented GPU parallel computing data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911210933.2A CN110941494A (en) 2019-12-02 2019-12-02 Deep learning-oriented GPU parallel computing data processing method

Publications (1)

Publication Number Publication Date
CN110941494A true CN110941494A (en) 2020-03-31

Family

ID=69908471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911210933.2A Pending CN110941494A (en) 2019-12-02 2019-12-02 Deep learning-oriented GPU parallel computing data processing method

Country Status (1)

Country Link
CN (1) CN110941494A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111580827A (en) * 2020-04-30 2020-08-25 腾讯科技(深圳)有限公司 Compiling optimization method and device of machine learning model
CN111652346A (en) * 2020-04-21 2020-09-11 厦门渊亭信息科技有限公司 Large-scale map deep learning calculation framework based on hierarchical optimization paradigm
CN112132287A (en) * 2020-09-04 2020-12-25 苏州浪潮智能科技有限公司 Distributed quantum computing simulation method and device
CN112306697A (en) * 2020-12-31 2021-02-02 之江实验室 Deep learning memory management method and system based on Tensor access
CN113066023A (en) * 2021-03-19 2021-07-02 哈尔滨工程大学 SAR image speckle removing method based on self-calibration convolutional neural network
CN114327630A (en) * 2022-01-05 2022-04-12 北京大学 High-performance operator generation method suitable for Huaji Shengteng chip
CN115033391A (en) * 2022-08-10 2022-09-09 之江实验室 Data flow method and device for neural network calculation
CN115268877A (en) * 2022-09-27 2022-11-01 之江实验室 Intermediate representation method and device for parallel execution of graph computation
CN115268936A (en) * 2022-09-27 2022-11-01 之江实验室 Optimization method and device for compiling calculation graph
WO2023093689A1 (en) * 2021-11-29 2023-06-01 华为技术有限公司 Computational graph optimization method and apparatus, and device
CN116432778A (en) * 2023-06-12 2023-07-14 摩尔线程智能科技(北京)有限责任公司 Data processing method and device, storage medium and electronic equipment
CN116610456A (en) * 2023-07-19 2023-08-18 首都师范大学 Memory optimization method based on eager memory reuse algorithm
US11782723B1 (en) 2022-09-27 2023-10-10 Zhejiang Lab Intermediate representation method and apparatus for parallel execution of graph computation
US11789894B2 (en) 2022-01-27 2023-10-17 Wistron Corporation Acceleration system and dynamic configuration method thereof
CN117522669A (en) * 2024-01-08 2024-02-06 之江实验室 Method, device, medium and equipment for optimizing internal memory of graphic processor

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302974A (en) * 2015-11-06 2016-02-03 北京航空航天大学 Real-time cutting simulation method of flexible object on the basis of finite element and time-variant modal analysis
CN109919310A (en) * 2019-01-15 2019-06-21 中国科学院信息工程研究所 A kind of GPU Memory Optimize Method and system towards deep learning training mission

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302974A (en) * 2015-11-06 2016-02-03 北京航空航天大学 Real-time cutting simulation method of flexible object on the basis of finite element and time-variant modal analysis
CN109919310A (en) * 2019-01-15 2019-06-21 中国科学院信息工程研究所 A kind of GPU Memory Optimize Method and system towards deep learning training mission

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴文文: "面向深度学习的GPU访存优化研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652346A (en) * 2020-04-21 2020-09-11 厦门渊亭信息科技有限公司 Large-scale map deep learning calculation framework based on hierarchical optimization paradigm
CN111580827B (en) * 2020-04-30 2021-09-28 腾讯科技(深圳)有限公司 Compiling optimization method and device of machine learning model
CN111580827A (en) * 2020-04-30 2020-08-25 腾讯科技(深圳)有限公司 Compiling optimization method and device of machine learning model
CN112132287A (en) * 2020-09-04 2020-12-25 苏州浪潮智能科技有限公司 Distributed quantum computing simulation method and device
CN112132287B (en) * 2020-09-04 2022-05-17 苏州浪潮智能科技有限公司 Distributed quantum computing simulation method and device
CN112306697A (en) * 2020-12-31 2021-02-02 之江实验室 Deep learning memory management method and system based on Tensor access
CN112306697B (en) * 2020-12-31 2021-04-27 之江实验室 Deep learning memory management method and system based on Tensor access
CN113066023A (en) * 2021-03-19 2021-07-02 哈尔滨工程大学 SAR image speckle removing method based on self-calibration convolutional neural network
WO2023093689A1 (en) * 2021-11-29 2023-06-01 华为技术有限公司 Computational graph optimization method and apparatus, and device
CN114327630A (en) * 2022-01-05 2022-04-12 北京大学 High-performance operator generation method suitable for Huaji Shengteng chip
US11789894B2 (en) 2022-01-27 2023-10-17 Wistron Corporation Acceleration system and dynamic configuration method thereof
TWI819480B (en) * 2022-01-27 2023-10-21 緯創資通股份有限公司 Acceleration system and dynamic configuration method thereof
CN115033391A (en) * 2022-08-10 2022-09-09 之江实验室 Data flow method and device for neural network calculation
US11941507B2 (en) 2022-08-10 2024-03-26 Zhejiang Lab Data flow method and apparatus for neural network computation by determining input variables and output variables of nodes of a computational graph of a neural network
CN115268877A (en) * 2022-09-27 2022-11-01 之江实验室 Intermediate representation method and device for parallel execution of graph computation
US11782723B1 (en) 2022-09-27 2023-10-10 Zhejiang Lab Intermediate representation method and apparatus for parallel execution of graph computation
CN115268936A (en) * 2022-09-27 2022-11-01 之江实验室 Optimization method and device for compiling calculation graph
CN116432778B (en) * 2023-06-12 2023-09-19 摩尔线程智能科技(北京)有限责任公司 Data processing method and device, storage medium and electronic equipment
CN116432778A (en) * 2023-06-12 2023-07-14 摩尔线程智能科技(北京)有限责任公司 Data processing method and device, storage medium and electronic equipment
CN116610456B (en) * 2023-07-19 2023-09-26 首都师范大学 Memory optimization method based on eager memory reuse algorithm
CN116610456A (en) * 2023-07-19 2023-08-18 首都师范大学 Memory optimization method based on eager memory reuse algorithm
CN117522669A (en) * 2024-01-08 2024-02-06 之江实验室 Method, device, medium and equipment for optimizing internal memory of graphic processor
CN117522669B (en) * 2024-01-08 2024-03-26 之江实验室 Method, device, medium and equipment for optimizing internal memory of graphic processor

Similar Documents

Publication Publication Date Title
CN110941494A (en) Deep learning-oriented GPU parallel computing data processing method
CN112579063B (en) Acceleration method for exploring optimization space in deep learning compiler
Le et al. Tflms: Large model support in tensorflow by graph rewriting
CN110276442B (en) Searching method and device of neural network architecture
CN105550746A (en) Training method and training device of machine learning model
CN114897173B (en) Method and device for determining PageRank based on variable component sub-line
CN115860081B (en) Core algorithm scheduling method, system, electronic equipment and storage medium
CN111966495B (en) Data processing method and device
CN112764893B (en) Data processing method and data processing system
Le et al. Automatic gpu memory management for large neural models in tensorflow
US20210392073A1 (en) Regular path queries (rpqs) for distributed graphs
Neele et al. Solving parameterised boolean equation systems with infinite data through quotienting
CN112433853A (en) Heterogeneous sensing data partitioning method for parallel application of supercomputer data
CN113609806B (en) Quantum circuit program general transformation method combining sub-graph isomorphism
CN112528108A (en) Model training system, gradient aggregation method and device in model training
Cruz et al. A linear logic programming language for concurrent programming over graph structures
US20240028802A1 (en) Automatic low level operator loop generation, parallelization and vectorization for tensor computations
Liu et al. Integrating alternating direction method of multipliers and bush for solving the traffic assignment problem
CN114662662A (en) Method for constructing subgraph and allocating stripe size in subgraph
Han et al. Robot path planning in dynamic environments based on deep reinforcement learning
CN116187463B (en) Quantum measurement mode-to-quantum circuit compiling method and device and electronic equipment
CN116187464B (en) Blind quantum computing processing method and device and electronic equipment
CN116167447B (en) Quantum circuit processing method and device and electronic equipment
CN116187458B (en) Quantum circuit processing method and device and electronic equipment
CN117829242B (en) Model processing method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200331

RJ01 Rejection of invention patent application after publication