CN110941494A

CN110941494A - Deep learning-oriented GPU parallel computing data processing method

Info

Publication number: CN110941494A
Application number: CN201911210933.2A
Authority: CN
Inventors: 吴艳霞; 任宁; 李晓松; 张硕; 王旭
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-03-31

Abstract

The invention provides a deep learning-oriented GPU parallel computing data processing method. Firstly, inputting data to model a calculation graph: (1) constructing operation rules of vertexes and edges of the directed graph; (2) defining the execution sequence of the operations in the graph by using topological sorting; (3) parameters are updated by training the model. Then, a tensor life cycle is introduced, and then, the calculation graph rewriting based on the data operation cost is carried out, so that the optimal operation strategy is obtained, and the method mainly comprises the following steps: firstly, modeling a cost-based calculation graph, and redefining an operation function on a CPU; then, the swap-out operations of the same tensor are fused into a single swap-out operation; and finally, obtaining a traversal sequence by using a tensor back-exchange strategy based on calculation and transmission cost. Therefore, a computational graph modeling method based on formalization rules is constructed. Finally, the invention combines the extensible neural network with the computational graph, can improve the training speed of the model and effectively improve the image processing effect.

Description

Deep learning-oriented GPU parallel computing data processing method

Technical Field

The invention relates to a data processing method of GPU parallel computation, in particular to a computation graph modeling method based on formalization rules.

Background

With the rapid development of artificial intelligence, deep neural networks have penetrated into various fields of scientific research, and neural network models are more and more complex, so that the GPU is applied to deep learning. Compared with a CPU, the GPU has excellent performance in the aspect of accelerating matrix calculation, for example, the face recognition precision of a faceNet network model developed by Google can reach 99.63%. The Optasia developed by Microsoft shows high precision and performance in the aspect of the correlation query of the traffic cameras in the large city. But the complex network model presents significant challenges to the GPU processor. Especially under large samples, large scale parameters and deep structures, the training speed will be slower. Therefore, the invention provides a calculation graph modeling technology based on formalization rules. Different from the research of data exchange between a GPU memory and an external memory, the method provides a data processing mode method capable of fusing on the basis of the technology, for input data with different data structures, data preprocessing can be carried out according to the model to obtain data with the same size, and then model operation is carried out, so that the processing speed of the graph and the training speed of the neural network model can be improved.

The computation graph is a dynamic graph built from different input data structures. In general, most deep learning items require data preprocessing on training data of a model. In the process, data input in various sizes and structures are cut into data with the same latitude and size and are pushed into a stack, and then a batch processing flow of a model training stage is carried out. For example, in many fields including natural language processing (parse tree) and chemical informatics (molecular graph), the input of the model training process is the parse number and the molecular graph, and the model parsing can complete the neural network training calculated on the graph structure, which is an optimal way to solve the graph problem.

Disclosure of Invention

The invention aims to provide a data processing method for GPU parallel computing facing deep learning, which can improve the transmission of data streams between a CPU and a GPU and enhance the generality of a data transmission model.

The purpose of the invention is realized as follows:

step one, modeling a calculation graph;

secondly, introducing a tensor life cycle on the basis of modeling of the calculation graph;

step three, introducing a cost-based data flow operation method, and rewriting a calculation graph;

and step four, introducing an optimization strategy based on calculation and transmission cost to obtain accurate data transmission.

The present invention may further comprise:

1. the first step specifically comprises the following steps:

(1) calculation chart

Constructing a directed graph G ═ (V, E λ, τ) for the input data, where V is the set of vertices in G, E ∈ V × V is the set of edges in G, λ → (O, Bool) is a boolean value indicating whether the function mapping each vertex to operation O and the operation are parameterized, the tuple O ∈ O, τ: e → (D, ACT) is the mapping of each edge to data type D and the mapping of actions to ACT mapping functions, where ACT { "read", "update", "control" };

(2) topological ordering

Given a computation graph G ═ (V, E λ, τ), let N be the number of vertices in the graph, the topological ordering is a fixed-point to integer mapping, γ: v → {0, 1., N-1}, satisfying the following requirements

·

·

Topological ordering represents the order of execution of the operations in the diagram: given two operations u and v, if γ (u) < γ (v), then u is performed before v; if γ (u) ═ γ (v), then u and v execute in parallel; if the variable is always 0, then this operation will be performed first and incoming edges will not change the execution order; the following execution variables depend on incoming operations and are independent of the variable order, only these executions do not trigger outgoing operations of variables;

(3) updating parameters

Training a neural network involves minimizing an objective function f to measure the gap between the predicted result and the actual value, the objective function being a combination of a plurality of functions having learnable parameters, a gradient descent algorithm being used to minimize the objective function; the optimization is an iterative process for updating learnable parameters based on minimizing an objective function, wherein the training iteration comprises three stages: calculating an objective function in a forward stage, calculating a backward gradient, and updating a learnable parameter by using the gradient; at the start of the iteration, the tensors other than the learnable parameter are unloaded, the input tensors are updated and the iterative process is triggered.

2. The second step specifically comprises:

each tensor is allocated with an citing counter, namely an operand, after the tensor is used up in each operation, the citing count is reduced by 1, and if the citing count reaches zero, the memory space of the tensor is released; the lifetime of the tensor is from the beginning of the operation that generated it to the end of the last operation that used it, let T_sIs the tensor produced by operation u, and v₁，v₂，...v_kIs using T_sK operations of (T)_sIs calculated as max [ gamma (v) ]₁),γ(v₂),...γ(v_k)}-γ(u)；

The lifetime calculation formula for the change tensor is max (Time)_all(v₁),Time_all(v₂),...Time_all(v_k)}-Time_all(u) where Time_all(v) Representing the total time spent by the program executing to node v.

3. The third step specifically comprises:

(1) data flow computation graph modeling method based on cost

Setting edge (f)₁,f₂) Where τ (f) is performed using the GPU₁,f₂)＝(D,_),f₁And f₂The calculation of the edge is expressed as

Superscript G represents operations on the GPU, rewriting this calculation as:

equation of

The function id in (1) is a swap-out operation, where the superscript C stands for CPU and id is a self-function, i.e., id (x) x, the vector value at the time of swap-in and swap-out is unchanged, since id is performed using CPU, so f is₁Is at f₁Immediately switching out the memory of the CPU after finishing, releasing the memory of the GPU, and when f is finished₂When triggered, the output tensor of id is exchanged into GPU and input into f₂Performing the following steps;

using the formula

Rewriting the graph;

(2) function redefinition

Rewriting the id function into a combination of two functions, namely a primary function and an inverse function, and selecting f as the function of the id:

formula (II)

The rewrite is:

by using a pair of encoding and decoding functions instead of id,

in the equation

In, operation id is changed back₂The representation switches the tensor back to the device.

4. The computational graph rewrite conditions are such that a threshold value α is defined, and if the following conditions are simultaneously met, the tensor of u is swapped out to the CPU,

α≤γ(v)-γ(u)

α≥n←trans(u)

α′≤γ(u)-γ(u_fwd)

n ← trans (u) represents the total time cost for swapping out, the node number n in the time interval is found out according to the pre-operated calculation time evaluation value, the meaning of α 'is the interval node number between the current node u and the previous rewritten node, the value α' is adjusted according to the execution condition of the current deep learning model, and the optimal value is obtained through training.

5. The fourth step specifically comprises:

performing performance analysis of a deep learning method by determining the time cost of data transmission, and determining the time of theoretical node calculation; determining actual calculation time through pre-operation of an algorithm model to realize calculation graph rewriting, wherein the strategy of realizing swapping out and swapping back of a memory is to add control edges between nodes and determine the weight of data transmission on the edges;

and adding the control edge of the exchange strategy into the nodes of the computational graph for triggering the exchange operation.

Because the traditional computational graph is poor in generality, the training speed of the deep learning model can be greatly improved by applying the computational graph to a CPU-GPU heterogeneous environment. Therefore, the computational graph modeling technology under the CPU-GPU heterogeneous environment has important research significance. Compared with the traditional calculation graph, the calculation graph modeling technology based on the formalization rules redefines the modeling method of the calculation graph, and then optimizes the model by using the derived swap-out and swap-in operations in the existing graph and rules. The technology improves the adaptability of the deep neural network structure, improves the running speed of the CPU and the GPU, and has good image batch processing effect.

The invention provides a calculation graph modeling method based on formalization rules, which combines an extensible neural network with calculation graph reconstruction, can effectively improve the transmission of data streams between a CPU and a GPU, enhances the generality of a data transmission model, and has certain research and use values.

Drawings

FIG. 1 is an indexing flow diagram;

FIG. 2 calculates a graph topology;

FIG. 3 is a diagram of a backward calculation process;

FIG. 4 is a calculation graph rewrite example;

FIG. 5 is a graphical semantic understanding of Table 1;

FIG. 6 is a flow chart of the present invention.

Detailed Description

The invention mainly comprises the following contents:

1. modeling computational graphs

(1) Calculation chart

Constructing a directed graph G ═ (V, E λ, τ) for the input data, where V is the set of vertices in G, E ∈ V × V is the set of edges in G, λ → (O, Bool) is a boolean value that represents the function that maps each vertex to operation O (tuple O ∈ O) and whether the operation is parameterized, τ: e → (D, ACT) is the mapping of each edge to data type D and the mapping of actions to ACT mapping functions, where ACT { "read", "update", "control" }.

(2) Topological ordering

·

·

The topological ordering represents the order of execution of the operations in the diagram. Given two operations u and v, if γ (u) < γ (v), then u is performed before v. If γ (u) ═ γ (v), then u and v execute in parallel. If the variable is always 0, this operation will be performed first on behalf of the incoming edge (the "update" edge) and the execution order will not be changed. The following execution variables depend on incoming operations and are independent of the variable order, only these executions do not trigger outgoing operations of variables.

(4) Updating parameters

Training the neural network involves minimizing the objective function f to measure the gap between the predicted outcome and the actual value. The objective function is a combination of multiple functions with learnable parameters, and a gradient descent algorithm is used to minimize the objective function. The optimization is an iterative process for updating learnable parameters based on minimizing an objective function, wherein the training iteration comprises three stages: a forward stage to compute an objective function, a backward gradient computation stage, and a learnable parameters stage using gradient updates. At the start of the iteration, the tensors other than the learnable parameter are unloaded, the input tensors are updated and the iterative process is triggered.

2. And introducing a tensor life cycle on the basis of modeling by calculating the graph.

If the tensors are no longer used in the TensorFlow, the data is released by TensorFlow garbage collection. Therefore, each tensor is assigned a reference counter, an operand. Each operation uses up the tensor, its reference count is decremented by 1. If the reference count reaches zero, the memory space of the tensor is released. Similarly, the lifetime of a tensor is from the beginning of the operation that generated it to the end of the last operation that used it. For example, let T_sIs the tensor produced by operation u, and v₁，v₂，...v_kIs using T_sK operations. T is_sIs calculated as max [ gamma (v) ]₁),γ(v₂),...γ(v_k)}-γ(u)。

The lifetime calculation formula for the change tensor is max (Time)_all(v₁),Time_all(v₂),...Time_all(v_k)}-Time_all(u) where Time_all(v) Representing the total time spent by the program executing to node v. The calculation method is used for comparing and transmittingThe system has more accurate and efficient calculation mode.

3. After modeling is carried out on the basis of the calculation graph, a data flow operation method based on cost is introduced, and the calculation graph is rewritten.

The 'long-life' tensors in the GPU are temporarily sent to the CPU and are replaced into the GPU when necessary, so that the memory of the GPU is released, and the running speed of the CPU is accelerated.

(1) Data flow computation graph modeling method based on cost

Suppose edge (f)₁,f₂) Where τ (f) is performed using the GPU₁,f₂)＝(D,_),f₁And f₂. The computation of the edge can be expressed as

Superscript G represents operations on the GPU, rewriting this calculation as:

the function id in the above equation is a swap-out operation, where the superscript C stands for CPU and id is a self-function, i.e., id (x) x, the vector values at the time of swap-in and swap-out are unchanged. Since id is performed using CPU, f₁Is at f₁And immediately switching out the memory of the CPU after the completion, and releasing the memory of the GPU. When f is₂When triggered, the output tensor of id is exchanged into GPU and input into f₂In (1).

Using the formula

Graphics can be rewritten, reducing GPU memory consumption. However, considering the bandwidth problem of the system, if the tensor is continuously swapped out, the load of the current platform is increased. Therefore, a strategy of interval swapping out tensors is proposed. And, for the edge (u, v) therein, v is executed immediately after u. Therefore, no tensor exchange on such edges is required. Next, a calculation graph based on the cost of the data flow will be describedThe condition is rewritten, a threshold is defined α, and the tensor of u can be swapped out to the CPU if the following condition is satisfied at the same time.

α≤γ(v)-γ(u)

α≥n←trans(u)

α′≤γ(u)-γ(u_fwd)

N ← trans (u) represents the total time cost for swapping out, and the node number n in the time interval is found out according to the pre-operated calculation time evaluation value, α 'meaning is the interval node number between the current node u and the previous rewritten node, α' value needs to be adjusted appropriately according to the execution condition of the current deep learning model, and then the optimal value is obtained through training

(2) Function redefinition

Since the swap-in and swap-out operations cannot be represented by the same function, the tensor is swapped back as early as possible, requiring the addition of additional operations. The id function is rewritten as a combination of two functions, a primary function and an inverse function. F is chosen as a function of id:

formula (II)

The rewrite is:

by using a pair of encoding and decoding functions instead of an id, memory consumption on the CPU is reduced.

In the equation

In, operation id is changed back₂The representation switches the tensor back to the device. Moreover, the id must be triggered with high priority₂(ii) a Otherwiseid₂Will be in id₁Later, it is necessary to add a certain operation to the id₂The control edge of (1). So next, an optimization strategy based on computation and transmission costs is proposed.

4. And an optimization strategy based on calculation and transmission cost is introduced to obtain a more accurate data transmission mode.

(1) Fusion operation strategy

In the case where a large number of switching operations switch tensors multiple times consuming bandwidth, multiple switching operations can be merged into one switching operation if the tensors are large and the operation intervals are short. The tensor is only swapped back once and resides in GPU memory for reuse by consuming operations.

(2) Swap-in strategy based on operation transmission cost

And performing performance analysis of the deep learning method by determining the time cost of data transmission to determine the theoretical calculation time of the nodes. The rewriting of the calculation graph is realized by determining the actual calculation time through the pre-operation of the algorithm model, and the key for realizing the swap-out and swap-back strategy of the memory is to add the control edge between the nodes, thereby accurately determining the weight of data transmission on the edge and designing an efficient control edge addition algorithm.

And adding the control edges of the exchange strategy into the nodes of the computational graph for triggering exchange operation so as to reduce the overhead of exchange communication and ensure the equivalence of the computational graph.

In the formula

Mitigation operation id₂Must be operated from a set of operations V_cIs selected, wherein for

All exist gamma (id)₁)＜γ(v)＜γ(f₂) To ensure computation graph correctness. Let k be γ (f)₂) - γ (v) is f₂And v. If k is too large, the exchange tensor is too early and the tensor will be stored in the device for a long time until it is f₂The preparation is used.

The inventionThe method is based on static theoretical analysis and pre-operation actual data, and introduces two parameters: lower limit σ_lAnd an upper limit σ_uThereby defining the condition for the node to add a dependent edge.

1) And determining the upper limit and the lower limit of the search node.

Definition f_lThe cost of the node's tensor swap-out and swap-back is trans (f)_i) The backward operation is defined as f_iAnd f is₁Satisfying the swap-out condition, searching from the backward node to the forward node, and determining the number of the computing nodes capable of covering the data and swapping back to the GPU according to the actual operation time of the nodes under the condition before pre-operation, namely determining the number as the lower limit sigma_l：

The above equation is the pre-run time of the node, n_exeThe total number of nodes, upper bound sigma, that can be selected in the pre-operation state_uCan be obtained by the following process:

time_theoryrepresenting the evaluation run time of the node, n_theoryRepresenting the total number of nodes that can be selected in the case of theoretical analysis of the evaluation model.

2) The maximum swap back data volume on the dependent node is defined as the maximum throughput of the system.

Using a function thr (f) to represent the amount of dependent data added on the node f, and thread _ max to represent the maximum throughput, if data is added for the current node and the dependency is changed back, the current node needs to satisfy:

Thr(f)＜Through_max

under the condition of ensuring the current system load, the system throughput is used as a dependency, the data is added and replaced with the dependency, and meanwhile, the data transmission quantity from the CPU to the GPU cannot exceed the maximum throughput of the system, so that the performance stability is ensured.

3) Modeling process:

referring to FIG. 1, an index flow diagram is depicted. Firstly, inputting a starting node and an upper limit and a lower limit of a calculation graph, acquiring the starting node, then finding out a node meeting the range of the upper limit and the lower limit, determining an accessible edge, judging whether the node meets the condition of adding dependence or not according to a cost function, and dividing into two implementation strategies of backward and forward according to different directions of searching nodes, namely different degrees of approaching the upper limit or the lower limit. For example, for one edge (f)₁,f₂) Using a swap-out operation, a swap-back operation as s_i：

For the switch back operation s_iThe first is a backward topology query strategy. At the target operation f₂And an overwrite operation f₁In between, directly using topological ordering to obtain a set of candidate nodes for controllable operation, lower and upper bounds are relative to f₂. Candidate node is and₂is at a distance of_lTo sigma_uOperation within range, there is one node to f₂And the data-dependent data volume currently added by the node does not exceed the maximum throughput of the system per second, once an operation meeting the conditions is found, the algorithm automatically stops, and the control node obtained by the algorithm is the operation which triggers the data to be changed back at the latest.

The second is a forward search strategy. Chain rule policy slave source operation f₁A search along the calculation direction is started to find the corresponding backward operation as a candidate for the control operation. Breadth-first search is used to traverse operations in the forward phase, where the lower and upper bounds are relative to f₁. For breadth-first search strategy, the invention sets two open sets s₁And s₂And a closed set s_c，s₁Involving the current forward operation, s₂Involving the next stage of forward operation (including s)₁Outgoing operation of all operations in (1), s_cContaining accessed operations, from f₁At the beginning, if the algorithm is at σ_lTo sigma_uIn the range ofThe output backward operations of the current operation are checked for validity. If there is a valid operation, the current node is the candidate node and the algorithm returns it. Otherwise, the algorithm enters the next loop.

And finally, optimizing the proposed cost-based data stream exchange method through a memory fusion strategy and a forward searching mode and a backward searching mode to obtain a final calculation graph model.

The invention is described in more detail below by way of example.

The invention relates to a calculation graph modeling technology based on formalization rules, which firstly carries out data flow calculation graph modeling, updates learnable parameters and introduces tensor life cycle. Then, model optimization is carried out on the rewriting calculation graph and the derived operation strategy, and the following steps are specifically described:

1. modeling computational graphs

(1) Calculation chart

Table 1 as in fig. 5 lists symbolic representations of the different vertices and edges of the computational graph. The combination of functions is expressed as

From the definition

The function "_ i" is to take the ith element in the tuple, e.g., (a, b) _2 returns b. Triggering the operations in the computation graph to begin execution when all incoming edges have data. After this operation is completed, data is generated on its output side, and then the next operation is triggered repeatedly in the same manner. This process ends when all reachable operations are performed and all reachable edges are filled with data. At this time, each reachable operation except for the variable is performed once. Variables are introduced because at the start of a computation, if no value is set for an edge, execution of the graph cannot be triggered. And the calculation graph is an acyclic graph, and if some operations have no incoming edges, the operations cannot be triggered, and the problem is solved by introducing variables. Variables in the computation graph are used for storing learnable parameters, inputAnd outputting the data, thereby triggering the computation of the graph. A special application of variables is an important feature that distinguishes deep learning computational graphs from general graphs.

(2) Topological ordering

As in fig. 2(a) with the following execution sequence: "x → h → f → x". Firstly, initializing a variable x by a user, and then triggering an operation h; executing h and triggering operation f; and updating x by using the output of f, and finishing the calculation. The execution of operation h depends only on x, which itself cannot trigger h again.

The exemplary diagram as in fig. 2(b) may have two possible execution orders: "x → h → f → x → f₁", or" x → h → f₁→ x ". Based on tensor T₁And T₂' all can be used to trigger operation f₁Then f is₁Must be performed after f and x. However, x is performed multiple times, so the key step is to determine how the output of x is used as f₁Is input. To avoid ambiguity, the method proposes the following conventions for variables:

the operation always uses the most recently updated value of the variable.

In operations using the same tensor, a variable always has the highest execution priority.

This definition ensures that f is executed after updating x with the output of f₁The order of (a). The order of execution of the operations depends not only on the data availability of the incoming edges, but also on the presence or absence of data by the "control" edges. The "control" edges are not inputs to operations, but are used to control the order in which the operations of the computation graph are performed. Adding a "control" edge in the computation graph will change the topological ordering of the graph. For example, (u, v) is a "control" edge, and γ (u) < γ (v), then it must be performed after uAnd row v. From this definition, a variable control-free edge can be derived.

The role of the "control" edge is shown in FIG. 2 (c). Adding a new operation f₂It uses the output of f, performs the calculation and updates the variable x. In the absence of f₂To f₁In the case of the control edge of (1), after f is executed, f₁And f₂Can be executed in parallel because they are not dependent on each other. So they both access the variable x, i.e. f₁Read x and f₂Write x, so control edges are needed to ensure that they access x in sequence. From f₂To f₁The "control" edge of (c) represents f₁Operation will be at completion f₂And executing after operating and updating x data.

(3) Updating parameters

The process of updating learnable parameters (represented by variables) during training is shown in fig. 3. In the forward phase, variable x_iIs a function f_iInput of f_iIs used in the latter function, and finally the loss value is generated by the objective function f. In the backward phase, the method calculates the gradient of f according to learnable parameters, which are needed_iAs a function of

To calculate f relative to x_iFinally, the function Ux is used in the update phase_iUpdating x_i。

The life cycle of the tensor is changed, and the calculation formula is max { Time_all(v₁),Time_all(v₂),...Time_all(v_k)}-Time_all(u) where Time_all(v) Representing the total time spent by the program executing to node v. The calculation method is more accurate and efficient than the traditional method.

(1) Data flow computation graph modeling method based on cost

A computational graph rewrite example is illustrated in fig. 4. (a) The thick edges of (a) are rewritten to generate (b). Integers above or below a vertex are the order of execution of that vertex in topological order, vertex f₁Is swapped out, thus overwriting the slave f₁To f₂And f₃The edge of (c) is f_iAnd f_jAre control dependent operations, trigger exchange operation id separately₃And id₄Is performed. To store the tensor in the CPU memory, a derivation operates to automatically send the tensor to the CPU and back to the GPU when used.

For edge (f)₁,f₂) Where τ (f) is performed using the GPU₁,f₂)＝(D,_),f₁And f₂. The computation of the edge can be expressed as

Superscript G represents operations on the GPU, rewriting this calculation as:

since id is performed using CPU, f₁Is at f₁And immediately switching out the memory of the CPU after the completion, and releasing the memory of the GPU. When f is₂When triggered, the output tensor of id is exchanged into GPU and input into f₂In (1).

The method includes the steps of rewriting a graph by using an equation (2-2) so as to reduce GPU memory consumption, adopting a strategy of switching out tensors at intervals in order to reduce system broadband consumption, and executing v immediately after u for edges (u, v) of the graph, therefore, not needing to switch tensors on the edges, describing rewriting conditions of a calculation graph based on data flow cost, defining a threshold value α, and switching out the tensor of u to a CPU if the following conditions are met at the same time.

α≤γ(v)-γ(u)

α≥n←trans(u) (2-3)

α′≤γ(u)-γ(u_fwd)

N ← trans (u) represents the total time cost for exchanging, and the node number n in the time interval is found out according to the pre-operated calculation time evaluation value, α 'means the interval node number between the current node u and the previous rewritten node, α' value needs to be adjusted appropriately according to the execution condition of the current deep learning model, and the optimal value is obtained through training.

(2) Function redefinition

As shown in fig. 4, the swap-in and swap-out operations cannot be represented by the same function, and the tensor is swapped back as early as possible, requiring the addition of extra operations. The id function is rewritten as a combination of two functions, a primary function and an inverse function. F is chosen as a function of id:

equation 2-2 is rewritten as:

by using a pair of encoding and decoding functions instead of id, memory consumption on the CPU is reduced.

In equations 2-6, the operation id is swapped back₂The representation switches the tensor back to the device. Moreover, the id must be triggered with high priority₂(ii) a Otherwise id₂Will be in id₁Later, it is necessary to add a certain operation to the id₂The control edge of (1). Optimization strategies based on computational and transmission costs are therefore proposed.

(1) Fusion operation strategy

When the tensor is large and the operation interval is short, a plurality of switching operations are merged into one switching operation. Sheet of paperThe amount is only swapped back once and resides in GPU memory for reuse by consuming operations. For example, in FIG. 4 (d), if f₂And f₃Close and the tensor transmitted is large, id will be₃And id₄To determine the proximity of two operations, the present invention defines a threshold β in terms of the distance between '·' that represents the proximity of the two operations.

(2) Swap-in strategy based on operation transmission cost

Based on the work, the time cost of data transmission is determined, and the theoretical node calculation time is determined through performance analysis of a deep learning method. The rewriting of the calculation graph is realized by determining the actual calculation time of the algorithm model and the operation, the key for realizing the swap-out and swap-back strategy of the memory is to add control edges between the nodes, determine the weight of data transmission on the edges as accurately as possible, and design a reasonable control variable addition algorithm. And adding the control edges of the exchange strategy into the nodes of the computational graph for triggering exchange operation so as to reduce the overhead of exchange communication and ensure the equivalence of the computational graph.

In equations 2-6, the slow down operation id₂Must be operated from a set of operations V_cIs selected, wherein for

All exist gamma (id)₁)＜γ(v)＜γ(f₂) To ensure the correctness of the calculation graph. Let k be γ (f)₂) - γ (v) is f₂And v. If k is too large, the exchange tensor is too early and the tensor will remain in the device for a long time until it is f₂The preparation is used.

The scheme is based on static theoretical analysis and pre-running actual data, and introduces two parameters: lower limit σ_lAnd an upper limit σ_uAnd defining the condition that the node adds the dependent edge.

1) And determining the upper limit and the lower limit of the search node.

Definition f_lThe cost of the node's tensor swap out and back is trans (f)_i) The backward operation is defined as f_iAnd f is₁Satisfying the swap-out condition, searching from the backward node to the forward node, and determining the number of the computing nodes capable of covering the data and swapping back to the GPU according to the actual operation time of the nodes under the condition before pre-operation, namely determining the number as the lower limit sigma_l：

Equation 3-1 is the pre-run time of the node, n_exeThe total number of nodes, upper bound sigma, that can be selected in the pre-operation state_uCan be obtained by the following process:

in the formula 3-2, time_theoryRepresenting the evaluation run time of the node, n_theoryRepresenting the total number of nodes that can be selected in the case of theoretical analysis of the evaluation model.

Using a function thr (f) to represent the amount of dependent data to be added to the node f, and thread _ max to represent the maximum throughput, if data is added to the current node and the dependency is changed back, the current node needs to satisfy:

Thr(f)＜Through_max (3-3)

under the condition of ensuring the current system load, the system throughput is used as a dependency, the added data is replaced with the dependency, and meanwhile, the data transmission quantity from the CPU to the GPU cannot exceed the maximum throughput of the stabbing pain, so that the performance stability is ensured.

3) Modeling process:

for one edge (f)₁,f₂) Using a swap-out operation, a swap-back operation as s_l：

For the switch back operation s_lThe first is a backward topology query strategy. At the targetOperation f₂And operation f of rewriting₁In between, directly using topological ordering to obtain a set of candidate nodes for controllable operation, lower and upper bounds are relative to f₂. Algorithm 1 is an implementation of this strategy. Candidate node is and₂is at a distance of_lTo sigma_u(line 7) operation in the range from them to f₂And the current amount of data-dependent data added by the node does not exceed the maximum throughput per second of the system, the algorithm will stop by itself as soon as an operation is found that satisfies the above conditions (lines 10-13), and the control node through the algorithm is the operation that triggered the data swap back the latest. And finally outputting the traversal sequence of the directed graph. The specific implementation process is described as follows:

the second is a forward market search strategy. Chain rule policy slave source operation f₁A search along the calculation direction is started to find the corresponding backward operation as a candidate for the control operation. Breadth-first search is used to traverse operations in the forward phase, where the lower and upper bounds are relative to f₁. Algorithm 2 shows the specific course of the strategy. For breadth-first search strategy, two open sets s are set in the text₁And s₂One set for one closed set s_c，s₁Involving the current forward operation, s₂Involving the next stage of forward operation (including s)₁Outgoing operation of all operations in (1), s_cContaining accessed operations, from f₁At the beginning, if the algorithm is at σ_lσ_uThe output backward operations (line 9) of the current operation are obtained within the range (line 8), and the validity of these backward operations is checked (lines 10-11). If there is a valid operation, the current node is the candidate node and it is returned. Otherwise, the algorithm enters the next loop. And finally outputting the traversal sequence of the directed graph. (lines 27-30).

And finally, combining the established computational graph model with a ResNet network model for training to obtain the optimal tensor, and inputting data of different structures again for testing. And then optimizing through a memory fusion strategy and a forward searching mode and a backward searching mode to obtain a final traversal sequence data set.

A computational graph modeling technology based on formalization rules is combined with a deep learning network model, so that the interpretability and the pushability of the model can be improved, and the image batch processing effect can be improved.

Claims

1. A deep learning-oriented GPU parallel computing data processing method is characterized by comprising the following steps:

step one, modeling a calculation graph;

2. The deep learning oriented GPU parallel computing data processing method according to claim 1, wherein the first step specifically comprises:

(1) calculation chart

(2) topological ordering

·

·

(3) updating parameters

3. The deep learning-oriented GPU parallel computing data processing method as claimed in claim 1, wherein the second step specifically comprises:

each tensor is allocated with an citing counter, namely an operand, after the tensor is used up in each operation, the citing count is reduced by 1, and if the citing count reaches zero, the memory space of the tensor is released; the lifetime of the tensor is the operation from which it is derivedMake T to end with the last operation to use it_sIs the tensor produced by operation u, and v₁，v₂，...v_kIs using T_sK operations of (T)_sIs calculated as max [ gamma (v) ]₁),γ(v₂),...γ(v_k)}-γ(u)；

4. The deep learning-oriented GPU parallel computing data processing method as claimed in claim 1, characterized by comprising the following steps:

(1) data flow computation graph modeling method based on cost

Superscript G represents operations on the GPU, rewriting this calculation as:

equation of

using the formula

Rewriting the graph;

(2) function redefinition

formula (II)

The rewrite is:

by using a pair of encoding and decoding functions instead of id,

in the equation

5. The data processing method of GPU parallel computing facing deep learning of claim 4, wherein the condition of rewriting the computation graph is defined as α threshold, if the following conditions are satisfied at the same time, the tensor of u is swapped out to CPU,

α≤γ(v)-γ(u)

α≥n←trans(u)

α′≤γ(u)-γ(u_fwd)

6. The deep learning-oriented GPU parallel computing data processing method as claimed in claim 1, wherein the fourth step specifically comprises: