CN113254867A - Automatic configuration template generation method and device, server and storage medium - Google Patents

Automatic configuration template generation method and device, server and storage medium Download PDF

Info

Publication number
CN113254867A
CN113254867A CN202110715535.7A CN202110715535A CN113254867A CN 113254867 A CN113254867 A CN 113254867A CN 202110715535 A CN202110715535 A CN 202110715535A CN 113254867 A CN113254867 A CN 113254867A
Authority
CN
China
Prior art keywords
input
convolution
convolutional
vertices
directed acyclic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110715535.7A
Other languages
Chinese (zh)
Other versions
CN113254867B (en
Inventor
张晓扬
肖俊敏
曹连雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hyperai Cloud Technology Beijing Co ltd
Original Assignee
Hyperai Cloud Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hyperai Cloud Technology Beijing Co ltd filed Critical Hyperai Cloud Technology Beijing Co ltd
Priority to CN202110715535.7A priority Critical patent/CN113254867B/en
Publication of CN113254867A publication Critical patent/CN113254867A/en
Application granted granted Critical
Publication of CN113254867B publication Critical patent/CN113254867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to a method, a device, a server and a storage medium for generating an automatic configuration template, wherein the method comprises the following steps: acquiring current input data and an on-chip memory space of a current terminal; determining a lower bound of convolutional communication based on current input data, an on-chip memory space of a current terminal and a convolutional operation rule; determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in the convolutional operation; generating a search space according to the optimality condition, wherein the search space comprises at least one configuration template, at least one adjustable parameter is the same in each configuration template, and at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition; and adjusting the access locality of the convolution operation based on the search space. The method and the device have the effect of reducing the time cost and the hardware cost in the process of searching the parameter configuration.

Description

Automatic configuration template generation method and device, server and storage medium
Technical Field
The present application relates to the field of deep learning, compiling technology, and high-performance computing cross technology, and in particular, to a method, an apparatus, a server, and a storage medium for automatically configuring a template.
Background
Reasoning optimization of deep learning is widely concerned in the industry, and as reasoning terminals are various and include embedded CPUs (central processing units), GPUs (graphics processing units), fpgas (field Programmable Gate arrays), ASICs (application specific integrated circuits), and the like, different hardware objects are different in memory organization, computing functional units, and the like, so that the skills required for reasoning optimization of different terminals are different, and tuning performance becomes a repetitive work requiring professional engineering experience and consuming a lot of manpower.
An automatic tuning framework TVM (deep learning-oriented automatic end-to-end optimization compiler) commonly adopted in the related art firstly acquires a model from an existing framework as input, converts the model into a calculation graph representation, and then a system executes high-level data stream rewriting to generate an optimized graph; given a rich set of scheduling primitives, the TVM finds the optimal operator implementation for each layer of the deep learning model, creates a specialized operator for the specific input shapes and layouts associated with each layer, and selects scheduling optimizations, such as modifying the round robin order, optimizing the memory hierarchy, and scheduling specific parameters such as tile size and round robin expansion factor.
TVM discloses graph and operator level optimization, introduces a tensor expression language to build operators and provide program conversion primitives, selects code for generation (e.g., loop blocks and ordering, caching, unfolding) based on a combination of memory access, thread patterns, and new hardware primitives, and introduces an automatic program optimization framework to find optimized tensor operators.
In the template-guided search mode of TVM, the search space is written by a manual template that requires the user to manually write a template for the computation definition, which defines the structure of the tensor expression using some adjustable parameters (e.g., tile size and expansion factor) and searches for the best values of these parameters for a particular computation graph shape configuration and a particular hardware object.
With respect to the above-described related art, the inventors consider that the following drawbacks exist: the search space for parameter configuration is huge, so that the time cost and hardware cost for the parameter configuration process are high.
Disclosure of Invention
In order to reduce the time cost and the hardware cost of a search parameter configuration process, the application provides an automatic configuration template generation method, an automatic configuration template generation device, a server and a storage medium.
In a first aspect, the present application provides a method for generating an automatically tuned and optimized template, which adopts the following technical scheme:
an automatic configuration template generation method comprises the following steps:
acquiring current input data and an on-chip memory space of a current terminal;
determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal and a convolutional operation rule;
determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in convolutional operation, and the data stream meets the lower bound of the convolutional communication;
generating a search space according to the optimality condition, wherein the search space comprises at least one configuration template, the configuration template comprises a plurality of adjustable parameters, at least one adjustable parameter is the same in each configuration template, and the at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;
and adjusting the access locality of the convolution operation based on the search space.
By adopting the technical scheme, the data stream in the convolution operation process is determined based on the lower bound of the convolution communication in the convolution operation process, the optimal condition is formed by the allocation of the storage position of the data stream, at least one adjustable parameter in the configuration template can be determined according to the optimal condition, namely, at least one adjustable parameter in the search space of the parameter configuration is fixed, and only other adjustable parameters are needed to be configured, so that the search space of the parameter configuration can be reduced.
In a possible implementation manner, the determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal, and the convolutional operation rule includes:
generating a convolution directed acyclic graph according to the current input data through a convolution operation rule;
decomposing the convolved directed acyclic graph into at least two convolved input subgraphs;
determining a total number of vertices of the convolved directed acyclic graph;
partitioning the convolutional directed acyclic graph into at least one subset of convolutional inputs based on S-partitioning;
determining an upper bound of the number of vertexes based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all the convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm;
wherein the vertex number upper bound is used for representing the vertex number corresponding to the convolution input subset with the maximum vertex number in all the convolution input subsets;
and determining the lower bound of the convolution communication based on the upper bound of the number of the vertexes, the total number of the vertexes of the convolution directed acyclic graph, the red and blue pebble model and the on-chip internal memory space of the current terminal.
In a possible implementation manner, the determining an upper bound of the number of vertices based on the current input data, any one of the subset of convolutional inputs, all of the sub-graphs of convolutional inputs, a relationship between any one of the subset of convolutional inputs and all of the sub-graphs of convolutional inputs, an on-chip memory space of the current terminal, and a preset algorithm includes:
defining any dominating set of the convolutional input subset as an input dominating set, dividing the input dominating set into a first vertex set and a second vertex set, determining the number of first vertices corresponding to the first vertex set, and determining the number of second vertices corresponding to the second vertex set;
and determining the upper bound of the number of the vertexes based on the current input data, any convolution input subset, the input dominating set, the number of the first vertexes, the number of the second vertexes, all the convolution input subgraphs, the relation between any convolution input subset and all the convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm.
In one possible implementation, the relationship between any one subset of the convolutional inputs and all the convolutional input subgraphs includes:
Vi∩Uj+1all inputs being contained in
Figure 239936DEST_PATH_IMAGE001
Performing the following steps; (i)
according to
Figure 184759DEST_PATH_IMAGE002
Can determine | Vi∩Uj+1The upper bound of |; (ii)
Figure 550625DEST_PATH_IMAGE003
;(iii)
wherein, for any directed acyclic graph G (V, E):
Viany subset obtained for any directed acyclic graph G (V, E) based on S-division;
Uja vertex set corresponding to the jth sub-graph of any directed acyclic graph G (V, E);
Uj+1the vertex set corresponding to the j +1 th sub-graph of the directed acyclic graph is obtained;
Diis a ViThe dominating set of (2);
Figure 555490DEST_PATH_IMAGE004
is UjThe output set of (2);
Figure 806474DEST_PATH_IMAGE005
means that one contains all the pieces capable of being DiA set of generated vertices;
and (5) substituting the convolution directed acyclic graph into an equation (i) to an equation (iii) to obtain the relation between any convolution input subset and all the convolution input subgraphs.
In one possible implementation, the determining an upper bound of the number of vertices based on the current input data, any one of the subset of convolutional inputs, the input dominating set, the first number of vertices, the second number of vertices, all of the sub-graphs of convolutional inputs, a relationship between any one of the subset of convolutional inputs and all of the sub-graphs of convolutional inputs, an on-chip memory space of the current terminal, and a preset algorithm includes:
Figure 554987DEST_PATH_IMAGE006
(1)
wherein, for any directed acyclic graph G (V, E):
s is the capacity of the flash memory;
will DiDividing the vertexes into at least two vertex sets, wherein the number of vertexes in each vertex set is respectively as follows:
Figure 840606DEST_PATH_IMAGE007
Figure 95001DEST_PATH_IMAGE008
Figure 82548DEST_PATH_IMAGE009
Figure 382555DEST_PATH_IMAGE010
(2)
wherein, U is a vertex set which represents the operation of the algorithm in any directed acyclic graph G (U, E);
Ujdecomposing any directed acyclic graph G (U, E) according to a preset decomposition rule to obtain a vertex set of algorithm operation corresponding to the jth sub-graph;
k is an arbitrary integer, D is a number satisfying
Figure 506369DEST_PATH_IMAGE011
A set of vertices of the arbitrary dominance set;
Figure 603769DEST_PATH_IMAGE004
is UjThe set of outputs of (a) is,
Figure 78613DEST_PATH_IMAGE012
representing a set containing all vertices that can be generated by the set D;
Figure 122923DEST_PATH_IMAGE013
representing the vertex set U n U by DjThe number of vertices generated;
Figure 366823DEST_PATH_IMAGE014
representing a set of vertices by D
Figure 635124DEST_PATH_IMAGE015
The number of vertices generated in (1);
Figure 597264DEST_PATH_IMAGE016
(3)
wherein, U is any vertex set which represents algorithm operation in the directed acyclic graph G (U, E) of the formula (2);
k is an arbitrary integer of formula (2);
for any given k and jth sub-computation,
Figure 566969DEST_PATH_IMAGE017
represents the maximum number of vertices in U #uj generated by k input vertices,
Figure 681687DEST_PATH_IMAGE018
representing k input vertices at
Figure 307840DEST_PATH_IMAGE015
The maximum number of vertices generated;
substituting the convolution directed acyclic graph into a formula (1) to a formula (3), and obtaining the result based on a preset algorithm:
Figure 508009DEST_PATH_IMAGE019
(4)
wherein R is the maximum reuse times of each input vertex;
s is the capacity of the flash memory;
and (4) substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (4), wherein the obtained T (S) is the upper bound of the vertex number.
In one possible implementation manner, the determining the lower bound of the convolutional communication based on the upper bound of the number of vertices, the total number of vertices of the convolutional directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal includes:
Figure 205706DEST_PATH_IMAGE020
(5)
wherein, for any directed acyclic graph G (V, E):
the number of the top points in the top set of the algorithm operation of any directed acyclic graph G (V, E) is | V |;
s is the capacity of the flash memory;
when Q takes the minimum value, Q is the lower communication bound;
and substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (5), wherein when Q is the minimum value, Q is the lower bound of the convolution communication.
In one possible implementation, determining an optimality condition according to the lower bound of the convolutional communication includes:
determining the convolution input subgraph of the storage space to be allocated based on the convolution communication lower bound;
determining an output sub-block, a size of the output sub-block, a size of an input sub-block and a size of the input sub-block corresponding to the convolution input sub-graph according to the lower bound of the convolution communication, wherein the sizes of the input sub-block and the output sub-block both meet the lower bound of the convolution communication;
and determining the output sub-block as data stored in the on-chip internal memory space of the current terminal, and determining an input space in the on-chip internal memory space of the current terminal, wherein the input space is occupied by the input sub-block and a convolution kernel corresponding to the input sub-block so as to obtain the optimality condition.
In a second aspect, the present application provides an automatic configuration template generating apparatus, which adopts the following technical solution:
an automatic configuration template generation apparatus comprising:
the acquisition module is used for acquiring current input data and the on-chip memory space of the current terminal;
the analysis module is used for determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal and a convolutional operation rule;
the distribution module is used for determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in convolutional operation, and the data stream meets the lower bound of the convolutional communication;
a configuration module, configured to generate a search space according to the optimality condition, where the search space includes at least one configuration template, the configuration template includes multiple adjustable parameters, at least one adjustable parameter is the same in each configuration template, and the at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;
and the optimization module is used for adjusting the access locality of the convolution operation based on the search space.
In a third aspect, the present application provides a server, which adopts the following technical solutions:
a server, the server comprising:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and executing the method for automatically generating the tuning template.
In a fourth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:
a computer-readable storage medium, comprising: a computer program is stored which can be loaded by a processor and which performs the method of automatically tuning template generation described above.
Drawings
FIG. 1 is a schematic flow chart illustrating a method for generating an automatically tuned template according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a TVM tuning procedure;
FIG. 3 is a directed acyclic graph of direct convolution;
FIG. 4 is a diagram illustrating a data flow corresponding to a convolution operation rule;
FIG. 5 is a schematic diagram of an apparatus for automatically generating a tuning template according to an embodiment of the present application;
fig. 6 is a schematic diagram of a server according to an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to the attached drawings.
A person skilled in the art, after reading the present specification, may make modifications to the present embodiments as necessary without inventive contribution, but only within the scope of the claims of the present application are protected by patent laws.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.
All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.
In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship, unless otherwise specified.
The embodiment of the application provides an automatic configuration template generation method, which is executed by a server and comprises the following steps:
with reference to figure 1 of the drawings,
step S101, obtaining current input data and an on-chip internal memory space of a current terminal.
The on-chip memory space of the current terminal is the fast memory space of the hardware terminal.
For complex convolution operation in deep learning, hardware cost and time cost are dominant in memory access corresponding to the convolution operation, and off-chip access almost occupies all energy consumption; the inputs and weights of the convolution operations are typically stored in an off-chip Dynamic Random Access Memory (DRAM), an on-chip global buffer (GBuf) -based Static Random Access Memory (SRAM) stores part of the inputs and weights loaded from the off-chip Dynamic Random Access Memory (DRAM), each processing unit (PE) has some processors (Reg) for storing the inputs and weights read from the global buffer (GBuf), and part of the sum (psum) is stored in GBuf or Regs, so that there is a complex data transfer in the memory hierarchy during the calculation.
If the on-chip memory is large enough, all types of data (input, weight and output) are guaranteed to be accessed and stored only once when the processing unit (Pe) performs convolution operation, and the access and storage are single-layer access and storage optimal values.
The convolution operation has 7 layers of circulation, the direct search of the convolution optimal communication method is difficult, the access times can be effectively reduced through data multiplexing, but different circulation sequences, circulation step lengths, circulation expansion and other parameters need to be searched, and the configuration of the parameters forms a huge search space.
With reference to figure 1 of the drawings,
step S102, determining a lower bound of convolution communication based on current input data, an on-chip memory space of a current terminal and a convolution operation rule.
The lower bound of the convolution communication is the minimum value (minimum overhead) of off-chip access in the convolution operation process.
Maximizing data reuse in convolution helps reduce communication, and a combination of data reuse methods can constitute a very complex data stream, and a given hardware resource condition becomes one of the constraints for designing the data stream.
With reference to figure 1 of the drawings,
and S103, determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in the convolutional operation, and the data stream meets the lower bound of the convolutional communication.
With reference to figure 1 of the drawings,
and S104, generating a search space according to the optimality condition, wherein the search space comprises at least one configuration template, the configuration template comprises a plurality of adjustable parameters, at least one adjustable parameter is the same in each configuration template, and at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition.
First, the main operations of the TVM are described, including:
(1) the method comprises the steps of introducing a tensor expression language to construct an operational character and provide program conversion primitives, generating programs of different versions through various optimizations by the primitives, expanding the computation/scheduling separation concept of Halide, separating the target hardware essence from the conversion primitives so as to support a new accelerator, introducing the new conversion primitives and supporting deployment to a special accelerator so that different program conversion sequences can be applied to form a rich effective program space for a given operator statement; (2) an automatic program optimization framework is introduced to find an optimized tensor operator, an optimizer takes a machine learning-based performance model as a guide, more data are collected from the rear end of hardware, and the performance model is optimized; (3) on the basis of an automatic code generator, a graph rewriter is introduced, and the graph rewriter makes full use of high-level optimization and operator-level optimization.
In which a mathematical computation may be implemented on a computer in a variety of ways, and different implementations may have different memory accesses and different performance, so that the TVM requires a user to provide a "schedule" to specify how to perform the computation (e.g., circular tiling and ordering, caching, and expanding).
The TVM is guided by the template to search when finding the optimized tensor operator, in which the search space is defined by a manual template, the TVM requires the user to manually write a template for the calculation definition, which defines the structure of the tensor program using some adjustable parameters (e.g., tiling size and expansion factor); the compiler then searches for the best values of these parameters for the particular input shape configuration and the particular hardware target.
The method has good effect on the common deep learning operator.
However, developing templates requires a lot of work, e.g. the code library of TVM already contains more than 15K lines of code for these templates, and this number continues to grow with the advent of new operators and new hardware platforms.
In addition, constructing a high quality template requires expertise in tensor operators and hardware.
Developing high quality templates requires a lot of research effort, although template design is complex, manual templates cover only limited program structures because all optimization options for manually enumerating all operators are prohibitive, this approach usually requires defining a template for each operator, FlexTensor proposes a generic template to cover multiple operators, but its template is still designed for single operator granularity, and does not include optimizations involving multiple operators (e.g., operator fusion); the search space of the multi-operator optimized computation graph should contain different operator combinations, which cannot be achieved by the template-based approach because it cannot decompose and recombine fixed templates, which represent the underlying parameters.
Referring to fig. 2, a common technical means in the art, which is not the point of the present application, is used for automatically generating a template based on TVM, and therefore, the description is not repeated here, and the tensor program structure generated by the automatic template generation technology covers the tensor program structure generated by the manual template generation.
The method comprises the steps of automatically generating templates through a template generator, delivering the generated templates to a template manager, putting template parameter information under the template manager into a configuration manager, then correspondingly configuring a generated program according to the parameters and inputting the program into a performance model, predicting the running time of the performance model on the given hardware back end, determining the configuration effect of the parameters in a search space based on the predicted running time, and simultaneously training the performance model by using the running time information collected during exploration.
Adjustable parameters in the template (e.g., tile size and unroll factor) define the structure of the tensor program for generating the optimizer, and each Schedule (Schedule) determines the difference in memory access and performance.
In the method, one of the conditions for generating the template is determined based on the memory access mode corresponding to the convolution operation in the optimized deep learning network, and the parameters corresponding to the memory access in the template parameters are determined, namely, the search space formed by combining a plurality of template parameters is reduced.
With reference to figure 1 of the drawings,
and S105, adjusting the access locality of the convolution operation based on the search space.
In the method and the device, in the convolution operation process, the data stream in the convolution operation process is determined based on the lower bound of convolution communication, the allocation of the storage position of the data stream forms an optimality condition, at least one adjustable parameter in the configuration template can be determined according to the optimality condition, namely, at least one adjustable parameter in the search space of parameter configuration is fixed, and only other adjustable parameters are needed to be configured, so that the search space of parameter configuration can be reduced.
Since the above optimality condition is determined based on the lower bound of the convolutional communication, the lower bound of the convolutional communication is determined first.
To introduce the lower bound of convolutional communication, we first understand the red-blue pebble model, which is used to estimate the minimum amount of data transfer between two levels of memory, and the derivation of the lower bound of convolutional communication relies on the red-blue pebble model.
The red-blue pebble model is described below:
the red and blue pebble game is carried out based on a directed acyclic graph;
let G (V, E) represent a directed acyclic graph
V is a set of vertices representing the operation of the algorithm; e is a set of edges representing the dependency of two operations; we use the symbol |. to represent the number of vertices of an arbitrary set, e.g. | V | represents the number of vertices in set V.
If there is a partition on G that satisfies the following four properties, the partition is called an S-partition (S-partition).
Properties 1: v is divided into h subsets V1,V2,…,VhSo that Vi do not intersect but their union is V.
Properties 2: each ViHaving an allocation set D containing at most S verticesi
ViAn dominating set DiIs a series of nodes in V to which the input from G is appliediAny path of the node of (2) contains DiThe node (b).
Properties 3: each ViAll have a minimum set M containing at most S verticesi
ViIs defined as ViDoes not have any successor vertices in it that belong to ViThe set of vertices of (1).
Properties 4: v1,…,VhThere is no cyclic dependency between them.
Let P (S) be at least the minimum number of subsets that any S-partition (S-partition) of the directed acyclic graph satisfies.
The following theorem describes the lower communication bound of any directed acyclic graph red-blue pebble model:
for a directed acyclic graph G = (V, E), if the number of red pebbles is at most S (i.e. given a fast memory of capacity S), then the minimum I/O overhead Q (i.e. minimum number of communications) required for any complete calculation of a red and blue pebble game is:
Figure 237247DEST_PATH_IMAGE021
(6)
wherein S is the capacity of the flash memory;
Viany subset of the directed acyclic graph obtained based on S-partitioning (S-partition);
2S is 2 times the capacity S of the flash memory.
Assuming that the directed acyclic graph G (V, E) describes an n-stage algorithm, it is difficult to directly infer p(s) for the multi-stage algorithm, and we can estimate the lower bound of p(s):
order to
Figure 844422DEST_PATH_IMAGE022
Is a set containing all possibilities of S-partitioning (S-partition) of a directed acyclic graph G (V, E), and each
Figure 515575DEST_PATH_IMAGE023
The elements in (A) represent some of the S-divisions (S-partitions) of G (V, E).
Order to
Figure 767695DEST_PATH_IMAGE024
(7)
Wherein | V | represents the number of vertices in the set V;
Viany subset of the directed acyclic graph obtained based on S-partitioning (S-partition);
h (S) represents the lower bound of the number of S-partition (S-partition) sets of any of G (V, E).
According to equations (6) and (7), the minimum I/O overhead Q (i.e., the minimum number of communications) can be expressed as:
Figure 840694DEST_PATH_IMAGE025
(8)
wherein S is the capacity of the flash memory;
q is the minimum I/O overhead required by any complete calculation of the red and blue pebble game;
2S is 2 times the capacity S of the flash memory.
Thus, for a multi-stage algorithm, only H (2S) needs to be estimated instead of P (2S), and from equation (7), we can see that H (S) depends on
Figure 621699DEST_PATH_IMAGE026
By a value of (c), which means for ViIs the key to the problem.
For any subset Vi
Below by ViRelationship to all sub-computations of G (V, E), estimating ViThe number of vertices of (2):
first, a multi-stage partition is formalized for the directed acyclic graph G (V, E):
definitions 1.1 suppose that a directed acyclic graph G (V, E) is decomposed into n sub-graphs G1(U1,E1),G2(U2,E2),···,Gn(Un,En) Wherein G isj(Uj,Ej) Is the corresponding pair calculation.
Then { G }1(U1,E1),···,Gn(Un,En) Is a multi-stage division called G (V, E), if and only if Gj(Uj,Ej) Must be Gj−1(Uj−1,Ej−1) And all UjAre mutually exclusive.
Wherein any sequence of sub-computations can be represented as a multi-stage partition of a directed acyclic graph representing the total computation.
Suppose { G1(U1,E1),… ,Gn(Un,En) Is a multistage division of G (V, E) since
Figure 983410DEST_PATH_IMAGE027
Therefore if we can use S-partitioning (S-partition)Properties 2 and 3 in the definition estimate all | Vi∩UjAn upper bound of | V (j =1,2, …, n), then it is possible to obtain | ViThe maximum value of | is.
Next, all | V's are determined by way of recursive analysisi∩Uj|:
For the jth sub-calculation, assume | Vi∩UjIf the upper bound of | has been successfully reached, the next question is how to estimate | Vi∩Uj+1|。
UjThe output set of (2) is recorded as
Figure 508064DEST_PATH_IMAGE028
ViThe dominating set of (D) is denoted asi
To determine
Figure 435568DEST_PATH_IMAGE028
Neutral and Vi∩Uj+1In relation to vertices, we introduce the concept of vertex generation:
definition 1.2: in a directed acyclic graph G (V, E), one set of vertices U may generate another set of vertices U ', if and only if each vertex path from the input to U' starting at V contains some of the vertices in U.
Further to the above, it is preferable that,
Figure 650124DEST_PATH_IMAGE029
representing a set containing all vertices that can be generated by U.
Definition 1.2 is a new concept of vertex generation that describes dependencies between vertices.
Due to dominance set DiOnly V can be generatediOne of the sets of (1), apparently Vi∩Uj+1All inputs being contained in
Figure 295869DEST_PATH_IMAGE030
In, therefore Vi∩Uj+1All inputs being contained in
Figure 686531DEST_PATH_IMAGE031
In (1),
Figure 484853DEST_PATH_IMAGE032
can be used to infer | Vi∩Uj+1The upper bound of | is.
In summary, if we can deduce | Vi∩UjI and
Figure 529033DEST_PATH_IMAGE032
by then will
Figure 740702DEST_PATH_IMAGE030
As Vi∩Uj+1To obtain | Vi∩Uj+1I and
Figure 856426DEST_PATH_IMAGE033
is possible.
Based on the above derivation, V is obtainediRelationships with all subgraphs of the directed acyclic graph include:
Vi∩Uj+1all inputs being contained in
Figure 506326DEST_PATH_IMAGE034
Performing the following steps; (i)
according to
Figure 49302DEST_PATH_IMAGE035
Can determine | Vi∩Uj+1The upper bound of |; (ii)
Figure 623634DEST_PATH_IMAGE036
;(iii)
wherein, for any directed acyclic graph G (V, E):
Viany subset obtained for any directed acyclic graph G (V, E) based on S-division;
Uja vertex set corresponding to the jth sub-graph of any directed acyclic graph G (V, E);
Uj+1the vertex set corresponding to the j +1 th sub-graph of the directed acyclic graph is obtained;
Diis a ViThe dominating set of (2);
Figure 543049DEST_PATH_IMAGE037
is UjThe output set of (2);
Figure 50385DEST_PATH_IMAGE038
means that one contains all the pieces capable of being DiThe set of generated vertices.
Based on recursive analysis, | Vi∩Uj+1All upper bounds can be established for | j =1,2, …, n, and we proceed to analyze how to get | ViThe upper bound of | is.
First, we define the maximum vertex generation function to determine the respective DiAt Vi∩UjAnd Vi
Figure 764263DEST_PATH_IMAGE028
The number of generated vertices in (2), which specifically includes:
for any integer k and satisfy
Figure 888208DEST_PATH_IMAGE039
Define two vertex generation functions for the jth sub-computation as follows:
Figure 345734DEST_PATH_IMAGE040
(2)
wherein, U is a vertex set which represents the operation of the algorithm in any directed acyclic graph G (U, E);
Ujdecomposing any directed acyclic graph G (U, E) according to a preset decomposition rule to obtain a vertex set of algorithm operation corresponding to the jth sub-graph;
k is an arbitrary integer, D is a number satisfying
Figure 55980DEST_PATH_IMAGE041
A set of vertices of the arbitrary dominance set;
Figure 957071DEST_PATH_IMAGE037
is UjThe set of outputs of (a) is,
Figure 286421DEST_PATH_IMAGE042
representing a set containing all vertices that can be generated by the set D;
Figure 563950DEST_PATH_IMAGE013
representing the vertex set U n U by DjThe number of vertices generated;
Figure 232829DEST_PATH_IMAGE014
representing a set of vertices by D
Figure 304821DEST_PATH_IMAGE015
The number of vertices generated in (1);
that is to say that the first and second electrodes,
Figure 121467DEST_PATH_IMAGE043
and
Figure 668599DEST_PATH_IMAGE044
respectively representing the two vertex sets U n U by DjAnd U #
Figure 254301DEST_PATH_IMAGE045
The number of vertices generated in (1);
wherein the predetermined decomposition rule is definition 1.1.
For any given k and jth sub-computation, the maximum vertex generation function is defined as follows:
Figure 497195DEST_PATH_IMAGE046
(3)
wherein, U is any vertex set which represents algorithm operation in the directed acyclic graph G (U, E) of the formula (2);
k is an arbitrary integer of formula (2);
for any given k and jth sub-computation,
Figure 801137DEST_PATH_IMAGE017
u # U representing k input vertex generationsjThe number of the largest top points in the middle,
Figure 154889DEST_PATH_IMAGE018
representing k input vertices at
Figure 860677DEST_PATH_IMAGE015
The maximum number of generated vertices;
that is to say that the first and second electrodes,
Figure 8893DEST_PATH_IMAGE047
and
Figure 800131DEST_PATH_IMAGE048
provide UjAnd
Figure 16961DEST_PATH_IMAGE049
upper bound estimate of the number of vertices in (1).
Second, any set | V of arbitrary S-partitions (S-partitions) of the directed acyclic graph is generated based on the maximum vertex generating function in the first stepiThe upper bound of | is inferred.
We first infer two secondary results:
order to
Figure 327988DEST_PATH_IMAGE050
Is that
Figure 161952DEST_PATH_IMAGE051
Such that for any
Figure 191219DEST_PATH_IMAGE052
Any path from the input set of V to V has at leastOne vertex belongs to
Figure 339303DEST_PATH_IMAGE053
Introduction 1:
Figure 504837DEST_PATH_IMAGE054
is that
Figure 775281DEST_PATH_IMAGE055
2, leading:
Figure 23335DEST_PATH_IMAGE056
is also that
Figure 584898DEST_PATH_IMAGE057
Next, we can get | V by the above-mentioned theorem 1 and theorem 2iThe upper bound of | is very important for I/O complexity analysis based on the red and blue pebble game model, namely:
suppose that
Figure 588626DEST_PATH_IMAGE058
Is a multi-stage partition of the directed acyclic graph G (V, E), for an arbitrary S-partition (S-partition) G (V, E)
Figure 515125DEST_PATH_IMAGE059
,|ViI has an upper bound:
Figure 581301DEST_PATH_IMAGE060
when in use
Figure 867926DEST_PATH_IMAGE061
When the temperature of the water is higher than the set temperature,
Figure 270701DEST_PATH_IMAGE062
take a maximum value, i.e.
Figure 617368DEST_PATH_IMAGE063
And (4) an upper bound.
Lower pair
Figure 46207DEST_PATH_IMAGE064
The parameters in (1) are explained:
Figure 870943DEST_PATH_IMAGE065
(1)
wherein, for any directed acyclic graph G (V, E):
s is the capacity of the flash memory;
will DiDividing the vertexes into at least two vertex sets, wherein the number of vertexes in each vertex set is respectively as follows:
Figure 865575DEST_PATH_IMAGE066
Figure 648723DEST_PATH_IMAGE067
Figure 361596DEST_PATH_IMAGE068
in that
Figure 990023DEST_PATH_IMAGE069
In, S is the capacity of the flash memory;
thus, in the formula 3)
Figure 164128DEST_PATH_IMAGE070
Depends on the choice of the ith subset (i.e., which V isi) But in the formula (1)
Figure 868910DEST_PATH_IMAGE071
The value of (d) is independent of the value of i; thus passing through
Figure 521608DEST_PATH_IMAGE071
Can estimate any subset ViTo the upper bound, i.e.
Figure 704459DEST_PATH_IMAGE071
Can be used to estimate all subsets ViThe vertex upper bound of the subset having the largest number of vertices.
In summary, the function is generated based on the maximum vertex and V in the first stepiWith the relationship between all subgraphs, we get
Figure 922951DEST_PATH_IMAGE072
The upper bound of (c).
Third step, based on the obtained
Figure 860951DEST_PATH_IMAGE070
The upper bound of (2) determines the lower communication bound:
from equations (6), (7) and (8), we obtain: given a fast memory with a capacity of S size, in order to realize the calculation of the whole algorithm, the I/O operand Q between the fast memory and the slow memory satisfies:
Figure 279906DEST_PATH_IMAGE073
(5)
wherein, for any directed acyclic graph G (V, E):
the number of the top points in the top set of the algorithm operation of any directed acyclic graph G (V, E) is | V |;
s is the capacity of the flash memory;
when Q takes the minimum value, Q is the lower communication bound.
The above-mentioned V-basediRelationships with all sub-graphs, first and second steps to derive | ViThe process of the upper bound of | is illustrated:
suppose there is a directed acyclic graph G (U, E) = G consisting of two sub-computations1(U1,E1)+G2(U2,E2) Will ViAn dominating set DiThe number of vertices of (a) is denoted as ki
According to the definition of S-partition (S-partition), we have | Di|=kiLess than or equal to S, adding kiIs divided into ki=ki 1+ki 2Wherein k isi 1Is Vi∩U1Input the number of vertices, ki 2Is Vi∩U2The number of vertices is input.
For any integer k, we have two functions
Figure 515716DEST_PATH_IMAGE074
And
Figure 339446DEST_PATH_IMAGE075
wherein
Figure 369719DEST_PATH_IMAGE074
Representing a generated U of k input verticesjThe largest number of vertices in the population.
Therefore, the temperature of the molten metal is controlled,
Figure 544480DEST_PATH_IMAGE076
not more than
Figure 52821DEST_PATH_IMAGE077
And at most generate
Figure 996638DEST_PATH_IMAGE078
One vertex is taken as Vi∩U2Input of, therefore, Vi∩U2At most have
Figure 197812DEST_PATH_IMAGE078
An input vertex.
In addition to this, the present invention is,
Figure 794622DEST_PATH_IMAGE079
is effective.
Thus, we have:
Figure 106655DEST_PATH_IMAGE080
(9)
wherein, for any directed acyclic graph G (V, E):
s is the capacity of the flash memory;
will DiPartitioning vertices into a first set of vertices and a second set of vertices: the number of vertices in the first set of vertices is
Figure 904978DEST_PATH_IMAGE081
The number of sets in the second vertex set is
Figure 621261DEST_PATH_IMAGE082
In that
Figure 754302DEST_PATH_IMAGE083
And S is the capacity of the flash memory.
Figure 886337DEST_PATH_IMAGE084
(10)
We get | Vi ≦ T (S), and when | Vi | = T (S), the value of T (S) is the upper bound of | Vi |.
In summary, an algorithm corresponding to any directed acyclic graph can determine a communication lower bound thereof based on a red and blue pebble model, and the red and blue pebble model can effectively describe many computational graph problems, for example: FFT, matrix multiplication, convolution operation, etc., so we can compute the lower bound of the convolution communication based on the above analysis.
The following determines the lower bound of convolutional communications based on the analysis process described above:
in step S102, a possible implementation manner of the embodiment of the present application, determining a lower bound of convolutional communication based on current input data, an on-chip memory space of a current terminal, and a convolutional operation rule, includes:
step S1021 (not shown), generating a convolution directed acyclic graph according to the current input data by a convolution operation rule, and decomposing the convolution directed acyclic graph into at least two convolution input subgraphs.
Wherein, the form of the current input data includes but is not limited to images, matrixes, and the like;
and the directed acyclic graph generated based on the convolution operation rule according to the current input data is a convolution directed acyclic graph.
Referring to fig. 3, the convolution directed acyclic graph is divided into three levels, which are input vertices, product vertices, and addition vertices, the input vertices are arranged according to the convolution order, each multiplication unit in the graph outputs a product term, and the addition tree is used to accumulate the input product terms and obtain the final output sum.
For an input image with a sliding window, let IiAs the ith size is Wker×Hker×CinSliding the input tensor, let KjAs the jth size of Wker×Hker×CinConvolution kernel, direct convolution comprising two stages, for each IiAnd Kj
The first stage is through IiAnd KjPerforming a corresponding element multiplication operation to generate WkerHkerCinThe term product term.
The second stage is to mix IiAnd KjThe resulting product terms are summed to form a final output based on the summing tree.
The summing tree is a directed acyclic graph subgraph of a tree structure, the tree has at most 2 incomes of other vertices except input vertices (i.e., product vertices), and all the inputs of the tree are added together and have only one output, and the direct convolution is completed after the summing process.
Step S1022 (not shown) determines the total number of vertices of the convolution directed acyclic graph.
Specifically, the total number of vertices of the convolution directed acyclic graph is calculated as:
Figure 788434DEST_PATH_IMAGE085
step S1023 (not shown), the convolutional directed acyclic graph is divided into at least one subset of convolutional inputs based on S-partitioning.
Specifically, the S-partition (S-partition) can be referred to the S-partition (S-partition) in the aforementioned Red-blue pebble, and will not be described herein in detailThe convolution directed acyclic graph forms at least one convolution input subset after being divided according to S-mode, and any one convolution input subset can be counted as a convolution input subset V for convenient representationi
Step S1024 (not shown), determining an upper bound of the number of vertices based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and a preset algorithm.
And the vertex number upper bound is used for representing the vertex number corresponding to the convolution input subset with the maximum vertex number in all the convolution input subsets.
Step S1025 (not shown), determining a convolution communication lower bound based on the upper bound of the number of vertices, the total number of vertices of the convolution directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal.
In step S1024, determining an upper bound of the number of vertices based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and a preset algorithm, includes:
step S241 (not shown), define any dominating set of any convolution input subset as an input dominating set, divide the input dominating set into a first vertex set and a second vertex set, determine a first vertex number corresponding to the first vertex set, and determine a second vertex number corresponding to the second vertex set.
Specifically, based on step S102 and the aforementioned definition 1.1, the multi-stage division of the directly convolved directed acyclic graph G (V, E) can be written as G (V, E) = G1(U1,E1)∪G2(U2,E2);
Wherein the convolution input subgraph Gi(Ui,Ei) Representing the corresponding stage of direct convolution, the multiple convolved input subgraphs may be represented as a first convolved input subgraph, a second convolved input subgraph, respectively1(U1,E1) Corresponding to the first stage of the direct convolution described above, the second convolution is input into subgraph G2(U2,E2) Corresponding to the second stage of the direct convolution described above.
Specifically, when the red-blue pebble model is introduced, the S-partition of any directed acyclic graph (V, E) is introduced.
Specifically, for the case where the input dominance set is divided into a first vertex set and a second vertex set, equation (1) is therefore equivalent to equation (9), i.e.: the number of vertices in the first set of vertices is
Figure 102651DEST_PATH_IMAGE086
The number of sets in the second vertex set is
Figure 457409DEST_PATH_IMAGE087
In the convolution directed acyclic graph generated in step S1021, the first vertices are input vertices in the first stage, and the input vertices in the second stage do not include output vertices in the first stage, but only include input vertices that do not belong to the first stage but are needed in the second stage.
Step S242 (not shown), determining an upper bound of the number of vertices based on the current input data, any convolution input subset, the input dominating set, the number of first vertices, the number of second vertices, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and the preset algorithm.
First, in step S242, a possible implementation manner of the embodiment of the present application, where the relationship between any convolution input subset and all convolution input subgraphs includes:
and (5) substituting the convolution directed acyclic graph into the formulas (i) - (iii) to obtain the relation between any convolution input subset and all convolution input subgraphs.
Secondly, a possible implementation manner of the embodiment of the present application, in step S242, determining an upper bound of the number of vertices based on the current input data, any convolution input subset, the input dominating set, the number of first vertices, the number of second vertices, all convolution input subgraphs, a relationship between any convolution input subset and all convolution input subgraphs, an on-chip memory space of the current terminal, and a preset algorithm, includes:
then, substituting the convolution directed acyclic graph into an expression (9), an expression (10), an expression (1), an expression (2) and an expression (3) according to the relation between any convolution input subset and all convolution input subgraphs;
and based on a preset algorithm, obtaining:
Figure 127556DEST_PATH_IMAGE088
(4)
wherein R is the maximum reuse times of each input vertex;
s is the capacity of the flash memory;
and (4) substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (4), wherein the obtained T (S) is the upper bound of the vertex number.
The preset algorithm is as follows:
Figure 884159DEST_PATH_IMAGE089
Figure 348770DEST_PATH_IMAGE090
and is and
Figure 456403DEST_PATH_IMAGE091
for any integer k1And k2Are available.
The demonstration process is as follows:
let U be an arbitrary vertex set, and its dominating set D and minimum set M contain S vertices at most.
Suppose that
Figure 867924DEST_PATH_IMAGE092
To estimate
Figure 479034DEST_PATH_IMAGE093
And
Figure 377195DEST_PATH_IMAGE094
let us consider
Figure 706545DEST_PATH_IMAGE095
How many of the vertices can be composed of
Figure 452916DEST_PATH_IMAGE096
And (4) generating.
Due to the fact that
Figure 184111DEST_PATH_IMAGE097
Without internal vertices, all of
Figure 521683DEST_PATH_IMAGE098
The generated vertex can be used as
Figure 72750DEST_PATH_IMAGE099
Is input.
Therefore, the temperature of the molten metal is controlled,
Figure 888390DEST_PATH_IMAGE100
due to the fact that
Figure 474092DEST_PATH_IMAGE101
The internal sets of vertices with different summing trees do not intersect each other, so that U may have non-empty intersections with the internal sets of vertices of up to S summing trees, and each intersection has at least one different vertex in the minimum set.
For any with S output vertices
Figure 448477DEST_PATH_IMAGE102
Each vertex can be formally represented as
Figure 752420DEST_PATH_IMAGE103
Without loss of generality, weSuppose that
Figure 434068DEST_PATH_IMAGE104
Each of which is
Figure 625009DEST_PATH_IMAGE105
Therein is at least provided with
Figure 288071DEST_PATH_IMAGE106
Item is at
Figure 830042DEST_PATH_IMAGE107
Each is in
Figure 236753DEST_PATH_IMAGE108
In (1)
Figure 544850DEST_PATH_IMAGE109
All have at most
Figure 113234DEST_PATH_IMAGE110
Item is at
Figure 142501DEST_PATH_IMAGE111
In (1).
On the one hand, for
Figure 290586DEST_PATH_IMAGE112
From this set, we can prove
Figure 721698DEST_PATH_IMAGE113
In the following, we verify this fact by paradoxical means.
Suppose that
Figure 460984DEST_PATH_IMAGE114
Since each one is
Figure 711968DEST_PATH_IMAGE115
In that
Figure 536180DEST_PATH_IMAGE116
Therein is at least provided with
Figure 805488DEST_PATH_IMAGE117
Item, therefore
Figure 731986DEST_PATH_IMAGE118
Set at least relates to
Figure 719534DEST_PATH_IMAGE119
Is/are as follows
Figure 756891DEST_PATH_IMAGE120
Vertices, some of which may be identical.
Since each input vertex can be reused at most R times,
Figure 411864DEST_PATH_IMAGE116
in at least
Figure 509264DEST_PATH_IMAGE121
An independent vertex, an
Figure 718528DEST_PATH_IMAGE122
This is in accordance with the fact
Figure 759909DEST_PATH_IMAGE123
Are in contradiction.
Therefore, the assumption is not established, so
Figure 3808DEST_PATH_IMAGE124
Due to the fact that
Figure 537689DEST_PATH_IMAGE125
Figure 499829DEST_PATH_IMAGE126
Vertices in relation D do not exceed S.
Therefore, the temperature of the molten metal is controlled,
Figure 206885DEST_PATH_IMAGE127
and
Figure 56023DEST_PATH_IMAGE128
can generate at most the vertex in
Figure 744494DEST_PATH_IMAGE129
Product of the item of
Figure 941732DEST_PATH_IMAGE130
On the other hand, for sets
Figure 373851DEST_PATH_IMAGE131
To say, because
Figure 546337DEST_PATH_IMAGE132
At most there are
Figure 405709DEST_PATH_IMAGE133
Item, therefore
Figure 93173DEST_PATH_IMAGE134
And
Figure 142032DEST_PATH_IMAGE135
at most, the vertex in (1) can be formed
Figure 480609DEST_PATH_IMAGE136
And (4) accumulating.
In any case, it is preferable that,
Figure 461947DEST_PATH_IMAGE137
in that
Figure 620396DEST_PATH_IMAGE138
Can generate at most
Figure 410629DEST_PATH_IMAGE139
This means that
Figure 603712DEST_PATH_IMAGE140
Assuming that a vertex set U has an dominating set D and a minimum set M, satisfy
Figure 555619DEST_PATH_IMAGE141
,
Figure 14413DEST_PATH_IMAGE142
,
Figure 592025DEST_PATH_IMAGE143
To deduce
Figure 410856DEST_PATH_IMAGE144
By the upper bound, we only need to consider
Figure 48511DEST_PATH_IMAGE145
And
Figure 666705DEST_PATH_IMAGE146
formed of
Figure 782429DEST_PATH_IMAGE147
Due to the fact that
Figure 435258DEST_PATH_IMAGE148
At most, only
Figure 181497DEST_PATH_IMAGE149
Each vertex would be the input to the summing tree.
And one has
Figure 349304DEST_PATH_IMAGE150
The summation tree of input vertices involves
Figure 344418DEST_PATH_IMAGE151
Internal vertices and 1 output vertex.
Thus for
Figure 117333DEST_PATH_IMAGE150
A vertex can be generated at most
Figure 565632DEST_PATH_IMAGE152
A vertex.
Thus, we obtain
Figure 220735DEST_PATH_IMAGE153
The current input data is explained below:
the current input data is the following in step S1021: for an input image with a sliding window, let IiAs the ith size is Wker×Hker×CinSliding the input tensor, let KjAs the jth size of Wker×Hker×CinA convolution kernel; wherein the maximum reuse number of each input (element) in an input image is denoted as R by different sliding windows, and its value is R = WkerHker2And μ is the size of the stride.
In step S1025, the determining a lower bound of convolution communication based on the upper bound of the number of vertices, the total number of vertices of the convolution directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal includes:
substituting the convolution directed acyclic graph into formula (5) based on step S1021, step S1024, and formula (5): when Q takes the minimum value, Q is the lower bound of convolutional communication, i.e.:
Figure 694573DEST_PATH_IMAGE154
(11)
in step S103, a possible implementation manner of the embodiment of the present application, determining an optimality condition according to a lower bound of convolutional communication includes:
step S400 (not shown), determining a convolution input sub-graph of the storage space to be allocated based on the convolution communication lower bound.
Specifically, according to the foregoing scheme, for the lower bound of convolutional communication, in equations (9) and (10),
Figure 571262DEST_PATH_IMAGE155
(9)
Figure 203844DEST_PATH_IMAGE156
(10)
the highest order term of the lower bound result of convolutional communication must be bounded by some phijIt is decided that since the highest order term of the lower bound of I/O represents the majority of I/O, the highest order term φiThe main process involving the most I/O operations is pointed to, so that the I/O of the jth stage is optimized intensively to reduce off-chip access memory of convolution operation, the highest order term is determined by k, and which phi k-order is high selects which stage.
As can be seen from equations (9) and (10), the second sub-calculation corresponding to the second stage of the direct convolution, i.e., the second stage of the direct convolution, mainly determines the lower communication bound of the convolution.
Step S401 (not shown), determining the output sub-block, the size of the output sub-block, and the sizes of the input sub-block and the input sub-block corresponding to the convolution input sub-graph according to the lower bound of convolution communication, where the sizes of the input sub-block and the output sub-block both satisfy the lower bound of convolution communication.
We analyze the calculation process of the second stage of direct convolution as follows:
according to step S101, maximizing data reuse in convolution helps to reduce communication, and the output vertex corresponding to the second stage of convolution operation is a sum value, so we can utilize output reuse, which is a convolution output pixel that resides in on-chip storage until it is completely calculated (all input channels are traversed), and therefore we choose to make the second stage output vertex occupy the current device on-chip memory as much as possible.
With reference to figures 3 and 4 of the drawings,
first, the output sub-block and the size of the output sub-block are determined:
to achieve the aboveFor a sub-block of the output image of size x y x z we tend to have xyz ≈ S/NpIn which N ispIs the total number of available processors and S is the on-chip memory space of the current terminal.
If we execute data streams in parallel, it means that the sum of the on-chip memories owned by each processor is fully utilized to generate partial sums, so that the output data can be reused to the maximum extent, and the data transmission in the hierarchical memory is reduced; different sub-blocks are composed of NpThe processors process in parallel.
In our data stream design, there is a common (W)outHoutCout) /(xyz) output sub-block.
Secondly, determining the input sub-blocks and the sizes of the input sub-blocks:
in the convolution operation, in order to calculate the result of x × y × z sub-blocks of the output image, we need inputs of all channel dimensions corresponding to x '× y' (input sub-block) positions and z convolution kernels associated with part of the output channels, i.e. the number of channels of the input picture is equal to the number of channels of the convolution kernels.
Since on-chip memory is limited and needs to be used for storing as many output results as possible, when loading picture input and convolution kernel, continuous loading is needed instead of loading into fast memory at one time, since the input ith channel can only be reused by weighted ith channel, in order to ensure that output sub-block can be as large as possible in limited on-chip memory, we make α =1, that is, our dataflow design requires loading x '× y' patches with fixed number of channels and sliding along the direction of channel dimension; after the input patch of x '× y' and the corresponding z convolution kernel weights are loaded to the on-chip memory, partial sum calculation can be performed on the output sub-blocks;
to update the entire output sub-block, we successively slide the x '× y' input block along the channel direction and load the corresponding inputs and weights and perform a partial update.
To update each sub-block, we need to input x 'y' C from the input imageinElement, W needs to be input from z convolution kernelskerHkerCinz elements, due to
Figure 533194DEST_PATH_IMAGE157
Figure 545144DEST_PATH_IMAGE158
Figure 354968DEST_PATH_IMAGE159
The I/O size of data read-in is:
Figure 82752DEST_PATH_IMAGE160
and (3) calculating to obtain:
Figure 978027DEST_PATH_IMAGE161
(12)
wherein if and only if the xy = Rz formula holds,
Figure 42935DEST_PATH_IMAGE162
taking the minimum value and the result is the same as that in equation (11)
Figure 376440DEST_PATH_IMAGE163
On the same order of magnitude.
Based on
Figure 603022DEST_PATH_IMAGE164
Figure 657697DEST_PATH_IMAGE165
Figure 260716DEST_PATH_IMAGE166
The requirement of xy = Rz condition is such that x 'y' = zWkerHker
It determines the optimal size of each x '× y' patch in order to minimize the memory occupied by the input vertices, i.e., product vertices, of the second stage.
The I/O size of the storage output is WoutHoutCout,When we make xyz ≈ S/NpAnd xy = Rz, total amount of I/O being:
Figure 451657DEST_PATH_IMAGE167
(13)
if it is not
Figure 114720DEST_PATH_IMAGE168
And is
Figure 391112DEST_PATH_IMAGE169
Then can ignore
Figure 797822DEST_PATH_IMAGE170
The data stream can reach the lower I/O bound.
Step S402 (not shown), determining the output sub-block as data stored in the on-chip memory space of the current terminal, and determining an input space in the on-chip memory space of the current terminal, where the input space is a space occupied by the input sub-block and a convolution kernel corresponding to the input sub-block, so as to obtain an optimality condition.
The above embodiments describe a method for generating an automatic configuration template from the perspective of a method flow, and the following embodiments describe an apparatus for generating an automatic configuration template from the perspective of a virtual module or a virtual unit, which are described in detail in the following embodiments.
An embodiment of the present application provides an apparatus 100 for automatically configuring template generation, and referring to fig. 5, the apparatus 100 for automatically configuring template generation may specifically include:
an automatic configuration template generation apparatus comprising:
an obtaining module 1001, configured to obtain current input data and an on-chip memory space of a current terminal;
the analysis module 1002 is configured to determine a lower bound of convolutional communication based on current input data, an on-chip memory space of a current terminal, and a convolutional operation rule;
the allocation module 1003 is configured to determine an optimality condition according to the lower bound of the convolutional communication, where the optimality condition is used to indicate a data stream stored in an on-chip memory space of the current terminal in the convolutional operation, and the data stream meets the lower bound of the convolutional communication;
a configuration module 1004, configured to generate a search space according to the optimality condition, where the search space includes at least one configuration template, the configuration template includes multiple adjustable parameters, at least one adjustable parameter is the same in each configuration template, and at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;
and an optimizing module 1005, configured to adjust the memory access locality of the convolution operation based on the search space.
In a possible implementation manner of this embodiment of the present application, when determining the lower bound of the convolutional communication based on the current input data, the on-chip memory space of the current terminal, and the convolutional operation rule, the analyzing module 1002 is specifically configured to:
generating a convolution directed acyclic graph according to current input data through a convolution operation rule;
decomposing the convolution directed acyclic graph into at least two convolution input subgraphs;
determining the total number of vertexes of the convolution directed acyclic graph;
partitioning a convolutional directed acyclic graph into at least one subset of convolutional inputs based on S-partitioning (S-partitioning);
determining an upper bound of the number of vertexes based on current input data, any convolution input subset, all convolution input subgraphs, the relation between any convolution input subset and all convolution input subgraphs, an on-chip memory space of a current terminal and a preset algorithm;
the upper bound of the number of the vertexes is used for representing the number of the vertexes corresponding to the convolution input subset with the maximum number of the vertexes in all the convolution input subsets;
and determining a convolution communication lower bound based on the vertex number upper bound, the vertex total number of the convolution directed acyclic graph, the red and blue pebble model and the on-chip memory space of the current terminal.
In a possible implementation manner of this embodiment of the present application, when determining the upper bound of the number of vertices based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and the preset algorithm, the analyzing module 1002 is specifically configured to:
defining any dominating set of any convolution input subset as an input dominating set, dividing the input dominating set into a first vertex set and a second vertex set, determining the number of first vertexes corresponding to the first vertex set, and determining the number of second vertexes corresponding to the second vertex set;
and determining the upper bound of the vertex number based on the current input data, any convolution input subset, an input dominating set, the first vertex number, the second vertex number, all convolution input subgraphs, the relation between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm.
In a possible implementation manner of the embodiment of the present application, when determining the relationship between any convolution input subset and all convolution input subgraphs, the analysis module 1002 is specifically configured to:
1) Vi∩Uj+1all inputs being contained in
Figure 840340DEST_PATH_IMAGE171
Performing the following steps; (i)
2) according to
Figure 674304DEST_PATH_IMAGE172
Can determine | Vi∩Uj+1The upper bound of |; (ii)
3)
Figure 437992DEST_PATH_IMAGE003
;(iii)
wherein, for any directed acyclic graph G (V, E):
Viany subset obtained for any directed acyclic graph G (V, E) based on S-division;
Uja vertex set corresponding to the jth sub-graph of any directed acyclic graph G (V, E);
Uj+1the vertex set corresponding to the j +1 th sub-graph of the directed acyclic graph is obtained;
Diis a ViThe dominating set of (2);
Figure 382814DEST_PATH_IMAGE173
is UjThe output set of (2);
Figure 813926DEST_PATH_IMAGE005
means that one contains all the pieces capable of being DiA set of generated vertices;
and (5) substituting the convolution directed acyclic graph into the formulas (i) to (iii) to obtain the relation between any convolution input subset and all convolution input subgraphs.
In a possible implementation manner of this embodiment of the present application, the analysis module 1002 is specifically configured to, when determining the upper bound of the vertex number based on the current input data, any convolution input subset, the input dominating set, the first vertex number, the second vertex number, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and a preset algorithm:
Figure 818792DEST_PATH_IMAGE006
(1)
wherein, for any directed acyclic graph G (V, E):
s is the capacity of the flash memory;
will DiDividing the vertexes into at least two vertex sets, wherein the number of vertexes in each vertex set is respectively as follows:
Figure 69775DEST_PATH_IMAGE174
Figure 818289DEST_PATH_IMAGE175
Figure 835399DEST_PATH_IMAGE176
Figure 11165DEST_PATH_IMAGE010
(2)
wherein, U is a vertex set which represents the operation of the algorithm in any directed acyclic graph G (U, E);
Ujdecomposing any directed acyclic graph G (U, E) according to a preset decomposition rule to obtain a vertex set of algorithm operation corresponding to the jth sub-graph;
k is an arbitrary integer, D is a number satisfying
Figure 749445DEST_PATH_IMAGE177
A set of vertices of the arbitrary dominance set;
Figure 36070DEST_PATH_IMAGE178
is UjThe set of outputs of (a) is,
Figure 176195DEST_PATH_IMAGE179
representing a set containing all vertices that can be generated by the set D;
Figure 257284DEST_PATH_IMAGE013
representing the vertex set U n U by DjThe number of vertices generated;
Figure 482860DEST_PATH_IMAGE014
representing a set of vertices by D
Figure 573176DEST_PATH_IMAGE015
The number of vertices generated in (1);
Figure 564878DEST_PATH_IMAGE180
(3)
wherein, U is any vertex set which represents algorithm operation in the directed acyclic graph G (U, E) of the formula (2);
k is an arbitrary integer of formula (2);
for any givenk and the jth sub-calculation,
Figure 82447DEST_PATH_IMAGE017
represents the maximum number of vertices in U #uj generated by k input vertices,
Figure 264161DEST_PATH_IMAGE018
representing k input vertices at
Figure 158167DEST_PATH_IMAGE015
The maximum number of vertices generated;
substituting the convolution directed acyclic graph into an equation (1) to an equation (3), and obtaining the result based on a preset algorithm:
Figure 272885DEST_PATH_IMAGE019
(4)
wherein R is the maximum reuse times of each input vertex;
s is the capacity of the flash memory;
and (4) substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (4), wherein the obtained T (S) is the upper bound of the vertex number.
In a possible implementation manner of the embodiment of the present application, when the analysis module determines the lower bound of the convolution communication based on the upper bound of the number of vertices, the total number of vertices of the convolution directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal, the analysis module is specifically configured to:
Figure 961355DEST_PATH_IMAGE020
(5)
wherein, for any directed acyclic graph G (V, E):
the number of the top points in the top set of the algorithm operation of any directed acyclic graph G (V, E) is | V |;
s is the capacity of the flash memory;
when Q takes the minimum value, Q is the lower communication bound;
and (5) substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (5), wherein when Q takes the minimum value, Q is the lower bound of convolution communication.
In a possible implementation manner of the embodiment of the present application, when determining the optimality condition according to the lower bound of the convolutional communication, the allocating module 1003 is specifically configured to:
determining a convolution input subgraph of a storage space to be allocated based on a convolution communication lower bound;
determining an output sub-block, the size of the input sub-block and the size of the input sub-block corresponding to the convolution input sub-graph according to the lower bound of convolution communication, wherein the size of the input sub-block and the size of the output sub-block both meet the lower bound of convolution communication;
and determining the output sub-block as data stored in the on-chip memory space of the current terminal, and determining an input space in the on-chip memory space of the current terminal, wherein the input space is a space occupied by the input sub-block and a convolution kernel corresponding to the input sub-block, so as to obtain an optimality condition.
In an embodiment of the present application, with reference to fig. 6, a server 1100 shown in fig. 6 includes: a processor 1101 and a memory 1103.
The processor 1101 is coupled to the memory 1103, such as by a bus 1102.
Optionally, the server 1100 may also include a transceiver 1104.
It should be noted that the transceiver 1104 is not limited to one in practical applications, and the structure of the server 1100 is not limited to the embodiment of the present application.
The processor 1101 may be a CPU (central processing unit), a general purpose processor, a DSP (digital signal processor), an ASIC (application specific integrated circuit), an FPGA (field programmable gate array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure.
The processor 1101 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.
Bus 1102 may include a path that transfers information between the above components.
The bus 1102 may be a PCI (peripheral component interconnect) bus, an EISA (extended industry standard architecture) bus, or the like.
The bus 1102 may be divided into an address bus, a data bus, a control bus, and the like.
For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
The memory 1103 may be a ROM (read only memory) or other type of static storage device that can store static information and instructions, a RAM (random access memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (electrically erasable programmable read only memory), a CD-ROM (compact read only memory) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 1103 is used for storing application program codes for executing the present application, and the execution is controlled by the processor 1101.
The processor 1101 is configured to execute application program code stored in the memory 1103 to implement the content shown in the foregoing method embodiments.
Among them, the server includes but is not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.
The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows.
The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein.
Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. An automatic configuration template generation method, comprising:
acquiring current input data and an on-chip memory space of a current terminal;
determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal and a convolutional operation rule;
determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in convolutional operation, and the data stream meets the lower bound of the convolutional communication;
generating a search space according to the optimality condition, wherein the search space comprises at least one configuration template, the configuration template comprises a plurality of adjustable parameters, at least one adjustable parameter is the same in each configuration template, and the at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;
and adjusting the access locality of the convolution operation based on the search space.
2. The method of claim 1, wherein determining a lower bound for convolutional communication based on the current input data, an on-chip memory space of the current terminal, and a convolutional operation rule comprises:
generating a convolution directed acyclic graph according to the current input data through a convolution operation rule;
decomposing the convolved directed acyclic graph into at least two convolved input subgraphs;
determining a total number of vertices of the convolved directed acyclic graph;
partitioning the convolutional directed acyclic graph into at least one subset of convolutional inputs based on S-partitioning;
determining an upper bound of the number of vertexes based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all the convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm;
wherein the vertex number upper bound is used for representing the vertex number corresponding to the convolution input subset with the maximum vertex number in all the convolution input subsets;
and determining the lower bound of the convolution communication based on the upper bound of the number of the vertexes, the total number of the vertexes of the convolution directed acyclic graph, the red and blue pebble model and the on-chip internal memory space of the current terminal.
3. The method of claim 2, wherein determining the upper bound on the number of vertices based on the current input data, any subset of the convolutional inputs, all the convolutional input subgraphs, a relationship between any subset of the convolutional inputs and all the convolutional input subgraphs, an on-chip memory space of the current terminal, and a preset algorithm comprises:
defining any dominating set of the convolutional input subset as an input dominating set, dividing the input dominating set into a first vertex set and a second vertex set, determining the number of first vertices corresponding to the first vertex set, and determining the number of second vertices corresponding to the second vertex set;
and determining the upper bound of the number of the vertexes based on the current input data, any convolution input subset, the input dominating set, the number of the first vertexes, the number of the second vertexes, all convolution input subgraphs, the relation between any convolution input subset and all the convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm.
4. The method of claim 3, wherein the relationship between any subset of the convolved inputs and all of the convolved input subgraphs comprises:
Vi∩Uj+1all inputs being contained in
Figure 91595DEST_PATH_IMAGE001
Performing the following steps; (i)
according to
Figure 84434DEST_PATH_IMAGE002
Can determine | Vi∩Uj+1The upper bound of |; (ii)
Figure 926488DEST_PATH_IMAGE003
;(iii)
wherein, for any directed acyclic graph G (V, E):
Viany subset obtained for any directed acyclic graph G (V, E) based on S-division;
Uja vertex set corresponding to the jth sub-graph of any directed acyclic graph G (V, E);
Uj+1the vertex set corresponding to the j +1 th sub-graph of the directed acyclic graph is obtained;
Diis a ViThe dominating set of (2);
Figure 134747DEST_PATH_IMAGE004
is UjThe output set of (2);
Figure 11436DEST_PATH_IMAGE005
means that one contains all the pieces capable of being DiA set of generated vertices;
and (5) substituting the convolution directed acyclic graph into an equation (i) to an equation (iii) to obtain the relation between any convolution input subset and all the convolution input subgraphs.
5. The method of claim 4, wherein determining an upper bound on the number of vertices based on the current input data, any subset of the convolutional inputs, the input dominating set, the first number of vertices, the second number of vertices, all of the convolutional input subgraphs, the relationship between any subset of the convolutional inputs and all of the convolutional input subgraphs, an on-chip memory space of the current terminal, and a preset algorithm comprises:
Figure 381369DEST_PATH_IMAGE006
(1)
wherein, for any directed acyclic graph G (V, E):
s is the capacity of the flash memory;
will DiDividing the vertexes into at least two vertex sets, wherein the number of vertexes in each vertex set is respectively as follows:
Figure 976298DEST_PATH_IMAGE007
Figure 988247DEST_PATH_IMAGE008
Figure 719443DEST_PATH_IMAGE009
Figure 522926DEST_PATH_IMAGE010
(2)
wherein, U is a vertex set which represents the operation of the algorithm in any directed acyclic graph G (U, E);
Ujdecomposing any directed acyclic graph G (U, E) according to a preset decomposition rule to obtain a vertex set of algorithm operation corresponding to the jth sub-graph;
k is an arbitrary integer, D is a number satisfying
Figure 339573DEST_PATH_IMAGE011
A set of vertices of the arbitrary dominance set;
Figure 217530DEST_PATH_IMAGE012
is UjThe set of outputs of (a) is,
Figure 553965DEST_PATH_IMAGE013
representing a set containing all vertices that can be generated by the set D;
Figure 46126DEST_PATH_IMAGE014
representing the vertex set U n U by DjThe number of vertices generated;
Figure 100801DEST_PATH_IMAGE015
representing a set of vertices by D
Figure 438241DEST_PATH_IMAGE016
The number of vertices generated in (1);
Figure 157411DEST_PATH_IMAGE017
(3)
wherein, U is any vertex set which represents algorithm operation in the directed acyclic graph G (U, E) of the formula (2);
k is an arbitrary integer of formula (2);
for any given k and jth sub-computation,
Figure 554894DEST_PATH_IMAGE018
represents the maximum number of vertices in U #uj generated by k input vertices,
Figure 96865DEST_PATH_IMAGE019
representing k input vertices at
Figure 237996DEST_PATH_IMAGE016
The maximum number of vertices generated;
substituting the convolution directed acyclic graph into a formula (1) to a formula (3), and obtaining the result based on a preset algorithm:
Figure 549023DEST_PATH_IMAGE020
(4)
wherein R is the maximum reuse times of each input vertex;
s is the capacity of the flash memory;
and (4) substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (4), wherein the obtained T (S) is the upper bound of the vertex number.
6. The method of claim 5, wherein the determining the lower bound for the convolutional communication based on the upper bound for the number of vertices, the total number of vertices for the convolutional directed acyclic graph, a red-blue pebble model, and an on-chip memory space for the current terminal comprises:
Figure 382987DEST_PATH_IMAGE021
(5)
wherein, for any directed acyclic graph G (V, E):
the number of the top points in the top set of the algorithm operation of any directed acyclic graph G (V, E) is | V |;
s is the capacity of the flash memory;
when Q takes the minimum value, Q is the lower communication bound;
and substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (5), wherein when Q is the minimum value, Q is the lower bound of the convolution communication.
7. The method of claim 3, wherein determining an optimality condition based on the lower bound of convolutional communications comprises:
determining the convolution input subgraph of the storage space to be allocated based on the convolution communication lower bound;
determining an output sub-block, a size of the output sub-block, a size of an input sub-block and a size of the input sub-block corresponding to the convolution input sub-graph according to the lower bound of the convolution communication, wherein the sizes of the input sub-block and the output sub-block both meet the lower bound of the convolution communication;
and determining the output sub-block as data stored in the on-chip internal memory space of the current terminal, and determining an input space in the on-chip internal memory space of the current terminal, wherein the input space is occupied by the input sub-block and a convolution kernel corresponding to the input sub-block so as to obtain the optimality condition.
8. An automatic configuration template generation apparatus, comprising:
the acquisition module is used for acquiring current input data and the on-chip memory space of the current terminal;
the analysis module is used for determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal and a convolutional operation rule;
the distribution module is used for determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in convolutional operation, and the data stream meets the lower bound of the convolutional communication;
a configuration module, configured to generate a search space according to the optimality condition, where the search space includes at least one configuration template, the configuration template includes multiple adjustable parameters, at least one adjustable parameter is the same in each configuration template, and the at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;
and the optimization module is used for adjusting the access locality of the convolution operation based on the search space.
9. A server, comprising:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the method of automatically configuring template generation of any of claims 1-7.
10. A computer-readable storage medium, comprising: a computer program loadable by a processor and adapted to perform the method of automatically configuring a template generation according to any of claims 1 to 7.
CN202110715535.7A 2021-06-28 2021-06-28 Automatic configuration template generation method and device, server and storage medium Active CN113254867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110715535.7A CN113254867B (en) 2021-06-28 2021-06-28 Automatic configuration template generation method and device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110715535.7A CN113254867B (en) 2021-06-28 2021-06-28 Automatic configuration template generation method and device, server and storage medium

Publications (2)

Publication Number Publication Date
CN113254867A true CN113254867A (en) 2021-08-13
CN113254867B CN113254867B (en) 2021-10-29

Family

ID=77189776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110715535.7A Active CN113254867B (en) 2021-06-28 2021-06-28 Automatic configuration template generation method and device, server and storage medium

Country Status (1)

Country Link
CN (1) CN113254867B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841326A (en) * 2022-05-19 2022-08-02 北京百度网讯科技有限公司 Operator processing method, device and equipment of deep learning framework and storage medium
CN117008916A (en) * 2023-07-06 2023-11-07 清华大学 Tensor program optimization method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188537A1 (en) * 2017-12-14 2019-06-20 Robert Bosch Gmbh Effective building block design for deep convolutional neural networks using search
CN111401520A (en) * 2020-03-10 2020-07-10 北京迈格威科技有限公司 Method and device for determining model search space and electronic equipment
CN111523642A (en) * 2020-04-10 2020-08-11 厦门星宸科技有限公司 Data reuse method, operation method and device and chip for convolution operation
CN111859797A (en) * 2020-07-14 2020-10-30 Oppo广东移动通信有限公司 Data processing method and device and storage medium
CN112711422A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Optimization method and system for neural network compiling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188537A1 (en) * 2017-12-14 2019-06-20 Robert Bosch Gmbh Effective building block design for deep convolutional neural networks using search
CN111401520A (en) * 2020-03-10 2020-07-10 北京迈格威科技有限公司 Method and device for determining model search space and electronic equipment
CN111523642A (en) * 2020-04-10 2020-08-11 厦门星宸科技有限公司 Data reuse method, operation method and device and chip for convolution operation
CN111859797A (en) * 2020-07-14 2020-10-30 Oppo广东移动通信有限公司 Data processing method and device and storage medium
CN112711422A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Optimization method and system for neural network compiling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOMING CHEN 等: "Communication Lower Bound in Convolution Accelerators", 《2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841326A (en) * 2022-05-19 2022-08-02 北京百度网讯科技有限公司 Operator processing method, device and equipment of deep learning framework and storage medium
CN114841326B (en) * 2022-05-19 2024-01-12 北京百度网讯科技有限公司 Operator processing method, device, equipment and storage medium of deep learning framework
CN117008916A (en) * 2023-07-06 2023-11-07 清华大学 Tensor program optimization method and device

Also Published As

Publication number Publication date
CN113254867B (en) 2021-10-29

Similar Documents

Publication Publication Date Title
CN108228972B (en) Method for determining arrangement of at least one circuit for reconfigurable logic device
CN110659728B (en) Neural network optimization method, device, computer equipment and storage medium
JP6821715B2 (en) Block processing for image processors with 2D execution lane arrays and 2D shift registers
JP6837084B2 (en) Core process for block processing on image processors with 2D execution lane arrays and 2D shift registers
CN113254867B (en) Automatic configuration template generation method and device, server and storage medium
Guo et al. A parallel attractor finding algorithm based on Boolean satisfiability for genetic regulatory networks
US9891958B2 (en) System and method for parallelizing grid search method facilitating determination of PK-PD parameters
Phillips et al. A CUDA implementation of the High Performance Conjugate Gradient benchmark
US11630986B2 (en) Graph conversion method
Murthy et al. Buffer merging—a powerful technique for reducing memory requirements of synchronous dataflow specifications
Fell et al. Force-directed scheduling for data flow graph mapping on coarse-grained reconfigurable architectures
Heidorn et al. Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs
CN112463159A (en) Compiling method, compiling device, electronic equipment and storage medium
Levchenko et al. GPU implementation of ConeTorre algorithm for fluid dynamics simulation
Olivry et al. IOOpt: automatic derivation of i/o complexity bounds for affine programs
Zhang et al. I/O lower bounds for auto-tuning of convolutions in CNNs
US20090064120A1 (en) Method and apparatus to achieve maximum outer level parallelism of a loop
Ganapathi Automatic discovery of efficient divide-&-conquer algorithms for dynamic programming problems
Gui et al. Developing subdomain allocation algorithms based on spatial and communicational constraints to accelerate dust storm simulation
CN116048521A (en) Multi-level parallelism development method for multi-array coarse-granularity reconfigurable architecture
Chen et al. A scheduling algorithm for heterogeneous computing systems by edge cover queue
Ding et al. Memory-aware partitioning, scheduling, and floorplanning for partially dynamically reconfigurable systems
Zhou et al. Optimal parallelogram selection for hierarchical tiling
Murthy et al. A buffer merging technique for reducing memory requirements of synchronous dataflow specifications
US20120226890A1 (en) Accelerator and data processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant