CN113254867A

CN113254867A - Automatic configuration template generation method and device, server and storage medium

Info

Publication number: CN113254867A
Application number: CN202110715535.7A
Authority: CN
Inventors: 张晓扬; 肖俊敏; 曹连雨
Original assignee: Hyperai Cloud Technology Beijing Co ltd
Current assignee: Hyperai Cloud Technology Beijing Co ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-08-13
Anticipated expiration: 2041-06-28
Also published as: CN113254867B

Abstract

The application relates to a method, a device, a server and a storage medium for generating an automatic configuration template, wherein the method comprises the following steps: acquiring current input data and an on-chip memory space of a current terminal; determining a lower bound of convolutional communication based on current input data, an on-chip memory space of a current terminal and a convolutional operation rule; determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in the convolutional operation; generating a search space according to the optimality condition, wherein the search space comprises at least one configuration template, at least one adjustable parameter is the same in each configuration template, and at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition; and adjusting the access locality of the convolution operation based on the search space. The method and the device have the effect of reducing the time cost and the hardware cost in the process of searching the parameter configuration.

Description

Automatic configuration template generation method and device, server and storage medium

Technical Field

The present application relates to the field of deep learning, compiling technology, and high-performance computing cross technology, and in particular, to a method, an apparatus, a server, and a storage medium for automatically configuring a template.

Background

Reasoning optimization of deep learning is widely concerned in the industry, and as reasoning terminals are various and include embedded CPUs (central processing units), GPUs (graphics processing units), fpgas (field Programmable Gate arrays), ASICs (application specific integrated circuits), and the like, different hardware objects are different in memory organization, computing functional units, and the like, so that the skills required for reasoning optimization of different terminals are different, and tuning performance becomes a repetitive work requiring professional engineering experience and consuming a lot of manpower.

An automatic tuning framework TVM (deep learning-oriented automatic end-to-end optimization compiler) commonly adopted in the related art firstly acquires a model from an existing framework as input, converts the model into a calculation graph representation, and then a system executes high-level data stream rewriting to generate an optimized graph; given a rich set of scheduling primitives, the TVM finds the optimal operator implementation for each layer of the deep learning model, creates a specialized operator for the specific input shapes and layouts associated with each layer, and selects scheduling optimizations, such as modifying the round robin order, optimizing the memory hierarchy, and scheduling specific parameters such as tile size and round robin expansion factor.

TVM discloses graph and operator level optimization, introduces a tensor expression language to build operators and provide program conversion primitives, selects code for generation (e.g., loop blocks and ordering, caching, unfolding) based on a combination of memory access, thread patterns, and new hardware primitives, and introduces an automatic program optimization framework to find optimized tensor operators.

In the template-guided search mode of TVM, the search space is written by a manual template that requires the user to manually write a template for the computation definition, which defines the structure of the tensor expression using some adjustable parameters (e.g., tile size and expansion factor) and searches for the best values of these parameters for a particular computation graph shape configuration and a particular hardware object.

With respect to the above-described related art, the inventors consider that the following drawbacks exist: the search space for parameter configuration is huge, so that the time cost and hardware cost for the parameter configuration process are high.

Disclosure of Invention

In order to reduce the time cost and the hardware cost of a search parameter configuration process, the application provides an automatic configuration template generation method, an automatic configuration template generation device, a server and a storage medium.

In a first aspect, the present application provides a method for generating an automatically tuned and optimized template, which adopts the following technical scheme:

an automatic configuration template generation method comprises the following steps:

acquiring current input data and an on-chip memory space of a current terminal;

determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal and a convolutional operation rule;

determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in convolutional operation, and the data stream meets the lower bound of the convolutional communication;

generating a search space according to the optimality condition, wherein the search space comprises at least one configuration template, the configuration template comprises a plurality of adjustable parameters, at least one adjustable parameter is the same in each configuration template, and the at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;

and adjusting the access locality of the convolution operation based on the search space.

By adopting the technical scheme, the data stream in the convolution operation process is determined based on the lower bound of the convolution communication in the convolution operation process, the optimal condition is formed by the allocation of the storage position of the data stream, at least one adjustable parameter in the configuration template can be determined according to the optimal condition, namely, at least one adjustable parameter in the search space of the parameter configuration is fixed, and only other adjustable parameters are needed to be configured, so that the search space of the parameter configuration can be reduced.

In a possible implementation manner, the determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal, and the convolutional operation rule includes:

generating a convolution directed acyclic graph according to the current input data through a convolution operation rule;

decomposing the convolved directed acyclic graph into at least two convolved input subgraphs;

determining a total number of vertices of the convolved directed acyclic graph;

partitioning the convolutional directed acyclic graph into at least one subset of convolutional inputs based on S-partitioning;

determining an upper bound of the number of vertexes based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all the convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm;

wherein the vertex number upper bound is used for representing the vertex number corresponding to the convolution input subset with the maximum vertex number in all the convolution input subsets;

and determining the lower bound of the convolution communication based on the upper bound of the number of the vertexes, the total number of the vertexes of the convolution directed acyclic graph, the red and blue pebble model and the on-chip internal memory space of the current terminal.

In a possible implementation manner, the determining an upper bound of the number of vertices based on the current input data, any one of the subset of convolutional inputs, all of the sub-graphs of convolutional inputs, a relationship between any one of the subset of convolutional inputs and all of the sub-graphs of convolutional inputs, an on-chip memory space of the current terminal, and a preset algorithm includes:

defining any dominating set of the convolutional input subset as an input dominating set, dividing the input dominating set into a first vertex set and a second vertex set, determining the number of first vertices corresponding to the first vertex set, and determining the number of second vertices corresponding to the second vertex set;

and determining the upper bound of the number of the vertexes based on the current input data, any convolution input subset, the input dominating set, the number of the first vertexes, the number of the second vertexes, all the convolution input subgraphs, the relation between any convolution input subset and all the convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm.

In one possible implementation, the relationship between any one subset of the convolutional inputs and all the convolutional input subgraphs includes:

V_i∩U_j+1all inputs being contained in

Performing the following steps; (i)

according to

Can determine | V_i∩U_j+1The upper bound of |; (ii)

；（iii）

wherein, for any directed acyclic graph G (V, E):

V_iany subset obtained for any directed acyclic graph G (V, E) based on S-division;

U_ja vertex set corresponding to the jth sub-graph of any directed acyclic graph G (V, E);

U_j+1the vertex set corresponding to the j +1 th sub-graph of the directed acyclic graph is obtained;

D_iis a V_iThe dominating set of (2);

is U_jThe output set of (2);

means that one contains all the pieces capable of being D_iA set of generated vertices;

and (5) substituting the convolution directed acyclic graph into an equation (i) to an equation (iii) to obtain the relation between any convolution input subset and all the convolution input subgraphs.

In one possible implementation, the determining an upper bound of the number of vertices based on the current input data, any one of the subset of convolutional inputs, the input dominating set, the first number of vertices, the second number of vertices, all of the sub-graphs of convolutional inputs, a relationship between any one of the subset of convolutional inputs and all of the sub-graphs of convolutional inputs, an on-chip memory space of the current terminal, and a preset algorithm includes:

（1）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

will D_iDividing the vertexes into at least two vertex sets, wherein the number of vertexes in each vertex set is respectively as follows:

、

、

；

（2）

wherein, U is a vertex set which represents the operation of the algorithm in any directed acyclic graph G (U, E);

U_jdecomposing any directed acyclic graph G (U, E) according to a preset decomposition rule to obtain a vertex set of algorithm operation corresponding to the jth sub-graph;

k is an arbitrary integer, D is a number satisfying

A set of vertices of the arbitrary dominance set;

is U_jThe set of outputs of (a) is,

representing a set containing all vertices that can be generated by the set D;

representing the vertex set U n U by D_jThe number of vertices generated;

representing a set of vertices by D

The number of vertices generated in (1);

（3）

wherein, U is any vertex set which represents algorithm operation in the directed acyclic graph G (U, E) of the formula (2);

k is an arbitrary integer of formula (2);

for any given k and jth sub-computation,

represents the maximum number of vertices in U #uj generated by k input vertices,

representing k input vertices at

The maximum number of vertices generated;

substituting the convolution directed acyclic graph into a formula (1) to a formula (3), and obtaining the result based on a preset algorithm:

（4）

wherein R is the maximum reuse times of each input vertex;

s is the capacity of the flash memory;

and (4) substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (4), wherein the obtained T (S) is the upper bound of the vertex number.

In one possible implementation manner, the determining the lower bound of the convolutional communication based on the upper bound of the number of vertices, the total number of vertices of the convolutional directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal includes:

（5）

wherein, for any directed acyclic graph G (V, E):

the number of the top points in the top set of the algorithm operation of any directed acyclic graph G (V, E) is | V |;

s is the capacity of the flash memory;

when Q takes the minimum value, Q is the lower communication bound;

and substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (5), wherein when Q is the minimum value, Q is the lower bound of the convolution communication.

In one possible implementation, determining an optimality condition according to the lower bound of the convolutional communication includes:

determining the convolution input subgraph of the storage space to be allocated based on the convolution communication lower bound;

determining an output sub-block, a size of the output sub-block, a size of an input sub-block and a size of the input sub-block corresponding to the convolution input sub-graph according to the lower bound of the convolution communication, wherein the sizes of the input sub-block and the output sub-block both meet the lower bound of the convolution communication;

and determining the output sub-block as data stored in the on-chip internal memory space of the current terminal, and determining an input space in the on-chip internal memory space of the current terminal, wherein the input space is occupied by the input sub-block and a convolution kernel corresponding to the input sub-block so as to obtain the optimality condition.

In a second aspect, the present application provides an automatic configuration template generating apparatus, which adopts the following technical solution:

an automatic configuration template generation apparatus comprising:

the acquisition module is used for acquiring current input data and the on-chip memory space of the current terminal;

the analysis module is used for determining a lower bound of convolutional communication based on the current input data, the on-chip memory space of the current terminal and a convolutional operation rule;

the distribution module is used for determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in convolutional operation, and the data stream meets the lower bound of the convolutional communication;

a configuration module, configured to generate a search space according to the optimality condition, where the search space includes at least one configuration template, the configuration template includes multiple adjustable parameters, at least one adjustable parameter is the same in each configuration template, and the at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;

and the optimization module is used for adjusting the access locality of the convolution operation based on the search space.

In a third aspect, the present application provides a server, which adopts the following technical solutions:

a server, the server comprising:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and executing the method for automatically generating the tuning template.

In a fourth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium, comprising: a computer program is stored which can be loaded by a processor and which performs the method of automatically tuning template generation described above.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for generating an automatically tuned template according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a TVM tuning procedure;

FIG. 3 is a directed acyclic graph of direct convolution;

FIG. 4 is a diagram illustrating a data flow corresponding to a convolution operation rule;

FIG. 5 is a schematic diagram of an apparatus for automatically generating a tuning template according to an embodiment of the present application;

fig. 6 is a schematic diagram of a server according to an embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to the attached drawings.

A person skilled in the art, after reading the present specification, may make modifications to the present embodiments as necessary without inventive contribution, but only within the scope of the claims of the present application are protected by patent laws.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship, unless otherwise specified.

The embodiment of the application provides an automatic configuration template generation method, which is executed by a server and comprises the following steps:

with reference to figure 1 of the drawings,

step S101, obtaining current input data and an on-chip internal memory space of a current terminal.

The on-chip memory space of the current terminal is the fast memory space of the hardware terminal.

For complex convolution operation in deep learning, hardware cost and time cost are dominant in memory access corresponding to the convolution operation, and off-chip access almost occupies all energy consumption; the inputs and weights of the convolution operations are typically stored in an off-chip Dynamic Random Access Memory (DRAM), an on-chip global buffer (GBuf) -based Static Random Access Memory (SRAM) stores part of the inputs and weights loaded from the off-chip Dynamic Random Access Memory (DRAM), each processing unit (PE) has some processors (Reg) for storing the inputs and weights read from the global buffer (GBuf), and part of the sum (psum) is stored in GBuf or Regs, so that there is a complex data transfer in the memory hierarchy during the calculation.

If the on-chip memory is large enough, all types of data (input, weight and output) are guaranteed to be accessed and stored only once when the processing unit (Pe) performs convolution operation, and the access and storage are single-layer access and storage optimal values.

The convolution operation has 7 layers of circulation, the direct search of the convolution optimal communication method is difficult, the access times can be effectively reduced through data multiplexing, but different circulation sequences, circulation step lengths, circulation expansion and other parameters need to be searched, and the configuration of the parameters forms a huge search space.

With reference to figure 1 of the drawings,

step S102, determining a lower bound of convolution communication based on current input data, an on-chip memory space of a current terminal and a convolution operation rule.

The lower bound of the convolution communication is the minimum value (minimum overhead) of off-chip access in the convolution operation process.

Maximizing data reuse in convolution helps reduce communication, and a combination of data reuse methods can constitute a very complex data stream, and a given hardware resource condition becomes one of the constraints for designing the data stream.

With reference to figure 1 of the drawings,

and S103, determining an optimality condition according to the lower bound of the convolutional communication, wherein the optimality condition is used for indicating a data stream stored in an on-chip memory space of the current terminal in the convolutional operation, and the data stream meets the lower bound of the convolutional communication.

With reference to figure 1 of the drawings,

and S104, generating a search space according to the optimality condition, wherein the search space comprises at least one configuration template, the configuration template comprises a plurality of adjustable parameters, at least one adjustable parameter is the same in each configuration template, and at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition.

First, the main operations of the TVM are described, including:

(1) the method comprises the steps of introducing a tensor expression language to construct an operational character and provide program conversion primitives, generating programs of different versions through various optimizations by the primitives, expanding the computation/scheduling separation concept of Halide, separating the target hardware essence from the conversion primitives so as to support a new accelerator, introducing the new conversion primitives and supporting deployment to a special accelerator so that different program conversion sequences can be applied to form a rich effective program space for a given operator statement; (2) an automatic program optimization framework is introduced to find an optimized tensor operator, an optimizer takes a machine learning-based performance model as a guide, more data are collected from the rear end of hardware, and the performance model is optimized; (3) on the basis of an automatic code generator, a graph rewriter is introduced, and the graph rewriter makes full use of high-level optimization and operator-level optimization.

In which a mathematical computation may be implemented on a computer in a variety of ways, and different implementations may have different memory accesses and different performance, so that the TVM requires a user to provide a "schedule" to specify how to perform the computation (e.g., circular tiling and ordering, caching, and expanding).

The TVM is guided by the template to search when finding the optimized tensor operator, in which the search space is defined by a manual template, the TVM requires the user to manually write a template for the calculation definition, which defines the structure of the tensor program using some adjustable parameters (e.g., tiling size and expansion factor); the compiler then searches for the best values of these parameters for the particular input shape configuration and the particular hardware target.

The method has good effect on the common deep learning operator.

However, developing templates requires a lot of work, e.g. the code library of TVM already contains more than 15K lines of code for these templates, and this number continues to grow with the advent of new operators and new hardware platforms.

In addition, constructing a high quality template requires expertise in tensor operators and hardware.

Developing high quality templates requires a lot of research effort, although template design is complex, manual templates cover only limited program structures because all optimization options for manually enumerating all operators are prohibitive, this approach usually requires defining a template for each operator, FlexTensor proposes a generic template to cover multiple operators, but its template is still designed for single operator granularity, and does not include optimizations involving multiple operators (e.g., operator fusion); the search space of the multi-operator optimized computation graph should contain different operator combinations, which cannot be achieved by the template-based approach because it cannot decompose and recombine fixed templates, which represent the underlying parameters.

Referring to fig. 2, a common technical means in the art, which is not the point of the present application, is used for automatically generating a template based on TVM, and therefore, the description is not repeated here, and the tensor program structure generated by the automatic template generation technology covers the tensor program structure generated by the manual template generation.

The method comprises the steps of automatically generating templates through a template generator, delivering the generated templates to a template manager, putting template parameter information under the template manager into a configuration manager, then correspondingly configuring a generated program according to the parameters and inputting the program into a performance model, predicting the running time of the performance model on the given hardware back end, determining the configuration effect of the parameters in a search space based on the predicted running time, and simultaneously training the performance model by using the running time information collected during exploration.

Adjustable parameters in the template (e.g., tile size and unroll factor) define the structure of the tensor program for generating the optimizer, and each Schedule (Schedule) determines the difference in memory access and performance.

In the method, one of the conditions for generating the template is determined based on the memory access mode corresponding to the convolution operation in the optimized deep learning network, and the parameters corresponding to the memory access in the template parameters are determined, namely, the search space formed by combining a plurality of template parameters is reduced.

With reference to figure 1 of the drawings,

and S105, adjusting the access locality of the convolution operation based on the search space.

In the method and the device, in the convolution operation process, the data stream in the convolution operation process is determined based on the lower bound of convolution communication, the allocation of the storage position of the data stream forms an optimality condition, at least one adjustable parameter in the configuration template can be determined according to the optimality condition, namely, at least one adjustable parameter in the search space of parameter configuration is fixed, and only other adjustable parameters are needed to be configured, so that the search space of parameter configuration can be reduced.

Since the above optimality condition is determined based on the lower bound of the convolutional communication, the lower bound of the convolutional communication is determined first.

To introduce the lower bound of convolutional communication, we first understand the red-blue pebble model, which is used to estimate the minimum amount of data transfer between two levels of memory, and the derivation of the lower bound of convolutional communication relies on the red-blue pebble model.

The red-blue pebble model is described below:

the red and blue pebble game is carried out based on a directed acyclic graph;

let G (V, E) represent a directed acyclic graph

V is a set of vertices representing the operation of the algorithm; e is a set of edges representing the dependency of two operations; we use the symbol |. to represent the number of vertices of an arbitrary set, e.g. | V | represents the number of vertices in set V.

If there is a partition on G that satisfies the following four properties, the partition is called an S-partition (S-partition).

Properties 1: v is divided into h subsets V₁,V₂,…,V_hSo that Vi do not intersect but their union is V.

Properties 2: each V_iHaving an allocation set D containing at most S vertices_i。

V_iAn dominating set D_iIs a series of nodes in V to which the input from G is applied_iAny path of the node of (2) contains D_iThe node (b).

Properties 3: each V_iAll have a minimum set M containing at most S vertices_i。

V_iIs defined as V_iDoes not have any successor vertices in it that belong to V_iThe set of vertices of (1).

Properties 4: v₁,…,V_hThere is no cyclic dependency between them.

Let P (S) be at least the minimum number of subsets that any S-partition (S-partition) of the directed acyclic graph satisfies.

The following theorem describes the lower communication bound of any directed acyclic graph red-blue pebble model:

for a directed acyclic graph G = (V, E), if the number of red pebbles is at most S (i.e. given a fast memory of capacity S), then the minimum I/O overhead Q (i.e. minimum number of communications) required for any complete calculation of a red and blue pebble game is:

（6）

wherein S is the capacity of the flash memory;

V_iany subset of the directed acyclic graph obtained based on S-partitioning (S-partition);

2S is 2 times the capacity S of the flash memory.

Assuming that the directed acyclic graph G (V, E) describes an n-stage algorithm, it is difficult to directly infer p(s) for the multi-stage algorithm, and we can estimate the lower bound of p(s):

order to

Is a set containing all possibilities of S-partitioning (S-partition) of a directed acyclic graph G (V, E), and each

The elements in (A) represent some of the S-divisions (S-partitions) of G (V, E).

Order to

（7）

Wherein | V | represents the number of vertices in the set V;

h (S) represents the lower bound of the number of S-partition (S-partition) sets of any of G (V, E).

According to equations (6) and (7), the minimum I/O overhead Q (i.e., the minimum number of communications) can be expressed as:

（8）

wherein S is the capacity of the flash memory;

q is the minimum I/O overhead required by any complete calculation of the red and blue pebble game;

2S is 2 times the capacity S of the flash memory.

Thus, for a multi-stage algorithm, only H (2S) needs to be estimated instead of P (2S), and from equation (7), we can see that H (S) depends on

By a value of (c), which means for V_iIs the key to the problem.

For any subset V_i：

Below by V_iRelationship to all sub-computations of G (V, E), estimating V_iThe number of vertices of (2):

first, a multi-stage partition is formalized for the directed acyclic graph G (V, E):

definitions 1.1 suppose that a directed acyclic graph G (V, E) is decomposed into n sub-graphs G₁(U₁,E₁),G₂(U₂,E₂),···,G_n(U_n,E_n) Wherein G is_j(U_j,E_j) Is the corresponding pair calculation.

Then { G }₁(U₁,E₁),···,G_n(U_n,E_n) Is a multi-stage division called G (V, E), if and only if G_j(U_j,E_j) Must be G_j−1(U_j−1,E_j−1) And all U_jAre mutually exclusive.

Wherein any sequence of sub-computations can be represented as a multi-stage partition of a directed acyclic graph representing the total computation.

Suppose { G₁(U₁,E₁),… ,G_n(U_n,E_n) Is a multistage division of G (V, E) since

；

Therefore if we can use S-partitioning (S-partition)Properties 2 and 3 in the definition estimate all | V_i∩U_jAn upper bound of | V (j =1,2, …, n), then it is possible to obtain | V_iThe maximum value of | is.

Next, all | V's are determined by way of recursive analysis_i∩U_j|：

For the jth sub-calculation, assume | V_i∩U_jIf the upper bound of | has been successfully reached, the next question is how to estimate | V_i∩U_j+1|。

U_jThe output set of (2) is recorded as

；

V_iThe dominating set of (D) is denoted as_i；

To determine

Neutral and V_i∩U_j+1In relation to vertices, we introduce the concept of vertex generation:

definition 1.2: in a directed acyclic graph G (V, E), one set of vertices U may generate another set of vertices U ', if and only if each vertex path from the input to U' starting at V contains some of the vertices in U.

Further to the above, it is preferable that,

representing a set containing all vertices that can be generated by U.

Definition 1.2 is a new concept of vertex generation that describes dependencies between vertices.

Due to dominance set D_iOnly V can be generated_iOne of the sets of (1), apparently V_i∩U_j+1All inputs being contained in

In, therefore V_i∩U_j+1All inputs being contained in

In (1),

can be used to infer | V_i∩U_j+1The upper bound of | is.

In summary, if we can deduce | V_i∩U_jI and

by then will

As V_i∩U_j+1To obtain | V_i∩U_j+1I and

is possible.

Based on the above derivation, V is obtained_iRelationships with all subgraphs of the directed acyclic graph include:

V_i∩U_j+1all inputs being contained in

Performing the following steps; (i)

according to

Can determine | V_i∩U_j+1The upper bound of |; (ii)

；（iii）

wherein, for any directed acyclic graph G (V, E):

D_iis a V_iThe dominating set of (2);

is U_jThe output set of (2);

means that one contains all the pieces capable of being D_iThe set of generated vertices.

Based on recursive analysis, | V_i∩U_j+1All upper bounds can be established for | j =1,2, …, n, and we proceed to analyze how to get | V_iThe upper bound of | is.

First, we define the maximum vertex generation function to determine the respective D_iAt V_i∩U_jAnd V_i∩

The number of generated vertices in (2), which specifically includes:

for any integer k and satisfy

Define two vertex generation functions for the jth sub-computation as follows:

（2）

k is an arbitrary integer, D is a number satisfying

A set of vertices of the arbitrary dominance set;

is U_jThe set of outputs of (a) is,

representing a set containing all vertices that can be generated by the set D;

representing the vertex set U n U by D_jThe number of vertices generated;

representing a set of vertices by D

The number of vertices generated in (1);

that is to say that the first and second electrodes,

and

respectively representing the two vertex sets U n U by D_jAnd U #

The number of vertices generated in (1);

wherein the predetermined decomposition rule is definition 1.1.

For any given k and jth sub-computation, the maximum vertex generation function is defined as follows:

（3）

k is an arbitrary integer of formula (2);

for any given k and jth sub-computation,

u # U representing k input vertex generations_jThe number of the largest top points in the middle,

representing k input vertices at

The maximum number of generated vertices;

that is to say that the first and second electrodes,

and

provide U_jAnd

upper bound estimate of the number of vertices in (1).

Second, any set | V of arbitrary S-partitions (S-partitions) of the directed acyclic graph is generated based on the maximum vertex generating function in the first step_iThe upper bound of | is inferred.

We first infer two secondary results:

order to

Is that

Such that for any

Any path from the input set of V to V has at leastOne vertex belongs to

。

Introduction 1:

is that

。

2, leading:

is also that

Next, we can get | V by the above-mentioned theorem 1 and theorem 2_iThe upper bound of | is very important for I/O complexity analysis based on the red and blue pebble game model, namely:

suppose that

Is a multi-stage partition of the directed acyclic graph G (V, E), for an arbitrary S-partition (S-partition) G (V, E)

，|V_iI has an upper bound:

when in use

When the temperature of the water is higher than the set temperature,

take a maximum value, i.e.

And (4) an upper bound.

Lower pair

The parameters in (1) are explained:

（1）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

、

、

。

in that

In, S is the capacity of the flash memory;

thus, in the formula 3)

Depends on the choice of the ith subset (i.e., which V is_i) But in the formula (1)

The value of (d) is independent of the value of i; thus passing through

Can estimate any subset V_iTo the upper bound, i.e.

Can be used to estimate all subsets V_iThe vertex upper bound of the subset having the largest number of vertices.

In summary, the function is generated based on the maximum vertex and V in the first step_iWith the relationship between all subgraphs, we get

The upper bound of (c).

Third step, based on the obtained

The upper bound of (2) determines the lower communication bound:

from equations (6), (7) and (8), we obtain: given a fast memory with a capacity of S size, in order to realize the calculation of the whole algorithm, the I/O operand Q between the fast memory and the slow memory satisfies:

（5）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

when Q takes the minimum value, Q is the lower communication bound.

The above-mentioned V-based_iRelationships with all sub-graphs, first and second steps to derive | V_iThe process of the upper bound of | is illustrated:

suppose there is a directed acyclic graph G (U, E) = G consisting of two sub-computations₁(U₁,E₁)+G₂(U₂,E₂) Will V_iAn dominating set D_iThe number of vertices of (a) is denoted as kⁱ。

According to the definition of S-partition (S-partition), we have | D_i|=kⁱLess than or equal to S, adding k_iIs divided into kⁱ=kⁱ ₁+kⁱ ₂Wherein k isⁱ ₁Is V_i∩U₁Input the number of vertices, kⁱ ₂Is V_i∩U₂The number of vertices is input.

For any integer k, we have two functions

And

wherein

Representing a generated U of k input vertices_jThe largest number of vertices in the population.

Therefore, the temperature of the molten metal is controlled,

not more than

And at most generate

One vertex is taken as V_i∩U₂Input of, therefore, V_i∩U₂At most have

An input vertex.

In addition to this, the present invention is,

is effective.

Thus, we have:

（9）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

will D_iPartitioning vertices into a first set of vertices and a second set of vertices: the number of vertices in the first set of vertices is

The number of sets in the second vertex set is

；

In that

And S is the capacity of the flash memory.

（10）

We get | Vi ≦ T (S), and when | Vi | = T (S), the value of T (S) is the upper bound of | Vi |.

In summary, an algorithm corresponding to any directed acyclic graph can determine a communication lower bound thereof based on a red and blue pebble model, and the red and blue pebble model can effectively describe many computational graph problems, for example: FFT, matrix multiplication, convolution operation, etc., so we can compute the lower bound of the convolution communication based on the above analysis.

The following determines the lower bound of convolutional communications based on the analysis process described above:

in step S102, a possible implementation manner of the embodiment of the present application, determining a lower bound of convolutional communication based on current input data, an on-chip memory space of a current terminal, and a convolutional operation rule, includes:

step S1021 (not shown), generating a convolution directed acyclic graph according to the current input data by a convolution operation rule, and decomposing the convolution directed acyclic graph into at least two convolution input subgraphs.

Wherein, the form of the current input data includes but is not limited to images, matrixes, and the like;

and the directed acyclic graph generated based on the convolution operation rule according to the current input data is a convolution directed acyclic graph.

Referring to fig. 3, the convolution directed acyclic graph is divided into three levels, which are input vertices, product vertices, and addition vertices, the input vertices are arranged according to the convolution order, each multiplication unit in the graph outputs a product term, and the addition tree is used to accumulate the input product terms and obtain the final output sum.

For an input image with a sliding window, let I_iAs the ith size is W_ker×H_ker×C_inSliding the input tensor, let K_jAs the jth size of W_ker×H_ker×C_inConvolution kernel, direct convolution comprising two stages, for each I_iAnd K_j，

The first stage is through I_iAnd K_jPerforming a corresponding element multiplication operation to generate W_kerH_kerC_inThe term product term.

The second stage is to mix I_iAnd K_jThe resulting product terms are summed to form a final output based on the summing tree.

The summing tree is a directed acyclic graph subgraph of a tree structure, the tree has at most 2 incomes of other vertices except input vertices (i.e., product vertices), and all the inputs of the tree are added together and have only one output, and the direct convolution is completed after the summing process.

Step S1022 (not shown) determines the total number of vertices of the convolution directed acyclic graph.

Specifically, the total number of vertices of the convolution directed acyclic graph is calculated as:

step S1023 (not shown), the convolutional directed acyclic graph is divided into at least one subset of convolutional inputs based on S-partitioning.

Specifically, the S-partition (S-partition) can be referred to the S-partition (S-partition) in the aforementioned Red-blue pebble, and will not be described herein in detailThe convolution directed acyclic graph forms at least one convolution input subset after being divided according to S-mode, and any one convolution input subset can be counted as a convolution input subset V for convenient representation_i。

Step S1024 (not shown), determining an upper bound of the number of vertices based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and a preset algorithm.

And the vertex number upper bound is used for representing the vertex number corresponding to the convolution input subset with the maximum vertex number in all the convolution input subsets.

Step S1025 (not shown), determining a convolution communication lower bound based on the upper bound of the number of vertices, the total number of vertices of the convolution directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal.

In step S1024, determining an upper bound of the number of vertices based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and a preset algorithm, includes:

step S241 (not shown), define any dominating set of any convolution input subset as an input dominating set, divide the input dominating set into a first vertex set and a second vertex set, determine a first vertex number corresponding to the first vertex set, and determine a second vertex number corresponding to the second vertex set.

Specifically, based on step S102 and the aforementioned definition 1.1, the multi-stage division of the directly convolved directed acyclic graph G (V, E) can be written as G (V, E) = G₁(U₁,E₁)∪G₂(U₂,E₂)；

Wherein the convolution input subgraph G_i(U_i,E_i) Representing the corresponding stage of direct convolution, the multiple convolved input subgraphs may be represented as a first convolved input subgraph, a second convolved input subgraph, respectively₁(U₁,E₁) Corresponding to the first stage of the direct convolution described above, the second convolution is input into subgraph G₂(U₂,E₂) Corresponding to the second stage of the direct convolution described above.

Specifically, when the red-blue pebble model is introduced, the S-partition of any directed acyclic graph (V, E) is introduced.

Specifically, for the case where the input dominance set is divided into a first vertex set and a second vertex set, equation (1) is therefore equivalent to equation (9), i.e.: the number of vertices in the first set of vertices is

The number of sets in the second vertex set is

。

In the convolution directed acyclic graph generated in step S1021, the first vertices are input vertices in the first stage, and the input vertices in the second stage do not include output vertices in the first stage, but only include input vertices that do not belong to the first stage but are needed in the second stage.

Step S242 (not shown), determining an upper bound of the number of vertices based on the current input data, any convolution input subset, the input dominating set, the number of first vertices, the number of second vertices, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and the preset algorithm.

First, in step S242, a possible implementation manner of the embodiment of the present application, where the relationship between any convolution input subset and all convolution input subgraphs includes:

and (5) substituting the convolution directed acyclic graph into the formulas (i) - (iii) to obtain the relation between any convolution input subset and all convolution input subgraphs.

Secondly, a possible implementation manner of the embodiment of the present application, in step S242, determining an upper bound of the number of vertices based on the current input data, any convolution input subset, the input dominating set, the number of first vertices, the number of second vertices, all convolution input subgraphs, a relationship between any convolution input subset and all convolution input subgraphs, an on-chip memory space of the current terminal, and a preset algorithm, includes:

then, substituting the convolution directed acyclic graph into an expression (9), an expression (10), an expression (1), an expression (2) and an expression (3) according to the relation between any convolution input subset and all convolution input subgraphs;

and based on a preset algorithm, obtaining:

（4）

wherein R is the maximum reuse times of each input vertex;

s is the capacity of the flash memory;

The preset algorithm is as follows:

，

and is and

for any integer k₁And k₂Are available.

The demonstration process is as follows:

let U be an arbitrary vertex set, and its dominating set D and minimum set M contain S vertices at most.

Suppose that

。

To estimate

And

let us consider

How many of the vertices can be composed of

And (4) generating.

Due to the fact that

Without internal vertices, all of

The generated vertex can be used as

Is input.

Therefore, the temperature of the molten metal is controlled,

。

due to the fact that

The internal sets of vertices with different summing trees do not intersect each other, so that U may have non-empty intersections with the internal sets of vertices of up to S summing trees, and each intersection has at least one different vertex in the minimum set.

For any with S output vertices

Each vertex can be formally represented as

。

Without loss of generality, weSuppose that

Each of which is

Therein is at least provided with

Item is at

Each is in

In (1)

All have at most

Item is at

In (1).

On the one hand, for

From this set, we can prove

。

In the following, we verify this fact by paradoxical means.

Suppose that

。

Since each one is

In that

Therein is at least provided with

Item, therefore

Set at least relates to

Is/are as follows

Vertices, some of which may be identical.

Since each input vertex can be reused at most R times,

in at least

An independent vertex, an

This is in accordance with the fact

Are in contradiction.

Therefore, the assumption is not established, so

。

Due to the fact that

，

Vertices in relation D do not exceed S.

Therefore, the temperature of the molten metal is controlled,

and

can generate at most the vertex in

Product of the item of

。

On the other hand, for sets

To say, because

At most there are

Item, therefore

And

at most, the vertex in (1) can be formed

And (4) accumulating.

In any case, it is preferable that,

in that

Can generate at most

。

This means that

。

Assuming that a vertex set U has an dominating set D and a minimum set M, satisfy

,

,

。

To deduce

By the upper bound, we only need to consider

And

formed of

。

Due to the fact that

At most, only

Each vertex would be the input to the summing tree.

And one has

The summation tree of input vertices involves

Internal vertices and 1 output vertex.

Thus for

A vertex can be generated at most

A vertex.

Thus, we obtain

。

The current input data is explained below:

the current input data is the following in step S1021: for an input image with a sliding window, let I_iAs the ith size is W_ker×H_ker×C_inSliding the input tensor, let K_jAs the jth size of W_ker×H_ker×C_inA convolution kernel; wherein the maximum reuse number of each input (element) in an input image is denoted as R by different sliding windows, and its value is R = W_kerH_ker/μ²And μ is the size of the stride.

In step S1025, the determining a lower bound of convolution communication based on the upper bound of the number of vertices, the total number of vertices of the convolution directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal includes:

substituting the convolution directed acyclic graph into formula (5) based on step S1021, step S1024, and formula (5): when Q takes the minimum value, Q is the lower bound of convolutional communication, i.e.:

（11）

in step S103, a possible implementation manner of the embodiment of the present application, determining an optimality condition according to a lower bound of convolutional communication includes:

step S400 (not shown), determining a convolution input sub-graph of the storage space to be allocated based on the convolution communication lower bound.

Specifically, according to the foregoing scheme, for the lower bound of convolutional communication, in equations (9) and (10),

（9）

（10）

the highest order term of the lower bound result of convolutional communication must be bounded by some phi_jIt is decided that since the highest order term of the lower bound of I/O represents the majority of I/O, the highest order term φ_iThe main process involving the most I/O operations is pointed to, so that the I/O of the jth stage is optimized intensively to reduce off-chip access memory of convolution operation, the highest order term is determined by k, and which phi k-order is high selects which stage.

As can be seen from equations (9) and (10), the second sub-calculation corresponding to the second stage of the direct convolution, i.e., the second stage of the direct convolution, mainly determines the lower communication bound of the convolution.

Step S401 (not shown), determining the output sub-block, the size of the output sub-block, and the sizes of the input sub-block and the input sub-block corresponding to the convolution input sub-graph according to the lower bound of convolution communication, where the sizes of the input sub-block and the output sub-block both satisfy the lower bound of convolution communication.

We analyze the calculation process of the second stage of direct convolution as follows:

according to step S101, maximizing data reuse in convolution helps to reduce communication, and the output vertex corresponding to the second stage of convolution operation is a sum value, so we can utilize output reuse, which is a convolution output pixel that resides in on-chip storage until it is completely calculated (all input channels are traversed), and therefore we choose to make the second stage output vertex occupy the current device on-chip memory as much as possible.

With reference to figures 3 and 4 of the drawings,

first, the output sub-block and the size of the output sub-block are determined:

to achieve the aboveFor a sub-block of the output image of size x y x z we tend to have xyz ≈ S/N_pIn which N is_pIs the total number of available processors and S is the on-chip memory space of the current terminal.

If we execute data streams in parallel, it means that the sum of the on-chip memories owned by each processor is fully utilized to generate partial sums, so that the output data can be reused to the maximum extent, and the data transmission in the hierarchical memory is reduced; different sub-blocks are composed of N_pThe processors process in parallel.

In our data stream design, there is a common (W)_outH_outC_out) /(xyz) output sub-block.

Secondly, determining the input sub-blocks and the sizes of the input sub-blocks:

in the convolution operation, in order to calculate the result of x × y × z sub-blocks of the output image, we need inputs of all channel dimensions corresponding to x '× y' (input sub-block) positions and z convolution kernels associated with part of the output channels, i.e. the number of channels of the input picture is equal to the number of channels of the convolution kernels.

Since on-chip memory is limited and needs to be used for storing as many output results as possible, when loading picture input and convolution kernel, continuous loading is needed instead of loading into fast memory at one time, since the input ith channel can only be reused by weighted ith channel, in order to ensure that output sub-block can be as large as possible in limited on-chip memory, we make α =1, that is, our dataflow design requires loading x '× y' patches with fixed number of channels and sliding along the direction of channel dimension; after the input patch of x '× y' and the corresponding z convolution kernel weights are loaded to the on-chip memory, partial sum calculation can be performed on the output sub-blocks;

to update the entire output sub-block, we successively slide the x '× y' input block along the channel direction and load the corresponding inputs and weights and perform a partial update.

To update each sub-block, we need to input x 'y' C from the input image_inElement, W needs to be input from z convolution kernels_kerH_kerC_inz elements, due to

，

，

The I/O size of data read-in is:

and (3) calculating to obtain:

（12）

wherein if and only if the xy = Rz formula holds,

taking the minimum value and the result is the same as that in equation (11)

On the same order of magnitude.

Based on

，

，

The requirement of xy = Rz condition is such that x 'y' = zW_kerH_ker。

It determines the optimal size of each x '× y' patch in order to minimize the memory occupied by the input vertices, i.e., product vertices, of the second stage.

The I/O size of the storage output is W_outH_outC_out，When we make xyz ≈ S/N_pAnd xy = Rz, total amount of I/O being:

（13）

if it is not

And is

Then can ignore

The data stream can reach the lower I/O bound.

Step S402 (not shown), determining the output sub-block as data stored in the on-chip memory space of the current terminal, and determining an input space in the on-chip memory space of the current terminal, where the input space is a space occupied by the input sub-block and a convolution kernel corresponding to the input sub-block, so as to obtain an optimality condition.

The above embodiments describe a method for generating an automatic configuration template from the perspective of a method flow, and the following embodiments describe an apparatus for generating an automatic configuration template from the perspective of a virtual module or a virtual unit, which are described in detail in the following embodiments.

An embodiment of the present application provides an apparatus 100 for automatically configuring template generation, and referring to fig. 5, the apparatus 100 for automatically configuring template generation may specifically include:

an automatic configuration template generation apparatus comprising:

an obtaining module 1001, configured to obtain current input data and an on-chip memory space of a current terminal;

the analysis module 1002 is configured to determine a lower bound of convolutional communication based on current input data, an on-chip memory space of a current terminal, and a convolutional operation rule;

the allocation module 1003 is configured to determine an optimality condition according to the lower bound of the convolutional communication, where the optimality condition is used to indicate a data stream stored in an on-chip memory space of the current terminal in the convolutional operation, and the data stream meets the lower bound of the convolutional communication;

a configuration module 1004, configured to generate a search space according to the optimality condition, where the search space includes at least one configuration template, the configuration template includes multiple adjustable parameters, at least one adjustable parameter is the same in each configuration template, and at least one adjustable parameter is an adjustable parameter corresponding to the optimality condition;

and an optimizing module 1005, configured to adjust the memory access locality of the convolution operation based on the search space.

In a possible implementation manner of this embodiment of the present application, when determining the lower bound of the convolutional communication based on the current input data, the on-chip memory space of the current terminal, and the convolutional operation rule, the analyzing module 1002 is specifically configured to:

generating a convolution directed acyclic graph according to current input data through a convolution operation rule;

decomposing the convolution directed acyclic graph into at least two convolution input subgraphs;

determining the total number of vertexes of the convolution directed acyclic graph;

partitioning a convolutional directed acyclic graph into at least one subset of convolutional inputs based on S-partitioning (S-partitioning);

determining an upper bound of the number of vertexes based on current input data, any convolution input subset, all convolution input subgraphs, the relation between any convolution input subset and all convolution input subgraphs, an on-chip memory space of a current terminal and a preset algorithm;

the upper bound of the number of the vertexes is used for representing the number of the vertexes corresponding to the convolution input subset with the maximum number of the vertexes in all the convolution input subsets;

and determining a convolution communication lower bound based on the vertex number upper bound, the vertex total number of the convolution directed acyclic graph, the red and blue pebble model and the on-chip memory space of the current terminal.

In a possible implementation manner of this embodiment of the present application, when determining the upper bound of the number of vertices based on the current input data, any convolution input subset, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and the preset algorithm, the analyzing module 1002 is specifically configured to:

defining any dominating set of any convolution input subset as an input dominating set, dividing the input dominating set into a first vertex set and a second vertex set, determining the number of first vertexes corresponding to the first vertex set, and determining the number of second vertexes corresponding to the second vertex set;

and determining the upper bound of the vertex number based on the current input data, any convolution input subset, an input dominating set, the first vertex number, the second vertex number, all convolution input subgraphs, the relation between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm.

In a possible implementation manner of the embodiment of the present application, when determining the relationship between any convolution input subset and all convolution input subgraphs, the analysis module 1002 is specifically configured to:

1） V_i∩U_j+1all inputs being contained in

Performing the following steps; (i)

2) according to

Can determine | V_i∩U_j+1The upper bound of |; (ii)

3）

；（iii）

wherein, for any directed acyclic graph G (V, E):

D_iis a V_iThe dominating set of (2);

is U_jThe output set of (2);

and (5) substituting the convolution directed acyclic graph into the formulas (i) to (iii) to obtain the relation between any convolution input subset and all convolution input subgraphs.

In a possible implementation manner of this embodiment of the present application, the analysis module 1002 is specifically configured to, when determining the upper bound of the vertex number based on the current input data, any convolution input subset, the input dominating set, the first vertex number, the second vertex number, all convolution input subgraphs, the relationship between any convolution input subset and all convolution input subgraphs, the on-chip memory space of the current terminal, and a preset algorithm:

（1）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

、

、

；

（2）

k is an arbitrary integer, D is a number satisfying

A set of vertices of the arbitrary dominance set;

is U_jThe set of outputs of (a) is,

representing a set containing all vertices that can be generated by the set D;

representing the vertex set U n U by D_jThe number of vertices generated;

representing a set of vertices by D

The number of vertices generated in (1);

（3）

k is an arbitrary integer of formula (2);

for any givenk and the jth sub-calculation,

representing k input vertices at

The maximum number of vertices generated;

substituting the convolution directed acyclic graph into an equation (1) to an equation (3), and obtaining the result based on a preset algorithm:

（4）

wherein R is the maximum reuse times of each input vertex;

s is the capacity of the flash memory;

In a possible implementation manner of the embodiment of the present application, when the analysis module determines the lower bound of the convolution communication based on the upper bound of the number of vertices, the total number of vertices of the convolution directed acyclic graph, the red-blue pebble model, and the on-chip memory space of the current terminal, the analysis module is specifically configured to:

（5）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

when Q takes the minimum value, Q is the lower communication bound;

and (5) substituting the current input data, the on-chip memory space of the current terminal and the convolution directed acyclic graph into a formula (5), wherein when Q takes the minimum value, Q is the lower bound of convolution communication.

In a possible implementation manner of the embodiment of the present application, when determining the optimality condition according to the lower bound of the convolutional communication, the allocating module 1003 is specifically configured to:

determining a convolution input subgraph of a storage space to be allocated based on a convolution communication lower bound;

determining an output sub-block, the size of the input sub-block and the size of the input sub-block corresponding to the convolution input sub-graph according to the lower bound of convolution communication, wherein the size of the input sub-block and the size of the output sub-block both meet the lower bound of convolution communication;

and determining the output sub-block as data stored in the on-chip memory space of the current terminal, and determining an input space in the on-chip memory space of the current terminal, wherein the input space is a space occupied by the input sub-block and a convolution kernel corresponding to the input sub-block, so as to obtain an optimality condition.

In an embodiment of the present application, with reference to fig. 6, a server 1100 shown in fig. 6 includes: a processor 1101 and a memory 1103.

The processor 1101 is coupled to the memory 1103, such as by a bus 1102.

Optionally, the server 1100 may also include a transceiver 1104.

It should be noted that the transceiver 1104 is not limited to one in practical applications, and the structure of the server 1100 is not limited to the embodiment of the present application.

The processor 1101 may be a CPU (central processing unit), a general purpose processor, a DSP (digital signal processor), an ASIC (application specific integrated circuit), an FPGA (field programmable gate array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.

Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure.

The processor 1101 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

Bus 1102 may include a path that transfers information between the above components.

The bus 1102 may be a PCI (peripheral component interconnect) bus, an EISA (extended industry standard architecture) bus, or the like.

The bus 1102 may be divided into an address bus, a data bus, a control bus, and the like.

For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The memory 1103 may be a ROM (read only memory) or other type of static storage device that can store static information and instructions, a RAM (random access memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (electrically erasable programmable read only memory), a CD-ROM (compact read only memory) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 1103 is used for storing application program codes for executing the present application, and the execution is controlled by the processor 1101.

The processor 1101 is configured to execute application program code stored in the memory 1103 to implement the content shown in the foregoing method embodiments.

Among them, the server includes but is not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.

The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows.

The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein.

Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. An automatic configuration template generation method, comprising:

acquiring current input data and an on-chip memory space of a current terminal;

2. The method of claim 1, wherein determining a lower bound for convolutional communication based on the current input data, an on-chip memory space of the current terminal, and a convolutional operation rule comprises:

determining a total number of vertices of the convolved directed acyclic graph;

3. The method of claim 2, wherein determining the upper bound on the number of vertices based on the current input data, any subset of the convolutional inputs, all the convolutional input subgraphs, a relationship between any subset of the convolutional inputs and all the convolutional input subgraphs, an on-chip memory space of the current terminal, and a preset algorithm comprises:

and determining the upper bound of the number of the vertexes based on the current input data, any convolution input subset, the input dominating set, the number of the first vertexes, the number of the second vertexes, all convolution input subgraphs, the relation between any convolution input subset and all the convolution input subgraphs, the on-chip memory space of the current terminal and a preset algorithm.

4. The method of claim 3, wherein the relationship between any subset of the convolved inputs and all of the convolved input subgraphs comprises:

V_i∩U_j+1all inputs being contained in

Performing the following steps; (i)

according to

Can determine | V_i∩U_j+1The upper bound of |; (ii)

；（iii）

wherein, for any directed acyclic graph G (V, E):

D_iis a V_iThe dominating set of (2);

is U_jThe output set of (2);

5. The method of claim 4, wherein determining an upper bound on the number of vertices based on the current input data, any subset of the convolutional inputs, the input dominating set, the first number of vertices, the second number of vertices, all of the convolutional input subgraphs, the relationship between any subset of the convolutional inputs and all of the convolutional input subgraphs, an on-chip memory space of the current terminal, and a preset algorithm comprises:

（1）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

、

、

；

（2）

k is an arbitrary integer, D is a number satisfying

A set of vertices of the arbitrary dominance set;

is U_jThe set of outputs of (a) is,

representing a set containing all vertices that can be generated by the set D;

representing the vertex set U n U by D_jThe number of vertices generated;

representing a set of vertices by D

The number of vertices generated in (1);

（3）

k is an arbitrary integer of formula (2);

for any given k and jth sub-computation,

representing k input vertices at

The maximum number of vertices generated;

（4）

wherein R is the maximum reuse times of each input vertex;

s is the capacity of the flash memory;

6. The method of claim 5, wherein the determining the lower bound for the convolutional communication based on the upper bound for the number of vertices, the total number of vertices for the convolutional directed acyclic graph, a red-blue pebble model, and an on-chip memory space for the current terminal comprises:

（5）

wherein, for any directed acyclic graph G (V, E):

s is the capacity of the flash memory;

when Q takes the minimum value, Q is the lower communication bound;

7. The method of claim 3, wherein determining an optimality condition based on the lower bound of convolutional communications comprises:

8. An automatic configuration template generation apparatus, comprising:

9. A server, comprising:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the method of automatically configuring template generation of any of claims 1-7.

10. A computer-readable storage medium, comprising: a computer program loadable by a processor and adapted to perform the method of automatically configuring a template generation according to any of claims 1 to 7.