CN116562388A

CN116562388A - Method, device and readable storage medium for determining sample batch size

Info

Publication number: CN116562388A
Application number: CN202210111790.5A
Authority: CN
Inventors: 吕文媛; 淡孝强; 曹睿
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2022-01-29
Filing date: 2022-01-29
Publication date: 2023-08-08

Abstract

The embodiment of the invention discloses a method, a device and a readable storage medium for determining the size of a sample batch. According to the embodiment of the invention, based on the size of N sample batches, N average calculation cycle numbers of a deep learning model are respectively determined, wherein N is a positive integer greater than or equal to 1, and each average calculation cycle number corresponds to one sample batch size; determining a first candidate sample batch size according to the N average calculation cycle numbers; respectively determining N computing intensities of a deep learning model based on the N sample batch sizes, wherein each computing intensity corresponds to one sample batch size; determining a second candidate sample batch size according to the N computing intensities; and determining the maximum value of the first candidate sample batch size and the second candidate sample batch size as the target sample batch size. By the method, the optimal sample batch size can be accurately determined, so that the deep learning model achieves optimal performance.

Description

Method, device and readable storage medium for determining sample batch size

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method, an apparatus, and a readable storage medium for determining a sample batch size.

Background

Deep Learning (DL) models are an important part of machine Learning, and deployment of the Deep Learning models on hardware can be generally divided into training scenarios and reasoning scenarios, in which the sample Batch Size (bs) processed each time has a certain influence on the optimization degree and speed of training, and the performance of reasoning.

For the reasoning scene, although the sample batch size does not influence the accuracy of the reasoning result, reasonable sample batch selection can improve the reasoning efficiency of the deep learning model; under the reasoning scene, the chip uses DDR SDRAM (double rate synchronous dynamic random access memory, double Data Rate Synchronous Dynamic Random Access Memory) and other storage media with relatively limited bandwidth, so that the bandwidth is an important factor for limiting the improvement of the reasoning performance. If the sample batch size (bs) is too small, the calculation amount of the deep learning model is small, the calculation force of hardware cannot be fully used, if the sample batch size (bs) is too large, the requirement of larger storage intermediate calculation results is brought, multiple transportation of data between a cache and a low-speed cache in the reasoning process can be caused, and the limitation of hardware bandwidth can cause the reduction of the reasoning performance. And the compiler also schedules the calculation of the deep learning model according to the bandwidth limitation of the hardware, namely, adjusts the calculation sequence of different samples, so that the deep learning model saves intermediate calculation results on a hardware cache as much as possible in the calculation process, the data carrying capacity between caches is reduced, and the reasoning performance of the deep learning model is improved. As the sample batch size (bs) increases, the algorithm complexity of compiler scheduling increases accordingly; when the sample batch size is too large, the complexity of the scheduling algorithm may be too high, so that it is difficult to schedule a computing scheme with optimal performance.

In the prior art, a Grid Search (Grid Search) method is adopted to traverse the possible batch sample number, a deep learning model is inferred in practice according to different sample batch sizes, and the performance of the deep learning model is inferred on hardware, so that the sample batch size is preferentially selected.

In summary, how to accurately determine the preferred sample batch size is a problem to be solved at present.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a computing resource allocation method, apparatus and readable storage medium based on tensor computation graph, which can accurately determine a preferred sample batch size to enable a deep learning model to achieve a preferred performance.

In a first aspect, an embodiment of the present invention provides a method for determining a sample batch size, the method comprising:

based on the size of N sample batches, N average calculation cycle numbers of a deep learning model are respectively determined, wherein N is a positive integer greater than or equal to 1, and each average calculation cycle number corresponds to one sample batch size;

Determining a first candidate sample batch size according to the N average calculation cycle numbers;

respectively determining N computing intensities of a deep learning model based on the N sample batch sizes, wherein each computing intensity corresponds to one sample batch size;

determining a second candidate sample batch size according to the N computing intensities;

and determining the maximum value of the first candidate sample batch size and the second candidate sample batch size as the target sample batch size.

Optionally, the determining N average computing cycles of the deep learning model according to the N sample batch sizes specifically includes:

determining a calculation cycle number of a plurality of operators in a deep learning model for each of the N sample batch sizes; determining the sum of the calculated cycle numbers of the operators as the calculated cycle number corresponding to the sample batch size;

and determining the ratio of each total calculation cycle to the corresponding sample batch size as an average calculation cycle number.

Optionally, the determining the first candidate sample batch size according to the N average computing periods specifically includes:

determining a first inflection point according to the sequence from small to large of the N sample batches, wherein the first inflection point is the inflection point of the N average calculation cycle numbers;

And determining the sample batch size corresponding to the first inflection point as a first candidate sample batch size.

Optionally, the determining the first inflection point according to the order of the N sample lot sizes from small to large specifically includes:

calculating N logarithms of the N average calculation cycle numbers;

determining N first slope values from the N logarithms;

sorting the N first slope values according to the sequence from the smaller sample batch size to the larger sample batch size;

and determining the average calculation cycle number corresponding to the N first slope values in the sorting process when the N first slope values are larger than a first set threshold value for the first time as the first inflection point.

Optionally, the determining N computing intensities of the deep learning model based on the N sample batch sizes includes:

and determining the ratio of the total calculated amount of the deep learning model to the total visit amount of the deep learning model as the corresponding calculated intensity of the sample batch size according to each sample batch size in the N sample batch sizes.

determining the calculated amount of a plurality of operators in the deep learning model for each of the N sample batch sizes;

Determining the sum of the calculated amounts of a plurality of operators in the deep learning model as the total calculated amount of the deep learning model;

determining, for each of the N sample batch sizes, a data access amount of input and output data for a plurality of operators in the deep learning model, a parameter access amount for a plurality of operators in the deep learning model, and an access amount for intermediate calculation results for a plurality of operators in the deep learning model;

and determining the total visit amount of the deep learning model as the total visit amount of the data visit amount of the input and output data of the operators in the deep learning model, the parameter visit amount of the operators in the deep learning model and the visit amount of the intermediate calculation result of the operators in the deep learning model.

Optionally, the determining the second candidate sample batch size according to the N computing intensities specifically includes:

determining a second inflection point according to the sequence from small to large of the N sample batches, wherein the second inflection point is the inflection point of the N calculation intensities;

and determining the sample batch size corresponding to the second inflection point as a second candidate sample batch size.

Optionally, the determining the second inflection point according to the order of the N sample lot sizes from small to large specifically includes:

calculating N logarithms of the N calculation intensities;

determining N second slope values from the N logarithms;

sorting the N second slope values according to the sequence from the small size to the large size of the sample batch;

and determining the corresponding calculation intensity when the N second slope values are larger than a second set threshold value for the first time as the second inflection point.

In a second aspect, embodiments of the present invention provide an apparatus for determining a sample batch size, the apparatus comprising:

the first determining unit is used for respectively determining N average calculation cycle numbers of the deep learning model based on N sample batch sizes, wherein N is a positive integer greater than or equal to 1, and each average calculation cycle number corresponds to one sample batch size;

a second determining unit, configured to determine a first candidate sample batch size according to the N average calculation cycles;

a third determining unit, configured to determine N computing intensities of a deep learning model, based on the N sample batch sizes, where each computing intensity corresponds to one sample batch size;

A fourth determining unit configured to determine a second candidate sample batch size according to the N calculation intensities;

and a fifth determining unit configured to determine a maximum value of the first candidate sample batch size and the second candidate sample batch size as the target sample batch size.

Optionally, the first determining unit is specifically configured to:

determining a calculation cycle number of a plurality of operators in a deep learning model for each of the N sample batch sizes;

determining the sum of the calculated cycle numbers of the operators as the calculated cycle number corresponding to the sample batch size;

Optionally, the second determining unit is specifically configured to:

calculating N logarithms of the N average calculation cycle numbers;

Determining N first slope values from the N logarithms;

Preferably, the third determining unit is specifically configured to: and determining the ratio of the total calculated amount of the deep learning model to the total visit amount of the deep learning model as the corresponding calculated intensity of the sample batch size according to each sample batch size in the N sample batch sizes.

Preferably, the determining N computing intensities of the deep learning model based on the N sample batch sizes includes:

Optionally, the fourth determining unit is specifically configured to:

calculating N logarithms of the N calculation intensities;

determining N second slope values from the N logarithms;

In a third aspect, embodiments of the present invention provide computer program instructions which, when executed by a processor, implement a method as in the first aspect or any one of the possibilities of the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method as in the first aspect or any of the possibilities of the first aspect.

In a fifth aspect, an embodiment of the present invention provides a chip comprising a memory and a processing core, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processing core to implement the method of the first aspect or any one of the possibilities of the first aspect.

In a sixth aspect, an embodiment of the present invention provides a board, where the board includes the chip of the fifth aspect.

In a seventh aspect, an embodiment of the present invention provides a server, where the server includes the board card of the sixth aspect.

According to the embodiment of the invention, based on the size of N sample batches, N average calculation cycle numbers of a deep learning model are respectively determined, wherein N is a positive integer greater than or equal to 1, and each average calculation cycle number corresponds to one sample batch size; determining a first candidate sample batch size according to the N average calculation cycle numbers; respectively determining N computing intensities of a deep learning model based on the N sample batch sizes, wherein each computing intensity corresponds to one sample batch size; determining a second candidate sample batch size according to the N computing intensities; and determining the maximum value of the first candidate sample batch size and the second candidate sample batch size as the target sample batch size. By the method, the optimal sample batch size can be accurately determined, so that the deep learning model achieves optimal performance.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a method of determining sample lot size in accordance with an embodiment of the present invention;

FIG. 2 is a graph showing the relationship between the average number of calculation cycles and the sample batch size according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of determining a first candidate sample lot size according to an embodiment of the invention;

FIG. 4 is a graph showing the relationship between the calculation intensity and the sample batch size according to the embodiment of the present invention;

FIG. 5 is a flow chart of a method of determining a second candidate sample lot size in accordance with an embodiment of the invention;

FIG. 6 is a graph showing the relationship between the calculation intensity and the sample batch size according to the embodiment of the present invention;

FIG. 7 is a graph showing average calculation cycle versus sample batch size according to an embodiment of the present invention;

FIG. 8 is a graph showing the relationship between the calculation intensity and the sample batch size according to the embodiment of the present invention;

FIG. 9 is a schematic diagram of an apparatus for determining sample lot size according to an embodiment of the invention.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, certain specific details are set forth in detail. The present disclosure may be fully understood by those skilled in the art without a review of these details. Well-known methods, procedures, flows, components and circuits have not been described in detail so as not to obscure the nature of the disclosure.

Moreover, those of ordinary skill in the art will appreciate that the drawings are provided herein for illustrative purposes and that the drawings are not necessarily drawn to scale.

Unless the context clearly requires otherwise, the words "comprise," "comprising," and the like throughout the application are to be construed as including but not being exclusive or exhaustive; that is, it is the meaning of "including but not limited to".

In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present disclosure, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the prior art, the deployment of the deep learning model on hardware can be generally divided into a training scene and an reasoning scene, and in the two scenes, the Batch Size (bs) of samples processed each time has a certain influence on the optimization degree and speed of training and the performance of reasoning. The reasonable sample batch size bs can find the best balance among hardware bandwidth, memory capacity, hardware computing power and model structure, thereby exerting hardware computing capacity and achieving the best performance of model computing and deployment.

In a training scenario, sample batch size bs can affect training efficiency and accuracy. If bs is too small, the training data is difficult to converge, possibly causing under fitting; if bs is too large, the required memory capacity is correspondingly increased, and meanwhile, the number of epoch times of the training set is increased to achieve the optimal result, so that the training efficiency is reduced. Therefore, reasonable bs are selected, so that training efficiency can be improved, meanwhile, the amplitude of training vibration can be reduced, and training precision is improved; in the reasoning scene, the sample batch size does not influence the accuracy of the reasoning result, but reasonable sample batch selection improves the reasoning efficiency of the deep learning model; under the reasoning scene, the chip uses DDR SDRAM (double rate synchronous dynamic random access memory, double Data Rate Synchronous Dynamic Random Access Memory) and other storage media with relatively limited bandwidth, so that the bandwidth is an important factor for limiting the improvement of the reasoning performance. If the sample batch size (bs) is too small, the calculation amount of the deep learning model is small, the calculation force of hardware cannot be fully used, if the sample batch size (bs) is too large, the requirement of larger storage intermediate calculation results is brought, multiple transportation of data between a cache and a low-speed cache in the reasoning process can be caused, and the limitation of hardware bandwidth can cause the reduction of the reasoning performance. And the compiler also schedules the calculation of the deep learning model according to the bandwidth limitation of the hardware, namely, adjusts the calculation sequence of different samples, so that the deep learning model saves intermediate calculation results on a hardware cache as much as possible in the calculation process, the data carrying capacity between caches is reduced, and the reasoning performance of the deep learning model is improved. As the sample batch size (bs) increases, the algorithm complexity of compiler scheduling increases accordingly; when the sample batch size is too large, the complexity of the scheduling algorithm may be too high, so that it is difficult to schedule a computing scheme with optimal performance.

In the prior art, a Grid Search (Grid Search) method is adopted to traverse the possible batch sample number, a deep learning model is inferred in practice according to different sample batch sizes, and the performance of the deep learning model is inferred on hardware, so that the sample batch size is preferentially selected, and because the Search space of the Grid Search method is large, the sample batch size of each candidate needs to be actually operated, and then the optimal selection is carried out, so that huge time cost is brought, and the selection efficiency is too low. Therefore, how to accurately determine the optimal sample batch size is a problem that needs to be solved at present.

In an embodiment of the present invention, an inference scenario of a deep learning model is deployed on a Neural model processor (Neural-network Processing Unit, NPU) hardware, and in order to determine a suitable sample batch size of the deep learning model, a method for determining the sample batch size is provided, as shown in fig. 1, and fig. 1 is a flowchart of a method for determining the sample batch size according to an embodiment of the present invention, which specifically includes:

step S100, based on the size of N sample batches, N average calculation cycle numbers of a deep learning model are respectively determined, wherein N is a positive integer greater than or equal to 1, and each average calculation cycle number corresponds to one sample batch size.

Specifically, the determining, according to the size of the N sample batches, N average calculation cycles of the deep learning model respectively includes: determining a calculation cycle number of a plurality of operators in a deep learning model for each of the N sample batch sizes; and determining the sum of the calculated cycle numbers of the operators as the calculated cycle number corresponding to the sample batch size. Wherein, a plurality of operators in the deep learning model can be selected according to requirements, and are generally core operators in the deep learning model; and the number of the operators in the deep learning model is the number of the selected operators. And determining the ratio of each calculated calculation cycle to the corresponding sample batch size as an average calculation cycle number, wherein the N total calculation cycle numbers correspond to the N average calculation cycle numbers.

In one possible implementation, based on the structure of the NPU computing unit, the total computing cycle number is estimated when the deep learning model selects different sample batch sizes, and specifically, the sum of the computing cycle numbers of the plurality of operators is the total computing cycle number, that is:

wherein C is _i Representation operator OP _i Counting cycle number, C, inferred on NPU _B Representing the total number of computation cycles, n being the total number of operators. The n operators may be all operators in the deep learning model, or may be partial operators in the deep learning model, where the partial operators may be, for example, computationally intensive operators.

In one possible implementation, the calculation is performed on the total calculationCycle number C _B The ratio to the sample batch size bs is the average number of calculation cycles c when reasoning about a single sample _b :

Specifically, the average number of calculation cycles c _b Can reflect the utilization rate of the computing resources of the deep learning model in computation, and the average computing cycle number c _b The smaller the value of (c), the higher the utilization of the computing resource.

In one possible implementation manner, the NPU hardware is provided with a plurality of MAC (multiply-add computation, multiplier and accumulation) computation units to form a MAC array, and when the MAC array is matched with the shape of the operator input data of the deep neural model, the computing resources can be utilized, so that the utilization rate is high; if the shapes of the operator input data of the MAC array and the deep nerve model are not matched, part of MAC computing units idle, and the utilization rate of computing resources is reduced. The utilization rate of the computing resources is represented by the average computing cycle number of the deep learning model, the shape of operator input data can be influenced by the sample batch size, and the shape of operator input data can influence the utilization rate of the deep learning model on the computing unit of the NPU hardware.

Taking the click through rate (Click Through Rate, CTR) model in the deep learning model as an example, the sample batch size bs is proportional to the left matrix row number m of the matrix multiplication (Matmul), m=k×bs, where k is a coefficient, and whether the m value can integer the number of rows of the MAC matrix affects the utilization of the hardware computing resources. If the value k x bs of m cannot divide the number x of rows of the MAC computing array, the x- (k x bs)% x row computing resource waste in the MAC will be caused; reasonable bs value will increase the utilization rate of NPU computing resource in the case of deep learning model reasoning to the greatest extent, and for single operator, when bs value increases, the computing cycle number of operator and bs value first present synchronous increasing trend, then average computing cycle numberWill remain stable, as shown in FIG. 2, when the sample batch size bs is too small, the MAC calculation unit cannot be fully utilized by the input data, there is calculation waste, and the operator average theoretical calculation cycle number +.>With the increase of bs value, the utilization rate of the operator to the computing resource gradually reaches the highest, and the average theoretical cycle number is stable. For the overall deep learning model, the deep learning model is composed of a plurality of operators, as shown in fig. 2, since the average calculation cycle number of each operator has similar correlation with bs value, the overall deep learning model can be regarded as accumulation of each operator, so that a reasonable bs value is obtained for the model to achieve the maximization of the utilization ratio of calculation resources, and in fig. 2, the utilization ratio of the deep learning model to a calculation unit is bs ₁ And then becomes stable, it is considered that when the number of samples is greater than or equal to bs ₁ And when the deep learning model is used, the utilization rate of the computing unit is maximum. In fig. 2, a dotted line is a relationship between the calculation cycle number of each operator and the sample batch size, and a solid line is a relationship between the average calculation cycle number and the sample batch size in the deep learning model.

In the embodiment of the present invention, the sizes of the sample lot corresponding to the plurality of sample lots in the above description need to be determined from the plurality of sample lots by bs in fig. 2 ₁ The specific calculation method is as described in step S101.

Step S101, determining the first candidate sample batch size according to the N average calculation cycle numbers.

Specifically, the determining the first candidate sample batch size according to the N average computing periods specifically as shown in fig. 3 includes the following steps:

and step S300, obtaining N logarithms of the N average calculation cycle numbers.

In the embodiment of the invention, because the average calculation cycle number is larger, in order to simplify calculation, the logarithm is calculated on the average calculation cycle number and the sample batch size corresponding to the average calculation cycle number.

Step S301, determining N first slope values according to the N logarithms.

Specifically, since N points can be determined according to the average calculated cycle number after the logarithm is calculated and the sample batch size corresponding to the calculated cycle number, the slope between two adjacent points on the horizontal axis can be determined.

Step S302, sorting the N first slope values according to the sequence from the smaller sample batch size to the larger sample batch size.

Step S303, determining an average calculation cycle number corresponding to the N first slope values in the sorting process when the N first slope values are greater than a first set threshold value for the first time as the first inflection point.

And step S304, determining the sample batch size corresponding to the first inflection point as a first candidate sample batch size.

Step S102, based on the N sample batch sizes, N calculation intensities of a deep learning model are respectively determined, wherein each calculation intensity corresponds to one sample batch size.

Specifically, the determining N computational intensities of the deep learning model specifically includes: and determining the ratio of the total calculated amount of the deep learning model to the total visit amount of the deep learning model as the calculated intensity, wherein the N calculated intensities correspond to N sample batch sizes. The total calculated amount of the deep learning model is the sum of the calculated amounts of operators of all operators in the deep learning model, or may be the sum of the calculated amounts of operators of a part of operators in the deep learning model, where the part of operators may be computationally intensive operators. The calculated amount of the operator refers to the number of floating point operations which occur when the operator performs calculation when a single sample is input, and the calculated amount of the operator can also be called as the time complexity of the deep learning model.

In one possible implementation, the total computation amount of the deep learning model is a sum of computation amounts of a plurality of operators in the deep learning model; the total access quantity of the deep learning model is the sum of the data access quantity of input and output data of the deep learning model, the parameter access quantity of the deep learning model and the access quantity of intermediate calculation results of the deep learning model.

For example, the calculation intensity of the deep learning model is I (flow/Bytes) which is equal to the ratio of the total calculation amount R (flow) of the deep learning model to the total visit amount L (Bytes) of the deep learning model, namely:

the total calculated amount R of the deep learning model is equal to the sum of calculated amounts of all operators in the deep learning model; the calculated amount of the operator refers to the number of floating point operations generated when the operator performs calculation when the sample is input, and the calculated amount of the operator can also be called as the time complexity of the deep learning model.

Specifically, R is _i Representing an ith operator OP in a single sample deep learning model _i Floating point arithmetic number of (b), operator OP when inferring sample batch size is bs _i The operator calculation amount of (1) is bs _i The total calculation amount of the deep learning model is as follows:

in the embodiment of the present invention, the calculation amounts of different operators are different, and the calculation amount formulas of partial calculation intensive operators are listed in the following table 1, which specifically includes the following steps:

TABLE 1

In the embodiment of the invention, the total access quantity L (Bytes) of the deep learning model represents the byte size of the memory cell required to be accessed during the calculation of the deep learning model, and the requirement of the deep learning model on the bandwidth of the memory cell is also reflected.

In one possible implementation, in the NPU, since the deep learning model performs data handling between caches during reasoning, the access amount of the deep learning model is represented by the data handling amount, and the total access amount of the deep learning model is composed of three parts, namely, the data access amount P of input and output data, the parameter access amount W of the deep learning model, and the access amount S of intermediate calculation results of the deep learning model. The three parts are described in detail below.

Specifically, for the data access amount P of the input and output data, since the deep learning model is used for reasoning, the input data is necessarily sequentially carried from the external cache to the on-chip cache of the NPU and then calculated; after the whole deep learning model is calculated, the output data is carried step by step from an on-chip cache of the NPU to an external cache; therefore, the input data and the output data quantity of the deep learning model are proportional to the sample batch size bs required by reasoning, and f is assumed _in And f _out Representing the input and output single sample data access amounts of the deep learning model, respectively, the input and output data access amounts p=bs (f _in +f _out )。

A parameter visit amount W for the deep learning model, the parameter visit amount of the deep learning model representing the visit amount of parameters participating in the deep learning model in the inference scene, e.g., a convolution kernel in a convolution calculation; the different parameters in the deep learning model may be carried one or more times between the cache and the external buffer during reasoning, here by the variable alpha _i Representation pair operator OP _i The number of times of carrying the parameters used in the calculation; wherein p is _i Representation operator OP _i Calculating the used parameter access quantity; alpha _i The value of (2) and the operator OP _i The memory requirement is related to the total amount of NPU memory during calculation. Specifically, the number of operator samples b that can be stored by hardware _i Is through the memory requirement S of the operator _i And the buffer resource M of NPU, for each operator OP _i The number of samples that the NPU can calculate and store is at most bi, specifically:

in one possible implementation, when the inferred sample batch size bs number is less than or equal to b _i When for operator OP _i Through the optimal scheduling of the compiler, after reasonable scheduling, the intermediate result and the calculation result can be resident on the NPU cache without data transmission between the cache and the low-speed cache; when the sample batch size bs is greater than n _i When the operator OP _i The method needs to be divided into a plurality of operations, and the operation amount of each operation is b _i (or less than b) _i ) Number of samples, OP _i Parameters of alpha carried on NPU _i Secondary, specific:

wherein, the ceil is rounded upwards.

In the embodiment of the invention, the parameter access quantity W of the deep learning model is the sum of the parameter access quantity of each operator in the deep learning model:

the method comprises the steps of aiming at the access quantity S of intermediate calculation results of a deep learning model, wherein the access quantity S of the intermediate calculation results represents that when all intermediate calculation results cannot be stored due to the NPU cache size in the reasoning process of the deep learning model, part of intermediate calculation results need to overflow from an NPU cache to a low cache (namely the external cache above), and when the intermediate calculation results need to be continuously used in subsequent calculation, the intermediate calculation results are loaded from the low cache back to the cache; because bandwidth is a performance bottleneck in deep learning model pushing, compilers typically reduce the data handling of computational intermediate computation results between different cache levels by means of scheduling. Therefore, although the actual access amount of the intermediate calculation result calculated by the deep learning model cannot be determined before the compiler is scheduled, the access amount of the intermediate calculation result can be estimated, specifically, S:

Wherein, the liquid crystal display device comprises a liquid crystal display device,the S is oc bs, as can be seen from the above formula, the minimum value of S is 0, which means that all intermediate results of the deep learning model reasoning flow reside on the cache, and no data needs to be carried between the cache and the low-speed cache; the maximum value of S is->The maximum data quantity for carrying the intermediate calculation result between the cache and the low-speed cache is represented, the maximum data quantity is data of a sample batch size bs which is executed according to the topological sequence of the deep neural model under the non-scheduling optimization, and when the NPU cache cannot accommodate the intermediate calculation result, the intermediate calculation result overflows to the low-speed cache; wherein beta is _i Representation operator OP _i Number of transfers between cache and cache, s _i Representation operator OP _i Calculating the data size of the output result; the main reason for multiplying 2 in the above formula is that if the intermediate result is moved out of the cache, the data needs to be reloaded back into the cache when there is a subsequent calculation depending on the intermediate result, and therefore, the amount of access to the intermediate result should be multiplied by two when estimating the amount of access to the intermediate result. Meanwhile, the S is positively related to the sample lot size bs; when the sample batch size is continuously increased, the memory requirement of the intermediate calculation result inferred by the deep learning model is gradually increased, and after the actual buffer capacity of hardware is exceeded, the carrying of data among buffers is correspondingly increased.

In combination with the above three parts, the total visit amount of the deep learning model can be expressed as:

the computational intensity can be expressed as:

substituting the value interval of S into the above formula can obtain:

in one possible implementation, in the actual computation of the deep learning model, the amount of access to the actual intermediate computation results of the deep learning model may be considered due to compiler scheduling optimizationAnd is also provided withTherefore use->Let I be approximately equal to I as an estimate of the computational intensity _est 。

For I _est As shown in fig. 4, it can be seen from fig. 4 that when the sample batch size bs is smaller, the calculation intensity is smaller, and as the sample batch size increases, the calculation intensity also increases and reaches an inflection point; when reaching the inflection point bs ₂ After that, the calculation intensity tends to stabilize, and therefore, it is considered that when the sample lot size reaches bs ₂ The computational intensity of the deep learning model is greatest.

In the embodiment of the present invention, the above description is required to correspond to a plurality of sample batch sizesBs in FIG. 4 above were determined from a plurality of sample batches ₂ The specific calculation method is as described in step S103.

And step S103, determining the second candidate sample batch size according to the N calculation intensities.

Specifically, the determining the second candidate sample batch size according to the N computing intensities is specifically described in fig. 5, and includes the following steps:

and S500, obtaining N logarithms of the N calculation intensities.

In the embodiment of the invention, because the calculation intensity is high, the logarithm is calculated on the calculation intensity and the corresponding sample batch size respectively in order to simplify the calculation.

Step S501, determining N second slope values according to the N logarithms.

Specifically, since N points can be determined according to the calculated intensity after the logarithm is calculated and the sample batch size corresponding to the calculated intensity, the slope between two adjacent points on the horizontal axis can be determined.

Step S502, sorting the N second slope values in order from smaller to larger corresponding to the sample batch size.

Step S503, determining the calculation intensities corresponding to the N second slope values when the N second slope values are greater than a second set threshold value for the first time as the second inflection point.

And step S504, determining the sample batch size corresponding to the second inflection point as a second candidate sample batch size.

Step S104, determining the maximum value of the first candidate sample batch size and the second candidate sample batch size as the target sample batch size. .

For example, assume a first candidate sample batch size of bs1, and a second candidate sample batch size of bs2, max (bs 1, bs 2) is taken as the target sample batch size.

In one possible implementation, bs corresponding to the inflection points in fig. 2 and 4 are selected, and the reason for not setting one bs large enough is that:

first, in estimating the gauge of the deep learning modelWhen calculating intensity, estimating the visit quantity S of the intermediate calculation result to be zero, wherein in the actual calculation process, the visit quantity S of the intermediate calculation result can be increased along with the increase of the sample batch size, when the sample batch size bs is smaller, the calculation intensity estimation of the deep learning model is more accurate, when the sample batch size bs is increased, because the visit quantity S of the intermediate calculation result and the sample batch size are in positive correlation, when bs is larger, the visit quantity L of the deep learning model is estimated _est Total amount of visit L with actual deep learning model _em The greater the deviation, i.e. L _em ≥L _est And L is _em -L _est The difference in (c) increases with increasing bs. Thus, as shown in FIG. 6, the actual computation intensity I of the deep learning model _em Will tend to stabilize at large sample lot sizes or slightly smaller than I _est From this, it is clear that selecting a larger sample batch size may result in a reduction in the actual computational effort.

Secondly, when the compiler performs scheduling optimization on the deep learning model, searching an optimal reasoning sequence of the deep learning model in a given sample batch size range by an optimization algorithm to reduce the visit quantity of a calculation result in the middle of the deep learning model, wherein the larger the sample batch size is, the complexity of the optimization algorithm is improved; when different optimization scheduling algorithms are selected, the increase of the sample batch size can cause the complexity of the optimization algorithm to increase exponentially, which is not acceptable in practical operation.

A detailed description of determining sample batch size is provided below with respect to one embodiment.

Assuming that, by way of example, an optimal sample batch size is inferred on the resnet50 model on NPU 1 hardware, the NPU 1 has 8 MAC calculation units capable of performing efficient matrix operation, and the MAC shapes are 32×64 (float 16) and 32×128 (int 8); each computing unit has an exclusive 1280KB first-level cache, and the 8 computing units share a sufficiently large second-level cache; according to the hardware structure of the eight calculation units and the MAC shape of the NPU 1, therefore, only the sample batch size of the power of 2 needs to be considered as an alternative sample batch size in practice.

When the computing intensity of the resnet50 model is estimated, a first-level cache on each computing unit is used as a main cache, the storage capacity of the cache of the NPU 1 is 8×1280kb=10240kb, and the total access quantity of the resnet50 model refers to data exchange between a first-level cache and a second-level cache.

Under the premise, the calculation utilization rate of the resnet50 model to the MAC array on the NPU 1, that is, the average calculation cycle number of the resnet50 model is estimated, and the following table 2 lists the calculation cycle numbers corresponding to the operators in various calculation types in the NPU 1, specifically as follows:

TABLE 2

In the embodiment of the present invention, assuming that the candidate sample batch sizes are 1, 2, 4, 8, 16, 32, 64 and 128, calculating the average calculation cycle number of the Resnet50 model under different sample batch sizes can obtain the following data, which is specifically shown in table 3:

TABLE 3 Table 3

In one possible implementation, to calculate the inflection point bs1 of the Resnet50 model according to the average calculation cycle number, the average calculation cycle number (avg cycle num) and the sample lot size are logarithmized to obtain table 4, which is specifically as follows:

TABLE 4 Table 4

According to the logarithmic values corresponding to the batch sizes of each sample in the above table 4, drawing fig. 7, and determining the slope value (slope) of the point connection line on the curve corresponding to the two adjacent points in fig. 7, specifically as shown in table 5:

TABLE 5

Assuming that the first set threshold is-0.02, determining the sample batch size with the slope value greater than the first set threshold for the first time as the first candidate sample batch size, that is, -0.01 is the value greater than the first set threshold for the first time, and determining bs value 8 corresponding to-0.01 as the first candidate sample batch size, that is, the value of inflection point bs1 is 8.

In one possible implementation, the calculation strength of the Resnet50 model at different sample batch sizes according to the formula in the above method is shown in Table 6 below:

TABLE 6

According to the same algorithm as that of the above tables 4 and 5, the logarithm is first obtained, then the slope is calculated, and finally, according to the logarithm value corresponding to each sample batch size, fig. 8 is drawn, the slope value (slope) of the point connecting line on the curve corresponding to two adjacent points in fig. 8 is determined, the sample batch size with the slope value greater than the second set threshold value for the first time is determined as the second candidate sample batch size, and the corresponding bs value 32 is determined as the second candidate sample batch size, namely, the value of the inflection point bs2 is 8.

In the above specific embodiment, bs ₁ ＝8，bs ₂ ＝32，bs＝max(bs ₁ ,bs ₂ )＝32。

In one possible implementation manner, the method can be applied to computing hardware outside the NPU, such as GPU, and parameters such as SP (streaming Process), SM (streaming multiprocessor) number and thread bundles (Warp) can be used as the basis for computing the average computing cycle number through the specific architecture of the GPU; the storage capacity of the Shared Memory (Shared Memory) of the GPU and the data movement of the data between the Shared Memory and the Global Memory (Global Memory) can be used to estimate the model computation strength, and according to the above-mentioned index, the optimal inferred sample batch size of the model on the GPU can be estimated.

FIG. 9 is a schematic diagram of an apparatus for determining sample lot size according to an embodiment of the invention. As shown in fig. 9, the apparatus of the present embodiment includes a first determination unit 901, a second determination unit 902, a third determination unit 903, a fourth determination unit 904, and a fifth determination unit 905.

The first determining unit 901 is configured to determine N average computing cycles of the deep learning model based on N sample batch sizes, where N is a positive integer greater than or equal to 1, and each average computing cycle corresponds to one sample batch size; a second determining unit 902, configured to determine a first candidate sample batch size according to the N average calculation cycles; a third determining unit 903, configured to determine N computing intensities of a deep learning model based on the N sample batch sizes, where each computing intensity corresponds to one sample batch size; a fourth determining unit 904 configured to determine a second candidate sample batch size according to the N calculation intensities; a fifth determining unit 905 is configured to determine a maximum value of the first candidate sample batch size and the second candidate sample batch size as the target sample batch size.

Optionally, the first determining unit is specifically configured to:

Optionally, the second determining unit is specifically configured to:

calculating N logarithms of the N average calculation cycle numbers;

determining N first slope values from the N logarithms;

Optionally, the fourth determining unit is specifically configured to:

calculating N logarithms of the N calculation intensities;

determining N second slope values from the N logarithms;

In an embodiment of the present invention, there is also provided computer program instructions which, when executed by a processor, implement the method of any of the above embodiments.

In an embodiment of the present invention, there is also provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of the above embodiments.

An embodiment of the present invention provides a chip including a memory for storing one or more computer program instructions, and a processing core, where the one or more computer program instructions are executed by the processing core to implement the method of any of the above embodiments.

The embodiment of the invention provides a board card, which comprises a chip.

The embodiment of the invention provides a server, which comprises the board card.

As will be appreciated by one skilled in the art, aspects of embodiments of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of embodiments of the invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, aspects of embodiments of the invention may take the form of: a computer program product embodied in one or more computer-readable media having computer-readable program code embodied thereon.

Any combination of one or more computer readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of embodiments of the present invention, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, such as in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following: a computer-readable storage medium is not a computer-readable storage medium and can communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of embodiments of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, smalltalk, C ++, etc.; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing partly on the user computer and partly on the remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention described above describe aspects of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of determining sample batch size, the method comprising:

2. The method of claim 1, wherein the determining N average calculation cycles of the deep learning model based on the N sample batch sizes, respectively, specifically comprises:

3. The method according to claim 1 or 2, wherein said determining a first candidate sample batch size from said N average number of calculation cycles, in particular comprises:

4. A method according to claim 3, wherein said determining a first inflection point according to the order of the N sample sizes from smaller to larger comprises:

calculating N logarithms of the N average calculation cycle numbers;

determining N first slope values from the N logarithms;

5. The method of claim 1, wherein determining N computational intensities of a deep learning model based on the N sample batch sizes, respectively, specifically comprises:

6. The method of claim 5, wherein determining N computational intensities of a deep learning model based on the N sample batch sizes, respectively, specifically comprises:

7. The method of claim 1, wherein determining a second candidate sample batch size from the N computational intensities, specifically comprises:

8. The method of claim 7, wherein the determining the second inflection point according to the order of the N sample lot sizes from smaller to larger comprises:

calculating N logarithms of the N calculation intensities;

determining N second slope values from the N logarithms;

9. An apparatus for determining sample batch size, the apparatus comprising:

10. Computer program instructions, characterized in that it implements the method according to any of claims 1-8 when executed by a processor.