CN113240077B

CN113240077B - Tensor processing method and system

Info

Publication number: CN113240077B
Application number: CN202110458766.4A
Authority: CN
Inventors: 李国亮; 李锐; 张磊; 杨勤富; 钱军
Original assignee: Hanbo Semiconductor Shanghai Co ltd
Current assignee: Hanbo Semiconductor Shanghai Co ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2022-04-05
Anticipated expiration: 2041-04-27
Also published as: CN113240077A

Abstract

The application provides a tensor processing method and a system, wherein the method comprises the following steps: determining dividing region quantity information of a tensor to be processed based on working parameter information of a hardware accelerator; determining at least two input areas corresponding to the tensor to be processed based on the number information of the divided areas; and executing predetermined processing on the input areas by the hardware accelerator to obtain output areas respectively corresponding to the input areas. The tensor processing method can break through the limitation in the aspect of hardware, and is small in calculation amount and high in processing speed; on the other hand, when each part is processed, the processing loads of the related hardware accelerators are relatively balanced, and the efficiency of processing the original tensor to be processed is improved on the whole.

Description

Tensor processing method and system

Technical Field

The application relates to the field of computer information processing, in particular to a tensor processing technology.

Background

As computer system processing power has grown enormously, machine learning (e.g., deep learning neural networks) techniques have also become widely used. In some cases, one needs to perform convolution calculations of two-dimensional or higher-dimensional tensors by a computer. In actual computation, some parameters of the tensor may sometimes exceed the relevant limits of the hardware accelerator (e.g., the height or width of the tensor may exceed the corresponding limits of the accelerator, or the accelerator may limit on-chip memory, available computing resources, etc.), which is inconvenient.

Disclosure of Invention

An object of the present application is to provide a tensor processing method and a tensor processing system.

According to an embodiment of the present application, there is provided a tensor processing method including:

determining dividing region quantity information of a tensor to be processed based on working parameter information of a hardware accelerator;

determining at least two input areas corresponding to the tensor to be processed based on the number information of the divided areas; and the number of the first and second groups,

and executing predetermined processing on the input areas through the hardware accelerator to obtain output areas respectively corresponding to the input areas.

According to another embodiment of the present application, there is also provided a tensor processing system, wherein the system includes at least:

a number of regions determining means, wherein the input region determining means is configured to:

an input region determination device, wherein the input region determination device is configured to:

a region processing apparatus, wherein the region processing apparatus is configured to:

According to another embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program that, when executed, is capable of carrying out the operations of any one of the methods described above.

According to another embodiment of the present application, there is provided an electronic device including at least:

one or more processors;

a memory for storing executable instructions;

the one or more processors are configured to implement, via the executable instructions, the operations of any of the methods above.

Compared with the prior art, the method divides the large-size tensor into the plurality of input areas, and processes the divided input areas, specifically, based on the method, the number of the divided areas of the large-size tensor to be processed is determined firstly, and the determined basis is the working parameters of the hardware accelerator; then, the tensor to be processed is divided into the corresponding number and then correspondingly calculated, and the obtained output area can be used for subsequent processing, for example, the output area corresponding to the processed original large-size tensor is spliced. Although the processing method of segmenting the convolution kernel, which is widely used at present, can also solve the above problem, the processing method needs to segment the input tensor correspondingly, that is, corresponding to each block of the convolution kernel, a partial sum corresponding to the whole convolution kernel is obtained first, and finally, the partial sums need to be further accumulated, so that more hardware resource consumption is increased. Compared with the prior art, on one hand, the tensor processing method can break through the limitation of hardware, and is small in calculation amount and high in processing speed; on the other hand, based on the application, the number of the divided areas of the original tensor to be processed is determined according to the processing capacity of the hardware accelerator, and then the processing loads of the related hardware accelerators are relatively balanced when each part is processed, so that the situation that some accelerators wait for the other accelerators to finish processing is avoided, and the efficiency of processing the original tensor to be processed is improved on the whole.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram of a tensor processing method in one embodiment of the present application;

FIGS. 2a to 2d are schematic views of the position of an input area in the original tension under different conditions;

fig. 3a to fig. 3d are schematic diagrams of output tensors corresponding to input tensors and distribution of output regions corresponding to an input region in the output tensors in different situations, respectively;

fig. 4a to 4d show different cases of convolution processing, respectively.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The tensor processing method provided by the application is suitable for processing tensors, particularly large-size tensors through a hardware accelerator, for example, convolution is calculated on the tensors, and the processing method is also suitable for maximum pooling and average pooling. Taking convolution of the computed tensor as an example, a hardware accelerator generally has various limitations, such as physical limitations of on-chip memory, limitations of other computing resources, and limitations on the height and width of the input tensor caused by the limitations. Under the condition of limited computing resources, the prior art often adopts a processing mode of segmenting a convolution kernel to break through the limitation, but the processing mode of segmenting the convolution kernel needs to correspondingly segment an input tensor, namely, each block corresponding to the convolution kernel obtains a partial sum corresponding to the whole convolution kernel first, and finally, the partial sums need to be further accumulated, so that more hardware resource consumption is increased. The scheme is not based on the segmentation convolution kernel technology, but determines the blocks of the large-size tensor according to the related parameters of hardware, and then respectively processes the blocks, so that compared with the method of segmenting the convolution kernel, the method saves the hardware resource consumption of the sum obtaining part and the sum accumulating part, can process the large-size tensor under the limitation of hardware resources, and has the advantages of small calculation amount and high processing speed. In addition, the number of the divided areas of the original tensor to be processed is determined according to the processing capacity of the hardware accelerator, and the processing loads of the related hardware accelerators are relatively balanced when each part is processed, so that the situation that some accelerators wait for the other accelerators to finish processing is avoided, and the efficiency of processing the original tensor to be processed is improved on the whole.

The following describes in detail various embodiments of the present application, taking a tensor processing apparatus as an example.

Referring to fig. 1, the present application provides a tensor processing method. The method includes step S100, step S200, and step S300.

In step S100, the tensor processing device determines the number-of-divided-region information of the tensor to be processed based on the operating parameter information of the hardware accelerator (for example, the limitation of the height/width of the tensor inputted at one time); then, in step S200, the tensor processing device determines at least two input regions corresponding to the tensor to be processed based on the number information of the divided regions, for example, the tensor processing device determines a start position (for example, start boundary coordinates, such as coordinates of an element located at the start position) and an end position (for example, end boundary coordinates, such as coordinates of an element located at the end position) of each input region; next, in step S300, the tensor processing device performs predetermined processing (for example, convolution is calculated for the area) on the input area by the hardware accelerator to obtain output areas respectively corresponding to the input areas. In some embodiments, the output region is used to be merged with output regions corresponding to other input regions to form an output tensor corresponding to the original to-be-processed tensor.

For convenience of understanding, the embodiments of the present application will be described by taking the original to-be-processed tensor as a two-dimensional tensor as an example. Referring to fig. 2a, the two-dimensional tensor to be processed is studied based on a current input area marked in the graph, where the coordinates of a start boundary and an end boundary of the current input area are both line coordinates (that is, the original tensor is divided into a plurality of areas distributed in the height direction, at this time, the start boundary and the end boundary of each input area are respectively an upper boundary and a lower boundary, and the height of each input area is limited by the height of the input tensor by the hardware accelerator), or the coordinates of the start boundary and the end boundary of the current input area are both column coordinates (that is, the original two-dimensional tensor is divided into a plurality of areas distributed in the width direction, at this time, the start boundary and the end boundary of each input area are respectively a left boundary and a right boundary, and the width of each input area is limited by the input tensor by the hardware accelerator). Of course, those skilled in the art will appreciate that the following description is by way of example only, and is not intended to limit the embodiments of the present application. Specifically, in the case that the original tensor to be processed is a three-dimensional tensor, both the start boundary coordinate and the end boundary coordinate of the current input area are long, wide or high coordinates; and analogizing the tensor to be processed when the tensor to be processed is higher dimensionality, and no further description is given.

Here, the tensor to be processed is in some embodiments a large size tensor, for example the size of the tensor or the hardware resources required to process the tensor exceeds the current hardware accelerator limit. Each input region is a part of the tensor to be processed (i.e., each input region is a part of the original tensor to be processed).

For the case that the original to-be-processed tensor is a two-dimensional tensor, without limitation, each input area and the area covered by the to-be-processed two-dimensional tensor are both rectangular areas, and each input area may be one of the following cases in different embodiments:

a1) the width is consistent with the original two-dimensional tensor to be processed, and the height is smaller than the original two-dimensional tensor to be processed;

a2) the height is consistent with the original two-dimensional tensor to be processed, and the width is smaller than the original two-dimensional tensor to be processed;

a3) the height and the width are both smaller than the original two-dimensional tensor to be processed.

For the case that the original to-be-processed tensor is a three-dimensional tensor, without limitation, each input area and the area covered by the to-be-processed three-dimensional tensor are rectangular areas, and each input area may be one of the following cases in different embodiments:

b1) the width and the length of the three-dimensional tensor are consistent with those of the original three-dimensional tensor to be processed, and the height of the three-dimensional tensor is smaller than that of the original three-dimensional tensor to be processed;

b2) the height and the length are consistent with the original three-dimensional tensor to be processed, and the width is smaller than the original three-dimensional tensor to be processed;

b3) the height and the width are consistent with the original three-dimensional tensor to be processed, and the length is smaller than the original three-dimensional tensor to be processed;

b4) at least two of the height, the width and the length are smaller than the original three-dimensional tensor to be processed.

The case of the input tensor with higher dimensions is similar to that described above and the skilled person can analogize the above example without any creative effort, and for the sake of simplicity, the example is not expanded.

Of course, those skilled in the art should understand that the above listed situations are only examples, and are intended to illustrate that the above input area is a part of the tensor to be processed, and not to limit any specific embodiment of the present application; other existing or future divisions of the original to-be-processed tensor, as applicable to the present application, are also included within the scope of the present application and are incorporated by reference herein.

Next, the case where the to-be-processed tensor is a two-dimensional tensor will be described. Still based on the current input area, there may be several different determination ways for the initial boundary coordinates according to different actual situations. For simplicity, the following is expanded for the case a1) mentioned above, i.e., the case where the patch width coincides with the original two-dimensional tensor, and the height is smaller than the original two-dimensional tensor. It should be understood that other embodiments, such as a2) and a3), are the same or substantially the same, and those skilled in the art should be able to make corresponding modifications according to the actual situation, and the modified embodiments are also covered by the protection scope of the present application. As an example, the original two-dimensional tensor to be processed is divided into two regions. Fig. 2a to 2d respectively show the position of the current input area in the original two-dimensional tensor in different situations, wherein in order to ensure that adjacent input areas generate adjacent output areas, there should be an overlap between adjacent input areas. Fig. 3a, 3b, 3c, and 3d correspond to fig. 2a, 2b, 2c, and 2d, respectively, and show output tensors corresponding to the two-dimensional tensors to be processed, and distribution of the first output region corresponding to the current input region in the output tensors.

Specifically, in some embodiments, in the step S200, the tensor processing device determines, based on the number of divided regions information, start boundary coordinates and end boundary coordinates of at least two input regions corresponding to the tensor to be processed, where the start boundary coordinates and the end boundary coordinates are used for determining the corresponding input regions. For example, in some embodiments, the start boundary coordinates and the end boundary coordinates of each input region are determined using the number of regions to be divided, which is determined in the previous step, as a parameter. And each determined input area is used for inputting a corresponding hardware accelerator for subsequent processing by the hardware accelerator. On one hand, each part of the original tensor to be processed after being divided does not need to be calculated one by one in sequence, and can be respectively input into corresponding hardware accelerators for distributed processing after respective ranges are determined, so that the processing capacity of the multiple hardware accelerators is fully utilized. On the other hand, the starting boundary and the end boundary of each input region are determined based on the number of the divided regions, and the number of the divided regions is determined based on the working parameters of each hardware accelerator, so that the whole processing process of the original tensor to be processed does not exceed the limit of the working parameters of each hardware accelerator.

In some embodiments, step S200 further includes sub-step S210 and sub-step S220 (both not shown). In sub-step S210, the tensor processing device determines, based on the number of divided regions information, end boundary coordinates of at least two input regions corresponding to the tensor to be processed. And in sub-step S220, the tensor processing device determines the starting boundary coordinates of each input region. There may be two cases for the determination of the starting boundary of each input area. For the first input area, as no preamble input area exists, the initial boundary coordinate of the first input area is consistent with the initial boundary coordinate of the tensor to be processed; and secondly, for other input areas except the first input area, the initial boundary coordinates of the other input areas are determined based on the end boundary coordinates of the preorder input area (or the output area corresponding to the preorder input area), so that the output area is convenient to splice with the output area corresponding to the preorder input area (meanwhile, the processing of each input area can be respectively carried out, and the processing does not need to be carried out one by one), and then after the output areas corresponding to the input areas are spliced, the complete output corresponding to the original tensor to be processed can be finally obtained. Specifically, as described above, the boundary position and size of each output region correspond to those of the corresponding input region, and therefore, the division of the input region thereof can be completed before the actual calculation or transformation is performed on any part of the tensor to be processed. In other words, after the boundaries of the input areas of the whole to-be-processed tensor are determined, the input areas can be read into different hardware accelerators for parallel processing, and the next input area can be determined and processed without completing the processing of the current input area in sequence, so that the processing method can greatly improve the processing efficiency of the tensor, especially the large-size tensor.

Further, in some embodiments, the method further includes step S400 and step S500 (both not shown). In step S400, the tensor processing device determines, based on the number information of the divided regions, end boundary coordinates of at least two output regions corresponding to the tensor to be processed; subsequently, in step S500, the start boundary coordinates of the output area corresponding to each input area are determined. For example, after the end boundary coordinates of the output regions corresponding to some or all of the input regions are determined, the start boundary coordinates (start positions) of the output regions corresponding to the input regions are determined from the end boundary coordinates (end positions) of the output regions corresponding to the other input regions, for the output regions corresponding to the input regions for which the start boundary coordinates have not been determined. Specifically, after the number and sequence of divisions of an input area are determined, for an input area (hereinafter referred to as a current input area) with a start position and an end position to be determined, if the current input area is a start input area (for example, a first input area obtained by dividing an original tensor to be processed), determining a start boundary coordinate of an output area corresponding to the current input area as a start boundary coordinate of a target output tensor corresponding to the tensor to be processed; otherwise, if the current input area is not the start input area, determining the start boundary coordinates of the output area corresponding to the current input area based on the end boundary coordinates of the preamble output area (i.e. the output area corresponding to the previous input area). Correspondingly, in the sub-step S210, the tensor processing device determines the end boundary coordinates of each input region based on the end boundary coordinates of the output region corresponding to each input region; in the sub-step S220, the tensor processing device determines the start boundary coordinates of each input region based on the start boundary coordinates of the output region corresponding to each input region.

In some embodiments, the tensor needs to be input into the hardware accelerator in a particular format. Thus, the divided input regions need some pre-processing before being processed by the hardware accelerator (e.g., calculating a convolution). For example, in the case of a two-dimensional tensor, it is necessary to read in the accelerator from the storage unit in units of a height (for example, 8) of an atomic region in the height direction and/or the width direction (in the case of a three-dimensional tensor, it is necessary to read in the accelerator from the storage unit in units of 8 in the height direction and/or the width direction and/or the length direction), and write in the storage unit in the same format. In other words, the hardware accelerator needs to access the tensor in units of the atomic region height, i.e., the input or output tensor needs to be a multiple of the atomic region height, so as to improve the read-write efficiency. For the output tensor, taking the atomic region height as an example, the height of the output region corresponding to each input region should be a multiple of the atomic region height (for example, 8), and there is no overlap, so that the output regions corresponding to each input region can be seamlessly and non-overlappingly spliced together, and it is ensured that the output obtained after processing each part of the original tensor to be processed is correct.

At this time, the size (e.g., height) of the output region corresponding to each input region can be determined based on the above, for example, the height of the output region does not exceed the processing capacity of the hardware accelerator and is an integral multiple of the height of the atomic region. In one embodiment, the number N of blocks into which the original tensor to be processed needs to be divided is first determined. An alternative way of determining is: the method comprises the steps of determining the number of blocks N based on relevant size parameters (such as total height; the relevant size parameters can be uniquely determined based on corresponding size parameters of the tensor to be processed and a processing mode of preset processing) of an output tensor corresponding to an original tensor to be processed and corresponding working parameter information (such as limit of the height of the tensor which can be processed by a single hardware accelerator) of the hardware accelerator, for example, rounding the quotient of the two parameters upwards to obtain the number of blocks N; on the other hand, the number C of the atomic region heights included in the output tensor is determined, specifically, the number C is determined based on the relevant size parameter (for example, total height) of the output tensor corresponding to the original tensor to be processed and the atomic region height, and for example, the atomic region height is rounded up after the quotient of the two parameters is obtained. Assigning each atomic region to each block (output region corresponding to input region), each block corresponding to (C/N) atomic regions if the number C is divisible by the number N of blocks; otherwise, some of the tiles (output regions corresponding to input regions) will be assigned 1 more atomic regions than other tiles. Then, for the output region corresponding to each input region, the height thereof can be calculated based on the number of the assigned atomic regions, i.e. the number of the assigned atomic regions is multiplied by the height of the atomic regions. For a specific output region corresponding to the input region, the ending boundary coordinates can be obtained by accumulating the heights of the output regions including the ending boundary coordinates (here, the starting coordinate boundary position of the first output region is taken as the original position).

After the start and end boundary coordinates of the corresponding output area are all determined for a specific input area, the start and end boundary coordinates of the input area are determined accordingly since the mapping relationship of each point in the input and output areas is uniquely determined by the predetermined processing (e.g., convolution calculation) to be performed on the input area. That is, as described above, the start and end boundary coordinates of each input region are determined without actually performing processing (e.g., calculating convolution) and obtaining a corresponding output region. Specifically, in the sub-step S210, the end point boundary coordinates of each input region are determined based on the end point boundary coordinates of the output region corresponding to each input region; in the sub-step S220, the start boundary coordinates of each input region are determined based on the start boundary coordinates of the output region corresponding to each input region.

Wherein for a particular output region, its start boundary coordinates may be determined based on the end boundary coordinates of its preceding output region, e.g., the start boundary coordinates of the particular output region is the end boundary coordinates of the preceding output region plus 1.

In some embodiments, in the above sub-step S220, the tensor processing device determines the starting boundary coordinates of each input region, and expands the input regions satisfying the first expansion condition toward the preceding input region to update the starting boundary coordinates of the corresponding input region; and the distance between the updated initial boundary coordinate and the initial boundary coordinate of the tensor to be processed is integral multiple of the preset width. The part for expanding the current input area is also read in from the corresponding part of the original tensor to be processed.

Taking the situations shown in fig. 2d and fig. 3d as an example, the solid square on the left side of fig. 2d represents the overlapping portion of the current input area and the preamble input area, and the hollow square on the left side represents the expanded portion of the current input area after being preprocessed (expanded); the small square on the left side of fig. 3d represents the overlapping portion of the current output area and the preamble output area. In this case, as shown in fig. 2d, the current input area is partially expanded in the height direction, that is, the current input area is expanded in a direction from the end boundary of the previous input area to the inside of the previous input area, so as to complete the top alignment of the current input area. Here, "top-aligned" means that, after the hardware accelerator reads the current input region from the storage unit, the hardware accelerator expands the current input region to the top of the read current input region (for example, to the preceding input region in the height direction) so that the coordinates of the start boundary of the current input region are integral multiples of the preset region height (for example, 8 mentioned above) of the hardware accelerator (the hardware accelerator reads the current input region at the preset region height). In this way, the heights of the regions read into the hardware accelerator each time are integer multiples of the height of the preset region (for example, 8), so as to improve the reading efficiency, and the corresponding multiple outputs can be seamlessly and non-overlapping spliced by matching with the optional clipping operation (please refer to the detailed description in other embodiments of the present application), so as to avoid errors. In some embodiments, the input region is subjected to a convolution operation to obtain a corresponding output region, specifically, a convolution kernel (kernel) is gradually "scanned" on the two-dimensional input data, the input region is subjected to the convolution kernel "scanning" to obtain a corresponding output region after the convolution operation, the convolution kernel "slides" while calculating a product of the weight matrix and the scanned data matrix, and then the result is summarized into an output value.

In particular, the first expansion condition mentioned above is, in some embodiments: the current input area start boundary coordinate is not an integer multiple of the preset area height of the hardware accelerator (e.g., 8 mentioned above), or the current input area start boundary coordinate is not divisible by the preset area height.

Referring to fig. 4a, the original two-dimensional tensor to be processed will be convolved. Typically, the convolution kernel "slides" (illustrated by the solid arrow to the right in the figure) along the tensor to be processed by rows (and certainly by columns) to perform the convolution operation; after the current line 'sliding' calculation is completed, the convolution kernel performs a certain amount of offset (indicated by a downward wide arrow in the figure), and the 'sliding' convolution calculation is performed until the whole tensor to be processed is completely calculated. The offset of the convolution kernel at a time is also called a "Stride" in some embodiments, and during the convolution process, a portion of the information is compressed by the set Stride, or the size of the output is made smaller than the size of the input. In the case shown in the drawing, the step size is set to 3 (same as the convolution kernel height/width) to increase the processing speed, and the original tensor to be processed at this time is processed equivalently by being divided into a1 part, a2 part, A3 part, and a4 part … …, and the width of each part is equivalent to the height/width of the convolution kernel.

In order to make the output value obtained by performing convolution processing on the input area according to a certain step length subsequently effective, in some embodiments, the current input area needs to be further expanded, and a part for expansion is read in from the front of the current input area in the original tensor to be processed. In the above step S300, the tensor processing device fills the initial part of each input area read in to the integral multiple of the step size of the alignment predetermined processing by the hardware accelerator, and performs the predetermined processing on each input area to obtain the output areas respectively corresponding to the input areas. Specifically, in many cases (including the general case and the above-mentioned case where "top alignment" needs to be performed), the starting position of the current input region may not coincide with the original starting position in the original tensor to be processed, for example, the starting position is not the starting position of each part of a2, A3, a4 and the like in fig. 4a, but is somewhere in the middle of the parts. This causes "misalignment" (for example, convolution processing with a step length of not 1 is performed on the tensor) to occur when the original tensor to be processed is divided into a plurality of blocks and then processed separately, and the result of processing the tensor is not consistent any more, and even if the corresponding output is spliced after each part is processed, the spliced output is invalid.

For example, referring to fig. 4b, the description is still based on convolution operations, where the convolution kernel size is 3 x 3 and the step size is 3. If the tensor to be processed is directly processed without segmentation, the convolution kernel should "scan" the part a1 (coordinate 0 to coordinate 3 in the figure), the part a2 (coordinate 4 to coordinate 6 in the figure), the part A3 (coordinate 7 to coordinate 9 in the figure), and the part a4 (coordinate 10 to coordinate 12 in the figure) … … in sequence to perform segmentation, and as described above, after the current input area is expanded upwards, the initial boundary coordinate of the current input area is an integral multiple of the preset atomic region height (e.g., 8 mentioned above) (assuming that the coordinate is 8). If the convolution operation is performed directly from this, the convolution verifies that the coordinates of the actual scanned area are 8 to 10, offset from the a3 portion of the original scan, and the output is invalid. In order to continue to adopt the above-mentioned manner of segmenting the original to-be-processed tensor so as to reduce the system requirement, improve the processing efficiency and avoid errors, one implementation manner is to fill the current input area (shown by a light-colored part grid with oblique lines) with a value of 0 (or other fixed/variable values), so that the position of the starting boundary of the current input area which has been read into the hardware accelerator can make the filled output consistent with the original output (to ensure the correctness of the final output), that is, in practice, when convolution checks the current input area to perform the first "scanning", as shown in fig. 4c, the part which is "scanned" includes both the starting part of the first input area and the filled part. In other words, the tensor processing device fills the initial part of the current input area read into the hardware accelerator by the hardware accelerator to an integer multiple of the step size of the alignment predetermined processing (for example, convolution calculation), and executes the predetermined processing on the filled current input area to obtain the current output area corresponding to the current input area. Here, filling the start portion of the current input area to an integer multiple of the step size of the predetermined process (e.g., calculating convolution) refers to the start position of the current input area after filling, and the projection (refer to fig. 4d) position of the tensor to be processed is an integer multiple of the step size of the predetermined process (e.g., calculating convolution) (e.g., the coordinate position 6 in fig. 4 d).

In some embodiments, after the current input area is expanded (and optionally filled) as described above, the output area obtained after the current input area is processed (e.g., convolved) partially overlaps with the previous output area, and if the current output area is directly spliced with the previous output area, errors will occur (e.g., for a two-dimensional tensor, the splicing result will have a number of invalid lines); if the current input area of the read-in hardware accelerator is filled, the output corresponding to the filled part is invalid. Therefore, the output area corresponding to the current input area needs to be cut, and after the unnecessary part is removed, the effective output area corresponding to the input area obtained by dividing the original sheet can be obtained. For convenience, the corresponding pre-cropping region processed from the current input region is referred to as the current region to be cropped. Accordingly, in step S300, the tensor processing device performs predetermined processing on the input regions through the hardware accelerator to obtain regions to be clipped respectively corresponding to the input regions; and clipping the region corresponding to the expanded portion of the corresponding input region (corresponding to the initial portion of the corresponding input region, and removing the portion behind the marked portion corresponding to the upper portion of the "temporary" current output region as shown in fig. 2c, for example) in each region to be clipped to obtain the corresponding output region.

In some embodiments, the input region may need to be "bottom-aligned" as described above in reference to the "top-aligned" description. Specifically, in the sub-step S210, the tensor processing device determines the end boundary coordinates of at least two input regions corresponding to the tensor to be processed based on the number information of the divided regions, and expands the input regions satisfying the second expansion condition toward the subsequent input region to update the end boundary coordinates of the corresponding input regions; and the distance between the updated end point boundary coordinate and the initial boundary coordinate of the tensor to be processed is integral multiple of the preset width. Here, "bottom-aligned" means that, after the hardware accelerator reads the current input region from the storage unit, the hardware accelerator expands the current input region to the bottom of the read current input region (for example, to the preceding input region in the height direction) so that the coordinates of the start boundary of the current input region are integral multiples of the preset region height (for example, 8 mentioned above) of the hardware accelerator (the hardware accelerator reads the current input region at the preset region height). In this way, the height of the area of each reading-in hardware accelerator is an integral multiple of the height (for example, 8) of the preset area, so as to improve the reading efficiency, enable the corresponding multiple outputs to be spliced seamlessly and without overlapping, and avoid errors. The part for expanding the current input area is also read in from the corresponding part of the original tensor to be processed.

In particular, the second augmentation condition is, in some embodiments: the current input area end point boundary coordinates are not an integer multiple of the preset area height of the hardware accelerator (e.g., 8 as mentioned above), or the current input area end point boundary coordinates are not divisible by the preset area height.

Further, in some embodiments, in the step S300, the predetermined processing is performed on the input area by the hardware accelerator to obtain the areas to be cropped respectively corresponding to the input area; and clipping the area (corresponding to the end point part of the corresponding input area, and removing the rear part of the part to obtain the final current output area) corresponding to the expanded part in the corresponding input area in the area to be clipped to obtain the corresponding output area. Referring to the above description about processing the region to be clipped, in some embodiments, after the current input region is extended (optionally filled) as described above, the output region obtained after the current input region is correspondingly processed (e.g., convolution is calculated) will partially overlap with the subsequent output region, and if the current output region and the subsequent output region are directly spliced, errors will occur (e.g., for a two-dimensional tensor, the splicing result will include a plurality of invalid lines); if the current input area of the read-in hardware accelerator is filled, the output corresponding to the filled part is invalid. Therefore, the output area corresponding to the current input area needs to be cut, and after the unnecessary part is removed, the effective output area corresponding to the input area obtained by dividing the original sheet can be obtained. For convenience, the corresponding pre-cropping region processed from the current input region is referred to as the current region to be cropped.

Optionally, in some embodiments, the method further includes step S600 (not shown). In this step S600, the tensor processing device splices the target output tensor of the tensor to be processed based on the respective output regions. By combining the segmentation processing on the original large-size tensor in each embodiment, the method determines the number of the blocks of the large-size tensor according to the related parameters of hardware, specifically determines the starting point and the stopping point of each block based on the number of the blocks, processes each block respectively, and finally splices the processing result, and has the advantages of small calculation amount and high processing speed. Moreover, the size of each partial area divided from the original tensor is determined according to the processing capacity of the hardware accelerator, so that the processing mode can reduce the segmentation of the large-size tensor as much as possible, and the processing mode can maximize the utilization rate of computing resources and bandwidth.

As can be seen from the above description, the overall arrangement of the division and processing of the original tensor to be processed can be realized by determining the number of divisions of the tensor to be processed, determining each input area based on the number of divisions, and processing each input area respectively. The existing large-size tensor processing strategy is generally to divide a large-size tensor based on a preset block size, and cannot fully utilize the computing resources of hardware; or some schemes may sequentially determine the sizes of the blocks, and maximize the size of each block, so as to reduce the number of blocks as much as possible, but this may cause the output sizes corresponding to the input areas to be inconsistent, and thus the overall processing efficiency also needs to be improved.

In addition, optionally, since the tensor processing method provided by the application determines the division number of the tensor at first, energy saving can be achieved by closing part of hardware accelerators or processing cores. Specifically, after the above step S100, the tensor processing device turns off part of the hardware accelerators based on the obtained information of the number of divided areas and the number of existing hardware accelerators, and processes each input area by the hardware accelerator in an operating state in the above step S300.

For further explanation to facilitate understanding, the steps associated with one particular embodiment are listed below to illustrate a particular operational procedure. In an example, wherein the tensor to be processed is a two-dimensional tensor; and the like in other cases.

1) Acquiring the total height of an output tensor corresponding to an original tensor to be processed and the limiting height of a hardware accelerator on the tensor capable of being processed, and rounding the quotient of the total height and the limiting height upwards to obtain the number N of blocks of the tensor to be processed;

2) acquiring the total height of an output tensor corresponding to the original tensor to be processed and the height of the atomic region, and rounding the quotient of the total height and the height of the atomic region upwards to determine the number C of the atomic region heights contained in the output tensor;

3) distributing each atomic region to each block (output region corresponding to the input region), wherein each block corresponds to (C/N) atomic regions if the number C can be divided by the number N of the blocks, otherwise, the atomic regions distributed to partial blocks (output regions corresponding to the input regions) are 1 more than other blocks;

4) multiplying the number of the atomic regions allocated to each output region by the height of the atomic regions to obtain the height of each output region, then accumulating the heights of each output region and the output regions before the output region (if any, other output regions do not exist before the first output region), obtaining the coordinates of the end boundary (bottom end) of the corresponding output region, and simultaneously, determining the coordinates of the start boundary (top end) of each output region based on the output regions before the first output region;

5) aligning the top coordinates of the input areas corresponding to the top of each output area upwards, and adding a plurality of invalid lines in the current output area after the top coordinates are aligned upwards, wherein the invalid lines need to be cut in the subsequent steps; according to the requirement, a plurality of lines are filled at the top end of the corresponding input area, an invalid line is added in the corresponding output area after filling, and the invalid line also needs to be cut in the subsequent steps;

6) determining the bottom coordinates of the corresponding input areas based on the bottom coordinates of the output areas, and aligning the bottom coordinates of the input areas downwards;

7) based on the output area bottom coordinates obtained by the above calculation, the bottom of the output area obtained by the actual processing is clipped.

Various embodiments of the present application are described in detail above. It should be noted that the embodiments of the present application do not exclude further dividing the divided portions (for example, the input regions obtained by dividing the tensor to be processed), for example, further dividing the input regions illustrated in the drawings into blocks in the horizontal direction.

According to another aspect of the present application, there is also provided a tensor processing system, wherein the system includes at least:

determining at least two input areas corresponding to the tensor to be processed based on the number information of the divided areas;

The present application also provides a computer readable storage medium having stored thereon computer code which, when executed, performs a method as in any one of the preceding.

The present application also provides a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.

The present application further provides a computer device, comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Those skilled in the art will appreciate that the form in which the computer program instructions reside on a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and that the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Computer-readable media herein can be any available computer-readable storage media or communication media that can be accessed by a computer.

Communication media includes media by which communication signals, including, for example, computer readable instructions, data structures, program modules, or other data, are transmitted from one system to another. Communication media may include conductive transmission media such as cables and wires (e.g., fiber optics, coaxial, etc.) and wireless (non-conductive transmission) media capable of propagating energy waves such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied in a modulated data signal, for example, in a wireless medium such as a carrier wave or similar mechanism such as is embodied as part of spread spectrum techniques. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory such as random access memory (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, FeRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed that can store computer-readable information/data for use by a computer system.

An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A tensor processing method, wherein the method comprises:

determining starting boundary coordinates and ending boundary coordinates of at least two input areas corresponding to the tensor to be processed based on the number information of the divided areas, wherein the starting boundary coordinates and the ending boundary coordinates are used for determining the corresponding input areas;

and filling the initial part of each read input area to the integral multiple of the step size of the alignment preset processing through the hardware accelerator, and executing preset processing on each input area to obtain output areas respectively corresponding to the input areas.

2. The method according to claim 1, wherein the determining, based on the number of divided regions information, start boundary coordinates and end boundary coordinates of at least two input regions corresponding to the tensor to be processed includes:

b1 determining the end point boundary coordinates of at least two input areas corresponding to the tensor to be processed based on the number information of the divided areas;

b2 determining the start boundary coordinates of each input area;

wherein the start boundary coordinates and the end boundary coordinates are used to determine a corresponding input area.

3. The method of claim 2, wherein the method further comprises:

r, determining end point boundary coordinates of at least two output areas corresponding to the tensor to be processed based on the number information of the divided areas;

s determining the start boundary coordinates of the output area corresponding to each input area, wherein:

if the current input area is an initial input area, determining the initial boundary coordinates of the output area corresponding to the current input area as the initial boundary coordinates of the target output tensor corresponding to the to-be-processed tensor; if not, then,

determining a starting boundary coordinate of an output area corresponding to the current input area based on an end boundary coordinate of the preorder output area;

wherein, the step b1 includes:

determining the end point boundary coordinates of each input area based on the end point boundary coordinates of the output area corresponding to each input area;

the step b2 includes:

and determining the starting boundary coordinates of each input area based on the starting boundary coordinates of the output area corresponding to each input area.

4. The method of claim 2, wherein the step b2 includes:

determining the initial boundary coordinates of each input area, and expanding the input areas meeting the first expansion condition towards the direction of the forward sequence input areas so as to update the initial boundary coordinates of the corresponding input areas; and the distance between the updated initial boundary coordinate and the initial boundary coordinate of the tensor to be processed is integral multiple of the preset width.

5. The method of claim 4, wherein the filling, by the hardware accelerator, the initial portion of each input area read in to an integer multiple of a step size of an alignment predetermined process, and performing the predetermined process on each input area to obtain output areas respectively corresponding to the input areas comprises:

filling the initial part of each read-in input area to the integral multiple of the step length of the alignment preset processing through the hardware accelerator, and executing preset processing on the input area to obtain the areas to be clipped respectively corresponding to the input areas;

and clipping the area corresponding to the expanded part in the corresponding input area in the area to be clipped to obtain a corresponding output area.

6. The method of claim 2, wherein the step b1 includes:

determining end point boundary coordinates of at least two input areas corresponding to the tensor to be processed based on the number information of the divided areas, and expanding the input areas meeting a second expansion condition towards a subsequent input area direction to update the end point boundary coordinates of the corresponding input areas; and the distance between the updated end point boundary coordinate and the initial boundary coordinate of the tensor to be processed is integral multiple of the preset width.

7. The method of claim 6, wherein the filling, by the hardware accelerator, the initial portion of each input area read in to an integer multiple of a step size of an alignment predetermined process, and performing the predetermined process on each input area to obtain output areas respectively corresponding to the input areas comprises:

filling the initial part of each read input area to the integral multiple of the step length of the alignment preset processing through the hardware accelerator, and executing preset processing on each input area to obtain the areas to be clipped respectively corresponding to the input areas;

8. The method of claim 1, wherein the method further comprises:

and splicing the target output tensor of the tensor to be processed based on each output area.

9. A tensor processing system, wherein the system comprises at least:

a number of regions determining device, wherein the number of regions determining device is configured to:

10. A computer-readable storage medium, having stored thereon a computer program which, when executed, is capable of implementing the tensor processing method of any one of claims 1 to 8.

11. An electronic device, characterized in that the electronic device comprises at least:

one or more processors;

a memory for storing executable instructions;

the one or more processors are configured to implement, via the executable instructions, the method of any of claims 1-8.