CN117170588B - Method, apparatus and medium for converting a layout of tensor data - Google Patents

Method, apparatus and medium for converting a layout of tensor data Download PDF

Info

Publication number
CN117170588B
CN117170588B CN202311443517.3A CN202311443517A CN117170588B CN 117170588 B CN117170588 B CN 117170588B CN 202311443517 A CN202311443517 A CN 202311443517A CN 117170588 B CN117170588 B CN 117170588B
Authority
CN
China
Prior art keywords
tensor data
granularity
source
memory
memory granularity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311443517.3A
Other languages
Chinese (zh)
Other versions
CN117170588A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202311443517.3A priority Critical patent/CN117170588B/en
Publication of CN117170588A publication Critical patent/CN117170588A/en
Application granted granted Critical
Publication of CN117170588B publication Critical patent/CN117170588B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present disclosure relates to a method, apparatus, and medium for converting a layout of tensor data. The method comprises the steps of determining a second memory granularity according to a first memory granularity related to source tensor data and a target minimum memory granularity related to target tensor data, wherein a plurality of dimension values of the second memory granularity are respectively the least common multiple of corresponding dimension values of the first memory granularity and the target minimum memory granularity, and the first memory granularity is obtained according to the source minimum memory granularity related to the source tensor data; reading sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data for storage in a cache unit; and reading the sub-tensor data stored in the cache unit according to the target minimum access granularity so as to store the sub-tensor data in a storage space corresponding to the target tensor data. The scheme of the disclosure can remarkably improve the efficiency of converting the layout of tensors.

Description

Method, apparatus and medium for converting a layout of tensor data
Technical Field
Embodiments of the present disclosure relate generally to the field of deep learning, and more particularly, to a method, apparatus, and medium for converting a layout of tensor data.
Background
Tensors (tensors) are widely used in the field of deep neural networks. Tensors may be used, for example, to represent the pre-information of neurons. The training process and the reasoning process of the neural network involve a large amount of processing of tensors, and the processing efficiency of the tensors has an important influence on the calculation speed of the neural network. In some scenarios, a conversion of the layout of tensors is required. In conventional schemes, the conversion of the layout of tensors is inefficient.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a method, apparatus, and medium for converting a layout of tensor data, which can significantly improve the efficiency of converting a layout of tensors.
According to a first aspect of the present disclosure, a method for converting a layout of tensor data is provided. The method comprises the following steps: determining a second access memory granularity according to a first access memory granularity related to source tensor data and a target minimum access memory granularity related to target tensor data, wherein a plurality of dimension values of the second access memory granularity are respectively the least common multiple of corresponding dimension values of the first access memory granularity and the target minimum access memory granularity, and the first access memory granularity is obtained according to the source minimum access memory granularity related to source tensor data; reading sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data for storage in a cache unit; and reading the sub-tensor data stored in the cache unit according to the target minimum access granularity so as to store the sub-tensor data in a storage space corresponding to the target tensor data.
In some embodiments, the method further comprises: determining whether an instruction for transition conversion of source tensor data is received; in response to determining that no instruction for transition conversion of the source tensor data has been received, determining that the source minimum memory granularity for the source tensor data is the first memory granularity; and in response to determining that an instruction for transition conversion of source tensor data is received, determining that the memory granularity corresponding to the source minimum memory granularity after the transition conversion is the first memory granularity.
In some embodiments, the source minimum memory granularity is related to a source address layout pattern for elements in the source tensor data, and the target minimum memory granularity is related to a target address layout pattern for elements in the target tensor data.
In some embodiments, the source address layout pattern is different from the target address layout pattern.
In some embodiments, the method further comprises: determining a third access memory granularity according to a preset access memory granularity related to source tensor data and the target minimum access memory granularity, wherein a plurality of dimension values of the third access memory granularity are respectively the least common multiple of the preset access memory granularity and the corresponding dimension value of the target minimum access memory granularity, the preset access memory granularity is n times of the first access memory granularity, and n is a positive integer greater than 1; determining whether the capacity of the cache unit meets a third access granularity; and in response to determining that the capacity of the cache unit meets the third memory granularity, reading sub-tensor data corresponding to the third memory granularity from the source tensor data for storage to the cache unit.
In some embodiments, the method further comprises: in response to determining that the capacity of the cache unit does not meet the third memory granularity, determining whether the capacity of the cache unit meets the second memory granularity; and in response to determining that the capacity of the cache unit meets the second memory granularity, reading sub-tensor data corresponding to the second memory granularity from the source tensor data for storage to the cache unit.
In some embodiments, the method further comprises: in response to determining that the capacity of the cache unit does not meet the second memory granularity, sub-tensor data corresponding to the source minimum memory granularity is read from the source tensor data for storage to the cache unit.
In some embodiments, reading sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data includes: determining a first number of source minimum memory granularities comprised by an integer multiple of the second memory granularity; and reading sub-tensor data corresponding to the first number from the source tensor data with the minimum access granularity of the source as a unit.
In some embodiments, reading the sub-tensor data stored in the cache unit according to the target minimum memory granularity includes: determining a second number of target minimum memory granularities contained in the sub-tensor data stored in the cache unit; and storing the sub-tensor data stored in the cache unit into a storage space corresponding to the target tensor data by taking the target minimum access granularity as a unit.
In some embodiments, the transition transformation includes reshaping and/or transposing.
In some embodiments, the transition transformation includes transforming the source tensor data based on an operator fused with the remodelling operator and the transpose operator.
In some embodiments, the source minimum memory granularity is different than the target minimum memory granularity.
According to a second aspect of the present disclosure, an electronic device is provided. The electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to the first aspect of the present disclosure.
According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a machine implements a method according to the first aspect of the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.
Fig. 1 illustrates a flow chart of a method for converting a layout of tensor data according to an embodiment of the present disclosure.
Fig. 2 shows the relationship of the order, shape, and dimension values of tensor data.
Fig. 3 shows a schematic diagram of tensor data conversion according to an embodiment of the present disclosure.
Fig. 4 shows a schematic diagram of tensor data conversion according to an embodiment of the present disclosure.
Fig. 5 illustrates a physical structure of a storage space corresponding to tensor data according to an embodiment of the present disclosure.
Fig. 6 shows a schematic diagram of tensor data conversion according to an embodiment of the present disclosure.
FIG. 7 illustrates a schematic diagram of converting a layout of tensor data using a method of an embodiment of the present disclosure.
FIG. 8 illustrates a schematic diagram of a physical storage structure of local data of source tensor data utilizing an embodiment of the present disclosure.
FIG. 9 illustrates a schematic diagram of a physical storage structure with minimum source memory granularity in an embodiment of the disclosure.
FIG. 10 illustrates a schematic diagram of a physical storage structure with a target minimum memory granularity in an embodiment of the disclosure.
FIG. 11 illustrates a schematic diagram of a physical storage structure of local data of target tensor data according to an embodiment of the present disclosure.
Fig. 12 shows a schematic diagram of a tensor data transformation.
FIG. 13 illustrates a schematic diagram of converting a layout of tensor data using a method of an embodiment of the present disclosure.
FIG. 14 illustrates a schematic diagram of a physical storage structure of local data of source tensor data utilizing an embodiment of the present disclosure.
FIG. 15 illustrates a schematic diagram of a physical storage structure with minimum source memory granularity in an embodiment of the disclosure.
FIG. 16 illustrates a schematic diagram of a physical storage structure of a first memory granularity in an embodiment of the disclosure.
FIG. 17 illustrates a schematic diagram of a physical storage structure with a target minimum memory granularity in an embodiment of the disclosure.
Fig. 18 illustrates a schematic diagram of a physical storage structure of local data of target tensor data according to an embodiment of the present disclosure.
Fig. 19 shows a flowchart of a method for converting a layout of tensor data according to an embodiment of the present disclosure.
FIG. 20 shows a schematic block diagram of an example electronic device that may be used to implement the methods of embodiments of the present disclosure.
FIG. 21 illustrates a schematic diagram of transforming a layout of tensor data using a method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described previously, in some scenarios, the layout of the tensors needs to be transformed. In conventional schemes, the conversion of the layout of tensors is inefficient.
To at least partially solve one or more of the above problems and other potential problems, a solution for transforming a layout of tensor data is proposed, in which a second memory granularity is determined according to a first memory granularity for source tensor data and a target minimum memory granularity for target tensor data, wherein a plurality of dimension values of the second memory granularity are respectively the least common multiple of the corresponding dimension values of the first memory granularity and the target minimum memory granularity, and the first memory granularity is obtained according to the source minimum memory granularity for source tensor data. Based on the determined integer multiple of the second access granularity, corresponding sub-tensor data is read from the source tensor data, so that the access efficiency in the process of converting the layout of the tensor data can be remarkably improved.
The method of the embodiments of the present disclosure is described in detail below.
Fig. 1 illustrates a flow chart of a method 100 for converting a layout of tensor data according to an embodiment of the present disclosure. It should be understood that method 100 may also include additional steps not shown and/or that the illustrated steps may be omitted, the scope of the present disclosure being not limited in this respect. The method 100 may be implemented on a controller basis or at the electronic device 2000 shown in fig. 20. The controller is implemented using an MCU (Micro Controller Unit, micro control unit), CPU (Central Processing Unit ), GPU (Graphics Processing Unit, graphics processor), GPGPU (General-purpose Computing on Graphics Processing Units, general purpose graphics processor), FPGA (Field Programmable Gate Array ) or other programmable logic device, ASIC (Application Specific Integrated Circuit ), discrete gate or transistor logic device, discrete hardware components, etc.
Referring to fig. 1, the method 100 includes:
at step 102, a second memory granularity is determined based on the first memory granularity for the source tensor data and the target minimum memory granularity for the target tensor data. The plurality of dimension values of the second memory granularity are respectively the least common multiple of the corresponding dimension values of the first memory granularity and the target minimum memory granularity, and the first memory granularity is obtained according to the source minimum memory granularity related to the source tensor data.
At step 104, sub-tensor data corresponding to an integer multiple of the second memory granularity is read from the source tensor data for storage to the cache unit.
At step 106, the sub-tensor data stored in the cache unit is read according to the target minimum access granularity, so as to be stored in the storage space corresponding to the target tensor data.
In some embodiments, the method 100 further comprises: determining whether an instruction for transition conversion of source tensor data is received; in response to determining that no instruction for transition conversion of the source tensor data has been received, determining that the source minimum memory granularity for the source tensor data is the first memory granularity; and in response to determining that an instruction for transition conversion of the source tensor data is received, determining that the access granularity corresponding to the minimum access granularity after the transition conversion is the first access granularity.
In some embodiments, the source minimum memory granularity is related to a source address layout pattern for elements in the source tensor data, and the target minimum memory granularity is related to a target address layout pattern for elements in the target tensor data.
In some embodiments, the source address layout pattern is different from the target address layout pattern.
In some embodiments, reading sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data includes: determining a first number of source minimum memory granularities comprised by an integer multiple of the second memory granularity; and reading sub-tensor data corresponding to the first number from the source tensor data with the minimum access granularity of the source as a unit.
In some embodiments, the method 100 further comprises: determining a third access memory granularity according to a preset access memory granularity related to source tensor data and the target minimum access memory granularity, wherein a plurality of dimension values of the third access memory granularity are respectively the least common multiple of the preset access memory granularity and the corresponding dimension value of the target minimum access memory granularity, the preset access memory granularity is n times of the first access memory granularity, and n is a positive integer greater than 1; determining whether the capacity of the cache unit meets a third access granularity; and in response to determining that the capacity of the cache unit meets the third memory granularity, reading sub-tensor data corresponding to the third memory granularity from the source tensor data for storage to the cache unit.
In some embodiments, the method 100 further comprises: in response to determining that the capacity of the cache unit does not meet the third memory granularity, determining whether the capacity of the cache unit meets the second memory granularity; and in response to determining that the capacity of the cache unit meets the second memory granularity, reading sub-tensor data corresponding to the second memory granularity from the source tensor data for storage to the cache unit.
In some embodiments, the method 100 further comprises: in response to determining that the capacity of the cache unit does not meet the second memory granularity, sub-tensor data corresponding to the source minimum memory granularity is read from the source tensor data for storage to the cache unit.
It is worth noting that tensor data may be stored in memory. Taking the tensor flow (an artificial intelligence deep learning framework) as an example, tensor data has attributes such as rank (rank), shape (shape), dimension value, and the like. Fig. 2 shows the relationship of the order, shape, and dimension values of tensor data. It should be understood that fig. 2 is based on a schematic illustration of a TensorFlow scenario only, and is not limited to the scenario shown in fig. 2. The order of tensor data may also be referred to as a "dimension". The dimension value of the tensor data is the number of elements that the tensor data has in that dimension. For example, for tensor data t=5, the tensor data T is a tensor of order 0 (order 0), also called "scalar", and the tensor data T represents a numerical value (i.e., the numerical value "5"). For tensor data t= [2], which is a first order tensor (or one-dimensional tensor), also referred to as a "vector" or "one-dimensional array", it has one dimension, whose dimension corresponds to a dimension value D1 of 2, the tensor data T characterizes, for example, a vector containing 2 data. For tensor data t= [5,3], which is a second order tensor, also referred to as a "matrix" or "two-dimensional array", it has two dimensions, a first dimension corresponding to a dimension value D1 of 5 and a second dimension corresponding to a dimension value D2 of 3, the tensor data T representing, for example, a matrix of 5 rows and 3 columns. Similarly, tensor data t= [ D1, D2, D3, … … Dn ] is also described based on the shape of the tensor data T, each element in the tensor data T may correspond to one data.
It is worth noting that in some embodiments, the data in the tensor data is stored in memory sequentially and consecutively. For example, adjacent data in tensor data is stored in a storage unit adjacent to an address. The address where the data (or "element") in the tensor data is stored may be determined, for example, according to the starting address (or "base address", i.e., the storage address of the first element in the tensor data) of the tensor data in the memory and the address offset corresponding to the data. For example, for tensor data having the shape of [ D1, D2, D3], the data included therein is stored in a continuous storage space of D1×d2×d3. For a plurality of tensor data having the same shape (or the same logical structure), if the address layout pattern possessed by each of the tensor data is different, the address offset amounts corresponding to the elements having the same coordinates may be different in the tensor data. For example, for an element having coordinates (n 1, n2, n 3), according to the first address layout pattern, the address offset corresponding to the element is, for example, (n1×d2+n2) ×d3+n3; according to the second address layout mode, the address offset corresponding to the element is, for example, (n3×d2+n2) ×d1+n1. That is, the definition of the innermost dimension and the definition of the outermost dimension of tensor data having the shape [ D1, D2, D3] may be different, and the calculation method of the address offset corresponding to the data having the coordinates (n 1, n2, n 3) may be different.
It should be noted that, in the embodiment of the present disclosure, the layout of the tensor data is converted, including, but not limited to, converting the shape of the tensor data, converting the position (e.g., coordinates) of the element in the tensor data, converting the address layout mode of the tensor data, and may be a combination of multiple conversion/conversion modes.
Fig. 3 shows a schematic diagram of tensor data conversion according to an embodiment of the present disclosure. The tensor data T01 is a second-order tensor, which has a shape of [ N, HW ], that is, the first dimension of the tensor data T01 has a dimension value of N and the second dimension has a dimension value of HW (for example, HW is the product of "H" and "W"). The tensor data T01 is subjected to reconstruction conversion, and for example, tensor data T02 can be obtained. The tensor data T02 is a third-order tensor, and the shape of the tensor data T02 is [ N/8, HW ]. Wherein the reconstruction transformation may be implemented using a reconstruction operator (e.g., a reshape operator). Then, the tensor data T02 is transposed to obtain tensor data T03. Wherein the tensor data T03 is a third order tensor, the shape of which is [ N/8, HW,8]. Wherein the transposed conversion may be implemented using a transpose operator (e.g., a permite operator or a transpost operator).
Fig. 4 shows a schematic diagram of tensor data conversion according to an embodiment of the present disclosure. Fig. 4 can be regarded as a specific example of fig. 3. For ease of understanding, the numbers corresponding to the elements in the tensor data are labeled with numerals, for example, the tensor data T04, the tensor data T05, and the tensor data T06 each include 32 elements in total from element 1 to element 32. For example, tensor data T04 is a second order tensor, whose shape is [16,2]. The tensor data T04 is transformed by reconstruction to form tensor data 05. The shape of the tensor data 05 is, for example, [2,8,2]. The tensor data T05 is transformed by reconstruction to form tensor data 06. The shape of the tensor data 06 is, for example, [2,2,8]. It should be understood that the tensor data T04, the tensor data T05, and the tensor data T06 shown in the drawings are logical structures of the corresponding tensor data shown based on the shape of the tensor data. Fig. 5 illustrates a physical structure of a storage space corresponding to tensor data T04 of an embodiment of the present disclosure. For example, 32 elements in the tensor data T04 are sequentially stored in 32 storage units in the memory, where the storage address corresponding to the first element (i.e., element 1) of the tensor data T04 is the initial address of the storage space corresponding to the tensor data T04. The tensor data is converted such that the positions (or coordinates) of the elements in the tensor data in the logical structure of the tensor data change, which may in some cases change the relative positions of the elements in the tensor data in the storage space with respect to each other. The position of the element in the new tensor data formed after conversion in the new tensor data can be determined according to the address of the element, the initial address of the new tensor data, and the shape of the new tensor data.
In some embodiments, the tensor data T01 may also be transformed into tensor data T03 based on a fusion operator obtained by fusing the reconstruction operator with the transpose operator. The fusion operator is utilized to realize the conversion of tensor data, for example, an intermediate link (such as tensor data T02) can be omitted, and the efficiency can be remarkably improved.
Fig. 6 shows a schematic diagram of tensor data conversion according to an embodiment of the present disclosure. For example, the Conv1D (one-dimensional convolution) operator, in some artificial intelligence models, the input tensor as well as the output tensor of the Conv1D operator are both third-order tensors (or "three-dimensional matrices"). In some embodiments, the functionality of the Conv1D operator is implemented, for example, with a Conv2D (two-dimensional convolution) operator, which may be implemented with reference to the flow shown in fig. 6. For the input tensor data t07= [ N, H, W ], reconstruction conversion is first performed to obtain tensor data t08= [ N/8, H, W ] (fourth-order tensor), and transpose conversion is further performed to obtain tensor data T09= [ N/8, H, W,8]. Next, the tensor data T09 is converted by using the Conv2D operator to obtain tensor data t10= [ N/8, H, W,8], then the tensor data t11= [ N/8, H, W ] is obtained by performing a transpose conversion, and finally the output tensor data t12= [ N, H, W ] is obtained by performing a reconstruction conversion, so that the Conv1D conversion from the input tensor data T07 to the output tensor data T12 can be realized.
FIG. 7 illustrates a schematic diagram of converting a layout of tensor data using a method of an embodiment of the present disclosure. The source tensor data T71 is, for example, a third-order tensor. When converting the layout of the source tensor data T71, the data in the source tensor data T71 may be loaded into the first memory M1 first. For convenience of explanation, only partial data in the source tensor data T71 stored in the first memory M1 is illustrated in the figure. The first memory M1 is, for example, an on-chip register for storing the source tensor data T71. Fig. 8 illustrates a schematic diagram of a physical storage structure of local data using source tensor data T71 of an embodiment of the present disclosure. The address corresponding to the element 1 is, for example, an initial address BM1 of the first memory M1. The minimum source memory granularity of the source tensor data T71 is [2,6] (or 2*6), for example, 12 (2*6) elements in total of 1-12 elements in the source tensor data T71 form a minimum source memory granularity Tf71. Similarly, 12 elements 13-24 in the source tensor data T71 constitute a source minimum memory granularity, and 12 elements 25-36 in the source tensor data T71 constitute a source minimum memory granularity. It is worth noting that the source minimum memory granularity is, for example, the minimum accessible sub-tensor at a time when the source tensor data T71 is accessed (e.g., read from the source tensor data T71). The sub-tensor corresponding to the source minimum memory granularity has a predetermined number of elements, and the sub-tensor corresponding to the source minimum memory granularity has a predetermined shape. In some embodiments, to achieve higher memory efficiency, the source minimum memory granularity is, for example, a sub-tensor with a first predetermined granularity defined to achieve memory alignment requirements of the source tensor data T71 in its respective dimensions. The memory alignment is not described here. The source tensor data T71 is, for example, a third-order tensor, and the source minimum memory granularity Tf71 corresponds to a second-order tensor, so, regarding the third dimension of the source tensor data T71, the source minimum memory granularity Tf71 has a dimension value of 1 according to the memory alignment requirement. That is, when the dimension (i.e., the order) of the source minimum memory granularity is smaller than the dimension of the source tensor data, the dimension value corresponding to the source minimum memory granularity in the dimension of the source minimum memory granularity compared with the "missing" of the source tensor data is 1.
It is worth noting that the source minimum memory granularity Tf71 is related to the source address layout pattern for the elements in the source tensor data. That is, the source minimum memory granularity Tf71 can reflect the source address layout pattern for the elements in the source tensor data.
First, the controller determines whether an instruction for transition conversion of the source tensor data T71 is received. The transition conversion is, for example, a conversion required when the source tensor data T71 is loaded into the buffer unit from a memory (e.g., the first memory M1) corresponding to the source tensor data T71 in the process of converting the source tensor data into the target tensor data. Transition transformations may include, but are not limited to, reshaping, transpose. The transition conversion can also utilize an operator fused by the remodelling operator and the transposed operator to convert the source tensor data. According to fig. 7, no transitional transitions are involved. Accordingly, the controller determines that an instruction for transition conversion is not received with respect to the source tensor data T71, and then the controller determines the source minimum memory granularity Tf71 with respect to the source tensor data as the first memory granularity. Fig. 9 shows a schematic diagram of a physical storage structure of a source minimum memory granularity Tf71 in an embodiment of the present disclosure.
At step 102, the controller determines a second memory granularity based on the first memory granularity and a target minimum memory granularity for the target tensor data. The plurality of dimension values of the second memory granularity are respectively the least common multiple of the corresponding dimension values of the first memory granularity and the target minimum memory granularity.
For example, a target minimum memory granularity Tf72 for target tensor data T72 is shown in fig. 7. The target minimum memory granularity Tf72 of the target tensor data T72 is [6,2] (or 6*2), for example, 12 (6*2) elements in total, such as element 1, element 2, element 7, element 8, and the like, in the target tensor data T72 form one target minimum memory granularity Tf72. Similarly, 12 elements, such as element 3, element 4, element 9, element 10, etc., in the target tensor data T72 form a target minimum memory granularity, and 12 elements, such as element 5, element 6, element 11, element 12, etc., in the target tensor data T72 form a target minimum memory granularity Tf72. It is worth noting that the target minimum memory granularity Tf72 is, for example, the smallest accessible sub-tensor at a time when the target tensor data T72 is accessed (e.g., read from the target tensor data T72). The sub-tensor corresponding to the target minimum memory granularity Tf72 has a predetermined number of elements, and the sub-tensor corresponding to the target minimum memory granularity Tf72 has a predetermined shape. In some embodiments, to obtain higher memory efficiency, the target minimum memory granularity Tf72 is, for example, a sub-tensor with a first predetermined granularity defined to achieve memory alignment requirements of the target tensor data T72 in its respective dimensions. The target tensor data T72 is, for example, a third-order tensor, and the target minimum memory granularity Tf72 corresponds to a second-order tensor, so, regarding the third dimension of the target tensor data T72, the target minimum memory granularity Tf72 has a dimension value of 1 according to the memory alignment requirement. That is, when the dimension (i.e., the order) of the target minimum memory granularity Tf72 is smaller than the dimension of the target tensor data T72, the dimension value corresponding to the target minimum memory granularity Tf72 in the dimension of the target minimum memory granularity Tf72 "missing" compared to the target tensor data T72 is 1.
It is worth noting that the target minimum memory granularity Tf72 is related to the target address layout pattern for the elements in the target tensor data. That is, the target minimum memory granularity Tf72 can reflect the target address layout pattern for the elements in the target tensor data. Fig. 10 is a schematic diagram of a physical storage structure of a target minimum memory granularity Tf72 in an embodiment of the disclosure, where target tensor data T72 is stored in the third memory M3, and in the target tensor data T72, an address corresponding to the element 1 is an initial address BM3 of the third memory M3.
It is worth noting that the first access granularity (i.e., the source minimum access granularity Tf 71) is [2,6], the target minimum access granularity Tf72 is [6,2], and thus the second access granularity Tf2 is [6,6]. The first dimension value "6" of the second access granularity Tf2 is the least common multiple of the first dimension value "2" of the first access granularity and the first dimension value "6" of the target minimum access granularity Tf 72; the second dimension value "6" of the second memory granularity Tf2 is the least common multiple of the second dimension value "6" of the first memory granularity and the second dimension value "2" of the target minimum memory granularity Tf 72.
Then, at step 104, the controller reads sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data for storage to the cache unit. For example, the controller reads sub-tensor data corresponding to N times the second access granularity Tf2 from the first memory M1 for storing the source tensor data T71, N being a positive integer.
Taking fig. 7 as an example, at step 104, the controller reads sub-tensor data corresponding to one time the second access granularity Tf2 (or "one second access granularity Tf 2") to the cache unit M2.
In particular implementations, at step 104, the controller determines a first number of source minimum memory granularities comprised by an integer multiple of the second memory granularity; and reading sub-tensor data corresponding to the first number from the source tensor data with the minimum access granularity of the source as a unit. For example, the controller determines that a first number of source minimum memory granularities Tf71 comprised by one second memory granularity Tf2 is 3, i.e. 3 source minimum memory granularities Tf71 are comprised in one second memory granularity Tf 2. The controller then reads in 3 cycles, each cycle reading one sub-tensor corresponding to the source minimum memory granularity Tf71. For example, the controller first reads a source minimum memory granularity Tf71 composed of 12 elements 1 to 12. Then, sequentially reading the minimum source memory granularity composed of 12 elements in total of 13-24 elements and the minimum source memory granularity composed of 12 elements in total of 25-36 elements in the source tensor data T71. Then, the element corresponding to the 3-source minimum access memory granularity Tf71 is stored to the cache unit M2.
Next, at step 106, the controller reads the sub-tensor data stored in the cache unit according to the target minimum access granularity, so as to store the sub-tensor data in the storage space corresponding to the target tensor data. In particular, at step 106, the controller determines a second number of target minimum memory granularity Tf72 contained in the sub-tensor data stored in the cache unit M2; and storing the sub-tensor data stored in the cache unit to a storage space (e.g., the memory M3) corresponding to the target tensor data by taking the target minimum access granularity Tf72 as a unit.
Taking fig. 7 as an example, the controller determines that the second number of target minimum memory granularities Tf72 contained in the sub-tensor data stored in the cache unit M2 is 3, that is, the sub-tensor data stored in the cache unit M2 may correspond to 3 target minimum memory granularities Tf72. Then, the controller reads in 3 cycles, reads one sub-tensor corresponding to the target minimum access memory granularity Tf72 for each cycle, and stores it in the memory M3. For example, the controller first reads a target minimum memory granularity Tf72 composed of 12 elements, such as element 1, element 2, element 7, and element 8. Then, the target minimum memory granularity composed of 12 elements including the element 3, the element 4, the element 9, the element 10, and the like, and the target minimum memory granularity Tf72 composed of 12 elements including the element 5, the element 6, the element 11, the element 12, and the like are sequentially read. Then, the elements in the cache unit M2 corresponding to the 3 target minimum access memory granularities Tf72 are stored to the cache unit M3. FIG. 11 illustrates a schematic diagram of a physical storage structure of local data of target tensor data according to an embodiment of the present disclosure.
It should be appreciated that, with reference to the above description, the steps 104 to 106 are repeatedly performed, and the remaining elements in the source tensor data T71 may be gradually converted and stored in the memory M3, so as to implement the conversion from the source tensor data T71 to the target tensor data T72. It should be noted that the source tensor data T71 and the target tensor data T72 have the same logical structure, but have different physical storage structures, and belong to different tensor data.
Fig. 12 shows a schematic diagram of a tensor data transformation. The controller stores sub-tensor data corresponding to a source minimum access granularity Tf71 in the cache unit M2. The controller then reads the data from the cache unit M2 at the target minimum access granularity Tf72 and stores it to the storage M3. Then, the controller can only read 4 elements (element 1, element 2, element 7, element 8, which is only 1/3 of the data amount corresponding to one target minimum access granularity Tf 72), and store it in the memory M3, which is inefficient. Referring to fig. 7, with the tensor data conversion performed by the method 100, the controller may significantly improve the memory access efficiency by using the data amount corresponding to at least one complete target minimum memory access granularity Tf72 in each operation cycle.
FIG. 13 illustrates a schematic diagram of converting a layout of tensor data using a method of an embodiment of the present disclosure. The source tensor data T81 is, for example, a third-order tensor. When the layout of the source tensor data T81 is converted, the data in the source tensor data T81 may be loaded into the first memory M1 first. For convenience of explanation, only partial data in the source tensor data T81 stored in the first memory M1 is illustrated in the figure. The first memory M1 is, for example, an on-chip register for storing the source tensor data T81. Fig. 14 shows a schematic diagram of a physical storage structure of local data using source tensor data T81 of an embodiment of the present disclosure. The address corresponding to the element 1 is, for example, an initial address BM1 of the first memory M1. The minimum source memory granularity Tf81 of the source tensor data T81 is [2,6] (or 2*6), for example, 12 (2*6) elements in total of 1 to 12 elements in the source tensor data T81 form a minimum source memory granularity Tf81. Similarly, 12 elements 13-24 in the source tensor data T81 constitute a source minimum memory granularity.
First, the controller determines whether an instruction for transition conversion of the source tensor data T81 is received. As shown in fig. 13, a transition transformation (transposed transformation) is involved. Accordingly, the controller determines that an instruction for transition conversion is received with respect to the source tensor data T81, and then the controller determines that the memory granularity corresponding to the source minimum memory granularity after the transition conversion is the first memory granularity Tf80. Taking fig. 8 as an example, the first access memory granularity Tf80 is [6,2], and the source minimum access memory granularity Tf81 [2,6] is obtained through transition conversion (transpose conversion). Fig. 15 shows a schematic diagram of a physical storage structure of a source minimum memory granularity Tf81 in an embodiment of the present disclosure. Fig. 16 shows a schematic diagram of a physical storage structure of a first access granularity Tf80 in an embodiment of the present disclosure.
At step 102, the controller determines a second memory granularity based on the first memory granularity and a target minimum memory granularity for the target tensor data. The plurality of dimension values of the second memory granularity are respectively the least common multiple of the corresponding dimension values of the first memory granularity and the target minimum memory granularity.
For example, a target minimum memory granularity Tf82 for target tensor data T82 is shown in fig. 13. The target minimum memory granularity Tf82 of the target tensor data T82 is [3,4] (or 3×4), for example, 12 (6*2) elements, such as element 1, element 7, element 13, element 19, and the like, in the target tensor data T82 form one target minimum memory granularity Tf82. Similarly, 12 elements of the target tensor data T82, namely, element 4, element 10, element 16, element 22, etc., constitute a target minimum memory granularity Tf82. Fig. 17 shows a schematic diagram of a physical storage structure of the target minimum memory granularity Tf82 in the embodiment of the present disclosure.
It is worth noting that the first access granularity Tf80 is [6,2], the target minimum access granularity Tf82 is [3,4], and thus the second access granularity Tf2 is [6,4]. The first dimension value "6" of the second access memory granularity Tf2 is the least common multiple of the first dimension value "6" of the first access memory granularity Tf80 and the first dimension value "3" of the target minimum access memory granularity Tf 82; the second dimension value "4" of the second memory granularity Tf2 is the least common multiple of the second dimension value "2" of the first memory granularity Tf80 and the second dimension value "4" of the target minimum memory granularity Tf82.
Then, at step 104, the controller reads sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data for storage to the cache unit. For example, the controller reads sub-tensor data corresponding to N times the second access granularity Tf2 from the first memory M1 for storing the source tensor data T81, where N is a positive integer. In some embodiments, N is a positive integer greater than 1.
Taking fig. 13 as an example, in step 104, the controller reads sub-tensor data corresponding to a second access granularity Tf2 to the cache unit M2.
In particular implementations, at step 104, the controller determines a first number of source minimum memory granularities comprised by an integer multiple of the second memory granularity; and reading sub-tensor data corresponding to the first number from the source tensor data with the minimum access granularity of the source as a unit. For example, the controller determines that a first number of source minimum memory granularities Tf81 comprised by one second memory granularity Tf2 is 2, i.e. that 2 source minimum memory granularities Tf81 are comprised in one second memory granularity Tf 2. The controller then reads in 2 cycles, each cycle reading one sub-tensor corresponding to the source minimum memory granularity Tf81. For example, the controller first reads a source minimum memory granularity Tf81 composed of 12 elements of elements 1 to 12. And then sequentially reading the minimum access granularity of the source consisting of 12 elements from element 13 to element 24. Then, the element corresponding to the 2-source minimum access memory granularity Tf81 is stored to the cache unit M2.
Next, at step 106, the controller reads the sub-tensor data stored in the cache unit according to the target minimum access granularity, so as to store the sub-tensor data in the storage space corresponding to the target tensor data. In particular, at step 106, the controller determines a second number of target minimum memory granularity Tf82 contained in the sub-tensor data stored in the cache unit M2; and storing the sub-tensor data stored in the cache unit to a storage space (e.g., the memory M3) corresponding to the target tensor data by taking the target minimum access granularity Tf82 as a unit.
Taking fig. 13 as an example, the controller determines that the second number of target minimum memory granularities Tf82 contained in the sub-tensor data stored in the cache unit M2 is 2, that is, the sub-tensor data stored in the cache unit M2 may correspond to 2 target minimum memory granularities Tf82. Then, the controller reads in 2 cycles, and reads one sub-tensor corresponding to the target minimum access memory granularity Tf82 for each cycle, and stores the sub-tensor in the memory M3. For example, the controller first reads a target minimum memory granularity Tf82 composed of 12 elements, such as element 1, element 7, element 13, element 19, and the like. Then, the controller reads a target minimum memory granularity Tf82 composed of 12 elements, i.e., element 4, element 10, element 16, element 22, and the like. Then, the elements in the cache unit M2 corresponding to the 2 target minimum access memory granularities Tf82 are stored to the cache unit M3. Fig. 18 illustrates a schematic diagram of a physical storage structure of local data of target tensor data according to an embodiment of the present disclosure.
It should be appreciated that, with reference to the above description, the steps 104 to 106 are repeatedly performed, and the remaining elements in the source tensor data T81 may be gradually converted and stored in the memory M3, so as to implement the conversion from the source tensor data T81 to the target tensor data T82.
As a general expression, for example, regarding the source tensor data ts= [ Ds1, ds2, ds3, … … Dsn ], the source tensor data Ts is an n-order tensor. The shape of the first access granularity Tsf determined with respect to the source tensor data Ts is [ Da1, da2, da3, … … Dam ]. The first access granularity Tsf has an order smaller than or equal to that of the source tensor data Ts, namely m is less than or equal to n, wherein m and n are natural numbers. The corresponding dimension value of each dimension of the first memory granularity Tsf is smaller than or equal to the corresponding dimension value of the source tensor data Ts, namely, da1 is less than or equal to Ds1, da2 is less than or equal to Ds2, … … Dam is less than or equal to Dsm.
For example, the source tensor data ts= [ Ds1, ds2, ds3, … … Dsn ] is converted to form the target tensor data td= [ Dd1, dd2, dd3, … … Ddp ], and the target tensor data Td is a p-order tensor. The target minimum memory granularity Tdf with respect to the target tensor data Td is, for example, a sub-tensor corresponding to the source minimum memory granularity Tsf with respect to the source tensor data Ts of the source tensor data Ts in the target tensor data Td. That is, regarding the target minimum memory granularity Tdf of the target tensor data Td, which may be an element in the sub-tensor corresponding to the source minimum memory granularity Tsf, the sub-tensors composed in the target tensor data Td are formed after conversion. Thus, in some embodiments, the target minimum memory granularity Tdf has the same number of elements as the source minimum memory granularity Tsf. It should be noted that the order of both the target minimum memory granularity and the source minimum memory granularity Tsf may not be equal. The order of both the first memory granularity Tsf and the source minimum memory granularity Tsf may not be equal.
For example, the target minimum memory granularity tdf= [ Db1, db2, db3, … … Dbq ] with respect to the target tensor data Td. It should be appreciated that the shape of the target minimum memory granularity Tdf is related to the corresponding source minimum memory granularity Tsf and the manner in which the transformation is performed. Correspondingly, the target minimum memory granularity Tdf has an order smaller than or equal to that of the target tensor data Td, namely q is less than or equal to p, wherein q and p are both natural numbers. The corresponding dimension value of each dimension of the target minimum memory granularity Tdf is smaller than or equal to the corresponding dimension value of the target tensor data Td, namely Db1 is smaller than or equal to Dd1, db2 is smaller than or equal to Dd2, and … … Dbp is smaller than or equal to Ddp. Alternatively, the portion of the target tensor data Td corresponding to the target minimum memory granularity Tdf may be regarded as a sub-tensor of the target tensor data Td.
Therefore, according to the first memory granularity tsf= [ Da1, da2, da3, … … Dam ] and the target minimum memory granularity tdf= [ Db1, db2, db3, … … Dbq ], a second memory granularity may be determined, where a plurality of dimension values of the second memory granularity are respectively the least common multiple of the corresponding dimension values of the first memory granularity and the target minimum memory granularity. When the order of the first access granularity Tsf is not equal to the order of the target minimum access granularity Tdf, a dimension value corresponding to a dimension "missing" by one of the two types of access granularity Tsf is regarded as "1". That is, the second memory granularity has an order equal to the higher one of the first memory granularity Tsf and the target minimum memory granularity Tdf.
Fig. 19 illustrates a flow chart of a method 1900 for converting a layout of tensor data according to an embodiment of the disclosure. It should be understood that method 1900 may also include additional steps not shown and/or that the illustrated steps may be omitted, the scope of the disclosure being not limited in this respect. Method 1900 may be implemented based on a controller or at electronic device 2000 shown in fig. 20.
FIG. 21 illustrates a schematic diagram of transforming a layout of tensor data using a method of an embodiment of the present disclosure. Referring to fig. 19, a method 1900 includes:
at step 1902, a third memory granularity is determined based on the predetermined memory granularity for the source tensor data, and the target minimum memory granularity. The multiple dimension values of the third access granularity are respectively the least common multiple of the corresponding dimension values of the preset access granularity and the target minimum access granularity, the preset access granularity is n times of the first access granularity, and n is a positive integer greater than 1.
It should be noted that, in some embodiments, in order to improve the efficiency of tensor data conversion, the first predetermined memory granularity Tf90 corresponding to the plurality of source minimum memory granularities Tf91 is taken as one data block, so as to access the source tensor data T91 based on the data block. In some embodiments, the source tensor data T91 may also be accessed in burst mode. In burst mode, the amount of data accessed from the source tensor data T91 may contain a plurality of data blocks for each burst operation.
A second predetermined memory granularity may be determined for the source tensor data based on the first predetermined memory granularity. For example, the controller determines whether an instruction for transition conversion of the source tensor data T91 is received. If the controller determines that no instruction for transition conversion is received with respect to the source tensor data T91, the controller determines that the first predetermined memory granularity Tf90 with respect to the source tensor data is the second predetermined memory granularity. If the controller determines that an instruction for transition conversion is received with respect to the source tensor data T91, the controller determines that the access granularity corresponding to the first predetermined access granularity Tf90 after transition conversion is the second predetermined access granularity.
The controller then determines a third memory granularity from the second predetermined memory granularity with respect to the source tensor data T91, and the target minimum memory granularity Tf 92. The multiple dimension values of the third access memory granularity are respectively the least common multiple of the corresponding dimension values of the second predetermined access memory granularity and the target minimum access memory granularity Tf92, the second predetermined access memory granularity is n times of the first access memory granularity, and n is a positive integer greater than 1. Referring to FIG. 21, the second predetermined memory granularity includes 3 first memory granularities. Referring to fig. 21, the first memory granularity is the source minimum memory granularity Tf91.
At step 1904, it is determined whether the capacity of the cache unit meets a third memory granularity.
At step 1906, in response to determining that the capacity of the cache unit meets the third memory granularity, sub-tensor data corresponding to the third memory granularity is read from the source tensor data for storage to the cache unit. It should be appreciated that if the cache unit M2 is large enough, sub-tensor data corresponding to the third memory granularity is read from the source tensor data for storage to the cache unit M2. This can significantly improve the efficiency of tensor data conversion.
At step 1908, in response to determining that the capacity of the cache unit does not meet the third memory granularity, it is determined whether the capacity of the cache unit meets the second memory granularity. It should be appreciated that the second memory granularity is determined based on the first memory granularity and the target minimum memory granularity, the first memory granularity being determined based on the source minimum memory granularity.
At step 1910, in response to determining that the capacity of the cache unit meets the second memory granularity, sub-tensor data corresponding to the second memory granularity is read from the source tensor data for storage to the cache unit.
At step 1912, in response to determining that the capacity of the cache unit does not meet the second memory granularity, sub-tensor data corresponding to the source minimum memory granularity is read from the source tensor data for storage to the cache unit.
It should be appreciated that if the capacity of the cache unit M2 is less than the third memory granularity, but greater than or equal to the second memory granularity, then sub-tensor data corresponding to the second memory granularity is read from the source tensor data for storage to the cache unit M2 in order to maximize the efficiency of the tensor data conversion. If the capacity of the buffer unit M2 is smaller than the second access granularity, sub-tensor data corresponding to the minimum access granularity of the source is read from the source tensor data so as to be stored in the buffer unit, and tensor data conversion processing is carried out.
At step 1914, the sub-tensor data stored in the cache unit is read according to the target minimum access granularity for storage in the storage space corresponding to the target tensor data.
Fig. 20 shows a schematic block diagram of an example electronic device 2000 that may be used to implement methods of embodiments of the present disclosure. As shown, the electronic device 2000 includes a central processing unit (i.e., CPU 2001) that can perform various suitable actions and processes in accordance with computer program instructions stored in a read only memory (i.e., ROM 2002) or computer program instructions loaded from storage unit 2008 into a random access memory (i.e., RAM 2003). In the RAM 2003, various programs and data required for the operation of the electronic device 2000 can also be stored. The CPU 2001, ROM 2002, and RAM 2003 are connected to each other by a bus 2004. An input/output interface (i.e., I/O interface 2005) is also connected to bus 2004.
Various components in the electronic device 2000 are connected to the I/O interface 2005, including: an input unit 2006 such as a keyboard, mouse, microphone, etc.; an output unit 2007 such as various types of displays, speakers, and the like; a storage unit 2008 such as a magnetic disk, an optical disk, or the like; and a communication unit 2009 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 2009 allows the electronic device 2000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Various of the procedures and processes described above, e.g., method 100, method 1900, may be performed by CPU 2001. For example, in some embodiments, the methods 100, 1900 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 2008. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 2000 via the ROM 2002 and/or the communication unit 2009. When the computer program is loaded into RAM 2003 and executed by CPU 2001, one or more actions of method 100, method 1900 described above may be performed.
The present disclosure relates to methods, electronic devices, computer-readable storage media, and/or computer program products. The computer program product may include computer readable program instructions for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge electronics. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The foregoing is merely an alternative embodiment of the present disclosure, and is not intended to limit the present disclosure, and various modifications and variations may be made to the present disclosure by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (14)

1. A method for converting a layout of tensor data, comprising:
determining a second access memory granularity according to a first access memory granularity related to source tensor data and a target minimum access memory granularity related to target tensor data, wherein a plurality of dimension values of the second access memory granularity are respectively the least common multiple of corresponding dimension values of the first access memory granularity and the target minimum access memory granularity, and the first access memory granularity is obtained according to the source minimum access memory granularity related to source tensor data; the minimum access granularity of the source is the minimum accessible sub-tensor at each time when the source tensor data is accessed; the minimum access granularity of the target is the minimum accessible sub-tensor at each time when the target tensor data is accessed;
Reading sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data for storage in a cache unit; and
and reading the sub-tensor data stored in the cache unit according to the target minimum access granularity so as to store the sub-tensor data into a storage space corresponding to the target tensor data.
2. The method as recited in claim 1, further comprising:
determining whether an instruction for transition conversion of source tensor data is received;
in response to determining that no instruction for transition conversion of the source tensor data has been received, determining that the source minimum memory granularity for the source tensor data is the first memory granularity; and in response to determining that an instruction for transition conversion of source tensor data is received, determining that the memory granularity corresponding to the source minimum memory granularity after the transition conversion is the first memory granularity.
3. The method of claim 1, wherein the source minimum memory granularity is related to a source address layout pattern for elements in the source tensor data, and the target minimum memory granularity is related to a target address layout pattern for elements in the target tensor data.
4. The method of claim 3, wherein the source address layout pattern is different from the target address layout pattern.
5. The method as recited in claim 1, further comprising:
determining a third access memory granularity according to a preset access memory granularity related to source tensor data and the target minimum access memory granularity, wherein a plurality of dimension values of the third access memory granularity are respectively the least common multiple of the preset access memory granularity and the corresponding dimension value of the target minimum access memory granularity, the preset access memory granularity is n times of the first access memory granularity, and n is a positive integer greater than 1;
determining whether the capacity of the cache unit meets a third access granularity; and
in response to determining that the capacity of the cache unit meets the third memory granularity, sub-tensor data corresponding to the third memory granularity is read from the source tensor data for storage to the cache unit.
6. The method as recited in claim 5, further comprising:
in response to determining that the capacity of the cache unit does not meet the third memory granularity, determining whether the capacity of the cache unit meets the second memory granularity; and
in response to determining that the capacity of the cache unit meets the second memory granularity, sub-tensor data corresponding to the second memory granularity is read from the source tensor data for storage to the cache unit.
7. The method as recited in claim 6, further comprising:
In response to determining that the capacity of the cache unit does not meet the second memory granularity, sub-tensor data corresponding to the source minimum memory granularity is read from the source tensor data for storage to the cache unit.
8. The method of claim 1, wherein reading sub-tensor data corresponding to an integer multiple of the second memory granularity from the source tensor data comprises:
determining a first number of source minimum memory granularities comprised by an integer multiple of the second memory granularity; and
and reading sub-tensor data corresponding to the first quantity from the source tensor data by taking the minimum access granularity of the source as a unit.
9. The method of claim 1, wherein reading the sub-tensor data stored in the cache unit according to the target minimum memory granularity comprises:
determining a second number of target minimum memory granularities contained in the sub-tensor data stored in the cache unit; and
and storing the sub-tensor data stored in the cache unit into a storage space corresponding to the target tensor data by taking the target minimum access granularity as a unit.
10. The method of claim 2, wherein the transition transformation comprises reshaping and/or transposing.
11. The method of claim 2, wherein the transition transformation comprises transforming source tensor data based on an operator fused with a remodelling operator and a transpose operator.
12. The method of claim 1, wherein the source minimum memory granularity is different from an order possessed by the target minimum memory granularity.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor;
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 12.
14. A computer readable storage medium having stored thereon a computer program which, when executed by a machine, implements the method according to any of claims 1 to 12.
CN202311443517.3A 2023-11-01 2023-11-01 Method, apparatus and medium for converting a layout of tensor data Active CN117170588B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311443517.3A CN117170588B (en) 2023-11-01 2023-11-01 Method, apparatus and medium for converting a layout of tensor data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311443517.3A CN117170588B (en) 2023-11-01 2023-11-01 Method, apparatus and medium for converting a layout of tensor data

Publications (2)

Publication Number Publication Date
CN117170588A CN117170588A (en) 2023-12-05
CN117170588B true CN117170588B (en) 2024-01-26

Family

ID=88937850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311443517.3A Active CN117170588B (en) 2023-11-01 2023-11-01 Method, apparatus and medium for converting a layout of tensor data

Country Status (1)

Country Link
CN (1) CN117170588B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107689948A (en) * 2016-08-22 2018-02-13 北京深鉴科技有限公司 Efficient data memory access managing device applied to neural network hardware acceleration system
CN115221102A (en) * 2021-04-16 2022-10-21 中科寒武纪科技股份有限公司 Method for optimizing convolution operation of system on chip and related product
CN115454923A (en) * 2021-06-09 2022-12-09 寒武纪(昆山)信息科技有限公司 Data calculation device, board card, method and storage medium
CN116579274A (en) * 2023-04-27 2023-08-11 北京大学 Automatic design method of tensor operation acceleration chip
CN116911366A (en) * 2023-07-20 2023-10-20 上海壁仞智能科技有限公司 Computing system neural network optimization method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4201034B2 (en) * 2006-09-04 2008-12-24 ソニー株式会社 Video recording video camera apparatus, video recording method, and program
US11170294B2 (en) * 2016-01-07 2021-11-09 Intel Corporation Hardware accelerated machine learning
WO2022030877A1 (en) * 2020-08-03 2022-02-10 Samsung Electronics Co., Ltd. Control information transmission method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107689948A (en) * 2016-08-22 2018-02-13 北京深鉴科技有限公司 Efficient data memory access managing device applied to neural network hardware acceleration system
CN115221102A (en) * 2021-04-16 2022-10-21 中科寒武纪科技股份有限公司 Method for optimizing convolution operation of system on chip and related product
CN115454923A (en) * 2021-06-09 2022-12-09 寒武纪(昆山)信息科技有限公司 Data calculation device, board card, method and storage medium
CN116579274A (en) * 2023-04-27 2023-08-11 北京大学 Automatic design method of tensor operation acceleration chip
CN116911366A (en) * 2023-07-20 2023-10-20 上海壁仞智能科技有限公司 Computing system neural network optimization method and device

Also Published As

Publication number Publication date
CN117170588A (en) 2023-12-05

Similar Documents

Publication Publication Date Title
Darbon et al. Overcoming the curse of dimensionality for some Hamilton–Jacobi partial differential equations via neural network architectures
US11551068B2 (en) Processing system and method for binary weight convolutional neural network
JP2018156451A (en) Network learning device, network learning system, network learning method, and program
CN107944545B (en) Computing method and computing device applied to neural network
CN116702678B (en) DTCO optimization method integrating deep learning and swarm intelligence algorithm
CN105446896A (en) MapReduce application cache management method and device
US20220164666A1 (en) Efficient mixed-precision search for quantizers in artificial neural networks
CN115861031A (en) Method and apparatus for performing dense prediction using transformer blocks
KR102487260B1 (en) Image processing method, device, electronic device, and storage medium
WO2023077819A1 (en) Data processing system, method and apparatus, and device, storage medium, computer program and computer program product
CN112182938A (en) Mesoscopic structural part mechanical property prediction method based on transfer learning-multi-fidelity modeling
JP2020064535A (en) Optimization device and method for controlling optimization device
CN117170588B (en) Method, apparatus and medium for converting a layout of tensor data
CN113222151B (en) Quantum state transformation method and device
WO2023143121A1 (en) Data processing method and related device thereof
US10559093B2 (en) Selecting encoding options
CN116030235A (en) Target detection model training method, target detection device and electronic equipment
JP6705506B2 (en) Learning program, information processing apparatus, and learning method
JP7297286B2 (en) Optimization method, optimization program, reasoning method, and reasoning program
JP7398625B2 (en) Machine learning devices, information processing methods and programs
KR102305981B1 (en) Method for Training to Compress Neural Network and Method for Using Compressed Neural Network
US20120089561A1 (en) Parallel Sifting Algorithm
CN112241786B (en) Determination method and device for model super-parameters, computing device and medium
EP4254275A1 (en) Data reading method and apparatus, storage medium and electronic apparatus
Berndt et al. Reduction of neural network circuits by constant and nearly constant signal propagation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant