WO2022068328A1 - 数据迁移的方法、装置、处理器和计算设备 - Google Patents

数据迁移的方法、装置、处理器和计算设备 Download PDF

Info

Publication number
WO2022068328A1
WO2022068328A1 PCT/CN2021/106966 CN2021106966W WO2022068328A1 WO 2022068328 A1 WO2022068328 A1 WO 2022068328A1 CN 2021106966 W CN2021106966 W CN 2021106966W WO 2022068328 A1 WO2022068328 A1 WO 2022068328A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
migrated
storage
address
row
Prior art date
Application number
PCT/CN2021/106966
Other languages
English (en)
French (fr)
Inventor
侯新宇
李涛
俞立呈
刘昊程
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022068328A1 publication Critical patent/WO2022068328A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Definitions

  • the present application relates to the field of computers, and in particular, to a method, apparatus, processor and computing device for data migration.
  • Matrix multiplication is the corresponding multiplication of the row elements of the first matrix with the column elements of the second matrix.
  • memory one of column storage or row storage is usually used for matrices.
  • Row storage means contiguous storage of elements belonging to the same row in the matrix
  • column storage means contiguous storage of elements belonging to the same column in the matrix.
  • the present application provides a data migration method, apparatus, processor and computing device, so as to improve the efficiency of the data migration method.
  • a first aspect provides a method for data migration, the method is executed by a data migration unit in a processor, the method includes acquiring a storage location of a matrix to be migrated, the matrix to be migrated is stored in a first storage manner, and reading the For the elements in the matrix to be migrated, the elements of the matrix to be migrated are stored in a second storage mode, and the second storage mode is a storage mode different from the first storage mode.
  • obtaining the storage location of the matrix to be migrated may be obtaining the source storage address of the matrix to be migrated, that is, the storage address of the first element of the matrix to be migrated before the migration.
  • the destination storage address can also be obtained as the storage address of the first element after the migration matrix to be migrated.
  • the acquired size information may also include two pieces of information, the number of rows and the number of columns of the matrix to be migrated.
  • the matrix to be migrated is a matrix that needs to be migrated to change the storage mode.
  • the matrix in the memory is stored in the form of row storage
  • the matrix on the right side of the multiplication sign needs to read the column elements continuously
  • the matrix is the matrix to be migrated, and it is necessary to The matrix is migrated so that its storage method is changed from row storage to column storage.
  • the matrix in the memory is stored in the form of column storage
  • the matrix on the left side of the multiplication sign needs to continuously read the row elements
  • the matrix is the matrix to be migrated, and it is necessary to The matrix is migrated so that its storage mode is changed from column storage to row storage.
  • the present application can realize the migration process of the matrix through the data migration unit, without the need for equipment and storage space other than the processor, and the data reading and processing speed is faster.
  • the data migration unit can send multiple read requests to the internal bus of the processor, and each read request is used to read an element of a row or a column in the matrix to be migrated. Then, the acquired elements are stored according to the second storage mode.
  • the whole process is implemented by the data migration unit inside the processor.
  • the matrix on the left side of the multiplication sign can be converted into a storage method of column storage as the matrix to be migrated, or the matrix on the right side of the multiplication sign can be converted into a storage method of row storage as the matrix to be migrated. Therefore, when calculating matrix multiplication, one row or one column of elements can be continuously obtained for multiplication and accumulation without frequent memory access, which effectively improves the calculation efficiency.
  • the first storage manner and the second storage manner include a column storage manner and a row storage manner, respectively.
  • the to-be-migrated matrix stored in row storage can be migrated to convert the storage mode
  • the to-be-migrated matrix stored in column storage can also be migrated to convert the storage mode
  • the method before acquiring the source storage address, destination storage address and size information of the matrix to be migrated, the method further includes: acquiring a data migration instruction, wherein the data migration instruction carries the first register Identification, the second register identification and the third register identification, wherein the first register identification is used to indicate the first register that stores the source storage address of the matrix to be migrated, and the second register identification is used to indicate the destination storage address of the matrix to be migrated.
  • the second register and the third register are used to indicate registers for storing size information of the matrix to be migrated.
  • obtaining the source storage address, destination storage address and size information of the matrix to be migrated includes: obtaining the source storage address of the matrix to be migrated from the first register, obtaining the destination storage address of the matrix to be migrated from the second register, and obtaining the destination storage address of the matrix to be migrated from the second register.
  • the third register acquires the size information of the matrix to be migrated.
  • the processor before performing matrix multiplication, can obtain the source storage address and size information of the matrix to be migrated in the memory, and apply for a memory space to store the migrated matrix, and the first address of the applied memory space That is, the storage address for the above-mentioned purpose.
  • the processor then writes the source storage address into a register, the destination storage address into a register, and the size information into a register.
  • the processor may generate a data migration instruction, where the data migration instruction carries the first register identifier, the second register identifier and the third register identifier.
  • the register identification may be a register number.
  • the processor stores the generated data migration instruction in the memory, and the instruction fetch unit obtains the data migration instruction stored in the memory. After the instruction fetch unit acquires the data migration instruction, it sends the data migration instruction to the decoding unit. The decoding unit decodes the received data migration instruction and sends it to the data migration unit.
  • the data migration unit obtains the first register identification, the second register identification and the third register identification carried in the data migration instruction, obtains the source storage address of the matrix to be migrated from the register indicated by the first register identification, and asks the second register identification to
  • the indicated register obtains the destination storage address of the matrix to be migrated, and obtains the size information of the matrix to be migrated from the register indicated by the third register identifier.
  • the third register further stores the matrix type identifier, the matrix type identifier and the LDA of the to-be-stored matrix.
  • Obtaining the size information of the matrix to be migrated from the third register includes: obtaining the size information, matrix type identifier and LDA of the matrix to be migrated from the third register. Then, based on the size information and the source storage address, reading the to-be-migrated matrix stored in the first storage mode includes: based on the size information, the matrix type identifier, LDA and the source storage address, reading the to-be-migrated matrix stored in the first storage mode. Migration matrix.
  • the matrix types may include ordinary matrices, square matrices, uncompressed storage upper triangular matrices, uncompressed storage diagonal matrices, packed storage upper triangular matrices, packed storage lower triangular matrices, packed storage diagonal matrices, and the like.
  • the source storage address is taken as the address of the first element in the first row or the address of the first element in the first column in the matrix to be migrated, and N is determined.
  • N is the number of rows or columns of the matrix to be migrated; read the matrix to be migrated row by row or column by column according to the address and offset address of the first element of the row or column of the matrix to be migrated middle element.
  • the offset address is used to indicate that the address of the first element of the row or column prevails, and the offset position can be calculated and obtained according to the preset size of each row or column element.
  • the data migration unit can read the data of the entire row or column or the entire matrix of the matrix to be migrated at one time, avoiding the problem of frequent memory access caused by the one-by-one reading process, and the reading efficiency is higher.
  • the method before reading the elements in the matrix to be migrated, the method further includes: when the first storage mode is row storage, using the source storage address as the source memory address for reading the first row in the matrix to be migrated The address of the first element determines N-1 offset values according to the size information, matrix type identifier and LDA, where N is the number of rows of the matrix to be migrated. Then, according to each i offset value and the source storage address, determine the first address when reading the element of the i-th row in the matrix to be migrated, where i is a positive integer greater than 0 and less than N.
  • the source storage address is used as the first address when reading elements of the first column in the matrix to be migrated.
  • the matrix type identifier and the LDA, N-1 offset values are determined, where N is the number of columns of the matrix to be migrated.
  • reading the elements in the matrix to be migrated includes: sending N read requests to the internal bus of the processor, each read request carrying a first address, wherein each read request is used to read the matrix to be migrated.
  • a row or column of elements Receives the elements in the matrix to be migrated returned by the internal bus based on each first address.
  • one read request can read each element of the matrix row by row or column by column, instead of one read request to read one element, which improves the read efficiency of the matrix to be migrated.
  • storing the elements to be migrated in the matrix to be migrated in a second storage manner according to the size information, the matrix type identifier and the destination storage address including:
  • the elements to be migrated in the matrix to be migrated are stored in the second storage mode in the memory with the destination storage address as the first address, and stored in the cache of the corresponding processor as the second storage method.
  • the storage mode stores the to-be-migrated elements in the to-be-migrated matrix.
  • the destination storage address is a memory address
  • the read element may be stored in a cache corresponding to the memory address before being stored in the memory.
  • the cache may be the processor's cache. In this way, when performing matrix multiplication, data can be directly read in the cache, and the reading speed is faster.
  • an apparatus for data migration includes various modules for executing the data migration method in the first aspect or any possible implementation manner of the first aspect
  • a third aspect provides a processor, where the processor includes a data migration unit, and the data migration unit is configured to execute the data migration method in the first aspect or any possible implementation manner of the first aspect.
  • a computing device in a fourth aspect, includes a processor, and the processor includes a data migration unit, the data migration unit is configured to execute the first aspect or any possible implementation manner of the first aspect. Data Migration Method.
  • the data migration unit can acquire the storage location of the matrix to be migrated and read the elements of the matrix to be migrated in the storage location, and the matrix to be migrated is stored in the first storage mode before the migration. Then, the data migration unit stores the elements of the matrix to be migrated according to the second storage mode, and the second storage mode is a storage mode different from the first storage mode.
  • the matrix on the left side of the multiplication sign can be converted into a storage method of column storage as a matrix to be migrated, or the matrix on the right side of the multiplication sign can be converted into a storage method of row storage as a matrix to be migrated, so that when calculating the matrix multiplication
  • the data migration unit in the processor can continuously obtain a row or a column of elements to multiply and accumulate, without using other external devices to frequently access the memory, which effectively improves the computing efficiency.
  • FIG. 1 is a schematic structural diagram of a processor provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a method for data migration provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a register storage format provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an apparatus for data migration provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • a matrix in which the number of elements in each row is equal to the number of elements in each column can be called a square matrix.
  • a square matrix with all other elements being 0 can be called a diagonal matrix.
  • the diagonal matrix can include non-compressed storage diagonal matrix and compressed storage diagonal matrix according to the storage method, wherein, the compressed storage diagonal matrix means that when storing the matrix, all elements except the elements on the diagonal are not stored. , the non-compressed storage of the diagonal matrix means that all elements are completely stored.
  • a square matrix whose elements below the diagonal are all 0 can be called an upper triangular matrix.
  • the upper triangular matrix can include two types: uncompressed storage upper triangular matrix and compressed storage upper triangular matrix.
  • the compressed storage upper triangular matrix means that the 0 elements below the diagonal are not stored during storage, and the uncompressed upper triangular matrix is stored.
  • Storing a matrix means storing all elements completely.
  • a square matrix whose elements above the diagonal are all 0 can be called a lower triangular matrix.
  • the lower triangular matrix can include two types of uncompressed storage lower triangular matrix and compressed storage lower triangular matrix.
  • the compressed storage lower triangular matrix means that 0 elements above the diagonal are not stored during storage, and the uncompressed lower triangular matrix is stored.
  • Storing a matrix means storing all elements completely.
  • FIG. 1 is a schematic structural diagram of a processor 100 according to an embodiment of the present application.
  • the processor 100 includes an instruction fetch unit (fetch) 10, a decoding unit (decode) 20, and an out-of-order sending unit (issue) 30 and data migration unit 40.
  • the addressing unit 10 is connected to the decoding unit 20
  • the decoding unit 20 is connected to the out-of-order sending unit 30
  • the data migration unit 40 is connected to the out-of-order sending unit 30 .
  • the processor 100 may further include a cache 50 and other processing units 60 , wherein the cache 50 is connected to the data migration unit 40 , and the processing unit 60 may be connected to the out-of-order sending unit 30 .
  • the instruction fetching unit 10 can acquire the instruction, and send the acquired instruction to the decoding unit 20, and the decoding unit 20 decodes the received instruction and sends the decoded instruction to the processing unit.
  • the instruction may be a data migration instruction
  • the processing unit may be the data migration unit 40 .
  • the processor 100 may also include a plurality of registers 70, and the number of registers included may vary according to the number of bits of the processor. The specific number of the registers 70 included in the processor 100 is not specified in this embodiment of the present application. limited.
  • the register 70 may be connected to the data migration unit 40 described above.
  • the register 70 can be used to store data (for example, the storage location of the matrix to be migrated and parameters such as size information)
  • the processor 100 may also be connected to the memory 200 outside the processor 1001 .
  • the memory 200 may be connected to the above-mentioned cache unit 50 , and the data stored in the memory 200 (for example, the matrix to be migrated) may be cached by the cache unit 50 first.
  • the data migration unit 40 can obtain the matrix to be migrated from the memory 200 for migration, so that the storage mode of the matrix is changed from row storage before migration to column storage after migration, or from migration.
  • the former column store becomes the migrated row store.
  • the out-of-order sending unit 30 is configured to receive an instruction decoded by the decoding unit 20 and send the instruction to a corresponding processing unit for processing.
  • the processing unit may be the data migration unit 40 or other processing units 60 described above.
  • data migration unit is a hardware unit defined in this application, and its name is not limited in the embodiment of this application.
  • the embodiment of the present application provides a method for data migration, and the method can be implemented by a data migration unit in a processor. As shown in FIG. 2 , a process of a data migration method provided by an embodiment of the present application may include the following steps:
  • Step 201 Acquire the storage location of the matrix to be migrated, and the matrix to be migrated is stored in a first storage manner.
  • the conversion process of the matrix storage format needs to use a new storage space to store the converted matrix. Therefore, the above-mentioned matrix storage method in the form of rows or columns can also be called the migration process of matrix data.
  • matrix data migration is taken as an example for description.
  • the data migration unit acquires the storage location of the matrix to be migrated, which may be to acquire the source storage address of the matrix to be migrated, that is, the storage address of the first element of the matrix to be migrated before the migration.
  • a destination storage address can also be obtained, where the destination storage address is the storage address of the first element after the migration matrix is migrated.
  • the first element refers to the element located in the first row and first column of the matrix to be migrated.
  • Size information of the matrix to be migrated may also be acquired, and the size information may include the number of rows and columns of the matrix to be migrated, that is, the size information is used to indicate the size of the matrix to be migrated.
  • the processor can obtain the source storage address and size information of the matrix to be migrated in the memory, and apply for a storage space in the memory for storing the migrated data, and the first address of the applied storage space That is, the storage address for the above-mentioned purpose.
  • the processor then writes the source storage address into a register, the destination storage address into a register, and the size information into a register.
  • the above-mentioned registers may correspond to the registers 70 in FIG. 1 .
  • one register 70 may be used to store the above-mentioned source storage address, destination storage address and size information respectively, or three different registers 70 may be used to store the above-mentioned source storage addresses respectively. address, destination storage address, and size information.
  • the processing unit in the processor may call the compiler to generate a data migration instruction, where the data migration instruction carries the first register identifier, the second register identifier and the third register identifier.
  • the first register identifier is used to indicate the register that stores the source storage address of the matrix to be migrated
  • the second register identifier is used to indicate the register that stores the destination storage address of the to-be-migrated matrix
  • the third register identifier is used to indicate the storage address of the to-be-migrated matrix.
  • the register identification includes at least one of a register number and an address.
  • fields 0-4 are the identifiers of the second register
  • fields 5-9 are the identifiers of the first register
  • fields 16-20 are the identifiers of the third register
  • the remaining fields are reserved fields.
  • the processor stores the generated data migration instruction in the memory, and the instruction fetch unit acquires the data migration instruction stored in the memory.
  • the instruction fetch unit acquires the data migration instruction, it sends the data migration instruction to the decoding unit.
  • the decoding unit decodes the received data migration instruction and sends it to the data migration unit.
  • the data migration unit obtains the first register identification, the second register identification and the third register identification carried in the data migration instruction, obtains the source storage address of the matrix to be migrated from the register indicated by the first register identification, and asks the second register identification to The indicated register obtains the destination storage address of the matrix to be migrated, and obtains the size information of the matrix to be migrated from the register indicated by the third register identifier.
  • Step 202 Read the elements of the matrix to be migrated.
  • the data migration unit Before reading the elements of the matrix to be migrated, take the source storage address as the address of the first element of the first row or the address of the first element of the first column in the matrix to be migrated, and determine N-1 offset values, where N is The number of rows or columns of the matrix to be migrated; the elements in the matrix to be migrated are read row by row or column by column according to the address and offset address of the first element of the row or column of the matrix to be migrated.
  • the data migration unit can read the data of the entire row or column of the matrix to be migrated at one time, avoiding the problem of frequent memory access caused by the one-by-one reading process, and the reading efficiency is higher.
  • the data migration unit may generate a read request and send the read request to the internal bus of the processor, and the internal bus reads the elements of the matrix to be migrated from the memory according to the read request and returns it to the data migration unit.
  • the internal bus may be a bus used inside the processor for transmitting requests, which is not limited herein.
  • the data migration unit can The source storage address of the matrix to be migrated and the number of columns in the size information of the matrix to be migrated (that is, the number of elements in each row of the matrix to be migrated) are used as offset values to determine the first to-be-migrated matrix of each row in the matrix to be migrated.
  • the storage address corresponding to the migration element At the same time, the number of read bytes corresponding to each row can also be determined, that is, the number of elements to be migrated in the row.
  • the determined storage address corresponding to the first element to be migrated in the row is read as the first address, and the corresponding elements of the number of read bytes are read.
  • the first element to be migrated in each row may be the first element in each row, and all elements in each row may be elements to be migrated.
  • the data migration unit may sequentially generate m read requests that respectively carry the above-mentioned first address and the corresponding number of read bytes.
  • the first read request is used to read the element to be migrated in the first row of the matrix to be migrated, and the first address carried in the first read request is the first element to be migrated in the first row of the matrix to be migrated storage address.
  • the generated m th read request is used to read the element to be migrated in the m th row in the to-be-migrated matrix, and the first address carried in the m th read request is the m th row in the to-be-migrated matrix.
  • the storage address of the first element to be migrated is used to generate m read requests that respectively carry the above-mentioned first address and the corresponding number of read bytes.
  • the first read request is used to read the element to be migrated in the first row of the matrix to be migrated, and the first address carried in the first read request is the first element to be migrated in the first row of the
  • the read request is used to continuously read Y bytes of data starting from the data stored in the storage location corresponding to the first address X.
  • the Y bytes of data are the Y elements to be migrated in the first row.
  • the data storage unit may also generate only one read request, and the read request is used to read the elements in the matrix to be migrated row by row or column by column.
  • the data migration unit can determine the N first addresses and the number of read bytes corresponding to each first address according to the source storage address, the size information of the matrix and the size of the storage space occupied by each element.
  • N is the row number of the matrix to be migrated
  • the i-th first address indicates the storage address of the first to-be-migrated element in the i-th row element in the to-be-migrated matrix.
  • the first storage mode is During row storage, N is the number of columns of the matrix to be migrated, the i-th first address indicates the storage address of the first element to be migrated in the elements of the i-th column in the matrix to be migrated, and i is greater than 0 and less than or equal to N. positive integer.
  • the following describes a method for reading a matrix to be migrated by taking the data migration unit generating a read request for each row or column as an example when the matrix storage mode is row storage and column storage respectively.
  • the number of read requests required to read the matrix to be migrated stored in the row is the same as the number of rows of the matrix to be migrated, and each read request reads a row of elements of the matrix to be migrated.
  • the data migration unit After reading a matrix to be migrated with m rows and n columns, the data migration unit can generate and send m read requests.
  • the first read request reads n elements starting with the source storage address (src_addr);
  • the second read request reads n elements of the address headed by src_addr+n;
  • the mth read request reads n elements whose address is headed by src_addr+(m-1)n.
  • the number of read requests required to read the matrix to be migrated stored in the column is the same as the number of columns of the matrix to be migrated, and each read request reads one column of elements of the matrix to be migrated.
  • the data migration unit After reading a matrix to be migrated with m rows and n columns, the data migration unit can generate and send n read requests.
  • the first read request reads m elements starting with the source storage address (src_addr);
  • the second read request reads n elements of the address headed by src_addr+m;
  • the nth read request reads m elements whose address is headed by src_addr+(n-1)m.
  • the leading dimension of array can also be set.
  • the number of bytes separated between the last element of the previous row and the first element of the following row in adjacent rows, and the number of bytes occupied by the elements of each row (elements The sum of the numbers) is the LDA of the matrix.
  • the number of bytes separated between the last element in the front column and the first element in the latter column in adjacent rows , and the sum of the number of bytes (number of elements) occupied by the elements of each column is the LDA of the matrix.
  • the LDA can be acquired by the processor and stored in the same register as the size information.
  • step 202 the processing of step 202 may be as follows: the data migration unit reads the matrix to be migrated according to the LDA, the size information and the source storage address.
  • LDA is greater than or equal to the number of rows or columns of the matrix.
  • the LDA should be greater than or equal to the number of columns of the matrix, and when the matrix is stored by columns, the LDA should be greater than or equal to the number of rows of the matrix.
  • the following describes the method for reading the matrix to be migrated when the storage mode is row storage and column storage respectively.
  • the number of read requests required to read the matrix to be migrated stored in the row is the same as the number of rows of the matrix to be migrated, and each read request reads the elements to be migrated in a row of the matrix to be migrated. Unlike when LDA is not set, LDA needs to be considered when determining the first address.
  • the data migration unit determines the storage corresponding to the first element to be migrated in each row of the matrix to be migrated according to the source storage address of the matrix to be migrated and the offset value corresponding to each row (the sum of the LDAs corresponding to each row before the current row). address.
  • the number of read bytes corresponding to each row can also be determined, that is, the number of elements to be migrated in the row.
  • the determined storage address corresponding to the first element to be migrated in the row is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the first element to be migrated in each row may be the first element in the row, and all elements in each row may be elements to be migrated.
  • LDA L
  • the data migration unit can generate and send m read requests.
  • the first read request reads n elements starting with the source storage address (src_addr);
  • the second read request reads n elements with src_addr+L as the first address
  • LDA L
  • the data migration unit can generate and send n read requests.
  • the first read request reads m elements starting with the source storage address (src_addr);
  • the second read request reads n elements with src_addr+L as the first address
  • the nth read request reads m elements whose address is headed by src_addr+(n-1)L.
  • the matrix type identifier used to indicate the matrix type can be acquired by the processor and stored in the same register as the size information.
  • the processing of step 202 may be as follows: the data migration unit reads the matrix to be migrated according to the matrix type identifier, size information and source storage address.
  • the matrix type identifier is used to indicate the type of the matrix to be migrated, and the matrix type identifier may be a number, an English letter, or the like.
  • Matrix types can include: ordinary matrix, square matrix, and diagonal matrix. According to the way of storage matrix, the matrix can be divided into uncompressed storage upper triangular matrix, uncompressed storage diagonal matrix, compressed storage upper triangular matrix, and compressed storage lower triangular matrix. Triangular matrix, compressed storage diagonal matrix, etc.
  • the matrix type identifier is a number, and the corresponding relationship between the matrix type identifier and the matrix type may be as shown in Table 1 below.
  • Matrix Type Identifier matrix type 0 Ordinary matrix 1 phalanx 2 Uncompressed storage upper triangular matrix 3 Uncompressed storage lower triangular matrix 4 Uncompressed storage of diagonal matrices 5 Compressed storage upper triangular matrix 6 Compressed storage lower triangular matrix 7 Compressed storage diagonal matrix
  • the elements on the diagonal and above are read as the elements to be migrated, and the 0 elements below the diagonal can not be read, which can improve the reading efficiency and save migration. storage space after.
  • the data migration unit determines the data to be migrated according to the source storage address of the matrix to be migrated and the offset value corresponding to each row (the sum of the number of elements in each row before this row and the number of 0 elements below the diagonal line in this row).
  • the storage address corresponding to the first element to be migrated in each row of the matrix At the same time, the number of read bytes corresponding to each row can also be determined, that is, the number of elements to be migrated in the row.
  • the determined storage address corresponding to the first element to be migrated in the row is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the first element to be migrated in each row may be the element on the first diagonal and above in the row, and the element on the diagonal and above in each row may be the element to be migrated.
  • the data migration unit can generate and send m read requests.
  • the first read request reads m elements starting with the source storage address (src_addr);
  • the second read request reads m-1 elements starting with src_addr+m+1;
  • the third read request reads m-2 elements of the address headed by src_addr+2m+2;
  • the mth read request reads 1 element of the address starting with src_addr+(m-1)m+m-1.
  • the data migration unit determines the storage address corresponding to the first element to be migrated in each column of the matrix to be migrated according to the source storage address of the matrix to be migrated and the offset value corresponding to each column (the sum of the elements of the previous columns in this column). .
  • the number of read bytes corresponding to each column can also be determined, that is, the number of elements to be migrated in the column.
  • the determined storage address corresponding to the first element to be migrated in the column is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the first element to be migrated in each column may be the element on the first diagonal and above in the column, and the element on the diagonal and above in each column may be the element to be migrated.
  • the data migration unit can generate and send m read requests.
  • the first read request reads 1 element with the source storage address (src_addr) as the first address;
  • the second read request reads 2 elements of the address headed by src_addr+m;
  • the third read request reads 3 elements of the address headed by src_addr+2m;
  • the mth read request reads m elements whose address is headed by src_addr+(m-1)m.
  • the processing of the data migration unit when reading the uncompressed lower triangular matrix stored in the row is similar to the processing when reading the uncompressed upper triangular matrix stored in the column.
  • the processing is similar to that when reading a row-stored uncompressed storage upper triangular matrix. Therefore, the reading of the triangular matrix under the uncompressed storage will not be repeated here.
  • the diagonal matrix although the row storage and column storage are the same, in this application, the diagonal matrix can still be migrated, and the non-compressed storage diagonal matrix can be converted into the compressed storage diagonal matrix, that is, the Elements on the diagonal are read as elements to be migrated, and 0 elements other than the diagonal are not read.
  • the data migration unit is based on the source storage address of the matrix to be migrated and the offset value corresponding to each row (the element corresponding to each row before this row and the diagonal line in this row). The sum of the following 0 elements) to determine the storage address corresponding to the first element to be migrated in each row of the matrix to be migrated.
  • the number of read bytes corresponding to each row can also be determined, that is, the number of elements to be migrated in the row.
  • the determined storage address corresponding to the first element to be migrated in the row is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the first element to be migrated in each row may be the element on the first diagonal in the row, and the element on the diagonal in each row may be the element to be migrated.
  • the data migration unit can generate and send m read requests.
  • the first read request reads 1 element with the source storage address (src_addr) as the first address;
  • the second read request reads 1 element of the address headed by src_addr+m+1;
  • the third read request reads 1 element of the address headed by src_addr+2m+2;
  • the mth read request reads 1 element of the address starting with src_addr+(m-1)m+m-1.
  • the data migration unit determines the first to-be-migrated element of each row in the to-be-migrated matrix according to the source storage address of the to-be-migrated matrix and the offset value corresponding to each row (the number of to-be-migrated elements included in each row before the current row).
  • the storage address corresponding to the element At the same time, the number of read bytes corresponding to each row can also be determined, that is, the number of elements to be migrated in the row.
  • the determined storage address corresponding to the first element to be migrated in the row is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the data migration unit can generate and send m read requests.
  • the first read request reads m elements starting with the source storage address (src_addr);
  • the second read request reads m-1 elements starting with src_addr+m;
  • the third read request reads m-2 elements with src_addr+m+(m-1) as the first address;
  • the mth read request reads 1 element of the address starting with src_addr+m+(m-1)+...+2.
  • the data migration unit determines, according to the source storage address of the matrix to be migrated and the offset value corresponding to each column (the number of elements to be migrated included in each row before this column), the first to be migrated in each column of the matrix to be migrated The storage address corresponding to the element.
  • the number of read bytes corresponding to each column can also be determined, that is, the number of elements to be migrated in the column.
  • the determined storage address corresponding to the first element to be migrated in the column is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the data migration unit can generate and send m read requests.
  • the first read request reads 1 element with the source storage address (src_addr) as the first address;
  • the second read request reads 2 elements with src_addr+1 as the first address
  • the third read request reads 3 elements of the address headed by src_addr+1+2;
  • the mth read request reads m elements with addresses starting with src_addr+1+2...+m-1.
  • the processing of the data migration unit when reading the compressed storage lower triangular matrix stored in the row is similar to the processing when reading the compressed storage upper triangular matrix stored in the column, and the processing when reading the compressed storage lower triangular matrix stored in the column is the same as The processing is similar when reading a row-stored packed-store upper triangular matrix. Therefore, the reading of the triangular matrix under compressed storage will not be repeated here.
  • the migration operation may not be performed.
  • the matrix type when LDA is set, the matrix type can be considered for reading.
  • the matrix type identification and LDA can both be obtained by the processor and stored in the same register as the size information described above.
  • the register format can be as shown in FIG. 4 .
  • the data migration unit can acquire the matrix type identifier and LDA at the same time.
  • the processing of step 202 may be as follows: the data migration unit reads the matrix to be migrated according to the LDA, the matrix type identifier, the size information and the source storage address.
  • the data migration unit can generate and send m read requests.
  • the first read request reads m elements starting with the source storage address (src_addr);
  • the second read request reads m-1 elements starting with src_addr+L+1;
  • the third read request reads m-2 elements of the address headed by src_addr+2L+2;
  • the mth read request reads 1 element of the address starting with src_addr+(m-1)L+m-1.
  • the data migration unit determines the storage address corresponding to the first element to be migrated in each column of the matrix to be migrated according to the source storage address of the matrix to be migrated and the offset value (the sum of the LDAs corresponding to the rows before this column).
  • the number of read bytes corresponding to each column can also be determined, that is, the number of elements to be migrated in the column.
  • the determined storage address corresponding to the first element to be migrated in the column is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the first element to be migrated in each column may be the element on the first diagonal and above in the column, and the element on the diagonal and above in each column may be the element to be migrated.
  • the data migration unit can generate and send m read requests.
  • the first read request reads 1 element with the source storage address (src_addr) as the first address;
  • the second read request reads 2 elements with src_addr+L as the first address
  • the third read request reads 3 elements of the address headed by src_addr+2L;
  • the data migration unit is based on the source storage address of the matrix to be migrated and the offset value corresponding to each row (the LDA corresponding to each row before this row and the number of 0 elements below the diagonal in this row). and ) to determine the storage address corresponding to the first element to be migrated in each row of the matrix to be migrated.
  • the number of read bytes corresponding to each row can also be determined, that is, the number of elements to be migrated in the row.
  • the determined storage address corresponding to the first element to be migrated in the row is used as the first address to read, and the corresponding elements of the number of read bytes are read.
  • the first element to be migrated in each row may be the element on the first diagonal in the row, and the element on the diagonal in each row may be the element to be migrated.
  • the data migration unit can generate and send m read requests.
  • the first read request reads 1 element with the source storage address (src_addr) as the first address;
  • the second read request reads 1 element of the address headed by src_addr+L+1;
  • the third read request reads 1 element of the address headed by src_addr+2L+2;
  • the mth read request reads 1 element of the address starting with src_addr+(m-1)L+m-1.
  • Step 203 Store the elements of the matrix to be migrated according to a second storage mode, where the second storage mode is a storage mode different from the first storage mode.
  • the second storage mode when the first storage mode is row storage, the second storage mode is column storage, and when the first storage mode is column storage, the second storage mode is row storage.
  • the second storage mode when storing the elements of the matrix to be migrated read by each read request, it is necessary to ensure that the stored matrix is stored in the second storage mode. That is, if the second storage mode is column storage, the elements of the same column of the matrix to be migrated should be stored continuously during storage after migration. If the second storage mode is row storage, the elements of the matrix to be migrated should be stored after migration. Elements in the same row are stored consecutively.
  • the storage can be performed as follows.
  • n elements obtained by the first read request they are stored in the destination storage address (dst_addr), dst_addr+m, dst_addr+2m...dst_addr+(n-1)m in sequence;
  • n elements obtained by the second read request they are stored in dst_addr+1, dst_addr+m+1, dst_addr+2m+1...dst_addr+(n-1)m+1;
  • n elements obtained by the mth read request they are stored in sequence in dst_addr+m-1, dst_addr+m+(m-1), dst_addr+2m+(m-1), ... dst_addr+(n-1) m+(m-1).
  • matrices such as uncompressed storage upper triangular matrix, uncompressed storage diagonal matrix, compressed storage upper triangular matrix, compressed storage lower triangular matrix, compressed storage diagonal matrix, etc.
  • destination storage address and size information determine the storage address of each element read, and then store the read element.
  • the storage methods of several types of matrices are described below.
  • dst_addr+m-1 By analogy, for the m elements obtained by the mth read request, they are stored in dst_addr+m-1, dst_addr+m+(m-1)-1, dst_addr+2m+(m-1)-(1+2) , ... dst_addr+(m-1)m+(m-1)-[1+2+...(m-1)] corresponding storage spaces.
  • dst_addr For the m elements read by the first read request, they are stored in dst_addr, dst_addr+m-(m-1), dst_addr+2m-[(m-1)+(m-2)], ... dst_addr+(m -1)m-[(m-1)+(m-2)+...1] corresponding storage space;
  • dst_addr+m+1-(m-1) For m-1 elements read by the second read request, they are stored in dst_addr+m+1-(m-1), dst_addr+2m+1-[(m-1)+(m-2)], ...dst_addr+(m-1)m+1-[(m-1)+(m-2)+...1] corresponding storage space;
  • the processing performed by the data migration unit when storing the lower triangular matrix in compressed storage or the lower triangular matrix in uncompressed storage in the form of column storage is the same as that in storing the upper triangular matrix in compressed storage or the upper triangular matrix in uncompressed storage in row storage.
  • the processing is similar when storing the lower triangular matrix in compressed storage or the lower triangular matrix in uncompressed storage, and the processing when storing the upper triangular matrix in compressed storage or the upper triangular matrix in uncompressed storage in column storage resemblance. Therefore, the storage of the triangular matrix under the compressed storage or the storage of the triangular matrix under the uncompressed storage will not be repeated here.
  • the above-mentioned destination storage address is a memory address
  • the read element may be stored in a cache corresponding to the memory address before being stored in the memory.
  • the cache may be the processor's cache.
  • data migration can be realized through the data migration unit inside the processor, and the matrix on the left side of the multiplication sign can be converted into the storage method of column storage, or the matrix on the right side of the multiplication sign can be converted into the storage method of row storage, so that in the calculation During matrix multiplication, one row or one column of elements can be continuously obtained for multiplication and accumulation without frequent memory access, which effectively improves the computational efficiency.
  • the data migration unit only needs one data migration instruction to trigger the processing of data migration, and the triggering is relatively simple.
  • the matrix to be migrated is the matrix on the right side of the multiplication sign in the two matrices to be multiplied.
  • the processing unit that performs matrix multiplication can obtain the elements of the first row of the matrix on the left side of the multiplication sign, and directly read the elements of the first column of the matrix to be migrated that are continuously stored in the cache, multiplied by bits, and accumulated. , as the first element of the output matrix, and so on to complete the multiplication of the two matrices to obtain the output matrix.
  • the matrix to be migrated is the matrix on the left side of the multiplication sign among the two matrices to be multiplied.
  • the processing unit that performs the matrix multiplication can obtain the elements of the first row of the matrix on the right side of the multiplication sign, and directly read the elements of the first row of the matrix to be migrated which are continuously stored in the cache, multiplied by bit, and accumulated , as the first element of the output matrix, and so on to complete the multiplication of the two matrices to obtain the output matrix.
  • an embodiment of the present application also provides a data migration device, which may be the above-mentioned data migration unit. As shown in FIG. 5 , the device includes:
  • the acquiring module 510 is configured to acquire the storage location of the matrix to be migrated, and the matrix to be migrated is stored in a first storage manner. Specifically, the acquisition function of the above step 201 and its implicit steps can be implemented.
  • the reading module 520 is configured to read the elements in the matrix to be migrated. Specifically, the reading function of the above step 202 and its implicit steps can be implemented.
  • the storage module 530 is configured to store the elements of the matrix to be migrated according to a second storage mode, where the second storage mode is a storage mode different from the first storage mode. Specifically, the storage function of the above step 203 and its implicit steps can be implemented.
  • the apparatus in the embodiments of the present application may be implemented by an application-specific integrated circuit (ASIC), or a programmable logic device (PLD), and the above-mentioned PLD may be a complex program logic device ( complex programmable logical device (CPLD), field-programmable gate array (FPGA), generic array logic (GAL), or any combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL generic array logic
  • the first storage manner and the second storage manner include a column storage manner and a row storage manner, respectively.
  • the obtaining module 510 is further configured to:
  • the data migration instruction carries a first register identifier, a second register identifier and a third register identifier, wherein the first register identifier is used to indicate the source storage that stores the matrix to be migrated
  • the first register of the address, the second register identifier is used to indicate the second register that stores the destination storage address of the to-be-migrated matrix, and the third register identifier is used to indicate the register that stores the size information of the to-be-migrated matrix ;
  • the source storage address and the destination storage address are the storage addresses of the internal memory;
  • the source storage address of the matrix to be migrated is obtained from the first register
  • the destination storage address of the matrix to be migrated is obtained from the second register
  • the size information of the matrix to be migrated is obtained from the third register.
  • the obtaining module 510 is used for:
  • the reading of the to-be-migrated matrix stored in the first storage mode based on the size information and the source storage address includes:
  • the elements of the matrix to be migrated stored in the first storage mode are read.
  • the reading module 520 is used for:
  • the matrix type identifier and the LDA determine N-1 offset values, where N is the number of rows of the matrix to be migrated;
  • each i offset value and the source storage address determine the first address when reading the element of the i-th row in the matrix to be migrated, where i is a positive integer greater than 0 and less than N;
  • the source storage address is used as the first address when reading elements of the first column in the matrix to be migrated;
  • the matrix type identifier and the LDA determine N-1 offset values, where N is the number of columns of the matrix to be migrated;
  • each i offset value and the source storage address determine the first address when reading the element of the i+1th column in the matrix to be migrated, where i is a positive integer greater than 0 and less than N;
  • the reading module 520 is used for:
  • each read request carries a first address, wherein each read request is used to read the elements of a row or a column in the matrix to be migrated;
  • the elements in the to-be-migrated matrix returned by the internal bus based on each first address are received.
  • the storage module 530 is used for:
  • the elements to be migrated in the matrix to be migrated are stored in the second storage mode in the memory with the destination storage address as the first address, and stored in the corresponding processor
  • the to-be-migrated elements in the to-be-migrated matrix are stored in the second storage mode in the cache of the .
  • the data migration device provided in the above embodiments is used for data migration, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions may be allocated to different functional modules as required. , that is, dividing the internal structure of the data migration unit into different functional modules to complete all or part of the functions described above.
  • the apparatus for data migration provided in the above embodiment and the method embodiment for data migration belong to the same concept, and the specific implementation process thereof can be seen in the method embodiment shown in FIG. 2 , which is not repeated here for brevity.
  • an embodiment of the present application provides a computing device 1300 .
  • the computing device 1300 includes at least a processor 1301 , a bus system 1302 , a memory 1303 , a communication interface 1304 and a memory unit 1305 .
  • the above-mentioned processor 1301 may be a general-purpose central processing unit (CPU), a network processor (NP), a graphics processing unit (graphics processing unit) microprocessor, an application-specific integrated circuit (application-specific integrated circuit) circuit, ASIC), or one or more integrated circuits used to control the execution of the program of the present application.
  • the above-mentioned processor 1301 includes the data migration unit 40 shown in FIG. 1 , so that the above-mentioned data migration unit 40 is used to implement the operation steps of the method shown in FIG. 2 , which will not be repeated here for brevity.
  • the bus system 1302 described above may include a path to transfer information between the above described components.
  • the above-mentioned memory 1303 can be a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM) or other types of storage devices that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • Types of dynamic storage devices which can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), or other optical storage, CD-ROM storage (including compact discs, laser discs, compact discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by Any other medium accessed by the computer, but not limited to this.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory unit 1305 is used to store the application code for executing the solution of the present application, and the execution is controlled by the processor 1301 .
  • the processor 1301 is configured to execute the application program code stored in the memory unit 1305, thereby implementing the floating-point number calculation method proposed in this application.
  • the communication interface 1304 is used to enable the connection and communication of the computing device 1300 with external devices.
  • the computing device can convert the matrix on the left side of the multiplication sign as a matrix to be migrated to a storage method of column storage, or convert the matrix on the right side of the multiplication sign as a matrix to be migrated into a storage method of row storage, so that when calculating the matrix multiplication When , one row or one column of elements can be continuously obtained for multiplication and accumulation without frequent memory access, which effectively improves the computational efficiency.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any combination thereof, and when implemented in software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions, and when the computer program instructions are loaded and executed on a device, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.).

Abstract

一种数据迁移的方法。该方法由处理器中的数据迁移单元执行,包括:获取待迁移矩阵的存储位置,待迁移矩阵以第一存储方式存储(201),读取所述待迁移矩阵中元素(202),按照第二存储方式存储待迁移矩阵的元素,第二存储方式是与第一存储方式不同的存储方式(203)。在执行矩阵乘法时,处理器中数据迁移单元可以连续读取矩阵的一行或者一列的元素,无需外部设备以及频繁访问内存,整体计算效率更高。

Description

数据迁移的方法、装置、处理器和计算设备 技术领域
本申请涉及计算机领域,特别涉及一种数据迁移的方法、装置、处理器和计算设备。
背景技术
在机器学习技术,存在大量的矩阵乘法计算。矩阵乘法是将第一个矩阵的行元素与第二个矩阵的列元素对应相乘。在内存中通常对于矩阵采用列存储或者行存储中的一种。行存储,即对矩阵中属于同一行的元素连续存储,列存储,即对矩阵中属于同一列的元素连续存储。处理器在执行矩阵乘法时,需要连续读取乘号左侧矩阵的行元素和乘号右侧的列元素。由于传统技术中利用内存存储矩阵,执行矩阵乘的过程中处理器需要频繁访问内存,并将矩阵乘的过程数据也存储至内存,处理器和内存的频繁交互导致矩阵乘的过程耗时长、效率低,因此,如何提供一种高效的数据迁移方法成为亟待解决的技术问题。
发明内容
本申请提供一种数据迁移的方法、装置、处理器和计算设备,以此提升数据迁移方法的效率。
第一方面,提供了一种数据迁移的方法,该方法由处理器中的数据迁移单元执行,该方法包括获取待迁移矩阵的存储位置,待迁移矩阵以第一存储方式存储,读取所述待迁移矩阵中元素,按照第二存储方式存储待迁移矩阵的元素,第二存储方式是与第一存储方式不同的存储方式。
在上述方案中,获取待迁移矩阵的存储位置可以为获取待迁移矩阵的源存储地址,即待迁移矩阵进行迁移前,第一个元素的存储地址。还可以获取目的存储地址为待迁移矩阵迁移后,第一个元素的存储地址。还可以获取尺寸信息可以包括待迁移矩阵的行数和列数这两个信息。待迁移矩阵为需要进行迁移以改变存储方式的矩阵。在内存中矩阵以行存储的方式进行存储的情况下,在需要相乘的两矩阵中,因为乘号右侧的矩阵需要对列元素进行连续读取,所以该矩阵为待迁移矩阵,需要对该矩阵进行迁移,使得其存储方式由行存储变为列存储。在内存中矩阵以列存储的方式进行存储的情况下,在需要相乘的两矩阵中,因为乘号左侧的矩阵需要对行元素进行连续读取,所以该矩阵为待迁移矩阵,需要对该矩阵进行迁移,使得其存储方式由列存储变为行存储。本申请可以通过数据迁移单元实现矩阵的迁移过程,无需借助处理器以外的设备和存储空间,数据读取和处理速度更快。
数据迁移单元可以向处理器的内部总线发送多个读请求,每个读请求用于读取待迁移矩阵中的一行或者一列的元素。然后,将获取到的元素按照第二存储方式进行存储。整个过程由处理器内部的数据迁移单元实现,可以将乘号左侧矩阵作为待迁移矩阵转换为列存储的存储方式,或者将乘号右侧矩阵作为待迁移矩阵转换为行存储的存储方式,使得在计算矩阵乘时,可以连续获取一行或者一列元素进行相乘并累加,无需频繁访问内存,有效提高了计算效率。
在一种可能的实现方式中,第一存储方式和第二存储方式分别包括列存储方式和行存储 方式。
在本申请所示的方案中,即可以将以行存储的方式存储的待迁移矩阵进行迁移,以转换存储方式,还可以将以列存储的方式存储的待迁移矩阵进行迁移,以转换存储方式。
在一种可能的实现方式中,所述获取待迁移矩阵的源存储地址、目的存储地址和尺寸信息之前,所述方法还包括:获取数据迁移指令,其中,数据迁移指令中携带有第一寄存器标识、第二寄存器标识和第三寄存器标识,其中,第一寄存器标识用于指示存储待迁移矩阵的源存储地址的第一寄存器、第二寄存器标识用于指示存储待迁移矩阵的目的存储地址的第二寄存器和所述第三寄存器标识用于指示存储待迁移矩阵的尺寸信息的寄存器。
则获取待迁移矩阵的源存储地址、目的存储地址和尺寸信息,包括:向第一寄存器获取所述待迁移矩阵的源存储地址,向第二寄存器获取待迁移矩阵的目的存储地址,向所述第三寄存器获取所述待迁移矩阵的尺寸信息。
在本申请所示的方案中,处理器在执行矩阵乘法之前,可以在内存中获取待迁移矩阵的源存储地址和尺寸信息,并申请内存空间存储迁移后的矩阵,申请的内存空间的首地址即为上述目的存储地址。
然后,处理器将源存储地址写入一个寄存器中,将目的存储地址写入一个寄存器中,将尺寸信息写入一个寄存器中。处理器可以生成数据迁移指令,该数据迁移指令中携带第一寄存器标识、第二寄存器标识和第三寄存器标识。此处,寄存器标识可以为寄存器编号。
处理器将生成的数据迁移指令存储在内存中,取指单元获取存储在内存中的数据迁移指令。取指单元获取到数据迁移指令后,将该数据迁移指令发送至译码单元。译码单元对接收到的数据迁移指令进行译码后发送至数据迁移单元。
数据迁移单元获取数据迁移指令中携带的第一寄存器标识、第二寄存器标识和第三寄存器标识,并向第一寄存器标识所指示的寄存器获取待迁移矩阵的源存储地址,向第二寄存器标识所指示的寄存器获取待迁移矩阵的目的存储地址,向第三寄存器标识所指示的寄存器获取待迁移矩阵的尺寸信息。
在一种可能的实现方式中,所述第三寄存器还存储所述待存储矩阵的矩阵类型标识、矩阵类型标识和LDA。
向所述第三寄存器获取待迁移矩阵的尺寸信息,包括:向第三寄存器获取待迁移矩阵的尺寸信息、矩阵类型标识和LDA。则所述基于所述尺寸信息和源存储地址,读取第一存储方式存储的待迁移矩阵,包括:基于尺寸信息、矩阵类型标识、LDA和源存储地址,读取第一存储方式存储的待迁移矩阵。
在本申请实施例所示的方案中,内存在存储矩阵时,可以设置有LDA,且对于一些类型的矩阵,可以进行压缩存储,因此,对于这些矩阵进行读取时,需要考虑到矩阵类型以及LDA。例如。矩阵类型可以包括普通矩阵、方阵、非压缩存储上三角矩阵、非压缩存储对角矩阵、压缩存储上三角矩阵、压缩存储下三角矩阵、压缩存储对角矩阵等。
在一种可能的实现方式中,在读取待迁移矩阵元素之前,将源存储地址作为待迁移矩阵中第一行的首个元素的地址或者第一列的首个元素的地址,并确定N-1个偏移值,N为待迁移矩阵的行数或列数;按照待迁移矩阵的行或列的首个元素的地址和偏移地址分别逐行或逐列读取所述待迁移矩阵中元素。其中,偏移地址用于指示以行或列的首个元素的地址为准,偏移的位置,具体可以根据预设的每行或每列元素的大小计算获得。也就是说,根据行或列的首个元素的地址、矩阵的尺寸信息和预设的每个元素的大小,可以逐个确定行或列中各个元素的位置。通过上述过程,数据迁移单元可以一次读取待迁移矩阵整行或整列或整个矩阵 的数据,避免逐个读取过程所带来的频繁访问内存的问题,读取效率更高。
在一种可能的实现方式中,读取待迁移矩阵中元素之前,方法还包括:在第一存储方式为行存储时,将源存储地址作为读取所述待迁移矩阵中的第一行的首个元素的地址,根据尺寸信息、矩阵类型标识和LDA,确定N-1个偏移值,N为待迁移矩阵的行数。再根据每i个偏移值和源存储地址,确定读取待迁移矩阵中的第i行的元素时的首地址,i为大于0小于N的正整数。在第一存储方式为列存储时,将源存储地址作为读取待迁移矩阵中的第一列的元素时的首地址。根据尺寸信息、矩阵类型标识和LDA,确定N-1个偏移值,N为待迁移矩阵的列数。根据每i个偏移值和源存储地址,确定读取待迁移矩阵中的第i+1列的元素时的首地址,i为大于0小于N的正整数。则读取所述待迁移矩阵中元素,包括:向处理器的内部总线发送N个读请求,每个读请求携带一个首地址,其中,每个读请求用于读取所述待迁移矩阵中一行或者一列的元素。接收内部总线基于每个首地址返回的待迁移矩阵中的元素。可选地,也可以仅通过一个读请求读取整个矩阵的所有元素。
在本申请的方案中,数据迁移单元在读取一个存储方式的待迁移矩阵时,数据迁移单元可以根据尺寸信息、LDA、矩阵类型等确定出偏移值,进而确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每行中的第一个待迁移元素可以为每行中的第一个元素,每行中的全部元素均可以为待迁移元素。
通过本申请一个读请求可以逐行或逐列读取矩阵的各个元素,而不是一个读请求读取一个元素,提高待迁移矩阵的读取效率。
在一种可能的实现方式中,所述根据所述尺寸信息、所述矩阵类型标识和所述目的存储地址,以第二存储方式存储所述待迁移矩阵中的待迁移元素,包括:
根据尺寸信息、所述矩阵类型标识,在以目的存储地址为首地址的内存中,以第二存储方式存储所述待迁移矩阵中的待迁移元素,并在对应的处理器的缓存中以第二存储方式存储待迁移矩阵中的待迁移元素。
在本申请实施例所示的方案中,目的存储地址为内存地址,将读取的元素存储在内存中之前可以先存储至内存地址对应的缓存中。缓存可以为处理器的高速缓存(cache)。这样,在进行矩阵乘法时,可以直接在缓存中读取数据,读取速度较快。
第二方面,提供了一种数据迁移的装置,该装置包括用于执行第一方面或第一方面任一种可能实现方式中的数据迁移方法的各个模块
第三方面,提供了一种处理器,所述处理器包括数据迁移单元,所述数据迁移单元用于执行第一方面或第一方面任一种可能实现方式中的数据迁移方法。
第四方面,提供了一种计算设备,所述计算设备包括处理器,处理器中包括数据迁移单元,所述数据迁移单元用于执行第一方面或第一方面任一种可能实现方式中的数据迁移方法。
本申请实施例提供的技术方案带来的有益效果是:
数据迁移单元可以获取待迁移矩阵的存储位置并在该存储位置读取待迁移矩阵的元素,且在迁移之前待迁移矩阵以第一存储方式。然后,数据迁移单元按照第二存储方式存储待迁移矩阵的元素,第二存储方式是与第一存储方式不同的存储方式。采用本申请的技术方案,可以将乘号左侧矩阵作为待迁移矩阵转换为列存储的存储方式,或者将乘号右侧矩阵作为待迁移矩阵转换为行存储的存储方式,使得在计算矩阵乘时,处理器中数据迁移单元可以连续获取一行或者一列元素进行相乘并累加,无需利用其它外部设备频繁访问内存,有效提高了 计算效率。
附图说明
图1是本申请实施例提供的一种处理器的结构示意图;
图2是本申请实施例提供的一种数据迁移的方法流程图;
图3是本申请实施例提供的一种数据迁移指令的格式示意图;
图4是本申请实施例提供的一种寄存器存储格式的示意图;
图5是本申请实施例提供的一种数据迁移的装置结构示意图;
图6是本申请实施例提供的一种计算设备的结构示意图。
具体实施方式
为了便于理解本申请实施例提供的技术方案,下面先对几种常见的矩阵以及矩阵乘法的规则进行介绍。
方阵:
如下所示,每行的元素个数和每列的元素相等的矩阵,即可以称为方阵。
Figure PCTCN2021106966-appb-000001
对角矩阵:
如下所示,除对角线上的元素以外,其余所有元素均为0的方阵,即可以称为对角矩阵。
Figure PCTCN2021106966-appb-000002
对角矩阵按照存储方式又可以包括非压缩存储对角矩阵和压缩存储对角矩阵,其中,压缩存储对角矩阵即存储该矩阵时对于除对角线上的元素外,其余所有元素均不存储,非压缩存储对角矩阵即完整存储全部元素。
上三角矩阵:
如下所示,对角线以下的元素均为0的方阵,即可以称为上三角矩阵。
Figure PCTCN2021106966-appb-000003
上三角矩阵按照存储方式又可以包括非压缩存储上三角矩阵和压缩存储上三角矩阵两种类型,其中,压缩存储上三角矩阵即存储时对于对角线以下的0元素不存储,非压缩上三角存储矩阵即完整存储全部元素。
下三角矩阵:
如下所示,对角线以上的元素均为0的方阵,即可以称为下三角矩阵。
Figure PCTCN2021106966-appb-000004
下三角矩阵按照存储方式又可以包括非压缩存储下三角矩阵和压缩存储下三角矩阵两种类型,其中,压缩存储下三角矩阵即存储时对于对角线以上的0元素不存储,非压缩下三角存储矩阵即完整存储全部元素。
矩阵乘法规则:
矩阵A的列数等于矩阵B的行数时,两矩阵可以相乘;两矩阵相乘得到的矩阵C的行数等于矩阵A的行数,矩阵C的列数等于矩阵B的列数;矩阵C的第m行第n列的元素等于矩阵A的第m行的元素与矩阵B的第n列对应的元素乘积之和。示例如下:
Figure PCTCN2021106966-appb-000005
Figure PCTCN2021106966-appb-000006
Figure PCTCN2021106966-appb-000007
下面结合附图进一步介绍本申请提供的数据迁移方法。
图1为本申请实施例提供的一种处理器100的结构示意图如图1所示,处理器100包括取指单元(fetch)10、译码单元(decode)20、乱序发送单元(issue)30和数据迁移单元40。其中,取址单元10和译码单元20连接,译码单元20和乱序发送单元30连接,数据迁移单元40和乱序发送单元30连接。
此外,在处理器100中还可以包括高速缓存(cache)50以及其他处理单元60,其中,高速缓存50和数据迁移单元40连接,其中,处理单元60可以和乱序发送单元30连接。取指单元10可以获取指令,并将获取的指令发送至译码单元20,译码单元20对接收到的指令进行译码,并将译码后的指令发送给处理单元。如,指令可以为数据迁移指令,处理单元为的数据迁移单元40。另外,在处理器100中还可以包括多个寄存器70,根据处理器的位数不同,包括的寄存器的数量也可以不同,本申请实施例中对于处理器100包括的寄存器70的具体数量不做限定。寄存器70可以与上述数据迁移单元40连接。寄存器70可以用于存储数据(例如,待迁移矩阵的存储位置以及尺寸信息等参数)
此外,在处理器100还可以与处理器1001外部的内存200相连。其中,内存200可以和上述高速缓存单元50连接,内存200存储的数据(例如,待迁移矩阵)可以先由高速缓存单元50进行缓存。
数据迁移单元40在接收到译码后的数据迁移指令后,可以向内存200获取待迁移矩阵进行迁移,使该矩阵的存储方式由迁移前的行存储变为迁移后的列存储,或者由迁移前的列存储变为迁移后的行存储。
乱序发送单元30,用于接收译码单元20译码后发送的指令,并将指令发送至相应的处理单元进行处理,处理单元可以为上述数据迁移单元40或者其他处理单元60。
需要说明的是,上述数据迁移单元是本申请中所定义的一个硬件单元,对于其名称本申请实施例不做限定。
本申请实施例提供了一种数据迁移的方法,该方法可以由处理器中的数据迁移单元实现。如图2所示,本申请实施例提供的一种数据迁移的方法的流程可以包括如下步骤:
步骤201、获取待迁移矩阵的存储位置,待迁移矩阵以第一存储方式存储。
待迁移矩阵为需要进行迁移以改变存储方式的矩阵,该矩阵通常存储在内存的连续存储区域中。当内存中矩阵以行存储的方式进行存储时,在需要执行两矩阵相乘运算时,由于两矩阵相乘的过程是以乘号左侧矩阵的行元素和乘号右侧矩阵的列元素进行乘操作,因此需要将乘号右侧的矩阵转换为以列存储的方式存储的形式。也就是说,乘号右侧的矩阵为待迁移矩阵。当内存中矩阵以列存储时,基于类似的理由,需要将乘号左侧的矩阵转换为以行存储的方式存储的形式,以此完成两矩阵相乘操作。
值得说明的是,矩阵存储形式的转换过程需要利用新的存储空间存储转换后的矩阵,因此,也可以将上述矩阵以行或列形式存储方式称为矩阵数据的迁移过程,为了便于描述,以下实施例中以矩阵数据迁移为例进行描述。
数据迁移单元获取待迁移矩阵的存储位置,可以为获取待迁移矩阵的源存储地址,即待迁移矩阵进行迁移前,第一个元素的存储地址。
此外,还可以获取目的存储地址,该目的存储地址为待迁移矩阵迁移后,第一个元素的存储地址。第一个元素是指位于待迁移矩阵的第一行第一列的元素。
还可以获取待迁移矩阵的尺寸信息,尺寸信息可以包括待迁移矩阵的行数和列数,也就是说,尺寸信息用于指示待迁移矩阵的大小。
具体的,处理器在执行矩阵乘法之前,可以在内存中获取待迁移矩阵的源存储地址和尺寸信息,并在内存中申请用于存储迁移后的数据的存储空间,申请的存储空间的首地址即为上述目的存储地址。然后,处理器将源存储地址写入一个寄存器中,将目的存储地址写入一个寄存器中,将尺寸信息写入一个寄存器中。其中,上述寄存器可以对应图1中寄存器70,具体实施过程中,可以利用一个寄存器70分别存储上述源存储地址、目的存储地址和尺寸信息,也可以利用三个不同的寄存器70分别存储上述源存储地址、目的存储地址和尺寸信息。
然后,处理器中的处理单元可以调用编译器生成数据迁移指令,该数据迁移指令中携带第一寄存器标识、第二寄存器标识和第三寄存器标识。其中,第一寄存器标识用于指示存储待迁移矩阵的源存储地址的寄存器,第二寄存器标识用于指示存储所述待迁移矩阵的目的存储地址的寄存器,第三寄存器标识用于指示存储待迁移矩阵的尺寸信息的寄存器。此处,寄存器标识包括寄存器编号、地址中至少一种。
如图3所示,为一种可能的数据迁移指令的命令格式。其中,0-4字段为第二寄存器标识,5-9字段为第一寄存器标识,16-20字段为第三寄存器标识,其余字段为保留字段。
接下来,处理器将生成的数据迁移指令存储在内存中,取指单元获取存储在内存中的数据迁移指令。取指单元获取到数据迁移指令后,将该数据迁移指令发送至译码单元。译码单元对接收到的数据迁移指令进行译码后发送至数据迁移单元。数据迁移单元获取数据迁移指令中携带的第一寄存器标识、第二寄存器标识和第三寄存器标识,并向第一寄存器标识所指示的寄存器获取待迁移矩阵的源存储地址,向第二寄存器标识所指示的寄存器获取待迁移矩阵的目的存储地址,向第三寄存器标识所指示的寄存器获取待迁移矩阵的尺寸信息。
步骤202、读取待迁移矩阵的元素。
在读取待迁移矩阵元素之前,将源存储地址作为待迁移矩阵中第一行的首个元素的地址或者第一列的首个元素的地址,并确定N-1个偏移值,N为待迁移矩阵的行数或列数;按照待迁移矩阵的行或列的首个元素的地址和偏移地址分别逐行或逐列读取所述待迁移矩阵中元 素。通过上述过程,数据迁移单元可以一次读取待迁移矩阵整行或整列的数据,避免逐个读取过程所带来的频繁访问内存的问题,读取效率更高。
在具体实施中,数据迁移单元可以生成读请求,并将读请求发送至处理器的内部总线,内部总线根据读请求从内存读取待迁移矩阵的元素并返回给数据迁移单元。其中,内部总线可以是处理器内部用于传输请求的总线,在此不做限定。
示例地,当每个元素所占用的存储空间相同,且读取一个存储方式为行存储且存储时相邻行之间没有数据间隔的m行n列的待迁移矩阵时,数据迁移单元可以根据该待迁移矩阵的源存储地址以及待迁移矩阵的尺寸信息中的列数(即待迁移矩阵中每行的元素个数)作为偏移值,确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时,以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每行中的第一个待迁移元素可以为每行中的第一个元素,每行中的全部元素均可以为待迁移元素。
然后,数据迁移单元可以依次生成分别携带有上述首地址以及对应的读取字节数的m个读请求。第一个读请求用于读取待迁移矩阵中的第一行的待迁移元素,该第一个读请求中携带的首地址为该待迁移矩阵中的第一行的第一个待迁移元素的存储地址。以此类推,生成的第m个读请求,用于读取待迁移矩阵中的第m行的待迁移元素,该第m个读请求中携带的首地址为该待迁移矩阵中的第m行的第一个待迁移元素的存储地址。例如,第一个读请求中携带的首地址为X,读取字节数为Y,则该读请求用于从首地址X对应的存储位置存储的数据开始,连续读取Y个字节的数据,这Y个字节的数据即为第一行的Y个待迁移元素。
可选地,数据存储单元也可以仅生产一个读请求,该读请求用于逐行或逐列读取待迁移矩阵中元素。
数据迁移单元可以根据源存储地址、矩阵的尺寸信息和每个元素所占用的存储空间大小,确定N个首地址以及每个首地址对应的读取字节数。当第一存储方式为行存储时,N为待迁移矩阵的行数,第i个首地址指示待迁移矩阵中第i行元素中第一个待迁移元素的存储地址,当第一存储方式为行存储时,N为所述待迁移矩阵的列数,第i个首地址指示所述待迁移矩阵中第i列元素中第一个待迁移元素的存储地址,i为大于0小于等于N的正整数。
下面分别针对矩阵的存储方式为行存储和列存储时,以数据迁移单元针对每行或每列分别生成一个读请求为例,对读取待迁移矩阵的方法进行说明。
读取行存储的待迁移矩阵:
读取行存储的待迁移矩阵所需要的读请求数量与待迁移矩阵的行数相同,每个读请求读取待迁移矩阵的一行元素。
读取一个m行n列的待迁移矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的n个元素;
第二个读请求,读取以src_addr+n为首地址的n个元素;
以此类推,第m个读请求,读取以src_addr+(m-1)n为首地址的n个元素。
读取列存储的待迁移矩阵:
读取列存储的待迁移矩阵所需要的读请求数量与待迁移矩阵的列数相同,每个读请求读取待迁移矩阵的一列元素。
读取一个m行n列的待迁移矩阵,数据迁移单元可以生成并发送n个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的m个元素;
第二个读请求,读取以src_addr+m为首地址的n个元素;
以此类推,第n个读请求,读取以src_addr+(n-1)m为首地址的m个元素。
在一种可能的实现方式中,矩阵在内存中存储时,还可以设置矩阵主导维度(leading dimension of array,LDA)。以行存储方式存储的矩阵时,相邻行中在前行的最后一个元素与在后行的第一个元素之间相隔的字节数,与每行元素所占的字节数(元素个数)之和,即为该矩阵的LDA,同理,以列存储方式存储的矩阵时,相邻行中在前列的最后一个元素与在后列的第一个元素之间相隔的字节数,与每列的元素所占的字节数(元素个数)之和,即为该矩阵的LDA。该LDA可以被处理器获取,并与上述尺寸信息存储在同一寄存器中,数据迁移单元在获取尺寸信息时,可以同时获取到LDA。相应的,步骤202的处理可以如下:数据迁移单元根据LDA、尺寸信息和源存储地址,读取待迁移矩阵。
其中,LDA大于等于矩阵行数或者列数。例如,在矩阵为行存储时,LDA应大于等于矩阵的列数,在矩阵为列存储时,LDA应大于等于矩阵的的行数。
在设置有LDA的情况下,下面分别针对存储方式为行存储和列存储时,对读取待迁移矩阵的方法进行说明。
读取行存储的待迁移矩阵:
与未设置LDA时相同的是,读取行存储的待迁移矩阵所需要的读请求数量与待迁移矩阵的行数相同,每个读请求读取待迁移矩阵的一行中的待迁移元素。与未设置LDA时不同的是,确定首地址时首地址,需要考虑到LDA。
数据迁移单元根据待迁移矩阵的源存储地址以及每行对应的偏移值(本行之前每行对应的LDA之和),确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每行中的第一个待迁移元素可以为该行中的第一个元素,每行中的全部元素均可以为待迁移元素。
读取一个m行n列,LDA为L的待迁移矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的n个元素;
第二个读请求,读取以src_addr+L为首地址的n个元素;
以此类推,第m个读请求,读取以src_addr+(m-1)L为首地址的n个元素。
读取列存储待迁移矩阵:
数据迁移单元根据待迁移矩阵的源存储地址以及每列对应的偏移值(本列之前每列对应的LDA之和),确定出待迁移矩阵中每列的第一个待迁移元素对应的存储地址。同时,还可以确定出每列对应的读取字节数,即该列中待迁移元素的个数。在读取某一列中的待迁移元素时以确定出的该列的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每列中的第一个待迁移元素可以为每列中的第一个元素,每行中的全部元素均可以为待迁移元素。
读取一个m行n列,LDA为L的待迁移矩阵,数据迁移单元可以生成并发送n个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的m个元素;
第二个读请求,读取以src_addr+L为首地址的n个元素;
以此类推,第n个读请求,读取以src_addr+(n-1)L为首地址的m个元素。
在又一种可能的实现方式中,矩阵在内存中存储时,对于一些类型的矩阵(例如,对角矩阵)可以仅读取部分元素,而无需获取全部元素。用于指示矩阵类型的矩阵类型标识可以被处理器获取,并与上述尺寸信息存储在同一个寄存器中,数据迁移单元在获取尺寸信息时,可以同时获取到矩阵类型标识。相应的,步骤202的处理可以如下:数据迁移单元根据矩阵类型标识、尺寸信息和源存储地址,读取待迁移矩阵。
其中,矩阵类型标识用于指示待迁移矩阵的类型,该矩阵类型标识可以为数字、英文字母等。矩阵类型可以包括:普通矩阵、方阵、对角矩阵,按照存储矩阵的方式,又可以将矩阵划分为非压缩存储上三角矩阵、非压缩存储对角矩阵、压缩存储上三角矩阵、压缩存储下三角矩阵、压缩存储对角矩阵等。例如,矩阵类型标识为数字,矩阵类型标识和矩阵类型的对应关系可以如下表1所示。
表1
矩阵类型标识 矩阵类型
0 普通矩阵
1 方阵
2 非压缩存储上三角矩阵
3 非压缩存储下三角矩阵
4 非压缩存储对角矩阵
5 压缩存储上三角矩阵
6 压缩存储下三角矩阵
7 压缩存储对角矩阵
下面对上述几种特殊类型的矩阵的读取进行说明。
读取行存储的非压缩存储上三角矩阵:
对于非压缩存储上三角矩阵在读取时,将对角线及以上的元素作为待迁移元素进行读取,而对角线以下的0元素可以不用读取,这样可以提高读取效率,节省迁移后的存储空间。
数据迁移单元根据待迁移矩阵的源存储地址以及每行对应的偏移值(本行之前各行的元素个数与本行中对角线以下的0元素的个数之和),确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每行中的第一个待迁移元素可以为该行中的第一个对角线及以上的元素,每行中的对角线及以上的元素可以为待迁移元素。
读取一个行列数均为m的非压缩存储上三角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的m个元素;
第二个读请求,读取以src_addr+m+1为首地址的m-1个元素;
第三个读请求,读取以src_addr+2m+2为首地址的m-2个元素;
以此类推,第m个读请求,读取以src_addr+(m-1)m+m-1为首地址的1个元素。
读取列存储的非压缩存储上三角矩阵:
数据迁移单元根据待迁移矩阵的源存储地址以及每列对应的偏移值(本列之前各列的元素之和),确定出待迁移矩阵中每列的第一个待迁移元素对应的存储地址。同时,还可以确定出每列对应的读取字节数,即该列中待迁移元素的个数。在读取某一列中的待迁移元素时以 确定出的该列的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每列中的第一个待迁移元素可以为该列中的第一个对角线及以上的元素,每列中的对角线及以上的元素可以为待迁移元素。
读取一个行列数均为m的非压缩存储上三角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的1个元素;
第二个读请求,读取以src_addr+m为首地址的2个元素;
第三个读请求,读取以src_addr+2m为首地址的3个元素;
以此类推,第m个读请求,读取以src_addr+(m-1)m为首地址的m个元素。
数据迁移单元在读取行存储的非压缩存储下三角矩阵时的处理,与读取列存储的非压缩存储上三角矩阵时的处理相似,在读取列存储的非压缩存储下三角矩阵时的处理,与读取行存储的非压缩存储上三角矩阵时的处理相似。因此对于非压缩存储下三角矩阵的读取在此不做赘述。
读取非压缩存储的对角矩阵:
对于对角矩阵来说,虽然行存储和列存储是相同的,但是在本申请中仍然可以对对角矩阵进行迁移,可以使非压缩存储对角矩阵转换为压缩存储对角矩阵,即,将对角线上的元素作为待迁移元素进行读取,除对角线以外的0元素均不读取。
以读取行存储的非压缩存储的对角矩阵为例,数据迁移单元根据待迁移矩阵的源存储地址以及每行对应的偏移值(本行之前各行对应的元素与本行中对角线以下的0元素的个数之和),确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每行中的第一个待迁移元素可以为该行中的第一个对角线上的元素,每行中的对角线上的元素可以为待迁移元素。
读取一个行列数均为m的压缩上三角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的1个元素;
第二个读请求,读取以src_addr+m+1为首地址的1个元素;
第三个读请求,读取以src_addr+2m+2为首地址的1个元素;
以此类推,第m个读请求,读取以src_addr+(m-1)m+m-1为首地址的1个元素。
读取行存储的压缩存储上三角矩阵:
对于压缩存储上三角矩阵来说,其在存储时只存储对角线及以上的元素,即压缩存储上三角矩阵存储的全部元素均可以作为待迁移元素进行读取。
数据迁移单元根据待迁移矩阵的源存储地址以及每行对应的偏移值(本行之前的每行包括的待迁移元素的个数),确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。
读取一个行列数均为m的压缩存储上三角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的m个元素;
第二个读请求,读取以src_addr+m为首地址的m-1个元素;
第三个读请求,读取以src_addr+m+(m-1)为首地址的m-2个元素;
以此类推,第m个读请求,读取以src_addr+m+(m-1)+…+2为首地址的1个元素。
读取列存储的压缩存储上三角矩阵:
数据迁移单元根据待迁移矩阵的源存储地址以及每列对应的偏移值(本列之前的每行包括的待迁移元素的个数),确定出待迁移矩阵中每列的第一个待迁移元素对应的存储地址。同时,还可以确定出每列对应的读取字节数,即该列中待迁移元素的个数。在读取某一列中的待迁移元素时以确定出的该列的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。
读取一个行列数均为m的压缩存储上三角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的1个元素;
第二个读请求,读取以src_addr+1为首地址的2个元素;
第三个读请求,读取以src_addr+1+2为首地址的3个元素;
以此类推,第m个读请求,读取以src_addr+1+2…+m-1为首地址的m个元素。
数据迁移单元在读取行存储的压缩存储下三角矩阵时的处理,与读取列存储的压缩存储上三角矩阵时的处理相似,在读取列存储的压缩存储下三角矩阵时的处理,与读取行存储的压缩存储上三角矩阵时的处理相似。因此对于压缩存储下三角矩阵的读取在此不做赘述。
此外,如果矩阵类型标识指示的矩阵类型为压缩存储对角矩阵,则可以无需执行迁移操作。
在又一种可能的实现方式中,可以在设置有LDA的情况下,考虑矩阵类型进行读取。矩阵类型标识和LDA可以均被处理器获取,并与上述尺寸信息存储在同一个寄存器中。例如在存储LDA、矩阵类型标识和尺寸信息时,寄存器格式可以如图4所示。数据迁移单元在获取尺寸信息时,可以同时获取到矩阵类型标识和LDA。相应的,步骤202的处理可以如下:数据迁移单元根据LDA、矩阵类型标识、尺寸信息和源存储地址,读取待迁移矩阵。
对于压缩存储矩阵来说,是没有LDA的,因此,在矩阵类型标识指示的是压缩存储矩阵(如压缩存储上三角矩阵、压缩存储下三角矩阵、压缩存储对角矩阵等)时,则处理器可以不获取LDA,相应的,在寄存器中也不会存储有LDA。下面对于几种压缩存储矩阵,结合LDA读取的方法进行说明。
结合LDA读取行存储的非压缩存储上三角矩阵:
数据迁移单元根据待迁移矩阵的源存储地址以及每行对应的偏移值(本行之前各行对应的LDA与本行中对角线以下的0元素的个数之和),确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每行中的第一个待迁移元素可以为该行中的第一个对角线及以上的元素,每行中的对角线及以上的元素可以为待迁移元素。
同样的,对于非压缩存储上三角矩阵对角线以下的0元素可以不用读取。
读取一个行列数均为m,LDA为L的非压缩存储上三角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的m个元素;
第二个读请求,读取以src_addr+L+1为首地址的m-1个元素;
第三个读请求,读取以src_addr+2L+2为首地址的m-2个元素;
以此类推,第m个读请求,读取以src_addr+(m-1)L+m-1为首地址的1个元素。
结合LDA读取列存储的非压缩存储上三角矩阵:
数据迁移单元根据待迁移矩阵的源存储地址以及偏移值(本列之前各行对应的LDA之和),确定出待迁移矩阵中每列的第一个待迁移元素对应的存储地址。同时,还可以确定出每列对应的读取字节数,即该列中待迁移元素的个数。在读取某一列中的待迁移元素时以确定出的该列的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每列中的第一个待迁移元素可以为该列中的第一个对角线及以上的元素,每列中的对角线及以上的元素可以为待迁移元素。
读取一个行列数均为m,LDA为L的非压缩存储上三角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的1个元素;
第二个读请求,读取以src_addr+L为首地址的2个元素;
第三个读请求,读取以src_addr+2L为首地址的3个元素;
结合LDA读取非压缩存储的对角矩阵:
以读取行存储为例,数据迁移单元根据待迁移矩阵的源存储地址以及每行对应的偏移值(本行之前各行对应的LDA与本行中对角线以下的0元素的个数之和),确定出待迁移矩阵中每行的第一个待迁移元素对应的存储地址。同时,还可以确定出每行对应的读取字节数,即该行中待迁移元素的个数。在读取某一行中的待迁移元素时以确定出的该行的第一个待迁移元素对应的存储地址作为首地址进行读取,读取对应的读取字节数个元素。其中,每行中的第一个待迁移元素可以为该行中的第一个对角线上的元素,每行中的对角线上的元素可以为待迁移元素。
读取一个行列数均为m,LDA为L的非压缩对角矩阵,数据迁移单元可以生成并发送m个读请求。
第一个读请求,读取以源存储地址(src_addr)为首地址的1个元素;
第二个读请求,读取以src_addr+L+1为首地址的1个元素;
第三个读请求,读取以src_addr+2L+2为首地址的1个元素;
以此类推,第m个读请求,读取以src_addr+(m-1)L+m-1为首地址的1个元素。
步骤203、按照第二存储方式存储待迁移矩阵的元素,所述第二存储方式是与所述第一存储方式不同的存储方式。
其中,在上述第一存储方式为行存储时,该第二存储方式为列存储,在上述第一存储方式为列存储时,该第二存储方式为行存储。在实施中,在对每个读请求读取的待迁移矩阵的元素进行存储时,需要保证存储后的矩阵,是以第二存储方式存储的。即,如果第二存储方式为列存储,则在迁移后存储时要使待迁移矩阵的同一列元素连续存储,如果第二存储方式为行存储,则在迁移后存储时要使待迁移矩阵的同一行元素连续存储。
对于矩阵在迁移之前均是非压缩存储矩阵,且迁移数据时也是按照非压缩矩阵存储的情况,可以按照如下方式进行存储。
以列存储的方式存储m行n列的待迁移矩阵,在待迁移矩阵是方阵时m=n。
对于第一个读请求获取的n个元素,依次存储在目的存储地址(dst_addr)、dst_addr+m、dst_addr+2m…dst_addr+(n-1)m;
对于第二个读请求获取的n个元素,依次存储在dst_addr+1、dst_addr+m+1、 dst_addr+2m+1…dst_addr+(n-1)m+1;
以此类推,对于第m个读请求获取的n个元素,依次存储在dst_addr+m-1、dst_addr+m+(m-1)、dst_addr+2m+(m-1)、…dst_addr+(n-1)m+(m-1)。
对于一些类型的矩阵(如非压缩存储上三角矩阵、非压缩存储对角矩阵、压缩存储上三角矩阵、压缩存储下三角矩阵、压缩存储对角矩阵等),在迁移后存储时可以根据矩阵类型、目的存储地址以及尺寸信息,确定读取的每个元素的存储地址,再对读取的元素进行存储。下面对几种类型的矩阵的存储方法进行说明。
以行存储的方式,对读取的压缩存储上三角矩阵或者非压缩存储上三角矩阵进行存储:
对于第一个读请求获取的1个元素,存储在dst_addr各自对应的存储空间;
对于第二个读请求获取的2个元素,依次存储在dst_addr+1、dst_addr+m+1-1各自对应的存储空间;
对于第三个读请求获取的3个元素,依次存储在dst_addr+2、dst_addr+m+2-1、dst_addr+2m+2-(1+2)各自对应的存储空间;
以此类推,对于第m个读请求获取的m个元素,依次存储在dst_addr+m-1、dst_addr+m+(m-1)-1、dst_addr+2m+(m-1)-(1+2)、…dst_addr+(m-1)m+(m-1)-[1+2+…(m-1)]各自对应的存储空间。
以列存储的方式,对读取的压缩存储上三角矩阵或者非压缩存储上三角矩阵进行存储:
对于第一个读请求读取的m个元素,依次存储在dst_addr、dst_addr+m-(m-1)、dst_addr+2m-[(m-1)+(m-2)]、…dst_addr+(m-1)m-[(m-1)+(m-2)+…1]各自对应的存储空间;
对于第二个读请求读取的m-1个元素,依次存储在dst_addr+m+1-(m-1)、dst_addr+2m+1-[(m-1)+(m-2)]、…dst_addr+(m-1)m+1-[(m-1)+(m-2)+…1]各自对应的存储空间;
对于第三个读请求读取的m-2个元素,依次存储在dst_addr+2m+2-[(m-1)+(m-2)]、dst_addr+3m+2-[(m-1)+(m-2)+(m-3)]、…dst_addr+(m-1)m+2-[(m-1)+(m-2)+…1]各自对应的存储空间;
以此类推,对于第m-1个读请求读取的2个元素,依次存储在dst_addr+(m-2)m+(m-2)-[(m-1)+(m-2)+…2]、dst_addr+(m-1)m+(m-2)-[(m-1)+(m-2)+…1]各自对应的存储空间;
对于第m个读请求读取的1个元素,存储在dst_addr+(m-1)m+(m-1)-[(m-1)+(m-2)+…1]各自对应的存储空间。
需要说明的是,数据迁移单元在以列存储的方式存储压缩存储下三角矩阵或者非压缩存储下三角矩阵时的处理,与以行存储的方式存储压缩存储上三角矩阵或者非压缩存储上三角矩阵时的处理相似,在以行存储的方式存储压缩存储下三角矩阵或者非压缩存储下三角矩阵时的处理,与以列存储的方式存储压缩存储上三角矩阵或者非压缩存储上三角矩阵时的处理相似。因此对于存储压缩存储下三角矩阵或者非压缩存储下三角矩阵的存储在此不做赘述。
对读取的非压缩存储对角矩阵进行存储:
对于第一个读请求获取的1个元素,存储在dst_addr对应的存储空间;
对于第二个读请求获取的1个元素,依次存储在dst_addr+1对应的存储空间;
对于第三个读请求获取的1个元素,存储在dst_addr+2对应的存储空间;
以此类推,对于第m个读请求获取的1个元素,存储在dst_addr+m-1对应的存储空间。
需要说明的是,上述目的存储地址为内存地址,将读取的元素存储在内存中之前可以先存储至内存地址对应的缓存中。缓存可以为处理器的高速缓存(cache)。
综上所述,可以通过处理器内部的数据迁移单元实现数据迁移,将乘号左侧矩阵转换为列存储的存储方式,或者将乘号右侧矩阵转换为行存储的存储方式,使得在计算矩阵乘时,可以连续获取一行或者一列元素进行相乘并累加,无需频繁访问内存,有效提高了计算效率。此外,该数据迁移单元只需一条数据迁移指令即可触发数据迁移的处理,触发相对简单。
在第一存储方式为行存储的情况下,待迁移矩阵为相乘的两矩阵中乘号右侧的矩阵,该待迁移矩阵迁移后,可以以行存储的方式存储在处理器的缓存以及内存中,执行矩阵乘法的处理单元可以获取乘号左侧的矩阵的第一行的元素,并直接缓存中读取连续存储的待迁移矩阵的第一列的元素,进行对位相乘,并累加,作为输出矩阵的第一个元素,以此类推完成两矩阵的乘法运算,得到输出矩阵。
在第一存储方式为列存储的情况下,待迁移矩阵为相乘的两矩阵中乘号左侧的矩阵,该待迁移矩阵迁移后,可以以行存储的方式存储在处理器的缓存以及内存中,执行矩阵乘法的处理单元可以获取乘号右侧的矩阵的第一行的元素,并直接缓存中读取连续存储的待迁移矩阵的第一行的元素,进行对位相乘,并累加,作为输出矩阵的第一个元素,以此类推完成两矩阵的乘法运算,得到输出矩阵。
基于相同的技术构思,本申请实施例还提供了一种数据迁移的装置,该装置可以为上述数据迁移单元,如图5所示,该装置包括:
获取模块510,用于获取待迁移矩阵的存储位置,所述待迁移矩阵以第一存储方式存储。具体的,可以实现上述步骤201的获取功能及其隐含步骤。
读取模块520,用于读取所述待迁移矩阵中元素。具体的,可以实现上述步骤202的读取功能及其隐含步骤。
存储模块530,用于按照第二存储方式存储所述待迁移矩阵的元素,所述第二存储方式是与所述第一存储方式不同的存储方式。具体的,可以实现上述步骤203的存储功能及其隐含步骤。
应理解的是,本申请实施例的装置可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图2所示的数据迁移的方法时,装置及其各个模块也可以为软件模块。
在一种可能的实现方式中,所述第一存储方式和所述第二存储方式分别包括列存储方式和行存储方式。
在一种可能的实现方式中,所述获取模块510还用于:
获取数据迁移指令,其中,所述数据迁移指令中携带有第一寄存器标识、第二寄存器标识和第三寄存器标识,其中,所述第一寄存器标识用于指示存储所述待迁移矩阵的源存储地址的第一寄存器、所述第二寄存器标识用于指示存储所述待迁移矩阵的目的存储地址的第二寄存器和所述第三寄存器标识用于指示存储所述待迁移矩阵的尺寸信息的寄存器;所述源存储地址和所述目的存储地址为内存的存储地址;
所述获取模块510,用于:
向所述第一寄存器获取所述待迁移矩阵的源存储地址,向所述第二寄存器获取所述待迁 移矩阵的目的存储地址,向所述第三寄存器获取所述待迁移矩阵的尺寸信息。
在一种可能的实现方式中,所述第三寄存器还存储所述待存储矩阵的矩阵类型标识和LDA;
所述获取模块510,用于:
向所述第三寄存器获取所述待迁移矩阵的尺寸信息、矩阵类型标识和LDA;
所述基于所述尺寸信息和所述源存储地址,读取第一存储方式存储的所述待迁移矩阵,包括:
基于所述尺寸信息、矩阵类型标识、LDA和所述源存储地址,读取第一存储方式存储的所述待迁移矩阵的元素。
在一种可能的实现方式中,所述读取模块520,用于:
在所述第一存储方式为行存储时,将所述源存储地址作为读取所述待迁移矩阵中的第一行的元素时的首地址;
根据所述尺寸信息、矩阵类型标识和LDA,确定N-1个偏移值,N为所述待迁移矩阵的行数;
根据每i个偏移值和所述源存储地址,确定读取所述待迁移矩阵中的第i行的元素时的首地址,i为大于0小于N的正整数;
在所述第一存储方式为列存储时,将所述源存储地址作为读取所述待迁移矩阵中的第一列的元素时的首地址;
根据所述尺寸信息、矩阵类型标识和LDA,确定N-1个偏移值,N为所述待迁移矩阵的列数;
根据每i个偏移值和所述源存储地址,确定读取所述待迁移矩阵中的第i+1列的元素时的首地址,i为大于0小于N的正整数;
所述读取模块520,用于:
向处理器的内部总线发送N个读请求,每个读请求携带一个首地址,其中,每个读请求用于读取所述待迁移矩阵中一行或者一列的元素;
接收所述内部总线基于每个首地址返回的所述待迁移矩阵中的元素。
根据所述尺寸信息、所述矩阵类型标识和所述目的存储地址,以第二存储方式存储所述待迁移矩阵中的待迁移元素。
在一种可能的实现方式中,所述存储模块530,用于:
根据所述尺寸信息、所述矩阵类型标识,在以所述目的存储地址为首地址的内存中,以第二存储方式存储所述待迁移矩阵中的待迁移元素,并在对应的所述处理器的缓存中以第二存储方式存储所述待迁移矩阵中的待迁移元素。
还需要说明的是,上述实施例提供的数据迁移的装置在数据迁移时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将数据迁移单元的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据迁移的装置与数据迁移的方法实施例属于同一构思,其具体实现过程详见图2所示方法实施例,为了简洁,在此不再赘述。
参见图6,本申请实施例提供了一种计算设备1300。该计算设备1300包括至少一个处理器1301,总线系统1302,存储器1303,通信接口1304和内存单元1305。
上述处理器1301可以是一个通用中央处理器(central processing unit,CPU),网络处理器(network processor,NP),图形处理器(graphics processing unit)微处理器,特定应用集 成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。此外,上述处理器1301包括如图1所示的数据迁移单元40,使得上述数据迁移单元40用于实现如图2所示的方法的操作步骤,为了简洁,在此不再赘述。
上述总线系统1302可包括一通路,在上述组件之间传送信息。
上述存储器1303可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
内存单元1305用于存储执行本申请方案的应用程序代码,并由处理器1301来控制执行。处理器1301用于执行内存单元1305中存储的应用程序代码,从而实现本申请提出的浮点数计算方法。
在具体实现中,作为一种实施例,处理器1301可以包括一个或多个处理器1301。
通信接口1304用于实现计算设备1300与外部设备的连接和通信。
综上所述,计算设备可以将乘号左侧矩阵作为待迁移矩阵转换为列存储的存储方式,或者将乘号右侧矩阵作为待迁移矩阵转换为行存储的存储方式,使得在计算矩阵乘时,可以连续获取一行或者一列元素进行相乘并累加,无需频繁访问内存,有效提高了计算效率。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令,在设备上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴光缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是设备能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(如软盘、硬盘和磁带等),也可以是光介质(如数字视盘(Digital Video Disk,DVD)等),或者半导体介质(如固态硬盘等)。
以上所述仅为本申请一个实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种数据迁移的方法,其特征在于,所述方法包括:
    获取待迁移矩阵的存储位置,所述待迁移矩阵以第一存储方式存储;
    读取所述待迁移矩阵中元素;
    按照第二存储方式存储所述待迁移矩阵的元素,所述第二存储方式是与所述第一存储方式不同的存储方式。
  2. 根据权利要求1所述的方法,其特征在于,所述第一存储方式和所述第二存储方式分别包括列存储方式和行存储方式。
  3. 根据权利要求1或2所述的方法,其特征在于,在所述获取待迁移矩阵的存储位置之前,所述方法还包括:
    获取数据迁移指令,其中,所述数据迁移指令中携带有第一寄存器标识、第二寄存器标识和第三寄存器标识,其中,所述第一寄存器标识用于指示存储所述待迁移矩阵的源存储地址的第一寄存器;所述第二寄存器标识用于指示存储所述待迁移矩阵的目的存储地址的第二寄存器;所述第三寄存器标识用于指示存储所述待迁移矩阵的尺寸信息的寄存器;所述源存储地址和所述目的存储地址为内存的存储地址;
    则所述获取待迁移矩阵的存储位置,包括:
    向所述第一寄存器获取所述待迁移矩阵的源存储地址,向所述第二寄存器获取所述待迁移矩阵的目的存储地址,向所述第三寄存器获取所述待迁移矩阵的尺寸信息。
  4. 根据权利要求1至3中任一所述的方法,其特征在于,在所述读取所述待迁移矩阵中元素之前,所述方法还包括:
    将源存储地址作为所述待迁移矩阵中第一行的首个元素的地址或者第一列的首个元素的地址,并确定N-1个偏移值,N为待迁移矩阵的行数或列数;
    则所述读取所述待迁移矩阵中元素,包括:
    按照所述待迁移矩阵的行或列的首个元素的地址和偏移地址分别逐行或逐列读取所述待迁移矩阵中元素。
  5. 一种数据迁移的装置,其特征在于,所述方法应用于处理器中,所述方法包括:
    获取模块,用于获取待迁移矩阵的存储位置,所述待迁移矩阵以第一存储方式存储;
    读取模块,用于读取所述待迁移矩阵中元素;
    存储模块,用于按照第二存储方式存储所述待迁移矩阵的元素,所述第二存储方式是与所述第一存储方式不同的存储方式。
  6. 根据权利要求5所述的装置,其特征在于,所述第一存储方式和所述第二存储方式分别包括列存储方式和行存储方式。
  7. 根据权利要求5或6所述的装置,其特征在于,所述获取模块,还用于:
    获取数据迁移指令,其中,所述数据迁移指令中携带有第一寄存器标识、第二寄存器标识和第三寄存器标识,其中,所述第一寄存器标识用于指示存储所述待迁移矩阵的源存储地 址的第一寄存器、所述第二寄存器标识用于指示存储所述待迁移矩阵的目的存储地址的第二寄存器和所述第三寄存器标识用于指示存储所述待迁移矩阵的尺寸信息的寄存器;所述源存储地址和所述目的存储地址为内存的存储地址;
    向所述第一寄存器获取所述待迁移矩阵的源存储地址,向所述第二寄存器获取所述待迁移矩阵的目的存储地址,向所述第三寄存器获取所述待迁移矩阵的尺寸信息。
  8. 根据权利要求7中所述的装置,其特征在于,所述读取模块,还用于:
    在所述读取所述待迁移矩阵中元素之前,将源存储地址作为所述待迁移矩阵中第一行的首个元素的地址或者第一列的首个元素的地址,并确定N-1个偏移值,N为待迁移矩阵的行数或列数;
    按照所述待迁移矩阵的行或列的首个元素的地址和偏移地址分别逐行或逐列读取所述待迁移矩阵中元素。
  9. 一种处理器,其特征在于,所述处理器包括数据迁移单元,所述数据迁移单元用于执行如上述权利要求1-4中任一项所述的数据迁移的方法。
  10. 一种计算设备,其特征在于,所述计算设备包括权利要求9所述的处理器,所述处理器用于执行如权利要求1-4中任一项所述的数据迁移的方法。
PCT/CN2021/106966 2020-09-30 2021-07-17 数据迁移的方法、装置、处理器和计算设备 WO2022068328A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011062409.8A CN114327244A (zh) 2020-09-30 2020-09-30 数据迁移的方法、装置、处理器和计算设备
CN202011062409.8 2020-09-30

Publications (1)

Publication Number Publication Date
WO2022068328A1 true WO2022068328A1 (zh) 2022-04-07

Family

ID=80949164

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106966 WO2022068328A1 (zh) 2020-09-30 2021-07-17 数据迁移的方法、装置、处理器和计算设备

Country Status (2)

Country Link
CN (1) CN114327244A (zh)
WO (1) WO2022068328A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438114A (zh) * 2022-11-09 2022-12-06 浪潮电子信息产业股份有限公司 存储格式转换方法、系统、装置、电子设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304750B (zh) * 2023-05-19 2023-08-18 北京算能科技有限公司 数据处理方法及装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106183A (zh) * 2013-01-29 2013-05-15 福建天晴数码有限公司 基于mapreduce的大规模稀疏矩阵乘法运算的方法
CN108040257A (zh) * 2017-11-20 2018-05-15 深圳市维海德技术股份有限公司 一种二维dct硬件实现方法及装置
CN109522125A (zh) * 2018-11-19 2019-03-26 郑州云海信息技术有限公司 一种矩阵乘积转置的加速方法、装置及处理器
CN111401568A (zh) * 2020-03-24 2020-07-10 深圳职业技术学院 基于正规方程的高效机器学习算法、装置及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106183A (zh) * 2013-01-29 2013-05-15 福建天晴数码有限公司 基于mapreduce的大规模稀疏矩阵乘法运算的方法
CN108040257A (zh) * 2017-11-20 2018-05-15 深圳市维海德技术股份有限公司 一种二维dct硬件实现方法及装置
CN109522125A (zh) * 2018-11-19 2019-03-26 郑州云海信息技术有限公司 一种矩阵乘积转置的加速方法、装置及处理器
CN111401568A (zh) * 2020-03-24 2020-07-10 深圳职业技术学院 基于正规方程的高效机器学习算法、装置及可读存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438114A (zh) * 2022-11-09 2022-12-06 浪潮电子信息产业股份有限公司 存储格式转换方法、系统、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN114327244A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
CN110688157B (zh) 一种计算装置及计算方法
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
US9652374B2 (en) Sparsity-driven matrix representation to optimize operational and storage efficiency
US10311127B2 (en) Sparse matrix vector multiplication
US10346507B2 (en) Symmetric block sparse matrix-vector multiplication
WO2022068328A1 (zh) 数据迁移的方法、装置、处理器和计算设备
US20160140084A1 (en) Efficient sparse matrix-vector multiplication on parallel processors
CN108388527B (zh) 直接存储器存取引擎及其方法
US11308171B2 (en) Apparatus and method for searching linked lists
US6449706B1 (en) Method and apparatus for accessing unaligned data
WO2019084788A1 (zh) 用于神经网络的运算装置、电路及相关方法
KR102287677B1 (ko) 데이터 액세스 방법, 장치, 기기 및 저장 매체
US9058301B2 (en) Efficient transfer of matrices for matrix based operations
US10067763B2 (en) Handling unaligned load operations in a multi-slice computer processor
TWI696949B (zh) 直接記憶體存取方法、裝置、專用計算晶片及異構計算系統
US10275230B2 (en) Cache aware self-referential structure peeling
JP5171211B2 (ja) データ形式変換装置
US10013393B2 (en) Parallel computer system, parallel computing method, and program storage medium
US11467973B1 (en) Fine-grained access memory controller
US20220188380A1 (en) Data processing method and apparatus applied to graphics processing unit, and electronic device
US11409840B2 (en) Dynamically adaptable arrays for vector and matrix operations
US10430326B2 (en) Precision data access using differential data
US20230195414A1 (en) Arithmetic processing apparatus and arithmetic processing method
CN110147222B (zh) 运算装置及方法
CN114282158A (zh) 矩阵计算电路、方法、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873990

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21873990

Country of ref document: EP

Kind code of ref document: A1