WO2022161394A1 - 任务映射方法、任务处理方法、处理核和电子设备 - Google Patents

任务映射方法、任务处理方法、处理核和电子设备 Download PDF

Info

Publication number
WO2022161394A1
WO2022161394A1 PCT/CN2022/073984 CN2022073984W WO2022161394A1 WO 2022161394 A1 WO2022161394 A1 WO 2022161394A1 CN 2022073984 W CN2022073984 W CN 2022073984W WO 2022161394 A1 WO2022161394 A1 WO 2022161394A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
layer
sub
processing
core
Prior art date
Application number
PCT/CN2022/073984
Other languages
English (en)
French (fr)
Inventor
王封
祝夭龙
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110111060.0A external-priority patent/CN114791849A/zh
Priority claimed from CN202110103025.4A external-priority patent/CN114791786A/zh
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2022161394A1 publication Critical patent/WO2022161394A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present application relates to the field of computer technology, and in particular, to a task mapping method, a task processing method, a processing core and an electronic device.
  • a sparse matrix refers to a matrix in which the number of zero elements is much larger than the number of non-zero elements, and the distribution of non-zero elements is irregular. Sparse matrices are widely used in real life. In particular, sparse matrices often appear in high-performance computing and machine learning, such as data containing counts, data encoding for mapping categories, and natural language processing (NLP, Natural Language Processing), etc. Subfield of Machine Learning.
  • the present application provides a task mapping method, a task processing method, a processing core and an electronic device.
  • the present application provides a task mapping method, comprising: determining a multi-layer second matrix according to a first matrix, and dividing each layer of the second matrix into at least one second sub-matrix; wherein, the N+1 layer
  • the elements in the two matrices are in one-to-one correspondence with a plurality of second sub-matrices of the second matrix of the Nth layer, and the second sub-matrix of the first layer is a sub-matrix of the first matrix; wherein, N is a positive integer;
  • the second sub-matrix in the second matrix of the layer is mapped to the processing cores in the many-core system, and each of the processing cores corresponds to a second sub-matrix, so that the processing core performs the matrix operation of the corresponding second sub-matrix and store the result of the operation.
  • At least one second sub-matrix of the second matrix of each layer includes at least one non-zero second sub-matrix; the mapping of the second sub-matrix in the second matrix of each layer to the processing core in the many-core system , including: mapping the non-zero second sub-matrices in the second matrix of each layer to the processing cores in the many-core system respectively.
  • the matrix calculation performed by the processing core is a calculation performed based on an operation instruction, and each of the processing cores corresponds to a second sub-matrix and a corresponding operation instruction;
  • the second sub-matrix in the second matrix of each layer is mapped to the processing cores in the many-core system, and each of the processing cores corresponds to a second sub-matrix, so that the processing core performs its corresponding second sub-matrix.
  • the matrix operation of the sub-matrix and the storage of the operation result include: determining the operation instruction corresponding to at least one second sub-matrix; mapping the second sub-matrix and the operation instruction corresponding to the second sub-matrix in the second matrix of each layer to the many cores A processing core in the system, so that the processing core performs a matrix operation on its corresponding second sub-matrix according to the corresponding operation instruction and stores the calculation result.
  • mapping the second sub-matrix in the second matrix of each layer and the operation instruction corresponding to the second sub-matrix to the processing core in the many-core system includes: mapping the non-zero second sub-matrix in the second matrix of each layer to the processing core in the many-core system
  • the matrices are respectively mapped to the processing cores in the many-core system; the operation instructions corresponding to at least one non-zero second sub-matrix are configured into the processing cores of the many-core system.
  • the processing core corresponding to the second submatrix of the Nth layer is the processing core of the Nth layer; the non-zero second submatrix in the second matrix of each layer is respectively mapped to the processing core in the many-core system, It includes: transmitting the first-level non-zero second submatrix in the first-level second matrix to the first-level processing core, so that the first-level processing core can calculate the corresponding first-level non-zero second sub-matrix.
  • Matrix operation transmit the first correspondence between the processing core of the Nth layer and the first coordinate to the processing core of the N+1th layer, so that the processing core of the N+1th layer determines the processing core of the Nth layer according to the first correspondence
  • the second correspondence with the second coordinate is the coordinate of the element in the N+1th layer second matrix corresponding to the Nth layer non-zero second submatrix in the N+1th layer second matrix
  • the second coordinates are the coordinates of the elements in the N+1th layer second matrix corresponding to the Nth layer non-zero second submatrix in the operation result matrix of the N+1th layer second matrix.
  • the determining a multi-layer second matrix according to the first matrix includes: determining a target size according to the size of the first matrix, where the target size is the size of the second sub-matrix of each layer; according to the first matrix and the target size, determine a multi-layered second matrix.
  • the method further includes: according to at least one second sub-matrix in the multi-layer second matrix and the The mapping relationship of multiple processing cores in the many-core system determines the target processing core; the target processing core is at least one of the target data stored in the multiple processing cores; the target data is the task data of the task to be processed The operation result matrix of the corresponding sub-matrix of the first matrix; the task data is transmitted to the target processing core, so that the target processing core reads the target data and executes the operation corresponding to the task data.
  • At least one second sub-matrix of each layer of second matrix includes at least one non-zero second sub-matrix; at least one second sub-matrix in the multi-layer second matrix and multiple processing cores in the many-core system
  • the mapping relationship is the mapping relationship between at least one non-zero second sub-matrix in the second matrix of each layer and a plurality of the processing cores; the processing core corresponding to the second sub-matrix of the Nth layer is the processing core of the Nth layer; the first The layer processing core stores the calculation result obtained by performing the matrix operation on the corresponding first layer non-zero second sub-matrix; the N+1 layer processing core stores the first correspondence according to the Nth layer processing core and the first coordinate The second correspondence between the Nth layer processing kernel and the second coordinate determined by the relationship; the first coordinate is the element in the N+1th layer second matrix corresponding to the Nth layer non-zero second submatrix at the Nth layer The coordinates in the second matrix of layer 1; the second coordinates are the operation result
  • determining the target processing core according to the mapping relationship between at least one second sub-matrix in the multi-layer second matrix and the multiple processing cores in the many-core system includes: when N is greater than 1, according to the The second correspondence stored by the Nth layer processing core corresponding to the target data determines the N-1th layer processing core corresponding to the target data; when N is equal to 1, the first layer of the target data will be stored.
  • a layer processing core serves as the target processing core.
  • the task mapping method further includes: according to the second correspondence stored in the processing core of each layer, determining the address of the storage space of at least one target calculation result in the off-chip storage; the target calculation result is the first
  • the layer 1 processing core performs the matrix operation of the corresponding layer 1 non-zero second sub-matrix and stores the calculation result; controls at least one layer 1 processing core according to the target calculation result stored by the layer 1 processing core in the The address of the storage space in the off-chip storage, and write the target calculation result stored in the first layer processing core into the off-chip storage, wherein at least one of the target calculation results is spliced into the off-chip storage in the off-chip storage.
  • the operation result matrix of the first matrix is described.
  • the second sub-matrix of each layer is a square matrix; the matrix operation at least includes a matrix transposition operation.
  • an embodiment of the present application provides a task processing method, including: receiving a first correspondence between an Nth layer processing core and a first coordinate in a multi-layer second matrix; wherein, N is a positive integer; A correspondence determines the second correspondence between the processing core of the Nth layer and the second coordinate; wherein, each layer of the second matrix in the multi-layer second matrix is divided into at least one second sub-matrix; The elements in the two matrices correspond one-to-one with multiple second sub-matrices of the N-th layer second matrix; each of the multiple processing cores in the many-core system corresponds to a second sub-matrix; the first One coordinate is the coordinate of the element in the second matrix of the N+1th layer corresponding to the second sub-matrix of the Nth layer in the second matrix of the N+1th layer; the second coordinate is the corresponding element of the second sub-matrix of the Nth layer The coordinates of the elements in the second matrix of the N+1th layer in the operation result matrix of the second
  • embodiments of the present application provide an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, when the one or more programs are stored by the one or more programs The one or more processors are executed, so that the one or more processors implement the task mapping method described in the first aspect of the embodiments of the present application; one or more I/O interfaces are connected between the processors and the memory, and configure In order to realize the information interaction between the processor and the memory.
  • an embodiment of the present application provides a processing core, including: a computing unit and a cache; the computing unit can implement the task mapping method of the first aspect of the embodiment of the present application; and/or the second aspect of the embodiment of the present application task handling method.
  • embodiments of the present application provide a many-core system, including: a plurality of processing cores; and an on-chip network configured to exchange data among the plurality of processing cores and external data; one or more of the processing cores
  • One or more instructions are stored in the core, and one or more of the instructions are executed by one or more of the processing cores, so that the one or more of the processing cores can execute the first aspect of the embodiments of this application.
  • a task mapping method; and/or the task processing method of the second aspect of the embodiments of the present application are examples of the present application.
  • embodiments of the present application provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic device
  • the processor in the electronic device executes the task mapping method described in the first aspect of the embodiment of the present application; and/or the task processing method of the second aspect of the embodiment of the present application.
  • a solution for performing matrix operations on sparse matrices using a many-core system is provided, and a multi-layer second matrix is determined according to the first matrix for which the matrix operation needs to be performed, and the second sub-matrix of each layer of the second matrix is The scale of the first matrix is much smaller than that of the first matrix; the second sub-matrix of at least one layer of the second matrix is mapped to multiple processing cores in the many-core system to perform matrix operations, and finally the matrix operation result of the first matrix is obtained, thus It can achieve a high compression rate for the coordinate dimension of the elements in the matrix, greatly reducing the memory overhead; the calculation results of the matrix operation are stored in at least one processing core of the many-core system, and there is no need to write into off-chip storage such as memory, and the data is also reduced. The probability of repeated handling improves the efficiency of matrix operations on very large-scale sparse matrices.
  • FIG. 1 is a flowchart of a task mapping method in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a mapping from a multi-layer second matrix to a processing core in an embodiment of the present application
  • FIG. 7 is a flowchart of a task mapping method in an embodiment of the present application.
  • FIG. 10 is a flowchart of a task processing method in an embodiment of the present application.
  • FIG. 11 shows a flowchart of some steps in the task mapping method in the embodiment of the present application.
  • FIG. 13 is a flowchart of a task processing apparatus in an embodiment of the present application.
  • 15 is a block diagram of the composition of a processing core provided by an embodiment of the present application.
  • FIG. 16 is a block diagram of the composition of a many-core system provided by an embodiment of the present application.
  • the sparse matrix When the size of the sparse matrix is larger, the sparse matrix needs to occupy more storage space, and the calculation amount is larger, and the operation involving the sparse matrix is less efficient.
  • the result of the matrix operation after performing a matrix operation on a sparse matrix, the result of the matrix operation needs to be transferred to a memory (for example, a double-rate synchronous dynamic random access memory (DDR, Double Data Rate)), and spliced out in the memory.
  • DDR double-rate synchronous dynamic random access memory
  • the above process of repeatedly transferring data further reduces the computational efficiency involving sparse matrices.
  • FIG. 1 is a flowchart of a task mapping method in an embodiment of the present application.
  • an embodiment of the present application provides a task mapping method, and the method includes the following steps.
  • a multi-layer second matrix is determined according to the first matrix, and each layer of the second matrix is divided into at least one second sub-matrix; wherein, the elements in the N+1th layer of the second matrix are the same as the Nth layer of the second matrix.
  • the multiple second sub-matrices in the matrix are in one-to-one correspondence, and the second sub-matrix of the first layer is a sub-matrix of the first matrix; wherein, N is a positive integer.
  • step S120 map the second sub-matrix in the second matrix of each layer to the processing cores in the many-core system, and each processing core corresponds to a second sub-matrix, so that the processing core performs its corresponding second sub-matrix Matrix operations on matrices and store the result of the operations.
  • the matrix operation in this embodiment of the present application may include: a matrix transposition operation, determining a memory storage location corresponding to the matrix, and the like. This embodiment of the present application does not specifically limit this.
  • An embodiment of the present application provides a solution for performing matrix operations on sparse matrices by using a many-core system.
  • the many-core system may be composed of a single chip, the chip has multiple processing cores, and the processing core is the smallest computing unit in the many-core system that can be independently scheduled and has complete computing capabilities; the many-core system may also be It consists of multiple chips, and each chip can have multiple processing cores. This embodiment of the present application does not specifically limit this.
  • multiple processing cores in the many-core system can independently run program instructions, or can work together to speed up the running speed of the program by utilizing the parallel computing capability and provide multi-task processing capabilities.
  • each processing core in the many-core system has an independent cache, which can store data such as calculation results obtained by the processing cores performing operations.
  • the matrix operation of the first matrix can be performed through steps S110 to S120.
  • the first matrix is a sparse matrix.
  • the first-layered second matrix is the first matrix.
  • the plurality of second sub-matrices of the first layer obtained by dividing the second matrix of the first layer are the sub-matrices of the first matrix obtained by dividing the first matrix according to the same rule.
  • the second matrix of each layer is a real number matrix, wherein the elements in the second matrix are all real numbers. For example, if the element in the second matrix of the N+1th layer corresponds to the second submatrix of the Nth layer is a non-zero matrix, the element is 1; if the element of the second matrix of the N+1th layer corresponds to the Nth layer The second sub-matrix of the layer is a zero matrix, then this element is 0. Other values may also be used as elements in the second matrix, which are not particularly limited in this embodiment of the present application.
  • FIG. 2 shows a schematic diagram of the mapping of the multi-layer second matrix to the processing core in the embodiment of the present application.
  • the first matrix is divided into 16 4*4 sub-matrices: A, B, C, D, E, F, G, H, I, J, K, L, O, P, Q, R, corresponding to the second matrix of the first layer;
  • the second matrix of the second layer includes 16 elements, and each element corresponds one-to-one with the 16 sub-matrices of the first matrix.
  • the elements of the first row and the first column correspond to the 4*4 sub-matrix A
  • the elements of the first row and the second column correspond to the 4*4 sub-matrix B (because the 4*4 sub-matrix B is a zero matrix, so in the second layer The label B is not shown in the second matrix)
  • the elements of the fourth row and the fourth column correspond to the 4*4 sub-matrix R (because the 4*4 sub-matrix R is a zero matrix, so in the second
  • the second matrix of layers does not show the label R).
  • the submatrix obtained by dividing the first matrix may be a non-zero matrix or a zero matrix.
  • the number of zero matrices obtained by dividing the first matrix may be greater than the number of non-zero matrices.
  • the second submatrix obtained by dividing the second matrix of each layer may be a non-zero matrix or a zero matrix.
  • Non-zero matrices such as 4*4 sub-matrices A, D, K, P in Figure 2; zero matrices such as 4*4 sub-matrices B, C, E, F, G, H, I, J, L in Figure 2 , O, Q, R.
  • the processing cores in the many-core system can perform the corresponding 4*4 matrix transpositions in parallel, wherein the zero matrices B, C, E , F, G, H, I, J, L, O, Q, R do not need to be stored or transposed, and the processing core in the many-core system transposes its corresponding non-zero matrices A, D, K, P Then, the transposed matrices (operation result matrices) A1, D1, K1, and P1 are obtained.
  • a high-level transposition can be performed by one control core according to the transposition result obtained by the transposition operation performed by multiple cores in the many-core system.
  • the control core can only store the position coordinates of the non-zero matrix, so the amount of data occupied by the storage is very small.
  • the transposed result of the control kernel (as shown in Figure 2, the transposed result of the second matrix of the second layer) is the transposed result of the second matrix of the first layer (wherein the positions of the non-zero matrix D and the zero matrix O are Swap, the positions of the non-zero matrix P and the zero matrix H are exchanged).
  • step S110 when there are M layers of second matrices in total, and the second matrix of the first layer is the first matrix, the second matrix of the first layer, the second matrix of the second layer, ...
  • the second matrix of the M-1th layer is divided into a plurality of second sub-matrices; the second matrix of the Mth layer is divided into one second sub-matrix, that is, the second sub-matrix of the Mth layer is the second matrix of the Mth layer itself.
  • M is an integer greater than or equal to N.
  • the multi-layered second sub-matrix can be mapped to multiple processing cores of the many-core system, respectively, and the processing core performs the matrix operation on the corresponding second sub-matrix, and stores the calculation result.
  • the processing core performs the matrix operation on the corresponding second sub-matrix, and stores the calculation result.
  • multiple processing cores in the many-core system respectively perform matrix operations on one or more second sub-matrices of layers, thereby obtaining the operation result matrix of the first matrix.
  • multiple processing cores in step S120 have been configured with operation instructions such as operators and parameters required for performing matrix operations on the second sub-matrix, for example, the processing cores Ability to determine how large a matrix to compute.
  • the second sub-matrix is transmitted to the processing core through a network on chip (NOC, Network On Chip) in the form of a data stream.
  • NOC Network On Chip
  • steps S110-S120 may be performed by a control core in a many-core system.
  • the control core may be any processing core in the many-core system.
  • the control core may be any one of the multiple processing cores in step S120; it may also be one of the multiple processing cores in step S120, specifically, there are M layers of second matrices, And when the second matrix of the first layer is the first matrix, the processing core of the Mth layer corresponding to the second matrix of the Mth layer is the control core.
  • This embodiment of the present disclosure makes no special limitation on this.
  • an operation instruction corresponding to at least one second sub-matrix is determined by the control core by performing step S120, and the operation instruction includes an operator, parameters, etc. required for processing and checking the second sub-matrix to perform matrix operation.
  • an operation instruction instructs the processing core to calculate how large a matrix operation is.
  • the control core dynamically determines a processing core for performing matrix operations according to the first matrix, and determines an operation instruction of at least one processing core.
  • the size of the first matrix may be any dimension.
  • control core performs step S120 to map the multi-layered second sub-matrix and operation instructions to multiple processing cores of the many-core system, respectively, and the processing core performs the matrix operation on the corresponding second sub-matrix, and stores Calculation results. It can be known from the algorithm of the matrix that in the embodiment of the present disclosure, a plurality of processing cores in the many-core system respectively perform matrix operations on at least one layer of the second sub-matrix, thereby obtaining an operation result matrix of the first matrix.
  • the scale of at least one second sub-matrix is smaller than that of the first matrix
  • the storage space required for storing at least one second sub-matrix and the result of matrix operation in the many-core system is It is significantly smaller than the storage space required for storing the first matrix and the result of the matrix operation
  • the calculation amount of the at least one processing core when performing the matrix operation of the second sub-matrix is also significantly smaller than that of performing the matrix operation of the first matrix.
  • the 16*16 first matrix is divided into 16 4*4 sub-matrices, corresponding to the second matrix of the first layer; the second matrix of the second layer is also a 4*4 matrix.
  • the processing core requires 2 bits in each row and column to store the coordinates of the elements in a 4*4 sub-matrix.
  • the layered matrix operation scheme in which multiple first processing cores perform matrix operations on sub-matrixes and second processing cores perform matrix operations on block matrices can achieve a higher compression rate for the coordinate dimension of elements in the matrix. Can greatly reduce memory overhead.
  • a scheme of using a many-core system to perform a matrix operation of a sparse matrix is provided, and a multi-layer second matrix is determined according to the first matrix for which the matrix operation needs to be performed, and each layer of the second matrix is The scale of the second sub-matrix is much smaller than that of the first matrix; the second sub-matrix of at least one layer of the second matrix is mapped to multiple processing cores in the many-core system to perform matrix operations, and finally the first matrix is obtained.
  • the result of matrix operation can achieve a higher compression rate for the coordinate dimension of elements in the matrix and greatly reduce the memory overhead; the calculation result of the matrix operation is stored in at least one processing core of the many-core system, and does not need to be written into off-chip storage such as memory. , and also reduces the probability of repeated data handling, and improves the efficiency of matrix operations on very large-scale sparse matrices.
  • the second sub-matrix in the second matrix of each layer when the second sub-matrix in the second matrix of each layer is mapped to the processing cores in the many-core system through step S120, or, through step S120, the second sub-matrix in the second matrix of each layer is mapped
  • the second sub-matrix and operation instructions are mapped to the processing cores in the many-core system
  • all the second sub-matrices in the second matrix of each layer can be mapped to multiple processing cores in the many-core system;
  • the non-zero second sub-matrices in the layer second matrix map to multiple processing cores in the many-core system.
  • the sub-matrix of the first matrix includes multiple zero matrices; when determining the multi-layer second matrix, if the elements in the N+1th layer of the second matrix correspond to If the second submatrix of the Nth layer is a non-zero matrix, the element is 1; if the element in the second matrix of the N+1th layer corresponds to the second submatrix of the Nth layer is a zero matrix, the element is 0. Therefore, the second sub-matrix of the second matrix of each layer includes non-zero matrices and/or zero matrices.
  • step S120 the non-zero second sub-matrix in the second matrix of each layer is mapped to multiple processing cores in the many-core system, which can further reduce the occupation of storage resources and computing resources.
  • FIG. 2 illustrates an alternative embodiment of mapping non-zero second sub-matrices in each layer of second matrices to multiple processing cores in a many-core system.
  • FIG. 3 is a flowchart of some steps in the task mapping method in the embodiment of the present application.
  • at least one second sub-matrix of each layer of second matrix includes at least one non-zero second sub-matrix.
  • step S120 the step of mapping the second sub-matrix in the second matrix of each layer to the processing core in the many-core system may specifically include: S121: The two sub-matrices are respectively mapped to the processing cores in the many-core system.
  • the processing core corresponding to the second sub-matrix of the Nth layer is the Nth layer processing core.
  • the second sub-matrix of the first layer is the sub-matrix of the first matrix.
  • the processing core of the first layer calculates the sub-matrix of the first matrix and stores the operation result matrix of the sub-matrix of the first matrix; the processing core of the N+1 layer operates according to the matrix. (for example, transpose) the coordinates in the second matrix of the N+1th layer corresponding to the processing core of the Nth layer before, and determine the coordinates of the second matrix of the N+1th layer corresponding to the processing core of the Nth layer after the matrix operation (for example, transposition). The coordinates in the result matrix of the operation.
  • the first-layer processing core that stores the operation result matrix of the second sub-matrix of the first layer can be determined layer by layer according to the corresponding relationship between the processing core and the coordinates stored in the processing core, that is, The result matrix of the operation of the submatrix of the first matrix.
  • the matrix operation of the second sub-matrix performed by the processing core is based on the calculation performed by the operation instruction, and each processing core corresponds to a second sub-matrix and the corresponding operation instruction;
  • the above step S120 is specifically It may include: S11, determining the operation instruction corresponding to at least one second sub-matrix; S12, mapping the second sub-matrix and the operation instruction corresponding to the second sub-matrix in the second matrix of each layer to multiple cores in the many-core system.
  • a processing core so that the processing core performs the matrix operation of the corresponding second sub-matrix according to the corresponding operation instruction and stores the calculation result.
  • control core in the many-core system can perform matrix operations such as transposing the first matrix of the operation as required to determine the multi-layer second matrix and the operation instruction, and the second sub-matrix of each layer of the second matrix
  • the scale of the first matrix is far smaller than the scale of the first matrix; the second sub-matrix and operation instructions of at least one layer of the second matrix are mapped to multiple processing cores in the many-core system to perform matrix operations, and finally the matrix operations of the first matrix are obtained.
  • FIG. 4 is a flowchart of some steps in the task mapping method in the embodiment of the present application.
  • the above step S12 may specifically include: S41, mapping the non-zero second sub-matrices in the second matrix of each layer to the processing cores in the many-core system respectively; S42, mapping the non-zero second sub-matrices to the processing cores in the many-core system
  • the operation instructions corresponding to the second sub-matrix are allocated to the processing cores of the many-core system.
  • the processing core corresponding to the second sub-matrix of the Nth layer is the Nth layer processing core.
  • the second sub-matrix of the first layer is the sub-matrix of the first matrix.
  • the first-layer processing core calculates the sub-matrix of the first matrix and stores the operation result matrix of the sub-matrix of the first matrix; according to the determined at least one second sub-matrix corresponding to
  • the N+1th layer processing core determines the N+1th layer corresponding to the Nth layer processing core after transposition according to the coordinates in the N+1th layer second matrix corresponding to the Nth layer processing core before the transposition The coordinates in the result matrix of the operation of the second matrix.
  • the first layer of processing that stores the operation result matrix of the second sub-matrix of the first layer can be determined layer by layer according to the correspondence between the processing cores and the coordinates stored in at least one layer of processing cores.
  • the kernel is the operation result matrix of the submatrix of the first matrix.
  • FIG. 5 is a flowchart of some steps in the task mapping method in the embodiment of the present application.
  • the processing core corresponding to the second sub-matrix of the Nth layer is the Nth layer processing core;
  • the foregoing step S121 or the foregoing step S41 may specifically include the following steps.
  • S51 Transmit the first-layer non-zero second submatrix in the first-layer second matrix to the first-layer processing core, so that the first-layer processing core performs its corresponding first-layer non-zero second submatrix matrix operation.
  • S52 Transmit the first correspondence between the Nth layer processing core and the first coordinate to the N+1th layer processing core, so that the N+1th layer processing core determines the Nth layer processing core and the second layer according to the first correspondence
  • the second correspondence of coordinates is the coordinate of the element in the second matrix of the N+1th layer corresponding to the non-zero second sub-matrix of the Nth layer in the second matrix of the N+1th layer;
  • the second coordinate is The coordinates of the elements in the N+1th layer second matrix corresponding to the Nth layer non-zero second submatrix in the operation result matrix of the N+1th layer second matrix.
  • the operation result matrix is a transpose matrix
  • the size of at least one layer of the second sub-matrix may be the same or different. This embodiment of the present application does not specifically limit this.
  • FIG. 6 is a flowchart of some steps in the task mapping method in the embodiment of the present application.
  • step S110 includes: S61, determining a target size according to the size of the first matrix, and the target size is the size of the second sub-matrix of each layer; S62, determining according to the first matrix and the target size A multi-layered second matrix.
  • the size of the second sub-matrix is not particularly limited.
  • the size of the second sub-matrix may be determined according to the computing capability, storage size, and requirements for computing efficiency of the many-core system; the second sub-matrix is not specifically limited in this embodiment of the present application.
  • the second sub-matrix of each layer is a square matrix, and the matrix operation at least includes a matrix transpose operation.
  • the calculation result of the matrix operation is stored in at least one processing core of the many-core system, and when the subsequent operation involving the result of the matrix operation needs to be performed, the data of the subsequent operation is transmitted to the corresponding processing core , the subsequent operation is performed by the processing core, and there is no need to transfer the operation result matrix of the sub-matrix of the first matrix calculated by at least one processing core to the off-chip storage such as memory, and then read the first matrix from the off-chip storage.
  • the probability of repeated data transfer is also reduced, and the efficiency of matrix operations on large-scale sparse matrices is improved.
  • FIG. 7 is a flowchart of a task mapping method in an embodiment of the present application.
  • the same reference numerals are used for the same steps in FIG. 7 as in FIG. 1 . 7 , in some embodiments, after the above step S120, the method further includes the following steps.
  • S130 Determine a target processing core according to the mapping relationship between at least one second sub-matrix in the multi-layer second matrix and multiple processing cores in the many-core system; the target processing core is at least one of the multiple processing cores that stores the target data
  • the target data is the operation result matrix of the submatrix of the first matrix corresponding to the task data of the task to be processed.
  • S140 Transmit the task data to the target processing core, so that the target processing core reads the target data and executes an operation corresponding to the task data.
  • the multi-layer second matrix is determined according to the first matrix, and each layer of the second matrix is divided into at least one second sub-matrix; the elements in the N+1th layer of the second matrix are the same as the elements in the Nth layer of the second matrix.
  • a plurality of second sub-matrices are in one-to-one correspondence, and the second sub-matrix of the first layer is a sub-matrix of the first matrix; each of the plurality of processing cores corresponds to a second sub-matrix, and the processing core stores a corresponding sub-matrix.
  • the calculation result obtained by performing matrix operations on the second sub-matrix; N is a positive integer.
  • the operation result matrix is a transpose matrix
  • the obtained matrix operation result is stored in the processing core.
  • the control core executes steps S130 to S140, and transmits the task data of the task to be processed to the target processing core, and the target processing core executes the operation corresponding to the task data.
  • This embodiment of the present application does not specifically limit the operation corresponding to the task data. For example, a matrix multiplication operation and a matrix addition/subtraction operation between the operation result matrix of the submatrix of the first matrix and the task data.
  • the control core can be any processing core in the many-core system.
  • the control core may perform matrix operations on the second sub-matrix and store any one processing core except the multiple processing cores that store the calculation results; it may also be one that performs matrix operations on the second sub-matrix and stores the calculation results. one of multiple processing cores.
  • This embodiment of the present application does not specifically limit this. For example, when there are M layers of second matrices in total, and the first layer of second matrices is the first matrix, the M-th layer processing core corresponding to the M-th layer of the second matrix is the control core.
  • the calculation result of matrix calculation is stored in at least one processing core of the many-core system.
  • the task data is transmitted to the target processing core, and the target processing core performs the operation corresponding to the task data, without the need to transfer the calculation result of the matrix operation of the sub-matrix of the first matrix obtained by the matrix calculation of at least one processing core to the off-chip such as memory first.
  • the operation result of the first matrix is stored and read from the off-chip storage, thereby reducing the probability of repeated data transfer and improving the efficiency of super-matrix operations on large-scale sparse matrices.
  • At least one second sub-matrix of each layer of second matrix includes at least one non-zero second sub-matrix;
  • the mapping relationship of the processing cores is the mapping relationship between at least one non-zero second sub-matrix and multiple processing cores in the second matrix of each layer;
  • the processing core corresponding to the second sub-matrix of the Nth layer is the processing core of the Nth layer;
  • the first layer processing core stores the calculation result obtained by performing the matrix operation on the corresponding first layer non-zero second sub-matrix;
  • the N+1 layer processing core stores the first coordinate based on the Nth layer processing core and the first coordinate.
  • the first coordinate is the element in the second matrix of the N+1th layer corresponding to the non-zero second submatrix of the Nth layer at the Nth layer.
  • the coordinates in the second matrix of the 1st layer; the second coordinate is the element in the second matrix of the N+1th layer corresponding to the non-zero second sub-matrix of the Nth layer in the operation result matrix of the second matrix of the N+1th layer Coordinates;
  • the target processing core is one of the at least one layer 1 processing core.
  • the matrix operation is a transpose operation
  • the operation result matrix is a transpose matrix.
  • FIG. 8 shows a flowchart of some steps in the task mapping method according to an embodiment of the present disclosure.
  • the foregoing step S130 may specifically include the following steps.
  • the layer determines at least one layer of processing cores corresponding to the target data, until N is equal to 1, and takes the first layer of processing cores that store the target data as the target processing core. That is, if N is greater than 1, execute step S81 once, and decrease N by 1; if N is still greater than 1 after decreasing by 1, continue to execute step S81; if N is equal to 1 after decreasing by 1, execute step S81 S82.
  • FIG. 9 shows a flowchart of some steps in the task mapping method according to an embodiment of the present disclosure.
  • the task mapping method further includes the following steps.
  • control core may also control multiple processing cores to output the results of the matrix operations in the multiple processing cores to off-chip storage such as a memory.
  • the operation result matrix of the first matrix may be obtained from off-chip storage. It should be noted that when the results of the matrix operations in the multiple processing cores are output to off-chip storage such as memory, they are directly stored as the operation result matrix of the first matrix.
  • the results of the matrix operations in the multiple processing cores are output to the memory.
  • the address in the off-chip storage of the result of the matrix operation stored in the at least one processing core is determined by the control core, so as to ensure that after the at least one processing core writes the elements included in the stored matrix operation result into the off-chip storage, it can Concatenated into the operation result matrix of the first matrix.
  • FIG. 10 shows a flowchart of a task processing method according to an embodiment of the present application.
  • an embodiment of the present application provides a task processing method, and the method includes the following steps.
  • S1010 Receive the first correspondence between the processing core of the Nth layer and the first coordinate in the multi-layer second matrix; wherein, N is a positive integer.
  • S1020 Determine a second correspondence between the processing core of the Nth layer and the second coordinate according to the first correspondence.
  • each layer of the second matrix in the multi-layer second matrix is divided into at least one second sub-matrix; the elements in the N+1th layer of the second matrix are the same as the multiple second sub-matrices in the Nth layer of the second matrix.
  • each of the multiple processing cores in the many-core system corresponds to a second sub-matrix;
  • the first coordinate is the element in the second matrix of the N+1th layer corresponding to the second sub-matrix of the Nth layer in The coordinates in the second matrix of the N+1th layer;
  • the second coordinate is the element in the second matrix of the N+1th layer corresponding to the second sub-matrix of the Nth layer in the operation result matrix of the second matrix of the N+1th layer coordinates;
  • N is a positive integer.
  • the processing core of the many-core system can determine the processing of the Nth layer after the transposition according to the coordinates in the second matrix of the N+1th layer corresponding to the processing core of the Nth layer before the transposition.
  • the coordinates in the operation result matrix of the second matrix of the N+1th layer corresponding to the core, so that the many-core system can determine the multi-layer second matrix according to the first matrix for which the matrix operation needs to be performed, and calculate the second matrix of at least one layer of the second matrix.
  • the two sub-matrices are mapped to multiple processing cores in the many-core system to perform matrix operations, and finally the matrix operation result of the first matrix is obtained, which can achieve a higher compression rate for the coordinate dimension of the elements in the matrix and greatly reduce the memory overhead; matrix operations
  • the calculation results are stored in the processing core of the many-core system, and there is no need to write into off-chip storage such as memory, which also reduces the probability of repeated data handling and improves the efficiency of matrix operations on very large-scale sparse matrices.
  • FIG. 11 shows a flowchart of some steps in the task mapping method in the embodiment of the present application.
  • each processing core corresponds to a second sub-matrix and corresponding operation instructions.
  • step S1020 may specifically include the following steps.
  • S1110 receiving an operation instruction
  • S1120 determining a second correspondence between the Nth layer processing core and the second coordinate according to the operation instruction and the first correspondence.
  • the processing core of the many-core system can determine the transposed th The coordinates in the operation result matrix of the N+1th layer second matrix corresponding to the N layers of processing cores, so that the many-core system can determine the multi-layer second matrix according to the first matrix for which the matrix operation needs to be performed, and assign at least one layer of the second matrix
  • the second sub-matrix of the matrix is mapped to multiple processing cores in the many-core system to perform matrix operations, and finally the matrix operation result of the first matrix is obtained, which can achieve a higher compression rate for the coordinate dimension of elements in the matrix and greatly reduce memory overhead.
  • the calculation results of the matrix operation are stored in the processing core of the many-core system, and there is no need to write into off-chip storage such as memory, which also reduces the probability of repeated data transfer, and improves the efficiency of matrix operations on very large-scale sparse matrices.
  • FIG. 12 is a flowchart of the task mapping apparatus in the embodiment of the present application; as shown in FIG. 12 , in some embodiments, the task mapping apparatus 1200 may include the following modules.
  • the matrix determination module 1210 is configured to determine a multi-layer second matrix according to the first matrix, and each layer of the second matrix is divided into at least one second sub-matrix; wherein, the elements in the N+1th layer of the second matrix are the same as the Nth layer of the second matrix.
  • the multiple second sub-matrices of the second matrix are in one-to-one correspondence, and the second sub-matrix of the first layer is a sub-matrix of the first matrix; wherein, N is a positive integer.
  • the matrix mapping module 1220 is used to map the second sub-matrix in the second matrix of each layer to the processing core in the many-core system, and each processing core corresponds to a second sub-matrix, so that the processing core performs its corresponding first sub-matrix. Matrix operation of two sub-matrices and store the operation result.
  • At least one second sub-matrix of each layer of second matrix includes at least one non-zero second sub-matrix; the matrix mapping module 1220 is specifically configured to convert the non-zero second sub-matrix in each layer of second matrix The two sub-matrices are respectively mapped to the processing cores in the many-core system.
  • the matrix calculation performed by the processing cores is based on calculation instructions, and each processing core corresponds to a second sub-matrix and a corresponding operation instruction;
  • the matrix mapping module 1220 includes: an instruction determination unit that determines at least an operation instruction corresponding to the second sub-matrix; the matrix mapping module 1220 is specifically used to map the operation instruction corresponding to the second sub-matrix and the second sub-matrix in the second matrix of each layer to the processing core in the many-core system, So that the processing core performs matrix operation on its corresponding second sub-matrix according to its corresponding operation instruction and stores the calculation result.
  • the matrix mapping module 1220 when used to map the second sub-matrix in the second matrix of each layer and the operation instructions corresponding to the second sub-matrix to the processing cores in the many-core system, it is specifically used for : Map the non-zero second sub-matrices in the second matrix of each layer to the processing cores in the many-core system respectively; configure the operation instructions corresponding to at least one non-zero second sub-matrix to the processing cores of the many-core system.
  • the processing core corresponding to the second sub-matrix of the N-th layer is the processing core of the N-th layer;
  • the matrix mapping module 1220 is used to map the non-zero second sub-matrices in the second matrix of each layer to the When the processing core in the many-core system is used, it is specifically used to: transfer the non-zero second sub-matrix of the first layer in the second matrix of the first layer to the processing core of the first layer, so that the processing core of the first layer can correspond to it.
  • the first coordinate is the element in the second matrix of the N+1th layer corresponding to the non-zero second sub-matrix of the Nth layer at the N+1th layer
  • the coordinates in the second matrix of the layer; the second coordinates are the coordinates of the elements in the second matrix of the N+1th layer corresponding to the non-zero second sub-matrix of the Nth layer in the operation result matrix of the second matrix of the N+1th layer .
  • the matrix operation is a transpose operation
  • the operation result matrix is a transpose matrix.
  • the matrix determination module 1210 is specifically configured to: determine the target size according to the size of the first matrix, and the target size is the size of the second sub-matrix of each layer; Two Matrix.
  • the task mapping apparatus 1200 further includes: a target core determination module for, after mapping the second sub-matrix in the second matrix of each layer to the processing core in the many-core system, according to the multi-layer second matrix
  • the mapping relationship between at least one second sub-matrix in the matrix and multiple processing cores in the many-core system determines the target processing core;
  • the target processing core is at least one of the multiple processing cores that stores target data;
  • the target data is to be processed
  • the data transmission module is used for transmitting the task data to the target processing core, so that the target processing core reads the target data and executes the operation corresponding to the task data.
  • the operation result matrix is a transpose matrix
  • At least one second sub-matrix of each layer of second matrix includes at least one non-zero second sub-matrix;
  • the mapping relationship of the processing cores is the mapping relationship between at least one non-zero second sub-matrix and multiple processing cores in the second matrix of each layer;
  • the processing core corresponding to the second sub-matrix of the Nth layer is the processing core of the Nth layer;
  • the first layer processing core stores the calculation result obtained by performing the matrix operation on the corresponding first layer non-zero second sub-matrix;
  • the N+1 layer processing core stores the first coordinate based on the Nth layer processing core and the first coordinate.
  • the first coordinate is the element in the second matrix of the N+1th layer corresponding to the non-zero second submatrix of the Nth layer at the Nth layer.
  • the coordinates in the second matrix of the 1st layer; the second coordinate is the element in the second matrix of the N+1th layer corresponding to the non-zero second sub-matrix of the Nth layer in the operation result matrix of the second matrix of the N+1th layer Coordinates;
  • the target processing core is one of the at least one layer 1 processing core.
  • the matrix operation is a transpose operation
  • the operation result matrix is a transpose matrix.
  • the target core determination module is specifically used for: when N is greater than 1, according to the second correspondence stored by the Nth layer processing core corresponding to the target data, determine the N-1th layer processing core corresponding to the target data; In the case of 1, the first layer processing core storing the target data is set as the target processing core.
  • the task mapping apparatus 1200 further includes: an address determination module, configured to determine the address of the storage space in the off-chip storage of at least one target calculation result according to the second correspondence stored in each layer of processing cores;
  • the target calculation result is the calculation result that the first layer processing core performs the matrix operation on the corresponding first layer non-zero second sub-matrix and stores the calculation result;
  • the address writing module is used to control at least one layer 1 processing core according to the first layer
  • the result matrix of a matrix operation configured to determine the address of the storage space in the off-chip storage of at least one target calculation result according to the second correspondence stored in each layer of processing cores.
  • the target calculation result is the calculation result that the first layer processing core performs the matrix operation on the
  • the second sub-matrix of each layer is a square matrix; the matrix operation at least includes a matrix transpose operation.
  • a scheme for performing matrix operations on sparse matrices by using a many-core system is provided, and a multi-layer second matrix is determined according to the first matrix for which the matrix operation is performed, and the second matrix of each layer of the second matrix is determined.
  • the scale of the sub-matrix is much smaller than the scale of the first matrix; the second sub-matrix of at least one layer of the second matrix is mapped to multiple processing cores in the many-core system to perform matrix operations, and finally the matrix operation result of the first matrix is obtained.
  • the calculation results of the matrix operation are stored in at least one processing core of the many-core system, and there is no need to write into off-chip storage such as memory.
  • the probability of repeated data transfer is improved, and the efficiency of matrix operations on super-large sparse matrices is improved.
  • Fig. 13 is a flowchart of a task processing apparatus in an embodiment of the present application; as shown in Fig. 13 , in some embodiments, the task processing apparatus 1300 may include the following modules.
  • the receiving module 1310 is used to receive the first correspondence between the processing core of the Nth layer and the first coordinate in the multi-layer second matrix; wherein, N is a positive integer; the relationship determination module 1320 is used to determine the Nth layer according to the first correspondence The layer handles the second correspondence between the kernel and the second coordinate.
  • each layer of the second matrix in the multi-layer second matrix is divided into at least one second sub-matrix; the elements in the N+1th layer of the second matrix are the same as the multiple second sub-matrices in the Nth layer of the second matrix.
  • each of the multiple processing cores in the many-core system corresponds to a second sub-matrix;
  • the first coordinate is the element in the second matrix of the N+1th layer corresponding to the second sub-matrix of the Nth layer in The coordinates in the second matrix of the N+1th layer;
  • the second coordinate is the element in the second matrix of the N+1th layer corresponding to the second sub-matrix of the Nth layer in the operation result matrix of the second matrix of the N+1th layer coordinate of.
  • each processing core corresponds to a second sub-matrix and a corresponding operation instruction
  • the relationship determination module 1320 includes: an instruction receiving module, for receiving an operation instruction; and a relationship determination module 1320, specifically for according to the operation instruction and the first correspondence to determine a second correspondence between the processing core of the Nth layer and the second coordinate.
  • the processing cores of the many-core system can determine the correspondence of the Nth layer processing cores after transposition according to the coordinates in the N+1th layer second matrix corresponding to the Nth layer processing cores before the transposition.
  • the coordinates in the operation result matrix of the second matrix of the N+1th layer, so that the many-core system can determine the multi-layer second matrix according to the first matrix that needs to perform the matrix operation, and put the second sub-matrix of at least one layer of the second matrix.
  • the matrix is mapped to multiple processing cores in the many-core system to perform matrix operations, and finally the matrix operation result of the first matrix is obtained, which can achieve a higher compression rate for the coordinate dimension of elements in the matrix and greatly reduce memory overhead; the calculation of matrix operations
  • the results are stored in the processing core of the many-core system, and there is no need to write into off-chip storage such as memory, which also reduces the probability of repeated data transfer, and improves the efficiency of matrix operations on very large-scale sparse matrices.
  • FIG. 14 is a block diagram of an electronic device provided by an embodiment of the present application.
  • an embodiment of the present application provides an electronic device, including: one or more processors 1401; a memory 1402, on which one or more programs are stored. One or more processors execute, so that one or more processors implement the task mapping method of the embodiment of the present application; one or more I/O interfaces 1403 are connected between the processor and the memory, and are configured to implement the task mapping method between the processor and the memory. Information exchange in memory.
  • the processor 1401 is a device with data processing capability, including but not limited to a central processing unit (CPU), etc.; the memory 1402 is a device with data storage capability, including but not limited to random access memory (RAM, more specifically Such as SDRAM, DDR, etc.), read only memory (ROM), electrified erasable programmable read only memory (EEPROM), flash memory (FLASH); I/O interface (read and write interface) 1403 is connected between the processor 1401 and the memory 1402 , which can realize the information interaction between the processor 1401 and the memory 1402, which includes but is not limited to a data bus (Bus) and the like.
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrified erasable programmable read only memory
  • FLASH flash memory
  • I/O interface (read and write interface) 1403 is connected between the processor 1401 and the memory 1402 , which can realize the information interaction between the processor 1401 and the memory 1402, which includes but is not limited to a data bus (Bus) and
  • processor 1401, memory 1402, and I/O interface 1403 are interconnected by bus 1404, which in turn is connected to other components of the computing device.
  • FIG. 15 is a block diagram of the composition of a processing core provided by an embodiment of the present application.
  • an embodiment of the present application provides a processing core, including: a computing unit 1501 and a cache 1502; the computing unit 1501 can implement the task mapping method of the embodiment of the present application; and/ Or the task processing method of the embodiment of the present application.
  • FIG. 16 is a block diagram of the composition of a many-core system provided by an embodiment of the present application.
  • an embodiment of the present application provides a many-core system, including: a plurality of processing cores 1601; and an on-chip network 1602 configured to exchange data and external data among the plurality of processing cores 1601; one or more processing cores
  • One or more instructions are stored in 1601, and one or more instructions are executed by one or more processing cores 1601, so that one or more processing cores 1601 can execute the task mapping method of the above-mentioned embodiment of the present application; and/or this
  • the task processing method of the application embodiment is provided.
  • Embodiments of the present invention also provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are executed in a processor of an electronic device , the processor in the electronic device executes the task mapping method or the task processing method for implementing any embodiment of the present application.
  • Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供了一种任务映射方法、任务处理方法、处理核和电子设备,方法包括:根据第一矩阵确定多层第二矩阵,每一层第二矩阵划分为至少一个第二子矩阵;其中,第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应,第1层第二子矩阵为第一矩阵的子矩阵;其中,N为正整数(S110);将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个处理核对应一个第二子矩阵,以使处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果(S120)。

Description

任务映射方法、任务处理方法、处理核和电子设备 技术领域
本申请涉及计算机技术领域,特别涉及一种任务映射方法、任务处理方法、处理核和电子设备。
背景技术
稀疏矩阵是指零元素的数目远远多于非零元素的数目、且非零元素的分布无规律的矩阵。稀疏矩阵在现实生活被广泛使用,特别地,稀疏矩阵经常出现在高性能计算和机器学习中,例如,包含计数的数据、映射类别的数据编码、以及自然语言处理(NLP,Natural Language Processing)等机器学习的子领域。
发明内容
本申请提供一种任务映射方法、任务处理方法、处理核和电子设备。
第一方面,本申请提供了一种任务映射方法,包括:根据第一矩阵确定多层第二矩阵,每一层第二矩阵划分为至少一个第二子矩阵;其中,第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应,第1层第二子矩阵为所述第一矩阵的子矩阵;其中,N为正整数;将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个所述处理核对应一个第二子矩阵,以使所述处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果。
其中,每一层第二矩阵的至少一个第二子矩阵中包括至少一个非零第二子矩阵;所述将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,包括:将每一层第二矩阵中的非零第二子矩阵分别映射到所述众核系统中的处理核。
其中,所述处理核进行的所述矩阵计算是基于运算指令进行的计算,且每一个所述处理核对应一个第二子矩阵及对应的运算指令;
所述将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个所述处理核对应一个第二子矩阵,以使所述处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果,包括:确定至少一个第二子矩阵对应的运算指令;将每一层第二矩阵中的第二子矩阵及第二子矩阵对应的运算指令映射到众核系统中的处理核,以使所述处理核根据其对应的所述运算指令对其对应的第二子矩阵的进行矩阵运算并存储计算结果。
其中,将每一层第二矩阵中的第二子矩阵及第二子矩阵对应的运算指令映射到众核系统中的处理核,包括:将每一层第二矩阵中的非零第二子矩阵分别映射到所述众核系统中的处理核;将至少一个非零第二子矩阵对应的运算指令配置到所述众核系统的处理核中。
其中,对应第N层第二子矩阵的处理核为第N层处理核;所述将每一层第二矩阵中的非零第二子矩阵分别映射到所述众核系统中的处理核,包括:将第1层第二矩阵中的第1层非零第二子矩阵传输到第1层处理核,以使第1层处理核进行计算其对应的第1层非零第二子矩阵的矩阵运算;将第N层处理核与第一坐标的第一对应关系传输到第N+1层处理核,以使第N+1层处理核根据所述第一对应关系确定第N层处理核与第二坐标的第二对应关系;所述第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;所述第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在 第N+1层第二矩阵的运算结果矩阵中的坐标。
其中,所述根据第一矩阵确定多层第二矩阵,包括:根据所述第一矩阵的尺寸确定目标尺寸,所述目标尺寸为每一层第二子矩阵的尺寸;根据所述第一矩阵和所述目标尺寸,确定多层第二矩阵。
其中,在所述将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核之后,所述方法还包括:根据多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系,确定目标处理核;所述目标处理核为多个所述处理核中存储目标数据的至少一者;所述目标数据为待处理任务的任务数据对应的第一矩阵的子矩阵的运算结果矩阵;将所述任务数据传输到所述目标处理核,以使所述目标处理核读取所述目标数据,并执行所述任务数据对应的运算。
其中,每一层第二矩阵的至少一个第二子矩阵中包括至少一个非零第二子矩阵;多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系为每一层第二矩阵中的至少一个非零第二子矩阵与多个所述处理核的映射关系;对应第N层第二子矩阵的处理核为第N层处理核;第1层处理核存储有对其对应的第1层非零第二子矩阵进行矩阵运算得到的计算结果;第N+1层处理核中存储有根据第N层处理核与第一坐标的第一对应关系确定的第N层处理核与第二坐标的第二对应关系;所述第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;所述第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标;所述目标处理核为至少一个第1层处理核中的一者;
其中,所述根据多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系,确定目标处理核,包括:在N大于1的情况下,根据所述目标数据对应的第N层处理核存储的所述第二对应关系,确定所述目标数据对应的第N-1层处理核;在N等于1的情况下,将存储所述目标数据的第1层处理核作为所述目标处理核。
其中,所述任务映射方法还包括:根据每一层处理核中存储的所述第二对应关系,确定至少一个目标计算结果在片外存储中的存储空间的地址;所述目标计算结果为第1层处理核进行其对应的第1层非零第二子矩阵的矩阵运算并存储的计算结果;控制至少一个第1层处理核根据所述第1层处理核存储的目标计算结果在所述片外存储中的存储空间的地址,将所述第1层处理核存储的目标计算结果写入所述片外存储,其中,至少一个所述目标计算结果在所述片外存储中拼接成所述第一矩阵的运算结果矩阵。
其中,每一层第二子矩阵为方阵;所述矩阵运算至少包括矩阵转置运算。
第二方面,本申请实施例提供一种任务处理方法,包括:接收多层第二矩阵中第N层处理核与第一坐标的第一对应关系;其中,N为正整数;根据所述第一对应关系确定第N层处理核与第二坐标的第二对应关系;其中,所述多层第二矩阵中每一层第二矩阵划分为至少一个第二子矩阵;第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应;众核系统中的多个处理核中的每一个所述处理核对应一个第二子矩阵;所述第一坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;所述第二坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标。
其中,每一个所述处理核对应一个第二子矩阵及对应的运算指令;所述根据所述第一对应关系确定第N层处理核与第二坐标的第二对应关系,包括:接收运算指令;根据所述运算指令和所述第一对应关系确定第N层处理核与第二坐标的第二对应关系。
第三方面,本申请实施例提供一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本申请实施例第一方面所述的任务映射方法;一个或多个I/O接口,连接在所述处理器与存储器之间,配置为实现所述处理器与存储器的信息交互。
第四方面,本申请实施例提供一种处理核,包括:包括计算单元和缓存;所述计算单元能够实现本申请 实施例第一方面的任务映射方法;和/或本申请实施例第二方面的任务处理方法。
第五方面,本申请实施例提供一种众核系统,包括:多个处理核;以及片上网络,被配置为交互所述多个处理核间的数据和外部数据;一个或多个所述处理核中存储有一个或多个指令,一个或多个所述指令被一个或多个所述处理核执行,以使一个或多个所述处理核能够执行本申请实施例第一方面所述的任务映射方法;和/或本申请实施例第二方面的任务处理方法。
第六方面,本申请实施例提供一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行本申请实施例第一方面所述的任务映射方法;和/或本申请实施例第二方面的任务处理方法。
在本申请实施例中,提供一种利用众核系统执行稀疏矩阵的矩阵运算的方案,根据需要执行矩阵运算的第一矩阵确定多层第二矩阵,每一层第二矩阵的第二子矩阵的规模都远远小于第一矩阵的规模;将至少一层第二矩阵的第二子矩阵映射到众核系统中的多个处理核执行矩阵运算,最终得到第一矩阵的矩阵运算结果,从而能够实现对矩阵中元素坐标维度的较高压缩率,大大降低内存开销;矩阵运算的计算结果存储在众核系统的至少一个处理核中,无需写入内存等片外存储中,还降低了数据重复搬运的概率,提高了对超大规模稀疏矩阵进行矩阵运算的效率。
应当理解,本部分所描述的内容并非旨在标识本申请的实施例的关键或重要特征,也不用于限制本申请的范围。本申请的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图用来提供对本申请的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请,并不构成对本申请的限制。通过参考附图对详细示例实施例进行描述,以上和其他特征和优点对本领域技术人员将变得更加显而易见,在附图中:
图1是本申请实施例中任务映射方法的流程图;
图2是本申请实施例中多层第二矩阵到处理核的映射示意图;
图3是本申请实施例中任务映射方法中部分步骤的流程图;
图4是本申请实施例中任务映射方法中部分步骤的流程图;
图5是本申请实施例中任务映射方法中部分步骤的流程图;
图6是本申请实施例中任务映射方法中部分步骤的流程图;
图7是本申请实施例中任务映射方法的流程图;
图8是本申请实施例中任务映射方法中部分步骤的流程图;
图9是本申请实施例中任务映射方法中部分步骤的流程图;
图10是本申请实施例中任务处理方法的流程图;
图11示出本申请实施例中任务映射方法中部分步骤的流程图;
图12本申请实施例中任务映射装置的流程图;
图13本申请实施例中任务处理装置的流程图;
图14是本申请实施例提供的电子设备的组成框图;
图15是本申请实施例提供的处理核的组成框图;
图16是本申请实施例提供的众核系统的组成框图。
具体实施方式
为使本领域的技术人员更好地理解本申请的技术方案,以下结合附图对本申请的示范性实施例做出说明,其中包括本申请实施例的各种细节以助于理解,应当将它们认为是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本申请的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
在不冲突的情况下,本申请实施例及实施例中的一个或多个特征可相互组合。
如本文所使用的,术语“和/或”包括一个或多个相关列举条目的任何和所有组合。
本文所使用的术语仅用于描述特定实施例,且不意欲限制本申请。如本文所使用的,单数形式“一个”和“该”也意欲包括复数形式,除非上下文另外清楚指出。还将理解的是,当本说明书中使用术语“包括”和/或“由……制成”时,指定存在所述特征、整体、步骤、操作、元件和/或组件,但不排除存在或添加一个或多个其它特征、整体、步骤、操作、元件、组件和/或其群组。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。
除非另外限定,否则本文所用的所有术语(包括技术和科学术语)的含义与本领域普通技术人员通常理解的含义相同。还将理解,诸如那些在常用字典中限定的那些术语应当被解释为具有与其在相关技术以及本申请的背景下的含义一致的含义,且将不解释为具有理想化或过度形式上的含义,除非本文明确如此限定。
在一些相关技术中,在对稀疏矩阵进行转置等运算,且存储稀疏矩阵时,需要存储稀疏矩阵中一个或多个元素的值及其在稀疏矩阵中的坐标。稀疏矩阵规模越大,存储稀疏矩阵中的每个元素的坐标需要的比特位数越多。例如,存储百亿维矩阵中的元素的坐标需要行、列各35比特(bit)。对于超大规模的稀疏矩阵,不仅要占用大量的存储空间来存储元素的坐标,而且执行涉及稀疏矩阵的运算的速率也较慢。当稀疏矩阵规模越大时,稀疏矩阵需要占用越多的存储空间,且计算量较大,涉及稀疏矩阵的运算的效率较低。此外,在一些相关技术中,执行稀疏矩阵的矩阵运算后,需要将矩阵运算的结果传输到内存(例如双倍速率同步动态随机存储器(DDR,Double Data Rate))中,并在内存中拼接出完整的稀疏矩阵的运算结果矩阵;在执行后续运算时,需要从内存中读取稀疏矩阵的运算结果矩阵。上述反复搬运数据的过程进一步降低了涉及稀疏矩阵的运算效率。
图1是本申请实施例中一种任务映射方法的流程图。
参照图1,本申请实施例提供一种任务映射方法,该方法包括如下步骤。
在步骤S110中,根据第一矩阵确定多层第二矩阵,每一层第二矩阵划分为至少一个第二子矩阵;其中,第N+1层第二矩阵中的元素与第N层第二矩阵中的多个第二子矩阵一一对应,第1层第二子矩阵为第一矩阵的子矩阵;其中,N为正整数。
在步骤S120中,将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个处理核对应一个第二子矩阵,以使处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果。
本申请实施例中的矩阵运算可以包括:矩阵转置运算、确定矩阵对应的内存存储位置等。本申请实施例对此不做特殊限定。
本申请实施例提供一种利用众核系统执行稀疏矩阵的矩阵运算的方案。在本申请实施例中,众核系统可以是单个芯片构成的,芯片具有多个处理核,处理核是众核系统中可独立调度并拥有完整计算能力的最小计算单元;众核系统还可以是由多个芯片构成的,每个芯片可以具有多个处理核。本申请实施例对此不做特殊限定。
需要说明的是,在本申请实施例中,众核系统中的多个处理核可以分别独立运行程序指令,也可以联合工作,利用并行计算的能力加快程序的运行速度,并提供多任务处理能力。还需要说明的是,在本申请实施例中,众核系统中的每个处理核都具有独立的缓存,能够存储处理核执行运算得到的计算结果等数据。
在本申请实施例中,通过步骤S110至步骤S120可以进行第一矩阵的矩阵运算。其中,第一矩阵为稀疏矩阵。通过步骤S110确定的多层第二矩阵中,第1层第二矩阵即第一矩阵。第1层第二矩阵划分得到的 多个第1层第二子矩阵即按照相同规则划分第一矩阵得到的第一矩阵的子矩阵。
作为一种可选的实施方式,每一层第二矩阵都为实数矩阵,其中,第二矩阵中的元素都为实数。例如,若第N+1层第二矩阵中的元素对应的第N层第二子矩阵为非零矩阵,则该元素为1;若第N+1层第二矩阵中的元素对应的第N层第二子矩阵为零矩阵,则该元素为0。还可以用其他数值作为第二矩阵中的元素,本申请实施例对此不做特殊限定。
图2示出本申请实施例中多层第二矩阵到处理核的映射示意图。
如图2所示,第一矩阵被划分为16个4*4子矩阵:A、B、C、D、E、F、G、H、I、J、K、L、O、P、Q、R,对应第1层第二矩阵;第2层第二矩阵包括16个元素,每个元素与第一矩阵的16个子矩阵一一对应。例如:第一行第一列的元素对应4*4子矩阵A,第一行第二列的元素对应4*4子矩阵B(由于4*4子矩阵B为零矩阵,所以在第2层第二矩阵未示出该标记B)、……、以此类推,第四行第四列的元素对应对应4*4子矩阵R(由于4*4子矩阵R为零矩阵,所以在第2层第二矩阵未示出该标记R)。需要说明的是,划分第一矩阵得到的子矩阵可以为非零矩阵,也可以为零矩阵。当作为稀疏矩阵的第一矩阵较稀疏时,划分第一矩阵得到的零矩阵的数量可以大于非零矩阵的数量。划分每一层第二矩阵得到的第二子矩阵可以为非零矩阵,也可以为零矩阵。非零矩阵例如图2中的4*4子矩阵A、D、K、P;零矩阵例如图2中的4*4子矩阵B、C、E、F、G、H、I、J、L、O、Q、R。
以矩阵运算为矩阵转置(也称转置或转置运算)为例,众核系统中的处理核可以并行执行各自对应的4*4的矩阵转置,其中,零矩阵B、C、E、F、G、H、I、J、L、O、Q、R无需存储也无需进行转置运算,众核系统中的处理核对其对应的非零矩阵A、D、K、P进行转置后,得到转置后的矩阵(运算结果矩阵)A1、D1、K1、P1。
在图2中,可以由一个控制核,根据众核系统中的多个核进行转置运算得到的转置结果,进行高层转置。其中,控制核可以只存储非零矩阵的位置坐标,因此,存储所占用的数据量非常小。控制核进行转置后的结果(如图2中示出的第2层第二矩阵的转置结果)为第一层第二矩阵的转置结果(其中,非零矩阵D与零矩阵O位置互换,非零矩阵P与零矩阵H位置互换)。
需要说明的是,在步骤S110中,在总共有M层第二矩阵、且第1层第二矩阵为第一矩阵的情况下,第1层第二矩阵、第2层第二矩阵、……、第M-1层第二矩阵划分为多个第二子矩阵;第M层第二矩阵划分为一个第二子矩阵,即,第M层第二子矩阵为第M层第二矩阵本身。其中,M为大于或等于N的整数。
在本申请实施例中,通过步骤S120可以将多层第二子矩阵分别映射到众核系统的多个处理核,由处理核进行对应的第二子矩阵的矩阵运算,并存储计算结果。作为示例,由矩阵的运算法则可知,本申请实施例中由众核系统中的多个处理核分别进行一个或多个层第二子矩阵的矩阵运算,从而得到第一矩阵的运算结果矩阵。
作为一种可选的实施方式,通过预编译的方式,步骤S120中的多个处理核中已经配置有对第二子矩阵进行矩阵运算所需的算子、参数等运算指令,例如,处理核能够确定计算多大规模的矩阵。第二子矩阵以数据流的方式通过片上网络(NOC,Network On Chip)传输到处理核。
在本申请实施例中,步骤S110-S120可以是在众核系统中的控制核执行的。其中,控制核可以是众核系统中的任意一个处理核。例如,控制核可以是步骤S120中的多个处理核以外的任意一个处理核;也可以是步骤S120中的多个处理核中的一者,具体来说,在总共有M层第二矩阵、且第1层第二矩阵为第一矩阵的情况下,第M层第二矩阵对应的第M层处理核为控制核。本公开实施例对此不做特殊限定。
在一些实施例中,由控制核通过执行步骤S120确定至少一个第二子矩阵对应的运算指令,运算指令包括处理核对第二子矩阵进行矩阵运算所需的算子、参数等。例如,运算指令指示处理核计算多大规模的矩阵运算。在本公开实施例中,控制核根据第一矩阵动态确定执行矩阵运算的处理核,并确定至少一个处理核的运算指令。其中,第一矩阵的尺寸可以为任意维。
在一些实施例中,控制核通过执行步骤S120将多层第二子矩阵和运算指令分别映射到众核系统的多个处理核,由处理核进行对应的第二子矩阵的矩阵运算,并存储计算结果。由矩阵的运算法则可知,本公开实施例中由众核系统中的多个处理核分别进行至少一层第二子矩阵的矩阵运算,从而得到第一矩阵的运算结果矩阵。
需要说明的是,在本申请实施例中,至少一层第二子矩阵的规模均小于第一矩阵的规模,众核系统中存储至少一层第二子矩阵及矩阵运算的结果需要的存储空间显著小于存储第一矩阵及矩阵运算的结果需要的存储空间,而且至少一个处理核在执行第二子矩阵的矩阵运算时的计算量也显著小于执行第一矩阵的矩阵运算时的计算量。例如,如图2所示将16*16第一矩阵划分为16个4*4子矩阵,对应第1层第二矩阵;第2层第二矩阵也为4*4矩阵。存储16*16第一矩阵中的元素的坐标需要行、列各4bits,而处理核存储一个4*4子矩阵中的元素的坐标需要行、列各2bits。本申请实施例由多个第一处理核进行子矩阵的矩阵运算、由第二处理核进行分块矩阵的矩阵运算的分层矩阵运算方案能够实现对矩阵中元素坐标维度的较高压缩率,能够大大降低内存开销。
在本申请实施例提供的任务映射方法中,提供一种利用众核系统执行稀疏矩阵的矩阵运算的方案,根据需要执行矩阵运算的第一矩阵确定多层第二矩阵,每一层第二矩阵的第二子矩阵的规模都远远小于第一矩阵的规模;将至少一层第二矩阵的第二子矩阵映射到众核系统中的多个处理核执行矩阵运算,最终得到第一矩阵的矩阵运算结果,从而能够实现对矩阵中元素坐标维度的较高压缩率,大大降低内存开销;矩阵运算的计算结果存储在众核系统的至少一个处理核中,无需写入内存等片外存储中,还降低了数据重复搬运的概率,提高了对超大规模稀疏矩阵进行矩阵运算的效率。
在本申请一些实施例中,通过步骤S120将每一层第二矩阵中的第二子矩阵分别映射到众核系统中的处理核时,或者,通过步骤S120将每一层第二矩阵中的第二子矩阵和运算指令映射到众核系统中的处理核时,可以将每一层第二矩阵中的所有第二子矩阵映射到众核系统中的多个处理核;也可以将每一层第二矩阵中的非零第二子矩阵映射到众核系统中的多个处理核。
需要说明的是,在第一矩阵为稀疏矩阵的情况下,第一矩阵的子矩阵包括多个零矩阵;在确定多层第二矩阵时,若第N+1层第二矩阵中的元素对应的第N层第二子矩阵为非零矩阵,则该元素为1;若第N+1层第二矩阵中的元素对应的第N层第二子矩阵为零矩阵,则该元素为0。因此,每一层第二矩阵的第二子矩阵包括非零矩阵和/或零矩阵。在步骤S120中,将每一层第二矩阵中的非零第二子矩阵映射到众核系统中的多个处理核,能够进一步降低对存储资源和计算资源的占用。图2示出了将每一层第二矩阵中的非零第二子矩阵映射到众核系统中的多个处理核的一种可选实施方式。
图3是本申请实施例中任务映射方法中部分步骤的流程图。在一些实施例中,每一层第二矩阵的至少一个第二子矩阵中包括至少一个非零第二子矩阵。
参照图3,步骤S120中将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核的步骤,具体可以包括:S121,将每一层第二矩阵中的非零第二子矩阵分别映射到众核系统中的处理核。
在本申请实施例中,对应第N层第二子矩阵的处理核为第N层处理核。第1层第二子矩阵即第一矩阵的子矩阵,第1层处理核计算第一矩阵的子矩阵并存储第一矩阵的子矩阵的运算结果矩阵;第N+1层处理核根据矩阵运算(例如转置)前第N层处理核对应的第N+1层第二矩阵中的坐标,确定矩阵运算(例如转置)后第N层处理核对应的第N+1层第二矩阵的运算结果矩阵中的坐标。当后续运算需要第一矩阵的矩阵运算结果时,能够根据处理核存储的处理核与坐标的对应关系,逐层确定存储第1层第二子矩阵的运算结果矩阵的第1层处理核,即第一矩阵的子矩阵的运算结果矩阵。
作为另一种可选的实施方式,处理核进行的第二子矩阵的矩阵运算是基于运算指令进行的计算,且每一个处理核对应一个第二子矩阵及对应的运算指令;上述步骤S120具体可以包括:S11,确定至少一个第二子矩阵对应的运算指令;S12,将每一层第二矩阵中的第二子矩阵及第二子矩阵对应的运算指令映射到众核 系统中的多个处理核,以使处理核根据其对应的运算指令进行其对应的第二子矩阵的矩阵运算并存储计算结果。
在本申请实施例中,可以由众核系统中的控制核根据需要执行矩阵运算例如转置运算的第一矩阵确定多层第二矩阵和运算指令,每一层第二矩阵的第二子矩阵的规模都远远小于第一矩阵的规模;将至少一层第二矩阵的第二子矩阵和运算指令映射到众核系统中的多个处理核执行矩阵运算,最终得到第一矩阵的矩阵运算结果,从而能够实现对矩阵中元素坐标维度的较高压缩率,显著降低内存开销;矩阵运算的计算结果存储在众核系统的至少一个处理核中,无需写入内存等片外存储中,还降低了数据重复搬运的概率,提高了对超大规模稀疏矩阵进行矩阵运算的效率。
图4是本申请实施例中任务映射方法中部分步骤的流程图。
在一些实施例中,参照图4,上述步骤S12具体可以包括:S41,将每一层第二矩阵中的非零第二子矩阵分别映射到众核系统中的处理核;S42,将非零第二子矩阵对应的运算指令配置到众核系统的处理核中。
在本申请实施例中,对应第N层第二子矩阵的处理核为第N层处理核。第1层第二子矩阵即第一矩阵的子矩阵,第1层处理核计算第一矩阵的子矩阵并存储第一矩阵的子矩阵的运算结果矩阵;根据确定的至少一个第二子矩阵对应的运算指令,第N+1层处理核根据转置前第N层处理核对应的第N+1层第二矩阵中的坐标,确定转置后第N层处理核对应的第N+1层第二矩阵的运算结果矩阵中的坐标。当后续运算需要第一矩阵的矩阵运算结果时,能够根据至少一层处理核存储的处理核与坐标的对应关系,逐层确定存储第1层第二子矩阵的运算结果矩阵的第1层处理核,即第一矩阵的子矩阵的运算结果矩阵。
图5是本申请实施例中任务映射方法中部分步骤的流程图。在一些实施例中,对应第N层第二子矩阵的处理核为第N层处理核;
参照图5,在一些实施例中,上述步骤S121或上述步骤S41,具体可以包括如下步骤。
S51,将第1层第二矩阵中的第1层非零第二子矩阵传输到第1层处理核,以使第1层处理核进行其对应的第1层非零第二子矩阵的矩阵运算。
S52,将第N层处理核与第一坐标的第一对应关系传输到第N+1层处理核,以使第N+1层处理核根据第一对应关系确定第N层处理核与第二坐标的第二对应关系;第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标。
在该实施例中,若矩阵运算为转置运算,则运算结果矩阵为转置矩阵。
在本申请实施例中,至少一层第二子矩阵的尺寸可以相同,也可以不同。本申请实施例对此不做特殊限定。
图6是本申请实施例中任务映射方法中部分步骤的流程图。
参照图6,在一些实施例中,步骤S110包括:S61,根据第一矩阵的尺寸确定目标尺寸,目标尺寸为每一层第二子矩阵的尺寸;S62,根据第一矩阵和目标尺寸,确定多层第二矩阵。
在本申请实施例中,对第二子矩阵的尺寸不做特殊限定。可以根据众核系统的计算能力、存储大小、以及对运算效率的需求确定第二子矩阵的大小;本申请实施例对第二子矩阵也不做特殊限定。在一些实施例中,每一层第二子矩阵为方阵,矩阵运算至少包括矩阵转置运算。
在本公开实施例中,矩阵运算的计算结果存储在众核系统的至少一个处理核中,当需要执行涉及到矩阵运算的结果的后续运算时,将后续运算的数据传输到对应的处理核中,由该处理核执行该后续运算,无需将至少一个处理核计算得到的第一矩阵的子矩阵的运算结果矩阵先传输到内存等片外存储、再从片外存储读取第一矩阵的运算结果,从而还降低了数据重复搬运的概率,提高了对大规模稀疏矩阵进行矩阵运算的效率。
图7是本申请实施例中任务映射方法的流程图。图7与图1中相同的步骤使用相同的标号。参照图7,在一些实施例中,在上述步骤S120之后,该方法还包括如下步骤。
S130,根据多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系,确定目标处理核;目标处理核为多个处理核中存储目标数据的至少一者;目标数据为待处理任务的任务数据对应的第一矩阵的子矩阵的运算结果矩阵。
S140,将任务数据传输到目标处理核,以使目标处理核读取目标数据,并执行任务数据对应的运算。
其中,多层第二矩阵为根据第一矩阵确定的,每一层第二矩阵划分为至少一个第二子矩阵;第N+1层第二矩阵中的元素与第N层第二矩阵中的多个第二子矩阵一一对应,第1层第二子矩阵为第一矩阵的子矩阵;多个处理核中的每一个处理核对应一个第二子矩阵,处理核存储有对其对应的第二子矩阵进行矩阵运算得到的计算结果;N为正整数。
在该实施例中,若矩阵运算为转置运算,则运算结果矩阵为转置矩阵。
在本申请实施例中,多个处理核进行第二子矩阵的矩阵运算后,将得到的矩阵运算结果存储在处理核中。当众核系统执行涉及矩阵运算的结果的待处理任务时,由控制核执行步骤S130至步骤S140,将待处理任务的任务数据传输到目标处理核中,由目标处理核执行任务数据对应的运算。本申请实施例对任务数据对应的运算不做特殊限定。例如,第一矩阵的子矩阵的运算结果矩阵与任务数据的矩阵乘运算、矩阵加/减运算。
控制核可以是众核系统中的任意一个处理核。在本申请实施例中,控制核可以对第二子矩阵进行矩阵运算并存储计算结果的多个处理核以外的任意一个处理核;也可以是对第二子矩阵进行矩阵运算并存储计算结果的多个处理核中的一者。本申请实施例对此不做特殊限定。例如,在总共有M层第二矩阵、且第1层第二矩阵为第一矩阵的情况下,第M层第二矩阵对应的第M层处理核为控制核。
在本申请实施例提供的任务映射方法中,矩阵计算例如转置运算的计算结果存储在众核系统的至少一个处理核中,当需要执行涉及矩阵运算的待处理任务时,将待处理任务的任务数据传输到目标处理核中,由目标处理核执行任务数据对应的运算,无需将至少一个处理核进行矩阵计算得到的第一矩阵的子矩阵的矩阵运算的计算结果先传输到内存等片外存储、再从片外存储读取第一矩阵的运算结果,从而降低了数据重复搬运的概率,提高了超对大规模稀疏矩阵进行矩阵运算的效率。
在一些实施例中,每一层第二矩阵的至少一个第二子矩阵中包括至少一个非零第二子矩阵;多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系为每一层第二矩阵中的至少一个非零第二子矩阵与多个处理核的映射关系;对应第N层第二子矩阵的处理核为第N层处理核;第1层处理核存储有对其对应的第1层非零第二子矩阵进行矩阵运算得到的计算结果;第N+1层处理核中存储有根据第N层处理核与第一坐标的第一对应关系确定的第N层处理核与第二坐标的第二对应关系;第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标;目标处理核为至少一个第1层处理核中的一者。在该实施例中,若矩阵运算为转置运算,则运算结果矩阵为转置矩阵。
图8示出本公开实施例的任务映射方法中部分步骤的流程图。
参照图8,在一些实施例中,上述步骤S130具体可以包括如下步骤。
S81,在N大于1的情况下,根据目标数据对应的第N层处理核存储的第二对应关系,确定目标数据对应的第N-1层处理核。
S82,在N等于1的情况下,将存储目标数据的第1层处理核作为目标处理核。
需要说明的是,在总共有M层第二矩阵、且第1层第二矩阵为第一矩阵的情况下,从第M层处理核起,在N大于1的情况下,通过迭代执行S81逐层确定目标数据对应的至少一层处理核,直到N等于1,将存储目标数据的第1层处理核作为目标处理核。即,在N大于1的情况下,执行一次步骤S81,并将N减小1;若减小1后N仍大于1,则继续执行步骤S81;若减小1后N等于1,则执行步骤S82。
图9示出本公开实施例的任务映射方法中部分步骤的流程图。
在一些实施例中,参照图9,任务映射方法还包括如下步骤。
S91,根据每一层处理核中存储的第二对应关系,确定至少一个目标计算结果在片外存储中的存储空间的地址;目标计算结果为第1层处理核进行其对应的第1层非零第二子矩阵的矩阵运算并存储的计算结果。
S92,控制至少一个第1层处理核根据第1层处理核存储的目标计算结果在片外存储中的存储空间的地址,将第1层处理核存储的目标计算结果写入片外存储,其中,至少一个目标计算结果在片外存储中拼接成第一矩阵的运算结果矩阵。
在本申请实施例中,控制核还可以控制多个处理核将多个处理核中的矩阵运算的结果输出到内存等片外存储中。在后续计算中,可以从片外存储获取第一矩阵的运算结果矩阵。需要说明的是,在将多个处理核中的矩阵运算的结果输出到内存等片外存储时,直接存储为第一矩阵的运算结果矩阵。
需要说明的是,在将每一层第二矩阵中的非零第二子矩阵映射到众核系统中的多个处理核的情况下,将多个处理核中的矩阵运算的结果输出到内存等片外存储时需要补充至少一个第一矩阵的子矩阵中的零矩阵。其中,由控制核确定至少一个处理核中存储的矩阵运算的结果在片外存储中的地址,从而确保至少一个处理核将存储的矩阵运算结果中所包含的元素写入片外存储后,能够拼接成第一矩阵的运算结果矩阵。
图10示出本申请实施例的任务处理方法的流程图。
参照图10,在一些实施例中,本申请实施例提供一种任务处理方法,该方法包括如下步骤。
S1010,接收多层第二矩阵中第N层处理核与第一坐标的第一对应关系;其中,N为正整数。
S1020,根据第一对应关系确定第N层处理核与第二坐标的第二对应关系。
其中,多层第二矩阵中每一层第二矩阵划分为至少一个第二子矩阵;第N+1层第二矩阵中的元素与第N层第二矩阵中的多个第二子矩阵一一对应;众核系统中的多个处理核中的每一个处理核对应一个第二子矩阵;第一坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;第二坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标;N为正整数。
在本申请实施例提供的任务处理方法中,众核系统的处理核能够根据转置前第N层处理核对应的第N+1层第二矩阵中的坐标,确定转置后第N层处理核对应的第N+1层第二矩阵的运算结果矩阵中的坐标,使得众核系统能够根据需要执行矩阵运算的第一矩阵确定多层第二矩阵,并将至少一层第二矩阵的第二子矩阵映射到众核系统中的多个处理核执行矩阵运算,最终得到第一矩阵的矩阵运算结果,从而能够实现对矩阵中元素坐标维度的较高压缩率,大大降低内存开销;矩阵运算的计算结果存储在众核系统的处理核中,无需写入内存等片外存储中,还降低了数的重复搬运的概率,提高了对超大规模稀疏矩阵进行矩阵运算的效率。
图11示出本申请实施例中任务映射方法中部分步骤的流程图。在一些实施例中,每一个处理核对应一个第二子矩阵及对应的运算指令。
如图11所示,在一些实施例中,步骤S1020具体可以包括如下步骤。
S1110,接收运算指令;S1120,根据运算指令和第一对应关系确定第N层处理核与第二坐标的第二对应关系。
在本公开实施例提供的任务处理方法中,众核系统的处理核能够根据运算指令和转置前第N层处理核对应的第N+1层第二矩阵中的坐标,确定转置后第N层处理核对应的第N+1层第二矩阵的运算结果矩阵中的坐标,使得众核系统能够根据需要执行矩阵运算的第一矩阵确定多层第二矩阵,并将至少一层第二矩阵的第二子矩阵映射到众核系统中的多个处理核执行矩阵运算,最终得到第一矩阵的矩阵运算结果,从而能够实现对矩阵中元素坐标维度的较高压缩率,大大降低内存开销;矩阵运算的计算结果存储在众核系统的处理核中,无需写入内存等片外存储中,还降低了数据重复搬运的概率,提高了对超大规模稀疏矩阵进行矩阵运算的效率。
图12本申请实施例中任务映射装置的流程图;如图12所示,在一些实施例中,该任务映射装置1200可以包括如下模块。
矩阵确定模块1210,用于根据第一矩阵确定多层第二矩阵,每一层第二矩阵划分为至少一个第二子矩阵;其中,第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应,第1层第二子矩阵为第一矩阵的子矩阵;其中,N为正整数。
矩阵映射模块1220,用于将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个处理核对应一个第二子矩阵,以使处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果。
在一些实施例中,每一层第二矩阵的至少一个第二子矩阵中包括至少一个非零第二子矩阵;矩阵映射模块1220,具体用于将每一层第二矩阵中的非零第二子矩阵分别映射到众核系统中的处理核。
在一些实施例中,处理核进行的矩阵计算是基于运算指令进行的计算,且每一个处理核对应一个第二子矩阵及对应的运算指令;矩阵映射模块1220,包括:指令确定单元,确定至少一个第二子矩阵对应的运算指令;矩阵映射模块1220,具体用于将每一层第二矩阵中的第二子矩阵及第二子矩阵对应的运算指令映射到众核系统中的处理核,以使处理核根据其对应的运算指令对其对应的第二子矩阵的进行矩阵运算并存储计算结果。
在一些实施例中,矩阵映射模块1220,在用于将每一层第二矩阵中的第二子矩阵及第二子矩阵对应的运算指令映射到众核系统中的处理核时,具体用于:将每一层第二矩阵中的非零第二子矩阵分别映射到众核系统中的处理核;将至少一个非零第二子矩阵对应的运算指令配置到众核系统的处理核中。
在一些实施例中,对应第N层第二子矩阵的处理核为第N层处理核;矩阵映射模块1220,在用于将每一层第二矩阵中的非零第二子矩阵分别映射到众核系统中的处理核时,具体用于:将第1层第二矩阵中的第1层非零第二子矩阵传输到第1层处理核,以使第1层处理核进行对其对应的第1层非零第二子矩阵进行矩阵运算;将第N层处理核与第一坐标的第一对应关系传输到第N+1层处理核,以使第N+1层处理核根据第一对应关系确定第N层处理核与第二坐标的第二对应关系;第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标。在该实施例中,若矩阵运算为转置运算,则运算结果矩阵为转置矩阵。
在一些实施例中,矩阵确定模块1210,具体用于:根据第一矩阵的尺寸确定目标尺寸,目标尺寸为每一层第二子矩阵的尺寸;根据第一矩阵和目标尺寸,确定多层第二矩阵。
在一些实施例中,任务映射装置1200还包括:目标核确定模块,用于在将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核之后,根据多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系,确定目标处理核;目标处理核为多个处理核中存储目标数据的至少一者;目标数据为待处理任务的任务数据对应的第一矩阵的子矩阵的运算结果矩阵;数据传输模块,用于将任务数据传输到目标处理核,以使目标处理核读取目标数据,并执行任务数据对应的运算。
在该实施例中,若矩阵运算为转置运算,则运算结果矩阵为转置矩阵。
在一些实施例中,每一层第二矩阵的至少一个第二子矩阵中包括至少一个非零第二子矩阵;多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系为每一层第二矩阵中的至少一个非零第二子矩阵与多个处理核的映射关系;对应第N层第二子矩阵的处理核为第N层处理核;第1层处理核存储有对其对应的第1层非零第二子矩阵进行矩阵运算得到的计算结果;第N+1层处理核中存储有根据第N层处理核与第一坐标的第一对应关系确定的第N层处理核与第二坐标的第二对应关系;第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标;目标处理核为至少一个第1层处理核中的一者。在该实施例中,若矩阵运算为转置运算,则运算结果矩阵为转置矩阵。
目标核确定模块,具体用于:在N大于1的情况下,根据目标数据对应的第N层处理核存储的第二对应关系,确定目标数据对应的第N-1层处理核;在N等于1的情况下,将存储目标数据的第1层处理核作 为目标处理核。
在一些实施例中,任务映射装置1200还包括:地址确定模块,用于根据每一层处理核中存储的第二对应关系,确定至少一个目标计算结果在片外存储中的存储空间的地址;目标计算结果为第1层处理核进行其对应的第1层非零第二子矩阵的矩阵运算并存储的计算结果;地址写入模块,用于控制至少一个第1层处理核根据第1层处理核存储的目标计算结果在片外存储中的存储空间的地址,将第1层处理核存储的目标计算结果写入片外存储,其中,至少一个目标计算结果在片外存储中拼接成第一矩阵的运算结果矩阵。
在一些实施例中,其中,每一层第二子矩阵为方阵;矩阵运算至少包括矩阵转置运算。
根据本发明实施例任务映射装置,提供一种利用众核系统执行稀疏矩阵的矩阵运算的方案,根据需要执行矩阵运算的第一矩阵确定多层第二矩阵,每一层第二矩阵的第二子矩阵的规模都远远小于第一矩阵的规模;将至少一层第二矩阵的第二子矩阵映射到众核系统中的多个处理核执行矩阵运算,最终得到第一矩阵的矩阵运算结果,从而能够实现对矩阵中元素坐标维度的较高压缩率,大大降低内存开销;矩阵运算的计算结果存储在众核系统的至少一个处理核中,无需写入内存等片外存储中,还降低了数据重复搬运的概率,提高了超大规模稀疏矩阵进行矩阵运算的效率。
图13本申请实施例中任务处理装置的流程图;如图13所示,在一些实施例中,该任务处理装置1300可以包括如下模块。
接收模块1310,用于接收多层第二矩阵中第N层处理核与第一坐标的第一对应关系;其中,N为正整数;关系确定模块1320,用于根据第一对应关系确定第N层处理核与第二坐标的第二对应关系。
其中,多层第二矩阵中每一层第二矩阵划分为至少一个第二子矩阵;第N+1层第二矩阵中的元素与第N层第二矩阵中的多个第二子矩阵一一对应;众核系统中的多个处理核中的每一个处理核对应一个第二子矩阵;第一坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;第二坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标。
在一些实施例中,每一个处理核对应一个第二子矩阵及对应的运算指令;关系确定模块1320,包括:指令接收模块,用于接收运算指令;关系确定模块1320,具体用于根据运算指令和第一对应关系确定第N层处理核与第二坐标的第二对应关系。
根据本发明实施例的任务处理装置,众核系统的处理核能够根据转置前第N层处理核对应的第N+1层第二矩阵中的坐标,确定转置后第N层处理核对应的第N+1层第二矩阵的运算结果矩阵中的坐标,使得众核系统能够根据需要执行矩阵运算的第一矩阵确定多层第二矩阵,并将至少一层第二矩阵的第二子矩阵映射到众核系统中的多个处理核执行矩阵运算,最终得到第一矩阵的矩阵运算结果,从而能够实现对矩阵中元素坐标维度的较高压缩率,大大降低内存开销;矩阵运算的计算结果存储在众核系统的处理核中,无需写入内存等片外存储中,还降低了数据重复搬运的概率,提高了对超大规模稀疏矩阵进行矩阵运算的效率。
图14是本申请实施例提供的电子设备的组成框图。
参照图14,在一些实施例中,本申请实施例提供一种电子设备,包括:一个或多个处理器1401;存储器1402,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现本申请实施例的任务映射方法;一个或多个I/O接口1403,连接在处理器与存储器之间,配置为实现处理器与存储器的信息交互。其中,处理器1401为具有数据处理能力的器件,其包括但不限于中央处理器(CPU)等;存储器1402为具有数据存储能力的器件,其包括但不限于随机存取存储器(RAM,更具体如SDRAM、DDR等)、只读存储器(ROM)、带电可擦可编程只读存储器(EEPROM)、闪存(FLASH);I/O接口(读写接口)1403连接在处理器1401与存储器1402间,能实现处理器1401与存储器1402的信息交互,其包括但不限于数据总线(Bus)等。
在一些实施例中,处理器1401、存储器1402和I/O接口1403通过总线1404相互连接,进而与计算设备的其它组件连接。
图15是本申请实施例提供的处理核的组成框图。参照图15,在一些实施例中,本申请实施例提供一种处理核,包括:包括计算单元1501和缓存1502;计算单元1501能够实现本申请实施例本申请上述实施例任务映射方法;和/或本申请实施例的任务处理方法。
图16是本申请实施例提供的众核系统的组成框图。参照图16,本申请实施例提供一种众核系统,包括:多个处理核1601;以及片上网络1602,被配置为交互多个处理核1601间的数据和外部数据;一个或多个处理核1601中存储有一个或多个指令,一个或多个指令被一个或多个处理核1601执行,以使一个或多个处理核1601能够执行本申请上述实施例的任务映射方法;和/或本申请实施例的任务处理方法。
本发明实施例还提供一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当计算机可读代码在电子设备的处理器中运行时,电子设备中的处理器执行用于实现本申请任一实施例的任务映射方法或任务处理方法。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
本文已经公开了示例实施例,并且虽然采用了具体术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施例相结合描述的特征、特性和/或元素,或可与其他实施例相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本申请的范围的情况下,可进行各种形式和细节上的改变。

Claims (18)

  1. 一种任务映射方法,包括:
    根据第一矩阵确定多层第二矩阵,每一层第二矩阵划分为至少一个第二子矩阵;其中,第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应,第1层第二子矩阵为所述第一矩阵的子矩阵;其中,N为正整数;
    将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个所述处理核对应一个第二子矩阵,以使所述处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果。
  2. 根据权利要求1所述的方法,其中,每一层第二矩阵的第二子矩阵中包括至少一个非零第二子矩阵;所述将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,包括:
    将每一层第二矩阵中的非零第二子矩阵分别映射到所述众核系统中的处理核。
  3. 根据权利要求1所述的方法,其中,所述处理核进行的所述矩阵计算是基于运算指令进行的计算,且每一个所述处理核对应一个第二子矩阵及对应的运算指令;
    所述将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个所述处理核对应一个第二子矩阵,以使所述处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果,包括:
    确定至少一个第二子矩阵对应的运算指令;
    将每一层第二矩阵中的第二子矩阵及第二子矩阵对应的运算指令映射到众核系统中的处理核,以使所述处理核根据其对应的所述运算指令对其对应的第二子矩阵的进行矩阵运算并存储计算结果。
  4. 根据权利要求3所述的方法,其中,所述将每一层第二矩阵中的第二子矩阵及第二子矩阵对应的运算指令映射到众核系统中的处理核,包括:
    将每一层第二矩阵中的非零第二子矩阵分别映射到所述众核系统中的处理核;
    将所述非零第二子矩阵对应的运算指令配置到所述众核系统的处理核中。
  5. 根据权利要求2或4所述的方法,其中,对应第N层第二子矩阵的处理核为第N层处理核;所述将每一层第二矩阵中的非零第二子矩阵分别映射到所述众核系统中的处理核,包括:
    将第1层第二矩阵中的第1层非零第二子矩阵传输到第1层处理核,以使第1层处理核进行其对应的第1层非零第二子矩阵的矩阵运算;
    将第N层处理核与第一坐标的第一对应关系传输到第N+1层处理核,以使第N+1层处理核根据所述第一对应关系确定第N层处理核与第二坐标的第二对应关系;所述第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;所述第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标。
  6. 根据权利要求1至4中任意一项所述的方法,其中,所述根据第一矩阵确定多层第二矩阵,包括:
    根据所述第一矩阵的尺寸确定目标尺寸,所述目标尺寸为每一层第二子矩阵的尺寸;
    根据所述第一矩阵和所述目标尺寸,确定多层第二矩阵。
  7. 根据权利要求1至4中任意一项所述的方法,其特征在于,在所述将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核之后,所述方法还包括:
    根据多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系,确定目标处理核;所述目标处理核为多个所述处理核中存储目标数据的至少一者;所述目标数据为 待处理任务的任务数据对应的第一矩阵的子矩阵的运算结果矩阵;
    将所述任务数据传输到所述目标处理核,以使所述目标处理核读取所述目标数据,并执行所述任务数据对应的运算。
  8. 根据权利要求7所述的方法,其中,每一层第二矩阵的至少一个第二子矩阵中包括至少一个非零第二子矩阵;多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系为每一层第二矩阵中的至少一个非零第二子矩阵与多个所述处理核的映射关系;
    对应第N层第二子矩阵的处理核为第N层处理核;第1层处理核存储有对其对应的第1层非零第二子矩阵进行矩阵运算得到的计算结果;第N+1层处理核中存储有根据第N层处理核与第一坐标的第一对应关系确定的第N层处理核与第二坐标的第二对应关系;
    所述第一坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;所述第二坐标为第N层非零第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标;所述目标处理核为至少一个第1层处理核中的一者;
    所述根据多层第二矩阵中的至少一个第二子矩阵与众核系统中的多个处理核的映射关系,确定目标处理核,包括:
    在N大于1的情况下,根据所述目标数据对应的第N层处理核存储的所述第二对应关系,确定所述目标数据对应的第N-1层处理核;
    在N等于1的情况下,将存储所述目标数据的第1层处理核作为所述目标处理核。
  9. 根据权利要求8所述的方法,其中,所述任务映射方法还包括:
    根据每一层处理核中存储的所述第二对应关系,确定至少一个目标计算结果在片外存储中的存储空间的地址;所述目标计算结果为第1层处理核进行其对应的第1层非零第二子矩阵的矩阵运算并存储的计算结果;
    控制至少一个第1层处理核根据所述第1层处理核存储的目标计算结果在所述片外存储中的存储空间的地址,将所述第1层处理核存储的目标计算结果写入所述片外存储,其中,至少一个所述目标计算结果在所述片外存储中拼接成所述第一矩阵的运算结果矩阵。
  10. 根据权利要求1至4中任意一项所述的方法,其中,每一层第二子矩阵为方阵;所述矩阵运算至少包括矩阵转置运算。
  11. 一种任务处理方法,包括:
    接收多层第二矩阵中第N层处理核与第一坐标的第一对应关系;其中,N为正整数;
    根据所述第一对应关系确定第N层处理核与第二坐标的第二对应关系;
    其中,所述多层第二矩阵中每一层第二矩阵划分为至少一个第二子矩阵;第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应;众核系统中的多个处理核中的每一个所述处理核对应一个第二子矩阵;所述第一坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;所述第二坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标。
  12. 根据权利要求11所述的方法,其中,每一个所述处理核对应一个第二子矩阵及对应的运算指令;所述根据所述第一对应关系确定第N层处理核与第二坐标的第二对应关系,包括:
    接收运算指令;
    根据所述运算指令和所述第一对应关系确定第N层处理核与第二坐标的第二对应关系。
  13. 一种任务映射装置,包括:
    矩阵确定模块,用于根据第一矩阵确定多层第二矩阵,每一层第二矩阵划分为至少一个第二子矩阵;其中,第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应,第1层第二子矩阵为所述第一矩阵的子矩阵;其中,N为正整数;
    矩阵映射模块,用于将每一层第二矩阵中的第二子矩阵映射到众核系统中的处理核,每一个所述处理核对应一个第二子矩阵,以使所述处理核进行其对应的第二子矩阵的矩阵运算并存储运算结果。
  14. 一种任务处理装置,包括:
    接收模块,用于接收多层第二矩阵中第N层处理核与第一坐标的第一对应关系;其中,N为正整数;
    关系确定模块,用于根据所述第一对应关系确定第N层处理核与第二坐标的第二对应关系;
    其中,所述多层第二矩阵中每一层第二矩阵划分为至少一个第二子矩阵;第N+1层第二矩阵中的元素与第N层第二矩阵的多个第二子矩阵一一对应;众核系统中的多个处理核中的每一个所述处理核对应一个第二子矩阵;所述第一坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵中的坐标;所述第二坐标为第N层第二子矩阵对应的第N+1层第二矩阵中的元素在第N+1层第二矩阵的运算结果矩阵中的坐标。
  15. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现根据权利要求1至10中任意一项所述的任务映射方法;
    一个或多个I/O接口,连接在所述处理器与存储器之间,配置为实现所述处理器与存储器的信息交互。
  16. 一种处理核,包括:
    包括计算单元和缓存;
    所述计算单元能够实现根据权利要求1-10中任意一项所述的任务映射方法;和/或,根据权利要求11-12中任意一项所述的任务处理方法。
  17. 一种众核系统,包括:
    多个处理核;以及
    片上网络,被配置为交互所述多个处理核间的数据和外部数据;
    一个或多个所述处理核中存储有一个或多个指令,一个或多个所述指令被一个或多个所述处理核执行,以使一个或多个所述处理核能够执行根据权利要求1-10中任意一项所述的任务映射方法;和/或,根据权利要求11-12中任意一项所述的任务处理方法。
  18. 一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行用于实现权利要求1-10中任意一项所述的任务映射方法;和/或,根据权利要求11-12中任意一项所述的任务处理方法。
PCT/CN2022/073984 2021-01-26 2022-01-26 任务映射方法、任务处理方法、处理核和电子设备 WO2022161394A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110103025.4 2021-01-26
CN202110111060.0 2021-01-26
CN202110111060.0A CN114791849A (zh) 2021-01-26 2021-01-26 任务映射、任务处理方法及处理核、电子设备
CN202110103025.4A CN114791786A (zh) 2021-01-26 2021-01-26 任务映射、任务控制、任务处理方法及处理核、电子设备

Publications (1)

Publication Number Publication Date
WO2022161394A1 true WO2022161394A1 (zh) 2022-08-04

Family

ID=82653010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/073984 WO2022161394A1 (zh) 2021-01-26 2022-01-26 任务映射方法、任务处理方法、处理核和电子设备

Country Status (1)

Country Link
WO (1) WO2022161394A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6470368B1 (en) * 1999-05-21 2002-10-22 Sun Microsystems, Inc. Using tiling to improve performance in a sparse symmetric direct matrix solver
CN104636273A (zh) * 2015-02-28 2015-05-20 中国科学技术大学 一种带多级Cache的SIMD众核处理器上的稀疏矩阵存储方法
CN108268424A (zh) * 2016-12-31 2018-07-10 英特尔公司 用于处理具有偏斜非零分布的稀疏矩阵数据的异构硬件加速器架构
CN108446253A (zh) * 2018-03-28 2018-08-24 北京航空航天大学 一种针对神威体系架构的稀疏矩阵向量乘的并行计算方法
CN111428192A (zh) * 2020-03-19 2020-07-17 湖南大学 用于优化高性能计算构架稀疏矩阵向量乘的方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6470368B1 (en) * 1999-05-21 2002-10-22 Sun Microsystems, Inc. Using tiling to improve performance in a sparse symmetric direct matrix solver
CN104636273A (zh) * 2015-02-28 2015-05-20 中国科学技术大学 一种带多级Cache的SIMD众核处理器上的稀疏矩阵存储方法
CN108268424A (zh) * 2016-12-31 2018-07-10 英特尔公司 用于处理具有偏斜非零分布的稀疏矩阵数据的异构硬件加速器架构
CN108446253A (zh) * 2018-03-28 2018-08-24 北京航空航天大学 一种针对神威体系架构的稀疏矩阵向量乘的并行计算方法
CN111428192A (zh) * 2020-03-19 2020-07-17 湖南大学 用于优化高性能计算构架稀疏矩阵向量乘的方法和系统

Similar Documents

Publication Publication Date Title
CN110520870B (zh) 用于具有动态向量长度和码本大小的高吞吐量向量去量化的灵活硬件
JP6974270B2 (ja) 知能型高帯域幅メモリシステム及びそのための論理ダイ
US20190042611A1 (en) Technologies for structured database query for finding unique element values
US11436143B2 (en) Unified memory organization for neural network processors
US11403044B2 (en) Method and apparatus for performing multi-object transformations on a storage device
WO2018017282A1 (en) Techniques to provide a multi-level memory architecture via interconnects
JP2020519981A (ja) 専用ニューラルネットワークトレーニングチップ
US20150106574A1 (en) Performing Processing Operations for Memory Circuits using a Hierarchical Arrangement of Processing Circuits
US20150149857A1 (en) Error correction in memory
US10643126B2 (en) Systems, methods and devices for data quantization
US9804996B2 (en) Computation memory operations in a logic layer of a stacked memory
CN110442534A (zh) 用于相干消息的高带宽链路层
US20210209450A1 (en) Compressed weight distribution in networks of neural processors
CN105408875A (zh) 在存储器接口上的分布式过程执行和文件系统
CN111651383A (zh) 用于具有数据流管理器的处理器中的数据流的方法和装置
WO2022161394A1 (zh) 任务映射方法、任务处理方法、处理核和电子设备
KR20210151250A (ko) 확장 메모리 인터페이스
US11016765B2 (en) Bit string operations using a computing tile
US11768614B2 (en) Storage device operation orchestration
US11579882B2 (en) Extended memory operations
US10606775B1 (en) Computing tile
US20200341761A1 (en) Bit sting operations using a computing tile
CN114791786A (zh) 任务映射、任务控制、任务处理方法及处理核、电子设备
US20150071021A1 (en) Accessing independently addressable memory chips
US20240078046A1 (en) Computer system and method for data access

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22745267

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22745267

Country of ref document: EP

Kind code of ref document: A1