CN112765094A - Sparse tensor canonical decomposition method based on data division and calculation distribution - Google Patents
Sparse tensor canonical decomposition method based on data division and calculation distribution Download PDFInfo
- Publication number
- CN112765094A CN112765094A CN202011639166.XA CN202011639166A CN112765094A CN 112765094 A CN112765094 A CN 112765094A CN 202011639166 A CN202011639166 A CN 202011639166A CN 112765094 A CN112765094 A CN 112765094A
- Authority
- CN
- China
- Prior art keywords
- tensor
- data
- core
- slave
- sparse
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
- G06F13/30—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal with priority control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8053—Vector processors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a sparse tensor canonical decomposition method based on data division and task allocation, which comprises the following steps of: performing multi-stage division and task allocation on a plurality of processing cores on a core group according to the many-core characteristics of the Shenwei processor at initial; performing multi-stage segmentation processing on sparse tensor data initially; communication strategies for sparse tensor canonical decomposition designed using the register communication characteristics of the Shenwei processor SW 26010; aiming at the common performance bottleneck of different sparse Tensor canonical decomposition methods, namely different requirements (whether Tensor elements need to be randomly extracted for calculation) of matrixed Tensor Times Khatri-Rao Product (MTTKRP for short) in specific operation, different calculation schemes of the MTTKRP process are designed by utilizing the characteristics of an Shenwei processor. The invention fully excavates the characteristics of the Shenwei system structure, fully considers the calculation requirements of sparse tensor decomposition, can finish various sparse tensor canonical decomposition calculation methods on the Shenwei system structure in parallel and efficiently, and ensures dynamic load balance to the maximum extent.
Description
Technical Field
The invention relates to the fields of tensor numerical algorithms, parallel computing, Shenwei system structures and the like, in particular to a sparse tensor canonical decomposition method based on data division and computing distribution.
Background
The novel super computer system of ' Shenwei-Taihu light ' developed by the national parallel computer engineering and technology research center is supported by the national 863 plan, the system is a super computer of which the world first peak value operational performance reaches billion orders of magnitude, and is continuously evaluated as the super computer of which the world performance is strongest by the International TOP500 organization four times in the year of 2016-2017, and meanwhile, the running application on the super computer obtains the Goden-Bell prize '. The super-large-scale parallel application realized by a user on the system can cover millions of cores, and the application covers multiple fields including deep learning, earthquake simulation, quantum circuit simulation, climate simulation and the like.
The Shenwei Taihu light supercomputer adopts a SW26010 heterogeneous many-core processor developed by the national high performance integrated circuit (Shanghai) design center. In terms of basic architecture, 4 core groups are integrated on the processor chip, each core group comprises sixty-four slave core arrays (operation core arrays) and one master core (operation control core), so that the whole chip comprises 4 operation control cores and 256 operation cores in total, wherein the operation control cores and the operation cores both have the working frequency of 1.45 GHz. In each core group, the operation control core is responsible for completing task management, is the same as most of the current mainstream CPUs, and can support memory management, interruption and out-of-order execution; meanwhile, the operation core is responsible for completing high-efficiency calculation, and vectorization components and other devices supporting high-speed operation are deployed on the operation core. In terms of memory design, the SW26010 processor adopts a multi-level memory architecture, which includes 2-level caches, with sizes of 32KB and 256KB, respectively, and also provides 8GB of main memory space for a single core group. In order to provide a better parallel development environment for users, the 'optical of Shenwei Taihu lake' platform provides support for a plurality of programming languages such as C + +, C, Fortran and provides interfaces for a plurality of parallel libraries and parallel standards such as MPI and OpenMP.
In the computer field, tensors are generally regarded as multi-dimensional arrays, while scalars can be regarded as 0-order tensors, vectors can be regarded as 1-order tensors, and matrices are 2-order tensors, and tensors equal to or higher than the third order are called high-order tensors. While sparse tensors refer to tensors with a majority of values being non-zero. In actual engineering and research, much of the data is represented by high-order sparse tensors. User evaluation information such as amazon, which includes product information, user rating, and other data dimensions, wherein a non-zero value accounts for only one billion of the total data amount; and the paper Information published on Neural Information Processing System (NIPS) includes dimensions of paper ID, author Information, publication year and the like, and the proportion of non-zero values is one millionth; and the information of the e-mail is sent and received, and contains multidimensional data such as the name and the date of the sent and received mailbox, wherein the proportion of the non-zero value is one part per billion.
Due to the widespread use of tensors and the explosive development of big data analysis techniques and data mining techniques, the related analysis techniques of vectors are also more and more emphasized by researchers. One of the frequently used vector analysis techniques is the canonical decomposition cpd (canonical polymeric decomposition) algorithm, proposed by Hitchcock in 1927. CPD refers to the representation of a tensor by a finite number of tensors of rank 1, i.e.WhereinExpress tensor, arRepresenting a vector, which represents the outer product of the tensor. CPD algorithms are currently being used in a number of research areas, including radiation source location identification, internet of things, signal processing, and others. Due to the wide use of sparse tensors and CPD algorithms, more researches for realizing the CPD algorithms on sparse tensors, such as MATLAB Tensor Toolbox,GigaTensor, SPLATT (The generalized ParalleL software vector tool), DFacTo and HyperTensor, however, all of The algorithms are based on homogeneous single-core or multi-core CPUs or GPUs, no CPD algorithm adapted to The sparse Tensor of The special many-core architecture of The domestic Shenwei processor has appeared at present, and The current parallelized Tensor canonical decomposition algorithm is relatively single. The parallel design scheme based on various CPD algorithms is completed by deeply mining the heterogeneous characteristics, manual control cache, register communication and other characteristics of the domestic Shenwei processor SW26010, and the performance of the tensor canonical decomposition algorithm is improved.
Disclosure of Invention
The invention solves the problems: the method overcomes the defects and shortcomings of the prior art, provides a sparse tensor canonical decomposition method based on data division and task allocation, improves the performance of various sparse tensor canonical decomposition algorithms by utilizing the characteristics of the Shenwei system structure, and fully excavates the potential of the Shenwei core system structure.
The technical solution of the invention is a sparse tensor canonical decomposition method based on data division and task allocation, comprising the following steps:
step 1: reading Sparse tensor data specified by a user into a core group main memory, decomposing an algorithm type according to a specified specific Sparse tensor model, converting the Sparse tensor data into a Sparse tensor storage format CSF (compressed Sparse fiber) on a SW26010 core group main core (Management Processing Element, MPE for short), and performing first-stage and second-stage data segmentation on the Sparse tensor, wherein the segmented data units are tensor blocks (blocks) and tensor strips (bands), and one tensor block comprises a plurality of tensor strips; then, determining the number of factor matrixes which are used as a decomposition result according to the dimensionality of the input tensor, namely, the N-dimensional tensor is decomposed into N factor matrixes, wherein the ith factor matrix corresponds to the ith dimensionality of the tensor, and the row vector number of the ith factor matrix is equal to the length of the ith dimensionality of the tensor; then step 2 is entered to obtain the factor matrix in an iterative mode;
step 2: determining whether the MTTKRP process contained in the algorithm needs random sampling according to the specified sparse tensor canonical decomposition algorithm type, if the random sampling is not needed, directly entering step 3, otherwise, MPE randomly samples tensor strips with a control slave core (CPE Controller) in a slave core (CPE Processing Element, abbreviated as CPE) group;
and step 3: MPE determines tensor dimensionality corresponding to a Factor Matrix (Factor Matrix) needing to be updated at this time and a distribution scheme of tensor blocks, and stores the tensor dimensionality and the distribution scheme into a core memory;
and 4, step 4: each control slave core in the slave core group extracts information of a tensor block which is distributed to the slave core group and belongs to the control slave core from a main Memory in a DMA mode, wherein the information comprises coordinate information of sparse tensor nonzero elements contained in each tensor strip in the tensor block, and the information is stored in a Local Device Memory (LDM);
and 5: controlling a slave core to distribute a factor matrix updating calculation task to a work slave core (CPE worker) of the same group and sending coordinate information of sparse tensor nonzero elements contained in tensor strips corresponding to the task to be calculated by the work slave core in a register communication mode;
step 6: after obtaining the coordinates of the sparse tensor nonzero elements contained in the assigned tensor strips, working to extract factor matrix data needed by calculation from a main memory to an own LDM in a self-configuration cache mode from the core, performing third-level segmentation (data units are tensor pieces) on the assigned tensor strips according to the use condition of the own LDM, extracting the tensor pieces in sequence for calculation, and returning the calculation result information and the information of whether the calculation is finished to control the slave core in a register communication mode in real time;
and 7: after the work slave core finishes the calculation task, the control slave core dynamically distributes the next calculation task to the work slave core according to the calculation condition of the current tensor block, and sends the update result back to the core group main memory in a DMA mode;
and 8: repeating the step 6 and the step 7 by the working slave core and the control slave core of each slave core group until the distributed tensor block calculation task of the slave core group is finished;
and step 9: MPE determines matrix operation which needs to be executed besides the MTTKRP process according to the sparse tensor canonical decomposition algorithm type specified by a user, the secondary core group completes calculation by using a basic linear algebra library BLAS, and MPE determines a primary updating result of a factor matrix according to the MTTKRP calculation result and the matrix operation result;
step 10: MPE changes tensor dimensionality corresponding to the factor matrix needing to be updated, and returns to the step 3 until all the factor matrices are updated;
step 11: MPE determines whether to carry out a line searching process according to the sparse tensor canonical decomposition algorithm type specified by a user, and if not, directly updates a factor matrix; otherwise, updating the factor matrix through a line search process;
step 12: MPE determines the value of the target function through the updated value of the factor matrix, if the value of the target function meets the iteration termination condition, the iteration is terminated, otherwise, the step 2 is returned until the value of the target function meets the iteration termination condition.
In step 1, specifying a specific sparse tensor canonical decomposition algorithm type by a user includes: the sparse tensor canonical decomposition method based on the Alternating Least Square (ALS) method, the sparse tensor canonical decomposition method based on the Gradient Descent (GD), the sparse tensor canonical decomposition method based on the Random Block Sampling (RBS) method, or the sparse tensor canonical decomposition method based on the fast Levenberg-Marquardt algorithm (fLM).
In step 1, the method for performing the first-stage and second-stage data segmentation on the sparse tensor is as follows:
(1) first-level segmentation of tensors: after removing all-zero tensor lines, dividing the tensor into 8 blocks, wherein each block comprises a plurality of complete tensor nonzero lines and has equal quantity of nonzero elements;
(2) second-level segmentation of tensors: on the basis of the first-stage segmentation of the tensor, defining each nonzero tensor line in the tensor block as a tensor strip;
wherein a tensor line refers to a substructure of a tensor formed while keeping one coordinate of the tensor fixed.
In step 2, the CPE group is defined as follows:
since one Shenwei processor core group comprises 1 master core and 64 slave cores distributed in an 8 × 8 layout, the 64 slave cores distributed in the 8 × 8 layout in one core group are divided into 8 groups by rows, and the 8 slave cores in each group are divided into the following two types:
(1) the CPE controls the slave cores, each group is only provided with one, the control slave cores of each group are positioned in the same column of a CPE grid of the core group, and the slave cores are responsible for the following tasks:
(1-1) extracting the coordinates and range information of the tensor block needing to be calculated from the core group main memory, and extracting tensor dimension information corresponding to the factor matrix needing to be updated;
(1-2) responsible for dynamically assigning tensor strips to work from the core;
and (1-3) collecting the calculation results of the work slave cores and sending the calculation results back to the main memory.
(2) The CPE working slave cores are seven in each group, the working slave cores are not overlapped with the control slave cores, and the slave cores are responsible for the following tasks:
(2-1) responsible for completing the calculation of the MTTKRP process according to the data allocated from the control slave core;
and (2-2) sending the calculation result to the control slave core.
In step 2, the method for controlling random sampling from the kernel tensor strip in the MPE and CPE groups is as follows:
(1) since there are 8 CPE groups in total, MPE randomly assigns a priority to each CPE group, set to P i1, 8, and setting the number of tensor strips needing to be sampled each time as IBANDS _ TOTAL;
(2) each CPE group determines the number of tensor pieces required to be calculated by the group according to the priority assigned by the CPE group, and the sampling mode is as follows:
(2-1) if it is the CPE group (P) with the highest priorityi1), the CPE group with the highest priority controls the random generation of a tensor number IBANDS from the coreiAnd satisfies 0 < IBANDSi< IBANDS _ TOTAL and 0 < IBANDSi<IBLOCK_LENiWherein IBLOCK _ LENiThe number of tensor pieces contained in the assigned tensor block of the CPE group, and when the random generation is finished, the CPE group controls the slave core to carry out the residual number of samplable pieces IBANDS _ TOTAL-IBANDSiSending control slave (P) to CPE group with second highest priorityi=2);
(2-2) if not, the CPE group (P) with the highest priorityiNot equal to 1), then not the randomly generated tensor number IBANDS for the group of CPEs with the highest priorityiOne of the conditions of (1) becomes The rest of the conditions and the generation mode are unchanged.
(2-3) repeating the process (2-2) until all CPE groups have generated IBANDSiOr And finally determining the number of tensor strips needing to be calculated in the group, namely completing random sampling.
In the step 3, the main core determines a tensor dimension and a tensor block corresponding to the factor matrix needing to be updated at this time, wherein the specific updating mode of the factor matrix is to decompose the objective function according to a tensor canonical Solving the factor matrix using the canonical decomposition algorithm of sparse tensor as in claim 2, wherein the sparse tensor to be decomposedFactor matrix Ai(: r) represents a column vector of the matrix, which is a vector outer product,the Frobenius norm of the tensor is expressed, F being the rank of the tensor, i.e. the tensor can be decomposed at least as the sum of the outer products of F vectors.
In step 5, the format for controlling the slave core to send the task assignment information to the attached work slave core is as follows:
(1) the first data is i _ pos, and the data shows that the sparse tensor data to be calculated is in the number of nonzero tensor lines, namely the number of tensor strips;
(2) the second data is i _ id, and the data is used for explaining which row the sparse tensor data to be calculated currently is positioned in the corresponding factor matrix to be updated;
(3) the third data is i _ ptr [ i _ pos ], which is used for explaining the starting position of tensor fibers contained in the tensor strip to be currently calculated, wherein the tensor fibers refer to the ending position of the tensor fibers contained in the tensor strip to be currently calculated by the substructure of the tensor formed under the condition that two coordinates of the tensor are kept fixed;
(4) the fourth data is i _ ptr [ i [ [ i ]pos+1]The data is used for explaining the ending position of tensor fibers contained in the tensor strip to be calculated currently;
in summary, the purpose of the task allocation information is to give the information of the coordinate position of the tensor strip to be calculated and the corresponding relationship between the tensor strip and the factor matrix to be updated to the work slave core.
In step 6, the work extracts the factor matrix data required for calculation from the main memory to the owned LDM in a self-constructed cache manner, specifically the following extraction manner:
(1) opening up a Memory size in LDM of nuclear ownershipFMThe space of (A) is used as the buffer space needed for storing the factor matrix data, and the factor matrix to be updated is assumed to be ANThen the factor matrix required for the calculation is Ai1, N-1, where N is the dimension of the tensor,the program presets the row vector number of each factor matrix which can be completely stored in the LDM to be M, and then the Memory can be obtainedFMStoring (N-1) x 2 variables in the LDM for storing the maximum coordinate and the minimum coordinate of the cache line vector;
(2) when a task starts, extracting the previous continuous M rows of each factor matrix into a cache space;
(3) when calculating the non-zero element x of tensora1,a2,...,aNWhen it is to the matrix ANWhen updating, the matrix A is needed separatelyi1, …, N-1, line a1, a2, a (N-1), so that whether the line vector exists in the current buffer space in the LDM is queried, if so, the line vector can be used directly, otherwise, the matrix a is extracted from the main memoryiAi, ai + 1.., ai + M lines of (a) and stored in LDM.
In step 6, work extracts factor matrix data required for calculation from the main memory in a self-configuration cache manner to own local data storage from the core, and further performs third-level segmentation on the assigned tensor strips according to the use condition of the own local data storage, wherein the specific segmentation method is as follows: setting the residual storage space of the LDM after removing non-array variables such as the cyclic variable as MemoryLDMThe total space reserved for the calculation of the required factor matrix data is MemoryFMThen the size of a tensor piece is Memory at this timeLDM-MemoryFM。
In step 6, the information of the calculation result and the information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time, wherein the specific formats of the information are as follows:
(1) the information length of the calculation result sent by the core is controlled to be 256 bits, and the calculation result information contains 4 double-type data, and the format of the double-type data is as follows:
(1-1) the first data is i _ pos, which is a non-zero line of the current updated data in the several tensors;
(1-2) the second data is r, which is a data indicating in which column of the factor matrix to be updated the data currently updated is located at;
(1-3) the third data is result1, which is the first result data;
(1-4) the fourth data is result2, which is the second result data.
(2) The information length for controlling the sending of the computation end from the core is 256 bits, and the information length comprises 4 double type data, and the format of the double type data is as follows:
(2-1) the first data is i _ pos, which is data indicating in which row the currently updated data is located in the corresponding factor matrix to be updated;
(2-2) the second data is-1, which is an indication variable indicating that the current line has been updated;
and (2-3) the third data and the fourth data are filling data for occupying.
In step 7, the implementation manner of controlling the slave core to dynamically allocate the next computation task to the slave core according to the computation condition of the current tensor block is as follows:
(1) when the working slave core finishes the calculation task, if the non-calculated tensor strip still exists, directly distributing the tensor strip calculated by the next bit to the working slave core;
(2) and if all tensor strips are calculated, controlling the slave core to send a message which is 256 bits in length and contains 4 double-type data to all attached work slave cores in a register communication mode, and setting the last data of the message to be-1 to show that all calculation processes of the slave core group are finished.
In step 12, the value of the objective function satisfies the iteration termination condition, which includes two types: the first is that the iteration number reaches the upper limit; and the second method is that the objective function value obtained by two times of iterative calculation is smaller than a preset Threshold.
Compared with the prior art, the invention has the advantages that:
(1) according to the sparse tensor canonical decomposition method, three-level data division is utilized, so that the data parallelism of the sparse tensor canonical decomposition calculation process is improved, the space of the Shenwei processor from the kernel local data storage is utilized to the maximum extent, and the performance of the algorithm is optimized; the existing sparse tensor canonical decomposition technology is basically divided into one-level data and two-level data, a Shenwei processor is limited in local data storage space from a kernel, and when the data volume is large, data overflow can be caused by direct copying.
(2) The invention can effectively reduce the total access delay and avoid the problem of load imbalance by grouping and distributing the slave core array and using the register communication to carry out the interaction between the slave cores: because the delay of register communication between the secondary cores is much lower than the delay of accessing the main memory, the delay can be reduced by storing part of tensor data information in the control secondary core and sending the part of tensor data information to the working secondary core by the control secondary core, and meanwhile, the distribution of non-zero elements of the sparse tensor has uncertainty, so that the situation that the number of the non-zero elements of different tensor rows is very different often occurs, and the invention performs dynamic load balance by controlling the mode of distributing tasks from the core to the working secondary core; the prior art fails to effectively convert accesses to main memory into register communication and does not design a dynamic load balancing method.
(3) By designing a random sampling method of master-slave core cooperation, the invention can simultaneously calculate the MTTKRP process containing random sampling and the MTTKRP process containing non-random sampling; existing sparse tensor canonical decomposition parallel techniques are typically optimized only for non-randomly sampled MTTKRP processes or randomly sampled MTTKRP processes.
(4) By designing the optimization mode, the invention obtains better performance than the current latest sparse tensor canonical decomposition method on the generated and real data set, obtains the highest acceleration ratio of 25.5 times on the ALS method, obtains the highest acceleration ratio of 37.21 times on the GD method, obtains the highest acceleration ratio of 37.44 times on the RBS method and obtains the highest acceleration ratio of 39.57 times on the fLM method.
Drawings
FIG. 1 is a process diagram of the method proposed by the present invention;
FIG. 2 is a diagram of a hardware architecture for implementing the proposed method of the present invention;
figure 3 is a schematic diagram of the tensor multi-level decomposition proposed by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific examples described herein are intended to be illustrative only and are not intended to be limiting. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The process diagram of the invention is shown in fig. 1, and the hardware architecture diagram is shown in fig. 2.
As shown in fig. 1: the method comprises the following specific implementation steps:
step 1: reading Sparse tensor data specified by a user into a core group main memory, decomposing an algorithm type according to a specified specific Sparse tensor model, converting the Sparse tensor data into a specific Sparse tensor storage format CSF (compressed Sparse fiber) on a SW26010 core group main core (Management Processing Element, MPE for short), and performing first-level and second-level data segmentation on the tensor, wherein the data units after segmentation are tensor blocks (blocks) and tensor strips (bands) respectively; specifying a specific tensor canonical decomposition algorithm type by the user includes:
in step 1, specifying a specific sparse tensor canonical decomposition algorithm type by a user includes: the sparse tensor canonical decomposition method based on the alternating least square method, the sparse tensor canonical decomposition method based on the gradient descent, the sparse tensor canonical decomposition method based on the random block sampling or the sparse tensor canonical decomposition method based on the fast Levenberg-Marquardt algorithm.
The method for simultaneously carrying out the first-stage and second-stage data segmentation on the sparse tensor is as follows:
(1) first-level segmentation of tensors: after removing all-zero tensor lines, dividing the tensor into 8 blocks, wherein each block comprises a plurality of complete tensor nonzero lines and has equal quantity of nonzero elements;
(2) second-level segmentation of tensors: on the basis of the first-stage segmentation of the tensor, defining each nonzero tensor line in the tensor block as a tensor strip;
wherein a tensor line refers to a substructure of a tensor formed while keeping one coordinate of the tensor fixed.
Step 2: determining whether the MTTKRP (namely matrixed Tensor multiplied by Khatri-Rao Product, MTTKRP for short) process contained in the algorithm needs random sampling according to the type of the sparse Tensor canonical decomposition algorithm specified by a user, and if the random sampling is not needed, directly entering step 3, otherwise MPE randomly samples Tensor strips with a control slave core (CPE Controller) in a slave Core (CPE) group, wherein the definition of the slave core group is as follows:
since one Shenwei processor core group comprises 1 master core and 64 slave cores distributed in an 8 × 8 layout, the 64 slave cores distributed in the 8 × 8 layout in one core group are divided into 8 groups by rows, and the 8 slave cores in each group are divided into the following two types:
(1) the CPE controls the slave cores, each group is only provided with one, the control slave cores of each group are positioned in the same column of the slave core grid of the core group, and the slave cores are responsible for the following tasks:
(1-1) extracting the coordinates and range information of the tensor block needing to be calculated from the core group main memory, and extracting tensor dimension information corresponding to the factor matrix needing to be updated;
(1-2) responsible for dynamically assigning tensor strips to work from the core;
and (1-3) collecting the calculation results of the work slave cores and sending the calculation results back to the main memory.
(2) The CPE working slave cores are seven in each group, the working slave cores are not overlapped with the control slave cores, and the slave cores are responsible for the following tasks:
(2-1) responsible for completing the calculation of the MTTKRP process according to the data allocated from the control slave core;
and (2-2) sending the calculation result to the control slave core.
Meanwhile, the random sampling comprises the following specific steps:
(1) since there are 8 slave core groups in total, MPE randomly assigns a priority to each CPE group, set to P i1, 8, and setting the number of tensor strips needing to be sampled each time as IBANDS _ TOTAL;
(2) each CPE group determines the number of tensor pieces required to be calculated by the group according to the priority assigned by the CPE group, and the sampling mode is as follows:
(2-1) if it is the CPE group (P) with the highest priorityi1), the controlling slave core of the slave core group having the highest priority randomly generates a tensor number IBANDS from the slave coreiAnd satisfies 0 < IBANDSi< IBANDS _ TOTAL and 0 < IBANDSi<IBLOCK_LENiWherein IBLOCK _ LENiThe number of tensor pieces contained in the assigned tensor block of the CPE group, and when the random generation is finished, the CPE group controls the slave core to carry out the residual number of samplable pieces IBANDS _ TOTAL-IBANDSiSending control slave (P) to CPE group with second highest priorityi=2);
(2-2) if not, the CPE group (P) with the highest priorityiNot equal to 1), then not the tensor number IBANDS randomly generated from the kernel group with the highest priorityiOne of the conditions of (1) becomes The rest of the conditions and the generation mode are unchanged.
(2-3) repeating process (2-2) until all CPE groups have generated IBABDSiOr And finally determining the number of tensor strips needing to be calculated in the group, namely completing random sampling. .
And step 3: MPE determines tensor dimensionality corresponding to a Factor Matrix (Factor Matrix) needing to be updated at this time and a distribution scheme of tensor blocks and stores the tensor dimensionality and the distribution scheme into a main core memory (main memory), wherein the specific updating mode of the Factor Matrix is to decompose an objective function according to a tensor normBy usingThe sparse tensor canonical decomposition algorithm of claim 2 solving a matrix of factors, wherein the sparse tensor to be decomposed Factor matrixAi(: r) represents a column vector of the matrix, which is a vector outer product,the Frobenius norm representing the tensor, F being the rank of the tensor, i.e. the tensor can be decomposed at least as the sum of the outer products of F vectors;
and 4, step 4: each control slave core in the slave core group extracts information of a tensor block which is allocated to the slave core group and belongs to the control slave core from a main Memory in a DMA mode, wherein the information comprises coordinate information of sparse tensor nonzero elements contained in each tensor strip in the tensor block, and the information is stored in an own reconfigurable Local data storage (LDM);
and 5: controlling a slave core to distribute a factor matrix updating calculation task to a work slave core (CPE worker) of the same group, and sending coordinates of sparse tensor nonzero elements contained in tensor strips corresponding to the task to be calculated by the slave core in a register communication mode, wherein the information format is as follows:
(1) the first data is i _ pos, and the data shows that the sparse tensor data to be calculated is in the number of nonzero tensor lines, namely the number of tensor strips;
(2) the second data is i _ id, and the data is used for explaining which row the sparse tensor data to be calculated currently is positioned in the corresponding factor matrix to be updated;
(3) the third data is i _ ptr [ i _ pos ], which is used for explaining the starting position of tensor fibers contained in the tensor strip to be calculated currently, wherein the tensor fibers refer to a tensor substructure formed under the condition that two coordinates of the tensor are kept fixed;
(4) the fourth data is i _ ptr [ i [ [ i ]pos+1]The end position of tensor fibers contained in the current tensor strip to be calculated;
in summary, the purpose of the task allocation information is to give the information of the coordinate position of the tensor strip to be calculated and the corresponding relationship between the tensor strip and the factor matrix to be updated to the work slave core.
Step 6: after coordinate information of sparse tensor nonzero elements contained in the allocated tensor strips is obtained, work is carried out to extract factor matrix data needed by calculation from a main memory to an own LDM in a self-configuration cache mode from the core, the allocated tensor strips are further subjected to third-level division according to the use condition of the own LDM (the data unit is a tensor piece (tile), one tensor strip contains a plurality of tensor pieces), the tensor pieces are extracted in sequence for calculation, and calculation result information and information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time; the specific steps of extracting the factor matrix data required by calculation from the main memory to the owned LDM by the work in a self-constructed cache mode are as follows:
(1) opening up a Memory size in LDM of nuclear ownershipFMThe space of (A) is used as the buffer space needed for storing the factor matrix data, and the factor matrix to be updated is assumed to be ANThen the factor matrix required for the calculation is AiN-1, where N is the dimension of a tensor, and the number of row vectors that can be completely stored in the LDM by the program is preset as M for each factor matrix, then a Memory can be obtainedFMStoring (N-1) x 2 variables in the LDM for storing the maximum coordinate and the minimum coordinate of the cache line vector;
(2) when a task starts, extracting the previous continuous M rows of each factor matrix into a cache space;
(3) when calculating the non-zero element x of tensora1,a2,…,aNWhen it is to the matrix ANWhen updating, the matrix A is needed separatelyiThe line a1, a2, a (N-1) of N-1, so that it is queried whether the current cache space in LDM contains the lineVectors, which can be used directly if any, or else matrix A will be extracted from main memoryiAi, ai +1, …, ai + M lines and storing in LDM;
meanwhile, the core extracts factor matrix data required by calculation from a main memory in a self-constructed cache mode to own local data storage, and further performs third-level segmentation on the allocated tensor strips according to the use condition of the own local data storage, wherein the specific segmentation method comprises the following steps: setting the residual storage space of local data storage after removing non-array variables such as cyclic variables as MemoryLDMThe total space reserved for the calculation of the required factor matrix data is MemoryFMThen the size of a tensor piece is Memory at this timeLDM-MemoryFM。
And simultaneously, the information of the calculation result and the information of whether the calculation is finished are returned to the control slave core in a register communication mode, wherein the specific formats of the information are respectively as follows:
(1) the information length of the calculation result sent by the core is controlled to be 256 bits, and the calculation result information contains 4 double-type data, and the format of the double-type data is as follows:
(1-1) the first data is i _ pos, which is a non-zero line of the current updated data in the several tensors;
(1-2) the second data is r, which is a data indicating in which column of the factor matrix to be updated the data currently updated is located at;
(1-3) the third data is result1, which is the first result data;
(1-4) the fourth data is result2, which is the second result data.
(2) The information length for controlling the sending of the computation end from the core is 256 bits, and the information length comprises 4 double type data, and the format of the double type data is as follows:
(2-1) the first data is i _ pos, which is data indicating in which row the currently updated data is located in the corresponding factor matrix to be updated;
(2-2) the second data is-1, which is an indication variable indicating that the current line has been updated;
and (2-3) the third data and the fourth data are filling data for occupying.
And 7: after the work slave core finishes the calculation task, the control slave core dynamically allocates the next calculation task to the work slave core according to the calculation condition of the current tensor block, and sends the update result back to the core group main memory in a DMA mode, and the specific steps of allocating the next calculation task are as follows:
(1) when the working slave core finishes the calculation task, if the non-calculated tensor strip still exists, directly distributing the tensor strip calculated by the next bit to the working slave core;
(2) if all tensor strips are calculated, controlling the slave core to send a message which is 256 bits in length and contains 4 double-type data to all attached slave cores in a register communication mode, and setting the last data of the message to be-1 to show that all calculation processes of the slave core group are finished;
and 8: repeating the step 6 and the step 7 by the working slave core and the control slave core of each slave core group until the distributed tensor block calculation task of the slave core group is finished;
and step 9: MPE determines matrix operation which needs to be executed besides the MTTKRP process according to the sparse tensor canonical decomposition algorithm type specified by a user, the secondary core group completes calculation by using a basic linear algebra library BLAS, and MPE determines a primary updating result of a factor matrix according to the MTTKRP calculation result and the matrix operation result;
step 10: MPE changes tensor dimensionality corresponding to the factor matrix needing to be updated, and returns to the step 3 until all the factor matrices are updated;
step 11: MPE determines whether to carry out a line searching process according to the sparse tensor canonical decomposition algorithm type specified by a user, and if not, directly updates a factor matrix; otherwise, updating the factor matrix through a line search process;
step 12: MPE determines the value of the target function through the updated value of the factor matrix, if the value of the target function meets the iteration termination condition, the iteration is terminated, otherwise, the step 2 is returned until the value of the target function meets the iteration termination condition, wherein the iteration termination condition comprises two conditions: the first is that the iteration number reaches the upper limit; and the second method is that the objective function value obtained by two times of iterative calculation is smaller than a preset Threshold.
Claims (12)
1. A sparse tensor canonical decomposition method based on data partitioning and task allocation is characterized by comprising the following steps of:
step 1: reading the sparse tensor data specified by the user into the core group main memory, and decomposing the algorithm type according to the specified sparse tensor model; converting the main core of the Shenwei processor core group into a sparse tensor storage format CSF, and performing first-stage and second-stage data segmentation on the sparse tensor, wherein the data units after segmentation are tensor blocks and tensor strips respectively, and one tensor block comprises a plurality of tensor strips; then, determining the number of factor matrixes which are used as a decomposition result according to the dimensionality of the input tensor, namely, the N-dimensional tensor is decomposed into N factor matrixes, wherein the ith factor matrix corresponds to the ith dimensionality of the tensor, and the row vector number of the ith factor matrix is equal to the length of the ith dimensionality of the tensor; then step 2 is entered to obtain the factor matrix in an iterative mode;
step 2: determining MTTKRP contained in the algorithm according to the type of the appointed sparse tensor canonical decomposition algorithm, namely whether random sampling is needed in the process of multiplying the matrixing tensor by the Khatri-Rao product, if the random sampling is not needed, directly entering the step 3, otherwise, randomly sampling tensor strips by a main core and a control slave core in a slave core group;
and step 3: the main core determines tensor dimensionality corresponding to the factor matrix needing to be updated at this time and a distribution scheme of tensor blocks and stores the tensor dimensionality and the distribution scheme into the core group main memory;
and 4, step 4: each control slave core in the slave core group extracts information of tensor blocks distributed to the slave core group to which the control slave core belongs from a main memory in a DMA mode, wherein the information comprises coordinate information of sparse tensor nonzero elements contained in each tensor strip in the tensor block, and the information is stored in a local data storage of the control slave core;
and 5: controlling the slave core to distribute a factor matrix updating calculation task to the work slave cores in the same group and sending coordinate information of sparse tensor nonzero elements contained in tensor strips corresponding to the task to be calculated by the work slave cores in a register communication mode;
step 6: after coordinate information of sparse tensor nonzero elements contained in the allocated tensor strips is obtained, work is carried out to extract factor matrix data needed by calculation from a main memory to own local data storage in a self-configuration cache mode from the core, the allocated tensor strips are further subjected to third-level segmentation according to the use condition of the own local data storage, the data units are tensor pieces, one tensor strip contains a plurality of tensor pieces, the tensor pieces are extracted in sequence for calculation, calculation result information and information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time;
and 7: after the work slave core finishes the calculation task, the control slave core dynamically distributes the next calculation task to the work slave core according to the calculation condition of the current tensor block, and sends the update result back to the core group main memory in a DMA mode;
and 8: repeating the step 6 and the step 7 by the working slave core and the control slave core of each slave core group until the distributed tensor block calculation task of the slave core group is finished;
and step 9: the main core determines matrix operation which needs to be executed besides the MTTKRP process according to the sparse tensor canonical decomposition algorithm type specified by the user, the secondary core group completes calculation by using a basic linear algebra library BLAS, and the main core determines an initial updating result of the factor matrix according to the MTTKRP calculation result and the matrix operation result;
step 10: the main core changes the tensor dimensionality corresponding to the factor matrix needing to be updated and returns to the step 3 until all the factor matrices are updated;
step 11: the main core determines whether to perform a line search process according to the sparse tensor canonical decomposition algorithm type specified by the user, and if not, directly updates the factor matrix; otherwise, updating the factor matrix through a line search process;
step 12: and the main core determines the value of the target function through the updated value of the factor matrix, if the value of the target function meets the iteration termination condition, the iteration is terminated, otherwise, the step 2 is returned until the value of the target function meets the iteration termination condition.
2. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 1, specifying a specific sparse tensor canonical decomposition algorithm type by a user includes: the sparse tensor canonical decomposition method based on the alternating least square method, the sparse tensor canonical decomposition method based on the gradient descent, the sparse tensor canonical decomposition method based on the random block sampling or the sparse tensor canonical decomposition method based on the fast Levenberg-Marquardt algorithm.
3. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 1, the method for performing the first-stage and second-stage data segmentation on the sparse tensor is as follows:
(1) first-level segmentation of tensors: after removing all-zero tensor lines, dividing the tensor into 8 blocks, wherein each block comprises a plurality of complete tensor nonzero lines and has equal quantity of nonzero elements;
(2) second-level segmentation of tensors: on the basis of the first-stage segmentation of the tensor, defining each nonzero tensor line in the tensor block as a tensor strip; the tensor line refers to a substructure of a tensor formed while keeping one coordinate of the tensor fixed.
4. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 2, the slave core group is defined as follows:
since one Shenwei processor core group comprises 1 master core and 64 slave cores distributed in an 8 × 8 layout, the 64 slave cores distributed in the 8 × 8 layout in one core group are divided into 8 groups by rows, and the 8 slave cores in each group are divided into the following two types:
(1) the control slave cores are only one in each group, the control slave cores in each group are positioned in the same column of the grid of the core group slave cores, and the slave cores in the group are responsible for the following tasks:
(1-1) extracting the coordinates and range information of the tensor block needing to be calculated from the core group main memory, and extracting tensor dimension information corresponding to the factor matrix needing to be updated;
(1-2) responsible for dynamically assigning tensor strips to work from the core;
(1-3) responsible for collecting the calculation results of the work slave cores and sending them back to the main memory;
(2) the work slave cores are seven in each group, the work slave cores are not overlapped with the control slave cores, and the slave cores are responsible for the following tasks:
(2-1) responsible for completing the calculation of the MTTKRP process according to the data allocated from the control slave core;
and (2-2) sending the calculation result to the control slave core.
5. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 2, the method for controlling the slave cores in the master core and slave core group to randomly sample tensor strips is as follows:
(1) since the total number of the slave core groups is 8, the master core randomly assigns a priority to each slave core group, and the priority is set as Pi1, 8, and setting the number of tensor strips needing to be sampled each time as IBANDS _ TOTAL;
(2) each slave core group determines the number of tensor pieces required to be calculated by the group according to the priority assigned by the slave core group, and the sampling mode is as follows:
(2-1) if the slave core group having the highest priority is the slave core group having the highest priority, the control of the slave core group having the highest priority randomly generates a tensor number IBANDS from the slave coreiAnd satisfies 0 < IBANDSi< IBANDS _ TOTAL and 0 < IBANDSi<IBLOCK_LENiWherein IBLOCK _ LENiThe number of tensor blocks allocated to the slave core group is tensor number, and when the random generation is finished, the slave core group controls the slave core to leave the residual number of samplable IBANDS _ TOTAL-IBANDSiSending the control slave core to the slave core group with the second highest priority;
(2-2) if not the slave core group having the highest priority, not the tensor number IBANDS randomly generated from the slave core group having the highest priorityiOne of the conditions of (1) becomes The other conditions and the generation mode are not changed;
6. The sparse tensor canonical decomposition method based on data partitioning and calculation allocation of claim 1, wherein: in the step 3, the main core determines a tensor dimension and a tensor block corresponding to the factor matrix needing to be updated at this time, wherein the specific updating mode of the factor matrix is to decompose the objective function according to a tensor canonical Solving the factor matrix using the canonical decomposition algorithm for sparse tensor as given in claim 2, wherein the sparse tensor to be decomposedFactor matrixAi(: r) represents a column vector of the matrix, which is a vector outer product,the Frobenius norm of the tensor is expressed, F being the rank of the tensor, i.e. the tensor can be decomposed at least as the sum of the outer products of F vectors.
7. The sparse tensor canonical decomposition method based on data partitioning and calculation allocation of claim 1, wherein: in step 5, the format for controlling the slave core to send the task assignment information to the attached work slave core is as follows:
(1) the first data is i _ pos, and the data shows that the sparse tensor data to be calculated is in the number of nonzero tensor lines, namely the number of tensor strips;
(2) the second data is i _ id, and the data is used for explaining which row the sparse tensor data to be calculated currently is positioned in the corresponding factor matrix to be updated;
(3) the third data is i _ ptr [ i _ pos ], which is used for explaining the starting position of tensor fibers contained in the tensor strip to be calculated currently, wherein the tensor fibers refer to a tensor substructure formed under the condition that two coordinates of the tensor are kept fixed;
(4) the fourth data is i _ ptr [ i [ [ i ]pos+1]The data is used for explaining the ending position of tensor fibers contained in the tensor strip to be calculated currently;
the purpose of the task allocation information is to give information of the coordinate positions of the tensor strips to be calculated and the corresponding relation between the tensor strips and the factor matrix to be updated to the work slave core.
8. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 6, the work extracts the factor matrix data required by the calculation from the main memory to the local data storage in a self-constructed cache manner, and the specific extraction manner is as follows:
(1) opening up a Memory size in a local data store owned by a worker from a coreFMAs a buffer space required for storing the factor matrix data,suppose that the factor matrix to be updated is ANThen the factor matrix required for the calculation is AiN-1, where N is the dimension of a tensor, and the program sets the number of row vectors of each factor matrix stored completely in the local data storage to M, and then obtains a MemoryFM(N-1) × M, while the (N-1) × 2 variables are also saved in the local data store for storing the maximum and minimum coordinates of the cache line vector;
(2) when a task starts, extracting the previous continuous M rows of each factor matrix into a cache space;
(3) when calculating the non-zero element x of tensora1,a2,...,aNWhen updating the matrix AN, the matrix a is needed separatelyiN-1 lines a1, a2, …, a (N-1), so that it is queried whether the line vector exists in the cache space in the current local data storage, if so, it is directly used, otherwise, the matrix a is extracted from the main memoryiAi, ai + 1.., ai + M lines of (a) and into the local data store.
9. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 6, work extracts factor matrix data required for calculation from the main memory in a self-configuration cache manner to own local data storage from the core, and further performs third-level segmentation on the assigned tensor strips according to the use condition of the own local data storage, wherein the specific segmentation method is as follows: setting the residual storage space of local data storage after removing non-array variables such as cyclic variables as MemoryLDMThe total space reserved for the calculation of the required factor matrix data is MemoryFMThen the size of a tensor piece is Memory at this timeLDM-MemoryFM。
10. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 6, the information of the calculation result and the information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time, wherein the specific formats of the information are as follows:
(1) the information length of the calculation result sent by the core is controlled to be 256 bits, and the calculation result information contains 4 double-type data, and the format of the double-type data is as follows:
(1-1) the first data is i _ pos, which is a non-zero line of the current updated data in the several tensors;
(1-2) the second data is r, which is a data indicating in which column of the factor matrix to be updated the data currently updated is located at;
(1-3) the third data is result1, which is the first result data;
(1-4) the fourth data is result2, which is the second result data;
(2) the information length for controlling the sending of the computation end from the core is 256 bits, and the information length comprises 4 double-type data, and the format is as follows:
(2-1) the first data is i _ pos, which is data indicating in which row the currently updated data is located in the corresponding factor matrix to be updated;
(2-2) the second data is-1, which is an indication variable indicating that the current line has been updated;
and (2-3) the third data and the fourth data are filling data for occupying.
11. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 7, the implementation manner of controlling the slave core to dynamically allocate the next computation task to the slave core according to the computation condition of the current tensor block is as follows:
(1) when the working slave core finishes the calculation task, if the non-calculated tensor strip still exists, directly distributing the tensor strip calculated by the next bit to the working slave core;
(2) and if all tensor strips are calculated, controlling the slave core to send a message which is 256 bits in length and contains 4 double-type data to all attached work slave cores in a register communication mode, and setting the last data of the message to be-1 to show that all calculation processes of the slave core group are finished.
12. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 12, the value of the objective function satisfies the condition of iteration termination, which includes two conditions: the first is that the iteration number reaches the upper limit; and the second method is that the objective function value obtained by two times of iterative calculation is smaller than a preset Threshold value Threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011639166.XA CN112765094B (en) | 2020-12-31 | 2020-12-31 | Sparse tensor canonical decomposition method based on data division and task allocation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011639166.XA CN112765094B (en) | 2020-12-31 | 2020-12-31 | Sparse tensor canonical decomposition method based on data division and task allocation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112765094A true CN112765094A (en) | 2021-05-07 |
CN112765094B CN112765094B (en) | 2022-09-30 |
Family
ID=75698380
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011639166.XA Active CN112765094B (en) | 2020-12-31 | 2020-12-31 | Sparse tensor canonical decomposition method based on data division and task allocation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112765094B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113704691A (en) * | 2021-08-26 | 2021-11-26 | 中国科学院软件研究所 | Small-scale symmetric matrix parallel three-diagonalization method of Shenwei many-core processor |
CN114416605A (en) * | 2021-12-23 | 2022-04-29 | 深圳云天励飞技术股份有限公司 | Storage space allocation method, terminal device and computer readable storage medium |
CN114970294A (en) * | 2022-08-02 | 2022-08-30 | 山东省计算中心(国家超级计算济南中心) | Three-dimensional strain simulation PCG parallel optimization method and system based on Shenwei architecture |
CN115146780A (en) * | 2022-08-30 | 2022-10-04 | 之江实验室 | Method and device for quantum tensor network transposition and contraction cooperation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018128708A (en) * | 2017-02-06 | 2018-08-16 | 日本電信電話株式会社 | Tensor factor decomposition processing apparatus, tensor factor decomposition processing method and tensor factor decomposition processing program |
CN108446253A (en) * | 2018-03-28 | 2018-08-24 | 北京航空航天大学 | The parallel calculating method that a kind of Sparse Matrix-Vector for martial prowess architectural framework multiplies |
CN110362780A (en) * | 2019-07-17 | 2019-10-22 | 北京航空航天大学 | A kind of big data tensor canonical decomposition calculation method based on Shen prestige many-core processor |
-
2020
- 2020-12-31 CN CN202011639166.XA patent/CN112765094B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018128708A (en) * | 2017-02-06 | 2018-08-16 | 日本電信電話株式会社 | Tensor factor decomposition processing apparatus, tensor factor decomposition processing method and tensor factor decomposition processing program |
CN108446253A (en) * | 2018-03-28 | 2018-08-24 | 北京航空航天大学 | The parallel calculating method that a kind of Sparse Matrix-Vector for martial prowess architectural framework multiplies |
CN110362780A (en) * | 2019-07-17 | 2019-10-22 | 北京航空航天大学 | A kind of big data tensor canonical decomposition calculation method based on Shen prestige many-core processor |
Non-Patent Citations (1)
Title |
---|
武昱等: "结合GPU技术的并行CP张量分解算法", 《计算机科学》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113704691A (en) * | 2021-08-26 | 2021-11-26 | 中国科学院软件研究所 | Small-scale symmetric matrix parallel three-diagonalization method of Shenwei many-core processor |
CN113704691B (en) * | 2021-08-26 | 2023-04-25 | 中国科学院软件研究所 | Small-scale symmetric matrix parallel tri-diagonalization method of Shenwei many-core processor |
CN114416605A (en) * | 2021-12-23 | 2022-04-29 | 深圳云天励飞技术股份有限公司 | Storage space allocation method, terminal device and computer readable storage medium |
CN114970294A (en) * | 2022-08-02 | 2022-08-30 | 山东省计算中心(国家超级计算济南中心) | Three-dimensional strain simulation PCG parallel optimization method and system based on Shenwei architecture |
CN114970294B (en) * | 2022-08-02 | 2022-10-25 | 山东省计算中心(国家超级计算济南中心) | Three-dimensional strain simulation PCG parallel optimization method and system based on Shenwei architecture |
CN115146780A (en) * | 2022-08-30 | 2022-10-04 | 之江实验室 | Method and device for quantum tensor network transposition and contraction cooperation |
CN115146780B (en) * | 2022-08-30 | 2023-07-11 | 之江实验室 | Quantum tensor network transposition and contraction cooperative method and device |
Also Published As
Publication number | Publication date |
---|---|
CN112765094B (en) | 2022-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112765094B (en) | Sparse tensor canonical decomposition method based on data division and task allocation | |
Humphrey et al. | CULA: hybrid GPU accelerated linear algebra routines | |
US9038088B2 (en) | Load balancing on hetrogenous processing cluster based on exceeded load imbalance factor threshold determined by total completion time of multiple processing phases | |
Farhat et al. | Mesh partitioning for implicit computations via iterative domain decomposition: impact and optimization of the subdomain aspect ratio | |
Liu et al. | Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors | |
Ma et al. | Optimizing sparse tensor times matrix on GPUs | |
EP2657842B1 (en) | Workload optimization in a multi-processor system executing sparse-matrix vector multiplication | |
Economon et al. | Towards high-performance optimizations of the unstructured open-source SU2 suite | |
Chen et al. | aeSpTV: An adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform | |
Zhang et al. | Automatic irregularity-aware fine-grained workload partitioning on integrated architectures | |
US7983890B2 (en) | Method and apparatus performing automatic mapping for a multi-processor system | |
Lee et al. | Optimization of GPU-based sparse matrix multiplication for large sparse networks | |
Stripinis et al. | On MATLAB experience in accelerating DIRECT-GLce algorithm for constrained global optimization through dynamic data structures and parallelization | |
Rolinger et al. | Performance considerations for scalable parallel tensor decomposition | |
Chang et al. | A Hypergraph-Based Workload Partitioning Strategy for Parallel Data Aggregation. | |
Clarke et al. | Fupermod: A framework for optimal data partitioning for parallel scientific applications on dedicated heterogeneous hpc platforms | |
Yu et al. | Numa-aware optimization of sparse matrix-vector multiplication on armv8-based many-core architectures | |
Lu et al. | Tilesptrsv: a tiled algorithm for parallel sparse triangular solve on gpus | |
Liu et al. | Parallel reconstruction of neighbor-joining trees for large multiple sequence alignments using CUDA | |
Zhou et al. | FASTCF: FPGA-based accelerator for stochastic-gradient-descent-based collaborative filtering | |
Tian et al. | swSuperLU: A highly scalable sparse direct solver on Sunway manycore architecture | |
Ahmad et al. | Exploring data layout for sparse tensor times dense matrix on GPUs | |
Sefidgar et al. | Parallelization of torsion finite element code using compressed stiffness matrix algorithm | |
CN116167304B (en) | Reservoir value based on Shenwei architecture simulation GMRES optimization method and system | |
Matstoms | Parallel sparse QR factorization on shared memory architectures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |