CN112765094A

CN112765094A - Sparse tensor canonical decomposition method based on data division and calculation distribution

Info

Publication number: CN112765094A
Application number: CN202011639166.XA
Authority: CN
Inventors: 杨海龙; 敦明; 孙庆骁; 李云春
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-07
Anticipated expiration: 2040-12-31
Also published as: CN112765094B

Abstract

The invention relates to a sparse tensor canonical decomposition method based on data division and task allocation, which comprises the following steps of: performing multi-stage division and task allocation on a plurality of processing cores on a core group according to the many-core characteristics of the Shenwei processor at initial; performing multi-stage segmentation processing on sparse tensor data initially; communication strategies for sparse tensor canonical decomposition designed using the register communication characteristics of the Shenwei processor SW 26010; aiming at the common performance bottleneck of different sparse Tensor canonical decomposition methods, namely different requirements (whether Tensor elements need to be randomly extracted for calculation) of matrixed Tensor Times Khatri-Rao Product (MTTKRP for short) in specific operation, different calculation schemes of the MTTKRP process are designed by utilizing the characteristics of an Shenwei processor. The invention fully excavates the characteristics of the Shenwei system structure, fully considers the calculation requirements of sparse tensor decomposition, can finish various sparse tensor canonical decomposition calculation methods on the Shenwei system structure in parallel and efficiently, and ensures dynamic load balance to the maximum extent.

Description

Sparse tensor canonical decomposition method based on data division and calculation distribution

Technical Field

The invention relates to the fields of tensor numerical algorithms, parallel computing, Shenwei system structures and the like, in particular to a sparse tensor canonical decomposition method based on data division and computing distribution.

Background

The novel super computer system of ' Shenwei-Taihu light ' developed by the national parallel computer engineering and technology research center is supported by the national 863 plan, the system is a super computer of which the world first peak value operational performance reaches billion orders of magnitude, and is continuously evaluated as the super computer of which the world performance is strongest by the International TOP500 organization four times in the year of 2016-2017, and meanwhile, the running application on the super computer obtains the Goden-Bell prize '. The super-large-scale parallel application realized by a user on the system can cover millions of cores, and the application covers multiple fields including deep learning, earthquake simulation, quantum circuit simulation, climate simulation and the like.

The Shenwei Taihu light supercomputer adopts a SW26010 heterogeneous many-core processor developed by the national high performance integrated circuit (Shanghai) design center. In terms of basic architecture, 4 core groups are integrated on the processor chip, each core group comprises sixty-four slave core arrays (operation core arrays) and one master core (operation control core), so that the whole chip comprises 4 operation control cores and 256 operation cores in total, wherein the operation control cores and the operation cores both have the working frequency of 1.45 GHz. In each core group, the operation control core is responsible for completing task management, is the same as most of the current mainstream CPUs, and can support memory management, interruption and out-of-order execution; meanwhile, the operation core is responsible for completing high-efficiency calculation, and vectorization components and other devices supporting high-speed operation are deployed on the operation core. In terms of memory design, the SW26010 processor adopts a multi-level memory architecture, which includes 2-level caches, with sizes of 32KB and 256KB, respectively, and also provides 8GB of main memory space for a single core group. In order to provide a better parallel development environment for users, the 'optical of Shenwei Taihu lake' platform provides support for a plurality of programming languages such as C + +, C, Fortran and provides interfaces for a plurality of parallel libraries and parallel standards such as MPI and OpenMP.

In the computer field, tensors are generally regarded as multi-dimensional arrays, while scalars can be regarded as 0-order tensors, vectors can be regarded as 1-order tensors, and matrices are 2-order tensors, and tensors equal to or higher than the third order are called high-order tensors. While sparse tensors refer to tensors with a majority of values being non-zero. In actual engineering and research, much of the data is represented by high-order sparse tensors. User evaluation information such as amazon, which includes product information, user rating, and other data dimensions, wherein a non-zero value accounts for only one billion of the total data amount; and the paper Information published on Neural Information Processing System (NIPS) includes dimensions of paper ID, author Information, publication year and the like, and the proportion of non-zero values is one millionth; and the information of the e-mail is sent and received, and contains multidimensional data such as the name and the date of the sent and received mailbox, wherein the proportion of the non-zero value is one part per billion.

Due to the widespread use of tensors and the explosive development of big data analysis techniques and data mining techniques, the related analysis techniques of vectors are also more and more emphasized by researchers. One of the frequently used vector analysis techniques is the canonical decomposition cpd (canonical polymeric decomposition) algorithm, proposed by Hitchcock in 1927. CPD refers to the representation of a tensor by a finite number of tensors of rank 1, i.e.

Wherein

Express tensor, a_rRepresenting a vector, which represents the outer product of the tensor. CPD algorithms are currently being used in a number of research areas, including radiation source location identification, internet of things, signal processing, and others. Due to the wide use of sparse tensors and CPD algorithms, more researches for realizing the CPD algorithms on sparse tensors, such as MATLAB Tensor Toolbox,GigaTensor, SPLATT (The generalized ParalleL software vector tool), DFacTo and HyperTensor, however, all of The algorithms are based on homogeneous single-core or multi-core CPUs or GPUs, no CPD algorithm adapted to The sparse Tensor of The special many-core architecture of The domestic Shenwei processor has appeared at present, and The current parallelized Tensor canonical decomposition algorithm is relatively single. The parallel design scheme based on various CPD algorithms is completed by deeply mining the heterogeneous characteristics, manual control cache, register communication and other characteristics of the domestic Shenwei processor SW26010, and the performance of the tensor canonical decomposition algorithm is improved.

Disclosure of Invention

The invention solves the problems: the method overcomes the defects and shortcomings of the prior art, provides a sparse tensor canonical decomposition method based on data division and task allocation, improves the performance of various sparse tensor canonical decomposition algorithms by utilizing the characteristics of the Shenwei system structure, and fully excavates the potential of the Shenwei core system structure.

The technical solution of the invention is a sparse tensor canonical decomposition method based on data division and task allocation, comprising the following steps:

step 1: reading Sparse tensor data specified by a user into a core group main memory, decomposing an algorithm type according to a specified specific Sparse tensor model, converting the Sparse tensor data into a Sparse tensor storage format CSF (compressed Sparse fiber) on a SW26010 core group main core (Management Processing Element, MPE for short), and performing first-stage and second-stage data segmentation on the Sparse tensor, wherein the segmented data units are tensor blocks (blocks) and tensor strips (bands), and one tensor block comprises a plurality of tensor strips; then, determining the number of factor matrixes which are used as a decomposition result according to the dimensionality of the input tensor, namely, the N-dimensional tensor is decomposed into N factor matrixes, wherein the ith factor matrix corresponds to the ith dimensionality of the tensor, and the row vector number of the ith factor matrix is equal to the length of the ith dimensionality of the tensor; then step 2 is entered to obtain the factor matrix in an iterative mode;

step 2: determining whether the MTTKRP process contained in the algorithm needs random sampling according to the specified sparse tensor canonical decomposition algorithm type, if the random sampling is not needed, directly entering step 3, otherwise, MPE randomly samples tensor strips with a control slave core (CPE Controller) in a slave core (CPE Processing Element, abbreviated as CPE) group;

and step 3: MPE determines tensor dimensionality corresponding to a Factor Matrix (Factor Matrix) needing to be updated at this time and a distribution scheme of tensor blocks, and stores the tensor dimensionality and the distribution scheme into a core memory;

and 4, step 4: each control slave core in the slave core group extracts information of a tensor block which is distributed to the slave core group and belongs to the control slave core from a main Memory in a DMA mode, wherein the information comprises coordinate information of sparse tensor nonzero elements contained in each tensor strip in the tensor block, and the information is stored in a Local Device Memory (LDM);

and 5: controlling a slave core to distribute a factor matrix updating calculation task to a work slave core (CPE worker) of the same group and sending coordinate information of sparse tensor nonzero elements contained in tensor strips corresponding to the task to be calculated by the work slave core in a register communication mode;

step 6: after obtaining the coordinates of the sparse tensor nonzero elements contained in the assigned tensor strips, working to extract factor matrix data needed by calculation from a main memory to an own LDM in a self-configuration cache mode from the core, performing third-level segmentation (data units are tensor pieces) on the assigned tensor strips according to the use condition of the own LDM, extracting the tensor pieces in sequence for calculation, and returning the calculation result information and the information of whether the calculation is finished to control the slave core in a register communication mode in real time;

and 7: after the work slave core finishes the calculation task, the control slave core dynamically distributes the next calculation task to the work slave core according to the calculation condition of the current tensor block, and sends the update result back to the core group main memory in a DMA mode;

and 8: repeating the step 6 and the step 7 by the working slave core and the control slave core of each slave core group until the distributed tensor block calculation task of the slave core group is finished;

and step 9: MPE determines matrix operation which needs to be executed besides the MTTKRP process according to the sparse tensor canonical decomposition algorithm type specified by a user, the secondary core group completes calculation by using a basic linear algebra library BLAS, and MPE determines a primary updating result of a factor matrix according to the MTTKRP calculation result and the matrix operation result;

step 10: MPE changes tensor dimensionality corresponding to the factor matrix needing to be updated, and returns to the step 3 until all the factor matrices are updated;

step 11: MPE determines whether to carry out a line searching process according to the sparse tensor canonical decomposition algorithm type specified by a user, and if not, directly updates a factor matrix; otherwise, updating the factor matrix through a line search process;

step 12: MPE determines the value of the target function through the updated value of the factor matrix, if the value of the target function meets the iteration termination condition, the iteration is terminated, otherwise, the step 2 is returned until the value of the target function meets the iteration termination condition.

In step 1, specifying a specific sparse tensor canonical decomposition algorithm type by a user includes: the sparse tensor canonical decomposition method based on the Alternating Least Square (ALS) method, the sparse tensor canonical decomposition method based on the Gradient Descent (GD), the sparse tensor canonical decomposition method based on the Random Block Sampling (RBS) method, or the sparse tensor canonical decomposition method based on the fast Levenberg-Marquardt algorithm (fLM).

In step 1, the method for performing the first-stage and second-stage data segmentation on the sparse tensor is as follows:

(1) first-level segmentation of tensors: after removing all-zero tensor lines, dividing the tensor into 8 blocks, wherein each block comprises a plurality of complete tensor nonzero lines and has equal quantity of nonzero elements;

(2) second-level segmentation of tensors: on the basis of the first-stage segmentation of the tensor, defining each nonzero tensor line in the tensor block as a tensor strip;

wherein a tensor line refers to a substructure of a tensor formed while keeping one coordinate of the tensor fixed.

In step 2, the CPE group is defined as follows:

since one Shenwei processor core group comprises 1 master core and 64 slave cores distributed in an 8 × 8 layout, the 64 slave cores distributed in the 8 × 8 layout in one core group are divided into 8 groups by rows, and the 8 slave cores in each group are divided into the following two types:

(1) the CPE controls the slave cores, each group is only provided with one, the control slave cores of each group are positioned in the same column of a CPE grid of the core group, and the slave cores are responsible for the following tasks:

(1-1) extracting the coordinates and range information of the tensor block needing to be calculated from the core group main memory, and extracting tensor dimension information corresponding to the factor matrix needing to be updated;

(1-2) responsible for dynamically assigning tensor strips to work from the core;

and (1-3) collecting the calculation results of the work slave cores and sending the calculation results back to the main memory.

(2) The CPE working slave cores are seven in each group, the working slave cores are not overlapped with the control slave cores, and the slave cores are responsible for the following tasks:

(2-1) responsible for completing the calculation of the MTTKRP process according to the data allocated from the control slave core;

and (2-2) sending the calculation result to the control slave core.

In step 2, the method for controlling random sampling from the kernel tensor strip in the MPE and CPE groups is as follows:

(1) since there are 8 CPE groups in total, MPE randomly assigns a priority to each CPE group, set to

P

_i1, 8, and setting the number of tensor strips needing to be sampled each time as IBANDS _ TOTAL;

(2) each CPE group determines the number of tensor pieces required to be calculated by the group according to the priority assigned by the CPE group, and the sampling mode is as follows:

(2-1) if it is the CPE group (P) with the highest priority_i1), the CPE group with the highest priority controls the random generation of a tensor number IBANDS from the core_iAnd satisfies 0 < IBANDS_i< IBANDS _ TOTAL and 0 < IBANDS_i＜IBLOCK_LEN_iWherein IBLOCK _ LEN_iThe number of tensor pieces contained in the assigned tensor block of the CPE group, and when the random generation is finished, the CPE group controls the slave core to carry out the residual number of samplable pieces IBANDS _ TOTAL-IBANDS_iSending control slave (P) to CPE group with second highest priority_i＝2)；

(2-2) if not, the CPE group (P) with the highest priority_iNot equal to 1), then not the randomly generated tensor number IBANDS for the group of CPEs with the highest priority_iOne of the conditions of (1) becomes

The rest of the conditions and the generation mode are unchanged.

(2-3) repeating the process (2-2) until all CPE groups have generated IBANDS_iOr

And finally determining the number of tensor strips needing to be calculated in the group, namely completing random sampling.

In the step 3, the main core determines a tensor dimension and a tensor block corresponding to the factor matrix needing to be updated at this time, wherein the specific updating mode of the factor matrix is to decompose the objective function according to a tensor canonical

Solving the factor matrix using the canonical decomposition algorithm of sparse tensor as in claim 2, wherein the sparse tensor to be decomposed

Factor matrix

A_i(: r) represents a column vector of the matrix, which is a vector outer product,

the Frobenius norm of the tensor is expressed, F being the rank of the tensor, i.e. the tensor can be decomposed at least as the sum of the outer products of F vectors.

In step 5, the format for controlling the slave core to send the task assignment information to the attached work slave core is as follows:

(1) the first data is i _ pos, and the data shows that the sparse tensor data to be calculated is in the number of nonzero tensor lines, namely the number of tensor strips;

(2) the second data is i _ id, and the data is used for explaining which row the sparse tensor data to be calculated currently is positioned in the corresponding factor matrix to be updated;

(3) the third data is i _ ptr [ i _ pos ], which is used for explaining the starting position of tensor fibers contained in the tensor strip to be currently calculated, wherein the tensor fibers refer to the ending position of the tensor fibers contained in the tensor strip to be currently calculated by the substructure of the tensor formed under the condition that two coordinates of the tensor are kept fixed;

(4) the fourth data is i _ ptr [ i [ [ i ]_pos+1]The data is used for explaining the ending position of tensor fibers contained in the tensor strip to be calculated currently;

in summary, the purpose of the task allocation information is to give the information of the coordinate position of the tensor strip to be calculated and the corresponding relationship between the tensor strip and the factor matrix to be updated to the work slave core.

In step 6, the work extracts the factor matrix data required for calculation from the main memory to the owned LDM in a self-constructed cache manner, specifically the following extraction manner:

(1) opening up a Memory size in LDM of nuclear ownership_FMThe space of (A) is used as the buffer space needed for storing the factor matrix data, and the factor matrix to be updated is assumed to be A_NThen the factor matrix required for the calculation is A_i1, N-1, where N is the dimension of the tensor,the program presets the row vector number of each factor matrix which can be completely stored in the LDM to be M, and then the Memory can be obtained_FMStoring (N-1) x 2 variables in the LDM for storing the maximum coordinate and the minimum coordinate of the cache line vector;

(2) when a task starts, extracting the previous continuous M rows of each factor matrix into a cache space;

(3) when calculating the non-zero element x of tensor_{a1，a2，...，aN}When it is to the matrix A_NWhen updating, the matrix A is needed separately_i1, …, N-1, line a1, a2, a (N-1), so that whether the line vector exists in the current buffer space in the LDM is queried, if so, the line vector can be used directly, otherwise, the matrix a is extracted from the main memory_iAi, ai + 1.., ai + M lines of (a) and stored in LDM.

In step 6, work extracts factor matrix data required for calculation from the main memory in a self-configuration cache manner to own local data storage from the core, and further performs third-level segmentation on the assigned tensor strips according to the use condition of the own local data storage, wherein the specific segmentation method is as follows: setting the residual storage space of the LDM after removing non-array variables such as the cyclic variable as Memory_LDMThe total space reserved for the calculation of the required factor matrix data is Memory_FMThen the size of a tensor piece is Memory at this time_LDM-Memory_FM。

In step 6, the information of the calculation result and the information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time, wherein the specific formats of the information are as follows:

(1) the information length of the calculation result sent by the core is controlled to be 256 bits, and the calculation result information contains 4 double-type data, and the format of the double-type data is as follows:

(1-1) the first data is i _ pos, which is a non-zero line of the current updated data in the several tensors;

(1-2) the second data is r, which is a data indicating in which column of the factor matrix to be updated the data currently updated is located at;

(1-3) the third data is result1, which is the first result data;

(1-4) the fourth data is result2, which is the second result data.

(2) The information length for controlling the sending of the computation end from the core is 256 bits, and the information length comprises 4 double type data, and the format of the double type data is as follows:

(2-1) the first data is i _ pos, which is data indicating in which row the currently updated data is located in the corresponding factor matrix to be updated;

(2-2) the second data is-1, which is an indication variable indicating that the current line has been updated;

and (2-3) the third data and the fourth data are filling data for occupying.

In step 7, the implementation manner of controlling the slave core to dynamically allocate the next computation task to the slave core according to the computation condition of the current tensor block is as follows:

(1) when the working slave core finishes the calculation task, if the non-calculated tensor strip still exists, directly distributing the tensor strip calculated by the next bit to the working slave core;

(2) and if all tensor strips are calculated, controlling the slave core to send a message which is 256 bits in length and contains 4 double-type data to all attached work slave cores in a register communication mode, and setting the last data of the message to be-1 to show that all calculation processes of the slave core group are finished.

In step 12, the value of the objective function satisfies the iteration termination condition, which includes two types: the first is that the iteration number reaches the upper limit; and the second method is that the objective function value obtained by two times of iterative calculation is smaller than a preset Threshold.

Compared with the prior art, the invention has the advantages that:

(1) according to the sparse tensor canonical decomposition method, three-level data division is utilized, so that the data parallelism of the sparse tensor canonical decomposition calculation process is improved, the space of the Shenwei processor from the kernel local data storage is utilized to the maximum extent, and the performance of the algorithm is optimized; the existing sparse tensor canonical decomposition technology is basically divided into one-level data and two-level data, a Shenwei processor is limited in local data storage space from a kernel, and when the data volume is large, data overflow can be caused by direct copying.

(2) The invention can effectively reduce the total access delay and avoid the problem of load imbalance by grouping and distributing the slave core array and using the register communication to carry out the interaction between the slave cores: because the delay of register communication between the secondary cores is much lower than the delay of accessing the main memory, the delay can be reduced by storing part of tensor data information in the control secondary core and sending the part of tensor data information to the working secondary core by the control secondary core, and meanwhile, the distribution of non-zero elements of the sparse tensor has uncertainty, so that the situation that the number of the non-zero elements of different tensor rows is very different often occurs, and the invention performs dynamic load balance by controlling the mode of distributing tasks from the core to the working secondary core; the prior art fails to effectively convert accesses to main memory into register communication and does not design a dynamic load balancing method.

(3) By designing a random sampling method of master-slave core cooperation, the invention can simultaneously calculate the MTTKRP process containing random sampling and the MTTKRP process containing non-random sampling; existing sparse tensor canonical decomposition parallel techniques are typically optimized only for non-randomly sampled MTTKRP processes or randomly sampled MTTKRP processes.

(4) By designing the optimization mode, the invention obtains better performance than the current latest sparse tensor canonical decomposition method on the generated and real data set, obtains the highest acceleration ratio of 25.5 times on the ALS method, obtains the highest acceleration ratio of 37.21 times on the GD method, obtains the highest acceleration ratio of 37.44 times on the RBS method and obtains the highest acceleration ratio of 39.57 times on the fLM method.

Drawings

FIG. 1 is a process diagram of the method proposed by the present invention;

FIG. 2 is a diagram of a hardware architecture for implementing the proposed method of the present invention;

figure 3 is a schematic diagram of the tensor multi-level decomposition proposed by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific examples described herein are intended to be illustrative only and are not intended to be limiting. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The process diagram of the invention is shown in fig. 1, and the hardware architecture diagram is shown in fig. 2.

As shown in fig. 1: the method comprises the following specific implementation steps:

step 1: reading Sparse tensor data specified by a user into a core group main memory, decomposing an algorithm type according to a specified specific Sparse tensor model, converting the Sparse tensor data into a specific Sparse tensor storage format CSF (compressed Sparse fiber) on a SW26010 core group main core (Management Processing Element, MPE for short), and performing first-level and second-level data segmentation on the tensor, wherein the data units after segmentation are tensor blocks (blocks) and tensor strips (bands) respectively; specifying a specific tensor canonical decomposition algorithm type by the user includes:

in step 1, specifying a specific sparse tensor canonical decomposition algorithm type by a user includes: the sparse tensor canonical decomposition method based on the alternating least square method, the sparse tensor canonical decomposition method based on the gradient descent, the sparse tensor canonical decomposition method based on the random block sampling or the sparse tensor canonical decomposition method based on the fast Levenberg-Marquardt algorithm.

The method for simultaneously carrying out the first-stage and second-stage data segmentation on the sparse tensor is as follows:

Step 2: determining whether the MTTKRP (namely matrixed Tensor multiplied by Khatri-Rao Product, MTTKRP for short) process contained in the algorithm needs random sampling according to the type of the sparse Tensor canonical decomposition algorithm specified by a user, and if the random sampling is not needed, directly entering step 3, otherwise MPE randomly samples Tensor strips with a control slave core (CPE Controller) in a slave Core (CPE) group, wherein the definition of the slave core group is as follows:

(1) the CPE controls the slave cores, each group is only provided with one, the control slave cores of each group are positioned in the same column of the slave core grid of the core group, and the slave cores are responsible for the following tasks:

and (2-2) sending the calculation result to the control slave core.

Meanwhile, the random sampling comprises the following specific steps:

(1) since there are 8 slave core groups in total, MPE randomly assigns a priority to each CPE group, set to

P

(2-1) if it is the CPE group (P) with the highest priority_i1), the controlling slave core of the slave core group having the highest priority randomly generates a tensor number IBANDS from the slave core_iAnd satisfies 0 < IBANDS_i< IBANDS _ TOTAL and 0 < IBANDS_i＜IBLOCK_LEN_iWherein IBLOCK _ LEN_iThe number of tensor pieces contained in the assigned tensor block of the CPE group, and when the random generation is finished, the CPE group controls the slave core to carry out the residual number of samplable pieces IBANDS _ TOTAL-IBANDS_iSending control slave (P) to CPE group with second highest priority_i＝2)；

(2-2) if not, the CPE group (P) with the highest priority_iNot equal to 1), then not the tensor number IBANDS randomly generated from the kernel group with the highest priority_iOne of the conditions of (1) becomes

The rest of the conditions and the generation mode are unchanged.

(2-3) repeating process (2-2) until all CPE groups have generated IBABDS_iOr

And finally determining the number of tensor strips needing to be calculated in the group, namely completing random sampling. .

And step 3: MPE determines tensor dimensionality corresponding to a Factor Matrix (Factor Matrix) needing to be updated at this time and a distribution scheme of tensor blocks and stores the tensor dimensionality and the distribution scheme into a main core memory (main memory), wherein the specific updating mode of the Factor Matrix is to decompose an objective function according to a tensor norm

By usingThe sparse tensor canonical decomposition algorithm of claim 2 solving a matrix of factors, wherein the sparse tensor to be decomposed

Factor matrix

the Frobenius norm representing the tensor, F being the rank of the tensor, i.e. the tensor can be decomposed at least as the sum of the outer products of F vectors;

and 4, step 4: each control slave core in the slave core group extracts information of a tensor block which is allocated to the slave core group and belongs to the control slave core from a main Memory in a DMA mode, wherein the information comprises coordinate information of sparse tensor nonzero elements contained in each tensor strip in the tensor block, and the information is stored in an own reconfigurable Local data storage (LDM);

and 5: controlling a slave core to distribute a factor matrix updating calculation task to a work slave core (CPE worker) of the same group, and sending coordinates of sparse tensor nonzero elements contained in tensor strips corresponding to the task to be calculated by the slave core in a register communication mode, wherein the information format is as follows:

(3) the third data is i _ ptr [ i _ pos ], which is used for explaining the starting position of tensor fibers contained in the tensor strip to be calculated currently, wherein the tensor fibers refer to a tensor substructure formed under the condition that two coordinates of the tensor are kept fixed;

(4) the fourth data is i _ ptr [ i [ [ i ]_pos+1]The end position of tensor fibers contained in the current tensor strip to be calculated;

Step 6: after coordinate information of sparse tensor nonzero elements contained in the allocated tensor strips is obtained, work is carried out to extract factor matrix data needed by calculation from a main memory to an own LDM in a self-configuration cache mode from the core, the allocated tensor strips are further subjected to third-level division according to the use condition of the own LDM (the data unit is a tensor piece (tile), one tensor strip contains a plurality of tensor pieces), the tensor pieces are extracted in sequence for calculation, and calculation result information and information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time; the specific steps of extracting the factor matrix data required by calculation from the main memory to the owned LDM by the work in a self-constructed cache mode are as follows:

(1) opening up a Memory size in LDM of nuclear ownership_FMThe space of (A) is used as the buffer space needed for storing the factor matrix data, and the factor matrix to be updated is assumed to be A_NThen the factor matrix required for the calculation is A_iN-1, where N is the dimension of a tensor, and the number of row vectors that can be completely stored in the LDM by the program is preset as M for each factor matrix, then a Memory can be obtained_FMStoring (N-1) x 2 variables in the LDM for storing the maximum coordinate and the minimum coordinate of the cache line vector;

(3) when calculating the non-zero element x of tensor_{a1，a2，…，aN}When it is to the matrix A_NWhen updating, the matrix A is needed separately_iThe line a1, a2, a (N-1) of N-1, so that it is queried whether the current cache space in LDM contains the lineVectors, which can be used directly if any, or else matrix A will be extracted from main memory_iAi, ai +1, …, ai + M lines and storing in LDM;

meanwhile, the core extracts factor matrix data required by calculation from a main memory in a self-constructed cache mode to own local data storage, and further performs third-level segmentation on the allocated tensor strips according to the use condition of the own local data storage, wherein the specific segmentation method comprises the following steps: setting the residual storage space of local data storage after removing non-array variables such as cyclic variables as Memory_LDMThe total space reserved for the calculation of the required factor matrix data is Memory_FMThen the size of a tensor piece is Memory at this time_LDM-Memory_FM。

And simultaneously, the information of the calculation result and the information of whether the calculation is finished are returned to the control slave core in a register communication mode, wherein the specific formats of the information are respectively as follows:

(1-3) the third data is result1, which is the first result data;

(1-4) the fourth data is result2, which is the second result data.

and (2-3) the third data and the fourth data are filling data for occupying.

And 7: after the work slave core finishes the calculation task, the control slave core dynamically allocates the next calculation task to the work slave core according to the calculation condition of the current tensor block, and sends the update result back to the core group main memory in a DMA mode, and the specific steps of allocating the next calculation task are as follows:

(2) if all tensor strips are calculated, controlling the slave core to send a message which is 256 bits in length and contains 4 double-type data to all attached slave cores in a register communication mode, and setting the last data of the message to be-1 to show that all calculation processes of the slave core group are finished;

step 12: MPE determines the value of the target function through the updated value of the factor matrix, if the value of the target function meets the iteration termination condition, the iteration is terminated, otherwise, the step 2 is returned until the value of the target function meets the iteration termination condition, wherein the iteration termination condition comprises two conditions: the first is that the iteration number reaches the upper limit; and the second method is that the objective function value obtained by two times of iterative calculation is smaller than a preset Threshold.

Claims

1. A sparse tensor canonical decomposition method based on data partitioning and task allocation is characterized by comprising the following steps of:

step 1: reading the sparse tensor data specified by the user into the core group main memory, and decomposing the algorithm type according to the specified sparse tensor model; converting the main core of the Shenwei processor core group into a sparse tensor storage format CSF, and performing first-stage and second-stage data segmentation on the sparse tensor, wherein the data units after segmentation are tensor blocks and tensor strips respectively, and one tensor block comprises a plurality of tensor strips; then, determining the number of factor matrixes which are used as a decomposition result according to the dimensionality of the input tensor, namely, the N-dimensional tensor is decomposed into N factor matrixes, wherein the ith factor matrix corresponds to the ith dimensionality of the tensor, and the row vector number of the ith factor matrix is equal to the length of the ith dimensionality of the tensor; then step 2 is entered to obtain the factor matrix in an iterative mode;

step 2: determining MTTKRP contained in the algorithm according to the type of the appointed sparse tensor canonical decomposition algorithm, namely whether random sampling is needed in the process of multiplying the matrixing tensor by the Khatri-Rao product, if the random sampling is not needed, directly entering the step 3, otherwise, randomly sampling tensor strips by a main core and a control slave core in a slave core group;

and step 3: the main core determines tensor dimensionality corresponding to the factor matrix needing to be updated at this time and a distribution scheme of tensor blocks and stores the tensor dimensionality and the distribution scheme into the core group main memory;

and 4, step 4: each control slave core in the slave core group extracts information of tensor blocks distributed to the slave core group to which the control slave core belongs from a main memory in a DMA mode, wherein the information comprises coordinate information of sparse tensor nonzero elements contained in each tensor strip in the tensor block, and the information is stored in a local data storage of the control slave core;

and 5: controlling the slave core to distribute a factor matrix updating calculation task to the work slave cores in the same group and sending coordinate information of sparse tensor nonzero elements contained in tensor strips corresponding to the task to be calculated by the work slave cores in a register communication mode;

step 6: after coordinate information of sparse tensor nonzero elements contained in the allocated tensor strips is obtained, work is carried out to extract factor matrix data needed by calculation from a main memory to own local data storage in a self-configuration cache mode from the core, the allocated tensor strips are further subjected to third-level segmentation according to the use condition of the own local data storage, the data units are tensor pieces, one tensor strip contains a plurality of tensor pieces, the tensor pieces are extracted in sequence for calculation, calculation result information and information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time;

and step 9: the main core determines matrix operation which needs to be executed besides the MTTKRP process according to the sparse tensor canonical decomposition algorithm type specified by the user, the secondary core group completes calculation by using a basic linear algebra library BLAS, and the main core determines an initial updating result of the factor matrix according to the MTTKRP calculation result and the matrix operation result;

step 10: the main core changes the tensor dimensionality corresponding to the factor matrix needing to be updated and returns to the step 3 until all the factor matrices are updated;

step 11: the main core determines whether to perform a line search process according to the sparse tensor canonical decomposition algorithm type specified by the user, and if not, directly updates the factor matrix; otherwise, updating the factor matrix through a line search process;

step 12: and the main core determines the value of the target function through the updated value of the factor matrix, if the value of the target function meets the iteration termination condition, the iteration is terminated, otherwise, the step 2 is returned until the value of the target function meets the iteration termination condition.

2. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 1, specifying a specific sparse tensor canonical decomposition algorithm type by a user includes: the sparse tensor canonical decomposition method based on the alternating least square method, the sparse tensor canonical decomposition method based on the gradient descent, the sparse tensor canonical decomposition method based on the random block sampling or the sparse tensor canonical decomposition method based on the fast Levenberg-Marquardt algorithm.

3. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 1, the method for performing the first-stage and second-stage data segmentation on the sparse tensor is as follows:

(2) second-level segmentation of tensors: on the basis of the first-stage segmentation of the tensor, defining each nonzero tensor line in the tensor block as a tensor strip; the tensor line refers to a substructure of a tensor formed while keeping one coordinate of the tensor fixed.

4. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 2, the slave core group is defined as follows:

(1) the control slave cores are only one in each group, the control slave cores in each group are positioned in the same column of the grid of the core group slave cores, and the slave cores in the group are responsible for the following tasks:

(1-3) responsible for collecting the calculation results of the work slave cores and sending them back to the main memory;

(2) the work slave cores are seven in each group, the work slave cores are not overlapped with the control slave cores, and the slave cores are responsible for the following tasks:

and (2-2) sending the calculation result to the control slave core.

5. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 2, the method for controlling the slave cores in the master core and slave core group to randomly sample tensor strips is as follows:

(1) since the total number of the slave core groups is 8, the master core randomly assigns a priority to each slave core group, and the priority is set as P_i1, 8, and setting the number of tensor strips needing to be sampled each time as IBANDS _ TOTAL;

(2) each slave core group determines the number of tensor pieces required to be calculated by the group according to the priority assigned by the slave core group, and the sampling mode is as follows:

(2-1) if the slave core group having the highest priority is the slave core group having the highest priority, the control of the slave core group having the highest priority randomly generates a tensor number IBANDS from the slave core_iAnd satisfies 0 < IBANDS_i< IBANDS _ TOTAL and 0 < IBANDS_i＜IBLOCK_LEN_iWherein IBLOCK _ LEN_iThe number of tensor blocks allocated to the slave core group is tensor number, and when the random generation is finished, the slave core group controls the slave core to leave the residual number of samplable IBANDS _ TOTAL-IBANDS_iSending the control slave core to the slave core group with the second highest priority;

(2-2) if not the slave core group having the highest priority, not the tensor number IBANDS randomly generated from the slave core group having the highest priority_iOne of the conditions of (1) becomes

The other conditions and the generation mode are not changed;

(2-3) repeating the process (2-2) until all of the slave nuclei generate IBANDS_iOr

6. The sparse tensor canonical decomposition method based on data partitioning and calculation allocation of claim 1, wherein: in the step 3, the main core determines a tensor dimension and a tensor block corresponding to the factor matrix needing to be updated at this time, wherein the specific updating mode of the factor matrix is to decompose the objective function according to a tensor canonical

Solving the factor matrix using the canonical decomposition algorithm for sparse tensor as given in claim 2, wherein the sparse tensor to be decomposed

Factor matrix

7. The sparse tensor canonical decomposition method based on data partitioning and calculation allocation of claim 1, wherein: in step 5, the format for controlling the slave core to send the task assignment information to the attached work slave core is as follows:

the purpose of the task allocation information is to give information of the coordinate positions of the tensor strips to be calculated and the corresponding relation between the tensor strips and the factor matrix to be updated to the work slave core.

8. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 6, the work extracts the factor matrix data required by the calculation from the main memory to the local data storage in a self-constructed cache manner, and the specific extraction manner is as follows:

(1) opening up a Memory size in a local data store owned by a worker from a core_FMAs a buffer space required for storing the factor matrix data,suppose that the factor matrix to be updated is A_NThen the factor matrix required for the calculation is A_iN-1, where N is the dimension of a tensor, and the program sets the number of row vectors of each factor matrix stored completely in the local data storage to M, and then obtains a Memory_FM(N-1) × M, while the (N-1) × 2 variables are also saved in the local data store for storing the maximum and minimum coordinates of the cache line vector;

(3) when calculating the non-zero element x of tensor_{a1，a2，...，aN}When updating the matrix AN, the matrix a is needed separately_iN-1 lines a1, a2, …, a (N-1), so that it is queried whether the line vector exists in the cache space in the current local data storage, if so, it is directly used, otherwise, the matrix a is extracted from the main memory_iAi, ai + 1.., ai + M lines of (a) and into the local data store.

9. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 6, work extracts factor matrix data required for calculation from the main memory in a self-configuration cache manner to own local data storage from the core, and further performs third-level segmentation on the assigned tensor strips according to the use condition of the own local data storage, wherein the specific segmentation method is as follows: setting the residual storage space of local data storage after removing non-array variables such as cyclic variables as Memory_LDMThe total space reserved for the calculation of the required factor matrix data is Memory_FMThen the size of a tensor piece is Memory at this time_LDM-Memory_FM。

10. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 6, the information of the calculation result and the information of whether the calculation is finished are returned to the control slave core in a register communication mode in real time, wherein the specific formats of the information are as follows:

(1-3) the third data is result1, which is the first result data;

(1-4) the fourth data is result2, which is the second result data;

(2) the information length for controlling the sending of the computation end from the core is 256 bits, and the information length comprises 4 double-type data, and the format is as follows:

and (2-3) the third data and the fourth data are filling data for occupying.

11. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 7, the implementation manner of controlling the slave core to dynamically allocate the next computation task to the slave core according to the computation condition of the current tensor block is as follows:

12. The sparse tensor canonical decomposition method based on data partitioning and task allocation according to claim 1, wherein: in step 12, the value of the objective function satisfies the condition of iteration termination, which includes two conditions: the first is that the iteration number reaches the upper limit; and the second method is that the objective function value obtained by two times of iterative calculation is smaller than a preset Threshold value Threshold.