CN108170639B - Tensor CP decomposition implementation method based on distributed environment - Google Patents

Tensor CP decomposition implementation method based on distributed environment Download PDF

Info

Publication number
CN108170639B
CN108170639B CN201711426277.0A CN201711426277A CN108170639B CN 108170639 B CN108170639 B CN 108170639B CN 201711426277 A CN201711426277 A CN 201711426277A CN 108170639 B CN108170639 B CN 108170639B
Authority
CN
China
Prior art keywords
matrix
tensor
host
key
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711426277.0A
Other languages
Chinese (zh)
Other versions
CN108170639A (en
Inventor
周维
麦超
蔡莉
何靖
姚绍文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University YNU
Original Assignee
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University YNU filed Critical Yunnan University YNU
Priority to CN201711426277.0A priority Critical patent/CN108170639B/en
Publication of CN108170639A publication Critical patent/CN108170639A/en
Application granted granted Critical
Publication of CN108170639B publication Critical patent/CN108170639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a tensor CP decomposition implementation method based on a distributed environment, which is based on an ALS algorithm and is used for carrying out factor matrix A in each iteration process(n)The updating of (2) first calculates Y ═ X by splitting the Khatri-Rao product(n)(A(N)⊙…⊙A(n+1)⊙A(n‑1)⊙…⊙A(1)) Then adopts the mode of parallel computing outer product to compute
Figure DDA0001524006990000011
And finally, partitioning the matrix Y and the matrix V, distributing the partitioned matrixes corresponding to the matrix Y and the matrix V to the hosts of the Spark cluster by adopting Map operation, carrying out matrix multiplication by adopting Reduce operation, sending multiplication results to one host by adopting Map operation, and combining by adopting Reduce operation to obtain A(n)YV. The method realizes the tensor CP decomposition based on the MapReduce and Spark technology, and can effectively improve the efficiency of the tensor CP decomposition.

Description

Tensor CP decomposition implementation method based on distributed environment
Technical Field
The invention belongs to the technical field of tensor decomposition, and particularly relates to a tensor CP decomposition implementation method based on a distributed environment.
Background
In recent years, data scale has been growing rapidly in the fields of social networks, computing advertising, e-commerce, and the like. To describe complex relationships, for example: the characteristics of each person in the social network, such as the friend relationship, the computational advertisements, and the e-commerce, are abundant based on data modeled by a high-dimensional space. The appearance of these high-order data makes the conventional method of describing data in a two-dimensional manner by using a matrix increasingly inapplicable, and therefore a tool capable of describing the high-order relationships in the high-dimensional data is urgently needed.
The tensor, which is a generalization of the matrix in a high-dimensional space, is a better tool for describing the high-order relationships among multiple variables. As early as 1940, tensors were proposed in psychometrics, and later tensors were widely used in theoretical fields such as physics, numerical analysis, signal processing, and theoretical computer science. Because the tensor is a high-dimensional array, and the tensor-based algorithm is often exponential in time complexity, many iterations are required in the calculation, and early computers cannot complete the calculation at all.
With the development of hardware and software technologies, a large server is gradually no longer the first choice in the industry due to factors such as cost and maintenance, and a cluster built by a common PC gradually becomes a mainstream data processing platform. Following the development of the theoretical domain, tensors again have received a lot of attention in the engineering domain because of their ability to describe and analyze higher-order data. Due to the appearance of programming models such as MapReduce and the like, an algorithm which is operated independently by a single machine is changed into an algorithm which is operated by a plurality of machines in a scattered manner, and the calculation efficiency is improved by utilizing the parallel calculation capability of the plurality of machines. The rise of such big data technologies as distributed storage and computation makes it possible to process large-scale data. At present, commonly used distributed computing frames include Hadoop and Spark, Hadoop based on a MapReduce programming model is the most widely used distributed computing frame, but each MapReduce task of Hadoop needs to read and write a disk before and after execution, and the Hadoop is not suitable for a scene with many iterations due to a large amount of disk I/O. The distributed elastic data set (RDD) in Spark is stored in the memory, so that the overhead caused by accessing a disk is avoided in each iteration, and the iteration efficiency is greatly improved.
The calculation of the tensor is easy to parallelize, and the problem that the tensor cannot be processed in the early stage can be completed in a distributed processing mode. The CP Decomposition (tensor polymeric composition) of tensor is also used more and more widely as a key in tensor research, and can extract subjects implicit in data, remove noise data, and reduce data dimensions. Conventional CP decomposition algorithms are stand-alone, and although programs can be made to process larger-scale data by upgrading the configuration of the machine, such upgrading is limited after all.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a tensor CP decomposition implementation method based on a distributed environment, and the efficiency of tensor CP decomposition is improved based on MapReduce and Spark technologies.
In order to achieve the above purpose, the tensor CP decomposition implementation method based on the distributed environment is used for the N-order tensor with the rank of R
Figure BDA0001524006970000021
Initializing N factor matrices A(n)Alternately updated at each iteration
Figure BDA0001524006970000022
The other factor matrixes are fixed during calculation, and iteration is repeated until the value of the objective function is zero or less than a given threshold value, wherein the N factor matrixes A(n)I.e. tensor
Figure BDA00015240069700000213
The result of CP decomposition of (1), wherein the factor matrix A(n)The update formula of (2) is:
Figure BDA0001524006970000023
Figure BDA0001524006970000024
factor matrix A(n)The following method is adopted for updating:
s1: let set D be {1,2, …, N } - { N }, arrange the elements in set D in ascending order, let the jth element be DjJ ═ 1,2, …, N-1; let matrix Y be X(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)),
Figure BDA0001524006970000025
S2: calculating Y ═ X by splitting Khatri-Rao product(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) The method comprises the following specific steps:
s2.1: initializing a rank serial number r to be 1;
s2.2: the initialization j is 1 and the initialization j is,
Figure BDA0001524006970000026
s2.3: and Map: according to mode-djSplitting to obtain tensor
Figure BDA0001524006970000027
When n > djThen to
Figure BDA0001524006970000028
As a key, the key is a key,
Figure BDA0001524006970000029
as value, otherwise with
Figure BDA00015240069700000210
As a key, the key is a key,
Figure BDA00015240069700000211
as value, the tensor can be obtained by performing map operation
Figure BDA00015240069700000212
The fibers are distributed to all the hosts of the Spark cluster; simultaneous factor matrix
Figure BDA0001524006970000031
Column vector of
Figure BDA0001524006970000032
Is transferred to
Figure BDA0001524006970000033
As broadcast variables, distributing the broadcast variables to each host of the Spark cluster;
s2.4: reduce: each host of the Spark cluster is receiving
Figure BDA0001524006970000034
In the form of a key, the key is a key,
Figure BDA0001524006970000035
or
Figure BDA0001524006970000036
Value data and column vector
Figure BDA0001524006970000037
Then, by
Figure BDA0001524006970000038
An
Figure BDA0001524006970000039
Constituting fiber
Figure BDA00015240069700000310
Or is made of
Figure BDA00015240069700000311
An
Figure BDA00015240069700000312
Constituting fiber
Figure BDA00015240069700000313
The inner product of the fiber and column vector is calculated according to the following formula:
if n > dj+1Calculating
Figure BDA00015240069700000314
Or
Figure BDA00015240069700000315
If n < dj+1Calculating
Figure BDA00015240069700000316
Or
Figure BDA00015240069700000317
S2.5: judging whether j is less than N-1, if so, entering a step S2.6, otherwise, entering a step S2.7;
s2.6: j is made j +1, and the step S2.3 is returned;
s2.7: and Map: each host of Spark cluster is in code1Is key,
Figure BDA00015240069700000318
Performing map operations for value, code1The code is a preset code;
reduce: receive to
Figure BDA00015240069700000319
Will all
Figure BDA00015240069700000320
Combining to obtain vectors
Figure BDA00015240069700000321
S2.8: judging whether R is less than R, if so, entering step S2.9, otherwise, entering step S2.10;
s2.9: let r be r +1, return to step S2.3;
s2.10: merging vectors obtained by R times of cyclic calculation
Figure BDA00015240069700000322
Will be provided with
Figure BDA00015240069700000323
The r column vector is used as the matrix Y, so that the matrix Y is obtained;
s3: matrix calculation by parallel outer product calculation
Figure BDA00015240069700000324
The specific method comprises the following steps:
s3.1: initializing j to 1;
s3.2: MapReduce-based computation
Figure BDA00015240069700000325
1) And Map: first split the matrix
Figure BDA00015240069700000326
The row vectors of the matrix are distributed to each host of the Spark cluster, namely, the row sequence number i is taken as key,
Figure BDA00015240069700000327
Performing map operation for value;
2) reduce: each host of the Spark cluster receives i as a key,
Figure BDA00015240069700000328
after the value data, calculate
Figure BDA00015240069700000329
All the hosts calculated on the host
Figure BDA00015240069700000330
Sum and record the result as
Figure BDA0001524006970000041
M is 1,2, …, M represents the number of hosts of the Spark cluster;
3) and Map: each host of Spark cluster is in code2Is key,
Figure BDA0001524006970000042
Performing map operations for value, code2The code is a preset code;
4) reduce: receive to
Figure BDA0001524006970000043
Will all
Figure BDA0001524006970000044
Add up to obtain
Figure BDA0001524006970000045
The expression is as follows:
Figure BDA0001524006970000046
s3.3: judging whether j is less than N-1, if so, entering a step S3.4, otherwise, entering a step S3.5;
s3.4: j is made j +1, and the step returns to step S3.2;
s3.5: according to N-1
Figure BDA0001524006970000047
The matrix V is calculated according to the calculation result, and the specific process is as follows:
1) and Map: the host computer is obtained by calculation
Figure BDA0001524006970000048
Then, in code3Is key,
Figure BDA0001524006970000049
Performing map operations for value, code3Is a preset code.
2) Reduce: receive to
Figure BDA00015240069700000410
Host computer of (2) calculates all
Figure BDA00015240069700000411
Calculating the pseudo-inverse of the Hadamard product result to obtain a matrix V;
s4: partitioning the matrix Y and the matrix V, distributing the partitioned matrixes in the matrix Y and the matrix V to a host of a Spark cluster by adopting Map operation, carrying out matrix multiplication by adopting Reduce operation, and then carrying out multiplication resultSending the data to a host by adopting Map operation and merging by adopting Reduce operation to obtain A(n)=YV。
The tensor CP decomposition implementation method based on the distributed environment is based on the ALS algorithm and used for carrying out factor matrix A in each iteration process(n)The updating of (2) first calculates Y ═ X by splitting the Khatri-Rao product(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) Then adopts the mode of parallel computing outer product to compute
Figure BDA00015240069700000412
And finally, partitioning the matrix Y and the matrix V, distributing the partitioned matrixes corresponding to the matrix Y and the matrix V to the hosts of the Spark cluster by adopting Map operation, carrying out matrix multiplication by adopting Reduce operation, sending multiplication results to one host by adopting Map operation, and combining by adopting Reduce operation to obtain A(n)YV. The method realizes the tensor CP decomposition based on the MapReduce and Spark technology, and can effectively improve the efficiency of the tensor CP decomposition.
Drawings
FIG. 1 is a flowchart of an embodiment of updating a factor matrix in a distributed environment-based tensor CP decomposition implementation method according to the present invention;
FIG. 2 is a flow chart of the present invention for splitting the Khatri-Rao product calculation matrix Y;
FIG. 3 is a flow chart of the present invention for computing matrix V by parallel outer product computation;
FIG. 4 is a MapReduce-based calculation in the present invention
Figure BDA0001524006970000051
A schematic flow diagram of (a);
FIG. 5 is a graph of the runtime contrast of the present invention and the contrast method for different tensor sizes;
FIG. 6 is a graph of the runtime comparison of the present invention and the comparison method at different tensor densities.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
To better explain the technical solution of the present invention, the tensor CP decomposition and the principle on which the present invention is based will be briefly explained.
For N-order tensor with rank R
Figure BDA0001524006970000052
InDimension representing the nth order, N being 1,2, …, N, with the goal of computing a nearest neighbor
Figure BDA0001524006970000053
And the rank is R, i.e. the calculation
Figure BDA0001524006970000054
II denotes the norm, where
Figure BDA0001524006970000055
Wherein A is(1),…,A(n-1),A(n),…,A(N)Is a factor matrix of the tensor.
ALS (Alternating Least Squares) algorithm is a common algorithm for the current tensor CP decomposition, and the method is to calculate in turn
Figure BDA0001524006970000056
Each factor matrix is fixed with other factor matrixes when in each calculation, so that each calculation is converted into an optimization problem shown by the following formula:
Figure BDA0001524006970000057
wherein X(n)Tensor of representation
Figure BDA0001524006970000058
Of mode-n matrixing, i.e. tensors
Figure BDA0001524006970000059
The resulting matrix is expanded along mode-n with superscript T indicating transpose and the lines indicating the Khatri-Rao product.
When finding R factor matrixes
Figure BDA00015240069700000510
And make the target function | X(n)-A(n)(A(1)⊙…⊙A(n-1)⊙…⊙A(N))TWhen the value of | is the smallest, then the R factor matrices are the result of the CP decomposition.
In engineering, tensor is calculated using ALS algorithm
Figure BDA00015240069700000511
Initializing N factor matrices A(n)Performing multiple iterative computations, each iterative computation being updated in turn
Figure BDA00015240069700000512
The other factor matrices are fixed during calculation, and the iteration is repeated until the value of the objective function is zero or less than a given threshold value. Wherein each iteration requires updating all factor matrices, a for each factor matrix(n)The update is performed using the following formula:
Figure BDA0001524006970000061
wherein, the upper label
Figure BDA0001524006970000062
Representing the pseudo-inverse and denotes the Hadamard product.
Thus, in each calculation, the corresponding basis is updated according to the above formula. Parallelization and distribution of tensor decomposition algorithm are that a distributed algorithm needs to be designed to complete factor matrix A(n)Is moreAnd (5) new.
Examples
The invention is based on ALS algorithm, improves the updating of the factor matrix in each iteration process, and realizes the updating of the factor matrix based on a distributed environment so as to improve the efficiency of tensor CP decomposition. Fig. 1 is a flowchart of an embodiment of updating a factor matrix in a distributed environment-based tensor CP decomposition implementation method according to the present invention. As shown in fig. 1, in the tensor CP decomposition implementation method based on the distributed environment of the present invention, the specific step of updating the factor matrix includes:
s101: data arrangement:
let set D be {1,2, …, N } - { N }, arrange the elements in set D in ascending order, let the jth element be DjIt is clear that there are N-1 elements in set D, i.e., j ═ 1,2, …, N-1; let matrix Y be X(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)),
Figure BDA0001524006970000063
Namely A(n)=YV。
S102: splitting the Khatri-Rao product to calculate a matrix Y:
in equation (2), first, Y ═ X needs to be calculated(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) And A is(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)Is the Khatri-Rao product that leads to a surge in intermediate data. Calculating X(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) Is actually calculating the tensor
Figure BDA0001524006970000064
N-mode product with the column vector of the factor matrix, so by the above analysis, X can be multiplied(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) The calculation process of (a) is converted into the following algorithm:
order to
Figure BDA0001524006970000065
Let R be 1,2, …, R in turn, cyclically calculate:
Figure BDA0001524006970000066
Figure BDA0001524006970000071
wherein the content of the first and second substances,
Figure BDA0001524006970000072
matrix of presentation factors
Figure BDA0001524006970000073
The r-th column vector of (2),
Figure BDA0001524006970000074
denotes an n-mode product, where n ═ dj
Figure BDA0001524006970000075
Represents X(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) Is the transpose of the r-th column vector in the calculation result Y.
The following third order tensor with a rank of 2
Figure BDA0001524006970000076
The above method will be described as an example. Suppose that the factor matrix to be updated this time is A(1)Third order tensor
Figure BDA0001524006970000077
The forward slices of (a) are respectively:
Figure BDA0001524006970000078
Figure BDA0001524006970000079
last updated obtained A(2)And A(3)Respectively as follows:
Figure BDA00015240069700000710
Figure BDA00015240069700000711
due to the need to update the factor matrix A(1)Thus, the set D ═ 2, 3, let
Figure BDA00015240069700000712
Let r be 1, then:
Figure BDA00015240069700000713
Figure BDA00015240069700000714
let r be 2, then:
Figure BDA00015240069700000715
Figure BDA00015240069700000716
are combined to obtain
Figure BDA00015240069700000717
It can be seen that the above calculation process needs R iterations, N-1N-mode product calculations are performed in each iteration, and the calculation of the N-mode product does not need a large amount of storage, and can be conveniently realized by using MapReduce.
According to the analysis, the Khatri-Rao product is obtained by splitting the Khatri-Rao product so as to calculate Y ═ X(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) The specific process of (1). FIG. 2 is a flow chart of the present invention for splitting the Khatri-Rao product calculation matrix Y. As shown in fig. 2, the specific process of splitting the Khatri-Rao product to calculate the matrix Y in the present invention is as follows:
s201: the initialization rank index r is 1.
S202: the initialization j is 1 and the initialization j is,
Figure BDA0001524006970000081
s203: splitting tensor:
and Map: according to mode-djSplitting to obtain tensor
Figure BDA0001524006970000082
When n > djThen to
Figure BDA0001524006970000083
As a key, the key is a key,
Figure BDA0001524006970000084
as value, otherwise with
Figure BDA0001524006970000085
As a key, the key is a key,
Figure BDA0001524006970000086
as value.
Figure BDA0001524006970000087
Or
Figure BDA0001524006970000088
Tensor of representation
Figure BDA0001524006970000089
The elements of (1), the elements of the same key constituting the tensor
Figure BDA00015240069700000810
The tensor can be expanded by performing map operation on the fiber
Figure BDA00015240069700000811
Is distributed to each host of the Spark cluster. Simultaneously, the factor matrix is formed by utilizing the characteristics of shared variables of Spark
Figure BDA00015240069700000812
Column vector of
Figure BDA00015240069700000813
Is transferred to
Figure BDA00015240069700000814
As a broadcast variable, it is distributed to each host of the Spark cluster.
Since the Spark technique is adopted for calculation in the invention, except for
Figure BDA00015240069700000815
The rest(s)
Figure BDA00015240069700000816
In fact, the tensors are distributed on different hosts, and splitting the tensors, that is, when Map operation is performed, the tensor splitting operation is actually completed by combining a plurality of hosts.
S204: calculate the inner product of fiber and column vector:
reduce: each host of the Spark cluster is receiving
Figure BDA00015240069700000817
In the form of a key, the key is a key,
Figure BDA00015240069700000818
or
Figure BDA00015240069700000819
Value data and column vector
Figure BDA00015240069700000820
Then, by
Figure BDA00015240069700000821
An
Figure BDA00015240069700000822
Constituting fiber
Figure BDA00015240069700000823
Or is made of
Figure BDA00015240069700000824
An
Figure BDA00015240069700000825
Constituting fiber
Figure BDA00015240069700000826
The inner product of the fiber and column vector is calculated according to the following formula:
if n > dj+1Calculating
Figure BDA00015240069700000827
Or
Figure BDA00015240069700000828
If n < dj+1Calculating
Figure BDA00015240069700000829
Or
Figure BDA00015240069700000830
The different hosts can be calculated according to the data distributed to the hosts and the formula
Figure BDA00015240069700000831
S205: and judging whether j is less than N-1, if so, entering the step S206, otherwise, entering the step S207.
S206: let j equal j +1, return to step S203.
S207: to obtain
Figure BDA0001524006970000091
And Map: each host of Spark cluster is in code1Is key,
Figure BDA0001524006970000092
Performing map operations for value, code1For preset codes, i.e. calculated by each host
Figure BDA0001524006970000093
Map to a host.
Reduce: receive to
Figure BDA0001524006970000094
Will all
Figure BDA0001524006970000095
Combining to obtain vectors
Figure BDA0001524006970000096
S208: and judging whether R is less than R, if so, entering step S209, and otherwise, entering step S210.
S209: let r be r +1, return to step S202.
S210: merging vectors:
merging vectors obtained by R times of cyclic calculation
Figure BDA0001524006970000097
Will be provided with
Figure BDA0001524006970000098
As the r column vector of the matrix YThereby obtaining a matrix Y.
Third order tensor still with preceding rank 2
Figure BDA0001524006970000099
The above-described process will be described by way of example. Also assume that the factor matrix to be updated this time is A(1)Third order tensor
Figure BDA00015240069700000910
The forward slices of (a) are respectively:
Figure BDA00015240069700000911
Figure BDA00015240069700000912
last updated obtained A(2)And A(3)Respectively as follows:
Figure BDA00015240069700000913
Figure BDA00015240069700000914
due to the need to update the factor matrix A(1)I.e. n is 1, so the set D is 2, 3.
Let r be 1. The initialization j is 1 and the initialization j is,
Figure BDA00015240069700000915
map due to d1Split 2 by mode-2
Figure BDA00015240069700000916
4 fibers are obtained, since
Figure BDA00015240069700000917
Is 3 rd order and thus each fiber is actually a vector. The key of each element should be i1+i3. Table 1 shows the split by mode-2 in this example
Figure BDA00015240069700000918
The result of (1).
key 1+1 1+1 1+1
Value of element 1 3 5
key 1+2 1+2 1+2
Value of element 2 4 6
key 2+1 2+1 2+1
Value of element 7 9 11
key 2+2 2+2 2+2
Value of element 8 10 12
TABLE 1
Wherein each row of element values in Table 1 constitutes a fiber, i.e.
Figure BDA0001524006970000101
Figure BDA0001524006970000102
The values of the respective elements in Table 1 are expressed as i1+i3As a key, to
Figure BDA0001524006970000103
And performing map operation as value, and distributing the value to each host of the Spark cluster, namely completing the distribution of the fiber. A is to be(2)Column vector of
Figure BDA0001524006970000104
Is transferred to
Figure BDA0001524006970000105
As a broadcast variable, it is distributed to each host of the Spark cluster.
Reduce: each host computer obtaining data respectively calculates 4 fibers
Figure BDA0001524006970000106
And
Figure BDA0001524006970000107
the inner product of (a) yields 4 values 6, 8, 18, 20. From these 4 values, the size I is formed1×I3Magnitude of 2 nd order tensor
Figure BDA0001524006970000108
Namely, it is
Figure BDA0001524006970000109
Therein of elements
Figure BDA00015240069700001010
The following can be obtained:
Figure BDA00015240069700001011
let j equal 2.
Map due to d2Split 3, required by mode-3
Figure BDA00015240069700001012
While
Figure BDA00015240069700001013
Each element in (1), namely the calculation result of last Reduce, is dispersed on each host of Spark cluster, so that each host directly performs Map operation, namely executing
Figure BDA00015240069700001014
Each host computer of (1) with1As a key, the key is a key,
Figure BDA00015240069700001015
the Map operation is performed as value. Obviously, the fiber sent at this time is [ 618 ]]、[8 20]. A is to be(3)Middle column vector of
Figure BDA00015240069700001016
Is transferred to
Figure BDA00015240069700001017
As a broadcast variable, it is distributed to each host of the Spark cluster.
Reduce, each host computer obtaining data respectively calculates 2 fibers
Figure BDA00015240069700001018
And
Figure BDA00015240069700001019
the inner product of (a) yields 2 values 24, 28. From these 2 values, the size I is formed1Tensor of order 1
Figure BDA00015240069700001020
Namely, it is
Figure BDA00015240069700001021
Therein of elements
Figure BDA00015240069700001022
Will be provided with
Figure BDA00015240069700001023
Mapping to the same host, and merging to obtain vector
Figure BDA0001524006970000111
Can be calculated in a similar way
Figure BDA0001524006970000112
Then will be
Figure BDA0001524006970000113
And
Figure BDA0001524006970000114
combining to obtain matrix
Figure BDA0001524006970000115
S103: and (3) calculating a matrix V by adopting a parallel outer product calculation mode:
the matrix is then analyzed
Figure BDA0001524006970000116
And (4) calculating. It is clear that the key step is to be able to calculate (A) efficiently(1)TA(1)*...*A(n-1)TA(n-1)*...*A(N)TA(N)) Because the result of this equation is a matrix of size R x R, which is typically a small value, the computation of the pseudo-inverse is quite fast and easy. (A)(1)TA(1)*...*A(n-1)TA(n-1)*...*A(N)TA(N)) Can be calculated from left to right, respectively, the result of the product of the transpose of each matrix and itself, since
Figure BDA0001524006970000117
The results are thus a matrix of size R x R, and the Hadamard products of the N-1 matrices of size R x R are finally calculated, thus completing the calculation of the equation. Computing
Figure BDA0001524006970000118
Is just a calculation matrix
Figure BDA0001524006970000119
And the process of transposing each row of (a) and adding the results of all the outer products to the outer product of that row. Thus, can be
Figure BDA00015240069700001110
The calculation of (c) is described by the following algorithm:
initializing a matrix of R
Figure BDA00015240069700001111
Is a zero matrix, sequentially order
Figure BDA00015240069700001112
Circulation ofAnd (3) calculating:
Figure BDA00015240069700001113
wherein the content of the first and second substances,
Figure BDA00015240069700001114
to represent
Figure BDA00015240069700001115
Row i vector of (1), and o represents the outer product.
The above algorithm is illustrated by taking a matrix a of size 3 × 2 as an example:
Figure BDA00015240069700001117
initialization
Figure BDA00015240069700001118
When i is 1, there are:
Figure BDA00015240069700001119
when i is 2, there are:
Figure BDA00015240069700001120
when i is 3, there are:
Figure BDA0001524006970000121
obviously, the results obtained by using the above algorithm are combined with the direct calculation of ATThe results for a are the same.
In the above algorithm, the matrix needs to be calculated
Figure BDA0001524006970000122
The transpose of each row of (a) and the outer product of that row, thus requiring multiple iterations. By observing the calculation in each iteration of the algorithm, i.e.
Figure BDA0001524006970000123
It can be found that the data used to compute the outer product is a matrix
Figure BDA0001524006970000124
The same row vector, so that the calculation of the outer product can be completed independently; meanwhile, the matrix obtained by calculating the outer product is a matrix with the size of R multiplied by R, and the matrix is very small and does not occupy a large amount of storage space. Based on this finding, the present invention combines matrices
Figure BDA0001524006970000125
And splitting is carried out, each row vector is distributed to each machine of the cluster, and the outer product calculation is executed in parallel, so that the efficiency is improved. Here, the matrix needs to be considered
Figure BDA0001524006970000126
Is provided with
Figure BDA0001524006970000127
Go into
Figure BDA0001524006970000128
The rows may be a large number, so the computation of the outer product cannot be completed by only one reducer, partial outer products need to be merged in advance, the pressure of the reducer is reduced, the computation efficiency is improved, and finally all the outer products are merged on one reducer and the computation of the pseudo-inverse is completed. The specific implementation of MapReduce in the step is divided into two MapReduce algorithms, the outer products are calculated and all the outer products are merged, and each MapReduce is divided into two steps which are Map and Reduce respectively.
From the above analysis, the calculations of the present invention can be obtained
Figure BDA0001524006970000129
The specific process of (1). Fig. 3 is a flow chart of the present invention for computing matrix V by parallel computing outer products. As shown in fig. 3, the specific steps of calculating the matrix V in the form of parallel outer product in the present invention include:
s301: the initialization j is 1.
S302: is calculated to obtain
Figure BDA00015240069700001210
In the invention, two steps of MapReduce are adopted for calculation
Figure BDA00015240069700001211
FIG. 4 is a MapReduce-based calculation in the present invention
Figure BDA00015240069700001212
Is a schematic flow diagram. As shown in FIG. 4, the calculation based on MapReduce in the invention
Figure BDA00015240069700001213
The specific method comprises the following steps:
1) and Map: first split the matrix
Figure BDA00015240069700001214
The row vectors of the matrix are distributed to each host of the Spark cluster, namely, the row sequence number i is taken as key,
Figure BDA00015240069700001215
The map operation is performed for value.
2) Reduce: each host of the Spark cluster receives i as a key,
Figure BDA00015240069700001216
after the value data, calculate
Figure BDA00015240069700001217
All the hosts calculated on the host
Figure BDA00015240069700001218
Sum and record the result as
Figure BDA0001524006970000131
M is 1,2, …, M indicates the number of hosts of the Spark cluster. This is because
Figure BDA0001524006970000132
Usually, it is large, so that more than one row vector and its transposed outer product are calculated on each host, so that each host merges the partial outer products calculated thereon in advance to reduce the subsequent workload.
3) And Map: each host of Spark cluster is in code2Is key,
Figure BDA0001524006970000133
Performing map operations for value, code2For preset codes, i.e. calculated by each host
Figure BDA0001524006970000134
Map to a host.
4) Reduce: receive to
Figure BDA0001524006970000135
Will all
Figure BDA0001524006970000136
Add up to obtain
Figure BDA0001524006970000137
The expression is as follows:
Figure BDA0001524006970000138
s303: judging whether j is less than N-1, if so, entering step S304, otherwise, entering step S305.
S304: let j be j +1, return to step S302.
S305: calculating a matrix V:
according to N-1
Figure BDA0001524006970000139
The matrix V is calculated according to the calculation result, and the specific process is as follows:
1) and Map: the host computer is obtained by calculation
Figure BDA00015240069700001310
Then, in code3Is key,
Figure BDA00015240069700001311
Map operations for value, code as such3Is a preset code.
2) Reduce: receive to
Figure BDA00015240069700001312
Host computer of (2) calculates all
Figure BDA00015240069700001313
And calculating the pseudo-inverse of the Hadamard product result to obtain a matrix V. According to the previous analysis, since each
Figure BDA00015240069700001314
Are all R × R matrices, the calculation is relatively simple, and therefore can be done with one Reduce.
S104: computing based on distributed caches(n)
In the calculation of matrix multiplication A(n)One key factor to consider when YV is that a single machine store can accommodate both matrices and operate efficiently. Wherein Y ═ X(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) Is a size of InX R matrix, where R is usually a smaller number, and InOften, the memory of the matrix unit is not able to be accommodated. If a disk is used for auxiliary storage (for example, a swap partition mode is used), a large amount of disk I/O (input/output) is generated when a program runsGreatly affecting the efficiency of the operation.
Figure BDA00015240069700001315
Is R x R, this matrix can be stored in a stand-alone memory completely since R is a small number. From the above analysis, it can be concluded that A is(n)YV is a large matrix multiplication with severe data skew, and the common matrix multiplication method cannot be applied to such a matrix.
Based on the reasons, the matrix Y and the matrix V are partitioned, Map operation is adopted to distribute the partitioned matrixes in the matrix Y and the matrix V to the hosts of the Spark cluster, Reduce operation is adopted to carry out matrix multiplication, then the multiplication results are sent to one host by Map operation and are merged by Reduce operation, and A is obtained(n)YV, thereby enabling more efficient calculations.
For matrix blocking, the common ways are: the division by columns, by rows and by columns. The research of the invention finds that the matrix V is not the optimal way if the matrix V is also partitioned because of the small scale of the matrix V. Therefore, it is preferable to block only the matrix Y, i.e. the blocking method for the matrix Y and the matrix V is as follows:
and partitioning the matrix Y according to rows to obtain a partitioned matrix with the columns of R, wherein the row size of the partitioned matrix is set according to actual needs, and the partitioned matrix of the matrix V is the partitioned matrix, namely the partitioned matrix is not partitioned.
The block dividing mode is adopted, and the calculation A is calculated based on the distributed cache(n)The specific process is as follows:
1) and Map: firstly, splitting the matrix Y, and taking the row sequence number k as key and row vector YkMap operation is performed as value, k is 1,2, …, InSeparate row vectors ykAnd distributing to each host of the Spark cluster. And meanwhile, setting the matrix V as a broadcast variable of Spark, and distributing the broadcast variable to each host of the Spark cluster.
2) Reduce: receiving row vector y in Spark clusterkAnd each host of matrix V, calculate Ak (n)=ykV。
3) And Map: the host computer obtains A through calculationk (n)Then, in code4Is key, Ak (n)Performing map operations for value, code4Is a preset code.
4) Reduce: receive InA isk (n)Host computer Ak (n)As A(n)To obtain the factor matrix a(n)
In order to better illustrate the technical effect of the invention, a specific example is adopted to carry out experimental verification on the invention, and the technical effect is compared with the technical effect of the existing tensor CP decomposition method. In the experimental verification, 10 hosts are included in the Spark cluster, and the tensor data uses the NELL data source of CMU (university of tomilon in card), which originates from the "Read the Web" item of CMU and includes a large number of categories and relationships. Because the reality data are generally sparse, in order to test the expression of the decomposition algorithm of the tensor under the data with different sparsity degrees, the NELL full data are added, and the third-order tensors with different sizes and different densities are generated randomly in the experiment. Table 2 is a data set description in this example.
Figure BDA0001524006970000141
Figure BDA0001524006970000151
TABLE 2
The comparison method adopted in the experimental verification is that a traditional Tensor CP decomposition method tool is software MATLAB (matrix laboratory) Toolbox Version 2.6, the MATLAB Toolbox is realized by Tamara G.Kolda of American Sandia national laboratory, and the characteristic peer-to-peer operations of CP decomposition, Tucker decomposition and matrix calculation of dense Tensor, sparse Tensor and structured Tensor are provided, but the MATLAB Toolbox does not support distributed Tensor decomposition operation. This experiment verifies that the rank R of the tensor is set to 10 when CP decomposition is performed.
Firstly, fixing tensor density, and testing the running time of the method and the comparison method under the condition that tensor sizes are different. The tensor in the experiment is from I to J to K to 103Gradually increase to I ═ J ═ K ═ 108The number of non-zero elements is 10 × I. FIG. 5 is a graph of the runtime comparison of the present invention and the comparison method at different tensor sizes. As shown in fig. 5, the operation time of the contrast method increases with the scale of the tensor, and when the size of the tensor exceeds I, J, K, 106In time, the contrast method cannot complete the CP decomposition of the tensor due to the CPU and memory limitations of the single machine (mainly limited to the memory). The invention has tensor size of I-J-K-103~106The running time is stable because under the condition that the tensor scale is not large, the task scheduling and network transmission data occupy most of the running time of the program when the algorithm is operated, and the part of the running time is relatively stable. When the tensor size exceeds I, J, K and 106The runtime of the present invention begins to increase; when the tensor size reaches I ═ J ═ K ═ 10 ═ K ═8The increment of the running time of the invention is higher by an order of magnitude, at this time, the memory occupation of the cluster reaches a peak value, Spark starts to use the exchange partition of the disk to store partial data, some temporarily unnecessary RDDs are also cleared from the memory, and the RDDs are recalculated according to the ancestry of the RDDs when needed later, so that the running time is obviously increased by the two factors. Although the invention increases the run time when CP decomposing the large scale tensor, it is acceptable for engineering applications.
The size of the tensor is then fixed and the run time of the present invention and comparison method is tested for different tensor densities. The tensor size used for the test is I-J-K-105. Density of tensor is from 10-9Increment to 10-5Number of non-zero elements is 106To 1010. FIG. 6 is a graph of the runtime comparison of the present invention and the comparison method at different tensor densities. As shown in fig. 6, the density of the tensor is from 10-9Increased to 10-7The run time of the comparative method increases substantially linearly,when the density exceeds 10-7In time, the contrast method cannot complete the CP decomposition of the tensor. The invention has a tensor density of 10-9To 10-6The run time growth was more stable when the tensor density increased to 10-5The time for program execution increases significantly. Because the invention uses the mode of sparse storage to store the tensor, the number of the nonzero elements is increased along with the increase of the density of the tensor, more memories are needed on Spark to store RDD, if the memories are not enough, the mechanisms of discarding the temporarily unneeded RDD and using the disk swap partition are started to be started, which causes extra operation cost and increases the operation time. Although the invention increases the run time when CP decomposing the high density tensor, it is acceptable for engineering applications.
In conclusion, the tensor CP decomposition method has less running time than the traditional method, can break through the limitation of single machine software and hardware conditions, realizes the large-scale and high-density tensor CP decomposition, and keeps better timeliness.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (2)

1. Tensor CP decomposition implementation method based on distributed environment, for N-order tensor with rank R
Figure FDA0003103336550000011
InDimension number of nth order, N is 1,2, …, N, and
Figure FDA0003103336550000012
wherein A is(1),…,A(n-1),A(n),…,A(N)Is a factor matrix of the tensor; initializing N factor matrices A(n)Alternately updated at each iteration
Figure FDA0003103336550000013
The other factor matrixes are fixed during calculation, and iteration is repeated until the value of the objective function is zero or less than a given threshold value, wherein the N factor matrixes A(n)I.e. tensor
Figure FDA0003103336550000014
The result of CP decomposition of (1), wherein the factor matrix A(n)The update formula of (2) is:
Figure FDA0003103336550000015
wherein, the upper label
Figure FDA0003103336550000016
Indicates a pseudo-inverse,. indicates a Khatri-Rao product,. indicates a Hadamard product;
characterized by a factor matrix A(n)The following method is adopted for updating:
s1: let set D be {1,2, …, N } - { N }, arrange the elements in set D in ascending order, let the jth element be DjJ ═ 1,2, …, N-1; let matrix Y be X(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)),
Figure FDA0003103336550000017
S2: calculating Y ═ X by splitting Khatri-Rao product(n)(A(N)⊙…⊙A(n+1)⊙A(n-1)⊙…⊙A(1)) The method comprises the following specific steps:
s2.1: initializing a rank serial number r to be 1;
s2.2: the initialization j is 1 and the initialization j is,
Figure FDA0003103336550000018
s2.3: and Map: according to mode-djSplitting to obtain tensor
Figure FDA0003103336550000019
When n > djThen to
Figure FDA00031033365500000110
Figure FDA00031033365500000111
Or
Figure FDA00031033365500000112
Tensor of representation
Figure FDA00031033365500000113
The tensor can be tensed by performing map operation
Figure FDA00031033365500000114
The fibers are distributed to all the hosts of the Spark cluster; simultaneous factor matrix
Figure FDA00031033365500000115
Column vector of
Figure FDA00031033365500000116
Is transferred to
Figure FDA00031033365500000117
As broadcast variables, distributing the broadcast variables to each host of the Spark cluster;
s2.4: reduce: each host of the Spark cluster is receiving
Figure FDA00031033365500000118
In the form of a key, the key is a key,
Figure FDA00031033365500000119
or
Figure FDA00031033365500000120
Value data and column vector
Figure FDA00031033365500000121
Then, by
Figure FDA00031033365500000122
An
Figure FDA00031033365500000123
Constituting fiber
Figure FDA0003103336550000021
Or is made of
Figure FDA0003103336550000022
An
Figure FDA0003103336550000023
Constituting fiber
Figure FDA0003103336550000024
The inner product of the fiber and column vector is calculated according to the following formula:
if n > dj+1Calculating
Figure FDA0003103336550000025
Or
Figure FDA0003103336550000026
If n < dj+1Calculating
Figure FDA0003103336550000027
Or
Figure FDA0003103336550000028
S2.5: judging whether j is less than N-1, if so, entering a step S2.6, otherwise, entering a step S2.7;
s2.6: j is made j +1, and the step S2.3 is returned;
s2.7: and Map: each host of Spark cluster is in code1Is key,
Figure FDA0003103336550000029
Performing map operations for value, code1The code is a preset code;
reduce: receive to
Figure FDA00031033365500000210
Will all
Figure FDA00031033365500000211
Combining to obtain vectors
Figure FDA00031033365500000212
S2.8: judging whether R is less than R, if so, entering step S2.9, otherwise, entering step S2.10;
s2.9: let r be r +1, return to step S2.3;
s2.10: merging vectors obtained by R times of cyclic calculation
Figure FDA00031033365500000213
Will be provided with
Figure FDA00031033365500000214
The r column vector is used as the matrix Y, so that the matrix Y is obtained;
s3: matrix calculation by parallel outer product calculation
Figure FDA00031033365500000215
The specific method comprises the following steps:
s3.1: initializing j to 1;
s3.2: MapReduce-based computation
Figure FDA00031033365500000216
1) And Map: first split the matrix
Figure FDA00031033365500000217
The row vectors of the matrix are distributed to each host of the Spark cluster, namely, the row sequence number i is taken as key,
Figure FDA00031033365500000218
Performing map operation for value;
2) reduce: each host of the Spark cluster receives i as a key,
Figure FDA00031033365500000219
after the value data, calculate
Figure FDA00031033365500000220
Figure FDA00031033365500000221
Represents the outer product, all the hosts calculated on it
Figure FDA00031033365500000222
Sum and record the result as
Figure FDA00031033365500000223
M represents the number of hosts of the Spark cluster;
3) and Map: each host of Spark cluster is in code2Is key,
Figure FDA00031033365500000224
Performing map operations for value, code2The code is a preset code;
4) reduce: receive to
Figure FDA00031033365500000225
Will all
Figure FDA00031033365500000226
Add up to obtain
Figure FDA00031033365500000227
The expression is as follows:
Figure FDA00031033365500000228
s3.3: judging whether j is less than N-1, if so, entering a step S3.4, otherwise, entering a step S3.5;
s3.4: j is made j +1, and the step returns to step S3.2;
s3.5: according to N-1
Figure FDA0003103336550000031
The matrix V is calculated according to the calculation result, and the specific process is as follows:
1) and Map: the host computer is obtained by calculation
Figure FDA0003103336550000032
Then, in code3Is key,
Figure FDA0003103336550000033
Performing map operations for value, code3The code is a preset code;
2) reduce: receive to
Figure FDA0003103336550000034
Host computer of (2) calculates all
Figure FDA0003103336550000035
Calculating the pseudo-inverse of the Hadamard product result to obtain a matrix V;
s4: partitioning the matrix Y and the matrix V, distributing the partitioned matrixes corresponding to the matrix Y and the matrix V to the hosts of the Spark cluster by adopting Map operation, carrying out matrix multiplication by adopting Reduce operation, sending multiplication results to one host by adopting Map operation, and merging by adopting Reduce operation to obtain A(n)=YV。
2. The tensor CP decomposition implementation method as claimed in claim 1, wherein the partitioning method for the matrix Y and the matrix V in S4 is: and partitioning the matrix Y according to rows to obtain a partitioned matrix with the columns being R, wherein the partitioned matrix of the matrix V is the partitioned matrix.
CN201711426277.0A 2017-12-26 2017-12-26 Tensor CP decomposition implementation method based on distributed environment Active CN108170639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711426277.0A CN108170639B (en) 2017-12-26 2017-12-26 Tensor CP decomposition implementation method based on distributed environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711426277.0A CN108170639B (en) 2017-12-26 2017-12-26 Tensor CP decomposition implementation method based on distributed environment

Publications (2)

Publication Number Publication Date
CN108170639A CN108170639A (en) 2018-06-15
CN108170639B true CN108170639B (en) 2021-08-17

Family

ID=62520749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711426277.0A Active CN108170639B (en) 2017-12-26 2017-12-26 Tensor CP decomposition implementation method based on distributed environment

Country Status (1)

Country Link
CN (1) CN108170639B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11567816B2 (en) 2017-09-13 2023-01-31 Hrl Laboratories, Llc Transitive tensor analysis for detection of network activities
CN109299725B (en) * 2018-07-27 2021-10-08 华中科技大学鄂州工业技术研究院 Prediction system and device for parallel realization of high-order principal eigenvalue decomposition by tensor chain
US10796225B2 (en) * 2018-08-03 2020-10-06 Google Llc Distributing tensor computations across computing devices
CN110362780B (en) * 2019-07-17 2021-03-23 北京航空航天大学 Large data tensor canonical decomposition calculation method based on Shenwei many-core processor
CN111276183B (en) * 2020-02-25 2023-03-21 云南大学 Tensor decomposition processing method based on parameter estimation
CN111461193B (en) * 2020-03-25 2023-04-18 中国人民解放军国防科技大学 Incremental tensor decomposition method and system for open source event correlation prediction
EP4185970A1 (en) * 2020-07-22 2023-05-31 HRL Laboratories, LLC Transitive tensor analysis for detection of network activities
CN112835552A (en) * 2021-01-26 2021-05-25 算筹信息科技有限公司 Method for solving inner product of sparse matrix and dense matrix by outer product accumulation
CN115146780B (en) * 2022-08-30 2023-07-11 之江实验室 Quantum tensor network transposition and contraction cooperative method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260554A (en) * 2015-10-27 2016-01-20 武汉大学 GPU cluster-based multidimensional big data factorization method
CN105913085A (en) * 2016-04-12 2016-08-31 中国科学院深圳先进技术研究院 Tensor model-based multi-source data classification optimizing method and system
CN107015946A (en) * 2016-01-27 2017-08-04 常州普适信息科技有限公司 Distributed high-order SVD and its incremental computations a kind of method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260554A (en) * 2015-10-27 2016-01-20 武汉大学 GPU cluster-based multidimensional big data factorization method
CN107015946A (en) * 2016-01-27 2017-08-04 常州普适信息科技有限公司 Distributed high-order SVD and its incremental computations a kind of method
CN105913085A (en) * 2016-04-12 2016-08-31 中国科学院深圳先进技术研究院 Tensor model-based multi-source data classification optimizing method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Tensor Decomposition for Signal Processing and Machine Learning;Nicholas D. Sidiropoulos et al.;《arXiv:1607.01668v2》;20161214;第1-44页 *
分布式环境下的张量分解算法研究;adsuhviusa;《http://www.doc88.com/p-3197463411177.html》;20171204;第6-49页 *
基于共享内存的多核时代数据结构研究;周维 等;《软件学报》;20160430;第27卷(第4期);第1009-1025页 *

Also Published As

Publication number Publication date
CN108170639A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN108170639B (en) Tensor CP decomposition implementation method based on distributed environment
Lu et al. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs
Smith et al. SPLATT: Efficient and parallel sparse tensor-matrix multiplication
Albericio et al. Cnvlutin: Ineffectual-neuron-free deep neural network computing
CN109328361B (en) Accelerator for deep neural network
Dang et al. CUDA-enabled Sparse Matrix–Vector Multiplication on GPUs using atomic operations
JP2016119084A (en) Computer-implemented system and method for efficient sparse matrix representation and processing
Ma et al. Optimizing sparse tensor times matrix on GPUs
WO2012076379A2 (en) Data structure for tiling and packetizing a sparse matrix
CN109033030B (en) Tensor decomposition and reconstruction method based on GPU
US20200159810A1 (en) Partitioning sparse matrices based on sparse matrix representations for crossbar-based architectures
WO2012076377A2 (en) Optimizing output vector data generation using a formatted matrix data structure
Rungsawang et al. Fast pagerank computation on a gpu cluster
Conte et al. GPU-acceleration of waveform relaxation methods for large differential systems
D’Amore et al. Mathematical approach to the performance evaluation of matrix multiply algorithm
Gu et al. Efficient large scale distributed matrix computation with spark
US20180373677A1 (en) Apparatus and Methods of Providing Efficient Data Parallelization for Multi-Dimensional FFTs
US20220382829A1 (en) Sparse matrix multiplication in hardware
WO2022016261A1 (en) System and method for accelerating training of deep learning networks
Jain-Mendon et al. A hardware–software co-design approach for implementing sparse matrix vector multiplication on FPGAs
Wang et al. A novel parallel algorithm for sparse tensor matrix chain multiplication via tcu-acceleration
Wu et al. Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization
US9600446B2 (en) Parallel multicolor incomplete LU factorization preconditioning processor and method of use thereof
Caron et al. On the performance of parallel factorization of out-of-core matrices
CN114428936A (en) Allocating processing threads for matrix-matrix multiplication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant