CN105260554A - GPU cluster-based multidimensional big data factorization method - Google Patents

GPU cluster-based multidimensional big data factorization method Download PDF

Info

Publication number
CN105260554A
CN105260554A CN201510708583.8A CN201510708583A CN105260554A CN 105260554 A CN105260554 A CN 105260554A CN 201510708583 A CN201510708583 A CN 201510708583A CN 105260554 A CN105260554 A CN 105260554A
Authority
CN
China
Prior art keywords
tensor
factor
subset
data
overbar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510708583.8A
Other languages
Chinese (zh)
Inventor
陈丹
胡阳阳
蔡畅
李小俚
王力哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201510708583.8A priority Critical patent/CN105260554A/en
Publication of CN105260554A publication Critical patent/CN105260554A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a GPU cluster-based multidimensional big data factorization method, aims to solve the problem that a conventional grid parallel factor analysis model cannot process large-scale and high-dimension multidimensional data analysis, and provides an effective pattern processing unit-based multidimensional big data multi-mode decomposition method, namely a hierarchical parallel factor analysis framework. The framework is based on the conventional grid parallel factor analysis model, comprises a process of integrating tensor subsets under a coarse-grained model and a process of calculating all of the tensor subsets and fusing factor subsets under a fine-grained model, and is operated on a cluster formed by a plurality of nodes; each node comprises a plurality of pattern processing units. Tensor decomposition on pattern processing unit equipment can fully utilize a powerful parallel computing capability and a paralleling resource generated in tensor decomposition; experimental results show that through the adoption of the method, executive time for acquiring tensor factors can be greatly shortened, the large-scale data processing capability is improved, and the problem that the computing resource is insufficient is well solved.

Description

The large data factorization method of a kind of multidimensional based on GPU cluster
Technical field
The invention belongs to signal analysis technology field, relate to the large data analysing method of a kind of multidimensional, particularly relate to the large data factorization method of a kind of efficient multidimensional based on GPU cluster.
Background technology
Based in the complicated applications of data analysis, the dynamics of extensive tensor to be reflected in decomposable process, also do not cause large data deformation simultaneously, face increasing challenge ever-increasing today at data scale and data dimension.From multidimensional data, find the useful information of data more and more important in Science and engineering of today, as feature extraction, dimension-reduction treatment.2-D data decomposition method, as singular value decomposition method (SVD), principal component analysis (PCA) (PCA), independent component analysis method (ICA), directly can lose the corresponding relation between different dimensions the decomposition that these methods are applied to high dimensional data.Relatively, parallel factor analysis (PARAFAC), the multistage decomposition of specification (CPD), Tucker model is more suitable for the decomposition of three-dimensional or more high dimensional data, and can be solved by ALS method.PARAFAC compares Tucker and more easily explains, avoids rotation free problem typical in two-dimensional approach, ensures that solution is independently simultaneously.
In order to improve the tensor resolution effect of high dimensional data, researcher has made two classes and has explored, and optimizes the process of calculated factor and promotes decomposable process with high-capability computing device.Work the most significantly: the ELS energy accelerating convergence that [document 1] Rajih proposes, its basic thought finds optimum relaxation factor.[document 2, document 3] calculates sum minor matrix in Hadmard but not Khatri-Rao inner product and large matrix are multiplied, and based on parallel framework, the method can process large-scale data.
PARAFAC can to the data analysis of arbitrary size and dimension, but its computation complexity is high, also high to the performance requirement of computing machine, therefore whole data are split with sliding window, and with PARAFAC the data in certain dimension are analyzed one by one and just applied, this has established the basis of large Data Dynamic tensor analysis, and the direct fusion of the lower data of this theory makes the correlativity partial loss between data, and the cause and effect combined factor obtained is difficult to the dynamics reflecting raw data.
In order to solve the problem of large-scale data, PARAFAC mathematical theory has had certain innovation, large-scale data is regarded as the gridding of small data, i.e. gridPARAFAC ([document 4]), the decomposition and inversion of tensor is the decomposition of independently tensor subset by the method, and the output of merging tensor subset result can obtain whole factors of tensor, and this mode is effective, but face two large problems, computational resource is not enough and tensor subset splits the result differentiation caused.
[document 1] M.Rajih, P.Comon, andR.A.Harshman, " Enhancedlinesearch:Anovelmethodtoaccelerateparafac, " SIAMJournalonMatrixAnalysisandApplications, vol.30, no.3, pp.1128 – 1147,2008.
[document 2] A.H.PhanandA.Cichocki, " Advancesinparafacusingparallelblockdecomposition; " inNeuralInformationProcessing, ser.NeuralInformationProcessing.SpringerBerlinHeidelberg, 2009, pp.323 – 330.
[document 3] A.HuyPhanandA.Cichocki, " Parafacalgorithmsforlarge-scaleproblems, " Neurocomputing, vol.74, no.11, pp.1970 – 1984,2011.
[document 4] R.A.HarshmanandM.E.Lundy, " Parafac:Parallelfactoranalysis, " ComputationalStatistics & DataAnalysis, vol.18, no.1, pp.39 – 72,1994.
Summary of the invention
For existing methodical deficiency, the present invention proposes the large data factorization method of a kind of efficient multidimensional based on GPU cluster, i.e. H-PARAFAC framework, this framework, based on gridPARAFAC, comprises by the process calculating each tensor subset sums Parallel Fusion factor subset under integrating the process of tensor subset and finely granular access control under a Coarse grain model.This framework runs on the cluster be made up of multiple node, and each node comprises multiple stage GPU.Fully apply its powerful computation capability and parallelization resource based on the tensor resolution on GPU equipment, the execution time obtaining the tensor factor can be greatly reduced, improve large-scale data processing power, well solve the problems such as computational resource is not enough.
The technical solution adopted in the present invention is: the large data factorization method of a kind of multidimensional based on GPU cluster, is characterized in that, comprise the following steps:
Step 1: build PARAFAC model;
PARAFAC model is based on initialization model by matrix A, and B, C provide, and tensor is represented by λ, and formula is as follows:
χ = [ λ ; A , B , C ] = Σ r F λ r · a r · b r · c r - - - ( 1 ) ;
Step 2: build grid PARAFAC model and m-ALS;
Grid PARAFAC model regards tensor subset as tensor Y combination:
Y= I× 0A (0)× 1A (1)×A (2)(2);
Y ‾ ( k ‾ ) ≈ I ‾ × U ( k ‾ ) ( 0 ) 0 × U ( k ‾ ) ( 1 ) 1 × U ( k ‾ ) ( 2 ) 2 - - - ( 3 ) ;
Wherein: k i=1 ... K i, i=1 ..., N, × nthe n pattern inner product of tensor, A (n)and U (n)n-th factor of N tensor, be the i-th number;
The total divisor A of tensor Y (i)energy m-ALS is from factor subset obtain:
All tensor subsets first decomposed by traditional ALS method and obtain its factor
[ U ( k ‾ ) ( 1 ) , ... U ( k ‾ ) ( N ) ] = p a r a f a c A L S ( Y ‾ ( k ‾ ) ) - - - ( 5 ) ;
Intermediate variable P and Q is by the factor initialization of tensor subset:
P ( k ‾ ) = ( U ( k ‾ ) ( N ) T A ( k N ) ( N ) ) ⊗ , ... ⊗ ( U ( k ‾ ) ( 1 ) T A ( k 1 ) ( 1 ) ) - - - ( 6 ) ;
Q ( k ‾ ) = ( A ( k ‾ ) ( N ) T A ( k N ) ( N ) ) ⊗ , ... ⊗ ( A ( k 1 ) ( 1 ) T A ( k 1 ) ( 1 ) ) - - - ( 7 ) ;
Wherein: represent that Hadmard amasss;
Step 3: build stratification parallel factor analysis model H-PARAFAC;
Stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Described cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;
Described stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, described stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;
Step 4: build Coarse grain model;
Described Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;
Step 5: build refining parallel model;
Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor.
As preferably, the m-ALS described in step 2 specifically calculates, and it is by gradient cost function minimum sandards Euclidean distance thus to connect horizontal direction in m-ALS iterative process, when calculation the i-th dimension, tensor subset the factor can arrive, the synthesis of the tensor subset factor set of same when all dimensions all iteration cross once, factor subset is connected by following formula:
T = T + U ( k ‾ ) ( n ) P ( k ‾ ) ; S = S + Q ( k ‾ ) - - - ( 9 ) ;
A ( k n ) n = TS - 1 - - - ( 10 ) ;
Wherein: 1≤k n≤ K n, representative is selected, and after all factor subsets upgrade, middle tensor P and Q also will upgrade, in order to next iteration:
P ( k ‾ ) = P ( k ‾ ) ⊗ ( U ( k ‾ ) ( n ) T A ( k n ) n ) Q ( k ‾ ) = Q ( k ‾ ) ⊗ ( A ( k n ) ( n ) T A ( k n ) n ) - - - ( 11 ) .
As preferably, the execution of the stratification parallel factor analysis model H-PARAFAC that the Coarse grain model described in step 4 drives, its specific implementation comprises following sub-step:
(1) initialization GPUs;
(2) factor of tensor subset U in formula (5) is assessed;
(3) middle tensor (P, Q) is calculated by the learning rules in formula (6) and (7);
(4) block often tieed up is calculated: the Thread control equipment of host side;
1. by formula (8) (9) learning rule combined block between factor T, S;
2. factor subset A is upgraded by the learning rule of formula (10);
3. tensor (P, Q) is upgraded by the learning rule of formula (11);
(5) factor subset A is obtained from equipment end;
(6) factor subset A and middle tensor (P, Q) is upgraded by WiMAX and the MPI overall situation;
(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As
No, otherwise get back to the 4th step.
As preferably, the tensor dividing method described in step 4, its specific implementation process is:
Stratification parallel factor analysis model H-PARAFAC passes through to block device mean allocation block with EQUILIBRIUM CALCULATION FOR PROCESS resource, and every platform equipment only need obtain the data needed for current iteration, and block is passed to equipment by sequence by node; Being calculated as follows of number of blocks in a kth equipment:
S k = [ S - Σ i = 0 k - 1 s i n - k ] , S 0 = [ s n ] - - - ( 12 )
Wherein, k represents a kth equipment, k=1 ..., n-1, S represent total block data, S krepresent the block number in a kth equipment, the block number of each GPU is consistent or differ 1;
Each node is containing intermediate data, and each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.
As preferably, the assessment of the tensor subset factor described in step 5 and the initialization of factor subset, its specific implementation process is: the initial value being obtained the tensor subset factor by DTLD, conventional P ARAFAC is driven to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method; In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether; These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.
As preferably, the symmetric data transmission described in step 5, its specific implementation process is:
The each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, and P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, and P, Q two tensors are all divided into tensor subset, and label mode is also consistent; Factor subset corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to carry out piecemeal, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks;
In each iteration, to this TV station equipment, (node comprises multiple devices to the intermediate data (P, Q, U) that node transmission is upgraded by miscellaneous equipment, and equipment refers to computer disposal platform.Node is overall treatment platform) in, Yin Bentai equipment only processes some specific factor subsets, and node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously;
In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; This step can with the permutation and combination theoretical description in formula 13 and formula 14, the element u in formula (i, j, k)mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula 13, P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula 14 represents that call number have passed through three transmission:
p 1 = u ( 0 , 0 , 0 ) ... u ( i , j , k ) ... u ( I , J , K ) ↓ ... ↓ ... ↓ u ( 0 , 0 , 0 ) ... u ( j , k , i ) ... u ( J , K , I ) - - - ( 13 ) ;
u ( i , j , k ) → p 1 u ( j , k , i ) ; u ( j , k , i ) → p 1 u ( k , i , j ) ; u ( k , i , j ) → p 1 u ( i , j , k ) - - - ( 14 ) ;
According to the transmission of coordinate, the side-play amount of data cell is linear memory, and can be calculated by equipment, one of one-dimensional degree back and forth in, all with calculate in CUDA stream, the intermediate data of renewal can be copied directly to the assigned address of host side storage space according to skew, new block can be automatically stored to continuous print storage space and forward in another dimension data to facilitate simultaneously.
Compared with existing traditional multidimensional data analysis method based on parallel factor, the present invention has the following advantages and beneficial effect:
(1) the present invention proposes a kind of parallel computation frame H-PARAFAC, makes algorithm can process extensive High dimensional data analysis problem fast;
(2) the stratification parallel computation frame that the present invention proposes comprises the Coarse grain model and calculating of integrating tensor subset and the fine granularity computation model integrating tensor subset;
(3) H-PARAFAC that the present invention proposes utilizes multiple GPU equipment to decompose tensor subset in mixing cluster Computing Platform in a parallel fashion.
Accompanying drawing explanation
Accompanying drawing 1: the GridPARAFAC model schematic of the embodiment of the present invention;
Accompanying drawing 2: the stratification parallel factor analysis model H-PARAFAC schematic diagram of the embodiment of the present invention;
Accompanying drawing 3: the tensor resolution process flow diagram under the H-PARAFAC of the embodiment of the present invention;
Accompanying drawing 4: the symmetric data mode figure of the embodiment of the present invention;
Accompanying drawing 5: the execution time figure variation diagram obtaining the whole factor of these tensors under different splitting scheme of the embodiment of the present invention; Figure (a) execution time is with the variation diagram of size of data; Figure (b) execution time is with the variation diagram of block size.
Embodiment
Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with drawings and Examples, the present invention is described in further detail, should be appreciated that exemplifying embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.
Ask for an interview Fig. 1, the present invention is directed to conventional mesh parallel factor analysis model (gridPARAFAC) cannot process on a large scale, the problem of high-dimensional multidimensional data analysis, propose a kind of effectively based on the multidimensional large data multimode decomposition method of (Graphics Processing Unit) GPU cluster, i.e. stratification parallel factor analysis (H-PARAFAC) framework, this framework is based on gridPARAFAC (gridding parallel factor analysis), comprise by the process calculating each tensor subset sums Parallel Fusion factor subset under integrating the process of tensor subset and finely granular access control under a Coarse grain model.This framework runs on the cluster be made up of multiple node, and each node comprises multiple stage GPU.Fully apply its powerful computation capability and parallelization resource based on the tensor resolution on GPU equipment, the execution time obtaining the tensor factor can be greatly reduced, improve large-scale data processing power, well solve the problems such as computational resource is not enough.
This experiment effect mixing computing cluster on evaluate and test, this cluster by be configured with 4 NVIDIATeslaC2050 video cards two worktable form, execution environment configuration as follows:
Operating system is 64 windows2008 enterprise versions, containing the CPU of two IntelXeonE56202.40GHZ, RAM is 24G, compile under the all-round excellent environment of MSSTUDIO2010, bus is PCI-E5.0Gbps, and network is 32Gbps bandwidth for transmission, NVIDIA is containing 8 TeslaC2050.448 CUDUcores, fundamental frequency is 1.5GHZ, standard memory 2.5GB, memory bar 144GB/s.
Experimental data adopts three-dimensional tensor, and these tensors are selected from basic sparse smooth signal, as rectified half-waves cosine and sine signal.In the same size in three dimensions of data.Size has 300,600,900,1200,1500,1800,2100, and to 2400, a tensor is divided into the grid of several K*K*K, K=2, and 3,4,5,6,8.
The large data factorization method of a kind of multidimensional based on GPU cluster provided by the invention, comprises the following steps:
Step 1: build PARAFAC model;
PARAFAC model is based on initialization model by matrix A, and B, C provide, and tensor is represented by λ, and formula is as follows:
χ = [ λ ; A , B , C ] = Σ r F λ r · a r · b r · c r - - - ( 1 ) ;
Step 2: build grid PARAFAC model and m-ALS;
Grid PARAFAC model regards tensor subset as tensor Y combination:
Y= I× 0A (0)× 1A (1)×A (2)(2);
Y ‾ ( k ‾ ) ≈ I ‾ × U ( k ‾ ) ( 0 ) 0 × U ( k ‾ ) ( 1 ) 1 × U ( k ‾ ) ( 2 ) 2 - - - ( 3 ) ;
Wherein: k i=1 ... K i, i=1 ..., N, × nthe n pattern inner product of tensor, A (n)and U (n)n-th factor of N tensor, be the i-th number;
The total divisor A of tensor Y (i)energy m-ALS is from factor subset obtain:
All tensor subsets first decomposed by traditional ALS method and obtain its factor
[ U ( k ‾ ) ( 1 ) , ... U ( k ‾ ) ( N ) ] = p a r a f a c A L S ( Y ‾ ( k ‾ ) ) - - - ( 5 ) ;
Intermediate variable P and Q is by the factor initialization of tensor subset:
P ( k ‾ ) = ( U ( k ‾ ) ( N ) T A ( k N ) ( N ) ) ⊗ , ... ⊗ ( U ( k ‾ ) ( 1 ) T A ( k 1 ) ( 1 ) ) - - - ( 6 ) ;
Q ( k ‾ ) = ( A ( k ‾ ) ( N ) T A ( k N ) ( N ) ) ⊗ , ... ⊗ ( A ( k 1 ) ( 1 ) T A ( k 1 ) ( 1 ) ) - - - ( 7 ) ;
Wherein: represent that Hadmard amasss;
M-ALS specifically calculates and sees formula 8-formula 11, and it is by gradient cost function minimum sandards Euclidean distance thus connect horizontal direction the matrix as shown in Figure 1 dimension with same coordinate is assigned to same.In m-ALS iterative process, with regard to three-dimensional data citing, tensor is divided into three directions, when calculation the i-th dimension, and tensor subset the factor can arrive, the synthesis of the tensor subset factor set of same when all dimensions all iteration cross once, factor subset is connected by formula 8-formula 11:
T = T + U ( k ‾ ) ( n ) P ( k ‾ ) ; S = S + Q ( k ‾ ) - - - ( 9 ) ;
In formula 8 representative is selected, the matrix T of formula 9, and S is used for the calculating of final factor subset in formula 10.
A ( k n ) n = TS - 1 - - - ( 10 ) ;
(k after all factor subsets upgrade nfrom 1 to K n), middle tensor P and Q also will upgrade, such as formula (11), in order to next iteration;
P ( k ‾ ) = P ( k ‾ ) ⊗ ( U ( k ‾ ) ( n ) T A ( k n ) n ) ; Q ( k ‾ ) = Q ( k ‾ ) ⊗ ( A ( k n ) ( n ) T A ( k n ) n ) - - - ( 11 ) ;
Step 3: build stratification parallel factor analysis model H-PARAFAC;
Ask for an interview Fig. 2, stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;
Stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at node, equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;
Step 4: build Coarse grain model;
Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;
The wherein execution of the stratification parallel factor analysis model H-PARAFAC of Coarse grain model driving, its specific implementation comprises following sub-step:
(1) initialization GPUs;
(2) factor of tensor subset U in formula (5) is assessed;
(3) middle tensor (P, Q) is calculated by the learning rules in formula (6) and (7);
(4) block often tieed up is calculated: the Thread control equipment of host side;
1. by formula (8) (9) learning rule combined block between factor T, S;
2. factor subset A is upgraded by the learning rule of formula (10);
3. tensor (P, Q) is upgraded by the learning rule of formula (11);
(5) factor subset A is obtained from equipment end;
(6) factor subset A and middle tensor (P, Q) is upgraded by WiMAX and the MPI overall situation;
(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As
No, otherwise get back to the 4th step.
Have in whole step three layers synchronous: the synchronous 3. CUDA cross-thread of the synchronous 2. host side cross-thread 1. between node calculate synchronous.CUDA synchronously ensure that the correctness of calculating.Synchronisation of nodes and POSIX thread synchronization ensure that reliable data are transmitted.Internodal communication is realized by the transmitting function of polymerization and MPI.
Because GPU equipment computational resource is limited, Coarse grain model CUDA flows and forms multinuclear to adapt to the needs of random scale data.Each one or more data blocks of CUDA stream process, the quantity of CUDA depends on the memory size of size of data and GPU, a CUDA thread table is shown on equipment by a series of instructions that specified sequence is run, in same calculating stream, core before will form new core once completing, and the use sending out stream altogether can make full use of the computational resource of GPU.
Wherein tensor dividing method, its specific implementation process is:
Stratification parallel factor analysis model H-PARAFAC passes through to block device mean allocation block with EQUILIBRIUM CALCULATION FOR PROCESS resource, and every platform equipment only need obtain the data needed for current iteration, and block is passed to equipment by sequence by node; Being calculated as follows of number of blocks in a kth equipment:
S k = [ S - Σ i = 0 k - 1 s i n - k ] , S 0 = [ s n ] - - - ( 12 )
Wherein, k represents a kth equipment, k=1 ..., n-1, S represent total block data, S krepresent the block number in a kth equipment, the block number of each GPU is consistent or differ 1;
Such as, two main frames have connected the main frame of three equipment, and full tensor is divided into 10 pieces in the first dimension, and call number is from 0 to k=(h*3|d), and tensor splitting scheme is as following table 1:
Table 1: tensor splitting scheme
Each node is containing intermediate data, and as tensor subset factor U, factor subset A, tensor P and Q, each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.As shown in table 1, the equipment (2) of main frame (0) only uses renewal first to tie up with main frame (0) only need obtain from equipment with final fusion obtains the total divisor A under the first dimension (1).
The Coarse grain model of the present embodiment is equivalent to Management Calculation, and finely granular access control is responsible for concrete calculating; The initialization of the refined model Coverage factor of the present embodiment, the data transmission under certain storage, the calculating that parallel Hadmard is long-pending and intelligent scissor four parts.Wherein Hadmard amasss is that finely granular access control will realize, but he is parallel computation, manage this process as managing parallel Coarse grain model, and the calculating itself that Hadmard amasss is a kind of conventional calculating, when this kind calculates when process data with there being concrete introduction in step 2.Intelligent scissor scheme is also concrete execution, belongs to finely granular access control, and wherein splitting is not emphasis, be exactly simple piecemeal, key is the integration of block data, is also the management of data, the present embodiment has been placed on him in Coarse grain model and has gone to say, as the tensor partitioning portion in step 4.
Step 5: build refining parallel model;
Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor;
The wherein assessment of the tensor subset factor and the initialization of factor subset, its specific implementation process is: mALS method need to estimate all tensor subset the factor and using obtain factor subset as input.This step obtains the initial value of the tensor subset factor by DTLD, and drive conventional P ARAFAC to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method.In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether, reason there is not relation between tensor subset.These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.But the high complexity of DTLD makes the factor subset of the whole extensive tensor of initialization very difficult.Based on H-PARAFAC, factor subset carrys out initialization by sampling, and the tensor subset factor of equalization is the input of m-ALS method.Such energy minimization obtains the execution time of global factor.
Wherein symmetric data transmission, its specific implementation process is: each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, P, Q two tensors are all divided into tensor subset, and label mode is also consistent, as k i=1 ... K i, i=1 ..., N.Factor subset corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to carry out piecemeal, such as: if the n-th element of its call number is identical, so tensor subset (P and Q) will be assigned to same, the data belonging to same are continuous print in storage space, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks.
In each iteration, to this TV station equipment, (have multiple devices under a node, equipment refers to computer disposal platform to the intermediate data (P, Q, U) that node transmission is upgraded by miscellaneous equipment.Node is overall treatment platform) in, because this TV station equipment only processes some specific factor subsets, node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously.When node obtains data block, if source data poor organization in the storage space of main frame, expense will be very huge.Time on equipment along a certain dimension initialization calculated value, in order to node can obtain all pieces from Coutinuous store space, in the middle of restructuring, tensor is abnormal important, and node is heavily marked or P and Q that recombinate is very hard work.Time complexity is O (n 3), this result has been absorbed in the bottleneck phase.
In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; Refined model makes data directly from device transmission to node, ensure that intermediate data is Coutinuous store in host stores space.
As shown in Figure 4, for three-dimensional tensor, its transmission can as being be rotated counterclockwise 120 ° along self diagonal line, consider that data are three-dimensional, tensor need rotate 3 times, finally gets back to initial conditions, assuming that block is above Coutinuous store, data are vertical piecemeals in X-axis, and next time is exactly Y-axis, once rotate and guarantee that data result block is continuous print.
This step can with the permutation and combination theoretical description in formula (13) and (14), the element u in this group (i, j, k)mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula (13), P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula (14) represents that call number have passed through three transmission
p 1 = u ( 0 , 0 , 0 ) ... u ( i , j , k ) ... u ( I , J , K ) ↓ ... ↓ ... ↓ u ( 0 , 0 , 0 ) ... u ( j , k , i ) ... u ( J , K , I ) - - - ( 13 ) ;
u ( i , j , k ) → p 1 u ( j , k , i ) ; u ( j , k , i ) → p 1 u ( k , i , j ) ; u ( k , i , j ) → p 1 u ( i , j , k ) - - - ( 14 ) ;
According to the transmission of coordinate, the side-play amount of data cell is linear memory, and can be calculated by equipment, one of one-dimensional degree back and forth in, all with calculate in CUDA stream, the intermediate data of renewal can be copied directly to the assigned address of host side storage space according to skew, new block can be automatically stored to continuous print storage space and forward in another dimension data to facilitate simultaneously.
This example three kinds of methods are evaluated and tested:
(1) by the data of different scales, under the tensor subset of different size, m-ALS is used to obtain the working time of full tensor;
(2) different piece load capacity;
(3) duty factor of computing power and communication capacity.
Fig. 5 describes or else with the execution time obtaining the whole factor of these tensors under splitting scheme, figure (a) represents when data scale linear increase, execution time is linear increase also, this illustrates that H-PARAFAC of the present invention is stable, during K=6, when scale by 600 to 2400 time, the execution time changes to 677ms by 300, when data scale increases by 8 times in three dimensions simultaneously, the execution time increases number and is less than 2.9ms.Figure (b) reruns the execution time from the granularity of tensor subset, and obviously, the execution time promotes steadily with the reduction of tensor subset, and in other words, too much segmentation will increase load, and the fact shows:
(1) to guarantee that each tensor subset is large as far as possible;
(2) when configuring computing platform to H-PARAFA, the ability of GPUs is total computing power of a priori factor but not whole cluster.
Meanwhile, method of the present invention is compared with original PARAFAC method, and when tensor sub-set size equals 300 and 600, tensor processing time original method is respectively 158650ms and 4491736ms, and load is with n 3increase, n is size of data.H-PARAFAC only needs 80 and 800 times respectively.G-PARAFAC can obtain similar result, but H-PARAFAC can support that size is 2400 3and the fast decoupled of more large-scale data.
Should be understood that, the part that this instructions does not elaborate all belongs to prior art.
Should be understood that; the above-mentioned description for preferred embodiment is comparatively detailed; therefore the restriction to scope of patent protection of the present invention can not be thought; those of ordinary skill in the art is under enlightenment of the present invention; do not departing under the ambit that the claims in the present invention protect; can also make and replacing or distortion, all fall within protection scope of the present invention, request protection domain of the present invention should be as the criterion with claims.

Claims (6)

1., based on the large data factorization method of multidimensional of GPU cluster, it is characterized in that, comprise the following steps:
Step 1: build PARAFAC model;
PARAFAC model is based on initialization model by matrix A, and B, C provide, and tensor is represented by λ, and formula is as follows:
χ = [ λ ; A , B , C ] = Σ r F λ r · a r · b r · c r - - - ( 1 ) ;
Step 2: build grid PARAFAC model and m-ALS;
Grid PARAFAC model regards tensor subset as tensor Y combination:
Y= I× 0A (0)× 1A (1)×A (2)(2);
Y ‾ ( k ‾ ) ≈ I ‾ × 0 U ( k ‾ ) ( 0 ) × 1 U ( k ‾ ) ( 1 ) × 2 U ( k ‾ ) ( 2 ) - - - ( 3 ) ;
Wherein: k i=1 ... K i, i=1 ..., N, × nthe n pattern inner product of tensor, A (n)and U (n)n-th factor of N tensor, be the i-th number;
The total divisor A of tensor Y (i)energy m-ALS is from factor subset obtain:
All tensor subsets first decomposed by traditional ALS method and obtain its factor
[ U ( k ‾ ) ( 1 ) , ... U ( k ‾ ) ( N ) ] = p a r a f a c A L S ( Y ‾ ( k ‾ ) ) - - - ( 5 ) ;
Intermediate variable P and Q is by the factor initialization of tensor subset:
P ( k ‾ ) = ( U ( k ‾ ) ( N ) A ( k N ) ( N ) ) ⊗ , ... ⊗ ( U ( k ‾ ) ( 1 ) T A ( k 1 ) ( 1 ) ) - - - ( 6 ) ;
Q ( k ‾ ) = ( A ( k ‾ ) ( N ) T A ( k N ) ( N ) ) ⊗ , ... ⊗ ( A ( k 1 ) ( 1 ) T A ( k 1 ) ( 1 ) ) - - - ( 7 ) ;
Wherein: represent that Hadmard amasss;
Step 3: build stratification parallel factor analysis model H-PARAFAC;
Stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Described cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;
Described stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, described stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;
Step 4: build Coarse grain model;
Described Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;
Step 5: build refining parallel model;
Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor.
2. the large data factorization method of the multidimensional based on GPU cluster according to claim 1, is characterized in that: the m-ALS described in step 2 specifically calculates, and it is by gradient cost function minimum sandards Euclidean distance thus connect horizontal direction in m-ALS iterative process, when calculation the i-th dimension, tensor subset the factor can arrive, the synthesis of the tensor subset factor set of same when all dimensions all iteration cross once, factor subset is connected by following formula:
T = T + U ( k ‾ ) ( n ) P ( k ‾ ) ; S = S + Q ( k ‾ ) - - - ( 9 ) ;
A ( k n ) n = TS - 1 - - - ( 10 ) ;
Wherein: 1≤k n≤ K n, representative is selected, and after all factor subsets upgrade, middle tensor P and Q also will upgrade, in order to next iteration:
P ( k ‾ ) = P ( k ‾ ) ⊗ ( U ( k ‾ ) ( n ) T A ( k n ) n ) Q ( k ‾ ) = Q ( k ‾ ) ⊗ ( A ( k n ) ( n ) T A ( k n ) n ) - - - ( 11 ) .
3. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, it is characterized in that: the execution of the stratification parallel factor analysis model H-PARAFAC that the Coarse grain model described in step 4 drives, its specific implementation comprises following sub-step:
(1) initialization GPUs;
(2) factor of tensor subset U in formula (5) is assessed;
(3) middle tensor (P, Q) is calculated by the learning rules in formula (6) and (7);
(4) block often tieed up is calculated: the Thread control equipment of node side;
1. by formula (8) (9) learning rule combined block between factor T, S;
2. factor subset A is upgraded by the learning rule of formula (10);
3. tensor (P, Q) is upgraded by the learning rule of formula (11);
(5) factor subset A is obtained from equipment end;
(6) factor subset A and middle tensor (P, Q) is upgraded by WiMAX and the MPI overall situation;
(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As no, otherwise get back to the 4th step.
4. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, is characterized in that, the tensor dividing method described in step 4, and its specific implementation process is:
Stratification parallel factor analysis model H-PARAFAC passes through to block device mean allocation block with EQUILIBRIUM CALCULATION FOR PROCESS resource, and every platform equipment only need obtain the data needed for current iteration, and block is passed to equipment by sequence by node; Being calculated as follows of number of blocks in a kth equipment:
S k = [ S - Σ i = 0 k - 1 s i n - k ] , S 0 = [ s n ] - - - ( 12 )
Wherein, k represents a kth equipment, k=1 ..., n-1, S represent total block data, S krepresent the block number in a kth equipment, the block number of each GPU is consistent or differ 1;
Each node is containing intermediate data, and each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.
5. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, it is characterized in that, the assessment of the tensor subset factor described in step 5 and the initialization of factor subset, its specific implementation process is: the initial value being obtained the tensor subset factor by DTLD, conventional P ARAFAC is driven to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method; In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether; These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.
6. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, is characterized in that, the symmetric data transmission described in step 5, and its specific implementation process is:
The each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, and P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, and P, Q two tensors are all divided into tensor subset, and label mode is also consistent; Factor subset corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to carry out piecemeal, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks;
In each iteration, intermediate data (P, Q that node transmission is upgraded by miscellaneous equipment, U) in this TV station equipment, because this TV station equipment only processes some specific factor subsets, node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously;
In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; This step can with the permutation and combination theoretical description in formula 13 and formula 14, the element u in formula (i, j, k)mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula 13, P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula 14 represents that call number have passed through three transmission:
p 1 = u ( 0 , 0 , 0 ) ... u ( i , j , k ) ... u ( I , J , K ) ↓ ... ↓ ... ↓ u ( 0 , 0 , 0 ) ... u ( j , k , i ) ... u ( J , K . I ) - - - ( 13 ) ;
According to the transmission of coordinate, the side-play amount of data cell is linear memory, and can be calculated by equipment, one of one-dimensional degree back and forth in, all with calculate in CUDA stream, the intermediate data of renewal can be copied directly to the assigned address of host side storage space according to skew, new block can be automatically stored to continuous print storage space and forward in another dimension data to facilitate simultaneously.
CN201510708583.8A 2015-10-27 2015-10-27 GPU cluster-based multidimensional big data factorization method Pending CN105260554A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510708583.8A CN105260554A (en) 2015-10-27 2015-10-27 GPU cluster-based multidimensional big data factorization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510708583.8A CN105260554A (en) 2015-10-27 2015-10-27 GPU cluster-based multidimensional big data factorization method

Publications (1)

Publication Number Publication Date
CN105260554A true CN105260554A (en) 2016-01-20

Family

ID=55100243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510708583.8A Pending CN105260554A (en) 2015-10-27 2015-10-27 GPU cluster-based multidimensional big data factorization method

Country Status (1)

Country Link
CN (1) CN105260554A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931256A (en) * 2016-06-03 2016-09-07 中国地质大学(武汉) CUDA (compute unified device architecture)-based large-format remote sensing image fast segmentation method
CN107229966A (en) * 2016-03-25 2017-10-03 阿里巴巴集团控股有限公司 A kind of model data update method, apparatus and system
CN107801149A (en) * 2017-08-25 2018-03-13 长江大学 The multipath parameter evaluation method that real value parallel factor decomposes
CN108170639A (en) * 2017-12-26 2018-06-15 云南大学 Tensor CP based on distributed environment decomposes implementation method
CN111819579A (en) * 2018-08-03 2020-10-23 谷歌有限责任公司 Distribution tensor calculation across computing devices
CN112799852A (en) * 2021-04-12 2021-05-14 北京一流科技有限公司 Multi-dimensional SBP distributed signature decision system and method for logic node
WO2022100345A1 (en) * 2020-11-13 2022-05-19 中科寒武纪科技股份有限公司 Processing method, processing apparatus, and related product
WO2022151950A1 (en) * 2021-01-13 2022-07-21 华为技术有限公司 Tensor processing method, apparatus and device, and computer readable storage medium
CN116186522A (en) * 2023-04-04 2023-05-30 石家庄学院 Big data core feature extraction method, electronic equipment and storage medium
CN111033500B (en) * 2017-09-13 2023-07-11 赫尔实验室有限公司 Systems, methods, and media for sensor data fusion and reconstruction

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229966A (en) * 2016-03-25 2017-10-03 阿里巴巴集团控股有限公司 A kind of model data update method, apparatus and system
CN105931256A (en) * 2016-06-03 2016-09-07 中国地质大学(武汉) CUDA (compute unified device architecture)-based large-format remote sensing image fast segmentation method
CN107801149A (en) * 2017-08-25 2018-03-13 长江大学 The multipath parameter evaluation method that real value parallel factor decomposes
CN107801149B (en) * 2017-08-25 2020-02-18 长江大学 Multipath parameter estimation method for real value parallel factorization
CN111033500B (en) * 2017-09-13 2023-07-11 赫尔实验室有限公司 Systems, methods, and media for sensor data fusion and reconstruction
CN108170639B (en) * 2017-12-26 2021-08-17 云南大学 Tensor CP decomposition implementation method based on distributed environment
CN108170639A (en) * 2017-12-26 2018-06-15 云南大学 Tensor CP based on distributed environment decomposes implementation method
CN111819579B (en) * 2018-08-03 2022-02-08 谷歌有限责任公司 Method, system, and medium for distributed tensor computation across computing devices
CN111819579A (en) * 2018-08-03 2020-10-23 谷歌有限责任公司 Distribution tensor calculation across computing devices
WO2022100345A1 (en) * 2020-11-13 2022-05-19 中科寒武纪科技股份有限公司 Processing method, processing apparatus, and related product
WO2022151950A1 (en) * 2021-01-13 2022-07-21 华为技术有限公司 Tensor processing method, apparatus and device, and computer readable storage medium
CN112799852B (en) * 2021-04-12 2021-07-30 北京一流科技有限公司 Multi-dimensional SBP distributed signature decision system and method for logic node
CN112799852A (en) * 2021-04-12 2021-05-14 北京一流科技有限公司 Multi-dimensional SBP distributed signature decision system and method for logic node
CN116186522A (en) * 2023-04-04 2023-05-30 石家庄学院 Big data core feature extraction method, electronic equipment and storage medium
CN116186522B (en) * 2023-04-04 2023-07-18 石家庄学院 Big data core feature extraction method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105260554A (en) GPU cluster-based multidimensional big data factorization method
Patania et al. Topological analysis of data
Chen et al. 64-qubit quantum circuit simulation
Lončar et al. OpenMP, OpenMP/MPI, and CUDA/MPI C programs for solving the time-dependent dipolar Gross–Pitaevskii equation
Satarić et al. Hybrid OpenMP/MPI programs for solving the time-dependent Gross–Pitaevskii equation in a fully anisotropic trap
US10484479B2 (en) Integration of quantum processing devices with distributed computers
Pichon et al. Sparse supernodal solver using block low-rank compression: Design, performance and analysis
Verstraete et al. Criticality, the area law, and the computational power of projected entangled pair states
Hofmann et al. Analytical characterization of the genuine multiparticle negativity
Daas et al. Parallel algorithms for tensor train arithmetic
Li et al. Faster tensor train decomposition for sparse data
DiCarlo et al. Linear algebraic representation for topological structures
CN101086729A (en) A dynamic reconfigurable high-performance computing method and device based on FPGA
Chen et al. Pyomo. GDP: Disjunctive models in python
Messer et al. MiniApps derived from production HPC applications using multiple programing models
Chen et al. A hybrid GPU/CPU FFT library for large FFT problems
D’Azevedo et al. Parallel LU factorization on GPU cluster
Luo et al. Fractional chaotic maps with q–deformation
Sowkuntla et al. MapReduce based improved quick reduct algorithm with granular refinement using vertical partitioning scheme
Li et al. Efficient composing rough approximations for distributed data
Liu et al. Research on k-means algorithm based on cloud computing
Li et al. 2PCP: Two-phase CP decomposition for billion-scale dense tensors
Zhou et al. Quantum multidimensional color images similarity comparison
Messer et al. Developing miniapps on modern platforms using multiple programming models
Huang Frame-groups based fractal video compression and its parallel implementation in Hadoop cloud computing environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160120

RJ01 Rejection of invention patent application after publication