CN105260554A

CN105260554A - GPU cluster-based multidimensional big data factorization method

Info

Publication number: CN105260554A
Application number: CN201510708583.8A
Authority: CN
Inventors: 陈丹; 胡阳阳; 蔡畅; 李小俚; 王力哲
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2015-10-27
Filing date: 2015-10-27
Publication date: 2016-01-20

Abstract

The invention discloses a GPU cluster-based multidimensional big data factorization method, aims to solve the problem that a conventional grid parallel factor analysis model cannot process large-scale and high-dimension multidimensional data analysis, and provides an effective pattern processing unit-based multidimensional big data multi-mode decomposition method, namely a hierarchical parallel factor analysis framework. The framework is based on the conventional grid parallel factor analysis model, comprises a process of integrating tensor subsets under a coarse-grained model and a process of calculating all of the tensor subsets and fusing factor subsets under a fine-grained model, and is operated on a cluster formed by a plurality of nodes; each node comprises a plurality of pattern processing units. Tensor decomposition on pattern processing unit equipment can fully utilize a powerful parallel computing capability and a paralleling resource generated in tensor decomposition; experimental results show that through the adoption of the method, executive time for acquiring tensor factors can be greatly shortened, the large-scale data processing capability is improved, and the problem that the computing resource is insufficient is well solved.

Description

The large data factorization method of a kind of multidimensional based on GPU cluster

Technical field

The invention belongs to signal analysis technology field, relate to the large data analysing method of a kind of multidimensional, particularly relate to the large data factorization method of a kind of efficient multidimensional based on GPU cluster.

Background technology

Based in the complicated applications of data analysis, the dynamics of extensive tensor to be reflected in decomposable process, also do not cause large data deformation simultaneously, face increasing challenge ever-increasing today at data scale and data dimension.From multidimensional data, find the useful information of data more and more important in Science and engineering of today, as feature extraction, dimension-reduction treatment.2-D data decomposition method, as singular value decomposition method (SVD), principal component analysis (PCA) (PCA), independent component analysis method (ICA), directly can lose the corresponding relation between different dimensions the decomposition that these methods are applied to high dimensional data.Relatively, parallel factor analysis (PARAFAC), the multistage decomposition of specification (CPD), Tucker model is more suitable for the decomposition of three-dimensional or more high dimensional data, and can be solved by ALS method.PARAFAC compares Tucker and more easily explains, avoids rotation free problem typical in two-dimensional approach, ensures that solution is independently simultaneously.

In order to improve the tensor resolution effect of high dimensional data, researcher has made two classes and has explored, and optimizes the process of calculated factor and promotes decomposable process with high-capability computing device.Work the most significantly: the ELS energy accelerating convergence that [document 1] Rajih proposes, its basic thought finds optimum relaxation factor.[document 2, document 3] calculates sum minor matrix in Hadmard but not Khatri-Rao inner product and large matrix are multiplied, and based on parallel framework, the method can process large-scale data.

PARAFAC can to the data analysis of arbitrary size and dimension, but its computation complexity is high, also high to the performance requirement of computing machine, therefore whole data are split with sliding window, and with PARAFAC the data in certain dimension are analyzed one by one and just applied, this has established the basis of large Data Dynamic tensor analysis, and the direct fusion of the lower data of this theory makes the correlativity partial loss between data, and the cause and effect combined factor obtained is difficult to the dynamics reflecting raw data.

In order to solve the problem of large-scale data, PARAFAC mathematical theory has had certain innovation, large-scale data is regarded as the gridding of small data, i.e. gridPARAFAC ([document 4]), the decomposition and inversion of tensor is the decomposition of independently tensor subset by the method, and the output of merging tensor subset result can obtain whole factors of tensor, and this mode is effective, but face two large problems, computational resource is not enough and tensor subset splits the result differentiation caused.

[document 1] M.Rajih, P.Comon, andR.A.Harshman, " Enhancedlinesearch:Anovelmethodtoaccelerateparafac, " SIAMJournalonMatrixAnalysisandApplications, vol.30, no.3, pp.1128 – 1147,2008.

[document 2] A.H.PhanandA.Cichocki, " Advancesinparafacusingparallelblockdecomposition; " inNeuralInformationProcessing, ser.NeuralInformationProcessing.SpringerBerlinHeidelberg, 2009, pp.323 – 330.

[document 3] A.HuyPhanandA.Cichocki, " Parafacalgorithmsforlarge-scaleproblems, " Neurocomputing, vol.74, no.11, pp.1970 – 1984,2011.

[document 4] R.A.HarshmanandM.E.Lundy, " Parafac:Parallelfactoranalysis, " ComputationalStatistics & DataAnalysis, vol.18, no.1, pp.39 – 72,1994.

Summary of the invention

For existing methodical deficiency, the present invention proposes the large data factorization method of a kind of efficient multidimensional based on GPU cluster, i.e. H-PARAFAC framework, this framework, based on gridPARAFAC, comprises by the process calculating each tensor subset sums Parallel Fusion factor subset under integrating the process of tensor subset and finely granular access control under a Coarse grain model.This framework runs on the cluster be made up of multiple node, and each node comprises multiple stage GPU.Fully apply its powerful computation capability and parallelization resource based on the tensor resolution on GPU equipment, the execution time obtaining the tensor factor can be greatly reduced, improve large-scale data processing power, well solve the problems such as computational resource is not enough.

The technical solution adopted in the present invention is: the large data factorization method of a kind of multidimensional based on GPU cluster, is characterized in that, comprise the following steps:

Step 1: build PARAFAC model;

PARAFAC model is based on initialization model by matrix A, and B, C provide, and tensor is represented by λ, and formula is as follows:

χ = [λ; A, B, C] = Σ_{r}^{F} λ_{r} \cdot a_{r} \cdot b_{r} \cdot c_{r} - - - (1);

Step 2: build grid PARAFAC model and m-ALS;

Grid PARAFAC model regards tensor subset as tensor Y combination:

Y＝ I× ₀A ⁽⁰⁾× ₁A ⁽¹⁾×A ⁽²⁾(2)；

{\underset{&OverBar;}{Y}}^{(\overset{&OverBar;}{k})} \approx \underset{&OverBar;}{I} \times U_{(\overset{&OverBar;}{k})}^{(0)}_{0} \times U_{(\overset{&OverBar;}{k})}^{(1)}_{1} \times U_{(\overset{&OverBar;}{k})}^{(2)}_{2} - - - (3);

Wherein: k _i=1 ... K _i, i=1 ..., N, × _nthe n pattern inner product of tensor, A ⁽ⁿ⁾and U ⁽ⁿ⁾n-th factor of N tensor, be the i-th number;

The total divisor A of tensor Y ⁽ⁱ⁾energy m-ALS is from factor subset obtain:

All tensor subsets first decomposed by traditional ALS method and obtain its factor

[U_{(\overset{&OverBar;}{k})}^{(1)}, ... U_{(\overset{&OverBar;}{k})}^{(N)}] = p a r a f a c A L S ({\underset{&OverBar;}{Y}}^{(\overset{&OverBar;}{k})}) - - - (5);

Intermediate variable P and Q is by the factor initialization of tensor subset:

P_{(\overset{&OverBar;}{k})} = ({U_{(\overset{&OverBar;}{k})}^{(N)}}^{T} A_{(k_{N})}^{(N)}) &CircleTimes;, ... &CircleTimes; ({U_{(\overset{&OverBar;}{k})}^{(1)}}^{T} A_{(k_{1})}^{(1)}) - - - (6);

Q_{(\overset{&OverBar;}{k})} = ({A_{(\overset{&OverBar;}{k})}^{(N)}}^{T} A_{(k_{N})}^{(N)}) &CircleTimes;, ... &CircleTimes; ({A_{(k_{1})}^{(1)}}^{T} A_{(k_{1})}^{(1)}) - - - (7);

Wherein: represent that Hadmard amasss;

Step 3: build stratification parallel factor analysis model H-PARAFAC;

Stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Described cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;

Described stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, described stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;

Step 4: build Coarse grain model;

Described Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;

Step 5: build refining parallel model;

Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor.

As preferably, the m-ALS described in step 2 specifically calculates, and it is by gradient cost function minimum sandards Euclidean distance thus to connect horizontal direction in m-ALS iterative process, when calculation the i-th dimension, tensor subset the factor can arrive, the synthesis of the tensor subset factor set of same when all dimensions all iteration cross once, factor subset is connected by following formula:

T = T + U_{(\overset{&OverBar;}{k})}^{(n)} P_{(\overset{&OverBar;}{k})}; S = S + Q_{(\overset{&OverBar;}{k})} - - - (9);

A_{(k_{n})}^{n} = {TS}^{- 1} - - - (10);

Wherein: 1≤k _n≤ K _n, representative is selected, and after all factor subsets upgrade, middle tensor P and Q also will upgrade, in order to next iteration:

\begin{matrix} P_{(\overset{&OverBar;}{k})} = P_{(\overset{&OverBar;}{k})} &CircleTimes; ({U_{(\overset{&OverBar;}{k})}^{(n)}}^{T} A_{(k_{n})}^{n}) \\ Q_{(\overset{&OverBar;}{k})} = Q_{(\overset{&OverBar;}{k})} &CircleTimes; ({A_{(k_{n})}^{(n)}}^{T} A_{(k_{n})}^{n}) \end{matrix} - - - (11) .

As preferably, the execution of the stratification parallel factor analysis model H-PARAFAC that the Coarse grain model described in step 4 drives, its specific implementation comprises following sub-step:

(1) initialization GPUs;

(2) factor of tensor subset U in formula (5) is assessed;

(3) middle tensor (P, Q) is calculated by the learning rules in formula (6) and (7);

(4) block often tieed up is calculated: the Thread control equipment of host side;

1. by formula (8) (9) learning rule combined block between factor T, S;

2. factor subset A is upgraded by the learning rule of formula (10);

3. tensor (P, Q) is upgraded by the learning rule of formula (11);

(5) factor subset A is obtained from equipment end;

(6) factor subset A and middle tensor (P, Q) is upgraded by WiMAX and the MPI overall situation;

(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As

No, otherwise get back to the 4th step.

As preferably, the tensor dividing method described in step 4, its specific implementation process is:

Stratification parallel factor analysis model H-PARAFAC passes through to block device mean allocation block with EQUILIBRIUM CALCULATION FOR PROCESS resource, and every platform equipment only need obtain the data needed for current iteration, and block is passed to equipment by sequence by node; Being calculated as follows of number of blocks in a kth equipment:

S_{k} = [\frac{S - Σ_{i = 0}^{k - 1} s_{i}}{n - k}], S_{0} = [\frac{s}{n}] - - - (12)

Wherein, k represents a kth equipment, k=1 ..., n-1, S represent total block data, S _krepresent the block number in a kth equipment, the block number of each GPU is consistent or differ 1;

Each node is containing intermediate data, and each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.

As preferably, the assessment of the tensor subset factor described in step 5 and the initialization of factor subset, its specific implementation process is: the initial value being obtained the tensor subset factor by DTLD, conventional P ARAFAC is driven to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method; In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether; These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.

As preferably, the symmetric data transmission described in step 5, its specific implementation process is:

The each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, and P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, and P, Q two tensors are all divided into tensor subset, and label mode is also consistent; Factor subset corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to carry out piecemeal, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks;

In each iteration, to this TV station equipment, (node comprises multiple devices to the intermediate data (P, Q, U) that node transmission is upgraded by miscellaneous equipment, and equipment refers to computer disposal platform.Node is overall treatment platform) in, Yin Bentai equipment only processes some specific factor subsets, and node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously;

In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; This step can with the permutation and combination theoretical description in formula 13 and formula 14, the element u in formula _{(i, j, k)}mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula 13, P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula 14 represents that call number have passed through three transmission:

p_{1} = (\begin{matrix} u_{(0, 0, 0)} & ... & u_{(i, j, k)} & ... & u_{(I, J, K)} \\ &DownArrow; & ... & &DownArrow; & ... & &DownArrow; \\ u_{(0, 0, 0)} & ... & u_{(j, k, i)} & ... & u_{(J, K, I)} \end{matrix}) - - - (13);

u_{(i, j, k)} \overset{p_{1}}{&RightArrow;} u_{(j, k, i)}; u_{(j, k, i)} \overset{p_{1}}{&RightArrow;} u_{(k, i, j)}; u_{(k, i, j)} \overset{p_{1}}{&RightArrow;} u_{(i, j, k)} - - - (14);

According to the transmission of coordinate, the side-play amount of data cell is linear memory, and can be calculated by equipment, one of one-dimensional degree back and forth in, all with calculate in CUDA stream, the intermediate data of renewal can be copied directly to the assigned address of host side storage space according to skew, new block can be automatically stored to continuous print storage space and forward in another dimension data to facilitate simultaneously.

Compared with existing traditional multidimensional data analysis method based on parallel factor, the present invention has the following advantages and beneficial effect:

(1) the present invention proposes a kind of parallel computation frame H-PARAFAC, makes algorithm can process extensive High dimensional data analysis problem fast;

(2) the stratification parallel computation frame that the present invention proposes comprises the Coarse grain model and calculating of integrating tensor subset and the fine granularity computation model integrating tensor subset;

(3) H-PARAFAC that the present invention proposes utilizes multiple GPU equipment to decompose tensor subset in mixing cluster Computing Platform in a parallel fashion.

Accompanying drawing explanation

Accompanying drawing 1: the GridPARAFAC model schematic of the embodiment of the present invention;

Accompanying drawing 2: the stratification parallel factor analysis model H-PARAFAC schematic diagram of the embodiment of the present invention;

Accompanying drawing 3: the tensor resolution process flow diagram under the H-PARAFAC of the embodiment of the present invention;

Accompanying drawing 4: the symmetric data mode figure of the embodiment of the present invention;

Accompanying drawing 5: the execution time figure variation diagram obtaining the whole factor of these tensors under different splitting scheme of the embodiment of the present invention; Figure (a) execution time is with the variation diagram of size of data; Figure (b) execution time is with the variation diagram of block size.

Embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with drawings and Examples, the present invention is described in further detail, should be appreciated that exemplifying embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.

Ask for an interview Fig. 1, the present invention is directed to conventional mesh parallel factor analysis model (gridPARAFAC) cannot process on a large scale, the problem of high-dimensional multidimensional data analysis, propose a kind of effectively based on the multidimensional large data multimode decomposition method of (Graphics Processing Unit) GPU cluster, i.e. stratification parallel factor analysis (H-PARAFAC) framework, this framework is based on gridPARAFAC (gridding parallel factor analysis), comprise by the process calculating each tensor subset sums Parallel Fusion factor subset under integrating the process of tensor subset and finely granular access control under a Coarse grain model.This framework runs on the cluster be made up of multiple node, and each node comprises multiple stage GPU.Fully apply its powerful computation capability and parallelization resource based on the tensor resolution on GPU equipment, the execution time obtaining the tensor factor can be greatly reduced, improve large-scale data processing power, well solve the problems such as computational resource is not enough.

This experiment effect mixing computing cluster on evaluate and test, this cluster by be configured with 4 NVIDIATeslaC2050 video cards two worktable form, execution environment configuration as follows:

Operating system is 64 windows2008 enterprise versions, containing the CPU of two IntelXeonE56202.40GHZ, RAM is 24G, compile under the all-round excellent environment of MSSTUDIO2010, bus is PCI-E5.0Gbps, and network is 32Gbps bandwidth for transmission, NVIDIA is containing 8 TeslaC2050.448 CUDUcores, fundamental frequency is 1.5GHZ, standard memory 2.5GB, memory bar 144GB/s.

Experimental data adopts three-dimensional tensor, and these tensors are selected from basic sparse smooth signal, as rectified half-waves cosine and sine signal.In the same size in three dimensions of data.Size has 300,600,900,1200,1500,1800,2100, and to 2400, a tensor is divided into the grid of several K*K*K, K=2, and 3,4,5,6,8.

The large data factorization method of a kind of multidimensional based on GPU cluster provided by the invention, comprises the following steps:

Step 1: build PARAFAC model;

χ = [λ; A, B, C] = Σ_{r}^{F} λ_{r} \cdot a_{r} \cdot b_{r} \cdot c_{r} - - - (1);

Step 2: build grid PARAFAC model and m-ALS;

Grid PARAFAC model regards tensor subset as tensor Y combination:

Y＝ I× ₀A ⁽⁰⁾× ₁A ⁽¹⁾×A ⁽²⁾(2)；

{\underset{&OverBar;}{Y}}^{(\overset{&OverBar;}{k})} \approx \underset{&OverBar;}{I} \times U_{(\overset{&OverBar;}{k})}^{(0)}_{0} \times U_{(\overset{&OverBar;}{k})}^{(1)}_{1} \times U_{(\overset{&OverBar;}{k})}^{(2)}_{2} - - - (3);

[U_{(\overset{&OverBar;}{k})}^{(1)}, ... U_{(\overset{&OverBar;}{k})}^{(N)}] = p a r a f a c A L S ({\underset{&OverBar;}{Y}}^{(\overset{&OverBar;}{k})}) - - - (5);

Intermediate variable P and Q is by the factor initialization of tensor subset:

P_{(\overset{&OverBar;}{k})} = ({U_{(\overset{&OverBar;}{k})}^{(N)}}^{T} A_{(k_{N})}^{(N)}) &CircleTimes;, ... &CircleTimes; ({U_{(\overset{&OverBar;}{k})}^{(1)}}^{T} A_{(k_{1})}^{(1)}) - - - (6);

Q_{(\overset{&OverBar;}{k})} = ({A_{(\overset{&OverBar;}{k})}^{(N)}}^{T} A_{(k_{N})}^{(N)}) &CircleTimes;, ... &CircleTimes; ({A_{(k_{1})}^{(1)}}^{T} A_{(k_{1})}^{(1)}) - - - (7);

Wherein: represent that Hadmard amasss;

M-ALS specifically calculates and sees formula 8-formula 11, and it is by gradient cost function minimum sandards Euclidean distance thus connect horizontal direction the matrix as shown in Figure 1 dimension with same coordinate is assigned to same.In m-ALS iterative process, with regard to three-dimensional data citing, tensor is divided into three directions, when calculation the i-th dimension, and tensor subset the factor can arrive, the synthesis of the tensor subset factor set of same when all dimensions all iteration cross once, factor subset is connected by formula 8-formula 11:

T = T + U_{(\overset{&OverBar;}{k})}^{(n)} P_{(\overset{&OverBar;}{k})}; S = S + Q_{(\overset{&OverBar;}{k})} - - - (9);

In formula 8 representative is selected, the matrix T of formula 9, and S is used for the calculating of final factor subset in formula 10.

A_{(k_{n})}^{n} = {TS}^{- 1} - - - (10);

(k after all factor subsets upgrade _nfrom 1 to K _n), middle tensor P and Q also will upgrade, such as formula (11), in order to next iteration;

P_{(\overset{&OverBar;}{k})} = P_{(\overset{&OverBar;}{k})} &CircleTimes; ({U_{(\overset{&OverBar;}{k})}^{(n)}}^{T} A_{(k_{n})}^{n}); Q_{(\overset{&OverBar;}{k})} = Q_{(\overset{&OverBar;}{k})} &CircleTimes; ({A_{(k_{n})}^{(n)}}^{T} A_{(k_{n})}^{n}) - - - (11);

Step 3: build stratification parallel factor analysis model H-PARAFAC;

Ask for an interview Fig. 2, stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;

Stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at node, equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;

Step 4: build Coarse grain model;

Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;

The wherein execution of the stratification parallel factor analysis model H-PARAFAC of Coarse grain model driving, its specific implementation comprises following sub-step:

(1) initialization GPUs;

(2) factor of tensor subset U in formula (5) is assessed;

1. by formula (8) (9) learning rule combined block between factor T, S;

2. factor subset A is upgraded by the learning rule of formula (10);

3. tensor (P, Q) is upgraded by the learning rule of formula (11);

(5) factor subset A is obtained from equipment end;

No, otherwise get back to the 4th step.

Have in whole step three layers synchronous: the synchronous 3. CUDA cross-thread of the synchronous 2. host side cross-thread 1. between node calculate synchronous.CUDA synchronously ensure that the correctness of calculating.Synchronisation of nodes and POSIX thread synchronization ensure that reliable data are transmitted.Internodal communication is realized by the transmitting function of polymerization and MPI.

Because GPU equipment computational resource is limited, Coarse grain model CUDA flows and forms multinuclear to adapt to the needs of random scale data.Each one or more data blocks of CUDA stream process, the quantity of CUDA depends on the memory size of size of data and GPU, a CUDA thread table is shown on equipment by a series of instructions that specified sequence is run, in same calculating stream, core before will form new core once completing, and the use sending out stream altogether can make full use of the computational resource of GPU.

Wherein tensor dividing method, its specific implementation process is:

S_{k} = [\frac{S - Σ_{i = 0}^{k - 1} s_{i}}{n - k}], S_{0} = [\frac{s}{n}] - - - (12)

Such as, two main frames have connected the main frame of three equipment, and full tensor is divided into 10 pieces in the first dimension, and call number is from 0 to k=(h*3|d), and tensor splitting scheme is as following table 1:

Table 1: tensor splitting scheme

Each node is containing intermediate data, and as tensor subset factor U, factor subset A, tensor P and Q, each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.As shown in table 1, the equipment (2) of main frame (0) only uses renewal first to tie up with main frame (0) only need obtain from equipment with final fusion obtains the total divisor A under the first dimension ⁽¹⁾.

The Coarse grain model of the present embodiment is equivalent to Management Calculation, and finely granular access control is responsible for concrete calculating; The initialization of the refined model Coverage factor of the present embodiment, the data transmission under certain storage, the calculating that parallel Hadmard is long-pending and intelligent scissor four parts.Wherein Hadmard amasss is that finely granular access control will realize, but he is parallel computation, manage this process as managing parallel Coarse grain model, and the calculating itself that Hadmard amasss is a kind of conventional calculating, when this kind calculates when process data with there being concrete introduction in step 2.Intelligent scissor scheme is also concrete execution, belongs to finely granular access control, and wherein splitting is not emphasis, be exactly simple piecemeal, key is the integration of block data, is also the management of data, the present embodiment has been placed on him in Coarse grain model and has gone to say, as the tensor partitioning portion in step 4.

Step 5: build refining parallel model;

Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor;

The wherein assessment of the tensor subset factor and the initialization of factor subset, its specific implementation process is: mALS method need to estimate all tensor subset the factor and using obtain factor subset as input.This step obtains the initial value of the tensor subset factor by DTLD, and drive conventional P ARAFAC to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method.In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether, reason there is not relation between tensor subset.These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.But the high complexity of DTLD makes the factor subset of the whole extensive tensor of initialization very difficult.Based on H-PARAFAC, factor subset carrys out initialization by sampling, and the tensor subset factor of equalization is the input of m-ALS method.Such energy minimization obtains the execution time of global factor.

Wherein symmetric data transmission, its specific implementation process is: each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, P, Q two tensors are all divided into tensor subset, and label mode is also consistent, as k _i=1 ... K _i, i=1 ..., N.Factor subset corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to carry out piecemeal, such as: if the n-th element of its call number is identical, so tensor subset (P and Q) will be assigned to same, the data belonging to same are continuous print in storage space, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks.

In each iteration, to this TV station equipment, (have multiple devices under a node, equipment refers to computer disposal platform to the intermediate data (P, Q, U) that node transmission is upgraded by miscellaneous equipment.Node is overall treatment platform) in, because this TV station equipment only processes some specific factor subsets, node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously.When node obtains data block, if source data poor organization in the storage space of main frame, expense will be very huge.Time on equipment along a certain dimension initialization calculated value, in order to node can obtain all pieces from Coutinuous store space, in the middle of restructuring, tensor is abnormal important, and node is heavily marked or P and Q that recombinate is very hard work.Time complexity is O (n ³), this result has been absorbed in the bottleneck phase.

In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; Refined model makes data directly from device transmission to node, ensure that intermediate data is Coutinuous store in host stores space.

As shown in Figure 4, for three-dimensional tensor, its transmission can as being be rotated counterclockwise 120 ° along self diagonal line, consider that data are three-dimensional, tensor need rotate 3 times, finally gets back to initial conditions, assuming that block is above Coutinuous store, data are vertical piecemeals in X-axis, and next time is exactly Y-axis, once rotate and guarantee that data result block is continuous print.

This step can with the permutation and combination theoretical description in formula (13) and (14), the element u in this group _{(i, j, k)}mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula (13), P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula (14) represents that call number have passed through three transmission

p_{1} = (\begin{matrix} u_{(0, 0, 0)} & ... & u_{(i, j, k)} & ... & u_{(I, J, K)} \\ &DownArrow; & ... & &DownArrow; & ... & &DownArrow; \\ u_{(0, 0, 0)} & ... & u_{(j, k, i)} & ... & u_{(J, K, I)} \end{matrix}) - - - (13);

u_{(i, j, k)} \overset{p_{1}}{&RightArrow;} u_{(j, k, i)}; u_{(j, k, i)} \overset{p_{1}}{&RightArrow;} u_{(k, i, j)}; u_{(k, i, j)} \overset{p_{1}}{&RightArrow;} u_{(i, j, k)} - - - (14);

This example three kinds of methods are evaluated and tested:

(1) by the data of different scales, under the tensor subset of different size, m-ALS is used to obtain the working time of full tensor;

(2) different piece load capacity;

(3) duty factor of computing power and communication capacity.

Fig. 5 describes or else with the execution time obtaining the whole factor of these tensors under splitting scheme, figure (a) represents when data scale linear increase, execution time is linear increase also, this illustrates that H-PARAFAC of the present invention is stable, during K=6, when scale by 600 to 2400 time, the execution time changes to 677ms by 300, when data scale increases by 8 times in three dimensions simultaneously, the execution time increases number and is less than 2.9ms.Figure (b) reruns the execution time from the granularity of tensor subset, and obviously, the execution time promotes steadily with the reduction of tensor subset, and in other words, too much segmentation will increase load, and the fact shows:

(1) to guarantee that each tensor subset is large as far as possible;

(2) when configuring computing platform to H-PARAFA, the ability of GPUs is total computing power of a priori factor but not whole cluster.

Meanwhile, method of the present invention is compared with original PARAFAC method, and when tensor sub-set size equals 300 and 600, tensor processing time original method is respectively 158650ms and 4491736ms, and load is with n ³increase, n is size of data.H-PARAFAC only needs 80 and 800 times respectively.G-PARAFAC can obtain similar result, but H-PARAFAC can support that size is 2400 ³and the fast decoupled of more large-scale data.

Should be understood that, the part that this instructions does not elaborate all belongs to prior art.

Should be understood that; the above-mentioned description for preferred embodiment is comparatively detailed; therefore the restriction to scope of patent protection of the present invention can not be thought; those of ordinary skill in the art is under enlightenment of the present invention; do not departing under the ambit that the claims in the present invention protect; can also make and replacing or distortion, all fall within protection scope of the present invention, request protection domain of the present invention should be as the criterion with claims.

Claims

1., based on the large data factorization method of multidimensional of GPU cluster, it is characterized in that, comprise the following steps:

Step 1: build PARAFAC model;

χ = [λ; A, B, C] = Σ_{r}^{F} λ_{r} \cdot a_{r} \cdot b_{r} \cdot c_{r} - - - (1);

Step 2: build grid PARAFAC model and m-ALS;

Grid PARAFAC model regards tensor subset as tensor Y combination:

Y＝ I× ₀A ⁽⁰⁾× ₁A ⁽¹⁾×A ⁽²⁾(2)；

{\underset{&OverBar;}{Y}}^{(\overset{&OverBar;}{k})} \approx \underset{&OverBar;}{I} \times_{0} U_{(\overset{&OverBar;}{k})}^{(0)} \times_{1} U_{(\overset{&OverBar;}{k})}^{(1)} \times_{2} U_{(\overset{&OverBar;}{k})}^{(2)} - - - (3);

[U_{(\overset{&OverBar;}{k})}^{(1)}, ... U_{(\overset{&OverBar;}{k})}^{(N)}] = p a r a f a c A L S ({\underset{&OverBar;}{Y}}^{(\overset{&OverBar;}{k})}) - - - (5);

Intermediate variable P and Q is by the factor initialization of tensor subset:

P_{(\overset{&OverBar;}{k})} = (U_{(\overset{&OverBar;}{k})}^{(N)} A_{(k_{N})}^{(N)}) &CircleTimes;, ... &CircleTimes; (U_{(\overset{&OverBar;}{k})}^{(1) T} A_{(k_{1})}^{(1)}) - - - (6);

Q_{(\overset{&OverBar;}{k})} = (A_{(\overset{&OverBar;}{k})}^{(N) T} A_{(k_{N})}^{(N)}) &CircleTimes;, ... &CircleTimes; (A_{(k_{1})}^{(1) T} A_{(k_{1})}^{(1)}) - - - (7);

Wherein: represent that Hadmard amasss;

Step 3: build stratification parallel factor analysis model H-PARAFAC;

Step 4: build Coarse grain model;

Step 5: build refining parallel model;

2. the large data factorization method of the multidimensional based on GPU cluster according to claim 1, is characterized in that: the m-ALS described in step 2 specifically calculates, and it is by gradient cost function minimum sandards Euclidean distance thus connect horizontal direction in m-ALS iterative process, when calculation the i-th dimension, tensor subset the factor can arrive, the synthesis of the tensor subset factor set of same when all dimensions all iteration cross once, factor subset is connected by following formula:

T = T + U_{(\overset{&OverBar;}{k})}^{(n)} P_{(\overset{&OverBar;}{k})}; S = S + Q_{(\overset{&OverBar;}{k})} - - - (9);

A_{(k_{n})}^{n} = {TS}^{- 1} - - - (10);

\begin{matrix} P_{(\overset{&OverBar;}{k})} = P_{(\overset{&OverBar;}{k})} &CircleTimes; (U_{(\overset{&OverBar;}{k})}^{(n) T} A_{(k_{n})}^{n}) \\ Q_{(\overset{&OverBar;}{k})} = Q_{(\overset{&OverBar;}{k})} &CircleTimes; (A_{(k_{n})}^{(n) T} A_{(k_{n})}^{n}) \end{matrix} - - - (11) .

3. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, it is characterized in that: the execution of the stratification parallel factor analysis model H-PARAFAC that the Coarse grain model described in step 4 drives, its specific implementation comprises following sub-step:

(1) initialization GPUs;

(2) factor of tensor subset U in formula (5) is assessed;

(4) block often tieed up is calculated: the Thread control equipment of node side;

1. by formula (8) (9) learning rule combined block between factor T, S;

2. factor subset A is upgraded by the learning rule of formula (10);

3. tensor (P, Q) is upgraded by the learning rule of formula (11);

(5) factor subset A is obtained from equipment end;

(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As no, otherwise get back to the 4th step.

4. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, is characterized in that, the tensor dividing method described in step 4, and its specific implementation process is:

S_{k} = [\frac{S - Σ_{i = 0}^{k - 1} s_{i}}{n - k}], S_{0} = [\frac{s}{n}] - - - (12)

5. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, it is characterized in that, the assessment of the tensor subset factor described in step 5 and the initialization of factor subset, its specific implementation process is: the initial value being obtained the tensor subset factor by DTLD, conventional P ARAFAC is driven to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method; In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether; These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.

6. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, is characterized in that, the symmetric data transmission described in step 5, and its specific implementation process is:

In each iteration, intermediate data (P, Q that node transmission is upgraded by miscellaneous equipment, U) in this TV station equipment, because this TV station equipment only processes some specific factor subsets, node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously;

p_{1} = (\begin{matrix} u_{(0, 0, 0)} & ... & u_{(i, j, k)} & ... & u_{(I, J, K)} \\ &DownArrow; & ... & &DownArrow; & ... & &DownArrow; \\ u_{(0, 0, 0)} & ... & u_{(j, k, i)} & ... & u_{(J, K . I)} \end{matrix}) - - - (13);