CN105260554A - GPU cluster-based multidimensional big data factorization method - Google Patents
GPU cluster-based multidimensional big data factorization method Download PDFInfo
- Publication number
- CN105260554A CN105260554A CN201510708583.8A CN201510708583A CN105260554A CN 105260554 A CN105260554 A CN 105260554A CN 201510708583 A CN201510708583 A CN 201510708583A CN 105260554 A CN105260554 A CN 105260554A
- Authority
- CN
- China
- Prior art keywords
- tensor
- factor
- subset
- data
- overbar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a GPU cluster-based multidimensional big data factorization method, aims to solve the problem that a conventional grid parallel factor analysis model cannot process large-scale and high-dimension multidimensional data analysis, and provides an effective pattern processing unit-based multidimensional big data multi-mode decomposition method, namely a hierarchical parallel factor analysis framework. The framework is based on the conventional grid parallel factor analysis model, comprises a process of integrating tensor subsets under a coarse-grained model and a process of calculating all of the tensor subsets and fusing factor subsets under a fine-grained model, and is operated on a cluster formed by a plurality of nodes; each node comprises a plurality of pattern processing units. Tensor decomposition on pattern processing unit equipment can fully utilize a powerful parallel computing capability and a paralleling resource generated in tensor decomposition; experimental results show that through the adoption of the method, executive time for acquiring tensor factors can be greatly shortened, the large-scale data processing capability is improved, and the problem that the computing resource is insufficient is well solved.
Description
Technical field
The invention belongs to signal analysis technology field, relate to the large data analysing method of a kind of multidimensional, particularly relate to the large data factorization method of a kind of efficient multidimensional based on GPU cluster.
Background technology
Based in the complicated applications of data analysis, the dynamics of extensive tensor to be reflected in decomposable process, also do not cause large data deformation simultaneously, face increasing challenge ever-increasing today at data scale and data dimension.From multidimensional data, find the useful information of data more and more important in Science and engineering of today, as feature extraction, dimension-reduction treatment.2-D data decomposition method, as singular value decomposition method (SVD), principal component analysis (PCA) (PCA), independent component analysis method (ICA), directly can lose the corresponding relation between different dimensions the decomposition that these methods are applied to high dimensional data.Relatively, parallel factor analysis (PARAFAC), the multistage decomposition of specification (CPD), Tucker model is more suitable for the decomposition of three-dimensional or more high dimensional data, and can be solved by ALS method.PARAFAC compares Tucker and more easily explains, avoids rotation free problem typical in two-dimensional approach, ensures that solution is independently simultaneously.
In order to improve the tensor resolution effect of high dimensional data, researcher has made two classes and has explored, and optimizes the process of calculated factor and promotes decomposable process with high-capability computing device.Work the most significantly: the ELS energy accelerating convergence that [document 1] Rajih proposes, its basic thought finds optimum relaxation factor.[document 2, document 3] calculates sum minor matrix in Hadmard but not Khatri-Rao inner product and large matrix are multiplied, and based on parallel framework, the method can process large-scale data.
PARAFAC can to the data analysis of arbitrary size and dimension, but its computation complexity is high, also high to the performance requirement of computing machine, therefore whole data are split with sliding window, and with PARAFAC the data in certain dimension are analyzed one by one and just applied, this has established the basis of large Data Dynamic tensor analysis, and the direct fusion of the lower data of this theory makes the correlativity partial loss between data, and the cause and effect combined factor obtained is difficult to the dynamics reflecting raw data.
In order to solve the problem of large-scale data, PARAFAC mathematical theory has had certain innovation, large-scale data is regarded as the gridding of small data, i.e. gridPARAFAC ([document 4]), the decomposition and inversion of tensor is the decomposition of independently tensor subset by the method, and the output of merging tensor subset result can obtain whole factors of tensor, and this mode is effective, but face two large problems, computational resource is not enough and tensor subset splits the result differentiation caused.
[document 1] M.Rajih, P.Comon, andR.A.Harshman, " Enhancedlinesearch:Anovelmethodtoaccelerateparafac, " SIAMJournalonMatrixAnalysisandApplications, vol.30, no.3, pp.1128 – 1147,2008.
[document 2] A.H.PhanandA.Cichocki, " Advancesinparafacusingparallelblockdecomposition; " inNeuralInformationProcessing, ser.NeuralInformationProcessing.SpringerBerlinHeidelberg, 2009, pp.323 – 330.
[document 3] A.HuyPhanandA.Cichocki, " Parafacalgorithmsforlarge-scaleproblems, " Neurocomputing, vol.74, no.11, pp.1970 – 1984,2011.
[document 4] R.A.HarshmanandM.E.Lundy, " Parafac:Parallelfactoranalysis, " ComputationalStatistics & DataAnalysis, vol.18, no.1, pp.39 – 72,1994.
Summary of the invention
For existing methodical deficiency, the present invention proposes the large data factorization method of a kind of efficient multidimensional based on GPU cluster, i.e. H-PARAFAC framework, this framework, based on gridPARAFAC, comprises by the process calculating each tensor subset sums Parallel Fusion factor subset under integrating the process of tensor subset and finely granular access control under a Coarse grain model.This framework runs on the cluster be made up of multiple node, and each node comprises multiple stage GPU.Fully apply its powerful computation capability and parallelization resource based on the tensor resolution on GPU equipment, the execution time obtaining the tensor factor can be greatly reduced, improve large-scale data processing power, well solve the problems such as computational resource is not enough.
The technical solution adopted in the present invention is: the large data factorization method of a kind of multidimensional based on GPU cluster, is characterized in that, comprise the following steps:
Step 1: build PARAFAC model;
PARAFAC model is based on initialization model by matrix A, and B, C provide, and tensor is represented by λ, and formula is as follows:
Step 2: build grid PARAFAC model and m-ALS;
Grid PARAFAC model regards tensor subset as tensor Y
combination:
Y=
I×
0A
(0)×
1A
(1)×A
(2)(2);
Wherein:
k
i=1 ... K
i, i=1 ..., N, ×
nthe n pattern inner product of tensor, A
(n)and U
(n)n-th factor of N tensor,
be
the i-th number;
The total divisor A of tensor Y
(i)energy m-ALS is from factor subset
obtain:
All tensor subsets
first decomposed by traditional ALS method and obtain its factor
Intermediate variable P and Q is by the factor initialization of tensor subset:
Wherein:
represent that Hadmard amasss;
Step 3: build stratification parallel factor analysis model H-PARAFAC;
Stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Described cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;
Described stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, described stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;
Step 4: build Coarse grain model;
Described Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;
Step 5: build refining parallel model;
Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor.
As preferably, the m-ALS described in step 2 specifically calculates, and it is by gradient cost function minimum sandards Euclidean distance thus to connect horizontal direction
in m-ALS iterative process, when calculation the i-th dimension, tensor subset
the factor can arrive, the synthesis of the tensor subset factor set of same
when all dimensions all iteration cross once, factor subset is connected by following formula:
Wherein:
1≤k
n≤ K
n,
representative is selected, and after all factor subsets upgrade, middle tensor P and Q also will upgrade, in order to next iteration:
As preferably, the execution of the stratification parallel factor analysis model H-PARAFAC that the Coarse grain model described in step 4 drives, its specific implementation comprises following sub-step:
(1) initialization GPUs;
(2) factor of tensor subset U in formula (5) is assessed;
(3) middle tensor (P, Q) is calculated by the learning rules in formula (6) and (7);
(4) block often tieed up is calculated: the Thread control equipment of host side;
1. by formula (8) (9) learning rule combined block between factor T, S;
2. factor subset A is upgraded by the learning rule of formula (10);
3. tensor (P, Q) is upgraded by the learning rule of formula (11);
(5) factor subset A is obtained from equipment end;
(6) factor subset A and middle tensor (P, Q) is upgraded by WiMAX and the MPI overall situation;
(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As
No, otherwise get back to the 4th step.
As preferably, the tensor dividing method described in step 4, its specific implementation process is:
Stratification parallel factor analysis model H-PARAFAC passes through to block device mean allocation block with EQUILIBRIUM CALCULATION FOR PROCESS resource, and every platform equipment only need obtain the data needed for current iteration, and block is passed to equipment by sequence by node; Being calculated as follows of number of blocks in a kth equipment:
Wherein, k represents a kth equipment, k=1 ..., n-1, S represent total block data, S
krepresent the block number in a kth equipment, the block number of each GPU is consistent or differ 1;
Each node is containing intermediate data, and each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.
As preferably, the assessment of the tensor subset factor described in step 5 and the initialization of factor subset, its specific implementation process is: the initial value being obtained the tensor subset factor by DTLD, conventional P ARAFAC is driven to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method; In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether; These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.
As preferably, the symmetric data transmission described in step 5, its specific implementation process is:
The each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, and P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, and P, Q two tensors are all divided into tensor subset, and label mode is also consistent; Factor subset
corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to
carry out piecemeal, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks;
In each iteration, to this TV station equipment, (node comprises multiple devices to the intermediate data (P, Q, U) that node transmission is upgraded by miscellaneous equipment, and equipment refers to computer disposal platform.Node is overall treatment platform) in, Yin Bentai equipment only processes some specific factor subsets, and node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously;
In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; This step can with the permutation and combination theoretical description in formula 13 and formula 14, the element u in formula
(i, j, k)mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula 13, P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula 14 represents that call number have passed through three transmission:
According to the transmission of coordinate, the side-play amount of data cell is linear memory, and can be calculated by equipment, one of one-dimensional degree back and forth in, all
with
calculate in CUDA stream, the intermediate data of renewal can be copied directly to the assigned address of host side storage space according to skew, new block can be automatically stored to continuous print storage space and forward in another dimension data to facilitate simultaneously.
Compared with existing traditional multidimensional data analysis method based on parallel factor, the present invention has the following advantages and beneficial effect:
(1) the present invention proposes a kind of parallel computation frame H-PARAFAC, makes algorithm can process extensive High dimensional data analysis problem fast;
(2) the stratification parallel computation frame that the present invention proposes comprises the Coarse grain model and calculating of integrating tensor subset and the fine granularity computation model integrating tensor subset;
(3) H-PARAFAC that the present invention proposes utilizes multiple GPU equipment to decompose tensor subset in mixing cluster Computing Platform in a parallel fashion.
Accompanying drawing explanation
Accompanying drawing 1: the GridPARAFAC model schematic of the embodiment of the present invention;
Accompanying drawing 2: the stratification parallel factor analysis model H-PARAFAC schematic diagram of the embodiment of the present invention;
Accompanying drawing 3: the tensor resolution process flow diagram under the H-PARAFAC of the embodiment of the present invention;
Accompanying drawing 4: the symmetric data mode figure of the embodiment of the present invention;
Accompanying drawing 5: the execution time figure variation diagram obtaining the whole factor of these tensors under different splitting scheme of the embodiment of the present invention; Figure (a) execution time is with the variation diagram of size of data; Figure (b) execution time is with the variation diagram of block size.
Embodiment
Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with drawings and Examples, the present invention is described in further detail, should be appreciated that exemplifying embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.
Ask for an interview Fig. 1, the present invention is directed to conventional mesh parallel factor analysis model (gridPARAFAC) cannot process on a large scale, the problem of high-dimensional multidimensional data analysis, propose a kind of effectively based on the multidimensional large data multimode decomposition method of (Graphics Processing Unit) GPU cluster, i.e. stratification parallel factor analysis (H-PARAFAC) framework, this framework is based on gridPARAFAC (gridding parallel factor analysis), comprise by the process calculating each tensor subset sums Parallel Fusion factor subset under integrating the process of tensor subset and finely granular access control under a Coarse grain model.This framework runs on the cluster be made up of multiple node, and each node comprises multiple stage GPU.Fully apply its powerful computation capability and parallelization resource based on the tensor resolution on GPU equipment, the execution time obtaining the tensor factor can be greatly reduced, improve large-scale data processing power, well solve the problems such as computational resource is not enough.
This experiment effect mixing computing cluster on evaluate and test, this cluster by be configured with 4 NVIDIATeslaC2050 video cards two worktable form, execution environment configuration as follows:
Operating system is 64 windows2008 enterprise versions, containing the CPU of two IntelXeonE56202.40GHZ, RAM is 24G, compile under the all-round excellent environment of MSSTUDIO2010, bus is PCI-E5.0Gbps, and network is 32Gbps bandwidth for transmission, NVIDIA is containing 8 TeslaC2050.448 CUDUcores, fundamental frequency is 1.5GHZ, standard memory 2.5GB, memory bar 144GB/s.
Experimental data adopts three-dimensional tensor, and these tensors are selected from basic sparse smooth signal, as rectified half-waves cosine and sine signal.In the same size in three dimensions of data.Size has 300,600,900,1200,1500,1800,2100, and to 2400, a tensor is divided into the grid of several K*K*K, K=2, and 3,4,5,6,8.
The large data factorization method of a kind of multidimensional based on GPU cluster provided by the invention, comprises the following steps:
Step 1: build PARAFAC model;
PARAFAC model is based on initialization model by matrix A, and B, C provide, and tensor is represented by λ, and formula is as follows:
Step 2: build grid PARAFAC model and m-ALS;
Grid PARAFAC model regards tensor subset as tensor Y
combination:
Y=
I×
0A
(0)×
1A
(1)×A
(2)(2);
Wherein:
k
i=1 ... K
i, i=1 ..., N, ×
nthe n pattern inner product of tensor, A
(n)and U
(n)n-th factor of N tensor,
be
the i-th number;
The total divisor A of tensor Y
(i)energy m-ALS is from factor subset
obtain:
All tensor subsets
first decomposed by traditional ALS method and obtain its factor
Intermediate variable P and Q is by the factor initialization of tensor subset:
Wherein:
represent that Hadmard amasss;
M-ALS specifically calculates and sees formula 8-formula 11, and it is by gradient cost function minimum sandards Euclidean distance thus connect horizontal direction
the matrix as shown in Figure 1 dimension with same coordinate is assigned to same.In m-ALS iterative process, with regard to three-dimensional data citing, tensor is divided into three directions, when calculation the i-th dimension, and tensor subset
the factor can arrive, the synthesis of the tensor subset factor set of same
when all dimensions all iteration cross once, factor subset is connected by formula 8-formula 11:
In formula 8
representative is selected, the matrix T of formula 9, and S is used for the calculating of final factor subset in formula 10.
(k after all factor subsets upgrade
nfrom 1 to K
n), middle tensor P and Q also will upgrade, such as formula (11), in order to next iteration;
Step 3: build stratification parallel factor analysis model H-PARAFAC;
Ask for an interview Fig. 2, stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;
Stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at node, equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;
Step 4: build Coarse grain model;
Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;
The wherein execution of the stratification parallel factor analysis model H-PARAFAC of Coarse grain model driving, its specific implementation comprises following sub-step:
(1) initialization GPUs;
(2) factor of tensor subset U in formula (5) is assessed;
(3) middle tensor (P, Q) is calculated by the learning rules in formula (6) and (7);
(4) block often tieed up is calculated: the Thread control equipment of host side;
1. by formula (8) (9) learning rule combined block between factor T, S;
2. factor subset A is upgraded by the learning rule of formula (10);
3. tensor (P, Q) is upgraded by the learning rule of formula (11);
(5) factor subset A is obtained from equipment end;
(6) factor subset A and middle tensor (P, Q) is upgraded by WiMAX and the MPI overall situation;
(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As
No, otherwise get back to the 4th step.
Have in whole step three layers synchronous: the synchronous 3. CUDA cross-thread of the synchronous 2. host side cross-thread 1. between node calculate synchronous.CUDA synchronously ensure that the correctness of calculating.Synchronisation of nodes and POSIX thread synchronization ensure that reliable data are transmitted.Internodal communication is realized by the transmitting function of polymerization and MPI.
Because GPU equipment computational resource is limited, Coarse grain model CUDA flows and forms multinuclear to adapt to the needs of random scale data.Each one or more data blocks of CUDA stream process, the quantity of CUDA depends on the memory size of size of data and GPU, a CUDA thread table is shown on equipment by a series of instructions that specified sequence is run, in same calculating stream, core before will form new core once completing, and the use sending out stream altogether can make full use of the computational resource of GPU.
Wherein tensor dividing method, its specific implementation process is:
Stratification parallel factor analysis model H-PARAFAC passes through to block device mean allocation block with EQUILIBRIUM CALCULATION FOR PROCESS resource, and every platform equipment only need obtain the data needed for current iteration, and block is passed to equipment by sequence by node; Being calculated as follows of number of blocks in a kth equipment:
Wherein, k represents a kth equipment, k=1 ..., n-1, S represent total block data, S
krepresent the block number in a kth equipment, the block number of each GPU is consistent or differ 1;
Such as, two main frames have connected the main frame of three equipment, and full tensor is divided into 10 pieces in the first dimension, and call number is from 0 to k=(h*3|d), and tensor splitting scheme is as following table 1:
Table 1: tensor splitting scheme
Each node is containing intermediate data, and as tensor subset factor U, factor subset A, tensor P and Q, each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.As shown in table 1, the equipment (2) of main frame (0) only uses renewal first to tie up
with
main frame (0) only need obtain from equipment
with
final fusion obtains the total divisor A under the first dimension
(1).
The Coarse grain model of the present embodiment is equivalent to Management Calculation, and finely granular access control is responsible for concrete calculating; The initialization of the refined model Coverage factor of the present embodiment, the data transmission under certain storage, the calculating that parallel Hadmard is long-pending and intelligent scissor four parts.Wherein Hadmard amasss is that finely granular access control will realize, but he is parallel computation, manage this process as managing parallel Coarse grain model, and the calculating itself that Hadmard amasss is a kind of conventional calculating, when this kind calculates when process data with there being concrete introduction in step 2.Intelligent scissor scheme is also concrete execution, belongs to finely granular access control, and wherein splitting is not emphasis, be exactly simple piecemeal, key is the integration of block data, is also the management of data, the present embodiment has been placed on him in Coarse grain model and has gone to say, as the tensor partitioning portion in step 4.
Step 5: build refining parallel model;
Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor;
The wherein assessment of the tensor subset factor and the initialization of factor subset, its specific implementation process is: mALS method need to estimate all tensor subset the factor and using obtain factor subset as input.This step obtains the initial value of the tensor subset factor by DTLD, and drive conventional P ARAFAC to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method.In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether, reason there is not relation between tensor subset.These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.But the high complexity of DTLD makes the factor subset of the whole extensive tensor of initialization very difficult.Based on H-PARAFAC, factor subset carrys out initialization by sampling, and the tensor subset factor of equalization is the input of m-ALS method.Such energy minimization obtains the execution time of global factor.
Wherein symmetric data transmission, its specific implementation process is: each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, P, Q two tensors are all divided into tensor subset, and label mode is also consistent, as
k
i=1 ... K
i, i=1 ..., N.Factor subset
corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to
carry out piecemeal, such as: if the n-th element of its call number is identical, so tensor subset (P and Q) will be assigned to same, the data belonging to same are continuous print in storage space, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks.
In each iteration, to this TV station equipment, (have multiple devices under a node, equipment refers to computer disposal platform to the intermediate data (P, Q, U) that node transmission is upgraded by miscellaneous equipment.Node is overall treatment platform) in, because this TV station equipment only processes some specific factor subsets, node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously.When node obtains data block, if source data poor organization in the storage space of main frame, expense will be very huge.Time on equipment along a certain dimension initialization calculated value, in order to node can obtain all pieces from Coutinuous store space, in the middle of restructuring, tensor is abnormal important, and node is heavily marked or P and Q that recombinate is very hard work.Time complexity is O (n
3), this result has been absorbed in the bottleneck phase.
In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; Refined model makes data directly from device transmission to node, ensure that intermediate data is Coutinuous store in host stores space.
As shown in Figure 4, for three-dimensional tensor, its transmission can as being be rotated counterclockwise 120 ° along self diagonal line, consider that data are three-dimensional, tensor need rotate 3 times, finally gets back to initial conditions, assuming that block is above Coutinuous store, data are vertical piecemeals in X-axis, and next time is exactly Y-axis, once rotate and guarantee that data result block is continuous print.
This step can with the permutation and combination theoretical description in formula (13) and (14), the element u in this group
(i, j, k)mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula (13), P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula (14) represents that call number have passed through three transmission
According to the transmission of coordinate, the side-play amount of data cell is linear memory, and can be calculated by equipment, one of one-dimensional degree back and forth in, all
with
calculate in CUDA stream, the intermediate data of renewal can be copied directly to the assigned address of host side storage space according to skew, new block can be automatically stored to continuous print storage space and forward in another dimension data to facilitate simultaneously.
This example three kinds of methods are evaluated and tested:
(1) by the data of different scales, under the tensor subset of different size, m-ALS is used to obtain the working time of full tensor;
(2) different piece load capacity;
(3) duty factor of computing power and communication capacity.
Fig. 5 describes or else with the execution time obtaining the whole factor of these tensors under splitting scheme, figure (a) represents when data scale linear increase, execution time is linear increase also, this illustrates that H-PARAFAC of the present invention is stable, during K=6, when scale by 600 to 2400 time, the execution time changes to 677ms by 300, when data scale increases by 8 times in three dimensions simultaneously, the execution time increases number and is less than 2.9ms.Figure (b) reruns the execution time from the granularity of tensor subset, and obviously, the execution time promotes steadily with the reduction of tensor subset, and in other words, too much segmentation will increase load, and the fact shows:
(1) to guarantee that each tensor subset is large as far as possible;
(2) when configuring computing platform to H-PARAFA, the ability of GPUs is total computing power of a priori factor but not whole cluster.
Meanwhile, method of the present invention is compared with original PARAFAC method, and when tensor sub-set size equals 300 and 600, tensor processing time original method is respectively 158650ms and 4491736ms, and load is with n
3increase, n is size of data.H-PARAFAC only needs 80 and 800 times respectively.G-PARAFAC can obtain similar result, but H-PARAFAC can support that size is 2400
3and the fast decoupled of more large-scale data.
Should be understood that, the part that this instructions does not elaborate all belongs to prior art.
Should be understood that; the above-mentioned description for preferred embodiment is comparatively detailed; therefore the restriction to scope of patent protection of the present invention can not be thought; those of ordinary skill in the art is under enlightenment of the present invention; do not departing under the ambit that the claims in the present invention protect; can also make and replacing or distortion, all fall within protection scope of the present invention, request protection domain of the present invention should be as the criterion with claims.
Claims (6)
1., based on the large data factorization method of multidimensional of GPU cluster, it is characterized in that, comprise the following steps:
Step 1: build PARAFAC model;
PARAFAC model is based on initialization model by matrix A, and B, C provide, and tensor is represented by λ, and formula is as follows:
Step 2: build grid PARAFAC model and m-ALS;
Grid PARAFAC model regards tensor subset as tensor Y
combination:
Y=
I×
0A
(0)×
1A
(1)×A
(2)(2);
Wherein:
k
i=1 ... K
i, i=1 ..., N, ×
nthe n pattern inner product of tensor, A
(n)and U
(n)n-th factor of N tensor,
be
the i-th number;
The total divisor A of tensor Y
(i)energy m-ALS is from factor subset
obtain:
All tensor subsets
first decomposed by traditional ALS method and obtain its factor
Intermediate variable P and Q is by the factor initialization of tensor subset:
Wherein:
represent that Hadmard amasss;
Step 3: build stratification parallel factor analysis model H-PARAFAC;
Stratification parallel factor analysis model H-PARAFAC runs on the cluster be made up of multiple node, and each node comprises multi-core CPU and multiple GPU; Described cluster is Distributed sharing system DSM, is connected between node by WiMAX, is fed back each other by MPI, and the Coarse grain model that this cluster runs manages this calculation procedure;
Described stratification parallel factor analysis model H-PARAFAC is containing many POSIX thread process, each thread distributes a GPU equipment, and how to send out CUDA pattern altogether to GPU distribution data and calculation task, stratification parallel factor analysis model H-PARAFAC is at equipment and inter-node synchronous more new data; All calculation tasks are performed on GPUs by finely granular access control, described stratification parallel factor analysis model H-PARAFAC decomposes tensor subset sums in a parallel fashion and realizes m-ALS algorithm, and parallel Hadamard compute mode makes m-ALS algorithm to walk abreast and obtains whole tensors;
Step 4: build Coarse grain model;
Described Coarse grain model runs on Distributed sharing storage organization, described stratification parallel factor analysis model H-PARAFAC is as a kind of function that can realize user transparent, and this function is by hiding concrete potential multipoint interface MPI, universal parallel computing architecture CUDA middleware software and other hardware implementing; Mainly comprise execution and the tensor dividing method of the stratification parallel factor analysis model H-PARAFAC that Coarse grain model drives;
Step 5: build refining parallel model;
Refining parallel model CUDA thread is responsible for the assessment of the tensor subset factor, mainly comprises the long-pending calculating of the initialization of factor subset, symmetric data transmission, parallel Hadmard and intelligent scissor.
2. the large data factorization method of the multidimensional based on GPU cluster according to claim 1, is characterized in that: the m-ALS described in step 2 specifically calculates, and it is by gradient cost function minimum sandards Euclidean distance thus connect horizontal direction
in m-ALS iterative process, when calculation the i-th dimension, tensor subset
the factor can arrive, the synthesis of the tensor subset factor set of same
when all dimensions all iteration cross once, factor subset is connected by following formula:
Wherein:
1≤k
n≤ K
n,
representative is selected, and after all factor subsets upgrade, middle tensor P and Q also will upgrade, in order to next iteration:
3. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, it is characterized in that: the execution of the stratification parallel factor analysis model H-PARAFAC that the Coarse grain model described in step 4 drives, its specific implementation comprises following sub-step:
(1) initialization GPUs;
(2) factor of tensor subset U in formula (5) is assessed;
(3) middle tensor (P, Q) is calculated by the learning rules in formula (6) and (7);
(4) block often tieed up is calculated: the Thread control equipment of node side;
1. by formula (8) (9) learning rule combined block between factor T, S;
2. factor subset A is upgraded by the learning rule of formula (10);
3. tensor (P, Q) is upgraded by the learning rule of formula (11);
(5) factor subset A is obtained from equipment end;
(6) factor subset A and middle tensor (P, Q) is upgraded by WiMAX and the MPI overall situation;
(7) detect stopping criterion and whether have updated whole factors A, no, then process ends; As no, otherwise get back to the 4th step.
4. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, is characterized in that, the tensor dividing method described in step 4, and its specific implementation process is:
Stratification parallel factor analysis model H-PARAFAC passes through to block device mean allocation block with EQUILIBRIUM CALCULATION FOR PROCESS resource, and every platform equipment only need obtain the data needed for current iteration, and block is passed to equipment by sequence by node; Being calculated as follows of number of blocks in a kth equipment:
Wherein, k represents a kth equipment, k=1 ..., n-1, S represent total block data, S
krepresent the block number in a kth equipment, the block number of each GPU is consistent or differ 1;
Each node is containing intermediate data, and each node only needs: update section divided data, and remaining part is the data that last iteration upgrades.
5. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, it is characterized in that, the assessment of the tensor subset factor described in step 5 and the initialization of factor subset, its specific implementation process is: the initial value being obtained the tensor subset factor by DTLD, conventional P ARAFAC is driven to obtain the tensor subset factor by ALS, DTLD avoids local optimum in factor initialization, and DTLD is made up of Tucker model and GRAM method; In parallel DTLD method, tensor subset sums decomposition step is unified in the process of CUDA stream, multiple tensor subset can be unified in multiple depend on multiple GPU send out under CUDA Computational frame altogether; These tasks once completing, the result of all internodal equipment or the host synchronization exchange tensor subset factor.
6. the large data factorization method of the multidimensional based on GPU cluster according to claim 2, is characterized in that, the symmetric data transmission described in step 5, and its specific implementation process is:
The each iteration of m-ALS obtains in final total divisor process, needs the P in each dimension, Q, U data, and P, Q exist with the form of tensor, and its dimension is consistent with original tensor dimension, and P, Q two tensors are all divided into tensor subset, and label mode is also consistent; Factor subset
corresponding specific dimension, P, Q constantly update, in the calculating of the n-th dimension, middle tensor according to
carry out piecemeal, when calculating proceeds to the (n+1)th dimension, the data of Coutinuous store will be assigned in different blocks;
In each iteration, intermediate data (P, Q that node transmission is upgraded by miscellaneous equipment, U) in this TV station equipment, because this TV station equipment only processes some specific factor subsets, node needs from possible multidimensional data, locate its corresponding block, they is sent to equipment end simultaneously;
In stratification parallel factor analysis model H-PARAFAC, the label of equipment compute tensor subset is to expand to the (n+1)th dimension, and these data, when the n-th dimension data completes, are copied directly to the ad-hoc location of node by equipment; This step can with the permutation and combination theoretical description in formula 13 and formula 14, the element u in formula
(i, j, k)mean a data cell of tensor P and Q, (i, j, k) be the coordinate of data cell, in formula 13, P1 once transmits, and after once man-to-man mapping, label becomes (j, k, i), formula 14 represents that call number have passed through three transmission:
According to the transmission of coordinate, the side-play amount of data cell is linear memory, and can be calculated by equipment, one of one-dimensional degree back and forth in, all
with
calculate in CUDA stream, the intermediate data of renewal can be copied directly to the assigned address of host side storage space according to skew, new block can be automatically stored to continuous print storage space and forward in another dimension data to facilitate simultaneously.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510708583.8A CN105260554A (en) | 2015-10-27 | 2015-10-27 | GPU cluster-based multidimensional big data factorization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510708583.8A CN105260554A (en) | 2015-10-27 | 2015-10-27 | GPU cluster-based multidimensional big data factorization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105260554A true CN105260554A (en) | 2016-01-20 |
Family
ID=55100243
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510708583.8A Pending CN105260554A (en) | 2015-10-27 | 2015-10-27 | GPU cluster-based multidimensional big data factorization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105260554A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105931256A (en) * | 2016-06-03 | 2016-09-07 | 中国地质大学(武汉) | CUDA (compute unified device architecture)-based large-format remote sensing image fast segmentation method |
CN107229966A (en) * | 2016-03-25 | 2017-10-03 | 阿里巴巴集团控股有限公司 | A kind of model data update method, apparatus and system |
CN107801149A (en) * | 2017-08-25 | 2018-03-13 | 长江大学 | The multipath parameter evaluation method that real value parallel factor decomposes |
CN108170639A (en) * | 2017-12-26 | 2018-06-15 | 云南大学 | Tensor CP based on distributed environment decomposes implementation method |
CN111819579A (en) * | 2018-08-03 | 2020-10-23 | 谷歌有限责任公司 | Distribution tensor calculation across computing devices |
CN112799852A (en) * | 2021-04-12 | 2021-05-14 | 北京一流科技有限公司 | Multi-dimensional SBP distributed signature decision system and method for logic node |
WO2022100345A1 (en) * | 2020-11-13 | 2022-05-19 | 中科寒武纪科技股份有限公司 | Processing method, processing apparatus, and related product |
WO2022151950A1 (en) * | 2021-01-13 | 2022-07-21 | 华为技术有限公司 | Tensor processing method, apparatus and device, and computer readable storage medium |
CN116186522A (en) * | 2023-04-04 | 2023-05-30 | 石家庄学院 | Big data core feature extraction method, electronic equipment and storage medium |
CN111033500B (en) * | 2017-09-13 | 2023-07-11 | 赫尔实验室有限公司 | Systems, methods, and media for sensor data fusion and reconstruction |
-
2015
- 2015-10-27 CN CN201510708583.8A patent/CN105260554A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229966A (en) * | 2016-03-25 | 2017-10-03 | 阿里巴巴集团控股有限公司 | A kind of model data update method, apparatus and system |
CN105931256A (en) * | 2016-06-03 | 2016-09-07 | 中国地质大学(武汉) | CUDA (compute unified device architecture)-based large-format remote sensing image fast segmentation method |
CN107801149A (en) * | 2017-08-25 | 2018-03-13 | 长江大学 | The multipath parameter evaluation method that real value parallel factor decomposes |
CN107801149B (en) * | 2017-08-25 | 2020-02-18 | 长江大学 | Multipath parameter estimation method for real value parallel factorization |
CN111033500B (en) * | 2017-09-13 | 2023-07-11 | 赫尔实验室有限公司 | Systems, methods, and media for sensor data fusion and reconstruction |
CN108170639B (en) * | 2017-12-26 | 2021-08-17 | 云南大学 | Tensor CP decomposition implementation method based on distributed environment |
CN108170639A (en) * | 2017-12-26 | 2018-06-15 | 云南大学 | Tensor CP based on distributed environment decomposes implementation method |
CN111819579B (en) * | 2018-08-03 | 2022-02-08 | 谷歌有限责任公司 | Method, system, and medium for distributed tensor computation across computing devices |
CN111819579A (en) * | 2018-08-03 | 2020-10-23 | 谷歌有限责任公司 | Distribution tensor calculation across computing devices |
WO2022100345A1 (en) * | 2020-11-13 | 2022-05-19 | 中科寒武纪科技股份有限公司 | Processing method, processing apparatus, and related product |
WO2022151950A1 (en) * | 2021-01-13 | 2022-07-21 | 华为技术有限公司 | Tensor processing method, apparatus and device, and computer readable storage medium |
CN112799852B (en) * | 2021-04-12 | 2021-07-30 | 北京一流科技有限公司 | Multi-dimensional SBP distributed signature decision system and method for logic node |
CN112799852A (en) * | 2021-04-12 | 2021-05-14 | 北京一流科技有限公司 | Multi-dimensional SBP distributed signature decision system and method for logic node |
CN116186522A (en) * | 2023-04-04 | 2023-05-30 | 石家庄学院 | Big data core feature extraction method, electronic equipment and storage medium |
CN116186522B (en) * | 2023-04-04 | 2023-07-18 | 石家庄学院 | Big data core feature extraction method, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105260554A (en) | GPU cluster-based multidimensional big data factorization method | |
Patania et al. | Topological analysis of data | |
Chen et al. | 64-qubit quantum circuit simulation | |
Lončar et al. | OpenMP, OpenMP/MPI, and CUDA/MPI C programs for solving the time-dependent dipolar Gross–Pitaevskii equation | |
Satarić et al. | Hybrid OpenMP/MPI programs for solving the time-dependent Gross–Pitaevskii equation in a fully anisotropic trap | |
US10484479B2 (en) | Integration of quantum processing devices with distributed computers | |
Pichon et al. | Sparse supernodal solver using block low-rank compression: Design, performance and analysis | |
Verstraete et al. | Criticality, the area law, and the computational power of projected entangled pair states | |
Hofmann et al. | Analytical characterization of the genuine multiparticle negativity | |
Daas et al. | Parallel algorithms for tensor train arithmetic | |
Li et al. | Faster tensor train decomposition for sparse data | |
DiCarlo et al. | Linear algebraic representation for topological structures | |
CN101086729A (en) | A dynamic reconfigurable high-performance computing method and device based on FPGA | |
Chen et al. | Pyomo. GDP: Disjunctive models in python | |
Messer et al. | MiniApps derived from production HPC applications using multiple programing models | |
Chen et al. | A hybrid GPU/CPU FFT library for large FFT problems | |
D’Azevedo et al. | Parallel LU factorization on GPU cluster | |
Luo et al. | Fractional chaotic maps with q–deformation | |
Sowkuntla et al. | MapReduce based improved quick reduct algorithm with granular refinement using vertical partitioning scheme | |
Li et al. | Efficient composing rough approximations for distributed data | |
Liu et al. | Research on k-means algorithm based on cloud computing | |
Li et al. | 2PCP: Two-phase CP decomposition for billion-scale dense tensors | |
Zhou et al. | Quantum multidimensional color images similarity comparison | |
Messer et al. | Developing miniapps on modern platforms using multiple programming models | |
Huang | Frame-groups based fractal video compression and its parallel implementation in Hadoop cloud computing environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160120 |
|
RJ01 | Rejection of invention patent application after publication |