CN106407561B - Method for dividing parallel GPDT algorithm on multi-core SOC - Google Patents
Method for dividing parallel GPDT algorithm on multi-core SOC Download PDFInfo
- Publication number
- CN106407561B CN106407561B CN201610832540.5A CN201610832540A CN106407561B CN 106407561 B CN106407561 B CN 106407561B CN 201610832540 A CN201610832540 A CN 201610832540A CN 106407561 B CN106407561 B CN 106407561B
- Authority
- CN
- China
- Prior art keywords
- matrix
- core
- algorithm
- parallel
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/30—Circuit design
- G06F30/39—Circuit design at the physical level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
The invention belongs to the technical field of integrated circuit design, and particularly relates to a method for dividing a parallel GPDT algorithm on a multi-core SoC. The parallel GPDT algorithm includes two layers of iteration, the inner layer of iteration is responsible for solving the working set, and the outer layer of iteration is responsible for updating the working set. In the aspect of calculating the critical path of the speed, the critical path of the outer loop is gradient updating, the critical path of the inner loop is vector calculation after each projection, and the two parts of matrix operation need to be subjected to parallel processing on multiple cores; and the rest operations can be realized only in a serial mode on the main core, including gradient projection operation realized by utilizing the Dai-Fletcher algorithm, update of a working set realized by introducing a quick sequencing algorithm and the like. And the vector obtained after the calculation is the support vector of the GPDT algorithm training data.
Description
Technical Field
The invention belongs to the technical field of integrated circuit design, and particularly relates to a method for dividing a parallel GPDT algorithm on a multi-core SoC.
Background
The GPDT algorithm is a decomposition method for the original QP problem proposed by Zanni et al, and the number of working set variables of each iteration is 102To 103The magnitude order is different, so that the algorithm can reach convergence after few iterations, although the calculation amount of each iteration is larger, the complex calculation can be distributed to a plurality of processors for carrying out in a parallelization mode, and the faster training speed is obtainedAnd (4) degree.
The original expression of the support vector machine problem is:
The decomposition of the problem is the vector to be solvedThe method is divided into two parts, wherein one part is a working set and is represented by B, and the other part is a non-working set and is represented by N. The vector to be solved, the sample category vector and the kernel matrix in the formula are decomposed into the following forms:
through simplification, the QP sub-problem after decomposition is converted into the following form:
the solving process of the QP subproblem mainly comprises four steps, the final result is solved through loop iteration, and the judgment condition of the iteration end is a KKT (Karush-Kuhn-Tucker) condition.
The specific steps of the algorithm are as follows:
step 1: and (5) initializing.
Will vectorInitialized to 0 and then two integers n are selectedBAnd nCLet 0 be equal to or less than nC≤nB≤1,nCIs even number, slave vectorIn (1) random selection of nBForming a working set B by one element, forming a non-working set N by the rest elements, and enabling the outer layer iteration number k to be 1;
step 2: and solving the QP subproblem.
Step 2.1: initialization
Order toRepresenting an initial gradient, and orderStep down ρ0∈[ρmin,ρmax],ρminAnd ρmaxIs a predetermined value and satisfies 0<ρmin<ρmaxMaking the inner layer iteration number k' equal to 0;
step 2.2: projection (projector)
By PΩ() Representing the projection operation onto the feasible region omega, the vector is first determinedWhether a termination condition is met, if so, ending the iteration, otherwise calculating the direction of gradient descent using:
step 2.3: matrix multiplication
Step 2.4: line search
Step 2.5: updating
a new gradient descent step size p is then calculatedk′+1Let k '═ k' +1 for the number of iterations, and return to step 2.2.
And step 3: and (4) updating the gradient.
after the update, ifIf the KKT condition is satisfied, the iteration is ended, otherwise the next step is entered.
And 4, step 4: and updating the working set.
The following problem is solved first:
then, the result is obtainedα corresponding to the non-zero term in (1)iTaken out to form a working setThe maximum number of non-zero terms is ncThen take out the elements from the old working set B and fill inIn (1), up toIn to nBAn element, last orderk is k +1 and then returns to step 2.
The advantage of the GPDT algorithm is the work of solving each iterationThe number of the collecting elements can reach 103Orders of magnitude, enabling the algorithm to converge quickly, however, in a single iteration the computation is very large due to the large number of matrix operations.
Disclosure of Invention
The invention aims to provide a method for dividing a parallel GPDT algorithm on a multi-core SoC, so that the calculation time of single iteration is greatly shortened, and the operation efficiency of the whole training algorithm is improved.
The invention provides a method for dividing a parallel GPDT algorithm on a multi-core SoC, which has the general idea that n in a working set B is dividedBThe elements are equally distributed to N processors, and each processor is locally provided with a backup of training data, so that the matrix operation can be conveniently distributed to the N processors for execution. As can be seen from the basic principle of the algorithm, the parallelism of the algorithm is mainly concentrated in the steps 2 and 3, which are relatively concentrated steps of the matrix operation.
The method for dividing the parallel GPDT algorithm on the multi-core SoC comprises two parts: row decomposition and column decomposition; the details are as follows.
A line decomposition method. The method comprises the following steps: matrix is decomposed according to rows, parallel computation and result splicing:
in the initialization procedure of step 2.1, the initial gradient is calculatedWherein A represents an nB×nBA matrix ofRepresents an nB× 1, then,also the result of (a) is an nB× 1, first, the matrix A will be decomposed into rowsWherein A isniRepresents oneA matrix of (a); then compute on each coreA value of (d); finally, the operation results of the cores are spliced on the main core,is thatThe result of (1).
Column decomposition method. The method comprises the following steps: the matrix is decomposed according to columns, parallel calculation and result splicing:
in the gradient update of step 3, calculation is performedWherein G isLBIs l × nBA matrix ofIs nB× 1, the result of multiplying the twoIs a column vector of l × 1 due to the matrix GLBIs l line nBColumns, therefore, first, the matrix is decomposed into columns Is decomposed intoThen, the calculation is carried out on each coreFinally, the results of the individual core calculations are accumulated on the primary core,is thatThe value of (c).
According to the partitioning method, the improved parallel GPDT algorithm (i.e., the parallel GPDT algorithm partitioned on the multi-core SoC) specifically includes the following steps:
step 1: first, a vector is initialized on a primary coreIs 0, two integers n are selectedBAnd nCLet 0 be equal to or less than nC≤nB≤1,nCIs even number, slave vectorIn (1) random selection of nBEach element forms a working set B, and the outer iteration number k is 1.
2.1 setting initial gradients on the Primary nucleusStep down ρ0∈[ρmin,ρmax],ρminAnd ρmaxIs a predetermined value and satisfies 0<ρmin<ρmaxMaking the inner layer iteration number k' equal to 0;
2.2 the initial gradients are then computed in parallel on the individual kernelsRow segment ofWill count on the main coreAnd (4) splicing calculation results:
wherein A is nB×nBThe matrix is a matrix of a plurality of matrices,is nB× 1, the column vector of the column vector,is also nB× 1, first, the matrix A will be decomposed into rowsWherein A isniRepresents oneA matrix of (a); then compute on each coreA value of (d); finally, the operation results of the cores are spliced on the main core,is thatThe result of (1);
2.3 finishing the operation of omega projection to the feasible domain on the main core and judging the vectorWhether a termination condition is met, if so, ending the iteration, otherwise, calculating the gradient descending direction d(k’);
2.4 then the matrix z is calculated in parallel on the individual cores(k’)Row segment ofWherein the matrix A is decomposed in the same way in rowsThe same in step 2.2, then the operation results of the cores are spliced on the main core,is thatThe calculation result of (2);
2.5 then calculate the coefficient λ by line search on the principal kernel firstkCalculating a new step size pk‘+1And uk′+1And then making the inner layer iteration number k '═ k' + 1; judgment uk′+1Whether the KKT termination condition is met or not, if yes, entering the next step; otherwise, return to step 2.2 and calculate a new gradient descent direction.
And step 3: in obtaining a solution to the QP subproblemLater, the gradient needs to be updated, and column segments of gradient increments are computed in parallel on each coreThe results are then accumulated on the primary kernel, resulting in a new gradient:
computingWherein G isLBIs l × nBA matrix ofIs nB× 1, the result of multiplying the twoIs a column vector of l × 1 due to the matrix GLBIs l line nBColumns, therefore, first, the matrix is decomposed into columns Is decomposed intoThen, the calculation is carried out on each coreFinally, the results of the individual core calculations are accumulated on the primary core,is thatThe value of (c).
And 4, step 4: judging on the Master coreAnd if the KKT condition is met, finishing calculation, otherwise, updating the working set on the main core, specifically, referring to background technical introduction, and returning to the step 2, wherein k is k + 1.
The parallel GPDT algorithm mainly comprises two layers of iteration, wherein the inner layer of iteration is responsible for solving the working set B, and the outer layer of iteration is responsible for updating the working set B. In terms of calculating the critical path of the velocity, the critical path of the inner loop is the post-projection vector z(k’)The two parts of matrix operation need to be distributed to each core for parallelization processing, the parallelization processing modes are 'decomposition by row' and 'decomposition by column', the rest operations are realized on the main core in series, and the method mainly comprises two parts, namely the projection operation of the gradient, the Dai-Fletcher algorithm and the update of the working set B, and the step fills elements in a new working set efficiently by introducing a quick sorting algorithm.
Drawings
Figure 1 parallel GPDT algorithm flow.
Fig. 2 matrix multiplication by row decomposition.
Fig. 3 matrix multiplication by column decomposition.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in FIG. 1, the present invention calculates the initial gradient in the algorithmIn (1)Inner loop computation matrix z(k’)Outer loop calculation of gradient incrementsThe process of (2) is distributed to a plurality of processors for carrying out parallelization processing, so that the time of matrix operation in each iteration process is greatly reduced, and in addition, other parts in the algorithm are still serialization operation, including gradient projection, working set updating and the like. According to the Amdall law, the acceleration ratio of the parallelization algorithm is not only related to the acceleration ratio of the parallelizable part, but also related to the proportion of the parallelizable part, so that along with the increase of training data, the operation time proportion of the parallelizable part is increased, and the acceleration ratio of the whole algorithm is gradually close to the acceleration ratio of the parallelizable part.
1. The general idea of parallel partitioning is to divide n in the working set BBThe elements are equally distributed to N processors, and the work set subscript distributed to each processor is defined as a set IpP 1, 2, …, N, then the set I after allocationpSatisfies the following conditions:
i.e. the sets to which each processor is assigned do not intersect each other. Assume that each processor is assigned a number of working set elements of npAnd satisfyAnd each processor is stored locallyAnd the training data is backed up, so that the matrix operation can be conveniently distributed to N processors for execution, and the parallelism of the algorithm is mainly concentrated in the steps 2 and 3.
2. The calculation formula for calculating the initial gradient by parallelizing the initial gradient of the Dai-Fletcher algorithm isWherein A represents an nB×nBA matrix ofThen represents an nB× 1, thenAlso the result of (a) is an nB× 1, according to fig. 2, the matrix a is decomposed into rows, each processor being assigned to n of the matrix apSegment of a row, then with a column vectorMultiplying, and finally splicing to obtain a final result:
3. Parallel computation of gradient updates the formula for the gradient update is:
Wherein G isLBIs l × nBA matrix ofVector representing two adjacent iterationsThe difference of (2) is the result of multiplication of the twoIs a column vector of l × 1 due to the matrix GLBIs l line nBColumns, so the division here is by way of matrix GLBThe column decomposition is shown in figure 3. For each processor, its assigned matrix GLBColumn fragment G ofnpIs l line npMatrix of columns, and column vectorRow segment ofMultiplication, the result obtained isIs a column vector of l rows, so the computation results of each processor need to be accumulated to obtain the final result:
4. other parts of the algorithm, including projection operation of the gradient and update of the working set, are still executed in series on the main core, and the overall flow of the improved parallel GPDT algorithm is shown in fig. 1.
Claims (1)
1. A method for dividing a parallel GPDT algorithm on a multi-core SoC is characterized by comprising the following specific steps:
step 1: first, a vector is initialized on a primary coreIs 0, two integers n are selectedBAnd nCLet 0 be equal to or less than nC≤nB≤1,nCIs even number, slave vectorIn (1) random selection of nBForming a working set B by the elements, and enabling the outer layer iteration number k to be 1;
2.1 setting initial gradients on the Primary nucleusStep down ρ0∈[ρmin,ρmax],ρminAnd ρmaxIs a predetermined value and satisfies 0<ρmin<ρmaxMaking the inner layer iteration number k' equal to 0;
2.2 the initial gradients are then computed in parallel on the individual kernelsRow segment ofAnd splicing the calculation results on the main core:
wherein A is nB×nBThe matrix is a matrix of a plurality of matrices,is nB× 1, the column vector of the column vector,is also nB× 1, first, the matrix A will be decomposed into rowsWherein A isniRepresents oneA matrix of (a); then compute on each coreA value of (d); finally, the operation results of the cores are spliced on the main core,is thatThe result of (1);
2.3 finishing the operation of omega projection to the feasible domain on the main core and judging the vectorWhether a termination condition is met, if so, ending the iteration, otherwise, calculating the gradient descending direction d(k’);
2.4 then the matrix z is calculated in parallel on the individual cores(k’)Row segment ofWherein the row decomposition mode of the matrix A is the same as that in the step 2.2, then the operation results of the cores are spliced on the main core,is thatThe calculation result of (2);
2.5 then calculate the coefficient λ by line search on the principal kernel firstkCalculating a new step size pk‘+1And uk′+1Then, making the inner layer iteration number k '═ k' + 1; judgment uk′+1Whether the KKT termination condition is met or not, if yes, entering the next step; otherwise, returning to the step 2.2, and calculating a new gradient descending direction;
and step 3: in obtaining a solution to the QP subproblemLater, the gradient needs to be updated, and column segments of gradient increments are computed in parallel on each coreThe results are then accumulated on the primary kernel, resulting in a new gradient:
computingWherein G isLBIs l × nBA matrix ofIs nB× 1, the result of multiplying the twoIs a column vector of l × 1, due to the matrix GLBIs l line nBColumns, therefore, first, the matrix is decomposed into columns Is decomposed intoThen, the calculation is carried out on each coreFinally, the results of the individual core calculations are accumulated on the primary core,is thatA value of (d);
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610832540.5A CN106407561B (en) | 2016-09-19 | 2016-09-19 | Method for dividing parallel GPDT algorithm on multi-core SOC |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610832540.5A CN106407561B (en) | 2016-09-19 | 2016-09-19 | Method for dividing parallel GPDT algorithm on multi-core SOC |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106407561A CN106407561A (en) | 2017-02-15 |
CN106407561B true CN106407561B (en) | 2020-07-03 |
Family
ID=57997635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610832540.5A Active CN106407561B (en) | 2016-09-19 | 2016-09-19 | Method for dividing parallel GPDT algorithm on multi-core SOC |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106407561B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897163A (en) * | 2017-03-08 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of algebra system method for solving and system based on KNL platforms |
EP3654208A1 (en) * | 2017-08-31 | 2020-05-20 | Cambricon Technologies Corporation Limited | Chip device and related products |
CN115619890B (en) * | 2022-12-05 | 2023-04-07 | 山东省计算中心(国家超级计算济南中心) | Tomography method and system for solving linear equation set based on parallel random iteration |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102844762A (en) * | 2010-01-22 | 2012-12-26 | 意法爱立信有限公司 | Secure environment management during switches between different modes of multicore systems |
CN104461467A (en) * | 2013-09-25 | 2015-03-25 | 广州中国科学院软件应用技术研究所 | Method for increasing calculation speed of SMP cluster system through MPI and OpenMP in hybrid parallel mode |
CN104820657A (en) * | 2015-05-14 | 2015-08-05 | 西安电子科技大学 | Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor |
CN105550161A (en) * | 2015-12-16 | 2016-05-04 | 浪潮(北京)电子信息产业有限公司 | Parallel logic regression method and system for heterogeneous systems |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7363463B2 (en) * | 2005-05-13 | 2008-04-22 | Microsoft Corporation | Method and system for caching address translations from multiple address spaces in virtual machines |
US20150323975A1 (en) * | 2014-05-12 | 2015-11-12 | Qualcomm Innovation Center, Inc. | SYNCHRONIZATION OF ACTIVITY OF MULTIPLE SUBSYSTEMS IN A SoC TO SAVE STATIC POWER |
-
2016
- 2016-09-19 CN CN201610832540.5A patent/CN106407561B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102844762A (en) * | 2010-01-22 | 2012-12-26 | 意法爱立信有限公司 | Secure environment management during switches between different modes of multicore systems |
CN104461467A (en) * | 2013-09-25 | 2015-03-25 | 广州中国科学院软件应用技术研究所 | Method for increasing calculation speed of SMP cluster system through MPI and OpenMP in hybrid parallel mode |
CN104820657A (en) * | 2015-05-14 | 2015-08-05 | 西安电子科技大学 | Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor |
CN105550161A (en) * | 2015-12-16 | 2016-05-04 | 浪潮(北京)电子信息产业有限公司 | Parallel logic regression method and system for heterogeneous systems |
Non-Patent Citations (4)
Title |
---|
A parallel solver for large quadratic programs in training support vector machines;G. Zanghirati等;《Parallel Computing》;20031231;全文 * |
GRADIENT PROJECTION METHODS FOR QUADRATIC PROGRAMS AND APPLICATIONS IN TRAINING SUPPORT VECTOR MACHINES;THOMAS SERAFINI等;《Optimization Methods and Software》;20140531;全文 * |
支持向量机处理大规模问题算法综述;文益民 等;《计算机科学》;20090731;第36卷(第7期);全文 * |
面向无线安全的多核SoC平台关键技术研究;曹丹;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150815(第8期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN106407561A (en) | 2017-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106407561B (en) | Method for dividing parallel GPDT algorithm on multi-core SOC | |
Nurvitadhi et al. | GraphGen: An FPGA framework for vertex-centric graph computation | |
Peeters et al. | Stacking sequence optimisation of variable stiffness laminates with manufacturing constraints | |
EP3816824A1 (en) | High throughput matrix processor with support for concurrently processing multiple matrices | |
CN102521854B (en) | Parallel flow line placing method applicable to two-dimensional flow field | |
WO2021057465A1 (en) | Method and apparatus for performing parallel processing on deep learning model | |
CN106959937A (en) | A kind of vectorization implementation method of warp product matrix towards GPDSP | |
CN104835168A (en) | Fast multi-phase image segmentation method based on global convex variational model | |
JP2020520519A5 (en) | ||
Lai et al. | Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs | |
CN103065015B (en) | A kind of bearing structure low-carbon (LC) material-saving method for designing based on internal force path geometry form | |
CN110188424B (en) | Local area grid reconstruction parallel method for dynamic boundary flow field numerical simulation | |
CN104615790B (en) | Feature recommends method and apparatus | |
Cools et al. | On rounding error resilience, maximal attainable accuracy and parallel performance of the pipelined Conjugate Gradients method for large-scale linear systems in PETSc | |
CN104317244A (en) | Reconfigurable manufacturing system part family construction method | |
CN104049612A (en) | Processing workshop scheduling method based on distribution estimation | |
CN111125620B (en) | Parallel random gradient descent method based on matrix decomposition in recommendation system | |
Li et al. | Optimized deep belief networks on CUDA GPUs | |
CN106227982A (en) | A kind of electromagnetic relay static characteristic computational methods and device | |
JP2016224801A (en) | Parallel computer system, parallel calculation method and program | |
Harlap et al. | PipeDream: Pipeline parallelism for DNN training | |
CN108599173B (en) | Method and device for solving batch power flows | |
JP6573583B2 (en) | System development support apparatus and system development support method | |
Herrero et al. | An implementation of level set based topology optimization using GPU | |
Zhang et al. | A Barzilai and Borwein regularization feasible direction algorithm for convex nonlinear SOC programming with linear constraints |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |