CN106407561A - A division method of the parallel GPDT algorithm on a multi-core SOC - Google Patents

A division method of the parallel GPDT algorithm on a multi-core SOC Download PDF

Info

Publication number
CN106407561A
CN106407561A CN201610832540.5A CN201610832540A CN106407561A CN 106407561 A CN106407561 A CN 106407561A CN 201610832540 A CN201610832540 A CN 201610832540A CN 106407561 A CN106407561 A CN 106407561A
Authority
CN
China
Prior art keywords
core
matrix
row
algorithm
main core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610832540.5A
Other languages
Chinese (zh)
Other versions
CN106407561B (en
Inventor
韩军
轩四中
袁腾跃
曾晓洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201610832540.5A priority Critical patent/CN106407561B/en
Publication of CN106407561A publication Critical patent/CN106407561A/en
Application granted granted Critical
Publication of CN106407561B publication Critical patent/CN106407561B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/39Circuit design at the physical level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Geometry (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to the technical field of integrated circuit design and in particular provides a division method of the parallel GPDT algorithm on a multi-core SOC. The parallel GPDT algorithm includes two layers of iteration, wherein the inner-layer iteration is responsible for solving a working set and the outer-layer iteration is responsible for updating the working set. As for critical paths of computing speed, the critical path of outer-layer circulation is gradient update and the critical path of inner-layer circulation is vector computing following each projection, wherein both the matrix operations need multi-core parallel processing; the other operations can only be carried out in a serial manner on a main core, including a gradient projection operation realized by using the Dai-Fletcher algorithm and the working set update realized by introducing in the quick sort algorithm. Vectors obtained after the computing is over are support vectors of training data of the GPDT algorithm.

Description

A kind of division methods on multinuclear SOC for parallel GPDT algorithm
Technical field
The invention belongs to IC design technical field, specially a kind of parallel GPDT algorithm drawing on multinuclear SoC Divide method.
Background technology
GPDT algorithm is a kind of decomposition method to original QP problem that Zanni et al. proposes, the work of its each iteration Collection variable number is 102To 103Between the order of magnitude so that algorithm after little iteration several times just can reach convergence although The amount of calculation ratio of iteration is larger every time, but it is enterprising by way of parallelization, complicated calculating can be assigned to multiple processors OK, thus obtaining faster training speed.
The original expression of support vector machine problem is:
G is the matrix of a l × l, referred to as nuclear matrix, wherein, For kernel function.
The decomposition of problem is exactly by vector to be solvedIt is divided into two parts, a part is working set, represented with B, another Part is non-working set, is represented with N.In formula, vector to be solved, sample class vector and nuclear matrix are all accordingly decomposed For following form:
QP subproblem after abbreviation, decomposition is converted into following form:
The solution procedure of QP subproblem is broadly divided into four steps, obtains final result by loop iteration, and iteration terminates Rule of judgment be KKT (Karush-Kuhn-Tucker) condition.
The comprising the following steps that of algorithm:
Step 1:Initialization.
By vectorIt is initialized as 0, then select two Integer nBAnd nC, make 0≤nC≤nB≤1,nCFor even number, from vectorMiddle random selection nBIndividual elementary composition working set B, remaining elementary composition inoperative collection N, make external iteration number of times k=1;
Step 2:QP subproblem solves.
OrderSolution for QP subproblem.Then make again
Step 2.1:Initialization
OrderRepresent Initial Gradient, and makeDecline step-length ρ0∈[ρmin, ρmax], ρminAnd ρmax For preset value, and meet 0<ρmin< ρmax, make internal layer iterationses k '=0;
Step 2.2:Projection
Use PΩ() represents the operation to feasible zone Ω projection, first determines whether vectorWhether meet end condition, if Meet and then terminate iteration, otherwise calculate the direction of gradient decline using following formula:
Step 2.3:Matrix multiplication
Calculating matrix
Step 2.4:Line search
Method design factor λ with line searchk‘, and update and wait to ask vector
Step 2.5:Update
Calculate kth '+1 time iterationWith
Then calculate new gradient and decline step-length ρk′+1, make iterationses k '=k '+1, and return to step 2.2.
Step 3:Gradient updating.
After updating kth time iteration, object function is with regard to vectorGradient:
After renewal, ifMeet KKT condition, then terminate iteration, otherwise enter next step.
Step 4:Working set updates.
Solve following problem first:
Then, by resultIn the corresponding α of nonzero termiTake out, form working setThe number of nonzero term is n to the maximumc Individual, from old working set B, then take out element be filled intoIn, untilIn reach nBIndividual element, finally makesK=k+ 1, it is then back to step 2.
The advantage of GPDT algorithm is that the working set element number of each iterative can reach 103The order of magnitude is so that calculate Method can rapidly restrain, but due to there are substantial amounts of matrix operationss in single iteration, amount of calculation is very big.
Content of the invention
It is an object of the invention to provide a kind of division methods on multinuclear SoC for parallel GPDT algorithm, to greatly shorten list The calculating time of secondary iteration, thus improve the operational efficiency of whole training algorithm.
Division methods on multinuclear SoC for the parallel GPDT algorithm that the present invention provides, its overall thought is, by working set B In nBIndividual element is evenly distributed on N number of processor, and each processor is in the backup locally having training data, this Matrix operationss can be easily assigned to execution on N number of processor by sample.Algorithm be can be seen that by the ultimate principle of algorithm Degree of parallelism is concentrated mainly in step 2 and step 3, and this two step is the step of matrix operationss Relatively centralized.
Parallel division methods on multinuclear SoC for the GPDT algorithm, including two parts:Row decomposes and row decompose;Concrete Jie Continue as follows.
Row decomposition method.Including:Matrix by rows decomposition, parallel computation, result splice three steps:
In the initialization procedure of step 2.1, calculate Initial GradientWherein, A represents a nB× nBMatrix, andRepresent a nB× 1 column vector, then,Result be also a nB× 1 column vector;First First, matrix A will be decomposed into by rowWherein, AniRepresent oneMatrix;Then calculate on each coreValue;Finally, the operation result of each core is spliced by main core,It isResult.
Row decomposition method.Including:Matrix presses row decomposition, parallel computation, result three steps of splicing:
In the gradient updating of step 3, calculateWherein, GLBIt is one Individual l × nBMatrix, andIt is a nB× 1 column vector, then the result that the two is multipliedIt is the column vector of l × 1. Due to matrix GLBIt is l row nBRow, so, first, matrix is decomposed into by row Decompose by row ForThen, each core calculatesFinally, each result assessing calculation is added up by main core,It isValue.
According to above-mentioned division methods, then the parallel GPDT algorithm after improving is (i.e. parallel based on divide on multinuclear SoC GPDT algorithm) comprise the following steps that:
Step 1:Initialization vector first on main coreFor 0, select two Integer nBAnd nC, make 0≤nC≤nB≤ 1, nC For even number, from vectorMiddle random selection nBIndividual elementary composition working set B, makes external iteration number of times k=1.
Step 2:QP subproblem solves
2.1 set Initial Gradient on main coreDecline step-length ρ0∈[ρmin, ρmax], ρminAnd ρmaxFor preset value, and Meet 0<ρmin< ρmax, make internal layer iterationses k '=0;
2.2 then on each core parallel computation Initial GradientRow fragmentBy result of calculation on main core Splicing:
Wherein, A is nB×nBMatrix,It is nB× 1 column vector,It is also nB× 1 row Vector;First, matrix A will be decomposed into by rowWherein, AniRepresent oneMatrix;Then on each core CalculateValue;Finally, the operation result of each core is spliced by main core,It isResult;
2.3 complete the operation to feasible zone Ω projection on main core, and judge vectorWhether meet end condition, If met, terminating iteration, otherwise calculating the direction d that gradient declines(k’)
2.4 then parallel computation matrix z on each core(k’)Row fragmentThe row isolation of wherein matrix A With in step 2.2, then on main core, the operation result of each core is spliced,It is's Result of calculation;
2.5 then on main core first line search design factor λk, calculate new step-length ρk‘+1And uk′+1Deng, then make internal layer Iterationses k '=k '+1;Judge uk′+1Whether meet KKT end condition, if it is satisfied, entering next step;Otherwise, return to Step 2.2, calculates new gradient descent direction.
Step 3:In the solution obtaining QP subproblemAfterwards, need to update gradient, on each core, parallel computation gradient increases The column-slice section of amountThen on main core, result is added up, obtain new gradient:
CalculateWherein, GLBIt is a l × nBMatrix, andIt is one Individual nB× 1 column vector, then the result that the two is multipliedIt is the column vector of l × 1.Due to matrix GLBIt is l row nBRow, so, First, matrix is decomposed into by row It is decomposed into by rowThen, on each core CalculateFinally, each result assessing calculation is added up by main core,It is Value.
Step 4:Main core judgesWhether meet KKT condition, if it is satisfied, calculate terminating, otherwise on main core Update working set, concrete renewal process is passed away scape technology introduction, makes k=k+1, returns to step 2.
This parallel GPDT algorithm mainly includes two-layer iteration, and internal layer iteration is responsible for solving working set B, and external iteration is responsible for more New working set B.In terms of the critical path of calculating speed, the critical path of interior loop is vectorial z after each projection(k’)Meter Calculate, and the critical path of outer loop is the renewal of gradient, this two parts matrix operations needs to be assigned to parallelization on each core Process, the mode that parallelization is processed is respectively " decomposing by row " and " decomposing by row ", remaining computing serial implementation on main core, Main inclusion two parts, one is the projection operation of gradient, uses Dai-Fletcher algorithm, and two is the renewal of working set B, This step efficiently fills the element in new working set by being introduced into quick sorting algorithm.
Brief description
Fig. 1 parallel GPDT algorithm flow.
Fig. 2 presses the matrix multiplication that row decomposes.
Fig. 3 presses the matrix multiplication that row decompose.
Specific embodiment
Below in conjunction with the accompanying drawings, the invention will be further described.
As shown in figure 1, the present invention will calculate Initial Gradient in algorithmInInterior loop Calculating matrix z(k’), outer loop calculate gradient incrementProcess processed by parallelization, point It is fitted on and carries out on multiple processors, the time of matrix operationss in each iterative process will be greatly reduced, in addition other in algorithm Part remains serialization operation, the projection including gradient and the renewal of working set etc..According to Amdahl's law, parallelization is calculated The speed-up ratio of method not only with can parallelization part speed-up ratio relevant, also with can parallelization part ratio relevant, therefore with The increase of training data, can the operation time ratio of parallelization part increase, the overall speed-up ratio of algorithm will move closer in simultaneously The speed-up ratio of rowization part.
1st, the overall thought of parallel patition is, by the n in working set BBIndividual element is evenly distributed on N number of processor, often The working set subscript that individual processor is assigned to is defined as set Ip, p=1,2 ..., N, then the set I after distributingpMeet:
It is that the set that each processor is assigned to is mutually disjointed.Assume the working set element number that each processor is assigned to For npIndividual, and meetAnd each processor is locally having the backup of training data, thus permissible Matrix operationss are easily assigned to execution on N number of processor, the degree of parallelism of algorithm is concentrated mainly in step 2 and step 3.
2nd, the computing formula that the parallelization of Dai-Fletcher algorithm Initial Gradient calculates Initial Gradient isWherein A represents a nB×nBMatrix, andThen represent a nB× 1 column vector, then Result be also a nB× 1 column vector.Divide according to accompanying drawing 2, matrix A is pressed row and decomposes, each processor is assigned to The wherein n of matrix ApThe fragment of row, then with column vectorIt is multiplied, eventually pass splicing, obtain final result:
In the same manner, matrix in step 2.3Calculating also decomposed with identical method, that is,
3rd, the formula of the parallel computation gradient updating of gradient updating is:
OrderThen
Wherein, GLBIt is a l × nBMatrix, andRepresent the vector of adjacent iteration twiceDifference, then two The result that person is multipliedIt is the column vector of l × 1.Due to matrix GLBIt is l row nBRow, so dividing mode here is by square Battle array GLBDecompose by row, as shown in Figure 3.For each processor, the matrix G that it is assigned toLBColumn-slice section GnpIt is l row npThe matrix of row, with column vectorRow fragmentIt is multiplied, the result obtaining isIt is the column vector of a l row, institute To need to carry out adding up by the result of calculation of each processor just to obtain final result:
4th, the other parts in algorithm, the projection operation including gradient and the renewal of working set etc., then still on main core Serial executes, and the parallel GPDT algorithm overall procedure after improvement is as shown in Figure 1.

Claims (1)

1. a kind of division methods on multinuclear SoC for parallel GPDT algorithm are it is characterised in that comprise the following steps that:
Step 1:Initialization vector first on main coreFor 0, select two Integer nBAnd nC, make 0≤nC≤nB≤ 1, nCFor idol Number, from vectorMiddle random selection nBIndividual elementary composition working set B, makes external iteration number of times k=1;
Step 2:QP subproblem solves
2.1 set Initial Gradient on main coreDecline step-length ρ0∈[ρmin, ρmax], ρminAnd ρmaxFor preset value, and meet 0< ρmin< ρmax, make internal layer iterationses k '=0;
2.2 then on each core parallel computation Initial GradientRow fragmentResult of calculation is spliced by main core:Wherein, A is nB×nBMatrix,It is nB× 1 column vector,It is also nB× 1 column vector;First First, matrix A will be decomposed into by rowWherein, AniRepresent oneMatrix;Then calculate on each coreValue;Finally, the operation result of each core is spliced by main core,It isResult;
2.3 complete the operation to feasible zone Ω projection on main core, and judge vectorWhether meet end condition, if full Sufficient then terminate iteration, otherwise calculate the direction d that gradient declines(k’)
2.4 then parallel computation matrix z on each core(k’)Row fragmentThe row isolation of wherein matrix A is synchronous The same in rapid 2.2, then on main core, the operation result of each core is spliced,It isCalculating Result.
2.5 then on main core first line search design factor λk, calculate new step-length ρk‘+1And uk′+1Deng, then make internal layer iteration Number of times k '=k '+1;Judge uk′+1Whether meet KKT end condition, if it is satisfied, entering next step;Otherwise, return to step 2.2, calculate new gradient descent direction;
Step 3:In the solution obtaining QP subproblemAfterwards, need to update gradient, parallel computation gradient increment on each core Column-slice sectionThen on main core, result is added up, obtain new gradient:
CalculateWherein, GLBIt is a l × nBMatrix, andIt is a nB × 1 column vector, then the result that the two is multipliedIt is the column vector of l × 1;Due to matrix GLBIt is l row nBRow, so, first First, matrix is decomposed into by row It is decomposed into by rowThen, each core is counted CalculateFinally, each result assessing calculation is added up by main core,It isValue;
Step 4:Main core judgesWhether meeting KKT condition, if it is satisfied, calculate terminating, otherwise updating on main core Working set, makes k=k+1, returns to step 2.
CN201610832540.5A 2016-09-19 2016-09-19 Method for dividing parallel GPDT algorithm on multi-core SOC Active CN106407561B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610832540.5A CN106407561B (en) 2016-09-19 2016-09-19 Method for dividing parallel GPDT algorithm on multi-core SOC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610832540.5A CN106407561B (en) 2016-09-19 2016-09-19 Method for dividing parallel GPDT algorithm on multi-core SOC

Publications (2)

Publication Number Publication Date
CN106407561A true CN106407561A (en) 2017-02-15
CN106407561B CN106407561B (en) 2020-07-03

Family

ID=57997635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610832540.5A Active CN106407561B (en) 2016-09-19 2016-09-19 Method for dividing parallel GPDT algorithm on multi-core SOC

Country Status (1)

Country Link
CN (1) CN106407561B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897163A (en) * 2017-03-08 2017-06-27 郑州云海信息技术有限公司 A kind of algebra system method for solving and system based on KNL platforms
CN110231958A (en) * 2017-08-31 2019-09-13 北京中科寒武纪科技有限公司 A kind of Matrix Multiplication vector operation method and device
CN115619890A (en) * 2022-12-05 2023-01-17 山东省计算中心(国家超级计算济南中心) Tomography method and system for solving linear equation set based on parallel random iteration

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080215848A1 (en) * 2005-05-13 2008-09-04 John Te-Jui Sheu Method and System For Caching Address Translations From Multiple Address Spaces In Virtual Machines
CN102844762A (en) * 2010-01-22 2012-12-26 意法爱立信有限公司 Secure environment management during switches between different modes of multicore systems
CN104461467A (en) * 2013-09-25 2015-03-25 广州中国科学院软件应用技术研究所 Method for increasing calculation speed of SMP cluster system through MPI and OpenMP in hybrid parallel mode
CN104820657A (en) * 2015-05-14 2015-08-05 西安电子科技大学 Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor
US20150323975A1 (en) * 2014-05-12 2015-11-12 Qualcomm Innovation Center, Inc. SYNCHRONIZATION OF ACTIVITY OF MULTIPLE SUBSYSTEMS IN A SoC TO SAVE STATIC POWER
CN105550161A (en) * 2015-12-16 2016-05-04 浪潮(北京)电子信息产业有限公司 Parallel logic regression method and system for heterogeneous systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080215848A1 (en) * 2005-05-13 2008-09-04 John Te-Jui Sheu Method and System For Caching Address Translations From Multiple Address Spaces In Virtual Machines
CN102844762A (en) * 2010-01-22 2012-12-26 意法爱立信有限公司 Secure environment management during switches between different modes of multicore systems
CN104461467A (en) * 2013-09-25 2015-03-25 广州中国科学院软件应用技术研究所 Method for increasing calculation speed of SMP cluster system through MPI and OpenMP in hybrid parallel mode
US20150323975A1 (en) * 2014-05-12 2015-11-12 Qualcomm Innovation Center, Inc. SYNCHRONIZATION OF ACTIVITY OF MULTIPLE SUBSYSTEMS IN A SoC TO SAVE STATIC POWER
CN104820657A (en) * 2015-05-14 2015-08-05 西安电子科技大学 Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor
CN105550161A (en) * 2015-12-16 2016-05-04 浪潮(北京)电子信息产业有限公司 Parallel logic regression method and system for heterogeneous systems

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
G. ZANGHIRATI等: "A parallel solver for large quadratic programs in training support vector machines", 《PARALLEL COMPUTING》 *
THOMAS SERAFINI等: "GRADIENT PROJECTION METHODS FOR QUADRATIC PROGRAMS AND APPLICATIONS IN TRAINING SUPPORT VECTOR MACHINES", 《OPTIMIZATION METHODS AND SOFTWARE》 *
文益民 等: "支持向量机处理大规模问题算法综述", 《计算机科学》 *
曹丹: "面向无线安全的多核SoC平台关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897163A (en) * 2017-03-08 2017-06-27 郑州云海信息技术有限公司 A kind of algebra system method for solving and system based on KNL platforms
CN110231958A (en) * 2017-08-31 2019-09-13 北京中科寒武纪科技有限公司 A kind of Matrix Multiplication vector operation method and device
CN115619890A (en) * 2022-12-05 2023-01-17 山东省计算中心(国家超级计算济南中心) Tomography method and system for solving linear equation set based on parallel random iteration

Also Published As

Publication number Publication date
CN106407561B (en) 2020-07-03

Similar Documents

Publication Publication Date Title
Naumov et al. AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iterative methods
CN105051679B (en) For support vector sort algorithm and the functional unit with tree of other algorithms
Koanantakool et al. Communication-avoiding parallel sparse-dense matrix-matrix multiplication
CN106407561A (en) A division method of the parallel GPDT algorithm on a multi-core SOC
Peng et al. GLU3. 0: Fast GPU-based parallel sparse LU factorization for circuit simulation
DE102015112202A1 (en) Combining paths
CN106021480B (en) A kind of parallel spatial division methods and its system based on grid dividing
Driscoll et al. A communication-optimal n-body algorithm for direct interactions
Sun et al. Optimizing SpMV for diagonal sparse matrices on GPU
Meyer et al. 3-SAT on CUDA: Towards a massively parallel SAT solver
Schlag et al. Scalable edge partitioning
CN104615474B (en) For the compiling optimization method of coarseness reconfigurable processor
Cai et al. Highly efficient parallel ATPG based on shared memory
Guo et al. Ultrafast cpu/gpu kernels for density accumulation in placement
Setubal Sequential and parallel experimental results with bipartite matching algorithms
Kumar et al. Cell formation heuristic procedure considering production data
CN106547722A (en) A kind of big data parallel optimization method
CN104156268B (en) The load distribution of MapReduce and thread structure optimization method on a kind of GPU
Tsutsui et al. Fast QAP solving by ACO with 2-opt local search on a GPU
CN104793922B (en) A kind of Parallel Implementation method of large integer multiplication Comba algorithms based on OpenMP
CN100589080C (en) CMP task allocation method based on hypercube
Gan et al. Customizing the HPL for China accelerator
Yu et al. Layered sgd: A decentralized and synchronous sgd algorithm for scalable deep neural network training
Diéguez et al. Tree partitioning reduction: A new parallel partition method for solving tridiagonal systems
Ajdari et al. A version of parallel odd-even sorting algorithm implemented in cuda paradigm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant