CN104461449B - Large integer multiplication implementation method and device based on vector instruction - Google Patents

Large integer multiplication implementation method and device based on vector instruction Download PDF

Info

Publication number
CN104461449B
CN104461449B CN201410645961.8A CN201410645961A CN104461449B CN 104461449 B CN104461449 B CN 104461449B CN 201410645961 A CN201410645961 A CN 201410645961A CN 104461449 B CN104461449 B CN 104461449B
Authority
CN
China
Prior art keywords
vector
carry
word
addition
multiplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410645961.8A
Other languages
Chinese (zh)
Other versions
CN104461449A (en
Inventor
林璟锵
赵原
荆继武
潘无穷
郑昉昱
向继
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Data Assurance and Communication Security Research Center of CAS
Original Assignee
Data Assurance and Communication Security Research Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Data Assurance and Communication Security Research Center of CAS filed Critical Data Assurance and Communication Security Research Center of CAS
Priority to CN201410645961.8A priority Critical patent/CN104461449B/en
Publication of CN104461449A publication Critical patent/CN104461449A/en
Application granted granted Critical
Publication of CN104461449B publication Critical patent/CN104461449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of large integer multiplication implementation method and device based on vector instruction, one or more vector length integers are split as by the multiplicand and multiplier of large integer multiplication respectively, calculate the multiplication of these integers, and all products are summed;When the integer of two vector lengths of calculating is multiplied, product vector caused by the instruction of all vector multiplications is formed into two addition carry chains according to specified order, instructed using the vectorial addition with carry, the input that carry caused by each addition of vectors is instructed as next vectorial addition, eliminate all addition carries in chain, two addition carries are only produced, by its add-back, obtain the product of two vector length integers.Especially, if the length of multiplicand and multiplier is both less than the 1/n of vector length, n groups integer is multiplied and merges into a vector length integer multiplication, calculates n times of throughput hoisting.Based on above large integer multiplication method, a kind of high speed large integer multiplication device based on Intel Xeon Phi coprocessors is also disclosed.The present invention reduces the instruction number that large integer multiplication needs, computing relay is reduced, improves calculating handling capacity.

Description

Large integer multiplication implementation method and device based on vector instruction
Technical field
It is more particularly to a kind of big whole based on vector instruction the present invention relates to the data encrypting and deciphering field in computer technology Number multiplication implementation method and device.
Background technology
Large integer multiplication is widely used in public key cryptography calculating process in computer technology, ensures to transmit the safety of data Property.In the data cryptogram calculating field of computer, large integer multiplication is frequently utilized for calculating big integer modular multiplication.Big integer modular multiplication is one Class public key encryption algorithm, such as RSA Algorithm and elliptic curve) basic operation, determine the calculating speed of algorithm.For thing Do not know the big integer modular multiplication of modulus first, such as the modular multiplication in RSA Algorithm, typically realized using Montgomery algorithm;For Can be determined in advance the big integer modular multiplication that modulus and modulus are Mersenne Prime, such as elliptic curve, such as SM2 algorithms and Modular multiplication in ECDSA algorithms, large integer multiplication can be first calculated, then with quick yojan to product modulus.
Large integer multiplication can be realized with hardware, can also be realized with software.Hardware, which is realized, to be included using field-programmable Logic gate array (FPGA) and application specific integrated circuit (ASIC) are realized.Hardware realization is more flexible, can as required customize and multiply Musical instruments used in a Buddhist or Taoist mass, such as the word length of multiplier, the design of streamline etc..But hardware realizes that general development difficulty is big, construction cycle length, cost It is high.It is that software programming is carried out in commercial processor that software, which is realized, utilizes the characteristic optimizing algorithm of commercial processor.Commercialization processing Device mainly includes following several classes:Central processing unit (CPU), digital signal processor (DSP) and graphics processor (GPU).These Processor uses different instruction set, and storage organization is also different.CPU and DSP is usually the processor of monokaryon or multinuclear, often Individual core has stronger computing capability, can perform different instruction, supporting vector instruction set.GPU is many-core processor, is typically gathered around There are thousands of cores, the computing capability of each core is weaker, and multiple cores perform identical instruction simultaneously.Calculating for large integer multiplication What speed influenceed maximum is the instruction set that processor uses, and directly determines to complete the instruction number that a large integer multiplication needs.
The instruction of processor can be divided into two major classes:Scalar instruction and vector instruction.Scalar instruction is an instruction processing one Individual word;Vector instruction is instruction one vector of processing, and a vector includes multiple words, i.e. vector instruction can to All words in amount perform same operation simultaneously.CPU, including x86, ARM, supporting vector instruction set, it is single-instruction multiple-data (SIMD) processor.GPU is single instrction multithreading (SIMT) processor, i.e., multiple threads perform identical instruction, but often The instruction that individual thread performs once only handles a word.Moreover, CPU and GPU have selected different developing direction.CPU constantly adds Dominant vector instructs, and increases vector length, adds new vector instruction, lifts monokaryon computing capability, and GPU is then continuously increased the number of core Amount, thread parallel degree is improved, increases total calculating handling capacity.Therefore, for the most frequently used CPU processor of user, want to play Its computing capability, it is necessary to make full use of vector instruction.
Nearly 20 years of x86CPU constantly strengthens vector gather instruction, from MMX instruction set, SSE instruction set develop into AVX instruction set, AVX2 instruction set, AVX2 instruction set support integer and the floating-point operation of 256.The Xeon Phi coprocessor branch of Intel company 512 bit vector instruction set are held, supports to 16 single precision floating datums or 16 32 integers while calculates.What vector instruction was concentrated Vector multiplication is instructed and vectorial addition instruction is calculated the word of two vectorial same positions.If diverse location in vector Digital data is related, to be added to such as addition carry in the word of higher position, vector instruction is difficult to handle.
Calculating Long-number multiplication will sum to multiple vector multiplication results, and being added will use vectorial addition instruction will every time The word of same position is added, and handles carry caused by these words.Most important of which is that obtain and preserve to enter caused by each word Position, and by carry be added to and vector.Carry caused by one word needs to be added in the word of higher position, but also there may be New carry, so carry is constantly to be propagated to the word of higher position, until no longer producing new carry.So how to handle Addition carry is the key for realizing the large integer multiplication based on vector instruction, and difficult point.
The method that the addition carry in large integer multiplication is handled currently with vector instruction is to use redundant representation, i.e.,:Will Big integer takes storage apart to the low bit position of each word of vector, vacates some higher bit positions, carry meeting caused by addition of vectors Higher bit position is stored in, will not be overflowed, without being propagated at once to higher word, will finally be changed with vector from redundant representation Return integer.
The major defect of this method is:First, big integer is converted into redundant representation, and the knot by large integer multiplication Fruit converts back integer from redundant representation needs many vector instructions;Second, add because each word will reserve some high-order accommodate Carry caused by method, big integer may need to split into more words, can so increase vector multiplication instruction and vectorial addition instruction Number of run.
The method of redundant representation can not extremely efficiently calculate the quick summation of vector in large integer multiplication, so calculating effect Rate is relatively low.
The content of the invention
In view of this, the invention provides a kind of large integer multiplication implementation method based on vector instruction, this method can Improve the computational efficiency of large integer multiplication.
The present invention also provides a kind of realization device of the large integer multiplication based on vector instruction, and the device can improve big whole The computational efficiency of number multiplication.
According to above-mentioned purpose, what the present invention was realized in:
A kind of large integer multiplication implementation method based on vector instruction, this method apply the public key cryptography calculating in computer Process,
A, the multiplicand and multiplier of large integer multiplication is split as one or more vector length integers;
B, calculate vector length integer to be multiplied, process is:
(1) each word of the multiplicand vector successively with multiplier vector is multiplied, obtains all product vectors;
(2) by product vector form the alignment of word (Type) addition carry chain, instructed using the vectorial addition with carry Caused all addition carries in chain are eliminated, produce one group and one addition carry of vector sum;
(3) by with one word of staggering of vector composition (Type) addition carry chain, utilize the vectorial addition with carry Instruction eliminates caused all addition carries in chain, produces two and one addition carry of vector sum;
(4) by two addition carry add-backs and vector, the product of multiplicand vector sum multiplier vector is obtained;
Repeat the product that step (1) to step (4) calculates all vector length integers that step A is split out.
C, the product of all vector length integers is summed.
A kind of large integer multiplication realization device based on vector instruction, supporting the processor with carry vector addition instruction On realize the above method, including:Vector splits composite module, vector length integer multiplication module, product summation module, wherein,
Vector splits composite module, for multiplicand and multiplier to be splitted into one or more vectors respectively;Or by the long degree of n groups No more than vector lengthInteger combinations into two vectors;
Vector length integer multiplication module includes two submodules:Vector multiplication module and vectorial addition module;
Vector multiplication module, it is multiplied for calculating multiplicand vector with each word of multiplier vector, obtains all product vectors;
Vectorial addition module includesType addition carry chain module andType addition carry chain module, for Vector Groups will to be accumulated Into oneType addition carry chain calculates, the vector composition one that will be calculatedType addition carry chain, finally by two additions Carry add-back and vector, obtain the product of multiplicand vector sum multiplier vector;If multiplicand or multiplier are split into multiple vectors, Need repeatedly to call vector length integer multiplication module to be calculated;
Product summation module, for product vector summation caused by all vector length integer multiplication modules.
From such scheme as can be seen that the multiplicand and multiplier of large integer multiplication is split as one or more by the present invention respectively Individual vector length integer, the multiplication of these integers is calculated, and all products are summed;The integer for calculating two vector lengths is multiplied When, product vector caused by the instruction of all vector multiplications is formed into two addition carry chains according to specified order, utilizes band carry Vectorial addition instruction, the input that carry caused by each addition of vectors instruct as next vectorial addition, elimination chain in All addition carries, two addition carries are only produced, by its add-back, obtain the product of two vector length integers.Especially, if The length of multiplicand and multiplier is both less than the 1/n of vector length, and n groups integer is multiplied and merges into a vector length integer phase Multiply, calculate n times of throughput hoisting.Based on above large integer multiplication method, also disclose a kind of based on Intel Xeon Phi associations The high speed large integer multiplication device of processor.The present invention reduces the instruction number that large integer multiplication needs, computing relay is reduced, Improve calculating handling capacity.
Brief description of the drawings
Fig. 1 is the flow chart provided in an embodiment of the present invention based on vector instruction large integer multiplication implementation method;
Fig. 2 is that each word of the step 102 multiplicand vector provided in an embodiment of the present invention successively with multiplier vector is multiplied Specific calculation flow chart;
Fig. 3 is carry accumulation algorithm calculating process provided in an embodiment of the present invention and simplifies process schematic;
Fig. 4 is provided in an embodiment of the present inventionType addition carry chain structure and calculating process schematic diagram;
Fig. 5 is provided in an embodiment of the present inventionType addition carry chain structure and calculating process schematic diagram;
Fig. 6 is step 103 provided in an embodiment of the present inventionThe specific calculation flow chart of type addition carry chain;
Fig. 7 is step provided in an embodiment of the present inventionThe specific calculation flow chart of type addition carry chain;
Fig. 8 be step 105 provided in an embodiment of the present invention by two addition carry add-backs and vector specific calculation process Figure;
Fig. 9 is the large integer multiplication realization device structural representation provided in an embodiment of the present invention based on vector instruction;
Figure 10 uses schematic diagram for the Intel Xeon Phi mask registers in the embodiment of the present invention;
Figure 11 is that the realization device based on Intel Xeon Phi that second embodiment of the invention provides calculates two simultaneously The specific calculation flow chart of 256 multiplications of integers.
Embodiment
For the objects, technical solutions and advantages of the present invention are more clearly understood, develop simultaneously embodiment referring to the drawings, right The present invention is described in further detail.
In order to improve the computational efficiency of large integer multiplication, the present invention does not use big integer superfluous when carrying out big integer calculations The method of remaining expression, a plurality of instruction needed is changed to save redundant representation and integer representation, and make full use of each word of vector Higher bit position.The implementation method of large integer multiplication disclosed by the invention is by the present invention by the multiplicand and multiplier of large integer multiplication One or more vector length integers are split as respectively, calculate the multiplication of these integers, and all products are summed;Calculate two to When measuring the integer multiplication of length, product vector caused by the instruction of all vector multiplications is formed into two additions according to specified order and entered Position chain, is instructed using the vectorial addition with carry, carry caused by each addition of vectors is instructed as next vectorial addition Input, eliminate all addition carries in chain, it is last only to produce two addition carries, by its add-back, obtain two vector lengths The product of integer;Particularly, if the length of multiplicand and multiplier is both less than the 1/n of vector length, n groups integer can be multiplied and merged It is multiplied for a vector length integer, calculates n times of throughput hoisting.Based on above large integer multiplication implementation method, the present invention is also Disclose a kind of high speed large integer multiplication device based on Intel Xeon Phi coprocessors.Implementation method disclosed by the invention The instruction number of large integer multiplication needs is greatly reduced with device, reduces computing relay, improves calculating handling capacity.
Large integer multiplication provided by the invention is applied to the encryption process of the transmission data in computer technology.
Method provided by the invention can be completed on a processor, on the Xeon Phi coprocessors of such as Intel Company Realize, supercomputing can be carried out to the large integer multiplication in a variety of cryptographic calculation procedures such as SM2, ECDSA, RSA.
In order to facilitate description, the present invention defines symbol:
Variable with the arrow represents a vector, such as
Array element not with the arrow represents a word of the vector, as X [i] represents vectorI-th of word;
There is certain context between different these vectors of the lower multiple vector representations of target of band, such asWithAll it is that low level multiplies The multiplication result of normal vector instruction, L2Compare L1A high word.
Vector length is u bits, and the length of word is w bits in vector, and a vector is made up of s word, u=s*w.
Fig. 1 is the flow chart provided in an embodiment of the present invention based on vector instruction large integer multiplication implementation method, and its is specific Step is:
Step 101, the multiplicand and multiplier of large integer multiplication is split as one or more vector length integers;
Step 102 to step 104 is used to calculate one group of vector length integer multiplication.
Step 102, each word of the multiplicand vector successively with multiplier vector be multiplied, it is vectorial to obtain all products;
Step 103, product vector formed the alignment of word (Type) addition carry chain, utilize the vectorial addition with carry Instruction eliminates caused all addition carries in chain, produces one group and one addition carry of vector sum;
Step 104, by with one word of staggering of vector composition (Type) addition carry chain, utilize the vector with carry Addition instruction eliminates caused all addition carries in chain, produces two and one addition carry of vector sum;
Step 105, by two addition carry add-backs and vector, obtain the product of multiplicand vector sum multiplier vector;
Repeat the product that step 102 to step 105 calculates all vector length integers that step 101 is split out.
Step 106, the product to all vector length integers are summed.Step 101 is specifically described below to the reality of step 106 Existing method.
In order to facilitate description, the present invention defines symbol:
Variable with the arrow represents a vector, such as
Array element not with the arrow represents a word of the vector, as X [i] represents vectorI-th of word;
There is certain context between different these vectors of the lower multiple vector representations of target of band, such asWithAll it is low level The multiplication result of multiplication vector instruction, L2Compare L1A high word.
The parameter that the explanation present invention uses below.
Vector length is u bits, and the length of word is w bits in vector, and a vector is made up of s word, u=s*w.Multiplied Number X length is a bits, and multiplier Y length is b bits.
The multiplicand and multiplier of random length is split as one or more vector length integers, detailed process by step 101 For:
The multiplicand X of a bits is divided into mxIndividual word, the multiplier Y of b bits are divided into my.In highest word if length not Sufficient u bits, the high order bit vacated is set to 0.
By the m of multiplicandxIndividual stroke is divided into nxIndividual vector, the m of multiplieryIndividual stroke is divided into nyIndividual vector.Highest vector is such as Less than s word of fruit number of words, the high-word vacated is set to 0.
The n that multiplicand X is splitted intoxThe n that individual vector length integer and multiplier Y split intoyIndividual vector length integer is multiplied respectively, N is carried out altogetherx*nySecondary vector length multiplication of integers.
Step 106 is by nx*nyThe multiplication result of individual 2u bit lengths according to vector high-low-position be added, obtain multiplicand X with Multiplier Y multiplication result.
The following detailed description of the implementation method of vector length multiplication of integers, i.e. the realization side to step 102 to step 105 Method.
Each word of the multiplicand vector successively with multiplier vector is multiplied by step 102, obtains all product vectors, detailed process As shown in Figure 2.
Fig. 2 is that each word of the step 102 multiplicand vector provided in an embodiment of the present invention successively with multiplier vector is multiplied The specific calculation flow chart process schematic summed respectively to the vector to be alignd in two addition carry chains.As shown in the figure:Will be by Multiplier vectorWith multiplier vectorIn the vector that diffuses into of each word carry out s low level vector multiplication, obtain s low level accumulate it is vectorialArriveS high-order vector multiplication is carried out, obtains s high-order product vectorArrive
In this process, the vector instruction used for:Word diffusion instruction (SPREAD), high-order multiplication vector instruction And low level multiplication vector instruction (MULLOW) (MULHIGH).
SPREAD is by vectorIn a word X [i] be assigned to vectorIn all word, makeIn the value of each word be X [i], i.e.,:
MULHIGH is by vectorAnd vectorCorresponding word is multiplied, and the high w bits of multiplication result are saved in into high-order product vector Word in, i.e.,:
MULLOW is by vectorAnd vectorCorresponding word is multiplied, and the low w bits of multiplication result are saved in into low level product vector Word in, i.e.,:
This calculating process is:
The to s of FOR i ← 1, step (1) to (3) is performed repeatedly
(1)
(2)
(3)
So, s low level product vector is producedArriveWith s high-order product vectorArrive
Step 103 is to step 105 to the 2s quick summation of product vector caused by step 102.
Illustrate the quick summation algorithm of a kind of vector provided by the invention first:Addition carry chain algorithm.
Some vector gather instructions, such as Intel Xeon Phi vector gather instruction, the vectorial addition with carry is supported to instruct, The instruction can be using a carry vector as input, while produces new carry vector.Band carry vector addition instruction (ADC) As shown in following formula:
The input of ADC instructions is vectorAnd carry vectorOutput be and vectorAnd carry vectorVectorial addition with carry is instructed vectorAnd vectorCorresponding word is added, and adds carry vectorThe carry of middle correspondence position, by caused addition results be saved in and vectorBy the new caused carry of all words It is saved in
Addition ADC of the instruction to i-th of word in vectoriIt is expressed as:
(Sum [i], CarryOut [i]) ← ADCi(X [i], Y [i], CarryIn [i]) 1≤i≤s
Two vectors are calculated using the instructionDuring summation, two steps can be divided into:
First step addition of vectors:With band carry vector addition instruction to vectorIt is added, produces and vectorialWith enter Bit vectorI.e.:
The processing of second step carry:By carry vectorIt is added to and vectorial
The primitive rule of vector summation is that carry caused by a word will be propagated to higher word, i.e.,:Calculate two vectors During addition, carry Carry [i] caused by i-th of word summation will be added on the i+1 word Sum [i+1] with vectorial Sum, produced Raw new i+1 word Sum [i+1]*With new carry Carry [i+1]*If Carry [i+1]*It is not 0, then continues to tire out Be added to and vector the i-th+2 word Sum [i+2] on, by that analogy, until caused carry be 0.
The first time of i-th of carry is added up, as shown in following formula:
(Sum[i+1]*, Carry [i+1]*)←ADCi+1(Sum [i+1], 0, Carry [i])
Carry vectorI-th of carry Carry [i] it is at most cumulative s-i times, caused carry one is set to 0.Carry [0] it is at most cumulative s times.
Fig. 3 is carry accumulation algorithm calculating process provided in an embodiment of the present invention and simplifies process schematic.
As shown in the top half in Fig. 3, the calculating process of carry accumulation algorithm is:
Calculate two vectorsWithSummation, caused and vector after addition of vectors is obtainedCarry vector Now by carry vectorIt is added to and vectorialLast result is saved in and vectorPerform Following steps:
(1) willIt is assigned toWillEach word be set to 0, willIt is saved in
(2) by carry vectorTo s-1 word of low displacement, carry vector is saved in
(3) willWithIt is added, is as a result saved inNew carry, above step are not produced after addition It is represented by with formula:
(4) by carry vectorTo high 1 word of displacement, carry vector is saved in
(5) willWithSummation, is as a result saved inCarry caused by new is saved inMore than Step is represented by with formula:
(6) checkIfEach word is 0, and algorithm terminates, return vectorIfAt least one word is not 0, return to step (2).
5 steps of this algorithm steps (2) to step (6) are designated as into a wheel carry to add up, because Carry [0] at most may be used It can add up s times, so this algorithm may at most perform s wheels.
Especially, it is if it is known that vectorialSummation will not produce carry in highest word, then can be to carry accumulation algorithm Simplified, as shown in the latter half in Fig. 3, comprised the following steps that:
Calculate two vectorsWithSummation, caused and vector after addition of vectors is obtainedCarry vector By carry vectorIt is added to and vectorialLast result is saved in and vectorNeed to perform following step Suddenly:
(1) willIt is assigned toWillIt is saved in
(2) by carry vectorTo 1 word of low displacement, carry vector is saved in
(3) willWithSummation, is as a result saved inCarry caused by new is saved inMore than Step is represented by with formula:
(4) checkIfEach word is 0, and algorithm terminates, return vectorIf At least one word is not 0, returns to step 2.
The wheel carry that 3 steps of this algorithm steps (2) to step (4) are designated as to algorithm reduced form adds up, at most It is cumulative to carry out s-1 wheels.
The embodiment of the present invention finally produces two addition carries using addition carry chain, and the two addition carries are added to Carry will not be produced with determination during vector in highest word, it is possible to use the reduced form of carry accumulation algorithm.
Because one addition carry of processing needs a plurality of vector instruction, if to carry caused by the summation of product vector every time all Handled, it is necessary to spend a large amount of extra instructions, significantly reduce the speed of service of large integer multiplication.The present invention discloses Addition carry chain algorithm all product vectors are formed into several addition carry chains according to setting order, in chain caused by vectorial addition The input that carry instructs as next vectorial addition, last each chain only produce a carry, considerably reduce processing and add The instruction number of method carry.
Addition carry chain is described as follows:VectorWith vectorIt is added and produces carry vectorAs VectorWith vectorThe input carry vector of summation, is added and produces carry vectorSo carry out m-1 times,As vectorWith vectorThe input carry vector of summation simultaneously produces carry vectorM vector Addition forms an addition carry chain.It is represented by with formula:
…………
Addition carry chain includes m vectorial addition, last only to produce a carry vectorEliminate m-1 carry Vector.
The present invention proposes two kinds of addition carry chains:Word alignment (Type) addition carry chain and the word of staggering ( Type) addition carry chain.
Fig. 4 is provided in an embodiment of the present inventionType addition carry chain structure and calculating process schematic diagram, as upper in Fig. 4 Shown in half part,Type addition carry chain is:The vector that will be summed sorts by high-low-position, and s vector of selection word alignment is right, By vector to forming an addition carry chain by order from low to high, adjacent vector is to difference one in addition carry chain Word.
As shown in the latter half in Fig. 4,, it is specified that once vector is to summation, referred to as one wheel in type addition carry chain Type addition carry chain calculates, and s vector is to needing s to take turnsType addition carry chain calculates.
Fig. 5 is provided in an embodiment of the present inventionType addition carry chain structure and calculating process schematic diagram.As upper in Fig. 5 Shown in half part,Type addition carry chain is:The vector summed will be needed to be sorted by high-low-position, choose and differ word each other S vector one addition carry chain of composition, i.e., adjacent vector one word of difference in addition carry chain.
As shown in the latter half in Fig. 5,, it is specified that following three operation is referred to as a wheel in type addition carry chainType adds Method carry chain calculates:
(1) will be saved in vectorial Least Significant Character in the word that a low level vector is specified;
(2) will be with vector to one word of low displacement;
(3) will be with vectorial with next vector, and carry vector is added, produce the new and new carry of vector sum to Amount.
S vectorType addition carry chain needs s-1 to take turnsType addition carry chain calculates.
Due toOne wheel operation ratio of type addition carry chainOne wheel of type addition carry chain is simple to operate, enters with additive Position chain algorithm preferentially aligns multigroup word vectorial to compositionType addition carry chain, it is impossible to formDuring type addition carry chain, Algorithm just forms the vector of multiple front and rear one word of differenceType addition carry chain.
In summary, the calculating process of addition carry chain algorithm is as follows:
(1) n vector is sorted from low to high by word;
(2) the vectorial right of word alignment is chosen;
(3) if there is the vectorial right of multigroup word alignment, and these vectors by the order of word from low to high to being arranged, front and rear All it is one word of difference, performs step 4, otherwise, performs step 5;
(4) by institute's directed quantity to forming one by order from low to highType addition carry chain, perform take turns moreType adds Method carry chain calculates, and retains last caused addition carry;
(5) vector of one word of multiple front and rear differences is chosen by the order of word from low to high;
(6) institute's directed quantity is formed into one by order from low to high to addType addition carry chain, perform take turns moreType addition Carry chain calculates, and retains last caused addition carry.
The realization how addition carry chain algorithm is applied to the large integer multiplication based on vector instruction is specifically described below Method.Step 103 to step 105 realizes the product vector summation during vector length integer is multiplied.
Fig. 6 is step provided in an embodiment of the present inventionThe specific calculation flow chart of type addition carry chain.As schemed Show, it is vectorial right that 2s-2 vector in 2s product vector caused by step 103 is divided into s-1 groups, composition oneType addition carry Chain, carry out s-1 wheelsType addition carry chain calculates.
High position product vectorVector is accumulated with low levelIt is that word aligns, 1≤i≤s-1.WillIt is divided into one Group, common s-1 groups, to this s-1 group vector to summing successively from low to high.
Specific calculation process is as follows:
(1)
The to s-1 of FOR i ← 2, step (2) is performed repeatedly
(2)
(3)
Finally, s+1 and vector are produced,ArriveAnd 1 addition carry vectorS+1 and to Amount differs a word each other,WithAlignment.
Fig. 7 is step provided in an embodiment of the present inventionThe specific calculation flow chart of type addition carry chain is as schemed Show, step 104s+1 and vectorArriveComposition oneType addition carry chain, carry out s wheelsType addition carry chain Calculate.
Often take turnsType addition carry chain, which calculates, needs three kinds of instructions:Vectorial mixed instruction (BLEND), vector shift instruction (SHIFT) and with carry vector addition instruction (ADC).
BLEND, according to mask vector (Mask) instruction, by vectorSpecific word and vectorSpecific word mixing, assignment GiveMask has s word, if Mask i-th of word is 0, by vectorI-th of word be assigned to vectorI-th of word;If Mask I-th of word it is non-zero, by vectorI-th of word be assigned to vectorI-th of word:
SHIFT, by vectorTo t word of low level shifting parameter, vectorT word of low level be moved out of, the high-order t vacated Word vectorT word of low level supplement, 0≤t≤s:
By vectorTo high t word of displacement, when t word of low level is set to 0:
By vectorTo t word of low displacement, when high-order t word is set to 0:
One wheel of step 104 addition carry chainType addition carry chain calculates, as shown in figure 5, specifically including:
(1) will be caused by upper wheel vectorial addition and vectorialLeast Significant Character be saved in low level result vector I+1 word;
(2) will and vectorTo one word of low displacement, a high position is set to 0, and is stored back to vector
(3) by vectorWith vectorAnd carry vector caused by last round of vectorial addition It is added, produces and vectorialWith new carry vector
Perform s wheelsType addition carry chain calculates:
(1)mask1=0, maskj≠0 IF j≠1
(2)
(3)
The to s-1 of FOR i ← 1, step (4) to step (7) is performed repeatedly
(4)
(5)maskj≠0 IF j≠i+1
(6)
(7)
Obtain low level result vectorHigh-order and vectorAddition carry vector
VectorIt is interim vector, step (4) is by vectorI word is moved to left, is saved in vectorMake Temp [i+1]=Highi[1], Temp [i+1] is assigned to Low [i+1] by step (5), is instructed here with two by vectorLeast Significant Character be stored to low level result vectorI+1 word.
Fig. 8 be step 105 provided in an embodiment of the present invention by two addition carry add-backs and vector specific calculation process Figure
As shown in Figure 8:Step 103 is produced carry vector by step 105 using carry accumulation algorithmVector, step Carry vector caused by rapid 104It is quick to be added to high-order caused by step 104 and vector
The integer of two vector lengths is multiplied, and the length of multiplication result is not over twice of vector length, so willIt is added toWhen, highest word will not produce carry, use the simplification shape of carry accumulation algorithm Formula.
The implementation procedure of step 105 includes:
(1)
(2)
(3)IFGOTO(6)
(4)
(5)GOTO(3)
(6)
Finally export high-order result vectorWith low level result vector
Especially, when multiplicand vectorOr multiplier vectorSeveral high-order words when being 0, the present invention can be disclosed The implementation method of vector length multiplication of integers optimize.
It is if vectorialHigh sxIndividual word is 0, vectorHigh syIndividual word is 0, if sx> sy, then vector is exchangedAnd vectorEach word Value.
In vector length integer multiplication algorithm, by multiplicandAs vector and the vectorial phase of each word diffusion in multiplier Multiply, if multiplierA high position have value be 0 word, then the word diffusion vector with A carry out vector multiplication obtain product vector be each Word is 0 vector, without being summed to this product vector, therefore can save vector multiplication and vector summation.
For case above, the present invention is specifically optimized to step 102 to step 105.
Step 102 is optimized:
VectorHigh syIndividual word is 0, only to vectorLow s-syIndividual word diffusion, carries out s-syVector multiplication is taken turns, is produced s-syIndividual low level product vectorArriveAnd s-sbIndividual high-order product vectorArrive
Step 103 is optimized:
2* (the s-s produced for step 102y) individual long-pending vectorial, then step 103 need to carry out s-sy- 1 wheelType addition enters Position chain calculates, and produces s-sy+ 1 and vector,ArriveAnd 1 addition carry vector
Step 104 is optimized:
For producing s-s to step 103y+ 1 and vector:ArriveAnd 1 addition carry vectorStep 104 performs s-syWheelType addition carry chain calculates, and obtains low level result vectorHigh-order and vectorAddition carry vectorWherein vectorial Low only has s-syIndividual low word is effective, high syIndividual word It is invalid.
Step 105 is optimized:
Step 105 willIt is added toObtain high-order result vector VectorAnd vectorLow s-syIndividual word merges, composition of vectorMultiplication result.
Optimization algorithm above performs s-syTake turns vector multiplication, s-sy- 1 wheelType addition carry chain calculates, s-syWheelType Addition carry chain calculates, and carry adds up, and saves syTake turns vector multiplication, syWheelType addition carry chain calculates and syWheel Type addition carry chain calculates, and reduces the vector instruction number of above operation needs.
The present invention also proposes a kind of n groups that can calculate simultaneously compared with the method that small integer is multiplied., this method is applied to following feelings Shape:Calculate n groups multiplicand and multiplier to be multiplied, the length of these multiplicand and multipliers is all no longer than vector lengthMore specifically Say that the number of words divided is all no more than vectorial number of words w's in groundN multiplicand is combined into multiplicand vector by this method N multiplier is combined into multiplier vectorOnce calculate all n groups multiplication.
Vectorial anabolic process is as follows:
First by n multiplicand X1To XnWith n multiplier Y1To YnSeveral words are divided into, the number of words of division isArriveArriveBecause the length of multiplicand and multiplier is all no longer than vector lengthSo all integer partitionings Number of words is all no more than vectorial number of words w'sI.e.ArriveArriveAll it is not more thanBy multiplicand X1To XnPreserve To vectorBy multiplier Y1To YnIt is saved in vectorThe higher bit position and high word vacated are set to 0.Detailed process is:
FOR i←1 to n
Step (1) is performed repeatedly
(1)
Step (2) is performed repeatedly
(2)
Step (3) is performed repeatedly
(3)
By i-th of multiplicand XiEach word be stored to vectorFromArrive's Each word, willArriveEach word set to 0, vector'sHigh-order each word to w is set to 0.
Similarly, by i-th of multiplier YiEach word be stored to vectorFromArrive Each word, willArriveEach word set to 0, vector'sHigh-order each word to w is set to 0.
N multiplicand and n multiplier are saved in vector by aforesaid operationsIn after, all multiplicand and multipliers (Xi, Yi) group first character all be alignment.
Lower surface analysis vectorWithHow to be multiplied.Due to any one group of multiplicand multiplier group (Xi, Yi), phase in a calculating group Multiply, without calculating between the multiplicand and multiplier of difference group, calculate vectorWithBe multiplied is not to calculate two vector lengths Integer is multiplied.So calculate n groups simultaneously needs to modify to step 102 to step 105 compared with the method that small integer is multiplied
Step 102 is rightIn word carry out group in spread, i.e., by multiplier YiJ-th of word Yi[j] be diffused into vector fromArrive'sIndividual word.So step 102 needs to carry outVector multiplication is taken turns, is producedIndividual low level Product vectorArriveWithIndividual high-order product vectorArrive
Step 103 willA high position and low level product vector composition one for group word alignmentType addition carry chain, carry outWheelType addition carry chain operates;
Caused by step 104It is individual and vectorialArriveComposition oneType addition carry chain, carry outWheelType addition carry chain operates;
Step 105 keeps constant.
Finally, by caused vectorCorresponding different groups of multiplicand/multipliers are taken apart, then by every group of height Position result and low level result merge, the multiplication result of composition each group multiplicand/multiplier.
Compared with calculating one group of multiplication of integers with this, multiplication algorithm keeps delay constant to n groups integer simultaneously:Wheel vector Multiplication,WheelType addition carry chain operates,WheelType addition carry chain operates;Throughput hoisting n will be calculated simultaneously Times.
The invention discloses a kind of high speed large integer multiplication device based on vector instruction.Fig. 9 carries for the embodiment of the present invention The large integer multiplication realization device structural representation based on vector instruction supplied.
As illustrated, split composite module, vector length integer multiplication module, product summation module including vector.It is all Module can be based on Intel Xeon Phi coprocessors and realize.
Vector, which splits composite module, has two functions:For two Long-number multiplications, the module divides two big integer One or more vectors are not splitted into;All it is no longer than vector length for the long degree of n groupsInteger be multiplied, the module is whole by this two groups Array synthesizes two vectors.
Vector length integer multiplication module also has two functions:Vector length integer is calculated to be multiplied;Calculate two combinations Multiplication between vector.
Vector length integer multiplication module includes two submodules:Vector multiplication module and vectorial addition module.
Vector multiplication module calls high-order multiplication vector instruction (MULHIGH) and low level multiplication vector instruction (MULLOW) meter Each word that multiplicand vector is calculated with multiplier vector is multiplied, and obtains all product vectors.
Vectorial addition module includesType addition carry chain module andType addition carry chain module.
Type addition carry chain module uses band carry vector addition instruction (ADC) long-pending vectorial by designated order composition one The alignment of bar word (Type) addition carry chain, eliminate caused all additions in chain using the vectorial addition instruction with carry and enter Position, produce one group and one addition carry of vector sum.
Type addition carry chain module uses band carry vector addition instruction (ADC) will be with vector by designated order composition one Bar stagger word (Type) addition carry chain, eliminate caused all additions in chain using the vectorial addition instruction with carry Carry, produce two and one addition carry of vector sum.
Two addition carry add-backs and vector are obtained multiplying for multiplicand vector sum multiplier vector by last vectorial addition module Product.
If multiplicand or multiplier are split into multiple vectors, it is necessary to repeatedly call vector length integer multiplication module to be counted Calculate.
Product summation module is used for product vector summation caused by all vector length integer multiplication modules.
High speed large integer multiplication device based on vector instruction can be real on the Xeon Phi coprocessors of Intel Company It is existing
Lift two specific embodiments and illustrate the high speed large integer multiplication dress based on Intel Xeon Phi coprocessors Put.
One embodiment is reality of 512 (vector length) multiplications of integers on Intel Xeon Phi coprocessors Apply, second embodiment is reality of 256 (half of vector length) multiplications of integers on Intel Xeon Phi coprocessors Apply.Intel Xeon Phi coprocessors are the many-core coprocessors of the newest release of Intel Company, based on x86 frameworks, are at most gathered around There are 61 cores, each core supports 4 hardware threads, while supports 512 bit vector instruction set (IMCI, Intel Initial Many Core Instructions).IMCI supports simultaneously also to prop up 16 single precision floating datums and 16 32 integer calculations Hold while 8 double-precision floating pointses are calculated.
IMCI introduces a kind of new vector registor, the mask vector register of 16.Each hardware thread has 32 512 Bit vector register and 8 16 bitmask vector registors.
In IMCI, mask vector register has many usages, including:Storage addition carry, as write mark and conduct Mixed markers, Figure 10 use schematic diagram for the Intel Xeon Phi mask registers in the embodiment of the present invention.
It is described as follows:
Store addition carry:As input and output with carry vector addition instruction, the i-th bit of mask vector register Store carry caused by the word addition of i-th of position of summand vector sum addend vector.
Indicate as writing:As the input of the most of vector instructions of IMCI, indicate to carry out which of object vector word Modification.Mask vector register i-th bit is 0, and i-th of word of object vector is constant;Mask vector register i-th bit is 1, purpose I-th of word of vector is changed by the operation of vector instruction.
As Mixed markers:As the input of vectorial mixed instruction, indicate vectorOr vectorWord be saved in mesh VectorIn.Mask vector register i-th bit is 0, by vectorI-th of word be saved in vectorI-th of word;Mask to It is 1 to measure register i-th bit, by vectorI-th of word be saved in vectorI-th of word.
For vector gather instruction IMCI based on Xeon Phi coprocessors when realizing the embodiment of the present invention, the instruction used can It is divided into following a few classes:Vector order instruction, vector multiplication instruction, vectorial addition instruction, vector shift instruction, vector mixing refer to The instruction that order, mask vector assignment directive, and mask vector are mutually changed with integer.
Illustrate the function of these instructions, and the corresponding relation of the logical order with being used in the content of the invention separately below.
(1) vector diffusion and ordering instruction
16 words of 512 bit vectors are divided into 4 groups of { A, B, C, D } by Xeon Phi coprocessors from low to high, 4 in group Word is labeled as { A, B, C, D } 4 words from low to high.4 words in 4 groups and group are diffused and arranged respectively with two instructions Sequence.
Group diffusion and ordering instruction:
2nd parameter is an enumeration type, to indicate how that 4 groups are diffused and sorted.Such as _ MM_PERM_ CDAA, indicate vectorTwo groups of words of A, B be entered as vectorA group words, vectorC group words be entered as vectorD groups Word, vectorD group words be entered as vectorC group words.
The diffusion of word and ordering instruction in group, such as:
2nd parameter is an enumeration type, to indicate how that the word in 4 group groups is diffused and sorted, to 4 Individual group of operation is identical._ MM_SWIZ_REG_ABDC is indicated, by vectorEvery group of A words are entered as vectorEvery group of C Word,Every group of B words are entered asEvery group of D words,Every group of C words are entered asEvery group of B words,Every group of D words are entered as Every group of A words.
This two instructing combinations can realize logical order SPREAD function.
Such as by vectorThe 7th word be diffused into vectorI.e.:
It can be realized using following two instructions:
(2) vector multiplication instructs
Vectorial low level multiplication, it is possible to achieve logical order MULLOW function:
By vectorEach word is multiplied, and low 32 of product are saved in vectorWord in.
The high-order multiplication of vector, it is possible to achieve logical order MULHIGH function:
By vectorEach word is multiplied, and high 32 of product are saved in vectorWord in.
(3) vectorial addition instructs
This group instruction can realize logical order ADC function
Vectorial addition instruction with carry:
By vectorWith mask vectorBe added, and be saved in and vectorCarry preserves caused by new To mask vector
Band carry vector addition instruction with mask, such as:
The instruction is under mask vector WriteMask instruction pair and vectorMask vectorWord Modification.
The vectorial addition instruction of no-carry input, such as:
By vectorBe added, and be saved in and vectorCarry caused by new is saved in mask vector
(4) vector shift instructs
By vectorLow t word be saved in vectorHigh t word, by vectorHigh 16-t word be saved in vector Low 16-t word.Logical order SHIFT function can be realized.
(5) vectorial mixed instruction
If mask vector BlendMask i-th bit is 0, by vectorI-th of word be saved in vectorI-th of word; If mask vector BlendMask i-th bit is 1, by vectorI-th of word be saved in vectorI-th of word.It can realize Logical order BLEND function.
(6) mask vector and integer conversion instruction
Mask vector is converted to the instruction of integer
By mask vectorIt is assigned to low 16 of 32 integers.
Integer is converted to the instruction of mask vector
K is 32 integers, and mask vector is assigned to by its low 16
Two instructions and integer shift instruction it will combine above, it is possible to achieve work(of the SHIFT instructions to mask vector displacement Energy.Such as by mask vectorOne is moved to left, can be realized with following three instructions:
K=k < < 1
Illustrate one embodiment below:Implement vector length disclosed by the invention on Intel Xeon Phi coprocessors Multiplication of integers implementation method is spent, i.e., realizes that 512 integers are multiplied with IMCI instruction set.
Calculate vectorWith vectorIt is multiplied, calculating process is as follows:
(1) implementation steps 102,16 wheel vector multiplications
(1) Flag [4]=_ MM_SWIZ_REG_AAAA, _ MM_SWIZ_REG_BBBB, and _ MM_SWIZ_REG_CCCC, _ MM_SWIZ_REG_DDDD}
(2)
The to 4 of FOR i ← 1, step (3) (4) (5) is performed repeatedly
(3)
(4)
(5)
(6)
The to 8 of FOR i ← 5, step (7) (8) (9) is performed repeatedly
(7)
(8)
(9)
(10)
The to 12 of FOR i ← 9, step (11) (12) (13) is performed repeatedly
(11)
(12)
(13)
(14)
The to 16 of FOR i ← 13, step (15) (16) (17) is performed repeatedly
(15)
(16)
(17)
Step (1) is by vectorEach word diffusion, with vectorCarry out high-order vector multiplication and low level vector multiplication.
(2) implementation steps 103,15 wheelsType addition carry chain operates
(1)
(2)
The to 15 of FOR i ← 2, step (3) is performed repeatedly
(3)
(4)
By the vector of 15 word alignment to forming oneType addition carry chain, carry out 15 wheelsType addition carry chain is grasped Make, produce carry vector
(3) implementation steps 104,16 wheelsType addition carry chain calculates
(1) k [4]={ 0x1,0x2,0x4,0x8 }
(2)
(3)
The to 15 of FOR i ← 0, step (4) (5) (6) is performed repeatedly
(4)
(5)
(6)
VectorIt is the vector that each word is 0.Step (4) willLeast Significant Character be saved in successivelyStep (5) (6) willA word is moved to low level, withIt is added.The vector composition one of 17 mistakes, one wordType addition enters Position chain, carry out 16 wheelsType addition carry chain calculates, and produces and vectorialLow level result vectorAnd carry to Amount
(4) step 105 is realized, carry adds up
(1)
(2)
(3)
(4)
(5) IF k==0, step B4 terminate
(6) ELSE k=k < < 1
(7)
(8)
GOTO steps (4)
By carry vectorWith two null vectorsIt is added, is saved in vectorCaused by new Carry is all 0.WillTo high one word of displacement, withSummation, caused carry is preserved ArriveCarry accumulation algorithm is rightMove to left and be added to every timeUntil caused carry is all 0.
Above-mentioned 512 integer multiplicative embodiments are performed on Intel Xeon Phi coprocessors, average 128 have instructed Into 1 512 multiplication, 6.3 hundred million 512 multiplication are completed within 1 second.
SM2 national secret algorithms and ECDSA-P256 algorithms can calculate 256 modular multiplications with 256 multiplication and quick yojan, so 256 multiplication are the core calculations of both algorithms.
Illustrate second embodiment below:Implement 256 multiplications of integers on Intel Xeon Phi coprocessors.The reality Apply example by and meanwhile exemplified by calculating two 256 multiplication, illustrate how to implement n groups integer while multiplication algorithm.
Figure 11 is that the process of two 256 multiplications of integers of the realization device based on Intel Xeon Phi while calculating is specific Calculation flow chart.Two groups of 256 integers are calculated simultaneously to be multiplied, and are calculated X1 and are multiplied with Y1, X2 is multiplied with Y2.X1 and X2 are combined into one Individual vectorFor X2 high 256, X1 is combined into a vector at low 256, by Y1 and Y2Y2 is high 256, and Y1 is low 256 Position.
Calculate two groups of 256 integers simultaneously to be multiplied, calculating process is as follows:
(1) implementation steps 102,8 wheel vector multiplications
(1) Flag [4]=_ MM_SWIZ_REG_AAAA, _ MM_SWIZ_REG_BBBB, and _ MM_SWIZ_REG_CCCC, _ MM_SWIZ_REG_DDDD}
(2)
The to 4 of FOR i ← 1, step (3) (4) (5) is performed repeatedly
(3)
(4)
(5)
(6)
The to 8 of FOR i ← 5, step (7) (8) (9) is performed repeatedly
(7)
(8)
(9)
Word in Y1 and Y2 is diffused into 8 words by step (1) respectively, with vectorCarry out high-order vector multiplication and low level to Measure multiplication.
(2) implementation steps 103,7 wheelsType addition carry chain operates
(1)
(2)
The to 8 of FOR i ← 2, step (3) is performed repeatedly
(3)
(4)
By the vector of 8 word alignment to forming oneType addition carry chain, carry out 7 wheelsType addition carry chain operates, Produce carry vector
(3) implementation steps 104,8 wheelsType addition carry chain calculates
K [4]={ 0x0101,0x0202,0x0404,0x0808 }
(1)
(2)
(3)
(4)
(5)
(6)
The to 3 of FOR i ← 1, step (7) (8) (9) (10) (11) is performed repeatedly
(7)
(8)
(9)
(10)
(11)
(12)
The to 7 of FOR i ← 4, step (13) (14) (15) (16) is performed repeatedly
(13)
(14)
(15)
(16)
(17)
VectorIt is the vector that each word is 0;Step (8) (13) is first diffused before word is stored in into specified location;Step Suddenly (12) exchange B groups word and A groups word, and D groups word and C groups word exchange, and Least Significant Character is saved in vector after spreadingSpecified location;Step (17) gains the order of 4 groups of words to come.
(4) implementation steps 105, carry add up
(1)
(2)
(3)
(4)
(5) IF k==0, step B4 terminate
(6) ELSE k=k < < 1
(7)
(8)
GOTO steps (4)
By carry vectorWith two null vectorsIt is added, is saved in vectorEnter caused by new Position is all 0.WillTo high one word of displacement, withSummation, caused carry is saved inCarry accumulation algorithm is rightMove to left and be added to every timeUntil caused carry is all 0.
The present embodiment is performed on Intel Xeon Phi coprocessors, 256 multiplication twice are completed in 66 instructions, i.e., flat 256 multiplication are completed in equal 33 instructions, complete 24.5 hundred million 256 multiplication within 1 second.
Preferred embodiment is lifted above, the object, technical solutions and advantages of the present invention is further described, institute It should be understood that the foregoing is merely illustrative of the preferred embodiments of the present invention, it is not intended to limit the invention, it is all the present invention's Spirit and principle within, all any modification, equivalent and improvement made etc., should be included in protection scope of the present invention it It is interior.

Claims (9)

1. a kind of large integer multiplication implementation method based on vector instruction, this method apply the public key cryptography in computer to calculate Journey, its feature include:
The multiplicand and multiplier of large integer multiplication is split as to the integer of one or more vector lengths A,;
B, the integer for calculating vector length is multiplied, and process is:
(1) each word of the multiplicand vector successively with multiplier vector is multiplied, obtains all product vectors;
(2) product vector is formed to the addition carry chain of a word alignment, is referred to asType addition carry chain, utilizes the vector with carry Addition instruction eliminates caused all addition carries in chain, produces one group and one addition carry of vector sum;
(3) the addition carry chain of the word that staggers will be formed with vector, is referred to asType addition carry chain, using with carry Vectorial addition instruction eliminates caused all addition carries in chain, produces two and one addition carry of vector sum;
(4) by two addition carries with and addition of vectors, obtain multiplicand vector sum multiplier vector product;
Repeat the product that step (1) calculates the integer for all vector lengths that step A is split out to step (4);
C, the product of the integer of all vector lengths is summed.
2. implementation method as described in claim 1, it is characterised in that the addition carry chain algorithm utilize with carry to Multiple vectors for needing to sum are formed an addition carry chain by amount addition instruction by designated order, eliminate all carries in chain, Wherein,
Band carry vector addition instruction ADC can be expressed as below:
ADC is instructed vectorAnd carry vectorIt is added, produces and vectorialWith new carry vector
Addition carry chain, vectorWith vectorIt is added and produces carry vectorAs vectorWith vectorThe input carry vector of summation, is added and produces carry vectorSo carry out m-1 times,As vectorWith vectorThe input carry vector of summation simultaneously produces carry vectorM vectorial addition forms to be added described in one Method carry chain;It is represented by with formula:
…………
Addition carry chain includes m vectorial addition, last only to produce a carry vectorEliminate m-1 carry vector.
3. implementation method as claimed in claim 2, it is characterised in that describedType addition carry chain meets that requirement is:Need to It is required that 2s of sum vector sorts by high-low-position, it is right to line up s vector of word alignment, and it is vectorial to by sequence successively A word is differed, by vector to forming an addition carry chain by order from low to high, the integer for calculating vector length is multiplied, Process is:Use the vectorial addition instruction pair with carryEvery group of vector of type addition carry chain needs to perform s altogether to summation It is secondary, the vectorial addition instruction with carry, produce s and one carry vector of vector sum;
It is describedType addition carry chain meets that requirement is:S vector for needing to sum is sorted by high-low-position, can be lined up successively The addition carry chain of a word is differed, it is describedType addition carry chain provides as follows three operations as a wheelType addition carry chain Calculate operation:
(1) will be saved in vectorial Least Significant Character in the word that a low level vector is specified;
(2) will be with vector to one word of low displacement;
(3) will be with vectorial with next vector, and carry vector is added, produce the new and new carry vector of vector sum;
It is describedThe order of type addition carry chain from low to high performs to s vector in chain successivelyType addition carry chain calculates Operation, need to be performed s-1 times altogetherType addition carry chain calculates operation, some low levels vectors of the high-order vector sum of generation one and one Individual addition carry vector.
4. implementation method as claimed in claim 2, it is characterised in that the multiplication meter in the integer multiplication of the vector length Calculate, each word of the multiplicand vector successively with multiplier vector is multiplied using vector multiplication instruction, obtains all product vectors, process For:
Vector is made up of multiple words, and Y [i] represents vectorI-th of word, word extended instruction SPREAD is by vectorI-th of word Whole vector is expanded to, is saved asIt is i.e. vectorialThe value of all words is Y [i];
The each word to be alignd in two vectors is multiplied by high-order multiplication vector instruction MULHIGH, by word length in multiplication result High-order portion is stored in high-order multiplication result vectorIn corresponding word;
Low level multiplication vector instruction MULLOW by each word to be alignd in two vectors be multiplied, by multiplication result word length it is low Bit position is stored in low level multiplication result vectorIn corresponding word;
Instruction SPREAD, high-order multiplication vector instruction MULHIGH and low level multiplication vector instruction MULLOW are spread using word;
Multiplicand vectorWith multiplier vectorThere is s word, calculating process is:
FOR i ← 1to s, repeatedly perform step (1) to (3), i from 1 to s,
(1)
(2)
(3)
Produce s low level product vectorArriveWith s high-order product vectorArriveFor 2s product vector.
5. implementation method as claimed in claim 4, it is characterised in that the addition meter in the integer multiplication of the vector length Calculate, all multiplicand vectors are multiplied with obtained product vector with each word of multiplier vector successively, uses addition carry chain algorithm Quick summation;
Use band carry vector addition instruction ADC, vectorial mixed instruction BLEND and vector shift instruction SHIFT;
B2, the 2s-2 vector by described 2s product in vectorialArriveAndArriveIt is right to be divided into s-1 groups vector, composition OneType addition carry chain, from low level to a high position to s-1 groups vector to summing successively, obtain s-1 and vector,ArriveAnd 1 addition carry vector WithAdditional calculation is not carried out, willIt is named as It is named asSo s+1 and vector are obtainedArriveAnd 1 addition carry vector
(1)
FOR i ← 2to s-1, step (2) is performed repeatedly
(2)
(3)
B3, s+1 and vectorArriveComposition oneType addition carry chain, carry out s wheelsType addition carry chain meter Calculate, obtain low level result vectorHigh-order and vectorAddition carry vector
(1)
maskj≠0 IF j≠1
(2)
(3)
FOR i ← 1to s-1, step (4) to step (7) is performed repeatedly
(4)
(5)
maskj≠0 IF j≠i+1
(6)
(7)
B4, by carry vectorIt is quick to be added to high-order and vectorObtain multiplication result vector
(1)
(2)
(3)IFGo to step (6)
(4)
(5)
Go to step (3)
(6)
6. implementation method as claimed in claim 1, it is characterised in that when multiplicand vectorOr multiplier vectorIt is high-order some When individual word is 0, the multiplication of integers of the vector length is optimized:
It is if vectorialHigh sxIndividual word is 0, vectorHigh syIndividual word is 0, if sx>sy, then vector is exchangedAnd vectorThe value of each word;
VectorHigh syIndividual word is 0, only to vectorLow s-syIndividual word diffusion, carries out s-syVector multiplication is taken turns, produces s-syIt is individual Low level product vectorArriveAnd s-sbIndividual high-order product vectorArrive
To the 2* (s-s producedy) individual product is vectorial, as long as then step 103 carries out s-sy- 1 wheelType addition carry chain calculates, and produces s-sy+ 1 and vector,ArriveAnd 1 addition carry vector
To caused s-s abovey+ 1 and vectorArriveAnd 1 addition carry vectorHold Row s-syWheelType addition carry chain calculates, and obtains low level result vectorHigh-order and vectorAddition carry to AmountWherein vectorial Low only has s-syIndividual low word is effective, high syIndividual word is invalid;
WillIt is added toObtain high-order result vectorVectorWith VectorLow s-syIndividual word merges, composition of vectorMultiplication result;
Above optimization method performs s-syTake turns vector multiplication, s-sy- 1 wheelType addition carry chain calculates, s-syWheelType addition Carry chain calculates, and saves syTake turns vector multiplication, syWheelType addition carry chain calculates and syWheelType addition carry chain calculates.
7. implementation method as claimed in claim 1, it is characterised in that this method also includes:By the long degree of n groups all no more than vector LengthMultiplicand and multiplier form two vectors, once calculate the multiplication of n groups integer, n is natural number more than or equal to 1;
N groups multiplicand is formed into a multiplicand vectorN groups multiplier forms a multiplier vectorEvery group of multiplicand and multiplier All it is that word aligns;
The length of word is w bits in vector, only rightIn word carry out group in spread, i.e., by multiplier YiJ-th of word Yi[j] spreads Into vector fromArrive'sIndividual word;So only need to carry outVector multiplication is taken turns, is produced Individual low level product vectorArriveWithIndividual high-order product vectorArrive
WillA high position and low level product vector composition one for group word alignmentType addition carry chain, carry outWheelType Addition carry chain operates;
WillA high position and low level product vector composition one for group word alignmentType addition carry chain, carry outWheel Type addition carry chain operates, generationIt is individual and vectorialArrive
WillIt is individual and vectorialArriveComposition oneType addition carry chain, carry outWheelType addition carry Chain operates, generationWithBy two addition carry add-backs
WillWithTaken apart by the packet of multiplicand and multiplier, be combined into the multiplication result of every group of integer.
8. a kind of large integer multiplication realization device based on vector instruction, it is characterised in that refer to supporting band carry vector addition The either method in claim 1~7 is realized on the processor of order, including:Vector splits composite module, vector length integer phase Multiply module, product summation module, wherein,
Vector splits composite module, for multiplicand and multiplier to be splitted into one or more vectors respectively;It is or the long degree of n groups is little In vector lengthInteger combinations into two vectors, n is the natural number more than or equal to 1;
Vector length integer multiplication module includes two submodules:Vector multiplication module and vectorial addition module;
Vector multiplication module, it is multiplied for calculating multiplicand vector with each word of multiplier vector, obtains all product vectors;
Vectorial addition module includesType addition carry chain module andType addition carry chain module, for product vector to be formed into one BarType addition carry chain calculates, the vector composition one that will be calculatedType addition carry chain, finally by two addition carries Add-back and vector, obtain the product of multiplicand vector sum multiplier vector;If multiplicand or multiplier be split into it is multiple vector, it is necessary to Repeatedly vector length integer multiplication module is called to be calculated;
Product summation module, for product vector summation caused by all vector length integer multiplication modules.
9. realization device as claimed in claim 8, it is characterised in that real on the Xeon Phi coprocessors of Intel Company It is existing.
CN201410645961.8A 2014-11-14 2014-11-14 Large integer multiplication implementation method and device based on vector instruction Active CN104461449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410645961.8A CN104461449B (en) 2014-11-14 2014-11-14 Large integer multiplication implementation method and device based on vector instruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410645961.8A CN104461449B (en) 2014-11-14 2014-11-14 Large integer multiplication implementation method and device based on vector instruction

Publications (2)

Publication Number Publication Date
CN104461449A CN104461449A (en) 2015-03-25
CN104461449B true CN104461449B (en) 2018-02-27

Family

ID=52907570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410645961.8A Active CN104461449B (en) 2014-11-14 2014-11-14 Large integer multiplication implementation method and device based on vector instruction

Country Status (1)

Country Link
CN (1) CN104461449B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699449B (en) * 2015-04-03 2017-09-29 中国科学院软件研究所 A kind of big addition of integer and subtraction multi-core parallel concurrent implementation method based on GMP
CN104793922B (en) * 2015-05-04 2017-08-25 中国科学院软件研究所 A kind of Parallel Implementation method of large integer multiplication Comba algorithms based on OpenMP
US10152321B2 (en) * 2015-12-18 2018-12-11 Intel Corporation Instructions and logic for blend and permute operation sequences
CN111651201B (en) * 2016-04-26 2023-06-13 中科寒武纪科技股份有限公司 Apparatus and method for performing vector merge operation
CN105930128B (en) * 2016-05-17 2018-11-06 中国科学院数据与通信保护研究教育中心 It is a kind of to realize that large integer multiplication calculates accelerated method using floating number computations
GB2553783B (en) * 2016-09-13 2020-11-04 Advanced Risc Mach Ltd Vector multiply-add instruction
CN109062604B (en) * 2018-06-26 2021-07-23 飞腾技术(长沙)有限公司 Emission method and device for mixed execution of scalar and vector instructions
CN111752528B (en) * 2020-06-30 2021-12-07 无锡中微亿芯有限公司 Basic logic unit supporting efficient multiplication operation
CN111752529B (en) * 2020-06-30 2021-12-07 无锡中微亿芯有限公司 Programmable logic unit structure supporting efficient multiply-accumulate operation
CN111966327A (en) * 2020-08-07 2020-11-20 南方科技大学 Mixed precision space-time multiplexing multiplier based on NAS (network attached storage) search and control method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411558A (en) * 2011-10-31 2012-04-11 中国人民解放军国防科学技术大学 Vector processor oriented large matrix multiplied vectorization realizing method
CN103942028A (en) * 2014-04-15 2014-07-23 中国科学院数据与通信保护研究教育中心 Large integer multiplication method and device applied to password technology

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104011661B (en) * 2011-12-23 2017-04-12 英特尔公司 Apparatus And Method For Vector Instructions For Large Integer Arithmetic

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411558A (en) * 2011-10-31 2012-04-11 中国人民解放军国防科学技术大学 Vector processor oriented large matrix multiplied vectorization realizing method
CN103942028A (en) * 2014-04-15 2014-07-23 中国科学院数据与通信保护研究教育中心 Large integer multiplication method and device applied to password technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
大整数乘法算法的研究与快速实现;桑波;《中国优秀硕士学位论文全文数据库信息科技辑》;20130115(第1期);1-97 *

Also Published As

Publication number Publication date
CN104461449A (en) 2015-03-25

Similar Documents

Publication Publication Date Title
CN104461449B (en) Large integer multiplication implementation method and device based on vector instruction
CN107704916A (en) A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
US20210350204A1 (en) Convolutional neural network accelerator
TW201905768A (en) Performing matrix multiplication in hardware
US20210349692A1 (en) Multiplier and multiplication method
CN103942028B (en) Apply large integer multiplication operation method and device in cryptographic technique
CN109716287A (en) The arithmetical circuit of reduced floating point precision
CN110413254B (en) Data processor, method, chip and electronic equipment
US20140136588A1 (en) Method and apparatus for multiplying binary operands
CN101847137B (en) FFT processor for realizing 2FFT-based calculation
CN104063357B (en) Processor and processing method
CN111966324A (en) Multi-elliptic curve scalar multiplier oriented implementation method, device and storage medium
JP2597736B2 (en) Fast multiplier
US7653676B2 (en) Efficient mapping of FFT to a reconfigurable parallel and pipeline data flow machine
US11861327B2 (en) Processor for fine-grain sparse integer and floating-point operations
CN106371808B (en) A kind of method and terminal of parallel computation
CN116561819A (en) Encryption and decryption method based on from-Cook on-loop polynomial multiplication and on-loop polynomial multiplier
CN104793922B (en) A kind of Parallel Implementation method of large integer multiplication Comba algorithms based on OpenMP
Jadhav et al. A novel high speed FPGA architecture for FIR filter design
KR20200072666A (en) Selective data processing method of convolution layer and neural network processor using thereof
CN114756203A (en) Base 4Booth multiplier and implementation method, arithmetic circuit and chip thereof
JP2022181161A (en) Sparse matrix multiplication in hardware
CA2055900C (en) Binary tree multiplier constructed of carry save adders having an area efficient floor plan
CN107220702A (en) A kind of Neural network optimization and device
JP2022101472A (en) Systems and methods for low latency modular multiplication

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant