A kind of software and hardware combining accelerator and its implementation
Technical field
The present invention relates to the realization field of computing method, especially a kind of software and hardware combining accelerator and its implementation.
Background technology
Along with the development of mechanics of communication, the safety problem of information is also more and more important.How to guarantee information transmitted safety, complete sum non repudiation have become needs solution in the transmission major issue.Various information encryptions, decryption technology have appearred for this reason.
The RSA public key encryption algorithm is present public key algorithm with strongest influence power.The RSA public key encryption algorithm be 1977 by Ron Rivest, Adi Shamir and LenAdleman in Massachusetts Institute Technology exploitation, it can resist all up to the present known cryptographic attacks, is recommended as the public key data encryption standard by ISO.RSA Algorithm is true based on a foolproof number theory: two big prime numbers are multiplied each other very easily, but it is extremely difficult to want that at that time its product is carried out factorization, and therefore can product is open as encryption key.The data that comprise in the RSA Algorithm all are very large, and more greatly then difficulty is cracked more, generally all have 512 with binary representation, 1024, even 2048, we are referred to as big number such number.The ECC algorithm also is a kind of public key algorithm, and we do not do elaboration concrete principle, and wherein the data of computing also are big numbers.Also have occasions such as other enciphering and deciphering algorithms, all need carry out the computing of counting greatly.
32 of general processor word sizes, or 64 are far smaller than the figure place of big number, thereby can't directly carry out the direct calculating of number greatly.The big number because figure place is many, the operand that causes big number to calculate is very big, and but employing software is realized the computing (Fig. 1) of big number is with low cost because operand is big; Speed is slow; Performance is low, and is high to processor requirement, in some occasion (for example: be unacceptable Embedded Application).If hardware is realized fully, adopt hardware to realize that big number computing (Fig. 2) speed can be very fast, performance is high, but the hardware resource that needs is more, and promptly cost can be than higher.
On the processor that carries out several greatly computings, itself just have certain hardware resource, as: multiplier, high-speed RAM etc.Software realizes that in fact the computing of big number is exactly to have called the processor existing resources to carry out computing.Through software transfer; Synchronization can only have certain specific hardware resource work; Can not accomplish that several hardware resources work simultaneously; For example only move multiplier sometime and do multiplying, only read RAM sometime ... Like this, the processor hardware resource is because can not concurrent working and can't perform to maximum performance.According to the characteristics of the needs and the original hardware resource of processor of big several computings, through revising processor, increase the method for auxiliary logic, when letting big number calculate, the hardware resource of processor is brought into play maximum performance.So only add little hardware, increase a small amount of cost, can significantly improve big several calculated performance, thereby reach best cost performance.
Summary of the invention
The object of the invention will solve the deficiency that above-mentioned technology exists just; And a kind of software and hardware combining accelerator and its implementation are provided; Adopt the software and hardware combining method to realize the multiplication and division computing of big number; Only increase a small amount of cost and just can reach higher performance, can reach the balance between the cost-performance.
The present invention solves the technical scheme that its technical matters adopts: this software and hardware combining accelerator; On processor, increase and be connected with the large number multiplication accelerator; Increase hardware logic in the large number multiplication accelerator and Data Loading is gone into the monocycle multiplier from high-speed RAM; Increase hardware logic with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM.
The implementation method of software and hardware combining accelerator of the present invention, specific as follows:
(1), big several A{A [n-1] of length n ... A [2] A [1] A [0] }, big several B{B [m-1] of length m ... B [2] B [1] B [0] } multiply each other.To count B among the B [0] greatly and count A greatly and multiply each other; Obtain the big number of the intermediate result { C [n] [0] of length n+1 ... C [2] [0] C [1] [0] C [0] [0] }, repeat said process, with B [1], B [2] ... B [m-1]; Multiply each other with big number A respectively, amount to and obtain m the big number of intermediate result; These intermediate results move to left respectively the most at last, and addition finally obtains the big number of result of a length m+n;
(2), increase hardware logic in the large number multiplication accelerator and Data Loading gone into the monocycle multiplier from high-speed RAM, increase hardware logic with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM;
(3), in multiplier when operation monocycle, read multiplying data next time, read target RAM data simultaneously, each arithmetic average needs 1 cycle of multiplication, addition with write back 1 cycle of target RAM data, each 2 cycles.
Described monocycle multiplier, the multiplication of a 32bit*32bit of completion in the monocycle, but the result exports 2 cycles of needs; Described high-speed RAM is accomplished a read operation, perhaps a write operation in the monocycle.
The effect that the present invention is useful is: proposed a kind of software and hardware combining among the present invention and realized big number Calculation Method.On the basis that makes full use of processor existing hardware resource, only increase the little hardware resource, to handle big number and calculate part the most consuming time, other parts are accomplished by software.Under the situation that cost only slightly increases, significantly improve the speed that big number calculates, thereby reached the balance on cost and the performance like this.Thereby be fit to embedded, the relatively stricter occasion of cost requirement is used.
Description of drawings
Fig. 1 carries out counting greatly the synoptic diagram of computing for software mode;
Fig. 2 carries out counting greatly the synoptic diagram of computing for hardware mode;
Fig. 3 carries out counting greatly the synoptic diagram of computing for the present invention;
Fig. 4 is the large number multiplication principle schematic;
Fig. 5 is a n*1 large number multiplication principle schematic;
Fig. 6 is a n*1 large number multiplication accelerator work synoptic diagram;
Fig. 7 is 10 system multiplication synoptic diagram;
Fig. 8 is the processor structure of prior art;
Fig. 9 is a processor structure of the present invention.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described further:
The example that the large number multiplication accelerator is realized:
1, large number multiplication calculates principle:
We can adopt vertical multiplication the known common 10 system multiplication of total institute: (Fig. 7), the principle of large number multiplication computing method is consistent with common 10 system multiplicative principles, also can adopt the mode of vertical multiplication: (Fig. 4);
Like big several A n 32bit arranged, then claim big number A length n, big several A{A [n-1] of length n ... A [2] A [1] A [0] }, big several B{B [m-1] of length m ... B [2] B [1] B [0] } multiply each other.To count B among the B [0] greatly and count A greatly and multiply each other; Obtain the big number of the intermediate result { C [n] [0] of length n+1 ... C [2] [0] C [1] [0] C [0] [0] }, repeat said process, with B [1], B [2] ... B [m-1]; Multiply each other with big number A respectively, amount to and obtain m the big number of intermediate result.These intermediate results move to left respectively the most at last, and addition finally obtains the big number of result of a length m+n.
2, actual processor has resource:
1) monocycle multiplier can be accomplished the multiplication of a 32bit*32bit in the monocycle, but the result exports and needs 2 cycles.
2) high-speed RAM can be accomplished a read operation, perhaps a write operation in the monocycle.
3, calculated amount analysis:
If m and n are 32, processor word size is 32bit, aforementioned calculation; Amounting to approximately needs 1024 multiplication, 2080 sub-additions, and each computing need be imported 2 data; 2 cycles consuming time, 1 data of preservation, in 1 cycle consuming time, each multiplying completion obtains the result needs 2 cycles; Each multiplying needs 1 cycle, and each additive operation needs 1 cycle, minimumly like this needs 1024* (2+1+2+1)+2080* (2+1+1)=14464 cycle.
4, large number multiplication accelerator design principle:
Very fast of processor multiplying speed needs 1 cycle; But multiplication result output but needs 2 cycles; Each multiplication, addition input need 1 cycle, and output data needs 1 cycle, and the outer elapsed time of actual operation is also more than operation time.
Design the large number multiplication accelerator like this: increase hardware logic and from high-speed RAM Data Loading is gone into the monocycle multiplier automatically.Increase hardware logic automatically with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM.(Fig. 8, Fig. 9)
When multiplier moves, can read multiplying data next time, read target RAM data simultaneously, so each arithmetic average only needs 1 cycle of multiplication, addition with write back 1 cycle of target RAM data, each 2 cycles.M and n are 32bit like this, and one time large number multiplication is approximately wanted 32*32*2=2048 cycle.Be merely 14.16% of the theoretical periodicity of computed in software, computing velocity greatly promotes.
The multiplier A of demonstration length n and the multiplier B of length l multiply each other the concrete working method of accelerator below:
The multiplier A of note length n, every 32bit data are A [0], A [1], A [2] ... A [n-1], the multiplier B of length l, data are B [0].A is placed among the A of high-speed RAM address, and B is placed among the B of high-speed RAM address, multiplied result C, and length is (n+1), every 32bit data are C [0], C [1], C [2] ... C [n] is placed among the C of high-speed RAM address.
Like Fig. 6, the 3rd cycle obtains C [0] as a result, and the 5th cycle obtains C [1] as a result, and the 7th cycle will obtain C [2]
5, large number multiplication accelerator benefit analysis
1) made full use of original resource of CPU, monocycle 32bit multiplier, high-speed RAM have only increased by 3 RAM and have read logic, and 1 RAM writes logic, 1 adder logic, and the hardware resource that needs is considerably less.Total institute is known, and the hardware resource that monocycle 32bit multiplier, high-speed RAM need is much larger than adder logic and read-write logic.Final this accelerator design is only used about 5,000, and the hardware mode realization needs about 80,000.
2) fewer to the CPU change, only add some new steering logics, do not influence the use of the original logic of CPU.
Through analyzing, find big several computings part the most consuming time.Add the ancillary hardware logic, make full use of the existing hardware resource of processor, this partial arithmetic the most consuming time is realized, thereby increase substantially big several operational performance.Cost only increases slightly simultaneously.
Except that the foregoing description, the present invention can also have other embodiments.All employings are equal to the technical scheme of replacement or equivalent transformation formation, all drop on the protection domain of requirement of the present invention.