CN102520907A - Software and hardware integrated accelerator and implementation method for same - Google Patents

Software and hardware integrated accelerator and implementation method for same Download PDF

Info

Publication number
CN102520907A
CN102520907A CN2011104140657A CN201110414065A CN102520907A CN 102520907 A CN102520907 A CN 102520907A CN 2011104140657 A CN2011104140657 A CN 2011104140657A CN 201110414065 A CN201110414065 A CN 201110414065A CN 102520907 A CN102520907 A CN 102520907A
Authority
CN
China
Prior art keywords
multiplication
data
accelerator
speed ram
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104140657A
Other languages
Chinese (zh)
Inventor
杨波
徐功益
邱柏云
贺晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HANGZHOU SHENGYUAN CHIP TECHNIQUE CO Ltd
Original Assignee
HANGZHOU SHENGYUAN CHIP TECHNIQUE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HANGZHOU SHENGYUAN CHIP TECHNIQUE CO Ltd filed Critical HANGZHOU SHENGYUAN CHIP TECHNIQUE CO Ltd
Priority to CN2011104140657A priority Critical patent/CN102520907A/en
Publication of CN102520907A publication Critical patent/CN102520907A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention relates to a software and hardware integrated accelerator and an implementation method for the same. A large number multiplication accelerator is additionally connected onto a processor, hardware logic is added into the large number multiplication accelerator, data are loaded into a monocyclic multiplier from a high-speed RAM (random access memory), the added hardware logic is used for adding multiplication results and data in the target high-speed RAM, and then the multiplication results and the data are outputted to the target high-speed RAM. When the monocyclic multiplier runs, multiplication data of a next time and the data of the target RAM are simultaneously read, each operation averagely needs one cycle of multiplication, one cycle of addition and write-back of the data of the target RAM, and two cycles are needed each time. The software and hardware integrated accelerator and the implementation method have the advantages that on the basis of making full use of existing hardware resources of the processor, only a small number of hardware resources are added for processing the most time-consuming part of large number calculation, and other parts are completed by software. Therefore, when cost is only slightly increased, large number calculation is greatly accelerated, and the cost and performance are balanced.

Description

A kind of software and hardware combining accelerator and its implementation
Technical field
The present invention relates to the realization field of computing method, especially a kind of software and hardware combining accelerator and its implementation.
Background technology
Along with the development of mechanics of communication, the safety problem of information is also more and more important.How to guarantee information transmitted safety, complete sum non repudiation have become needs solution in the transmission major issue.Various information encryptions, decryption technology have appearred for this reason.
The RSA public key encryption algorithm is present public key algorithm with strongest influence power.The RSA public key encryption algorithm be 1977 by Ron Rivest, Adi Shamir and LenAdleman in Massachusetts Institute Technology exploitation, it can resist all up to the present known cryptographic attacks, is recommended as the public key data encryption standard by ISO.RSA Algorithm is true based on a foolproof number theory: two big prime numbers are multiplied each other very easily, but it is extremely difficult to want that at that time its product is carried out factorization, and therefore can product is open as encryption key.The data that comprise in the RSA Algorithm all are very large, and more greatly then difficulty is cracked more, generally all have 512 with binary representation, 1024, even 2048, we are referred to as big number such number.The ECC algorithm also is a kind of public key algorithm, and we do not do elaboration concrete principle, and wherein the data of computing also are big numbers.Also have occasions such as other enciphering and deciphering algorithms, all need carry out the computing of counting greatly.
32 of general processor word sizes, or 64 are far smaller than the figure place of big number, thereby can't directly carry out the direct calculating of number greatly.The big number because figure place is many, the operand that causes big number to calculate is very big, and but employing software is realized the computing (Fig. 1) of big number is with low cost because operand is big; Speed is slow; Performance is low, and is high to processor requirement, in some occasion (for example: be unacceptable Embedded Application).If hardware is realized fully, adopt hardware to realize that big number computing (Fig. 2) speed can be very fast, performance is high, but the hardware resource that needs is more, and promptly cost can be than higher.
On the processor that carries out several greatly computings, itself just have certain hardware resource, as: multiplier, high-speed RAM etc.Software realizes that in fact the computing of big number is exactly to have called the processor existing resources to carry out computing.Through software transfer; Synchronization can only have certain specific hardware resource work; Can not accomplish that several hardware resources work simultaneously; For example only move multiplier sometime and do multiplying, only read RAM sometime ... Like this, the processor hardware resource is because can not concurrent working and can't perform to maximum performance.According to the characteristics of the needs and the original hardware resource of processor of big several computings, through revising processor, increase the method for auxiliary logic, when letting big number calculate, the hardware resource of processor is brought into play maximum performance.So only add little hardware, increase a small amount of cost, can significantly improve big several calculated performance, thereby reach best cost performance.
Summary of the invention
The object of the invention will solve the deficiency that above-mentioned technology exists just; And a kind of software and hardware combining accelerator and its implementation are provided; Adopt the software and hardware combining method to realize the multiplication and division computing of big number; Only increase a small amount of cost and just can reach higher performance, can reach the balance between the cost-performance.
The present invention solves the technical scheme that its technical matters adopts: this software and hardware combining accelerator; On processor, increase and be connected with the large number multiplication accelerator; Increase hardware logic in the large number multiplication accelerator and Data Loading is gone into the monocycle multiplier from high-speed RAM; Increase hardware logic with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM.
The implementation method of software and hardware combining accelerator of the present invention, specific as follows:
(1), big several A{A [n-1] of length n ... A [2] A [1] A [0] }, big several B{B [m-1] of length m ... B [2] B [1] B [0] } multiply each other.To count B among the B [0] greatly and count A greatly and multiply each other; Obtain the big number of the intermediate result { C [n] [0] of length n+1 ... C [2] [0] C [1] [0] C [0] [0] }, repeat said process, with B [1], B [2] ... B [m-1]; Multiply each other with big number A respectively, amount to and obtain m the big number of intermediate result; These intermediate results move to left respectively the most at last, and addition finally obtains the big number of result of a length m+n;
(2), increase hardware logic in the large number multiplication accelerator and Data Loading gone into the monocycle multiplier from high-speed RAM, increase hardware logic with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM;
(3), in multiplier when operation monocycle, read multiplying data next time, read target RAM data simultaneously, each arithmetic average needs 1 cycle of multiplication, addition with write back 1 cycle of target RAM data, each 2 cycles.
Described monocycle multiplier, the multiplication of a 32bit*32bit of completion in the monocycle, but the result exports 2 cycles of needs; Described high-speed RAM is accomplished a read operation, perhaps a write operation in the monocycle.
The effect that the present invention is useful is: proposed a kind of software and hardware combining among the present invention and realized big number Calculation Method.On the basis that makes full use of processor existing hardware resource, only increase the little hardware resource, to handle big number and calculate part the most consuming time, other parts are accomplished by software.Under the situation that cost only slightly increases, significantly improve the speed that big number calculates, thereby reached the balance on cost and the performance like this.Thereby be fit to embedded, the relatively stricter occasion of cost requirement is used.
Description of drawings
Fig. 1 carries out counting greatly the synoptic diagram of computing for software mode;
Fig. 2 carries out counting greatly the synoptic diagram of computing for hardware mode;
Fig. 3 carries out counting greatly the synoptic diagram of computing for the present invention;
Fig. 4 is the large number multiplication principle schematic;
Fig. 5 is a n*1 large number multiplication principle schematic;
Fig. 6 is a n*1 large number multiplication accelerator work synoptic diagram;
Fig. 7 is 10 system multiplication synoptic diagram;
Fig. 8 is the processor structure of prior art;
Fig. 9 is a processor structure of the present invention.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described further:
The example that the large number multiplication accelerator is realized:
1, large number multiplication calculates principle:
We can adopt vertical multiplication the known common 10 system multiplication of total institute: (Fig. 7), the principle of large number multiplication computing method is consistent with common 10 system multiplicative principles, also can adopt the mode of vertical multiplication: (Fig. 4);
Like big several A n 32bit arranged, then claim big number A length n, big several A{A [n-1] of length n ... A [2] A [1] A [0] }, big several B{B [m-1] of length m ... B [2] B [1] B [0] } multiply each other.To count B among the B [0] greatly and count A greatly and multiply each other; Obtain the big number of the intermediate result { C [n] [0] of length n+1 ... C [2] [0] C [1] [0] C [0] [0] }, repeat said process, with B [1], B [2] ... B [m-1]; Multiply each other with big number A respectively, amount to and obtain m the big number of intermediate result.These intermediate results move to left respectively the most at last, and addition finally obtains the big number of result of a length m+n.
2, actual processor has resource:
1) monocycle multiplier can be accomplished the multiplication of a 32bit*32bit in the monocycle, but the result exports and needs 2 cycles.
2) high-speed RAM can be accomplished a read operation, perhaps a write operation in the monocycle.
3, calculated amount analysis:
If m and n are 32, processor word size is 32bit, aforementioned calculation; Amounting to approximately needs 1024 multiplication, 2080 sub-additions, and each computing need be imported 2 data; 2 cycles consuming time, 1 data of preservation, in 1 cycle consuming time, each multiplying completion obtains the result needs 2 cycles; Each multiplying needs 1 cycle, and each additive operation needs 1 cycle, minimumly like this needs 1024* (2+1+2+1)+2080* (2+1+1)=14464 cycle.
4, large number multiplication accelerator design principle:
Very fast of processor multiplying speed needs 1 cycle; But multiplication result output but needs 2 cycles; Each multiplication, addition input need 1 cycle, and output data needs 1 cycle, and the outer elapsed time of actual operation is also more than operation time.
Design the large number multiplication accelerator like this: increase hardware logic and from high-speed RAM Data Loading is gone into the monocycle multiplier automatically.Increase hardware logic automatically with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM.(Fig. 8, Fig. 9)
When multiplier moves, can read multiplying data next time, read target RAM data simultaneously, so each arithmetic average only needs 1 cycle of multiplication, addition with write back 1 cycle of target RAM data, each 2 cycles.M and n are 32bit like this, and one time large number multiplication is approximately wanted 32*32*2=2048 cycle.Be merely 14.16% of the theoretical periodicity of computed in software, computing velocity greatly promotes.
The multiplier A of demonstration length n and the multiplier B of length l multiply each other the concrete working method of accelerator below:
The multiplier A of note length n, every 32bit data are A [0], A [1], A [2] ... A [n-1], the multiplier B of length l, data are B [0].A is placed among the A of high-speed RAM address, and B is placed among the B of high-speed RAM address, multiplied result C, and length is (n+1), every 32bit data are C [0], C [1], C [2] ... C [n] is placed among the C of high-speed RAM address.
Like Fig. 6, the 3rd cycle obtains C [0] as a result, and the 5th cycle obtains C [1] as a result, and the 7th cycle will obtain C [2]
5, large number multiplication accelerator benefit analysis
1) made full use of original resource of CPU, monocycle 32bit multiplier, high-speed RAM have only increased by 3 RAM and have read logic, and 1 RAM writes logic, 1 adder logic, and the hardware resource that needs is considerably less.Total institute is known, and the hardware resource that monocycle 32bit multiplier, high-speed RAM need is much larger than adder logic and read-write logic.Final this accelerator design is only used about 5,000, and the hardware mode realization needs about 80,000.
2) fewer to the CPU change, only add some new steering logics, do not influence the use of the original logic of CPU.
Through analyzing, find big several computings part the most consuming time.Add the ancillary hardware logic, make full use of the existing hardware resource of processor, this partial arithmetic the most consuming time is realized, thereby increase substantially big several operational performance.Cost only increases slightly simultaneously.
Except that the foregoing description, the present invention can also have other embodiments.All employings are equal to the technical scheme of replacement or equivalent transformation formation, all drop on the protection domain of requirement of the present invention.

Claims (4)

1. software and hardware combining accelerator; It is characterized in that: increase is connected with the large number multiplication accelerator on processor; Increase hardware logic in the large number multiplication accelerator and Data Loading is gone into the monocycle multiplier from high-speed RAM; Increase hardware logic with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM.
2. implementation method that adopts software and hardware combining accelerator as claimed in claim 1 is characterized in that:
(1), big several A{A [n-1] of length n ... A [2] A [1] A [0] }, big several B{B [m-1] of length m ... B [2] B [1] B [0] } multiply each other.
3. will count B among the B [0] greatly and count A greatly and multiply each other; Obtain the big number of the intermediate result { C [n] [0] of length n+1 ... C [2] [0] C [1] [0] C [0] [0] }, repeat said process, with B [1], B [2] ... B [m-1]; Multiply each other with big number A respectively, amount to and obtain m the big number of intermediate result; These intermediate results move to left respectively the most at last, and addition finally obtains the big number of result of a length m+n;
(2), increase hardware logic in the large number multiplication accelerator and Data Loading gone into the monocycle multiplier from high-speed RAM, increase hardware logic with data addition in multiplication result and the target high-speed RAM, output to again in the target high-speed RAM;
(3), in multiplier when operation monocycle, read multiplying data next time, read target RAM data simultaneously, each arithmetic average needs 1 cycle of multiplication, addition with write back 1 cycle of target RAM data, each 2 cycles.
4. the implementation method of software and hardware combining accelerator according to claim 1 is characterized in that: described monocycle multiplier, and the multiplication of a 32bit*32bit of completion in the monocycle, but the result exports 2 cycles of needs; Described high-speed RAM is accomplished a read operation, perhaps a write operation in the monocycle.
CN2011104140657A 2011-12-13 2011-12-13 Software and hardware integrated accelerator and implementation method for same Pending CN102520907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104140657A CN102520907A (en) 2011-12-13 2011-12-13 Software and hardware integrated accelerator and implementation method for same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104140657A CN102520907A (en) 2011-12-13 2011-12-13 Software and hardware integrated accelerator and implementation method for same

Publications (1)

Publication Number Publication Date
CN102520907A true CN102520907A (en) 2012-06-27

Family

ID=46291850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104140657A Pending CN102520907A (en) 2011-12-13 2011-12-13 Software and hardware integrated accelerator and implementation method for same

Country Status (1)

Country Link
CN (1) CN102520907A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310816A (en) * 1998-07-22 2001-08-29 摩托罗拉公司 Circuit and method of modulo multiplication
CN1411578A (en) * 2000-03-27 2003-04-16 英芬能技术公司 Method and apparatus for adding user-defined execution units to processor using configurable long instruction word (CLIW)
CN1886744A (en) * 2002-05-13 2006-12-27 坦斯利卡公司 Method and apparatus for adding advanced instructions in an extensible processor architecture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310816A (en) * 1998-07-22 2001-08-29 摩托罗拉公司 Circuit and method of modulo multiplication
CN1411578A (en) * 2000-03-27 2003-04-16 英芬能技术公司 Method and apparatus for adding user-defined execution units to processor using configurable long instruction word (CLIW)
CN1886744A (en) * 2002-05-13 2006-12-27 坦斯利卡公司 Method and apparatus for adding advanced instructions in an extensible processor architecture

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李占才等: "RSA快速硬件实现研究", 《计算机研究与发展》, vol. 38, no. 11, 30 November 2001 (2001-11-30), pages 1360 - 1365 *
温暖等: "大数乘法器的设计与硬件实现", 《微电子学与计算机》, vol. 21, no. 5, 31 May 2004 (2004-05-31), pages 154 - 156 *
蔡敏等: "基于RSA算法的大数乘法器设计", 《半导体技术》, vol. 30, no. 8, 31 August 2005 (2005-08-31), pages 65 - 68 *

Similar Documents

Publication Publication Date Title
CN112988237B (en) Paillier decryption system, chip and method
Mert et al. FPGA implementation of a run-time configurable NTT-based polynomial multiplication hardware
Cao et al. High-speed fully homomorphic encryption over the integers
CN104461449A (en) Large integer multiplication realizing method and device based on vector instructions
CN104579656A (en) Hardware acceleration coprocessor for elliptic curve public key cryptosystem SM2 algorithm
Fadhil et al. Parallelizing RSA algorithm on multicore CPU and GPU
Khan et al. High speed ECC implementation on FPGA over GF (2 m)
Bosmans et al. A tiny coprocessor for elliptic curve cryptography over the 256-bit NIST prime field
Bos Low-latency elliptic curve scalar multiplication
Seo et al. Parallel implementations of LEA
Néto et al. A Parallel and Uniform $ k $-Partition Method for Montgomery Multiplication
Lin et al. Efficient parallel RSA decryption algorithm for manycore GPUs with CUDA
CN102520907A (en) Software and hardware integrated accelerator and implementation method for same
Judge et al. A Hardware‐Accelerated ECDLP with High‐Performance Modular Multiplication
Reymond et al. A hardware pipelined architecture of a scalable Montgomery modular multiplier over GF (2 m)
Li et al. High-speed implementation of SM2 based on fast modulus inverse algorithm
Zhuo et al. High-performance and area-efficient reduction circuits on FPGAs
CN104461469A (en) Method for achieving SM2 algorithm through GPU in parallelization mode
Zeng et al. The implementation of polynomial multiplication for lattice-based cryptography: A survey
Chen et al. Integer number crunching on the cell processor
Lee et al. Acceleration of differential power analysis through the parallel use of gpu and cpu
Wang et al. High radix montgomery modular multiplier on modern fpga
Shi et al. A 28nm 68MOPS 0.18\mu\mathrm {J}/\text {Op} $ Paillier Homomorphic Encryption Processor with Bit-Serial Sparse Ciphertext Computing
US11792004B2 (en) Polynomial multiplication for side-channel protection in cryptography
US20220060315A1 (en) Sign-based partial reduction of modular operations in arithmetic logic units

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: The city of Hangzhou in West Zhejiang province 311121 No. 998 Building 9 East Sea Park

Applicant after: Hangzhou Shengyuan Chip Technique Co., Ltd.

Address before: 310012, room 17, building 176, 203 Tianmu Mountain Road, Hangzhou, Zhejiang, Xihu District

Applicant before: Hangzhou Shengyuan Chip Technique Co., Ltd.

C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: Hangzhou City, Zhejiang province 311121 Yuhang Wuchang Street No. 998 West Sea Park Building 9 East

Applicant after: Hangzhou Shengyuan Chip Technique Co., Ltd.

Address before: The city of Hangzhou in West Zhejiang province 311121 No. 998 Building 9 East Sea Park

Applicant before: Hangzhou Shengyuan Chip Technique Co., Ltd.

C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120627