WO2023236899A1 - 数据处理方法、装置、设备及存储介质 - Google Patents

数据处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023236899A1
WO2023236899A1 PCT/CN2023/098288 CN2023098288W WO2023236899A1 WO 2023236899 A1 WO2023236899 A1 WO 2023236899A1 CN 2023098288 W CN2023098288 W CN 2023098288W WO 2023236899 A1 WO2023236899 A1 WO 2023236899A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processing
bits
computing
calculation unit
Prior art date
Application number
PCT/CN2023/098288
Other languages
English (en)
French (fr)
Inventor
袁壄
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023236899A1 publication Critical patent/WO2023236899A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/30Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy
    • H04L9/3006Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy underlying computational problems or public-key parameters
    • H04L9/3026Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy underlying computational problems or public-key parameters details relating to polynomials generation, e.g. generation of irreducible polynomials
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/30Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy

Definitions

  • the present application relates to the field of computer technology, and in particular to a data processing method, device, equipment and storage medium.
  • Embodiments of the present application provide a data processing method, device, equipment and storage medium, which can improve the efficiency of running number theory transformations.
  • the technical solution is as follows.
  • a data processing method is provided, which is executed by a computing device.
  • the computing device is used to run a number theory transformation of data.
  • the step of number theory transformation of data includes a plurality of computing units, including:
  • a first calculation unit is determined from the plurality of calculation units, the first calculation unit is a calculation unit used to reduce the processing result of the second calculation unit, and the third calculation unit The estimated number of bits of the processing result of the second calculation unit meets the preset number of bits.
  • the computing unit responsible for the reduction processing is determined based on this, so that there is no need to introduce logic.
  • the value of the processing result can also be made smaller at the appropriate position, thereby reducing the number of bits in the processing result and preventing the number of bits in the processing result from exceeding the upper limit of the number of bits that the computing device can represent, thus avoiding overflow.
  • this method can remove logical branch statements and optimize the structure of number theory transformations, thereby improving the efficiency of running number theory transformations.
  • the reduction process includes:
  • the processing result of the second calculation unit is subjected to redundant modular multiplication processing.
  • the reduction process is implemented using redundant modular multiplication, on the one hand, the reduction process is The Montgomery algorithm must be bound, and the representation of the data does not need to remain in Montgomery representation.
  • the scheme is available regardless of whether the data is expressed in Montgomery representation or non-Montgomery representation, thus improving the flexibility and practicality of the scheme. sex.
  • it can also play the role of increasing the speed of reduction processing, thereby improving efficiency. In particular, it helps to significantly accelerate the computing process of computing devices in scenarios such as large number operations.
  • performing redundant modular multiplication processing on the processing results of the second computing unit includes:
  • the processing result of the second calculation unit is subjected to a redundant modular multiplication process based on a twiddle factor, the twiddle factor having the same representation form as the data.
  • the representation is a Montgomery representation or a non-Montgomery representation.
  • the method further includes:
  • Encryption processing or decryption processing is performed on the processing result after the reduction processing by the second calculation unit.
  • the parameters include a modulus used by each of the plurality of computing units when performing a modulo operation, a redundant multiple of the data relative to the modulus, and a polynomial of the data. dimensions.
  • the redundancy size of the input data is described by the modulus and the redundancy multiple. Therefore, even if the data has redundancy, it can also be positioned more accurately.
  • the calculation units that need to be reduced are identified, thereby reducing redundant reduction processing.
  • the preset number of bits is determined based on the number of bits of a processor in the computing device, and the number of preset bits is 1 or 2 less than the number of bits of the processor.
  • the preset number of bits is determined based on hardware factors such as the number of processor bits, so that the preset number of bits can adapt to Depending on the capabilities of the hardware, different preset bit numbers can be determined for hardware with different capabilities, and the calculation unit that needs to be reduced can be found based on the preset bit number, which can more accurately locate the need for reduction. processing calculation unit, thereby reducing unnecessary reduction processing.
  • each of the plurality of computing units is further configured to perform subtraction processing based on a redundant value, where the redundant value is a value greater than or equal to the subtrahend in the subtraction processing.
  • the number theory transformation function includes a positive number theory transformation function and an inverse number theory transformation function.
  • the subtraction processing of the positive number theory transformation function is x-y*w mod 2q or x mod 2q-y*w mod 2q.
  • the inverse number theory transformation function The subtraction process of Represents modulo operation, the * represents multiplication, and the - represents subtraction.
  • this method determines the redundancy value based on data-related parameters, so that the determined redundancy value can be adapted to the value of the parameter, thereby improving accuracy. Additionally, redundant values do not need to be tied to a single parameters, but can be adjusted accordingly with the value of the parameters, so the solution has more parameters available, improving scalability and practicality.
  • the number theory transformation includes a positive number theory transformation
  • the redundancy value is equal to 2q
  • q represents the modulus used by each calculation unit in the plurality of calculation units when performing a modulo operation
  • the q is a positive integer
  • the characteristic of the positive number theory transformation is that the modular multiplication process is performed first, and then the addition process and subtraction process are performed.
  • the value range of modular multiplication processing is controllable. For example, when using redundant modular multiplication to implement multiplication operations, the value range of modular multiplication processing is within [0, 2q) and is implemented using modular multiplication without redundancy. During multiplication operations, the value range of modular multiplication processing is within [0, q), where q is the modulus.
  • the value range of the subtrahend is within [0, 2q), so the redundant value must be larger than the subtrahend, thus Guarantees that the result of a subtraction is not non-negative, thus contributing to correctness of the operation.
  • the redundancy value used is as small as possible to avoid excessive processing overhead and storage overhead due to excessive redundancy values.
  • the number theory transformation includes an inverse number theory transformation, and the redundant value is equal to (t+n)*q, where q represents the modulo operation used by each calculation unit in the plurality of calculation units.
  • modulus the t represents the redundancy multiple of the data relative to the modulus
  • the n the polynomial dimension of the data
  • the t, n and q are positive integers.
  • n*q on the basis of t*q is used as a redundant value, thereby supporting data of any redundant multiple as input, while ensuring that the result generated by performing the subtraction operation is not a non-negative number, which contributes to the correctness of the operation.
  • each of the plurality of computing units is used to process k data to generate k processing results, where k is a positive integer.
  • a data processing device which device has the function of implementing the above-mentioned first aspect or any optional method of the first aspect.
  • the device includes at least one module, and the at least one module is used to implement the method provided by the above-mentioned first aspect or any optional manner of the first aspect.
  • the modules in the apparatus are implemented by software, and the modules in the apparatus are program modules. In other embodiments, the modules in the device are implemented in hardware or firmware.
  • the modules in the device are implemented in hardware or firmware.
  • a computing device in a third aspect, includes a processor, the processor is coupled to a memory, and at least one computer program instruction is stored in the memory. The at least one computer program instruction is executed by the processor. Load and execute, so that the computing device implements the method provided by the above-mentioned first aspect or any optional manner of the first aspect.
  • the computing device implements the method provided by the above-mentioned first aspect or any optional manner of the first aspect.
  • a computer-readable storage medium stores at least one instruction.
  • the instruction When the instruction is run on a computer, it causes the computer to execute the above-mentioned first aspect or any optional method of the first aspect. provided method.
  • a computer program product including one or more computer programs Program instructions, when the computer program instructions are loaded and run by the computer, cause the computer to execute the method provided by the above-mentioned first aspect or any optional method of the first aspect.
  • a chip in a sixth aspect, includes programmable logic circuits and/or program instructions. When the chip is run, it is used to implement the method provided in the above-mentioned first aspect or any optional manner of the first aspect.
  • Figure 1 is a flow chart of an NTT provided by an embodiment of the present application.
  • FIG. 2 is a flow chart of an INTT provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of the calculation process of a crossover in a butterfly in Radix-2 NTT provided by the embodiment of the present application;
  • Figure 4 is a schematic diagram of the calculation process of a crossover in a butterfly in Radix-2 INTT provided by the embodiment of the present application;
  • Figure 5 is a schematic diagram of a principle for calculating polynomial multiplication provided by an embodiment of the present application.
  • Figure 6 is a flow chart of a data processing method provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of changes in the number of bits of data when running NTT provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of changes in the number of bits of data when running INTT provided by an embodiment of the present application.
  • Figure 9 is an architectural diagram of an NTT provided by an embodiment of the present application.
  • Figure 10 is an architectural diagram of an NTT precomputation module provided by an embodiment of the present application.
  • Figure 11 is an architectural diagram of an NTT generation module provided by an embodiment of the present application.
  • Figure 12 is an architecture diagram of an INTT provided by an embodiment of the present application.
  • Figure 13 is an architectural diagram of an INTT pre-computation module provided by an embodiment of the present application.
  • Figure 14 is an architectural diagram of an INTT generation module provided by an embodiment of the present application.
  • Figure 15 is a schematic diagram of a calculation method of redundant growth crossover in Radix-2 NTT provided by the embodiment of the present application.
  • Figure 16 is a schematic diagram of a redundancy reduction crossover calculation method in Radix-2 NTT provided by the embodiment of the present application.
  • Figure 17 is a schematic diagram of a calculation method of redundant growth crossover in Radix-2 INTT provided by the embodiment of the present application.
  • Figure 18 is a schematic diagram of a redundancy reduction crossover calculation method in Radix-2 INTT provided by an embodiment of the present application.
  • Figure 19 is a schematic structural diagram of a data processing device 800 provided by an embodiment of the present application.
  • Figure 20 is a schematic structural diagram of a computing device 900 provided by an embodiment of the present application.
  • g is called An nth primitive unit root on.
  • NTT NTT
  • a stage is all processing steps within the same time period in the exponential transformation.
  • FIG. 1 is a flow chart of an NTT provided by an embodiment of the present application.
  • the NTT shown in Figure 1 is divided into three stages. According to the order of time period from first to last (that is, from left to right), these three stages are called stage 1, stage 2 and stage 3 respectively.
  • Phase 1 includes all processing steps of NTT in the first time period
  • Phase 2 includes all processing steps of NTT in the second time period, and so on.
  • FIG. 2 is a flow chart of an INTT provided by an embodiment of the present application.
  • the INTT shown in Figure 2 is divided into three stages. According to the order of time period from first to last, these three stages are called stage 1, stage 2 and stage 3 respectively.
  • Phase 1 includes all processing steps of INTT in the first time period
  • Phase 2 includes all processing steps of INTT in the second time period, and so on.
  • Input a polynomial with the number of coefficients n (n ⁇ 2, usually equal to the power of 2).
  • Each stage in the number theory transformation presents a disjoint m (2 ⁇ m ⁇ n, usually equal to the power of 2) data.
  • This kind of processing is called butterfly, also called butterfly computation unit or butterfly operation.
  • Butterflies are often the main computational unit in number theory transformations.
  • a phase in a number theory transformation consists of one or more butterflies. For example, in the NTT shown in Figure 1, stage 1 includes 4 butterflies, stage 2 includes 2 butterflies, and stage 3 includes 1 butterfly. In the INTT shown in Figure 2, the 1st stage includes 1 butterfly, the 2nd stage includes 2 butterflies, and the 3rd stage includes 4 butterflies.
  • a cross refers to a computing unit that inputs k data and outputs k data.
  • each butterfly in stage 1 in NTT includes 1 crossover
  • each butterfly in stage 2 in NTT includes 2 crossovers
  • one butterfly in stage 3 in NTT includes 4 crossovers.
  • Figure 3 is the calculation process of a crossover in a butterfly in Radix-2 NTT provided by the embodiment of the present application.
  • the calculation formula for a crossover in a butterfly in Radix-2 NTT is:
  • Figure 4 is the calculation process of a crossover in a butterfly in Radix-2 INTT provided by the embodiment of the present application.
  • the calculation formula for a crossover in a butterfly in Radix-2 INTT is
  • a logical branch generally includes one or more judgment conditions and the processing steps corresponding to each judgment condition.
  • the computing device when the computing device wants to execute a logical branch, the computing device will determine whether the judgment conditions in the logical branch are met based on the current operating status. If the computing device determines that the operating status satisfies a certain judgment condition in the logical branch, the computing device will execute the processing step corresponding to the judgment condition.
  • Overflow occurs when the number of bits in a processing result produced by a computing device exceeds the machine word length of the computing device. For example, if the machine word length of the computing device is 32, and the processing result is 33 bits of data, this situation is an overflow. When an overflow occurs, the computing device will transform the processing result to obtain data with a number of bits within the machine word length range, and then continue processing based on the transformed data, resulting in an operation error. Therefore, overflow needs to be avoided.
  • Precomputation is a way to speed up processing tasks. Precomputation refers to performing some processing steps in advance and storing the resulting processing results in a location before executing the processing task.
  • the location used to save precomputed processing results is generally called a look-up table (LUT).
  • LUT look-up table
  • the precomputed processing results can be obtained by querying the precomputed table, and the processing task is executed based on the precomputed processing results, without the need to temporarily execute the precomputed processing steps during the execution of the processing task. , thereby speeding up the time to complete processing tasks and improving the efficiency of completing processing tasks.
  • Just-in-time computing is a concept opposite to pre-computation. Just-in-time computing refers to the process of executing processing tasks.
  • a, b and q are all integers.
  • the process of calculating a*b mod q is called modular multiplication processing, or modular multiplication for short.
  • a and b are congruent with respect to modulo q, that is, b ⁇ a mod q, where 0 ⁇ a ⁇ q and b ⁇ a, then a is the remainder of b mod q (that is, the remainder of b divided by q is equal to a) , b is numerically redundant relative to modulus q, referred to as b redundancy.
  • the redundancy multiplier is used to indicate the amount of numerical redundancy in the data relative to the modulus.
  • Reduction processing is the collective name for the two operations of modulo operation and congruence operation.
  • the modulo operation refers to determining the remainder of a data relative to a modulus. Expressed mathematically, given an integer a and an integer q, the modulo operation determines the remainder obtained by dividing the integer a by the integer q. By performing modulo processing on input data whose value is greater than the modulus, the computer can reduce the input data to fall within the range of the modulus, thereby limiting the value size of the data and avoiding overflow caused by excessive data values.
  • Modulo operations include modular addition, modular multiplication, modular subtraction, and modular division. Modular addition refers to determining the remainder of the sum of two data relative to a modulus, that is, a+b mod q.
  • Modulo subtraction refers to determining the remainder of the difference between two data relative to a modulus, that is, a-b mod q.
  • Modular multiplication refers to determining the remainder of the product of two data relative to a modulus, that is, a*b mod q.
  • Modular division refers to determining the remainder of the ratio of two data relative to a modulus.
  • Congruence means that the remainders obtained after dividing two integers by the same modulus are the same.
  • the integers a and b are congruent with respect to the modulus q, usually written as b ⁇ a mod q, where 0 ⁇ a ⁇ q and b ⁇ a.
  • the congruence operation refers to determining the data that has a congruential relationship with a data and has a value smaller than the data.
  • the process of determining the congruence value of the integer a with respect to the integer q is to find the congruence operation.
  • the above integer a is the input data of the number theory transformation
  • the above integer q is the modulus (that is, the parameter corresponding to the data).
  • Redundant modular multiplication also called fast redundant modular multiplication, lazy modulo multiplication
  • Redundant modular multiplication is a specific implementation of modular multiplication.
  • Redundant modular multiplication refers to the modular multiplication result of x and y relative to the modulus q determined by the following formula.
  • r represents the modular multiplication result
  • x and y represent the input data
  • x and y are both positive integers
  • is a positive integer
  • q represents the modulus
  • q ⁇ /2 y ⁇ q.
  • redundant modular multiplication Compared with the ordinary modular multiplication method (i.e. x*y mod q), there are two main characteristics of redundant modular multiplication.
  • second, redundant modular multiplication The result of is twice as redundant as the result of ordinary modular multiplication, that is, the value range of the result of ordinary modular multiplication is on [0, q), while the value range of the redundant modular multiplication is on [0, 2q). It can be understood that redundant modular multiplication may sacrifice a certain degree of accuracy, but in exchange for an increase in calculation speed.
  • x and y in the above redundant modular multiplication formula can both be input data; or, one of x and y can be input data, and the other can be a rotation factor.
  • Redundant modular multiplication is usually used when the value of the multiplicand is known before the immediate calculation stage of the number theory transformation, and the multiplicand is smaller than the modulus.
  • Montgomery's algorithm is a commonly used algorithm for quickly calculating modular multiplication of positive integers.
  • Montgomery algorithm can calculate a positive integer x times r -1 mod q to get x' ⁇ xr -1 mod q. Therefore, when it is necessary to calculate x mod q, by selecting an appropriate r value, changing the value of x to x*r, and then calling the Montgomery algorithm, the value of x mod q can be output.
  • the value of x*r is called the Montgomery representation of x.
  • the Montgomery representation of x can be redundant, for example it can be equal to x*r+i*q (integer i ⁇ 0).
  • a word refers to a set of binary numbers that are accessed, transferred, and processed as a whole in a computer.
  • the number of binary digits in a word is called the word length.
  • the machine word length refers to the number of bits of binary data that the processor can process for one integer operation. It is usually also the width of the internal data channel of the processor. For example, a 32-bit processor has a machine word size of 32, and a 64-bit processor has a machine word size of 64.
  • the instruction word length refers to the total number of bits in the binary code of the machine instruction.
  • the instruction word length depends on the length of the slave opcode, the length of the operand address, and the number of operand addresses. The word lengths of different instructions are different.
  • Data word length refers to the number of bits occupied by stored data.
  • the polynomial dimension is related to the degree (i.e. order) of the polynomial.
  • Some lattice-based cryptographic algorithms are based on algorithms involving polynomials with finite fields of coefficients. Such algorithms define polynomial rings.
  • the degree of the polynomial on R q is n-1, the dimension is set to n (n is the power of 2), and the prime number q ⁇ 1 mod 2n.
  • the coefficients of the polynomials involved in the calculation of NTT and INTT are all on the ring R q modulo q.
  • Quantum-resistant cryptography is a type of encryption algorithm that specializes in resisting quantum computers, especially public-key encryption (asymmetric encryption) algorithms.
  • Some PQC algorithms such as lattice-based cryptography, study the properties of lattice, that is, the discrete subgroup of the additive group in n-dimensional space R n .
  • This mathematical object has many applications, among which there are several called "lattice”"problem” problems, such as the shortest vector problem and the closest vector problem.
  • Many lattice-based cryptosystems take advantage of these difficulties.
  • Lattice-based cryptographic algorithms require a large number of polynomial calculations, among which number theory transformation is one of the most important calculations.
  • Homomorphic encryption is a form of encryption that allows users to perform computations on data while it is encrypted, without first decrypting it.
  • the results of homomorphic encryption calculations are retained in encrypted form and, when decrypted, produce the same output as the calculations on unencrypted data.
  • the homomorphic encryption scheme focuses on the security of data processing and provides a function for processing encrypted data.
  • the characteristic of homomorphic encryption scheme is that it allows data to perform mathematical or logical operations while being encrypted.
  • Homomorphism refers to homomorphisms in algebra, and encryption and decryption functions can be thought of as homomorphisms between plaintext and ciphertext spaces.
  • Fully homomorphic encryption is used to perform any operation on encrypted data that can be performed on plaintext without decrypting it, so fully homomorphic encryption can be run by an untrusted party without revealing its input and internal state. Based on the characteristics of fully homomorphic encryption, it can be used for privacy-protecting outsourced storage and computing and to perform operations such as retrieval and comparison in encrypted data to obtain correct results without the need to modify the data during the entire processing process. Decrypt. Its significance is that it can solve the data security problem when entrusting data and its calculation to a third party, such as in cloud computing scenarios.
  • the value of the data becomes smaller due to the reduction process at the appropriate position, thereby reducing the number of bits used to represent the data internally in the computing device and preventing the data from being corrupted.
  • the number of bits exceeds the upper limit of the number of bits that the computing device can represent to avoid overflow.
  • some embodiments provided by this application propose a data processing method that supports number theory transformation without logical branch statements.
  • the position where running the number theory transformation may cause overflow is first found out through the data-related parameters, the computing unit corresponding to the position is used as the computing unit that needs to be reduced, and then the position where the number theory transformation is run is During the process, the calculation unit found in advance performs the reduction processing, so that there is no need to introduce logical branch statements, and the reduction processing can be performed in time during the operation of the number theory transformation to avoid overflow.
  • the number theory transformation is still compared to a road.
  • the method provided by this embodiment is equivalent to planning in advance which locations need to be reduced before building the road, so that a road without intersections can be constructed, so that When the computing device is running the number theory transformation, it is equivalent to the process of driving. There is no need to pause and take time to consider whether the reduction process needs to be performed when encountering an intersection. Instead, the reduction process can be performed at a pre-planned location. Obviously Speeds up the operation of number theory transformations. It can be seen that the method provided by this embodiment solves the problem of operation problems caused by the introduction of logical branch statements in the prior art.
  • the embodiments of this application can be applied in data encryption and decryption scenarios, such as encrypted data transmission, privacy calculation, key generation, identity authentication, etc.
  • the embodiments of this application are applied in scenarios where encryption and decryption are performed based on PQC or FHE.
  • Data encryption and decryption schemes are usually implemented based on cryptographic algorithms, and cryptographic algorithms, especially the PQC algorithm, usually require the use of number theory transformations. Through the method provided by the embodiment of the present application, the operation of the number theory transformation can be accelerated, thereby improving the overall speed of the encryption and decryption scheme.
  • the sending end after the sending end obtains the plaintext data to be encrypted, it encrypts the plaintext data through the PQC algorithm to obtain ciphertext data, and sends the ciphertext data to the receiving end.
  • the data receiving end receives the ciphertext data, decrypts the ciphertext data through the PQC algorithm, and obtains the plaintext data. Security is improved because the data is transmitted in ciphertext on the link from the sender to the receiver.
  • number theory transformation is, for example, a module in the PQC algorithm.
  • the sending end performs the method provided in this embodiment, performs number theory transformation on the plaintext data, and performs other steps of the PQC algorithm on the transformed plaintext data to obtain ciphertext data.
  • the receiving end executes the method provided in this embodiment, performs number theory transformation on the ciphertext data, and performs other steps of the PQC algorithm on the transformed ciphertext data to obtain plaintext data.
  • the speed of number theory transformation can be improved, thereby improving the speed of the PQC algorithm.
  • the method provided by this embodiment can reduce the delay in data encryption and decryption at the sending end and the receiving end, helping to meet the delay requirements of both communicating parties.
  • public key signature algorithms can be used between different nodes in the power grid to ensure data transmission security.
  • the existing public key signature algorithm has a large delay and cannot meet the delay requirements of the standard.
  • the lattice-based public key signature algorithm can become a new generation cryptography standard algorithm in the future, through the methods provided by some embodiments of this application, the performance of running number theory transformations can be improved, thereby increasing the speed of running the lattice-based public key signature algorithm, thereby increasing the speed of running the lattice-based public key signature algorithm.
  • These public key signature algorithms can meet the communication delay requirements of relevant international standards, and may be adopted by relevant international standards to protect power grid data.
  • the product form of number theory transformation is software.
  • the form of number theory transformation is a piece of program code.
  • the CPU such as a 32-bit CPU or 64-bit CPU
  • the code thus performs number theoretic transformations.
  • the product form of the number theory transformation is hardware, such as a dedicated processor to undertake the number theory transformation.
  • the dedicated processor is a processor dedicated to encryption and decryption, such as an encryption chip (also called an encryption co-processor or a security chip).
  • the dedicated processor performs number theory transformations during the encryption and decryption process.
  • a dedicated processor assists the CPU in encrypting and decrypting data, and the dedicated processor and the CPU work together to complete the encryption and decryption operations.
  • the CPU transmits data related information to the dedicated processor.
  • the dedicated processor performs number theory transformation on the data based on the parameters passed in by the CPU, returns the number theory transformed data to the CPU, and then the CPU continues to perform encryption and decryption steps based on the number theory transformed data.
  • the number theory transformation is offloaded from the CPU to the dedicated processor, thereby reducing the computational burden of the CPU and increasing the speed of the CPU for encryption and decryption.
  • FIG. 5 is a schematic diagram of the principle of calculating polynomial multiplication by using number theory transformation (NTT) and its inverse transformation (INTT) according to an embodiment of the present application.
  • NTT number theory transformation
  • ITT inverse transformation
  • Figure 6 is a flow chart of a data processing method provided by an embodiment of the present application.
  • the method shown in Figure 6 is executed by a computing device.
  • the computing device is used to run the number theory transformation of the data.
  • the step of number theory transformation of the data includes multiple computing units.
  • the method shown in Figure 6 includes the following steps S201 to S202.
  • Step S201 The computing device determines the estimated number of bits of the processing result generated by each computing unit based on the parameters of the data.
  • the above data is the input data of the number theory transformation.
  • the number theory transformation includes at least one of positive number theory transformation (NTT) or inverse number theory transformation (INTT).
  • NTT positive number theory transformation
  • INTT inverse number theory transformation
  • the above-mentioned data are polynomial coefficients, for example.
  • the above-mentioned data is, for example, data to be encrypted or data to be decrypted.
  • the above data is plain text, cipher text or data required to generate a key.
  • the above parameter indicates the number of bits of data.
  • the above data is represented internally in the computing device in the form of a binary sequence, and the above parameter indicates the length of the binary sequence.
  • the purpose of obtaining parameters is that the number of bits of the input data will affect the number of bits of the processing results generated by each calculation unit in the number theory transformation, which in turn affects which calculation units will produce processing results with a number of bits that exceed what the hardware can represent.
  • the range of data that is, which computing units may overflow. Therefore, by obtaining the above parameters, it helps to more accurately determine how many bits the input data of the number theory transformation has, so as to more accurately estimate each computing unit in the number theory transformation process. Based on how many bits the result generated by the data has, the computing unit that needs to be reduced is located.
  • the parameters include a modulus used by each of the plurality of computing units when performing a modulo operation, a redundancy multiple of the data relative to the modulus, and a polynomial dimension of the data.
  • the above parameters also include the number of processor bits.
  • the redundancy factor indicates how redundant the data is relative to the modulus. For example, if the redundancy factor is 1, it indicates that the value range of the data Between 0 and the modulus, it means that there is no redundancy in the value of the data; if the redundancy multiple is 2, it means that the value range of the data is between 0 and twice the modulus; and so on, if the redundancy multiple is The remainder is k, which indicates that the value range of the data is between 0 and k times the modulus, and k is a positive integer.
  • the polynomial dimension indicates the number of stages involved in the number theory transformation. For example, if the polynomial dimension is n, this indicates that the number theory transformation has a total of log 2 n stages.
  • the above-mentioned processor is hardware used to run number theory transformations, such as a CPU.
  • the number of bits in a processor is used to indicate the range of values that the processor can represent.
  • the number of bits of the processor is, for example, the word length of the processor, such as machine word length, instruction word length, data word length or storage word length, etc.
  • the above parameters include the number of bits of data.
  • the above parameters are the maximum value of the data or the value range of the data.
  • the above parameters are provided by the user.
  • the user inputs the above parameters on the terminal, and then the terminal executes subsequent processes based on the parameters input by the user; for another example, when the computing device is a server, the user inputs the above parameters on the terminal.
  • Parameters, and then the terminal sends the parameters input by the user to the server, and the server executes subsequent processes based on the parameters received from the terminal; in another possible implementation, the above parameters are pre-stored in the computing device.
  • the above parameters are pre-programmed into the processor responsible for running the number theory transformation.
  • the above-mentioned computing unit is equivalent to a component or a data processing unit in the number theory transformation.
  • one calculation unit is used to perform addition processing, subtraction processing, and modular multiplication processing, and the modular multiplication processing includes a modulo operation.
  • each of the plurality of computing units is used to process k data to generate k processing results, where k is a positive integer.
  • a computing unit is one or more stages. In other embodiments, one computing unit is one or more butterflies. In other embodiments, a computing unit is one or more intersections.
  • a calculation unit is used to first perform modular multiplication processing, and then perform addition processing and subtraction processing to generate processing results.
  • a computing unit is used to first perform addition processing and subtraction processing, and then perform modular multiplication processing to generate processing results.
  • the above computing unit is software.
  • number theory is transformed into a piece of code, and the above-mentioned calculation units are statements in the code.
  • the above computing unit is hardware.
  • number theory is transformed into a chip, and the above-mentioned computing unit is the processing circuit in the chip.
  • the estimated number of bits indicates the number of bits of the processing result generated by the calculation unit based on the data. Taking the input data as 216 polynomial coefficients, each coefficient having 58 bits as an example, as shown in Figure 7, each computing unit in the first stage of NTT processes the data based on the bits of the processing result. The number is 59 bits, that is, the estimated number of bits corresponding to each computing unit in stage 1 is 59. As shown in Figure 8, after each computing unit in the first stage of INTT processes the data, the number of bits of the processing result generated is 62 bits or 60 bits, that is, each computing unit in the first stage corresponds to The estimated number of bits is 62 or 60.
  • the computing device determines the number of bits of the data based on the modulus and the redundancy multiple; the computing device determines the number of bits of each calculation based on the number of bits of the data and the increment of the number of bits corresponding to each computing unit. The estimated number of bits for the unit.
  • One possible way to determine the number of data bits is to determine the number of modulus bits if the redundancy multiplier is 1, As the number of bits of the data; if the redundancy multiple is greater than 1, determine the number of bits that is the product of the modulus and the redundancy multiple, as the number of bits of the data.
  • the modulus is q
  • the redundancy multiple is 1, it indicates that the value of the data is less than the modulus
  • log 2 q is determined to be the number of bits of the data
  • the modulus is q
  • the redundancy multiple is n(n is a positive integer greater than 1), indicating that the value of the data is less than n times the modulus, then determine log 2 qn as the number of bits of the data; then determine the number of bits of the data as The effect of this method is that, due to the The value range is between 0 and the product of the modulus and the redundancy multiple.
  • the number of bits of the product of the modulus and the redundancy multiple is the maximum value of the theoretical number of bits of the data. According to the theoretical number of bits of the data The maximum value is used to estimate the number of bits of the processing result, which is equivalent to taking into account the worst-case (worst case) to ensure that no overflow will occur.
  • the increment in the number of bits refers to the increment in the number of bits after the data is processed by the computing unit, that is, the difference in the number of bits in the output result generated by the computing unit and the number of bits in the input data obtained by the computing unit.
  • the calculation unit is an addition unit used to add data x and data y, then since the two data are added, the theoretical result is at most 1 bit more than the data, then 1 is used as the bit of the addition unit number increment.
  • the correspondence between the calculation unit and the bit increment is preset and saved, and the bit increment is determined by querying the corresponding relationship.
  • step S201 is equivalent to estimating the number of bits of the processing result generated by each computing unit theoretically if the actual input data is substituted into the number theory transformation, given the parameters related to the actual input data. Thereby identifying the computational units that theoretically cause overflow.
  • Step S202 The computing device determines a first computing unit from multiple computing units based on the estimated number of bits.
  • the first calculation unit is a calculation unit for reducing processing results of the second calculation unit.
  • the second calculation unit is one of the plurality of calculation units mentioned above.
  • the estimated number of bits of the processing result of the second calculation unit satisfies the preset number of bits.
  • the processing results generated by the second calculation unit are used as input data of the first calculation unit.
  • the second calculation unit is equivalent to the previous calculation unit of the first calculation unit, and the output of the second calculation unit is equivalent to the input of the first calculation unit.
  • the preset number of bits is a threshold, and the estimated number of bits of the processing result of the second calculation unit is greater than or equal to the threshold.
  • the above-mentioned preset number of bits is a value, and the estimated number of bits of the processing result of the second calculation unit is equal to this value.
  • the preset number of bits is the number of data bits when the overflow condition is met.
  • the function of the above preset number of bits is equivalent to providing an upper limit. If it is found that the estimated number of bits of the processing result generated by a certain computing unit reaches the upper limit, it is determined that the next computing unit of the computing unit needs to actually run NTT. The data is reduced to prevent the number of bits in the result from exceeding the upper limit when the number theory transformation is performed on the data.
  • the above-mentioned preset number of bits is determined based on the number of bits of the processor in the computing device.
  • the number of preset bits is determined by hardware factors such as the number of processor bits, so that the preset number of bits can adapt to the capabilities of the hardware and can Determine different preset number of bits for hardware with different capabilities (such as CPUs with different bits), and use the preset number of bits as a basis to find the computing unit that needs to be reduced, which can more accurately locate the need for reduction. Computational unit for reduction processing, thereby reducing unnecessary reduction processing.
  • the number of preset bits is 1 less than the number of bits of the processor.
  • the processor responsible for running the number theory transformation is a 64-bit CPU
  • the default number of bits is set to 63
  • the processor responsible for running the number theory transformation is a 32-bit CPU
  • the default number of bits is set to 31.
  • the preset number of bits as 63 if the input data of a calculation unit is estimated to reach 63 bits, reduction processing will be performed on this calculation unit, and calculation units before this calculation unit do not need to perform reduction processing. In this way, while avoiding overflow, the restrictions on the value of data can be relaxed as much as possible. Give full play to the capabilities of the hardware, improve resource utilization, and reduce the number of processing times.
  • the number theory transformation is usually equivalent to an intermediate module in the encryption and decryption scheme, it is usually not the first module or the last module. If the preset number of bits is set too large, then when the output result of the entire number theory transformation enters the next module of the encryption and decryption scheme, it is very likely that the value size will increase due to the execution of the next module. operation, resulting in overflow. If the preset number of bits is set too small, the hardware capabilities may not be fully utilized, resulting in a waste of resources. Based on this, the number of preset bits is designed to be 2 less than the number of bits of the processor.
  • the processor responsible for running the number theory transformation is a 64-bit CPU
  • the preset number of bits is set to 62
  • the processor responsible for running the number theory transformation is a 32-bit CPU
  • the preset number of bits is set to 30.
  • the computing device reduces the processing result of the second computing unit through the first computing unit.
  • the effect of the reduction processing performed by the first calculation unit is that, on the one hand, the reduction processing can reduce the value of the processing result, thereby reducing the number of bits in the processing result. Therefore, the first calculation unit performs reduction processing so that the number of bits of the processing result of the second calculation unit is reduced, thereby preventing the processing result generated by the first calculation unit from exceeding the predetermined number of bits, thereby avoiding overflow.
  • other computing units other than the first computing unit do not need to perform reduction processing. Therefore, when the data is processed by other computing units, the values of the data are allowed to remain redundant until the data is input to the first computing unit, that is, The data is reduced only when the number of bits in the data reaches the preset number of bits. This reduces redundant reduction processing, avoids redundant calculations in number theory transformations as much as possible, and improves processing efficiency.
  • the NTT operation process is divided into 16 stages.
  • the number of data bits is 63.
  • the computing device estimates that the number of bits of the processing result generated by each intersection in the 15th stage is 63, that is, the number of bits of the input data for each intersection in the 16th stage reaches 63.
  • the computing device treats each intersection of stage 16 as the first computing unit.
  • the computing device performs reduction processing at each intersection in the 16th stage, so that the number of bits of the processing result is reduced from 63 to 60, and finally the number of bits of the output result of the entire NTT running process is controlled within 60 , thus avoiding overflow.
  • the INTT operation process is divided into four stages.
  • the number of data bits is 62.
  • the computing device estimates that the number of bits of the input data for 5 crossovers in INTT is 62.
  • These 5 crossovers are the 1st butterfly and 1st crossover in stage 2 and the 2nd butterfly and 2nd crossover in stage 2.
  • the computing device When running NTT, the computing device performs reduction processing through these five intersections, reducing the number of bits of the processing result from 62 to 60, and finally controlling the number of bits of the output result of the entire INTT running process to within 62, thereby avoiding overflow.
  • the computing device can perform other processing steps on the processing results after the reduction processing through the first computing unit, and then continue processing through the next computing unit of the first computing unit until all computing units are processed, so that Convert the data into number-theoretic transformed data.
  • the application of data transformed by number theory in encryption and decryption schemes includes a variety of scenarios.
  • the above data is plain text.
  • the computing device After the computing device performs a positive number theory transformation on the plain text, it encrypts the plain text based on the positive number theory transformation to obtain the encryption key. part of the text.
  • the above data is ciphertext.
  • the computing device After the computing device performs an inverse number theory transformation on the ciphertext, it decrypts the ciphertext based on the inverse number theory transformation to obtain a part of the plaintext.
  • the above data is the data required to generate a key (public key or private key).
  • the computing device After the computing device performs a number theory transformation on the data, the key is generated based on the number theory transformed data.
  • the computing device performs encryption processing or decryption processing on the processing result after the reduction processing by the second computing unit.
  • the determined computing unit performs data reduction processing, so that there is no need to introduce logic.
  • the value of the data can also be made smaller at the appropriate position, thereby reducing the number of bits used to represent data within the computing device and preventing the number of data bits from exceeding the number of data bits that the computing device can represent. The upper limit of digits to avoid overflow.
  • this method can remove logical branch statements and optimize the structure of number theory transformations, thereby improving the efficiency of running number theory transformations.
  • the calculation unit (first calculation unit) where overflow may occur can be accurately located based on the estimated number of bits, so that the calculation unit where overflow may occur can perform reduction processing.
  • other computing units do not need reduction processing, thereby reducing the number of calls to reduction processing in number theory transformations, minimizing the amount of redundant calculations in number theory transformations, and improving efficiency.
  • the above-mentioned first computing unit uses a modulo operation to perform reduction processing on the processing results of the second computing unit.
  • the function of the modulo operation is to ensure correct calculation and to reduce the size of the data.
  • the computing device uses addition processing and subtraction processing instead of modulo operation, and uses subtraction processing to reduce the value size of the data.
  • the first calculation unit uses Montgomery modular multiplication to perform reduction processing on the processing results of the second calculation unit.
  • the processing result of the second calculation unit is first converted into Montgomery form, and then Montgomery modular multiplication is performed on the processing result in Montgomery representation form, thereby realizing reduction processing.
  • the data includes x and y, introduce a parameter r, according to the parameter r, convert x into x*r (that is, x in Montgomery representation), convert y into y*r (that is, y in Montgomery representation), and then Perform Montgomery modular multiplication based on x*r and y*r.
  • Reduction processing through Montgomery modular multiplication can reduce the value of the processing result, achieve the purpose of reduction, and help improve the speed of reduction processing.
  • the first computing unit performs redundant modular multiplication processing on the processing results of the second computing unit, thereby implementing reduction processing.
  • the redundant modular multiplication processing is a modular multiplication processing with a value range of [0, 2q), q represents the modulus, and q is a positive integer.
  • the reduction process is implemented using redundant modular multiplication, on the one hand, the reduction process does not need to be bound to the Montgomery algorithm, and the representation of the data does not need to remain in the Montgomery representation. In other words, no matter the representation of the data is the Montgomery representation
  • the scheme is available in both form and non-Montgomery representation, thus improving the flexibility and practicality of the scheme. On the other hand, it can also play the role of increasing the speed of reduction processing, thereby improving efficiency. In particular, it helps to significantly accelerate the computing process of computing devices in scenarios such as large number operations.
  • redundant modular multiplication when redundant modular multiplication is not used, the use of reduction algorithms is usually limited to Barrett reduction and Montgomery modular multiplication (requiring that polynomial coefficients must be in Montgomery representation).
  • the modular multiplication of coefficients in non-Montgomery representations can be calculated, and the redundant modular multiplication can share the same calculation module with Montgomery modular multiplication.
  • the computing device generates a rotation factor that has the same representation as the data according to the representation of the data; the first computing unit processes the second computing unit based on the rotation factor The result is subjected to redundant modular multiplication.
  • the representation is a Montgomery representation or a non-Montgomery representation.
  • the computing device determines the representation form of the data, and if the representation form of the data is the Montgomery representation form, generates a rotation factor of the Montgomery representation; if the representation form of the data is a non-Montgomery representation form, generates a non-Montgomery representation form. Twisting factors in Montgomery representation; the computing device saves the resulting twiding factors to a precomputed table. In the real-time calculation stage, the computing device obtains the rotation factors from the precomputation table, and performs redundant modular multiplication processing based on the obtained rotation factors and data.
  • the computing device correspondingly adjusts the rotation factor saved in the precomputation table so that the representation form of the rotation factor is consistent with the representation form of the data. For example, if the representation of the data is adjusted from a Montgomery representation to a non-Montgomery representation, the computing device adjusts the rotation factors saved in the precomputed table from the Montgomery representation to the non-Montgomery representation; if the representation of the data is adjusted from a non-Montgomery representation to If the form is adjusted to the Montgomery representation, the computing device adjusts the rotation factors saved in the precomputed table from the non-Montgomery representation to the Montgomery representation.
  • the rotation factor with the Montgomery representation is used for the operation. If the requirement of the task is that the data has a non-Montgomery representation, then the rotation factor with the non-Montgomery representation is used.
  • this method can dynamically adjust the representation of values in NTT/INTT operations according to the requirements of specific computing tasks. In addition, it will not affect the structure of the butterfly operation in NTT/INTT operations, will not introduce additional algorithms, and will not be bound. Montgomery algorithm does not increase the amount of calculation.
  • the result of the subtraction process is a negative number.
  • negative numbers in the processing results may lead to operational errors and incorrect calculations.
  • the computing device determines the redundant value based on the parameters; in the process of running the number theory transformation based on the data, each of the multiple computing units performs subtraction processing based on the redundant value.
  • the redundant value is a value greater than or equal to the subtrahend in the subtraction process.
  • the redundancy value is greater than or equal to the maximum value of the data.
  • the subtraction processing in the positive number theory transformation includes the subtraction processing in the redundant growth operation and the subtraction processing in the redundant reduction processing.
  • the minuend is the data
  • the subtrahend is the result of the redundant modular multiplication of the data.
  • the minuend is the result of the redundant modular multiplication of the data
  • the subtrahend is the result of the redundant modular multiplication of the data and the rotation factor.
  • the subtraction processing of the redundant growth operation is, for example, the subtraction of x and y*w mod 2q, where x and y are both data, w is the rotation factor, and q is the modulus.
  • the subtraction process of redundancy reduction processing is, for example, subtracting x mod 2q and y*w mod 2q, where x and y are both data, w is the rotation factor, and q is the modulus.
  • Subtraction in inverse number theory transformation The principle is the subtraction between two data, for example, xy, where x and y are both data.
  • the function of subtraction based on redundant values is that not only the data itself but also the redundant values are substituted during subtraction, which is equivalent to adding redundant values to the minuend and amplifying the value of the minuend. Therefore, it helps to prevent the processing result of the subtraction process from being a negative number, thereby contributing to the correctness of the operation.
  • this method determines the redundancy value based on data-related parameters, so that the determined redundancy value can be adapted to the value of the parameter, thereby improving accuracy.
  • the redundant value does not need to be bound to a single parameter, but can be adjusted accordingly with the value of the parameter. Therefore, the solution has more available parameters, improving scalability and practicality.
  • the above redundancy value is equal to 2q, q represents the modulus, and q is a positive integer.
  • the purpose of choosing 2q as the redundant value is that the positive number theory transformation is characterized by first performing modular multiplication processing, and then performing addition processing and subtraction processing.
  • the value range of modular multiplication processing is controllable. For example, when using redundant modular multiplication to implement multiplication operations, the value range of modular multiplication processing is within [0, 2q) and is implemented using modular multiplication without redundancy. During multiplication operations, the value range of modular multiplication processing is within [0, q), where q is the modulus.
  • the value range of the subtrahend is within [0, 2q), so the redundant value must be larger than the subtrahend, thus Guarantees that the result of a subtraction is not non-negative, thus contributing to correctness of the operation.
  • the redundancy value used is as small as possible to avoid excessive processing overhead and storage overhead due to excessive redundancy values.
  • the redundancy value is equal to (t+n)*q, t represents the redundancy multiple, n represents the polynomial dimension, q represents the modulus, and t, n and q are positive integers.
  • the embodiment shown in Figure 6 describes the case of data reduction processing.
  • positive number theory transformation if the computing device determines that the parameter satisfies the condition, it is determined that there is no computing unit that needs to perform reduction processing. In the process of running number theory transformations on the data, the reduction process of the data is omitted.
  • inverse number theory transformation if the computing device determines that the parameters meet the conditions, the computing unit in the last stage is determined to be the computing unit that needs to be reduced. In the process of running number-theoretic transformations on the data, the data is reduced through the calculation unit of the last stage.
  • the parameters satisfy the condition, for example, the sum of the number of bits of the modulus and the number of stages is less than the preset number of bits, for example, the condition is log 2 n +log 2 q ⁇ 60.
  • n represents the polynomial dimension
  • q represents the modulus.
  • the parameter satisfies the condition, for example, the sum of the number of bits and the number of stages, which is the product of the modulus and the redundancy multiple, is less than the preset number of bits.
  • the effect of the above method is that the number of bits of the product of the modulus and the redundancy multiple is equivalent to the theoretical maximum number of bits that the data has. Without reduction processing, each time the data passes through a stage of processing, Then the number of bits will increase by one bit, so the number of stages is equivalent to the maximum number of bits that can be increased after all stages of the data are processed in the number theory transformation. Therefore, the parameters meet the above conditions, which means that no overflow will occur in the worst case, and there is no need For data reduction processing, through the above method, the reduction processing and logical branch statements can be completely removed while ensuring that no overflow occurs, thereby improving Performance and efficiency in running number theory transformations.
  • the redundant reduction crossover is an example of a computing unit that performs data reduction processing (i.e., the first computing unit)
  • the redundant growth crossover is an example of a computing unit that does not need to perform data reduction processing (i.e., non- For the first calculation unit)
  • the polynomial coefficient is an example of the data
  • max is an example of the preset number of bits
  • min is an example of the redundant value.
  • each when constructing the NTT or INTT algorithm, each is divided into two parts: precomputation and on-the-fly computation.
  • FIG 9 is an architectural diagram of an NTT provided in this embodiment. As shown in Figure 9, NTT internally includes a rotation factor generation module, an NTT pre-calculation module, an NTT generation module and an NTT operation module. The flow chart of the NTT generation module is shown in Figure 11.
  • FIG 12 is an architectural diagram of INTT provided by this embodiment.
  • INTT internally includes a twirling factor generation module, an INTT precomputation module, an INTT generation module and an INTT operation module.
  • the flow chart of the INTT precomputation module is shown in Figure 13.
  • the flow chart of the INTT generation module is shown in Figure 14.
  • the coefficients of each term in the input polynomial a of NTT are sorted in ascending order according to degree, with the term with the lowest degree at the front and the term with the highest degree at the end, resulting in a sequence.
  • x and y are the coefficients of terms of degree j and j+t respectively.
  • w is the rotation factor used in the crossover calculation, and w is obtained from a precomputed table.
  • the code for redundant growth crossover in Radix-2 NTT is as follows.
  • x a[j]
  • y a[j+t]
  • tx x
  • the meaning of the NTT redundant growth crossover code shown above is to assign the coefficient x (i.e. a[j]) to the intermediate variable tx, and assign the result of the redundant modular multiplication of the coefficient y and the rotation factor w to the intermediate variable ty , then the coefficient x (i.e. a[j]) is modified to tx+ty, and the coefficient y (i.e. a[j+t]) is modified to 2*q-ty+tx. After such calculation, the value of coefficient x will become larger, which means growth.
  • the code for redundancy reduction crossover in Radix-2 NTT is as follows.
  • tx FastModMultiLazy(x, 1, q); //Remarks:
  • tx x mod 2q;
  • ty FastModMultiLazy(y, w, q); //Remarks:
  • ty y*w mod 2q;
  • a[j] tx+ty;
  • a[j+t] 2*q-ty+tx;
  • the meaning of the NTT redundancy reduction crossover shown above is to assign the result of the redundant modular multiplication of the coefficient x and the integer 1 to the intermediate variable tx, and to assign the result of the redundant modular multiplication of the coefficient y and w to the intermediate variable ty , then the coefficient x (i.e. a[j]) is modified to tx+ty, and the coefficient y (i.e. a[j+t]) is modified to 2*q-ty+tx. After such calculation, the value of coefficient x may become smaller, This is the meaning of reduction.
  • Figure 15 shows a schematic diagram of the calculation method of redundancy growth crossover in Radix-2 NTT.
  • the calculation formula for Radix-2 NTT redundancy growth crossover is the following formula A.
  • Figure 16 shows a schematic diagram of the calculation method of redundancy reduction crossover in Radix-2 NTT.
  • the calculation formula for Radix-2 NTT redundancy reduction crossover is the following equation B.
  • the NTT pre-calculation module receives the input parameters, then calculates the position where overflow occurs in NTT based on the parameters, and generates a sequence S.
  • the length of sequence S is equal to the number of stages in which overflow occurs in NTT.
  • the sequence value is the The identification of the NTT stage, the specific process is as follows.
  • n (n ⁇ 4) is a power of 2
  • n 2 m
  • the number of NTT stages is log 2 n
  • the modulus is log 2 q bits
  • the maximum redundancy value allowed by NTT (according to the machine word length, instruction word length or The data word length is 2 max
  • the minimum redundancy value min 2q.
  • the embodiment of the present application supports redundancy of any size as NTT input, so the value of the input coefficient input satisfies 2 h-1 ⁇ input ⁇ 2 h (h is a positive integer, h ⁇ max).
  • NTT executes the s 1 phase (s 1 is a positive integer, 1 ⁇ s 1 ⁇ m), if (That is, in the s 1 -1 stage, there is ), it means that the s 1 stage needs to be reduced.
  • NTT If NTT is not completed, continue to execute NTT.
  • s 2 is a positive integer, 1 ⁇ s 1 ⁇ s 2 ⁇ m
  • S [s 1 , s 2 , ...] of the number of stages in which the coefficients may theoretically overflow is obtained.
  • each term of the input polynomial a of INTT is sorted in ascending order according to its degree, with the term with the lowest degree at the front and the term with the highest degree at the bottom, thereby obtaining a sequence.
  • x and y are the coefficients of terms of degree j and j+t respectively.
  • w is the rotation factor used in this crossover calculation, and the rotation factor is obtained from a precomputed table.
  • Specify a minimum redundancy value min which is a positive integer; the function of the minimum redundancy value is to ensure that no matter what Y is equal to, the minimum redundancy value will be greater than Y.
  • the meaning of the INTT redundant growth crossover code shown above is that the sum of x and y is assigned to tx, the result of min-y+x is assigned to ty, then the coefficient x (i.e. a[j]) is modified to tx, The coefficient y (that is, a[j+t]) is modified to the result of the redundant modular multiplication of ty and w. After such calculation, the value of coefficient x will become larger, which means growth.
  • the code for redundancy reduction crossover in Radix-2 INTT is as follows.
  • tx FastModMultiLazy(x+y,1,q); //Remarks:
  • tx x+y mod 2q;
  • ty min-y+x;
  • a[j] tx;
  • the meaning of the INTT redundancy reduction and crossover code shown above is that the result of the redundant modular multiplication of x+y and the integer 1 is assigned to tx, and the result of min-y+x is assigned to ty, then the coefficient x (i.e. a [j]) is modified to tx, and the coefficient y (i.e. a[j+t]) is modified to the result of the redundant modular multiplication of ty and w. After such calculation, the value of coefficient x may become smaller, which is the meaning of reduction.
  • Figure 17 shows a schematic diagram of the calculation method of redundancy growth crossover in Radix-2 INTT.
  • the calculation formula for Radix-2 INTT redundancy growth crossover is the following equation E.
  • Figure 18 shows a schematic diagram of the calculation method of redundancy reduction crossover in Radix-2 INTT.
  • the calculation formula for Radix-2 INTT redundancy reduction crossover is the following formula F.
  • INTT requires an algorithm to accurately find which stage, which butterfly, and which intersection requires reduction processing.
  • the algorithm used in INTT to determine the need for reduction processing is as follows.
  • the precomputation algorithm shown above is used to calculate the specific positions of intersections that require reduction processing in INTT.
  • the logic of this algorithm is as follows.
  • the change of the X term is an increase in the number of bits by 1 bit, because the The maximum value in y is 1 bit larger than the number; and the change of the Y term is that the number of bits is 1 larger than the number of bits of the modulus q, because the output result of the redundant modular multiplication is always on [0, 2q), so The number of bits of the Y item is always less than or equal to 2q of the number of bits.
  • the number of bits of the calculated result of the Y item is directly equal to the number of 2q of bits.
  • the X term and the Y term may exchange positions with each other, so the number of bits of the X term and the Y term is obtained, and then brought into the equation E to calculate the number of bits of the What will happen after E? If it is found that before substituting into equation E, the number of bits in the , called redundancy reduction crossover.
  • the number of bits in the Y item does not need to be considered. This is because the Y item is brought into redundant modular multiplication. As long as the Y item does not exceed the machine word length, the Y item will not be equal to or exceed the maximum redundancy value.
  • the redundant modular multiplication can complete the calculation and output a value with the same number of bits as 2q in the worst-case case, so there is no overflow problem in the Y term.
  • INTT is simulated according to the above logic, and the number of bits for each crossover is calculated to find out all the factors that in theory will cause the X term to be equal to when input.
  • the intersection of the number of bits of the maximum redundancy value is recorded and stored in the pre-calculation table, which is used by the INTT generation module to customize the generation of INTT.
  • the INTT pre-calculation module receives the input parameters, then calculates the position where overflow occurs in INTT based on the parameters, and generates a sequence T.
  • the length of the sequence T is equal to the number of intersections where overflow occurs in INTT, and the sequence value is the The intersection identifier in INTT saves the sequence T to the precalculation table.
  • red_position contains all intersection positions that need to be reduced; by traversing red_position, you can accurately find the positions that need to be reduced [t, ⁇ s, b, c>], where t means that this is the tth INTT Crossover, ⁇ s, b, c> indicates that the specific position where this reduction process occurs is the c-th intersection of the b-th butterfly in the s-th stage of INTT.
  • NTT and INTT are their main calculations.
  • the parameters of this type of algorithm are relatively fixed, and the parameter size satisfies log 2 n + log 2 q ⁇ 60. Therefore, redundant modular multiplication is only called on the butterfly, and all reduction processing and all logical branches on the NTT and INTT butterflies are completely removed. statement to improve the performance of polynomial multiplication.
  • Example 1 is as follows:
  • redundant modular multiplication can be used to control the modular multiplication output within the range of [0, 2q).
  • the calculation process is the same as that of the butterfly in the main stage.
  • the main purpose is to control the redundancy of the output results. Therefore, although the coefficient does not have a numerical value out of bounds, redundancy reduction crossover can also be considered.
  • the overall output can be controlled within the range of [0, 2q) by calling the redundant modular multiplication function only when the butterfly calculates the integer modular multiplication.
  • the above example builds a more efficient NTT/INTT through precomputation; after removing all logical branch statements, only the necessary reduction processing is retained, and the value of the output result remains within the allowable value of the data type or instruction set. Within the range without crossing the boundary; using redundant modular multiplication for modular multiplication processing is more efficient.
  • NTT/INTT need to be split. NTT is split according to stages, and INTT is split according to cross. The basis is the pre-calculated position to be reduced.
  • the number of bits of the maximum redundancy value 2 62 is 63 bits, that is, each time during the NTT calculation process
  • the theoretical number of bits for each coefficient needs to be less than or equal to 63 bits.
  • n 2 16
  • the coefficients of the input polynomial a are in the interval [0, q-1), that is, the maximum value of log 2 a i is 57, that is, the polynomial coefficients are determined as 58-bit data.
  • the minimum redundancy value is 2q; through pre-calculation, all necessary parameters and number of stages are brought into equation C and equation D. It can be seen that NTT only needs to call the redundancy reduction branch in the last stage, that is, stage 16, and the other stages Call the redundant growth branch; the final polynomial coefficient output by NTT is controlled within 60 bits.
  • Figure 7 shows the calculation process of NTT in Example 2.
  • the redundancy multiple of the input polynomial coefficients is 4 times.
  • the coefficients of the polynomial output by INTT are controlled within 62 bits, which is within the allowable range.
  • the derivation process of the minimum redundancy value is as follows: the redundancy multiple of the input polynomial coefficient is 1 times (equivalent to When the polynomial coefficient has no redundancy), the minimum redundancy value is n*q; if the redundancy multiple of the input polynomial coefficient is 4 times, it means that the maximum value of the input polynomial coefficient does not exceed 4q, then the minimum redundancy value min is set When it is (4+n)q, it can be ensured that the minimum redundancy value min is greater than Y.
  • Figure 8 shows the INTT calculation process of Example 2.
  • Example 2 has the same effect as Example 1, but relaxes the redundancy multiple of the polynomial coefficient value.
  • the input data and output results support a larger redundancy range; accurate The location where the reduction processing occurs is located, unnecessary reduction processing is removed, and the reduction processing is not bound to the Montgomery algorithm; more parameter sets are available.
  • Figure 19 is a schematic structural diagram of a data processing device 800 provided by an embodiment of the present application.
  • the apparatus 800 includes a first determination module 801 and a second determination module 802 .
  • the device 800 is provided on the computing device shown in Figure 6.
  • the first determination module 801 is used to perform S201
  • the second determination module 802 is used to perform S202.
  • the device embodiment described in Figure 19 is only illustrative.
  • the division of the above modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • Each functional module in each embodiment of the present application can be integrated into one module, or each module can exist physically alone, or two or more modules can be integrated into one module.
  • Each module in the data processing device 800 is implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • the above-mentioned first determination module 801 and the second determination module 802 are software function modules generated by at least one processor 901 in Figure 20 after reading the program code stored in the memory 902. accomplish.
  • the above-mentioned modules in Figure 19 are respectively implemented by different hardware in the computing device.
  • the first determination module 801 is implemented by a part of the processing resources (such as multi-core) in at least one processor 901 in Figure 20 One core or two cores in the processor), while the second determination module 802 is processed by the remaining parts of at least one processor 901 in Figure 20 (such as other cores in a multi-core processor), or is field programmable This is accomplished by programmable devices such as field-programmable gate array (FPGA) or coprocessor.
  • FPGA field-programmable gate array
  • FIG 20 is a schematic structural diagram of a computing device 900 provided by an embodiment of the present application.
  • the computing device 900 is used to perform the method shown in Figure 6.
  • Computing device 900 includes processor 901, memory 902, and network interface 903.
  • the processor 901 is, for example, a general-purpose central processing unit (CPU), a network processor (NP), a graphics processing unit (GPU), or a neural-network processing unit (NPU). ), a data processing unit (DPU), a microprocessor, or one or more integrated circuits used to implement the solution of the present application.
  • the processor 901 includes an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • the PLD is, for example, a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or other random combination.
  • the memory 902 is, for example, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, or a random access memory (random access memory, RAM) or a device that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • Other types of dynamic storage devices such as electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical discs Storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can Any other media accessed by a computer, without limitation.
  • the memory 902 exists independently and is connected to the processor 901 through an internal connection 904.
  • memory 902 and processor 901 are optionally integrated together.
  • Network interface 903 uses any transceiver-like device for communicating with other devices or communications networks.
  • the network interface 903 includes, for example, at least one of a wired network interface or a wireless network interface.
  • the wired network interface is, for example, an Ethernet interface.
  • the Ethernet interface is, for example, an optical interface, an electrical interface or a combination thereof.
  • the wireless network interface is, for example, a wireless local area network (WLAN) interface, a cellular network network interface or a combination thereof.
  • WLAN wireless local area network
  • processor 901 includes one or more CPUs, such as CPU0 and CPU1 as shown in Figure 20.
  • computing device 900 optionally includes multiple processors, such as processor 901 and processor 905 shown in FIG. 20 .
  • processors are, for example, a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • Processor here optionally refers to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • computing device 900 also includes internal connection 904.
  • the processor 901, the memory 902 and at least one network interface 903 are connected through an internal connection 904.
  • Internal connections 904 include pathways that carry information between the components described above.
  • internal connection 904 is a single board or bus.
  • the internal connections 904 are divided into address bus, data bus, control bus, etc.
  • computing device 900 also includes an input-output interface 906.
  • Input/output interface 906 is connected to internal connection 904 .
  • the input/output interface 906 is used to connect to an input device and receive commands or data input by the user through the input device related to the above embodiments, such as modulus, redundancy multiples, polynomial dimensions and other parameters.
  • Input devices include but are not limited to keyboards, touch screens, microphones, mice or sensing devices.
  • the input and output interface 906 is also used to connect to an output device.
  • the input and output interface 906 outputs the processing results generated by the processor 301 by executing the above method through an output device, such as data after number theory transformation.
  • Output devices include but are not limited to monitors, printers, projectors, etc.
  • the processor 901 implements the method in the above embodiment by reading the program code 910 stored in the memory 902, or the processor 901 implements the method in the above embodiment by using the internally stored program code.
  • the memory 902 stores the program code that implements the method provided by the embodiment of the present application.
  • the processor 901 is used to instruct the input and output interface 906 or the network interface 903 to perform S201, and the processor 901 is also used to perform S202.
  • the processor 901 is used to instruct the input and output interface 906 or the network interface 903 to perform S201, and the processor 905 is used to perform S202.
  • the processor 901 implements the above functions, please refer to the descriptions in the previous method embodiments, which will not be repeated here.
  • A refers to B, which means that A is the same as B or that A is a simple transformation of B.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in the embodiments of this application are all processed.
  • the data to be encrypted and decrypted and the parameters corresponding to the data involved in this application were obtained with full authorization.
  • “at least one” means one or more, and “plurality” means two or more.
  • multiple computing units refer to two or more computing units.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供了一种数据处理方法、装置、设备及存储介质,属于计算机技术领域。本申请通过参数,确定数论变换的步骤中每个计算单元产生的处理结果的预估比特位数,基于此确定出负责约减处理的计算单元,从而在无需引入逻辑分支语句的情况下,也能让处理结果的取值在合适的位置变小,从而减少处理结果的比特位数,防止处理结果的比特位数超过计算设备所能表示的比特位数上限,避免溢出。相较于引入逻辑分支语句以进行约减处理的方式而言,该方法能够移除逻辑分支语句,优化数论变换的结构,从而提高运行数论变换的效率。

Description

数据处理方法、装置、设备及存储介质
本申请要求于2022年06月10日提交的申请号为202210656082.X、发明名称为“数据处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种数据处理方法、装置、设备及存储介质。
背景技术
在很多加解密方案中,多项式乘法是主要的部分。数论变换有助于更高效地实现多项式乘法,从而提高加解密方案的效率。
在经典的数论变换算法中,为了避免溢出,会在数论变换算法中引入一些逻辑分支语句。计算设备在运行数论变换算法的过程中,执行该逻辑分支语句,从而对数据进行约减,使得数据的取值变小,从而避免溢出。
然而,执行逻辑分支语句耗费的时间较长,导致运行数论变换的效率低下。
发明内容
本申请实施例提供了一种数据处理方法、装置、设备及存储介质,能够提高运行数论变换的效率。所述技术方案如下。
第一方面,提供了一种数据处理方法,由计算设备执行,所述计算设备用于运行数据的数论变换,所述数据的数论变换的步骤包括多个计算单元,包括:
基于所述数据的参数,确定每个所述计算单元产生的处理结果的预估比特位数,所述参数指示所述数据的比特位数;
基于所述预估比特位数,从所述多个计算单元中确定第一计算单元,所述第一计算单元为用于对第二计算单元的处理结果约减处理的计算单元,所述第二计算单元的处理结果的预估比特位数满足预设比特位数。
第一方面提供的方法中,由于通过参数,确定数论变换的步骤中每个计算单元产生的处理结果的预估比特位数,基于此确定出负责约减处理的计算单元,从而在无需引入逻辑分支语句的情况下,也能让处理结果的取值在合适的位置变小,从而减少处理结果的比特位数,防止处理结果的比特位数超过计算设备所能表示的比特位数上限,避免溢出。相较于引入逻辑分支语句以进行约减处理的方式而言,该方法能够移除逻辑分支语句,优化数论变换的结构,从而提高运行数论变换的效率。
在一些实施方式中,所述约减处理,包括:
对所述第二计算单元的处理结果进行冗余模乘处理。
在上述实施方式中,由于采用冗余模乘的方式实现约减处理,一方面,使得约减处理不 必绑定蒙哥马利算法,数据的表示形式也无需保持为蒙哥马利表示形式,换句话说,无论数据的表示形式为蒙哥马利表示形式还是非蒙哥马利表示形式,方案都具有可用性,从而提高方案的灵活性和实用性。另一方面,同样能起到提高约减处理的速度这一作用,从而提高效率,尤其是,在大数运算等场景下有助于显著加速计算设备的运算流程。
在一些实施方式中,所述对所述第二计算单元的处理结果进行冗余模乘处理包括:
基于旋转因子对所述第二计算单元的处理结果进行冗余模乘处理,所述旋转因子具有和所述数据相同的表示形式。
通过上述实施方式,支持根据具体计算任务的需求,动态地调整数据的表示形式。
在一些实施方式中,所述表示形式为蒙哥马利表示形式或者非蒙哥马利表示形式。
在一些实施方式中,所述方法还包括:
对所述第二计算单元的约减处理后的处理结果进行加密处理或解密处理。
在一些实施方式中,所述参数包括所述多个计算单元中每个计算单元进行取模运算时使用的模数、所述数据相对于所述模数的冗余倍数以及所述数据的多项式维度。
在上述实施方式中,考虑到输入数据可能具有一定大小的冗余,通过模数和冗余倍数来描述输入数据的冗余大小,因此在数据具有冗余的情况下,也能较为精确地定位出需要进行约减处理的计算单元,从而减少多余的约减处理。
在一些实施方式中,所述预设比特位数是基于所述计算设备中处理器的位数确定的,所述预设比特位数比所述处理器的位数少1或2。
在上述实施方式中,相较于根据经验来设定预设比特位数而言,通过处理器的位数这一硬件方面的因素来确定预设比特位数,使得预设比特位数能够适应于硬件的能力,能够针对不同能力的硬件分别确定不同的预设比特位数,以该预设比特位数为基准来寻找需要进行约减处理的计算单元,能更加精确地定位需要进行约减处理的计算单元,从而减少不必要的约减处理。
以预设比特位数为63为例,如果预估出一个计算单元的输入数据的比特位数达到62或63,才在该计算单元进行约减处理,而该计算单元之前的计算单元无需进行约减处理。通过这种方式,在避免溢出的同时,尽可能放宽对数据的取值的限制,充分发挥硬件的能力,提高资源利用率,减少约减处理的次数。
在一些实施方式中,所述多个计算单元中每个计算单元还用于基于冗余值进行减法处理,所述冗余值为大于或等于所述减法处理中减数的数值。
其中,所述数论变换功能包括正数论变换功能以及逆数论变换功能,所述正数论变换功能的减法处理为x-y*w mod 2q或者x mod 2q-y*w mod 2q,所述逆数论变换功能的减法处理为x-y,所述x和所述y均表示数据,所述q表示所述数论变换功能中对所述数据取模运算时使用的模数,所述w表示旋转因子,所述mod表示取模运算,所述*表示相乘,所述-表示相减。
在上述实施方式中,由于减法处理时代入了冗余值,相当于给被减数加上了冗余值,放大了被减数的取值,因此有助于避免减法处理的处理结果为负数,从而有助于运算正确性。此外,相较于根据经验设定冗余值而言,该方式由于以数据相关的参数为依据来确定冗余值,使得确定的冗余值能够适应于参数的取值,从而提高准确性。此外,冗余值无需绑定于单一 的参数,而是能够随着参数的取值相应调整,因此方案可用的参数更多,提高扩展性和实用性。
在一些实施方式中,所述数论变换包括正数论变换,所述冗余值等于2q,所述q表示所述多个计算单元中每个计算单元进行取模运算时使用的模数,所述q为正整数。
在上述实施方式中,由于正数论变换特点在于先执行模乘处理,再执行加法处理和减法处理。而模乘处理的值域是可控的,比如采用冗余模乘的方式实现乘法运算时,模乘处理的值域在[0,2q)内,采用不带冗余的模乘的方式实现乘法运算时,模乘处理的值域在[0,q)内,其中q为模数。因此,通过将2q代入到减法处理中,由于减法处理中的减数为模乘处理的输出结果,因此减数的值域在[0,2q)内,因此冗余值必然大于减数,从而保证执行减法处理所产生的结果不为非负数,因此有助于运算正确性。此外,使用的冗余值尽可能地小,从而避免由于冗余值过大造成处理开销和存储开销过大。
在一些实施方式中,所述数论变换包括逆数论变换,所述冗余值等于(t+n)*q,所述q表示所述多个计算单元中每个计算单元进行取模运算时使用的模数,所述t表示所述数据相对于所述模数的冗余倍数,所述n表示所述数据的多项式维度,所述t、所述n和所述q为正整数。
在上述实施方式中,由于在数据不存在冗余的情况下,任一个计算单元的输入数据不超过n*q,考虑到数据存在冗余的可能性,则在n*q的基础上加上t*q作为冗余值,从而支持任意冗余倍数的数据作为输入,同时保证执行减法运算所产生的结果不为非负数,有助于运算正确性。
在一些实施方式中,所述多个计算单元中每个计算单元用于基于k个数据进行处理产生k个处理结果,所述k为正整数。
在上述实施方式中,相当于以交叉为单位,将数论变换拆分为需要进行约减的计算单元以及无需进行约减的计算单元,有助于较为精细地定位需要进行约减的位置。
第二方面,提供了一种数据处理装置,该装置具有实现上述第一方面或第一方面任一种可选方式的功能。该装置包括至少一个模块,该至少一个模块用于实现上述第一方面或第一方面任一种可选方式所提供的方法。
在一些实施例中,该装置中的模块通过软件实现,该装置中的模块是程序模块。在另一些实施例中,该装置中的模块通过硬件或固件实现。第二方面提供的装置的具体细节可参见上述第一方面或第一方面任一种可选方式,此处不再赘述。
第三方面,提供了一种计算设备,该计算设备包括处理器,所述处理器与存储器耦合,所述存储器中存储有至少一条计算机程序指令,所述至少一条计算机程序指令由所述处理器加载并执行,以使所述计算设备实现上述第一方面或第一方面任一种可选方式所提供的方法。第三方面提供的计算设备的具体细节可参见上述第一方面或第一方面任一种可选方式,此处不再赘述。
第四方面,提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令,该指令在计算机上运行时,使得计算机执行上述第一方面或第一方面任一种可选方式所提供的方法。
第五方面,提供了一种计算机程序产品,所述计算机程序产品包括一个或多个计算机程 序指令,当所述计算机程序指令被计算机加载并运行时,使得所述计算机执行上述第一方面或第一方面任一种可选方式所提供的方法。
第六方面,提供一种芯片,该芯片包括可编程逻辑电路和/或程序指令,当该芯片运行时用于实现如上述第一方面或第一方面的任一可选方式所提供的方法。
附图说明
图1是本申请实施例提供的一种NTT的流程图;
图2是本申请实施例提供的一种INTT的流程图;
图3是本申请实施例提供的一种Radix-2 NTT中一个蝴蝶中一个交叉的计算过程示意图;
图4是本申请实施例提供的一种Radix-2 INTT中一个蝴蝶中一个交叉的计算过程示意图;
图5是本申请实施例提供的一种计算多项式乘法的原理示意图;
图6是本申请实施例提供的一种数据处理方法的流程图;
图7是本申请实施例提供的一种运行NTT时数据的比特位数的变化示意图;
图8是本申请实施例提供的一种运行INTT时数据的比特位数的变化示意图;
图9是本申请实施例提供的一种NTT的架构图;
图10是本申请实施例提供的一种NTT预计算模块的架构图;
图11是本申请实施例提供的一种NTT生成模块的架构图;
图12是本申请实施例提供的一种INTT的架构图;
图13是本申请实施例提供的一种INTT预计算模块的架构图;
图14是本申请实施例提供的一种INTT生成模块的架构图;
图15是本申请实施例提供的一种Radix-2 NTT中冗余增长交叉的计算方式的示意图;
图16是本申请实施例提供的一种Radix-2 NTT中冗余约减交叉的计算方式的示意图;
图17是本申请实施例提供的一种Radix-2 INTT中冗余增长交叉的计算方式的示意图;
图18是本申请实施例提供的一种Radix-2 INTT中冗余约减交叉的计算方式的示意图;
图19是本申请实施例提供的一种数据处理装置800的结构示意图;
图20是本申请实施例提供的一种计算设备900的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
下面对本申请实施例涉及的一些术语概念做解释说明。
(1)本原单位根
定义为有限域;给定正整数g和正整数q(g,q≥2),g与q互质;若存在一个最小的整数n>1,使得gn≡1 mod q成立,即对于任意一个整数k(1≤k<n-1),有gk≠1 mod q,则g称为上的一个n次本原单位根。
(2)正数论变换(number theoretic transform,NTT)
令正整数是2的幂,给定一个质数q满足q≡1 mod 2n。令ω是上的一个n次本原单位根,则有ωn≡1 mod q,且ω的幂次项模q有ω0≠ω≠…≠ωn-1mod q。定义多项式环多项式a(x)∈Rq,其中为a(x)的系数。
将ω0,ω…,ωn-1mod q代入到多项式a(x)中,有:
为多项式的系数,则定义NTT为
(3)逆数论变换(inverse NTT,NTT的逆变换,INTT)
INTT定义为a(x)=INTT(NTT(a(x)))。
(4)旋转因子(twiddle factor)
旋转因子原来是指在Cooley-Tukey快速傅里叶变换算法的蝴蝶运算中所乘上的复数常数。因为该常数在复数平面上位于单位圆之上,对于被乘数在复数平面上面会有旋转的效果,故该常数得名为旋转因子。后来,旋转因子也会用来指称FFT、包括FFT变形算法中的任一常数乘法。旋转因子的名称来源于W.M.Gentleman and G.Sande,“Fast Fourier transforms—for fun and profit”,Proc.AFIPS 29,pp.563–578,1966.此后在数以万计的文献中广泛使用。对于数论变换而言,结合上文在(1)中介绍的NTT的公式来看,旋转因子例如是ωk mod q(k=0,1,…,n-1)。
(5)阶段(stage)
一个阶段是指数论变换中同一个时间周期内的所有处理步骤。
图1是本申请实施例提供的一种NTT的流程图,图1所示的NTT划分为3个阶段。按照时间周期从先至后的顺序(即从左至右的顺序),这3个阶段分别称为第1阶段、第2阶段和第3阶段。第1阶段包括NTT在第1个时间周期内的所有处理步骤,第2阶段包括NTT在第2个时间周期内的所有处理步骤,以此类推。
图2是本申请实施例提供的一种INTT的流程图,图2所示的INTT划分为3个阶段。按照时间周期从先至后的顺序,这3个阶段分别称为第1阶段、第2阶段和第3阶段。第1阶段包括INTT在第1个时间周期内的所有处理步骤,第2阶段包括INTT在第2个时间周期内的所有处理步骤,以此类推。
(6)蝴蝶(butterfly)
输入一个系数个数为n(n≥2,通常等于2的幂)的多项式,数论变换中每一个阶段都呈现一种不相交的m(2≤m≤n,通常等于2的幂)个数据之间有规律的处理,这种处理称之为蝴蝶,也称蝶形运算(butterfly computation)单元或蝴蝶操作。蝴蝶通常是数论变换中主要的计算单元。数论变换中一个阶段包括一个或多个蝴蝶。例如,图1所示的NTT中,第1阶段包括4个蝴蝶,第2阶段包括2个蝴蝶,第3阶段包括1个蝴蝶。图2所示的INTT中,第1阶段包括1个蝴蝶,第2阶段包括2个蝴蝶,第3阶段包括4个蝴蝶。
(7)交叉(cross)
一个交叉是指输入k个数据,输出k个数据的计算单元。一个对m个数据处理的蝴蝶包括多个交叉。其中,2≤m≤n,通常m等于2的幂,2≤k≤m,通常k等于2的幂。如果k=2且m是k的幂,此时的NTT/INTT可分别划分为log2n个阶段,那么这样的NTT/INTT可称为Radix-2 NTT/INTT。本说明书如无特别说明,NTT/INTT通常指代Radix-2 NTT/INTT。
同一个NTT或INTT中不同蝴蝶可以包括不同个数的交叉。如图1所示,NTT中第1阶段中的每个蝴蝶包括1个交叉,NTT中第2阶段中每个蝴蝶包括2个交叉,NTT中第3阶段中一个蝴蝶包括4个交叉。
图3是本申请实施例提供的Radix-2 NTT中一个蝴蝶中一个交叉的计算过程。其中Radix-2 NTT中一个蝴蝶中一个交叉的计算公式为:图4是本申请实施例提供的Radix-2 INTT中一个蝴蝶中一个交叉的计算过程。Radix-2 INTT中一个蝴蝶中一个交叉的计算公式为
(8)逻辑分支
逻辑分支一般包括一个或多个判断条件以及每个判断条件对应的处理步骤。在计算设备运行过程中,当计算设备要执行一个逻辑分支时,计算设备会根据当前的运行状况判断是否满足逻辑分支中的判断条件。若计算设备确定运行状况满足逻辑分支中的某一个判断条件时,则计算设备会执行该判断条件对应的处理步骤。
(9)溢出(overflow,也称数值越界)
溢出是指计算设备产生的处理结果的比特位数超过计算设备的机器字长。例如,计算设备的机器字长为32,而产生的处理结果为33个比特的数据,这种情况为溢出。当发生溢出时,计算设备会将处理结果进行变换,得到一个比特位数在机器字长范围内的数据,再基于变换后的数据继续进行处理,导致运行错误。因此,需要避免溢出。
(10)预计算
预计算是一种加快处理任务的方式。预计算是指在执行处理任务之前,提前执行一些处理步骤,并将产生的处理结果存储到一个位置。用于保存预计算的处理结果的位置一般称为预计算表(look-up table,LUT)。这样,在执行处理任务的过程中,可以在预计算表中查询得到预计算的处理结果,基于预计算的处理结果执行处理任务,而无需在执行处理任务的过程中临时执行预计算的处理步骤,从而加快完成处理任务的时间,提高完成处理任务的效率。
(11)即时计算
即时计算是与预计算相对的概念,即时计算是指执行处理任务的过程。
(12)模数和取模运算
给定a和q,a和q均为整数,q≥1。计算a÷q(也可写为a/q),如果余数等于r(0≤r<q),则称r是a除以q的余数,称q是模数。求r的处理过程可写为r≡a mod q(或者写为r=a%q),其中mod和%是模运算符,该处理过程称为取模运算,也称模运算或模算。
(13)模乘处理
给定a、b和q(q≥1),a、b和q均为整数。计算a*b mod q(也可写为ab mod q,a*b%q,ab%q)的过程称为模乘处理,简称模乘。
(14)冗余
若整数a和整数b关于模数q同余,即b≡a mod q,其中0≤a<q并且b≥a,则a是b模q的余数(即b除以q的余数等于a),b相对于模数q数值冗余,简称b冗余。
(15)冗余倍数
冗余倍数用于指示数据相对于模数而言数值冗余的大小。以数学的方式表述,若整数b、整数a和模数q满足b=(k-1)*q+a,其中0≤a<q,整数k≥1,称整数b相对于模数q有k倍数值冗余,简称b有k倍冗余,即冗余倍数为k。b有1倍冗余时,等价于b=a。
(16)约减处理
约减处理是取模运算和求同余运算这两种运算的统称。
取模运算是指确定一个数据相对于一个模数的余数。以数学的方式表达,给定一个整数a和整数q,取模运算即确定整数a除以整数q所得到的余数。计算机通过对取值大于模数的输入数据进行取模处理,能够将输入数据减小到落入模数的范围内,从而限制数据的取值大小,避免数据取值过大导致溢出。取模运算包括模加、模乘、模减以及模除。模加是指确定两个数据的和值相对于一个模数的余数,即a+b mod q。模减是指确定两个数据的差值相对于一个模数的余数,即a-b mod q。模乘是指确定两个数据的乘积相对于一个模数的余数,即a*b mod q。模除是指确定两个数据的比值相对于一个模数的余数。
同余是指两个整数除以同一个模数后所得的余数相同。以数学的方式表达,整数a和整数b关于模数q同余,通常记为b≡a mod q,其中0≤a<q并且b≥a。求同余运算是指确定与一个数据具有同余关系、且取值小于该数据的数据。以数学的方式表达,给定整数a和整数q,确定整数a关于整数q同余的值(此值小于等于整数a)的处理过程即求同余运算。
针对数论变换而言,上述整数a为数论变换的输入数据,上述整数q为模数(即数据对应的参数)。
(17)冗余模乘(也称快速冗余模乘,lazy modulo multiplication)
冗余模乘是模乘的一种具体实现方式。冗余模乘是指通过以下公式确定x和y相对于模数q的模乘结果。
其中,r表示模乘结果,x和y表示输入数据,x和y均为正整数,β为正整数,q表示模数,q<β/2,y<q。
基于上述公式可以推算出:r=x*y mod 2q。
相对于普通的模乘方式(即x*y mod q)而言,冗余模乘的特点主要有两个。第一,由于冗余模乘的计算公式利用了计算机的一些特性,计算机实现冗余模乘的速度比实现普通的模乘速度更快,能够提高模乘的效率;第二,冗余模乘的结果相对于普通模乘的结果具有2倍冗余,即,普通模乘的结果的值域在[0,q)上,而冗余模乘的值域在[0,2q)上。可以这样理解,冗余模乘可能牺牲了一定程度的精确性,但换来计算速度的提高。
针对数论变换而言,在实现冗余模乘时,上述冗余模乘的公式中x和y可以均为输入数据;或者,x和y中其中一者为输入数据,另一者为旋转因子。冗余模乘通常应用在数论变换的即时计算阶段之前,已知被乘数的值、且该被乘数小于模数的情况。
(18)蒙哥马利算法
蒙哥马利算法是一种常用的快速计算正整数模乘的算法。蒙哥马利算法的基本思想是将计算xy mod q转换为计算xyr-1mod q,其中r>q,gcd(r,q)=1,rr-1≡1 mod q。根据扩展欧几里得算法,可知存在正整数q′,使得等式rr-1-qq′=1成立,因此有rr-1≡1 mod q,qq′≡-1 mod r。
(19)蒙哥马利表示
蒙哥马利算法可以计算一个正整数x乘以r-1mod q得到x’≡xr-1mod q。因此,当需要计算x mod q时,通过选择适当的r值,将x的值修改为x*r,再调用蒙哥马利算法,即可输出x mod q的值。x*r的值就称为x的蒙哥马利表示。x的蒙哥马利表示可以是冗余的,例如可以等于x*r+i*q(整数i≥0)。
(20)字长
字是指在计算机中作为一个整体被存取、传送、处理的一组二进制数。一个字中的二进制数字的数量称为字长。
(21)机器字长(machine word length)
机器字长是指处理器进行一次整数运算所能处理的二进制数据的比特位数,通常也是处理器内部数据通道的宽度。例如,一个32位处理器的机器字长为32,一个64位处理器的机器字长为64。
(22)指令字长
指令字长是指机器指令中二进制代码的总位数。指令字长取决于从操作码的长度、操作数地址的长度和操作数地址的个数。不同的指令的字长是不同的。
(23)数据字长
数据字长是指存储数据所占的比特位数。
(24)多项式维度
多项式维度与多项式的次数(即阶)有关。部分基于格的密码算法建立在包括有限域系数的多项式的算法之上,这类算法会定义多项式环则Rq上的多项式的次数是n-1,维度设为n(n为2的幂),素数q≡1 mod 2n。本申请实施例中参与NTT和INTT计算的多项式的系数在模q后都在环Rq上。
(25)抗量子计算密码(post-quantum cryptography,PQC)
抗量子计算密码是一种专门研究能够抵抗量子计算机的加密算法,特别是公钥加密(非对称加密)算法。部分PQC算法如格密码学(lattice-based cryptography)研究格(lattice)即n维空间Rn中加法群的离散子群的性质,这一数学对象有许多应用,其中存在几个称为“格问题”的难题,如最短向量问题(shortest vector problem)和最近向量问题(closest vector problem)。许多基于格的密码系统利用到了这些难题。基于格的密码算法需要用到大量的多项式的计算,其中数论变换是最重要的计算之一。
(26)同态加密(homomorphic encryption)
同态加密是一种加密形式,它允许用户在加密情况下对数据执行计算,而无需先进行解密。同态加密计算产生的结果以加密形式保留,当解密后,产生的计算结果与对未加密数据计算产生的输出结果相同。
同态加密方案关注的是数据处理的安全,提供一种对加密数据进行处理的功能。同态加密方案的特点是允许数据在加密情况下实现数学或逻辑运算。同态是指代数中的同态性,加密和解密函数可以被认为是明文和密文空间之间的同态。
(27)全同态加密(fully homomorphic encryption,FHE)
全同态加密用于在不解密的条件下对加密数据进行任何能在明文上进行的运算,因此全同态加密可以由不受信任的一方运行,而不会泄露其输入和内部状态。基于全同态加密的特性,它可被用于保护隐私的外包存储和计算以及在加密的数据中进行诸如检索、比较等操作,得出正确的结果,而在整个处理过程中无需对数据进行解密。它的意义在于,能够解决将数据及其计算委托给第三方时的数据安全问题,例如应用在云计算场景。
在经典的数论变换算法中,计算设备需要运行一些逻辑分支语句,导致运行数论变换的效率低下。
打个比方,如果将数论变换比作一条路,将计算设备运行数论变换比作在这条路上行驶,那么引入一些逻辑分支语句,相当于建立了一条包含一些十字路口的路,导致行驶过程中,每当在路上走到一个十字路口,就需要暂停一下,判断应当继续往哪个方向走,进入哪一个分支,因此拖慢了行驶流程。同理可见,如果在数论变换中引入逻辑分支语句,就会造成计算设备运行数论变换耗费的时间过长。经过研究发现,在中央处理器(central processing unit,CPU)上运行经典的数论变换时,任意一条逻辑分支语句的运行时间大概占数论变换整体运行时间的15%左右,显而易见,逻辑分支语句的存在会造成计算设备运行数论变换的时间过长,极大地影响计算设备运行数论变换的效率。
而传统的数论变换引入逻辑分支语句的主要原因在于,由于数论变换中通常包含大量的加法处理和乘法处理,因此随着数论变换的运行,数据的取值会越来越大,那么计算设备内部表示数据所需使用的比特位数也会越来越多,造成溢出的风险。因此,会引入一些逻辑分支语句,逻辑分支语句中的判断条件为判断当前处理的数据的取值是否超过设定的上限,如果是,则对数据进行约减处理,再基于约减后的数据执行后续的处理步骤。通过这种方式,计算设备在运行数论变换的过程中,由于在合适的位置进行约减处理,使得数据的取值变小,从而减少计算设备内部表示数据所使用的比特位数,防止数据的比特位数超过计算设备所能表示的数据的比特位数上限,避免溢出。
基于上述研究分析,本申请提供的一些实施例中,提出了一种数据处理方法,该数据处理方法支持无逻辑分支语句的数论变换。在该实施例提供的方法中,先通过数据相关的参数,找出运行数论变换可能导致溢出的位置,将该位置对应的计算单元作为需要进行约减处理的计算单元,然后在运行数论变换的过程中,由预先找到的计算单元执行约减处理,使得无需引入逻辑分支语句,也能在数论变换的运行中及时进行约减处理,避免溢出。
打个比方,还是将数论变换比作一条路,该实施例提供的方法,相当于在建路之前,就提前规划好哪些位置需要执行约减处理,从而能够构建一条没有十字路口的路,使得计算设备运行数论变换的过程中,相当于行驶过程中,无需在遇到路口时暂停下花时间考虑当前是否需要执行约减处理,而是在预先规划好的位置进行约减处理即可,显然加速了数论变换的运行。由此可见,该实施例提供的方法,解决了现有技术中由于引入逻辑分支语句导致运算 效率低下的问题,改进了计算设备运行数论变换的方式,提升计算设备运行数论变换的性能,节省了计算设备运行数论变换的时间,提高计算设备运行数论变换的效率,扩展了数论变换的适用场景。
下面对本申请实施例的应用场景举例说明。
本申请实施例可以应用在数据加解密的场景,例如数据的加密传输、隐私计算、密钥的生成、身份认证等。可选地,本申请实施例应用在基于PQC或者FHE进行加解密的场景。数据加解密的方案通常是基于密码学算法实现的,而密码学算法,特别是PQC算法通常需要使用数论变换。通过本申请实施例提供的方法,能够加速数论变换的运行,从而提高加解密方案整体的速度。
例如,在数据加密传输的场景下,发送端获得待加密的明文数据后,通过PQC算法对明文数据进行加密,得到密文数据,将密文数据发送给接收端。数据的接收端接收密文数据,通过PQC算法对密文数据进行解密,得到明文数据。由于数据以密文的形式在发送端至接收端的链路上传输,从而提高安全性。
在上述场景下,数论变换例如是PQC算法中的一个模块。发送端在通过PQC算法对明文数据进行加密的过程中,执行本实施例提供的方法,对明文数据进行数论变换,通过变换后的明文数据执行PQC算法的其他步骤,得到密文数据。接收端在通过PQC算法对密文数据进行解密的过程中,执行本实施例提供的方法,对密文数据进行数论变换,通过变换后的密文数据执行PQC算法的其他步骤,得到明文数据。通过本实施例提供的方法,能够提高数论变换的速度,从而提高PQC算法的速度。
特别是,在一些时延敏感型网络中加密传输数据的场景下,目前很多密码学算法的运行速度较慢,导致加密传输数据的时延较大,难以满足通信双方对时延的需求。而通过本实施例提供的方法,能够降低数据在发送端和接收端加解密的时延,有助于满足通信双方对时延的需求。
在一个示例性场景中,根据一些国际标准,电网不同节点间可使用公钥签名算法来保证数据传输安全,然而现有的公钥签名算法的时延较大,不能满足标准要求的时延。如果未来基于格的公钥签名算法能够成为新一代密码学标准算法,通过本申请一些实施例提供的方法,可以提高运行数论变换的性能,从而提高运行基于格的公钥签名算法的速度,从而使这些公钥签名算法能够满足相关国际标准的通信时延要求,进而可能被相关国际标准采用,用于保护电网数据。
其中,数论变换的产品形态包括很多种。在一种可能的实现中,数论变换的产品形态是软件,比如数论变换的形态是一段程序代码,CPU(如在32位CPU或者64位CPU)在加解密过程中,读取并执行该程序代码从而运行数论变换。
在另一种可能的实现中,数论变换的产品形态是硬件,比如通过一个专用处理器来承担数论变换。例如,该专用处理器是一个专用于加解密的处理器,比如是一个加密芯片(也称为加密协处理器或者安全芯片),该专用处理器在加解密过程中,执行数论变换。又如,通过一个专用处理器协助CPU对数据加解密,由专用处理器和CPU共同配合完成加解密操作。在一种可能的实现中,当CPU需要对数据进行加解密时,CPU向该专用处理器传入数据相关 的参数,该专用处理器基于CPU传入的参数对数据进行数论变换,将数论变换后的数据返回给CPU,然后CPU基于数论变换后的数据继续执行加解密步骤。通过这种方式,将数论变换从CPU卸载至专用处理器,从而减轻CPU的计算负担,并提高CPU进行加解密的速度。
图5是本申请实施例提供的一种利用数论变换(NTT)及其逆变换(INTT)计算多项式乘法的原理示意图。一些加解密方案,如PQC和FHE,可以通过多项式来表示密钥、密文、明文等数据,因此利用数论变换,可以加速加解密方案中的处理过程,如密钥生成、加密、解密、对密文处理的过程。
如图5所示,令正整数是2的幂,给定一个质数q满足q≡1 mod 2n。令ω是上的一个n次本原单位根,则上的一个2n次本原单位根。定义多项式环 和两个多项式a,b∈Rq;令a=(a[0],a[1],…,a[n-1]), 是a和b的系数项组成的向量,再定义另外两个向量 计算多项式乘法c=ab等同于计算a和b的负包裹卷积(negative wrapped convolution),即 代表哈达玛积。其中,a、b和c是多项式系数,例如是PQC和FHE方案中的密钥、密文、明文等数据。
图6是本申请实施例提供的一种数据处理方法的流程图。图6所示方法由计算设备执行,计算设备用于运行数据的数论变换,数据的数论变换的步骤包括多个计算单元,图6所示方法包括以下步骤S201至步骤S202。
步骤S201、计算设备基于数据的参数,确定每个计算单元产生的处理结果的预估比特位数。
上述数据是数论变换的输入数据。数论变换包括正数论变换(NTT)或逆数论变换(INTT)中至少一项。上述数据例如是多项式系数。上述数据例如是待进行加密的数据或待进行解密的数据。可选地,上述数据是明文、密文或者生成密钥所需的数据。
上述参数指示数据的比特位数。例如,计算设备内部以二进制序列的形式表示上述数据,上述参数指示该二进制序列的长度。
获取参数的作用在于,由于输入数据的比特位数会影响数论变换中各个计算单元产生的处理结果的比特位数,进而影响到哪些计算单元产生的处理结果的比特位数会超过硬件所能表示的数据的范围,即哪些计算单元可能发生溢出,因此通过获取上述参数,有助于较为准确地确定数论变换的输入数据有多少个比特,从而较为准确地预估出数论变换过程中各个计算单元基于该数据产生的结果有多少个比特,从而定位出需要进行约减处理的计算单元。
在一些实施例中,参数包括多个计算单元中每个计算单元进行取模运算时使用的模数、数据相对于模数的冗余倍数以及数据的多项式维度。可选地,上述参数还包括处理器的位数。
冗余倍数指示数据相对于模数的冗余程度。例如,如果冗余倍数为1,表明数据的取值范 围在0至模数之间,相当于数据的取值不存在冗余;如果冗余倍数为2,表明数据的取值范围在0至模数的二倍之间;以此类推,如果冗余倍数为k,表明数据的取值范围在0至模数的k倍之间,k为正整数。
多项式维度指示数论变换包括的阶段数。例如,如果多项式维度为n,指示数论变换一共具有log2n个阶段。
上述处理器为用于运行数论变换的硬件,例如为CPU。处理器的位数用于指示处理器能够表示的数据的取值范围。处理器的位数例如为处理器的字长,比如说是机器字长、指令字长、数据字长或存储字长等。
可替代地,上述参数包括数据的比特位数。或者,上述参数为数据的最大值或者数据的取值范围。
如何获取上述参数包括多种实现方式。在一种可能的实现方式中,由用户提供上述参数。例如,在计算设备为终端的情况下,由用户在终端上输入上述参数,然后终端基于用户输入的参数执行后续流程;又如,在计算设备为服务器的情况下,由用户在终端上输入上述参数,然后终端将用户输入的参数发送至服务器,由服务器基于从终端接收到的参数执行后续流程;在另一种可能的实现中,上述参数预先保存在计算设备中。例如,上述参数预先烧录至负责运行数论变换的处理器中。
上述计算单元相当于数论变换中的一个组件或者说一个数据处理单位。例如,一个计算单元用于进行加法处理、减法处理和模乘处理,该模乘处理包括取模运算。可选地,多个计算单元中每个计算单元用于基于k个数据进行处理产生k个处理结果,k为正整数。
计算单元的粒度包括很多种。在一些实施例中,一个计算单元为一个或多个阶段。在另一些实施例中,一个计算单元为一个或多个蝴蝶。在另一些实施例中,一个计算单元为一个或多个交叉。
以计算单元为一个交叉为例,如图3所示,对于NTT而言,一个计算单元用于先执行模乘处理,再执行加法处理和减法处理,产生处理结果。如图4所示,对于INTT而言,一个计算单元用于先执行加法处理和减法处理,再执行模乘处理,产生处理结果。
可选地,上述计算单元为软件。例如,数论变换为一段代码,上述计算单元为代码中的语句。或者,上述计算单元为硬件。例如,数论变换为一个芯片,上述计算单元为芯片中的处理电路。
预估比特位数指示计算单元基于数据进行处理产生的处理结果的比特位数。以输入数据为216个多项式系数,每个系数有58个比特为例,如图7所示,NTT的第1阶段的每个计算单元基于该数据进行处理后,产生的处理结果的比特位数为59个比特,即第1阶段的每个计算单元对应的预估比特位数为59。如图8所示,INTT的第1阶段的每个计算单元基于该数据进行处理后,产生的处理结果的比特位数为62个比特或60个比特,即第1阶段的每个计算单元对应的预估比特位数为62或60。
在一种可能的实现中,计算设备基于模数和冗余倍数,确定数据的比特位数;计算设备基于数据的比特位数以及每个计算单元对应的比特位数增量,确定每个计算单元的预估比特位数。
确定数据比特位数的一种可能实现方式为,如果冗余倍数为1,则确定模数的比特位数, 作为数据的比特位数;如果冗余倍数大于1,则确定模数与冗余倍数的乘积的比特位数,作为数据的比特位数。例如,如果模数为q,且冗余倍数为1,表明数据的取值小于模数,则确定log2q为数据的比特位数;如果模数为q,且冗余倍数为n(n为大于1的正整数),表明数据的取值小于模数的n倍,则确定log2qn为数据的比特位数;则确定数据的比特位数为这种方式的作用在于,由于数据的取值范围在0至模数与冗余倍数的乘积之间,模数与冗余倍数的乘积的比特位数即理论上数据的比特位数的最大值,通过依据理论上数据的比特位数的最大值来预估处理结果的比特位数,相当于考虑到worst-case(最坏情况),保证不会出现溢出。
比特位数增量是指数据经过计算单元处理后比特位数的增量,即计算单元产生的输出结果的比特位数与计算单元获得的输入数据的比特位数之差。例如,如果计算单元是一个加法单元,用于对数据x和数据y进行相加,那么由于两个数据相加后,理论上结果最多比数据多1比特,则将1作为加法单元的比特位数增量。在一种可能的实现中,预先设定和保存计算单元和比特位数增量之间的对应关系,通过查询该对应关系从而确定比特位数增量。
步骤S201的作用,相当于在给定实际的输入数据相关的参数的情况下,去预估如果将实际的输入数据代入至数论变换后,理论上各个计算单元产生的处理结果的比特位数,从而找出理论上会导致溢出的计算单元。
步骤S202、计算设备基于预估比特位数,从多个计算单元中确定第一计算单元。
第一计算单元为用于对第二计算单元的处理结果约减处理的计算单元。
第二计算单元为上述多个计算单元中的一个计算单元。第二计算单元的处理结果的预估比特位数满足预设比特位数。第二计算单元产生的处理结果用于作为第一计算单元的输入数据。第二计算单元相当于第一计算单元的上一个计算单元,第二计算单元的输出相当于第一计算单元的输入。
可选地,上述预设比特位数是一个阈值,第二计算单元的处理结果的预估比特位数大于或等于该阈值。或者,上述预设比特位数是一个数值,第二计算单元的处理结果的预估比特位数等于该数值。
在一些实施方式中,上述预设比特位数为满足溢出条件时数据的比特位数。上述预设比特位数的作用相当于提供一个上限,如果发现某一个计算单元产生的处理结果的预估比特位数达到该上限,则确定该计算单元的下一个计算单元在实际运行NTT时需要对数据进行约减处理,从而避免对数据进行数论变换时,产生的处理结果的比特位数超过上限。
可选地,上述预设比特位数是基于计算设备中处理器的位数确定的。相较于根据经验来设定预设比特位数而言,通过处理器的位数这一硬件方面的因素来确定预设比特位数,使得预设比特位数能够适应于硬件的能力,能够针对不同能力的硬件(如不同位数的CPU)分别确定不同的预设比特位数,以该预设比特位数为基准来寻找需要进行约减处理的计算单元,能更加精确地定位需要进行约减处理的计算单元,从而减少不必要的约减处理。
在一些实施例中,上述预设比特位数比处理器的位数少1。例如,如果负责运行数论变换的处理器是64位CPU,则将预设比特位数设置为63,如果负责运行数论变换的处理器是32位CPU,则将预设比特位数设置为31。以预设比特位数为63为例,如果预估出一个计算单元的输入数据达到63比特,才在该计算单元进行约减处理,而该计算单元之前的计算单元无需进行约减处理。通过这种方式,在避免溢出的同时,尽可能放宽对数据的取值的限制,充 分发挥硬件的能力,提高资源利用率,减少约减处理的次数。
在另一些实施例中,考虑到数论变换在加解密方案中,通常相当于一个中间模块,通常不是第一个模块或者最后一个模块。如果将预设比特位数设置地过大,那么整个数论变换的输出结果进入至加解密方案的下一个模块时,很有可能出现的一种情况是,下一个模块由于执行引起数值大小增加的运算,导致溢出。而如果将预设比特位数设置的过小,则可能没有充分发挥硬件能力,造成资源浪费。基于此,将预设比特位数设计为比处理器的位数少2。例如,如果负责运行数论变换的处理器是64位CPU,则将预设比特位数设置为62,如果负责运行数论变换的处理器是32位CPU,则将预设比特位数设置为30。
通过这种方式,在避免溢出的同时,为数论变换的下一个模块留出了余地,允许数论变换的下一个模块产生的处理结果继续增长一个比特位,降低在后一个模块发生数值越界的风险;此外,一定程度上放宽对数据的取值的限制,提高资源利用率。
在一些实施例中,在基于数据运行数论变换的过程中,计算设备通过第一计算单元对第二计算单元的处理结果约减处理。
通过第一计算单元进行约减处理的作用在于,一方面,约减处理能够减少处理结果的取值,进而减少处理结果的比特位数。因此,通过第一计算单元进行约减处理,使得第二计算单元的处理结果的比特位数得以减少,避免第一计算单元产生的处理结果超过上述预设比特位数,因此避免溢出。另一方面,第一计算单元之外的其他计算单元无需进行约减处理,因此数据在经过其他计算单元处理时,允许数据的取值保持冗余状态,直至数据输入至第一计算单元,即数据的比特位数达到预设比特位数时,才对数据进行约减处理,因此减少了多余的约减处理,尽可能避免数论变换包含多余的计算量,提升处理效率。
请参考图7,图7示出的场景中,NTT运行过程分为16个阶段,NTT运行过程中满足溢出条件时数据的比特位数为63。计算设备根据数据对应的参数,预估出第15阶段的每个交叉产生的处理结果的比特位数是63,即第16阶段的每个交叉的输入数据的比特位数达到63。在这一场景下,计算设备将第16阶段的每个交叉作为第一计算单元。在运行NTT中,计算设备在第16阶段的每个交叉进行约减处理,使得处理结果的比特位数从63减少至60,最终将整个NTT运行过程的输出结果的比特位数控制在60内,因此避免溢出。
请参考图8,图8示出的场景中,INTT运行过程分为4个阶段,NTT运行过程中满足溢出条件时数据的比特位数为62。计算设备根据数据对应的参数,预估出INTT中有5个交叉的输入数据的比特位数为62,这5个交叉分别是第2阶段第1蝴蝶第1交叉、第2阶段第2蝴蝶第1交叉、第2阶段第3蝴蝶第1交叉、第2阶段第4蝴蝶第1交叉、第4阶段第1蝴蝶第2交叉,则计算设备将这5个交叉中的每一个交叉均作为第一计算单元。在运行NTT中,计算设备通过这5个交叉进行约减处理,使得处理结果的比特位数从62减少至60,最终将整个INTT运行过程的输出结果的比特位数控制在62内,从而避免溢出。
在进行约减处理后,计算设备可以通过第一计算单元对约减处理后的处理结果进行其他处理步骤,再通过第一计算单元的下一个计算单元继续处理,直至所有计算单元处理完成,从而将数据转换为数论变换后的数据。
数论变换后的数据在加解密方案中的应用包括多种场景。例如,在加密场景下,上述数据为明文,计算设备对明文进行正数论变换后,基于正数论变换后的明文进行加密,得到密 文的一部分。在解密场景下,上述数据为密文,计算设备对密文进行逆数论变换后,基于逆数论变换后的密文进行解密,得到明文的一部分。再如,在密钥生成场景,上述数据为生成密钥(公钥或私钥)所需的数据,计算设备对数据进行数论变换后,基于数论变换后的数据生成密钥。例如,计算设备对第二计算单元的约减处理后的处理结果进行加密处理或解密处理。
在图6所示的实施例中,由于通过参数确定出需要负责约减处理的计算单元,在运行数论变换的过程中,由确定出的计算单元对数据进行约减处理,从而在无需引入逻辑分支语句的情况下,也能让数据的取值在合适的位置变小,从而减少计算设备内部表示数据所使用的比特位数,防止数据的比特位数超过计算设备所能表示的数据的比特位数上限,避免溢出。相较于引入逻辑分支语句以进行约减处理的方式而言,该方法能够移除逻辑分支语句,优化数论变换的结构,从而提高运行数论变换的效率。
此外,通过预估计算单元产生的处理结果的比特位数,依据预估比特位数能够准确地定位可能发生溢出的计算单元(第一计算单元),让可能发生溢出的计算单元进行约减处理,而其他计算单元无需约减处理,从而减少数论变换中约减处理的调用次数,尽可能减少数论变换中多余的计算量,提升效率。
在图6所示的实施例中,针对如何进行约减处理存在多种实现方式,下面对约减处理的一些实现方式进行描述。
在一些实施例中,上述第一计算单元采用取模运算的方式,对第二计算单元的处理结果进行约减处理。取模运算的作用一是保证计算正确,二是减少数据的取值大小。在另一些实施例中,计算设备采用加法处理和减法处理代替取模运算,通过减法处理来减少数据的取值大小。
在一些实施例中,上述第一计算单元采用蒙哥马利模乘的方式,对第二计算单元的处理结果进行约减处理。例如,先将第二计算单元的处理结果转换为蒙哥马利形式,再对具有蒙哥马利表示形式的处理结果执行蒙哥马利模乘,从而实现约减处理。例如,数据包括x和y,引入一个参数r,根据参数r,将x转换为x*r(即蒙哥马利表示形式的x),将y转化为y*r(即蒙哥马利表示形式的y),再根据x*r以及y*r进行蒙哥马利模乘。
通过蒙哥马利模乘的方式进行约减处理,能够减少处理结果的取值,实现约减的目的,同时有助于提高约减处理的速度。
考虑到上述实施例中,由于采用蒙哥马利模乘的方式进行约减处理,数据的表示形式需要保持为蒙哥马利表示形式,约减处理需要绑定蒙哥马利算法,导致局限性强,无法满足运行数论变换中调整数据表示形式的需求。
基于此,在另一些实施例中,第一计算单元对第二计算单元的处理结果进行冗余模乘处理,从而实现约减处理。冗余模乘处理为值域为[0,2q)的模乘处理,q表示模数,q为正整数。
由于采用冗余模乘的方式实现约减处理,一方面,使得约减处理不必绑定蒙哥马利算法,数据的表示形式也无需保持为蒙哥马利表示形式,换句话说,无论数据的表示形式为蒙哥马利表示形式还是非蒙哥马利表示形式,方案都具有可用性,从而提高方案的灵活性和实用性。 另一方面,同样能起到提高约减处理的速度这一作用,从而提高效率,尤其是,在大数运算等场景下有助于显著加速计算设备的运算流程。
在一个示例性场景中,在不采用冗余模乘的方式时,约减算法的使用通常局限于barrett reduction和蒙哥马利模乘(要求多项式系数必须采用蒙哥马利表示形式)。而通过使用冗余模乘的方式,可以计算非蒙哥马利表示形式的系数的模乘,且冗余模乘可以与蒙哥马利模乘共用相同的计算模块。
针对如何支持动态调整数据表示形式,本申请的一些实施例中,计算设备根据数据的表示形式,生成与数据具有相同表示形式的旋转因子;第一计算单元基于旋转因子对第二计算单元的处理结果进行冗余模乘处理。可选地,表示形式为蒙哥马利表示形式或者非蒙哥马利表示形式。
示例性地,在预计算阶段,计算设备确定数据的表示形式,如果数据的表示形式为蒙哥马利表示形式,则生成蒙哥马利表示形式的旋转因子;如果数据的表示形式为非蒙哥马利表示形式,则生成非蒙哥马利表示形式的旋转因子;计算设备将生成的旋转因子保存至预计算表。在即时计算阶段,计算设备从预计算表中获取旋转因子,基于获取的旋转因子以及数据进行冗余模乘处理。
可选地,如果输入数据的表示形式发生调整,则计算设备对应调整预计算表中保存的旋转因子,以使旋转因子的表示形式与数据的表示形式保持一致。例如,如果数据的表示形式从蒙哥马利表示形式调整为非蒙哥马利表示形式,则计算设备将预计算表中保存的旋转因子从蒙哥马利表示形式调整为非蒙哥马利表示形式;如果数据的表示形式从非蒙哥马利表示形式调整为蒙哥马利表示形式,则计算设备将预计算表中保存的旋转因子从非蒙哥马利表示形式调整为蒙哥马利表示形式。
通过上述实施方式,如果任务的需求是数据具有蒙哥马利表示形式,则使用具有蒙哥马利表示形式的旋转因子进行运算,如果任务的需求是数据具有非蒙哥马利表示形式,则使用具有非蒙哥马利表示形式的旋转因子进行运算,因此该方式可以根据具体计算任务的要求,动态的调整NTT/INTT运算中数值的表示形式,此外不会影响NTT/INTT运算中蝴蝶操作的结构,不会引入额外算法,不绑定蒙哥马利算法,也不会增加计算量。
在一些实施例中,考虑到数论变换包含减法处理,如果减数大于被减数,则减法处理的结果为负数。对于计算设备而言,处理结果出现负数就可能导致运行出错,导致计算不正确。
基于此,本申请的一些实施例中,计算设备会基于参数,确定冗余值;在基于数据运行数论变换的过程中,多个计算单元中每个计算单元基于冗余值进行减法处理。冗余值为大于或等于减法处理中减数的数值。可选地,冗余值大于或等于数据的最大值。
正数论变换中的减法处理包括冗余增长运算中的减法处理以及冗余约减处理中的减法处理。冗余增长运算的减法处理中被减数为数据,减数为数据经过冗余模乘的结果。冗余约减处理的减法处理中被减数为数据经过冗余模乘的结果,减数为数据和旋转因子经过冗余模乘的结果。以数学的方式表达,冗余增长运算的减法处理例如是x与y*w mod 2q进行相减,其中x和y均为数据,w是旋转因子,q是模数。冗余约减处理的减法处理例如是x mod 2q与y*w mod 2q进行相减,其中x和y均为数据,w是旋转因子,q是模数。逆数论变换中的减法处 理为两个数据之间相减,例如是x-y,其中x和y均为数据。
基于冗余值进行减法处理的作用在于,由于减法处理时不仅代入了数据本身,还代入了冗余值,相当于给被减数加上了冗余值,放大了被减数的取值,因此有助于避免减法处理的处理结果为负数,从而有助于运算正确性。此外,相较于根据经验设定冗余值而言,该方式由于以数据相关的参数为依据来确定冗余值,使得确定的冗余值能够适应于参数的取值,从而提高准确性。此外,冗余值无需绑定于单一的参数,而是能够随着参数的取值相应调整,因此方案可用的参数更多,提高扩展性和实用性。
针对如何设计上述冗余值的取值大小,本申请的一些实施例中,通过分析正数论变换以及逆数论变换各自的特点,为冗余值提供了效果较佳的取值。
可选地,针对正数论变换,上述冗余值等于2q,q表示模数,q为正整数。
选择2q作为冗余值的作用在于,由于正数论变换特点在于先执行模乘处理,再执行加法处理和减法处理。而模乘处理的值域是可控的,比如采用冗余模乘的方式实现乘法运算时,模乘处理的值域在[0,2q)内,采用不带冗余的模乘的方式实现乘法运算时,模乘处理的值域在[0,q)内,其中q为模数。因此,通过将2q代入到减法处理中,由于减法处理中的减数为模乘处理的输出结果,因此减数的值域在[0,2q)内,因此冗余值必然大于减数,从而保证执行减法处理所产生的结果不为非负数,因此有助于运算正确性。此外,使用的冗余值尽可能地小,从而避免由于冗余值过大造成处理开销和存储开销过大。
可选地,针对逆数论变换,冗余值等于(t+n)*q,t表示冗余倍数,n表示多项式维度,q表示模数,t、n和q为正整数。
选择(t+n)*q作为冗余值的作用在于,在数据不存在冗余的情况下,任一个计算单元的输入数据不超过n*q,考虑到数据存在冗余的可能性,则在n*q的基础上加上t*q作为冗余值,从而支持任意冗余倍数的数据作为输入,同时保证执行减法运算所产生的结果不为非负数,有助于运算正确性。
图6所示实施例描述了对数据进行约减处理的情况。在另一些实施例中,对于正数论变换,如果计算设备确定参数满足条件,则确定不存在需要进行约减处理的计算单元。在基于数据运行数论变换的过程中,省略对数据进行约减处理。对于逆数论变换,如果计算设备确定参数满足条件,则确定最后一个阶段的计算单元为需要进行约减处理的计算单元。在基于数据运行数论变换的过程中,通过最后一个阶段的计算单元对数据进行约减处理。
其中,在冗余倍数为1或者说输入数据没有冗余的情况下,参数满足条件例如为模数的比特位数与阶段数之和小于预设比特位数,比如说满足条件是log2n+log2q<60。其中,n表示多项式维度,q表示模数。在冗余倍数大于1的情况下,参数满足条件例如为模数与冗余倍数的乘积的比特位数与阶段数之和小于预设比特位数。
上述方式的作用在于,模数与冗余倍数的乘积的比特位数,相当于数据在理论上最多具有的比特位数,而在不约减处理的情况下,数据每经过一个阶段的处理,则比特位数会增加一个比特,因此阶段数相当于数据经过数论变换中所有阶段处理完成后最多增长的比特位数,因此参数满足上述条件,相当于最坏情况下也不会发生溢出,无需对数据约减处理,因此通过上述方式,能够在保证不会溢出的同时,完全移除约减处理以及逻辑分支语句,从而提高 运行数论变换的性能和效率。
下面结合一些代码以及公式对图6所示方法的实现方式举例说明。下述实现方式中,冗余约减交叉是对数据进行约减处理的计算单元(即第一计算单元)的举例说明,冗余增长交叉是无需对数据进行约减处理的计算单元(即非第一计算单元)的举例说明,多项式系数是对数据的举例说明,max是对预设比特位数的举例说明,min是对冗余值的举例说明。
在一些实施例中,在构造NTT或INTT算法时,各分成两个部分:预计算和即时计算。
图9是本实施例提供的一种NTT的架构图。如图9所示,NTT内部包括旋转因子生成模块、NTT预计算模块、NTT生成模块以及NTT运行模块。NTT生成模块的流程图如图11所示。
图12是本实施例提供的INTT的架构图。INTT内部包括旋转因子生成模块、INTT预计算模块、INTT生成模块以及INTT运行模块。INTT预计算模块的流程图如图13所示。INTT生成模块的流程图如图14所示。
下面分别对NTT和INTT的预计算过程和即时计算过程举例说明。
NTT
NTT中蝴蝶中的交叉分为两种,一种是冗余增长交叉,另一种是冗余约减交叉。
在一个示例性实施例中,将NTT的输入多项式a中每一项的系数按照次数大小进行升序排序,次数最低的项在最前面,次数最高的项在最后面,得到一个序列。x和y分别是次数为j和j+t的项的系数。w是该次交叉计算中用到的旋转因子,w从一个预计算表中取得。
在该实施例中,Radix-2 NTT中冗余增长交叉的代码如下所示。
x=a[j],y=a[j+t];
tx=x;
ty=FastModMultiLazy(y,w,q);//备注:ty=y*w mod 2q;
a[j]=tx+ty;
a[j+t]=2*q-ty+tx;
以上示出的NTT冗余增长交叉的代码的含义为,将系数x(即a[j])赋值给中间变量tx,将系数y与旋转因子w的冗余模乘的结果赋值给中间变量ty,则系数x(即a[j])被修改为tx+ty,系数y(即a[j+t])被修改为2*q-ty+tx。经过这样的计算后,系数x的值会变大,此为增长的含义。
在该实施例中,Radix-2 NTT中冗余约减交叉的代码如下所示。
x=a[j],y=a[j+t];
tx=FastModMultiLazy(x,1,q);//备注:tx=x mod 2q;
ty=FastModMultiLazy(y,w,q);//备注:ty=y*w mod 2q;
a[j]=tx+ty;
a[j+t]=2*q-ty+tx;
以上示出的NTT冗余约减交叉的含义为,将系数x与整数1的冗余模乘的结果赋值给中间变量tx,将系数y与w的冗余模乘的结果赋值给中间变量ty,则系数x(即a[j])被修改为tx+ty,系数y(即a[j+t])被修改为2*q-ty+tx。经过这样的计算后,系数x的值有可能变小, 此为约减的含义。
图15示出了Radix-2 NTT中冗余增长交叉的计算方式的示意图。Radix-2 NTT冗余增长交叉的计算公式为下述算式A。
图16示出了Radix-2 NTT中冗余约减交叉的计算方式的示意图。Radix-2 NTT冗余约减交叉的计算公式为下述算式B。
预计算
如图10所示,NTT预计算模块接收输入的参数,然后根据参数计算NTT中发生溢出的位置,生成序列S,序列S的长度等于NTT发生溢出的阶段的个数,序列值为发生溢出的NTT阶段的标识,具体流程如下。
设n(n≥4)为2的幂,n=2m,NTT阶段数为log2n,模数为log2q比特,NTT允许的最大冗余值(根据机器字长、指令字长或数据字长得出)为2max,最小冗余值min=2q。
本申请实施例支持任意大小的冗余作为NTT输入,所以输入系数input的值满足2h-1<input<2h(h为正整数,h<max)。在NTT执行第s1阶段时(s1为正整数,1≤s1≤m),如果满足(即在第s1-1阶段时,有),则说明第s1阶段需要约减处理。
上述实现方式的推导过程如下:根据算式A,第1阶段计算完后,X项和Y项的理论最大值都小于等于2h+2q,第2阶段计算完后,X项和Y项的理论最大值都小于等于(2h+2q)+2q=2h+2*2q;那么第s阶段计算完后,如果此时还是没有任何系数出现溢出,则X项和Y项的理论最大值都小于等于2h+s*2q,而2h+s*2q<2max。换句话说,如果顺利执行完第s阶段后,系数仍然不会溢出的话,则意味着带入s这个值,满足如下算式。
所以,当执行到第s1阶段时,如果带入s1到算式C后不成立,即带入s1后得到q≥(2max-1-2h-1)/s1,说明在worst-case情况下第s1阶段计算完后在理论上会出现溢出,则第s1阶段就需要约减处理,即交叉改用冗余约减交叉。
如果NTT未执行完,继续执行NTT。令log24q=g-1,如果在NTT执行完之前的第s2阶段(s2为正整数,1≤s1<s2≤m),发现有则说明第s2阶段需要约减处理。以此类推,直到运行完成NTT的所有阶段,得到一个理论上系数可能会溢出的阶段数的序列S=[s1,s2,…]。
上述确定阶段数序列的方式的推导过程如下。
由于第s1阶段改用了冗余约减交叉,根据算式B,第s1阶段计算完后,X项和Y项的理论最大值都小于等于4q,令log24q=g-1,那么2g-1<4q<2g,也就是说从第s1阶段计算完后,所有的系数都小于2g。假设之后的阶段中系数没有溢出,则第s1+1阶段计算完后,X和Y的理论最大值都小于等于2g+2q,第s1+2阶段计算完后,X和Y的理论最大值都小于等于(2g+2q)+2q=2g+2*2q;那么第s1+s阶段计算完后,如果此时还是没有任何系数出现数值越界,则X和Y的理论最大值都不会大于2g+s*2q,也就有2g+s*2q<2max。换句话说,如果顺利执行完第s1+s阶段后系数仍然不会溢出的话,则意味着带入s这个值,满足如下算式。
所以,当执行到第s2阶段(s2为正整数,1≤s1<s2≤m)时,如果带入s=s2-s1到算式D后不成立,即带入s1,s2后得到q≥(2max-1-2g-1)/(s2-s1),说明第s2阶段计算完后在理论上会出现溢出,第s2阶段就需要约减处理,即,改用冗余约减交叉。第s2阶段之后,若还未运行NTT完成,则重复这一部分的计算即可,直到运行NTT完成为止。
即时计算(构造NTT算法)
如图11所示,若序列S为空,则构造一个NTT,它的所有阶段的蝴蝶只需要调用冗余增长交叉;否则,对照序列S构造一个NTT,使得它在第s1、第s2等阶段时调用冗余约减交叉,其余阶段仅调用冗余增长交叉。
INTT
INTT中蝴蝶中的交叉分为两种,一种是冗余增长交叉,另一种是冗余约减交叉。
在一个示例性实施例中,将INTT的输入多项式a的每一项按照其次数大小进行升序排序,次数最低的项在最前面,次数最高的项在最后面,得到一个序列。x和y分别是次数为j和j+t的项的系数。w是该次交叉计算中用到的旋转因子,旋转因子从一个预计算表中取得。规定一个最小冗余值min,min是一个正整数;最小冗余值的作用在于,保证无论Y等于多少,最小冗余值会大于Y。
在该实施例中,Radix-2 INTT中冗余增长交叉的代码如下所示。
x=a[j],y=a[j+t];
tx=x+y;
ty=min-y+x;
a[j]=tx;
a[j+t]=FastModMultiLazy(ty,w,q);//备注:a[j+t]=ty*w mod 2q;
以上示出的INTT冗余增长交叉的代码的含义为,x与y的和赋值给tx,min-y+x的结果赋值给ty,则系数x(即a[j])被修改为tx,系数y(即a[j+t])被修改为ty与w的冗余模乘的结果。经过这样的计算后,系数x的值会变大,此为增长的含义。
在该实施例中,Radix-2 INTT中冗余约减交叉的代码如下所示。
x=a[j],y=a[j+t];
tx=FastModMultiLazy(x+y,1,q);//备注:tx=x+y mod 2q;
ty=min-y+x;
a[j]=tx;
a[j+t]=FastModMultiLazy(ty,w,q);//备注:a[j+t]=ty*w mod 2q;
以上示出的INTT冗余约减交叉的代码的含义为,x+y与整数1的冗余模乘的结果赋值给tx,min-y+x的结果赋值给ty,则系数x(即a[j])被修改为tx,系数y(即a[j+t])被修改为ty与w的冗余模乘的结果。经过这样的计算后,系数x的值有可能变小,此为约减的含义。
图17示出了Radix-2 INTT中冗余增长交叉的计算方式的示意图。Radix-2 INTT冗余增长交叉的计算公式为下述算式E。
图18示出了Radix-2 INTT中冗余约减交叉的计算方式的示意图。Radix-2 INTT冗余约减交叉的计算公式为下述算式F。
INTT需要一个算法来准确找到哪个阶段、哪个蝴蝶、哪个交叉才需要进行约减处理。INTT中确定需要进行约减处理的算法如下。

以上示出的预计算算法用于计算INTT中需要约减处理的交叉的具体位置,该算法的逻辑如下。
根据算式E,也就是根据先乘后加减的原理,X项的变化是比特位数增加1个比特,因为X项是计算加法,其中x和y的和在理论上最多是一个比x,y中的最大值大1比特的数;而Y项的变化则是比特位数比模数q的比特位数大1,因为冗余模乘的输出结果始终在[0,2q)上,所以Y项的比特位数总是小于或等于2q的比特位数,在考虑worst-case情况下,直接让Y项计算后的结果的比特位数等于2q的比特位数。然后,在下一个阶段,X项和Y项有可能会互相交换位置,所以获得X项和Y项的比特位数,然后带入算式E,去推算X项和Y项的比特位数在完成算式E后会怎么变化。若发现在代入算式E之前,X项的比特位数已经等于最大冗余值的比特位数,说明需要约减X项才能保证该交叉计算后不会出现溢出,即此时需要改用算式F,调用冗余约减交叉。Y项的比特位数不需要考虑,这是因为考虑到,Y项是带入冗余模乘,只要Y项不超过机器字长,此时Y项就不会等于或超过最大冗余值的比特位数,因为最大冗余值小于等于机器字长,冗余模乘就能完成计算,并输出一个worst-case情况下与2q相同比特位数的数值,所以Y项不存在溢出的问题。总结来看,上述算法中,当给定参数后,按照前述逻辑,模拟地运行一次INTT,将每一次交叉的比特位数进行推算,找出所有在理论上会导致X项在输入时就等于最大冗余值比特位数的交叉,将这些位置记录下来,存入预计算表中,用于INTT生成模块定制化生成INTT。
预计算
如图13所示,INTT预计算模块接收输入的参数,然后根据参数计算INTT中发生溢出的位置,生成序列T,序列T的长度等于INTT发生溢出的交叉的个数,序列值为发生溢出的INTT中交叉的标识,将序列T保存至预计算表,具体流程如下。
设n(n≥4)为2的幂,INTT阶段数为log2n,模数为log2q比特,INTT允许的最大冗余值(根据机器字长得出)为2max。输入前面这些参数和多项式a,通过上述算法返回一个map  red_position,red_position中包含所有需要约减处理的交叉位置;通过遍历red_position,可以准确找到需要约减处理的位置[t,<s,b,c>],其中t表示这是该INTT的第t个交叉,<s,b,c>表示这次约减处理发生的具体位置是INTT的第s阶段第b个蝴蝶的第c交叉,将这些位置存放在序列T中,如图13所示。
即时计算(构造INTT算法)
如图14所示,若序列T为空,则构造一个INTT,它的所有阶段的蝴蝶只需要调用冗余增长交叉;否则,对照序列T(见图13)构造一个INTT,使得它在red_position所包含的交叉位置上才调用冗余约减交叉,其余交叉仅调用冗余增长交叉。
下面结合一个具体应用场景对本申请的实现方式进行描述,参见下述实例1。
实例1
对于基于理想格的后量子密码算法(如NewHope和Kyber),NTT和INTT是其主要计算。这类算法的参数比较固定,同时参数大小满足log2n+log2q<60,因此只在蝴蝶上调用冗余模乘,完全移除NTT和INTT蝴蝶上的所有约减处理和所有逻辑分支语句,从而提高多项式乘法的性能。
基于上述的基本思路,实例1如下:
NTT
(1)无逻辑分支语句的NTT的所有计算的输出结果均为非负整数。
(2)NTT蝴蝶先算乘法后算加减法,因此需要放宽对X值的限制,使其输出更大的冗余。
(3)计算整数模乘时,可通过冗余模乘,令模乘输出控制在[0,2q)范围内。
(4)整数加法与减法后输出的冗余数值可根据具体计算任务处理。
(5)进入当前阶段的下一个蝴蝶,如果当前阶段的所有蝴蝶都计算完毕,则进入下一个阶段,重复步骤(2)—(5),直到所有阶段的计算都完成为止。这样整体输出是冗余的,但是在允许范围内。
NTT的算法如下。

(1)无逻辑分支语句的INTT的所有计算输出均为非负整数。设n(n≥4)为2的幂,该INTT一共有log2n个阶段;将最后一个阶段的蝴蝶单独拆分出来,因此该INTT可分为主阶段(阶段数小于log2n)和最后阶段(第log2n阶段)。
(2)由于INTT需要支持任意冗余大小的多项式系数输入,最小冗余值为min=(t+n)*q,其中t是输入数据的冗余倍数。
(3)INTT的蝴蝶先算加减法后算乘法,因此整数加法与减法后输出的的冗余数值不必处理;待到计算整数模乘时,利用冗余模乘,令输出结果的取值控制在[0,2q)范围内。
(4)进入当前阶段的下一个蝴蝶,如果当前阶段的所有蝴蝶都计算完毕,则进入下一个阶段,重复步骤(2)—(3),直到主阶段的计算都完成为止。
(5)进入最后阶段计算,计算过程与主阶段的蝴蝶相同,主要目的是控制输出结果的冗余,因此虽然系数没有出现数值越界、也可以考虑改用冗余约减交叉。利用INTT蝴蝶先乘后加的特性,只在蝴蝶计算整数模乘时,调用冗余模乘函数,就可以将整体输出控制在[0,2q)范围内。
INTT的算法如下。

总结来看,上述实例通过预计算构建更高效的NTT/INTT;在去掉所有的逻辑分支语句后,只保留了必要的约减处理,输出结果的取值依然保持在数据类型或指令集允许的范围内而不会越界;用冗余模乘的方式进行模乘处理,效率较高。
上述实例1达到的效果包括而不限于以下四点。
第一,通过引入数值的冗余表示,在去掉所有的逻辑分支语句后,输出结果的取值依然保持在指令集允许的范围内。
第二,NTT和INTT过程中没有额外的约减处理,多项式计算的性能得到提升。
第三,计算过程不与蒙哥马利算法相绑定,可根据具体计算任务随意调整多项式系数的表示形式。
第四,能够与其他NTT和INTT算法混合使用。
实例2
当参数很大的时候(比如在64位CPU上,log2n+log2q≥60),此时需要将NTT/INTT进行拆分,NTT按照阶段拆分,INTT按照交叉拆分,拆分的依据是预计算得到的待进行约减处理的位置。
例如,对于NTT计算,设在64位CPU上,NTT所允许的最大冗余值为262,即max=62,最大冗余值262的比特位数是63比特,即NTT计算过程中每个系数的理论上比特位数需要小于或等于63比特。n=216,log2q=57,即q的比特位数是58比特。输入的多项式a的系数在[0,q-1)区间上,即log2ai最大值为57,即将多项式系数确定为58比特的数据。最小冗余值为2q;通过预计算,即将各个必要的参数和阶段数带入算式C和算式D,可知NTT只在最后一个阶段,即第16阶段才需要调用冗余约减分支,其余阶段调用冗余增长分支;最后NTT输出的多项式系数控制在60比特以内。图7示出了实例2中NTT的计算过程。
对于INTT计算,设在64位CPU上,INTT所允许的最大冗余值为262(即max=62),n=16,log2q=58,即q的比特位数是59比特。输入的多项式系数的冗余倍数为4倍。通过预计算,可知该INTT一共需要计算5次约减,位置分别为:
[9,<2,1,1>]//第9个交叉需要约减,位置是第2阶段第1蝴蝶第1交叉;
[11,<2,2,1>]//第11个交叉需要约减,位置是第2阶段第2蝴蝶第1交叉;
[13,<2,3,1>]//第13个交叉需要约减,位置是第2阶段第3蝴蝶第1交叉;
[15,<2,4,1>]//第15个交叉需要约减,位置是第2阶段第4蝴蝶第1交叉;
[26,<4,1,2>]//第26个交叉需要约减,位置是第4阶段第1蝴蝶第2交叉;
因此,在构造INTT算法时,只对上述位置的交叉调用冗余约减交叉,其余交叉调用冗余增长交叉。最小冗余值可以取min=(4+16)q=20q。最后INTT输出的多项式的系数控制在62比特以内,在允许范围内。
其中,最小冗余值的取值推导过程如下:在输入的多项式系数的冗余倍数为1倍(等价 于多项式系数没有冗余)时,最小冗余值是n*q;若输入的多项式系数的冗余倍数是4倍,说明输入的多项式系数的最大值不超过4q,那么最小冗余值min设为(4+n)q时,就能确保最小冗余值min大于Y。图8示出了实例2的INTT计算过程。
上述实例2在具有和实例1相同的效果的基础上,放宽了对多项式系数的取值的冗余倍数,在指令集允许的范围内,输入数据输出结果都支持更大的冗余范围;精确定位了约减处理发生的位置,去掉不必要的约减处理,且约减处理不与蒙哥马利算法绑定;可用的参数组更多。
图19是本申请实施例提供的一种数据处理装置800的结构示意图。装置800包括第一确定模块801和第二确定模块802。
结合图6所示方法流程来看,装置800设于图6所示计算设备上,第一确定模块801用于执行S201,第二确定模块802用于执行S202。
图19所描述的装置实施例仅仅是示意性的,例如,上述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。在本申请各个实施例中的各功能模块可以集成在一个模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
数据处理装置800中的各个模块全部或部分地通过软件、硬件、固件或者其任意组合来实现。
下面结合后文描述的计算设备900,描述使用硬件或软件来实现数据处理装置800中的各个功能模块的一些可能实现方式。
在采用软件实现的情况下,例如,上述第一确定模块801和第二确定模块802是由图20中的至少一个处理器901读取存储器902中存储的程序代码后,生成的软件功能模块来实现。
在采用硬件实现的情况下,例如,图19中上述各个模块由计算设备中的不同硬件分别实现,例如第一确定模块801由图20中的至少一个处理器901中的一部分处理资源(例如多核处理器中的一个核或两个核)实现,而第二确定模块802由图20中至少一个处理器901中的其余部分处理资源(例如多核处理器中的其他核),或者采用现场可编程门阵列(field-programmable gate array,FPGA)、或协处理器等可编程器件来完成。
图20是本申请实施例提供的一种计算设备900的结构示意图。计算设备900用于执行图6所示方法。计算设备900包括处理器901、存储器902以及网络接口903。
处理器901例如是通用中央处理器(central processing unit,CPU)、网络处理器(network processer,NP)、图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing units,NPU)、数据处理单元(data processing unit,DPU)、微处理器或者一个或多个用于实现本申请方案的集成电路。例如,处理器901包括专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。PLD例如是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其 任意组合。
存储器902例如是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,又如是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,又如是电可擦可编程只读存储器(electrically erasable programmable read-only Memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。可选地,存储器902独立存在,并通过内部连接904与处理器901相连接。或者,可选地存储器902和处理器901集成在一起。
网络接口903使用任何收发器一类的装置,用于与其它设备或通信网络通信。网络接口903例如包括有线网络接口或者无线网络接口中的至少一项。其中,有线网络接口例如为以太网接口。以太网接口例如是光接口,电接口或其组合。无线网络接口例如为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络网络接口或其组合等。
在一些实施例中,处理器901包括一个或多个CPU,如图20中所示的CPU0和CPU1。
在一些实施例中,计算设备900可选地包括多个处理器,如图20中所示的处理器901和处理器905。这些处理器中的每一个例如是一个单核处理器(single-CPU),又如是一个多核处理器(multi-CPU)。这里的处理器可选地指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在一些实施例中,计算设备900还包括内部连接904。处理器901、存储器902以及至少一个网络接口903通过内部连接904连接。内部连接904包括通路,在上述组件之间传送信息。可选地,内部连接904是单板或总线。可选地,内部连接904分为地址总线、数据总线、控制总线等。
在一些实施例中,计算设备900还包括输入输出接口906。输入输出接口906连接到内部连接904上。
在一些实施例中,输入输出接口906用于与输入设备连接,接收用户通过输入设备输入的上述实施例涉及的命令或数据,例如模数、冗余倍数、多项式维度等参数。输入设备包括但不限于键盘、触摸屏、麦克风、鼠标或传感设备等。
在一些实施例中,输入输出接口906还用于与输出设备连接。输入输出接口906通过输出设备输出处理器301执行上述方法产生的处理结果,如数论变换后的数据。输出设备包括但不限于显示器、打印机、投影仪等等。
可选地,处理器901通过读取存储器902中保存的程序代码910实现上述实施例中的方法,或者,处理器901通过内部存储的程序代码实现上述实施例中的方法。在处理器901通过读取存储器902中保存的程序代码910实现上述实施例中的方法的情况下,存储器902中保存实现本申请实施例提供的方法的程序代码。
结合图6所示方法来看,在一种可能的实现方式中,处理器901用于指示输入输出接口906或者网络接口903执行S201,处理器901还用于执行S202。在另一种可能的实现中,处理器901用于指示输入输出接口906或者网络接口903执行S201,处理器905用于执行S202。 处理器901实现上述功能的更多细节请参考前面各个方法实施例中的描述,在这里不再重复。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分可互相参考,每个实施例重点说明的都是与其他实施例的不同之处。
A参考B,指的是A与B相同或者A为B的简单变形。
本申请实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的待进行加解密的数据以及数据对应的参数都是在充分授权的情况下获取的。
本申请实施例,除非另有说明,“至少一个”的含义是指一个或多个,“多个”的含义是指两个或两个以上。例如,多个计算单元是指两个或两个以上的计算单元。
上述实施例可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例描述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (23)

  1. 一种数据处理方法,其特征在于,由计算设备执行,所述计算设备用于运行数据的数论变换,所述数据的数论变换的步骤包括多个计算单元,所述方法包括:
    基于所述数据的参数,确定每个所述计算单元产生的处理结果的预估比特位数,所述参数指示所述数据的比特位数;
    基于所述预估比特位数,从所述多个计算单元中确定第一计算单元,所述第一计算单元为用于对第二计算单元的处理结果约减处理的计算单元,所述第二计算单元的处理结果的预估比特位数满足预设比特位数。
  2. 根据权利要求1所述的方法,其特征在于,所述约减处理,包括:
    对所述第二计算单元的处理结果进行冗余模乘处理。
  3. 根据权利要求2所述的方法,其特征在于,所述对所述第二计算单元的处理结果进行冗余模乘处理包括:
    基于旋转因子对所述第二计算单元的处理结果进行冗余模乘处理,所述旋转因子具有和所述数据相同的表示形式。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    对所述第二计算单元的约减处理后的处理结果进行加密处理或解密处理。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述参数包括所述多个计算单元中每个计算单元进行取模运算时使用的模数、所述数据相对于所述模数的冗余倍数以及所述数据的多项式维度。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述预设比特位数是基于所述计算设备中处理器的位数确定的,所述预设比特位数比所述处理器的位数少1或2。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述多个计算单元中每个计算单元还用于基于冗余值进行减法处理,所述冗余值为大于或等于所述减法处理中减数的数值。
  8. 根据权利要求7所述的方法,其特征在于,所述数论变换包括正数论变换,所述冗余值等于2q,所述q表示所述多个计算单元中每个计算单元进行取模运算时使用的模数,所述q为正整数。
  9. 根据权利要求7所述的方法,其特征在于,所述数论变换包括逆数论变换,所述冗余值 等于(t+n)*q,所述q表示所述多个计算单元中每个计算单元进行取模运算时使用的模数,所述t表示所述数据相对于所述模数的冗余倍数,所述n表示所述数据的多项式维度,所述t、所述n和所述q为正整数。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述多个计算单元中每个计算单元用于基于k个数据进行处理产生k个处理结果,所述k为正整数。
  11. 一种数据处理装置,其特征在于,设于计算设备,所述计算设备用于运行数据的数论变换,所述数据的数论变换的步骤包括多个计算单元,所述装置包括:
    第一确定模块,用于基于所述数据的参数,确定每个所述计算单元产生的处理结果的预估比特位数,所述参数指示所述数据的比特位数;
    第二确定模块,用于基于所述预估比特位数,从所述多个计算单元中确定第一计算单元,所述第一计算单元为用于对第二计算单元的处理结果约减处理的计算单元,所述第二计算单元的处理结果的预估比特位数满足预设比特位数。
  12. 根据权利要求11所述的装置,其特征在于,所述第一计算单元用于对所述第二计算单元的处理结果进行冗余模乘处理。
  13. 根据权利要求12所述的装置,其特征在于,所述第一计算单元用于基于旋转因子对所述第二计算单元的处理结果进行冗余模乘处理,所述旋转因子具有和所述数据相同的表示形式。
  14. 根据权利要求11所述的装置,其特征在于,所述装置还包括:处理模块,用于对所述第二计算单元的约减处理后的处理结果进行加密处理或解密处理。
  15. 根据权利要求11至14中任一项所述的装置,其特征在于,所述参数包括所述多个计算单元中每个计算单元进行取模运算时使用的模数、所述数据相对于所述模数的冗余倍数以及所述数据的多项式维度。
  16. 根据权利要求11至15中任一项所述的装置,其特征在于,所述预设比特位数是基于所述计算设备中处理器的位数确定的,所述预设比特位数比所述处理器的位数少1或2。
  17. 根据权利要求11至16中任一项所述的装置,其特征在于,所述多个计算单元中每个计算单元还用于基于冗余值进行减法处理,所述冗余值为大于或等于所述减法处理中减数的数值。
  18. 根据权利要求17所述的装置,其特征在于,所述数论变换包括正数论变换,所述冗余值等于2q,所述q表示所述多个计算单元中每个计算单元进行取模运算时使用的模数,所述 q为正整数。
  19. 根据权利要求17所述的装置,其特征在于,所述数论变换包括逆数论变换,所述冗余值等于(t+n)*q,所述q表示所述多个计算单元中每个计算单元进行取模运算时使用的模数,所述t表示所述数据相对于所述模数的冗余倍数,所述n表示所述数据的多项式维度,所述t、所述n和所述q为正整数。
  20. 根据权利要求11至19中任一项所述的装置,其特征在于,所述多个计算单元中每个计算单元用于基于k个数据进行处理产生k个处理结果,所述k为正整数。
  21. 一种计算设备,其特征在于,所述计算设备包括:处理器,所述处理器与存储器耦合,所述存储器中存储有至少一条计算机程序指令,所述至少一条计算机程序指令由所述处理器加载并执行,以使所述计算设备实现权利要求1-10中任一项所述的方法。
  22. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令在计算机上运行时,使得计算机执行如权利要求1-10中任一项所述的方法。
  23. 一种计算机程序产品,其特征在于,所述计算机程序产品包括一个或多个计算机程序指令,当所述计算机程序指令被计算机加载并运行时,使得所述计算机执行权利要求1-10中任一项所述的方法。
PCT/CN2023/098288 2022-06-10 2023-06-05 数据处理方法、装置、设备及存储介质 WO2023236899A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210656082.X 2022-06-10
CN202210656082.XA CN117254902A (zh) 2022-06-10 2022-06-10 数据处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023236899A1 true WO2023236899A1 (zh) 2023-12-14

Family

ID=89117530

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098288 WO2023236899A1 (zh) 2022-06-10 2023-06-05 数据处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN117254902A (zh)
WO (1) WO2023236899A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117714054B (zh) * 2024-02-01 2024-04-23 山东大学 基于数论变换的密钥封装轻量化方法、系统、介质及设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180294950A1 (en) * 2017-04-11 2018-10-11 The Governing Council Of The University Of Toronto Homomorphic Processing Unit (HPU) for Accelerating Secure Computations under Homomorphic Encryption
WO2021062468A1 (en) * 2019-10-01 2021-04-08 Commonwealth Scientific And Industrial Research Organisation Confidential validation of summations
CN113972980A (zh) * 2020-07-24 2022-01-25 国民技术股份有限公司 基于数论变换的格密码多项式乘法运算的优化方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180294950A1 (en) * 2017-04-11 2018-10-11 The Governing Council Of The University Of Toronto Homomorphic Processing Unit (HPU) for Accelerating Secure Computations under Homomorphic Encryption
WO2021062468A1 (en) * 2019-10-01 2021-04-08 Commonwealth Scientific And Industrial Research Organisation Confidential validation of summations
CN113972980A (zh) * 2020-07-24 2022-01-25 国民技术股份有限公司 基于数论变换的格密码多项式乘法运算的优化方法及装置

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALKIM ERDEM, PÖPPELMANN THOMAS, DUCAS LÉO, SCHWABE PETER: "Post-quantum Key Exchange-A New Hope", PROCEEDINGS OF THE 25TH USENIX SECURITY SYMPOSIUM, AUSTIN, TX, 1 August 2016 (2016-08-01) - 12 August 2016 (2016-08-12), Austin, TX, pages 327 - 343, XP093114385 *
HAO YANG, LIU ZHE, HUANG JUN-HAO. SHEN SHI-YU, ZHAO YUN-LEI: "Chinese Journal of Computers", CHINESE JOURNAL OF COMPUTERS, 15 December 2021 (2021-12-15), pages 1 - 14, XP093114384 *
YUAN YE; FUKUSHIMA KAZUHIDE; KIYOMOTO SHINSAKU; TAKAGI TSUYOSHI: "Memory-constrained implementation of lattice-based encryption scheme on standard Java Card", 2017 IEEE INTERNATIONAL SYMPOSIUM ON HARDWARE ORIENTED SECURITY AND TRUST (HOST), IEEE, 1 May 2017 (2017-05-01), pages 47 - 50, XP033258325, DOI: 10.1109/HST.2017.7951796 *

Also Published As

Publication number Publication date
CN117254902A (zh) 2023-12-19

Similar Documents

Publication Publication Date Title
US11784801B2 (en) Key management method and related device
Wu et al. An efficient key-management scheme for hierarchical access control in e-medicine system
CN112070222B (zh) 用于联邦学习的处理装置、加速器及方法
WO2020006692A1 (zh) 一种全同态加密方法、装置和计算机可读存储介质
CN1841443B (zh) 计算方法和计算设备
WO2015164996A1 (zh) 椭圆域曲线运算方法和椭圆域曲线运算器
JP5360836B2 (ja) ペアリング演算装置、ペアリング演算方法、及びペアリング演算プログラム
WO2023236899A1 (zh) 数据处理方法、装置、设备及存储介质
WO2024078347A1 (zh) 加速设备、计算系统及加速方法
JP7173170B2 (ja) 情報処理装置、秘密計算方法及びプログラム
CN109271137B (zh) 一种基于公钥加密算法的模乘装置及协处理器
JP2006259735A (ja) Simd処理を用いた楕円曲線点8倍化
JP4690819B2 (ja) 楕円曲線暗号におけるスカラー倍計算方法およびスカラー倍計算装置
CN112350827A (zh) 一种基于Koblitz曲线的加速标量乘计算的椭圆曲线加解密方法和系统
KR101977873B1 (ko) 하드웨어 구현된 모듈러 역원 모듈
CN114650135B (zh) 一种软硬件协同的sm2椭圆曲线密码算法实现方法
CN113467752B (zh) 用于隐私计算的除法运算装置、数据处理系统及方法
CN113505383A (zh) 一种ecdsa算法执行系统及方法
KR100423810B1 (ko) 타원곡선 암호화 장치
Realpe-Muñoz et al. High-performance elliptic curve cryptoprocessors over GF (2^ m) GF (2 m) on Koblitz curves
KR102253211B1 (ko) 소수체와 이진체 상의 타원곡선을 지원하는 공개키 암호 시스템의 하드웨어 구현을 위한 연산장치 및 방법
KR100451570B1 (ko) 에스피에이에 견디는 타원 곡선 암호화 알고리즘을구현하는 방법 및 장치
CN117009723B (zh) 一种多方计算方法、装置、设备及存储介质
JP7244060B2 (ja) ブロック暗号装置、ブロック暗号方法およびプログラム
KR101775597B1 (ko) 고속 모듈로 연산 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23819064

Country of ref document: EP

Kind code of ref document: A1