CN117254902A

CN117254902A - Data processing method, device, equipment and storage medium

Info

Publication number: CN117254902A
Application number: CN202210656082.XA
Authority: CN
Inventors: 袁壄
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2023-12-19
Also published as: WO2023236899A1

Abstract

The application provides a data processing method, a device, equipment and a storage medium, and belongs to the technical field of computers. According to the method and the device, through parameters, the estimated bit number of the processing result generated by each computing unit in the number theory transformation step is determined, and the computing unit responsible for reduction processing is determined based on the estimated bit number, so that the value of the processing result can be reduced at a proper position under the condition that a logic branch statement is not required to be introduced, the bit number of the processing result is reduced, the bit number of the processing result is prevented from exceeding the upper limit of the bit number which can be represented by computing equipment, and overflow is avoided. Compared with the mode of introducing the logic branch statement to perform reduction processing, the method can remove the logic branch statement and optimize the structure of the number theory transformation, thereby improving the efficiency of running the number theory transformation.

Description

Data processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and storage medium.

Background

In many encryption and decryption schemes, polynomial multiplication is the main part. The number theory transformation is helpful to more efficiently realize polynomial multiplication, thereby improving the efficiency of encryption and decryption schemes.

In a classical number-theory transformation algorithm, in order to avoid overflow, some logic branch sentences are introduced in the number-theory transformation algorithm. The computing device executes the logical branch statement in the process of running the number theory transformation algorithm, so that the data is reduced, the value of the data is reduced, and overflow is avoided.

However, executing logical branch statements takes a long time, resulting in inefficiency in running the number theory transformation.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, equipment and a storage medium, which can improve the efficiency of operation number theory transformation. The technical scheme is as follows.

In a first aspect, a data processing method is provided, performed by a computing device for running a number-wise transformation of data, the step of number-wise transformation of data comprising a plurality of computing units, comprising:

determining a predicted number of bits of the processing result generated by each of the computing units based on a parameter of the data, the parameter indicating the number of bits of the data;

and determining a first calculation unit from the plurality of calculation units based on the estimated bit number, wherein the first calculation unit is a calculation unit for reducing the processing result of the second calculation unit, and the estimated bit number of the processing result of the second calculation unit meets the preset bit number.

In the method provided in the first aspect, since the estimated bit number of the processing result generated by each computing unit in the number theory transformation step is determined by the parameters, the computing unit responsible for the reduction processing is determined based on the estimated bit number, so that the value of the processing result can be reduced at a proper position without introducing a logic branch statement, the bit number of the processing result is reduced, the bit number of the processing result is prevented from exceeding the upper limit of the bit number which can be represented by the computing device, and overflow is avoided. Compared with the mode of introducing the logic branch statement to perform reduction processing, the method can remove the logic branch statement and optimize the structure of the number theory transformation, thereby improving the efficiency of running the number theory transformation.

In some embodiments, the reducing process comprises:

and performing redundant modular multiplication processing on the processing result of the second computing unit.

In the above embodiment, since the reduction processing is implemented by using the redundancy modular multiplication method, on one hand, the reduction processing does not need to bind the montgomery algorithm, and the representation of the data does not need to be kept in the montgomery representation, in other words, whether the representation of the data is in the montgomery representation or the non-montgomery representation, the scheme has usability, so that the flexibility and the practicability of the scheme are improved. On the other hand, the method also has the function of improving the speed of the reduction processing, thereby improving the efficiency, and particularly, the method is helpful for remarkably accelerating the operation flow of the computing equipment in the scenes of large number operation and the like.

In some embodiments, the performing redundancy modular multiplication on the processing result of the second computing unit includes:

and performing redundant modular multiplication processing on the processing result of the second computing unit based on a twiddle factor, wherein the twiddle factor has the same representation form as the data.

Through the above embodiments, dynamic adjustment of the representation of data according to the needs of a particular computing task is supported.

In some embodiments, the representation is a Montgomery representation or a non-Montgomery representation.

In some embodiments, the method further comprises:

and carrying out encryption processing or decryption processing on the processing result after the reduction processing of the second computing unit.

In some embodiments, the parameters include a modulus used when each of the plurality of computing units performs a modulo operation, a redundancy of the data relative to the modulus, and a polynomial dimension of the data.

In the above embodiment, the redundancy size of the input data is described by the modulus and the redundancy multiple in consideration of the fact that the input data may have redundancy of a certain size, so that in the case that the data has redundancy, the calculation unit that needs to perform the reduction processing can be positioned relatively accurately, thereby reducing the redundant reduction processing.

In some implementations, the predetermined number of bits is determined based on a number of bits of a processor in the computing device, the predetermined number of bits being 1 or 2 less than the number of bits of the processor.

In the above embodiment, compared with setting the preset bit number empirically, the preset bit number is determined by the hardware factor of the bit number of the processor, so that the preset bit number can be adapted to the capability of the hardware, different preset bit numbers can be respectively determined for the hardware with different capabilities, the calculation unit needing the reduction processing is found based on the preset bit number, and the calculation unit needing the reduction processing can be more accurately positioned, thereby reducing unnecessary reduction processing.

Taking the preset bit number as 63 as an example, if it is estimated that the bit number of the input data of one computing unit reaches 62 or 63, the reduction processing is performed in the computing unit, and the computing unit before the computing unit does not need to perform the reduction processing. By the method, overflow is avoided, limitation on the value of data is relaxed as much as possible, the capability of hardware is fully exerted, the resource utilization rate is improved, and the times of reduction processing are reduced.

In some embodiments, each of the plurality of computing units is further configured to perform a subtraction process based on a redundancy value, the redundancy value being a numerical value greater than or equal to a reduction in the subtraction process.

The number theory conversion function comprises a positive number theory conversion function and an inverse number theory conversion function, wherein subtraction processing of the positive number theory conversion function is x-y x w mod 2q or x mod 2q-y x mod 2q, subtraction processing of the inverse number theory conversion function is x-y, x and y represent data, q represents modulus used in modulo operation of the data in the number theory conversion function, w represents a twiddle factor, mod represents modulo operation, multiplication is represented by x, and subtraction is represented by y.

In the above embodiment, since the redundancy value is substituted during the subtraction process, which corresponds to adding the redundancy value to the subtracted number and amplifying the value of the subtracted number, it is possible to avoid the negative number of the subtraction process, and to improve the calculation accuracy. In addition, compared with setting the redundancy value according to experience, the redundancy value is determined according to the data-related parameter, so that the determined redundancy value can be suitable for the value of the parameter, and the accuracy is improved. In addition, the redundancy value is not required to be bound to a single parameter, but can be correspondingly adjusted along with the value of the parameter, so that more parameters are available for the scheme, and the expansibility and the practicability are improved.

In some embodiments, the number-wise transformation comprises a positive number-wise transformation, the redundancy value being equal to 2q, the q representing a modulus used when each of the plurality of computing units performs a modulo operation, the q being a positive integer.

In the above embodiment, since the positive number theory transformation is characterized in that the modular multiplication process is performed first, and then the addition process and the subtraction process are performed. The value range of the modular multiplication is controllable, for example, when the multiplication is realized by adopting a redundant modular multiplication mode, the value range of the modular multiplication is in [0,2q ], and when the multiplication is realized by adopting a modular multiplication mode without redundancy, the value range of the modular multiplication is in [0, q), wherein q is a module. Therefore, by substituting 2q into the subtraction process, since the subtraction in the subtraction process is the output result of the modulo multiplication process, the value range of the subtraction is within [0,2 q), and thus the redundancy value is necessarily larger than the subtraction, thereby ensuring that the result of performing the subtraction process is not a non-negative number, thus contributing to the operation accuracy. In addition, the redundancy value used is as small as possible, thereby avoiding excessive processing overhead and storage overhead due to excessive redundancy value.

In some embodiments, the number-theory transformation comprises an inverse number-theory transformation, the redundancy value is equal to (t+n) q, q represents a modulus used when each of the plurality of computing units performs a modulo operation, t represents a redundancy multiple of the data with respect to the modulus, n represents a polynomial dimension of the data, and t, n, and q are positive integers.

In the above embodiment, since the input data of any computing unit does not exceed n×q when there is no redundancy in the data, t×q is added as a redundancy value on the basis of n×q, so as to support data with any redundancy multiple as input, and ensure that the result generated by performing the subtraction operation is not a non-negative number, thereby contributing to the operation accuracy.

In some embodiments, each of the plurality of computing units is configured to process based on k data to generate k processing results, where k is a positive integer.

In the above embodiment, the number theory transformation is divided into the calculation unit requiring the reduction and the calculation unit not requiring the reduction, which is equivalent to the unit of intersection, and thus the finer positioning of the position requiring the reduction is facilitated.

In a second aspect, there is provided a data processing apparatus having functionality to implement the above-described first aspect or any of the alternatives of the first aspect. The apparatus comprises at least one module for implementing the method as provided in the first aspect or any of the alternatives of the first aspect.

In some embodiments, the modules in the apparatus are implemented in software, and the modules in the apparatus are program modules. In other embodiments, the modules in the apparatus are implemented in hardware or firmware. The details of the apparatus provided in the second aspect may be found in the first aspect or any of the alternatives of the first aspect, and are not described here again.

In a third aspect, there is provided a computing device comprising a processor coupled to a memory having stored therein at least one computer program instruction that is loaded and executed by the processor to cause the computing device to implement the method provided by the first aspect or any of the alternatives of the first aspect. The specific details of the computing device provided in the third aspect may be referred to above or any optional manner of the first aspect, and are not described herein.

In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one instruction which when executed on a computer causes the computer to perform the method provided in the first aspect or any of the alternatives of the first aspect.

In a fifth aspect, there is provided a computer program product comprising one or more computer program instructions which, when loaded and run by a computer, cause the computer to carry out the method as provided in the first aspect or any of the alternatives of the first aspect.

In a sixth aspect, there is provided a chip comprising programmable logic circuitry and/or program instructions for implementing the method as provided in the first aspect or any of the alternatives of the first aspect, when the chip is run.

Drawings

Fig. 1 is a flowchart of an NTT provided in an embodiment of the present application;

FIG. 2 is a flowchart of an INTT provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a cross calculation process in a butterfly in a radius-2 NTT according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a cross calculation process in a butterfly of the radius-2 INTT according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a computational polynomial multiplication provided in an embodiment of the present application;

FIG. 6 is a flow chart of a data processing method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a change in the number of bits of data during NTT operation according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a change in the number of bits of data during INTT operation according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an NTT according to an embodiment of the present application;

FIG. 10 is a block diagram of an NTT pre-calculation module according to an embodiment of the present application;

Fig. 11 is an architecture diagram of an NTT generation module provided in an embodiment of the present application;

fig. 12 is a schematic diagram of an INTT provided in an embodiment of the present application;

fig. 13 is a schematic diagram of an INTT pre-computation module according to an embodiment of the present application;

fig. 14 is an architecture diagram of an INTT generating module provided in an embodiment of the present application;

FIG. 15 is a schematic diagram of a calculation method of redundancy growth crossover in a radius-2 NTT according to an embodiment of the present application;

FIG. 16 is a schematic diagram of a method for calculating redundancy reduction cross in a radius-2 NTT according to an embodiment of the present application;

FIG. 17 is a schematic diagram of a calculation method of redundancy growth crossover in radius-2 INTT according to an embodiment of the present application;

FIG. 18 is a schematic diagram of a method for calculating redundancy reduced cross in a radius-2 INTT according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a data processing apparatus 800 according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of a computing device 900 according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Some term concepts related to the embodiments of the present application are explained below.

(1) Positive number theory transformation (number theoretic transform, NTT)

Let positive integerIs a power of 2, given a prime number q satisfying q≡1 mod 2n. Let ω be +.>The last n times of primitive unit root has omega ⁿ Identical to 1 mod q, and the power of ω modulo q has ω ⁰ ≠ω≠…≠ω ^n-1 mod q. Defining polynomial rings->Polynomial a (x) ∈R _q Wherein->Is a coefficient of a (x).

Will omega ⁰ ，ω…，ω ^n-1 mod q is substituted into the polynomial a (x), and there are:

order theIs polynomial->Is defined as NTT

(2) Inverse number theory transformation (reverse NTT, inverse transformation of NTT, INTT)

INTT is defined as a (x) =intt (NTT (a (x))).

(3) Primitive root

Definition of the definitionIs a finite field; giving a positive integer g and a positive integer q (g, q is more than or equal to 2), wherein g and q are mutually prime; if there is a minimum integer n>1, such that g ⁿ Identical to 1 mod q, i.e. for any one integer k (1.ltoreq.k<n-1) has g ^k Not equal to 1 mod q, g is called +.>Last n times primitive unit roots.

(4) Twiddle factor (twiddle factor)

The twiddle factor originally refers to the complex constant multiplied in the butterfly operation of the Cooley-Tukey fast fourier transform algorithm. Because the constant lies above the unit circle in the complex plane, there is a rotation on the complex plane for the multiplicandThe effect of the inversion, the constant is called twiddle factor. Later, the twiddle factor will also be used to refer to any constant multiplication in the FFT, including FFT warping algorithms. The name of twiddle factor is derived from W.M. Gentleman and G.san de, "Fast Fourier transforms-for fun and profit," Proc. AFIPS 29, pp.563-578, 1966. This is widely used in the tens of thousands of literature hereafter. For the number theory transformation, the twiddle factor is, for example, ω in combination with the formula of NTT described in (1) above ^k mod q(k＝ 0，1，…，n-1)。

(5) Stage (stage)

One phase is all processing steps in the exponential transformation within the same time period.

Fig. 1 is a flowchart of an NTT provided in an embodiment of the present application, where the NTT shown in fig. 1 is divided into 3 stages. These 3 phases are referred to as phase 1, phase 2 and phase 3, respectively, in a time period from front to back (i.e., left to right order). Stage 1 includes all processing steps of NTT during time period 1, stage 2 includes all processing steps of NTT during time period 2, and so on.

Fig. 2 is a flowchart of an INTT provided in an embodiment of the present application, and the INTT shown in fig. 2 is divided into 3 stages. These 3 phases are referred to as phase 1, phase 2 and phase 3, respectively, in time period order from first to last. Phase 1 includes all processing steps of INTT during time 1, phase 2 includes all processing steps of INTT during time 2, and so on.

(6) Butterfly (butterfly)

A polynomial with the number of coefficients n (n is equal to or greater than 2 and is usually equal to the power of 2) is input, and each stage in the number theory transformation presents a regular treatment between disjoint m (m is equal to or greater than 2 and is usually equal to the power of 2) data, and the treatment is called butterfly operation or butterfly operation. Butterfly is typically the main computational element in the number theory transformation. One stage in the number theory transformation includes one or more butterflies. For example, in the NTT shown in fig. 1, stage 1 includes 4 butterflies, stage 2 includes 2 butterflies, and stage 3 includes 1 butterfly. In the INTT shown in fig. 2, stage 1 includes 1 butterfly, stage 2 includes 2 butterflies, and stage 3 includes 4 butterflies.

(7) Cross (cross)

An intersection is a calculation unit that inputs k data and outputs k data. A butterfly for m data processing includes a plurality of crossings. Where 2.ltoreq.m.ltoreq.n, typically m is equal to a power of 2, 2.ltoreq.k.ltoreq.m, typically k is equal to a power of 2. If k=2 and m is a power of k, then NTT/INTT may be divided into log, respectively ₂ n stages, then such NTT/INTT may be referred to as radius-2 NTT/INTT. NTT/INTT refers generally to radius-2 NTT/INTT unless specifically stated otherwise in this specification.

Different butterflies in the same NTT or INTT may include different numbers of crossings. As shown in fig. 1, each butterfly in stage 1 of the NTT includes 1 crossover, each butterfly in stage 2 of the NTT includes 2 crossover, and one butterfly in stage 3 of the NTT includes 4 crossover.

Fig. 3 is a calculation process of a cross in a butterfly in the radius-2 NTT according to the embodiment of the present application. Wherein the calculation formula of one cross in one butterfly in the Radix-2 NTT is as follows:fig. 4 is a calculation process of a cross in a butterfly in the radius-2 INTT provided in the embodiment of the present application. The formula of a cross in a butterfly in the Radix-2 INTT is +.>

(8) Logic branching

The logical branch generally includes one or more predicate conditions and a processing step corresponding to each predicate condition. When the computing device is to execute a logic branch during the operation of the computing device, the computing device determines whether the determination condition in the logic branch is satisfied according to the current operation condition. If the computing device determines that the running condition meets a certain judgment condition in the logic branches, the computing device executes a processing step corresponding to the judgment condition.

(9) Overflow (overflow, also known as value out of range)

Overflow refers to the number of bits of a processing result produced by a computing device exceeding the machine word length of the computing device. For example, the machine word length of the computing device is 32, and the processing result is 33 bits of data, which is an overflow. When overflow occurs, the computing device transforms the processing result to obtain data with a bit number within the range of the word length of the machine, and then continues processing based on the transformed data, resulting in operation errors. Therefore, it is necessary to avoid overflow.

(10) Pre-calculation of

Pre-calculation is a way to speed up processing tasks. Pre-calculation refers to performing some processing steps in advance before performing processing tasks and storing the resulting processing results in one location. The location for holding the processing result of the pre-calculation is generally referred to as a pre-calculation table (LUT). In this way, in the process of executing the processing task, the pre-calculated processing result can be queried in the pre-calculation table, the processing task is executed based on the pre-calculated processing result, and the pre-calculated processing step is not needed to be executed temporarily in the process of executing the processing task, so that the time for completing the processing task is shortened, and the efficiency for completing the processing task is improved.

(11) Instant computing

I.e. computation is a concept as opposed to pre-computation, and instant computation refers to the process of performing a processing task.

(12) Modulus and modulo arithmetic

Given a and q, a and q are integers, q.gtoreq.1. Calculating a/q (also written as a/q), if the remainder is equal to r (0.ltoreq.r < q), then r is the remainder of a divided by q, and q is the modulus. The process of solving for r, which may be written as r≡a mod q (or as r=a% q), where mod and% are modulo operators, is called modulo arithmetic, also called modulo arithmetic or modulo arithmetic.

(13) Modulo multiplication processing

Given a, b and q (q.gtoreq.1), a, b and q are integers. The process of calculating a×b mod q (also referred to as ab mod q, a×b% q, ab% q) is called modular multiplication, abbreviated modular multiplication.

(14) Redundancy of

If the integers a and b are congruent with respect to the modulus q, i.e., b≡a mod q, where 0.ltoreq.a < q and b.gtoreq.a, a is the remainder of b modulo q (i.e., the remainder of b divided by q is equal to a), b is numerically redundant with respect to the modulus q, abbreviated as b redundant.

(15) Redundancy factor

The redundancy factor is used to indicate the magnitude of the numerical redundancy of the data with respect to the modulus. Expressed mathematically, if the integer b, the integer a, and the modulus q satisfy b= (k-1) ×q+a, where 0 is less than or equal to a < q, the integer k is greater than or equal to 1, the integer b is said to have k-fold redundancy relative to the modulus q, and simply b has k-fold redundancy, i.e., the redundancy is k. b is 1-fold redundant and is equivalent to b=a.

(16) Reduction treatment

The reduction processing is a generic term for both modulo operation and the arithmetic of the balance.

Modulo arithmetic refers to determining the remainder of a data relative to a modulus. Expressed mathematically, given an integer a and an integer q, a modulo operation determines the remainder of the integer a divided by the integer q. The computer can reduce the input data to fall into the range of the modulus by performing the modulus-taking processing on the input data with the value larger than the modulus, thereby limiting the value of the data and avoiding overflow caused by overlarge value of the data. The modulo operation includes modulo addition, modulo multiplication, modulo subtraction, and modulo division. Modulo addition refers to determining the remainder of the sum of two data relative to a modulus, a+b mod q. Modulo subtraction refers to determining the remainder of the difference of two data relative to a modulus, i.e., a-b mod q. Modulo multiplication refers to determining the remainder of the product of two data relative to a modulus, a x b mod q. Modulo division refers to determining the remainder of the ratio of two data relative to a modulus.

Congruence refers to the fact that the remainder of dividing two integers by the same modulus is the same. Expressed mathematically, the integer a and the integer b are congruent with respect to the modulus q, commonly denoted b≡a mod q, where 0.ltoreq.a < q and b.gtoreq.a. The congruence operation is to determine data which has a congruence relation with one data and has a value smaller than the data. Expressed mathematically, given an integer a and an integer q, the process of determining the value of integer a relative to the congruence of integer q (which value is equal to or less than integer a), i.e., the congruence operation.

For the number-theory transformation, the integer a is the input data of the number-theory transformation, and the integer q is the modulus (i.e., the parameter corresponding to the data).

(17) Redundancy modular multiplication (also known as fast redundancy modular multiplication, lazy modulo multiplication)

Redundancy modular multiplication is a specific implementation of modular multiplication. Redundancy modular multiplication refers to determining the modular multiplication result of x and y with respect to the modulus q by the following formula.

Wherein r represents a modular multiplication result, x and y represent input data, x and y are positive integers, β is a positive integer, q represents a modulus, q < β/2, y < q.

Based on the above formula, it can be deduced that: r=x×y mod 2q.

Compared with the common modular multiplication mode (namely x ymod q), the redundancy modular multiplication is mainly characterized by two. First, because the calculation formula of the redundancy modular multiplication utilizes some characteristics of the computer, the speed of the computer for realizing the redundancy modular multiplication is faster than that for realizing the common modular multiplication, and the efficiency of the modular multiplication can be improved; second, the result of the redundancy modular multiplication has 2 times redundancy relative to the result of the normal modular multiplication, i.e., the value range of the result of the normal modular multiplication is on [0, q) and the value range of the redundancy modular multiplication is on [0,2 q). It will be appreciated that redundant modular multiplication may sacrifice a degree of accuracy, but in exchange for an increase in computational speed.

For the number theory transformation, when redundancy modular multiplication is realized, x and y in the redundancy modular multiplication formula can be input data; alternatively, one of x and y is input data, and the other is a twiddle factor. Redundancy modular multiplication is typically applied prior to the instantaneous computation phase of the number theory transformation, knowing the value of the multiplicand, where the multiplicand is smaller than the modulus.

(18) Montgomery algorithm

Montgomery algorithmIs a commonly used algorithm for rapidly calculating positive integer modular multiplication. The basic idea of the Montgomery algorithm is to convert the computation xy mod q into a computation xyr ^-1 mod q, where r>q，gcd(r，q)＝1，rr ^-1 ≡1 mod q. From the extended Euclidean algorithm, it is known that there is a positive integer q' such that equation rr ^-1 Qq' =1, so there is rr ^-1 ≡ 1 mod q，qq′≡-1 mod r。

(19) Montgomery representation

Montgomery algorithm may calculate a positive integer x times r ^-1 mod q gives x' ≡xr ^-1 mod q. Therefore, when x mod q needs to be calculated, the value of x can be output by selecting an appropriate r value, modifying the value of x to x r, and then invoking the Montgomery algorithm. The value of x r is known as the montgomery representation of x. The Montgomery representation of x may be redundant, e.g., may be equal to x r+i q (integer i.gtoreq.0).

(20) Word length

A word refers to a set of binary numbers that are accessed, transferred, and processed as a whole in a computer. The number of binary digits in a word is called the word length.

(21) Machine word length (machine word length)

The machine word length refers to the number of bits of binary data that can be processed by a processor for an integer operation, and is typically the width of the data channel within the processor. For example, a 32-bit processor has a machine word length of 32 and a 64-bit processor has a machine word length of 64.

(22) Instruction word length

The instruction word length refers to the total number of bits of binary code in a machine instruction. The instruction word length depends on the length of the slave opcode, the length of the operand address, and the number of operand addresses. The word length of different instructions is different.

(23) Data word length

The data word length refers to the number of bits occupied by stored data.

(24) Polynomial dimension

The polynomial dimension is related to the degree (i.e., order) of the polynomial. A partially lattice-based cryptographic algorithm is built onAbove algorithms comprising polynomials of finite field coefficients, such algorithms define a polynomial ringR is then _q The degree of the polynomial is n-1, the dimension is set to n (n is a power of 2), and the prime number q≡1 mod 2n. In the embodiment of the application, the coefficients of polynomials participating in NTT and INTT calculation are all in the ring R after the modulus q _q And (3) upper part.

(25) Anti-quantum computing code (post-quantum cryptography, PQC)

The anti-quantum computing password is an encryption algorithm which is specially researched and can resist quantum computers, in particular to a public key encryption (asymmetric encryption) algorithm. Partial PQC algorithms such as lattice cryptography (lattice-based cryptography) study lattice (lattice), i.e., n-dimensional space R _n This mathematical object has many applications in which there are several problems called "lattice problems" such as the shortest vector problem (shortest vector problem) and the nearest vector problem (closest vector problem), by adding the nature of the discrete subgroups of the population. Many grid-based cryptographic systems exploit these challenges. Lattice-based cryptographic algorithms require computation using a large number of polynomials, with number theory transformation being one of the most important computations.

(26) Homomorphic encryption (homomorphic encryption)

Homomorphic encryption is a form of encryption that allows a user to perform calculations on data under encryption without first decrypting. The results of homomorphic encryption calculations remain in encrypted form, and when decrypted, the results of the calculations are identical to the output results of the calculations on the unencrypted data.

The homomorphic encryption scheme focuses on the security of data processing and provides a function of processing encrypted data. Homomorphic encryption schemes are characterized by allowing data to be mathematically or logically operated under encryption. Homomorphism refers to homomorphism in algebra, and encryption and decryption functions may be considered homomorphism between plaintext and ciphertext space.

(27) Fully homomorphic encryption (fully homomorphic encryption, FHE)

Homomorphic encryption is used to perform any operation on encrypted data that can be performed on plaintext without decryption, so that homomorphic encryption can be performed by an untrusted party without revealing its input and internal state. Based on the fully homomorphic encryption feature, it can be used for privacy preserving outsourced storage and computation and operations such as retrieval, comparison, etc. in the encrypted data to yield the correct result without decrypting the data throughout the process. The method has the significance of solving the data security problem when the data and the calculation thereof are entrusted to a third party, and is applied to a cloud computing scene for example.

In classical number-theory transformation algorithms, the computing device needs to run some logical branch statements, resulting in inefficiency in running the number-theory transformation.

If the number theory transformation is compared with a road, and the operation number theory transformation of the computing device is compared with the running of the computing device on the road, then some logic branch sentences are introduced, which is equivalent to establishing a road containing some crossroads, so that in the running process, every time the road goes to an crossroad, the running process needs to be stopped, and the running process is slowed down by judging which direction should be continued and which branch should be entered. Similarly, if a logical branch statement is introduced into the number theory transformation, the time consumed by the computing device to run the number theory transformation is too long. Through research, when a classical number theory transformation is operated on a central processing unit (central processing unit, CPU), the operation time of any one logic branch statement is about 15% of the whole operation time of the number theory transformation, and obviously, the existence of the logic branch statement can cause overlong operation time of the number theory transformation of the computing device, so that the efficiency of the operation number theory transformation of the computing device is greatly influenced.

The main reason why the conventional number theory transformation introduces the logic branch statement is that, as the number theory transformation generally includes a large number of addition processes and multiplication processes, the value of the data will be larger and larger along with the operation of the number theory transformation, so that the number of bits required for representing the data in the computing device will be larger and larger, which causes the risk of overflow. Therefore, some logic branch sentences are introduced, the judgment condition in the logic branch sentences is to judge whether the value of the data currently processed exceeds the set upper limit, if so, the data is subjected to reduction processing, and then the subsequent processing steps are executed based on the reduced data. In this way, in the process of running number theory transformation, the computing device performs reduction processing at a proper position, so that the value of the data becomes smaller, the bit number used for representing the data in the computing device is reduced, the bit number of the data is prevented from exceeding the upper limit of the bit number of the data which can be represented by the computing device, and overflow is avoided.

Based on the above research analysis, in some embodiments provided herein, a data processing method is presented that supports the number theory transformation of a logical-free branch statement. In the method provided by the embodiment, the position where overflow is possible to occur in the number theory transformation is found out through the data related parameters, the calculation unit corresponding to the position is used as the calculation unit needing to perform the reduction processing, and then the reduction processing is performed by the calculation unit found in advance in the process of the number theory transformation, so that the reduction processing can be performed in time in the operation of the number theory transformation without introducing a logic branch statement, and the overflow is avoided.

The method provided by the embodiment is equivalent to planning which positions need to be subjected to reduction processing in advance before the road is established, so that a road without an intersection can be constructed, and the running of the number theory conversion is obviously accelerated by the calculation equipment in the running process of the number theory conversion, which is equivalent to the running process without taking time into consideration whether the reduction processing needs to be performed at present when the intersection is met, but the reduction processing is performed at the position planned in advance. Therefore, the method provided by the embodiment solves the problem of low operation efficiency caused by the introduction of the logic branch statement in the prior art, improves the mode of the operation number theory transformation of the computing equipment, improves the performance of the operation number theory transformation of the computing equipment, saves the time of the operation number theory transformation of the computing equipment, improves the efficiency of the operation number theory transformation of the computing equipment, and expands the applicable scene of the number theory transformation.

The application scenario of the embodiment of the present application is illustrated below.

The embodiment of the application can be applied to the scenes of data encryption and decryption, such as encryption transmission of data, privacy calculation, generation of a secret key, identity authentication and the like. Optionally, the embodiment of the application is applied to a scene of encrypting and decrypting based on the PQC or FHE. The scheme of data encryption and decryption is usually implemented based on a cryptographic algorithm, and the cryptographic algorithm, especially the PQC algorithm, usually requires the use of a number theory transformation. By the method provided by the embodiment of the application, the running of the number theory transformation can be accelerated, so that the overall speed of the encryption and decryption scheme is improved.

For example, in the data encryption transmission scenario, after the transmitting end obtains the plaintext data to be encrypted, the plaintext data is encrypted by the PQC algorithm to obtain ciphertext data, and the ciphertext data is transmitted to the receiving end. And the receiving end of the data receives the ciphertext data, decrypts the ciphertext data through a PQC algorithm, and obtains plaintext data. The data is transmitted in the form of ciphertext on a link from a transmitting end to a receiving end, so that the safety is improved.

In the above scenario, the number-wise transformation is for example a module in the PQC algorithm. In the process of encrypting the plaintext data through the PQC algorithm, the transmitting end executes the method provided by the embodiment to carry out the number theory transformation on the plaintext data, and other steps of the PQC algorithm are carried out on the transformed plaintext data to obtain ciphertext data. In the process of decrypting the ciphertext data through the PQC algorithm, the receiving end executes the method provided by the embodiment, performs number theory transformation on the ciphertext data, and executes other steps of the PQC algorithm through the transformed ciphertext data to obtain plaintext data. The method provided by the embodiment can improve the speed of the number theory transformation, thereby improving the speed of the PQC algorithm.

Particularly, in some situations of encrypting transmission data in a delay-sensitive network, the running speed of many cryptographic algorithms is low at present, so that the delay of the encrypting transmission data is large, and the requirement of two communication parties on the delay is difficult to meet. By the method provided by the embodiment, the encryption and decryption time delay of the data at the sending end and the receiving end can be reduced, and the requirements of both communication parties on the time delay can be met.

In an exemplary scenario, according to some international standards, a public key signature algorithm can be used between different nodes of the power grid to ensure data transmission security, however, the existing public key signature algorithm has larger time delay and cannot meet the time delay required by the standard. If the future grid-based public key signature algorithm can become a new generation of cryptography standard algorithm, the method provided by some embodiments of the application can improve the performance of running number theory transformation, so as to improve the speed of running the grid-based public key signature algorithm, and further enable the public key signature algorithm to meet the communication delay requirement of the related international standard, and further be possibly adopted by the related international standard for protecting the power grid data.

Among these, the product forms of the number theory transformation include a wide variety. In one possible implementation, the product form of the number-theory transformation is software, for example, the form of the number-theory transformation is a piece of program code, and the CPU (such as a 32-bit CPU or a 64-bit CPU) reads and executes the program code in the encryption and decryption process so as to run the number-theory transformation.

In another possible implementation, the product form of the number-theory transformation is hardware, such as by a dedicated processor to undertake the number-theory transformation. For example, the special processor is a processor special for encryption and decryption, such as an encryption chip (also called an encryption coprocessor or a security chip), and performs number-theory transformation during encryption and decryption. For another example, the CPU is assisted to encrypt and decrypt the data by a special processor, and the special processor and the CPU cooperate to complete the encryption and decryption operation. In one possible implementation, when the CPU needs to encrypt and decrypt data, the CPU transmits parameters related to the data to the special processor, the special processor performs number-theory transformation on the data based on the parameters transmitted by the CPU, returns the data after the number-theory transformation to the CPU, and then the CPU continues to execute the encrypting and decrypting step based on the data after the number-theory transformation. In this way, the number theory transformation is unloaded from the CPU to the special processor, thereby reducing the calculation load of the CPU and improving the encryption and decryption speed of the CPU.

Fig. 5 is a schematic diagram of a principle of a polynomial multiplication calculated using a number-theory transform (NTT) and its Inverse (INTT) according to an embodiment of the present application. Some encryption and decryption schemes, such as PQC and FHE, can represent data such as a key, a ciphertext, and a plaintext by using a polynomial, so that processing procedures in the encryption and decryption scheme, such as key generation, encryption, decryption, and ciphertext processing, can be accelerated by using a number theory transformation.

As shown in fig. 5, let positive integerIs a power of 2, given a prime number q satisfying q≡1 mod 2n. Let ω be +.>The former primitive unit root of n times above, then ∈ ->Is->Last 2n times primitive unit roots. Defining polynomial rings-> And two polynomials a, b ε, R _q The method comprises the steps of carrying out a first treatment on the surface of the Let a= (a [0 ]]，a[1]，…，a[n-1])，/> Is a vector of coefficient items a and b, and two other vectors are defined +.> Computing multiple termsThe formula multiplication c=ab is equivalent to calculating the negative wrap convolutions of a and b (negative wrapped convolution), i.e. +.> Representing the hadamard product. Where a, b and c are polynomial coefficients, such as keys, ciphertext, plaintext, etc. data in PQC and FHE schemes.

Fig. 6 is a flowchart of a data processing method according to an embodiment of the present application. The method shown in fig. 6 is performed by a computing device for running a number-wise transformation of data, the step of the number-wise transformation of data comprising a plurality of computing units, and the method shown in fig. 6 comprises the following steps S201 to S202.

Step S201, the computing device determines, based on the parameters of the data, the estimated bit number of the processing result generated by each computing unit.

The data is input data of the number theory transformation. The number-theory transformation includes at least one of a positive number-theory transformation (NTT) or an inverse number-theory transformation (INTT). The data are, for example, polynomial coefficients. The data is, for example, data to be encrypted or data to be decrypted. Optionally, the data is plaintext, ciphertext, or data required to generate a key.

The above parameters indicate the number of bits of the data. For example, the data is represented internally by the computing device in the form of a binary sequence, and the parameters indicate the length of the binary sequence.

The obtained parameters have the effect that the bit number of the input data influences the bit number of the processing result generated by each calculation unit in the number theory transformation, and further influences which calculation units generate the processing result, namely which calculation units possibly overflow, the bit number of the processing result exceeds the range of the data which can be represented by hardware, so that the obtained parameters are beneficial to more accurately determining how many bits of the input data of the number theory transformation, so that how many bits of the processing result generated by each calculation unit based on the data in the number theory transformation process are estimated, and the calculation unit needing reduction processing is positioned.

In some embodiments, the parameters include a modulus used when each of the plurality of computing units performs a modulo operation, a redundancy of the data relative to the modulus, and a polynomial dimension of the data. Optionally, the above parameters further include a number of bits of the processor.

The redundancy factor indicates the degree of redundancy of the data with respect to the modulus. For example, if the redundancy factor is 1, it indicates that the value of the data ranges from 0 to modulus, which is equivalent to the value of the data without redundancy; if the redundancy factor is 2, the value range of the data is between 0 and two times of the modulus; and so on, if the redundancy factor is k, the value range of the data is between 0 and k times of the modulus, and k is a positive integer.

The polynomial dimension indicates the number of stages comprised by the number theory transformation. For example, if the polynomial dimension is n, it is indicated that the number-theory transformations together have log ₂ n stages.

The processor is hardware, such as a CPU, for running the number theory transformation. The number of bits of the processor is used to indicate the range of values of the data that the processor is capable of representing. The number of bits of the processor is, for example, the word size of the processor, such as the machine word size, the instruction word size, the data word size, the memory word size, or the like.

Alternatively, the above parameters include the number of bits of data. Alternatively, the parameter is the maximum value of the data or the range of the data.

How the above parameters are obtained includes a variety of implementations. In one possible implementation, the above parameters are provided by the user. For example, in the case where the computing device is a terminal, the above-described parameters are input on the terminal by the user, and then the terminal performs the subsequent flow based on the parameters input by the user; for another example, in the case that the computing device is a server, the above parameters are input on the terminal by the user, and then the terminal transmits the parameters input by the user to the server, and the server performs the subsequent procedure based on the parameters received from the terminal; in another possible implementation, the above parameters are pre-stored in the computing device. For example, the parameters are pre-programmed into the processor responsible for running the number theory transformation.

The computation unit corresponds to a component or a data processing unit in the number theory transformation. For example, one calculation unit is used to perform addition processing, subtraction processing, and modular multiplication processing including modulo arithmetic. Optionally, each of the plurality of computing units is configured to process k data to generate k processing results, where k is a positive integer.

The granularity of the computational unit includes a wide variety. In some embodiments, one computing unit is one or more stages. In other embodiments, one computing unit is one or more butterflies. In other embodiments, one computing unit is one or more intersections.

Taking the computing unit as an intersection, as shown in fig. 3, for the NTT, one computing unit is configured to perform the modular multiplication process first, and then perform the addition process and the subtraction process, to generate a processing result. As shown in fig. 4, for INTT, one calculation unit is used to perform addition processing and subtraction processing first, and then perform modular multiplication processing, producing a processing result.

Optionally, the computing unit is software. For example, the number theory is transformed into a piece of code, and the calculation unit is a statement in the code. Alternatively, the computing unit is hardware. For example, the number theory is converted into a chip, and the computing unit is a processing circuit in the chip.

The estimated bit number indicates the bit number of the processing result generated by the calculation unit based on the processing of the data. With input data 2 ¹⁶ As shown in fig. 7, after each calculation unit in stage 1 of NTT performs processing based on the data, the bit number of the generated processing result is 59 bits, that is, the estimated bit number corresponding to each calculation unit in stage 1 is 59. As shown in fig. 8, each calculation unit in phase 1 of INTT performs processing based on the data, and generates a processing result with a bit number of62 bits or 60 bits, i.e. the number of estimated bits corresponding to each calculation unit in stage 1 is 62 or 60.

In one possible implementation, a computing device determines a number of bits of data based on a modulus and a redundancy factor; the computing device determines a predicted bit number for each computing unit based on the bit number of the data and the bit number increment corresponding to each computing unit.

One possible implementation way of determining the number of bits of the data is to determine the number of bits of the modulus as the number of bits of the data if the redundancy is 1; if the redundancy is greater than 1, the number of bits of the product of the modulus and the redundancy is determined as the number of bits of the data. For example, if the modulus is q and the redundancy factor is 1, indicating that the value of the data is less than the modulus, then log is determined ₂ q is the number of bits of the data; if the modulus is q and the redundancy factor is n (n is a positive integer greater than 1), indicating that the value of the data is less than n times the modulus, determining log ₂ qn is the number of bits of data; the effect of determining the number of bits of the data in this way is that, since the range of the value of the data is between 0 and the product of the modulus and the redundancy, the number of bits of the product of the modulus and the redundancy is the maximum value of the number of bits of the theoretical data, the number of bits of the processing result is estimated according to the maximum value of the number of bits of the theoretical data, which is equivalent to taking into account the worst case, ensuring that no overflow occurs.

The bit number increment is an increment of the bit number of the data after being processed by the computing unit, namely, the difference between the bit number of the output result generated by the computing unit and the bit number of the input data obtained by the computing unit. For example, if the calculation unit is an addition unit for adding the data x and the data y, since the two data are added, the theoretical result is at most 1 bit more than the data, 1 is incremented as the number of bits of the addition unit. In one possible implementation, the correspondence between the calculation unit and the bit number increment is preset and saved, and the bit number increment is determined by querying the correspondence.

The function of step S201 corresponds to estimating the number of bits of the processing result theoretically generated by each calculation unit after substituting the actual input data into the number-theory transformation, given the parameters related to the actual input data, and thus finding the calculation unit that theoretically would cause overflow.

Step S202, the computing device determines a first computing unit from a plurality of computing units based on the estimated bit number.

The first calculation unit is a calculation unit for performing reduction processing on the processing result of the second calculation unit.

The second computing unit is one of the plurality of computing units. The estimated bit number of the processing result of the second calculation unit satisfies the preset bit number. The processing result generated by the second calculation unit is used as input data of the first calculation unit. The second computing unit corresponds to the last computing unit of the first computing unit, and the output of the second computing unit corresponds to the input of the first computing unit.

Optionally, the predetermined number of bits is a threshold, and the estimated number of bits of the processing result of the second calculation unit is greater than or equal to the threshold. Alternatively, the predetermined number of bits is a value, and the estimated number of bits of the processing result of the second calculation unit is equal to the value.

In some embodiments, the predetermined number of bits is the number of bits of the data when the overflow condition is satisfied. The function of the preset bit number is equivalent to providing an upper limit, if the estimated bit number of the processing result generated by one computing unit is found to reach the upper limit, determining that the next computing unit of the computing unit needs to perform reduction processing on the data when the NTT is actually operated, so as to avoid that the bit number of the processing result generated by the data is beyond the upper limit when the data is subjected to the number theory transformation.

Optionally, the predetermined number of bits is determined based on the number of bits of a processor in the computing device. Compared with setting the preset bit number according to experience, the preset bit number is determined by the hardware factor of the bit number of the processor, so that the preset bit number can be adapted to the capability of the hardware, different preset bit numbers can be respectively determined for the hardware with different capabilities (such as CPUs with different bit numbers), the calculation unit needing the reduction processing is searched based on the preset bit number, and the calculation unit needing the reduction processing can be more accurately positioned, so that unnecessary reduction processing is reduced.

In some embodiments, the predetermined number of bits is 1 less than the number of bits of the processor. For example, if the processor responsible for running the number-theory transformation is a 64-bit CPU, the preset bit number is set to 63, and if the processor responsible for running the number-theory transformation is a 32-bit CPU, the preset bit number is set to 31. Taking the preset bit number as 63 as an example, if it is estimated that the input data of one computing unit reaches 63 bits, the reduction processing is performed on the computing unit, and the computing unit before the computing unit does not need to perform the reduction processing. By the method, overflow is avoided, limitation on the value of data is relaxed as much as possible, the capability of hardware is fully exerted, the resource utilization rate is improved, and the times of reduction processing are reduced.

In other embodiments, it is contemplated that the number-wise transformation in the encryption and decryption scheme will generally correspond to an intermediate module, typically not the first or last module. If the preset bit number is set to be too large, when the output result of the whole number theory transformation enters the next module of the encryption and decryption scheme, a situation is likely to occur that the next module overflows due to executing the operation causing the increase of the value. If the preset bit number is set too small, the hardware capability may not be fully exerted, resulting in resource waste. Based on this, the preset number of bits is designed to be 2 less than the number of bits of the processor. For example, if the processor responsible for running the number-theory transformation is a 64-bit CPU, the preset number of bits is set to 62, and if the processor responsible for running the number-theory transformation is a 32-bit CPU, the preset number of bits is set to 30.

In this way, while avoiding overflow, a room is reserved for the next module of the number theory transformation, the processing result generated by the next module of the number theory transformation is allowed to continue to increase by one bit, and the risk of value out-of-range in the next module is reduced; in addition, the limitation on the value of the data is relaxed to a certain extent, and the resource utilization rate is improved.

In some embodiments, in the process of transforming based on the data running number theory, the computing device reduces the processing result of the second computing unit by the first computing unit.

The effect of performing the reduction processing by the first calculation unit is that, on the one hand, the reduction processing can reduce the value of the processing result, and thus reduce the number of bits of the processing result. Therefore, the first computing unit performs the reduction processing, so that the bit number of the processing result of the second computing unit is reduced, the processing result generated by the first computing unit is prevented from exceeding the preset bit number, and overflow is avoided. On the other hand, other computing units except the first computing unit do not need to perform reduction processing, so that when the data is processed by the other computing units, the value of the data is allowed to keep a redundant state, and the reduction processing is not performed on the data until the data is input to the first computing unit, namely, the bit number of the data reaches the preset bit number, so that redundant reduction processing is reduced, redundant calculation amount contained in the number theory transformation is avoided as much as possible, and the processing efficiency is improved.

Referring to fig. 7, in the scenario shown in fig. 7, the NTT operation process is divided into 16 stages, and the bit number of the data when the overflow condition is satisfied in the NTT operation process is 63. The computing device predicts that the bit number of the processing result generated by each intersection in the 15 th stage is 63 according to the parameters corresponding to the data, namely, the bit number of the input data of each intersection in the 16 th stage reaches 63. In this scenario, the computing device takes each intersection of stage 16 as a first computing unit. In running the NTT, the computing device performs a reduction process at each intersection of stage 16 such that the number of bits of the processed result is reduced from 63 to 60, and eventually the number of bits of the output result of the entire NTT running process is controlled within 60, thus avoiding overflow.

Referring to fig. 8, in the scenario shown in fig. 8, the INTT operation process is divided into 4 stages, and the bit number of data when the overflow condition is satisfied in the NTT operation process is 62. According to the parameters corresponding to the data, the computing equipment predicts that the bit number of the input data with 5 crossings in INTT is 62, wherein the 5 crossings are respectively a 1 st crossing of a 2 nd phase 1 st butterfly, a 1 st crossing of a 2 nd phase 2 nd butterfly, a 3 rd butterfly, a 1 st crossing of a 2 nd phase 4 th butterfly and a 2 nd crossing of a 4 th phase 1 st butterfly, and then the computing equipment takes each crossing of the 5 crossings as a first computing unit. In running NTT, the computing device performs a reduction process through these 5 intersections, so that the number of bits of the processing result is reduced from 62 to 60, and finally the number of bits of the output result of the entire INTT running process is controlled within 62, thereby avoiding overflow.

After the reduction processing is performed, the computing device may perform other processing steps on the processing result after the reduction processing through the first computing unit, and then continue processing through the next computing unit of the first computing unit until all the computing units process, so as to convert the data into data after the number theory transformation.

The application of the data after the number theory transformation in the encryption and decryption scheme comprises various scenes. For example, in an encryption scenario, the data is plaintext, and after the computing device performs positive-number transformation on the plaintext, the computing device encrypts the plaintext based on the positive-number transformation to obtain a portion of ciphertext. In a decryption scene, the data is ciphertext, and after the computing equipment performs inverse number theory transformation on the ciphertext, the computing equipment decrypts the ciphertext based on the ciphertext after the inverse number theory transformation to obtain a part of plaintext. For another example, in the key generation scenario, the data is data required for generating a key (public key or private key), and the computing device performs a number-theory transformation on the data and generates the key based on the data after the number-theory transformation. For example, the computing device performs encryption processing or decryption processing on the processing result after the reduction processing of the second computing unit.

In the embodiment shown in fig. 6, since the computing unit which needs to be responsible for the reduction processing is determined through the parameters, in the process of running the number theory transformation, the determined computing unit performs the reduction processing on the data, so that the value of the data can be reduced at a proper position without introducing a logic branch statement, the number of bits used for representing the data in the computing device is reduced, the number of bits of the data is prevented from exceeding the upper limit of the number of bits of the data which can be represented by the computing device, and overflow is avoided. Compared with the mode of introducing the logic branch statement to perform reduction processing, the method can remove the logic branch statement and optimize the structure of the number theory transformation, thereby improving the efficiency of running the number theory transformation.

In addition, the number of bits of the processing result generated by the pre-estimation computing unit can accurately position the computing unit (the first computing unit) which is likely to overflow according to the pre-estimation number of bits, so that the computing unit which is likely to overflow can perform reduction processing, and other computing units do not need the reduction processing, thereby reducing the calling times of the reduction processing in the number theory transformation, reducing the redundant calculation amount in the number theory transformation as much as possible, and improving the efficiency.

In the embodiment shown in fig. 6, there are a number of implementations of how the reduction process is performed, some of which are described below.

In some embodiments, the first computing unit performs the reduction processing on the processing result of the second computing unit by using a modulo operation. The function of the modulo operation is to ensure the correct calculation and to reduce the value of the data. In other embodiments, the computing device employs addition and subtraction processes instead of modulo arithmetic, the subtraction process being used to reduce the size of the data.

In some embodiments, the first computing unit performs the reduction processing on the processing result of the second computing unit by using a montgomery modular multiplication method. For example, the processing result of the second calculation unit is converted into the Montgomery form, and then Montgomery modular multiplication is performed on the processing result having the Montgomery representation form, thereby realizing the reduction processing. For example, the data includes x and y, a parameter r is introduced, x is converted to x r (i.e., the montgomery representation x) based on the parameter r, y is converted to y r (i.e., the montgomery representation y), and montgomery modular multiplication is performed based on the x r and the y r.

The reduction processing is performed by the Montgomery modular multiplication method, so that the value of the processing result can be reduced, the purpose of reduction is realized, and the speed of the reduction processing is improved.

In view of the above embodiments, since the reduction processing is performed by using the montgomery modular multiplication method, the representation of the data needs to be kept in the montgomery representation, and the reduction processing needs to bind the montgomery algorithm, which results in a strong limitation, and cannot meet the requirement of adjusting the representation of the data in the running number theory transformation.

Based on this, in other embodiments, the first calculation unit performs redundant modular multiplication processing on the processing result of the second calculation unit, thereby realizing reduction processing. The redundancy modular multiplication is modular multiplication with a value range of 0,2q, q represents a modulus, and q is a positive integer.

Because the reduction processing is realized by adopting a redundancy modular multiplication mode, on one hand, the reduction processing is not required to bind a Montgomery algorithm, the representation form of the data is not required to be kept as a Montgomery representation form, in other words, whether the representation form of the data is a Montgomery representation form or a non-Montgomery representation form, the scheme has usability, and therefore the flexibility and the practicability of the scheme are improved. On the other hand, the method also has the function of improving the speed of the reduction processing, thereby improving the efficiency, and particularly, the method is helpful for remarkably accelerating the operation flow of the computing equipment in the scenes of large number operation and the like.

In one exemplary scenario, the use of the reduction-of-cost algorithm is generally limited to barrett reduction and Montgomery modular multiplication (requiring that the polynomial coefficients must take the form of Montgomery representations) when a redundant modular multiplication approach is not employed. By using the redundancy modular multiplication, the modular multiplication of the coefficient of the non-Montgomery representation may be calculated, and the redundancy modular multiplication may share the same calculation module as the Montgomery modular multiplication.

For how to support dynamically adjusting data representations, in some embodiments of the present application, a computing device generates a twiddle factor having the same representation as data from a representation of the data; the first calculation unit performs redundant modular multiplication processing on the processing result of the second calculation unit based on the twiddle factor. Optionally, the representation is a Montgomery representation or a non-Montgomery representation.

Illustratively, in the pre-calculation phase, the computing device determines a representation of the data, and if the representation of the data is a Montgomery representation, generates a twiddle factor for the Montgomery representation; if the representation of the data is a non-Montgomery representation, generating a twiddle factor for the non-Montgomery representation; the computing device saves the generated twiddle factors to a pre-computation table. In the instant calculation phase, the computing device acquires the twiddle factors from the pre-calculation table, and performs redundant modular multiplication processing based on the acquired twiddle factors and data.

Optionally, if the representation of the input data is adjusted, the computing device correspondingly adjusts the twiddle factors stored in the pre-calculation table to keep the representation of the twiddle factors consistent with the representation of the data. For example, if the representation of the data is adjusted from a Montgomery representation to a non-Montgomery representation, the computing device adjusts the twiddle factors held in the pre-calculation table from a Montgomery representation to a non-Montgomery representation; if the representation of the data is adjusted from a non-Montgomery representation to a Montgomery representation, the computing device adjusts the twiddle factors held in the pre-calculation table from the non-Montgomery representation to the Montgomery representation.

By the above embodiment, if the task is that the data has the Montgomery representation, the twiddle factor with the Montgomery representation is used for operation, if the task is that the data has the non Montgomery representation, the twiddle factor with the non Montgomery representation is used for operation, therefore, the method can dynamically adjust the representation form of the numerical value in the NTT/INTT operation according to the requirement of a specific calculation task, in addition, the butterfly operation structure in the NTT/INTT operation is not influenced, no additional algorithm is introduced, the Montgomery algorithm is not bound, and the calculation amount is not increased.

In some embodiments, considering that the number theory transformation includes a subtraction process, if the subtraction is greater than the subtracted number, the result of the subtraction process is a negative number. For computing devices, the occurrence of negative numbers in the processing results may lead to operational errors, resulting in incorrect computation.

Based on this, in some embodiments of the present application, the computing device will determine redundancy values based on the parameters; in the process of running the number theory transformation based on the data, each of the plurality of calculation units performs a subtraction process based on the redundancy value. The redundancy value is a value greater than or equal to the number of subtractions in the subtraction process. Optionally, the redundancy value is greater than or equal to a maximum value of the data.

The subtraction processing in the positive number theory transformation includes subtraction processing in a redundancy growth operation and subtraction processing in a redundancy reduction processing. The subtraction of the redundancy growth operation is performed by subtracting the number of data, which is the result of the redundancy modular multiplication of the data. The subtraction process of the redundancy reduction process is performed such that the data is subjected to redundancy modular multiplication, and the subtraction process is performed such that the data and the twiddle factor are subjected to redundancy modular multiplication. Expressed mathematically, the subtraction process of the redundancy growth operation is, for example, the subtraction of x and y x w mod 2q, where x and y are both data, w is a twiddle factor, and q is a modulus. The subtraction of the redundancy reduction is, for example, the subtraction of x mod 2q with y x w mod 2q, where x and y are both data, w is a twiddle factor, and q is a modulus. The subtraction process in the inverse number theory transformation is a subtraction between two data, for example, x-y, where x and y are both data.

The effect of subtracting based on the redundancy value is that not only the data itself but also the redundancy value are substituted during subtracting, which is equivalent to adding the redundancy value to the subtracted number and amplifying the value of the subtracted number, so that the processing result of subtracting is avoided being negative, thereby facilitating the operation accuracy. In addition, compared with setting the redundancy value according to experience, the redundancy value is determined according to the data-related parameter, so that the determined redundancy value can be suitable for the value of the parameter, and the accuracy is improved. In addition, the redundancy value is not required to be bound to a single parameter, but can be correspondingly adjusted along with the value of the parameter, so that more parameters are available for the scheme, and the expansibility and the practicability are improved.

Aiming at how to design the value of the redundancy value, in some embodiments of the application, the value with better effect is provided for the redundancy value by analyzing the respective characteristics of the positive number theory transformation and the inverse number theory transformation.

Optionally, for the positive number theory transformation, the redundancy value is equal to 2q, q represents a modulus, and q is a positive integer.

The effect of selecting 2q as the redundancy value is that since the positive-number-theory transformation is characterized by performing the modular multiplication process first, and then performing the addition process and the subtraction process. The value range of the modular multiplication is controllable, for example, when the multiplication is realized by adopting a redundant modular multiplication mode, the value range of the modular multiplication is in [0,2q ], and when the multiplication is realized by adopting a modular multiplication mode without redundancy, the value range of the modular multiplication is in [0, q), wherein q is a module. Therefore, by substituting 2q into the subtraction process, since the subtraction in the subtraction process is the output result of the modulo multiplication process, the value range of the subtraction is within [0,2 q), and thus the redundancy value is necessarily larger than the subtraction, thereby ensuring that the result of performing the subtraction process is not a non-negative number, thus contributing to the operation accuracy. In addition, the redundancy value used is as small as possible, thereby avoiding excessive processing overhead and storage overhead due to excessive redundancy value.

Optionally, for the inverse number theory transform, the redundancy value is equal to (t+n) ×q, t represents a redundancy multiple, n represents a polynomial dimension, q represents a modulus, and t, n, and q are positive integers.

The effect of selecting (t+n) q as the redundancy value is that, in the case where there is no redundancy in the data, the input data of any computing unit does not exceed n q, and considering the possibility of redundancy in the data, t q is added as the redundancy value on the basis of n q, so as to support data with any redundancy multiple as input, and ensure that the result generated by performing the subtraction operation is not a non-negative number, thereby contributing to the operation accuracy.

The embodiment shown in fig. 6 describes the case where the reduction processing is performed on the data. In other embodiments, for the positive number theory transformation, if the computing device determines that the parameter satisfies the condition, it is determined that there is no computing unit that needs to perform the reduction process. In the process of running the number theory transformation based on the data, the reduction processing on the data is omitted. For the inverse number theory transformation, if the computing device determines that the parameter satisfies the condition, the computing unit of the last stage is determined to be the computing unit that needs to perform the reduction processing. In the process of data-based operation number theory transformation, the data is subjected to reduction processing by a calculation unit of the last stage.

Wherein, in case of a redundancy factor of 1 or no redundancy of the input data, the parameter satisfaction condition, such as the sum of the number of bits of the modulus and the number of stages, is smaller than the preset number of bits, for example, the satisfaction condition is log ₂ n+log ₂ q<60. Where n represents the polynomial dimension and q represents the modulus. In the case where the redundancy is greater than 1, the parameter satisfies a condition such as that the sum of the number of bits and the number of stages of the product of the modulus and the redundancy is smaller than the preset number of bits.

The above-mentioned mode has the function that the bit number of the product of modulus and redundancy is equivalent to the bit number of the data which is theoretically the most, and under the condition of not reducing, the bit number is increased by one bit every time the data is processed by one stage, so that the number of stages is equivalent to the bit number which is the most increased after the data is processed by all stages in the number theory conversion, and therefore, the parameter meets the above-mentioned condition, and no overflow occurs in the worst case, and the data is not required to be reduced, so that by adopting the above-mentioned mode, the reduction processing and the logic branch statement can be completely removed while the overflow is ensured not to occur, thereby improving the performance and the efficiency of the running number theory conversion.

The implementation of the method shown in fig. 6 is illustrated below in conjunction with some code and formulas. In the following implementation, the redundancy reduction crossover is an illustration of a calculation unit (i.e., a first calculation unit) that performs reduction processing on data, the redundancy increase crossover is an illustration of a calculation unit (i.e., a non-first calculation unit) that does not perform reduction processing on data, the polynomial coefficient is an illustration of data, max is an illustration of a preset number of bits, and min is an illustration of a redundancy value.

In some embodiments, in constructing the NTT or INTT algorithm, each is split into two parts: pre-calculation and instant calculation.

Fig. 9 is a schematic diagram of NTT according to the present embodiment. As shown in fig. 9, the NTT includes a twiddle factor generation module, an NTT pre-calculation module, an NTT generation module, and an NTT operation module. A flowchart of the NTT generation module is shown in fig. 11.

Fig. 12 is a schematic diagram of INTT provided in this embodiment. The INTT internally comprises a twiddle factor generation module, an INTT pre-calculation module, an INTT generation module and an INTT operation module. A flowchart of the INTT pre-computation module is shown in fig. 13. The flow chart of the INTT generation module is shown in fig. 14.

The pre-calculation process and the instant calculation process of NTT and INTT, respectively, are exemplified below.

NTT

The cross in butterflies in NTT is divided into two types, one is a redundancy-increasing cross and the other is a redundancy-decreasing cross.

In an exemplary embodiment, the coefficients of each term in the input polynomial a of the NTT are sorted in ascending order according to the degree, the term with the lowest degree is at the forefront, and the term with the highest degree is at the rearmost, so as to obtain a sequence. x and y are coefficients of terms of times j and j+t, respectively. w is the twiddle factor used in this crossover calculation, and w is taken from a pre-calculation table.

In this embodiment, the code for redundancy growth crossover in the Radix-2 NTT is shown below.

x＝a[j]，y＝a[j+t]；

tx＝x；

ty=fastmamodmultitlazy (y, w, q); v/remarks: ty=y×w mod 2q;

a[j]＝tx+ty；

a[j+t]＝2*q-ty+tx；

the meaning of the code of the NTT redundancy growth crossover shown above is that the coefficient x (i.e., a [ j ]) is assigned to the intermediate variable tx, the result of the redundancy modular multiplication of the coefficient y and the twiddle factor w is assigned to the intermediate variable ty, the coefficient x (i.e., a [ j ]) is modified to tx+ty, and the coefficient y (i.e., a [ j+t ]) is modified to 2*q-ty+tx. After such calculation, the value of the coefficient x becomes larger, which is an increasing meaning.

In this embodiment, the code for redundancy reduced crossover in the Radix-2 NTT is shown below.

x＝a[j]，y＝a[j+t]；

tx = fastmamodmultitlazy (x, 1, q); v/remarks: tx=x mod 2q;

ty=fastmamodmultitlazy (y, w, q); v/remarks: ty=y×w mod 2q;

a[j]＝tx+ty；

a[j+t]＝2*q-ty+tx；

the meaning of NTT redundancy reduction cross shown above is that the result of the redundant modular multiplication of the coefficient x and the integer 1 is assigned to the intermediate variable tx, the result of the redundant modular multiplication of the coefficient y and w is assigned to the intermediate variable ty, the coefficient x (i.e., a [ j ]) is modified to tx+ty, and the coefficient y (i.e., a [ j+t ]) is modified to 2*q-ty+tx. After such calculation, the value of the coefficient x may become small, which is a meaning of about decrease.

FIG. 15 shows a schematic diagram of the calculation of the redundancy growth crossover in the radius-2 NTT. The calculation formula of the radius-2 NTT redundancy growth crossover is the following formula A.

FIG. 16 shows a schematic diagram of the manner in which redundancy reduction cross-overs in the radius-2 NTT are calculated. The equation for the radius-2 NTT redundancy reduction cross is the following equation B.

Pre-calculating:

as shown in fig. 10, the NTT pre-calculation module receives the input parameters, calculates the position where the overflow occurs in the NTT according to the parameters, and generates a sequence S, where the length of the sequence S is equal to the number of stages where the overflow occurs in the NTT, and the sequence value is an identifier of the NTT stage where the overflow occurs, and the specific flow is as follows.

Let n (n.gtoreq.4) be a power of 2, n=2 ^m The number of NTT stages is log ₂ n, modulus log ₂ q bits, maximum redundancy allowed by NTT (based on machine word length, instruction word length, or data word length) of 2 ^max Minimum redundancy value min=2q.

The embodiment of the application supports redundancy with any size as NTT input, so that the value of an input coefficient input meets 2 ^h ^-1 < input<2 ^h (h is a positive integer, h<max). Executing the s < th > at NTT ₁ Stage time(s) ₁ Is a positive integer, s is more than or equal to 1 ₁ M) if it meets(i.e. at s < th > ₁ In stage-1 there is ∈>) Description of the s ₁ The stage requires a reduction process.

The deduction process of the implementation mode is as follows: according to the formula A, after the calculation of the 1 st stage is completed, the theoretical maximum value of the X term and the Y term is less than or equal to 2 ^h +2q, after the calculation of stage 2, the theoretical maximum value of the X term and Y term is less than or equal to (2 ^h +2q)+ 2q＝2 ^h +2x2q; then after the calculation of the s-th stage is completed, if no coefficient overflows at the moment, the theoretical maximum value of the X term and the Y term is less than or equal to 2 ^h +s.2q, and 2 ^h +s*2q<2 ^max . In other words, if the coefficient still does not overflow after the s-th stage is smoothly performed, this means that the value of s is brought in, satisfying the following expression.

Therefore, when executing to the s < th) ₁ At stage, if s is brought in ₁ After reaching the formula C, it is not true, i.e. s is brought in ₁ After that, q is more than or equal to (2) ^max-1 - 2 ^h-1 )/s ₁ Description of the s < th > in the case of the word-case ₁ Overflow theoretically occurs after the stage calculation is completed, then the s < th ₁ The phase requires a reduction process, i.e. crossover changes to redundant reduced crossover.

If the NTT is not executed, continuing to execute the NTT. Let log ₂ 4 q=g-1, if the s < th > before the NTT has been performed ₂ Stage(s) ₂ Is a positive integer, s is more than or equal to 1 ₁ <s ₂ M) found to be less than or equal toDescription of the s ₂ The stage requires a reduction process. Similarly, until the run completes all phases of NTT, a sequence s= [ S ] of the number of phases for which the coefficient would theoretically overflow is obtained ₁ ，s ₂ ，…]。

The derivation of the manner of determining the phase number sequence is as follows.

Due to the s < th) ₁ The stage changes redundancy reduction crossover into the s-th according to the formula B ₁ After the stage calculation is completed, the theoretical maximum value of X item and Y item is less than or equal to 4q, log is obtained ₂ 4 q=g-1, then 2 ^g-1 <4q<2 ^g That is to say from the s < th) ₁ After the stage calculation is completed, all coefficients are smaller than 2 ^g . Assuming no coefficient overflows in the following stage, then s ₁ After the +1 stage is completed, the theoretical maximum value of X and Y is less than or equal to 2 ^g +2q, s < th > ₁ After the +2 stage is completed, the theoretical maximum values of X and Y are less than or equal to (2 ^g +2q)+2q＝2 ^g +2x2q; then the s < th > ₁ After the +s stage is completed, if no coefficient is out of range, the theoretical maximum value of X and Y is not greater than 2 ^g +s.2q, also 2 ^g +s*2q<2 ^max . In other words, if the s-th is successfully executed ₁ If the coefficient does not overflow after the +s stage, it means that the value s is taken in, and the following expression is satisfied.

Therefore, when executing to the s < th) ₂ Stage(s) ₂ Is a positive integer, s is more than or equal to 1 ₁ <s ₂ .ltoreq.m), if s=s is brought in ₂ -s ₁ After reaching the formula D, it is not true, i.e. it is brought into s ₁ ，s ₂ After that, q is more than or equal to (2) ^max-1 -2 ^g-1 )/(s ₂ -s ₁ ) Description of the s ₂ Overflow can theoretically occur after stage calculation is completed, and the s is th ₂ The phase requires a reduction process, i.e., a redundancy reduction crossover is used instead. S th ₂ After the stage, if the running NTT is not completed, repeating the calculation of the part until the running NTT is completed.

Instant calculation (construction NTT algorithm)

If the sequence S is empty, as shown in FIG. 11, then construct an NTT whose butterflies at all stages need only invoke redundancy growth intersections; otherwise, the control sequence S constructs an NTT such that it is at the S-th ₁ S th ₂ The equal phase calls redundancy reduction crossover, and the rest phases call redundancy increase crossover only.

INTT

The crossover in butterfly in INTT is divided into two types, one is redundancy growth crossover and the other is redundancy reduction crossover.

In an exemplary embodiment, each term of the input polynomial a of INTT is sorted in ascending order according to its degree, the term with the lowest degree being the forefront and the term with the highest degree being the rearmost, resulting in a sequence. x and y are coefficients of terms of times j and j+t, respectively. w is the twiddle factor used in this crossover calculation, which is taken from a pre-calculation table. Defining a minimum redundancy value min, wherein min is a positive integer; the effect of the minimum redundancy value is to ensure that the minimum redundancy value will be greater than Y no matter how much Y is equal.

In this embodiment, the code for the redundancy growth crossover in radius-2 INTT is shown below.

x＝a[j]，y＝a[j+t]；

tx＝x+y；

ty＝min-y+x；

a[j]＝tx；

a [ j+t ] = fastmadmultilazy (ty, w, q); v/remarks: a [ j+t ] =ty x w mod 2q;

the meaning of the code of the INTT redundancy growth cross shown above is that the result of the sum of x and y assigned to tx, min-y+x assigned to ty, then the coefficient x (i.e., a [ j ]) is modified to tx, and the coefficient y (i.e., a [ j+t ]) is modified to the result of the redundant modular multiplication of ty and w. After such calculation, the value of the coefficient x becomes larger, which is an increasing meaning.

In this embodiment, the code for redundancy reduced crossover in radius-2 INTT is shown below.

x＝a[j]，y＝a[j+t]；

tx = fastmamodmultitlazy (x+y, 1, q); v/remarks: tx=x+ymod 2q;

ty＝min-y+x；

a[j]＝tx；

a [ j+t ] = fastmadmultilazy (ty, w, q); v/remarks: a [ j+t ] =ty x w mod 2q;

the meaning of the code of the INTT redundancy reduction cross shown above is that the result of the redundant modular multiplication of x+y with integer 1 is assigned to tx, the result of min-y+x is assigned to ty, the coefficient x (i.e., a [ j ]) is modified to tx, and the coefficient y (i.e., a [ j+t ]) is modified to the result of the redundant modular multiplication of ty with w. After such calculation, the value of the coefficient x may become small, which is a meaning of about decrease.

Fig. 17 shows a schematic diagram of the calculation of the redundancy growth crossover in radius-2 INTT. The calculation formula of the radius-2 INTT redundancy growth crossover is the following formula E.

Fig. 18 shows a schematic diagram of the calculation of the redundancy reduction cross in radius-2 INTT. The equation for the radius-2 INTT redundancy reduction cross is the following equation F.

INTT requires an algorithm to find exactly which phase, which butterfly, which cross need to be reduced. The algorithm in INTT to determine the need for the reduction process is as follows.

The pre-calculation algorithm shown above is used to calculate the specific location of the intersection in INTT that requires a reduction process, the logic of which is as follows.

According to equation E, i.e., according to the principle of first-multiply-then-add-subtract, the change in the X term is an increase in the number of bits by 1 bit, because the X term is a computational addition, where the sum of X and y is theoretically at most a number 1 bit greater than the maximum of X, y; the change of the Y term is that the bit number is 1 larger than the bit number of the modulus q, and the bit number of the Y term is always smaller than or equal to the bit number of 2q because the output result of redundancy modulus multiplication is always in [0,2q ], and the bit number of the result after calculation of the Y term is directly equal to the bit number of 2q under the condition of taking the word-case into consideration. Then, in the next stage, the X-term and the Y-term may possibly exchange positions with each other, so that the number of bits of the X-term and the Y-term is obtained, and then the obtained number of bits is brought into the equation E, so as to calculate how the number of bits of the X-term and the Y-term may change after the equation E is completed. If the number of bits of the X term is found to be equal to the number of bits of the maximum redundancy value before substituting the equation E, it is indicated that the X term needs to be subtracted to ensure that no overflow occurs after the cross calculation, i.e., the equation F needs to be changed at this time, and redundancy reduction cross is invoked. The number of bits of the Y term does not need to be considered, because the Y term is considered to be a carry-in redundancy modular multiplication, and as long as the Y term does not exceed the machine word length, the Y term will not equal or exceed the number of bits of the maximum redundancy value at this time, because the maximum redundancy value is equal to or less than the machine word length, the redundancy modular multiplication can complete the calculation, and a numerical value of the same number of bits as 2q in the case of one word-case is output, so that the Y term has no overflow problem. Summarizing, in the algorithm, after the parameters are given, the INTT is operated in a simulated mode according to the logic, the bit number of each crossing is calculated, all crossings which can theoretically lead to the X term to be equal to the bit number of the maximum redundancy value when the X term is input are found out, the positions are recorded and stored in a pre-calculation table, and the positions are used for customizing and generating the INTT by an INTT generating module.

Pre-calculation of

As shown in fig. 13, the INTT pre-calculation module receives the input parameters, calculates the position of overflow in INTT according to the parameters, generates a sequence T, the length of the sequence T is equal to the number of intersections of overflow in INTT, the sequence value is the identification of the intersections in INTT of overflow, and stores the sequence T into the pre-calculation table, which is specifically described as follows.

Let n (n.gtoreq.4) be a power of 2 and the number of INTT stages be log ₂ n, modulus log ₂ q bits, maximum redundancy value (derived from machine word length) allowed by INTT of 2 ^max . Inputting the parameters and the polynomial a, and returning a map red_position by the algorithm, wherein the red_position contains all crossing positions needing reduction processing; by traversing the red _ position, the position t that needs the reduction process can be found accurately,<s，b，c>]where t represents the t-th intersection of the INTT,<s， b，c>the specific locations where this reduction occurs are the c-th crossing of the b-th butterfly of the s-th stage of INTT, which are stored in the sequence T, as shown in fig. 13.

Instant calculation (construct INTT algorithm):

if the sequence T is empty, as shown in fig. 14, an INTT is constructed whose butterflies at all stages need only invoke redundancy growth intersections; otherwise, an INTT is constructed against the sequence T (see fig. 13) such that it calls for redundancy reduction intersections at the intersection positions contained in the red_position, the remaining intersections only call for redundancy increase intersections.

The implementation of the present application is described below in connection with a specific application scenario, see example 1 below.

Example 1

For ideal lattice-based post quantum cryptography algorithms (such as NewHope and Kyber), NTT and INTT are their primary calculations. The parameter comparison of the algorithm is fixed, and the parameter size meets log ₂ n+log ₂ q<60, therefore, only redundant modular multiplication is called on the butterfly, and all reduction processing and all logic branch sentences on the NTT and INTT butterflies are completely removed, so that the performance of polynomial multiplication is improved.

Based on the basic idea described above, example 1 is as follows:

NTT

(1) All calculated output results of the NTT of the logical-branch-free statement are non-negative integers.

(2) The NTT butterfly multiplies and then subtracts, so that the limitation on the X value needs to be relaxed, so that the output of the NTT butterfly is more redundant.

(3) When calculating integer modular multiplication, the modular multiplication output can be controlled in the range of [0,2q ] through redundant modular multiplication.

(4) The redundant values output after the addition and subtraction of integers can be processed according to specific calculation tasks.

(5) Entering the next butterfly of the current stage, if all the butterflies of the current stage are calculated, entering the next stage, and repeating the steps (2) - (5) until the calculation of all the stages is completed. Such that the overall output is redundant, but within the allowable range.

The algorithm of NTT is as follows.

INTT

(1) All calculated outputs of INTT for a logical-branch-free statement are non-negative integers. Let n (n.gtoreq.4) be a power of 2, the INTT share a common log ₂ n stages; the butterfly of the last stage is split separately, so that the INTT can be divided into main stages (the number of stages is smaller than log ₂ n) and the final stage (log th ₂ n stage).

(2) Since INTT needs to support polynomial coefficient input of arbitrary redundancy size, the minimum redundancy value is min= (t+n) ×q, where t is the redundancy multiple of the input data.

(3) The butterfly of INTT calculates the addition and subtraction and then calculates the multiplication, so that the redundant value output after the addition and subtraction of the integer is not needed to be processed; when the integer modular multiplication is calculated, redundant modular multiplication is utilized to control the value of the output result within the range of [0,2q ].

(4) Entering the next butterfly of the current stage, if all the butterflies of the current stage are calculated, entering the next stage, and repeating the steps (2) - (3) until the calculation of the main stage is completed.

(5) The calculation in the final stage is carried out, the calculation process is the same as that of the butterfly in the main stage, and the main purpose is to control the redundancy of the output result, so that although the coefficient does not have numerical value out-of-range, the redundancy reduction crossover can be considered. By utilizing the characteristic of the first multiplication and the last addition of the INTT butterfly, the integral output can be controlled within the range of [0,2 q) by calling the redundant modular multiplication function only when the butterfly calculates the integer modular multiplication.

The algorithm of INTT is as follows.

Summarizing, the above example builds a more efficient NTT/INTT by pre-calculation; after all logic branch sentences are removed, only necessary reduction processing is reserved, and the value of an output result is still kept within the allowable range of a data type or an instruction set and cannot cross the boundary; and the modular multiplication is performed in a redundant modular multiplication mode, so that the efficiency is high.

The effects achieved by the above example 1 include, but are not limited to, the following four points.

First, by introducing redundant representations of values, after all logical branch statements are removed, the value of the output result remains within the range allowed by the instruction set.

Second, there is no additional reduction process during NTT and INTT, and the performance of polynomial computation is improved.

Thirdly, the calculation process is not bound with the Montgomery algorithm, and the expression form of the polynomial coefficient can be randomly adjusted according to specific calculation tasks.

Fourth, it can be used in combination with other NTT and INTT algorithms.

Example 2

When the parameters are large (such as on a 64-bit CPU, log ₂ n+log ₂ q is more than or equal to 60), at the moment, NTT/INTT is required to be split, NTT is split according to the stage, and INTT is split according to the stageAccording to cross splitting, the splitting basis is the pre-calculated position to be subjected to reduction processing.

For example, for NTT calculation, set on 64-bit CPU, the maximum redundancy allowed by NTT is 2 ⁶² Max=62, maximum redundancy value 2 ⁶² The number of bits of (2) is 63 bits, i.e. the theoretical number of bits per coefficient during NTT calculation needs to be less than or equal to 63 bits. n=2 ¹⁶ ，log ₂ q=57, i.e. the number of bits of q is 58 bits. The coefficients of the input polynomial a are over the [0, q-1 ] interval, i.e. log ₂ a _i The maximum value is 57, i.e., the polynomial coefficient is determined to be 58-bit data. The minimum redundancy value is 2q; through pre-calculation, namely, each necessary parameter and stage number are brought into an expression C and an expression D, the NTT is known to call a redundancy reduction branch only in the last stage, namely, the 16 th stage, and the rest stages call redundancy increase branches; finally, the polynomial coefficient of the NTT output is controlled within 60 bits. Fig. 7 shows the calculation process of NTT in example 2.

For INTT calculation, set on 64-bit CPU, the maximum redundancy value allowed by INTT is 2 ⁶² (i.e., max=62), n=16, log ₂ q=58, i.e. the number of bits of q is 59 bits. The redundancy factor of the input polynomial coefficient is 4 times. Through pre-calculation, the total number of times of subtraction is 5, and the positions are respectively:

[9, <2, 1> ]// 9 th intersection requires reduction, at stage 2,1 st butterfly, 1 st intersection;

[11, <2, 1> ]// 11 th intersection requires reduction, at 2 nd butterfly 1 st intersection;

[13, <2,3,1> ]// 13 th intersection requires reduction, at stage 2,3 rd butterfly 1 st intersection;

[15, <2,4,1> ]// 15 th intersection requires reduction, at stage 2,4 th butterfly, 1 st intersection;

[26, <4,1,2> ]// 26 th intersection requires reduction, at the 4 th phase, 1 st butterfly, 2 nd intersection;

thus, in constructing the INTT algorithm, only the cross-call redundancy constraint for the above locations reduces the cross-overs, the remaining cross-calls redundancy increase cross-overs. The minimum redundancy value may take min= (4+16) q=20q. The coefficients of the final INTT output polynomial are controlled to be within 62 bits, within the allowable range.

The value deducing process of the minimum redundancy value is as follows: when the redundancy of the input polynomial coefficient is 1-fold (equivalent to the polynomial coefficient having no redundancy), the minimum redundancy value is n×q; if the redundancy of the input polynomial coefficient is 4 times, it means that the maximum value of the input polynomial coefficient does not exceed 4q, and when the minimum redundancy value min is set to (4+n) q, it is ensured that the minimum redundancy value min is greater than Y. Fig. 8 shows the INTT calculation process of example 2.

The above example 2, on the basis of having the same effects as example 1, relaxes the redundancy factor of the polynomial coefficient value, and in the range allowed by the instruction set, the input data output results support a larger redundancy range; the method comprises the steps of accurately positioning the occurrence position of the reduction process, removing unnecessary reduction process, wherein the reduction process is not bound with the Montgomery algorithm; there are more parameter sets available.

Fig. 19 is a schematic structural diagram of a data processing apparatus 800 according to an embodiment of the present application. The apparatus 800 comprises a first determination module 801 and a second determination module 802.

As seen in connection with the method flow shown in fig. 6, the apparatus 800 is provided on the computing device shown in fig. 6, the first determining module 801 is used for executing S201, and the second determining module 802 is used for executing S202.

The embodiment of the apparatus depicted in fig. 19 is merely illustrative, and for example, the above-described division of modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. The functional modules in the embodiments of the present application may be integrated into one module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The various modules in data processing apparatus 800 are implemented in whole or in part by software, hardware, firmware, or any combination thereof.

Some possible implementations using hardware or software to implement the various functional modules in the data processing apparatus 800 are described below in connection with the computing device 900 described below.

In the case of a software implementation, for example, the first determination module 801 and the second determination module 802 are implemented by software functional modules generated after the program codes stored in the memory 902 are read by the at least one processor 901 in fig. 20.

In the case of a hardware implementation, for example, each of the above modules in fig. 19 is implemented by different hardware in a computing device, respectively, for example, the first determining module 801 is implemented by a portion of processing resources in at least one processor 901 in fig. 20 (for example, one core or two cores in a multi-core processor), and the second determining module 802 is implemented by the remaining portion of processing resources in at least one processor 901 in fig. 20 (for example, other cores in a multi-core processor), or is implemented by a field-programmable gate array (field-programmable gate array, FPGA), or a programmable device such as a coprocessor.

Fig. 20 is a schematic structural diagram of a computing device 900 according to an embodiment of the present application. The computing device 900 is used to perform the method shown in fig. 6. Computing device 900 includes a processor 901, memory 902, and network interface 903.

The processor 901 is, for example, a general-purpose central processing unit (central processing unit, CPU), a network processor (network processer, NP), a graphics processor (graphics processing unit, GPU), a neural-network processor (neural-network processing units, NPU), a data processing unit (data processing unit, DPU), a microprocessor, or one or more integrated circuits for implementing the aspects of the present application. For example, the processor 901 includes an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. PLDs are, for example, complex programmable logic devices (complex programmable logic device, CPLD), field-programmable gate arrays (field-programmable gate array, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof.

The Memory 902 is, for example, but not limited to, a read-only Memory (ROM) or other type of static storage device that can store static information and instructions, as well as a random access Memory (random access Memory, RAM) or other type of dynamic storage device that can store information and instructions, as well as an electrically erasable programmable read-only Memory (electrically erasable programmable read-only Memory, EEPROM), compact disc read-only Memory (compact disc read-only Memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media, or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Alternatively, the memory 902 is independent and coupled to the processor 901 via an internal connection 904. Alternatively, the memory 902 and the processor 901 are integrated together.

The network interface 903 uses any transceiver-like device for communicating with other apparatus or communication networks. The network interface 903 includes at least one of a wired network interface or a wireless network interface, for example. The wired network interface is, for example, an ethernet interface. The ethernet interface is, for example, an optical interface, an electrical interface, or a combination thereof. The wireless network interface is, for example, a wireless local area network (wireless local area networks, WLAN) interface, a cellular network interface, a combination thereof, or the like.

In some embodiments, processor 901 includes one or more CPUs, such as CPU0 and CPU1 shown in fig. 20.

In some embodiments, computing device 900 optionally includes multiple processors, such as processor 901 and processor 905 shown in fig. 20. Each of these processors is, for example, a single-core processor (single-CPU), and is, for example, a multi-core processor (multi-CPU). A processor herein may optionally refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In some embodiments, computing device 900 also includes internal connections 904. The processor 901, the memory 902 and the at least one network interface 903 are connected by an internal connection 904. The internal connections 904 include pathways to communicate information between the components described above. Optionally, the internal connection 904 is a board or bus. Optionally, the internal connections 904 are divided into address buses, data buses, control buses, etc.

In some embodiments, computing device 900 also includes an input-output interface 906. An input-output interface 906 is connected to the internal connection 904.

In some embodiments, the input-output interface 906 is configured to connect to an input device, and receive commands or data related to the above embodiments, such as modulus, redundancy, polynomial dimensions, and the like, input by a user via the input device. Input devices include, but are not limited to, a keyboard, touch screen, microphone, mouse, or sensing device, among others.

In some embodiments, the input-output interface 906 is also used to connect with an output device. The input-output interface 906 outputs the processing result, such as the data after the digital-to-analog conversion, generated by the processor 301 executing the above method through the output device. Output devices include, but are not limited to, displays, printers, projectors, and so forth.

Alternatively, the processor 901 implements the method in the above embodiment by reading the program code 910 stored in the memory 902, or the processor 901 implements the method in the above embodiment by internally storing the program code. In the case where the processor 901 implements the method in the above embodiment by reading the program code 910 stored in the memory 902, the program code implementing the method provided in the embodiment of the present application is stored in the memory 902.

As seen in connection with the method shown in fig. 6, in one possible implementation, the processor 901 is configured to instruct the input-output interface 906 or the network interface 903 to perform S201, and the processor 901 is further configured to perform S202 to S204. In another possible implementation, the processor 901 is configured to instruct the input-output interface 906 or the network interface 903 to perform S201, and the processor 901 is further configured to perform S202 to S203. The processor 905 is used to execute S204. For more details on the implementation of the above-mentioned functions by the processor 901, reference is made to the description of the previous method embodiments, which is not repeated here.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are referred to each other, and each embodiment is mainly described as a difference from other embodiments.

A refers to B, referring to a simple variation where A is the same as B or A is B.

Information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals, which are all authorized by the user or sufficiently authorized by the parties, and the collection, use, and processing of relevant data requires compliance with relevant laws and regulations and standards of the relevant countries and regions. For example, the data to be encrypted and decrypted and the parameters corresponding to the data are all acquired under the condition of full authorization.

In the examples herein, unless otherwise indicated, the meaning of "at least one" means one or more and the meaning of "a plurality" means two or more. For example, the plurality of computing units refers to two or more computing units.

The above-described embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces, in whole or in part, the procedures or functions described in accordance with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A data processing method, characterized by being executed by a computing device for running a number-wise transformation of data, the step of number-wise transformation of data comprising a plurality of computing units, the method comprising:

2. The method of claim 1, wherein the reducing process comprises:

3. The method according to claim 2, wherein the performing redundancy modular multiplication processing on the processing result of the second computing unit includes:

4. The method according to claim 1, wherein the method further comprises:

5. The method of any of claims 1-4, wherein the parameters include a modulus used when each of the plurality of computing units performs a modulo operation, a redundancy of the data relative to the modulus, and a polynomial dimension of the data.

6. The method of any of claims 1-5, wherein the preset number of bits is determined based on a number of bits of a processor in the computing device, the preset number of bits being 1 or 2 less than the number of bits of the processor.

7. The method according to any one of claims 1 to 6, wherein each of the plurality of calculation units is further configured to perform a subtraction process based on a redundancy value that is a numerical value that is greater than or equal to a reduction in the subtraction process.

8. The method of claim 7, wherein the number-wise transformation comprises a positive number-wise transformation, the redundancy value being equal to 2q, the q representing a modulus used when each of the plurality of computing units performs a modulo operation, the q being a positive integer.

9. The method of claim 7, wherein the number-wise transformation comprises an inverse number-wise transformation, the redundancy value being equal to (t+n) q, the q representing a modulus used in modulo operation of each of the plurality of computing units, the t representing a redundancy multiple of the data relative to the modulus, the n representing a polynomial dimension of the data, the t, the n, and the q being positive integers.

10. The method of any one of claims 1 to 9, wherein each of the plurality of computing units is configured to process based on k data to produce k processing results, where k is a positive integer.

11. A data processing apparatus, characterized in that it is provided in a computing device for running a number-wise transformation of data, said step of number-wise transformation of data comprising a plurality of computing units, said apparatus comprising:

a first determining module configured to determine an estimated number of bits of the processing result generated by each of the computing units based on a parameter of the data, the parameter indicating the number of bits of the data;

the second determining module is configured to determine, from the plurality of computing units, a first computing unit based on the estimated bit number, where the first computing unit is a computing unit for performing reduction processing on a processing result of the second computing unit, and the estimated bit number of the processing result of the second computing unit meets a preset bit number.

12. The apparatus of claim 11, wherein the first computing unit is configured to perform redundancy modular multiplication processing on a processing result of the second computing unit.

13. The apparatus according to claim 12, wherein the first computing unit is configured to perform redundant modular multiplication processing on a processing result of the second computing unit based on a twiddle factor having the same representation form as the data.

14. The apparatus of claim 11, wherein the apparatus further comprises: and the processing module is used for carrying out encryption processing or decryption processing on the processing result after the reduction processing of the second computing unit.

15. The apparatus of any of claims 11 to 14, wherein the parameters include a modulus used when each of the plurality of computing units performs a modulo operation, a redundancy of the data relative to the modulus, and a polynomial dimension of the data.

16. The apparatus of any of claims 11 to 15, wherein the predetermined number of bits is determined based on a number of bits of a processor in the computing device, the predetermined number of bits being 1 or 2 less than the number of bits of the processor.

17. The apparatus according to any one of claims 11 to 16, wherein each of the plurality of calculation units is further configured to perform a subtraction process based on a redundancy value that is a numerical value that is greater than or equal to a reduction in the subtraction process.

18. The apparatus of claim 17, wherein the number-wise transformation comprises a positive number-wise transformation, the redundancy value being equal to 2q, the q representing a modulus used when each of the plurality of computing units performs a modulo operation, the q being a positive integer.

19. The apparatus of claim 17, wherein the number-wise transformation comprises an inverse number-wise transformation, the redundancy value being equal to (t+n) q, the q representing a modulus used in modulo operation of each of the plurality of computing units, the t representing a redundancy multiple of the data relative to the modulus, the n representing a polynomial dimension of the data, the t, the n, and the q being positive integers.

20. The apparatus of any one of claims 11 to 19, wherein each of the plurality of computing units is configured to process based on k data to produce k processing results, where k is a positive integer.

21. A computing device, the computing device comprising: a processor coupled with a memory having stored therein at least one computer program instruction that is loaded and executed by the processor to cause the computing device to implement the method of any of claims 1-10.

22. A computer readable storage medium, characterized in that at least one instruction is stored in the storage medium, which instructions, when run on a computer, cause the computer to perform the method according to any of claims 1-10.

23. A computer program product comprising one or more computer program instructions which, when loaded and run by a computer, cause the computer to perform the method of any of claims 1-10.