WO2006120691A1

WO2006120691A1 - Galois field arithmetic unit for error detection and correction in processors

Info

Publication number: WO2006120691A1
Application number: PCT/IN2005/000150
Authority: WO
Inventors: Sourav Roy
Original assignee: Analog Devices Inc.
Priority date: 2005-05-06
Filing date: 2005-05-06
Publication date: 2006-11-16

Abstract

A GFU that performs multiply-accumulate operation for error detection and correction in processors by using sub-word-parallelism (SWP) to enhance system performance. The GFU performs error detection through parallel computation of Cyclic Redundancy Checks (CRC). The CRC for a message is computed using i bits at a time, wherein i is less than or equal to a degree of the generator polynomial. The GFU also performs error correction employing Reed-Solomon codes.

Description

GALOIS FIELD ARITHMETIC UNIT FOR ERROR DETECTION AND CORRECTION IN PROCESSORS

Field of the Invention

[0001] This invention relates to error control coding in electronic and communication systems and more specifically to a method and apparatus for a Galois field arithmetic unit (GFU) for error detection and correction in such systems.

Background of the Invention

[0002] As is known, communication systems include a plurality of communication devices (e.g., modems, cable modems, personal computers, laptops, cellular telephones, radios, telephones, facsimile machines, and so on) that communicate directly (i.e., point- to-point) or indirectly via communication system infrastructure (e.g., wire line channels, wireless channels, bridges, switches, routers, gateways, servers, and so on). As is also well known, a communication system may include one or more local area networks and/or one or more wide area networks to support at least one of the Internet, cable services (e.g., modem functionality and television), wireless communications systems (e.g., radio, cellular telephones), satellite services, wire line telephone services, digital television, and so on.

[0003] In any type of communication system, information (e.g., voice, audio, video, text, data, and so on) is transmitted from one communication device to another via the infrastructure. Accordingly, the transmitting communication device prepares the information for transmission to the other device and provides the prepared information to the infrastructure for direct or indirect routing to the receiving communication device. Once received, the received communication device traverses the processing steps used by the transmitting communication device to prepare the information for transmission to recapture the original information. [0004] As is further known, transmission of information between communication devices is not performed under an ideal environment where the received information exactly matches the transmitted information. In practice, the infrastructure can introduce errors, which can result in distorting the transmitted information such that the received information does not exactly match the transmitted information. To compensate for the error introduced by the infrastructure, the transmitting communication device includes an encoder, which adds redundancy to the original data to make the original data more unique, and the receiving communication device includes a corresponding decoder, which uses the redundancy information to recover the original data from the received data that includes transmission errors.

[0005] In general, the encoder and decoder employ an error detection and correction technique to reduce the adverse effects of transmission errors. One particular type of error detection technique is called cyclic redundancy checking (CRC). Various forms of CRC are employed in the communication and consumer electronics arena. For example, a 16 bit CRC is employed in MPEG audio standards, whereas a 32 bit CRC is employed in Ethernet protocols. CRC involves generating redundancy bits by partitioning the bit stream of the original data into blocks of data. The blocks of data are processed sequentially, with the data from each block being divided by a polynomial. The remainder from the division process becomes the redundancy bits, which are appended to, and transmitted with, the block of data from which they were generated. The decoder, upon receiving a block of data, divides the block of data and the appended redundancy bits by the same polynomial. If the remainder of this division is zero, there are no errors in the received block of data. If, however, there is a remainder, an error exits. For CRC, when an error exists in the block of data, the decoder typically requests retransmission of the block of data.

[0006] Though such serial computation of CRC with linear feedback shift registers is used in hardwired circuits, parallel computation is much more efficient, especially in software implementations. Currently, there are various approaches to performing such CRC computations in parallel. One such technique proposes an empirical approach to byte-wise parallel CRC calculation, which uses LFSR contents after every eight shifts. Another such technique uses parallel CRC encoders based on digital system theory and Z-transforms. Generally, such techniques require application specific circuits (ASICs), which can be expensive and can consume large silicon area. Yet another technique employs GF arithmetic to compute parallel CRC. In this technique, the number of bits processed in parallel is equal to m, which is the degree of a generator polynomial. Using this technique for a large value of m (for example, the value of m being 32 or higher), requires an (m bit x m bit) MAC (multiply and accumulate) architecture. To accommodate such a large MAC architecture the ASIC can require a large silicon area and can consume significant amount of processor time. This is generally not desirable for use in processors, especially in digital signal processors (DSPs).

[0007] As is known, there are a number of popular error correction techniques. One such technique, that is widely used, is generally known as forward error correction (FEC). The FEC involves an encoder generating error correction data as a function of the data to be sent and then transmitting the error correction data along with the data. A decoder within the receiving communication device utilizes the error correction data to identify any errors in the original data that may have occurred during transmission. A popular FEC algorithm is called Reed Solomon (RS) encoding and decoding. Like CRC, RS partitions a data stream into sequential blocks of data and then divides a block of data by a polynomial to obtain parity, or check data. However, RS operates on a byte stream rather than a bit stream, so it creates check bytes, which are appended to each block of data. The decoding process at the receiver is considerably more complex than that of the CRC algorithm. First, a set of syndromes is calculated. If the calculated syndromes have a zero value, the received block of data is then deemed to have no errors. If one or more of the calculated syndromes are not zero, then the existence of one or more errors is indicated. The non-zero values of the syndrome are then used to determine the location of the errors and, from there, correct values of data can be determined to correct the errors.

[0008] Generally, the syndromes are computed based on Homer's Rule, using GF MAC operations. Finding error locations in the core word and their corresponding magnitudes is achieved by computing the error locator and evaluator polynomials by the Euclidean or Berlekamp-Massey algorithm. The roots of the error locator are calculated using the Chien Search method, which employs constant GF multiplications. Finally, the error values are found using the Forney algorithm. This step requires a GF inversion operation, which is generally performed with a look-up table. RS codes require an [m bit x m bit) multiplication. Generally, the value of m that is used is not more than eight bits for RS codes.

[0009] Though Reed Solomon encoding/decoding requires 8 bit GF multiplications, CRC computation requires using larger values for m (for example, in the neighborhood of about 32 or more). Therefore, to use large values of m in CRC computation the MAC architecture can become significantly large, which can result in requiring a larger silicon area for the processors. Further, using such large silicon area can significantly lower the performance of the processors. Furthermore, the current MAC architectures can either perform error detection or data correction but not both.

Summary of the Invention

A Galois field arithmetic unit (GFU) to perform multiply-accumulate operation to calculate CRC as well as Reed-Solomon encoding/decoding. The GFU of the present invention uses sub-word-parallelism to enhance the system performance. According to an aspect of the present invention, there is provided a method for performing a cyclic redundancy check (CRC), the method including the steps of receiving a message of length n bits, partitioning the n bits into one or more blocks, wherein each block has i input bits such that n=k*i and i is less than or equal to m, wherein m is the degree of the generator polynomial used to compute the CRC, and computing a CRC value for the received message of n bits using the one or more blocks.

Brief description of the Drawing

[0010] FIG. 1 is block diagram of a digital signal processor according to an embodiment of the present invention.

[0011] FIG. 2 illustrates a flowchart of an example embodiment of a method for calculating CRC to be implemented using the GFU shown in FIG 2.

[0012] FIG 3 is a schematic diagram of a GF multiplier according to an embodiment of the present invention. [0013] FIG. 4 is a schematic diagram of a sub-cell array of the GF array, shown in FIG. 3, according to an embodiment of the present invention.

Description of Preferred Embodiments

[0014] In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.

[0015] The leading digit(s) of reference numbers appearing in the Figures generally corresponds to the Figure number in which that component is first introduced, such that the same reference number is used throughout to refer to an identical component which appears in multiple Figures. The same reference number or label may refer to signals and connections, and the actual meaning will be clear from its use in the context of the description.

[0016] FIG. 1 illustrates an example block diagram of a digital signal processor (DSP) 40. The DSP 40 shown in FIG. 1 is used to perform an encoding and decoding operation for the Reed-Solomon (RS) codes. As shown in FIG. 1, the DSP 40 has a processing circuit 42 and a Galois field multiplier unit (GFU) 44. Further as shown in FIG. 1, the processing circuit 42 has a processing module 46, a controller 48, and an input/output port 50. The input/output port 50 receives input data from an input terminal 52 of the DSP 40. The controller 48 transmits the input data to the processing module 46. The processing module 46 is used to perform the GF addition on the input data. However, when GF multiplication is required on the input data, the controller 48 transmits the input data to the GFU 44 via the input/output port 50. After the GFU 44 finishes processing the input data, the input data will be transmitted back to the processing module 46 for the following operations. In the end, the calculation result is outputted to an output terminal 54 of the DSP 40 via the input/output port 50. The GF addition is equivalent to an XOR logic operation. [0017] Referring now to FIG. 2, there is illustrated an example method 200 of performing a CRC. At step 210, this example method 200 begins by receiving a message of n bits.

[0018] The algorithm proposed in this example embodiment enables parallel computation of CRC-m (i.e. degree of the primitive polynomial is m) using chunks of i message bits at a time, where i is less than or equal to m. Processing less than m message bits in parallel, can significantly reduce the required processor core area. If multiple MAC units are used in the processor, the silicon area saving can also be significant. Moreover, from a timing point of view, a true single-cycle MAC can be easily accomplished to eliminate data dependency stalls in the processor pipeline and hence can significantly improve system performance.

[0019] The following illustrates the serial computation of CRC for a received message A(x) of length n bits, which is denoted by the polynomial in x, A(x) = a_n . \x^{n ' !} + a_n .₂x^{n " 2} + ... + a_\x + α₀, wherein a_\ e {0, 1 }

[0020] The generator polynomial of degree m can be denoted as P(x) = x^m +p_m . ιx^{m ' ι} + ... +^₁X +^₀, where pi e {0, 1 }

[0021] Then the serial computation of Cyclic Redundancy Check (CRC) is computed using the equation,

CRC[A(X)] = [A(x)x^m] mod P(x) (1)

[0022] At step 220, the received message A(x) including n bits is first divided into blocks of / bits, where i is less than or equal to m such that the length of the message including n bits is a multiple of i. Otherwise, a necessary number of zeros is inserted before the most significant bit (MSB) of the message A(x), to make it a multiple of i, wherein n = k*i and Hs an integer. Then A (x) is expressed as follows:

A(x) = (cm - IX^{1 ' 1} + ... + a_{k. i_{}i + i}x + α_(k - i)i)x^{(k " 1}l}

+

+ («2i - ιx^{1 " 1} + ... + a\ ₊ ιx + a\)x^l

+ (a_\ . _\x^{1 ' 1} + ... + a_\x + α₀)

= W_k . _!(x)x^{(k ■ 1)j} + ... + Jf₁(X)X¹ + W₀, [0023] Where each block of message i bits denoted by W_j(x) is a polynomial of degree i, for all j. Applying equation (1), the CRC of A(x) is given by,

CRC[A(X)] = [W_k . ι(x)x^{{k ■ l)i + m}] mod P(x) + ^" ... + [W₁(X)_x ^{1 + m}] mod P(x) + [Wo(x)x^m]modP(x) (2)

[0024] Equation (2) is further reduced using the following two polynomials F₁(X) and .F₂(X) represented in the quotient-remainder form when divided with P(x).

F₂(x) = Q₂(X)P(X) + R₂(x)

[F₁(X)F₂(X)]InOd P(x) = [Rι(x)R₂(x)]mod P(x).

[0025] Further, the degree of each of the message blocks, Wj(x) in equation (2) is (i - 1) which is less than the degree of P(x).

Wi(x) mod P(x) = Wi(x), V/ €{0, \,..., k - 1}

[0026] Using the above equations we can express each product term in equation (2) as,

[FFj(X) χJ^{i + m}] mod P(x) = [Pr_j(x)[x^{ji + m}mod P(x)]]mod P(x) (3)

[0027] Considering an extension of the GF of m bits, denoted as GF(2^m), whose primitive generator polynomial is P(x), wherein α is the root of P(x), i.e. the primitive element of the field. Then x*^{1 + m} mod P(x) is equivalent to a}^{1 + m}, when a canonical or standard basis is used to represent the field. Again, since W_$(x) is a polynomial of degree (/ - 1), we expand it to degree (m - 1) by filling the MSB positions with zeros.

Wi(x) = 0x^{m ■ J} + ... + 0x^! + a_{Q +} Di . ιx^[- ^! + + αji _{+ 1}x + θji.

[0028] Using the above relations in equations (2) and (3), the CRC ofA(x) can be computed as a multiply-accumulate in GF(2^m) as follows:

CRC[A(X)] =(W_k. ,(x) α'^^') + + (W_l(x) a^m+i) + (W₀(x) a^m)

[0029] Therefore the value of CRC for the received message A(x) can be computed using the equation,

CRC[A(X)] = ∑_j=_o ^k-1 FFj(x) ^• a^m+JI (4)

[0030] At step 230, the value of k and the intermediate CRC value are initialized to 0. At step 240 a current block of i input bits is multiplied with a GF coefficient to obtain a current multiplied CRC value. The GF coefficient is a power of a primitive element a in a finite field of GF(2'"). At step 250, the current multiplied CRC value is added to a previously obtained intermediate CRC value associated with a block of i inputs to obtain a new intermediate CRC value. At step 260, the value of k is incremented by a predetermined value. In some embodiments, the value of Hs incremented by 1. At step 270, the method 200 determines whether (k = n/i). Based on the determination at step 270, the method 200 goes to step 240 and repeats steps 240-270 if k is not equal to n/i. Based on the determination at step 270, the method goes to step 280 and outputs the new intermediate CRC value as the CRC value \ϊ(k = n/i).

[0031 ] The summation and product symbols in equation (4) represent GF operations. In these embodiments, CRC can be computed parallely using chunks of i input bits of the message at a time. The i bit input word is multiplied with a coefficient, which is a power of the primitive element α in GF(2^m), and the product is then added to the previously accumulated result by repeating the multiplication and addition operations for k iterations to get the final CRC value as described-above with reference to FIG. 2.

[0032] The coefficients are constant for a particular field dimension m and a primitive polynomial P(x), and are computed prior to the MAC operations. The total number of coefficients that need to be stored in the memory of a processor is given by n/i = k. But if the message length is large, the field elements wrap-around, the maximum number of coefficients being 2^m/i = k. Hence the number of coefficients is given by min(k, k). The important point to note here is that even though the multiplication and addition is performed in GF(2^m), one of the operands W_j(x) can be treated as an i bit number in GF(2^m), since the higher bit positions are all zeros. Hence unlike in RS CODECs, parallel CRC can be calculated with a (m bit x i bit) standard basis MAC structure in GF(2^m), where i < m. For a large m like 32, a (m bit x m bit) MAC structure for CRC can be expensive in terms of silicon area and speed, particularly when multiple such compute blocks are used in a general purpose processor, such as a DSP. Using i equal to 8 facilitates byte-wise parallel CRC computation.

[0033] Referring now to FIG. 3, which illustrates an example GF multiplier 300 including a MAC architecture implementing the parallel CRC computation scheme described above. As shown in FIG. 3, the GF multiplier 300 includes a pre-shift stage 310, a GF multiplier array 320, and a post-shift-add stage 330. Also as shown in FIG. 3, the GF multiplier array 320 includes a plurality of sub-cell matrices in the sub-cell array 340 that are arranged in one or more rows and columns.

[0034] Reed Solomon (RS) codes are constructed and decoded through the use of GF arithmetic. Encoding is done through polynomial division, which employs a linear feedback shift register (LFSR) structure like in the serial CRC computation, but operates on words of m bits in GF (2^m), instead of individual bits. The above-described RS encoder can be easily implemented in software with the aid of a GF MAC unit. Generally, the decoding of RS codes consists of the following two steps:

1. Syndrome Evaluation - A non-zero syndrome signifies an error in the received code word. The syndromes are calculated based on Homer's Rule, using GF MAC operations.

2. Finding error locations in the code word and their corresponding magnitudes - This is achieved by computing the error locator and evaluator polynomials by the Euclidean or Berlekamp-Massey algorithm. The roots of the error locator polynomial are calculated using the Chien Search method, which employs constant GF multiplications. Finally the error values are found using the Forney algorithm. This step requires a GF inversion operation, which is generally performed with a look-up-table. Hardware support for inversion is expensive, and is generally not required as it does not affect the overall performance of the CODEC. Unlike CRC, RS CODECs must use a (m bit x m bit) MAC unit. This is generally fine from a hardware point of view, because the field dimension m for almost all of the practical RS codes in error-control coding is not greater than 8. [0035] The following illustrates various multiplier architectures generally used for CRC computation. The finite field addition is performed using a bitwise XOR of the two operands. As described above, parallel CRC computation requires standard basis multipliers.

[0036] Assuming that Y(x) denotes the product of A(x) and B(x), where A, B, YG GF (2^m). Further assuming P(x) = x^m + p_m . _\x^{m ' l} + ... +pιx +po, which denotes the primitive polynomial of the field and wherein α being its root. Hence A(x) and B(x) can be represented as polynomials in α as follows:

A - a_m. tot"^1"1 + a_m .₂α^m'2 + ... + a_\a + α₀ B = b_m - ιθL^mΛ + b_m .₂a^m'2 + ... + O₁CC + b₀

[0037] Wherein, α_l3 ό, e GF (2) = {0, 1}. There are generally two types of array multipliers, depending on the order in which the multiplier bits are processed viz., least significant bit (LSB)-first and MSB-first multipliers. The MSB-first multiplier has a longer critical path of m - 1 and requires more XOR gates. Hence the LSB-first multiplier is superior as it reduces the computation delay considerably, without adding any extra hardware. Apart from array multipliers, there are parallel irregular multipliers, which perform the polynomial multiplication and the degree reduction separately. The silicon area and the delay of this multiplier are similar to that of the LSB-first multiplier, though it can potentially provide lower power dissipation.

[0038] The GF multiplier array 320 shown in FIG, 3 is based on the LSB-first array multiplier to exploit the regularity of the array structure, In the LSB-First multiplier, multiplication starts with the least significant bit of the multiplier B as outlined below.

Y(X) =A(X) B(x) Y(x) = A(x)B(x) mod P(x)

Y(x) = boA + bι[Aa mod P(x)] + &₂L4α²mod P (x)] + ... + b_m. ι[Aa^mΛ mod P(x)] [0039] Two intermediate polynomials viz. are introduced, V(x) and W(x) of degree (m - 1) to describe the basic computation steps involved in each iteration. In the Mi iteration for 1 < k < m, the following computations are performed in parallel: j/® = [j/C- ¹⁾] _{α m}od P(χ) ^k) _{= F}(k- i)_&k_ _{i +} ^k- I)

[0040] Wherein ≠⁰⁾ = 0 and ≠⁰⁾ =A. Since α is a root of i^>(x), P(μ) = 0.

Therefore, a^m=p_m. χa^m'1 + ... +p_xa +p₀ (5)

Again,

P(X)

= (α_m . _!a^m + ... + aid² + a_oa) mod P(pc) (5a)

[0041] Using the equation (5) in (5a), the relation for each coefficient of the polynomial V(x) in the first iteration can be deduced as, v⁽¹⁾i = fli - i + fl_m. iPi, V f e {0, l,..., »i - 1 }

[0042] In the above equation, a._\ = 0. The following outlines the basic computation steps performed in each iteration, in the LSB-first multiplier.

[0043] For all / e {0, 1,.. ,, m - \),

[0044] The above computation is repeated for m iterations to compute the final product given by W^^m\ In these embodiments, computing V^ is not necessary. The multiplier can be converted to a multiply-accumulate Y= A B + C, by making W^ = C.

[0045] The critical path in a GF (2^m) MAC unit comprises m XOR + m AND gates. The path starts from A_m . i and ends at Y_m . i. The AND-XOR combination may be replaced by NAND-XNOR for higher speed. A basic cell consisting of a combination of a pair of NAND-XNOR logic circuits that is repeated m² times in a two-dimensional array structure to build the GF multiplier.

[0046] For a processor, such as a DSP it is not enough to build a multiplier with a programmable primitive polynomial, but it should also be programmable with respect to the field dimension m. With appropriate pre-shift and post-shift we can extend the multiplier architecture to perform multiplications over GF(2^m ) where m < m. The output Y(x) of degree (m - 1) cannot be computed directly with the GF(2^m) multiplier. But when it is extended to degree (m - 1) as Y(x)x^{m " m} it can be calculated as follows:

Y(x) = A(x) B(x) mod P(x)

Le., A(x) B(x) = Q(x) P(x) + Y(x)

[0047] Multiplying both sides of the above equation by x^{m ' m} yields the following equation,

[A(x)x^m - ^m']B(x) = Q(x)[P(x)x^{m ■ m>}] + Y(x)χ^{m ' m'}

[0048] The above relation shows that if one of the input operands and the primitive polynomial is left-justified by shifting by m - m bit positions, then the product in GF(2^m ) also appears in left-justified format. The product is then right-shifted back by m - m bit positions. For a MAC operation, the summand C(x) also needs to be left-shifted like A(x) and P(x). But if silicon area is a concern, then C(x) can be added to the product finally after the right-shift operation. However, this can increase the critical path of the MAC structure by a further XOR gate delay.

[0049] The GFU described above is a multiplier which is programmable with respect to both primitive polynomial as well as field dimension, As seen earlier, RS codes require (m bit x m bit) multiplication where 1 < m < 8. But CRC requires up to 32 bit multipliers. However, as shown earlier CRC can be achieved by (m bit x i bit) multiplication where i < m. Hence an array of (32 bit x 8 bit) can support both applications. The register size of a processor, such as a DSP is generally 32 bits. The architecture of the DSP needs to support packed arithmetic or sub-word-parallelism (SWP), so that a 32 bit register can be accessed as quad 8 bit fields. To improve the performance of the MAC in a SWP architecture, the MAC shown in FIG. 3 supports the following modes in GF(2^m) : 1. quad ((m bit x m bit) + m bit), for K m < 8

2. dual ((m bit x i bit) + m bit), for 8< m < 16, K i < 8

3. single ((m bitx i bit) + m bit), for 16< m < 32, K i < 8.

[0050] The first mode is suitable for RS coding/decoding, where four parallel MAC operations can be performed in a single cycle. The last two are suitable for CRC computation by the algorithm described earlier, using chunks of 8 message bits or less at a time. The MAC unit shown in FIG. 3 includes three sub-units, namely the pre-shift stage 310, the GF multiplier array 320, and the post-shift-add stage 330. The addition is performed in the last to reduce silicon area, otherwise further pre-shift of the summand may be required.

[0051] The multiplier structure automatically configures itself based on the field dimension m. The input operands A(x), B(x), C(x), P(x) and the output operand Y(x) are all stored in 32 bit registers. The GF multiplier array is of the form (32 bitx 8 bit). The 32 bit multiplier A (x) is directly fed to the multiply unit. Since B(x) is 32 bits, but the GF multiplier array only handles 8 bits, appropriate bytes are chosen from B(x) for the multiplication. The data packing in the various modes of multiplication, i.e. quad, dual and single multiplication, are shown in the table below. P(x) is stored in a register occupying m bits. The coefficient for x^m, which is always unity, is not stored. The following illustrates the automatic configuration of the (32 bitx 8 bit) GF multiplier array.

32 bit 32 bit 32 bit

( A3 A2 ^SA1 AO X

( A1 AO X

( AO X

[0052] The pre-shift stage 310, left justifies A(x) and P(x) to the nearest byte boundary, to generate A ' (x) and P ' (x), respectively. For example, if m = 14, they are left- shifted 2 bit positions, to align with the 16 bit boundary.

A '(x) = A(x)x^{02 " ra)mod8};P ' (x) = P(x)x^{{32 " m)raod8} [0053] Further, the contents of B(x) and P ' (x) are chosen appropriately as shown in the above table, depending on the value of m. To achieve the above, the message A(x) including n bits is divided into groups of 8 bits as follows:

B(x) = (b₃₁x³¹+ ' + b₂₄x²⁴) + (6₂₃x^{23 "}+ ^{' '} + O₁₆X¹⁶) + (6₁₅x¹⁵ + ... + hx^s) + (byx⁷ + ... + bo) = B₃(X) + B₂(X) + B₁(X) + Bo(x)

[0054] Similarly,

P '(x)

+ P₀(X)

[0055] The four control variables depending on m are defined as follows. = l,ifl<m<8 = l,if8<m<16

= 0, otherwise; = 0, otherwise;

- 1, if 16 <m<24 =Uf24<m≤ 32

= 0, otherwise; = 0, otherwise;

[0056] Then the contents ofB'(x) and P '(x) are chosen as follows: B'(x) = S₃ [B₀(X)X²⁴ + B₀(X)X¹⁶ + B₀(x)x^s + B₀(x)J

+ S₂[B₀(X)X¹⁶ + B₀(X)X⁸ + B₀(X)J

+ S₁ [B₂(X)X⁸ + B₂(X) + B₀(x)x⁸ + B₀(X)J

+ S₀B(X).

P^' (x) = 8₀[P'₀(x)x²⁴ + P^'o(x)x¹⁶ + P^' ₀(x)x⁸ + P'₀(x)J + S₁ [P₁(X)X¹⁶ + P₀(x)x¹⁶ + P₁(X) + P₀(X)J +S₂P'(x) + S₃P'(x).

[0057] In this embodiment, the outputs of the pre-shift sub-block are A ' (x), B ' (x) and P'(x). These are then passed onto the GF multiplier array of size (32 bitx8 bit). The GF multiplier array 320 multiplies the inputs A ' (x) and B ' (x) according to the field parameters m and P (x). The intermediate polynomials V(x) and W(x) described above will have a degree of 31. The computation stage or iteration number is denoted by k, ranging from about 1 to 8. The initial values for the inputs to the GF multiplier array are given by,

[0058] The above outlined equation (6) is modified by introducing two new polynomials u(x) and / (x). Then, for all bit positions i e {0, 1,..., 31}, and for all iterations 1 < k < 8,

wherein, _u(k). = _δ3V(k)₃₁ + _δ2V(k)_{23 + δl}jk)_is + _δoV(k)_{?i 0 ≤ i ≤ 7}

= S₃VM_3] + _δ2V(k)_{23 +} (S₁₊S₀)V^₁₅, 8 < i < 15 = (δ3+δi)v®3i + (S₂+δ₀) V^₂₃, lό ≤ i ≤ 23 = v^(k) _3h 24 < i < 31.

8 = δ₃ + S₂, i = 16 = δ₃ + δ_h i = 24 = 1, otherwise.

[0059] As before, v.j = 0. Also, equation (7) remains unaltered except for the change due to data packing in operand B, which is one of the inputs to the GF multiplier 300.

[0060] For iteration k ranging from 1 to 8, _wβ) = _Vβ-D _{bk l} + _wjt.i)_{t Vi € {Ot} __{t 7} (9a)} ^'

_WiW = _V.(k-D _h+7 + _wfi-i)_{t Vi e {8>} J₅J _(9b)

_w ^β) _{= V}fi-V _h+ls + _Wi<W_{t Vi e {16i ^}23} (9c) _w.^(k) = _yM _bk+23 + _w(^k-ⁱ⁾ _{t Vi e {24>} 3ij _(9d)

[0061 ] The product

Thus the above equations describe the GF multiplier array 320 shown in FIG. 3, which can perform various modes of packed arithmetic multiplication depending on the field size. For example, it can be envisioned as a GF multiplier array of (32bit x 8bit), which is configured as either four independent (βbitx 8bit) arrays arranged side-by-side, as two independent (lόbitx 8bif) arrays, or as a single (32bitx 8bit) array. This configuration is done with the new polynomials u(x) and l{x). Although the above described GF multiplier array is of size (32bit x 8bit), the above-described technique can be extended for any different GF multiplier array size, such as (40bitx 8bif) or (32bitx lόbit).

[0062] The last sub-block in the GF arithmetic unit 300 shown in FIG. 3 is the post- shift-add stage 330. The shift is performed to right shift the product back to the right- justified packed format. After the right shift, the summand C(x) is added to give the final MAC result. As noted earlier, if area is not a concern the extra latency due to the XOR of C(x) can be avoided by pre-shifting C(x) and adding it in the GF multiplier array as W®\

Y(x) = Y' (χ)x^(m -^32)mod8 + C(x)

[0063] Referring now to FIG. 4, which shows an example implementation of the sub-cell matrices in the sub-cell array 340 used in the GF multiplier array 320 using the AND and XOR logic circuits. FIG. 4 shows two neighboring sub-cells 410 and 420 arranged in a row. As shown in FIG. 4, each of the sub-cells includes 8 cells 430. The new variables u and / translate to logic gates at the byte boundaries of the GF multiplier array. The MAC architecture shown in FIGS. 3 and 4 can be pipelined. The summand C(x) should preferably be passed through the pre-shifter and the GF multiplier array as W®\ Also, the various bits of B '(x) should be delayed appropriately to prevent previous data erasure. This depends on the level of pipelining.

[0064] In some embodiments, the pre-shift stage 310 shown in FIG. 3 receives first, second, and third operands (A), (B), and (P) 350, 360, and 370, respectively. Each of the received operands has m bits. The pre-shift stage 310 to left justify the operands A and P to a nearest byte boundary. It also divides the operands B and pre-shifted operand P into sub- words and selects the appropriate sub- words depending on the field size m.

[0065] The GF multiplier array 320 receives the sub-words associated with each operand, and performs GF multiplication on a sub-word-parallel basis and outputs the multiplied value (A x B) in GF (i.e., outputs a GF multiplied value). The post-shift-add stage 330 receives a fourth operand (C) 380, which has m bits. The post-shift-add stage 330 divides the fourth operand 380 into sub-words. Further, the post-shift-add stage 330 receives the GF multiplied value from the GF multiplier array 320 and right-justifies the GF multiplied value. Furthermore, the post-shift-add stage 330 adds the right-justified GF multiplied value to the sub- words associated with the operand C 380 and outputs the multiply-accumulate value of ((A x B) + C) in the GF. In some embodiments, the m bits can be in the range of about 8bits to 40 bits.

[0066] In these embodiments, each of the sub-cell matrices in the sub-cell array 340 in the GF multiplier array 320, as shown in FIGS. 3 and 4, includes 8 GF cells 430 arranged in a row. As shown in FIG. 4, each GF cell 430 includes a first and a second AND logic circuit 440 and 445. Each of the first and the second AND logic circuits 440 and 445 has first and second inputs and an output 442 and 443, 447 and 448, and 444 and 449, respectively.

[0067] Also as shown in FIG. 4, each GF cell 430 includes a first and a second XOR logic circuit 450 and 455. Each of the first and the second XOR logic circuits 450 and 455 has a first and a second inputs and an output 452 and 453 (same as 447), 457 and 458, and 454 and 459, respectively. The outputs of the first and second XOR logic circuits 454 and 459 form the bits of the intermediate polynomials v and w as described in the above outlined equations (8) and (9). Further as shown in FIG. 4, the first input 442 of the first AND logic circuit 440 is connected to receive one of the m bits associated with the pre-shifted operand P 370 (shown in FIG. 3) and the second input 443 of the first AND logic circuit 440 is connected to receive one of the bits associated with the new polynomial u as described in the above equation (8). Furthermore as shown in FIG. 4, the first input 447 of the second AND logic circuit 445 is connected to receive one of the bits associated with the intermediate polynomial v and the second input 448 of the second AND logic circuit 445 is connected to receive one of the bits associated with the second operand B 360 (shown in FIG. 3).

[0068] Furthermore as shown in FIG. 4, the first input 452 of the first XOR logic circuit 450 is connected to the output 444 of the first AND logic circuit 440 and the second input 453 of the first XOR logic circuit 450 is connected to one of the bits associated with the intermediate polynomial v which is modified logically by a sub-cell matrix AND logic circuit 480 with the new polynomial /, as described in the equation (8). Further, the first input 457 of the second XOR logic circuit 455 is connected to the output 449 of the second AND logic circuit 445 and the second input 458 of the second XOR logic circuit 455 is connected to one of the bits associated with the intermediate polynomial w as described in the equation (9).

[0069] Moreover as shown in FIG. 4, each sub-cell array in the sub-cell array 340 further includes a MUX 470 and the sub-cell matrix AND logic circuit 480. As shown in FIG. 4, the MUX 470 has one or more inputs 472 and an output 474. The output 474 of the MUX 470 represents 8 bits of the new polynomial u, as described in the equation (8). The one or more inputs 472 of the MUX 470 in each sub cell array 340 (shown in FIG. 3) vary from 4 to 1 depending on the position of the sub-cell 410 (the right most sub-cell 410 has 4 inputs). Further as shown in FIG. 4, the sub-cell matrix AND logic circuit 480 has first and second inputs 482 and 484 and an output 486. As shown in FIG. 4, the one or more inputs 472 of the MUX 470 is connected to one bit associated with the intermediate polynomial v and the output 474 of the MUX 470 is connected to each second input 443 of the first AND logic circuit 440 in the sub-cell array 340. Furthermore, the second input 484 of the sub-cell matrix AND logic circuit 480 is connected to the one or more inputs 472 of the MUX 470 and the first input 482 of the sub-cell matrix AND logic circuit 480 is connected to one of the bits associated with the new polynomial /. As shown in FIG. 4, the MUX 470 and the sub-cell AND logic circuit 480 are included at byte-boundaries (after every 8^th GF cell) of the sub-cell array 340.

[0070] In these embodiments, the polynomials v and w represent each stage or row of computation of the product of operands A and B, i.e., (A x B). The initial value of polynomial v is the input to operand A. The final value of the polynomial w is a computed product (A x B). The new polynomial u has coefficients consisting of byte-boundary coefficients of the polynomial v, i.e., a combination of polynomials at V₃₁ , V₂₃ , vis , vγ . This combination of byte-boundary coefficients of the polynomial v in the polynomial u is determined by the field size m. The new polynomial / signifies a connection between several sub-cell matrices of 8 GF cells in a row of the sub-cell array 340. The polynomials have unity coefficients in all positions in the sub-cell array 340 except at the byte boundary, which depends on the field size m. The new polynomials u and 1 are introduced to facilitate sub- word parallelism in the GF multiplier array 320. [0071] As shown in FIGS . 3 and 4, each of the first and the second XOR logic circuits 450 and 455 associated with a sub-cell array 340 in the GF multiplier array 320 has its output 454 and 459 connected to the first input 442 of the second AND logic circuit 445 and the second input 457 of the second XOR logic circuit 455, respectively, of a next successive sub-cell array except that the first and the second XOR logic circuits of a last sub-cell array is connected to an output of the GF multiplier array.

[0072] Further as shown in FIGS. 3 and 4, the first inputs 442 and 447 of the first and second AND logic circuits 440 and 445, respectively, of a sub-cell array 340 are connected to receive a bit in the pre-shifted third operand P 370 and output 454 of the first XOR logic circuits 450, respectively, of a substantially previous sub-cell array in the GF multiplier array, except for the first and second AND logic circuits of a first sub-cell array which are connected to an input of the GF multiplier array.

[0073] It can be envisioned that the above-described MAC architecture including AND-XOR logic circuits in each of the sub-cell matrices in the sub-cell array 340 in the GF multiplier array 320 can be built using NAND-XNOR logic circuits, which can provide a higher system performance. Each sub-cell array 340 in the GF multiplier array 320 consisting of the NAND-XNOR logic circuits is repeated m² times in a two- dimensional array structure to obtain the above-described MAC architecture.

[0074] The above-described GFU was implemented in RTL using a Verilog HDL. The RTL description was synthesized using a standard cell library, with proper wire load models and timing constraints. The paths from inputs m and P(x) were treated as multicycle paths of two cycles. This is because the configuration registers are not meant to be changed on-the-fly along with the inputs, but are set before the MAC operations begin. A commercial CAD tool was used to place-and-route the MAC unit. The delay of the entire MAC unit was found to be about 1.5 ns under typical conditions (i.e., typical process, 1.2V power supply, 125°C temperature) in 0.13μm technology. The total area required was found to be about 0.05 mm². Using other custom designs for the shifters and array, the system performance can further considerably improve the speed and reduce silicon area of the above-described MAC unit. [0075] Although the above embodiments describe performing error detection and correction techniques with reference to CRC and RS algorithms, the present invention is not limited to such. Thus, other embodiments may employ other types of forward error corrections algorithms. As one of average skill in the art will appreciate, other embodiments may be derived from the teachings of the above described techniques without deriving from the scope of the claims.

[0076] The above-described technique uses a sub-word parallel architecture to improve system performance when encoding/decoding using CRC and RS algorithms. This process uses a fast parallel CRC computation algorithm to enhance system performance. In addition, the above-described technique can be used to perform both error detection and data correction.

[0077] The above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those skilled in the art. The scope of the invention should therefore be determined by the appended claims, along with the full scope of equivalents to which such claims are entitled.

[0078] It is to be understood that the above-description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above-description. The scope of the subject matter should, therefore, be determined with reference to the following claims, along with the full scope of equivalents to which such claims are entitled.

[0079] As shown herein, the present invention can be implemented in a number of different embodiments, including various methods, a circuit, an I/O device, a system, and an article comprising a machine-accessible medium having associated instructions.

[0080] Other embodiments will be readily apparent to those of ordinary skill in the art. The elements, algorithms, and sequence of operations can all be varied to suit particular requirements.

[0081] FIGS. 1-4 are merely representational and are not drawn to scale. Certain portions thereof may be exaggerated, while others may be minimized. FIGS. 1-4 illustrate various embodiments of the invention that can be understood and appropriately carried out by those of ordinary skill in the art.

[0082] It is emphasized that the Abstract is provided to comply with 37 C.F.R. § 1.72(b) requiring an Abstract that will allow the reader to quickly ascertain the nature and gist of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims.

[0083] In the foregoing detailed description of embodiments of the invention, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather_? as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description of embodiments of the invention, with each claim standing on its own as a separate embodiment.

[0084] It is understood that the above description is intended to be illustrative, and not restrictive. It is intended to cover all alternatives, modifications and equivalents as may be included within the spirit and scope of the invention as defined in the appended claims. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of. the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms "including" and "in which" are used as the plain-English equivalents of the respective terms "comprising" and "wherein," respectively.

Claims

What is claimed is:

1. A Galois Field arithmetic unit (GFU) comprising: a pre-shift stage that receives first, second, and third operands (A), (B), and (P), respectively, wherein each operand has m bits, wherein the pre-shift stage to left justify the operands A and P to a nearest byte boundary and to divide the operand B and the pre- shifted operand P into sub-words and select sub-words based on a field size of m. a Galois field (GF) multiplier array coupled to the pre-shift stage to receive the sub-words associated with the operands A, B, and P and to perform a GF multiplication on a sub- word-parallel basis and output a GF multiplied value of (A x B); and a post-shift-add stage coupled to the GF multiplier array to receive a fourth operand (C), wherein the operand C has m bits, wherein the post-shift-add stage to divide the operand C into sub-words, wherein the post-shift-add stage to receive the GF multiplied value from the GF multiplier array and to right-justify the GF multiplied value, wherein the post-shift-add stage to add the right-justified GF multiplied value to the sub- words associated with the operand C to output a multiply-accumulate value of ((A x B) + C) in GF, which can be used to compute the CRC and perform a Reed-Solomon encoding/decoding.

2. The GFU of claim 1 , wherein the m bits comprise bits in the range of about 8 to 40 bits.

3. The GFU of claim 1, wherein the GF multiplier array is responsive to the received sub-words associated with the A, B, and P operands, wherein the GF multiplier array has a plurality of outputs for providing the GF transformation value associated with the sub- words of each operand, wherein the GF multiplier array has a plurality of sub-cell matrices arranged in one or more rows and columns, wherein each sub-cell array comprises 5 GF cells arranged in a row, wherein each GF cell has a first and a second AND logic circuits each having first and second inputs and an output, a first and a second XOR logic circuits each having a first and a second inputs and an output, wherein the first input of the first ANB logic circuit to couple to one of the m bits associated with the pre-shifted operand P and the second input of the first AND logic circuit to couple to one of the m bits associated with a new polynomial u, wherein the first input of the second AND logic circuit to couple to one of the m bits associated with a first intermediate polynomial v and the second input of the second AND logic circuit is to couple to one of the m bits associated with the operand B, wherein the first input of the first XOR logic circuit is connected to the output of the first AND logic circuit and the second input of the first XOR logic circuit is connected to couple to one of the m bits associated with v logically AND ed with another new polynomial /, wherein the first input of the second XOR logic circuit is connected to the output of the second AND logic circuit and the second input of the second XOR logic circuit is to couple to receive one of the m bits associated with a second intermediate polynomial w, and wherein each sub-cell array has a MUX and a sub-cell array AND logic circuit, wherein the MUX has one or more inputs and an output, wherein the sub-cell AND logic circuit has first and second inputs and an output, wherein the one or more inputs of the MUX is to couple to receive one or more m bits associated with v and the output of the MUX is connected to each second input of the first AND gate in the sub-cell array, wherein the first input of the sub-cell AND logic circuit is connected to the one or more inputs of the MUX and the second input of the sub-cell AND logic circuit is to couple to receive one of the m bits associated with /.

4. The GFU of claim 3, wherein each of the first and the second XOR logic circuits of a sub-cell array in the GF multiplier array has its output connected to the first input of the second AND logic circuit and the second input of the second XOR logic circuit, respectively, of a next successive sub-cell array except for the first and the second XOR logic circuits of a last sub-cell array is connected to an output of the GF multiplier array.

5. The GFU of claim 4, wherein each of the first inputs of the first and second AND logic circuits of a sub-cell array is to couple to receive a bit in the operand P and of the first XOR logic circuit of a substantially previous sub-cell array in the GF multiplier array, respectively, except for the first and second AND logic circuits of a first sub-cell array is connected to an input of the GF multiplier array.

6. A GF multiplier array that is responsive to the received sub- words associated with the A, B, and P operands, wherein the GF multiplier array comprising: a plurality of outputs for providing the GF transformation value associated with the sub-words of each operand, wherein the GF multiplier array has a plurality of sub-cell matrices arranged in one or more rows and columns, wherein each sub-cell array comprises 8 GF cells arranged in a row, wherein each GF cell has a first and- a second AND logic circuits each having first and second inputs and an output, a first and a second XOR logic circuits each having a first and a second inputs and an output, wherein the first input of the first AND logic circuit to couple to one of the m bits associated with the pre-shifted operand P and the second input of the first AND logic circuit to couple to one of the m bits associated with a new polynomial u, wherein the first input of the second AND logic circuit to couple to one of the m bits associated with a first intermediate polynomial v and the second input of the second AND logic circuit is to couple to one of the m bits associated with the operand B, wherein the first input of the first XOR logic circuit is connected to the output of the first AND logic circuit and the second input of the first XOR logic circuit is connected to couple to one of the m bits associated with v logically AND ed with another new polynomial /, wherein the first input of the second XOR logic circuit is connected to the output of the second AND logic circuit and the second input of the second XOR logic circuit is to couple to receive one of the m bits associated with a second intermediate polynomial w, and wherein each sub-cell array has a MUX and a sub-cell array AND logic circuit, wherein the MUX has one or more inputs and an output, wherein the sub-cell AND logic circuit has first and second inputs and an output, wherein the one or more inputs of the MUX is to couple to receive one or more m bits associated with v and the output of the MUX is connected to each second input of the first AND gate in the sub-cell array, wherein the first input of the sub-cell AND logic circuit is connected to the one or more inputs of the MUX and the second input of the sub-cell AND logic circuit is to couple to receive one of the m bits associated with /.

7. The GF multiplier array of claim 6, wherein each of the first and the second XOR logic circuits of a sub-cell array in the GF multiplier array has its output connected to the first input of the second AND logic circuit and the second input of the second XOR logic circuit, respectively, of a next successive sub-cell array except for the first and the second XOR logic circuits of a last sub-cell array which are connected to an output of the GF multiplier array.

8. The GF multiplier array of claim 6, wherein each of the first inputs of the first and second AND logic circuits of a sub-cell array is to couple to receive a bit in the operand P and of the first XOR logic circuit of a substantially previous sub-cell array in the GF multiplier array, respectively, expect for the first and second AND logic circuits of a first sub-cell array is connected to an input of the GF multiplier array.

9. The GF multiplier array of claim 7, wherein the input operands to each sub-cell array in the GF multiplier array include state inputs fed back from the state conditions of the GF field linear outputs of the substantially previous sub-cell array.

10. A method of performing a cyclic redundancy check (CRC) comprising: receiving a message of length n bits; partitioning the n bits into one or more blocks, wherein each block has i input bits such that n=k*i and i is less than or equal to m, wherein m is the degree of the generator polynomial used to compute the CRC; and computing a CRC value for the received message of n bits using the one or more blocks.

11. The method of claim 10, wherein computing the CRC value comprises: initializing the value of k and an intermediate CRC value to a 0 value; multiplying a current block of / input bits with a GF coefficient to obtain a current multiplied CRC value, wherein the GF coefficient is a power of a primitive element α in a finite field of GF(2^m); adding the current multiplied CRC value to a previously obtained intermediate CRC value associated with a previous block of / input bits to obtain a new intermediate CRC value; incrementing the value of k by a predetermined value; determining whether the value of k = (n/i); if not, repeating the above steps of multiplying,, adding, incrementing and determining; and if so, outputting the new intermediate CRC value as the CRC value.

12. The method of claim 11, wherein the CRC value is computed using the equation,

CRC[A(X)] = Σ_J=o ^M Wj(x) ^■ a^{m +Ji}

wherein A(x) is the input message of n bits, m is the degree of the generator polynomial, Wj(x) is a polynomial of degree (/ - 1) that is expanded to degree (m - T) with zeros on most significant bits that represents blocks of i input bits of the input message A(x), a is the primitive element of the field of GF (2^m) having the generator polynomial used in the CRC computation, and (k = n/i).