WO2006120691A1 - Galois field arithmetic unit for error detection and correction in processors - Google Patents

Galois field arithmetic unit for error detection and correction in processors Download PDF

Info

Publication number
WO2006120691A1
WO2006120691A1 PCT/IN2005/000150 IN2005000150W WO2006120691A1 WO 2006120691 A1 WO2006120691 A1 WO 2006120691A1 IN 2005000150 W IN2005000150 W IN 2005000150W WO 2006120691 A1 WO2006120691 A1 WO 2006120691A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
input
logic circuit
bits
array
Prior art date
Application number
PCT/IN2005/000150
Other languages
French (fr)
Inventor
Sourav Roy
Original Assignee
Analog Devices Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Analog Devices Inc. filed Critical Analog Devices Inc.
Priority to PCT/IN2005/000150 priority Critical patent/WO2006120691A1/en
Publication of WO2006120691A1 publication Critical patent/WO2006120691A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/60Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
    • G06F7/72Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic
    • G06F7/724Finite field arithmetic

Definitions

  • This invention relates to error control coding in electronic and communication systems and more specifically to a method and apparatus for a Galois field arithmetic unit (GFU) for error detection and correction in such systems.
  • GFU Galois field arithmetic unit
  • communication systems include a plurality of communication devices (e.g., modems, cable modems, personal computers, laptops, cellular telephones, radios, telephones, facsimile machines, and so on) that communicate directly (i.e., point- to-point) or indirectly via communication system infrastructure (e.g., wire line channels, wireless channels, bridges, switches, routers, gateways, servers, and so on).
  • a communication system may include one or more local area networks and/or one or more wide area networks to support at least one of the Internet, cable services (e.g., modem functionality and television), wireless communications systems (e.g., radio, cellular telephones), satellite services, wire line telephone services, digital television, and so on.
  • information e.g., voice, audio, video, text, data, and so on
  • the transmitting communication device prepares the information for transmission to the other device and provides the prepared information to the infrastructure for direct or indirect routing to the receiving communication device.
  • the received communication device traverses the processing steps used by the transmitting communication device to prepare the information for transmission to recapture the original information.
  • transmission of information between communication devices is not performed under an ideal environment where the received information exactly matches the transmitted information.
  • the infrastructure can introduce errors, which can result in distorting the transmitted information such that the received information does not exactly match the transmitted information.
  • the transmitting communication device includes an encoder, which adds redundancy to the original data to make the original data more unique, and the receiving communication device includes a corresponding decoder, which uses the redundancy information to recover the original data from the received data that includes transmission errors.
  • CRC cyclic redundancy checking
  • Various forms of CRC are employed in the communication and consumer electronics arena. For example, a 16 bit CRC is employed in MPEG audio standards, whereas a 32 bit CRC is employed in Ethernet protocols.
  • CRC involves generating redundancy bits by partitioning the bit stream of the original data into blocks of data. The blocks of data are processed sequentially, with the data from each block being divided by a polynomial. The remainder from the division process becomes the redundancy bits, which are appended to, and transmitted with, the block of data from which they were generated.
  • the decoder upon receiving a block of data, divides the block of data and the appended redundancy bits by the same polynomial. If the remainder of this division is zero, there are no errors in the received block of data. If, however, there is a remainder, an error exits. For CRC, when an error exists in the block of data, the decoder typically requests retransmission of the block of data.
  • MAC multiply and accumulate
  • FEC forward error correction
  • the FEC involves an encoder generating error correction data as a function of the data to be sent and then transmitting the error correction data along with the data.
  • a decoder within the receiving communication device utilizes the error correction data to identify any errors in the original data that may have occurred during transmission.
  • a popular FEC algorithm is called Reed Solomon (RS) encoding and decoding.
  • RS partitions a data stream into sequential blocks of data and then divides a block of data by a polynomial to obtain parity, or check data.
  • RS operates on a byte stream rather than a bit stream, so it creates check bytes, which are appended to each block of data.
  • the decoding process at the receiver is considerably more complex than that of the CRC algorithm.
  • a set of syndromes is calculated. If the calculated syndromes have a zero value, the received block of data is then deemed to have no errors. If one or more of the calculated syndromes are not zero, then the existence of one or more errors is indicated. The non-zero values of the syndrome are then used to determine the location of the errors and, from there, correct values of data can be determined to correct the errors.
  • the syndromes are computed based on Homer's Rule, using GF MAC operations. Finding error locations in the core word and their corresponding magnitudes is achieved by computing the error locator and evaluator polynomials by the Euclidean or Berlekamp-Massey algorithm. The roots of the error locator are calculated using the Chien Search method, which employs constant GF multiplications. Finally, the error values are found using the Forney algorithm. This step requires a GF inversion operation, which is generally performed with a look-up table.
  • RS codes require an [m bit x m bit) multiplication. Generally, the value of m that is used is not more than eight bits for RS codes.
  • CRC computation requires using larger values for m (for example, in the neighborhood of about 32 or more). Therefore, to use large values of m in CRC computation the MAC architecture can become significantly large, which can result in requiring a larger silicon area for the processors. Further, using such large silicon area can significantly lower the performance of the processors. Furthermore, the current MAC architectures can either perform error detection or data correction but not both.
  • a Galois field arithmetic unit to perform multiply-accumulate operation to calculate CRC as well as Reed-Solomon encoding/decoding.
  • the GFU of the present invention uses sub-word-parallelism to enhance the system performance.
  • FIG. 1 is block diagram of a digital signal processor according to an embodiment of the present invention.
  • FIG. 2 illustrates a flowchart of an example embodiment of a method for calculating CRC to be implemented using the GFU shown in FIG 2.
  • FIG 3 is a schematic diagram of a GF multiplier according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a sub-cell array of the GF array, shown in FIG. 3, according to an embodiment of the present invention.
  • FIG. 1 illustrates an example block diagram of a digital signal processor (DSP) 40.
  • the DSP 40 shown in FIG. 1 is used to perform an encoding and decoding operation for the Reed-Solomon (RS) codes.
  • the DSP 40 has a processing circuit 42 and a Galois field multiplier unit (GFU) 44.
  • the processing circuit 42 has a processing module 46, a controller 48, and an input/output port 50.
  • the input/output port 50 receives input data from an input terminal 52 of the DSP 40.
  • the controller 48 transmits the input data to the processing module 46.
  • the processing module 46 is used to perform the GF addition on the input data.
  • the controller 48 transmits the input data to the GFU 44 via the input/output port 50. After the GFU 44 finishes processing the input data, the input data will be transmitted back to the processing module 46 for the following operations. In the end, the calculation result is outputted to an output terminal 54 of the DSP 40 via the input/output port 50.
  • the GF addition is equivalent to an XOR logic operation.
  • the algorithm proposed in this example embodiment enables parallel computation of CRC-m (i.e. degree of the primitive polynomial is m) using chunks of i message bits at a time, where i is less than or equal to m. Processing less than m message bits in parallel, can significantly reduce the required processor core area. If multiple MAC units are used in the processor, the silicon area saving can also be significant. Moreover, from a timing point of view, a true single-cycle MAC can be easily accomplished to eliminate data dependency stalls in the processor pipeline and hence can significantly improve system performance.
  • A(x) (cm - IX 1 ' 1 + ... + a ⁇ k . i ⁇ i + i x + ⁇ (k - i)i)x (k " 1 ⁇ l
  • Equation (2) is further reduced using the following two polynomials F 1 (X) and .F 2 (X) represented in the quotient-remainder form when divided with P(x).
  • the degree of each of the message blocks, Wj(x) in equation (2) is (i - 1) which is less than the degree of P(x).
  • Wi(x) mod P(x) Wi(x), V/ € ⁇ 0, ⁇ ,..., k - 1 ⁇
  • Wi(x) 0x m ⁇ J + ... + 0x ! + a Q + Di . ⁇ x [ - ! + + ⁇ ji + 1 x + ⁇ ji.
  • the value of k and the intermediate CRC value are initialized to 0.
  • a current block of i input bits is multiplied with a GF coefficient to obtain a current multiplied CRC value.
  • the GF coefficient is a power of a primitive element a in a finite field of GF(2'").
  • the current multiplied CRC value is added to a previously obtained intermediate CRC value associated with a block of i inputs to obtain a new intermediate CRC value.
  • the value of k is incremented by a predetermined value. In some embodiments, the value of Hs incremented by 1.
  • CRC can be computed parallely using chunks of i input bits of the message at a time.
  • the i bit input word is multiplied with a coefficient, which is a power of the primitive element ⁇ in GF(2 m ), and the product is then added to the previously accumulated result by repeating the multiplication and addition operations for k iterations to get the final CRC value as described-above with reference to FIG. 2.
  • the coefficients are constant for a particular field dimension m and a primitive polynomial P(x), and are computed prior to the MAC operations.
  • the multiplication and addition is performed in GF(2 m )
  • one of the operands W j (x) can be treated as an i bit number in GF(2 m ), since the higher bit positions are all zeros.
  • parallel CRC can be calculated with a (m bit x i bit) standard basis MAC structure in GF(2 m ), where i ⁇ m.
  • a (m bit x m bit) MAC structure for CRC can be expensive in terms of silicon area and speed, particularly when multiple such compute blocks are used in a general purpose processor, such as a DSP.
  • i i equal to 8 facilitates byte-wise parallel CRC computation.
  • FIG. 3 illustrates an example GF multiplier 300 including a MAC architecture implementing the parallel CRC computation scheme described above.
  • the GF multiplier 300 includes a pre-shift stage 310, a GF multiplier array 320, and a post-shift-add stage 330.
  • the GF multiplier array 320 includes a plurality of sub-cell matrices in the sub-cell array 340 that are arranged in one or more rows and columns.
  • Reed Solomon (RS) codes are constructed and decoded through the use of GF arithmetic. Encoding is done through polynomial division, which employs a linear feedback shift register (LFSR) structure like in the serial CRC computation, but operates on words of m bits in GF (2 m ), instead of individual bits.
  • LFSR linear feedback shift register
  • the above-described RS encoder can be easily implemented in software with the aid of a GF MAC unit.
  • the decoding of RS codes consists of the following two steps:
  • Syndrome Evaluation A non-zero syndrome signifies an error in the received code word.
  • the syndromes are calculated based on Homer's Rule, using GF MAC operations.
  • Y(x) denotes the product of A(x) and B(x), where A, B, YG GF (2 m ).
  • P(x) x m + p m . ⁇ x m ' l + ... +p ⁇ x +po, which denotes the primitive polynomial of the field and wherein ⁇ being its root.
  • A(x) and B(x) can be represented as polynomials in ⁇ as follows:
  • ⁇ l3 ⁇ , e GF (2) ⁇ 0, 1 ⁇ .
  • array multipliers There are generally two types of array multipliers, depending on the order in which the multiplier bits are processed viz., least significant bit (LSB)-first and MSB-first multipliers.
  • the MSB-first multiplier has a longer critical path of m - 1 and requires more XOR gates. Hence the LSB-first multiplier is superior as it reduces the computation delay considerably, without adding any extra hardware.
  • LSB-first multiplier is superior as it reduces the computation delay considerably, without adding any extra hardware.
  • parallel irregular multipliers which perform the polynomial multiplication and the degree reduction separately. The silicon area and the delay of this multiplier are similar to that of the LSB-first multiplier, though it can potentially provide lower power dissipation.
  • the GF multiplier array 320 shown in FIG, 3 is based on the LSB-first array multiplier to exploit the regularity of the array structure, In the LSB-First multiplier, multiplication starts with the least significant bit of the multiplier B as outlined below.
  • the critical path in a GF (2 m ) MAC unit comprises m XOR + m AND gates.
  • the path starts from A m . i and ends at Y m . i.
  • the AND-XOR combination may be replaced by NAND-XNOR for higher speed.
  • a basic cell consisting of a combination of a pair of NAND-XNOR logic circuits that is repeated m 2 times in a two-dimensional array structure to build the GF multiplier.
  • the GFU described above is a multiplier which is programmable with respect to both primitive polynomial as well as field dimension,
  • RS codes require (m bit x m bit) multiplication where 1 ⁇ m ⁇ 8.
  • CRC requires up to 32 bit multipliers.
  • CRC can be achieved by (m bit x i bit) multiplication where i ⁇ m.
  • an array of (32 bit x 8 bit) can support both applications.
  • the register size of a processor, such as a DSP is generally 32 bits.
  • the architecture of the DSP needs to support packed arithmetic or sub-word-parallelism (SWP), so that a 32 bit register can be accessed as quad 8 bit fields.
  • SWP sub-word-parallelism
  • the MAC shown in FIG. 3 supports the following modes in GF(2 m ) : 1. quad ((m bit x m bit) + m bit), for K m ⁇ 8
  • the first mode is suitable for RS coding/decoding, where four parallel MAC operations can be performed in a single cycle.
  • the last two are suitable for CRC computation by the algorithm described earlier, using chunks of 8 message bits or less at a time.
  • the MAC unit shown in FIG. 3 includes three sub-units, namely the pre-shift stage 310, the GF multiplier array 320, and the post-shift-add stage 330. The addition is performed in the last to reduce silicon area, otherwise further pre-shift of the summand may be required.
  • the multiplier structure automatically configures itself based on the field dimension m.
  • the input operands A(x), B(x), C(x), P(x) and the output operand Y(x) are all stored in 32 bit registers.
  • the GF multiplier array is of the form (32 bitx 8 bit).
  • the 32 bit multiplier A (x) is directly fed to the multiply unit. Since B(x) is 32 bits, but the GF multiplier array only handles 8 bits, appropriate bytes are chosen from B(x) for the multiplication.
  • the data packing in the various modes of multiplication, i.e. quad, dual and single multiplication, are shown in the table below.
  • P(x) is stored in a register occupying m bits.
  • the coefficient for x m which is always unity, is not stored.
  • the following illustrates the automatic configuration of the (32 bitx 8 bit) GF multiplier array.
  • B'(x) S 3 [B 0 (X)X 24 + B 0 (X)X 16 + B 0 (x)x s + B 0 (x)J
  • P ' (x) 8 0 [P' 0 (x)x 24 + P ' o(x)x 16 + P ' 0 (x)x 8 + P' 0 (x)J + S 1 [P 1 (X)X 16 + P 0 (x)x 16 + P 1 (X) + P 0 (X)J +S 2 P'(x) + S 3 P'(x).
  • the outputs of the pre-shift sub-block are A ' (x), B ' (x) and P'(x). These are then passed onto the GF multiplier array of size (32 bitx8 bit).
  • the GF multiplier array 320 multiplies the inputs A ' (x) and B ' (x) according to the field parameters m and P (x).
  • the intermediate polynomials V(x) and W(x) described above will have a degree of 31.
  • the computation stage or iteration number is denoted by k, ranging from about 1 to 8.
  • the initial values for the inputs to the GF multiplier array are given by,
  • u (k). ⁇ 3V (k) 31 + ⁇ 2V (k) 23 + ⁇ l jk) is + ⁇ oV (k) ?i 0 ⁇ i ⁇ 7
  • equation (7) remains unaltered except for the change due to data packing in operand B, which is one of the inputs to the GF multiplier 300.
  • Wi W V .(k-D h+7 + w fi-i) t Vi e ⁇ 8> J 5 J (9b)
  • the above equations describe the GF multiplier array 320 shown in FIG. 3, which can perform various modes of packed arithmetic multiplication depending on the field size.
  • a GF multiplier array of (32bit x 8bit) which is configured as either four independent ( ⁇ bitx 8bit) arrays arranged side-by-side, as two independent (l ⁇ bitx 8bif) arrays, or as a single (32bitx 8bit) array. This configuration is done with the new polynomials u(x) and l ⁇ x).
  • the above described GF multiplier array is of size (32bit x 8bit), the above-described technique can be extended for any different GF multiplier array size, such as (40bitx 8bif) or (32bitx l ⁇ bit).
  • the last sub-block in the GF arithmetic unit 300 shown in FIG. 3 is the post- shift-add stage 330.
  • the shift is performed to right shift the product back to the right- justified packed format.
  • the summand C(x) is added to give the final MAC result.
  • area is not a concern the extra latency due to the XOR of C(x) can be avoided by pre-shifting C(x) and adding it in the GF multiplier array as W® ⁇
  • FIG. 4 shows an example implementation of the sub-cell matrices in the sub-cell array 340 used in the GF multiplier array 320 using the AND and XOR logic circuits.
  • FIG. 4 shows two neighboring sub-cells 410 and 420 arranged in a row. As shown in FIG. 4, each of the sub-cells includes 8 cells 430.
  • the MAC architecture shown in FIGS. 3 and 4 can be pipelined.
  • the summand C(x) should preferably be passed through the pre-shifter and the GF multiplier array as W® ⁇
  • the various bits of B '(x) should be delayed appropriately to prevent previous data erasure. This depends on the level of pipelining.
  • the pre-shift stage 310 shown in FIG. 3 receives first, second, and third operands (A), (B), and (P) 350, 360, and 370, respectively. Each of the received operands has m bits.
  • the pre-shift stage 310 to left justify the operands A and P to a nearest byte boundary. It also divides the operands B and pre-shifted operand P into sub- words and selects the appropriate sub- words depending on the field size m.
  • the GF multiplier array 320 receives the sub-words associated with each operand, and performs GF multiplication on a sub-word-parallel basis and outputs the multiplied value (A x B) in GF (i.e., outputs a GF multiplied value).
  • the post-shift-add stage 330 receives a fourth operand (C) 380, which has m bits.
  • the post-shift-add stage 330 divides the fourth operand 380 into sub-words. Further, the post-shift-add stage 330 receives the GF multiplied value from the GF multiplier array 320 and right-justifies the GF multiplied value.
  • the post-shift-add stage 330 adds the right-justified GF multiplied value to the sub- words associated with the operand C 380 and outputs the multiply-accumulate value of ((A x B) + C) in the GF.
  • the m bits can be in the range of about 8bits to 40 bits.
  • each of the sub-cell matrices in the sub-cell array 340 in the GF multiplier array 320 includes 8 GF cells 430 arranged in a row.
  • each GF cell 430 includes a first and a second AND logic circuit 440 and 445.
  • Each of the first and the second AND logic circuits 440 and 445 has first and second inputs and an output 442 and 443, 447 and 448, and 444 and 449, respectively.
  • each GF cell 430 includes a first and a second XOR logic circuit 450 and 455.
  • Each of the first and the second XOR logic circuits 450 and 455 has a first and a second inputs and an output 452 and 453 (same as 447), 457 and 458, and 454 and 459, respectively.
  • the outputs of the first and second XOR logic circuits 454 and 459 form the bits of the intermediate polynomials v and w as described in the above outlined equations (8) and (9).
  • the first input 442 of the first AND logic circuit 440 is connected to receive one of the m bits associated with the pre-shifted operand P 370 (shown in FIG.
  • the second input 443 of the first AND logic circuit 440 is connected to receive one of the bits associated with the new polynomial u as described in the above equation (8).
  • the first input 447 of the second AND logic circuit 445 is connected to receive one of the bits associated with the intermediate polynomial v and the second input 448 of the second AND logic circuit 445 is connected to receive one of the bits associated with the second operand B 360 (shown in FIG. 3).
  • the first input 452 of the first XOR logic circuit 450 is connected to the output 444 of the first AND logic circuit 440 and the second input 453 of the first XOR logic circuit 450 is connected to one of the bits associated with the intermediate polynomial v which is modified logically by a sub-cell matrix AND logic circuit 480 with the new polynomial /, as described in the equation (8).
  • the first input 457 of the second XOR logic circuit 455 is connected to the output 449 of the second AND logic circuit 445 and the second input 458 of the second XOR logic circuit 455 is connected to one of the bits associated with the intermediate polynomial w as described in the equation (9).
  • each sub-cell array in the sub-cell array 340 further includes a MUX 470 and the sub-cell matrix AND logic circuit 480.
  • the MUX 470 has one or more inputs 472 and an output 474.
  • the output 474 of the MUX 470 represents 8 bits of the new polynomial u, as described in the equation (8).
  • the one or more inputs 472 of the MUX 470 in each sub cell array 340 (shown in FIG. 3) vary from 4 to 1 depending on the position of the sub-cell 410 (the right most sub-cell 410 has 4 inputs). Further as shown in FIG.
  • the sub-cell matrix AND logic circuit 480 has first and second inputs 482 and 484 and an output 486. As shown in FIG. 4, the one or more inputs 472 of the MUX 470 is connected to one bit associated with the intermediate polynomial v and the output 474 of the MUX 470 is connected to each second input 443 of the first AND logic circuit 440 in the sub-cell array 340. Furthermore, the second input 484 of the sub-cell matrix AND logic circuit 480 is connected to the one or more inputs 472 of the MUX 470 and the first input 482 of the sub-cell matrix AND logic circuit 480 is connected to one of the bits associated with the new polynomial /. As shown in FIG. 4, the MUX 470 and the sub-cell AND logic circuit 480 are included at byte-boundaries (after every 8 th GF cell) of the sub-cell array 340.
  • the polynomials v and w represent each stage or row of computation of the product of operands A and B, i.e., (A x B).
  • the initial value of polynomial v is the input to operand A.
  • the final value of the polynomial w is a computed product (A x B).
  • the new polynomial u has coefficients consisting of byte-boundary coefficients of the polynomial v, i.e., a combination of polynomials at V 31 , V 23 , vis , v ⁇ . This combination of byte-boundary coefficients of the polynomial v in the polynomial u is determined by the field size m.
  • the new polynomial / signifies a connection between several sub-cell matrices of 8 GF cells in a row of the sub-cell array 340.
  • the polynomials have unity coefficients in all positions in the sub-cell array 340 except at the byte boundary, which depends on the field size m.
  • the new polynomials u and 1 are introduced to facilitate sub- word parallelism in the GF multiplier array 320. [0071] As shown in FIGS .
  • each of the first and the second XOR logic circuits 450 and 455 associated with a sub-cell array 340 in the GF multiplier array 320 has its output 454 and 459 connected to the first input 442 of the second AND logic circuit 445 and the second input 457 of the second XOR logic circuit 455, respectively, of a next successive sub-cell array except that the first and the second XOR logic circuits of a last sub-cell array is connected to an output of the GF multiplier array.
  • the first inputs 442 and 447 of the first and second AND logic circuits 440 and 445, respectively, of a sub-cell array 340 are connected to receive a bit in the pre-shifted third operand P 370 and output 454 of the first XOR logic circuits 450, respectively, of a substantially previous sub-cell array in the GF multiplier array, except for the first and second AND logic circuits of a first sub-cell array which are connected to an input of the GF multiplier array.
  • the above-described MAC architecture including AND-XOR logic circuits in each of the sub-cell matrices in the sub-cell array 340 in the GF multiplier array 320 can be built using NAND-XNOR logic circuits, which can provide a higher system performance.
  • Each sub-cell array 340 in the GF multiplier array 320 consisting of the NAND-XNOR logic circuits is repeated m 2 times in a two- dimensional array structure to obtain the above-described MAC architecture.
  • the above-described GFU was implemented in RTL using a Verilog HDL.
  • the RTL description was synthesized using a standard cell library, with proper wire load models and timing constraints.
  • the paths from inputs m and P(x) were treated as multicycle paths of two cycles. This is because the configuration registers are not meant to be changed on-the-fly along with the inputs, but are set before the MAC operations begin.
  • a commercial CAD tool was used to place-and-route the MAC unit.
  • the delay of the entire MAC unit was found to be about 1.5 ns under typical conditions (i.e., typical process, 1.2V power supply, 125°C temperature) in 0.13 ⁇ m technology.
  • the total area required was found to be about 0.05 mm 2 .
  • the system performance can further considerably improve the speed and reduce silicon area of the above-described MAC unit.
  • the above embodiments describe performing error detection and correction techniques with reference to CRC and RS algorithms, the present invention is not limited to such. Thus, other embodiments may employ other types of forward error corrections algorithms. As one of average skill in the art will appreciate, other embodiments may be derived from the teachings of the above described techniques without deriving from the scope of the claims.
  • the above-described technique uses a sub-word parallel architecture to improve system performance when encoding/decoding using CRC and RS algorithms. This process uses a fast parallel CRC computation algorithm to enhance system performance. In addition, the above-described technique can be used to perform both error detection and data correction.
  • the present invention can be implemented in a number of different embodiments, including various methods, a circuit, an I/O device, a system, and an article comprising a machine-accessible medium having associated instructions.
  • FIGS. 1-4 are merely representational and are not drawn to scale. Certain portions thereof may be exaggerated, while others may be minimized. FIGS. 1-4 illustrate various embodiments of the invention that can be understood and appropriately carried out by those of ordinary skill in the art.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Error Detection And Correction (AREA)
  • Detection And Correction Of Errors (AREA)

Abstract

A GFU that performs multiply-accumulate operation for error detection and correction in processors by using sub-word-parallelism (SWP) to enhance system performance. The GFU performs error detection through parallel computation of Cyclic Redundancy Checks (CRC). The CRC for a message is computed using i bits at a time, wherein i is less than or equal to a degree of the generator polynomial. The GFU also performs error correction employing Reed-Solomon codes.

Description

GALOIS FIELD ARITHMETIC UNIT FOR ERROR DETECTION AND CORRECTION IN PROCESSORS
Field of the Invention
[0001] This invention relates to error control coding in electronic and communication systems and more specifically to a method and apparatus for a Galois field arithmetic unit (GFU) for error detection and correction in such systems.
Background of the Invention
[0002] As is known, communication systems include a plurality of communication devices (e.g., modems, cable modems, personal computers, laptops, cellular telephones, radios, telephones, facsimile machines, and so on) that communicate directly (i.e., point- to-point) or indirectly via communication system infrastructure (e.g., wire line channels, wireless channels, bridges, switches, routers, gateways, servers, and so on). As is also well known, a communication system may include one or more local area networks and/or one or more wide area networks to support at least one of the Internet, cable services (e.g., modem functionality and television), wireless communications systems (e.g., radio, cellular telephones), satellite services, wire line telephone services, digital television, and so on.
[0003] In any type of communication system, information (e.g., voice, audio, video, text, data, and so on) is transmitted from one communication device to another via the infrastructure. Accordingly, the transmitting communication device prepares the information for transmission to the other device and provides the prepared information to the infrastructure for direct or indirect routing to the receiving communication device. Once received, the received communication device traverses the processing steps used by the transmitting communication device to prepare the information for transmission to recapture the original information. [0004] As is further known, transmission of information between communication devices is not performed under an ideal environment where the received information exactly matches the transmitted information. In practice, the infrastructure can introduce errors, which can result in distorting the transmitted information such that the received information does not exactly match the transmitted information. To compensate for the error introduced by the infrastructure, the transmitting communication device includes an encoder, which adds redundancy to the original data to make the original data more unique, and the receiving communication device includes a corresponding decoder, which uses the redundancy information to recover the original data from the received data that includes transmission errors.
[0005] In general, the encoder and decoder employ an error detection and correction technique to reduce the adverse effects of transmission errors. One particular type of error detection technique is called cyclic redundancy checking (CRC). Various forms of CRC are employed in the communication and consumer electronics arena. For example, a 16 bit CRC is employed in MPEG audio standards, whereas a 32 bit CRC is employed in Ethernet protocols. CRC involves generating redundancy bits by partitioning the bit stream of the original data into blocks of data. The blocks of data are processed sequentially, with the data from each block being divided by a polynomial. The remainder from the division process becomes the redundancy bits, which are appended to, and transmitted with, the block of data from which they were generated. The decoder, upon receiving a block of data, divides the block of data and the appended redundancy bits by the same polynomial. If the remainder of this division is zero, there are no errors in the received block of data. If, however, there is a remainder, an error exits. For CRC, when an error exists in the block of data, the decoder typically requests retransmission of the block of data.
[0006] Though such serial computation of CRC with linear feedback shift registers is used in hardwired circuits, parallel computation is much more efficient, especially in software implementations. Currently, there are various approaches to performing such CRC computations in parallel. One such technique proposes an empirical approach to byte-wise parallel CRC calculation, which uses LFSR contents after every eight shifts. Another such technique uses parallel CRC encoders based on digital system theory and Z-transforms. Generally, such techniques require application specific circuits (ASICs), which can be expensive and can consume large silicon area. Yet another technique employs GF arithmetic to compute parallel CRC. In this technique, the number of bits processed in parallel is equal to m, which is the degree of a generator polynomial. Using this technique for a large value of m (for example, the value of m being 32 or higher), requires an (m bit x m bit) MAC (multiply and accumulate) architecture. To accommodate such a large MAC architecture the ASIC can require a large silicon area and can consume significant amount of processor time. This is generally not desirable for use in processors, especially in digital signal processors (DSPs).
[0007] As is known, there are a number of popular error correction techniques. One such technique, that is widely used, is generally known as forward error correction (FEC). The FEC involves an encoder generating error correction data as a function of the data to be sent and then transmitting the error correction data along with the data. A decoder within the receiving communication device utilizes the error correction data to identify any errors in the original data that may have occurred during transmission. A popular FEC algorithm is called Reed Solomon (RS) encoding and decoding. Like CRC, RS partitions a data stream into sequential blocks of data and then divides a block of data by a polynomial to obtain parity, or check data. However, RS operates on a byte stream rather than a bit stream, so it creates check bytes, which are appended to each block of data. The decoding process at the receiver is considerably more complex than that of the CRC algorithm. First, a set of syndromes is calculated. If the calculated syndromes have a zero value, the received block of data is then deemed to have no errors. If one or more of the calculated syndromes are not zero, then the existence of one or more errors is indicated. The non-zero values of the syndrome are then used to determine the location of the errors and, from there, correct values of data can be determined to correct the errors.
[0008] Generally, the syndromes are computed based on Homer's Rule, using GF MAC operations. Finding error locations in the core word and their corresponding magnitudes is achieved by computing the error locator and evaluator polynomials by the Euclidean or Berlekamp-Massey algorithm. The roots of the error locator are calculated using the Chien Search method, which employs constant GF multiplications. Finally, the error values are found using the Forney algorithm. This step requires a GF inversion operation, which is generally performed with a look-up table. RS codes require an [m bit x m bit) multiplication. Generally, the value of m that is used is not more than eight bits for RS codes.
[0009] Though Reed Solomon encoding/decoding requires 8 bit GF multiplications, CRC computation requires using larger values for m (for example, in the neighborhood of about 32 or more). Therefore, to use large values of m in CRC computation the MAC architecture can become significantly large, which can result in requiring a larger silicon area for the processors. Further, using such large silicon area can significantly lower the performance of the processors. Furthermore, the current MAC architectures can either perform error detection or data correction but not both.
Summary of the Invention
A Galois field arithmetic unit (GFU) to perform multiply-accumulate operation to calculate CRC as well as Reed-Solomon encoding/decoding. The GFU of the present invention uses sub-word-parallelism to enhance the system performance. According to an aspect of the present invention, there is provided a method for performing a cyclic redundancy check (CRC), the method including the steps of receiving a message of length n bits, partitioning the n bits into one or more blocks, wherein each block has i input bits such that n=k*i and i is less than or equal to m, wherein m is the degree of the generator polynomial used to compute the CRC, and computing a CRC value for the received message of n bits using the one or more blocks.
Brief description of the Drawing
[0010] FIG. 1 is block diagram of a digital signal processor according to an embodiment of the present invention.
[0011] FIG. 2 illustrates a flowchart of an example embodiment of a method for calculating CRC to be implemented using the GFU shown in FIG 2.
[0012] FIG 3 is a schematic diagram of a GF multiplier according to an embodiment of the present invention. [0013] FIG. 4 is a schematic diagram of a sub-cell array of the GF array, shown in FIG. 3, according to an embodiment of the present invention.
Description of Preferred Embodiments
[0014] In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.
[0015] The leading digit(s) of reference numbers appearing in the Figures generally corresponds to the Figure number in which that component is first introduced, such that the same reference number is used throughout to refer to an identical component which appears in multiple Figures. The same reference number or label may refer to signals and connections, and the actual meaning will be clear from its use in the context of the description.
[0016] FIG. 1 illustrates an example block diagram of a digital signal processor (DSP) 40. The DSP 40 shown in FIG. 1 is used to perform an encoding and decoding operation for the Reed-Solomon (RS) codes. As shown in FIG. 1, the DSP 40 has a processing circuit 42 and a Galois field multiplier unit (GFU) 44. Further as shown in FIG. 1, the processing circuit 42 has a processing module 46, a controller 48, and an input/output port 50. The input/output port 50 receives input data from an input terminal 52 of the DSP 40. The controller 48 transmits the input data to the processing module 46. The processing module 46 is used to perform the GF addition on the input data. However, when GF multiplication is required on the input data, the controller 48 transmits the input data to the GFU 44 via the input/output port 50. After the GFU 44 finishes processing the input data, the input data will be transmitted back to the processing module 46 for the following operations. In the end, the calculation result is outputted to an output terminal 54 of the DSP 40 via the input/output port 50. The GF addition is equivalent to an XOR logic operation. [0017] Referring now to FIG. 2, there is illustrated an example method 200 of performing a CRC. At step 210, this example method 200 begins by receiving a message of n bits.
[0018] The algorithm proposed in this example embodiment enables parallel computation of CRC-m (i.e. degree of the primitive polynomial is m) using chunks of i message bits at a time, where i is less than or equal to m. Processing less than m message bits in parallel, can significantly reduce the required processor core area. If multiple MAC units are used in the processor, the silicon area saving can also be significant. Moreover, from a timing point of view, a true single-cycle MAC can be easily accomplished to eliminate data dependency stalls in the processor pipeline and hence can significantly improve system performance.
[0019] The following illustrates the serial computation of CRC for a received message A(x) of length n bits, which is denoted by the polynomial in x, A(x) = an . \xn ' ! + an .2xn " 2 + ... + a\x + α0, wherein a\ e {0, 1 }
[0020] The generator polynomial of degree m can be denoted as P(x) = xm +pm . ιxm ' ι + ... +^1X +^0, where pi e {0, 1 }
[0021] Then the serial computation of Cyclic Redundancy Check (CRC) is computed using the equation,
CRC[A(X)] = [A(x)xm] mod P(x) (1)
[0022] At step 220, the received message A(x) including n bits is first divided into blocks of / bits, where i is less than or equal to m such that the length of the message including n bits is a multiple of i. Otherwise, a necessary number of zeros is inserted before the most significant bit (MSB) of the message A(x), to make it a multiple of i, wherein n = k*i and Hs an integer. Then A (x) is expressed as follows:
A(x) = (cm - IX1 ' 1 + ... + a{k. i}i + ix + α(k - i)i)x(k " 1}l
+
+ («2i - ιx1 " 1 + ... + a\ + ιx + a\)xl
+ (a\ . \x1 ' 1 + ... + a\x + α0)
= Wk . !(x)x(k ■ 1)j + ... + Jf1(X)X1 + W0, [0023] Where each block of message i bits denoted by Wj(x) is a polynomial of degree i, for all j. Applying equation (1), the CRC of A(x) is given by,
CRC[A(X)] = [Wk . ι(x)x{k ■ l)i + m] mod P(x) + " ... + [W1(X)x 1 + m] mod P(x) + [Wo(x)xm]modP(x) (2)
[0024] Equation (2) is further reduced using the following two polynomials F1(X) and .F2(X) represented in the quotient-remainder form when divided with P(x).
Figure imgf000008_0001
F2(x) = Q2(X)P(X) + R2(x)
[F1(X)F2(X)]InOd P(x) = [Rι(x)R2(x)]mod P(x).
[0025] Further, the degree of each of the message blocks, Wj(x) in equation (2) is (i - 1) which is less than the degree of P(x).
Wi(x) mod P(x) = Wi(x), V/ €{0, \,..., k - 1}
[0026] Using the above equations we can express each product term in equation (2) as,
[FFj(X) χJi + m] mod P(x) = [Prj(x)[xji + mmod P(x)]]mod P(x) (3)
[0027] Considering an extension of the GF of m bits, denoted as GF(2m), whose primitive generator polynomial is P(x), wherein α is the root of P(x), i.e. the primitive element of the field. Then x*1 + m mod P(x) is equivalent to a}1 + m, when a canonical or standard basis is used to represent the field. Again, since W$(x) is a polynomial of degree (/ - 1), we expand it to degree (m - 1) by filling the MSB positions with zeros.
Wi(x) = 0xm ■ J + ... + 0x! + aQ + Di . ιx[- ! + + αji + 1x + θji.
[0028] Using the above relations in equations (2) and (3), the CRC ofA(x) can be computed as a multiply-accumulate in GF(2m) as follows:
CRC[A(X)] =(Wk. ,(x) α'^^') + + (Wl(x) am+i) + (W0(x) am)
[0029] Therefore the value of CRC for the received message A(x) can be computed using the equation,
CRC[A(X)] = ∑j=o k-1 FFj(x) am+JI (4)
[0030] At step 230, the value of k and the intermediate CRC value are initialized to 0. At step 240 a current block of i input bits is multiplied with a GF coefficient to obtain a current multiplied CRC value. The GF coefficient is a power of a primitive element a in a finite field of GF(2'"). At step 250, the current multiplied CRC value is added to a previously obtained intermediate CRC value associated with a block of i inputs to obtain a new intermediate CRC value. At step 260, the value of k is incremented by a predetermined value. In some embodiments, the value of Hs incremented by 1. At step 270, the method 200 determines whether (k = n/i). Based on the determination at step 270, the method 200 goes to step 240 and repeats steps 240-270 if k is not equal to n/i. Based on the determination at step 270, the method goes to step 280 and outputs the new intermediate CRC value as the CRC value \ϊ(k = n/i).
[0031 ] The summation and product symbols in equation (4) represent GF operations. In these embodiments, CRC can be computed parallely using chunks of i input bits of the message at a time. The i bit input word is multiplied with a coefficient, which is a power of the primitive element α in GF(2m), and the product is then added to the previously accumulated result by repeating the multiplication and addition operations for k iterations to get the final CRC value as described-above with reference to FIG. 2.
[0032] The coefficients are constant for a particular field dimension m and a primitive polynomial P(x), and are computed prior to the MAC operations. The total number of coefficients that need to be stored in the memory of a processor is given by n/i = k. But if the message length is large, the field elements wrap-around, the maximum number of coefficients being 2m/i = k. Hence the number of coefficients is given by min(k, k). The important point to note here is that even though the multiplication and addition is performed in GF(2m), one of the operands Wj(x) can be treated as an i bit number in GF(2m), since the higher bit positions are all zeros. Hence unlike in RS CODECs, parallel CRC can be calculated with a (m bit x i bit) standard basis MAC structure in GF(2m), where i < m. For a large m like 32, a (m bit x m bit) MAC structure for CRC can be expensive in terms of silicon area and speed, particularly when multiple such compute blocks are used in a general purpose processor, such as a DSP. Using i equal to 8 facilitates byte-wise parallel CRC computation.
[0033] Referring now to FIG. 3, which illustrates an example GF multiplier 300 including a MAC architecture implementing the parallel CRC computation scheme described above. As shown in FIG. 3, the GF multiplier 300 includes a pre-shift stage 310, a GF multiplier array 320, and a post-shift-add stage 330. Also as shown in FIG. 3, the GF multiplier array 320 includes a plurality of sub-cell matrices in the sub-cell array 340 that are arranged in one or more rows and columns.
[0034] Reed Solomon (RS) codes are constructed and decoded through the use of GF arithmetic. Encoding is done through polynomial division, which employs a linear feedback shift register (LFSR) structure like in the serial CRC computation, but operates on words of m bits in GF (2m), instead of individual bits. The above-described RS encoder can be easily implemented in software with the aid of a GF MAC unit. Generally, the decoding of RS codes consists of the following two steps:
1. Syndrome Evaluation - A non-zero syndrome signifies an error in the received code word. The syndromes are calculated based on Homer's Rule, using GF MAC operations.
2. Finding error locations in the code word and their corresponding magnitudes - This is achieved by computing the error locator and evaluator polynomials by the Euclidean or Berlekamp-Massey algorithm. The roots of the error locator polynomial are calculated using the Chien Search method, which employs constant GF multiplications. Finally the error values are found using the Forney algorithm. This step requires a GF inversion operation, which is generally performed with a look-up-table. Hardware support for inversion is expensive, and is generally not required as it does not affect the overall performance of the CODEC. Unlike CRC, RS CODECs must use a (m bit x m bit) MAC unit. This is generally fine from a hardware point of view, because the field dimension m for almost all of the practical RS codes in error-control coding is not greater than 8. [0035] The following illustrates various multiplier architectures generally used for CRC computation. The finite field addition is performed using a bitwise XOR of the two operands. As described above, parallel CRC computation requires standard basis multipliers.
[0036] Assuming that Y(x) denotes the product of A(x) and B(x), where A, B, YG GF (2m). Further assuming P(x) = xm + pm . \xm ' l + ... +pιx +po, which denotes the primitive polynomial of the field and wherein α being its root. Hence A(x) and B(x) can be represented as polynomials in α as follows:
A - am. tot"1"1 + am .2αm'2 + ... + a\a + α0 B = bm - ιθL + bm .2am'2 + ... + O1CC + b0
[0037] Wherein, αl3 ό, e GF (2) = {0, 1}. There are generally two types of array multipliers, depending on the order in which the multiplier bits are processed viz., least significant bit (LSB)-first and MSB-first multipliers. The MSB-first multiplier has a longer critical path of m - 1 and requires more XOR gates. Hence the LSB-first multiplier is superior as it reduces the computation delay considerably, without adding any extra hardware. Apart from array multipliers, there are parallel irregular multipliers, which perform the polynomial multiplication and the degree reduction separately. The silicon area and the delay of this multiplier are similar to that of the LSB-first multiplier, though it can potentially provide lower power dissipation.
[0038] The GF multiplier array 320 shown in FIG, 3 is based on the LSB-first array multiplier to exploit the regularity of the array structure, In the LSB-First multiplier, multiplication starts with the least significant bit of the multiplier B as outlined below.
Y(X) =A(X) B(x) Y(x) = A(x)B(x) mod P(x)
Y(x) = boA + bι[Aa mod P(x)] + &2L4α2mod P (x)] + ... + bm. ι[Aa mod P(x)] [0039] Two intermediate polynomials viz. are introduced, V(x) and W(x) of degree (m - 1) to describe the basic computation steps involved in each iteration. In the Mi iteration for 1 < k < m, the following computations are performed in parallel: j/® = [j/C- 1)] α mod P(χ) ^k) = F(k- i)&k_ i + ^k- I)
[0040] Wherein ≠0) = 0 and ≠0) =A. Since α is a root of i>(x), P(μ) = 0.
Therefore, am=pm. χam'1 + ... +pxa +p0 (5)
Again,
Figure imgf000012_0001
P(X)
= (αm . !am + ... + aid2 + aoa) mod P(pc) (5a)
[0041] Using the equation (5) in (5a), the relation for each coefficient of the polynomial V(x) in the first iteration can be deduced as, v(1)i = fli - i + flm. iPi, V f e {0, l,..., »i - 1 }
[0042] In the above equation, a.\ = 0. The following outlines the basic computation steps performed in each iteration, in the LSB-first multiplier.
[0043] For all / e {0, 1,.. ,, m - \),
Figure imgf000012_0002
[0044] The above computation is repeated for m iterations to compute the final product given by W^m\ In these embodiments, computing V^ is not necessary. The multiplier can be converted to a multiply-accumulate Y= A B + C, by making W^ = C.
[0045] The critical path in a GF (2m) MAC unit comprises m XOR + m AND gates. The path starts from Am . i and ends at Ym . i. The AND-XOR combination may be replaced by NAND-XNOR for higher speed. A basic cell consisting of a combination of a pair of NAND-XNOR logic circuits that is repeated m2 times in a two-dimensional array structure to build the GF multiplier.
[0046] For a processor, such as a DSP it is not enough to build a multiplier with a programmable primitive polynomial, but it should also be programmable with respect to the field dimension m. With appropriate pre-shift and post-shift we can extend the multiplier architecture to perform multiplications over GF(2m ) where m < m. The output Y(x) of degree (m - 1) cannot be computed directly with the GF(2m) multiplier. But when it is extended to degree (m - 1) as Y(x)xm " m it can be calculated as follows:
Y(x) = A(x) B(x) mod P(x)
Le., A(x) B(x) = Q(x) P(x) + Y(x)
[0047] Multiplying both sides of the above equation by xm ' m yields the following equation,
[A(x)xm - m']B(x) = Q(x)[P(x)xm ■ m>] + Y(x)χm ' m'
[0048] The above relation shows that if one of the input operands and the primitive polynomial is left-justified by shifting by m - m bit positions, then the product in GF(2m ) also appears in left-justified format. The product is then right-shifted back by m - m bit positions. For a MAC operation, the summand C(x) also needs to be left-shifted like A(x) and P(x). But if silicon area is a concern, then C(x) can be added to the product finally after the right-shift operation. However, this can increase the critical path of the MAC structure by a further XOR gate delay.
[0049] The GFU described above is a multiplier which is programmable with respect to both primitive polynomial as well as field dimension, As seen earlier, RS codes require (m bit x m bit) multiplication where 1 < m < 8. But CRC requires up to 32 bit multipliers. However, as shown earlier CRC can be achieved by (m bit x i bit) multiplication where i < m. Hence an array of (32 bit x 8 bit) can support both applications. The register size of a processor, such as a DSP is generally 32 bits. The architecture of the DSP needs to support packed arithmetic or sub-word-parallelism (SWP), so that a 32 bit register can be accessed as quad 8 bit fields. To improve the performance of the MAC in a SWP architecture, the MAC shown in FIG. 3 supports the following modes in GF(2m) : 1. quad ((m bit x m bit) + m bit), for K m < 8
2. dual ((m bit x i bit) + m bit), for 8< m < 16, K i < 8
3. single ((m bitx i bit) + m bit), for 16< m < 32, K i < 8.
[0050] The first mode is suitable for RS coding/decoding, where four parallel MAC operations can be performed in a single cycle. The last two are suitable for CRC computation by the algorithm described earlier, using chunks of 8 message bits or less at a time. The MAC unit shown in FIG. 3 includes three sub-units, namely the pre-shift stage 310, the GF multiplier array 320, and the post-shift-add stage 330. The addition is performed in the last to reduce silicon area, otherwise further pre-shift of the summand may be required.
[0051] The multiplier structure automatically configures itself based on the field dimension m. The input operands A(x), B(x), C(x), P(x) and the output operand Y(x) are all stored in 32 bit registers. The GF multiplier array is of the form (32 bitx 8 bit). The 32 bit multiplier A (x) is directly fed to the multiply unit. Since B(x) is 32 bits, but the GF multiplier array only handles 8 bits, appropriate bytes are chosen from B(x) for the multiplication. The data packing in the various modes of multiplication, i.e. quad, dual and single multiplication, are shown in the table below. P(x) is stored in a register occupying m bits. The coefficient for xm, which is always unity, is not stored. The following illustrates the automatic configuration of the (32 bitx 8 bit) GF multiplier array.
32 bit 32 bit 32 bit
( A3 A2 SA1 AO X
( A1 AO X
( AO X
Figure imgf000014_0001
Figure imgf000014_0002
Figure imgf000014_0003
[0052] The pre-shift stage 310, left justifies A(x) and P(x) to the nearest byte boundary, to generate A ' (x) and P ' (x), respectively. For example, if m = 14, they are left- shifted 2 bit positions, to align with the 16 bit boundary.
A '(x) = A(x)x02 " ra)mod8;P ' (x) = P(x)x{32 " m)raod8 [0053] Further, the contents of B(x) and P ' (x) are chosen appropriately as shown in the above table, depending on the value of m. To achieve the above, the message A(x) including n bits is divided into groups of 8 bits as follows:
B(x) = (b31x31+ ' + b24x24) + (623x23 "+ ' ' + O16X16) + (615x15 + ... + hxs) + (byx7 + ... + bo) = B3(X) + B2(X) + B1(X) + Bo(x)
[0054] Similarly,
P '(x)
Figure imgf000015_0001
+ P0(X)
[0055] The four control variables depending on m are defined as follows. = l,ifl<m<8 = l,if8<m<16
= 0, otherwise; = 0, otherwise;
- 1, if 16 <m<24 =Uf24<m≤ 32
= 0, otherwise; = 0, otherwise;
[0056] Then the contents ofB'(x) and P '(x) are chosen as follows: B'(x) = S3 [B0(X)X24 + B0(X)X16 + B0(x)xs + B0(x)J
+ S2[B0(X)X16 + B0(X)X8 + B0(X)J
+ S1 [B2(X)X8 + B2(X) + B0(x)x8 + B0(X)J
+ S0B(X).
P' (x) = 80[P'0(x)x24 + P'o(x)x16 + P' 0(x)x8 + P'0(x)J + S1 [P1(X)X16 + P0(x)x16 + P1(X) + P0(X)J +S2P'(x) + S3P'(x).
[0057] In this embodiment, the outputs of the pre-shift sub-block are A ' (x), B ' (x) and P'(x). These are then passed onto the GF multiplier array of size (32 bitx8 bit). The GF multiplier array 320 multiplies the inputs A ' (x) and B ' (x) according to the field parameters m and P (x). The intermediate polynomials V(x) and W(x) described above will have a degree of 31. The computation stage or iteration number is denoted by k, ranging from about 1 to 8. The initial values for the inputs to the GF multiplier array are given by,
[0058] The above outlined equation (6) is modified by introducing two new polynomials u(x) and / (x). Then, for all bit positions i e {0, 1,..., 31}, and for all iterations 1 < k < 8,
Figure imgf000016_0001
wherein, u(k). = δ3V(k)31 + δ2V(k)23 + δljk)is + δoV(k)?i 0 ≤ i ≤ 7
= S3VM3] + δ2V(k)23 + (S1+S0)V^15, 8 < i < 15 = (δ3+δi)v®3i + (S20) V^23, lό ≤ i ≤ 23 = v(k) 3h 24 < i < 31.
Figure imgf000016_0002
8 = δ3 + S2, i = 16 = δ3 + δh i = 24 = 1, otherwise.
[0059] As before, v.j = 0. Also, equation (7) remains unaltered except for the change due to data packing in operand B, which is one of the inputs to the GF multiplier 300.
[0060] For iteration k ranging from 1 to 8, wβ) = Vβ-D bk l + wjt.i)t Vi € {Ot _t 7} (9a) '
WiW = V.(k-D h+7 + wfi-i)t Vi e {8> J5J (9b)
w β) = Vfi-V h+ls + Wi<Wt Vi e {16i ^23} (9c) w.(k) = yM bk+23 + w(k-i) t Vi e {24> 3ij (9d)
[0061 ] The product
Figure imgf000016_0003
Thus the above equations describe the GF multiplier array 320 shown in FIG. 3, which can perform various modes of packed arithmetic multiplication depending on the field size. For example, it can be envisioned as a GF multiplier array of (32bit x 8bit), which is configured as either four independent (βbitx 8bit) arrays arranged side-by-side, as two independent (lόbitx 8bif) arrays, or as a single (32bitx 8bit) array. This configuration is done with the new polynomials u(x) and l{x). Although the above described GF multiplier array is of size (32bit x 8bit), the above-described technique can be extended for any different GF multiplier array size, such as (40bitx 8bif) or (32bitx lόbit).
[0062] The last sub-block in the GF arithmetic unit 300 shown in FIG. 3 is the post- shift-add stage 330. The shift is performed to right shift the product back to the right- justified packed format. After the right shift, the summand C(x) is added to give the final MAC result. As noted earlier, if area is not a concern the extra latency due to the XOR of C(x) can be avoided by pre-shifting C(x) and adding it in the GF multiplier array as W®\
Y(x) = Y' (χ)x(m -32)mod8 + C(x)
[0063] Referring now to FIG. 4, which shows an example implementation of the sub-cell matrices in the sub-cell array 340 used in the GF multiplier array 320 using the AND and XOR logic circuits. FIG. 4 shows two neighboring sub-cells 410 and 420 arranged in a row. As shown in FIG. 4, each of the sub-cells includes 8 cells 430. The new variables u and / translate to logic gates at the byte boundaries of the GF multiplier array. The MAC architecture shown in FIGS. 3 and 4 can be pipelined. The summand C(x) should preferably be passed through the pre-shifter and the GF multiplier array as W®\ Also, the various bits of B '(x) should be delayed appropriately to prevent previous data erasure. This depends on the level of pipelining.
[0064] In some embodiments, the pre-shift stage 310 shown in FIG. 3 receives first, second, and third operands (A), (B), and (P) 350, 360, and 370, respectively. Each of the received operands has m bits. The pre-shift stage 310 to left justify the operands A and P to a nearest byte boundary. It also divides the operands B and pre-shifted operand P into sub- words and selects the appropriate sub- words depending on the field size m.
[0065] The GF multiplier array 320 receives the sub-words associated with each operand, and performs GF multiplication on a sub-word-parallel basis and outputs the multiplied value (A x B) in GF (i.e., outputs a GF multiplied value). The post-shift-add stage 330 receives a fourth operand (C) 380, which has m bits. The post-shift-add stage 330 divides the fourth operand 380 into sub-words. Further, the post-shift-add stage 330 receives the GF multiplied value from the GF multiplier array 320 and right-justifies the GF multiplied value. Furthermore, the post-shift-add stage 330 adds the right-justified GF multiplied value to the sub- words associated with the operand C 380 and outputs the multiply-accumulate value of ((A x B) + C) in the GF. In some embodiments, the m bits can be in the range of about 8bits to 40 bits.
[0066] In these embodiments, each of the sub-cell matrices in the sub-cell array 340 in the GF multiplier array 320, as shown in FIGS. 3 and 4, includes 8 GF cells 430 arranged in a row. As shown in FIG. 4, each GF cell 430 includes a first and a second AND logic circuit 440 and 445. Each of the first and the second AND logic circuits 440 and 445 has first and second inputs and an output 442 and 443, 447 and 448, and 444 and 449, respectively.
[0067] Also as shown in FIG. 4, each GF cell 430 includes a first and a second XOR logic circuit 450 and 455. Each of the first and the second XOR logic circuits 450 and 455 has a first and a second inputs and an output 452 and 453 (same as 447), 457 and 458, and 454 and 459, respectively. The outputs of the first and second XOR logic circuits 454 and 459 form the bits of the intermediate polynomials v and w as described in the above outlined equations (8) and (9). Further as shown in FIG. 4, the first input 442 of the first AND logic circuit 440 is connected to receive one of the m bits associated with the pre-shifted operand P 370 (shown in FIG. 3) and the second input 443 of the first AND logic circuit 440 is connected to receive one of the bits associated with the new polynomial u as described in the above equation (8). Furthermore as shown in FIG. 4, the first input 447 of the second AND logic circuit 445 is connected to receive one of the bits associated with the intermediate polynomial v and the second input 448 of the second AND logic circuit 445 is connected to receive one of the bits associated with the second operand B 360 (shown in FIG. 3).
[0068] Furthermore as shown in FIG. 4, the first input 452 of the first XOR logic circuit 450 is connected to the output 444 of the first AND logic circuit 440 and the second input 453 of the first XOR logic circuit 450 is connected to one of the bits associated with the intermediate polynomial v which is modified logically by a sub-cell matrix AND logic circuit 480 with the new polynomial /, as described in the equation (8). Further, the first input 457 of the second XOR logic circuit 455 is connected to the output 449 of the second AND logic circuit 445 and the second input 458 of the second XOR logic circuit 455 is connected to one of the bits associated with the intermediate polynomial w as described in the equation (9).
[0069] Moreover as shown in FIG. 4, each sub-cell array in the sub-cell array 340 further includes a MUX 470 and the sub-cell matrix AND logic circuit 480. As shown in FIG. 4, the MUX 470 has one or more inputs 472 and an output 474. The output 474 of the MUX 470 represents 8 bits of the new polynomial u, as described in the equation (8). The one or more inputs 472 of the MUX 470 in each sub cell array 340 (shown in FIG. 3) vary from 4 to 1 depending on the position of the sub-cell 410 (the right most sub-cell 410 has 4 inputs). Further as shown in FIG. 4, the sub-cell matrix AND logic circuit 480 has first and second inputs 482 and 484 and an output 486. As shown in FIG. 4, the one or more inputs 472 of the MUX 470 is connected to one bit associated with the intermediate polynomial v and the output 474 of the MUX 470 is connected to each second input 443 of the first AND logic circuit 440 in the sub-cell array 340. Furthermore, the second input 484 of the sub-cell matrix AND logic circuit 480 is connected to the one or more inputs 472 of the MUX 470 and the first input 482 of the sub-cell matrix AND logic circuit 480 is connected to one of the bits associated with the new polynomial /. As shown in FIG. 4, the MUX 470 and the sub-cell AND logic circuit 480 are included at byte-boundaries (after every 8th GF cell) of the sub-cell array 340.
[0070] In these embodiments, the polynomials v and w represent each stage or row of computation of the product of operands A and B, i.e., (A x B). The initial value of polynomial v is the input to operand A. The final value of the polynomial w is a computed product (A x B). The new polynomial u has coefficients consisting of byte-boundary coefficients of the polynomial v, i.e., a combination of polynomials at V31 , V23 , vis , vγ . This combination of byte-boundary coefficients of the polynomial v in the polynomial u is determined by the field size m. The new polynomial / signifies a connection between several sub-cell matrices of 8 GF cells in a row of the sub-cell array 340. The polynomials have unity coefficients in all positions in the sub-cell array 340 except at the byte boundary, which depends on the field size m. The new polynomials u and 1 are introduced to facilitate sub- word parallelism in the GF multiplier array 320. [0071] As shown in FIGS . 3 and 4, each of the first and the second XOR logic circuits 450 and 455 associated with a sub-cell array 340 in the GF multiplier array 320 has its output 454 and 459 connected to the first input 442 of the second AND logic circuit 445 and the second input 457 of the second XOR logic circuit 455, respectively, of a next successive sub-cell array except that the first and the second XOR logic circuits of a last sub-cell array is connected to an output of the GF multiplier array.
[0072] Further as shown in FIGS. 3 and 4, the first inputs 442 and 447 of the first and second AND logic circuits 440 and 445, respectively, of a sub-cell array 340 are connected to receive a bit in the pre-shifted third operand P 370 and output 454 of the first XOR logic circuits 450, respectively, of a substantially previous sub-cell array in the GF multiplier array, except for the first and second AND logic circuits of a first sub-cell array which are connected to an input of the GF multiplier array.
[0073] It can be envisioned that the above-described MAC architecture including AND-XOR logic circuits in each of the sub-cell matrices in the sub-cell array 340 in the GF multiplier array 320 can be built using NAND-XNOR logic circuits, which can provide a higher system performance. Each sub-cell array 340 in the GF multiplier array 320 consisting of the NAND-XNOR logic circuits is repeated m2 times in a two- dimensional array structure to obtain the above-described MAC architecture.
[0074] The above-described GFU was implemented in RTL using a Verilog HDL. The RTL description was synthesized using a standard cell library, with proper wire load models and timing constraints. The paths from inputs m and P(x) were treated as multicycle paths of two cycles. This is because the configuration registers are not meant to be changed on-the-fly along with the inputs, but are set before the MAC operations begin. A commercial CAD tool was used to place-and-route the MAC unit. The delay of the entire MAC unit was found to be about 1.5 ns under typical conditions (i.e., typical process, 1.2V power supply, 125°C temperature) in 0.13μm technology. The total area required was found to be about 0.05 mm2. Using other custom designs for the shifters and array, the system performance can further considerably improve the speed and reduce silicon area of the above-described MAC unit. [0075] Although the above embodiments describe performing error detection and correction techniques with reference to CRC and RS algorithms, the present invention is not limited to such. Thus, other embodiments may employ other types of forward error corrections algorithms. As one of average skill in the art will appreciate, other embodiments may be derived from the teachings of the above described techniques without deriving from the scope of the claims.
[0076] The above-described technique uses a sub-word parallel architecture to improve system performance when encoding/decoding using CRC and RS algorithms. This process uses a fast parallel CRC computation algorithm to enhance system performance. In addition, the above-described technique can be used to perform both error detection and data correction.
[0077] The above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those skilled in the art. The scope of the invention should therefore be determined by the appended claims, along with the full scope of equivalents to which such claims are entitled.
[0078] It is to be understood that the above-description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above-description. The scope of the subject matter should, therefore, be determined with reference to the following claims, along with the full scope of equivalents to which such claims are entitled.
[0079] As shown herein, the present invention can be implemented in a number of different embodiments, including various methods, a circuit, an I/O device, a system, and an article comprising a machine-accessible medium having associated instructions.
[0080] Other embodiments will be readily apparent to those of ordinary skill in the art. The elements, algorithms, and sequence of operations can all be varied to suit particular requirements.
[0081] FIGS. 1-4 are merely representational and are not drawn to scale. Certain portions thereof may be exaggerated, while others may be minimized. FIGS. 1-4 illustrate various embodiments of the invention that can be understood and appropriately carried out by those of ordinary skill in the art.
[0082] It is emphasized that the Abstract is provided to comply with 37 C.F.R. § 1.72(b) requiring an Abstract that will allow the reader to quickly ascertain the nature and gist of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims.
[0083] In the foregoing detailed description of embodiments of the invention, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather? as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description of embodiments of the invention, with each claim standing on its own as a separate embodiment.
[0084] It is understood that the above description is intended to be illustrative, and not restrictive. It is intended to cover all alternatives, modifications and equivalents as may be included within the spirit and scope of the invention as defined in the appended claims. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of. the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms "including" and "in which" are used as the plain-English equivalents of the respective terms "comprising" and "wherein," respectively.

Claims

What is claimed is:
1. A Galois Field arithmetic unit (GFU) comprising: a pre-shift stage that receives first, second, and third operands (A), (B), and (P), respectively, wherein each operand has m bits, wherein the pre-shift stage to left justify the operands A and P to a nearest byte boundary and to divide the operand B and the pre- shifted operand P into sub-words and select sub-words based on a field size of m. a Galois field (GF) multiplier array coupled to the pre-shift stage to receive the sub-words associated with the operands A, B, and P and to perform a GF multiplication on a sub- word-parallel basis and output a GF multiplied value of (A x B); and a post-shift-add stage coupled to the GF multiplier array to receive a fourth operand (C), wherein the operand C has m bits, wherein the post-shift-add stage to divide the operand C into sub-words, wherein the post-shift-add stage to receive the GF multiplied value from the GF multiplier array and to right-justify the GF multiplied value, wherein the post-shift-add stage to add the right-justified GF multiplied value to the sub- words associated with the operand C to output a multiply-accumulate value of ((A x B) + C) in GF, which can be used to compute the CRC and perform a Reed-Solomon encoding/decoding.
2. The GFU of claim 1 , wherein the m bits comprise bits in the range of about 8 to 40 bits.
3. The GFU of claim 1, wherein the GF multiplier array is responsive to the received sub-words associated with the A, B, and P operands, wherein the GF multiplier array has a plurality of outputs for providing the GF transformation value associated with the sub- words of each operand, wherein the GF multiplier array has a plurality of sub-cell matrices arranged in one or more rows and columns, wherein each sub-cell array comprises 5 GF cells arranged in a row, wherein each GF cell has a first and a second AND logic circuits each having first and second inputs and an output, a first and a second XOR logic circuits each having a first and a second inputs and an output, wherein the first input of the first ANB logic circuit to couple to one of the m bits associated with the pre-shifted operand P and the second input of the first AND logic circuit to couple to one of the m bits associated with a new polynomial u, wherein the first input of the second AND logic circuit to couple to one of the m bits associated with a first intermediate polynomial v and the second input of the second AND logic circuit is to couple to one of the m bits associated with the operand B, wherein the first input of the first XOR logic circuit is connected to the output of the first AND logic circuit and the second input of the first XOR logic circuit is connected to couple to one of the m bits associated with v logically AND ed with another new polynomial /, wherein the first input of the second XOR logic circuit is connected to the output of the second AND logic circuit and the second input of the second XOR logic circuit is to couple to receive one of the m bits associated with a second intermediate polynomial w, and wherein each sub-cell array has a MUX and a sub-cell array AND logic circuit, wherein the MUX has one or more inputs and an output, wherein the sub-cell AND logic circuit has first and second inputs and an output, wherein the one or more inputs of the MUX is to couple to receive one or more m bits associated with v and the output of the MUX is connected to each second input of the first AND gate in the sub-cell array, wherein the first input of the sub-cell AND logic circuit is connected to the one or more inputs of the MUX and the second input of the sub-cell AND logic circuit is to couple to receive one of the m bits associated with /.
4. The GFU of claim 3, wherein each of the first and the second XOR logic circuits of a sub-cell array in the GF multiplier array has its output connected to the first input of the second AND logic circuit and the second input of the second XOR logic circuit, respectively, of a next successive sub-cell array except for the first and the second XOR logic circuits of a last sub-cell array is connected to an output of the GF multiplier array.
5. The GFU of claim 4, wherein each of the first inputs of the first and second AND logic circuits of a sub-cell array is to couple to receive a bit in the operand P and of the first XOR logic circuit of a substantially previous sub-cell array in the GF multiplier array, respectively, except for the first and second AND logic circuits of a first sub-cell array is connected to an input of the GF multiplier array.
6. A GF multiplier array that is responsive to the received sub- words associated with the A, B, and P operands, wherein the GF multiplier array comprising: a plurality of outputs for providing the GF transformation value associated with the sub-words of each operand, wherein the GF multiplier array has a plurality of sub-cell matrices arranged in one or more rows and columns, wherein each sub-cell array comprises 8 GF cells arranged in a row, wherein each GF cell has a first and- a second AND logic circuits each having first and second inputs and an output, a first and a second XOR logic circuits each having a first and a second inputs and an output, wherein the first input of the first AND logic circuit to couple to one of the m bits associated with the pre-shifted operand P and the second input of the first AND logic circuit to couple to one of the m bits associated with a new polynomial u, wherein the first input of the second AND logic circuit to couple to one of the m bits associated with a first intermediate polynomial v and the second input of the second AND logic circuit is to couple to one of the m bits associated with the operand B, wherein the first input of the first XOR logic circuit is connected to the output of the first AND logic circuit and the second input of the first XOR logic circuit is connected to couple to one of the m bits associated with v logically AND ed with another new polynomial /, wherein the first input of the second XOR logic circuit is connected to the output of the second AND logic circuit and the second input of the second XOR logic circuit is to couple to receive one of the m bits associated with a second intermediate polynomial w, and wherein each sub-cell array has a MUX and a sub-cell array AND logic circuit, wherein the MUX has one or more inputs and an output, wherein the sub-cell AND logic circuit has first and second inputs and an output, wherein the one or more inputs of the MUX is to couple to receive one or more m bits associated with v and the output of the MUX is connected to each second input of the first AND gate in the sub-cell array, wherein the first input of the sub-cell AND logic circuit is connected to the one or more inputs of the MUX and the second input of the sub-cell AND logic circuit is to couple to receive one of the m bits associated with /.
7. The GF multiplier array of claim 6, wherein each of the first and the second XOR logic circuits of a sub-cell array in the GF multiplier array has its output connected to the first input of the second AND logic circuit and the second input of the second XOR logic circuit, respectively, of a next successive sub-cell array except for the first and the second XOR logic circuits of a last sub-cell array which are connected to an output of the GF multiplier array.
8. The GF multiplier array of claim 6, wherein each of the first inputs of the first and second AND logic circuits of a sub-cell array is to couple to receive a bit in the operand P and of the first XOR logic circuit of a substantially previous sub-cell array in the GF multiplier array, respectively, expect for the first and second AND logic circuits of a first sub-cell array is connected to an input of the GF multiplier array.
9. The GF multiplier array of claim 7, wherein the input operands to each sub-cell array in the GF multiplier array include state inputs fed back from the state conditions of the GF field linear outputs of the substantially previous sub-cell array.
10. A method of performing a cyclic redundancy check (CRC) comprising: receiving a message of length n bits; partitioning the n bits into one or more blocks, wherein each block has i input bits such that n=k*i and i is less than or equal to m, wherein m is the degree of the generator polynomial used to compute the CRC; and computing a CRC value for the received message of n bits using the one or more blocks.
11. The method of claim 10, wherein computing the CRC value comprises: initializing the value of k and an intermediate CRC value to a 0 value; multiplying a current block of / input bits with a GF coefficient to obtain a current multiplied CRC value, wherein the GF coefficient is a power of a primitive element α in a finite field of GF(2m); adding the current multiplied CRC value to a previously obtained intermediate CRC value associated with a previous block of / input bits to obtain a new intermediate CRC value; incrementing the value of k by a predetermined value; determining whether the value of k = (n/i); if not, repeating the above steps of multiplying,, adding, incrementing and determining; and if so, outputting the new intermediate CRC value as the CRC value.
12. The method of claim 11, wherein the CRC value is computed using the equation,
CRC[A(X)] = ΣJ=o M Wj(x) am +Ji
wherein A(x) is the input message of n bits, m is the degree of the generator polynomial, Wj(x) is a polynomial of degree (/ - 1) that is expanded to degree (m - T) with zeros on most significant bits that represents blocks of i input bits of the input message A(x), a is the primitive element of the field of GF (2m) having the generator polynomial used in the CRC computation, and (k = n/i).
PCT/IN2005/000150 2005-05-06 2005-05-06 Galois field arithmetic unit for error detection and correction in processors WO2006120691A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IN2005/000150 WO2006120691A1 (en) 2005-05-06 2005-05-06 Galois field arithmetic unit for error detection and correction in processors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IN2005/000150 WO2006120691A1 (en) 2005-05-06 2005-05-06 Galois field arithmetic unit for error detection and correction in processors

Publications (1)

Publication Number Publication Date
WO2006120691A1 true WO2006120691A1 (en) 2006-11-16

Family

ID=35445749

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2005/000150 WO2006120691A1 (en) 2005-05-06 2005-05-06 Galois field arithmetic unit for error detection and correction in processors

Country Status (1)

Country Link
WO (1) WO2006120691A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103929208A (en) * 2014-03-27 2014-07-16 北京大学 Device for calculating adjoint polynomial in RS encoder
CN114063973A (en) * 2022-01-14 2022-02-18 苏州浪潮智能科技有限公司 Galois field multiplier and erasure coding and decoding system
US11362678B2 (en) 2011-12-30 2022-06-14 Streamscale, Inc. Accelerated erasure coding system and method
US11500723B2 (en) 2011-12-30 2022-11-15 Streamscale, Inc. Using parity data for concurrent data authentication, correction, compression, and encryption

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4251875A (en) * 1979-02-12 1981-02-17 Sperry Corporation Sequential Galois multiplication in GF(2n) with GF(2m) Galois multiplication gates
US5046037A (en) * 1988-03-17 1991-09-03 Thomson-Csf Multiplier-adder in the Galois fields, and its use in a digital signal processing processor
WO2003048921A1 (en) * 2001-11-30 2003-06-12 Analog Devices, Inc. Galois field multiply/multiply-add multiply accumulate

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4251875A (en) * 1979-02-12 1981-02-17 Sperry Corporation Sequential Galois multiplication in GF(2n) with GF(2m) Galois multiplication gates
US5046037A (en) * 1988-03-17 1991-09-03 Thomson-Csf Multiplier-adder in the Galois fields, and its use in a digital signal processing processor
WO2003048921A1 (en) * 2001-11-30 2003-06-12 Analog Devices, Inc. Galois field multiply/multiply-add multiply accumulate

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11362678B2 (en) 2011-12-30 2022-06-14 Streamscale, Inc. Accelerated erasure coding system and method
US11500723B2 (en) 2011-12-30 2022-11-15 Streamscale, Inc. Using parity data for concurrent data authentication, correction, compression, and encryption
US11736125B2 (en) 2011-12-30 2023-08-22 Streamscale, Inc. Accelerated erasure coding system and method
CN103929208A (en) * 2014-03-27 2014-07-16 北京大学 Device for calculating adjoint polynomial in RS encoder
CN114063973A (en) * 2022-01-14 2022-02-18 苏州浪潮智能科技有限公司 Galois field multiplier and erasure coding and decoding system

Similar Documents

Publication Publication Date Title
Lee High-speed VLSI architecture for parallel Reed-Solomon decoder
US20030192007A1 (en) Code-programmable field-programmable architecturally-systolic Reed-Solomon BCH error correction decoder integrated circuit and error correction decoding method
US7162679B2 (en) Methods and apparatus for coding and decoding data using Reed-Solomon codes
JP2005218098A (en) Reed-solomon decoder circuit of a forward directional chain search system
US7089276B2 (en) Modular Galois-field subfield-power integrated inverter-multiplier circuit for Galois-field division over GF(256)
US7366969B2 (en) System and method for implementing a Reed Solomon multiplication section from exclusive-OR logic
Chen et al. Area efficient parallel decoder architecture for long BCH codes
WO2006120691A1 (en) Galois field arithmetic unit for error detection and correction in processors
US6263471B1 (en) Method and apparatus for decoding an error correction code
Iwamura et al. A design of reed-solomon decoder with systolic-array structure
KR100756424B1 (en) An Area-Efficient Reed-Solomon Decoder using Pipelined Recursive Technique
Zhang VLSI architectures for Reed–Solomon codes: Classic, nested, coupled, and beyond
US10218386B1 (en) Methods and apparatus for performing variable and breakout Reed Solomon encoding
Park et al. High-speed low-complexity Reed-Solomon decoder using pipelined Berlekamp-Massey algorithm
JP2000020333A (en) Decoding device, arithmetic unit and their methods
US20180006664A1 (en) Methods and apparatus for performing reed-solomon encoding by lagrangian polynomial fitting
Khan et al. Hardware implementation of shortened (48, 38) Reed Solomon forward error correcting code
Lee A VLSI design of a high-speed Reed-Solomon decoder
Roy A sub-word-parallel Galois field multiply-accumulate unit for digital signal processors
Lu et al. High-speed low-complexity architecture for Reed-Solomon decoders
Chang et al. A universal VLSI architecture for Reed–Solomon error-and-erasure decoders
Lee et al. An efficient recursive cell architecture of modified Euclid's algorithm for decoding Reed-Solomon codes
Deshpande Finite Field Multiplier Accumulator Unit By Using Sub Word Parallel Architecture
Chang et al. A high speed Reed-Solomon CODEC chip using lookforward architecture
Lee An ultra high-speed Reed-Solomon decoder

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

NENP Non-entry into the national phase

Ref country code: RU

WWW Wipo information: withdrawn in national office

Country of ref document: RU

122 Ep: pct application non-entry in european phase

Ref document number: 05747130

Country of ref document: EP

Kind code of ref document: A1