KR102447445B1

KR102447445B1 - Calculation device for efficient parallel processing of matrix operation and memory device including the same

Info

Publication number: KR102447445B1
Application number: KR1020210030067A
Authority: KR
Inventors: 공재섭
Original assignee: 공재섭
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2022-09-26
Also published as: KR20220126018A

Abstract

연산 장치는 복수의 누적연산기들을 포함하는 하나 이상의 연산기그룹을 포함한다. 복수의 누적연산기들은 입력 벡터 성분들을 수신하는 제1 입력 단자, 및 행렬의 원소들을 수신하는 제2 입력 단자를 각각 포함하고, 입력 벡터 성분들과 행렬의 원소들에 대한 누적 연산을 수행하여 출력 벡터 성분들 중 하나를 각각 발생한다. 복수의 누적연산기들 각각은 제1 연산기, 제2 연산기 및 누적레지스터를 포함한다. 제1 연산기는 입력 벡터 성분들 및 행렬의 원소들에 대한 제1 연산을 수행하여 제1 연산의 결과를 발생한다. 제2 연산기는 제1 연산의 결과 및 누적 결과에 대한 제2 연산을 수행하여 제2 연산의 결과를 발생한다. 누적레지스터는 제2 연산의 결과를 누적하여 누적 결과를 발생하고, 최종적으로 출력 벡터 성분들 중 하나를 발생한다. 하나의 상기 연산기그룹에 포함되는 복수의 누적연산기들의 제1 입력 단자는 공통으로 연결된다.The computing device includes one or more operator groups including a plurality of accumulators. Each of the plurality of accumulators includes a first input terminal for receiving input vector components and a second input terminal for receiving matrix elements, and performs a cumulative operation on the input vector components and the matrix elements to obtain an output vector One of the components occurs each. Each of the plurality of accumulators includes a first operator, a second operator, and an accumulation register. The first operator performs a first operation on the input vector components and the elements of the matrix to generate the result of the first operation. The second operator generates a result of the second operation by performing a second operation on the result of the first operation and the accumulated result. The accumulation register generates an accumulation result by accumulating the result of the second operation, and finally generates one of the output vector components. The first input terminals of the plurality of accumulators included in the one operator group are commonly connected.

Description

An arithmetic device for efficient parallel processing of matrix operations and a memory device including the same

본 발명은 반도체 집적 회로에 관한 것으로서, 더욱 상세하게는 행렬 연산의 효율적 병렬처리를 위한 연산 장치 및 상기 연산 장치를 포함하는 메모리 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a semiconductor integrated circuit, and more particularly, to an arithmetic device for efficiently parallel processing of matrix operations and a memory device including the arithmetic device.

도 1a는 종래 기술에 따른 행렬 연산을 수행하는 일 예를 나타내는 도면이다.1A is a diagram illustrating an example of performing a matrix operation according to the prior art.

도 1a를 참조하면, 원소들(a₁₁, a₁₂, ..., a_1m, a₂₁, a₂₂, ..., a_2m, ..., a_n1, a_n2, ..., a_nm)을 포함하는 n*m(n, m은 각각 2 이상의 자연수) 차원의 행렬과 성분들(x₁, x₂, ..., x_m)을 포함하는 m 차원의 입력 벡터의 곱셈에 의하여 성분들(y₁, y₂, ..., y_n)을 포함하는 n 차원의 출력 벡터가 획득된다. 상기 출력 벡터의 각 성분은 상기 행렬의 해당하는 열 벡터 또는 행 벡터와 입력 벡터의 내적으로 정의된다. 이 때, 두 벡터의 내적은 각 벡터의 성분들을 차례로 곱한 후 모두 더하여 획득될 수 있다. 예를 들어, y₁= a₁₁*x₁+a₁₂*x₂+...+a_1m*x_m일 수 있다.1A , the elements (a ₁₁ , a ₁₂ , ..., a _1m , a ₂₁ , a ₂₂ , ..., a _2m , ..., a _n1 , a _n2 , ..., a By _{multiplication} of an n*m ( _n , _m is a natural number equal to or greater than ₂ ) dimension matrix containing An n-dimensional output vector comprising the components y ₁ , y ₂ , ..., y _n is obtained. Each component of the output vector is defined as the dot product of the input vector and the corresponding column or row vector of the matrix. In this case, the dot product of two vectors may be obtained by sequentially multiplying the components of each vector and then adding them all together. For example, y ₁ = a ₁₁ *x ₁ +a ₁₂ *x ₂ +...+a _1m *x _m .

도 1b는 도 1a의 행렬 연산을 수행하는 종래 기술에 따른 연산 장치를 나타내는 블록도이다.FIG. 1B is a block diagram illustrating an arithmetic apparatus according to the related art for performing the matrix operation of FIG. 1A.

도 1b를 참조하면, 연산 장치(10)는 복수의 MAC(multiplier-and-accumulator)들(201, 202, ..., 20k)(k는 2 이상의 자연수)을 포함한다. 복수의 MAC들(201~20k) 각각은 복수의 곱셈기들(211, 212, ..., 21k) 중 하나 및 복수의 누산기들(231, 232, ..., 23k) 중 하나를 포함한다.Referring to FIG. 1B , the computing device 10 includes a plurality of multiplier-and-accumulators (MACs) 201, 202, ..., 20k (k is a natural number equal to or greater than 2). Each of the plurality of MACs 201 to 20k includes one of the plurality of multipliers 211, 212, ..., 21k and one of the plurality of accumulators 231, 232, ..., 23k.

복수의 곱셈기들(211~21k) 각각은 복수의 제1 입력들(IN11, IN12, ..., IN1k) 중 하나 및 복수의 제2 입력들(IN21, IN22, ..., IN2k) 중 하나를 수신하며, 두 개의 입력을 곱하여 출력한다.Each of the plurality of multipliers 211 to 21k is one of the plurality of first inputs IN11, IN12, ..., IN1k and one of the plurality of second inputs IN21, IN22, ..., IN2k. is received, and the two inputs are multiplied and output.

복수의 누산기들(231~23k) 각각은 복수의 곱셈기들(211~21k) 중 하나와 연결되고, 복수의 가산기들(251, 252, ..., 25k) 중 하나 및 복수의 누적레지스터들(271, 272, ..., 27k) 중 하나를 포함하며, 복수의 곱셈기들(211~21k) 중 하나의 출력을 누적하여 더한 후 복수의 출력들(OUT1, OUT2, ..., OUTk) 중 하나를 발생한다.Each of the plurality of accumulators 231 to 23k is connected to one of the plurality of multipliers 211 to 21k, and one of the plurality of adders 251, 252, ..., 25k and a plurality of accumulation registers ( 271, 272, ..., 27k), and after accumulating and adding the output of one of the plurality of multipliers 211 to 21k, among the plurality of outputs OUT1, OUT2, ..., OUTk one occurs

행렬과 벡터의 차원이 커지면 병렬 처리를 위해 도 1b와 같이 복수의 MAC들(201~20k)을 사용하여 내적을 부분적으로 구한 후 그 값들을 합하여 전체 내적을 구할 수 있다. 다만, 도 1b의 종래의 연산 장치(10)는 k개의 MAC들(201~20k)을 포함하는 경우에 상기 행렬의 원소들과 상기 벡터의 성분들 중 각각 k개를 입력으로 수신하고 이를 위해 상기 벡터의 성분들을 중복하여 수신하기 때문에, 메모리 및 버스 대역폭의 사용이 증가하는 문제가 있었다.When the dimensions of the matrix and the vector are increased, the dot product may be partially obtained using a plurality of MACs 201 to 20k for parallel processing as shown in FIG. 1B and then the total dot product may be obtained by summing the values. However, when the conventional computing device 10 of FIG. 1B includes k MACs 201 to 20k, each of the matrix elements and k elements of the vector is received as inputs, and for this purpose, Since the components of the vector are repeatedly received, there is a problem in that the use of memory and bus bandwidth increases.

본 발명의 일 목적은 행렬 연산을 수행하는데 있어서 메모리 및 버스 대역폭 사용 감소를 위한 연산 장치를 제공하는 것이다.SUMMARY OF THE INVENTION One object of the present invention is to provide an arithmetic device for reducing memory and bus bandwidth usage in performing a matrix operation.

본 발명의 다른 목적은 상기 연산 장치를 포함함으로써 행렬 연산을 수행하는데 있어서 성능 향상 및 전력 소모 감소를 위한 메모리 장치를 제공하는 것이다.Another object of the present invention is to provide a memory device for improving performance and reducing power consumption in performing a matrix operation by including the computing device.

상기 일 목적을 달성하기 위해, 본 발명의 실시예들에 따른 연산 장치는 복수의 누적연산기들을 포함하는 하나 이상의 연산기그룹을 포함한다. 상기 복수의 누적연산기들은 입력 벡터 성분들을 수신하는 제1 입력 단자, 및 행렬의 원소들을 수신하는 제2 입력 단자를 각각 포함하고, 상기 입력 벡터 성분들과 상기 행렬의 원소들에 대한 누적 연산을 수행하여 출력 벡터 성분들 중 하나를 각각 발생한다. 상기 복수의 누적연산기들 각각은 제1 연산기, 제2 연산기 및 누적레지스터를 포함한다. 상기 제1 연산기는 상기 제1 입력 단자로부터 수신되는 상기 입력 벡터 성분들 및 상기 제2 입력 단자로부터 수신되는 상기 행렬의 원소들에 대한 제1 연산을 수행하여 상기 제1 연산의 결과를 발생한다. 상기 제2 연산기는 상기 제1 연산기로부터 출력되는 상기 제1 연산의 결과 및 상기 누적레지스터로부터 제공되는 누적 결과에 대한 제2 연산을 수행하여 상기 제2 연산의 결과를 발생한다. 상기 누적레지스터는 상기 제2 연산기로부터 출력되는 상기 제2 연산의 결과를 누적하여 상기 누적 결과를 발생하고, 최종적으로 상기 출력 벡터 성분들 중 하나를 발생한다. 하나의 상기 연산기그룹에 포함되는 상기 복수의 누적연산기들의 상기 제1 입력 단자는 공통으로 연결된다.In order to achieve the above object, a computing device according to embodiments of the present invention includes one or more operator groups including a plurality of accumulators. Each of the plurality of accumulators includes a first input terminal for receiving input vector elements and a second input terminal for receiving matrix elements, and performs an accumulation operation on the input vector elements and the matrix elements. to generate each one of the output vector components. Each of the plurality of accumulators includes a first operator, a second operator, and an accumulation register. The first operator generates a result of the first operation by performing a first operation on the input vector components received from the first input terminal and the elements of the matrix received from the second input terminal. The second operator generates a result of the second operation by performing a second operation on the result of the first operation output from the first operator and the accumulation result provided from the accumulation register. The accumulation register generates the accumulation result by accumulating the results of the second operation output from the second operator, and finally generates one of the output vector components. The first input terminals of the plurality of accumulators included in one operator group are connected in common.

일 실시예에서, 복수의 상기 연산기그룹들은 복수의 입력 벡터들에 일대일로 대응하여 입력 벡터 성분들을 수신하고 복수의 출력 벡터들에 일대일로 대응하여 출력 벡터 성분들을 발생할 수 있다. 서로 다른 상기 연산기그룹에 속하는 복수의 누적연산기들을 포함하는 복수의 교차연산기그룹들을 포함할 수 있다. 하나의 상기 교차연산기그룹에 포함되는 상기 복수의 누적연산기들의 상기 제2 입력 단자는 공통으로 연결될 수 있다.In an embodiment, the plurality of operator groups may receive input vector components in a one-to-one correspondence with a plurality of input vectors and generate output vector components in one-to-one correspondence with a plurality of output vectors. It may include a plurality of crossover operator groups including a plurality of accumulators belonging to the different operator groups. The second input terminals of the plurality of accumulators included in one crossover operator group may be connected in common.

일 실시예에서, 상기 제1 연산기는 곱셈기이고, 상기 제1 연산은 곱셈이며, 상기 제2 연산기는 가산기이고, 상기 제2 연산은 덧셈이며, 상기 복수의 누적연산기들 각각은 MAC(multiplier-and-accumulator)일 수 있다.In an embodiment, the first operator is a multiplier, the first operation is a multiplication, the second operator is an adder, the second operation is an addition, and each of the plurality of accumulators is a multiplier-and (MAC) operator. -accumulator).

일 실시예에서, 상기 복수의 누적연산기들 각각은 상기 누적레지스터에 의해 발생되는 상기 누적 결과를 임시로 저장하는 적어도 하나의 쉐도우(shadow) 레지스터를 더 포함할 수 있다.In an embodiment, each of the plurality of accumulators may further include at least one shadow register that temporarily stores the accumulation result generated by the accumulation register.

일 실시예에서, 상기 복수의 누적연산기들 각각은 보조 입력을 수신하는 보조 입력 단자, 및 상기 입력 벡터 성분들을 수신하는 상기 제1 입력 단자와 연결되고, 선택 신호에 기초하여 상기 보조 입력 및 상기 입력 벡터 성분들 중 하나를 선택하여 상기 제1 연산기에 제공하는 멀티플렉서를 더 포함할 수 있다.In one embodiment, each of the plurality of accumulators is connected to an auxiliary input terminal for receiving an auxiliary input, and the first input terminal for receiving the input vector components, and based on a selection signal, the auxiliary input and the input It may further include a multiplexer that selects one of the vector components and provides it to the first operator.

일 실시예에서, 상기 복수의 누적연산기들 중 제1 누적연산기는 보조 입력을 수신하는 보조 입력 단자, 및 상기 입력 벡터 성분들을 수신하는 상기 제1 입력 단자와 연결되고, 선택 신호에 기초하여 상기 보조 입력 및 상기 입력 벡터 성분들 중 하나를 선택하여 상기 제1 연산기에 제공하는 멀티플렉서를 더 포함할 수 있다. 상기 복수의 누적연산기들 중 상기 제1 누적연산기와 인접하여 배치되는 적어도 하나의 누적연산기는 상기 제1 누적연산기와 상기 멀티플렉서를 공유할 수 있다.In an embodiment, a first accumulator of the plurality of accumulators is connected to an auxiliary input terminal for receiving an auxiliary input, and the first input terminal for receiving the input vector components, and based on a selection signal A multiplexer may further include an input and a multiplexer that selects one of the input vector components and provides it to the first operator. At least one accumulator disposed adjacent to the first accumulator among the plurality of accumulators may share the multiplexer with the first accumulator.

일 실시예에서, 상기 복수의 누적연산기들 각각의 2개의 입력들 중 어느 하나는 0 또는 1의 값을 갖는 이진 값이고, 상기 제1 누적연산기는 상기 이진 값에 따라 0을 출력하거나 다른 하나의 입력을 바이패스 시키는 게이팅 회로로 구현될 수 있다.In one embodiment, any one of the two inputs of each of the plurality of accumulators is a binary value having a value of 0 or 1, and the first accumulator outputs 0 or the other one according to the binary value. It can be implemented as a gating circuit that bypasses the input.

일 실시예에서, 상기 복수의 누적연산기들 각각의 2개의 입력들 중 어느 하나는 n진 정수 값(n은 2이상의 자연수)이고 다른 하나는 정수 지수를 갖는 n의 거듭제곱 값이고, 상기 제1 누적연산기는 상기 n진 정수 입력을 상기 정수 지수만큼 쉬프트 시키는 회로로 구현될 수 있다.In one embodiment, one of the two inputs of each of the plurality of accumulators is a base n integer value (n is a natural number greater than or equal to 2) and the other is a value of a power of n having an integer exponent, and the first The accumulator may be implemented as a circuit for shifting the n-base integer input by the integer exponent.

일 실시예에서, 상기 복수의 누적연산기들 각각의 2개의 입력들 중 어느 하나는 n진 부동소수점 값(n은 2이상의 자연수)이고 다른 하나는 정수 지수를 갖는 n의 거듭제곱 값이고, 상기 제1 누적연산기는 상기 n진 부동소수점 입력의 가수를 바이패스 시키고 상기 두 입력의 지수를 더하는 회로로 구현될 수 있다.In one embodiment, one of the two inputs of each of the plurality of accumulators is an n-base floating-point value (n is a natural number greater than or equal to 2) and the other is a value to a power of n having an integer exponent, and the The 1 accumulator may be implemented as a circuit that bypasses the mantissa of the n-base floating-point input and adds the exponents of the two inputs.

일 실시예에서, 상기 복수의 누적연산기들 각각의 2개의 입력들 중 어느 하나를 1로 고정하여 상기 제1 연산기를 바이패스시키고, 상기 제2 연산기는 상기 제1 연산기에 의해 바이패스된 값을 상기 누적레지스터의 값과 비교하여 둘 중 큰 값을 출력하는 비교기로 구현될 수 있다.In one embodiment, the first operator is bypassed by fixing any one of two inputs of each of the plurality of accumulators to 1, and the second operator is the value bypassed by the first operator. It may be implemented as a comparator that compares the value of the accumulation register and outputs the larger of the two values.

일 실시예에서, 상기 행렬은 합성곱 행렬이고, 상기 입력 벡터와 상기 출력 벡터는 입력 행렬과 출력 행렬을 각각 열벡터로 바꾸어 얻어질 수 있다.In an embodiment, the matrix is a convolution matrix, and the input vector and the output vector may be obtained by changing the input matrix and the output matrix into column vectors, respectively.

일 실시예에서, 상기 행렬은 복수의 필터 행렬들에 대응하는 확장된 합성곱 행렬이고, 상기 출력 벡터는 상기 복수의 필터 행렬들에 대응하는 확장된 출력 벡터일 수 있다.In an embodiment, the matrix may be an extended convolution matrix corresponding to a plurality of filter matrices, and the output vector may be an extended output vector corresponding to the plurality of filter matrices.

일 실시예에서, 상기 행렬은 합성곱 행렬 또는 패킹된 합성곱 행렬이고, 상기 복수의 입력 벡터들과 상기 복수의 출력 벡터들은 복수의 입력 행렬들과 복수의 출력 행렬들을 각각 벡터로 바꾸어 얻어질 수 있다.In one embodiment, the matrix is a convolution matrix or a packed convolution matrix, and the plurality of input vectors and the plurality of output vectors may be obtained by converting a plurality of input matrices and a plurality of output matrices into vectors, respectively. have.

일 실시예에서, 상기 행렬은 분할된 합성곱 행렬이고, 상기 복수의 입력 벡터들과 상기 복수의 출력 벡터들은 복수의 분할된 입력 행렬들과 복수의 분할된 출력 행렬들을 각각 벡터로 바꾸어 얻어질 수 있다.In an embodiment, the matrix is a partitioned convolution matrix, and the plurality of input vectors and the plurality of output vectors may be obtained by converting a plurality of partitioned input matrices and a plurality of partitioned output matrices into vectors, respectively. have.

상기 다른 목적을 달성하기 위해, 본 발명의 실시예들에 따른 메모리 장치는 메모리 셀 어레이, 행 디코더, 열 디코더, 게이팅 회로, 입출력 데이터 구동 회로, 입출력 버퍼 및 연산 회로를 포함한다. 상기 메모리 셀 어레이는 복수의 행들 및 복수의 열들을 형성하도록 배열되는 복수의 메모리 셀들을 포함하고, 행렬의 원소들 또는 상기 행렬의 원소들을 생성하기 위한 행렬 생성 정보를 저장한다. 상기 행 디코더는 행 어드레스에 기초하여, 상기 복수의 행들 중 목표 행을 선택하기 위한 행 선택 신호를 발생한다. 상기 열 디코더는 열 어드레스에 기초하여, 상기 목표 행에 포함되는 열들 중 목표 열을 선택하기 위한 열 선택 신호를 발생한다. 상기 게이팅 회로는 상기 열 선택 신호에 기초하여 상기 목표 열을 선택한다. 상기 입출력 데이터 구동 회로는 상기 게이팅 회로를 통해 상기 목표 열에 입력 데이터를 기입하거나 상기 목표 열에 저장된 데이터를 출력 데이터로서 출력한다. 상기 입출력 버퍼는 상기 입출력 데이터 구동 회로와 연결되고, 입력 벡터 성분들 및 출력 벡터 성분들을 저장한다. 상기 연산 회로는 상기 게이팅 회로 및 상기 입출력 버퍼와 연결되고, 복수의 누적연산기들을 포함하는 하나 이상의 연산기그룹을 포함하며, 상기 입출력 버퍼로부터 제공되는 상기 입력 벡터 성분들과 상기 게이팅 회로를 통해 제공되는 상기 행렬의 원소들 또는 상기 게이팅 회로를 통해 제공되는 상기 행렬 생성 정보를 참조하여 생성되는 상기 행렬의 원소들을 상기 복수의 누적연산기들에서 입력 벡터 기준 방식으로 연산하여 상기 출력 벡터 성분들을 획득하고, 상기 출력 벡터 성분들을 상기 입출력 버퍼에 제공한다. 상기 복수의 누적연산기들 각각은, 상기 입력 벡터 성분들을 수신하는 제1 입력 단자, 및 상기 행렬의 원소들을 수신하는 제2 입력 단자를 포함하고, 상기 입력 벡터 성분들과 상기 행렬의 원소들에 대한 누적 연산을 수행하여 상기 출력 벡터 성분들 중 하나를 발생한다. 상기 복수의 누적연산기들 각각은 제1 연산기, 제2 연산기 및 누적레지스터를 포함한다. 상기 제1 연산기는 상기 제1 입력 단자로부터 수신되는 상기 입력 벡터 성분들 및 상기 제2 입력 단자로부터 수신되는 상기 행렬의 원소들에 대한 제1 연산을 수행하여 상기 제1 연산의 결과를 발생한다. 상기 제2 연산기는 상기 제1 연산기로부터 출력되는 상기 제1 연산의 결과 및 상기 누적레지스터로부터 제공되는 누적 결과에 대한 제2 연산을 수행하여 상기 제2 연산의 결과를 발생한다. 상기 누적레지스터는 상기 제2 연산기로부터 출력되는 상기 제2 연산의 결과를 누적하여 상기 누적 결과를 발생하고, 최종적으로 상기 출력 벡터 성분들 중 하나를 발생한다. 하나의 상기 연산기그룹에 포함되는 상기 복수의 누적연산기들의 상기 제1 입력 단자는 공통으로 연결된다.In order to achieve the above object, a memory device according to embodiments of the present invention includes a memory cell array, a row decoder, a column decoder, a gating circuit, an input/output data driving circuit, an input/output buffer, and an arithmetic circuit. The memory cell array includes a plurality of memory cells arranged to form a plurality of rows and a plurality of columns, and stores elements of a matrix or matrix generation information for generating elements of the matrix. The row decoder generates a row selection signal for selecting a target row from among the plurality of rows based on a row address. The column decoder generates a column selection signal for selecting a target column from among columns included in the target row, based on a column address. The gating circuit selects the target column based on the column selection signal. The input/output data driving circuit writes input data into the target column through the gating circuit or outputs data stored in the target column as output data. The input/output buffer is connected to the input/output data driving circuit and stores input vector components and output vector components. The operation circuit is connected to the gating circuit and the input/output buffer, and includes one or more operator groups including a plurality of accumulators, the input vector components provided from the input/output buffer and the gating circuit provided through the gating circuit. The plurality of accumulators operate on the elements of the matrix generated with reference to the elements of the matrix or the matrix generation information provided through the gating circuit in an input vector reference method to obtain the output vector components, and the output Vector components are provided to the input/output buffer. Each of the plurality of accumulators includes a first input terminal for receiving the input vector components, and a second input terminal for receiving the elements of the matrix, wherein the input vector components and the elements of the matrix An accumulation operation is performed to generate one of the output vector components. Each of the plurality of accumulators includes a first operator, a second operator, and an accumulation register. The first operator generates a result of the first operation by performing a first operation on the input vector components received from the first input terminal and the elements of the matrix received from the second input terminal. The second operator generates a result of the second operation by performing a second operation on the result of the first operation output from the first operator and the accumulation result provided from the accumulation register. The accumulation register generates the accumulation result by accumulating the results of the second operation output from the second operator, and finally generates one of the output vector components. The first input terminals of the plurality of accumulators included in one operator group are connected in common.

일 실시예에서, 상기 메모리 장치는 상기 열 어드레스에 기초하여, 상기 목표 행에 포함되고 두 개 이상의 열들을 각각 포함하는 복수의 라인들 중 상기 목표 열을 포함하는 목표 라인을 선택하기 위한 라인 선택 신호를 발생하는 라인 디코더를 더 포함할 수 있다. 상기 게이팅 회로는 상기 라인 선택 신호에 기초하여 상기 목표 라인을 선택하는 제1 게이팅 회로와 상기 열 선택 신호에 기초하여 상기 목표 열을 선택하는 제2 게이팅 회로를 포함할 수 있다. 상기 입출력 데이터 구동 회로는 상기 제1 및 제2 게이팅 회로들을 통해 상기 목표 열에 상기 입력 데이터를 기입하거나 상기 목표 열에 저장된 데이터를 상기 출력 데이터로서 출력할 수 있다. 상기 연산 회로는 상기 제1 게이팅 회로와 연결될 수 있다.In an embodiment, the memory device is a line selection signal for selecting a target line including the target column from among a plurality of lines included in the target row and each including two or more columns, based on the column address It may further include a line decoder for generating The gating circuit may include a first gating circuit that selects the target line based on the line selection signal and a second gating circuit that selects the target column based on the column selection signal. The input/output data driving circuit may write the input data in the target column or output data stored in the target column as the output data through the first and second gating circuits. The operation circuit may be connected to the first gating circuit.

일 실시예에서, 상기 열 디코더는 상기 열 어드레스 및 열 선택 정보에 기초하여, 상기 목표 행에 포함되는 열들 중 복수의 목표 열들을 한 번에 선택하기 위한 다중 열 선택 신호를 발생하는 다중 열 디코더이고, 상기 게이팅 회로는 상기 다중 열 선택 신호에 기초하여 상기 복수의 목표 열들을 한 번에 선택하고, 상기 입출력 데이터 구동 회로는 상기 게이팅 회로를 통해 상기 복수의 목표 열들에 상기 입력 데이터를 한 번에 기입하거나 상기 복수의 목표 열들에 저장된 데이터를 한 번에 상기 출력 데이터로서 출력하며, 상기 목표 행에 포함되는 상기 복수의 목표 열들에 대응하는 상기 열 어드레스는 연속적이지 않을 수 있다.In one embodiment, the column decoder is a multi-column decoder that generates a multi-column selection signal for selecting a plurality of target columns from among columns included in the target row at a time based on the column address and column selection information, , the gating circuit selects the plurality of target columns at once based on the multi-column selection signal, and the input/output data driving circuit writes the input data to the plurality of target columns at a time through the gating circuit Alternatively, data stored in the plurality of target columns may be output as the output data at a time, and the column addresses corresponding to the plurality of target columns included in the target row may not be consecutive.

상기와 같은 본 발명의 실시예들에 따른 연산 장치 및 메모리 장치에서는, 입력 벡터의 성분들 및 행렬의 원소들을 입력 받아 누적 연산을 수행하는데 있어서, 입력 벡터 기준 방식을 적용하여 복수의 누적연산기들(예를 들어, 복수의 MAC들)의 제1 입력 단자들을 공통으로 연결함으로써, 입력 벡터의 성분이 복수의 누적연산기들의 입력으로 가해지는 횟수를 줄일 수 있다. 따라서, 종래의 연산 장치와 비교하였을 때, 메모리 및 버스 대역폭 사용이 감소하고, 결과적으로 기존 방식에 비해 성능 및 전력 소모 관점에서 유리하게 된다.In the arithmetic device and the memory device according to the embodiments of the present invention as described above, when receiving input vector components and matrix elements and performing an accumulation operation, a plurality of accumulators ( For example, by connecting the first input terminals of the plurality of MACs in common, it is possible to reduce the number of times that the component of the input vector is applied to the inputs of the plurality of accumulators. Therefore, compared with the conventional computing device, memory and bus bandwidth usage is reduced, and as a result, it is advantageous in terms of performance and power consumption compared to the conventional method.

도 1a는 종래 기술에 따른 행렬 연산을 수행하는 일 예를 나타내는 도면이다.
도 1b는 도 1a의 행렬 연산을 수행하는 종래 기술에 따른 연산 장치를 나타내는 블록도이다.
도 2a는 본 발명의 실시예들에 따른 행렬 연산을 수행하는 일 예를 나타내는 도면이다.
도 2b는 도 2a의 행렬 연산을 수행하는 본 발명의 실시예들에 따른 연산 장치를 나타내는 블록도이다.
도 3a는 본 발명의 실시예들에 따른 연산 장치를 나타내는 블록도이다.
도 3b는 도 3a의 연산 장치에 포함되는 누적연산기의 다른 예를 나타내는 블록도이다.
도 3c는 도 3a의 연산 장치를 확장하여 복수의 입력 벡터들을 병렬로 처리함으로써 연산 수행 시간을 단축하는 예를 나타내는 블록도이다.
도 4a 및 4b는 도 3a의 연산 장치에 포함되는 누적연산기의 또 다른 예를 나타내는 블록도들이다.
도 5a, 5b, 5c, 5d 및 5e는 본 발명의 실시예들에 따른 연산 장치들을 이용하여 두 행렬의 합성곱 및 이를 응용한 연산을 수행하는 예들을 설명하기 위한 도면들이다.
도 6a, 6b 및 6c는 본 발명의 실시예들에 따른 연산 장치를 이용하여 두 행렬의 합성곱을 수행하는 다른 예를 설명하기 위한 도면들이다.
도 7, 8 및 9는 본 발명의 실시예들에 따른 연산 장치를 포함하는 메모리 장치를 나타내는 블록도들이다.1A is a diagram illustrating an example of performing a matrix operation according to the prior art.
FIG. 1B is a block diagram illustrating an arithmetic apparatus according to the related art for performing the matrix operation of FIG. 1A.
2A is a diagram illustrating an example of performing a matrix operation according to embodiments of the present invention.
FIG. 2B is a block diagram illustrating an operation device according to embodiments of the present invention that performs the matrix operation of FIG. 2A.
3A is a block diagram illustrating a computing device according to embodiments of the present invention.
3B is a block diagram illustrating another example of an accumulator included in the computing device of FIG. 3A .
FIG. 3C is a block diagram illustrating an example of shortening a calculation execution time by extending the calculation device of FIG. 3A to process a plurality of input vectors in parallel.
4A and 4B are block diagrams illustrating another example of an accumulator included in the computing device of FIG. 3A.
5A, 5B, 5C, 5D, and 5E are diagrams for explaining examples of performing a convolution of two matrices and an operation applying the same by using calculation devices according to embodiments of the present invention.
6A, 6B, and 6C are diagrams for explaining another example of performing a convolution of two matrices using an arithmetic device according to embodiments of the present invention.
7, 8, and 9 are block diagrams illustrating a memory device including an arithmetic unit according to embodiments of the present invention.

본문에 개시되어 있는 본 발명의 실시예들에 대해서, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시예를 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시예들에 한정되는 것으로 해석되어서는 아니 된다.With respect to the embodiments of the present invention disclosed in the text, specific structural or functional descriptions are only exemplified for the purpose of describing the embodiments of the present invention, and the embodiments of the present invention may be embodied in various forms. It should not be construed as being limited to the embodiments described in .

본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can have various changes and can have various forms, specific embodiments are illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로 사용될 수 있다. 예를 들어, 본 발명의 권리 범위로부터 이탈되지 않은 채 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms may be used for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but it is understood that other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle. Other expressions describing the relationship between elements, such as "between" and "immediately between" or "neighboring to" and "directly adjacent to", etc., should be interpreted similarly.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and is intended to indicate that one or more other features or numbers are present. , it is to be understood that it does not preclude the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미이다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having meanings consistent with the context of the related art, and unless explicitly defined in the present application, they are not to be interpreted in an ideal or excessively formal meaning. .

본 발명에서, 벡터를 편의상 열벡터로 표현하여 행렬 연산에 관한 실시예들을 설명하고 있으나 벡터의 표현을 열벡터로 한정하는 것은 아니다. 행벡터가 열벡터의 전치 행렬임을 감안하면, 벡터를 행벡터로 표현하여도 해당하는 행렬 연산에 관한 실시예를 쉽게 구성하고 설명할 수 있다.In the present invention, embodiments related to matrix operations are described by expressing vectors as column vectors for convenience, but the expression of vectors is not limited to column vectors. Considering that a row vector is a transpose matrix of a column vector, even if a vector is expressed as a row vector, an embodiment related to a corresponding matrix operation can be easily constructed and described.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and repeated descriptions of the same components are omitted.

도 2a는 본 발명의 실시예들에 따른 행렬 연산을 수행하는 일 예를 나타내는 도면이다.2A is a diagram illustrating an example of performing a matrix operation according to embodiments of the present invention.

도 2a를 참조하면, 도 1a를 참조하여 상술한 것과 유사하게, 원소들(a₁₁, a₁₂, ..., a_1m, a₂₁, a₂₂, ..., a_2m, ..., a_n1, a_n2, ..., a_nm)을 포함하는 n*m(n, m은 각각 2 이상의 자연수) 차원의 행렬과 성분들(x₁, x₂, ..., x_m)을 포함하는 m 차원의 입력 벡터의 곱셈에 의하여 성분들(y₁, y₂, ..., y_n)을 포함하는 n 차원의 출력 벡터가 획득된다. 다만, 도 1a를 참조하여 상술한 것과 다르게, 도 2a에서는 상기 입력 벡터의 개별 성분을 기준으로 하여 상기 출력 벡터의 각 성분에 기여하는 값을 계산한 후 이를 누적하여 상기 출력 벡터의 각 성분을 구한다. 도 2a에 도시된 입력 벡터 기준 방식을 적용하는 경우에, 도 2b를 참조하여 후술하는 것처럼 복수의 MAC들의 입력 단자들 중 하나를 공통으로 묶을 수 있다.Referring to FIG. 2A , similar to that described above with reference to FIG. 1A , elements a ₁₁ , a ₁₂ , ..., a _1m , a ₂₁ , a ₂₂ , ..., a _2m , ..., n*m (n, m are each a natural number greater than or equal to 2) dimensional matrix containing a _n1 , a _n2 , ..., a _nm ) and components (x ₁ , x ₂ , ..., x _m ) An n-dimensional output vector including the components (y ₁ , y ₂ , ..., y _n ) is obtained by multiplication of the containing m-dimensional input vector. However, unlike the one described above with reference to FIG. 1A, in FIG. 2A, values contributing to each component of the output vector are calculated based on individual components of the input vector, and then accumulated to obtain each component of the output vector. . When the input vector reference method shown in FIG. 2A is applied, one of the input terminals of the plurality of MACs may be commonly bundled as will be described later with reference to FIG. 2B.

도 2b는 도 2a의 행렬 연산을 수행하는 본 발명의 실시예들에 따른 연산 장치를 나타내는 블록도이다.FIG. 2B is a block diagram illustrating an operation device according to embodiments of the present invention that performs the matrix operation of FIG. 2A.

도 2b를 참조하면, 연산 장치(100)는 복수의 MAC(multiplier-and-accumulator)들(2001, 2002, ..., 200k)(k는 2 이상의 자연수)을 포함한다. 복수의 MAC들(2001~200k) 각각은 복수의 곱셈기들(2101, 2102, ..., 210k) 중 하나 및 복수의 누산기들(2301, 2302, ..., 230k) 중 하나를 포함한다.Referring to FIG. 2B , the computing device 100 includes a plurality of multiplier-and-accumulators (MACs) 2001, 2002, ..., 200k (k is a natural number equal to or greater than 2). Each of the plurality of MACs 2001 to 200k includes one of a plurality of multipliers 2101, 2102, ..., 210k and one of a plurality of accumulators 2301, 2302, ..., 230k.

복수의 곱셈기들(2101~210k) 각각은 제1 입력(IN1) 및 복수의 제2 입력들(IN21, IN22, ..., IN2k) 중 하나를 수신하며, 두 개의 입력을 곱하여 출력한다. 예를 들어, 제1 입력(IN1)은 도 2a의 상기 입력 벡터의 성분들(x₁, x₂, ..., x_m)이고, 복수의 제2 입력들(IN21~IN2k) 각각은 도 2a의 상기 행렬의 원소들(a₁₁, a₁₂, ..., a_1m, a₂₁, a₂₂, ..., a_2m, ..., a_n1, a_n2, ..., a_nm) 중 동일한 열에 배치되는 원소들일 수 있다.Each of the plurality of multipliers 2101 to 210k receives one of the first input IN1 and the plurality of second inputs IN21, IN22, ..., IN2k, and multiplies and outputs the two inputs. For example, the first input IN1 is the components x ₁ , x ₂ , ..., x _m of the input vector of FIG. 2A , and each of the plurality of second inputs IN21 to IN2k is shown in FIG. The elements of the matrix of 2a (a ₁₁ , a ₁₂ , ..., a _1m , a ₂₁ , a ₂₂ , ..., a _2m , ..., a _n1 , a _n2 , ..., a _nm ) may be elements arranged in the same column.

복수의 누산기들(2301~230k) 각각은 복수의 곱셈기들(2101~210k) 중 하나와 연결되며, 복수의 곱셈기들(2101~210k) 중 하나의 출력을 누산하여 복수의 출력들(OUT1, OUT2, ..., OUTk) 중 하나를 발생한다.Each of the plurality of accumulators 2301 to 230k is connected to one of the plurality of multipliers 2101 to 210k, and accumulates an output of one of the plurality of multipliers 2101 to 210k to output a plurality of outputs OUT1 and OUT2 , ..., OUTk).

복수의 누산기들(2301~230k) 각각은 복수의 가산기들(2501, 2502, ..., 250k) 중 하나 및 복수의 누적레지스터들(2701, 2702, ..., 270k) 중 하나를 포함한다. 복수의 가산기들(2501~250k) 각각은 복수의 곱셈기들(2101~210k) 중 하나의 출력 및 복수의 누적레지스터들(2701~270k) 중 하나의 출력을 더하여 출력한다. 복수의 누적레지스터들(2701~270k) 각각은 복수의 가산기들(2501~250k) 중 하나의 출력을 저장하고 최종적으로 복수의 출력들(OUT1~OUTk) 중 하나를 발생한다.Each of the plurality of accumulators 2301 to 230k includes one of the plurality of adders 2501, 2502, ..., 250k and one of the plurality of accumulation registers 2701, 2702, ..., 270k. . Each of the plurality of adders 2501 to 250k adds and outputs an output of one of the plurality of multipliers 2101 to 210k and an output of one of the plurality of accumulation registers 2701 to 270k. Each of the plurality of accumulation registers 2701 to 270k stores one output of the plurality of adders 2501 to 250k and finally generates one of the plurality of outputs OUT1 to OUTk.

일 실시예에서, MAC(2001)의 제1 입력(IN1) 및 제2 입력(IN21)은 상기 입력 벡터의 성분들(x₁, x₂, ..., x_m) 및 상기 행렬의 원소들(a₁₁, a₁₂, ..., a_1m)을 순차적으로 수신하며, MAC(2001)의 출력(OUT1)은 상기 출력 벡터의 성분(y₁), 즉 a₁₁*x₁+a₁₂*x₂+...+a_1m*x_m에 대응할 수 있다. 예를 들어, 최초 동작 시에 MAC(2001)의 제1 입력(IN1) 및 제2 입력(IN21)에는 각각 x₁ 및 a₁₁이 입력되고, 이에 대한 곱셈, 덧셈 및 누적을 수행하여 누적레지스터(2701)에 a₁₁*x₁이 저장될 수 있다. 이후에, MAC(2001)의 제1 입력(IN1) 및 제2 입력(IN21)에는 각각 x₂ 및 a₁₂가 입력되고, 이에 대한 곱셈, 덧셈 및 누적을 수행하여 누적레지스터(2701)에 a₁₁*x₁+a₁₂*x₂가 저장될 수 있다. 이러한 방식으로, 최종적으로 누적레지스터(2701)에 a₁₁*x₁+a₁₂*x₂+...+a_1m*x_m이 저장되고 출력될 수 있다. 이와 유사하게, MAC(2002)의 출력(OUT2)은 상기 출력 벡터의 성분(y₂), 즉 a₂₁*x₁+a₂₂*x₂+...+a_2m*x_m에 대응할 수 있다.In one embodiment, the first input IN1 and the second input IN21 of the MAC 2001 are the components x ₁ , x ₂ , ..., x _m of the input vector and the elements of the matrix (a ₁₁ , a ₁₂ , ..., a _1m ) is sequentially received, and the output (OUT1) of the MAC 2001 is the component (y ₁ ) of the output vector, that is, a ₁₁ *x ₁ +a ₁₂ * It can correspond to x ₂ +...+a _1m *x _m . For example, during the initial operation, x ₁ and a ₁₁ are respectively input to the first input IN1 and the second input IN21 of the MAC 2001, and multiplication, addition and accumulation are performed thereon to thereby register the accumulation register ( 2701), a ₁₁ *x ₁ may be stored. Thereafter, x ₂ and a ₁₂ are input to the first input IN1 and the second input IN21 of the MAC 2001, respectively, and multiplication, addition, and accumulation are performed thereon, and a ₁₁ is stored in the accumulation register 2701. *x ₁ +a ₁₂ *x ₂ may be stored. In this way, a ₁₁ *x ₁ +a ₁₂ *x ₂ +...+a _1m *x _m may be finally stored and output in the accumulation register 2701 . Similarly, the output OUT2 of the MAC 2002 may correspond to the component y ₂ of the output vector, ie a ₂₁ *x ₁ +a ₂₂ *x ₂ +...+a _2m *x _m . .

도 2a에 도시된 입력 벡터 기준 방식을 적용하는 경우에, 도 2b에 도시된 것과 같이 복수의 MAC들(2001~200k)을 사용하여 병렬처리 할 때 복수의 MAC들(2001~200k)에 가해지는 상기 입력 벡터의 성분이 모두 같으므로, 복수의 MAC들(2001~200k)의 입력 단자들 중 하나를 공통으로 묶을 수 있다. 예를 들어, 상기 입력 벡터의 성분들(x₁, x₂, ..., x_m)을 제1 입력(IN1)으로서 수신하는 복수의 MAC들(2001~200k)의 제1 입력 단자들을 공통으로 연결하여 하나의 입력 단자로 사용할 수 있다.In the case of applying the input vector reference method shown in FIG. 2A, as shown in FIG. 2B, when parallel processing is performed using a plurality of MACs 2001 to 200k, the Since the components of the input vectors are all the same, one of the input terminals of the plurality of MACs 2001 to 200k may be commonly bundled. For example, common first input terminals of the plurality of MACs 2001 to 200k receiving the components (x ₁ , x ₂ , ..., x _m ) of the input vector as the first input IN1 It can be used as one input terminal by connecting to

본 발명의 실시예들에 따른 연산 장치(100)에서는, 상술한 것처럼 복수의 MAC들(2001~200k)의 상기 제1 입력 단자들을 공통으로 연결함으로써, 상기 입력 벡터의 성분이 복수의 MAC들(2001~200k)의 입력으로 가해지는 횟수를 줄일 수 있다. 구체적으로, 도 2b의 본 발명의 실시예들에 따른 연산 장치(100)는 k개의 MAC들(2001~200k)을 포함하는 경우에 (k+1)개의 입력을 수신할 수 있다. 따라서, 도 1b의 종래의 연산 장치(10)와 비교하였을 때, 메모리 및 버스 대역폭 사용이 감소하고, 결과적으로 기존 방식에 비해 성능 및 전력 소모 관점에서 유리하게 된다.In the arithmetic device 100 according to the embodiments of the present invention, as described above, by connecting the first input terminals of the plurality of MACs 2001 to 200k in common, the component of the input vector is converted into a plurality of MACs ( 2001~200k) can reduce the number of times it is applied. Specifically, when the computing device 100 according to the embodiments of the present invention shown in FIG. 2B includes k MACs 2001 to 200k, (k+1) inputs may be received. Accordingly, as compared with the conventional computing device 10 of FIG. 1B , the use of memory and bus bandwidth is reduced, and as a result, it is advantageous in terms of performance and power consumption compared to the conventional method.

도 3a는 본 발명의 실시예들에 따른 연산 장치를 나타내는 블록도이다.3A is a block diagram illustrating a computing device according to embodiments of the present invention.

도 3a를 참조하면, 연산 장치(300)는 복수의 누적연산기들(4001, 4002, ..., 400k)(k는 2 이상의 자연수)을 포함한다. 복수의 누적연산기들(4001~400k) 각각은 복수의 제1 연산기들(4101, 4102, ..., 410k) 중 하나 및 복수의 누산기들(4301, 4302, ..., 430k) 중 하나를 포함한다.Referring to FIG. 3A , the arithmetic unit 300 includes a plurality of accumulators 4001 , 4002 , ..., 400k (k is a natural number equal to or greater than 2). Each of the plurality of accumulators 4001 to 400k is one of the plurality of first operators (4101, 4102, ..., 410k) and one of the plurality of accumulators (4301, 4302, ..., 430k) include

복수의 제1 연산기들(4101~410k) 각각은 제1 입력(IN1) 및 복수의 제2 입력들(IN21, IN22, ..., IN2k) 중 하나를 수신하며, 두 개의 입력에 대한 제1 연산을 수행하여 출력한다. 예를 들어, 제1 입력(IN1)은 도 2a의 상기 입력 벡터의 성분들(x₁, x₂, ..., x_m)이고, 복수의 제2 입력들(IN21~IN2k) 각각은 도 2a의 상기 행렬의 원소들(a₁₁, a₁₂, ..., a_1m, a₂₁, a₂₂, ..., a_2m, ..., a_n1, a_n2, ..., a_nm) 중 동일한 열에 배치되는 원소들일 수 있다.Each of the plurality of first operators 4101 to 410k receives one of a first input IN1 and a plurality of second inputs IN21, IN22, ..., IN2k, and receives a first input for the two inputs. Execute operation and output. For example, the first input IN1 is the components x ₁ , x ₂ , ..., x _m of the input vector of FIG. 2A , and each of the plurality of second inputs IN21 to IN2k is shown in FIG. The elements of the matrix of 2a (a ₁₁ , a ₁₂ , ..., a _1m , a ₂₁ , a ₂₂ , ..., a _2m , ..., a _n1 , a _n2 , ..., a _nm ) may be elements arranged in the same column.

복수의 누산기들(4301~430k) 각각은 복수의 제1 연산기들(4101~410k) 중 하나와 연결되며, 복수의 제1 연산기들(4101~410k) 중 하나의 출력을 누산하여 복수의 출력들(OUT1, OUT2, ..., OUTk) 중 하나를 발생한다.Each of the plurality of accumulators 4301 to 430k is connected to one of the plurality of first operators 4101 to 410k, and accumulates the output of one of the plurality of first operators 4101 to 410k to generate a plurality of outputs. Generates one of (OUT1, OUT2, ..., OUTk).

복수의 누산기들(4301~430k) 각각은 복수의 제2 연산기들(4501, 4502, ..., 450k) 중 하나 및 복수의 누적레지스터들(4701, 4702, ..., 470k) 중 하나를 포함한다. 복수의 제2 연산기들(4501~450k) 각각은 복수의 제1 연산기들(4101~410k) 중 하나의 출력 및 복수의 누적레지스터들(4701~470k) 중 하나의 출력에 대한 제2 연산을 수행하여 출력한다. 복수의 누적레지스터들(4701~470k) 각각은 복수의 제2 연산기들(4501~450k) 중 하나의 출력을 저장하고 최종적으로 복수의 출력들(OUT1~OUTk) 중 하나를 발생한다.Each of the plurality of accumulators 4301 to 430k receives one of the plurality of second operators 4501, 4502, ..., 450k and one of the plurality of accumulation registers 4701, 4702, ..., 470k. include Each of the plurality of second operators 4501 to 450k performs a second operation on the output of one of the plurality of first operators 4101 to 410k and the output of one of the plurality of accumulation registers 4701 to 470k. to output Each of the plurality of accumulation registers 4701 to 470k stores one output of the plurality of second operators 4501 to 450k and finally generates one of the plurality of outputs OUT1 to OUTk.

도 2a에 도시된 입력 벡터 기준 방식을 적용하는 경우에, 도 3a에 도시된 것과 같이 복수의 누적연산기들(4001~400k)을 사용하여 병렬처리 할 때 복수의 누적연산기들(4001~400k)에 가해지는 상기 입력 벡터의 성분이 모두 같으므로, 복수의 누적연산기들(4001~400k)의 입력 단자들 중 하나를 공통으로 묶을 수 있다. 예를 들어, 상기 입력 벡터의 성분들(x₁, x₂, ..., x_m)을 제1 입력(IN1)으로서 수신하는 복수의 누적연산기들(4001~400k)의 제1 입력 단자들을 공통으로 연결하여 하나의 입력 단자로 사용할 수 있다.In the case of applying the input vector reference method shown in FIG. 2A, when parallel processing is performed using a plurality of accumulators 4001 to 400k as shown in FIG. 3A, the plurality of accumulators 4001 to 400k Since the components of the applied input vectors are all the same, one of the input terminals of the plurality of accumulators 4001 to 400k may be grouped in common. For example, the first input terminals of the plurality of accumulators 4001 to 400k that receive the components (x ₁ , x ₂ , ..., x _m ) of the input vector as the first input IN1 are It can be used as one input terminal by connecting in common.

본 발명의 실시예들에 따른 연산 장치(300)에서는, 상술한 것처럼 복수의 누적연산기들(4001~400k)의 상기 제1 입력 단자들을 공통으로 연결함으로써, 상기 입력 벡터의 성분이 복수의 누적연산기들(4001~400k)의 입력으로 가해지는 횟수를 줄일 수 있다. 구체적으로, 도 3a의 본 발명의 실시예들에 따른 연산 장치(300)는 k개의 누적연산기들(4001~400k)을 포함하는 경우에 (k+1)개의 입력을 수신할 수 있다. 따라서, 메모리 및 버스 대역폭 사용이 감소하고, 결과적으로 성능 및 전력 소모 관점에서 유리하게 된다.In the computing device 300 according to the embodiments of the present invention, as described above, by connecting the first input terminals of the plurality of accumulators 4001 to 400k in common, the component of the input vector is converted into a plurality of accumulators. It is possible to reduce the number of times applied to the inputs of the fields 4001 to 400k. Specifically, when the computing device 300 according to the embodiments of the present invention of FIG. 3A includes k accumulators 4001 to 400k, (k+1) inputs may be received. Therefore, memory and bus bandwidth usage is reduced, which is advantageous in terms of performance and power consumption as a result.

도 3a의 연산 장치(300)는 도 2b의 연산 장치(100)의 복수의 곱셈기들(2101~210k) 및 복수의 가산기들(2501~250k)을 각각 복수의 제1 연산기들(4101~410k) 및 복수의 제2 연산기들(4501~450k)로 일반화한 경우를 나타내고 있다.The arithmetic device 300 of FIG. 3A uses the plurality of multipliers 2101 to 210k and the plurality of adders 2501 to 250k of the arithmetic device 100 of FIG. 2B to respectively connect the plurality of first operators 4101 to 410k. and a case of generalization to a plurality of second operators 4501 to 450k.

도 3a의 복수의 누적연산기들(4001, 4002, ..., 400k)은 디지털 회로 또는 아날로그 회로로 구현될 수 있다. 예를 들어, 복수의 누산기들(4301, 4302, ..., 430k)은 아날로그 적분 회로로 구현될 수 있다.The plurality of accumulators 4001, 4002, ..., 400k of FIG. 3A may be implemented as a digital circuit or an analog circuit. For example, the plurality of accumulators 4301, 4302, ..., 430k may be implemented as an analog integration circuit.

일 실시예에서, 도 3a의 복수의 제1 연산기들(4101~410k) 각각을 곱셈기로 구현하고 복수의 제2 연산기들(4501~450k) 각각은 가산기로 구현하는 경우에, 도 3a의 연산 장치(300)는 도 2b에 도시된 것처럼 통상의 행렬과 입력 벡터의 곱셈을 수행할 수 있다.In one embodiment, when each of the plurality of first operators 4101 to 410k of FIG. 3A is implemented as a multiplier and each of the plurality of second operators 4501 to 450k is implemented as an adder, the arithmetic device of FIG. 3A 300 may perform multiplication of a conventional matrix and an input vector as shown in FIG. 2B .

일 실시예에서, 도 3a의 복수의 누적연산기들(4001~400k) 각각의 2개의 입력들 중 어느 하나가 0 또는 1의 값을 가지는 이진 값인 경우에, 도 3a의 복수의 제1 연산기들(4101~410k) 각각은 상기 이진 값에 따라 온-오프 동작을 하는 게이팅 회로로 구현되어 곱셈을 수행할 수 있다. 예를 들어, 어느 하나의 입력(예를 들어, 제1 입력(IN1))이 0인 경우에, 복수의 제1 연산기들(4101~410k) 각각은 0의 값을 출력할 수 있고, 어느 하나의 입력(예를 들어, 제1 입력(IN1))이 1인 경우에, 복수의 제1 연산기들(4101~410k) 각각은 다른 하나의 입력(예를 들어, 복수의 제2 입력들(IN21~IN2k) 중 하나)을 그대로 출력할 수 있다.In one embodiment, when any one of two inputs of each of the plurality of accumulators 4001 to 400k of FIG. 3A is a binary value having a value of 0 or 1, the plurality of first operators of FIG. 3A ( Each of 4101 to 410k) may be implemented as a gating circuit that performs an on-off operation according to the binary value to perform multiplication. For example, when any one input (eg, the first input IN1 ) is 0, each of the plurality of first operators 4101 to 410k may output a value of 0, and any one When the input of (eg, the first input IN1) of is 1, each of the plurality of first operators 4101 to 410k corresponds to the other input (eg, the plurality of second inputs IN21). ~IN2k)) can be output as it is.

일 실시예에서, 도 3a의 복수의 누적연산기들(4001~400k) 각각의 2개의 입력들 중 어느 하나가 이진 정수 값(또는 이진법으로 표현된 정수 값)이고 다른 하나가 정수 지수를 갖는 2의 거듭제곱 값인 경우에, 도 3a의 복수의 제1 연산기들(4101~410k) 각각은 상기 정수 입력을 상기 정수 지수만큼 쉬프트 시킨 정수 값을 출력하는 회로로 구현되어 상기 제1 연산(예를 들어, 곱셈)을 수행할 수 있다. 예를 들어, 어느 하나의 입력(예를 들어, 제1 입력(IN1))이 6(즉, 이진수 1010)이고 다른 하나의 입력(예를 들어, 복수의 제2 입력들(IN21~IN2k) 중 하나)이 2²인 경우에, 복수의 제1 연산기들(4101~410k) 각각은 1010을 2만큼 쉬프트 시킨 101000을 출력할 수 있다. 이 때, 상기 정수 지수를 갖는 상기 2의 거듭제곱 값 대신 상기 정수 지수만을 입력으로 받으면 데이터 사이즈가 작아져서 필요 메모리 사이즈를 줄이고 메모리 및 버스 대역폭 사용을 절약할 수 있다.In one embodiment, one of the two inputs of each of the plurality of accumulators 4001 to 400k of FIG. 3A is a binary integer value (or an integer value expressed in binary notation) and the other is an integer exponent of 2 In the case of a power value, each of the plurality of first operators 4101 to 410k of FIG. 3A is implemented as a circuit outputting an integer value shifted by the integer exponent by the integer exponent, so that the first operation (for example, multiplication) can be performed. For example, one input (eg, the first input IN1) is 6 (ie, the binary number 1010) and the other input (eg, among the plurality of second inputs IN21 to IN2k) When one) is 2 ² , each of the plurality of first operators 4101 to 410k may output 101000 obtained by shifting 1010 by 2 . In this case, when only the integer exponent is received as an input instead of the power of 2 value having the integer exponent, the data size is reduced, thereby reducing the required memory size and saving the use of memory and bus bandwidth.

상기 실시예는 입력 데이터 형식이 n진법(n은 2 이상의 자연수)으로 표현한 데이터 형식인 경우로 확장될 수 있다. 즉, 도 3a의 복수의 누적연산기들(4001~400k) 각각의 2개의 입력들 중 어느 하나가 n진 정수 값(또는 n진법으로 표현된 정수 값)이고 다른 하나가 정수 지수를 갖는 n의 거듭제곱 값인 경우에, 도 3a의 복수의 제1 연산기들(4101~410k) 각각은 상기 n진 정수 입력을 상기 정수 지수만큼 쉬프트 시킨 정수 값을 출력하는 회로로 구현되어 상기 제1 연산(예를 들어, 곱셈)을 수행할 수 있다.The above embodiment can be extended to a case where the input data format is a data format expressed in base n (n is a natural number equal to or greater than 2). That is, one of two inputs of each of the plurality of accumulators 4001 to 400k of FIG. 3A is an n-base integer value (or an integer value expressed in base n) and the other is a power of n having an integer exponent In the case of a square value, each of the plurality of first operators 4101 to 410k of FIG. 3A is implemented as a circuit outputting an integer value shifted by the integer exponent by the n-base integer input, and the first operation (for example, , multiplication) can be performed.

일 실시예에서, 도 3a의 복수의 누적연산기들(4001~400k) 각각의 2개의 입력들 중 어느 하나가 이진 부동소수점 값(또는 밑이 2이고 지수와 가수가 이진법으로 표현된 부동소수점 값)이고 다른 하나가 정수 지수를 갖는 2의 거듭제곱 값인 경우에, 도 3a의 복수의 제1 연산기들(4101~410k) 각각은 지수끼리 더해지고 가수는 상기 이진 부동소수점 입력의 가수를 그대로 유지시킨 이진 부동소수점 값을 출력하는 회로로 구현되어 상기 제1 연산(예를 들어, 곱셈)을 수행할 수 있다. 이 때, 상기 정수 지수를 갖는 상기 2의 거듭제곱 값 대신 상기 정수 지수만을 입력으로 받으면 데이터 사이즈가 작아져서 필요 메모리 사이즈를 줄이고 메모리 및 버스 대역폭 사용을 절약할 수 있다.In one embodiment, any one of the two inputs of each of the plurality of accumulators 4001 to 400k of Figure 3A is a binary floating-point value (or a floating-point value with base 2 and exponent and mantissa expressed in binary) and the other one is a power of 2 value having an integer exponent, each of the plurality of first operators 4101 to 410k of FIG. 3A adds exponents to each other, and the mantissa is binary in which the mantissa of the binary floating-point input is maintained as it is. The first operation (eg, multiplication) may be performed by being implemented as a circuit for outputting a floating-point value. In this case, when only the integer exponent is received as an input instead of the power of 2 value having the integer exponent, the data size is reduced, thereby reducing the required memory size and saving the use of memory and bus bandwidth.

상기 실시예는 입력 데이터 형식이 n진법(n은 2 이상의 자연수)으로 표현된 데이터 형식인 경우로 확장될 수 있다. 즉, 도 3a의 복수의 누적연산기들(4001~400k) 각각의 2개의 입력들 중 어느 하나가 n진 부동소수점 값(또는 밑이 n이고 지수와 가수가 n진법으로 표현된 부동소수점 값)이고 다른 하나가 정수 지수를 갖는 n의 거듭제곱 값인 경우에, 도 3a의 복수의 제1 연산기들(4101~410k) 각각은 지수끼리 더해지고 가수는 상기 n진 부동소수점 입력의 가수를 그대로 유지시킨 n진 부동소수점 값을 출력하는 회로로 구현되어 상기 제1 연산(예를 들어, 곱셈)을 수행할 수 있다.The above embodiment can be extended to a case in which the input data format is a data format expressed in base n (n is a natural number equal to or greater than 2). That is, any one of the two inputs of each of the plurality of accumulators 4001 to 400k of FIG. 3A is an n-base floating-point value (or a floating-point value in which the base is n and the exponent and mantissa are expressed in base n) When the other is a power value of n having an integer exponent, each of the plurality of first operators 4101 to 410k of FIG. 3A adds exponents to each other, and the mantissa is n in which the mantissa of the n-base floating-point input is maintained as it is. The first operation (eg, multiplication) may be performed by being implemented as a circuit for outputting a binary floating-point value.

일 실시예에서, 도 3a의 복수의 제1 연산기들(4101~410k) 각각의 2개의 입력들 중 어느 하나(예를 들어, 상기 입력 벡터의 성분을 수신하는 제1 입력(IN1))를 1로 고정시켜서 복수의 제1 연산기들(4101~410k)을 바이패스(bypass) 시키고, 복수의 제2 연산기들(4501~450k) 각각은 복수의 제1 연산기들(4101~410k) 각각에 의해 바이패스된 값을 누적 레지스터의 값과 비교하여 둘 중 큰 값을 출력하는 비교기로 구현하는 경우에, 복수의 누적연산기들(4001~400k) 각각은 2개의 입력들 중 다른 하나(예를 들어, 복수의 제2 입력들(IN21~IN2k) 중 하나)로 주어지는 상기 행렬의 원소들 중 최대값을 출력하는 최대 풀링(max pooling) 연산을 수행할 수 있다. 상기 비교기는 감산기와 멀티플렉서로 구성되므로, 가산기에 약간의 회로 추가로 구현할 수 있다.In one embodiment, any one of two inputs of each of the plurality of first operators 4101 to 410k of FIG. 3A (eg, a first input IN1 for receiving a component of the input vector) is 1 fixed to to bypass the plurality of first operators 4101 to 410k, and each of the plurality of second operators 4501 to 450k is bypassed by each of the plurality of first operators 4101 to 410k When the passed value is compared with the value of the accumulation register and implemented as a comparator that outputs the larger of the two values, each of the plurality of accumulators 4001 to 400k is the other one (eg, a plurality of A max pooling operation of outputting a maximum value among the elements of the matrix given as second inputs IN21 to IN2k ) may be performed. Since the comparator is composed of a subtractor and a multiplexer, it can be implemented by adding a little circuit to the adder.

도 3b는 도 3a의 연산 장치에 포함되는 누적연산기의 다른 예를 나타내는 블록도이다.3B is a block diagram illustrating another example of an accumulator included in the computing device of FIG. 3A .

도 3b를 참조하면, 누적연산기(402)는 제1 연산기(412) 및 누산기(432)를 포함하며, 누산기(432)는 제2 연산기(452), 누적레지스터(472a) 및 적어도 하나의 쉐도우(shadow) 레지스터(472b)를 포함한다.3B, the accumulator 402 includes a first operator 412 and an accumulator 432, and the accumulator 432 includes a second operator 452, an accumulation register 472a, and at least one shadow ( shadow) register 472b.

쉐도우 레지스터(472b)를 더 포함하는 것을 제외하면, 도 3b의 누적연산기(402)는 도 3a의 복수의 누적연산기들(4001~400k)과 실질적으로 동일할 수 있다. 도 3b의 제1 연산기(412), 누산기(432), 제2 연산기(452), 누적레지스터(472a), 제1 입력(IN1), 제2 입력(IN2) 및 출력(OUT)은 각각 도 3a의 복수의 제1 연산기들(4101~410k), 복수의 누산기들(4301~430k), 복수의 제2 연산기들(4501~450k), 복수의 누적레지스터들(4701~470k), 제1 입력(IN1), 복수의 제2 입력들(IN21~IN2k) 및 복수의 출력들(OUT1~OUTk)에 대응할 수 있다.Except for further including the shadow register 472b, the accumulator 402 of FIG. 3B may be substantially the same as the plurality of accumulators 4001 to 400k of FIG. 3A. The first operator 412, the accumulator 432, the second operator 452, the accumulation register 472a, the first input IN1, the second input IN2, and the output OUT of FIG. 3B are shown in FIG. 3A, respectively. a plurality of first operators 4101 to 410k, a plurality of accumulators 4301 to 430k, a plurality of second operators 4501 to 450k, a plurality of accumulation registers 4701 to 470k, a first input ( IN1), a plurality of second inputs IN21 to IN2k, and a plurality of outputs OUT1 to OUTk.

쉐도우 레지스터(472b)는 누적레지스터(472a)의 값(예를 들어, 상기 누적 결과)을 임시로 저장할 수 있다. 또한, 쉐도우 레지스터(472b)에 저장된 값을 누적레지스터(472a)에 써 넣을 수 있다. 편의상 도 3b에서는 하나의 쉐도우 레지스터(472b)만을 도시하였으나, 쉐도우 레지스터(472b)의 개수는 실시예에 따라서 다양하게 변경될 수 있다.The shadow register 472b may temporarily store the value of the accumulation register 472a (eg, the accumulation result). Also, the value stored in the shadow register 472b may be written into the accumulation register 472a. For convenience, only one shadow register 472b is illustrated in FIG. 3B , but the number of shadow registers 472b may be variously changed according to an embodiment.

도 3c는 도 3a의 연산 장치(300)를 확장하여 복수의 입력 벡터들을 병렬로 처리함으로써 연산 수행 시간을 단축하는 예를 나타내는 블록도이다.FIG. 3C is a block diagram illustrating an example of shortening the calculation execution time by extending the calculation device 300 of FIG. 3A to process a plurality of input vectors in parallel.

도 3c를 참조하면, 연산 장치(500)는 복수의 누적연산기들(5011, 5021. , 50kq)(k, q는 각각 2 이상의 자연수)을 포함하며, 각 누적연산기들은 복수의 제1 입력들(IN11, IN12, ..., IN1q) 중 하나 및 복수의 제2 입력들(IN21, IN22, ..., IN2k) 중 하나를 수신하고 복수의 출력들(OUT11, OUT21, ..., OUTkq) 중 하나를 발생하며 연산기그룹들(5001, 5002, , 500q) 중 하나에 속하고 교차연산기그룹들(5010, 5020, , 50k0) 중 하나에 속한다. 하나의 연산기그룹에 속하는 누적연산기들은 제1 입력 단자들이 하나로 연결되어 있어서 제1 입력을 공통으로 수신한다. 하나의 교차연산기그룹에 속하는 누적연산기들은 서로 다른 연산기그룹에 속하며 제2 입력 단자들이 하나로 연결되어 있어서 제2 입력을 공통으로 수신한다. 상기 연산기그룹 각각은 도 3a의 연산 장치(300)에 해당한다.Referring to FIG. 3C , the arithmetic unit 500 includes a plurality of accumulators 5011, 5021. , 50kq (k and q are each a natural number equal to or greater than 2), and each of the accumulators includes a plurality of first inputs ( receiving one of IN11, IN12, ..., IN1q and one of the plurality of second inputs IN21, IN22, ..., IN2k and receiving a plurality of outputs OUT11, OUT21, ..., OUTkq Occurs in one of the operator groups 5001, 5002, , and 500q and belongs to one of the cross operator groups 5010, 5020, , and 50k0. The accumulators belonging to one operator group receive the first input in common because the first input terminals are connected to one another. The accumulators belonging to one cross operator group belong to different operator groups, and the second input terminals are connected to one another to receive the second input in common. Each of the operator groups corresponds to the computing device 300 of FIG. 3A .

하나의 행렬과 복수의 입력 벡터들이 주어졌을 때 도 3a의 연산 장치(300)를 반복적으로 사용하여 상기 행렬과 각각의 입력 벡터에 관한 행렬 연산을 순차적으로 수행하고 해당하는 출력 벡터들을 순차적으로 구할 수 있지만, 도 3c와 같이 상기 연산 장치(300) 또는 연산기그룹(5001, 5002, , 500q)을 여러 개 사용하여 상기 복수의 입력 벡터들에 일대일로 대응하도록 하고 상기 행렬 연산을 병렬로 수행하면 상기 행렬 연산 수행에 필요한 시간을 단축할 수 있다. 이때, 상기 행렬 연산에 사용되는 행렬이 동일하므로 각 연산기그룹에 인가되는 행렬의 원소들이 동일할 수 있으며, 특히 각각의 교차연산기그룹에 속하는 누적연산기들의 제2 입력이 같을 수 있고 따라서 하나의 교차연산기그룹에 속하는 누적연산기들의 제2 입력 단자들을 하나로 연결하고 제2 입력을 공통으로 수신할 수 있다.When one matrix and a plurality of input vectors are given, the matrix operation on the matrix and each input vector can be sequentially performed by repeatedly using the operation device 300 of FIG. 3A, and the corresponding output vectors can be sequentially obtained. However, as shown in FIG. 3C , by using the arithmetic unit 300 or the arithmetic operator groups 5001 , 5002 , , 500q in one-to-one correspondence to the plurality of input vectors and performing the matrix operation in parallel, the matrix It is possible to shorten the time required to perform calculations. At this time, since the matrix used for the matrix operation is the same, the elements of the matrix applied to each operator group may be the same. The second input terminals of the accumulators belonging to the group may be connected together and the second input may be commonly received.

상기 복수의 입력 벡터들 각각을 열벡터로 갖는 입력 행렬을 만들면 상기 행렬 연산은 상기 행렬과 상기 입력 행렬에 관한 행렬 연산과 본질적으로 동일하다. 따라서 행렬과 행렬의 연산을 도 3a의 연산 장치(300) 또는 도 3c의 연산 장치(500)를 사용하여 수행할 수 있다.When an input matrix having each of the plurality of input vectors as a column vector is created, the matrix operation is essentially the same as the matrix operation on the matrix and the input matrix. Accordingly, the matrix and the operation of the matrix may be performed using the calculating device 300 of FIG. 3A or the calculating device 500 of FIG. 3C .

도 4a 및 4b는 도 3a의 연산 장치에 포함되는 누적연산기의 또 다른 예를 나타내는 블록도들이다.4A and 4B are block diagrams illustrating another example of an accumulator included in the computing device of FIG. 3A.

도 4a를 참조하면, 누적연산기(404)는 제1 연산기(414), 누산기(434) 및 멀티플렉서(494)를 포함하며, 누산기(434)는 제2 연산기(454) 및 누적레지스터(474)를 포함한다.Referring to FIG. 4A , the accumulator 404 includes a first operator 414 , an accumulator 434 , and a multiplexer 494 , and the accumulator 434 includes a second operator 454 and an accumulation register 474 . include

멀티플렉서(494)를 더 포함하는 것을 제외하면, 도 4a의 누적연산기(404)는 도 3a의 복수의 누적연산기들(4001~400k)과 실질적으로 동일할 수 있다. 도 4a의 제1 연산기(414), 누산기(434), 제2 연산기(454), 누적레지스터(474), 제1 입력(IN1), 제2 입력(IN2) 및 출력(OUT)은 각각 도 3a의 복수의 제1 연산기들(4101~410k), 복수의 누산기들(4301~430k), 복수의 제2 연산기들(4501~450k), 복수의 누적레지스터들(4701~470k), 제1 입력(IN1), 복수의 제2 입력들(IN21~IN2k) 및 복수의 출력들(OUT1~OUTk)에 대응할 수 있다.Except for further including a multiplexer 494, the accumulator 404 of FIG. 4A may be substantially the same as the plurality of accumulators 4001 to 400k of FIG. 3A. The first operator 414, the accumulator 434, the second operator 454, the accumulation register 474, the first input IN1, the second input IN2, and the output OUT of FIG. 4A are shown in FIG. 3A, respectively. a plurality of first operators 4101 to 410k, a plurality of accumulators 4301 to 430k, a plurality of second operators 4501 to 450k, a plurality of accumulation registers 4701 to 470k, a first input ( IN1), a plurality of second inputs IN21 to IN2k, and a plurality of outputs OUT1 to OUTk.

멀티플렉서(494)는 보조 입력(AIN1)을 수신하는 보조 입력 단자, 및 제1 입력(IN1)을 수신하는 상기 제1 입력 단자와 연결되고, 선택 신호(SS1)에 기초하여 보조 입력(AIN1) 및 제1 입력(IN1) 중 하나를 선택하여 제1 연산기(414)에 제공할 수 있다. 또한, 멀티플렉서(494)는 보조 입력(AIN1) 및 제1 입력(IN1) 중 선택된 하나를 보조 출력(AOUT1)으로 제공할 수 있다. 보조 출력(AOUT1)은 인접한 누적연산기에 포함되는 멀티플렉서의 보조 입력으로 이용될 수 있다.The multiplexer 494 is connected to an auxiliary input terminal receiving the auxiliary input AIN1 and the first input terminal receiving the first input IN1, and based on the selection signal SS1, the auxiliary input AIN1 and One of the first inputs IN1 may be selected and provided to the first operator 414 . Also, the multiplexer 494 may provide a selected one of the auxiliary input AIN1 and the first input IN1 as the auxiliary output AOUT1 . The auxiliary output AOUT1 may be used as an auxiliary input of a multiplexer included in an adjacent accumulator.

다시 말하면, 도 4a의 누적연산기(404)는 제1 연산기(414)의 상기 제1 입력 단자에 멀티플렉서(494)를 추가하여 보조 입력(AIN1)을 선택할 수 있고, 멀티플렉서(494)의 출력을 보조 출력(AOUT1)으로 출력할 수 있다. 이러한 누적연산기(404)를 복수 개 사용하여 병렬처리 할 때 상기 보조 입력 단자를 통하여 인접한 누적연산기의 보조 출력을 보조 입력으로 수신할 수 있으므로, 필요한 개수의 누적연산기의 입력 단자를 공통으로 묶어 사용할 수 있고, 행렬과 입력 벡터의 곱셈을 위해 누적연산기를 할당할 때 자유도가 증가할 수 있다.In other words, the accumulator 404 of FIG. 4A may select the auxiliary input AIN1 by adding a multiplexer 494 to the first input terminal of the first operator 414, and assist the output of the multiplexer 494. It can be output as output (AOUT1). When parallel processing is performed using a plurality of such accumulators 404, the auxiliary output of an adjacent accumulator can be received as an auxiliary input through the auxiliary input terminal. In addition, the degree of freedom can be increased when allocating an accumulator for multiplication of a matrix and an input vector.

도 4b를 참조하면, 복수의 누적연산기들(4041, 4042, ..., 404s)(s는 2 이상의 자연수) 각각은 복수의 제1 연산기들(4141, 4142, ..., 414s) 중 하나 및 복수의 누산기들(4341, 4342, ..., 434s) 중 하나를 포함한다. 복수의 누산기들(4341~434s) 각각은 복수의 제2 연산기들(4541, 4542, ..., 454s) 중 하나 및 복수의 누적레지스터들(4741, 4742, ..., 474s) 중 하나를 포함한다. 누적연산기(4041)는 멀티플렉서(494)를 더 포함할 수 있다.Referring to FIG. 4B , each of the plurality of accumulators 4041, 4042, ..., 404s (s is a natural number greater than or equal to 2) is one of the plurality of first operators 4141, 4142, ..., 414s. and one of the plurality of accumulators 4341, 4342, ..., 434s. Each of the plurality of accumulators 4341 to 434s receives one of the plurality of second operators 4541, 4542, ..., 454s and one of the plurality of accumulation registers 4741, 4742, ..., 474s. include The accumulator 4041 may further include a multiplexer 494 .

누적연산기(4041)가 멀티플렉서(494)를 더 포함하는 것을 제외하면, 도 4b의 복수의 누적연산기들(4041~404s)은 도 3a의 복수의 누적연산기들(4001~400k)과 실질적으로 동일할 수 있다. 도 4b의 복수의 제1 연산기들(4141~414s), 복수의 누산기들(4341~434s), 복수의 제2 연산기들(4541~454s), 복수의 누적레지스터들(4741~474s), 제1 입력(IN1), 복수의 제2 입력들(IN21, IN22, ..., IN2s) 및 복수의 출력들(OUT1, OUT2, ..., OUTs)은 각각 도 3a의 복수의 제1 연산기들(4101~410k), 복수의 누산기들(4301~430k), 복수의 제2 연산기들(4501~450k), 복수의 누적레지스터들(4701~470k), 제1 입력(IN1), 복수의 제2 입력들(IN21~IN2k) 및 복수의 출력들(OUT1~OUTk)에 대응할 수 있다.Except that the accumulator 4041 further includes a multiplexer 494, the plurality of accumulators 4041 to 404s of FIG. 4B may be substantially the same as the plurality of accumulators 4001 to 400k of FIG. 3A. can 4B, a plurality of first operators 4141 to 414s, a plurality of accumulators 4341 to 434s, a plurality of second operators 4541 to 454s, a plurality of accumulation registers 4741 to 474s, a first The input IN1, the plurality of second inputs IN21, IN22, ..., IN2s, and the plurality of outputs OUT1, OUT2, ..., OUTs are respectively connected to the plurality of first operators ( 4101 to 410k), a plurality of accumulators 4301 to 430k, a plurality of second operators 4501 to 450k, a plurality of accumulation registers 4701 to 470k, a first input IN1, a plurality of second inputs It may correspond to the ones IN21 to IN2k and the plurality of outputs OUT1 to OUTk.

도 4b의 멀티플렉서(494)는 도 4a의 멀티플렉서(494)와 실질적으로 동일할 수 있다. 멀티플렉서(494)는 선택 신호(SS1)에 기초하여 보조 입력(AIN1) 및 제1 입력(IN1) 중 하나를 선택하여 보조 출력(AOUT1)으로 제공할 수 있다. 누적연산기들(4041~404s)은 멀티플렉서(494)를 공유할 수 있다. 복수의 제1 연산기들(4141~414s)은 멀티플렉서(494)에서 제공되는 보조 출력(AOUT1)을 수신함으로써 제1 입력(IN1)과 보조 입력(AIN1)을 선택하여 수신할 수 있다.The multiplexer 494 of FIG. 4B may be substantially the same as the multiplexer 494 of FIG. 4A . The multiplexer 494 may select one of the auxiliary input AIN1 and the first input IN1 based on the selection signal SS1 and provide it as the auxiliary output AOUT1 . The accumulators 4041 to 404s may share a multiplexer 494 . The plurality of first operators 4141 to 414s may select and receive the first input IN1 and the auxiliary input AIN1 by receiving the auxiliary output AOUT1 provided from the multiplexer 494 .

다시 말하면, 도 4b에서는 복수의 누적연산기들(4041~404s)이 서로 공유하는(즉, 공통적으로 연결되는) 상기 제1 입력 단자에 멀티플렉서(494)를 추가하여 보조 입력(AIN1)을 선택할 수 있고 멀티플렉서(494)의 출력을 보조 출력(AOUT1)으로 출력할 수 있다. 도 4a와 비교하였을 때 누적연산기 할당의 자유도는 떨어지지만 멀티플렉서의 개수가 줄어서 면적 관점에서 유리할 수 있다.In other words, in FIG. 4B, the auxiliary input AIN1 can be selected by adding a multiplexer 494 to the first input terminal shared (that is, commonly connected) by a plurality of accumulators 4041 to 404s. An output of the multiplexer 494 may be output as an auxiliary output AOUT1 . Compared with FIG. 4A , although the degree of freedom of accumulator allocation is lowered, the number of multiplexers is reduced, which may be advantageous in terms of area.

한편 도시하지는 않았으나, s번째 누적연산기(404s) 이후의 (s+1)번째 누적연산기는 누적연산기(4041)와 유사하게 멀티플렉서를 포함할 수 있고, 상기 (s+1)번째 누적연산기부터 t(t는 s보다 큰 자연수)번째 누적연산기까지 (t-s)개의 누적연산기들은 상기 (s+1)번째 누적연산기에 포함된 상기 멀티플렉서를 공유할 수 있다. 실시예에 따라서, t는 2*s와 같을 수도 있고 다를 수도 있다.Meanwhile, although not shown, the (s+1)-th accumulator after the s-th accumulator 404s may include a multiplexer similarly to the accumulator 4041, and from the (s+1)-th accumulator to t ( (t-s) accumulators may share the multiplexer included in the (s+1)-th accumulator until t is a natural number greater than s)-th accumulator. Depending on the embodiment, t may be equal to or different from 2*s.

도 5a, 5b, 5c, 5d 및 5e는 본 발명의 실시예들에 따른 연산 장치들을 이용하여 두 행렬의 합성곱 및 이를 응용한 연산을 수행하는 예들을 설명하기 위한 도면들이다.5A, 5B, 5C, 5D, and 5E are diagrams for explaining examples of performing a convolution of two matrices and an operation applying the same by using calculation devices according to embodiments of the present invention.

도 5a를 참조하면, 7*8 입력 행렬 X에 3*3 필터 행렬 W를 패딩(padding)은 없고 스트라이드(stride)는 1로 합성곱(convolution)을 수행하면 5*6 출력 행렬 Y를 획득할 수 있다. 합성곱을 수행하면서 출력 행렬 Y의 성분들을 차례로 계산하게 되는데, 이 때 입력 행렬 X의 각 원소가 여러 번씩 참조될 수 있다.Referring to FIG. 5A, if a 3*3 filter matrix W is convolved with no padding and a stride is 1 on a 7*8 input matrix X, a 5*6 output matrix Y can be obtained. can During convolution, the components of the output matrix Y are sequentially calculated. In this case, each element of the input matrix X may be referenced multiple times.

두 행렬 간의 합성곱을 행렬과 벡터 간의 곱셈으로 바꾸어 표현하기 위해 도 5b와 같이 입력 행렬 X와 출력 행렬 Y의 모양을 열벡터로 바꾸어 각각 입력 벡터 x와 출력 벡터 y를 획득할 수 있다. 이 때, xi, yi는 각각 입력 행렬 X와 출력 행렬 Y의 i번째 행의 원소들을 성분으로 갖는 입력 서브벡터 및 출력 서브벡터이다.In order to express the convolution between two matrices by changing the multiplication between the matrix and the vector, the shapes of the input matrix X and the output matrix Y are changed into column vectors to obtain an input vector x and an output vector y, respectively. In this case, xi and yi are an input subvector and an output subvector having elements of the i-th row of the input matrix X and the output matrix Y as components, respectively.

합성곱에 의해 획득하고자 하는 출력 벡터 y의 각 성분은 입력 벡터 x의 성분들의 일차결합으로 표시되므로, 입력 벡터 x를 출력 벡터 y로 변환시켜주는 일차 변환이 있고 이는 도 5c의 합성곱 행렬 A와 같이 표현될 수 있다. 합성곱 행렬 A와 입력 벡터 x를 곱하여 출력 벡터 y를 구한 후 다시 모양을 행렬로 바꾸어 출력 행렬 Y를 구할 수 있다. 따라서 입력 행렬 X와 필터 행렬 W의 합성곱은 입력 벡터 x와 합성곱 행렬 A의 곱셈으로 바꾸어 도 3a의 연산 장치(300)를 이용하여 수행할 수 있다. 이 때 입력 벡터 기준 방식을 적용하여 입력 벡터 x의 각 성분이 한 번씩만 참조될 수 있다.Since each component of the output vector y to be obtained by convolution is expressed as a linear combination of the components of the input vector x, there is a linear transformation that transforms the input vector x into the output vector y, which is can be expressed together. The output vector y is obtained by multiplying the convolution matrix A by the input vector x, and then the output matrix Y can be obtained by changing the shape to a matrix again. Accordingly, the convolution of the input matrix X and the filter matrix W can be performed using the arithmetic device 300 of FIG. 3A by changing the multiplication of the input vector x and the convolution matrix A. In this case, each component of the input vector x can be referenced only once by applying the input vector reference method.

상기 입력 행렬과 출력 행렬의 모양을 행벡터로 바꾸어 입력 벡터와 출력 벡터를 획득하였다면 합성곱 행렬은 상기 행렬 A의 전치 행렬과 같이 표현될 수 있다.If the input and output vectors are obtained by changing the shapes of the input and output matrices into row vectors, the convolution matrix may be expressed as the transpose matrix of the matrix A.

크기가 같은 복수의 입력 행렬에 하나의 필터 행렬을 반복하여 합성곱하는 것은 상술한 것과 같이 복수의 입력 벡터에 하나의 합성곱 행렬을 반복하여 곱하는 것으로 대체할 수 있으며, 복수의 출력 행렬들 또는 복수의 출력 벡터들을 도 3c의 연산 장치(500)를 이용하여 병렬로 구할 수 있다.Repeated convolution of a plurality of input matrices with the same size can be replaced by repeatedly multiplying a plurality of input vectors with one convolution matrix as described above, and a plurality of output matrices or a plurality of output matrices The output vectors may be obtained in parallel using the arithmetic unit 500 of FIG. 3C .

도 5d를 참조하면, 상기 출력 행렬 Y를 수직으로 나누어 복수의 분할된 출력 행렬 Y₁과 Y₂를 만들고, 상기 입력 행렬 X로부터 복수의 분할된 입력 행렬 X₁과 X₂를 만들어서 상기 필터 행렬 W와 각각 합성곱하여 Y₁과 Y₂를 얻도록 할 수 있다. 입력 행렬을 나눌 때 겹치는 영역이 발생하여 분할된 입력 행렬 X₁과 X₂의 원소의 개수의 합은 입력 행렬 X의 원소의 개수 보다 많게 될 수 있다. 반면에 상술한 것과 같이 상기 입력 행렬 X를 X₁과 X₂로 나누어 처리할 때 적용되는 분할된 합성곱 행렬 A_d의 원소의 개수는 상기 합성곱 행렬 A의 원소의 개수의 절반 보다 적게 될 수 있다.Referring to FIG. 5D , a plurality of divided output matrices Y ₁ and Y ₂ are generated by dividing the output matrix Y vertically, and a plurality of divided input matrices X ₁ and X ₂ are generated from the input matrix X to form the filter matrix W and Y can be convolved to obtain Y ₁ and Y ₂ respectively. When an input matrix is divided, an overlapping region is generated so that the sum of the number of elements of the divided input matrices X ₁ and X ₂ may be greater than the number of elements of the input matrix X . On the other hand, when processing the input matrix X by dividing the input matrix X into X ₁ and X ₂ as described above, the number of elements of the divided convolution matrix A _d applied may be less than half the number of elements of the convolution matrix A. have.

마찬가지로 하나의 출력 행렬을 같은 크기의 블록으로 나누어 복수의 분할된 출력 행렬들을 만들고 상응하는 복수의 분할된 입력 행렬들을 만들 수 있으며, 이들을 벡터로 바꾸어서 복수의 분할된 출력 벡터들과 상응하는 복수의 분할된 입력 벡터들을 만들 수 있다. 상기 복수의 분할된 입력 행렬들의 크기가 같으므로 상기 복수의 분할된 입력 벡터에 곱해지는 분할된 합성곱 행렬이 동일할 수 있고 따라서 도 3c의 연산 장치(500)를 이용하여 상기 복수의 분할된 출력 벡터들을 병렬로 구할 수 있다.Similarly, one output matrix can be divided into blocks of the same size to form a plurality of partitioned output matrices, and a plurality of partitioned input matrices can be created. input vectors can be created. Since the plurality of partitioned input matrices have the same size, the partitioned convolution matrix multiplied by the plurality of partitioned input vectors may be the same. Vectors can be found in parallel.

상기 입력 행렬을 상기 복수의 분할된 입력 행렬로 나눌 때 서로 겹쳐지는 영역이 발생하며 이 영역에 속한 원소들은 도 3c의 연산 장치(500)에서 중복하여 수신하게 되는 단점이 있다. 반면에 도 3c의 연산 장치(500)에서 수신하는 상기 분할된 합성곱 행렬의 원소들의 개수가 적어지는 이점이 있고, 아울러 도 3c의 하나의 교차연산기그룹에 속하는 복수의 누적연산기들에서 상기 분할된 합성곱 행렬의 원소들을 공통으로 수신하게 되는 이점이 있다.When the input matrix is divided into the plurality of divided input matrices, regions overlapping each other occur, and elements belonging to this region are repeatedly received by the computing device 500 of FIG. 3C . On the other hand, there is an advantage in that the number of elements of the divided convolution matrix received by the computing device 500 of FIG. 3C is reduced, and in addition, in the plurality of accumulators belonging to one intersection operator group of FIG. 3C, The advantage is that the elements of the convolution matrix are received in common.

도 5e를 참조하면, 상기 필터 행렬 W 외에 다른 필터 행렬 V가 주어져서 상기 입력 행렬 X와 합성곱하는 경우에 상기 필터 행렬 V에 상응하는 합성곱 행렬 A_v를 상기 합성곱 행렬 A에 추가하여 확장된 합성곱 행렬 A_e를 만들고 상기 입력 벡터 x와 곱하여 확장된 출력 벡터 y_e를 얻을 수 있다.5E, when a filter matrix V other than the filter matrix W is given and convolution is performed with the input matrix X, a convolution matrix A _v corresponding to the filter matrix V is added to the convolution matrix A to expand the synthesis A product matrix A _e can be created and multiplied by the input vector x to obtain an expanded output vector y _e .

마찬가지로 하나의 입력 행렬에 복수의 필터 행렬을 각각 합성곱하는 경우에 각각의 합성곱 행렬로부터 확장된 합성곱 행렬을 만들고 도 3a의 연산 장치(300)를 이용하여 확장된 출력 벡터 또는 상기 복수의 필터 행렬에 상응하는 복수의 출력 벡터를 병렬로 구할 수 있다. 이 때 도 3a의 연산 장치(300)에서 상기 입력 행렬의 원소들을 한 번씩만 수신하게 되는 이점이 있다.Similarly, when a plurality of filter matrices are convolved to one input matrix, an extended convolution matrix is created from each convolution matrix, and an extended output vector or the plurality of filter matrices is generated using the arithmetic unit 300 of FIG. 3A . A plurality of output vectors corresponding to can be obtained in parallel. In this case, there is an advantage that the elements of the input matrix are received only once in the operation device 300 of FIG. 3A .

도 6a, 6b 및 6c는 본 발명의 실시예들에 따른 연산 장치를 이용하여 두 행렬의 합성곱을 수행하는 다른 예를 설명하기 위한 도면들이다.6A, 6B, and 6C are diagrams for explaining another example of performing a convolution of two matrices using an arithmetic device according to embodiments of the present invention.

도 5c의 합성곱 행렬 A는 0이 아닌 원소가 적은 희소 행렬이므로 메모리에 저장하거나 벡터와 곱할 때 시간과 자원의 낭비가 클 수 있다. 상기 합성곱 행렬 A의 구조를 보면, 영행렬이 아닌 부분행렬들이 주대각선 부근에 규칙적으로 배치되어 있고 상기 부분행렬들은 띠 행렬임을 알 수 있다. 이러한 합성곱 행렬 A의 구조적 특성은 필터 행렬의 크기, 패딩 및 스트라이드가 달라져도 유지되며, 이를 활용하여 도 6a, 6b 및 6c에서는 합성곱 행렬과 출력 벡터를 각각 패킹하여 효율적으로 합성곱을 수행하는 방법을 예시한다.Since the convolution matrix A of FIG. 5C is a sparse matrix with a small number of non-zero elements, time and resource waste may be large when stored in a memory or multiplied by a vector. Looking at the structure of the convolution matrix A, it can be seen that submatrices other than the zero matrix are regularly arranged near the main diagonal, and the submatrices are band matrices. The structural characteristics of the convolution matrix A are maintained even when the size, padding, and stride of the filter matrix are changed. exemplify

도 6a를 참조하면, 누적연산기를 사용하여 하나의 출력 서브벡터의 계산이 끝난 후 누적레지스터를 리셋하고, 다음 출력 서브벡터 계산을 위해 상기 누적연산기를 배정함으로써 패킹된 합성곱 행렬 A_p를 얻을 수 있다. 이 때, 패킹된 출력 서브벡터 표기, 예를 들어 y₁|y₄ 는 출력 서브벡터 y₁과 y₄가 시분할을 통해 동일한 누적연산기들을 공유하여 구해지는 것을 의미한다. 즉, 출력 서브벡터 y₁을 먼저 구한 후 동일한 누적연산기들을 사용하여 출력 서브벡터 y₄를 구할 수 있다.Referring to FIG. 6A , a packed convolution matrix A _p can be obtained by resetting the accumulator register after the calculation of one output subvector is finished using the accumulator and allocating the accumulator for the next output subvector calculation. have. In this case, the packed output subvector notation, for example, y ₁ |y ₄ means that the output subvectors y ₁ and y ₄ are obtained by sharing the same accumulators through time division. That is, after obtaining the output subvector y ₁ first, the output subvector y ₄ can be obtained using the same accumulators.

도 6b를 참조하면, 누적연산기를 사용하여 입력 행렬의 하나의 행에 대한 출력 벡터 성분의 계산이 끝난 후 누적레지스터 값을 임시로 저장하여 놓고, 다음 출력 벡터 성분 계산을 위해 상기 누적연산기를 배정함으로써 패킹된 합성곱 행렬 A_p를 얻을 수 있다. 다음 출력 벡터 성분 계산을 시작하기 전에 상기 누적레지스터를 리셋하거나 이전에 임시로 저장하여 놓은 값을 상기 누적레지스터에 써 넣을 수 있다. 이 때, 패킹된 출력 벡터의 성분 표기, 예를 들어 y_i1|y_i4 는 출력 벡터의 성분 y_i1과 y_i4가 시분할을 통해 동일한 누적연산기를 공유하여 구해지는 것을 의미한다. 즉, 입력 행렬의 하나의 행에 대해서 출력 벡터 성분 y_i1을 위한 계산을 수행한 후 동일한 누적연산기를 사용하여 출력 벡터 성분 y_i4를 위한 계산을 수행할 수 있다.Referring to FIG. 6B , after the calculation of the output vector component for one row of the input matrix is finished using the accumulator, the accumulator value is temporarily stored, and the accumulator is assigned to calculate the next output vector component. A packed convolution matrix A _p can be obtained. Before starting the calculation of the next output vector component, the accumulation register may be reset or a previously temporarily stored value may be written into the accumulation register. In this case, the component notation of the packed output vector, for example, y _i1 |y _i4 , means that the components y _i1 and y _i4 of the output vector are obtained by sharing the same accumulator through time division. That is, after calculation for the output vector component y _i1 is performed with respect to one row of the input matrix, the calculation for the output vector component y _i4 may be performed using the same accumulator.

도 6c를 참조하면, 도 6a를 참조하여 상술한 합성곱 행렬 패킹 방식과 도 6b를 참조하여 상술한 합성곱 행렬 패킹 방식을 모두 사용하여 패킹된 합성곱 행렬 A_p를 얻을 수 있다.Referring to FIG. 6C , a packed convolution matrix A _p may be obtained by using both the convolution matrix packing method described above with reference to FIG. 6A and the convolution matrix packing method described above with reference to FIG. 6B .

상기 합성곱 행렬과 패킹된 합성곱 행렬들은 구조적으로 명확해서 각 열을 상기 필터 행렬의 원소들을 이용하여 쉽게 구성할 수 있으며 특히 한 열을 적절히 쉬프트 시켜서 이전 열 또는 다음 열을 얻을 수 있다. 따라서 상기 필터 행렬의 원소들로부터 상기 합성곱 행렬 또는 패킹된 합성곱 행렬의 각 열을 생성해 내는 회로를 효율적으로 구현할 수 있다.The convolution matrix and the packed convolution matrices are structurally clear, so that each column can be easily constructed using the elements of the filter matrix. In particular, the previous column or the next column can be obtained by appropriately shifting one column. Accordingly, it is possible to efficiently implement a circuit for generating each column of the convolution matrix or the packed convolution matrix from the elements of the filter matrix.

도 7, 8 및 9는 행렬 연산을 효율적으로 수행하기 위해 본 발명의 실시예들에 따른 연산 장치를 포함하는 메모리 장치를 나타내는 블록도들이다.7, 8, and 9 are block diagrams illustrating a memory device including an arithmetic device according to embodiments of the present invention in order to efficiently perform a matrix operation.

도 7을 참조하면, 메모리 장치(1100)는 메모리 셀 어레이(1110), 행 디코더(1120), 열 디코더(1130), 게이팅 회로(1140), 입출력 데이터 구동 회로(1150), 연산 회로(1000), 및 입출력 버퍼(1010)를 포함한다.Referring to FIG. 7 , the memory device 1100 includes a memory cell array 1110 , a row decoder 1120 , a column decoder 1130 , a gating circuit 1140 , an input/output data driving circuit 1150 , and an operation circuit 1000 . , and an input/output buffer 1010 .

메모리 셀 어레이(1110)는 복수의 행들 및 복수의 열들을 형성하도록 배열되는 복수의 메모리 셀들을 포함한다. 예를 들어, 메모리 셀 어레이(1110)의 하나의 행은 한 페이지를 구성한다. 메모리 셀 어레이(1110)는 연산할 행렬의 원소들 또는 상기 행렬의 원소들을 생성하기 위한 행렬 생성 정보를 저장한다. 또한 메모리 셀 어레이(1110)는 연산할 입력 벡터의 성분들과 연산 결과로 얻어지는 출력 벡터의 성분들을 저장할 수 있으며 연산의 중간 결과로 얻어지는 출력 벡터 성분의 임시 값들을 저장할 수 있다. 상기 행렬의 원소들을 저장할 때, 입력 벡터 기준 방식 적용에 적합하도록 상기 입력 벡터의 한 성분과 연산되는 상기 행렬의 원소들이 최대한 메모리 셀의 같은 행에 저장 되도록 하며, 이를 위해 필요하면 상기 행렬을 전치하여 저장할 수 있다.The memory cell array 1110 includes a plurality of memory cells arranged to form a plurality of rows and a plurality of columns. For example, one row of the memory cell array 1110 constitutes one page. The memory cell array 1110 stores elements of a matrix to be calculated or matrix generation information for generating elements of the matrix. Also, the memory cell array 1110 may store components of an input vector to be calculated and components of an output vector obtained as a result of an operation, and may store temporary values of an output vector component obtained as an intermediate result of an operation. When storing the elements of the matrix, one component of the input vector and the elements of the matrix that are calculated are stored in the same row of the memory cell as much as possible to suit application of the input vector reference method. For this, if necessary, the matrix is transposed to can be saved

행 디코더(1120)는 행 어드레스(RADDR)에 기초하여, 상기 복수의 행들 중 목표 메모리 셀을 포함하는 목표 행을 선택하기 위한 행 선택 신호(RSEL)를 발생한다.The row decoder 1120 generates a row select signal RSEL for selecting a target row including a target memory cell from among the plurality of rows based on the row address RADDR.

열 디코더(1130)는 열 어드레스(CADDR)에 기초하여, 상기 목표 행에 포함되는 열들 중 상기 목표 메모리 셀을 포함하는 목표 열을 선택하기 위한 열 선택 신호(CSEL)를 발생한다.The column decoder 1130 generates a column selection signal CSEL for selecting a target column including the target memory cell from among columns included in the target row, based on the column address CADDR.

게이팅 회로(1140)는 행 선택 신호(RSEL)에 기초하여 상기 목표 행과 연결된다. 또한, 게이팅 회로(1140)는 열 선택 신호(CSEL)에 기초하여 상기 목표 행에 포함된 상기 목표 열을 선택하여 입출력 데이터 구동 회로(1150) 또는 연산회로(1000)에 연결한다. 예를 들어, 게이팅 회로(1140)는 메모리 셀들을 감지하고 신호를 안정적으로 처리하기 위한 감지 증폭기를 더 포함할 수 있다.The gating circuit 1140 is connected to the target row based on the row select signal RSEL. In addition, the gating circuit 1140 selects the target column included in the target row based on the column selection signal CSEL and connects to the input/output data driving circuit 1150 or the operation circuit 1000 . For example, the gating circuit 1140 may further include a sense amplifier for sensing memory cells and stably processing signals.

입출력 데이터 구동 회로(1150)는 상기 목표 열에 입력 데이터(DIN)를 기입하거나 상기 목표 열에 저장된 데이터를 출력 데이터(DOUT)로서 출력한다. 이 때, 데이터 마스크 신호(DMS)에 기초하여 원하지 않는 입력 데이터가 상기 목표 열에 기입되는 것을 방지한다. 예를 들어, 입출력 데이터 구동 회로(1150)는 입력 데이터 구동 회로 및 출력 데이터 구동 회로를 포함할 수 있다.The input/output data driving circuit 1150 writes input data DIN in the target column or outputs data stored in the target column as output data DOUT. At this time, it is possible to prevent unwanted input data from being written into the target column based on the data mask signal DMS. For example, the input/output data driving circuit 1150 may include an input data driving circuit and an output data driving circuit.

상기 목표 열은 열 선택 신호(CSEL)에 기초하여 게이팅 회로(1140)를 통해 입출력 데이터 구동 회로(1150)와 연결될 수 있다. 데이터 기입 동작에서, 상기 입력 데이터 구동 회로에서 수신된 입력 데이터(DIN)는 열 데이터(CDIN) 단위로 게이팅 회로(1140)에 제공되고, 상기 목표 열에 저장될 수 있다. 이 때, 데이터 마스크 신호(DMS)에 기초하여 원하지 않는 데이터가 메모리 셀 어레이(1110)에 저장되는 것을 방지할 수 있다. 데이터 독출 동작에서, 상기 목표 행에 저장된 페이지 데이터(또는 행 데이터)(RD)가 게이팅 회로(1140)에 제공되고, 상기 목표 행에 포함되는 상기 목표 열의 데이터가 열 데이터(CDOUT) 단위로 상기 출력 데이터 구동 회로에 제공되며, 출력 데이터(DOUT)로서 출력될 수 있다.The target column may be connected to the input/output data driving circuit 1150 through the gating circuit 1140 based on the column selection signal CSEL. In the data write operation, the input data DIN received from the input data driving circuit may be provided to the gating circuit 1140 in units of column data CDIN and stored in the target column. In this case, it is possible to prevent unwanted data from being stored in the memory cell array 1110 based on the data mask signal DMS. In the data read operation, page data (or row data) RD stored in the target row is provided to the gating circuit 1140 , and the data of the target column included in the target row is output in units of column data CDOUT. It is provided to the data driving circuit and may be output as output data DOUT.

입출력 버퍼(1010)는 입출력 데이터 구동 회로(1150)와 연결되고, 입력 벡터 및 출력 벡터의 성분들을 저장한다. 예를 들어, 입출력 버퍼(1010)는 상기 입력 벡터의 성분들을 저장하는 입력 버퍼 및 상기 출력 벡터의 성분들을 저장하는 출력 버퍼를 포함할 수 있다. 메모리 장치 외부로부터 수신된 입력 벡터의 성분들이 입출력 데이터 구동 회로(1150)를 거쳐 입출력 버퍼(1010)에 기입될 수 있고, 메모리 셀 어레이(1110)에 저장되어 있는 입력 벡터의 성분들이 게이팅 회로(1140)와 입출력 데이터 구동 회로(1150)를 거쳐 입출력 버퍼(1010)에 기입될 수 있다. 입출력 버퍼(1010)에 저장된 출력 벡터 성분들은 입출력 데이터 구동 회로(1150)를 거쳐 메모리 장치 외부로 독출될 수 있고 입출력 데이터 구동 회로(1150)와 게이팅 회로(1140)를 거쳐 메모리 셀 어레이(1110)에 기입될 수 있다.The input/output buffer 1010 is connected to the input/output data driving circuit 1150 and stores input vectors and components of the output vectors. For example, the input/output buffer 1010 may include an input buffer storing components of the input vector and an output buffer storing components of the output vector. Components of the input vector received from the outside of the memory device may be written into the input/output buffer 1010 through the input/output data driving circuit 1150 , and the components of the input vector stored in the memory cell array 1110 may be converted into the gating circuit 1140 . ) and the input/output data driving circuit 1150 may be written to the input/output buffer 1010 . The output vector components stored in the input/output buffer 1010 may be read out of the memory device through the input/output data driving circuit 1150 , and may be transmitted to the memory cell array 1110 through the input/output data driving circuit 1150 and the gating circuit 1140 . can be entered.

연산 회로(1000)는 게이팅 회로(1140) 및 입출력 버퍼(1010)와 연결되고, 복수의 누적연산기들을 포함하며, 게이팅 회로(1140)를 통해 행렬의 원소들을 생성하기 위한 행렬 생성 정보(ASINF)를 제공 받아서 행렬의 원소들(AS)을 생성하는 행렬 원소 생성기를 포함할 수 있다. 예를 들어, 상기 행렬 원소 생성기는 상기 행렬 생성 정보(ASINF)로 행렬의 합성곱에 사용되는 필터 행렬의 원소들을 제공 받아서 합성곱 행렬 또는 패킹된 합성곱 행렬의 원소들을 생성할 수 있다. 입출력 버퍼(1010)로부터 제공되는 입력 벡터의 성분들(XS)과 게이팅 회로(1140) 또는 행렬 원소 생성기를 통해 제공되는 행렬의 원소들(AS)을 상기 누적연산기들에서 입력 벡터 기준 방식으로 연산하여 출력 벡터의 성분들(YS)을 획득하고 입출력 버퍼(1010)에 제공한다. 입출력 버퍼(1010)에 저장되는 입력 벡터 및 출력 벡터의 성분들의 데이터 형식 또는 게이팅 회로(1140)를 통해 제공되는 행렬의 원소들의 데이터 형식과 상기 누적연산기들의 입출력 데이터 형식이 다를 경우 필요한 형변환이 수행될 수 있다.The operation circuit 1000 is connected to the gating circuit 1140 and the input/output buffer 1010, includes a plurality of accumulators, and generates matrix generation information ASINF for generating elements of a matrix through the gating circuit 1140. It may include a matrix element generator that is provided and generates the elements (AS) of the matrix. For example, the matrix element generator may receive elements of a filter matrix used for matrix convolution as the matrix generation information ASINF to generate the elements of a convolution matrix or a packed convolution matrix. The accumulators calculate the elements XS of the input vector provided from the input/output buffer 1010 and the elements AS of the matrix provided through the gating circuit 1140 or the matrix element generator in the input vector reference method. The components YS of the output vector are obtained and provided to the input/output buffer 1010 . When the data format of the components of the input vector and output vector stored in the input/output buffer 1010 or the data format of the elements of the matrix provided through the gating circuit 1140 and the input/output data format of the accumulators are different, the necessary type conversion is performed. can

상기 복수의 누적연산기들 각각은, 상기 입력 벡터의 성분들을 수신하는 제1 입력 단자, 및 상기 행렬의 원소들을 수신하는 제2 입력 단자를 포함하고, 상기 입력 벡터의 성분들과 상기 행렬의 원소들에 대한 누적 연산을 수행하여 상기 출력 벡터의 성분들 중 하나를 발생한다. 또한, 상기 복수의 누적연산기들의 일부 또는 전체는 상기 제1 입력 단자들이 하나로 연결되어 상기 입력 벡터의 성분들을 공통으로 수신하는 연산기그룹을 형성할 수 있으며, 상기 복수의 누적연산기들 각각 및 상기 연산기그룹은 도 2a, 2b, 3a, 3b, 4a 및 4b를 참조하여 상술한 것처럼 구현될 수 있다. 추가로, 상기 연산기그룹이 복수 개 있을 수 있고 각각의 연산기그룹은 서로 다른 입력 벡터의 성분들을 수신할 수 있으며, 서로 다른 연산기그룹에 속하는 누적연산기들은 제2 입력 단자들이 하나로 연결되어 상기 행렬의 원소들을 공통으로 수신하는 교차연산기그룹을 형성할 수 있고 도 3c를 참조하여 상술한 것처럼 구현될 수 있다.Each of the plurality of accumulators includes a first input terminal for receiving components of the input vector and a second input terminal for receiving elements of the matrix, wherein components of the input vector and elements of the matrix are An accumulation operation is performed on , to generate one of the components of the output vector. In addition, some or all of the plurality of accumulators may form an operator group in which the first input terminals are connected to one and receive components of the input vector in common, each of the plurality of accumulators and the operator group may be implemented as described above with reference to FIGS. 2A, 2B, 3A, 3B, 4A and 4B. In addition, there may be a plurality of operator groups, each operator group may receive different input vector components, and the accumulators belonging to different operator groups may have second input terminals connected to one another to form an element of the matrix It is possible to form a cross-operator group that commonly receives the values, and can be implemented as described above with reference to FIG. 3C.

상기 복수의 누적연산기들의 누적레지스터들 또는 쉐도우 레지스터들이 출력 벡터의 성분(YS)을 저장하는 출력 버퍼의 역할을 겸할 수 있다. 이러한 경우에 입출력 버퍼(1010)의 출력 버퍼는 생략될 수 있다.Accumulation registers or shadow registers of the plurality of accumulators may serve as an output buffer for storing the component YS of the output vector. In this case, the output buffer of the input/output buffer 1010 may be omitted.

상술한 것처럼, 상기 복수의 누적연산기들의 상기 제1 입력 단자들 또는 상기 제2 입력 단자들을 공통으로 연결함으로써, 상기 입력 벡터의 성분들 또는 상기 행렬의 원소들이 중복으로 인가되는 것을 피할 수 있고, 따라서 메모리 및 버스 대역폭 사용이 감소하고 성능 및 전력 소모 관점에서 유리하게 된다. 또한, 연산 회로(1000)는 게이팅 회로(1140)를 통해 메모리 셀 어레이(1110)에 저장된 행렬의 원소들(AS)을 수신하고, 입출력 버퍼(1010)와 입출력 데이터 구동 회로(1150) 및 게이팅 회로(1140)를 통해 메모리 셀 어레이(1110)에 저장된 입력 벡터의 성분들(XS)을 수신하고 출력 벡터의 성분들(YS)을 메모리 셀 어레이(1110)에 기입함으로써, 행렬 연산과 관련된 데이터들의 경로들을 메모리 장치 내부로 한정할 수 있고, 따라서 메모리 및 버스 대역폭 사용이 감소하고 성능 및 전력 소모 관점에서 유리하게 된다.As described above, by connecting the first input terminals or the second input terminals of the plurality of accumulators in common, it is possible to avoid duplicate application of the elements of the input vector or the elements of the matrix, thus Memory and bus bandwidth usage is reduced, which is advantageous in terms of performance and power consumption. In addition, the operation circuit 1000 receives the elements AS of the matrix stored in the memory cell array 1110 through the gating circuit 1140 , and includes the input/output buffer 1010 , the input/output data driving circuit 1150 , and the gating circuit. By receiving the components XS of the input vector stored in the memory cell array 1110 through 1140 and writing the components YS of the output vector to the memory cell array 1110, a path of data related to matrix operation can be confined inside the memory device, thus reducing memory and bus bandwidth usage and advantageous in terms of performance and power consumption.

구현상의 이유로 게이팅 회로(1140)를 통해 연산 회로(1000)에 제공되는 상기 행렬의 원소들(AS) 또는 상기 행렬 생성 정보(ASINF)가 입출력 데이터 구동 회로(1150)를 거쳐서 연산 회로(1000)에 제공될 수 있다.For implementation reasons, the elements AS or the matrix generation information ASINF provided to the operation circuit 1000 through the gating circuit 1140 are transmitted to the operation circuit 1000 through the input/output data driving circuit 1150 . can be provided.

구현상의 이유로 상기 행렬 생성 정보가 별도의 메모리 장치에 저장되어 있을 수 있으며, 이 경우에 상기 행렬 생성 정보는 입출력 데이터 구동 회로(1150)를 거쳐서 연산 회로(1000)에 제공될 수 있다.For implementation reasons, the matrix generation information may be stored in a separate memory device. In this case, the matrix generation information may be provided to the arithmetic circuit 1000 through the input/output data driving circuit 1150 .

구현상의 이유로 연산 회로(1000) 및 입출력 버퍼(1010)가 별도의 칩에 구현될 수 있다. 이 경우에 두 칩의 연결을 위해 게이팅 회로(1140)와 연산 회로(1000) 사이 및 입출력 데이터 구동 회로(1150)와 입출력 버퍼(1010) 사이에 적절한 입출력 데이터 구동 회로가 삽입될 수 있다.For implementation reasons, the arithmetic circuit 1000 and the input/output buffer 1010 may be implemented on separate chips. In this case, an appropriate input/output data driving circuit may be inserted between the gating circuit 1140 and the arithmetic circuit 1000 and between the input/output data driving circuit 1150 and the input/output buffer 1010 to connect the two chips.

인공신경망 회로에서 한 층의 출력들은 입력들에 가중치들을 곱하는 행렬 연산 후 활성화 함수를 적용하여 구해진다. 이를 효율적으로 수행하기 위해 연산 회로(1000) 및 입출력 버퍼(1010) 사이에 활성화 함수 회로가 삽입될 수 있다.In an artificial neural network circuit, the outputs of one layer are obtained by applying an activation function after a matrix operation in which the inputs are multiplied by weights. In order to efficiently perform this, an activation function circuit may be inserted between the operation circuit 1000 and the input/output buffer 1010 .

한편 도시하지는 않았지만, 행 어드레스(RADDR), 열 어드레스(CADDR) 및 데이터 마스크 신호(DMS)는 외부의 메모리 컨트롤러로부터 제공될 수 있다. 메모리 장치(1100)와 메모리 컨트롤러는 메모리 시스템을 구성할 수 있다.Meanwhile, although not shown, the row address RADDR, the column address CADDR, and the data mask signal DMS may be provided from an external memory controller. The memory device 1100 and the memory controller may constitute a memory system.

정리하면, 도 7의 메모리 장치(1100)는 본 발명의 실시예들에 따른 누적연산기를 결합하여 행렬 연산을 효율적으로 수행할 수 있다. 메모리 셀 어레이(1110)에는 행렬의 원소들이 저장되어 있고 이들 데이터는 게이팅 회로(1140)(및 감지 증폭기)를 통하여 연산 회로(1000)로 전달된다. 입력 벡터의 성분들은 상기 입력 버퍼에 저장된 후 연산 회로(1000)에 전달된다. 연산 회로(1000)는 복수의 누적연산기들을 포함하고, 상기 누적연산기들의 제1 입력 단자가 공통으로 묶여 있다. 상기 입력 버퍼에서 받은 상기 입력 벡터의 성분은 상기 누적연산기들의 공통 단자에 가해지고 제2 입력 단자에는 행렬의 원소들이 가해진다. 상기 입력 벡터의 성분에 대해서 출력 벡터의 각 성분에 기여하는 값을 계산한 후 이를 누적하여 출력 벡터의 각 성분을 구한 후 상기 출력 버퍼에 저장한다.In summary, the memory device 1100 of FIG. 7 can efficiently perform a matrix operation by combining the accumulator according to embodiments of the present invention. The elements of a matrix are stored in the memory cell array 1110 , and these data are transmitted to the operation circuit 1000 through the gating circuit 1140 (and the sense amplifier). The components of the input vector are stored in the input buffer and then transferred to the arithmetic circuit 1000 . The arithmetic circuit 1000 includes a plurality of accumulators, and first input terminals of the accumulators are connected in common. Components of the input vector received from the input buffer are applied to a common terminal of the accumulators, and matrix elements are applied to a second input terminal. For each component of the input vector, a value contributing to each component of the output vector is calculated and accumulated to obtain each component of the output vector and stored in the output buffer.

도 8을 참조하면, 메모리 장치(1200)는 메모리 셀 어레이(1210), 행 디코더(1220), 라인 디코더(1230), 열 디코더(1240), 제1 게이팅 회로(1250), 제2 게이팅 회로(1260), 입출력 데이터 구동 회로(1270), 연산 회로(1000), 및 입출력 버퍼(1010)를 포함한다.Referring to FIG. 8 , the memory device 1200 includes a memory cell array 1210 , a row decoder 1220 , a line decoder 1230 , a column decoder 1240 , a first gating circuit 1250 , and a second gating circuit ( 1260 ), an input/output data driving circuit 1270 , an arithmetic circuit 1000 , and an input/output buffer 1010 .

게이팅 회로가 두 단으로 형성되고 이에 따라 라인 디코더(1230)를 더 포함하는 것을 제외하면, 도 8의 메모리 장치(1200)는 도 7의 메모리 장치(1100)와 유사한 구조를 가질 수 있다. 도 8에 도시된 것처럼 게이팅 회로를 두 단으로 구성하는 경우에, 제1 게이팅 회로(1250)에 의해 선택되는 복수의 열들을 하나의 라인이라고 정의한다.The memory device 1200 of FIG. 8 may have a structure similar to that of the memory device 1100 of FIG. 7 , except that the gating circuit is formed in two stages and thus further includes a line decoder 1230 . When the gating circuit is configured in two stages as shown in FIG. 8 , a plurality of columns selected by the first gating circuit 1250 is defined as one line.

메모리 셀 어레이(1210) 및 행 디코더(1220)는 도 7의 메모리 셀 어레이(1110) 및 행 디코더(1120)와 각각 실질적으로 동일할 수 있다.The memory cell array 1210 and the row decoder 1220 may be substantially the same as the memory cell array 1110 and the row decoder 1120 of FIG. 7 , respectively.

라인 디코더(1230)는 열 어드레스(CADDR)에 기초하여, 상기 목표 행에 포함되는 복수의 라인들 중 상기 목표 메모리 셀을 포함하는 목표 라인을 선택하기 위한 라인 선택 신호(LSEL)를 발생한다. 상기 복수의 라인들 각각은 하나의 행에 포함되고 두 개 이상의 열들을 포함하도록 정의될 수 있다.The line decoder 1230 generates a line selection signal LSEL for selecting a target line including the target memory cell from among a plurality of lines included in the target row, based on the column address CADDR. Each of the plurality of lines may be defined to be included in one row and include two or more columns.

열 디코더(1240)는 열 어드레스(CADDR)에 기초하여, 상기 목표 행에 포함되고 상기 목표 라인에 포함되며 상기 목표 메모리 셀을 포함하는 목표 열을 선택하기 위한 열 선택 신호(CSEL)를 발생한다.The column decoder 1240 generates a column selection signal CSEL for selecting a target column included in the target row and included in the target line and including the target memory cell, based on the column address CADDR.

제1 게이팅 회로(1250)는 행 선택 신호(RSEL)에 기초하여 상기 목표 행과 연결되고, 라인 선택 신호(LSEL)에 기초하여 상기 목표 라인을 선택한다. 제2 게이팅 회로(1260)는 열 선택 신호(CSEL)에 기초하여 상기 목표 라인에 포함된 상기 목표 열을 선택한다. 예를 들어, 제1 게이팅 회로(1250)는 메모리 셀들을 감지하고 신호를 안정적으로 처리하기 위한 감지 증폭기를 더 포함할 수 있다.The first gating circuit 1250 is connected to the target row based on the row selection signal RSEL and selects the target line based on the line selection signal LSEL. The second gating circuit 1260 selects the target column included in the target line based on the column selection signal CSEL. For example, the first gating circuit 1250 may further include a sense amplifier for sensing memory cells and stably processing signals.

입출력 데이터 구동 회로(1270)는 상기 목표 열에 입력 데이터(DIN)를 기입하거나 상기 목표 열에 저장된 데이터를 출력 데이터(DOUT)로서 출력한다. 이 때, 데이터 마스크 신호(DMS)에 기초하여 원하지 않는 입력 데이터가 상기 목표 열에 기입되는 것을 방지한다. 예를 들어, 입출력 데이터 구동 회로(1270)는 입력 데이터 구동 회로 및 출력 데이터 구동 회로를 포함할 수 있다.The input/output data driving circuit 1270 writes input data DIN in the target column or outputs data stored in the target column as output data DOUT. At this time, it is possible to prevent unwanted input data from being written into the target column based on the data mask signal DMS. For example, the input/output data driving circuit 1270 may include an input data driving circuit and an output data driving circuit.

상기 목표 열은 라인 선택 신호(LSEL) 및 열 선택 신호(CSEL)에 기초하여 게이팅 회로들(1250, 1260)을 통해 입출력 데이터 구동 회로(1270)와 연결될 수 있다. 데이터 기입 동작에서, 입력 데이터(DIN)는 열 데이터(CDIN) 및 라인 데이터(LDIN) 단위로 게이팅 회로들(1250, 1260)에 제공되고, 상기 목표 열에 저장될 수 있다. 데이터 독출 동작에서, 상기 목표 행에 저장된 페이지 데이터(RD)가 제1 게이팅 회로(1250)에 제공되고, 상기 목표 행에 포함되는 상기 목표 라인 및 상기 목표 열의 데이터가 라인 데이터(LDOUT) 및 열 데이터(CDOUT) 단위로 제2 게이팅 회로(1260) 및 상기 출력 데이터 구동 회로에 제공되며, 출력 데이터(DOUT)로서 출력될 수 있다.The target column may be connected to the input/output data driving circuit 1270 through the gating circuits 1250 and 1260 based on the line select signal LSEL and the column select signal CSEL. In the data write operation, the input data DIN may be provided to the gating circuits 1250 and 1260 in units of column data CDIN and line data LDIN and stored in the target column. In the data read operation, the page data RD stored in the target row is provided to the first gating circuit 1250 , and the data of the target line and the target column included in the target row are line data LDOUT and column data. It is provided to the second gating circuit 1260 and the output data driving circuit in units of (CDOUT), and may be output as output data DOUT.

입출력 버퍼(1010)는 도 7의 입출력 버퍼(1010)와 실질적으로 동일할 수 있다. 연산 회로(1000)는 제1 게이팅 회로(1250)와 연결되는 점을 제외하면 도 7의 연산 회로(1000)와 실질적으로 동일할 수 있다.The input/output buffer 1010 may be substantially the same as the input/output buffer 1010 of FIG. 7 . The operation circuit 1000 may be substantially the same as the operation circuit 1000 of FIG. 7 , except that it is connected to the first gating circuit 1250 .

도 8에서는 게이팅 회로를 2단으로 구성하여 연산 회로(1000)에 공급되는 데이터 사이즈를 열 사이즈 보다 크게 하여 한 번에 더 많은 누적연산기를 구동할 수 있다.In FIG. 8 , by configuring the gating circuit in two stages, the data size supplied to the operation circuit 1000 is larger than the column size, so that more accumulators can be driven at a time.

도 9를 참조하면, 메모리 장치(1300)는 메모리 셀 어레이(1310), 행 디코더(1320), 다중 열 디코더(1330), 게이팅 회로(1340), 입출력 데이터 구동 회로(1350), 형변환 및 연산 회로(1000), 및 입출력 버퍼(1010)를 포함한다.Referring to FIG. 9 , the memory device 1300 includes a memory cell array 1310 , a row decoder 1320 , a multi-column decoder 1330 , a gating circuit 1340 , an input/output data driving circuit 1350 , a type conversion and operation circuit. 1000 , and an input/output buffer 1010 .

열 디코더(1130)가 다중 열 디코더(1330)로 변경되는 것을 제외하면, 도 9의 메모리 장치(1300)는 도 7의 메모리 장치(1100)와 유사한 구조를 가질 수 있다.The memory device 1300 of FIG. 9 may have a structure similar to that of the memory device 1100 of FIG. 7 , except that the column decoder 1130 is changed to the multi-column decoder 1330 .

메모리 셀 어레이(1310) 및 행 디코더(1320)는 도 7의 메모리 셀 어레이(1110) 및 행 디코더(1120)와 각각 실질적으로 동일할 수 있다.The memory cell array 1310 and the row decoder 1320 may be substantially the same as the memory cell array 1110 and the row decoder 1120 of FIG. 7 , respectively.

다중 열 디코더(1330)는 열 어드레스(CADDR) 및 열 선택 정보(CSINF)에 기초하여, 목표 행(TR)에 포함되는 열들 중 상기 목표 메모리 셀을 포함하는 복수의 목표 열들(TC)(빗금 친 부분)을 한 번에 선택하기 위한 다중 열 선택 신호(MCSEL)를 발생한다. 이 때, 열 선택 정보(CSINF)에는 목표 열들(TC)의 주소가 연속이라는 제약이 없다. 다시 말하면, 목표 열들(TC)은(즉, 목표 열들(TC)의 주소는) 연속적이지 않을 수 있다.The multi-column decoder 1330 includes a plurality of target columns TC (hatched lines) including the target memory cell among columns included in the target row TR, based on the column address CADDR and the column selection information CSINF. portion) to generate a multi-column select signal (MCSEL) for selecting at one time. In this case, there is no restriction that the addresses of the target columns TC are continuous in the column selection information CSINF. In other words, the target columns TC (ie, addresses of the target columns TC) may not be contiguous.

실시예에 따라서, 열 선택 정보(CSINF)는 미리 정의된 열 선택 테이블에 기초하여 설정될 수도 있고, 열 선택 파라미터에 기초하여 설정될 수도 있으며, 열 선택 리스트 또는 열 선택 정보 리스트에 기초하여 설정될 수도 있다.According to an embodiment, the column selection information CSINF may be set based on a predefined column selection table, may be set based on a column selection parameter, or may be set based on a column selection list or a column selection information list. may be

게이팅 회로(1340)는 행 선택 신호(RSEL)에 기초하여 목표 행(TR)과 연결된다. 또한, 게이팅 회로(1340)는 다중 열 선택 신호(MCSEL)에 기초하여 목표 행(TR)에 포함된 복수의 목표 열들(TC)을 선택하여 입출력 데이터 구동 회로(150)에 한 번에 연결한다. 예를 들어, 게이팅 회로(1340)는 메모리 셀들을 감지하고 신호를 안정적으로 처리하기 위한 감지 증폭기를 더 포함할 수 있다.The gating circuit 1340 is connected to the target row TR based on the row select signal RSEL. Also, the gating circuit 1340 selects a plurality of target columns TC included in the target row TR based on the multi-column selection signal MCSEL and connects them to the input/output data driving circuit 150 at once. For example, the gating circuit 1340 may further include a sense amplifier for sensing memory cells and stably processing a signal.

입출력 데이터 구동 회로(1350)는 복수의 목표 열들(TC)에 입력 데이터(DIN)를 한 번에 기입하거나 복수의 목표 열들(TC)에 저장된 데이터를 한 번에 출력 데이터(DOUT)로서 출력한다. 이 때, 다중 열 선택 신호(MCSEL) 및 데이터 마스크 신호(DMS)에 기초하여 원하지 않는 입력 데이터가 상기 복수의 목표 열들(TC)에 기입되는 것을 방지한다. 예를 들어, 입출력 데이터 구동 회로(1350)는 입력 데이터 구동 회로 및 출력 데이터 구동 회로를 포함할 수 있다.The input/output data driving circuit 1350 writes input data DIN in the plurality of target columns TC at once or outputs data stored in the plurality of target columns TC as output data DOUT at once. In this case, unwanted input data is prevented from being written into the plurality of target columns TC based on the multi-column selection signal MCSEL and the data mask signal DMS. For example, the input/output data driving circuit 1350 may include an input data driving circuit and an output data driving circuit.

복수의 목표 열들(TC)은 다중 열 선택 신호(MCSEL)에 기초하여 게이팅 회로(1340)를 통해 입출력 데이터 구동 회로(1350)와 연결될 수 있다. 데이터 기입 동작에서, 입력 데이터(DIN)는 열 데이터(CDIN) 단위로 게이팅 회로(1340)에 제공될 수 있고, 복수의 목표 열들(TC)에 저장될 수 있다. 데이터 독출 동작에서, 목표 행(TR)에 저장된 페이지 데이터(또는 행 데이터)(RD)가 게이팅 회로(1340)에 제공될 수 있고, 목표 행(TR)에 포함되는 복수의 목표 열들(TC)의 데이터가 열 데이터(CDOUT) 단위로 상기 출력 데이터 구동 회로에 제공될 수 있으며, 출력 데이터(DOUT)로서 출력될 수 있다.The plurality of target columns TC may be connected to the input/output data driving circuit 1350 through the gating circuit 1340 based on the multi-column selection signal MCSEL. In the data write operation, the input data DIN may be provided to the gating circuit 1340 in units of column data CDIN, and may be stored in a plurality of target columns TC. In the data read operation, the page data (or row data) RD stored in the target row TR may be provided to the gating circuit 1340 , and the data of the plurality of target columns TC included in the target row TR may be read. Data may be provided to the output data driving circuit in units of column data CDOUT, and may be output as output data DOUT.

일 실시예에서, 입출력 데이터(DIN 또는 DOUT)의 크기는 페이지 데이터(RD)의 크기보다 작거나 같을 수 있다. 예를 들어, 페이지 데이터(RD)의 크기는 P(P는 2 이상의 자연수)비트일 수 있고, 입출력 데이터(DIN 또는 DOUT)의 크기는 K(K는 P 이하의 자연수)비트일 수 있다.In an embodiment, the size of the input/output data DIN or DOUT may be smaller than or equal to the size of the page data RD. For example, the size of the page data RD may be P (P is a natural number greater than or equal to 2) bits, and the size of the input/output data DIN or DOUT may be K (K is a natural number less than or equal to P) bits.

일 실시예에서, 열 데이터(CDIN 또는 CDOUT)의 크기는 입출력 데이터(DIN 또는 DOUT)의 크기보다 작을 수 있다. 예를 들어, 열 데이터(CDIN 또는 CDOUT)의 크기는 C(C는 K보다 작은 자연수)비트일 수 있다. 이 경우, 하나의 입력 데이터(DIN)를 저장하거나 하나의 출력 데이터(DOUT)를 제공하기 위해 복수의 열들(예를 들어, K/C개의 열들)을 선택할 수 있으며, 이에 따라 입출력 데이터(DIN 또는 DOUT)는 하나의 화살표로 도시하고 열 데이터(CDIN 또는 CDOUT)는 복수의 화살표들로 도시하였다.In an embodiment, the size of the column data CDIN or CDOUT may be smaller than the size of the input/output data DIN or DOUT. For example, the size of the column data CDIN or CDOUT may be C (C is a natural number less than K) bits. In this case, a plurality of columns (eg, K/C columns) may be selected to store one input data DIN or provide one output data DOUT, and accordingly, input/output data DIN or DOUT) is indicated by a single arrow, and column data (CDIN or CDOUT) is indicated by a plurality of arrows.

입출력 버퍼(1010)는 도 7의 입출력 버퍼(1010)와 실질적으로 동일할 수 있다. 연산 회로(1000)는 도 7의 연산 회로(1000)와 실질적으로 동일할 수 있다.The input/output buffer 1010 may be substantially the same as the input/output buffer 1010 of FIG. 7 . The operation circuit 1000 may be substantially the same as the operation circuit 1000 of FIG. 7 .

도 9에서는 다중 열 디코더(1330)를 사용하여 연산회로(1000)에 공급되는 데이터 선택의 자유도를 높일 수 있다. 또한, 연속적이지 않은 메모리 셀들에 한 번에 액세스할 수 있어 메모리 장치(1300)의 성능 및 전력 효율이 향상될 수 있다.In FIG. 9 , the degree of freedom in selecting data supplied to the operation circuit 1000 may be increased by using the multi-column decoder 1330 . In addition, since non-consecutive memory cells may be accessed at a time, performance and power efficiency of the memory device 1300 may be improved.

본 발명은 행렬 연산을 수행하는 연산 장치 및 메모리 장치를 포함하는 다양한 장치 및 시스템에 적용될 수 있다. 따라서 본 발명은 휴대폰, 스마트 폰, PDA, PMP, 디지털 카메라, 캠코더, PC, 서버 컴퓨터, 워크스테이션, 노트북, 디지털 TV, 셋-탑 박스, 음악 재생기, 휴대용 게임 콘솔, 네비게이션 기기, 웨어러블 기기, IoT 기기, VR 기기, AR 기기 등과 같은 다양한 전자 기기에 유용하게 이용될 수 있다.The present invention can be applied to various devices and systems including a computing device and a memory device that perform a matrix operation. Accordingly, the present invention is a mobile phone, smart phone, PDA, PMP, digital camera, camcorder, PC, server computer, workstation, notebook, digital TV, set-top box, music player, portable game console, navigation device, wearable device, IoT It may be usefully used in various electronic devices such as devices, VR devices, AR devices, and the like.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술분야의 숙련된 당업자는 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 것이다.Although the above has been described with reference to the preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. you will understand that you can

Claims

each of a first input terminal for receiving input vector components and a second input terminal for receiving elements of a matrix, and performing a cumulative operation on the input vector components and the elements of the matrix to obtain one of the output vector components one or more operator groups including a plurality of accumulators each generating one,
Each of the plurality of accumulators,
a first operator for generating a result of the first operation by performing a first operation on the input vector components received from the first input terminal and elements of the matrix received from the second input terminal;
a second operator for generating a result of the second operation by performing a second operation on the result of the first operation output from the first operator and the accumulation result provided from the accumulation register; and
and the accumulation register generating the accumulation result by accumulating the results of the second operation output from the second operator, and finally generating one of the output vector components,
When one of the two inputs of each of the plurality of accumulators is an n-base floating-point value (n is a natural number greater than or equal to 2) and the other is a power of n having an integer exponent, the first operator is Implemented as a circuit that bypasses the mantissa of a binary floating-point value and adds the exponent of the n-base floating-point value and the exponent of the power of n,
The first input terminal of the plurality of accumulators included in the one operator group is connected in common.

The method of claim 1,
A plurality of the operator groups receive input vector components in one-to-one correspondence with a plurality of input vectors and generate output vector components in one-to-one correspondence with a plurality of output vectors,
a plurality of cross operator groups including a plurality of accumulators belonging to the different operator groups;
The second input terminal of the plurality of accumulators included in one crossover operator group is connected in common.

The method of claim 1, wherein each of the plurality of accumulators comprises:
and at least one shadow register for temporarily storing the accumulation result generated by the accumulation register.

The method of claim 1, wherein each of the plurality of accumulators comprises:
an auxiliary input terminal for receiving an auxiliary input, and the first input terminal for receiving the input vector components, selects one of the auxiliary input and the input vector components based on a selection signal to the first operator Computing device, characterized in that it further comprises a multiplexer to provide.

The method of claim 1, wherein a first accumulator among the plurality of accumulators comprises:
an auxiliary input terminal for receiving an auxiliary input, and the first input terminal for receiving the input vector components, selects one of the auxiliary input and the input vector components based on a selection signal to the first operator Further comprising a multiplexer to provide,
and at least one accumulator disposed adjacent to the first accumulator among the plurality of accumulators shares the multiplexer with the first accumulator.

The method of claim 1,
When one of the two inputs of each of the plurality of accumulators is an n-base integer value (n is a natural number greater than or equal to 2) and the other is a value of a power of n having an integer exponent,
and the first operator is implemented as a circuit for shifting the n-base integer value by the integer exponent.

The method of claim 1,
The first operator is bypassed by fixing any one of two inputs of each of the plurality of accumulators to 1, and the second operator sets the value bypassed by the first operator to the value of the accumulation register. Comparing with and arithmetic device, characterized in that implemented as a comparator that outputs the larger of the two.

The method of claim 1,
The matrix is a convolution matrix or a packed convolution matrix, and the input vector and the output vector are obtained by converting an input matrix and an output matrix into vectors, respectively.

The method of claim 1,
wherein the matrix is an extended convolution matrix corresponding to a plurality of filter matrices, and the output vector is an extended output vector corresponding to the plurality of filter matrices.

3. The method of claim 2,
The matrix is a convolution matrix or a packed convolution matrix, and the plurality of input vectors and the plurality of output vectors are obtained by converting a plurality of input matrices and a plurality of output matrices into vectors, respectively.

3. The method of claim 2,
The matrix is a divided convolution matrix, and the plurality of input vectors and the plurality of output vectors are obtained by changing a plurality of divided input matrices and a plurality of divided output matrices into vectors, respectively.

a memory cell array comprising a plurality of memory cells arranged to form a plurality of rows and a plurality of columns, the memory cell array storing elements of a matrix or matrix generation information for generating elements of the matrix;
a row decoder for generating a row selection signal for selecting a target row from among the plurality of rows based on a row address;
a column decoder for generating a column selection signal for selecting a target column from among columns included in the target row, based on a column address;
a gating circuit for selecting the target column based on the column selection signal;
an input/output data driving circuit for writing input data into the target column or outputting data stored in the target column as output data through the gating circuit;
an input/output buffer connected to the input/output data driving circuit and configured to store input vector components and output vector components; and
connected to the gating circuit and the input/output buffer, and including one or more operator groups including a plurality of accumulators, the input vector components provided from the input/output buffer and the elements of the matrix provided through the gating circuit Alternatively, the plurality of accumulators calculate the elements of the matrix generated with reference to the matrix generation information provided through the gating circuit in an input vector reference method to obtain the output vector components, and use the output vector components as the Comprising an arithmetic circuit providing input and output buffers,
Each of the plurality of accumulators includes a first input terminal for receiving the input vector components, and a second input terminal for receiving the elements of the matrix, wherein the input vector components and the elements of the matrix performing an accumulation operation to generate one of the output vector components,
Each of the plurality of accumulators,
a first operator for generating a result of the first operation by performing a first operation on the input vector components received from the first input terminal and elements of the matrix received from the second input terminal;
a second operator for generating a result of the second operation by performing a second operation on the result of the first operation output from the first operator and the accumulation result provided from the accumulation register; and
and the accumulation register generating the accumulation result by accumulating the results of the second operation output from the second operator, and finally generating one of the output vector components,
When one of the two inputs of each of the plurality of accumulators is an n-base floating-point value (n is a natural number greater than or equal to 2) and the other is a power of n having an integer exponent, the first operator is Implemented as a circuit that bypasses the mantissa of a binary floating-point value and adds the exponent of the n-base floating-point value and the exponent of the power of n,
The first input terminal of the plurality of accumulators included in the one operator group is connected in common.

13. The method of claim 12,
A plurality of the operator groups receive input vector components in one-to-one correspondence with a plurality of input vectors and generate output vector components in one-to-one correspondence with a plurality of output vectors,
a plurality of cross operator groups including a plurality of accumulators belonging to the different operator groups;
and the second input terminals of the plurality of accumulators included in one crossover operator group are connected in common.

13. The method of claim 12, wherein each of the plurality of accumulators comprises:
and at least one shadow register for temporarily storing the accumulation result generated by the accumulation register.

13. The method of claim 12, wherein each of the plurality of accumulators comprises:
an auxiliary input terminal for receiving an auxiliary input, and the first input terminal for receiving the input vector components, selects one of the auxiliary input and the input vector components based on a selection signal to the first operator Memory device, characterized in that it further comprises a multiplexer to provide.

13. The method of claim 12, wherein a first accumulator among the plurality of accumulators comprises:
an auxiliary input terminal for receiving an auxiliary input, and the first input terminal for receiving the input vector components, selects one of the auxiliary input and the input vector components based on a selection signal to the first operator Further comprising a multiplexer to provide,
At least one accumulator disposed adjacent to the first accumulator among the plurality of accumulators shares the multiplexer with the first accumulator.

13. The method of claim 12,
When one of the two inputs of each of the plurality of accumulators is an n-base integer value (n is a natural number greater than or equal to 2) and the other is a value of a power of n having an integer exponent,
and the first operator is implemented as a circuit for shifting the n-base integer value by the integer exponent.

13. The method of claim 12,
The first operator is bypassed by fixing any one of two inputs of each of the plurality of accumulators to 1, and the second operator sets the value bypassed by the first operator to the value of the accumulation register. A memory device, characterized in that it is implemented as a comparator that compares with and outputs the larger of the two values.

13. The method of claim 12,
wherein the matrix is a convolution matrix or a packed convolution matrix, and the input vector and the output vector are obtained by converting an input matrix and an output matrix into vectors, respectively.

13. The method of claim 12,
wherein the matrix is an extended convolution matrix corresponding to a plurality of filter matrices, and the output vector is an extended output vector corresponding to the plurality of filter matrices.

14. The method of claim 13,
wherein the matrix is a convolution matrix or a packed convolution matrix, and the plurality of input vectors and the plurality of output vectors are obtained by converting a plurality of input matrices and a plurality of output matrices into vectors, respectively.

14. The method of claim 13,
wherein the matrix is a partitioned convolution matrix, and the plurality of input vectors and the plurality of output vectors are obtained by converting a plurality of partitioned input matrices and a plurality of partitioned output matrices into vectors, respectively.

13. The method of claim 12,
and a line decoder for generating a line selection signal for selecting a target line including the target column from among a plurality of lines included in the target row and each including two or more columns, based on the column address; ,
The gating circuit includes a first gating circuit that selects the target line based on the line selection signal and a second gating circuit that selects the target column based on the column selection signal,
the input/output data driving circuit writes the input data to the target column or outputs data stored in the target column as the output data through the first and second gating circuits;
and the arithmetic circuit is connected to the first gating circuit.

13. The method of claim 12,
the column decoder is a multi-column decoder that generates a multi-column selection signal for selecting a plurality of target columns from among columns included in the target row at a time based on the column address and column selection information;
the gating circuit selects the plurality of target columns at once based on the multi-column selection signal,
the input/output data driving circuit writes the input data in the plurality of target columns at a time through the gating circuit or outputs data stored in the plurality of target columns as the output data at a time;
and the column addresses corresponding to the plurality of target columns included in the target row are not contiguous.

delete