KR102371451B1

KR102371451B1 - Parallel multiplication apparatus using multi-port memory

Info

Publication number: KR102371451B1
Application number: KR1020210068613A
Authority: KR
Inventors: 남병규; 조수환; 도현구; 최성림
Original assignee: 충남대학교 산학협력단
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-03-07

Abstract

In a multiplier capable of parallel operation processing, the present invention comprises: a single memory that stores a multiplication result for a parameter and a plurality of input values and is multi-connected to a following multiport; and a multiport reading the multiplication result stored in the single memory, wherein a plurality of single ports configured with the multiport enable to perform a plurality of multiplication operations with a low-area structure by being connected to each of the single memory. Therefore, the present invention is capable of having an advantage of reducing an area of a circuit.

Description

Parallel multiplication device using multiport memory {PARALLEL MULTIPLICATION APPARATUS USING MULTI-PORT MEMORY}

본 발명은 컨볼루션 신경망(Convolutional Neural Network, CNN)과 같이 면적 효율적인 설계가 요구되는 응용에서의 곱셈기 부에 대한 면적을 줄이기 위한 방법에 관한 것으로 보다 상세하게는, 다수의 곱셈을 병렬적으로 처리할 시에 곱셈기를 읽기포트로 대체하여 면적 효율을 높이는 것을 특징으로 하는 멀티포트 메모리를 이용한 병렬 곱셈 장치에 관한 것이다.The present invention relates to a method for reducing the area for a multiplier unit in an application requiring an area-efficient design, such as a convolutional neural network (CNN). It relates to a parallel multiplication device using a multi-port memory, characterized in that the area efficiency is increased by replacing the multiplier with a read port.

딥러닝 알고리즘 중 CNN은 최고 수준의 정확도로 인해 각광받고 있는 기계학습 알고리즘이며, 근래에는 수십억의 MAC(Multiply-Accumulate) 연산을 통해 사람의 인식 정확도를 뛰어넘는 수준에 이르렀다. 다만, 이때 대량의 MAC 유닛은 차지하는 면적이 넓기 때문에 하나의 칩에 집적시키기 위해서는 면적 측면에서 MAC 유닛에 대한 효율적인 설계가 요구된다. 특히, MAC 유닛 중 곱셈기가 면적의 대부분을 차지하므로, 곱셈기의 면적을 줄이는 연구가 중요하다.Among deep learning algorithms, CNN is a machine learning algorithm that has been in the spotlight due to its highest level of accuracy, and has recently reached a level that surpasses human recognition accuracy through billions of multiply-accumulate (MAC) operations. However, in this case, since a large amount of MAC units occupy a large area, an efficient design of the MAC unit is required in terms of area in order to be integrated into one chip. In particular, since the multiplier occupies most of the area of the MAC unit, research on reducing the area of the multiplier is important.

한편, CNN은 컨볼루션 연산 시 하나의 가중치(weight)를 다수의 활성화(activation) 값이 공유하므로 가중치에 대한 재사용 빈도가 높다. 이에 종래의 메모리 기반 곱셈기는 재사용 빈도가 높은 복수의 입력값에 대한 곱셈 결과를 LUT(lookup table) 또는 단일 메모리에 저장하고, 저장된 곱셈 결과를 재사용하는 방식을 통해 곱셈기의 전력소모를 낮추고 면적을 줄였다. 그러나, 종래의 메모리 기반 곱셈기 기술은 MAC 유닛에 포함되는 복수의 곱셈기의 수만큼 복수개의 메모리 블록을 요구하므로, 종래의 메모리 기반 곱셈기는 전체적인 회로의 면적을 줄이는 데 한계가 있다. 특히, CNN과 같이 수십억 개의 MAC 연산이 요구되는 응용에서는 더욱 문제가 된다. On the other hand, in CNN, one weight is shared by multiple activation values during convolution operation, so the frequency of reuse of the weight is high. Accordingly, the conventional memory-based multiplier stores multiplication results for a plurality of input values with high reuse frequency in a lookup table (LUT) or a single memory, and reduces power consumption and area of the multiplier by reusing the stored multiplication results. . However, since the conventional memory-based multiplier technology requires a plurality of memory blocks as many as the number of multipliers included in the MAC unit, the conventional memory-based multiplier has a limit in reducing the overall circuit area. In particular, it is more problematic in applications that require billions of MAC operations such as CNN.

반면, 본 출원인은 CNN의 가중치와 같이 재사용 빈도가 높은 데이터에 대한 연산 결과를 하나의 단일 메모리에 저장하고, 복수의 곱셈 연산을 수행함에 있어서도 해당 단일 메모리에서 여러 연산 결과값을 읽어오는 연구를 진행하였다. 해당 연구는 읽기 포트(단일포트 또는 멀티포트)가 메모리에 저장된 곱셈 결과를 읽어오는 점에 착안하였고, 이에 곱셈기를 읽기 포트로 대체하여 읽기 포트를 추가하는 것만으로 다수의 곱셈을 병렬적으로 처리할 수 있도록 하였다. 이를 통해 동일한 곱셈 연산이 다수회 반복하여 진행하지 않으며, 곱셈 결과값을 읽어오는 과정에서도 추가적인 단일 메모리 없이 전술한 하나의 단일 메모리만을 이용함으로써 상기 선행특허보다 복수의 곱셈 연산이 면적 효율적으로 수행될 수 있음을 확인하게 되었다.On the other hand, the present applicant stores the calculation results for data with high reuse frequency, such as weights of CNNs, in a single memory, and conducts research on reading multiple calculation results from the single memory even when performing multiple multiplication operations did. This study focused on the point that the read port (single port or multiport) reads the multiplication result stored in the memory. made it possible Through this, the same multiplication operation is not repeatedly performed a plurality of times, and in the process of reading the multiplication result value, a plurality of multiplication operations can be performed more area-efficiently than in the prior patent by using only one single memory without an additional single memory. I have confirmed that there is

한국공개특허 제10-2021-0002495호Korean Patent Publication No. 10-2021-00002495 한국공개특허 제10-2019-0055608호Korean Patent Publication No. 10-2019-0055608

본 발명은 다수의 곱셈기를 면적 효율적으로 설계하여 하나의 칩에 대량의 MAC 유닛을 집적가능하게 하고자 한다.The present invention intends to enable integration of a large number of MAC units in one chip by designing a plurality of multipliers in an area-efficient manner.

본 발명이 해결하려는 과제들은 앞에서 언급한 과제들로 제한되지 않는다. 본 발명의 다른 과제 및 장점들은 아래 설명에 의해 더욱 분명하게 이해될 것이다.The problems to be solved by the present invention are not limited to the aforementioned problems. Other objects and advantages of the present invention will be more clearly understood from the following description.

상기 목적을 달성하기 위하여 본 발명은, 병렬 연산 처리가 가능한 곱셈기에 있어서, 매개 변수와 복수의 입력값에 대한 곱셈 결과를 저장하며 하기 멀티포트와 다중 연결하는 단일 메모리; 및 상기 단일 메모리에 저장된 곱셈 결과를 읽어오는 멀티포트를 포함하며, 상기 멀티포트를 구성하는 복수개의 단일 포트가 상기 단일 메모리와 각각 연결됨으로써 저면적 구조로 복수의 곱셈 연산 수행이 가능한 것을 특징으로 한다.In order to achieve the above object, the present invention provides a multiplier capable of parallel arithmetic processing, comprising: a single memory that stores multiplication results for a parameter and a plurality of input values and is multi-connected to the following multiport; and a multiport for reading a multiplication result stored in the single memory, wherein a plurality of single ports constituting the multiport are respectively connected to the single memory to perform a plurality of multiplication operations in a low-area structure. .

바람직하게, 상기 단일 메모리에 저장된 곱셈 결과를 갱신하고, 갱신된 곱셈 결과를 상기 단일 메모리로 전송하는 단일 제어부를 더 포함하며, 상기 단일 제어부는 상기 단일 메모리의 개수와 대응되는 개수로 마련되어 단수 개일 수 있다.Preferably, the method further comprises a single control unit that updates the multiplication result stored in the single memory and transmits the updated multiplication result to the single memory, wherein the single control unit is provided in a number corresponding to the number of the single memory and can be a singular number. there is.

본 발명에 따르면, 다수의 곱셈 연산을 수행할 수 있는 곱셈기에 있어서 단일 메모리에 재사용 빈도가 높은 데이터에 대한 곱셈 결과를 저장하고, 곱셈기를 읽기 포트로 대체하여 읽기 포트를 추가하는 것만으로 다수의 곱셈을 병렬적으로 처리할 수 있어 회로의 면적을 줄일 수 있다는 장점을 갖는다.According to the present invention, in a multiplier capable of performing a plurality of multiplication operations, a multiplication result for data with high reuse frequency is stored in a single memory, and the multiplier is replaced with a read port and a read port is added only by adding a read port. can be processed in parallel, which has the advantage of reducing the circuit area.

또한, 동일한 곱셈 연산 과정을 수행하지 않으므로 전력소비를 감소할 수 있으며, 단일 메모리의 곱셈 결과를 읽어오는데 있어서 멀티포트를 통해 동시에 병렬적으로 처리하므로 효율적으로 연산을 처리할 수 있는 효과가 있다.In addition, since the same multiplication operation process is not performed, power consumption can be reduced, and when reading a multiplication result from a single memory, it is simultaneously processed in parallel through a multiport, so that the operation can be efficiently processed.

도 1은 컨볼루션 신경망에서 가중치를 공유하는 컨볼루션 연산이 수행되는 모습을 구조화하여 나타낸 것이다.
도 2는 본 발명의 실시 예에 따른 곱셈기의 개략도이다.
도 3은 단일 메모리를 공유하는 곱셈 당 면적을 나타낸 그래프이다.
도 4는 본 발명의 실시 예에 따른 단일 제어부를 통한 단일 메모리의 데이터가 갱신되는 과정과 멀티포트를 통해 곱셈 결과를 읽어오는 과정을 나타낸 것이다.
도 5는 본 발명의 실시예에 따른 단일 메모리(10)의 메모리 어레이가 32 포트 비트 셀(bitcell)의 20×16 행렬로 구성된 모습이다.
도 6은 도 5의 32 포트 비트 셀의 개략도이다.1 is a structural diagram showing a state in which a convolution operation sharing a weight is performed in a convolutional neural network.
2 is a schematic diagram of a multiplier according to an embodiment of the present invention.
3 is a graph showing the area per multiplication sharing a single memory.
4 illustrates a process of updating data in a single memory through a single control unit and reading a multiplication result through a multiport according to an embodiment of the present invention.
5 is a diagram illustrating a memory array of a single memory 10 configured in a 20×16 matrix of 32 port bit cells according to an embodiment of the present invention.
6 is a schematic diagram of the 32 port bit cell of FIG. 5;

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명을 상세히 설명한다. 다만, 본 발명이 예시적 실시 예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일 참조부호는 실질적으로 동일한 기능을 수행하는 부재를 나타낸다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited or limited by the exemplary embodiments. The same reference numerals provided in the respective drawings indicate members that perform substantially the same functions.

본 발명의 목적 및 효과는 하기의 설명에 의해서 자연스럽게 이해되거나 보다 분명해질 수 있으며, 하기의 기재만으로 본 발명의 목적 및 효과가 제한되는 것은 아니다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이, 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.Objects and effects of the present invention can be naturally understood or more clearly understood by the following description, and the objects and effects of the present invention are not limited only by the following description. In addition, in the description of the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

도 1은 컨볼루션 신경망에서 가중치를 공유하는 컨볼루션 연산이 수행되는 모습을 구조화하여 나타낸 것이다. 도 1과 같이, 필터의 단일 가중치 매개 변수(도 1에서 W₀ 내지 W₈)는 컨볼루션 연산이 수행될 때, 여러 활성화 값에 의해 공유될 수 있다. 이에 본 발명에서는 컨볼루션 연산 수행 시, 연산 및 면적 측면에서의 효율적인 설계를 위해 가중치 매개 변수를 재사용할 수 있다. 1 is a structural diagram showing a state in which a convolution operation sharing a weight is performed in a convolutional neural network. As shown in FIG. 1 , a single weight parameter (W ₀ to W ₈ in FIG. 1 ) of the filter may be shared by several activation values when a convolution operation is performed. Accordingly, in the present invention, when performing a convolution operation, weight parameters can be reused for efficient design in terms of operation and area.

도 2는 본 발명의 실시 예에 따른 곱셈기(1)의 개략도이다. 곱셈기(1)는 다수의 곱셈 연산을 병렬적으로 처리가능하며, 단일 메모리(10), 단일 제어부(30) 및 멀티포트(50)를 포함할 수 있다. 2 is a schematic diagram of a multiplier 1 according to an embodiment of the present invention. The multiplier 1 can process a plurality of multiplication operations in parallel, and may include a single memory 10 , a single control unit 30 , and a multiport 50 .

단일 메모리(10)는 매개 변수와 복수의 입력값에 대한 곱셈 결과를 저장할 수 있다. 매개 변수는 재사용성이 높은 데이터로서 가중치(weight) 또는 활성화(activation) 값일 수 있다. 이하에서는 매개 변수가 가중치인 것으로 설명한다. 도 1에서 전술한 바와 같이 곱셈기(1)에서 가중치는 연산 수행 시 공유되므로 연산에 반복적으로 이용된다. 또한, 동일한 입출력 값 및 가중치에 대해 동일한 연산이 반복 수행되는 비경제성을 낮추기 위해 단일 메모리(10)는 다수의 곱셈 결과를 저장할 수 있다. The single memory 10 may store a multiplication result for a parameter and a plurality of input values. The parameter is data with high reusability and may be a weight or an activation value. Hereinafter, it will be described that the parameter is a weight. As described above with reference to FIG. 1 , the weight in the multiplier 1 is shared when performing the operation, and thus is repeatedly used for the operation. In addition, in order to reduce the economical efficiency of repeatedly performing the same operation with respect to the same input/output values and weights, the single memory 10 may store a plurality of multiplication results.

단일 메모리(10)는 멀티포트(50)와 다중 연결될 수 있다. 단일 포트 기반의 메모리는 동시에 다양한 기능을 수행할 수 없는 반면, 본 발명의 실시예에 따른 단일 메모리(10)는 멀티포트(50)와 다중 연결됨으로써 복수의 곱셈 연산값을 불러올 수 있어 효율을 높일 수 있다. 또한, 단일 메모리(10)는 다수의 곱셈 연산을 처리할 때에도 연산 수 또는 멀티포트(50)를 구성하는 포트의 개수에 관계없이 하나만으로 구성된다.A single memory 10 may be multi-connected to the multiport 50 . A single port-based memory cannot perform various functions at the same time, whereas the single memory 10 according to an embodiment of the present invention is multi-connected to the multiport 50 to call a plurality of multiplication values, thereby increasing efficiency. can In addition, even when processing a plurality of multiplication operations, the single memory 10 is configured with only one irrespective of the number of operations or the number of ports constituting the multiport 50 .

단일 제어부(30)는 단일 메모리(10)에 저장된 곱셈 결과를 갱신할 수 있다. 단일 메모리(10)는 이전에 수행되었던 연산에 대한 곱셈 결과만을 저장하므로 새로운 입력값 또는 새로운 가중치에 대한 곱셈 결과에 대해서는 단일 제어부(30)를 통해 갱신될 수 있다. 단일 제어부(30)에서 갱신된 곱셈 결과는 다시 단일 메모리(10)로 전송되어 단일 메모리(10)에 저장될 수 있다. 단일 제어부(30)를 통한 단일 메모리(10)의 곱셈 결과 갱신 과정은 이하 도 4에서 후술한다.The single control unit 30 may update the multiplication result stored in the single memory 10 . Since the single memory 10 stores only a multiplication result for a previously performed operation, a multiplication result for a new input value or a new weight may be updated through the single control unit 30 . The multiplication result updated by the single controller 30 may be transferred back to the single memory 10 and stored in the single memory 10 . A process of updating the multiplication result of the single memory 10 through the single controller 30 will be described later with reference to FIG. 4 .

단일 제어부(30)는 단일 메모리(10)의 개수에 대응되는 개수로 곱셈기(1)에 마련될 수 있으며, 본 발명은 복수개가 아닌 하나의 단일 메모리(10)에서만 곱산 연산이 수행되므로 단일 제어부(30) 또한 하나만 마련될 수 있음에 주목한다. The single control unit 30 may be provided in the multiplier 1 in a number corresponding to the number of the single memory 10, and in the present invention, since the multiplication operation is performed only in one single memory 10 instead of a plurality, a single control unit ( 30) also note that only one can be provided.

멀티포트(50)는 복수 개의 개별 단일 포트로 구성된 형태일 수 있으며, 멀티포트(50)에 포함된 복수 개의 단일 포트는 모두 단일 메모리(10)와 연결될 수 있다. 멀티포트(50)는 단일 메모리(10)에 연결되어, 단일 메모리(10)에 저장된 곱셈 결과를 읽어 단일 메모리(10)로부터 출력할 수 있다. The multi-port 50 may be configured with a plurality of individual single ports, and all of the plurality of single ports included in the multi-port 50 may be connected to the single memory 10 . The multiport 50 may be connected to the single memory 10 , read the multiplication result stored in the single memory 10 , and output it from the single memory 10 .

멀티포트(50)를 구성하는 단일 포트의 개수는 제한되지 않으며, 단일 메모리(10)로부터 동시에 더 많은 곱셈 결과를 읽어오기 위해 멀티포트(50)에 단일 포트가 추가될 수 있다. 또한, 멀티포트(50)는 단일 포트 상호 간 병렬적으로 배치될 수 있고, 이를 통해 곱셈기(1)는 저면적 구조로 집적되어 복수의 곱셈 연산 수행이 가능할 수 있다.The number of single ports constituting the multiport 50 is not limited, and a single port may be added to the multiport 50 to simultaneously read more multiplication results from the single memory 10 . In addition, the multiport 50 may be disposed in parallel with each other in a single port, and through this, the multiplier 1 may be integrated in a low-area structure to perform a plurality of multiplication operations.

도 3은 단일 메모리(10)를 공유하는 곱셈 당 면적을 나타낸 그래프이다. 본 발명의 실시예에서는 멀티포트(50)가 단일 메모리(10)에 연결됨에 있어서 단일 메모리(10)를 공유하는 곱셈의 수가 문제될 수 있다. 이에 도 3을 참고하면, 단일 메모리(10)를 공유하는 곱셈 수(도 3의 x축)가 많을수록 곱셈당 유효 면적(도 3의 y축)은 급격히 감소하는 것을 확인할 수 있다. 다만, 이익 또한 곱셈 수가 8개 근처에서 감소하므로, 본 발명의 실시 예에서는 단일 메모리(10)를 공유하는 곱셈의 수를 8개로 설정하여 최대 영역의 효율성을 높일 수 있다.3 is a graph illustrating the area per multiplication sharing a single memory 10 . In the embodiment of the present invention, when the multiport 50 is connected to the single memory 10 , the number of multiplications sharing the single memory 10 may be a problem. Accordingly, referring to FIG. 3 , it can be seen that as the number of multiplications (x-axis in FIG. 3 ) sharing the single memory 10 increases, the effective area per multiplication (y-axis in FIG. 3 ) decreases rapidly. However, since the number of multiplications also decreases near 8, the efficiency of the maximum area can be increased by setting the number of multiplications sharing the single memory 10 to 8 in the embodiment of the present invention.

도 4는 본 발명의 실시 예에 따른 단일 제어부(30)를 통한 단일 메모리(10)의 데이터가 갱신되는 과정과 멀티포트(50)를 통해 곱셈 결과를 읽어오는 과정을 나타낸 것이다. 도 4를 참고하면 갱신 단계 동안 곱셈 결과값이 매 클럭 주기마다 순차적으로 갱신됨을 확인할 수 있다. 곱셈 결과값 갱신 시간은 전체 작업주기에 비해 미비하므로 전체 작업주기에서 단일 제어부(30)의 갱신으로 인한 대기 시간은 영향이 거의 없을 수 있다. 또한 다수의 곱셈기가 공유하는 메모리가 하나이므로 곱셈 결과를 갱신하는 단일 제어부(30)가 하나만 필요하여 보다 면적 효율적으로 설계할 수 있다.4 shows a process of updating data of a single memory 10 through a single control unit 30 and a process of reading a multiplication result through the multiport 50 according to an embodiment of the present invention. Referring to FIG. 4 , it can be seen that the multiplication result value is sequentially updated every clock cycle during the update step. Since the update time of the multiplication result value is insignificant compared to the entire work cycle, the waiting time due to the update of the single control unit 30 in the entire work cycle may have little effect. In addition, since there is one memory shared by a plurality of multipliers, only one single control unit 30 for updating the multiplication result is required, so that the area can be designed more efficiently.

도 5는 본 발명의 실시예에 따른 단일 메모리(10)의 메모리 어레이(100)가 32 포트 비트 셀(bitcell)의 20×16 행렬로 구성된 모습이다. 도 5에서 도시한 메모리 어레이(100)는 16개 행의 32개 읽기 워드 라인(RWL; Read Wordline)과 20개 열의 32개 읽기 비트 라인(RBL; Read Bitline)으로 구성된다. 도 6은 도 5의 32 포트 비트 셀의 개략도이다. 도 6을 참고하면, 32 포트 비트 셀은 크게 데이터 저장소(data storge), 쓰기 포트(write port), 그리고 32개의 RBL을 구동하는 32개의 읽기 포트(read port)로 구성된다. 도 5 및 도 6에서의 읽기 포트는 본 발명의 멀티포트(50)를 의미한다. 2단계 FO4 버퍼링 구조는 커패시턴스가 큰 32개 읽기 포트를 구동하기 위해 포함될 수 있다. 도 6을 참고하면, 본 발명의 실시예에 따른 읽기 포트는 4개 단위로 묶어 총 8개의 그룹으로 구분될 수 있다. 5 is a diagram illustrating a memory array 100 of a single memory 10 configured in a 20×16 matrix of 32 port bit cells according to an embodiment of the present invention. The memory array 100 illustrated in FIG. 5 includes 32 read word lines (RWL) of 16 rows and 32 read bit lines (RBL) of 20 columns. 6 is a schematic diagram of the 32 port bit cell of FIG. 5; Referring to FIG. 6 , the 32-port bit cell is largely composed of a data storage, a write port, and 32 read ports driving 32 RBLs. The read port in FIGS. 5 and 6 refers to the multiport 50 of the present invention. A two-stage FO4 buffering scheme can be included to drive 32 high-capacitance read ports. Referring to FIG. 6 , a read port according to an embodiment of the present invention may be grouped into four groups and divided into a total of eight groups.

이상에서 대표적인 실시예를 통하여 본 발명을 상세하게 설명하였으나, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 상술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리 범위는 설명한 실시예에 국한되어 정해져서는 안 되며, 후술하는 특허청구범위뿐만 아니라 특허청구범위와 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태에 의하여 정해져야 한다. Although the present invention has been described in detail through representative embodiments above, those of ordinary skill in the art to which the present invention pertains will understand that various modifications are possible within the limits without departing from the scope of the present invention with respect to the above-described embodiments. will be. Therefore, the scope of the present invention should not be limited to the described embodiments and should be defined by all changes or modifications derived from the claims and equivalent concepts as well as the claims to be described later.

1: 곱셈기
10: 단일 메모리
100: 메모리 어레이
30: 단일 제어부
50: 멀티포트1: Multiplier
10: single memory
100: memory array
30: single control
50: multiport

Claims

In a multiplier capable of parallel operation processing,
a single memory that stores a multiplication result for a parameter and a plurality of input values and is multi-connected to the following multiport; and
It includes a multiport reading the multiplication result stored in the single memory,
A multiplier characterized in that a plurality of single ports constituting the multiport are respectively connected to the single memory, so that a plurality of multiplication operations can be performed in a low-area structure.

The method of claim 1,
Further comprising a single control unit that updates the multiplication result stored in the single memory, and transmits the updated multiplication result to the single memory,
The single control unit is provided in a number corresponding to the number of the single memory, characterized in that the singular number.