KR101967857B1

KR101967857B1 - Processing in memory device with multiple cache and memory accessing method thereof

Info

Publication number: KR101967857B1
Application number: KR1020170116646A
Authority: KR
Inventors: 김동순; 김병수; 장영종; 김영규
Original assignee: 전자부품연구원
Priority date: 2017-09-12
Filing date: 2017-09-12
Publication date: 2019-08-19
Also published as: KR20190029270A

Abstract

지능형 반도체 장치가 개시된다. 이 장치는 연산 작업을 수행하는 연산 장치; 메모리 계층 구조에서 상위 계층을 형성하는 다중 캐시 메모리 및 상기 메모리 계층 구조에서 하위 계층을 형성하는 메모리 뱅크를 포함하고, 상기 연산 장치는 다수의 독립된 채널을 통해 상기 다중 캐시 메모리에 선택적으로 또는 동시적으로 접근한다.An intelligent semiconductor device is disclosed. The apparatus includes a computing device for performing a computing task; A multiple cache memory forming an upper layer in a memory hierarchy and a memory bank forming a lower layer in the memory hierarchy, wherein the computing device is selectively or simultaneously to the multiple cache memory through a plurality of independent channels Approach

Description

PROCESSING IN MEMORY DEVICE WITH MULTIPLE CACHE AND MEMORY ACCESSING METHOD THEREOF

본 발명은 지능형 반도체 장치에 관한 것으로, 더욱 상세하게는 지능형 반도체 장치 내부에서 연산 장치의 메모리 접근 기술과 관련된 것이다.The present invention relates to an intelligent semiconductor device, and more particularly, to a memory access technology of a computing device inside an intelligent semiconductor device.

일반적으로 지능형 반도체(Processing In Memory; PIM)는 디램(DRAM)에 연산이 가능한 프로세서 기능을 추가한 미래형 반도체이다. 기존에는 프로세서와 메모리 기능이 완전히 분리되어 둘 사이에 정보가 오가는 과정에서 병목 현상이 빈번하게 일어났다. PIM을 활용하면 메인 프로세서에 연산 작업이 몰려 과부하가 생기는 일이 없어지고, 프로세서와 메모리 간 정보 병목현상이 사라져 처리 속도도 빨라진다In general, an intelligent semiconductor (Processing In Memory (PIM)) is a future semiconductor that adds a processor function to the DRAM (DRAM). In the past, bottlenecks frequently occurred as the processor and memory functions were completely separated and information flowed between them. PIM eliminates the overhead of computational work on the main processor and speeds up processing by eliminating information bottlenecks between the processor and memory.

지능형 반도체가 처리하는 명령어들은 한 번 이상의 메모리 접근 및 연산이 포함된 원자적 명령어(atomic operation)이며, 큰 크기의 단일 메모리를 사용하기 보다는 작은 크기의 메모리를 여러 개 사용하는 것이 시스템의 성능 측면에서 이득이 크다.Instructions processed by intelligent semiconductors are atomic operations that involve more than one memory access and operation.Instead of using a single large memory, it is possible to use several small memory sizes in terms of system performance. Big gain

이에 따라 최근 지능형 반도체는 대부분 단일 명령 다중 데이터 처리(Single Instruction Multiple Data, SIMD) 기반의 멀티프로세서 구조로 발전하고 있으며, SIMD 구조의 지능형 반도체들이 네트워크로 연결되는 계층적인 메모리 구조의 시스템으로 발전하고 있다. Accordingly, the recent development of intelligent semiconductors has developed into a multi-processor structure based on single instruction multiple data processing (SIMD), and has developed into a hierarchical memory structure system in which intelligent semiconductors of SIMD structures are connected to a network. .

지능형 반도체들은 디램과 연산장치 간의 속도 차이를 줄이기 위해 내부 혹은 외부의 캐시(cache) 메모리를 사용하고 있다. 하지만 기존 컴퓨터 시스템의 캐시메모리 관리 정책들은 데이터의 지역성(locality)을 이용한 적중률 향상에 주력하여 발전되어 왔으며, 이러한 관리 정책 및 구조를 가진 캐시들은 지능형 반도체의 성능 및 전력 소모를 극대화하는 데 한계가 있다. Intelligent semiconductors use internal or external cache memory to reduce the speed difference between DRAM and computing devices. However, cache memory management policies of existing computer systems have been developed with a focus on improving hit ratio using locality of data, and caches with such management policies and structures have limitations in maximizing performance and power consumption of intelligent semiconductors. .

상술한 문제를 해결하기 위한 본 발명의 목적은 새로운 캐시메모리의 관리 정책 및 하드웨어 구조에 따라 성능 및 전력 소모를 극대화할 수 있는 지능형 반도체를 제공하는 데 있다.An object of the present invention for solving the above problems is to provide an intelligent semiconductor that can maximize performance and power consumption according to the management policy and hardware structure of a new cache memory.

본 발명에서 해결하고자 하는 과제는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 해결과제들은 아래의 기재로부터 당해 기술분야에 있어서의 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problem to be solved in the present invention is not limited to those mentioned above, other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 목적을 달성하기 위한 본 발명의 일면에 따른 지능형 반도체 장치는,An intelligent semiconductor device according to an aspect of the present invention for achieving the above object,

연산 작업이 가능한 프로세서 기능과 데이터 저장이 가능한 메모리 기능을 동시에 구비한 지능형 반도체 장치로서, 상기 연산 작업을 수행하는 연산 장치; 메모리 계층 구조에서 상위 계층을 형성하는 다중 캐시 메모리; 및 상기 메모리 계층 구조에서 하위 계층을 형성하는 메모리 뱅크를 포함하고, 상기 연산 장치는, 다수의 독립된 채널을 통해 상기 다중 캐시 메모리에 선택적으로 또는 동시적으로 접근한다.An intelligent semiconductor device having a processor function capable of performing a computational operation and a memory function capable of storing data, the apparatus comprising: a computational apparatus configured to perform the computational task; Multiple cache memories forming a higher layer in the memory hierarchy; And a memory bank forming a lower layer in the memory hierarchy, wherein the computing device selectively or simultaneously accesses the multiple cache memories through a plurality of independent channels.

본 발명의 다른 일면에 따른 지능형 반도체 장치에서의 메모리 접근방법은, 연산 장치가 명령어 패킷을 수신하는 단계; 및 상기 연산 장치가 상기 명령어 패킷 내에서 어드레스 값의 하위 필드를 기반으로 상기 메모리 계층 구조에서 상위 계층을 형성하는 다중 캐시 메모리에 선택적으로 또는 동시적으로 접근하는 단계를 포함한다.According to another aspect of the present invention, there is provided a memory access method in an intelligent semiconductor device, the method including: receiving, by a computing device, an instruction packet; And selectively or concurrently accessing, by the computing device, multiple cache memories that form an upper layer in the memory hierarchy based on lower fields of an address value in the command packet.

본 발명에 따르면, 지능형 반도체 내의 캐시 메모리를 다중 캐시 메모리 구조로 변경함으로써, 지능형 반도체의 성능을 향상시키고, 동시에 전력 소비를 줄일 수 있다. 또한, 하나의 명령어 처리를 위해서 순차적으로 메모리에 접근해야 하는 문제점을 해결한다.According to the present invention, by changing the cache memory in the intelligent semiconductor to a multiple cache memory structure, it is possible to improve the performance of the intelligent semiconductor and at the same time reduce power consumption. It also solves the problem of sequentially accessing memory for processing a single instruction.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당해 기술분야에 있어서의 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시 예에 따른 지능형 반도체 장치의 내부 구성을 개략적으로 도시한 블록도이다.
도 2는 도 1에 도시한 지능형 반도체 장치에 입력되는 명령어 패킷의 데이터 구조를 도시한 일 예이다.
도 3은 도 1에 도시한 지능형 반도체 모듈의 내부 구성을 개략적으로 도시한 블록도이다.
도 4는 도 3에 도시한 연산 장치가 멀티플렉서를 이용하여 다중 캐시 메모리에 접근하는 방식을 도식적으로 나타내는 도면이다.
도 5는 종래의 지능형 반도체 장치와 본 발명의 지능형 반도체 장치 간의 성능 및 소비 전력을 각각 비교한 실험 결과 그래프이다.
도 6은 본 발명의 일 실시 예에 따른 지능형 반도체 장치에서의 메모리 접근 방법을 나타내는 흐름도이다.1 is a block diagram schematically illustrating an internal configuration of an intelligent semiconductor device according to an embodiment of the present disclosure.
FIG. 2 is an example illustrating a data structure of an instruction packet input to the intelligent semiconductor device shown in FIG. 1.
3 is a block diagram schematically illustrating an internal configuration of the intelligent semiconductor module illustrated in FIG. 1.
FIG. 4 is a diagram schematically illustrating a method of accessing a multiple cache memory by using the multiplexer shown in FIG. 3.
5 is a graph showing experimental results comparing the performance and power consumption between the conventional intelligent semiconductor device and the intelligent semiconductor device of the present invention.
6 is a flowchart illustrating a memory access method in an intelligent semiconductor device according to an embodiment of the present disclosure.

본 발명의 전술한 목적 및 그 이외의 목적과 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 이하의 실시예들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 목적, 구성 및 효과를 용이하게 알려주기 위해 제공되는 것일 뿐으로서, 본 발명의 권리범위는 청구항의 기재에 의해 정의된다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자가 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가됨을 배제하지 않는다.BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, advantages and features of the present invention, and methods of achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various forms, and only the following embodiments are provided to those skilled in the art to which the present invention pertains. It is merely provided to easily inform the configuration and effects, the scope of the present invention is defined by the description of the claims. Meanwhile, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and / or “comprising” refers to the presence of one or more other components, steps, operations and / or devices in which the mentioned components, steps, operations and / or devices are described. Or does not exclude addition.

도 1은 본 발명의 일 실시 예에 따른 지능형 반도체 장치의 내부 구성을 개략적으로 도시한 블록도이다.1 is a block diagram schematically illustrating an internal configuration of an intelligent semiconductor device according to an embodiment of the present disclosure.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 지능형 반도체 장치(100)는, 성능 및 소비전력을 극대화하기 위해, 그 내부에 구비된 연산 장치와 메모리 뱅크 사이에 설계된 다중 캐시 메모리(multiple cache)를 포함하도록 구성된다. Referring to FIG. 1, an intelligent semiconductor device 100 according to an embodiment of the present disclosure may include a multiple cache memory designed between an arithmetic unit and a memory bank provided therein to maximize performance and power consumption. It is configured to include).

이를 위해, 본 발명의 일 실시 예에 따른 지능형 반도체 장치(100)는 라우터(110) 및 다수의 지능형 반도체 모듈(120~150)을 포함하도록 구성될 수 있다.To this end, the intelligent semiconductor device 100 according to an embodiment of the present invention may be configured to include a router 110 and a plurality of intelligent semiconductor modules 120 to 150.

라우터(Router, 110)는 외부 유닛(도시하지 않음.)로부터 입력되는 메모리 명령어 패킷(10, 이하, 명령어 패킷)을 다수의 지능형 반도체 모듈(120~150)로 전달하는 일종의 통신 인터페이스일 수 있다. 또한, 라우터(110)는 외부 유닛과 다수의 지능형 반도체 모듈(120~150) 간의 글로벌 데이터를 교환하도록 중재 역할로 기능할 수 있다.The router 110 may be a kind of communication interface for transferring a memory command packet 10 (hereinafter, referred to as a command packet) input from an external unit (not shown) to the plurality of intelligent semiconductor modules 120 to 150. In addition, the router 110 may function as an arbitration role to exchange global data between the external unit and the plurality of intelligent semiconductor modules 120 to 150.

다수의 지능형 반도체 모듈(120~150) 각각은 그 내부에 구비된 연산 장치와 메모리 뱅크 간의 속도 차이를 줄이기 위해 기존의 단일 데이터 버퍼 대신에 다중 캐시 메모리(multiple cache)를 포함하도록 구성된다. Each of the plurality of intelligent semiconductor modules 120 to 150 is configured to include multiple caches instead of the existing single data buffer to reduce the speed difference between the computing device and the memory bank provided therein.

아래에서 상세히 설명하겠지만, 본 발명은 지능형 반도체 장치(100) 내의 다수의 지능형 반도체 모듈들(120~150) 각각이 다중 캐시 메모리를 포함하도록 구성된 하드웨어 구조를 제안한 것이다. As will be described in detail below, the present invention proposes a hardware structure in which each of the plurality of intelligent semiconductor modules 120 to 150 in the intelligent semiconductor device 100 includes multiple cache memories.

일반적인 컴퓨터 시스템의 캐시 메모리는 데이터의 지역성(locality)을 이용하여 캐시의 적중률을 향상시키기 위한 방향으로 발전하였다. 하지만 지능형 반도체 장치(100)가 처리하는 명령어 패킷(10)은 대부분 메모리 접근 동작이 포함된 원자적 명령어(atomic operation)로 이루어지며, 이러한 지능형 반도체 장치(100)가 하나의 명령어를 처리하기 위해서는, 도 2에 도시된 바와 같은 명령어 패킷(10)의 처리에 필요한 데이터를 획득하기 위해 반복적인 메모리 접근이 불가피하다. The cache memory of a general computer system has been developed to improve the hit ratio of the cache by using the locality of data. However, the command packet 10 processed by the intelligent semiconductor device 100 is mostly composed of atomic operations including a memory access operation. In order for the intelligent semiconductor device 100 to process one command, Repeated memory access is inevitable in order to obtain data necessary for processing the instruction packet 10 as shown in FIG.

따라서 기존 컴퓨터 시스템의 메모리 계층구조에서 사용되는 단일 데이터 버퍼 구조보다는 본 발명에서 개시하는 다중 캐시 메모리 구조를 사용하는 것이 시스템 성능 및 소비 전력 측면에서 이득이 매우 크다.Therefore, the use of the multiple cache memory structure disclosed in the present invention, rather than the single data buffer structure used in the memory hierarchy of the existing computer system, is very advantageous in terms of system performance and power consumption.

참고로, 본 발명의 실시 예에서 사용하는 명령어 패킷(10)의 데이터 구조는, 도 2에 도시된 바와 같이, 4개의 필드들(F1, F2, F3 및 F4)로 구성될 수 있으며, 제1 필드(F1)에는 연산(OPeration: OP)의 종류를 나타내는 opcode 값이 기록되고, 제2 필드(F2)에는 세컨 소스 어드레스(second source address) 값이 기록되고, 제3 필드(F3)에는 퍼스트 소스 어드레스(first source address) 값이 기록되고, 제4 필드(F4)에는 데스티네이션 어드레스(destination address) 값이 기록된다. 이하, 제2 내지 제4 필드들은 주소 필드로 칭한다.For reference, the data structure of the instruction packet 10 used in the embodiment of the present invention may be configured with four fields F1, F2, F3, and F4, as shown in FIG. An opcode value indicating the type of operation OP is recorded in the field F1, a second source address value is recorded in the second field F2, and a first source in the third field F3. A value of an address (first source address) is recorded, and a destination address value is recorded in the fourth field F4. Hereinafter, the second to fourth fields are referred to as address fields.

주소 필드들(F2, F3 및 F4) 각각은, 도 2에 도시된 바와 같이, 다시 4개의 필드들(F5, F6, F7 및 F8)로 구성될 수 있다.Each of the address fields F2, F3, and F4 may be composed of four fields F5, F6, F7, and F8, as shown in FIG.

최상위 비트(Most Significant Bit: MSB)를 포함하는 최상위 필드(F5)에는 태그(Tag) 값이 기록되며, 최하위 비트(Least Significant Bit: LSB)를 포함하는 최하위 필드(F8)에는 오프셋(Offset) 값이 기록된다.The tag value is recorded in the highest field (F5) containing the most significant bit (MSB), and the offset value is recorded in the least significant field (F8) containing the least significant bit (LSB). This is recorded.

상기 최상위 필드(F6)에 인접한 상위 필드(F2)에는 인덱스(Index) 값이 기록되며, 상기 최하위 필드(F8)에 인접한 하위 필드(F7)에는 이하에서 설명하는 다중 캐시 메모리 구조로 분리된 각 캐시 메모리를 식별하는 캐시 넘버(Cache Number) 값이 기록된다.An index value is recorded in the upper field F2 adjacent to the highest field F6, and each cache divided into a multiple cache memory structure described below in the lower field F7 adjacent to the lowest field F8. A Cache Number value identifying the memory is recorded.

상기 태그 값과 상기 인덱스 값의 결합은 상응하는 주기억장치, 즉, 메모리 뱅크의 주소 값으로 사용된다. The combination of the tag value and the index value is used as the address value of the corresponding main memory, that is, the memory bank.

상기 오프셋 값은 주 메모리(메모리 뱅크)와 캐시 메모리 사이의 데이터 이동 단위인 캐시 블록(block)의 크기에 의해 결정되며, 상기 인덱스는 캐시 블록이 저장되는 캐시 슬롯(slot)의 번호를 나타내며, 상기 태그는 캐시 슬롯에 저장된 캐시 블록이 유효한지 판단하기 위해 캐시 슬롯에 저장되어 있는 태그 값과 비교하는데 사용된다. 참고로, 상기 태그(Tag) 값, 오프셋(Offset) 값 및 인덱스(Index) 값은 "메모리 계층 구조에서 캐시"를 다루는 컴퓨터 공학에서 빈번히 접할 수 있는 기술용어이므로, 이에 대한 더 이상의 상세한 설명은 생략하기로 한다.The offset value is determined by the size of a cache block, which is a unit of data movement between the main memory (memory bank) and the cache memory, and the index indicates the number of cache slots in which the cache block is stored. The tag is used to compare the tag value stored in the cache slot to determine if the cache block stored in the cache slot is valid. For reference, the tag value, the offset value, and the index value are technical terms frequently encountered in computer engineering dealing with "cache in a memory hierarchy", and thus, further detailed description thereof will be omitted. Let's do it.

다만, 상기 하위 필드(F7)에 기록되는 캐시 넘버(Cache Number) 값은 아래에서 설명하는 연산 장치(도 3의 122)가 다중 캐시 메모리에 선택 또는 동시 접근이 가능하도록 상기 연산 장치(도 3의 122)의 내부에 구비되는 멀티플렉서(도 3의 122-1)의 선택 신호로 사용됨을 주목할 필요가 있다.However, a cache number value recorded in the lower field F7 may be set so that the computing device 122 of FIG. 3 can select or simultaneously access the multiple cache memories. Note that it is used as the selection signal of the multiplexer (122-1 in FIG. 3) provided inside the 122.

도 3은 도 1에 도시한 하나의 지능형 반도체 모듈의 내부 구성을 개략적으로 도시한 블록도이다.3 is a block diagram schematically illustrating an internal configuration of one intelligent semiconductor module shown in FIG. 1.

도 3을 참조하면, 상기 지능형 반도체 모듈(120)은 다중 캐시 메모리 구조를 포함하도록 구성된다. 구체적으로, 상기 지능형 반도체 모듈(120)은 연산 장치(122)와 메모리 뱅크(126)를 포함하며, 상기 연산 장치(122)와 상기 메모리 뱅크(126) 사이에 설계되는 다중 캐시 메모리(124)를 더 포함한다. Referring to FIG. 3, the intelligent semiconductor module 120 is configured to include a multiple cache memory structure. In detail, the intelligent semiconductor module 120 includes an arithmetic device 122 and a memory bank 126, and includes a multi-cache memory 124 designed between the arithmetic device 122 and the memory bank 126. It includes more.

다중 캐시 메모리(124)는 물리적으로 분리된(또는, 물리적으로 독립된) n(n은 2 이상의 자연수)개의 캐시 메모리(#0, #1, … #n-1)들을 포함하도록 구성될 수 있다.Multiple cache memory 124 may be configured to include physically separate (or physically independent) n (n is a natural number of two or more) cache memories (# 0, # 1, ... # n-1).

이러한 다중 캐시 메모리(124)는 메모리 계층 구조에서 상기 연산 장치(122)를 기준으로 상위 계층을 형성하고, 상기 메모리 뱅크(126)는 상기 메모리 계층 구조에서 하위 계층을 형성한다.The multiple cache memory 124 forms an upper layer in the memory hierarchy based on the computing device 122, and the memory bank 126 forms a lower layer in the memory hierarchy.

상기 연산 장치(122)는 외부 유닛으로부터 입력되는 명령어 패킷(10)을 처리하기 위한 연산 작업을 수행하는 장치로서, 넓은 의미에서는 '프로세서(processor)'로 불릴 수 있으며, 좁은 의미에서는 특정 연산 작업을 수행하는 '연산 유닛(arithmetic unit)'으로 불릴 수 있다.The arithmetic unit 122 is a device that performs arithmetic operations to process the instruction packet 10 input from an external unit. In the broad sense, the arithmetic unit 122 may be referred to as a processor. It may be called an 'arithmetic unit'.

이러한 연산 장치(122)는 도면에 도시하지는 않았으나 특정 연산 작업을 수행하도록 누산기, 가산기, 데이터 레지스터, 비교기, 상태 레지스터 등을 포함하도록 구성될 수 있다.Although not shown in the drawing, the operation device 122 may be configured to include an accumulator, an adder, a data register, a comparator, a status register, and the like to perform a specific operation.

추가로, 상기 연산 장치(122)는 상기 다중 캐시 메모리(124)에 선택적 혹은 동시 접근이 가능하도록 멀티플렉서(122-1)를 포함하도록 구성될 수 있다. In addition, the computing device 122 may be configured to include a multiplexer 122-1 to allow selective or simultaneous access to the multiple cache memory 124.

상기 연산 장치(122)는 독립된 n개의 채널(123, 물리적으로 분리된 n개의 채널)을 이용하여 상기 다중 캐시 메모리(124)에 선택적으로 또는 동시적으로 접근(access)할 수 있도록 구성될 수 있다.The computing device 122 may be configured to selectively or simultaneously access the multiple cache memories 124 using n independent channels 123 (n physically separated channels). .

상기 연산 장치(122)가 다중 캐시 메모리(124)에 선택 또는 동시 접근이 가능한 것은 상기 연산 장치(122) 내에 구비된 상기 멀티플렉서(122-1)에 의해 가능하다. 예를 들면, 상기 멀티플렉서(122-1)를 상기 연산 장치(122) 내의 출력단으로 구성하고, 상기 멀티플렉서(122-1)의 출력을 n개 출력들로 디자인한 후, 상기 멀티플렉서(122-1)의 n개 출력들과 상기 독립된 n개의 채널들(123)을 각각 연결함으로써, 상기 연산 장치(122)와 다중 캐시 메모리(124)를 구성하는 다수의 캐시 메모리(#0, #1, … #n-1)는 독립된 n개의 채널(123)에 의해 연결될 수 있다.It is possible to select or simultaneously access the multiple cache memories 124 by the arithmetic unit 122 by the multiplexer 122-1 provided in the arithmetic unit 122. For example, the multiplexer 122-1 is configured as an output terminal in the computing device 122, the output of the multiplexer 122-1 is designed with n outputs, and then the multiplexer 122-1 is designed. A plurality of cache memories (# 0, # 1, ... #n constituting the multi-cache memory 124 with the computing device 122 by connecting the n outputs of the < RTI ID = 0.0 > n < / RTI > -1) may be connected by independent n channels 123.

도면에서는 하나의 멀티플렉서(122-1)를 도시하고 있으나, 2개 이상의 멀티플렉서들로 구현될 수 있다. 또한, 도면에서는 연산 장치(122)에 내부에 구비된 멀티플렉서(122-1)를 예시하고 있으나, 설계에 따라 외부에 구비될 수 있다. 멀티플렉서가 외부에 구비된 경우, 상기 멀티플렉서의 입력은 상기 연산 장치의 출력단과 연결된다.Although one multiplexer 122-1 is shown in the drawing, it may be implemented by two or more multiplexers. In addition, although the multiplexer 122-1 provided inside the arithmetic device 122 is illustrated in the drawing, it may be provided externally according to a design. When the multiplexer is provided externally, the input of the multiplexer is connected to the output terminal of the computing device.

이와 같이, 상기 연산 장치(122)는 내부에 구비된 상기 멀티플렉서(122-1)에 연결된 n개의 독립된 채널들(123)을 이용하여 와 상기 다수의 캐시 메모리(#0, #1, … #n-1)에 접근하기 때문에, 상기 연산 장치(122)는 상기 멀티플렉서(122-1)의 동작에 따라 상기 다수의 캐시 메모리(#0, #1, … #n-1)에 선택적 혹은 동시 접근이 가능하다.As described above, the arithmetic unit 122 uses the n independent channels 123 connected to the multiplexer 122-1 provided therein, and the plurality of cache memories (# 0, # 1, ... #n). -1), the arithmetic unit 122 may selectively or concurrently access the plurality of cache memories # 0, # 1, ... # n-1 according to the operation of the multiplexer 122-1. It is possible.

구체적으로, 상기 연산 장치(122)는 현재 실행하는 명령어의 데이터들이 다수의 캐시 메모리(#0, #1, … #n-1)에 분산 적재되어 있는 경우, 선택적 또는 동시에 읽기/쓰기 동작(또는 접근)을 수행할 수 있다. 이것은 상기 연산 장치(122)가 상기 다수의 캐시 메모리(#0, #1, … #n-1) 중에서 일부 캐시 메모리들 또는 전체 캐시 메모리들을 인터리빙(interleaving)할 수 있음을 의미한다.In detail, the computing device 122 may selectively or simultaneously read / write operations (or Access). This means that the computing device 122 may interleave some or all of the cache memories among the plurality of cache memories # 0, # 1, ... # n-1.

한편, 상기 연산 장치(122)는 현재 실행하는 명령어의 데이터들이 다수의 캐시 메모리(#0, #1, … #n-1)에 분산 적재되어 있지 않은 경우에는 순차적으로 읽기/쓰기 동작(접근)을 수행할 수 있다. On the other hand, the computing device 122 sequentially reads / writes (accesses) when the data of the currently executed instruction are not distributed in a plurality of cache memories # 0, # 1, ... # n-1. Can be performed.

상기 다수의 캐시 메모리(#0, #1, … #n-1)와 상기 메모리 뱅크(126)는 하나의 메모리 버스(125)를 공유한다. 여기서, 상기 메모리 뱅크(126)는, 예를 들면, 디램 뱅크(D-RAM bank)일 수 있다.The plurality of cache memories # 0, # 1, ... # n-1 and the memory bank 126 share one memory bus 125. Here, the memory bank 126 may be, for example, a DRAM bank.

도 4는 도 3에 도시한 연산 장치가 멀티플렉서를 이용하여 다중 캐시 메모리에 접근하는 방식을 도식적으로 나타내는 도면이다.FIG. 4 is a diagram schematically illustrating a method of accessing a multiple cache memory by using the multiplexer shown in FIG. 3.

도 4를 참조하면, 캐시 메모리가 4개로 분할된 경우를 가정한 것이다. 각 캐시 메모리(#0, #1, #2 및 #3)는 적용되는 시스템의 특성에 따라 연관사상캐시(n-way set associative cache)로 사용가능하며, 분할되는 캐시 메모리의 개수 및 각 캐시 메모리의 way 개수에 따라 주소 매칭 방법(address mapping scheme)은 달라질 수 있다.Referring to FIG. 4, it is assumed that the cache memory is divided into four. Each cache memory (# 0, # 1, # 2, and # 3) is available as an n-way set associative cache, depending on the characteristics of the system to which it applies, and the number of cache memory to be partitioned and each cache memory. The address mapping scheme may vary according to the number of ways of the.

도 4의 메모리 접근 방식에서는 도 2에 도시된 명령어 패킷(10)의 각 어드레스 값(F2, F3 및 F4에 기록되는 값) 내에서 하위 필드(F7)를 구성하는 캐시 넘버(Cache Number) 값이 연산 장치(도 3의 122)가 4개의 캐시 메모리들(#0, #1, #2 및 #3)을 선택하기 위한 상기 멀티플렉서(122-1)의 선택 신호(40)로 사용되고, 명령어 패킷(10)의 각 어드레스 값(F2, F3 및 F4에 기록되는 값) 내에서 최상위 필드(F5)와 상위 필드(F6)를 각각 구성하는 태그(Tag) 값과 인덱스(Index) 값이 상기 멀티플렉서(122-1)의 입력으로 사용된다. In the memory approach of FIG. 4, a cache number value constituting a lower field F7 is included in each address value (values written in F2, F3, and F4) of the instruction packet 10 shown in FIG. 2. An arithmetic device 122 of FIG. 3 is used as the selection signal 40 of the multiplexer 122-1 for selecting four cache memories # 0, # 1, # 2 and # 3, and an instruction packet ( In the address value F2, F3 and F4 of 10), the tag value and the index value constituting the top field F5 and the top field F6 are respectively the multiplexer 122. Used as input to -1).

명령어 패킷(10)의 각 어드레스 값(F2, F3 및 F4에 기록되는 값) 내에서 하위 필드(F7)를 구성하는 캐시 넘버(Cache Number) 값을 상기 멀티플렉서의 선택 신호로 사용하는 것은, 데이터들을 여러 캐시 메모리들(#0, #1, #2 및 #3)에 분산 적재하기 위해, 최상위 필드(F5) 또는 상위 필드(F6) 보다 값의 변화가 빈번한 하위 필드(F7)를 선택 신호로 사용하는 것이 바람직하기 때문이다. 하지만, 이 또한 시스템의 특성에 따라 다양한 선택이 가능할 수 있다.Using a cache number value constituting a lower field F7 as a selection signal of the multiplexer within each address value (values written in F2, F3, and F4) of the instruction packet 10, Use of the lower field F7, which has a more frequent change in value than the uppermost field F5 or upper field F6, as a selection signal for distributed loading into several cache memories # 0, # 1, # 2 and # 3. This is because it is preferable. However, this may also be a variety of choices depending on the characteristics of the system.

이와 같이, 본 발명의 지능형 반도체(100) 내에 설계된 다중 캐시 메모리는 하나의 명령어 처리에 필요한 데이터들을 동시에 읽을 수 있으며, 다수로 분할된 캐시 메모리를 채용함으로써, 분할 된 각 캐시 메모리의 크기를 작게 설계할 수 있으며, 이로 인해 데이터 접근 당 소비 전력이 크게 감소한다. As described above, the multiple cache memory designed in the intelligent semiconductor 100 of the present invention can read data necessary for processing one instruction at the same time, and by designing a smaller size of each divided cache memory by employing a plurality of divided cache memories. This significantly reduces the power consumption per data access.

또한, 본 발명의 지능형 반도체 장치(100) 내에 설계된 다중 캐시 메모리 구조는 이상 설명한 바와 같이, 간단한 하드웨어 구조로 설계하였음도 시스템 성능 향상 및 에너지 절감 측면에서 매우 효과적이다. In addition, as described above, the multiple cache memory structure designed in the intelligent semiconductor device 100 of the present invention is very effective in terms of system performance and energy saving, even though it is designed as a simple hardware structure.

도 5는 기존의 하나의 데이터 버퍼를 사용한 지능형 반도체(Existing PIM)와 본 발명에 따라 다중 캐시 메모리를 사용한 지능형 반도체(Proposed PIM)에 대해 데이터 버퍼 접근 시간 및 소비전력을 각각 비교한 시뮬레이션 결과를 나타내는 막대그래프이다. 메모리에 대한 파라미터들은 CACTI 5.3을 통해 계산된 값들을 사용하였고, 실험에 사용된 workloads는 SPEC CPU2006 및 SPEC OMP2012를 사용하였다.FIG. 5 shows simulation results of comparing data buffer access time and power consumption with respect to an intelligent semiconductor using an existing data buffer and an exposed PIM using multiple cache memories according to the present invention. It is a bar graph. The parameters for memory were calculated using CACTI 5.3, and the workloads used in the experiment were SPEC CPU2006 and SPEC OMP2012.

도 5에 도시된 바와 같이, 다중 캐시 메모리로 설계된 지능형 반도체는 기존의 성능을 그대로 유지할 뿐 아니라 그 이상의 성능을 제공하고 있음을 시뮬레이션 결과로부터 확인할 수 있다.As shown in FIG. 5, it can be seen from the simulation results that the intelligent semiconductor designed with multiple cache memories not only maintains existing performance but also provides more performance.

도 6은 본 발명의 일 실시 예에 따른 지능형 반도체 장치에서의 메모리 접근 방법을 나타내는 흐름도로서, 이하에서는 도 1 내지 도 5를 참조한 설명과 중복된 설명은 생략하거나 간략히 기재한다.6 is a flowchart illustrating a memory access method in an intelligent semiconductor device according to an embodiment of the present disclosure. Hereinafter, a description overlapping with the description with reference to FIGS. 1 to 5 will be omitted or briefly described.

도 6을 참조하면, 먼저, 단계 S610에서, 지능형 반도체 장치(100)의 내부에 구비된 연산 장치(122)가 외부로부터 메모리 명령어 패킷(10, 이하, 명령어 패킷)을 수신하는 과정이 수행된다. 여기서, 명령어 패킷(10)은 도 2에 도시된 바와 같이, 크게 오피코드값이 기록되는 필드(F1)와 어드레스 값이 기록되는 필드들(F2, F3, F4)로 나뉘며, 상기 필드들(F2, F3, F4) 각각은 다시 태그 값이 기록되는 필드(F5), 인덱스 값이 기록되는 필드(F6), 캐시 넘버 값이 기록되는 필드(F7) 및 오프셋 값이 기록되는 필드(F8)로 구분될 수 있다.Referring to FIG. 6, first, in operation S610, a process of receiving a memory command packet 10 (hereinafter, referred to as a command packet) from an external device by an arithmetic device 122 included in the intelligent semiconductor device 100 is performed. Here, as shown in FIG. 2, the command packet 10 is divided into a field F1 in which an opcode value is largely recorded, and fields F2, F3, and F4 in which an address value is recorded, and the fields F2. , F3 and F4 are each divided into a field F5 in which a tag value is recorded, a field F6 in which an index value is recorded, a field F7 in which a cache number value is recorded, and a field F8 in which an offset value is recorded. Can be.

이어, 단계 S620에서, 상기 연산 장치(122)가 상기 명령어 패킷 내에서 어드레스 값의 하위 필드(어드레스 값이 기록되는 필드 내에서의 하위 필드)를 기반으로 메모리 계층 구조에서 상위 계층을 형성하는 다중 캐시 메모리에 선택적으로 또는 동시적으로 접근하는 과정이 수행된다. 즉, 상기 연산 장치(122)가 어드레스 값의 하위 필드를 기반으로 다중 캐시 메모리를 구성하는 다수의 캐시 메모리 중에서 일부 또는 전체를 인터리빙하는 과정이 수행된다.Subsequently, in step S620, the arithmetic unit 122 forms an upper layer in a memory hierarchy based on a lower field of an address value (a lower field in a field in which an address value is recorded) in the command packet. The process of selectively or concurrently accessing memory is performed. That is, the computing device 122 interleaves some or all of the plurality of cache memories constituting the multiple cache memories based on the lower field of the address value.

이상에서 본 발명에 대하여 실시 예를 중심으로 설명하였으나 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 예를 들어, 본 발명의 실시예에 구체적으로 나타난 각 구성 요소는 변형하여 실시할 수 있는 것이다. 그리고 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.Although the present invention has been described above with reference to the embodiments, these are merely examples and are not intended to limit the present invention. It will be appreciated that various modifications and applications are not illustrated. For example, each component specifically shown in the embodiment of the present invention can be modified. And differences relating to such modifications and applications will have to be construed as being included in the scope of the invention defined in the appended claims.

Claims

In an intelligent semiconductor device having a processor function that can perform arithmetic operations and a memory function that can store data,
One computing device that performs the computing task;
A multiple cache memory connected to the one computing device through a plurality of independent channels and forming an upper layer in a memory hierarchy; And
A memory bank forming a lower layer in the memory hierarchy;
The one computing device,
A multiplexer connected to the plurality of independent channels,
The multiplexer selectively or simultaneously accesses the multiple cache memories via a plurality of independent channels, using a select signal of a cache number identifying the multiple cache memories in an instruction packet input from the outside to the computing device. Intelligent semiconductor device that will approach.

The memory system of claim 1, wherein the multiple cache memory includes:
An intelligent semiconductor device comprising a plurality of physically separated cache memory.

delete

The method of claim 1, wherein the cache number is,
And written in a field adjacent to the lowest field in the command packet.

The memory system of claim 1, wherein the multiple cache memories and the memory banks comprise:
An intelligent semiconductor device connected by a common bus.

The apparatus of claim 1, wherein the one computing device includes:
And interleaving the multiple cache memories through a plurality of independent channels.

In a memory access method in an intelligent semiconductor device having a single computing device capable of arithmetic operation and memory forming a lower layer in the memory hierarchy,
Receiving, by the one computing device, an instruction packet; And
Selectively or simultaneously accessing the multiple cache memories forming an upper layer in the memory hierarchy based on a cache number identifying the multiple cache memories in the instruction packet by the one computing device Including,
The selectively or simultaneously approaching step,
Wherein the multiplexer included in the one computing device selectively or simultaneously accesses the multiple cache memories through a plurality of independent channels using the cache number as a selection signal. Memory approach in.

delete