KR100936601B1

KR100936601B1 - Multi-processor system

Info

Publication number: KR100936601B1
Application number: KR1020080044831A
Authority: KR
Inventors: 채수익; 박상규
Original assignee: 재단법인서울대학교산학협력재단
Priority date: 2008-05-15
Filing date: 2008-05-15
Publication date: 2010-01-13
Also published as: KR20090119032A

Abstract

하드웨어 운영체제 기반 및/또는 공유형 캐시 기반의 멀티 프로세서 시스템이 개시된다. 복수의 프로세서 코어를 포함하는 멀티 프로세서 시스템에 있어서, 상기 복수의 프로세서 코어와 연결되어 상기 복수의 프로세서 코어로부터의 운영체제 서비스 요청에 대해서 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환 및 작업 재개 동작을 하드웨어적으로 수행하는 하드웨어 운영체제; 상기 복수의 프로세서 코어를 지원하는 명령어들이 미리 저장되는 공유형 명령어 캐시; 및 상기 복수의 프로세서 코어를 지원하는 데이터들이 미리 저장되는 공유형 데이터 캐시를 포함하는 멀티 프로세서 시스템이 제공된다. 기존 방법에 비하여 다중 프로세서 코어를 운용할 때 발생하는 성능 지연 및 오버헤드가 적고, 경량화된 프로세서를 사용하고 면적 면에서 효율적이어서 모바일용 멀티미디어 시스템에 최적으로 적용될 수 있다.A hardware operating system based and / or shared cache based multiprocessor system is disclosed. A multiprocessor system comprising a plurality of processor cores, the plurality of processor cores coupled to the plurality of processor cores for interrupt handling, task synchronization, task scheduling, context switching and task resume operations for operating system service requests from the plurality of processor cores. Hardware operating system; A shared instruction cache in which instructions supporting the plurality of processor cores are stored in advance; And a shared data cache in which data supporting the plurality of processor cores is stored in advance. Compared to the conventional method, it has less performance delay and overhead when operating a multi-processor core, uses a lighter processor, and is effective in area, and thus can be optimally applied to a mobile multimedia system.

프로세서, 멀티 프로세서, 캐시, 하드웨어, 멀티 쓰레딩 Processor, Multiprocessor, Cache, Hardware, Multi Threading

Description

Multi-processor system

본 발명은 멀티 프로세서 시스템(Multi-processor system)에 관한 것으로, 보다 상세하게는 하드웨어 운영체제 기반 및/또는 공유형 캐시 기반의 멀티 프로세서 시스템에 관한 것이다. The present invention relates to a multi-processor system, and more particularly, to a hardware operating system-based and / or shared cache-based multi-processor system.

최근 휴대폰, PDA, 전자 사전 등의 디지털 정보화 기기 뿐만 아니라 TV, 냉장고 등의 백색 가전도 인공 지능화 및 정보화가 빠르게 이루어지고 있다. 이에 따라, 전자 산업 전반에서 소프트웨어의 비중이 급격히 증가하고 있으며, 이에 따라 소프트웨어를 실행하기 위한 프로세서에 대한 수요도 증가하고 있다. 디지털 정보화 기기에 내장되는 프로세서를 내장형 프로세서(Embedded Processor)라고 부르고, 내장형 프로세서에서 실행되는 소프트웨어를 내장형 소프트웨어(Embedded Software)라고 부른다.Recently, as well as digital information devices such as mobile phones, PDAs, electronic dictionaries, white appliances such as TVs, refrigerators, artificial intelligence and informatization have been rapidly made. Accordingly, the proportion of software in the electronics industry is rapidly increasing, and accordingly, the demand for a processor for executing software is also increasing. The processor embedded in a digital information device is called an embedded processor, and the software running on the embedded processor is called an embedded software.

일반적으로 디지털 정보화 기기들은 다양한 외부 자극에 빠르게 반응하여, 실시간 작업들(Real-time Task)을 동시에 실행해야 하기 때문에, 내장형 소프트웨어는 다수의 쓰레드로 구성된 멀티 쓰레드 프로그램(Multi-threaded Program)으로 구현된다.In general, because digital information devices respond quickly to various external stimuli and execute real-time tasks simultaneously, embedded software is implemented as a multi-threaded program consisting of multiple threads. .

일반적으로 내장형 프로세서가 멀티 쓰레드 프로그램을 수행하는 과정은 인터럽트 핸들링(Interrupt Handling), 작업 동기화(Job/Task Synchronization), 작업 스케쥴링(Task Scheduling), 컨텍스트 교환(Context Switching), 작업 재개(Task Resume)의 다섯 가지 세부 단계로 구성된다. 각 세부 단계를 좀 더 상세하게 설명하면 다음과 같다. In general, the process by which the embedded processor executes a multi-threaded program includes interrupt handling, job / task synchronization, task scheduling, context switching, and task resume. It consists of five detailed steps. Each detailed step is described in more detail as follows.

(1) 인터럽트 핸들링 단계에서 타이머 등의 외부 자극(인터럽트)이 프로세서에 가해지면, 프로세서는 외부 자극을 발생시킨 하드웨어 장치가 무엇인지 알아내고, 이에 대응되는 인터럽트 핸들러 루틴을 호출한다. (2) 작업 동기화는 인터럽트 핸들러 루틴이 하드웨어 장치가 요청한 내용에 따라 적절한 작업을 수행한 후, 세마포어, 배리어 등 동기화 관련 변수 (variable)들의 값을 변경함으로써 중지되었던 응용 작업을 깨워 대기 큐(Ready Queue)에 삽입하는 것을 말한다. (3) 작업 스케쥴링은 대기 큐에 저장된 작업 중에서 가장 우선 순위가 높은 작업을 선택하는 것을 말한다. (4) 컨텍스트 교환은 내장형 프로세서에서 수행중인 작업에 관한 각종 정보들을 지정된 메모리 영역에 대피시키고, 작업 스케쥴링 단계에서 선택된 작업에 관한 정보들을 특정 메모리 영역으로부터 읽어와 내장형 프로세서의 내부 레지스터들에 설정하는 것을 말한다. (5) 작업 재개는 상기 과정을 통하여 프로세서에 설정된 작업의 수행을 개시하는 것을 말한다.(1) When an external stimulus (interrupt) such as a timer is applied to the processor in the interrupt handling step, the processor finds out which hardware device caused the external stimulus and calls the corresponding interrupt handler routine. (2) Task Synchronization is a waiting queue where the interrupt handler routine wakes up an application task that has been suspended by changing the value of synchronization related variables such as semaphores and barriers after performing appropriate tasks according to the request of the hardware device. ) To insert. (3) Job scheduling means selecting the highest priority job among the jobs stored in the waiting queue. (4) Context exchange is to evacuate various information about the task being executed in the embedded processor to the designated memory area, and to read the information about the task selected in the task scheduling step from the specific memory area and to set the internal registers of the embedded processor. Say. (5) Resuming a task means starting the execution of a task set in a processor through the above process.

일반적인 시스템에서 상기 과정들은 소프트웨어적으로 수행되는데, 상기 과정들을 담당하는 소프트웨어를 운영체제라 한다. 상기 과정들을 소프트웨어적으로 수행하기 위해서는 프로세서의 연산 자원(Computing Power)을 소비해야 한다. 운영 체제 수행을 위한 연산 자원의 양을 멀티 쓰레딩 오버헤드라 한다. 앞에서 설명한 다섯 단계를 수행하기 위해 필요한 연산 자원은 운영체제의 알고리즘과 복잡도에 따라 다르지만, 수 백 싸이클에서 수 천 싸이클 정도이다. 이 정도의 오버헤드는 쓰레드 간 전환이 수 만 싸이클 이상의 단위로 이루어지는 소프트웨어의 경우에는 크게 문제되지 않는다. 예를 들어, 개인용 PC에서 많이 사용하는 윈도즈나 리눅스의 경우 멀티 쓰레딩 오버헤드는 전체 연산 자원의 5%를 넘지 않는다. In a typical system, the above processes are performed in software. The software in charge of the above processes is called an operating system. In order to perform the above processes in software, the computing power of the processor must be consumed. The amount of computational resources to run the operating system is called multithreading overhead. The computational resources required to perform the five steps described above depend on the algorithm and complexity of the operating system, but they range from hundreds to thousands of cycles. This amount of overhead isn't much of a problem for software where the transition between threads is in the tens of thousands or more cycles. For example, on Windows or Linux, which are popular on personal PCs, the multithreading overhead is no more than 5% of the total computational resources.

한편, 최근들어 개발되는 모바일 멀티미디어 시스템들은 여러 개의 프로세서들이 협업하여 하나의 응용을 실행하는 멀티 프로세서 시스템 구조를 채택하고 있고, 영상. 음성, 3D 그래픽스 등의 가속을 위한 하드웨어 가속기를 내장하고 있다. 이러한 응용에서는 성능을 극대화하기 위하여 쓰레드 전환이 매우 빈번하게 이루어지고 있다. 예를 들어, 하드웨어 가속기와 소프트웨어가 협업하는 동영상 코덱의 경우, 약 1천 ~ 5천 싸이클마다 쓰레드 전환이 발생하는데, 이런 경우 멀티 쓰레딩 오버헤드의 비중이 매우 커지게 된다. 예를 들어, 앞에서 설명한 다섯 단계를 수행하기 위해 필요한 연산 자원이 약 1천 싸이클이라고 할 때 쓰레드 전환이 3천 싸이클마다 발생한다면, 멀티 쓰레딩 오버헤드는 약 25%이다. 이는 한 시스템이 4개의 내장형 프로세서를 가지고 있는 경우를 가정할 때, 내장형 프로세서 중 하나는 멀티 쓰레딩 오버헤드를 위해 허비되는 것과 같다. 이 정도의 멀티 쓰레딩 오버헤드를 감수하는 것은 비실용적이므로, 일반적으로는 쓰레드 전환이 드물게 발생하도록 내장형 소프트웨어를 작성한다. 하지만, 이 경우 외부 자극이나 하드웨어 가속기의 요청에 기민하게 대응할 수 없기 때문에 성능 면에서 한계가 있다. Meanwhile, recently developed mobile multimedia systems adopt a multiprocessor system structure in which several processors collaborate to execute an application. It has a built-in hardware accelerator for accelerating voice and 3D graphics. In these applications, thread switching is very frequent to maximize performance. For example, a video codec with hardware accelerators and software collaboratively causes thread switching every 1,000 to 5,000 cycles, which adds a significant amount of multithreading overhead. For example, if the computational resources required to perform the five steps described above are about 1,000 cycles, and the thread transition occurs every 3,000 cycles, then the multithreading overhead is about 25%. This assumes that a system has four embedded processors, so one of the embedded processors is wasted for multithreading overhead. It's impractical to take this much multithreading overhead, so you usually write embedded software to make thread switching rare. However, in this case, there is a limit in performance because it cannot respond to the request of an external stimulus or a hardware accelerator as agile.

멀티 쓰레딩 오버헤드의 근본적인 원인은 상술한 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환, 작업 재개의 과정들이 소프트웨어적으로 수행되기 때문이다. 멀티 쓰레딩을 위한 다섯 가지 동작들을 하드웨어적으로 수행한다면, 멀티 쓰레딩 오버헤드를 최소화할 수 있다. 사실, 소프트웨어 작업의 일부를 하드웨어 가속기로 구현함으로써 성능을 개선한다는 것은 보편적인 전략으로, 멀티 쓰레딩 오버헤드를 줄이기 위하여 많은 프로세서들이 발표되었다. 하지만, 지금까지 발표된 프로세서 구조들은 단 하나의 프로세서 코어만을 지원하는 솔루션이거나 상술한 다섯 가지 단계 중에서 일부만을 하드웨어로 구현한 솔루션이다. 기존의 방법들의 장단점은 다음과 같다. The root cause of the multithreading overhead is that the interrupt handling, task synchronization, task scheduling, context exchange, and task resumption processes described above are performed in software. If the five operations for multithreading are performed in hardware, the multithreading overhead can be minimized. In fact, improving performance by implementing parts of software tasks as hardware accelerators is a common strategy, and many processors have been announced to reduce multithreading overhead. However, the processor architectures published so far are solutions that support only one processor core or hardware implementation of some of the above five steps. The advantages and disadvantages of the existing methods are as follows.

한 쓰레드의 임의의 순간의 상태(State)를 규정할 수 있는 정보의 집합을 해당 쓰레드의 컨텍스트라고 한다. 일반적인 RISC(Reduced Instruction Set Computer) 구조의 프로세서는 내부적으로 단 하나의 컨텍스트만을 저장할 수 있는 레지스터 세트를 가지고 있다. 이에 반하여, MIPS(Microprocessor without Interlocked Pipeline Stages) 32K, SPARC(Scalable Processor Architecture) V9 등의 프로세서들은 복수의 컨텍스트를 저장할 수 있도록 여분의 레지스터 세트를 가지고 있기 때문에, 매 싸이클마다 서로 다른 쓰레드를 실행할 수 있다. The set of information that can define a state at any moment in a thread is called the thread's context. A typical RISC (Reduced Instruction Set Computer) processor has a register set that can store only one context internally. In contrast, processors such as Microprocessors without Interlocked Pipeline Stages (MIPS) 32K and Scalable Processor Architecture (SPARC) V9 have an extra set of registers to store multiple contexts, allowing different threads to run on different cycles. .

이러한 멀티 컨텍스트 멀티 쓰레딩 프로세서(Multiple Context Multithreading Processor)는 컨텍스트 교환과 쓰레드의 선택이 하드웨어적으로 이루어지기 때문에 작업 스케쥴링 오버헤드와 컨텍스트 교환 오버헤드를 줄일 수 있다. 하지만, 일반적으로 멀티 컨텍스트 멀티 쓰레딩 프로세서들은 4개 전후의 컨텍 스트 세트를 제공하기 때문에 전체 작업의 개수가 하드웨어적으로 지원하는 컨텍스트 세트의 개수보다 큰 경우에는 소프트웨어 운영체제의 개입이 필요하므로 멀티 쓰레딩 오버헤드가 증가하게 된다. 또한, 시중에 출시된 멀티 컨텍스트 멀티 쓰레딩 프로세서들은 인터럽트 핸들링, 작업 동기화 등에 관해서는 고려하지 않기 때문에, 외부 자극 및 하드웨어 가속기로부터의 요청이 빈번한 시스템에서는 멀티 쓰레딩 오버헤드가 여전히 상당하다. 한편, 멀티 컨텍스트 멀티 쓰레딩 프로세서는 컨텍스트는 복수개를 가지고 있지만 정작 연산을 수행할 데이터패스는 단 하나만을 가지고 있으므로, 멀티 쓰레딩 오버헤드를 줄이는 것 이외에는 성능 효과를 기대할 수가 없다. 더구나, 경량 프로세서의 경우 데이터패스가 차지하는 면적이 대략 20K ~ 30K 게이트 정도임에 반하여, 하나의 컨텍스트를 위한 레지스터 세트가 차지하는 면적은 대략 10K 게이트 정도이고 4개의 컨텍스트를 위한 면적은 대략 40K 게이트 정도로 공유할 자원인 데이터 패스보다 공유할 수 없는 자원인 레지스터 세트가 훨씬 크기 때문에 면적면에서 비효율적이다. 예를 들어, 4개의 컨텍스트를 지원하는 멀티 컨텍스트 멀티 쓰레딩 프로세서의 면적은 약 60K 게이트임에 반하여, 하나의 컨텍스트를 지원하는 RISC 프로세서의 면적은 약 30K 게이트에 불과하다. 즉, 같은 면적을 사용하는 경우에는, 두 개의 프로세서를 병렬로 사용할 수 있는 RISC 프로세서가 오히려 성능 면에서 유리하다. The multiple context multithreading processor can reduce the task scheduling overhead and the context exchange overhead because the context exchange and the thread selection is made in hardware. In general, however, multi-context multi-threading processors provide four or more context sets, so if the total number of tasks is greater than the number of context sets supported by hardware, the software operating system requires intervention. Will increase. In addition, commercially available multi-context multithreading processors do not take into account interrupt handling, task synchronization, etc., so multithreading overhead is still significant in systems with frequent requests from external stimuli and hardware accelerators. On the other hand, a multi-context multi-threading processor has a plurality of contexts but only one datapath to perform an operation, and thus the performance effect can not be expected other than reducing the multi-threading overhead. Moreover, for lightweight processors, the datapath occupies about 20K to 30K gates, whereas the register set for one context occupies about 10K gates and the area for four contexts is about 40K gates. This is inefficient in terms of area because the register set, which is a resource that cannot be shared, is much larger than the data path, which is a resource to do. For example, an area of a multi-context multithreading processor supporting four contexts is about 60K gates, whereas an RISC processor supporting one context is only about 30K gates. That is, when using the same area, RISC processor that can use two processors in parallel is rather advantageous in terms of performance.

동시적 멀티 쓰레딩(Simultaneous Multithreading; 이하 SMT라 칭함) 프로세서는 복수의 컨텍스트를 지원함과 동시에 복수의 데이터 패스를 지원하는 구조의 프로세서이다. SMT 프로세서는 멀티 쓰레딩 오버헤드를 줄이는 것 뿐만 아니라, 하 나의 쓰레드의 복수의 명령어를 동시에 수행하거나, 여러 쓰레드의 명령어들을 동시에 수행함으로써 성능을 대폭 향상시킬 수 있는 구조이다. 일반인에게 널리 알려져 있는 인텔 펜티엄4 프로세서에 적용된 하이퍼 쓰레딩(Hyper-threading) 기술이 바로 SMT이다. SMT 프로세서의 단점은 면적이 지나치게 크다는 점이다. 일반적으로 SMT 프로세서는 내부적으로 대용량의 명령어 큐를 가지고 있고, 레지스터 재명명(Register re-naming)과 예약 테이블(Reservation Table) 등의 복잡한 회로를 가지고 있다. 3개의 명령어를 동시에 수행할 수 있는 SMT 프로세서의 경우, 데이터 패스의 면적은 약 70K 게이트 수준임에 반하여, 이를 운용하기 위하여 부가적으로 필요한 회로의 면적은 120K 게이트를 넘는다. 한편, 멀티 컨텍스트 멀티 쓰레딩 프로세서와 마찬가지로 SMT 프로세서도 인터럽트 핸들링과 작업 동기화에 관한 지원이 없고, 시스템 내의 작업의 개수가 하드웨어 컨텍스트의 개수를 초과하는 경우 소프트웨어 운영체제의 개입이 필요하기 때문에, 역시 멀티 쓰레딩 오버헤드를 줄이는데 있어 한계가 있다.Simultaneous Multithreading (hereinafter referred to as SMT) processor is a processor of a structure that supports a plurality of data paths while supporting a plurality of contexts. In addition to reducing multi-threading overhead, the SMT processor can significantly improve performance by executing multiple threads of instructions simultaneously or simultaneously executing instructions of multiple threads. SMT is a hyper-threading technology applied to the Intel Pentium 4 processor that is widely known to the public. The disadvantage of SMT processors is that they are too large. In general, SMT processors have a large instruction queue internally and have complex circuits such as register renaming and a reservation table. In the case of an SMT processor that can execute three instructions simultaneously, the area of the data path is about 70K gates, whereas the area of additional circuitry required to operate it exceeds 120K gates. On the other hand, like the multi-context multithreading processor, the SMT processor also has no support for interrupt handling and task synchronization, and since the number of tasks in the system exceeds the number of hardware contexts, intervention of the software operating system is required. There is a limit to reducing the head.

매우 많은 작업을 효과적으로 지원하기 위한 방법으로는 컨텍스트 교환을 하드웨어적으로 수행하는 것과 작업 스케쥴링을 하드웨어적으로 수행하는 것이 있다. There are two ways to effectively support a large number of tasks: hardware contextual exchange and task scheduling hardware.

먼저, 컨텍스트 교환에 관하여 살펴보겠다. 컨텍스트 교환을 하드웨어적으로 수행하는 프로세서로는 인텔 사의 펜티엄 프로세서가 가장 대표적이다. 펜티엄 프로세서에서 각 쓰레드는 TSS(Task State Segment)라는 메모리 영역을 할당받아 가지고 있는데, 이 쓰레드의 컨텍스트 정보들이 이 TSS에 저장되며, 특정 TSS를 가 리키는 ID를 인수로 받는 특별한 호출(call) 명령어를 이용해 컨텍스트를 교환할 수 있다. 인터럽트가 발생한 경우에는, 프로세서 레지스터 값들이 현재 실행중인 쓰레드의 TSS에 자동으로 대피하고, 인터럽트를 발생한 하드웨어에 대응되는 핸들러 루틴의 컨텍스트를 자동적으로 로드하고 실행한다. 펜티엄 프로세서는 컨텍스트 교환과 인터럽트 핸들링을 하드웨어적으로 처리함으로써 멀티 쓰레딩 오버헤드를 감소시켰으나, 작업 스케쥴링과 작업 동기화는 여전히 소프트웨어적으로 수행해야 한다는 한계가 있다. 더욱이, 작업 스케쥴링을 통해 재개할 쓰레드가 결정되었으나, 이 쓰레드의 컨텍스트가 외부 메모리에 대피되어 있는 경우, 외부 메모리로부터 컨텍스트를 읽어오는 동안 프로세서는 수행이 정지된다는 문제점이 있다. 예를 들어, 컨텍스트가 32개의 워드로 구성되어 있고, 프로세서는 800 MHz로, 외부 메모리는 200 MHz로 동작하는 경우, 최소한 128 (프로세서) 싸이클 동안 프로세서는 수행이 중단된다. First, let's look at context exchange. Intel's Pentium processor is the most representative processor that performs the context exchange. In the Pentium processor, each thread has an allocated memory area called Task State Segment (TSS), where the context information of this thread is stored in this TSS, and a special call instruction that takes an ID indicating a specific TSS. You can exchange contexts with In the event of an interrupt, the processor register values are automatically evacuated to the TSS of the currently running thread, and automatically loads and executes the context of the handler routine corresponding to the hardware that caused the interrupt. The Pentium processor reduces the multithreading overhead by hardware handling context exchange and interrupt handling, but has the limitation that task scheduling and task synchronization still need to be done in software. Moreover, although a thread to be resumed has been determined through task scheduling, when the context of the thread is evacuated to the external memory, the processor stops executing while reading the context from the external memory. For example, if the context consists of 32 words, the processor is operating at 800 MHz, and the external memory is operating at 200 MHz, the processor stops performing for at least 128 (processor) cycles.

작업 스케쥴링 방식은 운영체제의 구현에 따라 혹은 작업의 동적 특성에 따라 적응적으로 선택해야 하기 때문에, 개인용 컴퓨터와 같이 다양한 응용 소프트웨어를 수행하는 시스템에서는 소프트웨어적으로 처리하는 것이 필수일 수 있으나, 모바일 멀티미디어 핸드폰이나 PDA와 같은 내장형 시스템에서는 특정 스케쥴링 방식을 하드웨어적으로 구현하더라도 큰 문제가 없다. 하지만, 일부 발표된 작업 스케쥴링 방식을 하드웨어적으로 수행하는 하드웨어 가속기들은 운영체제의 핵심 기능들은 여전히 소프트웨어적으로 구현되고, 극히 일부 기능만 하드웨어 가속기를 통해 수행하기 때문에, 멀티 쓰레딩 오버헤드 감소 효과가 크지 않고, 프로세서와 의 통신 및 컨텍스트 교환을 위해 많은 시간이 소모되어 스케쥴링을 하드웨어적으로 수행하는 효과를 반감시킨다. 더욱이, 일부는 너무 많은 면적을 차지하기 때문에 실용성이 없다. Since the task scheduling method must be adaptively selected according to the implementation of the operating system or the dynamic characteristics of the task, it may be necessary to process the software in a system that performs various application software such as a personal computer. In a built-in system such as a PDA or a PDA, even if a specific scheduling scheme is implemented in hardware, there is no big problem. However, hardware accelerators that perform some of the announced task scheduling schemes in hardware are not implemented as a result of reducing multithreading overhead because the core functions of the operating system are still implemented in software and only a few functions are performed through hardware accelerators. For this reason, a lot of time is spent for communicating with the processor and for context exchange, thereby halving the effect of performing the scheduling in hardware. Moreover, some are not practical because they take up too much area.

최근의 시장 요구에 따라 소프트웨어의 복잡도는 지속적으로 증가하고 있으므로, 더 이상 하나의 내장형 프로세서로는 주어진 성능을 만족할 수 없다. 이 때문에, 하나의 시스템에 점점 더 많은 프로세서들이 내장되는 추세이다. 멀티 프로세서 상에서 소프트웨어를 수행할 때에는 각 프로세서마다 독립적인 소프트웨어를 수행하는 경우도 있지만, 이보다는 소프트웨어를 복수개의 작업으로 나누어 구현하고, 각 쓰레드들을 다수의 프로세서 상에 분산시켜 병렬로 수행시킬 수 있도록 멀티 쓰레드 프로그램으로 구현하는 것이 일반적이다. As the complexity of software continues to grow in response to recent market demands, a single embedded processor can no longer meet its performance. Because of this, more and more processors are embedded in one system. When running software on multiple processors, independent software may be executed for each processor, but rather, the software may be divided into a plurality of tasks, and each thread may be distributed over multiple processors and executed in parallel. Typically implemented as a threaded program.

멀티 쓰레드 프로그램을 멀티 프로세서 시스템 상에서 수행하는 경우, 앞에서 설명한 멀티 쓰레딩 오버헤드 이외에 작업 이전 오버헤드(Task Migration Overhead)가 추가로 발생한다. 작업 이전 오버헤드란, 임의의 한 프로세서 상에서 수행되던 쓰레드를 중단한 후, 이 쓰레드를 다른 프로세서에서 재개할 때 추가로 발생하는 시간 지연을 말한다. 이를 구체적으로 설명하면 다음과 같다. When a multithreaded program is executed on a multiprocessor system, a task migration overhead occurs in addition to the multithreading overhead described above. Pre-operation overhead is an additional time delay incurred when a thread that was running on one processor is suspended and then resumed on another processor. This will be described in detail as follows.

고성능 프로세서들은 외부 메모리의 통신 지연(Latency)으로 인한 성능 감소를 방지하기 위하여 명령어 캐시와 데이터 캐시를 가지고 있다. 명령어 혹은 데이터를 처음 읽는 경우에는, 어쩔 수 없이 외부 메모리로부터 정보를 가져오기 위하여 긴 시간을 기다려야 하지만, 그 후에 이 명령어 혹은 데이터가 다시 읽히는 경우에는 캐시가 직접 공급함으로써 프로세서의 성능을 개선할 수 있다. 데이터를 저장할 때에는 일단 데이터 캐시 상에만 저장하고, 외부 메모리에는 추후에 반영함으로써 프로세서의 성능을 개선할 수 있다. 한 프로세서에서 수행되다 중단되었던 작업을 다른 프로세서에서 수행하려는 경우, 이 작업을 수행하며 변경된 데이터들은 이전 프로세서의 캐시에는 저장되어 있으나 아직 외부 메모리에는 반영되지 않았을 수 있고, 이 경우, 새 프로세서에서는 올바른 데이터를 참조할 수가 없다. 따라서, 작업을 이전할 때에는 데이터의 일관성(Coherency)를 위하여 기존 프로세서의 데이터 캐시 상에만 변경된 데이터들을 외부 메모리에 강제로 반영시키는 데이터 캐시 플러시(Data Cache Flush)를 수행해야 하는데, 이 기간에는 프로세서가 다른 작업을 수행할 수 없기 때문에 장시간의 시간 지연이 불가피하다. 더욱이, 데이터 캐시는 데이터를 변경한 쓰레드가 무엇인지 알 수가 없으므로 부득이하게 모든 변경된 데이터를 외부 메모리에 반영해야 하므로, 시간 지연은 매우 심각하다고 할 수 있다. 한편, 명령어 캐시에 대해서도 이전 프로세서의 캐시에 해당 명령어가 이미 있음에도 불구하고 새 프로세서를 위해 외부 메모리로부터 명령어들을 다시 읽어 와야 하고, 시간 지연이 발생한다. 작업 이전 오버헤드는 위와 같이 작업을 이전할 때 발생하는 부가적인 시간 지연들을 말한다. High performance processors have an instruction cache and a data cache to prevent performance degradation due to communication latency in external memory. The first time you read an instruction or data, you have to wait a long time to get information from the external memory, but if the instruction or data is read again later, the cache can be supplied directly to improve processor performance. . When storing data, it is possible to improve the performance of the processor by storing it only in the data cache and later reflecting it to external memory. If you want to perform a task that was performed on one processor and was interrupted by another processor, the changed data may be stored in the cache of the previous processor, but not yet reflected in external memory, in which case the correct data on the new processor. Cannot be referenced. Therefore, when transferring work, the data cache flush should be performed to force the changed data to the external memory only in the data cache of the existing processor for the coherency of data. Long delays are inevitable because no other work can be done. Moreover, the data cache cannot know what thread changed the data, so it is inevitable that all the changed data must be reflected in the external memory, so the time delay is very serious. On the other hand, even for the instruction cache, even though the instruction already exists in the cache of the previous processor, the instruction must be read back from the external memory for a new processor, and a time delay occurs. Work transfer overhead refers to the additional time delays that occur when transferring work as above.

일부 시스템에서는 메모리의 용량은 작지만 고속으로 동작할 수 있는 캐시인 L1 캐시들은 각 프로세서마다 배치된다. L1 캐시와 외부 메모리 사이에서 프로세서와는 다소 거리를 두고 떨어져 배치되어 있으나 대용량의 캐시 메모리를 사용하는 캐시인 L2 캐시는 다수의 프로세서가 공유하는 계층적 구조를 채택함으로써 작업 이전 오버헤드를 줄이고 캐시 메모리의 면적 효율을 개선하고 있다. 하지만, L1 캐시들은 여전히 각 프로세서마다 분산 배치되어 있으므로 작업 이전 오버헤드를 완벽히 제거할 수가 없는 문제점이 있다. In some systems, L1 caches, which are small in memory but capable of operating at high speed, are placed on each processor. The L2 cache, a cache that uses a large amount of cache memory but is located somewhat apart from the processor between the L1 cache and external memory, employs a hierarchical structure shared by multiple processors, reducing pre-operation overhead and reducing cache memory. Its area efficiency is improving. However, there is a problem that L1 caches are still distributed for each processor and thus the overhead before operation cannot be completely eliminated.

한편, 각 프로세서마다 고유하게 명령어 캐시와 데이터 캐시를 가지고 있는 경우, 캐시 메모리들이 잘게 나누어져 분산된다. 특정 프로세서에 대한 작업의 할당이 동적으로 이루어지는 시스템의 경우, 모든 프로세서 별로 균등한 크기의 캐시 메모리를 할당하는 것이 합리적으로 보이지만, 작업별로 필요로 하는 캐시 메모리의 크기가 다르다는 점을 고려하면 비효율을 야기한다. 예를 들어, 4KB의 데이터 캐시를 가진 프로세서 P_A와 프로세서 P_B에 각각 2KB와 6KB의 메모리를 필요로 하는 작업 T_A와 T_B가 수행중이라고 가정해보자. 두 작업을 위해 필요한 메모리는 총 8KB이고, 데이터 캐시 메모리도 총 8KB으로 같지만, P_A의 경우에는 할당된 캐시 메모리의 반만을 활용하고, P_B의 경우에는 빈번하게 캐시 미스(Miss)가 발생하여, 시간 지연이 발생한다. 이는 캐시 메모리가 분할되어 분산되어 있기 때문에 발생하는 것으로, 이러한 효과를 캐시 메모리 단편화 효과(Cache Memory Fragmentation Effect)라 한다. On the other hand, if each processor has its own instruction cache and data cache, the cache memories are finely divided and distributed. In a system with dynamically allocating tasks for a particular processor, it seems reasonable to allocate an equal amount of cache memory for all processors, but it is inefficient given the different cache memory requirements for each task. do. For example, assume that tasks T _A and T _B are running that require 2 KB and 6 KB of memory on processor P _A and processor P _B with 4 KB of data cache, respectively. The memory required for both operations is 8KB in total, and the data cache memory is 8KB in total, but P _A utilizes only half of the allocated cache memory, and P _B frequently causes cache misses. , A time delay occurs. This occurs because cache memory is divided and distributed, and this effect is called a cache memory fragmentation effect.

각 프로세서마다 고유한 명령어 캐시와 데이터 캐시를 부여하는 기존 방식은 작업 이전 오버헤드와 캐시 메모리 단편화 현상으로 인하여 높은 성능을 기대하기가 어렵다. 이에 대한 대안은 하나의 명령어 캐시와 하나의 데이터 캐시를 여러 개의 프로세서가 공유하는 것이나, 이 경우 복수개의 프로세서를 동시에 지원할 수 있는 독창적인 캐시 구조가 필요하다. Existing methods that give each processor its own instruction and data caches are difficult to expect high performance due to pre-operation overhead and cache memory fragmentation. An alternative is to share one instruction cache and one data cache among multiple processors, but in this case, a unique cache structure is needed to support multiple processors simultaneously.

따라서, 본 발명은 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환 등의 과정을 하드웨어적으로 구현한 하드웨어 운영체제를 통하여 통합적으로 수행함으로써, 상기 과정에 수반되는 오버헤드를 최소화할 수 있는 멀티 프로세서 시스템을 제공한다. Accordingly, the present invention provides a multi-processor system capable of minimizing the overhead associated with the process by integrating the hardware through a hardware operating system that implements interrupt handling, task synchronization, task scheduling, context exchange, and the like. to provide.

또한, 본 발명은 운영체제가 하드웨어적으로 구현되므로, 프로세서가 소프트웨어 운영체제를 지원하기 위한 기능을 제공할 필요가 없어 보다 경량화된 프로세서를 사용할 수 있는 멀티 프로세서 시스템을 제공한다. In addition, the present invention provides a multi-processor system that can use a lighter processor because the operating system is implemented in hardware, the processor does not need to provide a function for supporting a software operating system.

또한, 본 발명은 복수의 프로세서 코어가 하나의 명령어 캐시와 데이터 캐시를 L1 캐시로서 공유할 수 있는 공유형 명령어 캐시 및 공유형 데이터 캐시를 도입함으로써 작업 이전 오버헤드와 캐시 메모리 단편화 현상을 제거함으로써 멀티 프로세서 시스템의 성능을 대폭 개선한 효과가 있는 멀티 프로세서 시스템을 제공한다. In addition, the present invention eliminates overhead and cache memory fragmentation before operation by introducing a shared instruction cache and a shared data cache, in which a plurality of processor cores can share one instruction cache and a data cache as an L1 cache. The present invention provides a multiprocessor system having an effect of significantly improving the performance of a processor system.

또한, 본 발명은 공유형 캐시를 통해 좀 더 면적 효율이 좋은 대용량 온 칩 SRAM을 복수의 프로세서 코어가 공유할 수 있고, 각 프로세서 코어마다 독점적 캐시들이 분산된 경우와 비교할 때 동일한 성능을 내기 위해 보다 적은 캐시 메모리를 사용할 수 있으므로 면적 면에서도 효율적인 멀티 프로세서 시스템을 제공한다. In addition, the present invention allows a plurality of processor cores to share a more efficient large area on-chip SRAM through a shared cache, and to achieve the same performance as compared to the case where the proprietary caches are distributed to each processor core Less cache memory is available, providing an area-efficient multiprocessor system.

또한, 본 발명은 경량화된 프로세서를 사용하고 면적 면에서 효율적이어서 모바일용 멀티미디어 시스템에 최적으로 적용될 수 있는 멀티 프로세서 시스템을 제공한다.In addition, the present invention provides a multi-processor system that can be optimally applied to a mobile multimedia system using a lightweight processor and efficient in area.

본 발명의 일 측면에 따르면, 복수의 프로세서 코어; 및 상기 복수의 프로세서 코어와 운영체제 버스를 통해 연결되며, 상기 복수의 프로세서 코어로부터의 운영체제 서비스 요청에 대해서 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환 및 작업 재개 동작을 하드웨어적으로 수행하는 하드웨어 운영체제를 포함하는 멀티 프로세서 시스템이 제공된다.According to one aspect of the invention, a plurality of processor cores; And a hardware operating system connected to the plurality of processor cores through an operating system bus, the hardware operating system performing hardware for interrupt handling, task synchronization, task scheduling, context exchanging, and task resume operations for operating system service requests from the plurality of processor cores. A multiprocessor system is provided that includes.

상기 하드웨어 운영체제는, 작업 정보를 저장하는 쓰레드 제어 메모리; 대기 큐에 등록된 작업 중 가장 높은 우선순위의 작업을 선택하고 세마포어 접근을 제어하는 쓰레드 관리자; 생성된 쓰레드들의 컨텍스트가 저장되는 컨텍스트 메모리; 수행을 재개할 쓰레드의 컨텍스트 또는 수행이 중단된 쓰레드의 컨텍스트가 저장되는 컨텍스트 버퍼; 상기 쓰레드 관리자에 의해 선택된 쓰레드에 대한 컨텍스트를 상기 컨텍스트 메모리로부터 읽어와 상기 컨텍스트 버퍼에 저장하거나 수행이 중단되어 상기 컨텍스트 버퍼에 저장된 쓰레드의 컨텍스트를 상기 컨텍스트 메모리에 저장하는 컨텍스트 관리자; 및 상기 운영체제 버스에 연결되며, 상기 프로세서 코어로부터의 상기 운영체제 서비스 요청에 따라 상기 쓰레드 관리자 및 상기 컨텍스트 관리자를 제어하는 주 제어기를 포함할 수 있다.The hardware operating system includes a thread control memory for storing work information; A thread manager that selects the highest priority job among the jobs registered in the waiting queue and controls semaphore access; A context memory in which contexts of created threads are stored; A context buffer in which a context of a thread to resume execution or a context of a thread in which execution is stopped is stored; A context manager that reads a context for a thread selected by the thread manager from the context memory and stores the context in the context buffer or stops execution and stores the context of a thread stored in the context buffer in the context memory; And a main controller coupled to the operating system bus and controlling the thread manager and the context manager according to the operating system service request from the processor core.

여기서 상기 쓰레드 제어 메모리는, 상기 작업이 대기 상태인 경우 제1 값을 가지고 상기 작업이 중단된 경우 제2 값을 가지는 대기 큐(Ready Queue)와, 상기 작업의 우선순위를 포함하는 작업 스케쥴링을 위한 정보와, 각 세마포어 별 세 마포어 상태 플래그와 세마포어 값을 포함하는 세마포어 접근 제어를 위한 정보와, 상기 세마포어에 대기하고 있는 작업들이 등록되는 중단 큐(Wait Queue)를 포함할 수 있다. The thread control memory may include a ready queue having a first value when the task is in a standby state and a second value when the task is suspended, and a priority of the task. It may include information, semaphore access control information including semaphore status flags and semaphore values for each semaphore, and a wait queue in which jobs waiting in the semaphore are registered.

상기 대기 큐는 하나의 비트를 매핑하는 비트 벡터일 수 있다.The wait queue may be a bit vector that maps one bit.

상기 쓰레드 제어 메모리는, 4개의 필드로 구성된 쓰레드 디스크립터(Thread Descriptor)와 세마포어 디스크립터(Semaphore Descriptor)를 포함할 수 있다. The thread control memory may include a thread descriptor and a semaphore descriptor composed of four fields.

상기 작업 스케쥴링을 위한 정보는 상기 쓰레드 디스크립터의 필드 중 상기 우선순위의 상위 비트 및 하위 비트를 나타내는 2개의 필드에 저장되고, 상기 세마포어 접근 제어를 위한 정보는 상기 세마포어 디스크립터의 필드 중 상기 세마포어 디스크립터의 사용 여부를 나타내는 플래그, 상기 세마포어의 종류를 나타내는 플래그 및 상기 세마포어 값이 저장되는 3개의 필드에 각각 저장될 수 있다. The information for scheduling the job is stored in two fields indicating the upper and lower bits of the priority among the fields of the thread descriptor, and the information for the semaphore access control uses the semaphore descriptor among the fields of the semaphore descriptor. A flag indicating whether or not, a flag indicating the type of the semaphore, and the semaphore value may be stored in three fields.

상기 중단 큐는 상기 세마포어의 디스크립터 중 나머지 1개의 필드와, 상기 쓰레드 디스크립터 중 나머지 2개의 필드를 이용하여 구현될 수 있다. The suspend queue may be implemented using the remaining one field of the semaphore descriptor and the remaining two fields of the thread descriptor.

상기 세마포어 디스크립터의 나머지 1개의 필드가 음수인 경우 한 개 이상의 작업이 상기 중단 큐에 등록되어 있음을 나타내며, 상기 쓰레드 디스크립터의 나머지 2개의 필드 중 제1 필드에는 첫 번째 작업의 ID가 저장되고, 제2 필드가 제1 값인 경우 타 작업이 추가적으로 상기 중단 큐에 등록되어 있음을 나타내고, 후속 작업의 ID가 상기 제1 필드에 저장되어 있고, 상기 제2 필드가 제2 값인 경우 상기 작업이 상기 중단 큐에 등록된 마지막 작업임을 나타낼 수 있다.If the remaining one field of the semaphore descriptor is negative, it indicates that at least one job is registered in the suspension queue, and the ID of the first job is stored in the first field of the remaining two fields of the thread descriptor. When the second field is the first value, it indicates that another job is additionally registered in the suspension queue, and when the ID of a subsequent job is stored in the first field and the second field is the second value, the job is the suspension queue. It can indicate that it is the last job registered in.

상기 쓰레드 제어 메모리는 하나의 싱글 포트(Single-Port) SRAM일 수 있다.The thread control memory may be one single-port SRAM.

본 발명의 다른 측면에 따르면, 복수의 프로세서 코어를 포함하는 멀티 프로세서 시스템에 있어서, 상기 복수의 프로세서 코어와 연결되어 상기 복수의 프로세서 코어로부터의 운영체제 서비스 요청에 대해서 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환 및 작업 재개 동작을 수행하는 운영체제; 상기 복수의 프로세서 코어를 지원하는 명령어들이 미리 저장되는 공유형 명령어 캐시; 및 상기 복수의 프로세서 코어를 지원하는 데이터들이 미리 저장되는 공유형 데이터 캐시를 포함하는 멀티 프로세서 시스템이 제공된다.According to another aspect of the present invention, in a multi-processor system including a plurality of processor cores, connected to the plurality of processor cores interrupt handling, task synchronization, task scheduling, for operating system service requests from the plurality of processor cores, An operating system for performing context exchange and resume operation operations; A shared instruction cache in which instructions supporting the plurality of processor cores are stored in advance; And a shared data cache in which data supporting the plurality of processor cores is stored in advance.

상기 공유형 명령어 캐시는, 내부 통신 네트워크 및 복수의 온 칩 메모리를 포함하되, 하나의 명령어 블록을 구성하는 워드들이 서로 다른 온 칩 메모리에 분산 저장되고, 상기 분산 저장된 워드들을 동시에 읽을 수 있는 인터리브 명령어 캐시 메모리(Interleaved Instruction Cache Memory); 상기 인터리브 명령어 캐시 메모리에 저장된 명령어 블록의 일부분을 태그로 저장하고 있는 태그 메모리; 상기 운영체제 서비스 요청에 따른 명령어의 일부분과 상기 태그 메모리에 저장된 태그를 비교하여 상기 요청된 명령어가 상기 인터리브 명령어 캐시 메모리에 존재하는지를 확인하는 태그 매칭(Tag Matching)을 수행하고, 상기 태그 매칭 결과 상기 요청된 명령어가 상기 인터리브 명령어 캐시 메모리에 존재하지 않는 경우 상기 명령어 블록 요청 신호를 출력하는 캐시 제어기; 및 상기 명령어 블록 요청 신호에 따라 시스템 버스를 통해 외부 메모리로부터 명령어 블록을 읽어와 상기 인터리브 명 령어 캐시 메모리에 저장하는 블록-필 제어기를 포함할 수 있다. The shared instruction cache may include an internal communication network and a plurality of on-chip memories, wherein words constituting one instruction block are distributed and stored in different on-chip memories, and the interleaved instructions may read the distributed stored words simultaneously. Cache memory (Interleaved Instruction Cache Memory); A tag memory for storing a part of an instruction block stored in the interleaved instruction cache memory as a tag; Comparing a part of the command according to the operating system service request with a tag stored in the tag memory, and performing tag matching to determine whether the requested command exists in the interleaved command cache memory, and performing the tag matching result. A cache controller configured to output the instruction block request signal when a specified instruction does not exist in the interleaved instruction cache memory; And a block-fill controller configured to read an instruction block from an external memory through a system bus according to the instruction block request signal and store the instruction block in the interleaved instruction cache memory.

상기 인터리브 명령어 캐시 메모리는 서로 다른 경로(way)에 대해서 상기 복수의 온 칩 메모리에 분산 저장되는 순서가 다를 수 있다. The interleaved instruction cache memory may have a different order in which the interleaved instruction cache memory is distributed and stored in the plurality of on-chip memories with respect to different paths.

상기 공유형 명령어 캐시는 명령어 블록에 처음 접근하려는 경우에 상기 태그 매칭을 수행하고, 상기 명령어 블록에 속한 명령어들을 순차적으로 읽어오는 경우에는 상기 태그 매칭을 건너뛰는 태그 매칭 스키핑(Tag Matching Skipping) 방식을 이용할 수 있다. The shared instruction cache performs a tag matching when first accessing an instruction block, and when a command belonging to the instruction block is sequentially read, a tag matching skipping scheme that skips the tag matching. It is available.

상기 공유형 명령어 캐시는 상기 복수의 프로세서 코어 중 하나가 분기 명령을 수행한 경우 상기 프로세서 코어가 순차적인 주소를 가진 명령어를 요청했으나 상기 명령어가 타 명령어 블록에 저장된 경우 상기 태그 매칭을 수행할 수 있다. The shared instruction cache may perform the tag matching when the processor core requests an instruction having a sequential address when one of the plurality of processor cores performs a branch instruction, but the instruction is stored in another instruction block. .

상기 공유형 명령어 캐시는, 상기 복수의 프로세서 코어마다 할당되고, 상기 할당된 프로세서 코어가 명령어를 요청하기 전에 미리 상기 명령어를 상기 인터리브 명령어 캐시로부터 읽어와 저장하고 있는 명령어 버퍼를 더 포함할 수 있다. The shared instruction cache may further include an instruction buffer that is allocated to each of the plurality of processor cores and reads and stores the instruction from the interleaved instruction cache before the allocated processor core requests the instruction.

상기 명령어 버퍼는 상기 할당된 프로세서 코어가 분기 명령을 수행하였는지 여부에 대한 분기 상태 신호(Branch Status Signal)와 상기 명령어 버퍼에 저장된 명령어들의 개수에 대한 프리-펫치 압력 신호(Pre-fetch Pressure Signal)를 상기 캐시 제어기에 전달하고, 상기 캐시 제어기는 상기 분기 상태 신호와 상기 프리-펫치 압력 신호들을 이용해 읽고자 하는 명령어를 결정한 후 상기 내부 통신 네트워크를 통해 상기 복수의 온 칩 메모리로 온 칩 메모리 선택 신호를 출력하여 상기 명령어가 상기 명령어 버퍼에 삽입되도록 할 수 있다. The instruction buffer includes a branch status signal indicating whether the allocated processor core has performed a branch instruction and a pre-fetch pressure signal indicating the number of instructions stored in the instruction buffer. The cache controller transmits an on-chip memory selection signal to the plurality of on-chip memories through the internal communication network after determining a command to be read using the branch status signal and the pre-fetch pressure signals. And output the command to be inserted into the command buffer.

상기 공유형 데이터 캐시는 히스토리 기반 계층적 태그 매칭 방식이 적용될 수 있다. The shared data cache may be a history based hierarchical tag matching scheme.

상기 공유형 데이터 캐시는, 내부 통신 네트워크 및 복수의 온 칩 메모리를 포함하되, 하나의 데이터 블록을 구성하는 워드들이 서로 다른 온 칩 메모리에 분산 저장되고, 상기 분산 저장된 워드들을 동시에 읽을 수 있는 인터리브 데이터 캐시 메모리(Interleaved Data Cache Memory); 상기 인터리브 데이터 캐시 메모리에 저장된 데이터 블록의 일부분을 태그로 저장하고 있는 태그 메모리; 상기 운영체제 서비스 요청에 따른 데이터의 일부분과 상기 태그 메모리에 저장된 태그를 비교하여 상기 요청된 데이터가 상기 인터리브 데이터 캐시 메모리에 존재하는지를 확인하는 태그 매칭을 수행하고, 상기 태그 매칭 결과 상기 요청된 데이터가 상기 인터리브 데이터 캐시 메모리에 존재하지 않는 경우 상기 데이터 블록 요청 신호를 출력하는 캐시 제어기; 및 상기 데이터 블록 요청 신호에 따라 시스템 버스를 통해 외부 메모리로부터 데이터 블록을 읽어와 상기 인터리브 데이터 캐시 메모리에 저장하는 블록-필 제어기를 포함할 수 있다. The shared data cache includes an internal communication network and a plurality of on-chip memories, wherein words constituting one data block are distributed and stored in different on-chip memories, and interleaved data capable of simultaneously reading the distributed and stored words. Cache memory (Interleaved Data Cache Memory); A tag memory for storing a part of a data block stored in the interleaved data cache memory as a tag; The tag matching is performed to compare whether the requested data exists in the interleaved data cache memory by comparing a portion of data according to the operating system service request with a tag stored in the tag memory, and the requested data is determined by the tag matching result. A cache controller for outputting the data block request signal when not in an interleaved data cache memory; And a block-fill controller configured to read a data block from an external memory through a system bus according to the data block request signal and store the data block in the interleaved data cache memory.

상기 공유형 데이터 캐시는, 소정 시간 내에 접근하였던 온 칩 메모리의 주소들의 태그를 저장하는 히스토리 버퍼; 및 상기 히스토리 버퍼에 저장된 태그와 요청된 주소를 비교하여 상기 요청된 데이터가 상기 인터리브 데이터 캐시 메모리에 존재하는지를 판별하는 히스토리 테스터를 더 포함할 수 있다. The shared data cache may include a history buffer that stores tags of addresses of an on-chip memory that have been accessed within a predetermined time; And a history tester comparing the tag stored in the history buffer with a requested address to determine whether the requested data exists in the interleaved data cache memory.

상기 캐시 제어기는 상기 히스토리 테스터에 의해 판별이 끝나지 않는 경우 상기 태그 메모리로부터 상기 태그를 읽어와 상기 태그 매칭을 수행할 수 있다. The cache controller may read the tag from the tag memory and perform the tag matching when the determination is not completed by the history tester.

상기 히스토리 버퍼는 스택 메모리 접근과 힙 메모리 접근을 구별하여 상기 태그를 저장할 수 있다. The history buffer may store the tag by distinguishing a stack memory access from a heap memory access.

본 발명의 또 다른 측면에 따르면, 복수의 프로세서 코어를 포함하는 멀티 프로세서 시스템에 있어서, 상기 복수의 프로세서 코어와 연결되어 상기 복수의 프로세서 코어로부터의 운영체제 서비스 요청에 대해서 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환 및 작업 재개 동작을 하드웨어적으로 수행하는 하드웨어 운영체제; 상기 복수의 프로세서 코어를 지원하는 명령어들이 미리 저장되는 공유형 명령어 캐시; 및 상기 복수의 프로세서 코어를 지원하는 데이터들이 미리 저장되는 공유형 데이터 캐시를 포함하는 멀티 프로세서 시스템이 제공된다.According to another aspect of the invention, in a multi-processor system comprising a plurality of processor cores, connected to the plurality of processor cores interrupt handling, task synchronization, task scheduling for operating system service requests from the plurality of processor cores A hardware operating system for performing context exchange and resume operation operations in hardware; A shared instruction cache in which instructions supporting the plurality of processor cores are stored in advance; And a shared data cache in which data supporting the plurality of processor cores is stored in advance.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

본 발명에 따른 멀티 프로세서 시스템은 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환 등의 과정을 하드웨어적으로 구현한 하드웨어 운영체제를 통하여 통합적으로 수행함으로써, 상기 과정에 수반되는 오버헤드를 최소화한 효과가 있다.The multiprocessor system according to the present invention is integrated with a hardware operating system that implements a process such as interrupt handling, task synchronization, task scheduling, context exchange, and the like, thereby minimizing the overhead associated with the process. .

또한, 운영체제가 하드웨어적으로 구현되므로, 프로세서가 소프트웨어 운영체제를 지원하기 위한 기능을 제공할 필요가 없어 보다 경량화된 프로세서를 사용할 수 있다. In addition, since the operating system is implemented in hardware, the processor does not need to provide a function for supporting a software operating system, so that a lighter processor can be used.

또한, 복수의 프로세서 코어가 하나의 명령어 캐시와 데이터 캐시를 공유할 수 있는 공유형 명령어 캐시 및 공유형 데이터 캐시를 도입함으로써 작업 이전 오버헤드와 캐시 메모리 단편화 현상을 제거함으로써 멀티 프로세서 시스템의 성능을 대폭 개선한 효과가 있다. In addition, the introduction of shared instruction caches and shared data caches, where multiple processor cores can share a single instruction cache and data cache, significantly reduces the performance of multiprocessor systems by eliminating pre-operation overhead and cache memory fragmentation. It has an improved effect.

또한, 공유형 캐시를 통해 좀 더 면적 효율이 좋은 대용량 온 칩 SRAM을 복수의 프로세서 코어가 공유할 수 있고, 각 프로세서 코어마다 독점적 캐시들이 분산된 경우와 비교할 때 동일한 성능을 내기 위해 보다 적은 캐시 메모리를 사용할 수 있으므로 면적 면에서도 효율적이다.In addition, shared caches allow more efficient, large area-on-chip SRAMs to be shared by multiple processor cores, with less cache memory for the same performance as compared to the case where proprietary caches are distributed across each processor core. It is also effective in terms of area since it can be used.

또한, 경량화된 프로세서를 사용하고 면적 면에서 효율적이어서 모바일용 멀티미디어 시스템에 최적으로 적용될 수 있다. In addition, it uses a lightweight processor and is efficient in terms of area, so it can be optimally applied to a mobile multimedia system.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.As the invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. In the following description of the present invention, if it is determined that the detailed description of the related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나 의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

이하, 본 발명의 실시예를 첨부한 도면들을 참조하여 상세히 설명하기로 한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서는 프로세서 코어에서 수행되는 작업(task)은 쓰레드(thread) 또는 프로세스(process)일 수 있으며, 이하에서는 발명의 이해와 설명의 편의를 위해 프로세서 코어에서 수행되는 작업이 쓰레드인 것을 가정하여 설명하기로 한다. In the present invention, a task performed in the processor core may be a thread or a process. Hereinafter, the task performed in the processor core will be described as a thread for the convenience of understanding and explanation of the present invention. Let's do it.

도 1은 본 발명의 일 실시예에 따른 멀티 프로세서 시스템의 구조를 나타낸 개략도이다. 멀티 프로세서 시스템(100), 복수의 프로세서 코어(120a, 120b, 120c, 120d, 이하 120으로 통칭함), 하드웨어 운영체제(Hardware Operating System)(110), 공유형 명령어 캐시(130), 공유형 데이터 캐시(140), 코프로세서(10), SDRAM(20)이 도시되어 있다. 1 is a schematic diagram illustrating a structure of a multiprocessor system according to an exemplary embodiment of the present invention. Multi-processor system 100, a plurality of processor cores (120a, 120b, 120c, 120d, collectively referred to as 120), hardware operating system (110), shared instruction cache 130, shared data cache 140, coprocessor 10, SDRAM 20 are shown.

일반적인 프로세서 코어는 외부 디바이스로부터의 서비스 요청이나 이벤트 발생을 통지받기 위해 인터럽트 방식을 이용한다. 여기서, 인터럽트 방식은 외부의 이벤트가 있는 경우 프로세서 코어가 현재 수행중인 작업을 일시 중단하고, 인터럽트 핸들러 루틴으로 강제 분기하는 것을 말한다. 외부의 이벤트 이외에도 내부적인 오류, 소프트웨어 인터럽트, 디버거 트랩, 페이지 폴트 등의 서비스들도 동일한 방식으로 처리될 수 있으며, 이를 예외 처리(Exception Handling)라 한다. 인터럽트 방식에 따른 인터럽트 핸들링을 위해서는 프로세서 코어가 내부적으로 이를 지원하기 위한 제어 로직은 물론 인터럽트 모드를 위한 전용 레지스터 세트를 가지고 있어야 한다. A typical processor core uses an interrupt scheme to be notified of service requests or event occurrences from external devices. Here, the interrupt method means that the processor core suspends the work currently being executed and forcibly branches to the interrupt handler routine when there is an external event. In addition to external events, services such as internal errors, software interrupts, debugger traps, and page faults can be handled in the same way, which is called exception handling. Interrupt handling based on the interrupt scheme requires the processor core to have a dedicated set of registers for the interrupt mode as well as the control logic to support it internally.

본 발명의 일 실시예에서는, 이러한 인터럽트 핸들링을 프로세서 코어(120)가 아닌 하드웨어 운영체제(110)가 직접 담당한다. 따라서, 프로세서 코어(120)는 인터럽트를 지원하기 위한 로직 및 레지스터 세트를 가지고 있을 필요가 없으므로, 프로세서 코어 자체의 면적을 줄이는 것이 가능하다. 프로세서 코어(120)는 인터럽트 핸들링을 위한 레지스터가 제거되어 있어 일반적인 프로세서 코어에 비하여 좁은 면적을 차지한다. In one embodiment of the present invention, such interrupt handling is directly handled by the hardware operating system 110 rather than the processor core 120. Thus, since the processor core 120 does not need to have a set of logic and registers to support interrupts, it is possible to reduce the area of the processor core itself. The processor core 120 has a smaller area than the general processor core because the register for interrupt handling is removed.

예를 들어, 영국 ARM사의 ARM9 프로세서의 경우 데이터 패스의 면적이 약 15K 게이트 정도이고, 기본 컨텍스트를 위한 면적이 약 10K 게이트 정도이며, 시스템 서비스를 위해 부가적으로 제공하는 레지스터들의 면적이 약 15K 게이트 정도로, 도합 40K 게이트 정도의 면적을 차지한다. 하지만, 본 발명의 일 실시예에 따르면, 시스템 서비스는 하드웨어 운영체제(110)에 의하여 수행되기 때문에, 프로세서 코어(120)마다 별도의 지원이 필요하지 않아, 30K 게이트 수준의 경량화된 프로세서 코어(120)를 사용할 수 있다. 도 1에 도시된 것과 같이 4개의 프로세서 코 어(120)를 통합한 경우 약 40K 게이트 정도를 절감할 수 있다.For example, the ARM9 processor from the UK ARM has a data path area of about 15K gates, an area of about 10K gates for the basic context, and an additional register area of about 15K gates for system services. As such, it occupies an area of about 40K gates in total. However, according to the exemplary embodiment of the present invention, since the system service is performed by the hardware operating system 110, the processor core 120 does not need separate support, and thus, the lightweight processor core 120 having a 30K gate level is provided. Can be used. As shown in FIG. 1, when four processor cores 120 are integrated, about 40K gates can be reduced.

여기서, 멀티 프로세서 시스템(100)에는 하드웨어 운영체제(110)를 위한 별도의 면적이 요구된다. 하지만, 이는 추후 설명할 쓰레드 제어 메모리를 이용한 작업 스케쥴링 방법과 작업 동기화 방법을 적용함으로써, 멀티 프로세서 시스템(100)에서 필요로 하는 하드웨어 운영체제(110)의 면적은 40K 게이트를 넘지 않는다. 즉, 하드웨어 운영체제(110)를 위한 면적 증가분와 경량 운영체제를 활용함으로써 얻어지는 프로세서 코어(120)의 면적 감소분이 상쇄되므로, 큰 면적 증가 없이 멀티 쓰레딩 오버헤드를 최소화함으로써 성능을 높일 수 있다. 멀티 쓰레딩 오버헤드에는 작업 스케쥴링 오버헤드, 컨텍스트 교환 오버헤드, 인터럽트 핸들링 오버헤드, 작업 동기화 오버헤드가 포함된다. Here, the multiprocessor system 100 requires a separate area for the hardware operating system 110. However, this applies to the job scheduling method and the job synchronization method using the thread control memory to be described later, the area of the hardware operating system 110 required in the multi-processor system 100 does not exceed 40K gate. That is, since the area increase for the hardware operating system 110 and the area decrease of the processor core 120 obtained by utilizing the light weight operating system are offset, the performance may be improved by minimizing the multi-threading overhead without increasing the area. Multithreading overhead includes task scheduling overhead, context switching overhead, interrupt handling overhead, and task synchronization overhead.

프로세서 코어(120)는 운영체제 버스(Operating System Bus)를 통하여 하드웨어 운영체제(110)와 연결된다. 프로세서 코어(120)는 운영체제 버스를 통하여 운영체제 서비스 요청, 즉 쓰레드의 생성 및 파괴, 세마포어의 생성, 값 변경 및 파괴에 관한 요청을 하드웨어 운영체제(110)에 전달한다. 프로세서 코어(120)는 코프로세서 버스를 통해 코프로세서(10)에 연결되어 데이터를 주고 받을 수 있다. 또한, 프로세서 코어(120)는 하드웨어 운영체제(110)와 컨텍스트 버스(Context Bus)로도 연결되어 있다. 하드웨어 운영체제(110)는 컨텍스트 버스를 통하여 중단된 쓰레드와 새로 재개할 쓰레드들의 컨텍스트를 교환한다. The processor core 120 is connected to the hardware operating system 110 through an operating system bus. The processor core 120 transmits an operating system service request to the hardware operating system 110 through the operating system bus, that is, a request for creating and destroying a thread, creating a semaphore, changing a value, and destroying the semaphore. The processor core 120 may be connected to the coprocessor 10 through a coprocessor bus to exchange data. In addition, the processor core 120 is also connected to the hardware operating system 110 and the context bus (Context Bus). The hardware operating system 110 exchanges the context of suspended threads and newly resumed threads through the context bus.

본 발명의 다른 실시예에서 멀티 프로세서 시스템(100) 내에는 공유형 명령어 캐시(130) 및 공유형 데이터 캐시(140)가 구비되어 있고, 공유형 명령어 캐 시(130) 및 공유형 데이터 캐시(140)는 공유 메모리 버스를 통해 SDRAM(20)에 접근할 수 있다.In another embodiment of the present invention, the multiprocessor system 100 includes a shared instruction cache 130 and a shared data cache 140, and the shared instruction cache 130 and the shared data cache 140. Can access the SDRAM 20 via a shared memory bus.

캐시는 캐시 컨트롤러와 캐시 메모리로 구성된다. 캐시 메모리는 온 칩 SRAM 셀(cell)을 이용하여 구현될 수 있다. SRAM 셀은 용량이 클수록 밀도(density)가 높아진다. 예를 들어, 0.18um 공정에서 2KB의 SRAM 셀은 1 비트(bit)당 약 10 um² 정도의 면적을 차지함에 반하여, 8KB의 SRAM 셀은 1 비트당 6 um² 정도의 면적을 차지한다. 만일 4개의 프로세서 코어가 각각 2KB 캐시를 가지고 있는 경우, 캐시 메모리가 차지하는 면적은 약 80,000 um² 이지만, 하나의 8KB 공유형 캐시를 사용하는 경우에는 48,000 um²의 면적만을 차지한다. 즉, 캐시 메모리가 차지하는 면적이 대략 반이 된다. 일반적으로 캐시 메모리는 전체 시스템의 면적의 50 ~ 75%를 차지하는 만큼 공유형 캐시를 활용함에 따라 얻어지는 면적 감소의 효과가 매우 크다. The cache consists of a cache controller and cache memory. Cache memory may be implemented using on-chip SRAM cells. The larger the capacity of an SRAM cell, the higher the density. For example, in a 0.18um process, a 2KB SRAM cell occupies an area of about 10 um ² per bit, whereas an 8KB SRAM cell occupies an area of 6 um ² per bit. If each of the four processor cores has a 2KB cache, the cache memory occupies about 80,000 um ^2, but if one 8KB shared cache occupies only 48,000 um ² . That is, the area occupied by the cache memory is approximately half. In general, cache memory occupies 50 to 75% of the total system area, and the area reduction obtained by utilizing the shared cache is very large.

도 1에서는 컨텍스트 버스, 운영체제 버스, 코프로세서 버스, 공유 메모리 버스를 구분하여 도시하였으나, 이는 어디까지나 기능상의 구분으로 실제로는 다양한 구조의 물리적 버스로 구현할 수 있다. 일 실시예로, 공유 메모리 버스와 코프로세서 버스를 하나의 물리적 버스로 통합하고, 컨텍스트 버스와 운영체제 버스를 하나의 물리적 버스로 통합하는 구현도 가능하다. 또 다른 실시예로, 운영체제 버스, 코프로세서 버스를 하나의 물리적 버스로 통합하고 공유 메모리 버스와 컨텍스 트 버스는 별개의 물리적 버스로 구현하는 것도 가능하다. In FIG. 1, the context bus, the operating system bus, the coprocessor bus, and the shared memory bus are shown separately, but they can be implemented as physical buses having various structures. In one embodiment, it is also possible to integrate the shared memory bus and the coprocessor bus into one physical bus, and the context bus and operating system bus into a single physical bus. In another embodiment, the operating system bus and the coprocessor bus may be integrated into one physical bus, and the shared memory bus and the context bus may be implemented as separate physical buses.

본 발명의 일 실시예에 따른 멀티 프로세서 시스템은 내부에 구비된 하드웨어 운영체제(110)를 통해 멀티 쓰레딩 오버헤드를 줄일 수 있다. 이에 대해서 도 2 내지 도 5를 참조하여 상세히 후술하기로 한다. The multi-processor system according to an embodiment of the present invention can reduce the multi-threading overhead through the hardware operating system 110 provided therein. This will be described later in detail with reference to FIGS. 2 to 5.

또한, 본 발명의 다른 실시예에 따른 멀티 프로세서 시스템은 내부에 구비된 공유형 명령어 캐시(130) 및 공유형 데이터 캐시(140)를 통해 작업 이전 오버헤드를 줄이고 전체 면적을 줄일 수 있다. 이에 대해서 도 6 내지 도 10을 참조하여 상세히 후술하기로 한다. In addition, the multi-processor system according to another embodiment of the present invention can reduce the overhead before operation and reduce the total area through the shared instruction cache 130 and the shared data cache 140 provided therein. This will be described later in detail with reference to FIGS. 6 to 10.

또한, 본 발명의 또 다른 실시예에 따른 멀티 프로세서 시스템은 하드웨어 운영체제(110)와, 공유형 명령어 캐시(130) 및 공유형 데이터 캐시(140)를 포함하여 멀티 쓰레딩 오버헤드 및 작업 이전 오버헤드를 줄이고 전체 면적을 줄일 수 있다. In addition, a multiprocessor system according to another embodiment of the present invention includes a hardware operating system 110, a shared instruction cache 130 and a shared data cache 140 to reduce the multi-threading overhead and pre-work overhead Reduce the total area.

이하에서는 우선 일 실시예에 따라 하드웨어 운영체제(110)를 통해 멀티 쓰레딩 오버헤드를 줄이는 멀티 프로세서 시스템에 대해서 설명하기로 한다. Hereinafter, a multiprocessor system that reduces multithreading overhead through the hardware operating system 110 according to an embodiment will be described.

도 2는 하드웨어 운영체제의 구조를 설명하는 상세도이며, 도 3은 쓰레드 제어 메모리의 구조를 설명하는 상세도이다. 2 is a detailed diagram illustrating a structure of a hardware operating system, and FIG. 3 is a detailed diagram illustrating a structure of a thread control memory.

하드웨어 운영체제(110)는 주 제어기(210), 쓰레드 관리자(230), 컨텍스트 관리자(250), 컨텍스트 버퍼(260), 쓰레드 제어 메모리(220), 컨텍스트 메모리(240)를 포함한다. The hardware operating system 110 includes a main controller 210, a thread manager 230, a context manager 250, a context buffer 260, a thread control memory 220, and a context memory 240.

프로세서 코어(120)는 데이터패스(126), 컨텍스트 제어기(124), 레지스터 파일(122)을 포함한다. 데이터패스(126)는 운영체제 버스를 통해 하드웨어 운영체제(110)의 주 제어기(210)와 직접 연결된다. 컨텍스트 제어기(124)는 컨텍스트 버스를 통해 하드웨어 운영체제(110)의 컨텍스트 관리자(250)와 직접 연결되며, 컨텍스트 제어기(124)와 컨텍스트 관리자(250)는 프로세서 코어(120)에서 수행 중이거나 수행이 중단된 작업의 컨텍스트를 주고 받는다. 레지스터 파일(122)은 프로세서 코어(120)에서 수행되는 작업의 컨텍스트를 임시로 저장한다. Processor core 120 includes a datapath 126, a context controller 124, and a register file 122. The datapath 126 is directly connected to the main controller 210 of the hardware operating system 110 through an operating system bus. The context controller 124 is directly connected to the context manager 250 of the hardware operating system 110 through the context bus, and the context controller 124 and the context manager 250 are running or are stopped in the processor core 120. Send and receive the context of a completed task. The register file 122 temporarily stores the context of work performed in the processor core 120.

하드웨어 운영체제(110)는 인터럽트 핸들링, 작업 동기화, 작업 스케쥴링, 컨텍스트 교환, 작업 재개의 다섯 가지 세부 단계로 구성되는 멀티 쓰레드 프로그램 수행 과정을 담당한다. 각 세부 단계에 대해서는 앞서 상세히 설명하였는 바 상세한 설명은 생략하기로 한다. The hardware operating system 110 is responsible for a multi-threaded program execution process consisting of five sub-steps, interrupt handling, task synchronization, task scheduling, context exchange, and task resume. Each detailed step has been described in detail above, so a detailed description thereof will be omitted.

주 제어기(210)는 운영체제 버스에 연결되어 프로세서 코어(120)가 제기한 운영체제 서비스 요청을 받고, 이에 따라 내부의 쓰레드 관리자(230), 컨텍스트 관리자(250) 등의 다른 구성요소들을 제어한다. The main controller 210 is connected to the operating system bus to receive operating system service requests issued by the processor core 120, and thus control other components such as the internal thread manager 230 and the context manager 250.

쓰레드 제어 메모리(220)는 작업 정보인 쓰레드의 작업 상태, 우선순위, 중단 큐(Wait Queue), 세마포어 상태 및 이들의 결합 중 하나를 저장한다. The thread control memory 220 stores one of a thread's work information, a work status, a priority, a wait queue, a semaphore status, and a combination thereof.

쓰레드 관리자(230)는 대기 큐(Ready Queue)에 등록된 작업들 중에서 가장 높은 우선순위의 작업을 선택하고, 세마포어 접근을 제어한다. The thread manager 230 selects the highest priority task among the tasks registered in the waiting queue and controls semaphore access.

컨텍스트 메모리(240)는 생성된 작업들의 컨텍스트가 저장되어 있다. The context memory 240 stores the context of the generated jobs.

컨텍스트 버퍼(260)는 수행을 재개할 작업의 컨텍스트를 임시로 저장하거나, 수행이 중단된 작업의 컨텍스트가 임시로 저장된다. 컨텍스트 버퍼(260)는 온 칩 메모리로 이루어질 수 있다. The context buffer 260 temporarily stores a context of a job to resume execution, or temporarily stores a context of a job where execution is stopped. The context buffer 260 may be made of on chip memory.

컨텍스트 관리자(250)는 쓰레드 관리자(230)에 의해 선택된 작업에 대한 컨텍스트를 컨텍스트 메모리(240)로부터 읽어와 컨텍스트 버퍼(260)에 저장하거나, 역으로 수행이 중단된 작업에 상응하여 컨텍스트 버퍼(260)에 저장된 컨텍스트를 컨텍스트 메모리(240)에 저장한다. The context manager 250 reads the context for the task selected by the thread manager 230 from the context memory 240 and stores the context in the context buffer 260, or conversely, the context buffer 260 corresponding to the task which has been stopped. ) Is stored in the context memory 240.

컨텍스트 관리자(250)는 재개할 작업의 컨텍스트를 컨텍스트 메모리(240)로부터 읽어와서 컨텍스트 버퍼(260)에 저장한다. 그리고 임의의 한 프로세서 코어(120)에서 수행되는 작업이 중단되거나 이 작업의 우선순위보다 컨텍스트 버퍼(260)에 준비된 작업의 우선순위가 높은 경우 컨텍스트 버스를 통해 컨텍스트를 교환한다. 이 과정은 소프트웨어의 개입 없이 하드웨어적으로 수행되므로 매우 빠르게 작업을 전환할 수 있다. The context manager 250 reads the context of the job to be resumed from the context memory 240 and stores it in the context buffer 260. When a task performed by any one processor core 120 is interrupted or when a priority of a task prepared in the context buffer 260 is higher than a priority of the task, contexts are exchanged through the context bus. This process is done in hardware without software intervention, so you can switch jobs very quickly.

하드웨어적 수행을 위해서 하드웨어 운영체제(110)는 쓰레드들의 컨텍스트를 저장하는 메모리, 작업을 스케쥴링할 때 필요한 정보를 저장하는 메모리, 세마포어 접근을 제어하기 위한 정보를 저장하는 메모리, 중단 큐를 구현하기 위한 메모리, 대기 큐를 구현하기 위한 메모리를 필요로 한다. In order to perform the hardware, the hardware operating system 110 stores a memory for storing contexts of threads, a memory for storing information required for scheduling a task, a memory for storing information for controlling semaphore access, and a memory for implementing a hang queue. This requires memory to implement the wait queue.

우선, 컨텍스트를 저장하는 메모리(컨텍스트 메모리(240)에 해당함)는 용량이 매우 크기 때문에 이들 모두를 온 칩 메모리로 구현하는 것이 불가능한 바 외부의 메모리를 이용하여 구현한다. 반면에, 나머지 메모리들은 용량이 그리 크지 않기 때문에 온 칩 메모리로 구현하는 것이 가능하다. 본 발명의 실시예에서는 컨텍스트 메모리(240)를 제외한 나머지 메모리들을 다음과 같이 구현한다. First, since the memory for storing the context (corresponding to the context memory 240) is very large, it is impossible to implement all of them as on-chip memory. On the other hand, since the remaining memories are not very large, it is possible to implement on-chip memory. In the embodiment of the present invention, the remaining memories except the context memory 240 are implemented as follows.

우선 대기 큐는 각 쓰레드마다 하나의 비트를 매핑하는 비트 벡터(bit vector)로 구현한다. 예를 들어, 한 쓰레드가 대기 상태인 경우 이 쓰레드에 대응되는 비트는 제1 값(예를 들어, 1)을 가지고 이 쓰레드가 중단된 경우에는 제2 값(예를 들어, 0)을 가질 수 있다. 대기 큐를 비트 벡터로 구현하는 경우, 각 쓰레드 별로 1개의 레지스터(6개의 NAND 게이트의 면적에 해당함)만으로 표현할 수 있다. 일반적으로 내장형 시스템에서 쓰레드의 개수는 128개를 넘지 않으므로, 대기 큐의 면적은 1K 게이트를 넘지 않는다. First, the wait queue is implemented as a bit vector that maps one bit to each thread. For example, if a thread is in a waiting state, the bit corresponding to that thread may have a first value (for example, 1) and, if this thread is suspended, it may have a second value (for example, 0). have. When a wait queue is implemented as a bit vector, each thread may be represented by only one register (corresponding to an area of six NAND gates) for each thread. In a typical embedded system, the number of threads does not exceed 128, so the area of the wait queue does not exceed 1K gates.

작업 스케쥴링을 위해서는 각 쓰레드 별로 우선순위를 알아야 한다. 여기서, 작업 스케쥴링을 위한 정보는 각 쓰레드의 우선순위이다. 대기 큐에 등록된(즉, 해당 비트가 1로 설정되어 있는) 쓰레드 별로 우선순위 값을 읽고, 순차적으로 우선순위를 비교해나가며 우선순위 값이 가장 큰 쓰레드를 선택한다. 이러한 스케쥴링 방식에 따를 때, 대기 큐에 등록된 쓰레드의 개수가 k인 경우 스케쥴링은 k 클럭 싸이클이 필요하다. 만일 대기 큐에 등재된 쓰레드가 적은 경우, 작업 스케쥴링은 그리 오래 걸리지 않는다. 특히 하드웨어 운영체제 기반의 멀티 프로세서 시스템에서 작업 스케쥴링은 프로세서 코어(120)와는 병렬로 수행되기 때문에 시스템의 성능을 저하시키지는 않는다. 반면, 대기 큐에 등재된 쓰레드가 많은 경우, 작업 스케쥴링 자체는 장시간이 소요될 수 있다. 하지만, 대기 큐에 쓰레드가 많다는 것은 이미 대기중인 다른 쓰레드가 컨텍스트 버퍼(260)에 프리펫치되어 있을 가능성이 매우 높기 때문에 스케쥴링 시간은 성능 면에서 그리 문제되지 않는다. 이러한 스케쥴링 방식의 장점은 하드웨어로 구현할 시 우선순위를 비교하는 비교기가 단 하나만 필요하고, 쓰레드의 우선순위 정보를 순차적 접근이 가능한 온 칩 메모리(SRAM)에 저장할 수 있어 면적 면에서 유리하다는 점이다. Job scheduling requires knowing the priority of each thread. Here, the information for job scheduling is the priority of each thread. The priority value is read for each thread registered in the waiting queue (that is, the corresponding bit is set to 1), the priority is sequentially compared, and the thread having the highest priority value is selected. According to this scheduling method, if the number of threads registered in the waiting queue is k, scheduling requires k clock cycles. If there are few threads in the wait queue, job scheduling does not take long. In particular, in a hardware processor-based multiprocessor system, job scheduling is performed in parallel with the processor core 120 and thus does not deteriorate the performance of the system. On the other hand, if there are many threads in the waiting queue, job scheduling itself can take a long time. However, scheduling time is not a problem in terms of performance because the fact that there are many threads in the wait queue is very likely that other threads already waiting are prefetched in the context buffer 260. The advantage of this scheduling method is that it requires only one comparator to compare priorities when implemented in hardware, and it is advantageous in terms of area because the priority information of threads can be stored in sequential on-chip memory (SRAM).

세마포어 접근을 제어하기 위해서는 각 세마포어 별로 세마포어의 상태를 나타내는 세마포어 상태 플래그와 세마포어 값이 필요하고, 세마포어에 대기하고 있는 쓰레드들이 등록되는 중단 큐가 필요하다. 여기서, 세마포어 접근 제어를 위한 정보는 세마포어 상태 플래그 및 세마포어 값이다. In order to control semaphore access, each semaphore requires a semaphore state flag indicating the semaphore's state, a semaphore value, and a suspend queue that registers threads waiting on the semaphore. Here, information for semaphore access control is a semaphore state flag and a semaphore value.

세마포어에 관한 대표적인 서비스는 세마포어 대기(Semaphore wait)와 세마포어 포스트(Semaphore post)이다. Representative services for semaphores are semaphore wait and semaphore posts.

임의의 한 쓰레드가 어떤 세마포어에 세마포어 대기를 요청한 경우에는 다음과 같이 동작한다. 하드웨어 운영체제(110)는 세마포어 값을 읽고, 이 값이 양수이면 1 감소시키고 쓰레드의 수행을 즉시 재개시킨다. 이 값이 0 또는 음수이면 그 즉시 이 쓰레드를 중단시킨 후, 이 세마포어의 중단 큐에 삽입한 후, 해당 프로세서에는 다른 쓰레드를 할당(assign)하여 수행시킨다. When a thread requests a semaphore wait for a semaphore, it works as follows: The hardware operating system 110 reads the semaphore value, decreases it by one if this value is positive and immediately resumes execution of the thread. If this value is zero or negative, it immediately suspends this thread, inserts it into the semaphore's suspend queue, and assigns another thread to the processor.

한 쓰레드가 임의의 한 세마포어에 세마포어 포스트를 요청한 경우에는 다음과 같이 동작한다. 하드웨어 운영체제(110)는 세마포어 값을 읽고, 이 값이 0 또는 양수이면 세마포어 값을 요청된 양만큼 증가시킨다. 이 값이 음수이면 요청된 값만큼의 쓰레드를 중단 큐로부터 인출하여 대기 큐에 삽입하고, 작업 스케쥴링을 통해 여러 프로세서 코어(120)에 분산시켜 수행한다. When a thread requests a semaphore post for any one semaphore, it works as follows: The hardware operating system 110 reads the semaphore value, and if this value is zero or positive, increases the semaphore value by the requested amount. If this value is negative, the required number of threads are withdrawn from the suspend queue, inserted into the wait queue, and distributed to various processor cores 120 through job scheduling.

본 발명의 일 실시예에 따른 하드웨어 운영체제(110)는 세마포어 포스트, 세마포어 값 갱신, 중단 큐로부터 쓰레드를 인출, 쓰레드를 대기 큐에 삽입, 작업 스케쥴링, 컨텍스트 교환의 과정들이 일괄적으로 하드웨어적으로 빠르게 수행된다. Hardware operating system 110 according to an embodiment of the present invention is the process of hardware semaphore post, semaphore value update, withdraw thread from the suspend queue, insert the thread into the wait queue, job scheduling, context exchange collectively hardware fast Is performed.

여기서, 작업 스케쥴링을 위한 정보, 세마포어 접근 제어를 위한 정보, 각 세마포어 별 중단 큐는 쓰레드 제어 메모리(220)로 구현된다. 쓰레드 제어 메모리(220)는 하나의 온 칩 SRAM으로 구현될 수 있다. 도 3을 참조하면, 쓰레드 제어 메모리(220)의 구조가 도시되어 있다. 쓰레드 제어 메모리(220)에 저장된 작업 스케쥴링을 위한 정보를 쓰레드 디스크립터(Thread Descriptor)라고 부르고, 세마포어 접근 제어를 위한 정보를 세마포어 디스크립터(Semaphore Descriptor)라고 부른다. Here, information for job scheduling, information for semaphore access control, and a suspension queue for each semaphore are implemented as a thread control memory 220. The thread control memory 220 may be implemented as one on-chip SRAM. Referring to FIG. 3, the structure of the thread control memory 220 is shown. Information for job scheduling stored in the thread control memory 220 is called a thread descriptor, and information for semaphore access control is called a semaphore descriptor.

쓰레드 디스크립터(222)는 PRIO, WCNT, N, NEXT의 네 개의 필드로 구성된다. PRIO는 우선순위 값의 상위 비트이고 WCNT는 우선순위 값의 하위 비트이다. WCNT의 비트 수를 NWCNT라고 할 때, 쓰레드의 우선순위 값은 ((PRIO<<NWCNT) | WCNT)이다. 우선순위를 나타내는 정보 중에서 PRIO는 소프트웨어에 의하여 쓰레드에 부여된 우선순위 값이며 쓰레드 관리자(230)에 의해 변하지 않고, WCNT는 공평한 스케쥴링을 위하여 쓰레드 관리자(230)가 값을 변경한다. 예를 들어, 가장 높은 우선순위 값이 p이고, PRIO 값이 p인 쓰레드가 n개 있다면, 이 쓰레드들의 WCNT 값은 다음과 같이 갱신된다. 쓰레드 관리자(230)에 의하여 선택된 쓰레드의 WCNT 값은 0으로 설정하고, 나머지 쓰레드들의 WCNT 값은 1 증가시킨다. N과 NEXT 필드는 중단 큐(226)를 구현하기 위한 것으로 추후 설명하기로 한다. The thread descriptor 222 consists of four fields: PRIO, WCNT, N, and NEXT. PRIO is the upper bit of the priority value and WCNT is the lower bit of the priority value. When the number of bits in WCNT is NWCNT, the priority value of a thread is ((PRIO << NWCNT) | WCNT). Among the information representing the priority, PRIO is a priority value assigned to a thread by software and is not changed by the thread manager 230, and the WCNT is changed by the thread manager 230 for fair scheduling. For example, if there are n threads with the highest priority value p and a PRIO value p, their WCNT values are updated as follows: The WCNT value of the thread selected by the thread manager 230 is set to 0, and the WCNT value of the remaining threads is increased by one. The N and NEXT fields are for implementing the suspension queue 226, which will be described later.

세마포어 디스크립터(224)는 E, B, VALUE, NEXT의 네 개의 필드로 구성된다. E는 해당 세마포어 디스크립터가 이미 사용중인 경우 1의 값으로, 그렇지 않은 경우 0의 값으로 설정하는 플래그이다. B는 해당 세마포어가 바이너리 세마포어인 경우 1의 값으로, 카운팅 세마포어인 경우 0의 값으로 설정되는 플래그이다. VALUE는 세마포어 값이 저장되는 필드이다. E, B 및 VALUE의 세 필드에 저장되는 값을 이용해 세마포어 접근을 제어하는 방법(세마포어 대기, 세마포어 포스트 등)은 앞서 설명하였는바 생략하기로 한다. The semaphore descriptor 224 consists of four fields: E, B, VALUE, and NEXT. E is a flag that is set to a value of 1 if the semaphore descriptor is already in use and to a value of 0 otherwise. B is a flag set to a value of 1 when the semaphore is a binary semaphore and a value of 0 when a counting semaphore is used. VALUE is a field that stores semaphore values. The method of controlling semaphore access (semaphore wait, semaphore post, etc.) by using values stored in three fields of E, B, and VALUE will be omitted.

중단 큐(226)는 세마포어 디스크립터(224)의 NEXT 필드와 쓰레드 디스크립터(222)의 N, NEXT 필드를 이용해 구현된다. 만일 세마포어 디스크립터(224)의 VALUE가 음수인 경우 한 개 이상의 쓰레드가 중단 큐에 등록되어 있다는 의미로, 첫 번째 쓰레드의 ID가 세마포어 디스크립터(224)의 NEXT 필드에 저장된다. 쓰레드 디스크립터(222)의 N 필드가 1인 경우 다른 쓰레드가 추가적으로 중단 큐에 등록되어 있다는 의미로, 후속 쓰레드의 ID가 쓰레드 디스크립터(222)의 NEXT 필드에 저장된다. 만일 N 필드가 0인 경우에는, 이 쓰레드가 중단 큐에 등록된 마지막 쓰레드라는 의미이다. Suspend queue 226 is implemented using the NEXT field of semaphore descriptor 224 and the N, NEXT fields of thread descriptor 222. If the VALUE of the semaphore descriptor 224 is negative, it means that one or more threads are registered in the suspension queue, and the ID of the first thread is stored in the NEXT field of the semaphore descriptor 224. If the N field of the thread descriptor 222 is 1, it means that another thread is additionally registered in the suspension queue, and the ID of the subsequent thread is stored in the NEXT field of the thread descriptor 222. If the N field is 0, this means that this thread is the last thread registered in the suspend queue.

쓰레드 디스크립터(222)와 세마포어 디스크립터(224)를 상술한 것과 같이 표현함으로써, 작업 스케쥴링을 위한 정보, 세마포어 접근 제어를 위한 정보, 중단 큐가 단 하나의 싱글 포트(Single-Port) SRAM을 이용해 구현될 수 있고, 이로써 면적을 크게 줄일 수 있다. By expressing the thread descriptor 222 and the semaphore descriptor 224 as described above, information for job scheduling, information for semaphore access control, and a suspension queue can be implemented using only one single-port SRAM. This can greatly reduce the area.

코프로세서 버스에는 하나 이상의 코프로세서(10)가 연결되어 프로세서 코어(120)의 수행을 도울 수 있다. 코프로세서(10)는 하드웨어 운영체제(110)에 락(Lock), 릴리즈(Release), 프로세서 코어 ID의 세 가지 신호를 전달한다. 프로세 서 코어 ID 신호는 현재 코프로세서(10)에 접근한 프로세서 코어(120)의 번호를 의미하고, 락 신호는 코프로세서(10)에 요청된 작업이 아직 완료되지 않았음을 의미하며, 릴리즈 신호는 요청된 작업이 완료되었음을 의미한다. 하드웨어 운영체제(110)는 락 신호가 인가된 경우 프로세서 코어 ID 신호가 가리키는 프로세서 코어(120)가 수행하고 있는 작업을 중단시키고, 컨텍스트 버퍼(260)에 대기 중인 작업으로 컨텍스트 교환을 수행한다. 이후, 릴리즈 신호가 인가되면, 중지되었던 작업을 컨텍스트 버퍼(260)에 다시 준비하고 현재 작업이 없는 프로세서 코어 혹은 재개할 작업보다 우선순위가 낮은 작업이 수행되고 있는 프로세서 코어를 대상으로 컨텍스트 교환을 수행한다. 위와 같은 기능을 통하여 코프로세서(10)로부터의 인터럽트 요청, 작업의 중단, 그리고 작업의 재개의 과정을 하드웨어적으로 수행하여 성능을 개선할 수 있다.One or more coprocessors 10 may be connected to the coprocessor bus to assist in the performance of the processor core 120. The coprocessor 10 delivers three signals to the hardware operating system 110: lock, release, and processor core ID. The processor core ID signal means the number of processor cores 120 that are currently accessing the coprocessor 10, and the lock signal means that the work requested for the coprocessor 10 has not been completed yet. The signal means that the requested operation has been completed. When the lock signal is applied, the hardware operating system 110 stops the work performed by the processor core 120 indicated by the processor core ID signal, and performs a context exchange with the work waiting in the context buffer 260. Subsequently, when a release signal is applied, the stopped task is prepared again in the context buffer 260, and the context exchange is performed for a processor core in which no current task is performed or a processor core having a lower priority than the task to be resumed. do. Through the above functions, the interrupt request from the coprocessor 10, the interruption of the operation, and the resumption of the operation may be performed in hardware to improve performance.

도 4는 소프트웨어 운영체제에서의 멀티 쓰레드 프로그램 수행 과정 흐름도이고, 도 5는 본 발명의 일 실시예에 따른 하드웨어 운영체제에서의 멀티 쓰레드 프로그램 수행 과정 흐름도이다. 4 is a flowchart of a multi-threaded program execution process in a software operating system, and FIG. 5 is a flowchart of a multi-threaded program execution process in a hardware operating system according to an embodiment of the present invention.

도 4를 참조하면, 소프트웨어 운영체제에서의 멀티 쓰레드 프로그램 수행은 단계 S410부터 S470까지의 7단계로 구분된다. Referring to FIG. 4, the multi-threaded program execution in the software operating system is divided into seven steps from S410 to S470.

응용 작업 수행 중(단계 S400) 일단 인터럽트가 발생(단계 S410)하면 수행 중이던 응용 작업을 중단하고 단계 S420부터 단계 S470까지의 각 과정을 순차적으로 수행한 후 응용 작업을 재개한다(단계 S480). During the execution of the application task (step S400) Once an interrupt occurs (step S410), the application task that was being performed is interrupted, and each application from step S420 to step S470 is sequentially performed, and the application task is resumed (step S480).

인터럽트가 발생(단계 S410)하면, 인터럽트 핸들러를 수행한다(단계 S420). 그리고 작업 동기화를 수행하고 대기 큐에 삽입한다(단계 S430). 작업 스케쥴링을 수행(단계 S440)한 후 현재 쓰레드의 컨텍스트를 외부 메모리에 대피(단계 S450)하거나 재개할 쓰레드의 컨텍스트를 외부 메모리로부터 읽어온다(단계 S460). 단계 S450 및 단계 S460는 컨텍스트 교환 과정에 해당하며, 소프트웨어 운영체제의 경우 이 과정 중에 많은 클럭 싸이클을 소모하게 된다. When an interrupt occurs (step S410), an interrupt handler is performed (step S420). Then, work synchronization is performed and inserted into a waiting queue (step S430). After performing job scheduling (step S440), the context of the current thread is evacuated to external memory (step S450) or the context of the thread to be resumed is read from the external memory (step S460). Steps S450 and S460 correspond to a context exchange process, and the software operating system consumes many clock cycles during this process.

도 5를 참조하면, 하드웨어 운영체제에서의 멀티 쓰레드 프로그램 수행은 단계 S510부터 S570까지의 7단계로 구분된다. 하드웨어 운영체제에서의 각 단계는 이에 대응되는 소프트웨어 운영체제에서의 단계에 비해 매우 짧은 시간 내에 완수될 수 있다. Referring to FIG. 5, execution of a multi-threaded program in a hardware operating system is divided into seven steps from S510 to S570. Each step in the hardware operating system can be completed in a very short time compared to the step in the corresponding software operating system.

하드웨어 운영체제에서의 각 단계 중 컨텍스트 교환 직전까지의 단계 S510 내지 S550은 프로세서 코어와 무관하게 수행된다. 쓰레드 관리자에 의해 선택되어 재개할 쓰레드의 컨텍스트를 외부 메모리로부터 모두 읽어 들여 컨텍스트 버퍼에 저장 완료한 후 단계 S560에서 컨텍스트 교환을 개시한다. 현재 수행 중이던 쓰레드의 컨텍스트도 일단 컨텍스트 버퍼에 대피된 후 나중에 하드웨어 운영체제에 의해 외부 메모리에 저장되기 때문에, 프로세서 코어 관점에서 멀티 쓰레딩 오버헤드는 극히 적다. Steps S510 to S550 up to immediately before the context exchange among the steps in the hardware operating system are performed regardless of the processor core. After all the contexts of the thread selected and resumed by the thread manager are read out from the external memory and stored in the context buffer, the context exchange is started in step S560. From the perspective of the processor core, the multithreading overhead is extremely small, since the context of the thread that is currently running is also evacuated to the context buffer and later stored in external memory by the hardware operating system.

이와 같이 하드웨어 운영체제 기반의 멀티 프로세서 시스템에서 각 프로세서 코어들은 좀 더 많은 연산 자원을 응용 작업에 할애할 수 있으므로, 결과적으로 더욱 우수한 성능을 보일 수 있다.As described above, each processor core in a multiprocessor system based on a hardware operating system can dedicate more computational resources to an application, resulting in better performance.

이하에서는 다른 실시예에 따라 공유형 명령어 캐시(130) 및 공유형 데이터 캐시(140)를 통해 작업 이전 오버헤드를 줄이고 전체 면적을 줄이는 멀티 프로세서 시스템에 대해서 설명하기로 한다. Hereinafter, a multiprocessor system that reduces overhead before operation and reduces the total area through the shared instruction cache 130 and the shared data cache 140 according to another embodiment will be described.

프로세서 코어를 원활하게 구동하기 위해서는 앞으로 사용될 확률이 높은 명령어들을 미리 온 칩 메모리에 저장해 둔 명령어 캐시가 필요하다. 하나의 프로세서 코어만을 지원할 수 있는 일반적인 명령어 캐시는 순차적인 명령어들로 구성된 메모리 블록을 온 칩 메모리에 저장하고, 해당 메모리 블록이 캐시에 존재함을 확인할 수 있도록 명령어 주소의 일부분을 태그(Tag)로서 별도의 메모리에 저장한다. 이와 같이 명령어 블록을 저장하는 온 칩 메모리를 캐시 메모리라 하고, 태그를 저장하는 온 칩 메모리를 태그 메모리라 한다. 일반적인 명령어 캐시가 명령어를 프로세서에 제공하는 과정은 (a) 프로세서 코어가 요청한 주소의 명령어의 일부분과 태그 메모리에 저장된 태그를 비교하여, 요청된 명령어가 캐시 메모리에 존재하는지를 확인하는 태그 매칭(Tag Matching) 단계와, (b) 요청된 명령어가 캐시 메모리에 존재하지 않는 경우 외부 메모리로부터 명령어 블록을 읽어와 캐시 메모리에 저장하는 명령어 블록 적재(Instruction Blocking Loading) 단계, (c) 태그 매칭을 통해 캐시 메모리에 존재함이 확인된 경우 혹은 명령어 블록 적재가 끝난 경우, 명령어를 캐시 메모리로부터 읽어오는 명령어 읽기(Instruction Reading) 단계로 구성된다. To run the processor core smoothly, you need an instruction cache that stores instructions that are likely to be used in the on-chip memory. A typical instruction cache that can support only one processor core stores a block of memory consisting of sequential instructions in on-chip memory and uses a portion of the instruction address as a tag to ensure that the memory block exists in the cache. Store in a separate memory. In this way, the on-chip memory storing the instruction block is called a cache memory, and the on-chip memory storing the tag is called a tag memory. The process of providing an instruction to a processor by a general instruction cache includes: (a) Tag Matching, which checks whether the requested instruction is in cache memory by comparing a portion of the instruction at the address requested by the processor core to the tag stored in the tag memory. Step (b) Instruction Blocking Loading, which reads the instruction block from the external memory and stores it in the cache memory if the requested instruction does not exist in the cache memory, and (c) the cache memory through tag matching. If it is confirmed that there exists at or the instruction block is finished loading, it consists of Instruction Reading step that reads instruction from cache memory.

복수의 프로세서 코어(120)에 명령어를 동시에 지원할 때에는 복수의 프로세서 코어(120)가 캐시 메모리에 접근하려고 하는 캐시 메모리 충돌(Cache Memory Conflict) 및/또는 동시에 태그를 매칭하려는 태그 매칭 충돌(Tag Matching Conflict)이 발생한다. 이러한 충돌을 해소하기 위하여 본 발명의 다른 실시예에 따른 멀티 프로세서 시스템(100)은 공유형 명령어 캐시(130)를 포함한다. When concurrently supporting instructions for a plurality of processor cores 120, a Cache Memory Conflict for accessing cache memory by a plurality of processor cores 120 and / or a Tag Matching Conflict for matching tags simultaneously ) Occurs. In order to resolve such conflicts, the multiprocessor system 100 according to another embodiment of the present invention includes a shared instruction cache 130.

도 6은 공유형 명령어 캐시의 구조를 설명하는 상세도이며, 도 7은 메인 메모리의 데이터 배치 방식을 나타낸 도면이고, 도 8은 기존 캐시 메모리의 데이터 배치 방식을 나타낸 도면이며, 도 9는 본 발명의 일 실시예에 따른 인터리브 명령어 캐시 메모리의 데이터 배치 방식을 나타낸 도면이고, 도 10은 공유형 데이터 캐시의 구조를 설명하는 상세도이다. 6 is a detailed view illustrating a structure of a shared instruction cache, FIG. 7 is a diagram illustrating a data arrangement method of a main memory, FIG. 8 is a diagram illustrating a data arrangement method of an existing cache memory, and FIG. 9 is an embodiment of the present invention. FIG. 10 is a diagram illustrating a data arrangement method of an interleaved instruction cache memory, and FIG. 10 is a detailed diagram illustrating a structure of a shared data cache.

도 6을 참조하면, 복수의 프로세서 코어(120a, 120b, 120c, 120d, 이하 120이라 통칭함), 공유형 명령어 캐시(130), 명령어 버퍼(610), 인터리브 명령어 캐시 메모리(Interleaved Instruction Cache Memory)(620), 태그 메모리(630a, 630b, 이하 630이라 통칭함), 캐시 제어기(640), 블록-필 제어기(650)가 도시되어 있다. Referring to FIG. 6, a plurality of processor cores 120a, 120b, 120c, and 120d (hereinafter, collectively referred to as 120), a shared instruction cache 130, an instruction buffer 610, and an interleaved instruction cache memory 620, tag memories 630a, 630b (hereinafter collectively referred to as 630), cache controller 640, and block-fill controller 650 are shown.

복수의 프로세서 코어(120)에 명령어를 지원하기 위해서는 태그 매칭과 명령어 읽기가 각 프로세서 코어(120)마다 동시에 수행되어야 한다. 이러한 동시적인 명령어 읽기를 지원하기 위해서 인터리브 명령어 캐시 메모리(620)를 이용한다. 인터리브 명령어 캐시 메모리(620)는 복수의 온 칩 메모리(622a, 622b, 622c, 622d, 이하 622라 통칭함)를 이용해 캐시 메모리를 구성하되, 하나의 명령어 블록을 구성하는 명령어들이 서로 다른 온 칩 메모리에 분산되도록 저장한다. In order to support instructions for the plurality of processor cores 120, tag matching and instruction reading must be simultaneously performed for each processor core 120. Interleaved instruction cache memory 620 is used to support such simultaneous instruction reading. The interleaved instruction cache memory 620 constitutes a cache memory using a plurality of on-chip memories 622a, 622b, 622c, 622d, and 622, and 622, and the instructions constituting one instruction block are different on-chip memory. To be distributed to

인터리브 명령어 캐시 메모리(620)는 내부 통신 네트워크(624)와, 하나 이상의 온 칩 메모리(622)를 포함한다. 내부 통신 네트워크(624)는 캐시 제어기(640)로부터의 온 칩 메모리 선택 신호에 따라 선택된 온 칩 메모리(622)에 명령어를 저 장하거나 선택된 온 칩 메모리(622)에 저장된 명령어를 프로세서 코어(120)로 보낸다.Interleaved instruction cache memory 620 includes an internal communications network 624 and one or more on-chip memories 622. The internal communication network 624 may store instructions in the selected on-chip memory 622 or store instructions stored in the selected on-chip memory 622 according to the on-chip memory selection signal from the cache controller 640. Send to.

이 구조는 기존의 세트-어소시에이티브 캐시(Set-associative cache)의 메모리 구조와 유사해 보이지만, 기존의 메모리 구조에서는 하나의 명령어 블록이 하나의 온 칩 메모리에 저장된다는 점에서 다르다. 복수의 프로세서 코어(120)가 요청한 명령어들이 서로 다른 온 칩 메모리에 저장되어 있는 경우, 서로 다른 주소의 명령어를 동시에 읽어서 해당 프로세서 코어에 제공할 수 있다. 그러나 만일 두 개의 프로세서 PA와 PB가 서로 다른 메모리를 참조하는데, 우연히 이들이 같은 온 칩 메모리에 저장되어 있다면, 하나의 프로세서는 수행이 중단되어야 한다. This structure looks similar to the memory structure of a conventional set-associative cache, but differs in that one instruction block is stored in one on-chip memory. When instructions requested by the plurality of processor cores 120 are stored in different on-chip memories, instructions of different addresses may be simultaneously read and provided to the corresponding processor cores. However, if two processor PAs and PBs refer to different memories, and if they are inadvertently stored in the same on-chip memory, one processor should stop performing.

이 때, 명령어들은 순차적으로 읽힐 가능성이 매우 높기 때문에, 기존의 세트-어소시에이티브 캐시 메모리 방식에서는 한 프로세서가 명령어 블록을 모두 참조하고 다음 명령어 블록으로 넘어가거나, 분기 명령을 통해 다른 명령어 블록을 참조하게 될 때까지 계속 충돌이 발생하고 성능을 지연시킨다. 이에 반하여, 인터리브 명령어 캐시 메모리 구조에서는 최초 한 번만 시간 지연이 발생하고, 그 다음부터는 온 칩 메모리 접근 충돌이 자연 해소된다. 이에 관한 설명을 도 8을 통해 좀 더 상세하게 설명하면 다음과 같다. In this case, since instructions are very likely to be read sequentially, in a conventional set-associative cache memory scheme, one processor refers to all the instruction blocks and moves on to the next instruction block, or the branch instruction refers to another instruction block. It keeps crashing and slowing performance until you do. In contrast, in the interleaved instruction cache memory structure, a time delay occurs only once, and then on-chip memory access conflicts are naturally resolved. A detailed description thereof will be given below with reference to FIG. 8.

도 8을 참조하면, 기존의 세트-어소시에이티브 캐시(Set-associative cache)가 도시되어 있다. 기존의 세트-어소시에이티브 캐시는 동시적인 데이터 읽기/쓰기(Read/Write)가 가능하도록 여러 개의 메모리 셀들(810, 820, 830, 840)을 병렬 연결하여 사용한다. 여기서, 각 메모리 셀(810, 820, 830, 840)은 SRAM으로 구현된다. Referring to FIG. 8, an existing Set-associative cache is shown. The existing set-associative cache uses a plurality of memory cells 810, 820, 830, and 840 in parallel to simultaneously read / write data. Here, each of the memory cells 810, 820, 830, and 840 is implemented with SRAM.

하지만, 도 9를 참조하면, 본 발명의 실시예에 따른 인터리브 명령어 캐시 메모리는 도 8에 도시된 세트-어소시에이티브 캐시와 같이 여러 개의 메모리 셀들(910, 920, 930, 940)이 병렬 연결되어 있다. 여기서, 각 메모리 셀(910, 920, 930, 940)은 SRAM으로 구현된다. 하지만, 복수의 프로세서 코어(120)의 동시적인 접근으로 인한 충돌을 최소화하기 위하여 각 메모리 셀(910, 920, 930, 940)에 데이터를 배치하는 방식이 다르다. However, referring to FIG. 9, in the interleaved instruction cache memory according to an exemplary embodiment of the present invention, several memory cells 910, 920, 930, and 940 are connected in parallel, such as the set-associative cache illustrated in FIG. 8. have. Here, each of the memory cells 910, 920, 930, and 940 is implemented with SRAM. However, in order to minimize the collision due to the simultaneous access of the plurality of processor cores 120, data is arranged in each of the memory cells 910, 920, 930, and 940.

메인 메모리(700)의 데이터들은 N개의 워드로 구성된 블록단위로 캐시 메모리에 저장된다. 도 7 내지 도 9에서는 하나의 블록이 4개의 워드로 구성된다고 가정한다. 또한, 메인 메모리(700)의 A, B, C, D 블록들은 각각 Way0, Way1, Way2, Way3에 캐싱(caching)되어 있다고 가정한다. The data of the main memory 700 are stored in the cache memory in blocks of N words. 7 to 9 assume that one block is composed of four words. In addition, it is assumed that the A, B, C, and D blocks of the main memory 700 are cached in Way0, Way1, Way2, and Way3, respectively.

일반적인 세트-어소시에이티브 캐시 구조에서는 각 경로(way) 별로 하나의 SRAM을 할당한다. 이는 SRAM의 타이밍 특성 때문이다. 태그 메모리와 캐시 메모리 구현에 이용되는 동기식 SRAM(Synchronous SRAM)은 주소(address)를 인가하면, 다음 클록 싸이클에 데이터가 출력된다. 따라서, 태그 매칭(Tag matching)은 두 싸이클에 걸쳐 이루어진다. 첫 번째 사이클에는 모든 태그 메모리에 주소를 인가하고, 다음 싸이클에 태그 메모리로부터 태그 정보들을 받아 이 중에서 히트(hit)인 경로를 찾아낸다. 만일, 태그 매칭이 완료되어 데이터가 저장되어 있는 경로를 알아낸 다음 캐시 메모리에 주소를 인가하고, 그 다음 싸이클에 데이터를 읽으면 총 세 클록 싸이클이 소요되며, 성능이 크게 떨어진다. 따라서, 대부분의 캐시들은 태그 매 칭의 첫 번째 싸이클에 모든 캐시 메모리에도 주소를 인가하고, 그 다음 싸이클에 데이터가 저장된 경로를 알아내면, 해당 캐시 메모리가 출력한 데이터를 프로세서 코어에 제공하도록 한다. 즉, 모든 경로들이 동시에 읽혀야 하므로, 각 경로마다 하나의 SRAM을 할당한다(도 8 참조).In a typical set-associative cache structure, one SRAM is allocated to each path. This is due to the timing characteristics of the SRAM. Synchronous SRAM (SRAM) used in tag memory and cache memory implementations, when an address is applied, data is output to the next clock cycle. Thus, tag matching occurs over two cycles. In the first cycle, all tag memories are addressed, and the next cycle receives tag information from the tag memory and finds a hit path among them. If the tag matching is completed, the path where the data is stored is found, the address is applied to the cache memory, and the data is read in the next cycle, which takes a total of three clock cycles. Therefore, most caches also address all cache memory in the first cycle of tag matching, and then find out the path where the data is stored in the next cycle, and then provide the processor core with the data output from that cache memory. That is, since all paths must be read at the same time, one SRAM is allocated to each path (see FIG. 8).

하지만, 이러한 방식은 복수의 프로세서 코어를 지원하는 공유형 캐시에는 적합하지 않다. 일반적으로 캐시, 특히 명령어 캐시는 순차적으로 데이터를 요청하는 비율이 높다. 예를 들어, 한 프로세서 코어가 Way0에 임의의 순간에 접근하였다면, 그 다음 싸이클에도 Way0로부터 그 데이터를 읽어올 가능성이 매우 높다. 제1 프로세서 코어와 제2 프로세서 코어가 동시에 Way0에 접근한다고 가정하면, 하나의 프로세서 코어는 데이터를 받지 못해 지연(stall)되어야 한다. 하지만, 그 다음 싸이클에도 제1 프로세서 코어와 제2 프로세서 코어는 계속 충돌이 나게 되고, 심각한 성능 저하를 유발하는 문제점이 있다. However, this approach is not suitable for shared caches that support multiple processor cores. In general, caches, especially instruction caches, have a high rate of requesting data sequentially. For example, if a processor core approaches Way0 at any moment, it is very likely that the next cycle will read that data from Way0. Assuming that the first processor core and the second processor core approach Way0 at the same time, one processor core must not stall because it does not receive data. However, in the next cycle, the first processor core and the second processor core continue to collide with each other, causing a serious performance degradation.

이를 해소하기 위하여, 본 발명의 일 실시예에서는 도 9에 도시된 것과 같이 데이터를 배치한다. 우선, 블록을 구성하는 각 워드들은 서로 다른 메모리 셀에 배치한다. 그리고 캐시 메모리가 4개의 메모리 셀(910, 920, 930. 940), 즉 M0, M1, M2, M3로 구성되어 있고, 네 개의 경로, 즉 Way0, Way1, Way2, Way3가 존재한다면, Way0를 구성하는 워드들은 순서대로 M0, M1, M2, M3에, Way1를 구성하는 워드들은 M1, M2, M3, M0에, Way2를 구성하는 워드들은 순서대로 M2, M3, M0, M1에, Way3를 구성하는 워드들은 M3, M0, M1, M2에 배치한다. 이러한 배치에 의하면, 첫 번째 싸이클에는 제1 프로세서 코어와 제2 프로세서 코어가 충돌나서 한 프로세서 코어가 지연되더라도, 그 다음 싸이클에는 충돌이 해소된다. To solve this, one embodiment of the present invention arranges data as shown in FIG. First, each word constituting the block is placed in different memory cells. And if the cache memory is composed of four memory cells (910, 920, 930, 940), that is, M0, M1, M2, M3, and four paths, that is, Way0, Way1, Way2, Way3, Way0 is configured. Words are M0, M1, M2, M3 in order, Words constituting Way1 are M1, M2, M3, M0, Words constituting Way2 are M2, M3, M0, M1, Way3 Words are placed in M3, M0, M1, M2. With this arrangement, even if one processor core is delayed because the first processor core and the second processor core collide in the first cycle, the collision is resolved in the next cycle.

또한, 공유형 명령어 캐시(130)는 동시적인 태그 매칭을 지원해야 한다. 그러나, 태그 메모리(630)의 경우에는 상술한 인터리브 구조로 구현할 수가 없다. 그리고 동시적인 태그 매칭을 지원하기 위하여 복수의 태그 메모리 사본을 관리하는 것은 면적 면에서 비효율적이다. In addition, the shared instruction cache 130 must support simultaneous tag matching. However, the tag memory 630 may not be implemented in the interleaved structure described above. And managing multiple tag memory copies to support simultaneous tag matching is inefficient in area.

명령어의 경우 순차적으로 접근되는 경향이 매우 강하다. 따라서, 공유형 명령어 캐시(130)는 한 명령어 블록에 처음 접근하려는 경우에만 태그 매칭을 수행하고, 해당 명령어 블록에 속한 명령어들을 순차적으로 읽어오는 경우에는 태그 매칭을 건너뛰는 태그 매칭 스키핑(Tag Matching Skipping) 방식을 이용한다. 태그 매칭 스키핑 방식에 의할 경우 태그 매칭의 빈도를 줄일 수 있는 효과가 있다. Instructions tend to be accessed sequentially. Therefore, the shared instruction cache 130 performs tag matching only when first accessing an instruction block, and tag matching skipping skips tag matching when sequentially reading instructions belonging to the instruction block. ) Method. The tag matching skipping method can reduce the frequency of tag matching.

공유형 명령어 캐시(130)는 프로세서 코어(120)가 분기 명령을 수행한 경우, 프로세서 코어(120)가 순차적인 주소를 가진 명령어를 요청했으나 해당 명령어가 다른 명령어 블록에 저장되어 있는 경우에만 태그 매칭을 수행한다. When the processor core 120 executes a branch instruction, the shared instruction cache 130 matches a tag only when the processor core 120 requests an instruction having a sequential address but the instruction is stored in another instruction block. Do this.

또한, 공유형 명령어 캐시(130)는 각 프로세서 코어 별로 명령어 버퍼(610)를 더 포함할 수 있다. 인터리브 명령어 캐시 메모리(620)를 이용하는 경우 복수의 프로세서 코어가 동일한 온 칩 메모리에 접근한다면 단 하나의 프로세서 코어에만 명령어를 공급할 수 있고 나머지 프로세서 코어들은 명령어를 공급받지 못해 수행이 중단되므로 성능이 저하된다. 또한, 태그 매칭 스키핑 방식은 서로 다른 프로세서 코어에서 분기 명령을 동시에 수행하는 경우에는, 단 하나의 프로세서 코어에 대해서만 태그 매칭을 수행할 수 있고, 나머지 프로세서 코어들은 태그 매칭을 하 지 못하여 수행이 중단되므로 성능이 저하된다. 명령어 버퍼(610)는 이러한 문제점들을 해결한다. In addition, the shared instruction cache 130 may further include an instruction buffer 610 for each processor core. In the case of using the interleaved instruction cache memory 620, if a plurality of processor cores access the same on-chip memory, only one processor core can supply an instruction, and the remaining processor cores are not supplied with the instruction, and thus performance is degraded. . In addition, when the tag matching skipping method simultaneously executes branch instructions in different processor cores, tag matching may be performed on only one processor core, and the remaining processor cores may not perform tag matching, and thus execution may be stopped. Performance is degraded. The instruction buffer 610 solves these problems.

공유형 명령어 캐시(130)는 각 프로세서 코어(120)가 명령어를 요청하기 전에 미리 명령어를 인터리브 명령어 캐시 메모리(620)로부터 읽어와 명령어 버퍼(610)에 저장하고 있다. 그리고 프로세서 코어(120)가 명령어를 요청하되 해당 명령어가 명령어 버퍼(610)에 저장되어 있는 경우, 명령어 버퍼(610)에서 제공한다. 명령어 버퍼(610)를 내장한 경우와 그렇지 않은 경우를 비교할 때 캐시 메모리의 충돌과 태그 매칭 충돌은 같은 횟수로 발생하지만, 명령어 버퍼(610)를 내장한 경우에는 충돌이 발생하더라도 명령어 버퍼(610)에서 프로세서 코어(120)로 명령어를 제공할 수 있으므로 성능이 개선되는 효과가 있다. The shared instruction cache 130 reads instructions from the interleaved instruction cache memory 620 in advance and stores them in the instruction buffer 610 before each processor core 120 requests the instructions. When the processor core 120 requests an instruction but the corresponding instruction is stored in the instruction buffer 610, the processor core 120 provides the instruction. When comparing the case where the instruction buffer 610 is embedded with the case where the instruction buffer 610 is not included, the collision of the cache memory and the tag matching collision occur the same number of times. However, when the instruction buffer 610 is built, the instruction buffer 610 even though the collision occurs. In order to provide instructions to the processor core 120, the performance is improved.

명령어 버퍼(610)는 분기 분별기(Branch Checker)(612)를 포함할 수 있다. 분기 분별기(612)를 통하여 프로세서 코어(120)가 분기 명령을 수행하였는지를 알아낸다. 만일 프로세서 코어(120)가 분기 명령을 수행한 경우, 명령어 버퍼(610)를 비우고 캐시 제어기(640)에 분기 상태 신호(Branch Status Signal)을 통해 알린다. 명령어 버퍼(610)를 늘리는 경우 좀 더 많은 명령어들이 미리 펫치될 수 있으므로 성능이 개선된다. 하지만 이를 위해서는 온 칩 메모리(622)와 명령어 버퍼(610) 간 대역폭(Bandwidth)을 명령어 버퍼(610)와 프로세서 코어(120) 간 대역폭보다 증가시켜야 한다. 도 6에는 프로세서 코어(120)와 명령어 버퍼(610) 간에는 32-bit 회선으로 연결되고 명령어 버퍼(610)와 온 칩 메모리(622) 간에는 64-bit 회선으로 연결된 경우를 도시하고 있으나, 이는 일 실시예에 불과하며 다른 구현도 가능하 다. The command buffer 610 may include a branch checker 612. The branch classifier 612 finds out whether the processor core 120 has executed a branch instruction. If the processor core 120 performs the branch instruction, the instruction buffer 610 is cleared and the cache controller 640 is notified through the branch status signal. Increasing the instruction buffer 610 may improve performance since more instructions may be prefetched. However, to this end, the bandwidth between the on-chip memory 622 and the instruction buffer 610 must be increased than the bandwidth between the instruction buffer 610 and the processor core 120. FIG. 6 illustrates a case in which the processor core 120 and the instruction buffer 610 are connected by a 32-bit line and the instruction buffer 610 and the on-chip memory 622 are connected by a 64-bit line. This is only an example and other implementations are possible.

캐시 제어기(640)는 캐시 메모리 충돌과 태그 매칭 충돌이 발생한 경우, 명령어 펫치 요청 중에서 명령어 버퍼(610)에 저장된 명령어의 개수가 적은 요청들을 우선적으로 지원한다. 이를 위하여 명령어 버퍼(610)들은 저장된 명령어들의 개수를 적절하게 인코딩하여 프리-펫치 압력 신호(Pre-fetch Pressure Signal)를 통해 캐시 제어기(640)에 전달한다. 캐시 제어기(640)는 분기 상태 신호와 프리-펫치 압력 신호들을 이용해 어떤 프로세서 코어(120)를 위해 명령어를 읽을 것인지를 결정한 후, 각 온 칩 메모리(622) 별로 온 칩 메모리 선택 신호를 보낸다. 이에 따라 각 온 칩 메모리(622)로부터 적절한 명령어가 읽혀 해당 명령어 버퍼(610)에 삽입된다. 캐시 제어기(640)는 태그 매칭도 수행하는데 동시에 여러 개의 주소에 대한 태그 매칭이 필요한 경우 명령어 버퍼(610)에 명령어가 전혀 없는 순차적인 명령어 요청에 대한 태그 매칭을 가장 먼저 수행하고, 분기에 의한 명령어 요청에 대한 태그 매칭을 그 다음으로 수행하며, 그 외의 요청에 대한 태그 매칭을 맨 나중에 수행한다. When a cache memory conflict and a tag matching conflict occur, the cache controller 640 preferentially supports requests having a small number of instructions stored in the instruction buffer 610 among instruction fetch requests. To this end, the instruction buffers 610 properly encode the number of stored instructions and deliver them to the cache controller 640 via a pre-fetch pressure signal. The cache controller 640 uses the branch status signal and the pre-fetch pressure signals to determine which processor core 120 to read an instruction from, and then sends an on-chip memory selection signal for each on-chip memory 622. Accordingly, an appropriate command is read from each on-chip memory 622 and inserted into the corresponding command buffer 610. The cache controller 640 also performs tag matching, and when tag matching for multiple addresses is required, tag matching is performed first for a sequential instruction request having no instructions in the instruction buffer 610 at all. Tag matching for the request is then performed, and tag matching for other requests is performed last.

캐시 제어기(640)는 태그 매칭 결과 요청된 명령어가 인터리브 명령어 캐시 메모리(620)에 존재하지 않음을 확인한 경우, 명령어 블록 요청 신호를 통해 블록-필 제어기(650)에 이 사실을 통지한다. 블록-필 제어기(650)는 시스템 버스를 통해 외부 메모리로부터 명령어 블록을 읽어와 인터리브 명령어 캐시 메모리(620)에 저장한다.If the cache controller 640 determines that the requested instruction does not exist in the interleaved instruction cache memory 620 as a result of the tag matching, it notifies the block-fill controller 650 via the instruction block request signal. The block-fill controller 650 reads the instruction block from the external memory via the system bus and stores it in the interleaved instruction cache memory 620.

프로세서 코어(120)의 원활한 구동을 위해서는 명령어 캐시와 함께 앞으로 읽거나 쓸 확률이 높은 데이터들을 온 칩 메모리에 저장해 둔 데이터 캐시가 필수적이다. 공유형 데이터 캐시(140)는 복수의 프로세서 코어(120)의 데이터 읽기 또는 쓰기 요청을 동시에 지원할 수 있는 데이터 캐시이다. 도 10을 참조하면, 공유형 데이터 캐시(140), 복수의 프로세서 코어(120), 인터리브 데이터 캐시 메모리(1020), 캐시 제어기(1040), 블록-필 제어기(1050), 히스토리 버퍼(1010), 히스토리 테스터(1015), 태그 메모리(1030a, 1030b, 1030c, 1030d, 이하 1030이라 통칭함)가 도시되어 있다. In order to smoothly operate the processor core 120, a data cache having stored therein an on-chip memory having high probability of reading or writing the data together with the instruction cache is essential. The shared data cache 140 is a data cache capable of simultaneously supporting data read or write requests of the plurality of processor cores 120. Referring to FIG. 10, a shared data cache 140, a plurality of processor cores 120, an interleaved data cache memory 1020, a cache controller 1040, a block-fill controller 1050, a history buffer 1010, History tester 1015 and tag memories 1030a, 1030b, 1030c, 1030d, hereafter referred to as 1030, are shown.

데이터 접근은 크게 두 가지 유형으로 분류할 수 있다. 한 가지는 스택 메모리(Stack Memory) 접근이고 다른 하나는 힙 메모리(Heap Memory) 접근이다. Data access can be divided into two types. One is stack memory access and the other is heap memory access.

스택 메모리에는 지역 변수, 프로시저(혹은 함수) 호출시 서브-프로그램에 전달하는 인수들, 그리고 컴파일러가 레지스터 값을 대피시키기 위한 임시 메모리로 사용된다. 스택 메모리는 각 프로세서 코어(혹은 프로세서 코어 상에서 수행되는 작업)마다 고유한 영역이 지정된다. 일반적으로 약 50 ~ 70%의 데이터 접근이 스택 메모리 접근이다. Stack memory is used for local variables, arguments passed to sub-programs when calling procedures (or functions), and temporary memory for the compiler to evacuate register values. Stack memory has its own area for each processor core (or task performed on the processor core). Typically, about 50-70% of data accesses are stack memory accesses.

힙 메모리는 소프트웨어에 의하여 동적으로 할당되는 변수 혹은 배열 변수가 배치되는 영역이다. 한 소프트웨어 프로그램(혹은 한 작업)은 매우 많은 동적 변수들을 운용하지만, 짧은 순간에는 소수의 동적 변수들이 자주 이용된다. Heap memory is the area where variables or array variables that are dynamically allocated by software are placed. A software program (or a task) operates very many dynamic variables, but in a short time a few dynamic variables are often used.

즉, 스택 메모리의 경우 프로세서 코어의 스택 포인터 전후의 주소로 접근되는 빈도가 매우 높고, 힙 메모리의 경우에도 최근에 접근되었던 동적 변수가 다시 접근될 확률이 높다. 이를 고려하여 본 발명의 일 실시예에 따른 공유형 데이터 캐시(140)는 히스토리 기반 계층적 태그 매칭 방식를 적용한다. That is, in the case of the stack memory, the frequency of accessing the address before and after the stack pointer of the processor core is very high, and in the case of the heap memory, the recently accessed dynamic variable is more likely to be accessed again. In consideration of this, the shared data cache 140 according to an embodiment of the present invention applies a history-based hierarchical tag matching scheme.

히스토리 기반 계층적 태그 매칭 방식은 다음과 같다. History-based hierarchical tag matching is as follows.

첫 번째 단계로, 최근에 접근되었던 스택 메모리와 힙 메모리의 태그를 히스토리 버퍼(1010)에 저장한다. 두 번째 단계로, 새로운 데이터 접근 요청이 있는 경우 히스토리 테스터(1015)는 일단 히스토리 버퍼(1010)에 저장된 태그와 요청된 주소를 비교하여 해당 데이터가 인터리브 데이터 캐시 메모리(1020)에 존재하는지를 확인한다. 세 번째 단계로, 만일 두 번째 단계에서 아직 판별이 끝나지 않은 경우 태그 메모리(1030)에서 태그를 읽어와 태그 매칭을 시도한다. 히스토리 버퍼(1010)는 각 프로세서 코어(120)마다 하나씩 존재하고, 상당수의 데이터 접근 요청이 두 번째 단계에서 캐시-히트임이 밝혀지기 때문에 태그 매칭 충돌이 현저하게 감소하는 효과가 있다. In the first step, the recently accessed tags of stack memory and heap memory are stored in the history buffer 1010. In a second step, when there is a new data access request, the history tester 1015 first compares the tag stored in the history buffer 1010 with the requested address and checks whether the data exists in the interleaved data cache memory 1020. In a third step, if the determination is not yet completed in the second step, the tag memory 1030 reads the tag and attempts tag matching. There is one history buffer 1010 for each processor core 120, and the tag matching collision is significantly reduced because a large number of data access requests are found to be cache-hit in the second stage.

히스토리 버퍼(1010)는 최근에 접근된 주소들의 태그가 다수 저장되어 있다. 다른 실시예에서는 응용에 따라 스택 메모리 접근의 빈도와 힙 메모리 접근의 빈도가 다르기 때문에, 히스토리 버퍼(1010)는 스택 메모리 접근과 힙 메모리 접근을 구별하여, 각각 다른 개수의 태그를 저장할 수도 있다. 데이터 접근 중에서 스택 메모리 접근과 힙 메모리 접근을 판별하는 방법 중 하나는 현재의 스택 포인터로부터 일정 범위 내의 데이터 접근은 스택 메모리 접근으로 간주하고, 그 이외의 것은 힙 메모리 접근으로 간주하면 면적 면에서 큰 오버헤드 없이 구현할 수 있다. The history buffer 1010 stores a plurality of tags of recently accessed addresses. In another embodiment, since the stack memory access and the heap memory access differ according to the application, the history buffer 1010 may distinguish the stack memory access from the heap memory access and store different numbers of tags. One of the methods of determining stack memory access and heap memory access among data accesses is that the data access within a range from the current stack pointer is regarded as stack memory access, and the other is considered as heap memory access. It can be implemented without a head.

공유형 데이터 캐시(140)의 다른 구조는 공유형 명령어 캐시(130)와 기능이 유사한 바 상세한 설명은 생략한다. Other structures of the shared data cache 140 are similar in function to the shared instruction cache 130, and thus detailed description thereof will be omitted.

한편, 공유형 명령어 캐시(130)의 구조가 도 6에 도시되어 있다. 2-경로 방식 세트 어소시에이티브 캐시의 예를 나타내고 있어 2개의 태그 메모리가 캐시 제어기(1040)에 연결되어 있다. 이는 하나의 실시예에 불과하며, 만일 공유형 명령어 캐시(130)가 4-경로 방식 세트 어소시에이티브 캐시인 경우에는 도 10에 도시된 공유형 데이터 캐시(140)와 같이 4개의 태그 메모리가 캐시 제어기(1040)에 연결될 수 있다. Meanwhile, the structure of the shared instruction cache 130 is illustrated in FIG. 6. An example of a two-path set associative cache is shown where two tag memories are coupled to cache controller 1040. This is just one embodiment, and if the shared instruction cache 130 is a four-path set associative cache, four tag memories are cached, such as the shared data cache 140 shown in FIG. May be connected to the controller 1040.

또한, 본 발명의 또 다른 실시예에 따르면, 멀티 프로세서 시스템은 하드웨어 운영체제, 공유형 명령어 캐시와 공유형 데이터 캐시를 모두 포함한다. 하드웨어 운영체제를 통해 멀티 쓰레딩 오버헤드를 줄이고, 공유형 명령어 캐시와 공유형 데이터 캐시를 통해 작업 이전 오버헤드와 시스템 면적을 줄일 수 있다. Further, according to another embodiment of the present invention, a multiprocessor system includes both a hardware operating system, a shared instruction cache, and a shared data cache. The hardware operating system reduces multithreading overhead, while shared instruction caches and shared data caches reduce pre-operation overhead and system area.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to a preferred embodiment of the present invention, those skilled in the art to which the present invention pertains without departing from the spirit and scope of the present invention as set forth in the claims below It will be appreciated that modifications and variations can be made.

도 1은 본 발명의 일 실시예에 따른 멀티 프로세서 시스템의 구조를 나타낸 개략도.1 is a schematic diagram illustrating a structure of a multiprocessor system according to an embodiment of the present invention.

도 2는 하드웨어 운영체제의 구조를 설명하는 상세도.2 is a detailed diagram illustrating the structure of a hardware operating system.

도 3은 쓰레드 제어 메모리의 구조를 설명하는 상세도. 3 is a detailed diagram illustrating a structure of a thread control memory.

도 4는 소프트웨어 운영체제에서의 멀티 쓰레드 프로그램 수행 과정 흐름도.4 is a flowchart of a multi-threaded program execution process in a software operating system.

도 5는 본 발명의 일 실시예에 따른 하드웨어 운영체제에서의 멀티 쓰레드 프로그램 수행 과정 흐름도. 5 is a flowchart illustrating a multi-threaded program execution process in a hardware operating system according to an embodiment of the present invention.

도 6은 공유형 명령어 캐시의 구조를 설명하는 상세도.6 is a detailed diagram illustrating the structure of a shared instruction cache.

도 7은 메인 메모리의 데이터 배치 방식을 나타낸 도면.7 is a diagram illustrating a data arrangement method of a main memory.

도 8은 기존 캐시 메모리의 데이터 배치 방식을 나타낸 도면.8 is a diagram illustrating a data arrangement method of an existing cache memory.

도 9는 본 발명의 일 실시예에 따른 인터리브 명령어 캐시 메모리의 데이터 배치 방식을 나타낸 도면.9 is a diagram illustrating a data placement method of an interleaved instruction cache memory according to an embodiment of the present invention.

도 10은 공유형 데이터 캐시의 구조를 설명하는 상세도.10 is a detailed diagram illustrating the structure of a shared data cache.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

10 : 코프로세서 100 : 멀티 프로세서 시스템 10: coprocessor 100: multiprocessor system

110 : 하드웨어 운영체제 120 : 프로세서 코어 110: hardware operating system 120: processor core

122 : 레지스터 파일 124 : 컨텍스트 제어기 122: register file 124: context controller

130 : 공유형 명령어 캐시 140 : 공유형 데이터 캐시 130: shared instruction cache 140: shared data cache

210 : 주 제어기 220 : 쓰레드 제어 메모리 210: main controller 220: thread control memory

230 : 쓰레드 관리자 240 : 컨텍스트 메모리 230: thread manager 240: context memory

250 : 컨텍스트 관리자 260 : 컨텍스트 버퍼 250: context manager 260: context buffer

610 : 명령어 버퍼 620 : 인터리브 명령어 캐시 메모리 610: instruction buffer 620: interleaved instruction cache memory

630 : 태그 메모리 640 : 캐시 제어기 630: tag memory 640: cache controller

650 : 블록-필 제어기 1010 : 히스토리 버퍼 650 block-fill controller 1010 history buffer

1020 : 인터리브 데이터 캐시 메모리 1030 : 태그 메모리 1020: interleaved data cache memory 1030: tag memory

1040 : 캐시 제어기 1050 : 블록-필 제어기1040 cache controller 1050 block-fill controller

Claims

delete

A plurality of processor cores; And

A hardware operating system connected to the plurality of processor cores through an operating system bus, the hardware operating system performing hardware for interrupt handling, task synchronization, task scheduling, context exchanging, and task resume operations for operating system service requests from the plurality of processor cores. and,

The hardware operating system,

A thread control memory for storing job information which is one of a job status, priority, wait queue, semaphore status, and a combination thereof;

A thread manager that selects the highest priority job among the jobs registered in the waiting queue and controls semaphore access;

A context memory in which a context of a generated job is stored;

A context buffer in which a context of a task to resume execution or a context of a task in which execution is stopped is stored;

A context manager that reads a context for a job selected by the thread manager from the context memory and stores the context in the context buffer or stops performing and stores the context of the job stored in the context buffer in the context memory; And

And a main controller coupled to the operating system bus and controlling the thread manager and the context manager in response to the operating system service request from the processor core.

The method of claim 2,

The thread control memory may include a ready queue having a first value when the job is in a wait state and a second value when the job is suspended, and information for job scheduling including a priority of the job. And semaphore access control information including semaphore status flags and semaphore values for each semaphore, and a wait queue for registering jobs waiting on the semaphore.

The method of claim 3,

The thread control memory includes a thread descriptor and a semaphore descriptor,

And obtaining the information for the job scheduling, the information for the semaphore access control, and the suspension queue from the field value of the thread descriptor and the field value of the semaphore descriptor.

The method of claim 4, wherein

Information for scheduling the job is stored in two fields indicating the upper and lower bits of the priority of the field of the thread descriptor,

The information for the semaphore access control is stored in each of the fields of the semaphore descriptor, the flag indicating whether the semaphore descriptor is used, the flag indicating the type of the semaphore, and the three fields in which the semaphore value is stored. Processor system.

The method of claim 5,

And the suspend queue is implemented using the remaining one field of the semaphore descriptor and the remaining two fields of the thread descriptor.

The method of claim 6,

If the remaining one field of the semaphore descriptor is negative, it indicates that at least one job is registered in the suspension queue, and the ID of the first job is stored in the first field of the remaining two fields of the thread descriptor.

When the second field is the first value, it indicates that another job is additionally registered in the suspension queue, and an ID of a subsequent job is stored in the first field.

If the second field is the second value, it indicates that the job is the last job registered in the suspension queue.

delete

In a multiprocessor system comprising a plurality of processor cores,

An operating system connected to the plurality of processor cores to perform interrupt handling, task synchronization, task scheduling, context switching, and resume operation operations on operating system service requests from the plurality of processor cores;

A shared instruction cache in which instructions supporting the plurality of processor cores are stored in advance; And

And a shared data cache in which data supporting the plurality of processor cores are stored in advance, wherein the shared instruction cache includes:

An interleaved instruction cache memory including an internal communication network and a plurality of on-chip memories, wherein words constituting one instruction block are distributed and stored in different on-chip memories, and the distributed words can be read simultaneously. );

A tag memory for storing a part of an instruction block stored in the interleaved instruction cache memory as a tag;

Comparing a part of the command according to the operating system service request with a tag stored in the tag memory, and performing tag matching to determine whether the requested command exists in the interleaved command cache memory, and performing the tag matching result. A cache controller configured to output the instruction block request signal when a specified instruction does not exist in the interleaved instruction cache memory; And

And a block-fill controller configured to read an instruction block from an external memory through a system bus according to the instruction block request signal and store the instruction block in the interleaved instruction cache memory.

The method of claim 9,

The interleaved instruction cache memory may have a different order in which the interleaved instruction cache memory is distributed and stored in the plurality of on-chip memories with respect to different paths.

The method of claim 9,

The shared instruction cache performs a tag matching when first accessing an instruction block, and when a command belonging to the instruction block is sequentially read, a tag matching skipping scheme that skips the tag matching. And a multiprocessor system.

The method of claim 11,

The shared instruction cache is configured to perform the tag matching when one of the plurality of processor cores executes a branch instruction, when the processor core requests an instruction having a sequential address but the instruction is stored in another instruction block. A multiprocessor system characterized by the above.

The method of claim 9,

The shared instruction cache may further include an instruction buffer that is allocated to each of the plurality of processor cores and that reads and stores the instruction from the interleaved instruction cache before the allocated processor core requests the instruction. Multiprocessor system.

The method of claim 13,

The instruction buffer includes a branch status signal indicating whether the allocated processor core has performed a branch instruction and a pre-fetch pressure signal indicating the number of instructions stored in the instruction buffer. Pass to the cache controller,

The cache controller determines an instruction to be read using the branch state signal and the pre-fetch pressure signals, and then outputs an on-chip memory selection signal to the plurality of on-chip memories through the internal communication network so that the instruction is executed. Multiprocessor system characterized in that it is inserted into the buffer.

The method of claim 9,

The shared data cache,

An interleaved data cache memory including an internal communication network and a plurality of on-chip memories, wherein words constituting one data block are distributed and stored in different on-chip memories, and the distributed stored words can be read simultaneously. );

A tag memory for storing a part of a data block stored in the interleaved data cache memory as a tag;

The tag matching is performed to compare whether the requested data exists in the interleaved data cache memory by comparing a portion of data according to the operating system service request with a tag stored in the tag memory, and the requested data is determined by the tag matching result. A cache controller for outputting the data block request signal when not in an interleaved data cache memory; And

And a block-fill controller for reading a data block from an external memory through a system bus according to the data block request signal and storing the data block in the interleaved data cache memory.

The method of claim 15,

The shared data cache,

A history buffer for storing tags of addresses of on-chip memory that have been accessed within a predetermined time; And

And a history tester comparing the tag stored in the history buffer with a requested address to determine whether the requested data exists in the interleaved data cache memory.

The method of claim 16,

And if the determination is not completed by the history tester, the cache controller reads the tag from the tag memory and performs the tag matching.

The method of claim 16,

And the history buffer stores the tag by distinguishing a stack memory access from a heap memory access.

delete