KR101373699B1

KR101373699B1 - Multi core multi thread system and multi core process management method under mobile circumstance

Info

Publication number: KR101373699B1
Application number: KR1020120004968A
Authority: KR
Inventors: 이광엽; 정형기
Original assignee: 서경대학교 산학협력단
Priority date: 2012-01-16
Filing date: 2012-01-16
Publication date: 2014-03-14
Also published as: KR20130084160A

Abstract

본 발명은 제 1 코어 및 제 2 코어를 포함하는 복수 개 코어와 각 코어 내에서 복수 개 스레드를 수행하는 멀티 코어 멀티 스레드 시스템의 프로세스 관리 방법에 관한 것으로서, 본 발명에서는 제 1 코어 내의 모든 스레드 수행이 종료되는 제 1단계와, 제 2 코어 내의 모든 스레드 수행이 종료되는 제 2단계 및 제 2단계 이후에 제 1 코어 및 제 2 코어 내의 복수 개 스레드에 동기종료 (SYNC) 명령이 내려지는 제 3단계를 포함하는 것을 특징으로 하는 멀티 코어 멀티 스레드 시스템의 프로세스 관리 방법이 제공된다.The present invention relates to a process management method of a multi-core multi-threaded system that executes a plurality of cores including a first core and a second core and a plurality of threads within each core. The first step is terminated, and the third step in which a SYNC instruction is issued to a plurality of threads in the first core and the second core after the second and second steps in which all threads in the second core are finished. Provided is a process management method of a multi-core multi-threaded system comprising a.

Description

MULTI CORE MULTI THREAD SYSTEM AND MULTI CORE PROCESS MANAGEMENT METHOD UNDER MOBILE CIRCUMSTANCE

본 발명은 모바일 환경의 멀티 코어 멀티 스레드 시스템의 단일 프로세스 병렬화를 위한 분기 방법에 관한 것으로서, 보다 구체적으로는 멀티 코어 멀티 스레드 처리를 별도의 스레드 관리 호스트 프로세서 없이 단일 프로세스 실행에서 분기 명령만으로 멀티 스레드 실행을 관리하여 모바일 환경에 적합한 모바일 환경의 멀티 코어 멀티 스레드 시스템의 단일 프로세스 병렬화를 위한 분기 방법에 관한 것이다.
The present invention relates to a branching method for single-process parallelism of a multi-core multi-threaded system in a mobile environment. More specifically, multi-threaded multi-threaded processing is performed by using a branch instruction only in a single process execution without a separate thread management host processor. The present invention relates to a branching method for single process parallelization of a multi-core multi-threaded system of a mobile environment suitable for a mobile environment.

최근 컴퓨터 분야에서는 멀티태스킹과 다수의 고속연산을 요구하는 멀티미디어 성능이 중시되자 하나의 프로세서 내에 복수의 코어를 구비하는 멀티 코어 프로세서들이 개발되었다. 멀티 코어 프로세서는 작업을 복수의 코어들이 분담하여 처리하기 때문에 처리 성능을 향상시킬 수 있다. 또한 여러 개의 프로세서를 부가하여 사용하는 것에 비해 코어 이외의 부분을 공용할 수 있기 때문에 제조 비용이 저렴하고 크기를 소형화할 수 있는 이점을 가진다.Recently, in the computer field, multi-core processors having a plurality of cores in one processor have been developed due to the importance of multimedia performance requiring multitasking and a plurality of high-speed operations. Multi-core processors can improve processing performance because a plurality of cores share a task. In addition, compared to the use of multiple processors can be shared other than the core has the advantage of low manufacturing cost and small size.

IBM의 셀 프로세서(Cell Processor)와 nVidia의 CUDA는 각각 PPE(Power Processing Element)와 스레드 매니저(Thread Manager)를 통해 멀티 코어 및 스레드를 관리한다. 이러한 설계는 실제 작업을 수행하는 프로세서 외에도 이들을 관리하는 또 다른 호스트 프로세서를 요구한다. 이러한 구조에서는 프로세서에서 구동할 프로그램 외에도 이들을 관리할 별도의 호스트 프로세서를 위한 프로그램의 작성을 요구하며, 이는 하드웨어 관점에서는 면적 및 전력의 낭비를, 소프트웨어의 관점에서는 별도의 호스트 프로세서를 위한 소프트웨어를 필요로 하므로 추가적인 비용이 소요된다. 이러한 설계는 소비 전력과 칩 면적의 제한이 적은 개인 컴퓨터(PC) 환경에서는 멀티 코어 및 멀티 스레드를 관리하는 방법 중 성능을 극대화시킬 수 있는 좋은 방법이지만, 한정된 자원의 모바일 환경에서 사용하기엔 부담스러운 방법이다. 따라서 별도의 호스트 프로세서를 없애려면 각 코어가 호스트 프로세서의 역할을 동시에 수행 해야 하는데 이를 위해서는 스레드의 관리가 명령어로 이루어져야 한다.IBM's Cell Processor and nVidia's CUDA manage multiple cores and threads through a Power Processing Element (PPE) and a Thread Manager, respectively. This design requires another host processor to manage them in addition to the processors that perform the actual work. This architecture requires the creation of programs for separate host processors in addition to the programs to be run on the processor, which wastes space and power from a hardware perspective and software for a separate host processor from a software perspective. Therefore, additional cost is required. This design is a good way to maximize performance among multi-core and multi-threaded management in personal computer (PC) environments with low power consumption and chip area limitations, but it is a burdensome method for use in mobile environments with limited resources. to be. Therefore, in order to eliminate the separate host processor, each core must play the role of the host processor at the same time.

IBM의 셀 프로세서(Cell Processor)는 크게 PPE 유닛과 SPE 유닛으로 나누어져 있으며 PPE는 PowerPC CPU로 생각 할 수 있다. SPE는 Stream Processor Element로 멀티 스레드를 구동시키는 핵심 코어이다. SPE는 내부에 로컬 메모리를 두어 일정량의 데이터를 쌓아 연산 가능하도록 하며 PPE와 인터럽트 라인을 통해 통신을 한다. SPE에 일이 종료되면 PPE가 인터럽트를 받고 다음 작업을 지시하는 것으로 명료하면서 융통성 있는 구조를 취한다. 그러나 SPE가 능동적이지 못하고 PPE의 제어를 받아 연산 가능한 슬레이브 프로세서 역할을 한다.IBM's Cell Processor is largely divided into PPE units and SPE units, which can be thought of as PowerPC CPUs. SPE is the core core that drives multiple threads with the Stream Processor Element. The SPE has a local memory inside to accumulate a certain amount of data to be computed and communicate with the PPE through an interrupt line. When the SPE is finished, it takes a flexible structure, making it clear that the PPE is interrupted and instructs the next task. However, the SPE is not active and acts as a slave processor under the control of the PPE.

nVidia의 CUDA 프로세서는 그래픽 카드용으로 설계되었지만 다양한 분야에 사용되고 있는 GP-GPU(General Purpose Computing on Graphics Processing Unit)이다. CUDA는 많은 수의 스레드 프로세서를 관리하는 스레드 관리자가 별도로 존재한다. 또한 CUDA는 여러 개의 스레드가 하나의 명령어에 의해 동시에 실행되는 구조이기 때문에 조건 분기와 같은 프로그래밍 구조가 있을 때 실행되지 않는 스레드는 중지되는 비효율적인 면이 있다. 이러한 구조는 개별적인 코어의 성능은 떨어지지만 한번에 처리 가능한 스레드를 크게 늘림으로써 단순하고 고정된 함수를 가동할 때에는 크게 성능을 향상시킬 수 있다.nVidia's CUDA processor is a General Purpose Computing on Graphics Processing Unit (GP-GPU) designed for graphics cards but used in a variety of applications. CUDA has a separate thread manager that manages a large number of thread processors. Also, since CUDA is a structure in which several threads are executed by one instruction at the same time, a thread that does not execute when there is a programming structure such as a conditional branch is inefficient. This structure reduces performance of individual cores, but can greatly improve performance when running simple, fixed functions by greatly increasing the number of threads that can be processed at one time.

CUDA Processor도 프로그램을 실행 시키기 위해서 IBM Cell과 같이 스레드와 스레드 관리자 양쪽에 이원화된 프로그램을 따로 작성해야 한다. CUDA SDK에서 이를 추상화하여 프로그램 작성을 용이하게 하였지만 CUDA의 실제 동작을 알 수 없으며, 제 3 벤더에서 참여할 수 없는 독점적 구조를 취하고 CUDA SDK에서 제공하는 방법에 의해 설계되어야만 하는 제약성이 존재해 오히려 하드웨어의 성능을 제약할 수 있다.In order to execute the program, CUDA Processor also needs to write a binary program separately in both thread and thread manager like IBM Cell. Although the CUDA SDK has been abstracted to make programming easier, the actual operation of CUDA is unknown, and there are limitations that must be designed by the method provided by the CUDA SDK, and have a proprietary structure that can not be participated by third-party vendors. You can limit performance.

위와 같은 IBM의 Cell이나 nVidia의 CUDA의 구조는 각각 장점과 단점이 존재하지만 모두 공통적으로 각 코어가 외부의 코어 관리자에 의해 실행되는 구조를 가진다. 이러한 구조는 작업량 할당 측면에서 별도의 관리자가 스레드별로 작업량을 할당하므로 멀티코어로의 확장이 용이하며, 소비 전력과 칩 면적의 제한이 적은 PC환경에서는 성능을 중시한 구조로 높은 성능을 보이지만 모바일 환경에서 이러한 구조를 사용하기에는 소비 전력과 발열, 칩 면적에서 부담이 있다.Although IBM's Cell and nVidia's CUDA architectures have advantages and disadvantages, they all have a structure in which each core is executed by an external core manager. This structure is easy to expand to multicore because a separate manager allocates workload per thread in terms of workload allocation, and shows high performance in a PC environment where power consumption and chip area are limited. The use of such a structure has a burden on power consumption, heat generation, and chip area.

모바일 환경의 GP-GPU에서 상기와 동일한 방법을 적용하기 위해서는 별도의 호스트 프로세서와 처리 프로그램이 필요하므로 모바일 환경의 GP-GPU에서는 적합하지 않다. 따라서 모바일 환경의 GP-GPU에서는 호스트 프로세서를 없앤 새로운 코어 및 스레드 관리 명령어로 작업 관리를 대체하는 처리 방식이 필요하게 되었다.In order to apply the same method to the GP-GPU in a mobile environment, a separate host processor and a processing program are required, which is not suitable for a GP-GPU in a mobile environment. Therefore, GP-GPU in mobile environment needs processing method to replace task management with new core and thread management instructions that eliminated host processor.

본 발명은 상기와 같은 필요성에 의해 대두된 것으로서, 멀티 코어 멀티 스레드 구조에서 별도의 스레드 관리 모듈 없이 단일 프로세스 프로그램의 다양한 분기 명령어 구조를 통해 멀티스레드 관리 방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide a multithreaded management method through various branch instruction structures of a single process program without a separate thread management module in a multicore multithreaded structure.

본 발명의 상기 목적은 복수 개 코어와 각 코어 내에서 복수 개 스레드를 수행하는 멀티 스레드 시스템에 있어서, 각 코어 내에서 수행되는 각각의 스레드의 수행 여부를 표시하는 스레드 플래그 레지스터 및 각 코어마다 수행되는 스레드가 적어도 하나 이상 있는지 여부를 표시하는 코어 플래그 레지스터를 포함하는 것을 특징으로 하며, 별도의 분기 명령어 구조에 의해 멀티 스레드 시스템 구조를 가진다.
The above object of the present invention is a multi-threaded system that performs a plurality of cores and a plurality of threads in each core, the thread flag register indicating whether or not to perform each thread performed in each core and is performed for each core And a core flag register indicating whether there is at least one thread, and having a multi-threaded system structure by a separate branch instruction structure.

본 발명은 각 코어간 스레드 동기화가 필요할 시에 SYNC 명령을 통해서 각 코어의 마지막 스레드를 제외한 모든 스레드를 종료하고, 각 코어의 종료되지 않은 한 개의 스레드만을 비교함으로써 하드웨어의 복잡도를 낮추고 구현을 간단하게 할 수 있으며, 또한 별도의 코어 관리 프로그램 없이 단일 프로그램 내에 명령어로 코어를 관리하는 것이 가능하다. 이러한 명령어 세트를 이용하면 하나의 단일 스레드 프로그램을 작성하듯 멀티 코어, 멀티 스레드 프로그램을 작성할 수 있으며, 외부의 멀티 코어나 멀티 스레드 관리자 없이 각 코어의 스레드를 효율적으로 자기관리 할 수 있고, 이에 따라 외부의 멀티 코어나 멀티 스레드 관리자 및 이를 위한 별도의 프로그램이 필요치 않으므로 전력이나 칩 사이즈, 그리고 소프트웨어 개발 단계에서 이원화된 스레드 관리 프로그램 개발 비용을 줄일 수 있다. 또한 본 설계의 멀티 코어는 최대 16개까지 늘이고 줄일 수 있는 스케일러블(Scalable) 멀티 코어로서 코어의 개수에 비 의존적인 프로그램 작성이 가능하므로 코어의 개수가 변하더라도 추가적인 코드 수정이 필요치 않다.According to the present invention, when thread synchronization between cores is required, all threads except the last thread of each core are terminated through a SYNC instruction, and the hardware complexity is simplified and the implementation is simplified by comparing only one unterminated thread of each core. It is also possible to manage cores with instructions in a single program without a separate core management program. Using this instruction set, you can write multi-core and multi-threaded programs as if you were writing a single threaded program. You can efficiently self-manage threads on each core without external multi-core or multi-thread managers. This eliminates the need for multi-core or multi-thread managers and separate programs for them, reducing power, chip size, and the cost of developing a dual-threaded thread management program during software development. In addition, the multi-core of this design is a scalable multi-core that can be expanded and reduced up to 16, so that programming can be made independent of the number of cores. Therefore, no additional code modification is required even if the number of cores changes.

또한 빠른 작업량 할당을 위하여 글로벌 스크래치 카운터를 두었으며, 글로벌 스크래치 카운터는 크리티컬 섹션의 설정 없이 각 코어에서 동시에 요청이 오더라도 각 코어에게 서로 다른 값을 반환하게 되므로 각 스레드는 이 글로벌 스크래치 카운터의 값을 참조하여 빠른 속도로 작업량을 할당할 수 있다.
In addition, a global scratch counter has been set up for fast workload allocation, and each thread returns a different value to each core even if a request comes from each core at the same time without setting a critical section. References can be made to allocate workload at high speed.

도 1은 본 발명에 사용되는 all 분기 명령어를 설명하기 위한 명령어 흐름도.
도 2는 본 발명에 사용되는 seq 분기 명령어를 설명하기 위한 명령어 흐름도.
도 3은 본 발명에 따른 일 실시예의 멀티 코어 멀티 프로세스 명령어 처리 흐름도.
도 4는 도 1의 스레드 작업 수행을 가능하게 하는 본 발명에 따른 일 실시예의 플래그 레지스터 상태를 시간 추이에 따라 도시한 설명도.1 is a command flow diagram illustrating an all branch instruction used in the present invention.
2 is a command flow diagram illustrating a seq branch instruction used in the present invention.
3 is a flow chart of a multicore multiprocess instruction processing in one embodiment in accordance with the present invention.
4 is an explanatory diagram showing a state of a flag register of an embodiment according to the present invention that enables execution of the thread task of FIG. 1 over time; FIG.

본 발명은 싱글 코어 기반의 멀티 스레드 관리 기법을 확장하여 멀티 코어 환경의 GP-GPU에서 호스트 프로세서 역할을 명령어로 대체시켜 각 코어의 스레드를 외부의 멀티 코어 관리자 없이 각 코어가 스크래치 카운터(Scratch Counter)를 통해 능동적으로 작업을 할당할 수 있는 효율적인 구조이다.The present invention extends a single-core based multi-thread management scheme to replace the host processor role with instructions in GP-GPU in a multi-core environment so that each core thread is a scratch counter without an external multi-core manager. It is an efficient structure that can actively allocate work through

그러나 별도의 외부 스레드 매니저 없이 각각의 스레드가 스크래치 카운터를 통해 작업량을 할당하는 이러한 구조는 멀티 코어로 확장 시 몇 가지 문제점이 존재한다. 서로 다른 코어가 서로 내부의 스레드를 감시해야 하며, 스크래치 카운터 사용시 동시에 여러 개의 코어가 스크래치 카운터에 접근해야 하는 문제점이 존재한다. 크리디컬 섹션을 설정하여 이 문제를 해결할 수 있지만, 크리티컬 섹션 설정은 접근 중인 스레드를 제외하고 다른 모든 스레드를 대기 상태에 빠트려 성능 저하를 가져올 수 있다.
However, this structure, in which each thread allocates work through scratch counters without a separate external thread manager, has some problems when it is extended to multicore. The problem is that different cores need to watch for threads inside each other, and when using a scratch counter, multiple cores have to access the scratch counter at the same time. You can solve this problem by setting up a critical section, but setting up a critical section can put all other threads on standby except the one that is being accessed, resulting in performance degradation.

이하에서, 본 발명에 따른 유리 표면 이물 검사 장치의 바람직한 실시예를 첨부 도면을 참조하여 상세히 설명하도록 한다.
Best Mode for Carrying Out the Invention Hereinafter, a preferred embodiment of a glass surface foreign matter inspection apparatus according to the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서 코어 내의 일반적인 분기 명령어 이외에 all, seq, sync라는 명령어가 존재한다. all 분기 명령은 한 코어 내에서 비어 있는 다른 스레드와 동시에 지정한 같은 프로그램으로 활성화되어 멀티 스레드 방식으로 수행하도록 하는 명령어이다. 도 1은 all 분기 명령어를 설명하기 위한 명령어 흐름도이다. 도 1에 대해 설명하면, 하나의 코어 내에 제 0 스레드(# 0 Thread), 제 1 스레드(# 1 Thread), 제 2 스레드(# 2 Thread), 제 3 스레드(# 3 Thread), 제 4 스레드(# 4 Thread)가 실행 중에 있다가, all 분기 명령어가 수행되면, 비어 있는 제 0 스레드(# 0 Thread), 제 1 스레드(# 1 Thread), 제 2 스레드(# 2 Thread), 및 제 3 스레드(# 3 Thread)는 프로세스 B를 멀티 스레드 방식으로 수행하는 것을 보여준다.In the present invention, in addition to the general branch instruction in the core, there are instructions called all, seq, and sync. The all branch instruction is a multithreaded instruction that is activated by the same program specified at the same time as another empty thread in one core. 1 is a flowchart illustrating an all branch instruction. Referring to FIG. 1, in one core, a 0 thread, a 1 thread, a 2 thread, a 3 thread, and a 4 thread (# 4 Thread) is running, and when the all branch instruction is executed, an empty first thread (# 0 thread), first thread (# 1 thread), second thread (# 2 thread), and third thread Thread (# 3 Thread) shows that Process B runs in a multi-threaded fashion.

seq 분기 명령은 한 코어 내에서 자신 이외의 동작하는 스레드가 있을 경우 종료되어, 스레드 비어있음 상태를 나타내며, 자신 이외에 모든 스레드가 비어있는 상태일 경우 지정된 프로그램으로 분기하는 방식으로 수행된다. 도 2는 seq 분기 명령어를 설명하기 위한 명령어 흐름도이다. 도 2에 대해 설명하면, 하나의 코어 내에 제 0 스레드(# 0 Thread), 제 1 스레드(# 1 Thread), 제 2 스레드(# 2 Thread), 및 제 3 스레드(# 3 Thread)가 존재하고, 제 0 스레드, 제 3 스레드, 제 2 스레드 및 제 1 스레드 순으로 종료되어 대기 상태가 되며, 마지막으로 수행을 마친 제 1 스레드가 종료될 경우, 동일한 코어 내의 하나의 제 1 스레드만이 남아 있을 경우, 지정된 프로그램으로 분기할 준비가 되어 있는 상태를 나타낸다.The seq branch instruction terminates when there is a running thread other than itself in a core, indicating that the thread is empty, and branching to the specified program when all threads other than itself are empty. 2 is a flowchart illustrating a seq branch instruction. Referring to FIG. 2, there is a 0 thread, a 1 thread, a 2 thread, and a 3 thread in a core. , The 0th thread, the 3rd thread, the 2nd thread, and the 1st thread are terminated in the waiting state, and when the first thread that has finished execution is terminated, only one first thread in the same core remains. In this case, it indicates a state ready to branch to the designated program.

sync 분기 명령은 seq 명령과 동일한 분기를 하며, 다른 모든 코어 내에서도 단일 스레드 수행상태가 되기를 기다린 후 실행되는 명령어이다.
The sync branch command is the same branch as the seq command and is executed after waiting for a single thread to be executed within all other cores.

도 3은 본 발명에 따른 일 실시예의 멀티 코어 멀티 프로세스 명령어 처리 흐름도이다. 도 3에 제시된 명려어 처리 흐름은 별도의 제어 유닛이 필요 없고 크리티컬 섹션 설정이 필요 없으며 단순한 구조로 작업을 빠르게 트랜잭션 할 수 있는 구조로 되어 있다. 멀티 코어의 멀티 스레드를 관리하기 위해 SYNC 명령어가 추가되었으며 SYNC 명령어는 모든 코어의 동기화 및 스레드 종료를 수행하는 명령어이다.3 is a flow chart of a multicore multiprocess instruction processing in one embodiment in accordance with the present invention. The clarification process flow shown in FIG. 3 does not require a separate control unit, does not require a critical section setting, and has a structure capable of quickly transacting work with a simple structure. A SYNC instruction has been added to manage multiple threads of multiple cores. The SYNC instruction performs synchronization and thread termination of all cores.

본 발명에서는 SYNC 명령어를 통해 각 코어에 하나의 스레드만 남기고 종료한 상태에서 동기화를 하기 때문에 각 코어당 하나의 스레드만 비교하여 복잡도가 크게 증가하지 않는 구조이다. 동기화가 필요할 때나 싱글 스레드로 프로그램이 구동되어야 할 때 동기화와 스레드가 종료된 시점을 보장받을 수 있는 구조이다.In the present invention, since the synchronization is completed while leaving only one thread in each core through the SYNC instruction, the complexity does not increase significantly compared to only one thread for each core. When synchronization is required or when a program is run with a single thread, it is a structure that can guarantee synchronization and when the thread is terminated.

도 3에서는 두 개의 제 0 코어(core #0)와 제 1 코어(core #1)로 구성되고, 각 코어에 3 개의 스레드(Thread #0, Thread #1, 및 Thread #2)가 수행되는 상태의 멀티 코어 멀티 스레드 명령어 수행 흐름을 보여준다. 제 0 코어 및 제 1 코어의 모든 스레드는 프로세스 'A'를 수행하고 있으며, 제 1 코어에서는 제 0 스레드(Thread #0), 제 2 스레드(Thread #2) 및 제 1 스레드(Thread #1) 순으로 스레드가 종료되며, 이후 제 0 코어의 제 0 스레드(Thread #0), 제 2 스레드(Thread #2) 및 제 1 스레드(Thread #1) 순으로 스레드가 종료되는 경우를 가정한다. 본 발명에서는 SYNC 명령어를 사용하여 다른 코어의 마지막 스레드가 종료할 때까지 기다리게 한 후 동기화를 시킨다. 도 3의 좌측에 표시된 t0와 t1 사이 시간에 제 1 코어 내의 모든 스레드가 종료되나, 제 0 코어의 스레드가 아직 작업을 수행 중이기 때문에 제 1 코어의 제 1 스레드는 해당 스레드가 종료되었음을 알리는 SYNC 명령을 수행하지 않고 기다린 후 제 0 코어의 모든 스레드가 종료되면 제 1 스레드가 SYNC 명령으로 스레드 종료를 알리게 되는 것이다. 이후 t4 시점에 ALL 명령을 수행하여 프로세스 'C'에 대한 작업을 나누어 수행하게 되는 것이다.In FIG. 3, two cores (core # 0) and a first core (core # 1) are configured, and three threads (Thread # 0, Thread # 1, and Thread # 2) are performed on each core. Shows the multi-core multi-threaded instruction execution flow. All threads of the 0th core and the 1st core are performing process 'A', and in the 1st core, the 0th thread (Thread # 0), the 2nd thread (Thread # 2), and the 1st thread (Thread # 1) It is assumed that the threads are terminated in order, and then the threads are terminated in order of the 0th thread (Thread # 0), the second thread (Thread # 2), and the first thread (Thread # 1) of the 0th core. In the present invention, the SYNC instruction is used to wait for the last thread of another core to terminate and then synchronize. All threads in the first core are terminated at the time between t0 and t1 shown in the left side of FIG. 3, but since the thread of the 0th core is still performing work, the first thread of the first core notifies that the thread is terminated. If all threads of the 0th core are terminated after waiting without performing the first thread, the first thread is notified of the thread termination by the SYNC command. After that, by executing the ALL command at time t4, the process for process 'C' is divided.

SYNC 명령의 동작은 도 3에서 보여지는 것과 같이 모든 코어의 스레드가 각각 작업을 하고 있을 때 동기종료 (SYNC) 명령에 접근하면 각 코어의 스레드를 종료하며 마크된 마지막 스레드는 종료하지 않고 모든 코어의 명령어들이 SYNC명령어 상태가 될 때까지 대기상태에 들어가게 된다. 이후 모든 코어의 스레드가 SYNC 명령어를 만나게 되면 대기상태에서 빠져나와 이후 프로그램을 계속 진행하게 된다. 이 시점에서 순차 작업을 진행하여 다음 작업을 수행하고 전체 분기 (ALL)명령을 통하여 모든 스레드가 새로운 프로세서작업을 동시에 진행한다. 각각 작업이 진행되는 동안 끝나는 시점이 다르게 되는데 SYNC 명령이 같은 작업을 하는 다른 모든 스레드가 종료된 상태를 보장 하므로 이를 통해 각 코어의 동기화가 가능하며 이후에 전체 분기 명령을 통하여 새로운 작업으로 트랜잭션 하거나 종료를 수행할 수 있다.The operation of the SYNC instruction terminates the threads of each core when the SYNC instruction is accessed while the threads of all the cores are working as shown in FIG. 3, and the last thread marked is not terminated. The commands will wait until the SYNC command is entered. After all the threads of the core meet the SYNC instruction, it will come out of standby and continue the program afterwards. At this point, the next task is performed, and the next thread is executed by all threads at the same time. The end point is different during each operation. Since the SYNC instruction ensures that all other threads that do the same operation are terminated, this enables synchronization of each core and subsequently transacts or terminates with a new operation through the whole branch instruction. Can be performed.

또한 이 구조에서는 코어 및 스레드 관리가 하나의 프로그램 안에서 명령어를 통해 이루어지므로 각 코어 및 각 스레드 관리를 위한 별도의 프로그램이 필요치 않은 장점을 멀티코어로 환경에서도 그대로 유지할 수 있는 구조이며, 이를 통하여 외부 관리자 없이 효율적인 멀티 코어와 멀티 스레드 관리가 이루어지는 구조이다.In addition, in this structure, core and thread management is performed through instructions in one program, so it is possible to maintain the advantages of not requiring a separate program for managing each core and each thread in a multi-core environment. Efficient multicore and multithreaded management is achieved.

도 4는 도 3의 스레드 작업 수행을 가능하게 하는 본 발명에 따른 일 실시예의 플래그 레지스터 상태를 시간 추이에 따라 도시한 설명도이다. 각각의 코어에는 해당 코어 내에서 수행되는 스레드의 수행 여부를 표시하는 스레드 플래그 레지스터(thread flag register)가 구비되고, 멀티 코어 멀티 스레드 시스템 내에는 각각의 코어에서 수행 중인 스레드가 존재하는지 여부를 표시하는 코어 플래그 레지스터(core flag register)가 구비된다. 코어 플래그 레지스터는 코어가 아닌 별도 회로 구성으로 멀티 코어 멀티 스레드 시스템 내에 구비된다.4 is an explanatory diagram illustrating a state of a flag register of an embodiment according to the present invention enabling execution of the thread task of FIG. 3 over time. Each core is provided with a thread flag register that indicates whether a thread is running in the core, and whether or not there is a thread running in each core in a multicore multithreaded system. A core flag register is provided. The core flag register is provided in a multi-core multithreaded system in a separate circuit configuration rather than the core.

구체적으로 도 4에 대해 설명하기로 한다. 제 0 코어에는 스레드 플래그 레지스터(10)가 구비되고, 두번째 비트부터 네번째 비트는 각각 제 0 코어 내에서 수행되는 제 0 스레드, 제 1 스레드 및 제 2 스레드의 상태를 나타내는 플래그(11, 13, 15)가 구비된다. 유사하게 제 1 코어에는 스레드 플래그 레지스터(20)가 구비되고, 두번째 비트부터 네번째 비트는 각각 제 1 코어 내에서 수행되는 제 0 스레드, 제 1 스레드 및 제 2 스레드의 상태를 나타내는 플래그(21, 23, 25)가 구비된다. 코어 플래그 레지스터(30)은 제 0 코어 및 제 1 코어 내에서 각각 수행 중인 스레드가 있는지 여부를 각각 두번째와 세번째 비트를 이용하여 표시하고 있다.Specifically, FIG. 4 will be described. The zeroth core is provided with a thread flag register 10, and the second to fourth bits are flags 11, 13, and 15 indicating the states of the zeroth thread, the first thread, and the second thread, respectively, performed in the zeroth core. ) Is provided. Similarly, the first core is provided with a thread flag register 20, and the second to fourth bits, respectively, flags 21 and 23 indicating the states of the 0th thread, the first thread, and the second thread performed in the first core. , 25). The core flag register 30 indicates whether there are threads running in the 0th core and the 1st core, respectively using the second and third bits.

t0 시점에서는 제 0 코어의 모든 스레드는 작업을 수행 중이기 때문에 제 0 코어의 스레드 플래그 레지스터(10)의 두번째 비트부터 네번째 비트는 "111"로 표시되고, 제 1 코어의 제 0 스레드와 제 2 스레드가 종료되고 제 1 스레드는 수행 중이기 때문에 제 1 코어의 스레드 플래그 레지스터(20)의 두번째 비트부터 네번째 비트는 "010"로 표시된다. 이때 코어 플래그 레지스터(30)의 두번째 비트와 세번째 비트는 "11"로 저장되고 제 0 코어 및 제 1 코어 내에서 스레드가 수행 중임을 나타내게 된다.At the time t0, since all threads of the 0th core are performing work, the second to fourth bits of the thread flag register 10 of the 0th core are marked as "111", and the 0th thread and the 2nd thread of the first core are marked as "111". Is terminated and the first thread is running, so the second to fourth bits of the thread flag register 20 of the first core are marked with " 010. " At this time, the second bit and the third bit of the core flag register 30 are stored as "11" to indicate that a thread is running in the 0th core and the 1st core.

t1 시점에서는 제 0 코어의 제 0 스레드는 종료되고, 제 1 스레드 및 제 2 스레드는 수행 중이므로 제 0 코어의 스레드 플래그 레지스터(10)의 두번째 비트부터 네번째 비트는 "011"로 표시되고, 제 1 코어의 모든 스레드가 종료되므로 제 1 코어의 스레드 플래그 레지스터(20)의 두번째 비트부터 네번째 비트는 "000"로 표시된다. 이때 코어 플래그 레지스터(30)의 두번째 비트와 세번째 비트는 "10"로 저장되고 제 0 코어 내에서는 스레드가 수행 중이나 제 1 코어에서는 수행 중인 스레드가 없음을 알 수 있게 된다.At the time t1, the 0th thread of the 0th core is terminated, and since the first thread and the second thread are running, the second to fourth bits of the thread flag register 10 of the 0th core are marked as "011", and the first thread is the first thread. Since all threads of the core are terminated, the second to fourth bits of the thread flag register 20 of the first core are represented by "000". At this time, the second bit and the third bit of the core flag register 30 are stored as "10", and it can be seen that no thread is running in the 0th core but no thread is running in the first core.

t2 시점에서는 제 0 코어의 제 0 스레드 및 제 2 스레드는 종료되고, 제 1 스레드는 수행 중이므로 제 0 코어의 스레드 플래그 레지스터(10)의 두번째 비트부터 네번째 비트는 "010"로 표시되고, 제 1 코어의 모든 스레드가 종료되므로 제 1 코어의 스레드 플래그 레지스터(20)의 두번째 비트부터 네번째 비트는 "000"로 표시된다. 이때 코어 플래그 레지스터(30)의 두번째 비트와 세번째 비트는 "10"로 저장되고 제 0 코어 내에서는 스레드가 수행 중이나 제 1 코어에서는 수행 중인 스레드가 없음을 알 수 있게 된다.At the time t2, the 0th thread and the 2nd thread of the 0th core are terminated, and since the 1st thread is running, the 2nd to 4th bits of the thread flag register 10 of the 0th core are represented by "010", Since all threads of the core are terminated, the second to fourth bits of the thread flag register 20 of the first core are represented by "000". At this time, the second bit and the third bit of the core flag register 30 are stored as "10", and it can be seen that no thread is running in the 0th core but no thread is running in the first core.

t3 시점에서는 제 0 코어의 모든 스레드는 종료되므로 제 0 코어의 스레드 플래그 레지스터(10)의 두번째 비트부터 네번째 비트는 "010"로 표시되고, 제 1 코어의 모든 스레드가 종료되므로 제 1 코어의 스레드 플래그 레지스터(20)의 두번째 비트부터 네번째 비트는 "000"로 표시된다. 이때 코어 플래그 레지스터(30)의 두번째 비트와 세번째 비트는 "00"로 저장되고 제 0 코어 및 제 1 코어에서는 수행 중인 스레드가 없음을 알 수 있게 된다.
At the time t3, all the threads of the 0th core are terminated, so the second to fourth bits of the thread flag register 10 of the 0th core are marked as "010", and all the threads of the first core are terminated, so that the threads of the first core are terminated. The second to fourth bits of the flag register 20 are represented by "000". At this time, the second bit and the third bit of the core flag register 30 are stored as "00", and it can be seen that there are no threads running in the zeroth core and the first core.

또한 이러한 멀티 코어 멀티 스레드 시스템에서 이전의 스크래치 카운터를 이용한 작업량 할당은 여러 코어가 동시에 스크래치 카운터에 접근해야 하므로 종래의 스크래치 카운터를 멀티 코어에 적용하기에는 문제가 있으며, 이를 해결하고자 수정된 글로벌 스크래치 카운터(Global Scratch Counter) 방식을 사용하였다.In addition, in this multi-core multi-threaded system, the workload allocation using the previous scratch counter has a problem in that the conventional scratch counter is applied to the multi-core because several cores need to access the scratch counter at the same time. Global Scratch Counter) method was used.

글로벌 스크래치 카운터는 Get(읽기), Inc(읽기 후 +1증가), Set(컴퍼넌트 쓰기), Limit(Limit 쓰기) 네 가지의 동작이 가능하게 설계되었다. 글로벌 스크래치 카운터는 여러 개의 코어가 동시의 글로벌 스크래치 카운터의 값을 요청하고 동시에 증가시키는 Inc 요청시 코어로부터 이를 연산하여 각 코어 별로 현재의 글로벌 스크래치 카운터의 값과 여기에 각 코어별로 더해야 할 값 두 개를 연산하여 하나의 사이클에 각 코어로 서로 다른 값을 넘겨줄 수 있다. 이에 따라 각 코어는 글로벌 스크래치 카운터를 통해 얻은 서로 다른 값을 이용해 각각 효율적으로 작업량을 할당할 수 있다.The global scratch counter is designed to allow four operations: Get (read), Inc (+1 increase after read), Set (component write), and Limit (write limit). The Global Scratch Counter computes this from cores when an Inc request, where multiple cores request the value of a concurrent global scratch counter and increments it simultaneously, calculates the value of the current global scratch counter for each core and the value to add to each core. You can compute dogs and pass different values to each core in one cycle. This allows each core to efficiently allocate workload by using different values obtained through global scratch counters.

이러한 글로벌 스크래치 카운터를 사용하여 크리티컬 섹션 없이 한 번의 사이클에 각 코어가 서로 다른 값을 취할 수 있고, 이에 따라 대기시간 없이 빠른 속도로 작업량을 할당할 수 있는 구조가 가능하게 되었다.
Using this global scratch counter, each core can have a different value in one cycle without a critical section, thus enabling a structure that can allocate workload at high speed without waiting.

본 발명에서 멀티 코어 멀티 스레드 시스템이란, 복수 개 코어가 구비되고 각각의 코어에서 복수 개 스레드가 수행되는 반도체 칩(예를 들어, 프로세서) 또는 이를 포함하는 전자 기기를 의미하는 것이다.In the present invention, the multi-core multi-threaded system refers to a semiconductor chip (for example, a processor) or an electronic device including the same, in which a plurality of cores are provided and a plurality of threads are performed in each core.

상기에서 본 발명의 바람직한 실시예가 특정 용어들을 사용하여 설명 및 도시되었지만 그러한 용어는 오로지 본 발명을 명확히 설명하기 위한 것일 뿐이며, 본 발명의 실시예는 다음의 청구범위의 기술적 사상 및 범위로부터 이탈되지 않고서 여러가지 변경 및 변화가 가해질 수 있는 것을 자명한 일이다.
While the preferred embodiments of the present invention have been described and illustrated above using specific terms, such terms are used only for the purpose of clarifying the invention, and the embodiments of the present invention may be embodied in various forms without departing from the spirit or scope of the following claims It is evident that various changes and changes may be made.

10, 20: 스레드 플래그 레지스터
11, 13, 15, 21, 23, 25: 각각의 스레드의 수행 여부를 표시하는 플래그
30: 코어 플래그 레지스터
31, 33: 각 코어 내에 수행 중인 스레드 유무를 표시하는 플래그10, 20: thread flag register
11, 13, 15, 21, 23, 25: Flag indicating whether or not each thread is running
30: core flag register
31, 33: flags indicating the presence of threads running within each core

Claims

delete

A process management method of a multi-core multi-threaded system for performing a plurality of cores including a first core and a second core and a plurality of threads in each core,
A first step in which execution of all threads in the first core ends;
A second step in which execution of all threads in the second core ends; and
A third step of giving a SYNC command to a plurality of threads in the first core and the second core after the second step,
During the period between the first step and the third step, the thread of the first core remains in a waiting state without performing a process,
And after the third step, the first core and the second core further comprise a fourth step of exiting the standby state and simultaneously processing a new process.

The method of claim 3, wherein
As the step performed between the first step and the third step,
In case of requesting 1 increment command that requests multiple cores at the same time and increments at the same time, calculates the current counter value for each core and two values to be added for each core, so that the different counter values for each core in one cycle And a fifth step of returning the multi-core multithreaded system.