KR101736041B1

KR101736041B1 - Method for Speeded up image processing using distributed processing of embedded CPU and GPU

Info

Publication number: KR101736041B1
Application number: KR1020150005655A
Authority: KR
Inventors: 최해철; 이강운; 백아람
Original assignee: 한밭대학교 산학협력단
Priority date: 2015-01-13
Filing date: 2015-01-13
Publication date: 2017-05-17
Also published as: KR20160087127A

Abstract

임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화 방법에 관한 것으로서, 더욱 상세하게, 임베드디 및 모바일 환경에서 영상처리를 함에 있어 CPU와 GPU사이에 복수의 버퍼를 구비하고 복수의 버퍼에 번갈아가며 데이터를 저장 및 읽기를 통해 손실 및 중복되는 프레임이 없게 하며, CPU-GPU의 분산처리 및 GPU의 다중알고리즘 구조를 통해 영상처리의 효율성을 높이는 임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화방법에 관한 것이다.More particularly, the present invention relates to an image processing method and an image processing method using an embedded CPU-GPU distribution process. More particularly, the present invention relates to an image processing method and an image processing method, The present invention relates to an image processing speed-up method using an embedded CPU-GPU distributed processing which improves the efficiency of image processing through distributed processing of a CPU-GPU and multiple algorithm structures of a GPU without loss and duplication of frames through storage and reading.

Description

[0001] The present invention relates to an embedded CPU-GPU distributed processing method,

본 발명은 임베디드 및 모바일에서 영상처리 고속화에 관한 것으로, 종래 보편적으로 활용되는 임베디드 및 모바일에서의 영상처리에 비해 CPU(중앙 처리 장치, Central Processing Unit)와 GPU(그래픽 연산 장치, Graphic Processing Unit)의 분산처리, CPU와 GPU사이에 위치한 복수의 버퍼 및 다중알고리즘 구조를 통한 영상처리 고속화 방법에 관한 것이다. [0001] The present invention relates to image processing speedup in embedded and mobile systems, and more particularly, to a method and apparatus for processing image data in a CPU (Central Processing Unit) and a GPU (Graphic Processing Unit) Distributed processing, a plurality of buffers located between a CPU and a GPU, and a method of accelerating image processing through a multiple algorithm structure.

반도체 기술의 발전과 공정 미세화로 CPU는 처음 등장한 이후 점점 성능이 높아져만 갔다. CPU의 성능을 높이는 방법은 CPU에 들어가는 트랜지스터의 숫자를 늘리는 것이었다. 그러나 이 방법은 CPU의 클럭이 4GHz에 이르면서 한계에 부딪히게 된다. CPU에 집적되는 트랜지스터의 숫자가 늘어나면서 그에 비례하여 CPU가 소모하는 전력이 늘어나고, 소모 전력이 늘어나면 발열도 그만큼 증가하는데 CPU코어의 재료인 반도체는 도체와 반대로 열을 받으면 저항이 줄어들기 때문에 누설전류가 생기는 동시에 과전류가 흘러 코어가 타버리는 상황이 온것이다. 이러한 한계를 극복하기 위해 1개의 CPU내에 동일한 성능으로 작동하는 프로세서를 여러개를 넣은 멀티코어 프로세서(Multi Core Processor)가 등장하게 되었다. 상기 멀티코어 프로세서는 CPU뿐만 아니라 GPU에도 적용되는 개념이다. 컴퓨터의 그래픽처리능력 가속을 위해 1980년대 처음 등장한 GPU또한 지속적으로 성능향상을 거듭해 현재 단순 계산능력으로는 CPU를 능가할 정도가 되었다. 이러한 GPU도 CPU처럼 성능향상의 벽에 막히게 되었고, CPU와 마찬가지로 여러 개의 독립적인 코어를 가진 GPU가 등장하게 되었다. 이런 멀티코어 프로세서로 만들어진 CPU및 GPU 활용하기 위한 여러 가지 방법이 제안되고 있다. 대표적인 예가 병렬처리와 GPGPU(범용목적 GPU, General Purpose Graphic Processing Unit)라는 개념이다. 병렬처리란 컴퓨터가 처리할 한 작업을 여러 개로 나눠 CPU 및 GPU에 존재하는 각각의 독립적인 프로세서에게 할당하여 동시에 처리하는 것이다. 이 방법은 오래전부터 고성능연산에 사용되던 방법으로, 흔히 말하던 슈퍼컴퓨터에 적용되던 개념이었지만 최근 컴퓨터 이용에서 발열과 전력 소모에 대한 관심이 높아지는 것과 더불어 멀티코어 프로세서의 등장과 함께 컴퓨터 구조의 강력한 패러다임으로 주목받게 되었다. GPGPU란 GPU가 본래목적인 그래픽처리가 아닌 CPU가 하는 작업을 대신하는 것이다. CPU는 컴퓨터의 여러 장치로부터 오는 신호를 처리하기 위하여 비순차 실행처럼 명령어 처리 자체를 하기 위한 장치가 많기 때문에 계산을 위한 장치는 많지 않다. 특히 메모리와 CPU의 속도차이를 개선시키기 위한 버퍼역할을 하는 캐시메모리(Cache memory)는 고성능 CPU면적의 절반 이상을 차지할 정도이기 때문에 실제 계산을 하는 장치는 개수가 GPU에 비해 상대적으로 적다. 이에 비해 GPU는 본래 목적이 계산량이 많은 영상처리를 전담하기 위해 장치 대부분이 단순계산능력을 높이기 위한 장치로 구성되어있다. 이런 차이로 CPU와 GPU의 계산능력의 차이가 일어나게 되었고 상기 GPGPU라는 개념이 등장하게 되었다. 이런 GPGPU는 영상처리와 같은 대용량 단순계산 위주의 작업에 유용하며 조건이나 분기가 일어나는 프로그래밍에서는 오히려 성능이 감소하는 특징을 보여준다.With the development of semiconductor technology and the refinement of the process, the performance of the CPU has gradually increased since it first appeared. The way to improve CPU performance was to increase the number of transistors in the CPU. However, this approach limits the CPU clock to 4GHz. As the number of transistors integrated in the CPU increases, the power consumed by the CPU increases, and as the power consumption increases, the heat increases accordingly. Semiconductor, which is the material of the CPU core, The current will flow and the overcurrent will flow and the core will burn. To overcome these limitations, a Multi Core Processor with multiple processors running on the same performance within a single CPU has emerged. The multicore processor is a concept that is applied not only to a CPU but also to a GPU. The GPU, which first appeared in the 1980s to accelerate the computer's graphics processing capabilities, continued to improve performance and now exceeds CPU with its simple computing power. These GPUs are also blocked by performance-enhancing CPUs, and GPUs with multiple independent cores, such as CPUs, have emerged. Several methods for utilizing CPUs and GPUs made of such multi-core processors have been proposed. A typical example is the concept of parallel processing and GPGPU (General Purpose Graphic Processing Unit). Parallel processing is the process of dividing a computer into several tasks and assigning them to each independent processor in the CPU and GPU to process them simultaneously. This method has been used for high-performance computing for a long time. It was a concept that was applied to supercomputer. But recently, interest in heat generation and power consumption has increased in computer use. In addition to the emergence of multi-core processors and a strong paradigm of computer architecture I got attention. GPGPU is a replacement for the CPU, not the graphics processing that the GPU originally intended. There are not many devices for calculation because the CPU has many devices to handle the processing of instructions like non-sequential execution to process signals from various devices of computer. In particular, cache memory, which acts as a buffer to improve the speed difference between memory and CPU, occupies more than half of high-performance CPU area, so the number of devices that actually perform calculations is relatively small compared to the number of GPUs. On the other hand, the GPU is composed of a device for enhancing the simple calculation ability of most of the devices in order that the purpose of the GPU is dedicated to image processing having a large amount of calculation. This difference causes a difference in the computational power between the CPU and the GPU, and the concept of GPGPU has emerged. These GPGPUs are useful for large-scale simple computation tasks such as image processing, and show performance degradation in conditional or branching programming.

상술한 특징을 가지는 병렬처리 및 GPGPU를 사용하려면 해당 프로그램이 가능한 CPU 및 GPU뿐만 아니라 소프트웨어 적으로 변환시키는 레이어 혹은 라이브러리가 필요한데, 주로 nVIDIA사에서 만든 CUDA(Compute Unified Device Architecture)나 OpenCL 및 OpenGL이 사용된다.To use parallel processing and GPGPU with the above-mentioned characteristics, it is necessary to use not only a CPU and a GPU capable of a program but also a layer or a library which is converted by software, and mainly uses Compute Unified Device Architecture (CUDA) created by nVIDIA or OpenCL and OpenGL do.

한편, 휴대용 전자기기의 발전과 함께 카메라를 갖춘 모바일 기기가 보편화되면서 임베디드 및 모바일에서 영상처리를 이용한 다양한 응용이 확산되고 있다. 그러나 모바일은 PC에 비해 전력공급이 제한적이고, PC와 모바일 내에 장착되는 CPU와 GPU의 크기차이로 인한 낮은 연산능력, 작은 메모리 및 발열과 같은 물리적 제약이 있었고, 임베디드 GPU는 CPU와 대역폭 차이, 프로그래밍 인터페이스 부족, 적은 코어와 쉐이더 유닛 수와 같은 구조적 제약이 많기 때문에 기존의 방법으로는 처리가 힘든 실정이다. 이를 해결하고자 PC에서 사용하는 GPGPU를 임베디드 및 모바일에서도 응용하고자 하는 연구가 계속되고 있으며, OpenGL ES 2.0과 같이 임베디드 및 모바일에서 사용 가능한 그래픽처리 어플리케이션이 개발되고 있다.
On the other hand, with the development of portable electronic devices, mobile devices equipped with cameras become popular, and various applications using image processing in embedded and mobile are spreading. However, mobile has a limited power supply compared to PC, there are physical limitations such as low computing power, small memory and heat due to the difference in size of CPU and GPU mounted in PC and mobile, and embedded GPU has difference between CPU and bandwidth, There are many architectural limitations such as lack of interfaces, fewer cores and the number of shader units. In order to solve this problem, studies are being continued to apply GPGPUs used in PCs to embedded and mobile applications, and graphics processing applications that can be used in embedded and mobile applications such as OpenGL ES 2.0 are being developed.

상술한 바와 같이 임베디드 및 모바일에서 영상처리를 이용한 다양한 응용이 확산되고 있다. 상술한 병렬처리 및 GPGPU를 임베디드 및 모바일 환경에서 이용하려는 노력이 대표적이다. 선행문헌1(CPU and GPU Parallel Processing for Mobile Augmented Reality)에서는 모바일 환경에서 CPU와 GPU의 병렬처리를 통한 영상처리에 대해 개시하고 있다. 그러나 이런 기존의 병렬처리를 이용한 영상처리는 순차적으로 한 프레임씩 처리하는 구조보다 각 프로세서의 대기시간이 없어 총 수행시간이 감소하는 장점이 있지만, CPU와 GPU의 서로 다른 작업속도 차이로 인한 프레임 누락 및 작업하는 프레임이 중복된다는 문제점을 가지고 있다.As described above, various applications using image processing in embedded and mobile are spreading. The efforts to use the above-mentioned parallel processing and GPGPU in an embedded and mobile environment are representative. Prior Art 1 (CPU and GPU Parallel Processing for Mobile Augmented Reality) discloses image processing through parallel processing of a CPU and a GPU in a mobile environment. However, there is an advantage in that the total execution time is reduced due to the lack of waiting time of each processor in the image processing using the conventional parallel processing, rather than the processing of one frame sequentially. However, And the frame to be operated is overlapped.

또한 종래기술의 다른 단점으로는 단일알고리즘 처리구조를 들 수 있다. 보통의 영상처리는 여러 단계의 알고리즘으로 완성되지만 단일알고리즘 처리구조는 한 번의 알고리즘만 처리되어 화면에 출력하게 된다. 이러한 단일알고리즘 처리구조를 이용하여 여러 알고리즘을 처리하면 도 2와 같이 CPU와 GPU간의 데이터 전송 및 변환이 알고리즘 수만큼 발생하게 된다. PC에서는 임베디드 및 모바일 환경과 비교하여 대역폭이 크고, CPU와 GPU의 속도차이를 보완해줄 여러 가지 장치가 있지만 모바일 및 프로세서는 대역폭이 낮고, CPU와 GPU의 속도차이를 보완해줄 장치 또한 성능상의 제약이 있기 때문에 비효율적이라는 문제점이 있었다.
Another disadvantage of the prior art is a single algorithm processing structure. Normal image processing is completed by several stages of algorithms, but a single algorithm processing structure processes only one algorithm and outputs it to the screen. When a plurality of algorithms are processed using such a single algorithm processing structure, as shown in FIG. 2, data transmission and conversion between the CPU and the GPU occur as many as the number of algorithms. There are many devices in the PC that have higher bandwidth compared to the embedded and mobile environment and compensate the speed difference between the CPU and the GPU. However, the mobile and the processor are low in bandwidth and the devices that compensate the speed difference between the CPU and the GPU are also limited in performance There is a problem that it is inefficient.

(선행문헌1) "CPU and GPU Parallel Processing for Mobile Augmented Reality", Image and Signal Processing(CISP), 6th International Congress, ppp. 133-137, Dec, 2013(Prior Art 1) "CPU and GPU Parallel Processing for Mobile Augmented Reality", Image and Signal Processing (CISP), 6th International Congress, ppp. 133-137, Dec, 2013

본 발명은 상기한 바와 같은 종래 기술을 개선시키고자 하는 것으로, 임베디드 및 모바일 환경에서 CPU와 GPU사이에 복수의 버퍼를 구비하고, GPU에서 다중알고리즘 처리구조를 구현하며 CPU와 GPU의 분산처리를 통해 영상처리의 효율성을 높이고자 하는 것이다.
SUMMARY OF THE INVENTION The present invention has been made in order to solve the above problems, and it is an object of the present invention to provide a method and a system for providing a plurality of buffers between a CPU and a GPU in an embedded and mobile environment, And to improve the efficiency of image processing.

임베디드 및 모바일 환경에서 CPU-GPU 병렬처리를 기반으로 영상데이터를 처리하는 방법으로, 카메라로부터 획득하거나, 메모리에 저장된 영상데이터를 복호화 하는 복호화단계(S100), 상기 복호화단계(S100)에서 복호화된 영상데이터를 GPU에서 사용할 텍스쳐 타입으로 변환하고, 변환된 텍스쳐 타입 데이터를 CPU-GPU사이에 위치한 복수의 버퍼(10)에 번갈아가며 순차적으로 저장하는 데이터변환단계(S200), 상기 데이터변환단계(S200)에서 상기 버퍼(10)에 저장된 데이터를 번갈아가며 순차적으로 불러와 미리 결정된 알고리즘을 적용하는 알고리즘단계(S300) 및 상기 알고리즘단계(S300)를 거친 데이터를 출력하는 출력단계(S400)를 포함하며 상기 복호화 단계(S100) 및 상기 데이터변환단계(S200)는 CPU에 의해 수행되고 상기 알고리즘단계(S300) 및 상기 출력단계(S400)는 GPU에 의해 수행된다.A method of processing image data based on CPU-GPU parallel processing in an embedded and mobile environment, includes a decoding step (S100) of obtaining image data from a camera or decoding image data stored in a memory, a decoding step (S100) A data conversion step (S200) for converting data into a texture type to be used in the GPU, sequentially storing the converted texture type data in a plurality of buffers (10) located between the CPU and the GPU, (S300) for sequentially calling data stored in the buffer (10) alternately and applying a predetermined algorithm, and an output step (S400) for outputting data through the algorithm step (S300) The step S100 and the data conversion step S200 are performed by the CPU and the algorithm step S300 and the output step S400 It is performed by the GPU.

상기 데이터변환단계(S200)에서 사용가능한 상기 버퍼(10)는 상기 알고리즘단계(S300)에서 사용중인 것 외의 상기 버퍼(10)이다.The buffer 10 usable in the data conversion step S200 is the buffer 10 other than that being used in the algorithm step S300.

상기 알고리즘단계(S300)에서 상기 버퍼(10)로부터 받은 데이터에 적용하는 알고리즘은 프레임버퍼오브젝트(Frame Buffer Object)를 이용하여 하나 이상이다.The algorithm applied to the data received from the buffer 10 in the algorithm step S300 is one or more using a frame buffer object.

상기 복호화단계(S100) 및 상기 데이터변환단계(S200)는 모듈화 되어 병렬처리 된다.The decoding step (S100) and the data conversion step (S200) are modularized and processed in parallel.

본 발명의 상기 임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화 방법을 구현하기 위한 프로그램이 저장된 컴퓨터 판독 가능한 기록매체가 제공되는 것을 특징으로 한다.There is provided a computer-readable recording medium storing a program for implementing an image processing speed-up method using the embedded CPU-GPU distribution processing of the present invention.

아울러, 임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화 방법을 구현하기 위해, 컴퓨터 판독 가능한 기록매체에 저장된 프로그램이 제공되는 것을 특징으로 한다.
In addition, a program stored in a computer-readable recording medium is provided to implement an image processing speed-up method using an embedded CPU-GPU distribution process.

본 발명에 따른 임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화 방법은 임베디드 및 모바일 환경에서 CPU와 GPU사이에 위치한 복수의 버퍼로부터 번갈아가며 데이터를 저장하고 읽음으로써 데이터가 손실되거나 작업이 중복되는 것을 없애는 효과가 있다.The method for accelerating image processing using embedded CPU-GPU distributed processing according to the present invention is a method for speeding up image processing using an embedded CPU-GPU distributed processing, in which data is stored and read alternately from a plurality of buffers located between a CPU and a GPU in an embedded and mobile environment, It is effective.

또한 임베디드 및 모바일환경에 설치된 GPU에서 다중알고리즘 처리구조와 CPU-GPU의 분산처리를 통해 영상처리의 효율성을 높이는 효과가 있다.
In addition, the efficiency of image processing is improved through the multi-algorithm processing structure and the CPU-GPU distributed processing in the GPU installed in the embedded and mobile environment.

도 1은 본 발명의 일 실시예에 따른 순서도.
도 2는 종래 단일알고리즘 처리구조.
도 3은 본 발명의 일 실시예에 따른 다중알고리즘 처리구조.1 is a flowchart according to an embodiment of the present invention;
Figure 2 shows a conventional single algorithm processing structure.
FIG. 3 is a multiple algorithm processing structure according to an embodiment of the present invention; FIG.

이하 첨부된 도면을 참조하면서 본 발명에 따른 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여, 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, terms and words used in the present specification and claims should not be construed as limited to ordinary or dictionary terms, and the inventor should appropriately interpret the concepts of the terms appropriately The present invention should be construed in accordance with the meaning and concept consistent with the technical idea of the present invention.

따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely the most preferred embodiments of the present invention and are not intended to represent all of the technical ideas of the present invention. Therefore, various equivalents It should be understood that water and variations may be present.

도 1은 본 발명의 일실시예에 따른 임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화 방법에 대해 간략하게 도시한 도면이다. 도 1을 참조로 하여 본 발명의 일실시예에 따른 부호화 모드 결정 방법에 대해서 상세히 설명한다.FIG. 1 is a view briefly showing a method of speeding up image processing using an embedded CPU-GPU distributed processing according to an embodiment of the present invention. A method of determining an encoding mode according to an embodiment of the present invention will be described in detail with reference to FIG.

도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화 방법은 임베디드 및 모바일 환경에서 CPU-GPU 병렬처리를 기반으로 영상데이터를 처리하는 방법으로, 복호화단계(S100), 데이터변환단계(S200), 알고리즘단계(S300) 및 출력단계(S400)를 포함한다. 상기 복호화단계(S100) 및 상기 데이터변환단계(S200)는 CPU에 의해서 수행되고, 상기 알고리즘단계(S300) 및 상기 출력단계(S400)는 GPU에 의해서 수행된다.As shown in FIG. 1, an image processing speed-up method using an embedded CPU-GPU distributed processing according to an embodiment of the present invention is a method of processing image data based on CPU-GPU parallel processing in an embedded and mobile environment, Decoding step S100, a data conversion step S200, an algorithm step S300, and an output step S400. The decoding step S100 and the data conversion step S200 are performed by the CPU, and the algorithm step S300 and the output step S400 are performed by the GPU.

상기 복호화단계(S100)는 카메라로부터 획득하거나 메모리에 저장된 영상데이터를 CPU에서 사용가능한 데이터로 복호화 한다. 카메라로부터 받아들이는 영상은 아날로그 영상신호로써 컴퓨터가 다룰 수 있는 디지털 신호로의 변환이 필요하다. 이 아날로그 방식의 영상신호를 단위 펄스의 유무로 구분되는 디지털 부호로 변환하는 작업을 부호화라고 하고 그 반대 작업을 복호화라고 한다. 이 부호화는 아날로그신호를 디지털 신호로의 변환 뿐만 아니라 방대한 양의 아날로그 신호를 압축해서 메모리나 하드디스크에 저장하는 것도 포함한다. 상기 복호화단계(S100)에서 복호화 하는 작업은 이 부호화되어 저장된 데이터에서 영상처리가 가능한 데이터로 변환하는 작업이다.The decoding step S100 decodes the image data obtained from the camera or stored in the memory into data usable in the CPU. The image received from the camera is an analog video signal and needs to be converted into a digital signal that can be handled by a computer. The operation of converting the video signal of the analog system into the digital code separated by the presence or absence of the unit pulse is referred to as encoding and the operation opposite thereto is referred to as decryption. This encoding includes not only conversion of an analog signal into a digital signal but also compression of a vast amount of analog signal and storing it in a memory or a hard disk. The operation of decrypting in the decrypting step (S100) is the operation of converting the encoded and stored data into data capable of image processing.

상기 복호화단계(S100)에서 복호화된 데이터는 CPU에서 사용하는 데이터타입으로, GPU에서 사용하기 위해서는 GPU에서 사용하는 데이터타입인 텍스쳐로 바꿔줄 필요가 있다. 상기 이유로 상기 데이터변환단계(S200)는 상기 복호화단계(S100)에서 복호화된 영상데이터를 GPU에서 사용할 텍스쳐 타입으로 변환하고, 변환된 텍스쳐 타입 데이터를 CPU-GPU사이에 위치한 두 개의 버퍼(10)에 번갈아가며 순차적으로 저장한다. The data decoded in the decoding step (S100) is a data type used by the CPU. In order to use the decoded data in the GPU, it is necessary to convert the decoded data into a texture which is a data type used in the GPU. For this reason, the data conversion step S200 converts the image data decoded in the decoding step S100 into a texture type to be used in the GPU, and converts the converted texture type data into two buffers 10 located between the CPU and the GPU Sequentially stored in turn.

상기 복호화단계(S100) 및 상기 데이터변환단계(S200)는 모듈화 되어 병렬처리 된다. 상기 두 단계를 모듈화 및 병렬처리하는 것는 영상 데이터 타입 변환 시간을 줄이기 위함이다. 상기 모듈화 후 병렬처리는 CPU내 독립 코어가 여러 개인 멀티코어 CPU일 경우 가능하며, 이 때 병렬처리는 CPU와 GPU의 병렬처리가 아닌 CPU내 독립 코어별로 작업을 할당하여 병렬처리 하는 것을 의미한다. The decoding step (S100) and the data conversion step (S200) are modularized and processed in parallel. Modularization and parallel processing of the two steps are intended to reduce image data type conversion time. The post-modular parallel processing can be performed when a multi-core CPU having a plurality of independent cores in a CPU is used. In this case, parallel processing refers to parallel processing by allocating tasks to independent cores in a CPU, rather than parallel processing of a CPU and a GPU.

상기 알고리즘단계(S300)는 상기 두 개의 버퍼(10)에 저장된 데이터를 번갈아가며 순차적으로 불러와 미리 결정된 알고리즘을 적용하며, 상기 알고리즘은 프레임버퍼오브젝트(Frame Buffer Object)를 이용하여 하나 이상이다.In the algorithm step S300, the data stored in the two buffers 10 are alternately called in sequence and a predetermined algorithm is applied, and the algorithm is one or more using a frame buffer object.

도 2는 종래의 단일알고리즘 처리구조를 도시한 것이다. 도 2와 같은 단일알고리즘 처리구조를 사용하면 여러개의 알고리즘으로 이루어지는 영상처리시 CPU와 GPU사이의 데이터 전송이 알고리즘의 수만큼 일어나게 되고, CPU와 GPU사이에 데이터가 전송되기 때문에 각 프로세서에서 사용하는 데이터 타입으로 변환하는 작업 또한 필요하다. 이 데이터를 변환하는 시간과 데이터를 전송하는 시간은 알고리즘 수만큼 총 작업시간에 포함되기 때문에 비효율적이었다.Figure 2 illustrates a conventional single algorithm processing structure. When a single algorithm processing structure as shown in FIG. 2 is used, the number of data transfer between the CPU and the GPU is increased by the number of algorithms in the image processing by the multiple algorithms, and since data is transferred between the CPU and the GPU, Type conversion is also necessary. The time to convert this data and the time to transmit data are inefficient because they are included in the total working time by the number of algorithms.

도 3은 이를 극복하고자 프레임버퍼오브젝트를 이용한 다중알고리즘 처리구조를 도시한 것으로, CPU에서 데이터를 GPU로 데이터를 넘겨주면 GPU내에서 미리 정해진 하나 이상의 알고리즘을 적용하여 출력하는 구조이다. 구체적으로는 GPU의 프레임버퍼로 전송되는 영상을 동일한 크기와 영역을 만들어 텍스쳐메모리의 프레임버퍼오브젝트에 복사한다. 프레임버퍼로 전송된 영상은 오프-스크린 렌더링(Off-Screen Rendering)명령을 이용하여 화면에 출력하지 않고 소비되도록 한다. 프레임버퍼오브젝트에 저장된 데이터는 정해진 복수의 알고리즘을 통해 사용자가 원하는 작업을 하게 되고 프레임버퍼를 통해 출력하게 된다.FIG. 3 illustrates a multiple algorithm processing structure using a frame buffer object in order to overcome this. In the case of passing data from the CPU to the GPU, the CPU applies one or more predetermined algorithms in the GPU for output. Specifically, an image transferred to the frame buffer of the GPU is created and copied to a frame buffer object of the texture memory. The image transmitted to the frame buffer is consumed without outputting it to the screen by using the Off-Screen Rendering command. The data stored in the frame buffer object is processed through a plurality of predetermined algorithms and is output through the frame buffer.

상기 데이터변환단계(S200)와 상기 알고리즘단계(S300)에서 사용하는 상기 버퍼(10)란 데이터를 한 곳에서 다른 한 곳으로 전송하는 동안 일시적으로 그 데이터를 보관하는 메모리의 영역으로, 상기 버퍼(10)를 이용하는 것은 병렬처리시 CPU와 GPU의 작업속도차이로 인해 작업하는 프레임의 누락되는 것을 막아 영상의 정확도를 높이고, 중복되는 프레임작업을 막아 작업시간을 단축하기 위함이다. 종래의 순차처리구조에서는 CPU가 첫 번째 프레임의 데이터를 변환시키고 GPU에 데이터를 넘겨주면, GPU가 정해진 영상처리를 한 후 출력하고 다시 CPU가 두 번째 프레임의 데이터 타입을 변환시키는 식의 순차처리구조를 사용했다. 상기 순차처리구조는 손실되는 프레임이 없다는 장점이 있었지만, 한 프로세서가 작업하는 동안 다른 프로세서는 대기하기 때문에 총 작업시간이 증가하는 단점이 있다. 이를 극복하고자 CPU가 첫 번째 프레임의 작업을 끝내고 GPU로 데이터를 보내면 대기시간 없이 바로 두 번째 프레임의 작업을 시작하고, GPU는 CPU가 끝낸 첫 번째 프레임의 데이터로 작업을 시작하는 방식인 병렬처리구조를 사용하기 시작했다. 이는 각 프로세서의 대기시간이 없어 총 작업시간이 감속했지만, 서로 다른 작업 속도 차이로 인해 작업 대상이 되는 프레임이 누락되거나 중복되는 단점을 가지게 되었다. 그러나 상기 버퍼(10)를 CPU와 GPU사이에 두고 사용하게 되면 CPU와 GPU의 작업속도 차이를 보완함으로써 프레임 누락을 방지하여 영상처리의 정확성을 향상시키고, 프레임을 중복처리하지 않아 불필요한 GPU사용을 방지하는 효과가 있다. 상기 버퍼(10)가 두 개인 이유는 임베디드 및 모바일 환경에서의 부족한 CPU의 용량을 고려한 것이다. 상기 데이터변환단계(S200)에서 CPU는 각각의 상기 버퍼(10)에 번갈아가며 순차적으로 데이터를 저장하게 되고, 상기 알고리즘단계(S300)에서 GPU는 각각의 버퍼(10)에서 순차적으로 데이터를 호출하게 된다. 상기 CPU와 GPU가 번갈아가며 순차적으로 각각 두 개의 버퍼(10)에 저장하고 불러올 때, GPU가 사용중인 버퍼(10)는 CPU가 사용하지 못한다. 이는 GPU가 버퍼(10)로부터 데이터를 불러오는 중 CPU가 해당 버퍼(10)를 사용하게 되면 데이터를 삭제하고 새로운 데이터를 사용하면 데이터의 누락이 발생할 수 있기 때문이다.The buffer 10 used in the data conversion step S200 and the algorithm step S300 is an area of a memory that temporarily stores data while transferring data from one place to another, 10) is used in order to prevent the missing frames of the working frame due to the difference of the operation speed of the CPU and the GPU in the parallel processing, to increase the accuracy of the image and to shorten the working time by preventing overlapping frame work. In the conventional sequential processing structure, when the CPU converts the data of the first frame and passes the data to the GPU, the GPU outputs the data after performing the predetermined image processing, and then the CPU converts the data type of the second frame . The sequential processing structure has the advantage that there is no lost frame, but there is a disadvantage that the total working time increases because one processor is waiting while another processor is working. In order to overcome this, the CPU starts the operation of the second frame immediately without waiting time when the CPU finishes the operation of the first frame and sends data to the GPU, and the GPU starts the operation with the data of the first frame that the CPU finishes. . This is because the total operation time is slowed down because each processor does not have a waiting time, but it has a disadvantage that the frame to be operated is missed or overlapped due to the difference in the operation speed. However, if the buffer 10 is used between the CPU and the GPU, it compensates for the difference in operation speed between the CPU and the GPU, thereby preventing the frame omission and improving the accuracy of the image processing and preventing unnecessary use of the GPU . The reason why there are two buffers 10 is considering the insufficient CPU capacity in the embedded and mobile environments. In the data conversion step S200, the CPU sequentially stores data in each buffer 10, and the GPU sequentially calls data in each buffer 10 in the algorithm step S300 do. When the CPU and the GPU are alternately stored and loaded into the two buffers 10 in sequence, the buffer 10 being used by the GPU can not be used by the CPU. This is because if the CPU uses the buffer 10 while the GPU is loading data from the buffer 10, data may be deleted and data may be omitted if new data is used.

상기 출력단계(S400)는 상기 알고리즘단계(S300)에서 처리된 데이터를 화면에 출력한다.The output step (S400) outputs the data processed in the algorithm step (S300) to the screen.

하기 표 1은 상기 제안하는 방법과 기존의 순차처리구조, 병렬처리구조의 작업결과를 비교한 것이다. 실험은 ARM v7기반의 듀얼코어 1.3Ghz CPU와 트리플코어로 구성된 SGX 543MP3 GPU 그리고 1G 메모리로 구성된 하드웨어를 가지고 안드로이드 환경의 모바일 기기로 수행했으며 프로그래밍 언어는 OpenGL ES 2.0을 사용했다. 실험방법은 500프레임 길이의 영상을 그레이스케일(Grayscale) 알고리즘과 소벨 엣지 검출(Sobel Edge detector) 알고리즘을 동시에 적용하고 그 작업시간을 비교했다.Table 1 below compares the results of the proposed method with those of the sequential processing structure and the parallel processing structure. The experiment was carried out with an ARM v7 based dual core 1.3Ghz CPU, an SGX 543MP3 GPU consisting of triple cores and 1G memory with mobile devices running Android, and the programming language was OpenGL ES 2.0. The experimental results show that the image of 500 frames long is applied to both the grayscale algorithm and the Sobel edge detector algorithm at the same time.

상기 표 1을 보면 순차처리구조는 작업시간이 다른 작업방식에 비해 느리지만 손실되는 프레임이 없는 것을 알 수 있다. 그에 비해 병렬처리구조는 작업시간이 순차처리구조에 비해 빠르지만 해상도가 480P 및 720P에서는 프레임이 500에 미치지 못해 데이터가 손실되고, 해상도가 1080P에서는 일부 프레임을 이중으로 처리해 불필요한 작업시간이 늘어난다. 상기 본 발명에서 제안하는 방법은 순차처리와 병렬처리의 장장점을 합해놓은 방법으로 해상도가 480P와 720P에서는 병렬처리구조 보다는 느리지만 데이터손실이 없고, 해상도가 1080P에서는 작업속도가 오히려 병렬처리구조보다 빠르다. 상기 실험의 결과를 토대로 판단해보면 본 발명에서 제안하는 방법은 작업량이 많아질수록 상기 순차처리구조나 병렬처리구조보다 빠르며 데이터의 손실 또한 최소화 할 수 있다는 것을 알 수 있다.As shown in Table 1, it can be seen that the sequential processing structure is slower than other working methods, but has no lost frames. In contrast, the parallel processing structure has faster processing time than the sequential processing structure, but the data is lost because the resolution is 480P and 720P because the frame is less than 500, and at 1080P resolution, some frames are doubled to increase unnecessary working time. The method proposed in the present invention combines the merits of sequential processing and parallel processing. In the case of resolutions of 480P and 720P, it is slower than parallel processing but has no data loss. In the case of 1080P resolution, fast. Based on the results of the experiment, it can be seen that the method proposed by the present invention is faster than the sequential processing structure or the parallel processing structure and the data loss can be minimized as the workload increases.

상술한 임베디드 CPU-GPU 분산처리를 이용한 영상처리 고속화 방법은 이를 구현하기 위한 명령어들의 프로그램이 유형적으로 구현됨으로써, 컴퓨터를 통해 판독될 수 있는 기록매체에 포함되어 제공될 수도 있음을 당업자들이 쉽게 이해할 수 있을 것이다. 다시 말해, 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어, 컴퓨터 판독 가능한 기록매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 상기 컴퓨터 판독 가능한 기록매체의 예에는 하드 디스크, 플로피 디스크 및 자기테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리, USB 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.It will be appreciated by those skilled in the art that the image processing speedup method using the embedded CPU-GPU distribution processing described above can be easily included in a recording medium that can be read through a computer by tangibly embodying a program of instructions for implementing the method There will be. In other words, it can be implemented in the form of a program command that can be executed through various computer means, and can be recorded on a computer-readable recording medium. The computer-readable recording medium may include program commands, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer-readable recording medium may be those specially designed and configured for the present invention or may be those known and available to those skilled in the computer software. Examples of the computer-readable medium include magnetic media such as hard disks, floppy disks and magnetic tape, optical media such as CD-ROMs and DVDs, and optical disks such as floppy disks. Magneto-optical media and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, USB memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

본 발명은 상기한 실시예에 한정되지 아니하며, 적용범위가 다양함은 물론이고, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 다양한 변형 실시가 가능한 것은 물론이다.
It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

10 : 버퍼
S100 : 복호화단계
S200 : 데이터변환단계
S300 : 알고리즘단계
S400 : 출력단계10: buffer
S100: Decryption step
S200: Data conversion step
S300: Algorithm step
S400: Output step

Claims

A method of processing image data based on CPU-GPU parallel processing in an embedded and mobile environment,
A decoding step (S100) of decoding image data obtained from a camera or stored in a memory;
The image data decoded in the decoding step S100 is converted into a texture type to be used in the GPU, and the converted texture type data is sequentially stored in the plurality of buffers 10 located in parallel with each other between the CPU and GPU Conversion step S200;
An algorithm step (S300) for sequentially calling the data stored in the buffer (10) in the data conversion step (S200) sequentially and applying a predetermined algorithm; And
An output step (S400) of outputting the data through the algorithm step (S300);
/ RTI >
The decoding step S100 and the data conversion step S200 are performed by the CPU, the algorithm step S300 and the output step S400 are performed by the GPU,
Wherein the buffer for storing the image data in the data conversion step (S200) is a buffer other than that being used in the algorithm step (S300).

delete

The method according to claim 1,
Wherein the algorithm applied to the data received from the buffer (10) in the algorithm step (S300) is one or more using a frame buffer object (Frame Buffer Object).

The method according to claim 1,
Wherein the decoding step (S100) and the data conversion step (S200) are modularized and parallel-processed.

delete

A program stored in a computer-readable recording medium for implementing an image processing speed-up method using an embedded CPU-GPU distribution process according to any one of claims 1, 3 and 4.