KR101722643B1

KR101722643B1 - Method for managing RDD, apparatus for managing RDD and storage medium for storing program managing RDD

Info

Publication number: KR101722643B1
Application number: KR1020160092467A
Authority: KR
Inventors: 김장원; 정도헌; 정창후; 김영민; 김태홍; 조민수
Original assignee: 한국과학기술정보연구원
Priority date: 2016-07-21
Filing date: 2016-07-21
Publication date: 2017-04-05

Abstract

The present invention discloses a storage medium for storing a method, an apparatus, and a program for efficiently managing a resilient distributed dataset (RDD). The RDD management method according to an embodiment of the present invention relates to a method for managing RDD for data collection, analysis, and processing, which comprises the following steps: extracting metadata from genealogy for the RDD; storing the metadata extracted in the extraction step; and updating the metadata stored when performing operation related to the genealogy for the RDD, wherein the stored metadata includes RDD object information and RDD type information.

Description

[0001] The present invention relates to an RDD management method, an RDD management apparatus, and a storage medium storing an RDD management program,

본 발명은 RDD(Resilient Distributed Dataset)를 관리하는 방법, RDD를 관리하는 장치 및 RDD를 관리하는 프로그램을 저장하는 저장매체에 관한 것으로서, 보다 상세하게는 원시 RDD 및 원시 RDD로부터 파생된 파생 RDD와 관련된 메타데이터를 이용하여 RDD를 관리하는 방법, 장치 및 저장매체에 관한 것이다.The present invention relates to a method of managing RDD (Resilient Distributed Dataset), a device for managing RDD, and a storage medium storing a program for managing RDD, and more particularly, to a RDD To a method, apparatus, and storage medium for managing RDD using metadata.

최근 빅데이터에 대한 관심이 증가함에 따라 실시간으로 데이터를 분석하는 기술에 대한 연구가 활발하게 이루어지고 있다. 주로 사용되는 실시간 데이터 분석 플랫폼으로는 아파치 스파크(Apache Spark)를 예로 들 수 있다. 이러한 실시간 데이터 분석 플랫폼은 실시간으로 스트리밍되는 데이터를 가공, 분석 및 저장하기 위해 RDD(Resilient Distributed Dataset)라는 형태의 데이터 집합을 사용한다. 여기서, RDD는 가변적이지 않은, 즉 불변의(immutable) 형태로 데이터를 갖는 특성이 있다. 따라서 기 생성된 RDD에 포함된 데이터를 가공할 경우, 가공된 데이터를 포함하는 새로운 RDD가 생성된다. 다시 말해, 데이터 처리시 기존의 RDD로부터 파생되는 파생 RDD가 추가적으로 생성된다. Recently, as interest in big data increases, research on data analysis technology in real time has been actively carried out. An example of a popular real-time data analysis platform is Apache Spark. This real-time data analysis platform uses a dataset called Resilient Distributed Dataset (RDD) to process, analyze and store the data streamed in real time. Here, RDD is characterized by having data in a non-variable, i.e., immutable form. Therefore, when processing the data contained in the pre-generated RDD, a new RDD including the processed data is generated. In other words, a derived RDD derived from an existing RDD is additionally generated during data processing.

한편, 스트리밍 데이터를 처리할 경우, 끊임없이 원시(raw) RDD가 생성되고, 생성된 각각의 원시 RDD로부터 파생 RDD가 생성되므로 RDD의 수는 급증하게 된다. 그 결과, RDD에 포함된 정보를 저장하기 위한 공간이 부족하게 되고, 소망하는 정보를 검색하는데 상당한 딜레이가 발생하는 문제가 있다. 또한, 생성된 RDD 들을 삭제할 경우, 데이터 분석을 위해 RDD들을 재차 생성해야 하는 문제가 있다.On the other hand, when streaming data is processed, raw RDDs are constantly generated, and a derived RDD is generated from each generated raw RDD, thereby increasing the number of RDDs. As a result, there is a shortage of space for storing the information included in the RDD, and a considerable delay occurs in searching for desired information. In addition, there is a problem in that, when the generated RDDs are deleted, RDDs must be generated again for data analysis.

본 발명은 상기와 같은 문제점을 인식하여 창안된 것으로서, RDD를 효율적으로 관리하기 위한 방법, 장치 및 프로그램을 저장하는 저장매체를 제공하는 것을 목적으로 한다. The present invention has been made in recognition of the above problems, and it is an object of the present invention to provide a storage medium storing a method, an apparatus, and a program for efficiently managing an RDD.

구체적으로, 본 발명은 RDD를 효율적으로 관리하기 위해 원시 RDD 및 파생 RDD와 관련된 메타데이터를 이용하여 RDD를 관리하는 방법, 장치 및 프로그램을 저장하는 저장매체를 제공한다.Specifically, the present invention provides a storage medium storing a method, apparatus, and program for managing RDD using meta data related to RDD and derived RDD in order to efficiently manage the RDD.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 RDD(Resilient Distributed Dataset) 관리 방법은, 데이터 수집, 분석 및 처리를 위한 RDD를 관리하는 방법으로서, RDD에 대한 계보로부터 메타데이터를 추출하는 단계; 상기 추출하는 단계에서 추출된 메타데이터를 저장하는 단계; 및 상기 RDD에 대한 계보와 관련된 연산 수행시 저장된 메타데이터를 업데이트 하는 단계, 여기서, 상기 저장된 메타데이터는 RDD 객체정보 및 RDD 타입정보를 포함함;를 포함할 수 있다.According to an aspect of the present invention, there is provided a method for managing RDD for data collection, analysis, and processing, the method comprising: extracting metadata from a genealogy for RDD; ; Storing the extracted metadata in the extracting step; And updating the stored metadata when performing an operation related to the genealogy for the RDD, wherein the stored metadata includes RDD object information and RDD type information.

상기 업데이트하는 단계는, 상기 RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들의 라이프 사이클을 지시하는 스테이터스 정보를 업데이트하는 단계일 수 있다.The updating may be a step of updating the status information indicating the life cycle of the RDDs generated as a result of performing an operation related to the lineage for the RDD.

상기 방법은, 상기 연산 수행의 결과로 생성된 RDD들 중 적어도 하나의 RDD를 메모리 상에 캐싱(cashing) 저장할지 퍼시스트(persist)로 저장할지를 지시하는 정보를 수신하는 단계; 상기 지시하는 정보에 따라 상기 적어도 하나의 RDD를 저장하는 단계; 및 상기 지시하는 정보 및 저장된 상기 적어도 하나의 RDD의 저장위치 정보에 기초하여 상기 저장된 메타데이터를 더 업데이트하는 단계;를 더 포함할 수 있다.The method may further include receiving information indicating whether to store at least one RDD among the RDDs generated as a result of performing the operation, in a cache or a persistent manner; Storing the at least one RDD according to the indicating information; And further updating the stored metadata based on the indicating information and the storage location information of the stored at least one RDD.

상기 저장된 메타데이터는 우선순위 정보 및 RDD 저장위치 정보를 더 포함하고, 상기 우선순위 정보는 상기 RDD 타입정보에 의존하여 설정되고, 상기 RDD 저장위치 정보는 RDD들이 저장되는 위치정보를 지시할 수 있다.The stored metadata further includes priority information and RDD storage location information, the priority information is set depending on the RDD type information, and the RDD storage location information can indicate location information on which RDDs are stored .

상기 저장된 메타데이터는 상기 RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들의 라이프 사이클을 지시하는 스테이터스 정보를 더 포함하고, 상기 방법은, 상기 RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들에 대한 가비지 컬렉션(garbage collection)을 감지하는 단계; 상기 생성된 RDD들 중 가비지 컬렉션의 대상으로 선정된 RDD들과 관련된 상기 스테이터스 정보를 변경하여 상기 저장된 메타데이터를 더 업데이트하는 단계; 및 상기 가비지 컬렉션의 대상으로 선정된 RDD들 중 우선순위 정보가 기설정된 우선순위 보다 높은 우선순위를 갖는 RDD에 대한 상기 스테이터스 정보 및 상기 RDD 저장위치 정보를 변경하여 상기 저장된 메타데이터를 더 업데이트하고, 상기 우선순위 정보가 기설정된 우선순위 보다 앞서는 RDD를 변경된 상기 RDD 저장위치 정보에 기초하여 저장하는 단계;를 더 포함할 수 있다.Wherein the stored metadata further comprises status information indicating a lifecycle of RDDs generated as a result of an operation related to the genealogy for the RDD, the method comprising: generating, as a result of performing an operation related to the genealogy for the RDD Detecting garbage collection for the RDDs; Updating the stored metadata by changing the status information associated with RDDs selected as targets of garbage collection among the generated RDDs; And updating the stored metadata by changing the status information and the RDD storage location information for an RDD having priority higher than a predetermined priority among RDDs selected as objects of the garbage collection, And storing the RDD whose priority information is ahead of a predetermined priority based on the changed RDD storage location information.

상기 우선순위 정보가 기설정된 우선순위 보다 앞서는 RDD를 변경된 RDD 저장위치 정보에 기초하여 저장하는 단계 이후에, 상기 감지하는 단계에서 감지된 가비지 컬렉션이 수행될 수 있다.The detected garbage collection may be performed after storing the RDD whose priority information is ahead of the predetermined priority based on the changed RDD storage location information.

상기 RDD 타입정보는, 최종 RDD, 헤비 RDD, 일반 RDD 및 원시 RDD 중 어느 하나를 지시할 수 있다.The RDD type information may indicate any one of a final RDD, a heavy RDD, a general RDD, and a raw RDD.

상기 우선순위 정보는, 최종 RDD, 헤비 RDD, 원시 RDD, 일반 RDD의 순서로 높은 우선순위를 나타낼 수 있다.The priority information may indicate a high priority in the order of the final RDD, the heavy RDD, the primitive RDD, and the general RDD.

본 발명의 다른 실시예에 따른 RDD(Resilient Distributed Dataset) 관리 장치는, 데이터 수집, 분석 및 처리를 위한 RDD를 관리하는 장치로서, RDD에 대한 계보를 포함하는 DAG(Directed Acyclic Graph)를 스캐닝하고, 상기 RDD에 대한 계보로부터 메타데이터를 추출하는 DAG 스캐닝 에이전트; 및 상기 RDD에 대한 계보로부터 추출된 메타데이터를 저장하고, 상기 RDD에 대한 계보와 관련된 연산 수행시 저장된 메타데이터를 업데이트하는 메타데이터 매니저, 여기서, 상기 저장된 메타데이터는 RDD 객체정보 및 RDD 타입정보를 포함함;를 포함할 수 있다.A Resilient Distributed Dataset (RDD) management apparatus according to another embodiment of the present invention manages an RDD for data collection, analysis, and processing. The apparatus manages a DAD (Directed Acyclic Graph) including a genealogy for RDD, A DAG scanning agent for extracting meta data from the genealogy for the RDD; And a metadata manager for storing metadata extracted from the genealogy for the RDD and for updating metadata stored in the operation related to the genealogy for the RDD, wherein the stored metadata includes RDD object information and RDD type information And the like.

상기 메타데이터 매니저는, 상기 RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들의 라이프 사이클을 지시하는 스테이터스 정보를 업데이트할 수 있다. The meta data manager may update the status information indicating the life cycle of the RDDs generated as a result of the operation related to the genealogy for the RDD.

상기 장치는, 상기 연산 수행의 결과로 생성된 RDD들 중 적어도 하나의 RDD를 메모리 상에 캐싱(cashing) 저장할지 퍼시스트(persist)로 저장할지를 지시하는 정보를 수신하고, 상기 지시하는 정보에 따라 상기 적어도 하나의 RDD를 저장하는 RDD 저장 매니저;를 더 포함하고, 상기 메타데이터 매니저는, 상기 지시하는 정보 및 저장된 상기 적어도 하나의 RDD의 저장위치 정보에 기초하여 상기 저장된 메타데이터를 더 업데이트할 수 있다.The apparatus includes a memory for storing information indicating whether to cache or store at least one RDD among RDDs generated as a result of the computation in memory, And an RDD storage manager for storing the at least one RDD, wherein the metadata manager is further adapted to update the stored metadata based on the indicating information and the storage location information of the stored at least one RDD have.

상기 저장된 메타데이터는 상기 RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들의 라이프 사이클을 지시하는 스테이터스 정보를 더 포함하고, 상기 장치는, 상기 RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들에 대한 가비지 컬렉션을 감지하는 RDD 수명 모니터링 에이전트;를 더 포함하고, 상기 메타데이터 매니저는, 상기 생성된 RDD들 중 가비지 컬렉션의 대상으로 선정된 RDD들과 관련된 상기 스테이터스 정보를 변경하여 상기 저장된 메타데이터를 더 업데이트하고, 상기 메타데이터 매니저는, 상기 가비지 컬렉션의 대상으로 선정된 RDD들 중 우선순위 정보가 기설정된 우선순위 보다 높은 우선순위를 갖는 RDD에 대한 상기 스테이터스 정보 및 상기 RDD 저장위치 정보를 변경하여 상기 저장된 메타데이터를 더 업데이트하고, 상기 RDD 저장 매니저는, 상기 우선순위 정보가 기설정된 우선순위 보다 앞서는 RDD를 변경된 RDD 저장위치 정보에 기초하여 저장할 수 있다.Wherein the stored metadata further includes status information indicating a life cycle of RDDs generated as a result of an operation related to the genealogy for the RDD, the apparatus comprising: Further comprising an RDD lifetime monitoring agent for detecting garbage collection for the RDDs that have been selected for garbage collection, wherein the meta data manager changes the status information associated with RDDs selected as targets of garbage collection among the generated RDDs, Wherein the meta data manager further updates the stored meta data, and the meta data manager stores the status information for the RDD having the priority higher than the predetermined priority among the RDDs selected as the target of the garbage collection, Updates the stored metadata by changing information, and updates the RD D storage manager may store the RDD in which the priority information precedes the predetermined priority based on the changed RDD storage location information.

본 발명의 또 다른 실시예에 따른 RDD(Resilient Distributed Dataset)를 관리하는 프로그램을 저장하는 저장매체는, 데이터 수집, 분석 및 처리를 위한 RDD(Resilient Distributed Dataset)를 관리하는 프로그램을 저장하는 저장매체로서, RDD에 대한 계보로부터 메타데이터를 추출하고, 상기 추출하는 단계에서 추출된 메타데이터를 저장하고, 상기 RDD에 대한 계보와 관련된 연산 수행시 저장된 메타데이터를 업데이트 하고, 여기서, 상기 저장된 메타데이터는 RDD 객체정보 및 RDD 타입정보를 포함하는 것일 수 있다.A storage medium for storing a program for managing a Resilient Distributed Dataset (RDD) according to another embodiment of the present invention is a storage medium for storing a program for managing a Resilient Distributed Dataset (RDD) for data collection, analysis and processing , Extracting metadata from the genealogy for the RDD, storing the extracted metadata in the extracting step, and updating metadata stored in the operation related to the genealogy for the RDD, wherein the stored metadata is RDD Object information and RDD type information.

본 발명에 따르면, RDD를 효율적으로 저장 및 관리할 수 있는 방법, 장치 및 프로그램을 저장하는 저장매체를 제공할 수 있다. According to the present invention, it is possible to provide a storage medium storing a method, an apparatus, and a program capable of efficiently storing and managing the RDD.

본 발명의 일 실시예에 따르면, RDD 연산 이후 RDD와 관련된 메타데이터를 업데이트함으로써, 실제로 생성된 RDD에 대한 정확한 정보를 관리할 수 있다.According to an embodiment of the present invention, by updating the metadata associated with the RDD after the RDD operation, accurate information on the actually generated RDD can be managed.

본 발명의 다른 실시예에 따르면, 캐싱 저장되는 RDD와 퍼시스트로 저장되는 RDD를 별도로 관리할 수 있다.According to another embodiment of the present invention, the cached RDD and the RDD stored as the persistent can be separately managed.

본 발명의 또 다른 실시예에 따르면, RDD에 대한 가비지 컬렉션이 수행 이전에, 중요도가 높은 RDD를 미리 별도의 저장위치에 저장하고 관리함으로써, 해당 RDD가 메모리 상에서 삭제되는 것을 방지할 수 있다.According to another embodiment of the present invention, the RDD having a high degree of importance is stored and managed in advance in a separate storage location before garbage collection for the RDD is performed, thereby preventing the corresponding RDD from being deleted from the memory.

도 1은 본 발명의 일 실시예에 따른 실시간 데이터 처리 프로세스를 나타낸 도면이다.
도 2는 RDD가 생성되는 과정의 일 실시예를 나타낸 도면이다.
도 3은 본 발명의 다른 실시예에 따른 실시간 데이터 처리 프로세스를 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 RDD 메타데이터를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 메타데이터 추출 과정을 나타낸 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 RDD 초기 생성 과정을 나타낸 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 RDD의 선택적 저장 과정을 나타낸 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 가비지 컬렉션 과정을 나타낸 흐름도이다.
도 9는 본 발명의 일 실시예에 따른 RDD 관리 방법을 나타내는 순서도이다.1 is a diagram illustrating a real-time data processing process according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a process of generating an RDD.
3 is a diagram illustrating a real-time data processing process according to another embodiment of the present invention.
4 is a diagram illustrating RDD metadata according to an embodiment of the present invention.
5 is a flowchart illustrating a metadata extraction process according to an embodiment of the present invention.
6 is a flowchart illustrating an initial RDD generation process according to an embodiment of the present invention.
FIG. 7 is a flowchart illustrating a selective storage process of an RDD according to an embodiment of the present invention.
8 is a flowchart illustrating a garbage collection process according to an embodiment of the present invention.
9 is a flowchart illustrating an RDD management method according to an embodiment of the present invention.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어를 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도, 관례 또는 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한 특정 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는, 단순한 용어의 명칭이 아닌 그 용어가 가진 실질적인 의미와 본 명세서의 전반에 걸친 내용을 토대로 해석되어야 함을 밝혀두고자 한다. 이하 본 발명을 용이하게 설명할 수 있는 도면을 참조하여 본 발명의 실시예를 개시한다. While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments. Also, in certain cases, there may be a term arbitrarily selected by the applicant, and in this case, the meaning thereof will be described in the description of the corresponding invention. Therefore, it is intended that the terminology used herein should be interpreted based on the meaning of the term rather than on the name of the term, and on the content of the specification throughout the specification. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 실시간 데이터 처리 프로세스를 나타낸 도면이다.1 is a diagram illustrating a real-time data processing process according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 실시간 데이터 처리 시스템은, 정보수집부(H1100), 데이터 생성부(H1200), 데이터 처리부(H1300) 및 저장부(H1400)를 포함할 수 있다. 한편, 도 1의 실시예는 실시간 데이터 처리 프로세스를 나타내고 있으나, 본 발명의 데이터 처리 시스템은 실시간 데이터 뿐만 아니라 비실시간 데이터를 처리할 수도 있다.1, a real-time data processing system according to an embodiment of the present invention may include an information collecting unit H1100, a data generating unit H1200, a data processing unit H1300, and a storage unit H1400 . Meanwhile, although the embodiment of FIG. 1 shows a real time data processing process, the data processing system of the present invention may process real time data as well as non real time data.

정보수집부(H1100)는, 다양한 정보를 수집할 수 있다. 정보수집부(H1100)는, IoT센서, 적외선 센서, 열감지 센서, CCTV 등과 같은 센서들로부터 생성된 데이터를 수집할 수 있다. 또한, 정보수집부(H1100)는 레거시 시스템이나 SNS와 같이 실시간 또는 비실시간 정보를 포함하는 데이터베이스로부터 정보를 수집할 수도 있다. 정보수집부(H1100)는 수집한 정보를 후술할 데이터 생성부(H1200)로 출력할 수 있다.The information collection unit H1100 can collect various information. The information collection unit H1100 can collect data generated from sensors such as an IoT sensor, an infrared sensor, a thermal sensor, a CCTV, and the like. Also, the information collecting unit H1100 may collect information from a database including real-time or non-real-time information such as a legacy system or an SNS. The information collecting unit H1100 can output the collected information to a data generating unit H1200 to be described later.

데이터 생성부(H1200)는, 정보수집부(H1100)로부터 정보를 입력받아 데이터 를 생성할 수 있다. 데이터 생성부(H1200)가 생성하는 데이터는 실시간 데이터 스트림일 수 있고, 비실시간 기반의 데이터일 수도 있다. 또한, 데이터 생성부(H1200)는, 생성한 데이터를 데이터 처리부(H1300)로 출력할 수 있다. 일 실시예에 따르면, 데이터 생성부(H1200)는, TCP, ZeroMQ, Kafka 및 Twitter 중 적어도 하나를 포함할 수 있다. The data generating unit H1200 can receive data from the information collecting unit H1100 and generate data. The data generated by the data generating unit H1200 may be a real time data stream or non real time based data. Further, the data generating section H1200 can output the generated data to the data processing section H1300. According to one embodiment, the data generation unit H1200 may include at least one of TCP, ZeroMQ, Kafka, and Twitter.

데이터 처리부(H1300)는 데이터 생성부(H1200)로부터 데이터를 입력받을 수 있다. 이때, 데이터 처리부(H1300)는 데이터 생성부(H1200)로부터 실시간 또는 비실시간으로 데이터를 입력받을 수 있다. 또한, 데이터 처리부(H1300)는 입력받은 데이터를 기반으로 RDD(Resilient Distributed Dataset)를 생성할 수 있다. 데이터 처리부(H1300)는 범용 분산 플랫폼에 기반하여 데이터를 처리할 수 있다. 일 실시예에 따르면, 데이터 처리부(H1300)는 아파치 스파크 플랫폼의 일부일 수 있다. The data processing unit H1300 can receive data from the data generating unit H1200. At this time, the data processing unit H1300 can receive data from the data generating unit H1200 in real time or in non-real time. In addition, the data processing unit H1300 can generate a Resilient Distributed Dataset (RDD) based on the received data. The data processing unit H1300 can process data based on a general-purpose distributed platform. According to one embodiment, the data processing unit H1300 may be part of the Apache Spark platform.

데이터 처리부(H1300)는 RDD를 실시간 또는 비실시간으로 생성할 수 있다. 여기서, 입력받은 데이터를 기반으로 최초로 생성된 RDD는 원시(raw) RDD라고 지칭될 수 있다. The data processing unit H1300 can generate the RDD in real time or in non-real time. Here, the RDD originally generated based on the input data may be referred to as a raw RDD.

한편, RDD는 데이터 집합으로서, 가변적이지 않은, 즉 불변의(immutable) 형태로 데이터를 갖는다. 다시 말해, RDD는 기존에 사용되는 데이터 집합과 달리 내부의 데이터의 일부 변경이 불가능하다. 이러한 RDD 특성으로 인해, RDD와 관련된 데이터 가공 및 처리시 추가적인 RDD가 생성된다. 즉, 데이터 가공시 기 생성된 RDD로부터 파생된 RDD가 생성된다. 이때, 기 생성된 RDD로부터 파생된 RDD는 파생 RDD라고 지칭될 수 있다. 이러한 파생 RDD는 원시 RDD로부터 파생되어 생성된 것과, 기 생성된 파생 RDD로부터 파생되어 생성된 것을 포함하는 개념이다.RDD, on the other hand, is a set of data that has data in a non-variable, or immutable, form. In other words, RDD can not change some of the internal data unlike existing data sets. Due to these RDD characteristics, additional RDDs are generated during data processing and processing associated with the RDD. That is, an RDD derived from the pre-generated RDD is generated at the time of data processing. At this time, the RDD derived from the pre-generated RDD may be referred to as a derived RDD. This derived RDD is a concept that is derived from a primitive RDD and generated from a pre-existing derived RDD.

RDD는 데이터 가공 처리시 파생되어 생성되며, 데이터 스트림을 기반으로 생성될 수도 있기 때문에, RDD의 수는 기하급수적으로 증가할 수 있다. 또한, RDD 수의 증가로 인해 저장 공간이 부족해지게 되고, 원하는 정보에 대한 검색 속도의 지연이 발생할 수 있다. 따라서, 원시 RDD, 파생 RDD 및 자주 사용되는 RDD에 대한 선택적 관리가 필요하다. Since RDD is derived from data processing and may be generated based on the data stream, the number of RDDs may increase exponentially. Also, the increase in the number of RDDs results in insufficient storage space, and a delay in the search speed for desired information may occur. Therefore, there is a need for selective management of raw RDDs, derived RDDs, and frequently used RDDs.

한편, 생성된 RDD는 반영구적으로 저장되는 경우를 제외하면, 소정의 라이프 사이클을 가진다. 수명이 다한 RDD는 가비지 컬렉션(garbage collection)이라고 지칭되는 동작을 통해 메모리 상에서 제거될 수 있다. 즉, 데이터 처리부(H1300)는, 가비지 컬렉션을 수행하여 수명이 다한 RDD를 메모리 상에서 제거할 수 있다. On the other hand, the generated RDD has a predetermined life cycle, except that it is semi-permanently stored. An RDD that has reached its end of life may be removed from memory via an operation called garbage collection. That is, the data processing unit H1300 can perform garbage collection to remove the RDD having a shortened life span from the memory.

RDD의 라이프 사이클은 초기 생성(initial, I), 캐싱(cashing, C) 및 소멸(garbage collection, GC)의 3가지 상태로 구분될 수 있다. 여기서, 초기 생성은 RDD가 생성된 상태를 의미하고, 캐싱은 생성된 RDD가 메모리 상에 캐싱저장된 상태를 의미하며, 소멸은 RDD가 메모리 상에서 삭제되어 하드디스크에 저장된 상태를 의미할 수 있다. 한편, 생성된 RDD가 반영구적으로 저장되는 경우(persist, P)까지 포함하면, RDD의 라이프 사이클은, 퍼시스트(persist, P), 초기 생성(initial, I), 캐싱(cashing, C) 및 소멸(garbage collection, GC)의 4가지 상태로 구분될 수 있다.The life cycle of an RDD can be divided into three states: initial (I), caching (C), and garbage collection (GC). In this case, the initial generation means that the RDD is created, the caching means that the generated RDD is cached in the memory, and the disappearance may mean that the RDD is deleted from the memory and stored in the hard disk. If the generated RDD is stored semi-permanently (persist, P), the lifecycle of the RDD may be persistent (P), initial (I), cached (C) (garbage collection, GC).

한편, 파생 RDD는 RDD에 대한 연산에 의해 생성된다. RDD에 대한 연산으로는 트랜스포메이션(transformation)과 액션(action)이 존재한다. On the other hand, the derived RDD is generated by an operation on the RDD. There are transformations and actions in RDD operations.

표1은 RDD에 대한 연산을 나타낸 표이다.Table 1 is a table showing operations for RDD.

파생 RDD는 트랜스포메이션과 액션에 의해 생성된다. 보다 상세하게는, 액션이 수행될 경우, 후술할 RDD에 대한 계보(Lineage)를 참고하여 RDD가 생성된다. 이때, 액션과 관련된 트랜스포메이션이 호출된 횟수만큼 파생 RDD가 생성되고, 최종적으로 액션이 수행되어 최종적인 파생 RDD(last RDD)가 생성된다(Lazy execution). 이때 액션은 분산 플랫폼의 개별 노드(node)에 존재하는 워커(worker)에 의해 수행된다. Derived RDDs are generated by transformations and actions. More specifically, when an action is performed, an RDD is generated by referring to a lineage for an RDD to be described later. At this time, a derived RDD is generated as many times as the number of times the action associated with the action is called, and finally an action is performed to generate a final RDD (Lazy execution). In this case, the action is performed by a worker present at an individual node of the distributed platform.

연산에 앞서 RDD에 대한 계보(Lineage)를 포함하는 DAG(Directed Acyclic Graph)가 생성될 수 있다. 여기서, RDD에 대한 계보는 트랜스포메이션과 액션에 의해 생성될 파생 RDD와 원시 RDD에 대한 데이터 및 메타데이터를 포함하고 있다. 따라서, DAG에 대한 스캐닝을 통해 원시 RDD 및 파생 RDD에 대한 메타데이터 및 데이터를 획득할 수 있다. 보다 구체적으로, RDD에 대한 계보는 트랜스포메이션 연산과 액션 연산이 수행될 경우 생성되는 파생 RDD들에 대한 데이터, 원시 RDD에 대한 데이터 및 파생 RDD들의 관계와 원시 RDD와 파생 RDD의 관계를 나타내는 메타데이터 등을 포함할 수 있다. A Directed Acyclic Graph (DAG) can be generated that includes a lineage for RDD prior to operation. Here, the genealogy for RDD contains data and metadata about the derived RDD and raw RDD to be generated by the transformations and actions. Thus, the metadata and data for the raw RDD and the derived RDD can be obtained through scanning for the DAG. More specifically, the genealogy for RDD includes data on derived RDDs generated when a transformation operation and an action operation are performed, data on primitive RDDs, and relationship between derived RDDs and metadata indicating the relationship between the derived RDD and the derived RDD And the like.

데이터 처리부(H1300)의 주요 동작에는 다음의 4가지 동작이 포함될 수 있다.The main operations of the data processing unit H1300 may include the following four operations.

1) RDD에 대한 계보의 생성1) Generation of genealogy for RDD

2) RDD에 대한 계보에 따라 실제 연산을 수행하여 RDD를 생성2) Generate RDD by performing the actual operation according to the genealogy for RDD

3) 초기 생성된 RDD를 퍼시스트로 저장할 지 캐싱으로 저장할 지 결정3) Decide whether to save the initially generated RDD as a custody

4) 저장된 RDD를 가비지 컬렉션을 통해 메모리 상에서 제거4) Remove stored RDD from memory via garbage collection

도 2는 RDD가 생성되는 과정의 일 실시예를 나타낸 도면이다.FIG. 2 is a diagram illustrating an example of a process of generating an RDD.

도 2를 참조하면, 먼저, 액션이 수행되기에 앞서 DAG가 생성된다(S21). Referring to FIG. 2, a DAG is generated before an action is performed (S21).

이어서, DAG 스케쥴러는, 생성된 DAG에 따라 실행 플랜을 생성한다. 이때, 노드 그룹핑이 이루어지고 스테이지가 생성된다(S22). Then, the DAG scheduler generates an execution plan according to the generated DAG. At this time, node grouping is performed and a stage is created (S22).

다음으로, 태스크 스케쥴러는 내로우 디펜던시(narrow dependency)를 고려하는 태스크를 생성한다(S23). 태스크 스케쥴러는 클러스터 매니저를 통해 태스크를 론칭(launching)한다. Next, the task scheduler generates a task considering narrow dependency (S23). The task scheduler launches tasks through the cluster manager.

그 다음으로, 개별 노드에 존재하는 워커들은 태스크를 실행하여 원시 RDD(raw RDD, RR)로부터 파생 RDD들을 생성한다. 이때, RDD에 대한 계보에 따라 파생 RDD들이 생성된다. 액션 수행에 의해 생성된 결과물인 RDD는 헤비 RDD라(heavy RDD, HR)고 지칭되며, 이러한 헤비 RDD를 생성하는 RDD는 최종 RDD(last RDD, LR)라고 지칭될 수 있다. 그리고, 파생 RDD 중에서 최종 RDD와 헤비 RDD를 제외한 나머지 파생 RDD는 일반 RDD(general RDD, GR)라고 지칭될 수 있다. 일반 RDD의 태스크는 다른 RDD와의 관계에서 내로우 디펜던시(narrow dependency)를 가진다. 이와 달리, 최종 RDD와 헤비 RDD의 태스크는 서로 와이드 디펜던시(wide dependency)를 갖는다. Next, the worker residing at the individual node executes the task to generate derived RDDs from the raw RDD (RR). At this time, derived RDDs are generated according to the lineage for RDD. The resultant RDD generated by performing the action is referred to as heavy RDD (HR), and the RDD generating such a heavy RDD may be referred to as the last RDD (last RDD, LR). And, the derived RDD excluding the final RDD and the heavy RDD in the derived RDD may be referred to as a general RDD (general RDD, GR). The tasks of a generic RDD have a narrow dependency in relation to other RDDs. In contrast, the tasks of the final RDD and the heavy RDD have a wide dependency on each other.

한편, 도 2에 도시된 RDD 생성 과정의 실시예는 한편으로 RDD에 대한 계보를 나타낸다고 할 수 있다. 도 2는 액션 수행시 RDD가 생성되는 과정을 도시하고 있으나, RDD에 대한 계보는 액션 연산이 수행될 경우, 파생 RDD들의 관계를 나타내는 메타데이터를 포함하고 있으므로, 도 2의 도시된 사항 특히, 도 2의 S23은 RDD에 대한 계보를 시각적으로 나타낸다고 할 수 있다. On the other hand, the embodiment of the RDD generation process shown in FIG. 2 can be said to represent the genealogy for the RDD. FIG. 2 shows a process of generating an RDD when an action is performed. However, since the genealogy for an RDD includes metadata indicating a relationship of derived RDDs when an action is performed, 2, S23 can visually represent the genealogy for RDD.

도 2를 참조하면, RDD에 대한 계보에는 원시 RDD(raw RDD, RR) 및 파생 RDD들이 도시되어 있고, 이러한 파생 RDD는 전술한 바와 같이, 최종 RDD(last RDD, LR), 헤비 RDD(heavy RDD, HR), 일반 RDD(general RDD, GR)로 구분될 수 있다. RDD는 원시 RDD와 파생 RDD로 구분될 수 있으므로, RDD는, 원시 RDD, 최종 RDD, 헤비 RDD, 일반 RDD로 구분될 수 있다. 이와 같은 원시 RDD, 최종 RDD, 헤비 RDD, 일반 RDD는 RDD의 타입 정보라고 지칭될 수 있다. 도면에 도시된 예에서는, RDD 5는 헤비 RDD에 해당하고, RDD4는 최종 RDD에 해당하며, RDD2 ~ 3은 일반 RDD에 해당하며, RDD1은 원시 RDD에 해당한다. Referring to FIG. 2, the lineage for RDD is illustrated with raw RDDs (RRs) and derived RDDs, and these derived RDDs are referred to as last RDDs (LRs), heavy RDDs , HR), and general RDD (general RDD, GR). Since RDD can be divided into a raw RDD and a derived RDD, the RDD can be divided into a raw RDD, a final RDD, a heavy RDD, and a general RDD. These primitive RDDs, final RDDs, heavy RDDs, and generic RDDs may be referred to as RDD type information. In the example shown in the drawing, RDD 5 corresponds to a heavy RDD, RDD 4 corresponds to a final RDD, RDD 2 to 3 correspond to a general RDD, and RDD 1 corresponds to a raw RDD.

저장부(H1400)는 데이터 처리부로부터 입력받은 RDD들을 저장할 수 있다. 이때, 저장부(H1400)는 RDD들을 메모리 상에 캐싱(cashing) 처리 하거나 반영구적으로 저장(persisting)할 수 있으며, 메모리상에서 제거된 RDD들을 하드디스크에 저장할 수도 있다. 일 실시예에 따르면, 상기 저장부(H1400)의 저장 프로세스는 데이터 처리부의 제어에 따라 수행될 수 있다. 예를 들어, 데이터 처리부는 RDD 저장과 관련된 제어 명령을 생성하여 저장부(H1400)로 출력할 수 있다. The storage unit H1400 may store the RDDs received from the data processing unit. At this time, the storage unit H1400 can cache or semi persistently store the RDDs in the memory, and store the RDDs removed from the memory in the hard disk. According to one embodiment, the storage process of the storage unit H1400 may be performed under the control of the data processing unit. For example, the data processing unit may generate a control command related to RDD storage and output it to the storage unit H1400.

도 3은 본 발명의 다른 실시예에 따른 실시간 데이터 처리 프로세스를 나타낸 도면이다. 3 is a diagram illustrating a real-time data processing process according to another embodiment of the present invention.

도 3을 참조하면, 본 발명의 다른 실시예에 따른 실시간 데이터 처리 시스템은, 정보수집부(H3100), 데이터 생성부(H3200), 데이터 처리부(H3300) 및 저장부(H3400)를 포함하며, RDD 관리부(H3500)를 더 포함할 수 있다. 한편, 도 3의 실시예는 실시간 데이터 처리 프로세스를 나타내고 있으나, 본 발명의 데이터 처리 시스템은 실시간 데이터 뿐만 아니라 비실시간 데이터를 처리할 수도 있다.3, a real-time data processing system according to another embodiment of the present invention includes an information collection unit H3100, a data generation unit H3200, a data processing unit H3300, and a storage unit H3400, And a management unit H3500. Meanwhile, although the embodiment of FIG. 3 shows a real-time data processing process, the data processing system of the present invention may process real-time data as well as non-real-time data.

정보수집부(H3100)는, 다양한 정보를 수집할 수 있다. 정보수집부(H3100)는, IoT센서, 적외선 센서, 열감지 센서, CCTV 등과 같은 센서들로부터 생성된 데이터를 수집할 수 있다. 또한, 상기 정보수집부(H3100)는 레거시 시스템이나 SNS와 같이 실시간 또는 비실시간 정보를 포함하는 데이터베이스로부터 정보를 수집할 수도 있다. 이는 전술한 바와 같다. The information collecting unit H3100 can collect various information. The information collecting unit H3100 can collect data generated from sensors such as an IoT sensor, an infrared sensor, a heat sensor, a CCTV, and the like. In addition, the information collecting unit H3100 may collect information from a database including real-time or non-real-time information such as a legacy system or an SNS. This is as described above.

데이터 생성부(H3200)는, 정보수집부(H3100)로부터 정보를 입력받아 데이터를 생성할 수 있다. 데이터 생성부(H3200)가 생성하는 데이터는 실시간 데이터 스트림일 수 있고, 비실시간 기반의 데이터일 수도 있다. 또한, 데이터 생성부(H3200)는, 생성한 데이터를 데이터 처리부(H3300)로 출력할 수 있다. 이는 전술한 바와 같다. The data generating unit H3200 can receive data from the information collecting unit H3100 and generate data. The data generated by the data generating unit H3200 may be a real time data stream or non real time based data. Further, the data generating section H3200 can output the generated data to the data processing section H3300. This is as described above.

데이터 처리부(H3300)는, 데이터 생성부(H3200)로부터 데이터를 입력받을 수 있다. 이때, 데이터 처리부(H3300)는 데이터 생성부(H3200)로부터 실시간 또는 비실시간으로 데이터를 입력받을 수 있다. 또한, 데이터 처리부(H3300)는 입력받은 데이터를 기반으로 RDD(Resilient Distributed Dataset)를 생성할 수 있다. 한편, RDD와 관련된 내용은 본 발명의 일 실시예에 따른 실시간 데이터 처리 프로세스에서 기술한 내용이 그대로 적용될 수 있으므로, 반복적인 설명은 생략한다. The data processing unit H3300 can receive data from the data generating unit H3200. At this time, the data processing unit H3300 can receive data from the data generating unit H3200 in real time or in non-real time. In addition, the data processing unit H3300 can generate a Resilient Distributed Dataset (RDD) based on the received data. Meanwhile, the content related to the RDD can be directly applied to the real-time data processing process according to the embodiment of the present invention, and therefore, the repetitive description will be omitted.

RDD 관리부(H3500)는, 원시 RDD, 파생 RDD를 모니터링하고 관리하여, RDD에 대한 메타데이터를 생성, 저장 및 관리할 수 있다. The RDD management unit (H3500) can generate, store, and manage the metadata of the RDD by monitoring and managing the raw RDD and the derived RDD.

일 실시예에 따르면, RDD 관리부(H3500)는, DAG 스캐닝 에이전트(H3510), RDD 수명 모니터링 에이전트(H3520), 메타데이터 매니저(H3530) 및 RDD 저장 매니저(H3540)를 포함할 수 있다.According to one embodiment, the RDD management unit H3500 may include a DAG scanning agent H3510, an RDD lifetime monitoring agent H3520, a metadata manager H3530, and an RDD storage manager H3540.

DAGDAG 스캐닝 에이전트 Scanning Agent

DAG 스캐닝 에이전트(H3510)는, DAG를 스캐닝할 수 있다. 즉, DAG 스캐닝 에이전트(H3510)는 DAG가 생성되었는지 여부를 판단할 수 있다. 일 실시예에 따르면, DAG 스캐닝 에이전트(H3510)는 태스크를 모니터링하여 DAG가 생성되었는지 여부를 판단할 수 있다. DAG 스캐닝 에이전트(H3510)는 DAG를 발견할 경우 DAG로부터 RDD에 대한 계보를 읽어올 수 있다. 또한, DAG 스캐닝 에이전트(H3510)는, RDD에 대한 계보로부터 메타데이터를 추출할 수 있다. 여기서, 추출되는 메타데이터는 각 RDD에 대한 객체정보와 RDD 타입 정보를 포함할 수 있다. 여기서, RDD에 대한 객체정보는, RDD의 식별정보 및 RDD가 포함된 DAG에 대한 식별정보를 포함할 수 있다. 그리고, RDD 타입 정보는, RDD가 원시 RDD에 해당하는지, 파생 RDD에 해당하는지에 대한 정보일 수 있으며, 파생 RDD중에서 최종(last) RDD, 즉 액션과 관련된 RDD에 해당하는지, 액션 연산에 따라 최종 RDD로부터 생성된 헤비 RDD에 해당하는지 또는 최종 RDD 및 헤비 RDD를 제외한 일반 RDD에 해당하는지를 나타내는 정보일 수 있다. 한편, 이러한 RDD 타입 정보는 RDD 에 대한 우선순위를 설정하는 정보로 활용될 수 있다. The DAG scanning agent H3510 can scan the DAG. That is, the DAG scanning agent H3510 can determine whether or not a DAG has been generated. According to one embodiment, the DAG scanning agent H3510 may monitor the task to determine whether a DAG has been generated. The DAG scanning agent (H3510) can read the genealogy for the RDD from the DAG if it finds the DAG. Further, the DAG scanning agent H3510 can extract metadata from the lineage for RDD. Here, the extracted metadata may include object information and RDD type information for each RDD. Here, the object information for the RDD may include identification information of the RDD and identification information of the DAG including the RDD. The RDD type information may be information on whether the RDD corresponds to a raw RDD or a derived RDD, and whether the RDD corresponds to the last RDD in the derived RDD, that is, the RDD associated with the action, Information corresponding to a heavy RDD generated from the RDD or a general RDD excluding the final RDD and the heavy RDD. On the other hand, the RDD type information can be used as information for setting the priority for the RDD.

일 실시예에 따르면, DAG 스캐닝 에이전트(H3510)는, RDD 타입 정보에 따라 아래와 같은 순서로 RDD에 대한 우선순위를 설정할 수 있다. According to one embodiment, the DAG scanning agent H3510 can set the priority for RDD in the following order according to the RDD type information.

우선순위 1: 최종 RDD(last RDD)Priority 1: Last RDD (last RDD)

우선순위 2: 헤비 RDD(heavy RDD)Priority 2: Heavy RDD (heavy RDD)

우선순위 3: 원시 RDD(raw RDD)Priority 3: Raw RDD (raw RDD)

우선순위 4: 일반 RDD(general RDD)Priority 4: General RDD (general RDD)

설정된 RDD 우선순위는 RDD에 대한 메타데이터를 구성할 수 있다. 즉, DAG로부터 추출된 메타데이터와 RDD 우선순위는 RDD에 대한 메타데이터를 구성하여, 후술할 메타데이터 매니저(H3530)로 출력될 수 있다. 즉, DAG 스캐닝 에이전트(H3510)는, RDD에 대한 메타데이터를 후술할 메타데이터 매니저(H3530)로 출력할 수 있다. 이 경우, 메타데이터 매니저(H3530)는 RDD에 대한 메타데이터를 입력받은 후 스토리지에 저장할 수 있다. The set RDD priority can configure the metadata for the RDD. That is, the metadata extracted from the DAG and the RDD priority can be output to the metadata manager H3530, which will be described later, by configuring the metadata for the RDD. That is, the DAG scanning agent H3510 can output the metadata for the RDD to the metadata manager H3530, which will be described later. In this case, the metadata manager H3530 may receive the metadata about the RDD and store the metadata in the storage.

또한, DAG 스캐닝 에이전트(H3510)는 RDD 우선순위를 후술할 RDD 수명 모니터링 에이전트(H3520)로 출력할 수 있다. RDD 수명 모니터링 에이전트(H3520)는 RDD 우선순위를 입력받은 다음, RDD에 대한 모니터링을 수행할 수 있다.In addition, the DAG scanning agent H3510 can output the RDD priority to the RDD lifetime monitoring agent H3520 to be described later. The RDD lifetime monitoring agent (H3520) can receive the RDD priority, and then monitor the RDD.

RDDRDD 수명 life span 모니터링monitoring 에이전트 agent

RDD 수명 모니터링 에이전트(H3520)는 각 RDD의 라이프 사이클을 모니터링 할 수 있다. The RDD lifetime monitoring agent (H3520) can monitor the life cycle of each RDD.

일 실시예에 따르면, RDD 수명 모니터링 에이전트(H3520)는, DAG 스캐닝 에이전트(H3510)로부터 RDD 우선순위를 입력받은 다음 RDD에 대한 모니터링을 수행할 수 있다. RDD 우선순위를 입력받는다는 것은 조만간 DAG 계보에 따라 실제로 RDD에 대한 연산이 수행된다는 것을 의미할 수 있으므로, RDD 수명 모니터링 에이전트(H3520)는, DAG 스캐닝 에이전트(H3510)로부터 RDD 우선순위를 입력받은 다음 RDD에 대한 모니터링을 수행할 수 있다.According to one embodiment, the RDD lifetime monitoring agent H3520 may receive RDD priority from the DAG scanning agent H3510 and then perform monitoring on the RDD. Since the RDD lifetime monitoring agent H3520 receives the RDD priority from the DAG scanning agent H3510, the RDD lifetime monitoring agent H3520 may receive the RDD priority, Can be performed.

RDD 수명 모니터링 에이전트(H3520)는, RDD에 대한 연산(transformation, action)이 수행되는지를 모니터링하여, 실제로 연산이 수행된 경우, RDD에 대한 라이프 사이클 정보, 저장위치 정보 등을 후술할 메타데이터 매니저(H3530)로 출력할 수 있다. 이때, 메타데이터 매니저(H3530)는 라이프 사이클 정보, 저장위치 정보 등을 입력받은 후 업데이트할 수 있다. 여기서, 연산에 의해 실제로 RDD가 생성되기 때문에 메타데이터 매니저(H3530)로 출력되는 라이프 사이클 정보는 초기 생성 상태(initial, I)를 나타낸다. 또한, 여기서, 저장위치 정보는 RDD가 저장될 메모리의 어드레스 주소 등의 구체적인 로케이션 정보를 포함할 수 있다. The RDD lifetime monitoring agent H3520 monitors whether or not an operation (transformation, action) is performed with respect to the RDD. When the operation is actually performed, the life cycle information and the storage location information about the RDD are stored in a metadata manager H3530). At this time, the metadata manager H3530 can update life cycle information, storage location information, and the like after receiving the input. Here, since the RDD is actually generated by the operation, the life cycle information output to the metadata manager H3530 represents the initial generation state (initial, I). Here, the storage location information may include specific location information such as the address of the memory in which the RDD is to be stored.

또한, RDD 수명 모니터링 에이전트(H3520)는, 가비지 컬렉션이 수행되는지를 모니터링할 수 있다. 데이터 처리부(H3300)가 가비지 컬렉션을 수행할 경우, RDD 수명 모니터링 에이전트(H3520)는 이를 감지한 다음, RDD의 라이프 사이클 정보를 후술할 메타데이터 매니저(H3530)로 출력할 수 있다. 이때, 메타데이터 매니저(H3530)로 출력되는 정보는, RDD들 중 가비지 컬렉션으로 선정된 RDD들에 대한 라이프 사이클 정보를 포함할 수 있다. 여기서, 가비지 컬렉션으로 선정된 RDD들의 라이프 사이클 정보는 가비지 컬렉션(GC) 상태를 나타낼 수 있다. 이 경우, 메타데이터 매니저(H3530)는 RDD 라이프 사이클 정보를 입력받아 RDD에 대한 메타데이터를 업데이트할 수 있다. 또한, 메타데이터 매니저(H3530)는, RDD에 대한 메타데이터에 기술된 우선순위 정보를 이용하여 가비지 컬렉션의 대상으로 선정된 RDD 중 일부의 RDD에 대한 라이프 사이클 정보 및 저장위치 정보를 변경할 수 있다. 이때, 메타데이터 매니저(H3530)는 RDD에 대한 메타데이터에 기술된 우선순위가 기 설정된 우선순위 이상인 경우 라이프 사이클 정보, 저장위치 정보를 변경할 수 있다. 예를 들어, 우선순위가 2이상인 경우, 즉 RDD 타입이 최종 RDD(우선순위 1)이거나, 헤비 RDD(우선순위 2)인 경우, 메타데이터 매니저(H3530)는 해당 RDD에 대한 라이프 사이클 정보 및 저장위치 정보를 변경할 수 있다. 그 결과, 해당 RDD는 가비지 컬렉션에 의해 제거되지 않을 수 있다. 또한, 메타데이터 매니저(H3530)는 변경된 정보를 후술할 RDD 저장 매니저(H3540)로 출력할 수 있다. RDD 저장 매니저(H3540)는 변경된 정보를 입력받아 해당 RDD를 다른 위치에 저장할 수 있다. 그 결과, 가비지 컬렉션의 대상으로 선정된 RDD 중 일부의 RDD는 미리 다른 장소에 저장되어 가비지 컬렉션에 의해 삭제되지 않을 수 있다. In addition, the RDD lifetime monitoring agent (H3520) can monitor whether garbage collection is performed. When the data processing unit H3300 performs garbage collection, the RDD lifetime monitoring agent H3520 detects this and then outputs the lifecycle information of the RDD to the metadata manager H3530 to be described later. At this time, the information output to the metadata manager H3530 may include lifecycle information on RDDs selected as garbage collection among the RDDs. Here, the lifecycle information of the RDDs selected as the garbage collection may indicate the garbage collection (GC) state. In this case, the metadata manager H3530 can receive the RDD lifecycle information and update the metadata about the RDD. Also, the metadata manager H3530 may change the life cycle information and the storage location information of some RDDs selected as objects of garbage collection by using the priority information described in the meta data for the RDD. At this time, the metadata manager H3530 can change the life cycle information and the storage location information when the priority described in the metadata for the RDD is equal to or higher than the predetermined priority. For example, when the priority is 2 or more, that is, when the RDD type is the last RDD (priority 1) or the heavy RDD (priority 2), the metadata manager H3530 stores the lifecycle information and the storage Location information can be changed. As a result, the RDD may not be removed by garbage collection. Further, the meta data manager H3530 can output the changed information to the RDD storage manager H3540 to be described later. The RDD storage manager H3540 can receive the changed information and store the RDD in another location. As a result, some of the RDDs selected as objects of garbage collection may be stored in advance in another place and may not be deleted by garbage collection.

메타데이터 매니저Metadata Manager

메타데이터 매니저(H3530)는 RDD에 대한 메타데이터를 입력받아 메타데이터를 저장, 관리 및 업데이트 할 수 있다. 여기서, 메타데이터는, RDD의 DAG ID(Identifier), RDD ID, RDD 타입, 우선순위(priority), 스테이터스(status), 저장위치(location) 등을 포함할 수 있다. 여기서, DAG ID는 각 DAG를 식별하는 식별자이고, RDD ID는 각 RDD를 식별하는 식별자이며, RDD 타입은 전술한 RDD에 대한 4가지 구분형태를 나타낸다. 즉, RDD 타입은 원시 RDD(raw RDD, RR), 최종 RDD(last RDD, LR), 헤비 RDD(heavy RDD, HR), 일반 RDD(general RDD, GR) 중 어느 하나로 설정될 수 있다. 우선순위는 DAG 스캐닝 에이전트(H3510)에 의해 설정된 우선순위로서, RDD 타입에 따라 결정될 수 있다. 전술한 실시예에 따르면, 최종 RDD가 1순위이고, 헤비 RDD가 2순위이며, 원시 RDD는 3순위이고, 일반 RDD는 4순위이다. 다만, 이는 하나의 실시예로서, 이와 다른 순서로 RDD의 우선순위가 설정될 수도 있다. 스테이터스는 RDD의 라이프 사이클의 상태를 나타낸다. 전술한 바와 같이, 스테이터스는 퍼시스트(persist, P), 초기 생성(initial, I), 캐싱(cashing, C) 및 소멸(garbage collection, GC) 중 어느 하나에 해당할 수 있다. 저장위치는 RDD가 저장된 위치를 나타낸다. 저장위치는 접근(access) 용이성에 따라 메모리와 하드디스크 중 어느 하나일 수 있다. 생명주기상태가 퍼시스트, 초기 생성 또는 캐싱인 경우 해당 RDD의 저장위치는 메모리일 수 있으며, 생명주기상태가 소멸인 경우 해당 RDD의 저장위치는 하드디스크일 수 있다. The metadata manager (H3530) receives metadata about the RDD and can store, manage and update the metadata. Here, the metadata may include a DAG ID, an RDD ID, an RDD type, a priority, a status, a storage location, and the like of the RDD. Here, the DAG ID is an identifier for identifying each DAG, the RDD ID is an identifier for identifying each RDD, and the RDD type represents four different types of RDDs. That is, the RDD type can be set to any one of a raw RDD (raw RDD), a final RDD (last RDD, LR), a heavy RDD (HR), and a general RDD (general RDD). The priority order can be determined according to the RDD type as a priority set by the DAG scanning agent H3510. According to the above-described embodiment, the final RDD is ranked first, the heavy RDD is ranked second, the primitive RDD is ranked 3, and the general RDD is ranked 4. However, this is an embodiment, and the priority of the RDD may be set in a different order. The status indicates the state of the RDD life cycle. As described above, the status may correspond to one of persist (P), initial (I), cashing (C), and garbage collection (GC). The storage location indicates the location where the RDD is stored. The storage location may be either a memory or a hard disk, depending on the accessibility. If the life cycle state is persistent, initial creation, or caching, the storage location of the RDD may be a memory, and if the life cycle state is destroyed, the storage location of the RDD may be a hard disk.

메타데이터 매니저(H3530)는 전술한 DAG 스캐닝 에이전트(H3510)로부터 RDD에 대한 메타데이터를 입력받아 메타데이터를 저장 및 업데이트할 수 있다. The meta data manager H3530 may receive the meta data for the RDD from the DAG scanning agent H3510 and store and update the meta data.

또한, 메타데이터 매니저(H3530)는 후술할 RDD 저장 매니저(H3540)로부터 저장 정보를 입력받아 메타데이터를 업데이트할 수 있다. 여기서, 저장 정보는, 생성된 RDD가 퍼시스트로 저장될 것인지, 캐싱으로 저장될 것인지를 지시하는 정보 및 저장위치와 관련된 정보를 포함할 수 있다. 그리고, 저장위치와 관련된 정보는, RDD가 메모리에 저장될 것인지, 하드디스크에 저장될 것인지를 지시하는 정보 및 구체적인 어드레스를 지시하는 정보를 포함할 수 있다. Also, the metadata manager H3530 can receive the storage information from the RDD storage manager H3540, which will be described later, and update the metadata. Here, the storage information may include information related to the storage location and information indicating whether the generated RDD is to be stored as a persistent store or a caching store. And, the information related to the storage location may include information indicating whether the RDD is stored in the memory, the hard disk, and information indicating a specific address.

또한, 메타데이터 매니저(H3530)는, RDD 수명 모니터링 에이전트(H3520)부터 가비지 컬렉션의 대상에 해당하는 RDD에 대한 정보를 입력받을 수 있다. 메타데이터 매니저(H3530)는, 가비지 컬렉션의 대상에 해당하는 RDD에 대한 정보를 입력받아 해당 RDD의 스테이터스 등을 업데이트 할 수 있다.Also, the metadata manager H3530 can receive information on the RDD corresponding to the target of garbage collection from the RDD lifetime monitoring agent (H3520). The meta data manager H3530 can receive information on the RDD corresponding to the target of garbage collection and update the status of the corresponding RDD.

메타데이터 매니저(H3530)는 메타데이터를 후술할 저장부(H3400)에 저장하거나, 별도의 데이터베이스에 저장할 수 있다. 이때, 저장된 메타데이터는 RDD에 대한 계보로부터 추출된 메타데이터가 가공된 결과물일 수 있다. 그리고, RDD에 대한 계보로부터 추출된 메타데이터는 전술한 DAG 스캐닝 에이전트(H3510) 또는 메타데이터 매니저(H3530)에 의해 가공될 수 있다. The metadata manager H3530 may store the metadata in a storage unit H3400 to be described later or may store the metadata in a separate database. At this time, the stored metadata may be a processed result of the metadata extracted from the genealogy for the RDD. The metadata extracted from the genealogy for the RDD can be processed by the DAG scanning agent H3510 or the metadata manager H3530 described above.

메타데이터 매니저(H3530)는 메타데이터를 후술할 RDD 저장 매니저(H3540)로 출력할 수 있으며, RDD 저장 매니저(H3540)로부터 RDD가 퍼시스트로 저장되는지, 캐싱으로 저장되는지를 나타내는 정보 및 RDD가 저장된 위치에 관한 정보를 입력받을 수 있다. The metadata manager H3530 can output the metadata to the RDD storage manager H3540 to be described later. The metadata manager H3530 can receive metadata from the RDD storage manager H3540, indicating whether the RDD is stored as a persistent store, Can be input.

도 4는 본 발명의 일 실시예에 따른 RDD 메타데이터를 나타낸 도면이다.4 is a diagram illustrating RDD metadata according to an embodiment of the present invention.

도 4를 참조하면, RDD 메타데이터는 DAG_ID, RDD_ID, RDD_Type, Priority, Status, Location을 포함한다. 도 4에서 DAG_ID, RDD_ID, RDD_Type, Priority, Status, Location은 제1행에 표시되어 있다. 그리고, 제1열에 표시된 DAG_ID는 전술한 DAG ID이고, 제2열에 표시된 RDD_ID는 전술한 RDD ID이며, 제3열에 표시된 RDD_Type은 전술한 RDD 타입이고, 제4열에 표시된 Priority는 전술한 우선순위이며, 제5열에 표시된 Status는 전술한 스테이터스이며, 제6열에 표시된 Location은 전술한 저장위치일 수 있다. Referring to FIG. 4, the RDD metadata includes DAG_ID, RDD_ID, RDD_Type, Priority, Status, and Location. In FIG. 4, DAG_ID, RDD_ID, RDD_Type, Priority, Status, and Location are shown in the first row. The RDG_ID indicated in the first column is the above-mentioned DAG ID, the RDD_ID indicated in the second column is the RDD ID described above, the RDD_Type indicated in the third column is the RDD type described above, the priority indicated in the fourth column is the above- The Status shown in the fifth column is the above-mentioned status, and the Location shown in the sixth column can be the aforementioned storage location.

제2행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 1이라는 정보, 해당 RDD의 RDD ID가 1이라는 정보를 나타낸다. 또한, 제2행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 일반 RDD(GR)라는 정보를 나타내고, 따라서 우선순위가 4순위라는 정보를 나타낸다. 그리고, 제2행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 스테이터스가 초기 상태(initial, I)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다. The metadata associated with the RDD shown in the second row indicates information that the DAG ID of the corresponding RDD is 1 and that the RDD ID of the corresponding RDD is 1. In addition, the metadata related to the RDD shown in the second row indicates information that the type of the RDD indicates the general RDD (GR), and therefore, the information indicating that the priority is 4th. The metadata related to the RDD shown in the second row indicates information that the status of the corresponding RDD is an initial state (initial, I), and thus the storage location of the corresponding RDD indicates information that the memory is memory.

제3행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 1이라는 정보, 해당 RDD의 RDD ID가 2라는 정보를 나타낸다. 또한, 제3행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 최종 RDD(LR)라는 정보를 나타내고, 따라서 우선순위가 1순위라는 정보를 나타낸다. 그리고, 제3행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 초기 상태(initial, I)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다.The metadata associated with the RDD shown in the third row indicates information that the DAG ID of the corresponding RDD is 1 and that the RDD ID of the corresponding RDD is 2. In addition, the metadata related to the RDD shown in the third row indicates information indicating that the type of the RDD is the last RDD (LR), and thus the priority is rank 1. The meta data associated with the RDD shown in the third row indicates information that the lifetime state of the corresponding RDD is an initial state (initial, I), and thus the storage location of the RDD indicates information that the memory is memory.

제4행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 1이라는 정보, 해당 RDD의 RDD ID가 3이라는 정보를 나타낸다. 또한, 제4행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 헤비 RDD(HR)라는 정보를 나타내고, 따라서 우선순위가 2순위라는 정보를 나타낸다. 그리고, 제4행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 초기 상태(initial, I)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다.The metadata associated with the RDD shown in the fourth row indicates information that the DAG ID of the corresponding RDD is 1 and that the RDD ID of the corresponding RDD is 3. In addition, the metadata associated with the RDD indicated in the fourth row indicates information indicating that the type of the RDD indicates the information of the heavy RDD (HR), and accordingly, the priority is the second rank. The meta data associated with the RDD shown in the fourth row indicates information that the lifetime state of the corresponding RDD is an initial state (initial, I), and therefore, the storage location of the RDD indicates information that the memory is memory.

제5행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 1이라는 정보, 해당 RDD의 RDD ID가 4라는 정보를 나타낸다. 또한, 제4행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 원시 RDD(RR)라는 정보를 나타내고, 따라서 우선순위가 3순위라는 정보를 나타낸다. 그리고, 제5행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 초기 상태(initial, I)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다.The metadata related to the RDD shown in the fifth row indicates information that the DAG ID of the corresponding RDD is 1 and the RDD ID of the corresponding RDD is 4. In addition, the metadata associated with the RDD shown in the fourth row indicates information indicating that the type of the RDD corresponds to the original RDD (RR), and thus the information indicating that the priority is three. The meta data associated with the RDD shown in the fifth row indicates information that the lifetime state of the corresponding RDD is the initial state (initial, I), and thus the storage location of the RDD indicates information that the memory is memory.

제6행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 2라는 정보, 해당 RDD의 RDD ID가 1이라는 정보를 나타낸다. 또한, 제5행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 일반 RDD(GR)라는 정보를 나타내고, 따라서 우선순위가 4순위라는 정보를 나타낸다. 그리고, 제5행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 가비지 컬렉션(garbage collection, C)라는 정보를 나타내고, 따라서 해당 RDD 가 하드디스크를 기반으로 한 파일시스템 또는 빅데이터 에코 시스템 기반의 저장소 등에 저장될 수 있다는 정보를 나타낸다.The metadata related to the RDD shown in the sixth row indicates information that the DAG ID of the corresponding RDD is 2 and the RDD ID of the corresponding RDD is 1. [ In addition, the metadata related to the RDD shown in the fifth row indicates information indicating that the type of the RDD corresponds to the general RDD (GR), and therefore, the priority is 4th. The meta data associated with the RDD shown in the fifth row indicates that the life cycle status of the corresponding RDD indicates garbage collection (C), and thus the corresponding RDD is a hard disk-based file system or a big data echo system Based repository or the like.

제7행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 2라는 정보, 해당 RDD의 RDD ID가 2이라는 정보를 나타낸다. 또한, 제7행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 일반 RDD(GR)라는 정보를 나타내고, 따라서 우선순위가 4순위라는 정보를 나타낸다. 그리고, 제7행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 캐싱(cashing, C)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다.The metadata related to the RDD shown in the seventh row shows information that the DAG ID of the corresponding RDD is 2 and the RDD ID of the corresponding RDD is 2. In addition, the metadata related to the RDD shown in the seventh row indicates information indicating that the type of the RDD corresponds to the general RDD (GR), and therefore, the priority is rank 4. The meta data associated with the RDD shown in the seventh row indicates information indicating that the lifetime state of the corresponding RDD is caching (C), and thus the storage location of the corresponding RDD indicates the memory.

제8행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 2라는 정보, 해당 RDD의 RDD ID가 3이라는 정보를 나타낸다. 또한, 제8행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 헤비 RDD(HR)라는 정보를 나타내고, 따라서 우선순위가 2순위라는 정보를 나타낸다. 그리고, 제8행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 캐싱(cashing, C)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다.The metadata associated with the RDD shown in the eighth row shows information that the DAG ID of the corresponding RDD is 2 and the RDD ID of the corresponding RDD is 3. In addition, the metadata related to the RDD shown in the eighth row indicates information indicating that the type of the corresponding RDD indicates the heavy RDD (HR), and accordingly, the information indicating that the priority is second rank. The meta data associated with the RDD shown in the eighth row indicates information indicating that the lifetime state of the corresponding RDD is caching (C), and thus the storage location of the corresponding RDD indicates memory.

제9행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 2라는 정보, 해당 RDD의 RDD ID가 4라는 정보를 나타낸다. 또한, 제9행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 원시 RDD(RR)라는 정보를 나타내고, 따라서 우선순위가 3순위라는 정보를 나타낸다. 그리고, 제8행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 퍼시스트(persist, P)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다.The metadata related to the RDD shown in the ninth line shows information that the DAG ID of the corresponding RDD is 2 and the RDD ID of the corresponding RDD is 4. In addition, the metadata associated with the RDD shown in the ninth row indicates information indicating that the type of the RDD is the primitive RDD (RR), and thus the priority is the third rank. The metadata related to the RDD shown in the eighth row indicates information that the life cycle state of the corresponding RDD is persist (P), and thus the storage location of the RDD indicates information that the memory is memory.

제10행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 DAG ID가 2라는 정보, 해당 RDD의 RDD ID가 5라는 정보를 나타낸다. 또한, 제10행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 타입이 최종 RDD(LR)라는 정보를 나타내고, 따라서 우선순위가 1순위라는 정보를 나타낸다. 그리고, 제10행에 표시된 RDD와 관련된 메타데이터는 해당 RDD의 생명주기상태가 캐싱(cashing, C)라는 정보를 나타내고, 따라서 해당 RDD의 저장위치가 메모리라는 정보를 나타낸다.The metadata associated with the RDD shown in the 10th row indicates the information that the DAG ID of the corresponding RDD is 2 and the RDD ID of the corresponding RDD is 5. In addition, the metadata associated with the RDD shown in the tenth row indicates information indicating that the type of the RDD corresponds to the last RDD (LR), and thus the priority is rank 1. The metadata associated with the RDD shown in the tenth row indicates information indicating that the life cycle state of the corresponding RDD is caching (C), and thus the storage location of the corresponding RDD indicates the memory.

한편, 메타데이터 매니저(H3530)는 DAG 스캐닝 에이전트(H3510)로부터 입력받은 RDD에 대한 메타데이터를 저장 및 업데이트할 수 있다. 일 예로, 메타데이터 매니저(H3530)는, DAG로 부터 추출된 메타데이터 및 우선순위정보를 DAG 스캐닝 에이전트(H3510)로부터 입력받아 이를 저장할 수 있다. 다른 예로, 메타데이터 매니저(H3530)는, 실제로 연산(트랜스포메이션, 액션)이 수행되어 RDD가 생성된 경우, 생성된 RDD에 대한 라이프 사이클 정보를 입력받아 이를 업데이트할 수 있다. 또 다른 예로, 메타데이터 매니저(H3530)는, 가비지 컬렉션이 수행되기에 앞서, RDD의 라이프 사이클 정보를 입력받아 RDD에 대한 메타데이터를 업데이트할 수 있다. 또 다른 예로, 메타데이터 매니저(H3530)는, 가비지 컬렉션이 수행되기에 앞서, RDD에 대한 메타데이터에 기술된 우선순위 정보를 이용하여 가비지 컬렉션의 대상으로 선정된 RDD 라이프 사이클 정보를 변경하여 업데이트할 수 있다. 또 다른 예로, 메타데이터 매니저(H3530)는 후술할 RDD 저장 매니저(H3540)로부터 초기 생성된 RDD가 퍼시스트의 라이프 사이클을 가질 것인지, 캐싱의 라이프 사이클을 가진 것인지에 대한 정보 등을 입력받아 이를 업데이트할 수 있다.Meanwhile, the metadata manager H3530 can store and update the metadata of the RDD received from the DAG scanning agent H3510. For example, the metadata manager H3530 may receive metadata and priority information extracted from the DAG from the DAG scanning agent H3510 and store the received metadata and priority information. As another example, the meta data manager H3530 can receive lifecycle information on the generated RDD and update it when the RDD is actually generated by performing operations (transformation, action). As another example, the metadata manager H3530 may update the metadata of the RDD by receiving the lifecycle information of the RDD before the garbage collection is performed. As another example, the metadata manager H3530 may update the RDD lifecycle information selected as the target of garbage collection by using the priority information described in the metadata about the RDD before the garbage collection is performed . As another example, the meta data manager H3530 receives information on whether the RDD initially generated from the RDD storage manager H3540, which will be described later, has a life cycle of the persiste or a life cycle of the caching, .

RDDRDD 저장 매니저 Storage manager

RDD 저장 매니저(H3540)는, 저장부(H3400)에 저장되는 RDD들을 관리할 수 있다. The RDD storage manager H3540 can manage the RDDs stored in the storage unit H3400.

RDD 저장 매니저(H3540)는 메타데이터 매니저(H3530)로부터 메타데이터의 적어도 일부에 대한 정보를 입력받을 수 있다. 일 예로, RDD 저장 매니저(H3540)는, RDD를 식별하기 위한 정보와 RDD의 저장위치에 관한 정보를 메타데이터 매니저(H3530)로부터 입력받을 수 있다. 여기서, RDD를 식별하기 위한 정보는, 전술한 DAG ID, RDD ID를 포함할 수 있다. 즉, DAG ID와 RDD ID를 조합하여 RDD를 식별할 수 있다. RDD 저장 매니저(H3540)는 RDD를 식별하기 위한 정보와 RDD의 저장위치 정보를 이용하여 생성된 RDD의 로케이션 정보를 관리할 수 있다. The RDD storage manager H3540 can receive information about at least a part of the metadata from the metadata manager H3530. For example, the RDD storage manager H3540 can receive information for identifying the RDD and information about the storage location of the RDD from the metadata manager H3530. Here, the information for identifying the RDD may include the DAG ID and the RDD ID described above. That is, the RDD can be identified by combining the DAG ID and the RDD ID. The RDD storage manager (H3540) can manage the location information of RDD generated using the information for identifying the RDD and the storage location information of the RDD.

또한, RDD 저장 매니저(H3540)는, 초기 생성된 RDD가 퍼시스트로 저장/관리될 것인지 아니면 캐싱으로 저장/관리될 것인지를 지시하는 정보를 입력받을 수 있다. 일 실시예에 따르면, RDD 저장 매니저(H3540)는, 데이터 처리부(H3300)로부터 초기 생성된 RDD가 퍼시스트로 저장/관리될 것인지 아니면 캐싱으로 저장/관리될 것인지를 지시하는 정보를 입력받을 수 있다. 이때, RDD 저장 매니저(H3540)는 데이터 처리부(H3300)로부터 구체적인 저장위치에 대한 정보를 더 입력받을 수 있다. RDD 저장 매니저(H3540)는 초기 생성된 RDD가 퍼시스트로 저장/관리될 것인지 아니면 캐싱으로 저장/관리될 것인지를 지시하는 정보를 입력받은 다음, 해당 정보에 따라 저장위치 정보를 업데이트할 수 있다. RDD의 라이프 사이클 정보(특히, 퍼시스트 또는 캐싱), 업데이트된 RDD의 저장위치 정보를 메타데이터 매니저(H3530)로 출력할 수 있다. 메타데이터 매니저(H3530)는 RDD 저장 매니저(H3540)로부터 상술한 정보를 입력받아 RDD에 대한 메타데이터를 업데이트할 수 있다. 이때, RDD에 대한 메타데이터에서, RDD의 라이프 사이클 정보, 저장위치 정보 등이 업데이트 될 수 있다. In addition, the RDD storage manager H3540 can receive information indicating whether the initially generated RDD is stored / managed as a persistent or stored / managed by caching. According to one embodiment, the RDD storage manager H3540 can receive information indicating whether the RDD initially generated from the data processing unit H3300 is stored / managed as a persistent or stored / managed by caching. At this time, the RDD storage manager H3540 can receive more information about the specific storage location from the data processing unit H3300. The RDD storage manager H3540 may receive information indicating whether the initially generated RDD is to be stored / managed as a persistent or stored / managed by caching, and may then update storage location information according to the information. It is possible to output the life cycle information of the RDD (in particular, the persistent or caching) and the storage location information of the updated RDD to the metadata manager H3530. The metadata manager H3530 can receive the above-described information from the RDD storage manager H3540 and update the metadata about the RDD. At this time, in the metadata for the RDD, the life cycle information, the storage location information, and the like of the RDD may be updated.

예를 들어, 데이터 처리부(H3300)가 초기 생성된 RDD #i를 퍼시스트로 저장하기로 결정한 경우를 가정해 볼 수 있다. 이때, RDD 저장 매니저(H3540)는, 데이터 처리부(H3300)로부터 RDD #i가 퍼시스트로 저장/관리될 것을 지시하는 정보와 구체적인 저장위치에 대한 정보를 입력받을 수 있다. RDD 저장 매니저(H3540)는 RDD #i가 퍼시스트 형태로 소정의 저장위치에 저장된다는 정보를 입력받아 저장위치 정보를 업데이트할 수 있다. RDD 저장 매니저(H3540)는 RDD #i의 라이프 사이클 정보, 업데이트된 RDD #i의 저장위치 정보를 메타데이터 매니저(H3530)로 출력할 수 있다. 메타데이터 매니저(H3530)는 RDD 저장 매니저(H3540)로부터 상술한 정보를 입력받아 RDD #i에 대한 메타데이터를 업데이트할 수 있다. 구체적으로, 메타데이터 매니저(H3530)는 RDD #i에 대한 라이프 사이클 정보를 퍼시스트로 업데이트하고, 구체적인 저장위치 정보를 업데이트할 수 있다. For example, it can be assumed that the data processing unit H3300 has decided to store the initially generated RDD #i as a persistent resource. At this time, the RDD storage manager H3540 can receive from the data processing unit H3300 information indicating that RDD #i is to be stored / managed as a persistent and information about a specific storage location. The RDD storage manager H3540 can receive the information that the RDD #i is stored in the predetermined storage location in the form of a persistent list and update the storage location information. The RDD storage manager H3540 can output the life cycle information of the RDD #i and the storage location information of the updated RDD #i to the metadata manager H3530. The metadata manager H3530 may receive the above-described information from the RDD storage manager H3540 and update the metadata of the RDD #i. Specifically, the metadata manager H3530 can update the lifecycle information on the RDD #i with the persistent update and the specific storage location information.

또한, RDD 저장 매니저(H3540)는, 가비지 컬렉션이 수행되기에 앞서, 가비지 컬렉션의 대상으로 선정된 RDD들 중 일부를 다른 저장위치에 저장할 수 있다. RDD 저장 매니저(H3540)는 메타데이터 매니저(H3530)로부터 변경된 정보를 전송받아 이를 기초로 RDD를 다른 위치에 저장할 수 있다. 이때, 메타데이터 매니저(H3530)는 RDD에 대한 메타데이터에 기술된 우선순위 정보를 이용하여 가비지 컬렉션의 대상으로 선정된 RDD 중 일부의 RDD에 대한 라이프 사이클 정보 및 저장위치 정보를 변경할 수 있다. 이는 전술한 바와 같다. In addition, the RDD storage manager H3540 may store some of the RDDs selected as objects of garbage collection in another storage location before the garbage collection is performed. The RDD storage manager H3540 receives the changed information from the metadata manager H3530 and can store the RDD in another location based on the received information. At this time, the metadata manager H3530 may change the life cycle information and the storage location information of some RDDs selected as objects of garbage collection by using the priority information described in the meta data for RDD. This is as described above.

저장부(H3400)는 데이터 처리부(H3300)로부터 입력받은 RDD들을 저장할 수 있다. 이때, 저장부(H3400)는 RDD들을 메모리 상에 캐싱(cashing) 처리 하거나 반영구적으로 저장(persist)할 수 있으며, 선택적으로 하드디스크에 저장할 수도 있다. 일 실시예에 따르면, 상기 저장부(H3400)의 저장 프로세스는 데이터 처리부(H3300)의 제어에 따라 수행될 수 있다. 예를 들어, 데이터 처리부(H3300)는 RDD 저장과 관련된 제어 명령을 생성하여 저장부(H3400)로 출력할 수 있다. 또한, 저장부(H3400)는, RDD 저장 매니저(H3540)의 제어 명령에 따라 RDD들을 메모리 상에 캐싱 처리하거나 반영구적으로 저장할 수도 있다. 일 예로, 저장부(H3400)는, RDD 저장 매니저(H3540)의 제어 명령에 따라 가비지 컬렉션으로 선정된 RDD들 중 일부의 RDD를 별도의 저장위치에 더 저장하거나, 별도의 저장위치로 옮겨 저장할 수 있다.The storage unit H3400 can store the RDDs received from the data processing unit H3300. At this time, the storage unit H3400 can cache or semi-permanently store the RDDs in the memory, and selectively store the RDDs in the hard disk. According to one embodiment, the storage process of the storage unit H3400 may be performed under the control of the data processing unit H3300. For example, the data processing unit H3300 may generate a control command related to RDD storage and output it to the storage unit H3400. In addition, the storage unit H3400 may cache or semi-permanently store the RDDs in the memory according to a control command of the RDD storage manager H3540. For example, the storage unit H3400 may store some of the RDDs selected by garbage collection according to a control command of the RDD storage manager H3540 in a separate storage location or move the storage unit to another storage location have.

이하, 도면을 참조하여, RDD 계보로부터 메타데이터를 추출하고, RDD를 초기 생성하고, RDD를 선택적으로 저장하고, RDD를 가비지 컬렉션하는 일련의 과정에 대해 설명하도록 한다. 이하의 도면에서, 스파크 어플리케이션과 DAG 스케쥴러는 전술한 데이터 처리부(H3300)의 구성요소일 수 있다. 또한, DAG 스캐닝 에이전트(H3510), RDD 수명 모니터링 에이전트(H3520), 메타데이터 매니저(H3530) 및 RDD 저장 매니저(H3540)는 RDD 관리부(H3500)에 포함될 수 있으며, 전술한 설명이 그대로 적용될 수 있다.Hereinafter, with reference to the drawings, a description will be given of a series of processes for extracting metadata from RDD lineage, initially generating RDD, selectively storing RDD, and garbage collection of RDD. In the following drawings, the spark application and the DAG scheduler may be components of the above-described data processing unit H3300. The DAG scanning agent H3510, the RDD lifetime monitoring agent H3520, the metadata manager H3530 and the RDD storage manager H3540 may be included in the RDD management unit H3500, and the above description may be applied as it is.

도 5는 본 발명의 일 실시예에 따른 메타데이터 추출 과정을 나타낸 흐름도이다. 5 is a flowchart illustrating a metadata extraction process according to an embodiment of the present invention.

먼저, 스파크 어플리케이션은, 데이터 처리를 위한 실행 플랜을 생성한다. 스파크 어플리케이션은 생성된 실행 플랜을 DAG 스케쥴러로 출력한다. DAG 스케쥴러는 실행 플랜을 입력받고, 실행 플랜에 따라 DAG를 생성한다. 또한, DAG 스케쥴러는 RDD 수명 모니터링 에이전트(H3520)로 스탠바이 명령을 출력한다. RDD 수명 모니터링 에이전트(H3520)는 스탠바이 명령을 입력받고 대기 상태를 유지한다. First, the spark application creates an execution plan for data processing. The spark application outputs the generated execution plan to the DAG scheduler. The DAG scheduler receives the execution plan and generates the DAG according to the execution plan. In addition, the DAG scheduler outputs a standby command to the RDD lifetime monitoring agent (H3520). The RDD lifetime monitoring agent (H3520) receives the standby command and maintains the standby state.

DAG 스캐닝 에이전트(H3510)는, DAG를 스캐닝한다. DAG 스케쥴러가 스케쥴에 따라 DAG를 생성하면, DAG 스캐닝 에이전트(H3510)가 이를 발견할 수 있다. DAG 스캐닝 에이전트(H3510)는 DAG를 발견하여, DAG에 포함된 RDD에 대한 계보로부터 연산과 관련된 메타데이터를 추출한다. DAG 스캐닝 에이전트(H3510)는 RDD의 타입에 따라 우선순위를 부여한다. DAG 스캐닝 에이전트(H3510)는, 우선순위 정보를 RDD 수명 모니터링 에이전트(H3520)로 출력한다. 또한, DAG 스캐닝 에이전트(H3510)는, RDD에 대한 메타데이터를 RDD 수명 모니터링 에이전트(H3520)로 출력한다. 여기서, RDD에 대한 메타데이터는, 우선순위 정보 및 RDD에 대한 계보로부터 추출한 메타데이터를 포함한다. RDD 수명 모니터링 에이전트(H3520)는 우선순위 정보를 입력받은 후 RDD에 대한 계보에 따라 RDD가 실제로 생성되는지를 모니터링한다. 메타데이터 매니저(H3530)는 DAG 스캐닝 에이전트(H3510)로부터 입력받은 RDD에 대한 메타데이터를 저장한다.The DAG scanning agent H3510 scans the DAG. If the DAG scheduler generates the DAG according to the schedule, the DAG scanning agent H3510 can detect it. The DAG scanning agent (H3510) discovers the DAG and extracts the operation related metadata from the lineage of the RDD included in the DAG. DAG scanning agent (H3510) assigns priority according to the type of RDD. The DAG scanning agent H3510 outputs priority information to the RDD lifetime monitoring agent H3520. In addition, the DAG scanning agent H3510 outputs the metadata for the RDD to the RDD lifetime monitoring agent H3520. Here, the metadata for RDD includes metadata extracted from the genealogy for priority information and RDD. The RDD lifetime monitoring agent (H3520) receives priority information and monitors whether the RDD is actually generated according to the genealogy for RDD. The metadata manager H3530 stores metadata about the RDD received from the DAG scanning agent H3510.

도 6은 본 발명의 일 실시예에 따른 RDD 초기 생성 과정을 나타낸 흐름도이다. 6 is a flowchart illustrating an initial RDD generation process according to an embodiment of the present invention.

본 도면은 RDD에 대한 계보를 기초로 RDD가 실제로 생성되는 과정을 나타낸다. 먼저, DAG 스케쥴러는, 스케쥴에 따라 스파크 어플리케이션으로 연산 수행 명령을 출력한다. 스파크 어플리케이션은 DAG 스케쥴러로부터 연산 수행 명령을 입력받아 연산을 수행한다. 즉, 스파크 어플리케이션은, 액션과 관련된 적어도 하나 이상의 트랜스포메이션을 수행하고, 최종적으로 액션을 수행하여 실제 RDD를 초기 생성한다. RDD 수명 모니터링 에이전트(H3520)는, RDD가 생성되는지 모니터링 중에 실제 RDD가 생성된 것을 감지한다. RDD 수명 모니터링 에이전트(H3520)는, 실제 RDD가 생성된 것을 감지하여 RDD들의 라이프 사이클에 대한 정보를 메타데이터 매니저(H3530)로 출력한다. 메타데이터 매니저(H3530)는 생성된 RDD에 대한 라이프 사이클을 업데이트할 수 있다. 이때, RDD는 초기 생성 상태이므로, 메타데이터 매니저(H3530)는 실제로 생성된 RDD에 대한 라이프 사이클을 초기 생성 상태(initial, I)로 업데이트할 수 있다. This figure shows the process in which RDD is actually generated based on the lineage for RDD. First, the DAG scheduler outputs an operation execution instruction in a spark application according to a schedule. The spark application performs an operation by receiving an operation execution command from the DAG scheduler. That is, the spark application performs at least one or more transformations related to an action, and finally performs an action to initially create an actual RDD. The RDD lifetime monitoring agent (H3520) detects that the actual RDD is generated while monitoring whether RDD is generated. The RDD lifetime monitoring agent H3520 detects that the actual RDD has been generated and outputs information on the lifecycle of the RDDs to the meta data manager H3530. The metadata manager H3530 can update the life cycle of the generated RDD. At this time, since the RDD is in the initial generation state, the metadata manager H3530 can update the life cycle of the actually generated RDD to the initial generation state (initial, I).

도 7은 본 발명의 일 실시예에 따른 RDD의 선택적 저장 과정을 나타낸 흐름도이다. FIG. 7 is a flowchart illustrating a selective storage process of an RDD according to an embodiment of the present invention.

본 도면은 초기 생성된 RDD를 퍼시스트로 저장할지 캐싱으로 저장할지 결정하는 과정을 나타낸다. 먼저, 스파크 어플리케이션은 초기 생성된 RDD를 퍼시스트로 저장할지 캐싱으로 저장할지를 결정한다. RDD 저장 매니저(H3540)는 초기 생성된 RDD가 퍼시스트로 저장될 것인지 캐싱으로 저장될 것인지를 지시하는 정보를 입력받는다. 이때, RDD 저장 매니저(H3540)는 해당 RDD가 저장될 구체적인 저장위치에 대한 정보를 더 입력받을 수 있다. RDD 저장 매니저(H3540)는 입력받은 정보에 기초하여 RDD를 캐싱 또는 퍼시스트로 저장한다. 즉, RDD 저장 매니저(H3540)는 해당 RDD를 입력받은 저장위치에 캐싱 또는 퍼시스트로 저장한다. 이어서, RDD 저장 매니저(H3540)는, 해당 RDD의 라이프 사이클 정보 및 해당 RDD의 구체적인 저장위치에 대한 정보를 메타데이터 매니저(H3530)로 출력한다. 여기서, 라이프 사이클 정보는, 퍼시스트 또는 캐싱에 해당할 수 있다. 메타데이터 매니저(H3530)는 RDD 저장 매니저(H3540)로부터 상기 정보를 수신하여 메타데이터를 업데이트 할 수 있다.This drawing shows a process of determining whether to store the initially generated RDD as a persistent or a caching. First, the spark application decides whether to store the initially created RDD as a persistent or a caching. The RDD storage manager H3540 receives information indicating whether the initially generated RDD is to be saved as a persistent or a caching. At this time, the RDD storage manager H3540 can receive further information about the specific storage location where the corresponding RDD is to be stored. The RDD storage manager H3540 stores the RDD as caching or persistent based on the received information. That is, the RDD storage manager H3540 stores the corresponding RDD as a caching or persistent storage location. Then, the RDD storage manager H3540 outputs information on the life cycle information of the RDD concerned and the specific storage location of the RDD to the meta data manager H3530. Here, the life cycle information may correspond to persistent or caching. The metadata manager H3530 can receive the information from the RDD storage manager H3540 and update the metadata.

도 8은 본 발명의 일 실시예에 따른 가비지 컬렉션 과정을 나타낸 흐름도이다.8 is a flowchart illustrating a garbage collection process according to an embodiment of the present invention.

본 도면은 가비지 컬렉션이 수행되는 과정을 나타낸다. 먼저, DAG 스케쥴러는 스케쥴에 따라 스파크 어플리케이션으로 가비지 컬렉션 수행 명령을 출력한다. 스파크 어플리케이션은 DAG 스케쥴러로부터 가비지 컬렉션 수행 명령을 입력받는다. 이어서, 스파크 어플리케이션은 가비지 컬렉션을 위한 동작을 수행한다. RDD 수명 모니터링 에이전트(H3520)는 가비지 컬렉션이 수행되는지를 모니터링하여 가비지 컬렉션이 감지되면 메타데이터 매니저(H3530)로 RDD에 대한 라이프 사이클 정보를 출력한다. RDD 수명 모니터링 에이전트(H3520)는 가비지 컬렉션의 대상으로 선정된 RDD에 대한 라이프 사이클 정보를 메타데이터 매니저(H3530)로 출력한다. 메타데이터 매니저(H3530)는 RDD 라이프 사이클 정보를 입력받아 RDD에 대한 메타데이터를 업데이트한다. 이때, 메타데이터 매니저(H3530)는 가비지 컬렉션의 대상으로 선정된 RDD에 대한 RDD 라이프 사이클 정보를 가비지 컬렉션(GC)으로 업데이트할 수 있다. 뿐만 아니라, 메타데이터 매니저(H3530)는 메타데이터에 기 저장된 우선순위 정보를 이용하여 가비지 컬렉션의 대상으로 선정된 RDD 중 일부의 RDD에 대한 라이프 사이클 정보 및 저장위치 정보를 더 업데이트 할 수 있다. 메타데이터 매니저(H3530)는 업데이트된 정보를 RDD 저장 매니저(H3540)로 출력한다. RDD 저장 매니저(H3540)는 업데이트된 정보를 입력받는다. RDD 저장 매니저(H3540)는 입력받은 정보에 기초하여 RDD를 저장한다. This drawing shows a process in which garbage collection is performed. First, the DAG scheduler outputs a garbage collection execution command as a spark application according to a schedule. The spark application receives the garbage collection execution command from the DAG scheduler. The spark application then performs operations for garbage collection. The RDD lifetime monitoring agent (H3520) monitors whether garbage collection is performed and outputs lifecycle information for RDD to the metadata manager (H3530) when garbage collection is detected. The RDD lifetime monitoring agent (H3520) outputs lifecycle information of the RDD selected as a target of garbage collection to the metadata manager (H3530). The metadata manager H3530 receives the RDD lifecycle information and updates the metadata about the RDD. At this time, the meta data manager H3530 can update the RDD lifecycle information of the RDD selected as the target of the garbage collection to garbage collection (GC). In addition, the metadata manager H3530 can further update the life cycle information and the storage location information of some RDDs selected as objects of garbage collection using the priority information pre-stored in the metadata. The meta data manager H3530 outputs the updated information to the RDD storage manager H3540. The RDD storage manager H3540 receives the updated information. The RDD storage manager H3540 stores the RDD based on the received information.

일 실시예에서, 메타데이터 매니저(H3530)는 메타데이터에 기 저장된 우선순위가 기 설정된 우선순위 이상인 경우 라이프 사이클 정보, 저장위치 정보를 변경할 수 있다. 즉, 메타데이터 매니저(H3530)는 우선순위가 높은 RDD가 가비지 컬렉션의 대상으로 선정된 경우, 해당 RDD가 제거되지 않도록 하기 위해 라이프 사이클 정보 및 저장위치 정보를 변경할 수 있다. 또한, 메타데이터 매니저(H3530)는 변경된 라이프 사이클 정보 및 저장위치 정보를 RDD 저장 매니저(H3540)로 출력할 수 있다. RDD 저장 매니저(H3540)는 변경된 라이프 사이클 정보 및 저장위치 정보에 기초하여 우선순위가 높은 RDD를 다른 저장위치에 더 저장할 수 있다. 이러한 실시예에 의하면, 우선순위가 높은 RDD는 별도의 저장위치에 저장되므로, 가비지 컬렉션이 수행되더라도 삭제되지 않는다. 또한, 별도로 저장된 RDD는 DAG ID 및 RDD ID 의 조합과 같은 객체식별정보를 통해 식별할 수 있다. 따라서, 이러한 실시예에 의하면, 중요한 RDD가 메모리 상에 유지될 수 있을 뿐만 아니라, 중요한 RDD에 대한 접근성을 높일 수 있다.In one embodiment, the metadata manager H3530 may change the life cycle information and the storage location information when the priority stored in the metadata is equal to or higher than the predetermined priority. That is, when the RDD having a high priority is selected as the object of garbage collection, the metadata manager H3530 can change the life cycle information and storage location information so that the corresponding RDD is not removed. Also, the meta data manager H3530 can output the changed life cycle information and storage location information to the RDD storage manager H3540. The RDD storage manager H3540 can further store the RDD having a higher priority in another storage location based on the changed life cycle information and storage location information. According to this embodiment, the RDD having a high priority is stored in a separate storage location, so that even if garbage collection is performed, the RDD is not deleted. Further, the separately stored RDD can be identified through object identification information such as a combination of DAG ID and RDD ID. Thus, according to this embodiment, important RDDs can be maintained on the memory, and accessibility to important RDDs can be increased.

도 9는 본 발명의 일 실시예에 따른 RDD 관리 방법을 나타내는 순서도이다.9 is a flowchart illustrating an RDD management method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 RDD 관리 방법의 각 단계는 전술한 RDD 관리 장치의 구성요소(들)에 의해 수행될 수 있다. Each step of the RDD management method according to an embodiment of the present invention can be performed by the component (s) of the RDD management apparatus described above.

상기 방법은 먼저, RDD에 대한 계보로부터 메타데이터를 추출한다(s9100).The method first extracts metadata from the genealogy for RDD (s9100).

이어서, 상기 방법은 추출된 메타데이터를 저장한다(s9200). 다음으로, 상기 방법은 RDD에 대한 계보와 관련된 연산이 수행되면, 저장된 메타데이터를 업데이트 한다(s9300). 여기서, 저장된 메타데이터는 RDD 객체정보 및 RDD 타입 정보를 포함한다. RDD 객체정보는 DAG 식별자, RDD 식별자를 포함할 수 있고, RDD 타입정보는, 최종 RDD, 헤비 RDD, 일반 RDD 및 원시 RDD 중 어느 하나를 지시할 수 있다.Then, the method stores extracted metadata (s9200). Next, when the operation related to the lineage for the RDD is performed, the method updates the stored metadata (s9300). Here, the stored metadata includes RDD object information and RDD type information. The RDD object information may include a DAG identifier and an RDD identifier, and the RDD type information may indicate either a final RDD, a heavy RDD, a general RDD, or a primitive RDD.

업데이트 하는 단계는, RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들의 라이프 사이클을 지시하는 스테이터스 정보를 업데이트하는 단계에 해당할 수 있다. 즉, 실제로 RDD에 대한 연산이 수행되어 RDD가 생성될 경우, RDD들의 라이프 사이클이 초기 생성임을 지시하는 스테이터스 정보를 업데이트할 수 있다.The updating step may correspond to updating the status information indicating the life cycle of the RDDs generated as a result of the operation related to the lineage for the RDD. That is, when an operation for RDD is actually performed and an RDD is generated, the status information indicating that the life cycle of RDDs is an initial generation can be updated.

선택적으로, 상기 방법은 아래의 단계를 더 수행할 수 있다. 먼저, 상기 방법은 연산 수행의 결과로 생성된 RDD들 중 적어도 하나의 RDD를 메모리 상에 캐싱(cashing) 저장할지 퍼시스트(persist)로 저장할지를 지시하는 정보를 수신한다. 이어서, 지시하는 정보에 따라 적어도 하나의 RDD를 저장한다. 다음으로, 지시하는 정보 및 저장된 적어도 하나의 RDD의 저장위치 정보에 기초하여 저장된 메타데이터를 더 업데이트한다. 즉, 상기 방법은 초기 생성 상태(initial, I)의 RDD들을 캐싱으로 저장할지 퍼시스트로 저장할지 지시하는 정보에 따라 RDD들을 구분하여 저장하고, 구분 저장된 RDD들에 대한 메타데이터를 관리할 수 있다.Optionally, the method can further perform the following steps. First, the method receives information indicating whether to store at least one RDD among the RDDs generated as a result of the computation, in a memory in a cache or persist. Then, at least one RDD is stored according to the indicated information. Next, the stored metadata is further updated based on the indicating information and the storage location information of the stored at least one RDD. That is, the method can distinguish and store the RDDs according to the information indicating whether the RDDs in the initial generation state (initial, I) are stored by caching or persistent storage, and manage the metadata for the RDDs classified and stored.

선택적으로, 저장된 메타데이터는 우선순위 정보 및 RDD 저장위치 정보를 더 포함하고, 우선순위 정보는 RDD 타입정보에 의존하여 설정되고, 상기 RDD 저장위치 정보는 RDD들이 저장되는 위치정보를 지시할 수 있다. 이때, 우선순위 정보는, 최종 RDD, 헤비 RDD, 원시 RDD, 일반 RDD의 순서로 높은 우선순위를 가질 수 있다.Alternatively, the stored metadata further includes priority information and RDD storage location information, the priority information is set depending on RDD type information, and the RDD storage location information may indicate location information on which RDDs are stored . At this time, the priority information may have a high priority in the order of the final RDD, the heavy RDD, the primitive RDD, and the general RDD.

선택적으로, 상기 방법은 아래의 단계를 더 수행할 수 있다. 먼저, 상기 방법은 RDD에 대한 계보와 관련된 연산 수행의 결과로 생성된 RDD들에 대한 가비지 컬렉션(garbage collection)을 감지한다. 이어서, 상기 방법은 생성된 RDD들 중 가비지 컬렉션의 대상으로 선정된 RDD들과 관련된 스테이터스 정보를 변경하여 저장된 메타데이터를 더 업데이트한다. 다음으로, 상기 방법은 가비지 컬렉션의 대상으로 선정된 RDD들 중 우선순위 정보가 기설정된 우선순위 보다 높은 우선순위를 갖는 RDD에 대한 스테이터스 정보 및 RDD 저장위치 정보를 변경하여 저장된 메타데이터를 더 업데이트하고, 우선순위 정보가 기설정된 우선순위 보다 앞서는 RDD를 변경된 RDD 저장위치 정보에 기초하여 저장한다. 그 다음으로, 감지된 가비지 컬렉션이 수행될 수 있다. 즉, 상기 방법은 가비지 컬렉션의 대상으로 선정된 RDD들 중 중요도가 높은, 다시 말해 우선순위가 높은 RDD들을 별도로 저장하고 관리할 수 있다. Optionally, the method can further perform the following steps. First, the method detects garbage collection for RDDs generated as a result of performing an operation related to the lineage for the RDD. The method further updates the stored metadata by changing the status information related to the RDDs selected as objects of garbage collection among the generated RDDs. Next, the method further updates the stored metadata by changing the status information and the RDD storage location information of the RDD having priority higher than the predetermined priority among the RDDs selected as objects of garbage collection , And stores the RDD whose priority information is ahead of the preset priority based on the changed RDD storage location information. Next, sensed garbage collection may be performed. That is, the above method can separately store and manage RDDs having a high priority, that is, high priority RDDs selected as objects of garbage collection.

전술한 본 발명의 일 실시예에 따른 RDD 관리 방법의 각 단계는 소프트웨어로 프로그래밍될 수 있으며, 프로그래밍된 프로그램은 저장매체에 저장될 수 있다. Each step of the RDD management method according to an embodiment of the present invention described above can be programmed by software, and the programmed program can be stored in a storage medium.

모듈, 유닛 또는 장치를 구성하는 각 구성요소는 메모리(또는 저장 유닛)에 저장된 연속된 수행과정들을 실행하는 프로세서들일 수 있다. 전술한 실시예에 기술된 각 단계들은 하드웨어/프로세서들에 의해 수행될 수 있다. 전술한 실시예에 기술된 각 구성단위들은 하드웨어/프로세서로서 동작할 수 있다. 또한, 본 발명이 제시하는 방법들은 코드로서 실행될 수 있다. 이 코드는 프로세서가 읽을 수 있는 저장매체에 쓰여질 수 있고, 따라서 장치(apparatus)가 제공하는 프로세서에 의해 읽혀질 수 있다. Each component that constitutes a module, unit, or device may be processors that execute sequential execution processes stored in memory (or storage unit). Each of the steps described in the above embodiments may be performed by hardware / processors. Each of the units described in the above embodiments may be operated as a hardware / processor. Further, the methods proposed by the present invention can be executed as codes. The code may be written to a storage medium readable by the processor and thus read by a processor provided by the apparatus.

설명의 편의를 위하여 각 도면을 나누어 설명하였으나, 각 도면에 서술되어 있는 실시예들을 병합하여 새로운 실시예를 구현하도록 설계하는 것도 가능하다. 그리고, 통상의 기술자의 필요에 따라, 이전에 설명된 실시 예들을 실행하기 위한 프로그램이 기록되어 있는 컴퓨터에서 판독 가능한 기록 매체를 설계하는 것도 본 발명의 권리범위에 속한다.Although the drawings have been described for the sake of convenience of explanation, it is also possible to design a new embodiment by incorporating the embodiments described in each drawing. It is also within the scope of the present invention to design a computer-readable recording medium in which a program for executing the previously described embodiments is recorded according to the needs of ordinary artisans.

본 발명에 따른 장치 및 방법은 상술한 바와 같이 설명된 실시 예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 상술한 실시 예들은 다양한 변형이 이루어질 수 있도록 각 실시 예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.The apparatus and method according to the present invention are not limited to the configuration and method of the embodiments described above as described above, but the embodiments described above may be modified so that all or some of the embodiments are selectively And may be configured in combination.

한편, 본 발명이 제안하는 방법을 네트워크 디바이스에 구비된, 프로세서가 읽을 수 있는 기록매체에, 프로세서가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 프로세서가 읽을 수 있는 기록매체는 프로세서에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 프로세서가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있다. 또한, 프로세서가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 프로세서가 읽을 수 있는 코드가 저장되고 실행될 수 있다.On the other hand, it is possible to implement the method proposed by the present invention as a code that can be read by a processor in a computer-readable recording medium provided in a network device. The processor-readable recording medium includes all kinds of recording apparatuses in which data that can be read by the processor is stored. Examples of the recording medium on which the processor can read are ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like. In addition, the processor-readable recording medium may be distributed over network-connected computer systems so that code readable by the processor in a distributed fashion can be stored and executed.

또한, 이상에서는 본 발명의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해돼서는 안 될 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the present invention.

그리고, 당해 명세서에서는 물건 발명과 방법 발명이 모두 설명되고 있으며, 필요에 따라 양 발명의 설명은 보충적으로 적용될 수가 있다.In this specification, both the invention of the invention and the invention of the method are explained, and the description of both inventions can be supplemented as necessary.

본 발명의 사상이나 범위를 벗어나지 않고 본 발명에서 다양한 변경 및 변형이 가능함은 당업자에게 이해된다. 따라서, 본 발명은 첨부된 청구항 및 그 동등 범위 내에서 제공되는 본 발명의 변경 및 변형을 포함하는 것으로 의도된다.It will be understood by those skilled in the art that various changes and modifications can be made therein without departing from the spirit or scope of the invention. Accordingly, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

본 명세서에서 장치 및 방법 발명이 모두 언급되고, 장치 및 방법 발명 모두의 설명은 서로 보완하여 적용될 수 있다.In the present specification, the apparatus and method inventions are all referred to, and descriptions of both the apparatus and method inventions can be supplemented and applied to each other.

H1100, H3100: 정보수집부
H1200, H3200: 데이터 생성부
H1300, H3300: 데이터 처리부
H1400, H3400: 저장부
H3500: RDD 관리부
H3510: DAG 스캐닝 에이전트
H3520: RDD 수명 모니터링 에이전트
H3530: 메타데이터 매니저
H3540: RDD 저장 매니저H1100, H3100: Information collecting section
H1200, H3200: Data generation unit
H1300, H3300: Data processing section
H1400, H3400: Storage unit
H3500: RDD management unit
H3510: DAG scanning agent
H3520: RDD lifetime monitoring agent
H3530: Metadata manager
H3540: RDD Storage Manager

Claims

A method of managing a Resilient Distributed Dataset (RDD) for data collection, analysis and processing, the method comprising:
The processor comprising: extracting metadata from the lineage for the RDD;
Storing the extracted metadata in the extracting step; And
Updating the stored metadata when the processor performs an operation related to the genealogy for the RDD, wherein the stored metadata includes RDD object information and RDD type information.

The method according to claim 1,
Wherein the updating is a step of updating status information indicating a life cycle of RDDs generated as a result of an operation related to the lineage for the RDD.

3. The method of claim 2,
Receiving, by the processor, information indicating whether to cache or store at least one RDD among RDDs generated as a result of performing the operation, in a memory;
The processor storing the at least one RDD in accordance with the indicating information; And
Further comprising: the processor further updating the stored metadata based on the indicating information and the storage location information of the stored at least one RDD.

The method according to claim 1,
Wherein the stored metadata further include priority information and RDD storage location information,
Wherein the priority information is set depending on the RDD type information, and the RDD storage location information indicates location information on which RDDs are stored.

5. The method of claim 4,
Wherein the stored metadata further includes status information indicating a life cycle of RDDs generated as a result of performing an operation related to the genealogy for the RDD,
The method comprises:
The processor detecting garbage collection of RDDs generated as a result of performing an operation related to the lineage for the RDD;
The processor further updating the stored metadata by changing the status information associated with RDDs selected as targets of garbage collection among the generated RDDs; And
Wherein the processor changes the status information and the RDD storage location information for the RDD having priority higher than a predetermined priority among the RDDs selected as the objects of the garbage collection, And storing the RDD whose priority information is ahead of a predetermined priority based on the changed RDD storage location information.

6. The method of claim 5,
Wherein the garbage collection detected in the detecting step is performed after storing the RDD whose priority information is prior to the predetermined priority based on the changed RDD storage location information.

The method according to claim 1,
Wherein the RDD type information indicates any one of a final RDD, a heavy RDD, a general RDD, and a raw RDD.

The method according to claim 6,
Wherein the priority information indicates a high priority in the order of a final RDD, a heavy RDD, a primitive RDD, and a general RDD.

1. An apparatus for managing a Resilient Distributed Dataset (RDD) for data collection, analysis and processing,
A DAG scanning agent that scans a Directed Acyclic Graph (DAG) including a lineage for RDD and extracts metadata from the lineage for the RDD; And
A metadata manager for storing metadata extracted from the genealogy for the RDD and for updating the metadata stored in the operation related to the genealogy for the RDD, wherein the stored metadata includes RDD object information and RDD type information Wherein the RDD management device comprises:

10. The method of claim 9,
Wherein the meta data manager updates status information indicating a life cycle of RDDs generated as a result of an operation related to the genealogy for the RDD.

11. The method of claim 10,
The apparatus comprises:
Receiving information indicative of whether to cache or store at least one RDD among the RDDs generated as a result of performing the operation on a memory, An RDD storage manager for storing the RDD;
Wherein the metadata manager further updates the stored metadata based on the indicating information and the storage location information of the stored at least one RDD.

12. The method of claim 11,
Wherein the stored metadata further include priority information and RDD storage location information,
Wherein the priority information is set depending on the RDD type information, and the RDD storage location information indicates location information on which RDDs are stored.

13. The method of claim 12,
Wherein the stored metadata further includes status information indicating a life cycle of RDDs generated as a result of performing an operation related to the genealogy for the RDD,
The apparatus comprises:
And an RDD lifetime monitoring agent for detecting garbage collection of RDDs generated as a result of performing an operation related to the genealogy for the RDD,
Wherein the metadata manager further updates the stored metadata by changing the status information associated with RDDs selected as targets of garbage collection among the generated RDDs,
Wherein the meta data manager changes the status information and the RDD storage location information for the RDD having priority higher than a predetermined priority among the RDDs selected as the objects of the garbage collection, Lt; / RTI >
Wherein the RDD storage manager stores the RDD whose priority information is ahead of a predetermined priority based on the changed RDD storage location information.

A storage medium storing a computer-readable program for managing a Resilient Distributed Dataset (RDD) for data collection, analysis and processing,
The metadata extraction unit extracts metadata from the genealogy for the RDD, stores the metadata extracted in the extracting step, and updates metadata stored in the operation related to the genealogy for the RDD, wherein the stored metadata includes an RDD object Information and RDD type information. &Lt; RTI ID = 0.0 > A < / RTI >