KR20200084441A

KR20200084441A - Automatic build apparatus and method of application for generating training data set of machine learning

Info

Publication number: KR20200084441A
Application number: KR1020180169672A
Authority: KR
Inventors: 조성제; 이명건; 정재민
Original assignee: 단국대학교 산학협력단
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2020-07-13
Also published as: KR102167767B1

Abstract

The present invention relates to an automation build device which automatically generates a data set of a sample application for machine learning to perform malware detection, classification, and the like, and automatically solves a problem which occurs in a build process when generating the data set. The automation build device comprises: a storage part which stores a plurality of open source data for machine learning; a setting part which performs system setting including identification of the build system type of open source data stored in the storage part, setting of appropriate environment or option information, and application of necessary libraries; and a construction part which generates an application data set by building the open source data in an environment or options set by the system setting part.

Description

Application automation build device and method for creating machine learning learning dataset {AUTOMATIC BUILD APPARATUS AND METHOD OF APPLICATION FOR GENERATING TRAINING DATA SET OF MACHINE LEARNING}

본 발명은 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 장치 및 방법에 관한 것으로서, 보다 상세하게는 멀웨어 탐지, 분류 등을 수행하는 머신러닝의 학습을 위해 샘플 애플리케이션의 데이터셋을 자동으로 생성하고, 데이터셋 생성 시 빌드과정에서 발생되는 문제를 자동으로 해결하는 자동화 빌드 장치 및 방법에 관한 기술이다.The present invention relates to an application automation build apparatus and method for generating a learning dataset of machine learning, and more specifically, to automatically generate a dataset of a sample application for learning of machine learning to perform malware detection, classification, etc. , This is a technology for an automated build device and method that automatically solves problems that occur during the build process when creating a data set.

연산장치의 고도화에 의해 머신러닝을 이용한 자동화 시스템이 차세대 기술로 떠오르고 있다.With the advancement of computing devices, automation systems using machine learning are emerging as next-generation technologies.

한편, 다양한 종류의 멀웨어 및 은폐 기법이 적용된 멀웨어가 등장하면서 멀웨어의 탐지 및 분류를 위해 머신러닝을 활용는 노력이 계속되고 있다. 멀웨어의 유형에는 바이러스, 웜, 트로이목마, 백도어, 논리폭탄, 봇(bot), 에드웨어, 스파이웨어, 랜섬웨어 등이 있다.Meanwhile, with the emergence of various types of malware and concealment-based malware, efforts to utilize machine learning to detect and classify malware are continuing. Types of malware include viruses, worms, Trojan horses, backdoors, logical bombs, bots, adware, spyware, and ransomware.

머신러닝이 제 기능을 수행하기 위해서는 선행하여 학습이 실시되어야 한다. 머신러닝의 학습을 위해서는 방대한 학습 데이터셋이 필요하다. 머신러닝의 성능을 결정짓는 요소는 학습 데이터셋이라 할 정도로 학습 데이터셋의 중요도는 머신러닝의 활용에서 가장 크다.In order for machine learning to perform its functions, learning must be performed in advance. Machine learning requires a vast learning dataset. The factor that determines the performance of machine learning is the training data set, so the importance of the training data set is the greatest in the use of machine learning.

멀웨어 탐지 또는 분류를 위한 머신러닝은 멀웨어에 감염되지 않은 애플리케이션과, 멀웨어에 감염된 애플리케이션이 마련되어야 한다.Machine learning for malware detection or classification requires that applications that are not infected with malware and applications that are infected with malware are provided.

학습 데이터셋 중 멀웨어에 감염된 애플리케이션을 관리자가 식별할 수 있어야 올바른 학습과정과 테스트를 실시할 수 있다. 하지만 일반적으로 앱스토어 등 온라인에서 획득할 수 있는 애플리케이션 중에는 멀웨어에 감염된 것이 존재할 가능성이 있기 때문에 종래 애플리케이션을 멀웨어에 감염된 것과 감염되지 않은 것으로 구분하는 것은 완전무결하지 않을 수 있다. 만약 멀웨어에 감염된 애플리케이션이 감염되지 않은 애플리케이션으로 머신러닝에 학습된다면 해당 머신러닝은 지속적으로 관련 멀웨어에 대한 탐지율이 저하될 수 있는 문제가 발생된다.Administrators must be able to identify malware-infected applications in the training dataset in order to conduct proper training and testing. However, in general, among applications that can be obtained online, such as an app store, there is a possibility that there is a malware infection, so it may not be perfect to classify a conventional application as being infected with a malware or not. If an application infected with malware is learned to machine learning as an application that is not infected, the machine learning continuously has a problem that the detection rate of the related malware may be lowered.

따라서 학습 데이터셋으로 바람직한 애플리케이션은 소스 내용을 검토할 수 있는 오픈소스 애플리케이션이다. 하지만 오픈소스의 경우 다양한 빌드시스템 환경에서 코딩되고, 애플리케이션마다 세부 옵션이나 호출하는 라이브러리가 상이하다. 때문에 수작업으로 오픈소스를 빌드할 때에는 빌드 중 발생되는 오류를 수십 차례 해결하여야 하나의 실행 가능한 애플리케이션을 획득할 수 있게 된다.Therefore, a preferred application as a learning dataset is an open source application that can review source content. However, in the case of open source, it is coded in various build system environments, and each application has different options or calling libraries. Therefore, when you build open source manually, you need to solve dozens of errors during the build to obtain a single executable application.

하지만 신뢰성 있는 머신러닝 알고리즘을 구현하기 위해서는 수천 내지 수백만개의 학습 데이터가 필요하다. 따라서 멀웨어 탐지 및 분류를 수행하는 머신러닝을 위한 학습 데이터셋을 수작업으로 생성하는 일은 사실상 불가능에 가까운 상태이다.However, in order to implement a reliable machine learning algorithm, thousands to millions of training data are required. Therefore, it is virtually impossible to manually create a training dataset for machine learning that performs malware detection and classification.

등록특허공보 제10-1880628호Registered Patent Publication No. 10-1880628

이에 본 발명은 상기와 같은 종래의 제반 문제점을 해소하기 위해 제안된 것으로, 본 발명의 목적은 멀웨어 탐지, 분류 등을 수행하는 머신러닝의 학습을 위해 샘플 애플리케이션의 데이터셋을 자동으로 생성하고, 데이터셋 생성 시 빌드과정에서 발생되는 문제를 자동으로 해결하는 자동화 빌드 장치 및 방법을 제공하는 것을 과제로 한다.Accordingly, the present invention is proposed to solve the above-mentioned problems, and the object of the present invention is to automatically generate a dataset of a sample application for learning machine learning to perform malware detection, classification, etc. The task is to provide an automated build device and method that automatically solves problems that occur during the build process when creating a set.

상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 장치는 머신러닝 학습을 위한 다수의 오픈소스 데이터가 저장되는 저장부; 상기 저장부에 저장된 오픈소스 데이터의 빌드시스템 종류 식별, 적합한 환경 또는 옵션 정보 설정, 필요 라이브러리 적용이 포함된 시스템설정을 실시하는 설정부; 상기 오픈소스 데이터를 상기 설정부가 시스템설정한 환경 또는 옵션으로 빌드하여 애플리케이션 데이터셋을 생성하는 구축부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the application automation build apparatus for generating a learning dataset of machine learning according to the technical idea of the present invention includes a storage unit in which a plurality of open source data for learning machine learning are stored; A setting unit for identifying a build system type of open source data stored in the storage unit, setting a suitable environment or option information, and setting a system including application of a necessary library; It characterized in that it comprises a building unit for generating an application data set by building the open source data in an environment or option set by the setting unit system.

또한, 상기 오픈소스 데이터에 난독화기법 또는 패킹기법을 적용하여 은폐화 하는 은폐부를 더 포함하는 것을 특징으로 할 수 있다.In addition, it may be characterized in that it further comprises a concealment unit for concealing by applying an obfuscation technique or a packing technique to the open source data.

또한, 상기 구축부가 상기 오픈소스 데이터의 빌드 시 전자서명을 추가하는 서명부를 더 포함하는 것을 특징으로 할 수 있다.In addition, the building unit may further include a signature unit that adds an electronic signature when building the open source data.

또한, 상기 오픈소스 데이터에 머신러닝 학습에 필요한 학습용 멀웨어 코드를 삽입하는 멀웨어삽입부를 더 포함하는 것을 특징으로 할 수 있다.In addition, it may be characterized in that it further comprises a malware insertion unit for inserting the learning malware code necessary for machine learning learning in the open source data.

또한, 상기 설정부는, 상기 오픈소스 데이터의 빌드시스템 종류를 식별하고, 누락된 빌드 필요 파일을 생성하는 빌드분석부; 상기 오픈소스 데이터에 적합한 환경 또는 옵션 정보를 설정하는 환경조정부; 상기 오픈소스 데이터가 호출하는 라이브러리를 적용하는 라이브러리구성부; 및 상기 오픈소스 데이터의 문법 오류 정정 및 문법 최신화를 실시하는 교정부를 포함하는 것을 특징으로 할 수 있다.In addition, the setting unit, the build analysis unit for identifying the build system type of the open source data, and generates a missing build required file; An environment adjustment unit for setting environment or option information suitable for the open source data; A library configuration unit to apply a library called by the open source data; And a correction unit for correcting grammatical errors and updating grammar of the open source data.

또한, 상기 머신러닝은 안드로이드 애플리케이션의 멀웨어 탐지 또는 분류를 위한 알고리즘이고, 상기 오픈소스 데이터는 안드로이 애플리케이션의 오픈소스이며, 상기 빌드분석부는 상기 오픈소스 데이터의 빌드시스템이 이클립스(Eclipse), 안드로이드 스튜디오(Android studio), 그래들(Gradle) 중 어떤 것에 해당되는지 식별하는 것을 특징으로 할 수 있다.In addition, the machine learning is an algorithm for malware detection or classification of the Android application, the open source data is the open source of the Android application, and the build analysis unit includes the build system of the open source data is Eclipse, Android Studio ( Android studio), Gradle (Gradle), it can be characterized by identifying the corresponding.

또한, 상기 라이브러리구성부는 상기 오픈소스 데이터의 필요 라이브러리를 탐색하기 위한 라이브러리 데이터베이스를 포함하는 것을 특징으로 할 수 있다.In addition, the library configuration unit may include a library database for searching the required library of the open source data.

또한, 상기 라이브러리구성부는 상기 오픈소스 데이터가 유효하지 않은 안드로이드 서버의 연결정보를 포함하는 경우 상기 연결정보를 관련된 최신 안드로이드 서버의 정보로 최신화 하는 것을 특징으로 할 수 있다.In addition, the library configuration unit may be characterized in that when the open source data includes connection information of an invalid Android server, the connection information is updated to the latest Android server information.

또한, 상기 교정부는 local.properties, build.xml, gradle-wrapper.properties를 포함하는 빌드 관련파일을 대상으로 문법 오류 정정 및 문법 최신화를 실시하는 것을 특징으로 할 수 있다.In addition, the proofing unit may be characterized by performing grammatical error correction and grammar updating for build-related files including local.properties, build.xml, and gradle-wrapper.properties.

한편, 상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 방법은 (A)저장부가 머신러닝 학습을 위한 다수의 오픈소스 데이터를 저장하는 단계; (B)설정부가 상기 저장부에 저장된 오픈소스 데이터의 빌드시스템 종류 식별, 적합한 환경 또는 옵션 정보 설정, 필요 라이브러리 적용이 포함된 시스템설정을 실시하는 단계; 및 (C)구축부가 오픈소스 데이터를 상기 설정부가 시스템설정한 환경 또는 옵션으로 빌드하여 애플리케이션 데이터셋을 생성하는 단계를 포함하는 것을 특징으로 한다.On the other hand, in order to achieve the above object, the application automation build method for generating a learning dataset of machine learning according to the technical idea of the present invention comprises: (A) storing a plurality of open source data for machine learning learning; ; (B) the setting unit performs a system setting including identification of a build system type of open source data stored in the storage unit, setting of suitable environment or option information, and application of necessary libraries; And (C) characterized in that it comprises the step of building an application data set by the build unit to build the open source data in an environment or option set by the setting unit system.

또한, 상기 (C) 단계 전, 은폐부가 상기 오픈소스 데이터에 난독화기법 또는 패킹기법을 적용하여 은폐화 하는 단계를 더 포함하는 것을 특징으로 할 수 있다.In addition, before the step (C), the hiding portion may further include the step of concealing by applying an obfuscation technique or a packing technique to the open source data.

또한, 상기 (C) 단계 전, 서명부가 상기 오픈소스 데이터에 전자서명을 추가하는 단계를 더 포함하는 것을 특징으로 할 수 있다.In addition, before the step (C), the signature unit may further include the step of adding an electronic signature to the open source data.

또한, 상기 (C) 단계 전, 멀웨어삽입부가 상기 오픈소스 데이터에 머신러닝 학습에 필요한 학습용 멀웨어 코드를 삽입하는 단계를 더 포함하는 것을 특징으로 할 수 있다.In addition, before the step (C), the malware insertion unit may further include the step of inserting the learning malware code necessary for machine learning learning into the open source data.

본 발명에 의한 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 장치 및 방법에 따르면,According to an application automation build apparatus and method for generating a learning dataset of machine learning according to the present invention,

첫째, 멀웨어 탐지, 분류 등을 수행하는 머신러닝의 학습을 위한 애플리케이션의 데이터셋을 자동으로 생성하므로, 머신러닝의 충분한 학습을 지원할 수 있게 된다.First, because it automatically creates an application dataset for learning machine learning that performs malware detection, classification, etc., it is possible to support sufficient learning of machine learning.

둘째, 오픈소스의 빌드에 필요한 환경설정 및 라이브러리 적용이 자동으로 실시되고, 오류가 있는 코드는 자동으로 수정하므로 애플리케이션 데이터셋을 신속하게 생성할 수 있게 된다.Second, the configuration and library application necessary for the build of open source are automatically executed, and the code with errors is automatically corrected, so that an application dataset can be quickly generated.

셋째, 오픈소스를 다양한 방법으로 은폐화 할 수 있고, 은폐화된 멀웨어를 효율적으로 생성하게 해 주어 은폐화된 멀웨어를 탐지하기 위한 머신러닝의 학습도 지원할 수 있게 된다.Third, open source can be concealed in various ways, and it is possible to support the learning of machine learning to detect concealed malware by efficiently generating concealed malware.

넷째, 오픈소스에 학습용 멀웨어 코드를 삽입하여 빌드할 수 있기 때문에 충분한 수의 멀웨어 감염 애플리케이션의 데이터셋을 마련하는 것이 용이하다.Fourth, it is easy to prepare a sufficient dataset of malware-infected applications because it can be built by embedding learning malware code in open source.

도 1은 본 발명의 실시예에 따른 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 장치의 구성도.
도 2는 본 발명의 실시예에 따른 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 방법의 순서도.1 is a configuration diagram of an application automation build device for generating a learning dataset of machine learning according to an embodiment of the present invention.
2 is a flow chart of an application automation build method for generating a learning dataset of machine learning according to an embodiment of the present invention.

첨부한 도면을 참조하여 본 발명의 실시예들에 의한 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 장치 및 방법에 대하여 상세히 설명한다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.With reference to the accompanying drawings will be described in detail with respect to the application automation build apparatus and method for generating a learning dataset of machine learning according to embodiments of the present invention. The present invention can be applied to various changes and may have various forms, and specific embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to a specific disclosure form, it should be understood to include all modifications, equivalents, or substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar components.

또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In addition, unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms, such as those defined in a commonly used dictionary, should be interpreted as having meanings consistent with meanings in the context of related technologies, and should not be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not.

본 발명은 윈도우즈 환경, 안드로이드 환경, 리눅스 환경, iOS 환경 등 다양한 플랫폼에서 제약 없이 이용 가능하다. 발명을 설명하기 위해 안드로이드 환경을 예시로 사용하지만 이것으로 본 발명의 적용 환경이 안드로이드로 한정되는 것은 아니다.The present invention can be used without limitation in various platforms such as Windows environment, Android environment, Linux environment, and iOS environment. To illustrate the invention, an Android environment is used as an example, but this does not limit the application environment of the present invention to Android.

본 발명의 실시예에 따른 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 장치(100)는 하나의 컴퓨터 장치에 설치 운영되거나, 복수의 컴퓨터 장치에 분산 설치되고 유선 또는 무선으로 각 컴퓨터 장치가 서로 연결되어 연동되는 것으로 실시될 수 있다.The application automation build device 100 for generating a learning dataset of machine learning according to an embodiment of the present invention is installed and operated in one computer device, or distributedly installed in a plurality of computer devices, and each computer device is wired or wireless to each other It may be implemented by being connected and interlocked.

도 1을 참조하면, 본 발명의 실시예에 따른 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 장치(100)는, 머신러닝 학습을 위한 다수의 오픈소스 데이터가 저장되는 저장부(110); 저장부(110)에 저장된 오픈소스 데이터의 빌드시스템 종류 식별, 적합한 환경 및 ·옵션 정보 설정, 필요 라이브러리 적용이 포함된 시스템설정을 실시하는 설정부(140); 및 오픈소스 데이터를 설정부(140)가 시스템설정한 환경 또는 옵션으로 빌드하여 애플리케이션 데이터셋을 생성하는 구축부(180)를 포함한다.Referring to FIG. 1, the application automation build apparatus 100 for generating a learning dataset of machine learning according to an embodiment of the present invention includes: a storage unit 110 in which a plurality of open source data for learning machine learning are stored; A setting unit 140 for identifying a build system type of open source data stored in the storage unit 110, setting a suitable environment and option information, and setting a system including application of necessary libraries; And a building unit 180 that builds the open source data in an environment or option set by the setting unit 140 and creates an application data set.

오픈소스 데이터에는 소스코드가 공개된 애플리케이션이 포함된다.Open source data includes applications for which source code has been released.

저장부(110)는 데이터베이스 관리 응용프로그램에 의해 운영되거나, 각 오픈소스별로 디렉토리를 생성하는 것으로 관리될 수 있다.The storage unit 110 may be operated by a database management application program or by creating a directory for each open source.

리스트생성부(120)는 저장부(110)에 저장된 오픈소스 데이터에 고유 명칭, 앱스토어에서의 분류, 포함되는 기능, 이용하는 API, 포함된 퍼미션(permission) 등을 정리하여 리스트화 한다.The list generating unit 120 lists the unique names, classifications in the App Store, included functions, used APIs, and included permissions in the open source data stored in the storage unit 110.

아래에서는 이 실시예의 오픈소스 데이터가 안드로이드 애플리케이션의 오픈소스이고, 머신러닝이 안드로이드 애플리케이션의 멀웨어 탐지 또는 분류를 위한 알고리즘인 것을 가정하여 설명한다.Hereinafter, it is assumed that the open source data of this embodiment is an open source of an Android application, and machine learning is an algorithm for detecting or classifying malware in an Android application.

저장부(110)에는 F-droid 등 온라인 또는 오프라인에서 공개된 안드로이드 애플리케이션의 오픈소스가 수집되어 저장된다.In the storage unit 110, open source of an Android application published online or offline, such as F-droid, is collected and stored.

설정부(140)는, 오픈소스 데이터의 빌드시스템 종류를 식별하고, 누락된 빌드 필요 파일을 생성하는 빌드분석부(142); 오픈소스 데이터에 적합한 환경 ·또는 옵션 정보를 설정하는 환경조정부(143); 오픈소스 데이터가 호출하는 라이브러리를 적용하는 라이브러리구성부(145); 및 오픈소스 데이터의 문법 오류 정정과 문법 최신화를 실시하는 교정부(144)를 포함할 수 있다.The setting unit 140 includes: a build analysis unit 142 for identifying a build system type of open source data and generating a missing build required file; An environment adjustment unit 143 for setting environment or option information suitable for open source data; A library construction unit 145 that applies a library called by open source data; And a correction unit 144 for correcting grammatical errors and updating grammar of open source data.

빌드분석부(142)는 오픈소스 데이터가 안드로이드 애플리케이션인 경우, 빌드시스템이 이클립스(Eclipse), 안드로이드 스튜디오(Android studio), 그래들(Gradle) 중 어떤 것에 해당되는지 식별한다.When the open source data is an Android application, the build analysis unit 142 identifies whether the build system corresponds to Eclipse, Android studio, or Gradle.

또한 빌드분석부(142)는 build.xml, local.properties와 같이 누락된 빌드 필요 파일을 자동으로 생성한다. In addition, the build analysis unit 142 automatically generates missing build required files such as build.xml and local.properties.

환경조정부(143)는 오픈소스 데이터에 적합한 환경 또는·옵션 정보를 설정한다. 환경 또는 옵션에는 안드로이드 운영체제 버전(Android 4.1~4.3 Jelly Bean, Android 4.4 KitKat, Android 5.0~5.1 Lollipop, Android 6.0 Marshmallow, Android 7 Nougat, Android 8 Oreo, Android 9 Pie), 컴파일러 버전 정보, 적용되는 난독화 도구나 패킹 도구, 컴파일 최적화 또는 난독화 옵션 등을 포함한다. 예를 들어, Android 6.0 Marshmallow부터는 허가모드가 런타임 퍼미션이 적용되어, 기존 버전과는 상이하게 애플리케이션이 빌드된다.The environment adjustment unit 143 sets environment or option information suitable for open source data. The environment or options include the Android operating system version (Android 4.1~4.3 Jelly Bean, Android 4.4 KitKat, Android 5.0~5.1 Lollipop, Android 6.0 Marshmallow, Android 7 Nougat, Android 8 Oreo, Android 9 Pie), compiler version information, and obfuscation applied. Tools or packing tools, compilation optimization or obfuscation options. For example, as of Android 6.0 Marshmallow, the permission mode is applied with runtime permission, so the application is built differently from the previous version.

라이브러리구성부(145)는 오픈소스 데이터에 필요한 라이브러리를 적용한다. 라이브러리의 예로는 Facebook, Gson, Flurry Analytics, Bolts, Crashlytics, OKHttp, Nine Old Androids, Picasso, Retrofit 등이 있다.The library configuration unit 145 applies a library required for open source data. Examples of libraries are Facebook, Gson, Flurry Analytics, Bolts, Crashlytics, OKHttp, Nine Old Androids, Picasso, Retrofit, etc.

빌드분석부(142)가 오픈소스 데이터의 빌드시스템을 이클립스 또는 안드로이드 스튜디오로 판단한 경우, 필요 라이브러리를 구글 라이브러리에서 자동 다운로드하여 설치되게 한다.When the build analysis unit 142 determines that the build system of the open source data is Eclipse or Android Studio, the necessary libraries are automatically downloaded from the Google library and installed.

한편, 라이브러리구성부(145)는 빌드시스템이 그래들인 경우, 구글 라이브러리에 포함되는 라이브러리는 자동 다운로드하여 설치한다. 만약 구글 라이브러리에 포함되지 않는 사용자 라이브러리이고, 오픈소스 데이터에 저장소(repositories)가 없으면 라이브러리구성부(145)는 내장된 라이브러리 데이터베이스에서 해당 라이브러리를 검색한다. 이를 위해, 라이브러리구성부(145)는 오픈소스 데이터의 필요 라이브러리를 탐색하기 위한 라이브러리 데이터베이스를 포함한다. 라이브러리 데이터베이스에는 알려진 다양한 사용자 라이브러리가 기 저장된다.Meanwhile, if the build system is a gradle, the library configuration unit 145 automatically downloads and installs the library included in the Google library. If the user library is not included in the Google library, and there are no repositories in the open source data, the library configuration unit 145 searches the library in the built-in library database. To this end, the library configuration unit 145 includes a library database for searching for a library of open source data. Various known user libraries are stored in the library database.

또한, 오래된 오픈소스 데이터는 유효하지 않은 안드로이드 서버의 연결정보를 포함할 수 있다. 라이브러리구성부(145)는 오픈소스 데이터가 유효하지 않은 안드로이드 서버의 연결정보를 포함하는 경우 연결정보를 관련된 최신 안드로이드 서버의 정보로 최신화 한다.Also, old open source data may contain invalid Android server connection information. The library configuration unit 145 updates the connection information to the latest Android server information when the open source data includes invalid Android server connection information.

교정부(144)는 오픈소스 데이터의 문법 오류 정정과 문법 최신화를 실시한다. 교정부(144)는 안드로이드 환경에서 local.properties, build.xml, gradle-wrapper.properties를 포함하는 빌드 관련파일을 대상으로 문법 오류 정정 및 문법 최신화를 실시한다.The correction unit 144 corrects grammar errors in the open source data and updates grammar. The correction unit 144 performs grammatical error correction and grammar updating for build-related files including local.properties, build.xml, and gradle-wrapper.properties in the Android environment.

교정부(144)는 빌드 중 발생될 수 있는 오류 및 문제를 자동으로 수정한다.The correction unit 144 automatically corrects errors and problems that may occur during build.

예를 들어, 안드로이드 프로젝트의 디렉터리 구조 및 파일구성 확인한다. 또한 오픈소스 데이터의 저장소(repositories)가 구 버전이라면 최신 버전으로 변경하고, 관련 오류도 함께 수정한다. 또한 빌드시스템이 그래들인 경우, 일부 잘못될 수 있는 소스코드를 수정한다. 이용되는 그래들이 구 버전이어서 빌드가 어려울 경우, 최신의 그래들 버전으로 수정한다.For example, check the directory structure and file configuration of the Android project. Also, if the repositories of open source data are out of date, change them to the latest version and fix related errors. Also, if the build system is a Gradle, some source code that may be wrong is corrected. If it is difficult to build because the used gradle is an old version, fix it to the latest gradle version.

구축부(180)는 빌드분석부(142)가 식별한 종류의 빌드시스템으로 오픈소스 데이터를 빌드한다. 이를 위해, 구축부(180)는 빌드시스템에 대응하여 오픈소스 데이터의 빌드구조를 재 정렬할 수 있다.The building unit 180 builds open source data with a build system of the type identified by the build analysis unit 142. To this end, the building unit 180 may rearrange the build structure of the open source data corresponding to the build system.

또한 이 실시예는 오픈소스 데이터에 코드 난독화(obfuscation)기법 또는 패킹(packing)기법을 적용하여 은폐화 하는 은폐부(150)를 더 포함한다.In addition, this embodiment further includes a concealment unit 150 to conceal by applying a code obfuscation technique or a packing technique to open source data.

난독화기법(code obfuscation)은 소스코드나 실행코드를 해석하기 어렵게 만드는 기술이다. 난독화는 대상에 따라 크게 소스코드 난독화와 바이너리 난독화로 나눌 수 있다. 소스코드 난독화는 소스코드를 알아보기 힘든 형태로 바꾸는 기술이고, 바이너리 난독화는 컴파일 후 생성된 바이너리를 역공학으로 분석하기 어렵게 변환하는 기술이다. 난독화 기법에는 레이아웃 난독화(layout obfuscation), 제어 난독화(제어흐름 난독화, control obfuscation = control flow obfuscation), 데이터 난독화(data obfuscation), 예방적 변환(preventive transformations) 등이 있다.Code obfuscation is a technique that makes it difficult to interpret source code or executable code. Depending on the object, obfuscation can be divided into source code obfuscation and binary obfuscation. Source code obfuscation is a technique that converts source code into a form that is difficult to recognize, and binary obfuscation is a technique that converts binary generated after compilation into difficult to analyze by reverse engineering. The obfuscation techniques include layout obfuscation, control obfuscation (control flow obfuscation = control flow obfuscation), data obfuscation, and preventive transformations.

패킹기법(packing)은 암호화, 인코딩, 압축 등의 변환 기법을 포함한다. 패킹은 소스코드를 분석하기 어렵도록 암호화 또는 인코딩하거나 압축하는 연산 동작, 불필요한 코드를 인위적으로 삽입하는 연산 동작 등을 포함한다. 암호화의 경우, 애플리케이션을 특정 키를 이용하여 암호화 하여 배포하거나 저장하면서, 필요 시 해당 키를 이용해 복호화 해야 정상적으로 애플리케이션이 실행될 수 있게 한다. 강력한 패킹 도구는 수십 내지 수백 번의 패킹을 하고, 코드 중간에 안티 디버깅 기술도 함께 포함하여 해석이 곤란하게 한다.The packing technique includes transformation techniques such as encryption, encoding, and compression. Packing includes an operation to encrypt or encode or compress the source code so that it is difficult to analyze the source code, and an operation to artificially insert unnecessary code. In the case of encryption, the application is encrypted and distributed or stored using a specific key, and if necessary, decrypted using the key so that the application can be normally executed. A powerful packing tool packs dozens to hundreds of times, and includes anti-debugging techniques in the middle of the code, making it difficult to interpret.

멀웨어가 난독화 또는 패킹되어 은폐된 상태로 애플리케이션에 포함된 경우, 은폐되지 않은 애플리케이션 데이터셋으로 학습된 머신러닝은 해당 멀웨어를 탐지 또는 분류하지 못할 수 있다. 하지만 은폐부(150)가 오픈소스 데이터를 난독화 또는 패킹하면 하나의 오픈소스 데이터에서 다양한 방식으로 은폐된 애플리케이션 데이터셋을 준비할 수 있게 된다. 난독화된 애플리케이션 데이터셋 또는 패킹된 애플리케이션 데이터셋으로 학습된 머신러닝은 은폐된 애플리케이션을 대상으로 멀웨어 탐지, 분류 성능이 향상된다.When malware is obfuscated or packed and included in an application in a concealed state, machine learning learned with an application data set that is not concealed may not detect or classify the malware. However, when the concealment unit 150 obfuscates or packs the open source data, it is possible to prepare the concealed application dataset from a single open source data in various ways. Machine learning learned with obfuscated application datasets or packed application datasets improves malware detection and classification for concealed applications.

또한 이 실시예는 구축부(180)가 오픈소스 데이터의 빌드 시 전자서명을 추가하는 서명부(160)를 더 포함한다.In addition, this embodiment further includes a signature unit 160 in which the building unit 180 adds an electronic signature when building the open source data.

또한 이 실시예는 오픈소스 데이터에 머신러닝 학습에 필요한 학습용 멀웨어 코드를 삽입하는 멀웨어삽입부(170)를 더 포함할 수 있다. 멀웨어삽입부(170)는 기 공개된 멀웨어의 소스코드를 포함한다.In addition, this embodiment may further include a malware insertion unit 170 that inserts learning malware code necessary for machine learning learning into the open source data. The malware insertion unit 170 includes previously released source code of malware.

멀웨어 탐지 및 분류 머신러닝의 학습을 위해서는 멀웨어에 감염된 애플리케이션의 데이터셋도 충분히 필요하다. 하지만 멀웨어에 감염된 애플리케이션이나 해당 오픈소스 데이터를 수집하는 것은 감염되지 않은 애플리케이션이나 오픈소스를 수집하는 것 보다 어렵다.In order to learn malware detection and classification machine learning, a dataset of malware-infected applications is also required. However, collecting malware-infected applications or their open source data is more difficult than collecting non-infected applications or open source data.

따라서 멀웨어삽입부(170)는 기 공개된 멀웨어의 소스코드를 다수의 오픈소스 데이터에 추가 삽입하여 임의로 멀웨어에 감염된 애플리케이션을 생성할 수 있다. 멀웨어는 바이러스, 웜, 트로이목마, 백도어, 논리폭탄, 봇(bot), 에드웨어, 스파이웨어, 랜섬웨어 등의 유형이 존재하고 각 유형마다 수십 내지 수천가지의 멀웨어가 존재한다. 이러한 다양한 멀웨어의 소스코드를 하나의 오픈소스에 삽입하면 적어도 천 개 이상의 멀웨어에 감염된 애플리케이션을 획득할 수 있다. 이 실시예의 저장부(110)에는 다수의 오픈소스 데이터가 저장되므로, 각 오픈소스 데이터에 멀웨어 소스코드를 종류별로 적용하면 수억 개의 멀웨어에 감염된 애플리케이션 데이터셋을 마련할 수 있게 된다.Therefore, the malware insertion unit 170 may randomly generate an application infected with malware by additionally inserting the source code of the previously released malware into a plurality of open source data. There are viruses, worms, Trojan horses, backdoors, logical bombs, bots, adware, spyware, ransomware, and the like, and there are tens to thousands of malwares for each type. By inserting the source codes of these various malware into one open source, it is possible to acquire applications infected with at least one thousand malware. Since a plurality of open source data are stored in the storage unit 110 of this embodiment, application of a malware source code to each open source data by type enables provision of an application data set infected with hundreds of millions of malware.

멀웨어삽입부(170)는 멀웨어 소스코드를 오픈소스 데이터에서 실행영역에 해당되는 위치에 삽입한다.The malware inserting unit 170 inserts the malware source code into the location corresponding to the execution region in the open source data.

멀웨어삽입부(170)는 임의로 멀웨어 소스코드가 삽입된 오픈소스 데이터에 어떠한 멀웨어가 적용되었는지 라벨링한다. 라벨에는 멀웨어의 이름, 멀웨어 유형 등 정보가 포함될 수 있다. 이로써 관리자는 이 실시예에서 빌드된 애플리케이션 중에서 멀웨어에 감염된 애플리케이션만을 선별하고, 필요한 멀웨어의 종류도 선별할 수 있게 된다.The malware insertion unit 170 labels what malware is applied to the open source data in which the malware source code is arbitrarily inserted. The label may include information such as the name of the malware and the type of malware. As a result, the administrator can select only the applications infected with the malware from among the applications built in this embodiment, and can also select the type of the required malware.

특히, 멀웨어삽입부(170)가 오픈소스 데이터에 멀웨어 코드를 삽입하고, 은폐부(150)가 해당 오픈소스 데이터를 은폐화 하면 무한대에 가까운 유형의 애플리케이션 데이터셋을 생성하는 것이 가능하게 된다.In particular, when the malware insertion unit 170 inserts malware code into the open source data, and the hiding unit 150 conceals the corresponding open source data, it becomes possible to generate an application dataset of a type close to infinity.

이어서, 본 발명의 실시예에 따른 머신러닝의 학습 데이터셋 생성을 위한 애플리케이션 자동화 빌드 방법을 설명한다.Next, an application automation build method for generating a learning dataset of machine learning according to an embodiment of the present invention will be described.

도 2를 참조하면, 이 실시예는 저장부(110)가 머신러닝 학습을 위한 다수의 오픈소스 데이터를 저장하는 단계(S110); 설정부(140)가 상기 저장부(110)에 저장된 오픈소스 데이터의 빌드시스템 종류 식별, 적합한 환경 또는 ·옵션 정보 설정, 필요 라이브러리 적용이 포함된 시스템설정을 실시하는 단계(S120); 및 구축부(180)가 오픈소스 데이터를 설정부(140)가 시스템설정한 환경 또는 옵션으로 빌드하여 애플리케이션 데이터셋을 생성하는 단계(S160)를 포함한다.Referring to FIG. 2, this embodiment includes the step of storing, in the storage unit 110, a plurality of open source data for machine learning (S110); The setting unit 140 identifies the build system type of the open source data stored in the storage unit 110, sets a suitable environment or option information, and sets a system including application of necessary libraries (S120); And a step in which the building unit 180 builds the open source data in an environment or option set by the setting unit 140 in a system setting to generate an application data set (S160).

또한 이 실시예는 S160 단계 전, 멀웨어삽입부(170)가 오픈소스 데이터에 머신러닝 학습에 필요한 학습용 멀웨어 코드를 삽입하는 단계(S130)를 더 포함한다.In addition, this embodiment further includes a step S130 of inserting the malware for learning necessary for learning machine learning into the open source data by the malware inserter 170 before step S160.

또한 이 실시예는 S160 단계 전, 은폐부(150)가 오픈소스 데이터에 난독화기법 또는 패킹기법을 적용하여 은폐화 하는 단계(S140)를 더 포함한다. 멀웨어 코드도 은폐화 될 수 있도록, S140 단계는 S130 단계 후에 실행되는 것이 바람직하다.In addition, this embodiment further includes a step S140 of concealing by applying an obfuscation technique or a packing technique to the open source data by the concealment unit 150 before the step S160. Step S140 is preferably executed after step S130 so that the malware code can also be concealed.

또한 이 실시예는 S160 단계 전, 서명부(160)가 오픈소스 데이터에 전자서명을 추가하는 단계(S150)를 더 포함한다.In addition, this embodiment further includes a step (S150) of adding the digital signature to the open source data by the signing unit 160 before the step S160.

S160 단계 후 출력부(190)가 생성된 애플리케이션 데이터셋을 멀웨어 진단 분류 장치(300)에 포함된 머신러닝을 학습 시키는 학습 장치(200)에 제공되거나, 향후 적합한 환경에서 이용하기 위해 별도의 저장소에 저장한다(S170).After the step S160, the output unit 190 is provided to the learning device 200 for learning the machine learning included in the malware diagnosis classification device 300, or the application data set generated in the separate storage for use in a suitable environment in the future Save (S170).

이상에서 본 발명의 바람직한 실시예를 설명하였으나, 본 발명은 다양한 변화와 변경 및 균등물을 사용할 수 있다. 본 발명은 상기 실시예를 적절히 변형하여 동일하게 응용할 수 있음이 명확하다. 따라서 상기 기재 내용은 다음 특허청구범위의 한계에 의해 정해지는 본 발명의 범위를 한정하는 것이 아니다.Although the preferred embodiments of the present invention have been described above, the present invention can use various changes, modifications, and equivalents. It is clear that the present invention can be equally applied by appropriately modifying the above embodiments. Therefore, the above description is not intended to limit the scope of the present invention as defined by the following claims.

100 : 애플리케이션 자동화 빌드 장치
110 : 저장부 120 : 리스트생성부
140 : 설정부 142 : 빌드분석부
143 : 환경조정부 144 : 교정부
145 : 라이브러리구성부 150 : 은폐부
160 : 서명부 170 : 멀웨어삽입부
180 : 구축부 190 : 출력부
200 : 머신러닝 학습 장치 300 : 멀웨어 탐지 분류 장치100: application automation build device
110: storage unit 120: list generation unit
140: setting unit 142: build analysis unit
143: environmental adjustment unit 144: calibration unit
145: library component 150: concealed
160: signature unit 170: malware insertion unit
180: construction unit 190: output unit
200: machine learning learning device 300: malware detection classification device

Claims

A storage unit storing a plurality of open source data for machine learning learning;
A setting unit for identifying a build system type of open source data stored in the storage unit, setting a suitable environment or option information, and setting a system including application of necessary libraries; And
An application automation build device for creating a learning dataset for machine learning, comprising a building unit that builds the open source data in an environment or option set by the setting unit in a system setting to generate an application data set.

According to claim 1,
Application automation build device for machine learning learning data set generation, characterized in that it further comprises a concealment unit for concealing by applying an obfuscation technique or a packing technique to the open source data.

According to claim 1,
An application automation build device for generating a learning dataset of machine learning, characterized in that the building unit further comprises a signature unit that adds an electronic signature when building the open source data.

According to claim 1,
An application automation build device for creating a learning dataset for machine learning, further comprising a malware inserter for inserting learning malware code necessary for learning machine learning into the open source data.

According to claim 1, The setting unit,
A build analysis unit for identifying the build system type of the open source data and generating a missing build required file;
An environment adjustment unit for setting environment or option information suitable for the open source data;
A library configuration unit to apply a library called by the open source data; And
An application automation build device for creating a learning dataset of machine learning, comprising a correction unit for correcting grammar errors and updating grammar of the open source data.

The method of claim 5,
The machine learning is an algorithm for malware detection or classification of Android applications,
The open source data is the open source of the Android application,
The build analysis unit is an application for generating a learning dataset of machine learning, characterized by identifying whether the build system of the open source data corresponds to Eclipse, Android studio, or Gradle. Automated build device.

The method of claim 6,
The library constructing unit includes a library database for searching the required library of the open source data. The application automation build device for creating a learning dataset of machine learning.

The method of claim 6,
When the open source data includes connection information of an invalid Android server, the library configuration unit updates the connection information to the latest Android server related information, thereby automating an application for creating a learning dataset of machine learning. Build device.

The method of claim 6,
The proofing unit performs grammatical error correction and grammar updating for build-related files including local.properties, build.xml, and gradle-wrapper.properties to automate the application for generating a learning dataset of machine learning. Build device.

(A) the storage unit storing a plurality of open source data for machine learning learning;
(B) the setting unit performs a system setting including identifying a build system type of open source data stored in the storage unit, setting an appropriate environment or option information, and applying a necessary library; And
(C) Application automation build method for creating a learning dataset of machine learning, characterized in that the building unit comprises the steps of creating an application dataset by building the open source data in an environment or option set by the setting unit.

The method of claim 10, before step (C),
A method of building an application automation for generating a learning dataset of machine learning, further comprising the step of concealing by applying an obfuscation technique or a packing technique to the open source data.

The method of claim 10, before step (C),
The signing unit further comprises the step of adding an electronic signature to the open source data Application automation build method for creating a learning dataset of machine learning.

The method of claim 10, before step (C),
A method for automatically building an application for generating a learning dataset of machine learning, characterized in that the malware insertion unit further includes inserting learning malware code necessary for learning machine learning into the open source data.