KR101842267B1

KR101842267B1 - Apparatus for categorizing unknown softwares using mld and method thereof

Info

Publication number: KR101842267B1
Application number: KR1020160103835A
Authority: KR
Inventors: 조대희; 조성제
Original assignee: 단국대학교 산학협력단
Priority date: 2016-08-16
Filing date: 2016-08-16
Publication date: 2018-03-27
Also published as: KR20180019455A

Abstract

본 발명에 따르면, 소프트웨어의 분류에 있어서, 불법 복제되거나 표절되지 않은 복수의 소프트웨어에 대해 각 소프트웨어의 실행파일에 포함된 API에 기반한 MLD값을 산출하고, MLD값을 기반으로 복수의 소프트웨어에 대한 분류를 수행한 소프트웨어 필터링 데이터 베이스를 생성한 후, 분류 대상 소프트웨어의 API를 기반으로 산출된 MLD값을 필터링 데이터 베이스와 비교하는 것을 통해 분류 대상 소프트웨어의 카테고리가 보다 정확하게 결정되도록 함으로써, 소프트웨어의 불법 복제 또는 표절의 탐지를 위한 소프트웨어 필터링 작업의 오버헤드를 줄일 수 있도록 한다.According to the present invention, in classification of software, an MLD value based on an API included in an executable file of each software is calculated for a plurality of software pirated or not plagiarized, and a classification And comparing the MLD value calculated based on the API of the classification target software with the filtering database to more accurately determine the category of the classification target software, Thereby reducing the overhead of software filtering operations for detecting plagiarism.

Description

[0001] APPARATUS FOR CATEGORIZING UNKNOWN SOFTWARES USING MLD AND METHOD THEREOF [0002]

본 발명은 소프트웨어 분류에 관한 것으로, 특히 소프트웨어의 분류에 있어서, 불법 복제되거나 표절되지 않은 복수의 소프트웨어에 대해 각 소프트웨어의 실행파일에 포함된 API(Application Program Interface)에 기반한 MLD(Machine Learning Data)값을 산출하고, MLD값을 기반으로 복수의 소프트웨어에 대한 분류를 수행한 소프트웨어 필터링 데이터 베이스를 생성한 후, 분류 대상 소프트웨어의 API를 기반으로 산출된 MLD값을 필터링 데이터 베이스와 비교하는 것을 통해 분류 대상 소프트웨어의 카테고리가 보다 정확하게 결정되도록 함으로써, 소프트웨어의 불법 복제 또는 표절의 탐지를 위한 소프트웨어 필터링 작업의 오버헤드를 줄일 수 있도록 하는 소프트웨어 분류를 위한 MLD 생성 장치와 MLD를 이용한 소프트웨어 분류 장치 및 그 방법에 관한 것이다. More particularly, the present invention relates to software classification, and more particularly, to software classification, in which a MLD (Machine Learning Data) value based on an API (Application Program Interface) included in an executable file of each software for a plurality of software pirated or not plagiarized A software filtering database in which a plurality of software is classified based on the MLD value is generated and then the MLD value calculated based on the API of the classification target software is compared with the filtering database, A MLD generation device for software classification and a software classification device using the MLD and a method thereof for allowing a software category to be more accurately determined, thereby reducing an overhead of software filtering operation for detecting illegal copying or plagiarism of software will be.

오늘날 인터넷과 통신망의 발전으로 소프트웨어의 배포와 유통은 크게 증가하였고, 최근에는 스마트폰(smart phone)의 열풍으로 많은 사용자들이 어플리케이션 마켓을 통해 손쉽게 소프트웨어를 다운받아 사용할 수 있는 환경이 만들어졌다. 이러한 환경의 변화는 제품 선택의 기준이 하드웨어에서 소프트웨어로 넘어가는 결정적인 계기를 낳았다. Today, with the development of the Internet and communication network, distribution and distribution of software has greatly increased. Recently, as a smart phone, a lot of users have been able to download and use the software easily through the application market. This change in circumstances has led to a decisive momentum for product selection to move from hardware to software.

그러나, 인터넷과 같은 통신의 발달로 소프트웨어 불법 복제 및 표절이 급증하여, 소프트웨어 산업 발전에 걸림돌이 되고 있다. 즉, 예를 들어 불법 복제된 프로그램, 위조 또는 변조된 프로그램 등은 웹하드(web hard), 토렌트, 블랙마켓, 블로그, 카페 등의 다양한 경로를 통해 배포되고 있으며, 이와 같이 배포되는 불법 복제된 프로그램, 위조 또는 변조된 프로그램, 해킹된 모바일 앱 등의 불법 소프트웨어로 인하여 소프트웨어 개발사, 저작권사의 매출 하락이나, 소프트웨어 기술 개발, 인력양성, 투자 감소와 같은 악영향이 나타나고 있다. However, due to the development of communication such as the Internet, software piracy and plagiarism have soared, which has hampered the development of the software industry. For example, pirated programs, counterfeit or altered programs are distributed through various channels such as web hard, torrents, black market, blogs, and cafes, and pirated programs , Illegal software such as forged or altered programs and hacked mobile apps have caused adverse effects such as a decline in sales of software developers and copyright holders, software technology development, manpower training, and investment reduction.

여기서, 소프트웨어의 불법 복제(piracy)란 특정 소프트웨어를 그대로 복제(copy)하여 유통하거나 사용하는 것을 의미하며, 소프트웨어의 표절/도용(plagiarism/theft)이란 소프트웨어의 전체 코드 또는 일부 코드를 역공학(reverse engineering) 등의 방법으로 도용하여 사용하는 것을 의미할 수 있다.Here, piracy of software means copying and distributing certain software as it is, and plagiarism / theft of software means that the whole code or some code of software is reverse- engineering, etc.).

한편, 위와 같은 불법 소프트웨어를 탐지하여 차단하기 위해 소프트웨어 필터링(software filtering) 기법이 도입되고 있다.On the other hand, software filtering is being introduced to detect and block such illegal software.

이때, 소프트웨어 필터링 기법이라 함은 의심스러운 프로그램이 업로드/다운로드(upload/download)될 때, 의심 프로그램을 기존 데이터 베이스(DB: data base)와 비교하고 기존 데이터 베이스에 등록된 프로그램들 중 하나인지 아닌지를 검사하여 불법 소프트웨어 여부를 판단하는 방법이다. 그러나, 위와 같은 종래의 소프트웨어 필터링 기법에서는 의심 프로그램에 대해 데이터 베이스에 등록된 모든 프로그램일 1대1로 단순 비교를 수행함에 따라 필터링 처리시간이 증가하는 문제점이 있었다.In this case, when the suspicious program is uploaded / downloaded, the software filtering method is to compare the suspicious program with the existing database (DB) and determine whether the program is one of the registered programs in the existing database To determine whether or not the software is illegal. However, in the conventional software filtering technique as described above, there is a problem that the filtering processing time increases due to the simple comparison of the suspect program to all programs registered in the database on a one-to-one basis.

이에 따라, 근래에 들어서는 소프트웨어 필터링 시 의심 프로그램과 데이터 베이스에 등록된 프로그램을 단순 1대1 비교함에 따른 필터링 처리시간의 증가 등과 같은 오버헤드(overhead)의 문제점을 개선하기 위해 소프트웨어를 분류하는 기법이 제안되고 있다.Accordingly, in order to solve problems of overhead such as an increase in filtering processing time due to a simple one-to-one comparison of a program registered in a database and a suspicious program in software filtering in recent years, a technique of classifying software Has been proposed.

그러나, 현재까지 제안되고 있는 소프트웨어 분류 기법에서는 소프트웨어를 특정 카테고리로 분류하는 명확한 기준이 제시되고 있지 않은 상태여서 소프트웨어 분류 기법을 통한 소프트웨어 필터링 동작의 신뢰성이 확보되고 있지 못한 상태이다.However, in the proposed software classification method, there is no definite criterion for classifying the software into a specific category, so that the reliability of the software filtering operation through the software classification technique is not secured.

(특허문헌)(Patent Literature)

대한민국 등록특허번호 10-1604891호(등록일자 2016년 03월 14일)Korean Registered Patent No. 10-1604891 (Registered Date March 14, 2016)

따라서, 본 발명에서는 소프트웨어의 분류에 있어서, 불법 복제되거나 표절되지 않은 복수의 소프트웨어에 대해 각 소프트웨어의 실행파일에 포함된 API에 기반한 MLD값을 산출하고, MLD값을 기반으로 복수의 소프트웨어에 대한 분류를 수행한 소프트웨어 필터링 데이터 베이스를 생성한 후, 분류 대상 소프트웨어의 API를 기반으로 산출된 MLD값을 필터링 데이터 베이스와 비교하는 것을 통해 분류 대상 소프트웨어의 카테고리가 보다 정확하게 결정되도록 함으로써, 소프트웨어의 불법 복제 또는 표절의 탐지를 위한 소프트웨어 필터링 작업의 오버헤드를 줄일 수 있도록 하는 소프트웨어 분류를 위한 MLD 생성 장치와 MLD를 이용한 소프트웨어 분류 장치 및 그 방법을 제공하고자 한다.Therefore, in the present invention, the MLD value based on the API included in the executable file of each software is calculated for a plurality of software that are illegally copied or not plagiarized in the classification of the software, and the MLD value based on the MLD value is classified And comparing the MLD value calculated based on the API of the classification target software with the filtering database to more accurately determine the category of the classification target software, The present invention provides an MLD generator for software classification and a software classification apparatus using MLD and a method for reducing the overhead of software filtering for detecting plagiarism.

상술한 본 발명은 소프트웨어 분류용 MLD 생성장치로서, 카테고리가 사전에 분류된 복수의 소프트웨어를 입력받는 입력부와, 상기 복수의 소프트웨어별 API 정보를 추출하는 추출부와, 상기 API 정보를 이용하여 상기 카테고리별로 상기 API 정보에 포함된 API별 참조빈도를 기반으로 제1 MLD값을 생성하는 MLD 생성부를 포함한다.The MLD generation apparatus for classifying software according to the present invention comprises an input unit for receiving a plurality of software items classified into categories in advance, an extraction unit for extracting the plurality of software-specific API information, And an MLD generator for generating a first MLD value based on the reference frequency of each API included in the API information.

또한, 상기 API 정보는, 각 소프트웨어를 구성하는 실행파일에 포함된 복수의 API의 파일이름과 각 API의 참조 빈도 및 각 API를 사용하는 소프트웨어의 수를 포함하는 것을 특징으로 한다.In addition, the API information includes a file name of a plurality of APIs included in an executable file constituting each software, a reference frequency of each API, and the number of software using each API.

또한, 상기 MLD 생성부는, 각각의 카테고리에 포함된 복수의 API에 대해, 상기 카테고리내에 포함된 각 API의 참조빈도를 상기 각 API를 사용하는 소프트웨어의 수로 나누어 상기 각 API에 대한 상기 제1 MLD 값을 산출하는 것을 특징으로 한다.The MLD generation unit may divide the reference frequency of each API included in the category into a plurality of APIs included in each category by the number of software using each API, Is calculated.

또한, 상기 MLD 생성부는, 상기 복수의 API 중 특정 API를 전체 소프트웨어 중 일부 소프트웨어가 참조하는 빈도가 소정의 임계치를 넘는 경우, 상기 특정 API를 상기 API 정보에서 제외시키는 것을 특징으로 한다. The MLD generation unit may exclude the specific API from the API information when the frequency of the reference to a specific API among the plurality of APIs exceeds a predetermined threshold value.

또한, 본 발명은 소프트웨어 분류 장치로서, 상기 청구항 제1항 내지 제4항 중 어느 한항에 기재된 생성장치에 의해 생성된 카테고리 분류된 복수의 소프트웨어에 대한 API별 제1 MLD값을 저장하고 있는 MLD DB와, 분류 대상 소프트웨어가 인식되는 경우, 상기 소프트웨어를 구성하는 실행파일에 포함된 복수의 API의 파일이름과 각 API의 참조 빈도에 대한 정보를 추출하는 특징정보 산출부와, 상기 복수의 API의 파일이름과 상기 각 API의 참조 빈도에 대한 정보를 이용하여 상기 분류 대상 소프트웨어의 각 API에 대한 제2 MLD 값을 산출하는 MLD 산출부와, 상기 MLD 산출부를 통해 산출된 제2 MLD값을 상기 MLD DB내 저장된 카테고리별 제1 MLD값과 비교하여 상기 분류 대상 소프트웨어가 속하는 카테고리를 결정하는 카테고리 결정부를 포함한다.The present invention also provides a software classifying device, comprising: an MLD DB storing a first MLD value for each API for a plurality of software classified into categories, generated by the generating device according to any one of claims 1 to 4; A feature information calculation unit for extracting information on a file name of a plurality of APIs included in an executable file constituting the software and reference frequency of each API when the software to be classified is recognized; An MLD calculation unit for calculating a second MLD value for each API of the classification target software using information on a name and a reference frequency of each of the APIs; and a second MLD calculating unit for calculating a second MLD value calculated through the MLD calculating unit from the MLD DB And a category determination unit for determining a category to which the classification target software belongs by comparing the first MLD value for each of the stored categories.

또한, 상기 카테고리 결정부는, 상기 분류 대상 소프트웨어의 상기 제2 MLD값과 상기 MLD DB내 상기 카테고리별 제1 MLD값을 비교하여 상기 제1 MLD값과 제2 MLD값 사이의 유사도가 가장 높은 카테고리를 상기 분류 대상 소프트웨어의 카테고리로 결정하는 것을 특징으로 한다.The category determination unit compares the second MLD value of the classification target software with the first MLD value of each category in the MLD DB to determine a category having the highest similarity between the first MLD value and the second MLD value The category of the software to be classified is determined.

또한, 상기 특징 정보 산출부는, 상기 분류 대상 소프트웨어의 헤더와 섹션들로부터 상기 분류 대상 소프트웨어의 각 API의 파일 이름 및 상기 각 API의 참조 빈도를 추출하는 것을 특징으로 한다.The feature information calculation unit may extract a file name of each API of the classification target software and a reference frequency of each API from the header and sections of the classification target software.

또한, 상기 실행 파일은, 마이크로소프트 윈도우 EXE 파일, Java 바이트 파일 또는 리눅스 a.out 파일 중 적어도 하나인 것을 특징으로 한다.The execution file is at least one of a Microsoft Windows EXE file, a Java byte file, or a Linux a.out file.

또한, 본 발명은 소프트웨어 분류용 MLD 생성방법으로서, 카테고리가 사전에 분류된 복수의 소프트웨어를 입력받는 단계와, 상기 복수의 소프트웨어별 API 정보를 추출하는 단계와, 상기 API 정보를 이용하여 상기 카테고리별 상기 API 정보에 포함된 API별 참조빈도를 기반으로 제1 MLD값을 생성하는 단계를 포함한다.According to another aspect of the present invention, there is provided an MLD generation method for software classification, comprising: inputting a plurality of software items classified into categories in advance; extracting API information for each of the plurality of software items; And generating a first MLD value based on the API reference frequency included in the API information.

또한, 상기 제1 MLD값을 생성하는 단계는, 상기 복수의 소프트웨어에 대해, 각각의 소프트웨어에 포함된 복수의 API의 파일이름을 추출하는 단계와, 각 API의 참조 빈도와 상기 각 API를 사용하는 소프트웨어의 수를 산출하는 단계와, 상기 참조 빈도를 상기 소프트웨어의 수로 나누어 상기 각 API에 대한 상기 제1 MLD 값을 산출하는 단계를 포함하는 것을 특징으로 한다.The generating of the first MLD value may include extracting a file name of a plurality of APIs included in each software for the plurality of software programs, Calculating the number of software, and calculating the first MLD value for each API by dividing the reference frequency by the number of the software.

또한, 본 발명은 소프트웨어 분류 방법으로서, 상기 청구항 제1항 내지 제4항 중 어느 한항에 기재된 생성장치에 의해 생성된 카테고리 분류된 복수의 소프트웨어에 대한 API별 제1 MLD값을 저장하고 있는 MLD DB를 생성하는 단계와, 분류 대상 소프트웨어가 인식되는 경우, 상기 분류 대상 소프트웨어를 구성하는 실행파일에 포함된 복수의 API의 파일이름과 각 API의 참조 빈도에 대한 특징 정보를 추출하는 단계와, 상기 복수의 API의 파일이름과 상기 각 API의 참조 빈도에 대한 정보를 이용하여 상기 분류 대상 소프트웨어의 각 API에 대한 제2 MLD값을 산출하는 단계와, 상기 제2 MLD값을 상기 MLD DB내 저장된 카테고리별 제1 MLD값과 비교하여 상기 분류 대상 소프트웨어가 속하는 카테고리를 결정하는 단계를 포함한다.The present invention also provides a software classification method, comprising: a MLD DB storing a first MLD value per API for a plurality of software classified into categories, generated by the generating device according to any one of claims 1 to 4; Extracting feature names of a plurality of APIs included in an executable file that constitutes the classification target software and a reference frequency of each API when the classification target software is recognized; Calculating a second MLD value for each API of the classification target software by using information about a file name of the API of the MLD and a reference frequency of each of the APIs; And comparing the first MLD value with the first MLD value to determine a category to which the classification target software belongs.

또한, 상기 카테고리를 결정하는 단계는, 상기 분류 대상 소프트웨어의 상기 제2 MLD값과 상기 MLD DB내 상기 카테고리별 제1 MLD값을 비교하는 단계와, 상기 제1 MLD값과 제2 MLD값 사이의 유사도가 가장 높은 카테고리를 상기 분류 대상 소프트웨어의 카테고리로 결정하는 단계를 포함하는 것을 특징으로 한다.The step of determining the category may further include the steps of: comparing the second MLD value of the classification target software with the first MLD value for each category in the MLD DB; comparing the first MLD value with the second MLD value And determining a category having the highest degree of similarity as the category of the classification target software.

본 발명에 따르면, 소프트웨어의 분류에 있어서, 불법 복제되거나 표절되지 않은 복수의 소프트웨어에 대해 각 소프트웨어의 실행파일에 포함된 API에 기반한 MLD값을 산출하고, MLD값을 기반으로 복수의 소프트웨어에 대한 분류를 수행한 소프트웨어 필터링 데이터 베이스를 생성한 후, 분류 대상 소프트웨어의 API를 기반으로 산출된 MLD값을 필터링 데이터 베이스와 비교하는 것을 통해 분류 대상 소프트웨어의 카테고리가 보다 정확하게 결정되도록 함으로써, 소프트웨어의 불법 복제 또는 표절의 탐지를 위한 소프트웨어 필터링 작업의 오버헤드를 줄일 수 있도록 하는 이점이 있다.According to the present invention, in classification of software, an MLD value based on an API included in an executable file of each software is calculated for a plurality of software pirated or not plagiarized, and a classification And comparing the MLD value calculated based on the API of the classification target software with the filtering database to more accurately determine the category of the classification target software, There is an advantage of reducing the overhead of software filtering operations for detecting plagiarism.

도 1은 본 발명의 실시예에 따른 소프트웨어 분류용 MLD 생성장치의 상세 블록 구성도,
도 2는 본 발명의 실시예에 따른 소프트웨어 분류용 MLD 생성장치에서 소프트웨어별 MLD 값을 산출하는 동작 제어 흐름도,
도 3은 본 발명의 실시예에 따른 API에 대해 계산된 MLD값 예시도,
도 4는 본 발명의 실시예에 따른 MLD 기반의 카테고리 분류 예시도,
도 5는 본 발명의 실시예에 따른 소프트웨어 분류 장치가 적용되는 환경을 설명하기 위한 개념도,
도 6은 본 발명의 실시예에 따른 소프트웨어 분류 장치의 상세 블록 구성도,
도 7은 본 발명의 실시예에 따른 소프트웨어 분류 장치에서 분류 대상 소프트웨어의 카테고리를 분류하는 동작 제어 흐름도.1 is a detailed block diagram of an MLD generation apparatus for software classification according to an embodiment of the present invention;
FIG. 2 is an operation control flowchart for calculating an MLD value for each software in an MLD generating apparatus for classifying software according to an embodiment of the present invention; FIG.
3 is an exemplary MLD value computed for an API according to an embodiment of the present invention,
4 is a diagram illustrating an MLD-based category classification according to an embodiment of the present invention,
FIG. 5 is a conceptual diagram for explaining an environment to which a software classification apparatus according to an embodiment of the present invention is applied;
6 is a detailed block diagram of a software classification apparatus according to an embodiment of the present invention.
7 is a flowchart of an operation control for classifying categories of software to be classified in a software classification apparatus according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 동작 원리를 상세히 설명한다. 하기에서 본 발명을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Hereinafter, the operation principle of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The following terms are defined in consideration of the functions of the present invention, and these may be changed according to the intention of the user, the operator, or the like. Therefore, the definition should be based on the contents throughout this specification.

도 1은 본 발명의 실시예에 따른 소프트웨어 분류용 MLD 생성장치의 상세 블록 구성을 도시한 것으로, 입력부(102), 특징정보 추출부(104), MLD 생성부(106) 등을 포함할 수 있다. FIG. 1 shows a detailed block configuration of an MLD generation apparatus for software classification according to an embodiment of the present invention, and may include an input unit 102, a feature information extraction unit 104, an MLD generation unit 106, and the like .

이하, 도 1을 참조하여 본 발명의 실시예에 따른 소프트웨어 분류를 위한 MLD 생성장치(100)의 각 구성요소에서의 동작을 상세히 설명하기로 한다.Hereinafter, operation of each component of the MLD generation apparatus 100 for software classification according to an embodiment of the present invention will be described in detail with reference to FIG.

먼저, 입력부(102)는 카테고리가 사전에 분류된 복수의 소프트웨어를 입력 받는다. 이러한 소프트웨어는 소프트웨어의 분류를 위한 기계학습(Machine learning) 훈련용 소프트웨어로서, 불법 복제되거나 표절되지 않은 소프트웨어인 것이 바람직하나, 이에 한정되지는 않는다. 또한, 이러한 소프트웨어는 사전에 복수의 카테고리로 분류된 소프트웨어로서, 예를 들어, Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, Video Player, Zip 등의 카테고리 중 어느 하나의 카테고리에 포함될 수 있다. First, the input unit 102 receives a plurality of software items whose categories are classified in advance. Such software is software for machine learning training for classification of software, preferably but not limited to, pirated or non-pirated software. Such software is software classified into a plurality of categories in advance and is classified into any one of categories such as Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, Video Player, .

특징정보는 추출부(104)는 입력부(102)로부터 입력되는 카테고리가 사전 분류된 복수의 소프트웨어에 대해 각 소프트웨어별 특징정보를 추출한다. 이때, 이러한 특징정보는 예를 들어 소프트웨어를 구성하는 실행파일에 포함된 API 정보일 수 있으며, 또한, 이러한 API 정보는 각 소프트웨어를 구성하는 실행파일에 포함된 복수의 API의 파일이름과 각 API의 참조 빈도 및 각 API를 사용하는 소프트웨어의 수에 대한 정보를 포함할 수 있다. 이때, 참조 빈도라 함은 각 API가 호출되는 횟수 정보를 의미할 수 있으나 이에 한정되는 것은 아니다.The feature information extracting unit 104 extracts feature information for each software for a plurality of software categories pre-classified by the input unit 102. In this case, the feature information may be, for example, API information included in an executable file constituting software, and the API information may include a file name of a plurality of APIs included in an executable file constituting each software, Reference frequency, and the number of software using each API. In this case, the reference frequency may refer to the number of times each API is called, but is not limited thereto.

MLD(Machine Learning Data) 생성부(106)는 특징정보 추출부(104)로부터 추출된 복수의 소프트웨어별 API 정보를 이용하여 각 소프트웨어의 API별 MLD값을 생성할 수 있다. 이때, 이러한 MLD값은 API를 기반으로 해당 소프트웨어가 어떤 특성을 가진 소프트웨어인지를 분류할 수 있도록 하는 정보를 의미할 수 있다. 즉, 각 API에 대해 계산된 MLD 값은, 해당 소프트웨어를 예를 들어 Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, Video Player 등에 속하는 것으로 분류할 수 있도록 지시하는 값일 수 있다.The MLD (Machine Learning Data) generation unit 106 can generate MLD values for each software API by using a plurality of software-specific API information extracted from the feature information extraction unit 104. [ In this case, the MLD value may mean information that allows the software to classify the software having a characteristic based on the API. That is, the MLD value calculated for each API may be a value that instructs the software to be classified as belonging to, for example, Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, .

또한, MLD 생성부(106)는 위와 같은 MLD 값을 산출함에 있어서, 복수의 소프트웨어별 각 API의 참조 빈도와 각 API를 사용한 다른 소프트웨어의 수를 산출하고, 아래의 [수학식 1]에서와 같이 각 API의 참조 빈도를 소프트웨어의 수로 나누어 각 API에 대한 MLD값을 생성할 수 있다.In calculating the MLD value as described above, the MLD generation unit 106 calculates the reference frequency of each API for each of a plurality of software and the number of other software using each API, The MLD value for each API can be generated by dividing the reference frequency of each API by the number of software.

또한, 이때, 위와 같이 API 기반의 MLD값을 이용하여 소프트웨어에 대한 카테고리 분류를 수행하고자 하는 경우, MLD의 정확성을 높이기 위한 방안으로 카테고리내 API를 사용하는 소프트웨어의 수가 대략 40%이상이 되는 API를 사용하는 것이 바람직하다. 또한, 이러한 API로는 원도우에서 제공하는 KERNEL32, USER32 등과 같이 상대적으로 많은 프로그램에서 사용하는 DLL에 포함된 API를 사용하는 것이 바람직하다. 또한, API의 참조 빈도를 이용하여 MLD를 산출하는 경우 하나의 속성에 수많은 관측데이터 값이 들어가게 되는데, 동일 카테고리의 경우에는 관측데이터 값이 유사하거나 동일하게 측정되는 것이 MLD값의 정확도를 높이는데 도움이 될 수 있으나, 이에 한정되는 것은 아니다. 한편, 이러한 API 중 다른 소프트웨어들이 80% 또는 90% 이상 등 전체 소프트웨어의 기설정된 퍼센트 이상이 참조하는 API는 특징정보에 제외하는 것이 바람직하다. 왜냐하면, 기계 학습에 사용되는 전체 소프트웨어의 80% 또는 90% 이상이 참조하는 API는 각 카테고리를 구분하는 주요 특징정보라고 할 수 없기 때문이다. 즉, 예를 들어 전체 소프트웨어 중, 대부분의 소프트웨어들이 참조하는 API는 소프트웨어 분류에 기여하는 바가 적으므로 제외하는 것이 바람직할 수 있다.In this case, when it is desired to classify software by using the API-based MLD value as described above, as an approach for improving the accuracy of MLD, there is an API that has about 40% or more of the software using the API in the category Is preferably used. Also, it is preferable to use an API included in a DLL used in a relatively large number of programs such as KERNEL32 and USER32 provided by a window. In addition, when MLD is calculated by using API reference frequency, a lot of observation data values are included in one attribute. In the same category, measurement data similar or identical to each other helps to improve the accuracy of MLD value But is not limited thereto. On the other hand, it is desirable to exclude the APIs referenced by more than a predetermined percentage of the total software, such as 80% or 90% of other APIs, in the feature information. This is because APIs referenced by more than 80% or 90% of the total software used for machine learning can not be the main feature information that distinguishes each category. In other words, it may be desirable to exclude, for example, the API referred to by most software among all software, since it contributes little to software classification.

또한, MLD 생성부(106)는 각 카테고리에서 MLD값이 큰 순서로 N개의 API들을 선정하고, 각 소프트웨어가 속한 카테고리에서 선정된 N개의 API에 대해, 각 소프트웨어별 API 참조 빈도의 수를 표기하여 데이터 베이스(data base)를 생성할 수 있다. Also, the MLD generation unit 106 selects N APIs in order of increasing MLD value in each category, and indicates the number of API reference frequencies for each software for N APIs selected in the category to which each software belongs You can create a database.

이때, 예를 들어 기계 학습에 사용되는 소프트웨어의 집합이 9개의 카테고리로 분류되고, 각 카테고리별 소프트웨어의 수가 55개라고 가정하면, 전체 소프트웨어의 수는 495개가 될 수 있다. 또한, 이때, N은 가변일 수 있으며, 300, 500, 700, 900, 1000 등이 될 수 있고, N이 무한대라면, 한번이라도 참조된 모든 API들이 특징 정보가 될 수 있다. 또한, 만일 N이 500이면 각 카테고리마다 MLD 값이 큰 500개의 API들이 특징정보로 선택될 수 있고, N이 500이고 각 카테고리에서 선정된 API들이 전혀 중복되지 않는 다면, 전체 특징정보로 선정된 API 수는 500*9=4500개가 될 수 있다. 또한, N이 500이고 각 카테고리에서 선정된 API들이 다른 카테고리에서 선정된 API들과 중복된다면, 전체 특징정보로 선정된 API의 수는 4500개 보다 적을 수 있다. 또한, 예를 들어 전체 특징정보로 선정된 API의 수가 M이라고 가정하는 경우 N≤M의 관계가 성립하여 M≤4500이 될 수 있고, 이러한 경우 전체 소프트웨어에 대한 MLD값은 495개의 행과 M개의 열을 가진 행렬로 표현될 수 있다. 즉, 행은 각 소프트웨어, 열은 API로 표현될 수 있다. 또한, 이러한 행렬로 표시되는 MLD값은 소프트웨어를 분류하기 위한 데이터 베이스로 사용될 수 있고, 이러한 데이터 베이스는 후속하여 설명하는 소프트웨어 분류장치로 제공되어 소프트웨어 분류장치로 새로이 입력된 분류 대상 소프트웨어를 분류하는데 있어 참조 정보로 활용될 수 있다. For example, assuming that the set of software used for machine learning is classified into nine categories, and the number of software for each category is 55, the total number of software can be 495. In this case, N may be variable, and may be 300, 500, 700, 900, 1000, and so on. If N is infinite, all APIs referred to even once may be feature information. If N is 500, 500 APIs having a large MLD value can be selected as feature information for each category. If N is 500 and the selected APIs in each category do not overlap at all, The number can be 500 * 9 = 4500. Also, if N is 500 and the APIs selected in each category are overlapped with the APIs selected in other categories, then the total number of APIs selected as feature information may be less than 4,500. For example, assuming that the number of APIs selected by the entire feature information is M, the relation of N < M is established and M < = 4500. In this case, the MLD value for the entire software is 495 rows and M It can be expressed as a matrix with columns. That is, a row can be represented by each software, and a column by an API. The MLD value represented by this matrix can be used as a database for classifying software, and this database is provided as a software classifying device described below, and classifies the newly classified classifying software into a software classifying device It can be used as reference information.

도 2는 각 API에 대해 계산된 MLD값을 예시한 것이며, 도 3은 API별 계산된 MLD값을 이용하여 각 소프트웨어가 어떤 카테고리에 속하는 소프트웨어인지 분류할 수 있도록 하는 MLD 기반의 카테고리 분류 데이터를 예시한 것이다.FIG. 2 illustrates MLD values calculated for each API, and FIG. 3 illustrates MLD-based category classification data that allows each software to classify software belonging to a certain category using MLD values calculated for each API It is.

이때, MLD DB(200)에는 도 3에서와 같이 소프트웨어의 카테고리별 각 API에 대한 MLD값에 대한 정보가 저장될 수 있다. 이때, 도 3에서 각 행은 하나의 프로그램을 나타낼 수 있다.At this time, the MLD DB 200 may store information on MLD values for each API of each category of software as shown in FIG. In this case, each row in FIG. 3 may represent one program.

도 4는 본 발명의 실시예에 따른 소프트웨어 분류용 MLD 생성장치에서 카테고리가 사전에 분류된 기계 학습 훈련용 복수의 소프트웨어에 대한 MLD값을 산출하는 동작 제어 흐름을 도시한 것이다. 이하, 도 1 내 도 4를 참조하여 본 발명의 실시예를 상세히 설명하기로 한다.4 shows an operation control flow for calculating an MLD value for a plurality of software for machine learning training in which categories are classified in advance in the MLD generating apparatus for software classification according to the embodiment of the present invention. Hereinafter, an embodiment of the present invention will be described in detail with reference to FIG. 4 in FIG.

먼저, 소프트웨어 분류용 MLD 생성 장치(100)는 카테고리가 분류된 불법 복제되거나 표절되지 않은 복수의 소프트웨어를 기계 학습 훈련용 소프트웨어로서 입력 받는다(S400). First, the MLD generation apparatus for software classification 100 receives a plurality of illegally copied or non-plagiarized software classified as categories as software for machine learning training (S400).

이어, 소프트웨어 분류용 MLD 생성 장치(100)는 입력된 불법 복제되거나 표절되지 않은 복수의 소프트웨어에 대해, 각각의 소프트웨어의 실행파일에 포함된 API 정보를 추출한다(S402). 이때, 이러한 API 정보는 예를 API의 파일명, 각 API의 참조 빈도, API를 사용한 소프트웨어의 수 등의 정보가 될 수 있으나, 이에 한정되는 것은 아니다.Next, the MLD generation apparatus for software classification 100 extracts the API information included in the execution files of the respective software for a plurality of illegally copied or non-plagiarized software (S402). At this time, the API information may be information such as the file name of the API, the reference frequency of each API, the number of software using the API, and the like, but is not limited thereto.

이어, 소프트웨어 분류용 MLD 생성 장치(100)는 카테고리별 복수의 소프트웨어의 복수의 API에 대해 각 API가 다른 소프트웨어에 의해 참조되는 참조 빈도와 각 API를 사용하는 소프트웨어의 수를 이용하여, 위의 [수학식 1]에서와 같이 참조 빈도를 소프트웨어의 수로 나누어 각 API에 대한 MLD 값을 생성한다(S404).Next, the MLD generation apparatus for software classification 100 uses the reference frequency referred to by the other software and the number of software using each API for a plurality of APIs of a plurality of software for each category, The MLD value for each API is generated by dividing the reference frequency by the number of software as in Equation (1) (S404).

또한, 소프트웨어 분류용 MLD 생성 장치(100)는 MLD값을 생성함에 있어서, 각 카테고리별 MLD 값이 큰 상위 N개의 API들을 선정할 수 있다(S406). 또한, 소프트웨어 분류용 MLD 생성 장치(100)는 각 소프트웨어가 속한 카테고리에서 선정된 N개의 API들에 대해, 각 소프트웨어별 API 참조 수를 표기할 수 있다(S408). 이때, 이러한 API 참조 수가 각 소프트웨어의 카테고리 특성을 나타내는 MLD값이 될 수 있다.In generating the MLD value, the MLD generation apparatus for software classification 100 may select the top N APIs having a large MLD value for each category (S406). In addition, the MLD generation apparatus for software classification 100 may display the API reference number for each software for the N APIs selected in the category to which each software belongs (S408). At this time, the API reference number may be the MLD value indicating the category characteristic of each software.

이어, 소프트웨어 분류용 MLD 생성 장치(100)는 위와 같이 생성한 카테고리별 전체 소프트웨어의 각 API에 대한 MLD값을 소프트웨어 분류용 데이터 베이스 정보로서 MLD DB(200)로 저장한다(S410).Next, the MLD generation apparatus for software classification 100 stores the MLD value for each API of all the software created by the above-described category as MLD DB 200 as software classification database information (S410).

이때, 이러한 MLD DB(200)는 후속하여 설명하는 소프트웨어 분류장치로 제공되어 소프트웨어 분류장치로 새로이 입력된 분류 대상 소프트웨어를 분류하는데 있어 참조 정보로 활용될 수 있다. At this time, the MLD DB 200 is provided as a software classification device described later, and can be utilized as reference information in classifying the classification target software newly input by the software classification device.

도 5는 본 발명의 실시예에 따른 소프트웨어 분류 장치가 적용되는 환경을 설명하기 위한 개념도를 도시한 것이다.5 is a conceptual diagram illustrating an environment in which a software classification apparatus according to an embodiment of the present invention is applied.

위 도 5를 참조하면, 웹 하드(web hard; 10) 상에는 다수 사용자에 의해서 다종의 소프트웨어들(20, 30, 40, ...이 업로드될 수 있다. 여기에서, 웹 하드(10)는 다수의 사용자가 소프트웨어 또는 컨텐츠를 업로드할 수 있는 클라우드(cloud) 저장 공간의 일종을 의미하며, 웹 하드라는 명칭에 구애 받지 않고, 다양한 온라인 저장 공간들을 포괄하는 개념일 수 있다.5, a plurality of types of software 20, 30, 40, ... can be uploaded by a plurality of users on a web hard 10. Here, the web hard 10 includes a plurality of Refers to a kind of cloud storage space in which a user of the computer can upload software or contents. It may be a concept covering various online storage spaces regardless of the name of WebHard.

본 발명의 실시예에 따른 소프트웨어 분류 장치(500)는 웹 하드(10)에 업로드 되는 다종의 대상 소프트웨어들(20, 30, 40, ...에 대해서 업로드 단계에서 해당 소프트웨어의 카테고리를 분류할 수 있다. The software classification apparatus 500 according to the embodiment of the present invention can classify the category of the corresponding software in the uploading step for the target software 20, 30, 40, ..., have.

또한, 소프트웨어 분류 장치(500)는 위와 같이 소프트웨어의 카테고리를 분류함에 있어서, 소프트웨어의 실행파일에 포함된 API 정보 등의 특징 정보를 이용하여 카테고리가 분류된 복수의 소프트웨어에 대해 사전에 계산된 MLD값을 저장하여 둔 MLD(Machine Learning Data) DB(200)를 이용하여 분류 대상 소프트웨어의 카테고리를 분류할 수 있다.In classifying the categories of software as described above, the software classifying device 500 may classify the category of the software by using MLD values calculated in advance for a plurality of software categories classified by using feature information such as API information included in the executable file of the software The classification of the software to be classified can be classified using the MLD (Machine Learning Data) DB 200 stored.

즉, 소프트웨어 분류 장치(500)는 사용자 컴퓨터 단말로부터 웹하드(10) 등의 클라우드 저장공간에 소프트웨어가 업로드되는 경우 업로드되는 소프트웨어에 대해 소프트웨어 특징 정보를 추출하고, 추출된 특징 정보를 MLD DB(200)와 비교하여 특정 카테고리로 분류할 수 있다.That is, when the software is uploaded from the user's computer terminal to the cloud storage space of the web hard 10 or the like, the software classification apparatus 500 extracts the software feature information from the software to be uploaded and transmits the extracted feature information to the MLD DB 200 ), It can be classified into a specific category.

이때, 소프트웨어 분류 장치(500)에서 분류 대상 소프트웨어의 특징 정보를 이용하여 해당 소프트웨어를 특정 카테고리로 분류하는 동작에 대해서는 소프트웨어 분류 장치의 상세 블록을 도시하고 각 블록의 동작을 설명하는 후술되는 도 6에서 상세히 설명하기로 한다.In this case, the operation of classifying the software into specific categories using the feature information of the classification target software in the software classifying apparatus 500 is shown in detail in the block diagram of the software classifying apparatus, and in FIG. 6 Will be described in detail.

또한, 이러한 특징정보는 예를 들어 API(Application Program Interface) 관련 정보가 될 수 있으며, 또한, 소프트웨어 분류 장치(500)를 통해 소프트웨어의 카테고리가 분류되는 경우, 이와 같은 카테고리가 분류된 소프트웨어는 소프트웨어 분류 장치(500)와 연동될 수 있는 소프트웨어 불법 탐지 장치(도시되지 않음)에서 카테고리별 기 등록된 정상적인 소프트웨어와 비교되어 불법 복제 또는 위조/변조된 프로그램 여부가 탐지될 수 있다. In addition, when the category of the software is classified through the software classifying device 500, the software classified as such category may be classified into software category It is possible to detect whether the program is illegally copied or falsified / modulated compared with the normal software pre-registered by category in a software illegal detection device (not shown) that can be interlocked with the device 500.

한편, 위와 같은 설명에서는 개인용 PC 등의 사용자 단말로부터 웹하드(10) 등의 클라우드 저장공간으로 업로드되는 소프트웨어에 대한 소프트웨어 분류 과정을 설명하였으나, 웹하드(10)에서 다운로드되는 소프트웨어에 대해서도 동일한 소프트웨어 분류 과정이 적용될 수 있다. In the above description, a software classification process for software that is uploaded from a user terminal such as a personal computer to a cloud storage space such as the WebHard 10 has been described. However, Process can be applied.

또한, 본 발명의 실시예에 따른 소프트웨어 분류 장치(500)는 웹 하드(10) 등에 업로드/다운로드되는 소프트웨어를 보다 정확한 카테고리로 분류하여 소프트웨어의 불법 복제 또는 표절 여부의 탐지를 용이하게 하기 위한 용도뿐만 아니라, 오프라인 상태에서 소프트웨어의 불법 복제 또는 표절 여부의 탐지를 용이하게 하기 위한 소프트웨어 분류 과정에도 적용될 수 있다. 예컨대, 탐지 대상 컴퓨터에 탑재된 하드 디스크와 같은 저장 장치를 검색하여 불법 복제 또는 표절 소프트웨어의 존재 여부를 탐지하는 경우 대상 소프트웨어에 대한 보다 신속한 탐지를 위해 소프트웨어를 분류하는 용도로도 본 발명의 소프트웨어 분류 장치(500)는 이용될 수 있을 것이다.In addition, the software classification apparatus 500 according to the embodiment of the present invention can be used not only for classifying software to be uploaded / downloaded into the web hard 10 into more accurate categories, and for facilitating detection of illegal copying or plagiarism of software But can also be applied to a software classification process for facilitating the detection of piracy or plagiarism of the software in an off-line state. For example, when searching for a storage device such as a hard disk mounted on a detection target computer and detecting the presence of illegal copying or plagiarism software, it is also possible to classify the software for the purpose of faster detection of the target software, Apparatus 500 may be utilized.

도 6은 본 발명의 실시예에 따른 실행파일의 API 특징정보를 이용한 소프트웨어 분류 장치의 상세 회로 구성을 도시한 것으로, 특징정보 추출부(510), MLD 산출부(520), 카테고리 결정부(530), MLD DB(200) 등을 포함할 수 있다.6 shows a detailed circuit configuration of a software classification apparatus using API feature information of an executable file according to an embodiment of the present invention. The feature classification information extraction unit 510, the MLD calculation unit 520, the category determination unit 530 ), MLD DB 200, and the like.

이하, 도 6을 참조하여 소프트웨어 분류 장치(500)의 각 구성요소에서의 동작을 보다 상세히 설명하기로 한다.Hereinafter, operation of each component of the software classifier 500 will be described in more detail with reference to FIG.

먼저, 특징정보 추출부(510)는 분류 대상 소프트웨어가 인식되는 경우, 해당 분류 대상 소프트웨어를 구성하는 실행파일의 핵심 특징 정보를 포함하고 있는 부분 정보를 분류 대상 소프트웨어의 특징 정보로 추출할 수 있다. 이때, 실행 파일의 핵심 특징 정보는 예를 들어 실행 파일에 포함되어 있는 문자열, API(Application Programming Interface)의 파일 이름, API의 참조 빈도 등의 정보가 될 수 있다.First, when the classification target software is recognized, the characteristic information extracting unit 510 may extract partial information including the core characteristic information of the executable file that constitutes the classification target software as characteristic information of the classification target software. At this time, the core feature information of the executable file may be information such as a string included in the executable file, a file name of an API (application programming interface), a reference frequency of an API, and the like.

즉, 특징정보 추출부(510)는 분류 대상 소프트웨어가 인식되는 경우, 해당 분류 대상 소프트웨어를 구성하는 실행파일에 포함된 복수의 API의 파일이름과 각 API의 참조 빈도에 대한 정보를 분류 대상 소프트웨어의 특징 정보로 추출할 수 있으나, 이에 한정되는 것은 아니다.That is, when the classification target software is recognized, the characteristic information extracting unit 510 extracts the file names of the plurality of APIs included in the executable file that constitutes the classification target software and the information about the reference frequency of each API, But the present invention is not limited thereto.

한편, 실행 파일(executable file)은 해당 대상 소프트웨어의 운영 플랫폼에 따라서 다양한 형태를 취할 수가 있다. 예컨대, 실행 파일은 운영 플랫폼을 구성하는 운영체제(OS: Operating System)가 마이크로소프트 윈도우 (Microsoft Windows)인 경우는 PE 파일 포맷의 EXE 파일, 런타임(runtime) 환경이 자바(Java)인 경우는 바이트(byte) 파일, 운영체제가 리눅스(Linux)인 경우는 a.out 파일 등이 될 수 있다. 여기서는 구체적인 예로 MS Windows 환경에 대해 상세히 기술하며, 다른 런타임 환경이나 운영체제의 경우에도 적용 가능하다.On the other hand, an executable file can take various forms according to the operating platform of the target software. For example, the executable file may be an EXE file in the PE file format if the operating system (OS) constituting the operating platform is Microsoft Windows, a byte file in the case where the runtime environment is Java, byte) file, or a.out file if the operating system is Linux. This document describes the MS Windows environment in detail, and is applicable to other runtime environments and operating systems.

또한, MS Windows PE 파일 포맷은 링크를 위한 동적 라이브러리의 참조(reference)들, API의 엑스포트(export)와 임포트(import)를 위한 테이블, 자원 관리 데이터(resource management data)와 쓰레드 로컬 스토리지(TLS: Thread Local Storage) 데이터 등을 포함할 수 있다. 또한, PE 파일은 동적 링커(linker)가 파일을 메모리 상에 매핑(mapping)할 수 있도록 하기 위해서 많은 수의 헤더와 섹션들을 포함하고 있게 된다.The MS Windows PE file format also includes dynamic library references for links, tables for exporting and importing APIs, resource management data, and TLS: Thread Local Storage) data, and the like. The PE file also contains a large number of headers and sections to enable the dynamic linker to map the file to memory.

즉, 실행 파일이 로드될 때, 윈도우 로더(loader)는 어플리케이션이 사용하는 모든 DLL들을 로드하고, 프로세스 어드레스 스페이스(process address space) 상에 매핑하게 되는데, 이러한 동작이 윈도우 로더가 참조하는 실행파일의 IAT (Import Address Table)에 의해서 이루어질 수 있다. 이때, 위의 예에서는 IAT를 참조하는 것을 예로 들어 설명하였으나, 위와 같은 API의 정보는 일반적으로 실행파일의 헤더 또는 섹션들로부터 추출 가능함은 물론이다. That is, when an executable file is loaded, the Windows loader loads all the DLLs used by the application and maps them onto the process address space, IAT (Import Address Table). In the above example, the IAT is referred to in the above example, but the API information can be extracted from the header or sections of the executable file.

이때, 특징 정보 추출부(510)는 실행파일의 IAT를 참조하여 분류 대상 소프트웨어의 API의 파일 이름을 확인할 수 있고, IAT에 저장된 주소와 .text 섹션의 정보를 비교하여 각 API의 참조 빈도를 확인할 수 있으나, 이에 한정되는 것은 아니다.At this time, the feature information extracting unit 510 can check the file name of the API of the classification target software by referring to the IAT of the execution file, and compare the address stored in the IAT and the information of the .text section to check the reference frequency of each API But is not limited thereto.

또한, 위와 같은 분류 대상 소프트웨어라 함은 개인용 PC 등의 사용자 단말로부터 웹하드(10) 등의 클라우드 저장공간으로 업로드되는 소프트웨어, 또는 웹하드(10) 등에서 다운로드되는 소프트웨어가 될 수 있으며, 또는 탐지 대상 컴퓨터에 탑재된 하드 디스크와 같은 저장 장치에 저장된 소프트웨어가 될 수도 있으나, 이에 한정되는 것은 아니다. 또한, 이러한 분류 대상 소프트웨어는 불법 복제 여부 또는 위조/변조 여부가 확인되지 않은 상태의 소프트웨어를 의미할 수 있으나, 이에 한정되는 것은 아니다.The above-described software to be classified may be software that is uploaded from a user terminal such as a personal computer to a cloud storage space such as the WebHard 10 or software downloaded from the WebHard 10, But may be software stored in a storage device such as a hard disk mounted on a computer, but is not limited thereto. In addition, the software to be classified may mean software that is illegally copied or has not been checked for falsification / alteration, but is not limited thereto.

MLD(Machine Learning Data) 산출부(520)는 분류 대상 소프트웨어의 실행파일내 포함된 복수의 API의 파일이름과 각 API의 참조 빈도에 정보를 이용하여 각 API에 대한 MLD 값을 산출할 수 있다.The MLD (Machine Learning Data) calculating unit 520 may calculate the MLD value for each API by using information on a file name of a plurality of APIs included in an execution file of the software to be classified and a reference frequency of each API.

이때, 이러한 MLD값은 API를 기반으로 해당 소프트웨어가 어떤 특성을 가진 소프트웨어인지를 분류할 수 있도록 하는 정보를 의미할 수 있다. 즉, 각 API에 대해 계산된 MLD 값은 해당 소프트웨어를 예를 들어 Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, Video Player 등에 속하는 것으로 분류할 수 있도록 지시하는 값일 수 있다.In this case, the MLD value may mean information that allows the software to classify the software having a characteristic based on the API. That is, the MLD value calculated for each API can be a value indicating that the software can be classified as belonging to, for example, Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, Video Player and the like.

또한, MLD 산출부(520)는 위와 같은 MLD 값을 산출함에 있어서, 분류 대상 소프트웨어의 특징 정보로 선정한 M개의 API에 대해 해당 소프트웨어에서 각 API를 참조하는 참조 빈도를 표기하여 각 API에 대한 MLD값을 산출할 수 있다. In calculating the MLD value as described above, the MLD calculation unit 520 displays the reference frequency referring to each API in the corresponding software for the M APIs selected as the feature information of the classification target software, and calculates the MLD value Can be calculated.

카테고리 결정부(530)에서는 MLD 산출부(520)를 통해 산출된 분류 대상 소프트웨어의 MLD값과 MLD DB(200)내 저장된 카테고리별 MLD값을 비교하여 분류 대상 소프트웨어가 속하는 카테고리를 결정한다.The category determination unit 530 compares the MLD value of the classification target software calculated through the MLD calculation unit 520 with the MLD value of each category stored in the MLD DB 200 to determine the category to which the classification target software belongs.

즉, 카테고리 결정부(530)는 분류 대상 소프트웨어에 대해 산출된 MLD값과 MLD DB(200)내 카테고리별 복수의 소프트웨어에 대해 미리 계산되어 저장된 MLD값을 비교하여 유사도가 가장 높은 카테고리를 분류 대상 소프트웨어의 카테고리로 결정한다. That is, the category determination unit 530 compares the MLD value calculated for the classification target software with the MLD value calculated and stored in advance for a plurality of software items in the MLD DB 200, As a category.

도 7은 본 발명의 실시예에 따른 소프트웨어 분류 장치에서 분류 대상 소프트웨어의 카테고리를 분류하는 동작 제어 흐름을 도시한 것이다. 이하, 도 5 내지 도 7을 참조하여 본 발명의 실시예를 상세히 설명하기로 한다.7 shows an operation control flow for classifying categories of software to be classified in the software classification apparatus according to the embodiment of the present invention. Hereinafter, embodiments of the present invention will be described in detail with reference to FIGS. 5 to 7. FIG.

먼저, 소프트웨어 분류 장치(500)는 분류 대상 소프트웨어가 인식되는 경우(S700), 해당 분류 대상 소프트웨어를 구성하는 실행파일의 핵심 특징 정보를 포함하고 있는 부분 정보를 분류 대상 소프트웨어의 특징 정보로 추출한다(S702). First, if the software to be classified is recognized (S700), the software classifier 500 extracts partial information including the core feature information of the executable file that constitutes the classification target software as feature information of the software to be classified S702).

이때, 실행 파일의 핵심 특징 정보는 예를 들어 실행 파일에 포함되어 있는 문자열, API(Application Programming Interface)의 파일 이름, API의 참조 빈도 등의 정보가 될 수 있다.At this time, the core feature information of the executable file may be information such as a string included in the executable file, a file name of an API (application programming interface), a reference frequency of an API, and the like.

즉, 소프트웨어 분류 장치(500)는 분류 대상 소프트웨어가 인식되는 경우, 해당 분류 대상 소프트웨어를 구성하는 실행파일에 포함된 복수의 API의 파일이름과 각 API의 참조 빈도에 대한 정보를 분류 대상 소프트웨어의 특징 정보로 추출할 수 있으나, 이에 한정되는 것은 아니다.That is, when the software to be classified is recognized, the software classification apparatus 500 classifies the file names of the plurality of APIs included in the executable file that constitutes the classification target software and the information about the reference frequency of each API as the characteristics Information, but it is not limited thereto.

이때, 분류 대상 소프트웨어를 구성하는 실행파일에 포함된 복수의 API의 파일이름과 각 API의 참조 빈도에 대한 정보를 추출함에 있어서, 소프트웨어 분류 장치(500)는 실행파일의 IAT를 참조하여 분류 대상 소프트웨어의 API의 파일 이름을 확인할 수 있고, IAT에 저장된 주소와 .text 섹션의 정보를 비교하여 각 API의 참조 빈도를 확인할 수 있으나, 이에 한정되는 것은 아니다.At this time, in extracting the file names of the plurality of APIs included in the executable file constituting the classification target software and the information about the reference frequency of each API, the software classification apparatus 500 refers to the IAT of the execution file, The API file name can be confirmed and the reference frequency of each API can be confirmed by comparing the information stored in the IAT with the information of the .text section. However, the present invention is not limited thereto.

이어, 소프트웨어 분류 장치(500)는 분류 대상 소프트웨어의 실행파일내 포함된 복수의 API의 파일이름과 각 API의 참조 빈도에 대한 정보를 이용하여 각 API에 대한 MLD 값을 산출한다(S704).In operation S704, the software classification apparatus 500 calculates the MLD value for each API using the file names of the plurality of APIs included in the execution file of the classification target software and information about the reference frequency of each API.

이때, 이러한 MLD값은 API를 기반으로 해당 소프트웨어가 어떤 특성을 가진 소프트웨어인지를 분류할 수 있도록 하는 정보를 의미할 수 있다. 즉, 각 API에 대해 계산된 MLD 값을 참조하는 경우 해당 소프트웨어를 예를 들어 Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, Video Player 등에 속하는 것으로 분류할 수 있도록 지시하는 값일 수 있다.In this case, the MLD value may mean information that allows the software to classify the software having a characteristic based on the API. That is, when referring to the calculated MLD value for each API, it is a value indicating that the software can be classified as belonging to, for example, Audio Player, Browser, CD Writer, FTP, Image Viewer, Messenger, Text Editor, .

또한, 소프트웨어 분류 장치(500)는 위와 같은 MLD 값을 산출함에 있어서, 분류 대상 소프트웨어의 특징 정보로 선정한 M개의 API에 대해 해당 소프트웨어에서 각 API를 참조하는 참조 빈도를 표기하여 각 API에 대한 MLD값을 산출할 수 있다. In calculating the MLD value as described above, the software classifying device 500 displays the reference frequency referring to each API in the corresponding software for the M APIs selected as the feature information of the classification target software, and calculates the MLD value Can be calculated.

이어, 소프트웨어 분류 장치(500)는 위와 같이 산출된 MLD값과 MLD DB(200)에 저장된 정보를 비교하여 분류 대상 소프트웨어가 속하는 카테고리를 결정한다(S706).Next, the software classifier 500 compares the MLD value calculated in the above manner with the information stored in the MLD DB 200 to determine a category to which the classification target software belongs (S706).

즉, 소프트웨어 분류 장치(500)는 분류 대상 소프트웨어에 대해 산출된 MLD값과 MLD DB(200)내 카테고리별 복수의 소프트웨어에 대해 미리 계산되어 저장된 MLD값을 비교하여 유사도가 가장 높은 카테고리를 분류 대상 소프트웨어의 카테고리로 결정한다. That is, the software classification apparatus 500 compares the MLD value calculated for the classification target software with the MLD value calculated and stored in advance for a plurality of software items in the MLD DB 200, As a category.

이에 따라, 위와 같이 분류 대상 소프트웨어가 속하는 카테고리가 분류되는 경우, 분류 대상 소프트웨어가 속하는 카테고리에 포함된 소프트웨어와 1대1 비교하면 되므로, 소프트웨어의 불법 복제 또는 표절의 탐지를 위한 소프트웨어 필터링 작업의 오버헤드를 줄일 수 있게 된다.Accordingly, when the category to which the classification target software belongs is classified as described above, it is necessary to compare one-to-one with the software included in the category to which the classification target software belongs, so that the overhead of the software filtering operation for detecting illegal copying or plagiarism of the software .

상기한 바와 같이, 본 발명에 따르면, 소프트웨어의 분류에 있어서, 불법 복제되거나 표절되지 않은 복수의 소프트웨어에 대해 각 소프트웨어의 실행파일에 포함된 API에 기반한 MLD값을 산출하고, MLD값을 기반으로 복수의 소프트웨어에 대한 분류를 수행한 소프트웨어 필터링 데이터 베이스를 생성한 후, 분류 대상 소프트웨어의 API를 기반으로 산출된 MLD값을 필터링 데이터 베이스와 비교하는 것을 통해 분류 대상 소프트웨어의 카테고리가 보다 정확하게 결정되도록 함으로써, 소프트웨어의 불법 복제 또는 표절의 탐지를 위한 소프트웨어 필터링 작업의 오버헤드를 줄일 수 있도록 한다.As described above, according to the present invention, in the classification of software, an MLD value based on an API included in an executable file of each software is calculated for a plurality of software pirated or not plagiarized, and a plurality The category of the software to be classified is more accurately determined by comparing the MLD value calculated based on the API of the classification target software with the filtering database after creating the software filtering database that classifies the software of the classification target software, Thereby reducing the overhead of software filtering operations for software piracy or detection of plagiarism.

본 발명에 첨부된 각 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 흐름도의 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. Combinations of the steps of each flowchart attached to the present invention may be performed by computer program instructions. These computer program instructions may be loaded into a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus so that the instructions, which are executed via a processor of a computer or other programmable data processing apparatus, Lt; / RTI > These computer program instructions may also be stored in a computer usable or computer readable memory capable of directing a computer or other programmable data processing apparatus to implement the functionality in a particular manner so that the computer usable or computer readable memory It is also possible to produce manufacturing items that contain instruction means for performing the functions described in each step of the flowchart. Computer program instructions may also be stored on a computer or other programmable data processing equipment so that a series of operating steps may be performed on a computer or other programmable data processing equipment to create a computer- It is also possible for the instructions to perform the processing equipment to provide steps for executing the functions described in each step of the flowchart.

또한, 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In addition, each step may represent a module, segment, or portion of code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions mentioned in the steps may occur out of order. For example, the two steps shown in succession may in fact be performed substantially concurrently, or the steps may sometimes be performed in reverse order according to the corresponding function.

한편 상술한 본 발명의 설명에서는 구체적인 실시예에 관해 설명하였으나, 여러 가지 변형이 본 발명의 범위에서 벗어나지 않고 실시될 수 있다. 따라서 발명의 범위는 설명된 실시 예에 의하여 정할 것이 아니고 특허청구범위에 의해 정하여져야 한다.While the invention has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should not be limited by the described embodiments but should be defined by the appended claims.

102 : 입력부 104 : 특징정보 추출부
106 : MLD 생성부 200 : MLD DB
500 : 소프트웨어 분류 장치 510 : 특징정보 추출부
520 : MLD 산출부 530 : 카테고리 결정부102: input unit 104: feature information extraction unit
106: MLD generation unit 200: MLD DB
500: Software classifying apparatus 510: Feature information extracting unit
520: MLD calculating section 530: category determining section

Claims

delete

An MLD DB storing a first MLD value for each API for a plurality of software classified into categories,
A feature information calculating unit that extracts information on a file name of a plurality of APIs included in an executable file that constitutes the software and reference frequency of each API when the software to be classified is recognized;
An MLD calculation unit for calculating a second MLD value for each API of the classification target software by using information on the file names of the plurality of APIs and the reference frequency of each of the APIs;
A category determination unit for determining a category to which the classification target software belongs by comparing the second MLD value calculated through the MLD calculation unit with the first MLD value for each category stored in the MLD DB,
The software classification apparatus comprising:

6. The method of claim 5,
Wherein the category determination unit comprises:
Compares the second MLD value of the classification target software with the first MLD value of the category in the MLD DB to classify the category having the highest similarity between the first MLD value and the second MLD value into a category of the classification target software Determining a software classification device.

6. The method of claim 5,
Wherein the feature information calculating unit comprises:
And extracts a file name of each API of the classification target software and a reference frequency of each of the APIs from the header and sections of the classification target software.

6. The method of claim 5,
The executable file includes:
A Microsoft Windows EXE file, a Java byte file, or a Linux a.out file.

delete

Generating an MLD DB storing a first MLD value per API for a plurality of software classified as a category;
Extracting a file name of a plurality of APIs included in an executable file constituting the classification target software and characteristic information about a reference frequency of each API when the classification target software is recognized;
Calculating a second MLD value for each API of the classification target software using information on the file names of the plurality of APIs and the reference frequency of each of the APIs;
Comparing the second MLD value with a first MLD value for each category stored in the MLD DB to determine a category to which the classification target software belongs
Gt;. &Lt; / RTI >

12. The method of claim 11,
Wherein the step of determining the category comprises:
Comparing the second MLD value of the classification target software with the first MLD value for each category in the MLD DB;
Determining a category having the highest degree of similarity between the first MLD value and the second MLD value as a category of the classification target software
Gt;. &Lt; / RTI >