KR101803889B1

KR101803889B1 - Method and apparatus for detecting malicious application based on risk

Info

Publication number: KR101803889B1
Application number: KR1020170008687A
Authority: KR
Inventors: 오성택; 고웅; 김미주; 최은영; 이태진
Original assignee: 한국인터넷진흥원
Priority date: 2017-01-18
Filing date: 2017-01-18
Publication date: 2017-12-04

Abstract

According to an embodiment of the present invention, a method for detecting a malicious application comprises the following steps. A malicious application detection apparatus extracts a source code of an application. The malicious application detection apparatus generates n-dimensional coordinates according to whether first to n^th features are included in the source code. The malicious application detection apparatus calculates a degree of risk of the application with a value derived by multiplying the n-dimensional coordinates by a weighted value for each coordinate and adding the multiplied values thereafter. And the malicious application detection apparatus determines a malicious state by using the degree of risk.

Description

TECHNICAL FIELD The present invention relates to a method and an apparatus for detecting a malicious application based on a risk,

본 발명은 위험도를 기반으로 악성 어플리케이션을 탐지하는 방법 및 그 장치에 관한 것이다. 보다 자세하게는 악성 모바일 어플리케이션에서 주로 사용되는 API(Application Programming Interface) 등을 사용하는 정도에 따라 신종 모바일 어플리케이션의 악성 여부를 탐지하는 방법 및 그 방법을 수행하는 장치에 관한 것이다.The present invention relates to a method and apparatus for detecting malicious applications based on risk. And more particularly, to a method of detecting maliciousness of a new type of mobile application according to the degree of use of an API (Application Programming Interface) or the like which is mainly used in a malicious mobile application and a device performing the method.

대한민국에 2009년 말 출시된 애플의 아이폰(iPhone)과 그 이후 등장한 삼성의 갤럭시로 대표되는 안드로이드(Android) 기반의 스마트 폰은 혁명적인 변화를 불러왔다. 이러한 스마트 기기와 이에 기반한 다양한 모바일 어플리케이션(Application)의 출현은 삶의 패턴을 변화 시켰다.Apple's iPhone, launched in Korea in late 2009, and Android-based smartphones, such as Samsung's Galaxy since then, have revolutionized the world. The appearance of these smart devices and various mobile applications based on them has changed the pattern of life.

그러나 모바일 어플리케이션이 활성화 되고 널리 퍼지면서, 악의적인 공격 도구로 활용되는 사례가 증가하였다. 특히 아이폰의 iOS 운영체제의 경우 애플의 폐쇄적인 마켓(market) 정책으로 인해 상대적으로 덜 하지만, 구글의 안드로이드 운영체제의 경우 구글의 개방적인 마켓 정책으로 인해 다양한 악성 어플리케이션이 배포되고 있다.However, as mobile applications become more active and widespread, cases of malicious attacks have increased. In particular, the iPhone's iOS operating system is relatively less popular due to Apple's closed market policy, but Google's Android operating system is releasing a variety of malicious applications due to Google's open market policy.

특히, 안드로이드 계열 어플리케이션의 악성 어플리케이션 유통 과정을 살펴보면, 기존의 모바일 어플리케이션에 악성 코드를 삽입하여 재배포 하는 리패키징(Repackaging) 방식이 주를 이루고 있다. 모바일 어플리케이션의 수명이 매우 짧고, 매일 수많은 새로운 어플리케이션이 등장하기 때문에, 손쉽게 악성 어플리케이션을 제작하고 배포하기 위한 리패키징 방식이 사용되고 있다.Especially, if you look at the distribution process of malicious application in Android-based applications, repackaging method that re-distributes malicious code to existing mobile application is main. Because mobile applications have a very short life span and many new applications appear every day, repackaging methods are used to easily create and distribute malicious applications.

즉 모바일 어플리케이션은 손쉬운 설치 및 삭제로 인해 생명 주기가 짧기 때문에, 악성 모바일 어플리케이션 또한 생명 주기가 짧다. 이로 인해 악성 코드를 새로 제작 하는 것보다 기존에 정상적으로 유통되는 모바일 어플리케이션에 악성 코드를 심어서 재배포 하는 형태가 주를 이루고 있다.In other words, mobile applications have a short lifecycle because of easy installation and deletion, so malicious mobile applications also have a short life cycle. This is mainly due to the fact that malicious code is redistributed by installing malicious code in a mobile application that is normally distributed, rather than creating a new malicious code.

이렇게 리패키징 방식에 의해 다수의 변종 악성 어플리케이션이 출현하고 있으나 이를 모두 분석하고 대응하는 것은 효율적이지 못하다. 리패키징 방식의 악성 어플리케이션은 임의로 삽입된 악성 코드를 제외한 다른 코드는 기존의 각 어플리케이션의 소스 코드여서, 코드가 제각각 이기 때문에 이를 모두 분석하는 것은 비용 대비 비효율적이다.Although many variant malicious applications are emerging by the repackaging method, it is not efficient to analyze and respond to all of them. The malicious application of repackaging method is inefficient because it is the source code of each existing application except for the arbitrary inserted malicious code, and analyzing all of them is costly.

그러므로 리패키징 방식으로 유포되는 악성 모바일 어플리케이션에서 주로 사용되는 API의 위험한 정도를 수치화해서 연산한 위험도를 기반으로 신규 어플리케이션의 악성 여부를 빠르고 간편하게 판단할 수 있는 방법이 필요하다.Therefore, there is a need for a method that can quickly and easily judge the malicious nature of a new application based on the risk calculated by quantifying the dangerous degree of the API which is mainly used in a malicious mobile application distributed by a repackaging method.

KR 10-2004-0056998 A "위험도 산정을 통한 악성실행코드 탐지 장치 및 그 방법" (2004.07.01)KR 10-2004-0056998 A "Apparatus and method for detecting malicious code by risk assessment" (2004.07.01)

본 발명이 해결하고자 하는 기술적 과제는 위험도를 기반으로 악성 모바일 어플리케이션을 탐지하는 방법 및 그 방법을 수행하는 장치를 제공하는 것이다.The present invention provides a method for detecting a malicious mobile application based on a risk and an apparatus for performing the method.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the above-mentioned technical problems, and other technical problems which are not mentioned can be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한 본 발명의 일 태양에 따른 악성 어플리케이션 탐지 방법은, 악성 어플리케이션 탐지 장치가, 어플리케이션의 소스 코드를 추출하는 단계; 상기 악성 어플리케이션 탐지 장치가, 상기 소스 코드에 제1 특징부터 제n 특징까지 포함되어 있는지 여부에 따라 n차원의 좌표를 생성하는 단계; 상기 악성 어플리케이션 탐지 장치가, 상기 n차원의 좌표에 각 좌표별 가중치를 곱해서 더한 값으로 상기 어플리케이션의 위험도를 연산하는 단계; 및 상기 악성 어플리케이션 탐지 장치가, 상기 위험도를 이용하여 악성 여부를 판단하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a malicious application detection method comprising: extracting a source code of an application; Generating the n-dimensional coordinates according to whether or not the malicious application detection apparatus includes the first feature to the n-th feature in the source code; The malicious application detection device calculating the risk of the application by multiplying the n-dimensional coordinates by a weight for each coordinate, And determining whether the malicious application detection apparatus is malicious based on the risk.

일 실시예에서, 상기 소스 코드를 추출하는 단계는, 상기 어플리케이션을 디컴파일 또는 디어셈블 하여 소스 코드를 추출하는 단계를 포함할 수 있다.In one embodiment, extracting the source code may include decompiling or disassembling the application to extract the source code.

다른 실시예에서, 상기 제1 특징부터 제n 특징은, 사전에 정상 어플리케이션과 악성 어플리케이션을 대상으로 학습 과정을 통해 선별된 API 또는 시스템 명령어이다.In another embodiment, the first to n-th features are API or system commands selected in advance through a learning process for a normal application and a malicious application.

또 다른 실시예에서, 상기 각 좌표별 가중치는, 사전에 정상 어플리케이션과 악성 어플리케이션을 대상으로 유전자 알고리즘을 이용하여 결정한 값이다.In yet another embodiment, the weights for each coordinate are values previously determined for normal applications and malicious applications using genetic algorithms.

또 다른 실시예에서, 상기 위험도를 연산하는 단계는, 상기 n차원의 좌표에 각 좌표별 가중치를 곱해서 더한 값이 0부터 100까지의 범위에 포함되도록 스케일링 하는 단계를 포함할 수 있다.In another embodiment, the step of calculating the risk may include scaling the n-dimensional coordinates by multiplying the weights by the respective coordinates, and adding the sum to a range of 0 to 100.

또 다른 실시예에서, 상기 악성 여부를 판단하는 단계는, 상기 위험도를 기 설정된 임계값과 비교하여 악성 여부를 판단하는 단계를 포함할 수 있다.In yet another embodiment, the step of determining whether the malicious object is malicious may include determining whether the malicious object is malicious by comparing the risk with a preset threshold value.

상기 기술적 과제를 해결하기 위한 본 발명의 다른 태양에 따른 악성 어플리케이션 탐지 장치는, 네트워크 인터페이스; 하나 이상의 프로세서; 상기 프로세서에 의하여 수행되는 컴퓨터 프로그램을 로드하는 메모리; 및 어플리케이션을 저장하는 스토리지를 포함하되, 상기 컴퓨터 프로그램은, 상기 어플리케이션의 소스 코드를 추출하는 오퍼레이션; 상기 소스 코드에 제1 특징부터 제n 특징까지 포함되어 있는지 여부에 따라 n차원의 좌표를 생성하는 오퍼레이션; 상기 n차원의 좌표에 각 좌표별 가중치를 곱해서 더한 값으로 상기 어플리케이션의 위험도를 연산하는 오퍼레이션; 및 상기 위험도를 이용하여 악성 여부를 판단하는 오퍼레이션을 포함할 수 있다.According to another aspect of the present invention, there is provided an apparatus for detecting a malicious application, the apparatus comprising: a network interface; One or more processors; A memory for loading a computer program executed by the processor; And a storage for storing an application, the computer program comprising: an operation of extracting a source code of the application; Generating an n-dimensional coordinate according to whether the source code includes the first feature to the n-th feature; An operation of calculating the risk of the application by multiplying the n-dimensional coordinates by a weight for each coordinate; And an operation of determining whether the malicious result is obtained using the risk level.

본 발명의 악성 탐지 방법을 이용하면 리패키징 방식으로 다양한 변종 악성 어플리케이션을 유포하는 행위에 대해 빠르게 대처해서 신규 어플리케이션의 악성 여부를 판단할 수 있다. 이를 통해 어플리케이션 사용자들의 개인 정보를 보호하고 악성 어플리케이션으로 인한 피해를 최소화 할 수 있다.According to the malicious detection method of the present invention, maliciousness of a new application can be determined by quickly responding to an action of distributing various variant malicious applications by a repackaging method. This protects the privacy of application users and minimizes the impact of malicious applications.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood to those of ordinary skill in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 위험도 기반 악성 어플리케이션 탐지 방법의 순서도이다.
도 2는 본 발명의 일 실시예에 따른 위험도 기반 악성 어플리케이션 탐지 방법을 설명하기 위한 도면이다.
도 3 내지 도 5는 본 발명의 일 실시예에서 사용될 수 있는 특징을 선정하는 과정을 설명하기 위한 도면이다.
도 6a 내지 도 6e는 본 발명의 일 실시예에서 사용될 수 있는 유전자 알고리즘에 대해 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 위험도 기반 악성 어플리케이션 탐지 장치의 하드웨어 구성도이다.1 is a flowchart of a risk-based malicious application detection method according to an embodiment of the present invention.
2 is a diagram for explaining a risk-based malicious application detection method according to an embodiment of the present invention.
FIGS. 3 to 5 are views for explaining a process of selecting features that can be used in an embodiment of the present invention.
6A to 6E are diagrams for explaining a genetic algorithm that can be used in an embodiment of the present invention.
7 is a hardware block diagram of a risk-based malicious application detection apparatus according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 공통으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않은 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시 예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense that is commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise. The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification.

명세서에서 사용되는 "포함한다 (comprises)" 및/또는 "포함하는 (comprising)"은 언급된 구성 요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성 요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.It is noted that the terms "comprises" and / or "comprising" used in the specification are intended to be inclusive in a manner similar to the components, steps, operations, and / Or additions.

이하, 본 발명에 대하여 첨부된 도면에 따라 더욱 상세히 설명한다.Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 위험도 기반 악성 어플리케이션 탐지 방법의 순서도이다.1 is a flowchart of a risk-based malicious application detection method according to an embodiment of the present invention.

도 1을 참고하면, 우선 어플리케이션을 수집한다(S1100). 여기서 수집하는 어플리케이션이란 분석 대상이 되는 신규 어플리케이션을 말한다. 이미 수집을 해서 악성/양성 여부를 판단한 어플리케이션을 제외하고 신규로 공식 마켓, 사설 마켓, 온라인 등을 통해서 유포되는 어플리케이션들을 수집한다Referring to FIG. 1, an application is first collected (S1100). The application to be collected here is a new application to be analyzed. Except for applications that have already been collected and judged to be malicious / benign, new applications are collected through the official market, private market, online, etc.

다음으로 수집한 어플리케이션의 인증서를 추출한다(S1200). 안드로이드 계열의 어플리케이션의 경우 apk 만들어서 배포할 때, 서명을 하도록 되어있다. 이때 인증서를 사용하여 어플리케이션을 서명해야 한다. 어플리케이션 서명에 관한 보다 자세한 정보는 https://developer.android.com/studio/publish/app-signing.html 구글의 개발자 정보 사이트에서 확인할 수 있다.Next, a certificate of the collected application is extracted (S1200). For Android-based applications, it is required to sign when creating and distributing apk. At this point, you must sign the application using the certificate. For more information about signing applications, visit https://developer.android.com/studio/publish/app-signing.html Google's developer information site.

추출한 인증서로 블랙리스트(blacklist) 또는 화이트리스트(whitelist)를 적용할 수 있다. 즉 안드로이드에서 어플리케이션을 서명한 인증서는 어플리케이션을 제작하고 배포한 이를 인증하는 도구이므로 이를 인용하여 1차적으로 검사를 수행하는 것이다.A blacklist or whitelist can be applied to the extracted certificate. In other words, the certificate that signed the application on Android is a tool to author and distribute the application.

도 1에는 화이트리스트를 적용하는 경우가 예시되어 있다. 신규 어플리케이션에서 추출한 인증서가 화이트리스트에 이미 등록되어 있는 인증서인 경우(S1300), 해당 신규 어플리케이션은 인증된 사용자가 배포한 어플리케이션이므로 정상 어플리케이션이라고 판단하고(S1900), 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 종료한다.FIG. 1 illustrates a case where a white list is applied. If the certificate extracted from the new application is a certificate already registered in the whitelist (S1300), the new application is determined to be a normal application because it is an application distributed by the authenticated user (S1900), and a process of detecting whether the new application is malicious Lt; / RTI >

만약 신규 어플리케이션에서 추출한 인증서가 화이트리스트에 등록되어 있는 인증서가 아닌 경우에는 신규 어플리케이션을 악성 어플리케이션일 가능성이 있는 의심스러운 어플리케이션으로 판단하고, 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 계속해 나갈 수 있다.If the certificate extracted from the new application is not a certificate registered in the whitelist, the new application may be determined to be a suspicious application that may be a malicious application, and the process of detecting whether the new application is malicious may be continued.

도 1에는 도시되어 있지 않으나 블랙리스트의 경우에도 이와 유사하다. 예를 들면 신규 어플리케이션에서 추출한 인증서가 블랙리스트에 이미 등록되어 있는 인증서인 경우, 해당 신규 어플리케이션은 악성 어플리케이션을 배포하던 사용자가 배포한 어플리케이션이므로 악성 어플리케이션이라고 판단하고(S1800), 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 종료한다.Although not shown in Fig. 1, this is similar to the case of the black list. For example, if the certificate extracted from the new application is a certificate already registered in the black list, it is determined that the new application is a malicious application because it is an application distributed by the user who has been distributing the malicious application (S1800) End the detection process.

만약 신규 어플리케이션에서 추출한 인증서가 블랙리스트에 등록되어 있는 인증서가 아닌 경우에는 신규 어플리케이션을 악성 어플리케이션일 가능성이 있는 의심스러운 어플리케이션으로 판단하고, 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 계속해 나갈 수 있다.If the certificate extracted from the new application is not a certificate registered in the black list, the new application may be determined as a suspicious application which may be a malicious application, and the process of detecting whether the new application is malicious may be continued.

신규 어플리케이션에서 추출한 인증서가 블랙리스트에도 없고 화이트리스트에도 없는 경우에는 아직 악성 어플리케이션인지 정상 어플리케이션인지 판단이 안된 상태이므로 의심스러운 상태이므로 추가 분석이 필요하다. 추가 분석을 위해서는 신규 어플리케이션에서 소스 코드를 추출해야 한다.If the certificate extracted from the new application is neither in the black list nor in the whitelist, the malicious application or normal application is not yet determined. Therefore, additional analysis is required because it is in a suspicious state. For further analysis, you need to extract the source code from the new application.

다음으로, 의심스러운 신규 어플리케이션의 소스 코드를 추출한다. 안드로이드 어플리케이션의 설치 파일인 apk 파일의 소스 코드는 디컴파일, 디어셈블 과정을 통해서 추출할 수 있다. JAVA 언어의 경우 JVM에서 실행하기 위해 소스 코드를 바이너리 형태로 변환해서 사용한다. class 확장자를 가진 바이너리를 JAD(JAVA Decompiler) 등을 이용하면 원래의 java 확장자를 가진 소스 코드를 얻을 수 있다.Next, the source code of the suspicious new application is extracted. The source code of the apk file, which is the installation file of the Android application, can be extracted by decompiling and disassembling. For the JAVA language, convert the source code to binary form for use by the JVM. You can get the source code with the original java extension by using binary with class extension JAD (JAVA Decompiler).

소스 코드를 추출한 후에는 기존에 악성으로 판단 받은 어플리케이션의 소스코드와 유사도를 연산한다(S1400). 리패키징 방식으로 배포되는 악성 어플리케이션은 악성 코드를 삽입하는 방식으로 유포되는 경우가 많아서, 대부분 메서드는 그대로 재사용 한다.After extracting the source code, the degree of similarity with the source code of the application that has been determined to be malicious has been calculated (S1400). Malicious applications deployed by repackaging methods are often spread by injecting malicious code, so most of the methods are reused as they are.

그러므로 비교를 빠르게 수행하기 위해서 불필요한 라이브러리를 제외하고, 메서드의 이름을 정렬한 후 메서드 기반으로 바이너리를 추출하거나 메서드 기반으로 해시값을 추출하고, 악성 어플리케이션에서 메서드 기반으로 바이너리를 추출하거나 메서드 기반으로 해시값을 추출하여 비교를 하고 유사도를 연산한다.Therefore, to speed up comparisons, you can sort the method names, extract unnecessary libraries, extract the binaries on a method basis, extract the hash values based on the method, extract malicious application-based binaries, Extracts the values, compares them, and calculates the similarity.

신규 어플리케이션의 메서드와 기존에 악성으로 판단한 어플리케이션의 메서드 기반으로 연산한 유사도를 기 설정된 임계값과 비교하고(S1500), 임계값 이상인 경우 신규 어플리케이션을 악성 어플리케이션으로 판단하고(S1800), 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 종료한다.The similarity calculated by the method of the new application and the method determined based on the existing malicious application is compared with a preset threshold value in step S1500. If the threshold is equal to or greater than the threshold value, the new application is determined to be a malicious application in step S1800. And terminates the process of detecting whether or not it is detected.

만약 신규 어플리케이션과 기존에 악성 어플리케이션으로 판단 받은 어플리케이션의 유사도가 기 설정된 임계치보다 작은 경우에는 아직도 의심스러운 상태로 보고 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 계속해 나갈 수 있다.If the degree of similarity between the new application and the application determined to be a malicious application is smaller than a predetermined threshold value, the process can still continue as a suspicious condition and detect the maliciousness of the new application.

이때에는 다양한 방법으로 신규 어플리케이션의 악성 여부를 판단할 수 있다. 예를 들면 의심스러운 신규 어플리케이션의 apk 파일을 안드로이드 에뮬레이터(emulator)에서 로딩하고, 해당 신규 어플리케이션을 실행하여 어떠한 동작을 수행하는지 모니터링 하고, 그 결과에 따라 악성 여부를 판단할 수 있다. 흔히 이러한 분석을 동적 분석이라고 한다.At this time, it is possible to judge whether the new application is malicious by various methods. For example, an apk file of a suspicious new application may be loaded from an Android emulator, the new application may be executed to monitor what action is performed, and the malicious result may be determined according to the result. This analysis is often called dynamic analysis.

또는 의심스러운 신규 어플리케이션에서 추출한 소스 코드에서 사용되는 API 등을 정리하고, 악성 어플리케이션에서 주로 사용되는 API의 패턴과 유사한 경우 신규 어플리케이션을 악성으로 판단할 수 있다. 이와 같은 분석을 흔히 정적 분석이라고 한다.Or the API used in the source code extracted from the suspicious new application, and judges that the new application is malicious if it is similar to the pattern of the API used mainly in the malicious application. Such an analysis is often called static analysis.

동적 분석의 경우 분석 대상이 되는 어플리케이션의 소스 코드의 추출이 힘들 때 장점을 가지는 분석 방법이다. 하지만 안드로이드 계열의 어플리케이션의 경우 디컴파일 등을 통해서 소스 코드의 추출이 용이하므로 정적 분석이 더 적합하다.Dynamic analysis is an advantageous method when it is difficult to extract the source code of the application to be analyzed. However, static analysis is more suitable for Android based applications because it is easy to extract source code through decompilation.

또한 동적 분석의 경우 에뮬레이터를 구동하고 동작을 모니터링 하기 위해서 많은 시간과 부하를 필요로 한다. 이러한 점에서 하루에도 수많은 어플리케이션이 쏟아져 나오는 가운데 동적 분석을 통해 변종 악성 어플리케이션을 감지해내기에는 비효율적이다.In addition, dynamic analysis requires a lot of time and load to run the emulator and monitor its operation. In this regard, many applications are pouring in a day, and it is inefficient to detect variant malicious applications through dynamic analysis.

이러한 점에서 소스 코드의 분석을 통해 악성 여부를 판단하는 정적 분석 방법을 적용하기로 한다. 이를 위해서 신규 어플리케이션의 소스 코드에서 API 등을 추출하고 악성 어플리케이션의 소스 코드에서 추출한 API 등과 비교하여 위험도를 연산한다(S1600).From this point of view, we will apply the static analysis method to judge the maliciousness through the analysis of the source code. To do so, the API is extracted from the source code of the new application, and the risk is calculated by comparing with the API extracted from the source code of the malicious application (S1600).

신규 어플리케이션의 API와 기존에 악성으로 판단한 어플리케이션의 API 기반으로 연산한 위험도를 기 설정된 임계값과 비교하고(S1700), 임계값 이상인 경우 신규 어플리케이션을 악성 어플리케이션으로 판단하고(S1800), 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 종료한다.The risk calculated by the API of the new application and the application determined to be malicious in the past is compared with a preset threshold value in step S1700. If the threshold is greater than the threshold value, the new application is determined to be a malicious application in step S1800. And terminates the process of detecting whether or not it is detected.

만약 신규 어플리케이션의 위험도가 기 설정된 임계치보다 작은 경우에는 신규 어플리케이션을 정상 어플리케이션으로 판단하고(S1900), 신규 어플리케이션의 악성 여부를 탐지하는 프로세스를 계속해 나갈 수 있다. 이와 같은 과정을 통해 최종적으로 신규 어플리케이션의 악성 여부를 탐지할 수 있다.If the risk level of the new application is smaller than the predetermined threshold value, the new application is determined to be a normal application (S1900), and the process of detecting the maliciousness of the new application can be continued. Through this process, it is possible to finally detect the maliciousness of a new application.

도 1의 과정을 정리하면 블랙리스트와 화이트리스트를 이용한 1차 검사, 기존에 악성으로 판단 받은 어플리케이션과의 유사도를 이용한 2차 검사, 기존에 악성으로 판단 받은 어플리케이션에서 사용하는 API를 이용하여 연산한 위험도를 기준으로 한 3차 검사를 통해 신규 어플리케이션의 악성 여부를 최종적으로 판단할 수 있다.Referring to FIG. 1, it is possible to perform a first inspection using a black list and a whitelist, a second inspection using a similarity with an application judged to be malicious, and an operation using an API used in a malicious application A third level of risk-based inspection can ultimately determine whether a new application is malicious.

도 2는 본 발명의 일 실시예에 따른 위험도 기반 악성 어플리케이션 탐지 방법을 설명하기 위한 도면이다.2 is a diagram for explaining a risk-based malicious application detection method according to an embodiment of the present invention.

도 2를 참고하면, 앞서 도 1에서 설명한 3차 검사의 위험도를 산출하는 과정에 대해서 자세하게 볼 수 있다. 본 발명에서는 위험도를 연산할 때 특히 유전자 알고리즘(Genetic Algorithm)을 이용한다. 이를 위해서 사전에 악성으로 판단된 어플리케이션을 이용해서 학습을 수행하는 과정이 필요하다. 학습 과정에서 생성한 모델을 이용하여 실시간으로 신규 어플리케이션의 악성 여부를 탐지한다.Referring to FIG. 2, the process of calculating the risk of the third inspection described above with reference to FIG. 1 can be described in detail. In the present invention, a genetic algorithm is used to calculate the risk. In order to do this, it is necessary to perform a learning process using an application determined to be malicious in advance. Detects the malicious nature of the new application in real time using the model created in the learning process.

도 2에서 좌측의 배치 처리 영역이 학습 과정을 나타내는 것이고, 우측의 실시간 처리 영역이 신규 어플리케이션의 악성 여부를 판단하는 과정을 나타내는 것이다. 우선 배치 처리 영역을 먼저 살펴보면 악성 어플리케이션에서 악성 행위의 특징을 도출한다.In FIG. 2, the layout processing area on the left side shows a learning process, and the right-side real-time processing area shows a process of determining whether a new application is malicious. First, looking at the batch processing area, the malicious behavior is characterized in the malicious application.

여기서 악성 행위의 특징이란, 예를 들면 기기의 연락처 정보에 접근하려는 시도, 기기의 SMS에 접근하려는 시도, 기기의 위치 정보에 접근하려는 시도, 안드로이드 운영체제의 시스템 명령어를 실행하려는 시도 등이 있다. 물론 이러한 행동들이 항상 악성 행위와 연결되는 것은 아니므로 이를 이용하여 학습을 하는 과정이 필요하다.Here, the characteristics of malicious behavior include, for example, an attempt to access the device's contact information, an attempt to access the device's SMS, an attempt to access the location information of the device, and an attempt to execute the system commands of the Android operating system. Of course, these behaviors are not always linked to malicious behavior, so it is necessary to use them to learn.

악성 행위 특징을 학습하는 과정은 유전자 알고리즘을 이용하여 학습하는 것을 말한다. 유전자 알고리즘이란 자연세계의 진화과정에 기초한 계산 모델로서 존 홀랜드(John Holland)에 의해서 1975년에 개발된 전역 최적화 기법으로, 최적화 문제를 해결하는 기법의 하나이다. 생물의 진화를 모방한 진화 연산의 대표적인 기법으로, 실제 진화의 과정에서 많은 부분을 차용하였으며, 변이(돌연변이), 교배 연산 등이 존재한다. 또한 세대, 인구 등의 용어도 문제 풀이 과정에서 사용된다.The process of learning malicious behavior feature is learning using genetic algorithm. A genetic algorithm is a global optimization technique developed by John Holland in 1975 as a computational model based on the natural world evolution process, and is one of the techniques for solving optimization problems. As a representative technique of evolutionary computation that mimics the evolution of living things, it has borrowed a lot from the process of actual evolution, and there are variations (mutation) and mating operations. Also, terms such as generation, population, etc. are used in the course of problem solving.

유전 알고리즘은 자연계의 생물 유전학에 기본 이론을 두며, 병렬적이고 전역적인 탐색 알고리즘으로서, 다윈의 적자생존 이론을 기본 개념으로 한다. 유전 알고리즘은 풀고자 하는 문제에 대한 가능한 해들을 정해진 형태의 자료구조로 표현한 다음, 이들을 점차적으로 변형함으로써 점점 더 좋은 해들을 만들어 낸다. 여기에서 해들을 나타내는 자료구조는 유전자, 이들을 변형함으로써 점점 더 좋은 해를 만들어 내는 과정은 진화로 표현할 수 있다.Genetic algorithms have basic theories of natural biogenesis, parallel and global search algorithms, Darwin's theory of survival of the fittest. Genetic algorithms produce increasingly better solutions by expressing possible solutions to the problem that they want to solve in a given form of data structure and gradually transforming them. Here, the data structure representing the solutions can be expressed as evolution, the process of creating a better solution by transforming genes.

달리 표현하면, 유전 알고리즘은 어떤 미지의 함수 Y = f(x)를 최적화하는 해 x를 찾기 위해, 진화를 모방한(Simulated evolution) 탐색 알고리즘이라고 말할 수 있다. 유전 알고리즘은 특정한 문제를 풀기 위한 알고리즘이라기 보다는 문제를 풀기 위한 접근방법에 가까우며, 문제를 유전 알고리즘에서 사용할 수 있는 형식으로 바꾸어 표현할 수 있는 모든 문제에 대해서 적용할 수 있다.In other words, a genetic algorithm can be said to be a simulated evolutionary search algorithm to find a solution x that optimizes some unknown function Y = f (x). Genetic algorithms can be applied to all problems that can be expressed by replacing the problem with a form that can be used by the genetic algorithm rather than an approach to solving a problem rather than an algorithm for solving a specific problem.

일반적으로 문제가 계산 불가능할 정도로 지나치게 복잡할 경우 유전 알고리즘을 통하여, 실제 최적해를 구하지는 못하더라도 최적해에 가까운 답을 얻기 위한 방안으로써 접근할 수 있다. 이 경우 해당 문제를 푸는 데 최적화되어 있는 알고리즘보다 좋은 성능을 보여주지는 못하지만, 대부분 받아들일 수 있는 수준의 해를 보여줄 수 있다.In general, if the problem is too complex to be computed, the genetic algorithm can be used as a method to obtain a solution close to the optimal solution even if the actual optimal solution is not obtained. In this case, it does not perform better than the algorithm that is optimized to solve the problem, but it can show the most acceptable level of harm.

이러한 생물의 진화 과정, 즉 자연 선택과 유전 법칙 등을 모방한 알고리즘들로 진화 전략(Evolutionary strategies), 유전 프로그래밍(Genetic programming) 등 여러 형태의 이론과 기법들이 최근에 활발히 연구되고 있다. 유전 알고리즘은 이 중에서 가장 기본이 되고 대표적인 알고리즘으로, 자연과학/공학 및 인문 사회 과학 분야에서 비선형 또는 계산 불가능한 복잡한 문제를 해결하는 데 널리 응용되고 있다.Evolutionary strategies, genetic programming, and many other types of theories and techniques have recently been actively researched as algorithms that mimic such evolutionary processes, ie natural selection and genetic rules. Genetic algorithms are the most basic and representative of these algorithms and are widely applied to solve complex problems that are non-linear or non-computable in natural science / engineering and humanities and social sciences.

이러한 유전자 알고리즘을 이용하여 악성 행위의 가중치의 적합성을 판단한다. 이를 통해 실제 악성 어플리케이션과 일반 어플리케이션을 보다 더 잘 구분할 수 있는 각 행위의 가중치 수치를 결정할 수 있다. 예를 들면 35개의 행위를 미리 정의하고 해당 행위가 소스 코드에 있는지 없는지에 따라 각 행위별로 1 또는 0을 기록한다. 그러면 크기가 35차원인 특징 벡터를 각 어플리케이션 별로 얻을 수 있다.This genetic algorithm is used to determine the suitability of malicious behavior weights. This allows you to determine the weight value of each action that better distinguishes between actual malicious and general applications. For example, you can define 35 behaviors in advance and record 1 or 0 for each action depending on whether the action is in the source code or not. Then a feature vector of size 35 is obtained for each application.

이렇게 해서 얻은 35차원 크기의 벡터에 각 좌표별로 가중치를 설정해서 더하면 일정한 값을 얻을 수 있다. 우리는 이를 위험도라고 지칭하고, 이 위험도를 산출하기 위한 가중치를 유전자 알고리즘을 이용한 학습을 통해서 결정할 수 있다.We can obtain a constant value by adding weights for each coordinate to the 35-dimension vector thus obtained. We refer to this as a risk and weights can be determined through genetic algorithm learning to calculate this risk.

비교의 편의를 위해서 위험도는 스케일링을 거쳐서 -1부터 1까지의 값을 가지는 것으로 가정하자. 위험도가 -1의 값을 가지면 이는 음성으로 판단할 수 있다. 즉 위험도가 -1의 값을 가지면 정상 어플리케이션이다. 반대로 위험도가 1의 값을 가지면 이는 양성으로 판단할 수 있다. 즉 위험도가 1의 값을 가지면 악성 어플리케이션이다.For convenience of comparison, it is assumed that the risk is scaled to have a value from -1 to 1. If the risk has a value of -1, it can be judged to be negative. That is, if the risk has a value of -1, it is a normal application. Conversely, if the risk has a value of 1, it can be judged positive. That is, if the risk has a value of 1, it is a malicious application.

기존에 악성 어플리케이션, 정상 어플리케이션으로 판단 받은 학습용 데이터에 유전자 알고리즘을 적용하여 대치, 선택, 교차, 변이 등의 과정을 반복해서 수행해가면서 최적의 가중치를 결정한다. 유전자 알고리즘에 대한 보다 자세한 사항은 https://en.wikipedia.org/wiki/Genetic_algorithm 페이지에서 확인할 수 있다.By applying the genetic algorithm to learning data that has been judged to be a malicious application or a normal application in the past, optimum weighting is determined while repeating substitution, selection, intersection, and mutation processes. More information on genetic algorithms can be found at https://en.wikipedia.org/wiki/Genetic_algorithm.

다시 도 2로 돌아가서, 유전자 알고리즘을 통해 악성 행위의 특징별로 가중치를 생성하고, 악성으로 판단하기 위한 임계치를 생성한다. 앞서 유전자 알고리즘에 대해서 설명하면서 각 특징별로 소스 코드에 해당 API가 등장하는지 여부에 따라 0과 1로 값을 정하고 35개의 특징에 대해 각각 가중치를 곱해서 더한 값을 위험도라고 부른다고 했다.Returning to FIG. 2, a genetic algorithm is used to generate a weight for each characteristic of malicious behavior, and a threshold value for maliciousness is generated. We explained the genetic algorithms and set the values as 0 and 1 according to whether the corresponding API appears in the source code for each feature, multiply the weights by 35 for each characteristic, and add the value as the risk.

이때 판단의 편의를 위해서 위험도를 -1부터 1까지의 값으로 스케일링을 한 경우 -1부터 1사이의 값 중에서 특정 값을 악성으로 판단하기 위한 임계치로 정하는 것이다. 예를 들면 특징 좌표에 각 좌표별 가중치를 더해서 구한 위험도가 0.5보다 크면 악성으로 판단하는 방식이다.In this case, for the convenience of judgment, when the risk is scaled from -1 to 1, the specific value among the values between -1 and 1 is set as a threshold value for judging maliciousness. For example, if the risk score obtained by adding the weights of each coordinate to feature coordinates is greater than 0.5, it is judged to be malicious.

이를 위해서는 가중치를 곱해서 구한 위험도가 정상 어플리케이션인지 악성 어플리케이션인지 따라서 값이 잘 분류가 되도록 가중치를 정해야 한다. 즉 기존에 악성으로 판단된 어플리케이션일수록 1에 가까운 값을 갖도록 기존에 정상으로 판단된 어플리케이션일수록 -1에 가까운 값을 갖도록 가중치를 정해야 한다.To do this, weigh the values so that the risk is a normal or malicious application. That is, a weight value should be set to have a value close to -1 for an application determined to be normal so as to have a value close to 1 for an application determined to be malicious.

이와 같은 배치 처리 영역을 통해서 가중치와 악성 여부 판단을 위한 임계치를 결정한 후에는 이를 이용하여 실제 신규 어플리케이션의 악성 여부를 판단하는데 활용할 수 있다. 즉 가중치와 임계치가 학습을 통해서 얻는 결과로 이는 모바일 악성 어플리케이션 데이터베이스에 저장될 수 있다.After determining the weights and the thresholds for judging whether the malicious result is malicious or not, it is possible to utilize the threshold values to determine whether maliciousness of a new application is actually occurring. That is, as a result of the weight and the threshold being learned through learning, it can be stored in the mobile malicious application database.

모바일 악성 어플리케이션 데이터베이스에 저장된 가중치와 임계치는 추후 실시간 처리 영역에서 신규 어플리케이션의 악성 여부를 판단하는 과정에 제공된다. 도 2에서 실시간 처리 영역을 살펴보면, 우선 신규 어플리케이션의 apk 파일을 수집한다.The weights and threshold values stored in the mobile malicious application database are provided in the process of determining whether the new application is malicious in the real-time processing area. Referring to the real-time processing area in FIG. 2, first, an apk file of a new application is collected.

다음으로 해당 어플리케이션에 대해서 도 2에는 도시되어 있지는 않지만 블랙리스트와 화이트리스트를 이용한 1차 검사를 수행하고, 1차 검사 이후에도 의심스러운 경우에는 악성 어플리케이션과의 메서드 기반의 유사도를 구해서 악성 여부를 판단하는 2차 검사를 수행하고, 2차 검사 이후에도 의심스러운 경우에는 위험도를 구하는 3차 검사를 수행한다.Next, although not shown in FIG. 2, the first application using the black list and the whitelist is performed for the application, and if there is a suspicion even after the first inspection, the method similarity with the malicious application is obtained to determine whether the application is malicious Perform a secondary inspection and, if suspicious after the secondary inspection, perform a third inspection to determine the risk.

위험도를 구하는 과정은 신규 어플리케이션의 소스 코드에서 특정 API가 포함되어 있는지 여부에 따라 0과 1의 값을 가지는 35차원의 특징 벡터를 구한 후 각 좌표에 가중치를 곱해서 위험도를 연산하고, 위험도가 임계값 이상인지에 따라 신규 어플리케이션의 악성 여부를 판단하는 것이다.In the process of obtaining the risk, a 35-dimensional feature vector having values of 0 and 1 is obtained according to whether a specific API is included in the source code of a new application, and the risk is calculated by multiplying each coordinate by a weight. It is determined whether or not the new application is malicious.

도 2에 대해서 설명하면서 특징 벡터의 크기를 35차원으로 설명하였는데 이는 발명의 이해를 돕기 위한 것일 뿐 발명을 제한하고자 하는 것은 아니므로 특징 벡터의 크기는 다양한 차원을 가질 수 있다. 이 경우 특징 벡터의 각 좌표로 사용될 만한 API의 목록을 살펴보면 다음의 표 1과 같다.2, the size of the feature vector has been described in terms of 35 dimensions. However, this is for the purpose of understanding the invention only, and is not intended to limit the invention. Therefore, the size of the feature vector may have various dimensions. In this case, a list of APIs which can be used as each coordinate of the feature vector is as shown in Table 1 below.

APIAPI 분석analysis 역할role 악성여부Malicious sendTextMessagesendTextMessage 스미싱 메시지 전파 / 획득 정보 전송Spreading of smsing message / transmission of acquisition information 메시지 전송Send message 높음height getMessageBodygetMessageBody 문자 메시지 탈취Text message takeover 메시지 내용 획득Acquiring Message Content 낮음lowness getOriginatingAddressgetOriginatingAddress 발신 번호 탈취Odd number calling 발신 번호 획득Get Caller ID 높음height createFormPDUcreateFormPDU 악성 문자 생성 / Raw 문자 탈취Malicious character generation / Raw character deception Raw PDU 변환Raw PDU conversion 높음height aboardBroadcastaboardBroadcast 문자 수신 브로드 캐스트 강제 종료Forced termination of character receiving broadcast 순차 전달형 브로드 캐스트 종료Sequential forward broadcast termination 높음height setRingerModesetRingerMode 문자 메시지 수신 은닉Hiding the text message 벨소리 모드 전환Tone ring mode 높음height setComponetEnable
SettingsetComponetEnable
Setting 스미싱 앱 아이콘 은닉Hide the smsing app icon 패키지 컴포넌트 상태 전환Switching package components state 높음height Android.app.action
ADD_DEVICE_ADMINAndroid.app.action
ADD_DEVICE_ADMIN 스미싱 앱 삭제 제한 / 방해Limit / intercept removal of smsing apps 관리자 모드 (시스템 앱) 등록Administrator mode (system app) registration 높음height ContactsContract
$contactsContactsContract
$ contacts 사용자 주소록 탈취Take the user's address book 주소록 관리 프로바이더Address Book Management Provider 낮음lowness lockNowlockNow 현재 상태 은닉 / 강제 잠금Presence conceal / force lock 잠금 화면 전환Switch lock screen 높음height wipeDatawipeData 스미싱 앱 증거 삭제Delete smsing app evidence 사용자 데이터 삭제Delete user data 높음height getLatitudegetLatitude 사용자 위치 정보 탈취Takeover of user location information 위도Latitude 낮음lowness getLongitudegetLongitude 사용자 위치 정보 탈취Takeover of user location information 경도Hardness 낮음lowness getLastKnownLocationgetLastKnownLocation 사용자 위치 정보 탈취Takeover of user location information 저장된 사용자 마지막 위치Last user saved 낮음lowness getAccountsgetAccounts 사용자 계정 정보 탈취Takeover of user account information 기기에 등록된 전체 계정 정보 제공Provide full account information on device 낮음lowness

표 1을 참고하면 안드로이드 공통 API 중에서 악성 어플리케이션에서 주로 사용되는 API의 목록과 그 위험도가 표시되어 있다. 이와 같은 API 들이 신규 어플리케이션의 소스 코드에 있는지 없는지에 특징 벡터를 추출할 수 있다. 이러한 안드로이드 공통 API 외에도 표 2와 같은 외부 명령어도 특징 벡터를 추출하는데 활용될 수 있다.Table 1 shows a list of APIs commonly used in malicious applications among Android common APIs and their risks. The feature vectors can be extracted based on whether or not such APIs are present in the source code of the new application. In addition to these common Android APIs, external commands such as Table 2 can also be used to extract feature vectors.

명령어command 분석analysis 역할role 악성여부Malicious killProcesskillProcess 특정 어플리케이션 강제 종료Force application termination 해당없음Not applicable 낮음lowness getAsciiBytesgetAsciiBytes 문자열 변환String conversion 해당없음Not applicable 낮음lowness shellshell 쉘 명령어 실행Execute shell command 해당없음Not applicable 높음height copyClassDexcopyClassDex 외부 실행 바이너리 파일 복사Copy externally executed binary files 해당없음Not applicable 낮음lowness copyFilecopyFile 도구 실행 파일 복사Copy tool executable 해당없음Not applicable 높음height copyLibcopyLib 난독화 / 악성 라이브러리 복사Obfuscation / malicious library copy 해당없음Not applicable 높음height ChmodChmod 악성 파일 실행 권한 부여Grant permission to run malicious files 파일 권한 변경 도구File Permission Change Tool 낮음lowness

표 2에서도 볼 수 있듯이 신규 어플리케이션의 소스 코드에서 안드로이드 공통 API 외에도 외부 명령어가 포함된 경우에도 악성 어플리케이션일 가능성이 높을 수 있다. 그러므로 이러한 API나 명령어가 소스 코드에 포함되어 있는지 확인하여 특징 벡터를 생성할 수 있다.As shown in Table 2, even if the source code of the new application includes an external command in addition to the Android common API, it may be a malicious application. Therefore, it is possible to generate feature vectors by checking whether these APIs or commands are included in the source code.

도 3 내지 도 5는 본 발명의 일 실시예에서 사용될 수 있는 특징을 선정하는 과정을 설명하기 위한 도면이다.FIGS. 3 to 5 are views for explaining a process of selecting features that can be used in an embodiment of the present invention.

앞서 표 1과 표 2를 통해서 API나 명령어를 기준으로 특징을 분석한다고 이야기 하였다. 이 때 어느 API와 명령어를 선정할지가 중요해진다. 예를 들면 안드로이드 공통 API 중에서 악성 어플리케이션이나 정상 어플리케이션이나 별 차이없이 사용되는 API라면, 해당 API는 특징 벡터를 산출하는데 이용할 필요가 없을 것이다.Table 1 and Table 2 describe the analysis of features based on APIs and commands. At this time, it is important to select which API and instruction. For example, if the malicious application or the normal application among the Android common APIs is used without any difference, the corresponding API will not be used to calculate the feature vector.

그러므로 어떠한 API, 어떠한 명령어를 기준으로 특징을 분석할지가 중요해진다. 이를 위해서는 기존에 악성으로 판단된 어플리케이션과 정상으로 판단된 어플리케이션에서 사용되는 API나 명령어의 패턴을 분석할 필요가 있다. 이는 앞서 도 2에서 설명한 배치 처리 영역에서 수행되는 과정들이다.Therefore, it is important to decide which API to use and which instruction to use to analyze the feature. To do this, it is necessary to analyze patterns of APIs and commands that are used in malicious applications and malicious applications. This is a process performed in the batch processing area described above with reference to FIG.

예를 들면 악성 어플리케이션 1138개와 정상 어플리케이션 1009개를 대상으로 주로 사용되는 특징적인 행위를 표로 정리하면 다음의 표 3과 같다.For example, Table 3 shows the characteristic behaviors that are mainly used for 1138 malicious applications and 1009 normal applications.

유형type 행위Act 정상normal 악성malignity Device InformationDevice Information getDeviceIDgetDeviceID 736736 867867 getSubscriberIDgetSubscriberID 5454 632632 getNetworkOperatorNamegetNetworkOperatorName 382382 628628 getNetworkOperatorgetNetworkOperator 692692 672672 getNwtworkContryISOgetNwtworkContryISO 264264 323323 getSimSerialNumbergetSimSerialNumber 140140 730730 getLine1NumbergetLine1Number 292292 10521052 getAccountsgetAccounts 290290 77 getSimContryISOgetSimContryISO 182182 283283 getSimOperatorgetSimOperator 285285 367367 getSimStategetSimState 151151 296296 SMS/MMSSMS / MMS sendTestMessagesendTestMessage 4646 505505 sendDataMessagesendDataMessage 1One 00 getMessageBodygetMessageBody 5454 841841 getOriginatingAddressgetOriginatingAddress 3232 790790 getTimestampMillisgetTimestampMillis 345345 350350 createFromPDUcreateFromPDU 8585 897897 aboutBroadcastaboutBroadcast 3535 886886 getDisplayMessageBodygetDisplayMessageBody 4848 5858 getDisplayOriginaltingAddressgetDisplayOriginaltingAddress 2323 107107 Location InformationLocation Information getLattitudegetLattitude 776776 7676 getLongitudegetLongitude 776776 7676 getLastKnownLocationgetLastKnownLocation 414414 4646 ContactContact ContactsContract$ContractsContactsContract $ Contracts 248248 563563 Transmission ThroughTransmission Through setEntitysetEntity 766766 745745 isConnectedisConnected 841841 524524 AdministratorAdministrator Android.app.action.ADD_DEVICE_ADMINAndroid.app.action.ADD_DEVICE_ADMIN 77 800800 TeleponyTelepony setRingerModesetRingerMode 2525 774774 onCallSatateChangedonCallSatateChanged 108108 282282 setResultDatasetResultData 2121 9090 RecordRecord setOutPutFilesetOutPutFile 5858 7979 Icon HindingIcon Hinding setComponentEnabledSettingsetComponentEnabledSetting 7676 920920 XML ParsingXML Parsing getSharedPreferencesgetSharedPreferences 983983 10191019 putStringputString 935935 996996 Screen LockScreen Lock lockNowlockNow 55 152152 Factory ResetFactory Reset wipeDatawipeData 44 1313 Process ExitProcess Exit killProcesskillProcess 268268 66 CertificatesCertificates NPKINPKI 2020 496496

표 3에서 볼 수 있듯이 다양한 유형의 API나 명령어들이 정상 어플리케이션과 악성 어플리케이션에서 등장하는 빈도를 볼 수 있다. 만약 특정 API가 악성 어플리케이션에만 등장한다면, 신규 어플리케이션이 이러한 API를 많이 포함할수록 위험도가 높다고 볼 수 있다. 즉, 악성 어플리케이션일 가능성이 높다고 판단할 수 있다.As you can see in Table 3, you can see how many different types of APIs and commands appear in normal and malicious applications. If a particular API appears only in a malicious application, the risk is high if the new application includes many of these APIs. That is, it can be determined that the possibility of the malicious application is high.

반대로 특정 API 가 정상 어플리케이션에만 주로 등장한다면, 신규 어플리케이션이 이러한 API를 많이 포함할수록 위험도가 낮다고 볼 수 있다. 즉, 정상 어플리케이션일 가능성이 높다고 판단할 수 있다. 이처럼 특정 API가 악성 어플리케이션에 등장하는 빈도와 정상 어플리케이션에 등장하는 빈도를 그래프로 표현하면 도 3 내지 도 5와 같다.On the other hand, if a particular API appears only in a normal application, the risk is low when a new application includes a large number of such APIs. That is, it can be determined that there is a high possibility that the application is a normal application. The frequency of occurrence of a specific API in a malicious application and the frequency of appearance in a normal application are shown in FIGS. 3 to 5.

도 3을 참고하면 정상 어플리케이션으로는 구글에서 제작해서 배포하는 어플리케이션을 참고하였고, 악성 어플리케이션으로는 사용자의 개인 정보를 탈취하는 스미싱 어플리케이션을 참고하였다. 각 어플리케이션에서 getDeviceID 부터 getSimState까지 단말정보와 관련된 약 11개의 API가 사용되는 빈도를 보면 정상 어플리케이션에서 주로 사용되는 API와 악성 어플리케이션에서 주로 사용되는 API를 확인할 수 있다.Referring to FIG. 3, a normal application refers to an application that is produced and distributed by Google, and a malicious application refers to a smashing application that takes a user's personal information. In each application, about 11 APIs related to terminal information from getDeviceID to getSimState are used, and APIs used mainly in normal applications and malicious applications can be identified.

마찬가지로 도 4를 참고하면 SMS/MMS와 관련된 약 8개의 API가 악성 어플리케이션과 정상 어플리케이션에서 사용되는 빈도를 그래프로 시각화한 것을 볼 수 있다. 또한 도 5를 참고하면 표 3의 나머지 유형에 대한 사용 빈도 그래프를 확인할 수 있다. 이처럼 특징 행위를 선별하고, 각 특징 행위별로 유전자 알고리즘에 의해 가중치를 학습하여 적용하는 경우에 대해서 살펴보도록 하자.Similarly, referring to FIG. 4, it can be seen that approximately eight APIs related to SMS / MMS are graphically visualized in frequency of use in malicious applications and normal applications. Also, referring to FIG. 5, a graph of frequency of use for the remaining types of Table 3 can be seen. Let's take a look at the case of selecting the feature behaviors and learning the weights by the genetic algorithm for each feature action.

도 6a 내지 도 6e는 본 발명의 일 실시예에서 사용될 수 있는 유전자 알고리즘에 대해 설명하기 위한 도면이다.6A to 6E are diagrams for explaining a genetic algorithm that can be used in an embodiment of the present invention.

도 6a를 참고하면 line 246부터 line 252까지가 표시된 화면을 볼 수 있다. line 246을 살펴보면 맨 앞의 값이 1이고, 그 이후부터는 0과 1이 40자리까지 표시된 것을 볼 수 있다. 맨 앞의 값은 해당 어플리케이션이 악성인지 정상인지 표시한 것으로 앞서 설명한 것처럼 1의 값을 가지므로 악성 어플리케이션이다.Referring to FIG. 6A, a screen displaying lines 246 to 252 can be seen. If you look at line 246, you can see that the first value is 1, after that 0 and 1 are displayed up to 40 digits. The first value indicates whether the application is malicious or not, which is a malicious application because it has a value of 1, as described above.

그리고 그 뒤 40자리는 특징으로 선별한 40개의 API와 명령어가 해당 악성 어플리케이션의 소스 코드에 포함되어 있는지 여부에 따라 0과 1로 표시한 것이다. 이와 같은 방식으로 line 252까지 실제 악성 어플리케이션과 실제 정상 어플리케이션을 바탕으로 한 학습 데이터를 준비한다.Then 40 digits are marked with 0 and 1 depending on whether 40 APIs and commands selected by the feature are included in the source code of the malicious application. In this way, learning data based on real malicious applications and actual normal applications up to line 252 is prepared.

도 6a를 참고하면 line 246부터 line 249까지는 악성 어플리케이션이고 line 250부터 line 252까지는 정상 어플리케이션이다. 도 6a는 학습을 위해 준비한 실제 악성 어플리케이션과 실제 정상 어플리케이션의 데이터의 일부를 표시한 것으로 학습 데이터는 많으면 많을수록 보다 더 정확한 가중치를 정하는데 도움이 된다.Referring to FIG. 6A, lines 246 to 249 are malicious applications, and line 250 to line 252 are normal applications. FIG. 6A shows a part of data of a real malicious application prepared for learning and an actual normal application, and it is helpful to set more accurate weights as the number of learning data increases.

도 6a의 학습 데이터를 대상으로 유전자 알고리즘에 의해 정한 다음의 표 4의 가중치를 적용한 경우의 실험 결과가 도 6b에 표시되어 있다.FIG. 6B shows an experiment result in the case where the weight values in the following Table 4 determined by the genetic algorithm are applied to the learning data in FIG. 6A.

순번turn 특징Characteristic 가중치weight 1One getMessageBodygetMessageBody 2.0007032.000703 22 getOriginatingAddressgetOriginatingAddress -1.2143-1.2143 33 createFromPDUcreateFromPDU 1.3355721.335572 44 abortBroadcastabortBroadcast 1.1327631.132763 55 setRingerModesetRingerMode 1.000921.00092 66 setComponentEnabledSettingsetComponentEnabledSetting 0.6804610.680461 77 Android.app.action.ADD_DEVICE_ADMINAndroid.app.action.ADD_DEVICE_ADMIN 0.6917420.691742 88 ContactsContract%ContactsContactsContract% Contacts 1.3835851.383585 99 NPKINPKI -0.37357-0.37357 1010 lockNowlockNow 0.958640.95864 1111 wipeDatawipeData 0.5880970.588097 1212 getLatitudegetLatitude 0.5625910.562591 1313 getLongitudegetLongitude -0.59751-0.59751 1414 getLastKnownLocationgetLastKnownLocation -1.22455-1.22455 1515 getAccountsgetAccounts -1.01545-1.01545 1616 killProcesskillProcess -1.56521-1.56521 1717 getAsciiBytesgetAsciiBytes -1.83888-1.83888 1818 classes.dexclasses.dex -0.95635-0.95635 1919 shellshell 0.88090.8809 2020 copyClassDexcopyClassDex 0.9101620.910162 2121 copyFilecopyFile -1.57676-1.57676 2222 copyLibcopyLib 1.7185441.718544 2323 chmodchmod 1.0304791.030479 2424 shieldshield -0.44892-0.44892 2525 assetsassets -0.17721-0.17721 2626 classloaderclassloader -1.10432-1.10432 2727 ...... ......

이처럼 유전자 알고리즘을 이용하여 가중치를 정하고 정한 가중치를 이용하여 학습을 수행한 결과 실제 악성인데 악성으로 판단한 건수가 6b를 참고하면 323건, 실제 악성인데 일반으로 판단한 건수가 13건이다. 마찬가지로 실제로 일반인데 악성으로 판단한 건수가 5건, 실제로 일반인데 일반으로 판단한 건수가 331건이다.As a result of performing weighted learning using the genetic algorithm and performing the learning using the set weights, 323 cases were actually malignant, but 6 cases were judged to be malignant. Likewise, the number of cases judged to be malicious by the general public is 5, and the number of cases judged to be general is 331.

도 6c는 도 6b의 상황에서 임계값을 정하는 과정을 수식으로 표현한 것이다. 표 4의 가중치를 이용하여 위험도를 연산하니 최대값이 18.1958, 최소값이 -10.5051이였다. 이때 판단의 편의를 위해 위험도를 0부터 100까지 스케일링을 한 경우 악성인지 일반인지 판단할 임계값을 37로 결정하는 것이 가장 바람직한 것을 볼 수 있다.FIG. 6C is a graphical representation of the process of determining the threshold value in the situation of FIG. 6B. Using the weights in Table 4, the risk was calculated to have a maximum value of 18.1958 and a minimum value of -10.5051. In this case, for the sake of judgment, it is most preferable to determine the threshold value to judge whether malicious or general is 37 when the risk is scaled from 0 to 100.

즉 0부터 100까지의 값으로 위험도를 스케일링하고 스케일링한 위험도가 37을 넘는 경우 악성으로 판단하는 것이다. 이처럼 학습 과정을 통해 위험도를 연산하기 위한 가중치와 악성 여부를 판단하기 위한 임계값을 결정할 수 있다. 이와 같은 학습 과정은 반복해서 수행될 수 있다.That is, if the risk is scaled to a value from 0 to 100 and the risk of scaling exceeds 37, it is judged to be malicious. Through the learning process, we can determine the weight for calculating the risk and the threshold value for judging whether or not it is malicious. Such a learning process can be performed repeatedly.

도 6d 내지 도 6e는 다른 가중치 아래에서 위험도를 계산하고 임계값을 정하여 오탐지율 등의 통계 지표를 구한 것을 볼 수 있다. 이처럼 반복 학습을 통해 최적의 가중치와 임계값을 구한 후 실제 신규 어플리케이션의 악성 여부를 판단하는 과정에 적용하면 리패키징 방식으로 배포되는 변종 악성 어플리케이션을 빠르게 감지할 수 있다.6D to 6E show statistical indices such as a false positive rate by calculating a risk value and setting a threshold value under different weights. In this way, it is possible to quickly detect variant malicious applications distributed through repackaging method, by applying the process to the process of determining the optimal weight and threshold value through the iterative learning and determining whether the new application is malicious.

도 7은 본 발명의 일 실시예에 따른 위험도 기반 악성 어플리케이션 탐지 장치의 하드웨어 구성도이다.7 is a hardware block diagram of a risk-based malicious application detection apparatus according to an embodiment of the present invention.

도 7을 참고하면 위험도 기반 악성 어플리케이션 탐지 장치(10)는 하나 이상의 프로세서(510), 메모리(520), 스토리지(560) 및 인터페이스(570)을 포함할 수 있다. 프로세서(510), 메모리(520), 스토리지(560) 및 인터페이스(570)는 시스템 버스(550)를 통하여 데이터를 송수신한다.Referring to FIG. 7, the risk-based malicious application detection device 10 may include one or more processors 510, memory 520, storage 560, and interface 570. The processor 510, the memory 520, the storage 560, and the interface 570 transmit and receive data via the system bus 550.

프로세서(510)는 메모리(520)에 로드 된 컴퓨터 프로그램을 실행하고, 메모리(520)는 상기 컴퓨터 프로그램을 스토리지(560)에서 로드(load) 한다. 상기 컴퓨터 프로그램은, 소스 코드 추출 오퍼레이션(521), 위험도 연산 오퍼레이션(523) 및 악성 여부 판단 오퍼레이션(525)을 포함할 수 있다.The processor 510 executes a computer program loaded into the memory 520 and the memory 520 loads the computer program from the storage 560. [ The computer program may include a source code extraction operation 521, a risk computing operation 523, and a maliciousness determination operation 525.

소스 코드 추출 오퍼레이션(521)은 인터페이스(570)을 통해서 공식 마켓, 블랙 마켓, SNS, 사설 게시판 등을 통해서 악성 여부의 판단 대상이 되는 신규 어플리케이션을 수집한다. 이렇게 수집된 어플리케이션은 시스템 버스(550)를 통해 스토리지(560)의 apk 파일(561)로 저장된다.The source code extraction operation 521 collects new applications through the interface 570 through the official market, the black market, the SNS, a private bulletin board, or the like, to be judged as malicious. The collected applications are stored in the apk file 561 of the storage 560 via the system bus 550.

다음으로 소스 코드 추출 오퍼레이션(521)은 디컴파일, 디어셈블을 통해서 apk 파일(561)로부터 소스 코드를 추출한다. 이렇게 추출된 소스 코드는 시스템 버스(550)를 통해 스토리지(560)의 소스 코드(563)로 저장된다.Next, the source code extraction operation 521 extracts the source code from the apk file 561 through decompilation and disassembly. The extracted source code is stored in the source code 563 of the storage 560 via the system bus 550.

위험도 연산 오퍼레이션(523)은 스토리지(560)의 소스 코드(563)에서 특징이 되는 API나 명령어가 포함되어 있는지 여부를 검사한다. 예를 들어 특징이 되는 API나 명령어를 사전에 40개를 선정하였으면, 40개의 API나 명령어가 포함되어 있는지 여부에 따라 0과 1의 값을 가지는 40차원의 좌표를 얻을 수 있다.The risk computing operation 523 examines whether the source code 563 of the storage 560 includes an API or a command that is characteristic. For example, if 40 APIs or commands are selected in advance, 40-dimensional coordinates having values of 0 and 1 can be obtained depending on whether 40 APIs or commands are included.

이렇게 얻은 특징 좌표 내지 특징 벡터에 대해 각 좌표별로 가중치를 곱해서 위험도를 연산한다. 각 특징 좌표에 곱할 가중치는 스토리지의 악성 코드 API 패턴(567)으로 저장되어 있다. 여기서 특징이 되는 API나 명령어의 종류 및 각 API나 명령어의 유무를 나타내는 수치에 곱할 가중치는 사전에 학습을 통해 선별되고 결정된 것이다. 다음으로 신규 어플리케이션의 특징 API를 이용하여 구한 위험도는 시스템 버스(550)를 통해 스토리지(560)의 유사도(565)로 저장된다.The obtained feature or feature vector is multiplied by a weight for each coordinate to calculate the risk. The weights to be multiplied by each feature coordinate are stored in the malicious code API pattern 567 of the storage. The weights to be multiplied by the numerical values indicating the types of the APIs and the commands and the presence or absence of the respective APIs and commands are selected and determined in advance through learning. The risk obtained using the feature API of the new application is then stored in the similarity 565 of the storage 560 via the system bus 550.

악성 여부 판단 어플리케이션(525)는 스토리지(560)의 위험도(565)가 기 설정된 임계치 이상인 경우 수집한 신규 어플리케이션을 악성으로 판단한다. 여기서 임계치는 기존에 악성으로 판단했던 어플리케이션을 이용해서 학습을 통해 설정할 수 있다.The malicious nature determination application 525 determines that the collected new application is malicious when the risk level 565 of the storage 560 is equal to or greater than a preset threshold value. Here, the threshold value can be set through learning using an application that has previously been determined to be malicious.

도 7의 각 구성 요소는 소프트웨어(Software) 또는, FPGA(Field Programmable Gate Array)나 ASIC(Application-Specific Integrated Circuit)와 같은 하드웨어(Hardware)를 의미할 수 있다. 그렇지만, 상기 구성 요소들은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 어드레싱(Addressing) 할 수 있는 저장 매체에 있도록 구성될 수도 있고, 하나 또는 그 이상의 프로세서들을 실행시키도록 구성될 수도 있다. 상기 구성 요소들 안에서 제공되는 기능은 더 세분된 구성 요소에 의하여 구현될 수 있으며, 복수의 구성 요소들을 합하여 특정한 기능을 수행하는 하나의 구성 요소로 구현될 수도 있다.Each component in FIG. 7 may refer to software or hardware such as an FPGA (Field Programmable Gate Array) or an ASIC (Application-Specific Integrated Circuit). However, the components are not limited to software or hardware, and may be configured to be addressable storage media, and configured to execute one or more processors. The functions provided in the components may be implemented by a more detailed component, or may be implemented by a single component that performs a specific function by combining a plurality of components.

이상 첨부된 도면을 참조하여 본 발명의 실시 예들을 설명하였지만, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims

The malicious application detection device extracting a source code of a new application;
The malicious application detection device extracts a method from the source code and compares the first binary corresponding to the method with the second binary corresponding to the method included in the application classified as malicious, step;
The malicious application detection device calculating a risk for the new application when the similarity is less than a predetermined threshold value; And
Wherein the malicious application detection device includes a step of determining whether the new application is malicious using the risk,
The step of calculating the risk includes:
Based on the features included in the first learning target application classified as malicious, generates n-dimensional first learning target feature vectors tagged with maliciousness, and based on the features included in the second classified learning application, A second step of generating an n-dimensional second learning target feature vector tagged with the first learning target feature vector;
A second step of performing learning on the first learning object feature vector and the second learning object feature vector using a genetic algorithm and updating a weight for each feature through the learning;
Repeating the first step and the second step to determine a feature weight for each feature constituting the n-dimensional feature vector;
Generating an n-dimensional detection target feature vector based on whether each of the first to n-th features is included in the source code of the new application; And
Calculating a risk of the new application through a weighted sum of values of each element constituting the feature vector to be detected and the weights of the determined features;
How to detect malicious applications.

The method according to claim 1,
The step of extracting the source code includes:
And decompiling or disassembling the new application to extract the source code.
How to detect malicious applications.

The method according to claim 1,
The first to n < th >
Wherein the first instruction is an API or a system instruction selected from features included in the first learning target application and features included in the second learning target application on the basis of an appearance frequency,
How to detect malicious applications.

delete

The method according to claim 1,
The step of calculating the risk comprises:
And scaling the values calculated by summing the weights of the respective features constituting the feature vector to be detected and the weights of the determined features to fall within a range of 0 to 100,
How to detect malicious applications.

The method according to claim 1,
Wherein the step of determining whether the malicious result is malicious or not,
And comparing the risk with a preset threshold value to determine whether or not the risk is malicious.
How to detect malicious applications.

Network interface;
One or more processors;
A memory for loading a computer program executed by the processor; And
Including storage for storing new applications,
The computer program comprising:
An operation of extracting a source code of the new application;
Extracting a method from the source code, comparing the first binary corresponding to the method and the second binary corresponding to the method included in the application classified as malicious, in a method unit, and calculating the similarity;
Computing a risk for the new application if the similarity is less than a predetermined threshold; And
And an operation of determining whether the new application is malicious using the risk,
The operation for calculating the above-
Based on the features included in the first learning target application classified as malicious, generates n-dimensional first learning target feature vectors tagged with maliciousness, and based on the features included in the second classified learning application, A first operation for generating an n-dimensional second learning-target feature vector tagged with the first training feature vector;
A second operation that performs learning on the first learning object feature vector and the second learning object feature vector using a genetic algorithm and updates a weight for each feature through the learning;
An operation of repeating the first operation and the second operation to determine a feature weight for each feature constituting the n-dimensional feature vector;
Generating an n-dimensional feature vector to be detected based on whether each of the first to n-th features is included in the source code of the new application; And
And calculating the risk of the new application through a weighted sum of a value of each element constituting the feature vector to be detected and a weight value of the determined feature,
Malicious application detection device.