KR101067650B1

KR101067650B1 - System for constructing whitelist

Info

Publication number: KR101067650B1
Application number: KR1020090107540A
Authority: KR
Inventors: 윤현수; 이형규; 이민수
Original assignee: 한국과학기술원
Priority date: 2009-11-09
Filing date: 2009-11-09
Publication date: 2011-09-29
Also published as: KR20110050939A

Abstract

본 발명은 화이트리스트 구축 시스템에 관한 것으로서, 웹에서 각종 사이트를 검색하는 크로울링부; 상기 크로울링부를 통해 검색된 각각의 사이트에 대하여, 특정분야와 관련된 사이트인지 여부를 판단함으로써 관련된 사이트를 추출하는 사이트 추출부; 및 상기 사이트 추출부로부터 추출된 사이트를 대상으로, 해당 사이트의 도메인 정보를 이용하여 화이트리스트를 추출하는 화이트리스트 추출부; 를 포함한다. The present invention relates to a white list construction system, comprising: a crawling unit for searching various sites on the web; A site extracting unit for extracting a related site by determining whether the site is related to a specific field, for each site searched through the crawling unit; And a white list extracting unit extracting a white list using the domain information of the corresponding site from the site extracted from the site extracting unit. It includes.

크로울링, 화이트리스트, 도메인 Crawling, White List, Domain

Description

White List Construction System {SYSTEM FOR CONSTRUCTING WHITELIST}

본 발명은 화이트리스트 구축 시스템에 관한 것으로서, 더욱 상세하게는 지속적이고 능동적으로 양질의 화이트리스트를 유지하고 관리하여 최근 급증하고 있는 피싱 문제를 근본적으로 해결할 뿐만 아니라 변종 피싱 기법인 파밍까지 방지할 수 있는 안티피싱 크로울러(Anti-Phishing Crawler)라는 새로운 화이트리스트 구축 시스템에 관한 것이다.The present invention relates to a white list construction system, and more particularly, to maintain and manage a high-quality white list continuously and actively to fundamentally solve the rapidly increasing phishing problem, and to prevent phishing, which is a variant phishing technique. It's about a new whitelisting system called Anti-Phishing Crawler.

인터넷 사용인구가 점점 증가함에 따라 인터넷 환경을 이용한 범죄도 급증하고 있다. 그 중에서도 최근 유행하고 있는 피싱(phishing)은 사이버 환경의 특성을 이용한 대표적 신종 범죄 행위라고 할 수 있다. As the number of Internet users is increasing, crimes using the Internet environment are increasing rapidly. Phishing is one of the most popular new crimes using the characteristics of the cyber environment.

피싱이란, 사용자의 개인 정보와 계좌 비밀 정보를 훔치려고 하는 범죄적 메커니즘을 이르는 용어이다. 가트너(Gartner)의 조사에 따르면 피싱에 의한 금전적 손해액이 2006년에는 26억 달러, 2007년에는 32억 달러에 이른다고 보고된 바 있으며, 이런 피싱에 의한 피해액은 계속 증가하고 있는 추세이다.Phishing is a term for a criminal mechanism that attempts to steal your personal and account confidential information. According to Gartner's research, financial damages from phishing have been reported at $ 2.6 billion in 2006 and $ 3.2 billion in 2007, and the damage from phishing continues to increase.

현재, 피싱 방지에 가장 많이 사용되는 기법은 블랙리스트 기반의 기법이다. 블랙리스트 기법이란, 알려진 피싱 사이트들의 URL을 수집하여 저장함으로써 저장된 피싱 사이트에 사용자들이 방문했을 때 웹 브라우저 등에서 사이트를 차단하는 방법이다. 대부분의 안티피싱 상용 솔루션이나 일반적인 웹 브라우저에서 이 기법을 사용한다. 하지만 대부분의 피싱 사이트는 수명이 매우 짧기 때문에 주기적인 업데이트가 이루어진다고 해도 리스트에 보고되기 전까지는 피싱 사이트를 차단할 수 없다. 즉, 알려지지 않은 피싱 사이트에 대해서는 탐지하지 못하는 단점을 가진다. 또한, 피싱 사이트는 그 수가 끊임없이 증가하기 때문에 리스트를 유지 관리하는 데 매우 큰 비용이 들게 된다.Currently, the most commonly used technique for anti-phishing is a blacklist-based technique. The blacklist technique collects and stores URLs of known phishing sites to block a site from a web browser when a user visits the stored phishing site. Most antiphishing commercial solutions or common web browsers use this technique. However, most phishing sites have a very short lifespan, so even if regular updates are made, they cannot be blocked until they are reported on the list. That is, there is a disadvantage that it does not detect unknown phishing sites. Also, phishing sites are very expensive to maintain lists because their number is constantly increasing.

반면에, 화이트리스트 기법은 블랙리스트 기법의 단점을 해결할 수 있는 기법이다. 화이트리스트 기반의 피싱 방지 기법이란 안전한 사이트들의 URL을 수집하여 저장함으로써 사용자들이 사이트를 방문했을 때 그 사이트의 안전성의 유무를 알려주는 방법을 말한다. On the other hand, the whitelist technique is a technique to solve the shortcomings of the blacklist technique. Whitelist-based anti-phishing techniques are a way to tell a user whether or not they are safe when they visit a site by collecting and storing the URLs of those sites.

화이트리스트 기법은 일반적으로 수명이 긴 안전한 사이트를 대상으로 하기 때문에 블랙리스트보다 유지하기가 쉬우며, 알려지지 않은 피싱 공격들에 대해 탐지가 가능하다. 하지만 화이트리스트 기법에서 그 화이트리스트를 얼마나 많이 그리고 어떻게 구축할 것인가를 판단하기가 쉽지 않기 때문에 위의 장점에도 불구하고 많이 사용되지 않으며 제안된 방법도 극히 드물다.Whitelisting techniques are generally easier to maintain than blacklisting because they target secure, long-lived sites, and can detect unknown phishing attacks. However, it is not easy to determine how much and how to build the whitelist in the whitelist technique, and despite the above advantages, it is not used much and the proposed method is extremely rare.

화이트리스트를 구축하는 종래의 제안된 방법 중에서 개인이 자주 들어가는 사이트는 그 수가 매우 한정적이라는 특성을 살려 개인 사이트를 기반으로 한 화이트리스트 구축 방법이 있다. 이 제안은 화이트리스트를 구축하는 방법을 구체적으 로 제시하였으나, 개인 기반 사이트의 화이트리스트를 구축하기 때문에 그 외의 사이트에 대해서는 안전성을 판별하지 못한다는 단점이 있다. 또한, 화이트리스트에 있는 사이트들이 자주 변경이 안 된다는 점을 들어 주기적인 업데이트의 필요성을 격하하였지만, 화이트리스트에 있는 사이트들은 안전한 사이트들이기 때문에 그 중요도가 매우 큰 만큼 자주 변경되지 않더라도 업데이트는 반드시 필요하다는데서 한계점을 찾을 수 있다.Among the conventionally proposed methods of constructing a white list, there is a method of constructing a white list based on a personal site, taking advantage of the fact that the number of sites frequently entered by individuals is very limited. Although this proposal specifically suggested how to build a whitelist, it has a disadvantage in that it cannot determine the safety of other sites because it builds a whitelist of individual-based sites. We also downgraded the need for regular updates because sites on the whitelist could not change frequently, but because sites on the whitelist are safe sites, they are important even if they do not change frequently. You can find the limitations here.

본 발명은 상기와 같은 문제점을 감안하여 안출된 것으로, 본 발명의 목적은, 화이트리스트 기반의 피싱 방지 기법에서 화이트리스트를 만드는데 가장 중요한 자동성, 광범위성 및 주기적인 업데이트를 구현하기 위하여 크로울러라는 방법을 사용하여 화이트리스트를 효율적으로 구축하도록 함에 있다.SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and an object of the present invention is a crawler method for implementing automatic, extensive and periodic updates which are most important for creating a whitelist in a whitelist-based phishing prevention technique. To build whitelists efficiently.

또한 본 발명의 목적은, 피싱 타겟 사이트로 가장 많이 알려진 금융 관련 사이트를 크로울링하기 위해 적절한 시드 페이지와 쿼리를 사용한 토픽 크로울러를 통하여 금융 관련 사이트를 효율적으로 선별하도록 함에도 있다. It is also an object of the present invention to efficiently select financial related sites through topic crawlers using appropriate seed pages and queries to crawl financial related sites most known as phishing target sites.

또한 본 발명의 목적은, 선별된 사이트가 안전한 사이트인지를 판단하기 위해 피싱 사이트와 정상적인 사이트의 가장 두드러진 특성인 페이지 특성과 4가지의 도메인 특성을 이용하여 화이트리스트에 들어갈 사이트의 안전성을 효과적으로 판별하도록 함에도 있다. It is also an object of the present invention to effectively determine the safety of sites to be entered into the whitelist by using page characteristics and four domain characteristics, which are the most prominent characteristics of phishing sites and normal sites, to determine whether the selected sites are safe sites. There is also in the ship.

그리고 본 발명의 목적은, 해당 사이트의 IP주소까지 저장함으로써 최근 유행하고 있는 신종 사기 수법인 파밍을 방지하도록 함에도 있다. In addition, the object of the present invention is to prevent farming, a new breed of fraudulent method, which is popular by storing the IP address of the site.

이러한 기술적 과제를 달성하기 위한 본 발명은 화이트리스트 구축 시스템에 관한 것으로서, 웹에서 각종 사이트를 검색하는 크로울링부; 상기 크로울링부를 통해 검색된 각각의 사이트에 대하여, 특정분야와 관련된 사이트인지 여부를 판단함으로써 관련된 사이트를 추출하는 사이트 추출부; 및 상기 사이트 추출부로부터 추 출된 사이트를 대상으로, 해당 사이트의 도메인 정보를 이용하여 화이트리스트를 추출하는 화이트리스트 추출부; 를 포함한다. The present invention for achieving the above technical problem relates to a white list building system, the crawling unit for searching various sites on the web; A site extracting unit for extracting a related site by determining whether the site is related to a specific field, for each site searched through the crawling unit; And a white list extracting unit extracting a white list using the domain information of the corresponding site, targeting the site extracted from the site extracting unit. It includes.

한편, 본 발명은 화이트리스트 구축 방법에 관한 것으로서, (a) 상기 크로울링부가 웹에서 각종 사이트를 검색하는 단계; (b) 상기 사이트 추출부가 크로울링부를 통해 검색된 사이트를 입력받아 저장하는 단계; (c) 상기 사이트 추출부가 저장된 각각의 사이트에 대하여 [수식 1] 을 통해 특정분야 관련 유사도를 계산함으로써, 유사도 값이 0이 아닌지 여부를 판단하는 단계; (d) 상기 (c) 단계의 판단결과, 유사도 값이 0이 아닐 경우 상기 사이트 추출부가 해당 사이트를 특정분야 관련 사이트로 판단하여 추출하고, 계산된 유사도 점수와 해당 링크를 기록하는 단계; (e) 상기 화이트리스트 추출부가 상기 사이트 추출부로부터 추출된 사이트를 대상으로, 현재 사이트 도메인과 링크 사이트의 도메인이 일치하는지 여부를 판단하는 단계; (f) 상기 (e) 단계의 판단결과, 현재 사이트 도메인과 링크 사이트의 도메인이 일치할 경우, 상기 화이트리스트 추출부가 해당 사이트를 피싱 사이트가 아닌 것으로 판단하여 추출하는 단계; (g) 상기 화이트리스트 추출부가 피싱 사이트가 아닌 것으로 판단하여 추출된 사이트에 대하여, 사이트 도메인에 대한 수명, 나이, 인기도 및 DNS 랭킹에 관한 정보를 수집하는 단계; (h) 상기 화이트리스트 추출부가 신경망 기법(Neural Networks)을 이용하여 수집한 정보를 특성 값으로 수치화하는 단계; 및 (i) 상기 화이트리스트 추출부가 수치화된 값을 바탕으로, 도메인 특성 항목에 대하여 높은 입력값을 가지게 되는 사이트를 화이트리스트로서 추출하여, 추출한 사이트의 URL과 IP 정보와 함께 저장하는 단계; 를 포함한다. On the other hand, the present invention relates to a whitelist building method, the method comprising: (a) searching for various sites on the web by the crawling unit; (b) receiving and storing the site searched by the site extracting unit through the crawling unit; (c) determining whether or not the similarity value is not zero by calculating a similarity related to a specific field for each site stored by the site extracting unit through Equation 1; (d) if the similarity value is not 0, the site extracting unit determines the relevant site as a specific field-related site and extracts it, and records the calculated similarity score and the corresponding link if the similarity value is not 0; (e) determining, by the whitelist extractor, whether a domain of a current site domain and a link site is identical to a site extracted from the site extractor; (f) determining that the whitelist extractor determines that the site is not a phishing site and extracts the current site domain if the domain of the link site matches the domain of the link site; (g) collecting information regarding the lifetime, age, popularity, and DNS ranking of the site domain of the extracted site by determining that the whitelist extractor is not a phishing site; (h) digitizing the information collected by the whitelist extractor using neural networks using characteristic values; And (i) extracting a site having a high input value for a domain characteristic item as a white list based on the numerical value of the white list extracting unit, and storing the extracted site as a URL and IP information of the extracted site; It includes.

상기와 같은 본 발명에 따르면, 화이트리스트 기반의 피싱 방지 기법에서 구축이 까다로운 화이트리스트를 크로울러라는 기법을 통해 자동으로 최대한 많은 사이트를 자동으로 선별하여 사용자로 하여금 최신의 리스트를 주기적으로 확인할 수 있도록 함으로써, 신뢰도를 높일 수 있는 효과가 있다. According to the present invention as described above, by automatically picking up as many sites as possible through a technique called crawler whitelist that is difficult to build in a whitelist-based phishing prevention technique to allow the user to check the latest list periodically This has the effect of increasing the reliability.

또한 본 발명에 따르면, 정상적인 사이트와 피싱 사이트 간에 구별되게 나타나는 여러 특성들을 효과적으로 조합 및 사용하여 화이트리스트에 들어갈 사이트를 선택함으로써 보다 안전한 사이트를 제공할 수 있는 효과도 있다. In addition, according to the present invention, it is possible to provide a more secure site by selecting a site to be included in the white list by effectively combining and using various characteristics that are distinguished between a normal site and a phishing site.

그리고 본 발명에 따르면, 피싱뿐만 아니라 신종 사기 수법인 파밍까지 방지하는 효과도 있다. In addition, according to the present invention, there is an effect of preventing phishing as well as pharming novel.

본 발명의 구체적 특징 및 이점들은 첨부도면에 의거한 다음의 상세한 설명으로 더욱 명백해질 것이다. 이에 앞서 본 발명에 관련된 공지 기능 및 그 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 구체적인 설명을 생략하였음에 유의해야 할 것이다.Specific features and advantages of the present invention will become more apparent from the following detailed description based on the accompanying drawings. In the meantime, when it is determined that the detailed description of the known functions and configurations related to the present invention may unnecessarily obscure the subject matter of the present invention, it should be noted that the detailed description is omitted.

이하, 첨부된 도면을 참조하여 본 발명을 상세하게 설명한다. Hereinafter, with reference to the accompanying drawings will be described in detail the present invention.

본 발명에 따른 화이트리스트 구축 시스템 및 그 방법에 관하여 도 1 내지 도 7 을 참조하여 설명하면 다음과 같다. The whitelist building system and method thereof according to the present invention will be described with reference to FIGS. 1 to 7.

도 1 은 본 발명에 따른 화이트리스트 구축 시스템(S)을 개념적으로 도시한 전체 구성도로서, 도시된 바와 같이 크로울링부(100), 사이트 추출부(200) 및 화이 트리스트 추출부(300)를 포함하여 이루어진다. 1 is an overall configuration diagram conceptually showing a white list building system S according to the present invention. As shown, the crawling unit 100, the site extracting unit 200, and the white tries extracting unit 300 are illustrated. It is made, including.

크로울링(crawling)부(100)는 웹에서 각종 사이트를 검색하는 기능을 수행한다. The crawling unit 100 performs a function of searching various sites on the web.

사이트 추출부(200)는 크로울링부(100)를 통해 검색된 각각의 사이트에 대하여, 특정분야와 관련된 사이트인지 여부를 판단함으로써 관련된 사이트를 추출하는 기능을 수행하는 바, 상기 도 1 에 도시된 바와 같이 사이트 관리모듈(210) 및 유사도 계산모듈(220)을 포함한다. 본 실시예에서는, 특정 분야를 "금융"으로 설정하겠으나, 본 발명이 이에 한정되는 것은 아니다. The site extracting unit 200 performs a function of extracting a related site by determining whether each site searched through the crawling unit 100 is a site related to a specific field, as shown in FIG. The site management module 210 and the similarity calculation module 220 as shown. In the present embodiment, a specific field will be set to "finance", but the present invention is not limited thereto.

구체적으로, 사이트 관리모듈(210)은 크로울링부(100)를 통해 검색된 사이트를 입력받아 저장한다. In detail, the site management module 210 receives and stores a site searched through the crawling unit 100.

유사도 계산모듈(220)은 사이트 관리모듈(210)에 저장된 각각의 사이트가 "금융"과 관련된 사이트인지 여부를 판단하기 위하여, 하기의 [수식 1] 을 통해 "금융" 관련 유사도를 계산한다. The similarity calculation module 220 calculates "finance" related similarity through the following [Equation 1] to determine whether each site stored in the site management module 210 is a site related to "finance".

[수식 1][Equation 1]

페이지 p와 쿼리 q의 코사인 유사도(sim(q,p))는, The cosine similarity (sim (q, p)) of page p and query q is

여기서, Vq와 Vp는 각각 쿼리 q와 페이지 p의 벡터 표현에 기초한 용어 빈도수이며, Vq · Vp는 두 벡터의 내적, || v || 는 벡터 V의 유클리드 노름을 뜻함.Where Vq and Vp are the term frequencies based on the vector representation of the query q and page p, respectively, and VqVp is the dot product of the two vectors, | v || Is the Euclidean gambling of the vector V.

유사도 계산모듈(220)은 유사도 값이 0이 아닌지 여부를 판단하여, 0이 아닐 경우 해당 사이트를 금융 관련 사이트로 판단하여 추출한다. 한편, 판단결과 0일 경우 금융 관련 사이트가 아닌 것으로 판단하여, 사이트 관리모듈(210)로부터 다른 사이트를 입력받아 유사도 계산을 수행한다. The similarity calculation module 220 determines whether the similarity value is not 0, and if it is not 0, determines the relevant site as a financial related site and extracts it. On the other hand, if the determination result is 0, it is determined that it is not a financial-related site, and receives another site from the site management module 210 to perform a similarity calculation.

이때, 금융 관련 사이트로 판단된 사이트는, 계산된 유사도 점수와 해당 링크를 기록하여 사이트 관리모듈(210)로 전송된다. 즉, 차후에 유사도 계산모듈(220)은 사이트 관리모듈(210)로부터 높은 유사도 점수를 가지는 사이트 순서로 입력받아 유사도 계산을 수행할 수도 있다. At this time, the site determined to be a finance-related site records the calculated similarity score and the corresponding link and transmits the same to the site management module 210. That is, the similarity calculation module 220 may receive the similarity calculation from the site management module 210 in the order of sites having the high similarity score.

참고로, 본 발명에 따른 사이트 추출부(200)에서는 토픽 크로울러라는 기법을 사용한다. 토픽 크로울러란, 토픽 중심 크로울러란 먼저 정의된 토픽과 관련된 웹 페이지만을 다운로드하려고 시도하는 크로울러를 뜻한다. 여기서는 토픽 크로울러 알고리즘 중에서 가장 효율적이라고 알려진 상기 [수식 1] 의 코사인 유사도를 사용하는 최고 우선 알고리즘을 사용하여 256개의 쓰레드를 동시에 사용하는 멀티 쓰레드 크로울링을 통하여 크로울링을 원활히 할 수 있도록 구현된다. 도 2 는 사이트 추출부를 통한 멀티쓰레드 크로울러 모델을 보여주는 일예시도이다. For reference, the site extraction unit 200 according to the present invention uses a technique called topic crawler. Topic crawlers are topic crawlers that attempt to download only web pages that are related to the topic that is defined first. In this case, the crawler can be smoothly implemented through multi-thread crawling using 256 threads simultaneously using the highest priority algorithm using the cosine similarity of Equation 1, which is known to be the most efficient among the topic crawler algorithms. 2 is an exemplary view illustrating a multithreaded crawler model through a site extractor.

이때, 사이트 추출부(200)는 2가지 중요한 요소를 가진다. At this time, the site extraction unit 200 has two important elements.

첫째로는 쿼리로서 여기서의 토픽에 해당한다. 쿼리가 정확하면 정확할수록 그 토픽에 관련될 사이트들이 높게 크로울링 될 수 있다. 본 발명에서는 "bank"라는 쿼리를 사용하여 은행 뿐만 아니라 금융 거래를 담당하는 사이트를 효과적으로 크로울링한다. 비록 쿼리는 단순하지만 금융 관련 사이트에서 가장 많이 사용되는 용어이기 때문에 매우 뛰어난 성능을 보여준다. The first is a query, which corresponds to the topic here. The more accurate the query, the higher the number of sites that will be relevant to that topic. In the present invention, the query "bank" is used to effectively crawl not only banks but also sites in charge of financial transactions. Although the query is simple, it is very popular because it is the most used term on financial sites.

둘째로는 시드(seed) 페이지이다. 시드 페이지가 그 주제와 관련된 링크들을 얼마나 많이 그리고 총괄적으로 포함하느냐에 따라 파이낸스 셀렉터의 성능이 좌우될 수 있다. Second is the seed page. The performance of a finance selector can depend on how much and collectively the seed page contains links related to the topic.

그리고, 화이트리스트 추출부(300)는 사이트 추출부(200)로부터 추출된 사이트를 대상으로, 해당 사이트의 도메인 정보를 이용하여 화이트리스트를 추출하는 기능을 수행하는 바, 상기 도 1 에 도시된 바와 같이 페이지 정보 판단모듈(310), 도메인 정보 판단모듈(320) 및 리스트 추출모듈(330)을 포함한다. In addition, the white list extracting unit 300 performs a function of extracting a white list using the domain information of the corresponding site from the site extracted from the site extracting unit 200, as shown in FIG. The page information determining module 310, the domain information determining module 320, and the list extracting module 330 are included.

본 실시예에서는 현재 사이트 도메인과 링크 사이트의 도메인 매칭 여부를 바탕으로 안정성을 판단한다. 피싱 사이트에 보이는 링크 중에 현재 피싱 사이트의 도메인을 이용한 링크는 거의 없다. 왜냐하면 피싱 사이트는 사용자의 개인 정보와 계좌 번호를 훔치는 것이 목적이기 때문에 피싱을 위해 만든 사이트를 제외하고는 비용을 아끼기 위해 다른 사이트는 만들지 않기 때문이다. 하지만 일반 정상적인 사이트는 거의 대부분이 자신의 도메인과 링크 도메인이 일치하게 된다. 따라서 현재 도메인과 링크의 도메인을 비교했을 때 도메인이 하나도 일치하지 않을 경우 피싱 사이트일 가능성이 매우 높다. In this embodiment, the stability is determined based on the domain matching of the current site domain and the link site. Few links on the phishing site currently use the domain of the phishing site. Because phishing sites are designed to steal your personal information and account number, you don't create any other sites to save money, except for sites you've created for phishing. However, most normal sites will have their own domain and link domain matching. Therefore, if you compare the domain of the current domain with the link, it is very likely that you are a phishing site if none of the domains match.

따라서, 페이지 정보 판단모듈(310)은 사이트 추출부(200)의 유사도 계산모듈(220)로부터 추출된 사이트를 대상으로, 현재 사이트 도메인과 링크 사이트의 도메인이 일치하는지 여부를 판단한다. Therefore, the page information determination module 310 determines whether the current site domain and the domain of the link site match the target site extracted from the similarity calculation module 220 of the site extraction unit 200.

판단결과, 현재 사이트 도메인과 링크 사이트의 도메인이 일치할 경우, 해당 사이트를 피싱 사이트가 아닌 것으로 판단하여 추출하며, 일치하지 않을 경우 피싱 사이트인 것으로 판단한다. As a result, if the current site domain and the link site domain match, it is determined that the site is not a phishing site, and if it does not match, it is determined to be a phishing site.

도 3 은 본 발명에 따른 피싱 사이트를 보이는 일예시도로서, 현재 사이트의 도메인(a)과 링크 사이트의 도메인(b)이 일치하지 않음을 확인할 수 있다. 위와 같이, 현재 사이트 도메인과 링크 사이트의 도메인이 일치하는 사이트는 도메인의 세부적인 특성으로 피싱 사이트 여부를 다시 판단하게 된다. 3 is an exemplary view showing a phishing site according to the present invention, it can be seen that the domain (a) of the current site and the domain (b) of the link site does not match. As above, the site where the domain of the current site domain and the link site match is determined again as a phishing site based on the detailed characteristics of the domain.

도메인 정보 판단모듈(320)은 페이지 특성 판단모듈(310)을 통해 피싱 사이트가 아닌 것으로 판단하여 추출된 사이트에 대하여, 사이트 도메인에 대한 수명, 나이, 인기도 및 DNS 랭킹에 관한 정보를 수집하고, 신경망 기법(Neural Networks)을 이용하여 수집한 정보를 특성 값으로 수치화한다. The domain information determination module 320 collects information on the lifetime, age, popularity, and DNS ranking of the site domain for the extracted site by determining that it is not a phishing site through the page characteristic determination module 310, and the neural network. Using Neural Networks, the collected information is digitized into characteristic values.

도메인 특성은 총 4가지로 나뉠 수 있다. Domain characteristics can be divided into four types.

첫째, 도메인의 생성일에서 만료일까지의 시간인 도메인 수명.First, the domain lifetime, which is the time from the creation date of the domain to its expiration date.

피싱 사이트는 일반적인 사이트보다 단기적으로 사용자들을 속이고 없어지기 때문에 도메인 수명이 매우 짧은 특성을 지닌다. Phishing sites have a very short domain life span because they deceive and deceive users in the short term than regular sites.

둘째, 도메인의 생성일부터 지금까지의 시간인 도메인 나이.Second, the domain age, which is the time from the creation date of the domain so far.

피싱 사이트의 단기성을 나타낸 것으로서, 도메인 나이가 젊을수록 피싱 사이트일 가능성이 높아지게 된다. 도 4 는 도메인 수명 및 나이 검색 사이트를 이용하여 특정 도메인의 수명과 나이를 계산하기 위해 정보 조회결과를 보이는 일예시도이다. The shorter the phishing site, the younger the domain, the more likely it is. 4 is an exemplary view showing an information search result for calculating the lifetime and age of a specific domain using a domain lifetime and age search site.

셋째, 동일 도메인을 사용하는 사이트가 일반 검색 엔진 사이트에 얼마나 많이 등록되었는지를 나타내는 도메인 인기도.Third, domain popularity, which indicates how many sites using the same domain are registered with a regular search engine site.

일반적으로 유명한 정상적인 사이트는 그 도메인을 중심으로 여러 개의 사이트들을 제공하기 때문에 그 도메인을 사용하는 사이트의 개수가 많지만, 피싱 사이트는 사용자를 속이기 위한 사이트들을 등록할 뿐 그 도메인을 이용한 사이트를 많이 등록해 사용하지 않는다. 또한 검색 엔진 사이트도 자신들의 안전성을 위해 이런 피싱 사이트들을 검색 엔진 DB에서 삭제하기 때문에 검색 엔진에서 피싱 사이트를 찾기는 쉽지 않다. 도 5 는 검색 엔진 사이트를 이용하여 특정 은행의 도메인 인기도 조회 결과를 나타내고 있다.In general, a well-known normal site provides a number of sites around the domain, so the number of sites that use the domain is large, but phishing sites register sites that use the domain to deceive users. Do not use. Search engine sites also remove these phishing sites from the search engine database for their safety, so it is not easy to find phishing sites in search engines. 5 shows a search result of a domain popularity of a specific bank using a search engine site.

넷째, 사용자가 얼마나 많이 그 도메인을 요청했는지를 나타내는 DNS랭킹.Fourth, DNS ranking indicating how many times a user requested the domain.

정상적인 사이트는 사용자들이 그 사이트에 접속하기 위하여 도메인의 조회를 많이 하는 반면에, 피싱 사이트는 그 도메인이 잘 알려져 있지 않기 때문에 사용자가 그 도메인을 조회하는 경우는 거의 없다. 도 6 은 랭킹 사이트를 이용하여 특정 도메인의 랭킹 순위를 보여주는 일예시도이다. A normal site uses a lot of domain lookups for users to access the site, whereas phishing sites rarely look up a domain because the domain is not well known. 6 is an exemplary view illustrating a ranking ranking of a specific domain using a ranking site.

도메인의 특성은 위의 4가지 특성의 조합으로 이루어진다. 전체적으로 도메인 수명이 길수록, 나이가 많을수록, 인기도가 클수록, 랭킹이 높을수록 정상적인 사이트일 확률이 높다. The properties of the domain consist of a combination of the four properties above. Overall, the longer the domain lifetime, the older, the more popular, and the higher the ranking, the more likely it is to be a normal site.

이와 같은 4가지 특성들을 독립적으로 판단할 경우, 신뢰도 높은 화이트리스트를 구축하지 못할 가능성이 높다. 따라서, 본 발명에서는 데이터마이닝 기법 중 하나인 신경망 기법을 이용하여 4가지 특성들을 조합하여 사이트를 판단하게 된다. If these four characteristics are judged independently, there is a high possibility that a reliable whitelist cannot be built. Therefore, in the present invention, a site is determined by combining four characteristics using a neural network technique, which is one of data mining techniques.

신경망 기법에서의 입력 값은 0과 1사이의 값을 가져야 하기 때문에, 각각의 값이 범주형을 가지는 도메인 특성들의 입력 값을 변환할 필요가 있다. Since the input value in the neural network technique must have a value between 0 and 1, it is necessary to convert the input values of domain characteristics in which each value has a categorical type.

본 실시예에 따라 도메인의 수명, 나이, 인기도 및 DNS 랭킹에 관한 정보를 수치화한 내용은 다음의 [표 1] 에 나타낸 바와 같다. According to the present embodiment, information about the lifespan, age, popularity, and DNS ranking of domains is quantified as shown in Table 1 below.

특 성Characteristics 00 0 ~ 10 to 1 1One 도메인 수명Domain lifetime 2년 이하2 years or less (도메인 수명 - 2) / 6(Domain lifetime-2) / 6 8년 이상More than 8 years 도메인 나이Domain age 1년 이하Less than 1 year (도메인 나이 - 1) / 3(Domain age-1) / 3 4년 이상More than 4 years 도메인 인기도Domain popularity 0개0 도메인 개수 / 10만Number of domains / 100,000 10만개 이상More than 100,000 DNS 랭킹DNS ranking 랭킹에 없는 경우If not in the ranking (10000001 - DNS랭킹) / 100만(10000001-DNS ranking) / 1 million 1등First place

[표 1] 도메인 특성 입력 값의 수치화 Table 1 Digitization of Domain Attribute Input Values

[표 1] 에서는 도메인 특성의 값을 범주형에서 수치형으로 바꾸는 과정을 보여준다. 먼저, 도메인 수명에서 수명이 2년 이하일 때는 제일 낮은 입력 값인 0으로 하고 8년 이상일 때는 가장 높은 값인 1로 한다. 도메인 수명이 2년과 8년 사이일 때에는 비례하여 그 값을 준다. 도메인 나이에서도 도메인 수명과 비슷한 방법으로 나이가 1년 이하일 때는 0, 4년 이상일 때는 1, 그 사이일 때에는 비례하여 그 값을 준다. 도메인 인기도에서는 그 사이트의 도메인 조회 개수가 하나도 없을 때에는 제일 낮은 입력 값인 0으로 하고 10만개 이상의 사이트 개수가 나온 경우에는 가장 높은 값인 1로 한다. 인기도가 그 사이일 때는 비례하여 그 값을 준다. 마지막으로 DNS 랭킹에서는 랭킹에 없는 경우 가장 작은 값인 0으로 하고, 그 외에는 등수에 반비례하여 값을 준다. 왜냐하면 랭킹은 순위의 값이 작을수록 값을 크게 가져야 하기 때문이다. 전체적으로 도메인 수명이 길수록, 나이가 많을수록, 인기도가 클수록, 랭킹이 높을수록 큰 입력 값을 받게 된다.Table 1 shows the process of changing domain property values from categorical to numeric. First, the domain life is set to 0, which is the lowest input value when the life span is 2 years or less, and 1, which is the highest value when it is 8 years or more. If the domain lifetime is between 2 and 8 years, the value is given in proportion. The domain age is similar to the domain life, and the value is proportional to 0 when the age is less than 1 year, 1 when the age is over 4 years, and between. In the case of domain popularity, the lowest input value is 0 when there are no domain lookups for the site, and the highest value is 1 when there are more than 100,000 sites. When the popularity is in between, the value is given in proportion. Lastly, in DNS ranking, if it is not in the ranking, the smallest value is 0. Otherwise, the value is inversely proportional to the rank. This is because the ranking should have a larger value as the value of the rank is smaller. Overall, the longer the domain lifetime, the older, the more popular, and the higher the ranking, the greater the input value.

리스트 저장모듈(330)은 도메인 정보 판단모듈(320)을 통해 수치화된 값을 바탕으로 화이트리스트를 추출하여 저장한다. 즉, 도메인 특성 항목에 대하여 높은 입력값을 가지게 되는 사이트를 추출하여, 추출한 사이트가 최종적으로 피싱 사이트가 아닌 것으로 판단하여 화이트리스트로서 추출하여, 추출한 사이트의 URL과 IP 정보를 저장한다.The list storage module 330 extracts and stores the white list based on the numerical value through the domain information determination module 320. That is, a site having a high input value for a domain characteristic item is extracted, and it is determined that the extracted site is not a phishing site at last, extracted as a white list, and the URL and IP information of the extracted site are stored.

이하에서는, 상술한 바와 같은 화이트리스트 구축 시스템(S)을 이용한 구축 방법에 관하여 도 7 을 참조하여 설명하면 다음과 같다. Hereinafter, a construction method using the white list construction system S as described above will be described with reference to FIG. 7.

도 7 은 본 발명에 따른 화이트리스트 구축 방법에 관한 전체 흐름도로서, 도시된 바와 같이 크로울링부(100)가 웹에서 각종 사이트를 검색하며(S10), 사이트 추출부(200)의 사이트 관리모듈(210)은 크로울링부(100)를 통해 검색된 사이트를 입력받아 저장한다(S20).7 is a flowchart illustrating a whitelist construction method according to the present invention. As shown, the crawling unit 100 searches for various sites on the web (S10), and the site management module of the site extraction unit 200 ( 210 receives and stores a site searched through the crawling unit 100 (S20).

이후, 유사도 계산모듈(220)은 사이트 관리모듈(210)에 저장된 각각의 사이트에 대하여 상기의 [수식 1] 을 통해 "금융" 관련 유사도를 계산함으로써, 유사도 값이 0이 아닌지 여부를 판단한다(S30). Thereafter, the similarity calculation module 220 calculates the "financial" related similarity with respect to each site stored in the site management module 210 through [Equation 1] above, and determines whether or not the similarity value is not 0 ( S30).

제S30 단계의 판단결과, 유사도 값이 0이 아닐 경우 유사도 계산모듈(220)은 해당 사이트를 금융 관련 사이트로 판단하여 추출하고, 계산된 유사도 점수와 해당 링크를 기록하여 사이트 관리모듈(210)로 전송하며(S40), 판단결과, 유사도 값이 0일 경우 제S10 단계로 절차를 이행함으로써, 사이트 관리모듈(210)로부터 다른 사이트를 입력받아 유사도 계산을 수행한다. As a result of the determination in step S30, if the similarity value is not 0, the similarity calculation module 220 determines the relevant site as a finance-related site, extracts it, and records the calculated similarity score and the corresponding link to the site management module 210. In operation S40, when the similarity value is 0, the procedure is performed in step S10 to receive another site from the site management module 210 to perform similarity calculation.

뒤이어, 화이트리스트 추출부(300)의 페이지 정보 판단모듈(310)은 유사도 계산모듈(220)로부터 추출된 사이트를 대상으로, 현재 사이트 도메인과 링크 사이트의 도메인이 일치하는지 여부를 판단한다(S50).Subsequently, the page information determining module 310 of the white list extracting unit 300 determines whether the current site domain and the domain of the linking site correspond to the site extracted from the similarity calculating module 220 (S50). .

제S50 단계의 판단결과, 현재 사이트 도메인과 링크 사이트의 도메인이 일치할 경우, 페이지 정보 판단모듈(310)은 해당 사이트를 피싱 사이트가 아닌 것으로 판단하여 추출하며(S60), 일치하지 않을 경우, 피싱 사이트로 판단하여 제S10 단계로 절차를 이행한다. As a result of the determination in step S50, when the current site domain and the domain of the link site match, the page information determination module 310 determines that the site is not a phishing site and extracts it (S60). Judging by the site, the procedure proceeds to step S10.

다음으로, 도메인 정보 판단모듈(320)은 페이지 특성 판단모듈(310)을 통해 피싱 사이트가 아닌 것으로 판단하여 추출된 사이트에 대하여, 사이트 도메인에 대한 수명, 나이, 인기도 및 DNS 랭킹에 관한 정보를 수집하고(S70), 신경망 기법(Neural Networks)을 이용하여 수집한 정보를 특성 값으로 수치화한다(S80).Next, the domain information determination module 320 collects information on the lifetime, age, popularity, and DNS ranking of the site domain for the extracted site by determining that it is not a phishing site through the page characteristic determination module 310. In operation S70, the collected information is digitized into characteristic values using neural networks (S80).

이후, 리스트 저장모듈(330)은 도메인 정보 판단모듈(320)을 통해 수치화된 값을 바탕으로, 도메인 특성 항목에 대하여 높은 입력값을 가지게 되는 사이트를 화이트리스트로서 추출하여, 추출한 사이트의 URL과 IP 정보를 저장한다(S90). 이후, 제S10 단계로 절차를 이행함으로써, 반복 수행할 수 있다. Subsequently, the list storage module 330 extracts, as a white list, a site having a high input value for the domain characteristic item based on the numerical value through the domain information determination module 320, and extracts the URL and IP of the extracted site. Store the information (S90). Thereafter, by performing the procedure to step S10, it may be repeated.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서, 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다. As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be understood by those skilled in the art that many modifications and variations can be made to the invention without departing from the scope of the invention. Accordingly, all such suitable changes and modifications and equivalents should be considered to be within the scope of the present invention.

도 1 은 본 발명에 따른 화이트리스트 구축 시스템을 개념적으로 도시한 전체 구성도.1 is an overall configuration diagram conceptually showing a white list building system according to the present invention;

도 2 는 사이트 추출부를 통한 멀티쓰레드 크로울러 모델을 보여주는 일예시도.2 is an exemplary view showing a multithreaded crawler model through a site extraction unit.

도 3 은 본 발명에 따른 피싱 사이트를 보이는 일예시도. 3 is an exemplary view showing a phishing site according to the present invention.

도 4 는 특정 도메인의 수명과 나이를 계산하기 위해 정보 조회결과를 보이는 일실예시도. Figure 4 is an exemplary view showing the results of information search to calculate the age and age of a particular domain.

도 5 는 검색 엔진 사이트를 이용하여 특정 은행의 도메인 인기도 조회 결과를 보이는 일예시도. 5 is an exemplary view showing domain popularity search results of a specific bank using a search engine site.

도 6 은 랭킹 사이트를 이용하여 특정 도메인 랭킹 순위를 보여주는 일예시도. 6 is an exemplary view showing a specific domain ranking ranking using the ranking site.

도 7 은 본 발명에 따른 화이트리스트 구축 방법에 관한 전체 흐름도.7 is an overall flowchart of a whitelist building method according to the present invention;

** 도면의 주요 부분에 대한 부호의 설명 **** Description of symbols for the main parts of the drawing **

100: 크로울링부 200: 사이트 추출부100: crawling unit 200: site extraction unit

300: 화이트리스트 추출부 210: 사이트 관리모듈300: white list extraction unit 210: site management module

220: 유사도 계산모듈 310: 페이지 정보 판단모듈220: similarity calculation module 310: page information determination module

320: 도메인 정보 판단모듈 330: 리스트 추출모듈320: domain information determination module 330: list extraction module

Claims

In the white list building system,

Crawling unit 100 for searching various sites on the web;

A site extracting unit (200) for extracting related sites by determining whether each site is searched through the crawling unit (100), whether the site is related to a specific field; And

A whitelist for a site extracted from the site extracting unit 200, determining whether or not it is a phishing site using domain information of the corresponding site, and extracting a whitelist by digitizing the collected information when the site is not a phishing site. Extracting unit 300; Including,

The site extraction unit 200,

A site management module 210 for receiving and storing a site searched through the crawling unit 100; And

A similarity calculation module 220 for calculating a similarity related to a specific field through [Equation 1] to determine whether each site stored in the site management module 210 is a site related to a specific field; Including,

The similarity calculation module 220,

It is determined whether the similarity value is not 0, and if it is not 0, it is determined that the site is related to a specific field and extracted. If it is 0, it is determined that the site is not related to a specific field. And receiving a site and performing a similarity calculation, and recording a similarity score and a corresponding link of a site determined to be a specific field related site and transmitting the same to the site management module 210.

[Equation 1]

The cosine similarity (sim (q, p)) of page p and query q is

Where Vq and Vp are the term frequencies based on the vector representation of the query q and page p, respectively, and VqVp is the dot product of the two vectors, | v || Is the Euclidean gambling of the vector V.

delete

The method of claim 1,

The white list extraction unit 300,

A page information determination module 310 for determining whether a current site domain matches a domain of a link site, targeting a site extracted from the site extraction unit 200;

Collecting information on the lifetime, age, popularity and DNS ranking of the site domain for the extracted site determined to be not a phishing site through the page characteristic determination module 310, using Neural Networks (Neural Networks) A domain information determination module 320 for digitizing the collected information into characteristic values; And

A list storage module 330 for extracting and storing a white list based on the numerical value through the domain information determination module 320; White list building system comprising a.

Claim 6 was abandoned when the registration fee was paid.

The method of claim 5,

The page information determination module 310,

And if the domain of the current site domain and the link site match, determine that the site is not a phishing site, and if not, determine that the site is a phishing site.

The method of claim 5,

The list storage module 330,

And extracting a site having an input value of a predetermined value or more with respect to the domain characteristic item as a white list, and storing the extracted site URL and IP information.

delete