KR20100068352A

KR20100068352A - B-tree index vector based web-log restoration method for huge web log mining and web attack detection

Info

Publication number: KR20100068352A
Application number: KR1020100052042A
Authority: KR
Inventors: 이형우
Original assignee: 충남대학교산학협력단; 한신대학교 산학협력단; 이형우
Priority date: 2010-06-01
Filing date: 2010-06-01
Publication date: 2010-06-23
Also published as: KR101005871B1

Abstract

PURPOSE: A weblog recovering method based on b-tree index vector for large capacity web log mining and attack detection is provided to perform field unit parsing about web log information and efficient indexing process, thereby improving attack detection and a real-time dealing performance about a large-capacity web log based on mining algorithm. CONSTITUTION: A field division module(20) divides a web log of the log load module(10) by a field. A B-tree indexing module(30) constitutes B-tree through the divided fields. The B-tree indexing module generates an index list and an indexing log. A query processor(40) searches B-tree corresponding to a search condition about external query. The query processor extracts a field index set. A log restoration module(60) restores the web log through an extracted indexing log list.

Description

B-Tree Index Vector Based Web-Log Restoration Method For Huge Web Log Mining And Web Attack Detection}

본 발명은 대용량 웹로그마이닝 및 공격탐지를 위한 비트리인덱스벡터기반 웹로그 복구방법에 관한 것으로, 더욱 상세하게는 비-트리인덱스 기법을 통해 키워드의 비교를 통해 비-트리를 구성하고 웹 로그 정보내에 생성되는 정보에 대한 효율적인 인덱싱 구조를 실현함으로써 웹로그를 고속으로 검색하고 복구할 수 있도록 하는 대용량 웹로그마이닝 및 공격탐지를 위한 비트리인덱스벡터기반 웹로그 복구방법에 관한 것이다.The present invention relates to a non-index vector-based web log recovery method for large-scale web log mining and attack detection, and more specifically, to construct a non-tree by comparing keywords through a non-tree index technique, and to generate web log information. The present invention relates to a non-index vector-based web log recovery method for large-scale web log mining and attack detection that enables efficient search and recovery of web logs by realizing an efficient indexing structure of information generated therein.

국내의 인터넷 이용률은 꾸준히 증가 추세에 있으며, 현재 국내 인터넷 사용자 수는 약 34,430천명을 넘고 있다. 2007년 6월 만 6세 이상 인터넷 이용률(최근 1개월 이내 인터넷 이용자의 비율)은 75.5%에 달하고 있다. 이처럼 국내 인터넷 이용자수 및 이용률은 꾸준히 증가 추세에 있으며, 사용자의 연령층 및 직업군 또한 다양해지고 있다.The domestic Internet usage rate is steadily increasing, and the number of domestic Internet users is over 34,430 thousand at present. In June 2007, the Internet usage rate for the age of 6 and over (% of Internet users within the last month) reached 75.5%. As such, the number of domestic Internet users and usage rate is steadily increasing, and the age group and occupation group of users are also diversifying.

인터넷 사용자의 증가에 따라 국내 주요 포탈 웹 사이트에서 생성되는 웹 로그의 양은 평균 50GB/일 내외에 해당하는 대용량의 로그가 발생하고 있다. 이와 같이 웹 서비스 양적 증가와 더불어 웹 서비스에서의 취약점도 날로 급증하고 있어서 국내 주요 포털 사이트에 대한 웹 공격 시도 및 성공율도 함께 증가하고 있다.As the number of internet users increases, the amount of web logs generated at major portal web sites in Korea is about 50GB / day. As the number of web services increases, the number of vulnerabilities in web services is increasing rapidly. As a result, web attack attempts and success rates on major domestic portal sites are also increasing.

또한 Web 2.0의 출현으로 웹 서비스는 더욱 확대되고 있으며 광고(ActiveX) , 피싱, 악성코드 유포, 개인정보 관련 공격, 피싱, Service Hacking, XSS 공격, SQL Injection 및 script attack 등의 취약점을 이용한 웹 포털 사이트 공격은 급증하고 있다. 특히 웹 서비스에 SQL Injection / Parameter Injection 및 DoS 등의 공격을 통해 포털 사이트에 대한 악의적 공격을 시도하거나 웹 사용자에 대한 정보를 불법적으로 유출하는 등의 취약점이 발견되고 있다.In addition, with the advent of Web 2.0, Web services are expanding and web portal sites using vulnerabilities such as advertisement (ActiveX), phishing, malware distribution, privacy related attacks, phishing, service hacking, XSS attack, SQL Injection, and script attack Attacks are skyrocketing. In particular, vulnerabilities such as attempting malicious attacks on portal sites or illegally leaking information about web users through attacks such as SQL Injection / Parameter Injection and DoS on web services are being discovered.

따라서 이와 같은 형태의 웹 공격에 능동적으로 대응하기 위해서는 우선적으로 웹 사이트에서 생성되는 웹 로그(Web Log) 정보를 분석하여 공격 시도를 탐지하는 방법을 적용해야 한다. 웹 로그 정보는 인터넷을 주요사업 및 마케팅 수단으로 사용하는 기업들에게 소비자들의 성향 분석 및 보다 발전된 서비스 제공을 위한 도구로 사용된다. 그러나 수집된 웹 로그 데이터는 분석에 바로 이용될 수 있는 형태가 아니다. 웹 로그를 통해 사용자의 성향을 분석하고 웹 사이트의 통계적인 데이터를 추출하며 동시에 웹 공격 등을 사전에 차단하기 위해서는 웹 로그 정보를 대상으로 전처리 기법(Web Log Preprocessing)을 적용하여 포털 사이트로부터 생성되는 대용량 웹 로그 정보에 대한 고속 인덱싱 기법이 필요하다. Therefore, in order to proactively respond to this type of web attack, it is necessary to first apply a method of detecting attack attempts by analyzing web log information generated from a web site. Web log information is used as a tool for analyzing consumer tendencies and providing more advanced services to companies that use the Internet as a major business and marketing tool. However, the collected web log data is not immediately available for analysis. In order to analyze the user's disposition through web logs and extract statistical data of web sites and at the same time to prevent web attacks in advance, web log information is generated from portal site by applying Web Log Preprocessing. There is a need for a fast indexing technique for large web log information.

그러나 기존의 웹 IDS 시스템 및 웹 마이닝 기반 공격 탐지 기법은 대용량 웹 로그에 대한 고속 인덱싱 과정 없이 웹 서버에 대한 공격 탐지 과정을 수행하기 때문에 Web IDS/IPS 시스템 구동시 탐지율이 저하된다는 문제점을 갖고 있다.
However, the existing web IDS system and web mining-based attack detection technique has a problem that the detection rate is lowered when the web IDS / IPS system is run because the web server performs the attack detection process without the fast indexing process for the large web logs.

기존의 IDS 시스템에서는 공격을 탐지하기 위해 IP 패킷을 대상으로 룰 데이터를 이용해 공격 여부를 탐지한다. 그러나 웹 서버의 공격 탐지를 위해서는 웹 서버에서 생성되는 웹 로그(Web Log) 데이터에 대한 분석을 통해 외부로부터의 불법적인 접속을 탐지하거나 이상 탐지(Anomaly Detection) 기능을 제공해야 한다.Existing IDS system detects an attack using rule data targeting IP packet to detect an attack. However, in order to detect attack of web server, it is necessary to detect illegal access from outside or provide anomaly detection function by analyzing web log data generated from web server.

기존의 웹 IDS 시스템은 웹 로그를 기반으로 외부로부터의 공격이나 웹 시스템 내부의 부적절한 쿼리 전송 및 이상 접속 정보를 탐지하기 위해 공격 탐지 룰(Web Attack Rule) 정보를 사용한다.Existing web IDS system uses web attack rule information to detect attack from outside or inappropriate query transmission and abnormal access information in web system based on web log.

하지만 기존의 웹 IDS 시스템은 대량으로 생성되는 웹 로그에 대한 별도의 전처리 과정 없이 웹 공격 탐지 룰과 비교하는 방식이므로 실시간으로 수행되는 웹 공격에 효율적으로 대처하지 못하고 있다.However, the existing web IDS system does not efficiently cope with web attacks that are performed in real time because it compares with web attack detection rules without separate preprocessing for a large number of generated web logs.

또한 기존의 웹 IDS는 로그 파일의 전처리 과정이 없이 순차 검색을 통해 룰과의 비교를 수행하고 있다. 따라서 최악의 경우 웹 로그 생성 후 일정 시간이 경과된 이후 에서야 탐색 결과를 제시하게 된다는 단점이 있다.In addition, the existing web IDS compares with the rules through sequential search without the preprocessing of log files. Therefore, in the worst case, the search results are presented only after a certain time has elapsed after generating the web log.

따라서 웹 로그에 대한 전처리 과정과 함께 웹 로그 마이닝 기법을 접목하여 시스템에 적용할 경우 앞에서 제시한 악성코드 기반 공격, BotNet 등을 통한 DoS 공격 등 웹 서버에서의 공격 행위들에 대해 효율적으로 탐지/대응할 수 있을 것으로 예상된다.
Therefore, when applying web log mining technique together with web log preprocessing process, it can efficiently detect / respond to attack activities in web server such as malicious code based attack and DoS attack through BotNet. It is expected to be able.

웹 마이닝 기법은 크게 컨텐츠 마이닝, 이용정보 마이닝 및 구조 마이닝 기법으로 나눌 수 있다. 최근 Web2.0 기술이 부각되면서 동적 웹 사이트 구축이 급증하고 있다. 따라서 분석 대상이 되는 웹 로그 정보의 양 또한 급증하고 있어 이에 대한 대응 기술이 연구되어야 한다. 웹을 대상으로한 공격에 능동적으로 대응하기 위해서는 인덱싱 기법과 마이닝 기법이 결합되어야 할 것으로 예상된다. 대용량 웹 로그 정보에 대한 고속 인덱싱 기법을 통해 실시간 분석 기능을 제공할 수 있으며, 다양한 형태의 웹 공격을 판단할 수 있을 것이다.Web mining techniques can be largely divided into content mining, usage information mining, and structure mining techniques. With the recent rise of Web2.0 technology, dynamic web site construction is rapidly increasing. Therefore, the amount of web log information to be analyzed is also increasing rapidly. In order to proactively respond to web-based attacks, indexing and mining techniques should be combined. Real-time analysis can be provided through high-speed indexing of large web log information, and various web attacks can be judged.

도 1은 대용량 웹 정보를 대상으로 마이닝 기술을 적용하는 과정을 나타낸 개념도로서, 시스템 내부의 로그파일을 분석해 정의된 룰과 비교를 통해 공격을 탐지하는 과정을 나타낸 개념도이다.FIG. 1 is a conceptual diagram illustrating a process of applying a mining technology to a large amount of web information. The conceptual diagram illustrates a process of detecting an attack by analyzing a log file inside a system and comparing it with a defined rule.

도 2를 참조하면, 종래에는 대단위 웹 로그 정보에 대한 고속처리 기술이 미비하며, 로그 정보내 기록된 문자열 정보에 대한 처리기능 미비로 인해 공격 정보에 대한 효율적인 탐지 기능이 부족하다. 또한 종래기술은 대부분 웹 사용자의 이용 정보 등을 추출하기 위한 단계에 머무르고 있다. 종래 기술은 단지 웹 로그 정보로부터 사용자의 행위 정보를 예측하기 위한 과정으로 마이닝 기법을 사용하고 있다. 종래의 기술은 사용자가 앞으로 어떠한 페이지로 이동할 지에 대한 정보 등을 추출하기 위한 용도로 사용되고 있다.2, in the related art, a high speed processing technique for large-scale web log information is insufficient, and an efficient detection function for attack information is insufficient due to a lack of processing function for string information recorded in log information. In addition, the prior art is mostly at the stage for extracting the usage information of the web user. The prior art uses a mining technique as a process for predicting user's behavior information only from web log information. The prior art is used to extract information about which page a user moves to in the future.

하지만 웹 로그 마이닝 기술을 웹 로그내 공격 탐지 기술에 접목한다면 기존의 기법보다 더욱 개선된 탐지율을 보일 수 있게 된다. 따라서 이를 위해서는 대용량 웹 로그에 대한 고속 인덱싱 기술을 접목할 필요가 있다.However, if we combine the web log mining technology with the attack detection technology in the web log, the detection rate can be improved. Therefore, it is necessary to combine high-speed indexing technology for large web logs.

상기 설명한 바와 같이 웹 마이닝은 웹 로그 분석을 통해 웹 페이지 이용자의 접근 패턴을 자동으로 발견해내는 것을 말한다. 일반적으로 웹 서버에는 일정한 형태의 포맷을 구성하는 웹 로그가 생성된다. IIS 서버나 아파치 서버에서 생성되는 웹 로그는 W3C 포맷을 따르고 있다.As described above, web mining refers to automatically discovering web page user access patterns through web log analysis. In general, a web log is generated in a web server that forms a certain format. Web logs generated by IIS or Apache servers follow the W3C format.

도 3은 종래기술에서 사용한 IIS 6.0의 W3C 확장 로그 포맷에 대하여 설명한다. 도 3의 여러 가지 로그 필드 중에서 웹 이용 마이닝에 이용되는 필드는 일부 필드들의 복합적인 연간 관계를 고려해 특정 필드의 정보만이 추출되어 사용된다. 따라서 대용량 웹 로그 정보로부터 이상탐지를 위한 방법을 도출하고 마이닝 정보를 추출하기 위해 도 4와 같은 로그 필드별 공격탐지 관련정보의 연관성을 도출하였다.3 illustrates a W3C extended log format of IIS 6.0 used in the related art. Of the various log fields of FIG. 3, the field used for web usage mining is used by extracting only information of a specific field in consideration of a complex annual relationship of some fields. Therefore, in order to derive an error detection method and extract mining information from a large amount of web log information, the correlation of attack detection information for each log field as shown in FIG. 4 was derived.

도 4와 같이 대용량 웹 로그 정보를 이용하여 웹 이상행위를 탐지하고 공격탐지 등을 적용하여 마이닝 기법과 연계하기 위해서 로그 필드 요소 추출과정이 수행되어야 한다. 그리고 관련 없는 로그 정보의 정제와 여과 과정, 이용자 분류, 분류된 이용자의 세션 분류, 분류된 세션을 고려해 웹 사이트에서의 이동경로 추출 등 사전에 웹 로그의 전처리 모듈이 개발되어야 한다.
As shown in FIG. 4, a log field element extraction process should be performed in order to detect web abnormal behavior using a large amount of web log information and to connect with a mining technique by applying attack detection. In addition, the pre-processing module of the web log should be developed in advance, such as the process of refining and filtering irrelevant log information, classifying users, classifying the classified users' sessions, and extracting the moving paths from the web site considering the classified sessions.

종래기술의 웹 로그 전처리 방법은 도 5와 같이 크게 5단계로 구분할 수 있다.The web log preprocessing method of the prior art can be largely divided into five steps as shown in FIG.

- 1단계 (Cleaning Log) - 분석에 필요하지 않은 로그 정보들을 삭제하는 전처리 과정이다. 일반적으로 jpg, jpge, gif, swf 등 웹 페이지를 구성하는 멀티미디어 파일들은 사용자가 웹페이지 요청시 자동으로 요청되어 사용자의 의도와는 무관한 데이터들로 웹 이용 마이닝에 불필요한 정보들이다. 그러므로 삭제 대상이 된다. 이렇게 삭제된 로그는 1/20, 1/40의 효율을 볼 수 있다.-Step 1 (Cleaning Log)-This is a preprocessing process that deletes log information that is not necessary for analysis. In general, multimedia files constituting web pages such as jpg, jpge, gif, swf are automatically requested when a user requests a web page, and are unnecessary information for web mining. Therefore, it is deleted. The deleted log can see efficiencies of 1/20 and 1/40.

- 2단계 (User Identification) - 웹 로그 정보를 이용해 사용자를 식별하는 단계. 일반적으로 c-ip와 c(User Agent)필드 정보를 이용해 사용자를 식별함. 그러나 프록시 서버를 사용할 때 단순히 c-ip와 user agent정보만으로 동일한 컴퓨터를 이용하는 서로 다른 사용자를 식별할 수 없다는 문제점을 가지고 있다. Step 2 (User Identification)-Identifying the user using web log information. In general, users are identified using the c-ip and c (User Agent) field information. However, when using a proxy server, there is a problem that it is not possible to identify different users using the same computer by simply c-ip and user agent information.

- 3단계 (Session Identification) - 세션이란 사용자의 로그인 유무를 나타내는 것이 아니다. 사용자가 브라우저로 웹 페이지를 이용할 때부터 세션 ID가 부여된다. 그러므로 사용자 개개인의 웹 페이지 내에서 이동 경로 및 선호도 분석을 위해 가장 중요한 전처리 과정이다.-Session Identification-A session does not indicate whether or not a user is logged in. The session ID is assigned when the user uses the web page in the browser. Therefore, it is the most important preprocessing process for analyzing the movement path and preference in each user's web page.

- 4단계 (Path Completion) - 분류된 세션 정보를 이용해 웹 페이지 내에서의 사용자 이동경로를 추적하고, 일반적으로 back 또는 forward 버튼을 눌러 이동한 경우의 경로 연결 과정을 수행한다.-Step 4-Path Completion-Tracks the user's movement path in the web page using the classified session information, and generally performs the path connection process when the user moves by pressing the back or forward button.

- 5단계 (Formatting) - 웹 로그 분석에 적합한 정보의 형태로 포맷을 전환한다. 포맷 형태는 사용되는 분석 기법에 따라 결정한다.Step 5-Formatting-Converts the format into information suitable for web analytics. The format type depends on the analysis technique used.

위 단계를 거친 웹 로그는 웹 마이닝 기법에 유용한 자료로 사용된다. 그러나 현대 웹 환경의 발달 및 변화로 웹 로그의 데이터 형태가 더욱 복잡해지고 크기 또한 대용량화 되어가고 있어, 기존의 전처리 방식은 새로운 개선이 시급한 실정이다. The above web log is used as a useful material for web mining techniques. However, due to the development and change of the modern web environment, the data form of the web log is becoming more complicated and the size is also larger, and the existing preprocessing method is urgently required to be improved.

웹의 발달로 웹 로그의 크기는 대용량화되고 있다. Microsoft.com의 하루 평균 웹 로그의 양은 200~300GB이고, 최고 1TB의 웹 로그가 발생되기도 한다. 이처럼 대용량의 웹 로그를 기존의 방식대로 전 처리 하는 것은 비효율적인 방법이다. 기존 전처리 단계의 1단계 (Cleaning Log)를 수행하기 위해서는 로그 정보의 접미사(suffix)를 검사함으로써 가능하다. 예를 들어 파일 이름의 접미사가 'gif, jpeg 등인 모든 로그 정보는 삭제될 수 있다. 그러나 기존 전처리 기법에서는 도 3에서의 전체 필드의 순차적인 검사를 통해 Cleaning Log 과정을 수행한다.Due to the development of the web, the size of the web log is increasing. Microsoft.com averages 200-300GB of web logs per day, with up to 1TB of web logs. It is inefficient to preprocess large web logs like this. In order to perform the first step (Cleaning Log) of the existing preprocessing step, it is possible to check the suffix of the log information. For example, all log information with the file name suffix 'gif, jpeg, etc. can be deleted. However, in the existing pretreatment technique, the cleaning log process is performed through the sequential inspection of the entire field in FIG. 3.

웹 로그 각각의 필드 정보를 이용해 새로운 정보를 생성하기 위해서는 필드별 분류 기법 및 특정 문자열 로그에 대한 결과 내 검색 기능이 필요하다. 예를 들어 웹 서버에 특정한 자원을 일정한 주기로 요청하는 로그 정보는 웹 스캔공격으로 레포팅 할 수 있다. 이 결과는 cs-uri-stem필드를 추출해 date 및 time필드의 결과 내 검색 기능으로 탐지할 수 있다. 그러나 기존의 웹 로그 전처리 기법은 필드 단위의 탐색 및 결과 내 검색 기능의 부재로 복합적인 로그 필드를 이용한 마이닝에 비효율적인 문제점이 있었다.
In order to generate new information by using field information of each web log, a classification method for each field and a search function within a result for a specific string log are required. For example, log information requesting a specific resource to a web server at regular intervals can be reported as a web scan attack. This result can be detected by extracting the cs-uri-stem field and searching within the results of the date and time fields. However, the existing web log preprocessing technique has an inefficient problem in mining using complex log fields due to the lack of field-based searching and search in results.

본 발명의 목적은 B-트리 인덱스 벡터(B-Tree Index Vector) 구조를 이용하여 대량의 웹 로그 정보를 고속으로 검색하고 복구하기 위한 대용량 웹로그마이닝 및 공격탐지를 위한 비트리인덱스벡터기반 웹로그 복구방법을 제공함에 있다.An object of the present invention is to use a B-Tree Index Vector structure to retrieve and recover a large amount of web log information at high speed. To provide a recovery method.

상기한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따르면, According to a preferred embodiment of the present invention for achieving the above object,

웹아이디에스 시스템의 공격탐지를 위하여 일정한 포맷으로 구성된 비트리인덱스벡터기반의 구조를 가지는 웹 로그를 복구하는 방법에 있어서,In the method of recovering a web log having a bit-index vector-based structure of a predetermined format for attack detection of a web ID system,

목록에서 해당하는 로그 인덱스 정보를 가져오는 제1 단계;A first step of obtaining corresponding log index information from the list;

해당 필드의 비-트리에서 인덱스를 이용하여 문자열을 가져와서 통합하는 제2단계;Incorporating a string by using an index in a non-tree of a corresponding field;

로그 텍스트를 복구하는 제3 단계;를 포함하여 이루어진 것을 특징으로 하는 대용량 웹로그마이닝 및 공격탐지를 위한 비트리인덱스벡터기반의 웹로그 복구방법이 제공된다.A third step of recovering log text is provided, including a non-index vector-based web log recovery method for mass web log mining and attack detection.

상기 설명한 바와 같이, 본 발명에 따른 비트리인덱스벡터기반 웹로그 복구방법에 의하면, cs-uri-stem필드를 추출해 중복된 로그 정보를 인덱스화하여 Cleaning Log 과정을 수행함으로써 종래에 비해 효율적이다.As described above, according to the non-index vector-based weblog recovery method according to the present invention, by extracting the cs-uri-stem field to index the duplicated log information to perform a cleaning log process is more efficient than the conventional.

본 발명에 따르면, 웹 로그 정보에 대한 필드 단위 파싱 및 효율적인 인덱싱 처리 과정을 수행하며 마이닝 알고리즘을 기반으로 대용량의 로그에 대해 공격 탐지 및 실시간 대응 성능을 향상시킬 수 있을 것으로 예상되어 개선된 웹 IDS/IPS 시스템 구축이 가능한 효과가 있다.According to the present invention, it is possible to perform field-by-field parsing and efficient indexing of web log information, and it is expected to improve attack detection and real-time response performance for large logs based on a mining algorithm. IPS system can be built.

본 발명에 따르면, 로그 정보를 로드해 필드별 파싱 기법과 중복 문자열을 고려한 필드 단위 인덱스 기법을 이용해 로그 정보의 크기가 혁신적으로 줄어들게 되는 효과가 있고, 또한 B-트리 구조로 필드 단위의 인덱스 정보를 구성해 기존 전처리 과정의 성능을 향상시킬 수 있는 효과가 있다.According to the present invention, the log information can be loaded, and the size of the log information can be innovatively reduced by using the field-based indexing technique considering field-specific parsing techniques and duplicate strings. It can be configured to improve the performance of the existing pretreatment process.

또한, 본 발명에 따르면, 대용량 웹 로그 정보에 대해 멀티쓰레드 기반 고속 전처리 및 검색 기능을 제공하였기 때문에 웹 로그 정보내 이상탐지 및 마이닝 기법과 연계할 경우 전체적인 시스템의 성능을 높일 수 있는 효과가 있다.In addition, according to the present invention, since the multithread-based fast preprocessing and retrieval function for a large amount of web log information is provided, when combined with an abnormal detection and mining technique in the web log information, the overall system performance can be improved.

도 1은 종래의 대용량 웹 로그 전처리방법의 일예를 나타내는 개념도이다.
도 2는 종래의 웹 로그 정보 기반 마이닝방법의 일예를 나타내는 개념도이다.
도 3은 종래기술(IIS 6.0)의 W3C확장 로그 포맷을 나타낸 것이다.
도 4는 도 3의 각 필드별 이상행위 고려사항을 나타낸 것이다.
도 5는 종래의 웹 로그 전처리기법의 일예를 나타낸 개념도이다.
도 6은 본 발명에 따른 웹 로그 전처리기법의 개략적인 개념도이다.
도 7은 본 발명에 따른 본 발명에 따른 웹 로그 전처리기법에서의 멀티쓰레드 기반 웹 로그 인덱싱과정을 나타낸 개념도이다.
도 8은 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법의 일예를 나타낸 흐름도이다.
도 9는 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서 비-트리 구조를 나타낸 것이다.
도 10은 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서의 웹로그 복구방법을 나타낸 개념도이다.
도 11은 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서의 고속검색방법의 일예를 나타낸 도면이다.
도 12는 본 발명에 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서의 비-트리 인덱싱 로그프로세서의 개략적인 블록구성도이다.
도 13은 도 12의 흐름도이다.1 is a conceptual diagram illustrating an example of a conventional large-scale web log preprocessing method.
2 is a conceptual diagram illustrating an example of a conventional web log information based mining method.
Figure 3 shows the W3C extended log format of the prior art (IIS 6.0).
FIG. 4 illustrates abnormal behavior considerations for each field of FIG. 3.
5 is a conceptual diagram illustrating an example of a conventional web log preprocessing technique.
6 is a schematic conceptual diagram of a web log preprocessing method according to the present invention.
7 is a conceptual diagram illustrating a multi-thread based web log indexing process in the web log preprocessing method according to the present invention.
8 is a flowchart illustrating an example of a non-tree indexing vector-based fast weblog search method according to the present invention.
9 shows a non-tree structure in the non-tree indexing vector-based weblog fast search method according to the present invention.
10 is a conceptual diagram illustrating a web log recovery method in the non-trigger indexing vector-based web log fast retrieval method according to the present invention.
FIG. 11 is a diagram illustrating an example of a fast search method in a non-tree indexing vector-based web log fast search method according to the present invention.
12 is a schematic block diagram of a non-tree indexing log processor in the non-tree indexing vector-based weblog fast search method according to the present invention.
13 is a flowchart of FIG. 12.

이하 본 발명에 따른 비트리인덱스벡터기반 웹로그 복구방법을 첨부도면을 참조로 상세히 설명한다.Hereinafter, the non-index vector-based weblog recovery method according to the present invention will be described in detail with reference to the accompanying drawings.

도 6은 본 발명에 따른 웹 로그 전처리기법의 개략적인 개념도이고, 도 7은 본 발명에 따른 본 발명에 따른 웹 로그 전처리기법에서의 멀티쓰레드 기반 웹 로그 인덱싱과정을 나타낸 개념도이고, 도 8은 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법의 일예를 나타낸 흐름도이고, 도 9는 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서 비-트리 구조를 나타낸 것이고, 도 10은 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서의 웹로그 복구방법을 나타낸 개념도이고, 도 11은 본 발명에 따른 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서의 고속검색방법의 일예를 나타낸 도면이고, 도 12는 본 발명에 비-트리기인덱싱 벡터기반 웹로그 고속검색방법에서의 비-트리 인덱싱 로그프로세서의 개략적인 블록구성도이고, 도 13은 도 12의 흐름도이다.6 is a schematic conceptual diagram of a web log preprocessing method according to the present invention, FIG. 7 is a conceptual diagram illustrating a multithreaded web log indexing process in a web log preprocessing method according to the present invention, and FIG. FIG. 9 is a flowchart illustrating an example of a non-tree indexing vector-based weblog fast search method according to the present invention. FIG. 9 illustrates a non-tree structure in the non-tree indexing vector-based weblog search method according to the present invention. 10 is a conceptual diagram illustrating a weblog recovery method in the non-trigger indexing vector-based weblog fast search method according to the present invention, and FIG. 11 is a non-trigger indexing vector-based weblog fast search method according to the present invention. 12 is a diagram illustrating an example of a non-tree indexing log processor in a non-tree indexing vector-based weblog fast searching method according to the present invention. Of a block diagram, Figure 13 is a flow chart of Fig.

도 6을 참조하면, 본 발명에 따른 비트리인덱스벡터기반 웹로그 고속검색방법은 대용량 웹 로그내 공격 탐지의 효율성을 높이기 위해 웹 로그 정보에 대한 고속 전처리/인덱싱 과정을 수행한다. 또한 실시간 공격 대응 기술을 제공하기 위해 로그 인덱싱 기반 마이닝 기법과 연계할 수 있는 방안을 제시하고자 한다. 제시한 모듈별 상세 구조는 다음과 같다.Referring to FIG. 6, the non-index vector-based web log fast searching method according to the present invention performs a fast preprocessing / indexing process for web log information in order to increase the efficiency of attack detection in a large web log. In addition, we propose a method that can be linked with the log indexing-based mining technique to provide real-time attack response technology. The detailed structure of each module is as follows.

IIS 로그는 하나의 거대한 크기의 텍스트 정보이다. 하지만 하나의 거대한 정보를 각 라인으로 나눌 수 있고, 각 라인은 다시 필드들로 나눌 수 있다. 라인마다 필드들로 이루어져 있고, 필드마다 각각의 해당 정보를 가지고 있기 때문이다. 그러므로 필요한 정보를 거대한 크기의 텍스트 정보에서 바로 검색하는 것보다는 문제를 분할하여 해결하는 분할 정복 방법을 적용하고 멀티쓰레드를 이용하여 고속 인덱싱 기술을 제공하였다(도 7 참조).The IIS log is a huge piece of textual information. But one giant piece of information can be broken down into lines, and each line can be broken down into fields. This is because each line consists of fields, and each field has its own information. Therefore, rather than retrieving the necessary information directly from the large sized text information, the partition conquest method that solves the problem is applied, and the multi-threaded fast indexing technique is provided (see FIG. 7).

제안한 방법은 각 필드의 정보를 인덱싱하고 검색에 최적화된 B-트리로 생성하는 것이다. 웹 로그를 분석하기 위해서는 웹 로그를 로드하는 시점에서 바로 B-트리로 변환하는 것이 필요하다. 그러므로 웹 로그를 불러옴과 동시에 위 작업을 수행한다.The proposed method is to index the information of each field and create a B-tree optimized for search. To analyze the web log, you need to convert it to a B-tree at the time the web log is loaded. Therefore, the above operation is executed at the same time as loading the web log.

중복 문자열의 처리는 웹 로그의 로드 기능이 수행되면서 자동으로 실행된다. STL(Standard Template Library)의 map 모듈은 키워드와 데이터로 이루어져 있으며, 키워드의 비교를 통해 B-트리로 구성한다.The processing of duplicate strings is executed automatically when the web log load function is executed. The map module of the STL (Standard Template Library) consists of keywords and data, and consists of B-trees by comparing the keywords.

대용량 웹 로그 정보 내에는 시간정보, 송신자 IP 및 수신자 IP 정보 그리고 사용자의 접속 환경과 접속 URI 정보 등을 포함하고 있다. 따라서 대형 포털 사이트인 경우 하루에도 수십 GB 정도로 생성되는 웹 로그 정보를 효율적으로 저장/관리하고 공격탐지 및 웹 마이닝 등에 적용하기 위해서는 효율적인 자료 구조를 사용해야 한다.The large web log information includes time information, sender IP and receiver IP information, and the user's connection environment and access URI information. Therefore, in case of a large portal site, an efficient data structure should be used to efficiently store / manage web log information generated by several tens of GB per day and apply it to attack detection and web mining.

따라서 본 발명에서는 B-트리 구조를 적용하여 로그 정보내 생성되는 정보에 대한 효율적 인덱싱 구조를 구현하게 되었다.Therefore, the present invention implements an efficient indexing structure for the information generated in the log information by applying the B-tree structure.

키워드는 필드의 정보로 하고 데이터는 필드 정보의 인덱스로 지정하였다. 만약 'Key5'의 정보가 map에 들어가게 되면 B-트리 내에 이미 인덱싱 정보가 있을 경우 추가적으로 인덱싱 과정을 수행하지 않고 이미 부여된 인덱싱 값을 이용하여 로그 정보를 저장/관리할 수 있다.Keywords are designated as field information, and data is designated as index of field information. If the information of 'Key5' enters the map, if there is already indexing information in the B-tree, the log information can be stored / managed using the index value that has already been assigned without performing the additional indexing process.

본 발명에서는 도 8과 같은 B-트리 인덱싱 벡터 기반 웹로그 정보 전처리방법이 수행된다. 대용량 웹 로그 정보를 B-트리 기반의 인덱스 벡터(B-Tree Index Vector) 값으로 변환하게 되면 전체적인 저장 용량이 감소하게 될 것이다. 또한 웹 로그 정보내 검색 과정에서도 문자열에 대한 검색보다 인덱스 값에 의한 검색 과정을 수행한다면 더욱더 빠른 결과를 제시할 수 있을 것이다. 따라서 본 발명에서는 IIS 웹 로그 정보에 대해 B-트리 기반 인덱스 값을 부여하여 성능 개선을 가져올 수 있었다. 결국 웹 로그 정보는 다음과 같이 인덱스 기반 전처리처리 과정을 수행한 후에 저장되기 때문에 전체 전처리 시간과 파일 크기를 최소화 할 수 있었다.In the present invention, the B-tree indexing vector-based weblog information preprocessing method as shown in FIG. 8 is performed. Converting large web log information to B-tree-based index vector values will reduce the overall storage capacity. Also, in the search process in the web log information, if the search process is performed by the index value rather than the search for the string, the result may be faster. Therefore, in the present invention, B-tree-based index value can be given to IIS web log information to improve performance. As a result, the web log information is stored after the index-based preprocessing process as shown below, minimizing the overall preprocessing time and file size.

웹 로그에 대해 도 9와 같은 인덱스 키 값 리스트 정보와 B-트리 구조를 갖는다. 인덱스 키 값 리스트는 웹 로그 필드별 고유 키 값 정보를 저장하고 있으면서 고속으로 웹 로그 원본 텍스트로 복구하는 기능도 제공한다.The web log has index key value list information and a B-tree structure as shown in FIG. The index key value list stores unique key value information for each web log field, and also provides a fast recovery to the web log original text.

도 10과 같이 각각의 웹 로그 정보를 인덱스화하여 축약된 형태로 변형 저장하였기 때문에 웹 로그 원본 텍스트 값으로 복구하고자 하는 경우 해당 로그 인덱스 벡터 값을 보고 B-트리내에 저장된 인덱스 키 값을 찾아서 해당 필드 값으로 조합하여 구성하게 되면 원본 웹 로그 정보를 재구성할 수 있다.Since each web log information is indexed and stored in a condensed form as shown in FIG. 10, when recovering to the web log original text value, look at the corresponding log index vector value and find the index key value stored in the B-tree. Combining the values, you can reconstruct the original Web log information.

각 필드별로 인덱스 기반 B-트리 정보를 가지고 있기 때문에 해당 필드별로 저장된 정보를 검색하여 멀티쓰레드 방식으로 재구성하는 과정을 수행하도록 하여 재구성시 소요되는 시간 지연을 최소화 할 수 있었다.Since each field has index-based B-tree information, it is possible to minimize the time delay required for reconstruction by searching the stored information for each field and reconfiguring it in a multithreaded manner.

이와 같은 구조를 사용하였기 때문에 대용량의 웹 로그 정보에 대한 검색을 수행할 경우에도 성능을 최대화할 수 있다.With this structure, performance can be maximized even when searching a large amount of web log information.

도 11과 같이 웹 로그 인덱스 벡터 값을 이용해서 필드의 인덱스 번호를 알고 있을 시에 인덱스 번호를 이용하여 직접적으로 필드의 정보에 고속으로 접근할 수 있는 구조를 제공한다.As shown in FIG. 11, when the index number of a field is known using a web log index vector value, a structure for directly accessing field information using the index number is provided.

검색을 시작할 때는 기본적으로 필드의 정보를 가지고 검색을 한다. 그럴 때는 탐색의 최적화된 B-트리에서 정보를 찾아 그에 대한 인덱스 값을 구한다. 그리고 그 이후에 모든 작업은 구해진 인덱스 번호만으로 처리를 한다. 인덱스 번호를 구한 이후에 또다시 B-트리에서 검색할 필요가 없이 인덱스 번호만 가지고 값을 참조할 수 있다.When you start a search, you basically search with the information in the field. In that case, we look for information in the optimized B-tree of the search and find the index value for it. After that, all work is done with only the index number obtained. After obtaining the index number, you can refer to the value only by the index number without having to search the B-tree again.

도 12를 참조하면, 비-트리기반 인덱싱로그 프로세서(100)는 입력되는 웹로그를 로딩하는 로그 로드모듈(10)과, 상기 로그 로드모듈(10)로부터 전달된 웹로그를 필드별로 분할하는 필드분할모듈(20)과, 상기 필드분할모듈(20)에 의해 분할된 필드를 통해 비-트리를 구성함과 아울러 인덱스 리스트를 작성하고 인덱싱 로그를 생성하는 비-트리인덱싱모듈(30)과, 외부 쿼리(query)에 대하여 비-트리를 통해 검색조건에 맞는 비-트리를 검색하고 필드 인덱스 세트를 추출하고 변형된 쿼리를 요청하는 쿼리프로세서(40)와, 상기 쿼리프로세서의 요청에 따라 인덱싱 로그를 추출하는 인덱싱로그추출부(50) 및 추출된 인덱싱로그 리스트를 통해 로그를 복구하고 출력하는 로그복구모듈(60)을 포함하여 이루어진다.Referring to FIG. 12, the non-tree-based indexing log processor 100 divides a log load module 10 for loading an input web log and a field for dividing the web logs transmitted from the log load module 10 by fields. A non-tree indexing module 30 for constructing a non-tree, creating an index list, and generating an indexing log, by forming a partitioning module 20 and fields divided by the field partitioning module 20, and an external A query processor 40 which searches a non-tree meeting a search condition through a non-tree for a query, extracts a set of field indexes, and requests a modified query; and an indexing log according to the request of the query processor. And a log recovery module 60 for recovering and outputting a log through the indexing log extracting unit 50 and the indexing log list extracted.

도 13을 참조하면, 먼저 로그가 로딩되면(S1), 로그 라인을 추출하고(S2), 이후 필드 정보가 있는지를 판단하여(S3), 만약 필드 정보가 없으면 필드 정보를 생성한다(S4). 상기 단계 S3에서 만약 필드 정보가 있으면 필드 분할을 수행하고(S5), B-트리 인덱싱을 수행한다(S6). 이후 새로운 키워드 인지를 판단하여(S7), 만약 새로운 키워드인 경우에는 키워드를 삽입하고 인덱스를 생성하고(S8) 인덱스를 획득한다(S9). 인덱스 생성시 B-트리 필드에서 키워드를 찾고(S24), 찾은 인덱스는 전달된다(S9). 만약 새로운 키워드가 아니면 인덱스를 획득하고(S9), 상기 단계 S9이후에, 인덱싱 로그를 생성한다(S10). 생성된 인덱싱 로그는 인덱싱 로그필드에 저장된다(S25).Referring to FIG. 13, when a log is loaded first (S1), a log line is extracted (S2), and whether there is field information thereafter (S3). If there is no field information, field information is generated (S4). If there is field information in step S3, field division is performed (S5), and B-tree indexing is performed (S6). After that, it is determined whether it is a new keyword (S7), and if it is a new keyword, a keyword is inserted, an index is generated (S8), and an index is obtained (S9). When the index is generated, the keyword is found in the B-tree field (S24), and the found index is transferred (S9). If it is not a new keyword, an index is obtained (S9), and after the step S9, an indexing log is generated (S10). The generated indexing log is stored in the indexing log field (S25).

한편, 쿼리 프로세서는 쿼리 입력시(S21) 입력된 쿼리를 분석하여 B-트리와 인덱싱 로그필드와 연계하여 인덱싱 로그를 추출하고(S23) 로그를 복구한다(S26).
Meanwhile, the query processor analyzes the input query at the time of query input (S21), extracts the indexing log in association with the B-tree and the indexing log field (S23), and recovers the log (S26).

10: 로그 로드모듈
20: 필드분할모듈
30: 비-트리인덱싱모듈
40: 쿼리프로세서
50: 인덱싱로그추출부
60: 로그복구모듈
100: 비-트리기반 인덱싱로그 프로세서10: log load module
20: field division module
30: non-tree indexing module
40: query processor
50: indexing log extractor
60: log recovery module
100: non-tree based indexing log processor

Claims

In the method of recovering a web log having a bit-index vector-based structure of a predetermined format for attack detection of a web ID system,
A first step of obtaining corresponding log index information from the list;
Incorporating a string by using an index in a non-tree of a corresponding field;
The third step of recovering the log text; Bit re-index vector based web log recovery method for mass web log mining and attack detection, characterized in that made.