KR101556714B1

KR101556714B1 - Method, system and computer readable recording medium for providing search results

Info

Publication number: KR101556714B1
Application number: KR1020110000240A
Authority: KR
Inventors: 이채현; 심동윤
Original assignee: 네이버 주식회사
Priority date: 2011-01-03
Filing date: 2011-01-03
Publication date: 2015-10-02
Also published as: KR20120090131A

Abstract

본 발명은 데이터베이스에 저장된 웹문서에 대응하는 웹문서 출처표시들을 소정 기준으로 묶는 출처표시 클러스터링 단계; 상기 출처표시 클러스터링 결과에 대하여 유사한 문서끼리 묶는 유사문서 클러스터링 단계; 상기 출처표시 클러스터링 단계 및 상기 유사문서 클러스터링 단계로부터 각각 생성된 피쳐(feature)들을 이용하여 변수를 분류하는 단계; 상기 변수 분류 단계의 결과를 이용하여 정규식을 생성하는 단계; 및 상기 생성된 정규식을 이용하여 웹문서를 수집하는 단계를 포함하는 포함하는 검색결과 제공 방법 및 시스템에 관한 발명이다. The present invention relates to a method for clustering a source document, comprising: clustering a source indication clustering a web document source indication corresponding to a web document stored in a database into predetermined criteria; A similar document clustering step of grouping similar documents into the source indication clustering result; Classifying variables using features generated from the source indication clustering step and the similar document clustering step, respectively; Generating a regular expression using the result of the variable classifying step; And collecting the web document by using the generated regular expression. The present invention relates to a method and system for providing a search result.

Description

[0001] METHOD, SYSTEM AND COMPUTER READABLE RECORDING MEDIUM FOR PROVIDING SEARCH RESULTS [0002]

본 발명은 URL(Uniform Resource Locator) 등 웹문서의 출처표시에서 불필요한 변수를 제거하여 검색 데이터베이스를 구축하는 방법, 시스템 및 컴퓨터 판독 가능한 기록매체에 관한 것으로, 보다 구체적으로는 클러스터링된 URL을 대상으로 URL에서 불필요한 변수를 자동으로 추출하고 이를 제거한 정규식 패턴을 이용하여 검색 데이터베이스를 구축하기 위한 방법, 시스템 및 컴퓨터 판독 가능한 매체에 관한 것이다. The present invention relates to a method and system for constructing a search database by removing unnecessary variables from the source display of a web document such as a URL (Uniform Resource Locator), and more particularly, To a method, system and computer readable medium for constructing a search database using a regular expression pattern in which unnecessary variables are automatically extracted and removed.

근래에 들어, 인터넷 사용이 보편화되면서 사용자들은 인터넷 검색을 통하여 다양한 정보를 획득할 수 있게 되었다. 즉, 사용자들은 인터넷에의 접속이 가능한 개인용 컴퓨터 등의 단말 장치를 통해 웹 브라우저의 주소창에 URL과 같은 식별자를 입력함으로써 인터넷 검색 사이트에 접속한 후 자신이 확인하고자 하는 검색 문구를 입력하여 뉴스, 지식, 게임, 커뮤니티, 웹문서 등 다양한 분야와 관련된 검색 결과를 볼 수 있게 되었다. In recent years, as the use of the Internet has become popular, users can acquire various information through the Internet search. That is, users access an Internet search site by inputting an identifier such as a URL in an address window of a web browser through a terminal device such as a personal computer capable of accessing the Internet, input a search phrase to be confirmed by themselves, , Games, communities, web documents, and more.

이렇듯, 사용자들이 검색하고자 하는 내용을 적절히 보여주기 위해서 인터넷 검색 사이트 제공자는 웹문서 등을 수집하고, 수집된 웹문서 등에 색인을 구성하여 이를 바탕으로 검색 결과를 사용자에게 제공하는 역할을 담당하는 검색엔진을 설계하고 구성하고 있으며, 그 중에서도 인터넷상에 존재하는 웹문서를 조직적, 자동화된 방법으로 탐색 및 수집하는 기능을 하는 웹크롤러(Web Crawler)는 큰 역할을 하고 있다. In order to appropriately display the contents to be searched by users, the Internet search site provider collects web documents and the like, constructs an index for the collected web documents, and provides search results to the users based on the indexes And web crawler, which functions to search and collect web documents existing on the Internet in an organized and automated way, plays a big role.

이러한 웹크롤러의 작동 방식 중의 하나로서, 대개 시드(seed)로 불리우는 URL 리스트로부터 시작하고, 시드에 포함되어 있는 모든 하이퍼링크(Hyperlink)를 인식하여 URL 리스트를 갱신하며, 갱신된 URL 리스트를 재귀적으로 다시 방문하는 방식을 사용하고 있다. One of the ways in which the web crawler works is to start from a list of URLs, usually called seeds, to recognize all hyperlinks included in the seed, update the URL list, As well as to visit the site again.

한편, 웹 페이지 제작 시, 웹 제작자는 웹 페이지의 기능 내지 그 관리를 위해 페이지의 내용에는 변화를 주지 않으나, 다양한 값을 갖는 변수를 URL에 추가할 수 있는데, 웹크롤러가 URL의 변수만 약간 상이하며 내용상은 동일한 이러한 웹문서를 모두 수집한다면, 불필요하거나 의미없는 웹문서를 중복하여 수집하게 될 수 있어, 수집된 내용을 저장하게 되는 저장소 공간의 낭비를 초래할 뿐 아니라, 웹크롤러의 성능도 저하되고, 또한 검색엔진의 부하가 가중될 수 있는 문제점이 발생한다. On the other hand, at the time of web page production, the web creator does not change the contents of the page in order to manage the functions of the web page, but it is possible to add variables having various values to the URL. And collecting all of these same web documents in contents, it is possible to collect unnecessary or meaningless web documents in a duplicate manner, resulting in a waste of storage space for storing the collected contents, and the performance of the web crawler is deteriorated , And the load of the search engine may be increased.

따라서, 이러한 문제점을 해결하고자, 종래에는, 관리자가 URL에서 이러한 불필요한 변수를 제거하는 정규식 패턴을 수동으로 입력하여 종래에 저장된 웹문서 중 중복되는 웹문서를 삭제하였다. 또한, 웹크롤러가 문서에서 신규로 방문할 URL을 추출할 때, 이 정규식 패턴을 적용하고 URL을 저장함으로써 불필요한 페이지의 방문을 피해 왔다. 하지만, 정규식 패턴을 수동으로 입력하는 한계 때문에, 실질적으로 방대한 웹을 모두 커버하기 어려운 문제가 여전히 존재한다.Accordingly, in order to solve such a problem, conventionally, an administrator manually inputs a regular expression pattern for removing such unnecessary variables in a URL, and deletes a duplicated web document among the previously stored web documents. In addition, when the web crawler extracts a URL to visit a new document, it has avoided unnecessary page visits by applying this regular expression pattern and storing the URL. However, due to the limitation of manually entering regular expression patterns, there is still a problem that is difficult to cover substantially the entire massive web.

본 발명의 목적은 위에서 언급한 종래 기술의 문제점을 해결하는 것이다. An object of the present invention is to solve the above-mentioned problems of the prior art.

본 발명의 일 목적은 웹문서 출처표시에서 불필요한 변수를 자동으로 판단하고 추출하여 종래에 저장된 웹문서 중 중복되는 것을 삭제함으로써, 검색 데이터베이스의 저장소 공간을 효율적으로 사용하고 검색엔진의 부하를 감소시키는 것이다. One object of the present invention is to efficiently utilize the storage space of the search database and reduce the load of the search engine by automatically determining and extracting unnecessary variables in the web document source display to delete redundant ones of the previously stored web documents .

또한, 본 발명의 다른 목적은 웹문서 출처표시 내의 불필요한 변수를 자동으로 판단하여 이를 제거하는 정규식을 생성하고, 웹크롤러에 적용함으로써 보다 효율적으로 웹문서를 수집하도록 하는 것이다.
It is another object of the present invention to provide a method and system for automatically collecting unnecessary variables in a web document source display and generating a regular expression for eliminating the unnecessary variables and applying the same to a web crawler.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특유의 효과를 달성하기 위한, 본 발명의 특징적인 구성은 하기와 같다. In order to achieve the above-described object of the present invention and to achieve the specific effects of the present invention described below, the characteristic structure of the present invention is as follows.

본 발명의 일 태양에 따르면, 사용자의 검색 질의어에 해당하는 검색결과를 제공하기 위한 방법으로서, 검색결과 제공 시스템이,According to an aspect of the present invention, there is provided a method for providing a search result corresponding to a search query term of a user,

데이터베이스에 저장된 웹문서에 대응하는 웹문서 출처표시들을 소정 기준으로 묶는 출처표시 클러스터링 단계; CLAIMS 1. A method for clustering a source document, the method comprising: clustering a source indication clustering a source document indication corresponding to a web document stored in a database;

상기 출처표시 클러스터링 결과에 대하여 유사한 문서끼리 묶는 유사문서 클러스터링 단계;A similar document clustering step of grouping similar documents into the source indication clustering result;

상기 출처표시 클러스터링 단계 및 상기 유사문서 클러스터링 단계로부터 각각 생성된 피쳐(feature)들을 이용하여 변수를 분류하는 단계;Classifying variables using features generated from the source indication clustering step and the similar document clustering step, respectively;

상기 변수 분류 단계의 결과를 이용하여 정규식을 생성하는 단계; 및 Generating a regular expression using the result of the variable classifying step; And

상기 생성된 정규식을 이용하여 웹문서를 수집하는 단계를 포함하는 검색결과 제공 방법이 제공된다. And collecting the web document using the generated regular expression.

또한, 바람직하게, 상기 웹문서 출처표시는 URL(Uniform Resource Locator)을 포함하며, 상기 출처표시 클러스터링 단계는, 상기 웹문서의 URL들에 대해, 해당 URL에서 경로(path)까지 동일한 경우, 파일 이름까지 동일한 경우, 또는 파라미터 키(key)까지 동일한 경우 중 적어도 하나에 해당하는 경우에 동일 클러스터로 묶을 수 있다.In addition, preferably, the web document source display includes a URL (Uniform Resource Locator), and the source display clustering step may include a step of, when the URLs are the same from the URL to the URLs of the web document, , Or the same up to the parameter key (key), it can be grouped into the same cluster.

또한, 바람직하게, 상기 유사문서 클러스터링 단계는, 상기 출처표시 클러스터링 단계에 의해 동일 클러스터로 묶여진 웹문서 출처표시 리스트들을 대상으로 하여 수행될 수 있다.Also, preferably, the similar document clustering step may be performed on the web document source display lists enclosed in the same cluster by the source display clustering step.

또한, 바람직하게, 상기 유사문서 클러스터링 단계는, 상기 동일 클러스터 내의 웹문서 출처표시의 모든 페어(pair)에 대하여 유사여부 판단을 수행할 수 있다.In addition, preferably, the similar document clustering step may perform similarity determination on all pairs of web document source display in the same cluster.

또한, 바람직하게, 상기 유사문서 클러스터링 단계는, In addition, preferably, the similar document clustering step includes:

웹문서를 파싱(parsing)하는 단계; Parsing the web document;

상기 파싱된 웹문서를 해싱(hashing)하는 단계; Hashing the parsed web document;

상기 해싱된 웹문서를 비트별로 계산하여 심해시(simhash)를 계산하는 단계; 및 Calculating a simhash by computing the hashed web document bit by bit; And

해밍 거리(hamming distance)가 소정값 이하인 웹문서들을 유사문서로서 그룹핑하는 단계를 포함할 수 있다.And grouping the web documents whose hamming distance is equal to or less than a predetermined value as a similar document.

또한, 바람직하게, 상기 출처표시 클러스터링 단계로부터 생성된 피쳐는 파라미터 키 관련 피쳐 및 파라미터 값 관련 피쳐 중 적어도 하나를 포함할 수 있다.In addition, preferably, the feature generated from the source indication clustering step may include at least one of a parameter key related feature and a parameter value related feature.

또한, 바람직하게, 상기 유사문서 클러스터링 단계로부터 생성된 피쳐는 유사문서 페어를 만들 때 전체 출처표시 클러스터 카운트 대비 해당 변수의 유일한(unique) 값의 비율에 관련된 피쳐, 전체 출처표시 클러스터 카운트 대비 유사문서 페어를 만들 때 참여한 유일한 문서의 비율에 관련된 피쳐, 유일한 문서 대비 유사문서 페어의 개수와 관련된 피쳐 및 유사문서 페어 카운트의 절대값과 관련된 피쳐 중 적어도 하나를 포함할 수 있다.Preferably, the feature generated from the similar document clustering step includes a feature relating to a ratio of a unique value of a corresponding variable to a total source indication cluster count when creating a similar document pair, A feature associated with the ratio of the unique document involved in creating the document, a feature associated with the unique document versus the number of similar document pairs, and a feature associated with the absolute value of the similar document pair count.

또한, 바람직하게, 상기 변수를 분류하는 단계는, 하나의 유사문서 클러스터로 묶인 웹문서 출처표시에 포함된 각각의 변수에 대해 카테고리를 분류하고, 분류된 카테고리가 사용자의 접속 세션을 관리하기 위한 변수 카테고리 또는 웹문서가 어느 곳에서 링크되었는지를 체크하기 위한 변수 카테고리에 해당하는 경우, 해당 변수를 불필요한 변수로 분류하는 단계를 포함할 수 있다.In addition, preferably, the step of classifying the variables includes classifying categories for each variable contained in the web document source indication grouped into one similar document cluster, and classifying the classified categories into variables for managing the user's access session And classifying the variable as an unnecessary variable when the category corresponds to a variable category for checking where the document or the web document is linked.

또한, 바람직하게, 상기 정규식을 생성하는 단계는 대상 웹문서 출처표시에서 상기 불필요한 변수를 제거하는 정규식 패턴을 생성할 수 있다.In addition, the step of generating the regular expression may generate a regular expression pattern that removes the unnecessary variable from the target web document source display.

본 발명의 다른 태양에 따르면, 사용자의 검색 질의어에 해당하는 검색결과를 제공하기 위한 시스템으로서,According to another aspect of the present invention, there is provided a system for providing a search result corresponding to a search query term of a user,

데이터베이스에 저장된 웹문서에 대응하는 웹문서 출처표시들을 소정 기준으로 묶는 출처표시 클러스터링 수단; A source indication clustering means for grouping, based on a predetermined criterion, web document source indication corresponding to a web document stored in a database;

상기 출처표시 클러스터링 결과에 대하여 유사한 문서끼리 묶는 유사문서 클러스터링 수단;Similar document clustering means for clustering similar documents with respect to the source indication clustering result;

상기 출처표시 클러스터링 수단 및 상기 유사문서 클러스터링 수단으로부터 각각 생성된 피쳐(feature)들을 이용하여 변수를 분류하는 변수 분류 수단; 및Variable classifying means for classifying a variable using features generated from the source indication clustering means and the similar document clustering means, respectively; And

상기 변수 분류 수단으로부터의 결과를 이용하여 정규식을 생성하는 정규식 적용 수단; 을 포함하는 유사문서 제거부와, A regular expression applying means for generating a regular expression using the result from the variable classifying means; A similar document removing unit including the similar document removing unit,

상기 생성된 정규식을 이용하여 웹문서를 수집하는 검색부를 포함하는 검색결과 제공 시스템이 제공된다. And a search unit for collecting the web document using the generated regular expression.

또한, 바람직하게, 상기 웹문서 출처표시는 URL(Uniform Resource Locator)을 포함하며, 상기 출처표시 클러스터링 수단은, Preferably, the web document source indication includes a URL (Uniform Resource Locator), and the source indication clustering means comprises:

상기 웹문서의 URL들에 대해, 해당 URL에서 경로(path)까지 동일한 경우, 파일 이름까지 동일한 경우, 또는 파라미터 키(key)까지 동일한 경우 중 적어도 하나에 해당하는 경우에 동일 클러스터로 묶을 수 있다.The URLs of the web document may be grouped into the same cluster when the URLs are the same up to the path, the file names are the same, or the parameter keys are the same.

또한, 바람직하게, 상기 유사문서 클러스터링 수단은, 상기 출처표시 클러스터링 수단에 의해 동일 클러스터로 묶여진 웹문서 출처표시 리스트들을 대상으로 하여 유사문서 탐지를 수행할 수 있다.In addition, preferably, the similar document clustering means may perform similar document detection on the Web document source display lists enclosed in the same cluster by the source display clustering means.

또한, 바람직하게, 상기 유사문서 클러스터링 수단은, 상기 동일 클러스터 내의 웹문서 출처표시의 모든 페어(pair)에 대하여 유사여부 판단을 수행할 수 있다.In addition, preferably, the similar document clustering means may perform similarity determination on all pairs of web document source display in the same cluster.

또한, 바람직하게, 상기 유사문서 클러스터링 수단은, Further, preferably, the similar document clustering means comprises:

웹문서를 파싱(parsing)하고; Parsing a web document;

상기 파싱된 웹문서를 해싱(hashing)하고; Hashing the parsed web document;

상기 해싱된 웹문서를 비트별로 계산하여 심해시(simhash)를 계산하고; 그리고 해밍 거리(hamming distance)가 소정값 이하인 웹문서들을 유사문서로서 그룹핑할 수 있다.Computing the hash web document bit by bit to calculate a simhash; And web documents whose hamming distance is equal to or less than a predetermined value can be grouped as a similar document.

또한, 바람직하게, 상기 출처표시 클러스터링 수단으로부터 생성된 피쳐는 파라미터 키 관련 피쳐 및 파라미터 값 관련 피쳐 중 적어도 하나를 포함할 수 있다.Also preferably, the feature generated from the source indication clustering means may comprise at least one of a parameter key related feature and a parameter value related feature.

또한, 바람직하게, 상기 유사문서 클러스터링 수단으로부터 생성된 피쳐는 유사문서 페어를 만들 때 전체 출처표시 클러스터 카운트 대비 해당 변수의 유일한(unique) 값의 비율에 관련된 피쳐, 전체 출처표시 클러스터 카운트 대비 유사문서 페어를 만들 때 참여한 유일한 문서의 비율에 관련된 피쳐, 유일한 문서 대비 유사문서 페어의 개수와 관련된 피쳐 및 유사문서 페어 카운트의 절대값과 관련된 피쳐 중 적어도 하나를 포함할 수 있다.Preferably, the feature generated from the similar document clustering means comprises a feature relating to a ratio of a unique value of the variable to the total source indication cluster count when creating a similar document pair, A feature associated with the ratio of the unique document involved in creating the document, a feature associated with the unique document versus the number of similar document pairs, and a feature associated with the absolute value of the similar document pair count.

또한, 바람직하게, 상기 변수 분류 수단은, 하나의 유사문서 클러스터로 묶인 웹문서 출처표시에 포함된 각각의 변수에 대해 카테고리를 분류하고, 분류된 카테고리가 사용자의 접속 세션을 관리하기 위한 변수 카테고리 또는 웹문서가 어느 곳에서 링크되었는지를 체크하기 위한 변수 카테고리에 해당하는 경우, 해당 변수를 불필요한 변수로 분류할 수 있다.Preferably, the variable classification means classifies categories for each variable contained in the web document source indication grouped into one similar document cluster, and classifies the category classified into a variable category for managing the user's access session or If the variable category is checked to check where the web document is linked, the variable can be classified as an unnecessary variable.

또한, 바람직하게, 상기 정규식 적용 수단은, 대상 웹문서 출처표시에서 상기 불필요한 변수를 제거하는 정규식 패턴을 생성할 수 있다.In addition, preferably, the regular expression applying means may generate a regular expression pattern for removing the unnecessary variables from the target web document source display.

상술한 바와 같이 본 발명에 따르면, 웹문서 출처표시에서 불필요한 변수를 자동으로 판단하고 추출하여 종래에 저장된 웹문서 중 중복되는 것을 삭제함으로써, 검색 데이터베이스의 저장소 공간의 사용 효율을 높이고 검색엔진의 부하를 감소시킬 수 있다. As described above, according to the present invention, unnecessary variables are automatically determined and extracted in the web document source display, thereby eliminating redundant ones of the web documents stored in the past, thereby increasing the use efficiency of the storage space of the search database, .

또한, 본 발명에 따르면, 불필요한 변수를 자동으로 판단하여 이를 웹문서 출처표시로부터 제거하는 정규식을 생성하고 웹크롤러에 적용함으로써, 향후 보다 효율적으로 웹문서를 수집하는 기능을 제공할 수 있다. In addition, according to the present invention, it is possible to provide a function of collecting a web document more efficiently in the future by generating a regular expression for automatically determining an unnecessary variable and removing the unnecessary variable from the display of the web document source.

도 1은 본 발명의 일 실시예에 따라, URL에서 불필요한 변수를 자동으로 제거한 후 웹 문서를 수집하고 이를 이용하여 구축된 검색 데이터베이스를 이용한 검색 결과 제공 시스템의 전체적인 구성을 개략적으로 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 검색 결과 제공 시스템의 세부 구성도이다.
도 3은 본 발명의 일 실시예에 따른 검색 결과 제공 시스템 내의 유사문서 제거부의 세부 구성도이다.
도 4는 본 발명의 일 실시예에 따른 NDD 클러스터의 개념을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 검색 결과 제공 방법을 설명하기 위한 흐름도이다.FIG. 1 is a diagram schematically showing the overall configuration of a search result providing system using a search database constructed by collecting web documents after automatically removing unnecessary variables from URLs according to an embodiment of the present invention.
2 is a detailed configuration diagram of a search result providing system according to an embodiment of the present invention.
3 is a detailed configuration diagram of a similar document removing unit in a search result providing system according to an embodiment of the present invention.
4 is a view for explaining the concept of an NDD cluster according to an embodiment of the present invention.
5 is a flowchart illustrating a search result providing method according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는 적절하게 설명된다면 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다. The following detailed description of the invention refers to the accompanying drawings, which illustrate, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with an embodiment. It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is to be limited only by the appended claims, along with the full scope of equivalents to which the claims are entitled, if properly explained. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, so that those skilled in the art can easily carry out the present invention.

본 발명의 실시예에서, 용어 "연결"은 "직접적으로 연결"되어 있는 경우뿐 아니라, 다른 소자를 그 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함하는 광의의 뜻을 나타낸다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. In an embodiment of the present invention, the term "connection " means broad meaning, including not only when it is" directly connected "but also when" electrically connected " Also, when an element is referred to as "comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.

또한, 본 발명의 실시예에서, 용어 "웹문서"는 인터넷 익스플로러 등의 웹브라우저 프로그램을 이용하여 열람할 수 있는 수동 또는 능동적인 문서 형식을 모두 포함하는 광의의 의미로 해석되어야 하며, 웹문서의 문서 형식으로서 주로 HTML(HyperText Markup Language)이 사용되나, 반드시 이에 한정되지 않고 XML(eXtensible Markup Language), SGML(Standard Generalized Markup Language)를 포함하여 웹브라우저 프로그램을 이용하여 열람할 수 있는 문서 형식이라면 모두 웹문서에 해당된다. 웹브라우저 프로그램을 이용하여 웹문서를 열람하기 위해서는 일반적으로 웹문서가 위치하고 있는 주소를 URL로 입력하며, 그 주소 형식으로서 HTTP(HyperText Transfer Protocol)가 많이 사용되나 반드시 이에 한정되는 것은 아니다.Further, in the embodiment of the present invention, the term "web document" should be interpreted in a broad sense including both manual and active document formats that can be browsed using a web browser program such as Internet Explorer, Although HTML (HyperText Markup Language) is mainly used as a document format, it is not necessarily limited to this, and any document format including XML (Extensible Markup Language) and SGML (Standard Generalized Markup Language) It corresponds to a web document. In order to browse a web document using a web browser program, an address in which a web document is located is generally input as a URL, and HTTP (HyperText Transfer Protocol) is often used as the address format. However, the present invention is not limited thereto.

한편, 본 발명의 명세서에서, "웹문서 출처표시"라 함은, 후술하는 URL 등을 포함하여 웹문서의 출처를 알 수 있도록 소정 방식으로 표기된 문자, 지시어 등을 모두 포함하는 것으로 한다. On the other hand, in the specification of the present invention, the term " web document source display "includes characters, directives, and the like expressed in a predetermined manner so as to know the source of the web document including URLs to be described later.

또한, 용어 "URL(Uniform Resource Locator)"은 웹상에서 서비스를 제공하는 각 서버들에 있는 파일들의 위치를 명시하기 위한 것으로 접속해야 될 서비스의 종류, 서버의 위치(도메인 네임), 파일의 위치를 포함한다. URL의 일반적인 체계(syntax)는 "프로토콜://호스트이름/경로/파일이름?파라미터"와 같은 형식으로 구성될 수 있다. 여기서 경로는 복수 이상의 경로를 포함할 수 있으며, 파라미터도 복수 이상의 파라미터를 포함할 수 있다. 예를 들어, http://www.naver.com/a/b/c.html?x=1&y=2와 같은 URL에 있어서, 프로토콜은 http, 호스트이름은 www.naver.com, 경로는 /a/b/, 파일이름은 c.html, 그리고 파라미터는 x, y 두 개로서 값은 각각 1, 2로 볼 수 있다. In addition, the term "URL (Uniform Resource Locator)" is used to specify the location of files on each server that provides services on the web. The type of service to be accessed, the location . The general syntax of a URL can be configured in the same format as "protocol: // hostname / path / filename? Parameter". The path may include a plurality of paths, and the parameter may include a plurality of parameters. For example, in a URL such as http://www.naver.com/a/b/c.html?x=1&y=2, the protocol is http, the host name is www.naver.com, the path is / a / b /, the file name is c.html, and the parameters are x and y, and the values are 1 and 2, respectively.

웹크롤러를 통한 웹문서 수집시, URL이 조금이라도 다르다고 하여 이를 모두 수집하게 된다면, 예를 들어, 카운터만 다른 웹 페이지 또는 달력만 다른 블로그 등 URL은 다르나 웹 페이지의 내용이 거의 같은 웹문서(이하 "유사문서"라 함)들을 중복적으로 저장하게 되고, 이로써 수집한 대다수의 무의미한 웹문서를 다시 데이터베이스 등 저장공간으로부터 삭제하여야 하므로, 본 발명에서는 URL에서 불필요한 변수를 자동으로 추출하여 삭제하는 정규식을 바탕으로 무의미한 웹문서를 수집에서 제외하는 검색 시스템 및 방법을 개시하고 있다. When collecting web documents through web crawlers, if the URLs are slightly different and collect all of them, for example, only web pages with different counters such as counters only or calendars other blogs, Quot; similar document ") are stored redundantly, and thus a large number of meaningless web documents collected are deleted from the storage space such as a database again. Therefore, in the present invention, a regular expression for automatically extracting and deleting unnecessary variables from a URL Discloses a search system and method for excluding a meaningless web document from a collection.

전체 시스템 구성Complete system configuration

도 1은 본 발명의 일 실시예에 따라, 유사문서 탐지 방법을 통해 URL에서 불필요한 변수를 자동으로 판단하고 제거한 후 웹 문서를 수집하여 구축된 검색 데이터베이스를 이용한 검색 결과 제공 시스템의 전체적인 구성을 개략적으로 나타내는 도면이다. FIG. 1 schematically shows a general configuration of a search result providing system using a search database constructed by collecting web documents after automatically determining and removing unnecessary variables from a URL through a similar document detection method according to an embodiment of the present invention. Fig.

도 1에 도시되어 있는 바와 같이, 본 발명의 일 실시예에 따른 전체 시스템은, 검색 결과 제공 시스템(100)이 통신망(200)을 통하여 복수의 사용자 단말장치(300) 및 복수의 웹문서 서버(400)와 연결될 수 있다. 1, an overall system according to an exemplary embodiment of the present invention includes a search result providing system 100 for providing a plurality of user terminal devices 300 and a plurality of web document servers 400).

먼저, 본 발명의 일 실시예에 따르면, 검색 결과 제공 시스템(100)은 사용자 단말장치(300)로부터 검색 문구, 즉 질의어를 수신하여, 이를 기초로 검색 데이터베이스(미도시)를 참조하여 검색을 수행한 뒤 그 결과로 도출되는 검색 결과를 사용자 단말장치(300)로 전송하는 역할을 할 수 있다. 또한, 검색 결과 제공 시스템(100)은 웹크롤러(도 2의 140 참조)를 사용하여 하나 이상의 웹문서 서버(400)로부터 수집한 웹문서들로부터 유사문서가 중복 수집되는 것을 막기 위해 URL에서 불필요한 변수를 자동으로 삭제하는 정규식을 생성하고, 이미 저장된 검색 데이터베이스에 적용하여 유사 문서를 삭제하거나, 향후 웹크롤러(미도시)의 작동시 생성된 정규식을 적용하도록 하는 역할을 할 수 있다. First, according to an embodiment of the present invention, the search result providing system 100 receives a search phrase, i.e., a query term, from the user terminal device 300 and performs a search by referring to the search database (not shown) And transmits the retrieved search result to the user terminal device 300. In addition, the search result providing system 100 may use a web crawler (refer to 140 in FIG. 2) to prevent redundant collection of similar documents from web documents collected from one or more web document servers 400, And deletes the similar document by applying it to the already stored search database, or may apply the regular expression generated in the future operation of the web crawler (not shown).

또한, 본 발명의 일 실시예에 따르면, 통신망(200)은 유선 및 무선 등과 같은 그 통신 양태를 가리지 않고 구성될 수 있으며, 단거리 통신망(PAN; Personal Area Network), 근거리 통신망(LAN; Local Area Network), 도시권 통신망(MAN; Metropolitan Area Network), 광역 통신망(WAN; Wide Area Network) 등 다양한 통신망으로 구성될 수 있다. In addition, according to one embodiment of the present invention, the communication network 200 may be configured without regard to its communication mode such as wired and wireless, and may be a personal area network (PAN), a local area network , A metropolitan area network (MAN), a wide area network (WAN), and the like.

한편, 본 발명의 일 실시예에 따른 사용자 단말장치(300)는 사용자가 소정 질의어에 대한 검색 결과를 제공받기 위하여 통신망(200)을 통하여 검색 결과 제공 시스템(100)과 연결하기 위한 기능을 포함하는 입출력 장치를 의미하며, 데스크톱 컴퓨터뿐만 아니라 노트북 컴퓨터, 워크스테이션, 팜톱(palmtop) 컴퓨터, 개인 휴대 정보 단말기(personal digital assistant: PDA), 웹 패드, 스마트 폰을 포함하는 이동 통신 단말기 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기라면 얼마든지 본 발명에 따른 사용자 단말 장치(300)로서 채택될 수 있다. 바람직하게는, 검색 결과 제공 시스템(100)과 연결하고, 질의어를 입력하여 검색 결과를 제공받기 위하여 사용자 단말장치(300) 내의 웹브라우저를 실행시키고 사용할 수 있으나, 반드시 이에 한정되는 것은 아니다. Meanwhile, the user terminal 300 according to an exemplary embodiment of the present invention includes a function for connecting with the search result providing system 100 through the communication network 200 in order to receive a search result for a predetermined query term Means an input / output device and may be a memory device such as a desktop computer as well as a mobile communication terminal including a notebook computer, a workstation, a palmtop computer, a personal digital assistant (PDA), a web pad, And can be adopted as the user terminal device 300 according to the present invention as long as it is a digital device having computation capability by mounting a microprocessor. Preferably, the web browser in the user terminal device 300 may be operated and used in connection with the search result providing system 100, and in order to receive a search result by inputting a query word, the present invention is not limited thereto.

그리고, 본 발명의 일 실시예에 따른 웹문서 서버(400)는 검색 결과 제공 시스템(100) 내의 웹크롤러가 수집할 웹문서를 포함하고 있는 웹서버를 통칭하며 물리적으로 특정 서버나 또는 특정 내용/형식의 웹문서에 한정되는 것은 아니다. 웹크롤러가 통신망(200)을 통해 접근하여 웹문서를 수집할 수 있는 웹서버는 모두 웹문서 서버(400)에 포함되는 것으로 보아야 할 것이다. The web document server 400 according to an embodiment of the present invention is collectively referred to as a web server including a web document to be collected by the web crawler in the search result providing system 100, Format web documents. The web server that can access the web crawler through the communication network 200 and collect the web document is all included in the web document server 400.

검색 결과 제공 시스템Search result providing system

도 2는 본 발명의 일 실시예에 따른 검색 결과 제공 시스템(100)의 세부 구성도이다. 2 is a detailed configuration diagram of a search result providing system 100 according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 검색 결과 제공 시스템(100)은 검색부(110), 유사문서 제거부(120), 웹크롤러(130) 및 검색 데이터베이스(140)를 포함할 수 있다. Referring to FIG. 2, a search result providing system 100 according to an embodiment of the present invention includes a search unit 110, a similar document removing unit 120, a web crawler 130, and a search database 140 .

검색부(110)는 사용자 단말장치(300)로부터 수신한 질의어에 매칭되는 정보를 검색 데이터베이스(140)로부터 검색한다. 검색을 통해 추출되는 검색 결과는 사용자 단말장치(300)로 전송된다. The search unit 110 searches the search database 140 for information matched with the query term received from the user terminal device 300. The search result extracted through the search is transmitted to the user terminal device 300.

유사문서 제거부(120)는 웹크롤러(130)를 통해 웹문서 서버(400)로부터 수집되는 웹문서들 중에서 URL은 일부 상이하나 내용은 거의 유사한 유사 문서가 중복적으로 검색되지 않도록 하기 위해 URL에서 불필요한 변수를 자동으로 추출하고, 삭제하도록 하는 정규식을 생성하고, 생성된 정규식 및 소정 조건을 적용하여 종래에 검색 데이터베이스(140)에 저장된 웹문서에 적용하여 중복적으로 저장된 유사 문서를 삭제하도록 할 수 있다. 또한, 유사문서 제거부(120)는 향후 웹크롤러(130)의 동작시 정규식에 기초하여 유사문서를 중복적으로 수집하지 않도록 웹 크롤러(130)를 설정할 수 있다. 유사문서 제거부(120)의 각 구성요소의 상세한 기능에 대해서는 후술하도록 한다. The similar document removing unit 120 deletes the similar documents from the web document server 400 through the web crawler 130 in order to prevent duplicate search of similar documents whose URLs are partially different but almost similar in contents A regular expression for automatically extracting and deleting unnecessary variables is generated and applied to the web document stored in the search database 140 by applying the generated regular expression and predetermined conditions so as to delete the redundantly stored similar document have. In addition, the similar document removal unit 120 can set the web crawler 130 so that the similar crawler 120 does not redundantly collect the similar document based on the regular expression in the operation of the web crawler 130 in the future. Detailed function of each component of the similar document removing unit 120 will be described later.

웹크롤러(130)는 웹문서 서버(400)에 저장된 웹문서를 공지의 조직적, 자동화된 방법으로 탐색 및 수집하여 검색 데이터베이스(140) 또는 별도의 데이터베이스에 저장하는 기능을 한다. 또한, 본 발명의 일 실시예에 따르면, 웹크롤러(130)는 유사문서 제거부(120)에서 제공된 정규식을 이용함으로써, 웹 문서들의 탐색 및 수집에 있어서 종래기술과 달리, 유사문서를 중복적으로 수집하지 않을 수 있게 된다.The web crawler 130 searches and collects the web documents stored in the web document server 400 in a known systematic and automated manner and stores them in the search database 140 or in a separate database. In addition, according to one embodiment of the present invention, the web crawler 130 uses the regular expression provided in the similar document removing unit 120, thereby, unlike the prior art in search and collection of web documents, It will not be collected.

검색 데이터베이스(140)는 질의어에 해당하는 검색 결과를 제공하기 위해 수집되거나 저장된 각종 정보를 포함할 수 있고, 그 밖에도 웹크롤러(130)가 수집한 웹문서를 저장할 수도 있다. 또한, 유사문서 제거부(120)의 동작에 의해 검색 데이터베이스(140) 내부에 이미 검색된 웹문서와 내용상 차이가 거의 없는 유사문서가 중복적으로 저장되어 있는 경우, 그 일부 또는 전체가 삭제될 수 있다. The search database 140 may include various information collected or stored to provide a search result corresponding to a query term, and may also store a web document collected by the web crawler 130. In addition, if a similar document which has almost no difference in content from the web document already searched in the search database 140 is redundantly stored by the operation of the similar document removing unit 120, some or all of the similar document may be deleted .

도 2에는 검색 데이터베이스(140)만을 도시하였지만, 본 발명의 일 실시예에 따라 웹크롤러(130)가 수집, 검출한 웹문서를 저장하는 데이터베이스를 별도로 구축하고, 그 중 유사문서가 삭제된 나머지만을 인덱싱하여 검색 데이터베이스(140)에 저장할 수도 있다. 또한, 도면에서 검색부(110), 유사문서 제거부(120) 및 웹크롤러(130)는 각각 별개의 블록으로 도시하였으나, 이들은 물리적으로 하나의 기계 내에 구현될 수도 있고 일부 또는 그 각각이 물리적으로 다른 기계에 구현될 수도 있거나, 동일한 기능을 하는 물리적으로 복수 개 존재하는 기계가 병렬적으로 존재할 수도 있다. 이렇듯 본 발명은 각 구성부가 설치된 기계 또는 데이터베이스의 물리적인 개수 및 위치에 한정되지 않고 다양한 방식으로 설계 변경될 수 있음은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 자명하다. Although only the search database 140 is shown in FIG. 2, according to an embodiment of the present invention, a database for storing web documents collected and detected by the web crawler 130 is separately constructed, Indexed and stored in the search database 140. Although the search unit 110, the similar document removing unit 120, and the web crawler 130 are shown as separate blocks in the drawing, they may be physically implemented in one machine, or some or all of them may be physically It may be implemented in another machine, or a plurality of physically existing machines having the same function may exist in parallel. It will be apparent to those skilled in the art that various modifications and changes may be made in the present invention without departing from the spirit and scope of the invention.

유사문서 Similar Documents 제거부Removal

도 3을 참조하여, 본 발명의 일 실시예에 따른 검색 결과 제공 시스템(100) 내의 유사문서 제거부(120)를 더욱 상세히 설명하면, 유사문서 제거부(120)는 URL 클러스터링 수단(121), NDD(near duplication detection) 클러스터링 수단(122), 변수 분류 수단(123) 및 정규식 적용 수단(124)을 포함할 수 있다. 3, the similar document removing unit 120 in the search result providing system 100 according to the embodiment of the present invention will be described in more detail. The similar document removing unit 120 includes a URL clustering unit 121, A near duplication detection (NDD) clustering means 122, a variable classifying means 123, and a regular expression applying means 124.

여기서 본 발명의 일 실시예에 따른 URL 클러스터링 수단(121)은 검색 데이터베이스(140) 또는 별도의 데이터베이스에 저장되어 있는, 웹크롤러(130)에 의해 수집한 웹문서의 URL들을 동일한 클러스터 단위로 모으고, 동일 클러스터에 포함된 URL들을 분석하여 관련 피쳐(feature)들을 찾는 역할을 할 수 있다. The URL clustering unit 121 according to an embodiment of the present invention collects the URLs of the web documents collected by the web crawler 130 stored in the search database 140 or a separate database in the same cluster unit, It can analyze the URLs included in the same cluster and find the related features.

구체적으로, URL 클러스터링은 다양한 방법을 통해 수행될 수 있는데, 제한적인 것은 아니나 예를 들어, URL에서 파일 이름까지 동일한 경우를 동일 클러스터로 보거나, 또는 파라미터 키(key)까지 동일한 경우를 동일 클러스터로 볼 수도 있고, 또는 동일 경로(path)까지를 동일 클러스터로 볼 수도 있다. 또는 해당 도메인에서 값이 자주 변경되는 경로의 컴포넌트나 파라미터의 키(key) 및/또는 파라미터 값 등을 무시하는 방법으로 URL 클러스터를 구성할 수도 있다. Specifically, the URL clustering can be performed through various methods. For example, if the URLs are the same from the file names, the clustering may be performed in the same cluster, or the same parameter key may be regarded as the same cluster. Or may be viewed as the same cluster up to the same path. Alternatively, a URL cluster may be constructed by ignoring key and / or parameter values of a component or parameter of a path whose value is frequently changed in the corresponding domain.

이하, 표 1의 A 내지 E는 동일한 URL 클러스터로 묶인 URL들의 예를 나타낸다. Hereinafter, A to E in Table 1 show examples of URLs grouped in the same URL cluster.

URLURL AA https://www.kokutou.com/products/review.php?PHPSESSID=14d3bd1d0e896https://www.kokutou.com/products/review.php?PHPSESSID=14d3bd1d0e896 BB https://www.kokutou.com/products/review.php?PHPSESSID=14d9f0ddfk6896https://www.kokutou.com/products/review.php?PHPSESSID=14d9f0ddfk6896 CC https://www.kokutou.com/products/review.php?PHPSESSID=54d0h9c5n78hsehttps://www.kokutou.com/products/review.php?PHPSESSID=54d0h9c5n78hse DD https://www.kokutou.com/products/review.php?PHPSESSID=d9h8v7e6d0f9h8https://www.kokutou.com/products/review.php?PHPSESSID=d9h8v7e6d0f9h8 EE https://www.kokutou.com/products/review.php?PHPSESSID=z0d98n7e6d7f98https://www.kokutou.com/products/review.php?PHPSESSID=z0d98n7e6d7f98

다시 말해, 본 발명의 일 실시예에 따를 때, URL 클러스터링 방법은 공지의 어떤 방식을 사용하더라도 무방하며, 단지, 동일 URL 클러스터로 묶인 URL들의 공통적인 정보를 피쳐로 남겨 이후의 단계에서 변수 분류 수단(123)이 이를 사용할 수 있도록 하면 된다. In other words, according to the embodiment of the present invention, the URL clustering method may use any known method, and only the common information of the URLs bundled with the same URL cluster may be left as a feature, (123) can use it.

이때, URL 클러스터링 결과 추출가능한 피쳐들은 예를들어, 파라미터 키 또는 파라미터 값과 관련된 피쳐들 중 적어도 하나를 포함할 수 있다. 다만, 본 발명이 이에 제한되는 것은 아니며, URL 클러스터링 방식에 따라 좀 더 다양한 피쳐들이 추출될 수도 있다. At this time, the URL clustering result extractable features may include at least one of, for example, a parameter key or a feature associated with a parameter value. However, the present invention is not limited thereto, and more various features may be extracted according to the URL clustering method.

먼저, 파라미터 키와 관련된 피쳐들로는, 아래 표 2에 예시된 바와 같이, 각 변수의 키 값에 대한 피쳐(key), 각 파라미터 키 값 중에 특정 값을 포함하는지 여부와 관련된 피쳐(keyinclude), 파일 이름 중 확장자가 cgi 또는 php 등으로 끝나는 경우 cgi 값에 대한 피쳐(cgi), 및 cgi나 php를 포함한 파일 이름 이전의 직전 경로 값과 관련된 피쳐(lastpath) 등을 포함할 수 있으나, 반드시 이에 한정되는 것은 아니다.First, the parameters associated with the parameter key include a key for each key value of the variable, a keyinclude related to whether a specific value is included in each parameter key value, a file name (Cgi) for the cgi value when the extension ends with cgi or php, and a feature associated with the previous path value before the filename including cgi or php (lastpath), but is not limited to no.

파라미터 키 관련 피쳐Parameter key related features 예Yes keykey ▶ key_sid, key_no, key_page▶ key_sid, key_no, key_page keyincludekeyinclude ▶ sid, sessid, session, phpsessid, sess, phpid, clientid, accessid, who
▶ redirect, return, url, f_file, fname, from, refer, token, src, login▶ sid, sessid, session, phpsessid, sess, phpid, clientid, accessid, who
Redirect, return, url, f_file, fname, from, refer, token, src, login cgicgi ▶ cgi_board.cgi / cgi_login.php / cgi_bbs.php 등▶ cgi_board.cgi / cgi_login.php / cgi_bbs.php etc. lastpathlastpath ▶ lastpath_zboard / lastpath_bbs / lastpath_calendar 등▶ lastpath_zboard / lastpath_bbs / lastpath_calendar etc.

다음으로, 파라미터 값과 관련된 피쳐들로는, 아래 표 3에 예시된 바와 같이, 파라미터 값의 유형과 관련된 피쳐(vtype), 파라미터 값들의 평균 길이(vlength)와 관련된 피쳐, 및 파라미터 값들의 길이의 편차와 관련된 피쳐(vlengthdev) 들 중 적어도 하나를 포함할 수 있으나 반드시 이에 한정되는 것은 아니다.Next, the features associated with the parameter values include, as illustrated in Table 3 below, the deviation of the length of the feature (vtype) associated with the type of the parameter value, the feature associated with the average length vlength of the parameter values, But may include, but is not necessarily limited to, at least one of the related features (vlengthdev).

파라미터 값 관련 피쳐Parameter Value Related Features 예Yes vtypevtype ▶ vtype _blank: 변수의 값이 없음
▶ vtype _integer: 정수
▶ vtype _ip_addr: ip주소
▶ vtype_mixed_number_alphabet: 숫자와 알파벳이 무작위로 섞인 문자열로서, 숫자/문자 전환 횟수가 소정 회수 이상인 경우, 예를들어, abc1015(해당 x), a1b0c15(해당 o)
▶ vtype _too_long_value: 길이가 소정 값 이상인 문자열 ▶ vtype _blank: No value of variable
▶ vtype _integer: integer
▶ vtype _ip_addr: ip address
Vtype_mixed_number_alphabet: A string of numbers and alphabets randomly mixed. For example, if the number of times of number / character conversion is more than the predetermined number, abc1015 (corresponding x), a1b0c15 (corresponding o)
▶ vtype _too_long_value: string with a length greater than or equal to a predetermined value vlengthVLength ▶ vlength _20: 파라미터 값들의 평균길이가 20미만
▶ vlength _20_30: 20이상 30미만▶ vlength _20: The average length of the parameter values is less than 20
▶ vlength _20_30: 20 or more and less than 30 vlengthdevvlengthdev ▶ vlengthdev_fixed: 파라미터 값들의 길이의 편차 값이 소정 값 미만인 경우
▶ vlengthdev_diverse: 편차 값이 소정 값 이상인 경우 ▶ vlengthdev_fixed: When the deviation value of the length of the parameter values is less than the predetermined value
▶ vlengthdev_diverse: If the deviation value is more than the predetermined value

따라서, 표 1에 예시된 A 내지 E의 URL 클러스터로부터 생성가능한 피쳐를 살펴보면, 파라미터 키 관련 피쳐로는 keyinclude, cgi, lastpath가 생성될 수 있고 그 값은 각각 PHPSESSID, review.php 및 products/에 해당한다. 또한, 파라미터 값 관련 피쳐로는 vtype, vlength, vlengthdev가 생성될 수 있으며 그 값은 각각 vtype_mixed_number_alphabet, vlength_20 및 vlengthdev_fixed에 해당한다. 이러한, URL 클러스터들로부터 추출되어 생성된 피쳐들 및 그 값을 사용하여 변수를 분류하는 방법에 대하여는 후술하기로 한다.Therefore, if we look at the features that can be generated from the URL clusters A to E illustrated in Table 1, keyinclude, cgi, and lastpath can be generated as parameter key related features, and the values correspond to PHPSESSID, review.php, and products / do. In addition, vtype, vlength, and vlengthdev can be generated as parameter value related features, and the values correspond to vtype_mixed_number_alphabet, vlength_20, and vlengthdev_fixed, respectively. A method of classifying variables using the features extracted from URL clusters and their values will be described later.

한편, 본 발명의 일 실시예에 따른 NDD 클러스터링 수단(122)은 웹크롤러(130)에 의해 수집한 웹문서들 중 URL은 일부 상이하나 웹문서의 내용은 변화가 없고 거의 유사한 유사문서가 존재할 가능성이 있는지를 파악하기 위해, 상기 URL 클러스터링 수단(121)을 통과한 웹문서의 URL들을 소정 기준에 따라서 NDD 클러스터로 묶는 기능을 수행할 수 있다. 이러한 작업을 통해서, 유사문서를 중복적으로 수집해야 하는 작업 및 웹문서의 양을 크게 줄일 수 있으며, 따라서 처리 비용이 감소되고 및 그 처리 속도는 향상될 수 있다.Meanwhile, the NDD clustering unit 122 according to an exemplary embodiment of the present invention may be configured such that the URL of the web documents collected by the web crawler 130 is partially different but the contents of the web document does not change and similar similar documents exist It is possible to perform the function of bundling the URLs of the web documents that have passed through the URL clustering means 121 into NDD clusters according to a predetermined criterion. Through this work, it is possible to greatly reduce the amount of work and web documents that need to collect similar documents redundantly, and thus the processing cost can be reduced and the processing speed can be improved.

즉, NDD 클러스터링 수단(122)은 상기 표 1과 같이 동일 클러스터로 묶여진 URL 리스트들을 대상으로 하여, 동일 클러스터 내의 URL의 모든 페어(pair)에 대하여 유사여부 판단을 수행하여, 유사한 문서 쌍을 모두 묶어 동일한 NDD 클러스터로 분류한다. 예를 들어, 상기 표 1에 예시된 바와 같이 동일 URL 클러스터로 묶인 A 내지 E의 URL에 대하여, NDD 클러스터링은 (A, B), (A, C), (A, D), (A, E), (B, C), (B, D), (B, E), (C, D), (C, E) 및 (D, E)인 동일 클러스터 내의 URL의 모든 페어들을 대상으로 하여 유사여부 판단이 수행될 수 있다.That is, the NDD clustering means 122 performs similarity determination on all pairs of URLs in the same cluster, targeting the URL lists bundled in the same cluster as shown in Table 1, and groups similar document pairs It is classified into the same NDD cluster. For example, as illustrated in Table 1, NDD clustering for (A, B), (A, C), (A, D), All pairs of URLs in the same cluster of (B, C), (B, D), (B, E), (C, D), A determination can be made.

다만, NDD 클러스터링 수단(122)에 따른 NDD 클러스터링은 URL 클러스터링 수단(121)을 통과한 모든 URL에 대해 한번에 전체적으로 실시할 수도 있고, 또는 URL 클러스터링 수단(121)을 통과한 URL 중 일부에 대해 수행하여 정규화한 후 정규화된 URL을 바탕으로 나머지 URL에 적용할 수도 있으며 이러한 작업 방식 또는 순서의 변경은 본 발명이 속하는 분야에서 통상의 지식을 가진 자에게 있어 자명할 것이다.However, the NDD clustering according to the NDD clustering means 122 may be performed entirely for all the URLs passed through the URL clustering means 121 or for a part of the URLs passed through the URL clustering means 121 The normalized URL may be applied to the rest of the URL based on the normalized URL, and the change in the manner or order of operations will be apparent to those skilled in the art.

구체적으로, NDD 클러스터링의 방법으로는, 이에 제한되는 것은 아니나, 유사도 검출 기반-해시(hash-based similarity detection, 이하 "심해시(simhash)라 함")를 통해 대규모 문서들을 처리할 수 있다. 그 밖에도, 각 문서의 슁글(Shingle)들을 구하고, 이에 기초하여 자카드 계수(Jaccard coefficient)를 계산하는 방식으로 웹문서들 간의 유사도를 측정할 수도 있는 등, 다양한 방식 및 공식을 이용하여 대규모의 웹문서들 사이에서 빠르게 유사 문서 여부를 탐지할 수 있는 것은 당업자에게 있어 자명할 것이다. 이하에서는 그 설명의 용이성을 위해 심해시 방식에 기초한 유사도 판단에 대해 설명하도록 한다. Specifically, the method of NDD clustering can handle large-scale documents through, but not limited to, hash-based similarity detection (hereinafter referred to as "simhash"). In addition, it is possible to measure the degree of similarity between web documents by calculating the shingles of each document, and calculating the Jaccard coefficient based on the shingles, It will be apparent to those skilled in the art that it is possible to quickly detect whether a similar document exists between documents. The similarity determination based on the deep sea method will be described below for ease of explanation.

심해시 방식을 상세히 설명하면, 각 웹문서를 파싱(parsing)을 통해 단어 단위로 나누고, 이렇게 단어 단위로 나뉜 텍스트 노드들과 앵커(anchor) 주소, 이미지 소스 주소, 임베드(embed) 태그 주소들을 해싱(hashing)한 뒤, 비트별로 계산하여 심해시를 계산할 수 있다. 이렇게 계산된 심해시들을 이용하여, 해밍 거리(hamming distance)가 소정값, 예를들어 3, 이하인 웹문서들을 유사문서로 보고, 이러한 웹문서들을 모두 모아 NDD 클러스터로 묶을 수 있다. 다만, 실시예에 따라서, 상기 해밍 거리가 3 이외의 값을 갖는 경우를 기준으로 NDD 클러스터로 묶는 것도 가능하다. 따라서, 본 발명이 특정 해밍 거리 값으로 제한되는 것은 아니다.In detail, each web document is divided into words by parsing, and the text nodes, anchor addresses, image source addresses, and embed tag addresses divided by word units are subjected to hashing (hashing), and then calculating it by bit to calculate the deep-sea time. By using the calculated deep sea scenes, web documents having a hamming distance of a predetermined value, for example, 3 or less, can be regarded as similar documents, and these web documents can be grouped together into NDD clusters. However, according to the embodiment, it is also possible to group NDD clusters on the basis of the case where the Hamming distance has a value other than 3. Therefore, the present invention is not limited to a specific Hamming distance value.

전술한 심해시 방식을 통한 NDD 클러스터링 결과로서, 예를 들어, 상기 표 1의 URL들의 경우, (A, B), (A, C) 및 (B, C) 페어가 "NDD 클러스터1"로 묶이고, (D, E)가 "NDD 클러스터2"로 묶일 수 있다(도 4 참조). 다음으로, 이렇게 묶인 NDD 클러스터들을 토대로 각 NDD 클러스터들로부터 피쳐들을 추출해 낼 수 있다. (A, B), (A, C) and (B, C) pairs are bundled as "NDD cluster 1" in the case of the URLs in Table 1, for example, as a result of NDD clustering through the above- , (D, E) can be bundled into "NDD cluster 2" (see FIG. 4). Next, the features can be extracted from each NDD clusters based on the NDD clusters thus bundled.

구체적으로는, 아래 표 4에 예시된 바와 같이, NDD 페어를 만들 때 전체 URL 클러스터 카운트 대비 해당 변수의 유일한(unique) 값의 비율에 관련된 피쳐(nddvalue), 전체 URL 클러스터 카운트 대비 NDD 페어를 만들 때 참여한 유일한 문서의 비율에 관련된 피쳐(ndddoc), 유일한 문서 대비 NDD 페어의 개수와 관련된 피쳐(nddcountmultiple) 및 NDD 페어 카운트의 절대값과 관련된 피쳐(nddcountabs) 중 적어도 하나를 추출해낼 수 있으나, 반드시 이에 한정되는 것은 아니며 그 밖에도 웹문서들의 유사성과 관련된 다른 피쳐들도 추출될 수 있다. 또한, 표 4에 예시된 각 피쳐들의 값을 결정짓는 비율 역시 한정적인 것이 아니며 그 비율을 달리할 수 있는 것은 당업자에게 있어 자명하다. Specifically, as illustrated in Table 4 below, when creating an NDD pair, a feature (nddvalue) related to the ratio of the unique value of the variable to the total URL cluster count, an NDD pair relative to the entire URL cluster count At least one of a feature (ndddoc) related to the ratio of the only documents involved, a feature (nddcountmultiple) related to the number of unique documents versus the number of NDD pairs, and a feature (nddcountabs) related to the absolute value of the NDD fair count can be extracted Other features related to the similarity of web documents can also be extracted. It should also be apparent to those skilled in the art that the ratios determining the values of each of the features illustrated in Table 4 are not limiting, and that the ratios can be varied.

NDD 클러스터의 피쳐Features of NDD Clusters 예Yes nddvaluenddvalue ▶nddvalue_one : unique한 값이 하나인 경우
▶nddvalue_small / nddvalue_medium/ nddvalue_big :unique한 값이 두 개 이상인 경우로서 그 비율 크기에 따라 각각 small, medium, big으로 구분. ▶ nddvalue_one: if there is one unique value
Nddvalue_small / nddvalue_medium / nddvalue_big: unique If there are two or more values, separated by small, medium, or big depending on the ratio size. ndddocndddoc ▶ndddoc_one / ndddoc_small / ndddoc_medium / ndddoc_big Ndddoc_one / ndddoc_small / ndddoc_medium / ndddoc_big nddcountmultiplenddcountmultiple ▶nddcountmultiple_1: ratio (= unique한 문서 수/ ndd pair의 개수)가 1 이상 소정값 미만
▶nddcountmultiple_10: ratio가 10 이상 소정값 미만▶ nddcountmultiple_1: ratio (= number of unique documents / number of ndd pairs) is 1 or more and less than a predetermined value
▶ nddcountmultiple_10: ratio is more than 10 and less than predetermined value nddcountabsnddcountabs ▶nddcountabs_10: 절대 값이 10 이상 소정값 미만
▶nddcountabs_100: 절대 값이 100 이상 소정값 미만▶ nddcountabs_10: absolute value is more than 10 and less than predetermined value
▶ nddcountabs_100: Absolute value is more than 100 and less than predetermined value

따라서, 상기 예에서, (A, B), (A, C) 및 (B, C) 페어의 "NDD 클러스터1"로부터 추출 가능한 피쳐인 nddvalue, ndddoc, nddcountmultiple에 대한 값은 각각 nddvalue_big, ndddoc_big, nddcountmultiple_1 이고, (D, E) 페어의 "NDD 클러스터2"로부터 추출 가능한 피쳐인 nddvalue, ndddoc, nddcountmultiple에 대한 값은 각각 nddvalue_big, ndddoc_big, nddcountmultiple_1 이다. 피쳐 nddcountabs에 대해서는 표 1의 모집단의 수가 작으므로 별도의 설명은 생략하기로 한다. 이러한, NDD 클러스터들로부터 추출된 피쳐들을 사용하여 변수를 분류하는 방법에 대하여는 후술하기로 한다.Therefore, in the above example, values for nddvalue, ndddoc, and nddcountmultiple, which are features extractable from the "NDD cluster 1" of the pairs (A, B), (A, C) and (B, C) are nddvalue_big, ndddoc_big, nddcountmultiple , And the values for nddvalue, ndddoc, and nddcountmultiple, which are features extractable from the "NDD cluster 2" of the (D, E) pair, are nddvalue_big, ndddoc_big, and nddcountmultiple_1, respectively. For the feature nddcountabs, the number of the population in Table 1 is small, and a separate explanation will be omitted. A method of classifying the parameters using the features extracted from the NDD clusters will be described later.

다음으로, 본 발명의 일 실시예에 따른 변수 분류 수단(123)은 각 URL 클러스터와 NDD 클러스터에서 추출된 피쳐들을 조합하여, URL에 포함된 변수가 불필요한 변수인지를 판단하고, 그 판단에 기초하여 이미 수집된 웹문서와 내용은 동일하나 URL만 일부 상이한 유사문서의 URL이 포함되어 있는지 여부를 판별할 수 있다. 또한, 본 발명의 일 실시예에 따른 변수 분류 수단(123)은 웹문서간의 유사도 및 피쳐들에 기초하여 URL에 포함된 변수가 어떠한 분류에 속하는지 여부를 제공하는 기능을 더 수행할 수 있다.Next, the variable classifying unit 123 according to the embodiment of the present invention combines the URL clusters and the features extracted from the NDD cluster to determine whether a variable included in the URL is an unnecessary variable, and based on the determination It is possible to judge whether or not the URL of the similar document is the same as that of the web document already collected but only the URL is slightly different. In addition, the variable classifying unit 123 according to an embodiment of the present invention may further perform the function of providing whether a variable included in the URL belongs to which classification, based on the similarity and features among the web documents.

변수 분류 수단(123)은 정답 셋의 피쳐들을 대상으로 학습된 다양한 기계 학습 알고리즘에 기초하여, URL 클러스터와 NDD 클러스터에서 추출된 피쳐들을 조합하여 해당 변수의 특성에 따라 분류시킬 수 있으며, 분류된 카테고리에 따라 해당 변수가 URL에서 불필요한 변수인지 여부를 판단할 수 있다. 변수 분류 수단(123)은 예를 들어, 나이브 베이지안 분류기(Naive Bayesian classifier) 또는 피셔 방식(Fisher's method) 등의 기계 학습 알고리즘을 사용하는 분류기일 수 있다. 다만, 본 발명이 특정 분류 기법 내지 분류기로 제한되는 것은 아님을 알아야 한다.The variable classifying unit 123 can classify the URL clusters and the features extracted from the NDD cluster according to the characteristics of the corresponding variables based on various machine learning algorithms learned for the features of the correct set, It is possible to determine whether the variable is an unnecessary variable in the URL. The variable classification means 123 may be, for example, a classifier using a machine learning algorithm such as a Naive Bayesian classifier or Fisher's method. It should be understood, however, that the present invention is not limited to any particular classification technique or classifier.

이하, 본 발명의 일 실시예에 따른 변수 분류 수단(123)의 동작을 설명하면, 변수 분류 수단(123)은 하나의 NDD 클러스터로 묶인 URL에 포함된 모든 변수에 대하여, 샘플 URL과 특정 변수가 제거된 URL에 해당하는 웹문서를 방문하여 양 웹문서의 유사도 여부를 판단하며, 두 웹문서의 유사도, URL 클러스터와 NDD 클러스터에서 추출된 피쳐들의 조합, 그리고 변수 성격에 따라 하나의 NDD 클러스터로 묶인 URL에 포함된 각각의 변수에 대해 아래 표 5와 같이 sid, referrer, good, bad 등과 같은 카테고리로 분류한다. 여기서 특정 변수가 제거된 URL은 특정 파라미터 값을 삭제하거나 또는 특정 파라미터 키 및 파라미터 값 모두를 삭제한 URL을 의미할 수 있으며, 두 웹문서의 유사도는 앞서 설명한 유사 여부 판단 기준을 그대로 이용하거나 변형하여 사용할 수 있다.Hereinafter, the operation of the variable classifying unit 123 according to an embodiment of the present invention will be described. The variable classifying unit 123 classifies the sample URL and a specific variable The similarity of the two web documents, URL clusters, combinations of features extracted from the NDD clusters, and combinations of the features extracted from the NDD clusters are grouped into one NDD cluster For each variable included in the URL, it is classified into categories such as sid, referrer, good, and bad as shown in Table 5 below. Here, the URL from which a specific variable is removed may refer to a URL that deletes a specific parameter value or deletes both a specific parameter key and a parameter value. The similarity of the two web documents may be determined by using the similarity judgment criterion Can be used.

여기서, 구체적으로, sid는 두 웹문서가 유사하고 해당 변수의 성격이 사용자의 접속 세션을 관리하기 위한 변수인 세션 아이디(session ID)에 해당하는 경우이고, referrer는 두 웹문서가 유사하고 변수의 성격이 referrer, 즉, 해당 웹문서가 어느 곳에서 링크되었는지 체크하기 위한 변수인 경우이며, good은 변수의 유무에 따라 두 웹문서의 내용이 달라지는, 즉 유사 문서가 아닌 경우이며, bad는 변수의 유무에 따라 웹문서의 내용이 변경되지는 않지만 변수의 성격이 sid 또는 referrer가 아닌 다른 종류에 해당하는 경우를 의미하도록 그 카테고리를 설정할 수 있다. 다만, bad 변수는 예를들어, 캘린더(calendar)형 변수나 불필요한 키워드, 이메일, 타이틀, 소트(sort), 페이지(page) 등의 변수를 포함할 수 있으며, 그 중 일부 또는 전체 성격이 동일한 변수를 추출하여 별도의 유사 문서에 해당하는 항목으로 추가하거나, 또는 bad 변수 그 자체를 good과는 별도의 카테고리로 분류함으로써 유사문서를 보다 효과적으로 제거시킬 수도 있겠으나, 실시예에 따라서는, 자칫 비유사문서가 웹크롤링에서 제외되는 것을 방지하고자 bad 카테고리의 변수들도 good 카테고리, 즉 비유사문서로서 분류하는 것도 가능하다. 즉, 본 발명의 일 실시예에 따른 변수 분류 수단(123)은 sid + referrer + (good + bad) 의 URL을 정답 셋으로　사용하여 하나의 NDD 클러스터로 묶인 URL에 포함된 해당 변수를 분류할 수 있다.Specifically, sid is a case in which two web documents are similar and the nature of the corresponding variable corresponds to a session ID, which is a variable for managing a user's access session, and referrer is a case in which two web documents are similar, Good is the case where the content of the two web documents is different depending on the presence or absence of the variable, that is, the document is not similar, and bad is the variable The content of the web document is not changed depending on whether or not the variable is a sid or a referrer. However, the bad variable may include, for example, a variable such as a calendar type variable or an unnecessary keyword, an email, a title, a sort, a page, and the like, May be extracted and added as an item corresponding to another similar document or the bad variable itself may be classified into a category different from good to remove the similar document more effectively. However, according to the embodiment, To avoid being excluded from web crawling, it is also possible to classify the variables of the bad category as a good category, that is, a non-native document. That is, the variable classifying unit 123 according to an embodiment of the present invention can classify corresponding variables included in a URL bundled into one NDD cluster using a URL of sid + referrer + (good + bad) as a correct set have.

카테고리category 예Yes sidsid 두 문서 유사, 변수의 성격= session IDTwo document variations, variable personality = session ID referrerreferrer 두 문서 유사, 변수의 성격=referrerTwo document variations, the nature of the variable = referrer goodgood 두 문서 상이Both documents are different badbad 두 문서 유사, 변수의 성격≠ session ID 또는 referrerTwo document variations, the nature of the variable ≠ session ID or referrer

샘플 URL에 해당하는 웹문서와 특정 변수가 제거된 URL에 해당하는 웹문서의 유사도가 특정 수치 이하인 경우에는 피쳐들의 조합을 고려할 필요 없이 하나의 NDD 클러스터로 묶인 URL에 포함된 해당 변수는 good 카테고리에 해당될 수 있다. 만일 상기 웹문서들의 유사도가 특정 수치 이상인 경우에는 다양한 피쳐들을 조합하여 변수의 성격을 판단할 수 있으며, 그 일 예로서 파라미터 키 관련 피쳐 중 keyinclude가 phpsessid를 갖고 있으면 변수의 성격이 세션 아이디에 해당할 확률이 크므로 sid 카테고리에 포함될 가능성이 높도록 가중치를 부여할 수 있다. 이와 같이, 서로 다른 가중치가 부여된 다양한 피쳐들의 조합에 해당하는 값에 따라 하나의 NDD 클러스터로 묶인 URL에 포함되는 해당 변수는 sid, referrer 또는 bad 카테고리에 해당될 수 있다. If the similarity between the web document corresponding to the sample URL and the web document corresponding to the URL from which the specific variable is removed is below a certain value, the corresponding variable included in the URL bundled into one NDD cluster need not be considered in the good category . If the similarity degree of the web documents is more than a specific value, it is possible to determine the nature of the variable by combining various features. For example, if the keyinclude of the parameter key-related features has phpsessid, the personality of the variable corresponds to the session ID Since the probability is high, it can be weighted to be more likely to be included in the sid category. As described above, the variable included in the URL bundled into one NDD cluster may correspond to the sid, referrer, or bad category according to a value corresponding to a combination of various features having different weights.

다양한 피쳐들에 대한 가장 정확한 가중치, 그리고 피쳐들의 가장 정확한 조합을 결정하기 위해, 앞서 설명한 바와 같이 변수 분류 수단(123)은 피셔 방식과 같은 공지의 기계 학습 알고리즘을 이용할 수 있으며, 일부 NDD 클러스터링에서 URL에 포함된 일부 또는 모든 변수들에 대해 분류하고, 이를 기초로 초기 학습을 수행할 수 있다. 또한, 이미 적용되는 정규식 패턴으로부터 주기적 또는 비정기적으로 수집한 일부 URL의 변수들에 대해 분류한 결과를 기초로 변수 분류 수단(123)의 학습을 수행하여 그 정확도를 높이거나, 또는 특정 주기별로 패턴을 생성하며, 그 정확도를 조사하여 정확도가 떨어지는 패턴을 삭제하도록 구현될 수 있는 등, 그 정확도를 높이기 위한 다양한 방법을 적용할 수 있는 것은 당업자에게 있어 자명할 것이다. In order to determine the most accurate weights for the various features and the most precise combination of features, the variable classification means 123 may use known machine learning algorithms, such as the Fisher's method, as described above, And the initial learning can be performed based on the classification. In addition, learning of the variable classifying means 123 is performed on the basis of the result of classifying the variables of some URLs collected periodically or irregularly from the already applied regular expression pattern to increase the accuracy thereof, It is possible to implement various methods for improving the accuracy, for example, by generating patterns of patterns and generating patterns, and by analyzing the accuracy and deleting patterns with low accuracy.

본 발명에 따른 변수 분류의 정확성 실험Accuracy experiment of variable classification according to the present invention

앞서 설명한 변수 분류 수단(132)의 정확성을 확인하기 위해 한정된 개수의 샘플을 대상으로 수작업을 진행하여 검증해보았다. In order to confirm the accuracy of the variable classifying means 132 described above, a limited number of samples were manually handled and verified.

검증 대상은 유사문서로 검출된 URL 중 258개의 URL에 해당하며, 수작업을 통한 검증 결과 표 5에 따른 카테고리별로 분류된 개수 및 비율은 아래의 표 6과 같다. The verification target corresponds to 258 URLs among URLs detected as similar documents, and the numbers and ratios classified by categories according to the verification result by hand are shown in Table 6 below.

카테고리category 개수Count 비율ratio sidsid 7272 27.9%27.9% referrerreferrer 7070 27.1%27.1% goodgood 3636 14.0%14.0% badbad 4141 15.9%15.9% 기타(변수의 정확한 의미를 파악하기 힘든 경우)Other (if it is difficult to determine the exact meaning of the variable) 3939 15.1%15.1%

상기 검출된 258개의 URL 중 각각의 카테고리에 해당하는 URL을 랜덤하게 나누어 50%는 학습을 위한 데이터로 이용하고, 나머지 50%는 테스트를 위한 데이터로 사용하는 실험을 50번 반복하여 측정한 결과 그 정확도는 카테고리가 sid인 경우 99.8%, 카테고리가 referrer에 해당하는 경우 99.6%에 해당하는 등 매우 높은 정확도를 보이는 것을 확인할 수 있다. The URLs corresponding to the respective categories among the 258 URLs thus detected are randomly divided into 50 groups, and 50% are used as data for learning, and the remaining 50% are used as data for testing. The accuracy is 99.8% when the category is sid and 99.6% when the category is referrer.

마지막으로 본 발명의 일 실시예에 따른 정규식 적용 수단(124)은 기존의 검색 데이터베이스(140) 또는 별도의 데이터베이스에 저장되어 있는 웹문서들 중 유사문서로서 중복하여 저장될 의미가 없는 웹문서를 삭제할 수 있도록 정규식 및/또는 특정 조건을 생성할 수 있고, 또한 이를 웹크롤러(130)에 적용하여 향후 웹문서 수집시 유사문서는 수집하지 않도록 할 수도 있다. Lastly, the regular expression applying unit 124 according to an embodiment of the present invention deletes a web document which is not meaningful to be redundantly stored as a similar document among the web documents stored in the existing search database 140 or a separate database And / or a specific condition may be generated so that it can be applied to the web crawler 130 so as not to collect similar documents in future web document collection.

먼저, URL 클러스터링과 NDD 클러스터링을 수행한 다음 변수 분류 수단(123)이 예를 들어, sid 카테고리와 referrer 카테고리로 분류한 변수에 기초하여, 정규식 적용 수단(124)이 정규식 패턴을 생성할 수 있다. 여기서, 정규식 패턴은 동일 URL 클러스터 내의 다수 또는 전체 URL 리스트에 적용될 수 있는 일반화된 공식을 의미하는 것으로서, sid 또는 referrer 카테고리의 변수를 제거하여 정규식 패턴을 생성하는 것은 예시적인 것에 불과하며, sid 및 referrer 외에도 IP 어드레스(IP address) 등 그 유무에 관계 없이 유사 문서들에 해당하는 변수를 제거할 수 있는 정규식 패턴의 생성에 의해서도 유사문서 수집을 방지하는 것이 가능하다. 따라서, 본 명세서에 기술된 것 외에도, 다양한 방식이나 표현으로 정규식을 생성할 수 있음은 본 발명이 속하는 분야에서 통상의 지식을 가진 자에게 있어 용이하다. First, after the URL clustering and NDD clustering are performed, the regular expression applying unit 124 may generate the regular expression pattern based on the variables classified by the variable classification unit 123 into, for example, the sid category and the referrer category. Here, the regular expression pattern refers to a generalized formula that can be applied to a multiple or an entire URL list in the same URL cluster. It is merely an example to generate a regular expression pattern by removing the variable of the sid or referrer category, and sid and referrer It is also possible to prevent similar document collection by generating a regular expression pattern that can remove variables corresponding to similar documents regardless of whether or not they have an IP address. Therefore, it is easy for a person having ordinary skill in the art to create regular expressions in various ways or expressions besides those described in this specification.

본 발명의 일 실시예에 따른 정규식 적용 수단(124)는 앞서 생성된 정규식 패턴을 이용하여, 기존의 검색 데이터베이스(140) 또는 별도의 데이터베이스에 저장되어 있는 웹문서들을 대상으로 유사문서를 삭제하거나, 웹 크롤러(130)에 적용하여 향후 웹문서 수집시 유사문서는 수집하지 않도록 설정하는 기능을 더 포함할 수 있다.The regular expression applying unit 124 according to an embodiment of the present invention deletes a similar document with respect to web documents stored in an existing search database 140 or a separate database by using the regular expression pattern generated above, May be applied to the web crawler 130 so that the similar document may not be collected when the web document is collected in the future.

다음으로, 도 5를 참조하여, 본 발명의 일 실시예에 따른 검색 결과 제공 방법을 설명하기로 한다. Next, a search result providing method according to an embodiment of the present invention will be described with reference to FIG.

본 발명의 일 실시예에 따르면, 사용자는 자신의 사용자 단말장치(300)를 이용하여 질의어를 검색 결과 제공 시스템(100)으로 전송할 수 있으며, 검색 결과 제공 시스템(100)은 수신한 질의어를 기초로 검색 데이터베이스(140)를 참조하여 검색을 수행한 뒤 그 결과로 도출되는 검색 결과를 사용자 단말장치(300)로 전송할 수 있다. 다만, 검색 결과 제공 시스템(100)은 상기 통상의 동작 외에도, 웹크롤러(130)를 사용하여 수집한 웹문서들 중에서 유사문서가 중복 수집되는 것을 방지하기 위해 URL에서 불필요한 변수를 자동으로 삭제하는 정규식을 생성하고, 검색 데이터베이스(140) 등에 저장된 유사 문서를 삭제하고 또는 삭제하거나 웹크롤러(130)의 작동시 생성된 정규식을 적용하도록 하는 역할을 할 수 있는데, 도 5는 이러한 작업의 각 단계를 흐름도로서 도시한 도면이다.According to an embodiment of the present invention, a user may transmit a query term to the search result providing system 100 using his / her user terminal device 300, and the search result providing system 100 may search the search result providing system 100 based on the received query term The user terminal 300 may perform a search by referring to the search database 140 and transmit the result of the search to the user terminal 300. However, in order to prevent duplicate collection of similar documents among the web documents collected by using the web crawler 130, the search result providing system 100 may further include a regular expression < RTI ID = 0.0 > And deletes or deletes the similar document stored in the search database 140 or the like and applies the regular expression generated in the operation of the web crawler 130. FIG. 5 shows each step of this operation as a flowchart As shown in Fig.

도 5를 참조하면, 검색 결과 제공 시스템(100)(또는, 그 내부의 유사문서 제거부(120) 중 URL 클러스터링 수단(121))은 웹크롤러(130)에 의해 수집한 웹문서의 URL들에 대해, 예를 들어, 파일 이름이 동일하거나 경로가 동일한 경우 동일 클러스터로 묶는 등의 방식을 이용하여 URL 클러스터링을 수행하고, 동일 클러스터에 포함된 URL들을 분석하여 관련 피쳐(feature)들을 추출하는 역할을 수행할 수 있다(단계 S510).5, the search result providing system 100 (or the URL clustering means 121 of the similar document removing unit 120 therein) searches the URLs of the web documents collected by the web crawler 130 For example, URL clustering is performed using a method such as grouping the same clusters if the file names are the same or the paths are the same, and the function of extracting related features by analyzing URLs included in the same cluster (Step S510).

그 후, 유사문서 제거부(120)의 NDD 클러스터링 수단(122)은 동일 클러스터 내의 URL의 모든 페어(pair)에 대하여 유사여부 판단을 수행하여, 예를 들어, "심해시(simhash)"를 통해 유사한 문서 페어를 모두 묶어 동일한 NDD 클러스터로 분류하고, 이렇게 묶인 NDD 클러스터들을 토대로 각 NDD 클러스터들로부터 피쳐들을 추출하는 역할을 수행할 수 있다(단계 S520).Thereafter, the NDD clustering means 122 of the similar document removal unit 120 performs similarity determination on all pairs of URLs in the same cluster, for example, through "simhash" All similar document pairs are grouped into the same NDD cluster, and the function of extracting features from each NDD cluster based on the bundled NDD clusters can be performed (step S520).

그 후, 유사문서 제거부(120)의 변수 분류 수단(123)은 각 URL 클러스터와 NDD 클러스터에서 추출된 피쳐들 및 샘플 URL 및 소정 변수가 삭제된 URL에 해당되는 웹문서간의 유사도에 기초하여, URL에서 해당 변수의 카테고리를 분류하고, 그 분류에 따라 해당 변수가 불필요한 변수인지 여부를 판단하는 기능을 수행할 수 있다(단계 S530). 여기서 카테고리 분류 작업은 예를 들어, 나이브 베이지안 분류기(Naive Bayesian classifier) 또는 피셔 분류기(Fisher's classifier) 등의 기계 학습 알고리즘을 통해 구해진 다양한 피쳐들에 대한 가중치 그리고 다양한 피쳐들의 조합에 기초할 수 있다. Thereafter, the variable classifying unit 123 of the similar document removing unit 120 classifies each URL cluster, the features extracted from the NDD cluster, and the sample URL based on the similarity between the web document corresponding to the deleted URL and the predetermined variable, It is possible to classify the category of the variable in the URL and determine whether the variable is an unnecessary variable according to the classification (step S530). Here, the category classification operation may be based on a combination of various features and weights for various features obtained through, for example, a machine learning algorithm such as Naive Bayesian classifier or Fisher ' s classifier.

그 후, 유사문서 제거부(120)의 정규식 적용 수단(124)은 해당 변수가 불필요한 변수로 판단되는 경우 해당 변수를 삭제하기 위한 정규식 패턴을 생성할 수 있다(단계 S540). 이후, 생성된 정규식 패턴을 검색 데이터베이스(140) 등에 적용하여 이미 수집된 웹문서들중 중복된 유사문서를 삭제하거나, 향후 웹크롤링시 유사문서를 중복하여 수집하지 않도록 설정함으로써, 저장공간의 효율적인 사용을 도모하고 검색에 있어서 큰 부하가 걸리는 것을 방지할 수 있다.Thereafter, the regular expression applying unit 124 of the similar document removal unit 120 may generate a regular expression pattern for deleting the variable if the variable is determined to be an unnecessary variable (step S540). Thereafter, the generated regular expression pattern is applied to the search database 140 or the like to delete redundant similar documents among the already collected web documents, or to prevent duplicate similar documents from being collected during future web crawling, And it is possible to prevent a large load from being applied to the search.

본 발명에 따른 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(Floptical disk)와 같은 자기-광 매체(megneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동되도록 구성될 수 있으며, 그 역도 마찬가지다. Embodiments according to the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and configured for the present invention or may be available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as a hard disk, a floppy disk, and a magnetic tape; optical media such as CD-ROM and DVD; magnetic recording media such as a floppy disk; Includes hardware devices specifically configured to store and perform program instructions such as megneto-optical media and ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.
Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

Claims

A method for providing a search result corresponding to a search query term of a user,
Clustering a source indication clustering a web document source indication corresponding to a web document stored in a database with a predetermined criterion;
Clustering a similar document in which the search result providing system binds similar documents to the source display clustering result;
Classifying variables using features generated from the source indication clustering step and the similar document clustering step, respectively, by the search result providing system;
Generating a regular expression using the result of the variable classifying step; And
Wherein the search result providing system collects a web document using the generated regular expression,
Wherein the similar document clustering step is performed on the Web document source display lists bundled in the same cluster by the source indication clustering step.

The method of claim 1, wherein the web document source indication comprises a URL (Uniform Resource Locator)
Wherein the source indication clustering comprises:
When the search result providing system corresponds to at least one of the cases where the search result providing system is identical to the URLs of the web document from the corresponding URL to the path, the same up to the file name, or the same up to the parameter key, And clustering the search results into clusters.

delete

The method according to claim 1,
Wherein the similar document clustering step is performed by the search result providing system, wherein the similarity document clustering step performs similarity determination on all pairs of web document source display in the same cluster.

2. The method of claim 1,
Parsing a web document by the search result providing system;
The hashing of the parsed web document by the search result providing system;
Calculating a simhash by calculating the hashed web document bit by bit according to the search result providing system; And
And grouping the web documents whose hamming distance is less than or equal to a predetermined value as a similar document.

The method of claim 1, wherein the feature generated from the source indication clustering comprises at least one of a parameter key related feature and a parameter value related feature.

The method of claim 1, wherein the feature generated from the similar document clustering step includes a feature related to a ratio of a unique value of the variable to a total source indication cluster count when creating a similar document pair, A feature associated with a ratio of a unique document participating in creating the search query, a feature associated with the number of unique document versus similar document pairs, and a feature associated with an absolute value of a similar document pair count.

The method according to claim 1,
The step of classifying the variables may include classifying categories for each variable included in the web document source indication grouped in one similar document cluster, And classifying the variable as an unnecessary variable when the variable category corresponds to a variable category for checking whether a variable category or a web document is linked.

The method according to claim 8, wherein the generating of the regular expression generates the regular expression pattern in which the unnecessary variable is removed from the target web document source display.

A system for providing a search result corresponding to a search query term of a user,
A source indication clustering means for grouping, based on a predetermined criterion, web document source indication corresponding to a web document stored in a database;
Similar document clustering means for clustering similar documents with respect to the source indication clustering result;
Variable classifying means for classifying a variable using features generated from the source indication clustering means and the similar document clustering means, respectively; And
A regular expression applying means for generating a regular expression using the result from the variable classifying means; A similar document removing unit including the similar document removing unit,
And a search unit for collecting a web document using the generated regular expression,
Wherein the similar document clustering means performs similar document detection on the Web document source display lists bounded by the same cluster by the source display clustering means.

11. The method of claim 10, wherein the web document source indication comprises a URL (Uniform Resource Locator)
Wherein the source indication clustering means comprises:
The URLs of the web document are grouped into the same cluster when the URLs are the same from the URL to the path, the file names are the same, or the parameter keys are the same. Search result providing system.

delete

The method of claim 10,
Wherein the similar document clustering means performs similarity determination on all pairs of Web document source display in the same cluster.

11. The apparatus of claim 10,
Parsing a web document;
Hashing the parsed web document;
Computing the hash web document bit by bit to calculate a simhash; And
And grouping the web documents whose hamming distances are equal to or less than a predetermined value as similar documents.

11. The system of claim 10, wherein the feature generated from the source indication clustering means comprises at least one of a parameter key related feature and a parameter value related feature.

11. The method of claim 10, wherein the feature generated from the similar document clustering means comprises a feature relating to a ratio of a unique value of a variable relative to a full source indication cluster count when creating a similar document pair, A feature associated with a ratio of a unique document participating in creating the document, a feature associated with the number of unique document versus similar document pairs, and a feature associated with an absolute value of the similar document pair counter.

The method of claim 11,
Wherein the variable classification means classifies categories for each variable contained in the web document source indication grouped into one similar document cluster and classifies the variable category or the web document for managing the user's access session If the search result corresponds to a variable category for checking whether the search result is linked to the search result, the corresponding search result is classified as an unnecessary variable.

The search result providing system according to claim 17, wherein the regular expression applying means generates a regular expression pattern for removing the unnecessary variable from the target web document source display.

A computer-readable recording medium recording a program for performing each step of the method according to any one of claims 1, 2, and 4 to 9 on a computer.