KR20100037401A

KR20100037401A - Method and apparatus for managing search database

Info

Publication number: KR20100037401A
Application number: KR1020080096713A
Authority: KR
Inventors: 강춘식; 김송희; 유지영; 조안나
Original assignee: 엔에이치엔(주)
Priority date: 2008-10-01
Filing date: 2008-10-01
Publication date: 2010-04-09
Also published as: KR101074578B1

Abstract

PURPOSE: A method and an apparatus for managing a searching database are provided to expand a searching database by obtaining various website addresses from address information or bookmark information of the received website from a user terminal and parsing the main contents provided from a CP(contents Provider). CONSTITUTION: A UR(Uniform Resource ID) obtaining unit(120) obtains a URI from a user terminal and/or contents, and information collecting unit(130) collects the information of the website by visiting the website corresponding to the obtained URI. An information organization unit(150) organizes and stores the information of the collected websites, and the URI obtaining unit includes: the URI parser(122) that extracts URI of the contents; and a validity judgment unit that judges the validity of the extracted URI.

Description

Method and Apparatus for Managing Search Database {Method and Apparatus for Managing Search Database}

본 발명은 검색 데이터베이스 관리 방법 및 장치에 관한 것으로서 보다 상세하게는 웹사이트 검색을 위한 검색 데이터베이스 관리 방법 및 장치에 관한 것이다.The present invention relates to a search database management method and apparatus, and more particularly, to a search database management method and apparatus for searching a website.

인터넷의 발달 및 보급의 증가로 인해 인터넷을 이용한 다양한 서비스가 제공되고 있는데, 그 중 대표적인 예가 검색 서비스라 할 수 있다. 이러한 검색 서비스는 사용자가 검색하고자 하는 단어 또는 단어의 조합을 질의어로 입력하면, 검색 엔진이 입력된 질의어에 상응하는 검색결과 문서(예컨대, 사용자로부터 입력된 검색 질의어를 포함하는 웹 사이트, 기사, 또는 해당 검색 질의어를 포함하는 파일명을 갖는 이미지 등)를 사용자에게 제공하는 서비스를 의미한다.Due to the development and spread of the Internet, various services using the Internet are provided, and a representative example thereof is a search service. When the search service inputs a word or a combination of words to be searched by a user as a query, the search engine corresponds to a search result document corresponding to the input query (eg, a web site, an article, or a search query including a search query input from the user). An image having a file name including the corresponding search query).

이러한 검색 서비스를 제공하기 위해 검색 엔진은 웹 상을 지속적으로 순회하면서 새로운 웹사이트 정보를 기계적으로 수집한 후 수집된 데이터를 데이터베이스화한다.In order to provide such a search service, a search engine continuously circulates on the web, mechanically collects new website information, and then database the collected data.

최근에는 웹상에서 단기간 내에 수 많은 웹사이트의 생성 또는 소멸이 진행 됨에 따라 저장 및 관리 하여야 하는 데이터의 양이 급증하게 되었고, 이와 같이 급증하는 데이터 중에서 어떠한 데이터가 가치 있는 것인지를 평가하기 어렵게 됨에 따라, 이를 해결하기 위해 사용자로부터 직접 웹사이트 등록 요청을 받아서 해당 웹사이트에 대한 정보를 데이터베이스화하는 방안이 제시된 바 있다.Recently, as the creation or destruction of a large number of websites on the web has progressed rapidly, the amount of data to be stored and managed has rapidly increased, and it is difficult to evaluate which data is valuable among the rapidly increasing data. In order to solve this problem, a method of receiving a website registration request from a user and making a database of information on the website has been proposed.

이와 같이, 사용자의 웹사이트 등록 요청에 따라 해당 웹사이트 정보를 데이터베이스화 할 수 있게 됨에 따라 검색 엔진을 운영하는 운영자는 보다 나은 검색 서비스를 제공할 수 있게 되고, 사용자는 자신의 웹사이트에 대한 광고 효과를 극대화할 수 있어 운영자의 영리 및 이익을 적절히 조화할 수 있게 되었다.As such, the website information can be databased according to the user's request for registration of the website, so that the operator operating the search engine can provide a better search service, and the user can advertise the website. The effect can be maximized, allowing the operator to balance the profits and profits of the operator.

그러나, 이러한 사용자의 등록 요청에 의해 웹사이트 정보를 데이터베이스화하는 경우에도, 자체적인 한계로 인해 여전히 검색 결과로써 제공되지 못하는 웹사이트가 존재할 수 밖에 없고, 이로 인해 검색 엔진 운영자는 보다 완벽한 검색 서비스를 제공할 수 없게 됨에 따라 검색 서비스 품질이 저하될 수 있다는 문제점이 있다. However, even when the website information is databased by the registration request of such a user, there are some websites which cannot be provided as a search result due to their own limitations, and thus the search engine operator can There is a problem that the search service quality may be degraded as it cannot be provided.

본 발명은 상술한 문제점을 해결하기 위한 것으로서, 컨텐츠 제공자로부터 제공되는 컨텐츠의 본문을 직접 파싱함으로써 웹사이트의 주소 정보를 획득하거나 사용자 단말로부터 웹사이트의 주소 정보를 직접 획득함으로써 검색 데이터베이스를 확장할 수 있는 검색 데이터베이스 관리 방법 및 장치를 제공하는 것을 기술적 과제로 한다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and can directly expand the search database by directly acquiring the address information of the website by directly parsing the body of the content provided from the content provider or by directly obtaining the address information of the website from the user terminal. It is a technical problem to provide a search database management method and apparatus.

또한, 본 발명은 새롭게 생성 또는 변경된 웹사이트의 정보를 신속하게 검색 데이터베이스에 반영할 수 있는 검색 데이터베이스 관리 방법 및 장치를 제공하는 것을 기술적 과제로 한다.Another object of the present invention is to provide a search database management method and apparatus capable of quickly reflecting newly created or changed website information in a search database.

상술한 목적을 달성하기 위한 본 발명의 일 측면에 따른 검색 데이터베이스 관리 방법은 소정 컨텐츠의 본문 및 사용자 단말 중 적어도 하나로부터 URI(Uniform Resource Identifier)를 획득하는 단계; 상기 획득된 URI에 상응하는 웹사이트를 방문하여 상기 웹사이트의 정보를 수집하는 단계; 및 상기 수집된 웹사이트 정보를 조직화하여 저장하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method for managing a search database, the method including: obtaining a Uniform Resource Identifier (URI) from at least one of a body of a predetermined content and a user terminal; Collecting information of the website by visiting a website corresponding to the obtained URI; And organizing and storing the collected website information.

일 실시예에 있어서, 상기 URI를 상기 소정 컨텐츠 본문으로부터 획득하는 경우 상기 URI 획득단계는, 상기 소정 컨텐츠 본문에 포함된 내용을 어절 단위로 구분하는 단계; 및 상기 각 어절들 중 URI로 인지되는 문자열이 포함되어 있는 제1 어절로부터 상기 URI를 추출하는 단계를 포함하는 것을 특징으로 한다.According to an embodiment, when the URI is obtained from the predetermined content body, the URI obtaining step may include: dividing the content included in the predetermined content body by word units; And extracting the URI from a first word including a string recognized as a URI among the words.

이때, 상기 URI 추출 단계에서, 상기 URI로 인지되는 문자열은 http:// 또는 www.을 포함하는 제1 문자열로 시작되는 것을 특징으로 하고, 상기 제1 어절에 포함된 URI로 인지되는 문자열 내에서, 상기 제1 문자열부터 제1 문자 그룹에 포함된 문자 이전까지의 문자열 또는 상기 제1 문자열부터 영어나 한글까지의 문자열을 상기 URI로 추출하는 것을 특징으로 한다. 여기서, 상기 제1 문자 그룹은 URI의 표현에 이용되는 특수문자들로 구성될 수 있다.In this step, in the URI extraction step, the string recognized as the URI starts with a first string including http: // or www. Within the string recognized as a URI included in the first word. And extracting a string from the first string to a character before the character included in the first character group or a string from the first string to English or Korean as the URI. Here, the first character group may be composed of special characters used in the representation of the URI.

다른 실시예에 있어서, 상기 URI를 상기 소정 컨텐츠 본문으로부터 획득하는 경우 상기 URI 획득단계는, 상기 소정 컨텐츠 본문에 포함된 내용을 어절 단위로 구분하는 단계; 및 상기 각 어절들 중 URI를 지칭하는 단어가 포함된 제1 어절과 일정 거리 전후에 위치한 제2 어절 내에서, 영문 또는 한글로 시작하되 제1 특수문자가 소정 개수 이상 포함된 문자열을 상기 URI로 추출하는 단계를 포함하는 것을 특징으로 한다. 이때, 상기 추출단계에서, 상기 제1 특수문자가 한 개 이면서 상기 문자열이 상기 제1 특수문자로 종결되는 문자열은 추출 대상 URI에서 배제하는 것이 바람직하다In another embodiment, when the URI is obtained from the predetermined content body, the URI obtaining step may include: dividing the content included in the predetermined content body by word units; And a string including a predetermined number of first special characters, starting with English or Korean, within a second word located before and after a predetermined distance from a first word including a word indicating a URI among each word, as the URI. Characterized in that it comprises the step of extracting. At this time, in the extracting step, it is preferable that the first special character is one and the string is terminated by the first special character.

한편, 상기 URI 획득단계 이후에, 상기 URI의 유효성을 판단하는 단계를 더 포함함으로써, 상기 웹사이트 정보 수집 단계에서, 상기 URI 중 유효한 것으로 판단된 URI에 상응하는 웹사이트의 정보를 수집할 수 있다. 이때, 상기 URI가 한글로 시작하는 경우 상기 URI 내에서 한글과 영문이 제1 특수문자로 연결되어 있지 않은 형태의 URI, 상기 URI 내에서 상기 제1 특수문자가 포함되어 있지 않은 형태의 URI, 또는 접속 불능인 URI는 유효하지 않은 것으로 판단할 수 있다.On the other hand, after the URI obtaining step, further comprising the step of determining the validity of the URI, in the website information collection step, it is possible to collect the information of the website corresponding to the URI determined to be valid among the URI. . In this case, when the URI starts with Korean, a URI in which Korean and English are not connected with a first special character in the URI, a URI in which the first special character is not included in the URI, or It is possible to determine that a URI that is not accessible is invalid.

한편, 상기 URI를 상기 사용자 단말로부터 획득하는 경우 상기 URI 획득단계에서, 상기 URI는 상기 사용자 단말의 웹브라우저에 저장되어 있는 북 마크 정보 또는 상기 웹브라우저를 통해 입력된 URL(Uniform Resource Locator)로부터 획득될 수 있다. 이때, 상기 북 마크 정보는 북 마크된 웹사이트 제목 및 상기 북 마크된 웹사이트의 URL 주소 중 적어도 하나를 포함하고, 상기 북 마크 정보는 상기 북 마크 정보에 대한 변경이 발생할 때마다 상기 사용자 단말로부터 획득되는 것을 특징으로 한다.Meanwhile, when the URI is obtained from the user terminal, in the URI obtaining step, the URI is obtained from book mark information stored in a web browser of the user terminal or a URL (Uniform Resource Locator) input through the web browser. Can be. In this case, the book mark information includes at least one of a book-marked website title and a URL address of the book-marked website, and the book mark information is received from the user terminal whenever a change to the book mark information occurs. Characterized in that it is obtained.

일 실시예에 있어서, 본 발명은 상기 웹사이트 정보 수집 단계 이전에, 상기 URI와 상기 검색 데이터베이스에 기 저장된 URI와의 중복 여부를 판단하는 단계를 더 포함함으로써, 상기 웹사이트 정보 수집 단계에서, 중복되지 않는 URI에 상응하는 웹사이트의 정보를 수집하는 것을 특징으로 한다.In an embodiment, the present invention may further include determining whether the URI is duplicated with a URI previously stored in the search database before collecting the website information. It collects information of the website corresponding to the URI does not.

또한, 상기 수집된 웹사이트 정보의 조직화 및 저장단계에서, 상기 수집된 웹사이트 정보를 이용하여 상기 웹사이트의 제목, 정보 제공자, 중요 태그, 및 그룹 중 적어도 하나를 결정함으로써 상기 수집된 웹사이트 정보를 조직화할 수 있는데, 이때, 상기 제목은 수집된 웹사이트 정보의 타이틀 태그에 포함된 내용 또는 상기 수집된 웹사이트 정보에 포함된 문구 중 출현횟수가 가장 높은 문구로 결정하고, 상기 정보 제공자는 상기 수집된 웹사이트 정보의 카피라이터에 해당하는 내용 또는 상기 컨텐츠 제공자로 결정하며, 상기 중요 태그는 상기 수집된 웹사이트 정보의 키워드 태그 또는 상기 수집된 웹사이트 정보에 포함된 문구 중 출현횟수 순위가 상위 N번째 이내인 문구로 결정하고, 상기 그룹은 기 저장된 그룹들 중 상기 결정된 중요 태그와의 일치도가 임계치 이상인 중요 태그를 가지는 그룹으로 결정할 수 있다.In the organizing and storing of the collected website information, the collected website information is determined by determining at least one of a title, an information provider, an important tag, and a group of the website using the collected website information. In this case, the title may be determined as the phrase having the highest number of occurrences among the contents included in the title tag of the collected website information or the phrase included in the collected website information, and the information provider The content corresponding to a copy writer of the collected website information or the content provider is determined, and the important tag is the highest N number of occurrences among the keyword tags of the collected website information or the phrases included in the collected website information. The first and second phrases, and the group is compared with the determined important tag among previously stored groups. It can be determined by the group with the least important tag Tidori the threshold.

상술한 목적을 달성하기 위한 본 발명의 다른 측면에 따른 검색 데이터베이스 관리 장치는 소정 컨텐츠의 본문 및 사용자 단말 중 적어도 하나로부터 URI를 획득하는 URI 획득부; 상기 URI 획득부에 의해 획득된 URI에 상응하는 웹사이트를 방문하여 상기 웹사이트의 정보를 수집하는 정보 수집부; 및 상기 정보 수집부에 의해 수집된 웹사이트 정보를 조직화하여 저장하는 정보 조직화부를 포함한다.According to another aspect of the present invention, there is provided a apparatus for managing a search database, comprising: a URI obtaining unit obtaining a URI from at least one of a body of a predetermined content and a user terminal; An information collecting unit for collecting information of the website by visiting a website corresponding to the URI obtained by the URI obtaining unit; And an information organizing unit for organizing and storing website information collected by the information collecting unit.

본 발명에 따르면, 컨텐츠 제공자로부터 제공되는 컨텐츠의 본문을 직접 파싱하거나 사용자 단말에 설치된 웹브라우저를 통해 직접 입력된 웹사이트의 주소 정보 또는 웹브라우저에 저장된 북 마크 정보를 이용하여 다양한 웹사이트의 주소를 획득할 수 있어 검색 데이터베이스를 확장할 수 있다는 효과가 있다.According to the present invention, the address of various websites can be obtained by directly parsing the text of the content provided from the content provider or by using the address information of the website directly input through the web browser installed in the user terminal or the bookmark information stored in the web browser. It can be obtained so that the search database can be extended.

또한, 본 발명은 새롭게 생성 또는 변경된 웹사이트의 정보를 신속하게 검색 데이터베이스에 반영할 수 있어 보다 나은 검색 서비스를 제공할 수 있다는 효과가 있다.In addition, the present invention has the effect that it is possible to quickly reflect the information of the newly created or changed website in the search database to provide a better search service.

이하 첨부된 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 검색 데이터베이스 관리 장치의 개략적인 블록도이다. 도시된 바와 같이, 본 발명에 따른 검색 데이터베이스 관리 장치(100)는 URI 획득부(120), 정보 수집부(130), 및 정보 조직화부(150)를 포함한 다.1 is a schematic block diagram of an apparatus for managing a search database according to an embodiment of the present invention. As shown, the search database management apparatus 100 according to the present invention includes a URI obtaining unit 120, information collecting unit 130, and information organizing unit 150.

URI 획득부(120)는 소정 컨텐츠의 본문 및 사용자 단말 중 적어도 하나로부터 URI(Uniform Resource Identifier)을 획득한다. 이하에서는 URI 획득부(120)가 컨텐츠의 본문으로부터 URI를 획득하는 경우에 대해 먼저 설명한 후 사용자 단말로부터 URI를 획득하는 경우에 대해 설명하기로 한다.The URI obtaining unit 120 obtains a Uniform Resource Identifier (URI) from at least one of a body of a predetermined content and a user terminal. Hereinafter, the case where the URI obtaining unit 120 obtains the URI from the body of the content will be described first, and then the case of obtaining the URI from the user terminal will be described.

먼저, URI 획득부(120)가 컨텐츠의 본문으로부터 URI를 획득하는 경우에 대해 설명하면, URI 획득부(120)는 도 1에 도시된 바와 같이 URI 파서(122) 및 유효성 판단부(124)를 포함한다.First, when the URI obtaining unit 120 obtains a URI from the body of the content, the URI obtaining unit 120 uses the URI parser 122 and the validity determining unit 124 as shown in FIG. 1. Include.

URI 파서(122)는 컨텐츠의 본문에 포함된 내용을 파싱함으로써 컨텐츠 본문으로부터 URI를 추출하는 역할을 수행한다. 여기서, 컨텐츠라 함은 뉴스, 블로그, 또는 카페 등과 같이 인터넷 상에서 유통되는 모든 웹문서들을 포함하는 개념으로서, 뉴스와 같은 컨텐츠로부터는 시기적으로 유효한 URI를 획득할 수 있고, 블로그나 카페 등과 같은 컨텐츠로부터는 전문적인 URI를 획득할 수 있게 된다.The URI parser 122 extracts a URI from the content body by parsing the content included in the content body. Here, the content is a concept including all web documents distributed on the Internet, such as news, blogs, or cafes, and can obtain valid URIs from contents such as news, and from contents such as blogs or cafes. Can get the professional URI.

URI 파서(122)가 컨텐츠의 본문으로부터 URI를 추출하는 방법에 대해서 구체적으로 설명하면, URI 파서(122)는 먼저, 컨텐츠의 본문에 포함된 내용을 어절 단위로 구분한 후 각 어절들 중 URI로 인지되는 문자열이 포함되어 있는 어절로부터 URI를 추출한다. 즉, 본 발명은 해당 컨텐츠 내에서 하이퍼링크가 설정되어 있는 앵커 태그 또는 링크 태그를 이용해서 URI를 추출하는 것이 아니라, 해당 컨텐츠의 본문을 구성하는 텍스트로부터 직접 URI를 추출하는 것이다.Specifically, when the URI parser 122 extracts a URI from the body of the content, the URI parser 122 first divides the content included in the body of the content into word units, and then uses the URI among the words. Extracts a URI from a word that contains a recognized string. That is, the present invention does not extract the URI using the anchor tag or the link tag in which the hyperlink is set in the content, but extracts the URI directly from the text constituting the body of the content.

일 실시예에 있어서, URI로 인지되는 문자열은"http://"로 시작되는 문자열 또는 "www."로 시작되는 문자열일 수 있다. 즉, URI 파서(122)는 각 어절들 중 "http://"로 시작되는 문자열 또는 "www."로 시작되는 문자열이 포함되어 있는 어절로부터 URI를 추출하는 것이다.In one embodiment, the string recognized as a URI may be a string starting with "http: //" or a string starting with "www.". That is, the URI parser 122 extracts a URI from a word including a string beginning with "http: //" or a string beginning with "www." Among the words.

변형된 실시예에 있어서는, "http://"로 시작되는 문자열 또는 "www."로 시작되는 문자열이 아니더라도, URI를 지칭하는 단어들이 포함되어 있는 어절과 일정 거리 전후에 위치하는 어절들 중 영문이나 한글로 시작하면서 "."과 같은 특수문자가 1개 이상 포함되어 있는 문자열도 URI로 인지되는 문자열일 수 있다. 이는, URI들 중 "http://"또는 "www."이외에도 "mail.", "blog.", 또는 "cafㅹ."와 같이 다양한 형태의 문자로 시작되는 URI가 존재할 수 있기 때문이다. 여기서, URI를 지칭하는 단어란 "홈페이지", "사이트", "site", "블로그", "미니홈피", "카페", "클럽", "URL", "인터넷 주소"등과 같은 단어일 수 있다.In a modified embodiment, even if the string starts with "http: //" or the string starts with "www.", The word including the words indicating the URI and the words located before and after a certain distance are included. A string containing one or more special characters such as ".", Beginning with Korean or Hangul, can also be a string recognized as a URI. This is because there may be URIs starting with various types of characters such as "mail.", "Blog.", Or "caf '." Besides "http: //" or "www." Among URIs. Here, the word referring to the URI may be a word such as "home page", "site", "site", "blog", "mini homepage", "cafe", "club", "URL", "Internet address", and the like. have.

이러한 실시예에 의하는 경우, 영문이나 한글로 시작하면서 "."과 같은 특수문자가 1개 이상 포함되어 있는 문자열들 중, "."과 같은 특수문자가 1개이면서, "."과 같은 특수문자로 종결되는 문자열은 URI가 아니라 단순히 종결되는 문장을 나타내는 것일 수 있으므로 URI로 인지되는 문자열에서 제외하는 것이 바람직하다.According to this embodiment, among the strings including one or more special characters such as "." Starting with English or Korean, there is one special character such as "." And a special such as "." Since a string terminated with a character may indicate a sentence that is simply terminated, not a URI, it is preferable to exclude from a string recognized as a URI.

일 실시예에 있어서, URI 파서(122)는 URI로 인지되는 문자열이 포함되어 있는 어절 내에서 URI를 추출함에 있어서, URI로 인지되는 문자열의 시작점부터 URI의 표현에 이용되는 특수문자들로 구성된 제1 문자 그룹에 포함된 문자 이전까지의 문자열을 URI로 추출할 수 있다.In one embodiment, the URI parser 122, in extracting a URI within a word containing a string recognized as a URI, comprises a special character composed of special characters used in the representation of the URI from the beginning of the string recognized as the URI. A string up to the characters included in the 1 character group can be extracted as a URI.

여기서, URI로 인지되는 문자열의 시작점은 "http://" 또는 "www."로 정의되 거나, 한글 이나 영문으로 시작하되"."과 같은 특수문자가 1개 이상 포함되어 있는 문자열 중 "." 이전의 한글 이나 영문으로 정의될 수 있다. 또한, 제1 문자 그룹은 "/", "?", "&", "$", "공백"과 같은 특수문자들로 구성될 수 있다. 즉, URI 파서(122)는 URI로 인지되는 문자열 중 호스트 네임과 같은 유효한 URI만을 추출할 수 있다. 예컨대, URI로 인지되는 문자열이 "http://news.chosun.com/site/data/html_dir/2008/07/30/2008073001738.html" 인 경우, 시작점인 "http;//"부터 제1 문자 그룹에 포함된 특수문자인 "/"까지의 문자열인 "http://news.chosun.com"를 URI로 추출할 수 있다.Here, the starting point of a string recognized as a URI is defined as "http: //" or "www.", Or a string containing one or more special characters such as "." Starting with Korean or English. " " It can be defined as previous Korean or English. Also, the first character group may be composed of special characters such as "/", "?", "&", "$", And "space". That is, the URI parser 122 may extract only valid URIs such as host names from strings recognized as URIs. For example, when the string recognized as a URI is "http://news.chosun.com/site/data/html_dir/2008/07/30/2008073001738.html", the first character from the starting point "http; //" You can extract "http://news.chosun.com", a string of up to "/" special characters in a group, as a URI.

변형된 실시예에 있어서, URI로 인지되는 문자열이 포함되어 있는 어절 내에서 URI를 추출함에 있어서, URI로 인지되는 문자열의 시작점부터 한글 또는 영문까지의 문자열만을 URI로 추출할 수도 있다. 예컨대, URI로 인지되는 문자열이 "http://news.chosun.com/site/data/html_dir/2008/07/30/2008073001738.html" 인 경우, URI 파서(122)는 시작점인 "http://"부터 영문인 "html_dir"까지의 문자열인 "http://news.chosun.com/site/data/html_dir"를 URI로 추출할 수 있다.In a modified embodiment, when extracting a URI in a word including a string recognized as a URI, only a string from the start point of the string recognized as the URI to Korean or English may be extracted as the URI. For example, if the string recognized as a URI is "http://news.chosun.com/site/data/html_dir/2008/07/30/2008073001738.html", the URI parser 122 is the starting point "http: / You can extract the string "http://news.chosun.com/site/data/html_dir" from / "to the English" html_dir "as a URI.

다음으로, 유효성 판단부(124)는 URI 파서(122)에 의해 추출된 URI의 유효성을 판단한다. 일 실시예에 있어서 유효성 판단부(124)는 추출된 URI가 한글로 시작하는 경우 추출된 URI 내에서 한글과 영문이 "한글.영문"과 같이 "."으로 연결되어 있지 않은 형태의 URI, 추출된 URI 내에 "."과 같은 특수문자가 포함되어 있지 않은 URI, 또는 접속이 불능인 URI를 유효하지 않은 URI로 판단할 수 있다.Next, the validity determination unit 124 determines the validity of the URI extracted by the URI parser 122. In an embodiment, when the extracted URI starts with Korean, the validity determining unit 124 extracts a URI having a form in which the Korean and English are not connected with ".", Such as "Hangul.English" in the extracted URI. A URI that does not contain a special character such as "." In a given URI, or an unreachable URI may be determined as an invalid URI.

상술한 바와 같은 URI 획득부(120)가 URI를 획득하는 예를 도 2를 참조하여 설명하면, 먼저, URI 획득부(120)는 도 2에 도시된 컨텐츠 중 컨텐츠 본문(200)을 구성하는 텍스트를 어절 단위로 구분한 후, 각 어절들 중 URI로 인지되는 문자열이 포함된 어절을 검색한다. 도 2에서는 해당 컨텐츠의 어절들 중 "www"로 시작하는 문자열이 포함되어 있는 어절(210)이 존재하므로, 해당 어절(210)로부터 시작점인 "www."부터 영문까지의 문자열인 "www.bucheon.go.kr"과 같은 URI를 획득하게 되는 것이다.Referring to FIG. 2, the URI obtaining unit 120 obtains a URI as described above. First, the URI obtaining unit 120 configures the text constituting the content body 200 among the contents shown in FIG. 2. After dividing into word units, search for words that contain a string that is recognized as a URI among each word. In FIG. 2, since there is a word 210 including a string starting with "www" among words of the corresponding content, the string "www.bucheon", which is a string from "www.", Which is a starting point, to the English word, from the word 210. You will get a URI like ".go.kr".

다음으로, URI 획득부(120)가 사용자 단말로부터 URI를 직접 획득하는 경우에 대해 설명하면, URI 획득부(120)는 사용자 단말의 웹브라우저에 저장되어 있는 북 마크 정보 또는 웹브라우저를 통해 입력된 URL(Uniform Resource Locator)로부터 URI를 획득할 수 있다.Next, when the URI obtainer 120 directly obtains a URI from the user terminal, the URI acquirer 120 is inputted through book mark information or a web browser stored in a web browser of the user terminal. The URI may be obtained from a Uniform Resource Locator (URL).

먼저, URI 획득부(120)가 사용자 단말의 웹브라우저에 저장되어 있는 북 마크 정보로부터 URI를 획득함에 있어서, URI 획득부(120)는 북 마크 정보로써 북 마크된 웹사이트의 제목 및 북 마크된 웹사이트의 URL을 획득할 수 있다. 이때, URI 획득부(120)는 웹브라우저에 저장되어 있는 북 마크 정보의 변동이 발생할 때마다 변동된 북 마크 정보를 획득할 수 있다.First, when the URI obtaining unit 120 obtains a URI from book mark information stored in a web browser of a user terminal, the URI obtaining unit 120 may display a title and a book mark of a website marked as book mark information. The URL of the website can be obtained. In this case, the URI obtaining unit 120 may obtain the changed book mark information whenever a change of the book mark information stored in the web browser occurs.

한편, URI 획득부(120)가 사용자 단말의 웹브라우저에 입력되는 URL로부터 URI를 획득함에 있어서, URI 획득부(120)는 사용자들의 이용 패턴에 따른 URI를 획득하기 위해 웹브라우저가 최초로 활성화된 이후에 직접 입력된 URL을 획득하거나, 특정 웹사이트를 방문한 이후 웹브라우저에 직접 입력된 URL을 획득할 수 있다. 이때, 특정 웹사이트란 예컨대 검색 서비스를 제공하는 웹사이트를 의미하는 것으 로서, 검색 서비스 이용 후 사용자가 직접 URL을 입력하여 다른 웹사이트로 이동하는 경우 해당 URL을 획득하기 위한 것이다. 이외에도, URL 획득부(120)는 특정 웹사이트로 유입되기 직전에 입력된 URL 주소를 획득할 수도 있을 것이다.Meanwhile, when the URI obtaining unit 120 obtains a URI from a URL input to the web browser of the user terminal, the URI obtaining unit 120 is first activated after the web browser is obtained to obtain a URI according to the usage pattern of the users. The URL directly inputted in the web browser may be obtained or the URL directly inputted in the web browser may be obtained after visiting a specific website. In this case, the specific website refers to, for example, a website that provides a search service. When a user directly enters a URL and moves to another website after using the search service, the specific website is obtained. In addition, the URL obtaining unit 120 may obtain a URL address input just before flowing into a specific website.

다시 도 1을 참조하면, 정보 수집부(130)는 URI 획득부(120)에 의해 획득된 URI에 상응하는 웹사이트를 직접 방문하여 해당 웹사이트의 정보를 수집한다. 일 실시예에 있어서, 정보 수집부(130)는 해당 웹사이트에 공개된 모든 자원들을 웹사이트 정보로써 수집할 수 있다. 예컨대, 정보 수집부(130)는 해당 웹사이트에 포함된 HTML 문서, 이미지, 또는 텍스트 등을 웹사이트의 정보로써 수집할 수 있다.Referring back to FIG. 1, the information collecting unit 130 directly visits a website corresponding to the URI obtained by the URI obtaining unit 120 and collects information of the corresponding website. In one embodiment, the information collection unit 130 may collect all the resources published on the website as website information. For example, the information collection unit 130 may collect HTML documents, images, texts, etc. included in the corresponding website as the information of the website.

상술한 실시예에 있어서 정보 수집부(130)는 URI 획득부(120)에 의해 수집된 모든 URI에 상응하는 웹사이트로부터 해당 웹사이트들의 정보를 수집하는 것으로 기재하였지만, 변형된 실시예에 있어서는 URI 획득부(120)에 의해 수집된 모든 URI중 검색 데이터베이스(미도시)에 기 저장되어 있는 URI와 중복되지 않는 URI에 상응하는 웹사이트에 대해서만 해당 웹사이트의 정보를 수집할 수도 있을 것이다. 이를 위해, 본 발명의 일 실시예에 따른 검색 데이터베이스 관리 장치(100)는 중복여부 판단부(140)를 더 포함할 수 있다.In the above-described embodiment, the information collecting unit 130 described as collecting information of the corresponding websites from the websites corresponding to all the URIs collected by the URI obtaining unit 120, but in the modified embodiment URI Of all the URIs collected by the acquisition unit 120, the information of the corresponding website may be collected only for a website corresponding to a URI that is not duplicated with a URI previously stored in a search database (not shown). To this end, the search database management apparatus 100 according to an embodiment of the present invention may further include a duplicate determination unit 140.

중복여부 판단부(140)는 URI 획득부(120)에 의해 획득된 URI와 검색 데이터베이스에 기 저장된 URI와의 중복 여부를 판단하여, 중복되지 않는 URI들을 상술한 정보 수집부(130)로 제공한다. 중복여부 판단부(140)는 검색 데이터베이스에 URI 획득부(120)에 의해 획득된 URI와 동일한 URI가 존재하는 경우 해당 URI는 중복되는 것으로 판단한다.The duplicate determination unit 140 determines whether the URI obtained by the URI obtaining unit 120 overlaps with a URI previously stored in the search database, and provides the above-mentioned information collecting unit 130 with URIs that are not duplicated. The duplicate determination unit 140 determines that the same URI is duplicated when the same URI as the URI obtained by the URI obtaining unit 120 exists in the search database.

일 실시예에 있어서, 중복여부 판단부(140)는 URI 획득부(120)에 의해 획득된 URI와 기 저장된 URI간에 "/"와 같은 특수문자만이 상이한 경우 두 URI는 서로 동일한 것으로 판단할 수 있다. 또한, 중복여부 판단부(140)는 URI 획득부(120)에 의해 획득된 URI가 페이지형(Page Type)인 경우, 획득된 URI와 기 저장된 URI간에 호스트 네임이 동일하면 두 URI는 동일한 것으로 판단할 수 있다. 예컨대, URI 획득부(120)에 의해 획득된 URI가 "www.dmlc.co.kr/condo386"이고, 기 저장된 URI가 "www.dmlc.co.kr"인 경우 두 URI는 서로 동일한 것으로 판단한다.According to an embodiment, the duplicate determination unit 140 may determine that two URIs are the same when only special characters such as "/" are different between a URI obtained by the URI obtaining unit 120 and a pre-stored URI. have. In addition, the duplicate determination unit 140 determines that two URIs are the same if the host name is the same between the obtained URI and the pre-stored URI when the URI obtained by the URI obtaining unit 120 is a page type. can do. For example, when the URI obtained by the URI obtaining unit 120 is "www.dmlc.co.kr/condo386" and the pre-stored URI is "www.dmlc.co.kr", it is determined that the two URIs are the same. .

그러나, 이러한 규칙을 URI 획득부(120)에 의해 획득된 모든 URI에 대해 일괄적으로 적용한다면, 카페, 블로그, 미니홈피, 또는 클럽 등에 해당하는 URI의 경우 획득된 URI들의 호스트 네임이 모두 동일할 수 있어 실제로는 상이한 URI임에 불구하고 모두 동일한 것으로 판단될 수 있다. 예컨대, URI 획득부(120)에 의해 획득된 URI가 "blog.naver.com/broadseo"이고 기 저장된 URI가 "blog.naver.com/jhoh"인 경우 실제로 두 URI는 상이함에도 불구하고, 중복여부 판단부(140)는 두 URI의 호스트 네임이 동일하므로 두 URI가 동일한 것으로 판단하게 된다. 따라서, 이러한 규칙은, 획득된 URI가 카페, 블로그, 미니홈피, 또는 클럽과 같은 경우에는 적용하지 않는 것이 바람직하다.However, if such a rule is applied to all URIs obtained by the URI acquisition unit 120 collectively, in case of URIs corresponding to cafes, blogs, mini homepages, clubs, etc., the host names of the obtained URIs may be the same. It can be determined that they are all the same despite the fact that they are actually different URIs. For example, if the URI obtained by the URI obtaining unit 120 is "blog.naver.com/broadseo" and the pre-stored URI is "blog.naver.com/jhoh", the two URIs are actually different from each other. The determination unit 140 determines that the two URIs are the same because the host names of the two URIs are the same. Thus, such a rule is preferably not applied when the obtained URI is like a cafe, blog, minihompy, or club.

다음으로, 정보 조직화부(150)는 정보 수집부(130)에 의해 수집된 웹사이트 정보를 조직화하여 저장하는 것으로서, 구체적으로, 정보 조직화부(150)는 정보 수집부(130)에 의해 수집된 웹사이트 정보를 이용하여 해당 웹사이트의 제목, 정보 제공자, 중요 태그, 및 그룹 중 적어도 하나를 결정함으로써 수집된 웹사이트 정보 를 조직화한다.Next, the information organizer 150 is to organize and store the website information collected by the information collector 130, specifically, the information organizer 150 is collected by the information collector 130 Website information is used to organize the collected website information by determining at least one of the website's title, information provider, important tags, and groups.

일 실시예에 있어서, 정보 조직화부(150)는 수집된 웹사이트 정보의 타이틀 태그에 포함된 내용 또는 수집된 웹사이트 정보에 포함된 문구들 중 출현횟수가 가장 높은 문구를 해당 웹사이트의 제목으로 결정할 수 있다. 또한, 정보 조직화부(150)는 수집된 웹사이트 정보의 카피라이터에 해당하는 내용 또는 해당 컨텐츠의 제공자를 해당 웹사이트의 정보 제공자로 결정할 수 있다. 여기서, 웹사이트 정보에 포함된 문구란 웹사이트 정보에 포함된 단어일 수 있는데, 특히 단어 중 그 품사가 명사인 단어일 수 있다. 또한, 이에 한정되지 않고 웹사이트 정보에 포함된 문구란 2개의 단어가 결합된 형태의 문구이거나 조사 등이 결합되어 있는 형태의 단어일 수도 있을 것이다.In one embodiment, the information organizing unit 150 uses the title of the collected website information or the phrase with the highest occurrence frequency among the phrases included in the collected website information as the title of the website. You can decide. In addition, the information organizer 150 may determine the content corresponding to the copy writer of the collected website information or the provider of the corresponding content as the information provider of the corresponding website. Here, the phrase included in the website information may be a word included in the website information, and in particular, may be a word whose part of speech is a noun. In addition, the present invention is not limited thereto, and the phrase included in the website information may be a phrase in which two words are combined or a word in which a search is combined.

또한, 정보 조직화부(150)는 수집된 웹사이트 정보의 키워드 태그 또는 수집된 웹사이트 정보에 포함된 문구들 중 출현횟수 순위가 상위 N 번째, 예컨대 10번째 이내의 문구들을 해당 웹사이트의 중요 태그로 결정할 수 있다. 또한, 정보 조직화부(150)는 기 저장된 그룹들 중 상술한 과정에 따라 결정된 중요 태그와 일치도가 임계치 이상인 중요 태그들을 가지는 그룹을 해당 웹사이트가 포함될 그룹으로 결정할 수 있다.In addition, the information organizing unit 150 is a keyword tag of the collected website information or the phrases included in the top N number, for example, within the tenth rank among the phrases included in the collected website information, important tags of the website. Can be determined. In addition, the information organizer 150 may determine, as a group to which the website will be included, a group having important tags having a matching degree or more that is equal to or greater than a threshold value among the previously stored groups.

이때, 웹사이트가 해당 그룹에 포함됨에 따라 새롭게 그룹 명을 변경하거나 그룹의 중요 태그를 변경할 수 있는데, 그룹 명은 해당 그룹에 포함된 각 웹사이트 정보들에 포함된 문구들 중 출현횟수가 가장 많은 문구로 결정할 수 있으며, 그룹의 중요 태그는 해당 그룹에 포함된 각 웹사이트 정보들에 포함된 모든 문구들 중 출현횟수 순위가 상위 N 번째 이내의 문구들로 결정할 수 있을 것이다.In this case, as the website is included in the group, the group name or the important tag of the group can be newly changed, and the group name is the phrase with the most occurrences among the phrases included in each website information included in the group. The important tag of the group may be determined as the phrases that appear in the top Nth rank among all the phrases included in the website information included in the group.

정보 조직화부(150)는 상술한 과정에 따라 결정된 웹사이트 정보의 제목, 정보 제공자, 중요 태그, 및 그룹을 해당 웹사이트의 URI와 함께 검색 데이터베이스에 저장할 수 있다.The information organizer 150 may store the title, information provider, important tag, and group of the website information determined according to the above-described process together with the URI of the website in the search database.

이하에서는, 도 3을 참조하여 본 발명에 따른 검색 데이터베이스 관리 방법을 설명한다. 도 3은 본 발명의 일 실시예에 따른 검색 데이터베이스 관리 방법을 보여주는 플로우차트이다.Hereinafter, a search database management method according to the present invention will be described with reference to FIG. 3. 3 is a flowchart showing a search database management method according to an embodiment of the present invention.

먼저, 소정 컨텐츠의 본문 및 사용자 단말 중 적어도 하나로부터 URI를 획득한다(S300). 여기서, 컨텐츠라 함은 뉴스, 블로그, 또는 카페 등과 같이 인터넷 상에서 유통되는 모든 웹문서들을 포함하는 개념으로서, 뉴스와 같은 컨텐츠로부터는 시기적으로 유효한 URI를 획득할 수 있고, 블로그나 카페 등과 같은 컨텐츠로부터는 전문적인 URI를 획득할 수 있게 된다.First, a URI is obtained from at least one of a body of a predetermined content and a user terminal (S300). Here, the content is a concept including all web documents distributed on the Internet, such as news, blogs, or cafes, and can obtain valid URIs from contents such as news, and from contents such as blogs or cafes. Can get the professional URI.

이하에서는 소정 컨텐츠 본문으로부터 URI를 획득하는 과정에 대해 도 4를 참조하여 보다 구체적으로 설명한다. 먼저, 소정 컨텐츠의 본문에 포함된 내용을 어절 단위로 구분한 후(S400), 각 어절들 중 URI로 인지되는 문자열이 포함되어 있는 어절을 검색한다(S410).Hereinafter, a process of obtaining a URI from a predetermined content body will be described in more detail with reference to FIG. 4. First, after dividing the contents included in the body of the predetermined content by word units (S400), a word including a string recognized as a URI among each word is searched (S410).

일 실시예에 있어서, URI로 인지되는 문자열은"http://"로 시작되는 문자열 또는 "www."로 시작되는 문자열이거나, "http://"로 시작되는 문자열 또는 "www."로 시작되는 문자열이 아니더라도, URI를 지칭하는 단어들이 포함되어 있는 어절과 일정 거리 전후에 위치하는 어절들 중 영문이나 한글로 시작하면서 "."과 같은 특 수문자가 1개 이상 포함되어 있는 문자열일 수 있다. 여기서, URI를 지칭하는 단어란 "홈페이지", "사이트", "site", "블로그", "미니홈피", "카페", "클럽", "URL", "인터넷 주소"등과 같은 단어일 수 있다. 이때, 영문이나 한글로 시작하면서 "."과 같은 특수문자가 1개 이상 포함되어 있는 문자열들 중 "."과 같은 특수문자가 1개이면서 "."과 같은 특수문자로 종결되는 문자열은 URI가 아니라 단순히 종결되는 문장을 나타내는 것일 수 있으므로 URI로 인지되는 문자열에서 제외하는 것이 바람직하다.In one embodiment, the string recognized as a URI is a string starting with "http: //" or a string starting with "www.", Or a string starting with "http: //" or starting with "www." Even if it is not a string, it may be a string including one or more special characters such as "." Starting with English or Korean among words that contain URIs and words that are located before and after a certain distance. . Here, the word referring to the URI may be a word such as "home page", "site", "site", "blog", "mini homepage", "cafe", "club", "URL", "Internet address", and the like. have. At this time, among the strings that start with English or Korean and contain one or more special characters such as ".", There is one special character such as "." And terminated with special characters such as ".". It may be simply a sentence that is terminated, so it is preferable to exclude it from a string recognized as a URI.

이후, 각 어절들 중 URI로 인지되는 문자열이 포함되어 있는 어절로부터 URI를 추출한다(S420). 일 실시예에 있어서, URI로 인지되는 문자열을 포함하는 어절로부터 URI를 추출함에 있어서, URI로 인지되는 문자열 중 호스트 네임과 같은 유효한 URI만을 추출할 수 있다. 즉, URI로 인지되는 문자열 내에서"http://"또는 "www."과 같은 시작점부터 URI의 표현에 이용되는 특수문자들로 구성된 제1 문자 그룹에 포함된 문자 이전까지의 문자열을 URI로 추출하는 것이다. 변형된 실시예에 있어서는, 시작점부터 한글 또는 영문까지의 문자열만을 URI로 추출할 수도 있을 것이다.Thereafter, a URI is extracted from a word including a string recognized as a URI among the words (S420). According to an embodiment, in extracting a URI from a word including a string recognized as a URI, only a valid URI such as a host name may be extracted from a string recognized as a URI. That is, a string from a starting point such as "http: //" or "www." To a character string of the first character group consisting of the special characters used for the representation of the URI within the string recognized as a URI is converted into a URI. To extract. In the modified embodiment, only a string from the starting point to Korean or English may be extracted as a URI.

마지막으로, 상술한 과정을 통해 추출된 URI의 유효성을 판단한다(S420). 일 실시예에 있어서 URI의 유효성을 판단함에 있어서, 추출된 URI가 한글로 시작하는 경우 추출된 URI 내에서 한글과 영문이 "한글.영문"과 같이 "."으로 연결되어 있지 않은 형태의 URI, 추출된 URI 내에 "."과 같은 특수문자가 포함되어 있지 않은 URI, 또는 접속이 불능인 URI는 유효하지 않은 URI로 판단할 수 있다.Finally, the validity of the extracted URI is determined through the above-described process (S420). In one embodiment, in determining the validity of the URI, when the extracted URI starts with the Hangul, the URI of the form in which the Hangul and English are not connected with ".", Such as "Hangul.English" in the extracted URI, A URI that does not include a special character such as "." In the extracted URI or a URI that cannot be connected may be determined as an invalid URI.

이와 같이, 본 발명에서는 소정 컨텐츠 내에서 하이퍼링크가 설정되어 있는 앵커 태그 또는 링크 태그를 이용해서 해당 컨텐츠로부터 URI를 추출하는 것이 아니라, 해당 컨텐츠의 본문을 구성하는 텍스트로부터 직접 URI를 추출함으로써 해당 컨텐츠로부터 보다 다양한 URI를 추출할 수 있다.As described above, in the present invention, the URI is not extracted from the content using the anchor tag or the link tag in which the hyperlink is set in the predetermined content, but the URI is extracted directly from the text constituting the body of the content. More URIs can be extracted from the.

다시 도 3을 참조하면, S300에서, 사용자 단말로부터 URI를 획득하는 경우, 사용자 단말의 웹브라우저에 저장되어 있는 북 마크 정보 또는 웹브라우저를 통해 입력된 URL로부터 URI를 획득하게 된다. 이때, 북 마크 정보는 북 마크된 웹사이트의 제목 및 북 마크된 웹사이트의 URL을 포함하는 것으로서, 이러한 북 마크 정보는 웹브라우저에 저장되어 있는 북 마크 정보의 변동이 발생할 때마다 획득할 수 있다.Referring back to FIG. 3, in S300, when a URI is obtained from the user terminal, the URI is obtained from a book mark information stored in a web browser of the user terminal or a URL input through a web browser. In this case, the book mark information includes the title of the bookmarked website and the URL of the bookmarked website. The bookmark information may be obtained whenever a change in the bookmark information stored in the web browser occurs. .

또한, 사용자 단말의 웹브라우저에 입력되는 URL은 웹브라우저가 최초로 활성화된 이후에 직접 입력된 URL, 특정 웹사이트를 방문한 이후 웹브라우저에 직접 입력된 URL, 또는 특정 웹사이트로 유입되기 직전에 입력된 URL을 포함할 수 있다. 이때, 특정 웹사이트란 예컨대, 검색 서비스를 제공하는 웹사이트일 수 있다.In addition, the URL input into the web browser of the user terminal may be a URL directly input after the web browser is first activated, a URL directly input to the web browser after visiting a specific website, or immediately before entering a specific website. May contain a URL. In this case, the specific website may be, for example, a website that provides a search service.

이후, S300에서 획득된 URI가 검색 데이터베이스에 기 저장된 URI와 동일한지 여부를 판단함으로써 획득된 URI의 중복 여부를 판단한다(S310). 일 실시예에 있어서, 획득된 URI와 기 저장된 URI간에 "/"와 같은 특수문자만이 상이한 경우 두 URI는 서로 동일한 것으로 판단하거나, 획득된 URI가 페이지형인 경우 획득된 URI와 기 저장된 URI간에 호스트 네임이 동일하면 두 URI는 동일한 것으로 판단할 수 있다. 그러나, 상술한 바와 같이, 호스트 네임이 동일한 경우 동일한 URI로 판단 한다는 규칙은 카페, 블로그, 미니홈피, 또는 클럽 등에 해당하는 URI에 대해서는 적용하지 않는 것이 바람직하다.Thereafter, by determining whether the URI obtained in S300 is identical to the URI previously stored in the search database, it is determined whether the obtained URI is duplicated (S310). In one embodiment, if only special characters such as "/" are different between the obtained URI and the pre-stored URI, the two URIs are determined to be the same, or if the obtained URI is a page type, the host is between the obtained URI and the pre-stored URI. If the names are the same, the two URIs can be determined to be the same. However, as described above, the rule that the same URI is determined when the host names are the same is not preferably applied to URIs corresponding to cafes, blogs, mini homepages, clubs, and the like.

본 발명은 상술한 URI의 중복여부 판단 과정을 수행하지 않고 획득된 모든 URI에 대해 후술할 정보 수집 과정을 수행할 수 있으므로, 이러한 URI 중복여부 판단 과정은 선택적으로 포함될 수 있을 것이다.The present invention can perform the information collection process to be described later for all the URIs obtained without performing the above-described URI duplication determination process, such a URI duplication determination process may be optionally included.

다음으로, 획득된 URI가 중복되지 않은 것으로 판단되는 경우, 획득된 URI에 상응하는 웹사이트에 직접 방문함으로써 해당 웹사이트에 대한 정보를 수집한다(S320). 일 실시예에 있어서, 해당 웹사이트에 대한 정보로 해당 웹사이트에 포함된 HTML 문서, 이미지, 또는 텍스트 등을 수집할 수 있다.Next, when it is determined that the obtained URI is not duplicated, information about the website is collected by directly visiting the website corresponding to the obtained URI (S320). In one embodiment, the information about the website may collect HTML documents, images, or text included in the website.

이후, S320에서 수집된 웹사이트의 정보를 조직화한다(S330). 일 실시예에 있어서, 수집된 웹사이트 정보의 조직화란 수집된 웹사이트의 정보로부터 웹사이트의 제목, 정보 제공자, 중요 태그, 및 그룹 중 적어도 하나를 결정하는 것을 의미한다.Thereafter, the information of the website collected in S320 is organized (S330). In one embodiment, organizing the collected website information means determining at least one of a website title, an information provider, an important tag, and a group from the collected website information.

여기서, 웹사이트의 제목은 수집된 웹사이트 정보의 타이틀 태그에 포함된 내용 또는 수집된 웹사이트 정보에 포함된 문구들 중 출현횟수가 가장 높은 문구로 결정할 수 있고, 웹사이트의 정보 제공자는 수집된 웹사이트 정보 중 카피라이터에 해당하는 내용 또는 해당 컨텐츠의 제공자로 결정할 수 있다. 여기서, 웹사이트 정보에 포함된 문구란 웹사이트 정보에 포함된 단어일 수 있는데, 특히 단어 중 그 품사가 명사인 단어일 수 있다. 또한, 이에 한정되지 않고 웹사이트 정보에 포함된 문구란 2개의 단어가 결합된 형태의 문구이거나 조사 등이 결합되어 있는 형태 의 단어일 수도 있을 것이다.Here, the title of the website may be determined as the highest occurrence frequency among the contents included in the title tag of the collected website information or the phrases included in the collected website information, and the information provider of the website is collected. The content corresponding to the copywriter among the website information or the provider of the corresponding content may be determined. Here, the phrase included in the website information may be a word included in the website information, and in particular, may be a word whose part of speech is a noun. In addition, the phrase included in the website information is not limited thereto and may be a phrase in which two words are combined or a word in which a search is combined.

또한, 웹사이트의 중요 태그는 수집된 웹사이트 정보의 키워드 태그 또는 수집된 웹사이트 정보에 포함된 문구들 중 출현횟수 순위가 상위 N 번째, 예컨대 10번째 이내의 문구들로 결정할 수 있고, 웹사이트의 그룹은 기 저장된 그룹들 중 중요 태그와 일치도가 임계치 이상인 중요 태그들을 가지는 그룹으로 결정할 수 있다.In addition, the important tag of the website may be determined as the top Nth, for example, within the tenth occurrence of the keyword tag of the collected website information or the phrases included in the collected website information. The group of may be determined to be a group having important tags that match the critical tag among the previously stored groups with a threshold equal to or greater than a threshold.

이때, 해당 웹사이트가 특정 그룹에 포함됨에 따라 새롭게 그룹 명을 변경하거나 그룹의 중요 태그를 변경할 수 있는데, 그룹 명은 해당 그룹에 포함된 각 웹사이트 정보에 포함된 문구들 중 출현횟수가 가장 많은 문구로 결정할 수 있으며, 그룹의 중요 태그는 해당 그룹에 포함된 각 웹사이트 정보들에 포함된 모든 문구들 중 출현횟수 순위가 상위 N 번째 이내의 문구들로 결정할 수 있을 것이다.In this case, as the website is included in a specific group, a new group name or an important tag of the group can be changed, and the group name is the phrase with the most occurrences among the phrases included in the information of each website included in the group. The important tag of the group may be determined as the phrases that appear in the top Nth rank among all the phrases included in the website information included in the group.

마지막으로, S330에서 결정된 웹사이트의 제목, 정보 제공자, 중요 태그, 및 그룹 중 적어도 하나를 해당 웹사이트의 URI와 함께 검색 데이터베이스에 저장한다(S340).Finally, at least one of a title, an information provider, an important tag, and a group of the website determined in S330 is stored in the search database along with the URI of the website (S340).

한편, S310에서 획득된 URI가 중복되는 것으로 판단되면, 획득된 다른 URI에 대해서 중복여부를 판단하고, 더 이상 판단할 URI가 존재하지 않는 경우 절차를 종료한다.On the other hand, if it is determined that the URI obtained in S310 is a duplicate, it is determined whether the other URI obtained is a duplicate, and if there is no URI to determine any more, the procedure is terminated.

상술한 검색 데이터베이스 관리 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 판독 가능한 기록 매체에 기록될 수 있다. 이때, 컴퓨터로 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 한편, 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The above-described search database management method may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable recording medium. In this case, the computer-readable recording medium may include program instructions, data files, data structures, and the like, alone or in combination. On the other hand, the program instructions recorded on the recording medium may be those specially designed and configured for the present invention or may be available to those skilled in the art of computer software.

컴퓨터로 판독 가능한 기록매체에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 한편, 이러한 기록매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다.The computer-readable recording medium includes a magnetic recording medium such as a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, an optical medium such as a CD-ROM and a DVD, a magnetic disk such as a floppy disk, A magneto-optical media, and a hardware device specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. The recording medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, or the like.

또한, 프로그램 명령에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.In addition, program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

한편, 본 발명이 속하는 기술분야의 당업자는 상술한 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다.On the other hand, those skilled in the art will understand that the present invention described above can be implemented in other specific forms without changing the technical spirit or essential features.

그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Therefore, it is to be understood that the embodiments described above are exemplary in all respects and not restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. do.

도 1은 본 발명의 일 실시예에 따른 검색 데이터베이스 관리 장치의 개략적인 블록도.1 is a schematic block diagram of an apparatus for managing a search database according to an embodiment of the present invention.

도 2는 도 1에 도시된 URI 획득부가 컨텐츠의 본문으로부터 URI를 획득하는 예를 보여주는 도면.FIG. 2 is a diagram illustrating an example in which a URI obtainer illustrated in FIG. 1 obtains a URI from a body of content. FIG.

도 3은 본 발명의 일 실시예에 따른 검색 데이터베이스 관리 방법을 보여주는 플로우차트.Figure 3 is a flowchart showing a search database management method according to an embodiment of the present invention.

도 4는 컨텐츠 본문으로부터 URI를 획득하는 세부 절차를 보여주는 플로우차트.4 is a flowchart showing a detailed procedure of obtaining a URI from a content body.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100: 검색 데이터베이스 관리 장치 120: URI 획득부100: search database management unit 120: URI acquisition unit

130: 정보 수집부 140: 중복여부 판단부130: information collecting unit 140: duplicate determination unit

150: 정보 조직화부150: Information Organization

Claims

Obtaining a Uniform Resource Identifier (URI) from at least one of a body of predetermined content and a user terminal;

Collecting information of the website by visiting a website corresponding to the obtained URI; And

And organizing and storing the collected website information.

The method of claim 1,

When obtaining the URI from the predetermined content body, the URI obtaining step,

Dividing the contents included in the predetermined content body by word units; And

And extracting the URI from a first word including a string recognized as a URI among the words.

The method of claim 2,

In the URI extraction step, the search string, characterized in that the URI is a string starting with the first string including http: // or www.

The method of claim 3,

In the extracting step, a string from the first string to a character previous to a character included in the first character group or a string from the first string to English or Korean within a string recognized as a URI included in the first word. Search database management method characterized in that the extraction to the URI.

The method of claim 4, wherein

The first character group is composed of special characters used for the representation of the URI.

The method of claim 1,

Extracts a string including a predetermined number of first special characters, starting with English or Korean, within a second word located before and after a predetermined distance from a first word including a word indicating a URI among the words, as the URI Search database management method comprising the step of.

The method of claim 6,

And in the extracting step, the first special character is one and the character string is terminated by the first special character.

The method of claim 1,

After the URI obtaining step, further comprising determining the validity of the URI,

And in the collecting website information, collecting information of a website corresponding to a URI determined to be valid among the URIs.

The method of claim 8,

In the validity determination step, when the URI starts with Korean, a URI in which Korean and English are not connected with a first special character in the URI, and a form in which the first special character is not included in the URI The URI of the or URI that is not accessible is determined to be invalid.

The method of claim 1,

When the URI is obtained from the user terminal In the URI obtaining step, the URI is obtained from the bookmark information stored in the web browser of the user terminal or URL (Uniform Resource Locator) input through the web browser Featured search database management method.

The method of claim 10,

And the book mark information includes at least one of a book-marked website title and a URL address of the book-marked website.

The method of claim 10,

And the book mark information is obtained from the user terminal whenever a change to the book mark information occurs.

The method of claim 1,

Before the website information collection step, further comprising the step of determining whether the URI and the URI previously stored in the search database,

In the website information collection step, the search database management method, characterized in that for collecting information of the website corresponding to the non-overlapping URI.

The method of claim 13,

In the step of determining whether the URI is a page type, when the host name included in the URI is the same as the host name included in the pre-stored URI, it is determined that the URI is duplicated How to manage a search database.

The method of claim 1,

In the organizing and storing the collected website information, using the collected website information, the collected website information is organized by determining at least one of a title, an information provider, an important tag, and a group of the website. Search database management method characterized in that.

The method of claim 15,

In the information determining step, the title is determined as the phrase having the highest occurrence frequency among the contents included in the title tag of the collected website information or the phrase included in the collected website information. .

The method of claim 15,

In the information determining step, the information provider determines the content corresponding to the copy writer of the collected website information or the content provider.

The method of claim 15,

In the information determining step, the important tag is a keyword database of the collected website information or a phrase in the appearance frequency rank among the phrases included in the collected website information ranks within the upper N th search database, characterized in that How to manage.

The method of claim 15,

In the information determining step, the group is determined as a group having an important tag having a degree of agreement with the determined important tag among a previously stored group having a threshold value or more.

The method of claim 19,

And when the group of the website is determined, updating the group name of the group using the important tag of the website and the important tag of the group.

A recording medium having recorded thereon a program for performing the method according to any one of claims 1 to 20.

A URI obtaining unit obtaining a URI from at least one of a body of predetermined content and a user terminal;

An information collecting unit for collecting information of the website by visiting a website corresponding to the URI obtained by the URI obtaining unit; And

And an information organizing unit for organizing and storing website information collected by the information collecting unit.

The method of claim 22,

When obtaining the URI from the predetermined content body, the URI obtaining unit,

A URI parser for dividing the contents included in the body of the predetermined content by word units and extracting the URI from a word including a string recognized as a URI among the words; And

And a validity determination unit for determining the validity of the extracted URI.

24. The method of claim 23,

The character string recognized as the URI is a character string starting with the first character string including http: // or www. Or a character string starting from the alphabet or the Korean language but including a predetermined number of first special characters. Management device.

The method of claim 24,

The URI parser may include a string from the first string to a character before the character included in the first character group consisting of special characters used in the representation of the URI, in the string recognized as the URI, or from the first string to English. Searching database management device, characterized in that for extracting the character string up to or Hangul as the URI.

The method of claim 24,

The search database management apparatus of claim 1, wherein the first special character is one of the character strings recognized as the URI, and the character string terminated by the first special character is excluded from the character string recognized as the URI.

24. The method of claim 23,

The validity judging unit, when the URI begins with the Hangul, the URI of the form in which the Korean and English are not connected with the first special character in the URI, and the form in which the first special character is not included in the URI. The search database management device, characterized in that it is determined that the URI or URI that is not accessible is invalid.

The method of claim 22,

When the URI is obtained from the user terminal, the URI obtaining unit obtains the URI from book mark information stored in a web browser of the user terminal or a URL (Uniform Resource Locator) input through the web browser. Search database management device.

The method of claim 22,

The apparatus may further include a duplicate determination unit determining whether the URI is duplicated with a URI previously stored in the search database.

And the information collecting unit collects information of a website corresponding to a non-overlapping URI.

The method of claim 22,

The information organizing unit organizes the collected website information by determining at least one of a title, an information provider, an important tag, and a group of the website using the collected website information. Device.

The method of claim 22,

The information organizing unit determines that the title is a phrase having the highest number of occurrences among contents included in a title tag of collected website information or phrases included in the collected website information, and the information provider determines the collected web. The content corresponding to a copy writer of the site information or the content provider is determined, and the important tag has a ranking number of occurrences among the keyword tags of the collected website information or the phrase included in the collected website information within the top Nth rank. And a group, wherein the group is a group having an important tag having a degree of matching with the determined important tag more than a threshold value among previously stored groups.