KR20110133650A

KR20110133650A - Method and system for searching by proximity of index term

Info

Publication number: KR20110133650A
Application number: KR1020100053138A
Authority: KR
Inventors: 주은석
Original assignee: 엔에이치엔(주)
Priority date: 2010-06-07
Filing date: 2010-06-07
Publication date: 2011-12-14
Also published as: KR101127795B1

Abstract

PURPOSE: A searching method using the accuracy of a guide word and system thereof are provided to improve the convenience of development by establishing a location policy and a proximity expression. CONSTITUTION: A query index extraction unit(120) extracts a query index list corresponding to query received through an input unit. A document searching unit(130) searches document using a query guide word list. A proximity determination unit(140) determines proximity using the number of the guide word. The number of the guide word is extracted by comparing the unit guide word list of the searched document with the query guide word list. A searching result providing unit(160) provides the searched document based on the proximity.

Description

Searching method and search system using proximity of index term {METHOD AND SYSTEM FOR SEARCHING BY PROXIMITY OF INDEX TERM}

본 발명은 색인어의 근접도를 이용하는 검색 방법 및 검색 시스템에 관한 것으로, 특히 단위 색인어 로케이션 방식을 적용한 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 검출된 색인어 개수를 반영한 근접 연산식을 이용함으로써, 검색의 정확도를 향상시킬 수 있는 색인어의 근접도를 이용하는 검색 방법 및 검색 시스템에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a search method and a search system using the proximity of index words. In particular, the present invention relates to a search method by using a proximity expression reflecting the number of index words detected between a unit index word list and a query index word list of a document to which a unit index word location method is applied. The present invention relates to a search method and a search system using a proximity of an index word that can improve the accuracy of the index.

최근 들어, 대부분의 문서를 컴퓨터로 작성하고 통신망을 통해 문서를 배포하고 획득함에 따라 효과적으로 문서를 찾는 기술의 중요성이 매우 커지고 있다. In recent years, as most documents are written on a computer and documents are distributed and acquired through a communication network, the importance of finding a document effectively becomes very important.

더구나, 인터넷이 보급됨으로써 전문가뿐만 아니라 일반인도 통신망에 접속하여 정보를 제공하거나 획득하는 것이 일반화되고, 이에 따라 인터넷으로 접근할 수 있는 정보의 양이 기하급수적으로 증가하고 있다. 따라서, 역사상 유래 없는 거대한 정보창고이자 정보획득 인프라 인터넷에서 검색엔진(예컨대, naver, altavista, yahoo, infoseek ultra, dejanews, lycos, empas 등)이 가장 성공적인 응용 프로그램으로 자리 매김을 하고 있다.Moreover, with the spread of the Internet, it is common for not only experts but also ordinary people to access and provide information through communication networks, and accordingly, the amount of information accessible through the Internet is increasing exponentially. Thus, search engines (e.g. naver, altavista, yahoo, infoseek ultra, dejanews, lycos, empas, etc.) have become the most successful applications on the Internet, an immense information warehouse and information acquisition infrastructure.

초기 인터넷 검색엔진은 웹의 규모가 크지 않았기 때문에 몇 안 되는 자료를 데이터베이스로 구축할 필요가 없었으며, 야후와 같은 웹 초기의 검색엔진들은 데이터베이스 규모가 작은 경우 개발과 검색에 편리한 주제 검색 방법을 이용하였다.Since the early Internet search engines were not large in size, there was no need to build a few data into a database. Early web search engines such as Yahoo used a convenient topic search method for development and search when the database was small. It was.

예를 들어, 초기메뉴를 비롯한 각 단계의 메뉴들이 하위메뉴를 약 10개 정도를 갖고 있고 전체 메뉴는 총 4단계까지 지원한다고 가정하면, 이를 트리구조 형식으로 나타냈을 때 총 1000(10³)개만큼의 자료를 보유할 수 있다. 여기에서 한 단계를 더 추가한다면 10000(10⁴)개까지 자료를 확보할 수 있다.For example, the menu of each stage, including the initial menu will have about 10 sub-menu and the full menu is a total of 1000 (10 ³⁾ nd appear in a tree structure format it, assuming supported up to step 4 It can hold as much data. If you add one more step, you can have up to 10000 (10 ⁴ ) pieces of data.

그러나, 현재의 인터넷 검색엔진들의 보유 레코드 수가 작게는 100만 개부터 많게는 5천만 개에 이르고 있기 때문에 주제 검색 방식으로 자료를 검색할 경우 여러 단계의 거쳐야만 최종 자료에 접근할 수 있다. 만약, 여러 단계 중에서 한 번이라도 실수하게 되면 다시 상위 주제로 올라가지 않는 한 하위 주제에서 자료를 검색하는 것은 불가능하다. However, since the current number of records of Internet search engines ranges from as little as one million to as many as 50 million, the final data can only be accessed through several stages when searching through the subject search method. If you make a mistake in any of the steps, it is impossible to retrieve data from subtopics unless you go back to the parent topic.

이와 같이, 지속으로 인터넷의 규모가 커지면서 더 이상 주제 검색만으로는 원활한 검색이 불가능해졌고, 급팽창하는 웹의 규모에 맞게 검색엔진이 보유한 레코드 수도 그만큼 증가해야 하는데 예전과 같이 사람의 수작업에 의해 하나의 홈페이지를 확인하고 이를 하나의 레코드로 추가시키는 방식은 급격한 웹의 성장을 따라갈 수 없으며, 이러한 수작업에 의해 수십만 개의 홈페이지를 색인하여 데이터베이스를 구축하더라도 이를 사용자가 메뉴 방식으로 검색하기 위해서는 많은 시간과 노력을 기울여야 한다. 이때부터 웹크롤러(Web Crawler)라는 개념이 인터넷에 도입되었으며, 웹크롤러란 일종의 자동 순회 프로그램으로 기존에 수작업으로 웹사이트를 찾아다니며 색인하던 작업을 자동적으로 검색하고 색인하여 이를 데이터베이스화하는 프로그램을 일컫는다. 이러한 웹크롤러에 의해 만들어진 데이터베이스는 대부분이 키워드 검색이 가능하도록 설계되며 이때부터 인터넷 검색엔진이 주제 검색에서 키워드 검색으로 전환하기 시작했다. 즉, 사용자가 자신이 원하는 정보를 검색하기 위해 해당 검색식을 키워드, 즉 질의어로 입력하고, 입력한 질의어 간의 관계를 이용하여 불리언(boolean) 질의 방식이나 벡터 질의 방식으로 관련 정보에 접근해간다.As the size of the Internet continues to grow, it is no longer possible to search smoothly by just searching the topic, and the number of records held by the search engine must increase according to the size of the rapidly expanding web. Checking and adding it as a record cannot keep up with the rapid growth of the web, and even if you build a database by indexing hundreds of thousands of homepages by this manual effort, a user must spend a lot of time and effort to search it menu-wise. do. Since then, the concept of Web Crawler has been introduced to the Internet, and Web Crawler is a kind of automatic traversal program that automatically searches, indexes, and databases the previously searched and indexed work by hand. . Most of the databases created by these web crawlers are designed to be keyword searchable, and from then on, Internet search engines began to switch from topic search to keyword search. That is, a user inputs a corresponding search expression as a keyword, that is, a query word to search for information desired by the user, and accesses related information by a Boolean query method or a vector query method by using the relationship between the input query words.

이와 같은 종래 기술에서는 입력된 질의어간의 관계(즉, 질의어간의 가중치 등)를 고려하여 시스템에 이미 구축되어 있는 문서들의 색인어 리스트에서 검색하여 해당되는 정보들을 사용자에게 제공한다. 질의어간의 관계를 고려하여 원하는 정보에 접근하는 방법은 질의어를 형태소 별로 분석하여 색인어를 추출하고, 추출된 색인어에 해당하는 문서를 검색하고 해당 수치를 근접 검색식에 대입하여 근접도를 계산하며, 근접도가 높은 순으로 문서를 제공하는 방법 등이 있다. 그러나, 형태소별 분석을 통하여 로케이션을 부여한 검색방법은 그 로케이션 정보가 직관적이지 않으므로, 이를 통한 근접도 연산시 일부 패턴에 대하여 정확하지 않게 근접도가 계산되어 정확한 검색결과를 제공하지 못한다는 문제점을 갖고 있다. In the related art, a search is performed on an index word list of documents that are already built in the system in consideration of input relations (ie, weights between query terms) and provides corresponding information to a user. In order to access the desired information in consideration of the relationship between the query words, the query word is analyzed by morpheme to extract the index word, the document corresponding to the extracted index word is searched, the numerical value is substituted into the proximity search expression, and the proximity is calculated. There is a method of providing documents in ascending order. However, since the location information is not intuitive, the retrieval method to which the location is assigned through morphological analysis has a problem in that the proximity is calculated incorrectly for some patterns during the proximity calculation, thereby not providing accurate search results. have.

또한, 기존 방식대로 로케이션을 부여한 검색방법 및 근접도 연산식에 따르면 색인어 리스트 내부의 색인어가 교차되는 경우에 이러한 경우를 고려하지 않음에 따른 오차가 발생하는 문제점도 발생하고 있다.In addition, according to a search method and a proximity calculation method of assigning a location according to a conventional method, an error occurs due to not considering such a case when index words in the index word list intersect.

본 발명의 목적은, 색인어 추출 정책의 변화에도 견고한 로케이션 정책 및 근접 연산식을 수립하고 적용하여 개발의 편의를 도모하고 근접연산의 오차를 해결할 수 있는, 색인어의 근접도를 이용하는 검색 방법 및 검색 시스템을 제공함에 있다. An object of the present invention is to establish and apply a robust location policy and arithmetic expressions even in the event of a change in index extraction policy, which facilitates development and solves errors in proximity operation. In providing.

또한, 본 발명의 다른 목적은, 근접도 연산식의 전제가 되는 색인어 밀도 개념에 충실한 로케이션 부여 방식으로 인해 색인어 추출기의 색인어 추출 결과만으로 로케이션 정보를 명확하게 해석할 수 있고, 단독 색인하지 않는 타입으로 인해 발생하는 근접연산 오차를 해결하여 검색의 정확도를 향상시킬 수 있는, 색인어의 근접도를 이용하는 검색 방법 및 검색 시스템을 제공함에 있다. In addition, another object of the present invention is to provide a location that is faithful to the concept of index word density, which is a premise of the proximity calculation equation, so that the location information can be clearly interpreted only by the index word extraction result of the index word extractor. It is to provide a search method and a search system using the proximity of the index word, which can improve the accuracy of the search by solving the proximity calculation error that occurs.

또한, 본 발명의 다른 목적은, 질의에 근접한 검색결과를 제공하도록 질의에 해당하는 질의 색인어 리스트와 검색된 문서에 단위 색인어 로케이션 방식을 적용하여 추출한 단위 색인어 리스트 사이에서 검출된 색인어 개수를 반영하여 근접도를 결정함으로써 검색의 정확도를 향상시킬 수 있는, 색인어의 근접도를 이용하는 검색 방법 및 검색 시스템을 제공함에 있다.In addition, another object of the present invention is to reflect the number of index words detected between the index index list corresponding to the query and the index index list extracted by applying the unit index word location method to the retrieved document to provide a search result close to the query To provide a search method and a search system using the proximity of the index word, which can improve the accuracy of the search by determining the.

또한 본 발명의 다른 목적은 질의 색인어 경계 내에서 질의 색인 리스트 중 일부 및 문서의 색인어 리스트 간에 로케이션 정보가 서로 다른 색인어가 존재하는 경우 페널티 값을 계산하고 이를 근접도에 적용함으로써 색인어 스와핑(Swapping)에 따른 오차를 개선한, 색인어의 근접도를 이용하는 검색 방법 및 검색 시스템을 제공함에 있다.In addition, another object of the present invention is to calculate the penalty value when the index information with different location information exists between a part of the query index list and the index list of the document within the query index word boundary to apply to the index index swapping (Swapping) An object of the present invention is to provide a search method and a search system using the proximity of index words.

상기 목적을 달성하기 위한 본 발명의 실시예에 따른 색인어의 근접도를 이용하는 검색 방법은 질의어를 입력받는 단계; 상기 질의에 대응하는 질의 색인어 리스트를 추출하는 단계; 상기 질의 색인어 리스트를 이용하여 문서를 검색하는 단계; 및 상기 검색된 문서의 단위 색인어 리스트와 상기 질의 색인어 리스트의 비교를 통해 근접도를 결정하는 단계를 포함한다. According to an embodiment of the present invention, a search method using a proximity of an index word includes: receiving a query word; Extracting a query index word list corresponding to the query; Retrieving a document using the query index word list; And determining the proximity by comparing the unit index word list of the searched document with the query index word list.

상기 근접도는 질의 색인어 경계 내에서 추출할 수 있는 최대 색인어 집합 내의 매칭 색인어와 트레이스 색인어(Trace terms)의 밀도로 결정하는 것이 바람직하다.The proximity may be determined by the density of the matching index and the trace terms in the maximum index set that can be extracted within the query index term boundary.

상기 근접도를 결정하는 단계는 상기 질의 색인어 리스트 내의 색인어 개수(QTC: Query Terms Count)를 검출하는 단계; 상기 단위 색인어 리스트와 상기 질의 색인어 리스트 사이에 일치하는 색인어 개수(MTC: Matched terms count)를 검출하는 단계; 상기 단위 색인어 리스트에서 질의 색인어 경계 내에 대응하는 색인어 개수(UTC: UTL(Unit Term List) matched terms count)를 검출하는 단계; 상기 질의 색인어 경계 내에서 상기 단위 색인어 리스트에는 있지만 상기 질의 색인어 리스트에는 없는 색인어 개수(TTC: Trace terms count)를 검출하는 단계; 및 상기 검출된 색인어 개수들을 이용하여 근접도를 계산하는 단계를 더 포함하는 것이 바람직하다. The determining of the proximity may include detecting a query terms count (QTC) in the query index word list; Detecting a matched terms count (MTC) between the unit index word list and the query index word list; Detecting a unit term list (UTC) matched terms count (UTC) corresponding to a query index term boundary in the unit index term list; Detecting a trace terms count (TTC) in the unit index word list but not in the query index word list within the query index word boundary; And calculating the proximity using the detected number of index words.

상기 근접도 계산 단계는 이하의 식에 상기 검출된 색인어 개수들을 대입하는 것이 바람직하다. 근접도 = ((MTC+TTC)/max(QTC, UTC)) In the proximity calculation step, it is preferable to substitute the detected number of index words in the following equation. Proximity = ((MTC + TTC) / max (QTC, UTC))

또한, 상기 근접도 계산 단계는 상기 문서의 단위 색인어 리스트를 기준으로 상기 질의 색인 리스트의 로케이션 정보를 부여하는 단계를 더 포함하는 것이 바람직하다. The proximity calculation step may further include providing location information of the query index list based on the unit index word list of the document.

상기 근접도 결정 단계는 상기 질의 색인어 경계 내에서, 상기 단위 색인어 리스트와 상기 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어가 있는 경우 페널티를 적용하여 상기 근접도를 보정하는 단계를 더 포함하는 것이 바람직하다. The determining of the proximity may further include correcting the proximity by applying a penalty when there is an index word having different location information between the unit index word list and the query index word list within the query index word boundary. .

상기 페널티는 상기 질의 색인어 경계 내에서 상기 단위 색인어 리스트와 상기 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어의 개수(STC: Swapped Term Count)에 기초하여 계산되는 것이 바람직하다.Preferably, the penalty is calculated based on the number of index terms (STC) whose location information differs between the unit index word list and the query index word list within the query index word boundary.

상기 근접도를 결정하는 단계 이후에, 상기 결정된 근접도에 기초하여 검색된 문서를 정렬하는 단계를 더 포함하는 것이 바람직하고, 여기서 상기 문서의 정렬은 근접도의 내림차순 또는 오름차순으로 이루어지는 것이 바람직하다. After determining the proximity, it is preferable to further include sorting the retrieved documents based on the determined proximity, wherein the alignment of the documents is preferably in descending or ascending order of proximity.

상기 단위 색인어 리스트는 로케이션 정보를 포함하고, 상기 로케이션 정보는 각 색인어별로 미리 정해진 수만큼 단조 증가하는 것이 바람직하다. The unit index word list includes location information, and the location information is monotonically increased by a predetermined number for each index word.

또한 본 발명의 실시예에 따른 색인어의 근접도를 이용하는 검색 시스템은 질의를 입력받는 입력부; 상기 질의에 대응하는 질의 색인어 리스트를 추출하는 질의 색인어 추출부; 상기 질의 색인어 리스트를 이용하여 문서를 검색하는 문서 검색부; 및 상기 검색된 문서의 단위 색인어 리스트와 상기 질의 색인어 리스트의 비교를 통해 검출된 색인어 개수를 이용하여 근접도를 결정하는 근접도 결정부를 포함한다. In addition, the search system using the proximity of the index word according to an embodiment of the present invention includes an input unit for receiving a query; A query index word extracting unit which extracts a list of query index words corresponding to the query; A document search unit for searching for a document using the query index word list; And a proximity determination unit configured to determine a proximity using the number of index words detected by comparing the unit index word list of the searched document with the query index word list.

상기 근접도 결정부는 상기 질의 색인어 리스트 내의 색인어 개수(QTC : Query Terms Count), 상기 단위 색인어 리스트와 상기 질의 색인어 리스트 사이에 일치하는 색인어 개수(MTC : Matched terms count), 상기 단위 색인어 리스트에서 질의 색인어 경계 내에 대응하는 색인어 개수(UTC : UTL mateched terms count), 상기 질의 색인어 경계 내에서 상기 단위 색인어 리스트에는 있지만 상기 질의 색인어 리스트에는 없는 색인어 개수(TTC : Trace terms count)를 검출하는 색인어 개수 검출부; 및 상기 검출된 색인어 개수들을 이용하여 근접도를 계산하는 근접도 계산부를 더 포함하는 것이 바람직하다. The proximity determining unit may include a number of index terms (QTC) in the query index word list, a match index count (MTC) between the unit index word list and the query index word list, and a query index word in the unit index word list. An index word count detecting unit for detecting a number of index terms (UTC: UTL mateched terms count) corresponding to a boundary and a number of index terms (TTC) in the unit index word list but not in the query index word list within the query index word boundary; And a proximity calculator configured to calculate a proximity using the detected number of index words.

본 발명의 실시예에 따른 색인어의 근접도를 이용하는 검색 시스템은 상기 문서의 단위 색인어 리스트를 기준으로 상기 질의 색인 리스트의 로케이션 정보를 부여하는 로케이션 부여부를 더 포함하는 것이 바람직하고, 상기 근접도 계산부는 상기 질의 색인어 경계 내에서, 상기 단위 색인어 리스트와 상기 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어가 있는 경우 그 색인어의 개수(STC: Swapped Term Count)를 검출하고, STC에 기초하여 계산된 페널티를 적용하여 상기 근접도를 보정하는 것이 바람직하다. The search system using the proximity of the index word according to an embodiment of the present invention preferably further includes a location granting unit for assigning location information of the query index list based on the unit index word list of the document. Within the query index word boundary, if there is an index word having different location information between the unit index word list and the query index word list, the number of index words (STC) is detected, and a penalty calculated based on STC is applied. It is desirable to correct the proximity.

본 발명의 실시예에 따른 색인어의 근접도를 이용하는 검색 시스템은 상기 결정된 근접도에 기초하여 검색된 문서를 정렬하여 제공하는 검색결과 제공부를 더 포함하는 것이 바람직하다. The search system using the proximity of the index word according to an embodiment of the present invention preferably further includes a search result providing unit for sorting and providing the searched documents based on the determined proximity.

상기 단위 색인어 리스트는 로케이션 정보를 포함하고, 상기 로케이션 정보는 각 색인어별로 미리 정해진 수만큼 단조 증가하는 것이 바람직하다.The unit index word list includes location information, and the location information is monotonically increased by a predetermined number for each index word.

본 발명에 따르면 색인어 추출 정책의 변화에도 견고한 로케이션 정책 및 근접 연산식을 수립하고 적용하여, 개발의 편의를 도모하고 근접연산의 오차를 해결할 수 있다.According to the present invention, by establishing and applying a robust location policy and a proximity operation expression to the change of the index extraction policy, it is possible to facilitate development and solve the error of the proximity operation.

또한 본 발명에 따르면 근접도 연산식의 전제가 되는 색인어 밀도 개념에 충실한 로케이션 부여 방식으로 인해 색인어 추출기의 색인어 추출 결과만으로 로케이션 정보를 명확하게 해석할 수 있고, 단독 색인하지 않는 타입으로 인해 발생하는 근접연산 오차를 해결하여 검색의 정확도를 향상시킬 수 있다.In addition, according to the present invention, due to the location granting method faithful to the concept of index word density, which is the premise of the proximity calculation equation, the location information can be clearly interpreted only by the index word extraction result of the index word extractor. The accuracy of the search can be improved by solving the calculation error.

본 발명에 따르면 질의에 근접한 검색결과를 제공하도록 질의에 해당하는 질의 색인어 리스트와 검색된 문서에 단위 색인어 로케이션 방식을 적용하여 추출한 단위 색인어 리스트 사이에서 검출된 색인어 개수를 반영하여 근접도를 결정함으로써, 검색의 정확도를 향상시킬 수 있는 효과가 있다.According to the present invention, the proximity index is determined by reflecting the number of index words detected between a list of query index words corresponding to a query and a unit index word list extracted by applying a unit index word location method to a searched document so as to provide a search result close to the query. There is an effect to improve the accuracy of.

또한 본 발명에 따르면 질의 색인어 경계 내에서 질의 색인 리스트 중 일부 및 문서의 색인어 리스트 간에 로케이션 정보가 서로 다른 색인어가 존재하는 경우 페널티 값을 계산하고 이를 근접도에 적용함으로써 색인어 스와핑(Swapping)에 따른 오차를 개선하는 효과도 있다.In addition, according to the present invention, if there is an index word having different location information between a part of a query index list and a document index word list within a query index word boundary, an error due to index swapping is calculated by applying a penalty value to the proximity. There is also an effect to improve.

도 1은 발명의 일 실시예에 따른 색인어의 근접도를 이용하여 사용자에게 정확한 검색 결과를 제공하기 위한 시스템의 전체적인 구성도.
도 2은 본 발명의 실시예에 따른 색인어의 근접도를 이용하는 검색 시스템을 설명하기 위한 블록도.
도 3는 본 발명의 실시예에 따른 색인어의 근접도를 이용하는 검색 방법을 설명하기 위한 동작 흐름도.
도 4은 본 발명의 실시예에 따라 종래 기술과 대비하여 색인어의 근접도 계산에 사용되는 색인어 개수를 설명하기 위한 도면.
도 5은 본 발명의 실시예에 따라 교차 발생시 색인어의 근접도 계산에 사용되는 색인어 개수를 설명하기 위한 도면.1 is an overall configuration diagram of a system for providing accurate search results to a user using a proximity of an index word according to an embodiment of the present invention.
2 is a block diagram illustrating a search system using a proximity of an index word according to an embodiment of the present invention.
3 is a flowchart illustrating a search method using a proximity of index words according to an embodiment of the present invention.
4 is a view for explaining the number of index words used to calculate the proximity of index words in comparison with the prior art according to an embodiment of the present invention.
5 is a view for explaining the number of index words used to calculate the proximity of the index word when the intersection occurs in accordance with an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는 적절하게 설명된다면 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다. DETAILED DESCRIPTION The following detailed description of the invention refers to the accompanying drawings that show, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with an embodiment. It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. Accordingly, the following detailed description is not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled. Like reference numerals in the drawings refer to the same or similar functions throughout the several aspects.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement the present invention.

본 발명의 명세서 및 청구항 전체를 통해서, 용어 "문서"는 사용자가 입력하는 질의어에 대한 검색 결과를 제공하기 위한 검색의 대상을 모두 포함하는 광의의 의미로 해석되어야 한다. 본 발명에 따른 "문서"는 특정 형식에 제한되지 않으므로 일반적인 텍스트에 한정되는 것이 아니라, 이미지, 음악, 동영상 등 다양한 형태의 컨텐츠를 포함할 수 있다. 또한, "문서"가 담고 있는 출처에 따른 구체적인 분류에 따르면 "문서"는 일반적인 웹문서, 광고, 사전, 블로그, 웹사이트, 뉴스, 카페, 이미지, 전문정보, 책, 지도, 동영상 등을 포함할 수 있으나 상기 나열된 분류에 한정되는 것은 아니다. 앞서 살펴본 바와 같은 다양한 출처 및 형식을 가지는 "문서"로부터 검색된 "검색 결과" 역시 다양한 출처 및 형식을 갖는다. Throughout the description and claims of the present invention, the term "document" should be construed in a broad sense including all of the subjects of a search for providing a search result for a query entered by a user. Since the "document" according to the present invention is not limited to a specific format, the "document" is not limited to general text, but may include various types of content such as an image, music, and video. In addition, according to the specific classification according to the source of the "document", the "document" may include general web documents, advertisements, dictionaries, blogs, websites, news, cafes, images, professional information, books, maps, videos, etc. May be, but is not limited to the above listed classifications. The "search results" retrieved from "documents" having various sources and formats as discussed above also have various sources and formats.

전체 시스템 구성Complete system configuration

도 1은 발명의 일 실시예에 따라, 색인어의 근접도를 이용하여 사용자에게 정확한 검색 결과를 제공하기 위한 시스템의 전체적인 구성을 개략적으로 나타내는 도면이다. 1 is a diagram schematically illustrating the overall configuration of a system for providing an accurate search result to a user by using a proximity of an index word according to an embodiment of the present invention.

도 1에 도시되어 있는 바와 같이 본 발명의 일 실시예에 따른 전체 시스템은, 데이터베이스를 포함하고 있는 검색 결과 제공 시스템(100)이 통신망(200)을 통하여 복수의 사용자 단말장치(300)와 연결되어 있다. As shown in FIG. 1, in the entire system according to an exemplary embodiment of the present invention, a search result providing system 100 including a database is connected to a plurality of user terminal devices 300 through a communication network 200. have.

먼저, 본 발명의 일 실시예에 따르면, 검색 결과 제공 시스템(100)은 사용자 단말장치(300)로부터 질의어를 수신하여, 이를 분석하여 질의 색인어 리스트를 추출하고, 색인어 리스트 중 적어도 하나를 갖는 문서를 검색하고, 검색된 문서의 단위 색인어 리스트의 로케이션 정보를 기준으로 질의 색인어 리스트의 로케이션 정보를 부여하며, 질의 색인어 리스트와 문서의 단위 색인어 리스트를 이용하여 색인어 개수를 검출한 뒤 이를 기반으로 근접도를 계산하며, 근접도가 높거나 낮은 순으로 검색결과를 제공하는 기능을 수행한다. 검색 결과 제공 시스템(100)의 상세한 구성요소 및 각 구성요소의 기능에 대해서는 후술하도록 한다. First, according to an embodiment of the present invention, the search result providing system 100 receives a query from the user terminal device 300, analyzes it, extracts a query index word list, and extracts a document having at least one of the index word lists. Search and assign the location information of the query index word list based on the location information of the unit index word list of the searched document, detect the number of index words using the query index word list and the unit index word list of the document, and then calculate the proximity based on the location index list. It performs the function of providing search results in the order of high or low proximity. Detailed components of the search result providing system 100 and functions of each component will be described later.

또한, 본 발명의 일 실시예에 따르면, 통신망(200)은 유선 및 무선 등과 같은 그 통신 양태를 가리지 않고 구성될 수 있으며, 단거리 통신망(PAN; Personal Area Network), 근거리 통신망(LAN; Local Area Network), 도시권 통신망(MAN; Metropolitan Area Network), 광역 통신망(WAN; Wide Area Network) 등 다양한 통신망으로 구성될 수 있다. In addition, according to an embodiment of the present invention, the communication network 200 may be configured without regard to communication modes such as wired and wireless, and may include a personal area network (PAN) and a local area network (LAN). ), A metropolitan area network (MAN), a wide area network (WAN), and the like.

한편, 본 발명의 일 실시예에 따른 사용자 단말장치(300)는 사용자가 소정 키워드에 대해 가장 근접한 검색 결과를 제공받기 위하여 통신망(200)을 통하여 검색 결과 제공 시스템(100)과 연결하기 위한 기능을 포함하는 입출력 장치를 의미하며, 데스크톱 컴퓨터뿐만 아니라 노트북 컴퓨터, 워크스테이션, 팜톱(palmtop) 컴퓨터, 개인 휴대 정보 단말기(personal digital assistant: PDA), 웹 패드, 스마트 폰을 포함하는 이동 통신 단말기 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기라면 얼마든지 본 발명에 따른 사용자 단말 장치(300)로서 채택될 수 있다. 바람직하게는, 검색 결과 제공 시스템(100)과 연결하고, 질의어를 입력하여 검색 결과를 제공받기 위하여 사용자 단말장치(300) 내의 웹브라우저를 실행시키고 사용할 수 있으나, 반드시 이에 한정되는 것은 아니다. On the other hand, the user terminal device 300 according to an embodiment of the present invention has a function for connecting the user with the search result providing system 100 through the communication network 200 in order to receive a search result closest to a predetermined keyword. Including the input and output device, and not only desktop computers, but also memory, such as notebook computers, workstations, palmtop computers, personal digital assistants (PDAs), web pads, mobile communication terminals including smart phones, etc. Any digital device having a means and a microprocessor equipped with computing power can be adopted as the user terminal device 300 according to the present invention. Preferably, the web browser in the user terminal device 300 may be executed and used to connect with the search result providing system 100 and receive a search result by inputting a query word, but is not limited thereto.

검색 결과 제공 시스템Search result provision system

도 2는 본 발명의 일 실시예에 따른 색인어의 근접도를 이용하는 검색 결과 제공 시스템을 도시한 도면이다. 2 is a diagram illustrating a search result providing system using a proximity of an index word according to an exemplary embodiment of the present invention.

도 2에 도시된 바와 같이 색인어의 근접도를 이용하는 검색 결과 제공 시스템(100)은 입력부(110), 질의 색인어 추출부(120), 문서 검색부(130), 근접도 결정부(140), 로케이션 부여부(150), 및 검색결과 제공부(160)를 포함할 수 있다. 또한 근접도 결정부(140)는 색인어 개수 검출부(141) 및 근접도 계산부(142)를 포함할 수 있다. 상술된 검색 결과 제공 시스템(100)은 도면으로 도시하지는 않았지만 검색 대상 문서 및 해당 검색 대상 문서에 대해 로케이션 정보를 포함하는 단위 색인어 리스트를 저장한 데이터베이스를 구비할 수 있다.As shown in FIG. 2, the search result providing system 100 using the proximity of the index word includes an input unit 110, a query index word extraction unit 120, a document search unit 130, a proximity determination unit 140, and a location. The granter 150 and the search result providing unit 160 may be included. In addition, the proximity determiner 140 may include an index word number detector 141 and a proximity calculator 142. Although not illustrated in the drawing, the search result providing system 100 may include a database storing a search target document and a unit index word list including location information of the search target document.

먼저, 입력부(110)는 사용자 단말 장치(300)로부터 사용자의 질의어를 입력받는다. 질의어는 하나의 단어로도 구성될 수 있고 또는 복수의 단어를 포함하거나 문장 형식으로도 입력될 수 있다. 입력부(110)는 예컨대, '정보의 검색'이라는 질의어를 입력 받을 수 있다.First, the input unit 110 receives a user's query word from the user terminal device 300. The query word may be composed of a single word or may include a plurality of words or may be input in a sentence form. The input unit 110 may receive, for example, a query word of “information search”.

질의 색인어 추출부(120)는, 입력부(110)를 통해 입력받은 질의어에 대응하는 질의 색인어 리스트를 미리 설정된 추출패턴에 따라 추출한다. 즉, 색인어 리스트가 기저장된 데이터베이스를 참조하여 질의어에 대응하는 색인어 리스트를 추출할 수 있으며, 이러한 색인어 리스트 추출은 미리 지정되어 적용되는 추출패턴에 따라 그 방식에 변화가 발생할 수 있다. 추출패턴의 예로서는, 질의어의 형태소 중 명사에 조사가 결합된 형태에 대해서도 색인하기로 한 패턴, 명사에 조사가 결합된 경우 그 명사만 분리하여 색인하기로 한 패턴, 질의어 내부의 특수문자를 기준으로 좌우를 분리하여 그 각각을 색인하기로 한 패턴, 숫자 및 문자가 결합된 경우 그 자체를 색인하기로 한 패턴, 질의어의 형태소 중 명사가 복합명사인 경우 그 각각을 분리하여 색인하기로 한 패턴 등 여러 가지가 존재할 수 있으며, 앞서 나열한 추출패턴은 예시적인 것에 불과하고 이 외에도 기존의 패턴이 재설정되거나 추가적으로 별도의 패턴을 만들어 적용될 수도 있다. 예를 들어 '정보의 검색'이라는 질의어가 입력되면, 적용되는 추출패턴에 따라 '정보', '정보의', '검색'이라는 3개의 질의 색인어 리스트를 추출할 수 있고, 또는 다른 추출패턴이 적용되어 '정보의', '검색' 이라는 2개의 질의 색인어 리스트를 추출할 수도 있는 것이다. The query index word extractor 120 extracts a query index word list corresponding to the query word input through the input unit 110 according to a preset extraction pattern. That is, the index word list corresponding to the query word may be extracted by referring to a database in which the index word list is pre-stored, and the index word list extraction may change depending on a predetermined extraction pattern. As an example of the extraction pattern, based on the pattern that is to be indexed for the form in which nouns are combined with the nouns among the morphemes of the query, if the nouns are combined with the nouns, the pattern is to be separated and indexed only by the nouns and the special characters in the query. A pattern that separates left and right and indexes each, a pattern that indexes itself when numbers and letters are combined, and a pattern that separates and indexes each other when nouns in the morphemes of query terms are compound nouns. There may be many kinds, and the extraction patterns listed above are merely exemplary, and in addition, the existing patterns may be reset or additionally made by applying a separate pattern. For example, if a query of 'information search' is inputted, three query index words list of 'information', 'information' and 'search' can be extracted according to the extraction pattern applied, or another extraction pattern is applied. It is also possible to extract two lists of query index words, 'information' and 'search'.

문서 검색부(130)는 웹크롤러에 의해 미리 수집된 문서 중에서, 추출된 질의 색인어 리스트에 존재하는 색인어 중 적어도 하나를 포함하는 문서, 즉 검색 대상 문서들을 검색한다. 여기서 웹크롤러에 의해 미리 수집된 문서들은 후술할 로케이션 부여부(150)에서 색인어 리스트가 기저장된 데이터베이스를 참조하여 문서에 대응하는 색인어 리스트를 추출하고 저장될 수 있으며, 문서의 색인어 리스트에는 단위 색인어 로케이션 방식으로, 즉 로케이션 정보는 단위 색인어 리스트의 색인어를 기준으로 1씩 단조 증가하는 방식으로 부여될 수 있다. 예를 들어 웹크롤러에 의해 수집된 문서의 하나에 대해 색인어를 추출한 결과 '정보', '정보의', 그리고 '검색'이라는 3개의 단위 색인어 리스트가 검출된 경우 부여되는 로케이션 정보는 이하의 표1와 같다. The document search unit 130 searches for documents including at least one of the index words existing in the extracted query index word list among documents previously collected by the web crawler, that is, search target documents. Here, the documents pre-collected by the web crawler may extract and store the index word list corresponding to the document by referring to a database in which the index word list is previously stored in the location granter 150 to be described later, and the unit index word location may be stored in the index word list of the document. In other words, the location information may be provided in a monotonically increasing manner by 1 based on the index words of the unit index word list. For example, if the index word is extracted for one of the documents collected by the web crawler, and the three unit index word lists of 'information', 'information', and 'search' are detected, the location information given is shown in Table 1 below. Same as

로케이션 정보Location information 단위 색인어 리스트Unit index word list <1><1> 정보Information <2><2> 정보의Information <3><3> 검색Search

근접도 결정부(140)는, 문서 검색부(130)로부터 검색된 검색 대상 문서들의 단위 색인어 리스트와 질의 색인어 리스트의 비교를 통해 검출된 색인어 개수들을 이용하여 근접도를 결정한다. 근접도는 질의어의 색인어가 검색 대상 문서 내에서 얼만큼 근접하게 나타나는지를 알 수 있는 수치(0~1의 확률값)로, 검색의 정확도를 향상시키기 위한 지표로 사용될 수 있다. 보다 상세하게, 근접도는 질의 색인어 경계 - 단위 색인어 리스트와 질의 색인어 리스트의 비교를 통해 단위 색인어 리스트 중에서 최초로 질의 색인어와 일치하는 위치부터 단위 색인어 리스트 중에서 최후로 질의 색인어와 일치하는 위치까지를 의미하며, 일치하는 경계까지도 포함한다 - 내에서 추출할 수 있는 최대 색인어 집합 내의, 매칭 색인어 - 경계 내에서 질의 색인어와 매칭된 단위 색인어 - 와 트레이스 색인어(Trace terms) - 경계 내에서 질의 색인어와 매칭되지 않은 단위 색인어 - 의 밀도로 결정하게 되므로, 근접도가 '1'이면 질의어의 색인어가 문서 상에서 동일한 순서로 100% 매칭되어 나타남을 가리킨다. 근접도 계산을 하기 위한 구체적인 계산 방식의 일 실시예는 다음과 같다. The proximity determining unit 140 determines the proximity using the number of index words detected by comparing the unit index word list of the search target documents retrieved from the document search unit 130 and the query index word list. Proximity is a numerical value (a probability value of 0 to 1) that indicates how closely the index word of the query word appears in the search target document, and may be used as an index for improving the accuracy of the search. More specifically, the proximity means the query index boundary-from the unit index word list to the position that first matches the query index word in the unit index word list through the comparison of the query index word list to the last match in the unit index word list. Match indexes, within the maximum set of indexes that can be extracted within the query,-unit index words that match the query index words within the boundary, and-trace terms,-which do not match the query index words within the boundary. Since it is determined by the density of the unit index word, when the proximity is '1', the index word of the query word is 100% matched in the same order on the document. An embodiment of a specific calculation method for calculating proximity is as follows.

먼저 색인어 개수 검출부(141)는 질의 색인어 리스트내의 색인어 개수(QTC: Query Terms Count, 이하 'QTC'라 한다)를 검출하며, 질의어가 '정보의 검색'인 앞선 예에서 추출패턴에 따라 3개의 색인어를 추출할 수 있다면 QTC는 3인 것을 알 수 있다. 또한 색인어 개수 검출부(141)는 검색 대상 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 일치하는 색인어 개수(MTC: Matched Terms Count, 이하 'MTC'라 한다)를 검출한다. 만일, 앞의 예에서 계속하여 검색 대상 문서의 단위 색인어 리스트가 표1과 같고, 질의 색인어 리스트에 3개의 색인어가 존재한다면 MTC는 3인 것을 알 수 있다. 또한 색인어 개수 검출부(141)는 검색 대상 문서의 단위 색인어 리스트에서 질의 색인어 경계 내에 대응하는 색인어 개수(UTC: UTL Matched Terms Count, 이하 'UTC'라 한다)를 검출한다. 앞의 예에서 계속하여, 질의 색인어의 경계는 최초의 질의 색인어인 '정보'와 일치하는 문서의 단위 색인어 리스트의 로케이션 정보 <1>에서부터 마지막 질의 색인어인 '검색'과 일치하는 문서의 단위 색인어 리스트의 로케이션 정보 <3>까지를 의미하므로 UTC는 그 경계를 포함하여 로케이션 정보 <1> 에서 <3>까지 포함되는 색인어의 개수이므로 3에 해당한다. 그리고 색인어 개수 검출부(141)는 키워드 색인어 경계 내에서 검색 대상 문서의 단위 색인어 리스트에는 있지만 키워드 색인어 리스트에는 없는 색인어 개수(TTC : Trace Terms Count, 이하 'TTC'라 한다)를 검출한다. 앞의 예에서 계속하여, 질의 색인어 경계 내에서 검색 대상 문서의 단위 색인어 리스트와 키워드 색인어 리스트는 일치하므로 TTC는 0이다. First, the index word detection unit 141 detects the number of index terms (QTC: Query Terms Count, hereinafter referred to as 'QTC') in the query index word list, and the three index words according to the extraction pattern in the previous example in which the query word is 'search for information' It can be seen that QTC is 3 if we can extract. The index word count detector 141 also detects the number of matched terms (MTC: Matched Terms Count (MTC)) between the unit index word list of the search target document and the query index word list. If the unit index word list of the search target document is shown in Table 1, and there are three index words in the query index word list, the MTC is 3. In addition, the index word count detector 141 detects the number of index words (UTC: UTL Matched Terms Count (UTC) hereinafter) corresponding to the query index word boundary in the unit index word list of the searched document. Continuing from the previous example, the boundary of the query index word is the unit index word list of the document that matches the last query index word 'search' from location information <1> of the unit index word list of the document that matches the first query index word 'information'. UTC is the number of index words included in the location information <1> to <3>, including the boundary, and thus corresponds to 3. The index word count detector 141 detects the number of index words (TTC: Trace Terms Count, hereinafter referred to as 'TTC') in the unit index word list of the search target document but not in the keyword index word list within the keyword index word boundary. Continuing from the previous example, the TTC is 0 since the unit index word list and the keyword index word list of the searched document match within the query index word boundary.

근접도 계산부(142)는, 색인어 개수 검출부(141)에서 검출된 QTC, MTC, UTC, 및 TTC를 하기의 수학식1을 이용하여 근접도를 계산할 수 있다.The proximity calculator 142 may calculate the proximity of the QTC, MTC, UTC, and TTC detected by the index word number detector 141 using Equation 1 below.

앞선 예에서 계속하여, MTC는 3, TTC는 0, QTC는 3, 그리고 UTC는 3인 것을 알 수 있으므로 근접도 연산식에 대입하면 근접도는 ((3+0)/max(3, 3)) = 3/3 = 1이므로 키워드의 색인어 리스트가 검색 대상 문서 상에서 동일한 순서로 100% 매칭되어 나타남을 알 수 있다. Continuing from the previous example, we can see that MTC is 3, TTC is 0, QTC is 3, and UTC is 3, so when we substitute the proximity equation, the proximity is ((3 + 0) / max (3, 3) ) = 3/3 = 1, it can be seen that the index word list of the keywords are matched 100% in the same order on the search target document.

한편, 색인어 개수 검출부(141)는 질의 색인어 경계 내에서 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어 개수(STC: Swapped Terms Count, 이하 'STC'라 한다)를 검출할 수 있다. 예를 들면 질의 색인어 리스트가 A1, B2, C3, D4이고, 문서의 단위 색인어 리스트가 A1, C2, B3, D4 일 때, 질의 색인어 경계(A1과 D4) 사이에 로케이션 정보가 다른 색인어가 2개 존재하므로 STC는 2이다.Meanwhile, the index word count detector 141 may detect a number of index terms (STC: SWapped Terms Count, hereinafter referred to as 'STC') having different location information between the unit index word list of the document and the query index word list within the query index word boundary. . For example, when the query index word list is A1, B2, C3, D4, and the unit index word list of the document is A1, C2, B3, D4, two index words having different location information between the query index word boundaries A1 and D4 are used. STC is 2 because it exists.

이와 같이 검출된 STC를 근접도 연산식에 추가적으로 적용하여 보정된 근접도를 계산함으로써, 종래에 색인어 리스트 내부의 색인어가 교차(swapping)되는 경우에 발생되는 근접도의 오차를 줄일 수 있다.By calculating the corrected proximity by additionally applying the detected STC to the proximity calculation equation, it is possible to reduce the error of the proximity generated when the index words in the index word list are conventionally crossed.

근접도 계산부(142)는 하기의 수학식 2 및 수학식 3에 의해 STC를 반영한 근접도를 계산할 수 있다.The proximity calculator 142 may calculate the proximity reflecting the STC by Equations 2 and 3 below.

여기에서 st-weight는 가중치로 주어지는 상수로서, 근접도의 오차를 줄이기 위해 임의의 수치가 대입될 수 있다.Here, st-weight is a constant given as a weight, and an arbitrary value may be substituted to reduce the error of proximity.

또한 로케이션 부여부(150)는, 단위 색인어 리스트를 기준으로 색인어 리스트에 로케이션 정보를 부여하며, 로케이션 정보는 각 색인어별로 단조 증가할 수 있다. 로케이션 부여부(150)는 앞선 문서 검색부(130)에 대한 설명시 언급된 바와 같이, 웹크롤러(미도시) 등에 의해 수집된 문서들에 대해 색인어 추출 작업 진행시, 그에 대한 로케이션을 부여할 수 있으며, 로케이션 부여 방식으로는 각 단위 색인어별로 로케이션이 1씩 단조 증가하는 단위 색인어 로케이션 방식을 적용할 수 있다. 또는, 질의 색인어 추출부(120)를 통해서 추출된 색인어 리스트에 로케이션을 부여할 수도 있으며, 이 경우 질의 색인어 리스트에 부여되는 로케이션 정보는 검색대상 문서의 단위 색인어 리스트 중 질의 색인어와 일치하는 색인어의 로케이션 정보와 동일하게 부여될 수 있다. 즉, 앞선 예에서 계속하여 문서의 단위 색인어 리스트 중 색인어 '정보의'에 해당하는 로케이션 정보가 <2>인 경우 이와 일치하는 질의 색인어 '정보의'에 해당하는 로케이션 정보도 <2>로 부여될 수 있다. In addition, the location assigning unit 150 may provide location information to the index word list based on the unit index word list, and the location information may be monotonically increased for each index word. As mentioned in the description of the document searching unit 130, the location granting unit 150 may assign a location for the documents collected by the web crawler (not shown) when the index word is extracted. In addition, as a location granting method, a unit indexer location method in which a location monotonically increases by 1 for each unit index word may be applied. Alternatively, a location may be given to the index word list extracted through the query index word extracting unit 120. In this case, the location information provided in the query index word list may be the location of the index word that matches the query index word in the unit index word list of the search target document. The same information can be given. In other words, if the location information corresponding to the index word 'information' of the unit index word list of the document continues to be <2> in the previous example, the location information corresponding to the query index word 'information' corresponding to the index index is also given as <2>. Can be.

이와 같이 단위 색인어 리스트를 기준으로 로케이션 정보가 부여됨에 따라 근접연산의 거리(distance) 계산의 안정성을 위해 단위를 통일할 수 있으며, 더 이상 색인어 추출시 형태소 분석 토큰 타입이 무엇인지, 해당 타입의 로케이션 규칙은 무언인지를 고민할 필요가 없게 되고, 문서의 단위 색인어 리스트와 질의 색인어 리스트와의 비교를 통해 보다 정확하게 근접도 연산을 수행할 수 있다. 또한 단위 색인어 리스트에 근거하여 부여된 키워드 색인어 리스트의 로케이션 정보를 확인할 수 있다.As the location information is given based on the unit index word list, the unit can be unified for the stability of the distance calculation of the proximity operation, and when extracting the index word, what is the stemming token type and the location of the type? The rule does not need to worry about something, and it is possible to perform the proximity calculation more accurately by comparing the unit index word list of the document with the query index word list. Also, the location information of the given keyword index word list can be confirmed based on the unit index word list.

이와 같이 계산된 근접도에 근거하여 검색결과 제공부(160)는 검색대상문서들을 정렬하여 사용자 단말 장치(300)를 통하여 사용자에게 검색결과를 제공할 수 있다.Based on the proximity calculated as described above, the search result providing unit 160 may sort the search target documents and provide the search results to the user through the user terminal device 300.

색인어의 Index 근접도를Close up 이용하는 검색 방법 Search method to use

도 3은 본 발명의 실시예에 따른 색인어의 근접도를 이용하는 검색 방법을 설명하기 위한 동작 흐름도이다.3 is a flowchart illustrating a search method using a proximity of an index word according to an exemplary embodiment of the present invention.

색인어의 근접도를 이용하는 검색 결과 제공 시스템(100)은 적어도 하나의 검색 대상 문서와, 해당 검색 대상 문서 별로 로케이션 정보를 포함하는 단위 색인어 리스트를 저장하는 데이터베이스를 포함할 수 있다. 이하에서 상술하는 바와 같이 색인어의 근접도를 이용하는 검색 결과 제공 시스템(100)의 동작으로, 검색 결과 제공 시스템(100)은 질의 색인어 리스트와 검색 대상 문서의 단위 색인어 리스트를 비교하고 검출된 색인어 개수 등을 반영하여 결정된 근접도에 따라 정확도가 향상된 검색결과를 제공할 수 있다.The search result providing system 100 using the proximity of the index word may include a database storing at least one search target document and a unit index word list including location information for each corresponding search target document. As described below, in operation of the search result providing system 100 using the proximity of the index word, the search result providing system 100 compares the query index word list with the unit index word list of the search target document and detects the number of index words. Based on the determined proximity, the search result can be improved.

도 3를 참조하여 자세하게 설명하면 단계S101에서 시스템의 입력부(110)는 사용자 단말장치(300)로부터 질의어를 입력받는다. 이후, 질의 색인어 추출부(120)는 단계S103을 통해서 질의어에 대응되는 질의 색인어 리스트를 추출한다. 이때 질의 색인어 추출부(120)는 질의어에 대한 색인어 리스트를 미리 설정된 추출패턴을 반영하여 추출할 수 있으며, 그 추출패턴의 예로서는, 키워드의 형태소 중 명사에 조사가 결합된 형태에 대해서도 색인하기로 한 패턴, 명사에 조사가 결합된 경우 그 명사만 분리하여 색인하기로 한 패턴, 키워드의 형태소 중 명사가 복합명사인 경우 그 각각을 분리하여 색인하기로 한 패턴, 키워드의 형태소로부터 파생될 수 있는 다른 명사가 존재하는 경우 그 명사에 대해서도 색인하기로 한 패턴 등 여러 가지가 있을 수 있다.Referring to FIG. 3, the input unit 110 of the system receives a query from the user terminal device 300 in step S101. Thereafter, the query index word extractor 120 extracts a query index word list corresponding to the query word through step S103. In this case, the query index word extractor 120 may extract the index word list for the query by reflecting a predetermined extraction pattern. As an example of the extraction pattern, the query index word extractor 120 may also index a form in which a search is combined with a noun among morphemes of keywords. A pattern that, when search is combined with a noun, decides to index only that noun, and when a noun is a compound noun among the morphemes of the keyword, a pattern that separates and indexes each of them, If a noun exists, there may be a number of patterns, including a pattern for indexing the noun.

다음으로, 단계S105에서 문서 검색부(130)는 추출된 질의 색인어 리스트 중 적어도 하나의 색인어를 포함하는 검색 대상 문서를 검색한다. 검색 대상 문서는 데이터베이스에 이미 저장되어 있으며, 해당 검색 대상 문서마다 로케이션 정보를 포함한 단위 색인어 리스트가 함께 저장되어 있다. 여기서 로케이션 정보는 로케이션 부여부(150)에 의해 각 색인어별로 단조 증가하도록 부여될 수 있다. Next, in step S105, the document search unit 130 searches for a search target document including at least one index word from the extracted query index word list. The document to be searched is already stored in the database, and a unit index word list including location information is stored for each corresponding document to be searched. The location information may be given by the location granting unit 150 to monotonically increase for each index word.

또한, 로케이션 부여부(150)는 단계S107을 통해서 단계S105에서 검색된, 검색 대상 문서의 단위 색인어 리스트에 포함된 로케이션 정보를 기준으로 질의 색인어 리스트에 로케이션 정보를 부여한다. 이와 같이 질의 색인어 추출 결과만을 활용하는 방식을 채택함으로 다양한 색인 방식의 변화에 대해서도 단순하면서도 안정적으로 질의 색인어에 로케이션 정보를 부여할 수 있다. In addition, the location assigning unit 150 assigns the location information to the query index word list based on the location information included in the unit index word list of the search target document searched in step S105 through step S107. In this way, by adopting a method of using only the query index word extraction result, location information can be given to the query index word simply and stably even with various index method changes.

단계S109에서, 색인어 개수 검출부(141)는 문서의 단위 색인어 리스트와 질의 색인어 리스트를 비교하여 색인어 개수를 검출한다. 더 자세하게는 질의 색인어 리스트내의 색인어 개수인 QTC를 검출하고, 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에 일치하는 색인어 개수인 MTC를 검출하고, 문서의 단위 색인어 리스트에서 질의 색인어 경계 내에 대응하는 색인어 개수인 UTC를 검출하며, 질의 색인어 경계내에서 문서의 단위 색인어 리스트에는 있지만 질의 색인어 리스트에는 없는 색인어의 개수인 TTC를 검출한다.In step S109, the index word number detection unit 141 detects the index word number by comparing the unit index word list with the query index word list of the document. More specifically, it detects QTC, which is the number of index words in the query index word list, detects MTC, which is the number of index words that match between the unit index word list and the query index word list of the document, and the number of index words corresponding within the query index word boundary in the unit index word list of the document. UTC is detected, and TTC is detected within the query index word boundary, which is the number of index words in the unit index word list of the document but not in the query index word list.

이때, 색인어 개수 검출부(141)는 단계S111을 통하여 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어가 존재하는지 여부도 판단한다. 전술된 단계S111의 판단결과 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어가 존재하지 않는 경우 근접도 계산부(142)는 검출된 QTC, MTC, UTC 및 TTC의 값을 전술된 수학식 1에 대입시켜 근접도를 계산(S113)하고, 전술된 단계S111의 판단결과 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어가 존재하는 경우 색인어 개수 검출부(141)는 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 로케이션 정보가 다른 색인어 개수인 STC를 검출한다(S112). 검출된 STC는 근접도의 연산에 반영되어, 문서의 단위 색인어 리스트와 질의 색인어 리스트 사이에서 교차(swapping)로 인한 근접도 오차를 줄일 수 있다. 이 경우 근접도 계산부(142)는 단계S114에서 STC를 이용하여 전술된 수학식 3에 대입시켜 근접도에 반영될 페널티 값를 결정하고, 검출된 페널티 값 및 기검출된 QTC, MTC, UTC, TTC를 수학식 2에 대입시켜 근접도를 계산한다(S113).At this time, the index word count detection unit 141 also determines whether there is an index word having a different location information between the unit index word list of the document and the query index word list through step S111. When there is no index word having different location information between the unit index word list of the document and the query index word list of step S111 described above, the proximity calculator 142 calculates the values of the detected QTC, MTC, UTC and TTC. Proximity is calculated by substituting Equation 1 (S113), and when there is an index word having a different location information between the unit index word list and the query index word list of the determination result document of step S111 described above, the index word count detection unit 141 determines the document. STC, which is the number of index words whose location information differs between the unit index word list and the query index word list, is detected (S112). The detected STC is reflected in the calculation of the proximity to reduce the proximity error due to the swapping between the unit index word list and the query index word list of the document. In this case, the proximity calculator 142 substitutes the above-described Equation 3 using STC in step S114 to determine the penalty value to be reflected in the proximity, and detects the detected penalty value and the previously detected QTC, MTC, UTC, TTC. Is substituted into Equation 2 to calculate the proximity (S113).

이후 검색결과 제공부(160)는 계산된 근접도에 기초하여 검색된 검색대상문서들을 정렬하여 사용자 단말장치(300)에 제공한다(S115). 이때 근접도는 0~1사이의 수치일 수 있고, 내림차순 또는 오름차순으로 정렬하여 제공함에 따라 질의에 가장 근접한 문서순으로 검색결과를 제공할 수 있다.Thereafter, the search result providing unit 160 sorts the searched searched documents based on the calculated proximity and provides them to the user terminal device 300 (S115). In this case, the proximity may be a value between 0 and 1, and the search results may be provided in the order of the documents closest to the query by providing them in descending or ascending order.

종래 기술 및 본 발명에 따른 성능 대비Performance comparison according to the prior art and the present invention

도 4 및 도 5는 단위 색인어 로케이션 방식 및 근접 연산식을 이용하여 종래 기술과 대비한 실험예로서, 이를 통해 종래 기술에 비해 보다 나은 근접도를 계산할 수 있다는 것을 확인할 수 있다. 먼저, 검색 대상 문서에 '초경량 비행기 1만여회 비행'이 존재하는 경우 도4a의 우측은 종래 기술인 형태소분석 토큰 타입에 의해 추출되고 로케이션이 부여된 색인어 리스트이고, 도4a의 좌측은 본 발명에 의한 단위 색인어 로케이션 방식으로, 단위 색인어당 로케이션이 1씩 단조 증가하여 부여된 색인어 리스트를 나타낸다. 이 경우 질의어로 "초경량 비행기 1만여회 비행"이 입력되고, 그 추출패턴 중 키워드의 형태소로부터 파생될 수 있는 다른 명사가 존재하는 경우에도 이를 색인하지 않는 패턴이 포함되었다면 종래 기술 및 본 발명에 따른 키워드 색인어 추출 및 로케이션 부여 결과는 도4b의 우측 및 좌측과 같다. 근접도 계산에 필요한 색인어 개수를 계산해 보면, 문서의 색인어 리스트와 질의 색인어 리스트 사이에 일치하는 색인어 개수인 MTC는 종래 기술 및 본 발명에서 모두 4이고, 질의 색인어 리스트 내의 색인어 개수인 QTC도 모두 4이다. UTC의 경우에는 종래 기술에 의하면 6에 해당하고 본 발명에 의하면 8에 해당한다. 마지막으로 TTC의 경우에는 본 발명에 의하면 로케이션 정보가 <2>, <3>, <5> 및 <6>인 색인어에 해당하므로 그 값은 4이다. 4 and 5 are experimental examples compared with the prior art by using the unit indexer location method and the proximity equation, it can be seen that it is possible to calculate a better proximity than the prior art through this. First, in the case where there are '10, 000 flights of ultralight airplanes' in a search target document, the right side of FIG. 4A is a list of index words extracted by a conventional morphological analysis token type and assigned a location, and the left side of FIG. 4A is In the unit index word location method, the location per unit index word is monotonically increased by 1 to represent the index word list. In this case, if a query is inputted as "lightly 10,000 airplanes" and a pattern that does not index it is included even if there are other nouns that can be derived from the morpheme of the keyword among the extraction patterns, according to the prior art and the present invention. Keyword index word extraction and location assignment results are as shown in the right and left sides of Fig. 4B. When the number of index words required for the proximity calculation is calculated, MTC, which is the number of index words that match between the index list of the document and the query index word list, is 4 in the prior art and the present invention, and QTC, which is the number of index words in the query index word list, is also 4. . UTC corresponds to 6 according to the prior art and 8 according to the present invention. Finally, in the case of TTC, the value corresponds to 4 because the location information corresponds to index words of <2>, <3>, <5>, and <6>.

이를 바탕으로 근접도를 계산하면, 종래 기술의 경우 거리 근접도는 4/6 = 0.67임을 알 수 있다. 즉, 검색 대상 문서의 색인어 리스트 내에 질의어의 색인어가 모두 포함되어 있음에도 종래 기술의 근접도 계산에 의하면 이를 제대로 반영하지 못하고 있는 것이다. 이에 반해 본 발명에 의한 근접도를 계산하면, 상기 수학식 1에 수치들을 대입하면 근접도 = ((4+4)/max(4,8)) = 8/8 = 1 이므로 정확하게 근접도를 계산하는 것을 알 수 있다. By calculating the proximity based on this, it can be seen that the distance proximity is 4/6 = 0.67 in the prior art. That is, although all index words of the query word are included in the index word list of the search target document, the proximity calculation of the prior art does not properly reflect this. On the contrary, when calculating the proximity according to the present invention, if the numerical values are substituted into Equation 1, the proximity is accurately calculated because the proximity = ((4 + 4) / max (4,8)) = 8/8 = 1 I can see that.

다음으로, 교차(swapping)로 인한 근접도 오차 보정 여부를 판단하기 위해 도 5를 참조하여 설명한다. 도5의 (a)는 검색 대상 문서의 단위 색인어 리스트 및 로케이션 정보를 나타내며, (b)는 질의어의 색인어 리스트를 나타낸다. 여기서 먼저 QTC는 4, MTC도 4임을 알 수 있다. 다음으로 UTC는 4, TTC는 0이며, 교차가 발생하였으므로 STC를 계산하면 그 값은 2이다. 일반적인 경우 근접도 계산은 수학식 1에 해당하므로 근접도 = ((4+0)/max(4,4)) = 1의 결과가 나오나, 이는 교차로 인한 근접도 하락을 반영하지 않은 수치이다. 따라서, 오차를 보정하기 위해 수학식 3을 이용하여 st_weight를 0.1이라고 가정하면 st_penalty = 1-2*0.1 = 0.8이 도출되고, 따라서 수학식 2를 이용하여 오차가 보정된 근접도를 계산하면 근접도 = 1*0.8 = 0.8이 도출된다. 따라서, 교차로 인해 질의 색인어 리스트가 검색 대상 문서의 색인어 리스트 내에 포함되어 있지만 그 순서는 전부 일치하지 않으므로 근접도 1보다는 낮은, 근접도 0.8이 나오는 것이다. Next, a description will be given with reference to FIG. 5 to determine whether the proximity error is corrected due to the swapping. 5A shows a unit index word list and location information of a search target document, and (b) shows an index word list of a query word. Here, it can be seen that QTC is 4 and MTC is also 4. Next, UTC is 4, TTC is 0, and because the intersection occurs, the STC is calculated to be 2. In general, since the proximity calculation corresponds to Equation 1, a result of proximity = ((4 + 0) / max (4,4)) = 1 is obtained, but this value does not reflect the proximity drop due to the intersection. Therefore, assuming that st_weight is 0.1 using Equation 3 to correct the error, st_penalty = 1-2 * 0.1 = 0.8 is derived. Therefore, when Equation 2 is used to calculate the proximity where the error is corrected, the proximity = 1 * 0.8 = 0.8 is derived. Therefore, due to the intersection, the query index word list is included in the index word list of the searched document, but since the order is not all identical, the proximity index 0.8 is lower than the proximity 1.

따라서, 도 4 및 도 5의 예시를 통하여 본 발명에 의한 근접도 계산은 종래 기술에 비해 근접도의 계산 정확도를 향상시킨다는 것을 알 수 있다. Accordingly, it can be seen from the examples of FIGS. 4 and 5 that the proximity calculation according to the present invention improves the accuracy of calculating the proximity compared to the prior art.

본 발명에 따른 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(Floptical disk)와 같은 자기-광 매체(megneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동되도록 구성될 수 있으며, 그 역도 마찬가지다. Embodiments according to the present invention can be implemented in the form of program instructions that can be executed by various computer means can be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of the computer-readable recording medium include magnetic media such as a hard disk, a floppy disk, and a magnetic tape; optical media such as CD-ROM and DVD; magnetic recording media such as a floppy disk; Includes hardware devices specifically configured to store and perform program instructions such as megneto-optical media and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described by specific embodiments such as specific components and the like. For those skilled in the art to which the present invention pertains, various modifications and variations are possible.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.
Therefore, the spirit of the present invention should not be limited to the described embodiments, and all the things that are equivalent to or equivalent to the claims as well as the following claims will belong to the scope of the present invention. .

100 : 색인어의 근접도를 이용하는 검색 결과 제공 시스템
110 : 입력부 120 : 질의 색인어 추출부
130 : 문서 검색부 140 : 근접도 결정부
141 : 색인어 개수 검출부 142 : 근접도 계산부
150 : 로케이션 부여부 160 : 검색결과 제공부100: search result providing system using proximity of index word
110: input unit 120: query index word extraction unit
130: document search unit 140: proximity determination unit
141: index word detection unit 142: proximity calculation unit
150: location grant unit 160: search results provider

Claims

Receiving a query;
Extracting a query index list corresponding to the query;
Retrieving a document using the query index word list; And
And determining the proximity by comparing the unit index word list of the searched document with the query index word list.

The method according to claim 1,
The proximity is determined by the density of the matching index and the trace terms in the maximum index set that can be extracted within the query index term boundary, the search method using the proximity of the index.

The method according to claim 1,
Determining the proximity
Detecting a query terms count (QTC) in the query index word list;
Detecting a matched terms count (MTC) between the unit index word list and the query index word list;
Detecting a unit term list (UTC) matched terms count (UTC) corresponding to a query index term boundary in the unit index term list;
Detecting a trace terms count (TTC) in the unit index word list but not in the query index word list within the query index word boundary; And
And calculating a proximity using the detected number of index words.

The method according to claim 3,
And said proximity calculation step substitutes the detected number of index words in the following equation.
Proximity = ((MTC + TTC) / max (QTC, UTC))

The method according to claim 3,
Determining the proximity
And providing location information of the query index list based on the unit index word list of the document.

The method according to claim 5,
The determining of the proximity may further include correcting the proximity by applying a penalty when there is an index word having different location information between the unit index word list and the query index word list within the query index word boundary. Search method using the proximity of the index word.

The method of claim 6,
The penalty is search based on the proximity of the index word, wherein the location information between the unit index word list and the query index word list within the query index word boundary is calculated based on the number of index terms (STC). Way.

The method according to claim 1,
After determining the proximity,
And sorting the retrieved documents based on the determined proximity.

The method according to claim 8,
And sorting the documents in descending or ascending order of proximity.

The method according to claim 1,
The unit index word list includes location information, and the location information is monotonically increased by a predetermined number for each index word.

An input unit for receiving a query;
A query index word extracting unit which extracts a list of query index words corresponding to the query;
A document search unit for searching for a document using the query index word list; And
And a proximity determining unit configured to determine a proximity using the number of index words detected by comparing the unit index word list of the searched document with the query index word list.

The method of claim 11,
The proximity determining unit
Query Terms Count (QTC) in the query index word list, Matched terms count (MTC) between the unit index word list and the query index word list, and corresponding index words within a query index word boundary in the unit index word list. UTL mateched terms count (UTC), an index word count detecting unit for detecting a number of index terms (TTC) in the unit index word list but not in the query index word list within the query index word boundary; And
And a proximity calculator configured to calculate a proximity using the detected number of index words.

The method of claim 12,
And a location assigning unit for assigning location information of the query index list based on the unit index word list of the document.

The method according to claim 13,
The proximity calculation unit
Within the query index word boundary, if there is an index word having different location information between the unit index word list and the query index word list, the number of index words (STC) is detected, and a penalty calculated based on STC is applied. Search for the proximity of the index word, characterized in that to correct the proximity.

The method of claim 11,
And a search result provider for sorting and providing the searched documents based on the determined proximity.

The method of claim 11,
The unit index word list includes location information, and the location information monotonically increases by a predetermined number for each index word.

A computer-readable recording medium having stored thereon a program for performing the method according to any one of claims 1 to 10.