KR100407081B1

KR100407081B1 - Document retrieval and classification method and apparatus

Info

Publication number: KR100407081B1
Application number: KR10-2000-0049164A
Authority: KR
Inventors: 노구치나오히코; 칸노유지; 사토미츠히로; 이토하야시; 후쿠시게요시오; 이나바미츠아키
Original assignee: 마쯔시다덴기산교 가부시키가이샤
Priority date: 2000-08-24
Filing date: 2000-08-24
Publication date: 2003-11-28
Also published as: KR20020016056A

Abstract

문서 검색 및 분류 시스템에서, 검색 동작은 의도된 검색 문서들을 픽업하기 위해 유저에 의해 입력된 검색 조건들에 따라 문서들의 데이터베이스 상에서 수행된다. 유저는 검색 동작에 의해 픽업된 검색된 문서들에 응답하여 복수의 분류들의 분류 표준들을 입력하는 것이 허용된다. 분류 표준들은 검색 조건들로 변환된다. 검색 동작에 의해 픽업된 검색 문서들과 분류 표준들로부터 얻어진 변환된 검색 조건들간의 유사도가 계산된다. 그리고, 각각의 분류에 대해 검색 동작에 의해 픽업된 각각의 검색된 문서의 속성이 상기 유사도를 참조하여 계산되고, 이에 의해 각각의 검색된 문서가 가장 높은 속성을 가진 분류로 분류된다.In a document retrieval and classification system, a retrieval operation is performed on a database of documents according to the search conditions entered by the user to pick up the intended retrieval documents. The user is allowed to enter a plurality of classification standards in response to the retrieved documents picked up by the search operation. Classification criteria are converted into search conditions. The similarity between the search documents picked up by the search operation and the converted search conditions obtained from the classification standards is calculated. Then, the attribute of each retrieved document picked up by the retrieval operation for each category is calculated with reference to the similarity, whereby each retrieved document is classified into the category having the highest attribute.

Description

Document retrieval and classification method and apparatus

본 발명은 일단의 전자 문서 데이터를 저장한 데이터베이스로부터 의도된 문서들을 검색하는 문서 검색 및 분류 방법에 관한 것이다. 또한, 본 발명은 본 발명의 문서 검색 및 분류 방법을 수행하는 문서 검색 및 분류 시스템에 관한 것이다.The present invention relates to a document retrieval and classification method for retrieving intended documents from a database storing a set of electronic document data. The present invention also relates to a document retrieval and classification system for performing the document retrieval and classification method of the present invention.

본 발명은 퍼스널 컴퓨터들, 사무용 컴퓨터들, 워드프로세서들 등에 설치된 메모리 디바이스들 뿐만 아니라 이들에 로딩 가능한 정보 기억 매체와 같은 데이터베이스에 저장된 각종 문서 정보에 적용할 수 있다.The present invention can be applied to various document information stored in a database such as an information storage medium loadable thereon as well as memory devices installed in personal computers, office computers, word processors, and the like.

이-메일들, 전자 카달로그 및 전자 간행물들을 포함하는 데이터 통신 분야에서의 최근의 발전은 유저가 접속 및 이용할 수 있는 방대한 양의 문서 정보를 제공한다. 또한, 인터넷 유저들의 수가 크게 증가하고 있다. 따라서, 거대한 데이터베이스로부터 의도된 문서들을 검색 또는 수집하는 요구가 증가하고 있다. 동시에, 의도된 픽업(picked up)된 문서의 분류 요구도 증가하고 있다.Recent developments in the field of data communications, including e-mails, electronic catalogs, and electronic publications, provide vast amounts of document information that users can access and use. In addition, the number of Internet users is increasing significantly. Thus, there is an increasing demand for retrieving or collecting intended documents from a huge database. At the same time, there is an increasing demand for classification of intended picked up documents.

그러나, 종래의 문서 검색 및 분류 시스템에 따르면, 검색 조건들 및 분류 표준들은 통상적으로 유저의 선호도들에 따라 또는 미리 고정되어 있다. 이 관점에서, 종래의 문서 검색 및 분류 시스템들은 검색 조건들 및 분류 표준들의 관점에서 고정되어 있었다.However, according to the conventional document retrieval and classification system, the search conditions and classification standards are typically fixed in advance or according to the user's preferences. In this respect, conventional document retrieval and classification systems have been fixed in terms of search conditions and classification standards.

본 발명의 목적은 문서 검색 및 분류에서 융통성을 개선하는 것이다.It is an object of the present invention to improve flexibility in document retrieval and classification.

본 발명의 다른 목적은 유저들이 검색 조건들 및 분류 표준들의 관점을 임의로 변경할 수 있도록 하는 것이다.Another object of the present invention is to allow users to arbitrarily change the perspective of search conditions and classification standards.

본 발명의 또 다른 목적은 유저들이 검색 결과에 응답하는 즉각적인 판단에 따라 검색 및 분류 동작을 수행할 수 있도록 하는 것이다.It is yet another object of the present invention to enable users to perform search and classification operations in accordance with an immediate decision in response to a search result.

본 발명의 또 다른 목적은 유저들의 지적인 활동(intellectual activity)들을 돕도록 자동 분류를 실현하는 것이다.Another object of the present invention is to realize automatic classification to help the intellectual activities of users.

상기 목적 및 다른 관련 목적을 달성하기 위해, 본 발명은 제 1 문서 검색 및 분류 시스템을 제공하며, 이 시스템은 유저가 검색 조건들 및 분류 표준들을 입력하는 것을 허용하는 입/출력 수단을 포함한다. 검색 수단은 임의의 단어들 또는 임의의 문자열들로 구성된 검색 조건들에 따라 문서들의 데이터베이스상에서 검색 동작을 수행하기 위해 그리고 검색 동작에 의해 픽업된 검색 문서들과 검색 조건들 간의 유사도(similarity)를 계산하기 위해 제공된다. 검색 결과 저장 수단은 검색 동작에 의해 픽업된 검색 문서를 저장하기 위해 제공된다. 분류 표준 변환 수단은 분류 표준들을 검색 조건들로 변환하기 위해 제공된다. 분류 표준은 임의의 단어들 또는 임의의 문자열들의 세트로서 표현된다. 검색 결과 분류 수단은 복수의 분류 표준들에 따라 검색 동작에 의해 픽업된 검색 문서들을 분류하기 위해 제공된다.In order to achieve the above and other related objects, the present invention provides a first document retrieval and classification system, which includes input / output means for allowing a user to input search conditions and classification standards. The search means calculates a similarity between the search conditions and the search conditions picked up by the search operation and to perform a search operation on the database of documents according to the search conditions composed of any words or arbitrary strings. To provide. Search result storage means is provided for storing a search document picked up by a search operation. Classification standard conversion means are provided for converting classification standards into search conditions. The classification standard is expressed as a set of any words or any strings. Search result classification means is provided for classifying search documents picked up by a search operation according to a plurality of classification standards.

따라서, 본 발명은 문서 검색 및 분류동안 지적인 활동들을 돕기 위해 융통성이 있는 문서 검색 및 분류를 제공할 수 있다.Thus, the present invention can provide flexible document retrieval and classification to assist intellectual activities during document retrieval and classification.

본 발명의 바람직한 실시예들에 따라, 검색 수단은 입/출력 수단을 통해 유저에 의해 입력된 검색 조건들에 응답하고, 유저에 의해 입력된 검색 조건들에 따라 문서의 데이터베이스상에서 검색 동작을 실행한다. 검색 결과 저장 수단은 검색 수단의 검색 동작에 의해 픽업된 검색 문서들을 저장한다. 분류 표준 변환 수단은 입/출력 수단을 통해 유저에 의해 입력된 복수의 분류 표준들에 응답하며, 입력된 분류 표준들로부터 얻어진 변환된 검색 조건들을 생성한다. 검색 수단은 검색 동작에 의해 픽업되어 검색 결과 저장 수단에 저장된 검색 문서들과 변환된 검색 조건들간의 유사도를 계산한다. 그리고, 검색 결과 분류 수단은 검색 수단에 의해 계산된 유사도에 따라 각각의 분류 표준에 대해 검색 동작에 의해 픽업된 각각의 검색된 문서의 속성을 계산하여, 문서의 분류를 수행한다.According to preferred embodiments of the present invention, the retrieval means is responsive to the retrieval conditions entered by the user via the input / output means and executes a retrieval operation on the database of documents in accordance with the retrieval conditions entered by the user. . The search result storage means stores the search documents picked up by the search operation of the search means. The classification standard converting means responds to the plurality of classification standards input by the user through the input / output means, and generates the converted search conditions obtained from the input classification standards. The search means calculates the similarity between search documents stored in the search result storage means picked up by the search operation and the converted search conditions. Then, the search result classification means calculates the attribute of each retrieved document picked up by the search operation for each classification standard according to the similarity calculated by the search means, and performs classification of the documents.

이 장치로, 유저는 자신들이 검색 동작 동안 마음에 둔 단어들 또는 문자열들을 가지고 있을 때 검색 조건들을 임의로 입력할 수 있다. 또한, 유저들은 의도된 검색 결과를 임의로 분류할 수 있다.With this device, a user can arbitrarily enter search conditions when they have words or strings they have liked during the search operation. In addition, users can optionally sort the intended search results.

본 발명의 바람직한 실시예에 따라, 입/출력 수단은 유저가 임의의 단어들 또는 임의의 문자열들의 세트로 각각 구성된 복수의 분류 표준들을 입력하는 것을 허용하며, 분류 표준 변환 수단은 임의의 단어들 또는 임의의 문자열들의 세트를 검색 조건들로 변환한다.According to a preferred embodiment of the present invention, the input / output means allow the user to input a plurality of classification standards each composed of any words or any set of strings, and the classification standard conversion means may be any words or Convert a set of arbitrary strings to search conditions.

이 장치에 있어서, 유저들은 자신들이 마음에 둔 임의의 단어들 또는 임의의 문자열들을 분류 표준들(즉, 분류의 관점들)으로서 입력할 수 있다. 따라서, 분류의 관점들을 설정할 때 큰 융통성이 주어진다.In this device, users can enter any words or any strings they have in mind as classification standards (ie, terms of classification). Thus, great flexibility is given when setting up aspects of classification.

본 발명의 바람직한 실시예들에 따라, 문서 검색 및 분류 시스템은 임의의 문장 또는 문서로부터 키워드들을 추출하기 위한 키워드 검색 수단을 더 포함한다. 이 경우, 키워드 검색 수단은 입/출력 수단을 통해 유저에 의해 입력된 임의의 문장에 의해 표현되는 복수의 분류 표준들에 응답하고, 입력된 문장들로부터 키워드들을 추출한다. 그리고, 분류 표준 변환 수단은 추출된 키워드들의 세트를 검색 조건들로 변환한다.According to preferred embodiments of the present invention, the document retrieval and classification system further comprises keyword retrieval means for extracting keywords from any sentence or document. In this case, the keyword retrieving means responds to the plurality of classification standards represented by any sentence input by the user via the input / output means, and extracts the keywords from the input sentences. The classification standard converting means then converts the extracted set of keywords into search conditions.

이 장치에서, 유저들은 의도된 분야에 속하는 임의의 문장을 분류 표준으로서 직접 입력할 수 있다. 이것은 복잡한 분류 관점들을 표현하는 것을 가능하게 한다. 따라서, 분류의 관점들의 설정은 여러 면에서 융통성있게 수행될 수 있다.In this device, users can directly enter any sentence belonging to the intended field as a classification standard. This makes it possible to express complex classification aspects. Thus, the setting of the aspects of classification can be performed flexibly in many respects.

본 발명의 바람직한 실시예들에 따라, 입/출력 수단은 유저가 복수의 분류 표준들의 역할을 하는 복수의 문서들을 지정하는 것을 허용한다. 지정된 문서들은 검색 동작에 의해 픽업된 검색된 문서들로부터 선택된다. 키워드 검색 수단은 지정된 문서로부터 키워드를 추출한다. 그리고, 분류 표준 변환 수단은 추출된 키워드들의 세트를 검색 조건들로 변환한다.According to preferred embodiments of the present invention, the input / output means allow the user to specify a plurality of documents serving as a plurality of classification standards. The designated documents are selected from the retrieved documents picked up by the retrieval operation. The keyword retrieving means extracts a keyword from the designated document. The classification standard converting means then converts the extracted set of keywords into search conditions.

이 장치에 있어서, 유저들이 검색 동작에 의해 픽업된 검색 문서들을 확인한 후, 유저들은 분류의 관점들을 표현하는 것으로 검색된 문서들 자체 또는 일부를 선택할 수 있다. 따라서, 분류의 관점들의 설정이 쉽게 수행될 수 있다.In this apparatus, after the users have identified the search documents picked up by the search operation, the users can select the retrieved documents themselves or some by expressing the aspects of classification. Thus, the setting of the aspects of classification can be easily performed.

또한, 본 발명은 제 2 문서 검색 및 분류 시스템을 제공하며, 이 시스템은 유저가 검색 조건들을 입력하는 것을 허용하는 입/출력 수단을 포함한다. 검색 수단은 임의의 단어들 또는 임의의 문자열들로 구성된 검색 조건에 따라 문서들의 데이터베이스 상에서 검색 동작을 실행하기 위해 그리고 검색 동작에 의해 픽업된 검색 문서들과 검색 조건들간의 유사도를 계산하기 위해 제공된다. 검색 결과 저장 수단은 검색 동작에 의해 픽업된 검색 문서를 저장하기 위해 제공된다. 키워드 검색 수단은 검색 동작에 의해 픽업된 검색 문서들로부터 키워드들을 추출하기 위해 제공된다. 자동 키워드 분류 수단은 추출된 키워드들을 복수의 클러스터들로 자동적으로 분류하기 위해 제공된다. 분류 표준 변환 수단은 분류 표준들을 검색 조건들로 변환하기 위해 제공된다. 각각의 분류 표준들은 각각의 클러스터로 분류된 키워드들의 세트이다. 검색 결과 분류 수단은 분류 표준들에 따라 검색 동작에 의해 픽업된 검색된 문서들의 세트를 분류하기 위해 제공된다.The present invention also provides a second document retrieval and classification system, which includes input / output means for allowing a user to enter retrieval conditions. Search means are provided for performing a search operation on a database of documents according to a search condition composed of any words or any strings and for calculating the similarity between the search documents picked up by the search operation and the search conditions. . Search result storage means is provided for storing a search document picked up by a search operation. Keyword search means is provided for extracting keywords from the search documents picked up by the search operation. An automatic keyword classification means is provided for automatically classifying the extracted keywords into a plurality of clusters. Classification standard conversion means are provided for converting classification standards into search conditions. Each classification standard is a set of keywords classified into each cluster. Search result classification means is provided for classifying the set of retrieved documents picked up by the search operation according to the classification standards.

따라서, 본 발명은 문서 검색 및 분류 동안 지적인 활동들을 돕기 위해 자동 문서 검색 및 분류 시스템을 제공할 수 있다.Thus, the present invention can provide an automatic document retrieval and classification system to assist intellectual activities during document retrieval and classification.

본 발명의 바람직한 실시예들에 따라, 검색 수단은 입/출력 수단을 통해 유저에 의해 입력된 검색 조건들에 응답하고, 유저에 의해 입력된 검색 조건들에 따라 문서들의 데이터베이스상에서 검색 동작을 실행한다. 검색 결과 저장 수단은 검색 수단의 검색 동작에 의해 픽업된 검색된 문서들을 저장한다. 키워드 검색 수단은 검색 동작에 의해 픽업된 검색된 문서들로부터 키워드들을 추출한다. 자동 키워드 분류 수단은 분류된 키워드들을 복수의 클러스터들로 분류한다. 분류 표준 변환 수단은 각각의 클러스터로 분류된 키워드들의 세트인 분류 표준들로부터 얻어진 변환된 검색 조건들을 생성한다. 검색 수단은 검색 동작에 의해 픽업되어 검색 결과 저장 수단에 저장된 검색 문서들과 변환된 검색 조건들간의 유사도를 계산한다. 그리고, 검색 결과 분류 수단은 검색 수단에 의해 계산된 유사도를 참조하여 각각의 분류 표준에 대해 검색 동작에 의해 픽업된 각각의 검색된 문서의 속성을 계산하고, 이에 의해 자동 분류를 수행한다.According to preferred embodiments of the present invention, the retrieval means is responsive to the retrieval conditions entered by the user via the input / output means and executes a retrieval operation on the database of documents in accordance with the retrieval conditions entered by the user. . The search result storage means stores the retrieved documents picked up by the search operation of the search means. The keyword retrieving means extracts keywords from the retrieved documents picked up by the retrieval operation. The automatic keyword classification means classifies the classified keywords into a plurality of clusters. The classification standard conversion means generates converted search conditions obtained from classification standards, which is a set of keywords classified into each cluster. The search means calculates the similarity between search documents stored in the search result storage means picked up by the search operation and the converted search conditions. The search result classification means then calculates an attribute of each retrieved document picked up by the search operation for each classification standard with reference to the similarity calculated by the search means, thereby performing automatic classification.

이 장치에 있어서, 유저에 의한 분류 표준들의 입력에 의존하지 않고 검색 결과의 분류의 관점들을 자동적으로 추출할 수 있게 된다. 유저들은 생각하지 않은 분류 관점들을 자동적으로 얻을 수 있다. 결과적으로, 문서 분류 작업을 효율적으로 도울 수 있다.In this apparatus, it is possible to automatically extract aspects of classification of search results without relying on input of classification standards by the user. Users can automatically get sorting perspectives they don't think about. As a result, document classification can be efficiently assisted.

또한, 본 발명은 제 1 문서 검색 및 분류 방법을 제공하며, 이 방법은 의도된 검색된 문서들을 픽업하기 위해 유저에 의해 입력된 검색 조건들에 따라 문서의 데이터베이스 상에서 검색 동작을 수행하는 단계와, 검색 동작에 의해 픽업된 검색된 문서들에 응답하여 복수의 분류들의 분류 표준들을 유저가 입력하는 것을 허용하는 단계와, 분류 표준들을 검색 조건들로 변환하는 단계와, 검색 동작에 의해 픽업된 검색 문서들과 분류 표준들로부터 얻어진 변환된 검색 조건들간의 유사도를 계산하는 단계와, 유사도를 참조하여 각각의 분류에 대해 검색 동작에 의해 픽업된 각각의 검색된 문서의 속성들을 계산하고, 이에 의해 각각의 검색된 문서를 가장 높은 속성을 가진 분류로 분류하는 단계를 포함한다.In addition, the present invention provides a first document search and classification method, which performs a search operation on a database of documents in accordance with search conditions input by a user to pick up intended searched documents; Allowing a user to enter a plurality of classification standards of the classifications in response to the retrieved documents picked up by the operation, converting the classification standards into search conditions, and searching documents picked up by the search operation; Calculating the similarity between the converted search conditions obtained from the classification standards, calculating the attributes of each retrieved document picked up by the search operation for each classification with reference to the similarity, and thereby resolving each retrieved document. Classifying the classification with the highest attribute.

이 방법에 있어서, 유저들은 자신들이 검색 동작 동안 마음에 둔 단어들을 가지고 있을 때 검색 조건들을 임의로 입력할 수 있다. 또한, 유저는 의도된 검색 결과를 임의로 분류할 수 있다. 따라서, 본 발명은 문서 검색 및 분류 동안 지적인 활동을 도울 수 있다.In this way, users can enter search conditions arbitrarily when they have words they have in mind during the search operation. In addition, the user can optionally sort the intended search results. Thus, the present invention can assist in intellectual activity during document retrieval and classification.

본 발명의 바람직한 실시예에 따라, 유저가 각각의 분류의 분류 표준을 위해 임의의 단어들 또는 임의의 문자열들의 세트를 입력할 때, 입력된 임의의 단어 또는 임의의 문자열들은 검색 조건들로 변환되며, 검색 동작에 의해 픽업된 검색된 문서들과 변환된 검색 조건들간의 유사도가 계산된다.According to a preferred embodiment of the present invention, when the user enters any words or any set of strings for the classification standard of each classification, any words or any strings entered are converted into search conditions and The similarity between the retrieved documents picked up by the retrieval operation and the converted retrieval conditions is calculated.

이 방법에 있어서, 유저들은 마음에 둔 임의의 단어들 또는 임의의 문자열들을 분류 표준들(즉, 분류의 관점들)로서 입력할 수 있다. 따라서, 분류의 관점들을 설정함에 있어서 큰 융통성이 주어진다.In this way, users can enter any words or any strings of interest as classification standards (ie, terms of classification). Thus, great flexibility is given in setting the aspects of classification.

본 발명의 바람직한 실시예들에 따라, 유저가 각각의 분류의 분류 표준의 역할을 하는 임의의 문장을 입력할 때, 키워드들이 검색 조건으로 변환되며, 검색 동작에 의해 픽업된 검색된 문서와 변환된 검색 조건간의 유사도가 계산된다.According to preferred embodiments of the present invention, when a user inputs any sentence serving as a classification standard of each classification, the keywords are converted into a search condition, and the searched document and the converted search picked up by the search operation are converted. The similarity between the conditions is calculated.

이 방법에 있어서, 유저들은 의도된 분야에 속하는 임의의 문장을 분류 표준으로서 직접 입력할 수 있다. 이것은 복잡한 분류 관점들을 표현하는 것을 가능하게 한다. 따라서, 분류의 관점들의 설정은 여러 면에서 융통성있게 수행될 수 있다.In this way, users can directly enter any sentence belonging to the intended field as a classification standard. This makes it possible to express complex classification aspects. Thus, the setting of the aspects of classification can be performed flexibly in many respects.

본 발명의 바람직한 실시예들에 따라, 유저는 검색 동작에 의해 픽업된 검색 문서들 중에서 복수의 문서들을 지정하며, 지정된 문서는 각각 분류의 표준의 역할을 한다. 이때, 키워드들은 추출된 문서들로부터 추출된다. 추출된 키워드들의 세트는 검색된 조건들로부터 추출된다. 추출된 키워드 세트는 검색 조건들로 변환된다. 그리고, 검색 동작에 의해 픽업된 검색 문서들과 변환된 검색 조건들간의 유사도가 계산된다.According to preferred embodiments of the present invention, a user specifies a plurality of documents among search documents picked up by a search operation, each of which serves as a standard of classification. In this case, keywords are extracted from the extracted documents. The set of extracted keywords is extracted from the retrieved conditions. The extracted keyword set is converted into search conditions. Then, the similarity between the search documents picked up by the search operation and the converted search conditions is calculated.

이 장치에 있어서, 유저들이 검색 동작에 의해 픽업된 검색 문서들을 확인한 후, 유저들은 분류의 관점들을 표현하는 것으로서 검색된 문서들 자체 또는 그 일부를 선택할 수 있다. 따라서, 분류의 관점들의 설정은 용이하게 수행된다.In this apparatus, after the users confirm the search documents picked up by the search operation, the users can select the retrieved documents themselves or part thereof as representing the points of view of the classification. Thus, the setting of the aspects of classification is easily performed.

또한, 본 발명은 제 2 문서 검색 및 분류 방법을 제공하며, 이 방법은 의도된 검색 문서들을 픽업하기 위해 유저에 의해 입력되는 검색 조건들에 따라 문서들의 데이터베이스상에서 검색 동작을 수행하는 단계, 검색 동작에 의해 픽업된 검색된 문서들로부터 키워드를 추출하는 단계, 추출된 키워드들을 복수의 클러스터들로 분류하는 단계, 검색 동작에 의해 픽업된 검색된 문서들과 추출된 키워드들로부터 얻어진 변환된 검색 조건들간의 유사도를 계산하는 단계, 및 유사도를 참조하여 각각의 분류에 대해 검색 동작에 의해 픽업된 검색된 문서 각각의 속성을 계산하고 이에 의해 각각의 검색된 문서를 가장 높은 속성을 가진 분류로 분류하는 단계를 포함한다.The present invention also provides a second document search and classification method, which performs a search operation on a database of documents according to search conditions input by a user to pick up intended search documents, and a search operation. Extracting keywords from the retrieved documents picked up by the user, classifying the extracted keywords into a plurality of clusters, and similarity between the retrieved documents picked up by the search operation and the converted search conditions obtained from the extracted keywords Calculating the attributes of each of the retrieved documents picked up by the retrieval operation for each classification with reference to the similarity, thereby classifying each retrieved document into the classification with the highest attribute.

이 방법으로, 유저에 의한 분류 표준들의 입력에 의존하지 않고 검색 결과의 분류의 관점들을 자동적으로 추출할 수 있게 된다. 유저들은 생각하지 않은 분류 관점들을 자동적으로 얻을 수 있다. 특별한 노력들이 필요하지 않다. 결과적으로, 문서 분류 작업을 효율적으로 돕는 것이 가능해 진다.In this way, it is possible to automatically extract aspects of classification of search results without relying on input of classification standards by the user. Users can automatically get sorting perspectives they don't think about. No special effort is needed. As a result, it becomes possible to assist with document classification efficiently.

도 1은 본 발명의 제 1 실시예에 따른 문서 검색 및 분류 시스템의 개략적인 구조를 도시하는 블록도.1 is a block diagram showing a schematic structure of a document retrieval and classification system according to a first embodiment of the present invention;

도 2는 본 발명의 제1 실시예에 따른 문서 검색 및 분류 시스템에 의해 얻어진 검색 결과를 도시하는 도면.Fig. 2 is a diagram showing a search result obtained by the document search and classification system according to the first embodiment of the present invention.

도 3은 본 발명의 제 1 실시예에 따른 분류 표준에 기초하여 검색 결과를 도시하는 도면.3 is a diagram showing a search result based on a classification standard according to the first embodiment of the present invention;

도 4는 본 발명의 제 1 실시예에 따른 속성의 일예를 도시하는 도면.4 is a diagram showing an example of an attribute according to the first embodiment of the present invention;

도 5는 본 발명의 제 1 실시예에 따른 문서 분류 결과를 도시하는 도면.Fig. 5 is a diagram showing a document classification result according to the first embodiment of the present invention.

도 6은 본 발명의 제 2 실시예에 따른 문서 검색 및 분류 시스템의 개략적인 구성을 도시하는 블록도.6 is a block diagram showing a schematic configuration of a document retrieval and classification system according to a second embodiment of the present invention;

도 7은 본 발명의 제 3 실시예에 따른 문서 검색 및 분류 시스템의 개략적인 구성을 도시하는 블록도.Fig. 7 is a block diagram showing the schematic arrangement of a document retrieval and classification system according to a third embodiment of the present invention.

*도면의 주요 부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

21: 입/출력부 22: 분류 표준 변환부21: input / output section 22: classification standard conversion section

23: 검색부 24: 문서 저장부23: search unit 24: document storage unit

25: 검색 결과 저장부 26: 검색 결과 분류부25: search result storage unit 26: search result classification unit

본 발명의 상기 목적, 다른 목적, 특징 및 이점들은 첨부 도면을 참조하여 읽히는 이하의 상세한 설명으로부터 보다 명확해진다. 이하, 본 발명의 바람직한 실시예들을 첨부 도면을 참조하여 설명한다. 도면 전체에서 동일한 참조 번호들은 동일한 부분들을 의미한다. 본 발명의 실시예는 일어들을 포함하는 문서들에 기초를 두고 있다. 따라서, 후술되는 설명은 괄호안에 영어가 뒤따르는 간지(kanji) 또는 가타카나(katakana)를 포함한다.The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings. Like reference numerals refer to like parts throughout. An embodiment of the present invention is based on documents containing Japanese. Thus, the following description includes kanji or katakana followed by English in parentheses.

제 1 실시예First embodiment

도 1은 본 발명의 제 1 실시예에 따른 문서 검색 및 분류 방법을 구현하기 위한 문서 검색 및 분류 시스템의 개략적인 배열을 도시하는 기능 블록도이다.1 is a functional block diagram showing a schematic arrangement of a document retrieval and classification system for implementing a document retrieval and classification method according to a first embodiment of the present invention.

도 1에 도시된 문서 검색 및 분류 시스템에서, 입/출력부(21)는 유저들이 검색 조건들 및 분류 표준들을 입력하는 것을 허용하며, 또한 검색 결과들 및 분류 결과들을 출력한다. 검색부(23)는 검색된 문서들과 검색된 조건들간의 유사도를 계산한다. 검색 결과 저장부(25)는 검색된 문서들과 같은 검색 결과를 저장한다. 분류 표준 변환부(22)는 입/출력부(21)로부터 공급된 분류 표준들을 수신하고 입력된 분류 표준들을 검색부(23)에서 처리 가능한 검색 조건들로 변환한다. 검색 결과 분류부(26)는 검색부(23)에 의해 계산된 유사도를 참조하여 분류 표준들에 따라 검색된 문서들을 분류한다.In the document retrieval and classification system shown in FIG. 1, the input / output unit 21 allows users to input search conditions and classification standards, and also outputs search results and classification results. The searcher 23 calculates the similarity between the searched documents and the searched conditions. The search result storage unit 25 stores search results such as searched documents. The classification standard conversion section 22 receives the classification standards supplied from the input / output section 21 and converts the input classification standards into search conditions that can be processed by the search section 23. The search result classification unit 26 classifies the documents searched according to the classification standards with reference to the similarity calculated by the search unit 23.

이하에서, 제 1 실시예에 따른 문서 검색 및 분류 처리의 상세 내용이 설명된다.In the following, details of the document retrieval and classification process according to the first embodiment are described.

먼저, 유저는 검색 조건들을 입/출력부(21)에 입력한다. 예를 들면, 다음의 논리(부울) 식 1이 검색 조건들로서 주어질 수 있다.First, the user inputs search conditions into the input / output unit 21. For example, the following logical (boolean) equation 1 can be given as search conditions.

(米 OR コメ OR 政策) -------- (1)(米 OR コメ OR 政策) -------- (1)

여기서, 米는 쌀(rice)을 표현하는 간지(kanji)이고, コメ는 쌀을 표현하는 가타카나(katakana)이며, 政策은 정책(policy)을 표현하는 간지이다.Here, 米 is kanji representing rice, コメ is katakana representing rice, and 政策 is kanji representing policy.

검색부(23)는 검색 조건들에 기초하여 문서 저장부(24)에 저장된 문서들을 검색한다. 검색부(23)는 임의의 단어들 또는 임의의 문자열들로 구성된 검색 조건들에 기초하여 검색을 수행할 수 있다. 또한, 검색부(23)는 검색 결과와 검색 조건들간의 유사도를 계산할 수 있다.The searching unit 23 searches for documents stored in the document storing unit 24 based on the search conditions. The search unit 23 may perform a search based on search conditions composed of arbitrary words or arbitrary strings. In addition, the searcher 23 may calculate a similarity between the search result and the search conditions.

이 종류의 검색부는 일본 특허 공개 제 9-319766 호에 공개된 바와 같이, 지정된 단어가 존재하는 문서들 모두를 검색할 수 있는 전체 문장 검색부를 포함할 수 있다.This type of retrieval unit may include a full sentence retrieval unit capable of retrieving all documents in which a specified word exists, as disclosed in Japanese Patent Laid-Open No. 9-319766.

검색 조건들과 검색된 문서(즉, 검색 결과)(Dj)간의 유사도는 다음 식에 의해 표현될 수 있다.The similarity between the search conditions and the retrieved document (ie, the search result) Dj can be expressed by the following equation.

S(Dj) ={fij x (1-log(di/N))}S (Dj) = {fij x (1-log (di / N))}

여기서,는 변수 "i"에 관한 합이고, "fij"는 문서(Dj) 내의 각 단어 "ti"의 발생 빈도(또는 정도)를 나타내며, "di"는 단어 "ti"가 나타나는 문서들의 수를 나타내고, "N"은 검색된 문서들의 전체 수를 나타낸다.here, Is the sum of the variable "i", "fij" represents the frequency (or degree) of occurrence of each word "ti" in the document Dj, "di" represents the number of documents in which the word "ti" appears, "N" represents the total number of documents retrieved.

상기 식은 검색 조건들에 포함된 각각의 단어들 "ti"의 유사도들의 합을 취하는 것을 나타낸다.The above expression represents taking the sum of the similarities of the respective words "ti" included in the search conditions.

이것은 일반적으로 내적 스케일(inner product scale)에 기초하는 유사도 계산 및 "TFIDF" 방법에 기초하는 단어 가중치 부여로 언급된다.This is commonly referred to as similarity calculation based on the inner product scale and word weighting based on the “TFIDF” method.

이제, 특정 검색 문서(Dj)에서, 현재 검색 조건들에서의 검색 단어들이 다음의 발생 빈도를 가지는 것으로 가정된다. 즉, "fij"는 다음과 같이 주어진다.Now, in the specific search document Dj, it is assumed that the search words in the current search conditions have the following occurrence frequency. That is, "fij" is given as

米(rice(간지)) 3米 (rice) 3

コメ(rice(가타카나)) 2コメ (rice (Katakana)) 2

政策(policy) 1政策 (policy) 1

반면에, 문서 저장부(24)에 저장된 모든 문서들에서, 각각의 검색 단어를 포함하는 문서들의 수는 다음과 같다. 즉, "di" 는 다음과 같이 주어진다.On the other hand, in all documents stored in the document storage unit 24, the number of documents including each search word is as follows. That is, "di" is given by

米(rice(간지)) 5000米 (rice) 5000

コメ(rice(가타카나)) 1250コメ (rice (katakana)) 1250

政策(policy) 2500Policy 2500

N=10000일 때, Dj의 유사도 S(Dj)는 다음 방식으로 계산된다.When N = 10000, the similarity S (Dj) of Dj is calculated in the following manner.

S(Dj) = 3 x (1-log(5000/10000))S (Dj) = 3 x (1-log (5000/10000))

+ 2 x (1-log(1250/10000))+ 2 x (1-log (1250/10000))

+ 1 x (1-log(2500/10000))+ 1 x (1-log (2500/10000))

= 6 + 8 + 3 = 17= 6 + 8 + 3 = 17

도 2는 검색 결과의 상세 내용을 나타내며, 여기서, 각각의 검색된 문서의 문서수는 각각의 검색된 문서의 내용 및 유사도와 함께 도시되어 있다. 도 2에 따라, 전체 10개 문서들이 위에서 설명된 검색 조건들에 따라 픽업되고, 유사도 순으로 랭크된다. 각각의 문서의 유사도는 최대값을 100에 할당함으로써 규준화된다. 검색 결과는 검색 결과 저장부(25)에 저장되고, 유저는 입/출력부(21)를 통해 검색 결과를 알 수 있다.2 shows the details of the search results, where the number of documents in each retrieved document is shown along with the content and similarity of each retrieved document. According to Fig. 2, all ten documents are picked up according to the search conditions described above and ranked in order of similarity. The similarity of each document is normalized by assigning a maximum value of 100. The search result is stored in the search result storage unit 25, and the user can know the search result through the input / output unit 21.

유저는 검색 결과를 본 후 새로운 검색을 실행할 수 있고, 또는 현재 얻어진 검색 결과를 분류할 수 있다.The user can view the search results and then perform a new search, or sort the search results currently obtained.

유저가 도 2에 도시된 검색 결과를 분류하기를 원하는 경우에, 유저는 입/출력부(21)를 통해 복수의 분류 표준들을 입력할 수 있다. 제 1 실시예에 따라, 분류 표준들은 분류의 관점들을 나타내는 단어들이다. 예를 들면, 유저는 입/출력부(21)를 통해 다음 단어들을 분류 표준들로서 입력할 수 있다.If the user wants to classify the search result shown in FIG. 2, the user can input a plurality of classification standards through the input / output unit 21. According to the first embodiment, classification standards are words representing aspects of classification. For example, the user may input the following words as classification standards via the input / output unit 21.

분류 표준1: コメ(rice; 가타카나), 米賈(rice price), 新食糧法(new Staple Food Control Act)Classification Standard 1: コメ (rice; Katakana), 米 (rice price), 新食糧法 (new Staple Food Control Act)

분류 표준2: 北朝鮮(North Korea), 中國(China), 米朝協議(U.S.-North Korea Talk)Classification Standard 2: North Korea, China, U.S.-North Korea Talk

분류 표준3: 米國(U.S.), 米軍(U.S.force)Classification Standard 3: 米國 (U.S.), 米軍 (U.S.force)

분류 표준 변환부(22)는 입력된 분류 표준들을 검색부(23)에서 처리할 수 있는 검색 조건들로 변환한다.The classification standard conversion unit 22 converts the input classification standards into search conditions that can be processed by the retrieval unit 23.

예를 들면, 분류 표준들로서 입력된 단어들을 적절하게 조합함으로써 논리식을 만들고, 가장 최근의 검색 조건들의 역할을 하는 상기 논리식 1에 결과적인 논리식을 연결하기 위해 AND를 사용하는 것이 바람직하다. 다음은 변환된 검색 조건들의 예들을 나타낸다.For example, it is desirable to use AND to link the resulting logic to Formula 1 above, which makes a logical expression by properly combining the words entered as classification standards, and serves as the most recent search conditions. The following shows examples of converted search conditions.

검색 조건1 : (コメ OR 米價) AND (米 OR コメ OR 政策)Search condition1: (コメ OR 米價) AND (米 OR COMOME OR 政策)

검색 조건2 : (北朝鮮 OR 中國) AND (米 OR コメ OR 政策)Search condition2: (北朝鮮 OR 中國) AND (米 OR コメ OR 政策)

검색 조건3: (米國 OR 米軍) AND (米 OR コメ OR 政策)Search condition 3: (米國 OR 米軍) AND (米 OR コメ OR 政策)

가장 최근의 검색 조건들(즉, 논리식 1)을 AND에 의해 분류 표준들과 연결하는 것은 검색될 대상 규모를 축소하기 위해 바람직하다.Linking the most recent search conditions (ie, logical expression 1) with classification criteria by AND is desirable to reduce the scale of the object to be searched.

다음에, 검색부(23)는 위에서 설명된 검색 조건들 1 내지 3에 따라 검색을 수행하여, 도 3에 도시된 검색 결과를 얻는다.Next, the searching unit 23 performs a search according to the search conditions 1 to 3 described above to obtain a search result shown in FIG.

도 3에 도시된 바와 같이, 검색 조건들 1 내지 3에 의해 얻어진 검색 결과들은 도 2에 도시된 검색된 문서들의 전체 세트의 서브세트들이다. 도 3에서, 각각의 검색된 문서에 첨부된 유사도는 대응하는 검색 조건에 대해 계산된 유사도 값을 나타낸다. 이제, S(i,j)가 검색 조건(즉, 분류 표준)"j"에 대한 문서"i"의 유사도를 나타낸다고 가정하자.As shown in FIG. 3, the search results obtained by the search conditions 1 to 3 are subsets of the entire set of retrieved documents shown in FIG. 2. In FIG. 3, the similarity attached to each retrieved document indicates the similarity value calculated for the corresponding search condition. Now assume that S (i, j) represents the similarity of the document "i" to the search condition (i.e., the classification standard) "j".

다음에, 검색 결과 분류부(26)는 분류 표준 "j"에 대한 문서 "i"의 속성 T(i,j)를 계산한다. 예를 들면, 다음 식이 속성 T(i,j)를 계산하는데 사용된다.Next, the search result classification unit 26 calculates the attribute T (i, j) of the document "i" with respect to the classification standard "j". For example, the following equation is used to calculate the property T (i, j).

T(i,j) = CㆍS(i,j) + (1-C)ㆍ100ㆍ(S(i,j)/S(i,k)) --------- (2)T (i, j) = C.S (i, j) + (1-C) .100 (S (i, j) / S (i, k)) --------- (2)

여기서,는 변수 "k"에 관한 합이고, C는 범위 0<C<1 내의 상수이다.here, Is the sum for the variable "k" and C is a constant in the range 0 <C <1.

상기 식(2)은 속성 T(i,j)를 얻기 위한 예에 불과하다. 속성T(i,j)를 계산하는 방법은 상기 식(2)으로 한정되어 있지 않다.Equation (2) is merely an example for obtaining the attribute T (i, j). The method of calculating the attribute T (i, j) is not limited to the above expression (2).

도 4는 C=0.5가 문서들(1 내지 10) 및 분류들(1 내지 3)에 적용되는 조건 하에, 상기 식(2)에 따라 계산되는 속성T(i,j)을 도시한다.4 shows the attribute T (i, j) calculated according to equation (2) above under the condition that C = 0.5 applies to documents 1 to 10 and classifications 1 to 3.

검색 결과 분류부(26)는 각 문서 "i"에 대한 가장 높은 속성 T(i,j)를 가진 분류를 식별하기 위해 다음 식(3)을 사용한다.The search result classification section 26 uses the following equation (3) to identify the classification with the highest attribute T (i, j) for each document "i".

c(i) = max{T(i,j)} ---------------------------------- (3)c (i) = max {T (i, j)} ---------------------------------- (3 )

여기서, "max"는 변수 "j"에 대한 최대 값을 나타낸다.Here, "max" represents the maximum value for the variable "j".

마지막으로, 검색 결과 분류부(26)는 문서"i"가 분류(i)에 속한다는 결과를 출력한다. 이 결과는 입/출력부(21)를 통해 유저에게 표시되거나 통보된다.Finally, the search result classification unit 26 outputs a result that the document "i" belongs to the classification (i). This result is displayed or notified to the user via the input / output unit 21.

도 5는 도 4에 도시된 예에 기초하는 최종 분류의 출력 예를 도시한다.FIG. 5 shows an example output of the final classification based on the example shown in FIG. 4.

위에서 설명된 바와 같이, 도 2에 도시된 검색 결과(즉, 검색된 문서들의 전체 세트)는 유저에 의해 입력된 단어들인 분류 표준들(1 내지 3)에 따라 복수의 서브세트들로 분류될 수 있다.As described above, the search results shown in FIG. 2 (ie, the full set of retrieved documents) may be classified into a plurality of subsets according to classification standards 1 to 3, which are words entered by the user. .

상기 예에 따라, 간지 "米"는 논리식 1에 주어진 검색 요소들 중 하나이다. 그러나, 일본어 "米"는 여러 의미들을 가지고 있다. 따라서, 도 2에 도시된 검색된 문서들에 있어서, 일부 문서들(doc.#6,#10)은 "rice"를 나타내는 "米"를 포함하는 반면, 다른 문서들(doc.#3,#4,#5,#7,#9)은 U.S.(즉, 미국)을 나타내는 "米"를 포함한다. 그러나, 유저는 적절한 분류 표준들을 입력함으로써 이러한 문서들을 상이한 분류들로 분류할 수 있다.According to the above example, the slip sheet "米" is one of the search elements given in the logical formula (1). However, the Japanese "米" has several meanings. Thus, in the retrieved documents shown in Fig. 2, some documents doc. # 6, # 10 include "米" representing "rice", while other documents (doc. # 3, # 4). (# 5, # 7, # 9) includes "米" for US (ie, US). However, the user can classify these documents into different classifications by entering appropriate classification standards.

또한, 유저가 검색 조건들 또는 분류 표준들을 입력할 때, 유저는 복수의 단일 단어들로 구성된 "新食糧法" 및" 米朝協議"와 같은 복합 단어에 대해 특별한 주의없이 검색 단어들을 임의로 선택할 수 있다.In addition, when the user enters search conditions or classification standards, the user can arbitrarily select search words without particular attention for complex words such as "新食糧法" and "米朝協議" consisting of a plurality of single words. have.

또한, 분류가 주어진 분류 표준들에 따라 완성된 후, 유저는 하나의 분류 표준에 대응하는 서브세트를 다른 분류(즉, 미세 분류;fine classification)에 대한 새로운 대상으로서 지정할 수 있다.In addition, after a classification is completed according to given classification standards, a user may designate a subset corresponding to one classification standard as a new object for another classification (ie, fine classification).

위에서 설명된 바와 같이, 본 발명의 제 1 실시예는 융통성있는 문서 검색 및 분류 방법 및 시스템을 제공한다. 제 1 실시예에 따라, 검색 동작은 의도된 검색된 문서를 픽업하기 위해 유저에 의해 입력된 검색 조건에 따라 문서들의 데이터베이스상에서 수행된다. 유저는 검색동작에 의해 픽업된 검색된 문서에 응답하여 복수의 분류들의 분류 표준들을 입력하는 것이 허용된다. 분류 표준들은 검색 조건으로 변환된다. 검색동작에 의해 픽업된 검색 문서들과 분류 표준으로부터 얻어진 변환된 검색 조건 간의 유사도가 계산된다. 그리고, 각각의 분류에 대해 검색 동작에 의해 픽업된 각각의 검색된 문서의 속성은 유사도를 참조하여 계산되고, 이에 의해, 각각의 검색된 문서가 가장 높은 속성을 가진 분류로 분류된다.As described above, the first embodiment of the present invention provides a flexible document retrieval and classification method and system. According to the first embodiment, a retrieval operation is performed on a database of documents according to a retrieval condition input by a user to pick up the intended retrieved document. The user is allowed to input the classification standards of the plurality of classifications in response to the retrieved document picked up by the search operation. Classification criteria are converted into search criteria. The similarity between the search documents picked up by the search operation and the converted search condition obtained from the classification standard is calculated. Then, the attribute of each retrieved document picked up by the retrieval operation for each category is calculated with reference to the similarity, whereby each retrieved document is classified into the category having the highest attribute.

본 발명의 제 1 실시예에 따라, 유저들은 자신들이 검색 동작 동안 마음에 둔 단어들을 가지고 있을 때 검색 조건들을 임의로 입력할 수 있다. 또한, 유저는 의도된 검색 결과를 임의로 분류할 수 있다. 따라서, 제 1 실시예는 문서 검색 및 분류 동안 지적인 활동들을 도울 수 있다.According to the first embodiment of the present invention, users can arbitrarily enter search conditions when they have words they like during the search operation. In addition, the user can optionally sort the intended search results. Thus, the first embodiment can assist in intelligent activities during document retrieval and classification.

또한, 본 발명의 제 1 실시예에 따라, 유저가 각각의 분류의 분류 표준을 위해 임의의 단어들 또는 임의의 문자열들의 세트를 입력할 때, 입력된 임의의 단어들 또는 비지정 문자열들은 검색 조건들로 변환되며, 검색 동작에 의해 픽업된 검색된 문서들과 변환된 검색 조건들간의 유사도가 계산된다.Further, according to the first embodiment of the present invention, when the user enters arbitrary words or a set of arbitrary strings for the classification standard of each classification, any words or non-specified strings entered are search conditions. The similarity between the retrieved documents picked up by the retrieval operation and the converted retrieval conditions is calculated.

이것은 유저들이 마음에 둔 임의의 단어들 또는 임의의 문자열들을 분류 표준들(즉, 분류의 관점들)로서 입력하는 것을 가능하게 한다. 따라서, 분류의 관점들을 설정할 때 큰 융통성이 주어진다.This makes it possible for users to enter any words or any strings they have in mind as classification standards (ie, aspects of classification). Thus, great flexibility is given when setting up aspects of classification.

제 2 실시예Second embodiment

제 2 실시예는 분류 표준들이 분류의 관점들을 나타내는 문장으로부터 선택되는 것을 특징으로 하는 문서 검색 및 분류 시스템을 제공한다.The second embodiment provides a document retrieval and classification system wherein the classification standards are selected from sentences representing aspects of classification.

도 6은 본 발명의 제 2 실시예에 따른 문서 검색 및 분류 시스템의 개략적인 배열을 도시하는 기능 블록도를 나타낸다.6 shows a functional block diagram showing a schematic arrangement of a document retrieval and classification system according to a second embodiment of the present invention.

도 6에 도시된 문서 검색 및 분류 시스템에 있어서, 입/출력부(11)는 유저가 검색 조건들 및 분류 표준들을 입력하는 것을 허용하며, 검색 결과들 및 분류 결과들을 출력한다. 문서 저장부(15)는 문서들을 저장한다. 검색부(14)는 검색된 문서들과 검색 조건들간의 유사도를 계산한다. 검색 결과 저장부(16)는 검색된 문서들과 같은 검색 결과를 저장한다. 키워드 검출부(12)는 입/출력부(11)를 통해 유저에 의해 입력된 분류의 관점들을 나타내는 문장을 수신하고, 수신된 문장으로부터 키워드들을 검출한다. 검출된 키워드들은 분류 표준들의 역할을 한다. 분류 표준 변환부(13)는 키워드 검출부(12)로부터 전송된 키워드들(즉, 분류 표준들)을 수신하고, 입력된 키워드들을 검색부(14)에서 처리 가능한 검색 조건들로 변환한다. 검색 결과 분류부(17)는 검색부(14)에 의해 계산된 유사도를 참조하여 분류 표준들에 따라 검색된 문서들을 분류한다.In the document retrieval and classification system shown in Fig. 6, the input / output section 11 allows a user to input search conditions and classification standards, and outputs search results and classification results. The document storage unit 15 stores the documents. The search unit 14 calculates the similarity between the searched documents and the search conditions. The search result storage unit 16 stores search results such as searched documents. The keyword detector 12 receives a sentence representing the viewpoints of the classification inputted by the user through the input / output unit 11, and detects keywords from the received sentence. The detected keywords serve as classification standards. The classification standard conversion unit 13 receives the keywords (that is, classification standards) transmitted from the keyword detection unit 12 and converts the input keywords into search conditions that can be processed by the retrieval unit 14. The search result classification unit 17 classifies the documents searched according to the classification standards with reference to the similarity calculated by the search unit 14.

이하, 제 2 실시예에 따른 문서 검색 및 분류 처리의 상세 내용이 설명된다.The details of the document retrieval and classification process according to the second embodiment are described below.

먼저, 유저는 입/출력부(11)에 검색 조건들을 입력한다. 이제, 제 1 실시예와 같이, 논리식 1에 의해 정의된 검색 조건들이 입력되는 것으로 가정한다. 그리고, 도 2에 도시된 검색 결과가 얻어진다.First, the user inputs search conditions into the input / output section 11. Now, as in the first embodiment, it is assumed that search conditions defined by the logical formula 1 are input. Then, the search result shown in FIG. 2 is obtained.

유저가 도 2에 도시된 검색 결과를 분류하기를 원하는 경우, 유저는 입/출력부(11)를 통해 복수의 분류 표준들을 입력할 수 있다. 제 2 실시예에 따라, 분류 표준들은 분류의 관점들을 나타내는 문장, 검색된 문서들을 식별하는 참조 번호 또는 검색된 문서들의 필수적인 부분이다.When the user wants to classify the search result shown in FIG. 2, the user may input a plurality of classification standards through the input / output unit 11. According to a second embodiment, the classification standards are a sentence indicating aspects of classification, a reference number identifying the retrieved documents or an integral part of the retrieved documents.

예를 들면, 유저는 입/출력부(11)를 통해 다음 문장들을 분류 표준들로서 입력할 수 있다.For example, the user may input the following sentences as classification standards through the input / output unit 11.

분류 표준 4: コメ市場や政府の米價政策について(Regarding the rice market and the rice price policy of the government)Classification Standard 4: コメ市場や政府の米價政策について (Regarding the rice market and the rice price policy of the government)

분류 표준 5: 北朝鮮や中國などに對する米國の對應(Attitude of U.S. against North Korea and China)Classification Standard 5: Attitude of U.S. against North Korea and China

분류 표준 6: 韓國や日本における米軍問題(Problems of U.S. forces in Korea and Japan)Classification Standard 6: Problems of U.S. forces in Korea and Japan

이러한 문장들의 입력에 응답하여, 제 2 실시예의 문서 검색 및 분류 시스템은 다음 처리를 실행한다.In response to the input of these sentences, the document retrieval and classification system of the second embodiment performs the following processing.

키워드 검출부(12)는 형태학(morphologic)상의 분석에 기초하여 사전 데이터베이스(도시되지 않음)를 사용하여 각각의 문장에서 나타나는 단어들을 추출하고, 각 문장의 키워드들(즉, 필수 단어들)을 선택한다.The keyword detection unit 12 extracts words that appear in each sentence based on a morphologic analysis and selects keywords (ie, essential words) of each sentence. .

각 문장의 키워드들의 선택에 관하여, 문서 저장부(15)에 사전에 저장된 모든 문서들에서의 각 단어의 발생의 빈도(또는 정도)를 검사하고, 다음에 키워드들은 "TFIDF" 단어 가중치 부여 방법 등에 기초하여 선택되는 것이 바람직하다. 이러한 단어 가중치 부여 방법은 예를 들면, Umino, Library 및 information Science, 제 26호(1988)에 의해 "Word Weighting Principle Based on Frequency-of-occurrence Information"에 개시되어 있다.Regarding the selection of keywords in each sentence, the frequency (or degree) of occurrence of each word in all documents stored in the dictionary in the document storage unit 15 is checked, and then the keywords are assigned to the "TFIDF" word weighting method or the like. It is preferred to be selected based on this. Such a word weighting method is disclosed in, for example, "Word Weighting Principle Based on Frequency-of-occurrence Information" by Umino, Library and Information Science, No. 26 (1988).

더욱이, 키워드들을 선택하기 위한 다른 방법으로서, 문장들로부터 단어들(즉, 문자열들)을 추출함에 있어서 가타카나, 히라가나 및 간지와 같은 문자 형태들의 차이를 고려하는 것이 일본 문서들에서는 바람직한다. 이는 사전에 등록되지 않은 복합 단어들 또는 새로운 단어들을 검출하는데 효율적이다.Moreover, as another method for selecting keywords, it is desirable in Japanese documents to take into account differences in character forms such as katakana, hiragana and kanji in extracting words (ie strings) from sentences. This is efficient for detecting compound words or new words not registered in the dictionary.

물론, 상기 2 개의 방법들을 적절하게 조합하는 것도 바람직하다.Of course, it is also desirable to combine the two methods as appropriate.

제 2 실시예에 따라, 각 문장의 키워드들은 사전 데이터베이스를 참조하여 선택된다. 이제, 다음 단어들은 위에서 설명된 분류 표준들 4 내지 6으로부터 추출되는 것으로 가정한다.According to the second embodiment, keywords of each sentence are selected by referring to a dictionary database. Now, assume that the following words are extracted from the classification standards 4 to 6 described above.

분류 표준 4':コメ(rice), 市場(market), 政府(government), 米價政策(rice price policy)Classification standard 4 ': Come (rice), city (government), government (government), 米價政策 (rice price policy)

분류 표준 5': 北朝鮮(North Korea), 中國(China), 米國(U.S.)Classification Standard 5 ': North Korea, China, 米國 (U.S.)

분류 표준 6': 韓國(Korea), 日本(Japan), 米軍(U.S.force) 問題(Problem)Classification Standard 6 ': Korea 國 (Korea), 日本 (Japan), 米軍 (U.S.force) 問題 (Problem)

그후, 분류 표준 변환부(13)는 분류 표준들 4'내지 6'을 제 1 실시예의 분류 표준 변환부(22)에서 수행된 처리와 동일한 방식으로 수신부(14)에서 처리 가능한 검색 조건들로 변환한다.Then, the classification standard conversion section 13 converts the classification standards 4 'to 6' into search conditions that can be processed in the reception section 14 in the same manner as the processing performed in the classification standard conversion section 22 of the first embodiment. do.

또한, 도 2에 도시된 검색 결과를 본 후, 유저가 다음의 방식으로 분류 표준들로서 검색된 문서들의 문서 번호들을 지정할 수 있다.Also, after viewing the search results shown in Fig. 2, the user can designate document numbers of documents searched as classification standards in the following manner.

분류 표준 7: 1,2Classification Standard 7: 1,2

분류 표준 8: 4,5Classification Standard 8: 4,5

분류 표준 9: 9Classification Standard 9: 9

이러한 참조 번호들의 입력에 응답하여, 제 2 실시예의 문서 검색 및 분류 시스템은 다음 처리를 실행한다.In response to the input of these reference numbers, the document retrieval and classification system of the second embodiment performs the following processing.

키워드 검출부(12)는 문서 저장부(15)로부터 참조 번호들(즉, 분류 표준들)에 의해 지정된 문서들의 텍스트들을 판독하고, 지정된 문서들에 포함된 키워드들을 추출한다.The keyword detector 12 reads the texts of the documents designated by the reference numbers (ie, classification standards) from the document storage unit 15, and extracts the keywords included in the designated documents.

키워드들의 추출은 문장들로부터 키워드들을 추출하는 위에서 설명된 예와 동일한 방식으로 수행될 수 있다. 대안으로, 각각의 문서로부터 키워드들을 미리 추출하고, 문서 저장부(15)에 대응 문서들과 함께 추출된 키워드들을 저장하는 것이 바람직하다. 이 경우, 키워드 검출부(12)는 지정된 문서들(즉, 분류 표준들)의 참조 번호들을 참조하여 문서 저장부(15)에 저장된 키워드들을 판독한다.Extraction of keywords may be performed in the same manner as the example described above of extracting keywords from sentences. Alternatively, it is preferable to extract the keywords from each document in advance, and to store the extracted keywords together with the corresponding documents in the document storage unit 15. In this case, the keyword detector 12 reads the keywords stored in the document storage 15 with reference to the reference numbers of the designated documents (i.e., classification standards).

이제, 다음의 단어들이 위에서 설명된 분류 표준 7 내지 9로부터 추출되는 것으로 가정한다.Now, assume that the following words are extracted from the classification standards 7-9 described above.

분류 표준 7': コメ(rice), 備蓄(stock), 食糧(food), 米價(rice price), 農協(agricultural cooperative association), 生産(product), 農家(farmer), 稻作(rice crop), 消費者(consumer), 米(rice)Classification Standard 7 ': rice, stock, food, rice price, agricultural cooperative association, product, farmer, rice crop , 消費者 (consumer), 米 (rice)

분류 표준 8': 北朝鮮(North Korea), 會談(conference), 韓國(Korea), 協議(talk), 米(U.S), 米韓(U.S.-Korea), 問題(problem), 南北(north-south), 朝鮮半島(Korean Peninsula), 米軍(U.S.force)Classification Standard 8 ': North Korea, North Korea, Conference, Korea, Talk, US, US-Korea, Problem, North-south ), 朝鮮半島 (Korean Peninsula), 米軍 (USforce)

분류 표준 9': 沖繩(Okinawa), 米國(U.S.), 連邦(Federation), 調査(investigation), 返歸(return), 公文書(official document), 材料(material), 伋處分(provisional disposition), 地裁(district court), 決定(decision)Classification Standard 9 ': Okkinawa, US, Federation, Investigation, Return, Official Document, Material, Provisional Disposition, District court, decision

그후, 분류 표준 변환부(13)는 분류 표준들 7' 내지 9'를 제 1 실시예의 분류 표준 변환부(22)에서 실행된 처리와 같은 방식으로 검색부(14)에서 처리가능한 검색 조건들로 변환한다.Then, the classification standard conversion section 13 converts the classification standards 7 'to 9' into search conditions that can be processed in the retrieval section 14 in the same manner as the processing executed in the classification standard conversion section 22 of the first embodiment. To convert.

검색 조건으로의 분류 표준의 변환 완료 후, 제 2 실시예의 문서 검색 및 분류 시스템은 제 1 실시예에 공개된 처리와 동일한 처리를 실행한다.After completion of the conversion of the classification standard to the search condition, the document search and classification system of the second embodiment executes the same processing as that disclosed in the first embodiment.

위에서 설명된 바와 같이, 도 2에 도시된 검색 결과(즉, 검색 문서들의 전체 세트)는 분류의 관점들을 나타내는 문장들, 검색된 문서들을 식별하는 참조 번호들 또는 유저에 의해 입력된 검색 문서들의 필수적인 부분인 분류 표준 4 내지 6 (또는 7 내지 9)에 따라 복수의 서브 세트들로 분류될 수 있다. 따라서, 유저는 다양한 방식들로 분류를 수행할 수 있다. 예를 들면, 유저는 검색된 문서들을 분류할 때 복잡한 관점들과 간단한 관점들을 융통성있게 그리고 선택적으로 사용할 수 있다.As described above, the search result shown in FIG. 2 (ie, the full set of search documents) is an integral part of sentences indicating aspects of classification, reference numbers identifying searched documents, or search documents entered by the user. It may be classified into a plurality of subsets according to phosphorus classification standards 4 to 6 (or 7 to 9). Thus, the user can perform the classification in various ways. For example, a user can flexibly and selectively use complex and simple perspectives when sorting searched documents.

위에서 설명한 바와 같이, 본 발명의 제 2 실시예에 따라, 유저가 각각의 분류의 분류 표준의 역할을 하는 임의의 문장을 입력할 때, 키워드들이 문장으로부터 추출되고, 추출된 키워드들의 세트가 검색 조건들로 변환되며, 검색 동작에 의해 픽업된 검색된 문서들과 변환된 검색 조건들간의 유사도가 계산된다.As described above, according to the second embodiment of the present invention, when a user inputs an arbitrary sentence serving as a classification standard of each classification, keywords are extracted from the sentence, and the set of extracted keywords is a search condition. The similarity between the retrieved documents picked up by the retrieval operation and the converted retrieval conditions is calculated.

본 발명의 제 2 실시예에 따라, 유저들은 의도된 분야에 속하는 임의의 문장을 분류 표준으로서 직접 입력할 수 있다. 이는 복잡한 분류 관점들을 표현하는 것을 가능하게 한다. 따라서, 분류의 관점들을 설정은 여러 면에서 융통성 있게 수행될 수 있다.According to the second embodiment of the present invention, users can directly enter any sentence belonging to the intended field as a classification standard. This makes it possible to express complex classification points of view. Thus, setting aspects of classification can be performed flexibly in many respects.

또한, 본 발명의 제 2 실시예에 따라, 유저는 검색 동작에 의해 픽업된 검색된 문서들 중에서 복수의 문서들을 지정하고, 지정된 문서들은 각각의 분류의 분류 표준의 역할을 한다. 이때, 키워드들이 추출된 문서들로부터 추출된다. 추출된 키워드들의 세트는 검색 조건들로 변환된다. 그리고, 검색 동작에 의해 픽업된 검색 문서들과 변환된 검색 조건들간의 유사도가 계산된다.Further, according to the second embodiment of the present invention, the user designates a plurality of documents among the searched documents picked up by the search operation, and the designated documents serve as a classification standard of each classification. At this time, keywords are extracted from the extracted documents. The set of extracted keywords is converted into search conditions. Then, the similarity between the search documents picked up by the search operation and the converted search conditions is calculated.

유저들이 검색 동작에 의해 픽업된 검색된 문서들을 확인한 후, 유저들은 검색된 문서들 자체 또는 그것들의 일부를 분류의 관점들을 표현하는 것으로 선택할 수 있다. 따라서, 분류의 관점들을 설정은 용이하게 수행될 수 있다.After the users have identified the retrieved documents picked up by the retrieval operation, the users can select the retrieved documents themselves or portions thereof as representing the aspects of classification. Thus, setting the aspects of classification can be easily performed.

제 3 실시예Third embodiment

제 3 실시예는 분류 표준들이 자동적으로 결정되고 검색된 문서들이 자동적으로 분류되는 것을 특징으로 하는 문서 검색 및 분류 시스템을 제공한다.The third embodiment provides a document retrieval and classification system characterized in that classification standards are automatically determined and retrieved documents are automatically classified.

도 7은 본 발명의 제 3 실시예에 따른 문서 검색 및 분류 시스템의 개략적인 구성을 도시하는 기능 블록도이다.7 is a functional block diagram showing a schematic configuration of a document search and classification system according to a third embodiment of the present invention.

도 7에 도시된 문서 검색 및 분류 시스템에서, 입/출력부(71)는 유저들이 검색 조건들 및 분류 표준들을 입력하는 것을 허용하고, 또한 검색 결과들 및 분류 결과들을 출력한다. 문서 저장부(76)는 문서들을 저장한다. 검색부(75)는 검색된 문서들과 검색 조건들간의 유사도를 계산한다. 검색 결과 저장부(77)는 검색된 문서들과 같은 검색 결과를 저장한다. 키워드 검출부(72)는 검색 결과 저장부(77)에 저장된 검색된 문서로부터 키워드를 검출한다. 자동 키워드 분류부(73)는 검출된 키워드들의 세트를 복수의 클러스터들로 분류한다. 각각의 클러스트의 분류된 키워드들은 분류 표준들의 역할을 한다. 분류 표준 변환부(74)는 자동 키워드 분류부(73)로부터 전송된 키워드들(즉, 분류 표준들)을 수신하여, 입력된 키워드들을 검색부(75)에서 처리 가능한 검색 조건들로 변환한다. 검색 결과 분류부(78)는 검색부(75)에 의해 계산된 유사도를 참조하여 분류 표준들에 따라, 검색된 문서들을 분류한다.In the document search and classification system shown in Fig. 7, the input / output section 71 allows users to input search conditions and classification standards, and also outputs search results and classification results. The document storage unit 76 stores the documents. The search unit 75 calculates the similarity between the searched documents and the search conditions. The search result storage unit 77 stores search results such as searched documents. The keyword detection unit 72 detects a keyword from the retrieved document stored in the search result storage unit 77. The automatic keyword classification unit 73 classifies the set of detected keywords into a plurality of clusters. The classified keywords of each cluster serve as classification standards. The classification standard conversion unit 74 receives the keywords (that is, classification standards) transmitted from the automatic keyword classification unit 73 and converts the input keywords into search conditions that can be processed by the retrieval unit 75. The search result classification unit 78 classifies the searched documents according to the classification standards with reference to the similarity calculated by the search unit 75.

이하, 제 3 실시예에 따른 문서 검색 및 분류 처리의 상세 내용이 설명된다.Hereinafter, details of the document retrieval and classification process according to the third embodiment will be described.

먼저, 유저는 입/출력부(71)에 검색 조건들을 입력한다. 이제, 제 1 실시예와 같이, 논리식 1에 의해 정의된 검색 조건들이 입력되는 것으로 가정한다. 그리고, 도 2에 도시된 검색 결과가 얻을 수 있다.First, the user inputs search conditions into the input / output unit 71. Now, as in the first embodiment, it is assumed that search conditions defined by the logical formula 1 are input. And, the search result shown in FIG. 2 can be obtained.

제 3 실시예는 분류 표준들이 유저에 의한 분류 표준들의 입력에 의존하지 않고 자동적으로 결정된다는 점에서 위에서 설명된 제 1 및 제 2 실시예와는 다르다.The third embodiment differs from the first and second embodiments described above in that the classification standards are automatically determined without depending on the input of the classification standards by the user.

이하, 제 3 실시예에 따른 자동 분류에 대해 상세히 설명한다. 먼저, 키워드 검출부(72)는 검색 결과 저장부(77)에 저장된 각각의 검색된 문서의 키워드들을 검출한다. 키워드들의 추출의 상세 내용은 본 발명의 위에서 설명된 제 2 실시예에 기재되어 있다. 일본 특허 공개 제 9-176822호에 공개된 키워드 추출 방법을 사용하는 것도 가능하다.Hereinafter, the automatic classification according to the third embodiment will be described in detail. First, the keyword detection unit 72 detects keywords of each retrieved document stored in the search result storage unit 77. Details of the extraction of keywords are described in the second embodiment described above of the present invention. It is also possible to use the keyword extraction method disclosed in Japanese Patent Laid-Open No. 9-176822.

다음에, 자동 키워드 분류부(73)는 검출된 키워드들의 세트를 복수의 서브세트들로 분류한다. 자동 키워드 분류에 관하여, 다음 방법이 이용된다.Next, the automatic keyword classification unit 73 classifies the set of detected keywords into a plurality of subsets. With regard to automatic keyword classification, the following method is used.

이제, 문서 저장부(76)가 n 개의 문서들 D1 내지 Dn의 전부를 저장하는 것으로 가정하며, 여기서, m 개의 단어들(W1 내지 Wm)의 전부가 각각 나타난다.Now assume that document storage 76 stores all of the n documents D1 through Dn, where all of the m words W1 through Wm appear, respectively.

이 경우에서, 다음의 n 차 벡터 Vj가 각 단어(Wj)에 도입될 수 있다.In this case, the next n-th vector Vj can be introduced into each word Wj.

Vj = (e1,e2,e3,.....,en)Vj = (e1, e2, e3, ....., en)

다음 식(4)은 각 벡터 성분(e1)(i=1,....,n)의 계산을 나타낸다.The following equation (4) shows the calculation of each vector component (e1) (i = 1, ..., n).

ei = TFi(Wj) x log(n/DF(Wj)) -------------- (4)ei = TFi (Wj) x log (n / DF (Wj)) -------------- (4)

여기서, TFi(Wj)는 문서(Di) 내의 단어(Wj)의 발생 빈도(또는 정도)를 나타내며, DF(Wj)는 단어(Wj)가 나타나는 문서들의 수를 나타낸다.Here, TFi (Wj) represents the frequency (or degree) of occurrence of the word Wj in the document Di, and DF (Wj) represents the number of documents in which the word Wj appears.

벡터(Vj)는 그 길이가 1이 되도록 규준화되는 것이 바람직하다.The vector Vj is preferably normalized such that its length is one.

이 방식에서, 벡터들(V1 내지 Vm)은 m 개의 단어들에 대해 각각 얻어질 수 있다.In this way, the vectors V1 to Vm can be obtained for m words respectively.

다음에, 복수의 단어 그룹들(G1 내지 Gp)이 고려된다. 각 단어 그룹은 특정 분야의 문서에 빈번하게 나타나는 특정 단어들로 구성된다. 각 단어 그룹은 사전 또는 대규모 문서에서 단어들의 발생 분포를 이용하여 수동적으로 생성되거나 자동적으로 생성될 수 있다.Next, a plurality of word groups G1 to Gp are considered. Each word group consists of specific words that appear frequently in a particular field of document. Each word group can be generated manually or automatically using the occurrence distribution of words in a dictionary or large document.

이 경우, 다음의 n 차 벡터(VGK)가 각 단어 그룹(Gk)에 도입될 수 있다.In this case, the next n th vector VGK may be introduced into each word group Gk.

VGK = (e'1,e'2,e'3,.....,e'n)VGK = (e'1, e'2, e'3, ....., e'n)

다음 식(5)은 각 벡터 성분(e'1)(i=1,....,n)의 계산을 나타낸다.The following equation (5) shows the calculation of each vector component (e'1) (i = 1, ..., n).

e'i = TFi(Gj) x log(n/DF(Gj) ----------------- (5)e'i = TFi (Gj) x log (n / DF (Gj) ----------------- (5)

여기서, TFi(Gj)는 문서(Di)의 그룹(Gj)에 속하는 단어들의 발생 빈도(또는 정도)를 나타내며, DF(Gj)는 그룹(Gj)에 속하는 어떤 단어가 나타나는 문서들의 수를 나타낸다.Here, TFi (Gj) represents the frequency (or degree) of occurrence of words belonging to the group Gj of the document Di, and DF (Gj) represents the number of documents in which a certain word belonging to the group Gj appears.

벡터 VGK는 그 길이가 1이 되도록 규준화되는 것이 바람직하다.The vector VGK is preferably normalized such that its length is one.

이 방식에서, 벡터들(VG1 내지 VGp)은 p 개의 단어 그룹들에 대해 각각 얻어질 수 있다.In this way, vectors VG1 to VGp may be obtained for p word groups, respectively.

각각의 단어(Wj)와 단어 그룹(Gk)간의 유사도는 벡터(Vj) 및 벡터(VGk)의 내적(inner product)에 의해 얻어질 수 있다.The similarity between each word Wj and word group Gk can be obtained by the inner product of the vector Vj and the vector VGk.

상기 벡터들과 유사도 계산을 이용하여, 키워드들의 자동 분류를 쉽게 실현할 수 있다. 예컨대, 이제, 다음 분야들에서 빈번하게 사용되는 3 개의 단어 그룹들(G1,G2,G3)이 존재하는 것으로 가정한다.By using the vectors and the similarity calculation, automatic classification of keywords can be easily realized. For example, suppose now that there are three word groups G1, G2, G3, which are frequently used in the following fields.

G1: 자동차들용의 내연기관G1: Internal Combustion Engines for Automobiles

G2: 항공 사고G2: aviation accident

G3: 인터넷G3: Internet

검색부(75)는 "엔진"에 관련된 문서들을 검색한다. 이때, 키워드 검출부(72)는 다음 키워드들을 추출한다.The search unit 75 searches for documents related to the "engine". At this time, the keyword detector 72 extracts the following keywords.

ガソリン(gasoline), 事故(accident), WWW, 燃費(fuel consumption), 檢索(retrieval), 爆發(Explosion), 空港(Airport), URLGasoline (gasoline), 事故 (accident), WWW, fuel consumption, retrieval, Explosion, airport (Airport), URL

각각의 단어 그룹들(G1,G2,G3)에 대한 단어의 유사도는 다음 방식으로 계산된다.The similarity of a word for each word group G1, G2, G3 is calculated in the following manner.

S(ガソリン(gasoline))=(0.8,0.0,0.2)S (Gasoline) = (0.8,0.0,0.2)

S(事故(accident))=(0.2,0.6,0.3)S (事故 (accident)) = (0.2,0.6,0.3)

S(WWW)=(0.1,0.2,0.8)S (WWW) = (0.1,0.2,0.8)

S(燃費(fuel consumption))=(0.7,0.1,0.2)S (fuel consumption) = (0.7,0.1,0.2)

S(檢索(retrieval))=(0.0,0.2,0.6)S (retrieval) = (0.0,0.2,0.6)

S(爆發(Explosion))=(0.4,0.6,0.1)S (Explosion) = (0.4,0.6,0.1)

S(空港(Airport))=(0.0,0.9,0.2)S (空港 (Airport)) = (0.0,0.9,0.2)

S(URL)=(0.1,0.0,0.9)S (URL) = (0.1,0.0,0.9)

이때, 각각의 키워드는 가장 높은 유사도를 가진 단어 그룹에 속하는 것이 간주된다. 따라서, 추출된 키워드들의 모두는 다음 방식으로 각각의 단어 그룹들(G1,G2,G3)로 분류된다.At this time, each keyword is considered to belong to a word group having the highest similarity. Therefore, all of the extracted keywords are classified into respective word groups G1, G2, and G3 in the following manner.

G1: ガソリン(gasoline), 燃費(fuel consumption)G1: gasoline (gasoline), fuel consumption

G2: 事故(accident), 暴發(Explosion), 空港(Airport)G2: 事故 (accident), 暴發 (Explosion), 空港 (Airport)

G3: WWW, 檢索(retrieval), URLG3: www, retrieval, url

자동 키워드 분류부(73)에 의해 이와 같이 얻어진 키워드 그룹들은 분류 표준 변환부(74)에 입력된다.The keyword groups thus obtained by the automatic keyword classification section 73 are input to the classification standard conversion section 74.

단어 그룹들(G)의 수가 클 때(예를 들면, 100), 또는 분류 표준들의 역할을 하는 키워드 그룹들의 수가 감소될 필요가 있을 때(예를 들면 2), 자동 키워드 분류부(73)는 다음 방식으로 동작한다.When the number of word groups G is large (e.g., 100), or when the number of keyword groups serving as classification standards needs to be reduced (e.g., 2), the automatic keyword classification unit 73 It works in the following way:

제 1 단계 : 각 단어 그룹(G)에 대해 분류된 키워드들의 가중치들의 합을 취하고, 다음에 얻어진 합을 이 단어 그룹의 스코어로서 간주한다.First step: The sum of the weights of the keywords classified for each word group G is taken and the next obtained sum is regarded as the score of this word group.

제 2 단계 : 스코어의 가장 높은 값을 연속적으로 고려하여 소정 수의 그룹을 선택한다.Second step: A predetermined number of groups are selected by continuously considering the highest value of the scores.

상기 실예에 따라,According to the above example,

G1의 스코어: 0.8 + 0.7 = 1.5Score of G1: 0.8 + 0.7 = 1.5

G2의 스코어: 0.6 + 0.6 + 0.9 = 2.1Score of G2: 0.6 + 0.6 + 0.9 = 2.1

G3의 스코어: 0.8 + 0.6 + 0.9 = 2.3Score of G3: 0.8 + 0.6 + 0.9 = 2.3

따라서, 키워드 그룹들의 수가 2까지 감소될 필요가 있을 때, 자동 키워드 분류부(73)는 각 단어 그룹의 스코어를 고려하여 그룹들(G2,G3)을 선택한다.Therefore, when the number of keyword groups needs to be reduced to two, the automatic keyword classification unit 73 selects the groups G2 and G3 in consideration of the score of each word group.

자동 키워드 분류부(73)가 위에서 설명된 처리를 수행할 때, 검색된 문서들로부터 추출된 키워드들의 세트는 복수의 그룹들로 자동적으로 분류될 수 있다. 상기 예에 따라, 다음 분류 표준들이 얻어진다.When the automatic keyword classification unit 73 performs the above described processing, the set of keywords extracted from the retrieved documents can be automatically classified into a plurality of groups. According to the above example, the following classification standards are obtained.

분류 표준 10: ガソリン(gasoline), 燃費(fuel comsumption)Classification standard 10: gasoline (fuel comsumption)

분류 표준 11: 事故(accident), 暴發(Explosion), 공항(Airport)Classification Standard 11: Accident, Explosion, Airport

분류 표준 12: WWW, 檢索(retrieval),URLClassification Standard 12: WWW, retrieval, URL

그 후, 분류 표준 변환부(74)는 분류 표준 10 내지 12을 제 1 실시예의 분류 표준 변환부(22)에서 수행되는 처리와 동일한 방식으로 검색부(75)에서 처리 가능한 검색 조건들로 변환한다.Thereafter, the classification standard conversion section 74 converts the classification standards 10 to 12 into search conditions that can be processed by the search section 75 in the same manner as the processing performed by the classification standard conversion section 22 of the first embodiment. .

검색 조건들로의 분류 표준들의 변환 완료 후, 제 3 실시예의 문서 검색 및 분류 시스템은 제 1 실시예에 공개된 처리와 동일한 처리를 실행한다.After completion of the conversion of classification standards into search conditions, the document retrieval and classification system of the third embodiment executes the same processing as that disclosed in the first embodiment.

위에서 설명된 바와 같이, 제 3 실시예는 검색된 문서들에서 빈번하게 나타나는 단어들의 분야를 자동적으로 판단한다. 검출된 분야들은 분류 표준들로서 간주된다. 따라서, 검색 결과의 성질에 따라 문서 분류를 수행하는 것이 가능하게 된다. 바꾸어 말하면, 제 3 실시예는 간단화된 문서 분류를 제공한다.As described above, the third embodiment automatically determines the field of words that frequently appear in the retrieved documents. Detected fields are considered as classification standards. Therefore, it is possible to perform document classification according to the nature of the search results. In other words, the third embodiment provides simplified document classification.

자동 키워드 분류부(73)에 의해 얻어진 키워드 그룹들은 입/출력부(71)를 통해 유저에게 일단 보여질 수 있다. 유저는 지시된 키워드 그룹들을 수정 또는 정정할 수 있다. 다음에, 분류 표준 변환부(74)는 수정된 키워드 그룹들(즉, 수정된 분류 표준들)을 검색 조건들로 변환한다. 이 분류 처리는 유저에 의해 생각되지 않은 분류 관점들을 유저가 알도록 하는 것을 가능하게 한다. 결과적으로, 제 3 실시예는 문서 분류 작업을 효율적으로 돕는다.The keyword groups obtained by the automatic keyword classification unit 73 can be seen once by the user via the input / output unit 71. The user can modify or correct the indicated keyword groups. Next, the classification standard conversion unit 74 converts the modified keyword groups (that is, the modified classification standards) into search conditions. This classification process makes it possible for the user to know classification points which are not considered by the user. As a result, the third embodiment helps the document classification task efficiently.

위에서 설명한 바와 같이, 본 발명의 제 3 실시예는 자동 문서 검색 및 분류 방법 및 시스템을 제공한다. 검색 동작은 의도된 검색된 문서들을 픽업하기 위해 유저에 의해 입력된 검색 조건들에 따라 문서들의 데이터베이스 상에서 수행된다. 키워드들은 검색 동작에 의해 픽업된 검색된 문서들로부터 추출된다. 검색된 키워드들은 복수의 클러스터들로 분류된다. 각 클러스터에 속하는 추출된 키워드들의 세트가 검색 조건들로 변환된다. 검색 동작에 의해 픽업된 검색 문서들과 추출된 키워드들로부터 얻어진 변환된 검색 조건들간의 유사도가 계산된다. 그리고, 각 분류에 대해 검색 동작에 의해 픽업된 각각의 검색된 문서의 속성이 유사도를 참조하여 계산되고, 이에 의해 각각의 검색된 문서를 가장 높은 속성을 가진 분류로 분류한다.As described above, the third embodiment of the present invention provides a method and system for automatic document retrieval and classification. The retrieval operation is performed on a database of documents according to the search conditions entered by the user to pick up the intended retrieved documents. The keywords are extracted from the retrieved documents picked up by the search operation. The searched keywords are classified into a plurality of clusters. The set of extracted keywords belonging to each cluster are converted into search conditions. The similarity between the search documents picked up by the search operation and the converted search conditions obtained from the extracted keywords is calculated. Then, for each category, the attributes of each retrieved document picked up by the retrieval operation are calculated with reference to the similarity, thereby classifying each retrieved document into the category with the highest attribute.

본 발명의 제 3 실시예에 따라, 유저에 의한 분류 표준들의 입력에 의존하지 않고 검색 결과의 분류 관점들을 자동적으로 추출할 수 있다. 유저들은 자신들이 생각하지 않은 분류 관점들을 자동적으로 얻을 수 있다. 결과적으로, 문서 분류 작업을 효율적으로 돕는 것이 가능하게 된다.According to the third embodiment of the present invention, it is possible to automatically extract classification aspects of the search result without relying on input of classification standards by the user. Users can automatically get classification views they don't think about. As a result, it becomes possible to assist in document classification work efficiently.

본 발명은 본 발명의 정신 및 범위를 벗어나지 않고 다양한 수정 및 변경이 가능하다. 설명된 본 실시예들은 단지 설명을 위해 제공된 것으로, 이에 한정되는 것은 아니다. 본 발명의 정신은 첨부된 청구의 범위에 의해 정의된다. 본 발명의 정신 및 범위를 벗어나지 않고 많은 변경이 가능하다.The present invention is susceptible to various modifications and changes without departing from the spirit and scope of the invention. The described embodiments are provided for illustrative purposes only and are not intended to be limiting. The spirit of the invention is defined by the appended claims. Many modifications are possible without departing from the spirit and scope of the invention.

Claims

delete

In a document retrieval and classification system,

Input / output means for allowing a user to enter search conditions,

Retrieving means for performing a retrieval operation on a database of documents according to the retrieval conditions composed of arbitrary words or arbitrary strings, and calculating the similarity between the retrieved documents picked up by the retrieval operation and the retrieval conditions ,

Search result storage means for storing the searched documents picked up by the search operation,

Keyword detecting means for extracting keywords from the retrieved documents picked up by the searching operation,

Automatic keyword classification means for automatically classifying the extracted keywords into a plurality of clusters,

Classification standard conversion means for converting classification standards into search conditions, wherein each of the classification standards is a set of keywords classified into respective clusters, and

A search result classification means for classifying the set of retrieved documents picked up by the search operation according to the classification standards.

The method of claim 6,

The retrieval means is responsive to the retrieval conditions entered by the user via the input / output means, and executes a retrieval operation on a database of documents in accordance with the retrieval conditions entered by the user,

The search result storage means stores the retrieved documents picked up by a search operation of the search means,

The keyword detecting means extracts keywords from the retrieved documents picked up by the search operation,

The automatic keyword classification means automatically classifies the extracted keywords into a plurality of clusters,

The classification standard conversion means generate converted search conditions obtained from classification standards, each of which is a set of keywords classified into each cluster,

The retrieval means calculates a similarity between the retrieved documents picked up by the retrieval operation and stored in the retrieval result storage means and the converted retrieval conditions,

The search result classification means calculates an attribute of each retrieved document picked up by the search operation for each classification standard with reference to the similarity calculated by the search means, thereby performing document classification. And classification system.

delete

In the document search and classification method,

Performing a search operation on a database of documents according to the search conditions entered by the user to pick up the intended retrieved documents,

Extracting keywords from the retrieved documents picked up by the search operation,

Classifying the extracted keywords into a plurality of clusters;

Converting the extracted set of keywords belonging to each cluster into search conditions,

Calculating a similarity between the retrieved documents picked up by the search operation and the converted search conditions obtained from the extracted keywords, and

Calculating a property of each document picked up by the search operation for each category with reference to the similarity, thereby classifying each retrieved document into a category having the highest attribute. Way.