KR20080017686A

KR20080017686A - Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded

Info

Publication number: KR20080017686A
Application number: KR1020060079177A
Authority: KR
Inventors: 홍석후; 최원종; 이상호; 이길재; 장성민; 정영신; 유현애
Original assignee: 에스케이커뮤니케이션즈 주식회사
Priority date: 2006-08-22
Filing date: 2006-08-22
Publication date: 2008-02-27
Also published as: KR101249183B1

Abstract

A method for extracting subjects and sorting documents in a search engine, and a computer-readable recording medium storing a program thereof are provided to enable a user to access desired information conveniently/quickly by selecting atypical/various subjects not classified in a manual mode, classify the target documents into each subject, and determine whether the searched document is suitable for the subject. A relation degree representing that respective keywords are selected at the same time is measured for the keywords included in target documents. A convergence relation degree between a word set about the predetermined keyword and the word set related to other keywords is measured. The keyword is selected as a subject when the convergence relation degree is higher than a specific value. A naive Bayesian probability is calculated by performing naive Bayesian training for training documents and each keyword included in the target documents. A vector size of each keyword included in the training and target document is calculated. A distance between the vector size of each keyword of the training and target document is calculated. Similarity of each keyword is calculated by multiplying the naive Bayesian probability and the distance. A ranking value is calculated by processing the similarity of each keyword included in the target document.

Description

METHOD FOR EXTRACTING SUBJECT AND SORTING DOCUMENT OF SEARCHING ENGINE, COMPUTER READABLE RECORD MEDIUM ON WHICH PROGRAM FOR EXECUTING METHOD IS RECORDED}

도 1은 본 발명에 따른 검색엔진의 구성블럭도, 1 is a block diagram of a search engine according to the present invention;

도 2는 특정 주제에 관련된 전자문서와, 각 전자문서의 키워드를 표시한 일 실시예, 2 is a diagram illustrating an electronic document related to a specific subject and a keyword of each electronic document;

도 3은, 문서A, 문서B, 문서C의 각 키워드 간의 연관도 및 각 키워드 간의 관계를 보인 결과, 3 is a result showing the relationship between each keyword of the document A, document B, document C and the relationship between each keyword,

도 4는 키워드의 거리와 빈도와의 관계를 나타낸 그래프,4 is a graph showing the relationship between the keyword distance and frequency;

도 5는 도 1의 검색엔진에 따른 주제 생성과정 및 문서 분류과정을 보인 흐름도이다. 5 is a flowchart illustrating a subject generation process and a document classification process according to the search engine of FIG. 1.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

10 : 검색부 20 : 주제생성모듈10: search unit 20: subject generation module

21 : 키워드 저장부 23 : 연관도 산출부21: Keyword storage unit 23: Association degree calculation unit

25 : 수렴연관정도 산출부 30 : 문서분류모듈25: convergence degree calculation unit 30: document classification module

31 : 나이브 베이지안부 33 : 키워드 벡터 산출부31: naive Bayesian part 33: keyword vector calculation unit

35 : 거리 산출부 37 : 랭킹 산출부35: distance calculation unit 37: ranking calculation unit

본 발명은 검색엔진의 주제 생성 및 문서 분류방법에 관한 것으로서, 보다 상세하게는, 비정형화되고 다양한 주제를 선정할 수 있고, 주제에 질적으로 유사한 정도에 따라 분류대상문서를 분류하도록 하는 검색엔진의 주제 생성 및 문서 분류방법에 관한 것이다. The present invention relates to a method for generating a subject and a document classification method of a search engine. More particularly, the present invention relates to a search engine for classifying documents to be classified according to a qualitative similarity to a subject. It relates to the topic creation and document classification

인터넷의 상용화로 인해 많은 사용자들이 인터넷을 이용하여 다양한 주제의 정보를 간편하고 신속하게 검색할 수 있게 되었다. 그러나 방대한 양의 정보속에서 사용자가 원하는 최적의 정보를 찾는 데는 시간과 노력이 필요하다. The commercialization of the Internet has made it possible for many users to easily and quickly search for information on a variety of topics using the Internet. However, it takes time and effort to find the optimal information that users want from the vast amount of information.

최근 검색엔진들은 사용자가 원하는 최적의 정보를 신속하게 찾아낼 수 있도록 다양한 검색방법을 도입하고 있다. 그 중 하나가 관련도 랭킹을 제공하는 것이다. 관련도 랭킹이란, 검색된 전자문서가 다른 전자문서들과 비교하여 주어진 네트워크내에서 사용자에 의해 지정된 검색어와 관련되는 정도를 상대적 평가치로 나타낸 것이다. 이러한 관련도 랭킹은, 전자문서 내에 미리 정해진 검색어가 나타난 횟수, 전자문서 내에서의 검색어의 위치 등을 평가하는 방법, 웹페이지 구조 분석법, 키단어 열거방법 등이 있다. Recently, search engines have introduced various search methods so that users can find the optimal information quickly. One of them is to provide a relevance ranking. Relevance ranking refers to the relative evaluation of the degree to which a searched electronic document is related to a search word designated by a user in a given network compared to other electronic documents. The relevance ranking includes a method of evaluating the number of times a predetermined search word appears in an electronic document, a position of the search word in the electronic document, a web page structure analysis method, a key word enumeration method, and the like.

이렇게 제공되는 관련도 랭킹은, 검색결과에 의해 보여지는 전자문서가 해당 주제와 부합되는지 여부를 확률적인 수치만으로 제공하며, 사용자가 입력한 검색어 의 빈도 등으로 판단된 수치이므로, 실질적으로 사용자가 원하는 주제에 부합되는 전자문서인지를 판단하는 기준이 되기는 어렵다. The relevance ranking provided in this way provides only a probabilistic value of whether the electronic document shown by the search result matches the corresponding subject, and is determined by the frequency of the search word entered by the user. It is difficult to be a criterion for determining whether an electronic document corresponds to a subject.

한편, 일반적으로 검색대상 문서를 분류하기 위한 주제는, 검색엔진의 개발자나 운영자에 의해 수작업에 의해 선정 및 정의된다. 따라서, 한정된 수의 주제만을 분류할 수 있다. 그러나, 다양한 계층과 직업, 학력 등을 가진 불특정 다수의 사용자에 의해 생성되는 전자문서는 매우 광범위한 주제를 다루고 있을 수밖에 없다. 따라서, 현재와 같은 주제의 수작업 분류에 의해서는 모든 주제를 다룰 수 없으며, 주제가 분류되지 아니한 영역의 전자문서는 모호한 주제로 분류된다. 이에 따라, 검색시 해당 전자문서는 전혀 엉뚱한 주제에 포함되어 검색되거나, 원하는 사용자에게 검색되지 않을 수 있다. 즉, 분류의 모호성으로 인해 검색결과의 신뢰성이 저하되는 것이다. On the other hand, a subject for classifying documents to be searched is generally selected and defined manually by a developer or operator of a search engine. Thus, only a limited number of topics can be classified. However, an electronic document generated by an unspecified number of users with various hierarchies, occupations, and educational attainments may cover a very wide range of topics. Therefore, the manual classification of the current subject cannot cover all the subjects, and the electronic document in the area where the subject is not classified is classified as an ambiguous subject. Accordingly, at the time of searching, the electronic document may be included in a completely wrong topic or may not be searched by a desired user. In other words, the ambiguity of classification reduces the reliability of the search results.

이에 따라, 비정형화되고 광범위한 주제를 생성하는 방법을 제시함으로써, 다양한 주제에 대해 전자문서를 정확히 분류할 수 있도록 하는 방법을 모색하여야 할 것이다. 이와 함께, 주제에 적합한 정도에 따라 전자문서를 분류하고, 분류된 전자문서가 사용자가 원하는 주제에 얼마나 부합되는지를 랭킹으로 보여줌으로써, 검색의 신뢰성을 향상시킬 수 있는 방법을 모색하여야 할 것이다. Accordingly, it is necessary to find a way to accurately classify electronic documents on various topics by presenting a method for generating an informal and extensive subject. In addition, it is necessary to search for a method of improving the reliability of the search by classifying the electronic documents according to the degree appropriate to the subject and showing the ranking how the classified electronic documents correspond to the desired subject.

따라서, 본 발명의 목적은, 비정형화되고 광범위한 주제를 생성할 수 있는 검색엔진의 주제 생성 및 문서 분류방법을 제공하는 것이다. Accordingly, an object of the present invention is to provide a method for generating a topic and classifying a document of a search engine that can generate an informal and broad topic.

한편, 본 발명의 다른 목적은, 주제에 적합한 정도에 따른 전자문서를 분류 함으로써, 검색의 신뢰성을 향상시킬 수 있는 검색엔진의 주제 생성 및 문서 분류방법을 제공하는 것이다. On the other hand, another object of the present invention is to provide a method for generating a subject and a document classification method of a search engine that can improve the reliability of the search by classifying the electronic document according to the degree suitable for the subject.

이러한 목적을 달성하기 위한 본 발명의 구성은, 분류대상문서에 포함되는 적어도 하나의 키워드에 대해 상기 각 키워드들이 동시에 선택된 정도를 나타내는 연관도를 측정하는 단계: 임의의 키워드에 연관된 단어 집합과, 타 키워드에 연관된 단어 집합 간의 연관정도인 수렴연관정도를 측정하는 단계; 상기 수렴연관정도가 미리 설정된 일정 이상이면 상기 키워드를 주제로 선택하는 단계;를 포함하는 것을 특징으로 한다. In accordance with an aspect of the present invention, there is provided a method for measuring the degree of association of each of the keywords simultaneously selected for at least one keyword included in a document to be classified: a word set associated with any keyword and another keyword. Measuring a degree of convergence association, which is an degree of association between sets of words associated with; And selecting the keyword as a subject when the convergence connection degree is equal to or more than a predetermined schedule.

상기 연관도는 수학식 1에 의해 산출될 수 있다. The correlation may be calculated by Equation 1.

[수학식 1][Equation 1]

여기서, LinkCount(w₁, w₂)는 키워드 w₁과 w₂이 동시에 태깅된 횟수, WordCount(w₁)은 키워드 w₁ 이 키워드로 사용된 총 횟수, WordCount(w₂)은 키워드 w₂ 이 키워드로 사용된 총 횟수를 말한다. Where LinkCount (w ₁ , w ₂ ) is the number of times the keywords w ₁ and w ₂ were tagged at the same time, WordCount (w ₁ ) is the total number of times the keyword w ₁ was used as a keyword, and WordCount (w ₂ ) was the keyword w ₂ The total number of times used as a keyword.

상기 수렴연관정도는 수학식 2에 의해 산출되며; The convergence correlation is calculated by Equation 2;

[수학식 2][Equation 2]

여기서,

과,

모두 미리 설정된 최소 연관도보다 크거나 같으면, PositiveVal(w₁, w₂, w_i)에 값을 대입하고, 아니면, NegativeVal(w₁, w₂, w_i)에 값을 대입하는 것이 바람직하다. here,

and,

If all are greater than or equal to the preset minimum correlation, it is preferable to assign a value to PositiveVal (w ₁ , w ₂ , w _i ), or to assign a value to NegativeVal (w ₁ , w ₂ , w _i ).

상기 PositiveVal(w₁, w₂, w_i)와, NegativeVal(w₁, w₂, w_i)은 각각 다음의 수학식 3과 수학식 4에 의해 산출할 수 있다. PositiveVal (w ₁ , w ₂ , w _i ) and NegativeVal (w ₁ , w ₂ , w _i ) may be calculated by Equations 3 and 4, respectively.

[수학식 3][Equation 3]

[수학식 4][Equation 4]

상기 수렴연관정도는 상기 각 키워드가 유사할수록 1에 가까워지고, 상기 각 키워드가 상이할수록 0에 가까워지는 것이 바람직하다. The convergence association degree is closer to 1 as the keywords are similar, and closer to 0 as the keywords are different.

분류의 기준이 되는 학습문서와, 분류대상 문서에 포함된 각 키워드에 대해 나이브 베이지안 학습을 수행하여 나이브 베이지안 확률을 산출하는 단계;를 포함할 수 있다. And calculating a naive Bayesian probability by performing naïve Bayesian learning on each keyword included in the classification target document and the keywords to be classified.

상기 학습문서와, 분류대상 문서에 포함된 각 키워드의 벡터 크기를 산출하는 단계;를 포함할 수 있다. And calculating a vector size of each of the keywords included in the learning document and the document to be classified.

상기 각 키워드의 벡터 크기를 산출하는 단계는, 상기 각 키워드가 해당 문 서에서 출현하는 횟수와, 역문헌빈도를 곱하여 각 키워드의 가중치를 산출하는 단계와, 상기 각 키워드의 가중치를 상기 각 키워드가 포함된 상기 학습문서 또는 분류대상문서내의 모든 단어의 가중치의 합으로 나누어 산출하는 단계를 포함할 수 있다. The calculating of the vector size of each keyword may include calculating a weight of each keyword by multiplying the number of occurrences of each keyword in a corresponding document by an inverse document frequency, and calculating the weight of each keyword by the weight of each keyword. It may include the step of calculating by dividing by the sum of the weights of all the words in the included learning document or classification target document.

상기 학습문서상에서의 각 키워드의 벡터 크기와, 상기 분류대상문서상에서의 각 키워드의 벡터 크기의 차인 키워드의 거리를 산출하는 단계;를 포함할 수 있다. Calculating a distance of a keyword that is a difference between a vector size of each keyword on the learning document and a vector size of each keyword on the classification target document.

상기 키워드의 거리는, 상기 학습문서와, 상기 분류대상문서에서 키워드가 사용된 빈도가 유사할수록 더 높은 값을 갖는 것이 바람직하다. It is preferable that the distance of the keyword has a higher value as the frequency of use of the keyword in the learning document and the classification target document is similar.

상기 각 키워드에 대한 나이브 베이지안 확률값(Nb)과 거리(DP)를 곱하여 각 키워드에 대해 유사도값을 산출하는 단계; 상기 분류대상문서에 포함되는 각 키워드에 대한 유사도값을 소정의 처리과정을 통해 하나의 랭킹판단값을 산출하는 단계; 상기 각 분류대상문서의 랭킹판단값을 이용하여 상기 각 분류대상문서의 랭킹을 산출하고, 주제에 따라 분류하는 단계;를 포함할 수 있다. Calculating a similarity value for each keyword by multiplying a naive Bayesian probability value Nb and a distance DP for each keyword; Calculating one ranking determination value through a predetermined process of the similarity value for each keyword included in the classification target document; Calculating a ranking of each classification target document by using the ranking determination value of each classification target document, and classifying the classification according to a subject.

상기 목적을 달성하기 위한 또 다른 본 발명은, 상기 방법을 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 제공한다.Another object of the present invention for achieving the above object is to provide a computer-readable recording medium containing a program capable of performing the method.

이하에서는 첨부도면을 참조하여 본 발명을 상세히 설명한다. Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 검색엔진의 구성블럭도이다. 1 is a block diagram of a search engine according to the present invention.

본 검색엔진은, 서버 컴퓨터, 퍼스널 컴퓨터, 핸드헬드, 랩탑 디바이스, 프로그램 가능한 전자기기 등 다양한 컴퓨팅 시스템에 탑재되어 사용될 수 있다. The search engine can be used in a variety of computing systems, including server computers, personal computers, handhelds, laptop devices, programmable electronics, and the like.

본 검색엔진은, 검색부(10), 주제생성모듈(20), 문서분류모듈(30)을 포함하며, 자동으로 주제를 생성하고, 주제에 관련된 분류대상문서들을 분류하여 랭킹을 산출한다. The search engine includes a search unit 10, a subject generation module 20, and a document classification module 30. The search engine automatically generates a subject, and classifies documents classified according to a subject to calculate a ranking.

검색부(10)는, 검색엔진이 탑재된 컴퓨팅 시스템에서 접근가능한 모든 웹페이지의 전자문서를 검색하며, 검색된 문서는 주제생성모듈(20)로 제공되어 주제가 생성된다. 이 때, 각 전자문서는 사용자가 문서를 분류하고 저장하기 위해 전자문서에 덧붙이는 태깅(Tagging) 키워드(이하, 키워드라 함)가 포함되어 있다. The search unit 10 searches electronic documents of all web pages accessible from a computing system equipped with a search engine, and the searched documents are provided to the subject generation module 20 to generate a subject. At this time, each electronic document includes a tagging keyword (hereinafter referred to as a keyword) that the user adds to the electronic document to classify and store the document.

주제생성모듈(20)은, 전자문서에 포함된 키워드를 이용하여 비정형화되고 광범위한 주제를 갖는 전자문서로부터 다수의 관심사가 되는 주제를 자동으로 생성한다. The subject generation module 20 automatically generates a plurality of subjects of interest from an electronic document having an unstructured and broad subject using keywords included in the electronic document.

주제생성모듈(20)은, 키워드를 저장하는 키워드 저장부(21), 키워드간 연관도를 산출하는 연관도 산출부(23), 임의의 키워드와 연관된 집합들간의 연관정도를 산출하는 수렴연관정도 산출부(25)를 포함한다. The subject generation module 20 includes a keyword storage unit 21 for storing keywords, an association degree calculation unit 23 for calculating an association degree between keywords, and a convergence association degree for calculating an association degree between sets associated with any keyword. The calculator 25 is included.

키워드 저장부(21)는, 전자문서에 포함된 각 키워드와, 각 키워드가 전자문서에서 출현된 출현횟수가 저장되며, 키워드 저장부(21)에 저장된 모든 키워드는 주제로 생성될 수 있는 주제 후보군이 된다. The keyword storage unit 21 stores each keyword included in the electronic document and the number of occurrences of each keyword in the electronic document, and all the keywords stored in the keyword storage unit 21 can be generated as a subject candidate group. Becomes

한편, 전자문서에 포함된 각 키워드는 사용자가 직접 입력하는 것이므로, 비속어, 성인용어, 오타 등이 포함될 수 있으며, 이러한 용어는 부적절한 키워드가 될 수 있다. 이러한 부적절한 키워드 때문에, 단순히 출현횟수만으로 주제로 선택되는 것은 바람직하지 아니하다. On the other hand, since each keyword included in the electronic document is directly input by the user, slang, adult terms, typos, etc. may be included, and such terms may be inappropriate keywords. Because of these inadequate keywords, it is not desirable to be selected as the subject by simply the number of occurrences.

연관도 산출부(23)는, 전자문서에 포함된 각 키워드들이 연관된 정도를 파악함으로써, 부적절한 키워드가 주제로 선정되는 것을 방지한다. 예를 들어, '야구'에 관련된 문서는 '야구'에 관련된 용어들이 키워드로서 입력된다. 도 2에는 일 실시예로서 특정 주제에 관련된 전자문서와, 각 전자문서의 키워드를 표시하고 있다. The relevance calculator 23 detects the degree to which the keywords included in the electronic document are related, thereby preventing the inappropriate keywords from being selected as the subject. For example, in a document related to 'baseball', terms related to 'baseball' are input as keywords. In FIG. 2, as an embodiment, an electronic document related to a specific subject and keywords of each electronic document are displayed.

문서A의 경우, '박찬호 야구', '야구', '박찬호 김병헌'이 키워드로 입력되어 있고, 문서B는 '야구 이종범', '야구 일본', '일본 이종범'이 키워드로 입력되어 있고, 문서C는 '일본 애니', '일본 만화', '애니 만화'가 키워드로 입력되어 있다. 여기서 문서A와 문서B의 경우 '야구'라는 동일한 주제를 가지고 있으며, 이에 따라, 공통되는 키워드가 자주 사용되고 있다. 따라서, 문서A와 문서B는 그 연관도가 높다. 반면, 문서A와 문서C의 경우 각각 '야구'와 '일본 만화'로 상이한 주제를 가지고 있으므로, 상호 공통되는 키워드가 없고 따라서 양 문서의 키워드간의 연관도는 낮다. 한편, 비속어, 성인용어, 오타 등의 키워드는 특정한 주제에 속하지 못하므로, 다른 키워드와의 연관도가 낮게 나타난다. 이에 따라, 키워드의 출현횟수와 키워드간의 연관도를 이용하여 주제를 선정하면 부적절한 용어나 키워드가 주제로 선정되는 것을 방지할 수 있다. In case of Document A, 'Park Chan-ho Baseball', 'Baseball' and 'Park Chan-ho Kim Byung-hun' are entered as keywords, and Document B is 'Baseball Lee Jong-bum', 'Baseball Japan', and 'Japanese Jong-beom' as keywords. C has the keywords 'anime anime', 'japanese manga' and 'anime manga' as keywords. Here, Document A and Document B have the same theme of 'baseball,' and accordingly, common keywords are frequently used. Therefore, Document A and Document B are highly related. On the other hand, Document A and Document C have different themes, such as 'baseball' and 'japanese manga', respectively, so there is no keyword in common with each other, and thus the correlation between the keywords in both documents is low. On the other hand, keywords such as profanity, adult terms, typos, etc. do not belong to a specific theme, and thus have low association with other keywords. Accordingly, if a topic is selected using the number of occurrences of the keyword and the degree of association between the keywords, inappropriate terms or keywords can be prevented from being selected as the topic.

연관도 산출부(23)는, 다음의 수학식 1을 이용하여 키워드w₁, w₂간의 연관도인 R(w₁, w₂)를 산출한다.The association degree calculation unit 23 calculates R (w ₁ , w ₂ ) which is the degree of association between the keywords w ₁ and w ₂ using the following equation ( ₁ ).

[수학식 1][Equation 1]

여기서, LinkCount(w₁, w₂)는 키워드 w₁과 w₂이 동시에 태깅된 횟수, WordCount(w₁)은 키워드 w₁이 키워드로 사용된 총 횟수, WordCount(w₂)은 키워드 w₂이 키워드로 사용된 총 횟수를 말한다.Where LinkCount (w ₁ , w ₂ ) is the number of times the keywords w ₁ and w ₂ were tagged at the same time, WordCount (w ₁ ) is the total number of times the keyword w ₁ was used as a keyword, and WordCount (w ₂ ) was the keyword w ₂ The total number of times used as a keyword.

한편, 도 2의 각 문서에 포함된 키워드는 {박찬호, 야구, 김병헌, 이종범, 일본, 애니, 만화}이며, 각 키워드의 출현횟수는 {2,4,1,2,4,2,2}이다. 여기서, 키워드인 '박찬호'와 '야구'간의 연관도를 계산해 보면, '박찬호'와 '야구'가 동시에 태깅된 횟수는 1회이고, '박찬호'가 키워드로 사용된 총 횟수는 2회이고, '야구'가 키워드로 사용된 총 횟수는 4회이다. Meanwhile, keywords included in each document of FIG. 2 are {Park Chan-ho, baseball, Kim Byung-hun, Lee Jong-bum, Japan, Annie, cartoon}, and the number of occurrences of each keyword is {2,4,1,2,4,2,2}. to be. Here, if you calculate the degree of association between the keywords 'Park Chan-ho' and 'baseball', the number of times 'Park Chan-ho' and 'Baseball' are tagged at the same time, the total number of times 'Park Chan-ho' is used as a keyword is 2 times, The total number of times a baseball is used as a keyword is four times.

따라서, R(박찬호, 야구) = (1/2+1/4)/2 = 3/8 = 0.375 이다. Therefore, R (Park Chan-ho, baseball) = (1/2 + 1/4) / 2 = 3/8 = 0.375.

마찬가지 방법으로 문서A, 문서B, 문서C의 각 키워드 간의 연관도를 계산한 결과가 도 3에 도시되어 있다. Similarly, the results of calculating the degree of association between the keywords of Document A, Document B, and Document C are shown in FIG.

도시된 바와 같이 '일본'이라는 키워드는 {이종범, 애니, 만화, 야구}와 각각 1개의 링크를 가지며, '일본'과 각 링크로 연결된 키워드 간의 연관도는 비슷하게 나타난다. 만약, '일본'과 '이종범'이라는 키워드가 유사한 주제에 포함된다면, '일본'에 연결된 키워드 집합과, '이종범'에 연결된 키워드 집합은 동일한 키워드로 다수 연결될 것이다. 그러나, '일본'과 '이종범'이 서로 다른 주제를 다룬 다면 '일본'에 연결된 키워드 집합과, '이종범'에 연결된 키워드 집합은 다른 집합에 속하는 키워드와의 연결이 더 많을 것이다. 이렇게 임의의 키워드가 속한 키워드 집합들간의 연관된 정도를 나타내는 것이 수렴연관정도(Convergence Relation)이며, 수렴연관정도는 연관도에 반영하여 각 키워들간의 연관도를 더욱 정교화한다. As shown, the keyword 'Japan' has one link each with {Lee Bum, Annie, Manga, Baseball}, and the relationship between 'Japan' and the keyword connected by each link is similar. If the keywords "Japan" and "Lee Jong-bum" are included in a similar theme, the keyword set linked to "Japan" and the keyword set linked to "Lee Jong-bum" will be linked to the same keyword. However, if 'Japan' and 'Lee Jong-bum' deal with different subjects, the keyword set linked to 'Japan' and the keyword set linked to 'Lee Jong-bum' will have more links with keywords belonging to different sets. The degree of association between the keyword sets to which an arbitrary keyword belongs is convergence relation, and the degree of convergence is reflected in the association to further refine the association between keywords.

수렴연관정도 산출부(25)는, 다음의 수학식 2 내지 4를 이용하여 키워드w₁, w₂간의 수렴연관정도인 CR(w₁, w₂)를 산출한다. The convergence association degree calculation unit 25 calculates CR (w ₁ , w ₂ ) which is the convergence association degree between the keywords w ₁ and w ₂ using the following equations ( ₂ ) to ( ₄ ).

[수학식 2][Equation 2]

여기서,

과,

모두 미리 설정된 최소 연관도보다 크거나 같으면, 즉, w₁과 w₂에 연결된 w_i의 연관도가 최소 연관도 이상인 경우에는 PositiveVal(w₁, w₂, w_i) 측에 값을 넣고, w₁과 w₂에 연결된 w_i의 연관도가 최소 연관도 미만인 경우인 NegativeVal(w₁, w₂, w_i) 측에 값을 넣는다. here,

and,

All pre-set minimum Relevancy greater than or equal to, i.e., w _1, and if also less than w _i Relevancy a minimum association attached to w _2, put a value on PositiveVal (w _1, w _2, w _i) side, w It is ₁ and w _i w ₂ is connected to the association of the value placed in the _{_{NegativeVal (w 1, w 2,}} w i) is less than the minimum associated side case.

예를 들어, 도 3에 대해 최소 연관도를 0.3으로 설정하고, w₁을 '야구'라 하고, w₂를 '만화'라고 하면, '야구'와 '만화'에 연결된 키워드 w_i인 '일본'는, '만화'와는 연관도가 0.3이상이지만, '야구'와는 연관도가 0.3보다 작으므로, '일본' 은 NegativeVal(w₁, w₂, w_i) 측에 포함되어 계산된다. For example, if the minimum association degree is set to 0.3 for FIG. 3, w ₁ is called 'baseball', and w ₂ is called 'cartoon', 'Japan' which is a keyword w _i connected to 'baseball' and 'cartoon''Is associated with' manga 'is 0.3 or more, but' association with baseball is less than 0.3, so 'Japan' is included in the NegativeVal (w ₁ , w ₂ , w _i ).

이에 따라, 키워드w₁과 w₂에 연결된 단어들인 w_i중 최소 연관도를 초과하는 단어가 많으면, 수렴연관정도는 커지고, 키워드에 연결된 단어들 중 최소 연관도를 초과하지 못하는 단어가 많으면, 수렴연관정도는 작아진다. 따라서, 수렴연관정도는 각 키워드가 유사할수록 1에 가까워지고, 각 키워드가 상이할수록 0에 가까워진다. Accordingly, if there are many words that exceed the minimum degree of relevance among the words w _i connected to the keywords w ₁ and w ₂ , the convergence correlation increases, and when there are many words that do not exceed the minimum degree of relevance among the words connected to the keyword, convergence occurs. The degree of association is small. Therefore, the convergence association degree is closer to 1 as each keyword is similar, and closer to 0 as each keyword is different.

PositiveVal(w₁, w₂, w_i)과, NegativeVal(w₁, w₂, w_i)은 각각 다음의 수학식 3과 수학식 4에 의해 산출할 수 있다. PositiveVal (w ₁ , w ₂ , w _i ) and NegativeVal (w ₁ , w ₂ , w _i ) can be calculated by the following equations (3) and (4), respectively.

[수학식 3][Equation 3]

[수학식 4][Equation 4]

이러한 수렴연관정도는 각 키워드간의 연관성을 보다 정확하게 판단할 수 있도록 하며, 이에 따라, 수렴연관정도가 미리 설정된 소정의 수치 이상인 키워드를 주제로 선정할 수 있다. The degree of convergence can more accurately determine the association between each keyword. Accordingly, a keyword whose convergence degree is above a predetermined value can be selected as a theme.

문서분류모듈(30)은, 선정된 주제를 갖는 학습문서를 통계적 기법인 나이브 베이지안(Naive Baysien)방법을 사용하여 학습하고, 분류대상이 되는 전자문서인 분류대상문서에 대해 주제와의 연관성을 나타내는 검색랭킹을 산출하여 문서를 분 류한다. 나이브 베이지안 방법은 가설에 대해 명시적으로 확률을 계산하는 학습 알고리즘으로서, 간단하면서도 분류성능이 높은 통계적 기법으로 문서분류에 자주 사용되고 있는 알고리즘이다. The document classification module 30 learns a learning document having a selected subject by using a naive Baysien method, which is a statistical technique, and indicates an association of a subject with a subject to be classified as an electronic document to be classified. Classify documents by calculating search rankings. The naive Bayesian method is a learning algorithm that explicitly calculates probabilities for hypotheses. It is a simple and high-class statistical technique that is frequently used for document classification.

문서분류모듈(30)은, 나이브 베이지안 확률을 산출하는 나이브 베이지안부(31), 키워드의 벡터 크기를 산출하는 키워드 벡터 산출부(33), 문서간의 유사도를 산출하는 거리 산출부(35), 나이브 베이지안 확률과 거리를 이용하여 랭킹을 산출하는 랭킹 산출부(37)를 포함한다. The document classification module 30 includes a naive Bayesian unit 31 for calculating the naive Bayesian probability, a keyword vector calculator 33 for calculating the vector size of the keyword, a distance calculator 35 for calculating the similarity between documents, and a naive. A ranking calculator 37 calculates a ranking using Bayesian probabilities and distances.

나이브 베이지안부(31)는, 미리 선정된 학습문서에서 주제로 선정된 키워드를 포함한 단어들의 나이브 베이지안 확률을 산출한다. 이 때, 나이브 베이지안 확률을 산출할 단어들은 주제에 관련하여 미리 추출된 단어일 수도 있고, 학습문서에 포함된 모든 단어일 수도 있으며, 이는 검색엔진의 설계자나 사용자에 의해 다양하게 선택될 수 있다. The naive Bayesian unit 31 calculates naive Bayesian probabilities of words including a keyword selected as a subject from a pre-selected learning document. At this time, the words to calculate the naive Bayesian probability may be a word extracted in advance in relation to the subject, or all words included in the learning document, which can be variously selected by the designer or the user of the search engine.

나이브 베이지안부(31)는, 학습문서에 포함된 키워드 및 단어들을 각각 나이브 베이지안 학습방법으로 학습하여 키워드와 단어들이 출현했을 때, 해당 키워드와 단어들이 주제가 될 확률을 구하며, 이 값을 나이브 베이지안 확률(Nb)이라 한다. 나이브 베이지안 학습방법은 이미 당업자라면 용이하게 실시할 수 있는 공지기술이므로, 자세한 설명은 생략하기로 한다. The naive Bayesian unit 31 learns the keywords and words included in the learning document using the Naive Bayesian learning method, respectively, and calculates the probability that the keywords and words become themes when the keywords and words appear. It is called (Nb). Since the naive Bayesian learning method is a well-known technique that can be easily implemented by those skilled in the art, a detailed description thereof will be omitted.

키워드 벡터 산출부(33)는, 학습된 키워드와 단어의 가중치를 이용하여 키워드와 단어들의 벡터 크기를 산출한다. 키워드와 단어들의 가중치는 해당 키워드의 문서내 출현횟수(Term Frequenct)와 역문헌빈도(Inverse Document Frequenct)의 곱 으로 산출한다. 여기서, 역문헌빈도는 조사나 관사 등 반복적으로 사용되는 단어를 제거하기 위한 기재로서, 여러 문서에서 나타나는 단어는 가중치가 떨어지도록 계산되는 빈도이다. The keyword vector calculator 33 calculates the vector size of the keyword and the words by using the learned keywords and the weights of the words. The weight of a keyword and words is calculated as the product of the term frequency and the inverse document frequency of the keyword. Here, the inverse document frequency is a description for removing words that are repeatedly used, such as surveys and articles, and words appearing in various documents are frequencies calculated so as to reduce weight.

키워드 벡터 산출부(33)는, 키워드와 단어의 가중치를 이용하여 각 키워드와 단어의 벡터의 크기를 산출한다. 그런 다음, 키워드 벡터 산출부(33)는, 산출된 키워드의 가중치를 문서내의 모든 단어들의 가중치의 합으로 나누어 키워드의 벡터 크기를 산출한다. The keyword vector calculator 33 calculates the size of the vector of each keyword and word using the weight of the keyword and the word. Then, the keyword vector calculator 33 calculates the vector size of the keyword by dividing the calculated weight of the keyword by the sum of the weights of all the words in the document.

상술한 것과 마찬가지 방법으로, 키워드 벡터 산출부(33)는, 분류대상문서에서 키워드 및 미리 선정된 단어들에 대해 벡터 크기를 산출한다. In the same manner as described above, the keyword vector calculation unit 33 calculates the vector size for the keyword and the preselected words in the classification target document.

거리 산출부(35)는, 동일한 키워드에 대해, 학습문서상에서의 키워드의 벡터 크기와, 분류대상문서상에서의 키워드의 벡터 크기의 차인 키워드의 거리(DP)를 다음의 수학식 5를 이용하여 구한다. The distance calculation unit 35 calculates the distance DP of the keyword, which is the difference between the vector size of the keyword on the learning document and the vector size of the keyword on the classification target document, for the same keyword using the following equation (5). .

[수학식 5] [Equation 5]

여기서, α는 학습문서에서의 키워드의 벡터 크기이고, β는 분류대상문서에서의 키워드의 벡터 크기이다. 학습문서의 키워드와, 분류대상문서에서의 키워드의 벡터 크기가 유사할 경우, 키워드의 거리(DP)는 α에 가까워진다. 즉, 키워드의 거리(DP)는, 학습문서와, 분류대상문서에서 키워드가 사용된 빈도가 유사할수록 더 높은값을 가진다. 그러나 사용된 빈도가 지나치게 높거나 낮은 경우, 키워드의 거리는 오히려 감소하게 된다. 따라서, 사용된 빈도가 지나치게 높은 스팸성 문서 등을 제거할 수 있다. Here, α is the vector size of the keyword in the learning document, and β is the vector size of the keyword in the classification target document. When the keyword of the learning document and the vector size of the keyword in the classification target document are similar, the distance DP of the keyword is close to α. That is, the distance DP of the keyword has a higher value as the frequency of use of the keyword in the learning document and the document to be classified is similar. However, if the frequency used is too high or too low, the keyword distance is rather reduced. Therefore, it is possible to remove spammy documents and the like which are used frequently.

도 4는 키워드의 거리와 빈도와의 관계를 나타낸 그래프이다. 4 is a graph showing the relationship between the keyword distance and the frequency.

본 그래프에서 키워드의 사용 빈도가 지나치게 높은 경우는 거리값이 0에 가까워지며, 이는 특정 단어가 과도하게 반복되어 나타나는 스팸성 문서를 제거하는 역할을 하게 된다. 만약 검색대상이 되는 전자문서에 스팸성 문서가 없다고 판단할 수 있는 경우에는, 키워드의 사용 빈도가 높을수록 거리값이 높아지도록 관계 그래프를 형성하여 전자문서의 주제와의 적합성 정도를 판단할 수 있을 것이다. In this graph, if the frequency of the keyword is used is too high, the distance value is close to zero, which removes the spammy documents in which a certain word is excessively repeated. If the electronic document to be searched can be determined that there is no spam document, the relation graph is formed so that the distance value increases as the frequency of use of the keyword increases, so that the degree of conformity with the subject of the electronic document can be determined. .

이렇게 분류대상문서에 포함된 각 키워드에 대해 키워드의 거리가 산출되면, 랭킹 산출부(37)는, 각 키워드에 대한 나이브 베이지안 확률값(Nb)과 거리(DP)를 곱하여 각 키워드에 대해 유사도값을 산출한다. 하나의 분류대상문서에 포함되는 각 키워드에 대한 유사도값은 모두 더해져서 사용되거나 평균을 내어 사용되며, 이 값은 하나의 랭킹판단값이 된다. 각 분류대상문서는 하나의 랭킹판단값을 갖게 되며, 랭킹 산출부(37)는 각 분류대상문서의 랭킹판단값을 비교하여 랭킹을 산출한다. When the distance of the keyword is calculated for each keyword included in the classification target document, the ranking calculation unit 37 multiplies the naive Bayesian probability value Nb and the distance DP for each keyword to calculate a similarity value for each keyword. Calculate. The similarity values for each keyword included in one classified document are added together or averaged, and this value becomes one ranking judgment value. Each classification target document has one ranking determination value, and the ranking calculation unit 37 compares the ranking determination value of each classification target document and calculates a ranking.

이러한 랭킹은, 학습문서와 분류대상문서간의 유사여부를 판단하는 기준이 되며, 이러한 랭킹을 통해 분류대상문서가 주제와 질적으로 얼마나 일치하는지 여부를 판단할 수 있다. 이에 따라, 임의의 학습문서에 대해 랭킹이 높은 분류대상문서는 해당 학습문서의 주제에 해당되는 것으로 분류되고, 랭킹이 낮은 분류대상문서는 다른 학습문서와의 유사도를 측정하여 다른 주제로 분류할 수 있으므로, 분 류대상문서를 용이하게 분류할 수 있다. Such a ranking serves as a criterion for determining whether the learning document and the classification target document are similar, and through this ranking, it is possible to determine whether the classification target document is qualitatively matched with the subject matter. Accordingly, the high-ranking classification target document is classified as the subject of the corresponding learning document, and the low-ranking classification target document can be classified into another subject by measuring similarity with other learning documents. Therefore, the documents to be classified can be easily classified.

이러한 구성에 의한 검색엔진의 주제 생성과정과 문서 분류과정을 도 5를 참조하여 설명하면 다음과 같다. The subject generation process and document classification process of the search engine by such a configuration will be described with reference to FIG. 5 as follows.

먼저, 검색부(10)에서 전자문서를 검색하고, 전자문서에 포함된 키워드를 키워드 저장부(21)에 저장한다(S505). 연관도 산출부(23)에서는 각 전자문서에 포함된 키워드들간의 연관도를 산출하고(S510), 수렴연관정도 산출부(25)에서는 임의의 키워드와 연결된 단어 집합과 타 키워드와 연결된 단어 집합간의 연관도를 비교하여 수렴연관정도를 산출한다(S515). 산출결과, 수렴연관정도가 일정 이상인 키워드를 주제로 선정한다(S520). First, the search unit 10 searches for an electronic document, and stores the keyword included in the electronic document in the keyword storage unit 21 (S505). The association degree calculation unit 23 calculates an association degree between keywords included in each electronic document (S510), and the convergence association degree calculation unit 25 associates a word set connected to an arbitrary keyword and a word set connected to another keyword. Comparing the degree of convergence to calculate the degree of convergence (S515). As a result of the calculation, a keyword whose convergence association degree is equal to or more than a predetermined level is selected as a theme (S520).

이렇게 주제가 선정되면, 나이브 베이지안부(31)에서는 해당 주제를 포함하는 키워드들이 학습문서에 출현할 확률인 나이브 베이지안 확률을 산출하고(S525), 키워드 벡터 산출부(33)에서는 가중치를 적용하여 학습문서와 분류대상문서에 포함된 각 키워드에 대해 벡터 크기를 산출한다(S530). 그리고, 각 키워드에 대해 학습문서내에서의 벡터 크기와 분류대상문서내에서의 벡터 크기를 이용하여 키워드간 거리를 산출한다(S535). When the subject is selected, the naive Bayesian unit 31 calculates the naive Bayesian probability, which is a probability that keywords including the subject appear in the learning document (S525), and the keyword vector calculation unit 33 applies a weight to the learning document. And a vector size for each keyword included in the document to be classified (S530). For each keyword, the distance between the keywords is calculated using the vector size in the learning document and the vector size in the classification target document (S535).

그런 다음, 나이브 베이지안 확률과 키워드간 거리를 곱하여 유사도값을 산출하고(S540), 각 키워드에 대한 유사도값을 처리하여 하나의 랭킹판단값을 산출한다(S545). 각 분류대상문서의 랭킹판단값을 비교하여 각 분류대상문서의 해당 주제에 대한 랭킹을 산출하고(S550), 각 주제에 대해 상술한 전과정을 거치면, 각 분류대상문서를 주제에 맞게 분류할 수 있다(S555). Then, the similarity value is calculated by multiplying the naive Bayesian probability by the distance between the keywords (S540), and calculating one ranking determination value by processing the similarity values for each keyword (S545). Comparing the ranking judgment value of each classified document to calculate the ranking for the corresponding subject of each classified document (S550), and through the above-described process for each subject, each classified document can be classified according to the subject. (S555).

이와 같이, 본 검색엔진은, 사용자에 의해 입력된 키워드의 연관도와 수렴연관정도를 이용하여 전자문서로부터 자동으로 주제를 선정할 수 있도록 한다. 이에 따라, 수작업에 의해 분류되지 못하였던 비정형화되고 다양한 주제를 선정할 수 있게 됨에 따라, 사용자가 원하는 정보에 보다 간편하고 신속하게 접근할 수 있게 된다. In this way, the present search engine enables the user to automatically select a subject from the electronic document using the degree of association and convergence of the keywords input by the user. Accordingly, it is possible to select atypical and various topics that could not be sorted by hand, so that users can access the information they want more easily and quickly.

뿐만 아니라, 선정된 주제인 키워드를 이용하여 학습문서와 분류대상문서간의 유사도의 기준이 되는 분류대상문서의 랭킹을 산출함으로써, 분류대상문서를 주제에 맞게 분류할 수 있다. 따라서, 검색된 분류대상문서가 주제에 질적으로 적합한지 여부를 판단할 수 있으므로, 검색의 신뢰성을 향상시키고, 사용자는 보다 정확한 정보를 수집할 수 있게 된다. In addition, by using a keyword, which is a selected topic, a ranking of a document to be classified as a criterion of similarity between the learning document and the document to be classified, can be classified according to the subject. Therefore, it is possible to determine whether the searched classified document is qualitatively suitable for the subject, thereby improving the reliability of the search and allowing the user to collect more accurate information.

이상에서 설명한 바와 같이, 본 발명에 따르면, 수작업에 의해 분류되지 못하였던 비정형화되고 다양한 주제를 선정할 수 있게 됨에 따라, 사용자가 원하는 정보에 보다 간편하고 신속하게 접근할 수 있게 된다. 또한, 분류대상문서를 주제에 맞게 분류할 수 있으며, 검색된 전자문서가 주제에 적합한지 여부를 질적으로 판단할 수 있으므로, 사용자가 보다 정확한 정보를 수집할 수 있게 된다. As described above, according to the present invention, the informal and various subjects which could not be classified by manual operation can be selected, and thus, the user can more easily and quickly access the desired information. In addition, the classification target document can be classified according to the subject, and it is possible to qualitatively determine whether the searched electronic document is suitable for the subject, so that the user can collect more accurate information.

또한, 본 발명의 상세한 설명에서는 구체적인 실시형태에 관해 설명하였으나, 이는 예시적인 것으로 받아들여져야 하며, 본 발명의 기술적 사상에서 벗어나지 않는 한도내에서 여러 가지 변형이 가능함은 물론이다. 그러므로, 본 발명의 범위는 설명된 실시 형태에 국한되어 정해져서는 안되며 후술하는 특허청구범위 뿐 만 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다. Further, in the detailed description of the present invention, specific embodiments have been described, which should be taken as exemplary, and various modifications may be made without departing from the technical spirit of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined not only by the claims below, but also by the equivalents of the claims.

Claims

Measuring a degree of association indicating the degree to which each of the keywords is simultaneously selected for at least one keyword included in the document to be classified:

Measuring a degree of convergence association, which is an association degree between a word set associated with an arbitrary keyword and a word set associated with another keyword;

Selecting the keyword as a subject when the degree of convergence association is equal to or greater than a predetermined schedule; and generating a subject and classifying a document by a search engine.

The method of claim 1,

The association is calculated by Equation 1, the subject generation and document classification method of a search engine, characterized in that.

Where LinkCount (w ₁ , w ₂ ) is the number of times the keywords w ₁ and w ₂ were tagged at the same time, WordCount (w ₁ ) is the total number of times the keyword w ₁ was used as a keyword, and WordCount (w ₂ ) was the keyword w ₂ The total number of times used as a keyword.

The method of claim 2,

The convergence correlation is calculated by Equation 2;

here,

and,

If both are greater than or equal to the preset minimum correlation, then a value is assigned to PositiveVal (w ₁ , w ₂ , w _i ), or a value is assigned to NegativeVal (w ₁ , w ₂ , w _i ). How to create topics and classify documents in search engines.

The method of claim 3, wherein

PositiveVal (w ₁ , w ₂ , w _i ) and NegativeVal (w ₁ , w ₂ , w _i ) are generated by the following equations (3) and (4), respectively. Document classification method.

The method of claim 1,

The convergence association degree is closer to 1 as each keyword is similar, and closer to 0 as each keyword is different.

The method of claim 1,

Calculating a naive Bayesian probability by performing naive Bayesian learning on each keyword included in the classification target document and a learning document as a classification standard; and a method of generating a subject and classifying a document of a search engine further comprising: .

The method of claim 6,

And calculating a vector size of each keyword included in the learning document and the document to be classified.

The method of claim 7, wherein

Computing the vector size of each keyword,

Calculating the weight of each keyword by multiplying the number of occurrences of each keyword in the document by the inverse document frequency;

And calculating the weight of each keyword by dividing the weight of each keyword by the sum of the weights of all words in the learning document or classification target document including the respective keywords.

The method of claim 8,

Calculating a distance of a keyword that is a difference between a vector size of each keyword on the learning document and a vector size of each keyword on the classification target document; and generating a subject of a search engine and a document classification method. .

The method of claim 9,

The distance between the keyword and the learning document, the higher the frequency of the keyword is used in the classification target document has a higher value, characterized in that the topic generation and document classification method of the search engine.

The method of claim 10,

Calculating a similarity value for each keyword by multiplying a naive Bayesian probability value Nb and a distance DP for each keyword;

Calculating one ranking determination value through a predetermined process of the similarity value for each keyword included in the classification target document;

Calculating a ranking of each classification target document by using the ranking determination value of each classification target document, and classifying the classification target document according to a subject; generating a subject and a document of a search engine further comprising Classification method.

Calculating naive Bayesian probabilities by performing naïve Bayesian learning on the learning document that is a classification standard and each keyword included in the classification target document;

Calculating a vector size of each keyword included in the learning document and the classification target document;

Calculating a distance of a keyword that is a difference between a vector size of each keyword on the learning document and a vector size of each keyword on the classification target document;

Calculating a similarity value for each keyword of the document to be classified by multiplying the naive Bayesian probability value Nb and the distance DP for each keyword; And,

And calculating a ranking of each classification target document using the similarity value.

A computer-readable recording medium containing a program capable of performing the method of any one of claims 1 to 12.