KR101479040B1

KR101479040B1 - Method, apparatus, and computer storage medium for automatically adding tags to document

Info

Publication number: KR101479040B1
Application number: KR1020147019605A
Authority: KR
Inventors: 시앙 흐어; 왕예; 펑 지아오
Original assignee: 텐센트 테크놀로지(센젠) 컴퍼니 리미티드
Priority date: 2012-01-05
Filing date: 2012-12-17
Publication date: 2015-01-05
Also published as: EP2801917A4; CN103198057B; CN103198057A; JP2015506515A; US20150019951A1; WO2013102396A1; EP2801917A1; KR20140093762A; US9146915B2

Abstract

본 발명의 실시예들은 태그를 문서에 자동으로 추가하는 방법 및 장치를 제공하고, 상기 방법은: 복수의 후보 태그 단어들을 결정하고; 다수의 텍스트들을 포함하는 말뭉치를 결정하고; 말뭉치로부터 흔한 단어들을 특성 단어들로서 선택하고; 각각의 특성 단어 및 후보 태그 단어에 대해, 특성 단어가 발생하는 경우에, 후보 태그 단어가 동시에 발생하는 동시 발생 확률을 결정하고; 문서로부터 특성 단어들을 추출하고, 각 추출된 특성 단어에 대해, 이 특성 단서의 가중치를 계산하고; 말뭉치 내에서, 후보 태그 단어들에 대해, 후보 태그 단어들 및 문서에서 발생하는 특성 단어들 모두의 가중 동시 발생 확률을 계수하고; 가장 높은 가중 동시 발생 확률을 가지는 후보 태그 단어를 문서에 추가될 태그 단어로서 선택하는 것을 포함한다. 본 발명의 실시예들은 태그를 문서에 추가하기 위한 지능화를 실현할 수 있고, 태그들은 문서에서 발생되는 키워드들로 제한되지 않는다.Embodiments of the present invention provide a method and apparatus for automatically adding a tag to a document, the method comprising: determining a plurality of candidate tag words; Determining a corpus containing a plurality of texts; Selecting common words from the corpus as characteristic words; For each characteristic word and candidate tag word, if a characteristic word occurs, determine the coincidence probability of the candidate tag word occurring at the same time; Extracting characteristic words from the document, calculating weight of the characteristic cues for each extracted characteristic word; Within the corpus, for candidate tag words, count the weighted coincidence probabilities of both candidate tag words and characteristic words occurring in the document; And selecting the candidate tag word having the highest probability of weighted coincidence as the tag word to be added to the document. Embodiments of the present invention can realize intelligence for adding tags to a document, and tags are not limited to keywords generated in a document.

Description

METHOD, APPARATUS, AND COMPUTER STORAGE MEDIUM FOR AUTOMATICALLY ADDING TAGS TO DOCUMENT,

본 출원은 "METHOD AND APPARATUS FOR AUTOMATICALLY ADDING TAG TO DOCUMENT"라는 명칭으로 2012년 1월 5일에 국가지식산권국(State Intellectual Property Office)에 제출된 중국 특허 출원번호 201210001611.9의 우선권을 주장하고, 이 출원은 전체가 본원에 참조로서 통합되어 있다.This application claims the priority of Chinese Patent Application No. 201210001611.9, filed on January 5, 2012 with the title "METHOD AND APPARATUS FOR AUTOMATICALLY ADDED TAG TO DOCUMENT", filed with the State Intellectual Property Office, Are incorporated herein by reference in their entirety.

본 발명은 인터넷 문서의 기술에 관한 것으로, 특히 태그(tag)를 문서에 자동으로 추가하는 방법 및 장치에 관한 것이다.
The present invention relates to the art of Internet documents, and more particularly to a method and apparatus for automatically adding tags to a document.

인터넷 상의 컨텐츠를 조직화하는 데 사용되는 태그들은 문서에 고도로 관련되는 핵심 단어들이다. 문서의 컨텐츠들은 검색 및 공유를 용이하게 하기 위해 태그들에 의해 간략하게 기술되고 분류된다.The tags used to organize content on the Internet are key words that are highly related to the document. The contents of the document are briefly described and classified by the tags to facilitate searching and sharing.

현재, 태그를 문서에 추가하기 위하여 주로 세 가지 방식들이 존재한다: 1) 특정한 태그가 문서에 수동으로 지정되는, 수동 태그의 방식; 2) 문서의 컨텐츠들을 분석함으로써 문서로부터 자동으로 추출되는 중요한 키워드(keyword)가 태그로서 취해지는 키워드 태그 방식; 및 3) 태그가 사용자 자신에 의해 사용자의 문서에 추가되는 사회화 태그(socialized tag) 방식. 이 세 방식들 모두에서는 문제들이 있는데, 예를 들어, 1) 수동 태그 방식에 관하여, 태그들은 대량의 문서들에 자동으로 추가될 수 없고; 2) 키워드 태그 방식에 관하여, 단지 문서에서 발생하는 키워드만이 태그로서 선택될 수 있는 반면에, 키워드들 모두가 태그에 적합한 것은 아니고; 그리고 3) 사회화 태그의 방식들에 관하여, 이는 사용자가 태그들을 홀로 문서에 추가할 것을 요구하므로, 결과적으로 상이한 사용자들의 일치하지 않는 표준들로 인해 태그들이 정렬되지 않는다.
Currently, there are mainly three ways to add tags to a document: 1) the manner of a passive tag, where a particular tag is manually assigned to the document; 2) a keyword tag method in which an important keyword, which is extracted automatically from a document by analyzing contents of the document, is taken as a tag; And 3) a socialized tag method in which the tag is added to the user's document by the user himself. There are problems with all three approaches, for example: 1) With regard to the passive tag approach, tags can not be automatically added to a large amount of documents; 2) With respect to the keyword tag method, only keywords occurring in the document can be selected as tags, while not all of the keywords are suitable for the tag; And 3) With respect to the methods of socialization tags, this requires the user to add tags to the document alone, resulting in tags not being aligned due to inconsistent standards of different users.

본 발명의 하나의 실시예에 따르면, 태그를 문서에 자동으로 추가하는 방법 및 장치가 제공되고, 이로 인해 문서 내의 키워드들로 제한되지 않은 태그가 문서에 지능적으로 추가될 수 있다.According to one embodiment of the present invention, a method and apparatus are provided for automatically adding tags to a document, whereby tags that are not limited to keywords in the document can be intelligently added to the document.

본 발명의 하나의 실시예에 대한 해법은 다음과 같이 구현된다.The solution to one embodiment of the present invention is implemented as follows.

태그를 문서에 자동으로 추가하는 방법은:To automatically add tags to your document:

문서에 대응하는 다수의 후보 태그 단어들을 결정하는 단계;Determining a plurality of candidate tag words corresponding to the document;

다수의 텍스트들을 포함하는 말뭉치(corpus)를 결정하고; 말뭉치로부터 흔히 사용되는 단어들을 특성 단어(characteristic word)들로서 선택하고; 특성 단어들 각각 및 후보 태그 단어들 각각에 대해, 후보 태그 단어가 특성 단어와 동시 발생할 확률을 결정하는 단계;Determining a corpus comprising a plurality of texts; Selecting words that are commonly used from corpus as characteristic words; Determining, for each of the characteristic words and the candidate tag words, a probability that the candidate tag word coincides with the characteristic word;

문서로부터 특성 단어들을 추출하고, 추출된 특성 단어들의 각각에 대한 가중치를 계산하는 단계; 및Extracting characteristic words from a document, and calculating a weight for each of the extracted characteristic words; And

말뭉치에서, 후보 태그 단어들의 각각이 문서로부터 추출되는 특성 단어들 모두와 동시 발생할 가중 확률을 계산하고; 높은 가중 동시 발생 확률을 가진 후보 태그 단어를 문서에 추가될 태그 단어로서 선택하는 단계를 포함한다.Calculating, in the corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; Selecting a candidate tag word having a high probability of occurrence of weighted weighting as a tag word to be added to the document.

태그를 문서에 자동으로 추가하는 장치는:Devices that automatically add tags to a document are:

문서에 대응하는 복수의 후보 태그 단어들을 결정하도록 구성되는 후보 태그 단어 결정 모듈;A candidate tag word determination module configured to determine a plurality of candidate tag words corresponding to a document;

복수의 텍스트들을 포함하는 말뭉치를 결정하고, 말뭉치로부터 흔히 사용되는 단어들을 특성 단어들로서 선택하고, 특성 단어들 각각 및 후보 태그 단어들 각각에 대해, 후보 태그 단어가 특성 단어와 동시 발생할 확률을 결정하도록 구성되는 동시 발생 확률 결정 모듈;To select a corpus containing a plurality of texts, to select commonly used words from the corpus as characteristic words, and for each of the characteristic words and each of the candidate tag words to determine the probability that the candidate tag word coincides with the characteristic word A concurrent occurrence probability determination module configured;

문서로부터 특성 단어들을 추출하고, 추출된 특성 단어들의 각각에 대한 가중치를 계산하도록 구성되는 가중치 계산 모듈;A weight calculation module configured to extract characteristic words from a document and calculate a weight for each of the extracted characteristic words;

말뭉치에서, 후보 태그 단어들의 각각이 문서로부터 추출되는 특성 단어들 모두와 동시 발생할 가중 확률을 계산하도록 구성되는 가중 동시 발생 확률 계산 모듈; 및A weighted coincidence probability calculation module configured to calculate, in a corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; And

높은 가중 동시 발생 확률을 가지는 후보 태그 단어를 문서에 추가될 태그 단어로서 선택하도록 구성되는 태그 단어 추가 모듈을 포함한다.And a tag word addition module configured to select a candidate tag word having a high weighted coincidence probability as a tag word to be added to the document.

본 발명의 실시예에 따라 태그를 문서에 자동으로 추가하는 방법 및 장치에서, 문서 내의 키워드로 제한되지 않는 태그는 특성 단어가 말뭉치 내의 후보 태그 단어와 동시 발생할 확률을 계산하고, 동시 발생 확률을 특성 단어로부터 후보 태그 단어로의 표(vote)로 전환하고, 가장 많은 표들을 획득한 후보 태그 단어를 문서에 추가될 태그 단어로서 취함으로써 문서에 지능적으로 추가될 수 있다.
In a method and apparatus for automatically adding a tag to a document in accordance with an embodiment of the present invention, a tag that is not limited to a keyword in the document calculates the probability that the feature word coincides with the candidate tag word in the corpus, Can be intelligently added to a document by switching to a vote from a word to a candidate tag word and taking the candidate tag word that has acquired the most votes as the tag word to be added to the document.

도 1은 본 발명의 하나의 실시예에 따라 태그를 문서에 자동으로 추가하기 위한 방법의 흐름도이다.
도 2는 본 발명의 하나의 실시예에 따라 태그를 문서에 자동으로 추가하기 위한 장치의 구조에 대한 개략도이다.1 is a flow diagram of a method for automatically adding a tag to a document in accordance with one embodiment of the present invention.
Figure 2 is a schematic diagram of the structure of an apparatus for automatically adding tags to a document in accordance with one embodiment of the present invention.

본 발명의 하나의 실시예에 따르면, 태그를 문서에 자동으로 추가하는 방법이 제공된다. 도 1은 다음과 같은 단계들을 포함하는 방법의 흐름도이다.According to one embodiment of the present invention, a method of automatically adding a tag to a document is provided. 1 is a flow chart of a method including the following steps.

단계 101에서, 문서에 대응하는 다수의 후보 태그 단어들이 결정된다.In step 101, a plurality of candidate tag words corresponding to the document are determined.

이 단계에서, 문서에 대응하는 다수의 후보 태그 단어들은 다음과 같은 세 방식들에 의해 결정될 수 있으나, 이로 제한되지 않는다:At this stage, a plurality of candidate tag words corresponding to a document may be determined by, but not limited to, the following three methods:

1) 특정한 태그가 문서에 수동으로 명시되는, 수동 태그의 방식;1) the manner of a passive tag, where a particular tag is manually specified in the document;

2) 문서의 컨텐츠들을 분석함으로써 문서로부터 자동으로 추출되는 중요한 키워드가 태그로서 취해지는 키워드 태그 방식; 및2) a keyword tag method in which an important keyword automatically extracted from a document is extracted as a tag by analyzing contents of the document; And

3) 태그가 사용자 자신에 의해 사용자의 문서에 추가되는 사회화 태그 방식.3) The social tag method in which the tag is added to the user's document by the user himself.

후보 태그 단어들이 수동 태그 방식 또는 사회화 태그 방식에 의해 결정되는 경우에 후보 태그 단어들은 문서에서 발생하는 단어들로 제한되지 않는다.In the case where the candidate tag words are determined by the manual tag method or the social tag method, the candidate tag words are not limited to words occurring in the document.

단계 102에서, 다수의 텍스트들을 포함하는 말뭉치(corpus)가 결정된다.In step 102, a corpus containing a plurality of texts is determined.

예를 들어, 인터넷으로부터 일백만 개의 텍스트들이 획득되면, 일백만 개의 획득된 텍스트들이 일괄적으로 말뭉치로 칭해진다.For example, if one million texts are obtained from the Internet, one million acquired texts are collectively referred to as corpus.

단계 103에서, 흔히 사용되는 단어들이 말뭉치로부터 특성 단어들로서 선택되고, 특성 단어들의 각 단어별로 그리고 후보 태그 단어들의 각 단어별로, 후보 태그 단어와 특성 단어가 동시 발생할 확률이 말뭉치에서 결정된다.In step 103, the commonly used words are selected as characteristic words from the corpus, and the probability of simultaneous occurrence of the candidate tag word and the characteristic word for each word of the characteristic words and for each word of the candidate tag words is determined in the corpus.

단계 104에서, 특성 단어들이 문서로부터 추출되고, 특성 단어들의 각 단어에 대한 가중치가 계산된다.In step 104, the characteristic words are extracted from the document, and a weight for each word of the characteristic words is calculated.

단계 105에서, 후보 태그 단어들의 각 단어 별로, 후보 태그 단어가 문서에서 발생하는 특성 단어들 모두와 동시 발생할 가중 확률이 말뭉치에서 계산되고; 높은 가중 동시 발생 확률을 가지는 후보 태그 단어가 문서에 추가될 태그 단어로서 선택된다.In step 105, for each word of the candidate tag words, a weighted probability that a candidate tag word occurs simultaneously with all of the characteristic words occurring in the document is calculated in the corpus; A candidate tag word with a high probability of weighted concurrency is selected as the tag word to be added to the document.

단계 103에서, 동시 발생 확률은 P(X｜Y)로서 표시되고, 여기서 X는 후보 태그 단어들 중 하나를 표시하고 Y는 말뭉치에서 발생하는 특성 단어들 중 하나를 표시한다. P(X｜Y)는 다음과 같은 다양한 방식들에 의해 결정될 수 있다.At step 103, the coincidence probability is denoted as P (X | Y), where X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus. P (X | Y) can be determined by various methods such as the following.

제 1 방식에서, P(X｜Y)는 말뭉치에 포함되는 동일한 텍스트 내에서의 X가 Y와 동시 발생하는 횟수를 말뭉치 내에서 Y가 발생하는 횟수로 나눈 결과와 동일하다.In the first scheme, P (X | Y) is equal to the result of dividing the number of times X coincides with Y in the same text included in the corpus by the number of times Y occurs in the corpus.

제 2 방식에서,

이고, 여기서 H(X, Y)는 X 및 Y의 결합 엔트로피(combination entropy)를 표시하고, I(X,Y)는 X 및 Y의 상호 정보를 표시하고, H(X)는 X의 정보 엔트로피(information entropy)를 표시하고, H(Y)는 Y의 정보 엔트로피를 표시한다.In the second scheme,

Where H (X, Y) represents the combination entropy of X and Y, I (X, Y) represents the mutual information of X and Y, H (X) represents the information entropy of X (information entropy), and H (Y) denotes the information entropy of Y. [

제 3 방식에서, P(X｜Y)는 wordnet과 같은 어휘 데이터베이스를 사용함으로써 결정된다.In the third scheme, P (X | Y) is determined by using a vocabulary database such as wordnet.

단계 104에서, 추출된 특성 단어들의 각 단어 별로, 문서 내에서 특성 단어가 발생한 횟수 및 특성 단어가 발생한 말뭉치 내의 텍스트의 수에 기초하여 특성 단어에 대한 가중치가 계산될 수 있다.At step 104, for each word of the extracted characteristic words, the weight for the characteristic word may be calculated based on the number of times the characteristic word occurred in the document and the number of texts in the corpus in which the characteristic word occurred.

문서에서 추출되는 특성 단어(Y)에 대한 가중치는 W_Y로 표시되고, W_Y는: W_Y가 Y가 문서에서 발생한 횟수 및 Y가 발생한 말뭉치 내의 텍스트들의 수의 곱(product)과 동일하다는 것에 의해 계산될 수 있다.Weight for the characteristic word (Y) to be extracted from a document is represented by W _Y, W _Y is: W _{and Y} is as Y are same as the product of the number of the text in the generated number of times, and Y generated in the document corpus (product) Lt; / RTI >

단계 105에서, 가중 동시 발생 확률은

로 표시되고, 여기서 Y_i는 문서로부터 추출되는 특성 단어들 중 하나를 표시하고,

는 Y_i에 대한 가중치를 표시하고, n은 문서로부터 추출되는 특성 단어들의 수를 표시한다.In step 105, the weighted coincidence probability is

, Where Y _i represents one of the characteristic words extracted from the document,

Denotes the weight for Y _i , and n denotes the number of characteristic words extracted from the document.

단계 105에서, 가중 동시 발생 확률 P_X는 모든 후보 태그 단어들에 대해서보다는 오히려, 단지 문서로부터 추출되는 하나 이상의 특성 단어와 동시 발생하는 후보 태그 단어에 대해서 계산될 수 있다.In step 105, the weighted coincidence probability P _X can be calculated for candidate tag words that coincide with one or more characteristic words extracted from the document, rather than for all candidate tag words.

특정한 실시예들이 아래에서 더 상세하게 도입될 것이다.Certain embodiments will be introduced in more detail below.

제 1 실시예First Embodiment

단계 1에서, 태그 단어 세트가 준비된다.In step 1, a tag word set is prepared.

원하는 바에 따라 태그 단어 세트를 구성하기 위하여 문서에 대응하는 다수의 후보 태그 단어들이 획득된다. 예를 들어, 태그를 영화와 관련되는 문서들에 추가할 필요가 있는 경우, 태그 단어 세트는 영화의 종류 및 유명인과 같은 태그 단어들을 포함할 수 있다.A number of candidate tag words corresponding to the document are obtained to construct a tag word set as desired. For example, if a tag needs to be added to documents associated with a movie, the tag word set may include tag types such as movie types and celebrities.

단계 2에서, 말뭉치가 준비된다.In step 2, a corpus is prepared.

다수의 관련 텍스트들은 인터넷으로부터 단어들 사이의 동시 발생 관계들의 통계에 사용될 말뭉치로서 수집될 수 있다.A number of related texts may be collected as a corpus to be used for statistics of concurrent relationships between words from the Internet.

단계 3에서, 말뭉치로부터 특성 단어들이 추출된다.In step 3, characteristic words are extracted from the corpus.

말뭉치 내의 텍스트들에 대해 단어 구분(word segmentation)이 수행된다. 그리고 나서 각 단어의 어구 빈도수(term frequency; TF)가 계수된다. 고 빈도수 단어들, 사용되지 않은 단어들 및 저 빈도수 단어들은 제거되고, 나머지 흔히 사용되는 단어들이 특성 단어들로서 선택된다.Word segmentation is performed on the texts in the corpus. Then the term frequency (TF) of each word is counted. High frequency words, unused words and low frequency words are removed and the remaining commonly used words are selected as characteristic words.

단계 4에서, 특성 단어의 각각이 후보 태그 단어의 각각과 동시 발생할 확률 P(X｜Y)이 계산된다.In step 4, the probability P (X | Y) that each of the characteristic words occurs simultaneously with each of the candidate tag words is calculated.

P(X｜Y)는 말뭉치에 포함되는 동일한 텍스트에서 X 및 Y가 동시 발생하는 횟수를 말뭉치 내에서 Y가 발생한 횟수로 나눈 결과와 동일하다.P (X | Y) is equivalent to the number of simultaneous X and Y occurrences in the same text included in the corpus divided by the number of occurrences of Y in the corpus.

여기서, X는 후보 태그 단어들 중 하나를 표시하고, Y는 특성 단어들 중 하나를 표시한다.Where X denotes one of the candidate tag words, and Y denotes one of the characteristic words.

단계 5에서, 태그 단어들은 문서에 자동으로 추가되고, 이의 특정한 단계들은 다음과 같다:In step 5, the tag words are automatically added to the document, the specific steps of which are:

단계 Ⅰ에서, 문서에 대한 단어 구분을 수행하고;In step I, word breaking is performed on the document;

단계 Ⅱ에서, 단어 구분 결과에 따라 문서 내에서 발생하는 특성 단어들 모두를 추출하고, 각각의 추출된 특성 단어 Y에 대한 가중치(W_Y)를 W_Y=TF×IDF로서 계산하고, 여기서 TF는 Y가 문서 내에서 발생하는 횟수를 표시하고 IDF는 Y가 발생하는 말뭉치 내의 텍스트의 수를 표시한다;In step II, all of the characteristic words occurring in the document are extracted according to the word classification result, and the weight (W _Y ) for each extracted characteristic word Y is calculated as W _Y = TF × IDF, Y indicates the number of occurrences in the document and IDF indicates the number of texts in the corpus in which Y occurs;

단계 Ⅲ에서, 단계 4에서 계산된 동시 발생 확률에 기초하여 적어도 하나의 특성 단어와 동시 발생하는(즉, 동시 발생 확률은 0이 아니다) 후보 태그 단어들을 추출하고; Extracting candidate tag words coincident with at least one characteristic word (i. E., The coincidence probability is not zero) based on the coincidence probabilities calculated in step 4;

단계 Ⅳ에서, 추출되는 후보 태그 단어들의 각각에 대해, 문서로부터 추출되는 특성 단어들 모두와의 추출되는 후보 태그 단어의 가중 동시 발생 확률

을 계산하고, 여기서 Y_i는 문서로부터 추출되는 특성 단어들 중 하나를 표시하고,

는 Y_i에 대한 가중치를 표시하고, n은 문서로부터 추출되는 특성 단어들의 수를 표시하고; 그리고In step IV, for each of the candidate tag words to be extracted, the weighted coincidence probability of the extracted candidate tag words with all of the characteristic words extracted from the document

Denotes a weight for Y _i , n denotes the number of characteristic words extracted from the document; And

단계 Ⅴ에서, P_x 값들의 내림 차순으로 추출되는 후보 태그 단어들 모두를 순위화하고, 가장 높은 P_X를 가지는 하나 이상의 후보 태그 단어들을 문서에 추가될 태그 단어들로서 선택한다.In step V, all of the candidate tag words extracted in descending order of P _x values are ranked, and one or more candidate tag words having the highest P _X are selected as tag words to be added to the document.

이 단계에서, 후보 태그 단어들 중 서너 단어는 우선 단계 Ⅲ에서 추출되고, 그 후에 가중 동시 발생 확률은 이 추출된 후보 태그 단어들 각각에 대해 계산된다. 이것은 계산 속도를 증가시키고 시스템 자원을 절약할 수 있다. 본 발명의 다른 실시예들에 따르면, 가중 동시 발생 확률은 후보 태그 단어들 모두에 대해 계산될 수 있다. 특성 단어들 어느 것과도 동시 발생 관계를 가지지 않는 후보 태그 단어의 경우, 계산되는 가중 동시 발생 확률 P_X = 0이고 후보 태그 단어는 단계 Ⅴ에서 후보 태그 단어들의 줄(queue)의 말미에 순위가 정해질 것이다.At this stage, a few of the candidate tag words are first extracted in step III, and then a weighted coincidence probability is calculated for each of the extracted candidate tag words. This can speed computation and save system resources. According to other embodiments of the invention, the weighted coincidence probability can be calculated for all of the candidate tag words. For a candidate tag word that does not have a coincidence relationship with any of the characteristic words, the calculated weighted coincidence probability P _X = 0 and the candidate tag word is ranked at the end of the queue of candidate tag words in step V Will be.

본 발명의 다른 실시예에서, 특성 단어 및 후보 태그 단어의 동시 발생 확률 P(X｜Y)는 다른 방식들로 계산될 수 있다. 예를 들어, P(X｜Y)는

로 계산될 수 있고, 여기서 H(X,Y)는 X 및 Y의 결합 엔트로피를 표시하고, I(X,Y)는 X 및 Y의 상호 정보를 표시하고, H(X)는 X의 정보 엔트로피를 표시하고, H(Y)는 Y의 정보 엔트로피를 표시한다. 대안으로, 특성 단어 및 후보 태그 단어 사이의 관계는 wordnet과 같은 어휘 데이터베이스를 사용함으로써 결정된다.In another embodiment of the present invention, the co-occurrence probability P (X | Y) of the characteristic word and the candidate tag word may be calculated in other manners. For example, P (X | Y)

Where H (X, Y) denotes the combined entropy of X and Y, I (X, Y) denotes the mutual information of X and Y, H (X) denotes the information entropy of X , And H (Y) indicates the information entropy of Y. Alternatively, the relationship between the characteristic word and the candidate tag word is determined by using a vocabulary database such as wordnet.

본 발명의 하나의 실시예에 따르면, 태그를 문서에 자동으로 추가하는 장치가 더 제공된다. 도 2는 상기 장치의 구조에 대한 개략도이고, 이는:According to one embodiment of the present invention, there is further provided an apparatus for automatically adding a tag to a document. Figure 2 is a schematic view of the structure of the device,

문서에 대응하는 다수의 후보 태그 단어들을 결정하도록 구성되는 후보 태그 단어 결정 모듈(201);A candidate tag word determination module (201) configured to determine a plurality of candidate tag words corresponding to a document;

다수의 텍스트들을 포함하는 말뭉치를 결정하고, 말뭉치로부터 흔히 사용되는 단어들을 특성 단어들로 선택하고, 특성 단어들의 각 단어 및 후보 태그 단어들의 각 단어에 대해, 말뭉치 내에서 후보 태그 단어가 특성 단어와 동시 발생할 확률을 결정하도록 구성되는 동시 발생 확률 결정 모듈(202);Determining a corpus containing a plurality of texts, selecting commonly used words from the corpus as characteristic words, and for each word of the characteristic words and each word of the candidate tag words, A coincidence probability determination module (202) configured to determine a coincidence probability;

문서로부터 특성 단어들을 추출하고 특성 단어들의 각 단어에 대한 가중치를 계산하도록 구성되는 가중치 계산 모듈(203);A weight calculation module (203) configured to extract characteristic words from a document and calculate a weight for each word of characteristic words;

말뭉치 내에서, 후보 태그 단어들의 각 단어가 문서에서 발생하는 특성 단어들 모두와 동시 발생할 가중 확률을 계산하도록 구성되는 가중 동시 발생 확률 계산 모듈(204); 및Within a corpus, a weighted coincidence probability calculation module (204) configured to calculate a weighted probability that each word of candidate tag words will occur simultaneously with all of the characteristic words occurring in the document; And

높은 가중 동시 발생 확률을 가지는 후보 태그 단어를 문서에 추가될 태그 단어로서 선택하도록 구성되는 태그 단어 추가 모듈(205)을 포함한다.And a tag word addition module (205) configured to select a candidate tag word having a high weighted coincidence probability as a tag word to be added to the document.

상술한 장치에서, 동시 발생 확률은 P(X｜Y)로 표시될 수 있고, 여기서 X는 후보 태그 단어들 중 하나를 표시하고 Y는 말뭉치 내에서 발생하는 특성 단어들 중 하나를 표시한다. 동시 발생 확률 결정 모듈(202)은 P(X｜Y)를 다음과 같이 계산할 수 있다.In the above-described apparatus, the coincidence probability can be expressed as P (X | Y), where X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus. The coincidence probability determination module 202 may calculate P (X | Y) as follows.

P(X｜Y)는 말뭉치에 포함되는 동일한 텍스트에서가 X 및 Y가 동시 발생하는 횟수를 말뭉치 내에서 Y가 발생하는 횟수로 나눈 결과와 동일하다.P (X | Y) is the same as the result of dividing the number of simultaneous X and Y occurrences in the corpus by the number of occurrences of Y in the corpus.

대안으로,

이고, 여기서 H(X,Y)는 X 및 Y의 결합 엔트로피를 표시하고, I(X,Y)는 X 및 Y의 상호 정보(mutual information)를 표시한다.As an alternative,

, Where H (X, Y) denotes the entropy of the combination of X and Y, and I (X, Y) denotes the mutual information of X and Y.

대안으로, P(X｜Y)는 어휘 데이터베이스를 사용함으로써 결정된다.Alternatively, P (X | Y) is determined by using a lexical database.

상술한 장치에서, 문서로부터 추출되는 특성 단어 Y에 대한 가중치는 W_Y로 표시되고, 이는 가중치 계산 모듈(203)에 의해: W_Y는 문서에서 Y가 발생하는 횟수 및 Y가 발생하는 말뭉치에서의 텍스트들의 수의 곱과 동일하다는 것에 의해 계산될 수 있다.In the above-described apparatus, the weight for the characteristic word Y extracted from the document is represented by W _Y , which is calculated by the weight calculation module 203: W _Y is the number of times Y occurs in the document, Is equal to the product of the number of texts.

상술한 장치에서, 가중 동시 발생 확률은

로서 표시될 수 있고, 여기서 Y_i는 문서로부터 추출되는 특성 단어들 중 하나를 표시하고,

는 Y_i에 대한 가중치를 표시하고, n은 문서로부터 추출되는 특성 단어들의 수를 표시한다.In the above-described apparatus, the probability of weighted simultaneous occurrence is

상술한 장치에서, 가중 동시 발생 확률 계산 모듈(204)은 단지 문서로부터 추출되는 하나 이상의 특성 단어와 동시 발생하는 후보 태그 단어에 대한 가중 동시 발생 확률을 계산할 수 있다.In the apparatus described above, the weighted coincidence probability calculation module 204 may calculate the weighted coincidence probability for candidate tag words coincident with one or more characteristic words extracted from the document only.

결론적으로, 본 발명의 실시예들에 따라 태그를 문서에 자동으로 추가하는 방법 및 장치에서, 문서에서 발생하는 키워드로 제한되지 않는 태그는 특성 단어가 말뭉치 내의 후보 태그 단어와 동시 발생할 확률을 계산하고, 동시 발생 확률을 특성 단어로부터 후보 태그 단어로의 표(vote)로 전환하고, 최대 표들을 획득한 후보 태그 단어를 문서에 추가될 태그 단어로 취함으로써 지능적으로 문서에 추가될 수 있다. 태그 단어 및 문서 사이의 관련성은 본 발명의 실시예들에 따른 동시 발생 확률에 대한 통계에 기초하여 향상된다.In conclusion, in a method and apparatus for automatically adding a tag to a document according to embodiments of the present invention, a tag that is not limited to a keyword occurring in a document calculates a probability that a characteristic word coincides with a candidate tag word in a corpus , The simultaneous occurrence probability can be added to the document intelligently by converting the candidate word tag from the characteristic word into a vote vote and taking the candidate tag word obtained from the maximum tags as a tag word to be added to the document. The relevance between the tag word and the document is improved based on statistics on the coincidence probability according to embodiments of the present invention.

본 발명의 하나의 실시예에 따르면, 기계가 본원에서 기술되는 바와 같이 태그를 문서에 자동으로 추가하는 방법을 실행할 수 있도록 하는 명령들을 저장하는 기계 판독가능 저장 매체가 더 제공된다. 상술한 실시예들 중 임의의 실시예의 기능을 구현하는 소프트웨어 프로그램 코드들이 저장되어 있는 저장 매체를 포함하는 시스템 또는 장치가 제공될 수 있고, 이 시스템 또는 장치 내의 컴퓨터(또는 CPU 또는 MPU)는 저장 매체 내에 저장된 프로그램 코드들을 판독 및 실행할 수 있다.According to one embodiment of the present invention, there is further provided a machine-readable storage medium storing instructions that enable a machine to perform a method of automatically adding a tag to a document, as described herein. A system or apparatus may be provided that includes a storage medium on which software program codes embodying the functions of any of the above embodiments are stored, and a computer (or CPU or MPU) Lt; RTI ID = 0.0 > and / or < / RTI >

이 경우에, 저장 매체로부터 판독되는 프로그램 코드들은 상술한 실시예들 중 임의의 하나의 기능을 구현할 수 있다. 그러므로, 프로그램 코드들 및 프로그램 코드들을 저장하는 저장 매체는 본 발명의 일부를 구성한다.In this case, the program codes read from the storage medium may implement any one of the above-described embodiments. Therefore, the storage medium storing the program codes and program codes constitutes a part of the present invention.

프로그램 코드들을 제공하는 저장 매체의 예들은 소프트 디스크, 하드 디스크, 자기 광 디스크, 광 디스크(CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW 및 DVD+RW와 같은), 자기 테이프, 비휘발성 메모리 및 ROM을 포함한다. 선택적으로, 프로그램 코드들은 통신 네트워크를 통해 서버 컴퓨터로부터 다운로드될 수 있다.Examples of storage media that provide the program codes are a hard disk, a hard disk, a magnetic optical disk, an optical disk (CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD- , Magnetic tape, non-volatile memory, and ROM. Optionally, the program codes may be downloaded from the server computer via the communications network.

더욱이, 상술한 실시예들 중 임의의 하나의 기능은 컴퓨터에 의해 판독되는 프로그램 코드들을 실행할 뿐만 아니라 컴퓨터 상에서 동작하는 운영 시스템에 프로그램 코드들에 기초하여 지시함으로써 수행되는 실제 동작들 중 서너 개 또는 모두를 통해 구현될 수 있음이 인정될 것이다.Furthermore, any one of the above-described embodiments may be implemented in a computer-readable medium, such as a computer-readable medium or a computer-readable recording medium, As will be appreciated by those skilled in the art.

더욱이, 상술한 실시예들 중 임의의 하나의 기능은 저장 매체로부터 판독되는 프로그램 코드들을 컴퓨터 내에 삽입되는 확장 보드에서 제공되는 메모리에 기록하거나 프로그램 코드들을 컴퓨터에 접속되는 확장 유닛에서 제공되는 메모리에 기록하고 나서 실제 동작들 중 서너 개 또는 모두를 수행하는 프로그램 코드들에 기초하여 확장 보드 또는 확장 유닛에 장착되는 CPU 등에게 지시함으로써 구현될 수 있음이 이해되어야 한다.Furthermore, any one of the above-described embodiments may be implemented by writing program codes read from a storage medium to a memory provided in an expansion board inserted in the computer, or writing the program codes into a memory provided in an expansion unit connected to the computer And then instructing the CPU or the like mounted on the expansion board or expansion unit based on the program codes that perform some or all of the actual operations.

상기 진술된 본 발명의 바람직한 실시예들은 본 발명의 범위를 제한하도록 의도되지 않는다. 본 발명의 사상 및 원리들 내에서 행해지는 임의의 변형들, 등가들, 개선들은 본 발명의 범위에 해당한다.The above-described preferred embodiments of the present invention are not intended to limit the scope of the invention. Any modifications, equivalents, and improvements that fall within the spirit and principles of the present invention are within the scope of the present invention.

Claims

A method to automatically add a tag to a document:
Determining a plurality of candidate tag words corresponding to the document;
Determining a corpus comprising a plurality of texts; Selecting words commonly used from the corpus as characteristic words; Determining, for each of the characteristic words and the candidate tag words, a probability that the candidate tag word coincides with the characteristic word;
Extracting characteristic words from the document and calculating a weight for each of the extracted characteristic words; And
Calculating, in the corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; Selecting a candidate tag word having a high probability of weighted occurrence as a tag word to be added to the document,
Weight for the characteristic word Y to be extracted from said document that is represented by W _Y, W _Y is Y is identical to the product of the text, the number of (product) in the corpus that the number and Y generated by the document generation tags To the document automatically.

The method according to claim 1,
Wherein the coincidence probability is represented as P (X | Y), where X represents one of the candidate tag words and Y represents one of the characteristic words occurring in the corpus;
A method of automatically adding a tag to a document determined as a result of dividing the number of simultaneous occurrences of X and Y in the same text included in the corpus by the number of occurrences of Y in the corpus.

The method according to claim 1,
Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
P (X | Y)

I (X, Y) represents a mutual information of X and Y, and H (X, Y) represents a combination entropy of X and Y, .

The method according to claim 1,
Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
P (X | Y) is a method of automatically adding tags to a document determined by using a vocabulary database.

delete

The method according to claim 1,
The weighted probability of simultaneous occurrence is

Y _i represents one of the characteristic words extracted from the document,

Represents a weight for Y _i , and n is a number indicating the number of characteristic words extracted from the document.

The method according to claim 1,
In the corpus, calculating the weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document:
Wherein in the corpus, calculating a weighted probability that each of the candidate tag words coincides with at least one characteristic word extracted from the document.

A device that automatically adds tags to a document:
A candidate tag word determination module configured to determine a plurality of candidate tag words corresponding to the document;
Determining a corpus containing a plurality of texts, selecting words commonly used from the corpus as characteristic words, and for each of the characteristic words and the candidate tag words, determining whether the candidate tag word is concurrent with the characteristic word A coincidence probability determination module configured to determine a probability of occurrence;
A weight calculation module configured to extract characteristic words from the document and to calculate a weight for each of the extracted characteristic words;
A weighted coincidence probability calculation module configured to calculate, in the corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; And
And a tag word addition module configured to select a candidate tag word having a high weighted coincidence probability as a tag word to be added to the document,
The weight for the characteristic word Y extracted from the document is represented by W _Y and the weight calculation module is equal to the product of W _Y by the number of times Y occurs in the document and the number of texts in the corpus where Y occurs A device that automatically adds tags to a document that are configured to calculate.

9. The method of claim 8,
Wherein the coincidence probability is represented as P (X | Y), where X represents one of the candidate tag words and Y represents one of the characteristic words occurring in the corpus;
Wherein the simultaneous occurrence probability determination module calculates a tag configured to calculate P (X | Y) as a result of dividing the number of simultaneous occurrence of X and Y in the same text included in the corpus by the number of times Y occurs in the corpus A device that automatically adds to a document.

9. The method of claim 8,
Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
The coincidence probability determination module determines P (X | Y) as

, Where H (X, Y) denotes the combined entropy of X and Y, and I (X, Y) automatically adds a tag to the document indicating the mutual information of X and Y.

9. The method of claim 8,
Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
Wherein the coincidence probability determination module automatically adds a tag to a document that is configured to calculate P (X | Y) by using a lexical database.

delete

The method according to any one of claims 8 to 11,
The weighted probability of simultaneous occurrence is

Y _i represents one of the characteristic words extracted from the document,

&Lt; / _RTI > wherein n represents the number of characteristic words extracted from the document, and n represents a number representing the number of characteristic words extracted from the document.

The method according to any one of claims 8 to 11,
Wherein the weighted coincidence probability calculation module automatically adds within the corpus a tag that is configured to calculate a weighted probability that each of the candidate tag words coincides with one or more characteristic words extracted from the document.

A computer storage medium storing a computer program for implementing the method according to any one of claims 1 to 4, 6, and 7.