KR20080026948A

KR20080026948A - Method for related keyword group extraction

Info

Publication number: KR20080026948A
Application number: KR1020060092210A
Authority: KR
Inventors: 이수원; 이성진; 김지인
Original assignee: 숭실대학교산학협력단; 건국대학교 산학협력단
Priority date: 2006-09-22
Filing date: 2006-09-22
Publication date: 2008-03-26

Abstract

A method for extracting a related keyword group is provided to enhance importance of keywords existing at plural documents of the same site group, to analyze how many site group related keyword candidates appear at, and to analyze relationship between a specific keyword and related keyword candidates. A method for extracting a related keyword group comprises the following several steps. A site group is formed by grouping sites searched via a specific keyword(S111,S112). Important keywords are extracted from the site group and related keyword candidates are generated by using modified TFIDF(Term Frequency Inverse Document Frequency) which enhances the importance of the keyword appearing at documents of the same site group and reduces the importance of the keyword appearing at various site groups(S113-S118). Relationship between the related keyword candidates and the specific keyword is analyzed by using a modified relation rule to which concentration rate and reliability are applied wherein the concentration rate means how frequently the related keyword candidates appear at the same site group and the reliability means how frequently the related keyword candidates appear at various site groups(S119). Relationship between the related keyword candidates and the specific keyword is analyzed by using Cosine similarity for analyzing content similarity among site groups(S120). Related keyword groups are extracted by intersecting the related keyword candidates analyzed via the modified relation rule with those analyzed via the Cosine similarity(S121,S122).

Description

{Method for Related Keyword Group Extraction}

도 1은 본 발명에 따른 연관 키워드 그룹 추출 방법의 순서도,1 is a flowchart of a method for extracting a related keyword group according to the present invention;

도 2는 본 발명에 따른 연관 키워드 그룹 추출 방법의 일실시예,2 illustrates an embodiment of a method for extracting a related keyword group according to the present invention;

도 3은 본 발명에 따른 연관 키워드 그룹 추출 방법의 다른 일실시예,3 is another embodiment of a method for extracting a related keyword group according to the present invention;

도 4는 본 발명에 따른 연관 키워드 그룹 추출 방법의 또 다른 일실시예,4 is another embodiment of a method for extracting a related keyword group according to the present invention;

본 발명은 연관 키워드 그룹 추출 방법에 관한 것으로, 보다 자세하게는 수정된 역 문서빈도를 고려한 용어빈도 연산법(Term Frequency Inverse Document Frequency : 이하 'TFIDF'라 함)과 수정된 연관규칙을 사용함으로써 같은 사이트 그룹의 여러 문서에서 출현한 키워드의 중요도를 높이고, 연관 키워드 후보가 얼마나 많은 사이트 그룹에서 출현하는지를 파악하여 특정 키워드와 연관 키워드 후보 간의 연관성을 효율적으로 분석하여 연관 키워드 그룹을 추출할 수 있는 연관 키워드 그룹 추출 방법에 관한 것이다.The present invention relates to a method for extracting a related keyword group, and more particularly, by using a term frequency inverse document frequency (hereinafter referred to as 'TFIDF') and a modified association rule in consideration of a modified inverse document frequency. A group of related keywords that can increase the importance of keywords from multiple documents in the group, identify how many site groups of related keyword candidates appear, and efficiently analyze the association between specific keywords and related keyword candidates to extract groups of related keywords. It relates to an extraction method.

인터넷 사용이 생활화되면서 인터넷을 이용한 광고 시장의 규모가 급격히 성장하고 있다. 대표적인 인터넷 광고 방법은 배너(Banner) 광고, 팝업(Pop up) 광고, 이메일(E-mail) 광고 및 동영상 광고 등이 있다. 사용자의 관심을 유발할 수 있는 핵심적인 내용을 보여주고 이에 호기심을 느낀 사용자들이 광고를 클릭하면 광고 사이트로 연결해 주는 방식이다. 이러한 방식들은 실제 광고를 보여주기 위해서는 사용자의 관심을 유발해야하고 이를 위해서는 일정 시간 이상의 노출 시간이 필요하다. 그리고 미리 정해진 크기에 맞추어 제작되므로 많은 내용을 제공할 수 없다. 배너 광고의 클릭률은 최근 0.5% 이하로 떨어지고 있으며, 팝업 광고, 이메일 광고 및 동영상 광고 등은 내용을 살펴 보기도 전에 삭제하거나 창을 닫는 등 그 효과가 갈수록 떨어지고 있다. 이에 따라 인터넷 광고의 효과를 높이기 위한 다양한 노력들이 이루어지고 있으며 최근 가장 이용되는 방법은 바로 검색엔진을 이용한 키워드(Keyword) 광고이다. As the use of the Internet becomes more common, the size of the advertising market using the Internet is growing rapidly. Representative Internet advertising methods include banner advertisement, pop up advertisement, e-mail advertisement, and video advertisement. It is a way to show the core contents that can interest the user and curious users to click on the advertisement and connect to the advertisement site. These methods must induce the user's attention in order to show the actual advertisement, which requires more than a certain time of exposure. And since it is made to a predetermined size, it cannot provide much content. The CTR of banner ads has recently fallen below 0.5%, and pop-up ads, e-mail ads, and video ads have been dropping even before the contents are reviewed. Accordingly, various efforts have been made to increase the effectiveness of Internet advertising, and the most recently used method is keyword advertising using a search engine.

키워드 광고란 사용자가 자신이 원하는 정보를 찾기 위해 검색엔진에서 키워드 검색을 하게 되면 검색 키워드를 구매한 광고주의 사이트를 마치 검색 결과인 것처럼 보여주는 방법으로 현재 대부분의 검색엔진에서 키워드 광고를 운영하고 있다. 키워드 광고에서 구매 비용은 노출횟수와 노출위치에 따라 차등을 두고 있는데 사용자들이 많이 검색하는 키워드일수록 가격이 높게 책정되며 노출 위치가 상위일수록 높은 가격을 책정한다. 이는 사용자가 많은 검색하는 키워드일수록 자신의 광고가 노출될 확률이 높으며, 상위에 노출될수록 사용자가 사이트를 방문할 확률이 높기 때문이다. 따라서 광고주가 키워드를 구매하고자 하는 경우 이미 판매가 완료 되어 구매하지 못하거나 가격에 대한 부담으로 구매하지 못하는 등의 문제가 있다. 따라서 특정 키워드와 의미가 비슷하거나 서로 연관이 있는 키워드 그룹을 생성하여 키워드 샵(Shop)에서의 상품 추천에 활용할 필요가 있다. 현재 대부분의 검색 엔진 사이트에서는 중요 키워드에 대해서 관리자의 판단에 의해 연관 키워드를 제공하고 있다. 또한 일부에서는 사용자들의 키워드 검색 순서를 이용하여 연관규칙에 의한 관련 키워드를 제공하고 있지만 키워드 샵(Shop)에서는 활용되고 있지 못하다.Keyword advertising means that when a user searches a keyword in a search engine to find the information he / she wants, most of the search engines operate keyword ads as if the search result shows the advertiser's site as if it were a search result. In keyword advertising, the purchase cost is differentiated according to the number of impressions and the position of exposure. The keyword that users search for is priced higher, and the higher the position of exposure, the higher the price. This is because the more keywords a user searches, the more likely his ad is exposed, and the higher the user is, the more likely the user is to visit the site. Therefore, when an advertiser wants to purchase a keyword, there is a problem such that it cannot be purchased because the sale is completed or cannot be purchased due to a price burden. Therefore, it is necessary to create a keyword group similar or related to a specific keyword and to use it for product recommendation in a keyword shop. Currently, most search engine sites provide related keywords at the administrator's discretion regarding important keywords. In addition, some provide related keywords based on the association rules by using the keyword search order of users, but they are not used in the keyword shop.

종래기술인 TFIDF는 텍스트로 이루어진 문서에서 각 키워드의 중요도를 계산하는 수식으로, 주로 검색엔진에서 사용자가 입력한 질의어와 각 문서 간의 유사도, 혹은 문서와 문서 간의 유사도를 계산하기 위해 각 문서를 대표할 수 있는 중요 키워드를 추출할 때 사용하는 방법이다.The prior art TFIDF is a formula that calculates the importance of each keyword in a text document. The TFIDF can be representative of each document to calculate the similarity between the query input by the user and each document or the similarity between the document and the document. This method is used to extract important keywords.

TFIDF의 기본 개념은 한 문서에서 많이 나오면서 다른 문서에서는 적게 나오는 키워드를 중요 키워드로 간주하는 것이다. 즉, 특정 문서 T에서 키워드 d_i의 중요도 v_i는 d_i가 T에서 출현한 빈도 tf(d_i)에 비례하고 그 키워드가 다른 문서에서 출현하는 빈도 df(d_i)에 반비례한다. 그러나 키워드의 빈도는 문서의 길이에 비례하는 경향을 나타내므로 문서의 길이가 긴 경우에 키워드의 중요도가 높아지는 단점이 있다.The basic idea behind TFIDF is to consider keywords that appear more in one document but less in another. That is, inversely proportional to the frequency df (d _i) for priority v _i d _i of the keyword in the particular document is T d _i is proportional to the frequency tf (d _i) and the appearance in the T keyword is emerged in the other document. However, since the frequency of the keyword tends to be proportional to the length of the document, the importance of the keyword increases when the document length is long.

수학식 1은 일반적으로 사용하는 정규화된 TFIDF의 계산 수식이다.Equation 1 is a calculation formula of a normalized TFIDF.

[수학식 1][Equation 1]

T : 임의의 문서 d_i :T에서 출현한 키워드T: any document d _i : Keywords that appeared in T

V_i : T에서의 d_i의 중요도 tf(d_i) : T에서 d_i가 출현한 빈도V _i : importance of d _i in T tf (d _i ): frequency of d _i in T

tf_max : T에서 가장 큰 tf의 값 df(d_i) : d_i가 출현한 문서의 빈도tf _max : The largest value of tf in T df (d _i ): Frequency of documents in which d _i appears

그러나 종래의 TFIDF는 사이트 그룹의 개념이 없기 때문에 한 사이트에서 많이 출현하면서 다른 사이트에서는 적게 나오는 키워드를 중요 키워드로 간주하고, 같은 사이트 그룹에서 골고루 출현하는 키워드의 중요도가 감소하는 단점이 있다. 따라서 여러 문서(사이트)로 이루어진 사이트 그룹을 대표하는 중요 키워드를 추출하는 방법으로는 적당하지 않다.However, since the conventional TFIDF does not have a concept of a site group, a keyword that appears a lot in one site and a few appear in another site is regarded as an important keyword, and the importance of a keyword appearing evenly in the same site group decreases. Therefore, it is not suitable as a method of extracting important keywords representing a group of sites consisting of several documents (sites).

종래 기술인 연관규칙(Association Rule)은, 항목들의 집합으로 이루어진 트랜잭션(Transaction)을 분석하여 각 항목간의 연관성을 파악하는 기법으로 장바구니 분석, 교차판매전략 및 사용자 접근 패턴 분석 등의 분야에서 이용되고 있다. 이 방법은 여러 트랜잭션에서 자주 함께 나타나는 항목(항목집합)들의 관계를 찾는 방법으로 특정 항목(항목집합)이 출현한 트랜잭션에서 동시에 어떤 항목(항목집합) 이 자주 나타나는가를 계산한다. 즉, 항목 X가 출현한 트랜잭션에서는 항목 Y가 자주 출현한다면 항목 Y는 항목 X와의 연관성이 높다고 간주하는 것이다. 이때 항목 X를 전항목, 항목 Y를 후항목이라 하며 전항목과 후항목은 항목집합의 부분집합으로 이루어 질 수 있다. 그러나 반대로 항목 Y가 출현하는 트랜잭션에서 항목 X가 출현하는 비중이 크지 않을 수도 있기 때문에 항목 X는 항목 Y와의 연관성이 높지 않을 수도 있다.The association rule, which is a conventional technology, is used in fields such as shopping cart analysis, cross-selling strategy, and user access pattern analysis as a technique of analyzing a transaction composed of a set of items to determine the association between each item. This method finds the relationship of items (itemsets) that frequently appear together in several transactions, and calculates which items (itemsets) appear frequently at the same time in a transaction in which a specific items (itemsets) appear. That is, in a transaction in which item X appears, if item Y frequently appears, item Y is regarded as highly related to item X. In this case, item X is called all items and item Y is called after items, and all items and after items can be composed of a subset of item sets. However, item X may not be highly associated with item Y because, in contrast, item X may not be significant in a transaction in which item Y appears.

연관규칙의 타당성을 검증하는 척도로는 지지도(Support), 신뢰도(Confidence), 개선도(Lift)를 이용한다. 일반적으로 최소지지도와 최소신뢰도를 만족하는 경우에만 규칙으로 생성하며 개선도는 생성된 규칙을 평가하는 항목으로 활용되고 있다.Support, confidence and improvement are used as a measure of validity of association rules. Generally, a rule is generated only when the minimum support and minimum reliability are satisfied, and the improvement is used as an item for evaluating the generated rule.

수학식 2는 지지도의 정의로, 전체 트랜잭션이 X와 Y를 동시에 포함할 확률을 의미하며, 전체 트랜잭션에서 X와 Y가 동시에 나타나는 경우가 얼마나 많은지에 대한 척도이다.Equation 2 is a definition of support, which means the probability that an entire transaction includes X and Y simultaneously, and is a measure of how many times X and Y appear simultaneously in the entire transaction.

[수학식 2][Equation 2]

수학식 3은 신뢰도의 정의로, X를 포함한 트랜잭션이 X와 Y를 동시에 포함할 확률을 의미하며, X를 포함한 트랜잭션에서 X와 Y가 동시에 나타나는 경우가 얼마 나 많은지에 대한 척도이다.Equation 3 is a definition of reliability, which means the probability that a transaction including X simultaneously includes X and Y, and is a measure of how many times X and Y appear simultaneously in a transaction including X.

[수학식 3][Equation 3]

수학식 4는 개선도의 정의로, 전체 트랜잭션이 Y항목을 포함할 확률과 X를 포함한 트랜잭션이 X와 Y를 동시에 포함할 확률을 비교하는 것이다.Equation 4 is a definition of the improvement degree, and compares the probability that the entire transaction includes the Y item and the probability that the transaction including the X includes the X and Y simultaneously.

[수학식 4][Equation 4]

종래의 연관규칙은 특정 키워드에 대한 사이트 그룹에서 연관 키워드 후보와의 연관성을 판단하는 데는 적절하지 못한 문제점이 있다.Conventional association rules have a problem in that it is not appropriate to determine the association with the relevant keyword candidate in the site group for a particular keyword.

따라서, 본 발명은 종래 기술의 문제점을 해결하기 위한 것으로, 수정된 TFIDF를 사용함으로써, 같은 사이트 그룹의 여러 문서에서 출현한 키워드의 중요도를 높일 수 있도록 함에 목적이 있다.Accordingly, an object of the present invention is to solve the problems of the prior art, and by using the modified TFIDF, an object of the present invention is to increase the importance of keywords appearing in various documents of the same site group.

또한, 수정된 연관규칙을 사용함으로써 연관 키워드 후보가 얼마나 많은 사이트 그룹에서 출현하는지를 파악하여, 특정 키워드와 연관 키워드 후보 간의 연관성을 효율적으로 분석하여 연관 키워드 그룹을 생성할 수 있도록 하는 다른 목적이 있다.In addition, there is another purpose of determining how many site groups related keyword candidates appear by using the modified association rule, and efficiently analyzing the association between a specific keyword and the related keyword candidate to generate a related keyword group.

본 발명의 목적은 특정 키워드에 의해 검색되는 사이트를 묶어 사이트 그룹을 형성하는 제 1단계; 같은 사이트 그룹의 문서에서 출현한 키워드의 중요성을 높이고 여러 사이트 그룹에서 출현한 키워드의 중요성을 감소시키는 수정된 역 문서빈도를 고려한 용어빈도 연산법을 이용하여, 상기 사이트 그룹으로부터 중요 키워드를 추출하여 연관 키워드 후보를 생성하는 제 2단계; 상기 연관 키워드 후보가 같은 사이트 그룹에서 집중적으로 출현함을 계산하는 집중도와 여러 사이트 그룹의 문서에서 출현함을 계산하는 신뢰도를 적용한 수정된 연관규칙을 이용하여 상기 연관 키워드 후보와 상기 특정 키워드의 연관성을 분석하는 제 3단계; 각 사이트 그룹간의 내용적인 유사도를 분석하는 코사인 유사도를 이용하여 상기 연관 키워드 후보와 상기 특정 키워드의 연관성을 분석하는 제 4단계; 및 상기 수정된 연관규칙과 상기 코사인 유사도에 의해 각각 분석된 연관 키워드 후보들을 교집합하여 연관 키워드 그룹을 추출하는 제 5단계를 포함하는 연관 키워드 그룹 추출 방법에 의해 달성된다.An object of the present invention is a first step of forming a site group by binding a site searched by a specific keyword; Extract key keywords from the site groups by using the terminology calculation method that takes into account the revised inverse document frequency, which increases the importance of keywords appearing in documents of the same site group and reduces the importance of keywords appearing in multiple site groups. Generating a keyword candidate; Relevance of the related keyword candidate to the specific keyword using a modified association rule applying a concentration that calculates that the related keyword candidates appear intensively in the same site group and a reliability that calculates the appearance of documents in multiple site groups. A third step of analyzing; A fourth step of analyzing a correlation between the related keyword candidate and the specific keyword by using cosine similarity for analyzing content similarity between each site group; And a fifth step of extracting a related keyword group by intersecting relevant keyword candidates analyzed by the modified association rule and the cosine similarity, respectively.

본 발명의 다른 목적은 특정 키워드에 의해 검색되는 사이트를 묶어 사이트 그룹을 형성하는 제 1단계; 같은 사이트 그룹의 문서에서 출현한 키워드의 중요성을 높이고 여러 사이트 그룹에서 출현한 키워드의 중요성을 감소시키는 수정된 역 문서빈도를 고려한 용어빈도 연산법을 이용하여, 상기 사이트 그룹으로부터 중요 키워드를 추출하여 연관 키워드 후보를 생성하는 제 2단계; 상기 연관 키워드 후보가 같은 사이트 그룹에서 집중적으로 출현함을 계산하는 집중도와 여러 사이트 그룹의 문서에서 출현함을 계산하는 신뢰도를 적용한 수정된 연관규칙을 이용하여 상기 연관 키워드 후보와 상기 특정 키워드의 연관성을 분석하는 제 3단계; 각 사이트 그룹간의 내용적인 유사도를 분석하는 코사인 유사도를 이용하여 상기 연관 키워드 후보와 상기 특정 키워드의 연관성을 분석하는 제 4단계; 및 상기 수정된 연관규칙과 상기 코사인 유사도에 의해 각각 분석된 연관 키워드 후보들을 합집합하여 연관 키워드 그룹을 추출하는 제 5단계를 포함하는 연관 키워드 그룹 추출 방법에 의해 달성된다.Another object of the present invention is a first step of forming a site group by binding a site searched by a specific keyword; Extract key keywords from the site groups by using the terminology calculation method that takes into account the revised inverse document frequency, which increases the importance of keywords appearing in documents of the same site group and reduces the importance of keywords appearing in multiple site groups. Generating a keyword candidate; Relevance of the related keyword candidate to the specific keyword using a modified association rule applying a concentration that calculates that the related keyword candidates appear intensively in the same site group and a reliability that calculates the appearance of documents in multiple site groups. A third step of analyzing; A fourth step of analyzing a correlation between the related keyword candidate and the specific keyword by using cosine similarity for analyzing content similarity between each site group; And a fifth step of extracting a related keyword group by merging related keyword candidates analyzed by the modified association rule and the cosine similarity, respectively.

본 발명의 또 다른 목적은 특정 키워드에 의해 검색되는 사이트를 묶어 사이트 그룹을 형성하는 제 1단계; 같은 사이트 그룹의 문서에서 출현한 키워드의 중요성을 높이고 여러 사이트 그룹에서 출현한 키워드의 중요성을 감소시키는 수정된 역 문서빈도를 고려한 용어빈도 연산법을 이용하여, 상기 사이트 그룹으로부터 중요 키워드를 추출하여 연관 키워드 후보를 생성하는 제 2단계; 상기 연관 키워드 후보가 같은 사이트 그룹에서 집중적으로 출현함을 계산하는 집중도와 여러 사이트 그룹의 문서에서 출현함을 계산하는 신뢰도를 적용한 수정된 연관규칙을 이용하여 상기 연관 키워드 후보와 상기 특정 키워드의 연관성을 분석하는 제 3단계; 및 상기 수정된 연관규칙에 의해 분석된 연관 키워드 후보를 이용하여 연관 키워드 그룹을 추출하는 제 4단계를 포함하는 연관 키워드 그룹 추출 방법에 의해 달성된다.Still another object of the present invention is a first step of forming a site group by binding a site searched by a specific keyword; Extract key keywords from the site groups by using the terminology calculation method that takes into account the revised inverse document frequency, which increases the importance of keywords appearing in documents of the same site group and reduces the importance of keywords appearing in multiple site groups. Generating a keyword candidate; Relevance of the related keyword candidate to the specific keyword using a modified association rule applying a concentration that calculates that the related keyword candidates appear intensively in the same site group and a reliability that calculates the appearance of documents in multiple site groups. A third step of analyzing; And extracting a related keyword group by using the related keyword candidate analyzed by the modified related rule.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, terms or words used in the specification and claims should not be construed as having a conventional or dictionary meaning, and the inventors should properly explain the concept of terms in order to best explain their own invention. Based on the principle that can be defined, it should be interpreted as meaning and concept corresponding to the technical idea of the present invention.

따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Therefore, the embodiments described in the specification and the drawings shown in the drawings are only the most preferred embodiment of the present invention and do not represent all of the technical idea of the present invention, various modifications that can be replaced at the time of the present application It should be understood that there may be equivalents and variations.

도 1은 본 발명에 따른 연관 키워드 그룹 추출 방법의 순서도이다.1 is a flowchart of a method for extracting a related keyword group according to the present invention.

우선 특정 키워드에 의해 검색되는(S111) 사이트들을 묶어서 사이트 그룹을 형성한다(S112). First, a site group is formed by grouping sites searched by a specific keyword (S111) (S112).

이때 사이트 그룹에 속한 사이트 수가 너무 적은 경우는 키워드 간의 연관성이 존재한다고 보기 어렵기 때문에, 사이트 그룹에 속한 사이트 수가 사용자가 입 력한 최소 사이트 수(N₁)보다 작은 경우에는(S113) 사이트 그룹을 생성하지 않는다(S114).If the number of sites in the site group is too small, it is difficult to say that there is a correlation between keywords. If the number of sites in the site group is smaller than the minimum number of sites (N ₁ ) entered by the user (S113), a site group is created. Do not (S114).

생성된 각 사이트 그룹을 대표하는 중요 키워드를 추출한다(S115). 각 사이트 그룹을 대표하는 중요 키워드는 연관 키워드 후보가 되며 사이트 그룹을 구성하는 대표 키워드 역시 연관 키워드 후보가 된다. An important keyword representing each generated site group is extracted (S115). Important keywords representing each site group are related keyword candidates, and representative keywords constituting the site group are also related keyword candidates.

연관 키워드 그룹에서의 중요 키워드는 한 사이트에서 많이 출현하고 같은 사이트 그룹의 여러 문서에서 출현하는 반면, 해당 키워드가 출현한 사이트 그룹의 수가 작은 키워드이다. 그러나 종래의 TFIDF는 사이트 그룹의 개념이 없기 때문에 한 사이트에서 많이 출현하면서 다른 사이트에서는 적게 나오는 키워드를 중요 키워드로 간주하므로 같은 사이트 그룹에서 골고루 출현하는 키워드의 중요도가 감소하는 단점이 있다. An important keyword in the related keyword group is a keyword that appears a lot in a site and appears in several documents in the same site group, while a small number of site groups in which the keyword appears. However, since the conventional TFIDF does not have a concept of a site group, a keyword that appears a lot in one site and a few keywords in another site is regarded as an important keyword, and thus, the importance of the keyword evenly appearing in the same site group is reduced.

따라서 이러한 문제점을 극복하기 위해 본 발명에서 고안된 수정된 TFIDF를 사용하여 연관 키워드 후보가 되는 중요 키워드를 추출한다. Therefore, in order to overcome this problem, an important keyword which is a related keyword candidate is extracted using the modified TFIDF devised in the present invention.

수학식 5는 수정된 TFIDF의 식이다.Equation 5 is a modified TFIDF equation.

[수학식 5][Equation 5]

T : 임의의 사이트 tf(d_i) : T에서 d_i가 출현한 빈도T: random site tf (d _i ): frequency of d _i in T

G : T가 속한 사이트 그룹 df_G(d_i) : G에서 d_i가 출현한 사이트의 빈도G: the site group to which T belongs df _G (d _i ): the frequency of the appearance of d _i in G

d_i : T에서 출현한 키워드 n_G : G에 속한 사이트의 수d _i : keyword from T n _G : number of sites in G

v_i : T에서의 v_i의 중요도 gf(d_i) : d_i가 출현한 사이트 그룹의 수 v _i The importance of v _i in T gf (d _i ): The number of site groups in which d _i appeared.

tf_max : T에서 가장 큰 tf의 값 N_G : 전체 사이트 그룹의 수tf _max : the largest tf value in T N _G : total number of site groups

수학식 6은 수정된 TFIDF에서 df를 계산하는 식이며, 수학식 7은 그룹 출현 빈도를 계산하여 반영하는 식이다.Equation 6 is a formula for calculating df in the modified TFIDF, Equation 7 is to calculate and reflect the frequency of group appearance.

[수학식 6][Equation 6]

[수학식 7][Equation 7]

수학식 6에서 n_G는 임의의 사이트 T가 속한 사이트 그룹 G에 속한 사이트의 수를 나타내며, df_G(d_i)는 사이트 그룹 G에서 특정 키워드 d_i가 출현한 사이트 수 이다. 따라서 df_G(d_i)의 값이 커질수록 수학식 6의 값이 커지므로 같은 사이트 그룹의 여러 문서에서 출현한 키워드의 중요도를 높여 주게 된다. In Equation 6, n _G represents the number of sites belonging to a site group G to which any site T belongs, and df _G (d _i ) is the number of sites where a specific keyword d _i appears in the site group G. Therefore, as the value of df _G (d _i ) increases, the value of Equation 6 increases, thereby increasing the importance of keywords appearing in multiple documents of the same site group.

또한 수학식 7에서 gf(d_i)는 특정 키워드가 출현한 사이트 그룹의 수이며 N_G는 전체 사이트 그룹의 수이다. 따라서 gf(d_i)의 값이 커질수록 수학식 7의 값이 작아져 여러 사이트 그룹에서 출현한 키워드의 중요도를 감소시켜 준다.Also, in Equation 7, gf (d _i ) is the number of site groups in which a specific keyword appears and N _G is the number of total site groups. Therefore, as the value of gf (d _i ) increases, the value of Equation 7 decreases, thereby reducing the importance of keywords appearing in various site groups.

이때 연관 키워드 그룹을 생성하는 목적은 시장에서의 판매를 위한 것이므로 판매 가능성을 판단해야 한다. 즉, 키워드 간의 연관성이 높다 하더라도 일반 사용자들이 자주 검색하는 키워드가 아니라면 광고가 노출될 기회가 적기 때문에 검색 엔진에서의 노출 횟수(N₂)를 이용하여 연관 키워드 후보를 필터링(Filtering)한다(S116).In this case, since the purpose of generating the related keyword group is for sale in the market, it is necessary to determine the sale possibility. That is, even though the correlation between the keywords is high, if the keywords are not frequently searched by the general users, since there is little opportunity for the advertisement to be exposed, the related keyword candidates are filtered using the number of exposures (N ₂ ) in the search engine (S116). .

노출 횟수보다 작은 키워드는 연관 키워드 후보에서 삭제하고(S117) 남은 중요 키워드들로 연관 키워드 후보를 생성한다(S118).Keywords smaller than the number of exposures are deleted from the related keyword candidates (S117) and the related keyword candidates are generated from the remaining important keywords (S118).

이후 본 발명에서 고안한 수정된 연관규칙을 이용해, 생성된 연관 키워드 후보와 특정 키워드 사이의 연관성을 분석한다(S119). 수정된 연관규칙에 의한 연관성 분석에서는 사용자가 연관 키워드 그룹을 생성하고자 하는 특정 키워드별 사이트 그룹에 속한 각각의 사이트들의 중요 키워드 리스트를 하나의 트랜잭션으로 구 성한다. 이때 탐사 시간을 단축하기 위해서 항목집합의 부분집합 간의 연관성을 분석하는 종래의 연관규칙과는 달리 전항목과 후항목을 모두 단일 항목으로 설정하되 전항목은 특정 키워드, 후항목은 연관 키워드 후보로 한다.Then, using the modified association rules devised in the present invention, the correlation between the generated related keyword candidate and the specific keyword is analyzed (S119). In the association analysis based on the modified association rule, the user constructs a list of important keywords of each site belonging to a site group by a specific keyword for which the user wants to create a related keyword group in one transaction. In order to reduce the exploration time, unlike the conventional association rule that analyzes the association between subsets of a set of items, all items and all items are set as a single item, but all items are specific keywords and all items are candidates for related keywords. .

본 발명에 따른 수정된 연관규칙에서는 연관 키워드 후보가 특정 키워드에 대한 사이트 그룹의 여러 문서에서 골고루 출현하면서, 연관 키워드 후보가 출현한 사이트 그룹의 수가 작은 경우 특정 키워드와 연관 키워드 후보 간의 연관성이 높은 것으로 간주한다. 따라서 연관 키워드 후보가 특정 키워드에 대한 사이트 그룹의 얼마나 많은 사이트에서 출현하는지를 나타내는 척도로 종래의 연관규칙 탐사에서의 신뢰도를 이용하며, 연관 키워드 후보가 얼마나 많은 사이트 그룹에서 출현하는지를 나타내는 척도로 집중도(Concentrarion)라는 새로운 개념을 도입하여 사용한다. 그리고 특정 키워드와 연관 키워드 후보 간의 연관성을 나타내는 척도로 연관도를 사용하는데 이는 신뢰도와 집중도를 곱한 값을 이용한다.According to the modified association rule according to the present invention, when a related keyword candidate appears evenly in various documents of a site group for a specific keyword, and a small number of site groups in which the related keyword candidate appears, the association between the specific keyword and the related keyword candidate is high. Consider. Therefore, we use the reliability of conventional association rule exploration as a measure of how many sites in a group of sites a particular keyword candidate appears for a particular keyword, and as a measure of how many site groups the relevant keyword candidate appears in. We introduce and use a new concept of). In addition, the degree of association is used as a measure of the association between a specific keyword and an associated keyword candidate, which is obtained by multiplying the reliability and the concentration.

수학식 8과 수학식 9는 신뢰도와 집중도의 계산식이다.Equations 8 and 9 are calculations of reliability and concentration.

[수학식 8][Equation 8]

df_G(r) : G중에서 r이 출현한 사이트의 수 d : 특정 키워드 df _G (r): number of sites where r appears in G d: specific keyword

n_G : G에 속한 사이트의 수 r : 연관 키워드 후보n _G : The number of sites belonging to G r: associated keyword candidate

[수학식 9][Equation 9]

df_G(r) : G중에서 r이 출현한 사이트의 수df _G (r): number of sites in which r appeared in G

df_all(r) : 전체 사이트 중 r이 출현한 사이트의 수df _all (r): number of sites in which r appeared in all sites

r : 연관 키워드 후보 G : d에 대한 사이트 그룹r: site group for associated keyword candidate G: d

d : 특정 키워드 gf(r) : r이 출현한 사이트 그룹의 수d: the number of site groups where the specific keyword gf (r): r appears

N_G : 전체 사이트 그룹의 수N _G : total number of site groups

수학식 8에서 n_G는 특정 키워드 d에 대한 사이트 그룹 G에 속한 사이트의 수를 나타내며 df_G(r)는 사이트 그룹 G에서 연관 키워드 후보 r이 출현한 사이트 수이다. 따라서 df_G(r)의 값이 커질수록 R : d→r의 신뢰도 값은 커지고 이는 연관 키워드 후보 r이 사이트 그룹 G의 여러 문서에서 출현함을 의미한다.In Equation 8, n _G represents the number of sites belonging to the site group G for the specific keyword d, and df _G (r) is the number of sites where the associated keyword candidate r appears in the site group G. Therefore, as the value of df _G (r) increases, the reliability value of R: d → r increases, which means that the related keyword candidate r appears in various documents of the site group G.

또한 수학식 9에서 df_All(r)은 전체 사이트 중 r이 출현한 사이트 수이고 gf(r)은 r이 출현한 사이트 그룹의 수이다. 그리고 N_G는 전체 사이트 그룹의 수를 나타낸다. 따라서 df_All(r)이 작아지고 gf(r)이 작아질수록 R : d→r의 집중도의 값 은 커지고 이는 연관 키워드 후보 r이 사이트 그룹 G에서 집중적으로 출현함을 의미한다.In addition, in Equation 9, df _All (r) is the number of sites in which r appeared and gf (r) is the number of site groups in which r appeared. And N _G represents the total number of site groups. Therefore, as df _All (r) decreases and gf (r) decreases, the value of the concentration of R: d → r increases, which means that the related keyword candidate r appears intensively in the site group G.

이때 df_G(r)의 값은 항상 n_G보다 작거나 같으며, df_All(r)보다 작거나 같다. 또한 gf(r)은 항상 N_G보다 작거나 같다. 따라서 신뢰도와 집중도는 항상 0과 1 사이의 값을 가지게 되며 이 둘을 곱한 값인 연관도 역시 0과 1 사이의 값을 가지게 된다.The value of df _G (r) is always less than or equal to n _{G and} less than or equal to df _All (r). Also, gf (r) is always less than or equal to N _G. Therefore, the reliability and the concentration always have a value between 0 and 1, and the association, which is the product of the two, also has a value between 0 and 1.

수정된 연관규칙에 의한 연관성 분석은 특정 키워드에 의해 검색된 사이트들에서 공통으로 많이 나오는 키워드를 추출한 것이다. 따라서 사이트 그룹에서 출현하지 않은 키워드들은 연관 키워드 그룹에 속하지 못하게 된다. 키워드 B가 키워드 A에 대한 사이트 그룹에서 적게 출현하거나 출현하지 않는다면 키워드 B는 키워드 A에 대한 연관 키워드가 될 수 없다. 그러나 키워드 A, B 각각에 대한 사이트 그룹을 구성하는 사이트의 내용이 비슷하다면 두 개의 키워드는 연관 키워드가 되어야 한다.Correlation analysis based on the modified association rule extracts keywords that appear in common from sites searched by specific keywords. Therefore, keywords that do not appear in the site group do not belong to the related keyword group. Keyword B cannot be an associated keyword for keyword A unless keyword B appears or appears less in the site group for keyword A. However, if the contents of the sites that make up the site group for each of the keywords A and B are similar, the two keywords should be related keywords.

따라서 본 발명에서는 코사인 유사도를 이용하여 각 사이트 그룹을 구성하는 사이트의 내용에 대한 연관성을 분석한다(S120). 즉, 수정된 연관규칙에 의한 연관성 분석이 특정 키워드와 그 키워드에 의해 검색되는 사이트들에서 자주 나오는 키워드 간의 연관성을 분석하는 것이라면, 코사인 유사도에 의한 연관성 분석은 각 사이트 그룹간의 내용적인 유사도를 분석하는 방법이다.Therefore, the present invention analyzes the association of the contents of the sites constituting each site group by using the cosine similarity (S120). In other words, if the correlation analysis based on the modified association rule is to analyze the association between a specific keyword and keywords frequently found in the sites searched by the keyword, the association analysis based on cosine similarity analyzes the content similarity between each site group. It is a way.

최종적인 연관 키워드 그룹은 수정된 연관규칙에 의해 생성된 연관 키워드 후보와 코사인 유사도에 의한 연관 키워드 후보를 결합하여(S121) 생성한다(S122). 연관 키워드 그룹을 결합하는 방법은 수정된 연관규칙에 의한 연관 키워드 후보와 코사인 유사도에 의한 연관 키워드 후보의 교집합을 이용하는 방법, 합집합을 이용하는 방법 및 수정된 연관규칙에 의한 연관 키워드 후보만을 이용하는 방법이 있다.The final association keyword group is generated by combining the association keyword candidate generated by the modified association rule with the association keyword candidate based on cosine similarity (S121). The method of combining the related keyword groups includes a method of using an intersection of the related keyword candidate by the modified association rule and the related keyword candidate by the cosine similarity, a method using the union, and a method using only the related keyword candidate by the modified association rule. .

다음은 본 발명에 따른 연관 키워드 그룹 추출 방법을 구현한 일실시예이다. Pentium-4 1.7GHz의 CPU와 512Mb 메모리를 갖는 IBM 호환기종의 윈도우(Window) 플랫폼에서 자바(JAVA)를 이용하여 구현하였다. 실험 데이터는 검색 사이트 '네이버'의 검색엔진에서 한 달간 구매된 키워드의 판매 현황을 참조하였다. 특정 키워드는 결혼, 대출, 이사, 여행사 및 창업으로 최소노출횟수는 100, 노출횟수의 가중치는 0.7, 판매횟수의 가중치는 0.3으로 하였다.The following is an embodiment of implementing a method of extracting a related keyword group according to the present invention. Pentium-4 was implemented using JAVA on IBM compatible Window platform with 1.7GHz CPU and 512Mb memory. The experimental data refers to the sales status of keywords purchased for a month from the search engine of the search site 'Naver'. The specific keywords were marriage, loan, moving, travel agency and start-up. The minimum exposure count was 100, the exposure count was 0.7, and the sales count was 0.3.

도 2는 본 발명에 따른 연관 키워드 그룹 추출 방법의 일실시예로, 두 방법의 교집합을 이용하여 생성된 연관 키워드 그룹이다. 수정된 연관규칙에 의한 방법과 코사인 유사도에 의한 방법은 각기 다른 의미의 연관성을 분석한다. 따라서 두 방법에 의한 연관 키워드 후보 간의 교집합을 이용하는 것이 가장 유의미하지만 이럴 경우 교집합에 속한 키워드가 많지 않다.2 is an embodiment of a method for extracting a related keyword group according to the present invention, which is a related keyword group generated using an intersection of two methods. The method by the modified association rule and the method by cosine similarity analyze the association of different meanings. Therefore, it is most meaningful to use the intersection between the candidate candidates of related keywords by the two methods, but in this case, there are not many keywords belonging to the intersection.

도 3은 본 발명에 따른 연관 키워드 그룹 추출 방법의 다른 일실시예로, 두 방법의 합집합을 이용하여 생성된 연관 키워드 그룹이다. 교집합에 의해 생성된 연관 키워드 그룹은 키워드가 적으므로 많은 키워드를 얻을 때 이용한다.3 is another embodiment of a method for extracting a related keyword group according to the present invention, which is a related keyword group generated using a union of two methods. Since the associated keyword group generated by the intersection has few keywords, it is used to obtain many keywords.

도 4은 본 발명에 따른 연관 키워드 그룹 추출 방법의 다른 일실시예로, 수 정된 연관규칙만을 이용해 생성된 연관 키워드 그룹이다. 코사인 유사도에 의한 연관성 분석은 연관 키워드 그룹을 생성하고자 하는 특정 키워드의 수에 따라 시간이 매우 오래 걸릴 수도 있다. 따라서 생성해야 하는 연관 키워드 그룹의 수가 많은 경우 코사인 유사도에 의한 연관성 분석은 하지 않고 수정된 연관규칙에 의한 연관성 분석만으로 연관 키워드 그룹을 생성하는 방법을 제공한다.4 is another embodiment of a method for extracting a related keyword group according to the present invention, which is a related keyword group generated using only a modified association rule. Correlation analysis based on cosine similarity may take a very long time depending on the number of specific keywords to generate a group of related keywords. Therefore, when the number of related keyword groups to be generated is large, it provides a method of generating related keyword groups by only analyzing the association by the modified association rule, not analyzing the association by cosine similarity.

본 발명은 이상에서 살펴본 바와 같이 바람직한 실시예를 들어 도시하고 설명하였으나, 상기한 실시예에 한정되지 아니하며 본 발명의 정신을 벗어나지 않는 범위 내에서 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변경과 수정이 가능할 것이다.Although the present invention has been shown and described with reference to the preferred embodiments as described above, it is not limited to the above embodiments and those skilled in the art without departing from the spirit of the present invention. Various changes and modifications will be possible.

따라서, 본 발명의 연관 키워드 그룹 추출 방법은 수정된 TFIDF를 사용함으로써, 같은 사이트 그룹의 여러 문서에서 출현한 키워드의 중요도를 높일 수 있다.Therefore, the related keyword group extraction method of the present invention can increase the importance of keywords appearing in several documents of the same site group by using the modified TFIDF.

또한, 수정된 연관규칙을 사용함으로써 연관 키워드 후보가 얼마나 많은 사이트 그룹에서 출현하는지를 파악하여, 연관 키워드 그룹의 생성을 위해 특정 키워드와 연관 키워드 후보 간의 연관성을 효율적으로 분석할 수 있도록 하는 현저하고도 유리한 효과가 있다.Also, by using the modified association rule, it is possible to identify how many site groups of related keyword candidates appear, and thus, it is a significant and advantageous way to efficiently analyze the association between specific keywords and related keyword candidates for the generation of related keyword groups. It works.

Claims

A first step of forming a site group by binding sites searched by a specific keyword;

Extract key keywords from the site groups by using the terminology calculation method that takes into account the revised inverse document frequency, which increases the importance of keywords appearing in documents of the same site group and reduces the importance of keywords appearing in multiple site groups. Generating a keyword candidate;

Relevance of the related keyword candidate to the specific keyword using a modified association rule applying a concentration that calculates that the related keyword candidates appear intensively in the same site group and a reliability that calculates the appearance of documents in multiple site groups. A third step of analyzing;

A fourth step of analyzing a correlation between the related keyword candidate and the specific keyword by using cosine similarity for analyzing content similarity between each site group; And

A fifth step of extracting a related keyword group by intersecting relevant keyword candidates analyzed based on the modified association rule and the cosine similarity;

Association keyword group extraction method comprising a.

A first step of grouping sites searched by a specific keyword to form a site group;

A fifth step of extracting a related keyword group by combining related keyword candidates analyzed based on the modified association rule and the cosine similarity;

Association keyword group extraction method comprising a.

Extract key keywords from the site group by using the term frequency calculation method with modified inverse document frequency that increases the importance of keywords appearing in the documents of the same site group and decreases the importance of keywords appearing in the various site groups. Generating a keyword candidate;

Relevance of the related keyword candidate to the specific keyword using a modified association rule applying a concentration that calculates that the related keyword candidates appear intensively in the same site group and a reliability that calculates the appearance of documents in multiple site groups. A third step of analyzing; And

A fourth step of extracting a related keyword group by using the related keyword candidate analyzed by the modified related rule;

Association keyword group extraction method comprising a.

The method of claim 1, wherein the first step comprises:

The associated keyword group extraction method, characterized in that the site group is not formed when the number of searched sites is small.

The method according to any one of claims 1 to 3, wherein the second step,

The related keyword group extraction method, characterized in that if the extracted keyword is exposed less than a predetermined number of times, the corresponding keyword is deleted.

The method according to any one of claims 1 to 3, wherein in the second step,

The term frequency calculation method taking into account the modified inverse document frequency,

An associated keyword group extraction method calculated using the following equation.

[Equation]

[T: any site, tf (d _i): frequency of one in the T d _i the occurrence, G: Site T belongs to Group, df _G (d _i): in G d _i, the frequency of occurrence sites, d _i : Keyword from T, n _G : number of sites in G, v _i : importance of v _i in T, gf (d _i ): number of site groups where d _i appears, tf _max : most in T Large tf value, N _g : number of entire site groups]

The method according to any one of claims 1 to 3, wherein in the third step,

The modified association rule uses the concept of concentration and the calculation of the concentration,

[Equation]

[df _G (r): number of sites in which r appeared in G, df _all (r): number of sites in which r appeared in total sites, r: candidate candidates for related keywords, site groups for G: d, d: Specific keyword, gf (r): number of site groups where r appeared, N _G : number of whole site groups]

The method according to any one of claims 1 to 3, wherein in the third step,

The modified association rule uses the concept of reliability and the calculation of the reliability,

[Equation]

[df _G (r): number of sites where r appears in _G , n _G : number of sites belonging to G]