KR20110023304A

KR20110023304A - Method and system of configuring user profile based on a concept network and personalized query expansion system using the same

Info

Publication number: KR20110023304A
Application number: KR1020090081082A
Authority: KR
Inventors: 김한준; 이성직; 이병정
Original assignee: 서울시립대학교 산학협력단
Priority date: 2009-08-31
Filing date: 2009-08-31
Publication date: 2011-03-08
Also published as: KR101140724B1

Abstract

PURPOSE: A concept network based user profile configuration method and a system thereof, an individual query extension system using the same by using a session interest are provided to recommend an individualized query by comparing a user profile with a word which a user makes a question. CONSTITUTION: A keyword extractor(100) extracts a keyword from documents which a user refers to. A session interest making module(210) generates a session interest by using the keyword. The session interest making module makes a user profile by accumulating the session interest. A comparing module(220) compares the generated session interest with the concept of the user profile. A user profile updating module(230) adds the session interest and the concept of the user profile.

Description

METHOOD AND SYSTEM OF CONFIGURING USER PROFILE BASED ON A CONCEPT NETWORK AND PERSONALIZED QUERY EXPANSION SYSTEM USING THE SAME}

본 발명은 개념 네트워크 기반 사용자 프로파일 구성 방법 및 시스템과 이를 이용한 개인화 질의 확장 시스템에 관한 것으로, 보다 상세하게는 개인화 검색을 위한 개념 네트워크 기반 사용자 프로파일 구성 방법 및 시스템과 이를 이용한 개인화 질의 확장 시스템에 관한 것이다.The present invention relates to a method and system for constructing a conceptual network-based user profile and a system for expanding a personalized query using the same, and more particularly, to a method and system for constructing a conceptual network-based user profile for a personalized search and a system for expanding a personalized query using the same. .

일반적으로 정보 검색 시스템은 정보 수요자가 필요하다고 예측되는 정보나 데이터를 미리 수집, 가공, 처리하여 찾기 쉬운 형태로 축적해놓은 데이터베이스로부터 요구에 적합한 정보를 신속하게 찾아내어 정보 요구자에게 제공되는 시스템을 말한다. In general, an information retrieval system refers to a system that provides information requesters by quickly finding information suitable for a request from a database accumulated in an easy-to-find form by collecting, processing, and processing information or data that is expected to be needed by an information consumer.

종래에는 이러한 정보 검색 시스템을 활용하여 웹 포탈 사이트를 구축하였다. 상기 웹 포탈 사이트에서는 정보 검색시스템을 활용하여 인터넷 등을 통해 웹 사이트 등에서 정보를 수집하고 가공하여 처리하며 사용자 단말, 예를들어 컴퓨터나 무선 단말기 등에게 검색된 정보를 제공하게 된다. In the past, a web portal site was constructed by using such an information retrieval system. The web portal site utilizes an information retrieval system to collect, process, and process information on a web site through the Internet, and provide the retrieved information to a user terminal, for example, a computer or a wireless terminal.

현재 상용화된 검색 엔진이 훌륭한 성능을 보이고 있지만, 너무 많은 검색 결과를 보여주는 문제점이 있다. 이것은 검색 엔진이 기본적으로 질의어와 문서의 유사도를 비교하는 질의어 기반 시스템이기 때문이다. 하나의 단어가 여러 가지의 의미를 가질 수 있고, 개인마다 의도하는 바가 다를 수 있기 때문에 정확한 결과를 보장해 줄 수 없는 것이다. While currently available search engines perform well, there are problems with too many search results. This is because the search engine is basically a query based system that compares the similarity between the query and the document. A word can have many different meanings, and because each person's intention is different, it can't guarantee accurate results.

이러한 문제를 해결하기 위해 질의어 확장이나 사용자의 기호에 따라 검색 결과의 순위를 조정하는 등의 많은 방법이 연구되었다.To solve this problem, many methods have been studied, such as query expansion and ranking of search results according to user preferences.

질의어 확장은 사용자의 질의를 더 좋은 단어로 변경 또는 확장하여 검색 결과의 질을 향상시키는 방법이다. 이러한 질의어 확장을 위해서는 질의어와 관련이 있는 단어를 찾을 수 있는 사전이 필요한데, 상기 사전을 만들고 유지 보수하는 일은 매우 비용이 높은 작업이다. Query expansion is a method of improving the quality of search results by changing or extending the user's query to better words. Such query expansion requires a dictionary to find words related to the query, and the creation and maintenance of the dictionary is a very expensive task.

이에 본 발명의 기술적 과제는 이러한 점에 착안한 것으로, 본 발명의 목적은 사용자가 질의한 단어와 방문한 웹 페이지에서 추출한 키워드간의 연관성을 이용하여 개인화 검색을 위한 개념 네트워크 기반 사용자 프로파일 구성 방법을 제공하는 것이다.Therefore, the technical problem of the present invention has been made in view of the above, an object of the present invention is to provide a method for constructing a concept network-based user profile for personalized search by using the correlation between the user queryed words and keywords extracted from the visited web page will be.

본 발명의 다른 목적은 상기한 사용자 프로파일 구성 방법을 수행하기 위한 개념 네트워크 기반 사용자 프로파일 구성 시스템을 제공하는 것이다. Another object of the present invention is to provide a concept network-based user profile configuration system for performing the user profile configuration method described above.

본 발명의 또 다른 목적은 상기한 개념 네트워크 기반 사용자 프로파일을 이용한 개인화 질의 확장 시스템을 제공하는 것이다. Another object of the present invention is to provide a system for expanding a personalized query using the conceptual network-based user profile.

상기한 본 발명의 목적을 실현하기 위하여 일실시예에 따른 개념 네트워크 기반 사용자 프로파일 구성 방법은, 사용자 프로파일을 작성하기 위해, 사용자가 검색 엔진을 사용하면서 탐색한 문서들을 대상으로 키워드를 추출하는 단계와, 추출된 키워드들을 이용하여 세션 인터레스트를 생성하고, 생성된 세션 인터레스트를 누적하여 사용자 프로파일을 작성하는 단계와, 새로운 세션 인터레스트가 생성될 때마다 생성된 세션 인터레스트와 사용자 프로파일의 개념을 비교하는 단계와, 상기 세션 인터레스트가 상기 사용자 프로파일의 개념과 동일 또는 유사할 경우, 상기 세션 인터레스트와 상기 사용자 프로파일의 개념을 서로 합하고, 상이할 경우, 새로운 사용자 프로파일의 개념으로서 추가하는 단계를 포함한다.According to an embodiment of the present invention, there is provided a method of constructing a conceptual network-based user profile, including: extracting keywords from documents searched by a user using a search engine to create a user profile; Generating a session interest using the extracted keywords, and accumulating the generated session interests to create a user profile; and each time a new session interest is generated, concepts of the generated session interest and the user profile are generated. Comparing the session interest with the concept of the user profile if the session interest is the same as or similar to the concept of the user profile, and adding as a concept of the new user profile if different. Include.

본 발명의 실시예에서, 상기 세션 인터레스트와 상기 사용자 프로파일의 개념간의 유사도 비교는 오픈 디렉토리 프로젝트(ODP)의 웹 디렉토리를 사용하여 이루어 질 수 있다. In an embodiment of the present invention, the similarity comparison between the session interest and the concept of the user profile may be made using the web directory of the Open Directory Project (ODP).

여기서, 상기 세션 인터레스트와 상기 사용자 프로파일의 개념을 비교하는 단계는 상기 세션 인터레스트와 상기 사용자 프로파일의 개념 각각을 텀 벡터로 표현하는 단계와, 상기 세션 인터레스트에 대응하는 텀 벡터와 상기 사용자 프로파일의 개념에 대응하는 텀 벡터 각각을 상기 ODP의 카테고리를 차원으로 갖는 제1 벡터 및 제2 벡터로 변경하는 단계와, 코사인 유사도를 활용하여 상기 제1 벡터와 상기 제2 벡터를 비교하는 단계를 포함할 수 있다. The comparing of the concepts of the session interest and the user profile may include expressing each of the concepts of the session interest and the user profile in a term vector, and a term vector corresponding to the session interest and the user profile. Changing each term vector corresponding to the concept of to a first vector and a second vector having a category of the ODP as a dimension, and comparing the first vector and the second vector using cosine similarity. can do.

본 발명의 실시예에서, 상기 키워드를 추출하는 단계는 TF-IDF 가중치를 기준으로 각 웹 문서내에서 단어를 추출하는 단계와, 각 문서에서 추출된 단어들을 하나의 테이블에 저장하고, 등장 회수를 계산하는 단계와, 상기 등장 회수를 기준으로 일정한 임계값 이상의 등장 회수를 갖는 단어들만을 사용하여 세션 인터레스트를 구성하는 단계를 포함할 수 있다. In an embodiment of the present invention, the extracting of the keyword may include extracting a word in each web document based on the TF-IDF weight, storing the words extracted from each document in one table, and counting the number of appearances. And calculating a session interest using only words having a number of appearances above a predetermined threshold based on the number of appearances.

상기한 본 발명의 다른 목적을 실현하기 위하여 일실시예에 따른 개념 네트워크 기반 사용자 프로파일 구성 시스템은, 사용자 프로파일을 작성하기 위해, 사용자가 검색 엔진을 사용하면서 탐색한 문서들을 대상으로 키워드를 추출하는 키워드 추출부와, 추출된 키워드들을 이용하여 세션 인터레스트를 생성하고, 생성된 세션 인터레스트를 누적하여 사용자 프로파일을 작성하는 세션 인터레스트 작성모듈과, 새로운 세션 인터레스트가 생성될 때마다 생성된 세션 인터레스트와 사용자 프 로파일의 개념을 비교하는 비교모듈과, 상기 세션 인터레스트가 상기 사용자 프로파일의 개념과 동일 또는 유사할 경우, 상기 세션 인터레스트와 상기 사용자 프로파일의 개념을 서로 합하고, 상이할 경우, 새로운 사용자 프로파일의 개념으로서 추가하는 사용자 프로파일 갱신모듈을 포함한다.In accordance with an embodiment of the present invention, a conceptual network-based user profile configuration system includes a keyword for extracting keywords from documents searched by a user using a search engine to create a user profile. A session interest generation module for generating a session interest by using the extractor, the extracted keywords, and accumulating the generated session interests to create a user profile, and a generated session access whenever a new session interest is generated. A comparison module for comparing the concept of the rest and the user profile, and when the session interest is the same as or similar to the concept of the user profile, when the concepts of the session interest and the user profile are summed and different, Users to add as a concept of new user profiles To include the file update module.

본 발명의 실시예에서, 상기 비교모듈은 오픈 디렉토리 프로젝트(ODP)의 웹 디렉토리를 사용하여 상기 세션 인터레스트와 상기 사용자 프로파일의 개념간의 유사도를 비교할 수 있다. In an embodiment of the present invention, the comparison module may compare the similarity between the concept of the session interest and the user profile using the web directory of the Open Directory Project (ODP).

여기서, 상기 비교모듈은 상기 세션 인터레스트와 상기 사용자 프로파일의 개념 각각을 텀 벡터로 표현하고, 상기 세션 인터레스트에 대응하는 텀 벡터를 상기 ODP의 카테고리를 차원으로 갖는 제1 벡터로 변경하며, 상기 사용자 프로파일의 개념에 대응하는 텀 벡터를 ODP의 카테고리로 차원으로 갖는 제2 벡터로 변경한 후, 코사인 유사도를 활용하여 상기 제1 벡터와 상기 제2 벡터를 비교할 수 있다. Here, the comparison module represents each of the concepts of the session interest and the user profile as a term vector, and changes the term vector corresponding to the session interest into a first vector having a category of the ODP as a dimension. After changing a term vector corresponding to a concept of a user profile into a second vector having a dimension as a category of an ODP, a cosine similarity may be used to compare the first vector and the second vector.

본 발명의 실시예에서, 상기 키워드 추출부는 TF-IDF 가중치를 기준으로 각 웹 문서내에서 단어를 추출하고, 각 문서에서 추출된 단어들을 하나의 테이블에 저장하고, 등장 회수를 계산한 후, 상기 등장 회수를 기준으로 일정한 임계값 이상의 등장 회수를 갖는 단어들만을 사용하여 세션 인터레스트를 구성할 수 있다. In an embodiment of the present invention, the keyword extractor extracts a word in each web document based on the TF-IDF weight, stores the words extracted from each document in a table, calculates the number of appearances, and then The session interest may be constructed using only words having the number of appearances above a certain threshold based on the number of appearances.

본 발명의 실시예에서, 상기 키워드 추출부는 질의어를 검색엔진에 질의하고 그 결과 웹 페이지를 분석하고 저장하는 웹 문서 수집모듈과, 저장된 웹 페이지들에서 단일명사를 추출하는 단일 명사 추출모듈과, TF-IDF 가중치를 계산하는 TF-IDF 가중치 계산모듈과, 상기 TF-IDF 가중치에 대하여 최중요단어출현빈도를 계산 하여 키워드를 선택하는 최중요단어출현빈도 계산모듈을 포함할 수 있다. In an embodiment of the present invention, the keyword extracting unit is a web document collection module for querying a query engine and analyzing and storing a web page as a result, a single noun extracting module for extracting a single noun from stored web pages, and a TF And a TF-IDF weight calculation module for calculating IDF weights, and a most important word frequency calculation module for selecting keywords by calculating the most important word occurrence frequency with respect to the TF-IDF weight.

일례에서, 상기 최중요단어출현빈도 계산모듈은 문서내의 단어수를 기준으로 상위 특정 비율의 단어를 추출하고, 최중요단어출현빈도를 계산 및 키워드를 선택할 수 있다. In one example, the most important word appearance frequency calculation module may extract the upper specific ratio of words based on the number of words in the document, calculate the most important word occurrence frequency and select keywords.

다른 예에서, 상기 최중요단어출현빈도 계산모듈은 문서내의 최대 TF-IDF 가중치를 기준으로 상위 특정 비율의 단어를 추출하고, 최중요단어출현빈도를 계산 및 키워드를 선택할 수 있다. In another example, the most important word occurrence frequency calculating module may extract a word having a higher specific ratio based on the maximum TF-IDF weight in the document, calculate the most important word occurrence frequency, and select a keyword.

상기한 본 발명의 또 다른 목적을 실현하기 위하여 일실시예에 따른 개인화 질의 확장 시스템은, 개념 네트워크와, 클라이언트부로부터 질의어가 제공됨에 따라, 제공된 질의어에 대응하여 상기 개념 네트워크를 조회하여 확장된 질의어들을 획득하고, 획득한 확장된 질의어들을 상기 클라이언트부에 제공하는 질의 확장모듈과, 상기 클라이언트부에 의해 질의어가 선택됨에 따라, 선택된 질의어에 대응하는 문서들을 검색 엔진에 제공하여 검색을 의뢰하고, 의뢰에 상응하는 결과인 문서를 상기 클라이언트부에 제공하는 검색모듈과, 상기 클라이언트부로부터 세션 인터레스트가 제공됨에 따라 상기 세션 인터레스트와 상기 개념 네트워크에 저장된 사용자 프로파일의 개념을 비교하여, 상기 세션 인터레스트가 상기 사용자 프로파일의 개념과 동일 또는 유사할 경우, 상기 세션 인터레스트와 상기 사용자 프로파일의 개념을 서로 합하여 상기 개념 네트워크에 저장하고, 상이할 경우, 새로운 사용자 프로파일의 개념으로서 추가하여 상기 개념 네트워크에 저장하는 사용자 개념 네트워크 관리모듈을 포함한다.In order to realize the above object of the present invention, the personalized query expansion system according to an embodiment of the present invention provides a conceptual network and a query query extended by querying the conceptual network in response to the provided query query. And a query extension module for providing the obtained extended query terms to the client unit and, as a query term is selected by the client unit, requesting a search by providing documents corresponding to the selected query term to a search engine, and requesting a request. A search module for providing a document corresponding to a result corresponding to the client module, and comparing the session interest with a concept of a user profile stored in the conceptual network as a session interest is provided from the client device, wherein the session interest is obtained. Is the same as the concept of the user profile Includes a user concept network management module that adds the concepts of the session interest and the user profile to the concept network if similar, and adds them as concepts of a new user profile to the concept network if different. do.

본 발명의 실시예에서, 상기 클라이언트부는 사용자 입력을 받고, 탐색한 웹 페이지에서 키워드를 추출하여 상기 질의 확장모듈에 제공하는 질의 입력모듈과, 상기 검색모듈에서 문서들이 제공됨에 따라 제공되는 문서들을 표시하는 뷰어모듈과, 사용자에 의해 선택된 관심 문서에서 키워드를 추출하고, 추출된 키워드를 상기 사용자 개념 네트워크 관리모듈에 제공하는 키워드 추출모듈을 포함할 수 있다. In an embodiment of the present invention, the client unit receives a user input, extracts a keyword from the searched web page and provides the query input module to the query expansion module, and displays documents provided as documents are provided by the search module. And a keyword extraction module for extracting keywords from the document of interest selected by the user and providing the extracted keywords to the user concept network management module.

본 발명의 실시예에서, 상기 키워드 추출모듈은 TF-IDF 가중치를 기준으로 각 웹 문서내에서 단어를 추출하고, 각 문서에서 추출된 단어들을 하나의 테이블에 저장하고, 등장 회수를 계산한 후, 상기 등장 회수를 기준으로 일정한 임계값 이상의 등장 회수를 갖는 단어들만을 사용하여 세션 인터레스트를 구성할 수 있다. In an embodiment of the present invention, the keyword extraction module extracts words in each web document based on the TF-IDF weight, stores the words extracted from each document in one table, calculates the number of appearances, and The session interest may be configured using only words having the number of appearances above a predetermined threshold based on the number of appearances.

본 발명의 실시예에서, 상기 키워드 추출모듈은 질의어를 검색엔진에 질의하고 그 결과 웹 페이지를 분석하고 저장하는 웹 문서 수집모듈과, 저장된 웹 페이지들에서 단일명사를 추출하는 단일 명사 추출모듈과, TF-IDF 가중치를 계산하는 TF-IDF 가중치 계산모듈과, 상기 TF-IDF 가중치에 대하여 최중요단어출현빈도를 계산하여 키워드를 선택하는 최중요단어출현빈도 계산모듈을 포함할 수 있다. In an embodiment of the present invention, the keyword extraction module includes a web document collection module for querying a query engine and analyzing and storing a web page as a result, a single noun extraction module for extracting a single noun from the stored web pages; TF-IDF weight calculation module for calculating the TF-IDF weight, and the most important word occurrence frequency calculation module for selecting a keyword by calculating the most important word occurrence frequency with respect to the TF-IDF weight.

이러한 개념 네트워크 기반 사용자 프로파일 구성 방법 및 시스템과 이를 이용한 개인화 질의 확장 시스템에 의하면, 사용자가 질의한 단어를 사용자 프로파일과 비교하여, 개인화된 질의 단어를 추천하는 방식으로 검색의 개인화에 활용할 수 있다. According to the conceptual network-based user profile composition method and system and the personalized query extension system using the same, the user's query can be compared with the user's profile and can be used for personalization of the search by recommending personalized query words.

이하, 첨부한 도면들을 참조하여, 본 발명을 보다 상세하게 설명하고자 한다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will now be described in more detail with reference to the accompanying drawings. As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하여 도시한 것이다. Like reference numerals are used for like elements in describing each drawing. In the accompanying drawings, the dimensions of the structures are shown in an enlarged scale than actual for clarity of the invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, action, component, part, or combination thereof described on the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, parts, or combinations thereof.

또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. Also, unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

본 발명에서는 사용자가 검색 엔진을 사용한 후에 탐색한 웹 문서를 사용해 사용자 프로파일을 작성하여 질의 확장을 위한 사전으로 사용하는 방법을 제시한다. 사용자가 방문했던 웹 문서를 이용하므로 사용자별로 서로 다른 프로파일을 구성하게되고, 이를 질의 확장에 사용하면 개인 의도를 검색 결과에 반영할 수 있다.The present invention proposes a method for creating a user profile using a searched web document after using a search engine and using it as a dictionary for query expansion. Since the web document visited by the user is used, different profiles are configured for each user, and when used for query expansion, personal intention can be reflected in the search results.

본 발명에서는 사용자 프로파일을 구성하기 위해 사용자가 검색 엔진 사용 후에 방문했던 웹 페이지에서 최중요단어출현빈도(Table Term Frequency, TTF) 기법을 사용하여 키워드를 추출한다. In the present invention, to construct a user profile, a keyword is extracted using a table term frequency (TTF) technique from a web page visited by a user after using a search engine.

최중요단어출현빈도 기법은 특정 문서집합 전체에서 키워드를 추출하기 위한 방법으로서, 용어 빈도수 (또는 단어 빈도수) 및 반전된 도큐먼트 빈도수(Term Frequency-Inverse Document Frequency, 이하 TF-IDF) 가중치를 이용한다. The most important word occurrence frequency technique is a method for extracting a keyword from a specific document set, using a term frequency (or word frequency) and an inverted document frequency (TF-IDF) weight.

TF-IDF 가중치는 어떤 문서 집합에 속해 있는 특정 문서내에서 등장하는 단어들에 대해서 중요한 정도를 평가할 수 있는 통계적 측정값이다. TF-IDF weights are statistical measures that can be used to assess the importance of words appearing within a particular document in a document set.

TF(단어 빈도수, term frequency)는 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값으로, TF 값이 높을수록 문서에서 중요하다고 간주할 수 있다. 하지만 단어 자체가 문서군 내에서 자주 사용되는 경우, 이것은 그 단어가 흔하게 등장한다는 것을 의미한다. 이것을 DF(문서 빈도수, document frequency)라고 하며, DF 값의 역수를 IDF(inverse document frequency)라고 한다. TF-IDF는 TF와 IDF를 곱한 값이다. IDF 값은 문서군의 성격에 따라 결정된다. 예를들어 '원자'라는 낱말은 일반적인 문서들 사이에서는 잘 나오지 않기 때문에 IDF 값이 높아지고 문서의 핵심어가 될 수 있지만, 원자에 대한 문서를 모아놓은 문서군의 경우 이 낱말은 상투어가 되어 각 문서들을 세분화하여 구분할 수 있는 다른 낱말들이 높은 가중치를 얻게 된다.TF (term frequency) is a value that indicates how often a particular word appears in a document. The higher the TF value, the more important the document can be considered. However, if the word itself is often used in a family of documents, this means that the word is common. This is called document frequency (DF), and the inverse of the DF value is called inverse document frequency (IDF). TF-IDF is the product of TF and IDF. The IDF value depends on the nature of the document family. For example, the word 'atoms' can be a key word in documents because of the high IDF value because they do not appear well among ordinary documents.However, in the case of a group of documents that contain atoms, the word becomes a common word. Other words that can be broken down into segments get high weights.

상기한 키워드들과 사용자가 질의했던 단어는 공통적으로 어떤 개념을 가르킨다고 가정하며, 상기한 키워드를 사용하여 질의어 확장을 할 수 있다.It is assumed that the above keywords and the words queried by the user point to a common concept, and the query expansion can be performed using the above keywords.

그러면, 이하에서, 최중요단어출현빈도 기법을 사용하는 키워드 추출 방법에 대해 설명하고, 개인 프로파일을 구성하는 기본 모델에 대해 설명한다.Next, a keyword extraction method using the most important word occurrence frequency technique will be described below, and a basic model constituting the personal profile will be described.

<검색 결과에 대한 키워드 추출><Keyword extraction for search results>

본 발명에서는 사용자가 질의한 단어와 높은 연관성을 갖는 단어를 찾기 위해 검색엔진에서 특정 단어로 검색하여 방문한 웹 페이지들에서 키워드를 추출한다. 이러한 키워드를 추출하기 위해서 최중요단어출현빈도를 사용한다. 최중요단어출현빈도는 문서 집합이 주어졌을 때, 먼저 TF-IDF 가중치를 사용하여 문서내에서 중요한 단어를 일정 비율로 추출한다. 각 문서에서 추출한 중요한 단어들을 중복을 허용하여 하나의 테이블에 넣고 다시 발생횟수를 카운트한다. 상기 발생횟수가 최중요단어출현빈도이면, 카운트된 값이 높은 단어를 주어진 문서 집합 전체에서의 키워드로 간주하게 된다. In the present invention, in order to find a word having a high correlation with a user's query, a search engine searches for a specific word and extracts keywords from visited web pages. The most important word occurrence frequency is used to extract these keywords. The most important word occurrence frequency, given a document set, first extracts important words in a document using a TF-IDF weight. The important words extracted from each document can be duplicated and put into one table, and the number of occurrences is counted again. If the frequency of occurrence is the most important word occurrence frequency, the word with the high count value is regarded as a keyword in a given document set.

이하, 본 발명에서 사용된 TF-IDF 가중치를 구하는 방법과 최중요단어출현빈도를 구하는 방법을 각각 설명한다.Hereinafter, a method for obtaining the TF-IDF weight used in the present invention and a method for obtaining the most important word occurrence frequency will be described.

<TF-IDF 가중치><TF-IDF weights>

TF-IDF 가중치는 정보 검색과 텍스트 마이닝에서 이용하는 가중치로, 여러 문서로 이루어진 문서군이 있을 때, 어떤 단어가 특정 문서내에서 얼마나 중요한 것인지를 나타내는 통계적 수치이다. 문서의 핵심어를 추출하거나, 검색 엔진에서 검색 결과의 순위를 결정하거나, 문서들 사이의 비슷한 정도를 구하는 등의 용도로 사용할 수 있다. The TF-IDF weight is a weight used in information retrieval and text mining. It is a statistical value that indicates how important a word is in a particular document when there are a group of documents. It can be used to extract key words of a document, to rank search results in a search engine, or to obtain a degree of similarity between documents.

상대적으로 큰 TF-IDF 가중치를 갖는 단어는 더 중요하다고 볼 수 있다. 본 발명에서, 문서 j에서 등장한 단어 i의 TF-IDF 가중치는 하기하는 수학식 1에 의해 산출된다. Words with relatively large TF-IDF weights are more important. In the present invention, the TF-IDF weight of the word i appearing in the document j is calculated by Equation 1 below.

여기서,

는 아래의 수학식 2에 의해 정의된다. here,

Is defined by Equation 2 below.

여기서,

는 단어 i가 문서 j에서 출현한 횟수고,

는 문서 dj에서 모든 단어가 출현한 횟수다.here,

Is the number of times word i appeared in document j,

Is the number of occurrences of every word in document dj.

수학식 1은 문서 j에서 등장한 단어 i의 TF-IDF 가중치를 계산하는 식이고, TF 값과 IDF 값의 곱으로 계산된다. Equation 1 is an expression for calculating the TF-IDF weight of the word i appeared in the document j, it is calculated as the product of the TF value and the IDF value.

먼저, TF 값은 한 문서내에서 빈도가 높은 단어가 더 중요하게 본다는 의미이며, 수학식 2와 같이 특정 단어 출현 빈도를 모든 단어의 총 출현 횟수로 나누는 표준화된 값을 사용한다. 그리고 IDF 값은 보다 적은 문서에서 등장한 단어가 더 중요하다는 의미로 사용되며, 하기하는 수학식 3을 이용하여 계산한다.First, the TF value means that a word with a high frequency is considered more important in a document, and a standardized value of dividing a frequency of a specific word by the total number of occurrences of all words is used as in Equation 2. In addition, the IDF value is used to mean that words appearing in fewer documents are more important, and are calculated using Equation 3 below.

여기서,

는 문서 집합에 포함되어 있는 문서의 수이고,

는 단어 tj가 등장하는 문서의 수이다. here,

Is the number of documents in the document set,

Is the number of documents in which the word tj appears.

본 발명에서는 사용자가 검색 엔진에 한번 질의한 뒤에 방문했던 웹 페이지들을 각각 개별 문서로 간주하고, 상기 개별 문서들로 하나의 문서 집합을 구성한다. 이렇게 구성한 문서 집합에 대하여 TF-IDF 가중치를 계산한 뒤에 다음의 최중요단어출현빈도(Table Term Frequency; TTF)를 적용한다.In the present invention, each web page visited by a user after querying a search engine is regarded as an individual document, and a single document set is constructed from the individual documents. After calculating the TF-IDF weight for this document set, the following Table Term Frequency (TTF) is applied.

<최중요단어출현빈도(Table Term Frequency)><Table Term Frequency>

앞에서 설명한 TF-IDF 가중치를 사용하면 문서 내부에서 어떤 단어가 더 중요한지를 확인할 수 있다. 하지만 TF-IDF 가중치는 문서 집합 전체에서의 중요한 정도를 나타내는 것이 아니기 때문에 최중요단어출현빈도(TTF)를 사용하여 주어진 문서 집합에서 가장 중요한 단어를 찾는다. Using the TF-IDF weights described earlier, you can see which words are more important within the document. However, since TF-IDF weights do not represent a significant degree across the document set, the most important word occurrence frequency (TTF) is used to find the most important words in a given document set.

도 1a는 본 발명에 일실시예에 따른 개념 네트워크 기반 사용자 프로파일 구성 시스템을 설명하는 블럭도이다. 도 1b는 도 1a에 도시된 키워드 추출부의 일례를 설명하기 위한 블록도이다. 1A is a block diagram illustrating a conceptual network-based user profile configuration system according to an embodiment of the present invention. FIG. 1B is a block diagram for explaining an example of the keyword extracting unit shown in FIG. 1A.

도 1a 및 도 1b를 참조하면, 본 발명의 일실시예에 따른 개념 네트워크 기반 사용자 프로파일 구성 시스템은 키워드 추출부(100) 및 사용자 프로파일 작성부(200)를 포함한다. 1A and 1B, the conceptual network-based user profile composition system according to an embodiment of the present invention includes a keyword extractor 100 and a user profile generator 200.

키워드 추출부(100)는 웹 문서 수집모듈(110), 단일명사 추출모듈(120), TF-IDF 가중치 계산모듈(130) 및 최중요단어출현빈도 계산모듈(140)을 포함한다. 본 실시예에서는 키워드 추출부(100)를 웹 문서 수집모듈(110), 단일명사 추출모듈(120), TF-IDF 가중치 계산모듈(130) 및 최중요단어출현빈도 계산모듈(140)로 구분하였으나, 이는 논리적으로 또는 기능적으로 구분하였을 뿐 하드웨어적으로 구분한 것은 아니다. The keyword extraction unit 100 includes a web document collection module 110, a single noun extraction module 120, a TF-IDF weight calculation module 130, and a most important word occurrence frequency calculation module 140. In this embodiment, the keyword extractor 100 is divided into a web document collection module 110, a single noun extraction module 120, a TF-IDF weight calculation module 130, and a most important word occurrence frequency calculation module 140. This is logically or functionally separated, not hardware.

상기 웹 문서 수집모듈(110)은 질의어를 검색엔진에 질의하여, 그 결과 웹 페이지를 분석하고, 사용자가 선택한 결과 웹 페이지만을 저장한다. The web document collection module 110 queries the query engine to the search engine, analyzes the result web page, and stores only the result web page selected by the user.

상기 단일명사 추출모듈(120)은 저장된 웹 페이지들에서 단일명사를 추출한다.The single noun extraction module 120 extracts a single noun from the stored web pages.

상기 TF-IDF 가중치 계산모듈(130)은 TF-IDF 가중치를 계산한다. 상기 TF-IDF 가중치 계산모듈(130)에 의한 TF-IDF 가중치의 계산은 후술하기로 한다. The TF-IDF weight calculation module 130 calculates the TF-IDF weight. The calculation of the TF-IDF weight by the TF-IDF weight calculation module 130 will be described later.

상기 최중요단어출현빈도 계산모듈(140)은 TF-IDF 가중치에 대하여 최중요단 어출현빈도를 계산하고, 키워드를 선택하게 된다. The most important word occurrence frequency calculating module 140 calculates the most important word occurrence frequency based on the TF-IDF weight and selects a keyword.

일례로, 문서내의 단어수를 기준으로 상위 특정 비율의 단어를 추출하여 최중요단어출현빈도를 계산할 수 있다. 이어, 계산된 최중요단어출현빈도를 근거로 키워드를 선택하게 된다. For example, the most important word occurrence frequency may be calculated by extracting a higher specific ratio of words based on the number of words in the document. Then, the keyword is selected based on the calculated most important word occurrence frequency.

다른 예로, 문서 내의 최대 TF-IDF 가중값을 기준으로 상위 특정 비율의 단어를 추출하여 최중요단어출현빈도를 계산할 수 있다. 이어, 계산된 최중요단어출현빈도를 근거로 키워드를 선택하게 된다. 상기 최중요단어출현빈도 계산모듈(140)에 의한 TF-IDF 가중치의 계산은 후술하기로 한다.As another example, the most significant word occurrence frequency may be calculated by extracting a higher specific ratio of words based on the maximum TF-IDF weighting value in the document. Then, the keyword is selected based on the calculated most important word occurrence frequency. The calculation of the TF-IDF weight by the most important word occurrence frequency calculation module 140 will be described later.

사용자 프로파일 작성부(200)는 세션 인터레스트 작성모듈(210), 비교모듈(220) 및 사용자 프로파일 갱신모듈(230)을 포함한다. 본 실시예에서는 사용자 프로파일 작성부(200)를 세션 인터레스트 작성모듈(210), 비교모듈(220) 및 사용자 프로파일 갱신모듈(230)로 구분하였으나, 이는 논리적으로 또는 기능적으로 구분하였을 뿐 하드웨어적으로 구분한 것은 아니다. The user profile generator 200 includes a session interest creation module 210, a comparison module 220, and a user profile update module 230. In this embodiment, the user profile creation unit 200 is divided into a session interest creation module 210, a comparison module 220, and a user profile update module 230, but this is logically or functionally divided into hardware. There is no distinction.

세션 인터레스트 작성모듈(210)은 키워드 추출부(100)에 의해 추출된 키워드들을 이용하여 세션 인터레스트를 생성하고, 생성된 세션 인터레스트를 누적하여 사용자 프로파일을 작성한다. The session interest creation module 210 generates a session interest using the keywords extracted by the keyword extraction unit 100, and accumulates the generated session interest to create a user profile.

비교모듈(220)은 새로운 세션 인터레스트가 생성될 때마다 생성된 세션 인터레스트와 사용자 프로파일의 개념을 비교한다. 상기 비교모듈(220)은 오픈 디렉토리 프로젝트(ODP)의 웹 디렉토리를 사용하여 상기 세션 인터레스트와 상기 사용자 프로파일의 개념간의 유사도를 비교한다. 상기 비교모듈(220)은 상기 세션 인터레 스트와 상기 사용자 프로파일의 개념 각각을 텀 벡터로 표현하고, 상기 세션 인터레스트에 대응하는 텀 벡터를 상기 ODP의 카테고리를 차원으로 갖는 제1 벡터로 변경하며, 상기 사용자 프로파일의 개념에 대응하는 텀 벡터를 ODP의 카테고리로 차원으로 갖는 제2 벡터로 변경한 후, 코사인 유사도를 활용하여 상기 제1 벡터와 상기 제2 벡터를 비교한다. The comparison module 220 compares the generated session interest with the concept of the user profile whenever a new session interest is generated. The comparison module 220 compares the similarity between the concept of the session interest and the user profile using the web directory of the Open Directory Project (ODP). The comparison module 220 expresses each of the concepts of the session interest and the user profile as a term vector, and changes the term vector corresponding to the session interest into a first vector having a category of the ODP as a dimension. After changing the term vector corresponding to the concept of the user profile into a second vector having a dimension as a category of the ODP, the first vector and the second vector are compared using a cosine similarity.

사용자 프로파일 갱신모듈(230)은 상기 비교모듈(220)에 의한 비교 결과에 따라 상기 세션 인터레스트가 상기 사용자 프로파일의 개념과 동일 또는 유사할 경우, 상기 세션 인터레스트와 상기 사용자 프로파일의 개념을 서로 합하여 상기 개념 네트워크(300)에 저장하고, 상이할 경우, 새로운 사용자 프로파일의 개념으로서 추가하여 상기 개념 네트워크(300)에 저장한다. The user profile update module 230 adds the concept of the session interest and the user profile to each other when the session interest is the same as or similar to the concept of the user profile according to the comparison result by the comparison module 220. Stored in the conceptual network 300, and if different, add as a concept of a new user profile and stored in the conceptual network 300.

도 2a는 각각의 문서에 대하여 계산된 TF-IDF 가중치를 바탕으로 문서 전체에서 최중요단어출현빈도를 계산하는 방법을 설명하기 위한 흐름도이다. FIG. 2A is a flowchart illustrating a method of calculating the most important word occurrence frequency in the entire document based on the TF-IDF weight calculated for each document.

도 2a를 참조하면, 각 문서에 대하여 TF-IDF 가중치를 계산한다(단계 S110). 여기서, 각 문서는 사용자가 검색 엔진에 한번 질의한 뒤에 방문했던 웹 페이지들을 각각 개별 문서로서 간주된다.Referring to FIG. 2A, a TF-IDF weight is calculated for each document (step S110). Here, each document is regarded as a separate document for each web page that the user visited after querying the search engine once.

도 2b는 각 문서에 대응하는 단어 및 TF-IDF 가중치의 일례를 나타낸 표이다. 2B is a table showing an example of a word and TF-IDF weight corresponding to each document.

도 2b를 참조하면, 문서번호 1에 존재하는 단어 <영화>의 계산된 TF-IDF 가중치는 0.5이고, 문서번호 1에 단어 <스릴러>의 계산된 TF-IDF 가중치는 0.4이며, 문서번호 1에 단어 <개봉>의 계산된 TF-IDF 가중치는 0.1이고, 문서번호 2에 단어 <영화>의 계산된 TF-IDF 가중치는 0.6이고, 문서번호 2에 단어 <스릴러>의 계산된 TF-IDF 가중치는 0.4이고, 문서번호 2에 단어 <한국>의 계산된 TF-IDF 가중치는 0.1이다. Referring to FIG. 2B, the calculated TF-IDF weight of the word <movie> present in document number 1 is 0.5, the calculated TF-IDF weight of word <thriller> in document number 1 is 0.4, and the document number 1 The calculated TF-IDF weight of the word <open> is 0.1, the calculated TF-IDF weight of the word <movie> at document number 2 is 0.6, and the calculated TF-IDF weight of the word <thriller> at document number 2 is 0.4, and the calculated TF-IDF weight of the word <Korea> in Document No. 2 is 0.1.

이렇게 계산된 TF-IDF 가중치를 통해 문서내에서 어떤 단어가 더 중요한지를 알 수 있게 된다. The calculated TF-IDF weights show which words are more important in the document.

도 2a의 설명으로 환원하여, 각 문서에서 가장 중요한 단어들, 즉 높은 TF-IDF 가중치를 갖는 단어만을 추출한다(단계 S120). 즉, 단계 S120에서, 각각 전체 단어수와 최대 TF-IDF 가중치를 기준으로 상위 60%의 단어만을 추출하는 경우, 0.6 대신 0.4를 곱하여 그 곱보다 큰 단어를 추출하면 출현한 전에 단어 중에서 중요한 단어 60%를 취하게 된다. 예를들어, 도 2b에 도시된 바와 같이, 문서 1의 전체 단어수는 4이므로, 4*0.6의 연산에 의해 2.4가 연산된다. 따라서, 2.4 순위 이내에 존재하는 단어 <영화>와, 단어 <스릴러>가 높은 TF-IDF 가중치를 갖는 단어로서 도 2c에 도시된 바와 같이 추출된다. Returning to the description of FIG. 2A, only the most important words in each document, that is, words having a high TF-IDF weight are extracted (step S120). That is, in step S120, if only the top 60% of words are extracted based on the total number of words and the maximum TF-IDF weight, respectively, multiplying 0.4 by 0.4 instead of 0.6 extracts words larger than the product, and then shows the significant words among the words before appearing. Will take%. For example, as shown in Fig. 2B, since the total number of words in Document 1 is 4, 2.4 is calculated by the calculation of 4 * 0.6. Therefore, the word <movie> and the word <thriller> existing within the 2.4 ranking are extracted as shown in Fig. 2C as a word having a high TF-IDF weight.

한편, 도 2b에 도시된 바와 같이, 문서 2의 전체 단어수는 3이므로, 3*0.6의 연산에 의해 1.8이 연산된다. 따라서, 1.8 순위 이내에 존재하는 단어 <영화>가 높은 TF-IDF 가중치를 갖는 단어로서 도 2c에 도시된 바와 같이 추출된다. On the other hand, as shown in Fig. 2B, since the total number of words in document 2 is 3, 1.8 is calculated by the calculation of 3 * 0.6. Therefore, the word <movie> existing within the 1.8 rank is extracted as shown in FIG. 2C as a word having a high TF-IDF weight.

단계 S110에서는 각 문서에서 등장한 단어의 수를 기준으로 특정 비율로 단어를 추출하는 방법만을 사용하였으나, 추가적으로 문서내에서 최대 TF-IDF 값을 기준으로 단어를 추출하는 방법도 사용할 수도 있다. 예를들어, 도 2b에 도시된 바와 같이, 문서 1의 가장 큰 TF-IDF 값은 0.5이다. 따라서, 0.5*0.4의 연산에 의해 0.2가 연산된다. 따라서, TF-IDF 값이 0.2보다 큰 단어인 <영화>와, <스릴러>, <감독>이 최대 TF-IDF 가중치를 갖는 단어로서 추출된다. In step S110, only a method of extracting a word at a specific ratio based on the number of words appearing in each document is used, but a method of extracting a word based on the maximum TF-IDF value in the document may also be used. For example, as shown in FIG. 2B, the largest TF-IDF value of document 1 is 0.5. Therefore, 0.2 is calculated by the calculation of 0.5 * 0.4. Therefore, <movie>, <thriller>, and <director>, which are words having a TF-IDF value greater than 0.2, are extracted as words having a maximum TF-IDF weight.

한편, 도 2b에 도시된 바와 같이, 문서 2에서 가장 큰 TF-IDF 값은 0.6이다. 따라서, 0.6*0.4의 연산에 의해 0.24가 연산된다. 따라서, TF-IDF 값이 0.24보다 큰 단어인 <영화>와 <스릴러>가 최대 TF-IDF 가중치를 갖는 단어로서 추출된다. On the other hand, as shown in FIG. 2B, the largest TF-IDF value in document 2 is 0.6. Therefore, 0.24 is calculated by the operation of 0.6 * 0.4. Therefore, <movie> and <thriller>, which are words whose TF-IDF value is larger than 0.24, are extracted as words having the maximum TF-IDF weight.

이어, 추출한 단어들을 하나의 테이블에 넣고 발생횟수를 카운트하여 단어의 최중요단어출현빈도(TTF)를 계산한다(단계 S130). Subsequently, the extracted words are put in one table, and the occurrence frequency is counted to calculate the most important word occurrence frequency (TTF) of the words (step S130).

도 2c는 각 문서에서 단어별 최중요단어출현빈도를 나타낸 표이다. Figure 2c is a table showing the frequency of the word most important words for each document.

도 2c를 참조하면, 문서 1에 대응하여 단어 <영화>의 최중요출현빈도는 2회이고, 단어 <스릴러>의 최중요출현빈도는 1회로 각각 계산된다. 또한, 문서 2에 대응하여 단어 <영화>의 최중요출현빈도는 2회이고, 단어 <스릴러>의 최중요출현빈도는 2회이며, 단어 <감독>의 최중요출현빈도는 1회로 각각 계산된다. Referring to FIG. 2C, the most important frequency of occurrence of the word <movie> is twice, and the most important frequency of occurrence of the word <thriller> is calculated once. In addition, corresponding to document 2, the most important frequency of occurrence of the word <movie> is twice, the most important frequency of occurrence of the word <thriller> is twice, and the most important frequency of occurrence of the word <director> is calculated once. .

이상에서는 사용자가 검색엔진에 질의한 뒤 탐색한 일련의 웹 문서들에 대하여 최중요단어출현빈도를 사용하여 키워드를 추출한 다음 최중요단어출현빈도를 갖는 단어들을 추출하는 일련의 과정에 대해 설명하였습니다. In the above, we explained the process of extracting keywords using the most important word occurrence frequency and then extracting the words with the most important word occurrence frequency for the series of web documents that the user queries after searching the search engine.

이하에서, 상기한 방식으로 추출된 최중요출현빈도를 갖는 단어들 사이에 해당 사용자에게 특화된 연관성을 갖고 있다는 가정하에 개인화된 사전 모델, 즉 질의 확장을 위한 네트워크 기반 개인 프로파일 모델에 대해 설명한다. Hereinafter, a personalized dictionary model, that is, a network-based personal profile model for query extension, will be described under the assumption that the user has a specific association among words having the most important frequency extracted in the above manner.

<질의 확장을 위한 네트워크 기반 개인 프로파일 모델>Network-based Personal Profile Model for Query Expansion

질의 확장은 어떤 단어로 질의할 경우, 질의된 단어와 밀접한 관계가 있는 다른 단어를 질의어로 추가하여 더 정확한 검색 결과를 얻고자 하는 것이다. 이를 위해서 단어들간의 관계가 정의되어 있는 일종의 사전이 필요하다. 또한 같은 질의어라 하더라도 그 의도가 개인별로 다를 수가 있으므로, 개인별로 사전을 구성하는 것이 더 합리적이라 할 수 있다. Query expansion is to get more accurate search result by adding another word that is closely related to the queried word as query word when querying with a word. To do this, you need a dictionary that defines the relationships between words. In addition, even if the same query is intended to be different for each individual, it is more reasonable to construct a dictionary for each individual.

따라서, 본 발명의 실시예에서는 질의어에 대한 키워드 추출 방법을 이용하여 질의 확장을 위한 네트워크 기반의 개인 프로파일을 만들고 질의 확장을 위한 사전으로 사용한다. Therefore, in the embodiment of the present invention, a network-based personal profile for query extension is created using a keyword extraction method for a query and used as a dictionary for query extension.

도 3은 질의 확장을 위한 개인 프로파일의 구성 모델을 설명하기 위한 개념도이다.3 is a conceptual diagram illustrating a configuration model of a personal profile for query extension.

도 3을 참조하면, 사용자가 질의어 1을 검색엔진에 질의하여 결과를 반환받은 문서들중, 사용자가 링크를 클릭하여 탐색한 문서만을 대상으로 문서집합을 구성하고 구성된 문서집합에서 최중요단어출현빈도를 사용하여 키워드를 추출한다. Referring to FIG. 3, among documents in which a user queries Query 1 for a search engine and returns a result, a document set is composed only for documents searched by a user by clicking a link, and the most important word occurrence frequency in the configured document set is shown. Use to extract keywords.

예를들어, 추출된 키워드들은 키워드 1, 키워드 2, 키워드 3이며, 키워드 1,2,3과 질의어 1은 개념 1을 공통적으로 가르키는 단어들로 간주한다. 이러한 관계에서, 개념 1은 해당 단어들이 공통적으로 가르키는 개념으로 정의한다. For example, the extracted keywords are keyword 1, keyword 2, and keyword 3, and keywords 1,2,3 and query 1 regard the concept 1 as the common points. In this relationship, concept 1 is defined as a concept that the words commonly point to.

도 4a는 사용자 프로파일로 표현되는 개념 네트워크를 설명하기 위한 개념도이다. 도 4b는 사용자 프로파일을 이용한 질의어 추천의 일례를 설명하기 위한 개념도이다. 도 4c는 질의어 추천에 따른 개념 네트워크를 설명하기 위한 개념도이다. 4A is a conceptual diagram illustrating a conceptual network represented by a user profile. 4B is a conceptual diagram illustrating an example of query recommendation using a user profile. 4C is a conceptual diagram illustrating a conceptual network according to a query recommendation.

도 4a 내지 도 4c를 참조하면, 본 발명에서는 개인별로 프로파일을 만들어 이를 검색의 개인화에 활용한다. 각 개인의 관심사항을 사용자 프로파일로 표현할 때 도 4a에 도시된 키워드들이 개념 네트워크(Concept Network)로써 나타낸다. 예를들어, 개념 C1은 비어(Beer), 더프(Duff), 알코올(alcohol)이라는 키워드에 연결되어 있으며, 상기한 세 단어에 의해 정의된다.4A to 4C, the present invention creates a profile for each individual and utilizes it for personalization of a search. When expressing the interests of each individual in a user profile, the keywords shown in FIG. 4A are represented as a concept network. For example, the concept C1 is linked to the keywords Beer, Duff and alcohol and is defined by the three words mentioned above.

사용자 프로파일은 질의어 추천에 직접적으로 활용될 수 있다. 사용자가 특정 단어 또는 알파벳 등을 입력하면, 본 발명에 따른 개인화 질의 확장 시스템은 사용자 프로파일에서 특정 단어 또는 알파벳으로 시작하는 단어를 찾고, 그 단어에 연결되어 있는 개념에 연결되어 있는 단어들을 사용하여 사용자가 질의를 확장해줄 수 있다. The user profile can be used directly for query recommendation. When the user inputs a specific word or alphabet, the personalized query expansion system according to the present invention finds a word starting with the specific word or alphabet in the user profile and uses the words connected to the concept connected to the word. Can extend the query.

예를들어, 사용자가 도 4b에 도시된 바와 같이, 'b'를 입력했을 때, 본 발명에 따른 개인화 질의 확장 시스템은 사용자 프로파일에서 알파벳 'b'로 시작하는 단어를 찾는다. 여기서, 찾아지는 알파벳 'b'로 시작하는 단어는 'beer'이다. 도 4c에 도시된 바와 같이, 단어 'beer'는 개념 C1과 개념 C3에 연결되어 있다. 따라서, 본 발명에 따른 개인화 질의 확장 시스템은 도 4b 및 도 4c에 도시된 바와 같이, 개념 C1과 개념 C3에 연결된 단어들을 조합하여 'beer', 'beer Duff', 'beer alcohol', 'beer drinking', 'beer Tavern', 'beer evening', 'beer Duff alcohol', 'beer Tavern drinking', 'beer drinking evening'이라는 확장된 단어들을 표시할 수 있다. For example, when the user enters 'b' as shown in FIG. 4B, the personalized query expansion system according to the present invention finds a word starting with the alphabet 'b' in the user profile. Here, the word starting with the letter 'b' is 'beer'. As shown in FIG. 4C, the word 'beer' is connected to Concept C1 and Concept C3. Accordingly, the personalized query expansion system according to the present invention combines words connected to concepts C1 and C3 as shown in FIGS. 4B and 4C to 'beer', 'beer Duff', 'beer alcohol', and 'beer drinking'. Extended words such as', 'beer Tavern', 'beer evening', 'beer Duff alcohol', 'beer Tavern drinking' and 'beer drinking evening' can be displayed.

도 5는 개념 네트워크를 구성하는 세션 인터레스트(Session Interest)를 설명하기 위한 개념도이다.5 is a conceptual diagram for explaining a session interest constituting a conceptual network.

도 5를 참조하면, 사용자의 프로파일을 작성하기 위해 사용자가 검색 엔진을 사용하면서 탐색한 문서들을 대상으로 키워드를 추출하여 사용한다. 이를 위해 세션(Session)과 세션 인터레스트(Session Interest)를 정의하여 사용한다. 여기서, 세션은 사용자가 검색 엔진에 질의를 하고, 다음 번 질의를 하기 전까지의 시간이다. 또한, 세션 인터레스트는 한 세션의 주요한 관심사항이다.Referring to FIG. 5, keywords are extracted and used for documents searched by a user using a search engine to create a user profile. For this purpose, session and session interest are defined and used. Here, the session is the time before the user queries the search engine and makes the next query. Session interest is also a major concern of a session.

하나의 개념은 예를들어, 비어(Beer), 더프(Duff), 알코올(alcohol)과 같이 세 개의 단어들로 정의된다. 비어(Beer)는 사용자가 검색 엔진에 질의한 단어이고, 더프(Duff)와 알코올(alcohol)은 비어(Beer)를 질의한 다음, 결과 페이지에서 사용자가 보기 위해 클릭한 웹 문서들에서 추출한 키워드이다. One concept is defined by three words, for example Beer, Duff and alcohol. Beer is the word the user queries the search engine, and Duff and alcohol are the keywords extracted from the web documents that the user clicked on to view in the results page and then clicked to view it. .

사용자가 질의할 때마다 이러한 세션 인터레스트를 생성하는데, 생성된 세션 인터레스트들을 누적하여 사용자 프로파일을 작성한다. Whenever a user inquires, these session interests are generated, and the generated session interests are accumulated to create a user profile.

도 6은 세션 인터레스트를 누적하여 사용자 프로파일을 구성하는 일련의 절차를 설명하기 위한 개념도이다. 6 is a conceptual diagram illustrating a series of procedures for configuring a user profile by accumulating session interests.

도 6을 참조하면, 새로운 세션 인터레스트가 생성될 때마다 기저장된 사용자 프로파일의 개념들과 비교하여 동일하거나 유사할 경우, 세션 인터레스트에 대응하는 개념과 사용자 프로파일의 개념을 합치고, 다르다고 판단될 경우, 개념으로서 더해진다. 이때 개념간의 유사도를 계산하기 위해 오픈 디렉토리 프로젝트(open directory project)라는 웹 디렉토리를 사용할 수 있다. 오픈 디렉토리 프로젝트는 수많은 웹 페이지들을 지원자들이 분류해놓은 웹 디렉토리이다.Referring to FIG. 6, when a new session interest is generated and compared with the concepts of pre-stored user profiles each time, a concept corresponding to the session interest and a concept of the user profile are combined and determined to be different. , As a concept. You can use a web directory called an open directory project to calculate the similarity between concepts. The Open Directory Project is a web directory where volunteers have categorized numerous web pages.

도 7a는 오픈 디렉토리 프로젝트에서 카테고리들의 계층 구조를 설명하기 위 한 개념도이고, 도 7b는 계층 구조의 깊이에 따른 카테고리와 웹 페이지의 수의 일례를 나타내는 표이다. FIG. 7A is a conceptual diagram illustrating a hierarchy of categories in an open directory project, and FIG. 7B is a table illustrating an example of the number of categories and web pages according to the depth of the hierarchy.

도 7a를 참조하면, 오픈 디렉토리 프로젝트에서, 카테고리들이 계층구조를 갖고 있다. 즉, 최상위(Top) 카테고리의 하부에는 아트(Arts) 카테고리, 비즈니스(Business) 카테고리, 컴퓨터(Computers) 카테고리 등이 링크되어 있다. 상기 아트 카테고리에는 애니메이션(Animations) 카테고리, 골동품(Antiques) 카테고리, 건축(Architecture) 카테고리 등이 링크되어 있다. 상기 애니메이션 카테고리에는 애니메이티드 그래픽(Animated Graphics) 카테고리, 애니메이션 아트(Animation Art Galleries) 갤러리 카테고리 등이 링크되어 있다. 상기 애니메이티드 그래픽 카테고리에는 어도브 플래시(Adobe Flash) 카테고리, 애니메이티드 지아이에프(Animated GIFs) 카테고리, 아티스트(Artists) 카테고리가 링크되어 있다. Referring to FIG. 7A, in an open directory project, categories have a hierarchical structure. That is, under the Top category, the Arts category, the Business category, the Computers category, and the like are linked. The art category is linked to the animation category, the antiques category, the architecture category, and the like. The animation category is linked to the animated graphics category, the animation art gallery category, and the like. The animated graphics category is linked to the Adobe Flash category, the animated GIFs category, and the Artists category.

도 7b를 참조하면, 각 웹 페이지들은 소정의 카테고리로 분류되어 있다. 예를들어, 계층 깊이 1에는 17개의 카테고리들에 대응하여 89개의 웹 페이지들이 존재한다. 계층 깊이 2에는 657개의 카테고리들에 대응하여 6,427개의 웹 페이지들이 존재한다. 이러한 방식으로 계층 깊이 8에는 165,616개의 카테고리들에 대응하여 661,485개의 웹 페이지들이 존재한다. Referring to FIG. 7B, each web page is classified into a predetermined category. For example, at hierarchy depth 1 there are 89 web pages corresponding to 17 categories. At hierarchy depth 2, there are 6,427 web pages corresponding to 657 categories. In this manner, there are 661,485 web pages at tier depth 8 corresponding to 165,616 categories.

각 카테고리들과 웹 페이지들에는 제목과 설명들이 기록되어 있다. 따라서, 오픈 디렉토리 프로젝트의 카테고리들과 웹 페이지들에 기록된 제목과 설명 등의 텍스트 데이터를 웹 페이지 검색시 활용할 수 있다.Each category and web page contains a title and description. Therefore, text data such as titles and descriptions recorded in categories and web pages of the Open Directory project can be utilized in the web page search.

도 8은 개념간의 유사도를 ODP 카테고리 차원에서 계산하기 위한 차원 변경 을 설명하기 위한 개념도이다.8 is a conceptual diagram for explaining a dimension change for calculating the similarity between concepts in the ODP category dimension.

도 8을 참조하면, 세션 인터레스트(session interest)와 사용자 프로파일의 개념(concept of profile) 각각은, 먼저 텀 벡터(term vector)(또는 도큐먼트 벡터(document vector))로 표현된다. Referring to FIG. 8, each of a session interest and a concept of a user profile is first represented by a term vector (or a document vector).

벡터 T_S는 세션 인터레스트에 대응하는 텀 벡터고, 벡터 T₁은 사용자 프로파일의 제1 개념에 대응하는 텀 벡터들이다. 벡터 T_S와 벡터 T₁은 각각은 ODP 카테고리들을 차원으로 갖은 벡터들로 변경되어 비교된다. 즉, 세션 인터레스트에 대응하는 텀 벡터(T_S)는 ODP의 카테고리를 차원으로 갖는 벡터 Ct로 변경되고, 제1 개념에 대응하는 텀 벡터(T₁)는 ODP의 카테고리를 차원으로 갖는 벡터 C₁로 변경된다. The vector T _S is a term vector corresponding to the session interest, and the vector T ₁ is a term vectors corresponding to the first concept of the user profile. The vectors T _S and the vectors T ₁ are compared with each other being changed into vectors having ODP categories as dimensions. That is, the term vector T _S corresponding to the session interest is changed to the vector Ct having the category of the ODP as a dimension, and the term vector T ₁ corresponding to the first concept is the vector C having the category of the ODP as a dimension. Changed to ₁

도 9는 도 8의 차원 변경을 구체적으로 설명하기 위한 개념도이다. FIG. 9 is a conceptual diagram for describing the dimension change of FIG. 8 in detail.

도 9를 참조하면, 텀 벡터(Term Vector)의 차원을 ODP 카테고리의 차원으로 변경하기 위해 먼저 ODP 카테고리들 역시 텀 벡터로 표현한다. Referring to FIG. 9, in order to change the dimension of the term vector to the dimension of the ODP category, first, the ODP categories are also expressed as term vectors.

먼저, 각 ODP 카테고리와 해당 ODP 카테고리에 속해 있는 웹 페이지의 타이틀(Title), 설명(Description) 항목들의 문자열을 모아서 ODP 카테고리별로 문서를 하나씩 생성한다. First, a document is generated for each ODP category by collecting strings of title and description items of each ODP category and a web page belonging to the corresponding ODP category.

도 9의 슈퍼 문서(Super Document)들은 해당 ODP 카테고리별 문서들이고, 해당 문서들에 다시 TF-IDF 가중치를 매겨서 텀 벡터로 표현한다. TF-IDF 가중치는 어떤 문서 집합이 있을 때 각 문서내에서 어떤 단어가 중요한지 알 수 있는 가중치이다. TF-IDF 가중치는 텀빈도(Term Frequency)와 도큐먼트 빈도(Document Frequency)의 역수의 곱으로 계산된다. 여기서, 텀빈도는 한 문서 내에서 어떤 단어의 등장 횟수고, 도큐먼트 빈도는 문서 집합 내에서 어떤 단어가 등장한 문서의 수이다. The super documents of FIG. 9 are documents according to the corresponding ODP category, and the TF-IDF weights are again given to the corresponding documents and expressed as a term vector. The TF-IDF weight is a weight that shows which words are important in each document when there is a document set. The TF-IDF weight is calculated as the product of the term frequency and the inverse of the document frequency. Here, the term frequency is the number of occurrences of a word in a document, and the document frequency is the number of documents in which a word appears in a document set.

도 9에서, Tct는 개념을 텀 벡터로 표현한 값이고, T₁, T₂, T₃, , T_N은 카테고리를 텀 벡터로 표현한 값이다. 각 ODP 카테고리들과 개념간의 유사도는 코사인 유사도(Cosine Similarity)를 통해 계산될 수 있다. 코사인 유사도는 정보검색분야에서 가장 많이 사용되고 있는 벡터모델을 이용하여 문서간의 유사도를 측정하는 방법의 일종이다. In FIG. 9, Tct represents a concept expressed as a term vector, and T ₁ , T ₂ , T ₃ , and T _N represent a category expressed as a term vector. Similarity between each ODP category and concept can be calculated through cosine similarity. Cosine similarity is a method of measuring similarity between documents using a vector model which is used most frequently in the field of information retrieval.

도 9를 참조하면, 카테고리 1에 대응하는 텀 벡터(T₁)와 개념에 대응하는 텀 벡터(Tct)간의 유사도는 cos(T₁, T_Ct)이고, 카테고리 2에 대응하는 텀 벡터(T₂)와 개념에 대응하는 텀 벡터(Tct)간의 유사도는 cos(T₂, T_Ct)이고, 카테고리 3에 대응하는 텀 벡터(T₃)와 개념에 대응하는 텀 벡터(Tct)간의 유사도는 cos(T₃, T_Ct)이다. 이와 유사한 방식으로, 카테고리 N에 대응하는 텀 벡터(T_N)와 개념에 대응하는 텀 벡터(Tct)간의 유사도는 cos(T_N, T_Ct)이다. Referring to FIG. 9, the similarity between the term vector T ₁ corresponding to category 1 and the term vector Tct corresponding to the concept is cos (T ₁ , T _Ct ), and the term vector T ₂ corresponding to category 2 is _shown. ) And the similarity between the term vector Tct corresponding to the concept is cos (T ₂ , T _Ct ), and the similarity between the term vector T ₃ corresponding to category 3 and the term vector Tct corresponding to the concept is cos ( T ₃ , T _Ct ). In a similar manner, the similarity between the term vector T _N corresponding to category N and the term vector Tct corresponding to the concept is cos (T _N , T _Ct ).

따라서, 카테고리 차원에서 개념 유사도는 아래의 수학식 4와 같이 표현된다. Therefore, conceptual similarity in the category dimension is expressed as Equation 4 below.

이처럼, 개념들은 텀 벡터로 표현된 각 카테고리와의 유사도가 계산되어, 계산된 유사도 값을 각 카테고리의 차원 값으로 갖는 벡터로 표현된다. 이제 개념들을 카테고리 차원에서 유사도를 계산할 수 있다.As such, the concepts are calculated with similarity with each category represented by the term vector, and represented as a vector having the calculated similarity value as the dimension value of each category. The concepts can now be calculated at the category level.

도 10a 및 도 10b는 키워드 추출 방법을 설명하기 위한 개념도들이다. 10A and 10B are conceptual diagrams for describing a keyword extraction method.

도 10a에 도시된 바와 같이, 사용자가 탐색했던 웹 페이지들(문서들)에서 키워드를 추출한다. 하나의 세션 내에서 사용자가 탐색했던 문서들로 문서집합을 구성한 다음, TF-IDF 가중치를 계산하여, 이 가중치를 기준으로 각 웹 문서내에서 중요한 단어를 선별적으로 추출한다. 상기한 도 10a에 도시된 바와 같이, 문서집합을 구성하고, TF-IDF 가중치를 계산한 후, 중요한 단어를 선별적으로 추출하는 것은 도 2a에서 설명된 단계 S110과 단계 S120에서 설명된 바 있다. As shown in Fig. 10A, keywords are extracted from web pages (documents) that the user has searched. The document set is composed of documents searched by the user in one session, and then the TF-IDF weight is calculated to selectively extract important words in each web document based on the weight. As shown in FIG. 10A, the document set, the TF-IDF weight calculation, and the selective extraction of the important words have been described in steps S110 and S120 described in FIG. 2A.

이어, 도 2a의 단계 S130에서 설명된 바와 같이, 각 문서에서 추출한 단어들을 하나의 테이블에 넣고 다시 등장 횟수를 계산한다. 예를들어, 도 10b에 도시된 바와 같이, 왼쪽 표의 임시 테이블(Temp. Table)에서 등장 횟수를 카운트하여 오른쪽 표의 텀빈도(Term Frequency) 테이블에 기재한다. Subsequently, as described in step S130 of FIG. 2A, the words extracted from each document are put into one table and the number of appearances is calculated again. For example, as shown in FIG. 10B, the number of appearances is counted in a temporary table of the left table and described in a term frequency table of the right table.

상기 등장 횟수를 기준으로 일정한 임계값 이상의 등장 횟수를 갖는 단어들만을 사용해서 세션 인터레스트의 개념을 구성한다. 도 10b에서, 구성되는 세션 인터레스트(Cs)에는 비어(Beer)와, 알코올(alcohol), 더프(Duff)와 같은 3개의 단어 들에 의해 정의된다. The concept of session interest is constructed using only words having a number of appearances above a predetermined threshold based on the number of appearances. In FIG. 10B, the session interest Cs that is configured is defined by three words, such as beer, alcohol, and duff.

도 11은 본 발명의 일실시예에 따른 개인화 질의 확장 시스템을 설명하기 위한 블록도이다. 특히, 사용자 프로파일 구성 기법을 기반으로 개인화된 질의 확장 시스템이 도시된다. 11 is a block diagram illustrating a personalization query extension system according to an embodiment of the present invention. In particular, a personalized query expansion system based on user profile construction techniques is shown.

도 11을 참조하면, 본 발명의 일실시예에 따른 개인화 질의 확장 시스템은 클라이언트부(400)와 서버부(500)를 포함한다. Referring to FIG. 11, the personalized query expansion system according to an embodiment of the present invention includes a client unit 400 and a server unit 500.

상기 클라이언트부(400)는 질의 입력모듈(410), 뷰어모듈(420), 키워드 추출모듈(430)을 포함하고, 사용자 입력을 받고, 탐색한 웹 페이지에서 키워드를 추출하여 사용자 프로파일을 만들도록 상기 서버부(500)로 전송한다. 본 실시예에서는 클라이언트부(400)를 질의 입력모듈(410), 뷰어모듈(420), 키워드 추출모듈(430)로 구분하였으나, 이는 논리적으로 또는 기능적으로 구분하였을 뿐 하드웨어적으로 구분한 것은 아니다. The client unit 400 includes a query input module 410, a viewer module 420, and a keyword extraction module 430. The client unit 400 receives a user input and extracts a keyword from a searched web page to create a user profile. The server unit 500 transmits. In the present embodiment, the client unit 400 is divided into the query input module 410, the viewer module 420, and the keyword extraction module 430, but this is not logically or functionally divided.

상기 질의 입력모듈(410)은 사용자에 의해 입력되는 질의어를 입력받아 서버부(500)에 제공하고, 서버부(500)로부터 제공되는 확장된 질의어들중 사용자에 의해 선택된 질의어들을 서버부(500)에 제공한다. The query input module 410 receives a query input by the user and provides the query to the server 500, and selects query terms selected by the user from among the extended query words provided from the server 500. To provide.

상기 뷰어모듈(420)은 상기 서버부(500)에서 문서들이 제공됨에 따라 제공되는 문서들을 표시한다. The viewer module 420 displays documents provided as documents are provided by the server unit 500.

상기 키워드 추출모듈(430)은 사용자에 의해 선택된 관심 문서에서 키워드(또는 숏텀 개념 네트워크(short term concept network))를 추출하고, 추출된 키워드를 상기 사용자 개념 네트워크 관리모듈(530)에 제공한다. 상기 키워드 추출모 듈(430)은 도 1에서 설명된 키워드 추출부에 구비되는 구성요소들을 구비할 수 있다. 이에 대한 상세한 설명은 생략하기로 한다. The keyword extraction module 430 extracts a keyword (or short term concept network) from the document of interest selected by the user, and provides the extracted keyword to the user concept network management module 530. The keyword extraction module 430 may include components included in the keyword extraction unit described with reference to FIG. 1. Detailed description thereof will be omitted.

상기 서버부(500)는 질의 확장모듈(510), 검색모듈(520), 사용자 개념 네트워크 관리모듈(530)을 포함하고, 사용자의 프로파일을 구성하고 관리한다. 본 실시예에서는 서버부(500)를 질의 확장모듈(510), 검색모듈(520), 사용자 개념 네트워크 관리모듈(530)로 구분하였으나, 이는 논리적으로 또는 기능적으로 구분하였을 뿐 하드웨어적으로 구분한 것은 아니다. The server unit 500 includes a query expansion module 510, a search module 520, and a user concept network management module 530, and configures and manages a user profile. In the present embodiment, the server 500 is divided into a query expansion module 510, a search module 520, and a user concept network management module 530. no.

상기 질의 확장모듈(510)은 클라이언트부로부터 질의어가 제공됨에 따라 제공된 질의어에 대응하여 개념 네트워크를 조회하여 확장된 질의어들을 획득하여 클라이언트부에 제공한다. As the query is provided from the client unit, the query expansion module 510 queries the conceptual network in response to the provided query and obtains extended query terms and provides them to the client unit.

상기 검색모듈(520)은 클라이언트부에 의해 질의어가 선택됨에 따라, 선택된 질의어에 대응하는 문서들을 네트워크 검색 엔진에 제공하여 검색을 의뢰하고, 의뢰에 상응하는 결과인 문서를 클라이언트부(400)에 제공한다. As the query is selected by the client unit, the search module 520 requests documents by providing documents corresponding to the selected query to the network search engine, and provides the client unit 400 with a document corresponding to the request. do.

상기 사용자 개념 네트워크 관리모듈(530)은 클라이언트부(400)로부터 키워드(또는 숏트 텀 개념 네트워크(short term concept network))가 제공됨에 따라 개념 네트워크(700)에 제공한다. The user concept network management module 530 provides the concept network 700 as a keyword (or a short term concept network) is provided from the client unit 400.

도 11에서는 상기 질의 입력모듈(410), 상기 뷰어모듈(420), 상기 키워드 추출모듈(430)이 상기 클라이언트부(400)에 구비되는 것이 설명되었으나, 상기 질의 입력모듈(410), 상기 뷰어모듈(420), 상기 키워드 추출모듈(430)중 적어도 하나 이상은 상기 서버부(500)에 구비될 수도 있다. In FIG. 11, the query input module 410, the viewer module 420, and the keyword extraction module 430 are provided in the client unit 400, but the query input module 410 and the viewer module are described. At least one of the keyword extraction module 430 may be provided in the server unit 420.

도 12는 본 발명의 일실시예에 따른 개인화 질의 확장 방법을 설명하기 위한 흐름도이다. 12 is a flowchart illustrating a personalization query expansion method according to an embodiment of the present invention.

도 12를 참조하면, 사용자가 새로운 질의를 입력하면(단계 S210), 질의 단어들과 웹 페이지들로부터 추출된 키워드들에 의해 세션 인터레스트가 생성된다(단계 S220). Referring to FIG. 12, when a user inputs a new query (step S210), a session interest is generated by query words and keywords extracted from web pages (step S220).

이어, 세션 인터레스트의 텀 벡터와 ODP 카테고리의 텀 벡터들간의 유사도를 계산한다(단계 S230). 여기서, 유사도는 코사인 유사도를 계산하여 이루어진다.Next, the similarity between the term vector of the session interest and the term vectors of the ODP category is calculated (step S230). Here, the similarity is made by calculating the cosine similarity.

이어, 카테고리 벡터로 표현되는 세션 인터레스트가 생성된다(단계 S240).Subsequently, a session interest expressed by the category vector is generated (step S240).

이어, 세션 인터레스트의 개념과 기저장된 사용자 프로파일의 개념과의 유사도를 계산하여 서로 비교한다(단계 S250). 이러한 비교를 통해 세션 인터레스트의 개념과 사용자 프로파일의 개념이 동일하거나 유사하다면, 두 개념들을 합쳐서 개념 네트워크에 저장하고, 세션 인터레스트의 개념과 사용자 프로파일의 개념이 서로 다르다고 판단되면, 새로운 개념으로서 개념 네트워크에 추가된다(단계 S260). Then, the similarity between the concept of the session interest and the concept of the pre-stored user profile is calculated and compared with each other (step S250). In this comparison, if the concept of session interest and the concept of user profile are the same or similar, the two concepts are put together and stored in the concept network, and if it is judged that the concept of session interest and the concept of user profile are different, the concept as a new concept. It is added to the network (step S260).

그러면, 이하에서, 사용자가 질의한 단어와 방문한 웹 페이지에서 추출한 키워드간의 연관성과 키워드를 추출하기 위한 방법으로 최중요단어출현빈도를 사용이 효율적인지를 검증하기 위한 실험예에 대해서 설명한다.Next, an experimental example for verifying whether the most important word occurrence frequency is effective to use as a method for extracting a keyword and a correlation between a keyword queried by a user and a keyword extracted from a visited web page will be described.

<실험예>Experimental Example

먼저, 검색사이트에서 특정 단어를 사용하여 질의한 다음, 최중요단어출현빈도를 사용해 키워드를 추출하고 이를 질의어와 비교하는 실험을 하였다. 이 실험은 주제가 확실한 웹 문서들로 문서 집합을 구성하고, 최중요단어출현빈도를 사용하여 추출한 키워드가 그 문서 집합의 주제와 일치하는지 확인하여 최중요단어출현빈도의 성능을 검증하기 위해 실시하였다. 그리고 추출한 키워드를 질의어로 추가하여 질의할 경우 더 나은 검색결과를 보여주는지를 확인하였다. First, we searched the search site using specific words, and then we extracted the keywords using the most important word occurrence frequency and compared them with the query words. This experiment was conducted to verify the performance of the most important word occurrence frequency by constructing a document set with well-defined web documents and verifying that keywords extracted using the most important word occurrence frequency match the subject of the document set. Then, we added the extracted keywords as a query and checked whether the query showed better search results.

- 실험 환경- Experiment environment

실험에서 사용한 검색 사이트는 구글이며, 검색결과로 얻은 웹 문서를 내려받고, 최중요단어출현빈도를 계산하는 일련의 실험과정을 자바 언어를 사용해 구현하였다, 또한, 최중요단어출현빈도의 기본 가중치인 TF-IDF 가중치를 계산하기 위해서는 단어의 발생횟수를 세는 작업을 많이 하게 되는데 데이터베이스 관리시스템(DBMS)을 사용하여 효율적으로 처리하였다. 여기서, 데이터베이스 관리시스템(DBMS)은 MySQL을 사용하였다. 그리고 수집한 웹 문서에서 단일 명사를 추출하기 위해 KLT(Korean Language Technology) 라이브러리를 사용한다. KLT 라이브러리는 색인어 추출 함수와 형태소분석 함수를 제공한다. 한편, 영어의 경우에는 스탠다드 포스 태거(Standard POS Tagger) 등의 자연어 처리 도구를 사용하여 처리한다. The search site used in the experiment is Google, which implements a series of experiments using the Java language to download web documents obtained from the search results and calculate the frequency of the most important word occurrences. In order to calculate the TF-IDF weights, the number of occurrences of words is counted a lot. The database management system (DBMS) is used for efficient processing. Here, the database management system (DBMS) used MySQL. We use the Korean Language Technology (KLT) library to extract a single noun from the collected web documents. The KLT library provides index word extraction and stemming functions. On the other hand, English is processed using a natural language processing tool such as Standard POS Tagger.

- 질의어에 대한 키워드 추출부-Keyword extraction section for query

본 실험에서는 사용자가 질의후 탐색한 웹 문서를 수집하고 키워드를 추출하기 위해 도 1에서 설명된 바와 같이, 프로토 타입의 키워드 추출부를 구성하였다. 도 1에서 설명된 키워드 추출부에 의해 수집된 웹 문서를 저장하고 키워드들을 추출하기 위해, 문서집합, 문서, 사전, 단어 출현사실, IDF, TF-IDF, 최중요단어출현빈도를 위한 테이블을 만든다. 이어, 키워드 추출부는 다음의 순서로 사용자 질의에 대하여 키워드를 추출하게 된다. In this experiment, as shown in FIG. 1, a prototype keyword extracting unit is configured to collect web documents searched after a user queries and extract keywords. In order to store the web documents collected by the keyword extraction unit described in FIG. 1 and extract the keywords, a table for document sets, documents, dictionaries, word appearance facts, IDF, TF-IDF, and most important word occurrence frequency is created. . Next, the keyword extractor extracts keywords for the user query in the following order.

-질의에 대한 웹 문서 수집과 단어 출현사실 확인Web document collection for queries and word facts

도 1의 웹 문서 수집모듈(110)은 먼저 질의어를 입력받고 검색엔진(50)에 그 단어를 질의하여 검색결과를 가져온다. 실험자가 웹 문서 수집모듈(110)이 보여준 검색 결과중에서 자신의 의도가 맞는 웹 문서만을 선택한 후에 웹 문서 수집모듈(110)은 실험자가 선택한 웹 문서들을 수집하여 문서 저장소에 저장한다. 이때 웹 문서들은 모두 동일한 문서 집합 번호를 부여받고, 각각 하나의 문서로써 저장된다. The web document collection module 110 of FIG. 1 first receives a query word and queries the search engine 50 for the word to bring a search result. After the experimenter selects only the web document that meets his intention from the search results shown by the web document collection module 110, the web document collection module 110 collects the web documents selected by the experimenter and stores them in the document repository. At this time, the web documents are all assigned the same document set number, and each is stored as one document.

이렇게 저장된 문서들은 한글을 제외한 문자를 제거하고, HTML 태그를 제거하는 등의 전처리 과정을 거친 후에 KTL 라이브러리를 사용하여 단일 명사를 추출한다. 이렇게 추출된 명사들은 단어 출현사실 정보 저장소에 저장된다. The stored documents are processed through preprocessing such as removing non-Hangul characters and removing HTML tags, and then extracting a single noun using the KTL library. The extracted nouns are stored in the word appearance fact information repository.

- TF-IDF 가중치 계산과 최중요단어출현빈도 계산-TF-IDF weight calculation and most important word occurrence frequency calculation

단어 출현 사실 관계를 이용하여 IDF를 먼저 계산하고, 이후 TF-IDF를 계산하게 된다. 상기 IDF와 상기 TF-IDF값은 'INSERT'와 'SELECT' 문을 사용한 간단한 SQL문을 사용해 데이터베이스 관리 시스템에서 계산하여 저장한다. 상기 TF-IDF를 계산하는 과정은 단어의 발생횟수를 세고 더하는 등의 간단한 연산이므로 데이터베이스 관리 시스템을 사용하여 쉽게 계산할 수 있다.IDF is calculated first using word occurrence facts, and then TF-IDF is calculated. The IDF and the TF-IDF values are calculated and stored in a database management system using simple SQL statements using 'INSERT' and 'SELECT' statements. The process of calculating the TF-IDF is a simple operation such as counting and adding the number of occurrences of words, and thus can be easily calculated using a database management system.

SQL문을 사용하여 TF값과 TF-IDF값을 계산한 다음, 계산된 TF-IDF 가중치를 기준으로 문서내에서 중요하다고 판단되는 단어를 일정 비율로 추출하여 각 단어가 추출된 횟수를 카운트하여 최중요단어출현빈도 값을 계산한다. After calculating the TF value and TF-IDF value by using the SQL statement, and extracting a word that is considered to be important in the document at a certain rate based on the calculated TF-IDF weight, the number of times each word is extracted is counted. Calculate the frequency of significant word occurrences.

- 질의어와 키워드의 비교 실험-Comparison experiment between query word and keyword

본 발명에 따라 개인화된 질의 확장을 하기 위해 사용하는 키워드 추출 기법의 성능을 검증하기 위해, 먼저 특정 질의어로 질의를 하여 결과로 얻은 웹 문서에서 키워드를 측정하여 비교하는 실험을 하였다. In order to verify the performance of the keyword extraction technique used for personalized query expansion according to the present invention, an experiment was conducted to measure and compare keywords in a web document obtained by querying a specific query.

질의어는 '영화', '스릴러', '액션' 등을 사용하였으며, 질의할 때, '영화'는 꼭 포함하고 '스릴러',' 액션'은 둘 중 하나는 포함하여 '스릴러 영화' 또는 '액션 영화'를 검색할 수 있도록 했다, 이렇게 질의하여 얻은 그 검색 결과 전체에 대해서 최중요단어출현빈도를 사용하여 키워드를 추출하여 그 결과를 보인다. 그리고 적절한 키워드를 추출하기 위해 필요한 문서의 수를 확인하기 위해 키워드를 추출하는 문서 집합에 포함시키는 문서의 수를 변화시키고 그 결과를 확인한다. The query used 'movie', 'thriller', 'action', etc. When querying, 'thriller' or 'action' must include both 'movie' and 'thriller' or 'action'. 'Movie' can be searched, and the keyword is extracted using the most important word occurrence frequency for the entire search result obtained by this query and shows the result. In order to confirm the number of documents needed to extract the appropriate keywords, the number of documents included in the document set for extracting keywords is changed and the result is checked.

- 질의확장 실험-Query Expansion Experiment

본 발명에서는 질의확장을 위한 개인화된 프로파일 모델을 제시하였는데 이 모델에서 질의어는 하나 이상의 개념과 관련되어 있다고 가정한다. 질의 확장 실험에서는 키워드 추출부로 추출한 키워드를 사용하여 단어의 의미를 하나의 개념으로 적절하게 한정할 수 있는지를 확인한다. 그 실험 방법을 도 13a 및 도 13b에 도시하였다. In the present invention, a personalized profile model for query expansion is proposed. In this model, it is assumed that query terms are related to one or more concepts. In the query expansion experiment, we use the keywords extracted by the keyword extraction unit to check whether the meaning of words can be properly limited to a single concept. The experimental method is shown in Figs. 13A and 13B.

도 13a는 종래의 질의 방식에 따른 검색된 결과 화면을 설명하기 위한 개념도이고, 도 13b는 본 발명에 따른 질의확장 실험에 의해 검색된 결과 화면을 설명하기 위한 개념도이다.13A is a conceptual diagram illustrating a searched result screen according to a conventional query method, and FIG. 13B is a conceptual diagram illustrating a searched result screen by a query expansion experiment according to the present invention.

도 13a를 참조하면, 일반적으로 단어 A를 사용해 검색 엔진에 질의하면, 해당 검색 엔진은 확장되지 않은 질의까지 포함하여 검색을 수행하고, 그 검색 결과 를 사용자에게 제공한다. 여기서, 질의어의 확장이 이루어지지 않았으므로, 사용자에게 제공되는 검색 결과 화면에는 원하지 않은 결과까지 섞여 있다. 따라서, 사용자가 원하지 않은 검색결과까지 제공되므로 사용자는 다시 한전 검색을 수행하거나 불필요한 검색 결과에 대해 인내를 하면서 자신이 원하는 검색 결과를 찾아야하는 불편함이 있다. Referring to FIG. 13A, when a query engine is generally used using the word A, the search engine includes a non-expanded query and performs a search and provides the search result to the user. In this case, since the query is not expanded, the search result screen provided to the user is mixed with unwanted results. Therefore, since the user is provided with a search result that he / she does not want, the user has to find a search result that he / she wants while performing a KEPCO search or perseverance for unnecessary search results.

하지만, 도 13b를 참조하면, 사용자에게 제공되는 검색 결과 화면에는 원하는 검색 결과만이 표시되었다. However, referring to FIG. 13B, only a desired search result is displayed on the search result screen provided to the user.

즉, 질의확장 실험을 위해 먼저 실험 대상이 되는 질의어를 정한다. 본 실험에서는 '주식,' '음악', '북한' 등의 단어를 질의어로 정하였다. 그리고 각 단어와 관련된 개념을 정하고 문서 수집을 한다, '주식'에 대해서는 '가치투자'를, '음악'에 대해서는 '인디음악'을, '북한'에 대해서는 '인권 문제'를 각 단어의 개념으로 정하였다. In other words, for query expansion experiments, we first select the query subject. In this experiment, words such as 'stock,' 'music,' and 'North Korea' were selected as query words. The concept of each word is defined and the documents are collected, 'value investment' for 'stock', 'indie music' for 'music', and 'human rights issue' for 'North Korea'. Decided.

그 다음에는 각 추상적 개념에 맞는 웹 문서를 수집하여 키워드를 추출한다. 이렇게 얻은 키워드들과 각 질의어를 함께 질의하여 검색되는 웹 문서들의 주제가 지정한 개념과 일치하는지 확인한다. Next, extract the keywords by collecting the web documents for each abstract concept. Query these keywords and each query together to see if the subject of the web documents being retrieved matches the concepts you have specified.

이러한 실험에서 각 질의어에 지정된 '가치투자', '인디음악', '인권문제' 등의 개념은 질의 확장후에 검색된 문서들이 가져야 하는 정답 주제로 볼 수 있으며, 검색된 문서들의 주제가 지정된 개념과 일치할 경우 키워드를 사용한 질의 확장이 적절하였음을 확인할 수 있다. In these experiments, the concepts of 'value investment', 'indie music', and 'human rights problem' assigned to each query can be seen as the correct subjects to be searched by the documents searched after the query expansion. We can see that the query expansion using keywords is appropriate.

본 실험에서는 각 개념에 대한 문서 집합은 4개의 웹 문서들로 구성하였으 며, 해당 문서 집합에서 가장 큰 최중요단어빈도값을 갖는 단어들을 키워드로 선정했다. In this experiment, the document set for each concept consisted of four web documents, and the words with the most important word frequency values were selected as keywords in the document set.

<실험 결과><Experiment Result>

- 질의어에 따른 키워드 추출 결과-Keyword Extraction Result by Query

'영화,' '스릴러', '액션' 등의 단어를 질의어로 사용하여 그 검색 결과로 얻은 웹 문서에서 키워드를 추출한 결과가 도 14a 및 도 14b에 도시된다.14A and 14B show the results of extracting a keyword from a web document obtained as a result of a search using words such as 'movie', 'thriller', and 'action' as a query word.

도 14a 및 도 14b는 영화, 액션, 스릴러 질의에 대한 키워드 추출 결과를 설명하기 위한 표들이다.14A and 14B are tables for explaining keyword extraction results for movie, action, and thriller queries.

도 14a 및 도 14b를 참조하면, 문서 집합을 구성하는 웹 문서의 수를 2개에서 100개까지 변화시키면서 추출한 키워드를 제시하였다. 도 14a 및 도 14b에 나타난 순위는 최중요단어출현빈도 값에 따른 순위이다. Referring to FIGS. 14A and 14B, keywords extracted while changing the number of web documents constituting a document set from 2 to 100 are presented. 14A and 14B are ranked according to the most important word occurrence frequency values.

결과를 보면 8개 이상의 수집된 웹 문서들로 문서집합을 구성하고 최중요단어출현빈도 값을 계산해보면 질의어로 사용된 '영화', '액션', '스릴러' 등의 단어가 상위 순위를 얻었다. As a result, when we composed a document set with 8 or more collected web documents and calculated the most important word occurrence frequency, the words 'movie', 'action', 'thriller' and etc. used as query words got the top rank.

2개, 4개, 6개의 웹 문서를 이용해 추출한 키워드에 '액션'이 빠져있는 이유는 검색사이트에 질의할 때 '액션' 또는 '스릴러'를 포함하도록 하여, 상위 검색 결과에 반환받은 웹 문서에 '스릴러'만 포함되어 있고, '액션'을 포함되어 있지 않았기 때문이다. 이러한 결과를 통해서 적은 수의 웹 문서로 구성된 문서 집합에 대해서도 효과적으로 키워드를 추출할 수 있음을 알 수 있다.The reason that 'action' is missing from keywords extracted using 2, 4, or 6 web documents is to include 'action' or 'thriller' when querying the search site. This is because only the thriller is included and the action is not included. These results show that keywords can be effectively extracted even for a document set composed of a small number of web documents.

- 질의 확장 실험-Query Expansion Experiment

먼저 각 개념들에 해당하는 웹 문서들에서 추출한 키워드들을 도 15에 제시하였다. First, keywords extracted from web documents corresponding to each concept are shown in FIG. 15.

도 15는 각 개념에 대한 웹 문서집합에서 추출한 키워드를 설명하는 표이다. 15 is a table for explaining keywords extracted from a web document set for each concept.

도 15를 참조하면, <가치 투자>의 경우를 보면, <투자>, <시장>, <장기>, <주식>, <가치투자>, <투자자> 등이 가장 큰 최중요단어출현빈도 값을 얻어서 키워드로 추출되었다. 상기한 키워드들로부터 <가치투자>는 <장기>적인 투자와 관련이 있음을 알 수 있다. Referring to FIG. 15, in the case of <value investment>, <investment>, <market>, <long-term>, <stock>, <value investment>, and <investor> have the largest value of the word occurrence frequency. And extracted with keywords. From the above keywords, it can be seen that <value investment> is related to <long term> investment.

이렇게 추출한 키워드들을 각 개념이 지정된 질의어와 함께 질의하였을 때의 검색 결과가 지정한 개념에 속하는 비율을 도 16에 제시하였다. 16 shows the ratio of the search results when the concepts are queried with the designated query.

도 16은 질의 확장후 지정한 개념에 속하는 결과의 비율을 설명하기 위한 표이다. 16 is a table for explaining a ratio of results belonging to a designated concept after query expansion.

도 16을 참조하면, <주식>이라는 단어를 <가치투자>와 관련된 키워드로 질의한 결과, 모든 검색 결과의 주제가 <가치투자>라는 개념과 일치하는 결과를 보였다. Referring to FIG. 16, as a result of querying the word <stock> with a keyword related to <value investment>, the results of all the search results matched the concept of <value investment>.

마찬가지로 <북한>이라는 단어를 <인권문제>와 관련된 웹 문서에서 추출한 키워드를 사용하여 질의 확장했을 때도 같은 결과를 보여주었다. Similarly, the same result was obtained when the word <North Korea> was expanded using a keyword extracted from a web document related to <Human Rights>.

그리고, <음악>의 경우, 소수의 결과가 지정한 개념에 속해 있지 않았고, 검색 결과의 수도 31개로 적은 수였는데, 이는 추출된 키워드가 너무 많아 질의어가 너무 많아졌지 때문이다. 추출된 키워드가 많은 이유는 가장 높은 최중요단어출현빈도 값을 동일하게 갖는 단어가 많았기 때문이다. In addition, in the case of <Music>, a small number of results did not belong to the specified concept, and the number of search results was 31, which is because too many extracted keywords resulted in too many queries. The reason why there are many extracted keywords is that many words have the same highest word occurrence frequency.

본 발명에 따른 실험에서는 실제 적은 수의 웹 문서로 이루어진 문서 집합에서 키워드를 추출하고 그 결과를 보여 최중요단어출현빈도를 사용한 키워드 추출 방법의 성능을 검증하였다. In the experiment according to the present invention, the keyword was extracted from a document set consisting of a small number of web documents and the results were verified to verify the performance of the keyword extraction method using the most important word occurrence frequency.

이상에서는 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to the embodiments, those skilled in the art can be variously modified and changed within the scope of the invention without departing from the spirit and scope of the invention described in the claims below. I can understand.

이상에서 설명한 바와 같이, 본 발명에 따르면, 웹 문서 검색에 있어서, 질의 확장에 사용되는 사전으로 사용하기 위한 개인 프로파일의 기본적인 모델인 개념 네트워크와 그 모델을 구성하는 기본적인 연관 관계를 찾는 방법으로서, 최중요단어출현빈도를 사용한 키워드 추출 방법이 제공된다. As described above, according to the present invention, as a method of finding a conceptual network which is a basic model of a personal profile for use as a dictionary used for query expansion, and a basic correlation for constructing the model, in a web document search, A keyword extraction method using an important word occurrence frequency is provided.

TF-IDF 가중치를 바탕으로 계산하는 최중요단어출현빈도를 사용하면, 주어진 문서 집합에서 중요한 키워드를 찾을 수 있다. 또한, 사용자가 검색엔진에 질의한 후 방문한 웹 문서들로 문서 집합을 구성하여 찾아낸 키워드들은 사용자의 질의어와 연관 관계를 갖는 단어를 찾을 수 있다. Using the most important word occurrence frequency calculated based on TF-IDF weights, we can find important keywords in a given document set. Also, the keywords found by constructing a document set from the web documents visited after the user queries the search engine can find words related to the user's query.

이러한 방식으로 찾은 단어들과 질의어와의 연관 관계를 사용하여 개념 네트워크 기반의 사용자 프로파일을 구성할 수 있다. 그리고 사용자가 실제로 검색 엔진을 사용할 때는 매우 적은 수의 검색 결과만을 방문하므로, 적은 문서의 수로 이루어진 문서 집합에서도 의미있는 키워드를 추출할 수 있다. In this way, the association between the words found and the query can be used to construct a user profile based on the conceptual network. When a user actually uses a search engine, the user visits only a small number of search results, so that a meaningful keyword can be extracted from a document set consisting of a small number of documents.

도 1a는 본 발명에 일실시예에 따른 개념 네트워크 기반 사용자 프로파일 구성 시스템을 설명하는 블럭도이다. 1A is a block diagram illustrating a conceptual network-based user profile configuration system according to an embodiment of the present invention.

도 1b는 도 1a에 도시된 키워드 추출부의 일례를 설명하기 위한 블록도이다. FIG. 1B is a block diagram for explaining an example of the keyword extracting unit shown in FIG. 1A.

도 4a는 사용자 프로파일로 표현되는 개념 네트워크를 설명하기 위한 개념도이다. 도 3b는 사용자 프로파일을 이용한 질의어 추천의 일례를 설명하기 위한 개념도이다. 도 3c는 질의어 추천에 따른 개념 네트워크를 설명하기 위한 개념도이다. 4A is a conceptual diagram illustrating a conceptual network represented by a user profile. 3B is a conceptual diagram illustrating an example of query recommendation using a user profile. 3C is a conceptual diagram illustrating a conceptual network according to a query recommendation.

도 8은 개념간의 유사도를 ODP 카테고리 차원에서 계산하기 위한 차원 변경을 설명하기 위한 개념도이다.8 is a conceptual diagram for explaining a dimension change for calculating the similarity between concepts in the ODP category dimension.

도 9는 차원 변경을 구체적으로 설명하기 위한 개념도이다. 9 is a conceptual diagram for explaining the dimension change in detail.

도 11은 본 발명의 일실시예에 따른 개인화 질의 확장 시스템을 설명하기 위한 블록도이다. 11 is a block diagram illustrating a personalization query extension system according to an embodiment of the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100 : 키워드 추출부 110 : 웹 문서 수집모듈100: keyword extraction unit 110: Web document collection module

120 : 단일명사 추출모듈 130 : TF-IDF 가중치 계산모듈120: single noun extraction module 130: TF-IDF weight calculation module

140 : 최중요단어출현빈도 계산모듈 200 : 사용자 프로파일 작성부140: most important word occurrence frequency calculation module 200: user profile creation unit

210 : 세션 인터레스트 작성모듈 220 : 비교모듈210: session interest creation module 220: comparison module

230 : 사용자 프로파일 갱신모듈 400 : 클라이언트부230: user profile update module 400: client unit

410 : 질의 입력모듈 420 : 뷰어모듈410: query input module 420: viewer module

430 : 키워드 추출모듈 500 : 서버부430: keyword extraction module 500: server unit

510 : 질의 확장모듈 520 : 검색모듈510: query extension module 520: search module

530 : 사용자 개념 네트워크 관리모듈530: User Concept Network Management Module

Claims

Extracting a keyword from documents searched by a user using a search engine to create a user profile;

Generating a session interest using the extracted keywords, and accumulating the generated session interest to create a user profile;

Comparing the concept of the created session interest with the user profile each time a new session interest is created; And

If the session interest is the same as or similar to the concept of the user profile, adding the concepts of the session interest and the user profile to each other, and if different, adding as a concept of a new user profile; How to Configure User Profiles.

The method of claim 1, wherein the similarity comparison between the session interest and the concept of the user profile is made using a web directory of an Open Directory Project (ODP).

The method of claim 2, wherein the comparing of the concept of the session interest and the user profile comprises:

Expressing each of the concepts of the session interest and the user profile in a term vector;

Changing each of the term vectors corresponding to the session interest and the term vectors corresponding to the concept of the user profile into first and second vectors having a category of the ODP as a dimension; And

And comparing the first vector with the second vector using a cosine similarity.

The method of claim 3, wherein the changing to the first vector and the second vector comprises:

Generating a super document for each category by collecting strings of each category corresponding to the session interest and web pages belonging to the category;

Expressing the super documents as term vectors weighting term frequency-inverse document frequency (TF-IDF); And

Calculating a similarity with each category expressed as a term vector and expressing the similarity value as a vector having the dimension value of each category,

Comparing the first vector and the second vector

And comparing the term vector corresponding to the concept and the term vector corresponding to the category in a category dimension by using cosine similarity.

The method of claim 1, wherein the extracting of the keyword comprises:

Extracting words in each web document based on the TF-IDF weights;

Storing the words extracted from each document in one table and calculating the number of appearances; And

And constructing a session interest using only words having a number of appearances above a predetermined threshold based on the number of appearances.

A keyword extraction unit for extracting keywords from documents searched by a user using a search engine to create a user profile;

A session interest creation module which generates a session interest using the extracted keywords and creates a user profile by accumulating the generated session interest;

A comparison module for comparing the concept of the generated session interest with the user profile each time a new session interest is generated; And

And a user profile update module for adding the concept of the session interest and the user profile to each other when the session interest is the same as or similar to the concept of the user profile, and adding the concept of the new user profile as a difference. Conceptual network-based user profile configuration system.

The system of claim 6, wherein the comparison module compares the similarity between the session interest and the concept of the user profile using a web directory of an open directory project (ODP).

The method of claim 7, wherein the comparison module,

Represent each of the concepts of the session interest and the user profile as a term vector,

Changing a term vector corresponding to the session interest into a first vector having a category of the ODP as a dimension;

After changing the term vector corresponding to the concept of the user profile into a second vector having dimensions as a category of an ODP,

And a conceptual network based user profile composition system using cosine similarity to compare the first vector and the second vector.

The method of claim 6, wherein the keyword extraction unit,

Extract words from each web document based on the TF-IDF weight,

After storing the words extracted from each document in one table, calculating the number of appearances,

The concept network-based user profile composition system of claim 1, wherein the session interest is configured using only words having the number of appearances above a predetermined threshold based on the number of appearances.

The method of claim 6, wherein the keyword extraction unit,

A web document collection module for querying a query engine on a search engine and analyzing and storing a web page as a result;

Single noun extraction module for extracting a single noun from the stored web pages;

A TF-IDF weight calculation module for calculating a TF-IDF weight; And

Conceptual network-based user profile configuration system comprising a most important word occurrence frequency calculation module for selecting a keyword by calculating the most important word occurrence frequency for the TF-IDF weight.

The method of claim 10, wherein the most important word occurrence frequency calculation module

A concept network-based user profile composition system characterized by extracting words of a specific ratio based on the number of words in a document, calculating the most important word occurrence frequency and selecting keywords.

A conceptual network-based user profile composition system characterized by extracting words of a specific upper ratio based on the maximum TF-IDF weight in a document, calculating the most important word occurrence frequency and selecting keywords.

Concept network;

A query extension module for querying the concept network corresponding to the provided query word to obtain extended query terms and providing the obtained extended query term to the client portion as a query is provided from a client portion;

A search module for requesting a search by providing documents corresponding to the selected query to the search engine as the query is selected by the client unit, and providing a document that is a result corresponding to the request to the client unit; And

As the session interest is provided from the client unit, the session interest is compared with the concept of the user profile stored in the conceptual network, and if the session interest is the same as or similar to the concept of the user profile, the session interest And a user concept network management module for adding the concepts of the user profile to the concept network by adding them together and, if different, adding them as concepts of a new user profile to the concept network.

The method of claim 13, wherein the client unit,

A query input module for receiving a user input and extracting a keyword from the searched web page and providing the keyword to the query expansion module;

A viewer module for displaying documents provided as documents are provided by the search module; And

And a keyword extraction module for extracting keywords from the document of interest selected by the user and providing the extracted keywords to the user concept network management module.

The method of claim 13, wherein the keyword extraction module

Extract words from each web document based on the TF-IDF weight,

And a session interest using only words having the number of appearances above a predetermined threshold based on the number of appearances.

The method of claim 13, wherein the keyword extraction module,

A TF-IDF weight calculation module for calculating a TF-IDF weight; And

And a key word frequency calculation module for calculating a key word frequency with respect to the TF-IDF weight to select a keyword.

The method of claim 16, wherein the most important word occurrence frequency calculation module

A personalized query expansion system, comprising: extracting a word having a high specific ratio based on the number of words in a document, calculating a frequency of the most important word and selecting a keyword.

A personalized query expansion system, comprising: extracting a high specific ratio of words based on the maximum TF-IDF weight in a document, calculating the most important word occurrence frequency, and selecting a keyword.