KR20170042080A

KR20170042080A - Method and apparatus for providing personalized search results

Info

Publication number: KR20170042080A
Application number: KR1020150141556A
Authority: KR
Inventors: 강한훈; 양재영; 윤수인; 강슬기; 우혜진; 구본경
Original assignee: 삼성에스디에스 주식회사
Priority date: 2015-10-08
Filing date: 2015-10-08
Publication date: 2017-04-18
Also published as: KR102292092B1

Abstract

According to an embodiment of the present invention, a method for providing a personalized search result comprises the steps of: extracting a keyword from a document which is opened by a user; calculating a weighted value by considering a frequency of the keyword in the document; generating user model data on the user by mapping the weighted value to the keyword; and receiving a query from a terminal of the user, and aligning a search result on the query by using the user model data.

Description

TECHNICAL FIELD The present invention relates to a personalized search result providing method and apparatus,

본 발명은 개인화된 검색 결과를 제공하는 방법 및 그 장치에 관한 것이다. 보다 자세하게는 검색 결과에 대한 사용자의 반응을 수치화하여, 사용자가 보다 관심 있어 할만한 문서를 우선하여 제공하는 방법 및 그 방법을 수행하는 장치에 관한 것이다.The present invention relates to a method and apparatus for providing personalized search results. More particularly, the present invention relates to a method of digitizing a user's response to a search result and providing a user with a document that is more interesting, and an apparatus for performing the method.

다양한 정보들 중에 원하는 정보만을 얻기 위해서 흔히 검색을 이용한다. 이 때 사용자는 원하는 정보를 검색하기 위한 질의어를 입력하여 검색을 수행한다. 하지만 사용자가 입력하는 질의어는 대부분 한 두 단어에 그치기 때문에, 검색의 품질이 떨어지는 경우가 많다. 즉 사용자가 원하는 정보를 찾기 위한 최적의 질의어를 입력하는 경우는 소수이다.We often use search to get only the desired information among various information. At this time, the user performs a search by inputting a query word for searching for desired information. However, since most of the query words input by the user are only one or two words, the quality of the search often deteriorates. That is, when the user inputs an optimal query for finding the desired information, it is a prime number.

이러한 문제를 해결하기 위하여 검색 엔진은 자체 알고리즘에 의해 검색 결과를 정렬해서 제공하거나, 유사 검색어 또는 추천 검색어 같은 서비스를 제공하기도 한다. 그러나 대부분 이러한 서비스는 질의어에 대응되어 제공되는 기능으로 서로 다른 사용자라 하더라도 동일한 질의어를 입력한 경우에는 동일한 검색 결과, 동일한 유사 검색어, 동일한 추천 검색어를 제공할 뿐이다.To solve this problem, search engines sort their search results by their own algorithms, and provide services such as similarity search queries or suggestions. However, most of these services are provided in correspondence with query terms, and even if they are different users, they provide the same search results, the same similar query, and the same query when the same query is input.

동일한 질의어라고 하더라도 사용자에 따라 해당 질의어를 통해서 얻고자 하는 정보는 상이할 수 있다. 동음이의어나 다의어와 같은 극단적인 예를 들면, "밤"이라는 동일한 질의어를 입력하더라도 누군가는 해가 져서 어두워진 때를 의미하는 밤(night)에 관한 정보를 검색하기 위한 것일 수도 있고, 다른 누군가는 밤나무의 열매를 의미하는 밤(chestnut)에 관한 정보를 검색하기 위한 것일 수도 있다.The information to be obtained through the query may differ depending on the user even if the query is the same. An extreme example of a homophonic word or a variant, for example, if you enter the same query term "night", someone might be searching for information about the night, which means when the sun was darkened, May be to retrieve information about the chestnut, which means the fruit of the chestnut.

본 발명이 해결하고자 하는 과제는 개인화된 검색 결과를 제공하는 방법 및 그 장치를 제공하는 것이다.A problem to be solved by the present invention is to provide a method and apparatus for providing personalized search results.

본 발명이 해결하고자 하는 다른 과제는 개인화된 검색 결과를 제공하기 위한 최적의 알고리즘 조합을 제공하는 방법 및 그 장치를 제공하는 것이다.It is another object of the present invention to provide a method and apparatus for providing an optimal combination of algorithms to provide personalized search results.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the above-mentioned technical problems, and other technical problems which are not mentioned can be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한 본 발명의 일 태양에 따른 개인화된 검색 결과 제공 방법은, 사용자가 열람한 문서를 대상으로, 상기 문서의 키워드를 추출하는 단계와 상기 키워드의 상기 문서에서의 빈도를 고려하여, 가중치를 계산하는 단계와 상기 키워드에 상기 가중치를 맵핑하여, 상기 사용자에 대한 사용자 모델 데이터를 생성하는 단계 및 상기 사용자의 단말로부터 질의어를 수신하고, 상기 사용자 모델 데이터를 이용하여, 상기 질의어에 대한 검색 결과를 정렬하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a personalized search result providing method including: extracting a keyword of a document from a document viewed by a user; Calculating a weight, mapping the weight to the keyword, generating user model data for the user, receiving a query word from the terminal of the user, and using the user model data, The search results may be sorted.

일 실시예에서, 상기 가중치를 계산하는 단계는, 상기 키워드와 상기 가중치의 분포를 고려하여, 상기 가중치를 보정하는 단계를 포함할 수 있다.In one embodiment, the step of calculating the weights may include a step of correcting the weights in consideration of the distribution of the keywords and the weights.

다른 실시예에서, 상기 가중치를 계산하는 단계는, 상기 사용자가 상기 문서를 열람한 시간에 관한 정보를 고려하여, 상기 가중치를 보정하는 단계를 포함할 수 있다.In another embodiment, calculating the weighting may include correcting the weighting, taking into account information about the time the user has viewed the document.

또 다른 실시예에서, 상기 질의어에 대한 검색 결과를 정렬하는 단계는, 상기 검색 결과에 속한 각 문서에서 검색 결과 키워드를 추출하는 단계와 상기 검색 결과 키워드와 상기 키워드를 비교하고, 상기 키워드에 맵핑된 가중치를 기준으로, 상기 각 문서의 중요도를 계산하는 단계 및 상기 각 문서의 중요도를 이용하여, 상기 질의어에 대한 검색 결과를 정렬하는 단계를 포함할 수 있다.In yet another embodiment, the step of sorting the search results for the query term includes extracting a search result keyword from each document belonging to the search result, comparing the search result keyword with the keyword, Calculating the importance of each document based on the weights, and sorting the search results for the query terms using the importance of each document.

또 다른 실시예에서, 상기 질의어에 대한 검색 결과를 정렬하는 단계는, 상기 정렬된 검색 결과에 대한 상기 사용자의 반응을 피드백하여, 상기 사용자 모델 데이터를 갱신하는 단계를 포함할 수 있다.In another embodiment, the step of sorting the search results for the query term may include the step of feeding back the user's response to the sorted search results and updating the user model data.

또 다른 실시예에서, 상기 사용자에 대한 사용자 모델 데이터를 생성하는 단계는, 사용자 모델 데이터를 기준으로 복수의 사용자를 클러스터링 하여, 클러스터를 구성하는 단계를 포함할 수 있다.In yet another embodiment, generating the user model data for the user may comprise clustering a plurality of users based on user model data to construct a cluster.

상기 기술적 과제를 해결하기 위한 본 발명의 다른 태양에 따른 개인화된 검색 결과 제공 장치는, 사용자가 열람한 문서를 대상으로, 상기 문서의 키워드를 추출하는 키워드 추출부와 상기 키워드의 상기 문서에서의 빈도를 고려하여, 가중치를 계산하는 가중치 연산부와 상기 키워드에 상기 가중치를 맵핑하여, 상기 사용자에 대한 사용자 모델 데이터를 생성하는 사용자 모델 데이터 생성부 및 상기 사용자의 단말로부터 질의어를 수신하고, 상기 사용자 모델 데이터를 이용하여, 상기 질의어에 대한 검색 결과를 정렬하는 검색 결과 개인화부를 포함할 수 있다.According to another aspect of the present invention, there is provided a personalized search result providing apparatus including a keyword extracting unit for extracting a keyword of a document from a document viewed by a user, A user model data generation unit for generating a user model data for the user by mapping the weights to the keyword and a query input unit for receiving a query word from the user terminal, And a search result personalization unit for sorting search results for the query term using the search result personalization unit.

일 실시예에서, 상기 가중치 연산부는, 상기 키워드와 상기 가중치의 분포를 고려하여, 상기 가중치를 보정하는 가중치 보정부를 포함할 수 있다.In one embodiment, the weight calculation unit may include a weight correction unit that corrects the weight by considering the distribution of the keyword and the weight.

다른 실시예에서, 상기 검색 결과 개인화부는, 상기 검색 결과에 속한 각 문서에서 검색 결과 키워드를 추출하는 검색 결과 키워드 추출부와 상기 검색 결과 키워드와 상기 키워드를 비교하고, 상기 키워드에 맵핑된 가중치를 기준으로, 상기 각 문서의 중요도를 계산하는 문서 중요도 연산부 및 상기 각 문서의 중요도를 이용하여, 상기 질의어에 대한 검색 결과를 정렬하는 검색 결과 정렬부를 포함할 수 있다.In another embodiment, the search result personalization unit may further include a search result keyword extracting unit for extracting a search result keyword from each document belonging to the search result, a comparison unit for comparing the search result keyword with the keyword, A document importance calculation unit for calculating the importance of each document, and a search result sorting unit for sorting search results for the query term using the importance of each document.

또 다른 실시예에서, 상기 검색 결과 개인화부는, 상기 정렬된 검색 결과에 대한 상기 사용자의 반응을 피드백하여, 상기 사용자 모델 데이터를 갱신하는 사용자 모델 데이터 갱신부를 포함할 수 있다.In another embodiment, the search result personalization unit may include a user model data updating unit that updates the user model data by feeding back the response of the user to the sorted search result.

상기 기술적 과제를 해결하기 위한 본 발명의 또 다른 태양에 따른 개인화된 검색 결과 제공 장치는, 네트워크 인터페이스와 하나 이상의 프로세서와 상기 프로세서에 의하여 수행되는 컴퓨터 프로그램을 로드(load)하는 메모리 및 문서 데이터, 사용자 모델 데이터, 알고리즘 조합 데이터를 저장하는 스토리지를 포함할 수 있다. 여기서, 상기 컴퓨터 프로그램은, 상기 문서 데이터 중 사용자가 열람한 문서를 대상으로, 상기 문서의 키워드를 추출하는 키워드 추출 오퍼레이션과 상기 키워드의 상기 문서에서의 빈도를 고려하여, 가중치를 계산하는 가중치 연산 오퍼레이션과 상기 키워드에 상기 가중치를 맵핑하여, 상기 사용자에 대한 상기 사용자 모델 데이터를 생성하는 사용자 모델 데이터 생성 오퍼레이션과 사용자 모델 데이터를 기준으로 복수의 사용자를 클러스터링 하여 클러스터를 구성하고, 키워드 추출 알고리즘, 가중치 계산 알고리즘, 가중치 보정 알고리즘 및 중요도 연산 알고리즘에 대한 상기 클러스터에 속한 사용자의 반응을 기계 학습하여, 상기 클러스터의 상기 알고리즘 조합 데이터를 생성하는 알고리즘 조합 학습 오퍼레이션 및 상기 사용자의 단말로부터 질의어를 수신하고, 상기 사용자 모델 데이터를 이용하여, 상기 질의어에 대한 검색 결과를 정렬하는 검색 결과 개인화 오퍼레이션을 포함할 수 있다.According to another aspect of the present invention, there is provided a personalized search result providing apparatus including a network interface, at least one processor, a memory for loading a computer program executed by the processor, Model data, and algorithm combination data. Here, the computer program may further include a weight calculation operation for calculating a weight, in consideration of a keyword extraction operation for extracting a keyword of the document and a frequency of the keyword in the document, A user model data generation operation for generating the user model data for the user by mapping the weights to the keyword and a plurality of users by clustering a plurality of users based on the user model data, An algorithm combination learning operation for mechanically learning a response of a user belonging to the cluster to an algorithm, a weight correction algorithm and an importance computation algorithm to generate the algorithm combination data of the cluster, And a search result personalization operation for receiving a query word and using the user model data to sort search results for the query term.

상기와 같은 본 발명에 따르면, 검색 결과에 대한 사용자의 반응을 수치화하여 이를 사용자 모델 데이터(user model data)로 취합함으로써, 해당 사용자가 다음 검색을 수행할 때에는 보다 사용자의 관심에 적합할 만한 문서를 우선하여 제공할 수 있다.According to the present invention as described above, the user's reaction to the search result is numerically quantified and collected as user model data, so that when the user performs the next search, Can be offered first.

또한 이 과정에서 사용될 수 있는 복수의 알고리즘을 기계 학습(machine learning)을 통하여 각각의 사용자에 맞게 설정함으로써, 개인화된 검색 결과를 제공하는데 활용할 수 있다.In addition, a plurality of algorithms that can be used in this process can be used to provide individualized search results by setting the algorithms for each user through machine learning.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood to those of ordinary skill in the art from the following description.

도 1은 기존의 검색 결과 제공 방법을 설명하기 위한 개념도이다.
도 2 내지 도 3은 본 발명의 일 실시예에 따른 개인화된 검색 결과를 제공하는 방법을 설명하기 위한 개념도이다.
도 4는 본 발명의 일 실시예에 따른 개인화된 검색 결과 제공 방법의 순서도이다.
도 5는 본 발명의 일 실시예에 따른 사용자의 관심 문서에서 키워드를 추출하여 가중치를 계산하는 단계를 설명하기 위한 개념도이다.
도 6 내지 도 7은 본 발명의 일 실시예에 따른 문서의 열람과 관련된 시간정보를 기준으로 가중치를 계산하는 단계를 설명하기 위한 개념도이다.
도 8 내지 도 9는 본 발명의 일 실시예에 따른 사용자 모델 데이터를 이용하여 검색 결과를 개인화 하는 단계를 설명하기 위한 개념도이다.
도 10은 개인화된 검색 결과 제공 방법의 각 단계에서 사용될 수 있는 알고리즘을 도시한 것이다.
도 11은 각 알고리즘 조합에 따라 특정 문서의 중요도를 테스트한 결과를 도시한 것이다.
도 12는 본 발명의 일 실시예에 따른 알고리즘 조합을 기계 학습 하여 개인화된 검색 결과를 제공하는 방법을 설명하기 위한 순서도이다.
도 13은 각 알고리즘 조합에 따라 사용자의 반응을 테스트한 결과를 도시한 것이다.
도 14는 본 발명의 일 실시예에 따른 개인화된 검색 결과 제공 장치의 하드웨어 구성도이다.1 is a conceptual diagram for explaining an existing search result providing method.
2 to 3 are conceptual diagrams illustrating a method of providing personalized search results according to an embodiment of the present invention.
4 is a flowchart of a personalized search result providing method according to an exemplary embodiment of the present invention.
5 is a conceptual diagram illustrating a step of extracting a keyword from a user's interest document and calculating a weight according to an embodiment of the present invention.
6 to 7 are conceptual diagrams illustrating a step of calculating a weight based on time information related to the browsing of a document according to an embodiment of the present invention.
8 to 9 are conceptual diagrams illustrating a step of personalizing search results using user model data according to an embodiment of the present invention.
Figure 10 shows an algorithm that can be used in each step of the method for providing personalized search results.
11 shows a result of testing the importance of a specific document according to each algorithm combination.
12 is a flowchart illustrating a method of providing a personalized search result by machine learning of an algorithm combination according to an embodiment of the present invention.
FIG. 13 shows a result of testing the response of the user according to each algorithm combination.
14 is a hardware configuration diagram of a personalized search result providing apparatus according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise. The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification.

명세서에서 사용되는 "포함한다 (comprises)" 및/또는 "포함하는 (comprising)"은 언급된 구성 요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성 요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.It is noted that the terms "comprises" and / or "comprising" used in the specification are intended to be inclusive in a manner similar to the components, steps, operations, and / Or additions.

이하, 본 발명에서 사용되는 검색 및 검색 결과는 키워드 위주의 검색 및 텍스트 위주로 작성된 문서들을 대상으로 한다. 즉 이미지 검색이나, 동영상 검색과 같은 분야를 대상으로 하는 것은 아니다. 여기서 텍스트 위주로 작성된 문서란 일반 웹 문서일 수도 있고, 경우에 따라서는 그룹웨어 내의 메일이나 게시판의 게시글과 같은 문서, 또는 한글, 워드, 파워포인트 등과 같이 문서 프로그램으로 작성된 문서일 수도 있다.Hereinafter, the search and search results used in the present invention are targeted to documents created mainly in terms of keywords and texts. That is, it does not target such areas as image search or video search. Here, the document prepared in the text-oriented manner may be a general web document, and in some cases, it may be a document such as a mail in a groupware, a publication such as a bulletin board, or a document written in a document program such as Hangul, word or PowerPoint.

이하, 본 발명에 대하여 첨부된 도면에 따라 보다 상세히 설명한다.Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings.

도 1은 기존의 검색 결과 제공 방법을 설명하기 위한 개념도이다.1 is a conceptual diagram for explaining an existing search result providing method.

도 1을 참고하면, 서로 다른 사용자 A(101a)와 사용자 B(101b)가 동일한 질의어로 검색을 수행하면(① query 단계), 검색 엔진(111)은 검색 대상이 되는 문서(113)를 해당 질의어로 검색하여 검색 결과를 조회한다(② searching 단계). 검색 엔진(111)은 조회된 검색 결과(115)를 자체적인 알고리즘에 의해 정렬을 수행한다(③ ranking 단계). 이 때 정렬 알고리즘은 예를 들면 검색 대상 문서에서 해당 키워드가 나타난 빈도와 같은 정보를 이용할 수 있다. 또는 검색 대상 문서가 작성된 시기나 수정된 시기와 같은 정보들도 정렬 알고리즘에 이용될 수 있다. 이렇게 정렬된 검색 결과(115)는 사용자 A(101a)와 사용자 B(101b)에게 제공된다(④ view / click 단계).Referring to FIG. 1, when different user A 101a and user B 101b perform a search with the same query term (a query step), the search engine 111 searches the corresponding document 113, And searching result (② searching step). The search engine 111 performs sorting of the inquired search result 115 by its own algorithm ((3) ranking step). In this case, the sorting algorithm can use information such as the frequency of occurrence of the keyword in the search target document, for example. Or information such as when the document to be searched for was created or modified could be used in the sorting algorithm. The sorted search results 115 are provided to the user A 101a and the user B 101b (step 4).

그러나 사용자 A(101a)와 사용자 B(101b)가 동일한 질의어를 입력하였기 때문에 그 때 조회되는 검색 결과(115)도 동일할 수 밖에 없다. 즉 사용자의 취향이나 성향을 반영한 검색 결과(115)를 제공하는 것은 아니기 때문에, 사용자에 따라 검색 결과(115)에 대한 만족도가 다를 수 밖에 없다.However, since the user A 101a and the user B 101b input the same query, the search result 115 to be searched at that time must be the same. That is, since the search result 115 reflecting the taste or the tendency of the user is not provided, satisfaction with the search result 115 depends on the user.

예를 들면 사용자 A(101a)는 해가 져서 어두운 밤을 검색하고 싶어서 "밤"을 입력하였고, 사용자 B(101b)는 밤나무 열매인 밤을 검색하고 싶어서 "밤"을 입력하였는데, 검색 결과(115)는 밤나무 열매에 관한 정보만 가득하다면, 사용자 A(101a)는 자신이 원하는 정보를 찾기 위해서 검색 결과(115)를 몇 페이지씩 넘겨가면서 하나하나 확인해야 하거나 또는 질의어를 수정해서 다시 검색을 수행해야 할 것이다.For example, the user A 101a entered "night" because he wanted to search for a dark night, user B 101b entered "night" because he wanted to search for chestnut berries, ), If the information about the chestnut fruit is full, the user A (101a) must check each one of the search results 115 by several pages in order to search for the desired information, or perform a search again after correcting the query term something to do.

만약, 이 때 사용자 A(101a)와 사용자 B(101b)의 취향이나 성향을 고려하여 서로 다른 개인화된 검색 결과(115)를 제공할 수 있다면, 사용자 각자가 자신이 원하는 정보를 검색하기 위한 수고를 덜 수 있을 것이다.If it is possible to provide different personalized search results 115 in consideration of the taste or the tendency of the user A 101a and the user B 101b at this time, It will be less.

도 2 내지 도 3은 본 발명의 일 실시예에 따른 개인화된 검색 결과를 제공하는 방법을 설명하기 위한 개념도이다.2 to 3 are conceptual diagrams illustrating a method of providing personalized search results according to an embodiment of the present invention.

도 2를 참고하면, 사용자 모델 데이터(user model data)를 이용하여 검색 결과(115)를 각 사용자에 맞게 개인화하여 정렬한다(③ personalized ranking 단계). 여기서 사용자 모델 데이터(117)은 각 사용자의 취향이나 성향을 반영하여 생성된 데이터로, 예를 들면 사용자가 좋아할 만한 문서의 종류나 특성을 데이터화 한 것이다. 즉 사용자의 취향이나 성향을 모델링한 데이터를 말한다.Referring to FIG. 2, the search result 115 is personalized and sorted according to each user by using user model data (personalized ranking step). Here, the user model data 117 is data generated by reflecting the taste or the tendency of each user, and is data obtained by, for example, data on the types and characteristics of documents that the user may like. That is, data that models a user's taste or disposition.

물론 사용자 모델 데이터(117)는 사전에 사용자 환경 설정 메뉴 등을 통해서 사용자로부터 직접 입력 받는 데이터일 수도 있으나, 바람직하게는 검색 결과에 대한 사용자의 반응을 모니터링 하여 생성된 데이터일 수 있다. 즉, 검색 결과에서 사용자가 어떤 문서를 중점적으로 열람하였는지를 기준으로 다음 검색에서는 사용자가 열람했던 문서들과 유사한 문서들 위주로 검색 결과를 정렬해서 제공하는 것이다.Of course, the user model data 117 may be data received directly from a user through a user preference menu or the like, but may be data generated by monitoring a user's reaction to a search result. That is, the search results are sorted based on documents that are similar to the documents that the user has browsed in the next search based on which documents the user has focused on in the search results.

이를 통해서 동일하게 "밤"이라는 질의어를 입력하였더라도, 사용자 A(101a)에게는 해가 져서 어두운 어두운 밤과 관련된 문서들(119a) 위주로 제공하고, 사용자 B(101b)에게는 밤나무 열매인 밤과 관련된 문서들(119b) 위주로 제공할 수 있다. 즉, 사용자 모델 데이터(117)를 각 사용자별로 생성하여 이를 이용하여 개인화된 검색 결과(119a, 119b)를 제공할 수 있다.Even if a query word "night" is input in the same manner, the user A 101a is harmed and provides mainly dark dark night related documents 119a. In addition, the user B 101b provides documents related to chestnut berries night (119b). That is, the user model data 117 may be generated for each user and used to provide personalized search results 119a and 119b.

도 3을 참고하면, 사용자 모델 데이터(117)를 생성하는 과정을 볼 수 있다. 즉, 개인화된 검색 결과(119a, 119b)에 대해서 각 사용자가 어느 문서들을 위주로 열람하였는지를 이용하여 다시 사용자 모델 데이터(117)를 수정하는 것이다(⑤ feedback 단계). 사용자가 검색을 하면 할수록 사용자가 열람하였던 문서들과 유사한 문서들을 검색 결과로 제공할 수 있어서, 검색 결과의 품질이 높아질 수 있다.Referring to FIG. 3, a process of generating user model data 117 can be seen. That is, the personalized search results 119a and 119b are used to revise the user model data 117 by using which documents each user mainly browses (step (5) feedback). As the user performs the search, documents similar to the documents viewed by the user can be provided as the search results, so that the quality of the search results can be enhanced.

지금까지, 도 1 내지 도 3을 통해서 기존의 검색 결과 제공 방법과 본 발명의 일 실시예에 따른 개인화된 검색 결과 제공 방법에 대해서 개념적으로 살펴보았다. 이하 도 4 내지 도 14를 통해서 본 발명의 일 실시예에 다른 개인화된 검색 결과 제공 방법에 대해서 보다 자세하게 살펴보도록 한다.1 to 3, a conventional search result providing method and a personalized search result providing method according to an embodiment of the present invention have been conceptually described. Hereinafter, a personalized search result providing method according to an embodiment of the present invention will be described in detail with reference to FIG. 4 through FIG.

도 4는 본 발명의 일 실시예에 따른 개인화된 검색 결과 제공 방법의 순서도이다.4 is a flowchart of a personalized search result providing method according to an exemplary embodiment of the present invention.

도 4를 참고하면, 우선 각 사용자의 관심 문서들에서 키워드를 추출한다(S1100). 여기서 사용자의 관심 문서란 앞서 언급한 것처럼 검색 결과에서 사용자가 반응하여 열람한 문서들을 말한다. 즉, 특정 사용자가 검색을 통해서 얻고자 하는 정보들을 해당 특정 사용자가 열람한 문서들로부터 추론하는 것이다.Referring to FIG. 4, keywords are extracted from the documents of interest of each user (S1100). Here, the user's interest document refers to the documents that the user has responded to in the search result as described above. That is, the information that a specific user desires to obtain through a search is inferred from the documents read by the specific user.

이러이러한 질의어를 입력하여 나온 검색 결과에서 이러이러한 문서들을 위주로 열람한 사용자에게는 다음 검색에서는 해당 문서들과 유사한 문서들을 우선하여 제공하기 위해 관심 문서들에 대한 정보를 수치화하는 것이다. 사용자의 관심 문서들을 수치화하기 위해서는 관심 문서에 대한 다양한 정보들을 이용할 수 있겠지만, 우리는 문서에 기재된 단어, 키워드에 초점을 맞추어 문서의 특성을 수치화하기로 한다.In a search result obtained by inputting such a query, a user who browses these documents mainly is to digitize information on documents of interest in order to give priority to documents similar to the documents in the next search. In order to quantify the user's interest documents, various information about the document of interest can be used. However, we focus on the words and keywords in the document to characterize the document.

관심 문서에서 키워드를 추출한 후에는 각 키워드 가중치를 계산할 수 있다(S1200). 키워드 가중치를 계산할 때에는 해당 키워드가 해당 관심 문서에서 어느 정도의 빈도로 나타나는지, 또한 다른 문서에서는 어느 정도의 빈도로 나타나는지 등을 기준으로 가중치를 계산할 수 있다. 나아가 키워드와 가중치의 분포에 따라 추가적으로 가중치를 보정할 수도 있다.After extracting the keyword from the document of interest, each keyword weight can be calculated (S1200). When calculating the keyword weight, the weight can be calculated on the basis of how often the keyword appears in the interest document and how often it appears in other documents. Furthermore, the weight can be further corrected according to the distribution of the keyword and the weight.

사용자의 관심 문서에서 키워드를 추출하고 각 키워드 가중치를 계산한 후에는, 이를 합산하여 사용자별로 키워드 가중치를 사용자 모델 데이터로 생성하여 저장한다(S1300). 이를 이용하여 추후에 해당 사용자가 다른 질의어를 입력하여 검색을 수행하더라도 이전에 검색된 문서들에 대한 사용자의 반응을 기초로 검색된 문서들을 개인화하여 제공할 수 있는 것이다(S1400).After extracting the keyword from the user's interest document and calculating the weight of each keyword, the keyword weight is summed and stored as user model data for each user (S1300). In step S1400, even if the user performs a search by inputting another query, the user can personalize the retrieved documents based on the user's response to the previously retrieved documents.

사용자마다 열람한 문서들은 서로 다를 수밖에 없으므로, 동일한 질의어를 입력하여 검색을 수행하더라도, 사용자 모델 데이터가 서로 달라 그 검색 결과 또한 서로 다르게 제공될 수 밖에 없다. 이를 통해서 사용자가 자신이 원하는 정보를 찾기 위해서 검색 결과를 하나하나 탐색해야 하는 수고를 덜 수 있다.Since the documents read for each user are different from each other, even if the search is performed by inputting the same query words, the user model data are different from each other and the search results are also provided differently. Through this, the user can reduce the trouble of searching the search results one by one in order to find the information desired by the user.

예를 들면, 해가 져서 어두운 밤을 검색하기 위해 "밤"을 입력한 사용자 A(101a)가 그 후 "여행"이라는 질의어를 입력한다면, 밤에 떠나서 다녀올 수 있는 야간 여행지를 검색 결과로 우선하여 제공하고, 밤나무 열매인 밤을 검색하기 위해 "밤"을 입력한 사용자 B(101b)가 그 후 "여행"이라는 질의어를 입력한다면, 밤 생산지나 밤 줍기 체험 프로그램을 할 수 있는 여행지를 검색 결과로 우선하여 제공할 수 있을 것이다.For example, if a user A (101a) who entered "night" to search for dark nights due to sunshine enters a query term "travel" after that, he / she may prioritize the nighttime travel destination If the user B (101b) who entered "night" to search for chestnut berries, then enters the query "travel," he or she can search for a destination where night production or night picking experience programs can be performed You will be able to provide it first.

도 5는 본 발명의 일 실시예에 따른 사용자의 관심 문서에서 키워드를 추출하여 가중치를 계산하는 단계를 설명하기 위한 개념도이다.5 is a conceptual diagram illustrating a step of extracting a keyword from a user's interest document and calculating a weight according to an embodiment of the present invention.

우선 사용자가 열람한 문서들에서 키워드를 추출한다. 여기서 키워드는 문서의 제목이나 작성자, 본문 등에서 추출될 수 있다. 만약 그룹웨어에서 사용되는 문서라면 작성자의 부서나 직급체계와 관련된 정보도 키워드로 추출될 수 있을 것이다. 추출된 키워드는 문서 내에서의 빈도 및 다른 문서에서의 빈도 등을 기준으로 가중치를 계산한다.First, keywords are extracted from the documents that the user has viewed. Here, the keyword can be extracted from the title, author, and text of the document. If the document is used in groupware, information related to the author's department or position system can be extracted as keywords. The extracted keywords are weighted based on the frequency in the document and the frequency in other documents.

계산된 가중치는 그대로 사용자 모델 데이터로 사용될 수도 있지만, 필요에 따라서는 가중치를 보정하는 작업을 수행할 수 있다. 가중치 보정에는 각종 통계적인 방법이 적용될 수 있다. 그리고 가중치 보정 작업은 여러 단계에 걸쳐서 수행될 수도 있다. 즉, 다양한 보정 알고리즘에 의해서 가중치 보정을 여러 번 수행할 수도 있다. 이렇게 각 문서에 대해서 가중치를 계산하고 보정을 한 후에는 각 문서의 키워드 가중치를 합산하여 해당 사용자의 사용자 모델 데이터를 생성하고 저장한다.The calculated weight can be used as the user model data as it is, but the weight can be corrected if necessary. Various statistical methods can be applied to the weight correction. The weight correction operation may be performed in several steps. That is, weight correction may be performed several times by various correction algorithms. After the weights are calculated and corrected for each document, the keyword weights of the respective documents are summed up to generate and store the user model data of the corresponding user.

도 5를 참고하면, 사용자가 열람한 문서 중 특정 문서(121)에서 키워드를 추출하여 가중치를 계산하는 과정을 볼 수 있다. "사업 보고자료"라는 제목으로 "A사업부"의 "홍길동"이 작성한 문서에서 키워드를 추출하면 "사업", "보고자료", "홍길동", "A사업부"와 같은 키워드를 추출할 수 있다. 다만, 여기서는 지면상의 제한으로 제목, 작성자와 관련된 정보에서 키워드를 추출한 것일 뿐, 문서의 본문이나 그 외 문서와 관련된 다양한 텍스트 정보에서 키워드가 추출될 수 있다.Referring to FIG. 5, a process of extracting a keyword from a specific document 121 among the documents viewed by a user and calculating a weight value can be seen. You can extract keywords such as "business", "report data", "Hong Gil Dong", "A business division" by extracting keywords from the document created by "Hong Gil Dong" However, here, keywords are extracted from the information related to the title and author due to limitations on the paper, and keywords can be extracted from various text information related to the text of the document or other documents.

추출된 키워드(123)는 1차적으로 해당 문서에서의 빈도 및 다른 문서에서의 빈도를 이용하여 가중치(125)를 계산한다. 도 5를 참고하면 "사업"은 0.12의 가중치를, "보고자료"는 1.32의 가중치를, "홍길동"은 2.12의 가중치를, "A 사업부"는 4.12의 가중치를 가지는 것을 볼 수 있다. 가중치를 계산하는 1차 작업이 완료되면 해당 가중치를 적절하게 보정할 수 있다.The extracted keyword 123 primarily calculates the weight 125 using the frequency in the document and the frequency in the other document. Referring to FIG. 5, it can be seen that "business" has a weight of 0.12, "report data" has a weight of 1.32, "Hong Kil Dong" has a weight of 2.12, and "A business unit" has a weight of 4.12. When the primary task of calculating the weight is completed, the weight can be corrected appropriately.

가중치를 보정하는 단계는 필요에 따라 여러 번 수행될 수 있다. 예를 들면 가중치의 편차가 너무 심한 경우에는 이를 보정할 수 있다. 도 5의 예에서는 보정을 통해 각각의 가중치가 0.02, 0.23, 0.11, 0.12의 값으로 보정된 것을 볼 수 있다.The step of correcting the weight can be performed several times as needed. For example, if the deviation of the weight is too severe, it can be corrected. In the example of FIG. 5, it can be seen that the respective weights are corrected to 0.02, 0.23, 0.11 and 0.12 through the correction.

문서에서 보정된 가중치(127)를 하나로 합산하면, 사용자 모델 데이터가 완성된다. 도 5의 예에서는 사용자가 열람한 특정 문서(121) 외에도 사용자가 열람한 다른 문서에서 키워드를 추출하여 가중치를 계산하였다. "사업", "보고자료", "홍길동", "A사업부"외에도 다른 문서에서 추출한 키워드 "사업", "마감일", "출원", "김지영"의 가중치를 합산할 수 있다. 이렇게 여러 문서의 키워드와 키워드 가중치를 합산하면 사용자 모델 데이터(129)를 생성할 수 있다. 여기서, 가중치를 계산하는 구체적인 방법에 대해서는 추후 도 10 내지 도 13에서 자세히 살펴보도록 한다.When the corrected weights 127 in the document are added together, the user model data is completed. In the example of FIG. 5, keywords are extracted from other documents viewed by the user in addition to the specific document 121 viewed by the user, and the weights are calculated. It is possible to add the weights of keywords "business", "deadline", "application", "Kim Ji-young" extracted from other documents in addition to "business", "report data", " The user model data 129 can be generated by summing the keywords of the various documents and the keyword weights. Here, a concrete method of calculating the weights will be described in detail later with reference to FIG. 10 to FIG.

도 6 내지 도 7은 본 발명의 일 실시예에 따른 문서의 열람과 관련된 시간정보를 기준으로 가중치를 계산하는 단계를 설명하기 위한 개념도이다.6 to 7 are conceptual diagrams illustrating a step of calculating a weight based on time information related to the browsing of a document according to an embodiment of the present invention.

검색 결과에 대한 사용자의 반응을 모니터링 하여 가중치를 계산할 때, 단순히 문서를 열람하였는지 여부를 기준으로 가중치를 계산하면 각 문서간의 경중이 반영되지 않을 수가 있다. 일반적으로 사용자는 검색 결과로 제공되는 문서들을 열람하여 자신이 원하는 정보가 있는지 확인하는 과정을 거치게 되는데 이를 고려할 필요가 있는 것이다. 즉, 검색 결과로 제공된 문서를 사용자가 열람을 하는 경우, 각 문서를 열람한 시간을 고려할 필요가 있다. 그래서 검색 결과 중에서 사용자가 원한 정보가 있는 문서가 어떤 문서였는지 고려하여 가중치를 계산하여야 한다.When calculating the weights by monitoring the user's reaction to the search results, calculating the weights based on whether or not the documents are simply browsed may not reflect the weight among the documents. In general, the user must check the documents provided as a result of the search to see if there is desired information. In other words, when a user browses a document provided as a search result, it is necessary to consider the time at which each document is viewed. Therefore, it is necessary to calculate the weight by considering the document in which the user has the desired information among the search results.

또한, 사용자가 예전에 질의어를 입력하여 검색된 결과 중에서 열람한 문서들과 최근에 질의어를 입력하여 검색된 결과 중에서 열람한 문서들의 경중을 반영할 필요가 있다. 즉, 최근에 검색하여 열람한 문서들일수록 가중치를 계산할 때 보다 더 큰 가중치를 가지도록 할 필요가 있다. 이를 위해서는 사용자가 검색 결과에서 문서를 열람한 시간을 모니터링 할 필요가 있다. 다시 말하면, 사용자가 문서를 열람한 누적 시간과, 최근 열람 일시를 기준으로 가중치를 보정하는 과정이 필요하다.In addition, it is necessary for the user to input the query phrase and reflect the severity of the documents retrieved from the retrieved results and the documents retrieved from the retrieved results by inputting the query phrase recently. That is, it is necessary that the documents that have been searched and read recently have a larger weight than when the weight is calculated. To do this, the user needs to monitor the time the document is viewed in the search results. In other words, there is a need for a process of correcting the weights based on the cumulative time of the user viewing the document and the latest viewing date and time.

도 6을 참고하면, 사용자가 검색창(131)에서 "정산"이라는 질의어를 입력하여 검색을 수행한 검색 결과(133)를 볼 수 있다. 그룹웨어 내에서 사용자가 입력한 질의어 "정산"으로 검색된 검색 결과는 "개발 정의서.pptx", "화면 설계서.pptx", "정산 서류.xls" 등과 같은 문서가 있을 수 있다. 사용자는 검색 결과에서 각 문서의 제목이나 요약 내용(미도시) 등을 참고하여 자신이 원하는 정보가 있는 문서가 어느 문서인지 고민한 후, 검색 결과에서 몇몇의 문서들을 열람해서 정말로 원하는 정보가 있는 문서인지 확인할 수 있다. 이 때, 사용자가 해당 문서를 열람을 시작한 시간과 열람을 종료한 시간을 모니터링 하면, 사용자가 해당 문서를 열람하는데 소모한 시간 및 해당 문서의 최근 열람 일시 등을 구할 수 있다.Referring to FIG. 6, the user can input the query term "settlement" in the search window 131 to see the search result 133 that performed the search. The search results retrieved with the query "settlement" entered by the user in the groupware may include documents such as "development definition .pptx", "screen design .pptx", "settlement documents .xls" The user refers to the title or summary (not shown) of each document in the search result to determine which document has the desired information. Then, the user browses some documents in the search result, . At this time, when the user starts to browse the document and monitors the time at which the reading is finished, the time consumed by the user to browse the document and the latest browsing date and time of the document can be obtained.

도 6의 예에서는 "개발 정의서.pptx"를 열람하는 데는 10초의 시간을 소모하고, "화면 설계서.pptx"를 열람하는 데는 8초의 시간을, "정산 서류.xls"를 열람하는 데는 6분 32초의 시간을 소모한 것을 볼 수 있다. 아마도 검색 결과에서 사용자가 검색을 통해서 얻고자 했던 정보는 "정산 서류.xls"에 있을 가능성이 높다. 이렇게 각 문서에 대한 사용자의 열람 시간을 모니터링 하여 로그 데이터(135)로 별도로 저장해서 관리하면, 각 문서간의 열람 선후 순서뿐만 아니라, 열람 누적 시간도 구할 수가 있다. 즉 어느 문서를 가장 최근에 열람했는지, 또 해당 문서를 열람하는데 어느 정도의 시간을 소모했는지 알 수가 있다.In the example of Fig. 6, it takes 10 seconds to browse the " development definition .pptx ", 8 seconds to view "screen design .pptx ", 6 minutes to view" It can be seen that the time consumed in seconds is consumed. Perhaps the information the user wanted to retrieve from the search results is likely to be in the "settlement documents .xls". If the user's viewing time for each document is monitored and stored separately in the log data 135, not only the order of reading after each document but also the cumulative viewing time can be obtained. In other words, you can see which documents were most recently viewed and how much time was spent browsing them.

도 7을 참고하면, 도 6에서 구한 문서를 열람하는데 소모한 시간과 문서를 최근에 열람한 시간을 기준으로 가중치를 계산하는 것을 볼 수 있다. 도 5에서는 각 문서에 포함된 키워드 가중치(keyword weighting)를 계산하였다면, 도 7에서는 각 문서에 대한 시간 가중치(time weighting)를 계산하는 것이다.Referring to FIG. 7, it can be seen that the weight is calculated on the basis of the time consumed in browsing the document obtained in FIG. 6 and the time when the document was recently viewed. In FIG. 5, the keyword weighting included in each document is calculated. In FIG. 7, the time weighting for each document is calculated.

각 문서의 시간 가중치를 고려하지 않는다면, 그저 단순히 사용자가 열람한 문서들의 키워드 가중치를 합산하여 사용자 모델 데이터를 생성할 것이다. 도 7의 예를 참고하면, 문서 1(141a)과 문서 2(141b)의 키워드 가중치를 계산한 후 시간 가중치의 고려 없이 이를 합산하여 사용자 모델 데이터(151)를 얻었다. 여기서는 문서별 경중 없이 단순히 각 문서의 키워드 가중치를 합산하여 공통된 키워드인 "사업"은 0.02+0.04=0.06 의 가중치를 가지게 되었고, 그 외의 키워드는 공통된 키워드가 없어, "보고자료"는 0.23, "홍길동"은 0.11, "A사업부"는 0.12, "마감일"은 0.36, "출원"은 0.17, "김지영"은 0.24의 가중치를 가지는 사용자 모델 데이터(151)가 생성되었다.If we do not take into account the time weight of each document, we will simply generate user model data by simply summing the keyword weights of the documents that the user has viewed. 7, the keyword weights of the document 1 141a and the document 2 141b are calculated, and then the sum of the keyword weights is added to the user model data 151 without considering the time weight. In this case, "business", which is a common keyword, has a weight of 0.02 + 0.04 = 0.06, and the other keywords do not have a common keyword. "Report data" is 0.23, User model data 151 having a weighting value of 0.11, 0.12, 0.12, a deadline of 0.36, a reference of 0.17, and a weight of 0.24 were generated.

하지만, 도 6에서 설명한 것처럼 열람한지 오래된 문서일수록, 열람하는데 소모한 시간이 적은 문서일수록 낮은 가중치를 갖도록 보정할 필요가 있다. 도 5의 키워드 가중치(keyword weighting)에 대비하여 이를 시간 가중치(time weighting)라 한다.However, as described in FIG. 6, it is necessary to calibrate a document that is older than the scanned document so as to have a lower weight as the document consumes less time for reading. In contrast to the keyword weighting in FIG. 5, this is called time weighting.

열람 시간을 기준으로 한 가중치(143a, 143b)는 열람 시간이 많을수록 높은 가중치를 가지게 된다. 문서 1(141a)의 경우 열람 시간이 많아서 0.5(143a)의 시간 가중치를 가지고, 문서 2(141b)의 경우 열람 시간이 적어서 0.2(143b)의 시간 가중치를 가지는 것을 볼 수 있다. 이 때, 필요에 따라서는 열람 시간이 커질수록 열람 시간에 따른 시간 가중치가 특정한 값에 수렴하게 할 수 있다. 도 7의 예에서는 열람 시간에 따른 시간 가중치는 1에 수렴하도록 설정되었다.The weights 143a and 143b based on the browsing time have higher weights as the browsing time is longer. The document 1 141a has a time weight of 0.5 (143a) because of a large viewing time, and the viewing time of the document 2 (141b) is small and has a time weight of 0.2 (143b). At this time, the time weight according to the viewing time can be converged to a specific value as the viewing time increases, if necessary. In the example of FIG. 7, the time weight according to the viewing time is set to converge to 1.

열람 시간을 기준으로 한 가중치(143a, 143b) 외에도 최근에 열람한 문서일수록 높은 가중치를 가지도록 할 수 있다. 최근 열람을 기준으로 한 가중치(145a, 145b)를 살펴보면 문서 1(141a)의 경우 열람한지 오래된 문서로 0.1의 가중치를 가지고, 문서 2(141b)의 경우 최근에 열람한 문서여서 0.6의 가중치를 가지는 것을 볼 수 있다. 이 때, 필요에 따라서는 가장 최근에 열람한 문서는 특정한 값을 가지게 할 수 있다. 또한, 도 7의 예에서는 가장 최근에 열람한 문서는 1의 값을 가지도록 설정되었다. 그리고 열람한 일시가 오래될수록 시간 가중치가 감소하게 되어 결국에는 0에 수렴하게 된다.In addition to the weights 143a and 143b based on the browse time, it is possible to have a higher weight for the document that has been recently viewed. The weights 145a and 145b based on the recent browsing are as follows. In the case of the document 1 141a, the oldest document has a weight of 0.1. In the case of the document 2 141b, the most recently viewed document has a weight of 0.6 Can be seen. At this time, if necessary, the most recently viewed document can have a specific value. In the example of FIG. 7, the most recently viewed document is set to have a value of 1. The longer the viewing date and time, the smaller the time weight and eventually converges to zero.

시간 가중치의 고려 없이 생성한 사용자 모델 데이터(151)에 비해 시간 가중치를 고려한 사용자 모델 데이터(157)는 다른 값을 가지는 것을 볼 수 있다. 즉 시간 가중치를 고려하게 되면 1차로 열람 시간을 기준으로 한 시간 가중치와, 2차로 최근 열람을 기준으로 한 시간 가중치를 곱하여 각 문서의 키워드 가중치를 보정할 수 있다. 문서 1(141a)의 경우 시간 가중치를 고려하여 *0.05(=0.5*0.1)만큼 보정된 키워드 가중치(147a)를 가지게 되었고, 문서 2(141b)의 경우에도 시간 가중치를 고려하여 *0.12(=0.2*0.6)만큼 보정된 새로운 키워드 가중치(147b)를 가지게 된 것을 볼 수 있다. 이제 이 둘을 합산하면 시간 가중치가 반영된 새로운 사용자 모델 데이터(157)를 얻을 수 있고, 이를 이용하여 검색 결과를 정렬한다면 보다 사용자의 관심에 적합한 문서를 우선하여 제공할 수 있을 것이다.It can be seen that the user model data 157 considering the time weight is different from the user model data 151 generated without consideration of the time weight. That is, when the time weight is taken into consideration, the keyword weight of each document can be corrected by multiplying the time weight based on the browse time first and the time weight based on the second most recent browse. In the case of the document 1 141a, the keyword weight 147a is corrected by 0.05 (= 0.5 * 0.1) in consideration of the time weight. In the case of the document 2 141b, * 0.6) of the new keyword weight 147b. Now, by summing up these two, new user model data 157 reflecting the time weight can be obtained, and if the search results are sorted using the new user model data 157, a document suited to the user's interest can be provided first.

도 8 내지 도 9는 본 발명의 일 실시예에 따른 사용자 모델 데이터를 이용하여 검색 결과를 개인화 하는 단계를 설명하기 위한 개념도이다.8 to 9 are conceptual diagrams illustrating a step of personalizing search results using user model data according to an embodiment of the present invention.

지금까지 도 5 내지 도 7을 통해서 사용자 모델 데이터를 생성하기 위하여 키워드 가중치 및 시간 가중치를 고려하여 가중치를 계산하는 과정을 살펴보았다. 이렇게 생성된 사용자 모델 데이터는 검색 결과를 정렬하는데 이용될 수 있다. 즉, 동일한 키워드를 입력하더라도 서로 다른 사용자 모델 데이터로 인해, 사용자에게 제공되는 검색 결과는 서로 다르게 되는 것이다. 예를 들면, 사용자 A(101a)에게는 해가 져서 어두운 밤과 관련된 문서를 우선하여 제공하고, 사용자 B(101b)에게는 밤나무의 열매인 밤과 관련된 문서를 우선하여 제공할 수 있다.5 to 7, the process of calculating the weight by considering the keyword weight and the time weight in order to generate the user model data has been described. The user model data thus generated can be used to sort search results. That is, even if the same keyword is input, the search results provided to the user are different because of different user model data. For example, the user A 101a may be harmed to give priority to a document related to a dark night, and the user B 101b may preferentially provide documents related to a chestnut fruit night.

도 8을 참고하면, 검색된 결과의 각 문서들을 정렬하기 위한 순서를 정하는 알고리즘이 소개되어 있다. 이 때 사용할 수 있는 알고리즘은 크게 세 가지이다. 하나는 매칭 알고리즘(matching algorithm), 다른 하나는 유니크 매칭 알고리즘(unique matching algorithm), 마지막은 언어 모델 알고리즘(language model algorithm)이다.Referring to FIG. 8, an algorithm for determining the order for sorting each document of the retrieved result is introduced. There are three algorithms that can be used at this time. One is a matching algorithm, the other is a unique matching algorithm, and the last is a language model algorithm.

매칭 알고리즘은 검색된 문서에서 키워드를 추출하여 검색된 문서에서 키워드의 빈도와 사용자 모델 데이터에서 해당 키워드의 가중치를 이용하여 문서의 중요도를 계산할 수 있다.The matching algorithm extracts keywords from the retrieved documents and calculates the importance of the documents using the frequency of the keywords in the retrieved documents and the weights of the keywords in the user model data.

도 8의 예에서, 검색 결과 문서 1(163)에 대해서 매칭 알고리즘을 적용해보면 2번 등장한 "사업"의 사용자 모델 데이터(161)에서의 가중치가 0.1, 한번 등장한 "보고자료"의 사용자 모델 데이터(161)에서의 가중치가 0.2, 2번 등장한 "A사업부"의 사용자 모델 데이터(161)에서의 가중치가 0.4, 1번 등장한 "회의"의 사용자 모델 데이터(161)에서의 가중치가 0으로 이를 각각 곱해서 더하면 (2*0.1) + (1*0.2) + (2*0.4) + (1*0) = 1.2 수식에 의해서 1.2의 중요도를 얻을 수 있다.In the example of FIG. 8, when the matching algorithm is applied to the search result document 1 163, the weight of the user model data 161 of the "business" appearing twice is 0.1, and the user model data of the "report data" The weight in the user model data 161 of the " A business unit ", which appeared twice, and the weight in the user model data 161 of the "meeting " (2 * 0.1) + (1 * 0.2) + (2 * 0.4) + (1 * 0) = 1.2 The significance of 1.2 can be obtained by the following equation.

유니크 매칭 알고리즘은 매칭 알고리즘과 유사하나, 검색된 문서에서 키워드의 빈도를 고려하지 않는 점에서 차이가 있다. 즉 검색된 문서에서 해당 키워드가 등장하기만 하면 한번을 등장하던 여러 번 등장하든 그 차이를 두지 않고 문서의 중요도를 계산하는 방법이다.The unique matching algorithm is similar to the matching algorithm, but differs in that it does not consider the frequency of keywords in the retrieved documents. That is, if the keyword appears in the retrieved document, it is a method of calculating the importance of the document without making a difference whether it appears once or several times.

도 8의 예에서, 검색 결과 문서 1(163)에 대해서 유니크 매칭 알고리즘을 적용해보면 검색 결과 문서 1(163)에 등장한 키워드 "사업", "보고자료", "A사업부", "회의" 각각의 사용자 모델 데이터(161)에서의 가중치 0.1, 0.2, 0.4, 0을 합산하여 0.1 + 0.2 + 0.4 + 0 = 0.7 수식에 의해서 0.7의 중요도를 얻을 수 있다.In the example of FIG. 8, when a unique matching algorithm is applied to the search result document 163, the keyword "business", "report data", "business A", "conference" The significance of 0.7 can be obtained by adding the weights 0.1, 0.2, 0.4, and 0 in the user model data 161 to the formula 0.1 + 0.2 + 0.4 + 0 = 0.7.

앞선 두 알고리즘에 비해 언어 모델 알고리즘은 조금 다른 방식을 취한다. 각 키워드의 가중치를 합산하여 가중치 합계(w_total)를 구한 후 이에 대한 비율에 로그를 취하여 중요도를 계산한다. 이 과정에서 각 키워드 가중치에 1를 더해서 로그를 취하여 중요도를 구한다. 언어 모델 알고리즘을 이용하면 매칭 알고리즘이나 유니크 매칭 알고리즘과는 다르게 사용자 모델 데이터(161)에서 가중치가 0인 키워드도 중요도에 영향을 미칠 수 있다.Compared to the two previous algorithms, the language model algorithm takes a slightly different approach. Weights of each keyword are summed up to obtain a weighted sum (w _total ), and a log of the weighted sum is calculated to calculate the importance. In this process, 1 is added to each keyword weight to obtain logarithm. Using a language model algorithm, unlike a matching algorithm or a unique matching algorithm, a keyword having a weight of 0 in the user model data 161 may affect the importance.

도 8의 예에서, 검색 결과 문서 1(163)에 대해서 언어 모델 알고리즘을 적용해보면 2*log((0.1+1)/1.5) + 1*log((0.2+1)/1.5) + 2*log((0.4+1)/1.5) + 1*log((0+1)/1.5) = 4.8 수식에 의해서 4.8의 중요도를 얻을 수 있다.In the example of FIG. 8, when the language model algorithm is applied to the search result document 163, 2 * log ((0.1 + 1) /1.5) + 1 * log ( (0.4 + 1) /1.5) + 1 * log ((0 + 1) / 1.5) = 4.8.

도 8에서 살펴본 것처럼 동일한 검색 결과 문서 1(163)에 대해 동일한 사용자 모델 데이터(161)를 이용하여 중요도를 계산하더라도 적용하는 알고리즘에 따라 문서의 중요도가 달라질 수 있는 것을 볼 수 있다. 이에 도 9와 함께 각 알고리즘에 대해서 좀 더 자세히 살펴보도록 하자.As shown in FIG. 8, even if importance is calculated using the same user model data 161 for the same search result document 163, it can be seen that the importance of the document can be changed according to the applied algorithm. Let's take a closer look at each algorithm in conjunction with Figure 9.

도 9에서는 도 8에 이어 검색 결과 문서 2(165)에 대해 각 알고리즘에 의해 문서의 중요도를 계산하는 과정을 볼 수 있다.In FIG. 9, the process of calculating the importance of the document by the respective algorithms with respect to the search result document 2 165 shown in FIG. 8 can be seen.

검색 결과 문서 1(163)과 검색 결과 문서 2(165)에 대한 문서의 중요도를 계산하는 과정을 각각 표로 나타내면 다음과 같다.The process of calculating the importance of the document for the search result document 1 (163) and the search result document 2 (165) will be described as follows.

키워드keyword 매칭matching 유니크 매칭Unique matching 언어 모델Language model 사업business 2*0.1=0.22 * 0.1 = 0.2 0.10.1 2*log((0.1+1)/1.5)=1.4672 * log ((0.1 + 1) /1.5) = 1.467 보고자료Reporting 1*0.2=0.21 * 0.2 = 0.2 0.20.2 1*log((0.2+1)/1.5)=0.81 * log ((0.2 + 1) / 1.5) = 0.8 A사업부A Division 2*0.4=0.82 * 0.4 = 0.8 0.40.4 2*log((0.4+1)/1.5)=1.8672 * log ((0.4 + 1) /1.5) = 1.867 회의conference 1*0=01 * 0 = 0 00 1*log((0+1)/1.5)=0.6671 * log ((0 + 1) / 1.5) = 0.667 합계Sum 1.21.2 0.70.7 4.84.8

키워드keyword 매칭matching 유니크 매칭Unique matching 언어 모델Language model 홍길동Hong Gil Dong 2*0.3=0.62 * 0.3 = 0.6 0.30.3 2*log((0.3+1)/1.5)=1.7332 * log ((0.3 + 1) /1.5) = 1.733 A사업부A Division 1*0.4=0.41 * 0.4 = 0.4 0.40.4 1*log((0.4+1)/1.5)=0.9331 * log ((0.4 + 1) /1.5) = 0.933 김지영Kim Ji Young 1*0.5=0.51 * 0.5 = 0.5 0.50.5 1*log((0.5+1)/1.5)=11 * log ((0.5 + 1) / 1.5) = 1 합계Sum 1.51.5 1.21.2 3.673.67

표 1은 검색 결과 문서 1(163)에 대한 중요도 계산을 표로 정리한 것이고, 표 2는 검색 결과 문서 2(165)에 대한 중요도 계산을 표로 정리한 것이다. 표 1과 표 2를 참고하면, 매칭 알고리즘이나 유니크 매칭 알고리즘에 의해 중요도를 계산하면, 검색 결과 문서 1(163)의 중요도 1.2 또는 0.7에 비해 검색 결과 문서 2(165)의 중요도 1.5 또는 1.2가 보다 더 높아, 검색 결과 문서 2(165)가 보다 더 중요한 문서로 취급될 수 있다. 그러나, 언어 모델 알고리즘에 의해 중요도를 연산하게 되면 검색 결과 문서 1(163)의 중요도 4.8에 비해 검색 결과 문서 2(165)의 중요도 3.67이 보다 더 낮아, 검색 결과 문서 1(163)이 보다 더 중요한 문서로 취급될 수 있다.Table 1 summarizes the importance calculation for the search result document 1 (163), and Table 2 lists the importance calculation for the search result document 2 (165). Referring to Table 1 and Table 2, when the importance is calculated by the matching algorithm or the unique matching algorithm, the importance 1.5 or 1.2 of the search result document 2 (165) is compared with the importance 1.2 or 0.7 of the search result document 1 (163) The search result document 2 165 can be treated as a more important document. However, if the importance level is calculated by the language model algorithm, the importance degree 3.67 of the search result document 2 165 is lower than the importance degree 4.8 of the search result document 1 163, and the search result document 1 163 is more important It can be treated as a document.

도 8의 예처럼, 문서의 중요도를 계산하기 위해 적용되는 알고리즘에 따라 검색 결과의 정렬 순서가 뒤바뀔 수도 있다. 도 8의 문서의 중요도를 계산하는 단계는 도 4의 순서도에서 검색 결과를 개인화하는 단계(S1400)에 해당된다. 그리고 도 8의 검색 결과를 개인화 하는 단계(S1400) 외에도 나머지 각 단계(S1100, S1200, S1300)에서 적용되는 알고리즘에 따라 검색 결과의 정렬 순서가 얼마든지 달라질 수 있다. 이에 대해서 도 10 내지 도 13을 통해서 살펴보도록 한다.As in the example of FIG. 8, the sort order of the search results may be reversed according to the algorithm applied to calculate the importance of the document. The step of calculating the importance of the document of Fig. 8 corresponds to the step of personalizing the search result (S1400) in the flowchart of Fig. In addition to the step of personalizing the search result of FIG. 8 (S1400), the sorting order of the search results may vary depending on the algorithm applied in the remaining steps (S1100, S1200, S1300). This will be described with reference to FIGS. 10 to 13. FIG.

도 10 내지 도 13은 본 발명의 일 실시예에 따른 검색 결과 개인화에서 사용되는 알고리즘 조합을 기계 학습 하는 과정을 설명하기 위한 개념도이다.FIGS. 10 to 13 are conceptual diagrams for explaining a process of learning a combination of algorithms used in personalizing search results according to an embodiment of the present invention.

도 10은 개인화된 검색 결과 제공 방법의 각 단계에서 사용될 수 있는 알고리즘을 도시한 것이다. 우선, 관심 문서에서 키워드를 추출하는 단계(S1100)에서는 n-gram, context, stemming, part-of-speech tagging, NER(named entity recognition) 등의 알고리즘이 사용될 수 있다. 각 알고리즘에 대해서 간단히 살펴보면, n-gram 알고리즘에는 uni-gram, bi-gram, tri-gram 이 있으며 uni-gram 은 한 글자씩, bi-gram은 두 글자씩, tri-gram은 세 글자씩 추출하는 방법이다. "철수가 학교에 간다"라는 예문에 대해 n-gram 알고리즘을 적용해보면 다음의 표 3과 같다.Figure 10 shows an algorithm that can be used in each step of the method for providing personalized search results. First, algorithms such as n-gram, context, stemming, part-of-speech tagging, and NER (named entity recognition) may be used in step S1100 of extracting keywords from the document of interest. For each algorithm, the n-gram algorithm includes uni-gram, bi-gram, and tri-gram, extracting uni-gram by one letter, bi-gram by two letters, and tri- gram by three letters Method. The n-gram algorithm is applied to the example of "Go to school".

원문Original 철수가 학교에 간다.She goes to school. uni-gramuni-gram 철, 수, 가, 학, 교, 에, 간, 다Iron, water, autumn, school, bridge bi-grambi-gram 철수, 수가, 학교, 교에, 간다Withdraw, go to school, school tri-gramtri-gram 철수가, 학교에, 간다.The withdrawal goes to school.

다음으로, context 알고리즘은 텍스트로부터 인명, 업무명, 부서명, 날짜 등을 추출하는 방법이다. "본 회계 자료는 경리 부서 김철수에 의해 발표된 자료입니다."라는 예문에 대해 context 알고리즘을 적용해보면 아래 표 4와 같다.Next, the context algorithm is a method of extracting a name, a work name, a department name, and a date from a text. Table 4 below shows the application of the context algorithm to the sample sentence "This is the data released by accounting department Kim Chul Su."

원문Original 본 회계 자료는 경리 부서 김철수에 의해 발표된 자료입니다.This accounting data is published by Chul Soo Kim, accounting department. 키워드keyword 인명 : 김철수, 부서명 : 경리부서, 업무명 : 회계 자료Name: Chul Soo Kim, Department: Accounting Department, Title: Accounting

다음으로, part-of-speech tagging 알고리즘은 텍스트에 품사를 태깅하는 방법이다. "철수가 학교에 간다."라는 예문에 대해 part-of-speech tagging 알고리즘을 적용해보면 다음의 표 5와 같다.Next, the part-of-speech tagging algorithm is a method of tagging parts of speech. Apply the part-of-speech tagging algorithm to the example sentence "Go to school."

원문Original 철수가 학교에 간다.She goes to school. 키워드keyword 철수/명사, 가/조사, 학교/명사, 에/조사, 가/동사Withdrawal / noun, verb / investigation, school / noun, in / survey, verb / verb

다음으로, stemming 알고리즘은 영문 텍스트에 대해 특정 규칙에 의해 하나의 공통된 단어를 추출하는 방법이다. 이에 대한 간단한 예를 살펴보면 다음의 표 6과 같다.Next, the stemming algorithm is a method of extracting one common word by a specific rule for English text. A simple example of this is shown in Table 6 below.

원문Original engineering, engineer, engineengineering, engineer, engine 키워드keyword engineengine

다음으로, NER(Named Entity Recognition) 알고리즘은 개체명 인식으로 context 알고리즘과 유사한 방법으로 날짜, 이름, 장소명 등을 추출 하는 방법이다. "철수가 학교에 간다."라는 예문에 대해 NER 알고리즘을 적용해보면 다음의 표 7과 같다.Next, the NER (Named Entity Recognition) algorithm is a method of extracting the date, name, and place name in a manner similar to the context algorithm by object name recognition. Applying the NER algorithm to the example sentence "Go to school" is shown in Table 7 below.

원문Original 철수가 학교에 간다.She goes to school. 키워드keyword 인명 : 철수, 장소 : 학교Name: Chul Su, Place: School

동일한 예문이더라도 적용되는 키워드 추출 알고리즘에 따라 다른 키워드가 추출되기도 하며, 추출된 키워드를 품사나 성격에 따라 구분할 수도 있다. 키워드 추출 알고리즘은 문서의 성격에 따라 많은 영향을 받을 수 있다. 도 10에 도시하지는 않았지만, 예를 들면 영화나 음악의 리뷰를 분석하여 제공하는 검색에서는 긍정/부정의 키워드를 추출할 수 있는 알고리즘이 유용할 것이며, 논문을 검색하는 경우에 키워드를 추출하는 알고리즘과 가십 뉴스를 검색하는 경우에 키워드를 추출하는 알고리즘은 서로 다를 수 밖에 없을 것이다.Other keywords may be extracted according to the applied keyword extraction algorithm, and the extracted keywords may be classified according to parts of speech or personality. Keyword extraction algorithms can be affected by the nature of the document. Although not shown in FIG. 10, for example, an algorithm capable of extracting affirmative / negative keywords in a search that analyzes and provides a review of a movie or a music will be useful, and an algorithm for extracting keywords When searching gossip news, the algorithms for extracting keywords will be different.

도 5에서 키워드의 가중치를 계산하는 단계(S1200)에서는 키워드의 빈도를 이용하여 각 키워드의 가중치를 계산할 수 있다고 언급하였다. 이 때 적용될 수 있는 알고리즘을 살펴보면, 우선 TF-IDF(term frequency - inverse document frequency) 알고리즘은 일반적으로 적용될 수 있는 방법으로, 해당 문서에서 많이 나타나는 단어일수록 가중치를 높게 반영하는 대신 다른 여러 문서에서 많이 나타나는 단어일수록 가중치를 낮게 반영하는 방법이다. 즉, 해당 문서에서만 많이 나타나는 단어일수록 가중치를 높게 평가하는 것이다. TF-IDF는 구체적인 수식으로는 해당 문서에서의 해당 키워드의 빈도를 다른 문서에서의 빈도로 나눈 값으로 구할 수 있다. 다만, 여기서 다른 문서에서의 빈도는 역수에 로그를 취해서 계산될 수 있다.In step S1200 of calculating the weight of the keyword in FIG. 5, it is mentioned that the weight of each keyword can be calculated using the frequency of the keyword. The TF-IDF algorithm can be applied in a general way. In this paper, we propose a TF-IDF algorithm that can be applied to a large number of words in a document, The more the word, the lower the weight is reflected. In other words, the more words that appear only in the document, the higher the weight is. The TF-IDF is a concrete expression that can be obtained by dividing the frequency of the keyword in the document by the frequency in other documents. However, here the frequency in other documents can be calculated by taking a log of inverses.

다음으로, TF-IDF normalized 알고리즘은 일반적인 TF-IDF 알고리즘과 다르게 문서 길이에 대한 정규화를 수행하여 키워드의 가중치를 계산하는 방법이다. 즉, 문서의 길이가 길면 길수록 해당 키워드가 등장할 빈도가 높아지므로 이를 고려하여 해당 키워드의 가중치를 계산하는 것이다. 다시 말하면 동일한 빈도를 가진 키워드라고 하더라도, 길이가 짧은 문서에서의 키워드의 경우 가중치를 더 높게 평가하는 것이다. 이는 문서 길이가 긴 문서에서 출현하는 단어의 가중치는 항상 높게 나타날 수 있기 때문에 이를 약화 시키기 위해 사용하는 알고리즘이다.Next, the TF-IDF normalized algorithm is a method of calculating the weight of a keyword by performing normalization on the document length unlike the general TF-IDF algorithm. That is, the longer the length of the document, the higher the frequency with which the keyword appears. Therefore, the weight of the keyword is calculated in consideration of this. In other words, even if the keyword has the same frequency, the keyword is evaluated with a higher weight for a document having a shorter length. This is an algorithm used to weaken the weight of a word appearing in a document having a long document length because it can always be high.

다음으로, BM25 알고리즘은 TF-IDF normalized 방법과 유사하게 약간의 파라미터 조정을 통해서 가중치를 다르게 평가하는 알고리즘이다. 지금까지 키워드의 가중치를 계산하는 알고리즘을 간단하게 수식으로 표현해보면 다음의 표 8과 같다.Next, the BM25 algorithm is an algorithm that evaluates weights differently by adjusting some parameters similar to the TF-IDF normalized method. So far, the algorithm for calculating the weight of a keyword can be expressed by a simple formula as shown in Table 8 below.

TF-IDFTF-IDF

TF-IDF (norm)

BM25

키워드의 가중치를 계산한 후에는 필요에 따라 가중치를 보정하는 단계를 더 수행할 수도 있다. 이 때 적용될 수 있는 알고리즘을 살펴보면, SVD(singular value decomposition) 알고리즘은 용어와 문서로 이루어진 매트릭스로부터 용어 성분을 나타내는 U, 특이치를 나타내는 sigma, 문서 성분을 나타내는 V(전치행렬) 을 분해하여, 차원의 축소를 위해 사용되는 방법이다.After the weights of the keywords are calculated, the weights may be further corrected as necessary. The singular value decomposition (SVD) algorithm decomposes U, which represents a term component, sigma, which represents a singular value, and V (transpose matrix), which represents a document component, from a matrix of terms and documents, It is a method used for reduction.

다음으로, GMM(gaussian mixture model) 알고리즘은 가우시안 분포 모델의 혼합 모델로, 값의 분포의 중심에 가까울 수록 가중치를 높게 책정하는 방법이다. 일반적인 가우시안 분포의 특성이 여러 형태로 나타날 경우 이를 혼합 모델로 하여 형성될 수 있다.Next, the gaussian mixture model (GMM) algorithm is a mixed model of the Gaussian distribution model. The closer to the center of the distribution of values, the higher the weighting is. When the characteristics of a general Gaussian distribution appear in various forms, they can be formed as a mixed model.

마지막으로, 사용자 모델 데이터를 생성하고 이를 이용하여 검색 결과를 개인화 하는 단계(S1400)에 사용될 수 있는 알고리즘은 도 8 내지 도 9에서 살펴본 것처럼 매칭 알고리즘, 유니크 매칭 알고리즘, 언어 모델 알고리즘 등이 있을 수 있다. 매칭 알고리즘은 사용자 모델 데이터의 키워드가 검색 결과 문서에 있을 때 키워드 가중치의 합을 계산하는데 있어서 검색 결과 문서에 있는 키워드의 개수까지 고려하는 방법이고, 유니크 매칭 알고리즘은 사용자 모델 데이터의 키워드가 검색 결과에 있을 때 단순히 키워드 가중치의 합을 계산하는 방법이며, 언어 모델 알고리즘은 키워드 가중치를 전체 키워드의 가중치 합으로 나누어 계산하는 방법이다.Finally, the algorithms that may be used in step S1400 of generating user model data and personalizing the search results using the same may include matching algorithms, unique matching algorithms, language model algorithms, etc., as illustrated in FIGS. 8 through 9 . The matching algorithm is a method of considering the number of keywords in the search result document in calculating the sum of the keyword weights when the keyword of the user model data is in the search result document. And the language model algorithm divides the keyword weight by the weighted sum of all the keywords.

개인화된 검색 결과를 제공하는 방법의 각 단계에서 사용될 수 있는 알고리즘은 도 10에 도시된 알고리즘 외에도 다양한 알고리즘이 있을 수 있다. 문서에 따라, 사용자에 따라, 키워드에 따라 적용하기에 적절한 알고리즘이 달라질 수 있다.Algorithms that may be used at each step of the method of providing personalized search results may be various algorithms other than the algorithm shown in FIG. Depending on the document, the algorithm that is appropriate for applying according to the keyword may vary depending on the user.

도 11은 각 알고리즘 조합에 따라 특정 문서의 중요도를 테스트한 결과를 도시한 것이다.11 shows a result of testing the importance of a specific document according to each algorithm combination.

도 11을 참고하면, 동일한 사용자 모델 데이터 및 동일한 검색 결과 문서를 대상으로 각 알고리즘 조합을 테스트해 보았다. 키워드를 추출하는 알고리즘으로는 POSTagging 알고리즘(미도시)을 사용하였고, 가중치를 계산하는 알고리즘으로 TF-IDF, TF-IDF (norm), BM25를 사용하는 경우 및 가중치를 보정하는 알고리즘을 적용하는 경우와 적용하지 않는 경우를 테스트 해보았다. 동일한 사용자 모델 데이터 및 동일한 검색 결과 문서임에도 적용된 알고리즘에 따라 해당 검색 결과 문서의 중요도가 다양하게 계산되는 것을 확인할 수 있다.Referring to FIG. 11, the combinations of the respective algorithms were tested on the same user model data and the same search result document. In the case of using TF-IDF, TF-IDF (norm), BM25, and applying the weight correction algorithm to calculate the weight, POSTagging algorithm (not shown) I tested it when it does not apply. It can be confirmed that the importance of the search result document is calculated in various ways according to the algorithm applied to the same user model data and the same search result document.

해당 검색 결과 문서의 중요도가 가장 높게 계산되는 경우는 (TF-IDF (norm) / GMM / 유니크 매칭 알고리즘) 조합을 사용한 경우의 0.55이였으며, 해당 검색 결과 문서의 중요도가 가장 낮게 계산되는 경우는 (TF-IDF / 언어 모델 알고리즘) 또는 (BM25 / 언어 모델 알고리즘) 또는 (TF-IDF (norm) / GMM / 매칭 알고리즘) 조합일 때 계산된 0.3이였다. 알고리즘의 조합에 따라 0.3 에서 0.55까지 검색 결과 문서의 중요도가 다양하게 계산될 수 있는 것을 알 수 있다.When the relevance of the search result document is calculated to be highest (TF-IDF (norm) / GMM / unique matching algorithm), 0.55 is used. When the importance of the search result document is calculated to be lowest TF-IDF / language model algorithm) or (BM25 / language model algorithm) or (TF-IDF (norm) / GMM / matching algorithm) combination. It can be seen that the significance of the search result document from 0.3 to 0.55 can be calculated variously according to the combination of the algorithms.

검색된 결과를 정렬하여 제공할 때 각 문서의 중요도를 구하여 이를 이용하여 사용자에게 노출되는 문서의 순서를 정렬하여 제공할 수 있다. 이 때 문서의 중요도에 영향을 미치는 요소는 사용자가 이미 열람한 문서들을 기초로 생성된 사용자 모델 데이터뿐만 아니라 개인화된 검색 결과 제공 방법에서 사용되는 알고리즘의 조합도 영향을 미칠 수 있는 것이다. 만약 특정 사용자에게 맞는 알고리즘 조합도 개인화 할 수 있다면 보다 신뢰할 만한 검색 결과를 사용자에게 제공할 수 있을 것이다. 다만, 특정 사용자에게 알맞은 알고리즘 조합을 알아내기에는 많은 어려움이 있으므로 이를 간단히 수행하기 위한 몇 가지 방법을 도 12와 함께 살펴보도록 한다.When the retrieved results are sorted and provided, the importance of each document can be obtained, and the order of the documents exposed to the user can be sorted and provided using the obtained importance. In this case, factors affecting the importance of the document may affect the combination of the algorithm used in the personalized search result providing method as well as the user model data generated based on the documents already viewed by the user. If you can personalize the combination of algorithms for a particular user, you will be able to provide users with more reliable search results. However, since there are many difficulties in finding a suitable combination of algorithms for a specific user, some methods for simply performing this algorithm will be described with reference to FIG.

도 12는 본 발명의 일 실시예에 따른 알고리즘 조합을 기계 학습 하여 개인화된 검색 결과를 제공하는 방법을 설명하기 위한 순서도이다.12 is a flowchart illustrating a method of providing a personalized search result by machine learning of an algorithm combination according to an embodiment of the present invention.

도 12를 참고하면, 사용자 모델 데이터를 생성하는 단계(S1300)를 수행한 후에는 생성된 사용자 모델 데이터를 기준으로 사용자를 클러스터링 하는 단계(S1310)를 수행할 수 있다. 즉 사용자별로 알고리즘 조합을 기계 학습하기에는 표본이나 실험결과가 부족할 수 있으므로 사용자 모델 데이터를 기준으로 사용자를 그룹으로 묶는 것이다. 사용자 모델 데이터에 포함된 키워드 및 각 키워드의 가중치를 고려하여 유사한 키워드와 유사한 키워드 가중치를 가진 사용자들끼리 클러스터링을 수행하면 보다 효율적으로 알고리즘 조합을 학습할 수 있다.Referring to FIG. 12, after performing step S1300 of generating user model data, a step S1310 of clustering users based on the generated user model data may be performed. In other words, it is necessary to group users based on user model data because there is not enough sample or experiment results to learn the algorithm combination by user. Considering the keywords included in the user model data and the weight of each keyword, clustering among users having similar keyword weights can learn algorithm combinations more efficiently.

사용자 모델 데이터를 클러스터링 한 후에는 각 클러스터별로 키워드를 추출하는 알고리즘, 가중치를 계산하는 알고리즘, 가중치를 보정하는 알고리즘, 문서의 중요도를 계산하는 알고리즘을 피처(feature)로 하여 사용자의 반응을 기계 학습 한다(S1320). 즉, 검색 결과에서 특정 문서를 열람하는데 소모한 시간을 이용하면 사용자가 검색을 통해서 찾고자 했던 문서가 어떤 문서인지 추론할 수 있으므로, 특정 알고리즘 조합에 의해 사용자에게 우선하여 제공된 문서 중에서 실제로 사용자가 찾고자 했던 문서가 포함된 비율 등을 이용하면 어느 알고리즘 조합이 해당 클러스터에 더 유용한지 학습 할 수 있다. 기계 학습 과정에서는 통계적 방식인 Naive Bayes 알고리즘과, 벡터방식의 SVM(Support Vector Machine) 알고리즘, kNN(k-Nearest Neighbor) 알고리즘 등이 적용될 수 있다.After clustering the user model data, the user's reaction is machine-learned by using algorithm for extracting keywords for each cluster, algorithm for calculating weight, algorithm for correcting weight, and algorithm for calculating document importance. (S1320). That is, when the time consumed for browsing a specific document in the search result is used, it is possible to infer which document the user has searched for through a search. Therefore, The percentage of documents included can be used to learn which algorithm combination is more useful for the cluster. In the machine learning process, the statistical Naive Bayes algorithm, the SVM (Support Vector Machine) algorithm and the kNN (k-Nearest Neighbor) algorithm can be applied.

도 13은 각 알고리즘 조합에 따라 사용자의 반응을 테스트한 결과를 도시한 것이다.FIG. 13 shows a result of testing the response of the user according to each algorithm combination.

특정 사용자 모델 데이터 클러스터에 대해 몇몇 알고리즘 조합에 의해 검색 결과를 제공하는 경우 사용자에게 우선하여 제공된 검색 결과에 대한 사용자의 반응을 기준으로 사용자가 검색을 통해서 얻고자 한 정보를 우선하여 제공하였는지 여부를 수치화하여 평가하였다.In the case of providing a search result by a combination of several algorithms for a specific user model data cluster, it is determined whether or not the user gives priority to the information to be obtained through search based on the user's reaction to the search result provided to the user Respectively.

해당 특정 사용자 모델 데이터 클러스터에서는 키워드 추출 알고리즘으로 POSTagging, 1차 가중치 계산 알고리즘으로 BM25, 2차 가중치 보정 알고리즘으로 SVD, 검색 결과의 중요도 계산 알고리즘으로 매칭 알고리즘(미도시)를 조합한 경우가 정밀도(precision)에서 0.79의 값을, 재현율(recall)에서 0.49 및 F-Measure에서 0.60으로 가장 적절한 조합임을 알 수 있었다.In the specific user model data cluster, the combination of POSTagging as a keyword extraction algorithm, BM25 as a primary weight calculation algorithm, SVD as a secondary weight correction algorithm, and a matching algorithm (not shown) ) Was 0.79, 0.49 for recall and 0.60 for F-measure.

도 13의 예에서처럼 각 사용자 모델 데이터를 기준으로 한 사용자 클러스터에 대해 알고리즘 조합을 학습한다면, 사용자가 열람한 문서를 피드백으로 사용자 모델 데이터가 보다 더 정교해질 뿐만 아니라 사용자 열람한 문서를 피드백으로 해당 사용자 모델 데이터가 속한 클러스터에 적용할 알고리즘 조합도 보다 더 정교해질 수 있다. 이를 통해서 보다 더 효율적으로 개인화된 검색 결과를 제공할 수 있다.If the algorithm combination is learned for the user cluster based on each user model data as in the example of FIG. 13, not only the user model data becomes more refined by feedback of the document read by the user, but also the user- The algorithm combination to be applied to the cluster to which the model data belongs may be more elaborate. This makes it possible to provide personalized search results more efficiently.

도 14는 본 발명의 일 실시예에 따른 개인화된 검색 결과 제공 장치의 하드웨어 구성도이다.14 is a hardware configuration diagram of a personalized search result providing apparatus according to an embodiment of the present invention.

도 14을 참고하면, 개인화된 검색 결과 제공 장치(10)는 하나 이상의 프로세서(510), 메모리(520), 스토리지(560) 및 네트워크 인터페이스(570)을 포함할 수 있다. 프로세서(510), 메모리(520), 스토리지(560) 및 인터페이스(570)는 시스템 버스(550)를 통하여 데이터를 송수신한다.14, the personalized search result providing apparatus 10 may include one or more processors 510, a memory 520, a storage 560, and a network interface 570. As shown in FIG. The processor 510, the memory 520, the storage 560, and the interface 570 transmit and receive data via the system bus 550.

프로세서(510)는 메모리(520)에 로드 된 컴퓨터 프로그램을 실행하고, 메모리(520)는 상기 컴퓨터 프로그램을 스토리지(560)에서 로드(load) 한다. 상기 컴퓨터 프로그램은, 키워드 추출 오퍼레이션(521), 가중치 연산 오퍼레이션(523), 사용자 모델 데이터 생성 오퍼레이션(525), 알고리즘 조합 학습 오퍼레이션(527) 및 검색 결과 개인화 오퍼레이션(529)을 포함할 수 있다.The processor 510 executes a computer program loaded into the memory 520 and the memory 520 loads the computer program from the storage 560. [ The computer program may include a keyword extraction operation 521, a weight calculation operation 523, a user model data generation operation 525, an algorithm combination learning operation 527, and a search result personalization operation 529.

키워드 추출 오퍼레이션(521)은 스토리지(560)의 문서 데이터(569) 중 사용자가 열람한 문서를 대상으로 Context, POSTagging 등의 알고리즘을 적용하여 키워드를 추출할 수 있다. 또한 스토리지(560)의 문서 데이터(569) 중 질의어를 입력하여 검색된 각 문서의 중요도를 사용자 모델 데이터(561)를 이용하여 계산하기 위해서, 검색된 각 문서를 대상으로 키워드를 추출할 수 있다.The keyword extracting operation 521 can extract a keyword by applying an algorithm such as Context or POSTagging to a document viewed by the user among the document data 569 of the storage 560. Also, in order to calculate the importance of each document retrieved by inputting a query word among the document data 569 of the storage 560 using the user model data 561, keywords can be extracted for each retrieved document.

가중치 연산 오퍼레이션(523)은 키워드 추출 오퍼레이션(521)에서 추출한 키워드를 대상으로 TF-IDF, BM25 등의 알고리즘을 적용하여 각 키워드의 가중치를 계산할 수 있다. 또한 가중치 연산 오퍼레이션(523)은 필요에 따라 계산된 가중치를 GMM, SVD 등의 알고리즘을 적용하여 보정하거나, 사용자가 해당 문서를 열람한 시간, 최근 열람한 시간 등을 기준으로 시간 가중치를 적용하여 보정할 수 있다.The weight calculation operation 523 can calculate the weight of each keyword by applying algorithms such as TF-IDF and BM25 to the keywords extracted from the keyword extraction operation 521. [ In addition, the weight calculation operation 523 may perform correction by applying algorithms such as GMM and SVD to the calculated weights as necessary or applying time weighting based on the time the user browses the document, can do.

사용자 모델 데이터 생성 오퍼레이션(525)은 가중치 연산 오퍼레이션(523)에 계산된 키워드 및 키워드에 대응되는 가중치 정보를 사용자가 열람한 문서를 기준으로 합산하여 사용자 모델 데이터(561)로 생성할 수 있다. 이 때 생성된 사용자 모델 데이터(561)은 시스템 버스(550)을 통해 스토리지(560)의 사용자 모델 데이터(561)로 저장된다.The user model data generation operation 525 may generate weight model information 561 by summing weight information corresponding to the keywords and keywords calculated in the weight calculation operation 523 on the basis of a document viewed by the user. The user model data 561 generated at this time is stored in the user model data 561 of the storage 560 via the system bus 550.

알고리즘 조합 학습 오퍼레이션(527)은 사용자 모델 데이터 생성 오퍼레이션(525)에서 생성한 사용자 모델 데이터(561)를 기준으로 사용자를 클러스터링하여 여러 개의 클러스터로 분류하고, 각 클러스터에 대해서 키워드 추출 알고리즘, 가중치 계산 알고리즘, 가중치 보정 알고리즘, 문서 중요도 계산 알고리즘을 입력으로 하여 검색 결과에 대한 사용자의 반응을 모니터링하여 해당 클러스터에 적합한 알고리즘 조합을 기계 학습할 수 있다. 이렇게 학습된 사용자 클러스터별 알고리즘 조합은 시스템 버스(550)을 통해 스토리지(560)의 알고리즘 조합 데이터(563)으로 저장된다.The algorithm combination learning operation 527 classifies users into a plurality of clusters based on the user model data 561 generated in the user model data creation operation 525 and classifies the clusters into a plurality of clusters based on the keyword extraction algorithm, , Weights correction algorithm, and document importance calculation algorithm are input, and the user's reaction to the search result is monitored, so that a combination of algorithms suitable for the cluster can be learned. The thus learned user cluster-specific algorithm combination is stored in the algorithm combination data 563 of the storage 560 via the system bus 550.

검색 결과 개인화 오퍼레이션(529)은 네트워크 인터페이스(570)을 통해 사용자로부터 질의어를 입력 받아서, 스토리지(570)의 문서 데이터(569)를 대상으로 검색을 수행한 후, 검색된 문서들을 사용자 모델 데이터(561)와 알고리즘 조합 데이터(563)을 이용하여 중요도를 계산한다. 계산된 중요도에 따라 검색 결과를 정렬하여 각 사용자에게 개인화된 검색 결과를 제공한다.The search result personalization operation 529 receives a query term from the user via the network interface 570 and searches the document data 569 of the storage 570 for the retrieved documents, And algorithm combination data (563). The search results are sorted according to the calculated importance, and personalized search results are provided to each user.

도 14의 각 구성 요소는 소프트웨어(Software) 또는, FPGA(Field Programmable Gate Array)나 ASIC(Application-Specific Integrated Circuit)과 같은 하드웨어(Hardware)를 의미할 수 있다. 그렇지만, 상기 구성 요소들은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 어드레싱(Addressing)할 수 있는 저장 매체에 있도록 구성될 수도 있고, 하나 또는 그 이상의 프로세서들을 실행시키도록 구성될 수도 있다. 상기 구성 요소들 안에서 제공되는 기능은 더 세분화된 구성 요소에 의하여 구현될 수 있으며, 복수의 구성 요소들을 합하여 특정한 기능을 수행하는 하나의 구성 요소로 구현될 수도 있다.Each component in FIG. 14 may refer to software or hardware such as an FPGA (Field Programmable Gate Array) or an ASIC (Application-Specific Integrated Circuit). However, the components are not limited to software or hardware, and may be configured to be addressable storage media, and configured to execute one or more processors. The functions provided in the components may be implemented by a more detailed component, or may be implemented by a single component that performs a specific function by combining a plurality of components.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims

Extracting a keyword of the document on a document viewed by a user;
Calculating a weight in consideration of a frequency of the keyword in the document;
Mapping the weight to the keyword to generate user model data for the user; And
Receiving a query term from the terminal of the user and using the user model data to sort the search results for the query term;
A method for providing personalized search results.

The method according to claim 1,
The step of calculating the weight includes:
And correcting the weighting factor in consideration of the distribution of the keyword and the weighting factor.
A method for providing personalized search results.

The method according to claim 1,
The step of calculating the weight includes:
And correcting the weight by considering information on a time when the user browses the document.
A method for providing personalized search results.

The method of claim 3,
The step of correcting the weight includes:
And correcting the weight significantly as the user is viewing the document for a longer period of time.
A method for providing personalized search results.

The method of claim 3,
The step of correcting the weight includes:
And correcting the weight significantly as the time the user has browsed the document,
A method for providing personalized search results.

The method according to claim 1,
Wherein the step of sorting the search results for the query term comprises:
Extracting a search result keyword from each document belonging to the search result;
Comparing the search result keyword with the keyword, and calculating importance of each document based on a weight mapped to the keyword; And
And arranging search results for the query terms using the importance of each document.
A method for providing personalized search results.

The method according to claim 1,
Wherein the step of sorting the search results for the query term comprises:
And updating the user model data by feeding back the response of the user to the sorted search result.
A method for providing personalized search results.

The method according to claim 1,
Wherein generating user model data for the user comprises:
Clustering a plurality of users based on user model data to construct a cluster,
A method for providing personalized search results.

9. The method of claim 8,
Wherein configuring the cluster comprises:
Determining a keyword extraction algorithm of the cluster by mechanically learning a response of a user belonging to the cluster to a keyword extraction algorithm;
A method for providing personalized search results.

9. The method of claim 8,
Wherein configuring the cluster comprises:
Learning the response of a user belonging to the cluster to a weight calculation algorithm to determine a weight calculation algorithm of the cluster.
A method for providing personalized search results.

9. The method of claim 8,
Wherein configuring the cluster comprises:
Learning a response of a user belonging to the cluster to a weight correction algorithm, and determining a weight correction algorithm of the cluster;
A method for providing personalized search results.

9. The method of claim 8,
Wherein configuring the cluster comprises:
Learning a response of a user belonging to the cluster to an importance computation algorithm to determine an importance computation algorithm of the cluster;
A method for providing personalized search results.

A keyword extracting unit for extracting a keyword of the document from a document viewed by a user;
A weight computing unit for computing a weight by considering the frequency of the keyword in the document;
A user model data generation unit for mapping the weights to the keyword to generate user model data for the user; And
And a search result personalization unit that receives a query term from the terminal of the user and sorts the search result for the query term using the user model data,
A personalized search result providing device.

14. The method of claim 13,
The weight computing unit may include:
And a weight correction unit for correcting the weight by considering the distribution of the keyword and the weight,
A personalized search result providing device.

14. The method of claim 13,
The search result personalization unit,
A search result keyword extracting unit for extracting a search result keyword from each document belonging to the search result;
A document importance calculation unit for comparing the search result keyword with the keyword and calculating importance of each document based on a weight mapped to the keyword; And
And a search result sorting unit for sorting search results for the query terms by using the importance of each of the documents.
A personalized search result providing device.

14. The method of claim 13,
The search result personalization unit,
And a user model data updating unit for updating the user model data by feeding back the response of the user to the sorted search result.
A personalized search result providing device.

Network interface;
One or more processors;
A memory for loading a computer program executed by the processor; And
A storage for storing document data, user model data, algorithm combination data,
The computer program comprising:
A keyword extraction operation for extracting a keyword of the document from a document viewed by a user among the document data;
A weight calculation operation for calculating a weight in consideration of a frequency of the keyword in the document;
A user model data generation operation for mapping the weights to the keyword to generate the user model data for the user;
A plurality of users are clustered on the basis of user model data to constitute a cluster, and a reaction of a user belonging to the cluster with respect to a keyword extraction algorithm, a weight calculation algorithm, a weight correction algorithm and an importance calculation algorithm are machine- An algorithm combination learning operation for generating algorithm combination data; And
A search result personalization operation that receives a query term from the terminal of the user and uses the user model data to sort the search results for the query term;
A personalized search result providing device.