KR20190005494A

KR20190005494A - System for recommendating search keywords based on search behavior pattern

Info

Publication number: KR20190005494A
Application number: KR1020170086223A
Authority: KR
Inventors: 김선욱
Original assignee: 김선욱
Priority date: 2017-07-07
Filing date: 2017-07-07
Publication date: 2019-01-16

Abstract

The present invention provides a system for recommending a search keyword based on a search behavior pattern which can select a search pattern from keywords within a short period. According to an embodiment of the present invention, the system for recommending a search keyword based on a search behavior pattern comprises: a step of calculating vector values of a plurality of words existing in an internet network; a step of using the distance between a vector of a search word inputted from a user terminal for each word and a vector of the words to calculate a distance score; a step of using log data to calculate one or more feature values for the plurality of words; a step of using the one or more feature values for each word to calculate a real-time issue score; and a step of supplying a recommendation search word including at least one word among a plurality of words to the user terminal in accordance with the distance score and real-time issue score. The step of calculating a real-time issue score includes: a step of summing the one or more feature values for each word to calculate a feature score; a step of grouping a plurality of words with an identical search intention among the plurality of words into a single word group to generate at least one word group; and a step of comparing the feature score of the at least one word group (sum of the feature scores of the plurality of words included in the word group) and feature scores of a plurality of words which are not grouped to calculate the real-time issue score.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a search word recommendation system based on a search behavior pattern,

본 발명은 검색 행동 패턴에 기반한 검색어 추천 시스템에 관한 것으로, 더욱 상세하게는 복수의 사용자로부터 수집한 검색 데이터로부터 행동 패턴을 추출하여 구성되는 검색 행동 패턴에 기반한 검색어 추천 시스템에 관한 것이다. The present invention relates to a query term recommendation system based on a search behavior pattern, and more particularly, to a query term recommendation system based on a search behavior pattern configured by extracting a behavior pattern from search data collected from a plurality of users.

검색어 추천의 기본적인 방법은 사용자가 입력한 질의를 포함하는(substring) 후보 질의들을 그 빈도에 따라서 보여주는 방법이다. 이는 사용자의 컨텍스트를 전혀 고려하지 않으므로 좋은 성능을 기대하기 어렵다. The basic way of recommending a query is to show candidate queries that contain user-entered queries according to their frequency. This does not take into account the context of the user at all, so it is difficult to expect good performance.

RECQ(Real-World Context Aware Querying)라는 연구는 모바일 검색의 컨텍스트를 고려하여 질의를 확장해주는 방법을 제안한 연구이다. RECQ는 질의 추천 후보들과 함께 현재 사용자의 위치명을 넣었을 때의 페이지수와 구글 검색엔진에 넣었을 때의 전체 검색엔진의 코퍼스(corpus)의 문서수와의 비율을 가중치로 하여 가중치가 높은순으로 추천 후보들을 선별하는 방법을 제안하였다. Research called Real-World Context Aware Querying (RECQ) is a research that proposes a method to extend the query considering the context of mobile search. RECQ recommends the ratio of the number of pages when the current user's location name is put together with the query recommendation candidates to the number of corpus documents of all search engines when put into the Google search engine as weights, Suggesting a method of selecting candidates.

기존의 추천 검색어 제공 서비스는 사용자로부터 입력된 검색 쿼리를 분석하여, 사용자의 검색어 입력시 또는 검색 결과 제공시 관련성이 높은 검색어를 추천하여 제공하는 서비스이다. 종래의 추천 검색어 제공 서비스는 협업 필터링(collaborative filtering)과 같은 방법을 사용하여 추천 검색어를 제공하였으나, 스파스(sparse)하고 대규모 데이터를 갖는 검색어의 경우에는 비슷한 패턴을 갖는 교집합이 없거나 적기 때문에 정확도가 떨어지는 문제점이 존재하였다. The existing recommendation search service is a service for analyzing a search query input from a user and recommending and providing a highly relevant search word when a user inputs a search word or provides a search result. Conventional suggestion service provided suggestion words using collaborative filtering, but in the case of sparse and large-scale data, there is no intersection with similar patterns, There was a falling problem.

공개특허공보 제10-2012-0094562호, 2012.08.27. 공개Japanese Patent Application Laid-Open No. 10-2012-0094562, Aug. 27, 2012. open

본 발명이 해결하고자 하는 과제는, 실시간으로 검색되는 키워드 데이터를 처리하여 빠른 시간 내에 키워드로부터 검색 패턴을 선별할 수 있는 실시간 검색 행동 패턴에 기반한 검색어 추천 시스템을 제공하는 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a query term recommendation system based on a real-time search behavior pattern capable of processing keyword data searched in real time and selecting search patterns from keywords in a short time.

본 발명이 해결하고자 하는 다른 과제는 대규모 데이터를 갖는 검색어의 경우에도 단순하면서 정확하게 추천 검색어를 제공할 수 있는 추천 검색어 제공 시스템을 제공하는 것이다.Another problem to be solved by the present invention is to provide a recommendation word provision system capable of providing a recommendation word simply and accurately in the case of a search word having a large amount of data.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the above-mentioned problems, and other problems which are not mentioned can be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 실시예에 따른 검색 행동 패턴에 기반한 검색어 추천 시스템은, 복수의 사용자 단말기들로부터 입력된 복수의 검색어들을 포함하는 로그 데이터를 수신하는 단계, 상기 수신한 로그 데이터를 이용하여 각각의 검색어별로 적어도 하나의 특징 값들을 산출하는 단계, 상기 각각의 검색어별로 상기 적어도 하나의 특징 값들을 합산하여 점수를 산출하는 단계, 상기 복수의 검색어들 중 검색의도가 동일한 복수의 검색어들을 하나의 검색어 군집으로 군집화하여 적어도 하나의 검색어 군집을 생성하는 단계, 및 상기 적어도 하나의 검색어 군집의 점수-상기 검색어 군집에 포함되는 복수의 검색어들의 점수의 합-와 상기 군집화되지 않은 복수의 검색어들의 점수를 비교하여 검색어 순위를 산정하는 단계를 포함한다.According to an aspect of the present invention, there is provided a search query recommendation system based on a search behavior pattern, comprising: receiving log data including a plurality of search terms input from a plurality of user terminals; Calculating at least one feature value for each search term using data, calculating a score by summing the at least one feature value for each search term, determining a plurality of search terms among the plurality of search terms Generating at least one set of search terms by clustering search terms of a set of search terms into a single set of search terms and generating at least one set of search terms based on the scores of the at least one set of search terms, And ranking the search word rankings .

일부 실시예에서, 상기 검색어 순위를 산정하는 단계는, 점수가 높은 순서로 검색어 순위를 할당하되, 검색어 군집의 경우, 해당 검색어 군집에 포함되는 복수의 검색어들 중 점수가 가장 높은 검색어에 대하여 해당 검색어 순위를 할당하고, 해당 검색어 군집의 점수를 해당 검색어 군집에 포함되는 나머지 검색어들의 점수의 합으로 재산정할 수 있다.In some embodiments, the step of estimating the search term rankings may include assigning search term rankings in descending order of scores, wherein in the case of a search term clusters, for a search term with the highest score among a plurality of search terms included in the search term clusters, And the score of the query word cluster can be re-calculated as the sum of the scores of the remaining search words included in the search word cluster.

일부 실시예에서, 상기 적어도 하나의 검색어 군집을 생성하는 단계는, 복수의 검색어들의 검색 결과 내에 동일한 웹문서가 소정의 비율 이상으로 노출되는 경우, 상기 복수의 검색어들을 하나의 검색어 군집으로 군집화할 수 있다.In some embodiments, the step of generating the at least one set of search terms may group the plurality of search terms into a single set of search terms when the same web document is exposed in a predetermined ratio or more in the search results of the plurality of search words have.

일부 실시예에서, 상기 적어도 하나의 검색어 군집을 생성하는 단계는, 복수의 검색어들의 문자가 소정의 비율 이상으로 동일한 경우, 상기 복수의 검색어들을 하나의 검색어 군집으로 군집화할 수 있다.In some embodiments, generating the at least one query word cluster may cluster the plurality of search terms into a single query word cluster if the characters of the plurality of search words are equal to or greater than a predetermined ratio.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명의 실시예에 따른 검색 행동 패턴에 기반한 검색어 추천 시스템은, 검색어 군집을 생성함으로써, 사용자의 검색의도를 정확하게 분석할 수 있고, 쿼리량 및 사용자 행동과 관련된 다양한 특징 값들을 고려하므로, 실시간 이슈 검색어의 정확도를 높여, 시스템 운영자에 대한 의존도를 최소화할 수 있다.The search query recommendation system based on the search behavior pattern according to the embodiment of the present invention can accurately analyze the search intention of the user and consider various feature values related to the query amount and the user behavior, Increase the accuracy of issue search queries and minimize reliance on system operators.

검색어의 벡터와 워드의 벡터 사이의 거리 점수 및 상기 워드의 실시간 이슈 점수에 따라 추천 검색어를 선별하므로, 스파스하고 대규모 데이터를 갖는 검색어의 경우에도 단순하면서 정확하게 추천 검색어를 제공할 수 있다.A recommendation term is selected according to the distance score between the vector of the term and the word of the word and the real-time issue score of the word, so that it is possible to provide a simple and accurate recommendation word even in the case of a search word having sparse and large data.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the above-mentioned effects, and other effects not mentioned can be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 검색 행동 패턴에 기반한 검색어 추천 시스템이 제공되는 환경을 설명하기 위한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 검색 행동 패턴에 기반한 검색어 추천 시스템을 설명하기 위한 블록도이다.1 is a block diagram illustrating an environment in which a search query recommendation system based on a search behavior pattern according to an embodiment of the present invention is provided.
2 is a block diagram for explaining a search query recommendation system based on a search behavior pattern according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. It should be understood, however, that the invention is not limited to the disclosed embodiments, but may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, Is provided to fully convey the scope of the present invention to a technician, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. The terms " comprises "and / or" comprising "used in the specification do not exclude the presence or addition of one or more other elements in addition to the stated element. Like reference numerals refer to like elements throughout the specification and "and / or" include each and every combination of one or more of the elements mentioned. Although "first "," second "and the like are used to describe various components, it is needless to say that these components are not limited by these terms. These terms are used only to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may be the second component within the technical scope of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense that is commonly understood by one of ordinary skill in the art to which this invention belongs. In addition, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.The terms spatially relative, "below", "beneath", "lower", "above", "upper" And can be used to easily describe a correlation between an element and other elements. Spatially relative terms should be understood in terms of the directions shown in the drawings, including the different directions of components at the time of use or operation. For example, when inverting an element shown in the figures, an element described as "below" or "beneath" of another element may be placed "above" another element . Thus, the exemplary term "below" can include both downward and upward directions. The components can also be oriented in different directions, so that spatially relative terms can be interpreted according to orientation.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1을 참조하면, 사용자 단말기(110), 검색 서버(120), 실시간 이슈 검색어 선별 시스템(130)이 네트워크를 통해 서로 연결된다. 사용자 단말기(110), 검색 서버(120), 실시간 이슈 검색어 선별 시스템(130)은 서로 데이터 및/또는 정보를 송수신할 수 있다.Referring to FIG. 1, a user terminal 110, a search server 120, and a real-time issue search term selection system 130 are connected to each other through a network. The user terminal 110, the search server 120, and the real-time issue query selection system 130 can exchange data and / or information with each other.

네트워크는 근거리 네트워크(Local Area Network; LAN), 도시권 네트워크(Metropolitan Area Network; MAN), 광대역 네트워크(Wide Area Network; WAN) 등과 같은 다양한 크기의 네트워크로 구성될 수 있다. 네트워크는 유선 또는 무선 네트워크로 구성될 수 있다.The network may be composed of networks of various sizes such as a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), and the like. The network may be configured as a wired or wireless network.

사용자 단말기(110)는 데스크톱(desk top), 랩톱(lap top) 등과 같은 개인용 컴퓨터(Personal Computer; PC)일 수 있다. 또는, 사용자 단말기(110)는 스마트폰(smartphone), PDA(Personal Digital Assistant), 태블릿 PC(tablet PC) 등과 같은 휴대용 전자 장치일 수 있다. 사용자 단말기(110)는 프로세서, 입출력 수단, 통신 수단을 포함하는 예시되지 않은 다른 컴퓨팅 장치일 수도 있다.The user terminal 110 may be a personal computer (PC) such as a desk top, a lap top, or the like. Alternatively, the user terminal 110 may be a portable electronic device such as a smartphone, a personal digital assistant (PDA), a tablet PC, or the like. The user terminal 110 may be a non-illustrated other computing device including a processor, input / output means, and communication means.

검색 서버(120)는 사용자 단말기(110)로부터 수신한 검색 쿼리(search query)에 응답하여, 사용자 단말기(110)에 검색 결과를 제공한다. 검색 결과는 웹 문서, 이미지, 음악, 영상, 파일 등의 콘텐츠를 포함할 수 있다. 검색 서버(120)는 검색 쿼리에 포함되어 있는 검색어(search word 또는 keyword) 및 검색 조건에 따라 콘텐츠를 선별할 수 있다. 검색 서버(120)는 검색어에 가장 적합한 순위에 따라 콘텐츠가 나열된 검색 결과를 제공할 수 있다. 예를 들어, 검색 서버(120)는 검색어와 웹 문서 등의 데이터 간의 유사성(similarity)이 높은 순위로 검색 결과를 제공할 수 있다. 또는, 검색 서버(120)는 데이터의 최신성(freshness), 데이터 고유의 품질(quality), 사용자의 검색 로그(log) 등에 따라 순위를 결정할 수도 있다.The search server 120 provides a search result to the user terminal 110 in response to a search query received from the user terminal 110. The search results may include content such as web documents, images, music, images, files, and the like. The search server 120 can select the content according to a search word (keyword) included in the search query and a search condition. The search server 120 may provide search results listing the contents according to the rank most suitable for the search term. For example, the search server 120 can provide search results in a ranking of high similarity between data such as a search word and a web document. Alternatively, the search server 120 may determine rankings based on freshness of data, quality of data specificity, log of users, and the like.

검색 서버(120)는 검색 엔진 서비스를 제공하는 것 외에 카페, 메일, 블로그, 쇼핑, 지도, 사전, 뉴스, 증권, 부동산, 영화, 음악, 게시판 등의 다양한 콘텐츠 서비스를 제공하는 포털 사이트 서버일 수 있다. 즉, 검색 서버(120)는 검색 엔진과 포털 사이트가 결합된 형태일 수 있다.The search server 120 may be a portal site server that provides various contents services such as a cafe, a mail, a blog, a shopping, a map, a dictionary, a news, have. That is, the search server 120 may be a combination of a search engine and a portal site.

도 1에서는 하나의 검색 서버(120)만을 도시하였으나, 본 발명이 이에 한정되는 것은 아니고, 복수의 검색 서버(120)가 사용자 단말기(110), 실시간 이슈 검색어 선별 시스템(130)과 네트워크를 통해 서로 연결될 수 있다.1, the present invention is not limited to this, but a plurality of search servers 120 may be connected to a user terminal 110 and a real-time issue query word sorting system 130 via a network Can be connected.

도 1에서는 검색 서버(120)와 실시간 이슈 검색어 선별 시스템(130)을 별개로 도시하였으나, 실시예에 따라, 검색 서버(120)는 실시간 이슈 검색어 선별 시스템(130)과 결합된 형태로 제공될 수 있다.1, the search server 120 and the real-time issue query word sorting system 130 are separately shown. However, according to the embodiment, the search server 120 may be provided in a form combined with the real-time issue query word sorting system 130 have.

실시간 이슈 검색어 선별 시스템(130)은 검색 서버(120)로부터 로그 데이터를 수신하고, 수신한 로그 데이터를 분석하여 실시간 이슈 검색어를 선별한다. 실시간 이슈 검색어는 현재 시점에서 이슈(issue)가 되고, 쿼리가 급격하게 증가하는 검색어를 의미한다. 실시간 이슈 검색어 선별 시스템(130)은 실시간 이슈 검색어 정보를 검색 서버(120)에 송신하여, 검색 서버(120)가 실시간 이슈 검색어를 웹 문서에 노출하도록 한다.The real-time issue query selection system 130 receives log data from the search server 120 and analyzes the received log data to select real-time issue query words. A real-time issue search query is an issue at the current point of time, which means a query whose query is rapidly increasing. The real-time issue search word selection system 130 transmits real-time issue search word information to the search server 120 so that the search server 120 exposes a real-time issue search word to a web document.

본 발명의 실시예에서, 실시간 이슈 검색어 선별 시스템(130)은 시스템 운영자에 대한 의존도를 최소화하고, 부정사용자들의 공격에 대응할 수 있고, 실시간으로 데이터를 처리하여 빠른 시간 내에 실시간 이슈 검색어를 선별할 수 있다.In the embodiment of the present invention, the real-time issue search term selection system 130 minimizes the dependence on the system operator, responds to the attack of the unauthorized users, processes the data in real time, have.

도 2를 참조하면, 본 발명의 일 실시예에 따른 검색 행동 패턴에 기반한 검색어 추천 시스템을 구성하는 검색어 선별 서버(130)은 로그 데이터 수신부(131), 필터링부(132), 특징 값 산출부(133), 점수 산출부(134), 군집 생성부(135), 순위 산정부(136)를 포함한다.Referring to FIG. 2, a search word selection server 130 of a search query recommendation system based on a search behavior pattern according to an embodiment of the present invention includes a log data receiving unit 131, a filtering unit 132, a feature value calculating unit 133, a score calculating unit 134, a cluster generating unit 135, and a rank calculating unit 136.

로그 데이터 수신부(131)는 검색 서버(120)로부터 로그 데이터를 수신한다. 로그 데이터는 사용자의 검색 행위를 기록한 데이터를 의미한다.The log data receiving unit 131 receives log data from the search server 120. Log data refers to data that records a user's search behavior.

로그 데이터는 복수의 사용자 단말기(110)들로부터 입력된 복수의 검색어들을 포함할 수 있다. 여기서, 검색어는 검색 엔진의 검색창에 입력된 단어들을 기초로 정의될 수 있다. 즉, 검색어는 하나 이상의 단어를 포함할 수 있다.The log data may include a plurality of search terms input from a plurality of user terminals 110. Here, the search term may be defined based on the words input in the search window of the search engine. That is, the search term may include one or more words.

또한, 로그 데이터는 복수의 검색어들의 검색 결과에 대한 사용자 피드백 데이터를 포함할 수 있다. 사용자 피드백 데이터는 사용자가 검색 결과로 제공되는 웹 문서 등의 콘텐츠를 선택(또는, 클릭)하였는지 여부, 사용자가 검색 결과 화면에 머무르는 시간, 검색의도가 동일한 다른 검색어를 포함하는 재쿼리(re-query) 여부 등과 같이 사용자 행동과 관련된 데이터를 포함할 수 있다.The log data may also include user feedback data for the search results of a plurality of search terms. The user feedback data is used to determine whether or not the user has selected (or clicked) a content such as a web document provided as a search result, a time when the user stays on the search result screen, query, and the like.

로그 데이터에는 복수의 검색어들 및 검색 결과에 대한 사용자 피드백에 관련된 시간이 함께 기록될 수 있다.The log data may include a plurality of search terms and a time associated with user feedback on search results may be recorded together.

필터링부(132)는 로그 데이터 수신부(131)로부터 로그 데이터를 수신하고, 수신한 로그 데이터를 필터링한다.The filtering unit 132 receives log data from the log data receiving unit 131 and filters the received log data.

일부 실시예에서, 필터링부(132)는 필터링 과정에서 수신한 로그 데이터로부터 사용자의 중복된 행동에 의한 값을 제거할 수 있다. 예를 들어, 소정의 단위 시간 동안 동일한 사용자가 동일한 검색어의 쿼리가 반복해서 입력되거나, 동일한 사용자가 검색 결과로 제공되는 동일한 웹 문서 등의 콘텐츠를 반복해서 클릭한 경우, 필터링부(132)는 각각의 행위가 1회로 카운팅(counting)할 수 있다.In some embodiments, the filtering unit 132 may remove the value due to the user's duplicate behavior from the log data received in the filtering process. For example, when the same user repeatedly inputs a query of the same search term for a predetermined unit time, or when the same user repeatedly clicks the same web document or the like provided as a search result, the filtering section 132 May be counted once.

다른 일부 실시예에서, 필터링부(132)는 필터링 과정에서 수신한 로그 데이터로부터 어뷰징(abusing)에 의한 값을 제거할 수 있다. 예를 들어, 소정의 단위 시간 동안 쿼리만 입력되고 검색 결과로 제공되는 웹 문서의 클릭이 발생하지 않는 경우, 검색 결과로 제공되는 웹 문서 등의 콘텐츠의 클릭 횟수가 비정상적인 범위인 경우, 쿼리 또는 클릭의 주기가 일정하여 봇(bot)에 의한 행위로 추정되는 경우, 필터링부(132)는 이러한 행위를 어뷰징으로 판단하여 카운팅하지 않을 수 있다.In some other embodiments, the filtering unit 132 may remove values from abusing from the log data received in the filtering process. For example, in the case where only a query is inputted for a predetermined unit time and no click occurs in a web document provided as a search result, when the click count of contents such as a web document provided as a search result is in an abnormal range, The filtering unit 132 may determine that the action is an obtuse and do not count the action.

이와 같이, 로그 데이터를 필터링함으로써, 실시간 이슈 검색어에 특정 검색어를 악의적으로 노출하려는 부정사용자들의 공격에 대응할 수 있다. 상술한 필터링 과정은 예시적인 것이므로, 로그 데이터를 필터링하기 위하여 본 발명이 속하는 기술분야에서 잘 알려진 예시되지 않은 다른 필터링 방법이 사용될 수 있음은 통상의 기술자에게 명확하게 이해될 수 있을 것이다.By filtering the log data in this way, it is possible to respond to an attack by unauthorized users who want to maliciously expose a specific search term to a real-time issue search term. It is to be clearly understood by those skilled in the art that the above-described filtering process is illustrative, and that other filtering methods not exemplarily well-known in the art to filter log data can be used.

특징 값 산출부(133)는 필터링부(132)로부터 필터링된 로그 데이터를 수신하고, 수신한 로그 데이터를 이용하여 각각의 검색어별로 적어도 하나의 특징 값들을 산출한다.The feature value calculation unit 133 receives the filtered log data from the filtering unit 132 and calculates at least one feature value for each search word using the received log data.

특징 값 산출부(133)는 적어도 하나의 특징 값들을 산출하기 위하여 소정의 단위 시간 동안의 로그 데이터의 통계량을 이용할 수 있다. 소정의 단위 시간은 시스템 운영자에 의해 설정될 수 있으며, 실시간으로 그리고 빠른 시간 내에 데이터를 처리할 수 있도록 적절한 시간으로 설정될 수 있다. 예를 들어, 소정의 단위 시간은 15초일 수 있으나, 실시예에 따라 다양하게 변형될 수 있으며, 본 발명이 이에 한정되는 것은 아니다.The feature value calculating unit 133 may use the statistic of the log data for a predetermined unit time to calculate at least one feature value. The predetermined unit time can be set by the system operator, and can be set to an appropriate time to process data in real time and in a short time. For example, the predetermined unit time may be 15 seconds, but may be modified variously according to the embodiment, but the present invention is not limited thereto.

본 발명의 실시예에서, 특징 값 산출부(133)는 로그 데이터의 시간 범위를 결정하기 위해서, 단위 시간에 상응하는 슬라이딩 윈도우(sliding window)를 이용할 수 있다. 구체적으로, 특징 값 산출부(133)는, 종래의 시간 분할 방식으로 시간 범위를 결정하는 방법(예를 들어, 0초 내지 15초 구간, 16초 내지 30초 구간, 31초 내지 45초 구간 별 로그 데이터를 이용)을 이용하지 않고, 슬라이딩 윈도우 방식으로 시간 범위를 결정하는 방법(예를 들어, 0초 내지 15초 구간, 1초 내지 16초 구간, 2초 내지 17초 구간 별 로그 데이터를 이용)을 이용하므로, 실시간 이슈 검색어가 실시간으로(매초 또는 매단위초/분마다) 처리되고 갱신될 수 있다.In an embodiment of the present invention, the feature value calculating unit 133 may use a sliding window corresponding to the unit time to determine the time range of the log data. Specifically, the feature value calculating unit 133 calculates a feature value by a method of determining a time range (for example, a period of 0 second to 15 seconds, a period of 16 to 30 seconds, a period of 31 to 45 seconds, (For example, a period of 0 second to 15 seconds, a period of 1 second to 16 seconds, and a period of 2 seconds to 17 seconds) by using a sliding window method without using the log data ), Real-time issue queries can be processed and updated in real time (every second or every second / minute).

이동 평균에 따른 쿼리량은 소정의 시간(예를 들어, 1분, 5분, 15분, 30분, 1시간, 3시간 등) 동안 발생한 평균 쿼리량을 나타낼 수 있다. 이동 평균에 따른 쿼리량에 따라 해당 검색어의 쿼리의 입력 빈도, 지속 정도 등이 분석될 수 있다. 뉴스 등 언론에의 노출량은 검색 결과의 뉴스 섹션에서 해당 검색어와 관련된 뉴스가 얼마나 많이 노출되고 있는지를 나타낼 수 있다. 실시간 커뮤니티 노출량은 트위터 등과 같은 실시간 커뮤니티에서 해당 검색어와 관련된 글(또는, 트윗(tweet))이 얼마나 많이 노출되고 있는지를 나타낼 수 있다. 웹 문서의 클릭량은 검색 결과로 제공되는 웹 문서 등의 콘텐츠의 클릭 횟수를 나타낼 수 있다. 세션별 활동량 및 활동시간은 사용자가 검색 결과 화면에 머무르는 시간을 나타낼 수 있다. 편집된 검색 결과 화면 노출량은 "날씨" 또는 "증권" 등의 검색어와 같이, 검색 서버(120)에 의해 편집된 검색 결과가 제공되는 경우를 고려하기 위한 것이다. 위와 같이 편집된 검색 결과가 제공되는 검색어들은 실시간 이슈 검색어로 볼 수 없기 때문이다. 사이트 및 바로가기 등의 컬렉션 노출량 및 클릭량은 각각 검색 결과의 사이트 섹션 및 바로가기 섹션 등의 컬렉션에 얼마나 많이 노출되고 있는지 및 클릭 횟수를 나타낼 수 있다. 검색 결과의 바로가기 섹션을 클릭하여 소정의 사이트로 이동하는 경우, 해당 검색어는 실시간 이슈 검색어로 볼 수 없다. 그러나, 검색 결과에 바로가기 섹션이 포함되더라도, 바로가기 섹션을 클릭하지 않고 다른 섹션(예를 들어, 뉴스, 블로그, 게시판, 카페글 등)을 클릭하는 경우, 해당 검색어는 실시간 이슈 검색어로 선별될 수 있을 것이다.The amount of queries according to the moving average may represent the average amount of queries that occurred during a predetermined time (e.g., 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 3 hours, etc.). The input frequency and the sustainability of the query of the query can be analyzed according to the query amount according to the moving average. The amount of exposure to the media, such as news, can indicate how much of the news related to the search term is being exposed in the news section of the search result. Real-time community impressions can indicate how many posts (or tweets) related to the search term are being exposed in a real-time community such as Twitter. A click amount of a web document may indicate the number of clicks of contents such as a web document provided as a search result. The activity amount per session and the activity time may indicate the time the user stays on the search result screen. The edited search result screen exposure amount is for considering a case where a search result edited by the search server 120 is provided, such as a search word such as "weather" This is because the search words provided with the edited search results can not be viewed as real-time issue search words. Collection impressions and clicks, such as sites and shortcuts, can indicate how many times they are exposed to the collection, such as the site section and shortcut section of the search results, and the number of clicks. If you click on the shortcut section of a search result to navigate to a site, the search term can not be viewed as a real-time issue search term. However, even if the search result includes a shortcut section, if you click on another section (for example, news, blog, board, cafe article, etc.) without clicking on the shortcut section, the search term will be selected as a real- It will be possible.

이와 같이, 단순하게 쿼리량만을 고려하지 않고, 사용자 행동과 관련된 다양한 특징 값들을 함께 고려하므로, 실시간 이슈 검색어의 정확도를 높일 수 있고, 결과적으로, 시스템 운영자가 직접적으로 실시간 이슈 검색어를 선별할 필요가 없으므로, 시스템 운영자에 대한 의존도를 최소화할 수 있다. In this way, it is possible to increase the accuracy of the real-time issue search word, and consequently, it is necessary for the system operator to directly select the real-time issue search word There is no need to rely on the system operator.

점수 산출부(134)는 특징 값 산출부(133)로부터 각각의 검색어별 특징 값들을 수신하고, 각각의 검색어별로 특징 값들을 합산하여 점수를 산출한다. 검색어별로 산출된 점수는 후술하는 바와 같이 검색어 순위를 산정하기 위하여 이용될 수 있다. 예를 들어, 점수 산출부(134)는 선형 회귀(linear regression) 모델을 이용하여 특징값들을 합산할 수 있으나, 본 발명이 이에 한정되는 것은 아니다.The score calculation unit 134 receives the feature values for each search term from the feature value calculation unit 133, and sums feature values for each search term to calculate a score. The scores calculated for each search term can be used to estimate the search term ranking as described below. For example, the score calculator 134 may add feature values using a linear regression model, but the present invention is not limited thereto.

군집 생성부(135)는 점수 산출부(134)로부터 검색어별로 산출된 점수를 수신한다.The cluster generation unit 135 receives the score calculated for each search term from the score calculation unit 134. [

군집 생성부(135)는 복수의 검색어들 중 검색의도가 동일한 복수의 검색어들을 하나의 검색어 군집으로 군집화하여 적어도 하나의 검색어 군집을 생성한다. 검색의도는 사용자가 검색 쿼리를 통해서 획득하고자 하는 사항(또는, 목적, 생각 등)을 나타낼 수 있다. 예를 들어, "마이피플"과 "다음 마이피플"은 검색 의도가 동일하므로 하나의 검색어 군집으로 군집화될 수 있다.The cluster generating unit 135 groups at least one search word cluster by grouping a plurality of search words having the same search intention among a plurality of search words into one search word cluster. The retrieval intention can indicate a matter (or a purpose, a thought, etc.) that a user wants to acquire through a search query. For example, "My People" and "My Next People" have the same search intent and can be grouped into a single set of search terms.

일부 실시예에서, 군집 생성부(135)는 복수의 검색어들의 검색 결과 내에 동일한 웹 문서가 소정의 비율 이상으로 노출되는 경우, 복수의 검색어들을 하나의 검색어 군집으로 군집화할 수 있다. 예를 들어, 서로 다른 검색어의 쿼리에 대한 검색 결과 내에서, 동일한 뉴스, 블로그 포스트, 카페글 등이 노출된다면, 상기 검색어들을 군집화할 수 있다.In some embodiments, the cluster generation unit 135 may group a plurality of search words into a single search word cluster when the same web document is exposed in a predetermined ratio or more in the search results of a plurality of search words. For example, if the same news, blog posts, cafe articles, etc. are exposed in search results for queries of different search terms, the search terms may be grouped.

다른 일부 실시예에서, 군집 생성부(135)는 복수의 검색어들의 문자가 소정의 비율 이상으로 동일한 경우, 복수의 검색어를 하나의 검색어 군집으로 군집화할 수 있다. 예를 들어, 군집 생성부(135)는 복수의 검색어들의 편집거리(edit distance)가 소정의 값 이하인 경우, 복수의 검색어들을 하나의 검색어 군집으로 군집화할 수 있다. 이 경우, 오탈자 또는 외래어 표기 오류 등으로 인해서 서로 다른 데이터로 처리된 검색어들이 군집화 될 수 있을 것이다. 또한, 특정 검색어와 특정 검색어에 하나 이상의 단어가 더 결합된 검색어들도 함께 군집화될 수 있다.In some other embodiments, the cluster generator 135 may cluster a plurality of search terms into a single search word cluster if the characters of the plurality of search words are equal to or greater than a predetermined ratio. For example, when the edit distance of a plurality of search words is equal to or less than a predetermined value, the cluster generating unit 135 may group the plurality of search words into one search word cluster. In this case, search terms that are processed with different data due to mistakes in typographical or foreign language marking may be clustered. In addition, the search words in which one or more words are further combined with a specific search word and a specific search word may be grouped together.

이 같은 방법으로 군집화된 검색어 군집의 점수는 해당 검색어 군집에 포함되는 복수의 검색어들의 점수의 합으로 나타낼 수 있다.The score of the clustered search terms can be expressed as the sum of the scores of a plurality of search terms included in the corresponding search term cluster.

검색어 군집을 생성함으로써, 사용자의 검색의도를 정확하게 분석할 수 있으므로, 실시간 이슈 검색어의 정확도를 높일 수 있고, 결과적으로, 시스템 운영자가 직접적으로 실시간 이슈 검색어를 선별할 필요가 없으므로, 시스템 운영자에 대한 의존도를 최소화할 수 있다. 검색어 군집을 생성하기 위하여 예시되지 않은 다른 방법이 이용될 수 있음은 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Since the search intention of the user can be accurately analyzed by generating the query word cluster, the accuracy of the real-time issue search word can be improved, and as a result, the system operator does not need to directly select the real-time issue search word. Therefore, Dependency can be minimized. It will be clear to those of ordinary skill in the art that other methods not illustrated may be used to generate the query clusters.

순위 산정부(136)는 군집 생성부(135)로부터 검색어별로 산출된 점수 및 검색어 군집 정보를 수신하고, 적어도 하나의 검색어 군집의 점수와 군집화되지 않은 복수의 검색어들의 점수를 비교하여 검색어 순위를 산정한다.The ranking calculation unit 136 receives the scores calculated from the search word and the keyword cluster information from the cluster generation unit 135 and compares the scores of at least one query word cluster with the scores of the plurality of unclassified search words to calculate a search word ranking do.

순위 산정부(136)는 점수가 높은 순서로 검색어 순위를 할당할 수 있다. 검색어 군집에 대하여 순위를 할당하는 때에, 순위 산정부(136)는 해당 검색어 군집에 포함되는 복수의 검색어들 중 점수가 가장 높은 검색어에 대하여 해당하는 검색어 순위를 할당할 수 있다. 그리고, 순위 산정부(136)는 해당 검색어 군집의 점수를 해당 검색어 군집에 포함되는 나머지 검색어들의 점수의 합으로 재산정할 수 있다.The ranking calculation unit 136 can assign the query word rankings in descending order of scores. The ranking calculation unit 136 can assign a ranking of a query word to a search word having the highest score among a plurality of search words included in the search word cluster. The ranking calculation unit 136 may re-score the score of the query word cluster to the sum of the scores of the remaining search terms included in the search word cluster.

순위 산정부(136)는 제1 검색어(Q1)의 점수가 100으로 가장 높으므로, 제1 검색어(Q1)에 1등을 할당할 수 있다.Since the score of the first search word Q1 is the highest of 100, the ranking calculation unit 136 can assign the first search word Q1 to the first search word Q1.

이어서, 순위 산정부(136)는 제2 검색어(Q2) 및 제3 검색어(Q3)를 포함하는 검색어 군집의 점수 99가 제4 검색어(Q4)의 점수 77보다 높고, 제3 검색어(Q3)의 점수 66이 제2 검색어(Q2)의 점수 33보다 높으므로, 제3 검색어(Q3)에 다음 순위인 2등을 할당할 수 있다.Subsequently, the ranking calculation unit 136 determines that the score 99 of the query word cluster including the second and third search terms Q2 and Q3 is higher than the score 77 of the fourth search term Q4, Since the score 66 is higher than the score 33 of the second search word Q2, the second search word Q3 can be assigned the second rank, which is the next rank.

이어서, 같은 검색어 군집에 포함되었던 제3 검색어(Q3)에 순위가 할당되었고, 해당 검색어 군집에 포함되는 나머지 검색어는 제2 검색어(Q2)뿐이므로, 제2 검색어(Q2)는 자기의 점수 33에 따라 순위를 할당받게 된다. 즉, 순위 산정부(136)는 제4 검색어(Q4)의 점수 77이 제2 검색어(Q2)의 점수 33보다 높으므로, 제4 검색어(Q4)에 다음 순위인 3등을 할당하고, 제2 검색어(Q2)에 그 다음 순위인 4등을 할당할 수 있다.Since the ranking is assigned to the third search word Q3 included in the same search word cluster and the rest of the search words included in the search word group are only the second search word Q2, the second search word Q2 has its own score 33 They will be assigned rankings accordingly. That is, since the score 77 of the fourth search term Q4 is higher than the score 33 of the second search term Q2, the ranking calculation unit 136 assigns the next rank of 3 to the fourth search term Q4, It is possible to assign the fourth rank, which is the next rank, to the search term Q2.

결과적으로, 검색어 군집에 포함되는 검색어들의 점수의 합을 이용하여 해당 검색어 군집 내의 점수가 가장 높은 검색어의 순위를 산정하므로, 제4 검색어(Q4)보다 점수가 낮은 제3 검색어(Q3)의 순위가 더 높을 수 있다.As a result, since the rank of the search word having the highest score in the search word cluster is calculated using the sum of the scores of the search words included in the search word cluster, the rank of the third search word Q3 having a score lower than that of the fourth search word Q4 Can be higher.

순위 산정부(136)는 상술한 방법으로 검색어 순위를 산정하고, 미리 정해진 순위(예를 들어, 10등)까지의 검색어를 실시간 이슈 검색어로 선별할 수 있다.The ranking calculation unit 136 may calculate a query word ranking by the above-described method, and may search for a query word up to a predetermined rank (for example, ten) by a real-time issue search word.

실시간 이슈 검색어는 검색 서버(120)의 웹 문서상에 노출될 수 있다. 실시간 이슈 검색어는 산정된 순위에 따라 나열되고, 실시예에 따라, 그 우측에 점수의 상승폭이 함께 표시될 수 있다.The real-time issue search word may be exposed on the web document of the search server 120. The real-time issue search words are listed according to the estimated ranking, and according to the embodiment, the increase of the score may be displayed on the right side thereof.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments of the present invention may be embodied directly in hardware, in software modules executed in hardware, or in a combination of both. The software module may be a random access memory (RAM), a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, a CD- May reside in any form of computer readable recording medium known in the art to which the invention pertains.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims

A distance score calculation module for calculating a vector value of a plurality of words existing in the Internet network and calculating a distance score using a distance between a vector of the search word input from the user terminal and a vector of the word for each word;
A real-time issue score calculation module for calculating at least one feature value related to a user behavior for each of the plurality of words using log data, and calculating a real-time issue score using the at least one feature value for each word; And
And a recommendation search word providing unit for providing a recommendation search word including at least one word among the plurality of words to the user terminal according to the distance score and the real-time issue score,
The real-time issue score calculating module calculates,
A feature point calculating unit for calculating a feature point by summing the at least one feature value for each word;
A cluster generator for generating at least one word cluster by grouping a plurality of words having the same retrieval intention among the plurality of words into one word cluster;
Calculating a real-time issue score by calculating a feature score of the at least one word cluster, a sum of feature scores of a plurality of words included in the word cluster, and a feature score of a plurality of non-clustered words, Based on a search behavior pattern.

The method according to claim 1,
Wherein the distance score calculating module calculates center coordinates of vectors of the plurality of search words when a plurality of search words input from the user terminal are calculated and calculates a distance between a center of vectors of the plurality of search words and a vector of the word Based on the search behavior pattern.

The method according to claim 1,
The distance score calculating module may calculate the distance score based on the search word input from the user terminal,
Wherein a distance between the vectors of the plurality of search words and a vector of the words is calculated, and a distance score is calculated by summing and summing the calculated distances, respectively.

The method according to claim 1,
The distance score calculating module may calculate the distance score based on the search word input from the user terminal,
Calculating a distance between the vectors of the plurality of search words and a vector of the words and calculating the distance score using the minimum value or the maximum value among the calculated distances, respectively.

The method according to claim 1,
Wherein the recommendation word providing unit is configured to assign a weight to each of the distance score and the real time issue score to calculate a recommendation score and to select a recommendation word including at least one word among the plurality of words according to the recommendation score, Search Query Recommendation System Based on Search Behavior Patterns.