JP6261669B2

JP6261669B2 - Query calibration system and method

Info

Publication number: JP6261669B2
Application number: JP2016134985A
Authority: JP
Inventors: 泰壹金; グァンヒョンキム; デヌンソン
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2015-09-14
Filing date: 2016-07-07
Publication date: 2018-01-17
Anticipated expiration: 2036-07-07
Also published as: KR101839121B1; KR20170032084A; JP2017059216A

Description

本発明は、ユーザが入力したクエリに対して校正結果を提供するクエリ校正システムおよび方法に関し、より詳細には、翻訳モデルに基づいてユーザのクエリに対する校正結果を提供するクエリ校正システムおよび方法に関する。 The present invention relates to a query calibration system and method for providing a calibration result for a query input by a user, and more particularly to a query calibration system and method for providing a calibration result for a user query based on a translation model.

ユーザは、検索エンジンのようなサイトを通じ、所期の情報を得るために検索を行うことができる。ユーザは、ユーザの端末を通じて検索エンジンのクエリ入力ウィンドウにクエリを入力し、出力された検索結果を確認することによって所期の情報を得ることができる。 A user can perform a search to obtain desired information through a site such as a search engine. The user can obtain desired information by inputting a query to the query input window of the search engine through the user's terminal and confirming the output search result.

しかし、ユーザが端末を通じてこのようなクエリを入力するときに、端末のキーボードへの入力ミスおよび／または韓英変換キーの選択ミスなどにより、本来の意図とは異なる誤字脱字を含んだクエリが入力される場合が発生することがある。特に、ユーザの端末がタッチスクリーンを備える場合には、誤ったタッチ入力などにより、誤字脱字を含んだクエリが入力される可能性がさらに高くなることもある。 However, when a user inputs such a query through the terminal, a query including a typographical error that is different from the original intention is input due to an input error on the keyboard of the terminal and / or a selection error of the Korean-English conversion key. May occur. In particular, when the user terminal includes a touch screen, there is a possibility that a query including a typographical error may be further input due to an erroneous touch input or the like.

誤字脱字を含んだクエリが検索エンジンに入力された場合、出力される検索結果にユーザが意図していた所期の情報が含まれなくなるが、これは検索品質の劣化に繋がるようにもなる。 When a query including a typographical error is input to the search engine, the intended search information intended by the user is not included in the output search result, but this also leads to deterioration in search quality.

したがって、ユーザが誤字脱字を含むクエリを入力したとしても、ユーザが本来より意図していた所期の情報を検索結果として得ることができるように、入力されたクエリを所期の正字クエリに変換してユーザに提供する方法が求められている。 Therefore, even if the user enters a query that includes typographical errors, the entered query is converted to the intended orthographic query so that the user can get the expected information as the search result. Thus, there is a demand for a method for providing the information to the user.

特許文献１（公開日２０１１年１月２５日）は、統計データに基づいて誤字脱字クエリと判定されたユーザクエリに対して全体クエリ単位または単語単位によって校正を行うシステムおよび方法を開示している。 Patent Document 1 (publication date: January 25, 2011) discloses a system and method for calibrating a user query determined to be a typographical error query based on statistical data in units of entire queries or words. .

上述した情報は、理解を助けるためのものに過ぎず、従来技術の一部を形成しない内容を含むこともあるし、従来技術が通常の技術者に提示できることを含まないこともある。 The information described above is merely for helping understanding, and may include content that does not form part of the prior art, or may not include that the prior art can be presented to a normal engineer.

韓国公開特許第１０−２０１１−０００７７４３号公報Korean Published Patent No. 10-2011-0007743

一実施形態は、翻訳モデルに基づいてユーザクエリの校正候補への変換に関する確率を計算し最適な校正候補を抽出することにより、ユーザのクエリおよび校正候補の編集距離に制限なく、最適な校正候補を校正結果として提供する、ユーザクエリ校正システムおよび方法を提供する。 In one embodiment, the optimal proofreading candidate is calculated without calculating the optimal proofreading candidate by calculating the probability of conversion of the user query to the proofreading candidate based on the translation model, without limiting the edit distance between the user query and the proofreading candidate. A user query proofing system and method are provided.

一側面において、入力されたクエリに対する検索結果のログ情報に基づき、前記クエリに対する少なくとも１つの校正候補に関する情報を抽出する校正情報抽出部、前記の抽出された情報に基づき、前記クエリの前記校正候補への変換に関する少なくとも１つのパラメータを取得するパラメータ取得部、および前記の取得されたパラメータに基づき、前記クエリの前記少なくとも１つの校正候補のそれぞれへの変換と関連する確率を計算し、前記の計算された確率に基づき、前記クエリに対する校正結果として前記校正候補のうちから少なくとも１つの校正候補を抽出する校正結果生成部を備える、クエリ校正システムを提供する。 In one aspect, a proofreading information extraction unit that extracts information on at least one proofreading candidate for the query based on log information of a search result for the input query, and the proofreading candidate for the query based on the extracted information A parameter acquisition unit for acquiring at least one parameter relating to the conversion to, and a probability associated with the conversion of the query to each of the at least one calibration candidate based on the acquired parameter, There is provided a query proofreading system including a proofreading result generation unit that extracts at least one proofreading candidate from among the proofreading candidates as a proofreading result for the query based on the obtained probability.

前記校正情報抽出部は、前記ログ情報を使用して前記クエリが誤字脱字を含むかを判定してもよい。 The proofreading information extraction unit may determine whether the query includes a typographical error using the log information.

前記校正情報抽出部は、前記クエリが誤字脱字を含む場合、前記クエリに対する少なくとも１つの校正候補を識別してもよい。 The proofreading information extraction unit may identify at least one proofreading candidate for the query when the query includes a typographical error.

前記ログ情報は、ユーザによって第１クエリが入力された後に第２クエリが入力されるまでの時間、前記検索結果に対するユーザのクリック情報、前記第１クエリと前記第２クエリとの類似度に関する情報、および前記検索結果の属性のうち少なくとも１つを含んでもよい。 The log information includes a time from when the first query is input by the user until the second query is input, user click information with respect to the search result, and information on the similarity between the first query and the second query. , And at least one of the search result attributes.

前記検索結果の属性は、前記検索結果が含むコンテンツのカテゴリであってもよい。 The attribute of the search result may be a content category included in the search result.

前記パラメータ取得部は、前記クエリに含まれる要素が前記校正候補それぞれに含まれる要素に変換される確率、前記クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値、および前記クエリの前記校正候補それぞれへの変換の自然性の程度を示す確率のうち少なくとも１つをパラメータとして取得してもよい。 The parameter acquisition unit includes a probability that an element included in the query is converted into an element included in each of the calibration candidates, a numerical value indicating a positional relationship of the element included in the query with respect to each element of the calibration candidate, and the query At least one of the probabilities indicating the degree of naturalness of the conversion to each of the calibration candidates may be acquired as a parameter.

前記クエリに含まれる要素は、前記クエリに含まれる音節であり、前記校正候補それぞれに含まれる要素は、前記校正候補それぞれに含まれる音節であってもよい。 The element included in the query may be a syllable included in the query, and the element included in each of the calibration candidates may be a syllable included in each of the calibration candidates.

前記校正結果生成部は、前記校正結果として前記少なくとも１つの校正候補のうちから校正候補を抽出するにあたり、前記クエリに含まれる要素が配列される順序および前記少なくとも１つの校正候補の対応する要素が配列される順序を等しいものとして仮定してもよい。 When the calibration result generation unit extracts calibration candidates from the at least one calibration candidate as the calibration result, the order in which the elements included in the query are arranged and the corresponding elements of the at least one calibration candidate are The order of arrangement may be assumed to be equal.

前記校正結果として抽出された少なくとも１つの校正候補は、ユーザによる前記クエリの入力に対する検索結果に含まれてもよい。 The at least one proofreading candidate extracted as the proofreading result may be included in a search result for the input of the query by the user.

前記の取得されたパラメータは複数であってもよい。 The acquired parameter may be plural.

前記校正結果生成部は、前記の取得されたパラメータの倍またはログ加算に基づいて前記確率を計算してもよい。 The calibration result generation unit may calculate the probability based on a double of the acquired parameter or log addition.

前記校正候補は複数であってもよい。 There may be a plurality of calibration candidates.

前記校正結果生成部は、前記の複数の校正候補のそれぞれに対して計算された前記確率の分布に基づき、前記の複数の校正候補のうち少なくとも１つの校正候補を、前記クエリに対する校正結果としての校正候補抽出から除外してもよい。 The calibration result generator generates at least one calibration candidate among the plurality of calibration candidates as a calibration result for the query based on the probability distribution calculated for each of the plurality of calibration candidates. You may exclude from proofreading candidate extraction.

前記確率は、数式によって計算されてもよい。 The probability may be calculated by a mathematical formula.

前記数式は、 The formula is

であってもよい。

It may be.

ｌは前記校正候補それぞれの長さであり、ｍは前記校正候補それぞれの長さであり、ｊは前記校正候補のインデックスであり、ｉは前記クエリのインデックスである。 l is the length of each proofreading candidate, m is the length of each proofreading candidate, j is the index of the proofreading candidate, and i is the index of the query.

ＴＲは、前記クエリに含まれるｉ番目の要素が前記校正候補それぞれに含まれるｊ番目の要素に変換される確率を示す関数である。 TR is a function indicating the probability that the i-th element included in the query is converted to the j-th element included in each of the calibration candidates.

ＡＬは、前記クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値を示す関数である。 AL is a function indicating a numerical value indicating the positional relationship between the elements included in the query and the elements of each calibration candidate.

Ｐ_ＬＭは、前記クエリの前記校正候補それぞれへの変換の自然性の程度の確率を示す関数である。 _PLM is a function indicating the probability of the degree of naturalness of conversion of the query into each of the calibration candidates.

前記校正情報抽出部および前記パラメータ取得部のうち少なくとも１つは、分散処理システムとして実現されてもよい。 At least one of the calibration information extraction unit and the parameter acquisition unit may be realized as a distributed processing system.

前記校正候補に関する情報は、前記クエリおよび各校正候補で構成された誤字脱字−正字ペアを含んでもよい。 The information related to the proofreading candidates may include a typographical error-correction pair composed of the query and each proofreading candidate.

前記確率は、前記クエリが入力されるときに各校正候補が発生する条件付き確率であってもよい。 The probability may be a conditional probability that each calibration candidate occurs when the query is input.

他の一側面において、入力されたクエリに対する検索結果のログ情報に基づき、前記クエリに対する少なくとも１つの校正候補に関する情報を抽出する段階、前記の抽出された情報に基づき、前記クエリの前記校正候補への変換に関する少なくとも１つのパラメータを取得する段階、前記の取得されたパラメータに基づき、前記クエリの前記少なくとも１つの校正候補のそれぞれへの変換と関連する確率を計算する段階、および前記の計算された確率に基づき、前記クエリに対する校正結果として前記校正候補のうちから少なくとも１つの校正候補を抽出する段階を含む、クエリ校正方法が提供される。 In another aspect, extracting information on at least one proofreading candidate for the query based on log information of a search result for the input query, to the proofreading candidate for the query based on the extracted information Obtaining at least one parameter relating to the transformation of the query, calculating a probability associated with the transformation of the query to each of the at least one calibration candidates based on the obtained parameter, and the calculated A query proofreading method is provided that includes extracting at least one proofreading candidate from the proofreading candidates as a proofreading result for the query based on the probability.

前記校正候補に関する情報を抽出する段階は、前記ログ情報を使用して前記クエリが誤字脱字を含むかを判定する段階を含んでもよい。 Extracting information about the proofreading candidates may include determining whether the query includes typographical errors using the log information.

前記校正候補に関する情報を抽出する段階は、前記クエリが誤字脱字を含む場合、前記クエリに対する少なくとも１つの校正候補を識別する段階を含んでもよい。 Extracting information about the proofreading candidates may include identifying at least one proofreading candidate for the query if the query includes typographical errors.

前記パラメータを取得する段階は、前記クエリに含まれる要素が前記校正候補それぞれに含まれる要素に変換される確率、前記クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値、および前記クエリの前記校正候補それぞれへの変換の自然性の程度を示す確率のうち少なくとも１つをパラメータとして取得してもよい。 The step of obtaining the parameters includes: a probability that an element included in the query is converted into an element included in each of the calibration candidates; a numerical value indicating a positional relationship of the element included in the query with respect to the element of each calibration candidate; and At least one of the probabilities indicating the degree of naturalness of conversion of the query into each of the calibration candidates may be acquired as a parameter.

前記クエリ校正方法は、前記校正結果として抽出された少なくとも１つの校正候補を、ユーザによる前記クエリの入力に対する検索結果として出力する段階をさらに含んでもよい。 The query proofreading method may further include outputting at least one proofreading candidate extracted as the proofreading result as a search result for the input of the query by a user.

前記少なくとも１つの校正候補を抽出する段階は、前記の複数の校正候補のそれぞれに対して計算された前記確率の分布に基づき、前記の複数の校正候補のうち少なくとも１つの校正候補を、前記クエリに対する校正結果としての校正候補抽出から除外する段階を含んでもよい。 The step of extracting the at least one calibration candidate is based on the probability distribution calculated for each of the plurality of calibration candidates, and at least one calibration candidate is selected from the plurality of calibration candidates. The method may include a step of excluding from the calibration candidate extraction as a calibration result for.

翻訳モデルに基づいてユーザクエリの校正候補への変換に関する確率を計算して最適な校正候補を抽出することにより、ユーザのクエリおよび校正候補の編集距離に制限なく、ユーザのクエリに対してユーザの意図に符合する最適な校正候補を校正結果として提供することができる。 By calculating the probability of conversion of user queries to proofreading candidates based on the translation model and extracting the optimal proofreading candidates, the user's query and proofreading candidate can be edited regardless of the editing distance of the user's query. An optimum calibration candidate that matches the intention can be provided as a calibration result.

翻訳モデルに基づいてユーザクエリの校正候補への変換に関する確率を計算して最適な校正候補を抽出することにより、ユーザのクエリに対する校正の正確性を損なうことなく、校正のカバレッジを向上させることができる。 By calculating the probability of conversion of user queries to proofreading candidates based on the translation model and extracting the best proofreading candidates, it is possible to improve proofreading coverage without compromising the accuracy of proofreading for user queries. it can.

一実施形態における、ユーザクエリ校正システムの動作方法を示した図である。It is the figure which showed the operating method of the user query proofreading system in one Embodiment. 一実施形態における、ユーザクエリ校正システムを示した図である。1 is a diagram illustrating a user query proofing system in one embodiment. FIG. 一実施形態における、ユーザクエリの校正候補への変換に関するパラメータの取得方法を概念的に示した図である。It is the figure which showed notionally the acquisition method of the parameter regarding conversion to the proofreading candidate of a user query in one Embodiment. 一実施形態における、ユーザクエリの校正候補への変換に関するパラメータとしての帯域確率および整列確率の取得方法を示した疑似コードである。7 is pseudo code illustrating a method for obtaining a band probability and an alignment probability as parameters relating to conversion of a user query into a proofreading candidate according to an embodiment. 一実施形態における、ユーザクエリの校正候補への変換に関するパラメータとしての帯域確率および整列確率の取得方法を示した疑似コードである。7 is pseudo code illustrating a method for obtaining a band probability and an alignment probability as parameters relating to conversion of a user query into a proofreading candidate according to an embodiment. 図４Ａおよび図４Ｂのアルゴリズムによるパラメータ取得方法のパフォーマンスを示した図である。It is the figure which showed the performance of the parameter acquisition method by the algorithm of FIG. 4A and FIG. 4B. 一実施形態における、ユーザクエリの校正候補への変換に関するパラメータとしてのＬＭパラメータの取得方法を示した図である。It is the figure which showed the acquisition method of the LM parameter as a parameter regarding conversion to the proofreading candidate of a user query in one Embodiment. 一実施形態における、ユーザクエリに対する校正結果としての校正候補を抽出する方法を示した概念図である。It is the conceptual diagram which showed the method of extracting the proofreading candidate as a proofreading result with respect to a user query in one Embodiment. 一実施形態における、ユーザクエリに対する校正結果としての校正候補を抽出する方法を示した疑似コードである。7 is pseudo code illustrating a method for extracting a proofreading candidate as a proofreading result for a user query according to an embodiment. 一実施形態における、ユーザクエリ校正システムの動作方法を示したフローチャートである。7 is a flowchart illustrating an operation method of a user query proofreading system according to an embodiment. 一実施形態における、ユーザクエリに対する少なくとも１つの校正候補に関する情報を抽出する方法を示したフローチャートである。6 is a flowchart illustrating a method for extracting information related to at least one proofreading candidate for a user query according to an exemplary embodiment. 一実施形態における、ユーザクエリに対する校正候補のうちから不必要な校正候補を除外することにより、校正結果としての校正候補を抽出する方法を示したフローチャートである。6 is a flowchart illustrating a method of extracting a calibration candidate as a calibration result by excluding unnecessary calibration candidates from calibration candidates for a user query according to an embodiment.

以下、本発明の実試形態について、添付の図面を参照しながら詳細に説明する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, practical embodiments of the present invention will be described in detail with reference to the accompanying drawings.

図１は、一実施形態における、ユーザクエリ校正システムの動作方法を示した図である。 FIG. 1 is a diagram illustrating a method of operating a user query calibration system according to an embodiment.

図に示すユーザクエリ校正システム１００（以下、クエリ校正システムとする）は、ユーザの端末から入力されたユーザクエリ（以下、クエリとする）を処理し、前記クエリに対する適切な校正結果を提供するシステムである。例えば、ユーザは、ＰＣまたはモバイル端末（例えば、携帯電話、スマートフォン、タブレットＰＤＡなど）を通じてクエリを入力することができ、クエリ校正システム１００は、ユーザが入力したクエリに誤字脱字が存在するかを判定し、誤字脱字が存在すると判定された場合には、該当のクエリに対する校正結果としての校正候補を提供することができる。 A user query proofreading system 100 (hereinafter referred to as a query proofing system) shown in the figure processes a user query (hereinafter referred to as a query) input from a user terminal, and provides an appropriate proofreading result for the query. It is. For example, a user can input a query through a PC or a mobile terminal (eg, a mobile phone, a smartphone, a tablet PDA, etc.), and the query proofing system 100 determines whether there is a typographical error in the query input by the user. If it is determined that there is a typographical error, a proofreading candidate as a proofreading result for the corresponding query can be provided.

端末を通じて入力されたユーザのクエリは、例えば、コンテンツの検索および／または照会、または情報の取得のような作業の実行を要求するために、検索エンジンなどに入力されるキーワードまたは文字列であってよい。クエリは少なくとも１つの要素で構成されてもよく、クエリを構成する各要素は単語または音節であってもよい。 The user query input through the terminal is a keyword or a character string input to a search engine or the like in order to request execution of work such as content search and / or inquiry or information acquisition, for example. Good. The query may be composed of at least one element, and each element constituting the query may be a word or a syllable.

校正システム１００は、ユーザが過去に入力したクエリに対する検索情報のログ情報に基づいて、入力されたクエリが正字であるかを判定し、該当のクエリの校正結果となり得る校正候補（群）を識別し、校正候補（群）のうち最適な校正候補を校正結果として提供することができる。 The proofreading system 100 determines whether or not the inputted query is a correct character based on the log information of the search information for the query inputted by the user in the past, and identifies a proofreading candidate (group) that can be a proofreading result of the corresponding query. Then, the optimal calibration candidate among the calibration candidates (group) can be provided as the calibration result.

校正結果として決まった校正候補は、ユーザクエリと共にあるいは別に、ユーザクエリに対する検索結果として提供されてもよい。または、校正結果として決まった校正候補は、ユーザクエリに対する検索結果に含まれてもよい。さらに、校正結果として決まった校正候補に対する検索結果は、ユーザクエリに対する検索結果として提供されてもよい。 The proofreading candidate determined as the proofreading result may be provided as a search result for the user query together with or separately from the user query. Or the proofreading candidate decided as a proofreading result may be contained in the search result with respect to a user query. Furthermore, a search result for a proofreading candidate determined as a proofreading result may be provided as a search result for a user query.

校正結果は、ユーザのクエリ入力に対してリアルタイムで提供されてもよい。 Calibration results may be provided in real time for user query inputs.

入力されたクエリに対して校正候補（群）を識別して校正結果を提供する方法については、図２乃至図７を参照しながらさらに詳しく説明する。 A method of identifying a calibration candidate (group) for an input query and providing a calibration result will be described in more detail with reference to FIGS.

図２は、一実施形態における、ユーザクエリ校正システムを示した図である。図２を参照しながら、上述したクエリ校正システム１００についてさらに詳しく説明する。クエリ校正システム１００は、プロセッサ２１０を含んでよい。プロセッサ２１０は、ユーザのクエリを処理しクエリに対する校正結果を提供するために要求されるプログラムを実行したり、関連する演算を処理したりするための構成であってよい。プロセッサ２１０は、校正情報抽出部２２０、校正結果生成部２３０、およびパラメータ取得部２４０を含んでよい。校正情報抽出部２２０、校正結果生成部２３０、およびパラメータ取得部２４０のそれぞれは、プロセッサ２１０内で（または、図に示すものとは異なりプロセッサ２１０外部で）別のハードウェア構成によって実現されてもよい。図２では、プロセッサ２１０は単数で示されているが、複数のプロセッサであってもよく、プロセッサ内の少なくとも１つのコアを意味するものであってもよい。言い換えれば、校正情報抽出部２２０、校正結果生成部２３０、およびパラメータ取得部２４０のうち少なくとも一部は、プロセッサ２１０とは異なるプロセッサ２１０または異なるハードウェア構成内で実現されてもよい。 FIG. 2 is a diagram illustrating a user query proofing system in one embodiment. The above-described query calibration system 100 will be described in more detail with reference to FIG. Query calibration system 100 may include a processor 210. The processor 210 may be configured to execute a program required to process a user query and provide a calibration result for the query, or to process related operations. The processor 210 may include a calibration information extraction unit 220, a calibration result generation unit 230, and a parameter acquisition unit 240. Each of the calibration information extraction unit 220, the calibration result generation unit 230, and the parameter acquisition unit 240 may be realized by a different hardware configuration within the processor 210 (or outside the processor 210 unlike the figure). Good. In FIG. 2, a single processor 210 is shown, but it may be a plurality of processors and may mean at least one core in the processor. In other words, at least some of the calibration information extraction unit 220, the calibration result generation unit 230, and the parameter acquisition unit 240 may be realized in a processor 210 different from the processor 210 or in a different hardware configuration.

または、校正情報抽出部２２０、校正結果生成部２３０、およびパラメータ取得部２４０は、プロセッサ２１０が実行する機能を示す構成であってもよい。言い換えれば、校正情報抽出部２２０、校正結果生成部２３０、およびパラメータ取得部２４０のそれぞれは、ソフトウェアモジュールとして構成されてもよい。 Or the structure which shows the function which the processor 210 performs may be sufficient as the calibration information extraction part 220, the calibration result production | generation part 230, and the parameter acquisition part 240. In other words, each of the calibration information extraction unit 220, the calibration result generation unit 230, and the parameter acquisition unit 240 may be configured as a software module.

校正システム１００は、通信部２５０をさらに含んでよい。通信部２５０は、外部サーバまたはその他の端末からデータおよび情報を送受信することができる。例えば、通信部２５０は、ユーザの端末からクエリを受信したり、検索結果のログ情報を取得したり、ユーザクエリおよびユーザクエリに対する校正結果を出力したりするための構成であってよい。 The calibration system 100 may further include a communication unit 250. The communication unit 250 can transmit and receive data and information from an external server or other terminals. For example, the communication unit 250 may be configured to receive a query from a user terminal, acquire log information of a search result, and output a calibration result for the user query and the user query.

校正情報抽出部２２０は、予め入力されているクエリに対する検索結果のログ情報に基づき、ユーザの端末から入力されたクエリに対する少なくとも１つの校正候補に関する情報を抽出することができる。校正情報抽出部２２０は、ＳＶＭ基盤の誤字脱字−正字候補検出器（ＳＶＭｂａｓｅｄＥｒｒａｔａ−ＣｏｒｒｅｃｔＣａｎｄｉｄａｔｅＤｅｔｅｃｔｏｒ）に対応してもよい。 The proofreading information extraction unit 220 can extract information on at least one proofreading candidate for the query input from the user terminal based on the log information of the search result for the query input in advance. The proofreading information extraction unit 220 may correspond to an SVM-based error typographical error-correct character candidate detector (SVM based Erata- Correct Candidate Detector).

校正情報抽出部２２０は、前記ログ情報を使用して前記クエリが誤字脱字を含むかを判定し、前記クエリが誤字脱字を含む場合には、前記クエリに対する少なくとも１つの校正候補を識別することができる。 The proofreading information extraction unit 220 determines whether the query includes a typographical error using the log information, and identifies at least one proofreading candidate for the query when the query includes a typographical error. it can.

校正候補に関する情報は、クエリおよび各校正候補として構成された誤字脱字−正字ペアを含んでもよい。言い換えれば、識別されたクエリ−各校正候補は誤字脱字−正字のペアであってもよく、識別された誤字脱字−正字ペアは、データベース（図示せず）内に格納されて管理されてもよい。 The information about the proofreading candidates may include a query and a typographical error-correction pair configured as each proofreading candidate. In other words, the identified query-each proofreading candidate may be a typographical-letter-correction pair, and the identified typographical-letter-correction pair may be stored and managed in a database (not shown). .

ログ情報は、所定の期間にユーザ（ら）によって入力されたクエリ（群）および該当のクエリ（群）による検索結果（群）に関するログ情報であってもよい。ログ情報は、ユーザによって第１クエリが入力された後、その次に入力される第２クエリが入力されるまでの時間、クエリに対する検索結果に対するユーザのクリック情報、第１クエリと第２クエリとの類似度に関する情報および検索結果の属性のうち少なくとも１つを含んでもよい。 The log information may be log information related to a query (group) input by a user (or the like) during a predetermined period and a search result (group) based on the corresponding query (group). The log information includes the time from when the first query is input by the user until the second query to be input next is input, the user click information for the search result for the query, the first query and the second query, May include at least one of information on the degree of similarity and an attribute of a search result.

ログ情報が、ユーザによって第１クエリが入力された後、その次に入力される第２クエリが入力されるまでの時間を含む場合、例えば、第１クエリが入力された後、所定の時間以内に第２クエリが入力された場合に、校正情報抽出部２２０は、第１クエリが誤字脱字を含む（または含む可能性が高い）と判定してもよい。第１クエリが入力された後、所定の時間以後に第２クエリが入力された場合に、校正情報抽出部２２０は、第１クエリおよび第２クエリを互いに別のクエリとして判定してもよい。 When the log information includes a time from when the first query is input by the user until the second query to be input next is input, for example, within a predetermined time after the first query is input When the second query is input to the proofreading information, the proofreading information extracting unit 220 may determine that the first query includes (or is likely to include) a typographical error. When the second query is input after a predetermined time after the first query is input, the proofreading information extraction unit 220 may determine the first query and the second query as separate queries.

ログ情報が、クエリに対する検索結果に対するユーザのクリック情報を含む場合、例えば、クエリの検索結果に対してユーザのクリックが存在する場合に、校正情報抽出部２２０は、前記クエリが誤字脱字を含まない（または含まない可能性が高い）と判定してもよい。 When the log information includes user click information for the search result for the query, for example, when there is a user click for the query search result, the proofreading information extraction unit 220 does not include the typographical error. (Or a possibility that it is not likely to be included) may be determined.

ログ情報が含む、前記第１クエリと前記第２クエリとの類似度に関する情報は、第１クエリと第２クエリとの編集距離（ｅｄｉｔｄｉｓｔａｎｃｅ）であってもよい。編集距離は、レーベンシュタイン距離（Ｌｅｖｅｎｓｈｔｅｉｎｄｉｓｔａｎｃｅ）であってもよい。ユーザによって第１クエリが入力された後、その次に第２クエリが入力された場合に、第１クエリと第２クエリとの編集距離が所定の値以下であるとき、校正情報抽出部２２０は、第１クエリが誤字脱字を含む（または含む可能性が高い）と判定してもよい。反対に、第１クエリと第２クエリとの編集距離が所定の値を超えるとき、校正情報抽出部２２０は、第１クエリおよび第２クエリを互いに別のクエリとして判定してもよい。 The information related to the similarity between the first query and the second query included in the log information may be an edit distance between the first query and the second query. The edit distance may be a Levenshtein distance. When the second query is input after the first query is input by the user, when the edit distance between the first query and the second query is equal to or less than a predetermined value, the calibration information extraction unit 220 , It may be determined that the first query includes (or is likely to include) a typographical error. Conversely, when the edit distance between the first query and the second query exceeds a predetermined value, the proofreading information extraction unit 220 may determine the first query and the second query as different queries.

ログ情報が含む検索結果の属性は、検索結果が含むコンテンツの種類またはカテゴリであってもよい。例えば、カテゴリは、ウェブ文書、音楽、イメージ、ブログ、ニュース、および人物情報のうちいずれか１つであってもよい。校正情報抽出部２２０は、クエリに対する検索結果に含まれるコンテンツがウェブ文書だけである場合、該当のクエリは誤字脱字を含む（または含む可能性が高い）と判定してもよい。 The attribute of the search result included in the log information may be the type or category of content included in the search result. For example, the category may be any one of a web document, music, an image, a blog, news, and personal information. When the content included in the search result for the query is only the web document, the proofreading information extraction unit 220 may determine that the corresponding query includes (or is likely to include) a typographical error.

クエリに対する校正候補を識別するにあたり、上述した実施形態のようにログ情報を考慮することで、クエリが誤字脱字を含んでいるかをより正確に判定することができ、クエリに対するより正確な校正候補（群）を識別することができる。 In identifying the proofreading candidate for the query, by considering the log information as in the above-described embodiment, it is possible to more accurately determine whether the query includes a typographical error, and a more accurate proofreading candidate for the query ( Group).

校正情報抽出部２２０は、分散処理システム、例えば、Ｈａｄｏｏｐ基盤の分散処理システムを使用して実現されてもよい。分散処理システムを基盤として実現することで、校正情報抽出部２２０は、膨大な量のデータを高速で処理することができる。 The calibration information extraction unit 220 may be realized using a distributed processing system, for example, a Hadoop-based distributed processing system. By realizing the distributed processing system as a base, the calibration information extraction unit 220 can process a huge amount of data at high speed.

パラメータ取得部２４０は、校正情報抽出部２２０によって抽出された情報に基づき、入力されたクエリの校正候補への変換に関する少なくとも１つのパラメータを取得することができる。パラメータは、クエリの校正候補のそれぞれへの変換と関連する確率を計算するために使用されてもよい。 The parameter acquisition unit 240 can acquire at least one parameter related to conversion of an input query into a calibration candidate based on the information extracted by the calibration information extraction unit 220. The parameter may be used to calculate a probability associated with the conversion of the query to each of the proofreading candidates.

パラメータ取得部２４０は、クエリに含まれる要素が校正候補それぞれに含まれる要素に変換される確率、クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値、およびクエリの校正候補それぞれへの変換の自然性の程度を示す確率のうち少なくとも１つをパラメータとして取得してもよい。クエリに含まれる要素は、前記クエリに含まれる音節であり、前記校正候補それぞれに含まれる要素は、前記校正候補それぞれに含まれる音節であってもよい。 The parameter acquisition unit 240 sets the probability that an element included in the query is converted into an element included in each proofreading candidate, a numerical value indicating the positional relationship of each element included in the query with respect to each proofreading candidate element, and each proofreading candidate of the query. At least one of the probabilities indicating the degree of naturalness of conversion may be acquired as a parameter. The element included in the query may be a syllable included in the query, and the element included in each of the calibration candidates may be a syllable included in each of the calibration candidates.

パラメータ取得部２４０は、例えば、ＩＢＭ（登録商標）ＭＯＤＥＬ２技法を使用するアルゴリズムに基づき、クエリの校正候補のそれぞれへの変換と関連する確率を計算するために必要となるパラメータを取得してもよく、確率および整列パラメータ取得部２４２および言語モデルパラメータ取得部２４４を含んでもよい。 For example, the parameter acquisition unit 240 acquires parameters necessary for calculating the probability associated with the conversion of the query into each of the proofreading candidates based on an algorithm using the IBM® MODEL 2 technique. The probability and alignment parameter acquisition unit 242 and the language model parameter acquisition unit 244 may be included.

確率および整列パラメータ取得部２４２は、例えば、期待値最大化（Ｅｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ；ＥＭ）アルゴリズムを使用してクエリに含まれる要素が校正候補それぞれに含まれる要素に変換される確率およびクエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値を計算するＩＢＭＭＯＤＥＬ２パラメータＥＭ学習器（ＩＢＭＭＯＤＥＬ２ＰａｒａｍｅｔｅｒＥＭＬｅａｒｎｅｒ）に対応してもよい。 The probability and alignment parameter acquisition unit 242 is included in the probability and the query that, for example, an element included in the query is converted into an element included in each proofreading candidate using an Expectation-Maximization (EM) algorithm. You may respond | correspond to the IBM MODEL 2 parameter EM learner (IBM MODEL 2 Parameter EM Learner) which calculates the numerical value which shows the positional relationship with respect to the element of each calibration candidate of an element.

言語モデルパラメータ取得部２４４は、例えば、クエリの校正候補それぞれへの変換の自然性の程度を示す確率として言語モデル（ＬａｎｇｕａｇｅＭｏｄｅｌ；ＬＭ）パラメータを計算するＬＭパラメータ学習器（ＬＭＰａｒａｍｅｔｅｒＬｅａｒｎｅｒ）であってもよい。確率および整列パラメータ取得部２４２および言語モデルパラメータ取得部２４４は、分散処理システム、例えば、Ｈａｄｏｏｐ基盤の分散処理システムを使用して実現されてもよい。分散処理システムを基盤として実現することで、パラメータ取得部２４０は、膨大な量のデータを高速で処理することができる。 The language model parameter acquisition unit 244 is, for example, an LM parameter learner (LM Parameter Learner) that calculates a language model (LM) parameter as a probability indicating the degree of naturalness of conversion to each query proofreading candidate. May be. The probability and alignment parameter acquisition unit 242 and the language model parameter acquisition unit 244 may be realized using a distributed processing system, for example, a Hadoop-based distributed processing system. By realizing the distributed processing system as a base, the parameter acquisition unit 240 can process an enormous amount of data at high speed.

確率および整列パラメータ取得部２４２および言語モデルパラメータ取得部２４４の詳しい動作については、図３乃至図６を参照しながらさらに詳しく説明する。 Detailed operations of the probability and alignment parameter acquisition unit 242 and the language model parameter acquisition unit 244 will be described in more detail with reference to FIGS.

校正結果生成部２３０は、パラメータ取得部２４０によって取得されたパラメータに基づき、クエリの少なくとも１つの校正候補のそれぞれへの変換と関連する確率を計算することができる。例えば、校正結果生成部２３０は、パラメータ取得部２４０によって取得されたパラメータの倍またはログ加算に基づいて前記確率を計算してもよい。クエリの校正候補への変換と関連する確率は、クエリが入力されるときに各校正候補が発生する条件付き確率であってもよい。校正結果生成部２３０は、下記の数式（１）を使用し、クエリが少なくとも１つの校正候補のそれぞれに変換される確率を計算してもよい。 The calibration result generation unit 230 can calculate the probability associated with the conversion of the query into each of at least one calibration candidate based on the parameters acquired by the parameter acquisition unit 240. For example, the calibration result generation unit 230 may calculate the probability based on the double of the parameter acquired by the parameter acquisition unit 240 or log addition. The probability associated with the conversion of a query to a proofreading candidate may be a conditional probability that each proofreading candidate occurs when a query is entered. The calibration result generation unit 230 may calculate the probability that the query is converted into each of at least one calibration candidate using the following formula (1).

ここで、ｌは校正候補それぞれの長さであり、ｍはクエリの長さであり、ｊは校正候補それぞれに含まれる要素のインデックスであり、ｉはクエリに含まれる要素のインデックスである。ＴＲは、前記クエリに含まれるｉ番目の要素が前記校正候補それぞれに含まれるｊ番目の要素に変換される確率を示す関数である。例えば、ＴＲは、帯域確率を計算するための関数であってもよい。ＡＬは、前記クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値を示す関数である。例えば、ＡＬは、整列確率を計算するための関数であってもよい。Ｐ_ＬＭは、クエリの校正候補それぞれへの変換の自然性の程度を示す確率を示す関数である。例えば、Ｐ_ＬＭは、ＬＭパラメータを計算するための関数であってもよい。 Here, l is the length of each proofreading candidate, m is the length of the query , j is the index of the element included in each proofreading candidate, and i is the index of the element included in the query. TR is a function indicating the probability that the i-th element included in the query is converted to the j-th element included in each of the calibration candidates. For example, TR may be a function for calculating a band probability. AL is a function indicating a numerical value indicating the positional relationship between the elements included in the query and the elements of each calibration candidate. For example, AL may be a function for calculating the alignment probability. _PLM is a function indicating a probability indicating the degree of naturalness of conversion of each query into proofreading candidates. For example, _PLM may be a function for calculating LM parameters.

校正結果生成部２３０は、各校正候補に対して計算された確率に基づき、クエリに対する校正結果として校正候補のうちから少なくとも１つの校正候補を抽出することができる。抽出される校正候補は、所期の（最適な）校正候補であって、ユーザクエリに対する正字クエリであってもよい。校正結果生成部２３０は、校正結果として少なくとも１つの校正候補のうちから校正候補を抽出するにあたり、クエリに含まれる要素が配列される順序および前記少なくとも１つの校正候補の対応する要素が配列される順序を等しいものと仮定してもよい。言い換えれば、クエリと校正候補との間には、モノトニックアライメント（ｍｏｎｏｔｏｎｉｃａｌｉｇｎｍｅｎｔ）が仮定されてもよい。 The calibration result generation unit 230 can extract at least one calibration candidate from among the calibration candidates as a calibration result for the query based on the probability calculated for each calibration candidate. The extracted proofreading candidate is an intended (optimum) proofreading candidate, and may be a correct character query for the user query. When the calibration result generation unit 230 extracts calibration candidates from at least one calibration candidate as a calibration result, the order in which the elements included in the query are arranged and the corresponding elements of the at least one calibration candidate are arranged. You may assume that the order is equal. In other words, a monotonic alignment may be assumed between the query and the proofreading candidate.

また、校正結果生成部２３０は、複数の校正候補のそれぞれに対して計算された前記確率の分布に基づき、複数の校正候補のうち少なくとも１つの校正候補を、クエリに対する校正結果としての校正候補抽出から除外してもよい。 In addition, the calibration result generation unit 230 extracts at least one calibration candidate from among the plurality of calibration candidates as a calibration candidate extraction result as a calibration result based on the probability distribution calculated for each of the plurality of calibration candidates. May be excluded.

校正結果生成部２３０は、例えば、ＩＢＭＭＯＤＥＬ２技法を使用するアルゴリズムに基づいてクエリに対する校正結果を生成するＩＢＭＭＯＤＥＬ２デコーダ（ＩＢＭＭＯＤＥＬ２Ｄｅｃｏｄｅｒ）に対応してもよい。 The calibration result generation unit 230 may correspond to, for example, an IBM MODEL 2 decoder that generates a calibration result for a query based on an algorithm that uses the IBM MODEL 2 technique.

校正結果生成部２３０の詳しい動作については、図７および図８を参照しながらさらに詳しく説明する。 Detailed operation of the calibration result generation unit 230 will be described in more detail with reference to FIGS. 7 and 8.

図１を参照しながら上述した技術的特徴についての説明は、図２に対してもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIG. 1 can be applied to FIG.

図３は、一実施形態における、ユーザクエリの校正候補への変換に関するパラメータの取得方法を概念的に示した図である。 FIG. 3 is a diagram conceptually illustrating a method for obtaining parameters relating to conversion of a user query into a proofreading candidate according to an embodiment.

図３は、図２を参照しながら上述した、クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値を計算する方法を概念的に示した図である。図３では、ユーザが入力したクエリは「ニューバランスシンチユン」であり、校正候補は「ニューバランスシンチョン」であると仮定する。「要素」とは、クエリおよび校正候補を構成する音節を示すものである。 FIG. 3 is a diagram conceptually illustrating a method of calculating a numerical value indicating the positional relationship of the elements included in the query with respect to the elements of each calibration candidate described above with reference to FIG. In FIG. 3, it is assumed that the query input by the user is “New Balance Shinchon” and the proofreading candidate is “New Balance Shinchon”. An “element” indicates a syllable constituting a query and a proofreading candidate.

クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値とは、クエリの各要素が校正候補の各要素の位置に対応するようになる確率を意味してもよく、図３において各矢印に対応する値を意味してもよい。言い換えれば、各矢印に対応する値は、上述したＡＬ関数に基づいて決められた値（整列確率）であってもよい。 The numerical value indicating the positional relationship of the elements included in the query with respect to the elements of each calibration candidate may mean the probability that each element of the query corresponds to the position of each element of the calibration candidate. It may mean a value corresponding to an arrow. In other words, the value corresponding to each arrow may be a value (alignment probability) determined based on the above-described AL function.

図に示すように、確率および整列パラメータ取得部２４２は、分散システム上でＥＭアルゴリズムを使用する反復（ｉｔｅｒａｔｉｏｎ）プロセスを実行することにより、クエリに含まれる要素の各校正候補の要素に対する位置関係を示す数値を計算してもよい。 As shown in the figure, the probability and alignment parameter acquisition unit 242 performs the iteration process using the EM algorithm on the distributed system, thereby determining the positional relationship of the elements included in the query with respect to each calibration candidate element. The indicated numerical value may be calculated.

図１および図２を参照しながら上述した技術的特徴についての説明は、図２に対してもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIG. 1 and FIG. 2 can be applied to FIG.

図４Ａおよび図４Ｂは、一実施形態における、ユーザクエリの校正候補への変換に関するパラメータとしての帯域確率および整列確率の取得方法を示す疑似コード（ｐｓｅｕｄｏｃｏｄｅ）を示した図である。図に示すアルゴリズムは、確率および整列パラメータ取得部２４２によって実行されてもよい。 4A and 4B are diagrams illustrating pseudo code indicating a method for obtaining a band probability and an alignment probability as parameters relating to conversion of a user query into a proofreading candidate according to an embodiment. The algorithm shown in the figure may be executed by the probability and alignment parameter acquisition unit 242.

確率および整列パラメータ取得部２４２は、マッパー（ｍａｐｐｅｒ）およびリデューサー（ｒｅｄｕｃｅｒ）で構成されてもよい。 The probability and alignment parameter acquisition unit 242 may be configured by a mapper and a reducer.

マッパーの動作は、下記のアルゴリズムで示されてもよい（図４Ａ参照）。 The operation of the mapper may be shown by the following algorithm (see FIG. 4A).

ここで、音節ｔｒｉｇｒａｍを使用して計算されたｓｃｏｒｅに対する、出力としてのｅｓｔｉ＿ｐｒｏｂ（ｋ，ｉ，ｊ）は、ＴＲ（Ｅｒｒａｔａ_ｉ｜Ｃｏｒｒｅｃｔｉｏｎ_ｊ）を計算するために使用されてもよい。また、Ｌｅｎ関数を使用して計算されたｓｃｏｒｅに対する、出力としてのｅｓｔｉ＿ｐｒｏｂ（ｋ，ｉ，ｊ）値は、ＡＬ（ｊ｜ｉ，ｌ，ｍ）ｄｍｆを計算するために使用されてもよい。一方、ｅｓｔｉ＿ｐｒｏｂ値は、ｔ−１段階（ｓｔｅｐ）におけるリデューサーの出力であってもよい。初期ｅｓｔｉ＿ｐｒｏｂ値は、１／Ｌｅｎ（入力）であってもよい。上述したアルゴリズムを基盤とした計算は、ディスクメモリマップド（ｄｉｓｋｍｅｍｏｒｙｍａｐｐｅｄ）Ｉ／Ｏ方式の完全ハッシュ（ｐｅｒｆｅｃｔｈａｓｈ）として格納されたデータを、分散型システムを構成する各ｈａｄｏｏｐノードによってローディングすることを通じて行われてもよい。 Here, for the score calculated using the syllable trigram, Esti_prob as an output (k, i, j) _is, TR | may be used to calculate the (Errata i Correction _j). Also, the output esti_prob (k, i, j) value for score calculated using the Len function may be used to calculate AL (j | i, l, m) dmf. On the other hand, the esti_prob value may be an output of the reducer in the t-1 stage. The initial esti_prob value may be 1 / Len (input). In the calculation based on the above-mentioned algorithm, data stored as a perfect hash of a disk memory mapped I / O method is loaded by each of the hardoop nodes constituting the distributed system. May be done through.

図４に示す出力されたデータのように、リデューサーは、ＡＬおよびＴＲの計算に必要な因子をそれぞれ「Ｃ」および「Ｍ」に区別してもよい（１ｓｔｋｅｙ）。出力されたデータのうち分母にならなければならない情報（音節、インデックス、文字列の長さ）が区分されてもよく、分母であるか分子であるかは「０」、「１」で区分されてもよい（２ｎｄｋｅｙ、３ｒｄｋｅｙ）。さらに、分子となる情報（音節、インデックス）も区分されてもよい。出力されたデータのうち、フィールドの最後の値をｓｃｏｒｅとしてもよい。 As in the output data shown in FIG. 4, the reducer may distinguish the factors necessary for calculating AL and TR into “C” and “M”, respectively (1st key). Of the output data, the information (syllable, index, length of character string) that must be the denominator may be divided, and whether it is a denominator or numerator is classified by "0", "1" (2nd key, 3rd key). Furthermore, information (syllables, indexes) that become molecules may also be classified. Of the output data, the last value of the field may be score.

マッパーの出力が整列された後にｓｕｍとなる結果は、リデューサーの動作を示す下記のアルゴリズムで表現されてもよい（図４Ｂ参照）。 The result that results in a sum after the mapper outputs are aligned may be represented by the following algorithm that shows the operation of the reducer (see FIG. 4B).

図２および図３を参照しながら上述したパラメータに対応する、ＴＲ（帯域確率）およびＡＬ（整列確率）は、前記アルゴリズムによって計算されてもよい。また、計算されたｅｓｔｉ＿ｐｒｏｂ（ｋ，ｉ，ｊ）は、ｔ＋１段階でｓｃｏｒｅ値をアップデートするために使用されてもよく、完全ハッシュ構造でディスクまたはデータベースに格納されて管理されてもよい。 TR (band probability) and AL (alignment probability) corresponding to the parameters described above with reference to FIGS. 2 and 3 may be calculated by the algorithm. In addition, the calculated esti_prob (k, i, j) may be used to update the score value in the t + 1 stage, or may be stored and managed in a disk or database in a complete hash structure.

図１乃至図３を参照しながら上述した技術的特徴についての説明は、ここでもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 3 can be applied as it is here, and therefore, the redundant description is omitted.

図５は、図４Ａおよび図４Ｂのアルゴリズムによるパラメータ取得方法のパフォーマンスを示した図である。 FIG. 5 is a diagram showing the performance of the parameter acquisition method based on the algorithm of FIGS. 4A and 4B.

図５では、２４個のマッパーおよび２０個のリデューサーが使用され、合計１５個のノード（２．２ＧＨｚ、４８ＧＲＡＭ）が使用された。入力としては約３千７百万個の誤字脱字−正字ペアデータが使用され、出力としては約２億５千７百万個のモデルパラメータが得られた。 In FIG. 5, 24 mappers and 20 reducers were used, for a total of 15 nodes (2.2 GHz, 48 GRAM). About 37 million typographical error-correction pair data was used as input, and about 257 million model parameters were obtained as output.

結果を詳察すると、各ＥＭ段階あたり１１乃至１４分を所要したことが確認された（合計９乃至１０回の反復プロセスが行われる）。 A closer look at the results confirmed that it took 11 to 14 minutes per EM stage (a total of 9 to 10 iterations were performed).

図６は、一実施形態における、ユーザクエリの校正候補への変換に関するパラメータとしてのＬＭパラメータの取得方法を示した図である。図６で示すアルゴリズムは、図２を参照しながら上述したクエリの校正候補それぞれへの変換の自然性の程度を示す確率（ＬＭパラメータ）を計算する方法を示している。図に示すアルゴリズムは、言語モデルパラメータ取得部２４４によって実行されてもよい。ＬＭパラメータは、クエリが校正候補に変換されるために必要となる文脈および／または自然性の程度を確率で示したものである。 FIG. 6 is a diagram illustrating a method for acquiring an LM parameter as a parameter related to conversion of a user query into a proofreading candidate according to an embodiment. The algorithm shown in FIG. 6 shows a method of calculating a probability (LM parameter) indicating the degree of naturalness of conversion of each of the queries described above with reference to FIG. The algorithm shown in the figure may be executed by the language model parameter acquisition unit 244. The LM parameter indicates the degree of context and / or the degree of naturalness required for a query to be converted into a proofreading candidate.

言語モデルパラメータ取得部２４４は、平滑化（Ｓｍｏｏｔｈｉｎｇ）プロセスおよび線形補間（ｌｉｎｅａｒｉｎｔｅｒｐｏｌａｔｉｏｎ）プロセスを実行することによってＬＭパラメータを取得してもよい。ただし、平滑化プロセスは、処理されなければならない誤字脱字−正字ペアの数が極めて多い場合（例えば、１０億個以上）にのみ実行されてもよい。 The language model parameter acquisition unit 244 may acquire the LM parameter by performing a smoothing process and a linear interpolation process. However, the smoothing process may be performed only when the number of typographical-letter-letter pairs that must be processed is very large (eg, 1 billion or more).

言語モデルパラメータ取得部２４２は、マッパーおよびリデューサーで構成されてもよい。ＬＭパラメータを取得するにあたり、分母ローカルＳＵＭ（ｌｏｃａｌｓｕｍ）計算は、例えば、リデューサーキー（ｒｅｄｕｃｅｒｋｅｙ）に分母となる情報をＦｉｒｓｔ−Ｐｒｉｏｒｉｔｙ−Ｋｅｙで生成することによって実行されてもよい。 The language model parameter acquisition unit 242 may be configured with a mapper and a reducer. In obtaining the LM parameter, the denominator local SUM (local sum) calculation may be executed by, for example, generating information that becomes a denominator in the reducer key (reducer key) using the First-Priority-Key.

ＬＭパラメータを取得するにあたり、分散環境における統計学習は、スパーク（Ｓｐａｒｋ）を使用して実行されてもよい。 In obtaining the LM parameters, statistical learning in a distributed environment may be performed using Spark.

図に示すアルゴリズムにおいて、入力されたクエリおよびその頻度はそれぞれ「あかさたなはまや」、「９９」と「あかさたなはまが」、「１」とが仮定された。推定された１０ｇｒａｍ確率がＬＭパラメータとして計算されてもよく、１０ｇｒａｍ確率は、８ｇｒａｍ確率および９ｇｒａｍ確率を計算した後、補間によって推定されてもよい。 In the algorithm shown in the figure, the input query and its frequency were assumed to be “Akasana Hamaya”, “99”, “Akasana Hamaga”, and “1”, respectively. The estimated 10 gram probability may be calculated as the LM parameter, and the 10 gram probability may be estimated by interpolation after calculating the 8 gram probability and the 9 gram probability.

段階１で、マッパーは８乃至１０ｇｒａｍを抽出してもよく、リデューサーは頻度合計を計算してもよい。前記段階１は、例えば、文字列「ａｂｃｄ」に対してＰ（ｄ｜ａｂｃ）を計算するための前処理過程であってもよい。ＬＭパラメータの計算は、「ｍｉｎ（Ｎ）ｇｒａｍ」、「Ｎ−１ｇｒａｍ」、「Ｎｇｒａｍ」形式のキー構造（Ｎ＝８乃至１０）を使用して実行されてもよく、したがって、分母優先頻度合計の計算を容易に実行することができる。 In stage 1, the mapper may extract 8-10 grams, and the reducer may calculate the frequency sum. The step 1 may be, for example, a preprocessing process for calculating P (d | abc) for the character string “abcd”. The calculation of the LM parameter may be performed using a key structure (N = 8-10) in the form of “min (N) gram”, “N-1gram”, “Ngram”, and therefore the denominator preferred frequency sum. Can be easily executed.

段階２で、段階１の結果を使用し（ｍａｐ＝‘ｃａｔ’＆ｓｏｒｔ）、リデューサーは８乃至１０ｇｒａｍそれぞれの確率値を計算してもよい。ここで、８ｇｒａｍの分母は、ｓｕｍ（ｃｎｔ＿ｏｆ＿ａｌｌ（８ｇｒａｍ））に割り当てられてもよい。 In step 2, using the result of step 1 (map = 'cat' & sort), the reducer may calculate a probability value of 8 to 10 gram. Here, the denominator of 8 gram may be assigned to sum (cnt_of_all (8 gram)).

段階３で、段階２で計算された８乃至１０ｇｒａｍ確率値を使用して１０ｇｒａｍ確率が線形補間されてもよい。マッパーは、段階１と同じキーを生成し、１番目のキー（１ｓｔｋｅｙ）を文字列の逆順に変換し、「最終」音節を基準として８乃至１０ｇｒａｍが分類（ｓｏｒｔ）およびリデューサーにグルーピング（ｇｒｏｕｐｉｎｇ）されるようにしてもよい。リデューサーは、線形補間を実行することによって最終的なＬＭパラメータを生成してもよい。例えば、「あいうえ」の文字列に対し、ＬＭパラメータＰ_ＬＭ（た｜さかあ）は、ａ×Ｐ_ＬＭ（た｜さかあ）、ｂ×Ｐ_ＬＭ（た｜さか）およびｃ×Ｐ_ＬＭ（た｜さ）の合計で計算されてもよい。ここで、加重値ａ、ｂおよびｃは、「ＡｓｔａｔｉｓｔｉｃａｌＰａｒｔ−ｏｆ−ＳｐｅｅｃｈＴａｇｇｅｒ，Ｔ．Ｂｒａｎｔｅｔａｌ，２０００」で提案された方法に基づいて計算されてもよい。ｕｎｓｅｅｎ確率は、１／ｓｕｍ（ｃｎｔ＿ｏｆ＿ａｌｌ（１０ｇｒａｍ）＋ｃｎｔ＿ｏｆ＿ｄｉｃ（１０ｇｒａｍ））で計算されてもよい。 In step 3, 10 gram probabilities may be linearly interpolated using the 8-10 gram probability values calculated in step 2. The mapper generates the same key as step 1, converts the first key (1st key) in reverse order of the string, and groups 8 to 10 grams into groups and reducers based on the “final” syllable. ). The reducer may generate final LM parameters by performing linear interpolation. For example, for the character string “Aiue”, the LM parameter P _LM (Ta | Sakaa) is a × P _LM (Ta | Sakaa), b × P _LM (Ta | Sakaa), and c × P _LM ( It may be calculated by the sum of | Here, the weight values a, b, and c may be calculated based on the method proposed in “A statistical Part-of-Speech Tagger, T. Brant et al, 2000”. The unsen probability may be calculated by 1 / sum (cnt_of_all (10gram) + cnt_of_dic (10gram)).

下記の表１は、図に示すアルゴリズムによるＬＭパラメータ取得方法のパフォーマンスを示す。表１の結果については、２４個のマッパーおよび２０個のリデューサーが使用され、合計１５個のノード（２．２ＧＨｚ、４８ＧＲＡＭ）が使用された。 Table 1 below shows the performance of the LM parameter acquisition method according to the algorithm shown in the figure. For the results in Table 1, 24 mappers and 20 reducers were used, for a total of 15 nodes (2.2 GHz, 48 GRAM).

図１乃至５を参照しながら上述した技術的特徴についての説明は、ここでもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 5 can be applied here as it is, and thus the redundant description is omitted.

図７は、一実施形態における、ユーザクエリに対する校正結果としての校正候補を抽出する方法を示した概念図である。図７は、例えば、ＩＢＭＭＯＤＥＬ２技法を使用するアルゴリズムを基盤としてクエリに対する校正結果を生成するＩＢＭＭＯＤＥＬ２デコーダに対応する校正結果生成部２３０のクエリに対する最適の校正候補を抽出する方法を示している。 FIG. 7 is a conceptual diagram illustrating a method for extracting a proofreading candidate as a proofreading result for a user query according to an embodiment. FIG. 7 shows a method of extracting an optimum calibration candidate for a query of a calibration result generation unit 230 corresponding to an IBM MODEL 2 decoder that generates a calibration result for a query based on an algorithm using the IBM MODEL 2 technique, for example. Yes.

最適な校正候補は、下記の数式（２）によって決められてもよい。 The optimal calibration candidate may be determined by the following mathematical formula (2).

言い換えれば、クエリに対する複数の校正候補のうち、上述した数式（１）に基づいて計算された確率が最大である校正候補を最適な校正候補として決めて抽出してもよい。 In other words, among the plurality of calibration candidates for the query, the calibration candidate having the maximum probability calculated based on the above formula (1) may be determined and extracted as the optimal calibration candidate.

クエリと校正候補との間には、モノトニックアライメントが仮定されてもよい。また、最適な校正候補は、ダイナミックアルゴリズム（ｄｙｎａｍｉｃａｌｇｏｒｉｔｈｍ）を使用して決められてもよい。また、複数の校正候補のそれぞれに対して計算された前記確率の分布に基づき、複数の校正候補のうち少なくとも１つの校正候補が、クエリに対する校正結果としての校正候補抽出から除外されてもよい。例えば、確率分布で点値として示される中間確率値に対応する校正候補は、不必要な校正候補として校正候補抽出過程から除外されてもよい。前記過程により、校正候補の数が多い場合でも、最適な校正候補を高速で決めることができる。 Monotonic alignment may be assumed between the query and the proofreading candidate. In addition, the optimal calibration candidate may be determined using a dynamic algorithm. Further, based on the probability distribution calculated for each of a plurality of calibration candidates, at least one calibration candidate among the plurality of calibration candidates may be excluded from the calibration candidate extraction as a calibration result for the query. For example, calibration candidates corresponding to intermediate probability values indicated as point values in the probability distribution may be excluded from the calibration candidate extraction process as unnecessary calibration candidates. According to the above process, even when the number of calibration candidates is large, an optimal calibration candidate can be determined at high speed.

段階１で、校正結果生成部２３０は、クエリに対する可能な校正候補に対し、計算されたＴＲおよびＡＬパラメータを使用し、最適な校正候補決定のための情報を生成してもよい。 In step 1, the calibration result generator 230 may generate information for determining an optimal calibration candidate using the calculated TR and AL parameters for possible calibration candidates for the query.

段階２で、校正結果生成部２３０は、音節単位でクエリをデコードすることによって最適な校正候補を決めてもよい。段階２では、例えば、帯域確率、整列確率、およびＬＭパラメータのうち少なくとも１つによって計算された点数（ｓｃｏｒｅ）に基づいて不必要な校正候補が除外されてもよい。図に示す例示では、クエリ「ニューバランスシンチユン」に対して「ニューバランスシンチョン」が最適な校正候補として決められた。 In step 2, the calibration result generation unit 230 may determine an optimal calibration candidate by decoding the query in syllable units. In step 2, unnecessary calibration candidates may be excluded based on, for example, a score calculated by at least one of a band probability, an alignment probability, and an LM parameter. In the example shown in the figure, “New Balance Shinchon” is determined as the optimal calibration candidate for the query “New Balance Cincheng”.

校正結果生成部２３０は、１０００乃至１５００ＴＰＳ／ｃｏｒｅの速度で実現されてもよい。段階１および２については、図８のアルゴリズムを参照しながらさらに詳しく説明する。図１乃至図６を参照しながら上述した技術的特徴についての説明は、ここでもそのまま適用することができるため、重複する説明は省略する。 The calibration result generation unit 230 may be realized at a speed of 1000 to 1500 TPS / core. Steps 1 and 2 will be described in more detail with reference to the algorithm of FIG. The description of the technical features described above with reference to FIGS. 1 to 6 can be applied as it is here, and therefore, the redundant description is omitted.

図８は、一実施形態における、ユーザクエリに対する校正結果としての校正候補を抽出する方法を示した疑似コードである。図８では、図７を参照しながら上述した最適な校正候補を抽出するための段階１および２についてさらに詳しく説明する。 FIG. 8 is a pseudo code showing a method of extracting a proofreading candidate as a proofreading result for a user query according to an embodiment. In FIG. 8, steps 1 and 2 for extracting the optimum calibration candidate described above will be described in more detail with reference to FIG.

段階１で、校正結果生成部２３０は、クエリに対する可能な校正候補に対し、計算されたＴＲおよびＡＬパラメータを使用し、最適な校正候補決定のための情報を生成してもよい。クエリは音節単位でトークン化されてもよく、有限状態トランスデューサ（ｆｉｎｉｔｅｓｔａｔｅｔｒａｎｓｄｕｃｅｒ）に基づいてクエリに対してすべての可能なＡＬおよびＴＲパラメータが抽出されてもよい。抽出されたＡＬおよびＴＲパラメータは格納されてもよく（データベースまたはメモリなどに）、クエリの音節インデックス（ｓｙｌｌａｂｌｅｉｎｄｅｘ）にマッピングされてもよい。 In step 1, the calibration result generator 230 may generate information for determining an optimal calibration candidate using the calculated TR and AL parameters for possible calibration candidates for the query. The query may be tokenized on a syllable basis and all possible AL and TR parameters may be extracted for the query based on a finite state transducer. The extracted AL and TR parameters may be stored (such as in a database or memory) and mapped to the query's syllable index.

段階２で、校正結果生成部２３０は、校正候補のうちから最適な校正候補を決めてもよい。段階１における出力（入力クエリおよびそのＡＬおよびＴＲパラメータ）は、段階２における入力となってもよい。下記のアルゴリズムを通じて入力の校正候補に対する点数が計算されてもよく、点数に基づいて１つの校正候補が最適な校正候補として決められてもよい。 In step 2, the calibration result generation unit 230 may determine an optimal calibration candidate from among the calibration candidates. The output in stage 1 (input query and its AL and TR parameters) may be the input in stage 2. The score for the input calibration candidate may be calculated through the following algorithm, and one calibration candidate may be determined as the optimal calibration candidate based on the score.

帯域確率、整列確率、およびＬＭパラメータに基づいて計算された点数にしたがって不必要な校正候補が除外され、計算された点数にしたがって校正候補のうちから最適な校正候補が決められてもよい。決められた最適な校正候補は、上述した校正結果として抽出された校正候補に対応してもよく、ユーザのクエリに対する検索結果として提供されてもよい。 An unnecessary calibration candidate may be excluded according to the score calculated based on the band probability, the alignment probability, and the LM parameter, and the optimal calibration candidate may be determined from the calibration candidates according to the calculated score. The determined optimum proofreading candidate may correspond to the proofreading candidate extracted as the proofreading result described above, or may be provided as a search result for the user query.

図１乃至図７を参照しながら上述した技術的特徴についての説明は、ここでもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 7 can be applied here as it is, and therefore, redundant description is omitted.

図９は、一実施形態における、ユーザクエリ校正システムの動作方法を示したフローチャートである。 FIG. 9 is a flowchart illustrating a method of operating the user query proofing system in one embodiment.

段階９１０で、校正情報抽出部２２０は、予め入力されているクエリに対する検索結果のログ情報に基づき、ユーザクエリに対する少なくとも１つの校正候補に関する情報を抽出することができる。抽出される校正候補に関する情報は、クエリおよび校正候補で構成される誤字脱字−正字ペアを含んでもよい。 In step 910, the proofreading information extraction unit 220 may extract information on at least one proofreading candidate for the user query based on log information of a search result for the query input in advance. The extracted information about the proofreading candidates may include a typographical error-correction pair composed of the query and the proofreading candidates.

段階９２０で、パラメータ取得部２４０は、段階９１０で抽出された情報に基づき、クエリの校正候補への変換に関する少なくとも１つのパラメータを取得することができる。クエリの校正候補への変換に対する帯域確率のパラメータおよび整列確率のパラメータは、パラメータ取得部２４０の確率および整列パラメータ取得部２４２によって計算および取得されてもよい。クエリの校正候補への変換に対するＬＭパラメータは、パラメータ取得部２４０の言語モデルパラメータ取得部２４４によって計算および取得されてもよい。 In step 920, the parameter acquisition unit 240 may acquire at least one parameter related to conversion of the query into a proofreading candidate based on the information extracted in step 910. The bandwidth probability parameter and the alignment probability parameter for the conversion of the query into the calibration candidate may be calculated and acquired by the probability and alignment parameter acquisition unit 242 of the parameter acquisition unit 240. The LM parameter for the conversion of the query into the proofreading candidate may be calculated and acquired by the language model parameter acquisition unit 244 of the parameter acquisition unit 240.

段階９３０で、校正結果生成部２３０は、段階９２０で取得されたパラメータに基づき、クエリの校正候補（群）のそれぞれへの変換と関連する確率を計算することができる。例えば、校正結果生成部２３０は、上述した数式（１）を使用して前記確率の計算を実行してもよい。 In operation 930, the calibration result generator 230 may calculate a probability associated with the conversion of the query into each candidate calibration group (group) based on the parameters acquired in operation 920. For example, the calibration result generation unit 230 may perform the calculation of the probability using the above-described mathematical formula (1).

段階９４０で、校正結果生成部２３０は、段階９３０で計算された確率に基づき、クエリに対する校正結果として校正候補（群）のうちから少なくとも１つの校正候補を抽出することができる。例えば、校正結果生成部２３０は、上述した数式（２）を使用して校正候補（群）のうちから１つの校正候補を校正結果として抽出してもよい。 In operation 940, the calibration result generator 230 may extract at least one calibration candidate from the calibration candidates (group) as a calibration result for the query based on the probability calculated in operation 930. For example, the calibration result generation unit 230 may extract one calibration candidate as a calibration result from the calibration candidates (group) using the above-described formula (2).

段階９５０で、クエリ校正システム１００の通信部２５０は、段階９４０で抽出された校正結果としての校正候補を、ユーザのクエリ入力に対する検索結果として出力することができる。また、ユーザのクエリ入力に対する検索結果には、抽出された校正結果が含まれてもよい。 In step 950, the communication unit 250 of the query calibration system 100 may output the calibration candidate as the calibration result extracted in step 940 as a search result for the user's query input. The search result for the user's query input may include the extracted proofreading result.

図１乃至図８を参照しながら上述した技術的特徴についての説明は、ここでもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIG. 1 to FIG. 8 can be applied as it is here, and thus the redundant description is omitted.

図１０は、一実施形態における、ユーザクエリに対する少なくとも１つの校正候補に関する情報を抽出する方法を示したフローチャートである。 FIG. 10 is a flow diagram illustrating a method for extracting information about at least one proofreading candidate for a user query in one embodiment.

後述する段階１０１０および１０２０は、図９を参照しながら上述した段階９１０に含まれてもよい。 Steps 1010 and 1020 described below may be included in step 910 described above with reference to FIG.

段階１０１０で、校正情報抽出部２２０は、検索結果のログ情報を使用してクエリが誤字脱字を含むかを判定することができる。クエリが誤字脱字を含まなければ、該当のクエリは正字クエリに該当すると判定されるため、別途の校正結果が提供される必要はない。 In step 1010, the proofreading information extraction unit 220 may determine whether the query includes a typographical error using the log information of the search result. If the query does not include a typographical error, it is determined that the corresponding query corresponds to the correct character query, and therefore it is not necessary to provide a separate proofreading result.

クエリが誤字脱字を含む場合、段階１０２０で、校正情報抽出部２２０は、クエリに対する少なくとも１つの校正候補を識別することができる。例えば、校正情報抽出部２２０は、校正候補を識別することにより、クエリ−校正候補に対する少なくとも１つの誤字脱字−正字ペアを校正候補に関する情報として識別してもよい。 If the query includes a typographical error, in step 1020, the proofreading information extracting unit 220 may identify at least one proofreading candidate for the query. For example, the proofreading information extraction unit 220 may identify at least one typographical error-correction pair for the query-proofreading candidate as information on the proofreading candidate by identifying the proofreading candidate.

図１乃至図９を参照しながら上述した技術的特徴についての説明は、ここでもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 9 can be applied as it is here, and therefore, the redundant description is omitted.

図１１は、一実施形態における、ユーザクエリに対する校正候補のうち不必要な校正候補を除外することによって校正結果としての校正候補を抽出する方法を示したフローチャートである。 FIG. 11 is a flowchart illustrating a method for extracting a calibration candidate as a calibration result by excluding unnecessary calibration candidates from calibration candidates for a user query according to an embodiment.

後述する段階１１１０および１１２０は、図９を参照しながら上述した段階９４０に含まれてもよい。 Steps 1110 and 1120 described below may be included in step 940 described above with reference to FIG.

段階１１１０で、校正結果生成部２３０は、複数の校正候補のそれぞれに対して段階９３０で計算された確率の分布に基づき、複数の校正候補のうち少なくとも１つの校正候補を、クエリに対する校正結果としての校正候補抽出過程から除外することができる。 In step 1110, the calibration result generation unit 230 selects at least one calibration candidate among the plurality of calibration candidates as a calibration result for the query based on the probability distribution calculated in step 930 for each of the plurality of calibration candidates. Can be excluded from the proofreading candidate extraction process.

段階１１２０で、校正結果生成部２３０は、段階１１１０によって不必要な校正候補が除外された複数の校正候補のうちから少なくとも１つの校正候補をクエリに対する校正結果として抽出することができる。 In step 1120, the calibration result generator 230 may extract at least one calibration candidate as a calibration result for the query from among a plurality of calibration candidates from which unnecessary calibration candidates are excluded in step 1110.

図１乃至図１０を参照しながら上述した技術的特徴についての説明は、ここでもそのまま適用することができるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 10 can be applied here as it is, and therefore, redundant description is omitted.

下記では、従来技術と比較した本実施形態の効果および技術的な改善事項について説明する。 Below, the effect of this embodiment compared with a prior art and the technical improvement matter are demonstrated.

効果および技術的改善事項の評価において、ＩＢＭＭＯＤＥＬ２の学習のために使用されたデータは、ＩＢＭ＿ＳＥＴ−７およびＩＢＭ＿ＳＥＴ−２１によってそれぞれ示した。ＩＢＭ＿ＳＥＴ−７は、７日間のクエリに対する誤字脱字−校正候補のデータであり、ＩＢＭ＿ＳＥＴ−２１は、２１日間のクエリに対する誤字脱字−校正候補のデータである。また、ＬＭパラメータの学習のために使用されたデータは、ＬＭ＿ＳＥＴ−２１＿Ｑ５およびＬＭ＿ＳＥＴ−２０１５＿Ｑ３０によってそれぞれ示した。ＬＭ＿ＳＥＴ−２１＿Ｑ５は、２１日間のクエリに対して累積頻度が５以上のクエリに対するデータを示しており、ＬＭ＿ＳＥＴ−２０１５＿Ｑ３０は、２０１５年までの全体クエリに対して累積頻度が３０以上のクエリに対するデータを示している。テストセットとしては、ｕｎｓｅｅｎランダムサンプリングによる２１，２１９件の誤字脱字−＞正字データを構築した。実施形態との比較のための従来技術としては、最新の学習データを使用することが仮定される。 In the evaluation of effects and technical improvements, the data used for learning IBM MODEL 2 was indicated by IBM_SET-7 and IBM_SET-21, respectively. IBM_SET-7 is typographical error-proofreading candidate data for a 7-day query, and IBM_SET-21 is typographical-proofreading candidate data for a 21-day query. The data used for learning the LM parameter is indicated by LM_SET-21_Q5 and LM_SET-2015_Q30, respectively. LM_SET-21_Q5 indicates data for a query with a cumulative frequency of 5 or more for a 21-day query, and LM_SET-2015_Q30 indicates data for a query with a cumulative frequency of 30 or more for the entire query up to 2015. Show. As a test set, 21,219 typographical error-> correction data were constructed by unsen random sampling. As a conventional technique for comparison with the embodiment, it is assumed that the latest learning data is used.

下記の表２および表３は、ＩＢＭＭＯＤＥＬ２の学習データおよびＬＭ学習データを増加させた場合、カバレージおよび品質が向上するかを示している。表２では、ＬＭ＿ＳＥＴ−２０１５＿Ｑ３０の使用が固定された。表３では、ＩＢＭ＿ＳＥＴ−２１の使用が固定された。 Tables 2 and 3 below show whether the coverage and quality are improved when the IBM MODEL 2 learning data and LM learning data are increased. In Table 2, the use of LM_SET-2015_Q30 was fixed. In Table 3, the use of IBM_SET-21 was fixed.

表２および表３の場合すべてにおいて、カバレージの増加および正確度の増加が確認された。下記の表４は、従来技術（ＡＳ−ＩＳ）および実施形態（ＴＯ−ＢＥ）のＳＥＥＮＴＥＳＴの結果を示している。従来技術のシステムおよび実施形態のシステムのモデリングパワー（性能）を比較するために、学習に既に使用されていた（ＳＥＥＮ）データが評価のために使用された。 In all cases of Tables 2 and 3, increased coverage and increased accuracy were observed. Table 4 below shows the results of SEEN TEST of the prior art (AS-IS) and the embodiment (TO-BE). In order to compare the modeling power (performance) of the prior art system and the system of the embodiment, the data already used for learning (SEEN) was used for the evaluation.

前記のように、カバレージおよび正確度において、実施形態の場合が従来技術よりもさらに優れることが確認された。 As described above, it was confirmed that the embodiment is further superior to the prior art in terms of coverage and accuracy.

下記の表５は、従来技術（ＡＳ−ＩＳ）および実施形態（ＴＯ−ＢＥ）のＵＮＳＥＥＮＴＥＳＴの結果を示している。従来技術のシステムおよび実施形態のシステムのモデリングパワー（性能）を比較するために、学習データに存在しない（ＵＮＳＥＥＮ）データが評価のために使用された。 Table 5 below shows the results of UNSEEN TEST of the prior art (AS-IS) and the embodiment (TO-BE). In order to compare the modeling power (performance) of the prior art system and the embodiment system, UNSEEN data was used for the evaluation.

前記のように、実施形態の場合、従来技術に比べてカバレージが約２２２％増加した反面、正確度の減少は２％に過ぎないことが確認された。 As described above, in the case of the embodiment, it was confirmed that the coverage was increased by about 222% compared to the prior art, but the accuracy decrease was only 2%.

下記の表６および表７は、誤字脱字ではないクエリを正字クエリとしてどのくらい認識するかに対する従来技術および実施形態の比較結果を示している。表６において、評価データ１は、正字と見なされたショッピングドメインＱＣｔｏｐの５，０００件のデータである。表７において、評価データ２は、正字と見なされたＵＮＳＥＥＮ＆ｌｏｗＱＣの地図飲食店名データの１７，０４０件のデータである。 Tables 6 and 7 below show the comparison results of the prior art and the embodiment for how much a query that is not a typographical error is recognized as an orthographic query. In Table 6, the evaluation data 1 is 5,000 data items of the shopping domain QC top regarded as a normal character. In Table 7, evaluation data 2 is 17,040 data of UNSEEN & low QC map restaurant name data regarded as normal characters.

前記のように、従来技術に比べて実施形態の正字認識率がさらに高いことが確認された。 As described above, it has been confirmed that the positive character recognition rate of the embodiment is higher than that of the prior art.

下記の表８は、誤字脱字校正の従来技術（ＡＳ−ＩＳ）および実施形態（ＴＯ−ＢＥ）の比較結果を示している。 Table 8 below shows a comparison result between the prior art (AS-IS) and the embodiment (TO-BE) of typographical error correction.

前記のように、校正カバレージにおいて、実施形態の場合が従来技術よりも優れることが確認された。 As described above, in the calibration coverage, it was confirmed that the embodiment is superior to the conventional technique.

例えば、実施形態のシステムは、従来技術のシステムに比べて２乃至３倍の校正カバレージ向上の効果がある反面、校正正確度の劣化はほぼ無いか極めて少ないことが確認された。 For example, the system of the embodiment has an effect of improving the calibration coverage by 2 to 3 times compared to the system of the prior art, but it has been confirmed that there is almost no or very little deterioration of the calibration accuracy.

また、実施形態のシステムによると、校正カバレージが従来技術に比べて広くなることから、ユーザが誤字脱字クエリを入力した場合に、正字クエリを再入力しなければならない頻度が少なくなるため、正しい検索結果を出力するためのクエリ入力（クライアントの観点）および検索結果処理（検索サーバの観点）に求められるデータ処理量および計算量を減少させることができる効果がある。さらに、ユーザの端末がモバイル端末の場合に、誤字脱字クエリを入力した場合、正字クエリを再入力しなければならない頻度が少なくなるため、端末のバッテリー節約効果も達成することができる。 In addition, according to the system of the embodiment, since the proofreading coverage is wider than that of the prior art, when the user inputs a typographical lexical query, the frequency of having to re-enter the typographical query is reduced. There is an effect that it is possible to reduce the amount of data processing and calculation required for query input (client perspective) and search result processing (search server perspective) for outputting results. Furthermore, when the user's terminal is a mobile terminal, if a typographical error query is input, the frequency of having to re-input the normal character query is reduced, so that the battery saving effect of the terminal can also be achieved.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてもよい。例えば、実施形態で説明された装置および構成要素は、例えば、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてもよい。処理装置は、オペレーティングシステム（ＯＳ）および前記ＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを格納、操作、処理および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素（ｐｒｏｃｅｓｓｉｎｇｅｌｅｍｅｎｔ）および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでもよい。また、並列プロセッサ（ｐａｒａｌｌｅｌｐｒｏｃｅｓｓｏｒ）のような、他の処理構成（ｐｒｏｃｅｓｓｉｎｇｃｏｎｆｉｇｕｒａｔｉｏｎ）も可能である。 The apparatus described above may be realized by hardware components, software components, and / or a combination of hardware and software components. For example, the devices and components described in the embodiments include, for example, a processor, a controller, an ALU (arithmic logic unit), a digital signal processor, a microcomputer, an FPGA (field programmable gate array), a PLU (programmable logic unit), a micro It may be implemented using one or more general purpose or special purpose computers, such as a processor or various devices that can execute and respond to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the OS. The processing device may also respond to software execution, access data, and store, manipulate, process and generate data. For convenience of understanding, a single processing device may be described as being used, but those skilled in the art will recognize that the processing device includes multiple processing elements and / or multiple types of processing elements. But you can understand. For example, the processing device may include a plurality of processors or a processor and a controller. Also, other processing configurations such as a parallel processor are possible.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、仮想装置（ｖｉｒｔｕａｌｅｑｕｉｐｍｅｎｔ）、コンピュータ格納媒体または装置、または伝送される信号波（ｓｉｇｎａｌｗａｖｅ）に永久的または一時的に具現化（ｅｍｂｏｄｙ）されてもよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で格納されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータで読み取り可能な記録媒体に格納されてもよい。 The software may include computer programs, code, instructions, or a combination of one or more of these, configuring the processor to operate as desired, or instructing the processor independently or collectively. You may do it. Software and / or data can be interpreted based on a processing device or provide instructions or data to a processing device, any type of machine, component, physical device, virtual equipment, computer storage medium Alternatively, it may be permanently or temporarily embodied in a device, or a transmitted signal wave. The software may be distributed over computer systems connected by a network and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータで読み取り可能な媒体に記録されてもよい。前記コンピュータで読み取り可能な媒体は、プログラム命令、データファイル、データ構造などを単独でまたは組み合わせて含んでもよい。前記媒体に記録されるプログラム命令は、実施形態のために特別に設計されて構成されたものであってもよいし、コンピュータソフトウェア当業者に公知な使用可能なものであってもよい。コンピュータで読み取り可能な記録媒体の例としては、ハードディスク、フロッピディスクおよび磁気テープのような磁気媒体、ＣＤ−ＲＯＭ、ＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体およびＲＯＭ、ＲＡＭ、フラッシュメモリなどのようなプログラム命令を格納して実行するように特別に構成されたハードウェア装置が含まれる。プログラム命令の例は、コンパイラによって生成されるもののような機械語コードだけではなく、インタプリタなどを使用してコンピュータによって実行される高級言語コードを含む。上述したハードウェア装置は、実施形態の動作を実行するために１つ以上のソフトウェアモジュールとして動作するように構成されてもよく、その逆も同じである。 The method according to the embodiment may be realized in the form of program instructions executable by various computer means and recorded on a computer-readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment or may be usable by those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magneto-optical media such as floppy disks. And hardware devices specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language code such as that generated by a compiler, but also high-level language code that is executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

以上のように、実施形態を限定された実施形態と図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能である。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、および／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As described above, the embodiment has been described based on the limited embodiment and the drawings, but those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in a different order than the described method and / or components of the described system, structure, apparatus, circuit, etc. may be in a different form than the described method. Appropriate results can be achieved even when combined or combined, or opposed or replaced by other components or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Accordingly, even different embodiments belong to the appended claims as long as they are equivalent to the claims.

１００：クエリ校正システム
２１０：プロセッサ
２２０：校正情報抽出部
２３０：校正結果生成部
２４０：パラメータ取得部
２５０：通信部 100: Query calibration system 210: Processor 220: Calibration information extraction unit 230: Calibration result generation unit 240: Parameter acquisition unit 250: Communication unit

Claims

A proofreading information extraction unit that extracts information on at least one proofreading candidate for the query based on log information related to one or more queries input in advance ;
Based on the extracted information, the parameter acquisition unit for acquiring Rupa parameter relates to a conversion to the calibration candidate of the query, and based on the obtained parameters, the at least one calibration candidate of the query the probability associated with the conversion to respectively calculate, based on the calculated probabilities, Bei example calibration result generation unit for extracting at least one calibration candidate from among the calibration candidate as a calibration result for said query,
The parameter acquisition unit includes a probability that an element included in the query is converted into an element included in each of the calibration candidates, a positional relationship of the elements included in the query with respect to the elements of each calibration candidate, and the calibration candidates of the query A query proofreading system that acquires the degree of naturalness of conversion to each as a parameter .

The proofreading information extraction unit determines whether the query includes a typographical error using the log information, and if the query includes a typographical error, identifies at least one proofreading candidate for the query. Query calibration system as described in.

The log information includes a time from when the first query is input by the user until the second query is input, user click information with respect to a search result for the query, and similarity between the first query and the second query. The query proofreading system according to claim 1, wherein the query proofreading system includes at least one of information on an attribute and an attribute of the search result.

The query proofreading system according to claim 3, wherein the attribute of the search result is a category of content included in the search result.

Elements included in the query is a syllable included in the query, elements included in each of the calibration candidate syllables contained in each of the calibration candidate, in any one of claims 1 to 4 The query calibration system described.

When the calibration result generation unit extracts a calibration candidate from the at least one calibration candidate as the calibration result, the order in which the elements included in the query are arranged and the corresponding element of the at least one calibration candidate There is assumed the order arranged the same as the query calibration system according to any one of claims 1 to 5.

The query proofing system according to any one of claims 1 to 6 , wherein at least one proofreading candidate extracted as the proofreading result is included in a search result for an input of the query by a user.

The proofreading candidates are plural,
The calibration result generation unit
Excluding at least one of the plurality of calibration candidates from the calibration candidate extraction as a calibration result for the query based on the probability distribution calculated for each of the plurality of calibration candidates; The query proofreading system according to any one of claims 1 to 7 .

The probability associated with the transformation is calculated by a mathematical formula;
The formula is

And
l is the length of each proofreading candidate, m is the length of the query , j is the index of an element included in each of the proofreading candidates, i is the index of an element included in the query,
TR is a function indicating the probability that the i-th element included in the query is converted to the j-th element included in each of the calibration candidates, and AL is an element of each calibration candidate of the elements included in the query is a function indicating a number indicating the positional relationship, P _LM is the function indicating the probability of the degree of naturalness of conversion to the calibration candidates each of the query, any one of the claims 1 to 8 Query calibration system as described in.

The query proofreading system according to any one of claims 1 to 9 , wherein at least one of the proofreading information extraction unit and the parameter acquisition unit is realized as a distributed processing system.

The query proofreading system according to any one of claims 1 to 10 , wherein the information related to the proofreading candidate includes a typographical error-corrected character pair configured by the query and each proofreading candidate.

12. The query calibration system according to any one of claims 1 to 11 , wherein the probability associated with the transformation is a conditional probability that each calibration candidate occurs when the query is input.

Extracting information about at least one proofreading candidate for the query based on log information associated with the one or more pre- entered queries;
The computer, on the basis of the extracted information, the step of acquiring the related Rupa parameters for the conversion to the calibration candidate of the query,
Stage the computer is that based on the obtained parameters, the computation of a probability associated with the conversion to each of the at least one calibration candidate of the query, and
The computer, on the basis of said calculated probability, viewed including the step of extracting at least one calibration candidate from among the calibration candidate as a calibration result for said query,
Obtaining the parameter comprises:
To the probability that an element included in the query is converted into an element included in each of the calibration candidates, the positional relationship of the element included in the query with respect to the element of each calibration candidate, and each of the calibration candidates of the query Query calibration method that obtains the degree of naturalness of transformation as a parameter .

The step of extracting information about the proofreading candidates includes
The computer uses the log information to determine whether the query includes typos; and
The query proofreading method according to claim 13 , wherein the computer includes identifying at least one proofreading candidate for the query if the query includes a typographical error.

Said computer, at least one calibration candidate, further comprising the step of outputting as a search result to the input of the query by a user, the query calibration method of claim 13 or 14, which is extracted as the calibration results.

The proofreading candidates are plural,
Extracting the at least one proofreading candidate comprises:
The computer, on the basis of the distribution of the probability calculated for each of the plurality of calibration candidates for the calibration candidate extraction of at least one calibration candidate among the plurality of calibration candidates for the, as a calibration result for said query The query proofreading method according to any one of claims 13 to 15 , including a step of excluding from the query.

And
l is the length of each proofreading candidate, m is the length of the query , j is the index of an element included in each of the proofreading candidates, i is the index of an element included in the query ,
TR is a function indicating the probability that the i-th element included in the query is converted to the j-th element included in each of the calibration candidates, and AL is an element of each calibration candidate of the elements included in the query is a function indicating a number indicating the positional relationship, P _LM is the function indicating the probability of the degree of naturalness of conversion to the calibration candidates each of the query, any one of claims 13 to 16 Query calibration method described in.

A computer program that causes a computer to execute the query proofreading method according to any one of claims 13 to 17 .

A computer-readable recording medium storing the computer program according to claim 18 .