JP2014006620A

JP2014006620A - Synonym estimation device, synonym estimation method, and synonym estimation program

Info

Publication number: JP2014006620A
Application number: JP2012140466A
Authority: JP
Inventors: Kei Uchiumi; 慶内海
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2012-06-22
Filing date: 2012-06-22
Publication date: 2014-01-16
Anticipated expiration: 2032-06-22
Also published as: JP5507620B2

Abstract

PROBLEM TO BE SOLVED: To provide a synonym estimation device, a synonym estimation method, and a synonym estimation program, capable of more accurately specifying a synonym similar to a query required for retrieval.SOLUTION: A synonym estimation device 10 generates a plurality of functions that calculate a new feature value by performing a weighting operation on a plurality of feature values to learn weighting of the plurality of feature values and the new feature value, and weighting of a combined feature value of the functions, the weighting enabling character strings registered in learning data 30 to be determined in a descending order of similarity. The synonym estimation device 10 calculates a new feature value for a plurality of synonym candidates retrieved from a query by using a function obtained after learning from the plurality of feature values to specify a synonym similar to the inputted query on the basis of a calculation result obtained by calculating the plurality of feature values and the calculated new feature value with the learned weighting.

Description

本発明は、同義語推定装置、同義語推定方法および同義語推定プログラムに関する。 The present invention relates to a synonym estimation device, a synonym estimation method, and a synonym estimation program.

従来のＷｅｂページ検索では、ユーザによりクエリが入力されると、Ｗｅｂページ上の検索エンジンによってクエリで検索が行われ、複数のＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）を含む検索結果がユーザに提示される。 In a conventional web page search, when a query is input by a user, a search is performed by the search engine on the web page, and a search result including a plurality of URLs (Uniform Resource Locators) is presented to the user.

ところで、クエリは、構成要素は非類似であるが、当該クエリと意味が類似する同義語が存在する場合がある。そして、入力されたクエリで検索した場合と、クエリの同義語で検索した場合とで、検索結果が異なる場合がある。例えば、Ａを検索した場合、検索結果が２００００件であり、Ａの同義語であるＡ’を検索した場合、検索結果が３０００件となり、検索結果が異なる場合がある。 By the way, although a constituent element of a query is dissimilar, there may be a synonym having a similar meaning to the query. In some cases, the search result is different depending on whether the search is performed using the input query or the synonym of the query. For example, when searching for A, there are 20000 search results, and when searching for A ', which is a synonym for A, the search results are 3000 and the search results may be different.

システムがＡとＡ’を同疑義と理解できれば予め２つのクエリのＯＲ検索を行うことにより、損失機会を回避できる。 If the system can understand that A and A 'are the same, the loss opportunity can be avoided by performing an OR search of two queries in advance.

そこで、クエリの同義語を抽出する次のような従来技術が知られている。例えば、検索クリックスルーログを用いたラベル伝播により、入力されたクエリの同義語を検索する従来技術が知られている（例えば、特許文献１）。また、機械翻訳において、異なる言語の単語からそれぞれ素性の組を抽出して異なる言語の単語を対応付ける従来技術が知られている（例えば、特許文献２参照）。 Therefore, the following conventional techniques for extracting synonyms of queries are known. For example, a conventional technique for searching for synonyms of an input query by label propagation using a search click-through log is known (for example, Patent Document 1). Also, in machine translation, a conventional technique is known in which feature sets are extracted from words in different languages and associated with words in different languages (see, for example, Patent Document 2).

特開２０１２−７９０２９号公報JP 2012-79029 A 特開２０１０−１９８４３８号公報JP 2010-198438 A

しかしながら、従来技術は、検索が要求されたクエリの同義語を特定できない場合がある。 However, the conventional technology may not be able to identify a synonym of a query for which a search is requested.

本願は、上記に鑑みてなされたものであって、検索が要求されたクエリに近い同義語をより精度良く特定できる同義語推定装置、同義語推定方法および同義語推定プログラムを提供することを目的とする。 The present application has been made in view of the above, and an object thereof is to provide a synonym estimation device, a synonym estimation method, and a synonym estimation program that can identify a synonym close to a query for which a search is requested with higher accuracy. And

本願に係る同義語推定装置は、同義語と見なせる文字列が意味の近さを示す情報と共に記憶された学習用データを記憶する記憶手段と、複数の素性値を重み付け演算して新たな素性値を算出する非線形な関数を素性値の数および素性値の組み合わせを変えて複数生成する生成手段と、前記学習用データに基づき、文字列を意味の近い順に判定可能な、前記複数の素性値および複数の非線形な関数からそれぞれ算出される新たな素性値に対する重み付け、並びに各非線形な関数の組み合わせた素性値に対する重み付けを学習する学習手段と、検索が要求されたクエリから検索された複数の同義語の候補について、同義語の候補の前記複数の素性値から、前記学習手段で学習した組み合わせた素性値に対する重み付けで素性値を重み付けした各非線形な関数を用いて新たな素性値を算出し、当該複数の素性値および算出された新たな素性値を前記学習手段で学習した重み付けで演算した演算結果に基づき、前記複数の同義語の候補から前記クエリに意味が近い同義語を特定する特定手段と、を有することを特徴とする。 The synonym estimation device according to the present application includes a storage unit that stores learning data stored together with information indicating a meaning of a character string that can be regarded as a synonym, and a new feature value by weighting a plurality of feature values. Generating means for generating a plurality of nonlinear functions by changing the number of feature values and combinations of feature values, and the plurality of feature values capable of determining a character string in the order of meaning based on the learning data; and A learning means for learning weights for new feature values respectively calculated from a plurality of nonlinear functions, and weights for feature values combined with each nonlinear function, and a plurality of synonyms retrieved from a query for which a search is requested For each candidate, each feature value is weighted by weighting the combined feature values learned by the learning means from the plurality of feature values of the synonym candidates. A plurality of synonym candidates based on a calculation result obtained by calculating a new feature value using a shape function and calculating the plurality of feature values and the calculated new feature value by weighting learned by the learning unit And a specifying means for specifying a synonym having a meaning close to that of the query.

本願に係る同義語推定装置、同義語推定方法および同義語推定プログラムは、検索が要求されたクエリに近い同義語をより精度良く特定できる。 The synonym estimation device, the synonym estimation method, and the synonym estimation program according to the present application can identify a synonym close to a query for which a search is requested with higher accuracy.

図１は、実施形態に係る同義語検索処理を説明する図である。FIG. 1 is a diagram for explaining synonym search processing according to the embodiment. 図２は、素性の一例を説明する図である。FIG. 2 is a diagram illustrating an example of features. 図３は、新たな素性値の一例を示す図である。FIG. 3 is a diagram illustrating an example of a new feature value. 図４は、同義語の候補Ａ、Ｂを素性Ｘ、Ｙのグラフで示した図である。FIG. 4 is a graph showing the synonym candidates A and B in a graph of the features X and Y. 図５は、同義語の候補Ａ、Ｂ、Ｃ、Ｄを素性Ｘ、Ｙのグラフで示した図である。FIG. 5 is a graph showing the synonym candidates A, B, C, and D using the feature X and Y graphs. 図６は、同義語の候補Ａ、Ｂ、Ｃ、Ｄを素性Ｚのグラフで示した図である。FIG. 6 is a graph showing the synonym candidates A, B, C, and D in the feature Z graph. 図７は、同義語推定装置の機能的な構成の一例を示す図である。FIG. 7 is a diagram illustrating an example of a functional configuration of the synonym estimation device. 図８は、素性データのデータ構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of a data configuration of feature data. 図９は、関数の生成を説明するための図である。FIG. 9 is a diagram for explaining generation of a function. 図１０は、学習処理の手順を示すフローチャートである。FIG. 10 is a flowchart showing the procedure of the learning process. 図１１は、同義語特定処理の手順を示すフローチャートである。FIG. 11 is a flowchart showing the procedure of the synonym specifying process.

以下に、本発明に係る同義語推定装置、同義語推定方法および同義語推定プログラムを実施するための形態（以下、「実施形態」と呼ぶ）について図面を参照しつつ詳細に説明する。なお、この実施形態によりこの発明が限定されるものではない。 DESCRIPTION OF EMBODIMENTS Hereinafter, a form for implementing a synonym estimation apparatus, a synonym estimation method, and a synonym estimation program according to the present invention (hereinafter referred to as “embodiment”) will be described in detail with reference to the drawings. In addition, this invention is not limited by this embodiment.

［１．同義語検索処理］
まず、実施形態に係る同義語推定装置が行う同義語検索処理について説明する。図１は、実施形態に係る同義語検索処理を説明する図である。なお、図１の例は、入力されたクエリの同義語を特定し、クエリに最も近い同義語をサジェスチョンクエリとしてユーザに提示する場合を示している。 [1. Synonym search processing]
First, synonym search processing performed by the synonym estimation device according to the embodiment will be described. FIG. 1 is a diagram for explaining synonym search processing according to the embodiment. In addition, the example of FIG. 1 has shown the case where the synonym of the input query is specified and the synonym nearest to a query is shown to a user as a suggestion query.

同義語推定装置は、ユーザにより検索対象のクエリが入力される。同義語推定装置は、入力されたクエリの同義語の候補を複数検索する。そして、同義語推定装置は、検索された複数の同義語の候補をそれぞれ訂正候補として、入力されたクエリに類似する順にランキングし、最上位の訂正候補をサジェスチョンクエリとしてユーザに提示する。 In the synonym estimation device, a query to be searched is input by a user. The synonym estimation device searches for a plurality of synonym candidates for the input query. Then, the synonym estimation device ranks the plurality of searched synonym candidates as correction candidates, in the order similar to the input query, and presents the highest correction candidate to the user as a suggestion query.

同義語推定装置は、訂正候補を類似する順にランキングする際、クエリと各訂正候補との関連度を計算するため、各訂正候補とされた同義語の候補について素性を抽出する。図２は、素性の一例を説明する図である。図２の例では、同義語の候補をラベル伝播により検索した場合、ラベル伝播での検索の際のスコアを素性として抽出する。また、図２の例では、言語モデルを用いて計算した確立やＴｅｘｔＲａｎｋなどにより、訂正候補のクエリらしさを素性として抽出する。また、図２の例では、訂正候補の文字列の長さを素性として抽出する。また、図２の例では、クエリと訂正候補が一致しているか否かを素性として抽出する。また、図２の例では、クエリと訂正候補がＡｃｒｏｎｙｍの関係にあるか否かを素性として抽出する。Ａｃｒｏｎｙｍとは、「日本放送協会（ＮｉｈｏｎＨｏｕｓｏｕＫｙｏｋａｉ）」を「ＮＨＫ」と省略するなど、名称の各単語の先頭部分のみで省略したものである。また、図２の例では、クエリと訂正候補の編集距離を素性として抽出する。なお、素性は、これらに限定されず、その他の素性であってもよい。 When ranking the correction candidates in the order of similarity, the synonym estimation device extracts features of the synonym candidates determined as the correction candidates in order to calculate the degree of association between the query and each correction candidate. FIG. 2 is a diagram illustrating an example of features. In the example of FIG. 2, when a synonym candidate is searched by label propagation, the score at the time of search by label propagation is extracted as a feature. In the example of FIG. 2, the likelihood of a query as a correction candidate is extracted as a feature based on the establishment calculated using a language model, TextRank, or the like. In the example of FIG. 2, the length of the correction candidate character string is extracted as a feature. In the example of FIG. 2, whether the query and the correction candidate match is extracted as a feature. In the example of FIG. 2, whether the query and the correction candidate have an Acronym relationship is extracted as a feature. Acronym is an abbreviation of only the first part of each word of the name, for example, “Nihon House Kyokai” is abbreviated as “NHK”. In the example of FIG. 2, the edit distance between the query and the correction candidate is extracted as a feature. The features are not limited to these, and may be other features.

同義語推定装置は、検索された同義語の候補についてそれぞれ各素性を抽出し、同義語毎に、当該同義語の各素性の素性値を要素とした素性ベクトルを求める。そして、同義語推定装置は、この素性ベクトルに基づいて、同義語の候補のランキングを行う。 The synonym estimation device extracts each feature for each searched synonym candidate, and obtains a feature vector having the feature value of each feature of the synonym as an element for each synonym. Then, the synonym estimation device performs ranking of synonym candidates based on the feature vector.

ところで、検索された同義語の候補は、抽出された素性では手がかりが少なく、クエリとの類似度合いに応じて同義語の候補を十分に分類できない場合がある。このように同義語の候補を十分に分類できない場合、入力されたクエリに近い同義語を精度良く特定できない。 By the way, the retrieved synonym candidates have few clues in the extracted features, and the synonym candidates may not be sufficiently classified according to the degree of similarity with the query. In this way, when synonym candidates cannot be sufficiently classified, synonyms close to the input query cannot be accurately identified.

そこで、本実施形態に係る同義語推定装置では、同義語の候補毎に、複数の素性の素性値を組み合わせて演算し、新たな素性値を求める。図３は、新たな素性値の一例を示す図である。図３の例は、素性値Ｘ＿１、Ｘ＿２、Ｘ＿３を加算して新たな素性値Ｘ’＿１を求めている。また、図３の例は、素性値Ｘ＿１、Ｘ＿２、Ｘ＿３を乗算して新たな素性値Ｘ’＿２を求めている。 Therefore, in the synonym estimation apparatus according to the present embodiment, for each synonym candidate, a combination of a plurality of feature values is calculated to obtain a new feature value. FIG. 3 is a diagram illustrating an example of a new feature value. In the example of FIG. 3, the feature values X_1, X_2, and X_3 are added to obtain a new feature value X'_1. In the example of FIG. 3, a new feature value X′_2 is obtained by multiplying the feature values X_1, X_2, and X_3.

ここで、新たな素性値を算出する効果について説明する。なお、ここでは、説明を簡略化するため、素性をＸ、Ｙの２つの場合で説明する。例えば、同義語の候補Ａは、素性Ｘが「１」であり、素性Ｙが「０」であり、クエリに類似する同義語であるものとする。また、同義語の候補Ｂは、素性Ｘが「０」であり、素性Ｙが「１」であり、クエリに非類似の同義語であるものとする。図４は、同義語の候補Ａ、Ｂを素性Ｘ、Ｙのグラフで示した図である。この場合、例えば、素性Ｘが「１」であれば類似し、素性Ｙが「１」であれば非類似と判別することにより、同義語の候補がクエリに類似するか否か判別できる。 Here, the effect of calculating a new feature value will be described. Here, in order to simplify the description, the features will be described using two cases of X and Y. For example, the synonym candidate A is a synonym having a feature X of “1” and a feature Y of “0”, which is similar to a query. The synonym candidate B is assumed to be a synonym having a feature X of “0”, a feature Y of “1”, and dissimilar to the query. FIG. 4 is a graph showing the synonym candidates A and B in a graph of the features X and Y. In this case, for example, if the feature X is “1”, it is similar, and if the feature Y is “1”, it is determined that the synonym candidate is similar to the query.

一方、例えば、同義語の候補Ａは、素性Ｘが「１」であり、素性Ｙが「１」であり、クエリに類似する同義語であるものとする。また、同義語の候補Ｂは、素性Ｘが「−１」であり、素性Ｙが「１」であり、クエリに非類似の同義語であるものとする。また、同義語の候補Ｃは、素性Ｘが「−１」であり、素性Ｙが「−１」であり、クエリに類似する同義語であるものとする。また、同義語の候補Ｄは、素性Ｘが「１」であり、素性Ｙが「−１」であり、クエリに非類似の同義語であるものとする。図５は、同義語の候補Ａ、Ｂ、Ｃ、Ｄを素性Ｘ、Ｙのグラフで示した図である。この場合、素性Ｘの素性値は、クエリに類似する同義語の候補Ａとクエリに非類似の同義語の候補Ｄで共に「１」であり、クエリに非類似の同義語の候補Ｂとクエリに類似する同義語の候補Ｃで共に「−１」である。また、素性Ｙの素性値は、クエリに類似する同義語の候補Ａとクエリに非類似の同義語の候補Ｂで共に「１」であり、クエリに類似する同義語の候補Ｃとクエリに非類似の同義語の候補Ｄで共に「−１」である。よって、素性Ｘ、Ｙの素性値から、同義語の候補がクエリに類似するか否か判別できない。 On the other hand, for example, the synonym candidate A is a synonym having a feature X of “1” and a feature Y of “1”, which is similar to a query. The synonym candidate B is a synonym that has a feature X of “−1”, a feature Y of “1”, and is dissimilar to the query. Further, the synonym candidate C is a synonym similar to the query, with the feature X being “−1” and the feature Y being “−1”. Further, the synonym candidate D is a synonym having a feature X of “1”, a feature Y of “−1”, and dissimilar to the query. FIG. 5 is a graph showing the synonym candidates A, B, C, and D using the feature X and Y graphs. In this case, the feature value of the feature X is “1” for both the synonym candidate A similar to the query and the synonym candidate D dissimilar to the query, and the synonym candidate B and query dissimilar to the query. Are both “−1” in candidate C of synonyms similar to. Also, the feature value of the feature Y is “1” for both the synonym candidate A similar to the query and the synonym candidate B dissimilar to the query, and the synonym candidate C and query similar to the query are not “1”. Both of the similar synonym candidates D are “−1”. Therefore, it cannot be determined from the feature values of the features X and Y whether the synonym candidate is similar to the query.

そこで、例えば、同義語の候補Ａ〜Ｄについて、それぞれ素性Ｘの素性値と素性Ｙの素性値を乗算して新たに素性Ｚの素性値を算出する。この場合、同義語の候補Ａは、素性Ｘが「１」であり、素性Ｙが「１」であるため、素性Ｚが「１」となる。同義語の候補Ｂは、素性Ｘが「−１」であり、素性Ｙが「１」であるため、素性Ｚが「−１」となる。同義語の候補Ｃは、素性Ｘが「−１」であり、素性Ｙが「−１」であるため、素性Ｚが「１」となる。同義語の候補Ｄは、素性Ｘが「１」であり、素性Ｙが「−１」であるため、素性Ｚが「−１」となる。図６は、同義語の候補Ａ、Ｂ、Ｃ、Ｄを素性Ｚのグラフで示した図である。この新しい素性Ｚでは、クエリに類似する同義語の候補Ａ、Ｃと、クエリに非類似の同義語の候補Ｂ、Ｄの値が分かれる。よって、例えば、素性Ｚが「１」であれば類似し、素性Ｚが「−１」であれば非類似と判別することにより、同義語の候補がクエリに類似するか否か判別できる。このように、新たな素性を求めることにより、判別できなかった特性を判別できるようになるため、判別できなかった同義語が特定できるようになる。 Therefore, for example, for the synonym candidates A to D, the feature value of the feature X and the feature value of the feature Y are respectively multiplied to calculate a new feature value of the feature Z. In this case, in the synonym candidate A, the feature X is “1” and the feature Y is “1”, so the feature Z is “1”. The synonym candidate B has a feature X of “−1” and a feature Y of “1”, so the feature Z is “−1”. In the synonym candidate C, the feature X is “−1” and the feature Y is “−1”, so the feature Z is “1”. The synonym candidate D has the feature X of “1” and the feature Y of “−1”, so the feature Z is “−1”. FIG. 6 is a graph showing the synonym candidates A, B, C, and D in the feature Z graph. In this new feature Z, the values of the synonym candidates A and C similar to the query and the synonym candidates B and D dissimilar to the query are separated. Therefore, for example, it is possible to determine whether or not a synonym candidate is similar to a query by determining similarity if the feature Z is “1” and dissimilarity if the feature Z is “−1”. In this way, by obtaining a new feature, it becomes possible to discriminate characteristics that could not be discriminated, so that synonyms that could not be discriminated can be specified.

ところで、同義語の特定に有効な関数を見出すには、素性の組み合わせを様々考える必要がある。このため、管理者等による人手による作業では、有効な素性の組み合わせの発見に限界がある。また、関数で組み合わせた素性が有効では無い場合には、特定される同義語の精度が低下する場合がある。そこで、本実施形態に係る同義語推定装置は、意味が近い同義語と見なせる文字列が対応付けられて登録された学習用データを用いて学習を行って同義語の特定に有効な関数を求め、求めた関数により同義語の特定を行う。 By the way, in order to find a function that is effective for specifying synonyms, it is necessary to consider various combinations of features. For this reason, there is a limit in finding effective combinations of features in manual work by an administrator or the like. Moreover, when the feature combined with the function is not effective, the accuracy of the specified synonym may be reduced. Therefore, the synonym estimation device according to the present embodiment obtains a function effective for identifying synonyms by performing learning using learning data registered in association with character strings that can be regarded as synonyms having similar meanings. The synonym is specified by the obtained function.

［２．同義語推定装置の構成］
以下、本実施形態に係る同義語推定装置１０についてさらに詳細に説明する。図７は、同義語推定装置の機能的な構成の一例を示す図である。図７に示すように、同義語推定装置１０は、通信Ｉ／Ｆ（インタフェース）部２０と、記憶部２１と、制御部２２とを有する。 [2. Configuration of synonym estimation device]
Hereinafter, the synonym estimation device 10 according to the present embodiment will be described in more detail. FIG. 7 is a diagram illustrating an example of a functional configuration of the synonym estimation device. As illustrated in FIG. 7, the synonym estimation device 10 includes a communication I / F (interface) unit 20, a storage unit 21, and a control unit 22.

通信Ｉ／Ｆ部２０は、ＮＩＣ（Network Interface Card）等のインタフェースである。通信Ｉ／Ｆ部２０は、ネットワーク１１を介した他の装置との間で各種のデータを送受信する。このネットワーク１１には、他の装置、例えば、ユーザが操作するクライアント端末１２や、管理者が操作する管理端末１３が通信可能に接続される。 The communication I / F unit 20 is an interface such as a NIC (Network Interface Card). The communication I / F unit 20 transmits and receives various data to and from other devices via the network 11. Other devices such as a client terminal 12 operated by a user and a management terminal 13 operated by an administrator are communicably connected to the network 11.

クライアント端末１２は、ユーザによって利用される情報処理装置である。例えば、クライアント端末１２は、デスクトップ型ＰＣ（Personal Computer）、タブレット型ＰＣ、ノート型ＰＣ、携帯電話機、ＰＤＡ（Personal Digital Assistant）等である。 The client terminal 12 is an information processing device used by a user. For example, the client terminal 12 is a desktop PC (Personal Computer), a tablet PC, a notebook PC, a mobile phone, a PDA (Personal Digital Assistant), or the like.

クライアント端末１２は、同義語推定装置１０によって提供されるウェブページを受信し、受信したウェブページを所定の表示部（ディスプレイ）に表示する。かかるウェブページには、検索対象とする文字列を入力可能な入力領域が設けられている。検索を行う場合、ユーザは、クライアント端末１２を操作して表示部に表示されたウェブページの入力領域に検索対象とする文字列を入力し、検索実行を指示する。クライアント端末１２は、検索実行を指示されると、ウェブページの入力領域に入力された文字列をクエリとして同義語推定装置１０へ出力する。 The client terminal 12 receives the web page provided by the synonym estimation device 10 and displays the received web page on a predetermined display unit (display). Such a web page is provided with an input area in which a character string to be searched can be input. When performing a search, the user operates the client terminal 12 to input a character string to be searched for in the input area of the web page displayed on the display unit, and instructs execution of the search. When the client terminal 12 is instructed to execute the search, the client terminal 12 outputs the character string input in the input area of the web page to the synonym estimation device 10 as a query.

管理端末１３は、管理者が同義語推定装置１０を管理する際に利用する情報処理装置である。例えば、管理端末１３は、デスクトップ型ＰＣ、タブレット型ＰＣ、ノート型ＰＣ等である。管理者は、管理端末１３を操作して同義語推定装置１０を運用、管理するための各種データの登録や各種の指示を行う。例えば、管理者は、学習用データの登録を行う。また、管理者は、同義語の特定に有効な重み付け条件の学習指示を行う。 The management terminal 13 is an information processing device used when the administrator manages the synonym estimation device 10. For example, the management terminal 13 is a desktop PC, a tablet PC, a notebook PC, or the like. The administrator operates the management terminal 13 to perform registration of various data and various instructions for operating and managing the synonym estimation device 10. For example, the administrator registers learning data. In addition, the administrator gives an instruction to learn weighting conditions effective for specifying synonyms.

通信Ｉ／Ｆ部２０は、ネットワーク１１を介してクライアント端末１２からクエリを受信する。また、通信Ｉ／Ｆ部２０は、ネットワーク１１を介して管理端末１３から登録する各種データや各種の指示を受信する。 The communication I / F unit 20 receives a query from the client terminal 12 via the network 11. Further, the communication I / F unit 20 receives various data and various instructions registered from the management terminal 13 via the network 11.

記憶部２１は、ハードディスク、光ディスクなどの記憶装置である。なお、記憶部２１は、上記の種類の記憶装置に限定されるものではなく、ＲＡＭ（Random Access Memory）、フラッシュメモリなどの半導体メモリ素子であってもよい。 The storage unit 21 is a storage device such as a hard disk or an optical disk. The storage unit 21 is not limited to the above-mentioned types of storage devices, and may be a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory.

記憶部２１は、制御部２２で実行されるＯＳ（Operating System）や同義語の検索に用いる各種プログラムを記憶する。さらに、記憶部２１は、制御部２２で実行されるプログラムで用いられる各種データを記憶する。例えば、記憶部２１は、学習用データ３０と、素性データ３１と、関数データ３２とを記憶する。 The storage unit 21 stores an OS (Operating System) executed by the control unit 22 and various programs used for searching for synonyms. Furthermore, the storage unit 21 stores various data used in programs executed by the control unit 22. For example, the storage unit 21 stores learning data 30, feature data 31, and function data 32.

学習用データ３０は、同義語と見なせる文字列が意味の近さを示す情報と共に記憶されたデータである。例えば、学習用データ３０は、同義語と見なせる各文字列が意味の近さを示すスコアと共に記憶されている。このスコアは、管理者等により値が定められる。学習用データ３０は、管理者等により登録される。 The learning data 30 is data stored together with information indicating that the character string that can be regarded as a synonym is close in meaning. For example, in the learning data 30, each character string that can be regarded as a synonym is stored together with a score indicating the closeness of meaning. The score is determined by an administrator or the like. The learning data 30 is registered by an administrator or the like.

素性データ３１は、検索された同義語の候補毎に、素性値を記憶したデータである。素性データ３１は、例えば、後述する導出部４４により生成される。図８は、素性データのデータ構成の一例を示す図である。本実施形態に係る素性データ３１は、同義語の候補毎にレコードを分けて、各素性の素性値が所定の素性順にタブ区切りでフィールドを分けて記憶されている。図８に示すように、素性データ３１は、フィールド３１Ａ〜３１Ｄが設けられている。先頭のフィールド３１Ａは、同義語が学習用データ３０に登録されたものであるか否かを示すラベルを記憶する領域である。先頭のフィールド３１Ａには、同義語が学習用データ３０に登録されたものである場合「１」が格納され、学習用データ３０に登録されたものではない場合「０」が格納される。フィールド３１Ｂ以降は、素性値を記憶する領域である。図８の例は、各同義語の候補の素性値を３つとした場合を示しており、フィールド３１Ｂ〜３１Ｄには各素性の素性値が格納されている。なお、素性値は実数とする。すなわち、素性値としては、マイナスや少数点以下の数値も用いることができる。このように素性値を実数とすることにより、単純にある特性の有無だけではなく、特性を詳細に保持できる。 The feature data 31 is data in which feature values are stored for each searched synonym candidate. The feature data 31 is generated by, for example, a derivation unit 44 described later. FIG. 8 is a diagram illustrating an example of a data configuration of feature data. The feature data 31 according to the present embodiment stores records for each synonym candidate, and the feature values of each feature are stored in tab-separated fields in order of predetermined features. As shown in FIG. 8, the feature data 31 includes fields 31A to 31D. The first field 31A is an area for storing a label indicating whether or not the synonym is registered in the learning data 30. The first field 31A stores “1” when the synonym is registered in the learning data 30, and stores “0” when the synonym is not registered in the learning data 30. The field 31B and subsequent fields are areas for storing feature values. The example of FIG. 8 shows a case where the number of feature values of each synonym candidate is three, and the feature values of each feature are stored in the fields 31B to 31D. The feature value is a real number. That is, as the feature value, a minus value or a numerical value less than the decimal point can be used. Thus, by setting the feature value as a real number, it is possible to hold not only the presence / absence of a characteristic but also the characteristic in detail.

関数データ３２は、後述する学習部４２によりパラメータが変更された関数を記憶したデータである。関数データ３２は、学習部４２により登録される。 The function data 32 is data in which a function whose parameters have been changed by the learning unit 42 described later is stored. The function data 32 is registered by the learning unit 42.

図７に戻り、制御部２２は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、これらによって種々の処理を実行する。制御部２２は、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実現される。また、制御部２２は、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等によって、図示しない内部の記憶装置に記憶されているプログラムがＲＡＭを作業領域として実行されることにより実現される。 Returning to FIG. 7, the control unit 22 includes an internal memory for storing programs and control data that define various processing procedures, and executes various processes using these. The control unit 22 is realized by an integrated circuit such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA). The control unit 22 is realized, for example, by executing a program stored in an internal storage device (not shown) using a RAM as a work area by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. The

制御部２２は、各種のプログラムが動作することにより各種の処理部として機能する。例えば、制御部２２は、受付部４０と、生成部４１と、学習部４２と、検索部４３と、導出部４４と、特定部４５と、送信部４６とを有する。 The control unit 22 functions as various processing units by operating various programs. For example, the control unit 22 includes a reception unit 40, a generation unit 41, a learning unit 42, a search unit 43, a derivation unit 44, a specification unit 45, and a transmission unit 46.

受付部４０は、管理端末１３から各種のデータや各種の指示を受け付ける。例えば、受付部４０は、管理端末１３から学習用データや学習指示を受け付ける。受付部４０は、管理端末１３から学習用データを受け付けた場合、記憶部２１に学習用データ３０として登録する。また、受付部４０は、管理端末１３から学習指示を受け付けた場合、生成部４１および学習部４２に対して学習モードでの動作を指示する。 The accepting unit 40 accepts various data and various instructions from the management terminal 13. For example, the receiving unit 40 receives learning data and a learning instruction from the management terminal 13. When receiving the learning data from the management terminal 13, the receiving unit 40 registers the learning data 30 in the storage unit 21. In addition, when receiving a learning instruction from the management terminal 13, the receiving unit 40 instructs the generation unit 41 and the learning unit 42 to operate in the learning mode.

生成部４１は、学習モードでの動作が指示された場合、複数の素性値を重み付け演算して新たな素性値を算出する非線形な関数を、素性値の数および素性値の組み合わせを変えて複数生成する。例えば、生成部４１は、複数の素性値を各々独立したパラメータθで重み付けして内積計算を行う関数を、素性値の数および素性値の組み合わせを変えて複数生成する。図９は、関数の生成を説明するための図である。図９の例は、複数の同義語の候補ｔ−１〜ｔ＋１について、それぞれ素性値ｘ＿０〜ｘ＿ｉ−１、ｘ＿ｉ、ｘ＿ｉ＋１〜ｘ＿Ｄと複数の素性値が求まっていることを示す。この場合、生成部４１は、素性値の数および素性値の組み合わせを変えて、素性値ｘ＿０〜ｘ＿ｉ−１、ｘ＿ｉ、ｘ＿ｉ＋１〜ｘ＿Ｄをそれぞれ独立したパラメータθで重み付けして関数ｗ＿｛１｝〜ｗ＿｛ｇ｝〜ｗ＿｛Ｋ｝を生成している。図９の例では、各関数のパラメータθを識別し易くするためθ_１〜θ_ｇ〜θ_ｋと記載している。このパラメータθは、関数が同義語の特定に寄与する度合いに応じて更新される。 When the operation in the learning mode is instructed, the generation unit 41 performs a plurality of non-linear functions for calculating a new feature value by weighting a plurality of feature values by changing the number of feature values and combinations of feature values. Generate. For example, the generation unit 41 generates a plurality of functions for performing inner product calculation by weighting a plurality of feature values with independent parameters θ, by changing the number of feature values and combinations of feature values. FIG. 9 is a diagram for explaining generation of a function. The example of FIG. 9 indicates that feature values x_0 to x_i−1, x_i, and x_i + 1 to x_D and a plurality of feature values are obtained for a plurality of synonym candidates t−1 to t + 1, respectively. In this case, the generation unit 41 changes the number of feature values and the combination of the feature values, and weights the feature values x_0 to x_i−1, x_i, x_i + 1 to x_D with independent parameters θ, respectively, to function w_ {1} to w_ {g} to w_ {K} are generated. In the example of FIG. 9, θ ₁ to θ _g to θ _k are described for easy identification of the parameter θ of each function. This parameter θ is updated according to the degree to which the function contributes to the specification of synonyms.

ここで、関数は、重み付け演算する素性値の数が多いほど個々の素性値の独立性が失われ、同じ傾向を示すことになる。また、重み付け演算する素性値の数が多いほど生成される関数の数が多くなり、新たな素性値を演算する処理負荷も多くなる。そこで、生成部４１は、所定値以下の範囲で素性値の数を変えて非線形な関数を生成するものとする。本実施形態では、この所定値を、例えば、３とする。この所定値は、管理端末１３から管理者が指定可能としてもよい。すなわち、生成部４１は、素性値の数が３個以下の範囲で、素性値の組み合わせを変えて、素性値を重み付け演算して新たな素性値を算出する非線形な関数を生成する。本実施形態に係る同義語推定装置１０は、複数の素性値および複数の非線形な関数からそれぞれ算出される新たな素性値を重み付け演算した演算結果から同義語を特定している。例えば、複数の素性値および複数の非線形な関数からそれぞれ算出される新たな素性値を各々独立したパラメータｗで重み付け演算した演算結果から同義語を特定する。 Here, the function loses the independence of individual feature values as the number of feature values to be weighted increases, and shows the same tendency. In addition, as the number of feature values to be weighted increases, the number of functions generated increases, and the processing load for calculating new feature values also increases. Therefore, the generation unit 41 generates a non-linear function by changing the number of feature values within a predetermined value or less. In the present embodiment, this predetermined value is set to 3, for example. This predetermined value may be designated by the administrator from the management terminal 13. That is, the generation unit 41 generates a non-linear function that calculates a new feature value by changing the combination of feature values and weighting the feature values within a range of three or less feature values. The synonym estimation device 10 according to the present embodiment identifies a synonym from a calculation result obtained by weighting a new feature value calculated from a plurality of feature values and a plurality of nonlinear functions. For example, a synonym is specified from a calculation result obtained by weighting a new feature value calculated from a plurality of feature values and a plurality of nonlinear functions with independent parameters w.

学習部４２は、学習モードでの動作が指示された場合、学習用データ３０に基づき、文字列を意味の近い順に判定可能な、複数の素性値および複数の非線形な関数からそれぞれ算出される新たな素性値に対する重み付け、並びに各非線形な関数の組み合わせた素性値に対する重み付けを学習する。具体的に、学習部４２は、最初に、学習用データ３０に記憶された各文字列ついてそれぞれ素性値を導出する。例えば、学習部４２は、同義語と見なせる各文字列ついてそれぞれ、図２に示した各素性の素性値を実数で導出する。なお、学習用データ３０に各文字列の素性値が記憶されている場合は、素性値を導出する必要はない。そして、学習部４２は、各文字列毎に別なレコードで、所定の素性順に各素性の実数の素性値をタブ区切りで区切って、学習用データ３０の各文字列についての学習用の素性データ３１を生成する。そして、学習部４２は、パラメータθおよびパラメータｗをそれぞれ所定の初期値として、学習用の素性データ３１の各文字列の各素性値および当該各素性値を各関数に代入して算出した新たな素性値を重み付け演算した結果が学習用データ３０の同義語と見なせる各文字列のスコアにより近くなるようにリストワイズ学習を行い、各非線形な関数の組み合わせた素性値に対する重み値（個々のパラメータθ）および、各素性値および新たな素性値に対する重み値（個々のパラメータｗ）を更新する。そして、学習部４２は、学習によって特定された個々のパラメータθを設定した新たな素性値を算出する各関数および各素性値および新たな素性値に対する重みを示す個々のパラメータｗを関数データ３２として記憶部２１に登録する。なお、関数データ３２には、パラメータθおよびパラメータｗなどの各種パラメータを記憶させるものとしてもよい。 When the learning unit 42 is instructed to perform an operation in the learning mode, the learning unit 42 can newly determine a character string based on the learning data 30 in order of significance, and is calculated from a plurality of feature values and a plurality of nonlinear functions. A weight for a feature value and a weight for a feature value obtained by combining each nonlinear function are learned. Specifically, the learning unit 42 first derives a feature value for each character string stored in the learning data 30. For example, the learning unit 42 derives a feature value of each feature shown in FIG. 2 as a real number for each character string that can be regarded as a synonym. In addition, when the feature value of each character string is stored in the learning data 30, it is not necessary to derive the feature value. Then, the learning unit 42 separates the feature values of the real numbers of the respective features in a predetermined feature order by tab delimiters in a separate record for each character string, and learns feature data for each character string of the learning data 30. 31 is generated. Then, the learning unit 42 uses the parameter θ and the parameter w as predetermined initial values, respectively, and calculates the feature values of the character strings of the feature data 31 for learning and the feature values assigned to the functions. The listwise learning is performed so that the result of weighting the feature value becomes closer to the score of each character string that can be regarded as a synonym of the learning data 30, and the weight value (individual parameter θ) for the feature value combined with each nonlinear function ) And the weight value (individual parameter w) for each feature value and the new feature value are updated. Then, the learning unit 42 sets, as the function data 32, each function for calculating a new feature value in which each parameter θ specified by learning is set, and each feature value indicating each feature value and a weight for the new feature value. Register in the storage unit 21. The function data 32 may store various parameters such as the parameter θ and the parameter w.

ここで、関数のパラメータの更新の一例を説明する。最初に参考文献（Zhe Cao,Tao Qin,Tie-Yan Liu,Ming-Feng Tsai,and Hang Li．learning to rank：from pairwise approach to listwise approach．In Proceeding of the 24th international conference on Machine learning,ICML 07,P129-136,New York,USA,2007．ACM）を参考にTop k ListNetを説明する。 Here, an example of function parameter update will be described. References (Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Proceeding of the 24th international conference on Machine learning, ICML 07, P129-136, New York, USA, 2007. ACM) will explain Top k ListNet.

Top k ListNetでは、クエリｑ^（i）が与えられた際に得られたｎ^（i）件の同義語の候補から、上位ｋ件の文書の順序が得られる確率を以下の式（１）のように定義する。 In Top k ListNet, the probability of obtaining the order of the top k documents from the n ⁽ⁱ⁾ synonym candidates obtained when the query q ⁽ⁱ⁾ is given is expressed by the following equation (1). Define as follows.

ここで、ｆ_wは、文書の素性ベクトルｘ_jtに対するランキング関数を表し、ｚ⁽ⁱ⁾(ｆ_w)は、文書の順列に対するｆ_wのスコアのリストを表す。 Here, f _w represents a ranking function for the document feature vector x _jt , and z ⁽ⁱ⁾ (f _w ) represents a list of scores of f _w for document permutations.

モデルの学習では、学習用データ３０の同義語と見なせる各文字列の素性値およびスコアを用いて同義語の学習を行い、学習用データ３０に対するクロスエントロピーを最小化する事で、ランキング関数のパラメータｗを推定する。クロスエントロピーは、以下の式（２）のように表される。 In model learning, synonym learning is performed using the feature value and score of each character string that can be regarded as a synonym of the learning data 30, and the cross-entropy for the learning data 30 is minimized, so that the parameters of the ranking function Estimate w. The cross entropy is expressed as the following formula (2).

ｇ_kは、ｋ個のドキュメントからなる順列を表し、ｇは、その部分集合を表す。この場合のパラメータの更新式を以下の式（３）に示す。Ｐｙ^（i）はクエリｑ^（i）に対するドキュメントの部分集合ｇの正解となる生起確率を表し、Ｐｚ⁽ⁱ⁾(ｆ_w)は学習モデルのパラメータから求めた部分集合ｇの確率を表す。 g _k represents a permutation composed of k documents, and g represents a subset thereof. The following formula (3) shows the parameter update formula in this case. Py ⁽ⁱ⁾ represents the probability of occurrence of a correct subset g of the document for the query q ⁽ⁱ⁾ , and Pz ⁽ⁱ⁾ (f _w ) represents the probability of the subset g obtained from the parameters of the learning model.

ｋが大きい場合には、上記の更新式は計算が困難なため、通常は、ｋ＝１の場合のみを扱う。ｋ＝１の場合の更新式を以下の式（４）に示す。 When k is large, the above update formula is difficult to calculate, and therefore normally only the case of k = 1 is handled. The update formula when k = 1 is shown in the following formula (4).

参考文献では、ランキング関数ｆ_wに以下の式（５）に示すように素性ベクトルｘ_jtに対してそれぞれパラメータｗで重み付けた単純な線形ニューラルネットモデルを採用している。 In the reference literature, a simple linear neural network model in which the feature vector x _jt is weighted by the parameter w as shown in the following equation (5) is adopted for the ranking function f _w .

ここで、参考文献は、ランキング関数に線形ニューラルネットモデルを採用しているため、出力ｚと入力ｘの間の線形な関係しか学習することができない。出力ｚと入力ｘの間の線形な関係しか学習することができない。本実施形態では、生成部４１が生成する関数φ_tにより、出力と入力の間の非線形な関係としている。そこで、出力と入力の間の非線形な関係を学習できるように拡張する。これによって、ランキング関数は式（６）のように変更される。 Here, since the reference literature employs a linear neural network model for the ranking function, only a linear relationship between the output z and the input x can be learned. Only a linear relationship between output z and input x can be learned. In the present embodiment, the function φ _t generated by the generation unit 41 has a nonlinear relationship between the output and the input. Therefore, it is extended so that a nonlinear relationship between the output and the input can be learned. As a result, the ranking function is changed as shown in Equation (6).

Φ（ｘ_j ⁽ⁱ⁾）は、Ｄ個の関数φ_tからなるベクトルを表す。各関数φ_tは、以下の式（７）に示すシグモイド関数で表され、各関数ごとに独立した重みを示すパラメータθ_tを持つ。 Φ (x _j ⁽ⁱ⁾ ) represents a vector composed of D functions φ _t . Each function φ _t is represented by a sigmoid function shown in the following formula (7), and each function has a parameter θ _t indicating an independent weight.

ランキング関数に変更を加えたTop 1 ListNetの損失関数を最小化するようにパラメータｗ、θ_tを更新する。更新式を以下の式（８）〜（１０）に示す。 Parameter w so as to minimize the loss function of Top 1 ListNet making changes to the ranking function to update the theta _t. The update formulas are shown in the following formulas (8) to (10).

関数φ_tでは、入力ｘ_j ⁽ⁱ⁾に含まれる素性の組み合わせを生成し、それに非線形な重みを与えている。これによって、入力ｘ_j ⁽ⁱ⁾を任意のＤ次元に写像するとともに、入力と出力の間の非線形な学習を行う。 In the function φ _t , a combination of features included in the input x _j ⁽ⁱ⁾ is generated and given a nonlinear weight. Thereby, the input x _j ⁽ⁱ⁾ is mapped to an arbitrary D dimension, and nonlinear learning between the input and the output is performed.

学習部４２は、式（８）、（９）を用いて、各々独立したパラメータθ、パラメータｗをそれぞれ個別に更新する。これにより、正解を予測する際に寄与する素性の組み合わせに対する重みは高く変更され、正解の予測に寄与しない素性の組み合わせに対する重みは低く変更される。なお、パラメータの変更方法は、これに限定されるものではない。その他各種の手法を適用できる。例えば、学習部４２は、学習用データ３０に記憶された同義語と見なせる各文字列ついてそれぞれ複数の素性値および複数の非線形な関数からそれぞれ算出される新たな素性値を重み付け演算した演算結果が最も意味が近い同義語の文字列を最上位に特定できる場合、演算結果の値を上昇させ、特定できない場合、演算結果の値を低下させるように非線形な関数の素性値に対する重み付け、複数の素性値および新たな素性値に対する重み付けを変更してもよい。すなわち、学習部４２は、最も意味が近い同義語の文字列を特定できているか否かにより、パラメータθ、パラメータｗを更新するようにしてもよい。 The learning unit 42 individually updates the independent parameter θ and parameter w using equations (8) and (9). As a result, the weight for the feature combination that contributes to predicting the correct answer is changed to be high, and the weight for the feature combination that does not contribute to the prediction of the correct answer is changed to be low. The parameter changing method is not limited to this. Various other methods can be applied. For example, the learning unit 42 obtains a calculation result obtained by weighting a new feature value calculated from a plurality of feature values and a plurality of nonlinear functions for each character string that can be regarded as a synonym stored in the learning data 30. If the closest synonym string can be identified at the highest level, the value of the operation result is increased, and if it cannot be specified, the feature value of the nonlinear function is weighted so that the value of the operation result is decreased. The weighting for the value and the new feature value may be changed. That is, the learning unit 42 may update the parameter θ and the parameter w depending on whether the synonym character string having the closest meaning can be specified.

検索部４３は、クライアント端末１２からクリエが入力した場合、入力されたクエリの同義語の候補を複数検索する。この検索方式は、同義語の候補を検索可能であれば、何れであってもよい。本実施形態では、例えば、ラベル伝播を用いて同義語の候補を複数検索する。 When a query is input from the client terminal 12, the search unit 43 searches a plurality of synonym candidates for the input query. This search method may be any as long as it can search for synonym candidates. In the present embodiment, for example, a plurality of synonym candidates are searched using label propagation.

導出部４４は、検索された複数の同義語の候補についてそれぞれ素性値を実数として導出する。例えば、導出部４４は、検索された複数の同義語の候補についてそれぞれ、図２に示した各素性の実数の素性値を導出する。そして、導出部４４は、同義語の候補毎にそれぞれ導出した各素性の実数の素性値を、同義語の候補毎に別なレコードで、所定の素性順に各素性の素性値をタブ区切りで区切った素性データ３１を生成する。 The deriving unit 44 derives the feature value as a real number for each of the searched synonym candidates. For example, the deriving unit 44 derives a real feature value of each feature shown in FIG. 2 for each of a plurality of searched synonym candidates. Then, the derivation unit 44 separates the feature values of the real numbers derived for each of the synonym candidates into separate records for each of the synonym candidates, and separates the feature values of the features in the predetermined feature order by tab separation. Feature data 31 is generated.

特定部４５は、クライアント端末１２からクリエが入力した場合、記憶部２１に関数データ３２として記憶された各関数を用いて、素性データ３１の複数の実数の素性値からそれぞれ新たな実数の素性値を算出する。そして、特定部４５は、記憶部２１に関数データ３２として記憶された各々独立したパラメータｗで素性データ３１の実数の各素性値および算出された新たな実数の素性値に対して重み付け演算し、演算結果に基づき、同義語の候補から入力されたクエリに意味が近い同義語を特定する。例えば、重み付け演算した演算結果の値が大きい順に同義語の候補をランキングし、最も大きい同義語の候補を意味が近い同義語と特定する。なお、この特定手法は、意味が近い同義語を特定可能であれば、何れであってもよい。例えば、特定部４５が各素性の素性値を区別せずに、素性値が大きい順に同義語の候補をランキングしてもよい。 When a query is input from the client terminal 12, the specifying unit 45 uses each function stored as the function data 32 in the storage unit 21, and uses a plurality of real number feature values in the feature data 31 to generate new real number feature values. Is calculated. Then, the specifying unit 45 performs a weighting operation on each feature value of the real number of the feature data 31 and the calculated feature value of the new real number with each independent parameter w stored as the function data 32 in the storage unit 21. Based on the calculation result, a synonym having a meaning close to that of the query input from the synonym candidates is identified. For example, the synonym candidates are ranked in descending order of the value of the operation result obtained by the weighting operation, and the largest synonym candidate is identified as a synonym having a close meaning. This identification method may be any as long as it can identify synonyms having similar meanings. For example, the identifying unit 45 may rank the synonym candidates in descending order of the feature values without distinguishing the feature values of the features.

ここで、生成部４１により生成された各関数は、複数の同義語の候補から対応する同義語の文字列を最上位に特定できる確率が高い場合、演算結果の値が大きくなるようにパラメータθおよびパラメータｗが変更され、確率が低い場合、演算結果の値が小さくなるようにパラメータθおよびパラメータｗが変更される。よって、同義語の文字列を最上位に特定できる確率が低い関数は、算出される素性値の値が小さくなるため、クエリに意味が近い同義語を特定する際の影響が小さくなる。 Here, each function generated by the generation unit 41 has a parameter θ so that the value of the calculation result is large when there is a high probability that the corresponding synonym character string can be specified at the highest level from a plurality of synonym candidates. When the parameter w is changed and the probability is low, the parameter θ and the parameter w are changed so that the value of the calculation result becomes small. Therefore, a function with a low probability that the synonym character string can be specified at the highest level has a smaller feature value value, and therefore has less influence when specifying a synonym that has a meaning close to the query.

送信部４６は、最も上位にランキングされた同義語をサジェスチョンクエリとして、クライアント端末１２に送信する。これにより、クライアント端末１２では、検索を指示したクエリに近い同義語がサジェスチョンクエリとして表示される。 The transmission unit 46 transmits the synonym ranked highest to the client terminal 12 as a suggestion query. Thereby, on the client terminal 12, a synonym close to the query instructing the search is displayed as a suggestion query.

［３．作用（同義語推定装置の動作）］
次に、本実施例に係る同義語推定装置１０の作用について説明する。まず、本実施例に係る同義語推定装置１０が同義語の特定に有効な関数を学習する学習処理の流れを説明する。図１０は、学習処理の手順を示すフローチャートである。この学習処理は、所定のタイミング、例えば、管理端末１３から学習指示を受け付けたタイミングで実行される。 [3. Action (Operation of Synonym Estimation Device)]
Next, the effect | action of the synonym estimation apparatus 10 which concerns on a present Example is demonstrated. First, a flow of learning processing in which the synonym estimation device 10 according to the present embodiment learns a function effective for specifying synonyms will be described. FIG. 10 is a flowchart showing the procedure of the learning process. This learning process is executed at a predetermined timing, for example, when a learning instruction is received from the management terminal 13.

図１０に示すように、生成部４１は、複数の素性値を重み付け演算して新たな素性値を算出する非線形な関数を素性値の数および素性値の組み合わせを変えて複数生成する（ステップＳ１０）。学習部４２は、学習用データ３０に記憶された各文字列ついてそれぞれ素性値を導出し、導出した各素性の素性値を記憶した学習用の素性データ３１を生成する（ステップＳ１１）。 As illustrated in FIG. 10, the generation unit 41 generates a plurality of nonlinear functions that calculate a new feature value by weighting a plurality of feature values by changing the number of feature values and the combination of the feature values (Step S <b> 10). ). The learning unit 42 derives a feature value for each character string stored in the learning data 30, and generates learning feature data 31 in which the feature values of the derived features are stored (step S11).

そして、学習部４２は、学習用データ３０の文字列を意味の近い順に判定可能な、複数の素性値および新たな素性値に対する重み付け、並びに各非線形な関数の組み合わせた素性値に対する重み付けを学習する（ステップＳ１２）。学習部４２は、学習部４２は、算出したパラメータθを設定した新たな素性値を算出する各関数および各素性値および新たな素性値に対する重みを示すパラメータｗを関数データ３２として記憶部２１に記憶部２１に記憶させ（ステップＳ１３）、処理を終了する。 Then, the learning unit 42 learns weights for a plurality of feature values and new feature values, and weights for feature values obtained by combining non-linear functions, which can determine the character strings of the learning data 30 in the order of meaning. (Step S12). The learning unit 42 stores, as function data 32, each function that calculates a new feature value in which the calculated parameter θ is set, and each feature value and a weight w for the new feature value as function data 32. It memorize | stores in the memory | storage part 21 (step S13), and complete | finishes a process.

次に、本実施例に係る同義語推定装置１０が同義語を特定する同義語特定処理の流れを説明する。図１１は、同義語特定処理の手順を示すフローチャートである。この同義語特定処理は、所定のタイミング、例えば、クライアント端末１２からクエリが入力されたタイミングで実行される。 Next, the flow of the synonym specification process in which the synonym estimation device 10 according to the present embodiment specifies a synonym will be described. FIG. 11 is a flowchart showing the procedure of the synonym specifying process. This synonym specifying process is executed at a predetermined timing, for example, when a query is input from the client terminal 12.

図１１に示すように、検索部４３は、入力されたクエリの同義語の候補を複数検索する（ステップＳ２０）。導出部４４は、検索された複数の同義語の候補についてそれぞれ素性値を導出し、導出した各素性の素性値を記憶した素性データ３１を記憶部２１に格納する（ステップＳ２１）。 As shown in FIG. 11, the search unit 43 searches for a plurality of synonym candidates for the input query (step S20). The deriving unit 44 derives a feature value for each of the searched synonym candidates, and stores the feature data 31 storing the derived feature values of each feature in the storage unit 21 (step S21).

特定部４５は、関数データ３２を読み出す（ステップＳ２２）。特定部４５は、関数データ３２に記憶された各関数を用いて、素性データ３１の複数の素性の素性値からそれぞれ新たな素性値を算出する（ステップＳ２３）。そして、特定部４５は、関数データ３２として記憶されたパラメータｗで素性データ３１の各素性値および算出された新たな素性値を重み付け演算し、演算結果の値が大きい順に同義語の候補をランキングし、同義語の候補から入力されたクエリに意味が近い同義語を特定する（ステップＳ２４）。 The specifying unit 45 reads the function data 32 (step S22). Using the functions stored in the function data 32, the specifying unit 45 calculates new feature values from the feature values of the features in the feature data 31 (step S23). The specifying unit 45 performs a weighting operation on each feature value of the feature data 31 and the calculated new feature value using the parameter w stored as the function data 32, and ranks the synonym candidates in descending order of the value of the operation result. Then, a synonym having a meaning similar to the query input from the synonym candidates is specified (step S24).

送信部４６は、最も上位にランキングされた同義語をサジェスチョンクエリとして、クライアント端末１２に送信し（ステップＳ２５）、処理を終了する。 The transmission unit 46 transmits the synonym ranked in the highest rank as a suggestion query to the client terminal 12 (step S25), and ends the process.

［４．効果］
このように、同義語推定装置１０は、同義語と見なせる文字列が意味の近さを示す情報と共に記憶された学習用データ３０を記憶部２１（記憶手段の一例に相当）に記憶する。また、同義語推定装置１０は、生成部４１（生成手段の一例に相当）により、複数の素性値を重み付け演算して新たな素性値を算出する非線形な関数を、素性値の数および素性値の組み合わせを変えて複数生成する。そして、同義語推定装置１０は、学習部４２（学習手段の一例に相当）により、学習用データ３０に基づき、文字列を意味の近い順に判定可能な、複数の素性値および複数の非線形な関数からそれぞれ算出される新たな素性値に対する重み付け、並びに各非線形な関数の組み合わせた素性値に対する重み付けを学習する。そして、同義語推定装置１０は、特定部４５（特定手段の一例に相当）により、検索が要求されたクエリから検索された複数の同義語の候補について、同義語の候補の前記複数の素性値から、前記学習手段で学習した組み合わせた素性値に対する重み付けで素性値を重み付けした各非線形な関数を用いて新たな素性値を算出し、当該複数の素性値および算出された新たな素性値を前記学習手段で学習した重み付けで演算した演算結果に基づき、複数の同義語の候補からクエリに意味が近い同義語を特定する。このように、同義語推定装置１０は、新たな素性値を算出する非線形な関数を複数生成し、複数の素性値および新たな素性値に対する重み付け、並びに各非線形な関数の組み合わせた素性値に対する重み付けを学習を行うことで、ユーザの手を煩わすことなく有効な新たな素性値を求めることができ、新たな素性値も加味して同義語を特定するため、クエリに近い同義語をより精度良く特定できる。 [4. effect]
As described above, the synonym estimation device 10 stores the learning data 30 stored together with information indicating that the character string that can be regarded as a synonym is close in meaning in the storage unit 21 (corresponding to an example of a storage unit). Further, the synonym estimation device 10 generates a non-linear function for calculating a new feature value by weighting a plurality of feature values by using the generation unit 41 (corresponding to an example of a generation unit), the number of feature values, and the feature value. Generate multiple by changing the combination of. Then, the synonym estimation device 10 has a plurality of feature values and a plurality of non-linear functions that allow the learning unit 42 (corresponding to an example of learning means) to determine a character string in the order of meaning based on the learning data 30. The weighting for the new feature value respectively calculated from the above and the weighting for the feature value obtained by combining each nonlinear function is learned. Then, the synonym estimation device 10 uses the plurality of feature values of the synonym candidates for the plurality of synonym candidates searched from the query requested to be searched by the specifying unit 45 (corresponding to an example of the specifying unit). Then, a new feature value is calculated using each non-linear function that weights the feature value by weighting the combined feature value learned by the learning means, and the plurality of feature values and the calculated new feature value are Based on the calculation result calculated by the weighting learned by the learning means, a synonym having a meaning similar to the query is specified from a plurality of synonym candidates. As described above, the synonym estimation device 10 generates a plurality of nonlinear functions for calculating new feature values, weights the plurality of feature values and the new feature values, and weights the feature values obtained by combining the nonlinear functions. By learning, it is possible to obtain effective new feature values without bothering the user and identify synonyms taking into account new feature values, so synonyms close to the query can be more accurately Can be identified.

また、同義語推定装置１０は、学習部４２が、最上位に特定できる確率が高い非線形な関数の重み値を高く変更し、確率が低い非線形な関数の重み値を低く変更する。これにより、同義語推定装置１０は、同義語の文字列の特定精度を高めることができる。 In the synonym estimation device 10, the learning unit 42 changes the weight value of a nonlinear function having a high probability of being identified at the highest level, and changes the weight value of a nonlinear function having a low probability to be low. Thereby, the synonym estimation apparatus 10 can improve the specific precision of the character string of a synonym.

また、同義語推定装置１０は、生成部４１が、所定値以下の範囲で素性値の数を変えて非線形な関数を生成する。これにより、同義語推定装置１０は、素性値を重み付け演算する関数が不要に長くなることを防止でき、個々の素性値の影響が小さい演算式が生成されることも抑制されるため、同義語の特定に有効な演算式を生成できる。 Further, in the synonym estimation device 10, the generation unit 41 generates a non-linear function by changing the number of feature values within a predetermined value or less. As a result, the synonym estimation device 10 can prevent an unnecessarily long function for weighting feature values and suppress the generation of an arithmetic expression with a small influence of each feature value. It is possible to generate an arithmetic expression effective for specifying

［５．その他］
以上、本願の実施形態のいくつかを図面に基づいて詳細に説明したが、これらは例示であり、発明の開示の欄に記載の態様を始めとして、当業者の知識に基づいて種々の変形、改良を施した他の形態で本発明を実施することが可能である。 [5. Others]
As described above, some of the embodiments of the present application have been described in detail with reference to the drawings. However, these are merely examples, and various modifications, including the aspects described in the disclosure section of the invention, based on the knowledge of those skilled in the art, It is possible to implement the present invention in other forms with improvements.

例えば、上記の実施形態では、最も上位にランキングされた同義語の候補をサジェスチョンクエリとして送信する場合について説明したが、本発明はこれに限定されない。例えば、上位の所定位以内にランキングされた同義語をクエリと共にＯＲ検索を行うものとしてもよい。 For example, in the above embodiment, a case has been described in which a synonym candidate ranked highest is transmitted as a suggestion query, but the present invention is not limited to this. For example, an OR search may be performed on the synonyms ranked within a predetermined upper order together with a query.

また、上述した同義語推定装置１０は、複数のサーバコンピュータで実現してもよく、また、機能によっては外部のプラットフォーム等をＡＰＩ（Application Programming Interface）やネットワークコンピューティングなどで呼び出して実現するなど、構成は柔軟に変更できる。 Further, the synonym estimation device 10 described above may be realized by a plurality of server computers, and depending on the function, an external platform or the like may be realized by calling an API (Application Programming Interface) or network computing. The configuration can be changed flexibly.

例えば、上記の実施形態では、同義語推定装置１０において、入力されたクエリの同義語の候補の検索および同義語の候補についての素性値の導出を行うものとしているが、これに限定されるものではない。同義語の候補の検索や同義語の候補についての素性値の導出をＷｅｂサーバなどの別なサーバ装置で行い、同義語推定装置１０は、当該別なサーバ装置から素性データ３１を受信し、最も上位にランキングされた同義語を当該別なサーバ装置へ返信するものとしてもよい。 For example, in the above-described embodiment, the synonym estimation apparatus 10 searches for synonym candidates for the input query and derives feature values for the synonym candidates, but is not limited thereto. is not. The synonym candidate search and the derivation of the feature value for the synonym candidate are performed by another server device such as a Web server, and the synonym estimation device 10 receives the feature data 31 from the other server device, The synonyms ranked higher may be returned to the other server device.

また、特許請求の範囲に記載した「手段」は、「部（section、module、unit）」や「回路」などに読み替えることができる。例えば、検索手段は、検索部や検索回路に読み替えることができる。 Further, the “means” described in the claims can be read as “section (module, unit)” or “circuit”. For example, the search means can be read as a search unit or a search circuit.

１０同義語推定装置
２１記憶部
２２制御部
３０学習用データ
３１素性データ
３２関数データ
４０受付部
４１生成部
４２学習部
４３検索部
４４導出部
４５特定部
４６送信部 DESCRIPTION OF SYMBOLS 10 Synonym estimation apparatus 21 Memory | storage part 22 Control part 30 Data for learning 31 Feature data 32 Function data 40 Reception part 41 Generation part 42 Learning part 43 Search part 44 Derivation part 45 Identification part 46 Transmission part

Claims

Storage means for storing learning data stored together with information indicating the closeness of meaning of a character string that can be regarded as a synonym;
Generating means for generating a plurality of non-linear functions for calculating a new feature value by weighting a plurality of feature values by changing the number of feature values and combinations of feature values;
Weights for new feature values calculated from the plurality of feature values and a plurality of non-linear functions, respectively, based on the learning data, and a combination of the non-linear functions. Learning means for learning weights for values;
For each of a plurality of synonym candidates searched from a query for which a search is requested, each nonlinearity obtained by weighting the feature values by weighting the combined feature values learned by the learning means from the plurality of feature values of the synonym candidates A new feature value is calculated using a function, and the plurality of feature values and the calculated new feature value are calculated based on a calculation result obtained by weighting learned by the learning unit. A specifying means for specifying a synonym having a meaning close to the query;
A synonym estimation device comprising:

2. The synonym according to claim 1, wherein the learning unit changes a weight value of a nonlinear function having a high probability of being identified at the highest level and changes a weight value of a nonlinear function having a low probability to be low. Estimating device.

The synonym estimation device according to claim 1, wherein the generation unit generates the nonlinear function by changing the number of feature values within a range of a predetermined value or less.

A synonym estimation method executed by a computer,
A storage step of storing learning data stored together with information indicating the closeness of meaning of a character string that can be regarded as a synonym;
Generating a plurality of non-linear functions for calculating a new feature value by weighting a plurality of feature values by changing the number of feature values and the combination of feature values;
Based on the learning data, weights for new feature values calculated from the plurality of feature values and a plurality of nonlinear functions capable of determining character strings in the order of meaning, and feature values obtained by combining the nonlinear functions A learning process for learning weights for,
For each of a plurality of synonym candidates searched from a query for which a search is requested, each nonlinearity obtained by weighting the feature values by weighting the combined feature values learned in the learning step from the plurality of feature values of the synonym candidates A new feature value is calculated using a function, and the plurality of feature values and the calculated new feature value are calculated based on a calculation result obtained by weighting learned in the learning step. A specific step of identifying synonyms that are close in meaning to the query;
The synonym estimation method characterized by including.

A storage procedure for storing learning data stored together with information indicating the closeness of meaning of a character string that can be regarded as a synonym;
A generation procedure for generating a plurality of non-linear functions for calculating a new feature value by weighting a plurality of feature values by changing the number of feature values and the combination of feature values,
Based on the learning data, weights for new feature values calculated from the plurality of feature values and a plurality of nonlinear functions capable of determining character strings in the order of meaning, and feature values obtained by combining the nonlinear functions A learning procedure for learning weights for,
For each of a plurality of synonym candidates searched from a query for which a search is requested, each nonlinearity obtained by weighting the feature values by weighting the combined feature values learned in the learning procedure from the plurality of feature values of the synonym candidates A new feature value is calculated using a function, and the plurality of feature values and the calculated new feature value are calculated based on a calculation result obtained by weighting learned in the learning procedure. A specific procedure for identifying synonyms that are close in meaning to the query;
Is executed by a computer.