JP2016091374A

JP2016091374A - Parameter estimation device, category assignment device, method, and program

Info

Publication number: JP2016091374A
Application number: JP2014226385A
Authority: JP
Inventors: 具治岩田; Tomoharu Iwata; 友也吉川; Tomoya Yoshikawa
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-11-06
Filing date: 2014-11-06
Publication date: 2016-05-23

Abstract

PROBLEM TO BE SOLVED: To accurately assign a category to data.SOLUTION: A latent-vector estimation unit 32 estimates a latent vector xof a word included in a multiset of each document on the basis of a parameter aindicating a degree to which each document in a document set Dcontributes to category assignment, a category tassigned to each document, and the document set Dsuch that the latent vector xand the parameter aoptimize an objective function indicating a degree of effectiveness for document categorization. A discriminant-function estimation unit 34 repeats estimation of the parameter aoptimizing the objective function on the basis of the latent vector x, the category tassigned to each document, and the document set Duntil a predetermined termination condition is satisfied.SELECTED DRAWING: Figure 1

Description

本発明は、パラメータ推定装置、カテゴリ割当装置、方法、及びプログラムに係り、特に、データに対してカテゴリを割り当てるためのパラメータ推定装置、カテゴリ割当装置、方法、及びプログラムに関する。 The present invention relates to a parameter estimation device, a category assignment device, a method, and a program, and more particularly, to a parameter estimation device, a category assignment device, a method, and a program for assigning a category to data.

近年、大量の文書データが蓄積されている。例えば、論文、新聞記事、ウェブページがこれにあたる。ユーザが全ての文書に目を通し、文書を分類していくことは困難であるため、これらの文書を自動的に望ましいカテゴリに分類する装置が要望されている。この装置を開発することにより、文書を効率的に分類し整理できるようになるだけでなく、ユーザの必要な文書を効率的に検索できるようになる。 In recent years, a large amount of document data has been accumulated. For example, papers, newspaper articles, and web pages. Since it is difficult for a user to read all the documents and classify the documents, there is a need for an apparatus that automatically classifies these documents into a desired category. By developing this apparatus, not only can documents be efficiently classified and organized, but also documents required by the user can be efficiently searched.

多くの場合、文書は単語の集合として表現される。これまでの文書分類の手法は、単語の集合間の類似度を基準に文書を分類する（例えば非特許文献１）。しかし、単語の字面は異なるが意味が似ている単語（例えば「テスト」と「試験」）が類似度に反映されないため、正しく文書分類できない場合がある。 In many cases, a document is represented as a collection of words. Conventional document classification methods classify documents based on the similarity between sets of words (for example, Non-Patent Document 1). However, because words with different word faces but similar meanings (for example, “test” and “test”) are not reflected in the similarity, the document may not be correctly classified.

このような問題を解決するための方法として、単語の潜在ベクトルを導入することにより、単語の字面は異なるが意味が似ている単語を類似度に反映する方法が知られている。単語の潜在ベクトル表現の獲得の方法として、例えば、特異値分解（ＳＶＤ）によるもの（例えば非特許文献２）や、ｗｏｒｄ２ｖｅｃを用いるもの（例えば非特許文献３）が知られている。 As a method for solving such a problem, a method is known in which words having different word faces but similar meanings are reflected in similarity by introducing word latent vectors. As a method for acquiring a latent vector representation of a word, for example, a method using singular value decomposition (SVD) (for example, Non-Patent Document 2) and a method using word2vec (for example, Non-Patent Document 3) are known.

特開２０１３−２４６７９５号公報Japanese Patent Application Laid-Open No. 2013-246795

Thorsten Joachims,“Text categorization with Support Vector Machines:Learning with many relevant features,” Lecture Notes in Computer Science Volume 1398, pp137-142, 1998.Thorsten Joachims, “Text categorization with Support Vector Machines: Learning with many relevant features,” Lecture Notes in Computer Science Volume 1398, pp137-142, 1998. Zhu, Jun, Amr Ahmed, and Eric P. Xing.“MedLDA: maximum margin supervised topic models for regression and classication.”Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.Zhu, Jun, Amr Ahmed, and Eric P. Xing. “MedLDA: maximum margin supervised topic models for regression and classication.” Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009. Muandet, Krikamol, et al.“Learning from distributions via support measure machines.” Advances in Neural Information Processing Systems. 2012.Muandet, Krikamol, et al. “Learning from distributions via support measure machines.” Advances in Neural Information Processing Systems. 2012.

しかし、非特許文献２や非特許文献３の方法では、単語の潜在ベクトルを識別とは関係なく前処理で計算するため、文書の分類精度にそれほど寄与しない。 However, the methods of Non-Patent Document 2 and Non-Patent Document 3 do not contribute much to the document classification accuracy because the word latent vectors are calculated by preprocessing regardless of the identification.

本発明は、上記問題点を解決するために成されたものであり、データに対して、カテゴリを精度よく割り当てるための識別関数のパラメータを推定することができるパラメータ推定装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and provides a parameter estimation apparatus, method, and program capable of estimating parameters of a discrimination function for accurately assigning categories to data. The purpose is to provide.

また、データに対して、カテゴリを精度よく割り当てることができるカテゴリ割当装置、方法、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide a category assignment apparatus, method, and program that can assign categories to data with high accuracy.

上記目的を達成するために、第１の発明に係るパラメータ推定装置は、入力された、各々にカテゴリが予め割り当てられたデータ集合であって、かつ、離散特徴の多重集合で表現可能なデータの前記データ集合の各データの前記多重集合に含まれる各離散特徴の潜在ベクトル、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータ、前記データ集合のデータの各々に割り当てられたカテゴリ、及び前記データ集合を用いて、前記データに割り当てるカテゴリを識別するための識別関数で用いられる、各離散特徴の潜在ベクトルと、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータとを初期化する初期化部と、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータと、前記データ集合のデータの各々に割り当てられたカテゴリと、前記データ集合とに基づいて、各離散特徴の潜在ベクトル、及び前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータが、データのカテゴリ分類に有効な度合いを表す目的関数を最適化するように、各離散特徴の潜在ベクトルを推定する潜在ベクトル推定部と、各離散特徴の潜在ベクトルと、前記データ集合のデータの各々に割り当てられたカテゴリと、前記データ集合とに基づいて、前記目的関数を最適化するように、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータを推定する識別関数推定部と、予め定めた繰り返し終了条件を満たすまで、前記潜在ベクトル推定部による推定、及び前記識別関数推定部による推定を繰り返す繰り返し判定部と、を含んで構成されている。 In order to achieve the above object, a parameter estimation device according to a first aspect of the present invention is an input data set in which categories are assigned in advance, and data that can be expressed by multiple sets of discrete features. A latent vector of each discrete feature included in the multiple set of each data of the data set, a parameter indicating a degree of contribution of each data of the data set to category assignment, and a category assigned to each of the data of the data set And a latent vector of each discrete feature used in an identification function for identifying a category to be assigned to the data using the data set, and a parameter representing the degree to which each of the data of the data set contributes to category assignment An initialization unit that initializes and the degree of contribution of each of the data of the data set to category assignment A parameter representing the degree of contribution of each of the data of the discrete features and the latent vector of each discrete feature and the data of the data set based on the parameters, the category assigned to each of the data of the data set, and the data set , A latent vector estimator for estimating a latent vector of each discrete feature, a latent vector of each discrete feature, and a data of the data set of the data set so as to optimize an objective function representing an effective degree for categorizing the data. A discriminant function estimator for estimating a parameter representing a degree of contribution of each data of the data set to category assignment so as to optimize the objective function based on the category assigned to each and the data set And until the predetermined repetition termination condition is satisfied, the estimation by the latent vector estimation unit, and the discrimination function A repetition determining unit repeating estimation by tough, is configured to include a.

また、第１の発明に係るパラメータ推定装置において、前記識別関数は、以下の式に示すｆ（ｗ_k ^*）で表わされ、前記目的関数は、以下の式に示すＬ（Ａ，Ｘ，γ）で表わされるようにしてもよい。 In the parameter estimation device according to the first invention, the discriminant function is represented by f (w _k ^* ) shown in the following equation, and the objective function is represented by L (A, X, It may be expressed by γ).

ただし、ｗはデータの前記多重集合を表し、ｗ^＊はカテゴリが割り当てられていないデータの前記多重集合を表し、Ａはデータ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合を表し、０≦ａ_ｉ≦Ｃであり、ｔ_ｉはデータに予め割り当てられたカテゴリを表し、Ｘは各離散特徴ｖの潜在ベクトルｘ_ｖの集合を表し、 Where w represents the multiple set of data, w ^* represents the multiple set of data to which no category is assigned, and A represents a parameter a _i representing the degree to which each piece of data in the data set contributes to category assignment. Represents a set, 0 ≦ a _i ≦ C, t _i represents a category pre-assigned to the data, X represents a set of latent vectors x _v of each discrete feature v,

は離散特徴ｖのｑ次元の潜在ベクトルを表し、γは、離散特徴間の類似度の値を制御するためのパラメータであり、Ｋ（ｗ_ｉ，ｗ_ｊ；Ｘ，γ）はデータｉとデータｊとの類似度を表すカーネル関数である。 Represents a q-dimensional latent vector of the discrete feature v, γ is a parameter for controlling the similarity value between the discrete features, and K (w _i , w _j ; X, γ) is data i and data This is a kernel function that represents the degree of similarity with j.

第２の発明に係るカテゴリ割当装置は、入力された、離散特徴の多重集合で表現可能なデータのデータ集合であって、かつ、割り当てられるカテゴリが未知の未知データ集合を受け付ける入力部と、上記第１の発明係るパラメータ推定装置によって推定された各離散特徴の潜在ベクトル、及び前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータと、前記データ集合のデータの各々に割り当てられたカテゴリと、前記データ集合とに基づいて、前記入力部で受け付けた未知データ集合の各データに対して、前記識別関数に従って、前記カテゴリを割り当てる割当部と、を含んで構成されている。 A category allocating device according to a second aspect of the present invention is an input unit that receives an input unknown data set that is a data set that can be expressed by multiple sets of discrete features and whose assigned category is unknown, and A latent vector of each discrete feature estimated by the parameter estimation device according to the first aspect of the invention, a parameter representing a degree of contribution of each of the data of the data set to category assignment, and each of the data of the data set An allocating unit that allocates the category according to the identification function to each data of the unknown data set received by the input unit based on the category and the data set.

第１の発明に係るパラメータ推定方法は、初期化部が、入力された、各々にカテゴリが予め割り当てられたデータ集合であって、かつ、離散特徴の多重集合で表現可能なデータの前記データ集合の各データの前記多重集合に含まれる各離散特徴の潜在ベクトル、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータ、前記データ集合のデータの各々に割り当てられたカテゴリ、及び前記データ集合を用いて、前記データに割り当てるカテゴリを識別するための識別関数で用いられる、各離散特徴の潜在ベクトルと、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータとを初期化するステップと、潜在ベクトル推定部が、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータと、前記データ集合のデータの各々に割り当てられたカテゴリと、前記データ集合とに基づいて、各離散特徴の潜在ベクトル、及び前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータが、データのカテゴリ分類に有効な度合いを表す目的関数を最適化するように、各離散特徴の潜在ベクトルを推定するステップと、識別関数推定部が、各離散特徴の潜在ベクトルと、前記データ集合のデータの各々に割り当てられたカテゴリと、前記データ集合とに基づいて、前記目的関数を最適化するように、前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータを推定するステップと、繰り返し判定部が、予め定めた繰り返し終了条件を満たすまで、前記潜在ベクトル推定部による推定、及び前記識別関数推定部による推定を繰り返すステップと、を含んで実行することを特徴とする。 The parameter estimation method according to a first aspect of the present invention is the data set of data that is input by the initialization unit and to which categories are assigned in advance, and that can be expressed by multiple sets of discrete features. A latent vector of each discrete feature included in the multiple set of each of the data, a parameter representing a degree to which each of the data of the data set contributes to category assignment, a category assigned to each of the data of the data set, and the Using a data set, initial values of a latent vector of each discrete feature used in an identification function for identifying a category to be assigned to the data and a parameter indicating a degree of contribution of each data of the data set to category assignment And a latent vector estimator that determines how much each of the data in the data set contributes to category assignment. , A category assigned to each of the data of the data set, and a latent vector of each discrete feature and a degree to which each of the data of the data set contributes to category assignment based on the data set. Estimating a latent vector of each discrete feature such that a parameter to represent optimizes an objective function representing a degree of effectiveness in categorizing data; and a discriminant function estimator includes the latent vector of each discrete feature; Based on the category assigned to each data of the data set and the data set, a parameter representing the degree to which each of the data of the data set contributes to the category assignment is estimated so as to optimize the objective function. And the latent vector estimation unit until the repetition determination unit satisfies a predetermined repetition termination condition. According estimation, and wherein the performing comprises the steps of: repeating the estimation by the identification function estimator.

また、第１の発明に係るパラメータ推定方法において、前記識別関数は、以下の式に示すｆ（ｗ_k ^*）で表わされ、前記目的関数は、以下の式に示すＬ（Ａ，Ｘ，γ）で表わされるようにしてもよい。 In the parameter estimation method according to the first invention, the discriminant function is represented by f (w _k ^* ) shown in the following equation, and the objective function is expressed as L (A, X, It may be expressed by γ).

はｑ次元の離散特徴ｖの潜在ベクトルを表し、γは、離散特徴間の類似度の値を制御するためのパラメータであり、Ｋ（ｗ_ｉ，ｗ_ｊ；Ｘ，γ）はデータｉとデータｊとの類似度を表すカーネル関数である。 Represents a latent vector of the q-dimensional discrete feature v, γ is a parameter for controlling the similarity value between the discrete features, and K (w _i , w _j ; X, γ) is data i and data This is a kernel function that represents the degree of similarity with j.

第２の発明に係るカテゴリ割当方法は、入力部が、入力された、離散特徴の多重集合で表現可能なデータのデータ集合であって、かつ、割り当てられるカテゴリが未知の未知データ集合を受け付けるステップと、割当部が、請求項１又は２に記載のパラメータ推定装置によって推定された各離散特徴の潜在ベクトル、及び前記データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータと、前記データ集合のデータの各々に割り当てられたカテゴリと、前記データ集合とに基づいて、前記入力部で受け付けた未知データ集合の各データに対して、前記識別関数に従って、前記カテゴリを割り当てるステップと、を含んで実行することを特徴とする。 In the category assigning method according to the second invention, the input unit accepts an input unknown data set that is a data set of data that can be expressed by multiple sets of discrete features and whose assigned category is unknown. A parameter representing a degree of contribution of each of the data of the discrete features estimated by the parameter estimation device according to claim 1 or 2 and the data of the data set to category allocation; Assigning the category according to the identification function to each data of the unknown data set received by the input unit based on the category assigned to each of the data of the set and the data set. It is characterized by executing in.

第３の発明に係るプログラムは、コンピュータを、上記第１の発明に係るパラメータ推定装置又は上記第２の発明に係るカテゴリ割当装置の各部として機能させるためのプログラムである。 A program according to a third invention is a program for causing a computer to function as each part of the parameter estimation device according to the first invention or the category assignment device according to the second invention.

本発明のパラメータ推定装置、方法、及びプログラムによれば、データ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータと、データの各々に割り当てられたカテゴリと、データ集合とに基づいて、各データの多重集合に含まれる各離散特徴の潜在ベクトル、及びデータの各々がカテゴリ割当に貢献する度合いを表すパラメータが、データのカテゴリ分類に有効な度合いを表す目的関数を最適化するように、潜在ベクトルを推定し、潜在ベクトルと、データの各々に割り当てられたカテゴリと、データ集合とに基づいて、目的関数を最適化するように、カテゴリ割当に貢献する度合いを表すパラメータを推定することを、予め定めた繰り返し終了条件を満たすまで繰り返すことで、カテゴリを精度よく割り当てるための識別関数のパラメータを推定することができる、という効果が得られる。 According to the parameter estimation apparatus, method, and program of the present invention, based on a parameter that represents the degree to which each of the data of the data set contributes to category assignment, the category assigned to each of the data, and the data set, A latent vector of each discrete feature included in each data multiplex set, and a parameter representing the degree to which each piece of data contributes to category assignment, optimizes an objective function that represents the degree to which the data is categorized effectively. Estimating a latent vector and estimating a parameter representing a degree of contribution to the category assignment so as to optimize the objective function based on the latent vector, the category assigned to each of the data, and the data set. By repeating until a predetermined repetition end condition is satisfied, the information for assigning the category accurately can be obtained. It is possible to estimate the parameters of the function, the effect is obtained that.

また、カテゴリ割当装置、方法、及びプログラムによれば、パラメータ推定装置によって推定されたデータ集合の各データの多重集合に含まれる各離散特徴の潜在ベクトル、及びデータ集合のデータの各々がカテゴリ割当に貢献する度合いを表すパラメータと、データの各々に割り当てられたカテゴリと、及びデータ集合とに基づいて、未知データ集合の各データに対して、識別関数に従って、カテゴリを割り当てることで、データに対して、カテゴリを精度よく割り当てることができる、という効果が得られる。 In addition, according to the category assigning device, method, and program, each of the latent vector of each discrete feature included in the multiple sets of each data of the data set estimated by the parameter estimating device and the data of the data set are assigned to the category assignment. Based on a parameter representing the degree of contribution, a category assigned to each of the data, and a data set, a category is assigned to each data of the unknown data set according to an identification function. The effect that the categories can be assigned with high accuracy is obtained.

本実施の形態に係るパラメータ推定装置及びカテゴリ割当装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the parameter estimation apparatus and category allocation apparatus which concern on this Embodiment. 本実施の形態に係るパラメータ推定装置におけるパラメータ推定処理ルーチンを示すフローチャートである。It is a flowchart which shows the parameter estimation processing routine in the parameter estimation apparatus which concerns on this Embodiment. 本実施の形態に係るカテゴリ割当装置におけるカテゴリ割当処理ルーチンを示すフローチャートである。It is a flowchart which shows the category allocation process routine in the category allocation apparatus which concerns on this Embodiment. 本実施の形態に係る手法を用いて文書にカテゴリを割り当てた実験結果の一例を示すグラフである。It is a graph which shows an example of the experimental result which allocated the category to the document using the method which concerns on this Embodiment. 本実施の形態に係る手法を用いて文書にカテゴリを割り当てた実験結果における可視化の一例を示す抽象図である。It is an abstract figure which shows an example of the visualization in the experimental result which allocated the category to the document using the method which concerns on this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係るパラメータ推定装置の構成＞ <Configuration of Parameter Estimation Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係るパラメータ推定装置の構成について説明する。なお、本実施の形態では、文書にカテゴリを割り当てるためのパラメータを推定するパラメータ推定装置に本発明を適用した場合を例に説明するが、文書の場合と同様に、入力データが離散特徴の多重集合で表現可能なデータであれば、本発明を適用できる。例えば、本発明は画像や音楽等にも適用できる。 Next, the configuration of the parameter estimation apparatus according to the embodiment of the present invention will be described. In the present embodiment, a case where the present invention is applied to a parameter estimation apparatus that estimates a parameter for assigning a category to a document will be described as an example. However, as in the case of a document, input data is multiplexed with discrete features. The present invention can be applied to any data that can be expressed as a set. For example, the present invention can be applied to images and music.

図１に示すように、本発明の実施の形態に係るパラメータ推定装置１００は、ＣＰＵと、ＲＡＭと、後述するパラメータ推定処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このパラメータ推定装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、パラメータ集合記憶部４０とを備えている。 As shown in FIG. 1, a parameter estimation apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a parameter estimation processing routine described later and various data. Can be configured with a computer. Functionally, the parameter estimation apparatus 100 includes an input unit 10, a calculation unit 20, and a parameter set storage unit 40 as shown in FIG.

入力部１０は、入力データとして、各々にカテゴリが予め割り当てられた文書集合 The input unit 10 is a set of documents in which categories are assigned in advance as input data.

を受け付ける。ここで、ｗ_ｉはｉ番目の文書（ｉ＝１，・・・，Ｎ_ｔｒ）を表現する離散特徴の多重集合である。離散特徴としては、例えば単語を用いることができる。この場合、ｗ_ｉは語彙集合Ｖの部分集合である。また、ｔ_ｉ∈｛＋１，−１｝は文書ｉの二値カテゴリラベルを表す。カテゴリラベルが多値の場合も、例えば１対１方式や１対他方式と呼ばれる方法によって対応できる。 Accept. Here, w _i is a multiple set of discrete features representing the i-th document (i = 1,..., N _tr ). For example, a word can be used as the discrete feature. In this case, w _i is a subset of the vocabulary set V. Also, t _i ε {+ 1, −1} represents the binary category label of the document i. Even when the category label is multi-valued, it can be dealt with by a method called a one-to-one method or a one-to-other method.

演算部２０は、初期化部３０と、潜在ベクトル推定部３２と、識別関数推定部３４と、繰り返し判定部３６とを含んで構成されている。 The calculation unit 20 includes an initialization unit 30, a latent vector estimation unit 32, a discrimination function estimation unit 34, and an iterative determination unit 36.

初期化部３０は、入力部１０で受け付けた文書集合Ｄ_ｔｒの各文書の単語の多重集合に含まれる各単語ｖの潜在ベクトルｘ_ｖ（ｖ∈Ｖ）、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉ（ｉ＝１，・・・，Ｎ_ｔｒ）、文書集合Ｄ_ｔｒの文書の各々に割り当てられたカテゴリｔ_ｉ、及び文書集合Ｄ_ｔｒを用いて、文書に割り当てるカテゴリを識別するための下記の（１）式の識別関数で用いられる、各単語ｖの潜在ベクトルｘ_ｖと、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉと、パラメータγとを初期化する。 The initialization unit 30 includes the latent vector x _v (vεV) of each word v included in the multiple set of words of each document of the document set D _tr received by the input unit 10 and each document of the document set D _tr parameters _a i indicating the degree to contribute to the category assigned _{(i = 1, ···, N} tr), document set _{D tr} category _{t i} assigned to each document, and using a document set _{D tr,} document A parameter a representing the degree of contribution of each of the latent vectors x _v of each word _v and the documents in the document set D _tr used in the identification function of the following equation (1) for identifying the category to be assigned to _i and the parameter γ are initialized.

ここで、ｗは文書を表現する単語の多重集合を表し、ｗ^＊はカテゴリが割り当てられていない文書を表現する単語の多重集合を表す。また、Ｘ＝｛ｘ_ｖ｝_ｖ∈Ｖは各単語ｖの潜在ベクトル集合を表し、 Here, w represents a multiple set of words representing a document, and w ^* represents a multiple set of words representing a document to which no category is assigned. X = {x _v } _v∈V represents a latent vector set of each word v,

は単語ｖのｑ次元の潜在ベクトルを表す。Ｋ（ｗ_ｉ，ｗ_ｊ；Ｘ，γ）は、潜在ベクトルｘ_ｖの集合Ｘを入力としてとる、文書ｉと文書ｊの類似の度合いを表すカーネル関数である。 Represents a q-dimensional latent vector of the word v. K (w _i , w _j ; X, γ) is a kernel function that represents the degree of similarity between the document i and the document j, taking the set X of the latent vectors x _v as an input.

本実施の形態では、初期化部３０は、各単語ｖの潜在ベクトルｘ_ｖ、及びパラメータγをランダムに初期化する。また、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉをＣとして設定する。Ｃは誤識別を許すかどうかを決める、予め定められたパラメータである。 In the present embodiment, the initialization unit 30 initializes the latent vector x _v and the parameter γ of each word v at random. In addition, a parameter a _i indicating the degree to which each document in the document set D _tr contributes to category assignment is set as C. C is a predetermined parameter that determines whether misidentification is allowed.

潜在ベクトル推定部３２は、初期化部３０で初期化され、又は識別関数推定部３４で推定された文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉと、文書集合Ｄ_ｔｒの文書の各々に割り当てられたカテゴリｔ_ｉと、文書集合Ｄ_ｔｒとに基づいて、各単語ｖの潜在ベクトルｘ_ｖ、及び文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉが、文書のカテゴリ分類に有効な度合いを表す下記の（２）式の目的関数を最適化するように、各単語ｖの潜在ベクトルｘ_ｖ、及びパラメータγを推定する。 The latent vector estimation unit 32 includes a parameter a _i indicating the degree to which each document of the document set D _tr initialized by the initialization unit 30 or estimated by the discrimination function estimation unit 34 contributes to category assignment, and a document set Based on the category t _i assigned to each document of D _tr and the document set D _tr , the latent vector x _v of each word v and the degree to which each document of the document set D _tr contributes to the category assignment. The latent vector x _v and the parameter γ of each word v are estimated so that the parameter a _i to be expressed optimizes the objective function of the following equation (2) that represents the degree of effectiveness in document category classification.

ここで、 here,

は文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合であり、ａ_ｉの定義域は０≦ａ_ｉ≦Ｃである。また、ａ_ｉ＝０のとき、文書ｉはカテゴリ割当に貢献しないことを意味する。ρ＞０は予め定められたパラメータである。また、カーネル関数としては、例えば下記の（３）式を用いることができる。 Is a set of parameters a _i representing the degree to which each of the documents in the document set D _tr contribute to category assignment, the domain of a _i is 0 ≦ a _i ≦ C. Further, when a _i = 0, it means that the document i does not contribute to category assignment. ρ> 0 is a predetermined parameter. As the kernel function, for example, the following equation (3) can be used.

ここで、γ＞０は単語間の類似度の値を制御するパラメータである。なお、上記（３）式のカーネル関数の他にも、線形カーネルや多項式カーネルなどの半正定値性を持つカーネル関数を用いることが可能である。潜在ベクトル推定部３２における推定では、同じカテゴリに属する文書間のカーネルが大きい値になり、そうでなければ小さい値になるように潜在ベクトルとパラメータを更新することによって行う。これは、パラメータａ_ｉの集合Ａが与えられたうえで、上記（２）式の目的関数を最小化することによって可能である。例えば、準ニュートン法などの最適化手法を用いることにより、各単語の潜在ベクトルｘ_ｖとパラメータγを推定できる。目的関数の値は、各単語ｖの潜在ベクトルｘ_ｖが与えられた時、カテゴリラベル付き文書集合Ｄ_ｔｒに含まれる全ての文書の単語の多重集合ｗ_ｉについて、カテゴリｔ_ｉを正しく当てられれば最適値となる。 Here, γ> 0 is a parameter for controlling the similarity value between words. In addition to the kernel function of the above formula (3), a kernel function having a semi-definite property such as a linear kernel or a polynomial kernel can be used. The estimation in the latent vector estimation unit 32 is performed by updating the latent vector and the parameter so that the kernel between documents belonging to the same category has a large value, and otherwise becomes a small value. This is possible by minimizing the objective function of equation (2) above given a set A of parameters a _i . For example, by using an optimization method such as the quasi-Newton method, the latent vector _xv and the parameter γ of each word can be estimated. The value of the objective function, when the potential vector x _v of each word v is given, for multiple set w _i of the word of all the documents that are included in the category labeled document set D _tr, as long rely on category t _i correctly It becomes the optimum value.

識別関数推定部３４は、潜在ベクトル推定部３２で推定された各単語ｖの潜在ベクトルｘ_ｖと、パラメータγと、文書集合Ｄ_ｔｒの文書の各々に割り当てられたカテゴリｔ_ｉと、文書集合Ｄ_ｔｒとに基づいて、目的関数を最適化するように、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉを推定する。 The discriminant function estimation unit 34 includes the latent vector x _v of each word v estimated by the latent vector estimation unit 32, the parameter γ, the category t _i assigned to each document in the document set D _tr , and the document set D. Based on _tr , a parameter a _i representing the degree to which each document in the document set D _tr contributes to category assignment is estimated so as to optimize the objective function.

識別関数推定部３４では、直感的には、各文書ｉについて、カテゴリｔ_ｉと上記（１）式の識別関数の符号が一致するように上記（２）式におけるパラメータａ_ｉの集合Ａを求める。 The discriminant function estimation unit 34 intuitively obtains a set A of parameters a _i in the above equation (2) so that the category t _i and the sign of the discriminant function in the above equation (1) match for each document i. .

これは、上記潜在ベクトル推定部３２の推定によって潜在ベクトルｘ_ｖの集合Ｘとパラメータγが与えられたうえで、上記（２）式の目的関数を最大化することによって可能である。例えば、二次計画法やＳＭＯ（ＳｅｑｕｅｎｔｉａｌＭｉｎｉｍａｌＯｐｔｉｍｉｚａｔｉｏｎ）などの最適化手法を用いることによって、識別関数のパラメータを推定できる。 This is in terms of a set X and a parameter γ of latent vectors x _v by estimation of the potential vector estimator 32 is given, it is possible by maximizing the objective function of equation (2). For example, the parameters of the discriminant function can be estimated by using an optimization method such as quadratic programming or SMO (Sequential Minimal Optimization).

繰り返し判定部３６は、予め定めた繰り返し終了条件を満たすまで、上記潜在ベクトル推定部３２による推定、及び上記識別関数推定部３４による推定を繰り返す。具体的な繰り返し終了条件としては、繰り返しの回数、又は目的関数の変化の大きさなどを用いることができ、目的関数の値が収束するまで更新を繰り返す。そして、目的関数が収束したと判定されれば、潜在ベクトル推定部３２で推定された潜在ベクトルｘ_ｖの集合Ｘ及びパラメータγ、識別関数推定部３４で推定された文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合Ａ、並びに入力部１０で受け付けた文書集合Ｄ_ｔｒをパラメータ集合記憶部４０に記憶する。 The repetition determination unit 36 repeats the estimation by the latent vector estimation unit 32 and the estimation by the discrimination function estimation unit 34 until a predetermined repetition end condition is satisfied. As a specific repetition end condition, the number of repetitions or the magnitude of change of the objective function can be used, and the update is repeated until the value of the objective function converges. If it is determined that the objective function has converged, each of the set X and the parameter γ of the latent vector x _v estimated by the latent vector estimation unit 32 and each of the documents of the document set D _tr estimated by the discrimination function estimation unit 34 Is stored in the parameter set storage unit 40. The parameter set storage unit 40 stores the set A of parameters a _i representing the degree of contribution to category assignment and the document set D _tr received by the input unit 10.

＜本発明の実施の形態に係るカテゴリ割当装置の構成＞ <Configuration of Category Assignment Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係るカテゴリ割当装置の構成について説明する。図２に示すように、本発明の実施の形態に係るカテゴリ割当装置３００は、ＣＰＵと、ＲＡＭと、後述するカテゴリ割当ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このカテゴリ割当装置３００は、機能的には図２に示すように入力部３１０と、演算部３２０と、パラメータ集合記憶部４０と、出力部３５０とを備えている。 Next, the configuration of the category assignment device according to the embodiment of the present invention will be described. As shown in FIG. 2, a category assignment device 300 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a category assignment routine described later and various data. Can be configured. Functionally, the category assignment apparatus 300 includes an input unit 310, a calculation unit 320, a parameter set storage unit 40, and an output unit 350 as shown in FIG.

入力部３１０は、入力データとして、割り当てられるカテゴリが未知の未知文書集合 The input unit 310 uses an unknown document set whose assigned category is unknown as input data.

を受け付ける。 Accept.

演算部３２０は、読込部３３０と、割当部３３２とを含んで構成されている。 The calculation unit 320 includes a reading unit 330 and an allocation unit 332.

読込部３３０は、パラメータ集合記憶部４０に記憶されている各単語ｖの潜在ベクトルｘ_ｖの集合Ｘ、パラメータγ、及び文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合Ａと、文書集合Ｄ_ｔｒとを読み込む。 The reading unit 330 includes a parameter a representing the degree to which each of the set X of the latent vector x _v of each word v stored in the parameter set storage unit 40, the parameter γ, and the document of the document set D _tr contributes to category assignment. The set A of _i and the document set D _tr are read.

割当部３３２は、読込部３３０で読み込んだ、各単語ｖの潜在ベクトルｘ_ｖの集合Ｘ、パラメータγ、及び文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合Ａと、文書集合Ｄ_ｔｒの文書の各々に割り当てられたカテゴリと、文書集合Ｄ_ｔｒとに基づいて、未知文書集合Ｄ_ｔｅの各文書に対して、上記（１）式の識別関数に従って、カテゴリを割り当てる。本実施の形態では、カテゴリが割り当てられていない未知文書の単語の多重集合ｗ_k ^*（ｋ＝１，・・・，Ｎ_ｔｅ）について、上記（１）式を用いて識別関数の値を計算する。ここで、識別関数の値がｆ（ｗ_k ^*）≧０のとき、未知文書の単語の多重集合ｗ_k ^*のカテゴリｙ_ｋは＋１、そうでなければ−１となる。 The assignment unit 332 reads a set of parameters a _i representing the degree to which each of the documents X of the latent vector x _v of each word v read by the reading unit 330, the parameter γ, and the document set D _tr contributes to category assignment. and a, the category assigned to each of the documents in the document collection D _tr, based on the document set D _tr, for each document of unknown document set D _te, according to the identification function of the equation (1), the category Assign. In this embodiment, the value of the discriminant function is calculated using the above equation (1) for multiple sets w _k ^* (k = 1,..., N _te ) of unknown document words to which no category is assigned. To do. Here, when the value of the discriminant function is f (w _k ^* ) ≧ 0, the category y _k of the multiplex set w _k ^* of words of the unknown document is +1, and otherwise −1.

そして、割当部３３２は、未知文書の単語の多重集合ｗ_k ^*の各々について計算して得られたカテゴリラベル集合 Then, the assigning unit 332 calculates the category label set obtained by calculating for each of the multiple sets w _k ^* of words of the unknown document.

を、出力部３５０によって出力する。 Is output by the output unit 350.

＜本発明の実施の形態に係るパラメータ推定装置の作用＞ <Operation of Parameter Estimation Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係るパラメータ推定装置１００の作用について説明する。入力部１０において、カテゴリが割り当てられた文書集合Ｄ_ｔｒを受け付けると、パラメータ推定装置１００は、図２に示すパラメータ推定処理ルーチンを実行する。 Next, the operation of the parameter estimation apparatus 100 according to the embodiment of the present invention will be described. When the input unit 10 receives the document set D _tr to which the category is assigned, the parameter estimation apparatus 100 executes a parameter estimation processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けた文書集合Ｄ_ｔｒを取得する。 First, in step S100, the document set D _tr received by the input unit 10 is acquired.

次に、ステップＳ１０２では、ステップＳ１００で取得した文書集合Ｄ_ｔｒの各文書の単語の多重集合に含まれる各単語ｖの潜在ベクトルｘ_ｖ、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉ、及びパラメータγを初期化する。 Next, in step S102, latent vectors x _v for each word v included in the multiplexed set of words of each document in the document set D _tr acquired in step _S100, each of the documents in the document collection D _tr contribute to category assignment A parameter a _i representing the degree and a parameter γ are initialized.

ステップＳ１０４では、ステップＳ１０２で初期化され、又は後述するステップＳ１０６で推定された文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉと、文書集合Ｄ_ｔｒの文書の各々に割り当てられたカテゴリｔ_ｉと、文書集合Ｄ_ｔｒとに基づいて、各単語ｖの潜在ベクトルｘ_ｖ、及び文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉが、文書のカテゴリ分類に有効な度合いを表す上記（２）式の目的関数を最適化するように、各単語ｖの潜在ベクトルｘ_ｖ、及びパラメータγを推定する。 At step S104, it is initialized at step S102, or the parameters _{a i} of each of the documents of the estimated document set _{D tr} represents the degree to contribute to the category assigned in step S106 to be described later, each of the documents in the document collection _{D tr} Based on the category t _i assigned to and the document set D _tr , the latent vector x _v of each word v and the parameter a _i representing the degree to which each of the documents in the document set D _tr contributes to category assignment are: The latent vector x _v and the parameter γ of each word v are estimated so as to optimize the objective function of the above equation (2) representing the degree of effectiveness in document category classification.

ステップＳ１０６では、ステップＳ１０４で推定された各単語ｖの潜在ベクトルｘ_ｖと、パラメータγと、文書集合Ｄ_ｔｒの文書の各々に割り当てられたカテゴリｔ_ｉと、文書集合Ｄ_ｔｒとに基づいて、目的関数を最適化するように、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉを推定する。 In step S106, the latent vectors _{x v} for each word v which is estimated in step S104, a parameter gamma, and category _{t i} assigned to each of the documents in the document collection _{D tr,} based on the document set _{D tr,} A parameter a _i representing the degree to which each document in the document set D _tr contributes to category assignment is estimated so as to optimize the objective function.

ステップＳ１０８では、目的関数が収束したかを判定し、収束していないと判定されば場合は、ステップＳ１０４からステップＳ１０６の処理を繰り返し、収束していると判定された場合は、ステップＳ１０４で推定された潜在ベクトルｘ_ｖの集合Ｘ及びパラメータγ、ステップＳ１０６で推定された文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合Ａ、並びにステップＳ１００で取得した文書集合Ｄ_ｔｒをパラメータ集合記憶部４０に記憶し、処理を終了する。 In step S108, it is determined whether the objective function has converged. If it is determined that the objective function has not converged, the processes from step S104 to step S106 are repeated. If it is determined that the objective function has converged, the estimation is performed in step S104. have been latent vectors x _v document set X and parameter gamma, each of the document of the estimated document set D _tr in step S106 obtained in the set a, and the step S100 of the parameters a _i indicating the degree to contribute to the category allocation of The set D _tr is stored in the parameter set storage unit 40, and the process ends.

＜本発明の実施の形態に係るカテゴリ割当装置の作用＞ <Operation of Category Assignment Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係るカテゴリ割当装置３００の作用について説明する。入力部３１０において未知文書集合Ｄ_ｔｅを受け付けると、カテゴリ割当装置３００は、図３に示すカテゴリ割当処理ルーチンを実行する。 Next, the operation of the category assignment device 300 according to the embodiment of the present invention will be described. When the input unit 310 receives the unknown document set _Dte , the category assignment device 300 executes a category assignment processing routine shown in FIG.

まず、ステップＳ３００では、入力部３１０において受け付けた未知文書集合Ｄ_ｔｅを取得する。 First, in step S300, the unknown document set _Dte received by the input unit 310 is acquired.

次にステップＳ３０２では、パラメータ集合記憶部４０に記憶されている、各単語ｖの潜在ベクトルｘ_ｖの集合Ｘ、パラメータγ、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合Ａ、及び文書集合Ｄ_ｔｒとを読み込む。 Next, in step S302, a parameter representing the degree of contribution of each of the set X, parameter γ, and document set D _tr of the latent vector x _v of each word v that is stored in the parameter set storage unit 40 to the category assignment. The set A of a _i and the document set D _tr are read.

ステップＳ３０４では、ステップＳ３０４で読み込んだ、各単語ｖの潜在ベクトルｘ_ｖの集合Ｘ、パラメータγ、及び文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉの集合Ａと、文書集合Ｄ_ｔｒの文書の各々に割り当てられたカテゴリと、文書集合Ｄ_ｔｒとに基づいて、未知文書集合Ｄ_ｔｅの各文書に対して、上記（１）式の識別関数に従って、カテゴリを割り当てる。 In step S304, a set A of parameters a _i representing the degree to which each of the set X of latent vectors x _v of each word v, parameter γ, and documents in the document set D _tr read in step S304 contributes to category assignment; a category assigned to each of the documents in the document collection D _tr, based on the document set D _tr, for each document of unknown document set D _te, according to the identification function of the equation (1), assign a category .

そして、ステップＳ３０６では、ステップＳ３０４のカテゴリ割当結果を出力部３５０によって出力し処理を終了する。 In step S306, the category assignment result in step S304 is output by the output unit 350, and the process ends.

＜実験結果＞ <Experimental result>

次に、本実施の形態に係る手法を用いて行った実験結果について説明する。 Next, experimental results performed using the method according to the present embodiment will be described.

本実施の形態に係る手法を評価するため、３データセット（WebKB、Reuters-21578、及び20 Newsgroups）を用いて実験を行った。比較手法として、ＭｅｄＬＤＡ（Maximum entropy discrimination Latent Dirichlet Allocation）（上記、非特許文献２参照）、特異値分解（ＳＶＤ）によって単語のベクトル表現を得た後にＳＭＭ（Support Measure Machine）（上記、非特許文献３参照）によって分類するＳＶＤ＋ＳＭＭ、ｗｏｒｄ２ｖｅｃ（非特許文献４（Mikolov, Tomas, et al.“Distributed representations of words and phrases and their compositionality.”Advances in Neural Information Processing Systems. 2013.））によって単語のベクトル表現を得た後にＳＭＭによって分類するｗｏｒｄ２ｖｅｃ＋ＳＭＭ、及びｒｂｆカーネルと二次の多項式（poly2）カーネルを使ったＳＶＭ（Support Vector Machine）（非特許文献５（Cortes, Corinna, and Vladimir Vapnik.“Support-vector networks.”Machine learning 20.3 (1995): 273-297.））を用いた。 In order to evaluate the method according to the present embodiment, an experiment was performed using three data sets (WebKB, Reuters-21578, and 20 Newsgroups). As comparison methods, MedLDA (Maximum entropy discrimination Latent Dirichlet Allocation) (see Non-Patent Document 2 above), SMM (Support Measure Machine) (above Non-Patent Document) after obtaining a vector representation of a word by singular value decomposition (SVD) Vector representation of words by SVD + SMM, word2vec (see Non-Patent Document 4 (Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in Neural Information Processing Systems. 2013.)) SVM (Support Vector Machine) using word2vec + SMM and rbf kernel and quadratic polynomial (poly2) kernel (Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks . ”Machine learning 20.3 (1995): 273-297.)).

本実施の形態に係る手法による学習と、上記の比較手法による学習とにおいて、カテゴリが割り当てられた文書数を変化させたときの分類精度を図４に示す。３データセット全てにおいて、本実施の形態に係る手法（ＬａｔｅｎｔＳＭＭ）が最も高い精度を達成しており、高精度で文書分類が可能であることを示唆している。 FIG. 4 shows the classification accuracy when the number of documents to which categories are assigned is changed in learning by the method according to the present embodiment and learning by the comparison method described above. In all three data sets, the method according to the present embodiment (Lentent SMM) achieves the highest accuracy, suggesting that document classification is possible with high accuracy.

また、本実施の形態に係る手法による単語の可視化結果の例を図５に示す。同じカテゴリの単語がまとまって可視化されていることから、各カテゴリにおいて特徴的な単語を効率的に発見できることを示唆している。 In addition, FIG. 5 shows an example of word visualization results obtained by the method according to the present embodiment. Since the words of the same category are visualized together, this suggests that characteristic words can be efficiently found in each category.

以上、説明したように、本実施の形態に係るパラメータ推定装置によれば、文書集合Ｄ_ｔｒの文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉと、文書の各々に割り当てられたカテゴリｔ_ｉと、文書集合Ｄ_ｔｒとに基づいて、各文書の多重集合に含まれる単語の潜在ベクトルｘ_ｖ、及びパラメータａ_ｉが、文書のカテゴリ分類に有効な度合いを表す目的関数を最適化するように、潜在ベクトルｘ_ｖを推定し、潜在ベクトルｘ_ｖと、文書の各々に割り当てられたカテゴリｔ_ｉと、文書集合Ｄ_ｔｒとに基づいて、目的関数を最適化するように、カテゴリ割当に貢献する度合いを表すパラメータａ_ｉを推定することを、予め定めた繰り返し終了条件を満たすまで繰り返すことで、カテゴリを精度よく割り当てるための識別関数のパラメータを推定することができる。 As described above, according to the parameter estimation device according to the present embodiment, the parameter a _i indicating the degree to which each document in the document set D _tr contributes to category assignment, and the category assigned to each document. Based on t _i and the document set D _tr , the word latent vector x _v and the parameter a _i included in the multiple sets of each document optimize the objective function representing the degree of effectiveness in document category classification. as estimates potential vector x _v, and latent vectors x _v, and category t _i assigned to each document, on the basis of the document set D _tr, so as to optimize the objective function, the category assignment to estimate the parameters a _i representing the degree of contribution to, by repeating until the predetermined repeat end condition is satisfied, the identification for assigning accurately category It is possible to estimate the parameters of the number.

また、本実施の形態に係るカテゴリ割当装置によれば、パラメータ推定装置によって推定された文書集合の各文書の多重集合に含まれる各単語の潜在ベクトルｘ_ｖ、及び文書集合の文書の各々がカテゴリ割当に貢献する度合いを表すパラメータａ_ｉと、文書の各々に割り当てられたカテゴリｔ_ｉと、及び文書集合Ｄ_ｔｒとに基づいて、未知文書集合Ｄ_ｔｅの各文書に対して、識別関数に従って、カテゴリを割り当てることで、文書に対して、カテゴリを精度よく割り当てることができる Further, according to the category assignment device according to the present embodiment, each word latent vector x _v included in the multiple set of each document in the document set estimated by the parameter estimation device and each document in the document set are classified into categories. Based on the parameter a _i indicating the degree of contribution to the assignment, the category t _i assigned to each of the documents, and the document set D _tr , for each document in the unknown document set D _te , according to the discriminant function: By assigning categories, categories can be assigned to documents with high accuracy.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記の実施の形態では、データとして文書を想定して説明してきたが、これに限定されるものではなく、１つのデータを離散特徴の多重集合で表現可能なデータであれば、画像や音楽等のデータにも適用できることは言うまでもない。画像の局所特徴としては、例えば、勾配情報から算出されるＳＩＦＴ（Scale Invariant Feature Transform）が挙げられる。１つの画像から複数のＳＩＦＴが抽出されるので、それらに対しベクトル量子化を用いて離散化したものを要素とする集合は、その画像の離散特徴の多重集合である。この多重集合をｗとして上記（１）式の識別関数に適用すれば、あとは文書の場合と同様、カテゴリが割り当てられた画像の集合からカテゴリを識別する関数のパラメータを推定することや、カテゴリが割り当てられていない画像の集合に対してカテゴリ割当を行うことができる。 For example, in the above embodiment, the description has been made assuming a document as data, but the present invention is not limited to this. If data that can represent one piece of data as a multiple set of discrete features is used, an image, Needless to say, it can also be applied to data such as music. Examples of the local feature of the image include SIFT (Scale Invariant Feature Transform) calculated from gradient information. Since a plurality of SIFTs are extracted from one image, a set whose elements are those discretized using vector quantization is a multiple set of discrete features of the image. If this multiple set is applied to the discriminant function of the above equation (1) as w, the parameters of the function for identifying the category from the set of images to which the category is assigned can be estimated, A category can be assigned to a set of images to which no is assigned.

１０、３１０入力部
２０、３２０演算部
３０初期化部
３２上記潜在ベクトル推定部
３２潜在ベクトル推定部
３４識別関数推定部
３４上記識別関数推定部
３６繰り返し判定部
４０パラメータ集合記憶部
１００パラメータ推定装置
３００カテゴリ割当装置
３３０読込部
３３２割当部
３５０出力部 10, 310 Input unit 20, 320 Operation unit 30 Initialization unit 32 Latent vector estimation unit 32 Latent vector estimation unit 34 Discrimination function estimation unit 34 Discrimination function estimation unit 36 Repetition determination unit 40 Parameter set storage unit 100 Parameter estimation device 300 Category allocation device 330 reading unit 332 allocation unit 350 output unit

Claims

A latent vector of each discrete feature included in the multiple set of each data of the data set, which is a data set to which categories are assigned in advance and that can be represented by multiple sets of discrete features. A parameter representing a degree of contribution of each data of the data set to category assignment, a category assigned to each of the data of the data set, and a category assigned to the data using the data set An initialization unit for initializing a latent vector of each discrete feature and a parameter representing a degree that each of the data of the data set contributes to category assignment, which is used in an identification function;
A latent vector of each discrete feature based on a parameter representing the degree to which each of the data of the data set contributes to category assignment, a category assigned to each of the data of the data set, and the data set; and A latent vector estimator that estimates the latent vector of each discrete feature so that the parameter representing the degree to which each of the data in the data set contributes to category assignment optimizes the objective function representing the degree to which the data is categorized effectively When,
Each of the data in the data set is assigned a category so as to optimize the objective function based on the latent vector of each discrete feature, the category assigned to each of the data in the data set, and the data set. A discriminant function estimator for estimating a parameter representing the degree of contribution to
A repeat determination unit that repeats the estimation by the latent vector estimation unit and the estimation by the discrimination function estimation unit until a predetermined repetition end condition is satisfied;
A parameter estimation apparatus including:

The discriminant function is represented by f (w _k ^* ) shown in the following equation:
The parameter estimation apparatus according to claim 1, wherein the objective function is represented by L (A, X, γ) represented by the following expression.
Where w represents the multiple set of data, w ^* represents the multiple set of data to which no category is assigned, and A represents a parameter a _i representing the degree to which each piece of data in the data set contributes to category assignment. Represents a set, 0 ≦ a _i ≦ C, t _i represents a category pre-assigned to the data, X represents a set of latent vectors x _v of each discrete feature v,
Represents a q-dimensional latent vector of the discrete feature v, γ is a parameter for controlling the similarity value between the discrete features, and K (w _i , w _j ; X, γ) is data i and data This is a kernel function that represents the degree of similarity with j.

An input unit that accepts an input unknown data set that is a data set of data that can be expressed by multiple sets of discrete features and whose assigned category is unknown,
A latent vector of each discrete feature estimated by the parameter estimation device according to claim 1, a parameter representing a degree of contribution of each of the data of the data set to category assignment, and each of the data of the data set An assigning unit that assigns the category according to the identification function to each data of the unknown data set received by the input unit based on the assigned category and the data set;
Category assignment device including

The initialization unit is an input data set to which each category is assigned in advance, and each of the data included in the data set of the data set that can be expressed by a multiple set of discrete features A latent vector of discrete features, a parameter representing the degree to which each of the data of the data set contributes to category assignment, a category assigned to each of the data of the data set, and a category assigned to the data using the data set Initializing a latent vector for each discrete feature and a parameter representing the degree to which each piece of data in the data set contributes to category assignment, used in an identification function for identifying
The latent vector estimator is configured so that each discrete feature is based on a parameter that represents the degree to which each of the data of the data set contributes to category assignment, the category assigned to each of the data of the data set, and the data set. And a parameter representing the degree to which each of the data of the data set contributes to category assignment optimizes the latent vector of each discrete feature such that an objective function representing the degree of effectiveness in categorizing the data is optimized. Estimating, and
The discriminant function estimator optimizes the objective function based on the latent vector of each discrete feature, the category assigned to each of the data of the data set, and the data set. Estimating a parameter representing the degree to which each piece of data contributes to a category assignment;
Repeating the estimation by the latent vector estimation unit and the estimation by the discriminant function estimation unit until the iterative determination unit satisfies a predetermined repetition end condition;
A parameter estimation method including:

The discriminant function is represented by f (w _k ^* ) shown in the following equation:
The parameter estimation method according to claim 1, wherein the objective function is represented by L (A, X, γ) represented by the following expression.
Where w represents the multiple set of data, w ^* represents the multiple set of data to which no category is assigned, and A represents a parameter a _i representing the degree to which each piece of data in the data set contributes to category assignment. Represents a set, 0 ≦ a _i ≦ C, t _i represents a category pre-assigned to the data, X represents a set of latent vectors x _v of each discrete feature v,
Represents a latent vector of the q-dimensional discrete feature v, γ is a parameter for controlling the similarity value between the discrete features, and K (w _i , w _j ; X, γ) is data i and data This is a kernel function that represents the degree of similarity with j.

An input unit receiving an input unknown data set that is a data set of data that can be expressed by multiple sets of discrete features and whose assigned category is unknown;
An allocation unit includes a latent vector of each discrete feature estimated by the parameter estimation device according to claim 1 and a parameter indicating a degree of each of the data of the data set contributing to category allocation, Assigning the category according to the identification function to each data of the unknown data set received by the input unit based on the category assigned to each of the data and the data set;
Category assignment method including

The program for functioning a computer as each part of the parameter estimation apparatus of Claim 1 or Claim 2, or the category allocation apparatus of Claim 3.