JP2008276571A

JP2008276571A - Label assignment method, label assignment device, label assignment program and storage medium

Info

Publication number: JP2008276571A
Application number: JP2007120142A
Authority: JP
Inventors: Akinori Fujino; 昭典藤野; Hideki Isozaki; 秀樹磯崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-04-27
Filing date: 2007-04-27
Publication date: 2008-11-13
Anticipated expiration: 2027-04-27
Also published as: JP4976912B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a label assignment technique, capable of reducing determination of "no label" to a content while suppressing the calculation quantity. <P>SOLUTION: The label assignment device is adapted to assign, to a content having at least one of a character and an image, one or more labels showing a type of the content thereof. The device generates a score function using a characteristic vector and a binary classifier, generates an identification function having the score function and a threshold of determining the propriety of label assignment, and determines the propriety of label assignment based on an identification function value calculated by the identification function. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、文字や画像を有するコンテンツにその内容の種別を表すラベルを付与する技術に関する。 The present invention relates to a technique for assigning a label indicating the type of content to content having characters and images.

近年、データベースに含まれるコンテンツにラベルを付与する技術の研究や開発が盛んに行われている。ここで、コンテンツとは、論文、特許公報、オンラインニュースデータ、電子メール等のテキスト情報からなるものや、Ｗｅｂデータ、ブログデータ等のテキスト情報とリンク情報からなるもの、さらに、画像データからなるものなどのことを指す。また、この場合、コンテンツは、特徴ベクトルにより表現可能であることが必要である。 In recent years, research and development of technologies for labeling content included in a database have been actively conducted. Here, the content includes text information such as papers, patent gazettes, online news data, and e-mails, text information such as Web data and blog data and link information, and image data. And so on. In this case, it is necessary that the content can be expressed by a feature vector.

特徴ベクトルとは、コンテンツに含まれる要素（特徴）の出現頻度の比をベクトルで表したものである。たとえば、あるコンテンツに「データ」「情報」「処理」という単語（特徴）が、それぞれ「３回」「１回」「２回」出現する（使用されている）とき、そのコンテンツの特徴ベクトルは｛３，１，２｝あるいは｛３／６，１／６，２／６｝等と表現することができる。 The feature vector is a vector that represents the ratio of appearance frequencies of elements (features) included in the content. For example, when the words (features) “data”, “information”, and “processing” appear (used) in a certain content “3 times”, “1 time”, and “2 times”, the feature vector of the content is It can be expressed as {3, 1, 2} or {3/6, 1/6, 2/6}.

また、ラベルとは、前記したようにコンテンツの内容の種別を表すものであり、「コンピュータ」「スポーツ」「音楽」「数学」といったものが挙げられる。たとえば、コンテンツが特許文書である場合、ＩＰＣ（国際特許分類）、Ｆターム（File Forming Term）、ＦＩ(File Index)等がラベルに相当する。 The label indicates the type of content as described above, and includes “computer”, “sports”, “music”, and “math”. For example, when the content is a patent document, IPC (International Patent Classification), F-term (File Forming Term), FI (File Index), etc. correspond to the label.

そして、ラベル付与技術において、使用するラベルは予め決定されていることが一般的である。たとえば、付与すべきラベルがすでに決定されているコンテンツ（訓練データ）の集合に関する統計情報を用いて、コンテンツの特徴ベクトルを入力、ラベル付与の推定値を出力とする多重ラベル付与器（その機能を有するプログラム）を学習し、その多重ラベル付与器を用いてラベル未付与のコンテンツ（ラベル付与対象コンテンツ）に１つ以上のラベルを付与（多重ラベル付与）する方法がある。 In the label application technique, the label to be used is generally determined in advance. For example, by using statistical information about a set of contents (training data) for which labels to be assigned have already been determined, a multi-label adder (which has a function of inputting a feature vector of contents and outputting an estimated value of label assignment) There is a method of learning one program) and applying one or more labels (multiple label assignment) to unlabeled content (label assignment target content) using the multiple label applicator.

非特許文献１，２に示す多重ラベル付与器は、コンテンツに複数のラベルを付与する問題において、個々のラベルごとにコンテンツの特徴ベクトルを入力、ラベル付与の可否を出力とする２値分類器を用いて設計される。すなわち、各ラベルに対応する２値分類器を複数用いて複数のラベルの付与を判定する多重ラベル付与器を実現する。この非特許文献１，２に示す多重ラベル付与器では、訓練データを用いた２値分類器の学習と、学習された２値分類器を用いたコンテンツへのラベル付与の可否の判定が、個々のラベルごとに独立して行われることを特徴とする。 In the problem of assigning a plurality of labels to content, the multiple label assigner shown in Non-Patent Documents 1 and 2 is a binary classifier that inputs a feature vector of the content for each label and outputs whether or not the label can be given. Designed with. That is, a multiple label applicator that determines the application of a plurality of labels using a plurality of binary classifiers corresponding to each label is realized. In the multiple label applicators shown in Non-Patent Documents 1 and 2, learning of a binary classifier using training data and determination of whether or not labels can be assigned to content using the learned binary classifier are individually It is characterized by being performed independently for each label.

そして、非特許文献１，２の技術では、個々のラベルごとに２値分類器を設計する。各ラベルに対応する２値分類器では、他のラベルと比較して、訓練データを用いた学習と、コンテンツへのラベル付与の判定が行われる。２値分類器が個々のラベルごとに設計されるため、多重ラベル付与器の学習とコンテンツのラベル付与に要する計算量はラベル数に比例する。これは、高速に多重ラベル付与を実現できることを意味する。 In the techniques of Non-Patent Documents 1 and 2, a binary classifier is designed for each label. In the binary classifier corresponding to each label, learning using training data and determination of label assignment to content are performed as compared with other labels. Since the binary classifier is designed for each label, the amount of calculation required for learning the multiple label applicator and labeling the content is proportional to the number of labels. This means that multiple labels can be assigned at high speed.

また、非特許文献３，４に示す多重ラベル付与器は、入力であるコンテンツの特徴ベクトルに対して、付与すべきラベルの組み合わせを直接出力する多重分類器により実現される。この非特許文献３，４に示す多重ラベル付与器は、コンテンツに対してラベルの組み合わせの候補の中から最適な組み合わせを選択することを特徴とする。
K.Nigam, A.McCallum, S.Thrunand T.Mitchell: Text classification from labeled and unlabeled documents using EM, Machine Learning, 39, 103-134(2000). T. Joachims: Text categorization with support machines: Learning with many relevant features, Proceeding of 10th European Conference on Machine Learning(ECML-98), 137-142(1998). 上田修功、斉藤和巳：「多重トピックテキストの確率モデル−パラメトリック混合モデル−」、電子情報通信学会論文誌、Ｊ８７-Ｄ-ＩＩ（３）、872-883、２００４年賀沢秀人、泉谷知範、平博順、前田英作、磯崎秀樹：「最大マージン原理に基づく多重ラベリング学習」、電子情報通信学会論文誌、Ｊ８８-Ｄ-ＩＩ（１１）、2246-2259、２００５年 In addition, the multiple label assigners shown in Non-Patent Documents 3 and 4 are realized by a multiple classifier that directly outputs a combination of labels to be assigned to the feature vector of the input content. The multiple label applicators shown in Non-Patent Documents 3 and 4 are characterized by selecting an optimum combination from among label combination candidates for content.
K. Nigam, A. McCallum, S. Thrunand T. Mitchell: Text classification from labeled and unlabeled documents using EM, Machine Learning, 39, 103-134 (2000). T. Joachims: Text categorization with support machines: Learning with many relevant features, Proceeding of 10th European Conference on Machine Learning (ECML-98), 137-142 (1998). Ueda Osamu, Saito Kazuaki: "Probability model of multi-topic text-parametric mixed model-", IEICE Transactions, J87-D-II (3), 872-883, 2004 Hideto Kawasawa, Tomonori Izumiya, Hironori Hira, Hidesaku Maeda, Hideki Amagasaki: “Multiple Labeling Learning Based on the Maximum Margin Principle”, IEICE Transactions, J88-D-II (11), 2246-2259, 2005

しかしながら、非特許文献１，２の技術では、コンテンツに本来付与すべきラベルを付与しないように判定することがしばしば起きる欠点がある。一般的に、多重ラベル付与の問題において、個々のラベルは、付与されるべきコンテンツの数が、付与されるべきでないコンテンツの数よりも圧倒的に少ないことが多い。このような場合では、２値分類器があるラベルをすべてのコンテンツに付与しないように判定しても、その２値分類器の分類精度は高いことになりえる。したがって、多重ラベル付与の問題では、大多数のコンテンツにラベルを付与しないように判定する２値分類器を、訓練データを用いた学習で獲得する傾向があり、そうすると、多重ラベル付与器があるラベル付与対象コンテンツに対して１つもラベルを付与しないように判定する事態がしばしば発生するという問題がある。 However, the techniques of Non-Patent Documents 1 and 2 have a drawback in that it is often determined not to add a label that should be originally added to content. Generally, in the problem of multiple label assignment, the number of contents to be assigned to individual labels is often overwhelmingly smaller than the number of contents that should not be assigned. In such a case, even if it is determined that a label with a binary classifier is not given to all contents, the classification accuracy of the binary classifier can be high. Therefore, in the problem of multiple label assignment, there is a tendency to acquire a binary classifier that determines not to give a label to a large number of contents by learning using training data. There is a problem that a situation in which it is determined that no label is assigned to the content to be assigned often occurs.

一方、非特許文献３，４の技術では、前記したように、コンテンツに対してラベルの組み合わせの候補から最適なラベルの組み合わせを選択する多重ラベル付与器を設計する。この多重ラベル付与器は、コンテンツの最適なラベルの組み合わせが、まったくラベルを付与しない、すなわち「ラベルなし」になる場合を除外して設計することができる。このため、非特許文献３，４の技術では、非特許文献１，２の技術で生じる、多くのコンテンツを「ラベルなし」と判定する問題を回避することができ、より高精度な多重ラベル付与を期待できる。 On the other hand, in the techniques of Non-Patent Documents 3 and 4, as described above, a multiple label applicator for selecting an optimum label combination from label combination candidates for the content is designed. This multiple label applicator can be designed to exclude the case where the optimum label combination of the content does not add a label at all, that is, becomes “no label”. For this reason, the technologies of Non-Patent Documents 3 and 4 can avoid the problem of determining many contents as “no label” caused by the technologies of Non-Patent Documents 1 and 2, and provide more accurate multiple labels. Can be expected.

しかし、この種の多重ラベル付与器では、個々のラベルではなく、膨大なラベルの組み合わせ（ラベル数のべき乗のオーダ）を扱わなければならないという問題がある。これは、ラベル数がある程度多いと、コンテンツにラベルを付与するのに膨大な計算量を必要とすることを意味する。非特許文献３，４の技術では、この問題に対処する近似アルゴリズムを用いているが、それでも、それぞれラベル数の２乗、３乗のオーダの計算量を必要とする。つまり、特許文献３，４の技術には、非特許文献１，２の技術よりもラベル付与に多くの時間がかかることがあるという問題がある。 However, this type of multiple label applicator has a problem that it is necessary to handle not a single label but a huge number of label combinations (order of power of the number of labels). This means that if the number of labels is large to some extent, a huge amount of calculation is required to add labels to the content. In the techniques of Non-Patent Documents 3 and 4, an approximation algorithm for dealing with this problem is used, but it still requires a calculation amount of the order of the second and third power of the number of labels. That is, the techniques of Patent Documents 3 and 4 have a problem that it may take more time for labeling than the techniques of Non-Patent Documents 1 and 2.

そこで、本発明は、前記問題点に鑑みてなされたものであり、ラベル付与技術において、少ない計算量でコンテンツに対して適切なラベルを１つ以上付与し、かつ、コンテンツに対する「ラベルなし」の判定を低減することを目的とする。 Therefore, the present invention has been made in view of the above problems, and in the labeling technology, one or more appropriate labels are given to the content with a small amount of calculation, and “no label” for the content is provided. The purpose is to reduce the judgment.

前記課題を解決するために、本発明にかかるラベル付与方法を実行するラベル付与装置は、少なくとも文字と画像のいずれかを有するコンテンツにその内容の種別を表すラベルを１つ以上付与する。ラベル付与装置は、スコア関数生成部、識別関数生成部およびラベル付与部を含む処理部と、記憶部と、を備える。
スコア関数生成部は、記憶部からそれぞれ読み出した、コンテンツの各特徴の出現頻度を集めた特徴ベクトルと、個々のラベルごとに設計されるプログラムである２値分類器と、を用いて、ラベルごとのその付与の可否を判定するためのスコアをラベル付与対象コンテンツごとに算出するスコア関数を生成する。
識別関数生成部は、記憶部から読み出した、付与されるラベルがすでに決定されているコンテンツからなる訓練データ集合を用いて、スコア関数と、スコア関数によって算出されるスコアを用いてラベル付与の可否を判定するためのしきい値と、を有する識別関数を生成する。
ラベル付与部は、識別関数を用いてそれぞれのラベルに対するラベル付与対象コンテンツの識別関数値を算出し、その識別関数値が所定値以上になったときのラベルを、そのラベル付与対象コンテンツに対応付けて記憶部に格納することでラベル付与を行う。 In order to solve the above-described problem, a label applying apparatus that executes a label applying method according to the present invention assigns at least one label indicating the type of content to content having at least one of a character and an image. The label assignment apparatus includes a processing unit including a score function generation unit, an identification function generation unit, and a label addition unit, and a storage unit.
The score function generation unit reads each label using a feature vector obtained by collecting the appearance frequency of each feature of the content read from the storage unit and a binary classifier that is a program designed for each label. A score function is generated for calculating a score for determining whether or not the content can be given for each content to be labeled.
The discriminant function generation unit uses the training data set that is read from the storage unit and includes content for which labels to be assigned have already been determined, and whether or not labels can be assigned using a score function and a score calculated by the score function. A discriminant function having a threshold value for determining
The label assignment unit calculates an identification function value of the content to be labeled for each label using an identification function, and associates the label when the identification function value becomes a predetermined value or more with the content to be labeled Labeling by storing in the storage unit.

かかる発明によれば、特徴ベクトルと２値分類器とを用いてスコア関数を生成し、そのスコア関数とラベル付与可否判定用のしきい値とを有する識別関数を生成し、その識別関数によって算出した識別関数値によってラベル付与の可否を判定するため、ラベル付与に要する計算量をラベル数に比例する時間に抑制することができ、かつ、「ラベルなし」と誤判定する可能性を低減することができる。 According to this invention, a score function is generated using a feature vector and a binary classifier, an identification function having the score function and a threshold value for determining whether to give a label is generated, and calculated by the identification function Because the discriminating function value determines whether or not labeling is possible, the amount of calculation required for labeling can be suppressed to a time proportional to the number of labels, and the possibility of erroneous determination as “no label” is reduced. Can do.

また、本発明にかかるラベル付与装置では、スコア関数生成部が、２値分類器として、コンテンツにラベルを付与する確率を与える確率モデルに基づく分類器を用い、付与されるラベルがすでに決定されているコンテンツからなる訓練データ集合を用いてその分類器が予め学習されている場合、記憶部から読み出した２値分類器を用いて、ラベル付与対象コンテンツにラベルを付与すると仮定したときの確率とラベルを付与しないと仮定したときの確率とをそれぞれ算出し、ラベルを付与すると仮定したときの確率をラベルを付与しないと仮定したときの確率で除した値の対数を、ラベル付与対象コンテンツへのラベル付与の可否を判定するスコアとして算出するスコア関数を生成する。 In the label assigning apparatus according to the present invention, the score function generation unit uses a classifier based on a probability model that gives a probability of assigning a label to the content as a binary classifier, and the label to be assigned is already determined. If the classifier is learned in advance using a training data set consisting of existing content, the probability and label when it is assumed that a label is attached to the content to be labeled using the binary classifier read from the storage unit The probability of when no label is assigned is calculated, and the logarithm of the value obtained by dividing the probability when the label is assumed by dividing the probability when no label is assigned with the label to the content to be labeled A score function is generated that is calculated as a score for determining whether or not the grant is possible.

かかる発明によれば、ラベルを付与すると仮定したときの確率をラベルを付与しないと仮定したときの確率で除した値の対数を、ラベル付与対象コンテンツへのラベル付与の可否を判定するスコアとして用いることで、スコア関数生成部を適切に実現することができる。つまり、このスコア関数生成部を用いることで、任意のコンテンツに対して、識別関数生成部が計算に用いるのに適したスコアを算出することができる。 According to this invention, the logarithm of the value obtained by dividing the probability when labeling is assumed by the probability when no label is assumed is used as a score for determining whether or not the labeling target content can be labeled. Thus, the score function generation unit can be appropriately realized. That is, by using this score function generation unit, it is possible to calculate a score suitable for use by the identification function generation unit for calculation with respect to arbitrary content.

また、本発明にかかるラベル付与装置では、識別関数生成部が、識別関数を生成する場合、訓練データ集合から１つの訓練データを除外して計算される２値分類器のパラメータの推定値から算出されるスコアの予測値を用いてその除外された訓練データに付与されるラベルの組み合わせが、その予測値を用いて正しく付与されるラベルの個数をその訓練データに付与すべきラベルの総数で除した値と、その予測値を用いて正しく付与されるラベルの個数をその予測値を用いて付与されるラベルの総数で除した値と、の調和平均を最大化する基準に適合するように、識別関数のパラメータの推定値を算出する。 Moreover, in the label assignment apparatus according to the present invention, when the discriminant function generation unit generates the discriminant function, the discriminant function generator calculates from the estimated value of the binary classifier parameter calculated by excluding one piece of training data from the training data set. The combination of labels assigned to the excluded training data using the predicted score value is divided by the total number of labels to be assigned to the training data. To meet the criteria for maximizing the harmonic mean of the value obtained by dividing the number of labels correctly given by using the predicted value by the total number of labels given by using the predicted value, Compute an estimate of the parameters of the discriminant function.

かかる発明によれば、ラベルの組み合わせが前記した調和平均を最大化する基準に適合するように識別関数のパラメータの推定値を算出するようにしたことで、識別関数生成部を適切に実現することができる。つまり、この識別関数生成部を用いて新規のコンテンツにラベルを付与することで、前記の調和平均を大きくすることが期待できる。 According to this invention, it is possible to appropriately realize the discriminant function generation unit by calculating the estimated value of the parameter of the discriminant function so that the combination of the labels meets the above-described criterion for maximizing the harmonic average. Can do. That is, it can be expected that the harmonic average is increased by giving a label to a new content by using the discriminant function generation unit.

また、本発明にかかるラベル付与装置では、所定のロジスティック関数を用いて定義される調和平均の近似値を最大化する識別関数のパラメータの推定値を、勾配法を用いて探索する。 In the labeling apparatus according to the present invention, the gradient function is used to search for an estimated value of the parameter of the discriminant function that maximizes the approximate value of the harmonic mean defined using a predetermined logistic function.

かかる発明によれば、準ニュートン法などの勾配法を用いることで、前記した識別関数のパラメータの推定値を効率的に探索することができる。 According to this invention, by using a gradient method such as a quasi-Newton method, it is possible to efficiently search for an estimated value of the parameter of the discriminant function described above.

また、本発明にかかるプログラムは、前記したラベル付与装置によるラベル付与方法をコンピュータに実行させることを特徴とする。 According to another aspect of the present invention, there is provided a program for causing a computer to execute the labeling method by the labeling device described above.

かかる発明によれば、ラベル付与方法をコンピュータに実行させることができる。 According to this invention, it is possible to cause a computer to execute the labeling method.

また、本発明にかかる記憶媒体は、前記したプログラムが記憶されたことを特徴とするコンピュータに読み取り可能な記憶媒体である。 A storage medium according to the present invention is a computer-readable storage medium that stores the above-described program.

かかる発明によれば、ラベル付与方法のプログラムを記憶することができる。 According to this invention, the program for the labeling method can be stored.

本発明によれば、ラベル付与技術において、少ない計算量でコンテンツに対して適切なラベルを１つ以上付与し、かつ、コンテンツに対する「ラベルなし」の判定を低減することができる。 According to the present invention, in the label assignment technique, one or more appropriate labels can be assigned to the content with a small amount of calculation, and the determination of “no label” for the content can be reduced.

以下、本発明を実施するための最良の形態（以下、実施形態という。）について、図面を参照しながら説明する。なお、実施形態の概要、実施形態の具体例、実施例の順で説明する。
（実施形態の概要）
図１は、本実施形態の多重ラベル付与装置（ラベル付与装置）の構成を示す機能ブロック図である。多重ラベル付与装置１は、訓練データＤＢ２（記憶部）、スコア関数生成部３（処理部）、識別関数生成部４（処理部）、メモリ５（記憶部）、多重ラベル付与部６（処理部）、入力部７および出力部８を備えて構成される。 Hereinafter, the best mode for carrying out the present invention (hereinafter referred to as an embodiment) will be described with reference to the drawings. In addition, it demonstrates in order of the outline | summary of embodiment, the specific example of embodiment, and an Example.
(Outline of the embodiment)
FIG. 1 is a functional block diagram showing the configuration of the multiple label applying apparatus (label applying apparatus) of this embodiment. The multiple label assignment apparatus 1 includes a training data DB 2 (storage unit), a score function generation unit 3 (processing unit), an identification function generation unit 4 (processing unit), a memory 5 (storage unit), and a multiple label addition unit 6 (processing unit). ), And an input unit 7 and an output unit 8.

多重ラベル付与装置１は、具体的にはコンピュータ装置であり、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、ＨＤＤ（Hard Disk Drive）などを備えている。たとえば、訓練データＤＢ２はＨＤＤなどによって実現され、スコア関数生成部３、識別関数生成部４および多重ラベル付与部６は、ＲＯＭやＨＤＤに記憶された各種プログラムをＣＰＵが実行することによって実現される。また、メモリ５はＲＡＭなどによって実現される。 Specifically, the multi-label applying device 1 is a computer device, and includes a central processing unit (CPU), a random access memory (RAM), a read only memory (ROM), a hard disk drive (HDD), and the like. For example, the training data DB 2 is realized by an HDD or the like, and the score function generation unit 3, the identification function generation unit 4, and the multiple label assignment unit 6 are realized by the CPU executing various programs stored in the ROM or the HDD. . The memory 5 is realized by a RAM or the like.

訓練データＤＢ２には、これからラベルを付与するコンテンツ（以下、「ラベル付与対象コンテンツ」という。）と同様の形式を持つコンテンツの例を集めた訓練データ集合が記憶されている。この訓練データ集合は、ラベル付与の可否を判定する識別関数（詳細は後記）を学習させるためのデータである。たとえば、コンテンツの種類が特許文書である場合、複数の特許文書とそれぞれに付与されているラベル（ＩＰＣ、Ｆターム等）が予め用意され、コンテンツ本体とラベル付与ベクトルの対の集合が訓練データＤＢ２に記憶される。ここで、ラベル付与ベクトルとは、すべてのラベル候補に対する付与の可否をベクトルとして表現したもので（詳細は後記）、多重ラベル付与装置１の利用者（以下、単に「利用者」という。）あるいは多重ラベル付与装置１の設計者によって予め付与されている。 The training data DB 2 stores a training data set in which examples of content having the same format as content to be given a label (hereinafter referred to as “labeling target content”) are collected. This training data set is data for learning an identification function (details will be described later) for determining whether labeling is possible. For example, when the type of content is a patent document, a plurality of patent documents and labels (IPC, F-term, etc.) attached to each of the patent documents are prepared in advance, and a set of content body and label assignment vectors is a training data DB 2. Is remembered. Here, the label addition vector is a representation of whether or not all label candidates can be assigned as a vector (details will be described later), and the user of the multiple label assignment apparatus 1 (hereinafter simply referred to as “user”) or. It is given in advance by the designer of the multi-label applying device 1.

スコア関数生成部３は、個々のラベルごとにコンテンツのスコアを算出するスコア関数を訓練データＤＢ２に記憶された訓練データ集合から生成する。識別関数生成部４は、スコアを基にコンテンツへのラベル付与の可否を推定するための識別関数を訓練データ集合から生成する。メモリ５は、その生成された識別関数などを記憶する。多重ラベル付与部６は、その識別関数を用いてコンテンツへのラベル付与の可否を判定する。 The score function generation unit 3 generates a score function for calculating the content score for each label from the training data set stored in the training data DB 2. The discriminant function generation unit 4 generates an discriminant function from the training data set for estimating the possibility of labeling the content based on the score. The memory 5 stores the generated discriminant function and the like. The multiple label assigning unit 6 determines whether or not the content can be given a label using the identification function.

入力部７は、利用者がラベル付与を要求するコンテンツ（ラベル付与対象コンテンツ）を入力する際のインタフェースであり、たとえば、キーボードやマウスである。出力部８は、多重ラベル付与装置１による多重ラベル付与の結果などを画面表示したり外部装置に出力したりするインタフェースであり、たとえば、液晶ディスプレイや通信装置である。 The input unit 7 is an interface used when a user inputs content for which labeling is requested (labeled content), and is, for example, a keyboard or a mouse. The output unit 8 is an interface that displays the result of multiple label assignment by the multiple label assignment device 1 on the screen or outputs it to an external device, for example, a liquid crystal display or a communication device.

次に、多重ラベル付与装置１の処理について説明する。図２は、多重ラベル付与装置の処理の概要を示したフローチャートである。図３は、スコアの予測値の計算の処理（図２のステップＳ１０３）を示したフローチャートである。図４は、識別関数のパラメータ学習の処理（図２のステップＳ１０４）を示したフローチャートである。図５は、識別関数を用いてコンテンツへのラベル付与の可否を判定する処理（図２のステップＳ１０５）を示したフローチャートである。 Next, the process of the multiple label applying apparatus 1 will be described. FIG. 2 is a flowchart showing an outline of processing of the multiple label assigning apparatus. FIG. 3 is a flowchart showing a process of calculating a predicted score value (step S103 in FIG. 2). FIG. 4 is a flowchart showing the parameter learning process (step S104 in FIG. 2) of the discrimination function. FIG. 5 is a flowchart showing processing (step S105 in FIG. 2) for determining whether or not a label can be given to content using an identification function.

このように、本実施形態の多重ラベル付与装置１では、まず、従来の２値分類器の技術を基に、個々のラベルごとにコンテンツへのラベル付与の可能性を示すスコアを与えるスコア関数を設計する。次に、このスコア関数によってラベルごとに得られたスコアから、ラベル付与の可否を判定するための識別関数を設計する。識別関数の設計では、ラベル付与の可否を与えるスコアのしきい値を設定する。そして、コンテンツに付与すべきラベルを検索する問題に対して定義されるＦ値を最大化するように識別関数を学習することで最適なしきい値を得ることができる。 As described above, in the multiple label assigning apparatus 1 of the present embodiment, first, based on the technique of the conventional binary classifier, a score function that gives a score indicating the possibility of label assignment for each label is provided. design. Next, an identification function for determining whether or not to give a label is designed from the score obtained for each label by this score function. In the design of the discriminant function, a threshold value for the score that gives the possibility of labeling is set. Then, an optimum threshold value can be obtained by learning the discriminant function so as to maximize the F value defined for the problem of searching for the label to be given to the content.

すなわち、本実施形態の多重ラベル付与装置１によれば、個々のラベルごとに２値分類器の技術を基にスコア関数を設計し、そのスコア関数を用いてコンテンツの各ラベルに対するスコアを算出し、そのスコアを用いて個々にラベル付与の可否を判定することで、ラベル付与に要する計算量をラベル数に比例する時間に抑制することができる。また、本実施形態の多重ラベル付与装置１では、ラベル付与の判定を与えるスコアのしきい値を、付与すべきラベルを検索する問題に対して定義されるＦ値の最大化に基づいて獲得する。Ｆ値は０から１までの値を取り、付与すべきラベルがあるコンテンツを「ラベルなし」と誤判定する場合に０となる指標である。したがって、Ｆ値の最大化に基づいて獲得されるしきい値を用いることで、「ラベルなし」と判定する可能性を大きく低減することができ、ラベル付与の精度を向上させることができる。 That is, according to the multiple label assigning apparatus 1 of the present embodiment, a score function is designed for each label based on the technique of the binary classifier, and the score for each label of the content is calculated using the score function. By determining whether or not labeling can be individually performed using the score, the amount of calculation required for labeling can be suppressed to a time proportional to the number of labels. Further, in the multiple label assigning apparatus 1 of the present embodiment, a score threshold value that gives a label assignment determination is obtained based on maximization of the F value defined for the problem of searching for a label to be assigned. . The F value takes a value from 0 to 1, and is an index that is 0 when content with a label to be attached is erroneously determined as “no label”. Therefore, by using the threshold value acquired based on maximization of the F value, the possibility of determining “no label” can be greatly reduced, and the accuracy of labeling can be improved.

（実施形態の具体例）
次に、前記した実施形態の概要の具体例について説明する。ここでは、Ｋ個のラベル｛１，・・・，ｋ，・・・，Ｋ｝の候補からコンテンツに付与すべきラベルを１個以上選択する多重ラベル付与問題において、既知の任意の２値分類器を用いてスコア関数を定義して使用する場合の具体例について説明する。 (Specific example of embodiment)
Next, a specific example of the outline of the above-described embodiment will be described. Here, in the multiple label assignment problem in which one or more labels to be given to the content are selected from candidates of K labels {1,..., K,. A specific example in the case of defining and using a score function using a container will be described.

コンテンツに含まれる単語等により構成される特徴空間をＴ＝｛ｔ_１，…，ｔ_ｉ，…，ｔ_Ｖ｝とするとき、コンテンツの特徴ベクトルｘは、コンテンツに含まれるｔ_ｉの頻度を基にｘ＝｛ｘ_１，…，ｘ_ｉ，…，ｘ_Ｖ｝と表現される。Ｖは、コンテンツに含まれる可能性がある特徴の種類の数を表す。たとえば、コンテンツがテキストデータである場合、Ｖはコンテンツで出現する可能性がある語彙（単語）の種類の総数を表す。 When a feature space composed of words or the like included in the content is T = {t ₁ ,..., T _i ,..., T _V }, the content feature vector x is based on the frequency of t _i included in the content. X = {x ₁ ,..., X _i ,..., X _V }. V represents the number of types of features that may be included in the content. For example, when the content is text data, V represents the total number of vocabulary (word) types that may appear in the content.

コンテンツの特徴ベクトルｘに対してラベルｋの付与の可否を判定する２値分類器には、ナイーブベイズ（ＮＢ）モデル（非特許文献１）や、最大エントロピー（ＭＥ）モデル（K.Nigam, J.Lafferty, and A.McCallum: Using maximum entropy for text classification, IJCAI-99 Workshop on Machine Learning for Information Filtering, 61-67(1999).）、ＮＢモデルとＭＥモデルのハイブリッドに基づく分類器（ハイブリッド分類器、特開2006-107354号公報参照）、サポートベクターマシーン（ＳＶＭ）（非特許文献２）などの既存の分類器を用いることができる。 The binary classifier that determines whether or not the label k can be given to the feature vector x of the content includes a naive Bayes (NB) model (Non-patent Document 1) and a maximum entropy (ME) model (K. Nigam, J Lafferty, and A. McCallum: Using maximum entropy for text classification, IJCAI-99 Workshop on Machine Learning for Information Filtering, 61-67 (1999)), classifier based on hybrid of NB model and ME model (hybrid classifier) Existing classifiers such as Support Vector Machine (SVM) (Non-patent Document 2) can be used.

なお、このようなアルゴリズムのパラメータ学習とラベル付与の方法は、図２〜図５のフローチャートを実行するプログラムを作成することで、コンピュータ装置である多重ラベル付与装置１において実現することができる。さらに、それらのプログラムは、ハードディスク、フラッシュメモリ、ＣＤ−ＲＯＭ（Compact Disk Read Only Memory）、ＤＶＤ（Digital Versatile Disk）などの記憶媒体に保存することが可能である。 Note that such algorithm parameter learning and label assignment methods can be realized in the multiple label assignment apparatus 1 which is a computer apparatus by creating a program for executing the flowcharts of FIGS. Furthermore, these programs can be stored in a storage medium such as a hard disk, a flash memory, a CD-ROM (Compact Disk Read Only Memory), or a DVD (Digital Versatile Disk).

（実施例）
次に、本発明の実施例について説明する。表１は、国立情報学研究所主催のＮＴＣＩＲ-６プロジェクトから提供された日本語特許文書データベースに含まれる特許文書にＦタームを付与する問題に、前記した本実施形態の多重ラベル付与装置１(本願)を適用した場合と、従来技術（比較例）を適用した場合の結果を示す表である。

(Example)
Next, examples of the present invention will be described. Table 1 shows the multiple label assigning device 1 of the present embodiment described above for the problem of assigning an F term to a patent document included in the Japanese patent document database provided by the NTCIR-6 project sponsored by the National Institute of Informatics. It is a table | surface which shows the result when the case where this application) is applied, and the case where a prior art (comparative example) is applied.

このデータベースは、コンテンツである特許文書を含み、各特許文書には関係するＦタームが付与されている。Ｆタームは特許の大分類を与えるテーマごとに用意されており、特許文書の検索キーとして利用される。Ｆタームは、各テーマごとに、１８個から１０００個弱設定されている。 This database includes patent documents as contents, and each patent document is given an associated F-term. The F-term is prepared for each theme that gives a large classification of patents, and is used as a search key for patent documents. F terms are set from 18 to slightly less than 1000 for each theme.

本実施例では、テーマが既知の新規の特許文書にＦタームを自動的に付与する問題に対して、Ｆタームがすでに付与された過去の特許文書を訓練データとして用いて設計した実施形態の具体例の多重ラベル付与装置１を適用することで本発明の性能を評価した。ここで、過去の特許文書として、1993-1997年に出願された文書を用い、新規の特許文書として、1998-1999年に出願された１０８テーマに関する21606文書を用いた。なお、テーマごとに設定されているＦタームの数は、平均230、最小35、最大797であった。 In this embodiment, a specific example of an embodiment in which a past patent document to which an F term has already been assigned is used as training data for the problem of automatically assigning an F term to a new patent document having a known theme. The performance of the present invention was evaluated by applying the example multiple label applying apparatus 1. Here, as a past patent document, a document filed in 1993-1997 was used, and as a new patent document, a 21606 document on 108 themes filed in 1998-1999 was used. The number of F-terms set for each theme was 230 on average, 35 on minimum, and 797 on maximum.

実施形態の多重ラベル付与装置１のスコア関数生成部３で用いる２値分類器には、特開2006-107354号公報で示されたハイブリッド分類器を用いた。ハイブリッド分類器は、コンテンツに含まれる構成要素ごとにＮＢモデルを適用して設計される。本実施例では、特許文書に含まれる「発明の名称」、「発明者、出願人」、「要約」、「特許請求の範囲」、「その他の本文（明細書等）」を構成要素としてハイブリッド分類器を設計した。また、識別関数生成部４で識別関数のパラメータｗを計算する際に事前に設定するαとγの値をそれぞれ0.5と１とした。 A hybrid classifier disclosed in Japanese Patent Application Laid-Open No. 2006-107354 was used as the binary classifier used in the score function generation unit 3 of the multiple label assigning apparatus 1 of the embodiment. The hybrid classifier is designed by applying the NB model for each component included in the content. In the present embodiment, the “invention name”, “inventor / applicant”, “summary”, “claims”, and “other text (specifications, etc.)” included in the patent document are hybrid components. A classifier was designed. In addition, the values of α and γ set in advance when the discriminant function generation unit 4 calculates the parameter w of the discriminant function are set to 0.5 and 1, respectively.

新規の特許文書に対して実施形態の具体例の多重ラベル付与装置１を適用して得られるＦタームの付与の正確性とＦタームの検索順位（優先度）の正確性を、Ｆ値と１１点平均適合率（詳細は後記）を用いて評価した。 The accuracy of F-term assignment and the accuracy of F-term search order (priority) obtained by applying the multiple label assigning apparatus 1 of the specific example of the embodiment to a new patent document is expressed as F value and 11 Evaluation was made using the point average precision (details will be described later).

１１点平均適合率が大きいほど、特許文書に与えられるＦタームの検索順位が正確であることを示す。つまり、付与される正解Ｆタームの数が同じでも、検索順位が正確であるほど１１点平均適合率の値は大きくなる。Ｆ値と１１点平均適合率は、ラベルが付与される特許文書ごとに算出される評価尺度であるので、多重ラベル付与装置１の評価には、すべての特許文書に対するＦ値の平均と１１点平均適合率の平均の値を用いた。 The larger the 11-point average relevance rate, the more accurate the search order of F terms given to patent documents. That is, even if the number of correct F terms to be given is the same, the 11-point average precision rate increases as the search order is accurate. Since the F value and the 11-point average relevance rate are evaluation scales calculated for each patent document to which a label is assigned, the average of the F value for all patent documents and 11 points are used for the evaluation of the multiple label assigning apparatus 1. The average value of the average precision was used.

表１に示した本願による方法と比較例による方法は、２値分類器として、いずれも、特開2006-107354号公報で示されたハイブリッド分類器を用いた。比較例による方法ではこのハイブリッド分類器のみを用いてラベル付与を行ったのに対し、本願による方法では識別関数生成部４により識別関数を新たに設定する点で比較例による方法と異なっている。 Each of the method according to the present application and the method according to the comparative example shown in Table 1 uses a hybrid classifier disclosed in JP 2006-107354 A as a binary classifier. In the method according to the comparative example, labeling is performed using only this hybrid classifier, whereas the method according to the present application is different from the method according to the comparative example in that a discrimination function is newly set by the discrimination function generation unit 4.

表１に示すように、Ｆ値は比較例による方法よりも本願による方法のほうが高く、１１点平均適合率は比較例による方法と本願による方法が同程度である。このように、本願による方法は、比較例による方法と比べて、ラベル付与の正確性の点で優位性を示し、ラベル検索順位の正確性の点で同等の性能を有していることがわかる。つまり、本願による方法は、比較例による方法よりも、「ラベルなし」と判定する可能性が低い。また、ここでは実証データを示していないが、本願による方法が、非特許文献３，４で示した方法と比べて、要する計算量が有意に少ないのは、本明細書の全文から明らかである。 As shown in Table 1, the F value is higher in the method according to the present application than the method according to the comparative example, and the 11-point average precision is comparable between the method according to the comparative example and the method according to the present application. Thus, it can be seen that the method according to the present application shows superiority in terms of labeling accuracy and has the same performance in terms of accuracy of label search rank as compared with the method according to the comparative example. . That is, the method according to the present application is less likely to be determined as “no label” than the method according to the comparative example. In addition, although empirical data is not shown here, it is clear from the full text of the present specification that the method according to the present application requires significantly less calculation amount than the methods shown in Non-Patent Documents 3 and 4. .

以上で実施形態の説明を終えるが、本発明の態様はこれらに限定されるものではない。
たとえば、コンテンツは、特許文書でなくても、特徴ベクトルを表現可能なものであれば、論文や小説などの他の文書や、また、画像データなどであってもよい。その他、ラベルやフローチャートなどの具体的な構成について、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 This is the end of the description of the embodiments, but the aspects of the present invention are not limited to these.
For example, the content may be other documents such as a paper or a novel, image data, or the like as long as the content vector can be expressed even if it is not a patent document. In addition, specific configurations such as labels and flowcharts can be appropriately changed without departing from the spirit of the present invention.

本実施形態の多重ラベル付与装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the multiple label provision apparatus of this embodiment. 多重ラベル付与装置の処理の概要を示したフローチャートである。It is the flowchart which showed the outline | summary of the process of a multiple label provision apparatus. スコアの予測値の計算の処理を示したフローチャートである。It is the flowchart which showed the process of calculation of the predicted value of a score. 識別関数のパラメータ学習の処理を示したフローチャートである。It is the flowchart which showed the parameter learning process of an identification function. 識別関数を用いてコンテンツへのラベル付与の可否を判定する処理を示したフローチャートである。It is the flowchart which showed the process which determines the possibility of labeling to a content using an identification function.

Explanation of symbols

１多重ラベル付与装置
２訓練データＤＢ
３スコア関数生成部
４識別関数生成部
５メモリ
６多重ラベル付与部
７入力部
８出力部 1 Multi-labeling device 2 Training data DB
3 score function generation unit 4 discriminant function generation unit 5 memory 6 multiple label assignment unit 7 input unit 8 output unit

Claims

It is a labeling method by a labeling device that gives one or more labels representing the type of content to content having at least one of characters and images,
The labeling device includes a processing unit including a score function generation unit, an identification function generation unit, and a label addition unit, and a storage unit,
The score function generation unit uses a feature vector collected from the appearance frequency of each feature of the content read from the storage unit, and a binary classifier that is a program designed for each of the labels And generating a score function for calculating a score for determining whether or not the label can be given for each label target content,
The discriminant function generation unit uses the training data set that is read from the storage unit and includes content for which labels to be assigned have already been determined, and uses the score function and the score calculated by the score function. Generating a discrimination function having a threshold value for determining whether or not labeling is possible;
The label assigning unit calculates an identification function value of the content to be labeled for each label using the identification function, and the label when the identification function value is equal to or greater than a predetermined value A labeling method, characterized in that labeling is performed by storing in the storage unit in association with content.

The score function generator is
As the binary classifier, a classifier based on a probability model that gives a probability of assigning a label to content is used, and the classifier learns in advance using a training data set including content for which a label to be assigned is already determined. If
Using the binary classifier read from the storage unit, the probability when it is assumed that a label is given to the content to be labeled and the probability when it is assumed that no label is given are calculated, and the label is calculated. A score function is generated that calculates a logarithm of a value obtained by dividing a probability when it is assumed to be given by a probability when it is assumed that no label is given as a score for determining whether or not the labeling target content can be labeled. The labeling method according to claim 1, wherein:

The discriminant function generator generates the discriminant function,
A combination of labels given to the excluded training data using a predicted value of the score calculated from the estimated value of the parameter of the binary classifier calculated by excluding one training data from the training data set. Using the predicted value, the value obtained by dividing the number of labels correctly given by using the predicted value by the total number of labels to be given to the training data, and the number of labels correctly given by using the predicted value 2. The label according to claim 1, wherein an estimated value of the parameter of the discriminant function is calculated so as to meet a criterion for maximizing a harmonic mean of a value divided by the total number of labels given in the first step. Grant method.

The discriminant function generator
A logistic function using a difference between a threshold value given by the parameter of the discriminant function and a score function value as a variable when calculating the estimated value of the parameter of the discriminant function so as to meet the criterion for maximizing the harmonic mean The method for assigning a label according to claim 3, wherein an estimated value of the parameter of the discriminant function that maximizes the approximate value of the harmonic mean defined using is searched using a gradient method.

A label attaching device that attaches at least one label representing the type of content to content having at least one of characters and images, and includes a processing unit and a storage unit,
The processor is
Using a feature vector that collects the appearance frequency of each feature of the content read from the storage unit, and a binary classifier that is a program designed for each of the labels, and assigns the label to each label A score function generating unit that generates a score function for calculating a score for determining whether or not the content is to be provided for each content to be labeled,
Using a training data set consisting of content that has already been determined to be assigned labels read from the storage unit, whether or not label assignment is possible is determined using the score function and a score calculated by the score function. A discriminant function generator for generating a discriminant function having a threshold for
The identification function value of the labeling target content for each label is calculated using the identification function, and the label when the identification function value is equal to or greater than a predetermined value is associated with the labeling target content A label attaching unit that performs label assignment by storing in a storage unit.

A program for causing a computer to execute the labeling method according to any one of claims 1 to 4.

A computer-readable storage medium storing the program according to claim 6.