JP2004348239A

JP2004348239A - Text classification program

Info

Publication number: JP2004348239A
Application number: JP2003142007A
Authority: JP
Inventors: Mineki Takechi; 峰樹武智
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-05-20
Filing date: 2003-05-20
Publication date: 2004-12-09

Abstract

<P>PROBLEM TO BE SOLVED: To precisely classify a text to be classified. <P>SOLUTION: A function word/content word splitting means 1a splits a text A2 to be classified into a function word and a content word. An N-gram means 1b performs N-gram in which an N is changed step by step at each function word and content word. A feature vector generation means 1c generates a function word feature vector and a content word feature vector at each N-gram. An area determining means 1f determines to which of a procedure area and a non-procedure area of a classification model 1e each of the feature vectors belongs. A classification means 1g classifies whether each of the feature vectors indicates or not the procedure of the text A2 to be classified by using an evaluation reference which takes a high evaluation value when the classification performance due to the function word feature vector is enhanced or when the classification performance due to the content feature vector deteriorates and takes a low evaluation value when the classification performance due to the function word feature vector deteriorates or when the classification performance due to the content feature vector is enhanced, as the N is increased. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明はテキスト分類プログラムに関し、特に手順を示したテキストを分類するテキスト分類プログラムに関する。
【０００２】
【従来の技術】
近年、電子文書の蓄積に加え、インターネットの普及によりＷｅｂ上の大量のテキストへのアクセスが容易となり、コンピュータによる情報検索技術の重要性が増している。ＣＲＭ（ＣｕｓｔｏｍｅｒＲｅｌａｔｉｏｎｓｈｉｐＭａｎａｇｅｍｅｎｔ）やＷｅｂナビゲーションの分野では、自動質問応答などにおいてテキスト自動分類技術が用いられる。これらの応用分野では、従来のテキストの分野やジャンルに基づく分類に加えて、それとは異なる軸を持つ分類、例えば、テキストの記述スタイルや特定の質問に対する回答候補の絞込みに役立つ分類などが必要となっている。
【０００３】
Ｗｅｂ上で手順に関するテキストの検索を可能にするには、コンピュータにテキストの記述スタイルを、例えば、ＳＶＭ（サポートベクトルマシン）により学習させる。そして、予め被検索対象となる被検索テキストを、手順を示すテキストとそうでないテキストとに、学習結果に基づいて分類し記憶しておく。ユーザから手順に関するテキストのキーワード検索要求があった場合、分類された手順を示す被検索テキストの中からそのキーワードに合致するテキストを検索する。このとき、手順を示すテキストを確実に検索できるには、コンピュータのＳＶＭ学習によるテキストの分類精度が高い必要がある。
【０００４】
なお、文章の構造を解析することは、従来から行われている。表、箇条書き、多段組等任意にレイアウトされた文書から、意味あるテキストブロックを抽出する文書処理方法がある（例えば、特許文献１参照）。
【０００５】
【特許文献１】
特開２００２−０３２７７０号公報（第６頁、第８図）
【０００６】
【発明が解決しようとする課題】
ところで、手順を示すか否かによってテキスト（被分類テキスト）を分類するには、その被分類テキストの特徴ベクトルを生成する。そして、その特徴ベクトルが、学習によって得られた分類モデルの手順を示す領域に属するか属さないかによって分類される。被分類テキストの分類精度は、特徴ベクトルの基となるパラメータに依存するので、分類精度を高めるためのパラメータを選択する必要がある。
【０００７】
本発明はこのような点に鑑みてなされたものであり、被分類テキストの機能語と内容語との各々において特徴ベクトルを生成することにより精度よく分類することができるテキスト分類プログラムを提供することを目的とする。
【０００８】
【課題を解決するための手段】
本発明では上記課題を解決するために、図１に示すような、手順を示しているか否かによってテキストを分類するテキスト分類プログラムにおいて、コンピュータに、被分類テキストＡ２を機能語と内容語とに分割し、機能語と内容語との各々において、組み合わせ単語数を段階的に変化させたＮ−ｇｒａｍを行い、Ｎ−ｇｒａｍごとにおける機能語の機能語特徴ベクトルと内容語の内容語特徴ベクトルとを生成し、機能語特徴ベクトルと内容語特徴ベクトルとの各々が、学習用テキストＡ１を学習して生成した分類モデルの手順を示している領域と手順を示してない領域とのどちらの領域に属するかを判断し、Ｎ−ｇｒａｍに用いるＮが増加するとともに、機能語特徴ベクトルによる分類の性能が向上した場合又は内容語特徴ベクトルによる分類の性能が悪化した場合において高い評価値をとり、機能語特徴ベクトルによる分類の性能が悪化した場合又は内容語特徴ベクトルによる分類の性能が向上した場合に、低い評価値をとるような評価基準を用いて、被分類テキストＡ２の手順を示しているか否かの分類をする、処理を実行させることを特徴とするテキスト分類プログラムが提供される。
【０００９】
このような、テキスト分類プログラムによれば、被分類テキストＡ２を機能語と内容語とに分割し、各々において単語数を段階的に変化させたＮ−ｇｒａｍを行う。Ｎ−ｇｒａｍごとにおける機能語の機能語特徴ベクトルと内容語の内容語特徴ベクトルとを生成し、分類モデルの手順を示している領域と手順を示してない領域とのどちらの領域に属するかを判断する。そして、Ｎ−ｇｒａｍに用いるＮが増加するとともに、機能語特徴ベクトルによる分類の性能が向上した場合又は内容語特徴ベクトルによる分類の性能が悪化した場合において高い評価値をとり、機能語特徴ベクトルによる分類の性能が悪化した場合又は内容語特徴ベクトルによる分類の性能が向上した場合に、低い評価値をとるような評価基準を用いることによって、被分類テキストＡ２の手順を示しているか否かの分類をする。これにより、被分類テキストを分類精度よく分類することが可能となる。
【００１０】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。図１は、本発明の原理を説明する原理図である。図に示すように、コンピュータ１は、機能・内容語分割手段１ａ、Ｎ−ｇｒａｍ手段１ｂ、特徴ベクトル生成手段１ｃ、学習手段１ｄ、分類モデル１ｅ、領域判断手段１ｆ、及び分類手段１ｇを有している。また、図１には、学習用テキストＡ１、被分類テキストＡ２が示してある。
【００１１】
機能・内容語分割手段１ａは、被分類テキストＡ２を機能語と内容語とに分割する。
Ｎ−ｇｒａｍ手段１ｂは、機能・内容語分割手段１ａによって分割された機能語と内容語の各々においてＮ−ｇｒａｍを行う。Ｎ−ｇｒａｍは、機能語、内容語の組み合わせ単語数が段階的に変化するよう行われる。例えば、ｕｎｉ−ｇｒａｍ（１−ｇｒａｍ）、ｂｉ−ｇｒａｍ（１−ｇｒａｍと２−ｇｒａｍ）、ｔｒｉ−ｇｒａｍ（１−ｇｒａｍと２−ｇｒａｍと３−ｇｒａｍ）が行われる。
【００１２】
特徴ベクトル生成手段１ｃは、Ｎ−ｇｒａｍごと（例えば、ｕｎｉ／ｂｉ／ｔｒｉ−ｇｒａｍの各々）における機能語の機能語特徴ベクトルを生成する。また、特徴ベクトル生成手段１ｃは、Ｎ−ｇｒａｍごとにおける内容語の内容語特徴ベクトルを生成する。
【００１３】
学習手段１ｄは、学習用テキストＡ１を読込み、学習用テキストＡ１の機能語と内容語との各々において学習し分類モデル１ｅを生成する。分類モデル１ｅは、手順を示す領域と手順を示していない領域とを具備している。
【００１４】
領域判断手段１ｆは、機能語特徴ベクトルのＮ−ｇｒａｍごとにおいて、その機能語特徴ベクトルが、分類モデル１ｅの手順を示す領域と手順を示していない領域のどちらに属するかを判断する。また、領域判断手段１ｆは、内容語特徴ベクトルのＮ−ｇｒａｍごとにおいて、その内容語特徴ベクトルが、分類モデル１ｅの手順を示す領域と手順を示していない領域のどちらに属するかを判断する。
【００１５】
分類手段１ｇは、Ｎ−ｇｒａｍに用いるＮが増加するとともに、領域判断手段１ｆによって判断された機能語特徴ベクトルによる分類の性能が向上した場合又は内容語特徴ベクトルによる分類の性能が悪化した場合において高い評価値をとり、機能語特徴ベクトルによる分類の性能が悪化した場合又は内容語特徴ベクトルによる分類の性能が向上した場合に、低い評価値をとるような評価基準を用いて、分類テキストの手順を示しているか否かの分類をする。例えば、ｕｎｉ−ｇｒａｍ、ｂｉ−ｇｒａｍ、ｔｒｉ−ｇｒａｍの機能語特徴ベクトルでは、領域判断手段１ｆによるｕｎｉ−ｇｒａｍの機能語特徴ベクトルの判断結果より、ｔｒｉ−ｇｒａｍの機能語特徴ベクトルの判断結果を尊重する。また、ｕｎｉ−ｇｒａｍ、ｂｉ−ｇｒａｍ、ｔｒｉ−ｇｒａｍ内容語特徴ベクトルでは、領域判断手段１ｆによるｔｒｉ−ｇｒａｍの内容語特徴ベクトルの判断結果より、ｕｎｉ−ｇｒａｍの内容語特徴ベクトルの判断結果を尊重する。そして、分類手段１ｇは、その判断結果に基づいて被分類テキストＡ２の手順を示しているか否かの分類をする。
【００１６】
ところで、手順を示すテキストと手順を示さないテキストにおいて、機能語は、Ｎ−ｇｒａｍの単語数が増加するにつれ、テキストの分類精度が高くなり、内容語は、分類精度が悪くなるという傾向が実験的に確かめられている。すなわち、機能語と内容語の特徴ベクトルの、分類モデルによる分類結果をＮ−ｇｒａｍの単語数の増加に伴って重みを持たせることにより、精度よくテキストを分類するができる。
【００１７】
次に本発明の実施の形態の構成例について説明する。図２は、本発明の実施の形態の構成例を示す図である。図に示すように、分類サーバ１０は、ネットワーク３０を介して、端末装置２１、サーバ２２と接続されている。ネットワーク３０は、例えば、インターネットである。
【００１８】
分類サーバ１０は、サーバ２２から情報検索の対象となる被検索テキストを、そのＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）とともにネットワーク３０を介して受信する。分類サーバ１０は、受信した被検索テキストを、分類モデルを参照して、手順を示しているか否かによって分類し、記憶する。手順の具体例としては、ソフトウェアのインストール手順や料理の手順などである。非手順の具体例としては、単なる記事の表示、情報の羅列である。
【００１９】
分類サーバ１０は、手順を示した学習用テキストと手順を示していない学習用テキストが入力される。分類サーバ１０は、入力された学習用テキストを学習して、被検索テキストを分類するためのモデルとなる分類モデルを生成する。分類サーバ１０は、サーバ２２から受信した被検索テキスト（被検索テキストの特徴ベクトル）を、分類モデルの手順を示すテキストの領域に属するか、手順を示さないテキストの領域に属するかによって分類する。
【００２０】
分類サーバ１０は、端末装置２１からの手順検索（手順を示す内容を含むテキスト検索）又は通常検索（手順を示していないテキストの検索）の指示を受付ける。分類サーバ１０は、端末装置２１からキーワードを受信し、受付けた手順検索又は通常検索の指示に従って、分類した被検索テキストを検索する。そして、分類サーバ１０は、キーワードに合致する被検索テキストのＵＲＬを端末装置２１に送信する。端末装置２１は、受信したＵＲＬ（サーバ２２）にアクセスすることによって、希望する被検索テキストを参照することができる。
【００２１】
なお、端末装置２１とサーバ２２は、１つしか示してないが、複数の端末装置とサーバがネットワーク３０に接続される。分類サーバ１０は、複数のサーバから被検索テキストが入力され、複数の端末装置から情報検索が行われる。
【００２２】
次に、分類サーバ１０のハードウェア構成について説明する。図３は、分類サーバのハードウェア構成を示すブロック図である。図に示すように、分類サーバ１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０ａによって装置全体が制御されている。ＣＰＵ１０ａには、バス１０ｇを介してＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０ｂ、ハードディスクドライブ（ＨＤＤ：ＨａｒｄＤｉｓｋＤｒｉｖｅ）１０ｃ、グラフィック処理装置１０ｄ、入力インタフェース１０ｅ、及び通信インタフェース１０ｆが接続されている。
【００２３】
ＲＡＭ１０ｂには、ＣＰＵ１０ａに実行させるＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）のプログラムや、学習用テキストを学習し、被検索テキストを分類するためのアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ１０ｂには、ＣＰＵ１０ａによる処理に必要な各種データが保存される。ＨＤＤ１０ｃには、上記のＯＳやアプリケーションプログラム、各種データなどが格納される。
【００２４】
グラフィック処理装置１０ｄには、モニタ１０ｈが接続されている。グラフィック処理装置１０ｄは、ＣＰＵ１０ａからの命令に従って、画像をモニタ１０ｈの表示画面に表示させる。入力インタフェース１０ｅには、キーボード１０ｉと、マウス１０ｊとが接続されている。入力インタフェース１０ｅは、キーボード１０ｉやマウス１０ｊから送られてくる信号を、バス１０ｇを介してＣＰＵ１０ａに送信する。
【００２５】
通信インタフェース１０ｆは、ネットワーク３０に接続されている。通信インタフェース１０ｆは、ネットワーク３０を介して、端末装置２１、サーバ２２と通信を行う。
【００２６】
なお、端末装置２１、サーバ２２も図３と同様のハードウェア構成で実現することができる。ただし、端末装置２１、サーバ２２のハードディスクドライブには、各々が処理するのに必要なプログラム及びデータが格納される。
【００２７】
以上のようなハードウェア構成によって、本実施の形態を実現することができる。
次に、分類サーバ１０の機能について説明する。図４は、分類サーバの機能ブロック図である。図に示すように、分類サーバ１０は、ＳＶＭ部１１を有している。ＳＶＭ部１１は、学習用に与えられるデータをサポートベクトルマシンによって学習する。そして、ＳＶＭ部１１は、別に与えられるデータを学習した結果に基づいて分類する。ＳＶＭ部１１は、特徴抽出部１２、モデル生成部１３、データベースである分類モデルＤＢ１４、分類器１５、分類結果テーブル１６、及び評価部１７を有している。
【００２８】
図４には学習用テキストＢ１と被検索テキストＢ２が示してある。学習用テキストＢ１、被検索テキストＢ２は、例えば、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）で記述されている。学習用テキストＢ１は、人によって収集され、その内容が手順を示すものか否かの識別子が付与されて、ＳＶＭ部１１に入力される。被検索テキストＢ２は、ＳＶＭ部１１に入力され、ＳＶＭ部１１の学習結果に基づいて手順を示すテキストか否かに分類される。
【００２９】
特徴抽出部１２は、入力される学習用テキストＢ１、被検索テキストＢ２の機能語と内容語の各々において特徴ベクトルを生成する。特徴抽出部１２は、生成した特徴ベクトルをモデル生成部１３及び分類器１５に出力する。
【００３０】
図５は、特徴抽出部の詳細を示した機能ブロック図である。図に示すように、特徴抽出部１２は、形態素解析器１２ａ、解析済み文書ＤＢ（ＤＢ：データベース）１２ｂ、品詞フィルタ部１２ｃ、パターン辞書ＤＢ１２ｄ、パターンマイニング部１２ｅ、Ｎ−ｇｒａｍ計算部１２ｆ、及びベクトル生成部１２ｇを有している。
【００３１】
形態素解析器１２ａは、入力される学習用テキストＢ１、被検索テキストＢ２を形態素解析する。形態素解析器１２ａは、形態素解析した学習用テキストＢ１、被検索テキストＢ２を解析済み文書ＤＢ１２ｂに格納する。
【００３２】
品詞フィルタ部１２ｃは、解析済み文書ＤＢ１２ｂに格納されている形態素解析された学習用テキストＢ１、被検索テキストＢ２の機能語と内容語を抽出する。
【００３３】
パターンマイニング部１２ｅは、品詞フィルタ部１２ｃで抽出された機能語の繰り返し出現するパターンを、シーケンシャルパターンマイニング（Ｓｅｑｕｅｎｔｉａｌｐａｔｔｅｒｎｍｉｎｉｎｇ）によって抽出する。機能語の繰り返し出現するパターンは、予めパターン辞書ＤＢ１２ｄに格納されており、パターンマイニング部１２ｅは、パターン辞書ＤＢ１２ｄを参照することによって、抽出された機能語の繰り返し出現するパターンを抽出する。パターンマイニング部１２ｅは、機能語の場合と同様にして、品詞フィルタ部１２ｃで抽出された内容語の繰り返し出現するパターンを、パターン辞書ＤＢ１２ｄを参照し、シーケンシャルパターンマイニングによって抽出する。
【００３４】
Ｎ−ｇｒａｍ計算部１２ｆは、品詞フィルタ部１２ｃで抽出された機能語の、Ｎ（組み合わせ単語数）を変更した複数のＮ−ｇｒａｍを作成する。例えば、Ｎ−ｇｒａｍ計算部１２ｆは、まず、機能語のｕｎｉ（Ｎ＝１）−ｇｒａｍ、ｂｉ（Ｎ＝２）−ｇｒａｍ、及びｔｒｉ（Ｎ＝３）−ｇｒａｍを作成し、これらの組み合わせとして最終的にＮ＝ｕｎｉ，Ｎ＝ｕｎｉ＋ｂｉ，Ｎ＝ｕｎｉ＋ｂｉ＋ｔｒｉの計３つの組み合わせを作成する。同様にＮ−ｇｒａｍ計算部１２ｆは、品詞フィルタ部１２ｃで抽出された内容語の、Ｎを変更した複数のＮ−ｇｒａｍを作成する。
【００３５】
ベクトル生成部１２ｇは、パターンマイニング部１２ｅによって抽出された機能語の出現パターンと、Ｎ−ｇｒａｍ計算部１２ｆによって作成された機能語の複数のＮ−ｇｒａｍ各々において、特徴ベクトルを生成する。すなわち、ベクトル生成部１２ｇは、機能語の出現パターンとｕｎｉ−ｇｒａｍの特徴ベクトル、機能語の出現パターンと（ｕｎｉ＋ｂｉ）−ｇｒａｍの特徴ベクトル、及び機能語の出現パターンと（ｕｎｉ＋ｂｉ＋ｔｒｉ）−ｇｒａｍの特徴ベクトルを生成する。
【００３６】
図６は、特徴ベクトル成分を説明する図である。図に示すように、特徴ベクトルは、成分ｔｆ１，ｔｆ２，…，ｔｆｉ，…，ｔｆｌ，ｐ０，ｐ１，…，ｐｉ，…，ｐｍ（ｉ，ｌ，ｍ：正の整数でｉ＜ｌ＜ｍ）から構成されている。
【００３７】
成分ｔｆ１，ｔｆ２，…，ｔｆｉ，…，ｔｆｌの各々は、例えば、‘０’，‘１’の２値で表される。成分ｔｆ１，ｔｆ２，…，ｔｆｉ，…，ｔｆｌの各々は、Ｎ−ｇｒａｍの文字列が対応し、Ｎ−ｇｒａｍ計算部１２ｆによって作成された機能語の文字列の組み合わせが、その各成分に対応した文字列に一致していれば‘１’となる。一致していなければ、‘０’となる。
【００３８】
成分ｐ０，ｐ１，…，ｐｉ，…，ｐｍの各々は、例えば、‘０’，‘１’の２値で表される。成分ｐ０，ｐ１，…，ｐｉ，…，ｐｍの各々は、繰り返し出現するパターンが対応し、抽出した機能語の繰り返し出現するパターンが、その各成分に対応したパターンに一致していれば‘１’となる。一致していなければ、‘０’となる。
【００３９】
ベクトル生成部１２ｇは、同様にして内容語における特徴ベクトルを生成する。ベクトル生成部１２ｇは、生成した機能語と内容語の特徴ベクトルをモデル生成部１３、分類器１５に出力する。
【００４０】
図４の説明に戻る。モデル生成部１３は、特徴空間上に点在する、特徴抽出部１２から出力される学習用テキストＢ１の機能語における特徴ベクトルを、人によって付与された識別子を参照して、手順を示したものとそうでないものとに分ける識別平面を算出する。モデル生成部１３は、これらの特徴ベクトル、識別平面を分類モデルとして、分類モデルＤＢ１４に記憶する。同様にモデル生成部１３は、学習用テキストＢ１の内容語における分類モデルを生成し、分類モデルＤＢ１４に記憶する。
【００４１】
分類器１５は、特徴抽出部１２から出力される被検索テキストＢ２の機能語における特徴ベクトルが、分類モデルＤＢ１４に記憶されている機能語の分類モデルの、識別平面の手順を示している側の特徴空間に存在しているか、手順を示していない側の特徴空間に存在しているかを判断する。なお、上述したように被検索テキストＢ２の機能語における特徴ベクトルは、複数のＮ−ｇｒａｍにおいて生成されるので、分類器１５は、それぞれの特徴ベクトルにおいて判断をする。分類器１５は、同様に特徴抽出部１２から出力される被検索テキストＢ２の内容語における特徴ベクトルが、分類モデルＤＢ１４に記憶されている内容語の分類モデルの、識別平面の手順を示している側の特徴空間に存在しているか、手順を示していない側の特徴空間に存在しているかを判断する。
【００４２】
分類器１５は、判断結果を示した分類結果リストを生成する。分類器１５は、その分類結果リストに基づいて、被検索テキストＢ２が手順を示しているか否かの最終判断をするためのスコアを算出し、分類結果テーブル１６に記憶する。
【００４３】
図７は、分類器が生成する分類結果リストを示した図である。図に示す分類結果リスト４０は、被検索テキストＢ２の機能語におけるｕｎｉ−ｇｒａｍの特徴ベクトルの分類結果を示している。分類結果リスト４０には、被検索テキストＢ２に含まれている箇条書き部分に付与されるｉｄ（識別子）、その箇条書き部分の特徴ベクトルが、特徴空間の手順側の領域に属しているか（手順タイプ）非手順側の領域に属しているか（非手順タイプ）を示すクラスが示されている。
【００４４】
手順は、一般に箇条書きされていることが多く、被検索テキストＢ２に含まれている箇条書き部分にｉｄが付与される。クラスは、１で特徴ベクトルが手順タイプであることを示し、−１で非手順タイプであることを示している。
【００４５】
分類結果リストは、被検索テキストＢ２の複数のＮ−ｇｒａｍにおける特徴ベクトルごとに生成され、例えば、上記例では、ｂｉ−ｇｒａｍ、ｔｒｉ−ｇｒａｍにおける特徴ベクトルにおいても生成される。また、分類リストは、同様に内容語においても生成される。
【００４６】
図８は、分類結果テーブルのデータ構成例である。図に示すように、分類結果テーブル１６は、ｉｄ、Ｎ＝１、Ｎ＝１＋２、Ｎ＝１＋２＋３、及びスコアの欄が設けられている。
【００４７】
ｉｄの欄には、被検索テキストＢ２に含まれる箇条書き部分に付与された識別子が格納される。Ｎ＝１の欄には、被検索テキストＢ２の機能語と内容後のｕｎｉ−ｇｒａｍにおける特徴ベクトルのクラスが格納される。Ｎ＝１＋２の欄には、被検索テキストＢ２の機能語と内容語のｂｉ−ｇｒａｍにおける特徴ベクトルのクラスが格納される。Ｎ＝３の欄には、被検索テキストＢ２の機能語と内容語のｔｒｉ−ｇｒａｍにおける特徴ベクトルのクラスが格納される。スコアの欄には、被検索テキストＢ２が手順を示しているか否かを判断するスコアが格納される。
【００４８】
スコアの算出について説明する。機能語の特徴ベクトルの分類精度は、特徴ベクトルに与えられるパラメータ（成分）のＮ−ｇｒａｍのＮが大きくなる（ｕｎｉ、ｂｉ、ｔｒｉ）につれ高くなる傾向がある。また、内容語の特徴ベクトルによる分類精度は、特徴ベクトルに与えられるパラメータのＮ−ｇｒａｍのＮが大きくなるにつれ低くなる傾向がある。
【００４９】
従って、被検索テキストＢ２の手順を示しているか否かの判断をするには、機能語においては、Ｎ−ｇｒａｍのＮが大きくなるにつれ、クラスの重みを大きく評価し、内容語においては、Ｎ−ｇｒａｍのＮが大きくなるほど、クラスの重みを小さく評価するようにすればよい。これにより、手順を示しているか否かの分類精度が高くなる。このような評価を行うスコア計算式を、式（１）に示す。
【００５０】
【数１】

【００５１】
式（１）に示すα_１，α_２，β_１，β_２は、Ｎ−ｇｒａｍにおけるＮの値Ｎ＝１，１＋２，１＋２＋３別に与えられる正の整数で、分類モデルＤＢ１４を参照し、スコア計算式による分類が一番精度よく実行されるよう値を決める。具体的には、学習用テキストＢ１の分類結果の値を、式（１）に代入して、スコアの判定に用いるしきい値と比較して、４変数の連立不等式を解く。
【００５２】
式（１）で示されるｓｃｏｒｅ（スコア）は、Ｎ−ｇｒａｍに用いるＮが増加するとともに、機能語特徴ベクトルによる分類の性能が向上した場合又は内容語特徴ベクトルによる分類の性能が悪化した場合において高い値をとり、機能語特徴ベクトルによる分類の性能が悪化した場合又は内容語特徴ベクトルによる分類の性能が向上した場合に、低い値をとる。算出されるスコアの値が、所定のしきい値以上か否かによって、被検索テキストＢ２が手順を示しているか否かを判断、分類する。例えば、図８の例において、スコアが所定のしきい値未満であれば、被検索テキストＢ２は、手順を示さないテキスト、しきい値以上であれば、手順を示すテキストと分類することができる。
【００５３】
ところで、上記の機能語、内容語の特徴ベクトルによる分類精度の傾向は、実験において確認されている。図９は、機能語と内容語の分類結果を示す図である。図に示す表４１は、コンピュータ分野の箇条書き及びその他の分野の箇条書きを、学習と評価の役割を入れ替えたときの分類の再現率（再び同じ分類が行われる率）を示している。表４１に示すように、機能語における再現率は、Ｎ−ｇｒａｍのＮが大きくなるにつれて高くなる傾向にある。内容語における再現率は、Ｎ−ｇｒａｍのＮが大きくなるにつれて低くなる傾向にある。このような傾向は、以下のような理由により生じると考えられている。機能語は、助動詞、助詞、接辞などの品詞であり、テキストの記述スタイルの依存性が高く、手順（箇条書き）を示した文章の特徴を捉えやすい。一方、内容語は、名詞、動詞などの用言であり、テキストのトピックス、ジャンルの依存性が高く、通常（非手順）の文章の特徴を捉えやすいと考えられるからである。
【００５４】
図４の説明に戻る。評価部１７は、分類結果テーブル１６のスコアを参照して、ＳＶＭ部１１に入力された被検索テキストＢ２が手順を示しているか否かを評価する。
【００５５】
次に、図４、図５に示す分類サーバ１０の動作を、流れ図を用いて説明する。図１０は、分類サーバの学習時の動作の流れを示す流れ図である。分類サーバ１０は、以下のステップに従って学習し、分類モデルを生成する。
【００５６】
［ステップＳ１］特徴抽出部１２は、学習用テキストＢ１を受け付ける。
［ステップＳ２］特徴抽出部１２の形態素解析器１２ａは、学習用テキストＢ１の形態素解析を行う。形態素解析器１２ａは、形態素解析した学習用テキストＢ１を解析済み文書ＤＢ１２ｂに格納する。
【００５７】
［ステップＳ３］品詞フィルタ部１２ｃは、解析済み文書ＤＢ１２ｂに格納されている形態素解析された学習用テキストＢ１から抽出すべき語として機能語を選択する。なお、機能語の分類モデルが生成され、再びこのステップに戻った時は、品詞フィルタ部１２ｃは、内容語を選択する。
【００５８】
［ステップＳ４］品詞フィルタ部１２ｃは、解析済み文書ＤＢ１２ｂに格納されている形態素解析された学習用テキストＢ１から、ステップＳ３で選択された機能語、又は内容語を抽出する。
【００５９】
［ステップＳ５］パターンマイニング部１２ｅは、パターン辞書ＤＢ１２ｄを参照して、ステップＳ４で抽出された機能語又は内容語のパターンマイニングを行う。
【００６０】
［ステップＳ６］Ｎ−ｇｒａｍ計算部１２ｆは、ステップＳ４で抽出された機能語又は内容語のＮ−ｇｒａｍ作成を行う。
［ステップＳ７］ベクトル生成部１２ｇは、ステップＳ５、ステップＳ６のパターンマイニング処理、Ｎ−ｇｒａｍ作成より、学習用テキストＢ１の特徴ベクトルを生成する。
【００６１】
［ステップＳ８］モデル生成部１３は、ベクトル生成部１２ｇより生成された特徴ベクトルから分類モデルを生成する。モデル生成部１３は、生成した分類モデルを分類モデルＤＢ１４に記憶する。
【００６２】
［ステップＳ９］品詞フィルタ部１２ｃ、パターンマイニング部１２ｅ、Ｎ−ｇｒａｍ計算部１２ｆ、及びモデル生成部１３は、Ｎ−ｇｒａｍのＮ＝１，１＋２，１＋２＋３の全てにおいて、特徴ベクトルを生成するようステップＳ４〜ステップＳ８の処理を繰り返す。
【００６３】
［ステップＳ１０］内容語にける特徴ベクトルを生成するようステップＳ３へ進む。機能語と内容語の両方において、特徴ベクトルを生成した場合は、ステップＳ１１へ進む。
【００６４】
［ステップＳ１１］分類器１５は、各学習用テキストＢ１のスコアを計算する。
［ステップＳ１２］分類器１５は、被検索テキストＢ２を精度よく分類できるようα_１，α_２，β_１，β_２を決定する。
【００６５】
図１１は、分類サーバの分類時の動作の流れを示す流れ図である。分類サーバ１０は、以下のステップに従って被検索テキストＢ２を分類する。
［ステップＳ２１］特徴抽出部１２は、被検索テキストＢ２を受け付ける。
【００６６】
［ステップＳ２２］特徴抽出部１２の形態素解析器１２ａは、被検索テキストＢ２の形態素解析を行う。形態素解析器１２ａは、形態素解析した被検索テキストＢ２を解析済み文書ＤＢ１２ｂに格納する。
【００６７】
［ステップＳ２３］品詞フィルタ部１２ｃは、解析済み文書ＤＢ１２ｂに格納されている形態素解析された被検索テキストＢ２から抽出すべき語として機能語を選択する。なお、機能語の分類モデルが生成され、再びこのステップに戻った時は、品詞フィルタ部１２ｃは、内容語を選択する。
【００６８】
［ステップＳ２４］品詞フィルタ部１２ｃは、解析済み文書ＤＢ１２ｂに格納されている形態素解析された被検索テキストＢ２から、ステップＳ２３で選択された機能語、又は内容語を抽出する。
【００６９】
［ステップＳ２５］パターンマイニング部１２ｅは、パターン辞書ＤＢ１２ｄを参照して、ステップＳ２４で抽出された機能語又は内容語のパターンマイニングを行う。
【００７０】
［ステップＳ２６］Ｎ−ｇｒａｍ計算部１２ｆは、ステップＳ２４で抽出された機能語又は内容語のＮ−ｇｒａｍ作成を行う。
［ステップＳ２７］ベクトル生成部１２ｇは、ステップＳ２５、ステップＳ２６のパターンマイニング処理、Ｎ−ｇｒａｍ作成より、被検索テキストＢ２の特徴ベクトルを生成する。
【００７１】
［ステップＳ２８］分類器１５は、分類モデルＤＢ１４を参照して、被検索テキストＢ２の特徴ベクトルが、分類モデルの手順側の領域に属するか非手順側の領域に属するかを判断する（クラスを算出する）。
【００７２】
［ステップＳ２９］品詞フィルタ部１２ｃ、パターンマイニング部１２ｅ、Ｎ−ｇｒａｍ計算部１２ｆ、及びモデル生成部１３は、Ｎ−ｇｒａｍのＮ＝１，１＋２，１＋２＋３の全てにおいて、特徴ベクトルを生成するようステップＳ４〜ステップＳ８の処理を繰り返す。
【００７３】
［ステップＳ３０］内容語にける特徴ベクトルを生成するようステップＳ２３へ進む。機能語と内容語の両方において、特徴ベクトルを生成した場合は、ステップＳ３１へ進む。
【００７４】
［ステップＳ３１］分類器１５は、Ｎ−ｇｒａｍのＮ＝１，１＋２，１＋２＋３の全てにおける特徴ベクトルのクラスを用いてスコアを算出する。
［ステップＳ３２］評価部１７は、スコアを参照して、被検索テキストＢ２が手順を示しているか非手順を示しているかを判断する。
【００７５】
このように、機能語と内容語において、Ｎ−ｇｒａｍを段階的に変化させ、それぞれのＮ−ｇｒａｍにおいて、特徴ベクトルを生成し、クラスを算出する。Ｎ−ｇｒａｍに用いるＮが増加するとともに、機能語特徴ベクトルによる分類の性能が向上した場合又は内容語特徴ベクトルによる分類の性能が悪化した場合において高い値となり、機能語特徴ベクトルによる分類の性能が悪化した場合又は内容語特徴ベクトルによる分類の性能が向上した場合に、低い値となるスコアを算出する。そして、スコアの値によって、被検索テキストを分類するようにした。これにより、被分類テキストを精度よく分類することができる。
【００７６】
なお、上記の処理機能を実現するプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリなどがある。磁気記録装置には、ハードディスク装置（ＨＤＤ）フレキシブルディスク（ＦＤ）、磁気テープなどがある。光ディスクには、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＤＶＤ−ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ（Ｒｅｃｏｒｄａｂｌｅ）／ＲＷ（ＲｅＷｒｉｔａｂｌｅ）などがある。光磁気記録媒体には、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｃ）などがある。
【００７７】
プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤ−ＲＯＭなどの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。
【００７８】
プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送される毎に、逐次、受け取ったプログラムに従った処理を実行することもできる。
【００７９】
【発明の効果】
以上説明したように本発明では、被分類テキストを機能語と内容語とに分割し、各々において単語数を段階的に変化させたＮ−ｇｒａｍを行う。Ｎ−ｇｒａｍごとにおける機能語の機能語特徴ベクトルと内容語の内容語特徴ベクトルとを生成し、分類モデルの手順を示している領域と手順を示してない領域とのどちらの領域に属するかを判断する。そして、Ｎ−ｇｒａｍに用いるＮが増加するとともに、機能語特徴ベクトルによる分類の性能が向上した場合又は内容語特徴ベクトルによる分類の性能が悪化した場合において高い評価値をとり、機能語特徴ベクトルによる分類の性能が悪化した場合又は内容語特徴ベクトルによる分類の性能が向上した場合に、低い評価値をとるような評価基準を用いることによって、被分類テキストの手順を示しているか否かの分類をするようにした。これにより、被分類テキストを精度よく分類することができる。
【図面の簡単な説明】
【図１】本発明の原理を説明する原理図である。
【図２】本発明の実施の形態の構成例を示す図である。
【図３】分類サーバのハードウェア構成を示すブロック図である。
【図４】分類サーバの機能ブロック図である。
【図５】特徴抽出部の詳細を示した機能ブロック図である。
【図６】特徴ベクトル成分を説明する図である。
【図７】分類器から出力される分類結果リストを示した図である。
【図８】分類結果テーブルのデータ構成例である。
【図９】機能語と内容語の分類結果を示す図である。
【図１０】分類サーバの学習時の動作の流れを示す流れ図である。
【図１１】分類サーバの分類時の動作の流れを示す流れ図である。
【符号の説明】
１コンピュータ
１ａ機能・内容語分割手段
１ｂＮ−ｇｒａｍ手段
１ｃ特徴ベクトル生成手段
１ｄ学習手段
１ｅ分類モデル
１ｆ領域判断手段
１ｇ分類手段
１０分類サーバ
１１ＳＶＭ部
１２特徴抽出部
１２ａ形態素解析器
１２ｂ解析済み文書ＤＢ
１２ｃ品詞フィルタ部
１２ｄパターン辞書ＤＢ
１２ｅパターンマイニング部
１２ｆＮ−ｇｒａｍ計算部
１２ｇベクトル生成部
１３モデル生成部
１４分類モデルＤＢ
１５分類器
１６分類結果テーブル
１７評価部
Ａ１，Ｂ１学習用テキスト
Ａ２被分類テキスト
Ｂ２被検索テキスト[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a text classification program, and more particularly to a text classification program for classifying text indicating a procedure.
[0002]
[Prior art]
In recent years, in addition to the accumulation of electronic documents, the spread of the Internet has made it easy to access a large amount of text on the Web, and the importance of computer-based information retrieval technology has increased. In the field of CRM (Customer Relationship Management) and Web navigation, a text automatic classification technique is used in automatic question answering and the like. In these application areas, in addition to the conventional classification based on the field and genre of text, there is a need for a classification with a different axis, such as a description style of text or a classification that helps narrow possible answers to specific questions. Has become.
[0003]
In order to enable a search for a text relating to a procedure on the Web, a computer is made to learn a description style of the text by, for example, an SVM (Support Vector Machine). Then, the searched text to be searched is classified and stored in advance into a text indicating the procedure and a text not indicating the procedure based on the learning result. When a user makes a keyword search request for a text related to a procedure, a text that matches the keyword is searched from the searched text indicating the classified procedure. At this time, in order to reliably search for the text indicating the procedure, it is necessary that the classification accuracy of the text by SVM learning of the computer is high.
[0004]
Note that analyzing the structure of a sentence has been conventionally performed. There is a document processing method for extracting a meaningful text block from a document arbitrarily laid out such as a table, an itemized list, or a multi-column structure (for example, see Patent Document 1).
[0005]
[Patent Document 1]
JP-A-2002-032770 (page 6, FIG. 8)
[0006]
[Problems to be solved by the invention]
By the way, in order to classify a text (classified text) depending on whether or not a procedure is indicated, a feature vector of the classified text is generated. Then, the feature vector is classified according to whether or not it belongs to a region indicating the procedure of the classification model obtained by learning. Since the classification accuracy of the text to be classified depends on the parameters on which the feature vectors are based, it is necessary to select a parameter for improving the classification accuracy.
[0007]
The present invention has been made in view of such a point, and provides a text classification program capable of performing accurate classification by generating a feature vector in each of a functional word and a content word of a text to be classified. With the goal.
[0008]
[Means for Solving the Problems]
In the present invention, in order to solve the above-mentioned problem, in a text classification program for classifying text according to whether or not a procedure is shown as shown in FIG. 1, a computer converts the text to be classified A2 into a functional word and a content word. The N-gram is divided and the number of combined words is changed stepwise in each of the function word and the content word, and the function word feature vector of the function word and the content word feature vector of the content word are calculated for each N-gram. Is generated, and each of the function word feature vector and the content word feature vector is stored in either the region indicating the procedure of the classification model generated by learning the learning text A1 or the region not indicating the procedure. It is determined whether or not the N-gram belongs to the N-gram and the performance of the classification using the function word feature vector is improved or the content word feature vector is used. An evaluation criterion that takes a high evaluation value when the performance of the classification is deteriorated, and takes a low evaluation value when the performance of the classification by the functional word feature vector is deteriorated or the performance of the classification by the content word feature vector is improved. , A text classification program is provided which classifies whether or not it indicates the procedure of the text to be classified A2, and executes a process.
[0009]
According to such a text classification program, the text A2 to be classified is divided into functional words and content words, and N-grams in which the number of words is changed in each step are performed. A function word feature vector of a function word and a content word feature vector of a content word for each N-gram are generated, and which of a region indicating a procedure of the classification model and a region not indicating the procedure belongs to the classification model. to decide. Then, as N used for N-gram increases, a high evaluation value is obtained when the performance of the classification by the function word feature vector is improved or when the performance of the classification by the content word feature vector is deteriorated, When the performance of the classification is deteriorated or the performance of the classification based on the content word feature vector is improved, by using an evaluation criterion that takes a low evaluation value, the classification of whether the procedure of the text A2 to be classified is indicated or not is performed. do. Thus, the classified text can be classified with high classification accuracy.
[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a principle diagram for explaining the principle of the present invention. As shown in the figure, the computer 1 has a function / content word dividing unit 1a, an N-gram unit 1b, a feature vector generating unit 1c, a learning unit 1d, a classification model 1e, an area determining unit 1f, and a classification unit 1g. ing. FIG. 1 shows a learning text A1 and a classified text A2.
[0011]
The function / content word dividing means 1a divides the classified text A2 into a function word and a content word.
The N-gram means 1b performs N-gram on each of the function word and the content word divided by the function / content word division means 1a. The N-gram is performed such that the number of combined words of the function word and the content word changes stepwise. For example, uni-gram (1-gram), bi-gram (1-gram and 2-gram), and tri-gram (1-gram, 2-gram, and 3-gram) are performed.
[0012]
The feature vector generation unit 1c generates a function word feature vector of a function word for each N-gram (for example, each of uni / bi / tri-gram). In addition, the feature vector generation unit 1c generates a content word feature vector of a content word for each N-gram.
[0013]
The learning means 1d reads the learning text A1 and learns each of the functional words and the content words of the learning text A1 to generate a classification model 1e. The classification model 1e has an area indicating a procedure and an area not indicating a procedure.
[0014]
The region determining means 1f determines, for each N-gram of the functional word feature vector, whether the functional word feature vector belongs to a region indicating the procedure of the classification model 1e or a region not indicating the procedure. In addition, for each N-gram of the content word feature vector, the region determination unit 1f determines whether the content word feature vector belongs to a region indicating the procedure of the classification model 1e or a region not indicating the procedure.
[0015]
The classifying unit 1g performs the processing when the N used in the N-gram increases and the performance of the classification based on the function word feature vector determined by the area determination unit 1f is improved or the performance of the classification based on the content word feature vector is deteriorated. If a high evaluation value is taken and the performance of the classification by the function word feature vector is deteriorated or the performance of the classification by the content word feature vector is improved, the procedure of the classification text is performed using an evaluation criterion that takes a low evaluation value. Is classified. For example, in the function word feature vector of uni-gram, bi-gram, and tri-gram, the judgment result of the function word feature vector of tri-gram is obtained from the judgment result of the uni-gram function word feature vector by the area judging means 1f. to respect. In the uni-gram, bi-gram, and tri-gram content word feature vectors, the determination result of the uni-gram content word feature vector is respected based on the determination result of the tri-gram content word feature vector by the area determination unit 1f. I do. Then, the classifying unit 1g classifies whether the procedure of the classified text A2 is indicated based on the determination result.
[0016]
By the way, in the text indicating the procedure and the text not indicating the procedure, the experimentation has shown that the classification accuracy of the functional words increases as the number of N-gram words increases, and the classification accuracy of the content words decreases as the number of N-gram words increases. Has been confirmed. That is, texts can be classified with high accuracy by assigning weights to the classification results of the feature vectors of the function words and the content words by the classification model as the number of N-gram words increases.
[0017]
Next, a configuration example of an embodiment of the present invention will be described. FIG. 2 is a diagram illustrating a configuration example of the embodiment of the present invention. As shown in the figure, the classification server 10 is connected to a terminal device 21 and a server 22 via a network 30. The network 30 is, for example, the Internet.
[0018]
The classification server 10 receives the searched text to be searched for information from the server 22 via the network 30 together with its URL (Uniform Resource Locator). The classification server 10 classifies the received search target text by referring to the classification model based on whether or not the procedure is indicated, and stores the classified text. Specific examples of the procedure include a software installation procedure and a cooking procedure. Specific examples of non-procedures are simple display of articles and enumeration of information.
[0019]
The classification server 10 receives a learning text indicating a procedure and a learning text not indicating a procedure. The classification server 10 learns the input learning text and generates a classification model serving as a model for classifying the searched text. The classification server 10 classifies the searched text (feature vector of the searched text) received from the server 22 according to whether the text belongs to a text region indicating a procedure of the classification model or belongs to a text region indicating no procedure.
[0020]
The classification server 10 receives an instruction from the terminal device 21 for a procedure search (a text search including contents indicating a procedure) or a normal search (a search for text not indicating a procedure). The classification server 10 receives the keyword from the terminal device 21 and searches the classified text to be searched according to the received instruction of the procedure search or the normal search. Then, the classification server 10 transmits the URL of the searched text that matches the keyword to the terminal device 21. The terminal device 21 can refer to the desired search target text by accessing the received URL (server 22).
[0021]
Although only one terminal device 21 and one server 22 are shown, a plurality of terminal devices and servers are connected to the network 30. The classification server 10 receives the text to be searched from a plurality of servers, and performs an information search from a plurality of terminal devices.
[0022]
Next, the hardware configuration of the classification server 10 will be described. FIG. 3 is a block diagram illustrating a hardware configuration of the classification server. As shown in the figure, the classification server 10 is entirely controlled by a CPU (Central Processing Unit) 10a. To the CPU 10a, a RAM (Random Access Memory) 10b, a hard disk drive (HDD: Hard Disk Drive) 10c, a graphic processing device 10d, an input interface 10e, and a communication interface 10f are connected via a bus 10g.
[0023]
The RAM 10b temporarily stores at least a part of an OS (Operating System) program to be executed by the CPU 10a and at least a part of an application program for learning the learning text and classifying the searched text. The RAM 10b stores various data necessary for processing by the CPU 10a. The HDD 10c stores the OS, application programs, various data, and the like.
[0024]
A monitor 10h is connected to the graphic processing device 10d. The graphic processing device 10d displays an image on the display screen of the monitor 10h according to a command from the CPU 10a. A keyboard 10i and a mouse 10j are connected to the input interface 10e. The input interface 10e transmits signals transmitted from the keyboard 10i and the mouse 10j to the CPU 10a via the bus 10g.
[0025]
The communication interface 10f is connected to the network 30. The communication interface 10f communicates with the terminal device 21 and the server 22 via the network 30.
[0026]
Note that the terminal device 21 and the server 22 can also be realized with the same hardware configuration as in FIG. However, the hard disk drives of the terminal device 21 and the server 22 store programs and data necessary for the respective processes.
[0027]
The present embodiment can be realized by the above hardware configuration.
Next, the function of the classification server 10 will be described. FIG. 4 is a functional block diagram of the classification server. As shown in the figure, the classification server 10 has an SVM unit 11. The SVM unit 11 learns data provided for learning by a support vector machine. Then, the SVM unit 11 classifies the separately provided data based on the result of learning. The SVM unit 11 includes a feature extraction unit 12, a model generation unit 13, a classification model DB 14, which is a database, a classifier 15, a classification result table 16, and an evaluation unit 17.
[0028]
FIG. 4 shows a learning text B1 and a searched text B2. The learning text B1 and the search target text B2 are described in, for example, HTML (Hyper Text Markup Language). The learning text B1 is collected by a person, added with an identifier indicating whether or not the content indicates a procedure, and is input to the SVM unit 11. The searched text B2 is input to the SVM unit 11, and is classified based on the learning result of the SVM unit 11 as a text indicating a procedure.
[0029]
The feature extraction unit 12 generates a feature vector for each of the function words and the content words of the input learning text B1 and the searched text B2. The feature extraction unit 12 outputs the generated feature vector to the model generation unit 13 and the classifier 15.
[0030]
FIG. 5 is a functional block diagram showing details of the feature extraction unit. As shown in the figure, the feature extraction unit 12 includes a morphological analyzer 12a, an analyzed document DB (DB: database) 12b, a part-of-speech filter unit 12c, a pattern dictionary DB 12d, a pattern mining unit 12e, an N-gram calculation unit 12f, It has a vector generation unit 12g.
[0031]
The morphological analyzer 12a performs a morphological analysis on the input learning text B1 and search target text B2. The morphological analyzer 12a stores the learning text B1 and the searched text B2 obtained by the morphological analysis in the analyzed document DB 12b.
[0032]
The part-of-speech filter unit 12c extracts functional words and content words of the morphologically analyzed learning text B1 and the searched text B2 stored in the analyzed document DB 12b.
[0033]
The pattern mining unit 12e extracts, by sequential pattern mining, patterns in which the function words extracted by the part-of-speech filter unit 12c repeatedly appear. The pattern in which the function word repeatedly appears is stored in advance in the pattern dictionary DB 12d, and the pattern mining unit 12e extracts the pattern in which the extracted function word repeatedly appears by referring to the pattern dictionary DB 12d. The pattern mining unit 12e extracts a pattern in which the content word extracted by the part-of-speech filter unit 12c appears repeatedly by sequential pattern mining with reference to the pattern dictionary DB 12d in the same manner as in the case of the function word.
[0034]
The N-gram calculation unit 12f creates a plurality of N-grams in which N (the number of combined words) of the function words extracted by the part-of-speech filter unit 12c is changed. For example, the N-gram calculation unit 12f first creates function words uni (N = 1) -gram, bi (N = 2) -gram, and tri (N = 3) -gram, and as a combination of these. Finally, a total of three combinations of N = uni, N = uni + bi, and N = uni + bi + tri are created. Similarly, the N-gram calculation unit 12f creates a plurality of N-grams in which N of the content words extracted by the part-of-speech filter unit 12c is changed.
[0035]
The vector generation unit 12g generates a feature vector in each of the appearance pattern of the function word extracted by the pattern mining unit 12e and a plurality of N-grams of the function word created by the N-gram calculation unit 12f. That is, the vector generation unit 12g determines the appearance pattern of the function word and the uni-gram feature vector, the appearance pattern of the function word and the (uni + bi) -gram feature vector, and the appearance pattern of the function word and the (uni + bi + tri) -gram feature. Generate a vector.
[0036]
FIG. 6 is a diagram for explaining a feature vector component. As shown in the figure, the feature vectors are components tf1, tf2, ..., tfi, ..., tfl, p0, p1, ..., pi, ..., pm (i, l, m: positive integers, i <l <m ).
[0037]
Each of the components tf1, tf2, ..., tfi, ..., tfl is represented by, for example, a binary value of "0", "1". Each of the components tf1, tf2, ..., tfi, ..., tfl corresponds to a character string of N-gram, and a combination of character strings of function words created by the N-gram calculation unit 12f corresponds to each component. If it matches the entered character string, it becomes '1'. If they do not match, it is '0'.
[0038]
Each of the components p0, p1,..., Pi,..., Pm is represented by, for example, a binary value of '0' and '1'. Each of the components p0, p1,..., Pi,..., Pm corresponds to a pattern that repeatedly appears, and if the pattern that repeatedly appears of the extracted function word matches the pattern corresponding to each component, '1 '. If they do not match, it is '0'.
[0039]
The vector generation unit 12g similarly generates a feature vector in the content word. The vector generation unit 12g outputs the generated feature vector of the function word and the content word to the model generation unit 13 and the classifier 15.
[0040]
Returning to the description of FIG. The model generation unit 13 indicates a procedure by referring to a feature vector in a function word of the learning text B1 scattered on the feature space and output from the feature extraction unit 12 with reference to an identifier assigned by a person. The discrimination plane which divides into those which do not and those which do not is calculated. The model generation unit 13 stores these feature vectors and identification planes in the classification model DB 14 as classification models. Similarly, the model generation unit 13 generates a classification model for the content words of the learning text B1 and stores the classification model in the classification model DB 14.
[0041]
The classifier 15 determines whether the feature vector of the function word of the searched text B2 output from the feature extraction unit 12 indicates the procedure of the identification plane of the classification model of the function word stored in the classification model DB 14. It is determined whether it exists in the feature space or in the feature space on the side where the procedure is not shown. Note that, as described above, since the feature vector in the function word of the searched text B2 is generated in a plurality of N-grams, the classifier 15 makes a determination on each feature vector. The classifier 15 similarly shows the procedure of the identification plane of the classification model of the content word stored in the classification model DB 14 with the feature vector of the content word of the searched text B2 output from the feature extraction unit 12. It is determined whether it exists in the feature space on the side or in the feature space on the side that does not indicate the procedure.
[0042]
The classifier 15 generates a classification result list indicating the determination result. The classifier 15 calculates a score for making a final determination whether or not the searched text B2 indicates a procedure based on the classification result list, and stores the score in the classification result table 16.
[0043]
FIG. 7 is a diagram showing a classification result list generated by the classifier. The classification result list 40 shown in the figure shows the classification result of the uni-gram feature vector in the function word of the searched text B2. In the classification result list 40, the id (identifier) assigned to the bulleted portion included in the searched text B2 and whether the feature vector of the bulleted portion belongs to the procedure-side area of the feature space (procedure) Type) A class indicating whether or not it belongs to the non-procedural area (non-procedural type) is shown.
[0044]
The procedure is generally itemized in many cases, and an id is given to the itemized part included in the searched text B2. For the class, 1 indicates that the feature vector is a procedural type, and -1 indicates that it is a non-procedural type.
[0045]
The classification result list is generated for each feature vector in the plurality of N-grams of the searched text B2. For example, in the above example, the classification result list is also generated in the feature vectors of bi-gram and tri-gram. Further, the classification list is similarly generated for the content word.
[0046]
FIG. 8 is a data configuration example of a classification result table. As shown in the figure, the classification result table 16 has columns for id, N = 1, N = 1 + 2, N = 1 + 2 + 3, and score.
[0047]
The “id” field stores an identifier assigned to a bulleted part included in the searched text B2. In the column of N = 1, the function word of the searched text B2 and the class of the feature vector in the uni-gram after the content are stored. The column of N = 1 + 2 stores the class of the feature vector in the bi-gram of the function word and the content word of the searched text B2. The column of N = 3 stores the class of the feature vector in the tri-gram of the function word and the content word of the searched text B2. The score column stores a score for determining whether or not the searched text B2 indicates a procedure.
[0048]
The calculation of the score will be described. The classification accuracy of the feature vector of the function word tends to increase as the N of the parameter (component) given to the feature vector increases (uni, bi, tri). In addition, the classification accuracy of a content word by a feature vector tends to decrease as N of N-gram of a parameter given to the feature vector increases.
[0049]
Therefore, in order to determine whether or not the procedure of the search target text B2 is indicated, as the N of the N-gram of the function word increases, the weight of the class is greatly evaluated. The weight of the class may be evaluated to be smaller as the N of -gram increases. Thereby, the classification accuracy of whether or not a procedure is indicated is increased. Formula (1) shows a score calculation formula for performing such evaluation.
[0050]
(Equation 1)

[0051]
Α in equation (1) ₁ , Α ₂ , Β ₁ , Β ₂ Is a positive integer given for each N value of N-gram N = 1, 1 + 2, 1 + 2 + 3, and determines a value such that classification by the score calculation formula is performed with the highest accuracy by referring to the classification model DB14. Specifically, the value of the classification result of the learning text B1 is substituted into Expression (1), and is compared with a threshold value used for determining a score, thereby solving a simultaneous inequality expression of four variables.
[0052]
The score (score) represented by the equation (1) is obtained when N used for N-gram increases and the performance of the classification using the function word feature vector is improved or the performance of the classification using the content word feature vector is deteriorated. It takes a high value and takes a low value when the performance of the classification using the function word feature vector is deteriorated or when the performance of the classification using the content word feature vector is improved. Based on whether or not the calculated score value is equal to or greater than a predetermined threshold value, it is determined whether or not the searched text B2 indicates a procedure and classified. For example, in the example of FIG. 8, if the score is less than a predetermined threshold, the searched text B2 can be classified as a text indicating no procedure, and if the score is equal to or more than the threshold, it can be classified as a text indicating a procedure. .
[0053]
Incidentally, the tendency of the classification accuracy based on the feature vectors of the above-mentioned functional words and content words has been confirmed in experiments. FIG. 9 is a diagram showing the classification results of the function words and the content words. Table 41 shown in the figure shows the recall rate of the classification (the rate at which the same classification is performed again) when the roles of learning and evaluation are switched between the bullet points in the computer field and the bullet points in other fields. As shown in Table 41, the recall of the function word tends to increase as N of N-gram increases. The recall rate in the content word tends to decrease as N of N-gram increases. Such a tendency is considered to occur for the following reasons. A functional word is a part of speech such as an auxiliary verb, a particle, or an affix, and is highly dependent on the description style of the text, so that it is easy to capture the characteristics of a sentence indicating a procedure (a bullet point). On the other hand, the content word is a declinable word such as a noun or a verb, and is highly dependent on the topic and genre of the text, so that it is considered that it is easy to capture the characteristics of ordinary (non-procedural) sentences.
[0054]
Returning to the description of FIG. The evaluation unit 17 refers to the score in the classification result table 16 and evaluates whether the searched text B2 input to the SVM unit 11 indicates a procedure.
[0055]
Next, the operation of the classification server 10 shown in FIGS. 4 and 5 will be described using a flowchart. FIG. 10 is a flowchart showing the flow of the operation of the classification server at the time of learning. The classification server 10 learns according to the following steps and generates a classification model.
[0056]
[Step S1] The feature extraction unit 12 receives the learning text B1.
[Step S2] The morphological analyzer 12a of the feature extracting unit 12 performs a morphological analysis of the learning text B1. The morphological analyzer 12a stores the morphologically analyzed learning text B1 in the analyzed document DB 12b.
[0057]
[Step S3] The part of speech filter unit 12c selects a functional word as a word to be extracted from the morphologically analyzed learning text B1 stored in the analyzed document DB 12b. When a functional word classification model is generated and the process returns to this step, the part of speech filter unit 12c selects a content word.
[0058]
[Step S4] The part-of-speech filter unit 12c extracts the functional word or the content word selected in step S3 from the morphologically analyzed learning text B1 stored in the analyzed document DB 12b.
[0059]
[Step S5] The pattern mining unit 12e refers to the pattern dictionary DB 12d and performs pattern mining on the functional words or content words extracted in Step S4.
[0060]
[Step S6] The N-gram calculation unit 12f creates an N-gram of the function word or the content word extracted in step S4.
[Step S7] The vector generation unit 12g generates a feature vector of the learning text B1 from the pattern mining process of steps S5 and S6 and the creation of an N-gram.
[0061]
[Step S8] The model generation unit 13 generates a classification model from the feature vectors generated by the vector generation unit 12g. The model generation unit 13 stores the generated classification model in the classification model DB 14.
[0062]
[Step S9] The part-of-speech filter unit 12c, the pattern mining unit 12e, the N-gram calculation unit 12f, and the model generation unit 13 generate a feature vector in all N = 1, 1 + 2, 1 + 2 + 3 of the N-gram. Steps S4 to S8 are repeated.
[0063]
[Step S10] The process proceeds to step S3 to generate a feature vector for the content word. If feature vectors have been generated for both the function word and the content word, the process proceeds to step S11.
[0064]
[Step S11] The classifier 15 calculates the score of each learning text B1.
[Step S12] The classifier 15 sets α so that the searched text B2 can be classified with high accuracy. ₁ , Α ₂ , Β ₁ , Β ₂ To determine.
[0065]
FIG. 11 is a flowchart showing the flow of the operation of the classification server at the time of classification. The classification server 10 classifies the searched text B2 according to the following steps.
[Step S21] The feature extraction unit 12 receives the search target text B2.
[0066]
[Step S22] The morphological analyzer 12a of the feature extracting unit 12 performs a morphological analysis of the searched text B2. The morphological analyzer 12a stores the searched text B2 that has undergone the morphological analysis in the analyzed document DB 12b.
[0067]
[Step S23] The part of speech filter unit 12c selects a functional word as a word to be extracted from the morphologically analyzed searched text B2 stored in the analyzed document DB 12b. When a functional word classification model is generated and the process returns to this step, the part of speech filter unit 12c selects a content word.
[0068]
[Step S24] The part-of-speech filter unit 12c extracts the functional word or the content word selected in step S23 from the morphologically analyzed search target text B2 stored in the analyzed document DB 12b.
[0069]
[Step S25] The pattern mining unit 12e refers to the pattern dictionary DB 12d and performs pattern mining on the functional words or content words extracted in Step S24.
[0070]
[Step S26] The N-gram calculation unit 12f creates an N-gram of the function word or the content word extracted in step S24.
[Step S27] The vector generation unit 12g generates a feature vector of the search target text B2 from the pattern mining process of steps S25 and S26 and the creation of an N-gram.
[0071]
[Step S28] The classifier 15 refers to the classification model DB 14 and determines whether the feature vector of the searched text B2 belongs to the procedural side area or the non-procedural side area of the classification model (class calculate).
[0072]
[Step S29] The part-of-speech filter unit 12c, the pattern mining unit 12e, the N-gram calculation unit 12f, and the model generation unit 13 generate a feature vector in all N = 1, 1 + 2, 1 + 2 + 3 of the N-gram. Steps S4 to S8 are repeated.
[0073]
[Step S30] The process proceeds to step S23 to generate a feature vector for the content word. If a feature vector has been generated for both the function word and the content word, the process proceeds to step S31.
[0074]
[Step S31] The classifier 15 calculates a score using the classes of the feature vectors in all of N = 1, 1 + 2, 1 + 2 + 3 of N-gram.
[Step S32] The evaluation unit 17 refers to the score and determines whether the searched text B2 indicates a procedure or a non-procedure.
[0075]
As described above, in the function word and the content word, the N-gram is changed stepwise, and in each of the N-grams, a feature vector is generated and a class is calculated. When N used for N-gram increases and the performance of the classification by the function word feature vector is improved or when the performance of the classification by the content word feature vector is deteriorated, the value becomes high. When the performance is worsened or the performance of the classification based on the content word feature vector is improved, a score having a low value is calculated. Then, the searched text is classified according to the score value. This allows the classified text to be classified with high accuracy.
[0076]
The program for realizing the above processing functions can be recorded on a computer-readable recording medium. Computer-readable recording media include magnetic recording devices, optical disks, magneto-optical recording media, and semiconductor memories. The magnetic recording device includes a hard disk device (HDD), a flexible disk (FD), and a magnetic tape. Examples of the optical disc include a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), and a CD-R (Recordable) / RW (ReWritable). Magneto-optical recording media include MO (Magneto-Optical disc).
[0077]
When distributing the program, portable recording media such as DVDs and CD-ROMs on which the program is recorded are sold. Alternatively, the program may be stored in a storage device of a server computer, and the program may be transferred from the server computer to another computer via a network.
[0078]
The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, the computer may execute the processing according to the received program each time the program is transferred from the server computer.
[0079]
【The invention's effect】
As described above, in the present invention, the text to be classified is divided into functional words and content words, and N-grams are performed in each of which the number of words is changed stepwise. A function word feature vector of a function word and a content word feature vector of a content word for each N-gram are generated, and which of a region indicating a procedure of the classification model and a region not indicating the procedure belongs to the classification model. to decide. Then, as N used for N-gram increases, a high evaluation value is obtained when the performance of the classification by the function word feature vector is improved or when the performance of the classification by the content word feature vector is deteriorated, When the performance of the classification is deteriorated or the performance of the classification based on the content word feature vector is improved, by using an evaluation criterion that takes a low evaluation value, the classification of whether or not the procedure of the text to be classified is indicated can be performed. I did it. This allows the classified text to be classified with high accuracy.
[Brief description of the drawings]
FIG. 1 is a principle diagram illustrating the principle of the present invention.
FIG. 2 is a diagram illustrating a configuration example of an embodiment of the present invention.
FIG. 3 is a block diagram illustrating a hardware configuration of a classification server.
FIG. 4 is a functional block diagram of a classification server.
FIG. 5 is a functional block diagram showing details of a feature extraction unit.
FIG. 6 is a diagram illustrating a feature vector component.
FIG. 7 is a diagram showing a classification result list output from a classifier.
FIG. 8 is a data configuration example of a classification result table.
FIG. 9 is a diagram showing classification results of function words and content words.
FIG. 10 is a flowchart showing a flow of an operation at the time of learning of the classification server.
FIG. 11 is a flowchart showing a flow of an operation at the time of classification of the classification server.
[Explanation of symbols]
1 computer
1a Function / content word division means
1b N-gram means
1c Feature vector generation means
1d learning means
1e Classification model
1f area judgment means
1g Classification means
10. Classification server
11 SVM part
12 Feature extraction unit
12a Morphological analyzer
12b Analyzed document DB
12c Part-of-speech filter
12d pattern dictionary DB
12e Pattern mining unit
12f N-gram calculation unit
12g Vector generator
13 Model generator
14 Classification model DB
15 Classifier
16 Classification result table
17 Evaluation section
A1, B1 Textbook for learning
A2 Classified text
B2 Searched text

Claims

In a text classification program that classifies text according to whether it indicates a procedure,
On the computer,
Divide the classified text into function words and content words,
In each of the function word and the content word, N-gram in which the number of combined words is changed stepwise is performed,
Generating a function word feature vector of the function word and a content word feature vector of the content word for each of the N-grams;
Whether each of the function word feature vector and the content word feature vector belongs to a region indicating a procedure of a classification model generated by learning a learning text or a region not indicating a procedure is determined. Judge,
When N used for the N-gram increases and classification performance using the function word feature vector is improved or classification performance based on the content word feature vector is deteriorated, a high evaluation value is obtained, and the function word feature is obtained. When the performance of the classification by the vector is deteriorated or the performance of the classification by the content word feature vector is improved, whether or not the procedure of the classified text is indicated by using an evaluation criterion that takes a low evaluation value Classify,
A text classification program characterized by executing processing.

The text classification program according to claim 1, wherein a uni-gram, a bi-gram, and a tri-gram are performed for each of the function word and the content word.

2. The text classification program according to claim 1, wherein N-grams in which the number of combined words is changed stepwise are performed for each of the functional words and the content words of the learning text.

The text classification program according to claim 3, wherein the classification model is generated for each of the N-grams.