JP2003330922A

JP2003330922A - Method, device and storage medium for analyzing typical sentence

Info

Publication number: JP2003330922A
Application number: JP2002134334A
Authority: JP
Inventors: Satoshi Morinaga; 聡森永; Kenji Yamanishi; 健司山西
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-05-09
Filing date: 2002-05-09
Publication date: 2003-11-21
Anticipated expiration: 2022-05-09
Also published as: JP3767516B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method, a device and a storage medium for analyzing a text group to be analyzed properly and in detail. <P>SOLUTION: A processing unit 2 of a typical sentence analyzer 100 has expression histogram calculation means 21 to count, on a category basis from category 10 to 1n, the number of occurrences of each expression used in each text, which is included in each category. Category histogram calculation means 22 counts, on a category basis from category 10 to 1n, the number of texts or elements included in each category. Posteriori probability calculation means 23 uses these calculation results to determine, per each text S of the category 10 to be analyzed, a posteriori probability which indicates a typicality of the text. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、典型文分析方法、
装置および記録媒体に関し、特に特定の集合に属する各
テキストから、その集合を代表する典型文を分析する典
型文分析方法、装置および記録媒体に関するものであ
る。TECHNICAL FIELD The present invention relates to a typical sentence analysis method,
The present invention relates to an apparatus and a recording medium, and more particularly, to a typical sentence analysis method, an apparatus, and a recording medium for analyzing typical texts representing a set from each text belonging to a specific set.

【０００２】[0002]

【従来の技術】ビジネス分野において、アンケート調査
に基づくマーケティング分析は、商品やサービスに関す
る顧客の評価を取得する上で非常に重要視されている。
特に、自由記述回答欄を含むアンケートでは、顧客の生
の声を汲み取ることができ、アンケート以外にはｗｅｂ
上で自由に書き込まれた各種意見も同様である。従来、
このような自由記述欄などに記載された複数のテキスト
（文字情報）からなる集合について、その集合を代表す
るテキストを抽出する場合、これらテキストを要約する
技術が用いられていた（例えば、特開平１０−１３４０
６６号公報など参照）。この種の文書分析技術は、その
集合に含まれる各テキストから１つの要約した文章を作
成するものである。2. Description of the Related Art In the business field, marketing analysis based on questionnaire surveys is very important for obtaining customer's evaluations regarding products and services.
In particular, in a questionnaire that includes a free-filled answer section, it is possible to capture the live voice of the customer.
The same is true of the various opinions freely written above. Conventionally,
When extracting a text representing a set of a plurality of texts (character information) described in such a free description field, a technique of summarizing the texts has been used (for example, Japanese Patent Laid-Open No. H11-242242). 10-1340
66 publication etc.). This type of document analysis technique creates one summarized sentence from each text included in the set.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、このよ
うな従来の文書分析技術では、分析対象となるテキスト
集合を適切かつ詳細に分析できないという問題点があっ
た。通常、アンケート回答の分析では、「男性からの回
答テキストの集合」などのように、趣味・嗜好の異なる
複数の顧客層を含み類似性の低いテキスト集合について
も分析対象となり、有効な代表テキストが必要とされ
る。しかし、従来の文書分析技術では、分析対象とする
集合に含まれる各テキストが互いに類似していない場
合、その集合を代表する適切な文章が得られない。However, such a conventional document analysis technique has a problem in that the text set to be analyzed cannot be analyzed appropriately and in detail. Usually, in the analysis of questionnaire responses, a text set that includes multiple customer groups with different hobbies and preferences and has low similarity, such as "a set of answer texts from men", is also the analysis target, and a valid representative text is Needed. However, in the conventional document analysis technique, when the texts included in the set to be analyzed are not similar to each other, an appropriate sentence representing the set cannot be obtained.

【０００４】また「男性からの回答における典型例」や
「男性からの回答としては珍しい例」に相当するテキス
トの抽出など、各テキストがその集合を代表する度合い
すなわち典型度が得られれば、より有益な情報が得られ
る。しかし、従来の文書分析技術では、元のテキスト集
合から新たなテキストが生成されることから、その集合
の要素である各テキストについて「その集合を代表する
ものとして、どれだけ相応しいか」を分析することがで
きない。本発明はこのような課題を解決するためのもの
であり、分析対象となるテキスト集合を適切かつ詳細に
分析できる典型文分析方法、装置および記録媒体を提供
することを目的としている。Further, if the degree to which each text represents the set, that is, the typicality, can be obtained, such as the extraction of texts corresponding to "typical examples of answers from men" and "rare examples of answers from men", Useful information is available. However, in the conventional document analysis technique, a new text is generated from the original text set, and therefore, for each text that is an element of the set, "how appropriate is it as a representative of the set" is analyzed. I can't. The present invention is intended to solve such a problem, and an object of the present invention is to provide a typical sentence analysis method, apparatus, and recording medium capable of appropriately and in detail analyzing a text set to be analyzed.

【０００５】[0005]

【課題を解決するための手段】このような目的を達成す
るために、本発明にかかる典型文分析方法は、所定の分
析対象に関する記述を示す複数のテキストを要素とする
分析対象集合について、各テキストからその分析対象集
合を代表する典型文を分析する典型文分析方法であっ
て、分析対象集合と、分析対象と比較する各比較対象に
関する記述を示す複数のテキストをそれぞれ要素とする
複数の比較対象集合とから、分析対象集合に含まれる各
テキストがその分析対象集合を代表する度合いとして、
分析対象集合に含まれる各テキストごとに、分析対象集
合に対する当該テキストのベイズ事後確率を計算するよ
うにしたものである。In order to achieve such an object, a typical sentence analysis method according to the present invention relates to an analysis target set including a plurality of texts indicating a predetermined analysis target as elements. A typical sentence analysis method for analyzing a typical sentence representative of the analysis target set from a text, wherein a plurality of comparisons each having an element of a plurality of texts indicating a description of the analysis target set and each comparison target to be compared with the analysis target From the target set, as a degree that each text included in the analysis target set represents the analysis target set,
For each text included in the analysis set, the Bayesian posterior probability of the text for the analysis set is calculated.

【０００６】これに加えて、分析対象集合に含まれる各
テキストのうち、ベイズ事後確率が高い順に１つ以上の
テキストを分析対象集合の典型文として抽出するように
してもよい。In addition to this, one or more of the texts included in the analysis target set may be extracted as typical sentences of the analysis target set in the descending order of Bayesian posterior probabilities.

【０００７】ベイズ事後確率を計算する際、分析対象集
合および各比較対象集合に含まれる各テキストから、当
該テキストを分解して得られる各単位表現の当該集合に
おける出現回数を計数するとともに、分析対象集合およ
び各比較対象集合の各集合に含まれるテキスト数をそれ
ぞれの集合ごとに計数し、これら計数結果に基づいて分
析対象集合に含まれる各テキストごとのベイズ事後確率
を計算するようにしてもよい。When the Bayesian posterior probability is calculated, the number of appearances of each unit expression obtained by decomposing the text from each text included in the analysis target set and each comparison target set is counted, and the analysis target is analyzed. The number of texts included in each set of the set and each comparison target set may be counted for each set, and the Bayesian posterior probability for each text included in the analysis target set may be calculated based on these count results. .

【０００８】計数結果からベイズ事後確率を計算する
際、各単位表現の出現回数に基づき個々の単位表現の各
集合における出現確率を推定するとともに、各集合のテ
キスト数に基づき各集合の出現確率を推定し、これら推
定結果を用いて分析対象集合に含まれる各テキストごと
のベイズ事後確率を計算するようにしてもよい。比較対
象集合群の各集合を空集合としてもよい。When calculating the Bayesian posterior probability from the counting result, the appearance probability in each set of individual unit expressions is estimated based on the number of appearances of each unit expression, and the appearance probability of each set is calculated based on the number of texts in each set. The Bayesian posterior probability may be calculated for each text included in the analysis target set by using the estimation results. Each set of the comparison target set group may be an empty set.

【０００９】また、本発明にかかる典型文分析装置は、
所定の分析対象に関する記述を示す複数のテキストを要
素とする分析対象集合について、各テキストからその分
析対象集合を代表する典型文を分析する典型文分析装置
であって、分析対象集合と、分析対象と比較する各比較
対象に関する記述を示す複数のテキストをそれぞれ要素
とする複数の比較対象集合とから、分析対象集合に含ま
れる各テキストがその分析対象集合を代表する度合いと
して、分析対象集合に含まれる各テキストごとに、分析
対象集合に対する当該テキストのベイズ事後確率を計算
する事後確率計算手段を備えるものである。The typical sentence analyzer according to the present invention is
A typical sentence analysis device for analyzing a typical sentence representing each analysis target set from a plurality of texts indicating a description of a predetermined analysis target, the analysis target set and the analysis target From the multiple comparison target sets each having multiple texts indicating the description of each comparison target to be compared, each text included in the analysis target set is included in the analysis target set as a degree representing the analysis target set. For each text to be analyzed, a posterior probability calculating means for calculating the Bayesian posterior probability of the text for the analysis target set is provided.

【００１０】これに加えて、分析対象集合に含まれる各
テキストのうち、ベイズ事後確率が高い順に１つ以上の
テキストを分析対象集合の典型文として抽出するテキス
ト抽出手段を設けても良い。In addition to this, text extraction means may be provided for extracting one or more texts in the descending order of Bayesian posterior probabilities among the texts included in the analysis target set as typical sentences of the analysis target set.

【００１１】また、分析対象集合および各比較対象集合
に含まれる各テキストから、当該テキストを分解して得
られる各単位表現の当該集合における出現回数を計数す
る表現ヒストグラム計算手段と、分析対象集合および各
比較対象集合の各集合に含まれるテキスト数をそれぞれ
の集合ごとに計数するカテゴリヒストグラム計算手段と
をさらに備え、事後確率計算手段で、これら計数結果に
基づいて分析対象集合に含まれる各テキストごとのベイ
ズ事後確率を計算するようにしてもよい。An expression histogram calculation means for counting the number of appearances of each unit expression obtained by decomposing the text from each text included in the analysis target set and each comparison target set, and the analysis target set and A category histogram calculation unit that counts the number of texts included in each set of each comparison target set for each set, and the posterior probability calculation unit, for each text included in the analysis target set based on these counting results The Bayesian posterior probability of may be calculated.

【００１２】さらに、事後確率計算手段で、各単位表現
の出現回数に基づき個々の単位表現の各集合における出
現確率を推定するとともに、各集合のテキスト数に基づ
き各集合の出現確率を推定し、これら推定結果を用いて
分析対象集合に含まれる各テキストごとのベイズ事後確
率を計算するようにしてもよい。また、比較対象集合群
の各集合として空集合を用いるようにしてもよい。Further, the posterior probability calculating means estimates the occurrence probability of each set of individual unit expressions based on the number of appearances of each unit expression, and estimates the occurrence probability of each set based on the number of texts of each set. The Bayesian posterior probability for each text included in the analysis target set may be calculated using these estimation results. Alternatively, an empty set may be used as each set of the comparison target set group.

【００１３】また、本発明にかかる記録媒体は、所定の
分析対象に関する記述を示す複数のテキストを要素とす
る分析対象集合について、各テキストからその分析対象
集合を代表する典型文を分析する典型文分析装置のコン
ピュータで、上記各典型文分析方法を実行させるための
プログラムを記録したものである。Further, the recording medium according to the present invention is a typical sentence for analyzing a typical sentence representative of the analysis target set from each text, with respect to the analysis target set having a plurality of texts indicating a description regarding a predetermined analysis target. The computer of the analysis device records a program for executing each of the typical sentence analysis methods described above.

【００１４】[0014]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。図１は本発明の一実施の形
態にかかる典型文分析装置の構成を示すブロック図であ
る。この典型文分析装置１００は、全体としてコンピュ
ータからなり、蓄積部１、処理部２、分析対象テキスト
入力手段３、比較対象テキスト集合群入力手段４、記憶
部５、操作入力部６および画面表示部７が設けられてい
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of a typical sentence analysis apparatus according to an embodiment of the present invention. The typical sentence analysis apparatus 100 is composed of a computer as a whole, and includes a storage unit 1, a processing unit 2, an analysis target text input unit 3, a comparison target text set group input unit 4, a storage unit 5, an operation input unit 6, and a screen display unit. 7 is provided.

【００１５】分析対象テキスト入力手段３は、アンケー
ト調査やｗｅｂから収集した分析対象、例えば所定商品
の評価に関して記述したテキスト（文字情報）の集合を
入力する手段である。比較対象テキスト集合群入力手段
４は、同じくアンケート調査やｗｅｂから収集した分析
対象、例えば所定商品と同じ商品種別に属する他の商品
の評価に関して記述したテキスト（文字情報）の集合を
入力する手段である。これら分析対象テキスト入力手段
３や比較対象テキスト集合群入力手段４は、テキストデ
ータを操作入力するためのキーボードやｗｅｂ上のテキ
ストデータをダウンロードするアプリケーションなどの
一般的な情報入力手段により構成される。The analysis target text input means 3 is a means for inputting an analysis target collected from a questionnaire survey or web, for example, a set of texts (character information) describing the evaluation of a predetermined product. The comparison target text set group input means 4 is a means for inputting a set of texts (character information) describing the analysis target collected from the questionnaire survey or the web, for example, the evaluation of other products belonging to the same product type as the predetermined product. is there. The analysis target text input means 3 and the comparison target text set group input means 4 are configured by a general information input means such as a keyboard for operating and inputting text data and an application for downloading text data on a web.

【００１６】蓄積部１は、分析対象テキスト入力手段３
から入力されたテキストデータをカテゴリ１０というテ
キスト集合として蓄積するとともに、比較対象テキスト
集合群入力手段４から入力されたテキストデータを各比
較対象ごとにカテゴリ１１〜１ｎというテキスト集合と
して蓄積する手段である。この蓄積部１は、ハードディ
スクやメモリなどの情報記憶装置から構成される。図２
に各カテゴリの構成例を示す。例えばカテゴリ１０は
「商品Ａへの意見」に関して記述されたテキストの集合
であり、テキスト集合の要素して「商品Ａは使いにく
い。遅い。」や「商品Ａは大きくなってしまった。小さ
い方がよい。」などのテキストが含まれている。The storage unit 1 includes an analysis target text input means 3
Is a means for accumulating the text data input from the above as a text set of category 10 and the text data input from the comparison target text set group input means 4 for each comparison target as a text set of categories 11 to 1n. . The storage unit 1 is composed of an information storage device such as a hard disk and a memory. Figure 2
Shows an example of the structure of each category. For example, category 10 is a set of texts described regarding “opinion on product A”, and the elements of the text set are “product A is difficult to use. Slow” and “product A has grown. Is included. ”Is included.

【００１７】操作入力部６は、オペレータが処理部２で
の分析処理を指示／制御するための操作を入力するため
の手段であり、キーボードやマウスから構成される。画
面表示部７は、処理部２での処理過程や分析結果を画面
表示するための手段であり、ＣＲＴ装置やＬＣＤ装置か
ら構成される。The operation input section 6 is means for the operator to input an operation for instructing / controlling the analysis processing in the processing section 2, and is composed of a keyboard and a mouse. The screen display unit 7 is a unit for displaying the processing process and the analysis result in the processing unit 2 on the screen, and includes a CRT device or an LCD device.

【００１８】記憶部５は、記録媒体９から予め読み込ま
れ処理部２で実行されるプログラム５１や、処理部２で
の分析処理に用いる各種情報を記憶するハードディスク
やメモリなどの情報記憶装置から構成される。処理部２
は、ＣＰＵなどのマイクロプロセッサおよびその周辺回
路と、記憶部５のプログラム５１とが協働することによ
り各種機能手段を構成する機能部である。この機能手段
としては、表現ヒストグラム計算手段２１、カテゴリヒ
ストグラム計算手段２２、事後確率計算手段２３および
テキスト抽出手段２４とが設けられている。The storage unit 5 is composed of a program 51 preliminarily read from the recording medium 9 and executed by the processing unit 2, and an information storage device such as a hard disk or a memory for storing various information used in the analysis processing in the processing unit 2. To be done. Processing unit 2
Is a functional unit that constitutes various functional means by the microprocessor 51 such as a CPU and its peripheral circuits working together with the program 51 of the storage unit 5. As the functional means, an expression histogram calculating means 21, a category histogram calculating means 22, a posterior probability calculating means 23 and a text extracting means 24 are provided.

【００１９】表現ヒストグラム計算手段２１では、各カ
テゴリ１０〜１ｎごとに、当該カテゴリに含まれる各テ
キストで使用されている各表現の出現回数を計数する。
この表現とは、各テキストを分解して得られる単語や文
節などの単位表現であり、例えば自立語など特定の品詞
に属する単語や、時制（現在形、過去形、未来形あるい
は進行形など）の違いを無視して同一分類とした文節な
どで区分されたものである。図３に表現の区分例を示
す。ここでは自立語という品詞に基づきテキストから表
現を抽出した例が示されており、「商品Ａは使いにく
い。遅い。」というテキスト３０から、「商品Ａ」、
「使いにくい」、「遅い」という３つの表現３１〜３３
が抽出されている。このようなテキスト中にどのような
表現が存在しているかを調べるには、形態素解析に関す
る公知技術を応用できる。このようにして各テキストか
ら抽出された表現が、同一表現ごとに計数される。The expression histogram calculating means 21 counts, for each category 10 to 1n, the number of appearances of each expression used in each text included in the category.
This expression is a unit expression such as a word or phrase obtained by decomposing each text. For example, a word that belongs to a specific part of speech such as an independent word or tense (present tense, past tense, future tense or progressive tense). It is divided by clauses, etc., which have been classified into the same category by ignoring the difference in. FIG. 3 shows an example of classification of expressions. Here, an example in which an expression is extracted from a text based on a part-of-speech called an independent word is shown, and from the text 30 "Product A is difficult to use. Slow.", "Product A",
Three expressions 31-33, "difficult to use" and "slow"
Has been extracted. In order to check what kind of expression is present in such text, a known technique regarding morphological analysis can be applied. The expressions extracted from each text in this way are counted for each identical expression.

【００２０】カテゴリヒストグラム計算手段２２では、
各カテゴリ１０〜１ｎごとに、当該カテゴリに含まれて
いるテキストすなわち要素の数を計数する。事後確率計
算手段２３では、表現ヒストグラム計算手段２１および
カテゴリヒストグラム計算手段２２で得られた計数結果
に基づき、分析対象のテキスト集合すなわちカテゴリ１
０の各テキストごとに、事後確率を計算する。In the category histogram calculating means 22,
For each category 10 to 1n, the number of texts, that is, elements included in the category is counted. In the posterior probability calculating means 23, based on the counting results obtained by the expression histogram calculating means 21 and the category histogram calculating means 22, the text set to be analyzed, that is, category 1
For each 0 text, calculate the posterior probability.

【００２１】一般に、確率モデルの考え方では、与えら
れたデータｘは、ある確率変数Ｘの実現値と見なされ
る。特にこの確率変数の確率密度関数が有限次元のパラ
メータｔを持つ固定された関数形ｆ（ｘ｜ｔ）を持つと
仮定すると、その確率密度関数族Ｆ＝｛ｆ（ｘ｜ｔ）；
ｔ∈Ｔ｝をパラメトリック確率モデルという。また、デ
ータｘに基づきパラメータｔの値を推測することを推定
という。例えば、ｆ（ｘ｜ｔ）をｔの関数（尤度関数）
と見なし、これを最大にするｔを推定値とする最尤推定
法なとが一般的である。In general, in the concept of a probabilistic model, given data x is regarded as a realization value of a random variable X. In particular, assuming that the probability density function of this random variable has a fixed function form f (x | t) having a finite dimensional parameter t, its probability density function family F = {f (x | t);
tεT} is called a parametric stochastic model. Further, estimating the value of the parameter t based on the data x is called estimation. For example, f (x | t) is a function of t (likelihood function)
And a maximum likelihood estimation method in which t is an estimated value that maximizes this.

【００２２】一方、ベイズ統計（Bayesian Statistic
s）では、パラメータｔも確率変数と見なし、その確率
密度関数ｇ（ｔ）すなわち事前確率（ベイズ事前確率）
を考える。そしてこの事前確率と尤度関数とからベイズ
の定理によって、ｔの事後確率（ベイズ事後確率）ｇ
（ｔ｜ｘ）が求められる。事後確率ｇ（ｔ｜ｘ）に関す
るｔの平均をｔの推定値とする方法をベイズ推定法とい
い、ｇ（ｔ｜ｘ）を最大にするｔを推定する方法を、最
大事後確率推定法（ＭＡＰ推定法）という。On the other hand, Bayesian Statistic
In s), the parameter t is also regarded as a random variable, and its probability density function g (t), that is, the prior probability (Bayesian prior probability)
think of. Then, from this a priori probability and the likelihood function, the posterior probability of t (Bayesian posterior probability) g
(T | x) is calculated. The method of using the average of t with respect to the posterior probability g (t | x) as the estimated value of t is called the Bayesian estimation method, and the method of estimating t that maximizes g (t | x) is the maximum posterior probability estimation method ( MAP estimation method).

【００２３】事後確率計算手段２３では、カテゴリ１０
の各テキストごとの事後確率を以下のようにして計算す
る。今、カテゴリ１０をｃ₀、カテゴリ１０に含まれる
任意のテキストをＳ、そのテキストＳで出現する表現を
ｗ₁〜ｗ_mとすると、そのテキストＳに対する事後確率ｐ
（ｃ₀｜Ｓ）は、数１で求められる。In the posterior probability calculating means 23, the category 10
The posterior probability for each text of is calculated as follows. Now, assuming that the category 10 is c ₀ , an arbitrary text included in the category 10 is S, and expressions that appear in the text S are w ₁ to w _m , the posterior probability p for the text S is p.
(C ₀ | S) is obtained by the _equation 1.

【００２４】[0024]

【数１】 [Equation 1]

【００２５】ここで、カテゴリヒストグラム計算手段２
２での計数結果から、すべてのカテゴリに含まれるテキ
ストの総数をＮ、カテゴリｃ₀に含まれるテキストの数
をＮ₀，カテゴリｃ_iに含まれるテキストの数をＮ_iとす
ると、数１で用いられる、各カテゴリの出現確率を示す
事前確率ｐ（ｃ₀），ｐ（ｃ_i）は、ラプラス型の推定量
から数２により推定される。但し、ｎは比較対象テキス
ト集合の数を示す。Here, the category histogram calculation means 2
If the total number of texts included in all categories is N, the number of texts included in the category c ₀ is N ₀ , and the number of texts included in the category c _i is N _i from the counting result in 2, The prior probabilities p (c ₀ ), p (c _i ) used, which indicate the appearance probabilities of the respective categories, are estimated from the Laplace-type estimator by Equation 2. However, n indicates the number of comparison target text sets.

【００２６】[0026]

【数２】 [Equation 2]

【００２７】また、表現ヒストグラム計算手段２１での
計数結果から、すべてのカテゴリに含まれる各テキスト
で出現した表現ｗの総数をＫ、カテゴリｃ₀に含まれる
各テキストで出現した表現ｗ_jの出現回数をＫ_0wj、カテ
ゴリｃ_iに含まれる各テキストで出現した表現ｗ_jの出現
回数をＫ_iwjとすると、数１で用いられる、カテゴリｃ₀
における表現ｗ_jの出現確率ｐ（ｗ_j｜ｃ₀）、およびカ
テゴリｃ_iにおける表現ｗ_jの出現確率ｐ（ｗ_j｜ｃ_i）
は、ラプラス型の推定量から数３により推定される。但
し、Ｋ₀は、カテゴリＣ₀における表現の総出現回数、Ｋ
_iはカテゴリＣ_iにおける表現の総出現回数を示す。From the counting result of the expression histogram calculating means 21, the total number of expressions w that appear in each text included in all categories is K, and the expression w _j that appears in each text included in category c ₀ appears. number of times K _0Wj, the number of occurrences of expression w _j that appeared in the text included in category c _i When K _Iwj, used in Equation 1, category c ₀
The probability of occurrence of representation w _j in _{_{p (w j | c 0)}} , and the representation in the category c _i w _j of the occurrence probability p (w _j | c _i)
Is estimated from the Laplace-type estimator by Equation 3. However, K ₀ is the total number of appearances of the expression in the category C ₀ , K
_i indicates the total number of appearances of the expression in the category C _i .

【００２８】[0028]

【数３】 [Equation 3]

【００２９】テキスト抽出手段２４では、このようにし
て事後確率計算手段２３により求められたカテゴリ１０
の各テキストに関する事後確率に基づき、事後確率の高
い順あるいは低い順にカテゴリ１０からテキストを抽出
し、画面表示部７へ表示出力する。In the text extraction means 24, the category 10 thus obtained by the posterior probability calculation means 23 is calculated.
The texts are extracted from the category 10 in the descending order of the posterior probabilities based on the posterior probabilities of the respective texts, and are displayed and output to the screen display unit 7.

【００３０】次に、図４を参照して、本実施の形態にか
かる典型文分析装置の動作について説明する。図４は典
型文分析装置の動作を示すフローチャートである。操作
入力部６によるオペレータの処理開始操作に応じて、ま
ず、分析対象テキスト入力手段３では、分析対象となる
テキスト集合を取り込んで蓄積部１へカテゴリ１０とし
て格納する（ステップ２００）。これと前後して、比較
対象テキスト集合群入力手段４では、比較対象となるテ
キスト集合群を取り込んで蓄積部１へカテゴリ１１〜１
ｎとして格納する（ステップ２０１）。Next, the operation of the typical sentence analysis apparatus according to this embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing the operation of the typical sentence analysis device. In response to the operator's processing start operation by the operation input unit 6, first, the analysis target text input means 3 takes in a text set to be analyzed and stores it in the storage unit 1 as a category 10 (step 200). Before or after this, the comparison target text set group input means 4 fetches the text set group to be compared and stores the categories 11 to 1 in the storage unit 1.
It is stored as n (step 201).

【００３１】続いて、処理部２の表現ヒストグラム計算
手段２１では、各カテゴリ１０〜１ｎごとに、当該カテ
ゴリに含まれる各テキストで使用されている各表現の出
現回数を計数する（ステップ２０２）。これと前後し
て、カテゴリヒストグラム計算手段２２では、各カテゴ
リ１０〜１ｎごとに、当該カテゴリに含まれているテキ
ストすなわち要素の数を計数する（ステップ２０３）。
そして、事後確率計算手段２３では、表現ヒストグラム
計算手段２１での計数結果に基づく各表現の出現確率ｐ
（ｗ_j｜ｃ₀），ｐ（ｗ_j｜ｃ_i）の推定（ステップ２０
４）、およびカテゴリヒストグラム計算手段２２での計
数結果に基づく事前確率ｐ（ｃ₀），ｐ（ｃ_i）の推定
（ステップ２０５）を前後して実行し、これら推定結果
を用いて、カテゴリ１０の各テキストＳに関する事後確
率ｐ（ｃ₀｜Ｓ）を求める（ステップ２０６）。Subsequently, the expression histogram calculation means 21 of the processing unit 2 counts the number of appearances of each expression used in each text included in the category for each category 10 to 1n (step 202). Around this time, the category histogram calculation means 22 counts the number of texts, that is, elements included in each category 10 to 1n (step 203).
Then, in the posterior probability calculating means 23, the appearance probability p of each expression based on the counting result in the expression histogram calculating means 21.
(W _j | c ₀ ), p (w _j | c _i ) estimation (step 20
4) and the estimation of the prior probabilities p (c ₀ ), p (c _i ) based on the counting result in the category histogram calculation means 22 (step 205) are executed before and after, and the category 10 is calculated using these estimation results. The posterior probability p (c ₀ | S) for each text S of is calculated (step 206).

【００３２】その後、必要に応じてテキスト抽出手段２
４では、カテゴリ１０に含まれる各テキストのうち、そ
れぞれの事後確率の高い順あるいは低い順にテキストを
抽出して画面表示部７へ表示出力し（ステップ２０
７）、一連の処理を終了する。After that, the text extracting means 2 is used if necessary.
In step 4, out of the texts included in category 10, the texts are extracted in the descending order of posterior probabilities or in descending order and displayed and output to screen display unit 7 (step 20).
7) Then, a series of processing is ended.

【００３３】分析対象となるカテゴリ１０の各テキスト
ごとに求めた事後確率は、「当該テキストと同じ表現を
持つテキストが与えられた場合、そのテキストがカテゴ
リ１０に属する確率」と解釈することができる。したが
って、分析対象の各テキストごとに事後確率を求め、そ
の事後確率を各テキストが分析対象を代表する度合いす
なわち典型度として用いることにより、従来のように各
テキスト相互間で類似性がない場合でも適切に分析でき
るとともに、従来のように新たなテキストを生成するも
のではなく個々のテキストについて個別に典型度が得ら
れることから、より詳細な分析結果が得られる。The posterior probability obtained for each text of category 10 to be analyzed can be interpreted as "the probability that the text belongs to category 10 when a text having the same expression as the text is given". . Therefore, by calculating the posterior probability for each text to be analyzed, and using the posterior probability as the degree that each text represents the analysis target, that is, the typicality, even when there is no similarity between the texts as in the conventional case. A more detailed analysis result can be obtained because an appropriate analysis can be performed and a typicality is obtained for each text individually instead of generating new text as in the past.

【００３４】なお、以上の説明では、事後確率そのもの
を典型度として用いる場合について説明したが、事後確
率の関数のうち計算処理がより簡単なものを典型度とし
て用いても良い。例えば、事後確率の逆数から１を引い
たものは、事後確率そのものに比べて計算処理が少なく
て済む。また、上記事後確率やその関数で定義した典型
度を用いて計算した１表現当たりの典型度などを、改め
て修正典型度と定義しても良い。この場合、テキストＳ
の修正典型度は修正前の典型度をテキストＳにおける表
現数でわり算したものになる。In the above description, the case where the posterior probability itself is used as the typicality has been described, but a function of the posterior probability, which is easier to calculate, may be used as the typicality. For example, the subtraction of 1 from the reciprocal of the posterior probability requires less calculation processing than the posterior probability itself. Further, the typicality per one expression calculated using the posterior probability and the typicality defined by the function thereof may be defined again as the modified typicality. In this case, the text S
The modified typicality of is the typicality before modification divided by the number of expressions in the text S.

【００３５】図５に本実施の形態にかかる典型文分析装
置１００による分析結果出力例を示す。ここでは、図４
に示したように、商品Ａに関する記述を示すテキストを
分析対象であるカテゴリ１０とし、他の商品Ｂ，Ｃ，…
に関する記述を示すテキストを比較対象であるカテゴリ
１１〜１ｎとして分析を行い、典型度が高い順に１０個
のテキストを典型文として抽出した。なお、典型度の値
としては、事後確率の逆数から１を引いたものの自然対
数値を、当該テキストの表現数で割ったものが用いられ
ている。FIG. 5 shows an example of analysis result output by the typical sentence analysis apparatus 100 according to the present embodiment. Here, FIG.
As shown in, the text indicating the description of the product A is set as the category 10 to be analyzed, and the other products B, C, ...
The texts indicating the description are analyzed as categories 11 to 1n to be compared, and 10 texts are extracted as typical sentences in descending order of typicality. As the value of typicality, a value obtained by subtracting 1 from the reciprocal of the posterior probability and dividing the natural logarithmic value by the number of expressions of the text is used.

【００３６】なお、以上の説明では、比較対象テキスト
集合群すなわちカテゴリ１１〜１ｎを用いる場合を例と
して説明したが、こらテキスト集合が空集合であって
も、上記と同様に問題なく各テキストの典型度を求める
ことができる。比較対象として空集合しか入力しない場
合、比較対象テキスト集合群入力手段４は不要となる。In the above description, the case where the comparison target text set group, that is, the categories 11 to 1n is used has been described as an example. However, even if the text set is an empty set, there is no problem in the same manner as described above for each text. Typicality can be calculated. When only an empty set is input as a comparison target, the comparison target text set group input means 4 is not necessary.

【００３７】[0037]

【発明の効果】以上説明したように、本発明は、所定の
分析対象に関する複数のテキストを要素とする分析対象
集合と、分析対象と比較する各比較対象に関する複数の
テキストをそれぞれ要素とする複数の比較対象集合から
なる比較対象群とから、分析対象集合に含まれる各テキ
ストがその分析対象集合を代表する度合いとして、分析
対象集合に含まれる各テキストごとに、分析対象集合に
対する当該テキストのベイズ事後確率を計算するように
したので、従来のように各テキスト相互間で類似性がな
い場合でも適切に分析できるとともに、従来のように新
たなテキストを生成するものではなく個々のテキストに
ついて個別に典型度が得られることから、より詳細な分
析結果が得られる。As described above, according to the present invention, an analysis target set having a plurality of texts related to a predetermined analysis target as an element and a plurality of texts each having a plurality of texts related to each comparison target to be compared with the analysis target as elements. The comparison target group consisting of the comparison target set of, and the degree to which each text included in the analysis target set represents the analysis target set, for each text included in the analysis target set, the Bayes of the text with respect to the analysis target set Since the posterior probability is calculated, it is possible to analyze properly even if there is no similarity between each text as in the past, and it is not to generate new text as in the past, but individually for each text. Since the typicality is obtained, a more detailed analysis result can be obtained.

[Brief description of drawings]

【図１】本発明の一実施の形態にかかる典型文分析装
置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a typical sentence analysis apparatus according to an exemplary embodiment of the present invention.

【図２】各カテゴリの構成例である。FIG. 2 is a configuration example of each category.

【図３】表現の区分例である。FIG. 3 is an example of classification of expressions.

【図４】典型文分析装置の動作を示すフローチャート
である。FIG. 4 is a flowchart showing the operation of the typical sentence analysis device.

【図５】分析結果出力例である。FIG. 5 is an example of analysis result output.

[Explanation of symbols]

１…蓄積部、１０〜１ｎカテゴリ、２…処理部、２１…
表現ヒストグラム計算手段、２２…カテゴリヒストグラ
ム計算手段、２３…事後確率計算手段、２４…テキスト
抽出手段、３…分析対象テキスト入力手段、４…比較対
象テキスト集合群入力手段、５…記憶部、６…操作入力
部、７…画面表示部、９…記録媒体。1 ... Accumulation unit, 10 to 1n category, 2 ... Processing unit, 21 ...
Expression histogram calculating means, 22 ... Category histogram calculating means, 23 ... Posterior probability calculating means, 24 ... Text extracting means, 3 ... Analysis target text input means, 4 ... Comparison target text set group input means, 5 ... Storage section, 6 ... Operation input section, 7 ... Screen display section, 9 ... Recording medium.

Claims

[Claims]

1. A typical sentence analysis method for analyzing a typical sentence representative of the analysis target set from each of the texts, with respect to the analysis target set having a plurality of texts that describe a predetermined analysis target as elements. From the analysis target set and a plurality of comparison target sets each having a plurality of texts indicating a description of each comparison target to be compared with the analysis target, each text included in the analysis target set represents the analysis target set. The typical sentence analysis method is characterized in that, for each text included in the analysis target set, a Bayesian posterior probability of the text with respect to the analysis target set is calculated.

2. The typical sentence analysis method according to claim 1, wherein among the texts included in the analysis target set, one or more texts are extracted as a typical sentence of the analysis target set in descending order of Bayesian posterior probabilities. A typical sentence analysis method characterized by the above.

3. The typical sentence analysis method according to claim 1, wherein when the Bayesian posterior probability is calculated, the text is decomposed from each text included in the analysis target set and each comparison target set. While counting the number of appearances of each obtained unit expression in the set, the number of texts included in each set of the analysis target set and each comparison target set is counted for each set, and based on these counting results, A typical sentence analysis method characterized by calculating a Bayesian posterior probability for each text included in an analysis target set.

4. The typical sentence analysis method according to claim 3, wherein when the Bayesian posterior probability is calculated from the counting result, the occurrence probability of each unit expression in each set is estimated based on the number of appearances of each unit expression. In addition, the appearance probability of each set is estimated based on the number of texts in each set, and Bayesian posterior probability for each text included in the analysis target set is calculated using these estimation results. Sentence analysis method.

5. The typical sentence analysis method according to claim 1, wherein each set of the comparison target set group is an empty set.

6. A typical sentence analysis apparatus for analyzing a typical sentence representative of an analysis target set from each of the texts, with respect to an analysis target set having a plurality of texts indicating a description regarding a predetermined analysis target, From the analysis target set and a plurality of comparison target sets each having a plurality of texts indicating a description of each comparison target to be compared with the analysis target, each text included in the analysis target set represents the analysis target set. The typical sentence analysis apparatus is characterized in that, for each text included in the analysis target set, a posterior probability calculation means for calculating a Bayesian posterior probability of the text with respect to the analysis target set is provided.

7. The typical sentence analysis apparatus according to claim 6, wherein among the texts included in the analysis target set, one or more texts are extracted as a typical sentence of the analysis target set in descending order of Bayesian posterior probabilities. A typical sentence analysis apparatus comprising a text extraction means.

8. The typical sentence analysis device according to claim 1, wherein in each set of unit expressions obtained by decomposing the text from each text included in the analysis set and each comparison set, The posterior probability further comprises expression histogram calculation means for counting the number of appearances, and category histogram calculation means for counting the number of texts included in each set of the analysis target set and each comparison target set for each set. A typical sentence analysis device, wherein the calculation means calculates Bayesian posterior probability for each text included in the analysis target set based on these counting results.

9. The typical sentence analysis apparatus according to claim 3, wherein the posterior probability calculation means estimates an appearance probability in each set of individual unit expressions based on the number of appearances of each unit expression, and A typical sentence analysis apparatus, characterized in that the appearance probability of each set is estimated based on the number of texts in each set, and the Bayesian posterior probability for each text included in the analysis target set is calculated using these estimation results.

10. The typical sentence analysis device according to claim 6, wherein an empty set is used as each set of the comparison target set group.

11. A computer of a typical sentence analysis device for analyzing a typical sentence representative of the analysis target set from each of the texts, with respect to the analysis target set having a plurality of texts indicating a description regarding a predetermined analysis target, Items 1-5
A recording medium recording a program for executing the typical sentence analysis method according to any one of 1.