JP5404287B2

JP5404287B2 - Document analysis apparatus and method

Info

Publication number: JP5404287B2
Application number: JP2009229501A
Authority: JP
Inventors: 春男林
Original assignee: トレンドリーダーコンサルティング株式会社
Priority date: 2009-10-01
Filing date: 2009-10-01
Publication date: 2014-01-29
Anticipated expiration: 2029-10-01
Also published as: JP2011076524A

Description

この発明は文書解析装置および方法に関し、特にたとえばニュース，ウェブニュース，ブログ，新聞および雑誌のように時系列的に増量する言語資料だけでなく、インタビュー記録，供述調書，アンケート，小説などのように時系列的に増量しない言語資料も有意に解析できる、新規な文書解析装置および方法に関する。 The present invention relates to a document analysis apparatus and method, and more particularly to language records such as news, web news, blogs, newspapers, and magazines, as well as language records that increase in time series, such as interview records, statements, questionnaires, and novels. The present invention relates to a novel document analysis apparatus and method that can significantly analyze language materials that do not increase in time series.

災害研究において、郵送法、面接法、留置法などによる質問紙を用いた社会調査は、災害に見舞われた被災地社会の実情や、リスクやリスク対策の住民意識を調査する手段の一つとして欠かせないものになっている。 In disaster research, social surveys using questionnaires based on the mailing method, interview method, indwelling method, etc. are one of the means of investigating the actual situation of disaster-affected communities and the residents' awareness of risks and risk countermeasures. It is indispensable.

質問紙の中には、選択肢の中から単一または複数の回答を選ぶフェイス項目をはじめとしたプリコード回答形式の質問項目と、自由な記述を記入する自由回答形式の質問項目がある。後者のうち、回答者の自由な意見や考えについて記入する質問項目は、慣例的に質問紙の末尾に設けられている。このような質問項目で得られるデータは、自由回答や自由回答記述などと呼ばれている。 In the questionnaire, there are question items in a pre-coded answer format including a face item for selecting one or a plurality of answers from options, and a question item in a free answer format in which a free description is entered. Of the latter, the question items to be filled in regarding the respondent's free opinions and ideas are customarily provided at the end of the questionnaire. Data obtained from such question items is called free answer or free answer description.

災害の社会調査のほか、質問紙調査で得られる自由回答のデータの解析は非常に難しく、多くの社会調査で分析されないままに終わってきた。プリコード回答形式で得られたデータは、単純集計、クロス集計、多変量解析など系統的な分析手法が準備されているのに対し、自由回答形式で得られた自由回答記述は、集計が困難であり、分析に支障をきたし、報告書等の作成において自由回答記述そのものを列挙するだけになってしまうことが多い。たとえば、三宅島噴火災害の被災者を対象とした調査でも自由回答記述が得られているが、そのリポートは大まかな分類をもとに何ケースかが掲載されているにとどまっている。 In addition to social surveys on disasters, the analysis of free answer data obtained through questionnaire surveys is very difficult and has not been analyzed in many social surveys. Systematic analysis methods such as simple tabulation, cross tabulation, and multivariate analysis are prepared for the data obtained in the precoded answer format, but the free answer description obtained in the free answer format is difficult to tabulate Therefore, the analysis is often hindered, and the free answer descriptions themselves are often only enumerated when preparing reports and the like. For example, a survey of victims of the Miyakejima eruption disaster has provided free answers, but only a few cases are reported based on a broad classification.

自由回答形式の質問項目では、質問紙中のプリコード形式の質問では調査することのできない、重要な情報資料が得られることがある。特に、災害後の被災者に対する調査においては、災害後の不満や支障の内容について述べられることが多く、被災地社会の実情についての具体的で内容的なデータが得られることが期待され、これを適切に解析することの意義は大きい。 Free-answer questions can provide important informational material that cannot be investigated with pre-coded questions in the questionnaire. In particular, in the surveys of victims after a disaster, the contents of dissatisfaction and obstacles after the disaster are often described, and it is expected that concrete and detailed data on the actual situation of the disaster-affected area society will be obtained. Significantly, it is very important to analyze properly.

自由回答記述について分析された例としては、災害研究の分野において、1995年阪神・淡路大震災の被災世帯に対する質問紙調査で震災の教訓や体験について述べられた自由回答記述を生活再建7要素に分類された例や、類似する調査で被災者が他地域の人々に伝えたい教訓について求めた自由回答記述をＫＪ法によって分類・構造化された例などが、非特許文献１および２において知られている。ただし、これらは人手で行なわれており、分析に大きな労力を要したことが想像される。 As an example of analysis of free answer descriptions, in the field of disaster research, free answer descriptions that describe lessons and experiences of earthquake disasters in questionnaire surveys for households affected by the 1995 Hanshin-Awaji Earthquake were included in the seven elements of life reconstruction. Non-patent documents 1 and 2 are examples of classified examples and examples of classified and structured free answer descriptions that the victims asked about lessons they want to convey to people in other regions in similar surveys. ing. However, these are performed manually, and it can be imagined that a large amount of labor was required for analysis.

一方、自由回答記述の分析手法を開発した研究が、これまでいくつか報告されている。これらは、自然言語処理技術とテキストマイニングを用いており、キーワードを単語の頻度などを使ってキーワードを抽出する方法（非特許文献３）、重要な自由回答記述を自動的に選定する方法（非特許文献４）、自由回答記述を自動的にクラスタリングする方法（非特許文献５など）などに大別される。 On the other hand, several studies have been reported to develop an analysis method for free answer descriptions. These use natural language processing technology and text mining, and a method of extracting keywords using the frequency of words or the like (Non-Patent Document 3) and a method of automatically selecting important free answer descriptions (Non-Patent Document 3) Patent Literature 4), a method of automatically clustering free answer descriptions (Non-Patent Literature 5, etc.) and the like.

他方、本件発明者等は、特許文献１において、新規な文書解析装置および方法を提案した。この背景技術においては、災害や危機に関する言語資料体（コーパス）を、時間経過とともに増加するコーパスと捉え、ＴＦＩＤＦを修正した増加型ＴＦＩＤＦと特異値という指標を定義し、キーワードを自動抽出する。
林春男（編）：震災後の居住地の変化とくらしの実情に関する調査、京都大学防災研究所巨大災害研究センター・テクニカルレポート、１９９９中林一樹、福留邦洋、河上牧子：阪神・淡路大震災の被害者からの教訓‐兵庫区・長田区・須磨区でのアンケート・自由回答分析から‐地域安全学会梗概集、Ｎｏ．９、ｐｐ１４６‐１４９、１９９９大隈昇、Ludovic Lebart：調査における自由回答データの解析‐InforMinerによる探索的テキスト型データ解析‐統計数理、Ｖｏｌ．４８、Ｎｏ．２、ｐｐ３３９−３７６、２０００松村真宏、河原大輔、岡本雅史、黒橋禎夫、西田豊明：メッセージの背後に潜む「問い」の抽出、人口知能学会論文誌、Ｖｏｌ．２２、Ｎｏ.１、ｐｐ９３‐１０２、２００７乾裕子、田村真樹、内元清貴、井佐原均：表層表現に着目した自由回答アンケートの意図に基づく自動分類、自然言語処理、Ｖｏｌ．１０、Ｎｏ２、ｐｐ１４‐１０２、２００７ WO 2008/062910 A1 [G06F 17/30] On the other hand, the inventors of the present invention proposed a new document analysis apparatus and method in Patent Document 1. In this background art, a linguistic material (corpus) relating to a disaster or crisis is regarded as a corpus that increases with time, and an index of an increased TFIDF with a modified TFIDF and a singular value is defined, and keywords are automatically extracted.
Haruo Hayashi (ed.): Survey on changes in residence after the earthquake and actual living conditions, Technical Report, Disaster Research Institute, Disaster Prevention Research Institute, Kyoto University, 1999 Kazuki Nakabayashi, Kunihiro Fukudome, Makiko Kawakami: Lessons from the victims of the Great Hanshin-Awaji Earthquake: From questionnaires and free answer analysis in Hyogo, Nagata, and Suma Wards. 9, pp146-149, 1999 Noboru Otsuki, Ludovic Lebart: Analysis of free answer data in the survey-Exploratory text data analysis by InforMiner-Statistical mathematics, Vol. 48, no. 2, pp 339-376, 2000 Masahiro Matsumura, Daisuke Kawahara, Masafumi Okamoto, Ikuo Kurohashi, Toyoaki Nishida: Extraction of “questions” lurking behind messages, Journal of Population Intelligence Society, Vol. 22, No.1, pp93-102, 2007 Inui Yuko, Tamura Maki, Uchimoto Kiyotaka, Isahara Hitoshi: Automatic classification based on intention of free answer questionnaire focusing on surface expression, natural language processing, Vol. 10, No2, pp14-102, 2007 WO 2008/062910 A1 [G06F 17/30]

特許文献1の背景技術は、時間経過とともに増加するコーパスを解析することを対象にするものであり、他方、自由回答記述は、ある一時点において収集、形成されたデータであり、実時間の上にはないため、時間という順序で並べることはできない。したがって、自由回答記述を含むコーパスを、特許文献1の背景技術で直接解析することはできない。 The background art of Patent Document 1 is intended to analyze a corpus that increases with the passage of time. On the other hand, free answer descriptions are data collected and formed at a certain point in time. Because it is not, it cannot be arranged in the order of time. Therefore, a corpus including a free answer description cannot be directly analyzed by the background art of Patent Document 1.

それゆえに、この発明の主たる目的は、新規な、文書解析装置および方法を提供することである。 Therefore, a main object of the present invention is to provide a novel document analysis apparatus and method.

この発明の他の目的は、自由回答記述などの時系列的に増量しない言語資料を解析できる、文書解析装置および方法を提供することである。 Another object of the present invention is to provide a document analysis apparatus and method capable of analyzing linguistic material that does not increase in time series such as free answer descriptions.

この発明のその他の目的は、特異値の概念に基づいて自由回答記述などの時系列的に増量しない言語資料を解析することができる、文書解析装置および方法を提供することである。 Another object of the present invention is to provide a document analysis apparatus and method capable of analyzing a language material that does not increase in time series such as a free answer description based on the concept of singular values.

この発明は、上記の課題を解決するために、以下の構成を採用した。なお、括弧内の参照符号および補足説明等は、この発明の理解を助けるために後述する実施形態との対応関係を示したものであって、この発明を何ら限定するものではない。 The present invention employs the following configuration in order to solve the above problems. Note that reference numerals in parentheses, supplementary explanations, and the like indicate correspondence with embodiments to be described later in order to help understanding of the present invention, and do not limit the present invention.

第１の発明は、順序基準に従って擬似的に増量する言語資料を増加型ＴＦＩＤＦに基づいて解析する文書解析装置であって、前コーパスまでの増加型ＴＦＩＤＦの累計値に基づく推定値と現コーパスにおける増加型ＴＦＩＤＦの累計値との間で残差分析を実行することによって形態素毎の特異値を求めるもの文書解析装置において、言語資料を順序基準における昇順に並べたときの形態素毎の昇順累積特異値を計算する昇順累積特異値計算手段、言語資料を順序基準における降順に並べたときの形態素毎の降順累積特異値を計算する降順累積特異値算手段、および昇順累積特異値および降順累積特異値を平均して平均累積特異値を計算する平均累積特異値計算手段を備えることを特徴とする、文書解析装置である。 A first invention is a document analysis device that analyzes, based on an increase type TFIDF, a language material that is artificially increased in accordance with an order criterion, and an estimated value based on a cumulative value of the increase type TFIDF up to the previous corpus and a current corpus A singular value for each morpheme is obtained by performing a residual analysis with the cumulative value of incremental TFIDF. In document analysis apparatus, ascending cumulative singular value for each morpheme when linguistic materials are arranged in ascending order according to order criteria Ascending cumulative singular value calculation means for calculating decimated cumulative singular value for each morpheme, and ascending cumulative singular value and descending cumulative singular value. An average cumulative singular value calculation means for calculating an average cumulative singular value on average is provided.

第１の発明では、文書解析装置は、典型的には、コンピュータで構成される。そして、背景技術では、時間経過とともに単位文書が増量する言語資料（コーパス）を対象としたが、この発明では、たとえば年齢、発生日時などの任意の順序基準で並べられ得る自由回答形式の記述に基づくコーパスを対象とする。ただし、各文書（コーパステキスト）を形態素解析した後、有意な形態素について、前コーパスまでに計算した回帰曲線に対して現コーパスでの残差分析を行なって特異語を抽出する手法はそのまま適用できる。 In the first invention, the document analysis apparatus is typically constituted by a computer. In the background art, linguistic material (corpus) whose unit document increases with the passage of time is targeted. However, in the present invention, a free answer format description that can be arranged according to an arbitrary order criterion such as age, occurrence date and time, for example. Target the corpus based. However, after performing morphological analysis on each document (corpus text), the method of extracting singular words by performing residual analysis in the current corpus on the regression curve calculated up to the previous corpus for significant morphemes can be applied as it is. .

たとえば、形態素解析においては、たとえば日本語のように形態素が分割されていない言語体系のテキストデータである場合、たとえば茶筌（http://chasen.naist.jp/hiki/Cha
Sen/）のような形態素解析ツールを用いて、そのコーパステキストデータを形態素に分解して、各形態素に品詞情報を付加する。しかしながら、テキスト内の形態素が既に分割している、たとえば英語のような言語体系の場合には、形態素を分割する作業（分かち書き、tokenization）は必要ではなく、この形態素解析手段では、ステミング処理（stemming）によって，活用形を原形に直したのちに、たとえばタギング処理（tagging）によって、テキストを構成する各形態素に品位情報を付加する。 For example, in morphological analysis, for example, text data in a language system in which morphemes are not divided, such as Japanese, for example, tea bowl (http://chasen.naist.jp/hiki/Cha
Using a morphological analysis tool such as Sen /), the corpus text data is decomposed into morphemes, and part-of-speech information is added to each morpheme. However, when the morpheme in the text is already divided, for example, in the case of a language system such as English, the work of dividing the morpheme (tokenization) is not necessary, and this morpheme analysis means uses a stemming process (stemming). ) Is used to add the quality information to each morpheme constituting the text by, for example, tagging, after the utilization form is changed to the original form.

各形態素に付加された上述の品詞情報に基づいて、不要形態素として設定しておいた品詞の種類の形態素を取り除く。つまり、形態素解析の際に、各形態素に付与される品詞情報に基づいて、当該形態素を特異語および／または共通語の候補として採用するか否かを選定する。ただし、不要とする形態素の品詞の種類は、任意に設定できる。英文の場合には、ストップワード（stop word）と呼ばれる冠詞や前置詞からなる、極めて頻繁に使われるキーワードとして不適切な単語のリストを参照することによって、不要な形態素を取り除く。 Based on the part-of-speech information added to each morpheme, the morpheme of the part-of-speech type set as an unnecessary morpheme is removed. That is, at the time of morpheme analysis, based on the part-of-speech information given to each morpheme, whether or not to adopt the morpheme as a singular word and / or a common word candidate is selected. However, the type of part of speech of unnecessary morphemes can be set arbitrarily. In the case of English sentences, unnecessary morphemes are removed by referring to a list of words that are inappropriately used as keywords, which are composed of articles and prepositions called stop words.

コーパスに残った形態素の各々について、ＴＦ（Term Frequency）つまり単位ドキュメント中にそのキーワード候補が出現する頻度(延べ数)を計算し、さらに時間のパラメータ（順序基準）を考慮したＩＤＦ（Inversed Document Frequency）つまり他には出現していないという独自性値を計算することによって、当該コーパスにおける当該形態素の増加型ＴＦＩＤＦ（Term Frequency Inversed Document Frequency）を「ＴＦ」×「ＩＤＦ」として計算する。 For each morpheme remaining in the corpus, the TF (Term Frequency), that is, the frequency (total number) of the occurrence of the keyword candidate in the unit document, is calculated, and the IDF (Inversed Document Frequency) considering the time parameter (order standard) That is, by calculating a unique value that does not appear elsewhere, an increased TFIDF (Term Frequency Inversed Document Frequency) of the morpheme in the corpus is calculated as “TF” × “IDF”.

そして、残差分析においては、たとえば、前コーパスにおいて推定しておいた該当の形態素の増加型ＴＦＩＤＦの累計値の推定値と、現コーパスでの増加型ＴＦＩＤＦの累計値の実測値との間で残差分析を行ない、その形態素の残差値（特異値）を求め、正の特異値が得られた形態素を当該コーパスにおける特異語として選定する。 In the residual analysis, for example, between the estimated value of the cumulative TFIDF of the corresponding morpheme estimated in the previous corpus and the actual value of the cumulative value of the increased TFIDF in the current corpus. A residual analysis is performed to obtain a residual value (singular value) of the morpheme, and a morpheme from which a positive singular value is obtained is selected as a singular word in the corpus.

このような背景技術における残差分析の手法を用いて各形態素の累積特異値（ΣＤ）を計算するのであるが、昇順累積特異値計算手段（Ｓ３１‐Ｓ３５）は、コーパステキストを順序基準における昇順に並べて解析した際に得られる形態素毎の昇順累積特異値を計算し、降順累積特異値計算手段（Ｓ３７‐Ｓ４１）は、コーパステキストを順序基準における降順に並べたときの形態素毎の降順累積特異値を計算する。 The cumulative singular value (ΣD) of each morpheme is calculated using the residual analysis method in the background art. Ascending cumulative singular value calculation means (S31-S35) calculates the corpus text in ascending order on the basis of order. The ascending cumulative singular value for each morpheme obtained when analyzing in parallel is calculated, and the descending cumulative singular value calculation means (S37-S41) is a descending cumulative singularity for each morpheme when the corpus texts are arranged in descending order on the order basis. Calculate the value.

そして、平均累積特異値計算手段（Ｓ４３）によって、昇順の累積特異値ΣＤ(i,ord,asc)および降順の累積特異値ΣＤ(i,ord,dsc)の相加平均、つまり平均累積特異値aveΣＤを計算する。この平均化処理によって、着目した順序基準の特性を排除したより客観的な（または代表的な）解析結果を得ることができる。たとえば、順序基準として年齢を用いた場合、昇順累積特異値の大きい形態素（単語）は高齢層に特徴的な形態素であるが、降順累積特異値の大きい形態素（単語）は若年層に特徴的な形態素である。したがって、平均累積特異値はその形態素（キーワード）がたとえば若年層および高齢層の両方において顕著な程度、すなわちコーパス全体においてどの程度代表的なキーワードであるかを示すことになる。したがって、平均累積特異値は、代表キーワード値ということもでき、この代表キーワード値aveΣＤが自由回答形式の記述（単位ドキュメント）の評価の１つの指標となる。 Then, the average cumulative singular value calculation means (S43) calculates the arithmetic average of the cumulative singular value ΣD (i, ord, asc) in ascending order and the cumulative singular value ΣD (i, ord, dsc) in descending order, that is, the average cumulative singular value. Calculate aveΣD. By this averaging process, it is possible to obtain a more objective (or representative) analysis result that excludes the characteristic of the order reference that is focused on. For example, when age is used as the order criterion, morphemes (words) with a large ascending cumulative singular value are characteristic morphemes for older people, while morphemes (words) with a large descending cumulative singular value are characteristic for younger people. It is a morpheme. Therefore, the average cumulative singular value indicates the degree to which the morpheme (keyword) is prominent in, for example, both young and old people, that is, how representative the morpheme is in the entire corpus. Therefore, the average cumulative singular value can also be referred to as a representative keyword value, and this representative keyword value aveΣD is one index for evaluation of a free answer description (unit document).

第２の発明は、第１の発明に従属し、昇順累積特異値および降順累積特異値の一方を２軸の一方とし他方を他方とする累積特異値グラフを表示する累積特異値グラフ表示手段をさらに備える、文書解析装置である。 A second invention is dependent on the first invention, and comprises a cumulative singular value graph display means for displaying a cumulative singular value graph in which one of an ascending cumulative singular value and a descending cumulative singular value is one of two axes and the other is the other. A document analysis apparatus further provided.

第２の発明では、累積特異値グラフ表示手段（Ｓ４５）が、たとえば、縦軸に昇順累積特異値をとり横軸に降順累積特異値をとった累積特異値グラフ（図２１）を表示する。このグラフを見れば、順序基準、たとえばインシデントの発生年月日の古い年代の記事に特徴的なキーワードを横軸方向の値が大きい所に、また新しい時代に特徴的なキーワードを縦軸方向の値の大きいところに、容易に見つけることができる。 In the second invention, the cumulative singular value graph display means (S45) displays, for example, a cumulative singular value graph (FIG. 21) with the ascending cumulative singular value on the vertical axis and the descending cumulative singular value on the horizontal axis. If you look at this graph, keywords that are characteristic of the order criteria, for example, articles of the old age of incidents, are located in places where the value in the horizontal axis is large, and keywords that are characteristic in the new era are displayed in the vertical direction. It can be easily found where the value is large.

第３の発明は、第１または第２の発明に従属し、特定上位の累積特異値を持つ形態素について昇順累積特異値の総和を計算する昇順累積特異値総和計算手段、特定上位の累積特異値を持つ形態素について降順累積特異値の総和を計算する降順累積特異値総和計算手段、および昇順累積特異値総和および降順累積特異値総和を平均して平均累積特異値総和を計算する平均累積特異値総和計算手段をさらに備える、文書解析装置である。 3rd invention is dependent on 1st or 2nd invention, the ascending order accumulation singular value sum total calculation means which calculates the sum total of ascending order accumulation singular value about the morpheme which has the accumulation singular value of a specific high order, the accumulation singular value of a specific high order Descending cumulative singular value summation means for calculating the sum of descending cumulative singular values for morphemes, and average cumulative singular value summation that averages ascending cumulative singular value sums and descending cumulative singular value sums The document analysis apparatus further includes a calculation unit.

上述の昇順／降順累積特異値ΣＤはその単語（形態素）が順序基準の昇順／降順においてどの程度重要かを示す指標である。したがって、累積特異値ΣＤが大きく、したがって、高い重みを持つ形態素（単語）を多く含む記事ほど、重要な自由記述（記事）であるという考えが成立する。第３の発明では、１つのコーパスデータの中にどの程度重要な形態素が含まれているかを示すΣΣＤ（累積特異値総和）を採用する。ただし、ΣΣＤをそのまま採用すると、数値が記述文の長さ（単語の数）に影響されるという問題があるばかりでなく、累積特異値ΣＤは、コーパスに含まれる各文書を順序基準の昇順に並べるか降順に並べるかによって影響を受けるので、各形態素の累積特異値ΣＤの各文書（記事）中における総和である累積特異値総和ΣΣＤも昇順/降順の影響を受ける。コーパスの影響を排除するため、第３の発明では、特定上位の昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)を採用した上で、平均累積特異値総和計算手段（Ｓ６３）が昇順累積特異値総和および降順累積特異値総和の相加平均を計算して平均累積特異値総和aveΣΣＤ(j,ord,rank)を求める。 The above ascending / descending order cumulative singular value ΣD is an index indicating how important the word (morpheme) is in ascending / descending order of the order reference. Therefore, the idea that an article having a larger cumulative singular value ΣD and containing more morphemes (words) having a higher weight is an important free description (article) holds. In the third aspect of the invention, ΣΣD (cumulative singular value sum) indicating how important morphemes are included in one corpus data is employed. However, if ΣΣD is adopted as it is, there is a problem that the numerical value is affected by the length of the description sentence (number of words), and the cumulative singular value ΣD is obtained by ascending the order of each document included in the corpus. Since it is affected by whether it is arranged in descending order or in descending order, the accumulated singular value summation ΣΣD, which is the sum total in each document (article) of the accumulated singular value ΣD of each morpheme, is also affected in ascending / descending order. In order to eliminate the influence of the corpus, the third invention adopts the specific higher order ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending order cumulative singular value sum ΣΣD (j, ord, dsc, rank) After that, the average cumulative singular value sum calculating means (S63) calculates the arithmetic mean of the ascending cumulative singular value sum and the descending cumulative singular value sum to obtain the average cumulative singular value sum aveΣΣD (j, ord, rank).

順序基準に従う昇順累積特異値ΣＤ(j,ord,asc,rank)および順序基準に従う降順累積特異値ΣＤ(j,ord,dsc,rank)は、より具体的な意味を持つ形態素（単語）が高い値を示す。したがって、そのままこれらの単語とその重みを採用すれば、上記昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)ように、順序基準のどちらか一方の特性を強く示す。他方で、順序基準の効果を求めない、均質的な代表値も必要になる。そこで、順序基準の効果を相殺するために、両指標の平均値をとることによって、その単位ドキュメント（自由回答記述）がどの程度代表的な記事かを示す指標として採用する。 The ascending cumulative singular value ΣD (j, ord, asc, rank) according to the order criterion and the descending cumulative singular value ΣD (j, ord, dsc, rank) according to the order criterion have a higher morpheme (word) with a more specific meaning. Indicates the value. Therefore, if these words and their weights are adopted as they are, the order is as follows: ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and descending cumulative singular value sum ΣΣD (j, ord, dsc, rank). Strongly shows the characteristics of either one of the standards. On the other hand, a homogeneous representative value that does not require the effect of order criteria is also required. Therefore, in order to offset the effect of the order reference, by taking the average value of both indicators, the unit document (free answer description) is adopted as an indicator indicating how representative the article is.

第４の発明は、第３の発明に従属し、昇順累積特異値総和および降順累積特異値総和の差分を計算する累積特異値総和差分計算手段をさらに備える、文書解析装置である。 A fourth invention is a document analysis apparatus according to the third invention, further comprising cumulative singular value sum difference calculation means for calculating a difference between an ascending cumulative singular value sum and a descending cumulative singular value sum.

第４の発明では、累積特異値総和差分計算手段（Ｓ６５）が昇順累積特異値総和および降順累積特異値総和の差分を計算する。昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)が高い値を示す自由回答記述は、双方の特徴をよく表す記述内容になるが、中には、両方の特徴をもつような自由回答記述が存在する可能性があり、順序基準の効果を適切に表すことができない場合も生じる。そこで、第４の発明においては、昇順累積特異値総和および降順累積特異値総和の差を求め、その絶対値で順序基準の昇順／降順の性質を反映した重み付けを可能にした。これを差分累積特異値総和diffΣΣＤ(j,ord,rak)と呼び、順序基準の効果をより強調することができるようにした。 In the fourth invention, the cumulative singular value total difference calculating means (S65) calculates the difference between the ascending cumulative singular value sum and the descending cumulative singular value sum. The free answer description showing the high values of the ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending cumulative singular value sum ΣΣD (j, ord, dsc, rank) However, there is a possibility that there is a free answer description that has both characteristics, and the effect of the order criterion may not be expressed appropriately. Accordingly, in the fourth invention, the difference between the ascending order cumulative singular value sum and the descending order cumulative singular value sum is obtained, and weighting reflecting the ascending / descending order property of the order reference is made possible by the absolute value. This is called the difference accumulated singular value sum diffΣΣD (j, ord, rak), so that the effect of the order criterion can be more emphasized.

第５の発明は、第４の発明に従属し、平均累積特異値総和および累積特異値総和差分の一方を２軸の一方とし他方を他方とする累積特異値総和グラフを表示する累積特異値総和グラフ表示手段をさらに備える、文書解析装置である。 A fifth invention is dependent on the fourth invention, and displays a cumulative singular value summation graph displaying a cumulative singular value summation graph in which one of the average cumulative singular value summation and the cumulative singular value summation difference is one of the two axes and the other is the other. The document analysis apparatus further includes a graph display unit.

第５の発明では、累積特異値総和グラフ表示手段（Ｓ６７）が、たとえば図２４に示す累積特異値総和グラフを表示する。２軸の一方に平均累積特異値総和aveΣΣＤ(j,ord,rank)を示し、他方に差分累積特異値総和diffΣΣＤ(j,ord,rank)を示すグラフを表示することによって、コーパス全体を代表するような（平均累積特異値総和aveΣΣＤが大きい)代表的な記事または事例を容易に見つけることができる。また、差分累積特異値総和diffΣΣＤ(j,ord,rank)の大小によってそのコーパス内の各記事を並べた順序基準における特徴が一層明確に把握できる。 In the fifth invention, the cumulative singular value sum graph display means (S67) displays, for example, the cumulative singular value sum graph shown in FIG. The entire corpus is represented by displaying a graph showing the average cumulative singular value sum aveΣΣD (j, ord, rank) on one of the two axes and the difference cumulative singular value sum diffΣΣD (j, ord, rank) on the other axis. It is possible to easily find a representative article or example (with a large average cumulative singular value sum aveΣΣD). Further, the feature in the order reference in which articles in the corpus are arranged can be grasped more clearly by the magnitude of the difference accumulated singular value sum diffΣΣD (j, ord, rank).

第６の発明は、順序基準に従って増量する言語資料を増加型ＴＦＩＤＦに基づいて解析する文書解析装置において、前コーパスまでの増加型ＴＦＩＤＦの総和値に基づく推定値と現コーパスにおける増加型ＴＦＩＤＦの総和値との間で残差分析を実行することによって形態素毎の特異値を求める文書解析方法であって、文書解析装置のコンピュータが、言語資料を順序基準における昇順に並べたときの形態素毎の昇順累積特異値を計算する昇順累積特異値計算ステップ、言語資料を順序基準における降順に並べたときの形態素毎の降順累積特異値を計算する降順累積特異値計算ステップ、および昇順累積特異値および降順累積特異値を相加平均して平均累積特異値を計算する平均累積特異値計算ステップを実行する、文書解析方法である。 According to a sixth aspect of the present invention, there is provided a document analysis apparatus that analyzes a language material that increases in accordance with an order criterion based on an increased TFIDF, and an estimated value based on a total value of the increased TFIDF up to the previous corpus and a total sum of the increased TFIDF in the current corpus a Rubun manual analysis method obtains the singular values of each morpheme by performing a residual analysis between the values, the computer of the document analysis device, each morpheme when arranged corpora in ascending in order criteria An ascending cumulative singular value calculation step for calculating the ascending cumulative singular value of the grammar, a descending cumulative singular value calculation step for calculating the descending cumulative singular value for each morpheme when the language materials are arranged in descending order on the order basis, and an ascending cumulative singular value The document analysis method executes an average cumulative singular value calculation step of calculating an average cumulative singular value by arithmetically averaging descending cumulative singular values.

第６の発明でも第１の発明と同様の効果が期待できる。 In the sixth invention, the same effect as in the first invention can be expected.

第７の発明は、順序基準に従って増量する言語資料を増加型ＴＦＩＤＦに基づいて解析する文書解析装置であって、前コーパスまでの増加型ＴＦＩＤＦの総和値に基づく推定値と現コーパスにおける増加型ＴＦＩＤＦの総和値との間で残差分析を実行することによって形態素毎の特異値を求めるもの文書解析装置のコンピュータに、言語資料を順序基準における昇順に並べたときの形態素毎の昇順累積特異値を計算する昇順累積特異値計算ステップ、言語資料を順序基準における降順に並べたときの形態素毎の降順累積特異値を計算する降順累積特異値算ステップ、および昇順累積特異値および降順累積特異値を相加平均して代平均累積特異値を計算する平均累積特異値計算ステップを実行させることを特徴とする、文書解析プログラムである。 A seventh invention is a document analysis device for analyzing a language material that increases in accordance with an order criterion based on an increased TFIDF, and an estimated value based on the sum of the increased TFIDF up to the previous corpus and an increased TFIDF in the current corpus A singular value for each morpheme is obtained by performing a residual analysis with the sum of the values. Ascending cumulative singular value calculation step to calculate, descending cumulative singular value calculation step to calculate descending cumulative singular value for each morpheme when linguistic materials are arranged in descending order in order criteria, and ascending cumulative singular value and descending cumulative singular value A document analysis program characterized by executing an average cumulative singular value calculation step of calculating an average average cumulative singular value by averaging. .

第７の発明でも第１の発明と同様の効果が期待できる。 In the seventh invention, the same effect as in the first invention can be expected.

第８の発明は、順序基準に従って増量する言語資料を増加型ＴＦＩＤＦに基づいて解析する文書解析装置であって、前コーパスまでの増加型ＴＦＩＤＦの累計値に基づく推定値と現コーパスにおける増加型ＴＦＩＤＦの累計値との間で残差分析を実行することによって形態素毎の特異値を求める文書解析装置において、言語資料を順序基準における昇順に並べたときの形態素毎の昇順累積特異値を計算する昇順累積特異値計算手段、言語資料を順序基準における降順に並べたときの形態素毎の降順累積特異値を計算する降順累積特異値計算手段、特定上位の累積特異値を持つ形態素について昇順累積特異値の総和を計算する昇順累積特異値総和計算手段、特定上位の累積特異値を持つ形態素について降順累積特異値の総和を計算する降順累積特異値総和計算手段、および昇順累積特異値総和および降順累積特異値総和を平均して平均累積特異値総和を計算する平均累積特異値総和計算手段を備えることを特徴とする、文書解析装置である。 An eighth invention is a document analysis device that analyzes a language material that increases in accordance with an order criterion based on an increased TFIDF, and an estimated value based on a cumulative value of the increased TFIDF up to the previous corpus and an increased TFIDF in the current corpus Ascending order to calculate the ascending cumulative singular value for each morpheme when the linguistic materials are arranged in ascending order on the order basis in the document analysis device that obtains the singular value for each morpheme by performing residual analysis with the cumulative value of Cumulative singular value calculation means, descending cumulative singular value calculation means for calculating descending cumulative singular values for each morpheme when linguistic materials are arranged in descending order in order criteria, ascending cumulative singular values for morphemes with specific higher cumulative singular values Ascending order cumulative singular value summation calculation means for calculating the summation, descending order accumulation for calculating the summation of the descending order singular values for the morphemes with the cumulative singular value at the top A document analysis apparatus comprising: an abnormal sum total calculating means, and an average cumulative singular value sum calculating means for calculating an average cumulative singular value sum by averaging an ascending cumulative singular value sum and a descending cumulative singular value sum .

第８の発明では、第３の発明と同様の効果が期待できる。 In the eighth invention, the same effect as in the third invention can be expected.

第９の発明は、順序基準に従って増量する言語資料を増加型ＴＦＩＤＦに基づいて解析する文書解析装置において、前コーパスまでの増加型ＴＦＩＤＦの累計値に基づく推定値と現コーパスにおける増加型ＴＦＩＤＦの累計値との間で残差分析を実行することによって形態素毎の特異値を求める文書解析方法であって、文書解析装置のコンピュータが、言語資料を順序基準における昇順に並べたときの形態素毎の昇順累積特異値を計算する昇順累積特異値計算ステップ、言語資料を順序基準における降順に並べたときの形態素毎の降順累積特異値を計算する降順累積特異値計算ステップ、特定上位の累積特異値を持つ形態素について昇順累積特異値の総和を計算する昇順累積特異値総和計算ステップ、特定上位の累積特異値を持つ形態素について降順累積特異値の総和を計算する降順累積特異値総和計算ステップ、および昇順累積特異値総和および降順累積特異値総和を平均して平均累積特異値総和を計算する平均累積特異値総和計算ステップを実行する、文書解析方法である。 A ninth aspect of the document analysis device for analyzing based corpus that increasing the order criteria increasing type TFIDF, cumulative increasing type TFIDF in the estimate and the current corpus based on the cumulative value of the increase-type TFIDF and before Corpus A document analysis method for obtaining a singular value for each morpheme by performing a residual analysis with respect to values , wherein the document analysis device computer arranges the language materials in ascending order according to the order reference, and the ascending order for each morpheme Ascending cumulative singular value calculation step for calculating cumulative singular value, descending cumulative singular value calculation step for calculating descending cumulative singular value for each morpheme when linguistic materials are arranged in descending order according to order criteria, and having a specific higher cumulative singular value Ascending cumulative singular value summation calculation step that calculates the sum of ascending cumulative singular values for morphemes. Te descending cumulative singular value sum calculating step of calculating the sum of descending cumulative singular value, and the mean cumulative singular value sum calculating step of ascending cumulative singular value sum and average the descending cumulative singular value sum calculating a mean cumulative singular sum This is a document analysis method to be executed .

第９の発明でも、第３の発明と同様の効果が期待できる。 In the ninth invention, the same effect as in the third invention can be expected.

第１０の発明は、順序基準に従って増量する言語資料を増加型ＴＦＩＤＦに基づいて解析する文書解析装置であって、前コーパスまでの増加型ＴＦＩＤＦの総和値に基づく推定値と現コーパスにおける増加型ＴＦＩＤＦの総和値との間で残差分析を実行することによって形態素毎の特異値を求める文書解析装置のコンピュータに、言語資料を順序基準における昇順に並べたときの形態素毎の昇順累積特異値を計算する昇順累積特異値計算ステップ、言語資料を順序基準における降順に並べたときの形態素毎の降順累積特異値を計算する降順累積特異値算ステップ、特定上位の累積特異値を持つ形態素について昇順累積特異値の総和を計算する昇順累積特異値総和計算ステップ、特定上位の累積特異値を持つ形態素について降順累積特異値の総和を計算する降順累積特異値総和計算ステップ、および昇順累積特異値総和および降順累積特異値総和を平均して平均累積特異値総和を計算する平均累積特異値総和計算ステップを実行させることを特徴とする、文書解析プログラムである。 A tenth aspect of the invention is a document analysis apparatus that analyzes a language material that increases in accordance with an order criterion based on an increased TFIDF, and an estimated value based on the sum of the increased TFIDF up to the previous corpus and an increased TFIDF in the current corpus Calculates the ascending cumulative singular value for each morpheme when the linguistic materials are arranged in ascending order according to the order criteria on the computer of the document analyzer that calculates the singular value for each morpheme by performing residual analysis with the sum of Ascending cumulative singular value calculation step, descending cumulative singular value calculation step for calculating descending cumulative singular value for each morpheme when linguistic materials are arranged in descending order in order criteria, ascending cumulative singularity for morphemes with specific upper cumulative singular values Ascending cumulative singular value summation calculation step to calculate the sum of the values, the sum of the descending cumulative singular values for the morphemes with the cumulative singular value at the top Descending cumulative singular value summation calculating step, and ascending cumulative singular value summation and descending cumulative singular value summation are averaged to calculate the average cumulative singular value summation step. Document analysis program.

第１０の発明でも第３の発明と同様の効果が期待できる。 In the tenth invention, the same effect as in the third invention can be expected.

この発明によれば、適宜の順序基準に従って自由回答記述（単位ドキュメント）を並べてコーパスを作成することによって、自由回答記述を解析して代表的キーワードなどを選定することができる。 According to the present invention, by creating a corpus by arranging free answer descriptions (unit documents) according to an appropriate order criterion, it is possible to analyze a free answer description and select a representative keyword or the like.

この発明の上述の目的，その他の目的，特徴，および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above object, other objects, features, and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

図１はこの発明の背景となる文書解析装置を示すブロック図である。FIG. 1 is a block diagram showing a document analysis apparatus as a background of the present invention. 図２はこの文書解析装置で用いられるテキストデータテーブルの一例を示す図解図である。FIG. 2 is an illustrative view showing one example of a text data table used in the document analysis apparatus. 図３は図１の文書解析装置のコンピュータの動作の一例を示すフロー図である。FIG. 3 is a flowchart showing an example of the operation of the computer of the document analysis apparatus of FIG. 図４は時間とともに増加するコーパスの一例を示す図解図である。FIG. 4 is an illustrative view showing an example of a corpus that increases with time. 図５は各記事および形態素の出現頻度の解析結果の一例を示す表である。FIG. 5 is a table showing an example of the analysis result of the appearance frequency of each article and morpheme. 図６は各記事および形態素に対する単位ドキュメント数Ｎを示す表である。図６（Ａ）は言語資料体が一定量である一般的な場合（時間の経過とともに増加しない場合）を示し、図６（Ｂ）は時系列的に増量する言語資料体を解析する場合を示す。図６（Ａ）は，他の図（図５〜８）との表記を統一させるために，表示例の形態素（t1，t2，t3・・・）毎に単位ドキュメント数Ｎを示してある。FIG. 6 is a table showing the number N of unit documents for each article and morpheme. FIG. 6 (A) shows a general case where the number of linguistic material bodies is a constant amount (when the linguistic material body does not increase with time), and FIG. Show. FIG. 6A shows the number N of unit documents for each morpheme (t1, t2, t3...) Of the display example in order to unify the notation with other figures (FIGS. 5 to 8). 図７は各記事および形態素に対するＤＦを示す表である。図７（Ａ）は言語資料体が一定量である一般的な場合（時間の経過とともに増加しない場合）を示し、図７（Ｂ）は時系列的に増量する言語資料体の場合を示す。FIG. 7 is a table showing the DF for each article and morpheme. FIG. 7A shows a general case where the number of language material bodies is constant (a case where the number of language material bodies does not increase with time), and FIG. 7B shows a case where the language material body increases in time series. 図８は各記事および形態素に対するＴＦＩＤＦ（Ａ）および増加型ＴＦＩＤＦ（Ｂ）を示す表である。図８（Ａ）は言語資料体が一定量である一般的な場合（時間の経過とともに増加しない場合）を示し、図８（Ｂ）は時系列的に増量する言語資料体の場合を示す。FIG. 8 is a table showing TFIDF (A) and incremental TFIDF (B) for each article and morpheme. FIG. 8A shows a general case where the number of language material bodies is a constant amount (when the language material body does not increase with time), and FIG. 8B shows a case where the language material body increases in time series. 図９は回帰曲線の一例を示す図解図である。FIG. 9 is an illustrative view showing one example of a regression curve. 図１０は回帰曲線とそれに対する残差（正負）を示すグラフであり、横軸にＴＦの総和を、縦軸に増加型ＴＦＩＤＦの総和をとる。FIG. 10 is a graph showing a regression curve and a residual (positive / negative) with respect to the regression curve. The horizontal axis represents the sum of TFs, and the vertical axis represents the sum of the increased TFIDFs. 図１１は図１に示す文書解析装置のモニタに表示される１つの表示例を示す図解図である。FIG. 11 is an illustrative view showing one display example displayed on the monitor of the document analysis apparatus shown in FIG. 図１２は図１に示す文書解析装置のモニタに表示される別の表示例を示す図解図である。FIG. 12 is an illustrative view showing another display example displayed on the monitor of the document analysis apparatus shown in FIG. 図１３はコーパスと回帰曲線との関係を示す図解図である。FIG. 13 is an illustrative view showing a relationship between a corpus and a regression curve. 図１４はこの発明の一実施例である文書解析装置を示すブロック図である。FIG. 14 is a block diagram showing a document analysis apparatus according to an embodiment of the present invention. 図１５は図１４実施例で解析可能な自由回答形式の調査票の一例を示す図解図である。FIG. 15 is an illustrative view showing one example of a survey form in a free answer format that can be analyzed in the embodiment of FIG. 図１６は図１４実施例の文書解析装置においてモニタに表示されるＧＵＩの一例を示す図解図である。FIG. 16 is an illustrative view showing one example of a GUI displayed on the monitor in the document analysis apparatus of FIG. 14 embodiment. 図１６は図１４実施例において代表キーワード値（平均累積特異値：aveΣＤ）を求めるためのコンピュータの動作を示すフロー図である。FIG. 16 is a flowchart showing the operation of the computer for obtaining the representative keyword value (average cumulative singular value: ave ΣD) in the embodiment of FIG. 図１８は図１７実施例における各特異語の平均累積特異値（aveΣＤ）を表示するグラフの一例を示す図解図である。FIG. 18 is an illustrative view showing one example of a graph displaying an average cumulative singular value (ave ΣD) of each singular word in the FIG. 17 embodiment. 図１９は図１７実施例においてコーパスを昇順で並べた場合の各特異語の昇順累積特異値（ΣＤ(発生年月日,asc)）を表示するグラフの一例を示す図解図である。FIG. 19 is an illustrative view showing one example of a graph displaying ascending order cumulative singular values (ΣD (occurrence date, asc)) of each singular word when corpus are arranged in ascending order in the embodiment of FIG. 17. 図２０は図１７実施例においてコーパスを降順に並べた場合の各特異語の降順累積特異値（ΣＤ(発生年月日,dsc)）を表示するグラフの一例を示す図解図である。FIG. 20 is an illustrative view showing one example of a graph displaying a descending order cumulative singular value (ΣD (occurrence date, dsc)) of each singular word when the corpus is arranged in descending order in the embodiment of FIG. 17. 図２１は図１７実施例において、縦軸が各特異語の昇順累積特異値（ΣＤ(発生年月日,asc)）であり、横軸が各特異語の降順累積特異値（ΣＤ(発生年月日,dsc)）であるグラフの一例を示す図解図である。In FIG. 21, the vertical axis is the ascending cumulative singular value (ΣD (occurrence date, asc)) of each singular word, and the horizontal axis is the descending cumulative singular value (ΣD (occurrence year) of each singular word in the embodiment of FIG. 17. It is an illustration figure which shows an example of the graph which is a month day, dsc)). 図２２は図１７実施例において、縦軸に各特異語の平均累積特異値（aveΣＤ）をとり、横軸に各特異語のＴＦＩＤＦ(i)をとるグラフの一例を示す図解図である。FIG. 22 is an illustrative view showing one example of a graph in which the average cumulative singular value (ave ΣD) of each singular word is taken on the vertical axis and the TFIDF (i) of each singular word is taken on the horizontal axis in the embodiment of FIG. 図２３は図１４実施例において代表記事値（平均累積特異値総和：aveΣΣＤ）および累積特異値総和差分（diffΣΣＤ）求めるためのコンピュータの動作を示すフロー図である。FIG. 23 is a flowchart showing the operation of the computer for obtaining the representative article value (average cumulative singular value sum: ave ΣΣ D) and cumulative singular value sum difference (diff ΣΣ D) in the embodiment of FIG. 図２４は図２３実施例において、縦軸が平均累積特異値総和（aveΣΣＤ）であり、横軸が累積特異値総和差分（diffΣΣＤ）であるグラフの一例を示す図解図である。FIG. 24 is an illustrative view showing one example of a graph in which the vertical axis is the average cumulative singular value summation (aveΣΣD) and the horizontal axis is the cumulative singular value summation difference (diffΣΣD) in the embodiment of FIG. 図２５は図２３実施例において、Ｙ軸に平均累積特異値総和（aveΣΣＤ）の階級、Ｘ軸に累積特異値総和差分（diffΣΣＤ）の階級、Ｚ軸に各階級に該当するケース頻度をとった３次元グラフの一例を示す図解図である。FIG. 25 shows the case frequency corresponding to each class in the Z-axis, the class of the average cumulative singular value sum (aveΣΣD) on the Y axis, the class of the cumulative singular value sum difference (diffΣΣD) on the X axis in FIG. It is an illustration figure which shows an example of a three-dimensional graph.

以下の説明では、図１‐図１３を参照してこの発明の背景である文書解析装置を、本件発明の理解に必要な範囲で説明し、その後、図１４‐図２５を参照して本件発明の実施例を説明する。 In the following description, the document analysis apparatus that is the background of the present invention will be described with reference to FIGS. 1 to 13 within the scope necessary for understanding the present invention, and then the present invention will be described with reference to FIGS. Examples will be described.

図１に示すこの発明の背景となる文書解析装置１０は、たとえばインターネットのような通信網（ネットワーク）１２に有線または無線で結合されるコンピュータ１４を含む。コンピュータ１４には、基本的に、キーボードやマウスのような操作手段１５Ａおよび液晶表示器のようなモニタ１５Ｂが設けられていて、このコンピュータ１４には、さらに、テキストデータベース１６および分析データベース１８が付設される。コンピュータ１４は当然、内部メモリを有し、その内部メモリ（図示せず）はワーキングメモリなどとして利用され、必要なプログラムを展開したり、計算して得られた結果データや、解析結果データ、さらにはその解析途中の各種データなどを一時的に記憶したりする。 A document analysis apparatus 10 as a background of the present invention shown in FIG. 1 includes a computer 14 coupled to a communication network (network) 12 such as the Internet by wire or wirelessly. The computer 14 is basically provided with operating means 15A such as a keyboard and a mouse and a monitor 15B such as a liquid crystal display. The computer 14 is further provided with a text database 16 and an analysis database 18. Is done. Naturally, the computer 14 has an internal memory, and the internal memory (not shown) is used as a working memory or the like, and results data obtained by developing or calculating necessary programs, analysis result data, Temporarily stores various data during the analysis.

テキストデータベース１６には、たとえば、このコンピュータ１４がネットワーク１２を通して取得した時間順次のウェブニュースのテキストデータが逐次記憶され、コンピュータ１４はこのウェブニュースのテキストデータを順次分析または解析することによって、時系列的に変遷する特異語および共通語（キーワード）を抽出する。 The text database 16 sequentially stores time-sequential web news text data acquired by the computer 14 through the network 12, for example, and the computer 14 sequentially analyzes or analyzes the web news text data to obtain a time series. Singular words and common words (keywords) that change periodically.

テキストデータベース１６に蓄積されるテキストデータテーブル２０の一例が図２に示される。テキストデータテーブル２０は、具体的には、テキストデータで構成される言語資料から、任意の一定の大きさをもつ「単位ドキュメント」のテキストデータを１つのレコードに持つテーブルである。 An example of the text data table 20 stored in the text database 16 is shown in FIG. Specifically, the text data table 20 is a table having, in one record, text data of “unit document” having an arbitrary fixed size from a language material composed of text data.

単位ドキュメントの例としては、ウェブニュースの場合であれば、所定期間内の記事、１日の記事、１つの記事、１つの段落、１つの文などがある。新聞を例にとれば、１紙、１つの記事、１つの段落、１つの文などがある。文学作品（小説）などの場合には、１つの作品、１つの章、１つの段落、１つの文などがある。その他、ウェブ上のブログを解析対象とする場合には，１つの日記を単位ドキュメントとしたり、コールセンターへの１つの問い合わせや苦情などを単位ドキュメントにしたりするなど、言語資料に対して任意の単位を「単位ドキュメント」として定めて、データベース２０を作成する。 As an example of the unit document, in the case of web news, there are an article within a predetermined period, an article for a day, an article, a paragraph, a sentence, and the like. Taking a newspaper as an example, there are one paper, one article, one paragraph, one sentence, and the like. In the case of literary works (novels), there are one work, one chapter, one paragraph, one sentence, and the like. In addition, when analyzing blogs on the web, an arbitrary unit can be used for language materials such as one diary as a unit document or one inquiry or complaint to the call center as a unit document. The database 20 is created by defining it as “unit document”.

図２に示すように、1つのレコードに対しては、数度やアルファベットなどで形成される識別子（ＩＤ番号）２２およびテキストデータ２４のほか、時間情報（時刻スタンプ）２６をメタデータとして付与する。時間情報２６には、ウェブニュース記事であれば発信日時、コールセンターへの問い合わせであれば問い合わせ時間などが該当する。この背景技術の文書解析装置１０は、ニュースやブログなど時間とともに文字数が増加していく言語情報を対象としている。しかしながら、文学作品等のように常には更新されないような言語資料であっても、言語資料は線状性を有しているため、言語資料を読む人は、時間の経過ともに言語情報を理解することになる。したがって、小説や文学作品のように一見静的で時間情報を持たない言語資料については、図２に示す時間情報２６のフィールドに、時間情報の代わりに順序情報（１章、２章…、１段落目、２段落目…、１文目、２文目…など）をメタデータとして付与すればよい。その他、必要に応じて任意のフィールド、たとえばタイトル２６を設けて、データベーステーブル２０を作成する。さらに、後に説明するこの発明の実施例が解析可能な自由回答記述の場合には、この時間（順序）情報としてたとえば、災害の社会調査における、回答者の年齢、家屋の被害程度、被害額、世帯年収などの順序基準を用い、その順序基準に従って各自由回答記述（単位ドキュメント）を並べるようにすれば、この背景技術の文書解析装置と同じ手法を適用することができる。また、通勤事情に関する自由回答形式の調査票を会席する場合には、たとえば、通勤時間や通勤に係る交通費を「順序基準」として採用することも可能である。 As shown in FIG. 2, in addition to an identifier (ID number) 22 and text data 24 formed by several degrees or alphabets, time information (time stamp) 26 is given as metadata to one record. . The time information 26 corresponds to a transmission date and time for a web news article and an inquiry time for an inquiry to a call center. The document analysis apparatus 10 of this background art is intended for language information whose number of characters increases with time, such as news and blogs. However, even linguistic materials that are not constantly updated, such as literary works, etc., because linguistic materials have linearity, people who read linguistic materials understand linguistic information over time. It will be. Therefore, for language materials such as novels and literary works that are static at first glance and do not have time information, in the field of time information 26 shown in FIG. The first paragraph, the second paragraph, the first sentence, the second sentence, etc.) may be added as metadata. In addition, an arbitrary field, for example, a title 26 is provided as necessary to create the database table 20. Further, in the case of a free answer description that can be analyzed according to the embodiment of the present invention described later, as this time (order) information, for example, in the social survey of disaster, the age of the respondent, the degree of damage to the house, the amount of damage, If an order standard such as household income is used and each free answer description (unit document) is arranged according to the order standard, the same technique as the document analysis apparatus of this background art can be applied. Further, when a free answer form survey form regarding commuting conditions is held, for example, commuting time and transportation costs related to commuting can be adopted as the “order standard”.

もし、このテキストデータテーブル２０をコンピュータ１４が作成するときには、たとえばコンピュータ１４の中にインストールされている、ＤＢＭＳ（Data Base Management System：データベース管理システム）のようなアプリケーションを用いて、たとえばネットワーク１２を通して取得したウェブニュースなどからテキストデータテーブルを作成することができる。 If the text data table 20 is created by the computer 14, the text data table 20 is obtained through, for example, the network 12 using an application such as a DBMS (Data Base Management System) installed in the computer 14. Text data table can be created from web news.

なお、図２に示す１つの識別記号（ＩＤ）２２で区別されるかつ時系列（順序）情報２６が付された１つの単位ドキュメントのテキストデータ２４（図２）を含むものを、１レコードと呼ぶ。そして、言語資料体（コーパス）とは、このようなレコードの集合を意味する。 2 includes one unit document text data 24 (FIG. 2) identified by one identification symbol (ID) 22 shown in FIG. 2 and attached with time-series (order) information 26. Call. A language material (corpus) means a set of such records.

後述の背景技術の説明では、キーワード（特異語、共通語）を検出すべき時系列的に増量する言語資料体として、ウェブニュースを試用しているが、この種の言語資料としては、他に、新聞,雑誌,ブログ，インタビュー記録，供述調書，アンケート，小説，自由回答記述など任意の時間要素（順序基準要素）を含むデータが想定できる。 In the explanation of the background art described later, web news is used as a linguistic material that increases the time series in which keywords (singular words, common words) should be detected. Data including arbitrary time elements (order reference elements) such as newspapers, magazines, blogs, interview records, statement records, questionnaires, novels, and free answer descriptions can be assumed.

分析データベース１８には、後述の形態素分析のための品詞辞書など、この実施例において文書解析に必要な全ての辞書や文法ルールなどを予め記憶しているとともに、解析結果も蓄積する。ただし、この分析データベース１８は、上述のテキストデータベース１６も同様であるが、コンピュータ１４の内部メモリで構成されていてもよい。 The analysis database 18 stores in advance all dictionaries and grammar rules necessary for document analysis in this embodiment, such as a part-of-speech dictionary for morphological analysis described later, and also accumulates analysis results. However, the analysis database 18 may be constituted by the internal memory of the computer 14 although the above-described text database 16 is the same.

コンピュータ１４は、図３に示す文書解析プログラムに従って文書を解析してキーワードを抽出ないし検出する。 The computer 14 extracts or detects keywords by analyzing the document according to the document analysis program shown in FIG.

図３を参照して、最初のステップＳ１で、コンピュータ１４は、設定時間が経過したかどうか判断する。「設定時間」とは、時系列的に増量する言語資料から、時系列順序を有する各コーパスを画定するための、区切りの時間（Δｔ）である。この「設定時間」はユーザが自由に設定できる。たとえば、状況変化が短時間で生じるような言語資料を分析する際には、短い設定時間（Δｔ）を設定すればよく、逆の言語資料の場合には、設定時間Δｔを長くすればよい。Δｔの例としては、１時間、１０時間、１００時間、１日、１週間、１ヶ月など挙げられる。また、このΔｔを時間の経過とともに変更することも考えられる。一例として、災害発生から２４時間経過するまではたとえばΔｔを「１時間」に設定し、それ以降災害から３日目まではたとえばΔｔを「１０時間」に設定し、さらに１ヶ月以上経過したときにはたとえばΔｔを「１日」として設定する。 Referring to FIG. 3, in first step S1, computer 14 determines whether or not a set time has elapsed. The “set time” is a delimiter time (Δt) for defining each corpus having a time-series order from language materials that increase in time-series. This “set time” can be freely set by the user. For example, a short set time (Δt) may be set when analyzing a language material in which a situation change occurs in a short time, and in the case of a reverse language material, the set time Δt may be increased. Examples of Δt include 1 hour, 10 hours, 100 hours, 1 day, 1 week, and 1 month. It is also conceivable to change this Δt with the passage of time. As an example, for example, Δt is set to “1 hour” until 24 hours have passed since the occurrence of the disaster, and after that, for example, Δt is set to “10 hours” until the third day after the disaster. For example, Δt is set as “1 day”.

そして、ユーザによって任意の設定時間が設定されると、その設定時間はコンピュータ１４の適宜のメモリ領域（レジスタ）に記憶されるので、コンピュータ１４は、内部の時計データをレジスタに設定された設定時間と比較することによって、ステップＳ１で設定時間が経過したかどうか、判断することができる。 When an arbitrary set time is set by the user, the set time is stored in an appropriate memory area (register) of the computer 14, so that the computer 14 sets the internal clock data in the register. It is possible to determine whether or not the set time has elapsed in step S1.

ステップＳ１で“ＹＥＳ”が判断されると、続いてコンピュータ１４はステップＳ３においてコーパス作成処理を実行し、設定時間（Δｔ）の間に増量した単位ドキュメントのテキストデータを、たとえば図２に示すテキストデータテーブル２０から読み込み、今回のテキストコーパスＣ(t)を作成する。 If “YES” is determined in the step S1, the computer 14 subsequently executes a corpus creation process in a step S3, and the text data of the unit document increased during the set time (Δt), for example, the text shown in FIG. Read from the data table 20 to create the current text corpus C (t).

図４に示すコーパスＣ(t)は現在時間のコーパスを示すが、このコーパスＣ(t)は、それぞれより時系列順序が先のコーパスＣ(t-Δt)より、設定時間Δｔ後に形成したコーパスである。つまり、コーパスＣ(t)は、直前のコーパスＣ(t-Δt)と増量分のコーパスＣΔｔとを合計したものである。 The corpus C (t) shown in FIG. 4 indicates the corpus at the current time, and this corpus C (t) is a corpus formed after a set time Δt from the corpus C (t−Δt) whose time series order is earlier. It is. That is, the corpus C (t) is the sum of the immediately preceding corpus C (t−Δt) and the increased amount of corpus CΔt.

なお、「コーパス（corpus）」とは、言語分析のための文字言語、あるいは音声言語資料の集合体として定義されるもので、特に電子テキストで構築されたものを指し、一般には、電子的なオリジナルのテキスト群を収集したものを指すが、ここでは、上記の定義を広義にとらえ、オリジナルテキストに対して増加型ＴＦＩＤＦやＴＦ（いずれも後述）の情報をもつ形態素群を便宜的にコーパスと呼ぶことにする。したがって、ここでいうテキストコーパスは、少なくとも１つのレコードつまり少なくとも１つの単位ドキュメントのテキストデータを含む言語資料体を意味するものと理解されたい。 A “corpus” is defined as a collection of written language or spoken language materials for linguistic analysis, especially those constructed with electronic text. This refers to a collection of original text groups. Here, the above definition is taken in a broad sense. For convenience, morpheme groups having information on increased TFIDF and TF (both described later) are referred to as corpus for convenience. I will call it. Therefore, the text corpus here is to be understood as meaning a language material including text data of at least one record, that is, at least one unit document.

続いて、ステップＳ５において、そのコーパスに含まれるテキストデータ２４（図２）を形態素に分割し、品詞情報を付加する。ここで、形態素解析とは、自然言語で書かれた文を形態素(Morpheme、おおまかにいえば、言語で意味を持つ最小単位)の列に分割し、品詞を見分ける言語処理のことである。参照する情報源として、対象言語の文法の知識（ここでは文法のルールの集まり）と辞書（品詞等の情報付きの単語リスト）を用いるが、これらの文法ルールや辞書は、上述のように、上記分析データベース１８に予め準備されている。 Subsequently, in step S5, the text data 24 (FIG. 2) included in the corpus is divided into morphemes and part-of-speech information is added. Here, morpheme analysis is a language process in which sentences written in a natural language are divided into morpheme (or Morpheme, roughly speaking, the smallest unit having meaning in a language) to distinguish parts of speech. As the information source to be referred to, grammar knowledge of the target language (here, a collection of grammar rules) and a dictionary (a word list with information such as parts of speech) are used. The analysis database 18 is prepared in advance.

なお、実施例では、一例として「茶筌」（http://chasen.naist.jp/hiki/ChaSen/）というフリーの形態素解析ソフトをコンピュータ１４に導入して利用した。 In the embodiment, as an example, free morphological analysis software called “tea bowl” (http://chasen.naist.jp/hiki/ChaSen/) is introduced into the computer 14 and used.

なお、文書が日本語の場合、実施例では、まず形態素を分割して抽出しその抽出した形態素に付いて品詞を付与するように、上記「茶筌」のようなツールを利用した。しかしながら、たとえば英語のような言語体系では最小単位である単語は既に分割されているので、分かち書き処理（tokenization）は不要であるが、このステップＳ５では、活用形を原形に直す必要があるので、ステミング処理（stemming：活用形を原形に直すこと）を行い、さらに品詞を同定する必要があるので、タギング処理（tagging：語の品詞を見分けること）処理をすることになる。 When the document is in Japanese, in the embodiment, a tool such as “tea bowl” is used so that the morpheme is first divided and extracted, and the part of speech is given to the extracted morpheme. However, for example, in a language system such as English, the word that is the smallest unit has already been divided, and therefore, the tokenization process (tokenization) is unnecessary. However, in this step S5, it is necessary to change the utilization form to the original form. Since it is necessary to identify the part of speech by performing a stemming process (stemming), tagging processing (tagging) is performed.

また、このステップＳ５で解析した形態素（群）および品詞情報は、テキストデータベース１６に蓄積される。 Further, the morpheme (group) and the part of speech information analyzed in step S5 are stored in the text database 16.

続くステップＳ７において、コンピュータ１４は、上述の品詞情報に基づいて、不要語として設定しておいた品詞の種類の形態素を取り除くための不要形態素除去処理を実行する。 In the subsequent step S7, the computer 14 executes an unnecessary morpheme removal process for removing the morpheme of the part of speech type set as an unnecessary word based on the above-mentioned part of speech information.

つまり、形態素解析の際に、各形態素に付与される「品詞情報」に基づいて、当該形態素をキーワードの候補として採用するか否かを選定する。不要語とする形態素（特異語（キーワード）／共通語の候補）の品詞の種類は、形態素解析システムが出力する品詞体系と、ユーザの解析の意図によって異なる。不要形態素と認定する品詞の種類はユーザが任意で定められるものとする。発明者等が実際に解析を行なった実験では、「茶筅」を用いて分析した結果の、非自立や接尾の形を取らない名詞、動詞、副詞、形容詞以外を不要形態素とした。ただし、どのような品詞の形態素を不要語とするかという不要語除去規則もまた、分析データベース１８に予め設定しておけばよい。なお、英文の場合には、先に説明したように、ストップワード（stop word）と呼ばれる、極めて頻繁に使われるキーワードとして不適切な単語のリストを参照することによって、不要な形態素を取り除く。 That is, in the morpheme analysis, whether or not to adopt the morpheme as a keyword candidate is selected based on the “part of speech information” given to each morpheme. The type of part of speech of a morpheme (single word (keyword) / common word candidate) to be an unnecessary word differs depending on the part of speech system output by the morphological analysis system and the user's intention of analysis. The type of part of speech that is recognized as an unnecessary morpheme is arbitrarily determined by the user. In the experiments actually conducted by the inventors, unnecessary nouns, verbs, adverbs, and adjectives that did not take the form of independence or suffix as a result of analysis using “tea bowl” were regarded as unnecessary morphemes. However, an unnecessary word removal rule for determining what part of speech morpheme is an unnecessary word may be set in the analysis database 18 in advance. In the case of English sentences, as described above, unnecessary morphemes are removed by referring to a list of words inappropriate as frequently used keywords called stop words.

ステップＳ７を実行した後には、たとえばテキストデータベース１６に蓄積されている当該コーパスの中に必要な１つ以上形態素が残っている。したがって、ステップＳ９の処理は、そのコーパスに除去されずに残っているすべての形態素ついて実行される。つまり、コンピュータ１４は、ステップＳ９において、残っているすべての形態素の各々について増加型ＴＦＩＤＦを求める。 After executing step S7, for example, one or more necessary morphemes remain in the corpus stored in the text database 16, for example. Therefore, the process of step S9 is executed for all remaining morphemes that are not removed from the corpus. That is, in step S9, the computer 14 obtains an increased TFIDF for each of all remaining morphemes.

ここで、「ＴＦ」はTerm Frequency、つまり単位ドキュメント中にそのキーワード候補が出現する頻度(延べ数)（出現頻度）であり、時間のパラメータを考慮した「ＩＤＦ」は、Inversed Document Frequency（逆出現文書数）、つまり、他には出現していないという独自性を示す。したがって、「増加型ＴＦＩＤＦ」とは、「ＴＦ」×「ＩＤＦ」のことであり、Term Frequency Inversed Document Frequencyといい、ＴＦ＊ＩＤＦと表すこともあるが、ここでは、増加型ＴＦＩＤＦと表現する。増加型ＴＦＩＤＦは、一種の重み付け指標となる。なお、背景技術では時系列的に増量する記事を含むコーパスを対象としたため、「時間増加型ＴＦＩＤＦ」の語を用いたが、この発明は、以下に説明するように任意の順序基準に従って増量する文書を含むコーパス、たとえば自由回答形式の調査票の分析ないし解析を目的とするので、単に「増加型ＴＦＩＤＦ」の語を用いることにした。 Here, “TF” is Term Frequency, that is, the frequency (total number) (appearance frequency) of the occurrence of the keyword candidate in the unit document, and “IDF” in consideration of the time parameter is Inverse Document Frequency (reverse appearance document) Number), that is, uniqueness that does not appear elsewhere. Therefore, “increased TFIDF” means “TF” × “IDF”, which is referred to as “Term Frequency Inversed Document Frequency”, and may be expressed as TF * IDF, but here it is expressed as increased TFIDF. The incremental TFIDF is a kind of weighting index. In the background art, since the corpus including articles that increase in time series is targeted, the term “time-increasing TFIDF” is used. However, the present invention increases in accordance with an arbitrary order criterion as described below. Since the purpose is to analyze or analyze a corpus including documents, for example, a survey form in a free answer format, the word “increasing TFIDF” is simply used.

仮に、図５に示すように記事数が逐次変化する場合であっても、一般的な解析の場合には、最終的に一定数Ｎの単位ドキュメントが蓄積された後に行なうので、単位ドキュメントの総数Ｎは、図６（Ａ）に示すとおり一定数である。そのため、そのような一般のテキストデータを解析する際のＴＦＩＤＦのＤＦ（Document Frequency）、その形態素が出現する文書の数は、図７（Ａ）に示すように一定数となる。したがって、一般的な解析手法の場合のＴＦＩＤＦは図８（Ａ）のようになる。 Even if the number of articles changes sequentially as shown in FIG. 5, in the case of general analysis, since a fixed number N of unit documents are finally accumulated, the total number of unit documents N is a fixed number as shown in FIG. Therefore, the DF (Document Frequency) of TFIDF and the number of documents in which the morphemes appear when analyzing such general text data are a fixed number as shown in FIG. Therefore, TFIDF in the case of a general analysis method is as shown in FIG.

これに対して、背景技術のシステムで取り扱う１レコードは時間情報または順序情報２６（図２）を持っているため、各レコード（テキストデータ）は、時系列順または順序情報順に並べることができる。したがって、その際の増加型ＴＦＩＤＦのＤＦには、ｊの添え字（時間や順序の情報にもとづく添え字）が存在することになる。ここにいう「ｊ」は、時系列順または順序情報順にレコード（記事）を並べた際の順番を表すことになる。 In contrast, since one record handled in the background art system has time information or order information 26 (FIG. 2), each record (text data) can be arranged in time series order or order information order. Therefore, a subscript j (subscript based on time or order information) exists in the DF of the incremental TFIDF at that time. Here, “j” represents the order when records (articles) are arranged in time series order or order information order.

したがって、背景技術の文書解析装置１０では、たとえば、ある記事ｄｊに対するＴＦＩＤＦを求める場合、最終的に収集された全件の記事に基づく単位ドキュメントの総数Ｎやそれに基づくＤＦを用いるのではなく、記事ｄ(j)が発行されるまでの時間に発信されていた記事の数に基づく時間を考慮したＮ(j)（記事ｄ(j)が発信された時点までの記事の総数）や、ＤＦ(ti,dj)（記事ｄ(j)が発信された時点までの形態素ｔ(i)の出現文書数）を用いて、記事ｄ(j)が発信された時点で逐次ＴＦＩＤＦを計算する。この実施例の文書解析装置１０では、図４に示すようにそれが含む単位ドキュメント数が時系列順序にしたがって増加するコーパスを設定し、そのコーパスにおける各形態素のＴＦＩＤＦを計算することによって、時間的順序（順番）を有するテキストデータからその順序に従った特異語（キーワード）や共通語を抽出または検出する。 Therefore, in the document analysis apparatus 10 of the background art, for example, when obtaining TFIDF for an article dj, the total number N of unit documents based on all articles collected finally and the DF based thereon are not used. N (j) (the total number of articles up to the time when article d (j) was sent) considering the time based on the number of articles sent before d (j) was issued, DF ( ti, dj) (the number of appearance documents of the morpheme t (i) up to the time when the article d (j) is transmitted), the TFIDF is sequentially calculated when the article d (j) is transmitted. In the document analysis apparatus 10 of this embodiment, as shown in FIG. 4, a corpus in which the number of unit documents included in the corpus is increased according to the time-series order, and the TFIDF of each morpheme in the corpus is calculated, thereby calculating the temporal Singular words (keywords) and common words in accordance with the order are extracted or detected from text data having the order (order).

具体的には、通常のＴＦＩＤＦは次式（１）で、ここに定義する増加型ＴＦＩＤＦは次式（２）で計算される。
［数１］
ＴＦＩＤＦ(ti,dj)＝ＴＦ(ti,dj)＊ＩＤＦ(ti)
ＩＤＦ(ti)= log₁₀(Ｎ／ＤＦ(ti)） (1)
［数２］
増加型ＴＦＩＤＦ(ti,dj)＝ＴＦ(ti,dj)＊ＩＤＦ(ti,dj)
ＩＤＦ(ti_,dj)= log₁₀ (Ｎ(j)／ＤＦ(ti,dj)） (2)
ここで、ｔ(i)はｉを識別子(ＩＤ)にもつ形態素である。つまり、ＴＦＩＤＦ(ti,dj)を算出する対象となるキーワード候補のことである。 Specifically, the normal TFIDF is calculated by the following equation (1), and the incremental TFIDF defined here is calculated by the following equation (2).
[Equation 1]
TFIDF (ti, dj) = TF (ti, dj) * IDF (ti)
IDF (ti) = log ₁₀ (N / DF (ti)) (1)
[Equation 2]
Incremental TFIDF (ti, dj) = TF (ti, dj) * IDF (ti, dj)
IDF (ti _, dj) = log ₁₀ (N (j) / DF (ti, dj)) (2)
Here, t (i) is a morpheme having i as an identifier (ID). That is, it is a keyword candidate that is a target for calculating TFIDF (ti, dj).

ｄ(j)はｊ番目の単位ドキュメント（記事）を表わす。つまり、ＴＦＩＤＦ(ti,dj)および増加型ＴＦＩＤＦ(ti,dj)を算出する対象となるキーワード候補が含まれている文書のことである。ただし、文書の単位は、文章、記事、文など任意に設定可能であるが、背景技術では、ウェブニュースの記事を文書単位とした。 d (j) represents the jth unit document (article). That is, it is a document that includes keyword candidates that are targets for calculating TFIDF (ti, dj) and incremental TFIDF (ti, dj). However, the unit of the document can be arbitrarily set such as a sentence, an article, and a sentence, but in the background art, an article of web news is set as a document unit.

ＴＦＩＤＦ(ti,dj)および増加型ＴＦＩＤＦ(ti,dj)は、ｊ番目の単位ドキュメントの形態素ｔ(i)毎に算出される値である。 TFIDF (ti, dj) and incremental TFIDF (ti, dj) are values calculated for each morpheme t (i) of the j-th unit document.

ＴＦ(ti,dj)は、ｊ番目の単位ドキュメントの形態素ｔ(i)ごとに算出される値で、単位ドキュメントｄ(j)中に形態素ｔ(i)が出現した回数（延べ数）である。 TF (ti, dj) is a value calculated for each morpheme t (i) of the j-th unit document, and is the number of times that the morpheme t (i) appears in the unit document d (j) (total number).

ＤＦ(ti,dj)は、１〜ｊ番目の単位ドキュメント中に形態素ｔ(i)が出現した単位ドキュメント数である。 DF (ti, dj) is the number of unit documents in which the morpheme t (i) appears in the 1st to jth unit documents.

なお、上記Ｎ(j)は、単位ドキュメントｄ(j)が発生している際に出現している単位ドキュメント数であり、数度のＩＤが１から順序だって単位ドキュメントに付与されていれば実際には、Ｎの値はｊと同値になる。 N (j) is the number of unit documents that appear when the unit document d (j) is generated. If the IDs of several degrees are assigned to the unit documents in order from 1, The value of N is the same as j.

たとえば図５に示すように、各記事（単位ドキュメント）ｄ(1)，ｄ(2)，ｄ(3)，…に出現する形態素ｔ(1)，ｔ(2)，ｔ(3)，…が変化する場合を想定する。この場合、単位ドキュメントの数Ｎ(j)をフィールドに持つテーブルが図６（Ｂ）に示すように表される。また、各単位ドキュメントのＤＦ(ti,dj)をフィールドに持つテーブルが図７（Ｂ）のように表され、Ｎ(j)の値によって形態素ｔ(i)を識別子にもった各単位ドキュメントの増加型ＴＦＩＤＦ(ti,dj)値をフィールドに持つテーブルが図８（Ｂ）のようになる。これらのテーブルは、いずれも、テキストデータベース１６に逐次蓄積される。 For example, as shown in FIG. 5, morphemes t (1), t (2), t (3),... Appearing in each article (unit document) d (1), d (2), d (3),. Suppose that changes. In this case, a table having the number N (j) of unit documents in the field is represented as shown in FIG. Further, a table having the DF (ti, dj) of each unit document in the field is represented as shown in FIG. 7B, and each unit document having the morpheme t (i) as an identifier according to the value of N (j). FIG. 8B shows a table having incremented TFIDF (ti, dj) values in the field. All of these tables are sequentially stored in the text database 16.

このようにして、ステップＳ９ですべての形態素の増加型ＴＦＩＤＦが計算された後、続くステップＳ１１において、コンピュータ１４は、増加型ＴＦＩＤＦの累計値Σ増加型ＴＦＩＤＦと、ＴＦの累計値ΣＴＦとをそのコーパスＣ(t)までの実測値として計算する。なお、増加型ＴＦＩＤＦ(ti,dj)が図８（Ｂ）のようになり、ＤＦ(ti,dj)が図７（Ｂ）で表されることから、ＴＦ(ti,dj)も計算することができ、ΣＴＦについては、ＴＦ(ti,dj)を計算した後それの累計値として計算すればよい。ただし、増加型ＴＦＩＤＦについては、図８（Ｂ）のテーブルから累計値を計算すればよい。 In this way, after the increased TFIDF of all the morphemes is calculated in step S9, in the subsequent step S11, the computer 14 calculates the cumulative value Σincreased TFIDF of the increased TFIDF and the cumulative value ΣTF of the TF. Calculated as actual values up to corpus C (t). Since the increased TFIDF (ti, dj) is as shown in FIG. 8B and DF (ti, dj) is shown in FIG. 7B, TF (ti, dj) is also calculated. As for ΣTF, after calculating TF (ti, dj), it may be calculated as a cumulative value thereof. However, for the incremental TFIDF, the cumulative value may be calculated from the table of FIG.

続くステップＳ１３で、コンピュータ１４は、そのコーパスＣ(t)について求めたＴＦ(ti,dj)の累積値ΣＴＦをＸとし、増加型ＴＦＩＤＦ(ti,dj)の累積値Σ増加型ＴＦＩＤＦをＹとして次式（３）への当て嵌めを行い、定数ａと定数ｂを求め、図９に示す回帰曲線を作成する。この回帰曲線は、次のコーパスＣ(t+Δt)での残差分析のために、そのコーパスＣ(t+Δt)における増加型ＴＦＩＤＦを推定または予測するものとなる。つまり、そのコーパスＣ(t)までのΣＴＦが横軸のようになるとき、もし、次のコーパスＣ(t+Δt)においても増加型ＴＦＩＤＦが同じ傾向を示すなら、次のコーパスＣ(t+Δt)での増加型ＴＦＩＤＦは、この回帰曲線上にプロットされることになる。
［数３］
Ｙ＝ａＸ^ｂ (3)
そして、コンピュータ１４は、ステップＳ１５において、先のステップＳ１１で計算した時間ｊでのコーパスＣ(t)における増加型ＴＦＩＤＦ(ti,dj)の累計値Σ増加型ＴＦＩＤＦと、前のコーパスＣ(t-Δt)についてステップＳ１３で求めた回帰曲線Ｙ＝ａＸ^ｂによる推定値Ｙとの差（残差値）を求める（図１０）。残差値が大きいほど、正負のいずれに拘わらず、直前のコーパスＣ(t-Δt)で予測した同じ形態素ｔ(i)のΣ増加型ＴＦＩＤＦより離れている（乖離している）ことを、すなわち、直前のコーパスまでの常識から予測できなかったことを意味する。そこで、この残差値を、当該形態素の特異性を表す値、つまり、特異値（Discriminating Value）ということとする。他方、Σ増加型ＴＦＩＤＦが正の残差値（特異値）を示す形態素は、回帰曲線より上方にプロットされ、特異的または特徴的であることを意味する。Σ増加型ＴＦＩＤＦが負の残差値（特異値）を示す形態素は、特異性は全くなく、逆の性質をもつありふれた形態素であるといえる。 In the following step S13, the computer 14 sets the cumulative value ΣTF of TF (ti, dj) obtained for the corpus C (t) as X, and sets the cumulative value Σincreased TFIDF of the increasing type TFIDF (ti, dj) as Y. By fitting to the following equation (3), constants a and b are obtained, and a regression curve shown in FIG. 9 is created. This regression curve estimates or predicts the increased TFIDF in the corpus C (t + Δt) for the residual analysis in the next corpus C (t + Δt). That is, when the ΣTF up to the corpus C (t) becomes the horizontal axis, if the increased TFIDF shows the same tendency in the next corpus C (t + Δt), the next corpus C (t + The incremental TFIDF at Δt) will be plotted on this regression curve.
[Equation 3]
Y = aX ^b (3)
In step S15, the computer 14 adds the cumulative value Σincrease TFIDF of the increased TFIDF (ti, dj) in the corpus C (t) at the time j calculated in the previous step S11 and the previous corpus C (t the difference between the estimated value Y -.DELTA.t) for by regression curve Y = aX ^b determined in step S13 obtains the (residual value) (Figure 10). The larger the residual value, the farther away from the Σincrease type TFIDF of the same morpheme t (i) predicted by the immediately preceding corpus C (t−Δt), regardless of positive or negative, That is, it means that it could not be predicted from common sense up to the immediately preceding corpus. Therefore, the residual value is referred to as a value representing the singularity of the morpheme, that is, a singular value (Discriminating Value). On the other hand, a morpheme in which the Σincreased TFIDF shows a positive residual value (singular value) is plotted above the regression curve, meaning that it is specific or characteristic. A morpheme in which the Σincrease type TFIDF shows a negative residual value (singular value) has no singularity and can be said to be a common morpheme having the opposite property.

図１０を参照して、Ｙ＝ａＸ^ｂで示される回帰曲線に対して、形態素ｔ(i)のΣ増加型ＴＦＩＤＦがこの曲線の上方にプロットできた場合、この形態素ｔ(i)は正の残差値を持つことになる。正の残差値を持つということは、その形態素ｔ(i)がＣ(t-Δt)までにあまり出現しておらず，経過したΔtの中で急激に出現したといえる。Ｃ(t)の形態素ｔ(i)のΣ増加型ＴＦＩＤＦが回帰曲線より下方にある場合には，Ｃ(t-Δt)までも数多く出現した形態素であることを示している
ステップＳ１５ではこのようにして各形態素毎にΣ増加型ＴＦＩＤＦの推定値または予測値と実測値との間で残差分析を行ない、各形態素の特異値すなわち残差値を、たとえばデータベース１６のテキストデータテーブル２０（図２）にメタデータとして付加するなどして、逐次記憶する。 Referring to FIG. 10, with respect to the regression curve represented by Y = aX ^b, if Σ increasing type TFIDF morpheme t (i) could be plotted above the curve, the morpheme t (i) is a positive Will have a residual value. Having a positive residual value means that the morpheme t (i) does not appear so much by C (t−Δt), but appears abruptly in the elapsed Δt. If the Σincrease type TFIDF of the morpheme t (i) of C (t) is below the regression curve, it indicates that many morphemes have appeared up to C (t−Δt). For each morpheme, a residual analysis is performed between the estimated value or predicted value of the Σincrease TFIDF and the actual measurement value, and the singular value, that is, the residual value of each morpheme is stored in, for example, the text data table 20 (see FIG. It is sequentially stored by adding it as metadata to 2).

コンピュータ１４は、次のステップＳ１７で、上述のようにデータベース１６に記憶した特異値（残差値）に従って、特異語（キーワード）および共通語（キーワード）を選定する。たとえば、正の残差値（特異値）が任意の上位数以上だった形態素を、そのコーパスを代表する特異語として選定する。逆に、負の残差値（特異値）が任意の下位数以下だった形態素は、共通語として選定する。共通語は構成したテキストデータベース（言語資料）全体を代表するキーワードに該当する。これらの特異語や共通語を利用すれば、同じテーマのテキストデータ（言語資料）を効率よく探し出せる。 In the next step S17, the computer 14 selects a singular word (keyword) and a common word (keyword) according to the singular value (residual value) stored in the database 16 as described above. For example, a morpheme having a positive residual value (singular value) greater than an arbitrary upper number is selected as a singular word representing the corpus. Conversely, a morpheme whose negative residual value (singular value) is less than an arbitrary lower number is selected as a common word. The common language corresponds to a keyword representing the entire constructed text database (language material). By using these singular and common words, text data (language material) of the same theme can be searched efficiently.

続いて、コンピュータ１４は、最後のステップＳ１９で、ステップＳ１７で選定した特異語や共通語を図示しないディスプレイ上に表示する。 Subsequently, in the last step S19, the computer 14 displays the singular words and common words selected in step S17 on a display (not shown).

図１１に２００４年新潟県中越地震について発行されたウェブニュースを用いて解析したときの表示例を示す。図１１では、表示画面の上側に正の残差値を持つ特異語が時間経過（横軸）とともにプロットされ、下側に負の残差値を持つ共通語がプロットされる。ただし、図１１では細部を描けないので、特異語として２つ「死亡」、「派遣」だけが明示されていて、共通語として「地震」、「中越」という２つだけが明示されているが、各グラフ部分にそのグラフを構成する形態素（単語）が表示される、ということに留意されたい。この図１１のような表示例によれば、特異語と共通語が上下に別々に表示されているので、それらを一覧できるという利点がある。 FIG. 11 shows a display example when analyzing using web news issued for the 2004 Niigata Chuetsu Earthquake. In FIG. 11, singular words having a positive residual value are plotted on the upper side of the display screen with time (horizontal axis), and common words having a negative residual value are plotted on the lower side. However, since details cannot be drawn in FIG. 11, only two “death” and “dispatch” are specified as singular words, and only two “earthquake” and “Chuetsu” are specified as common words. Note that morphemes (words) constituting the graph are displayed in each graph portion. According to the display example as shown in FIG. 11, the singular words and the common words are displayed separately above and below, so that there is an advantage that they can be listed.

表示例としては、図１２に示す表形式の表示も考えられる。図１２の表では、横軸に時間経過を示し、縦軸に時間区分ごとの特異語を上位適宜数（ランク：Rankとして）表示するようにしている。 As a display example, a tabular display shown in FIG. 12 is also conceivable. In the table of FIG. 12, the horizontal axis shows the passage of time, and the vertical axis displays the appropriate number of unique words for each time segment (rank: Rank).

ただし、他の任意の表示形態が考えられることは勿論であり、図１１および図１２の表示例に限定されるものではない。 However, it is needless to say that other arbitrary display forms are possible, and the present invention is not limited to the display examples of FIGS.

先に説明したように、ある時点でのキーワードに、特徴の度合いを表す指標の情報が付加されていれば、指標の評価結果にもとづき、より特異的なキーワードを同定することができる。ある時点で、ある事柄がウェブニュース上で中心的に発信されている場合、ある事柄の意味を表す言葉は多く出現する可能性がある。しかし、頻出するキーワードの中には、どのようなニュース記事であっても、文書を構成する上で多用されるキーワード、一部のニュース記事の中で頻出しているキーワードの２種類があることが想像される。ニュース記事を特徴的に表すキーワードとは後者を指す。 As described above, if index information indicating the degree of characteristics is added to a keyword at a certain point in time, a more specific keyword can be identified based on the evaluation result of the index. At a certain point in time, if a certain matter is sent centrally on web news, many words representing the meaning of the certain matter may appear. However, there are two types of frequently used keywords: any news article, a keyword that is frequently used in composing a document, and a keyword that appears frequently in some news articles. Is imagined. The keyword that characterizes the news article is the latter.

後者のようなキーワードに対して高い重みを与える指標として先に説明したＴＦＩＤＦがある。ここで、上述のように、ＴＦ(ti,dj)がキーワードｔ(i)の記事ｄ(j)に出現した回数を示し、ＤＦ(ti)がキーワードｔ(i)の出現する文書数を示すとき、ＩＤＦ(ti)は、全文書数に対するキーワードｔ(i)が出現した文書数の比の逆数である。つまり、この実施例では、どの記事にも現れるような形態素については低い重みを、他の記事にあまり現れないような形態素には高い重みを与えることになる。これとＴＦとの積をとった増加型ＴＦＩＤＦは、記事の中にいかに多く出現し、いかに他の記事に出現していないかを表す指標であり、キーワードの特徴の度合いを評価している指標と言える。 As an index that gives a high weight to the latter keyword, there is the TFIDF described above. Here, as described above, TF (ti, dj) indicates the number of times the keyword t (i) appears in the article d (j), and DF (ti) indicates the number of documents in which the keyword t (i) appears. IDF (ti) is the reciprocal of the ratio of the number of documents in which the keyword t (i) appears to the total number of documents. That is, in this embodiment, a low weight is given to a morpheme that appears in any article, and a high weight is given to a morpheme that does not appear much in other articles. Incremental TFIDF, which is the product of this and TF, is an index that indicates how many appear in the article and how it does not appear in other articles, and is an index that evaluates the degree of the feature of the keyword It can be said.

そして、発明者等の背景技術における実験では、ある記事ｄ(j)に対する増加型ＴＦＩＤＦを求める場合、最終的に収集された全２６２３件の記事に基づくＮやＤＦを用いることはせず、記事ｄ(j)が発行されるまでの時間に発信されていた記事の数にもとづく時間を考慮したＮ(j)（記事ｄ(j)が発信された時点までの記事の総数）や、ＤＦ(ti,dj)（記事ｄ(j)が発信された時点までの形態素ｔｉの出現文書数）を用いて、記事ｄ(j)が発信された時点で逐次ＴＦＩＤＦを計算することにする。これを増加型ＴＦＩＤＦと呼ぶ。つまり、通常のＴＦＩＤＦはＮとＤＦが一定であり、増加する言語資料から抽出された形態素に対する重み付けには対応していない。そのため、背景技術では、全文書数と任意の形態素が出現する文書数を順序基準に基づいて変化するパラメータとし、ＴＦＩＤＦを修正した増加型ＴＦＩＤＦを用いることにした。 And, in the experiment in the background art of the inventors etc., when obtaining the increased TFIDF for an article d (j), N and DF based on all 2623 articles collected in the end are not used. N (j) (the total number of articles up to the time when article d (j) was sent) considering the time based on the number of articles sent before d (j) was issued, DF ( ti, dj) (the number of documents in which the morpheme ti appears until the article d (j) is sent) is used to calculate the TFIDF sequentially when the article d (j) is sent. This is called incremental TFIDF. That is, in normal TFIDF, N and DF are constant, and do not correspond to weighting for morphemes extracted from increasing language material. For this reason, in the background art, the total number of documents and the number of documents in which an arbitrary morpheme appears are used as parameters that change based on the order reference, and an increased TFIDF in which TFIDF is modified is used.

ただし、単に増加型ＴＦＩＤＦの値だけではキーワードが特徴的であるか否かを評価することは難しい。ある時点までの増加型ＴＦＩＤＦの値が高く評価されるパターンには、ＴＦの値が低くともＩＤＦが高い（ＤＦが低い）ために増加型ＴＦＩＤＦが高い値で求められる場合と、ＩＤＦが低くとも（ＤＦが高くとも）ＴＦが著しく大きいために増加型ＴＦＩＤＦが高く算出される場合とがある。ＴＦが著しく大きいということは、その言葉の一般性が高いために記事を記述する上で何度も用いなければならないような言葉である可能性が高い。単純に増加型ＴＦＩＤＦの値によってその形態素が特徴的であるかどうかを単純に評価することはできない。 However, it is difficult to evaluate whether or not a keyword is characteristic only by the value of incremental TFIDF. In a pattern in which the value of the increased TFIDF up to a certain point is highly evaluated, the IDF is high (DF is low) even if the value of the TF is low. In some cases, even if the DF is high, the incremental TFIDF is calculated to be high because the TF is remarkably large. The fact that the TF is remarkably large is likely to be a word that must be used many times to describe an article because of the high generality of the word. It is not possible to simply evaluate whether the morpheme is characteristic by the value of incremental TFIDF.

ある時点における情報が特徴的であるということは、前の時点までに語られているキーワード群と、ある時点で語られているキーワード群とを比較することから把握できると考えられる。両者に差が生じていれば、任意時点の前後に大きな質の違いがあったことを意味していると思われる。つまり、ある時点のコーパスと、ある時点から任意の時間が経過した分のコーパスを比較することにより、情報の質の変化を捉え、その変化をもたらしたキーワードを同定できる可能性があるものと考えられる。そこで、この発明の背景技術においては、先に説明したように、残差分析(ステップＳ１５)を行なうことによって、ある時点と次の時点のコーパスの特性を比較するようにした。 The fact that the information at a certain point in time is characteristic can be understood by comparing the keyword group spoken up to the previous point in time with the keyword group spoken at a certain point in time. If there is a difference between the two, it may mean that there was a large quality difference before and after the arbitrary time point. In other words, by comparing the corpus at a certain point in time with a corpus that has passed a certain amount of time from a certain point in time, it is possible to grasp the change in the quality of information and identify the keyword that caused the change. It is done. Therefore, in the background art of the present invention, as described above, the residual analysis (step S15) is performed to compare the corpus characteristics at a certain time point and the next time point.

発明者等が或る災害に関する実際のニュースを発災からそれぞれ異なる時間において形態素ごとのＴＦの累積値と増加型ＴＦＩＤＦの累積値の関係を評価したところ、ＴＦの累積値と増加型ＴＦＩＤＦの累積値の間には、先の（３）式で表される強い関係があった。サンプル数（キーワード数）が少ない期間においてはＴＦの累積値と増加型ＴＦＩＤＦ(の累計値の関係以外については、累乗関数でＲ^２が０．９０〜０．９９であり、ＴＦと増加型ＴＦＩＤＦの累積値の間には、累乗関数の関係が系統的に存在することが明らかになった。このような関数関係は、近似曲線の近傍にあるキーワードはＴＦの累積値と増加型ＴＦＩＤＦの累積値の関係が、コーパスの平均的な関係と同じような傾向にあることを意味している。このような傾向をもつキーワードは、平均的な出現パターンを呈しているものと考えられる。したがって、実際の増加型ＴＦＩＤＦの累積値が、近似曲線にもとづく推定値を下回る場合、コーパスの平均像からみて増加型ＴＦＩＤＦの累積値が低い、つまりあまり特徴の度合いが高くないことを表す。逆に、実測値が推定値を上回る場合は、その逆で増加型ＴＦＩＤＦが高く、特徴的なキーワードであることと言える。以上のような評価は、実際の増加型ＴＦＩＤＦの累積値と、近似曲線に基づく推定値との差（残差）を求めることによって可能になる。以上の関係を応用し、図１３のようなモデルで任意時点のキーワードを特徴的の度合いを評価する。 When the inventors evaluated the relationship between the cumulative value of TF and the cumulative value of increasing TFIDF for each morpheme at different times from the occurrence of actual news regarding a disaster, the cumulative value of TF and the cumulative value of increasing TFIDF There was a strong relationship between the values expressed by the previous equation (3). In a period when the number of samples (number of keywords) is small, R ² is 0.90 to 0.99 as a power function, except for the relationship between the cumulative value of TF and the cumulative value of incremental TFIDF (TF, and TF and incremental TFIDF It has been clarified that there is a systematic relationship between the cumulative values of the power functions of the TF, such that the keyword in the vicinity of the approximate curve is the cumulative value of the TF and the cumulative value of the increasing TFIDF. This means that the value relationship has the same tendency as the average corpus relationship, and keywords with such a tendency are considered to have an average appearance pattern. When the actual cumulative value of TFIDF is lower than the estimated value based on the approximate curve, the cumulative value of incremental TFIDF is low, that is, the degree of features is too high as viewed from the average image of the corpus. Conversely, if the actual measured value exceeds the estimated value, it can be said that the increase type TFIDF is high and is a characteristic keyword. This is made possible by calculating the difference (residual) between the accumulated value and the estimated value based on the approximate curve, applying the above relationship to evaluate the degree of characteristic of the keyword at an arbitrary point in time using the model shown in FIG. To do.

図１３の左側には、あるｔ‐Δｔから単位時間幅Δｔ経過する際のコーパスの変化を模式的に表した。このような関係は次式(４)で表すことができる。 The left side of FIG. 13 schematically shows a change in the corpus when a unit time width Δt elapses from a certain t−Δt. Such a relationship can be expressed by the following equation (4).

図１３（Ａ）に示すように、Ｃ(Δt)にそれまでに出現したキーワードが多く含まれていたり、出現頻度もあまり高くないような形態素のみが存在したりしているような場合には、図１３の右上側に示したようにＴＦの累積値と増加型ＴＦＩＤＦの累積値の関係は、ｔ‐Δｔの時点のコーパス（Ｃ(t-Δt)）で構成された場合とｔの時点のコーパス（Ｃ(Δt)）で構成された場合では大きな差は生じない。それに対して、図１３（Ｂ）に示すように、ｔ‐Δｔまでに出現しなかったようなキーワードがΔｔの中で出現したり、高い頻度で現れるような形態素が存在する場合には、ｔの時点でのコーパス（Ｃ(t)）が大きく変化し、図１３の右下側に示したようにＴＦの累積値と増加型ＴＦＩＤＦの累積値の関係を表す曲線の形状も大きく変化する。 As shown in FIG. 13A, when C (Δt) contains many keywords that have appeared so far, or there are only morphemes that do not appear very frequently, As shown in the upper right side of FIG. 13, the relationship between the cumulative value of TF and the cumulative value of incremental TFIDF is the case where it is composed of a corpus (C (t-Δt)) at time t-Δt and the time point t A large difference does not occur in the case of a corpus (C (Δt)). On the other hand, as shown in FIG. 13B, when a keyword that does not appear until t−Δt appears in Δt or there is a morpheme that appears frequently, t At this point, the corpus (C (t)) changes greatly, and as shown in the lower right side of FIG. 13, the shape of the curve representing the relationship between the accumulated value of TF and the accumulated value of increased TFIDF also changes greatly.

つまり、ある時点ｔでの増加型ＴＦＩＤＦの累積値と、ｔ‐Δｔの時点でのコーパスで構成された関係式にもとづく推定値との残差が、このΔｔの間のコーパスの変化そのものを表し、残差が大きい形態素こそがΔｔ間に発生した言語資料の内容を代表するキーワード（特異語、共通語）であると考えられる。 In other words, the residual between the cumulative value of the incremental TFIDF at a certain time t and the estimated value based on the relational expression composed of the corpus at the time t−Δt represents the change in the corpus during this Δt. A morpheme having a large residual is considered to be a keyword (single word, common language) representing the content of language material generated during Δt.

このように、実施例では、時間ｔでの情報内容の質的な変化を表すキーワードの特徴量を評価する指標として、任意時間ｔ‐Δｔのコーパスで構成されるＴＦと増加型ＴＦＩＤＦの累積値にもとづく関係式による増加型ＴＦＩＤＦの累積値の推定値とｔの時点での増加型ＴＦＩＤＦの累積値の実測値との差分（残差）を採用することにする。ここに残差が著しく高かったキーワードを特徴語または特異語（残差値または特異値：正）、著しく低かったキーワードを一般語または共通語と呼ぶことにする（残差値または特異値：負）。 As described above, in the embodiment, as an index for evaluating the feature amount of the keyword representing the qualitative change in the information content at time t, the cumulative value of the TF composed of the corpus of the arbitrary time t-Δt and the incremental TFIDF The difference (residual) between the estimated value of the cumulative value of the incremental TFIDF based on the relational expression based on the measured value of the cumulative value of the incremental TFIDF at the time t is adopted. Here, a keyword having a significantly high residual is called a feature word or singular word (residual value or singular value: positive), and a keyword having a very low residual is called a general word or common word (residual value or singular value: negative) ).

図１に示す背景技術の文書解析装置１０によれば、図３に示すフロー図に示す次の手順に従って、コンピュータ１４によって、人の主観的な判断を用いず、増加型ＴＦＩＤＦ指標や残差値による定量的な指標を用いて構成されており、連続したプロセスから成り立っているため、ツールと参照すべきものが適切に準備されていれば、過去の事象の記録をインプットとし、一連の過程を通して自動的客観的に最終成果物であるキーワードを検出することができる。 According to the document analysis apparatus 10 of the background art shown in FIG. 1, according to the following procedure shown in the flowchart shown in FIG. 3, the computer 14 does not use human subjective judgment, but increases TFIDF indices and residual values. Because it consists of a series of quantitative indicators, and consists of a series of processes, if the tools and what to be referred to are properly prepared, records of past events can be used as inputs, and automated through a series of processes. It is possible to objectively detect a keyword that is a final product.

このようにして、図１に示す実施例の文書解析装置１０において、コンピュータ１４は、要するに、次のステップを実行する。 Thus, in the document analysis apparatus 10 of the embodiment shown in FIG. 1, the computer 14 basically executes the following steps.

1）時系列的に増加するテキストデータ（この場合では、ウェブニュース）のデータベースを構築する。 1) Build a database of text data (in this case, web news) that increases over time.

2）テキストを形態素に分割し、品詞情報を付加する。 2) Divide the text into morphemes and add part-of-speech information.

3）品詞情報にもとづき、非自立と接尾以外の名詞、動詞、副詞、形容詞を抽出する。 3) Extract nouns, verbs, adverbs, and adjectives other than independence and suffix based on part of speech information.

4）形態素について、文書（ここではウェブニュース記事）ごとにＴＦと時間情報に基づく増加型ＴＦＩＤＦを求める。 4) For the morpheme, obtain an incremental TFIDF based on TF and time information for each document (here, web news article).

5）ある時点ｔ‐Δｔからｔの間における特徴的なテキストを代表するキーワードを抽出するため、ｔ‐ΔｔまでのコーパスにおけるＴＦの累積値と増加型ＴＦＩＤＦの累積値の関係式を求め、それにもとづくｔの時点での増加型ＴＦＩＤＦの累積値の推定値と実測値との差を求める。この残差値をあるΔｔに出現したキーワードの特徴量すなわち特異値とする。 5) In order to extract a keyword representing a characteristic text between time t-Δt and t, a relational expression between the cumulative value of TF and the cumulative value of incremental TFIDF in the corpus up to t-Δt is obtained. The difference between the estimated value of the cumulative value of the incremental TFIDF at the time point t and the actual measurement value is obtained. This residual value is defined as a feature amount of a keyword that appears at a certain Δt, that is, a singular value.

6）最も大きい残差値（特異値）から任意の上位数までのキーワード（特異語）を選定し、当該特異語が検出された記事に特異語を言語資料のメタデータとする。 6) Select keywords (singular terms) from the largest residual value (singular value) to any number of upper ranks, and use the singular terms as the metadata of the language material in the articles where the singular terms are detected.

以上説明したように、提案済みの文書解析手法は、時系列的に増加するコーパス中の言葉について、任意の時間断面における特異性を数値的に評価し、時系列のデータセットを生成するものである。この時系列データの傾向変動を的確に捉えることができれば、任意の事象の展開を予測できる可能性がある。 As described above, the proposed document analysis method generates a time-series data set by numerically evaluating the singularity of an arbitrary time section for words in a corpus that increases in time series. is there. If the trend variation of this time series data can be accurately grasped, there is a possibility that development of an arbitrary event can be predicted.

上で説明した背景技術では、単位ドキュメントの生成の時間のみを考慮するコーパスに適用できるよう、増加型ＴＦＩＤＦを定義した。ところが、自由回答記述のような実時間の上にないような文書であっても、ある順序基準に従って生成されていると仮定すれば、疑似的な時間軸の上に各自由回答記述が並ぶことになり、増加型ＴＦＩＤＦによる単語の重み付けが可能になる。このように、自由回答記述の単位ドキュメントを一定の順序基準に従って並べることによって、疑似的に、背景技術における増加型ＴＦＩＤＦを適用することができる。 In the background art described above, the incremental TFIDF is defined so that it can be applied to a corpus that considers only the generation time of a unit document. However, even if a document does not exist in real time, such as a free answer description, each free answer description is arranged on a pseudo time axis if it is generated according to a certain order criterion. Thus, it is possible to weight words by increasing TFIDF. In this manner, by arranging the unit documents of the free answer description according to a certain order reference, the incremental TFIDF in the background art can be applied in a pseudo manner.

そこで、図１４に示すこの発明の一実施例の文書解析装置１０では、図１および図３に示す背景技術を利用して、自由回答のテキストデータを単位文書とするコーパスを解析する。この文書解析装置１０は、図１の装置と同様に、操作手段１５Ａおよび表示手段（モニタ）１５Ｂを備えるコンピュータ１４を含み、このコンピュータ１４には先に説明したテキストデータベース１６および分析データベース１８が付設されるとともに、ネットワーク１２が結合される。 Therefore, the document analysis apparatus 10 according to the embodiment of the present invention shown in FIG. 14 analyzes the corpus having the free answer text data as a unit document by using the background art shown in FIGS. 1, the document analysis apparatus 10 includes a computer 14 having an operation means 15A and a display means (monitor) 15B. The computer 14 is provided with the text database 16 and the analysis database 18 described above. And the network 12 is coupled.

さらに、図1４の実施例では、コンピュータ１４に付属するイメージスキャナ３０を設置し、自由回答記述を含む調査票３２をこのイメージスキャナ３０で読み取った上で文字認識の手法を適用することによって、テキストデータに変換するようにしてもよい。このとき、自由回答記述に通常設定されるフェイス項目（後述）についてもこのイメージスキャナの読取データから復元するようにしてもよい。そうすれば、自由回答記述を含む調査票の内容を自動的にテキストデータ２０としてテキストデータベース１６中に取り込むことができる。 Further, in the embodiment of FIG. 14, an image scanner 30 attached to the computer 14 is installed, and a text recognition technique is applied by reading a survey form 32 including a free answer description by the image scanner 30 and applying a character recognition technique. You may make it convert into data. At this time, face items (to be described later) normally set in the free answer description may be restored from the read data of the image scanner. Then, the contents of the survey form including the free answer description can be automatically taken into the text database 16 as the text data 20.

ただし、このような調査票から手動的にテキストデータベース１６を作成するようにしてもよい。つまり、回収した調査票をコーディングするとともに、自由回答記述をテキスト入力する。コーディング工程では、よく知られているように、ケース（回答者）×属性（質問項目）のマトリクスに、実際の回答結果（数字など）を入力する。ただし、「年齢」のように数字そのものが回答の場合には、そのまま数字を入力する。このようなコーディング処理によってマトリクスを作成すれば、調査票をたとえば年齢のような所定の順序基準に従って昇順または降順に並べることができる。 However, the text database 16 may be manually created from such a survey form. That is, the collected survey form is coded and a free answer description is input as text. In the coding process, as is well known, actual answer results (numbers, etc.) are input into a matrix of case (respondent) × attribute (question item). However, if the number itself is the answer, such as “age”, the number is input as it is. If a matrix is created by such a coding process, survey forms can be arranged in ascending or descending order according to a predetermined order criterion such as age.

イメージスキャナ３０で調査票３２を自動的に読み取る場合でも、復元したフェイス項目に従ってマトリクスを自動的に作成することができるので、この場合においても、調査票を所定の順序基準に従って昇順または降順に並べることができる。 Even when the image scanner 30 automatically reads the survey form 32, the matrix can be automatically created according to the restored face item. In this case, the survey forms are arranged in ascending or descending order according to a predetermined order standard. be able to.

図１５には図１４の実施例の文書解析装置１０において解析可能な自由回答記述を含む調査票の一例が図示される。調査票３２はたとえばＡ４サイズの紙に記入されたものであり、紙面上部にフェイス項目記入領域３４が設定される。さらに、調査票３２の紙面下部が自由回答記述領域３６として設定されている。この実施例では調査票３２は単に図解の目的で１枚のものとして示されているが、複数枚１組の調査票であってよく、さらにはフェイス項目記入領域３４および自由回答記述領域３６だけでなく、プリコード回答形式（選択肢にチェックをつける形式）の質問回答領域（図示せず）が設定されているものであってよい。 FIG. 15 shows an example of a survey form including a free answer description that can be analyzed by the document analysis apparatus 10 of the embodiment of FIG. The survey form 32 is entered on, for example, A4 size paper, and a face item entry area 34 is set at the top of the page. Further, the lower part of the survey form 32 is set as a free answer description area 36. In this embodiment, the survey form 32 is shown as only one for the purpose of illustration, but it may be a set of a plurality of survey forms, and furthermore, only the face item entry area 34 and the free answer description area 36 are included. Instead, a question answer area (not shown) in a precode answer format (a format in which options are checked) may be set.

フェイス項目記入領域３４には、一例として、性別、年齢、職業、年数（勤続年数または営業年数）、世帯人数、世帯年収などの記入欄が設定されていて、これらの項目は、プリコード回答形式の問いとして設定されている。このフェイス項目記入領域３４に記入しまたは選択した、たとえば年齢や年数あるいは世帯年収などが、この調査票３２の自由回答記述領域３６における自由回答記述を単位ドキュメントとして昇順または降順に並べる際の順序基準として利用することができる。そして、自由回答記述領域３６には文字通り自由な回答や意見が記述される。 In the face item entry area 34, for example, entry fields such as gender, age, occupation, number of years of service (year of service or number of years of service), number of households, household income, etc. are set. It is set as a question. Order criteria when the free answer description in the free answer description area 36 of the survey form 32 is arranged in ascending or descending order as the unit document, for example, age, years or household income entered or selected in the face item entry area 34 Can be used as In the free answer description area 36, literal answers and opinions are described literally.

自由回答記述を解析するためには、図１４のコンピュータ１４は、モニタ（表示手段）１５Ｂにたとえば、図１６に示すＧＵＩ４０を表示することによって、ユーザによる設定を可能にしている。ＧＵＩ４０には、自由回答記述を分析するための分析項目選択ボタン４２、４４および４６が設定される。また、このＧＵＩ４０に右側にさらに、コーパス選択領域４８および対象設定領域５０が設定される。コーパス選択領域４８には、そのとき選択可能なコーパス（言語資料体）の内容を記述したものがコーパス名として表示され、ユーザは、上表示記したボタン４２‐４６で選択した評価項目を得ようとするコーパスを選択するときに、そのコーパス名を操作（クリック）すればよい。また、対象設定領域５０では、この実施例での評価の基準となる各形態素（単語）の累積特異値ΣＤの上位何位までを評価対象とするかを入力する。つまり、領域５０に設定されているウィンドウ５２にｎとして任意の数字を入力する。たとえば、「５０」がウィンドウ５２に設定されたとき、ユーザは累積特異値ΣＤの上位５０位までの形態素を対象とするように設定したことを意味する。 In order to analyze the free answer description, the computer 14 in FIG. 14 enables the setting by the user by displaying, for example, the GUI 40 shown in FIG. 16 on the monitor (display means) 15B. In the GUI 40, analysis item selection buttons 42, 44 and 46 for analyzing a free answer description are set. Further, a corpus selection area 48 and a target setting area 50 are set on the right side of the GUI 40. In the corpus selection area 48, the contents of the corpus (language material) that can be selected at that time are described as the corpus name, and the user can obtain the evaluation item selected by the buttons 42-46 displayed above. When the corpus to be selected is selected, the corpus name may be operated (clicked). Further, in the target setting area 50, the upper rank of the cumulative singular value ΣD of each morpheme (word) that is a reference for evaluation in this embodiment is input. That is, an arbitrary number is input as n in the window 52 set in the area 50. For example, when “50” is set in the window 52, it means that the user has set to target the top 50 morphemes of the cumulative singular value ΣD.

代表キーワード値選択ボタン４２は、評価項目として代表キーワード値を選択するとき、ユーザがたとえばマウス（図示せず）でクリックする。上で説明したように、特異値（Discriminating Value）を初めから任意の時間（順序基準）断面まで足し合わせた数値を累積特異値と呼び、重要な特異語を同定するための指標とした。特異値が正を示した言葉は各時間（順序基準）断面を特徴づける言葉で（特異語）あり、負の値を示した言葉はコーパスに遍在する言葉である（共通語）。単純に特異値を足し合わせれば、任意の言葉が特異語から共通語になった場合、負の値が足し合わされていく。このような言葉は、ある一定の期間、すなわち一定範囲の順序基準に重要な事象に関連する特異語であったにも拘わらず、積み上げられた特異値が減じられることになってしまい、重要な特異語として求めることができない可能性がある。したがって、ここでは、正の特異値だけを足し併せたものを累積特異値（ΣＤ）とする。この累積特異値ΣＤは当該形態素（単語）がどの程度高い重みを持った形態素かを示す指標である。 The representative keyword value selection button 42 is clicked by the user, for example, with a mouse (not shown) when selecting a representative keyword value as an evaluation item. As explained above, a numerical value obtained by adding a singular value (Discriminating Value) from the beginning to an arbitrary time (order reference) section is called a cumulative singular value, and is used as an index for identifying an important singular term. Words with positive singular values are words that characterize each time (order reference) cross section (singular words), and words with negative values are ubiquitous words in the corpus (common language). If singular values are simply added together, negative values are added together when an arbitrary word changes from a singular word to a common word. These terms are important because they will reduce the accumulated singular value despite being a singular term associated with an event that is important for a certain period of time, i.e. a range of order criteria. There is a possibility that it cannot be obtained as a singular term. Therefore, here, the sum of only positive singular values is taken as the cumulative singular value (ΣD). The cumulative singular value ΣD is an index indicating how much weight the morpheme (word) has.

ただし、図１の背景技術では、原理的に、順序基準において後ろの方に並べたドキュメントに含まれる単語（形態素）に高い重みを与える方法になっている。そのため、そのままの方法で自由回答記述の分析を行なった場合には、順序基準の特性を強く反映したキーワードが抽出されることが予想される。たとえば、自由回答記述を年齢のような順序基準において昇順に並べたときの昇順累積特異値ΣＤ(i,ord,asc)および順序基準において降順に並べたときの降順累積特異値ΣＤ(i,ord,dsc)をそれぞれ求めたとき、前者は高齢層の特性を表すキーワードを抽出し、後者は若年層の特性を表すキーワードを抽出する可能性がある。したがって、この実施例では、数５で求めた昇順の累積特異値ΣＤ(i,ord,asc)および降順の累積特異値ΣＤ(i,ord,dsc)の相加平均、つまり平均累積特異値aveΣＤを計算することによって、着目した順序基準の特性を排除したより客観的（または代表的）な解析結果を得ることを着想した。この平均累積特異値aveΣＤはしたがって、その形態素（キーワード）がたとえば順序基準が年齢である場合、若年層および高齢層の両方において顕著な程度、すなわちどの程度代表的なキーワードであるかを示すことになる。したがって、平均累積特異値aveΣＤは、代表キーワード値ということもできる。この代表キーワード値aveΣＤが自由回答形式の記述（単位ドキュメント）の評価の１つの指標である。 However, in the background art of FIG. 1, in principle, a high weight is given to words (morphemes) included in a document arranged rearward in the order reference. Therefore, if the free answer description is analyzed by the method as it is, it is expected that a keyword that strongly reflects the characteristics of the order criterion is extracted. For example, ascending cumulative singular values ΣD (i, ord, asc) when free answer descriptions are arranged in ascending order on an order criterion such as age, and descending cumulative singular values ΣD (i, ord) when arranging free answer descriptions in descending order on an order criterion , dsc), the former may extract keywords representing the characteristics of the elderly, and the latter may extract keywords representing the characteristics of the younger generation. Therefore, in this embodiment, the arithmetic average of the cumulative singular value ΣD (i, ord, asc) in ascending order and the cumulative singular value ΣD (i, ord, dsc) in descending order obtained by Equation 5, that is, the average cumulative singular value aveΣD The idea is to obtain a more objective (or representative) analysis result that eliminates the characteristics of the order criteria that we focused on. This average cumulative singular value aveΣD therefore indicates that the morpheme (keyword) is prominent, that is, how representative the morpheme (keyword) is in both the young and the elderly, for example when the order criterion is age. Become. Therefore, the average cumulative singular value aveΣD can also be referred to as a representative keyword value. This representative keyword value aveΣD is one index for evaluating the description (unit document) in the free answer format.

ただし、Ｄ(i,j,ord,asc/dsc)：順序基準(ord)の昇順(asc)または降順(dsc)で求められた単位ドキュメントｊにおける単語ｉの特異値である。 Where D (i, j, ord, asc / dsc): a singular value of word i in unit document j obtained in ascending order (asc) or descending order (dsc) of order reference (ord).

２つ目の指標は、上述の昇順／降順累積特異値ΣＤの総和（累計）に基づく指標である。累積特異値ΣＤはその単語（形態素）がどの程度重要かを示す指標である。したがって、累積特異値ΣＤが大きく、したがって、高い重みを持つ形態素（単語）を多く含む記事ほど、重要な自由記述（記事）であるという考えが成立する。そこで、図１４に示す実施例では、２つ目の指標として１つの記事の中にどの程度重要な形態素が含まれているかを示すΣΣＤ（累積特異値総和）を採用する。ただし、ΣΣＤをそのまま採用すると、数値が記述文の長さ（単語の数）に影響されるという問題がある。そこで、一定以上の上位を示す形態素の累積特異値ΣＤを累計した累積特異値総和ΣΣＤを用いる。この累積特異値総和ΣΣＤが数６で与えられる。 The second index is an index based on the sum (cumulative) of the above-mentioned ascending / descending order cumulative singular values ΣD. The cumulative singular value ΣD is an index indicating how important the word (morpheme) is. Therefore, the idea that an article having a larger cumulative singular value ΣD and containing more morphemes (words) having a higher weight is an important free description (article) holds. Therefore, in the embodiment shown in FIG. 14, ΣΣD (cumulative singular value summation) indicating how important morphemes are included in one article is adopted as the second index. However, if ΣΣD is adopted as it is, there is a problem that the numerical value is influenced by the length of the description sentence (number of words). Therefore, the cumulative singular value sum ΣΣD obtained by accumulating the cumulative singular values ΣD of the morphemes that are higher than a certain level is used. This accumulated singular value sum ΣΣD is given by Equation 6.

ただし、rank：自由回答記述の重み付けに考慮する累積特異値総和ΣΣＤ(j,ord,asc/dsc)の上位語の数（ランク）であり、rank(j)は単語（形態素）ｉのΣＤ(i,ord,asc/dsc)の順位である。 Where rank is the number (rank) of upper words of the cumulative singular value summation ΣΣD (j, ord, asc / dsc) to be considered for weighting the free answer description, and rank (j) is the ΣD (of the word (morpheme) i i, ord, asc / dsc).

他方、上で説明したように、累積特異値ΣＤは、コーパスに含まれる各文書を順序基準の昇順に並べるか降順に並べるかによって影響を受ける。したがって、各形態素の累積特異値ΣＤの各文書（記事）中における総和である累積特異値総和ΣΣＤも昇順/降順の影響を受けることが容易に予測できる。 On the other hand, as described above, the cumulative singular value ΣD is affected by whether the documents included in the corpus are arranged in ascending order or descending order of the order reference. Therefore, it can be easily predicted that the cumulative singular value sum ΣΣD, which is the sum of the cumulative singular values ΣD of each morpheme in each document (article), is also affected by the ascending / descending order.

そこで、実施例においては、特定上位の昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)を採用した上で、数７に従って昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)の相加平均を計算して平均累積特異値総和aveΣΣＤ(j,ord,rank)を求める。この平均累積特異値総和aveΣΣＤ(j,ord,rank)を代表ドキュメント値または代表事例値と呼ぶ。 Therefore, in the embodiment, after adopting the specific higher order ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending order cumulative singular value sum ΣΣD (j, ord, dsc, rank), Calculate the arithmetic mean of the ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending cumulative singular value sum ΣΣD (j, ord, dsc, rank) to calculate the average cumulative singular value sum aveΣΣD (j, ord, rank). This average cumulative singular value sum aveΣΣD (j, ord, rank) is called a representative document value or a representative case value.

先に説明したように、ＴＦＩＤＦ(i)は、それの性質上、それ自体が内容的な意味をもたない単語や、包括的な抽象の度合の高い単語についても高い重みを加えることがあるので、この発明が目指す自由回答記述に対しては適切な重み付けができていない可能性もある。これに対して、順序基準に従う昇順累積特異値ΣＤ(j,ord,asc,rank)および順序基準に従う降順累積特異値ΣＤ(j,ord,dsc,rank)は上記のように、より具体的な意味を持つ形態素（単語）が高い値を示す。したがって、このまま、これらの単語とその重みを採用すれば、上記昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)ように、順序基準のどちらか一方の特性を強く示す。他方で、ＴＦＩＤＦ(j,rank)のように、順序基準の効果を求めない、均質的な代表値も必要になる。そこで、順序基準の効果を相殺するために、両指標の平均値をとることを考える。これを平均累積特異値総和aveΣΣＤ(j,ord,rank)と呼び、その記事（自由回答記述）がどの程度代表的な記事かを示す指標として採用することとした。 As explained earlier, TFIDF (i), due to its nature, may add high weight even to words that do not have content meaning themselves or words that have a high degree of comprehensive abstraction. Therefore, there is a possibility that appropriate weighting is not made for the free answer description aimed by the present invention. On the other hand, the ascending cumulative singular value ΣD (j, ord, asc, rank) according to the order criterion and the descending cumulative singular value ΣD (j, ord, dsc, rank) according to the order criterion are more specific as described above. A meaningful morpheme (word) shows a high value. Therefore, if these words and their weights are used as they are, the ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending cumulative singular value sum ΣΣD (j, ord, dsc, rank) One of the characteristics of the order criterion is strongly shown. On the other hand, a homogeneous representative value that does not require the effect of the order criterion, such as TFIDF (j, rank), is also required. Therefore, in order to offset the effect of the order criterion, consider taking the average value of both indicators. This is called the average cumulative singular value sum aveΣΣD (j, ord, rank), and is adopted as an index indicating how representative the article (free answer description) is.

昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)が高い値を示す自由回答記述は、双方の特徴をよく表す記述内容になることが期待される。中には、両方の特徴をもつような自由回答記述が存在する可能性があり、順序基準の効果を適切に表すことができない場合も生じる。順序基準の効果をより強調しようとした場合には、昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)の差を求めることで、その絶対値で順序基準の昇順／降順の性質を反映した重み付けが可能になると考えられる。これを差分累積特異値総和diffΣΣＤ(j,ord,rak)と呼び、数８で表す。 The free answer description showing the high values of the ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending cumulative singular value sum ΣΣD (j, ord, dsc, rank) Is expected to be. There is a possibility that there is a free answer description having both characteristics, and the effect of the order criterion cannot be expressed appropriately. When the effect of the order criterion is to be emphasized more, the difference between the ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending cumulative singular value sum ΣΣD (j, ord, dsc, rank) is obtained. Therefore, it is considered that weighting that reflects the ascending / descending order characteristics of the order reference becomes possible with the absolute value. This is called the difference accumulated singular value sum diffΣΣD (j, ord, rak), and is expressed by the following equation (8).

たとえば、差分累積特異値総和diffΣΣＤ(j,ord,rak)が正の値を示した場合には、順序基準ordの昇順の性質（年齢の場合は、高齢層の性質）を表す自由回答記述が高い重みを示し、負の値を示した場合には、降順（年齢の場合は、若年層の性質）を表す自由回答記述が高い重みを示すことを期待した。つまり、この数７の指標は、その記事（自由回答記述）がどの程度の特異性を示すかを示す数値、つまり特異ドキュメント値または特異事例値である。 For example, if the difference accumulated singular value sum diffΣΣD (j, ord, rak) shows a positive value, the free answer description indicating the ascending nature of the order reference ord (in the case of age, the nature of the elderly) When a high weight was shown and a negative value was shown, it was expected that the free answer description indicating the descending order (in the case of age, the nature of the younger generation) would show a higher weight. That is, the index of the number 7 is a numerical value indicating how specific the article (free answer description) shows, that is, a specific document value or a specific case value.

このように、図１４の実施例の文書解析装置１０では、自由回答記述を解析する上で重要と思われる３つの指標、平均累積特異値aveΣＤ(i,ord,rank)、平均累積特異値総和aveΣΣＤ(j,ord,rak)および差分累積特異値総和diffΣΣＤ(j,ord,rak)を採用した。 As described above, in the document analysis apparatus 10 of the embodiment of FIG. 14, three indexes that are considered to be important in analyzing the free answer description, the average cumulative singular value aveΣD (i, ord, rank), and the average cumulative singular value summation. aveΣΣD (j, ord, rak) and difference summation singular value sum diffΣΣD (j, ord, rak) were adopted.

まず、図１７を参照して、図１６のＧＵＩ４０においてユーザが代表キーワード値選択ボタン４２をクリックして、自由回答記述のコーパスを、平均累積特異値aveΣＤ(i,ord,rank)という評価項目で解析することを選択した場合の動作を説明する。 First, referring to FIG. 17, the user clicks the representative keyword value selection button 42 in the GUI 40 of FIG. 16, and the corpus of the free answer description is expressed by the evaluation item of the average cumulative singular value aveΣD (i, ord, rank). The operation when the analysis is selected will be described.

なお、背景技術では時間基準に従って増量する文書を対象としたので図３のフロー図に従った動作を実行するタイミングを、設定時間が経過したかどうかで判断するようにした。しかしながら、この実施例では、文書が任意の順序基準で並べられたコーパスを対象にするので、コンピュータ１４は、文書（自由回答記述）が１つまたはそれ以上増加する都度、すなわちコーパス内において順序基準に従って１つまたはそれ以上文書が増える都度、図１７の動作を実行する。つまり、この図１７の動作を実行するには、順序基準に従って文書を増量させる必要がある。 In the background art, since the document is increased according to the time reference, the timing for executing the operation according to the flowchart of FIG. 3 is determined based on whether the set time has elapsed. However, in this embodiment, since the document is directed to a corpus in which the documents are arranged according to an arbitrary ordering criterion, the computer 14 performs the ordering criterion every time one or more documents (free answer descriptions) increase, that is, within the corpus. The operation shown in FIG. 17 is executed whenever one or more documents are added. That is, in order to execute the operation of FIG. 17, it is necessary to increase the number of documents according to the order reference.

図１７のステップＳ３１では、ユーザがＧＵＩ４０のコーパス選択領域４８を操作して選択したコーパスに含まれるその時点で１つ増加した文書ｉを含むすべての文書（コーパステキスト）を順序基準、たとえば年齢などに従って昇順に並べる。その後、図３の背景技術のステップＳ３‐Ｓ１９実行することによって、各形態素の残差値すなわち特異値Ｄ(i)を求める。ただし、そのとき、順序基準における直前の文書ｉ−１までを含む前コーパスで求めたＴＦ(ti,dj)の累積値ΣＴＦをＸとし、増加型ＴＦＩＤＦ(ti,dj)の累積値Σ増加型ＴＦＩＤＦをＹとして前述の数３への当て嵌めを行い、定数ａと定数ｂを求めることによって、ステップＳ１３（図３）図９の回帰曲線を作成する。そして、ステップＳ１５（図３）において、文書ｉまでを含む現コーパスにおける増加型ＴＦＩＤＦ(ti,dj)の累計値Σ増加型ＴＦＩＤＦと、文書ｉ−１までを含む前コーパスで求めた回帰曲線Ｙ＝ａＸ^ｂによる推定値Ｙとの差（残差値）すなわち特異値を求める。したがって、図１７のステップＳ３３において、文書ｉまでを順序基準に従って昇順に並べた現コーパスにおける各形態素の特異値がすべて計算されている。 In step S31 of FIG. 17, all documents (corpus text) including the document i increased by one at that time included in the corpus selected by the user operating the corpus selection area 48 of the GUI 40 are ordered, for example, age. According to the ascending order. Thereafter, the residual value of each morpheme, that is, the singular value D (i) is obtained by executing steps S3-S19 of the background art of FIG. However, at that time, the cumulative value ΣTF of TF (ti, dj) obtained in the previous corpus including the previous document i-1 in the order reference is set to X, and the cumulative value Σincrease type of the increasing type TFIDF (ti, dj). By applying TFIDF to Y and fitting to the above-mentioned equation 3 to obtain constant a and constant b, step S13 (FIG. 3) creates the regression curve of FIG. In step S15 (FIG. 3), the cumulative value Σincrease TFIDF of the increased TFIDF (ti, dj) in the current corpus including up to the document i and the regression curve Y obtained from the previous corpus including up to the document i-1. = difference between the estimated value Y by aX ^b (residual value) ie obtaining the singular values. Accordingly, in step S33 of FIG. 17, all singular values of the morphemes in the current corpus in which the documents i are arranged in ascending order according to the order criterion are calculated.

その後、ステップＳ３５において、先の数５に従って昇順累積特異値ΣＤ(i,ord,asc)を計算する。 After that, in step S35, the ascending cumulative singular value ΣD (i, ord, asc) is calculated according to the above equation 5.

ついで、ステップＳ３７‐Ｓ４１を先のステップＳ３１‐Ｓ３５と同様に実行して、文書ｉまでを順序基準に従って降順に並べた現コーパスにおける各形態素の特異値がすべて計算し、それに基づいて、数５を用いて降順累積特異値ΣＤ(i,ord,dsc)を計算する。 Next, Steps S37 to S41 are executed in the same manner as Steps S31 to S35, and all the singular values of each morpheme in the current corpus in which the documents i are arranged in descending order according to the order criterion are calculated. Is used to calculate the descending cumulative singular value ΣD (i, ord, dsc).

続くステップＳ４３において、昇順累積特異値ΣＤ(i,ord,asc)および降順累積特異値ΣＤ(i,ord,dsc)の相加平均を計算して、代表キーワード値を求める。 In the subsequent step S43, an arithmetic average of the ascending cumulative singular value ΣD (i, ord, asc) and the descending cumulative singular value ΣD (i, ord, dsc) is calculated to obtain a representative keyword value.

ついで、ステップＳ４５において、モニタ１５Ｂによって、平均累積特異値aveΣＤや、それの基礎となった昇順累積特異値ΣＤ(i,ord,asc)および降順累積特異値ΣＤ(i,ord,dsc)、さらには順累積特異値ΣＤ(i,ord,asc)および降順累積特異値ΣＤ(i,ord,dsc)を縦軸および横軸に表現したグラフ（後述）を表示する。 In step S45, the monitor 15B causes the average cumulative singular value aveΣD, the ascending cumulative singular value ΣD (i, ord, asc) and the descending cumulative singular value ΣD (i, ord, dsc), Displays a graph (described later) expressing the forward cumulative singular value ΣD (i, ord, asc) and the descending cumulative singular value ΣD (i, ord, dsc) on the vertical axis and the horizontal axis.

発明者等の実験においては、自由回答記述を収集するために、財団法人原子力安全技術センターが開設したサイト（ＩＩＮＥＴ(Incident Information network system)システム：http://www.n-linet.ne.jp）で公開されている「事故・故障情報データベース」（http://www.n-linet.ne.jp/default.htm）を利用した。その結果が図１８‐図２２に示される。 In an experiment conducted by the inventors, a site established by the Nuclear Safety Technology Center (IINET (Incident Information Network System) system: http://www.n-linet.ne.jp) ) Was used in the “Accident / Fault Information Database” (http://www.n-linet.ne.jp/default.htm). The results are shown in FIGS.

実験では上記データベースから各レコードを取得してコーパスを作成した。そして、順序基準として発生年月日を採用して各コーパステキストを昇順／降順に並べ、この順序基準に従って単位ドキュメント順を変更して図１７に示す処理を実行した。 In the experiment, each record was obtained from the database and a corpus was created. Then, the date of occurrence is adopted as the order reference, the corpus texts are arranged in ascending / descending order, the unit document order is changed according to the order reference, and the process shown in FIG. 17 is executed.

図１８は発生年月日の昇順でのΣＤと降順でのΣＤとを平均した平均累積特異値aveΣＤの大きい順に並べて示すグラフである。ただし、このグラフではＧＵＩ４０で上位５０位までの形態素（単語）だけを用いて計算した昇順累積特異値ΣＤ(i,ord,asc,rank)および降順累積特異値ΣＤ(i,ord,dsc,rank)を平均した平均累積特異値aveΣＤ(i,ord,rank)を示す。また、図１９が昇順累積特異値ΣＤ(i,ord,asc,rank)を示すグラフであり、図２０が降順累積特異値ΣＤ(i,ord,dsc,rank)を示すグラフである。 FIG. 18 is a graph showing the average cumulative singular values ave ΣD arranged in ascending order by averaging ΣD in ascending order and ΣD in descending order. In this graph, however, the ascending cumulative singular value ΣD (i, ord, asc, rank) and the descending cumulative singular value ΣD (i, ord, dsc, rank) calculated using only the top 50 morphemes (words) in the GUI 40 ) Is the average cumulative singular value aveΣD (i, ord, rank). FIG. 19 is a graph showing the ascending order singular values ΣD (i, ord, asc, rank), and FIG. 20 is a graph showing the descending order singular values ΣD (i, ord, dsc, rank).

図１９は昇順累積特異値ΣＤ(i,ord,asc,rank)であるから、順序基準としての発生年月日が後に出現する形態素ほど重みが大きくなる傾向があるのであり、ここでは「発見」、「施設」、「管理」、「システム」、「核」、…の順で累積特異値が大きくなっていることがわかる。図２０は降順累積特異値ΣＤ(i,ord,dsc,rank)であるから、順序基準としての発生年月日が前に出現する形態素ほど重みが大きくなる傾向がある。ここでは「ＰＵ（プルトニウム）」、「漏洩」、「装置」、「許容」、「被ばく」、…の順で累積特異値が大きくなっていることがわかる。つまり、原子力関連のインシデント報告では、初期のころには装置（ハードウェア）を原因とする故障や事故が多く発生していた反面、最近ではそのようなものではなく、管理や制御システムに関する事故や故障が多く発生していることがわかる。 Since FIG. 19 shows ascending cumulative singular values ΣD (i, ord, asc, rank), the morphemes that appear later in the generation date as the order reference tend to have higher weights. , “Facility”, “management”, “system”, “nucleus”,... Since FIG. 20 shows the descending order cumulative singular values ΣD (i, ord, dsc, rank), the morphemes in which the date of occurrence as the order reference appears earlier tend to have a higher weight. Here, it can be seen that the cumulative singular value increases in the order of “PU (plutonium)”, “leakage”, “apparatus”, “allowable”, “exposure”,. In other words, in nuclear incident reports, there were many failures and accidents caused by equipment (hardware) in the early days, but recently, such incidents have not occurred. It can be seen that many failures have occurred.

図１８に示す平均累積特異値aveΣＤ(i,ord,rank)では、昇順で大きい累積特異値を持った形態素および降順で大きい累積特異値を持った形態素が比較的大きい累積特異値を持つように計算された。つまり、平均累積特異値aveΣＤ(i,ord,rank)の大きい形態素（単語）がそのコーパスを代表する代表的なキーワードとなり得ることが分かる。 In the average cumulative singular value aveΣD (i, ord, rank) shown in FIG. 18, a morpheme having a large cumulative singular value in ascending order and a morpheme having a large cumulative singular value in descending order have relatively large cumulative singular values. calculated. That is, it can be seen that a morpheme (word) having a large average cumulative singular value aveΣD (i, ord, rank) can be a representative keyword representing the corpus.

図２１縦軸に昇順累積特異値ΣＤ(i,ord,asc,rank)を示し、横軸に降順累積特異値ΣＤ(i,ord,dsc,rank)を示した、各形態素がどの位置にプロットされるかを示すグラフである。この図２１を見ると、先に説明したように、縦軸の高位に「発見」、「施設」、「管理」、「システム」、「核」、…の形態素がプロットされていて、横軸の高位に「Ｐｕ（プルトニウム）」、「漏洩」、「装置」、「許容」、「被ばく」、…などの形態素がプロットされている。したがって、図２１のグラフを見れば、この原子力関連の故障事故報告のコーパスを代表するキーワードが何かが容易に把握できる。 Fig. 21 The vertical axis shows the ascending cumulative singular value ΣD (i, ord, asc, rank) and the horizontal axis shows the descending cumulative singular value ΣD (i, ord, dsc, rank). It is a graph which shows whether it is done. When FIG. 21 is seen, as described above, morphemes of “discovery”, “facility”, “management”, “system”, “nucleus”,... Morphological elements such as “Pu (plutonium)”, “leakage”, “apparatus”, “acceptance”, “exposure”,. Therefore, by looking at the graph of FIG. 21, it is possible to easily understand what keywords are representative of the corpus of this nuclear power-related failure accident report.

なお、図２２は縦軸に平均累積特異値aveΣＤ(i,ord,rank)をとり、横軸に増加型ＴＦＩＤＦ(i)をとって各形態素をプロットしたグラフであり、この実施例で評価した平均累積特異値aveΣＤ(i,ord,rank)の大きい形態素が高位にプロットされていることがわかる。つまり、平均累積特異値aveΣＤ(i,ord,rank)が或る程度信頼できることを示している。たとえば、キーワード「施設」，「許容」，「Ｐｕ」あるいは「漏洩」などについては時間情報を考慮しない従来の重み付け指標ＴＦＩＤＦ(i)と概ね同じ結果を示す一方で、「する」のような日本語の特性上どうしても高頻度で用いられる、それ自体意味を持たない単語や、「汚染」あるいは「作業」のようにコーパス（記事）の性質上どうしても高頻度で用いられる当たり前の単語が高い重みを示さないので、信頼性は担保されている。 FIG. 22 is a graph in which each morpheme is plotted with the average cumulative singular value aveΣD (i, ord, rank) on the vertical axis and the increased TFIDF (i) on the horizontal axis, and was evaluated in this example. It can be seen that a morpheme having a large average cumulative singular value aveΣD (i, ord, rank) is plotted at a high level. That is, it shows that the average cumulative singular value aveΣD (i, ord, rank) is reliable to some extent. For example, the keywords “facility”, “acceptable”, “Pu”, or “leakage” show almost the same results as the conventional weighting index TFIDF (i) that does not consider time information, while Japan like “Yes” Words that are used very frequently due to the characteristics of words, or words that have no meaning in themselves, or words that are naturally used frequently due to the nature of the corpus (article), such as "contamination" or "work", have a high weight. Since it is not shown, reliability is guaranteed.

図１６に示すＧＵＩ４０で選択ボタン３８または４０を操作した場合、図２３に示す処理が図１４に示す実施例のコンピュータ１４によって実行される。 When the selection button 38 or 40 is operated on the GUI 40 shown in FIG. 16, the processing shown in FIG. 23 is executed by the computer 14 of the embodiment shown in FIG.

図２３のステップＳ５１‐Ｓ６１は、以下の点を除いて、図１７に示すステップＳ３１‐Ｓ４１と同じであり、ここでは重複する説明は省略する。 Steps S51 to S61 in FIG. 23 are the same as steps S31 to S41 shown in FIG. 17 except for the following points, and redundant description is omitted here.

すなわち、図１７のステップＳ３５では昇順累積特異値ΣＤ(i,ord,asc,rank)を計算したのに対して、この実施例のステップＳ５５では先に示した数６に従って昇順累積特異値総和ΣＤ(j,ord,asc,rank)を計算する。図１７のステップＳ４１では昇順累積特異値ΣＤ(i,ord,dsc,rank)を計算したのに対して、この実施例のステップＳ６１では昇順累積特異値総和ΣＤ(j,ord,asc,rank)を計算する。 That is, the ascending order cumulative singular value ΣD (i, ord, asc, rank) is calculated in step S35 of FIG. 17, whereas the ascending order cumulative singular value sum ΣD according to the above-described equation 6 in step S55 of this embodiment. Calculate (j, ord, asc, rank). While the ascending order cumulative singular value ΣD (i, ord, dsc, rank) is calculated in step S41 in FIG. 17, the ascending order cumulative singular value sum ΣD (j, ord, asc, rank) is calculated in step S61 of this embodiment. Calculate

その後、図２３のステップＳ６３において、数７に従って平均累積特異値総和aveΣΣＤ(j,ord,rank)を計算する。この評価値は、昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)における順序基準の特性の影響を緩和するもので、その記事（自由回答記述）がどの程度代表的な記事かを示す指標である。つまり、この平均累積特異値総和aveΣΣＤ(j,ord,rank)が大きい記事がこのときのコーパスを代表する代表的記事である。 Thereafter, in step S63 of FIG. 23, the average cumulative singular value sum aveΣΣD (j, ord, rank) is calculated according to Equation 7. This evaluation value alleviates the influence of the characteristics of the order criterion in the ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and the descending cumulative singular value sum ΣΣD (j, ord, dsc, rank). This is an index indicating how representative an article (free answer description) is. That is, an article having a large average cumulative singular value sum aveΣΣD (j, ord, rank) is a representative article representing the corpus at this time.

他方、ステップＳ６５において、差分累積特異値総和diffΣΣＤ(j,ord,rank)を計算する。先に説明したように、昇順累積特異値総和ΣΣＤ(j,ord,asc,rank)が高い値を示す記事および降順累積特異値総和ΣΣＤ(j,ord,dsc,rank)が高い値を示す記事は、昇順および降順での特徴に大きく影響される。ところが、コーパスの中には両方の特徴をもつような記事が存在する可能性があり、順序基準の効果を適切に表すことができない場合も生じる。そこで、順序基準の効果をより強調することによってその記事がコーパス中においてどの程度の特異性を有するかを示す指標として差分累積特異値総和diffΣΣＤ(j,ord,rank)を採用したのである。 On the other hand, in step S65, the difference accumulated singular value sum diffΣΣD (j, ord, rank) is calculated. As described above, an article showing a high value in ascending cumulative singular value sum ΣΣD (j, ord, asc, rank) and an article showing a high value in descending order cumulative singular value sum ΣΣD (j, ord, dsc, rank) Are greatly affected by the ascending and descending characteristics. However, there is a possibility that an article having both characteristics may exist in the corpus, and the effect of the order criterion may not be expressed appropriately. Therefore, the difference accumulated singular value sum diffΣΣD (j, ord, rank) is adopted as an index indicating how much the article has specificity in the corpus by further emphasizing the effect of the order criterion.

最後に、ステップＳ６７において、モニタ１５Ｂに、平均累積特異値総和aveΣΣＤ(j,ord,rank)の大きさを縦軸に示し、横軸が差分累積特異値総和diffΣΣＤ(j,ord,rank)の大きさを示す、図２４に例示したようなグラフを表示する。このグラフでも先の例示と同様に、原子力に関する事故・故障報告記事をコーパスとして選択している。そして、横軸が差分累積特異値総和diffΣΣＤ(j,ord,rank)であるので、グラフの右半分および左半分に特徴的な記事がプロットされる。つまり、順序基準の昇順での累積特異値総和が大きい記事が横軸の「０」から右にプロットされ、降順での累積特異値総和の大きい記事が横軸の「０」から左にプロットされる。また、縦軸は平均累積特異値総和aveΣΣＤ(j,ord, rank)であり、上部にこのコーパスを代表するような記事がプロットされる。ただし、下部には余り重要ではない記事がプロットされる。棒グラフの高さが記事の数を示している図２５を見ると特によく分かるように、縦軸の上半分（図２４では下半分に相当する）には余り重要ではない記事が密集している。したがって、もし時間的な制約があるなら、縦軸において下の方にプロットされた記事は強いて読む必要はないということがわかる。 Finally, in step S67, the magnitude of the average cumulative singular value sum aveΣΣD (j, ord, rank) is shown on the monitor 15B on the vertical axis, and the horizontal axis is the difference cumulative singular value sum diffΣΣD (j, ord, rank). The graph as illustrated in FIG. 24 showing the size is displayed. In this graph as well, as in the previous example, accident / failure report articles related to nuclear power are selected as corpora. Since the horizontal axis is the difference accumulated singular value sum diffΣΣD (j, ord, rank), characteristic articles are plotted on the right half and the left half of the graph. In other words, articles with a large cumulative singular value sum in ascending order of the order criteria are plotted on the right from “0” on the horizontal axis, and articles with a large cumulative singular value sum in descending order are plotted on the left from “0” on the horizontal axis. The The vertical axis represents the average cumulative singular value sum aveΣΣD (j, ord, rank), and an article representing this corpus is plotted at the top. However, less important articles are plotted at the bottom. As can be seen particularly in FIG. 25 where the height of the bar graph indicates the number of articles, the upper half of the vertical axis (corresponding to the lower half in FIG. 24) is packed with less important articles. . So if you have time constraints, you can see that you don't have to read the articles plotted down the vertical axis.

ここで、図２４において示す３つのゾーンｂ‐ｃ‐ｄ、ａ‐b、およびｄ‐ｅに具体的にどのような記事があったのか、発明者等の実験の結果を見てみる。 Here, let us look at the results of experiments conducted by the inventors and the like to see what articles were specifically found in the three zones bcd, ab, and de shown in FIG.

ゾーンｂ‐ｃ‐ｄは、平均累積特異値総和aveΣΣＤ(j,ord,rank)が上位５％内の記事の存在を示すゾーンである。つまり、ゾーンｂ‐ｃ‐ｄには、コーパスを代表する重要な記事がプロットされている。このゾーンｂ‐ｃ‐ｄに該当する記事の例１、２および３が次表１に示される。ただし、いずれの表中においても、予想外の影響が広がることのないよう、記事中の固有名詞（地名や人名）は伏字にしている。 Zone bcd is a zone in which the average cumulative singular value sum aveΣΣD (j, ord, rank) indicates the presence of articles in the top 5%. That is, important articles representing the corpus are plotted in the zone bcd. Examples 1, 2, and 3 of articles corresponding to the zone bcd are shown in Table 1 below. However, in each table, proper nouns (place names and personal names) in the article are written in lowercase so that unexpected effects do not spread.

記事１は１９７８年１２月５日にイギリスで発生した事象を記述する記事であり、aveΣΣＤは最大値「２０８３」（diffΣΣＤ＝１９３）であった。この記事の中の重要なキーワードは「施設」、「管理」、「許容」、「安全」、「超える」、「Ｐｕ」であった。記事２は１９８７年９月１４日にイギリスで発生した事象を記述する記事であり、aveΣΣＤは「１７１９」（diffΣΣＤ＝‐７３０）であった。この記事の中の重要なキーワードは「許容」、「疑い」、「超える」、「結果」などであった。記事３は１９７８年８月1日にイギリスで発生した事象を記述する記事であり、aveΣΣＤが「１６９６」（diffΣΣＤ＝‐６９５）であった。この記事の中の重要なキーワードは「許容」、「疑い」、「超える」、「結果」、「モニタリング」などであった。 Article 1 is an article describing an event that occurred in the United Kingdom on December 5, 1978, and aveΣΣD was the maximum value “2083” (diffΣΣD = 193). The important keywords in this article were "facility", "management", "acceptable", "safety", "exceed", "Pu". Article 2 describes an event that occurred in the United Kingdom on September 14, 1987, and aveΣΣD was “1719” (diffΣΣD = −730). The important keywords in this article were "acceptable", "suspect", "exceed", "result" and so on. Article 3 is an article describing an event that occurred in England on August 1, 1978, and aveΣΣD was “1696” (diffΣΣD = −695). The important keywords in this article were “acceptable”, “suspect”, “exceed”, “result”, “monitoring” and so on.

ゾーンａ‐ｂは、正の差分累積特異値総和diffΣΣＤ(j,ord,rank)が上位５％内の記事の存在を示すゾーンである。つまり、ゾーンａ‐ｂには、比較的最近において重要な記事がプロットされている。このゾーンａ‐ｂに該当する記事の例４、５および６が次表２に示される。 Zone ab is a zone in which the positive difference cumulative singular value sum diffΣΣD (j, ord, rank) indicates the presence of articles in the top 5%. That is, important articles are plotted relatively recently in the zones ab. Examples 4, 5, and 6 of articles corresponding to this zone ab are shown in Table 2 below.

記事４は２００８年３月１１日にアメリカで発生した事象を記述する記事であり、diffΣΣＤが最大値「１９４７」（aveΣΣＤ＝９７４）であった。この記事の中の重要なキーワードは「発見」、「核」、「安全」、「違反」であった。記事５は２００７年３月１１日に日本で発生した事象を記述する記事であり、diffΣΣＤは「１７３１」（aveΣΣＤ＝９７１）であった。この記事の中の重要なキーワードは「発見」、「管理」、「核」、「燃料」であった。記事６は、上記記事２が記述する同じ２００７年１０月４日に日本で発生した事象を記述する別の記事であり、diffΣΣＤは「１７３１」（aveΣΣＤ＝９７１）であった。この記事の中の重要なキーワードは「発見」、「管理」、「核」、「燃料」であった。 Article 4 describes an event that occurred in the United States on March 11, 2008, and diffΣΣD was the maximum value “1947” (aveΣΣD = 974). The key keywords in this article were “discovery”, “nuclear”, “safety”, and “violation”. Article 5 is an article describing an event that occurred in Japan on March 11, 2007, and diffΣΣD was “1731” (aveΣΣD = 971). The key keywords in this article were “discovery”, “management”, “nuclear”, and “fuel”. Article 6 is another article describing the event that occurred in Japan on October 4, 2007, which is described in article 2 above, and diffΣΣD was “1731” (aveΣΣD = 971). The key keywords in this article were “discovery”, “management”, “nuclear”, and “fuel”.

ゾーンｄ‐ｅは、負の差分累積特異値総和diffΣΣＤ(j,ord,rank)が上位５％内に存在する記事を示すゾーンである。つまり、ゾーンｄ‐ｅには、比較的過去において重要な記事がプロットされている。このゾーンｄ‐ｅに該当する記事の例７、８および９が次表3に示される。 The zone de is a zone indicating articles in which the negative difference cumulative singular value sum diffΣΣD (j, ord, rank) is present in the upper 5%. That is, articles that are relatively important in the past are plotted in the zone de. Examples 7, 8 and 9 of articles corresponding to this zone de are shown in Table 3 below.

記事７は１９７８年１２月1日にフランスで発生した事象を記述する記事であり、diffΣΣＤは「−１９８４」（aveΣΣＤ＝９９２）であった。この記事の中の重要なキーワードは「Ｐｕ」、「漏洩」、「破損」、「被爆」であった。記事８は１９８９年６月１３日にイギリスで発生した事象を記述する記事であり、diffΣΣＤは「−１８７９」（aveΣΣＤ＝９４０）であった。この記事の中の重要なキーワードは「Ｐｕ」、「漏洩」、「配管」、「タンク」であった。記事９は１９７７年６月２２日にイギリスで発生した事象を記述する記事であり、diffΣΣＤが「−１８７２」（aveΣΣＤ＝１０５２）であった。この記事の中の重要なキーワードは「Ｐｕ」、「漏洩」、「上昇」、「異常」であった。 Article 7 describes an event that occurred in France on December 1, 1978, and diffΣΣD was “−1984” (ave ΣΣD = 992). Important keywords in this article were "Pu", "Leakage", "Damage", and "Exposure". Article 8 describes an event that occurred in the United Kingdom on June 13, 1989, and diffΣΣD was “−1879” (ave ΣΣD = 940). Important keywords in this article were "Pu", "Leakage", "Piping", and "Tank". Article 9 describes an event that occurred in the United Kingdom on June 22, 1977, and diffΣΣD was “−1872” (aveΣΣD = 1052). Important keywords in this article were "Pu", "Leakage", "Rise", and "Abnormal".

図２４および図２５に示すように、２軸の一方に平均累積特異値総和aveΣΣＤ(j,ord,rank)を示し、他方に差分累積特異値総和diffΣΣＤ(j,ord,rank)を示すグラフを表示することによって、コーパス全体を代表するような（平均累積特異値総和aveΣΣＤが大きい)代表的な記事または事例を容易に見つけることができる。また、差分累積特異値総和diffΣΣＤ(j,ord,rank)の大小によってそのコーパス内の各記事を並べた順序基準における特徴が一層明確に把握できる。上記の例でいえば、原子力関連のインシデントの発生年月日の古い年代において特徴的な記事、発生年月日の新しい年代において特徴的な記事などを容易に見つけることができる。 As shown in FIGS. 24 and 25, a graph showing the average cumulative singular value sum aveΣΣD (j, ord, rank) on one of the two axes and the difference cumulative singular value sum diffΣΣD (j, ord, rank) on the other axis is shown. By displaying, it is possible to easily find a representative article or case that represents the entire corpus (the average cumulative singular value sum aveΣΣD is large). Further, the feature in the order reference in which articles in the corpus are arranged can be grasped more clearly by the magnitude of the difference accumulated singular value sum diffΣΣD (j, ord, rank). In the above example, it is possible to easily find articles characteristic in the old age of occurrence of nuclear-related incidents, articles characteristic in the new age of occurrence.

また、この発明の上述の実施例ではコーパスとして自由回答形式の記述を含む調査票やアンケートを想定したが、各単位文書をこの発明の手法に従って任意の順序基準において昇順および降順に並べられる限りにおいて、ニュース，ウェブニュース，ブログ，新聞，雑誌，インタビュー記録，供述調書，アンケート，小説などの任意のコーパスにこの発明は適用可能である。 Further, in the above-described embodiment of the present invention, a survey form or questionnaire including a description in a free answer format is assumed as a corpus. The present invention can be applied to arbitrary corpora such as news, web news, blogs, newspapers, magazines, interview records, statements, questionnaires, and novels.

１０ …文書解析装置
１２ …ネットワーク
１４ …コンピュータ
１６ …テキストデータベース
１８ …分析データベース
３０ …イメージスキャナ
３２ …調査票
４０ …ＧＵＩ DESCRIPTION OF SYMBOLS 10 ... Document analysis apparatus 12 ... Network 14 ... Computer 16 ... Text database 18 ... Analysis database 30 ... Image scanner 32 ... Survey form 40 ... GUI

Claims

A document analysis apparatus for analyzing a language material that increases in accordance with an order standard based on an increased TFIDF, between an estimated value based on a cumulative value of the increased TFIDF up to the previous corpus and a cumulative value of the increased TFIDF in the current corpus In a document analysis apparatus that obtains a singular value for each morpheme by performing residual analysis in
An ascending cumulative singular value calculating means for calculating an ascending cumulative singular value for each morpheme when the language material is arranged in ascending order in the order reference;
A descending-order cumulative singular value calculating means for calculating a descending-order cumulative singular value for each morpheme when the language material is arranged in descending order in the order criterion; and an average cumulative singular value by averaging the ascending-order cumulative singular value and the descending-order cumulative singular value A document analysis apparatus comprising an average cumulative singular value calculation means for calculating a value.

The document analysis apparatus according to claim 1, further comprising: a cumulative singular value graph display unit that displays a cumulative singular value graph in which one of the ascending cumulative singular value and the descending cumulative singular value is one of two axes and the other is the other.

An ascending cumulative singular value sum calculating means for calculating a sum of ascending cumulative singular values for a morpheme having a specific upper cumulative singular value;
A descending cumulative singular value sum calculating means for calculating a sum of descending cumulative singular values for a morpheme having a specific upper cumulative singular value, and an average cumulative singular value sum by averaging the ascending cumulative singular value sum and the descending cumulative singular value sum The document analysis apparatus according to claim 1, further comprising an average cumulative singular value summation calculating means for calculating.

The document analysis apparatus according to claim 3, further comprising cumulative singular value sum difference calculation means for calculating a difference between the ascending cumulative singular value sum and the descending cumulative singular value sum.

The document analysis apparatus according to claim 4, further comprising a cumulative singular value sum graph display means for displaying a cumulative singular value sum graph in which the average cumulative singular value sum and the cumulative singular value sum difference are one and the other of two axes.

In a document analysis apparatus that analyzes a language material that increases in accordance with an ordering standard based on an increased TFIDF, there is a residual between an estimated value based on the total value of the increased TFIDF up to the previous corpus and a total value of the increased TFIDF in the current corpus. a Rubun manual analysis method obtains the singular values of each morpheme by performing a differential analysis, computer of the document analysis device,
An ascending cumulative singular value calculation step of calculating an ascending cumulative singular value for each morpheme when the language material is arranged in ascending order in the order criterion;
A descending order cumulative singular value calculation step for calculating a descending order cumulative singular value for each morpheme when the linguistic materials are arranged in descending order in the order criterion; and an arithmetic average of the ascending order cumulative singular value and the descending order cumulative singular value A document analysis method for executing an average cumulative singular value calculation step of calculating a cumulative singular value.

A document analysis apparatus for analyzing a language material that increases in accordance with an order criterion based on an increased TFIDF, between an estimated value based on the total value of the increased TFIDF up to the previous corpus and a total value of the increased TFIDF in the current corpus In the computer of the document analysis device that calculates the singular value for each morpheme by performing residual analysis in
An ascending cumulative singular value calculation step of calculating an ascending cumulative singular value for each morpheme when the language material is arranged in ascending order in the order criterion;
A descending order cumulative singular value calculation step for calculating a descending order cumulative singular value for each morpheme when the linguistic materials are arranged in descending order in the order criterion; and an arithmetic average of the ascending order cumulative singular value and the descending order cumulative singular value A document analysis program for executing an average cumulative singular value calculation step of calculating a cumulative singular value.

A document analysis apparatus for analyzing a language material that increases in accordance with an order standard based on an increased TFIDF, between an estimated value based on a cumulative value of the increased TFIDF up to the previous corpus and a cumulative value of the increased TFIDF in the current corpus In a document analysis apparatus that obtains a singular value for each morpheme by performing residual analysis in
An ascending cumulative singular value calculating means for calculating an ascending cumulative singular value for each morpheme when the language material is arranged in ascending order in the order reference;
A descending order cumulative singular value calculating means for calculating a descending order cumulative singular value for each morpheme when the language material is arranged in descending order in the order reference;
An ascending cumulative singular value sum calculating means for calculating a sum of ascending cumulative singular values for a morpheme having a specific upper cumulative singular value;
A descending cumulative singular value sum calculating means for calculating a sum of descending cumulative singular values for a morpheme having a specific upper cumulative singular value, and an average cumulative singular value sum by averaging the ascending cumulative singular value sum and the descending cumulative singular value sum A document analysis apparatus comprising an average cumulative singular value summation calculating means for calculating

Remaining between the document analysis device for analyzing based corpus that increasing the order criteria increasing type TFIDF, the cumulative value of the increase-type TFIDF in the estimate and the current corpus based on the cumulative value of the increase-type TFIDF and before Corpus A document analysis method for obtaining a singular value for each morpheme by executing a difference analysis , wherein the computer of the document analysis device comprises:
An ascending cumulative singular value calculation step of calculating an ascending cumulative singular value for each morpheme when the language material is arranged in ascending order in the order criterion;
A descending order cumulative singular value calculation step of calculating a descending order cumulative singular value for each morpheme when the language material is arranged in descending order in the order reference;
An ascending cumulative singular value sum calculating step for calculating the sum of ascending cumulative singular values for morphemes having a specific upper cumulative singular value;
A descending cumulative singular value sum calculating step for calculating the sum of descending cumulative singular values for a morpheme having a specific upper cumulative singular value; and averaging cumulative singular value sums by averaging the ascending cumulative singular value sum and the descending cumulative singular value sum A document analysis method for executing an average cumulative singular value summation calculating step for calculating.

A document analysis apparatus for analyzing a language material that increases in accordance with an order criterion based on an increased TFIDF, between an estimated value based on the total value of the increased TFIDF up to the previous corpus and a total value of the increased TFIDF in the current corpus In the computer of the document analysis device that calculates the singular value for each morpheme by performing residual analysis in
An ascending cumulative singular value calculation step of calculating an ascending cumulative singular value for each morpheme when the language material is arranged in ascending order in the order criterion;
A descending order cumulative singular value calculation step of calculating a descending order cumulative singular value for each morpheme when the language material is arranged in descending order in the order standard;
An ascending cumulative singular value sum calculating step for calculating the sum of ascending cumulative singular values for morphemes having a specific upper cumulative singular value;
A descending cumulative singular value sum calculating step for calculating the sum of descending cumulative singular values for a morpheme having a specific upper cumulative singular value; and averaging cumulative singular value sums by averaging the ascending cumulative singular value sum and the descending cumulative singular value sum A document analysis program characterized by causing an average cumulative singular value summation calculating step to calculate.