JP7324577B2

JP7324577B2 - Text processing method and text processing device

Info

Publication number: JP7324577B2
Application number: JP2018200325A
Authority: JP
Inventors: 裕司皆川; 亮地主; 雅紀木村
Original assignee: Solize
Current assignee: Solize
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2023-08-10
Anticipated expiration: 2038-10-24
Also published as: JP2020067831A

Description

特許法第３０条第２項適用〔展示会名〕第２回ＡＩ・人工知能ＥＸＰＯ〔開催日〕平成３０年４月４日から６日〔開催場所〕東京ビックサイト〔掲載アドレス〕ｈｔｔｐｓ：／／ｃｏｎｔｅｎｔ－ｔｏｋｙｏ２０１８．ｔｅｍｓ－ｓｙｓｔｅｍ．ｃｏｍ／ｅｇｕｉｄｅ／ｊｐ／ＡＩ／ｄｅｔａｉｌｓ？ｉｄ＝８６８ Article 30, Paragraph 2 of the Patent Act applies [Exhibition name] 2nd AI/Artificial Intelligence EXPO [Date] April 4 to 6, 2018 [Venue] Tokyo Big Sight [Posting address] https:// /content-tokyo2018. tems-system. com/eguide/jp/AI/details? id=868

特許法第３０条第２項適用〔セミナー名〕ＳＯＬＩＺＥＩｎｎｏｖａｔｉｏｎｓセミナー〔開催日〕平成３０年６月１５日〔開催場所〕ＴＫＰ品川カンファレンスセンター６Ｆバンケットホール６Ｇ〔掲載アドレス〕ｈｔｔｐｓ：／／ｗｗｗ．ｓｏｌｉｚｅ－ｇｒｏｕｐ．ｃｏｍ／ｅｖｅｎｔ／２０１８／ｉｎｄｅｘ．ｈｔｍｌApplication of Article 30, Paragraph 2 of the Patent Law [Seminar name] SOLIZE Innovations seminar [Date] June 15, 2018 [Venue] TKP Shinagawa Conference Center 6F Banquet Hall 6G [Posting address] https://www. soly-group. com/event/2018/index. html

本発明は、コンピュータによるテキスト処理方法及び装置に関し、特に、入力されたテキストデータから、着目するトピックについて、テキストデータ中に記載された関連箇所とトピックとの関連度を評価する技術に係る。 The present invention relates to a computer-based text processing method and apparatus, and more particularly to a technology for evaluating the degree of relevance between a relevant part described in text data and a topic of interest from input text data.

電子化された大量の文書の中から、必要となる所望の文書を検索する技術が多数提案されている。最も簡単なものは検索する単語を入力し、その単語が含まれた文書を抽出して表示する技術である。インターネットにおける検索も、検索ロボットが自動的に収集した膨大なデータから、検索語が含まれるウェブデータなどを抽出して提示することで行われている。 Many technologies have been proposed for retrieving a desired document from among a large number of electronic documents. The simplest one is a technique of inputting a word to be searched and extracting and displaying documents containing the word. Searches on the Internet are also performed by extracting and presenting web data including search terms from a huge amount of data automatically collected by search robots.

極めて大量の文書から、検索語が含まれる文書を短時間で抽出することはコンピュータのテキスト処理方法の大きな効果の１つであるが、抽出された文書の量やその文書の容量によっては、単に検索語が含まれているかだけでなく、どの程度の関連性を有しているかを評価する手法が求められる。 It is one of the great effects of computer text processing methods to extract documents containing search terms from an extremely large amount of documents in a short time. There is a need for a method of evaluating not only whether a search term is included, but also how relevant it is.

例えば、特許文献１では、利用者の興味・関心に合った書籍・資料を書込み入りドキュメント画像を元に検索する資料検索装置等が提案されている。該装置では、書込みを含むドキュメント画像に文字認識処理を施しテキストデータを抽出するテキスト抽出手段と、書込みの種類と位置を抽出する書込み抽出手段と、検索対象資料の第１の特徴語とその重要度を含む検索用インデックスを記憶する記憶手段と、テキストデータから第２の特徴語を抽出する特徴語抽出手段と、書込みの種類と位置とを用いて第２の特徴語の重要度を算出し、テキストデータの前記特徴語データを作成する特徴データ作成手段と、検索用インデックスと特徴語データとの関連度を計算する関連度計算手段とを具備する構成が開示されている。 For example, Japanese Laid-Open Patent Publication No. 2002-301000 proposes a material retrieval device or the like for retrieving books/materials that match a user's interest based on a written document image. The apparatus includes text extracting means for performing character recognition processing on a document image including writing to extract text data, writing extracting means for extracting the type and position of writing, and a first characteristic word of the material to be retrieved and its importance. calculating the importance of the second feature word using a storage means for storing a search index including the index, a feature word extraction means for extracting the second feature word from the text data, and the type and position of the writing; , a configuration comprising feature data creating means for creating the feature word data of text data and relevance calculating means for calculating the relevance between the search index and the feature word data.

特許文献２では、ある特定の話題に対するテキスト集合から代表的な意見を抽出する方法が開示されている。すなわち、複数のコメントを含むテキスト集合から、代表的なコメントを抽出する代表コメント抽出方法において、テキスト集合から、テキスト集合の特徴を示す重要語を抽出する重要語抽出ステップと、複数のコメントを、重要語をもとにクラスタリングするクラスタリングステップと、クラスタリングで分割した各クラスタから、代表コメントを抽出する代表コメント抽出ステップにより、代表コメントを抽出することが開示されている。 Patent Literature 2 discloses a method of extracting representative opinions from a set of texts on a specific topic. That is, in a representative comment extraction method for extracting a representative comment from a text set containing a plurality of comments, a key word extraction step of extracting key words that indicate the characteristics of the text set from the text set, and a plurality of comments, It is disclosed that representative comments are extracted by a clustering step of clustering based on important words and a representative comment extraction step of extracting representative comments from each cluster divided by clustering.

また、関連する技術として特許文献３には、商品カテゴリに対応した特徴語を自動的に学習する方法が提案されている。商品カテゴリそれぞれに対応した特徴語を自動的に学習するシステムであって、カテゴリ名をクエリとして検索エンジンによって最大１０００件のＷｅｂテキストを取得する手段と、カテゴリ名をクエリとして得られた最大１０００件のＷｅｂテキストから、特徴語候補を抽出する手段と、抽出された特徴語候補とカテゴリとの関連度を算出する手段と、各カテゴリについて、前記の関連度が所定の閾値よりも高い特徴語候補を、当該カテゴリに対応した特徴語として関連付けて記憶する特徴語デーベースと、を備えることが開示されている。 As a related technique, Patent Document 3 proposes a method of automatically learning feature words corresponding to product categories. A system for automatically learning characteristic words corresponding to each product category, comprising means for obtaining a maximum of 1000 web texts by a search engine using the category name as a query, and maximum 1000 obtained using the category name as a query. means for extracting feature word candidates from the Web text of; means for calculating a degree of relevance between the extracted feature word candidates and categories; as a feature word corresponding to the category, and a feature word database that associates and stores the feature word corresponding to the category.

特開２０１５－１７９３８５号公報JP 2015-179385 A 特開２０１３－１５９７１号公報JP 2013-15971 A 特開２０１０－９３０７号公報Japanese Unexamined Patent Application Publication No. 2010-9307

上記従来技術では、テキストデータにおける利用者の興味や代表的な意見について関連性の特徴語、重要語を表示することができるが、当該語句は全て抽出されてしまうため、文書内のどの部分が特に重要であるかなどは把握することができない。 In the conventional technology described above, it is possible to display characteristic words and key words related to user interests and representative opinions in text data. It is not possible to ascertain whether or not it is particularly important.

本発明は上記従来技術の有する問題点に鑑みて創出されたものであり、着目するトピックについて、テキストデータ中に記載された関連箇所とトピックとの関連度を評価する技術を提供することを目的とする。 SUMMARY OF THE INVENTION The present invention was created in view of the above-mentioned problems of the prior art, and it is an object of the present invention to provide a technique for evaluating the degree of relevance between a relevant part described in text data and a topic of interest. and

本発明は上記課題を解決するため、本発明は次のようなテキスト処理方法及び装置を提供する。
まず、第１の実施態様によれば、入力された第１のテキストデータから、着目する話題を分類した項目であるトピックについて、テキストデータ中に記載された関連箇所とトピックとの関連度を評価するコンピュータによるテキスト処理方法を提供する。該方法において、
（Ｓ１）入力手段が、複数の第２のテキストデータ群と、予め定義された複数のトピックと、予め関連づけられた第２のテキストデータ群におけるトピックについて記載された関連文字列情報とを学習用データとして入力する学習用データ入力ステップ、
（Ｓ２）特徴語抽出手段が、関連文字列情報から特徴語を抽出する特徴語抽出ステップ、
（Ｓ３）アスペクトデータ作成手段が、トピックごとに抽出された特徴語を対応付けた情報であるアスペクトデータを作成し記憶手段に記憶するアスペクトデータ作成ステップの各ステップを有する学習工程の後、
（Ｓ４）入力手段が、第１のテキストデータを入力するテキストデータ入力ステップ、
（Ｓ５）特徴語検索手段が、アスペクトデータを参照し、少なくとも１つのトピックについて、第１のテキストデータに含まれる特徴語を検索する特徴語検索ステップ、
（Ｓ６）出力手段が、トピック毎に区別して特徴語の検索結果に基づく値を関連度として出力する出力ステップの各ステップを有する関連度評価工程を行う、ことを特徴とする。 In order to solve the above problems, the present invention provides the following text processing method and apparatus.
First, according to the first embodiment, with respect to topics, which are items obtained by classifying topics of interest from input first text data, the degree of relevance between related parts described in the text data and the topic is evaluated. A method for processing text by a computer is provided. In the method,
(S1) An input means inputs a plurality of second text data groups, a plurality of predefined topics, and related character string information describing the topics in the second text data group associated in advance for learning. data input step for learning to input as data,
(S2) a feature word extraction step in which the feature word extraction means extracts feature words from the related character string information;
(S3) After a learning step in which the aspect data creating means creates aspect data, which is information in which the feature words extracted for each topic are associated, and stores the aspect data in the storage means,
(S4) a text data input step in which the input means inputs the first text data;
(S5) a feature word search step in which the feature word search means refers to the aspect data and searches for a feature word contained in the first text data for at least one topic;
(S6) The output means performs a relevance evaluation step having each step of an output step of outputting, as the relevance, a value based on the search result of the feature word for each topic.

第２の実施態様によれば、上記のテキスト処理方法の特徴語抽出ステップにおいて、特徴語抽出手段が、関連文字列情報から特徴語を抽出する際に、特徴語に定義された重み情報に基づいて所定の演算式により重み値を設定し、アスペクトデータにはトピックごとに抽出された特徴語について重み値を格納し、特徴語検索ステップにおいて、重み値に基づいて検索条件を決定する。 According to the second embodiment, in the feature word extraction step of the text processing method, when the feature word extraction means extracts the feature word from the related character string information, based on the weight information defined for the feature word: A weight value is set by a predetermined arithmetic expression in the aspect data, and the weight value is stored in the aspect data for the feature word extracted for each topic. In the feature word search step, the search condition is determined based on the weight value.

第３の実施態様によれば、特徴語に定義された重み値が、特徴語抽出手段が特徴語を抽出する際に算出された出現頻度または共起頻度の少なくともいずれかに係る値としてもよい。 According to the third embodiment, the weight value defined for the feature word may be a value related to at least one of the appearance frequency and the co-occurrence frequency calculated when the feature word extracting means extracts the feature word. .

第４の実施態様によれば、上記の出力ステップにおいて、トピック毎に検索結果である複数の特徴語の数又はその重み値の合計に係る計算値を上記の関連度として出力することもできる。 According to the fourth embodiment, in the above-mentioned output step, a calculated value relating to the number of a plurality of feature words, which are search results for each topic, or the sum of their weight values can be output as the above-mentioned degree of relevance.

第５の実施態様によれば、アスペクトデータにはトピックごとに抽出された特徴語と、当該特徴語が出現する前後少なくともいずれかの距離に係る距離値を格納し、上記の特徴語検索ステップにおいて、少なくとも距離値に基づいて第１のテキストデータに含まれる同じ種類又は異なる種類の特徴語の検索条件を決定することができる。 According to the fifth embodiment, the feature word extracted for each topic and the distance value related to at least one of the distances before and after the appearance of the feature word are stored in the aspect data, and in the feature word search step , a search condition for the same type or different type of feature words included in the first text data can be determined based on at least the distance value.

第６の実施態様によれば、上記の特徴語検索ステップにおいて特徴語検索手段がアスペクトデータに基づいて特徴語を検索すると共に、抽出された各特徴語の距離値を読み出し、距離値の範囲内にある特徴語の重み値から所定の演算による計算値を上記の関連度として出力することができる。 According to the sixth embodiment, in the feature word search step, the feature word search means searches for the feature word based on the aspect data, reads the distance value of each extracted feature word, and It is possible to output a value calculated by a predetermined operation from the weight value of the feature word in , as the degree of relevance.

第７の実施態様によれば、上記のアスペクトデータ作成ステップにおいて、アスペクトデータ作成手段が、トピックごとに抽出された特徴語の各々について階層情報を定義すると共に、階層情報には、重み値と距離値の組み合わせのパターンを定義し、特徴語検索ステップにおいて、パターンに従った特徴語の検索条件を決定することができる。 According to the seventh embodiment, in the aspect data creating step, the aspect data creating means defines hierarchical information for each feature word extracted for each topic, and the hierarchical information includes a weight value and a distance A pattern of value combinations can be defined, and in the feature word search step, a feature word search condition can be determined according to the pattern.

第８の実施態様によれば、トピックが、テキストデータに含まれる内容を読者が理解しやすいように上位概念から下位概念の２段階以上の概念情報に分類される構成において、上記のアスペクトデータ作成ステップにおいて、アスペクトデータ作成手段が、階層情報を、概念情報に応じて自動的に定義すると共に、トピック毎の特徴語のそれぞれを概念情報と対応付けることができる。 According to the eighth embodiment, in a configuration in which topics are classified into two or more levels of concept information from superordinate concepts to subordinate concepts so that readers can easily understand the content contained in the text data, the above aspect data is created. In the step, the aspect data creating means can automatically define the hierarchical information according to the conceptual information, and associate each of the feature words for each topic with the conceptual information.

第９の実施態様によれば、上記の出力ステップにおいて、出力手段が、抽出された特徴語と共に、その関連度を所定のグラフによって表示する構成でもよい。 According to the ninth embodiment, in the above-described output step, the output means may display the extracted feature words and their degrees of association in a predetermined graph.

第１０の実施態様によれば、上記の出力ステップにおいて、出力手段が、第１のテキストデータを表示する構成において、抽出された特徴語の行の位置に合わせて所定のグラフを表示することもできる。 According to the tenth aspect, in the output step, the output means may display a predetermined graph according to the position of the line of the extracted characteristic word in the configuration for displaying the first text data. can.

第１１の実施態様によれば、上記の出力ステップにおいて、出力手段が、第１のテキストデータを表示する構成において、特徴語が含まれる文章全体、又は特徴語近傍の所定範囲の文章、又は特徴語の距離値の範囲内の表示態様を変化させることができる。 According to the eleventh embodiment, in the above output step, the output means displays the first text data, in which the entire sentence containing the feature word, or a predetermined range of sentences near the feature word, or the feature The presentation within the range of word distance values can be varied.

第１２の実施態様によれば、上記の出力ステップにおいて、出力手段が、概念情報に属する特徴語ごとに、当該特徴語が含まれる文章全体、又は当該特徴語近傍の所定範囲の文章、又は特徴語の距離値の範囲内の表示態様を変化させ、上位概念から下位概念のそれぞれの概念を含む文章ごとに区別できるように表示する構成でもよい。 According to the twelfth aspect, in the above output step, for each feature word belonging to the concept information, the output means outputs the entire sentence containing the feature word, or a predetermined range of sentences near the feature word, or the feature It is also possible to change the display mode within the range of the distance value of the words so that each sentence including each concept from a higher concept to a lower concept can be displayed so as to be distinguished.

第１３の実施態様によれば、テキスト処理装置を提供することもできる。
すなわち、入力された第１のテキストデータから着目する話題を分類した項目であるトピックについて、テキストデータ中に記載された関連箇所とトピックとの関連度を評価するコンピュータを用いたテキスト処理装置であって、複数の第２のテキストデータ群と、予め定義された複数のトピックと、予め関連づけられた第２のテキストデータ群におけるトピックについて記載された関連文字列情報とを学習用データとして入力する学習用データ入力手段と、関連文字列情報から特徴語を抽出する特徴語抽出手段と、トピックごとに抽出された特徴語を対応付けた情報であるアスペクトデータを作成し記憶手段に記憶するアスペクトデータ作成手段と、第１のテキストデータを入力するテキストデータ入力手段と、アスペクトデータを参照し、少なくとも１つのトピックについて、第１のテキストデータに含まれる特徴語を検索する特徴語検索手段と、トピックごとに区別して、特徴語の検索結果に基づく値を関連度として出力する出力手段とを備えたことを特徴とする。 According to a thirteenth embodiment, a text processing device can also be provided.
That is, it is a text processing apparatus using a computer that evaluates the degree of relevance between a related part described in text data and a topic, which is an item obtained by classifying a topic of interest from input first text data. a plurality of second text data groups, a plurality of predefined topics, and related character string information describing the topics in the pre-associated second text data group as learning data. Aspect data input means, feature word extraction means for extracting feature words from related character string information, and aspect data that is information in which the feature words extracted for each topic are associated with each other are created and stored in a storage means. text data input means for inputting first text data; feature word search means for searching for feature words contained in the first text data for at least one topic by referring to aspect data; and an output means for outputting a value based on the search result of the feature word as the degree of relevance.

本発明によれば、着目するトピックについて、テキストデータ中に記載された関連箇所とトピックとの関連度を評価する技術を提供することができる。 According to the present invention, it is possible to provide a technology for evaluating the degree of relevance between a related part described in text data and a topic of interest.

本発明におけるテキスト処理装置（１）の全体図である。1 is an overall view of a text processing device (1) in the present invention; FIG. 本発明に係るテキスト処理方法のフローチャートである。1 is a flow chart of a text processing method according to the present invention; 本発明に係る関連度の計算方法の説明図である。FIG. 4 is an explanatory diagram of a method of calculating a degree of association according to the present invention; 本発明に係る関連度の計算方法の説明図である。FIG. 4 is an explanatory diagram of a method of calculating a degree of association according to the present invention; 本発明の第１の画面表示例である。It is a first screen display example of the present invention. 本発明の第２の画面表示例である。It is a second screen display example of the present invention. 本発明の第３の画面表示例である。It is the 3rd example of a screen display of this invention. 本発明の第４の画面表示例である。It is a fourth screen display example of the present invention.

以下、本発明の実施形態を図面を用いて説明する。本発明は以下の実施例に限定されず請求項記載の範囲で適宜実施することができる。
図１は、本発明におけるテキスト処理装置（１）の全体図である。本装置（１）は公知のパーソナルコンピュータにより実施することができるほか、ウェブサーバ装置などのサーバ装置に実装することもできる。これらの機器の詳細については公知であるから説明を省略する。
図２は本発明に係るテキスト処理方法のフローチャートである。 An embodiment of the present invention will be described below with reference to the drawings. The present invention is not limited to the following examples, and can be carried out as appropriate within the scope of the claims.
FIG. 1 is an overall view of a text processing device (1) in the present invention. This device (1) can be implemented by a known personal computer, and can also be implemented in a server device such as a web server device. The details of these devices are well known, so descriptions thereof will be omitted.
FIG. 2 is a flow chart of the text processing method according to the present invention.

本発明のテキスト処理方法は大きく２つの工程に分けられる。事前に機械学習を行う学習工程と、学習工程によって作成されたアスペクトデータを用いて入力されるテキスト中の関連箇所について関連度を評価する関連度評価工程である。本実施例では学習工程と関連度評価工程を連続したものとして説明しているが、アスペクトデータを作成して記憶させておけば、関連度評価工程だけを実装して実施することができる。 The text processing method of the present invention is roughly divided into two steps. A learning step in which machine learning is performed in advance, and a relevance evaluation step in which the degree of relevance is evaluated for related parts in the input text using the aspect data created by the learning step. In this embodiment, the learning process and the relevance evaluation process are described as being continuous, but if aspect data is created and stored, only the relevance evaluation process can be implemented and executed.

学習工程では、まず入力手段であるＣＰＵ（１０）における入力処理部（１０１）が、ハードディスクなどの記憶部（２０）に格納された学習用データ（２０１）を入力する。（学習用データ入力ステップ：Ｓ１）
学習用データ（２０１）は、複数の第２のテキストデータ群と、予め定義された複数のトピックと、予め関連づけられた第２のテキストデータ群におけるトピックについて記載された関連文字列情報との組み合わせから構成されている。 In the learning process, first, the input processing section (101) in the CPU (10), which is the input means, inputs learning data (201) stored in the storage section (20) such as a hard disk. (Learning data input step: S1)
The learning data (201) is a combination of a plurality of second text data groups, a plurality of predefined topics, and related character string information describing the topics in the pre-associated second text data group. consists of

トピックは、着目する話題を分類する項目であり、例えば技術文書であれば大項目から小項目に段階的に分類された見出しに対応させることもできるし、あるいは技術文書に含まれる一般的な要点を手作業で抽出したものでもよい。ニュースサイトや新聞、雑誌などの情報であれば、「国際」「経済」「社会」などのように分野別に分類されたもの、さらに「国際」という上位概念から「米国」「欧州」「中国」などのように下位概念に分類されたもの、などでもよい。あるいは「格差問題」「住宅問題」などのようにテーマ毎に分類されたものでもよい。 A topic is an item that classifies a topic of interest. For example, if it is a technical document, it can correspond to headings classified step by step from major to minor, or general main points included in technical documents. may be manually extracted. Information from news sites, newspapers, magazines, etc. is categorized by field such as "international," "economy," and "society." It may be classified into subordinate concepts such as . Alternatively, they may be classified according to themes such as "disparity problem" and "housing problem".

さらに本発明の特徴として、トピックはテキストデータに含まれる内容を読者が理解しやすいように上位概念から下位概念の２段階以上の概念情報に分類されたものとした上で、この概念情報を後述する特徴語の検索における検索条件に反映させることもできる。 Furthermore, as a feature of the present invention, topics are classified into two or more stages of conceptual information from superordinate concepts to subordinate concepts so that readers can easily understand the content contained in the text data, and this conceptual information will be described later. It is also possible to reflect it in the search conditions for searching for characteristic words.

入力する第２のテキストデータとは、学習に用いるためのテキストデータであって、少なくとも上記のトピックが含まれる文書に係るものであるが、トピックと何ら関係のないテキストデータを同時に入力してもよい。 The second text data to be input is text data to be used for learning, and relates to a document including at least the above topic. good.

関連文字列情報は、第２のテキストデータ群におけるトピックについて記載された関連文字列を予め定義したものであり、原則としては人手によって関連性を判断されて定義される。
例えば、過去の大量のニュースに関するテキストデータを第２のテキストデータ群として入力するとして、トピック「国際」の下に「米国」「欧州」「中国」があるとき、人手によって「米国」に関する記事部分を抽出したものが関連文字列情報である。この場合の関連文字列情報は、当該記事全体となる。 The related character string information is a pre-defined related character string describing the topic in the second text data group, and in principle is defined by manually judging the relevance.
For example, if a large amount of past news text data is input as the second text data group, and the topic "International" includes "United States,""Europe," and "China," the articles about "United States" are manually selected. is the related character string information. The related character string information in this case is the entire article.

入力された学習用データ（２０１）を用いて特徴語抽出部（１０２）が関連文字列情報から特徴語を抽出する。（特徴語抽出ステップ：Ｓ２）
特徴語は、テキストデータ中でトピックが記載されていることを特徴づける語句であり、前述の「米国」がトピックであれば、例えば「米国」「ニューヨーク」「トランプ大統領」など他のトピックの記事と区別されるような語句が考えられる。 A feature word extraction unit (102) extracts feature words from related character string information using the input learning data (201). (Characteristic word extraction step: S2)
A feature word is a phrase that characterizes that a topic is described in the text data. If the above-mentioned "United States" is the topic, for example, articles on other topics such as "United States", "New York", and "President Trump" Words and phrases that can be distinguished from

特徴語の抽出方法は、言語処理分野において公知であるが、特徴語の抽出で良く用いられる手法としてTF-IDFが挙げられる。TFは単語の出現頻度、IDFとは逆文書頻度であり、出現頻度の多い語句は重要という前提の上で、逆文書頻度を考慮することで特徴語が全記事においてどれくらいの記事で出現するかを表す尺度を導入する。IDFは、全記事数のうちからある語句が出現する記事数で割った値の対数に1を加えた値とするので、ある語句が出現する記事が少ないと大きくなり、どの記事にも出現する場合小さくなる。このIDFにTFを掛けたものがその語句の重み値とする。 Feature word extraction methods are well known in the field of language processing, and TF-IDF is a technique often used for feature word extraction. TF is word appearance frequency, and IDF is inverse document frequency. On the premise that words with high appearance frequency are important, how many articles in all articles the feature word appears by considering inverse document frequency Introduce a scale to express IDF is the value obtained by adding 1 to the logarithm of the value obtained by dividing the total number of articles by the number of articles in which a certain phrase appears. case becomes smaller. The value obtained by multiplying this IDF by TF is the weight value of the word.

特徴語の抽出方法としては、他にＳＶＭ（平尾努，磯崎秀樹，前田英作，松本祐治：ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを用いた重要文抽出法，情報処理学会論文誌．Ｖｏｌ．４４，Ｎｏ．８，ｐｐ．２２３０－２２４３（２００３））、ＫｅｙＧｒａｐｈ（インターネットＵＲＬ：ｈｔｔｐ：／／ｉｉｔ．ｋｋｅ．ｃｏ．ｊｐ／ｋｅｙｇｒａｐｈ／２０１８年９月１０日検索）など様々な方法が知られており、本発明では公知の方法を適宜使用することができる。
また、特徴語の抽出のためにテキストデータを形態素解析して所定の形態素に分けることも周知技術である。 Other methods for extracting feature words include SVM (Tsutomu Hirao, Hideki Isozaki, Eisaku Maeda, Yuji Matsumoto: Extracting Important Sentences Using Support Vector Machine, Transactions of Information Processing Society of Japan. Vol. 44, No. 8, pp. .2230-2243 (2003)), KeyGraph (Internet URL: http://iit.kke.co.jp/keygraph/ searched on September 10, 2018), etc. Various methods are known, and are known in the present invention. method can be used as appropriate.
It is also a well-known technique to morphologically analyze text data and divide it into predetermined morphemes in order to extract characteristic words.

特徴語抽出ステップ（Ｓ２）によれば、関連文字列情報と第２テキストデータ群中の特徴語の対応付けができるので、関連文字列情報と対応するトピックデータと特徴語とも対応付けられる。本発明ではこの対応付けた情報をアスペクトデータ（２０２）と呼び、アスペクトデータ作成部（１０３）によって記憶部（２０）に格納される。（アスペクトデータ作成ステップ：Ｓ３） According to the feature word extraction step (S2), the related character string information can be associated with the feature words in the second text data group, so that the related character string information can be associated with the corresponding topic data and feature words. In the present invention, this associated information is called aspect data (202), and is stored in the storage section (20) by the aspect data creation section (103). (Aspect data creation step: S3)

アスペクトデータ（２０２）として最も単純なデータはトピック毎に特徴語が定義されることであり、従来技術と異なり、様々なトピックに対してアスペクトデータを対応づけていることに本発明の特徴がある。従って、最小限の構成としてはアスペクトデータに特徴語だけが定義されていてもよい。
本実施例では、アスペクトデータにはトピック毎の特徴語に加え、特徴語には重み値を合わせて格納する。 The simplest data as the aspect data (202) is that a characteristic word is defined for each topic, and unlike the prior art, the present invention is characterized by associating aspect data with various topics. . Therefore, as a minimum configuration, only feature words may be defined in the aspect data.
In this embodiment, in addition to the feature words for each topic, the aspect data stores the weight values together with the feature words.

重み値としては、上述したTF-IDFにおける重み値など、特徴語抽出手段が特徴語を抽出する際に算出された出現頻度または共起頻度の少なくともいずれかに係る値を用いることができる。
あるいは、特徴語に定義された重み情報に基づいて所定の演算式により重み値を設定し、アスペクトデータにはトピックごとに抽出された特徴語について重み値を格納することもできる。 As the weight value, a value related to at least one of the appearance frequency and the co-occurrence frequency calculated when the feature word extraction means extracts the feature word, such as the weight value in TF-IDF described above, can be used.
Alternatively, a weight value can be set by a predetermined arithmetic expression based on the weight information defined for the feature word, and the weight value for the feature word extracted for each topic can be stored in the aspect data.

さらに、アスペクトデータにはトピックごとに抽出された特徴語と、当該特徴語が出現する前後少なくともいずれかの距離に係る距離値（マージンと呼ぶ）を格納することもできる。ここでマージンは、当該特徴語と他の特徴語とが同時に出現したと判断するかどうかの距離を指しており、本発明の関連度を算出する上で重要なパラメータである。距離値を含むアスペクトデータの例を表１に示す。 Furthermore, the aspect data can also store a feature word extracted for each topic and a distance value (referred to as a margin) relating to at least one of the distances before and after the appearance of the feature word. Here, the margin indicates the distance at which it is judged that the feature word and other feature words appear at the same time, and is an important parameter for calculating the degree of association in the present invention. Table 1 shows an example of aspect data including distance values.

上記表の意味は、関連度を評価するテキストと、トピック「米国」に係る特徴語の重みについて、「アメリカ」であれば重み値が１で小さい反面、他の特徴語、すなわち「米国」「ニューヨーク」「トランプ大統領」と共起したかどうか判断するマージンが前に2500文字、後ろに2500字と広く設定されていることになる。一方、「トランプ大統領」が含まれる場合、それは米国に関連する確からしさがより高いと考えられて重み値が１０となっている。この場合、範囲をより特定するためにマージンは前後それぞれ２５０と狭く設定されている。 The meaning of the above table is that the text whose degree of relevance is to be evaluated and the weight of the characteristic word related to the topic "United States", the weight value of "United States" is 1, which is small, while the other characteristic word, that is, "United States" The margin for judging whether it co-occurred with "New York" and "President Trump" was set wide at 2,500 characters in the front and 2,500 characters in the back. On the other hand, if "President Trump" is included, it is considered more likely to be associated with the United States and is given a weight value of 10. In this case, the front and rear margins are narrowly set to 250 in order to specify the range more.

距離値の定義は、重み値が１の時は２５００，５の時は１０００，というように重み値に対応づけて機械的に定義してもよいし、手作業で特徴語を考慮しながら定義してもよい。 The definition of the distance value may be defined mechanically in association with the weight value, such as 2500 when the weight value is 1, 1000 when the weight value is 5, etc., or may be defined manually while considering characteristic words. You may

以上の処理により学習工程を終え、結果としてアスペクトデータ（２０２）が生成される。
関連度評価工程では、入力処理部（１０１）から評価を行う対象の第１テキストデータ（２０３）を入力し、記憶部（２０）に記録する。（テキストデータ入力ステップ：Ｓ４） The above processing completes the learning process, and as a result, aspect data (202) is generated.
In the relevance evaluation step, first text data (203) to be evaluated is input from the input processing unit (101) and recorded in the storage unit (20). (Text data input step: S4)

次いで特徴語検索部（１０４）が、アスペクトデータ（２０２）を参照し、少なくとも１つのトピックについて、第１テキストデータ（２０３）に含まれる特徴語を検索する。（特徴語検索ステップ：Ｓ５）
具体的には、トピック毎にアスペクトデータ（２０２）に格納される特徴語を検索し、例えば特徴語の数に応じて関連度を評価する。テキストデータ中の特徴語が頻出する箇所が関連度の高い箇所として抽出すれば良いので、簡単な方法としては、特徴語がある閾値よりも狭い範囲で繰り返し出現する部分を関連箇所として出力することもできる。 Next, the feature word search unit (104) refers to the aspect data (202) and searches for feature words contained in the first text data (203) for at least one topic. (Characteristic word search step: S5)
Specifically, the feature words stored in the aspect data (202) are searched for each topic, and the degree of relevance is evaluated, for example, according to the number of feature words. A simple method is to output the parts where the feature words appear repeatedly within a narrower range than a certain threshold value as the relevant parts. can also

本実施例ではより複雑な処理を行っており、トピック毎の特徴語に重み値が設定されているので、その重み値に係る計算値を関連度とする。図３は本発明に係る関連度の計算方法の説明図である。
本図は第１テキストデータ（２０３）の先頭から末尾までを横軸に取り、左が先頭、右が末尾の文字位置を示している。各特徴語の縦は重み値を示している。 In the present embodiment, a more complicated process is performed, and a weight value is set for the feature word for each topic. FIG. 3 is an explanatory diagram of a method of calculating the degree of association according to the present invention.
In this drawing, the horizontal axis is from the beginning to the end of the first text data (203), and the left represents the beginning and the right represents the ending character position. The vertical column of each characteristic word indicates the weight value.

例えば、１段目の特徴語「アメリカ」は３０００文字目にあって、マージンの前後２５００文字として５００文字から５５００文字の範囲に重み１のグラフが描かれている。同じように８０００文字目にあって５５００文字から１０５００文字の範囲、１４０００文字目にあって１１５００文字から１６５００文字の範囲にもグラフが描かれる。
本実施例では同じルールに基づく検索結果はＯＲ演算を行う。つまり、２段目の特徴語「米国」は４５００文字目と８０００文字目に出現するが、この場合図示のように特徴語を検索したときに重複した部分が生じても、重みを合計するのではなくＯＲをとることで２０００文字から１０５００文字までが重み１のグラフとなる。 For example, the feature word "America" in the first row is located at the 3000th character, and a graph with a weight of 1 is drawn in the range of 500 to 5500 characters as 2500 characters before and after the margin. Similarly, graphs are drawn in the range of 5500 to 10500 characters in the 8000th character and in the range of 11500 to 16500 characters in the 14000th character.
In this embodiment, search results based on the same rule are OR-operated. In other words, the feature word "United States" in the second row appears in the 4500th and 8000th characters. Instead, by taking the OR, the graph with a weight of 1 is obtained from 2000 characters to 10500 characters.

特徴語「ニューヨーク」は４０００文字目と１４０００文字目に出現して前後幅２０００文字で重み３、「トランプ大統領」は４７５０文字目に出現して前後幅５００文字で重み５と表される。
その上で、これらの重み値のＡＮＤ演算を行うと、下段に示した関連度のようなグラフとなる。このグラフは、テキストデータ中の関連箇所を視覚的に示すものであり、ａの場所はトピック「米国」に関連がないことを示し、ｂやｃは関連が高いことを示している。 The feature word "New York" appears at the 4000th and 14000th characters, has a front-back width of 2000 characters, and has a weight of 3. "President Trump" appears at the 4750th character, has a front-back width of 500 characters, and has a weight of 5.
Then, if an AND operation is performed on these weight values, a graph such as the degree of relevance shown in the lower row is obtained. This graph visually shows the relevant parts in the text data, where the position of a indicates no relation to the topic "United States", and the positions of b and c indicate that the relation is high.

本発明では関連度を算出するために、重み値と共にマージンの概念を創出して、マージンを用いた演算を行ったことで、全く新しい関連度の算出方法を提供している。演算方法としては、上述したように同じルール（同じ特徴語）ではＯＲ演算、異なるルール同士ではＡＮＤ演算を行うことが好ましいが、同じ特徴語でＡＮＤ演算を行ってもよいし、その他任意の計算値を用いることもできる。 In the present invention, in order to calculate the degree of relevance, the concept of the margin is created along with the weight value, and calculation is performed using the margin, thereby providing a completely new method of calculating the degree of relevance. As a calculation method, it is preferable to perform an OR operation for the same rule (same feature word) and an AND operation for different rules as described above, but the same feature word may be used for the AND operation, or any other calculation may be performed. A value can also be used.

アスペクトデータ（２０２）の例として、１つの単語又は形態素からなる特徴語について重み値及び距離値を定義した例を示しているが、特徴語は複数の単語列でもよい。例えば「アメリカ合衆国ニューヨーク市」の単語列を特徴語としてもよい。 As an example of the aspect data (202), an example in which weight values and distance values are defined for a feature word consisting of one word or morpheme is shown, but the feature word may be a plurality of word strings. For example, a word string of "New York City, USA" may be used as a feature word.

また、２つの単語が所定の範囲内に共起する場合を１つのルールとしてもよい。例えば、単語Ａと単語Ｂ又は単語Ｃが５０文字以内に共起するというルール１をＡ（ＢＣ）＜５０と表す。５０文字以内は例えばＡがｆｉｓｈ、Ｂがｂｉｒｄであれば、ｆｉｓｈのｈと、ｂｉｒｄのｂの間が５０字以内とする。このルール１の特徴度を１とすると図４（ａ）のようにｆｉｓｈのｆから、ｂｉｒｄのｄまで、重み１のグラフが描かれる。 Also, one rule may be the case where two words co-occur within a predetermined range. For example, rule 1 that word A and word B or word C co-occur within 50 characters is expressed as A(B C)<50. Within 50 characters, for example, if A is fish and B is bird, the space between h of fish and b of bird should be within 50 characters. Assuming that the characteristic degree of this rule 1 is 1, a graph with a weight of 1 is drawn from f of fish to d of bird as shown in FIG. 4(a).

上記実施例と異なり、本実施例の図４（ｂ）では前後のマージンの範囲内において重み値が変化する例を示している。このように重み値は一定でなく、特徴語から離れるに応じて小さくなるように変化させてもよい。 Unlike the above embodiment, FIG. 4B of this embodiment shows an example in which the weight value changes within the range of the front and rear margins. As described above, the weight value is not constant, and may be changed so as to decrease as the distance from the feature word increases.

そして、図４（ｃ）のように、ルール１に定義されたＡとＢの共起によるグラフと、ＡとＣの共起によるグラフの場合は、同じルール同士であるからＯＲを取る。
一方、図４（ｄ）のように、ルール２にはＸ（ＹＺ）＜１００を定義し、ＸとＹの共起によるグラフがある場合には、ＡとＢの共起によるグラフと、ＸとＹの共起によるグラフのＡＮＤを取る。
以上のように本発明の特徴語については、複数の単語列や、所定の範囲内に共起する単語の組み合わせを含むことができる。 Then, as shown in FIG. 4C, in the case of the graph of the co-occurrence of A and B defined in rule 1 and the graph of the co-occurrence of A and C, since they are the same rule, OR is performed.
On the other hand, as shown in FIG. 4(d), rule 2 defines X(YZ)<100, and if there is a graph of the co-occurrence of X and Y, a graph of the co-occurrence of A and B and Take the AND of the graph by the co-occurrence of X and Y.
As described above, the characteristic word of the present invention can include a plurality of word strings and a combination of words co-occurring within a predetermined range.

表示処理部（１０５）はモニタ（３０）からトピック毎に区別して特徴語の検索結果に基づく値を関連度として表示する。表示方法は、図３のようにグラフ化して表示することが好ましい。
図５には、第１の画面表示例を示す。画面の左側にはトピック欄（４０）が配置され、ユーザーはキーボード（３１）やマウスなどを用いて表示するトピックを選択する。図では大項目である「生産戦略と拠点戦略」及び小項目である「国内回帰の動き」が選択されており、その右のスコア欄（４１）には重み値の計算値が線グラフで表示されている。スコア欄（４１）は上端がテキストの先頭、下端が末尾である。 A display processing unit (105) displays a value based on a search result of characteristic words as a degree of relevance by distinguishing topics from the monitor (30). As for the display method, it is preferable to display in a graph form as shown in FIG.
FIG. 5 shows a first screen display example. A topic column (40) is arranged on the left side of the screen, and the user selects a topic to be displayed using a keyboard (31), mouse, or the like. In the figure, the major item "Production Strategy and Base Strategy" and the minor item "Movement of Domestic Return" are selected, and the calculated value of the weight value is displayed in a line graph in the score column (41) on the right. It is The score column (41) has the beginning of the text at the top and the end at the bottom.

スコア欄（４１）を見ると横軸のスコアが高い位置が関連度の高い関連箇所を示しており、文章全体の中でトピックに関連する記述がどの位置にあるかを容易に把握することができる。スコア欄の線グラフをマウスなどで指定することで、当該記載部分を閲覧できるようにしてもよい。 Looking at the score column (41), the position of the high score on the horizontal axis indicates the related part with the high degree of relevance, and it is possible to easily grasp the position of the description related to the topic in the entire sentence. can. By specifying the line graph in the score column with a mouse or the like, the description part may be browsed.

スコア欄（４１）において閾値を超えた箇所は、その右欄の文書一覧表示（４２）においてマーカー表示されている。ユーザーはトピック欄（４０）からトピックを選ぶことで、そのトピックに関連する記載がどこにどのくらいの量記載されているのかを一目で理解することができる。
マーカー表示する範囲は、重み値の計算結果が所定の閾値を超えた語句が含まれる一文全体、又はその語句近傍の所定範囲の文章としてもよい。 Locations exceeding the threshold value in the score column (41) are highlighted in the document list display (42) in the right column. By selecting a topic from the topic column (40), the user can understand at a glance where and how much description related to the topic is described.
The range to be displayed as a marker may be an entire sentence including a word or phrase whose weight value calculation result exceeds a predetermined threshold value, or a predetermined range of sentences in the vicinity of the word or phrase.

図６には、第２の画面表示例を示す。本実施例では、左側のトピック欄（４０）からトピックを選択すると、右側の関連箇所表示欄（４３）において複数の文書の関連箇所を表示できるようになっている。例えば、白書のように毎年発行される文書を複数年度分指定してトピックを選択すると、各年度の白書において１つのトピックに関連する箇所が抽出される。所定の重み値の計算結果を超える部分だけを図のように配列すれば、複数の文書における同じトピックに係る記載を比較対照することが容易に行える。 FIG. 6 shows a second screen display example. In this embodiment, when a topic is selected from the topic field (40) on the left side, related parts of a plurality of documents can be displayed in the related part display field (43) on the right side. For example, if a document such as a white paper issued every year is designated for a plurality of years and a topic is selected, a portion related to one topic is extracted from each year's white paper. By arranging only the portion exceeding the calculation result of the predetermined weight value as shown in the figure, it is possible to easily compare and contrast descriptions on the same topic in a plurality of documents.

図７ではウェブブラウザにおいてウェブサイトの記事を表示すると共に、その横に関連度を示す棒グラフを表示する例を示している。棒グラフは記事の行の位置と一致しており、例えば検索窓に入力してユーザーが指定したトピックについて、記事のどの位置に関連する情報が記載されているか容易に分かるように構成されている。
このような表示方法は、ウェブブラウザのサイド部分のわずかな領域で、視覚的に分かりやすく表示することができ、例えば検索サービスの画面に適用しても好適である。 FIG. 7 shows an example of displaying a web site article on the web browser and displaying a bar graph indicating the degree of relevance next to the article. The bar graph corresponds to the position of the line of the article, for example, it is configured so that it is easy to understand in which position in the article the information related to the topic specified by the user by entering it in the search window is described.
Such a display method can be displayed in a small area on the side of the web browser in a visually easy-to-understand manner, and is suitable for application to search service screens, for example.

以上、本発明に係る表示例を示したが、関連度出力ステップ（Ｓ６）では、必ずしも結果を表示する必要はなく、あるトピックについて、テキストデータ（２０３）の中の関連箇所について関連度を出力する構成でもよい。関連度としては、上述したような重み値を計算した値のほか、関連の有り、無しだけの情報でもよい。
出力の態様も関連度の情報を図示しないメモリなどに一次的に格納したり、通信部（３２）を介してネットワークを通じて別のコンピュータに送信する構成でもよい。 The display examples according to the present invention have been shown above, but in the relevance output step (S6), it is not always necessary to display the results. It may be configured to The degree of association may be a value obtained by calculating a weight value as described above, or may be information only indicating whether or not there is an association.
The form of output may also be a configuration in which the information on the degree of association is temporarily stored in a memory (not shown) or the like, or transmitted to another computer through the network via the communication section (32).

本発明の別実施例として、アスペクトデータ作成ステップ（Ｓ３）において、アスペクトデータ作成部（１０３）が、トピックごとに抽出された特徴語の各々について階層情報を定義することができる。
例えば、表２のようにトピックを「トランプ大統領」としたとき、「国」→「地域」→トピックと上位の概念から下位の概念の概念情報に分類され、国が階層１、地域が階層２、トピックが階層３となる。 As another embodiment of the present invention, in the aspect data creation step (S3), the aspect data creation unit (103) can define hierarchical information for each feature word extracted for each topic.
For example, if the topic is "President Trump" as shown in Table 2, "Country" → "Region" → Topics and conceptual information are classified from the higher concept to the lower concept. , the topic is at level 3.

特徴語をこのように読者が理解しやすい概念情報で分類することで新聞記事から「アメリカ」について記載される広範囲な部分から、地域を特定し、さらに所望のトピックの部分までを段階に分けて把握することができるようになる。 By classifying the feature words with conceptual information that is easy for the reader to understand, it is possible to divide the newspaper article into stages, from a wide range of parts describing "America", to specific regions, and further to parts of desired topics. be able to comprehend.

このような分類を行った場合、図８に示すように所望のトピック（５０）を選択すると、階層１（国名）の部分は薄いマーカー表示（５１）、階層２（地域）が一致する部分は中濃度のマーカー表示（５２）、選択されたトピック（５０）の部分は濃いマーカー表示（５３）で表示することができる。
マーカーの表示範囲としては、概念情報に属する特徴語ごとに、当該特徴語が含まれる文章全体、又は当該特徴語近傍の所定範囲の文章、又は特徴語のマージン範囲内の表示態様を変化させることができる。上位概念から下位概念のそれぞれの概念を含む文章ごとに区別できるように表示することで、読者は記事中の関連の程度を視覚的に理解しやすくなる。 When such classification is performed, when a desired topic (50) is selected as shown in FIG. A medium density marker display (52) and a portion of the selected topic (50) can be displayed with a dark marker display (53).
As for the display range of the marker, for each feature word belonging to the concept information, the entire sentence including the feature word, the sentences in a predetermined range near the feature word, or the display mode within the margin range of the feature word can be changed. can be done. By displaying the sentences containing each concept from the superordinate concept to the subordinate concept so that they can be distinguished, the reader can visually understand the degree of relevance in the article.

上記階層情報をさらに、アスペクトデータ作成ステップ（Ｓ３）における重み値と距離値の定義に用いることができる。
すなわち、階層ごとに重み値とマージンの初期値を定める。表２の例では、階層１は重み値が１、前後のマージンが２５００字、階層２は重み値が５、前後のマージンが１０００字、階層３は重み値が１０、前後のマージンが２５０字である。 The above hierarchical information can be further used to define weight values and distance values in the aspect data creation step (S3).
That is, the weight value and the initial value of the margin are determined for each layer. In the example of Table 2, layer 1 has a weight value of 1 and a leading and trailing margin of 2500 characters, layer 2 has a weight value of 5 and a leading and trailing margin of 1000 characters, layer 3 has a weight value of 10 and a leading and trailing margin of 250 characters. is.

このように定めておくと、特徴語が抽出された後、それを読者の理解しやすい概念情報に分類すると同時に階層が決まるので、重み値とマージンが適切に設定される。すなわち、階層情報に、重み値と距離値の組み合わせのパターンを定義し、特徴語検索ステップ（Ｓ５）において、パターンに従った特徴語の検索条件を決定することができる。 By defining in this way, after the feature words are extracted, they are classified into conceptual information that is easy for the reader to understand, and at the same time the hierarchy is determined, so the weight values and margins are appropriately set. That is, it is possible to define a combination pattern of the weight value and the distance value in the hierarchical information, and determine the search condition for the feature word according to the pattern in the feature word search step (S5).

概念情報としては、国や地域、トピックについて辞書データやシソーラスなどのデータベースを参照し、上位概念から下位概念を自動的に分類することもできる。特徴語が抽出された後にこれらのデータベースと照合することで、例えば「アメリカ」「ニューヨーク」「トランプ大統領」であれば、国、地域、固有名詞であることから、上位概念、中位概念、下位概念の分類され、それに対応する階層情報、さらに重み値及び距離値の定義まで行うことができる。 As conceptual information, it is also possible to refer to databases such as dictionary data and thesaurus for countries, regions, and topics, and automatically classify low-level concepts from high-level concepts. By matching with these databases after the feature words are extracted, for example, "America", "New York", and "President Trump" are country, region, and proper nouns, so they are classified into broader concept, middle concept, and lower concept. It is possible to classify concepts, define corresponding hierarchical information, and even define weight values and distance values.

本発明は、以上のように読者の理解しやすい概念情報と、コンピュータが関連度を算出するときの値のセットをパターン化することができるので、関連度算出の高精度化に寄与すると同時に、ユーザーにも違和感のない結果を得ることができる。 As described above, the present invention can pattern the conceptual information that is easy for the reader to understand and the set of values when the computer calculates the degree of relevance. It is possible to obtain results that do not cause any discomfort to the user.

１テキスト処理装置
１０ＣＰＵ
１０１入力処理部
１０２特徴語抽出部
１０３アスペクトデータ作成部
１０４特徴語検索部
１０５表示処理部
２０記憶部
２０１学習用データ
２０２アスペクトデータ
２０３テキストデータ
３０モニタ
３１キーボード
３２通信部 1 text processing device 10 CPU
101 input processing unit 102 feature word extraction unit 103 aspect data creation unit 104 feature word search unit 105 display processing unit 20 storage unit 201 learning data 202 aspect data 203 text data 30 monitor 31 keyboard 32 communication unit

Claims

From the input first text data,
A text processing method by a computer for evaluating the degree of relevance between a related part described in text data and the topic for a topic, which is an item obtained by classifying a topic of interest, comprising:
An input means inputs a plurality of second text data groups, a plurality of predefined topics, and related character string information described about the topics in the second text data group associated in advance as learning data. training data input step,
a feature word extraction step in which the feature word extraction means extracts a feature word from the related character string information;
After the learning step having each step of an aspect data creation step, the aspect data creation means creates aspect data, which is information in which the feature words extracted for each topic are associated, and stores the aspect data in the storage means;
a text data input step in which the input means inputs the first text data;
a feature word search step in which a feature word search means searches for a feature word contained in the first text data for at least one topic by referring to the aspect data;
In a configuration in which the output means performs a relevance evaluation step having each step of an output step in which the relevance is output as the relevance based on the search result of the feature word for each topic,
The aspect data stores a weight value for the feature word extracted for each topic, a feature word extracted for each topic, and a distance value relating to at least one of the distances before and after the appearance of the feature word,
In the feature word extraction step, when the feature word extraction means extracts the feature word from the related character string information, the weight value is set by a predetermined arithmetic expression based on the weight information defined in the feature word. death,
In the feature word search step, a search condition is determined based on the weight value, and a search condition for feature words of the same type or different type included in the first text data is determined based at least on the distance value. ,
The feature word search means searches for a feature word based on the aspect data, reads the distance value of each extracted feature word, and performs a predetermined calculation from the weight value of the feature word within the range of the distance value. A text processing method characterized by outputting a calculated value as the degree of relevance.

The weight value defined for the feature word is a value related to at least one of the appearance frequency and the co-occurrence frequency calculated when the feature word extraction means extracts the feature word,
A text processing method according to claim 1.

2. The text processing method according to claim 1, wherein, in said output step, a calculated value relating to the number of a plurality of feature words that are search results for each topic or the sum of their weight values is output as said degree of relevance.

In the aspect data creating step, the aspect data creating means
Defining hierarchical information for each of the feature words extracted for each topic,
The hierarchical information defines patterns of combinations of the weight values and the distance values,
4. The text processing method according to any one of claims 1 to 3, wherein in said characteristic word search step, search conditions for characteristic words are determined according to said pattern.

In a configuration in which the topic is classified into two or more stages of conceptual information from a superordinate concept to a subordinate concept so that the reader can easily understand the contents included in the text data,
In the aspect data creating step, the aspect data creating means
automatically defining the hierarchical information according to the conceptual information;
5. The text processing method according to claim 4, wherein each of the feature words for each topic is associated with the conceptual information.

In the output step, the output means
6. A text processing method according to any one of claims 1 to 5, wherein the degree of association is displayed in a predetermined graph together with the extracted feature words.

In the output step, the output means
7. The text processing method according to any one of claims 1 to 6, wherein in the configuration for displaying the first text data, a predetermined graph is displayed according to the position of the line of the extracted characteristic word.

In the output step, the output means
In the configuration for displaying the first text data, changing the display mode of the entire sentence including the characteristic word, the sentence of a predetermined range near the characteristic word, or the range of the distance value of the characteristic word. Item 8. A text processing method according to any one of Items 1 to 7.

In the output step, the output means
For each feature word belonging to the concept information, the entire sentence including the feature word, the sentence in a predetermined range near the feature word, or the display mode within the range of the distance value of the feature word is changed, and the superordinate concept is changed. to display each sentence containing each concept of subordinate concepts so that they can be distinguished from
9. A text processing method as claimed in claim 8, comprising the arrangement of claim 5 .

A computer-based text processing apparatus for evaluating the degree of relevance between a related part described in text data and a topic, which is an item obtained by classifying a topic of interest from input first text data, and the topic. ,
Learning in which a plurality of second text data groups, a plurality of predefined topics, and related character string information described about the topics in the second text data group associated in advance are input as learning data. data entry means for
a feature word extracting means for extracting a feature word from the related character string information;
aspect data creation means for creating aspect data, which is information in which characteristic words extracted for each topic are associated, and storing the aspect data in a storage means;
text data input means for inputting first text data;
feature word search means for searching for feature words included in the first text data for at least one topic by referring to the aspect data;
and an output means for distinguishing between topics and outputting a value based on the search result of the feature word as the degree of relevance,
The aspect data stores a weight value for the feature word extracted for each topic, a feature word extracted for each topic, and a distance value relating to at least one of the distances before and after the appearance of the feature word,
when the feature word extracting means extracts the feature word from the related character string information, a weight value is set by a predetermined arithmetic expression based on the weight information defined for the feature word;
The feature word search means determines a search condition based on the weight value, and
The feature word search means determines a search condition for the same type or different type of feature word included in the first text data based on at least the distance value, and searches for the feature word based on the aspect data; , reading out the distance value of each extracted feature word, and outputting a value calculated by a predetermined operation from the weight value of the feature word within the range of the distance value as the degree of relevance. Device.