JP4423004B2

JP4423004B2 - Text mining device, text mining method, and text mining program

Info

Publication number: JP4423004B2
Application number: JP2003345961A
Authority: JP
Inventors: 勇之相川; 泰博高山; 明人永井; 誠今村
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2003-10-03
Filing date: 2003-10-03
Publication date: 2010-03-03
Anticipated expiration: 2023-10-03
Also published as: JP2005115468A

Description

本発明は、製品企画や品質管理などの業務で必要とされる重要な情報を蓄積された大量のテキストから抽出して、業務改善のために活用することを可能とするテキストマイニング装置およびテキストマイニング方法並びにテキストマイニングプログラムに関するものである。 The present invention relates to a text mining apparatus and text mining that can extract important information necessary for business such as product planning and quality control from a large amount of accumulated text and use it for business improvement. The present invention relates to a method and a text mining program.

文書の電子化が進み、製品企画や品質管理などで必要とされる重要な情報を蓄積文書から抽出するためのテキストマイニング装置の重要性が増している。このようなテキストマイニング装置のうち、入力文書とは異なる表現であっても類似する内容をもつ文書を検索可能なテキストマイニング装置として、文献１（顧客の声を眠らせないためのテキストマイニング、三室克哉、知的資産創造2002年9月号）に記載のものがある。また、これと関連する内容をもつ文献２（“顧客の声”を分析するテキストマイニングツール「TRUE TELLER（トゥルーテラー）Ver.2.0」を発表、NRI野村総合研究所ニュースリリース、2002年2月18日）、および、文献３（“顧客の声”を分析するテキストマイニングツール「TRUE TELLER（トゥルーテラー）Ver.3.0」を発表、NRI野村総合研究所ニュースリリース、2002年12月19日）に記載のものがある。上記の文献１、文献２、および、文献３により開示されたテキストマイニング方式について図１８により説明する。 With the progress of computerization of documents, the importance of text mining devices for extracting important information required for product planning and quality control from stored documents is increasing. Among such text mining devices, as a text mining device capable of searching for a document having similar contents even if the expression is different from that of the input document, Document 1 (text mining for keeping customers' voices awake, Katsuya Mimuro) , Intellectual Asset Creation September 2002 issue). In addition, Reference 2 (TRUE TELLER Ver.2.0), a text mining tool that analyzes “customer voice”, was released. NRI Nomura Research Institute News Release, February 18, 2002. ) And Document 3 (announced “TRUE TELLER Ver.3.0”, a text mining tool that analyzes “customer voice”, NRI Nomura Research Institute News Release, December 19, 2002) There are things. The text mining method disclosed by the above Literature 1, Literature 2, and Literature 3 will be described with reference to FIG.

図において、単語分割手段1801は、分析対象文書1821に含まれるテキストを解析し、単語に分割する。単語統一手段1802は、同義語辞書1803を参照して、単語分割手段1801による単語分割結果の表記ゆれを吸収する。構文解析手段1804は、単語分割手段1801による単語分割結果から単語間の係り受け関係を抽出して、分析用データベース1806に格納する。このとき、どの文書にどの単語が出現したかという情報もあわせて分析用データベース1806に格納する。また、係り受け同義辞書1805により係り受けの同義性も処理して分析用データベース1806に格納する。 In the figure, word dividing means 1801 analyzes text included in the analysis target document 1821 and divides it into words. The word unifying unit 1802 refers to the synonym dictionary 1803 and absorbs the fluctuation of the word division result by the word dividing unit 1801. The syntax analysis means 1804 extracts the dependency relationship between words from the word division result by the word division means 1801 and stores it in the analysis database 1806. At this time, information indicating which word appears in which document is also stored in the analysis database 1806. Also, the dependency synonym is processed by the dependency synonym dictionary 1805 and stored in the analysis database 1806.

構文解析手段1804は、さらに単語に対して不満の度合いや要望の度合いを定義するスコア定義情報1808を参照して、各文書に対して計算されるスコアの情報もあわせて分析用データベース1806に格納する（文献２の「スコアリング機能」）。分析手段1809は、分析用データベース1806および顧客属性データベース1807を参照して分析入力1822に対する分析結果1823を生成する。このようなテキストマイニング装置により、たとえば、化粧品に関するアンケートの自由記述文などに対して図１９のような分析支援が可能である（文献１）。 The parsing unit 1804 further stores the score information calculated for each document in the analysis database 1806 with reference to the score definition information 1808 that defines the degree of dissatisfaction and the degree of demand for the word. (“Scoring function” in Document 2). The analysis unit 1809 generates an analysis result 1823 for the analysis input 1822 with reference to the analysis database 1806 and the customer attribute database 1807. With such a text mining device, for example, analysis support as shown in FIG. 19 can be performed for a free description sentence of a questionnaire regarding cosmetics (Reference 1).

三室克哉「顧客の声を眠らせないためのテキストマイニング」知的資産創造、2002年9月号pp４４〜pp５３Katsuya Mimuro “Text Mining to Keep Customer Voices Awake” Intellectual Asset Creation, September 2002 pp44-pp53 「“顧客の声”を分析するテキストマイニングツール「TRUE TELLER（トゥルーテラー）Ver.2.0」を発表」、ニュースリリース、NRI野村総合研究所、2002年2月18日"Announced" TRUE TELLER Ver.2.0 ", a text mining tool for analyzing" customer voice "", News Release, NRI Nomura Research Institute, February 18, 2002 「“顧客の声”を分析するテキストマイニングツール「TRUE TELLER（トゥルーテラー）Ver. 3.0」を発表」、ニュースリリース、NRI野村総合研究所、2002年12月19日"Announced" TRUE TELLER Ver. 3.0 ", a text mining tool for analyzing" customer voice "", News Release, NRI Nomura Research Institute, December 19, 2002

しかし、文献１、文献２、および、文献３で開示された技術には、以下のような課題がある。 However, the techniques disclosed in Document 1, Document 2, and Document 3 have the following problems.

まず、好評、不評、要望などの抽出したい情報ごとに、単語に対するスコアを定義することはできるが、これだけでは十分な情報を抽出することができないという課題がある。たとえば、製品アンケートに対する自由記述回答において「高い」という単語を考えると、「価格が高い」のであれば否定的意見であるし、「信頼性が高い」のであれば好意的意見である。したがって、単語に対してスコアを定義するだけでは、好意的意見と否定的意見を判別できないことがある。 First, although a score for a word can be defined for each piece of information to be extracted, such as popular, unpopular, and desired, there is a problem that it is not possible to extract sufficient information by itself. For example, considering the word “high” in the free description response to a product questionnaire, a negative opinion is given if the price is high, and a positive opinion if the reliability is high. Therefore, it may not be possible to distinguish between positive opinions and negative opinions simply by defining a score for a word.

また、自由に記述されたテキストの多様な表現に対応した分析支援を行なうためには、大量の同義語辞書を人手で作成する必要があり、手間が大きいという課題がある。文献４（テキストマイニング活用法、石井哲、リックテレコム、2002年11月）では学習用文書の単語を統計的に処理することにより、自動的に類義性を計算可能な方法が開示されているが、自由記述テキストの分析において重要な機能表現、例えば、「良い」、「悪い」などの形容詞、「〜したい」などの要望表現については考慮されていないため、十分な分析支援機能が提供されていなかった。 In addition, in order to perform analysis support corresponding to various expressions of freely described texts, it is necessary to manually create a large number of synonym dictionaries, which is problematic. Reference 4 (How to use text mining, Satoshi Ishii, Rick Telecom, November 2002) discloses a method that can automatically calculate similarity by statistically processing words in a learning document. , Because it does not take into account functional expressions that are important in the analysis of free description text, such as adjectives such as “good” and “bad”, and desired expressions such as “I want to”, so sufficient analysis support functions are provided. There wasn't.

本発明は上記課題に鑑みてなされたものであり、分析対象テキスト中の情報と抽出すべき情報との対応関係を定義する照合パタンと、分析で必要とする情報を照合パタンと照合してテキストから抽出するための情報抽出手段と、抽出した情報を格納するための抽出情報格納手段を設けることにより、複数の単語を考慮して自由記述回答を判別し、分析支援に活用できることを目的とする。 The present invention has been made in view of the above-described problems. A collation pattern that defines a correspondence relationship between information in an analysis target text and information to be extracted, and a text obtained by collating information required for analysis with a collation pattern. It is intended to be able to discriminate free description answers in consideration of multiple words and utilize it for analysis support by providing information extraction means for extracting from the information and extraction information storage means for storing the extracted information .

本発明に係るテキストマイニング装置は、
登録手段、テキスト解析手段、属性データベース、概念辞書作成手段、概念辞書、照合パタン記憶手段、情報抽出手段、抽出情報索引記憶手段、文書索引記憶手段、分析手段を備え、
登録手段は、分析対象文書を読み込み、
テキスト解析手段は、登録手段により、読み込まれた分析対象文書のテキストを解析して、単語に分割し、単語の共起頻度と単語間の係り受け関係を抽出し、
属性データベースは、アンケート文書に付与された年齢や性別などの顧客情報や、故障事例文書に付与された機種名や故障派生日時などの分析処理内容により分析対象文書が絞り込み可能な項目からなる属性情報が格納され、
概念辞書作成手段は、テキスト解析手段により分割された単語の共起頻度から特異値分解により各単語の概念ベクトルを計算して概念辞書データを作成し、
概念辞書は、概念辞書作成手段により作成された概念辞書データを格納し、
上記登録手段はさらに、テキスト解析手段により分割された各単語に対応する概念ベクトルを概念辞書から読み出し、これらの概念ベクトルを文章毎に合成して文書索引情報の１つである文書ベクトルを生成し
文書索引記憶手段は、登録手段が生成した文書ベクトルが格納され、
照合パタン記憶手段は、予め作成され、分析条件である分析種別と、その分析種別に関連する単語と、その単語の分析種別に対する評価値を有する照合パタンが記憶され、
情報抽出手段は、分析対象文書をテキスト解析手段で解析した結果の単語が照合パタン記憶手段に記憶された照合パタンにある場合は、その単語と分析種別と分析種別に対する評価値を抽出し、
抽出情報索引記憶手段は、情報抽出手段が抽出した単語と分析種別と分析種別に対する評価値と該当文書の文書idが格納され、
分析手段は、
分析作業者が分析条件を入力する分析条件入力手段と、
分析条件入力手段により入力された分析条件に合致する単語と、その単語と共起しやすい単語を属性データベースと概念辞書を参照して、文書索引記憶手段から取得する頻度集計手段と、
頻度集計手段で取得した分析条件に合致する単語と、その単語と共起しやすい単語について抽出情報索引記憶手段から各単語の評価値を抽出し、合計する関連度計算手段と、
関連度計算手段で得られた情報を、グラフまたは表の形式に整形し出力する出力手段を有するものである。 The text mining device according to the present invention is:
Registration means, text analysis means, attribute database, concept dictionary creation means, concept dictionary, collation pattern storage means , information extraction means, extraction information index storage means , document index storage means , analysis means,
The registration means reads the analysis target document,
The text analysis means analyzes the text of the read analysis target document by the registration means, divides it into words, extracts the co-occurrence frequency of words and the dependency relationship between words ,
The attribute database is attribute information consisting of items that can narrow down the analysis target document by customer information such as age and gender given to questionnaire documents, and analysis processing contents such as model name and failure derivation date given to failure case documents Is stored,
The concept dictionary creation means creates concept dictionary data by calculating the concept vector of each word by singular value decomposition from the co-occurrence frequencies of the words divided by the text analysis means,
The concept dictionary stores concept dictionary data created by the concept dictionary creating means,
The registration means further reads out a concept vector corresponding to each word divided by the text analysis means from the concept dictionary and synthesizes these concept vectors for each sentence to generate a document vector which is one of the document index information. The document index storage means stores the document vector generated by the registration means,
The collation pattern storage means stores a collation pattern that is created in advance and has an analysis type as an analysis condition, a word related to the analysis type, and an evaluation value for the analysis type of the word ,
The information extraction unit extracts an evaluation value for the word, the analysis type, and the analysis type when the word obtained as a result of analyzing the analysis target document by the text analysis unit is in the collation pattern stored in the collation pattern storage unit .
The extracted information index storage means stores the word extracted by the information extracting means , the analysis type, the evaluation value for the analysis type, and the document id of the corresponding document ,
Analytical means
An analysis condition input means for an analysis worker to input analysis conditions;
A frequency totaling unit that obtains a word that matches the analysis condition input by the analysis condition input unit and a word that easily co-occurs with the word from the document index storage unit with reference to the attribute database and the concept dictionary ;
Relevance calculating means for extracting and summing the evaluation value of each word from the extracted information index storage means for words that match the analysis conditions acquired by the frequency counting means and words that are likely to co-occur with the words ;
It has an output means for shaping and outputting the information obtained by the relevance calculation means into a graph or table format .

また、本発明に係るテキストマイニング方法は、
登録手段により分析対象文書を読み込む登録ステップと、
登録ステップにより、読み込まれた分析対象文書のテキストをテキスト解析手段で解析して、単語に分割し、単語の共起頻度と単語間の係り受け関係を抽出するテキスト解析ステップと、
テキスト解析ステップにより分割された単語の共起頻度から概念辞書作成手段で特異値分解により各単語の概念ベクトルを計算して概念辞書データを作成して、概念辞書に格納する概念辞書作成ステップと、
文書ベクトル生成手段でテキスト解析ステップにより分割された各単語に対応する概念ベクトルを概念辞書から読み出し、これらの概念ベクトルを文章毎に合成して文書索引情報の１つである文書ベクトルを生成し文書索引に格納する文書ベクトル生成ステップと、
予め作成され、分析条件である分析種別と、その分析種別に関連する単語と、その単語の分析種別に対する評価値を有する照合パタンが記憶された照合パタンを用いて分析対象文書をテキスト解析ステップで解析した結果の単語が照合パタンにある場合、その単語と分析種別と分析種別に対する評価値を情報抽出手段で抽出し、これらを該当文書の文書idとともに抽出情報索引記憶手段に格納する情報抽出ステップと、
アンケート文書における顧客情報や、故障事例文書における機種名や故障派生日時の分析対象文書に付与された属性情報が格納された属性データベースと文書索引記憶手段に記憶された文書ベクトルに加え情報抽出ステップの抽出情報を参照して分析手段により分析結果を得る分析ステップを備え、
分析ステップは、分析作業者により分析条件入力手段を用いて入力された分析条件に合致する単語と、その単語と共起しやすい単語を属性データベースと概念辞書を参照して、文書索引記憶手段から取得する頻度集計ステップと、
頻度集計ステップで取得した分析条件に合致する単語と、その単語と共起しやすい単語について抽出情報索引記憶手段から各単語の評価値を抽出し、合計する関連度計算ステップと、
関連度計算ステップで得られた情報を、グラフまたは表の形式に整形し出力する出力ステップを有するものである。 Further, the text mining method according to the present invention includes:
A registration step of reading the analysis target document by a registration means;
In the registration step, the text of the read analysis target document is analyzed by a text analysis means, divided into words, and a text analysis step for extracting the co-occurrence frequency of words and the dependency relationship between words ;
A concept dictionary creation step of calculating a concept vector of each word by singular value decomposition by means of singular value decomposition from the co-occurrence frequency of the words divided by the text analysis step, and storing the concept dictionary data in the concept dictionary;
A concept vector corresponding to each word divided by the text analysis step by the document vector generation means is read from the concept dictionary, and these concept vectors are synthesized for each sentence to generate a document vector, which is one of the document index information. Generating a document vector to be stored in the index;
An analysis target document is analyzed in a text analysis step using a collation pattern that is created in advance and stores an analysis type that is an analysis condition, a word related to the analysis type, and a collation pattern having an evaluation value for the analysis type of the word. If the analyzed word is in the collation pattern, the information extraction step extracts the word, the analysis type, and the evaluation value for the analysis type by the information extraction unit, and stores them in the extracted information index storage unit together with the document id of the corresponding document When,
In addition to the customer information in the questionnaire document, the attribute database that stores the attribute information given to the analysis target document of the model name and failure derivation date in the failure case document, and the document vector stored in the document index storage means, the information extraction step An analysis step for obtaining an analysis result by an analysis means with reference to the extracted information is provided,
The analysis step refers to a word that matches the analysis condition input by the analysis operator using the analysis condition input means and a word that easily co-occurs with the word from the document index storage means by referring to the attribute database and the concept dictionary. A frequency aggregation step to retrieve,
Relevance calculation step of extracting the evaluation value of each word from the extracted information index storage means for the word that matches the analysis condition acquired in the frequency counting step and the word that is likely to co-occur with the word ,
It has an output step of shaping and outputting the information obtained in the relevance calculation step into a graph or table format .

また、本発明に係るテキストマイニングプログラムは、
登録手段により分析対象文書を読み込む登録ステップと、
登録ステップにより、読み込まれた分析対象文書のテキストをテキスト解析手段で解析して、単語に分割し、単語の共起頻度と単語間の係り受け関係を抽出するテキスト解析ステップと、
テキスト解析ステップにより分割された単語の共起頻度から概念辞書作成手段で特異値分解により各単語の概念ベクトルを計算して概念辞書データを作成して、概念辞書に格納する概念辞書作成ステップと、
文書ベクトル生成手段でテキスト解析ステップにより分割された各単語に対応する概念ベクトルを概念辞書から読み出し、これらの概念ベクトルを文章毎に合成して文書索引情報の１つである文書ベクトルを生成し文書索引に格納する文書ベクトル生成ステップと、
予め作成され、分析条件である分析種別と、その分析種別に関連する単語と、その単語の分析種別に対する評価値を有する照合パタンが記憶された照合パタンを用いて分析対象文書をテキスト解析ステップで解析した結果の単語が照合パタンにある場合、その単語と分析種別と分析種別に対する評価値を情報抽出手段で抽出し、これらを該当文書の文書idとともに抽出情報索引記憶手段に格納する情報抽出ステップと、
アンケート文書における顧客情報や、故障事例文書における機種名や故障派生日時の分析対象文書に付与された属性情報が格納された属性データベースと文書索引記憶手段に記憶された文書ベクトルに加え情報抽出ステップの抽出情報を参照して分析手段により分析結果を得る分析ステップを備え、
分析ステップは、分析作業者により分析条件入力手段を用いて入力された分析条件に合致する単語と、その単語と共起しやすい単語を属性データベースと概念辞書を参照して、文書索引記憶手段から取得する頻度集計ステップと、
頻度集計ステップで取得した分析条件に合致する単語と、その単語と共起しやすい単語について抽出情報索引記憶手段から各単語の評価値を抽出し、合計する関連度計算ステップと、
関連度計算ステップで得られた情報を、グラフまたは表の形式に整形し出力する出力ステップの各ステップをコンピュータに実行させるものである。
Further, the text mining program according to the present invention is:
A registration step of reading the analysis target document by a registration means;
In the registration step, the text of the read analysis target document is analyzed by a text analysis means, divided into words, and a text analysis step for extracting the co-occurrence frequency of words and the dependency relationship between words ;
A concept dictionary creation step of calculating a concept vector of each word by singular value decomposition by means of singular value decomposition from the co-occurrence frequency of the words divided by the text analysis step, and storing the concept dictionary data in the concept dictionary;
A concept vector corresponding to each word divided by the text analysis step by the document vector generation means is read from the concept dictionary, and these concept vectors are synthesized for each sentence to generate a document vector, which is one of the document index information. Generating a document vector to be stored in the index;
An analysis target document is analyzed in a text analysis step using a collation pattern that is created in advance and stores an analysis type that is an analysis condition, a word related to the analysis type, and a collation pattern having an evaluation value for the analysis type of the word. If the analyzed word is in the collation pattern, the information extraction step extracts the word, the analysis type, and the evaluation value for the analysis type by the information extraction unit, and stores them in the extracted information index storage unit together with the document id of the corresponding document When,
In addition to the customer information in the questionnaire document, the attribute database that stores the attribute information given to the analysis target document of the model name and failure derivation date in the failure case document, and the document vector stored in the document index storage means, the information extraction step An analysis step for obtaining an analysis result by an analysis means with reference to the extracted information is provided,
The analysis step refers to a word that matches the analysis condition input by the analysis operator using the analysis condition input means and a word that easily co-occurs with the word from the document index storage means by referring to the attribute database and the concept dictionary. A frequency aggregation step to retrieve,
Relevance calculation step of extracting the evaluation value of each word from the extracted information index storage means for the word that matches the analysis condition acquired in the frequency counting step and the word that is likely to co-occur with the word ,
The computer executes each step of the output step of shaping and outputting the information obtained in the relevance calculation step into a graph or table format .

本発明は、抽出すべき情報を定義する照合パタンと、分析対象文書中のテキストをテキスト解析手段で解析した結果を照合し、照合結果から分析で必要な情報を抽出して抽出情報索引に格納する情報抽出手段を備え、分析手段は、属性情報を格納する属性データベースと分析対象文書から文書ベクトルを生成して登録された文書索引と情報抽出手段の抽出情報を参照して分析結果を得ることにより、複数の単語を考慮して自由記述回答を判別し、分析支援への活用ができる。
また多数の類似表現をカバーすることができ、分析作業に際して同義語辞書構築の手間を削減することが出来るという利点がある。 The present invention collates a collation pattern that defines information to be extracted with a result obtained by analyzing text in an analysis target document by a text analysis means, extracts information necessary for analysis from the collation result, and stores it in an extracted information index The information extraction means includes an attribute database for storing attribute information and a document vector generated from the analysis target document, and obtains an analysis result by referring to the registered document index and the extraction information of the information extraction means. Thus, it is possible to discriminate a free description answer in consideration of a plurality of words and use it for analysis support.
In addition, many similar expressions can be covered, and there is an advantage that it is possible to reduce the trouble of constructing a synonym dictionary during analysis work.

実施の形態１.
図１に本発明によるテキストマイニング装置の実施の形態１における構成図を示す。テキスト解析手段101は、文書121に含まれるテキストを解析し、単語に分割して単語間の関係を抽出する。概念辞書作成手段102は、テキスト解析手段101が分割した単語の出現傾向から各単語の概念ベクトルを計算して概念辞書103に格納する。登録手段104は、概念辞書103に登録された概念ベクトルをもとに文書121に含まれるテキストをベクトル情報に変換して文書索引105に登録する。情報抽出手段106は、事前に定義された照合パタン107を参照して、テキスト解析手段101がテキスト解析した結果から分析に必要な情報を抽出して、登録手段104を介して抽出情報索引108に登録する。属性データベース109は、アンケート分析においては性別や年齢などの顧客情報を格納し、また、故障事例分析においては機種名や故障派生日時などの属性情報を格納する。分析手段110は、ユーザの入力した分析入力122を読み込んで、概念辞書103、文書索引105、抽出情報索引108、および、属性データベース109を参照して分析入力122に対応する分析支援のための分析情報123を出力する。 Embodiment 1.
FIG. 1 shows a configuration diagram of Embodiment 1 of a text mining apparatus according to the present invention. The text analysis means 101 analyzes the text contained in the document 121, divides it into words, and extracts the relationship between the words. The concept dictionary creation means 102 calculates the concept vector of each word from the appearance tendency of the words divided by the text analysis means 101 and stores it in the concept dictionary 103. The registration means 104 converts the text contained in the document 121 into vector information based on the concept vector registered in the concept dictionary 103 and registers it in the document index 105. The information extraction means 106 refers to the collation pattern 107 defined in advance, extracts information necessary for the analysis from the result of text analysis by the text analysis means 101, and stores it in the extracted information index 108 via the registration means 104. sign up. The attribute database 109 stores customer information such as gender and age in questionnaire analysis, and stores attribute information such as model name and failure derivation date in failure case analysis. The analysis means 110 reads the analysis input 122 input by the user, refers to the concept dictionary 103, the document index 105, the extracted information index 108, and the attribute database 109, and performs analysis for analysis support corresponding to the analysis input 122 Output information 123.

図２は、文書分析処理の概要を示す処理フローである。以下、図１から図４までを適宜参照しつつ分析処理の概要について説明する。
まずステップS201の概念辞書作成処理について説明する。ステップS201においては、まず登録手段104により文書121を読み込み、文書121に含まれるテキストをテキスト解析手段101により単語に分割する。ついで、登録手段104は概念辞書作成手段102を呼び出してテキスト解析手段101により分割した単語の出現傾向（同時に出現する（共起する）単語の頻度）から特異値分解という代数的演算により各単語の概念ベクトルを計算して図３に示すような概念辞書データを作成し、概念辞書103に格納する。
概念ベクトルの計算には、たとえば文献５（「単語の連想関係に基づく情報検索システムInfoMAP、高山他、情報学基礎53-1、1999-3」）に開示された方法を用いる。
また、単語を分割し、単語間の係り受け関係を抽出する方法については多数の公知文献があるので、説明を省略する。 FIG. 2 is a processing flow showing an outline of the document analysis processing. The outline of the analysis process will be described below with reference to FIGS. 1 to 4 as appropriate.
First, the concept dictionary creation process in step S201 will be described. In step S201, first, the document 121 is read by the registration unit 104, and the text included in the document 121 is divided into words by the text analysis unit 101. Next, the registration means 104 calls the concept dictionary creation means 102 and uses the algebraic operation called singular value decomposition from the appearance tendency of the words divided by the text analysis means 101 (the frequency of words that appear at the same time (co-occurs)). The concept vector is calculated to create concept dictionary data as shown in FIG.
For the calculation of the concept vector, for example, the method disclosed in Reference 5 (“Information search system InfoMAP based on word associations, Takayama et al., Informatics Fundamentals 53-1, 1999-3”) is used.
Moreover, since there are many well-known documents about the method of dividing a word and extracting the dependency relation between words, description is abbreviate | omitted.

なお、図１では煩雑さを避けるために概念辞書103は１つだけ示しているが、分析対象文書の分野ごとにそれぞれ概念辞書103を作成する。例えば、携帯電話のアンケート結果を分析するためには携帯電話アンケート分析用概念辞書を作成する。また、洗濯機に関する問合せメールを分析するのであれば洗濯機用概念辞書を作成する。これらの概念辞書は、登録対象文書と似通った内容のテキストから学習されたものであれば良い。従って、あるアンケート結果から学習した概念辞書103を、内容が類似する別アンケート結果の分析に用いることもできる。 In FIG. 1, only one concept dictionary 103 is shown to avoid complexity, but a concept dictionary 103 is created for each field of the analysis target document. For example, in order to analyze a mobile phone questionnaire result, a mobile phone questionnaire analysis concept dictionary is created. If an inquiry mail related to a washing machine is analyzed, a concept dictionary for the washing machine is created. These concept dictionaries only need to be learned from text having contents similar to the registration target document. Therefore, the concept dictionary 103 learned from a certain questionnaire result can be used for analysis of another questionnaire result having similar contents.

つぎにステップS202で文書索引作成処理を行う。この文書索引作成処理は、登録手段104により読み込んだ文書121をテキスト解析手段101により単語に分割し、分割した各単語に対応する概念ベクトルを概念辞書103から読み出し、これらの概念ベクトルを合成して文書索引情報の１つである文書ベクトルを生成し文書索引105に格納する。図４に文書索引105に格納される文書ベクトルの例を示す。
なお、上記では文書索引情報として文書ベクトルの例を述べたが、文書索引情報としては概念辞書103を用いずに、文書中に出現した単語と、その文書とを対応づける対照表であってもよい。 In step S202, document index creation processing is performed. In this document index creation process, the document 121 read by the registration unit 104 is divided into words by the text analysis unit 101, concept vectors corresponding to the divided words are read from the concept dictionary 103, and these concept vectors are synthesized. A document vector that is one of the document index information is generated and stored in the document index 105. FIG. 4 shows an example of a document vector stored in the document index 105.
In the above description, an example of a document vector is described as the document index information. However, as the document index information, the concept dictionary 103 is not used, and a reference table that associates a word that appears in the document with the document can be used. Good.

つぎにステップS203において、パタン抽出処理を行う。パタン抽出処理は、登録手段104により読み込んだ文書をテキスト解析手段101により解析し、解析した結果が照合パタン107に合致するかどうかを情報抽出手段106により照合し、照合結果から必要な情報を抽出して抽出情報索引108に格納する。このパタン抽出処理の詳細に関しては後述する。 Next, in step S203, pattern extraction processing is performed. In the pattern extraction process, the document read by the registration unit 104 is analyzed by the text analysis unit 101, and whether the analyzed result matches the verification pattern 107 is verified by the information extraction unit 106, and necessary information is extracted from the verification result And stored in the extracted information index 108. Details of this pattern extraction processing will be described later.

最後にステップS204において、テキスト分析処理を行う。テキスト分析処理は分析手段110により分析作業者の入力した分析入力122を読み込み、概念辞書103、文書索引105、抽出情報索引108、および、属性データベース109を参照して分析支援のための分析入力122に対する分析情報123を出力する。このテキスト分析処理の詳細に関しても後述する。 Finally, in step S204, text analysis processing is performed. In the text analysis process, an analysis input 122 input by an analysis operator is read by the analysis means 110, and an analysis input 122 for analysis support is referred to by referring to the concept dictionary 103, the document index 105, the extracted information index 108, and the attribute database 109. Output analysis information 123 for. Details of this text analysis processing will also be described later.

以下、図１、および、図５から図１３までを適宜参照しつつステップS203のパタン抽出処理（照合処理）の詳細について説明する。なお、以下では、分析対象データがエアコンに関するアンケート結果であると仮定して説明する。 The details of the pattern extraction process (collation process) in step S203 will be described below with reference to FIG. 1 and FIGS. 5 to 13 as appropriate. In the following description, it is assumed that the analysis target data is a questionnaire result regarding an air conditioner.

図５は、情報抽出手段106の詳細構成図である。文節内パタン照合手段501は、登録手段104により読み込まれ、テキスト解析手段101により解析されたテキスト解析結果中の各文節に対して照合可能なパタンを抽出し、照合パタン107の文節照合パタンと照合する。文内パタン照合手段502は、同じく登録手段104により読み込まれ、テキスト解析手段101により解析されたテキスト解析結果中の各文に対して照合可能なパタンを抽出し、照合パタン107の複数の単語を含む文内照合パタンと照合する。係り受けパタン照合手段503は、登録手段104により読み込まれ、テキスト解析手段101により解析された単語間の各係り受け（２つの文節間の関係）に対して照合可能なパタンを抽出し、照合パタン107の複数の単語間の係り受けを規定した係り受け照合パタンと照合する。 FIG. 5 is a detailed configuration diagram of the information extraction means 106. The in-phrase pattern collating unit 501 extracts patterns that can be collated with respect to each phrase in the text analysis result read by the registering unit 104 and analyzed by the text analyzing unit 101, and collates with the phrase collation pattern of the collation pattern 107. To do. The in-sentence pattern matching unit 502 extracts patterns that can be matched against each sentence in the text analysis result that is also read by the registration unit 104 and analyzed by the text analysis unit 101, and extracts a plurality of words in the matching pattern 107. Match with in-sentence collation pattern. The dependency pattern matching unit 503 extracts a pattern that can be collated with respect to each dependency between words (relationship between two phrases) read by the registration unit 104 and analyzed by the text analysis unit 101. Collation with a dependency collation pattern that defines the dependency between 107 words.

図６は、情報抽出手段106におけるパタン抽出処理（情報抽出処理）の処理フローである。まずステップS601において、文節内パタン照合手段501により文節内パタン照合処理を行なう。文節内パタン照合手段501では、照合パタン107の文節照合パタンを参照して、登録手段104より入力されるテキスト解析手段101によるテキスト解析結果から照合可能なパタンを抽出する。テキスト解析結果の例を図８に示す。解析結果は文(sentence)のリストからなり、各文は文節（pp）のリストからなる。各文節は形態素リスト(morph-list)をもち、形態素リストは形態素(morph)のリストからなる。また、各文節間の係り受け関係をpp-relタグにより示すものとする。このテキスト解析結果と、以下で説明する照合パタンとで合致する場合に、照合パタン中に記述されている内容に従って必要な情報を抽出し、抽出情報索引108に格納する。 FIG. 6 is a processing flow of pattern extraction processing (information extraction processing) in the information extraction means 106. First, in step S601, the phrase pattern matching unit 501 performs the phrase pattern matching process. The in-phrase pattern matching means 501 refers to the phrase matching pattern of the matching pattern 107 and extracts a pattern that can be verified from the text analysis result by the text analysis means 101 input from the registration means 104. An example of the text analysis result is shown in FIG. The analysis result consists of a list of sentences (sentence), and each sentence consists of a list of clauses (pp). Each clause has a morpheme list, and the morpheme list consists of a list of morphemes. Also, the dependency relationship between each phrase is indicated by a pp-rel tag. When this text analysis result matches the collation pattern described below, necessary information is extracted according to the contents described in the collation pattern and stored in the extracted information index 108.

図７は、文節パタン照合処理において使用する照合パタンの例である。本実施の形態ではxml形式で記述するものとする。なお、以下で説明する情報を保持できる形式であれば、xml形式以外の記述形式でもよい。図７では、２つのパタンを示している。各パタンは<pattern>〜</pattern>により境界が示される。 FIG. 7 is an example of a collation pattern used in the phrase pattern collation process. In this embodiment, it is described in the xml format. Note that a description format other than the xml format may be used as long as it can hold the information described below. In FIG. 7, two patterns are shown. Each pattern is bounded by <pattern>-</ pattern>.

つぎに、図７の各パタンの内部に記述された情報について説明する。最初にある<extract-object>タグは抽出すべき情報を示す。ここでは、属性名が「好感度」で、その値が「３」である情報を抽出する。つぎの<co-region>タグは、照合範囲を指定するタグである。ここでは文節内を示す"pp"が指定される。<pp id="0" negative="false">から</pp>までが、テキスト解析結果との照合に使用されるパタン情報である。 Next, information described inside each pattern in FIG. 7 will be described. The first <extract-object> tag indicates the information to be extracted. Here, information whose attribute name is “favorability” and whose value is “3” is extracted. The next <co-region> tag is a tag for specifying a collation range. Here, “pp” indicating the inside of a phrase is specified. From <pp id = "0" negative = "false"> to </ pp> is pattern information used for collation with the text analysis result.

<pp>タグは２つの属性値をもつ。id属性は、文章中に当該文節が出現した位置を示す整数値である。negative属性は、当該文節が否定表現を含んでいるかどうかを示すフラグ情報であり、否定情報を含んでいれば"true"が、含んでいなければ"false"が指定される。二重否定を含む文節については"false"が指定される。このnegative属性によって、図７に示した２番目のパタンで、「うるさくない」という否定表現に対する照合により、好感度３という値を抽出することが可能となる。 The <pp> tag has two attribute values. The id attribute is an integer value indicating the position where the clause appears in the sentence. The negative attribute is flag information indicating whether or not the clause includes a negative expression, and “true” is specified if negative information is included, and “false” is specified if the negative information is not included. "False" is specified for clauses containing double negation. With this negative attribute, it is possible to extract a value of likability 3 by collating against the negative expression “not loud” in the second pattern shown in FIG.

<pp>タグの子要素は<morph-list>タグである。<morph-list>タグはorder属性をもち、値が"false"の場合には順序を考慮せずに照合する。また、照合パタンに記述されていない形態素については、照合の際に無視する。order属性が"true"の場合には順序を考慮して照合する。また、照合パタンに記述されていない形態素が出現した場合には照合に失敗する。通常は順序を考慮せずに照合することにより、記述を簡易化する。たとえば、図７に示した１番目のパタンでは、<morph-list>内に「静か」という形容動詞が出現するということが照合条件になるので、「静かかもしれない」「静かだと」「静かならば」など、多様な表現に対して照合がなされる。 The child element of the <pp> tag is the <morph-list> tag. The <morph-list> tag has an order attribute. When the value is "false", matching is performed without considering the order. Also, morphemes that are not described in the verification pattern are ignored during verification. If the order attribute is "true", matching is performed considering the order. If a morpheme that is not described in the verification pattern appears, the verification fails. Usually, the description is simplified by collating without considering the order. For example, in the first pattern shown in FIG. 7, the matching condition is that the adjective verb "quiet" appears in <morph-list>. It is checked against various expressions such as “if it is quiet”.

<morph-list>タグの子要素は<morph>タグである。これは単語分割の最小単位（形態素）に対応する。<morph>タグは子要素として、各形態素の見出し表記を示す<hyouki>タグおよび品詞を示す<pos>タグをもつ。また、属性としてmatchをもち、"strict"が指定されている場合は表記および品詞が厳密に一致する照合を行い、"ambiguous"が指定されている場合には概念辞書103を参照して、「静か」と類似する「静音」や「低騒音」といった表現とも照合を試み、類似度が所定の値以上であれば照合成功として「静か」と同様に必要な情報（この場合は好感度３）を抽出する。 The child element of the <morph-list> tag is the <morph> tag. This corresponds to the minimum unit (morpheme) of word division. The <morph> tag has, as child elements, a <hyouki> tag indicating the heading notation of each morpheme and a <pos> tag indicating the part of speech. If the attribute is match and "strict" is specified, the notation and the part of speech are matched exactly. If "ambiguous" is specified, refer to the concept dictionary 103, Matching is also attempted with expressions such as “silent” and “low noise” similar to “quiet”, and if the degree of similarity is equal to or higher than a predetermined value, information necessary as “quiet” for successful matching (in this case, favorable sensitivity 3) To extract.

以上をまとめると、図７に示した１つめのパタンは、自由記述中に「静か」および「静か」に類似する単語が含まれていて、かつ、その単語が「静かではない」のように否定されていない場合には、エアコンに関する好意的な意見として、好感度３を与えるということを意味する。同様に、図７に示した２つめのパタンは、「うるさい」という単語が含まれていて、かつ、「うるさくはない」のように否定的な表現であれば、エアコンに関する好意的な意見として好感度２を与えるということを意味する。 In summary, the first pattern shown in FIG. 7 includes words similar to “quiet” and “quiet” in the free description, and the words are not “quiet”. If it is not denied, it means that a favorable opinion 3 regarding the air conditioner is given. Similarly, if the second pattern shown in FIG. 7 includes the word “noisy” and a negative expression such as “not noisy”, it is a positive opinion about the air conditioner. This means giving a preference of 2.

文節内パタン照合手段501では、テキスト解析結果（図８）中の各文節（pp）に対して、図７の照合パタンを順次適用し、照合に成功した場合には、各照合パタンのextra-objectに記述された情報を抽出して登録手段104を介して抽出情報索引108に格納する。図９に、文節内パタン照合処理により抽出されて抽出情報索引108に格納される抽出情報の例を示す。この抽出情報を用いた分析結果については後述する。 The in-phrase pattern matching means 501 sequentially applies the matching pattern of FIG. 7 to each clause (pp) in the text analysis result (FIG. 8), and if the matching is successful, the extra- Information described in the object is extracted and stored in the extracted information index 108 via the registration unit 104. FIG. 9 shows an example of extracted information extracted by the phrase pattern matching process and stored in the extracted information index 108. The analysis result using this extracted information will be described later.

つぎに、図６のステップS602において、文内パタン照合手段502により文内パタン照合処理を行なう。文内パタン照合手段502では、照合パタン107の文内照合パタンを参照して、登録手段104より入力されるテキスト解析手段101によるテキスト解析結果から照合可能なパタンを抽出する。 Next, in step S602 of FIG. 6, the in-sentence pattern matching unit 502 performs the in-sentence pattern matching process. The in-sentence pattern matching unit 502 refers to the in-sentence matching pattern of the matching pattern 107 and extracts a pattern that can be verified from the text analysis result by the text analysis unit 101 input from the registration unit 104.

図１０は、文内パタン照合処理において使用する照合パタンの例である。図７と同様、xml形式で記述するものとする。以下では、図７とは異なる部分を中心に説明する。extract-objectについては図７と同様なので説明を省略する。まず、co-regionタグの内容はsentenceとする。co-regionタグの次の要素は照合対象となるsentenceタグである。sentenceタグは、図８に示したテキスト解析結果と同様、文節のリストからなる。sentenceタグはorder属性をもち、"true"が指定されている場合には要素の文節リストの順序一致まで考慮した照合を行なう。"false"が指定されている場合には文節の順序関係を無視し、文中の要素が合致すれば照合成功とする。 FIG. 10 is an example of a collation pattern used in the in-sentence pattern collation process. Similar to FIG. 7, it is described in the xml format. Below, it demonstrates centering on a different part from FIG. The extract-object is the same as in FIG. First, the content of the co-region tag is sentence. The next element of the co-region tag is a sentence tag to be collated. The sentence tag is composed of a list of phrases as in the text analysis result shown in FIG. The sentence tag has an order attribute. When "true" is specified, collation is performed in consideration of the order matching of the clause list of elements. If "false" is specified, the order relation of clauses is ignored, and if the elements in the sentence match, the verification is successful.

図１０には２つのパタンを例として示している。１つめのパタンは「音」および「小さい」という単語が文内に出現したときに照合に成功する。「音」および「小さい」のそれぞれの文節の照合に関しては、上記で説明した文節内照合と同様の処理を行なう。図１０に示した２つめのパタンは、「表示」および「小さい」という単語が文内に出現したときに照合に成功する。 FIG. 10 shows two patterns as an example. The first pattern succeeds when the words “sound” and “small” appear in the sentence. For the matching of each of the “sound” and “small” clauses, the same processing as in the above-mentioned clause matching is performed. The second pattern shown in FIG. 10 succeeds when the words “display” and “small” appear in the sentence.

文内パタン照合手段502では、図８に示すテキスト解析結果中の各文（sentence）に対して、図１０の照合パタンを順次適用し、照合に成功した場合には、各パタンのextra-objectに記述された情報を抽出して登録手段104を介して抽出情報索引108に格納する。図１１に、文内パタン照合処理により抽出されて抽出情報索引108に格納される抽出情報の例を示す。この抽出情報を用いた分析結果については後述する。 The in-sentence pattern matching unit 502 sequentially applies the matching pattern of FIG. 10 to each sentence (sentence) in the text analysis result shown in FIG. 8, and when the matching is successful, the extra-object of each pattern. Are extracted and stored in the extracted information index 108 via the registration means 104. FIG. 11 shows an example of extracted information extracted by the in-sentence pattern matching process and stored in the extracted information index 108. The analysis result using this extracted information will be described later.

つぎに、図６のステップS603において、係り受けパタン照合手段503により係り受けパタン照合処理を行なう。係り受けパタン照合手段503では、照合パタン107の係り受け照合パタンを参照して、登録手段104より入力されるテキスト解析手段101によるテキスト解析結果から照合可能なパタンを抽出する。 Next, in step S603 of FIG. 6, the dependency pattern matching unit 503 performs dependency pattern matching processing. The dependency pattern collating unit 503 refers to the dependency collation pattern of the collation pattern 107 and extracts a collable pattern from the text analysis result by the text analyzing unit 101 input from the registration unit 104.

図１２は、係り受けパタン照合処理において使用する照合パタンの例である。図１０と同様、xml形式で記述するものとする。図１２で図１０と異なるのは、sentenceタグ内部の<pp-rel>タグである。これは、id="0"である文節（「腹が」または「腹の」を含む文節）が、id="1"である文節（「立つ」を含む文節）に係るということを示している。照合の際には、この係り受け関係まで考慮して照合する。すなわち、「側に立つと腹が冷える」という文は「腹」および「立つ」という単語を文中に含んでいるが、「腹」と「立つ」の間に係り受け関係がないので、図１２に示したパタンでは照合に失敗する。 FIG. 12 is an example of a collation pattern used in the dependency pattern collation process. As in FIG. 10, it is described in the xml format. 12 is different from FIG. 10 in the <pp-rel> tag inside the sentence tag. This indicates that the phrase with id = "0" (the phrase containing "belly" or "belly") relates to the phrase with id = "1" (the phrase containing "standing") Yes. At the time of collation, the collation is performed in consideration of this dependency relationship. That is, the sentence “I feel cold when I stand” includes the words “belly” and “stand” in the sentence, but there is no dependency relationship between “belly” and “stand”. Verification fails with the pattern shown.

係り受けパタン照合手段503では、テキスト解析結果（図８）中の各文（sentence）に対して、図１２の照合パタンを順次適用し、照合に成功した場合には、各パタンのextra-objectに記述された情報を抽出して抽出情報索引108に格納する。図１３に、係り受けパタン照合処理により抽出されて抽出情報索引108に格納される抽出情報の例を示す。この抽出情報を用いた分析結果については後述する。 The dependency pattern matching unit 503 sequentially applies the matching pattern of FIG. 12 to each sentence (sentence) in the text analysis result (FIG. 8), and when the matching is successful, the extra-object of each pattern. Are extracted and stored in the extracted information index 108. FIG. 13 shows an example of extracted information extracted by the dependency pattern matching process and stored in the extracted information index 108. The analysis result using this extracted information will be described later.

以上で、情報抽出手段106による図６のステップS601からS603までのパタン照合処理の詳細についての説明を終了する。 This is the end of the detailed description of the pattern collating process from steps S601 to S603 in FIG.

つぎに、図２のステップS204のテキスト分析処理の詳細について、図１４から図１７までを参照しながら説明する。
図１４は、分析手段110の詳細構成図である。分析条件入力手段1401は、分析作業者が分析入力122の条件を対話的に入力するためのＧＵＩ(Graphical User Interface)である。頻度集計手段1402は、分析条件入力手段1401で入力された分析条件に合致する文書や単語の頻度を、概念辞書103、文書索引105、抽出情報索引108、および、属性データベース109を参照して取得する。関連度計算手段1403は、概念辞書103、文書索引105、抽出情報索引108、および、属性データベース109を参照して、分析条件入力手段1401で入力された分析条件の指定により分類された文書集合から合成される概念ベクトルと、分析条件入力手段1401で入力された分析条件で指定されたテキストなどの概念ベクトルとの類似性を計算する。出力手段1404は、上記で得られた頻度および関連度を、分析作業者が傾向を把握しやすい形で整形して、表ないしはグラフの形式で分析情報123を出力する。 Next, details of the text analysis processing in step S204 of FIG. 2 will be described with reference to FIGS.
FIG. 14 is a detailed configuration diagram of the analysis unit 110. The analysis condition input means 1401 is a GUI (Graphical User Interface) for the analysis operator to interactively input the conditions of the analysis input 122. The frequency counting unit 1402 obtains the frequency of documents and words that match the analysis condition input by the analysis condition input unit 1401 with reference to the concept dictionary 103, the document index 105, the extracted information index 108, and the attribute database 109. To do. The relevance calculation means 1403 refers to the concept dictionary 103, the document index 105, the extracted information index 108, and the attribute database 109. From the document set classified by the analysis condition designation input by the analysis condition input means 1401 The similarity between the synthesized concept vector and the concept vector such as text designated by the analysis condition input by the analysis condition input means 1401 is calculated. The output means 1404 shapes the frequency and the degree of association obtained above in a form that makes it easy for the analysis operator to grasp the trend, and outputs the analysis information 123 in the form of a table or a graph.

図１５は、テキスト分析処理の詳細を示す処理フローである。まずステップS1501において、分析作業者が分析条件入力手段1401により分析条件を入力する。図１６に分析条件入力画面の例を示す。分析対象を50才から80才の高年齢者層に限定し、注目話題として「音」および「表示」を選択し、これらの話題に対する評判を把握するため分析種別の好評／不評を指定する。 FIG. 15 is a process flow showing details of the text analysis process. First, in step S1501, the analysis operator inputs analysis conditions using the analysis condition input means 1401. FIG. 16 shows an example of the analysis condition input screen. The analysis target is limited to the 50 to 80-year-old age group, “Sound” and “Display” are selected as the topic of interest, and the favorable / unpopular analysis type is designated to grasp the reputation of these topics.

つぎにステップS1502において、頻度集計手段1402で、注目話題として指定した「音」および「表示」と共起しやすい単語を、文書索引105に記録されている単語と文書との関係表を参照して取得する。その際、属性データベース109を参照し、年齢が50才から80才までの顧客からの回答に絞り込んだうえで、以下の処理を行なう。ここでは、「音」に対する共起単語として「小さい」「静か」「うるさい」などが得られ、「表示」に対する共起単語として「見やすい」「小さい」「見にくい」などが得られるものとする。 Next, in step S1502, the frequency counting means 1402 refers to the word-document relationship table recorded in the document index 105 for words that are likely to co-occur with “sound” and “display” designated as the topic of interest. Get. At that time, referring to the attribute database 109, the following processing is performed after narrowing down the responses from customers aged 50 to 80. Here, “small”, “quiet”, “noisy”, etc. are obtained as co-occurrence words for “sound”, and “easy to see”, “small”, “difficult to see”, etc. are obtained as co-occurrence words for “display”.

つぎにステップS1503において、関連度計算手段1403で、上記で取得した各表現「音−小さい」「音−静か」「表示−小さい」などについて、図９、図１１、および図１３に示した情報抽出結果の文書idの情報をもとに、各表現を含む文書の好感度および不満度を合計する。 Next, in step S1503, the relevance calculation means 1403 obtains the information shown in FIGS. 9, 11, and 13 for each expression “sound-small”, “sound-quiet”, “display-small”, and the like acquired above. Based on the document id information of the extraction result, the likability and dissatisfaction of the documents including each expression are summed.

さらにステップS1504において、出力手段1404により、上記で取得した好感度および不満度を図１７に示すような形でグラフ化する。このグラフにより、高齢者の意見として、音に関しては好評だが、表示については不評であることがわかる。 Further, in step S1504, the output means 1404 graphs the favorability and the degree of dissatisfaction acquired as shown in FIG. This graph shows that, as an opinion of elderly people, the sound is popular but the display is unpopular.

ここで、図１０に示した照合パタンを用いた文内パタン照合処理により、図１１に示した情報が抽出されるので、同じ「小さい」という形容詞に対して、好評と不評のそれぞれに振り分けて分析することが可能となる。（図１７の「音−小さい」と「表示−小さい」） Here, the information shown in FIG. 11 is extracted by the in-sentence pattern matching process using the matching pattern shown in FIG. 10, so that the same “small” adjective is assigned to each of favorable and unfavorable cases. It becomes possible to analyze. ("Sound-Small" and "Display-Small" in FIG. 17)

また、図７に示した照合パタンを用いた文節内パタン照合処理により、否定形を含む表現からも図９に示したような正確な情報が抽出されるので、「うるさい」という否定的な意味を含む形容詞を含んでいても、好評と認識して分析することが可能となる。 In addition, by the intra-phrase pattern matching process using the matching pattern shown in FIG. 7, accurate information as shown in FIG. 9 is extracted from the expression including the negative form. Even if it contains adjectives that contain, it can be recognized and analyzed.

上記で説明した各ステップはプログラムにより、コンピュータを動作させて処理することもできる。 Each step described above can be processed by operating a computer by a program.

以上、説明したように、本実施の形態によれば、文内パタン照合手段および係り受けパタン照合手段を備えているので、複数の単語を考慮して必要な情報を抽出できるという効果が得られる。 As described above, according to the present embodiment, since the in-sentence pattern matching unit and the dependency pattern matching unit are provided, it is possible to extract the necessary information in consideration of a plurality of words. .

また本実施の形態によれば、自動生成される概念辞書103を文節内パタン照合処理において利用することにより、ひとつのパタン記述で多数の類似表現をカバーすることができるので、分析作業に際して同義語辞書構築の手間を削減することが出来るという利点がある。 Further, according to the present embodiment, by using the automatically generated concept dictionary 103 in the phrase pattern matching process, a single pattern description can cover a large number of similar expressions. There is an advantage that it is possible to reduce the trouble of dictionary construction.

さらに、文節内パタン照合手段を備えているので、否定表現を含むテキストからも正確に情報を抽出できる。 Furthermore, since the phrase pattern matching means is provided, information can be accurately extracted from text including negative expressions.

さらに、自由な語順で照合可能な文内パタン照合手段を備えているので、パタン記述量を少なくすることができる。 Furthermore, since the in-sentence pattern matching means that can be collated in any word order is provided, the amount of pattern description can be reduced.

さらに、厳密な係り受け関係を指定して照合可能な係り受けパタン照合手段を備えているので、語順が重要な意味をもつ慣用表現を正確に分析することができる。 Furthermore, since the dependency pattern collating means capable of collating by specifying a strict dependency relationship is provided, it is possible to accurately analyze an idiomatic expression in which the word order has an important meaning.

曖昧表現の照合が可能で、好評、不評、要望などの情報をテキストから抽出できるので、アンケート分析作業を支援し、アンケート分析サービス事業への適用が可能である。 Because it is possible to collate ambiguous expressions and to extract information such as popular, unpopular and desired from the text, it supports the questionnaire analysis work and can be applied to the questionnaire analysis service business.

本発明の実施の形態１における構成図である。It is a block diagram in Embodiment 1 of this invention. 分析処理の概要を示す処理フロー図である。It is a processing flowchart which shows the outline | summary of an analysis process. 概念辞書作成手段により作成された概念辞書データの説明図である。It is explanatory drawing of the concept dictionary data created by the concept dictionary creation means. 文書索引に格納される文書ベクトルの説明図である。It is explanatory drawing of the document vector stored in a document index. 情報抽出手段の詳細構成図である。It is a detailed block diagram of an information extraction means. 情報抽出処理の詳細を示す処理フロー図である。It is a processing flowchart which shows the detail of an information extraction process. 文節パタン照合処理で使用する照合パタンの説明図である。It is explanatory drawing of the collation pattern used by phrase pattern collation processing. テキスト解析手段によるテキスト解析結果の説明図である。It is explanatory drawing of the text analysis result by a text analysis means. 文節内パタン照合処理により抽出される抽出情報の説明図である。It is explanatory drawing of the extraction information extracted by the pattern matching process in a phrase. 文内パタン照合処理で使用する照合パタンの説明図である。It is explanatory drawing of the collation pattern used by the in-sentence pattern collation process. 文内パタン照合処理により抽出される抽出情報の説明図である。It is explanatory drawing of the extraction information extracted by the pattern verification process in a sentence. 係り受けパタン照合処理で使用する照合パタンの説明図である。It is explanatory drawing of the collation pattern used by the dependency pattern collation process. 係り受けパタン照合処理により抽出される抽出情報の説明図である。It is explanatory drawing of the extraction information extracted by a dependency pattern collation process. 分析手段の詳細構成図である。It is a detailed block diagram of an analysis means. テキスト分析処理の詳細を示す処理フロー図である。It is a processing flowchart which shows the detail of a text analysis process. 分析条件入力画面の例の説明図である。It is explanatory drawing of the example of an analysis condition input screen. 出力手段により処理される出力グラフ例の説明図である。It is explanatory drawing of the example of an output graph processed by an output means. 従来のテキストマイニング装置の構成図である。It is a block diagram of the conventional text mining apparatus. 従来装置による分析支援機能例の説明図である。It is explanatory drawing of the example of an analysis assistance function by the conventional apparatus.

Explanation of symbols

101：テキスト解析手段、
102：概念辞書作成手段、
103：概念辞書、
104：登録手段、
105：文書索引、
106：情報抽出手段、
107：照合パタン、
108：抽出情報索引、
109：属性データベース、
110：分析手段、
121：文書、
122：分析入力、
123：分析情報、
501：文節内パタン照合手段、
502：文内パタン照合手段、
503：係り受けパタン照合手段、
1401：分析条件入力手段、
1402：頻度集計手段、
1403：関連度計算手段、
1404：出力手段。 101: Text analysis means,
102: Concept dictionary creation means,
103: Concept dictionary
104: Registration means
105: Document index,
106: Information extraction means,
107: Verification pattern
108: Extraction information index,
109: Attribute database,
110: Analytical means
121: Document,
122: Analysis input,
123: Analysis information
501: In-phrase pattern matching means,
502: In-sentence pattern matching means,
503: dependency pattern matching means,
1401: Analytical condition input means,
1402: Frequency counting means,
1403: Relevance calculation means
1404: Output means.

Claims

Registration means, text analysis means, attribute database, concept dictionary creation means, concept dictionary, collation pattern storage means , information extraction means, extraction information index storage means , document index storage means , analysis means,
The registration means reads the analysis target document,
The text analysis means analyzes the text of the read analysis target document by the registration means, divides it into words, extracts the co-occurrence frequency of words and the dependency relationship between words ,
The attribute database is attribute information consisting of items that can narrow down the analysis target document by customer information such as age and gender given to questionnaire documents, and analysis processing contents such as model name and failure derivation date given to failure case documents Is stored,
The concept dictionary creation means creates concept dictionary data by calculating the concept vector of each word by singular value decomposition from the co-occurrence frequencies of the words divided by the text analysis means,
The concept dictionary stores concept dictionary data created by the concept dictionary creating means,
The registration means further reads out a concept vector corresponding to each word divided by the text analysis means from the concept dictionary and synthesizes these concept vectors for each sentence to generate a document vector which is one of the document index information. The document index storage means stores the document vector generated by the registration means,
The collation pattern storage means stores a collation pattern that is created in advance and has an analysis type as an analysis condition, a word related to the analysis type, and an evaluation value for the analysis type of the word ,
The information extraction unit extracts an evaluation value for the word, the analysis type, and the analysis type when the word obtained as a result of analyzing the analysis target document by the text analysis unit is in the collation pattern stored in the collation pattern storage unit .
The extracted information index storage means stores the word extracted by the information extracting means , the analysis type, the evaluation value for the analysis type, and the document id of the corresponding document ,
Analytical means
An analysis condition input means for an analysis worker to input analysis conditions;
A frequency totaling unit that obtains a word that matches the analysis condition input by the analysis condition input unit and a word that easily co-occurs with the word from the document index storage unit with reference to the attribute database and the concept dictionary ;
Relevance calculating means for extracting and summing the evaluation value of each word from the extracted information index storage means for words that match the analysis conditions acquired by the frequency counting means and words that are likely to co-occur with the words ;
A text mining apparatus comprising output means for shaping and outputting the information obtained by the relevance calculation means into a graph or table format .

The collation pattern created in advance stored in the collation pattern storage means includes an analysis type as an analysis condition, a word related to the analysis type, an evaluation value for the analysis type of the word, and a search target range within a phrase. An in-phrase collation pattern having
The information extraction means uses the intra-phrase collation pattern stored in the collation pattern storage means , uses the words in the clause of the text analysis result by the text analysis means , and the analysis type that is the analysis condition, and relates to the analysis type 2. The text mining device according to claim 1, further comprising a phrase pattern matching unit that extracts a word and an evaluation value for the analysis type of the word .

The collation pattern created in advance stored in the collation pattern storage means includes an analysis type as an analysis condition, a plurality of words related to the analysis type, an evaluation value for the analysis type by the plurality of words, and a search target An in-sentence matching pattern whose range is in a sentence ,
The information extraction means uses the in- sentence collation pattern stored in the collation pattern storage means , uses the words in the sentence of the text analysis result by the text analysis means , and the analysis type that is the analysis condition, and relates to the analysis type 3. The text mining apparatus according to claim 1, further comprising an in-sentence pattern matching unit that extracts a word and an evaluation value for the analysis type of the word .

The collation pattern created in advance stored in the collation pattern storage means includes an analysis type as an analysis condition, a plurality of words related to the analysis type, a dependency relationship between the plurality of words, and a plurality of words. An evaluation value for the analysis type based on the dependency relationship between the two, and a dependency collation pattern having the search target range within the sentence ,
The information extraction means uses the dependency collation pattern stored in the collation pattern storage means , uses the words in the sentence of the text analysis result by the text analysis means , and the analysis type as the analysis condition, and relates to the analysis type a modification relation between words, according to any one of claims 1 to 3, characterized in that it has a dependency pattern checking means for extracting an evaluation value for analysis classification by modification relation between the word Text mining equipment.

A registration step of reading the analysis target document by a registration means;
In the registration step, the text of the read analysis target document is analyzed by a text analysis means, divided into words, and a text analysis step for extracting the co-occurrence frequency of words and the dependency relationship between words ;
A concept dictionary creation step of calculating a concept vector of each word by singular value decomposition by means of singular value decomposition from the co-occurrence frequency of the words divided by the text analysis step, and storing the concept dictionary data in the concept dictionary;
A concept vector corresponding to each word divided by the text analysis step by the document vector generation means is read from the concept dictionary, and these concept vectors are synthesized for each sentence to generate a document vector, which is one of the document index information. Generating a document vector to be stored in the index;
An analysis target document is analyzed in a text analysis step using a collation pattern that is created in advance and stores an analysis type that is an analysis condition, a word related to the analysis type, and a collation pattern having an evaluation value for the analysis type of the word. If the analyzed word is in the collation pattern, the information extraction step extracts the word, the analysis type, and the evaluation value for the analysis type by the information extraction unit, and stores them in the extracted information index storage unit together with the document id of the corresponding document When,
In addition to the customer information in the questionnaire document, the attribute database that stores the attribute information given to the analysis target document of the model name and failure derivation date in the failure case document, and the document vector stored in the document index storage means, the information extraction step An analysis step for obtaining an analysis result by an analysis means with reference to the extracted information is provided,
The analysis step refers to a word that matches the analysis condition input by the analysis operator using the analysis condition input means and a word that easily co-occurs with the word from the document index storage means by referring to the attribute database and the concept dictionary. A frequency aggregation step to retrieve,
Relevance calculation step of extracting the evaluation value of each word from the extracted information index storage means for the word that matches the analysis condition acquired in the frequency counting step and the word that is likely to co-occur with the word ,
Text mining method characterized by having an output step of the information obtained by relevance calculating step, and outputs shaped into the form of a graph or table.

A registration step of reading the analysis target document by a registration means;
The registration step, the text analysis step of the text of the loaded analyte document by analyzing a text analyzing means divides words, to extract a dependency relationship between words of co-occurrence frequency and word,
A concept dictionary creation step of calculating a concept vector of each word by singular value decomposition by means of singular value decomposition from the co-occurrence frequency of the words divided by the text analysis step, and storing the concept dictionary data in the concept dictionary;
A concept vector corresponding to each word divided by the text analysis step by the document vector generation means is read from the concept dictionary, and these concept vectors are synthesized for each sentence to generate a document vector, which is one of the document index information. Generating a document vector to be stored in the index;
An analysis target document is analyzed in a text analysis step using a collation pattern that is created in advance and stores an analysis type that is an analysis condition, a word related to the analysis type, and a collation pattern having an evaluation value for the analysis type of the word. If the analyzed word is in the collation pattern, the information extraction step extracts the word, the analysis type, and the evaluation value for the analysis type by the information extraction unit, and stores them in the extracted information index storage unit together with the document id of the corresponding document When,
In addition to the customer information in the questionnaire document, the attribute database that stores the attribute information given to the analysis target document of the model name and failure derivation date in the failure case document, and the document vector stored in the document index storage means, the information extraction step An analysis step for obtaining an analysis result by an analysis means with reference to the extracted information is provided,
The analysis step refers to a word that matches the analysis condition input by the analysis operator using the analysis condition input means and a word that easily co-occurs with the word from the document index storage means by referring to the attribute database and the concept dictionary. A frequency aggregation step to retrieve,
Relevance calculation step of extracting the evaluation value of each word from the extracted information index storage means for the word that matches the analysis condition acquired in the frequency counting step and the word that is likely to co-occur with the word ,
A text mining program that causes a computer to execute each step of an output step of shaping and outputting the information obtained in the relevance calculation step into a graph or table format .