JP6781123B2

JP6781123B2 - Data processing equipment, data processing method and data processing program

Info

Publication number: JP6781123B2
Application number: JP2017172062A
Authority: JP
Inventors: 須永　聡; 聡須永; 鎮成齋藤; 山人原田; 宮尾　浩; 浩宮尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-09-07
Filing date: 2017-09-07
Publication date: 2020-11-04
Anticipated expiration: 2037-09-07
Also published as: JP2019046414A

Description

本発明は、データ処理装置、データ処理方法及びデータ処理プログラムに関する。 The present invention relates to a data processing apparatus, a data processing method and a data processing program.

従来、対象語に関連する関連語の抽出は専門家の人手によらねばならず、時間がかかるため、関連語辞書を最新の情報に更新し続けることが困難であるという問題がある。このため、このような関連語用語の抽出を自動化することが期待されている。 Conventionally, extraction of related words related to a target word has to be done manually by an expert, and it takes time, so there is a problem that it is difficult to keep updating the related word dictionary with the latest information. Therefore, it is expected to automate the extraction of such related term terms.

そこで、自然言語処理分野において、対象語に関連する関連用語を文書データから自動抽出する技術が提案されている。例えば、従来の方法として、文書中のある言葉と同一文内またはその前後周辺に出現する言葉を、ある言葉と共起したとして、ある言葉と結びつきがあり、関連性のある言葉である場合に、この語を関連語として抽出とする方法がある。また、必ずしも関連語が一つの文書内に共起していない場合であっても対象分野と一致する分野の文書から関連語を抽出する方法（例えば、特許文献１参照）や、重要度の高い言葉を選別して重要語同士の関連度を判定し、関連の深い関連語を抽出する方法（例えば、特許文献２参照）が提案されている。 Therefore, in the field of natural language processing, a technique for automatically extracting related terms related to a target word from document data has been proposed. For example, as a conventional method, when a word that appears in the same sentence as a word in a document or around it is co-occurred with the word, and the word is related to the word. , There is a way to extract this word as a related word. Further, a method of extracting related words from a document in a field that matches the target field even when related words do not necessarily co-occur in one document (see, for example, Patent Document 1) and a method of high importance. A method has been proposed in which words are selected, the degree of relevance between important words is determined, and related words that are closely related are extracted (see, for example, Patent Document 2).

特開2004-361992号公報Japanese Unexamined Patent Publication No. 2004-361992 特開2003-167894号公報Japanese Unexamined Patent Publication No. 2003-167894

しかしながら、従来の方法では、共起した全ての言葉を関連語候補としている。このため、従来の方法では、出現回数と出現頻度によって関連語を限定した場合であっても、抽象度の高い概念語を含んでしまい、関連性が低い語、或いは、関連性がない語が関連語の中に多く混じり、関連語抽出の精度が低いという問題があった。 However, in the conventional method, all co-occurrence words are used as related word candidates. Therefore, in the conventional method, even if the related words are limited by the number of occurrences and the frequency of appearance, the conceptual words having a high degree of abstraction are included, and the words having low relevance or words having no relevance are included. There was a problem that many related words were mixed and the accuracy of extracting related words was low.

本発明は、上記に鑑みてなされたものであって、文書データから高精度に関連語を抽出することができるデータ処理装置、データ処理方法及びデータ処理プログラムを提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a data processing apparatus, a data processing method, and a data processing program capable of extracting related words from document data with high accuracy.

上述した課題を解決し、目的を達成するために、本発明に係るデータ処理装置は、文書データから、言葉の共起によって対象語に関連する関連語候補を抽出し、対象語それぞれの関連語候補群を取得する取得部と、複数の関連語候補群に含まれる関連語候補ごとに、複数の関連語候補群の中での出現数をカウントするカウント部と、カウント部によってカウントされた出現数が所定の閾値以上である関連語候補を複数の関連語候補群から除外し、残った関連語候補を、対象語の関連語であると判定する関連語判定部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the data processing apparatus according to the present invention extracts related word candidates related to the target word from the document data by coexistence of words, and the related words of each target word. An acquisition unit that acquires a candidate group, a counting unit that counts the number of occurrences in a plurality of related word candidate groups for each related word candidate included in a plurality of related word candidate groups, and an appearance counted by the counting unit. It is characterized by having a related word determination unit that excludes related word candidates whose number is equal to or greater than a predetermined threshold from a plurality of related word candidate groups and determines that the remaining related word candidates are related words of the target word. And.

本発明によれば、文書データから高精度に関連語を抽出することができる。 According to the present invention, related words can be extracted from document data with high accuracy.

図１は、実施の形態に係るデータ処理装置の構成の一例を模式的に示す図である。FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment. 図２は、図１に示すデータ処理装置の処理の流れを説明する図である。FIG. 2 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. 図３は、図１に示すデータ処理装置の処理の流れを説明する図である。FIG. 3 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. 図４は、実施の形態に係るデータ処理方法の処理手順を示すフローチャートである。FIG. 4 is a flowchart showing a processing procedure of the data processing method according to the embodiment. 図５は、図４に示す関連語候補出現数カウント処理の処理手順を示すフローチャートである。FIG. 5 is a flowchart showing a processing procedure of the related word candidate appearance number counting process shown in FIG. 図６は、プログラムが実行されることにより、データ処理装置が実現されるコンピュータの一例を示す図である。FIG. 6 is a diagram showing an example of a computer in which a data processing device is realized by executing a program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals.

［実施の形態］
本発明の実施の形態について説明する。本発明の実施の形態では、電子化されたテキスト文書データが対象であることを前提とする。そして、本実施の形態では、文書データから、言葉の共起によって抽出した各対象語の関連語候補群における関連語候補のうち、複数の関連語候補群の中での出現数が閾値以上である関連語候補を除外し、残った関連語候補を対象語の関連語とする。なお、対象語は、関連語抽出処理の対象となる語であり、関連語は、対象語と関連する語であるとして文書データから言葉の共起によって抽出された語である。 [Embodiment]
Embodiments of the present invention will be described. In the embodiment of the present invention, it is assumed that the object is digitized text document data. Then, in the present embodiment, among the related word candidates in the related word candidate group of each target word extracted from the document data by co-occurrence of words, the number of occurrences in the plurality of related word candidate groups is equal to or higher than the threshold value. Exclude certain related word candidates and use the remaining related word candidates as related words of the target word. The target word is a word that is the target of the related word extraction process, and the related word is a word that is extracted from the document data by co-occurrence of words as a word related to the target word.

［データ処理装置の構成］
まず、実施の形態におけるデータ処理装置の構成について説明する。図１は、実施の形態に係るデータ処理装置の構成の一例を模式的に示す図である。図１に示すように、データ処理装置１は、入力部１１、出力部１２、通信部１３、制御部１４及び記憶部１５を有する。 [Data processing device configuration]
First, the configuration of the data processing device according to the embodiment will be described. FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment. As shown in FIG. 1, the data processing device 1 includes an input unit 11, an output unit 12, a communication unit 13, a control unit 14, and a storage unit 15.

入力部１１は、データ処理装置１の操作者からの各種操作を受け付ける入力インタフェースである。例えば、入力部１１は、タッチパネル、音声入力デバイス、キーボードやマウス等の入力デバイスによって構成される。 The input unit 11 is an input interface that receives various operations from the operator of the data processing device 1. For example, the input unit 11 is composed of an input device such as a touch panel, a voice input device, and a keyboard and a mouse.

通信部１３は、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースである。通信部１３は、ＮＩＣ（Network Interface Card）等で実現され、ＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を介した他の装置と制御部１４（後述）との間の通信を行う。例えば、通信部１３は、ネットワークを介して、電子文書ファイルのデータを受け取り、制御部１４に出力する。また、通信部１３は、制御部１４によって生成された専門用語を示す情報を、ネットワークを介して、外部の装置へ出力する。 The communication unit 13 is a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. The communication unit 13 is realized by a NIC (Network Interface Card) or the like, and communicates between another device and the control unit 14 (described later) via a telecommunication line such as a LAN (Local Area Network) or the Internet. For example, the communication unit 13 receives the data of the electronic document file via the network and outputs the data to the control unit 14. Further, the communication unit 13 outputs information indicating technical terms generated by the control unit 14 to an external device via the network.

出力部１２は、例えば、液晶ディスプレイなどの表示装置、プリンタ等の印刷装置、情報通信装置等によって実現され、制御部１４によって生成された対象語の関連語を示す情報等を出力する。 The output unit 12 is realized by, for example, a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like, and outputs information or the like indicating a related word of the target word generated by the control unit 14.

制御部１４は、データ処理装置１全体を制御する。制御部１４は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）等の電子回路や、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等の集積回路である。また、制御部１４は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、内部メモリを用いて各処理を実行する。また、制御部１４は、各種のプログラムが動作することにより各種の処理部として機能する。制御部１４は、関連語候補群取得部１４１（取得部）、関連語候補出現数カウント部１４２（カウント部）及び関連語判定部１４３を有する。 The control unit 14 controls the entire data processing device 1. The control unit 14 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). In addition, the control unit 14 has an internal memory for storing programs and control data that define various processing procedures, and executes each process using the internal memory. In addition, the control unit 14 functions as various processing units by operating various programs. The control unit 14 has a related word candidate group acquisition unit 141 (acquisition unit), a related word candidate appearance number counting unit 142 (counting unit), and a related word determination unit 143.

関連語候補群取得部１４１は、文書データから、言葉の共起によって対象語に関連する関連語候補を抽出し、対象語それぞれの関連語候補群を取得する。関連語候補群取得部１４１は、文書データから、形態素解析を用いて抽出した名詞及び複合名詞を対象語とし、各対象語について、文書データのうち前記対象語を含む部分から形態素解析を用いて抽出した名詞及び複合名詞を関連語候補として、対象語ごとに関連語候補をまとめた関連語候補群を取得する。具体的には、関連語候補群取得部１４１は、処理対象の文書データから、形態素解析を用いて名詞及び複合名詞を抽出し、これらの抽出した各語を対象語とする対象語リストを作成する。そして、関連語候補群取得部１４１は、対象語リストの各語について、この語を含む部分を文書データから抽出する。関連語候補群取得部１４１は、抽出した部分から、形態素解析を用いて名詞及び複合名詞を抽出し、抽出した語を関連語候補として対象語ごとに関連語候補をまとめた関連語候補群を取得する。 The related word candidate group acquisition unit 141 extracts related word candidates related to the target word by co-occurrence of words from the document data, and acquires the related word candidate group for each target word. The related word candidate group acquisition unit 141 targets nouns and compound nouns extracted from the document data by using morphological analysis, and uses morphological analysis from the part of the document data including the target word for each target word. Using the extracted nouns and compound nouns as related word candidates, a related word candidate group that summarizes related word candidates for each target word is acquired. Specifically, the related word candidate group acquisition unit 141 extracts nouns and compound nouns from the document data to be processed by using morphological analysis, and creates a target word list for each of these extracted words. To do. Then, the related word candidate group acquisition unit 141 extracts, for each word in the target word list, a portion including this word from the document data. The related word candidate group acquisition unit 141 extracts nouns and compound nouns from the extracted part by using morphological analysis, and uses the extracted words as related word candidates to collect related word candidates for each target word. get.

関連語候補出現数カウント部１４２は、関連語候補群取得部１４１が取得した複数の関連語候補群に含まれる関連語候補ごとに、複数の関連語候補群の中での出現数をカウントする。 The related word candidate appearance count unit 142 counts the number of occurrences in the plurality of related word candidate groups for each related word candidate included in the plurality of related word candidate groups acquired by the related word candidate group acquisition unit 141. ..

関連語判定部１４３は、カウントした出現数が所定の閾値以上である関連語候補を複数の関連語候補群から除外し、残った関連語候補を、対象語の関連語であると判定する。関連語判定部１４３は、関連語候補抽出部１４４、関連語候補除外部１４５及び関連語データ格納部１４６を有する。 The related word determination unit 143 excludes the related word candidates whose counted number of occurrences is equal to or greater than a predetermined threshold value from the plurality of related word candidate groups, and determines that the remaining related word candidates are related words of the target word. The related word determination unit 143 has a related word candidate extraction unit 144, a related word candidate exclusion unit 145, and a related word data storage unit 146.

関連語候補抽出部１４４は、関連語候補出現数カウント部１４２によってカウントされた出現数が所定の閾値以上である関連語候補を複数の関連語候補群から抽出する。この際、関連語候補抽出部１４４は、処理対象の文書データに応じて設定された閾値を用いて、抽出処理を行う。 The related word candidate extraction unit 144 extracts related word candidates whose number of occurrences counted by the related word candidate appearance number counting unit 142 is equal to or greater than a predetermined threshold value from a plurality of related word candidate groups. At this time, the related word candidate extraction unit 144 performs the extraction process using the threshold value set according to the document data to be processed.

ここで、この閾値は、処理対象の文書データに応じて変更される。例えば、閾値は、文書データのデータ量、文書データのデータ内容、文書データの作成期間等に応じて適宜設定される。また、閾値は、データ処理装置１による過去の文書データ処理において蓄積された処理内容や、処理対象である文書データの分野、データ量、作成期間等を基に、シミュレーションで設定されてもよい。例えば、閾値は、全対象語数の２分の１である。もちろん、閾値は、全対象語数の３分の１としてもよいし、全対象語数の４分の１としてもよい。また、閾値は、全対象語数に限らず、全関連語候補群数や、関連語候補群に含まれる各関連語候補の数に応じて設定してもよい。なお、例えば、入力部１１が、閾値の設定或いは変更を指示する指示情報を受け付けることによって、制御部１４が閾値を変更する。或いは、関連語候補抽出部１４４が、所定のルールにしたがって閾値を変更してもよい。 Here, this threshold value is changed according to the document data to be processed. For example, the threshold value is appropriately set according to the amount of document data, the data content of the document data, the creation period of the document data, and the like. Further, the threshold value may be set by simulation based on the processing contents accumulated in the past document data processing by the data processing device 1, the field of the document data to be processed, the amount of data, the creation period, and the like. For example, the threshold is half the total number of target words. Of course, the threshold value may be one-third of the total number of target words or one-fourth of the total number of target words. Further, the threshold value is not limited to the total number of target words, but may be set according to the total number of related word candidate groups and the number of each related word candidate included in the related word candidate group. For example, the input unit 11 changes the threshold value by receiving the instruction information instructing the setting or change of the threshold value. Alternatively, the related word candidate extraction unit 144 may change the threshold value according to a predetermined rule.

関連語候補除外部１４５は、関連語候補抽出部１４４によって抽出された関連語候補を、複数の関連語候補群から除外する。関連語データ格納部１４６は、関連語候補除外部１４５の除外後に残った関連語候補を、対象語の関連語であると判定し、関連語データ１５４（後述）として記憶部１５に格納する。 The related word candidate exclusion unit 145 excludes the related word candidates extracted by the related word candidate extraction unit 144 from the plurality of related word candidate groups. The related word data storage unit 146 determines that the related word candidate remaining after the exclusion of the related word candidate exclusion unit 145 is a related word of the target word, and stores the related word data 154 (described later) in the storage unit 15.

記憶部１５は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、光ディスク等の記憶装置である。なお、記憶部１５は、ＲＡＭ（Random Access Memory）、フラッシュメモリ、ＮＶＳＲＡＭ（Non Volatile Static Random Access Memory）等のデータを書き換え可能な半導体メモリであってもよい。記憶部１５は、データ処理装置１で実行されるＯＳ（Operating System）や各種プログラムを記憶する。さらに、記憶部１５は、プログラムの実行で用いられる各種情報を記憶する。記憶部１５は、文書データ１５１、カウントデータ１５２、閾値データ１５３及び関連語データ１５４を記憶する。 The storage unit 15 is a storage device for an HDD (Hard Disk Drive), an SSD (Solid State Drive), an optical disk, or the like. The storage unit 15 may be a semiconductor memory in which data such as a RAM (Random Access Memory), a flash memory, and an NVSRAM (Non Volatile Static Random Access Memory) can be rewritten. The storage unit 15 stores an OS (Operating System) and various programs executed by the data processing device 1. Further, the storage unit 15 stores various information used in executing the program. The storage unit 15 stores document data 151, count data 152, threshold data 153, and related word data 154.

文書データ１５１は、電子化されたテキスト文書データであり、本データ処理装置１の処理対象となる文書ファイルを含む。カウントデータ１５２は、関連語候補出現数カウント部１４２がカウントした各カウント数が、関連語候補ごとに対応付けられたデータである。閾値データ１５３は、処理対象の文書ファイルに応じて変更可能に設定された閾値を示すデータである。また、関連語データ１５４は、対象語ごとに各関連語が対応付けられたデータである。 The document data 151 is digitized text document data, and includes a document file to be processed by the data processing device 1. The count data 152 is data in which each count number counted by the related word candidate appearance number counting unit 142 is associated with each related word candidate. The threshold data 153 is data indicating a threshold value that can be changed according to the document file to be processed. Further, the related word data 154 is data in which each related word is associated with each target word.

［データ処理の流れ］
次に、データ処理装置１における処理の流れについて詳細に説明する。図２及び図３は、図１に示すデータ処理装置の処理の流れを説明する図である。 [Data processing flow]
Next, the processing flow in the data processing apparatus 1 will be described in detail. 2 and 3 are diagrams illustrating a processing flow of the data processing apparatus shown in FIG.

まず、図２を参照して、対象語の関連語候補群を取得するまでの処理について説明する。図２に示すように、関連語候補群取得部１４１は、処理対象の電子ファイル文書１５１−１から、形態素解析により名詞及び複合名詞を抽出し、これらの抽出した各語を対象語とする対象語リストＬ_１を作成する（図２の（１）参照）。例えば、対象語リストＬ_１は、対象語として、「所分割」、「緊急通報」、「送信」の順で示す。また、対象語リストＬ_１には、Ｎ個の対象語が含まれるとして以降説明を行う。 First, with reference to FIG. 2, the process up to the acquisition of the related word candidate group of the target word will be described. As shown in FIG. 2, the related word candidate group acquisition unit 141 extracts nouns and compound nouns from the electronic file document 151-1 to be processed by morphological analysis, and targets each of these extracted words as the target word. to create a word list _{L 1} (see (1) Figure 2). For example, the target word list L _1, as a target word, "Tokoro division", "Emergency", shown in the order of "transmission". Further, the target word list L _1, a description hereinafter as included are N target word.

そして、関連語候補群取得部１４１は、電子ファイル文書１５１−１のテキスト文章を、一行に一文示した形式の一文一行ファイルＦ_１、一行に一段落を示した形式の一段落一行ファイルＦ_２、または、一行に一小節を示した形式の一小節一行ファイルに変形する（図２の（２）参照）。 The related term candidate group acquiring unit 141, a text sentence of the electronic file documents 151-1, sentence line file F ₁ of the type shown sentence in _line, paragraph line file F ₂ of the type shown the paragraph on one line _or, , Transforms into a one-bar, one-line file with one bar per line (see (2) in FIG. 2).

そして、対象語リストＬ_１の対象語のうち一番目の「所分割」の関連語候補群を取得する場合について説明する。この場合、関連語候補群取得部１４１は、「所分割」を含む行を、一文一行ファイルＦ_１、一段落一行ファイルＦ_２、及び、一小節一行ファイルから抽出する（図２の（３）参照）。例えば、関連語候補群取得部１４１は、一文一行ファイルＦ_１から「所分割にて緊急通報の送信を失敗するバグが発生した。」とする一文を抽出する。このように、関連語候補群取得部１４１は、データＰ_１に示すように「所分割」を含む行を複数抽出する。 Then, description will be given of a case where to get one second of related words candidates of "Tokoro division" of the target word of the target word list L _1. In this case, the related word candidate group acquiring unit 141, a line containing the "Tokoro split", sentence line files F _1, paragraph line file F _2, and is extracted from a bar line file ((3 2) see ). For example, related words candidate group acquiring unit 141, to extract the sentence to be "bug to fail the transmission of the emergency call at Tokoro split has occurred." From the sentence line file F _1. Thus, the related word candidate group acquiring unit 141, a plurality extracts the line containing "Tokoro split" as shown in the data P _1.

そして、関連語候補群取得部１４１は、抽出した「所分割」を含む行を、形態素解析により名詞及び複合名詞を抽出し、重複を除いたものを、「所分割」の関連語候補とする（図２の（４）参照）。すなわち、関連語候補群取得部１４１は、「緊急通報」、「送信」、「失敗」、「バグ」、「発生」を含む関連語候補の集まりを、対象語「所分割」の関連語候補群Ｇ_１として取得する。 Then, the related word candidate group acquisition unit 141 extracts nouns and compound nouns from the line including the extracted "place division" by morphological analysis, and removes the duplication as the related word candidate of "place division". (See (4) in FIG. 2). That is, the related word candidate group acquisition unit 141 sets a collection of related word candidates including "emergency call", "send", "failure", "bug", and "occurrence" as related word candidates of the target word "place division". obtaining as a group _{G 1.}

そして、関連語候補群取得部１４１は、対象語リストＬ_１の対象語のうち二番目の「緊急通報」に対する処理に進み、同様に、「緊急通報」について図２の（３），（４）で説明した処理を繰り返す（図２の（５）参照）。これによって、関連語候補群取得部１４１は、対象語「緊急通報」の関連語候補群を取得する。このように、対象語リストＬ_１の各対象語について、図２の（３），（４）で説明した処理を繰り返すことによって、対象語ごとに関連語候補群を取得する。関連語候補群取得部１４１は、１〜Ｎ個の対象語それぞれについて関連語候補群Ｇ_１〜Ｇ_Ｎを取得する。 The related term candidate group acquiring unit 141, the second of the target word of the target word list L ₁ proceeds to processing for "Emergency", Similarly, "Emergency" in FIG. 2 (3), (4 ) Is repeated (see (5) in FIG. 2). As a result, the related word candidate group acquisition unit 141 acquires the related word candidate group of the target word “emergency call”. Thus, for each target word of the target word list L _1, (3) in FIG. 2, by repeating the processing described in (4), to acquire the related word candidates for each target language. Related term candidate group acquiring unit 141 acquires the related word candidates _G 1 ~G _N for 1~N pieces of target language respectively.

次に、図３を参照して、関連語候補出現数カウント部１４２及び関連語判定部１４３の処理について説明する。図３に示すように、関連語候補群取得部１４１がＮ個の対象語それぞれの関連語候補群Ｇ_１〜Ｇ_Ｎを取得すると（図３の（６）参照）、関連語候補出現数カウント部１４２は、対象語ごとに、各関連語候補の全関連語候補群Ｇ_１〜Ｇ_Ｎにおける出現数をカウントする（図３の（７）参照）。 Next, the processing of the related word candidate appearance number counting unit 142 and the related word determination unit 143 will be described with reference to FIG. As shown in FIG. 3, the related term candidate group acquiring unit 141 acquires the N pieces of object words respectively associated word candidates G ₁ ~G _N (see (6) in FIG. 3), the related word candidate appearance count parts 142, for each target word, counting the number of occurrences in all related term candidates _G 1 ~G _N of each related term candidate (see (7) in FIG. 3).

そして、関連語候補抽出部１４４は、全関連語候補群Ｇ_１〜Ｇ_Ｎのうち一定数（例えば、全対象語数の半数）以上に共通して出現した関連語候補を抽出する（図３の（８）参照）。ここで、図３の上部枠内では、全関連語候補群Ｇ_１〜Ｇ_Ｎのうち一定数（例えば、全対象語数の半数）以上に出現した語（関連語候補）には、右側に星印を付している。 The related term candidate extraction unit 144, a certain number of all related words candidates G ₁ ~G _N (e.g., half of all target word number) extracts a related word candidate appearing in common to the above (in FIG. 3 (See (8)). Here, in the upper frame 3, all related words a fixed number of candidates G ₁ ~G _N (e.g., total target word number half) above appearing word (related term candidate) is star to the right It is marked.

例えば、関連語候補抽出部１４４は、「所分割」の関連語候補群Ｇ_１からは、全対象語数の半数以上、全関連語候補群Ｇ_１〜Ｇ_Ｎにおいて、共通して出現した関連語候補として「送信」、「失敗」、「バグ」、「発生」を抽出する（図３の中央の枠内の語群Ｇ_１´参照）。また、関連語候補抽出部１４４は、「緊急通報」の関連語候補群Ｇ_２からは、全対象語数の半数以上、全関連語候補群Ｇ_１〜Ｇ_Ｎにおいて、共通して出現した関連語候補として「機能」、「送信」、「発生」を抽出する（図３の中央の枠内の語群Ｇ_２´参照）。そして、関連語候補抽出部１４４は、「送信」の関連語候補群Ｇ_３からは、全対象語数の半数以上、全関連語候補群Ｇ_１〜Ｇ_Ｎにおいて、共通して出現した関連語候補として「バグ」、「失敗」、「機能」を抽出する（図３の中央の枠内の語群Ｇ_３´参照）。 For example, related term candidate extraction unit 144, the related term candidate group G ₁ of "Tokoro resolution" total target word number more than half, in all relevant word candidate group G ₁ ~G _N, related words appearing in common "transmission" as a candidate, "failure", "bugs" extracts "generation" (central word reference group G _{1 'in} the frame of FIG. 3). The related term candidate extraction unit 144, from the related term candidate group G ₂ of "Emergency", the total target word number more than half, in all relevant word candidate group G ₁ ~G _N, related words appearing in common "function" as a candidate, "send", to extract the "generation" (central word group G _{2 'references} the framework of FIG. 3). The related term candidate extraction unit 144, from the related term candidate group G ₃ of the "transmission", the total target word number more than half, in all relevant word candidate group G ₁ ~G _N, related word candidates commonly occurring as for extracting the "bugs", "failure", "function" (central word reference group G _{3 'in} the frame of FIG. 3).

続いて、関連語候補除外部１４５は、関連語候補抽出部１４４が抽出した、一定数以上、全関連語候補群Ｇ_１〜Ｇ_Ｎにおいて、共通して出現した語を除外し、関連語データ格納部１４６は、残る関連語候補を各対象語の関連語とする（図３の（９）参照）。そして、関連語データ格納部１４６は、関連語を対象語に対応付けて、記憶部１５に格納する。 Subsequently, the related word candidate excluding unit 145, extracted by the related term candidate extraction unit 144, a predetermined number or more, in all the relevant word candidates G ₁ ~G _N, excluding the emerging words in common, related word data The storage unit 146 sets the remaining related word candidates as related words of each target word (see (9) in FIG. 3). Then, the related word data storage unit 146 associates the related word with the target word and stores it in the storage unit 15.

例えば、関連語データ格納部１４６は、「所分割」の関連語候補群Ｇ_１のうち残った関連語候補である「緊急通報」を、「所分割」の関連語Ｋ_１（図３の下部枠内参照）として記憶部１５に格納する。関連語データ格納部１４６は、「緊急通報」の関連語候補群Ｇ_２のうち残った関連語候補である「所分割」、「番号通知」を、「緊急通報」の関連語Ｋ_２（図３の下部枠内参照）として記憶部１５に格納する。また、関連語データ格納部１４６は、「送信」の関連語候補群Ｇ_３のうち残った関連語候補である「受信」、「データ」を、「送信」の関連語Ｋ_３（図３の下部枠内参照）として記憶部１５に格納する。この結果、記憶部１５には、図３の下部枠内の関連語が、各対象語に対応付けられた状態で、関連語データとして格納される。 For example, the related word data storage unit 146 is a remaining associated word candidate of the associated word candidate group G ₁ of "Tokoro split" to "emergency call", the bottom of the related terms K _{1 (FIG.} 3 "Tokoro split" It is stored in the storage unit 15 as (see in the frame). Related word data storage unit 146 is a remaining associated word candidate of the associated word candidate group G ₂ of "Emergency", "Tokoro division", the "number notification", related terms K _{2 (figure} "Emergency" It is stored in the storage unit 15 as (see in the lower frame of 3). Further, the related word data storage unit 146 is a remaining associated word candidate of the associated word candidate group G ₃ of "transmit,""receive," and "data", the related word K _{3 (FIG.} 3 of the "transmission" It is stored in the storage unit 15 as (see in the lower frame). As a result, the related words in the lower frame of FIG. 3 are stored in the storage unit 15 as related word data in a state of being associated with each target word.

［データ処理方法の処理手順］
次に、図４を参照して、図１に示すデータ処理装置１によるデータ処理方法の処理手順について説明する。図４は、実施の形態に係るデータ処理方法の処理手順を示すフローチャートである。 [Processing procedure of data processing method]
Next, with reference to FIG. 4, the processing procedure of the data processing method by the data processing apparatus 1 shown in FIG. 1 will be described. FIG. 4 is a flowchart showing a processing procedure of the data processing method according to the embodiment.

まず、図４に示すように、制御部１４は、処理対象となる文書データを読み込むと、関連語候補群取得部１４１は、文書データから、言葉の共起によって対象語の関連語候補を抽出し、対象語ごとに関連語候補群を取得する（ステップＳ１）。関連語候補出現数カウント部１４２は、複数の関連語候補群に含まれる関連語候補ごとに、複数の関連語候補群の中での出現数をカウントする関連語候補出現数カウント処理を行う（ステップＳ２）。 First, as shown in FIG. 4, when the control unit 14 reads the document data to be processed, the related word candidate group acquisition unit 141 extracts the related word candidates of the target word from the document data by co-occurrence of words. Then, a related word candidate group is acquired for each target word (step S1). The related word candidate appearance count unit 142 performs a related word candidate appearance count process for counting the number of occurrences in the plurality of related word candidate groups for each related word candidate included in the plurality of related word candidate groups (). Step S2).

続いて、関連語判定部１４３では、関連語候補抽出部１４４が、本文書データに応じた閾値を参照し（ステップＳ３）、関連語候補出現数カウント部１４２によってカウントされた出現数が、参照した閾値以上である関連語候補を複数の関連語候補群から抽出する（ステップＳ４）。続いて、関連語候補除外部１４５は、関連語候補抽出部１４４によって抽出された関連語候補を、複数の関連語候補群から除外する（ステップＳ５）。 Subsequently, in the related word determination unit 143, the related word candidate extraction unit 144 refers to the threshold value according to the document data (step S3), and the number of occurrences counted by the related word candidate appearance number counting unit 142 is referred to. Related word candidates that are equal to or greater than the threshold value are extracted from a plurality of related word candidate groups (step S4). Subsequently, the related word candidate exclusion unit 145 excludes the related word candidates extracted by the related word candidate extraction unit 144 from the plurality of related word candidate groups (step S5).

そして、関連語データ格納部１４６は、関連語候補除外部１４５の除外後に残った関連語候補を、対象語の関連語であると判定し、関連語データ１５４として記憶部１５に格納する（ステップＳ６）。 Then, the related word data storage unit 146 determines that the related word candidate remaining after the exclusion of the related word candidate exclusion unit 145 is a related word of the target word, and stores it in the storage unit 15 as the related word data 154 (step). S6).

［関連語候補出現数カウント処理の処理手順］
次に、図５を参照して、関連語候補出現数カウント処理の処理手順について説明する。図５は、図４に示す関連語候補出現数カウント処理の処理手順を示すフローチャートである。 [Processing procedure for counting the number of related word candidates]
Next, the processing procedure of the related word candidate appearance number counting process will be described with reference to FIG. FIG. 5 is a flowchart showing a processing procedure of the related word candidate appearance number counting process shown in FIG.

図５に示すように、関連語候補出現数カウント部１４２は、対象語の識別番号であるｎを初期化し、ｎ＝１とする（ステップＳ１１）。そして、関連語候補出現数カウント部１４２は、対象語ｎの関連語候補群Ｇ_ｎの関連語候補のうち最初のカウント対象の関連語候補を設定する（ステップＳ１２）。ｎ＝１の場合、関連語候補出現数カウント部１４２は、まず、対象語リストの１番目の対象語（以下、対象語１とする。）についての関連語候補群Ｇ_１の１番目の関連語候補を、カウント対象として設定する。 As shown in FIG. 5, the related word candidate appearance number counting unit 142 initializes n, which is the identification number of the target word, and sets n = 1 (step S11). The related term candidate appearance counting section 142 sets the related word candidates for the first counted among the related word candidates associated word candidates G _n of subject words n (step S12). If n = 1, the related term candidate appearance number counting unit 142, first, the first target word of the target word list (hereinafter, the target word 1.) The first related related word candidate group G ₁ of Set word candidates as counting targets.

そして、関連語候補出現数カウント部１４２は、カウント対象の関連語候補について各関連語候補群Ｇ_１〜Ｇ_Ｎ中の出現数をカウントする（ステップＳ１３）。関連語候補出現数カウント部１４２は、関連語候補群Ｇｎの全関連語候補について、各関連語候補群Ｇ_１〜Ｇ_Ｎ中の出現数をカウントしたか否かを判定する（ステップＳ１４）。 The related term candidate appearance counting section 142, the related word candidates counted to count the number of occurrences in each related term candidates G ₁ ~G _N (step S13). Related term candidate appearance number counting unit 142, for all the relevant word candidates associated word candidates Gn, determines whether or not count the number of occurrences in each related term candidates G ₁ ~G _N (step S14).

関連語候補出現数カウント部１４２は、関連語候補群Ｇｎの全関連語候補について、各関連語候補群Ｇ_１〜Ｇ_Ｎ中の出現数をカウントしていないと判定した場合（ステップＳ１４：Ｎｏ）、対象語リストのｎ番目の対象語についての関連語候補群Ｇ_ｎのうち次のカウント対象の関連語候補を設定する（ステップＳ１５）。例えば、関連語候補出現数カウント部１４２は、対象語１についてのカウントが終了した場合には、関連語候補群Ｇ_１の２番目の関連語候補を、カウント対象として設定する。このように、関連語候補出現数カウント部１４２は、対象語１の関連語候補群Ｇ_１の関連語候補の全てについて、ステップＳ１３〜ステップＳ１５を繰り返す。 Related term candidate appearance number counting unit 142, for all the relevant word candidates associated word candidates Gn, if it is determined that no count the number of occurrences in each related term candidates G ₁ ~G _N (step S14: No ), The related word candidate to be counted next in the related word candidate group _Gn for the nth target word in the target word list is set (step S15). For example, related word candidate appearing counting section 142, when the count for the target language 1 is finished, the second related word candidates associated word candidates G _1, is set as the count target. Thus, the related word candidate appearing counting section 142, for all relevant word candidates associated word candidate group G ₁ of the target word 1, repeat steps S13~ step S15.

一方、関連語候補出現数カウント部１４２は、関連語候補群Ｇｎの全関連語候補について、各関連語候補群Ｇ_１〜Ｇ_Ｎ中の出現数をカウントしたと判定した場合（ステップＳ１４：Ｙｅｓ）、対象語１〜Ｎまでカウントしたか否かを判定する（ステップＳ１６）。関連語候補出現数カウント部１４２は、対象語１〜Ｎまでカウントしていないと判定した場合（ステップＳ１６：Ｎｏ）、対象語の識別番号ｎに対し、ｎ＝ｎ＋１とする（ステップＳ１７）。具体的には、関連語候補出現数カウント部１４２は、対象語１の関連語候補群Ｇ_１の関連語候補の全てについてカウントを終了した場合には、対象語リストＬ_１の２番目の対象語（以降、対象語２とする。）に進む。そして、関連語候補出現数カウント部１４２は、この対象語２の関連語候補群Ｇ_２の関連語候補について、順次、各関連語候補群Ｇ_１〜Ｇ_Ｎ中の出現数をカウントする。 On the other hand, related word candidate appearing counting section 142, for all the relevant word candidates associated word candidates Gn, if it is determined that the counted number of occurrences in each related term candidates G ₁ ~G _N (step S14: Yes ), Determine whether or not the target words 1 to N have been counted (step S16). When the related word candidate appearance number counting unit 142 determines that the target words 1 to N have not been counted (step S16: No), n = n + 1 is set for the identification number n of the target word (step S17). Specifically, the related word candidate appearing counting section 142, when it is completed to count for all relevant word candidates associated word candidate group G ₁ of the target word 1, the second object of the target word list L ₁ Proceed to the word (hereinafter referred to as the target word 2). The related term candidate appearance counting section 142, the related word candidates associated word candidate group G ₂ of the target word 2, successively, counts the number of occurrences in each related term candidates G ₁ ~G _N.

関連語候補出現数カウント部１４２は、対象語１〜Ｎまでカウントしたと判定した場合には（ステップＳ１６：Ｙｅｓ）、全対象語１〜Ｎの全関連語候補について、各関連語候補群Ｇ_１〜Ｇ_Ｎ中の出現数をカウントしたため、関連語候補出現数カウント処理を終了する。 When the related word candidate appearance count unit 142 determines that the target words 1 to N have been counted (step S16: Yes), each related word candidate group G is used for all the related word candidates of all the target words 1 to N. since counting the number of occurrences in ₁ ~G _N, and terminates the related term candidate appearance number counting process.

［実施の形態の効果］
このように、本実施の形態に係るデータ処理装置１は、文書データから、言葉の共起によって対象語の関連語候補を抽出し、対象語それぞれの関連語候補群を取得する。そして、データ処理装置１は、この複数の関連語候補群に含まれる関連語候補ごとに、複数の関連語候補群の中での出現数をカウントし、カウントした出現数が所定の閾値以上である関連語候補を複数の関連語候補群から除外して、残った関連語候補を対象語の関連語としている。 [Effect of Embodiment]
As described above, the data processing device 1 according to the present embodiment extracts the related word candidates of the target word from the document data by co-occurrence of words, and acquires the related word candidate group of each target word. Then, the data processing device 1 counts the number of occurrences in the plurality of related word candidate groups for each related word candidate included in the plurality of related word candidate groups, and the counted number of occurrences is equal to or higher than a predetermined threshold value. A certain related word candidate is excluded from a plurality of related word candidate groups, and the remaining related word candidate is used as a related word of the target word.

ここで、従来の技術では、対象語に関連する関連語を、文書データから、言葉の共起によって自動抽出しているものの、共起した全ての言葉を関連語候補としているため、関連性が低い語、或いは、関連性がない語が関連語に含まれ、関連語群の抽出の精度は低かった。 Here, in the conventional technique, related words related to the target word are automatically extracted from the document data by co-occurrence of words, but all the co-occurrence words are used as related word candidates, so that the relevance is high. Low words or unrelated words were included in the related words, and the accuracy of extracting the related word group was low.

これらの関連性が低い語、或いは、関連性がない語は、数多く出現し抽象度の比較的高い語と考えられるため、別々の対象語に共通して共起することが多い。そこで、本実施の形態では、対象語それぞれについて取得した関連語候補群に共通して出現する語は、どの対象語にも関連性のある語であるものの、対象語との結びつきが弱い語である場合が多いこと、すなわち、対象語との関連性が低い語である場合が多いことに着目した。 Since these words with low relevance or words with no relevance appear in large numbers and are considered to be words with a relatively high degree of abstraction, they often co-occur in common to different target words. Therefore, in the present embodiment, the words that appear in common in the related word candidate group acquired for each target word are words that are related to any target word but have a weak connection with the target word. We focused on the fact that there are many cases, that is, the words are often less relevant to the target word.

そして、本実施の形態では、これらの関連語候補群に共通して出現する語を取り除くことにより、対象語と関連性の強い語のみを絞り込んでいる。言い換えると、本実施の形態では、対象語それぞれの関連語候補群のうち一定数以上に共通して出現する語を抽出し、これらの語を除外し、対象語と関連性の高い語のみを絞りこむことによって、関連語を抽出している。このため、本実施の形態によれば、従来のデータ処理方法と比較して、関連性が低い語、或いは、関連性がない語を適切に除外することができるため、関連語を高精度に抽出できる。 Then, in the present embodiment, only words that are strongly related to the target word are narrowed down by removing words that appear in common in these related word candidate groups. In other words, in the present embodiment, words that appear in common in a certain number or more of the related word candidate groups of each target word are extracted, these words are excluded, and only words that are highly related to the target word are selected. By narrowing down, related words are extracted. Therefore, according to the present embodiment, words having low relevance or words having no relevance can be appropriately excluded as compared with the conventional data processing method, so that the related words can be extracted with high accuracy. Can be extracted.

［システム構成等］
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each of the illustrated devices is a functional concept and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed in arbitrary units according to various loads and usage conditions. It can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的におこなうこともでき、あるいは、手動的に行なわれるものとして説明した処理の全部又は一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or a part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
図６は、プログラムが実行されることにより、データ処理装置１が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 6 is a diagram showing an example of a computer in which the data processing device 1 is realized by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、データ処理装置１の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、データ処理装置１における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤにより代替されてもよい。 The hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, the program that defines each process of the data processing device 1 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing processing similar to the functional configuration in the data processing device 1 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN, WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, all other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are included in the scope of the present invention.

１データ処理装置
１１入力部
１２出力部
１３通信部
１４制御部
１５記憶部
１４１関連語候補群取得部
１４２関連語候補出現数カウント部
１４３関連語判定部
１４４関連語候補抽出部
１４５関連語候補除外部
１４６関連語データ格納部
１５１文書データ
１５２カウントデータ
１５３閾値データ
１５４関連語データ 1 Data processing device 11 Input unit 12 Output unit 13 Communication unit 14 Control unit 15 Storage unit 141 Related word candidate group acquisition unit 142 Related word candidate appearance count unit 143 Related word judgment unit 144 Related word candidate extraction unit 145 Related word candidate exclusion Part 146 Related word data storage part 151 Document data 152 Count data 153 Threshold data 154 Related word data

Claims

An acquisition unit that extracts related word candidates related to the target word by co-occurrence of words from the document data and acquires a related word candidate group for each of the target words.
For each related word candidate included in the plurality of related word candidate groups, a counting unit that counts the number of occurrences in the plurality of related word candidate groups, and a counting unit.
Related word candidates whose number of occurrences counted by the counting unit is equal to or greater than a predetermined threshold value are excluded from the plurality of related word candidate groups, and the remaining related word candidates are determined to be related words of the target word. Word judgment part and
A data processing device characterized by having.

The acquisition unit uses morphological analysis to extract nomenclature and compound nomenclature extracted from the document data using morphological analysis, and extracts each target word from the portion of the document data including the target word. The data processing apparatus according to claim 1, wherein a related word candidate group in which related word candidates are grouped for each target word is acquired by using a noun and a compound nomenclature as related word candidates.

The data processing apparatus according to claim 1 or 2, wherein the predetermined threshold value is changed according to document data.

A data processing method performed by a data processing device.
The process of extracting related word candidates of the target word from the document data by co-occurrence of words and acquiring the related word candidate group of each of the target words, and
A step of counting the number of occurrences in the plurality of related word candidate groups for each related word candidate included in the plurality of related word candidate groups, and
The related word candidates whose counted number of occurrences is equal to or greater than a predetermined threshold are excluded from the plurality of related word candidate groups, and the related word candidates remaining in the related word candidate group are the targets corresponding to the related word candidate group. The process of determining that the word is related to the word,
A data processing method characterized by including.

A step of extracting related word candidates related to the target word by co-occurrence of words from the document data and acquiring a related word candidate group for each of the target words.
For each related word candidate included in the plurality of related word candidate groups, a step of counting the number of occurrences in the plurality of related word candidate groups, and
The related word candidates whose counted number of occurrences is equal to or greater than a predetermined threshold are excluded from the plurality of related word candidate groups, and the related word candidates remaining in the related word candidate group are the targets corresponding to the related word candidate group. Steps to determine that the word is related to the word,
A data processing program that allows a computer to run.