JP2009169689A

JP2009169689A - Data classification method and data processing apparatus

Info

Publication number: JP2009169689A
Application number: JP2008007223A
Authority: JP
Inventors: 哲朗 ▲高▼橋; Tetsuro Takahashi; Aoshi Okamoto; 青史岡本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2008-01-16
Filing date: 2008-01-16
Publication date: 2009-07-30
Anticipated expiration: 2028-01-16
Also published as: JP5194818B2

Abstract

<P>PROBLEM TO BE SOLVED: To shorten the processing time of similarity calculation by efficiently calculating similarity between document data. <P>SOLUTION: A data processing apparatus 100 creates an inverted index 150d associating word IDs in document data with document data containing words corresponding to the word IDs, and executes sequential pattern extraction on the inverted index 150d. The occurrence frequency of combinations of document data occurring after the execution of sequential pattern extraction is determined (especially, only the occurrence frequency of patterns of length 2 in the sequential pattern extraction is determined), and similarity between the document data is calculated according to the determination. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、記憶装置に記憶された各文書データの類似度によって、文書データを分類するデータ処理装置およびそのデータ分類方法に関するものである。 The present invention relates to a data processing apparatus that classifies document data according to the similarity of each document data stored in a storage device, and a data classification method thereof.

従来、サーバ装置およびクライアント装置から構成される文書検索システムでは、例えば、サーバ装置がクライアント装置から検索キーワードを取得した場合に、データベースに記憶された複数の文書データの中から検索キーワードに対応する文書データを検索し、検索結果をクライアント装置に提供している。 Conventionally, in a document search system composed of a server device and a client device, for example, when the server device acquires a search keyword from the client device, a document corresponding to the search keyword from a plurality of document data stored in the database. Data is searched and the search result is provided to the client device.

また、近年では、データベースに記憶される文書データの数が飛躍的に増加しており、検索キーワードに基づいて検索される文書データの数が膨大な数となってしまうため、利用者が検索結果となる文書データを参照しやすいように、ベクタースペースモデル（例えば、非特許文献１参照）等の技術を利用して、各文書データの類似度を計算し、類似する文書データ毎に文書データを分類している。 In recent years, the number of document data stored in the database has increased dramatically, and the number of document data searched based on search keywords has become enormous. In order to make it easy to refer to the document data, the degree of similarity of each document data is calculated using a technique such as a vector space model (for example, see Non-Patent Document 1), and the document data is calculated for each similar document data. Classification.

例えば、非特許文献１に記載されているベクタースペースモデルでは、文書データ毎に複数のキーワードの存在有無をベクトルとして変換し、変換した全てのベクトルの組合せにおいて類似度を計算している。 For example, in the vector space model described in Non-Patent Document 1, the presence / absence of a plurality of keywords is converted as a vector for each document data, and the similarity is calculated for all combinations of converted vectors.

G.Sallon,A,Wong and C.S.Yang Cornell University"A Vector Space Model for Automatic Indexing",Communications of the ACM November1975 Volume18 Number11G. Sallon, A, Wong and C. S. Yang Cornell University "A Vector Space Model for Automatic Indexing", Communications of the ACM November 1975 Volume 18 Number 11

しかしながら、上述した従来の技術のように、ベクタースペースモデルを利用して、文書データの類似度を算出すると、全ての文書データの組み合わせの間で計算が行われるため、文書データ数の二乗の計算量が発生し、類似度計算にかかる処理時間が長くなってしまうという問題があった。 However, if the similarity of document data is calculated using a vector space model as in the conventional technique described above, the calculation is performed between all combinations of document data, and thus the square of the number of document data is calculated. There is a problem in that the amount of processing occurs and the processing time for calculating the similarity becomes long.

すなわち、文書データの類似度を効率よく算出することで、類似度計算にかかる処理時間を短縮することが極めて重要な課題となっている。 That is, it is an extremely important issue to reduce the processing time required for similarity calculation by efficiently calculating the similarity of document data.

この発明は、上述した従来技術による問題点を解消するためになされたものであり、文書データの類似度を効率よく算出して、類似度計算にかかる処理時間を短縮することが出来るデータ分類方法およびデータ処理装置を提供することを目的とする。 The present invention has been made to solve the above-described problems caused by the prior art, and is a data classification method capable of efficiently calculating the similarity of document data and reducing the processing time required for the similarity calculation. And it aims at providing a data processor.

上述した課題を解決し、目的を達成するため、このデータ分類方法は、記憶装置に記憶された各文書データの類似度によって、文書データを分類するデータ処理装置のデータ分類方法であって、前記データ処理装置は、前記記憶装置に記憶された文書データを読み出し、各文書データ中のキーワードと当該キーワードを含む文書データとを対応付けたリストを作成するリスト作成ステップと、前記リストに対してシーケンシャルパターン抽出を実行し、出現した文書データの組み合わせの発生回数を判定する判定ステップと、前記判定ステップの判定結果に基づいて、各文書データ間の類似度を算出する算出ステップと、を含んだことを要件とする。 In order to solve the above-described problems and achieve the object, this data classification method is a data classification method for a data processing device that classifies document data according to the similarity of each document data stored in a storage device, A data processing device reads out document data stored in the storage device, creates a list in which a keyword in each document data and a document data including the keyword are associated with each other, and a sequential with respect to the list A determination step of performing pattern extraction and determining the number of occurrences of the combination of the document data that has appeared, and a calculation step of calculating the similarity between the document data based on the determination result of the determination step Is a requirement.

また、このデータ分類方法は、上記のデータ分類方法において、前記判定ステップは、シーケンシャルパターン抽出で、長さ２のパターンの発生回数のみを判定することを要件とする。 Further, this data classification method is that, in the above-described data classification method, the determination step is required to determine only the number of occurrences of the pattern of length 2 by sequential pattern extraction.

また、このデータ分類方法は、上記のデータ分類方法において、前記類似度の値が最大となる各文書データの組合せを求めることにより、文書データを分類する分類ステップを更に含み、当該分類ステップは、利用者に指定される分類数と文書データの分類数とが等しくなるように、前記文書データを分類すること要件とする。 Further, the data classification method further includes a classification step of classifying the document data by obtaining a combination of each document data that maximizes the similarity value in the data classification method, and the classification step includes: It is a requirement to classify the document data so that the number of classifications specified by the user is equal to the number of classifications of document data.

また、このデータ処理装置は、記憶装置に記憶された各文書データの類似度によって、文書データを分類するデータ処理装置であって、前記記憶装置に記憶された文書データを読み出し、各文書データ中のキーワードと当該キーワードを含む文書データとを対応付けたリストを作成するリスト作成手段と、前記リストに対してシーケンシャルパターン抽出を実行し、出現した文書データの組み合わせの発生回数を判定する判定手段と、前記判定手段の判定結果に基づいて、各文書データ間の類似度を算出する算出手段と、を備えたことを要件とする。 Further, the data processing device is a data processing device that classifies document data according to the similarity of each document data stored in the storage device, and reads out the document data stored in the storage device. A list creating means for creating a list in which a keyword is associated with document data including the keyword, and a determination means for executing sequential pattern extraction on the list and determining the number of occurrences of a combination of appearing document data And a calculation means for calculating the similarity between the document data based on the determination result of the determination means.

また、このデータ処理装置は、上記データ処理装置において、前記判定手段は、シーケンシャルパターン抽出で、長さ２のパターンの発生回数のみを判定することを要件とする。 Further, this data processing apparatus is characterized in that, in the above data processing apparatus, the determination means determines only the number of occurrences of a length 2 pattern by sequential pattern extraction.

このデータ分類方法によれば、各文書データ中のキーワードと当該キーワードに対応する文書データとを対応付けたリストを作成し、リストに対してシーケンシャルパターン抽出を実行する。そして、シーケンシャルパターン抽出を実行した結果出現する文書データの組み合わせの発生回数を判定し、判定結果に基づいて、各文書データ間の類似度を計算するので、文書データの類似度を効率よく算出でき、類似度計算にかかる処理時間を短縮することが出来る。 According to this data classification method, a list in which keywords in each document data are associated with document data corresponding to the keywords is created, and sequential pattern extraction is performed on the list. Then, the number of occurrences of the combination of document data that appears as a result of sequential pattern extraction is determined, and the similarity between each document data is calculated based on the determination result, so that the similarity of document data can be calculated efficiently. The processing time for calculating the similarity can be shortened.

また、このデータ分類方法によれば、シーケンシャルパターン抽出において、長さ２のパターンの発生回数のみを判定することにより、各文書データ間の類似度を計算するので、装置にかかる処理負荷を大幅に軽減させることが出来る。 Further, according to this data classification method, the similarity between each document data is calculated by determining only the number of occurrences of a length 2 pattern in sequential pattern extraction, so that the processing load on the apparatus is greatly increased. It can be reduced.

また、このデータ分類方法によれば、利用者に指定される分類数と文書データの分類数とが等しくなるように、文書データを分類するので、ユーザの好みに合った情報提供を行うことが可能となる。 Further, according to this data classification method, document data is classified so that the number of classifications designated by the user is equal to the number of classifications of document data, so that it is possible to provide information according to user preferences. It becomes possible.

以下に添付図面を参照して、この発明に係るデータ分類方法およびデータ処理装置の好適な実施の形態を詳細に説明する。 Exemplary embodiments of a data classification method and a data processing device according to the present invention will be explained below in detail with reference to the accompanying drawings.

まず、本実施例にかかるデータ処理装置の概要および特徴について説明する。本実施例にかかるデータ処理装置は、各文書データ中のキーワードと当該キーワードを含む文書データとを対応付けた転置インデックスリストを作成し、転置インデックスリストに対してシーケンシャルパターン抽出を実行する。 First, the outline and features of the data processing apparatus according to this embodiment will be described. The data processing apparatus according to the present embodiment creates an inverted index list in which a keyword in each document data is associated with document data including the keyword, and executes sequential pattern extraction for the inverted index list.

そして、データ処理装置は、シーケンシャルパターン抽出を実行した結果出現する文書データの組み合わせの発生回数を判定（特に、本実施例では、シーケンシャルパターン抽出において、長さ２のパターンの発生回数のみを判定）し、判定結果に基づいて、各文書データ間の類似度を計算する。 Then, the data processing apparatus determines the number of occurrences of the combination of document data appearing as a result of executing the sequential pattern extraction (in particular, in this embodiment, only the number of occurrences of the length 2 pattern is determined in the sequential pattern extraction). Then, the similarity between each document data is calculated based on the determination result.

このように、本実施例にかかるデータ処理装置は、転置インデックスリストに対して、シーケンシャルパターン抽出を実行することで、各文書データ間の類似度を算出するので、文書データの類似度を効率よく算出でき、類似度計算にかかる処理時間を短縮することが出来る。 As described above, the data processing apparatus according to the present embodiment calculates the similarity between the respective document data by performing sequential pattern extraction on the transposed index list. It is possible to calculate, and the processing time for calculating the similarity can be shortened.

次に、本実施例にかかる検索システムの構成について説明する。図１は、検索システムの構成を示す図である。同図に示すように、この検索システムは、端末装置５０およびデータ処理装置１００から構成され、端末装置５０およびデータ処理装置１００は、ネットワーク１０を介して接続されている。 Next, the configuration of the search system according to the present embodiment will be described. FIG. 1 is a diagram illustrating a configuration of a search system. As shown in the figure, this search system includes a terminal device 50 and a data processing device 100, and the terminal device 50 and the data processing device 100 are connected via a network 10.

このうち、端末装置５０は、入力装置等を介して、利用者から検索キーワードを受け付けた場合に、検索キーワードをデータ処理装置１００に送信する装置である。そして、端末装置５０は、検索結果をデータ処理装置１００から取得した場合に、取得した検索結果をディスプレイに表示させる。 Among these, the terminal device 50 is a device that transmits a search keyword to the data processing device 100 when a search keyword is received from a user via an input device or the like. When the terminal device 50 acquires the search result from the data processing device 100, the terminal device 50 displays the acquired search result on the display.

データ処理装置１００は、端末装置５０から検索キーワードを取得した場合に、検索キーワードに対応する文書データを記憶装置から検索し、上記のシーケンシャルパターン抽出を実行することで、文書データを分類し、分類した文書データを検索結果として端末装置５０に出力する装置である。 When the search keyword is acquired from the terminal device 50, the data processing device 100 searches the storage device for document data corresponding to the search keyword, performs the above-described sequential pattern extraction, and classifies the document data. This is a device that outputs the retrieved document data to the terminal device 50 as a search result.

ここで、図１に示したデータ処理装置１００の構成について詳細に説明する。図２は、本実施例にかかるデータ処理装置１００の構成を示す機能ブロック図である。図２に示すように、このデータ処理装置１００は、入力部１１０と、出力部１２０と、通信制御ＩＦ部１３０と、入出力制御ＩＦ部１４０と、記憶部１５０と、制御部１６０とを備えて構成される。 Here, the configuration of the data processing apparatus 100 shown in FIG. 1 will be described in detail. FIG. 2 is a functional block diagram illustrating the configuration of the data processing apparatus 100 according to the present embodiment. As shown in FIG. 2, the data processing apparatus 100 includes an input unit 110, an output unit 120, a communication control IF unit 130, an input / output control IF unit 140, a storage unit 150, and a control unit 160. Configured.

このうち、入力部１１０は、各種の情報を入力する入力手段であり、キーボードやマウス、マイクなどによって構成される。また、出力部１２０は、各種の情報を出力する出力手段であり、モニタ（若しくはディスプレイ、タッチパネル）やスピーカなどによって構成される。 Among these, the input unit 110 is an input unit that inputs various types of information, and includes a keyboard, a mouse, a microphone, and the like. The output unit 120 is an output unit that outputs various types of information, and includes a monitor (or display, touch panel), a speaker, and the like.

通信制御ＩＦ部１３０は、主に端末装置５０（図１参照）との間における通信を制御する手段である。また、入出力制御ＩＦ部１４０は、入力部１１０、出力部１２０、通信制御ＩＦ部１３０、記憶部１５０、制御部１６０によるデータの入出力を制御する手段である。 The communication control IF unit 130 is a unit that mainly controls communication with the terminal device 50 (see FIG. 1). The input / output control IF unit 140 is a unit that controls data input / output by the input unit 110, the output unit 120, the communication control IF unit 130, the storage unit 150, and the control unit 160.

記憶部１５０は、制御部１６０による各種情報処理に必要なデータおよびプログラムを記憶する記憶手段であり、特に本発明に密接に関連するものとしては、図２に示すように、文書管理データ１５０ａと、単語インデックス１５０ｂと、単語ＩＤ管理テーブル１５０ｃと、転置インデックス１５０ｄと、類似度テーブル１５０ｅと、クラスタテーブル１５０ｆとを備える。 The storage unit 150 is a storage unit that stores data and programs necessary for various types of information processing by the control unit 160. In particular, as closely related to the present invention, as shown in FIG. , A word index 150b, a word ID management table 150c, a transposed index 150d, a similarity table 150e, and a cluster table 150f.

文書管理データ１５０ａは、各種の文書データを記憶するデータである。図３は、文書管理データ１５０ａのデータ構造の一例を示す図である。同図に示すように、この文書管理データ１５０ａは、各文書データを識別する文書ＩＤと、文書データとを対応付けて記憶している。 The document management data 150a is data for storing various document data. FIG. 3 is a diagram showing an example of the data structure of the document management data 150a. As shown in the figure, the document management data 150a stores a document ID for identifying each document data and the document data in association with each other.

単語インデックス１５０ｂは、文書ＩＤと、文書ＩＤによって識別される文書データに含まれる各単語ＩＤ（単語ＩＤ列）とを対応付けて記憶するデータである。図４は、単語インデックス１５０ｂのデータ構造の一例を示す図である。 The word index 150b is data that stores the document ID and each word ID (word ID string) included in the document data identified by the document ID in association with each other. FIG. 4 is a diagram illustrating an example of a data structure of the word index 150b.

同図に示すように、この単語インデックス１５０ｂは、文書ＩＤと単語ＩＤ列とを対応付けて記憶している。例えば、図４の１段目には、文書ＩＤ「Ａ」によって識別される文書データに、単語ＩＤ「１，２，３」によって識別される単語が含まれている旨の情報が登録されている。 As shown in the figure, the word index 150b stores a document ID and a word ID string in association with each other. For example, in the first row of FIG. 4, information indicating that the document data identified by the document ID “A” includes the word identified by the word ID “1, 2, 3” is registered. Yes.

単語ＩＤ管理テーブル１５０ｃは、単語ＩＤと、この単語ＩＤに対応する単語とを対応付けて記憶するテーブルである。図５は、単語ＩＤ管理テーブル１５０ｃのデータ構造の一例を示す図である。 The word ID management table 150c is a table that stores a word ID and a word corresponding to the word ID in association with each other. FIG. 5 is a diagram illustrating an example of a data structure of the word ID management table 150c.

転置インデックス１５０ｄは、単語ＩＤと、この単語ＩＤの単語を含む文書データの文書ＩＤとを対応付けて記憶するデータである。図６は、転置インデックス１５０ｄのデータ構造の一例を示す図である。例えば、図６の１段目には、単語ＩＤ「１」によって識別される単語を有している文書データは、文書ＩＤ「Ａ」によって識別される文書データである旨が登録されている。 The transposed index 150d is data that stores the word ID and the document ID of the document data including the word of the word ID in association with each other. FIG. 6 is a diagram illustrating an example of the data structure of the transposed index 150d. For example, in the first row of FIG. 6, it is registered that document data having a word identified by the word ID “1” is document data identified by the document ID “A”.

類似度テーブル１５０ｅは、各文書データ間の類似度を記憶するテーブルである。図７は、類似度テーブル１５０ｅのデータ構造の一例を示す図である。図７に示す「Ａ」〜「Ｅ」は、文書ＩＤであり、各数値は、類似度である。図７を参照すると、例えば、文書ＩＤ「Ａ」の文書データと、文書ＩＤ「Ｂ」の文書データとの類似度が「２」である旨の情報が登録されている。 The similarity table 150e is a table that stores the similarity between each piece of document data. FIG. 7 is a diagram illustrating an example of a data structure of the similarity table 150e. “A” to “E” shown in FIG. 7 are document IDs, and each numerical value is a similarity. Referring to FIG. 7, for example, information indicating that the similarity between the document data with the document ID “A” and the document data with the document ID “B” is “2” is registered.

クラスタテーブル１５０ｆは、各文書データの類似度に基づいて文書データを分類する場合に利用するデータである。図８は、クラスタテーブル１５０ｆのデータ構造の一例を示す図である。図８に示す「Ａ」〜「Ｅ」は、文書ＩＤであり、各数値は、類似度である。なお、クラスタテーブル１５０ｆを基にして、文書データを分類する処理は後述する。 The cluster table 150f is data used when classifying document data based on the similarity of each document data. FIG. 8 is a diagram illustrating an example of the data structure of the cluster table 150f. “A” to “E” shown in FIG. 8 are document IDs, and each numerical value is a similarity. The process of classifying document data based on the cluster table 150f will be described later.

制御部１６０は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、これらによって種々の処理を実行する制御手段であり、特に本発明に密接に関連するものとしては、文書データ検索部１６０ａと、転置インデックス作成部１６０ｂと、類似度テーブル作成部１６０ｃと、クラスタリング処理部１６０ｄとを備える。 The control unit 160 has an internal memory for storing programs and control data that define various processing procedures, and is a control means for executing various processes by these, and is particularly closely related to the present invention. Includes a document data search unit 160a, an inverted index creation unit 160b, a similarity table creation unit 160c, and a clustering processing unit 160d.

このうち、文書データ検索部１６０ａは、端末装置５０から検索キーワードを受け付けた場合に、受け付けた検索キーワードを含む文書データを文書管理データ１５０ａから検索する手段である。文書データ検索部１６０ａは、検索した文書データの文書ＩＤを転置インデックス作成部１６０ｂに出力する。 Among these, the document data search unit 160a is a means for searching the document management data 150a for document data including the received search keyword when the search keyword is received from the terminal device 50. The document data search unit 160a outputs the document ID of the searched document data to the transposed index creation unit 160b.

転置インデックス作成部１６０ｂは、文書データ検索部１６０ａから文書ＩＤを取得し、取得した文書ＩＤに対応する転置インデックス１５０ｄを作成する手段である。具体的に、この転置インデックス作成部１６０ｂは、単語インデックス１５０ｂ、単語ＩＤ管理テーブル１５０ｃを作成する単語インデックス作成処理と、転置インデックス１５０ｄを作成する転置インデックス作成処理を実行する。以下において、転置インデックス作成部１６０ｂが実行する単語インデックス作成処理、転置インデックス作成処理を順に説明する。 The transposed index creation unit 160b is a unit that acquires a document ID from the document data search unit 160a and creates a transposed index 150d corresponding to the acquired document ID. Specifically, the transposed index creating unit 160b executes a word index creating process for creating the word index 150b and the word ID management table 150c, and a transposed index creating process for creating the transposed index 150d. Hereinafter, the word index creation process and the transposition index creation process executed by the transposed index creation unit 160b will be described in order.

まず、転置インデックス作成部１６０ｂが実行する単語インデックス作成処理について説明する。転置インデックス作成部１６０ｂは、文書データ検索部１６０ａから文書ＩＤを取得した場合に、文書ＩＤに対応する文書データを文書管理データ１５０ａから取得し、取得した各文書データ（以下、文書データ群）に対して形態素解析を実行する。 First, word index creation processing executed by the transposed index creation unit 160b will be described. When the transposed index creation unit 160b obtains the document ID from the document data search unit 160a, the transposed index creation unit 160b obtains the document data corresponding to the document ID from the document management data 150a, and stores the obtained document data (hereinafter, document data group). A morphological analysis is performed on the result.

そして、転置インデックス作成部１６０ｂは、文書データ群に対して形態素解析を実行した結果得られる単語に単語ＩＤを割り振り、単語ＩＤとこの単語ＩＤに対応する単語とを対応付けて単語ＩＤ管理テーブル１５０ｃに登録する。 Then, the transposed index creation unit 160b assigns a word ID to the word obtained as a result of executing the morphological analysis on the document data group, associates the word ID with the word corresponding to the word ID, and the word ID management table 150c. Register with.

転置インデックス作成部１６０ｂは、単語ＩＤ管理テーブル１５０ｃと、文書データ群とを比較することにより、単語ＩＤ管理テーブル１５０ｃの単語が含まれる文書データを判定し、単語インデックス１５０ｂ（図４参照）を作成する。 The transposed index creation unit 160b compares the word ID management table 150c with the document data group to determine document data including words in the word ID management table 150c, and creates the word index 150b (see FIG. 4). To do.

続いて、転置インデックス作成部１６０ｂが実行する転置インデックス作成処理について説明する。転置インデックス作成部１６０ｂは、単語インデックス１５０ｂを取得し、単語ＩＤ毎に、単語ＩＤを有する文書データの文書ＩＤを判定することにより、転置インデックス１５０ｄ（図６参照）を作成する。 Next, a transposed index creation process executed by the transposed index creation unit 160b will be described. The transposed index creation unit 160b obtains the word index 150b, and creates the transposed index 150d (see FIG. 6) by determining the document ID of the document data having the word ID for each word ID.

類似度テーブル作成部１６０ｃは、転置インデックス１５０ｄに対してシーケンシャルパターン抽出を実行し、シーケンシャルパターン抽出を実行した結果出現する文書データ（文書ＩＤ）の組合せの発生回数を判定（特に、本実施例では、シーケンシャルパターン抽出において、長さ２のパターンの発生回数のみを判定）し、判定結果に基づいて、類似度テーブル１５０ｅを作成する手段である。 The similarity table creation unit 160c performs sequential pattern extraction on the transposed index 150d, and determines the number of occurrences of the combination of document data (document ID) that appears as a result of the sequential pattern extraction (particularly in this embodiment). In the sequential pattern extraction, only the number of occurrences of the length 2 pattern is determined), and the similarity table 150e is created based on the determination result.

以下において、類似度テーブル作成部１６０ｃの処理を具体的に説明する。まず、類似度テーブル作成部１６０ｃは、転置インデックス１５０ｄからアイテム数１の多頻度系列となる文書ＩＤ（換言すれば、転置インデックス１５０ｄに含まれる各文書ＩＤ）を抽出する。例えば、図６に示した転置インデックス１５０ｄからアイテム数１の多頻度系列となる文書ＩＤ（以下、系列文書ＩＤと表記する）を抽出すると、系列文書ＩＤとして、文書ＩＤ「Ａ」、「Ｂ」、「Ｃ」が抽出される。 In the following, the processing of the similarity table creation unit 160c will be specifically described. First, the similarity table creation unit 160c extracts a document ID (in other words, each document ID included in the transposed index 150d) that is a frequent series of 1 item from the transposed index 150d. For example, when a document ID (hereinafter referred to as a series document ID) that is a frequent series with 1 item is extracted from the transposed index 150d shown in FIG. 6, the document IDs “A” and “B” are used as the series document IDs. , “C” is extracted.

続いて、類似度テーブル作成部１６０ｃは、転置インデックス１５０ｄに対して、系列文書ＩＤによって射影し、射影データを作成する。 Subsequently, the similarity table creation unit 160c creates projection data by projecting the transposed index 150d with the series document ID.

ここで、射影の定義について説明する。ある系列ｓ＝＜ａ_１、ａ_２、・・・、ａ_ｍ＞、アイテムａに対し、ａ_１≠ａ、ａ_２≠ａ、・・・、ａ_ｊ−１≠ａ、ａ_ｊ＝ａとなるような整数ｊ（１≦ｊ≦ｍ）が存在する場合、系列＜ａ_１、ａ_２、・・・、ａ_ｊ＞をｓのａに対するprefix(prefix(s,a))と定義し、系列＜ａ_ｊ＋１、・・・、ａ_ｍ＞をｓのａに対するpostfix(postfix(s,a))と定義する。もし、ｊが存在しない場合には、prefix、postfixは未定義となる。 Here, the definition of projection will be described. There sequence _{_{_{s = <a 1 ,a 2 ,···,a m}}} >, to Item _{_{a, a 1 ≠ a, a}} 2 ≠ a, ···, a j-1 ≠ a, and _a j = a If made such integer j (1 ≦ j ≦ m) is present, is defined as sequence _{_<a} 1 _,a 2 _,···,a _j> a prefix for a of s (prefix (s, a) ), series _{<a j+1 ,···,a} _m> the postfix for a of s (postfix (s, a) ) to define. If j does not exist, prefix and postfix are undefined.

そして、ある系列データベースＳ（上記系列のデータを複数含んだデータベース）に対し、アイテムａによって射影し、射影データＳ｜ａを作成するとは、Ｓ中のそれぞれの系列ｓに対し、postfix(s,a)を作成し、それらを改めて系列データベースとする操作と定義される。 Then, to project a certain series database S (a database including a plurality of data of the above series) with an item a to create projection data S | a, postfix (s, a) is created and defined as a series database.

具体的に、類似度テーブル作成部１６０ｃは、転置インデックス１５０ｄ（系列データベースＳに対応）に含まれる各文書ＩＤ列（系列ｓに対応）に対して、アイテム（系列文書ＩＤ「Ａ」、「Ｂ」、「Ｃ」）によって射影する。そして、例えば、類似度テーブル作成部１６０ｃは、系列文書ＩＤ「Ａ」、「Ｂ」、「Ｃ」の順で射影を実行する。 Specifically, the similarity degree table creation unit 160c applies items (series document IDs “A” and “B” to each document ID column (corresponding to the series s) included in the transposed index 150d (corresponding to the series database S). ”,“ C ”). For example, the similarity table creation unit 160c performs projection in the order of the sequence document IDs “A”, “B”, and “C”.

（系列文書ＩＤ「Ａ」による射影）
類似度テーブル作成部１６０ｃは、転置インデックス１５０ｄ（図６参照）に対して、系列文書ＩＤ「Ａ」による射影を実行すると、系列文書ＩＤ「Ａ」を含む各文書ＩＤ列の内、系列文書ＩＤ「Ａ」を除いた文書ＩＤ列（系列文書ＩＤ「Ａ」のpostfix(s,a)に対応）および当該文書ＩＤ列に対応する単語ＩＤを抽出し、射影データを作成する。 (Projection by series document ID “A”)
When the similarity table creation unit 160c performs projection with the sequence document ID “A” on the transposed index 150d (see FIG. 6), the sequence document ID among the document ID columns including the sequence document ID “A”. A document ID string excluding “A” (corresponding to postfix (s, a) of the sequence document ID “A”) and a word ID corresponding to the document ID string are extracted to create projection data.

図９は、系列文書ＩＤ「Ａ」の射影によって作成される射影データの一例を示す図である。図９の射影データを参照すると、系列文書ＩＤ「Ａ」と文書ＩＤ「Ｂ」との組み合わせの発生回数が「２」であり、系列文書ＩＤ「Ａ」と文書ＩＤ「Ｃ」との組み合わせの発生回数が「３」であるため、類似度テーブル作成部１６０ｃは、文書ＩＤ「Ａ」の文書データと文章ＩＤ「Ｂ」の文書データとの類似度を「２」、文書ＩＤ「Ａ」の文書データと文章ＩＤ「Ｃ」の文書データとの類似度を「３」と判定する。 FIG. 9 is a diagram illustrating an example of projection data created by projection of the sequence document ID “A”. Referring to the projection data in FIG. 9, the number of occurrences of the combination of the sequence document ID “A” and the document ID “B” is “2”, and the combination of the sequence document ID “A” and the document ID “C” Since the occurrence count is “3”, the similarity table creation unit 160c sets the similarity between the document data with the document ID “A” and the document data with the text ID “B” to “2” and the document ID “A”. The similarity between the document data and the document data with the sentence ID “C” is determined as “3”.

その後、類似度テーブル作成部１６０ｃは、射影データ（図９参照）に対応させて、転置インデックス１５０ｄ（図６参照）を更新する。具体的には、転置インデックス１５０ｄの単語ＩＤ「２」の文書ＩＤ列を「Ｂ，Ｃ」、単語ＩＤ「７」の文書ＩＤ列を「Ｂ，Ｃ」、単語ＩＤ「１０」の文書ＩＤ列を「Ｃ」に更新する。図１０は、更新された転置インデックス１５０ｄのデータ構造の一例を示す図（１）である。 Thereafter, the similarity table creation unit 160c updates the transposed index 150d (see FIG. 6) in association with the projection data (see FIG. 9). Specifically, the document ID string with the word ID “2” of the transposed index 150d is “B, C”, the document ID string with the word ID “7” is “B, C”, and the document ID string with the word ID “10”. Is updated to “C”. FIG. 10 is a diagram (1) illustrating an example of the data structure of the updated inverted index 150d.

（系列文書ＩＤ「Ｂ」による射影）
類似度テーブル作成部１６０ｃは、転置インデックス１５０ｄ（図１０参照）に対して、系列文書ＩＤ「Ｂ」による射影を実行すると、系列文書「Ｂ」を含む各文書ＩＤ列の内、系列文書ＩＤ「Ｂ」を除いた文書ＩＤ列（系列文書ＩＤ「Ｂ」のpostfix(s,a)に対応）および当該文書ＩＤ列に対応する単語ＩＤを抽出し、射影データを作成する。 (Projection by series document ID “B”)
When the similarity table creation unit 160c performs projection using the sequence document ID “B” on the transposed index 150d (see FIG. 10), the sequence document ID “B” in each document ID column including the sequence document “B” is displayed. A document ID string excluding “B” (corresponding to postfix (s, a) of the sequence document ID “B”) and a word ID corresponding to the document ID string are extracted to create projection data.

図１１は、系列文書ＩＤ「Ｂ」の射影によって作成される射影データの一例を示す図である。図１１の射影データを参照すると、系列文書ＩＤ「Ｂ」と文書ＩＤ「Ｃ」との組み合わせの発生回数が「４」であるため、類似度テーブル作成部１６０ｃは、文書ＩＤ「Ｂ」の文書データと文章ＩＤ「Ｃ」の文書データとの類似度を「４」と判定する。 FIG. 11 is a diagram illustrating an example of projection data created by projection of the sequence document ID “B”. Referring to the projection data in FIG. 11, since the number of occurrences of the combination of the series document ID “B” and the document ID “C” is “4”, the similarity table creation unit 160c determines the document with the document ID “B”. The degree of similarity between the data and the document data of the sentence ID “C” is determined as “4”.

その後、類似度テーブル作成部１６０ｃは、射影データ（図１１）に対応させて、転置インデックス１５０ｄ（図１０参照）を更新する。具体的には、転置インデックス１５０ｄの単語ＩＤ「２」、「５」、「７」、「９」の文書ＩＤ列を「Ｃ」に更新する。図１２は、更新された転置インデックス１５０ｄのデータ構造の一例を示す図（２）である。 Thereafter, the similarity table creation unit 160c updates the transposed index 150d (see FIG. 10) in association with the projection data (FIG. 11). Specifically, the document ID string of the word IDs “2”, “5”, “7”, “9” of the transposed index 150d is updated to “C”. FIG. 12 is a diagram (2) illustrating an example of the data structure of the updated inverted index 150d.

（系列文書ＩＤ「Ｃ」による射影）
類似度テーブル作成部１６０ｃは、転置インデックス１５０ｄ（図１２参照）に対して、系列文書ＩＤ「Ｃ」による射影を実行すると、系列文書「Ｃ」を含む各文書ＩＤ列の内、系列文書ＩＤ「Ｃ」を除いた文書ＩＤ列（系列文書ＩＤ「Ｃ」のpostfix(s,a)に対応）および当該文書ＩＤ列に対応する単語ＩＤが存在しないため、類似度テーブル作成部１６０ｃは、射影処理を終了する。 (Projection by series document ID “C”)
When the similarity table creation unit 160c performs projection on the transposed index 150d (see FIG. 12) using the sequence document ID “C”, the sequence document ID “C” includes the sequence document ID “C”. Since there is no document ID string excluding “C” (corresponding to postfix (s, a) of the series document ID “C”) and the word ID corresponding to the document ID string, the similarity table creation unit 160c performs the projection process. Exit.

次に、類似度テーブル作成部１６０ｃは、系列文書ＩＤ「Ａ」、「Ｂ」、「Ｃ」による射影を実行した結果得られる各文書データ間の類似度に基づいて、類似度テーブル１５０ｅを作成する。なお、ここでは一例として、系列文書ＩＤ「Ａ」、「Ｂ」、「Ｃ」から類似度を算出する場合について説明したが、例えば、上記の系列文書ＩＤの他に、系列文書ＩＤ「Ｄ」、「Ｅ」等が含まれる場合であっても、上述したシーケンシャルパターン抽出を実行することにより、類似度を算出する。 Next, the similarity table creation unit 160c creates the similarity table 150e based on the similarity between the document data obtained as a result of executing the projection using the sequence document IDs “A”, “B”, and “C”. To do. Here, as an example, the case where the similarity is calculated from the sequence document IDs “A”, “B”, and “C” has been described. For example, in addition to the sequence document ID, the sequence document ID “D” is used. , “E” and the like are included, the similarity is calculated by executing the above-described sequential pattern extraction.

クラスタリング処理部１６０ｄは、類似度テーブル１５０ｅに基づいて、文書データ群を分類する手段である。なお、クラスタリング処理部１６０ｄは、分類するグループの数（以下、クラスタ数）が、利用者によって指定されたクラスタ数（以下、指定クラスタ数）と等しくなるように、文書データ群を分類する。利用者は、入力部１１０を利用して指定クラスタ数を指定しても良いし、端末装置５０の利用者が、指定クラスタ数を入力し、端末装置５０がデータ処理装置１００に、指定クラスタ数の情報を送信しても良い。 The clustering processing unit 160d is means for classifying the document data group based on the similarity table 150e. Note that the clustering processing unit 160d classifies the document data group so that the number of groups to be classified (hereinafter, the number of clusters) is equal to the number of clusters specified by the user (hereinafter, the number of designated clusters). The user may specify the specified number of clusters using the input unit 110, or the user of the terminal device 50 inputs the specified number of clusters, and the terminal device 50 inputs the specified number of clusters to the data processing device 100. May be sent.

以下において、クラスタリング処理部１６０ｄの処理を具体的に説明する。図１３は、クラスタリング処理部１６０ｄの処理を説明するための図である。なお、ここでは一例として、指定クラスタ数が「３」である場合について説明する。 Hereinafter, the processing of the clustering processing unit 160d will be specifically described. FIG. 13 is a diagram for explaining the processing of the clustering processing unit 160d. Here, as an example, a case where the number of designated clusters is “3” will be described.

まず、クラスタリング処理部１６０ｄは、クラスタテーブル１５０ｆを初期化した後、類似度テーブル１５０ｅ（図７参照）のデータをクラスタテーブル１５０ｆにコピーすることにより、図８（あるいは、図１３の上段左端）に示すクラスタテーブル１５０ｆを生成する。 First, the clustering processing unit 160d initializes the cluster table 150f, and then copies the data of the similarity table 150e (see FIG. 7) to the cluster table 150f, so that FIG. 8 (or the upper left corner of FIG. 13) is obtained. The cluster table 150f shown is generated.

そして、クラスタリング処理部１６０ｄは、クラスタテーブル１５０ｆから最大の類似度を持つ文書ＩＤのペアを検出し、検出したペアの文書ＩＤに対応する行と列のデータをクラスタテーブル１５０ｆから削除する。 Then, the clustering processing unit 160d detects the document ID pair having the maximum similarity from the cluster table 150f, and deletes the row and column data corresponding to the detected document ID of the pair from the cluster table 150f.

図１３に示す例では、最大の類似度を持つ文書ＩＤのペアは、類似度「８」を持つ文書ＩＤ「Ａ」、「Ｄ」のペアとなるので、クラスタリング処理部１６０ｄは、文書ＩＤ「Ａ」、「Ｄ」に対応する行と列のデータをクラスタテーブル１５０ｆから削除する（図１３のステップＳ１０参照）。 In the example shown in FIG. 13, the document ID pair having the maximum similarity is a pair of document IDs “A” and “D” having the similarity “8”. The row and column data corresponding to “A” and “D” are deleted from the cluster table 150f (see step S10 in FIG. 13).

クラスタリング処理部１６０ｄは、最大の類似度を持つペア（削除したペア）からなるクラスタを作成し、クラスタテーブル１５０ｆに追加する。図１３に示す例では、文書ＩＤ「Ａ」、「Ｄ」からなるクラスタ「Ａ，Ｄ」を作成し、作成したクラスタ「Ａ，Ｄ」をクラスタテーブル１５０ｆに追加する（図１３のステップＳ２０参照）。 The clustering processing unit 160d creates a cluster composed of a pair having the maximum similarity (deleted pair) and adds it to the cluster table 150f. In the example illustrated in FIG. 13, a cluster “A, D” including document IDs “A” and “D” is created, and the created cluster “A, D” is added to the cluster table 150f (see step S20 in FIG. 13). ).

クラスタリング処理部１６０ｄは、クラスタテーブル１５０ｆに追加したクラスタに含まれる文書ＩＤと、他の文書ＩＤとの間の類似度のうち、最大となる類似度を、クラスタに対応する行に登録する。そして、クラスタリング処理部１６０ｄは、クラスタの行に登録した類似度に対応させて、クラスタの列に類似度を登録する。 The clustering processing unit 160d registers the maximum similarity among the similarities between the document IDs included in the cluster added to the cluster table 150f and other document IDs in the row corresponding to the cluster. Then, the clustering processing unit 160d registers the similarity in the cluster column in association with the similarity registered in the cluster row.

図１３を例に説明する。クラスタ「Ａ，Ｄ」と文書ＩＤ「Ｂ」との間における類似度は、文書ＩＤ「Ａ」と文書ＩＤ「Ｂ」との間における類似度「２」および文書ＩＤ「Ｄ」と文書ＩＤ「Ｂ」との間における類似度「０」のうち、最大となる類似度が登録されるため、類似度「２」が該箇所に登録される。 An example will be described with reference to FIG. The similarity between the cluster “A, D” and the document ID “B” is the similarity “2” between the document ID “A” and the document ID “B”, the document ID “D”, and the document ID “B”. Among the similarities “0” with “B”, the maximum similarity is registered, so the similarity “2” is registered in the location.

クラスタ「Ａ，Ｄ」と文書ＩＤ「Ｃ」との間における類似度は、文書ＩＤ「Ａ」と文書ＩＤ「Ｃ」との間における類似度「５」および文書ＩＤ「Ｄ」と文書ＩＤ「Ｃ」との間における類似度「３」のうち、最大となる類似度が登録されるため、類似度「５」が該箇所に登録される。 The similarity between the cluster “A, D” and the document ID “C” is the similarity “5” between the document ID “A” and the document ID “C”, and the document ID “D” and the document ID “C”. Among the similarities “3” with “C”, since the maximum similarity is registered, the similarity “5” is registered in the location.

クラスタ「Ａ，Ｄ」と文書ＩＤ「Ｅ」との間における類似度は、文書ＩＤ「Ａ」と文書ＩＤ「Ｅ」との間における類似度「１」および文書ＩＤ「Ｄ」と文書ＩＤ「Ｅ」との間における類似度「２」のうち、最大となる類似度が登録されるため、類似度「２」が該箇所に登録される。そして、クラスタリング処理部１６０ｄは、クラスタの行に登録した類似度に対応させて、クラスタの列に類似度を登録する（図１３のステップＳ３０参照）。 The similarity between the cluster “A, D” and the document ID “E” is the similarity “1” between the document ID “A” and the document ID “E”, and the document ID “D” and the document ID “E”. Among the similarities “2” with “E”, the maximum similarity is registered, and therefore the similarity “2” is registered in the location. Then, the clustering processing unit 160d registers the similarity in the cluster column in association with the similarity registered in the cluster row (see step S30 in FIG. 13).

クラスタリング処理部１６０ｄは、クラスタテーブル１５０ｆのクラスタ数と、指定クラスタ数とが等しくなるまで、上記処理を繰り返す。図１３のステップＳ３０の終了時点において、クラスタ数は「４」であり、指定クラスタ数は、「３」であるため、クラスタリング処理部１６０ｄは、もう一度、上記処理を繰り返す。 The clustering processing unit 160d repeats the above processing until the number of clusters in the cluster table 150f is equal to the designated number of clusters. Since the number of clusters is “4” and the designated number of clusters is “3” at the end of step S30 in FIG. 13, the clustering processing unit 160d repeats the above process once again.

すなわち、図１３の上段右端に示すクラスタテーブル１５０ｆにおいて、最大の類似度を持つ文書ＩＤのペアは、類似度「６」を持つ文書ＩＤ「Ｃ」、「Ｅ」のペアとなるので、クラスタリング処理部１６０ｄは、文書ＩＤ「Ｃ」、「Ｅ」に対応する行と列のデータをクラスタテーブル１５０ｆから削除する（図１３のステップＳ４０参照）。 That is, in the cluster table 150f shown in the upper right corner of FIG. 13, the document ID pair having the maximum similarity is the document ID “C” and “E” pair having the similarity “6”. The unit 160d deletes the row and column data corresponding to the document IDs “C” and “E” from the cluster table 150f (see step S40 in FIG. 13).

そして、クラスタリング処理部１６０ｄは、文書ＩＤ「Ｃ」、「Ｅ」からなるクラスタ「Ｃ，Ｅ」を作成し、作成したクラスタ「Ｃ，Ｅ」をクラスタテーブル１５０ｆに追加する（図１３のステップＳ５０参照）。 Then, the clustering processing unit 160d creates a cluster “C, E” including the document IDs “C” and “E”, and adds the created cluster “C, E” to the cluster table 150f (step S50 in FIG. 13). reference).

クラスタリング処理部１６０ｄは、文書ＩＤ「Ｃ」と文書ＩＤ「Ｂ」との間における類似度が「３」、文書ＩＤ「Ｅ」と文書ＩＤ「Ｂ」との間における類似度が「４」となるので、最大となる類似度「４」をクラスタ「Ｃ，Ｅ」と文書ＩＤ「Ｂ」との間における類似度として登録する。 The clustering processing unit 160d determines that the similarity between the document ID “C” and the document ID “B” is “3”, and the similarity between the document ID “E” and the document ID “B” is “4”. Therefore, the maximum similarity “4” is registered as the similarity between the cluster “C, E” and the document ID “B”.

また、クラスタリング処理部１６０ｄは、文書ＩＤ「Ｃ」と文書ＩＤ「Ａ，Ｄ」との間における類似度が「５」、文書ＩＤ「Ｅ」と文書ＩＤ「Ａ，Ｄ」との間における類似度が「２」となるので、最大となる類似度「５」をクラスタ「Ｃ，Ｅ」と文書ＩＤ「Ａ，Ｄ」との間における類似度として登録する（図１３のステップＳ６０参照）。 Also, the clustering processing unit 160d has a similarity between the document ID “C” and the document ID “A, D” of “5”, and a similarity between the document ID “E” and the document ID “A, D”. Since the degree is “2”, the maximum degree of similarity “5” is registered as the degree of similarity between the cluster “C, E” and the document ID “A, D” (see step S60 in FIG. 13).

図１３のステップＳ６０の処理が終了した時点で、クラスタテーブル１５０ｆのクラスタ数が「３」となり、指定クラスタ数「３」と等しくなるので、クラスタリング処理部１６０ｄは、クラスタテーブル１５０ｆに基づいて、文書データ群を分類する。図１３に示す例では、文書データ群は、文書ＩＤ「Ｂ」の文書データと、文書ＩＤ「Ａ」、「Ｄ」の文書データと、文書ＩＤ「Ｃ」、「Ｅ」の文書データに分類されることになる。 When the processing of step S60 in FIG. 13 is completed, the number of clusters in the cluster table 150f is “3”, which is equal to the designated number of clusters “3”. Therefore, the clustering processing unit 160d performs document processing based on the cluster table 150f. Classify data groups. In the example illustrated in FIG. 13, the document data group is classified into document data with a document ID “B”, document data with document IDs “A” and “D”, and document data with document IDs “C” and “E”. Will be.

クラスタリング処理部１６０ｄは、端末装置５０から送信された検索キーワードの回答として、分類した文書データを端末装置５０に出力する。 The clustering processing unit 160 d outputs the classified document data to the terminal device 50 as an answer to the search keyword transmitted from the terminal device 50.

次に、本実施例にかかるデータ処理装置１００の処理手順について説明する。図１４は、本実施例にかかるデータ処理装置１００の処理手順を示すフローチャートである。同図に示すように、データ処理装置１００は、端末装置５０から検索キーワードを取得し（ステップＳ１０１）、検索キーワードに対応する文書データを検索する（ステップＳ１０２）。 Next, a processing procedure of the data processing apparatus 100 according to the present embodiment will be described. FIG. 14 is a flowchart illustrating the processing procedure of the data processing apparatus 100 according to the present embodiment. As shown in the figure, the data processing device 100 acquires a search keyword from the terminal device 50 (step S101), and searches for document data corresponding to the search keyword (step S102).

そして、データ処理装置１００は、転置インデックス作成処理を実行し（ステップＳ１０３）、類似度テーブル作成処理を実行し（ステップＳ１０４）、クラスタリング処理を実行し（ステップＳ１０５）、クラスタリング結果（分類した文書データ）を端末装置５０に出力する（ステップＳ１０６）。 The data processing apparatus 100 executes an inverted index creation process (step S103), executes a similarity table creation process (step S104), executes a clustering process (step S105), and performs a clustering result (classified document data). ) Is output to the terminal device 50 (step S106).

次に、図１４のステップＳ１０３に示した転置インデックス作成処理について説明する。図１５は、転置インデックス作成処理を示すフローチャートである。同図に示すように、データ処理装置１００は、転置インデックス作成部１６０ｂが各文書データに対して形態素解析を実行する（ステップＳ２０１）。 Next, the transposed index creation process shown in step S103 of FIG. 14 will be described. FIG. 15 is a flowchart showing the inverted index creation processing. As shown in the figure, in the data processing apparatus 100, the transposed index creation unit 160b performs morphological analysis on each document data (step S201).

そして、転置インデックス作成部１６０ｂは、単語インデックス１５０ｂを作成し（ステップＳ２０２）、単語ＩＤをキーにして、転置インデックス１５０ｄを作成する（ステップＳ２０３）。 The transposed index creation unit 160b creates the word index 150b (step S202), and creates the transposed index 150d using the word ID as a key (step S203).

次に、図１４のステップＳ１０４に示した類似度テーブル作成処理について説明する。図１６は、類似度テーブル作成処理を示すフローチャートである。同図に示すように、データ処理装置１００は、類似度テーブル作成部１６０ｃが類似度テーブル１５０ｅを初期化し（ステップＳ３０１）、転置インデックス１５０ｄからアイテム数１の多頻度系列となる系列文書ＩＤを抽出する（ステップＳ３０２）。 Next, the similarity table creation process shown in step S104 of FIG. 14 will be described. FIG. 16 is a flowchart showing the similarity table creation processing. As shown in the figure, in the data processing apparatus 100, the similarity table creation unit 160c initializes the similarity table 150e (step S301), and extracts a series document ID that becomes a frequent series of item number 1 from the transposed index 150d. (Step S302).

そして、類似度テーブル作成部１６０ｃが、系列文書ＩＤと転置インデックス１５０ｄとを基にして、射影データを作成し（ステップＳ３０３）、ペアの出現頻度（組合せの発生回数）を計算し（ステップＳ３０４）、転置インデックス１５０ｄを更新する（ステップＳ３０５）。 Then, the similarity table creation unit 160c creates projection data based on the sequence document ID and the transposed index 150d (step S303), and calculates the appearance frequency of the pair (number of occurrences of the combination) (step S304). The transposed index 150d is updated (step S305).

類似度テーブル作成部１６０ｃは、全ての系列文書ＩＤを選択したか否かを判定し（ステップＳ３０６）、全てを選択していない場合には（ステップＳ３０７，Ｎｏ）、未選択の系列文書ＩＤを選択し（ステップＳ３０８）、ステップＳ３０３に移行する。 The similarity table creation unit 160c determines whether or not all series document IDs have been selected (step S306). If all series document IDs have not been selected (step S307, No), unselected series document IDs are selected. Select (step S308), the process proceeds to step S303.

一方、全ての系列文書ＩＤを選択した場合には（ステップＳ３０７，Ｙｅｓ）、類似度テーブル１５０ｅに各文書ペアの類似度を登録する（ステップＳ３０９）。 On the other hand, when all the series document IDs are selected (step S307, Yes), the similarity of each document pair is registered in the similarity table 150e (step S309).

次に、図１４のステップＳ１０５に示したクラスタリング処理について説明する。図１７は、クラスタリング処理を示すフローチャートである。同図に示すように、クラスタリング処理部１６０ｄは、類似度テーブル１５０ｅをクラスタテーブル１５０ｆにコピーし（ステップＳ４０１）、クラスタテーブル１５０ｆのクラスタ数が指定クラスタ数と等しいか否かを判定する（ステップＳ４０２）。 Next, the clustering process shown in step S105 of FIG. 14 will be described. FIG. 17 is a flowchart showing the clustering process. As shown in the figure, the clustering processing unit 160d copies the similarity table 150e to the cluster table 150f (step S401), and determines whether or not the number of clusters in the cluster table 150f is equal to the designated cluster number (step S402). ).

そして、クラスタ数と指定クラスタ数とが等しい場合には（ステップＳ４０３，Ｙｅｓ）、クラスタリング処理を終了する。一方、クラスタ数と指定クラスタ数とが異なる場合（ステップＳ４０３，Ｎｏ）、クラスタテーブル１５０ｆ中で最大の値を判定する（ステップＳ４０４）。 If the number of clusters is equal to the number of designated clusters (step S403, Yes), the clustering process is terminated. On the other hand, when the number of clusters is different from the number of designated clusters (No in step S403), the maximum value is determined in the cluster table 150f (step S404).

クラスタリング処理部１６０ｄは、最大の値を持つペアの行と列を削除し（ステップＳ４０５）、最大の値を持つペアからなるクラスタを生成し、クラスタテーブル１５０ｆに追加する（ステップＳ４０６）。 The clustering processing unit 160d deletes the row and column of the pair having the maximum value (step S405), generates a cluster including the pair having the maximum value, and adds the cluster to the cluster table 150f (step S406).

続いて、クラスタリング処理部１６０ｄは、追加したクラスタの行の各要素から、未選択の要素を選択し（ステップＳ４０７）、類似度テーブル１５０ｅを参照し、追加したクラスタ間の類似度で最大の値を要素に登録する（ステップＳ４０８）。 Subsequently, the clustering processing unit 160d selects an unselected element from each element of the added cluster row (step S407), refers to the similarity table 150e, and determines the maximum value of the similarity between the added clusters. Is registered as an element (step S408).

そして、クラスタリング処理部１６０ｄは、全ての要素を選択したか否かを判定し（ステップＳ４０９）、全ての要素を選択していない場合には（ステップＳ４１０，Ｎｏ）、ステップＳ４０７に移行し、全ての要素を選択した場合には（ステップＳ４１０，Ｙｅｓ）、クラスタテーブル１５０ｆに追加した行の各要素の値を、対応する列の各要素に登録し（ステップＳ４１１）、ステップＳ４０２に移行する。 Then, the clustering processing unit 160d determines whether or not all the elements have been selected (step S409), and if not all the elements have been selected (step S410, No), the process proceeds to step S407, When the element is selected (step S410, Yes), the value of each element of the row added to the cluster table 150f is registered in each element of the corresponding column (step S411), and the process proceeds to step S402.

上述してきたように、本実施例にかかるデータ処理装置１００は、各文書データ中の単語ＩＤと当該単語ＩＤに対応する単語を含む文書データとを対応付けた転置インデックス１５０ｄを作成し、転置インデックス１５０ｄに対してシーケンシャルパターン抽出を実行する。そして、シーケンシャルパターン抽出を実行した結果出現する文書データの組み合わせの発生回数を判定（特に、本実施例では、シーケンシャルパターン抽出において、長さ２のパターンの発生回数のみを判定）し、判定結果に基づいて、各文書データ間の類似度を計算するので、文書データの類似度を効率よく算出でき、従来の技術と比較して、類似度計算にかかる処理時間を短縮することが出来る。 As described above, the data processing apparatus 100 according to the present embodiment creates the transposed index 150d in which the word ID in each document data is associated with the document data including the word corresponding to the word ID, and the transposed index. Sequential pattern extraction is executed for 150d. Then, the number of occurrences of the combination of document data appearing as a result of executing the sequential pattern extraction is determined (particularly, in the present embodiment, only the number of occurrences of the length 2 pattern is determined in the sequential pattern extraction), and the determination result is obtained. Based on this, since the similarity between the document data is calculated, the similarity of the document data can be calculated efficiently, and the processing time for calculating the similarity can be shortened as compared with the conventional technique.

例えば、従来技術のように、ベクタースペースモデルによる類似度の計算は、文書同士を比較することにより、文書間の類似度を算出しているので、文書数をｎとした場合の比較回数を表す式Ｆ（ｎ）は、
F(n)=(n×(n-1))/2
のような「二次関数の式」で表すことができる。そして、例えば、文書数が４の場合には、類似度の計算を６回行う必要がある（例えば、文書をａ，ｂ，ｃ，ｄとすれば、ａ−ｂ、ａ−ｃ、ａ−ｄ、ｂ−ｃ、ｂ−ｄ、ｃ−ｄの計６回計算を行う必要がある）。 For example, as in the prior art, the similarity calculation based on the vector space model calculates the similarity between documents by comparing the documents, and thus represents the number of comparisons when the number of documents is n. Formula F (n) is
F (n) = (n × (n-1)) / 2
It can be expressed by a “quadratic function expression” such as For example, when the number of documents is 4, the similarity needs to be calculated six times (for example, if the documents are a, b, c, d, ab, ac, a- d, bc, bd, and cd need to be calculated a total of 6 times).

そして、ベクタースペースモデルの計算量を見積もるため、比較回数を表す式をＯ記法によって表すと、
O(n)=n^2
と表すことが出来る。 And in order to estimate the amount of calculation of the vector space model, an expression representing the number of comparisons is expressed in O notation.
O (n) = n ^ 2
Can be expressed as

一方、本実施例にかかるデータ処理装置１００では、各文書と、転置インデックス１５０ｄとを比較することによって、各文書間の類似度を算出しているので、文書数が増えることによって増加する計算量は、「１次関数」に従って増加する（例えば、文書数が１つ増えると、転置インデックス１５０ｄと比較する回数が一回増える）と考えられ、比較回数を表す式をＯ記法によって表すと
O(n)=n
と表すことができる。 On the other hand, in the data processing apparatus 100 according to the present embodiment, the similarity between each document is calculated by comparing each document with the transposed index 150d, so that the amount of calculation that increases as the number of documents increases. Is increased according to the “linear function” (for example, when the number of documents increases by one, the number of comparisons with the transposed index 150d increases by one), and an expression representing the number of comparisons is expressed in O notation.
O (n) = n
It can be expressed as.

従って、文書数ｎが２倍になったときに、従来の技術では、処理時間が約ｎ＾２倍になるのに対し、本実施例にかかるデータ処理装置１００では、処理時間を約ｎ倍に抑えることが出来るので、従来の技術と比較して、類似度計算にかかる処理時間を短縮することが出来る。 Therefore, when the number of documents n is doubled, the processing time is about n ^ 2 in the conventional technique, whereas in the data processing apparatus 100 according to the present embodiment, the processing time is about n times. Therefore, the processing time for calculating the similarity can be shortened as compared with the conventional technique.

ところで、本実施例において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部あるいは一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 By the way, among the processes described in the present embodiment, all or a part of the processes described as being automatically performed can be manually performed, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

また、図２に示したデータ処理装置１００の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部または任意の一部がＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 Also, each component of the data processing apparatus 100 shown in FIG. 2 is functionally conceptual and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. Furthermore, each processing function performed by each device may be realized by a CPU and a program that is analyzed and executed by the CPU, or may be realized as hardware by wired logic.

図１８は、実施例にかかるデータ処理装置１００を構成するコンピュータ４０のハードウェア構成を示す図である。図１８に示すように、このコンピュータ（物品検索装置）４０は、入力装置４１、モニタ４２、ＲＡＭ（Random Access Memory）４３、ＲＯＭ（Read Only Memory）４４、記憶媒体からデータを読み取る媒体読取装置４５、他の装置（例えば、端末装置５０）との間でデータの送受信を行う通信装置４６、ＣＰＵ（Central Processing Unit）４７、ＨＤＤ（Hard Disk Drive）４８をバス４９で接続して構成される。 FIG. 18 is a diagram illustrating a hardware configuration of the computer 40 configuring the data processing apparatus 100 according to the embodiment. As shown in FIG. 18, the computer (article search device) 40 includes an input device 41, a monitor 42, a RAM (Random Access Memory) 43, a ROM (Read Only Memory) 44, and a medium reading device 45 that reads data from a storage medium. A communication device 46 that transmits and receives data to and from other devices (for example, the terminal device 50), a CPU (Central Processing Unit) 47, and an HDD (Hard Disk Drive) 48 are connected by a bus 49.

そして、ＨＤＤ４８には、上記したデータ処理装置１００の機能と同様の機能を発揮するクラスタリング処理プログラム４８ｂが記憶されている。ＣＰＵ４７が、クラスタリング処理プログラム４８ｂを読み出して実行することにより、クラスタリング処理プロセス４７ａが起動される。ここで、クラスタリング処理プロセス４７ａは、図２に示した文書データ検索部１６０ａ、転置インデックス作成部１６０ｂ、類似度テーブル作成部１６０ｃ、クラスタリング処理部１６０ｄに対応する。 The HDD 48 stores a clustering processing program 48b that exhibits the same function as that of the data processing apparatus 100 described above. When the CPU 47 reads and executes the clustering processing program 48b, the clustering processing process 47a is activated. Here, the clustering process 47a corresponds to the document data search unit 160a, the transposed index creation unit 160b, the similarity table creation unit 160c, and the clustering processing unit 160d shown in FIG.

また、ＨＤＤ４８は、文書管理データ１５０ａ、単語インデックス１５０ｂ、単語ＩＤ管理テーブル１５０ｃ、転置インデックス１５０ｄ、類似度テーブル１５０ｅ、クラスタテーブル１５０ｆに対応する各種データ４８ａを記憶する。ＣＰＵ４７は、ＨＤＤ４８に格納された各種データ４８ａを読み出して、ＲＡＭ４３に格納し、ＲＡＭ４３に格納された各種データ４３ａを用いて、各文書データを分類する。 The HDD 48 stores various data 48a corresponding to the document management data 150a, the word index 150b, the word ID management table 150c, the transposition index 150d, the similarity table 150e, and the cluster table 150f. The CPU 47 reads out various data 48 a stored in the HDD 48, stores it in the RAM 43, and classifies each document data using the various data 43 a stored in the RAM 43.

ところで、図１８に示したクラスタリング処理プログラム４８ｂは、必ずしも最初からＨＤＤ４８に記憶させておく必要はない。たとえば、コンピュータに挿入されるフレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」、または、コンピュータの内外に備えられるハードディスクドライブ（ＨＤＤ）などの「固定用の物理媒体」、さらには、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータに接続される「他のコンピュータ（またはサーバ）」などにクラスタリング処理プログラム４８ｂを記憶しておき、コンピュータがこれらからクラスタリング処理プログラム４８ｂを読み出して実行するようにしてもよい。 Incidentally, the clustering processing program 48b shown in FIG. 18 is not necessarily stored in the HDD 48 from the beginning. For example, a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into a computer, or a hard disk drive (HDD) provided inside or outside the computer. The clustering processing program 48b is stored in the “fixed physical medium” of “the other computer (or server)” connected to the computer via a public line, the Internet, a LAN, a WAN, or the like. The computer may read and execute the clustering processing program 48b from these.

上記の実施例を含む実施形態に関し、以下の付記を開示する。 The following additional notes are disclosed with respect to the embodiments including the above-described examples.

（付記１）記憶装置に記憶された各文書データの類似度によって、文書データを分類するデータ処理装置のデータ分類方法であって、
前記データ処理装置は、
前記記憶装置に記憶された文書データを読み出し、各文書データ中のキーワードと当該キーワードを含む文書データとを対応付けたリストを作成するリスト作成ステップと、
前記リストに対してシーケンシャルパターン抽出を実行し、出現した文書データの組み合わせの発生回数を判定する判定ステップと、
前記判定ステップの判定結果に基づいて、各文書データ間の類似度を算出する算出ステップと、
を含んだことを特徴とするデータ分類方法。 (Supplementary note 1) A data classification method for a data processing device for classifying document data according to the similarity of each document data stored in a storage device,
The data processing device includes:
A list creation step of reading out the document data stored in the storage device and creating a list in which keywords in each document data are associated with document data including the keywords;
A determination step of performing sequential pattern extraction on the list and determining the number of occurrences of a combination of document data that has appeared;
A calculation step of calculating a similarity between the document data based on the determination result of the determination step;
A data classification method characterized by including:

（付記２）前記判定ステップは、シーケンシャルパターン抽出で、長さ２のパターンの発生回数のみを判定することを特徴とする付記１に記載のデータ分類方法。 (Supplementary note 2) The data classification method according to supplementary note 1, wherein in the determination step, only the number of occurrences of the pattern of length 2 is determined by sequential pattern extraction.

（付記３）前記類似度の値が最大となる各文書データの組合せを求めることにより、文書データを分類する分類ステップを更に含み、当該分類ステップは、利用者に指定される分類数と文書データの分類数とが等しくなるように、前記文書データを分類することを特徴とする付記１または２に記載のデータ分類方法。 (Supplementary Note 3) The method further includes a classification step of classifying the document data by obtaining a combination of the respective document data that maximizes the similarity value, and the classification step includes the classification number designated by the user and the document data. The data classification method according to appendix 1 or 2, wherein the document data is classified so that the number of classifications is equal.

（付記４）記憶装置に記憶された各文書データの類似度によって、文書データを分類するデータ処理装置であって、
前記記憶装置に記憶された文書データを読み出し、各文書データ中のキーワードと当該キーワードを含む文書データとを対応付けたリストを作成するリスト作成手段と、
前記リストに対してシーケンシャルパターン抽出を実行し、出現した文書データの組み合わせの発生回数を判定する判定手段と、
前記判定手段の判定結果に基づいて、各文書データ間の類似度を算出する算出手段と、
を備えたことを特徴とするデータ処理装置。 (Supplementary note 4) A data processing device for classifying document data according to the similarity of each document data stored in a storage device,
List creation means for reading out the document data stored in the storage device and creating a list in which keywords in each document data are associated with document data including the keywords;
A determination unit that performs sequential pattern extraction on the list and determines the number of occurrences of a combination of document data that has appeared;
Calculation means for calculating the similarity between the document data based on the determination result of the determination means;
A data processing apparatus comprising:

（付記５）前記判定手段は、シーケンシャルパターン抽出で、長さ２のパターンの発生回数のみを判定することを特徴とする付記４に記載のデータ処理装置。 (Supplementary note 5) The data processing apparatus according to supplementary note 4, wherein the determination unit determines only the number of occurrences of a pattern of length 2 by sequential pattern extraction.

（付記６）前記類似度の値が最大となる各文書データの組合せを求めることにより、文書データを分類する分類手段を更に備え、当該分類手段は、利用者に指定される分類数と文書データの分類数とが等しくなるように、前記文書データを分類することを特徴とする付記４または５に記載のデータ処理装置。 (Additional remark 6) It further has a classification means for classifying document data by obtaining a combination of each document data that maximizes the similarity value, and the classification means includes the classification number designated by the user and the document data. 6. The data processing apparatus according to appendix 4 or 5, wherein the document data is classified so that the number of classifications is equal.

以上のように、本発明にかかるデータ分類方法およびデータ処理装置は、文書データを検索する検索システムなどに有用であり、特に、処理時間をかけることなく、各文書データを分類する必要がある場合に適している。 As described above, the data classification method and data processing apparatus according to the present invention are useful for a search system for searching for document data, and in particular, when it is necessary to classify each document data without taking processing time. Suitable for

検索システムの構成を示す図である。It is a figure which shows the structure of a search system. 本実施例にかかるデータ処理装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the data processor concerning a present Example. 文書管理データのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of document management data. 単語インデックスのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of a word index. 単語ＩＤ管理テーブルのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of a word ID management table. 転置インデックスのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of an inverted index. 類似度テーブルのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of a similarity table. クラスタテーブルのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of a cluster table. 系列文書ＩＤ「Ａ」の射影によって作成される射影データの一例を示す図である。It is a figure which shows an example of the projection data produced by projection of series document ID "A". 更新された転置インデックスのデータ構造の一例を示す図（１）である。It is a figure (1) which shows an example of the data structure of the updated inverted index. 系列文書ＩＤ「Ｂ」の射影によって作成される射影データの一例を示す図である。It is a figure which shows an example of the projection data produced by projection of series document ID "B". 更新された転置インデックスのデータ構造の一例を示す図（２）である。It is a figure (2) which shows an example of the data structure of the updated inverted index. クラスタリング処理部の処理を説明するための図である。It is a figure for demonstrating the process of a clustering process part. 本実施例にかかるデータ処理装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the data processor concerning a present Example. 転置インデックス作成処理を示すフローチャートである。It is a flowchart which shows a transposition index creation process. 類似度テーブル作成処理を示すフローチャートである。It is a flowchart which shows a similarity table preparation process. クラスタリング処理を示すフローチャートである。It is a flowchart which shows a clustering process. 実施例にかかるデータ処理装置を構成するコンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the computer which comprises the data processor concerning an Example.

Explanation of symbols

１０ネットワーク
４０コンピュータ
４１入力装置
４２モニタ
４３ＲＡＭ
４３ａ，４８ａ各種データ
４４ＲＯＭ
４５媒体読取装置
４６通信装置
４７ＣＰＵ
４７ａクラスタリング処理プロセス
４８ＨＤＤ
４８ｂクラスタリング処理プログラム
４９バス
５０端末装置
１００データ処理装置
１１０入力部
１２０出力部
１３０通信制御ＩＦ部
１４０入出力制御ＩＦ部
１５０記憶部
１５０ａ文書管理データ
１５０ｂ単語インデックス
１５０ｃ単語ＩＤ管理テーブル
１５０ｄ転置インデックス
１５０ｅ類似度テーブル
１５０ｆクラスタテーブル
１６０制御部
１６０ａ文書データ検索部
１６０ｂ転置インデックス作成部
１６０ｃ類似度テーブル作成部
１６０ｄクラスタリング処理部 10 network 40 computer 41 input device 42 monitor 43 RAM
43a, 48a Various data 44 ROM
45 Media reader 46 Communication device 47 CPU
47a Clustering process 48 HDD
48b Clustering processing program 49 Bus 50 Terminal device 100 Data processing device 110 Input unit 120 Output unit 130 Communication control IF unit 140 Input / output control IF unit 150 Storage unit 150a Document management data 150b Word index 150c Word ID management table 150d Transposed index 150e Similar Degree table 150f cluster table 160 control unit 160a document data search unit 160b transposed index creation unit 160c similarity table creation unit 160d clustering processing unit

Claims

A data classification method for a data processing device for classifying document data according to the similarity of each document data stored in a storage device,
The data processing device includes:
A list creation step of reading out the document data stored in the storage device and creating a list in which keywords in each document data are associated with document data including the keywords;
A determination step of performing sequential pattern extraction on the list and determining the number of occurrences of a combination of document data that has appeared;
A calculation step of calculating a similarity between the document data based on the determination result of the determination step;
A data classification method characterized by including:

2. The data classification method according to claim 1, wherein in the determination step, only the number of occurrences of a length 2 pattern is determined by sequential pattern extraction.

The method further includes a classification step of classifying the document data by obtaining a combination of the respective document data having the maximum similarity value, and the classification step includes a classification number designated by the user and a classification number of the document data. The data classification method according to claim 1, wherein the document data is classified so that the two are equal to each other.

A data processing device for classifying document data according to the similarity of each document data stored in a storage device,
List creation means for reading out the document data stored in the storage device and creating a list in which keywords in each document data are associated with document data including the keywords;
A determination unit that performs sequential pattern extraction on the list and determines the number of occurrences of a combination of document data that has appeared;
Calculation means for calculating the similarity between the document data based on the determination result of the determination means;
A data processing apparatus comprising:

The data processing apparatus according to claim 4, wherein the determination unit determines only the number of occurrences of a pattern of length 2 by sequential pattern extraction.