JP4940251B2

JP4940251B2 - Document processing program and document processing apparatus

Info

Publication number: JP4940251B2
Application number: JP2009001851A
Authority: JP
Inventors: 早織倉田; 佳美齋藤; 敏行加納
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2009-01-07
Filing date: 2009-01-07
Publication date: 2012-05-30
Anticipated expiration: 2029-01-07
Also published as: JP2010160645A

Description

本発明は、大量の文書群を分類するための文書処理プログラム及び文書処理装置に関する。 The present invention relates to a document processing program and a document processing apparatus for classifying a large number of document groups.

近年、大量な文書（文章）群を、幾つかの互いに似た文書集合（クラスタ）に分類する文書処理装置が知られている。 In recent years, document processing apparatuses that classify a large number of documents (sentences) into several similar document sets (clusters) are known.

この文書処理装置における文書の分類方法として、例えば文書に出現する単語から構成されるベクトル空間モデルを用いた文書間類似度算出方法がある。 As a document classification method in this document processing apparatus, for example, there is an inter-document similarity calculation method using a vector space model composed of words appearing in a document.

これに関連する技術として、例えばコールセンターやメールセンター等、企業や自治体に集まる電子化された顧客や住民の声や営業マンが作成する営業報告書等の文章情報を、高精度で自動的に分類することが可能な技術（以下、先行技術と表記）が開示されている（例えば、特許文献１を参照）。 As a related technology, for example, call centers and mail centers, etc. Electronic texts gathered in companies and local governments, voices of customers and residents, and business reports created by salespeople are automatically classified with high accuracy. Techniques that can be performed (hereinafter referred to as prior art) are disclosed (see, for example, Patent Document 1).

なお、先行技術においては、既に分類されているデータ（正解セット）に基づいて、入力されたデータ（分類すべきデータ）を分類するという考え方、これらのデータを文章（テキストデータ）と非文章データ（狭義のデータ）とに分け、それぞれをマイニング（テキストマイニングとデータマイニング）により類似性を判断する基準値を算出し、その基準値に基づいて分類するという考え方が取り入れられている。 In the prior art, the idea of classifying the input data (data to be classified) based on the already classified data (correct answer set), and these data as text (text data) and non-text data It is divided into (data in a narrow sense), and a concept of calculating a reference value for judging similarity by mining (text mining and data mining) and classifying based on the reference value is adopted.

特開２００５−７１２２９号公報JP 2005-71229 A

しかしながら、上記した先行技術においては、文書内で、当該文書に含まれる文の意味を考慮した分類は行われていない。つまり、文書内における「重要な記述部分」と「重要でない部分」とを対等な重みで扱っているため、分類精度が低い。 However, in the above-described prior art, no classification is performed in a document in consideration of the meaning of sentences included in the document. In other words, since “important description part” and “non-important part” in the document are handled with equal weight, the classification accuracy is low.

ところで、分類精度の評価方法としては、同じクラスタに分類された文書が互いに同じ意味であるか、例えば「Ａ（目的語）をＢ（動詞）する」が同じであるかに基づいて行われるのが一般的である。 By the way, as an evaluation method of classification accuracy, it is performed based on whether documents classified into the same cluster have the same meaning or whether “A (object) B (verb)” is the same. Is common.

このため、分類精度を向上させるためには、「目的語」と「動詞」の組を考慮した分類を行う必要がある。 Therefore, in order to improve the classification accuracy, it is necessary to perform classification in consideration of a set of “object” and “verb”.

そこで、本発明の目的は、文書の分類精度を向上させることができる文書処理プログラム及び文書処理装置を提供することにある。 Accordingly, an object of the present invention is to provide a document processing program and a document processing apparatus capable of improving the document classification accuracy.

本発明の１つの態様によれば、文字列を含む文からなる複数の文書を格納する文書格納手段及び素性格納手段を有する外部記憶装置と当該外部記憶装置を利用するコンピュータとから構成される文書処理装置において、前記コンピュータによって実行される文書処理プログラムであって、前記コンピュータに、前記文書格納手段に格納されている文書毎に、当該文書における文字列の出現頻度に基づいて、当該文書において重要となる文字列を重要語として抽出するステップと、前記抽出された重要語を含む文を要旨文として当該重要語が抽出された文書から抽出するステップと、前記抽出された要旨文に含まれる文字列間の係り受けを解析するステップと、前記抽出された要旨文に含まれる重要語及び前記解析結果に基づいて、当該要旨文に含まれる重要語以外の表現を単純化することにより、当該重要語を含む当該要旨文の言い換え文を生成するステップと、前記生成された言い換え文に含まれる重要語を含む複数の文字列から構成される素性の組を、当該言い換え文から抽出するステップと、前記抽出された素性の組を、前記素性格納手段に格納するステップと、前記文書格納手段に格納されている文書毎に、当該文書から抽出された要旨文における前記素性格納手段に格納された素性の組の出現頻度に基づいて文書ベクトル成分値を算出するステップと、前記算出された文書ベクトル成分値に基づいて、前記文書格納手段に格納されている文書毎に文書ベクトルを生成するステップと、前記抽出された重要語及び前記解析結果に基づいて、当該重要語を目的語または動詞とするテンプレートを生成するステップとを実行させ、前記素性の組を抽出するステップにおいて、前記生成された言い換え文に対し、前記生成されたテンプレートをマッチングさせることにより前記素性の組を抽出することを特徴とする文書処理プログラムが提供される。 According to one aspect of the present invention, a document configured by a document storage unit that stores a plurality of documents including sentences including character strings, an external storage device having a feature storage unit, and a computer that uses the external storage device. In a processing apparatus, a document processing program executed by the computer, wherein each document stored in the computer is stored in the computer based on the appearance frequency of the character string in the document. Extracting a character string as an important word, extracting a sentence including the extracted important word from a document from which the important word is extracted as a summary sentence, and characters included in the extracted summary sentence a step of analyzing the dependency between the columns, on the basis of key words and the analysis results are within the spirit sentence the extracted, the gist sentence By simplifying the representation of the non-important word included, composed of a plurality of character strings and generating a paraphrase sentence of the summary sentences including the key words, key words included in the generated paraphrase sentence A feature set to be extracted from the paraphrase sentence, a step of storing the extracted feature set in the feature storage unit, and a document stored in the document storage unit for each document Calculating a document vector component value based on the appearance frequency of the feature set stored in the feature storage means in the summary sentence extracted from the document, and based on the calculated document vector component value, the document storage means generating a document vector for each document that is stored in, based on the extracted important word and the analysis results, and object or verb the important word That template to execute the steps of generating, in the step of extracting a set of feature, wherein the relative generated paraphrase sentence, extracts a set of feature by matching the generated template A document processing program is provided.

本発明によれば、文書の分類精度を向上させることを可能とする。 According to the present invention, it is possible to improve document classification accuracy.

本発明の第１の実施形態に係る文書処理装置のハードウェア構成を示すブロック図。1 is a block diagram showing a hardware configuration of a document processing apparatus according to a first embodiment of the present invention. 図１に示す文書処理装置３０の主として機能構成を示すブロック図。FIG. 2 is a block diagram mainly showing a functional configuration of the document processing apparatus 30 shown in FIG. 1. 本実施形態に係る文書処理装置３０の処理手順を示すフローチャート。6 is a flowchart showing a processing procedure of the document processing apparatus 30 according to the present embodiment. 係り受け解析部３１３による係り受け解析結果の一例を示す図。The figure which shows an example of the dependency analysis result by the dependency analysis part 313. FIG. 重要語が１つである場合における言い換え処理の具体例について説明するための図。The figure for demonstrating the specific example of the paraphrase process in case there is one important word. 重要語が２つである場合における言い換え処理の具体例について説明するための図。The figure for demonstrating the specific example of the paraphrase process in case there are two important words. 本発明の第２の実施形態に係る文書処理装置の主として機能構成を示すブロック図。The block diagram which mainly shows a function structure of the document processing apparatus which concerns on the 2nd Embodiment of this invention. 本実施形態に係る文書処理装置４０の処理手順を示すフローチャート。6 is a flowchart showing a processing procedure of the document processing apparatus 40 according to the present embodiment. 重要語抽出部３１１によって抽出された重要語が「アナログ」である場合にテンプレート生成部４１によって生成されるテンプレートの一例を示す図。The figure which shows an example of the template produced | generated by the template production | generation part 41 when the important word extracted by the important word extraction part 311 is "analog". 重要語抽出部３１１によって抽出された重要語が「アナログ」及び「変換」である場合にテンプレート生成部４１によって生成されるテンプレートの一例を示す図。The figure which shows an example of the template produced | generated by the template production | generation part 41 when the important words extracted by the important word extraction part 311 are "analog" and "conversion". 言い換え文がテンプレートにマッチしない場合に抽出される素性の組について説明するための図。The figure for demonstrating the group of the features extracted when a paraphrase sentence does not match a template. 本発明の第３の実施形態に係る文書処理装置の主として機能構成を示すブロック図。The block diagram which mainly shows a function structure of the document processing apparatus which concerns on the 3rd Embodiment of this invention. 本実施形態に係る文書処理装置５０の処理手順を示すフローチャート。6 is a flowchart showing a processing procedure of the document processing apparatus 50 according to the present embodiment.

以下、図面を参照して、本発明の各実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施形態］
まず、図１及び図２を参照して、本発明の第１の実施形態について説明する。図１は、本実施形態に係る文書処理装置のハードウェア構成を示すブロック図である。図１に示すように、コンピュータ１０は、例えばハードディスクドライブ（ＨＤＤ：Hard Disk Drive）のような外部記憶装置２０と接続されている。この外部記憶装置２０は、コンピュータ１０によって実行されるプログラム２１を格納する。コンピュータ１０及び外部記憶装置２０は、文書処理装置３０を構成する。 [First Embodiment]
First, a first embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a block diagram showing a hardware configuration of the document processing apparatus according to the present embodiment. As shown in FIG. 1, the computer 10 is connected to an external storage device 20 such as a hard disk drive (HDD). The external storage device 20 stores a program 21 executed by the computer 10. The computer 10 and the external storage device 20 constitute a document processing device 30.

図２は、図１に示す文書処理装置３０の主として機能構成を示すブロック図である。図２に示すように、文書処理装置３０は、要旨文特定部３１、言い換え処理部３２及びクラスタリング部３３を含む。本実施形態において、これらの各部３１乃至３３は、図１に示すコンピュータ１０が外部記憶装置２０に格納されているプログラム２１を実行することにより実現されるものとする。このプログラム２１は、コンピュータ読み取り可能な記憶媒体に予め格納して頒布可能である。また、このプログラム２１が例えばネットワークを介してコンピュータ１０にダウンロードされても構わない。 FIG. 2 is a block diagram mainly showing a functional configuration of the document processing apparatus 30 shown in FIG. As shown in FIG. 2, the document processing apparatus 30 includes a summary sentence specifying unit 31, a paraphrase processing unit 32, and a clustering unit 33. In the present embodiment, these units 31 to 33 are realized by the computer 10 illustrated in FIG. 1 executing the program 21 stored in the external storage device 20. This program 21 can be stored in advance in a computer-readable storage medium and distributed. The program 21 may be downloaded to the computer 10 via, for example, a network.

また、文書処理装置３０は、文書格納部２２、類義句辞書格納部２３、言い換え文格納部２４及び文書クラスタ格納部２５を含む。本実施形態において、文書格納部２２、類義句辞書格納部２３、言い換え文格納部２４及び文書クラスタ格納部２５は、例えば外部記憶装置２０に格納される。 The document processing apparatus 30 includes a document storage unit 22, a synonym phrase dictionary storage unit 23, a paraphrase sentence storage unit 24, and a document cluster storage unit 25. In the present embodiment, the document storage unit 22, the synonym phrase dictionary storage unit 23, the paraphrase sentence storage unit 24, and the document cluster storage unit 25 are stored in, for example, the external storage device 20.

文書格納部２２には、複数の文書データ（テキストデータ）が格納されている。この文書データ（以下、文書と表記）は、文字列を含む文からなる。 The document storage unit 22 stores a plurality of document data (text data). This document data (hereinafter referred to as document) is composed of a sentence including a character string.

要旨文特定部３１は、重要語抽出部３１１、対象文抽出部３１２及び係り受け解析部３１３を含む。 The abstract sentence specifying unit 31 includes an important word extracting unit 311, a target sentence extracting unit 312, and a dependency analyzing unit 313.

重要語抽出部３１１は、文書格納部２２に格納されている文書における文字列（単語）の出現頻度に基づいて、当該文書において重要となる文字列を重要語として抽出する。 Based on the appearance frequency of the character string (word) in the document stored in the document storage unit 22, the important word extraction unit 311 extracts a character string that is important in the document as an important word.

対象文抽出部３１２は、重要語抽出部３１１によって抽出された重要語を含む文（対象文）を、当該重要語が抽出された文書から抽出する。この対象文は、重要語抽出部３１１によって抽出された重要語を含むため、当該対象文が抽出された文書における重要な記述部分である。よって、対象文抽出部３１２によって抽出された対象文を（文書の）要旨文と称する。 The target sentence extracting unit 312 extracts a sentence (target sentence) including the important word extracted by the important word extracting unit 311 from the document from which the important word is extracted. Since the target sentence includes the important word extracted by the important word extraction unit 311, it is an important description part in the document from which the target sentence is extracted. Therefore, the target sentence extracted by the target sentence extraction unit 312 is referred to as a summary sentence (of the document).

係り受け解析部３１３は、対象文抽出部３１２によって抽出された要旨文に含まれる文字列間の係り受けを解析（係り受け解析）する。この係り受け解析結果の具体例については後述する。なお、この係り受け解析を実行することによって、例えば対象文抽出部３１２によって抽出された要旨文に含まれる文字列（重要語）の品詞等を取得することができる。 The dependency analysis unit 313 analyzes the dependency between character strings included in the summary sentence extracted by the target sentence extraction unit 312 (dependency analysis). A specific example of this dependency analysis result will be described later. By executing this dependency analysis, for example, the part of speech of a character string (important word) included in the summary sentence extracted by the target sentence extraction unit 312 can be acquired.

類義句辞書格納部２３には、単語（文字列）の類義表現が予め格納されている。具体的には、例えば単語「画像」の類義表現として「映像」が類義句辞書格納部２３に格納されている。この類義句辞書格納部２３は、例えば重要語抽出部３１１によって抽出された重要語等を統一するための同義語処理において用いられる。 In the synonym phrase dictionary storage unit 23, synonymous expressions of words (character strings) are stored in advance. Specifically, for example, “video” is stored in the synonym phrase storage unit 23 as a synonymous expression of the word “image”. The synonym dictionary storage unit 23 is used, for example, in synonym processing for unifying the important words extracted by the important word extraction unit 311.

言い換え処理部３２は、個数判定部３２１及び言い換え文生成部３２２を含む。個数判定部３２１は、対象文抽出部３１２によって抽出された要旨文に含まれる重要語の数（個数）が１つであるか、または２つ以上であるか否かを判定する。 The paraphrase processing unit 32 includes a number determination unit 321 and a paraphrase text generation unit 322. The number determination unit 321 determines whether the number (number) of important words included in the summary sentence extracted by the target sentence extraction unit 312 is one, or two or more.

言い換え文生成部３２２は、重要語抽出部３１１によって抽出された重要語、係り受け解析部３１３による当該要旨文の係り受け解析結果及び類義句辞書格納部２３に格納されている類義表現に基づいて、対象文抽出部３１２によって抽出された要旨文を言い換える。これにより、言い換え文生成部３２２は、対象文抽出部３１２によって抽出された要旨文の言い換え文を生成する処理（言い換え処理）を実行する。この言い換え文生成部３２２によって生成された言い換え文には、重要語抽出部３１１によって抽出された重要語が含まれる。 The paraphrase sentence generation unit 322 converts the key word extracted by the key word extraction unit 311, the dependency analysis result of the summary sentence by the dependency analysis unit 313, and the synonym expression stored in the synonym phrase dictionary storage unit 23. In other words, the summary sentence extracted by the target sentence extraction unit 312 is paraphrased. As a result, the paraphrase text generation unit 322 executes a process (paraphrase process) of generating a paraphrase text of the abstract sentence extracted by the target sentence extraction unit 312. The paraphrase text generated by the paraphrase text generation unit 322 includes the key word extracted by the key word extraction unit 311.

また、言い換え文生成部３２２は、個数判定部３２１による判定結果に応じた言い換え処理を実行する。なお、言い換え処理の詳細については後述する。 In addition, the paraphrase text generation unit 322 executes paraphrase processing according to the determination result by the number determination unit 321. Details of the paraphrase process will be described later.

言い換え文生成部３２２は、生成された要旨文の言い換え文を言い換え文格納部２４に格納する。つまり、言い換え文格納部２４には、文書格納部２２に格納されている各文書から抽出された要旨文の言い換え文、つまり、文書格納部２２に格納されている文書毎の言い換え文が格納される。 The paraphrase text generation unit 322 stores the paraphrase text of the generated summary text in the paraphrase text storage unit 24. That is, the paraphrase text storage unit 24 stores the paraphrase text of the abstract sentence extracted from each document stored in the document storage section 22, that is, the paraphrase text for each document stored in the document storage section 22. The

クラスタリング部５１は、言い換え文格納部２４に格納されている言い換え文を分類する。クラスタリング部５１は、例えば言い換え文格納部２４に格納されている言い換え文に含まれる文字列の出現頻度に基づいて、当該言い換え文を分類する。クラスタリング部５１による分類結果は、文書クラスタ格納部２７に格納される。 The clustering unit 51 classifies the paraphrase text stored in the paraphrase text storage unit 24. For example, the clustering unit 51 classifies the paraphrase text based on the appearance frequency of the character string included in the paraphrase text stored in the paraphrase text storage unit 24. The classification result by the clustering unit 51 is stored in the document cluster storage unit 27.

次に、図３のフローチャートを参照して、本実施形態に係る文書処理装置３０の処理手順について説明する。なお、以下に説明する処理は、例えばユーザに指示（操作）に応じて実行される。 Next, a processing procedure of the document processing apparatus 30 according to the present embodiment will be described with reference to the flowchart of FIG. Note that the processing described below is executed in response to an instruction (operation) to the user, for example.

まず、要旨文特定部３１に含まれる重要語抽出部３１１は、文書格納部２２に格納されている文書（群）のうちの１つを、当該文書格納部２２から取得する（ステップＳ１）。 First, the important word extraction unit 311 included in the summary sentence specification unit 31 acquires one of the documents (group) stored in the document storage unit 22 from the document storage unit 22 (step S1).

次に、重要語抽出部３１１は、取得された文書における文字列（単語）の出現頻度に基づいて、当該文書における重要語を抽出する（ステップＳ２）。具体的には、重要語抽出部３１１は、例えばＴＦ／ＩＤＦの値（スコア）を基準に特徴的な単語を決定するといった特徴単語抽出方法を用いて重要語を抽出する。重要語抽出部３１１によって抽出される重要語は、複数であっても構わない。 Next, the keyword extraction unit 311 extracts the keyword in the document based on the appearance frequency of the character string (word) in the acquired document (step S2). Specifically, the keyword extraction unit 311 extracts the keyword using a feature word extraction method such as determining a characteristic word based on, for example, a TF / IDF value (score). There may be a plurality of important words extracted by the important word extraction unit 311.

なお、重要語は、重要語抽出部３１１によって取得された文書全体から抽出される構成であってもよいし、当該文書中の重要な段落（例えば最初の段落等）のような特定の箇所から抽出される構成であっても構わない。つまり、重要語は、重要語抽出部３１１によって取得された文書中の位置に基づいて抽出されても構わない。 The important word may be extracted from the entire document acquired by the important word extracting unit 311 or from a specific part such as an important paragraph (for example, the first paragraph) in the document. The configuration may be extracted. That is, the important word may be extracted based on the position in the document acquired by the important word extracting unit 311.

対象文抽出部３１２は、重要語抽出部３１１によって取得された文書において、当該重要語抽出部３１１によって抽出された重要語が含まれる要旨文（対象文）を抽出する（ステップＳ３）。対象文抽出部３１２によって抽出される要旨文は、複数であっても構わない。 The target sentence extraction unit 312 extracts a summary sentence (target sentence) including the important word extracted by the important word extraction unit 311 in the document acquired by the important word extraction unit 311 (step S3). There may be a plurality of abstract sentences extracted by the target sentence extraction unit 312.

なお、要旨文は、重要語抽出部３１１によって取得された文書全体から抽出される構成であってもよいし、当該文書中の特定の箇所（重要な段落）から抽出される構成であっても構わない。つまり、要旨文は、重要語抽出部３１１によって取得された文書中の位置に基づいて抽出されても構わない。 The abstract sentence may be extracted from the entire document acquired by the important word extracting unit 311 or may be extracted from a specific portion (important paragraph) in the document. I do not care. That is, the summary sentence may be extracted based on the position in the document acquired by the keyword extraction unit 311.

係り受け解析部３１３は、対象文抽出部３１２によって抽出された要旨文を係り受け解析する（ステップＳ４）。 The dependency analysis unit 313 performs dependency analysis on the summary sentence extracted by the target sentence extraction unit 312 (step S4).

ここで、図４は、係り受け解析部３１３による係り受け解析結果の一例を示す。図４は、例えば重要語抽出部３１１によって抽出された重要語が「アナログ」及び「変換」であり、対象文抽出部３１２によって抽出された要旨文が「アナログの画像を入力し変換を実行する」である場合における係り受け解析部３１３による係り受け解析結果である。 Here, FIG. 4 shows an example of a dependency analysis result by the dependency analysis unit 313. In FIG. 4, for example, the important words extracted by the important word extraction unit 311 are “analog” and “conversion”, and the abstract extracted by the target sentence extraction unit 312 is “input an analog image and execute conversion. ”Is a dependency analysis result by the dependency analysis unit 313.

なお、係り受け解析部３１３は、要旨文を係り受け解析することにより、当該要旨文に含まれる文字列（重要語）「アナログ」または「変換」等の品詞等を取得する。 The dependency analysis unit 313 performs dependency analysis on the summary sentence to acquire a part of speech such as a character string (important word) “analog” or “conversion” included in the summary sentence.

次に、言い換え処理部３２に含まれる個数判定部３２１は、対象文抽出部３１２によって抽出された要旨文に含まれる重要語の数が１つであるか、２つ以上であるかを判定する。 Next, the number determination unit 321 included in the paraphrase processing unit 32 determines whether the number of important words included in the summary sentence extracted by the target sentence extraction unit 312 is one or two or more. .

言い換え文生成部３２２は、重要語抽出部３１１によって抽出された重要語、係り受け解析部３１３による係り受け解析結果及び個数判定部３２１による判定結果に基づいて、対象文抽出部３１２によって抽出された要旨文を言い換える処理（言い換え処理）を実行する。これにより、言い換え文生成部３２２は、対象文抽出部３１２によって抽出された要旨文の言い換え文を生成する（ステップＳ５）。 The paraphrase sentence generation unit 322 is extracted by the target sentence extraction unit 312 based on the important words extracted by the important word extraction unit 311, the dependency analysis result by the dependency analysis unit 313 and the determination result by the number determination unit 321. A paraphrase process (paraphrase process) is performed. Thereby, the paraphrase sentence production | generation part 322 produces | generates the paraphrase sentence of the summary sentence extracted by the target sentence extraction part 312 (step S5).

このとき、言い換え文生成部３２２は、類義句辞書格納部２３に格納されている類義表現を用いて、対象文抽出部３１２によって抽出された要旨文に含まれる重要語等の文字列に対して同義語処理を実行する。 At this time, the paraphrase sentence generation unit 322 uses a synonym expression stored in the synonym phrase dictionary storage unit 23 to convert a character string such as an important word included in the abstract sentence extracted by the target sentence extraction unit 312. The synonym processing is executed for it.

また、言い換え文生成部３２２は、生成された言い換え文を言い換え文格納部２４に格納する。 In addition, the paraphrase text generation unit 322 stores the generated paraphrase text in the paraphrase text storage unit 24.

ここで、上記した言い換え処理とは、対象文抽出部３１２によって抽出された要旨文の複雑な表現を、より単純な表現に言い換える処理である。この言い換え処理においては、例えば第１〜第４の言い換え処理が行われる。 Here, the paraphrasing process described above is a process for paraphrasing the complicated expression of the abstract sentence extracted by the target sentence extracting unit 312 into a simpler expression. In this paraphrase process, for example, first to fourth paraphrase processes are performed.

第１の言い換え処理は、名詞句の単純化である。第１の言い換え処理は、具体的には「ＡのＢ」の表現を単に「Ａ」の表現に言い換える処理である。つまり、第１の言い換え処理においては、「ＡのＢ」の表現における「のＢ」の表現が省略される。 The first paraphrase process is simplification of noun phrases. Specifically, the first paraphrase process is a process in which the expression “B of A” is simply replaced with the expression “A”. In other words, in the first paraphrase process, the expression “B” in the expression “B of A” is omitted.

第２の言い換え処理は、機能動詞句の単純化である。第２の言い換え処理は、具体的には「Ａを実行する」の表現を単に「Ａする」の表現に言い換える処理である。 The second paraphrase process is simplification of the functional verb phrase. Specifically, the second paraphrasing process is a process of simply paraphrasing the expression “execute A” to the expression “execute A”.

第３の言い換え処理は、従属句の単純化である。第３の言い換え処理は、具体的には「ＡをＢしたらＣする」の表現を単に「ＡをＢしてＣする」の表現に言い換える処理である。 The third paraphrase process is simplification of subordinate phrases. Specifically, the third paraphrasing process is a process of simply rephrasing the expression “do A when B is A” and simply “express B as A and C”.

第４の言い換え処理は、格共有構造の単純化である。第４の言い換え処理は、具体的には「ＡをＢしてＣする」の表現を単に「ＡをＣする」の表現に言い換える処理である。つまり、第４の言い換え処理においては、「ＡをＢしてＣする」の表現における「Ｂして」の表現が省略される。 The fourth paraphrase process is simplification of the case sharing structure. Specifically, the fourth paraphrasing process is a process of simply paraphrasing the expression “A to B and C” into the expression “A to C”. That is, in the fourth paraphrase process, the expression “B” is omitted in the expression “A to B and C”.

言い換え文生成部３２２は、上記した第１〜第４の言い換え処理により、対象文抽出部３１２によって抽出された要旨文の言い換え文を生成する。 The paraphrase text generation unit 322 generates a paraphrase text of the abstract sentence extracted by the target sentence extraction unit 312 by the first to fourth paraphrase processes described above.

なお、上記した第１〜第４の言い換え処理により言い換え文が生成されるが、当該言い換え文が生成される際に当該要旨文に含まれる重要語は省略されない。換言すれば、全ての要旨文について第１〜第４の言い換え処理が全て適用されるわけではない。 In addition, although a paraphrase sentence is produced | generated by the above-mentioned 1st-4th paraphrase process, when the said paraphrase sentence is produced | generated, the important word contained in the said summary sentence is not abbreviate | omitted. In other words, not all the first to fourth paraphrasing processes are applied to all the abstract sentences.

具体的には、要旨文が例えば「ＡをＢしてＣする」である場合、上記第４の言い換え処理によれば「ＡをＣする」の表現に言い換えられるが、「Ｂ」が重要語である場合には、第４の言い換え処理が適用されると重要語が省略されてしまうため、当該「ＡをＢしてＣする」の要旨文には第４の言い換え処理は適用されない。 Specifically, for example, when the abstract sentence is “A to B and C”, according to the above fourth paraphrasing process, it can be rephrased as “A to C”, but “B” is an important word. In such a case, since the important word is omitted when the fourth paraphrase process is applied, the fourth paraphrase process is not applied to the summary sentence “A to B and C”.

また、言い換え文生成部３２２による言い換え処理は、上記した第１〜第４の言い換え処理以外に例えば係り受け解析結果（構文木）に対して枝刈り等を行うことにより言い換え処理が実行されても構わない。なお、枝刈りとは、構文木から不要な表現（文字列）を取り除く処理である。 The paraphrase processing by paraphrase sentence generating unit 322 paraphrase processing is performed by performing a pruning Ri or the like on the above-mentioned first to fourth paraphrase processing other than the example dependency analysis result (parse tree) It doesn't matter. It should be noted, it is a process of removing the branches cutting Ritowa, unnecessary representation from the syntax tree (a string).

ここで、図５を参照して、要旨文に含まれる重要語が１つである場合における言い換え処理の具体例について説明する。ここでは、重要語抽出部３１１によって抽出された重要語は「アナログ」であり、対象文抽出部３１２によって抽出された要旨文は「アナログの画像を入力し変換を実行する」であるものとする。 Here, with reference to FIG. 5, a specific example of the paraphrase process in the case where there is one important word included in the abstract will be described. Here, the important word extracted by the important word extraction unit 311 is “analog”, and the abstract extracted by the target sentence extraction unit 312 is “input an analog image and execute conversion”. .

なお、図５においては、要旨文１０１及び当該要旨文１０１の言い換え文１０２〜１０４が上記した図４のような係り受け解析結果の形式で示されている。 In FIG. 5, the abstract sentence 101 and the paraphrase sentences 102 to 104 of the abstract sentence 101 are shown in the form of the dependency analysis result as shown in FIG.

まず、要旨文１０１に対して上記した第１の言い換え処理を適用すると、要旨文「アナログの画像を入力し変換を実行する」１０１が言い換え文「アナログを入力し変換を実行する」１０２に言い換えられる（ステップＳ１１）。 First, when the first paraphrasing process described above is applied to the abstract sentence 101, the abstract sentence “input an analog image and execute conversion” 101 is changed to the paraphrase sentence “input analog and execute conversion” 102. (Step S11).

次に、言い換え文１０２に対して上記した第２の言い換え処理を更に適用すると、当該言い換え文「アナログを入力し変換を実行する」１０２が言い換え文「アナログを入力し変換する」１０３に言い換えられる（ステップＳ１２）。 Next, when the second paraphrase process described above is further applied to the paraphrase sentence 102, the paraphrase sentence “input analog and execute conversion” 102 is paraphrased as paraphrase sentence “input analog and convert” 103. (Step S12).

この言い換え文「アナログを入力し変換する」１０３から言い換え文「アナログを入力する」及び「アナログを変換する」１０４が生成される（ステップＳ１３）。 From this paraphrase sentence "input and convert analog" 103, paraphrase sentences "input analog" and "convert analog" 104 are generated (step S13).

このように、言い換え文生成部３２２は、要旨文「アナログの画像を入力し変換を実行する」１０１に対して言い換え処理を実行することにより、言い換え文「アナログを入力する」及び「アナログを変換する」１０４を生成する。この言い換え文生成部３２２によって生成された言い換え文「アナログを入力する」及び「アナログを変換する」１０４は、言い換え文格納部２４に格納される。 In this way, the paraphrase sentence generation unit 322 performs the paraphrase process on the abstract sentence “input an analog image and execute conversion” 101 to thereby change the paraphrase sentences “input analog” and “convert analog”. “Yes” 104 is generated. The paraphrase text “input analog” and “convert analog” 104 generated by the paraphrase text generation unit 322 are stored in the paraphrase text storage unit 24.

なお、図５に示す例では、言い換え文「アナログを入力し変換する」１０３に対して上記した第４の言い換え処理を適用することで、言い換え文「アナログを変換する」に言い換えることが考えられる。しかしながら、言い換え文「アナログを入力し変換する」１０３において「入力」及び「変換」は、重要語である「アナログ」に対して並列であり重みは同一であると考えられるため、一方のみを省略するような言い換え処理は行われない。 In the example shown in FIG. 5, the paraphrase sentence “convert analog” can be considered by applying the fourth paraphrase process described above to the paraphrase sentence “input and convert analog” 103. . However, in the paraphrase sentence "input and convert analog" 103, "input" and "conversion" are considered to be parallel to the important word "analog" and have the same weight, so only one is omitted. Such a paraphrase process is not performed.

また、上記したように対象抽出部３１２によって抽出された要旨文に対する係り受け解析部３１３による係り受け解析の結果によっては、当該要旨文から生成される言い換え文は１つとは限られず、上記した図５に示すように２つ以上の言い換え文が生成される場合がある。 Moreover, depending on the result of the dependency analysis by the dependency analysis unit 313 for the summary sentence extracted by the target extraction unit 312 as described above, the number of paraphrase sentences generated from the summary sentence is not limited to one, As shown in FIG. 5, two or more paraphrase sentences may be generated.

次に、図６を参照して、要旨文に含まれる重要語が２つである場合における言い換え処理の具体例について説明する。ここでは、重要語抽出部３１１によって抽出された重要語は「アナログ」及び「変換」であり、対象文抽出部３１２によって抽出された要旨文は「アナログの画像を入力し変換を実行する」であるものとする。つまり、図６に示す要旨文２０１は、上記した図５に示す要旨文１０１と同様である。 Next, a specific example of the paraphrasing process when there are two important words included in the abstract will be described with reference to FIG. Here, the important words extracted by the important word extraction unit 311 are “analog” and “conversion”, and the abstract extracted by the target sentence extraction unit 312 is “input analog image and execute conversion”. It shall be. That is, the abstract sentence 201 shown in FIG. 6 is the same as the abstract sentence 101 shown in FIG.

なお、図６においては、上記した図５と同様に、要旨文２０１及び当該要旨文２０１の言い換え文２０２〜２０４が係り受け解析結果の形式で示されている。 6, similar to FIG. 5 described above, the abstract sentence 201 and the paraphrase sentences 202 to 204 of the abstract sentence 201 are shown in the form of dependency analysis results.

まず、要旨文２０１に対して上記した第１の言い換え処理を適用すると、要旨文「アナログの画像を入力し変換を実行する」２０１が言い換え文「アナログを入力し変換を実行する」２０２に言い換えられる（ステップＳ２１）。 First, when the first paraphrase process described above is applied to the abstract sentence 201, the abstract sentence “input an analog image and execute conversion” 201 is reworded as an alternative sentence “input analog and execute conversion” 202. (Step S21).

次に、言い換え文２０２に対して上記した第２の言い換え処理を更に適用すると、当該言い換え文「アナログを入力し変換を実行する」２０２が言い換え文「アナログを入力し変換する」２０３に言い換えられる（ステップＳ２２）。 Next, when the second paraphrase process described above is further applied to the paraphrase sentence 202, the paraphrase sentence “input analog and execute conversion” 202 is paraphrased as the paraphrase sentence “input analog and convert” 203. (Step S22).

ここで、図６に示す例では、上記した要旨文に含まれる重要語が１つである場合と異なり、重要語「アナログ」に対して「入力」及び「変換」は並列であるが当該「変換」は重要語であり、当該重要語である「変換」の方が「入力」より重みが大きいと考えられるため、言い換え文２０３に対して上記した第４の言い換え処理が適用される。これにより、言い換え文「アナログを入力し変換する」２０３は言い換え文「アナログを変換する」２０４に言い換えられる（ステップＳ２３）。 Here, in the example shown in FIG. 6, unlike the case where there is one important word included in the above-described abstract, “input” and “conversion” are parallel to the important word “analog”, but the “ “Conversion” is an important word, and it is considered that “import”, which is the important word, has a higher weight than “input”. Therefore, the above-described fourth paraphrase process is applied to the paraphrase sentence 203. Thereby, the paraphrase sentence “input analog and convert” 203 is paraphrased into the paraphrase sentence “convert analog” 204 (step S23).

このように、言い換え文生成部３２２は、要旨文「アナログの画像を入力し変換を実行する」２０１に対して言い換え処理を実行することにより、言い換え文「アナログを変換する」２０４を生成する。この言い換え文生成部３２２によって生成された言い換え文「アナログを変換する」２０４は、言い換え文格納部２４に格納される。 As described above, the paraphrase sentence generation unit 322 generates the paraphrase sentence “convert analog” 204 by executing the paraphrase process on the abstract sentence “input analog image and execute conversion” 201. The paraphrase text “convert analog” 204 generated by the paraphrase text generation unit 322 is stored in the paraphrase text storage unit 24.

なお、要旨文に重要語が３つ以上である場合には、当該重要語のうちの２つの重要語の組み合わせ毎に、上記した図６に示すような処理が実行される。 When there are three or more important words in the summary sentence, the above-described process shown in FIG. 6 is executed for each combination of two important words of the important words.

再び図３に戻ると、文書格納部２２に格納されている全ての文書について上記したステップＳ１〜ステップＳ５の処理が実行されたか否かが判定される（ステップＳ６）。 Returning to FIG. 3 again, it is determined whether or not the processing in steps S1 to S5 described above has been executed for all documents stored in the document storage unit 22 (step S6).

文書格納部２２に格納されている全ての文書について上記したステップＳ１〜ステップＳ５の処理が実行されていないと判定された場合（ステップＳ６のＮＯ）、上記したステップＳ１に戻って処理が繰り返される。この場合、ステップＳ１においては、ステップＳ１〜ステップＳ５の処理が実行されていない文書が文書格納部２２から取得される。 When it is determined that the processing in steps S1 to S5 described above has not been executed for all the documents stored in the document storage unit 22 (NO in step S6), the processing returns to the above step S1 and is repeated. . In this case, in step S <b> 1, a document for which the processing in steps S <b> 1 to S <b> 5 has not been executed is acquired from the document storage unit 22.

一方、文書格納部２２に格納されている全ての文書についてステップＳ１〜ステップＳ５の処理が実行されたと判定された場合（ステップＳ６のＹＥＳ）、クラスタリング部３３は、言い換え文格納部２４に格納されている言い換え文を分類（クラスタリング）する（ステップＳ７）。クラスタリング部３３は、例えば言い換え文に含まれる文字列の出現頻度に基づいて文書分類を実行する。ここでは、言い換え文に含まれる文字列の出現頻度に基づいて分類処理が実行されるものとして説明したが、言い換え文の分類方法についてはここで説明した方法以外にも種々の方法が考えられる。 On the other hand, when it is determined that the processing of step S1 to step S5 has been executed for all the documents stored in the document storage unit 22 (YES in step S6), the clustering unit 33 is stored in the paraphrase text storage unit 24. Are classified (clustered) (step S7). For example, the clustering unit 33 performs document classification based on the appearance frequency of a character string included in the paraphrase text. Here, the description has been made assuming that the classification process is executed based on the appearance frequency of the character string included in the paraphrase text, but various methods other than the method described here are conceivable as the paraphrase text classification method.

なお、言い換え文格納部２４には、上記したように文書格納部２２に格納されている文書毎に言い換え文が格納されている。 The paraphrase text storage unit 24 stores a paraphrase text for each document stored in the document storage unit 22 as described above.

つまり、クラスタリング部５１は、言い換え文格納部２４に格納されている言い換え文を分類することにより、文書格納部２２に格納されている文書群の分類を行う。クラスタリング部３３による文書格納部２２に格納されている文書群の分類結果は、文書クラスタ格納部２７に格納される。 That is, the clustering unit 51 classifies the document group stored in the document storage unit 22 by classifying the paraphrase text stored in the paraphrase text storage unit 24. The classification result of the document group stored in the document storage unit 22 by the clustering unit 33 is stored in the document cluster storage unit 27.

上記したように本実施形態においては、文書格納部２２に格納されている文書毎に、重要語抽出部３１１によって抽出された重要語が含まれる要旨文を抽出する。本実施形態においては、抽出された要旨文に対して係り受け解析を実行し、重要語及び係り受け解析結果に基づいて要旨文に対して言い換え処理を行う。したがって、本実施形態においては、文書格納部２２に格納されている文書毎の言い換え文を分類することにより、当該文書群の分類を行うことが可能となる。 As described above, in the present embodiment, for each document stored in the document storage unit 22, a summary sentence including the important word extracted by the important word extraction unit 311 is extracted. In the present embodiment, dependency analysis is performed on the extracted summary sentence, and paraphrase processing is performed on the summary sentence based on the important word and the dependency analysis result. Therefore, in the present embodiment, it is possible to classify the document group by classifying the paraphrase text for each document stored in the document storage unit 22.

本実施形態においては、例えば文書格納部２２に格納されている文書全体に基づいて当該文書の分類を行う場合と比較して、言い換え文のみについて係り受け解析等の分類処理が実行される、つまり、当該文書において重要でない文等については分類処理が実行されないため、分類精度を向上させ、かつ、処理量を軽減することが可能となる。 In the present embodiment, for example, classification processing such as dependency analysis is performed only on a paraphrase sentence, compared to the case where the document is classified based on the entire document stored in the document storage unit 22, for example. Since the classification process is not executed for sentences and the like that are not important in the document, the classification accuracy can be improved and the processing amount can be reduced.

［第２の実施形態］
次に、図７を参照して、本発明の第２の実施形態について説明する。図７は、本実施形態に係る文書処理装置の主として機能構成を示すブロック図である。なお、前述した図２と同様の部分には同一参照符号を付してその詳しい説明を省略する。ここでは、図２と異なる部分について主に述べる。 [Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to FIG. FIG. 7 is a block diagram mainly showing a functional configuration of the document processing apparatus according to the present embodiment. The same parts as those in FIG. 2 described above are denoted by the same reference numerals, and detailed description thereof is omitted. Here, parts different from FIG. 2 will be mainly described.

また、本実施形態に係る文書処理装置のハードウェア構成は、前述した第１の実施形態と同様であるため、適宜、図１を用いて説明する。以下の実施形態についても同様である。 The hardware configuration of the document processing apparatus according to this embodiment is the same as that of the first embodiment described above, and will be described with reference to FIG. 1 as appropriate. The same applies to the following embodiments.

本実施形態においては、言い換え文格納部２４に格納された言い換え文（言い換え文生成部３２２によって生成された言い換え文）の文中から後述する素性の組を抽出し、当該素性の組に基づいて文書格納部２２に格納されている文書毎に文書ベクトルを生成する点が、前述した第１の実施形態とは異なる。 In the present embodiment, a feature set, which will be described later, is extracted from a sentence of a paraphrase text (paraphrase text generated by the paraphrase text generation unit 322) stored in the paraphrase text storage unit 24, and a document based on the feature set is extracted. The point which produces | generates a document vector for every document stored in the storage part 22 differs from 1st Embodiment mentioned above.

図７に示すように、文書処理装置４０は、テンプレート生成部４１、素性抽出部４２、素性出力部４３及び文書ベクトル処理部４４を含む。本実施形態において、これらの各部４１乃至４４は、図１に示すコンピュータ１０が外部記憶装置２０に格納されているプログラム２１を実行することにより実現されるものとする。 As illustrated in FIG. 7, the document processing apparatus 40 includes a template generation unit 41, a feature extraction unit 42, a feature output unit 43, and a document vector processing unit 44. In the present embodiment, these units 41 to 44 are realized by the computer 10 shown in FIG. 1 executing the program 21 stored in the external storage device 20.

また、文書処理装置４０は、素性格納部２６及び文書ベクトル格納部２７を含む。本実施形態において、素性格納部２６及び文書ベクトル格納部２７は、例えば外部記憶装置２０に格納される。 The document processing apparatus 40 includes a feature storage unit 26 and a document vector storage unit 27. In the present embodiment, the feature storage unit 26 and the document vector storage unit 27 are stored in, for example, the external storage device 20.

テンプレート生成部４１は、重要語抽出部３１１によって抽出された重要語及び係り受け解析部３１３による要旨文の係り受け解析結果に基づいて、当該重要語から構成されるテンプレートを生成する。テンプレート生成部４１によって生成されるテンプレートのデータ構造の詳細については後述する。 The template generation unit 41 generates a template composed of the key words based on the key word extracted by the key word extraction unit 311 and the dependency analysis result of the summary sentence by the dependency analysis unit 313. Details of the data structure of the template generated by the template generation unit 41 will be described later.

素性抽出部４２は、言い換え文格納部２４に格納された言い換え文（言い換え文生成部３２２によって生成された言い換え文）の文中から素性の組を抽出する。素性抽出部４２は、言い換え文格納部２４に格納された言い換え文に対し、テンプレート生成部４１によって生成されたテンプレートをマッチングさせる。これにより、素性抽出部４２は、言い換え文格納部２４に格納された言い換え文に含まれる重要語を含む素性の組を当該言い換え文から抽出する。素性抽出部４２は、抽出された素性の組を素性格納部２６に格納する。 The feature extraction unit 42 extracts a set of features from the sentence of the paraphrase text (paraphrase text generated by the paraphrase text generation unit 322) stored in the paraphrase text storage unit 24. The feature extraction unit 42 matches the template generated by the template generation unit 41 with the paraphrase text stored in the paraphrase text storage unit 24. As a result, the feature extraction unit 42 extracts a feature set including important words included in the paraphrase text stored in the paraphrase text storage unit 24 from the paraphrase text. The feature extraction unit 42 stores the extracted feature set in the feature storage unit 26.

素性抽出部４２によって抽出される素性の組には、例えば「目的語」及び「動詞」の組または「目的語」、「道具格」及び「動詞」の組等が含まれる。 The feature set extracted by the feature extraction unit 42 includes, for example, a set of “object” and “verb” or a set of “object”, “tool case”, and “verb”.

素性出力部４３は、素性抽出部４２によって抽出された素性の組をユーザに出力（表示）する。 The feature output unit 43 outputs (displays) the feature set extracted by the feature extraction unit 42 to the user.

文書ベクトル処理部４４は、文書ベクトル成分値算出部４４１及び文書ベクトル生成部４４２を含む。 The document vector processing unit 44 includes a document vector component value calculation unit 441 and a document vector generation unit 442.

文書ベクトル成分値算出部４４１は、文書格納部２２に格納されている文書毎に、文書ベクトル成分値を算出する。文書ベクトル成分値算出部４４１は、文書格納部２２に格納されている文書から抽出された要旨文（対象文抽出部３１２によって抽出された要旨文）における素性格納部２６に格納されている各素性の組の出現頻度に基づいて文書ベクトル成分値を算出する。文書ベクトル成分値算出部４４１は、１つの文書につき、素性格納部２６に格納されている素性の組の数の文書ベクトル成分値を算出する。 The document vector component value calculation unit 441 calculates a document vector component value for each document stored in the document storage unit 22. The document vector component value calculation unit 441 stores each feature stored in the feature storage unit 26 in a summary sentence extracted from a document stored in the document storage unit 22 (a summary sentence extracted by the target sentence extraction unit 312). The document vector component value is calculated based on the appearance frequency of the set. The document vector component value calculation unit 441 calculates the document vector component values of the number of feature pairs stored in the feature storage unit 26 for one document.

なお、文書ベクトル成分値は、例えば相互情報量のような単語の重み算出方法を用いて算出されてもよい。 The document vector component value may be calculated using a word weight calculation method such as a mutual information amount.

文書ベクトル生成部４４２は、文書ベクトル成分値算出部４４１によって算出された文書ベクトル成分値に基づいて、文書格納部２２に格納されている文書毎に文書ベクトルを生成する。 The document vector generation unit 442 generates a document vector for each document stored in the document storage unit 22 based on the document vector component value calculated by the document vector component value calculation unit 441.

文書ベクトル生成部４４２は、文書格納部２２に格納されている文書毎に生成された文書ベクトルを、文書ベクトル格納部２７に格納する。この文書ベクトルは、例えば文書格納部２２に格納されている文書（群）を分類する際に用いられる。 The document vector generation unit 442 stores the document vector generated for each document stored in the document storage unit 22 in the document vector storage unit 27. This document vector is used, for example, when classifying documents (groups) stored in the document storage unit 22.

次に、図８のフローチャートを参照して、本実施形態に係る文書処理装置４０の処理手順について説明する。 Next, a processing procedure of the document processing apparatus 40 according to the present embodiment will be described with reference to the flowchart of FIG.

まず、前述した図３に示すステップＳ１〜ステップＳ５の処理に相当するステップＳ３１〜ステップＳ３５の処理が実行される。なお、ステップＳ３５において生成された言い換え文は、前述したように言い換え文格納部２４に格納される。 First, the process of step S31-step S35 equivalent to the process of step S1-step S5 shown in FIG. 3 mentioned above is performed. The paraphrase text generated in step S35 is stored in the paraphrase text storage unit 24 as described above.

次に、テンプレート生成部４１は、重要語抽出部３１１によって抽出された重要語及び係り受け解析部３１３による係り受け解析結果に基づいて、当該重要語から構成されるテンプレートを生成する（ステップＳ３６）。 Next, the template generation unit 41 generates a template composed of the important words based on the important words extracted by the important word extraction unit 311 and the dependency analysis result by the dependency analysis unit 313 (step S36). .

ここで、図９は、例えば重要語抽出部３１１によって抽出された重要語が「アナログ」である場合にテンプレート生成部４１によって生成されるテンプレートの一例を示す。テンプレート生成部４１は、係り受け解析部３１３による係り受け解析結果により、重要語「アナログ」の品詞（ここでは、名詞）を取得する。これにより、テンプレート生成部４１は、図９に示すように重要語「アナログ」を「目的語」とするテンプレート３０１を生成する。 Here, FIG. 9 shows an example of a template generated by the template generation unit 41 when the keyword extracted by the keyword extraction unit 311 is “analog”, for example. The template generation unit 41 acquires the part of speech (here, a noun) of the important word “analog” based on the dependency analysis result by the dependency analysis unit 313. As a result, the template generation unit 41 generates a template 301 having the keyword “analog” as the “object” as shown in FIG.

また、図１０は、例えば重要語抽出部３１１によって抽出された重要語が「アナログ」及び「変換」である場合にテンプレート生成部４１によって生成されるテンプレートの一例を示す。テンプレート生成部４１は、係り受け解析部３１３による係り受け解析結果により、重要語「アナログ」及び「変換」の品詞（ここでは、名詞及び動詞）を取得する。これにより、テンプレート生成部４１は、図１０に示すように重要語「アナログ」を「目的語」、重要語「変換」を「動詞」とするテンプレート３０２を生成する。 FIG. 10 shows an example of a template generated by the template generation unit 41 when the important words extracted by the important word extraction unit 311 are “analog” and “conversion”, for example. The template generation unit 41 acquires parts of speech (here, nouns and verbs) of the important words “analog” and “conversion” based on the dependency analysis result by the dependency analysis unit 313. As a result, the template generation unit 41 generates a template 302 having the important word “analog” as “object” and the important word “conversion” as “verb” as shown in FIG.

なお、重要語抽出部３１１によって抽出された重要語が３つ以上である場合には、当該重要語の中から例えば名詞及び動詞の組がテンプレートとして生成される。また、重要語抽出部３１１によって抽出された重要語が２つである場合であっても、当該２つの重要語がともに名詞である場合には、それぞれの重要語について上記した図９で説明したようなテンプレート（当該重要語を「目的語」とするテンプレート）が生成される。つまり、重要語抽出部３１１によって抽出された重要語に応じて、複数のテンプレートが生成される場合がある。 When there are three or more important words extracted by the important word extraction unit 311, for example, a noun and verb pair is generated as a template from the important words. Further, even when there are two important words extracted by the important word extraction unit 311, when the two important words are both nouns, each important word has been described with reference to FIG. 9 described above. Such a template (a template having the important word as the “object”) is generated. That is, a plurality of templates may be generated according to the important word extracted by the important word extraction unit 311.

再び図８に戻ると、素性抽出部４２は、テンプレート生成部４１によって生成されたテンプレートを用いて、言い換え文格納部２４に格納された言い換え文の文中から重要語または文字列（素性）から構成される組（素性の組）を抽出する（ステップＳ３７）。素性抽出部４２は、言い換え文格納部２４に格納された言い換え文に対して、テンプレート生成部４１によって生成されたテンプレートをマッチングさせることにより、素性の組を抽出する。素性の組とは、例えば「目的語」及び「動詞」から構成される。 Returning again to FIG. 8, the feature extraction unit 42 is composed of key words or character strings (features) from the paraphrase text stored in the paraphrase text storage unit 24 using the template generated by the template generation unit 41. A set (feature group) is extracted (step S37). The feature extraction unit 42 extracts a set of features by matching the paraphrase text stored in the paraphrase text storage unit 24 with the template generated by the template generation unit 41. The feature set is composed of, for example, “object” and “verb”.

なお、言い換え文、重要語及びテンプレートによっては、上記した「目的語」、「道具格」及び「動詞」から構成される素性の組が抽出される場合もある。また、上記したように複数のテンプレートが生成された場合には、当該テンプレート毎に素性の組の抽出処理が実行される。 Depending on the paraphrase text, the key word, and the template, a feature set composed of the above-mentioned “object”, “tool case”, and “verb” may be extracted. In addition, when a plurality of templates are generated as described above, feature set extraction processing is executed for each template.

素性抽出部４２によって抽出された素性の組は、素性格納部２６に格納される。このとき、素性格納部２６においては、例えば異なる言い換え文から抽出された同一の素性の組は１つの素性の組として扱われる。 The feature set extracted by the feature extraction unit 42 is stored in the feature storage unit 26. At this time, in the feature storage unit 26, for example, the same feature set extracted from different paraphrase sentences is treated as one feature set.

また、素性出力部４３は、素性抽出部４２によって抽出された素性の組を例えばユーザに対して出力（表示）する。これにより、ユーザは、素性抽出部４２によって抽出された素性の組、つまり、素性の組に含まれる表現（文字列）を確認することができる。 The feature output unit 43 outputs (displays) the feature set extracted by the feature extraction unit 42 to, for example, the user. Thereby, the user can check the feature set extracted by the feature extraction unit 42, that is, the expression (character string) included in the feature set.

ここで、素性抽出部４２による素性の組の抽出処理について具体的に説明する。例えば重要語が「アナログ」の１つであり、当該重要語「アナログ」のテンプレートは上記した図９に示すテンプレート３０１であり、言い換え文格納部２４に格納された言い換え文は、前述した図５において説明したように「アナログを入力する」及び「アナログを変換する」１０４であるものとする。この場合には、言い換え文「アナログを入力する」及び「アナログを変換する」１０４において重要語「アナログ」は目的語として用いられているため、素性抽出部４２は、「目的語」及び「動詞」から構成される（アナログ，入力）及び（アナログ，変換）の素性の組を抽出する。 Here, the feature set extraction processing by the feature extraction unit 42 will be described in detail. For example, the important word is one of “analog”, the template of the important word “analog” is the template 301 shown in FIG. 9 described above, and the paraphrase text stored in the paraphrase text storage unit 24 is the above-described FIG. It is assumed that “analog input” and “analog conversion” 104 are performed as described in FIG. In this case, since the important word “analog” is used as an object in the paraphrased sentences “input analog” and “convert analog” 104, the feature extraction unit 42 selects “object” and “verb”. (Analog, input) and (analog, conversion) feature pairs are extracted.

一方、重要語が「アナログ」及び「変換」の２つであり、当該重要語「アナログ」及び「変換」のテンプレートは上記した図１０に示すテンプレート３０２であり、言い換え文格納部２４に格納された言い換え文は、前述した図６において説明したように「アナログを変換する」２０４であるものとする。この場合には、言い換え文「アナログを変換する」２０４において重要語「アナログ」は目的語として用いられており、重要語「変換」は動詞として用いられているため、この言い換え文は図１０に示すテンプレート３０２にマッチする。このため、素性抽出部４２は、「目的語」及び「動詞」から構成される（アナログ，変換）の素性の組を抽出する。 On the other hand, there are two important words “analog” and “conversion”, and the templates of the important words “analog” and “conversion” are the templates 302 shown in FIG. 10 described above, and are stored in the paraphrase sentence storage unit 24. It is assumed that the paraphrase text is “convert analog” 204 as described in FIG. In this case, in the paraphrase sentence “convert analog” 204, the important word “analog” is used as an object, and the important word “conversion” is used as a verb. Matches the template 302 shown. For this reason, the feature extraction unit 42 extracts a set of features (analog, conversion) composed of “object” and “verb”.

ここでは、重要語が１つ及び２つの場合において言い換え文がテンプレートにマッチする場合について説明したが、以下、図１１を参照して、言い換え文がテンプレートにマッチしない場合に抽出される素性の組について説明する。 Here, the case where the paraphrase text matches the template in the case where the number of important words is one or two has been described. However, referring to FIG. 11, a set of features extracted when the paraphrase text does not match the template is described below. Will be described.

例えば重要語が「アナログ」及び「変換」の２つであり、当該重要語「アナログ」及び「変換」のテンプレートは上記した図１０に示すテンプレート３０２であるものとする。 For example, there are two important words “analog” and “conversion”, and the templates of the important words “analog” and “conversion” are the templates 302 shown in FIG.

また、対象文抽出部３１２によって抽出された要旨文は、「文字を音声に変換しアナログで出力する」であるものとする。 Further, the abstract sentence extracted by the target sentence extraction unit 312 is assumed to be “converting characters into speech and outputting them in analog”.

なお、図１１においては、要旨文「文字を音声に変換しアナログで出力する」４０２が係り受け解析結果の形式で示されている。 In FIG. 11, a summary sentence “convert characters into speech and output in analog” 402 is shown in the form of dependency analysis results.

図１１に示すように、この要旨文４０２を例えば枝刈りすることにより、要旨文「文字を音声に変換しアナログで出力する」４０１が言い換え文「変換しアナログで出力する」４０２に言い換えられたものとする。つまり、言い換え文「変換しアナログで出力する」４０２が、言い換え文生成部３２２によって生成されたものとする。 As shown in FIG. 11, by benefit of this spirit sentence 402 for example pruning, "and outputs the character in analog and converted to speech" SUMMARY sentence 401 "outputs an analog converts" text paraphrase paraphrased to 402 Shall be. In other words, it is assumed that the paraphrase text “convert and output in analog” 402 is generated by the paraphrase text generation unit 322.

この場合、言い換え文「変換しアナログで出力する」４０２において重要語「変換」は動詞として用いられているが、重要語「アナログ」は目的語として用いられていないため、当該言い換え文４０２は、図１０に示すテンプレート３０２にマッチしない。 In this case, in the paraphrase sentence “convert and output in analog” 402, the important word “conversion” is used as a verb, but since the important word “analog” is not used as an object, the paraphrase sentence 402 is It does not match the template 302 shown in FIG.

この場合、素性抽出部４２は、重要語「アナログ」及び「変換」を素性の組とする。つまり、素性抽出部４２は、素性の組として例えば（変換，アナログ）及び（アナログ，変換）を抽出する。この場合には、上記したテンプレートにマッチする場合と異なり、「目的語」及び「動詞」が考慮されていない素性の組が抽出されることになる。 In this case, the feature extraction unit 42 sets the key words “analog” and “conversion” as a set of features. That is, the feature extraction unit 42 extracts, for example, (conversion, analog) and (analog, conversion) as feature sets. In this case, unlike the case of matching with the template described above, a set of features in which “object” and “verb” are not considered is extracted.

上記したように、対象文抽出部３１２によって抽出された要旨文によっては、言い換え文生成部３２２によって生成された言い換え文とテンプレート生成部４１によって生成されたテンプレートがマッチせず、「目的語」及び「動詞」が考慮されていない素性の組が抽出される。 As described above, depending on the abstract sentence extracted by the target sentence extraction unit 312, the paraphrase sentence generated by the paraphrase sentence generation unit 322 does not match the template generated by the template generation unit 41, and the “object” and A set of features that does not consider “verb” is extracted.

なお、言い換え文に含まれる重要語が１つである場合に当該言い換え文がテンプレートにマッチしない場合には、重要語を素性の組にすることができないため、素性の組は抽出されない。 Note that if the paraphrase text includes one important word and the paraphrase text does not match the template, the key word cannot be made into a feature set, and the feature set is not extracted.

再び図８に戻ると、文書格納部２２に格納されている全ての文書について上記したステップＳ３１〜ステップＳ３７の処理が実行されたか否かが判定される（ステップＳ３８）。 Returning to FIG. 8 again, it is determined whether or not the processing in steps S31 to S37 described above has been executed for all the documents stored in the document storage unit 22 (step S38).

文書格納部２２に格納されている全ての文書について上記したステップＳ３１〜ステップＳ３７の処理が実行されていないと判定された場合（ステップＳ３８のＮＯ）、上記したステップＳ１に戻って処理が繰り返される。この場合、ステップＳ３１においては、ステップＳ３１〜ステップＳ３７の処理が実行されていない文書が文書格納部２２から取得される。 If it is determined that the processing in steps S31 to S37 described above has not been executed for all the documents stored in the document storage unit 22 (NO in step S38), the processing returns to the above step S1 and is repeated. . In this case, in step S31, a document for which the processing in steps S31 to S37 has not been executed is acquired from the document storage unit 22.

一方、文書格納部２２に格納されている全ての文書についてステップＳ３１〜ステップＳ３７の処理が実行されたと判定された場合（ステップＳ３８のＹＥＳ）、文書ベクトル処理部４４に含まれる文書ベクトル成分値算出部４４１は、文書格納部２２に格納されている文書の１つを、当該文書格納部２２から取得する（ステップＳ３９）。以下、文書ベクトル成分値算出部４４１によって取得された文書を対象文書と称する。 On the other hand, when it is determined that the processing in steps S31 to S37 has been executed for all the documents stored in the document storage unit 22 (YES in step S38), the document vector component value included in the document vector processing unit 44 is calculated. The unit 441 acquires one of the documents stored in the document storage unit 22 from the document storage unit 22 (step S39). Hereinafter, the document acquired by the document vector component value calculation unit 441 is referred to as a target document.

次に、文書ベクトル成分値算出部４４１は、対象文書の文書ベクトル成分値を、当該対象文書から抽出された要旨文及び素性格納部２６に格納されている素性の組に基づいて算出する（ステップＳ４０）。文書ベクトル成分値算出部４４１は、対象文書から抽出された要旨文における当該素性の組の出現頻度を示す文書ベクトル成分値を、素性格納部２６に格納されている素性の組毎に算出する。つまり、文書ベクトル成分値算出部４４１は、１つの対象文書について、素性格納部２６に格納されている素性の組毎の文書ベクトル成分値を算出する。 Next, the document vector component value calculation unit 441 calculates the document vector component value of the target document based on the set of features stored in the summary sentence and the feature storage unit 26 extracted from the target document (step) S40). The document vector component value calculation unit 441 calculates a document vector component value indicating the appearance frequency of the feature set in the summary sentence extracted from the target document for each feature set stored in the feature storage unit 26. That is, the document vector component value calculation unit 441 calculates the document vector component value for each feature set stored in the feature storage unit 26 for one target document.

対象文書から抽出された要旨文において素性の組が出現するとは、当該要旨文において素性の組の例えば「目的語」及び「動詞」の組が出現することを言う。具体的には、例えば素性の組が（アナログ，入力）である場合を想定すると、対象文書から抽出された要旨文中において「アナログ」が目的語として、「入力」が動詞として用いられている場合には、当該要旨文におけるこの素性の組の出現頻度は例えば１となる。なお、１つの要旨文においてこの素性の組が２回以上出現する場合には、出現頻度の値は大きくなる。 A feature set appears in a summary sentence extracted from a target document means that a feature set, for example, a “object” and “verb” set appears in the summary sentence. Specifically, for example, assuming that the feature set is (analog, input), when “analog” is used as the object and “input” is used as the verb in the abstract extracted from the target document The appearance frequency of this feature set in the summary sentence is, for example, 1. In addition, when this feature group appears twice or more in one summary sentence, the value of the appearance frequency becomes large.

ここでは、文書ベクトル成分値算出部４４１が対象文書の要旨文における素性の組の出現頻度を文書ベクトル成分値として算出するものとして説明したが、上記したように例えば相互情報量のような単語の重み算出方法を用いて文書ベクトル成分値を算出する構成であっても構わない。相互情報量とは、文書中に出現する２つの単語が同時に出現する度合い等により、当該単語間の関連度を表す量である。 Here, the document vector component value calculation unit 441 has been described as calculating the appearance frequency of the feature set in the abstract of the target document as the document vector component value. However, as described above, for example, a word such as a mutual information amount is calculated. The document vector component value may be calculated using a weight calculation method. The mutual information amount is an amount representing the degree of association between the words based on the degree of appearance of two words appearing in the document at the same time.

文書ベクトル生成部４４２は、対象文書の文書ベクトルを、当該文書ベクトル成分値算出部４４１によって算出された文書ベクトル成分値に基づいて生成する（ステップＳ４１）。 The document vector generation unit 442 generates a document vector of the target document based on the document vector component value calculated by the document vector component value calculation unit 441 (step S41).

文書ベクトル生成部４４１は、生成された文書ベクトルを文書ベクトル格納部２７に格納する。この文書ベクトル格納部２７に格納された文書ベクトルは、例えば文書格納部２２に格納されている複数の文書を分類する際に用いられる。 The document vector generation unit 441 stores the generated document vector in the document vector storage unit 27. The document vector stored in the document vector storage unit 27 is used, for example, when a plurality of documents stored in the document storage unit 22 are classified.

例えば素性の組（アナログ、入力）及び（アナログ、変換）が素性格納部２６に格納されている場合を想定する。この場合、対象文書から抽出された要旨文中における素性の組（アナログ、入力）の出現頻度が１、素性の組（アナログ、変換）の出現頻度が０であれば、対象文書ｄの文書ベクトルは、ｄ（１，０）となる。 For example, it is assumed that a feature set (analog, input) and (analog, conversion) are stored in the feature storage unit 26. In this case, if the appearance frequency of the feature set (analog, input) in the abstract extracted from the target document is 1 and the appearance frequency of the feature set (analog, conversion) is 0, the document vector of the target document d is , D (1,0).

なお、この文書ベクトルｄ（１，０）の１は、対象文書における素性の組（アナログ，入力）の文書ベクトル成分値である。同様に、文書ベクトルｄ（１，０）の０は、対象文書における素性の組（アナログ，変換）の文書ベクトル成分値である。 Note that 1 in the document vector d (1, 0) is a document vector component value of a feature set (analog, input) in the target document. Similarly, 0 in the document vector d (1, 0) is a document vector component value of a feature set (analog, conversion) in the target document.

上記したように、文書ベクトルは、素性の組毎に算出された文書ベクトル成分値を組み合わせることによって生成される。 As described above, the document vector is generated by combining the document vector component values calculated for each feature set.

上記したようにステップＳ４１の処理が実行されると、文書格納部２２に格納されている全ての文書について上記したステップＳ３９〜ステップＳ４１の処理が実行されたか否かが判定される（ステップＳ４２）。 When the process of step S41 is executed as described above, it is determined whether or not the processes of step S39 to step S41 described above have been executed for all the documents stored in the document storage unit 22 (step S42). .

文書格納部２２に格納されている全ての文書について上記したステップＳ３９〜ステップＳ４１の処理が実行されていないと判定された場合（ステップＳ４２のＮＯ）、上記したステップＳ３９に戻って処理が繰り返される。この場合、ステップＳ３９においては、ステップＳ３９〜ステップＳ４１の処理が実行されていない文書が文書格納部２２から取得される。 If it is determined that the processes in steps S39 to S41 described above have not been executed for all the documents stored in the document storage unit 22 (NO in step S42), the process returns to the above step S39 and is repeated. . In this case, in step S39, a document for which the processes in steps S39 to S41 are not executed is acquired from the document storage unit 22.

一方、文書格納部２２に格納されている全ての文書についてステップＳ３９〜ステップＳ４１の処理が実行されたと判定された場合（ステップＳ４２ＮＯＹＥＳ）、処理は終了される。 On the other hand, when it is determined that the processes in steps S39 to S41 have been executed for all the documents stored in the document storage unit 22 (NO in step S42), the process ends.

上記したように、文書格納部２２に格納されている全ての文書から抽出された素性の組の数（異なり数）を次元数とするベクトルを文書ベクトルとし、当該文書毎に文書ベクトルの値を算出することにより当該各文書に対応する文書ベクトルが生成される。 As described above, a vector whose dimension is the number of feature pairs extracted from all the documents stored in the document storage unit 22 (different number) is a document vector, and the value of the document vector is set for each document. By calculating, a document vector corresponding to each document is generated.

上記したように本実施形態においては、文書格納部２２に格納されている文書毎に、重要語抽出部３１１によって抽出された重要語が含まれる要旨文を抽出する。本実施形態においては、抽出された要旨文に対して係り受け解析を実行し、重要語及び係り受け解析結果に基づいて要旨文に対して言い換え処理を行うことにより、文書分類に対して適切な素性の組の抽出が可能となる。したがって、本実施形態においては、文書分類において適切な素性の組に基づいて文書ベクトルを生成することができるため、当該文書ベクトルを用いて行われる文書分類の精度を向上させることができる。 As described above, in the present embodiment, for each document stored in the document storage unit 22, a summary sentence including the important word extracted by the important word extraction unit 311 is extracted. In this embodiment, dependency analysis is performed on the extracted summary sentence, and the paraphrase process is performed on the summary sentence based on the important word and the dependency analysis result, so that it is appropriate for document classification. The feature set can be extracted. Therefore, in the present embodiment, since a document vector can be generated based on an appropriate feature set in document classification, the accuracy of document classification performed using the document vector can be improved.

また、本実施形態においては、抽出された要旨文に対してのみ係り受け解析を実行するため、無駄な係り受け解析処理を削減することができる。また、本実施形態においては、抽出された素性の組数を文書ベクトルの次元数とすることで、文書分類精度を低下させることなく当該文書ベクトルの次元数を削減することができるため、文書分類処理の高速化を図ることができる。 Further, in the present embodiment, since dependency analysis is performed only on the extracted summary sentence, useless dependency analysis processing can be reduced. In the present embodiment, the number of feature sets extracted is used as the number of dimensions of the document vector, so that the number of dimensions of the document vector can be reduced without reducing the document classification accuracy. Processing speed can be increased.

［第３の実施形態］
次に、図１２を参照して、本発明の第３の実施形態について説明する。図１２は、本実施形態に係る文書処理装置の主として機能構成を示すブロック図である。なお、前述した図２及び図７と同様の部分には同一参照符号を付してその詳しい説明を省略する。ここでは、図２及び図７と異なる部分について主に述べる。 [Third Embodiment]
Next, a third embodiment of the present invention will be described with reference to FIG. FIG. 12 is a block diagram mainly showing a functional configuration of the document processing apparatus according to the present embodiment. The same parts as those in FIGS. 2 and 7 described above are denoted by the same reference numerals, and detailed description thereof is omitted. Here, parts different from FIGS. 2 and 7 will be mainly described.

本実施形態においては、文書ベクトル格納部２７に格納された文書ベクトルを用いて文書格納部２２に格納されている文書（群）を文書分類（クラスタリング）する点が、前述した第１及び第２の実施形態とは異なる。 In the present embodiment, the document classification (clustering) of the document (group) stored in the document storage unit 22 using the document vector stored in the document vector storage unit 27 is the first and second described above. This is different from the embodiment.

図１２に示すように、文書処理装置５０は、文書分類処理部５１を含む。本実施形態において、文書分類処理部５１は、図１に示すコンピュータ１０が外部記憶装置２０に格納されているプログラム２１を実行することにより実現されるものとする。 As shown in FIG. 12, the document processing apparatus 50 includes a document classification processing unit 51. In the present embodiment, the document classification processing unit 51 is realized by the computer 10 illustrated in FIG. 1 executing the program 21 stored in the external storage device 20.

文書分類処理部５１は、類似度算出部５１１及びクラスタリング部５１２を含む。類似度算出部５１１は、文書ベクトル格納部２７に格納されている文書ベクトルを用いて、当該文書ベクトル間の類似度を算出する。 The document classification processing unit 51 includes a similarity calculation unit 511 and a clustering unit 512. The similarity calculation unit 511 calculates the similarity between the document vectors using the document vectors stored in the document vector storage unit 27.

クラスタリング部５１２は、類似度算出部５１１によって算出された類似度（の値）に基づいて、文書格納部２２に格納されている文書群のクラスタリング（文書群の分類）を行う。 The clustering unit 512 performs clustering (classification of document groups) of document groups stored in the document storage unit 22 based on the similarity (value) calculated by the similarity calculation unit 511.

クラスタリング部５１２は、文書格納部２２に格納されている文書群の分類結果を文書クラスタ格納部２５に格納する。 The clustering unit 512 stores the document group classification result stored in the document storage unit 22 in the document cluster storage unit 25.

次に、図１３に示すフローチャートを参照して、本実施形態に係る文書処理装置５０の処理手順について説明する。 Next, a processing procedure of the document processing apparatus 50 according to the present embodiment will be described with reference to a flowchart shown in FIG.

まず、前述した図８に示すステップＳ３１〜ステップＳ４２の処理に相当するステップＳ５１〜ステップＳ６２の処理が実行される。なお、ステップＳ６１において生成された文書ベクトルは、前述したように文書ベクトル格納部２７に格納される。 First, the process of step S51-step S62 equivalent to the process of step S31-step S42 shown in FIG. 8 mentioned above is performed. The document vector generated in step S61 is stored in the document vector storage unit 27 as described above.

次に、文書分類処理部５１に含まれる類似度算出部５１１は、文書ベクトル格納部２７に格納された文書毎の文書ベクトルに基づいて、当該文書ベクトル間の類似度を算出する（ステップＳ６３）。類似度算出部５１１は、文書ベクトルにおける各素性の組毎の文書ベクトル成分値に基づいて、文書ベクトル間の類似度を算出する。 Next, the similarity calculation unit 511 included in the document classification processing unit 51 calculates the similarity between the document vectors based on the document vectors for each document stored in the document vector storage unit 27 (step S63). . The similarity calculation unit 511 calculates the similarity between document vectors based on the document vector component value for each set of features in the document vector.

クラスタリング部５１２は、類似度算出部５１１によって算出された類似度に基づいて文書格納部２２に格納されている文書群のクラスタリング処理（分類処理）を実行する（ステップＳ６４）。 The clustering unit 512 executes clustering processing (classification processing) of the document group stored in the document storage unit 22 based on the similarity calculated by the similarity calculation unit 511 (step S64).

クラスタリング部５１２は、類似度算出部５１１によって算出された文書間の類似度の値が近い（似ている）文書同士を集めることにより、文書格納部２２に格納されている文書群をクラスタリング（分類）する。 The clustering unit 512 clusters the documents stored in the document storage unit 22 by collecting documents having similar (similar) similarity values between documents calculated by the similarity calculation unit 511. )

クラスタリング部５１２は、文書格納部２２に格納されている文書群の分類結果を、文書クラスタ格納部２５に格納する。 The clustering unit 512 stores the document group classification result stored in the document storage unit 22 in the document cluster storage unit 25.

上記したように本実施形態においては、前述した第２の実施形態と同様に、文書分類において適切な素性の組に基づいて文書ベクトルを生成することができ、かつ、当該文書ベクトルを用いて文書格納部２２に格納されている文書群の分類処理を実行することができる。したがって、本実施形態においては、文書格納部２２に格納されている文書群の分類精度を向上させることが可能となる。 As described above, in this embodiment, similarly to the second embodiment described above, a document vector can be generated based on an appropriate feature set in document classification, and a document can be generated using the document vector. The classification process of the document group stored in the storage unit 22 can be executed. Therefore, in the present embodiment, the classification accuracy of the document group stored in the document storage unit 22 can be improved.

なお、本実施形態においては、図１３に示すステップＳ６２において全ての文書について処理が実行されたと判定された場合にステップＳ６３の処理が実行されるものとして説明したが、ステップＳ６２において全ての文書について処理が実行されたと判定された場合、つまり、文書格納部２２に格納されている全ての文書について文書ベクトルが生成された後、ユーザの指示があった場合にステップＳ６３の処理が実行される構成であっても構わない。換言すれば、前述した第２の実施形態で説明した文書ベクトルの生成処理と、当該文書ベクトルを用いて行われる文書群の分類処理が別々に実行されてもよい。 In the present embodiment, it has been described that the process of step S63 is executed when it is determined that the process has been executed for all the documents in step S62 shown in FIG. 13, but for all the documents in step S62. When it is determined that the process has been executed, that is, after the document vectors are generated for all the documents stored in the document storage unit 22, the process of step S63 is executed when the user gives an instruction. It does not matter. In other words, the document vector generation process described in the second embodiment and the document group classification process performed using the document vector may be executed separately.

なお、本願発明は、上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組合せにより種々の発明を形成できる。例えば、各実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に亘る構成要素を適宜組合せてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in each embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１０…コンピュータ、２０…外部記憶装置、２２…文書格納部、２３…類義句辞書格納部、２４…言い換え文格納部、２５…文書クラスタ格納部、２６…素性格納部、２７…文書ベクトル格納部、３０，４０，５０…文書処理装置、３１…要旨文特定部、３２…言い換え処理部、３３…クラスタリング部、４１…テンプレート生成部、４２…素性抽出部、４３…素性出力部、４４…文書ベクトル処理部、５１…文書分類処理部、３１１…重用語抽出部、３１２…対象文抽出部、３１３…係り受け解析部、３２１…個数判定部、３２２…言い換え文生成部、４４１…文書ベクトル成分値算出部、４４２…文書ベクトル生成部、５１１…類似度算出部、５１２…クラスタリング部。 DESCRIPTION OF SYMBOLS 10 ... Computer, 20 ... External storage device, 22 ... Document storage part, 23 ... Synonym dictionary storage part, 24 ... Paraphrase sentence storage part, 25 ... Document cluster storage part, 26 ... Feature storage part, 27 ... Document vector storage , 30, 40, 50... Document processing apparatus, 31... Abstract sentence specifying part, 32... Paraphrase processing part, 33... Clustering part, 41. Document vector processing unit 51 ... Document classification processing unit, 311 ... Multiple term extraction unit, 312 ... Target sentence extraction unit, 313 ... Dependency analysis unit, 321 ... Number determination unit, 322 ... Paraphrase sentence generation unit, 441 ... Document vector Component value calculation unit, 442... Document vector generation unit, 511... Similarity calculation unit, 512.

Claims

Executed by a computer in a document processing apparatus comprising a document storage means for storing a plurality of documents including sentences including character strings, an external storage device having a feature storage means, and a computer using the external storage device A document processing program,
In the computer,
For each document stored in the document storage means, based on the appearance frequency of the character string in the document, extracting a character string that is important in the document as an important word;
Extracting the sentence including the extracted important word from the document from which the important word is extracted as a summary sentence;
Analyzing a dependency between character strings included in the extracted summary sentence;
Based on the important words contained in the extracted summary sentences and the analysis results, the expression other than the important words contained in the summary sentences is simplified to generate a paraphrase text of the summary sentences including the important words. And steps to
Extracting a feature set composed of a plurality of character strings including important words included in the generated paraphrase text from the paraphrase text;
Storing the extracted feature sets in the feature storage means;
For each document stored in the document storage means, calculating a document vector component value based on the appearance frequency of the feature set stored in the feature storage means in the summary sentence extracted from the document;
Generating a document vector for each document stored in the document storage means based on the calculated document vector component value ;
Generating a template based on the extracted important word and the analysis result, and using the important word as an object or a verb ,
In the step of extracting the feature set, the feature set is extracted by matching the generated template with the generated paraphrase sentence.
A document processing program characterized by that .

In the computer,
Calculating a similarity between the document vectors based on the document vectors for each document stored in the generated document storage means;
The document processing program according to claim 1, further comprising: classifying a plurality of documents stored in the document storage unit based on the calculated similarity.

In a document processing apparatus comprising a document storage means for storing a plurality of documents composed of sentences including character strings and a paraphrase text storage means, and a computer using the external storage apparatus, the document processing apparatus is executed by the computer. A document processing program,
In the computer,
For each document stored in the document storage means, based on the appearance frequency of the character string in the document, extracting a character string that is important in the document as an important word;
Extracting the sentence including the extracted important word from the document from which the important word is extracted as a summary sentence;
Analyzing a dependency between character strings included in the extracted summary sentence;
Based on the extracted important words included in the extracted summary sentence and the analysis result, by simplifying the expression other than the important words included in the summary sentence , the document in which the important words are extracted Generating a corresponding paraphrase text;
Storing the generated paraphrase text in the paraphrase text storage means;
Classifying the paraphrase text based on a character string included in the paraphrase text stored in the paraphrase text storage means;
Classifying a document stored in the document storage unit corresponding to the paraphrase text based on a classification result of the paraphrase text.

Document storage means for storing a plurality of documents consisting of sentences including character strings;
For each document stored in the document storage means, based on the appearance frequency of the character string in the document, an important word extraction means for extracting a character string that is important in the document as an important word;
A summary sentence extracting means for extracting a sentence including the important word extracted by the important word extracting means as a summary sentence from the document from which the important word is extracted;
Analyzing means for analyzing a dependency between character strings included in the abstract sentence extracted by the abstract sentence extracting means;
Based on the important words included in the abstract sentence extracted by the abstract sentence extraction means and the analysis result by the analysis means , the expression other than the important words included in the abstract sentence is simplified to include the important words. Paraphrase text generating means for generating a paraphrase text of the summary text ;
A feature extraction unit that extracts a set of features composed of a plurality of character strings including important words included in the paraphrase text generated by the paraphrase text generation unit, from the paraphrase text;
Feature storage means for storing a set of features extracted by the feature extraction means;
For each document stored in the document storage unit, a calculation unit that calculates a document vector component value based on the appearance frequency of the feature stored in the feature storage unit in the summary sentence extracted from the document;
Document vector generation means for generating a document vector for each document stored in the document storage means based on the document vector component value calculated by the calculation means ;
Generating a template with the important word as an object or a verb based on the extracted important word and the analysis result ,
The feature extraction unit extracts the feature set by matching the generated template with the generated paraphrase sentence.
A document processing apparatus characterized by that.