JP2010170211A

JP2010170211A - Important sentence extraction program and device

Info

Publication number: JP2010170211A
Application number: JP2009010086A
Authority: JP
Inventors: Saori Kurata; 早織倉田; Yoshimi Saito; 佳美齋藤; Toshiyuki Kano; 敏行加納
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2009-01-20
Filing date: 2009-01-20
Publication date: 2010-08-05
Anticipated expiration: 2029-01-20
Also published as: JP4922319B2

Abstract

<P>PROBLEM TO BE SOLVED: To extract an important sentence from a document even when an important word corresponding to a field of the document is absent inside the document. <P>SOLUTION: An extraction target document input part 31 inputs an extraction object document imparted with a category according to operation of a user. An important word determination part 32 reads a first important word corresponding to the category imparted to the extraction target document and a second important word corresponding to the category related to the category from a hierarchical important word storage part 22. The important word determination part 32 determines the first and second important words as first and second extracting important words when the first and second important words are included in the extraction target document. An important word extraction part 33 extracts a sentence including at least one of the first and second extracting important words as the important sentence from the extraction target document. An important sentence output part 35 outputs the important sentences in order of an importance level calculated based on the first or second extracting important word included in the important sentence. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、複数の文から構成される文書における重要文を当該文書から抽出する重要文抽出プログラム及び重要文抽出装置に関する。 The present invention relates to an important sentence extraction program and an important sentence extraction apparatus for extracting an important sentence in a document composed of a plurality of sentences from the document.

近年、複数の文から構成される文書の内容を適切に表す重要文を当該文書から抽出する重要文抽出装置が知られている。このように、文書から重要文を抽出することは、文書の全文を読むことなく当該文書の内容を把握するために必要である。 In recent years, an important sentence extraction apparatus that extracts an important sentence that appropriately represents the contents of a document composed of a plurality of sentences from the document is known. Thus, extracting an important sentence from a document is necessary for grasping the contents of the document without reading the whole sentence of the document.

上記したような文書から重要文を抽出する技術として、文書中の重要文を、簡単に且つ高い精度で抽出することができる技術（以下、先行技術と表記）が開示されている（例えば、特許文献１を参照）。この先行技術によれば、例えば文書の分野に対応する重要表現（重要語）を記述した重要表現テーブル（重要語辞書）を用意しておき、当該重要表現を含む文が重要文として抽出される。 As a technique for extracting an important sentence from a document as described above, a technique (hereinafter referred to as a prior art) that can easily extract an important sentence in a document with high accuracy is disclosed (for example, a patent). Reference 1). According to this prior art, for example, an important expression table (important word dictionary) describing important expressions (important words) corresponding to the field of a document is prepared, and sentences including the important expressions are extracted as important sentences. .

特開平１１−２７２６８６号公報Japanese Patent Laid-Open No. 11-272686

しかしながら、上記した先行技術では、文書の分野に対応する重要語が当該文書内に存在しない場合には、当該文書から重要文を抽出することはできない。 However, in the above-described prior art, when an important word corresponding to the field of the document does not exist in the document, an important sentence cannot be extracted from the document.

また、先行技術では、重要語辞書の構造が重要語間の関係を表したもの（つまり、階層構造）ではないため、当該重要語辞書から文書の分野に対応する重要語に関連する重要語を見つけることは困難である。 In the prior art, since the structure of the important word dictionary does not represent the relationship between the important words (that is, the hierarchical structure), the important words related to the important words corresponding to the field of the document are extracted from the important word dictionary. It is difficult to find.

そこで、本発明の目的は、文書の分野に対応する重要語が当該文書内に存在しない場合であっても、当該文書から重要文を抽出することが可能な重要文抽出プログラム及び重要文抽出装置を提供することにある。 Accordingly, an object of the present invention is to provide an important sentence extraction program and an important sentence extraction apparatus capable of extracting an important sentence from the document even when the important word corresponding to the field of the document does not exist in the document. Is to provide.

本発明の１つの態様によれば、文書が分類されるカテゴリの各々に対応する重要語を格納する重要語格納手段であって、前記カテゴリと関連のあるカテゴリが階層構造で表されている重要語格納手段を有する外部記憶装置と、当該外部記憶装置を利用するコンピュータとから構成される重要文抽出装置において、前記コンピュータによって実行される重要文抽出プログラムであって、前記コンピュータに、ユーザの操作に応じて、複数の語を含む文から構成される抽出対象文書であって、当該抽出対象文書が分類されるカテゴリが付与された抽出対象文書を入力するステップと、前記入力された抽出対象文書に付与されているカテゴリに対応する第１の重要語を前記重要語格納手段から読み込むステップと、前記読み込まれた第１の重要語が前記入力された抽出対象文書に含まれているかを判定するステップと、前記第１の重要語が前記抽出対象文書に含まれていると判定された場合、当該第１の重要語を第１の抽出用重要語として決定するステップと、前記入力された抽出対象文書に付与されているカテゴリと関連のあるカテゴリに対応する第２の重要語を前記重要語格納手段から読み込むステップと、前記読み込まれた第２の重要語が前記入力された抽出対象文書に含まれているかを判定するステップと、前記第２の重要語が前記抽出対象文書に含まれていると判定された場合、当該第２の重要語を第２の抽出用重要語として決定するステップと、前記決定された第１の抽出用重要語及び第２の抽出用重要語のうち少なくとも１つを含む文を重要文として、前記入力された抽出対象文書から抽出するステップと、前記抽出された重要文に含まれる前記決定された第１の抽出用重要語または第２の抽出用重要語に基づいて、当該重要文の重要度を算出するステップと、前記抽出された重要文を、前記算出された重要度順に出力するステップとを実行させるための重要文抽出プログラムが提供される。 According to one aspect of the present invention, important word storage means for storing important words corresponding to each of categories into which a document is classified, wherein the categories related to the categories are represented in a hierarchical structure. An important sentence extraction apparatus comprising an external storage device having a word storage means and a computer using the external storage device, wherein the important sentence extraction program is executed by the computer, and the computer is operated by a user A step of inputting an extraction target document composed of a sentence including a plurality of words and having a category to which the extraction target document is classified, and the input extraction target document Reading the first important word corresponding to the category assigned to the important word storage means, and the read first important word is A step of determining whether or not the input target document includes the first important word, and if it is determined that the first important word is included in the extraction target document, A step of determining as an important word for extraction, a step of reading a second important word corresponding to a category associated with the category assigned to the input extraction target document from the important word storage means, and the reading Determining whether the second important word is included in the input extraction target document, and if it is determined that the second important word is included in the extraction target document, A key word as a second key word for extraction, and a sentence including at least one of the first key word for extraction and the key word for second extraction as a key sentence, Input extraction Extracting from an elephant document, and calculating the importance of the important sentence based on the determined first important word for extraction or second important word for extraction included in the extracted important sentence And an important sentence extraction program for executing the extracted important sentences in the order of the calculated importance.

本発明によれば、文書の分野に対応する重要語が当該文書内に存在しない場合であっても、当該文書から重要文を抽出することを可能とする。 According to the present invention, it is possible to extract an important sentence from a document even if the important word corresponding to the field of the document does not exist in the document.

本発明の第１の実施形態に係る重要文抽出装置のハードウェア構成を示すブロック図。The block diagram which shows the hardware constitutions of the important sentence extraction apparatus which concerns on the 1st Embodiment of this invention. 図１に示す重要文抽出装置３０の主として機能構成を示すブロック図。The block diagram which mainly shows the function structure of the important sentence extraction apparatus 30 shown in FIG. 図２に示す階層重要語格納部２２のデータ構造の一例を示す図。The figure which shows an example of the data structure of the hierarchy important word storage part 22 shown in FIG. 本実施形態に係る重要文抽出装置３０の重要文抽出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of the important sentence extraction process of the important sentence extraction apparatus 30 which concerns on this embodiment. 抽出対象文書入力部３１によって入力された抽出対象文書の一例を示す図。The figure which shows an example of the extraction object document input by the extraction object document input part 31. FIG. スコアが５以上の重要文が出力された場合の一例を示す図。The figure which shows an example when the important sentence whose score is 5 or more is output. 本発明の第２の実施形態に係る重要文抽出装置の主として機能構成を示すブロック図。The block diagram which mainly shows a function structure of the important sentence extraction apparatus which concerns on the 2nd Embodiment of this invention. 図７に示す重要文格納部２４のデータ構造の一例を示す図。The figure which shows an example of the data structure of the important sentence storage part 24 shown in FIG. 本実施形態に係る重要文抽出装置４０の文書分類処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of the document classification | category process of the important sentence extraction apparatus 40 concerning this embodiment. 分類対象文書入力部４１によって入力された分類対象文書の一例を示す図。The figure which shows an example of the classification object document input by the classification object document input part 41. FIG. 分類結果出力部４４によって分類結果が出力された場合の一例を示す図。The figure which shows an example at the time of a classification result being output by the classification result output part 44. FIG.

以下、図面を参照して、本発明の各実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施形態］
まず、図１及び図２を参照して、本発明の第１の実施形態について説明する。図１は、本実施形態に係る重要文抽出装置のハードウェア構成を示すブロック図である。図１に示すように、コンピュータ１０は、例えばハードディスクドライブ（ＨＤＤ：Hard Disk Drive）のような外部記憶装置２０と接続されている。この外部記憶装置２０は、コンピュータ１０によって実行されるプログラム２１を格納する。コンピュータ１０及び外部記憶装置２０は、重要文抽出装置３０を構成する。 [First Embodiment]
First, a first embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a block diagram showing a hardware configuration of the important sentence extracting apparatus according to the present embodiment. As shown in FIG. 1, the computer 10 is connected to an external storage device 20 such as a hard disk drive (HDD). The external storage device 20 stores a program 21 executed by the computer 10. The computer 10 and the external storage device 20 constitute an important sentence extraction device 30.

本実施形態において、重要文抽出装置３０は、例えば複数の文から構成される文書の内容を適切に表す重要文を当該文書から抽出する機能を有する。 In the present embodiment, the important sentence extraction device 30 has a function of extracting, from the document, an important sentence that appropriately represents the content of a document composed of a plurality of sentences, for example.

図２は、図１に示す重要文抽出装置３０の主として機能構成を示すブロック図である。図２に示すように、重要文抽出装置３０は、抽出対象文書入力部３１、重要語決定部３２、重要文抽出部３３、スコア算出部３４及び重要文出力部３５を含む。本実施形態において、これらの各部３１乃至３５は、図１に示すコンピュータ１０が外部記憶装置２０に格納されているプログラム（重要文抽出プログラム）２１を実行することにより実現されるものとする。このプログラム２１は、コンピュータ読み取り可能な記憶媒体に予め格納して頒布可能である。また、このプログラム２１が例えばネットワークを介してコンピュータ１０にダウンロードされても構わない。 FIG. 2 is a block diagram mainly showing a functional configuration of the important sentence extraction device 30 shown in FIG. As illustrated in FIG. 2, the important sentence extraction device 30 includes an extraction target document input unit 31, an important word determination unit 32, an important sentence extraction unit 33, a score calculation unit 34, and an important sentence output unit 35. In the present embodiment, these units 31 to 35 are realized by the computer 10 shown in FIG. 1 executing the program (important sentence extraction program) 21 stored in the external storage device 20. This program 21 can be stored in advance in a computer-readable storage medium and distributed. The program 21 may be downloaded to the computer 10 via, for example, a network.

また、重要文抽出装置３０は、階層重要語格納部（階層化重要語辞書）２２、同義語辞書２３及び重要文格納部２４を有する。本実施形態において、階層重要語格納部２２、同義語辞書２３及び重要文格納部２４は、例えば外部記憶装置２０に格納される。 The important sentence extraction device 30 includes a hierarchical important word storage unit (hierarchical important word dictionary) 22, a synonym dictionary 23, and an important sentence storage unit 24. In the present embodiment, the hierarchical important word storage unit 22, the synonym dictionary 23, and the important sentence storage unit 24 are stored in, for example, the external storage device 20.

階層重要語格納部２２には、複数の語（文字列）を含む文から構成される文書が分類されるカテゴリの各々に対応する重要語が当該カテゴリに対応付けて格納（記憶）される。この重要語は、例えば対応するカテゴリの内容を適切に表す語として予め定められている。また、階層重要語格納部２２においては、文書が分類されるカテゴリ（に対応する重要語）間における関係が階層構造で表されている。つまり、階層重要語格納部２２においては、文書が分類されるカテゴリと関連のあるカテゴリが階層構造で表されている。 The hierarchical important word storage unit 22 stores (stores) important words corresponding to each of categories into which a document including a plurality of words (character strings) is classified in association with the category. This important word is determined in advance as a word that appropriately represents the contents of the corresponding category, for example. Further, in the hierarchical important word storage unit 22, the relationship between the categories (corresponding important words) into which the document is classified is expressed in a hierarchical structure. That is, in the hierarchical important word storage unit 22, the categories related to the categories into which the documents are classified are represented in a hierarchical structure.

分類される文書が例えば特許文書（特許文献）である場合には、階層重要語格納部２２におけるカテゴリとしてはＦＩ（日本国特許庁で付与されているFile Index）が用いられ、当該カテゴリに対応する重要語としては当該ＦＩに対応した説明文（日本国特許庁が定めたもので、以下では「ＦＩ説明文」と記す）が用いられる。 If the document to be classified is, for example, a patent document (patent document), FI (File Index assigned by the Japan Patent Office) is used as the category in the hierarchical keyword storage unit 22 and corresponds to the category. As an important word, an explanatory text corresponding to the FI (defined by the Japan Patent Office, hereinafter referred to as “FI explanatory text”) is used.

なお、階層重要語格納部２２の中身は、例えば人手で予め構築されている。また、例えばカテゴリに分類される文書に含まれる語（文字列）の出現頻度に基づいて当該カテゴリに対応する重要語を統計的に決定することにより、階層重要語格納部２２が作成される構成であっても構わない。 Note that the contents of the hierarchical important word storage unit 22 are constructed in advance by hand, for example. Further, for example, the hierarchical important word storage unit 22 is created by statistically determining the important words corresponding to the category based on the appearance frequency of words (character strings) included in the document classified into the category. It does not matter.

抽出対象文書入力部３１は、重要文を抽出する対象となる文書（以下、抽出対象文書と表記）を入力する。抽出対象文書入力部３１は、ユーザの操作に応じて抽出対象文書を入力する。この抽出対象文書は、複数の語を含む複数の文から構成される。また、抽出対象文書には、当該抽出対象文書が分類されるカテゴリが付与されている。 The extraction target document input unit 31 inputs a document from which an important sentence is extracted (hereinafter referred to as an extraction target document). The extraction target document input unit 31 inputs an extraction target document in accordance with a user operation. This extraction target document is composed of a plurality of sentences including a plurality of words. The extraction target document is given a category in which the extraction target document is classified.

抽出対象文書入力部３１によって入力される抽出対象文書としては、上記したように例えば特許文書が含まれる。なお、特許文書は抽出対象文書の一例であり、抽出対象文書はテキストデータであればよい。 As described above, the extraction target document input by the extraction target document input unit 31 includes, for example, a patent document. A patent document is an example of an extraction target document, and the extraction target document may be text data.

同義語辞書２３には、例えば階層重要語格納部２２に格納されているカテゴリの各々に対応する重要語の同義語が格納されている。同義語辞書２３には、例えば階層重要語格納部２２に重要語「記憶媒体」が格納されている場合には、当該重要語「記憶媒体」の同義語として「記録装置」及び「記録部材」が格納されている。 In the synonym dictionary 23, for example, synonyms of important words corresponding to each of the categories stored in the hierarchical important word storage unit 22 are stored. In the synonym dictionary 23, for example, when an important word “storage medium” is stored in the hierarchical important word storage unit 22, “recording device” and “recording member” are synonymous with the important word “storage medium”. Is stored.

重要語決定部３２は、抽出対象文書入力部３１によって入力された抽出対象文書に付与されているカテゴリに対応する重要語（第１の重要語）を、階層重要語格納部２２から読み込む。以下、抽出対象文書入力部３１によって入力された抽出対象文書に付与されているカテゴリに対応する重要語を対応重要語と称する。 The important word determination unit 32 reads the important word (first important word) corresponding to the category assigned to the extraction target document input by the extraction target document input unit 31 from the hierarchical important word storage unit 22. Hereinafter, an important word corresponding to a category assigned to an extraction target document input by the extraction target document input unit 31 is referred to as a corresponding important word.

重要語決定部３２は、読み込まれた対応重要語が抽出対象文書入力部３１によって入力された抽出対象文書に含まれているか否かを判定する。重要語決定部３２は、対応重要語が抽出対象文書に含まれていると判定された場合、当該対応重要語を抽出用重要語（以下、第１の抽出用重要語と表記）として決定する。 The important word determination unit 32 determines whether or not the read corresponding important word is included in the extraction target document input by the extraction target document input unit 31. When it is determined that the corresponding important word is included in the extraction target document, the important word determination unit 32 determines the corresponding important word as an extraction important word (hereinafter referred to as a first extraction important word). .

なお、重要語決定部３２は、上記した同義語辞書２３に格納されている対応重要語の同義語が抽出対象文書に含まれている場合には当該対応重要語が当該抽出対象文書に含まれていると判定する。この場合、以降の処理においては、抽出対象文書に含まれている対応重要語の同義語は、当該対応重要語と同様に扱われる。 The important word determination unit 32 includes the corresponding important word in the extraction target document when the synonym of the corresponding important word stored in the synonym dictionary 23 is included in the extraction target document. It is determined that In this case, in the subsequent processing, synonyms of the corresponding important words included in the extraction target document are handled in the same manner as the corresponding important words.

重要語決定部３２は、抽出対象文書入力部３１によって入力された抽出対象文書に付与されているカテゴリ（読み込まれた対応重要語に対応するカテゴリ）と関連のあるカテゴリに対応する重要語（第２の重要語）を、階層重要語格納部２２から読み込む。以下、抽出対象文書入力部３１によって入力された抽出対象文書に付与されているカテゴリと関連のあるカテゴリに対応する重要語を近傍重要語（対応重要語の近傍重要語）と称する。 The important word determination unit 32 has an important word corresponding to a category associated with the category (the category corresponding to the read corresponding important word) assigned to the extraction target document input by the extraction target document input unit 31. 2 key words) is read from the hierarchical key word storage unit 22. Hereinafter, an important word corresponding to a category related to a category assigned to the extraction target document input by the extraction target document input unit 31 is referred to as a neighboring important word (a neighboring important word of the corresponding important word).

ここで、近傍重要語には、例えば第１〜第３の重要語が含まれる。第１の重要語は、階層重要語格納部２２において抽出対象文書に付与されているカテゴリと同階層に位置する（位置付けられている）カテゴリに対応する重要語である。第２の重要語は、階層重要語格納部２２において抽出対象文書に付与されているカテゴリの上位または下位に位置するカテゴリに対応する重要語である。また、第３の重要語は、階層重要語格納部２２において抽出対象文書に付与されているカテゴリの上位の上位または下位の下位に位置するカテゴリに対応する重要語である。なお、近傍重要語は、上記した第１〜第３の重要語のうちの例えば第１及び第２の重要語のみであってもよいし、当該第１〜第３の重要語以外の重要語を含んでいても構わない。 Here, the neighboring important words include, for example, first to third important words. The first important word is an important word corresponding to a category located (positioned) in the same hierarchy as the category assigned to the extraction target document in the hierarchical important word storage unit 22. The second important word is an important word corresponding to a category positioned at the upper or lower level of the category assigned to the extraction target document in the hierarchical important word storage unit 22. Further, the third important word is an important word corresponding to a category located in the upper or lower order of the category assigned to the extraction target document in the hierarchical important word storage unit 22. The neighboring important words may be, for example, only the first and second important words among the first to third important words described above, or important words other than the first to third important words. May be included.

重要語決定部３２は、読み込まれた近傍重要語が抽出対象文書入力部３１によって入力された抽出対象文書に含まれているか否かを判定する。重要語決定部３２は、近傍重要語が抽出対象文書に含まれていると判定された場合、当該近傍重要語を抽出用重要語（以下、第２の抽出用重要語と表記）として決定する。 The important word determination unit 32 determines whether or not the read nearby important word is included in the extraction target document input by the extraction target document input unit 31. When it is determined that the neighboring important word is included in the extraction target document, the important word determining unit 32 determines the neighboring important word as an extracting important word (hereinafter referred to as a second extracting important word). .

なお、重要語決定部３２は、上記した同義語辞書２３に格納されている近傍重要語の同義語が抽出対象文書に含まれている場合には当該近傍重要語が当該抽出対象文書に含まれていると判定する。この場合、以降の処理においては、抽出対象文書に含まれている近傍重要語の同義語は、当該近傍重要語と同様に扱われる。 In addition, the important word determination unit 32 includes the neighboring important word in the extraction target document when the synonym of the neighboring important word stored in the synonym dictionary 23 is included in the extraction target document. It is determined that In this case, in the subsequent processing, synonyms of the neighborhood important words included in the extraction target document are handled in the same manner as the neighborhood important words.

重要文抽出部３３は、重要語決定部３２によって決定された抽出用重要語（第１及び第２の抽出用重要語）を含む文を重要文として、抽出対象文書入力部３１によって入力された抽出対象文書から抽出する。重要文抽出部３３は、重要語決定部３２によって決定された抽出用重要語のうち少なくとも１つを含む文を重要文として抽出する。なお、重要文抽出部３３による重要文の抽出処理の詳細については後述する。 The important sentence extraction unit 33 is input by the extraction target document input unit 31 with a sentence including the extraction important word (first and second extraction important words) determined by the important word determination unit 32 as an important sentence. Extract from the extraction target document. The important sentence extraction unit 33 extracts a sentence including at least one of the extraction important words determined by the important word determination unit 32 as an important sentence. Details of the important sentence extraction processing by the important sentence extraction unit 33 will be described later.

スコア算出部３４は、重要文抽出部３３によって抽出された重要文の各々のスコア（重要度）を、当該重要文に含まれる抽出用重要語に基づいて算出する。スコア算出部３４は、重要文抽出部３３によって抽出された重要文に含まれる重要語が第１の抽出用重要語（対応重要語）であるかまたは第２の抽出用重要語（近傍重要語）であるかに基づいてスコアを算出する。なお、スコア算出部３４による重要文のスコアの算出処理の詳細については後述する。 The score calculation unit 34 calculates the score (importance) of each important sentence extracted by the important sentence extraction unit 33 based on the extraction important words included in the important sentence. The score calculation unit 34 determines whether the important word included in the important sentence extracted by the important sentence extraction unit 33 is the first extraction important word (corresponding important word) or the second extraction important word (neighboring important word). ) To calculate a score. Details of the important sentence score calculation processing by the score calculation unit 34 will be described later.

スコア算出部３４は、重要文抽出部３３によって抽出された重要文及び算出された当該重要文のスコアを対応付けて重要文格納部２４に格納する。このとき、スコア算出部３４は、重要文及び当該重要文のスコアの組を、抽出対象文書（つまり、当該重要文が抽出された文書）に付与されているカテゴリに対応付けて重要文格納部２４に格納する。 The score calculation unit 34 stores the important sentence extracted by the important sentence extraction unit 33 and the calculated score of the important sentence in the important sentence storage unit 24 in association with each other. At this time, the score calculation unit 34 associates the combination of the important sentence and the score of the important sentence with the category assigned to the extraction target document (that is, the document from which the important sentence is extracted). 24.

重要文出力部３５は、重要文抽出部３３によって抽出された重要文を、スコア算出部３４によって算出されたスコア順に出力する。具体的には、重要文出力部３５は、重要文をスコア順にユーザに対して提示（表示）する。 The important sentence output unit 35 outputs the important sentences extracted by the important sentence extraction unit 33 in the order of the scores calculated by the score calculation unit 34. Specifically, the important sentence output unit 35 presents (displays) the important sentences to the user in the order of scores.

図３は、図２に示す階層重要語格納部２２のデータ構造の一例を示す。図３に示すように、階層重要語格納部２２には、文書が分類されるカテゴリ間の関係が階層構造で表されている。また、階層重要語格納部２２には、文書が分類されるカテゴリの各々に対応する重要語（当該カテゴリに関する重要語）が格納されている。 FIG. 3 shows an example of the data structure of the hierarchical keyword storage unit 22 shown in FIG. As shown in FIG. 3, the hierarchical key word storage unit 22 shows a hierarchical structure of relationships between categories into which documents are classified. The hierarchical important word storage unit 22 stores important words (important words related to the category) corresponding to each category into which the document is classified.

なお、図３に示す例では、重要文抽出装置３０によって重要文が抽出される文書として特許文書（特許文献）を想定している。つまり、階層重要語格納部２２においては、文書が分類されるカテゴリとしてＦＩ、当該カテゴリに対応する重要語としてＦＩ説明文が利用されている。このＦＩは、例えば特許文書のサーチキーとして用いられる番号である。 In the example illustrated in FIG. 3, a patent document (patent document) is assumed as a document from which an important sentence is extracted by the important sentence extraction device 30. That is, in the hierarchical important word storage unit 22, FI is used as a category in which documents are classified, and FI explanatory text is used as an important word corresponding to the category. This FI is, for example, a number used as a search key for patent documents.

図３を用いて具体的に説明すると、階層重要語格納部２２には、例えば文書が分類されるカテゴリに相当するＦＩ「B41J 3/04 101」に対応する重要語としてＦＩ説明文「インクジェット」が格納されている。 Specifically, referring to FIG. 3, the hierarchical key word storage unit 22 stores the FI explanation “inkjet” as a key word corresponding to the FI “B41J 3/04 101” corresponding to the category into which the document is classified, for example. Is stored.

階層重要語格納部２２においては、ＦＩ「B41J 3/04 101A」に対応する重要語としてＦＩ説明文「カラー」が格納されている。 The hierarchical key word storage unit 22 stores the FI explanation “color” as the key word corresponding to the FI “B41J 3/04 101A”.

階層重要語格納部２２においては、ＦＩ「B41J 3/04 101B」に対応する重要語としてＦＩ説明文「インクミスト」が格納されている。 The hierarchical important word storage unit 22 stores the FI explanation “ink mist” as an important word corresponding to the FI “B41J 3/04 101B”.

階層重要語格納部２２においては、ＦＩ「B41J 3/04 101Y」に対応する重要語としてＦＩ説明文「記憶媒体」が格納されている。 The hierarchical key word storage unit 22 stores the FI explanation “storage medium” as the key word corresponding to the FI “B41J 3/04 101Y”.

階層重要語格納部２２においては、ＦＩ「B41J 3/04 102」に対応する重要語としてＦＩ説明文「インク供給装置」が格納されている。 In the hierarchical important word storage unit 22, the FI explanation “ink supply device” is stored as an important word corresponding to the FI “B41J 3/04 102”.

また、階層重要語格納部２２においては、ＦＩ「B41J 3/04 102H」に対応する重要語としてＦＩ説明文「ヘッド清掃」が格納されている。 In the hierarchical important word storage unit 22, the FI explanation “head cleaning” is stored as an important word corresponding to the FI “B41J 3/04 102H”.

なお、図３に示すように、階層重要語格納部２２においては、各ＦＩが階層構造で表されている。図３に示す例では、ＦＩ「B41J 3/04 101A」、「B41J 3/04 101B」及び「B41J 3/04 101Y」は、ＦＩ「B41J 3/04 101」の下位の階層に位置付けられている。つまり、ＦＩ「B41J 3/04 101A」、「B41J 3/04 101B」及び「B41J 3/04 101Y」は、互いに同階層に位置付けられている。 As shown in FIG. 3, in the hierarchical important word storage unit 22, each FI is expressed in a hierarchical structure. In the example illustrated in FIG. 3, the FIs “B41J 3/04 101A”, “B41J 3/04 101B”, and “B41J 3/04 101Y” are positioned in the lower hierarchy of the FI “B41J 3/04 101”. . That is, the FIs “B41J 3/04 101A”, “B41J 3/04 101B”, and “B41J 3/04 101Y” are positioned in the same hierarchy.

また、ＦＩ「B41J 3/04 102」は、ＦＩ「B41J 3/04 101」と同階層に位置付けられている。 The FI “B41J 3/04 102” is positioned in the same hierarchy as the FI “B41J 3/04 101”.

更に、ＦＩ「B41J 3/04 102H」は、ＦＩ「B41J 3/04 102」の下位の階層に位置付けられている。 Furthermore, FI “B41J 3/04 102H” is positioned in a lower hierarchy of FI “B41J 3/04 102”.

このように、階層重要語格納部２２において、カテゴリ（ＦＩ）は、互いに関連のある（上位、下位、または同階層に位置付けられている）ＦＩが階層構造により表されている。 In this way, in the hierarchical important word storage unit 22, the categories (FI) represent FIs that are related to each other (upper, lower, or positioned in the same hierarchy) in a hierarchical structure.

次に、図４のフローチャートを参照して、本実施形態に係る重要文抽出装置３０の重要文抽出処理の処理手順について説明する。なお、重要文抽出装置３０に含まれる階層重要語格納部２２は、上記した図３に示すデータ構造を有するものとする。 Next, with reference to the flowchart of FIG. 4, the process sequence of the important sentence extraction process of the important sentence extraction apparatus 30 which concerns on this embodiment is demonstrated. Note that the hierarchical important word storage unit 22 included in the important sentence extraction device 30 has the data structure shown in FIG.

まず、抽出対象文書入力部３１は、ユーザの操作に応じて、複数の語を含む複数の文から構成される文書（抽出対象文書）を入力する（ステップＳ１）。ユーザは、例えば重要文抽出装置３０を操作することによって抽出対象文書を指定することができる。抽出対象文書入力部３１は、抽出対象文書が分類されるカテゴリ（ＦＩ）が付与された抽出対象文書を入力する。 First, the extraction target document input unit 31 inputs a document (extraction target document) composed of a plurality of sentences including a plurality of words in accordance with a user operation (step S1). For example, the user can designate an extraction target document by operating the important sentence extraction device 30. The extraction target document input unit 31 inputs an extraction target document to which a category (FI) to which the extraction target document is classified is assigned.

ここで、図５は、抽出対象文書入力部３１によって入力された抽出対象文書の一例を示す。図５に示す例では、抽出対象文書１００は、特許文書である。また、抽出対象文書１００に付与されているカテゴリは、ＦＩ「B41J 3/04 101B」であるものとする。以下、抽出対象文書入力部３１によって図５に示す抽出対象文書１００が入力されたものとして説明する。 Here, FIG. 5 shows an example of the extraction target document input by the extraction target document input unit 31. In the example shown in FIG. 5, the extraction target document 100 is a patent document. The category assigned to the extraction target document 100 is FI “B41J 3/04 101B”. In the following description, it is assumed that the extraction target document 100 shown in FIG.

次に、重要語決定部３２は、抽出対象文書入力部３１によって入力された抽出対象文書１００を取得する。重要語決定部３２は、取得された抽出対象文書１００に付与されているカテゴリに対応する重要語（対応重要語）を階層重要語格納部２２から読み込む（ステップＳ２）。 Next, the important word determination unit 32 acquires the extraction target document 100 input by the extraction target document input unit 31. The important word determination unit 32 reads an important word (corresponding important word) corresponding to the category assigned to the acquired extraction target document 100 from the hierarchical important word storage unit 22 (step S2).

ここで、このステップＳ２の処理を上記した図３及び図５を用いて具体的に説明する。図５に示す抽出対象文書１００に付与されているカテゴリはＦＩ「B41J 3/04 101B」である。このため、図３に示す階層重要語格納部２２によれば、重要語決定部３２は、対応重要語として「インクミスト」を当該階層重要語格納部２２から読み込む。 Here, the process of step S2 will be specifically described with reference to FIGS. 3 and 5 described above. The category assigned to the extraction target document 100 shown in FIG. 5 is FI “B41J 3/04 101B”. Therefore, according to the hierarchical important word storage unit 22 shown in FIG. 3, the important word determination unit 32 reads “ink mist” from the hierarchical important word storage unit 22 as the corresponding important word.

重要語決定部３２は、読み込まれた対応重要語が、取得された抽出対象文書１００内に存在する（含まれている）か否かを判定する（ステップＳ３）。図５に示す抽出対象文書１００内には対応重要語である「インクミスト」が含まれるため、重要語決定部３２は、対応重要語が抽出対象文書１００内に存在すると判定する（ステップＳ３のＹＥＳ）。 The keyword determining unit 32 determines whether or not the read corresponding keyword is present (included) in the acquired extraction target document 100 (step S3). Since the extraction target document 100 shown in FIG. 5 includes “ink mist” that is the corresponding important word, the important word determination unit 32 determines that the corresponding important word exists in the extraction target document 100 (in step S3). YES)

重要語決定部３２は、対応重要語が抽出対象文書１００内に存在すると判定された場合、当該対応重要語（ここでは、「インクミスト」）を抽出用重要語（第１の抽出用重要語）として決定する（ステップＳ４）。 When it is determined that the corresponding important word is present in the extraction target document 100, the important word determination unit 32 selects the corresponding important word (here, “ink mist”) as the extraction important word (first extraction important word). ) (Step S4).

次に、重要語決定部３２は、取得された抽出対象文書１００に付与されているカテゴリと関連のあるカテゴリに対応する重要語（近傍重要語）を、階層重要語格納部２２から読み込む（ステップＳ５）。 Next, the important word determination unit 32 reads the important words (neighboring important words) corresponding to the categories associated with the categories assigned to the acquired extraction target document 100 from the hierarchical important word storage unit 22 (step) S5).

ここで、近傍重要語には、上記した第１〜第３の重要語が含まれるものとする。上記したように、第１の重要語は、抽出対象文書１００に付与されているカテゴリと同階層に位置付けられているカテゴリに対応する重要語である。第２の重要語は、抽出対象文書１００に付与されているカテゴリの上位（の階層）または下位（の階層）に位置付けられているカテゴリに対応する重要語である。また、第３の重要語は、抽出対象文書１００に付与されているカテゴリの上位の上位（の階層）または下位の下位（の階層）に位置付けられているカテゴリに対応する重要語である。 Here, it is assumed that the first to third important words described above are included in the neighboring important words. As described above, the first important word is an important word corresponding to a category positioned in the same hierarchy as the category assigned to the extraction target document 100. The second important word is an important word corresponding to a category positioned at the upper level (hierarchy) or lower (hierarchy) of the category assigned to the extraction target document 100. In addition, the third important word is an important word corresponding to a category positioned at a higher level (hierarchy) or a lower lower level (hierarchy) of the category assigned to the extraction target document 100.

この場合において上記した図３を用いて具体的に説明すると、階層重要語格納部２２には、抽出対象文書１００に付与されているカテゴリであるＦＩ「B41J 3/04 101B」と同階層に位置付けられているカテゴリとしてＦＩ「B41J 3/04 101A」及び「B41J 3/04 101Y」が存在する。したがって、重要語決定部３２は、対応重要語の近傍重要語として、ＦＩ「B41J 3/04 101A」及び「B41J 3/04 101Y」に対応する「カラー」及び「記憶媒体」を階層重要語格納部２２から読み込む。この「カラー」及び「記憶媒体」は、上記した第１の重要語に該当する近傍重要語である。 In this case, specifically, using FIG. 3 described above, the hierarchical keyword storage unit 22 is positioned in the same hierarchy as the FI “B41J 3/04 101B” that is a category assigned to the extraction target document 100. As categories, FI “B41J 3/04 101A” and “B41J 3/04 101Y” exist. Therefore, the important word determination unit 32 stores “color” and “storage medium” corresponding to the FI “B41J 3/04 101A” and “B41J 3/04 101Y” as hierarchical important words as neighboring important words of the corresponding important words. Read from the unit 22. The “color” and “storage medium” are neighboring important words corresponding to the first important word described above.

また、階層重要語格納部２２には、抽出対象文書１００に付与されているカテゴリであるＦＩ「B41J 3/04 101B」の上位の階層に位置付けられているカテゴリとしてＦＩ「B41J 3/04 101」が存在する。したがって、重要語決定部３２は、対応重要語の近傍重要語として、ＦＩ「B41J 3/04 101」に対応する「インクジェット」を階層重要語格納部２２から読み込む。この「インクジェット」は、上記した第２の重要語に該当する近傍重要語である。 Further, in the hierarchy important word storage unit 22, FI “B41J 3/04 101” is set as a category positioned in the upper hierarchy of FI “B41J 3/04 101B” which is a category assigned to the extraction target document 100. Exists. Therefore, the important word determination unit 32 reads “inkjet” corresponding to the FI “B41J 3/04 101” from the hierarchical important word storage unit 22 as a neighboring important word of the corresponding important word. This “inkjet” is a neighboring important word corresponding to the second important word described above.

重要語決定部３２は、読み込まれた対応重要語が、取得された抽出対象文書１００内に存在する（含まれている）か否かを判定する（ステップＳ６）。図５に示す抽出対象文書１００内には近傍重要語である「カラー」及び「インクジェット」が含まれるため、重要語決定部３２は、近傍重要語が抽出対象文書１００内に存在すると判定する（ステップＳ６のＹＥＳ）。 The important word determination unit 32 determines whether or not the read corresponding important word exists (is included) in the acquired extraction target document 100 (step S6). Since the extraction target document 100 shown in FIG. 5 includes “color” and “inkjet” which are neighboring important words, the important word determination unit 32 determines that the neighboring important words are present in the extraction target document 100 ( YES in step S6).

なお、図５に示す抽出対象文書１００内には、近傍重要語である「記憶媒体」は含まれていない。しかしながら、上記したように同義語辞書２３に近傍重要語「記憶媒体」の同義語として「記録装置」及び「記録部材」が格納されているものとすると、重要語決定部３２は、当該「記録装置」及び「記録部材」が抽出対象文書１００内に含まれているか否かを判定する。この場合、近傍重要語「記憶媒体」の同義語である「記録装置」及び「記録部材」は抽出対象文書１００内に含まれているため、近傍重要語が抽出対象文書１００内に存在すると判定される。この場合には、抽出対象文書１００内に存在する「記録装置」及び「記録部材」は、以降の処理において近傍重要語である「記憶媒体」と同様に扱われる。 Note that the extraction target document 100 shown in FIG. 5 does not include the “storage medium” that is a nearby important word. However, assuming that “recording device” and “recording member” are stored in the synonym dictionary 23 as synonyms of the nearby important word “storage medium” as described above, the important word determination unit 32 performs the “recording”. It is determined whether or not “device” and “recording member” are included in the extraction target document 100. In this case, since “recording device” and “recording member”, which are synonyms of the neighborhood important word “storage medium”, are included in the extraction target document 100, it is determined that the neighborhood important word exists in the extraction target document 100. Is done. In this case, the “recording device” and “recording member” existing in the extraction target document 100 are handled in the same manner as the “storage medium” that is a neighboring important word in the subsequent processing.

重要語決定部３２は、近傍重要語が抽出対象文書１００内に存在すると判定された場合、当該抽出対象文書１００内に存在すると判定された近傍重要語を抽出用重要語（第２の抽出用重要語）として決定する（ステップＳ７）。この場合、重要語決定部３２は、近傍重要語「カラー」、「記録装置」、「記録部材」及び「インクジェット」を抽出用重要語として決定する。なお、「記憶媒体」は、近傍重要語であるが、抽出対象文書１００内に含まれていないため抽出用重要語として決定されない。 When it is determined that the neighboring important word is present in the extraction target document 100, the important word determination unit 32 extracts the neighboring important word that is determined to be present in the extraction target document 100 as the extraction important word (second extraction word). Important word) (step S7). In this case, the important word determination unit 32 determines the adjacent important words “color”, “recording apparatus”, “recording member”, and “inkjet” as extraction important words. The “storage medium” is a nearby important word, but is not determined as an extraction important word because it is not included in the extraction target document 100.

次に、重要文抽出部３３は、重要語決定部３２によって決定された抽出用重要語（第１及び第２の抽出用重要語のうち少なくとも１つ）を含む文を重要文として、抽出対象文書１００から抽出する（ステップＳ８）。重要文抽出部３３は、重要文の抽出処理に際して、必要に応じて抽出対象文書１００の構文解析を行う。 Next, the important sentence extraction unit 33 extracts, as an important sentence, a sentence including the extraction important word (at least one of the first and second extraction important words) determined by the important word determination unit 32. Extracted from the document 100 (step S8). The important sentence extraction unit 33 performs syntax analysis of the extraction target document 100 as necessary in the important sentence extraction process.

ここで、重要文抽出部３３による重要文の抽出処理について詳細に説明する。重要文抽出部３３は、抽出対象文書１００において以下に説明する抽出条件（抽出ルール）に該当する箇所（部分）を重要文として抽出する。 Here, the important sentence extraction processing by the important sentence extraction unit 33 will be described in detail. The important sentence extraction unit 33 extracts a portion (part) corresponding to an extraction condition (extraction rule) described below in the extraction target document 100 as an important sentence.

上記した抽出条件には、第１及び第２の抽出条件が含まれる。第１の抽出条件によれば、抽出用重要語から始まり、当該抽出用重要語に係る動詞を１つ含む部分が存在する場合には、当該抽出用重要語から動詞までの部分を重要文とする。第２の抽出条件によれば、抽出用重要語の直前に係っている動詞が１つ存在する場合には、当該抽出用重要語に係っている動詞から当該抽出用重要語までを重要文とする。 The extraction conditions described above include the first and second extraction conditions. According to the first extraction condition, if there is a part that starts with an extraction key word and includes one verb related to the key word for extraction, the part from the key word for extraction to the verb is the key sentence. To do. According to the second extraction condition, if there is one verb immediately before the extraction important word, the important words from the verb related to the extraction important word to the extraction important word are important. A sentence.

上記したように重要語決定部３３によって決定された抽出用重要語が「インクミスト」、「カラー」、「記録装置」、「記録部材」及び「インクジェット」である場合において図５に示す抽出対象文書１００から抽出される重要文について具体的に説明する。 As described above, when the extraction important words determined by the important word determination unit 33 are “ink mist”, “color”, “recording device”, “recording member”, and “inkjet”, the extraction target shown in FIG. The important sentence extracted from the document 100 will be specifically described.

まず、抽出用重要語が「インクミスト」である場合について説明すると、図５に示す抽出対象文書１００から上記した第１の抽出条件により「インクミストによる不良を防ぐ」が重要文として抽出される。なお、抽出対象文書１００においては「インクミストによる不良を防ぎ」とあるが、第１の抽出条件により抽出される重要文においては、当該重要文の文末の動詞は終止形に修正される。 First, the case where the key word for extraction is “ink mist” will be described. From the extraction target document 100 shown in FIG. 5, “Prevent failure due to ink mist” is extracted as an important sentence according to the first extraction condition described above. . In the extraction target document 100, “Prevents defects due to ink mist” is stated. However, in an important sentence extracted according to the first extraction condition, the verb at the end of the important sentence is corrected to an end form.

また、抽出用重要語が「カラー」である場合について説明すると、図５に示す抽出対象文書１００から上記した第２の抽出条件により「インクの予備吐出を行うことができるカラーフィルタ」が重要文として抽出される。また、抽出対象文書１００から上記した第１の抽出条件により「カラーフィルタの製造装置を提供する」が重要文として抽出される。 The case where the key word for extraction is “color” will be described. The key sentence is “color filter capable of performing preliminary ink ejection” from the extraction target document 100 shown in FIG. Extracted as Further, “provide color filter manufacturing apparatus” is extracted from the extraction target document 100 as an important sentence according to the first extraction condition described above.

抽出用重要語が「記録装置」である場合について説明すると、図５に示す抽出対象文書１００から上記した第２の抽出条件により「記録を行うインクジェット記録装置」が重要文として抽出される。なお、抽出対象文書１００には、「記録を行うインクジェット記録装置」の記載が２箇所存在するが、重要文としては１文が抽出される。 The case where the key word for extraction is “recording device” will be described. “Inkjet recording device for recording” is extracted as an important sentence from the extraction target document 100 shown in FIG. 5 according to the second extraction condition described above. In the extraction target document 100, there are two descriptions of “inkjet recording apparatus for recording”, but one sentence is extracted as an important sentence.

抽出用重要語が「記録部材」である場合について説明すると、図５に示す抽出対象文書１００から上記した第１の抽出条件により「インクジェットヘッドにより被記録部材に向けてインクを吐出する」が重要文として抽出される。ここで抽出される重要文には、抽出用重要語である「インクジェット」も含まれている。この場合、「向けて」も動詞であるが、構文解析を行うことにより、抽出用重要語である「インクジェット」に係る動詞は「吐出する」であるため、上記の重要文が抽出される。 The case where the key word for extraction is “recording member” will be described. It is important that “ink is ejected toward the recording member by the inkjet head” from the extraction target document 100 shown in FIG. 5 according to the first extraction condition described above. Extracted as a sentence. The important sentences extracted here include “inkjet” which is an important word for extraction. In this case, “toward” is also a verb, but by performing syntax analysis, the verb related to “inkjet”, which is the key word for extraction, is “discharge”, and thus the above important sentence is extracted.

抽出用重要語が「インクジェット」である場合について説明すると、図５に示す抽出対象文書１００から上記した第１の抽出条件により「インクジェットヘッドにより被記録部材に向けてインクを吐出する」及び第２の抽出条件により「記録を行うインクジェット記録装置」が重要文として抽出される。また、抽出対象文書１００から上記した第１の抽出条件により「インクジェットヘッド１２０ａのノズル面を覆う」、「インクジェットヘッド１２０ａとキャップ部材３１ａとを相対的に移動させる」及び「インクジェットヘッド１２０ａからインクを吐出させる」が重要文として抽出される。 The case where the key word for extraction is “inkjet” will be described. According to the first extraction condition described above from the extraction target document 100 shown in FIG. 5, “ink is ejected toward the recording member by the inkjet head” and second The “ink jet recording apparatus that performs recording” is extracted as an important sentence according to the extraction conditions. Further, according to the first extraction condition described above from the extraction target document 100, “cover the nozzle surface of the inkjet head 120a”, “relatively move the inkjet head 120a and the cap member 31a”, and “ink from the inkjet head 120a”. “Discharge” is extracted as an important sentence.

以上により、重要文抽出部３３は、抽出対象文書１００から、「インクジェットヘッドにより被記録部材に向けてインクを吐出する」、「記録を行うインクジェット記録装置」、「インクミストによる不良を防ぐ」、「インクの予備吐出を行うことができるカラーフィルタ」、「カラーフィルタの製造装置を提供する」、「インクジェットヘッド１２０ａのノズル面を覆う」、「インクジェットヘッド１２０ａとキャップ部材３１ａとを相対的に移動させる」及び「インクジェットヘッド１２０ａからインクを吐出させる」を重要文として抽出する。 As described above, the important sentence extraction unit 33, from the extraction target document 100, “discharges ink toward the recording member by the inkjet head”, “inkjet recording apparatus that performs recording”, “prevents defects due to ink mist”, “Providing a color filter manufacturing device”, “Covering nozzle surface of inkjet head 120a”, “Moving inkjet head 120a and cap member 31a relative to each other” And “to eject ink from the inkjet head 120a” are extracted as important sentences.

次に、スコア算出部３４は、重要文抽出部３３によって抽出された重要文毎のスコアを算出する（ステップＳ９）。スコア算出部３４は、重要文抽出部３３によって抽出された重要文毎に、当該重要文に含まれる抽出用重要語に対して予め定められているスコア（以下、単に抽出用重要語のスコアと表記）に基づいて当該重要文のスコアを算出する。 Next, the score calculation part 34 calculates the score for every important sentence extracted by the important sentence extraction part 33 (step S9). For each important sentence extracted by the important sentence extraction unit 33, the score calculation unit 34 determines a score predetermined for the extraction important word included in the important sentence (hereinafter simply referred to as the extraction important word score). The score of the important sentence is calculated based on the notation.

この抽出用重要語のスコアは、当該抽出用重要語に対応するカテゴリと抽出対象文書１００に付与されているカテゴリとの階層重要語格納部２２の階層構造における位置関係に基づいて定められる。例えば抽出用重要語のうち、第１の抽出用重要語（対応重要語）については高いスコアが定められており、第２の抽出用重要語（近傍重要語）については階層構造における位置関係（つまり、抽出対象文書１００に付与されているカテゴリとの距離）に応じてスコアが定められている。 The score of the keyword for extraction is determined based on the positional relationship in the hierarchical structure of the hierarchical keyword storage unit 22 between the category corresponding to the keyword for extraction and the category assigned to the extraction target document 100. For example, among the extraction important words, the first extraction important word (corresponding important word) has a high score, and the second extraction important word (neighboring important word) has a positional relationship in the hierarchical structure ( That is, the score is determined in accordance with the distance to the category assigned to the extraction target document 100.

なお、以下の説明においては、抽出対象文書１００に付与されているカテゴリを対象カテゴリ、当該対象カテゴリの上位の階層に位置付けられているカテゴリを上位階層カテゴリ、当該対象カテゴリの下位の階層に位置付けられているカテゴリを下位階層カテゴリ、当該対象カテゴリと同階層に位置付けられているカテゴリを同階層カテゴリと称する。 In the following description, the category assigned to the extraction target document 100 is the target category, the category positioned in the upper hierarchy of the target category is positioned in the upper hierarchy category, and the lower hierarchy of the target category. A category that is positioned in the same hierarchy as the target category is called a lower hierarchy category.

ここでは、対象カテゴリに対応する第１の抽出用重要語（対応重要語）にはスコア「６」が定められているものとする。また、第２の抽出用重要語（近傍重要語）のうち、上位階層カテゴリ及び下位階層カテゴリに対応する第２の抽出用重要語にはスコア「３」、同階層カテゴリに対応する第２の抽出用重要語にはスコア「２」が定められているものとする。 Here, it is assumed that a score “6” is defined for the first extraction important word (corresponding important word) corresponding to the target category. Of the second extraction important words (neighboring important words), the second extraction important word corresponding to the upper hierarchy category and the lower hierarchy category has a score of “3”, and the second extraction important word corresponds to the same hierarchy category. It is assumed that a score “2” is defined for the extraction important word.

なお、図３に示す階層重要語格納部２２においては存在しないが、上位階層カテゴリの更に上位の階層に位置付けられているカテゴリに対応する第２の抽出用重要語には例えばスコア「１」が定められている。下位階層カテゴリの更に下位の階層に位置付けられているカテゴリに対応する第２の抽出用重要語についても同様のスコアが定められている。 Although not present in the hierarchical keyword storage unit 22 shown in FIG. 3, for example, the score “1” is included in the second extraction important word corresponding to the category positioned in the higher hierarchy of the upper hierarchy category. It has been established. A similar score is set for the second extraction important word corresponding to the category positioned in the lower hierarchy of the lower hierarchy category.

スコア算出部３４は、重要文抽出部３３によって抽出された重要文に含まれる抽出用重要語のスコアを加算することによって、当該重要文のスコアを算出する。つまり、重要文のスコアは、上記した抽出用重要語のスコアの例によると、「６×対応重要語（第１の抽出用重要語）の個数＋３×上位（または下位）階層カテゴリに対応する近傍重要語（第２の抽出用重要語）の個数＋２×同階層カテゴリに対応する近傍重要語（第２の抽出用重要語）の個数（＋１×上位（または下位）階層カテゴリの更に上位（または下位）の階層に位置付けられているカテゴリに対応する近傍重要語（第２の抽出用重要語）の個数）」のスコア式により算出される。 The score calculation unit 34 calculates the score of the important sentence by adding the scores of the extraction important words included in the important sentence extracted by the important sentence extraction unit 33. That is, the score of the important sentence corresponds to “6 × number of corresponding important words (first extraction important words) + 3 × higher (or lower) hierarchical category” according to the above-described example of the important words for extraction. Number of neighboring important words (second extracting important words) + 2 × number of neighboring important words (second extracting important words) corresponding to the same hierarchical category (+ 1 × higher (or lower) hierarchical category ( Or the number of neighboring important words (second extraction important words) corresponding to the category positioned in the lower hierarchy)).

ここで、上記した重要文抽出部３３によって抽出された重要文の各々のスコアの算出処理について具体的に説明する。なお、重要語決定部３２によって決定された抽出用重要語である「インクミスト」は対応重要語（第１の抽出用重要語）であり、「カラー」、「記録装置」及び「記録部材」は同階層カテゴリに対応する近傍重要語（第２の抽出用重要語）であり、「インクジェット」は上位階層カテゴリに対応する近傍重要語（第２の抽出用重要語）である。 Here, the score calculation process of each important sentence extracted by the above-described important sentence extraction unit 33 will be specifically described. Note that “ink mist”, which is an extraction important word determined by the important word determination unit 32, is a corresponding important word (first extraction important word), and is “color”, “recording device”, and “recording member”. Is a neighboring important word (second extracting important word) corresponding to the same hierarchical category, and “inkjet” is a neighboring important word (second extracting important word) corresponding to the upper hierarchical category.

また、重要文抽出部３３によって抽出された重要文は、上記したように「インクジェットヘッドにより被記録部材に向けてインクを吐出する」、「記録を行うインクジェット記録装置」、「インクミストによる不良を防ぐ」、「インクの予備吐出を行うことができるカラーフィルタ」、「カラーフィルタの製造装置を提供する」、「インクジェットヘッド１２０ａのノズル面を覆う」、「インクジェットヘッド１２０ａとキャップ部材３１ａとを相対的に移動させる」及び「インクジェットヘッド１２０ａからインクを吐出させる」であるものとする。 The important sentences extracted by the important sentence extraction unit 33 are, as described above, “injecting ink toward the recording member by the ink jet head”, “ink jet recording apparatus that performs recording”, and “defect caused by ink mist”. "Prevent", "Color filter capable of preliminary ink ejection", "Provide manufacturing apparatus for color filter", "Cover nozzle surface of inkjet head 120a", "Inkjet head 120a and cap member 31a relative to each other" And “move ink from the inkjet head 120a”.

重要文が「インクジェットヘッドにより被記録部材に向けてインクを吐出する」である場合には、当該重要文には、上位階層カテゴリに対応する近傍重要語である「インクジェット」及び同階層カテゴリに対応する近傍重要語である「記録部材」が１つずつ含まれている。したがって、この重要文のスコアは、上記したスコア式により３×１＋２×１＝５となる。 When the important sentence is “Ejecting ink toward the recording member by the ink jet head”, the important sentence corresponds to “inkjet” which is a neighboring important word corresponding to the upper hierarchy category and the same hierarchy category. One “recording member” which is a neighboring important word is included. Therefore, the score of this important sentence is 3 × 1 + 2 × 1 = 5 by the above-described score formula.

重要文が「記録を行うインクジェット記録装置」である場合には、当該重要文には、上位階層カテゴリに対応する近傍重要語である「インクジェット」及び同階層カテゴリに対応する近傍重要語である「記録装置」が１つずつ含まれている。したがって、この重要文のスコアは、上記したスコア式により３×１＋２×１＝５となる。 When the important sentence is “an inkjet recording apparatus that performs recording”, the important sentence includes “inkjet” that is a neighboring important word corresponding to the upper hierarchical category and “neighboring important words that correspond to the same hierarchical category” One “recording device” is included. Therefore, the score of this important sentence is 3 × 1 + 2 × 1 = 5 by the above-described score formula.

重要文が「インクミストによる不良を防ぐ」である場合には、当該重要文には、対応重要語である「インクミスト」が１つ含まれている。したがって、この重要文のスコアは、上記したスコア式により６×１＝６となる。 When the important sentence is “Preventing failure due to ink mist”, the important sentence contains one “ink mist” which is a corresponding important word. Therefore, the score of this important sentence is 6 × 1 = 6 by the above-described score formula.

重要文が「インクの予備吐出を行うことができるカラーフィルタ」である場合には、当該重要文には、同階層カテゴリに対応する近傍重要語である「カラー」が１つ含まれている。したがって、この重要文のスコアは、上記したスコア式により２×１＝２となる。 When the important sentence is “a color filter capable of performing preliminary ink ejection”, the important sentence includes one “color” which is a neighboring important word corresponding to the same hierarchical category. Therefore, the score of this important sentence is 2 × 1 = 2 by the above-described score formula.

重要文が「カラーフィルタの製造装置を提供する」である場合には、当該重要文には、同階層カテゴリに対応する近傍重要語である「カラー」が１つ含まれている。したがって、この重要文のスコアは、上記したスコア式により２×１＝２となる。 When the important sentence is “provide a color filter manufacturing apparatus”, the important sentence includes one “color” which is a neighboring important word corresponding to the same hierarchical category. Therefore, the score of this important sentence is 2 × 1 = 2 by the above-described score formula.

重要文が「インクジェットヘッド１２０ａのノズル面を覆う」である場合には、当該重要文には、上位階層カテゴリに対応する近傍重要語である「インクジェット」が１つ含まれている。したがって、この重要文のスコアは、上記したスコア式により３×１＝３となる。 When the important sentence is “covers the nozzle surface of the inkjet head 120a”, the important sentence includes one “inkjet” which is a neighboring important word corresponding to the upper hierarchical category. Therefore, the score of this important sentence is 3 × 1 = 3 by the above-described score formula.

重要文が「インクジェットヘッド１２０ａとキャップ部材３１ａとを相対的に移動させる」である場合には、当該重要文には、上位階層カテゴリに対応する近傍重要語である「インクジェット」が１つ含まれている。したがって、この重要文のスコアは、上記したスコア式により３×１＝３となる。 When the important sentence is “relatively move the inkjet head 120a and the cap member 31a”, the important sentence includes one “inkjet” which is a neighboring important word corresponding to the upper hierarchy category. ing. Therefore, the score of this important sentence is 3 × 1 = 3 by the above-described score formula.

また、重要文が「インクジェットヘッド１２０ａからインクを吐出させる」である場合には、当該重要文には、上位階層カテゴリに対応する近傍重要語である「インクジェット」が１つ含まれている。したがって、この重要文のスコアは、上記したスコア式により３×１＝３となる。 When the important sentence is “discharge ink from the inkjet head 120a”, the important sentence includes one “inkjet” that is a neighboring important word corresponding to the upper hierarchical category. Therefore, the score of this important sentence is 3 × 1 = 3 by the above-described score formula.

なお、１つの重要文に同じ重要語（対応重要語または近傍重要語）が複数含まれている場合であっても、スコアの算出処理においては個数は１つであるものとする。 Even when a single important sentence includes a plurality of the same important words (corresponding important words or neighboring important words), the number is assumed to be one in the score calculation process.

上記したように重要文毎のスコアが算出されると、スコア算出部３４は、抽出対象文書１００に付与されているカテゴリ（ここでは、ＦＩ「B41J 3/04 101B」）、重要文抽出部３３によって抽出された重要文及び当該重要文のスコアを対応付けて重要文格納部２４に格納する。 When the score for each important sentence is calculated as described above, the score calculation unit 34 includes the category (here, FI “B41J 3/04 101B”) assigned to the extraction target document 100, and the important sentence extraction unit 33. The important sentence extracted by the above and the score of the important sentence are stored in the important sentence storage unit 24 in association with each other.

つまり、重要文格納部２４には、上記したような重要文の抽出処理が行われる度に、カテゴリ別に当該カテゴリに分類される文書から抽出された重要文及び当該重要文のスコアの組が格納される。 That is, the important sentence storage unit 24 stores a combination of an important sentence extracted from a document classified into the category by category and a score of the important sentence each time the above-described important sentence extraction process is performed. Is done.

次に、重要文出力部３５は、重要文抽出部３３によって抽出された重要文を、スコア算出部３４によって算出されたスコア順に出力（例えば、ユーザに対して表示等）する（ステップＳ１０）。重要文出力部３５は、重要文抽出部３３によって抽出された重要文を当該重要文のスコアの値の例えば降順に出力する。 Next, the important sentence output unit 35 outputs the important sentences extracted by the important sentence extraction unit 33 in the order of the scores calculated by the score calculation unit 34 (for example, display to the user) (step S10). The important sentence output unit 35 outputs the important sentences extracted by the important sentence extraction unit 33 in, for example, descending order of the score values of the important sentences.

また、重要文出力部３５は、重要文抽出部３３によって抽出された重要文を全て出力することなく、例えばスコアが５以上の重要文のみを出力しても構わない。 Further, the important sentence output unit 35 may output only the important sentence having a score of 5 or more, for example, without outputting all the important sentences extracted by the important sentence extracting unit 33.

図６は、スコアが５以上の重要文が出力（表示）された場合の一例を示す。図６に示す例では、重要文抽出部３３によって抽出された重要文のうち、重要文「インクミストによる不良を防ぐ（スコア６）」、「インクジェットヘッドにより被記録部材に向けてインクを吐出する（スコア５）」及び「記録を行うインクジェット記録装置（スコア５）」がスコアの降順に出力されている。 FIG. 6 shows an example when an important sentence with a score of 5 or more is output (displayed). In the example shown in FIG. 6, among the important sentences extracted by the important sentence extraction unit 33, the important sentence “Prevents defects due to ink mist (score 6)” and “Eject ink toward the recording member by the inkjet head. “Score 5” ”and“ Inkjet recording apparatus for recording (score 5) ”are output in descending order of score.

一方、上記したステップＳ３において対応重要語が抽出対象文書１００内に存在しないと判定された場合、ステップＳ５の処理が実行される。つまり、対応重要語が抽出対象文書１００内に存在しない場合には、当該対応重要語は抽出用重要語として決定されない。 On the other hand, when it is determined in step S3 described above that the corresponding important word does not exist in the extraction target document 100, the process of step S5 is executed. That is, when the corresponding important word does not exist in the extraction target document 100, the corresponding important word is not determined as the extraction important word.

また、上記したステップＳ６において近傍重要語が抽出対象文書１００内に存在しないと判定された場合、処理は終了される。なお、ステップＳ６において近傍重要語が抽出対象文書１００内に存在しないと判定された場合であっても、上記したステップＳ３において対応重要語が抽出対象文書１００内に存在すると判定された場合には、当該対応重要語を含む重要文が抽出対象文書１００から抽出され、当該重要文が出力される。つまり、対応重要語及び近傍重要語の全てが抽出対象文書１００内に存在しない場合には、重要文は出力されることなく処理が終了される。 Further, when it is determined in step S6 described above that there is no neighboring important word in the extraction target document 100, the processing is terminated. Even when it is determined in step S6 that the neighboring important word does not exist in the extraction target document 100, if it is determined in step S3 that the corresponding important word exists in the extraction target document 100, The important sentence including the corresponding important word is extracted from the extraction target document 100, and the important sentence is output. That is, when all of the corresponding important words and the neighboring important words are not present in the extraction target document 100, the process is terminated without outputting the important sentence.

上記したように本実施形態においては、抽出対象文書１００に付与されたカテゴリに対応する重要語（対応重要語）及び当該カテゴリと関連のあるカテゴリ（上位階層カテゴリ、下位階層カテゴリ及び同階層カテゴリ等）に対応する重要語（近傍重要語）のうち、抽出対象文書１００内に存在する重要語を抽出用重要語として決定し、当該抽出用重要語を含む文を重要文として抽出対象文書１００から抽出する。 As described above, in the present embodiment, important words (corresponding important words) corresponding to the category assigned to the extraction target document 100 and categories related to the category (upper hierarchy category, lower hierarchy category, same hierarchy category, etc.) ) Is determined as an extraction important word, and a sentence including the extraction important word is determined as an important sentence from the extraction target document 100. Extract.

これにより、本実施形態においては、抽出対象文書１００に付与されたカテゴリに対応する対応重要語が当該抽出対象文書１００内に存在しない場合であっても、当該対応重要語に対応するカテゴリと関連のあるカテゴリに対応する近傍重要語を用いて重要文を抽出することができる。 Thereby, in this embodiment, even if the corresponding important word corresponding to the category assigned to the extraction target document 100 does not exist in the extraction target document 100, the relation to the category corresponding to the corresponding important word is related. An important sentence can be extracted using a neighboring important word corresponding to a certain category.

また、本実施形態においては、上記したように近傍重要語を用いることにより、抽出対象文書１００が分類されるカテゴリと関連のないカテゴリの重要語等を用いる場合と比較して重要文の抽出精度を保ちつつ、かつ、対応重要語のみを用いて重要文を抽出する場合と比較してより多くの重要文を抽出することができる。 In the present embodiment, as described above, the use of neighboring important words makes it possible to extract important sentences compared to the case where important words or the like of a category not related to the category into which the extraction target document 100 is classified are used. And more important sentences can be extracted as compared with the case where important sentences are extracted using only corresponding important words.

また、本実施形態においては、抽出対象文書１００の内容の記述の粒度に対応した重要文の抽出が可能となる。つまり、本実施形態においては、抽出対象文書１００の内容の記述粒度が粗い場合には上位階層カテゴリに対応する近傍重要語、一方、記述粒度が細かい場合には下位階層カテゴリに対応する近傍重要語を用いることによって、抽出対象文書１００の内容に応じた重要文を抽出することが可能である。また、本実施形態においては、上位及び下位階層カテゴリに対応する近傍重要語または同階層カテゴリに対応する近傍重要語でスコアが異なるため、重要度スコアをきめ細かく表すことが可能となる。 In the present embodiment, it is possible to extract an important sentence corresponding to the granularity of the description of the content of the extraction target document 100. In other words, in the present embodiment, when the description granularity of the content of the extraction target document 100 is coarse, the neighboring important words corresponding to the upper hierarchy category, while when the description granularity is fine, the neighboring important words corresponding to the lower hierarchy category. By using, it is possible to extract an important sentence according to the contents of the extraction target document 100. Further, in the present embodiment, since the scores differ depending on the neighboring important words corresponding to the upper and lower hierarchical categories or the neighboring important words corresponding to the same hierarchical category, the importance score can be expressed in detail.

なお、本実施形態においてはＦＩ及びＦＩ説明文を用いて特許文書から重要文を抽出するものとして説明したが、本実施形態は、例えばメールサーバに蓄積されたメール群に対しても適用可能である。この場合、メール群がディレクトリ構造で格納されており、各ディレクトリにキーワードが付与されている場合には、当該付与されたキーワードを用いて各ディレクトリ内のメール群から重要文を抽出することができる。 Although the present embodiment has been described on the assumption that an important sentence is extracted from a patent document using FI and FI explanation, this embodiment can also be applied to, for example, a mail group stored in a mail server. is there. In this case, if the mail group is stored in a directory structure and a keyword is assigned to each directory, an important sentence can be extracted from the mail group in each directory using the assigned keyword. .

［第２の実施形態］
次に、図７を参照して、本発明の第２の実施形態について説明する。図７は、本実施形態に係る重要文抽出装置の主として機能構成を示すブロック図である。なお、前述した図２と同様の部分には同一参照符号を付してその詳しい説明を省略する。ここでは、図２と異なる部分について主に述べる。 [Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to FIG. FIG. 7 is a block diagram mainly showing a functional configuration of the important sentence extracting apparatus according to the present embodiment. The same parts as those in FIG. 2 described above are denoted by the same reference numerals, and detailed description thereof is omitted. Here, parts different from FIG. 2 will be mainly described.

また、本実施形態に係る重要文抽出装置のハードウェア構成は、前述した第１の実施形態と同様であるため、適宜、図１を用いて説明する。 The hardware configuration of the important sentence extraction apparatus according to the present embodiment is the same as that of the first embodiment described above, and will be described with reference to FIG. 1 as appropriate.

本実施形態においては、前述した第１の実施形態における重要文の抽出処理に加えて、重要文格納部２４に格納された重要文を用いて文書を分類する処理（文書分類処理）を行う点が、前述した第１の実施形態とは異なる。つまり、本実施形態においては、前述した第１の実施形態における重要文の抽出処理が繰り返されることにより、重要文格納部２４にカテゴリ別に重要文及び当該重要文のスコアの組が蓄積されていることを前提としている。 In the present embodiment, in addition to the important sentence extraction process in the first embodiment described above, a process for classifying documents (document classification process) using the important sentence stored in the important sentence storage unit 24 is performed. However, this is different from the first embodiment described above. In other words, in this embodiment, the important sentence extraction process in the first embodiment described above is repeated, so that the important sentence storage unit 24 accumulates a set of important sentences and score of the important sentences for each category. It is assumed that.

図７に示すように、重要文抽出装置４０は、分類対象文書入力部４１、概念検索部４２、カテゴリ分類部４３及び分類結果出力部４４を含む。本実施形態において、これらの各部４１乃至４４は、図１に示すコンピュータ１０が外部記憶装置２０に格納されているプログラム２１を実行することにより実現されるものとする。 As illustrated in FIG. 7, the important sentence extraction device 40 includes a classification target document input unit 41, a concept search unit 42, a category classification unit 43, and a classification result output unit 44. In the present embodiment, these units 41 to 44 are realized by the computer 10 shown in FIG. 1 executing the program 21 stored in the external storage device 20.

また、重要文抽出装置４０は、分類結果格納部２５を有する。本実施形態において、分類結果格納部２５は、例えば外部記憶装置２０に格納される。 The important sentence extraction device 40 includes a classification result storage unit 25. In the present embodiment, the classification result storage unit 25 is stored in, for example, the external storage device 20.

分類対象文書入力部４１は、分類される対象となる文書（以下、分類対象文書と表記）を入力する。分類対象文書入力部４１は、ユーザの操作に応じて分類対象文書を入力する。この分類対象文書は、複数の語を含む複数の文から構成される。なお、この分類対象文書には、前述した第１の実施形態における抽出対象文書とは異なりカテゴリは付与されていない。 The classification target document input unit 41 inputs a document to be classified (hereinafter referred to as a classification target document). The classification target document input unit 41 inputs a classification target document in accordance with a user operation. This classification target document is composed of a plurality of sentences including a plurality of words. Unlike the extraction target document in the first embodiment described above, no category is assigned to this classification target document.

分類対象文書入力部４１によって入力される分類対象文書としては、例えば特許文書（特許文献）が含まれる。なお、特許文書は分類対象文書の一例であり、分類対象文書はテキストデータであればよい。 The classification target document input by the classification target document input unit 41 includes, for example, a patent document (patent document). A patent document is an example of a classification target document, and the classification target document may be text data.

概念検索部４２は、重要文格納部２４に格納されている重要文を検索キー、分類対象文書入力部４１によって入力された分類対象文書を検索対象として、概念検索（自然文検索）を行う。これにより、分類対象文書入力部４１によって入力された分類対象文書が重要文格納部２４に格納されている重要文に合致するか否かが判定される。なお、概念検索部４２は、重要文格納部２４に格納されている重要文の全てについて概念検索を行う。 The concept search unit 42 performs a concept search (natural sentence search) by using the important sentence stored in the important sentence storage unit 24 as a search key and the classification target document input by the classification target document input unit 41 as a search target. Thus, it is determined whether or not the classification target document input by the classification target document input unit 41 matches the important sentence stored in the important sentence storage unit 24. The concept search unit 42 performs a concept search for all the important sentences stored in the important sentence storage unit 24.

カテゴリ分類部４３は、概念検索結果に基づいて、分類対象文書入力部４１によって入力された分類対象文書（検索対象）が重要文格納部２４に格納されている重要文（検索キー）に合致するか否かを判定する。 Based on the concept search result, the category classification unit 43 matches the classification target document (search target) input by the classification target document input unit 41 with the important sentence (search key) stored in the important sentence storage unit 24. It is determined whether or not.

カテゴリ分類部４３は、重要文格納部２４に格納されているカテゴリに対応付けられている重要文であって、分類対象文書入力部４１によって入力された分類対象文書が合致すると判定された重要文に対応付けられているスコアの合計値を当該カテゴリのスコア（確信度）として、当該重要文格納部２４に格納されているカテゴリ毎に算出する。つまり、カテゴリ分類部４３は、重要文格納部２４に格納されている全ての重要文についての概念検索結果に基づいて、重要文格納部２４に格納されているカテゴリ別にスコアを算出する。 The category classification unit 43 is an important sentence associated with a category stored in the important sentence storage unit 24, and is an important sentence determined to match the classification target document input by the classification target document input unit 41. Is calculated for each category stored in the important sentence storage unit 24 as the score (confidence) of the category. That is, the category classification unit 43 calculates a score for each category stored in the important sentence storage unit 24 based on the concept search results for all the important sentences stored in the important sentence storage unit 24.

分類結果出力部４４は、重要文格納部２４に格納されているカテゴリ及び算出された当該カテゴリのスコアを出力する。このとき、分類結果出力部４４は、重要文格納部２４に格納されているカテゴリの各々を、カテゴリ分類部４３によって算出された当該カテゴリのスコア順に出力する。 The classification result output unit 44 outputs the category stored in the important sentence storage unit 24 and the calculated score of the category. At this time, the classification result output unit 44 outputs each of the categories stored in the important sentence storage unit 24 in the order of the score of the category calculated by the category classification unit 43.

また、分類結果出力部４４は、出力されたカテゴリ及び当該カテゴリのスコアに基づくユーザの操作に応じて、分類対象文書入力部４１によって入力された分類対象文書をカテゴリに分類し、当該分類結果を分類結果格納部２５に格納する。 The classification result output unit 44 classifies the classification target document input by the classification target document input unit 41 into categories according to the user's operation based on the output category and the score of the category, and the classification result is displayed. Stored in the classification result storage unit 25.

つまり、ユーザは、分類結果出力部４４によって出力されたカテゴリ及び当該カテゴリのスコアを参照して、分類対象文書入力部４１によって入力された分類対象文書が分類されるべきカテゴリを指定することができる。つまり、分類対象文書は、分類結果出力部４４によって出力されたカテゴリのうちユーザによって指定されたカテゴリに分類される。 That is, the user can specify a category to which the classification target document input by the classification target document input unit 41 should be classified with reference to the category output by the classification result output unit 44 and the score of the category. . That is, the classification target document is classified into a category designated by the user among the categories output by the classification result output unit 44.

図８は、図７に示す重要文格納部２４のデータ構造の一例を示す。図８に示すように、重要文格納部２４には、カテゴリ名、重要文及びスコア値が対応付けて格納されている。重要文は、対応付けられているカテゴリに分類されている文書から抽出された重要文を示す。スコア値は、対応付けられている重要文のスコアを示す。 FIG. 8 shows an example of the data structure of the important sentence storage unit 24 shown in FIG. As shown in FIG. 8, the important sentence storage unit 24 stores category names, important sentences, and score values in association with each other. The important sentence indicates an important sentence extracted from a document classified in the associated category. The score value indicates the score of the associated important sentence.

なお、重要文格納部２４には、カテゴリ毎に複数の重要文及びスコアの組が格納される。 The important sentence storage unit 24 stores a plurality of sets of important sentences and scores for each category.

図８に示す例では、重要文格納部２４には、カテゴリ名「B41J 3/04 101A」に対応付けて重要文１「カラー画像を印刷する」及びスコア値「６」の組が格納されている。また、重要文格納部２４には、カテゴリ名「B41J 3/04 101A」に対応付けて重要文２「カラー印刷方法を実行する」及びスコア値「６」の組が格納されている。 In the example illustrated in FIG. 8, the important sentence storage unit 24 stores a set of the important sentence 1 “print color image” and the score value “6” in association with the category name “B41J 3/04 101A”. Yes. The important sentence storage unit 24 stores a set of an important sentence 2 “execute color printing method” and a score value “6” in association with the category name “B41J 3/04 101A”.

重要文格納部２４には、カテゴリ名「B41J 3/04 101B」に対応付けて重要文１「インクミストが発生する」及びスコア値「６」の組が格納されている。また、重要文格納部２４には、カテゴリ名「B41J 3/04 101B」に対応付けて重要文２「インクジェット記録装置を有する」及びスコア値「５」の組が格納されている。 The important sentence storage unit 24 stores a set of an important sentence 1 “Ink mist occurs” and a score value “6” in association with the category name “B41J 3/04 101B”. Further, the important sentence storage unit 24 stores a set of an important sentence 2 “having an inkjet recording apparatus” and a score value “5” in association with the category name “B41J 3/04 101B”.

次に、図９のフローチャートを参照して、本実施形態に係る重要文抽出装置４０の文書分類処理の処理手順について説明する。なお、重要文格納部２４には、前述した第１の実施形態において説明したような処理によって抽出された重要文及び当該重要文のスコアがカテゴリ別に蓄積されているものとする。 Next, a processing procedure of document classification processing of the important sentence extraction device 40 according to the present embodiment will be described with reference to the flowchart of FIG. In the important sentence storage unit 24, the important sentences extracted by the process described in the first embodiment and the scores of the important sentences are accumulated for each category.

まず、分類対象文書入力部４１は、ユーザの操作に応じて、複数の語を含む複数の文から構成される文書（分類対象文書）を入力する（ステップＳ２１）。ユーザは、例えば重要文抽出装置４０を操作することによって分類対象文書を指定することができる。なお、分類対象文書入力部４１によって入力された分類対象文書には、カテゴリは付与されていない。 First, the classification target document input unit 41 inputs a document (classification target document) composed of a plurality of sentences including a plurality of words in accordance with a user operation (step S21). For example, the user can designate a classification target document by operating the important sentence extraction device 40. The classification target document input by the classification target document input unit 41 is not given a category.

ここで、図１０は、分類対象文書入力部４１によって入力された分類対象文書の一例を示す。図１０に示す例では、分類対象文書２００は、特許文書である。また、分類対象文書２００には、前述した図５に示す抽出対象文書１００とは異なり、カテゴリは付与されていない。以下、分類対象文書入力部４１によって図１０に示す分類対象文書２００が入力されたものとして説明する。 Here, FIG. 10 shows an example of the classification target document input by the classification target document input unit 41. In the example shown in FIG. 10, the classification target document 200 is a patent document. Further, unlike the extraction target document 100 shown in FIG. 5 described above, no category is assigned to the classification target document 200. In the following description, it is assumed that the classification target document 200 shown in FIG.

次に、概念検索部４２は、重要文格納部２４に格納されている重要文を１つ取得する（ステップＳ２２）。 Next, the concept search unit 42 acquires one important sentence stored in the important sentence storage unit 24 (step S22).

概念検索部４２は、取得された重要文を用いて分類対象文書入力部４１によって入力された分類対象文書２００に対する概念検索（自然文検索）を行う（ステップＳ２３）。この場合、概念検索部４２は、重要文を検索キー、分類対象文書２００を検索対象として概念検索を行う。概念検索とは、例えば検索したい内容の文章を検索キーとし、当該文章に近い内容の情報を検索する検索手法である。 The concept search unit 42 performs a concept search (natural sentence search) on the classification target document 200 input by the classification target document input unit 41 using the acquired important sentence (step S23). In this case, the concept search unit 42 performs a concept search using an important sentence as a search key and the classification target document 200 as a search target. The concept search is a search method for searching for information having contents close to the sentence using, for example, a sentence having a content to be searched as a search key.

カテゴリ分類部４３は、取得された重要文に分類対象文書入力部４１によって入力された分類対象文書２００が合致するか否かを、概念検索部４２による概念検索結果に基づいて判定する（ステップＳ２４）。概念検索における検索結果は、検索対象（分類対象文書２００）が検索キー（重要文）に合致する度合い（％）によって表される。カテゴリ分類部４３は、概念検索結果における検索対象が検索キーに合致する度合いが一定の値（例えば、７０％）以上である場合に、分類対象文書２００が重要文に合致すると判定する。 The category classification unit 43 determines whether or not the classification target document 200 input by the classification target document input unit 41 matches the acquired important sentence based on the concept search result by the concept search unit 42 (step S24). ). The search result in the concept search is represented by the degree (%) that the search target (classification target document 200) matches the search key (important sentence). The category classification unit 43 determines that the classification target document 200 matches the important sentence when the degree of matching of the search target in the concept search result is a certain value (for example, 70%) or more.

分類対象文書２００が重要文に合致すると判定された場合（ステップＳ２４のＹＥＳ）、カテゴリ分類部４３は、当該重要文（ステップＳ２３において検索キーとして用いられた重要文）に対応付けて重要文格納部２４に格納されているスコア（当該重要文のスコア）を、当該重要文格納部２４から取得する。 When it is determined that the classification target document 200 matches the important sentence (YES in step S24), the category classification unit 43 stores the important sentence in association with the important sentence (important sentence used as the search key in step S23). The score (score of the important sentence) stored in the unit 24 is acquired from the important sentence storage unit 24.

カテゴリ分類部４３は、取得された重要文のスコアを、重要文格納部２４において概念検索部４２によって取得された重要文に対応付けられているカテゴリのスコアに加算する（ステップＳ２５）。 The category classification unit 43 adds the score of the acquired important sentence to the score of the category associated with the important sentence acquired by the concept search unit 42 in the important sentence storage unit 24 (step S25).

一方、分類対象文書２００が重要文に合致しないと判定された場合（ステップＳ２４のＮＯ）、後述するステップＳ２６の処理が実行される。 On the other hand, when it is determined that the classification target document 200 does not match the important sentence (NO in step S24), a process in step S26 described later is executed.

なお、上記したステップＳ２２〜ステップＳ２５の処理は、重要文格納部２４に格納されている全ての重要文について実行される。これにより、カテゴリ分類部４３は、重要文格納部２４に格納されているカテゴリに対応付けられている重要文であって、分類対象文書２００が合致すると判定された重要文に対応付けられているスコア（重要文のスコア）の合計値を、当該カテゴリのスコアとして算出する。また、上記したステップＳ２２〜ステップＳ２５の処理を全ての重要文について実行されることにより、カテゴリ分類部４３は、重要文格納部２４に格納されているカテゴリの全てについてスコア（カテゴリ別のスコア）を算出する。 Note that the processes of steps S22 to S25 described above are executed for all important sentences stored in the important sentence storage unit 24. Accordingly, the category classification unit 43 is associated with an important sentence that is associated with a category stored in the important sentence storage unit 24 and is determined to match the classification target document 200. The total value of the scores (important sentence scores) is calculated as the score of the category. Moreover, the category classification | category part 43 scores about all the categories stored in the important sentence storage part 24 (score according to category) by performing the process of above-mentioned step S22-step S25 about all the important sentences. Is calculated.

ここで、カテゴリのスコアの算出処理について上記した図８を用いて具体的に説明する。例えば分類対象文書２００が図８に示すカテゴリ名「B41J 3/04 101A」に対応付けられている重要文１「カラー画像を印刷する」及び重要文２「カラー印刷方法を実行する」に合致する場合を想定する。この場合、カテゴリ名「B41J 3/04 101A」のスコアは、重要文１「カラー画像を印刷する」に対応付けられているスコア値「６」及び重要文２「カラー印刷方法を実行する」に対応付けられているスコア値「６」の合計、つまり、「１２」となる。 The category score calculation process will be specifically described with reference to FIG. For example, the classification target document 200 matches the important sentence 1 “print color image” and the important sentence 2 “execute color printing method” associated with the category name “B41J 3/04 101A” shown in FIG. Assume a case. In this case, the score of category name “B41J 3/04 101A” is score value “6” associated with important sentence 1 “print color image” and important sentence 2 “execute color printing method”. The sum of the associated score values “6”, that is, “12”.

また、例えば分類対象文書２００が図８に示すカテゴリ名「B41J 3/04 101B」に対応付けられている重要文１「インクミストが発生する」には合致せず、重要文２「インクジェット記録装置を有する」に合致する場合想定する。この場合、カテゴリ名「B41J 3/04 101B」のスコアは、重要文２「インクジェット記録装置を有する」に対応付けられているスコア値「５」となる。 Further, for example, the classification target document 200 does not match the important sentence 1 “ink mist is generated” associated with the category name “B41J 3/04 101B” shown in FIG. It is assumed that it matches “having”. In this case, the score of the category name “B41J 3/04 101B” is the score value “5” associated with the important sentence 2 “having the inkjet recording apparatus”.

つまり、重要文格納部２４に格納されている全ての重要文について上記したステップＳ２２〜ステップＳ２５の処理が実行されると、全てのカテゴリのスコアが算出される。 That is, when the above-described steps S22 to S25 are executed for all important sentences stored in the important sentence storage unit 24, scores of all categories are calculated.

上記したステップＳ２５の処理が実行されると、重要文格納部２４に格納されている全ての重要文について上記したステップＳ２２〜ステップＳ２５の処理が実行されたか否かが判定される（ステップＳ２６）。全ての重要文についてステップＳ２２〜ステップＳ２５の処理が実行されていないと判定された場合には、ステップＳ２２に戻って処理が繰り返される。 When the process of step S25 described above is executed, it is determined whether or not the processes of steps S22 to S25 described above have been executed for all the important sentences stored in the important sentence storage unit 24 (step S26). . If it is determined that the processes in steps S22 to S25 are not executed for all important sentences, the process returns to step S22 and is repeated.

一方、全ての重要文についてステップＳ２２〜ステップＳ２５の処理が実行されたと判定された場合（ステップＳ２６のＹＥＳ）、つまり、重要文格納部２４に格納されているカテゴリ別のスコアが算出された場合、分類結果出力部４４は、当該カテゴリ及び当該カテゴリのスコアを出力する。このとき、分類結果出力部４４は、重要文格納部２４に格納されているカテゴリ（つまり、分類対象文書が分類されるカテゴリ候補）を当該カテゴリ別のスコア順に出力する（ステップＳ２７）。 On the other hand, when it is determined that the processing of steps S22 to S25 has been executed for all important sentences (YES in step S26), that is, when the scores for each category stored in the important sentence storage unit 24 are calculated. The classification result output unit 44 outputs the category and the score of the category. At this time, the classification result output unit 44 outputs the categories stored in the important sentence storage unit 24 (that is, category candidates into which the classification target document is classified) in the order of score for each category (step S27).

ここで、図１１は、分類結果出力部４４によってカテゴリ候補が出力（表示）された場合の一例を示す。図１１に示す例では、例えばカテゴリ名「B41J 3/04 101A」及び「B41J 3/04 101A」が当該カテゴリのスコア（確信度）とともに当該スコア順に出力されている。つまり、分類結果出力部４４によって分類対象文書２００が分類されるべき複数のカテゴリ候補がスコア（確信度）順に出力される。 Here, FIG. 11 shows an example when category candidates are output (displayed) by the classification result output unit 44. In the example illustrated in FIG. 11, for example, category names “B41J 3/04 101A” and “B41J 3/04 101A” are output in the order of the scores together with the score (confidence) of the category. That is, the classification result output unit 44 outputs a plurality of category candidates to which the classification target document 200 should be classified in the order of scores (confidence levels).

上記した図１１に示すようにカテゴリ候補が出力された場合、ユーザは、当該カテゴリ名（カテゴリ候補）及び当該カテゴリのスコアを参照して分類対象文書２００が分類されるべきカテゴリを指定することができる。つまり、カテゴリのスコアが低い場合であっても、ユーザが、当該カテゴリが適切であると考える場合には、分類対象文書２００を当該カテゴリに分類することができる。 When category candidates are output as shown in FIG. 11 described above, the user may designate a category into which the classification target document 200 should be classified with reference to the category name (category candidate) and the score of the category. it can. That is, even if the score of the category is low, if the user thinks that the category is appropriate, the classification target document 200 can be classified into the category.

分類結果出力部４４は、上記したユーザの操作（指定）に応じて、分類対象文書２００が分類されるカテゴリを分類結果として分類結果格納部２５に格納する（ステップＳ２８）。 The classification result output unit 44 stores the category into which the classification target document 200 is classified in the classification result storage unit 25 as a classification result in accordance with the above-described user operation (designation) (step S28).

上記したように本実施形態においては、カテゴリが付与されていない分類対象文書が入力された場合、重要文格納部２４に格納されたカテゴリ別の重要文及び当該重要文のスコアを用いて当該分類対象文書が合致する重要文のスコアの合計値を当該カテゴリのスコア（確信度）として算出する。これにより、本実施形態においては、算出されたカテゴリのスコアに基づいて分類対象文書が分類されるべきカテゴリに当該分類対象文書を分類することが可能となる。このとき、カテゴリ毎のスコアを出力することにより、ユーザは、当該スコアが最も高いカテゴリだけでなく２番目以下のカテゴリについても参照して分類対象文書が分類されるべきカテゴリを指定することができる。 As described above, in the present embodiment, when a classification target document to which no category is assigned is input, the classification is performed using the important sentence for each category stored in the important sentence storage unit 24 and the score of the important sentence. The total score of important sentences that match the target document is calculated as the score (confidence) of the category. Thereby, in this embodiment, it becomes possible to classify | categorize the said classification target document into the category which should classify a classification target document based on the score of the calculated category. At this time, by outputting a score for each category, the user can specify a category in which the classification target document should be classified by referring not only to the category having the highest score but also to the second and lower categories. .

なお、本実施形態においては、算出されたカテゴリ別のスコアを出力し、当該カテゴリ別のスコアを参照したユーザによって指定されたカテゴリに分類対象文書を分類するものとして説明したが、当該カテゴリ別のスコアに基づいて自動的に分類対象文書が分類される構成であっても構わない。この場合、上記したカテゴリ分類部４３は、算出されたカテゴリのスコア（確信度）に基づいて、例えば当該スコアが最も高いカテゴリに分類対象文書を分類する。 In the present embodiment, the calculated score for each category is output, and the classification target document is classified into the category specified by the user referring to the score for each category. A configuration may be adopted in which the classification target documents are automatically classified based on the score. In this case, the above-described category classification unit 43 classifies the classification target document into, for example, a category having the highest score based on the calculated category score (confidence).

また、本実施形態においては、分類対象文書が入力される際には、重要文格納部２４には既にカテゴリ別に重要文及び当該重要文のスコアが蓄積されているものとして説明したが、重要文格納部２４内に十分な量の重要文及び当該重要文のスコアが蓄積されていない場合には、例えば文書ＤＢ等に蓄積されている大量の文書に対して前述した第１の実施形態における図４に示すような処理を繰り返し実行することによって重要文格納部２４を構築した後に分類対象文書に対する分類処理が行われても構わない。 Further, in the present embodiment, it has been described that when a document to be classified is input, the important sentence storage unit 24 already stores the important sentence and the score of the important sentence for each category. When a sufficient amount of important sentences and the score of the important sentences are not accumulated in the storage unit 24, for example, the diagram in the first embodiment described above for a large number of documents accumulated in the document DB or the like. After the important sentence storage unit 24 is constructed by repeatedly executing the process shown in FIG. 4, the classification process for the classification target document may be performed.

なお、本願発明は、上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組合せにより種々の発明を形成できる。例えば、各実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に亘る構成要素を適宜組合せてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in each embodiment. Furthermore, you may combine the component covering different embodiment suitably.

１０…コンピュータ、２０…外部記憶装置、２２…階層重要語格納部（重要語格納手段）、２３…同義語辞書、２４…重要文格納部、２５…分類結果格納部、３０，４０…重要文抽出装置、３１…抽出対象文書入力部、３２…重要語決定部、３３…重要文抽出部、３４…スコア算出部、３５…重要文出力部、４１…分類対象文書入力部、４２…概念検索部、４３…カテゴリ分類部、４４…分類結果出力部。 DESCRIPTION OF SYMBOLS 10 ... Computer, 20 ... External storage device, 22 ... Hierarchical important word storage part (important word storage means), 23 ... Synonym dictionary, 24 ... Important sentence storage part, 25 ... Classification result storage part, 30, 40 ... Important sentence Extraction device, 31 ... extraction target document input unit, 32 ... important word determination unit, 33 ... important sentence extraction unit, 34 ... score calculation unit, 35 ... important sentence output unit, 41 ... classification target document input unit, 42 ... concept search Part, 43 ... category classification part, 44 ... classification result output part.

Claims

An important word storage means for storing important words corresponding to each of the categories into which the document is classified, and an external storage device having important word storage means in which categories related to the categories are represented in a hierarchical structure; In an important sentence extraction device composed of a computer using the external storage device, an important sentence extraction program executed by the computer,
In the computer,
Inputting an extraction target document composed of a sentence including a plurality of words according to a user's operation, to which a category to which the extraction target document is classified is assigned;
Reading a first important word corresponding to a category assigned to the input extraction target document from the important word storage means;
Determining whether the read first important word is included in the input extraction target document;
When it is determined that the first important word is included in the extraction target document, the first important word is determined as a first extraction important word;
Reading a second important word corresponding to a category associated with a category assigned to the input extraction target document from the important word storage means;
Determining whether the read second important word is included in the input extraction target document;
When it is determined that the second important word is included in the extraction target document, the second important word is determined as a second extracting important word;
Extracting from the input extraction target document a sentence including at least one of the determined first and second extraction important words as an important sentence;
Calculating the importance of the important sentence based on the determined first extraction important word or second extracted important word included in the extracted important sentence;
An important sentence extraction program for executing the step of outputting the extracted important sentences in the order of the calculated importance.

In the computer,
Storing the category assigned to the input extraction target document, the extracted important sentence and the calculated importance of the important sentence in association with each other in an important sentence storage unit;
Inputting a classification target document composed of sentences including a plurality of words according to a user operation;
Determining whether the input classification target document matches an important sentence stored in the important sentence storage means;
An important sentence associated with a category stored in the important sentence storage means, and a total value of the importance levels of the important sentences associated with the important sentence determined to match the classification target document For each category stored in the important sentence storage means as the certainty of the category,
Classifying the input classification target document based on the certainty of the calculated category;
The important sentence extraction program according to claim 1, further comprising the step of storing the classification result in a classification result storage means.

The step of classifying the classification target document includes:
Outputting the category stored in the important sentence storage means and the certainty factor of the calculated category;
3. The important sentence according to claim 2, further comprising: classifying the input classification target document into the category according to the user's operation based on the output category and the certainty factor of the category. Extraction program.

The important word corresponding to each of the categories stored in the important word storage means includes a positional relationship in a hierarchical structure between the category corresponding to the important word and the category assigned to the input extraction target document. Based on your score,
In the step of calculating the importance of the important sentence, a predetermined score is added to the determined first extraction important word or the second extraction important word included in the important sentence. The important sentence extraction program according to claim 1, wherein the importance degree of the important sentence is calculated.

Important word storage means for storing important words corresponding to each of categories into which a document is classified, and important word storage means in which categories related to the category are represented in a hierarchical structure;
Input means for inputting an extraction target document composed of a sentence including a plurality of words according to a user's operation, to which a category to which the extraction target document is classified is assigned;
First reading means for reading a first important word corresponding to a category assigned to the input extraction target document from the important word storage means;
First determination means for determining whether or not the read first important word is included in the input extraction target document;
First determination means for determining the first important word as a first extraction important word when it is determined that the first important word is included in the extraction target document;
Second reading means for reading a second important word corresponding to a category associated with a category assigned to the input extraction target document from the important word storage means;
Second determination means for determining whether the second important word read by the second reading means is included in the input extraction target document;
A second determination for determining the second important word as a second extracting important word when the second determining means determines that the second important word is included in the extraction target document; Means,
The sentence including at least one of the first extracting important word determined by the first determining means and the second extracting important word determined by the second determining means is input as the important sentence. Extracting means for extracting from the extraction target document;
Based on the first extraction important word determined by the first determination means or the second extraction important word determined by the second determination means included in the extracted important sentence, the important sentence A calculation means for calculating the importance of
An important sentence extraction apparatus comprising: output means for outputting the extracted important sentences in the order of the calculated importance.