JP2017010107A

JP2017010107A - Information processing device, information processing system and program

Info

Publication number: JP2017010107A
Application number: JP2015121927A
Authority: JP
Inventors: 和久大野; Kazuhisa Ono
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2015-06-17
Filing date: 2015-06-17
Publication date: 2017-01-12
Anticipated expiration: 2035-06-17
Also published as: JP6613644B2

Abstract

PROBLEM TO BE SOLVED: To provide an information processing device, an information processing system and a program for extracting a sentence suitable for a summary from a document.SOLUTION: A document analysis device 1 for evaluating a sentence included in a document includes a pre-processing part 12 for dividing the document into sentence units, an importance calculation part 14 for calculating an importance score showing the degree of importance of the sentence, a simplicity degree calculation part 15 for calculating a simplicity degree score showing the degree of simplicity of the sentence, and an evaluation score calculation part 16 for calculating an evaluation score being the important and simple degree of the sentence by multiplying the calculated importance score and the calculated simplicity degree score.SELECTED DRAWING: Figure 1

Description

本発明は、文書を処理対象にする情報処理装置、情報処理システム及びプログラムに関する。 The present invention relates to an information processing apparatus, an information processing system, and a program for processing a document.

文書の要約は、文書の要点をまとめたものであるため、その内容は、文書に対する利用者の理解を助けるものであることが望まれる。要約の生成方法として、利用者の利用履歴に基づいて利用者のレベルを決定し、文中の語の文書頻度に基づいて決定した文書の難易度と、利用者のレベルとから要約を生成するものが開示されている（例えば、特許文献１）。 Since the summary of a document is a summary of the main points of the document, it is desirable that the contents should help the user to understand the document. As a summary generation method, the user level is determined based on the user's usage history, and the summary is generated from the document difficulty level determined based on the document frequency of the words in the sentence and the user level. Is disclosed (for example, Patent Document 1).

特許第５２０１７２７号公報Japanese Patent No. 5201727

特許文献１に記載のものは、利用者の利用履歴を用いるため、初めて利用する利用者に対する考慮がされていなかった。また、文書を要約する手段は、従来技術を用いるものであった。 Since the thing of patent document 1 uses a user's utilization log | history, the consideration with respect to the user who uses for the first time was not considered. The means for summarizing the document uses the prior art.

そこで、本発明は、文書から要約に適した文を抽出するための情報処理装置、情報処理システム及びプログラムを提供することを目的とする。 Accordingly, an object of the present invention is to provide an information processing apparatus, an information processing system, and a program for extracting a sentence suitable for summarization from a document.

本発明は、以下のような解決手段により、前記課題を解決する。
第１の発明は、文書に含まれる文を評価する情報処理装置であって、前記文書を文単位に分割する文分割手段と、文の重要度合いを示す重要度スコアを算出する重要度算出手段と、文の簡単度合いを示す簡単度スコアを算出する簡単度算出手段と、算出された前記重要度スコアと、前記簡単度スコアとを用いて、文の重要かつ簡単な度合いである評価スコアを算出する評価スコア算出手段と、を備える情報処理装置である。
第２の発明は、第１の発明の情報処理装置において、前記文書に含まれる全ての文に対して、前記評価スコア算出手段を実行する文書処理手段と、前記文書処理手段により算出された各文の前記評価スコアに基づいて、前記評価スコアが高い文を抽出する文抽出手段と、を備えること、を特徴とする情報処理装置である。
第３の発明は、第１の発明又は第２の発明の情報処理装置において、専門用語を見出し語にした辞書である専門用語辞書データベースの前記見出し語に対する参照数を記憶した見出し語参照数データベースに対して通信可能に接続され、前記簡単度算出手段は、前記見出し語参照数データベースを参照し、前記簡単度スコアを算出すること、を特徴とする情報処理装置である。
第４の発明は、第１の発明から第３の発明までのいずれかの情報処理装置において、専門用語を見出し語にした辞書である専門用語辞書データベースに対して通信可能に接続され、前記簡単度算出手段は、前記文に、前記専門用語辞書データベースの前記見出し語を含む場合に、前記簡単度スコアに重み付けをすること、を特徴とする情報処理装置である。
第５の発明は、第４の発明の情報処理装置において、前記簡単度算出手段は、前記文に、前記専門用語辞書データベースの前記見出し語に含まれ、かつ、一般用語を見出し語にした一般用語辞書データベースに含まれる見出し語を含む場合には、前記重みを低くすること、を特徴とする情報処理装置である。
第６の発明は、第１の発明から第５の発明までのいずれかの情報処理装置において、前記重要度算出手段は、前記文に、前記文書に含まれる出現頻度が高い名詞を含むほど、前記重要度スコアを高く算出すること、を特徴とする情報処理装置である。
第７の発明は、第３の発明から第５の発明までのいずれかの情報処理装置と、専門用語を見出し語にした辞書である専門用語辞書データベースの前記見出し語に対する参照数を記憶した見出し語参照数データベースと、を備えた情報処理システムである。
第８の発明は、第１の発明から第６の発明までのいずれかの情報処理装置としてコンピュータを機能させるためのプログラムである。 The present invention solves the above problems by the following means.
A first invention is an information processing apparatus that evaluates a sentence included in a document, a sentence dividing unit that divides the document into sentence units, and an importance calculating unit that calculates an importance score indicating the importance of the sentence An evaluation score that is an important and simple degree of the sentence using the degree-of-simple degree calculating means for calculating the degree of simplicity of the sentence, the importance degree score calculated, and the simplicity degree score. And an evaluation score calculating means for calculating.
According to a second invention, in the information processing apparatus of the first invention, a document processing unit that executes the evaluation score calculation unit for all sentences included in the document, and each of the documents calculated by the document processing unit An information processing apparatus comprising: a sentence extracting unit that extracts a sentence having a high evaluation score based on the evaluation score of a sentence.
According to a third aspect of the present invention, in the information processing apparatus of the first or second aspect, a headword reference number database storing the number of references to the headword in a technical term dictionary database that is a dictionary having technical terms as headwords. The simplicity calculating means refers to the headword reference number database, and calculates the simplicity score.
According to a fourth aspect of the present invention, in any one of the information processing apparatuses from the first aspect to the third aspect of the present invention, the information processing device is communicably connected to a technical term dictionary database that is a dictionary having technical terms as headwords. The degree calculation means is an information processing apparatus that weights the simplicity score when the sentence includes the headword of the technical term dictionary database.
According to a fifth aspect of the present invention, in the information processing apparatus of the fourth aspect of the present invention, the simplicity calculation means includes a general term that is included in the headword of the technical term dictionary database and the general term is a headword. The information processing apparatus is characterized in that the weight is reduced when a headword included in the term dictionary database is included.
In a sixth aspect of the information processing device according to any one of the first to fifth aspects of the invention, the importance calculation means includes a noun with a high appearance frequency included in the document, An information processing apparatus characterized by calculating the importance score high.
A seventh aspect of the present invention relates to any one of the information processing apparatuses from the third aspect to the fifth aspect of the present invention, and a headline that stores the number of references to the headword in a technical term dictionary database that is a dictionary having technical terms as headwords. An information processing system comprising a word reference number database.
An eighth invention is a program for causing a computer to function as any one of the information processing apparatuses from the first invention to the sixth invention.

本発明によれば、文書から要約に適した文を抽出するための情報処理装置、情報処理システム及びプログラムを提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the information processing apparatus, information processing system, and program for extracting the sentence suitable for the summary from a document can be provided.

本実施形態に係る文書解析システムの機能ブロックを示す図である。It is a figure which shows the functional block of the document analysis system which concerns on this embodiment. 本実施形態に係る文書解析システムでの各種ＤＢの例を示す図である。It is a figure which shows the example of various DB in the document analysis system which concerns on this embodiment. 本実施形態に係る文書解析装置での文書解析処理を示すフローチャートである。It is a flowchart which shows the document analysis process in the document analysis apparatus which concerns on this embodiment. 本実施形態に係る端末での表示例を示す図である。It is a figure which shows the example of a display with the terminal which concerns on this embodiment. 本実施形態に係る文書解析装置で解析する対象文書の例を示す図である。It is a figure which shows the example of the object document analyzed with the document analysis apparatus which concerns on this embodiment. 本実施形態に係る文書解析装置での事前処理を示すフローチャートである。It is a flowchart which shows the pre-processing in the document analysis apparatus concerning this embodiment. 本実施形態に係る文書解析装置での事前処理を説明するための図である。It is a figure for demonstrating the pre-processing in the document analysis apparatus which concerns on this embodiment. 本実施形態に係る文書解析装置でのスコア算出処理を示すフローチャートである。It is a flowchart which shows the score calculation process in the document analysis apparatus which concerns on this embodiment. 本実施形態に係る文書解析装置でのスコア算出処理及び評価対象文の並び替えを説明するための図である。It is a figure for demonstrating the score calculation process and rearrangement of the evaluation object sentence in the document analysis apparatus concerning this embodiment.

以下、本発明を実施するための形態について、図を参照しながら説明する。なお、これは、あくまでも一例であって、本発明の技術的範囲は、これに限られるものではない。
（実施形態）
＜文書解析システム１００の全体構成＞
図１は、本実施形態に係る文書解析システム１００の機能ブロックを示す図である。
図２は、本実施形態に係る文書解析システム１００での各種ＤＢの例を示す図である。

図１に示すように、文書解析システム１００（情報処理システム）は、文書解析装置１（情報処理装置）と、専門用語辞書ＤＢ（データベース）４と、見出し語参照数ＤＢ５と、一般用語辞書ＤＢ６と、端末８とが、通信ネットワークＮを介して接続されている。
文書解析システム１００は、端末８が指定した対象文書から、重要かつ簡単な文を抽出して端末８に出力するシステムである。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this.
(Embodiment)
<Overall Configuration of Document Analysis System 100>
FIG. 1 is a diagram showing functional blocks of a document analysis system 100 according to the present embodiment.
FIG. 2 is a diagram illustrating examples of various DBs in the document analysis system 100 according to the present embodiment.

As shown in FIG. 1, the document analysis system 100 (information processing system) includes a document analysis device 1 (information processing device), a technical term dictionary DB (database) 4, a headword reference number DB 5, and a general term dictionary DB 6. And the terminal 8 are connected via a communication network N.
The document analysis system 100 is a system that extracts an important and simple sentence from a target document designated by the terminal 8 and outputs the sentence to the terminal 8.

＜文書解析装置１＞
文書解析装置１は、対象文書を受け付けて、対象文書から要約に適する重要かつ簡単な文を抽出する装置である。
文書解析装置１は、制御部１０と、記憶部２０と、通信インタフェース部２９とを備える。
制御部１０は、文書解析装置１の全体を制御するＣＰＵ（中央処理装置）である。制御部１０は、記憶部２０に記憶されているＯＳ（オペレーティングシステム）や、各種のアプリケーションプログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、各種機能を実行する。
制御部１０は、文書受付部１１と、事前処理部１２（文分割手段）と、文処理部１３（文書処理手段）と、重要度算出部１４（重要度算出手段）と、簡単度算出部１５（簡単度算出手段）と、評価スコア算出部１６（評価スコア算出手段）と、文出力部１７（文抽出手段）とを備える。 <Document Analysis Device 1>
The document analysis device 1 is a device that receives a target document and extracts an important and simple sentence suitable for summarization from the target document.
The document analysis apparatus 1 includes a control unit 10, a storage unit 20, and a communication interface unit 29.
The control unit 10 is a CPU (Central Processing Unit) that controls the entire document analysis apparatus 1. The control unit 10 executes various functions in cooperation with the hardware described above by appropriately reading and executing an OS (operating system) and various application programs stored in the storage unit 20.
The control unit 10 includes a document receiving unit 11, a preprocessing unit 12 (sentence dividing unit), a sentence processing unit 13 (document processing unit), an importance level calculating unit 14 (importance level calculating unit), and a simplicity level calculating unit. 15 (simpleness calculating means), an evaluation score calculating section 16 (evaluation score calculating means), and a sentence output section 17 (sentence extracting means).

文書受付部１１は、対象文書を受け付ける制御部である。文書受付部１１は、例えば、端末８から対象文書を受信する。
ここで、対象文書は、複数の文を含む。そして、以下では、抽出する文の単位を、段落に含む文として説明する。つまり、１つの段落に含む文を、評価対象文にする。そのため、評価対象文は、句点で区切られた１つの文の場合もあるし、句点で区切られた複数の文の場合もある。
事前処理部１２は、対象文書に対して文処理部１３による処理を行うにあたっての事前処理を行う制御部である。
文処理部１３は、対象文書の各評価対象文に関して、評価スコアを算出する処理を行う制御部である。
重要度算出部１４は、１つの評価対象文の重要度合いを示す重要度スコアを算出する制御部である。
簡単度算出部１５は、１つの評価対象文の簡単度合いを示す簡単度スコアを算出する制御部である。
評価スコア算出部１６は、重要度スコアと簡単度スコアとから、１つの評価対象文の重要かつ簡単な度合いを示す評価スコアを算出する制御部である。
文出力部１７は、評価スコアの高い評価対象文を、対象文書から抽出して端末８に対して出力する制御部である。
なお、これらの各機能の詳細については、後述する。 The document receiving unit 11 is a control unit that receives a target document. For example, the document receiving unit 11 receives the target document from the terminal 8.
Here, the target document includes a plurality of sentences. In the following description, a unit of a sentence to be extracted is described as a sentence included in a paragraph. That is, a sentence included in one paragraph is set as an evaluation target sentence. For this reason, the evaluation target sentence may be a single sentence separated by punctuation marks or a plurality of sentences separated by punctuation marks.
The pre-processing unit 12 is a control unit that performs pre-processing when the sentence processing unit 13 performs processing on the target document.
The sentence processing unit 13 is a control unit that performs processing for calculating an evaluation score for each evaluation target sentence of the target document.
The importance calculation unit 14 is a control unit that calculates an importance score indicating the importance of one evaluation target sentence.
The simplicity calculator 15 is a controller that calculates a simplicity score indicating the simplicity of one evaluation target sentence.
The evaluation score calculation unit 16 is a control unit that calculates an evaluation score indicating an important and simple degree of one evaluation target sentence from the importance score and the simplicity score.
The sentence output unit 17 is a control unit that extracts an evaluation target sentence with a high evaluation score from the target document and outputs it to the terminal 8.
Details of these functions will be described later.

記憶部２０は、文書解析装置１の動作に必要なプログラム、データ等を記憶するためのハードディスク、半導体メモリ素子等の記憶装置である。
なお、コンピュータとは、制御部、記憶装置等を備えた情報処理装置をいい、文書解析装置１は、制御部１０、記憶部２０等を備えた情報処理装置であり、コンピュータの概念に含まれる。
記憶部２０は、プログラム記憶部２１と、文書記憶部２２とを備える。
プログラム記憶部２１は、プログラムを記憶するための記憶領域である。プログラム記憶部２１は、文書解析プログラム２１ａを記憶している。
文書解析プログラム２１ａは、制御部１０の各種機能を実行するためのプログラムである。
文書記憶部２２は、端末８から受け付けた対象文書を記憶するための記憶領域である。
なお、記憶部２０は、対象文書の評価スコア算出のために一時的に使用する一時記憶領域等も有する。
通信インタフェース部２９は、通信ネットワークＮを介して各種ＤＢや端末８との通信を行うためのインタフェースである。 The storage unit 20 is a storage device such as a hard disk or a semiconductor memory element for storing programs, data, and the like necessary for the operation of the document analysis apparatus 1.
The computer refers to an information processing apparatus including a control unit, a storage device, and the like, and the document analysis apparatus 1 is an information processing device including the control unit 10, the storage unit 20 and the like, and is included in the concept of a computer. .
The storage unit 20 includes a program storage unit 21 and a document storage unit 22.
The program storage unit 21 is a storage area for storing a program. The program storage unit 21 stores a document analysis program 21a.
The document analysis program 21 a is a program for executing various functions of the control unit 10.
The document storage unit 22 is a storage area for storing the target document received from the terminal 8.
The storage unit 20 also includes a temporary storage area that is temporarily used for calculating the evaluation score of the target document.
The communication interface unit 29 is an interface for communicating with various DBs and the terminal 8 via the communication network N.

なお、文書解析装置１を構成するハードウェアの数に制限はない。必要に応じて、１又は複数で構成してもよい。また、文書解析装置１のハードウェアは、必要に応じてＷｅｂサーバ、ＤＢ（データベース）サーバ、アプリケーションサーバ等の各種サーバを含んで構成してもよく、１台のサーバで構成しても、それぞれ別のサーバで構成してもよい。 Note that there is no limit on the number of hardware constituting the document analysis apparatus 1. You may comprise by 1 or multiple as needed. Further, the hardware of the document analysis apparatus 1 may be configured to include various servers such as a Web server, a DB (database) server, and an application server as necessary, You may comprise with another server.

＜ＤＢ＞
専門用語辞書ＤＢ４は、多数の専門用語を収録した辞書のＤＢである。図２（Ａ）に示すように、専門用語辞書ＤＢ４は、専門用語と、その語の意味、用法、その示す内容等とを対応付けたＤＢである。専門用語辞書ＤＢ４は、図１では、１つ記載しているが、分野ごとに複数の専門用語辞書ＤＢ４を有してもよい。
見出し語参照数ＤＢ５は、専門用語辞書ＤＢ４のアクセスログを記憶するＤＢである。図２（Ｂ）に示すように、見出し語参照数ＤＢ５は、専門用語辞書ＤＢ４の専門用語と、その専門用語を参照した参照数とを対応付けて記憶する。専門用語を調べたいユーザが、端末８を用いて専門用語辞書ＤＢ４にアクセスして専門用語の言葉の意味を調べた場合に、アクセスした専門用語に対応する見出し語参照数ＤＢ５の参照数には、１が加算される。
一般用語辞書ＤＢ６は、多数の語を収録した辞書のＤＢである。一般用語辞書ＤＢ６は、例えば、国語辞典のＤＢであってもよいし、形態素解析辞書のＤＢであってもよい。また、一般用語辞書ＤＢ６は、一部の一般化した専門用語を見出し語に含んでもよい。図２（Ｃ）に示すように、専門用語辞書ＤＢ４は、一般用語辞書ＤＢ６と同じ構造になっているが、収録する語が各々異なる。 <DB>
The technical term dictionary DB 4 is a dictionary DB storing a large number of technical terms. As shown in FIG. 2A, the technical term dictionary DB 4 is a DB in which technical terms are associated with the meanings, usages, and contents indicated. Although only one technical term dictionary DB4 is shown in FIG. 1, a plurality of technical term dictionaries DB4 may be provided for each field.
The headword reference number DB 5 is a DB that stores an access log of the technical term dictionary DB 4. As shown in FIG. 2B, the headword reference number DB 5 stores the technical terms in the technical term dictionary DB 4 and the reference numbers referring to the technical terms in association with each other. When a user who wants to check a technical term accesses the technical term dictionary DB 4 using the terminal 8 and examines the meaning of the term of the technical term, the reference number of the headword reference number DB 5 corresponding to the accessed technical term includes 1 is added.
The general term dictionary DB 6 is a dictionary DB containing a large number of words. The general term dictionary DB 6 may be, for example, a Japanese dictionary DB or a morphological analysis dictionary DB. Further, the general term dictionary DB 6 may include some generalized technical terms in the headword. As shown in FIG. 2C, the technical term dictionary DB 4 has the same structure as the general term dictionary DB 6, but the recorded words are different from each other.

＜端末８＞
図１に戻り、端末８は、ユーザＵが使用する端末である。端末８は、例えば、パーソナルコンピュータ（ＰＣ）や、タブレット端末等で構成することができる。図示していないが、端末８は、制御部、記憶部、表示部等を備える。
通信ネットワークＮは、文書解析装置１と、各種ＤＢ４〜６と、端末８との間のネットワークであり、例えば、インターネット回線等の通信網である。 <Terminal 8>
Returning to FIG. 1, the terminal 8 is a terminal used by the user U. The terminal 8 can be configured by, for example, a personal computer (PC), a tablet terminal, or the like. Although not shown, the terminal 8 includes a control unit, a storage unit, a display unit, and the like.
The communication network N is a network between the document analysis apparatus 1, the various DBs 4 to 6, and the terminal 8, and is a communication network such as an Internet line, for example.

＜文書解析処理＞
次に、文書解析装置１で行う処理について説明する。
図３は、本実施形態に係る文書解析装置１での文書解析処理を示すフローチャートである。
図４は、本実施形態に係る端末８での表示例を示す図である。
図５は、本実施形態に係る文書解析装置１で解析する対象文書３０の例を示す図である。
図６は、本実施形態に係る文書解析装置１での事前処理を示すフローチャートである。
図７は、本実施形態に係る文書解析装置１での事前処理を説明するための図である。
図８は、本実施形態に係る文書解析装置１でのスコア算出処理を示すフローチャートである。
図９は、本実施形態に係る文書解析装置１でのスコア算出処理及び評価対象文の並び替えを説明するための図である。 <Document analysis processing>
Next, processing performed by the document analysis apparatus 1 will be described.
FIG. 3 is a flowchart showing document analysis processing in the document analysis apparatus 1 according to the present embodiment.
FIG. 4 is a diagram illustrating a display example on the terminal 8 according to the present embodiment.
FIG. 5 is a diagram illustrating an example of the target document 30 to be analyzed by the document analysis apparatus 1 according to the present embodiment.
FIG. 6 is a flowchart showing pre-processing in the document analysis apparatus 1 according to the present embodiment.
FIG. 7 is a diagram for explaining pre-processing in the document analysis apparatus 1 according to the present embodiment.
FIG. 8 is a flowchart showing score calculation processing in the document analysis apparatus 1 according to the present embodiment.
FIG. 9 is a diagram for explaining score calculation processing and rearrangement of evaluation target sentences in the document analysis apparatus 1 according to the present embodiment.

図３のステップＳ（以下、単に「Ｓ」という。）１０において、文書解析装置１の制御部１０（文書受付部１１）は、端末８から対象文書３０を受け付ける。
端末８は、文書解析装置１に接続することで、図４（Ａ）に示す文書入力画面８０を端末８の表示部に表示させる。そして、端末８のユーザＵは、文書入力部８０ａに、要約を出力したい対象文書３０を入力する。なお、対象文書３０がデータ化されている場合には、ボタン８０ｂを選択して、端末８内に保存されているデータ化された文書のデータ名を、データ名入力部８０ｃに指定してもよい。また、対象文書３０がインターネット上にある場合には、対象文書が掲載されたＷｅｂページのＵＲＩ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＩｄｅｎｔｉｆｉｅｒ）を、データ名入力部８０ｃに指定してもよい。そして、対象文書３０を入力又は指定後に、ユーザＵがボタン８０ｄを選択することで、端末８の制御部は、対象文書３０を文書解析装置１に送信する。このように、ユーザＵが端末８を操作することで、文書解析装置１の制御部１０（文書受付部１１）は、端末８から対象文書３０を受け付けることができる。図５は、端末８から受け付けた対象文書３０の一例を示す。 In step S (hereinafter simply referred to as “S”) 10 in FIG. 3, the control unit 10 (document reception unit 11) of the document analysis apparatus 1 receives the target document 30 from the terminal 8.
The terminal 8 is connected to the document analysis apparatus 1 to display the document input screen 80 shown in FIG. 4A on the display unit of the terminal 8. Then, the user U of the terminal 8 inputs the target document 30 whose summary is to be output to the document input unit 80a. When the target document 30 is converted into data, the button 80b is selected, and the data name of the converted document stored in the terminal 8 is designated in the data name input unit 80c. Good. If the target document 30 is on the Internet, the URI (Uniform Resource Identifier) of the Web page on which the target document is posted may be specified in the data name input unit 80c. Then, after inputting or specifying the target document 30, the user U selects the button 80 d, so that the control unit of the terminal 8 transmits the target document 30 to the document analysis apparatus 1. Thus, when the user U operates the terminal 8, the control unit 10 (document reception unit 11) of the document analysis apparatus 1 can receive the target document 30 from the terminal 8. FIG. 5 shows an example of the target document 30 received from the terminal 8.

図３に戻り、Ｓ１１において、制御部１０（事前処理部１２）は、対象文書３０に対して事前処理を行う。
ここで、事前処理について、図６に基づき説明する。
図６のＳ２０において、制御部１０は、対象文書３０から主要語を取得し、主要語リスト３５を作成する。ここで、主要語とは、例えば、図５に示す対象文書３０に含まれる文書タイトル３１と、冒頭文３２と、章節の見出し３３とのうち、少なくともいずれかに含まれる名詞をいう。これは、対象文書３０に関する主要（重要）な語は、文書タイトル３１や、冒頭文３２や、章節の見出し３３に含まれることが多いという経験則に基づく。
この例では、制御部１０は、図５に示す対象文書３０から、文書タイトル３１に含まれる名詞を、主要語として取得する。なお、図５では、文書タイトル３１に含まれる名詞を分かりやすくするために、下線を付している。そして、制御部１０は、取得した主要語を、主要語リスト３５に記憶させる。
図７（Ａ）に示す主要語リスト３５は、取得した主要語を記憶したものである。 Returning to FIG. 3, in S 11, the control unit 10 (pre-processing unit 12) performs pre-processing on the target document 30.
Here, the pre-processing will be described with reference to FIG.
In S 20 of FIG. 6, the control unit 10 acquires a main word from the target document 30 and creates a main word list 35. Here, the main word refers to, for example, a noun included in at least one of the document title 31, the opening sentence 32, and the chapter heading 33 included in the target document 30 illustrated in FIG. This is based on an empirical rule that main (important) words related to the target document 30 are often included in the document title 31, the opening sentence 32, and the chapter heading 33.
In this example, the control unit 10 acquires a noun included in the document title 31 as a main word from the target document 30 shown in FIG. In FIG. 5, the nouns included in the document title 31 are underlined for easy understanding. Then, the control unit 10 stores the acquired main word in the main word list 35.
The main word list 35 shown in FIG. 7A stores the acquired main words.

図６に戻り、Ｓ２１において、制御部１０は、対象文書３０から名詞（語）を抽出し、名詞の重み係数ｆｎ（ｗｏｒｄ）を算出する。ここで抽出する名詞は、対象文書３０にある全ての名詞である。
具体的には、制御部１０は、図５に示す対象文書３０から全ての名詞（ｗｏｒｄ）を抽出する。制御部１０は、抽出した名詞を、名詞リスト４０に記憶させる。次に、制御部１０は、名詞ごとに、対象文書３０に含まれる数をカウントし、出現頻度ｆ（ｗｏｒｄ）として名詞リスト４０に記憶させる。そして、制御部１０は、名詞の重み係数ｆｎ（ｗｏｒｄ）を算出する。
名詞の重み係数ｆｎ（ｗｏｒｄ）は、Ｆをｆ（ｗｏｒｄ）の集合（ｆ（ｗｏｒｄ）∈Ｆ）として、次の式で算出できる。 Returning to FIG. 6, in S 21, the control unit 10 extracts nouns (words) from the target document 30 and calculates a noun weighting factor fn (word). The nouns extracted here are all nouns in the target document 30.
Specifically, the control unit 10 extracts all nouns from the target document 30 shown in FIG. The control unit 10 stores the extracted noun in the noun list 40. Next, the control part 10 counts the number contained in the object document 30 for every noun, and memorize | stores it in the noun list 40 as appearance frequency f (word). Then, the control unit 10 calculates a noun weighting factor fn (word).
The weighting coefficient fn (word) of the noun can be calculated by the following equation, where F is a set of f (word) (f (word) εF).

これは、文書中の名詞の出現頻度の最大値に対する、対象とした名詞の出現頻度の割合に、１を加算した値を、その名詞の重み係数ｆｎ（ｗｏｒｄ）にしたものである。この式により、“ｗｏｒｄ”の重み係数ｆｎ（ｗｏｒｄ）は、１〜２の範囲で正規化した値になる。つまり、“ｗｏｒｄ”の出現頻度が高ければ、２に近づき、低ければ、１に近づく。
そして、制御部１０は、算出した名詞の重み係数を、その名詞に対応付けて名詞リスト４０に記憶させる。
図７（Ｂ）は、この処理で作成された名詞リスト４０を示す。

This is a value obtained by adding 1 to the ratio of the appearance frequency of the target noun to the maximum value of the appearance frequency of the noun in the document as the weight coefficient fn (word) of the noun. According to this equation, the weighting factor fn (word) of “word” is a value normalized in the range of 1 to 2. That is, when the appearance frequency of “word” is high, it approaches 2 and when it is low, it approaches 1.
Then, the control unit 10 stores the calculated noun weight coefficient in the noun list 40 in association with the noun.
FIG. 7B shows the noun list 40 created by this processing.

図６に戻り、Ｓ２２において、制御部１０は、対象文書３０を、処理単位の文である評価対象文に分割する。図７（Ｃ）は、評価対象文に分割した対象文書３０を示す。ここでは、評価対象文を段落にしているので、この処理によって、対象文書３０は、複数の文５１，５２，・・・５６，・・・に分割できる。なお、この例では、文書タイトル３１（図５参照）を含んでいないが、文書タイトル３１を含んでもよい。
その後、制御部１０は、本処理を終了し、図３に戻る。 Returning to FIG. 6, in S 22, the control unit 10 divides the target document 30 into evaluation target sentences that are processing-unit sentences. FIG. 7C shows the target document 30 divided into evaluation target sentences. Here, since the evaluation target sentence is a paragraph, the target document 30 can be divided into a plurality of sentences 51, 52,... 56,. In this example, the document title 31 (see FIG. 5) is not included, but the document title 31 may be included.
Then, the control part 10 complete | finishes this process, and returns to FIG.

図３のＳ１２において、制御部１０は、１つの評価対象文を取得する。
Ｓ１３において、制御部１０は、取得した評価対象文に関するスコア算出処理を行う。
ここで、スコア算出処理について、図８に基づき説明する。
図８のＳ３０において、制御部１０（文処理部１３、重要度算出部１４）は、取得したｊ番目の評価対象文の重要度スコアＰ（ｊ）を算出する。
ｊ番目の評価対象文の重要度スコアＰ（ｊ）は、ｊ番目の評価対象文における名詞の総出現数をＭ（ｊ）とし、主要語の重み係数をαとした場合に、以下の式で算出できる。
なお、主要語の重み係数αは、任意に設定してよい。以下の例では、主要語の重み係数αは、“ｗｏｒｄ”（名詞）が主要語であれば２とし、そうでなければ１とする。 In S12 of FIG. 3, the control unit 10 acquires one evaluation target sentence.
In S 13, the control unit 10 performs a score calculation process regarding the acquired evaluation target sentence.
Here, the score calculation process will be described with reference to FIG.
In S30 of FIG. 8, the control unit 10 (sentence processing unit 13, importance calculation unit 14) calculates the importance score P (j) of the acquired j-th evaluation target sentence.
The importance score P (j) of the j-th evaluation target sentence is expressed by the following equation when the total number of nouns in the j-th evaluation target sentence is M (j) and the weighting factor of the main word is α. It can be calculated by
Note that the weighting factor α of the main word may be set arbitrarily. In the following example, the weighting factor α of the main word is 2 if “word” (noun) is the main word, and 1 otherwise.

これは、ｊ番目の評価対象文中の全ての名詞に対し、各名詞の重み係数に主要語の重み係数を掛けあわせ、その総和を名詞の総出現数で割った値を、ｊ番目の評価対象文の重要度スコアＰ（ｊ）にしたものである。
この式によれば、ｊ番目の評価対象文に含まれる出現頻度の高い名詞が主要語であれば、重要度スコアＰ（ｊ）は高い値になる。
なお、重要度スコアＰ（ｊ）の算出にあたって、名詞の総出現数をＭ（ｊ）で割っているのは、評価対象文の長短による差が出ないようにするためである。

This is obtained by multiplying the weighting factor of each noun by the weighting factor of the main word for all the nouns in the jth sentence to be evaluated, and dividing the sum by the total number of occurrences of the noun. The sentence importance score P (j) is used.
According to this expression, if a noun with a high appearance frequency included in the j-th evaluation target sentence is a main word, the importance score P (j) has a high value.
In the calculation of the importance score P (j), the total number of nouns is divided by M (j) in order to prevent a difference due to the length of the evaluation target sentence.

Ｓ３１において、制御部１０（文処理部１３、簡単度算出部１５）は、取得したｊ番目の評価対象文の簡単度スコアＥ（ｊ）を算出する。
ここで、ｊ番目の評価対象文の簡単度スコアＥ（ｊ）の算出には、ｊ番目の評価対象文における漢字率ｋ（ｊ）と、連続名詞率ｎｃ（ｊ）と、専門用語辞書ＤＢ４での見出し語率ｅｒ（ｊ）と、専門語率ｔ（ｊ）と、専門用語辞書ＤＢ４の参照数の平均ｅａ（ｊ）とを使用する。 In S31, the control unit 10 (the sentence processing unit 13 and the simplicity calculating unit 15) calculates the simplicity score E (j) of the acquired j-th evaluation target sentence.
Here, for calculating the simplicity score E (j) of the j-th evaluation target sentence, the kanji rate k (j), the continuous noun ratio nc (j) in the j-th evaluation target sentence, and the technical term dictionary DB 4 The headword rate er (j), the technical term rate t (j), and the average ea (j) of the number of references in the technical term dictionary DB4 are used.

ｊ番目の評価対象文における漢字率ｋ（ｊ）は、ｊ番目の評価対象文の文字数に対する漢字数の割合である。これは、漢字が多い文ほど、難しい文であることによる。
ｊ番目の評価対象文の連続名詞率ｎｃ（ｊ）は、ｊ番目の評価対象文の名詞数に対する連続名詞数の割合である。ここで、連続名詞とは、「人工知能」（「人工」＋「知能」）等である。これは、連続名詞が多い文ほど、難しい文であることによる。
なお、漢字率ｋ（ｊ）と、連続名詞率ｎｃ（ｊ）とは、「柴崎秀子，日本語リーダビリティー公式の構築と測定ツールの開発，特定領域研究日本語コーパス平成２０年度公開ワークショップ研究成果報告会予稿集，pp.155-160，2009」の記載に基づくものである。 The kanji rate k (j) in the jth evaluation target sentence is the ratio of the number of kanji characters to the number of characters in the jth evaluation target sentence. This is because sentences with more kanji are more difficult.
The continuous noun rate nc (j) of the jth evaluation target sentence is a ratio of the number of continuous nouns to the number of nouns of the jth evaluation target sentence. Here, the continuous noun is “artificial intelligence” (“artificial” + “intelligence”) or the like. This is because sentences with more continuous nouns are more difficult sentences.
The kanji rate k (j) and the continuous noun rate nc (j) are as follows: “Hideko Shibasaki, Construction of Japanese Readability Formula, Development of Measurement Tools, Specific Area Research Japanese Corpus Open Workshop 2008 This is based on the description of the Proceedings of Research Report Meeting, pp.155-160, 2009 ”.

ｊ番目の評価対象文の専門用語辞書ＤＢ４での見出し語率ｅｒ（ｊ）は、形態素数に対する専門用語辞書ＤＢ４の見出し語数の割合である。これは、文に専門用語辞書ＤＢ４の見出し語の数が多く含まれるほど、難しい文であることによる。
ｊ番目の評価対象文の専門語率ｔ（ｊ）は、専門用語辞書ＤＢ４の見出し語数に対する、専門用語辞書ＤＢ４の見出し語に含まれ、かつ、一般用語辞書ＤＢ６の見出し語にない語の数である。これは、専門用語辞書ＤＢ４の見出し語に含まれ、かつ、一般用語辞書ＤＢ６の見出し語にない語である真の専門語が文に多く含まれるほど、難しい文であることによる。一般用語辞書ＤＢ６の見出し語であるものは、簡単であるとして、簡単度スコアの算出に組み入れないのは、一般用語辞書ＤＢ６の見出し語であるものを組み入れると、簡単度スコアＥ（ｊ）が高くなりすぎてしまうためである。
ｊ番目の評価対象文の専門用語辞書ＤＢ４の参照数の平均ｅａ（ｊ）は、真の専門語の数に対するその専門語の参照数の合計の割合である。これは、真の専門語の参照がされているほど、人々に認知されているため、簡単であることによる。 The headword rate er (j) in the technical term dictionary DB4 of the j-th evaluation target sentence is the ratio of the number of headwords in the technical term dictionary DB4 to the number of morphemes. This is because a sentence is so difficult that the number of headwords of technical term dictionary DB4 is included in a sentence.
The technical word rate t (j) of the j-th evaluation target sentence is the number of words included in the headwords of the technical term dictionary DB4 and not included in the headwords of the general term dictionary DB6 with respect to the number of headwords of the technical term dictionary DB4. It is. This is because the sentence is more difficult as the sentence includes more true technical words that are included in the headword of the technical term dictionary DB4 and not in the headword of the general term dictionary DB6. What is a headword in the general term dictionary DB6 is simple, and is not included in the calculation of the simplicity score. If the headword in the general term dictionary DB6 is included, the simplicity score E (j) is This is because it becomes too high.
The average ea (j) of the number of references in the technical term dictionary DB 4 of the j-th evaluation target sentence is a ratio of the total number of references of the technical term to the number of true technical terms. This is due to the fact that the more true vocabulary is referenced, the more it is perceived by people.

ここで、専門用語辞書ＤＢ４の参照数の平均ｅａを用いるのは、「岡田仁之，島田諭，福原知宏，佐藤哲司、“Ｗｅｂコンテンツの文章表現に関する一検討―単語親密度を用いた文章の親密度評価―”，第１回データ工学と情報マネジメントに関するフォーラム，2009」の記載から、認知されやすい語は、単語親密度が高いといわれ、親密度が高いほど難易度が低いと考えられることに基づくものである。そして、ここでは、認知されているか否かを、見出し語参照数ＤＢ５の参照数の多少により判断している。つまり、見出し語参照数ＤＢ５の参照数が多いほど、ユーザがよく調べる専門用語であり、ユーザから認知された語であると判断できる。 Here, the average ea of the number of references in the technical term dictionary DB4 is used as “Hitayuki Okada, Atsushi Shimada, Tomohiro Fukuhara, Tetsuji Sato,“ A Study on Sentence Representation of Web Content—Sentence Parent Using Word Familiarity ” Density evaluation-From the description of “The 1st Forum on Data Engineering and Information Management, 2009”, it is said that easy-to-recognize words are said to have higher word familiarity, and the higher the familiarity, the lower the difficulty level. Is based. Here, whether or not it is recognized is determined based on the number of references in the headword reference number DB 5. That is, as the reference number in the headword reference number DB 5 increases, it can be determined that the term is a technical term that the user frequently checks and is a word recognized by the user.

ｊ番目の評価対象文の簡単度スコアＥ（ｊ）は、次の式で算出できる。

これは、ｊ番目の評価対象文の漢字率ｋ（ｊ）と、連続名詞率ｎｃ（ｊ）と、専門用語辞書ＤＢ４での見出し語率ｅｒ（ｊ）と、専門語率ｔ（ｊ）とを、それぞれ「１」から引いた各値と、専門用語辞書ＤＢ４の参照数の平均ｅａ（ｊ）とを掛けあわせた値を、ｊ番目の評価対象文の簡単度スコアＥ（ｊ）にしたものである。 The simplicity score E (j) of the j-th evaluation target sentence can be calculated by the following equation.

This is because the k-th character rate k (j), the continuous noun rate nc (j), the headword rate er (j) in the technical term dictionary DB4, and the technical term rate t (j) The value obtained by multiplying each value subtracted from “1” by the average ea (j) of the number of references in the technical term dictionary DB 4 is used as the simplicity score E (j) of the j-th evaluation target sentence. Is.

このように、文書解析装置１は、文の簡単度スコアを算出するものとして、見出し語参照数ＤＢ５の参照数を用いる。そのため、ユーザＵごとの利用履歴等を必要としないため、どのようなユーザＵであっても、この文書解析装置１を使用して文の簡単度合いを取得できる。また、文書解析装置１は、既にインターネット上にある資源である見出し語参照数ＤＢ５の参照数を用いて、文の簡単度合いを取得できる。 As described above, the document analysis device 1 uses the reference number of the headword reference number DB 5 as the sentence simplicity score. For this reason, since a usage history or the like for each user U is not required, any user U can acquire the degree of simplicity of the sentence using the document analysis apparatus 1. Further, the document analysis apparatus 1 can acquire the degree of simplicity of the sentence using the reference number of the headword reference number DB 5 which is a resource already on the Internet.

Ｓ３２において、制御部１０（文処理部１３、評価スコア算出部１６）は、取得したｊ番目の評価対象文の評価スコアＳｃｏｒｅ（ｊ）を算出する。
評価スコアＳｃｏｒｅ（ｊ）は、重要度スコアＰ（ｊ）と、簡単度スコアＥ（ｊ）とを用いて、次の式で算出できる。

これは、重要度スコアＰ（ｊ）と、簡単度スコアＥ（ｊ）とを掛けあわせた値を、ｊ番目の評価対象文の評価スコアＳｃｏｒｅ（ｊ）にしたものである。正規化された重要度スコアＰ（ｊ）と、簡単度スコアＥ（ｊ）とを掛けあわせることで、評価スコアＳｃｏｒｅ（ｊ）も、正規化された値になる。そして、評価スコアＳｃｏｒｅ（ｊ）は、重要かつ簡単な文であるほど、高い値に算出される。
その後、制御部１０は、本処理を終了し、図３に戻る。 In S32, the control unit 10 (the sentence processing unit 13, the evaluation score calculation unit 16) calculates the evaluation score Score (j) of the acquired jth evaluation target sentence.
The evaluation score Score (j) can be calculated by the following equation using the importance score P (j) and the simplicity score E (j).

This is a value obtained by multiplying the importance score P (j) and the simplicity score E (j) as the evaluation score Score (j) of the j-th evaluation target sentence. By multiplying the normalized importance score P (j) and the simplicity score E (j), the evaluation score Score (j) also becomes a normalized value. The evaluation score Score (j) is calculated as a higher value as the sentence is more important and simple.
Then, the control part 10 complete | finishes this process, and returns to FIG.

図３に戻り、Ｓ１４において、制御部１０は、対象文書３０内の全ての評価対象文について処理したか否かを判断する。対象文書３０内の全ての評価対象文について処理した場合（Ｓ１４：ＹＥＳ）には、制御部１０は、処理をＳ１５に移す。他方、対象文書３０内の全ての評価対象文について処理をしていない場合（Ｓ１４：ＮＯ）には、制御部１０は、処理をＳ１５ａに移す。
Ｓ１５ａにおいて、制御部１０は、処理をしていない評価対象文を１つ取得する。その後、制御部１０は、処理をＳ１３に移し、取得した１つの評価対象文に関してスコア算出処理を行う。
他方、Ｓ１５において、制御部１０は、評価スコアの高い順に評価対象文を並び替える。
図９（Ａ）は、図３のＳ１３のスコア算出処理によって算出された各評価対象文の重要度スコアＰ（ｊ）を示す。同様に、図９（Ｂ）、（Ｃ）は、各評価対象文の簡単度スコアＥ（ｊ）と評価スコアＳｃｏｒｅ（ｊ）とを示す。そして、図９（Ｄ）は、評価スコアＳｃｏｒｅ（ｊ）の高い順に、評価対象文を並べ替えた例を示す。 Returning to FIG. 3, in S 14, the control unit 10 determines whether all the evaluation target sentences in the target document 30 have been processed. When all the evaluation target sentences in the target document 30 have been processed (S14: YES), the control unit 10 moves the process to S15. On the other hand, when all the evaluation target sentences in the target document 30 are not processed (S14: NO), the control unit 10 moves the process to S15a.
In S15a, the control unit 10 acquires one evaluation target sentence that has not been processed. Thereafter, the control unit 10 moves the process to S13 and performs a score calculation process for the obtained one evaluation target sentence.
On the other hand, in S15, the control unit 10 rearranges the evaluation target sentences in descending order of evaluation score.
FIG. 9A shows the importance score P (j) of each evaluation target sentence calculated by the score calculation process of S13 of FIG. Similarly, FIGS. 9B and 9C show the simplicity score E (j) and the evaluation score Score (j) of each sentence to be evaluated. FIG. 9D illustrates an example in which the evaluation target sentences are rearranged in the descending order of the evaluation score Score (j).

図３に戻り、Ｓ１６において、制御部１０は、評価スコアの高い評価対象文を、端末８に対して出力する。その後、制御部１０は、本処理を終了する。
端末８では、図４（Ｂ）に示す処理結果画面８１の文出力部８１ａに、評価スコアが高い評価対象文が表示される。なお、ボタン８１ｂを選択することで、文出力部８１ａに、次に評価スコアが高い評価対象文を含めて評価スコアが高い評価対象文を表示してもよい。 Returning to FIG. 3, in S 16, the control unit 10 outputs an evaluation target sentence with a high evaluation score to the terminal 8. Then, the control part 10 complete | finishes this process.
In the terminal 8, an evaluation target sentence with a high evaluation score is displayed on the sentence output unit 81a of the processing result screen 81 shown in FIG. Note that by selecting the button 81b, the sentence output unit 81a may display an evaluation target sentence with a high evaluation score including an evaluation target sentence with the next highest evaluation score.

このように、本実施形態によれば、文書解析装置１は、以下のような効果がある。
（１）重要度スコアと簡単度スコアとから文の評価スコアを算出するので、重要かつ簡単な文を、評価スコアによって判断できる。そして、評価スコアが高いものであるほど、その文は、重要かつ簡単な文であるという判断をすることができる。よって、客観的な指標を用いて文の重要かつ簡単な度合いを測ることができる。
（２）見出し語参照数ＤＢ５の参照数を用いて、簡単度スコアを算出するので、ユーザＵの利用状況に関係なく、適切なスコアが算出できる。そして、見出し語参照数ＤＢ５の参照数が多いほど、広く知られ認知されやすい語であるとして、簡単度スコアを高く算出できる。 Thus, according to the present embodiment, the document analysis apparatus 1 has the following effects.
(1) Since an evaluation score of a sentence is calculated from the importance score and the simplicity score, an important and simple sentence can be determined based on the evaluation score. It can be determined that the higher the evaluation score is, the more important and simple the sentence is. Therefore, it is possible to measure an important and simple level of a sentence using an objective index.
(2) Since the simplicity score is calculated using the reference number of the headword reference number DB5, an appropriate score can be calculated regardless of the usage situation of the user U. Then, the greater the number of references in the headword reference number DB 5, the higher the simplicity score can be calculated as a word that is widely known and easily recognized.

（３）専門用語辞書ＤＢ４を使用して、文に専門用語を含むほど、その文は、難しい文であると判断して、文の簡単度スコアに重み付けをすることで、文の簡単度スコアを低く算出できる。
（４）専門用語辞書ＤＢ４の見出し語であり、かつ、一般用語辞書ＤＢ６の見出し語である語（名詞）を文に含む場合に、文の簡単度スコアの重みを低くする。そのような語は一般的な語であるので、専門語率を算出し、一般的な語による重みを低くすることで、簡単度スコアの値を、より精緻なものにできる。 (3) Using the technical term dictionary DB 4, the more the technical terms are included in the sentence, the more difficult the sentence is, and the sentence simplicity score is weighted by weighting the sentence simplicity score. Can be calculated low.
(4) When a word (noun) that is a headword in the technical term dictionary DB4 and a headword in the general term dictionary DB6 is included in the sentence, the weight of the sentence simplicity score is lowered. Since such a word is a general word, the value of the simplicity score can be made more precise by calculating the technical word rate and reducing the weight of the general word.

（５）文書内の出現頻度が高い名詞を含むほど、その文は、重要度合いが高いものであると判断して、文の重要度スコアを高く算出できる。
（６）文の評価スコアに基づいて文を抽出する。よって、誰が利用しても文書から重要かつ簡単な文を抽出することができる。そして、評価スコアが高い文を抽出することにより、抽出された文は、重要かつ簡単な文であるため、例えば、要約文として使用できる。 (5) It can be determined that the sentence has a higher importance level as the noun having a higher appearance frequency in the document is included, and the sentence importance score can be calculated higher.
(6) Extract a sentence based on the evaluation score of the sentence. Therefore, an important and simple sentence can be extracted from a document no matter who uses it. Then, by extracting a sentence with a high evaluation score, the extracted sentence is an important and simple sentence, and can be used as, for example, a summary sentence.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されるものではない。また、実施形態に記載した効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載したものに限定されない。なお、上述した実施形態及び後述する変形形態は、適宜組み合わせて用いることもできるが、詳細な説明は省略する。 As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above. In addition, the effects described in the embodiments are merely a list of the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the embodiments. In addition, although embodiment mentioned above and the deformation | transformation form mentioned later can also be used combining suitably, detailed description is abbreviate | omitted.

（変形形態）
（１）本実施形態では、評価する文の単位を、段落に含む文として説明したが、これに限定されない。評価する文の単位は、句点で区切られた一文であってもよいし、句点で区切られた複数の文であってもよい。また、評価する文の単位を、文書解析装置で予め決めたものとして説明したが、これに限定されない。例えば、端末から文の単位を受け付けてもよい。
（２）本実施形態では、主要語を、文書タイトル、冒頭文、見出しのうち少なくともいずれかに含む名詞として説明したが、これに限定されない。例えば、第１段落等であってもよい。
（３）本実施形態では、端末に評価スコアが高い文を出力するものとして説明したが、これに限定されない。評価スコアが高い文を、評価スコアと共に出力してもよい。 (Deformation)
(1) In the present embodiment, the unit of a sentence to be evaluated has been described as a sentence included in a paragraph, but the present invention is not limited to this. The unit of the sentence to be evaluated may be a single sentence delimited by punctuation marks or a plurality of sentences delimited by punctuation marks. Moreover, although the unit of the sentence to be evaluated has been described as being predetermined by the document analysis apparatus, the present invention is not limited to this. For example, a sentence unit may be received from a terminal.
(2) In the present embodiment, the main word has been described as a noun included in at least one of the document title, the opening sentence, and the headline. However, the present invention is not limited to this. For example, it may be the first paragraph or the like.
(3) Although the present embodiment has been described as outputting a sentence with a high evaluation score to the terminal, the present invention is not limited to this. A sentence with a high evaluation score may be output together with the evaluation score.

（４）本実施形態では、重要度スコアと、簡単度スコアの算出式を説明した。これは、一例であって、重要度や簡単度合いを算出可能な他の値を組み入れたり、一部の値を除外したりしてもよい。
（５）本実施形態では、全ての名詞の重み係数を算出するものとして説明したが、これに限定されない。例えば、名詞の出現頻度を算出した後に、出現頻度の最大値と比較して一定値以上のものに関してのみ重み係数を算出し、それ以外の名詞に関しては、重み係数を１にしてもよい。 (4) In this embodiment, the importance score and the formula for calculating the simplicity score have been described. This is an example, and other values that can calculate the importance and the simplicity may be incorporated, or some values may be excluded.
(5) Although the present embodiment has been described as calculating weight coefficients for all nouns, the present invention is not limited to this. For example, after calculating the appearance frequency of nouns, the weighting coefficient may be calculated only for those having a certain value or more compared with the maximum value of the appearance frequency, and the weighting coefficient may be set to 1 for other nouns.

１文書解析装置
４専門用語辞書ＤＢ
５見出し語参照数ＤＢ
６一般用語辞書ＤＢ
８端末
１０制御部
１２事前処理部
１３文処理部
１４重要度算出部
１５簡単度算出部
１６評価スコア算出部
１７文出力部
２０記憶部
２１ａ文書解析プログラム
３０対象文書
３５主要語リスト
４０名詞リスト
１００文書解析システム 1 Document analyzer 4 Technical term dictionary DB
5 Headword reference number DB
6 General term dictionary DB
DESCRIPTION OF SYMBOLS 8 Terminal 10 Control part 12 Preprocessing part 13 Sentence processing part 14 Importance calculation part 15 Simplicity calculation part 16 Evaluation score calculation part 17 Sentence output part 20 Storage part 21a Document analysis program 30 Target document 35 Main word list 40 Noun list 100 Document analysis system

Claims

An information processing apparatus for evaluating a sentence included in a document,
Sentence dividing means for dividing the document into sentence units;
An importance calculation means for calculating an importance score indicating the importance of the sentence;
A simpleness calculating means for calculating a simpleness score indicating the simpleness of the sentence;
An evaluation score calculating means for calculating an evaluation score which is an important and simple degree of a sentence using the calculated importance score and the simplicity score;
An information processing apparatus comprising:

The information processing apparatus according to claim 1,
Document processing means for executing the evaluation score calculating means for all sentences included in the document;
Sentence extraction means for extracting a sentence with a high evaluation score based on the evaluation score of each sentence calculated by the document processing means;
Providing
An information processing apparatus characterized by the above.

The information processing apparatus according to claim 1 or 2,
Connected to a headword reference number database storing the number of references to the headword in the technical term dictionary database, which is a dictionary having technical terms as headwords,
The simplicity calculating means refers to the headword reference number database and calculates the simplicity score;
An information processing apparatus characterized by the above.

In the information processing apparatus according to any one of claims 1 to 3,
Connected to a technical term dictionary database, which is a dictionary with technical terms as headwords,
The simplicity calculating means weights the simplicity score when the sentence includes the headword of the technical term dictionary database;
An information processing apparatus characterized by the above.

The information processing apparatus according to claim 4,
When the sentence includes a headword that is included in the headword in the technical term dictionary database and that is included in a general term dictionary database that uses a general term as a headword, the weight calculation unit calculates the weight. Lowering,
An information processing apparatus characterized by the above.

In the information processing apparatus according to any one of claims 1 to 5,
The importance calculation means calculates the importance score higher as the sentence includes a noun having a higher appearance frequency included in the document.
An information processing apparatus characterized by the above.

An information processing apparatus according to any one of claims 3 to 5,
A headword reference number database that stores the number of references to the headword in the technical term dictionary database, which is a dictionary that uses technical terms as headwords;
Information processing system with

A program for causing a computer to function as the information processing apparatus according to any one of claims 1 to 6.