JP2009146121A

JP2009146121A - Document editing device, document editing method and computer program

Info

Publication number: JP2009146121A
Application number: JP2007322093A
Authority: JP
Inventors: Yasuhide Miura; 康秀三浦; Hiroshi Masuichi; 博増市; Motoyuki Takaai; 基行鷹合
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2007-12-13
Filing date: 2007-12-13
Publication date: 2009-07-02
Anticipated expiration: 2027-12-13
Also published as: JP5115170B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a method that modify a document including words on different levels of abstraction to a document using words on a uniform level of abstraction. <P>SOLUTION: Words are extracted from an input document. Classes on an ontology corresponding to the extracted words are selected as corresponding classes. Classes having a high statistic representing information on the frequency of occurrence in a document corpus are selected as reference classes from neighboring classes of the corresponding classes. A document abstraction level that is the variance or mean of levels of abstraction of all the extracted corresponding classes is compared with a threshold to determine a level of deviation of word abstraction levels of the input document. If the level of deviation is high, the words are replaced with words corresponding to the reference classes or reference class information is presented. This technique can modify a document using words on different levels of abstraction to generate a document using words on a uniform level of abstraction. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、文書編集装置、および文書編集方法、並びにコンピュータ・プログラムに関する。さらに具体的には文書に含まれる語の抽象度を調整し、抽象度をほぼ一定にする処理を行なう文書編集装置、および文書編集方法、並びにコンピュータ・プログラムに関する。 The present invention relates to a document editing apparatus, a document editing method, and a computer program. More specifically, the present invention relates to a document editing apparatus, a document editing method, and a computer program that perform processing for adjusting the abstraction level of words included in a document to make the abstraction level substantially constant.

文書作成において、文書中に使用する語は文書作成者によって決定されるが、文書作成者の知識の偏りなどによっては抽象度の異なる語がランダムに使用される場合がある。例えば、医療分野における専門家の作成する文書、具体的には診断結果について記述する文書などには様々な専門用語が含まれるが、一般的な用語も混在し、それぞれの語の抽象度が大きく異なることが多い。文書に含まれる語の抽象度が異なると記述の一貫性が損なわれ、読者にとって分かりにくい文書となることがある。 In document creation, the word used in the document is determined by the document creator, but words with different abstractions may be used at random depending on the knowledge bias of the document creator. For example, a document created by a specialist in the medical field, specifically a document describing a diagnosis result, includes various technical terms, but common terms are also mixed, and each word has a high level of abstraction. Often different. If the abstraction level of the words contained in the document is different, the consistency of the description is lost and the document may be difficult to understand for the reader.

任意に作成された文書の編集処理により、様式や用語の統一を行なう編集処理を実行する従来技術として、例えば以下の文献がある。 As a conventional technique for executing editing processing for unifying styles and terms by editing processing of an arbitrarily created document, for example, there are the following documents.

特許文献１（特開平０７−２１９９３０号公報）には、学習対象の文書集合を設定し、これらの文書集合の様式を学習して、様式の統一処理を実行する処理対象文書を入力したときに学習結果に基づいて文書様式の統一を行う装置及び方法を開示している。具体的な処理としては、例えば、見出しと異表記の対応が格納された文書様式管理表を用意し、学習文書内に出現する見出しに統一フラグを設定し、統一対象文が入力されたときに、この統一フラグに基づき警告や統一候補を提示して様式の統一を支援するというものである。 In Patent Document 1 (Japanese Patent Application Laid-Open No. 07-219930), a set of documents to be learned is set, a format of these document sets is learned, and a processing target document for executing a format unification process is input. An apparatus and method for unifying document formats based on learning results are disclosed. As specific processing, for example, when a document format management table storing correspondences between headings and different notations is prepared, a unified flag is set for the headings that appear in the learning document, and a unified target sentence is input Based on this unified flag, warnings and candidates for unification are presented to support unification of styles.

また、特許文献２（特開２００４−２４０８５９号公報）は、テキスト中に専門知識を必要とするような語句が現れるとき、それらを分かり易く言い換えるシステムについて開示している。例えば、ユーザが入力したテキスト中に現れる語句に対して、言い換え語を格納する辞書を用い言い換えの候補を提示する。このとき、言い換えの候補は、ユーザが設定したレベル、語句のレベル、ユーザの過去の言い換え履歴を基に提示される。 Patent Document 2 (Japanese Patent Application Laid-Open No. 2004-240859) discloses a system in which words that require specialized knowledge appear in a text in an easy-to-understand manner. For example, a paraphrase candidate is presented using a dictionary that stores paraphrases for words that appear in text entered by the user. At this time, the paraphrase candidates are presented based on the level set by the user, the phrase level, and the user's past paraphrase history.

また、特許文献３（特開２０００−２６８３０４号公報）は、テキストの種類とテキストに対する前編集処理を予め定義しておき、テキストが入力されたときにテキストの種類を自動的に識別し、テキストの種類に対応する前編集処理を用いてテキストを編集する装置及び方法を開示している。まず、自動編集を行いたいテキストに対して形態素解析を行い、品詞等の語彙属性を抽出し、次に、用意された識別規則に基づき語彙属性からテキストの種類を決定し、テキストの種類に応じた前編集処理を用いてテキストの自動編集を行うものである。 Patent Document 3 (Japanese Patent Laid-Open No. 2000-268304) defines a text type and a pre-edit process for the text in advance, and automatically identifies the text type when the text is input. Discloses an apparatus and method for editing text using a pre-editing process corresponding to the type. First, perform morphological analysis on the text you want to edit automatically, extract vocabulary attributes such as parts of speech, and then determine the text type from the vocabulary attribute based on the prepared identification rules, and according to the text type The text is automatically edited using the pre-editing process.

このように文書編集処理を行なう装置について、従来から様々な処理構成が提案されている。多くの先行技術では、基準となる文書集合、予め設定された語句のレベル、文書の種類に応じて、単語の変換（統一、言い換え、編集）が行われる。しかし、これらの処理は、基本的には意味的に同義な別表記への変換であり、文書全体をより抽象・詳細な表現に変換する等の、意味や概念としての抽象度を考慮した標準化を行なう構成とはなっていない。 Various processing configurations have been proposed for apparatuses that perform document editing processing in this way. In many prior arts, word conversion (unification, paraphrase, editing) is performed according to a reference document set, a preset phrase level, and a document type. However, these processes are basically conversions to other notations that are semantically synonymous, and standardization that takes into account the degree of abstraction as meaning and concepts, such as converting the entire document into a more abstract and detailed representation. It is not the structure which performs.

記述の一貫性が重視される文書では、文書全体として表現の抽象度が一様であることが望ましい。例えば、医療分野のテキストとして、
「左肺Ｓ３に大きさ…棘状突起を伴う…、鑑別として肺癌が挙がります。」
といったテキストがある場合、
「左肺Ｓ３」という部位、
「棘状突起」という状態、
「肺癌」という病名、
これらの表記を比較すると、病名である「肺癌」は、他の表記に比較して抽象的であり、部位や状態の表記に対応する抽象度を持つ病名、すなわち、
「原発性肺癌」等のより詳細な表現がよい。
特開平０７−２１９９３０号公報特開２００４−２４０８５９号公報特開２０００−２６８３０４号公報 In a document in which consistency of description is important, it is desirable that the degree of abstraction of expression is uniform throughout the document. For example, as medical text,
“The left lung S3 has a size… with a spinous process… as a differentiation, lung cancer”
If there is a text such as
A region called “left lung S3”,
The state of spinous processes,
The disease name "lung cancer",
Comparing these notations, the disease name “lung cancer” is more abstract than other notations, and the disease name has an abstraction level corresponding to the notation of the part or state, that is,
A more detailed expression such as “primary lung cancer” is good.
Japanese Patent Laid-Open No. 07-219930 JP 2004-240859 A JP 2000-268304 A

本発明は、記述の一貫性、つまり文書全体としての抽象度を一定にするという意味での標準化を行う文書編集装置、および文書編集方法、並びにコンピュータ・プログラムを提供することを目的とする。 An object of the present invention is to provide a document editing apparatus, a document editing method, and a computer program that perform standardization in the sense that the consistency of description, that is, the level of abstraction of the entire document is constant.

具体的には、例えば標準化の基となる特定分野の文書コーパスとオントロジーを用意し、文書コーパス内に現れる単語に基づいて、オントロジー内のクラスの統計量（出現回数もしくはそれに準じた値）を計算する。ユーザによりシステムに文書が入力された際には、文書内の各単語に対応するオントロジー上のクラスを抽出し、予め計算されたクラスの統計量を用いてそれらの抽象度を計算する。なお、抽象度の算出方法は独自の手法を用いる。文書内の各単語に対応するクラスの抽象度がばらつく場合には、システムはユーザに文書内の抽象度のばらつきの原因となる単語を知らせる、もしくは文書内の抽象度のばらつきの原因となる単語を自動的に修正するという処理を行なう文書編集装置、および文書編集方法、並びにコンピュータ・プログラムを提供する。 Specifically, for example, a document corpus and an ontology for a specific field that are the basis for standardization are prepared, and the statistics (number of occurrences or values corresponding to them) of classes in the ontology are calculated based on words that appear in the document corpus. To do. When a user inputs a document to the system, an ontology class corresponding to each word in the document is extracted, and the degree of abstraction is calculated using a class statistic calculated in advance. Note that a unique method is used as a method for calculating the degree of abstraction. When the abstraction level of the class corresponding to each word in the document varies, the system informs the user of the word that causes the variation in the abstraction level in the document, or the word that causes the variation in the abstraction level in the document. Editing apparatus, document editing method, and computer program are provided.

本発明の第１の側面は、
入力文書の編集処理を実行する文書編集装置であり、
入力文書に含まれる単語を抽出する単語抽出部と、
抽出単語に対応するオントロジー上のクラスを対応クラスとして抽出する対応クラス抽出部と、
オントロジー上に規定されたクラスの識別子であるクラスＩＤと、クラス対応語の文書コーパスにおける出現頻度情報を示す統計量を対応付けて格納したクラス統計量データベースと、
オントロジー上に規定されるクラス階層中、前記対応クラス抽出部の抽出した対応クラスの近傍クラスから前記統計量の高いクラスを基準クラスとして選択する基準クラス抽出部と、
前記基準クラスおよび対応クラス各々の抽象度を算出するクラス抽象度算出部と、
前記入力文書内の各単語に対応して抽出された全ての対応クラスの抽象度の分散または平均を文書抽象度として算出する文書抽象度計算部と、
前記文書抽象度と閾値との比較により、入力文書に含まれる単語の抽象度のばらつき度合いを判定し、ばらつき度合いが大きいと判定された場合には入力文書に含まれる単語を前記基準クラス対応の単語に置き換える処理、または、基準クラス情報の提示を行なう文書標準化処理部と、
を有することを特徴とする文書編集装置にある。 The first aspect of the present invention is:
A document editing device that executes an input document editing process;
A word extraction unit for extracting words contained in the input document;
A corresponding class extracting unit that extracts ontology classes corresponding to the extracted words as corresponding classes;
A class statistic database in which a class ID that is an identifier of a class defined on the ontology and a statistic indicating appearance frequency information in the document corpus of the class-corresponding word are associated and stored;
In a class hierarchy defined on the ontology, a reference class extraction unit that selects, as a reference class, a class having a high statistic from neighboring classes extracted by the corresponding class extraction unit;
A class abstraction degree calculation unit for calculating the abstraction degree of each of the reference class and the corresponding class;
A document abstraction level calculation unit that calculates a variance or average of abstraction levels of all corresponding classes extracted corresponding to each word in the input document as a document abstraction level;
By comparing the document abstraction level with the threshold value, the degree of variation in the abstraction level of the words included in the input document is determined. If the degree of variation is determined to be large, the word included in the input document is determined to correspond to the reference class. A document standardization processing unit for processing to replace words or presenting reference class information;
In a document editing apparatus.

さらに、本発明の文書編集装置の一実施態様において、前記対応クラス抽出部は、前記入力文書からの抽出単語に対応するプロパティを有するオントロジー上のクラスを対応クラスとして抽出する処理を行なうことを特徴とする。 Furthermore, in one embodiment of the document editing apparatus of the present invention, the corresponding class extracting unit performs a process of extracting an ontology class having a property corresponding to an extracted word from the input document as a corresponding class. And

さらに、本発明の文書編集装置の一実施態様において、前記クラス統計量データベースは、統計量として、文書コーパスにおける出現頻度情報を示すＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）が記録された構成であることを特徴とする。 Furthermore, in one embodiment of the document editing apparatus of the present invention, the class statistics database has a configuration in which TF-IDF (Term Frequency-Inverse Document Frequency) indicating appearance frequency information in the document corpus is recorded as a statistics. It is characterized by being.

さらに、本発明の文書編集装置の一実施態様において、前記クラス統計量データベースは、文書コーパスにおける出現頻度情報であるＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）の文書作成者単位で算出し、算出した複数の文書作成者対応の［ＴＦ−ＩＤＦ］の分散の逆数を統計量として記録した構成であることを特徴とする。 Furthermore, in one embodiment of the document editing apparatus of the present invention, the class statistic database is calculated and calculated for each document creator of TF-IDF (Term Frequency-Inverse Document Frequency) which is appearance frequency information in the document corpus. The reciprocal of the variance of [TF-IDF] corresponding to a plurality of document creators is recorded as a statistic.

さらに、本発明の文書編集装置の一実施態様において、前記基準クラス抽出部は、オントロジー上に規定されるクラス階層中、前記対応クラス抽出部の抽出した対応クラスを起点とした予め設定した階層数［ｎ］以内の近傍クラスから、分離関係クラスを排除し、前記統計量が最大のクラスを基準クラスとして選択する処理を実行することを特徴とする。 Furthermore, in an embodiment of the document editing apparatus of the present invention, the reference class extracting unit is configured to set a predetermined number of hierarchies starting from the corresponding class extracted by the corresponding class extracting unit among the class hierarchies defined on the ontology. It is characterized in that a separation relation class is excluded from neighboring classes within [n], and a process of selecting a class having the maximum statistic as a reference class is executed.

さらに、本発明の文書編集装置の一実施態様において、前記基準クラス抽出部は、前記対応クラスの統計量より、予め規定した値（ｐ％）大きくかつ統計量が最大のクラスを基準クラスとして抽出する処理を実行することを特徴とする。 Furthermore, in one embodiment of the document editing apparatus according to the present invention, the reference class extraction unit extracts, as a reference class, a class having a statistic that is greater than a statistic of the corresponding class by a predetermined value (p%). The process which performs is performed.

さらに、本発明の文書編集装置の一実施態様において、前記クラス抽象度算出部は、前記基準クラスの近傍にある対応クラスの抽象度をオントロジー上の基準クラスからの階層の距離に基づいて決定する処理を行なう構成であり、上位クラスの抽象度を高い値として、下位クラスの抽象度を低い値とする抽象度設定を行なうことを特徴とする。 Furthermore, in one embodiment of the document editing apparatus of the present invention, the class abstraction level calculation unit determines the abstraction level of the corresponding class in the vicinity of the reference class based on the distance of the hierarchy from the reference class on the ontology. This is a configuration for performing processing, and is characterized in that an abstraction level setting is performed such that the abstraction level of the upper class is high and the abstraction level of the lower class is low.

さらに、本発明の文書編集装置の一実施態様において、前記クラス抽象度算出部は、前記基準クラスの近傍にある対応クラスの抽象度を基準クラスのプロパティ数との差に基づいて決定する処理を行なう構成であり、プロパティ数が少ないクラスの抽象度を高い値として、プロパティ数が多いクラスの抽象度を低い値とする抽象度設定を行なうことを特徴とする。 Furthermore, in an embodiment of the document editing apparatus of the present invention, the class abstraction degree calculation unit performs a process of determining the abstraction degree of the corresponding class in the vicinity of the reference class based on a difference from the number of properties of the reference class. This is a configuration in which an abstraction level is set such that a class with a small number of properties has a high level of abstraction and a class with a large number of properties has a low level of abstraction.

さらに、本発明の第２の側面は、
文書編集装置において、入力文書の編集処理を実行する文書編集方法であり、
単語抽出部が、入力文書に含まれる単語を抽出する単語抽出ステップと、
対応クラス抽出部が、抽出単語に対応するオントロジー上のクラスを対応クラスとして抽出する対応クラス抽出ステップと、
基準クラス抽出部が、オントロジー上に規定されるクラス階層中、前記対応クラス抽出部の抽出した対応クラスの近傍クラスから、文書コーパスにおける出現頻度情報を示す統計量の高いクラスを基準クラスとして選択する基準クラス抽出ステップと、
クラス抽象度算出部が、前記基準クラスおよび対応クラス各々の抽象度を算出するクラス抽象度算出ステップと。
文書抽象度計算部が、前記入力文書内の各単語に対応して抽出された全ての対応クラスの抽象度の分散または平均を文書抽象度として算出する文書抽象度計算ステップと、
文書標準化処理部が、前記文書抽象度と閾値との比較により、入力文書に含まれる単語の抽象度のばらつき度合いを判定し、ばらつき度合いが大きいと判定された場合には入力文書に含まれる単語を前記基準クラス対応の単語に置き換える処理、または、基準クラス情報の提示を行なう文書標準化処理ステップと、
を有することを特徴とする文書編集方法にある。 Furthermore, the second aspect of the present invention provides
A document editing method for executing an input document editing process in a document editing apparatus,
A word extracting unit for extracting a word included in the input document;
A corresponding class extraction step in which a corresponding class extraction unit extracts a class on the ontology corresponding to the extracted word as a corresponding class;
The reference class extraction unit selects, as a reference class, a class having a high statistic indicating the appearance frequency information in the document corpus from the neighborhood classes of the corresponding class extracted by the corresponding class extraction unit in the class hierarchy defined on the ontology. A base class extraction step;
A class abstraction level calculating step in which a class abstraction level calculation unit calculates an abstraction level of each of the reference class and the corresponding class;
A document abstraction level calculation unit that calculates a variance or average of the abstraction levels of all corresponding classes extracted corresponding to each word in the input document as a document abstraction level;
The document standardization processing unit determines the degree of variation in the abstraction level of the words included in the input document by comparing the document abstraction level and the threshold value. If the degree of variation is determined to be large, the word included in the input document A document standardization processing step of replacing a word with a word corresponding to the reference class, or presenting reference class information;
The document editing method is characterized by comprising:

さらに、本発明の文書編集方法の一実施態様において、前記対応クラス抽出ステップは、前記入力文書からの抽出単語に対応するプロパティを有するオントロジー上のクラスを対応クラスとして抽出する処理を行なうことを特徴とする。 Furthermore, in one embodiment of the document editing method of the present invention, the corresponding class extracting step performs a process of extracting an ontology class having a property corresponding to an extracted word from the input document as a corresponding class. And

さらに、本発明の文書編集方法の一実施態様において、前記基準クラス抽出ステップは、オントロジー上に規定されるクラス階層中、前記対応クラス抽出部の抽出した対応クラスを起点とした予め設定した階層数［ｎ］以内の近傍クラスから、分離関係クラスを排除し、前記統計量が最大のクラスを基準クラスとして選択する処理を実行することを特徴とする。 Further, in one embodiment of the document editing method of the present invention, the reference class extracting step includes a predetermined number of hierarchies starting from the corresponding class extracted by the corresponding class extracting unit in the class hierarchies defined on the ontology. It is characterized in that a separation relation class is excluded from neighboring classes within [n], and a process of selecting a class having the maximum statistic as a reference class is executed.

さらに、本発明の文書編集方法の一実施態様において、前記基準クラス抽出ステップは、前記対応クラスの統計量より、予め規定した値（ｐ％）大きくかつ統計量が最大のクラスを基準クラスとして抽出する処理を実行することを特徴とする。 Furthermore, in one embodiment of the document editing method of the present invention, the reference class extracting step extracts a class having a statistic that is larger than a statistic of the corresponding class by a predetermined value (p%) and having a maximum statistic as a reference class. The process which performs is performed.

さらに、本発明の文書編集方法の一実施態様において、前記クラス抽象度算出ステップは、前記基準クラスの近傍にある対応クラスの抽象度をオントロジー上の基準クラスからの階層の距離に基づいて決定する処理を行なう構成であり、上位クラスの抽象度を高い値として、下位クラスの抽象度を低い値とする抽象度設定を行なうことを特徴とする。 Furthermore, in one embodiment of the document editing method of the present invention, the class abstraction level calculating step determines an abstraction level of a corresponding class in the vicinity of the reference class based on a hierarchy distance from the reference class on the ontology. This is a configuration for performing processing, and is characterized in that an abstraction level setting is performed such that the abstraction level of the upper class is high and the abstraction level of the lower class is low.

さらに、本発明の文書編集方法の一実施態様において、前記クラス抽象度算出ステップは、前記基準クラスの近傍にある対応クラスの抽象度を基準クラスのプロパティ数との差に基づいて決定する処理を行なう構成であり、プロパティ数が少ないクラスの抽象度を高い値として、プロパティ数が多いクラスの抽象度を低い値とする抽象度設定を行なうことを特徴とする。 Furthermore, in one embodiment of the document editing method of the present invention, the class abstraction level calculating step includes a process of determining an abstraction level of a corresponding class in the vicinity of the reference class based on a difference from the number of properties of the reference class. This is a configuration in which an abstraction level is set such that a class with a small number of properties has a high level of abstraction and a class with a large number of properties has a low level of abstraction.

さらに、本発明の第３の側面は、
文書編集装置において、入力文書の編集処理を実行させるコンピュータ・プログラムであり、
単語抽出部に、入力文書に含まれる単語を抽出させる単語抽出ステップと、
対応クラス抽出部に、抽出単語に対応するオントロジー上のクラスを対応クラスとして抽出させる対応クラス抽出ステップと、
基準クラス抽出部に、オントロジー上に規定されるクラス階層中、前記対応クラス抽出部の抽出した対応クラスの近傍クラスから、文書コーパスにおける出現頻度情報を示す統計量の高いクラスを基準クラスとして選択させる基準クラス抽出ステップと、
クラス抽象度算出部に、前記基準クラスおよび対応クラス各々の抽象度を算出させるクラス抽象度算出ステップと、
文書抽象度計算部に、前記入力文書内の各単語に対応して抽出された全ての対応クラスの抽象度の分散または平均を文書抽象度として算出させる文書抽象度計算ステップと、
文書標準化処理部に、前記文書抽象度と閾値との比較により、入力文書に含まれる単語の抽象度のばらつき度合いを判定し、ばらつき度合いが大きいと判定された場合には入力文書に含まれる単語を前記基準クラス対応の単語に置き換える処理、または、基準クラス情報の提示を行なわせる文書標準化処理ステップと、
を有することを特徴とするコンピュータ・プログラムにある。 Furthermore, the third aspect of the present invention provides
In a document editing apparatus, a computer program for executing input document editing processing,
A word extraction step for causing the word extraction unit to extract words included in the input document;
A corresponding class extraction step for causing the corresponding class extraction unit to extract a class on the ontology corresponding to the extracted word as a corresponding class;
Let the reference class extraction unit select, as a reference class, a class having a high statistic indicating appearance frequency information in the document corpus from the neighborhood classes of the corresponding class extracted by the corresponding class extraction unit in the class hierarchy defined on the ontology. A base class extraction step;
A class abstraction calculating step for causing the class abstraction calculating unit to calculate the abstraction of each of the reference class and the corresponding class;
A document abstraction level calculation step for causing the document abstraction level calculation unit to calculate the variance or average of the abstraction levels of all corresponding classes extracted corresponding to each word in the input document as the document abstraction level;
The document standardization processing unit determines the degree of variation in the abstraction level of the word included in the input document by comparing the document abstraction level and the threshold value. If the degree of variation is determined to be large, the word included in the input document A document standardization processing step that replaces the word with a word corresponding to the reference class, or presents the reference class information,
There is a computer program characterized by comprising:

なお、本発明のコンピュータ・プログラムは、例えば、様々なプログラム・コードを実行可能な汎用コンピュータ・システムに対して、コンピュータ可読な形式で提供する記憶媒体、通信媒体によって提供可能なコンピュータ・プログラムである。このようなプログラムをコンピュータ可読な形式で提供することにより、コンピュータ・システム上でプログラムに応じた処理が実現される。 The computer program of the present invention is, for example, a computer program that can be provided by a storage medium or a communication medium provided in a computer-readable format to a general-purpose computer system that can execute various program codes. . By providing such a program in a computer-readable format, processing corresponding to the program is realized on the computer system.

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施例や添付する図面に基づくより詳細な説明によって明らかになるであろう。なお、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Other objects, features, and advantages of the present invention will become apparent from a more detailed description based on embodiments of the present invention described later and the accompanying drawings. In this specification, the system is a logical set configuration of a plurality of devices, and is not limited to one in which the devices of each configuration are in the same casing.

本発明の一実施例の構成によれば、入力文書に含まれる単語を抽出し、抽出単語に対応するオントロジー上のクラスを対応クラスとして選択し、さらに、対応クラスの近傍クラスから、文書コーパスにおける出現頻度情報を示す統計量の高いクラスを基準クラスとして選択し、入力文書内の各単語に対応して抽出された全ての対応クラスの抽象度の分散または平均を文書抽象度として算出して、算出した文書抽象度と閾値との比較により、入力文書に含まれる単語の抽象度のばらつき度合いを判定し、ばらつき度合いが大きいと判定された場合には入力文書に含まれる単語を基準クラス対応の単語に置き換える処理、または、基準クラス情報の提示を行なう構成としたので、抽象度のばらつきのある単語を利用した文書を変更して、均一の抽象度を持つ単語の利用された文書を生成することが可能となる。 According to the configuration of an embodiment of the present invention, a word included in an input document is extracted, an ontology class corresponding to the extracted word is selected as a corresponding class, and further, from a neighboring class of the corresponding class, in a document corpus Select a class with high statistics indicating the appearance frequency information as a reference class, calculate the variance or average of the abstraction levels of all corresponding classes extracted corresponding to each word in the input document as the document abstraction level, By comparing the calculated document abstraction level with the threshold value, the degree of variation in the abstraction level of the words included in the input document is determined. If it is determined that the degree of variation is large, the word included in the input document is determined to correspond to the reference class. Since it is configured to replace the word or present the reference class information, change the document using words with varying degrees of abstraction to make uniform extraction. It is possible to generate a use document of a word with a degree.

以下、図面を参照しながら本発明の一実施形態に係る文書編集装置、および文書編集方法、並びにコンピュータ・プログラムの詳細について説明する。 Hereinafter, a document editing apparatus, a document editing method, and a computer program according to an embodiment of the present invention will be described in detail with reference to the drawings.

図１に本発明の一実施形態に係る文書編集装置の構成図を示す。本発明の文書編集装置１００は、文書入力部１０１から処理対象とする任意の文書を入力し、その入力文書に使用されている語の抽象度をほぼ均一なレベルに調整した文書、すなわち標準化文書を作成するものである。 FIG. 1 shows a block diagram of a document editing apparatus according to an embodiment of the present invention. The document editing apparatus 100 of the present invention inputs an arbitrary document to be processed from the document input unit 101, and is a document in which the abstraction level of words used in the input document is adjusted to a substantially uniform level, that is, a standardized document Is to create.

本発明の文書編集装置１００では、標準化の基となる特定分野の文書データを集積した文書コーパスとオントロジーを用意し、文書コーパス内に現れる単語に基づいて、オントロジー内のクラスの統計量（出現回数もしくはそれに準じた値）を計算し、入力文書内の各単語に対応するオントロジー上のクラスを抽出し、予め計算されたクラスの統計量を用いてそれらの抽象度を計算して、各単語に対応するクラスの抽象度がばらつく場合に、ユーザに文書内の抽象度のばらつきの原因となる単語を知らせる処理や、抽象度のばらつきの原因となる単語を自動的に修正する処理を行なう。 The document editing apparatus 100 according to the present invention prepares a document corpus and an ontology in which document data of a specific field as a basis for standardization is accumulated, and based on words appearing in the document corpus, statistics of classes in the ontology (number of appearances) Or a value in accordance with it), extract ontology classes corresponding to each word in the input document, calculate their abstraction using pre-calculated class statistics, and calculate each word When the level of abstraction of the corresponding class varies, a process for notifying the user of a word that causes a variation in the abstraction level in the document and a process for automatically correcting the word that causes the variation in the abstraction level are performed.

図１に示すように、本発明の一実施形態に係る文書編集装置１００は、文書入力部１０１、文書抽出部１０２、単語抽出部１０３、対応クラス抽出部１０４、クラス統計量計算部１０５、基準クラス抽出部１０６、クラス抽象度計算部１０７、文書抽象度計算部１０８、文書標準化処理部１０９、さらに、標準化の基となる特定分野の文書データを集積した文書コーパス記憶部１２１、オントロジーを格納したオントロジーデータベース１２２、オントロジーのクラスＩＤと文書コーパス中での統計量を現す値を保持する、クラス統計量データベース１２３を有する。 As shown in FIG. 1, a document editing apparatus 100 according to an embodiment of the present invention includes a document input unit 101, a document extraction unit 102, a word extraction unit 103, a corresponding class extraction unit 104, a class statistic calculation unit 105, a reference A class extraction unit 106, a class abstraction level calculation unit 107, a document abstraction level calculation unit 108, a document standardization processing unit 109, a document corpus storage unit 121 in which document data in a specific field that is the basis of standardization are collected, and an ontology are stored. The ontology database 122 includes an ontology class ID and a class statistics database 123 that holds values representing statistics in the document corpus.

文書編集装置１００を利用した処理の説明の前に、文書コーパス記憶部１２１、オントロジーデータベース１２２、クラス統計量データベース１２３各々について説明する。なお、以下の実施例では、医学文書に関する処理例について説明する。従って、本例で利用するコーパスや、オントロジーは医学文書、用語に関するコーパスや、オントロジーである。 Before describing the processing using the document editing apparatus 100, each of the document corpus storage unit 121, the ontology database 122, and the class statistics database 123 will be described. In the following embodiment, a processing example related to a medical document will be described. Accordingly, the corpus and ontology used in this example are a corpus and ontology related to medical documents and terms.

図２を参照して文書コーパス記憶部１２１に格納されるテキストコーパスの例について説明する。文書コーパス記憶部１２１は様々なテキストを格納したデータベースである。図２に示すように、各テキストにはテキスト識別子としてのＩＤと、テキスト作成者の識別子（作成者名など）が設定されている。 An example of a text corpus stored in the document corpus storage unit 121 will be described with reference to FIG. The document corpus storage unit 121 is a database that stores various texts. As shown in FIG. 2, an ID as a text identifier and an identifier of a text creator (such as a creator name) are set for each text.

次に、図３を参照してオントロジーデータベース１２２に格納されるオントロジーについて説明する。オントロジーは知識の分類体系および推論規則の集合を意味し、存在を表す概念の辞書の一種であり、上位概念と下位概念との語関係を記述した辞書として構成される。例えばある概念の上位概念、下位概念の関連をノード間の接続関係で定義した階層構成を持つ。オントロジーには［概念］に対応するクラスや、クラスの性質情報としてのプロパティ、クラスの属性情報としてのアトリビュートなどが記述される。 Next, the ontology stored in the ontology database 122 will be described with reference to FIG. An ontology means a set of knowledge classification systems and inference rules, and is a kind of concept dictionary that represents existence, and is configured as a dictionary that describes word relationships between superordinate concepts and subordinate concepts. For example, it has a hierarchical structure in which a relationship between a superordinate concept and a subordinate concept of a concept is defined by a connection relationship between nodes. The ontology describes a class corresponding to [concept], properties as class property information, attributes as class attribute information, and the like.

図３は、オントロジー記述言語として知られるＯＷＬに従って記述された１つの登録語の情報構成例を示している。オントロジーデータベース１２２には、図３に示すような登録情報が、登録語である概念に対応するクラス各々について記録されている。 FIG. 3 shows an information configuration example of one registered word described according to OWL known as an ontology description language. In the ontology database 122, registration information as shown in FIG. 3 is recorded for each class corresponding to a concept that is a registered word.

図３に示すように、オントロジーには、例えば以下の情報、すなわち、
（ａ）［概念］に対応するクラス、
（ｂ）クラスの性質情報としてのラベル、
（ｃ）クラスの属性情報である他のクラスとの関係情報（例えば親子関係など）、
これらの情報が登録される。 As shown in FIG. 3, the ontology includes, for example, the following information:
(A) a class corresponding to [concept],
(B) a label as class property information;
(C) Relationship information with other classes which are class attribute information (for example, parent-child relationship),
These pieces of information are registered.

［クラス］は、登録された存在を表す概念名を記録するフィールドである。クラスのフィールドには、単語や単語の概念情報がクラス識別子（クラスＩＤ）として記録される。図に示す例では、
クラス識別子（クラスＩＤ）＝［ＥｐｉｔｈｅｌｉａｌＮｅｏｐｌａｓｍ］
である。なお、ＥｐｉｔｈｅｌｉａｌＮｅｏｐｌａｓｍは医学用語であり、病名としての「上皮腫瘍」を意味している。
［プロパティ］にはラベル等のクラスの性質情報が記録される。さらに、属性情報として他のクラスとの関係情報が記録される。 [Class] is a field for recording a concept name representing the registered existence. In the class field, words and conceptual information of words are recorded as class identifiers (class IDs). In the example shown in the figure,
Class identifier (class ID) = [Epithelial Neoplasm]
It is. In addition, Epithelial Neoplasm is a medical term and means “epithelial tumor” as a disease name.
[Property] records property information of a class such as a label. Further, relation information with other classes is recorded as attribute information.

他のクラスとの関係について、図４を参照して説明する。オントロジーは、概念に対応するクラス単位で、図３に示す登録情報を記録しているとともに、各クラスを１つのノードとして、クラス間の関係をノード間の接続関係として識別可能な構成を持つ。クラス間の関係として、大きく分類すると３つの関係が定義される。図４に示すクラスＡ２０１を注目クラスとして、クラスＡ２０１とのクラス関係について説明する。 The relationship with other classes will be described with reference to FIG. The ontology records the registration information shown in FIG. 3 in units of classes corresponding to concepts, and has a configuration that allows each class to be identified as one node and the relationship between classes to be identified as a connection relationship between nodes. As relations between classes, three relations are defined when roughly classified. With the class A201 shown in FIG. 4 as the class of interest, the class relationship with the class A201 will be described.

まず、クラスＰ２１１は、クラスＡ２０１に相当する概念の上位概念のクラスである。
このクラスＰ２１１は、クラスＡ２０１のスーパークラス（親クラス）となる。 First, the class P211 is a superordinate concept class corresponding to the class A201.
This class P211 is a superclass (parent class) of class A201.

また、クラスＡ２０１は、クラスＡ２０１に相当する概念の下位概念のクラスとして、クラスＳ２２１、クラスＴ２２２を有する。
このクラスＳ２２１、クラスＴ２２２は、クラスＡ２０１のサブクラス（子クラス）となる。 The class A 201 includes a class S 221 and a class T 222 as subordinate concept classes corresponding to the class A 201.
The class S221 and the class T222 are subclasses (child classes) of the class A201.

さらに、クラスＰ２１１の下位概念として登録されたクラスには、クラスＡ２０１の他、クラスＢ２１２、クラスＣ２１３がある。クラスＢ２１２、クラスＣ２１３は、クラスＡ２０１と同じスーパークラス（親クラス）＝クラスＰ２１１を有する。
このクラスＢ２１２、クラスＣ２１３は、クラスＡ２０１のシブリングクラス（兄弟クラス）となる。 Further, classes registered as subordinate concepts of class P211 include class B212 and class C213 in addition to class A201. Class B212 and class C213 have the same superclass (parent class) = class P211 as class A201.
The class B212 and the class C213 are sibling classes (sibling classes) of the class A201.

このように、オントロジーのクラス関係としては、
（ａ）スーパークラス（親クラス）
（ｂ）サブクラス（子クラス）
（ｃ）シブリングクラス（兄弟クラス）
これらのクラス関係が定義される。さらに、サブクラスのサブクラス（孫クラス）等についても定義される。 In this way, the ontology class relationship is
(A) Super class (parent class)
(B) Subclass (child class)
(C) Sibling class (sibling class)
These class relationships are defined. Furthermore, a subclass of a subclass (grandchild class) and the like are also defined.

次に、図５を参照してクラス統計量データベース１２３について説明する。クラス統計量データベース１２３は、オントロジーに記録されている各クラスのクラスＩＤと文書コーパス中での統計量を現す値を保持するデータベースである。なお、クラス統計量データベース１２３の構築処理については後述する。 Next, the class statistic database 123 will be described with reference to FIG. The class statistic database 123 is a database that holds a class ID of each class recorded in the ontology and a value representing a statistic in the document corpus. The construction process of the class statistics database 123 will be described later.

図５に示すように、クラス統計量データベース１２３は、オントロジーに記録されている各クラスのクラスＩＤと文書コーパス中での統計量を現す値を保持する。クラスＩＤは、先に、図３を参照して説明したように、オントロジーデータベースに登録された語の概念名を記録するフィールドに記録された概念を示すＩＤである。統計量は、この単語が文書コーパスＤＢ１２１に格納された文書中に出現する頻度情報であり、具体的には、ＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）＝（単語の出現回数／単語の出現文書数）が利用される。 As shown in FIG. 5, the class statistic database 123 holds a class ID of each class recorded in the ontology and a value representing the statistic in the document corpus. As described above with reference to FIG. 3, the class ID is an ID indicating the concept recorded in the field for recording the concept name of the word registered in the ontology database. The statistic is frequency information that this word appears in the document stored in the document corpus DB 121. Specifically, TF-IDF (Term Frequency-Inverse Document Frequency) = (number of occurrence of word / occurrence of word) Number of documents) is used.

ＴＦ−ＩＤＦは、文書に出現する語の重要性、例えば検索語としての有用性を意味する指標として利用される。例えば１つの文書に頻出する語のＴＦ−ＩＤＦは高くなる。しかし、多くの文書に頻出する場合、ＴＦ−ＩＤＦの値は小さくなる。すなわち、文書に一般的に利用される語の場合、ＴＦ−ＩＤＦの値は小さくなり、特定の文書にのみ頻出する場合ＴＦ−ＩＤＦの値は高くなる傾向があり、多くの文書から特定文書を選択する場合のキーワードとして利用する場合、ＴＦ−ＩＤＦの値の高い語を利用することが有効となる。 The TF-IDF is used as an index indicating the importance of words appearing in a document, for example, usefulness as a search word. For example, the TF-IDF of a word that appears frequently in one document increases. However, when it appears frequently in many documents, the value of TF-IDF becomes small. That is, in the case of a word that is generally used in a document, the value of TF-IDF tends to be small, and the value of TF-IDF tends to increase if it appears frequently only in a specific document. When used as a keyword for selection, it is effective to use a word having a high TF-IDF value.

次に、図１に示す文書編集装置１００を利用した処理について説明する。以下の２つの処理について、順次説明する。
（ａ）クラス統計量データベースの構築処理
（ｂ）入力文書の標準化処理
なお、上記（ａ），（ｂ）の処理中、（ａ）クラス統計量データベースの構築処理は、（ｂ）入力文書の標準化処理に用いるクラス統計量データベースの構築処理であり、（ｂ）入力文書の標準化処理の実行以前に、予め実行してデータベースを構築するための処理である。データベースが構築されれば、その後は、同じ処理を行なう必要はない。ただし、文書コーパスデータベース１２１やオントロジーデータベース１２２が更新された場合には、必要応じてクラス統計量データベース１２３の更新処理を、構築処理と同様の処理に従って行なうことが好ましい。 Next, processing using the document editing apparatus 100 shown in FIG. 1 will be described. The following two processes will be described sequentially.
(A) Class statistic database construction process (b) Input document standardization process During the processes (a) and (b) above, (a) class statistic database construction process is performed as follows: This is a process of constructing a class statistic database used for standardization processing, and (b) is a process for constructing a database by executing it in advance before executing the standardization processing of input documents. Once the database is constructed, it is not necessary to perform the same processing thereafter. However, when the document corpus database 121 and the ontology database 122 are updated, it is preferable to update the class statistic database 123 according to the same process as the construction process as necessary.

（ａ）クラス統計量データベースの構築処理
まず、図６に示すフローを参照してクラス統計量データベースの構築処理シーケンスについて説明する。 (A) Class Statistics Database Construction Process First, a class statistics database construction process sequence will be described with reference to the flow shown in FIG.

ステップＳ１０１において、文書コーパスから文書を１つ抽出する。この処理は、図１に示す文書編集装置１００の文書抽出部１０２が文書コーパスデータベース１２１から文書を抽出する処理である。 In step S101, one document is extracted from the document corpus. This process is a process in which the document extraction unit 102 of the document editing apparatus 100 shown in FIG. 1 extracts a document from the document corpus database 121.

次に、ステップＳ１０２において、文書コーパスから抽出された文書のテキストに対して、形態素解析を行い、文書を分かち書きにし、単語の集合を抽出する。この処理は、図１に示す文書編集装置１００の単語抽出部１０３の処理である。なお、形態素解析には、既存の形態素解析プログラムの適用が可能である。具体的には、例えば茶筌（ｈｔｔｐ：／／ｃｈａｓｅｎ．ｎａｉｓｔ．ｊｐ／ｈｉｋｉ／ＣｈａＳｅｎ／）等で形態素解析を行なう。 Next, in step S102, morphological analysis is performed on the text of the document extracted from the document corpus, the document is separated, and a set of words is extracted. This process is a process of the word extraction unit 103 of the document editing apparatus 100 shown in FIG. An existing morpheme analysis program can be applied to the morpheme analysis. Specifically, morphological analysis is performed using, for example, a tea bowl (http://chasen.naist.jp/hiki/ChaSen/).

次に、ステップＳ１０３において、抽出された各単語について、オントロジーデータベース１２２に格納されたオントロジー上のクラスとの対応を解析する。この処理は、図１に示す文書編集装置１００の対応クラス抽出部１０３の処理として実行される。 Next, in step S103, for each extracted word, the correspondence with the ontology class stored in the ontology database 122 is analyzed. This processing is executed as processing of the corresponding class extraction unit 103 of the document editing apparatus 100 shown in FIG.

この対応クラス抽出処理は、文書コーパスデータベース１２１の格納文書から抽出された各単語について、オントロジーデータベース１２２に格納されたオントロジー上のクラス対応データ（図３参照）に記録されたクラス性質情報であるラベルと一致する語の有無を判定する処理として実行する。 In this correspondence class extraction process, for each word extracted from a document stored in the document corpus database 121, a label which is class property information recorded in ontology class correspondence data (see FIG. 3) stored in the ontology database 122. It is executed as a process for determining the presence or absence of a word that matches.

例えば、文書コーパスから抽出された文書ＩＤ＝１の文書の形態素解析結果が、
「…／残留／腫瘍／も／否定／でき／ませ／ん。」
といった分かち書きにされたテキストを持つ文書に対して、
オントロジーデータベース１２２のオントロジー上のクラス［Ｎｅｏｐｌａｓｍ］が記録され、このクラス［Ｎｅｏｐｌａｓｍ］の性質情報としてのラベル情報に、
"腫"、"腫瘍"、"腫瘤"が記録されているとする。
この場合、文書コーパスから抽出された文書中の単語［腫瘍］と同様のラベル［腫瘍］の記録されたクラスがオントロジーに存在しており、以下の対応関係情報（ａ）〜（ｃ）を生成する。
（ａ）文書コーパスから抽出された文書ＩＤ
（ｂ）文書コーパスから抽出された文書中の単語
（ｃ）単語に対応するクラス名
すなわち、上記例では、
（文書ＩＤ、単語、クラス名）＝＜１，腫瘍，Ｎｅｏｐｌａｓｍ＞
このようなデータ形式で、文書と単語とクラスの対応関係を取得する。 For example, the morphological analysis result of the document with document ID = 1 extracted from the document corpus is
“… / Residue / tumor / also / denial / can / don /”
For documents with text that is split
Ontology class [Neoplasm] in the ontology database 122 is recorded, and label information as property information of this class [Neoplasm]
Assume that "tumor", "tumor", and "mass" are recorded.
In this case, a recorded class having a label [tumor] similar to the word [tumor] in the document extracted from the document corpus exists in the ontology, and the following correspondence information (a) to (c) is generated. To do.
(A) Document ID extracted from the document corpus
(B) Words in the document extracted from the document corpus (c) Class name corresponding to the words That is, in the above example,
(Document ID, word, class name) = <1, tumor, Neoplasm>
In such a data format, the correspondence between the document, the word, and the class is acquired.

ステップＳ１０４では、文書コーパス中の全ての文書の処理が終了したか否かを判定し、終了していない場合はステップＳ１０１に戻り、未処理文書について、ステップＳ１０１〜Ｓ１０３の処理を実行する。ステップＳ１０４において、文書コーパス中の全ての文書の処理が終了したと判定した場合、ステップＳ１０５に進む。 In step S104, it is determined whether or not all the documents in the document corpus have been processed. If not, the process returns to step S101, and the processes in steps S101 to S103 are executed for the unprocessed document. If it is determined in step S104 that all the documents in the document corpus have been processed, the process proceeds to step S105.

ステップＳ１０５では、抽出された単語とクラスの対応関係を基にクラスの統計量を計算する。この処理は、図１に示す文書編集装置１００のクラス統計量計算部１０５の処理である。統計量は、前述したように、例えば一般的に知られているＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）＝（単語の出現回数／単語の出現文書数）を利用する。 In step S105, a class statistic is calculated based on the correspondence between the extracted word and the class. This process is a process of the class statistic calculation unit 105 of the document editing apparatus 100 shown in FIG. As described above, for example, TF-IDF (Term Frequency-Inverse Document Frequency) = (number of occurrences of words / number of occurrences of words) is used as the statistic, as described above.

前述したようにＴＦ−ＩＤＦは、文書に出現する語の重要性、例えば検索後としての有用性を意味する指標として利用される。例えば１つの文書に頻出する語のＴＦ−ＩＤＦは高くなる。しかし、多くの文書に頻出する場合、ＴＦ−ＩＤＦの値は小さくなる。すなわち、文書に一般的に利用される語の場合、ＴＦ−ＩＤＦの値は小さくなり、特定の文書にのみ頻出する場合ＴＦ−ＩＤＦの値は高くなる傾向があり、多くの文書から特定文書を選択する場合のキーワードとして利用する場合、ＴＦ−ＩＤＦの値の高い語を利用することが有効となる。 As described above, TF-IDF is used as an index indicating the importance of words appearing in a document, for example, usefulness after retrieval. For example, the TF-IDF of a word that appears frequently in one document increases. However, when it appears frequently in many documents, the value of TF-IDF becomes small. That is, in the case of a word that is generally used in a document, the value of TF-IDF tends to be small, and the value of TF-IDF tends to increase if it appears frequently only in a specific document. When used as a keyword for selection, it is effective to use a word having a high TF-IDF value.

ステップＳ１０６では、抽出単語対応のクラスＩＤと、統計量（ＴＦ−ＩＤＦ）をクラス統計量ＤＢに格納する。すなわち図１に示す文書編集装置１００のクラス統計量データベース１２３に格納する。この処理の結果として、先に図５を参照して説明したクラス統計量データベース１２３が構築されることになる。 In step S106, the class ID corresponding to the extracted word and the statistic (TF-IDF) are stored in the class statistic DB. That is, it is stored in the class statistics database 123 of the document editing apparatus 100 shown in FIG. As a result of this processing, the class statistic database 123 described above with reference to FIG. 5 is constructed.

（ｂ）入力文書の標準化処理
次に、図１の文書編集装置１００を用いて実行する入力文書の標準化処理のシーケンスについて図７に示すフローチャートを参照して説明する。 (B) Input Document Standardization Processing Next, an input document standardization processing sequence executed using the document editing apparatus 100 of FIG. 1 will be described with reference to a flowchart shown in FIG.

まず、ステップＳ２０１において、文書入力部から処理対象とする文書を入力する。この処理は、図１に示す文書編集装置１００の文書入力部１０１の処理である。 First, in step S201, a document to be processed is input from the document input unit. This process is a process of the document input unit 101 of the document editing apparatus 100 shown in FIG.

次に、ステップＳ２０２において、文書入力部から入力された文書のテキストに対して、形態素解析を行い、文書を分かち書きにし、単語の集合を抽出する。この処理は、図１に示す文書編集装置１００の単語抽出部１０３の処理である。なお、形態素解析には、既存の形態素解析プログラム、具体的には、先に説明した茶筌（ｈｔｔｐ：／／ｃｈａｓｅｎ．ｎａｉｓｔ．ｊｐ／ｈｉｋｉ／ＣｈａＳｅｎ／）等を利用することが可能である。 Next, in step S202, morphological analysis is performed on the text of the document input from the document input unit, the document is separated, and a set of words is extracted. This process is a process of the word extraction unit 103 of the document editing apparatus 100 shown in FIG. For the morphological analysis, it is possible to use an existing morphological analysis program, specifically, a teacup (http://chasen.naist.jp/hiki/ChaSen/) described above.

次に、ステップＳ２０３において、抽出された各単語について、オントロジーデータベース１２２に格納されたオントロジー上のクラスとの対応を解析する。この処理は、図１に示す文書編集装置１００の対応クラス抽出部１０４の処理である。この処理は、抽出された各単語について、オントロジーデータベース１２２に格納されたオントロジー上のクラス対応データ（図３参照）に記録されたクラス性質情報であるラベルと一致する語の有無を判定する処理として実行する。 Next, in step S203, for each extracted word, the correspondence with the ontology class stored in the ontology database 122 is analyzed. This process is a process of the corresponding class extraction unit 104 of the document editing apparatus 100 shown in FIG. This process is a process for determining, for each extracted word, whether or not there is a word that matches the label that is the class property information recorded in the ontology class correspondence data stored in the ontology database 122 (see FIG. 3). Execute.

先に説明した図６のフローのステップＳ１０３の処理と同様、文書から抽出された単語と同様のラベルの記録されたクラスがオントロジーに存在している場合、以下の対応関係情報（ａ）〜（ｃ）を生成する。
（ａ）文書コーパスから抽出された文書ＩＤ
（ｂ）文書コーパスから抽出された文書中の単語
（ｃ）単語に対応するクラス名
すなわち、（文書ＩＤ、単語、クラス名）
このようなデータ形式で、文書と単語とクラスの対応関係を取得する。 Similar to the processing in step S103 in the flow of FIG. 6 described above, when the ontology has a recorded class having the same label as the word extracted from the document, the following correspondence information (a) to ( c) is generated.
(A) Document ID extracted from the document corpus
(B) Word in the document extracted from the document corpus (c) Class name corresponding to the word, ie (document ID, word, class name)
In such a data format, the correspondence between the document, the word, and the class is acquired.

ステップＳ２０３では、このように、入力文書に含まれる単語と同様のラベルの記録されたクラスを抽出する。入力文書に対する解析処理に際しては、さらにステップＳ２０４において、入力文書に含まれる単語に対応するクラスを対応クラスとしてオントロジーのクラス階層（図４参照）に従って近傍クラスを取得する。ステップＳ２０４，Ｓ２０５の処理は、図１に示す文書編集装置１００の基準クラス抽出部１０６の処理である。 In step S203, a class in which a label similar to a word included in the input document is recorded is extracted in this way. In the analysis process for the input document, in step S204, the neighborhood class is acquired according to the ontology class hierarchy (see FIG. 4) with the class corresponding to the word included in the input document as the corresponding class. The processes of steps S204 and S205 are processes of the reference class extracting unit 106 of the document editing apparatus 100 shown in FIG.

例えば、対応クラスからオントロジー上でｎ階層（ｎは１〜３程度の任意の値）の範囲にあるクラスを抽出する。すなわち、先に図４を参照して説明した階層構成に従って、対応クラスに近接するｎ階層のクラスについても抽出する。これらの抽出クラスには、入力文書に含まれる単語と同様のラベルの記録された対応クラスのスーパークラス（親クラス）やサブクラス（子クラス）などが含まれることになる。 For example, a class in the range of n layers (n is an arbitrary value of about 1 to 3) on the ontology is extracted from the corresponding class. In other words, according to the hierarchical structure described above with reference to FIG. 4, n-level classes adjacent to the corresponding class are also extracted. These extracted classes include the superclass (parent class) and subclass (child class) of the corresponding class in which the same label as the word included in the input document is recorded.

なお、近接クラスの近傍クラスであっても、以下の条件に対応するクラスは選択しない。
（ａ）統計量の設定されていないクラス、
（ｂ）オントロジー上で「分離関係（ｄｉｓｊｏｉｎｔＷｉｔｈ）」にあるクラス、およびその分離関係クラスを経由するクラス、
これらのクラスは、選択しない。 Even if the class is a neighborhood class, a class corresponding to the following condition is not selected.
(A) a class with no statistics set,
(B) a class that is “disjoint relationship” on the ontology, and a class that passes through the separation relationship class;
These classes are not selected.

分離関係（ｄｉｓｊｏｉｎｔＷｉｔｈ）の成り立つ２つのクラスの例について説明する。例えばオントロジーに記録されたクラスとして、身体の部位名を示すクラス、
クラス［手］と、
クラス［足］
があったとき、これらのクラス間に分離関係（ｄｉｓｊｏｉｎｔＷｉｔｈ）が定義されることがある。これは、ある身体の部位の実体（インスタンス）が［手］であり、かつ［足］あるということはありえないことを意味している。すなわち、分離関係（ｄｉｓｊｏｉｎｔＷｉｔｈ）の成り立つ２つのクラスは双方が共にある実体の正しいクラスとして両立し得ないクラスの関係である。 An example of two classes in which a separation relationship (disjointWith) is established will be described. For example, as a class recorded in the ontology, a class indicating the name of the body part,
Class [hand],
Class [foot]
When there is, a separation relationship (disjointWith) may be defined between these classes. This means that an entity (instance) of a certain body part is a [hand] and cannot be a [foot]. That is, the two classes in which the separation relationship (disjointWith) is established are the relationships between the classes that cannot be compatible as the correct classes of both entities.

次にステップＳ２０５において、抽出した近傍クラスの中から、対応クラスより統計量がｐ％（ｐは１０〜５０程度の任意の値）大きくかつ統計量が最大のクラスを基準クラスとして抽出する。 Next, in step S205, a class having a statistic larger than the corresponding class by p% (p is an arbitrary value of about 10 to 50) and the largest statistic is extracted from the extracted neighboring classes as a reference class.

基準クラスの抽出処理例について、図８を参照して説明する。図８はオントロジーデータベース１２２に格納されたオントロジーのクラス階層構成を示している。
対応クラス［Ｃａｒｃｉｎｏｍａ］３０１は、入力文書に含まれる単語と同様のラベルを持つクラスとしてステップＳ２０３において抽出されたクラスであり、図８に示す階層構成のクラスは、ｎ＝２として、対応クラス３０１の上下２階層分のクラスを示している。 An example of reference class extraction processing will be described with reference to FIG. FIG. 8 shows the ontology class hierarchy structure stored in the ontology database 122.
Corresponding class [Carcinoma] 301 is a class extracted in step S203 as a class having the same label as the word included in the input document. The class having the hierarchical structure shown in FIG. The upper and lower two classes are shown.

各クラスに示す［ｗ］の値は、先に図５を参照して説明したクラス統計量データベース１２３に記録された統計量、すなわちＴＤ−ＩＤＦなど文書コーパスにおける出現頻度に対応する指数を示す統計量である。 The value of [w] shown for each class is a statistic recorded in the class statistic database 123 described above with reference to FIG. 5, that is, a statistic indicating an index corresponding to the appearance frequency in the document corpus such as TD-IDF. Amount.

なお、図８に示す分離関係クラス３０３は、対応クラス３０１に対して、前述した「分離関係（ｄｉｓｊｏｉｎｔＷｉｔｈ）」にあるクラスであり、基準クラスの選定対象からは除外される。 Note that the separation relationship class 303 shown in FIG. 8 is a class that is in the “separation relationship (disjointWith)” described above with respect to the corresponding class 301 and is excluded from the selection target of the reference class.

現在、処理対象としているクラスが、図８に示す対応クラス［Ｃａｒｃｉｎｏｍａ］３０１とし、
抽出する近傍クラスの階層数［ｎ］：ｎ＝２
基準クラスとして選択する統計量閾値としての［ｐ］：ｐ＝２０％
とした場合、
図８に示すクラスから、
基準クラス［ＬｕｎｇＣａｒｃｉｎｏｍａ］が選択される。
すなわち、図８に示す対応クラス３０１と、分離関係クラス３０３を除く複数のクラス中、最大の統計量［ｗ］を持つクラスは、
統計量［ｗ］＝５．１の設定された対応クラスより統計量がｐ％（ｐは１０〜５０程度の任意の値）大きであり、
統計量［ｗ］＝５．１と、
対応クラス［Ｃａｒｃｉｎｏｍａ］３０１の統計量ｗ＝４．２との差分Δｗは、
Δｗ＝５．１−４．２
＝０．９
として算出される。 Currently, the processing target class is the corresponding class [Carcinoma] 301 shown in FIG.
Number of neighborhood classes to be extracted [n]: n = 2
[P] as a statistic threshold to be selected as the reference class: p = 20%
If
From the class shown in FIG.
The reference class [Lung Carcinoma] is selected.
That is, among the plurality of classes excluding the corresponding class 301 and the separation relation class 303 shown in FIG.
The statistic [w] = 5.1 is larger than the set corresponding class by the statistic [w] = 5.1 (p is an arbitrary value of about 10 to 50),
Statistic [w] = 5.1,
The difference Δw from the statistic w = 4.2 of the corresponding class [Carcinoma] 301 is
Δw = 5.1-4.2
= 0.9
Is calculated as

一方、対応クラス［Ｃａｒｃｉｎｏｍａ］３０１の統計量ｗ＝４．２に基づいて、統計量閾値を算出すると、
［ｐ］：ｐ＝２０％
の設定であるので、
４．２×２０％＝０．８１
となる。
差分Δｗ＝０．９は、０．８１より大であり、
クラス［ＬｕｎｇＣａｒｃｉｎｏｍａ］の統計量ｗ＝５．１は、対応クラス［Ｃａｒｃｉｎｏｍａ］３０１の統計量ｗ＝４．２より、閾値ｐ（ｐ＝２０％）以上の差分を有する大きい値となっている。 On the other hand, when the statistic threshold is calculated based on the statistic w = 4.2 of the corresponding class [Carcinoma] 301,
[P]: p = 20%
Because it is the setting of
4.2 × 20% = 0.81
It becomes.
The difference Δw = 0.9 is greater than 0.81;
The statistic w = 5.1 of the class [Lung Carcinoma] is a larger value having a difference equal to or greater than the threshold p (p = 20%) than the statistic w = 4.2 of the corresponding class [Carcinoma] 301. .

基準クラスの選定条件は、以下の条件１，２となる。
（条件１）対応クラスからオントロジー上でｎ階層（ｎは１〜３程度の任意の値）の範囲にあるクラス［本例ではｎ＝２］、ただし分離関係クラスは除外、
（条件２）対応クラスより統計量がｐ％（ｐは１０〜５０程度の任意の値）大きくかつ統計量が最大のクラス［本例ではｐ＝２０％］ The criteria class selection conditions are the following conditions 1 and 2.
(Condition 1) A class in the range of n layers (n is an arbitrary value of about 1 to 3) on the ontology from the corresponding class [n = 2 in this example], except for the separation relation class,
(Condition 2) A class whose statistic is larger than the corresponding class by p% (p is an arbitrary value of about 10 to 50) and whose statistic is the maximum [in this example, p = 20%]

クラス［ＬｕｎｇＣａｒｃｉｎｏｍａ］は、上記条件１，２を満足するクラスであり、図８に示すように対応クラス［Ｃａｒｃｉｎｏｍａ］３０１の基準クラス３０２として選択される。 The class [Lung Carcinoma] satisfies the above conditions 1 and 2, and is selected as the reference class 302 of the corresponding class [Carcinoma] 301 as shown in FIG.

なお、図８に示す例において、分離関係クラス３０３内のクラス［ｙｙｙｙｙ］の統計量は、基準クラス３０２より大きい値となっているが、［ｙｙｙｙｙ］や［ｚｚｚｚ］、［ａｄｅｎｏｍａ］は分離関係（ｄｉｓｊｏｉｎｔＷｉｔｈ）にあるクラスおよびそのサブクラスなので抽出対象から除外される。 In the example illustrated in FIG. 8, the statistic of the class [yyyyy] in the separation relation class 303 is larger than the reference class 302, but [yyyyy], [zzzzz], and [adenoma] are separated. Since it is a class and its subclass in (disjointWith), it is excluded from the extraction target.

次に、ステップＳ２０６において、入力文に含まれる各単語に対応して抽出された各クラス（対応クラス）と、各単語対応の対応クラスに対応して選択された基準クラスの抽象度を計算する。この処理は、図１に示す文書編集装置１００のクラス抽象度計算部１０７の処理である。 Next, in step S206, the abstraction level of each class (corresponding class) extracted corresponding to each word included in the input sentence and the reference class selected corresponding to the corresponding class corresponding to each word is calculated. . This process is the process of the class abstraction degree calculation unit 107 of the document editing apparatus 100 shown in FIG.

このクラス抽象度の算出処理例について説明する。クラス抽象度の算出処理手法としてはいくつかの手法が適用可能であるが、基本的には、
基準クラスの抽象度［ａ］をａ＝０、
として設定し、さらに、基準クラスの近傍にある対応クラス等の近傍クラスの抽象度を以下のいずれかの手法（ア），（イ）を用いて決定する。
（ア）オントロジー上の基準クラスからの階層の距離に基づいて抽象度を算出する手法、
（イ）基準クラスのプロパティ数との差に基づいて抽象度を算出する手法、
以下、これらの２つの手法について図９、図１０を参照して説明する。
説明する。 An example of class abstraction calculation processing will be described. Several methods can be applied to calculate the class abstraction, but basically,
The degree of abstraction [a] of the reference class is a = 0,
Further, the abstraction level of a neighboring class such as a corresponding class in the vicinity of the reference class is determined using one of the following methods (a) and (b).
(A) A method of calculating the degree of abstraction based on the distance of the hierarchy from the ontology reference class,
(B) A method for calculating the degree of abstraction based on the difference from the number of properties of the reference class,
Hereinafter, these two methods will be described with reference to FIGS.
explain.

（ア）オントロジー上の基準クラスからの階層の距離に基づいて抽象度を算出する手法、
まず、オントロジー上の基準クラスからの階層の距離に基づいて抽象度を算出する手法について、図９を参照して説明する。 (A) A method of calculating the degree of abstraction based on the distance of the hierarchy from the ontology reference class,
First, a method for calculating the degree of abstraction based on the hierarchy distance from the ontology reference class will be described with reference to FIG.

図９には、入力部に含まれる単語に対応して選択されたクラス、すなわち対応クラス３２１と、その近傍クラスから上記した手法によって選択された基準クラス３２２を含むクラス階層構成を示している。
対応クラス３２１は、入力部に含まれる単語に対応して選択されたクラスであり、単語に相当する性質情報としてのラベルが記録されたクラスである。基準クラス３２２は、前述した基準クラスの選定条件、すなわち、
（条件１）対応クラスからオントロジー上でｎ階層（ｎは１〜３程度の任意の値）の範囲にあるクラス、ただし分離関係クラスは除外、
（条件２）対応クラスより統計量がｐ％（ｐは１０〜５０程度の任意の値）大きくかつ統計量が最大のクラス、
この条件を満足するクラスとして選択されたクラスである。 FIG. 9 shows a class hierarchy configuration including a class selected corresponding to a word included in the input unit, that is, a corresponding class 321 and a reference class 322 selected from the neighboring classes by the above-described method.
The correspondence class 321 is a class selected corresponding to the word included in the input unit, and is a class in which a label as property information corresponding to the word is recorded. The reference class 322 is the selection condition of the reference class described above, that is,
(Condition 1) Classes in the range of n layers (n is an arbitrary value of about 1 to 3) on the ontology from the corresponding class, except for the separation relation class,
(Condition 2) A class whose statistics are larger than the corresponding class by p% (p is an arbitrary value of about 10 to 50) and whose statistics are the largest,
The class is selected as a class that satisfies this condition.

本例では、各クラスの抽象度を、オントロジー上の基準クラスからの階層の距離によって決定する。具体的には、あるクラスから、スーパークラス（親クラス）方向に進む場合、抽象度の値をプラス（＋１）し、サブクラス（子クラス）方向に進む場合、抽象度の値をマイナス（−１）するという規則に従って、各クラスの抽象度を設定する。 In this example, the abstraction level of each class is determined by the distance of the hierarchy from the reference class on the ontology. Specifically, when going from a certain class toward the super class (parent class), the value of abstraction is added (+1), and when moving toward the subclass (child class), the value of abstraction is minus (−1). ) To set the abstraction level of each class.

図９に示す例では、まず、基準クラス３２２の抽象度［ａ］をａ＝０とする。この場合、
基準クラス３２２のスーパークラス（親クラス）の抽象度［ａ］はａ＝１
基準クラス３２２のサブクラス（子クラス）の抽象度［ａ］はａ＝−１
基準クラス３２２のシブリングクラス（兄弟クラス）の抽象度［ａ］はａ＝０
となる。なお、
基準クラス３２２のスーパークラス（親クラス）の、さらにスーパークラス（親クラス）の抽象度［ａ］はａ＝２
となる。
このように、クラス階層に応じて抽象度を設定する。 In the example shown in FIG. 9, first, the abstraction level [a] of the reference class 322 is set to a = 0. in this case,
The degree of abstraction [a] of the superclass (parent class) of the reference class 322 is a = 1
The abstraction level [a] of the subclass (child class) of the reference class 322 is a = −1
The degree of abstraction [a] of the sibling class (sibling class) of the reference class 322 is a = 0.
It becomes. In addition,
The degree of abstraction [a] of the superclass (parent class) of the reference class 322 and the superclass (parent class) is a = 2.
It becomes.
Thus, the abstraction level is set according to the class hierarchy.

この抽象度の設定手法は、基本的にオントロジーのクラス階層構成に従ったものである。すなわちオントロジーにおいて、スーパークラス（親クラス）方向に進むに従って上位概念となり、語としての抽象度が増加し、サブクラス（子クラス）方向に進むに従って下位概念となり、語としての抽象度が減少する設定であり、この抽象度の設定手法はこのオントロジーのクラス階層構成に従った手法である。 This method of setting abstraction basically follows the ontology class hierarchy. In other words, in the ontology, the level of abstraction as a word increases as it progresses in the superclass (parent class) direction, and the level of abstraction as a word increases. The level of abstraction as a word decreases as it progresses in the subclass (child class) direction. Yes, this abstraction level setting method is a method according to the ontology class hierarchy structure.

（イ）基準クラスのプロパティ数との差に基づいて抽象度を算出する手法、
次に、図１０を参照して、基準クラスのプロパティ数との差に基づいて抽象度を算出する手法について説明する。 (B) A method for calculating the degree of abstraction based on the difference from the number of properties of the reference class,
Next, a method for calculating the abstraction level based on the difference from the number of properties of the reference class will be described with reference to FIG.

先に図３を参照して説明したようにオントロジーの各クラスには、プロパティ（性質情報）としてのラベルが設定されている。本手法は、このクラスに設定されたプロパティ（性質情報）の数に応じて、各クラスの抽象度を設定する手法である。 As described above with reference to FIG. 3, labels as properties (property information) are set in each class of the ontology. This method is a method for setting the abstraction level of each class according to the number of properties (property information) set for this class.

例えばクラスに設定されたプロパティの数が多いほどクラス対応抽象度は低い値とし、クラスに設定されたプロパティの数が少ないほどクラス対応抽象度は高い値とする。 For example, the greater the number of properties set in a class, the lower the class correspondence abstraction value, and the smaller the number of properties set in the class, the higher the class correspondence abstraction value.

具体的には、例えば図１０に示すように、まず基準クラス３３２の抽象度を０とする。この基準クラスに設定されたプロパティ（性質情報）の数はｐ１〜ｐ４の４つである。 Specifically, as shown in FIG. 10, for example, the abstraction level of the reference class 332 is first set to zero. The number of properties (property information) set in this reference class is four (p1 to p4).

次に、基準クラス選択の際に選定された近傍クラスの各プロパティ数に応じて、各クラスの抽象度を設定する。
基準クラスを抽象度ａ＝０とする。
基準クラスのプロパティ（性質情報）数は［４］であり、
他のクラスのプロパティ数を検証し、基準クラスのプロパティ数［４］を基準として、プロパティが１つ減ると抽象度ａをプラス１とし、プロパティ数が１つ増えると抽象度ａをマイナス１とする。 Next, the abstraction level of each class is set according to the number of properties of the neighborhood class selected at the time of selecting the reference class.
Let the reference class have an abstraction level a = 0.
The number of properties (property information) of the base class is [4]
The number of properties of other classes is verified. Based on the number of properties [4] of the reference class, when the number of properties decreases by 1, the abstraction level a is set to plus 1, and when the number of properties increases by 1, the level of abstraction a is set to minus 1. To do.

図１０に示す例では、例えば対応クラス３３１は２つのプロパティが記録されたクラスであり、基準クラスのプロパティ（性質情報）数は［４］より２つプロパティ数が少ないので、抽象度ａはａ＝２となる。 In the example shown in FIG. 10, for example, the correspondence class 331 is a class in which two properties are recorded, and the number of properties (property information) of the reference class is two fewer than [4], so the abstraction level a is a. = 2.

この抽象度設定手法は、基本的にオントロジーのプロパティ（性質情報）の設定状況に対応した手法である。すなわちオントロジーにおいて、上位概念の語は、抽象度が増加し、またプロパティ（性質情報）の設定数が減少する傾向にあり、下位概念の語は、抽象度が減少し、それとともにプロパティ（性質情報）の設定数が増加する傾向にある。この図１０を参照して説明した手法は、このオントロジーのプロパティ設定構成に従った手法である。 This abstraction level setting method is basically a method corresponding to the setting state of ontology properties (property information). In other words, in the ontology, the word of the higher-level concept tends to increase in the degree of abstraction and the number of properties (property information) to be set, and the word of the lower-level concept decreases in the degree of abstraction and, at the same time, the property (property information). ) Tend to increase. The method described with reference to FIG. 10 is a method according to this ontology property setting configuration.

図７のフローにおけるステップＳ２０６では、図９または図１０を参照して説明した手法で、入力文に含まれる各単語に対応して抽出された各クラス（対応クラス）と、各単語対応の対応クラスに対応して選択された基準クラスの抽象度を計算する。この処理は、図１に示す文書編集装置１００のクラス抽象度計算部１０７の処理である。 In step S206 in the flow of FIG. 7, each class (corresponding class) extracted corresponding to each word included in the input sentence by the method described with reference to FIG. 9 or FIG. Calculate the abstraction level of the selected reference class corresponding to the class. This process is the process of the class abstraction degree calculation unit 107 of the document editing apparatus 100 shown in FIG.

次に、ステップＳ２０７に進み、入力文書内の各単語に対応して抽出された全てのクラス（対応クラス）の抽象度の分散［Ｖ］を計算し、その計算された分散値［Ｖ］を［文書の抽象度］として設定する。なお、文書の抽象度は分散に限るものではなく、平均等を用いてもよい。ステップＳ２０７〜Ｓ２０９の処理は、図１に示す文書編集装置１００の文書抽象度計算部１０８の処理として実行される。 In step S207, the variance [V] of the abstraction level of all classes (corresponding classes) extracted corresponding to each word in the input document is calculated, and the calculated variance value [V] is calculated. Set as [Document abstraction]. Note that the abstraction level of the document is not limited to dispersion, and an average or the like may be used. The processing of steps S207 to S209 is executed as processing of the document abstraction degree calculation unit 108 of the document editing apparatus 100 shown in FIG.

次に、ステップＳ２０８において、分散［Ｖ］と予め定めた閾値［ｍ］を比較し、ステップＳ２０９において、
分散Ｖ≧閾値ｍ
上記式が成立するか否かを判定する。
上記式が成立しない場合、すなわち分散［Ｖ］が閾値［ｍ］未満である場合、処理対象とした入力文に含まれる単語の抽象度のばらつきが少ない。すなわちほぼ同一レベルの抽象度を持つ単語が使用された文であると判断して処理を終了する。 Next, in step S208, the variance [V] is compared with a predetermined threshold [m]. In step S209,
Variance V ≧ threshold m
It is determined whether or not the above equation holds.
When the above expression does not hold, that is, when the variance [V] is less than the threshold [m], there is little variation in the abstraction level of the words included in the input sentence to be processed. That is, it is determined that the word uses words having approximately the same level of abstraction, and the process is terminated.

一方、
分散Ｖ≧閾値ｍ
上記式が成立する場合、すなわち分散［Ｖ］が閾値［ｍ］以上である場合、処理対象とした入力文に含まれる単語の抽象度のばらつきが大きい。すなわち抽象度の異なる単語が使用された文であると判断して、ステップＳ２１０に進む。 on the other hand,
Variance V ≧ threshold m
When the above formula is satisfied, that is, when the variance [V] is equal to or greater than the threshold value [m], variations in the abstraction level of words included in the input sentence to be processed are large. That is, it is determined that the word uses a different abstraction level, and the process proceeds to step S210.

ステップＳ２１０は、文書の標準化処理であり、図１に示す文書標準化処理部１０９において実行する処理である。標準化処理としては、各語に対応する規定クラスの近傍において検出された基準クラスが存在する各語について、入力文に使用されていた語を、基準クラスに対応する語に置き換える編集処理を行なう。この編集処理によって、編集された文に含まれる語は基準クラスの語に置き換えられ、抽象度が揃えられた語からなる文に変更されることになる。 Step S210 is a document standardization process, which is executed by the document standardization processing unit 109 shown in FIG. As standardization processing, editing processing is performed for each word in which the reference class detected in the vicinity of the defined class corresponding to each word is replaced with the word corresponding to the reference class for the word used in the input sentence. By this editing process, the words included in the edited sentence are replaced with the words of the reference class, and are changed to sentences composed of words with the same level of abstraction.

なお、このような自動編集を行なう構成としてもよいが、自動編集を行わず、ユーザに対して、入力文に含まれる各単語に対応して選定された基準クラス情報を提示して、ユーザに基準クラス対応の語への置き換えを行うか否かを判定させて、ユーザの意思に応じた編集処理を実行させる構成としてもよい。 In addition, although it is good also as a structure which performs such an automatic edit, the reference class information selected corresponding to each word contained in an input sentence is shown to a user without performing an automatic edit, and a user is shown. A configuration may be adopted in which it is determined whether or not replacement with a word corresponding to the reference class is performed, and editing processing is executed according to the user's intention.

上述したように、本発明の文書編集装置では、文書に含まれる各単語についての抽象度を均一にする処理を実行する構成であり、抽象度の異なる様々な語が使用された文を自動的に修正し、抽象度レベルが揃った語からなる文書（標準化文書）を生成することが可能となる。 As described above, the document editing apparatus according to the present invention is configured to execute processing for making the abstraction level uniform for each word included in the document, and automatically uses sentences having various abstraction levels. It is possible to generate a document (standardized document) composed of words with the same level of abstraction.

［第２実施例］
図１１に、本発明の文書編集装置の第２実施例の構成を示す。図１１に示す文書編集装置５００と、図１に示した文書編集装置１００との差異は、図１に示す構成中のクラス統計量計算部１０５を、図１１では、文書作成者対応クラス統計量計算部５０１と、クラス統計量計算部（分散）５０２に置き換え、クラス統計量データベース５０３の格納データを変更した点である。 [Second Embodiment]
FIG. 11 shows the configuration of the second embodiment of the document editing apparatus of the present invention. The difference between the document editing apparatus 500 shown in FIG. 11 and the document editing apparatus 100 shown in FIG. 1 is that the class statistic calculation unit 105 in the configuration shown in FIG. It is replaced with the calculation unit 501 and the class statistic calculation unit (distribution) 502, and the data stored in the class statistic database 503 is changed.

すなわち、この第２実施例では、クラス統計量データベース５０３に格納するデータを、先の実施例と異なり、各文書作成者毎のばらつきを排除したデータとしている。 That is, in the second embodiment, the data stored in the class statistic database 503 is different from the previous embodiment, and is data in which variation for each document creator is excluded.

具体的には、先の実施例において、図６を参照して説明したクラス統計量データベースの構築処理シーケンスが異なることになる。また、図７を参照して説明した入力文書の標準化処理においても、クラス統計量データベース５０３の統計量が適用されることになる。 Specifically, in the previous embodiment, the class statistic database construction processing sequence described with reference to FIG. 6 is different. In the input document standardization processing described with reference to FIG. 7, the statistics of the class statistics database 503 are applied.

本実施例におけるクラス統計量データベースの構築処理シーケンスについて、図１２に示すフローを参照して説明する。 The class statistic database construction processing sequence in this embodiment will be described with reference to the flow shown in FIG.

ステップＳ３０１において、文書コーパスから文書を１つ抽出する。この処理は、図１１に示す文書編集装置５００の文書抽出部１０２が文書コーパスデータベース１２１から文書を抽出する処理である。なお、この処理において、文書作成者情報も取得する。 In step S301, one document is extracted from the document corpus. This process is a process in which the document extraction unit 102 of the document editing apparatus 500 shown in FIG. 11 extracts a document from the document corpus database 121. In this process, document creator information is also acquired.

次に、ステップＳ３０２において、文書コーパスから抽出された文書のテキストに対して、形態素解析を行い、文書を分かち書きにし、単語の集合を抽出する。この処理は、図１１に示す文書編集装置５００の単語抽出部１０３の処理である。なお、形態素解析には、既存の形態素解析プログラムの適用が可能である。具体的には、例えば茶筌（ｈｔｔｐ：／／ｃｈａｓｅｎ．ｎａｉｓｔ．ｊｐ／ｈｉｋｉ／ＣｈａＳｅｎ／）等で形態素解析を行なう。 Next, in step S302, morphological analysis is performed on the text of the document extracted from the document corpus, the document is separated, and a set of words is extracted. This process is a process of the word extraction unit 103 of the document editing apparatus 500 shown in FIG. An existing morpheme analysis program can be applied to the morpheme analysis. Specifically, morphological analysis is performed using, for example, a tea bowl (http://chasen.naist.jp/hiki/ChaSen/).

次に、ステップＳ３０３において、抽出された各単語について、オントロジーデータベース１２２に格納されたオントロジー上のクラスとの対応を解析する。この処理は、図１１に示す文書編集装置５００の対応クラス抽出部１０３の処理として実行される。 Next, in step S <b> 303, for each extracted word, the correspondence with the ontology class stored in the ontology database 122 is analyzed. This processing is executed as processing of the corresponding class extraction unit 103 of the document editing apparatus 500 shown in FIG.

ステップＳ３０４では、文書コーパス中の全ての文書の処理が終了したか否かを判定し、終了していない場合はステップＳ３０１に戻り、未処理文書について、ステップＳ３０１〜Ｓ３０３の処理を実行する。ステップＳ３０４において、文書コーパス中の全ての文書の処理が終了したと判定した場合、ステップＳ３０５に進む。 In step S304, it is determined whether or not all the documents in the document corpus have been processed. If not, the process returns to step S301, and the processes in steps S301 to S303 are executed for the unprocessed document. If it is determined in step S304 that all the documents in the document corpus have been processed, the process proceeds to step S305.

ステップＳ３０５では、抽出された単語とクラスの対応関係を基にクラスの統計量を計算する。この処理は、図１１に示す文書編集装置５００の文書作成者対応クラス統計量計算部５０１の処理である。統計量は、前述したように、例えば一般的に知られているＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）＝（単語の出現回数／単語の出現文書数）を利用する。 In step S305, a class statistic is calculated based on the correspondence between the extracted word and the class. This process is performed by the document creator correspondence class statistic calculation unit 501 of the document editing apparatus 500 shown in FIG. As described above, for example, TF-IDF (Term Frequency-Inverse Document Frequency) = (number of occurrences of words / number of occurrences of words) is used as the statistic, as described above.

ただし、本実施例では、この統計量を文書作成者毎に個別のデータとして算出する。すなわち、文書作成者はそれぞれ文書作成の「癖」があり、作成者毎に、よく使用する語や使用しない語に偏りがある。従ってこの統計量も作成者毎に大きく異なる場合がある。 However, in this embodiment, this statistic is calculated as individual data for each document creator. In other words, each document creator has a “癖” for document creation, and each author has a bias in terms of frequently used words and words that are not used. Therefore, this statistic may vary greatly depending on the creator.

ステップＳ３０６では、抽出単語に対応する各クラスの各文書作成者対応の統計量（ＴＦ−ＩＤＦ）の分散を算出し、算出した分散の逆数をそのクラスの統計量とする。この処理は、図１１に示す文書編集装置５００のクラス統計量計算部５０２の処理である In step S306, the variance of the statistics (TF-IDF) corresponding to each document creator of each class corresponding to the extracted word is calculated, and the reciprocal of the calculated variance is used as the statistics of the class. This process is a process of the class statistic calculation unit 502 of the document editing apparatus 500 shown in FIG.

ステップＳ３０７では、抽出単語対応のクラスＩＤと、ステップＳ３０６において算出した［各文書作成者対応の統計量（ＴＦ−ＩＤＦ）の分散の逆数］を、クラス対応の統計量として対応付けてクラス統計量ＤＢに格納する。 In step S307, the class statistic is associated with the class ID corresponding to the extracted word and the [reciprocal of variance for each document creator (TF-IDF)] calculated in step S306 as the class-corresponding statistic. Store in DB.

この処理により、図１１に示す文書編集装置５００のクラス統計量データベース５０３に格納されるデータは、作成者毎の「癖」などによって影響されない、ばらつきの少ない統計量が記録されることになる。 As a result of this processing, the data stored in the class statistics database 503 of the document editing apparatus 500 shown in FIG. 11 is recorded with statistics with little variation that are not influenced by “癖” for each creator.

本実施例における文書標準化処理は、先の実施例で説明した図７に示すフローと同様の処理であるが、図７のフローにおけるステップＳ２０５において利用する統計量の値が、先の実施例と異なる値であり、上述した作成者毎の「癖」などによって影響されない、ばらつきの少ない統計量（各文書作成者対応の統計量（ＴＦ−ＩＤＦ）の分散の逆数）となり、これにより、的確な基準クラスの選択が実行され、より品質の高い文書標準化が実現されることになる。 The document standardization processing in the present embodiment is the same processing as the flow shown in FIG. 7 described in the previous embodiment, but the statistic value used in step S205 in the flow in FIG. It is a different value and is not affected by the above-mentioned “癖” for each creator, etc., and has a small statistic (the reciprocal of the variance (TF-IDF) for each document creator). Selection of the reference class is executed, and document standardization with higher quality is realized.

最後に、上述した処理を実行する文書編集装置を構成する情報処理装置のハードウェア構成例について、図１３を参照して説明する。文書編集装置を構成する情報処理装置は、ハードウェアとしては例えばＰＣによって実現可能である。ＣＰＵ（Central Processing Unit）７０１は、ＯＳ（Operating System)に対応する処理や、上述の実施例において説明した、
（ａ）クラス統計量データベースの構築処理
（ｂ）入力文書の標準化処理
これらの処理などを実行する。これらの処理は、各情報処理装置のＲＯＭ、ハードディスクなどのデータ記憶部に格納されたコンピュータ・プログラムに従って実行される。 Finally, an example of the hardware configuration of the information processing apparatus constituting the document editing apparatus that executes the above-described processing will be described with reference to FIG. The information processing apparatus constituting the document editing apparatus can be realized as hardware by, for example, a PC. The CPU (Central Processing Unit) 701 is a process corresponding to an OS (Operating System) or the above-described embodiment.
(A) Class statistic database construction process (b) Input document standardization process These processes are executed. These processes are executed according to a computer program stored in a data storage unit such as a ROM or a hard disk of each information processing apparatus.

ＲＯＭ（Read Only Memory）７０２は、ＣＰＵ７０１が使用するプログラムや演算パラメータ等を格納する。ＲＡＭ（Random Access Memory）７０３は、ＣＰＵ７０１の実行において使用するプログラムや、その実行において適宜変化するパラメータ等を格納する。これらはＣＰＵバスなどから構成されるホストバス７０４により相互に接続されている。 A ROM (Read Only Memory) 702 stores programs used by the CPU 701, calculation parameters, and the like. A RAM (Random Access Memory) 703 stores programs used in the execution of the CPU 701, parameters that change as appropriate during the execution, and the like. These are connected to each other via a host bus 704 including a CPU bus.

ホストバス７０４は、ブリッジ７０５を介して、ＰＣＩ(Peripheral Component Interconnect/Interface)バスなどの外部バス７０６に接続されている。キーボード７０８、ポインティングデバイス７０９は、ユーザにより操作される入力デバイスである。ディスプレイ７１０は、液晶表示装置またはＣＲＴ（Cathode Ray Tube）などから成り、各種情報をテキストやイメージで表示する。 The host bus 704 is connected to an external bus 706 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 705. A keyboard 708 and a pointing device 709 are input devices operated by the user. The display 710 is composed of a liquid crystal display device, a CRT (Cathode Ray Tube), or the like, and displays various types of information as text and images.

ＨＤＤ（Hard Disk Drive）７１１は、ハードディスクを内蔵し、ハードディスクを駆動し、ＣＰＵ７０１によって実行するプログラムや情報を記録または再生させる。ハードディスクは、例えば、オントロジー、テキストコーパス、クラス統計量、辞書などの格納手段などに利用され、さらに、データ処理プログラム等、各種コンピュータ・プログラムが格納される。 An HDD (Hard Disk Drive) 711 includes a hard disk, drives the hard disk, and records or reproduces a program executed by the CPU 701 and information. The hard disk is used, for example, as storage means such as ontology, text corpus, class statistics, dictionary, and further stores various computer programs such as data processing programs.

ドライブ７１２は、装着されている磁気ディスク、光ディスク、光磁気ディスク、または半導体メモリ等のリムーバブル記録媒体７２１に記録されているデータまたはプログラムを読み出して、そのデータまたはプログラムを、インタフェース７０７、外部バス７０６、ブリッジ７０５、およびホストバス７０４を介して接続されているＲＡＭ７０３に供給する。 The drive 712 reads data or a program recorded on a removable recording medium 721 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and reads the data or program as an interface 707 and an external bus 706. , The bridge 705, and the RAM 703 connected via the host bus 704.

接続ポート７１４は、外部接続機器７２２を接続するポートであり、ＵＳＢ，ＩＥＥＥ１３９４等の接続部を持つ。接続ポート７１４は、インタフェース７０７、および外部バス７０６、ブリッジ７０５、ホストバス７０４等を介してＣＰＵ７０１等に接続されている。通信部７５１５は、ネットワークに接続され、例えば外部のデータベース８０１との通信によりデータ検索を実行する。 The connection port 714 is a port for connecting the external connection device 722 and has a connection unit such as USB or IEEE1394. The connection port 714 is connected to the CPU 701 and the like via the interface 707, the external bus 706, the bridge 705, the host bus 704, and the like. The communication unit 7515 is connected to the network and executes data search by communicating with an external database 801, for example.

なお、図１３に示す情報処理装置のハードウェア構成例は、ＰＣを適用して構成した装置の一例であり、本発明の文書編集装置は、図１３に示す構成に限らず、上述した実施例において説明した処理を実行可能な構成であればよい。 Note that the hardware configuration example of the information processing apparatus shown in FIG. 13 is an example of an apparatus configured by applying a PC, and the document editing apparatus of the present invention is not limited to the configuration shown in FIG. Any configuration can be used as long as the processing described in the above item can be executed.

以上、特定の実施例を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が実施例の修正や代用を成し得ることは自明である。すなわち、例示という形態で本発明を開示してきたのであり、限定的に解釈されるべきではない。本発明の要旨を判断するためには、特許請求の範囲の欄を参酌すべきである。 The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the gist of the present invention. In other words, the present invention has been disclosed in the form of exemplification, and should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims should be taken into consideration.

また、明細書中において説明した一連の処理はハードウェア、またはソフトウェア、あるいは両者の複合構成によって実行することが可能である。ソフトウェアによる処理を実行する場合は、処理シーケンスを記録したプログラムを、専用のハードウェアに組み込まれたコンピュータ内のメモリにインストールして実行させるか、あるいは、各種処理が実行可能な汎用コンピュータにプログラムをインストールして実行させることが可能である。例えば、プログラムは記録媒体に予め記録しておくことができる。記録媒体からコンピュータにインストールする他、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、インターネットといったネットワークを介してプログラムを受信し、内蔵するハードディスク等の記録媒体にインストールすることができる。 The series of processing described in the specification can be executed by hardware, software, or a combined configuration of both. When executing processing by software, the program recording the processing sequence is installed in a memory in a computer incorporated in dedicated hardware and executed, or the program is executed on a general-purpose computer capable of executing various processing. It can be installed and run. For example, the program can be recorded in advance on a recording medium. In addition to being installed on a computer from a recording medium, the program can be received via a network such as a LAN (Local Area Network) or the Internet, and installed on a recording medium such as a built-in hard disk.

なお、明細書に記載された各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。また、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Note that the various processes described in the specification are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Further, in this specification, the system is a logical set configuration of a plurality of devices, and the devices of each configuration are not limited to being in the same casing.

以上、説明したように、本発明の一実施例の構成によれば、入力文書に含まれる単語を抽出し、抽出単語に対応するオントロジー上のクラスを対応クラスとして選択し、さらに、対応クラスの近傍クラスから、文書コーパスにおける出現頻度情報を示す統計量の高いクラスを基準クラスとして選択し、入力文書内の各単語に対応して抽出された全ての対応クラスの抽象度の分散または平均を文書抽象度として算出して、算出した文書抽象度と閾値との比較により、入力文書に含まれる単語の抽象度のばらつき度合いを判定し、ばらつき度合いが大きいと判定された場合には入力文書に含まれる単語を基準クラス対応の単語に置き換える処理、または、基準クラス情報の提示を行なう構成としたので、抽象度のばらつきのある単語を利用した文書を変更して、均一の抽象度を持つ単語の利用された文書を生成することが可能となる。 As described above, according to the configuration of the embodiment of the present invention, the words included in the input document are extracted, the ontology class corresponding to the extracted word is selected as the corresponding class, and the corresponding class A class with high statistics indicating the appearance frequency information in the document corpus is selected as a reference class from neighboring classes, and the variance or average of the abstraction levels of all corresponding classes extracted corresponding to each word in the input document is documented. Calculated as the degree of abstraction, and by comparing the calculated document abstraction level with the threshold value, the degree of variation in the abstraction level of the words included in the input document is determined, and if it is determined that the degree of variation is large, it is included in the input document This is a process that replaces the generated word with a word corresponding to the reference class, or presents the reference class information. The change, it is possible to generate a use document of a word with a uniform abstraction.

本発明の一実施例に係る文書編集装置の構成例について示すブロック図である。It is a block diagram shown about the structural example of the document editing apparatus which concerns on one Example of this invention. 本発明の一実施例に係る文書編集装置において利用する文書コーパスの例について説明する図である。It is a figure explaining the example of the document corpus utilized in the document editing apparatus concerning one Example of this invention. 本発明の一実施例に係る文書編集装置において利用するオントロジーの例について説明する図である。It is a figure explaining the example of ontology utilized in the document editing apparatus concerning one Example of this invention. 本発明の一実施例に係る文書編集装置において利用するオントロジーの例について説明する図である。It is a figure explaining the example of ontology utilized in the document editing apparatus concerning one Example of this invention. 本発明の一実施例に係る文書編集装置において利用するクラス統計量の例について説明する図である。It is a figure explaining the example of the class statistic utilized in the document editing apparatus concerning one Example of this invention. 本発明の一実施例に係る文書編集装置において実行するクラス統計量データベースの構築処理シーケンスについて説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the construction processing sequence of the class statistic database performed in the document editing apparatus which concerns on one Example of this invention. 本発明の一実施例に係る文書編集装置において実行する入力文書の標準化処理シーケンスについて説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the standardization process sequence of the input document performed in the document editing apparatus concerning one Example of this invention. 本発明の一実施例に係る文書編集装置において実行する基準クラスの抽出処理例について説明する図である。It is a figure explaining the extraction process example of the reference | standard class performed in the document editing apparatus which concerns on one Example of this invention. 本発明の一実施例に係る文書編集装置において実行するオントロジー上の基準クラスからの階層の距離に基づいて抽象度を算出する手法について説明する図である。It is a figure explaining the method of calculating an abstraction level based on the distance of the hierarchy from the reference class on the ontology performed in the document editing apparatus which concerns on one Example of this invention. 本発明の一実施例に係る文書編集装置において実行する基準クラスのプロパティ数との差に基づいて抽象度を算出する手法について説明する図である。It is a figure explaining the method of calculating an abstraction level based on the difference with the property number of the reference | standard class performed in the document editing apparatus which concerns on one Example of this invention. 本発明の一実施例に係る文書編集装置の構成例について示すブロック図である。It is a block diagram shown about the structural example of the document editing apparatus which concerns on one Example of this invention. 本発明の一実施例に係る文書編集装置において実行するクラス統計量データベースの構築処理シーケンスについて説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the construction processing sequence of the class statistic database performed in the document editing apparatus which concerns on one Example of this invention. 本発明の一実施形態に係る文書編集装置を構成する情報処理装置のハードウェア構成例について説明する図である。It is a figure explaining the hardware structural example of the information processing apparatus which comprises the document editing apparatus which concerns on one Embodiment of this invention.

Explanation of symbols

１００文書編集装置
１０１文書入力部
１０２文書抽出部
１０３単語抽出部
１０４対応クラス抽出部
１０５クラス統計量計算部
１０６基準クラス抽出部
１０７クラス抽象度計算部
１０８文書抽象度計算部
１０９文書標準化処理部
１２１文書コーパス記憶部
１２２オントロジーデータベース
１２３クラス統計量データベース
２０１クラスＡ
２１１〜２１３クラス
２２１〜２２２クラス
３０１対応クラス
３０２基準クラス
３０３分離関係クラス
３２１対応クラス
３２２基準クラス
３３１対応クラス
３３２基準クラス
５００文書編集装置
５０１文書作成者対応クラス統計量計算部
５０２クラス統計量計算部
５０３クラス統計量データベース
７０１ＣＰＵ(Central Processing Unit)
７０２ＲＯＭ（Read-Only-Memory）
７０３ＲＡＭ（Random Access Memory）
７０４ホストバス
７０５ブリッジ
７０６外部バス
７０７インタフェース
７０８キーボード
７０９ポインティングデバイス
７１０ディスプレイ
７１１ＨＤＤ（Hard Disk Drive）
７１２ドライブ
７１４接続ポート
７１５通信部
７２１リムーバブル記録媒体
７２２外部接続機器
８０１データベース DESCRIPTION OF SYMBOLS 100 Document editing apparatus 101 Document input part 102 Document extraction part 103 Word extraction part 104 Corresponding class extraction part 105 Class statistic calculation part 106 Reference class extraction part 107 Class abstraction degree calculation part 108 Document abstraction degree calculation part 109 Document standardization processing part 121 Document corpus storage unit 122 Ontology database 123 Class statistics database 201 Class A
211-213 Class 221-222 Class 301 Corresponding Class 302 Standard Class 303 Separation Relation Class 321 Corresponding Class 322 Standard Class 331 Corresponding Class 332 Standard Class 500 Document Editing Device 501 Document Creator Corresponding Class Statistics Calculation Unit 502 Class Statistics Calculation Unit 503 Class Statistics Database 701 CPU (Central Processing Unit)
702 ROM (Read-Only-Memory)
703 RAM (Random Access Memory)
704 Host bus 705 Bridge 706 External bus 707 Interface 708 Keyboard 709 Pointing device 710 Display 711 HDD (Hard Disk Drive)
712 Drive 714 Connection port 715 Communication unit 721 Removable recording medium 722 Externally connected device 801 Database

Claims

A document editing device that executes an input document editing process;
A word extraction unit for extracting words contained in the input document;
A corresponding class extracting unit that extracts ontology classes corresponding to the extracted words as corresponding classes;
A class statistic database in which a class ID that is an identifier of a class defined on the ontology and a statistic indicating appearance frequency information in the document corpus of the class-corresponding word are associated and stored;
In a class hierarchy defined on the ontology, a reference class extraction unit that selects, as a reference class, a class having a high statistic from neighboring classes extracted by the corresponding class extraction unit;
A class abstraction degree calculation unit for calculating the abstraction degree of each of the reference class and the corresponding class;
A document abstraction level calculation unit that calculates a variance or average of abstraction levels of all corresponding classes extracted corresponding to each word in the input document as a document abstraction level;
By comparing the document abstraction level with the threshold value, the degree of variation in the abstraction level of the words included in the input document is determined. If the degree of variation is determined to be large, the word included in the input document is determined to correspond to the reference class. A document standardization processing unit for processing to replace words or presenting reference class information;
A document editing apparatus characterized by comprising:

The corresponding class extraction unit includes:
The document editing apparatus according to claim 1, wherein an ontology class having a property corresponding to an extracted word from the input document is extracted as a corresponding class.

The class statistics database is
2. The document editing apparatus according to claim 1, wherein a TF-IDF (Term Frequency-Inverse Document Frequency) indicating appearance frequency information in the document corpus is recorded as a statistic.

The class statistics database is
TF-IDF (Term Frequency-Inverse Document Frequency), which is appearance frequency information in the document corpus, is calculated for each document creator, and the reciprocal of the calculated [TF-IDF] variance corresponding to a plurality of document creators is used as a statistic. The document editing apparatus according to claim 1, wherein the document editing apparatus has a recorded configuration.

The reference class extraction unit includes:
In the class hierarchy defined on the ontology, the separation relation class is excluded from the neighboring classes within the preset number of hierarchies [n] starting from the corresponding class extracted by the corresponding class extracting unit, and the statistic is maximized. The document editing apparatus according to claim 1, wherein a process of selecting the class as a reference class is executed.

The reference class extraction unit includes:
2. The document editing apparatus according to claim 1, wherein a process of extracting a class having a statistic that is larger than a statistic of the corresponding class by a predetermined value (p%) and having a maximum statistic as a reference class is executed.

The class abstraction degree calculation unit
It is configured to perform a process of determining the abstraction level of the corresponding class in the vicinity of the reference class based on the distance of the hierarchy from the reference class on the ontology, and the abstraction degree of the lower class with the higher abstraction level of the upper class 2. The document editing apparatus according to claim 1, wherein an abstraction level is set to a low value.

The class abstraction degree calculation unit
The configuration is such that the abstraction level of the corresponding class in the vicinity of the reference class is determined based on the difference from the number of properties of the reference class, and the number of properties is large with the high abstraction level of the class having a small number of properties. 2. The document editing apparatus according to claim 1, wherein an abstraction level is set so that the abstraction level of the class is a low value.

A document editing method for executing an input document editing process in a document editing apparatus,
A word extracting unit for extracting a word included in the input document;
A corresponding class extraction step in which a corresponding class extraction unit extracts a class on the ontology corresponding to the extracted word as a corresponding class;
The reference class extraction unit selects, as a reference class, a class having a high statistic indicating the appearance frequency information in the document corpus from the neighborhood classes of the corresponding class extracted by the corresponding class extraction unit in the class hierarchy defined on the ontology. A base class extraction step;
A class abstraction calculating unit that calculates the abstraction of each of the reference class and the corresponding class;
A document abstraction level calculation unit that calculates a variance or average of the abstraction levels of all corresponding classes extracted corresponding to each word in the input document as a document abstraction level;
The document standardization processing unit determines the degree of variation in the abstraction level of the words included in the input document by comparing the document abstraction level and the threshold value. If the degree of variation is determined to be large, the word included in the input document A document standardization processing step of replacing a word with a word corresponding to the reference class, or presenting reference class information;
A document editing method characterized by comprising:

The corresponding class extraction step includes:
The document editing method according to claim 9, wherein the ontology class having a property corresponding to the extracted word from the input document is extracted as a corresponding class.

The reference class extracting step includes:
In the class hierarchy defined on the ontology, the separation relation class is excluded from the neighboring classes within the preset number of hierarchies [n] starting from the corresponding class extracted by the corresponding class extracting unit, and the statistic is maximized. The document editing method according to claim 9, wherein a process of selecting a class as a reference class is executed.

The reference class extracting step includes:
The document editing method according to claim 9, wherein a process of extracting, as a reference class, a class that is larger than a statistic of the corresponding class by a predetermined value (p%) and has a maximum statistic.

The class abstraction level calculating step includes:
It is configured to perform a process of determining the abstraction level of the corresponding class in the vicinity of the reference class based on the distance of the hierarchy from the reference class on the ontology, and the abstraction degree of the lower class with the higher abstraction level of the upper class The document editing method according to claim 9, wherein the abstraction level is set to a low value.

The class abstraction level calculating step includes:
The configuration is such that the abstraction level of the corresponding class in the vicinity of the reference class is determined based on the difference from the number of properties of the reference class, and the number of properties is large with the high abstraction level of the class having a small number of properties. The document editing method according to claim 9, wherein the abstraction level is set so that the abstraction level of the class is a low value.

In a document editing apparatus, a computer program for executing input document editing processing,
A word extraction step for causing the word extraction unit to extract words included in the input document;
A corresponding class extraction step for causing the corresponding class extraction unit to extract a class on the ontology corresponding to the extracted word as a corresponding class;
Let the reference class extraction unit select, as a reference class, a class having a high statistic indicating appearance frequency information in the document corpus from the neighborhood classes of the corresponding class extracted by the corresponding class extraction unit in the class hierarchy defined on the ontology. A base class extraction step;
A class abstraction calculating step for causing the class abstraction calculating unit to calculate the abstraction of each of the reference class and the corresponding class;
A document abstraction level calculation step for causing the document abstraction level calculation unit to calculate the variance or average of the abstraction levels of all corresponding classes extracted corresponding to each word in the input document as the document abstraction level;
The document standardization processing unit determines the degree of variation in the abstraction level of the word included in the input document by comparing the document abstraction level and the threshold value. If the degree of variation is determined to be large, the word included in the input document A document standardization processing step that replaces the word with a word corresponding to the reference class, or presents the reference class information,
A computer program characterized by comprising: