JP2003263451A

JP2003263451A - Thesaurus formulation support device and thesaurus formulation support method

Info

Publication number: JP2003263451A
Application number: JP2002061343A
Authority: JP
Inventors: Mitsuo Nunome; 光生布目; Yumi Ichimura; 由美市村
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-03-07
Filing date: 2002-03-07
Publication date: 2003-09-19
Anticipated expiration: 2022-03-07
Also published as: JP4739637B2

Abstract

<P>PROBLEM TO BE SOLVED: To solve a problem that, when a thesaurus is formulated, a user himself/herself has to extract again an important word or phrase that ought to have been differentiated in an existing document because character decorating information is lost in the course of processing, in the event that the decorating information exists in the document as an example that a viewpoint of the user is previously reflected in the document. <P>SOLUTION: This device is characterized by being equipped with a tag content extraction means for extracting the kind of tag information and a character string corresponding thereto from a document with the tag information embedded therein, a candidate generation means for generating candidates for registration in a thesaurus from the character string corresponding to the kind of tag extracted by the extracting means by a prescribed document analysis process, and a candidate storing means for storing, for the thesaurus, the candidates for registration generated by the generating means. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本願発明は、文書中の文字列
を抽出してシソーラスの構築支援を行うためのシソーラ
ス構築支援装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a thesaurus construction support device for extracting a character string in a document and supporting construction of the thesaurus.

【０００２】[0002]

【従来の技術】文書に対し何らかの検索を行う際、ある
いは複数の文書をある観点によって自動分類する際に
は、文書中に出現する語彙に関し、語と語とのさまざま
な関連付けを定義した背景知識を与えておくことで、計
算機上でも意味を考慮した文書処理が可能となる。こう
した計算機上での文書処理を実現する上では、現在のと
ころ背景知識としてシソーラスと呼ばれる知識ベース
（電子化辞書）が用いられることが多い。このシソーラ
スには、「ＰＣ」と「パソコン」が同義語である旨や、
「ダリア」「デイジー」は「花」の下位概念に属すると
いった、単語間の同義関係や上位／下位の関係が記載さ
れている。2. Description of the Related Art When performing a search for a document or automatically classifying a plurality of documents according to a certain viewpoint, background knowledge that defines various associations between words with respect to vocabulary appearing in the document. By giving "," it becomes possible to process documents on the computer in consideration of meaning. In implementing document processing on such a computer, a knowledge base (electronic dictionary) called a thesaurus is often used as background knowledge at present. In this thesaurus, "PC" and "PC" are synonyms,
A synonymous relationship between words and a higher / lower relationship such that "dahlia" and "daisy" belong to a lower concept of "flower" are described.

【０００３】意味を考慮した文書処理において重要な役
割を担うシソーラスであるが、従来のシソーラス構築方
法としては、構築システム側で読み込み可能な形式のデ
ータリストを作成し、それらを一括登録する方法（特開
平８−７７２０４「関連語辞書作成支援装置」）や、既
存の文書を参照しながら格納したい表現をユーザが逐一
拾い上げ、登録を行う方法（特開平８−２２１４２７
「シソーラス登録支援装置及びシソーラス登録支援方
法」）などがあった。The thesaurus plays an important role in document processing in consideration of the meaning, but the conventional thesaurus construction method is to create a data list in a format that can be read by the construction system and register them collectively ( JP-A-8-77204 "related word dictionary creation support device") or a method in which a user picks up expressions that he / she wants to store while referring to an existing document, and performs registration (JP-A-8-221427).
"Thesaurus registration support device and thesaurus registration support method").

【０００４】前者における登録手法はシソーラスの構築
方法として広く用いられており、例えば仮名漢字変換シ
ステムでの外部辞書と呼ばれるユーザ辞書や郵便番号辞
書として新たに大量の語彙を辞書登録する際には、一括
登録したい語彙リストをカンマ，空白，タブ等の汎用的
な区切りで表現しておき入力として与えることが多い。The former registration method is widely used as a thesaurus construction method. For example, when a large number of vocabularies are newly registered as a user dictionary or a postal code dictionary called an external dictionary in the Kana-Kanji conversion system, It is often the case that the vocabulary list you want to register at once is expressed as general-purpose delimiters such as commas, blanks, and tabs, and given as input.

【０００５】一方、後者の手段に基づくシステムの多く
は、シソーラスの内容とシソーラスに登録したい語彙候
補が記載された参照文書とを画面上に同時に表示するこ
とにより、ユーザは画面上の任意の表現に対し抽出元と
登録先を指定するのみでシソーラスへ語彙の登録が行え
る。辞書登録すべき候補の一覧を生成・提示することに
より、辞書メンテナンスの容易な文書分類システム（特
開２００１−１８４３５１「文書情報抽出装置および文
書分類装置」）もある。On the other hand, in most of the systems based on the latter means, the user simultaneously displays the contents of the thesaurus and the reference document in which the vocabulary candidates to be registered in the thesaurus are described on the screen, so that the user can express any expression on the screen. On the other hand, the vocabulary can be registered in the thesaurus simply by specifying the extraction source and the registration destination. There is also a document classification system (Japanese Patent Laid-Open No. 2001-184351 "Document Information Extraction Device and Document Classification Device") that facilitates dictionary maintenance by generating and presenting a list of candidates to be registered in the dictionary.

【０００６】ところで、シソーラスに登録される語彙と
はその用途上、処理したい文書中の語彙の中でユーザが
重要と見なしたものが数多く格納されるのが普通であ
る。しかしながら従来、シソーラス構築の際に語彙登録
候補として参照される文書に、予めユーザが重要度を反
映した結果として強調、斜体、下線等の視覚的な文字修
飾が施されていたとしても、そうした書式情報を利用し
た処理を行うことはできず、処理上はそのような書式情
報を欠落させたいわゆるプレーンテキスト形式と呼ばれ
る一般的な文書形式に変換され処理されるのが普通であ
った。By the way, as for the vocabulary registered in the thesaurus, many of the vocabulary in the document to be processed that the user considers to be important are usually stored. However, even if a document that is referred to as a vocabulary registration candidate when constructing a thesaurus is conventionally visually emphasized, italicized, or underlined as a result of the user's reflection of importance, such a format is used. It is not possible to perform processing using information, and in terms of processing, it is common to convert and process into a general document format called a so-called plain text format that lacks such format information.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、このよ
うな従来のシソーラス構築システムにおいては、文書に
対し予めユーザの観点を反映させた例として文字装飾情
報が文書中に存在していたとしても処理の過程でそのよ
うな文字装飾情報は失われるために、既存の文書中で区
別されていたはずの重要語句はシソーラス構築時に再び
ユーザ自身の手で抽出する必要があった。However, in such a conventional thesaurus construction system, even if the character decoration information is present in the document as an example in which the user's point of view is reflected in advance in the document, the processing is not performed. Since such character decoration information is lost in the process, the important words that should have been distinguished in the existing document had to be extracted again by the user himself when constructing the thesaurus.

【０００８】[0008]

【課題を解決するための手段】上述の事情を解決するた
めの本願発明は、文書中の任意の文字列に太字・下線・
色付などの文字修飾を定義したタグ情報を付加するため
のタグ情報付加手段と、このタグ情報付加手段によりタ
グ情報が埋め込まれた文書からタグ情報の種別と対応す
る文字列とを抽出するタグ内容抽出手段と、このタグ内
容抽出手段により抽出されたタグ種別と対応する文字列
から、所定の文書解析処理によりシソーラスの登録候補
を生成するための候補生成手段と、この候補生成手段に
より生成された前記登録候補をシソーラスとして格納す
るための候補格納手段を備えたことを特徴とする。SUMMARY OF THE INVENTION The present invention for solving the above-mentioned problems is based on the fact that an arbitrary character string in a document is bold-faced, underlined,
Tag information adding means for adding tag information defining character decoration such as coloring, and a tag for extracting a character string corresponding to the type of tag information from the document in which the tag information is embedded by the tag information adding means Content extraction means, a candidate generation means for generating a thesaurus registration candidate by a predetermined document analysis process from a character string corresponding to the tag type extracted by this tag content extraction means, and this candidate generation means. It is characterized by further comprising a candidate storage means for storing the registration candidate as a thesaurus.

【０００９】これにより、入力として与えられた文書か
ら書式情報としてのタグ情報を利用してシソーラスを構
築する手段であって、文書からシソーラス構築に有益な
情報を抽出する際に、太字、斜体、下線などの書式情報
を認識し、シソーラスへ登録される語彙もしくは登録候
補として参照される語彙の提示手段を備えるシソーラス
構築装置であり、文書中の文字列に対し視覚的な特徴付
けを施す書式が定義されたタグに基づき抽出を行うこと
ができる。This is a means for constructing a thesaurus from the document given as input by using the tag information as the format information. When extracting information useful for constructing the thesaurus from the document, bold letters, italics, A thesaurus construction device that recognizes format information such as underlining and presents the vocabulary to be registered in the thesaurus or the vocabulary to be referred to as a registration candidate, and is a format for visually characterizing character strings in a document. Extraction can be done based on the defined tags.

【００１０】[0010]

【発明の実施の形態】図１は、本願発明による一実施形
態である。本実施形態におけるシソーラス構築支援装置
は、計算機上においてソフトウェア（ワープロソフトや
文書エディタ等）で実現される文書作成・閲覧部１０１
の一機能であるタグ情報付加部１０２、タグ付けされた
文書の入力処理部１０３、文書に付加されたタグ情報の
タグ内容抽出部１０４、ユーザ指示を受け取る抽出候補
指定部１０６を備えた候補生成部１０５、出力された候
補をシソーラスとして登録するための候補格納部１０
７、そしてシソーラスの格納先としてシソーラス記憶部
１０８から構成される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT FIG. 1 shows an embodiment according to the present invention. The thesaurus construction support device according to the present embodiment is a document creation / browsing unit 101 that is realized by software (word processing software, document editor, etc.) on a computer.
A candidate generation including a tag information adding unit 102 that is one function, an input processing unit 103 for a tagged document, a tag content extracting unit 104 for tag information added to the document, and an extraction candidate specifying unit 106 that receives a user instruction. Unit 105, candidate storage unit 10 for registering the output candidates as a thesaurus
7, and the thesaurus storage unit 108 as a storage destination of the thesaurus.

【００１１】文書作成・閲覧部１０１上で視覚的に特徴
付けられた文字列２１０の例を図２に示す。強調２１
１，色付２１２，下線２１３，斜体２１４などの表現が
ある。FIG. 2 shows an example of a character string 210 visually characterized on the document creating / viewing unit 101. Emphasis 21
There are expressions such as 1, colored 212, underline 213, and italic 214.

【００１２】タグ情報付加部１０２では、文書作成・閲
覧部１０１で文字列に施された視覚的特徴付けを所定の
形式に変換し、書式情報が含まれたタグ付き文書として
出力する。ここではタグ付文書の例として図３のような
ＲＴＦ(Rich Text Format)形式を取り上げる。このＲＴ
Ｆ形式は、マイクロソフト社が仕様策定し一般に公開さ
れているもので、文字フォントや文字サイズなど、文字
属性を含む文書情報を記述するためのファイルフォーマ
ットである。The tag information adding unit 102 converts the visual characterization performed on the character string by the document creating / viewing unit 101 into a predetermined format and outputs it as a tagged document including format information. Here, an RTF (Rich Text Format) format as shown in FIG. 3 is taken as an example of a tagged document. This RT
The F format is a file format for writing document information including character attributes such as character font and character size, which has been specified by Microsoft Corporation and is open to the public.

【００１３】なお、文書に対するタグ付け形式には、Ｒ
ＴＦ形式の他にも、Ｗ３Ｃ(World Wide Web Consortium
)によるＷｅｂ上の事実上標準であり同様に仕様が公開
されているＨＴＭＬやＸＭＬ等の形式が存在する。The tagging format for documents is R
In addition to TF format, W3C (World Wide Web Consortium
There are formats such as HTML and XML, which are de facto standards on the Web, and whose specifications are similarly published.

【００１４】図４は、図２の文書に対応するタグ付き文
書４２０であり、強調４２１，赤色文字４２２，下線４
２３，斜体４２４は、図２の強調２１１，色付２１２，
下線２１３，斜体２１４の表現に対応するタグ情報と対
応する文字列である。FIG. 4 is a tagged document 420 corresponding to the document of FIG. 2, with emphasis 421, red letters 422 and underline 4
23 and italic 424 are highlighted 211, colored 212, and
It is a character string corresponding to the tag information corresponding to the expression of underline 213 and italic 214.

【００１５】タグ情報付加部１０２によってタグ付けさ
れた文書は、タグ内容抽出部１０４に送られる。タグ内
容抽出部１０４では入力された文書を解析しタグ種別と
そのタグに対応する文字列を抽出する。候補生成部１０
５では解析されたタグ種別に対し、どのタグ種別に対応
する文字列を抽出するかの決定を行い入力文書から特定
のタグ情報と対応する文字列を出力する。必要に応じて
抽出候補指定部１０６によりタグ種別選択の指定をユー
ザから受け取る。The document tagged by the tag information adding unit 102 is sent to the tag content extracting unit 104. The tag content extraction unit 104 analyzes the input document and extracts a tag type and a character string corresponding to the tag. Candidate generator 10
At 5, in the analyzed tag type, it is determined which tag type corresponds to the character string to be extracted, and the character string corresponding to the specific tag information is output from the input document. The extraction candidate designation unit 106 receives designation of tag type selection from the user as necessary.

【００１６】候補格納部１０７では候補生成部１０５に
より出力されたリストをシソーラスとして格納、もしく
はシソーラスの登録候補として提示する。以上が実施形
態によるシステムの概要である。次に各処理部の詳細に
ついてフローチャートを用いて説明する。The candidate storage unit 107 stores the list output by the candidate generation unit 105 as a thesaurus or presents it as a thesaurus registration candidate. The above is the outline of the system according to the embodiment. Next, details of each processing unit will be described using a flowchart.

【００１７】まず、入力処理部１０３，タグ内容抽出部
１０４，候補生成部１０５における処理の流れを図５の
フローチャートをもとに説明する。文書入力処理におい
て、タグが付加された形式の文書として例えば図４に示
すテキストを受け取る（ステップＳ５０２）。First, the flow of processing in the input processing unit 103, the tag content extraction unit 104, and the candidate generation unit 105 will be described with reference to the flowchart of FIG. In the document input process, for example, the text shown in FIG. 4 is received as a document with a tag added (step S502).

【００１８】次に、タグ内容抽出処理において、文書デ
ータ中のタグ種別（ＲＴＦ形式においてはコントロール
コードと呼称される）およびそのタグ付けに対応する文
字列が認識され、各タグ種別とタグ付けされた文字列
は、例えば図６に示す木構造形式で格納される（ステッ
プＳ５０３）。Next, in the tag content extraction processing, a tag type (called a control code in the RTF format) in the document data and a character string corresponding to the tagging are recognized and tagged with each tag type. The character string is stored, for example, in the tree structure format shown in FIG. 6 (step S503).

【００１９】格納方法の一例としてここでは木構造に格
納する手順を示す。図７は木構造に格納する場合の手順
を示したフローチャートである。このフローチャートに
沿って格納手続きを説明する。As an example of the storing method, a procedure of storing in a tree structure will be shown here. FIG. 7 is a flowchart showing a procedure for storing in a tree structure. The storage procedure will be described with reference to this flowchart.

【００２０】まず格納先ノードの初期化を行い、初期ノ
ード深さを０とする（ステップＳ７０１）。次に、タグ
付き文書から一文字読み込みを行う（ステップＳ７０
２）。次に入力文書の終了判定が行われる（ステップＳ
７０３）。First, the storage destination node is initialized, and the initial node depth is set to 0 (step S701). Next, one character is read from the tagged document (step S70).
2). Next, the end determination of the input document is performed (step S
703).

【００２１】このステップＳ７０３において読み込まれ
たデータがファイル終端ではない場合には、その文字列
が“｛”であるか否かの判定が行われる（ステップＳ７
０４）。記号“｛”はタグが影響を及ぼす範囲の開始を
意味するため、読み込み文字列が“｛”である場合には
新規ノードを作成し、ノード深さをインクリメントする
（ステップＳ７０７）。If the data read in step S703 is not the end of the file, it is determined whether the character string is "{" (step S7).
04). Since the symbol "{" means the start of the range affected by the tag, if the read character string is "{", a new node is created and the node depth is incremented (step S707).

【００２２】次に、タグ情報及び対応するそのタグ付け
の効果を受ける文字列を新規作成したノードに格納する
（ステップＳ７０８）。格納後はステップＳ７０２に処
理を戻し、再び文書から入力を受け取る。Next, the tag information and the corresponding character string subject to the tagging effect are stored in the newly created node (step S708). After the storage, the process returns to step S702 to receive the input from the document again.

【００２３】一方、ステップＳ７０４おいて入力文字列
が“｛”でない場合には、読み込んだ文字列が“｝”で
あるか否かの判定が行われる（ステップＳ７０５）。読
み込んだ文字列が“｝”である場合には、ノード深さを
デクリメントし、現在の格納先を一段上の親ノードに置
き換える（ステップＳ７０６）。その後、再びステップ
Ｓ７０２へ処理を移し入力文字列を読み込む。ステップ
Ｓ７０５の条件を満たさない場合にはＳ７０２へ処理が
移される。On the other hand, if the input character string is not "{" in step S704, it is determined whether the read character string is "}" (step S705). If the read character string is "}", the node depth is decremented and the current storage destination is replaced with the parent node one step higher (step S706). After that, the process moves to step S702 again to read the input character string. If the condition of step S705 is not satisfied, the process proceeds to S702.

【００２４】このようなステップＳ５０２での文書入力
処理ならびにステップＳ５０３におけるタグ内容抽出処
理を経て、タグ付文書データからタグ種別及びタグ付け
られた文字列を格納した木構造が得られる。Through the document input process in step S502 and the tag content extraction process in step S503, a tree structure storing the tag type and the tagged character string is obtained from the tagged document data.

【００２５】次に、図５に戻り、解析した文書に対し出
力すべきデータのタグ種別判定を行う（ステップＳ５０
４）。ここで出力すべきタグ種別の選定にあたっては例
えば次のような３通りが考えられる。Next, returning to FIG. 5, the tag type of the data to be output is determined for the analyzed document (step S50).
4). In selecting the tag type to be output here, for example, the following three types can be considered.

【００２６】（１）抽出すべきタグ種別をあらかじめ固
定しておき、該当するタグ情報に対応する文字列を抽出
する方法。これはステップＳ５０４において抽出種別が
１通り、もしくは抽出対象を事前に指定してある場合に
相当する。(1) A method of fixing the tag type to be extracted in advance and extracting the character string corresponding to the corresponding tag information. This corresponds to the case where there is one extraction type or the extraction target is designated in advance in step S504.

【００２７】特定できない場合には（２−１）抽出した
いタグ種別をユーザが指定し、該当データを抽出する方
法。これはステップＳ５０５の処理に相当する。ユーザ
インターフェースの例としては図８のようなものが挙げ
られる。図８に示したインターフェースでは、文書デー
タを解析後に文書中に出現しているタグ種別を表示し、
［太字］［下線］など８０１と８０２に示すようなチェ
ックボックスにユーザがチェックを投入することで抽出
候補を指定可能にするものである。When it is not possible to specify (2-1), the user designates the tag type to be extracted and the corresponding data is extracted. This corresponds to the process of step S505. An example of the user interface is shown in FIG. The interface shown in FIG. 8 displays the tag types appearing in the document after analyzing the document data,
The user can specify extraction candidates by checking the check boxes such as [bold] and [underline] 801 and 802.

【００２８】（２−２）選択候補提示ステップＳ５０５
のバリエーションとして、抽出候補指定部１０６におい
て、抽出したいタグ種別を指定する操作に加え、さらに
タグ種別に対応する文字列をシソーラス上のどのような
概念にラベル付けするかの指定を可能とするインターフ
ェースが考えられる。このような抽出候補指定部１０６
を実現するインターフェース例を図９に示す。(2-2) Selection candidate presentation step S505
As a variation of the above, an interface that enables the extraction candidate designating unit 106 to designate, in addition to the operation of designating the tag type to be extracted, what concept on the thesaurus the character string corresponding to the tag type is labeled Can be considered. Such an extraction candidate designation unit 106
FIG. 9 shows an example of an interface for realizing the above.

【００２９】この例として［太字］［下線］などのチェ
ックボックス９０１と９０２にチェックを投入すること
で抽出候補を選択できる一方で、チェックづけ項目に対
応するテキスト入力ボックス９０３に例えば［太字］に
対応する［基本概念］および９０４に［下線］に対応す
る［拡張概念］といったラベル名を入力することで、特
定のタグ種別を、シソーラス上のどのような概念ラベル
と結びつけるかのユーザ入力を受け取るインターフェー
スを提供する。以上が選択候補提示ステップＳ５０５の
処理である。図１０は抽出候補指定部１０６における抽
出過程で得られる出力データ形式の一例である。As an example, an extraction candidate can be selected by checking check boxes 901 and 902 such as [bold] and [underline], while a text input box 903 corresponding to a check item is displayed in [bold]. By inputting a label name such as [Extended Concept] corresponding to [Underline] in [Corresponding [Basic Concept]] and 904, a user input indicating what kind of concept label on the thesaurus a particular tag type is associated with is received. Provide an interface. The above is the processing of the selection candidate presentation step S505. FIG. 10 is an example of an output data format obtained in the extraction process in the extraction candidate designation unit 106.

【００３０】こうして得られた出力を候補生成処理にお
いて、自動的に文節区切りや品詞判定を行う形態素解析
による文法的解釈や文書中における文字列の出現頻度と
いった統計的尺度を利用することで重要度が低い語彙は
除去し（ステップＳ５０６）、得られた出力を続く候補
格納処理における入力として与える（ステップＳ５０
７）。In the candidate generation process, the output obtained in this way is used for statistical significance such as grammatical interpretation by morphological analysis that automatically makes segmentation and part-of-speech determination and the frequency of occurrence of character strings in a document. The vocabulary having a low value is removed (step S506), and the obtained output is given as an input in the subsequent candidate storage processing (step S50).
7).

【００３１】候補格納処理を行うステップＳ５０７で
は、シソーラス１０８にデータ（語彙）を登録する上で
のインターフェースを提供する。候補格納部１０７では
ユーザがシソーラスの内容、並びにシソーラスへの登録
語彙候補として各種のテキストデータが閲覧できる環境
をユーザに提示する。抽出候補指定部１０６によって出
力されたデータは、候補格納部１０７におけるシソーラ
スそのものであっても参照文書として登録候補提示の役
割でユーザの閲覧に供されるものとしてもよい。In step S507 for performing the candidate storing process, an interface for registering data (vocabulary) in the thesaurus 108 is provided. In the candidate storage unit 107, the user is presented with the thesaurus contents and an environment in which various text data can be browsed as vocabulary candidates registered in the thesaurus. The data output by the extraction candidate designating unit 106 may be the thesaurus in the candidate storing unit 107, and may be used as a reference document for browsing by the user in the role of presenting registration candidates.

【００３２】次に、候補格納処理（ステップＳ２０６）
を図１１のフローチャートを用いて説明する。ここまで
に得られた出力が参照文書として提示される場合の処理
の流れを示す。Next, the candidate storing process (step S206)
Will be described with reference to the flowchart of FIG. The flow of processing when the output obtained thus far is presented as a reference document is shown.

【００３３】まず図１０のような候補指定部１０６から
の出力を受け取る（ステップＳ１１０１）。続いて得ら
れたデータを候補格納部１０７における参照文書とする
べく、出力されたデータに対し自動で文節単位への分解
および品詞推定を行ういわゆる形態素解析処理を施す
（ステップＳ１１０２）。この形態素解析処理は広く知
られた手法であり、すでに市販・無償を問わず各種のシ
ステムが存在する。First, the output from the candidate designating unit 106 as shown in FIG. 10 is received (step S1101). Subsequently, in order to use the obtained data as a reference document in the candidate storage unit 107, so-called morphological analysis processing is performed on the output data to automatically perform segmentation and part-of-speech estimation on the output data (step S1102). This morphological analysis processing is a widely known method, and there are already various systems on the market regardless of whether they are commercially available or free of charge.

【００３４】次に、形態素解析が施されたこれらのデー
タを図１２に示す単語で切り分けたリスト１２０１や文
書中に隣接して出現する二または三単語を所定の構造に
基づいて抽出した共起リスト１２０２と呼ばれるリスト
の生成を行う（ステップＳ１１０３）。こうして生成さ
れたリストから不適切と思われる語彙の組み合わせを機
械的に排除することを目的として、出現頻度等の統計的
情報を利用しリストを精錬化する（ステップＳ１１０
４）。こうして得られた各リストを語彙登録候補として
出力する（ステップＳ１１０５）。このような手順を経
てステップＳ５０７の候補格納処理では、シソーラスに
対する語彙登録候補を提示する。Next, a co-occurrence obtained by extracting two or three words that appear adjacently in a list 1201 or a document obtained by dividing these data subjected to morphological analysis into words shown in FIG. 12 based on a predetermined structure. A list called a list 1202 is generated (step S1103). The list is refined using statistical information such as the frequency of appearance for the purpose of mechanically eliminating the combination of vocabulary that seems to be inappropriate from the list thus generated (step S110).
4). Each list thus obtained is output as a vocabulary registration candidate (step S1105). Through the procedure described above, in the candidate storage processing of step S507, the vocabulary registration candidates for the thesaurus are presented.

【００３５】出力された語彙登録候補を参照しつつ、ユ
ーザの人手によりシソーラスへの適切な箇所への登録を
行う。最終的に構築されたシソーラスの例を図１３に示
す。１３０３のエリアには登録候補として提示された
「励起」、「共起」、「刺激」、「共鳴」、「発散」の
各表現ををユーザが採用しシソーラス上に登録を行った
結果である。提示された候補に対し１３０２の「プラ
ス」「マイナス」や１３０１の「語彙イメージ」といっ
た上位概念をユーザが与えることで、シソーラスを構築
する。While referring to the output vocabulary registration candidate, the user manually registers the candidate in an appropriate place in the thesaurus. An example of the finally constructed thesaurus is shown in FIG. In the area 1303, the user adopts the expressions “excitation”, “co-occurrence”, “stimulation”, “resonance”, and “divergence” presented as registration candidates and registers them on the thesaurus. . The user constructs a thesaurus by giving superordinate concepts such as “plus” and “minus” 1302 and “vocabulary image” 1301 to the presented candidates.

【００３６】以下、図１４に示した文書データに対し、
文字装飾の行われた箇所を抽出し候補格納部１０７で実
際のシソーラス登録を行う過程を説明する。本実施例で
は太字により強調表現された“刺激”“励起”“発散”
“共鳴”ならびに下線強調された“励起状態で発散す
る”が抽出される過程を追う。Below, for the document data shown in FIG.
A process of extracting a portion where character decoration is performed and performing actual thesaurus registration in the candidate storage unit 107 will be described. In this embodiment, “stimulation”, “excitation”, and “divergence” emphasized by bold letters
Follow the process by which “resonance” as well as underlined “excited state divergence” are extracted.

【００３７】まずステップＳ５０２の入力処理により文
書が読み込まれる。そしてステップＳ５０３のタグ内容
抽出処理により、図１４のような文書データに付属して
いるタグ情報が解釈され各タグ情報が図１５の形式で木
構造に収められる。図１５において１５０１が太字表現
されたデータ“刺激”“励起”“発散”“共鳴”の格納
場所、１５０２が下線表現された“励起状態で発散す
る”のデータ格納場所である。First, the document is read by the input processing in step S502. By the tag content extraction processing in step S503, the tag information attached to the document data as shown in FIG. 14 is interpreted and each tag information is stored in the tree structure in the format of FIG. In FIG. 15, reference numeral 1501 denotes a storage location for data “stimulation”, “excitation”, “divergence”, and “resonance” in bold, and 1502 denotes a storage location for underlined “divergence in excited state”.

【００３８】この場合は装飾種別が複数出現しているた
め、ステップＳ５０４の条件を満たさないことになりス
テップＳ５０５の選択候補提示処理によりユーザの入力
が求められる。ここでは図８の入力と同様に太字表現８
０１並びに下線表現８０２を選択し抽出することとす
る。選択された表現に該当するタグ情報が探し出され、
図１５に示す対応する１５０１ならびに１５０２が出力
される。In this case, since a plurality of decoration types appear, the condition of step S504 is not satisfied, and the user's input is required by the selection candidate presentation process of step S505. Here, as in the input of FIG.
01 and the underlined expression 802 are selected and extracted. Tag information corresponding to the selected expression is searched for,
Corresponding 1501 and 1502 shown in FIG. 15 are output.

【００３９】次に候補生成処理ステップＳ５０６におい
て、形態素解析処理が行われ、図１２に示す単語リスト
１２０１や共起リスト１２０２といった出力が得られ
る。このように得られたリストをさらに、文書中におけ
る各語の出現頻度等をＴＦ／ＩＤＦ等の統計的尺度を利
用し、ユーザにとって利用価値の低いと思われる語彙を
自動で除去するといった処理も含むことが可能である。Next, in candidate generation processing step S506, morphological analysis processing is performed, and outputs such as the word list 1201 and the co-occurrence list 1202 shown in FIG. 12 are obtained. The list obtained in this way is further used to automatically remove the vocabulary that is considered to have a low utility value for the user, by using a statistical measure such as TF / IDF for the appearance frequency of each word in the document. Can be included.

【００４０】こうして得られたデータリストは、候補登
録処理における参照文書として利用され、ユーザによっ
てシソーラスへの登録が行われる。シソーラスへの登録
例を図１３に示す。図１３は、３階層の概念を構築した
例である。候補生成処理で得られる出力とは、文書中に
出現する語彙１３０３を登録候補として提示することに
相当する。The data list thus obtained is used as a reference document in the candidate registration process, and the user registers it in the thesaurus. FIG. 13 shows an example of registration in the thesaurus. FIG. 13 is an example in which the concept of three layers is constructed. The output obtained by the candidate generation process corresponds to presenting the vocabulary 1303 appearing in the document as a registration candidate.

【００４１】このように、本実施形態で示したシソーラ
ス構築処理方法によれば、既存の文書データ中の語彙を
抽出してシソーラスを構築する際に、ユーザによって事
前に施された下線、強調、斜体等の書式情報を認識する
ことで、それらの書式情報に基づいた文字列を抽出する
ことが可能となり、既存の文書データに施された特定の
語彙に対するユーザの意図を反映した処理を行うことが
できる。As described above, according to the thesaurus construction processing method shown in the present embodiment, when the vocabulary in the existing document data is extracted to construct the thesaurus, underlining, emphasis, By recognizing format information such as italics, it is possible to extract character strings based on those format information, and perform processing that reflects the user's intention for a specific vocabulary applied to existing document data. You can

【００４２】また、書式情報とシソーラスにおける文字
ラベルの対応を与えることで、既存の文書データから容
易にシソーラスを構築することが可能となる。さらにシ
ソーラスに登録すべき語彙候補を提示することで、シソ
ーラス構築にかかる労力の省力化を図ることができる。By providing correspondence between the format information and the character labels in the thesaurus, the thesaurus can be easily constructed from existing document data. Further, by presenting the vocabulary candidates to be registered in the thesaurus, it is possible to reduce the labor required for constructing the thesaurus.

【００４３】また、本願発明の実施例における処理をコ
ンピュータで実行可能なプログラムで実現し、このプロ
グラムをコンピュータで読み取り可能な記憶媒体として
実現することも可能である。It is also possible to realize the processing in the embodiment of the present invention by a computer-executable program and realize the program as a computer-readable storage medium.

【００４４】なお、本願発明における記憶媒体として
は、磁気ディスク、フレキシブルディスク、ハードディ
スク、光ディスク（ＣＤ−ＲＯＭ，ＣＤ−Ｒ，ＤＶＤ
等）、光磁気ディスク（ＭＯ等）、半導体メモリ等、プ
ログラムを記憶でき、かつコンピュータが読み取り可能
な記憶媒体であれば、その記憶形式は何れの形態であっ
てもよい。The storage medium in the present invention is a magnetic disk, flexible disk, hard disk, optical disk (CD-ROM, CD-R, DVD).
Etc.), a magneto-optical disk (MO etc.), a semiconductor memory, etc., as long as it is a storage medium which can store a program and can be read by a computer, the storage format may be any form.

【００４５】また、記憶媒体からコンピュータにインス
トールされたプログラムの指示に基づきコンピュータ上
で稼動しているＯＳ（オペレーションシステム）や、デ
ータベース管理ソフト、ネットワーク等のＭＷ（ミドル
ウェア）等が本実施形態を実現するための各処理の一部
を実行してもよい。Further, an OS (operation system) operating on the computer, database management software, MW (middleware) such as a network, etc., which realizes the present embodiment, based on the instructions of the program installed in the computer from the storage medium. You may perform a part of each process for doing.

【００４６】さらに、本願発明における記憶媒体は、コ
ンピュータと独立した媒体に限らず、ＬＡＮやインター
ネット等により伝送されたプログラムをダウンロードし
て記憶または一時記憶した記憶媒体も含まれる。Further, the storage medium in the present invention is not limited to a medium independent of a computer, but includes a storage medium in which a program transmitted via a LAN, the Internet or the like is downloaded and stored or temporarily stored.

【００４７】また、記憶媒体は１つに限らず、複数の媒
体から本実施形態における処理が実行される場合も、本
発明における記憶媒体に含まれ、媒体の構成は何れの構
成であってもよい。Further, the number of storage media is not limited to one, and even when the processing in this embodiment is executed from a plurality of media, it is included in the storage media in the present invention, and the structure of the medium may be any one. Good.

【００４８】なお、本願発明におけるコンピュータは、
記憶媒体に記憶されたプログラムに基づき、本実施形態
における各処理を実行するものであって、パソコン等の
１つからなる装置、複数の装置がネットワーク接続され
たシステム等の何れの構成であってもよい。The computer according to the present invention is
Each processing is executed in the present embodiment based on a program stored in a storage medium, which has any configuration such as a device such as a personal computer or a system in which a plurality of devices are network-connected. Good.

【００４９】また、本願発明におけるコンピュータと
は、パソコンに限らず、情報処理機器に含まれる演算処
理装置、マイコン等も含み、プログラムによって本願発
明の機能を実現することが可能な機器、装置を総称して
いる。Further, the computer in the present invention is not limited to a personal computer but also includes an arithmetic processing unit, a microcomputer, etc. included in information processing equipment, and is a general term for equipment and devices capable of realizing the functions of the present invention by a program. is doing.

【００５０】[0050]

【発明の効果】以上のように、本願発明のシソーラス構
築処理装置によれば、既存の文書データ中の語彙を抽出
してシソーラスを構築する際に、ユーザによって事前に
施された下線、強調、斜体等の書式情報を認識すること
で、それらの書式情報に基づいた文字列を抽出すること
が可能となり、既存の文書データに施された特定の語彙
に対するユーザの意図を反映した処理を行うことができ
る。As described above, according to the thesaurus construction processing apparatus of the present invention, when the vocabulary in the existing document data is extracted and the thesaurus is constructed, underlining, emphasis, By recognizing format information such as italics, it is possible to extract character strings based on those format information, and perform processing that reflects the user's intention for a specific vocabulary applied to existing document data. You can

[Brief description of drawings]

【図１】本願発明の実施の形態に係る構成ブロック図
である。FIG. 1 is a configuration block diagram according to an embodiment of the present invention.

【図２】抽出対象箇所である下線、強調表現、斜体等
の書式が施された語彙の具体例[Fig. 2] A concrete example of a vocabulary that is underlined, emphasized, or italicized, which is the extraction target part.

【図３】入力としてのタグ付き文書形式の一例を表す
説明図である。FIG. 3 is an explanatory diagram showing an example of a document format with a tag as an input.

【図４】抽出対象箇所である下線、強調表現、斜体に各
々対応するタグ情報を示した図である。FIG. 4 is a diagram showing tag information corresponding to underlines, emphasized expressions, and italics that are extraction target locations.

【図５】処理の概要を記述したフローチャートであ
る。FIG. 5 is a flowchart describing an outline of processing.

【図６】データ格納構造としての木構造の一例を表す
説明図である。FIG. 6 is an explanatory diagram illustrating an example of a tree structure as a data storage structure.

【図７】タグ情報ならびにそれに対応する文字列を木
構造に格納する手順を記述したフローチャートである。FIG. 7 is a flowchart describing a procedure for storing tag information and a corresponding character string in a tree structure.

【図８】ユーザに対し抽出候補の選択を求めるインタ
ーフェースの一例である。FIG. 8 is an example of an interface that prompts the user to select an extraction candidate.

【図９】抽出候補ならびに対応する概念名を入力する
インターフェースの一例である。FIG. 9 is an example of an interface for inputting extraction candidates and corresponding concept names.

【図１０】抽出過程で得られる出力データ形式の一例
である。FIG. 10 is an example of an output data format obtained in the extraction process.

【図１１】候補生成部の処理を記述したフローチャー
トである。FIG. 11 is a flowchart describing processing of a candidate generation unit.

【図１２】出力リストの一例を表す図である。FIG. 12 is a diagram illustrating an example of an output list.

【図１３】シソーラスの内部構造を表す図の一例であ
る。FIG. 13 is an example of a diagram showing an internal structure of a thesaurus.

【図１４】入力文書例である。FIG. 14 is an example of an input document.

【図１５】抽出候補の指定をもとにデータを抽出した
場合の木構造表現の一例である。FIG. 15 is an example of a tree structure representation when data is extracted based on designation of extraction candidates.

[Explanation of symbols]

１０１…文書作成・閲覧部１０２…タグ情報付加部１０３…入力処理部１０４…タグ内容抽出部１０５…候補生成部１０６…抽出候補指定部１０７…候補格納部１０８…シソーラス記憶部 101 ... Document creation / viewing unit 102 ... Tag information addition unit 103 ... Input processing unit 104 ... Tag content extraction unit 105 ... Candidate generation unit 106 ... Extraction candidate designation unit 107 ... Candidate storage unit 108 ... Thesaurus storage

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B075 ND03 NK35 NR12 QM07 QM08 UU01 5B091 AB00 CB24 CC02 CC04 CC16 DA06 ─────────────────────────────────────────────────── ─── Continued front page F term (reference) 5B075 ND03 NK35 NR12 QM07 QM08 UU01 5B091 AB00 CB24 CC02 CC04 CC16 DA06

Claims

[Claims]

1. Tag information adding means for adding tag information defining character modifications such as bold, underline, and coloring to an arbitrary character string in a document, and tag information is embedded by the tag information adding means. A tag content extraction unit that extracts a tag information type and a corresponding character string from the document, and a thesaurus registration candidate by a predetermined document analysis process from the character string corresponding to the tag type extracted by the tag content extraction unit. A thesaurus construction support device, comprising: a candidate generating unit for generating a candidate, and a candidate storing unit for storing the registration candidate generated by the candidate generating unit as a thesaurus.

2. The thesaurus construction according to claim 1, wherein the candidate storing means stores a plurality of character strings and a label indicating a higher-level concept common to the character strings as an associated thesaurus. Support device.

3. A thesaurus concept for the combination of a tag type in which a format for visually characterizing an arbitrary character string in a document is defined and a character string corresponding to the tag. 2. The thesaurus construction support device according to claim 1, further comprising a label designating unit for designating a label indicating.

4. The tag content extraction means comprises tag type designation means capable of designating a part of tag types from a plurality of types of tags defining a format for visually characterizing an arbitrary character string in a document, The thesaurus construction support device according to claim 1, wherein the tag type and the character string corresponding to the tag type are extracted from the tagged document for the designated tag.

5. A character string corresponding to the type of tag information from a document in which this tag information is embedded, by adding tag information defining character modifications such as bold, underline, and coloring to any character string in the document. To extract a registration candidate for the thesaurus from a character string corresponding to the extracted tag type by a predetermined document analysis process, and to store the registration candidate created by the candidate creating means as a thesaurus. Characteristic thesaurus construction support method.