JPH06208580A

JPH06208580A - Dictionary preparing device

Info

Publication number: JPH06208580A
Application number: JP4148331A
Authority: JP
Inventors: Sakiko Honma; 咲子本間
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-05-14
Filing date: 1992-05-14
Publication date: 1994-07-26

Abstract

PURPOSE:To efficiently obtain data for a customized dictionary without allowing a user to be conscious of information in general use dictionary. CONSTITUTION:A sample text extracting part 1 extracts a sample text from a corpus 12 of a field to be processed. The extracted text is sent to an analysis information instructing part 2. The part 2 instructs analysis information to the sample text provided by a user. An analysis result is properly processed by an analysis result storage part 3 and stored in an analysis result file 13 or directly sent to the succeeding processing. A dictionary information specifying part 4 specifies information matched with the sample analysis result. The specified information is collected in each original unit by a dictionary information storage part 5. The stored information is synthesized as dictionary description by a dictionary synthesizing part 6 in accordance with a format stored in a format file 10 and stored in a customized dictionary file 11.

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は、辞書作成装置に関し、より詳細
には、自然言語処理において、サンプルテキストの分析
結果から、辞書合成に必要な情報を抽出し、処理対象文
書に適した辞書を合成する辞書作成装置に関する。TECHNICAL FIELD The present invention relates to a dictionary creating apparatus, and more specifically, in natural language processing, information necessary for dictionary synthesis is extracted from a sample text analysis result, and a dictionary suitable for a processing target document is synthesized. The present invention relates to a dictionary creating device.

【０００２】[0002]

【従来技術】機械翻訳などの自然言語処理においては、
処理対象言語の語に関する情報を記述した辞書データが
必須である。汎用辞書においては、多義多様な言語現象
に対応するため、語の多義性とその使い分けに関する情
報をできるだけ細密かつ網羅的に記述するのが通常であ
る。しかし、処理対象分野あるいは文書を限定した場合
は、多義性がかなり解消されるため、多くの使い分け情
報が無駄になるだけでなく、不適当な情報付与により誤
った処理が施されるなどの弊害も多い。2. Description of the Related Art In natural language processing such as machine translation,
Dictionary data that describes information about words in the processing target language is essential. In general-purpose dictionaries, in order to deal with a wide variety of linguistic phenomena, it is usual to describe information about polysemy of words and their proper use as detailedly and exhaustively as possible. However, when the fields or documents to be processed are limited, the polysemy is considerably eliminated, so that not only a large amount of properly used information is wasted, but also incorrect processing is given due to improper information addition. There are also many.

【０００３】この問題を解決する手段として、ユーザ辞
書登録などにより、ユーザが処理対象文書の語に関する
情報を直接登録するという方法が取られている。例え
ば、特開平２−６７６７９号公報に提案されている「辞
書作成支援機能付き翻訳処理方法および装置」は、サン
プルテキストから抽出した語の出現頻度などの情報に基
づいて、ユーザが訳語などの情報を汎用辞書中から選択
し、それに基づいてユーザ辞書を作成するもので、すな
わち、ユーザが対象文書における頻度情報に基づいて汎
用辞書の情報から必要なものを選択することで、ユーザ
辞書登録の際の負担軽減と登録ミスの回避を図ってい
る。しかし、上記従来技術においては、多義語のうちど
の語義を選択するかはユーザの主観のみに任されてお
り、頻度情報だけでは対象文書中でどの語義が使用され
ているかをユーザに客観的に示す手段がないという問題
が生じる。As a means for solving this problem, a method has been adopted in which the user directly registers the information on the word of the document to be processed by user dictionary registration or the like. For example, “Translation processing method and device with dictionary creation support function” proposed in Japanese Patent Laid-Open No. 2-67679 proposes information such as translated words by a user based on information such as the appearance frequency of words extracted from a sample text. Is selected from the general-purpose dictionaries, and the user dictionary is created based on that, that is, the user selects the necessary information from the general-purpose dictionary based on the frequency information in the target document. We are trying to reduce the burden on users and avoid registration mistakes. However, in the above-mentioned conventional technology, which of the polysemous words is to be selected is left only to the subjectivity of the user, and only the frequency information allows the user to objectively determine which word is used in the target document. The problem arises that there is no means to indicate.

【０００４】[0004]

【目的】本発明は、上述のごとき実情に鑑みなされたも
ので、処理対象文書から適当な質、量のサンプルテキス
トを抽出し、該サンプルテキストを言語的に分析した結
果から、ユーザに汎用辞書の情報を意識させずに効率的
にカスタマイズ辞書用データを得ること、また、特に分
析方法として、サンプル文を語単位に分割し、各語につ
いて修飾先の語とその関係名を指定する係り受け分析を
用い、係り受けの整合性により汎用辞書の情報を照合す
ることで、サンプル文での用法を自動的に特定するこ
と、更に、係り受けの整合性だけで特定できない場合
は、構文的ルール、項構造マッチ、優先順位などの手段
を用いて処理の精密化を図ること、また、サンプル文か
ら得られる情報は、文中に現われるベタ語形単位である
が、これを原形単位の情報に変換してまとめ上げ、直接
辞書データとして登録可能な情報とすること、また、１
つのベタ語形に対して複数の原形に可能性がある場合
は、特定された情報や、ユーザが指定した訳語との対応
により原形を１つに絞り、正確な情報を得ること、ま
た、前記手段で得られた情報を、予め規定されたフォー
マットに従って合成し、また、汎用辞書にないがユーザ
が指定した情報を追加したり、サンプル文から直接得ら
れない情報は適宜汎用辞書から補うことにより、対象文
書の処理に適したカスタマイズ辞書を作成するようにし
た辞書作成装置を提供することを目的としてなされたも
のである。[Object] The present invention has been made in view of the above situation, and extracts a sample text of an appropriate quality and quantity from a document to be processed, and linguistically analyzes the sample text. To efficiently obtain customized dictionary data without paying attention to the information of the above, and especially as an analysis method, the sample sentence is divided into word units, and the modification target word and its relation name are specified for each word. By using the analysis and matching the information in the general-purpose dictionary according to the consistency of the dependency, the usage in the sample sentence can be automatically specified. Furthermore, if it is not possible to specify only the consistency of the dependency, the syntactic rule , Refinement of processing using means such as term structure matching and priority, and the information obtained from the sample sentence is the solid form unit appearing in the sentence, Up together into a distribution, it is a registrable information directly as dictionary data, also 1
When there is a possibility that there are a plurality of original forms for one solid form, it is possible to obtain accurate information by narrowing down the original form to one according to the specified information and the correspondence with the translated word specified by the user. By synthesizing the information obtained in step 1 according to a pre-defined format, and adding information specified by the user that is not in the general-purpose dictionary, or supplementing information that is not directly obtained from the sample sentence from the general-purpose dictionary, The object of the present invention is to provide a dictionary creation device that creates a customized dictionary suitable for processing a target document.

【０００５】[0005]

【構成】本発明は、上記目的を達成するために、（１）
処理対象分野の文書からサンプルテキストを抽出するサ
ンプルテキスト抽出部と、抽出したサンプルテキストを
提示し、該テキスト中の文の分析情報を指示する分析情
報指示部と、指示された分析情報を記憶する分析結果記
憶部と、該分析結果記憶部に記憶された分析結果に基づ
き、サンプル文中の各語について、該サンプルにおける
用法に適した情報を汎用辞書の情報から特定する辞書情
報特定部と、該辞書情報特定部において特定された情報
や、語の出現頻度などの頻度情報を記憶する辞書情報記
憶部と、規定された記述形式に基づいて、必要であれば
汎用辞書の情報を参照しながら、該辞書情報記憶部にお
いて記憶された情報を辞書形式に合成する辞書合成部と
を有すること、更には、（２）前記分析情報指示部にお
いて、指示する情報がサンプル文中の各語の係り先とそ
の修飾関係名であり、該情報に基づいて前記分析結果記
憶部において、各語について修飾先、修飾元、およびそ
れぞれの修飾関係名を記憶し、前記辞書情報特定部にお
いて、各語に対する汎用辞書による辞書引き情報の中か
ら、係り受けの整合性とマッチする情報を特定する特定
手段を有すること、更には、（３）前記（２）におい
て、前記辞書情報特定部において、係り受けの整合性だ
けでは汎用辞書中の不適な情報を排除できない場合は、
構文的なルールを適用して不適な情報を排除する排除手
段を有すること、更には、（４）前記（２）において、
前記辞書情報特定部において、係り受け情報から抽出し
た項の数や種類により、サンプル文中の動詞などの項構
造を特定し、特定された項構造にマッチする情報を汎用
辞書の情報から抽出する抽出手段を有すること、更に
は、（５）前記（３）又は（４）において、前記辞書情
報特定部において、前記排除手段と抽出手段によって汎
用辞書の情報からサンプル文における用法に適する情報
を特定できない場合は、サンプルテキストの言語の性質
に応じて恣意的に定めた優先順位により、情報を特定す
る手段を有すること、更には、（６）前記（１）又は
（２）において、前記辞書情報記憶部において、サンプ
ル文におけるベタ語形単位で得られた辞書情報を、汎用
辞書における原形単位の情報にまとめ挙げる手段を有す
ること、更には、（７）前記（１）において、前記辞書
情報記憶部において、１つのベタ語形に対して複数の原
形の可能性がある場合は、前記辞書情報特定部において
特定された情報にマッチする原形を選択する選択手段を
有すること、更には、（８）前記（１）において、前記
サンプル文分析部において、係り受け分析に加えて、サ
ンプル文の各語に対応する第２言語の訳語を指定する指
定手段を有し、前記辞書情報記憶部において、１つのベ
タ語形に対し複数の原形の可能性がある場合は、指定さ
れた訳語とマッチする原形を選択する選択手段を有する
こと、更には、（９）前記（１）において、前記辞書合
成部において、前記辞書情報記憶部で記憶された辞書情
報を、該記憶部に記憶された頻度情報に応じて、規定さ
れた記述形式に従って、カスタマイズ辞書を合成する合
成手段を有すること、更には、（１０）前記（１）にお
いて、前記辞書合成部において、前記辞書情報特定部に
おいて特定された情報と、前記サンプル文分析部で指定
された訳語のペアが、汎用辞書中にない場合は、新たに
そのペアを辞書情報として追加する手段を有すること、
更には、（１１）前記（１）において、前記辞書合成部
において、前記辞書情報特定部において特定された情報
の他に、汎用辞書中で優先度が高いとされている情報に
ついては、優先度を調整するなどして辞書情報に追加す
る追加手段を有すること、更には、（１２）前記（１）
において、対訳汎用辞書から訳語、品詞、共起格フレー
ムから成る訳語辞書を作成し、前記辞書情報特定部にお
いて特定された品詞情報と前記サンプル文分析部におい
てユーザが指定した訳語のペアを用いて、対応する共起
格フレームを訳語辞書から抽出し、辞書情報に追加する
追加手段を有することを特徴としたものである。以下、
本発明の実施例に基づいて説明する。In order to achieve the above object, the present invention provides (1)
A sample text extraction unit that extracts a sample text from a document of a processing target field, an analysis information instruction unit that presents the extracted sample text and instructs analysis information of a sentence in the text, and stores the instructed analysis information. An analysis result storage unit, and a dictionary information specifying unit that specifies information suitable for usage in the sample for each word in the sample sentence based on the analysis result stored in the analysis result storage unit from information in a general-purpose dictionary, Based on the information specified in the dictionary information specification unit, the dictionary information storage unit that stores frequency information such as the frequency of appearance of words, and the prescribed description format, while referring to the information of the general-purpose dictionary, if necessary, A dictionary synthesizing unit for synthesizing the information stored in the dictionary information storage unit into a dictionary format, and (2) information to be instructed by the analysis information instructing unit. Is a relationship destination of each word in the sample sentence and its qualified relation name, and based on the information, in the analysis result storage unit, a qualified destination, a qualified source, and each qualified relationship name are stored for each word, and the dictionary is stored. The information specifying unit has a specifying unit for specifying information matching the dependency matching among the dictionary lookup information by the general-purpose dictionary for each word, and (3) the dictionary in (2) above. In the information specifying unit, if unsatisfactory information in the general dictionary cannot be eliminated only by the consistency of dependency,
Having an excluding means for excluding inappropriate information by applying syntactic rules, and (4) In (2) above,
The dictionary information specifying unit specifies the term structure such as a verb in the sample sentence according to the number and type of terms extracted from the dependency information, and extracts the information matching the specified term structure from the information of the general-purpose dictionary. (5) In (3) or (4), the dictionary information specifying unit cannot specify the information suitable for the usage in the sample sentence from the information of the general dictionary by the removing unit and the extracting unit. In this case, there is provided means for specifying information according to a priority order arbitrarily determined according to the nature of the language of the sample text, and (6) in (1) or (2) above, the dictionary information storage The section has means for collecting and listing the dictionary information obtained in units of the solid form in the sample sentence as information in units of the original form in the general-purpose dictionary, and further (7 In the above (1), when there is a possibility that there are a plurality of original forms for one solid word form in the dictionary information storage unit, selecting means for selecting an original form that matches the information specified by the dictionary information specifying unit. Further, (8) in (1) above, the sample sentence analysis unit has a designation means for designating a translated word of the second language corresponding to each word of the sample sentence, in addition to the dependency analysis. However, in the dictionary information storage section, when there is a possibility that there are a plurality of original forms for one solid word form, the dictionary information storage unit has a selection means for selecting an original form that matches a designated translated word, and further (9) the above In (1), the dictionary combining unit stores the dictionary information stored in the dictionary information storage unit according to a prescribed description format according to frequency information stored in the storage unit. (10) In (1) above, in the dictionary synthesizing unit in the above (1), the information specified by the dictionary information specifying unit and the translation word specified by the sample sentence analyzing unit are combined. If the pair is not in the general-purpose dictionary, it has a means for newly adding the pair as dictionary information,
Furthermore, (11) In (1) above, in addition to the information specified by the dictionary information specifying unit in the dictionary synthesizing unit, the priority of the information that has a high priority in the general-purpose dictionary Adjusting means to add to the dictionary information, and (12) above (1)
In, a translation word dictionary including a translation word, a part of speech, and a co-occurrence frame is created from a bilingual general-purpose dictionary, and a pair of the part of speech information identified by the dictionary information identification unit and the translation word specified by the user in the sample sentence analysis unit is used. , A corresponding co-occurring case frame is extracted from the translation dictionary and is added to the dictionary information. Less than,
A description will be given based on an embodiment of the present invention.

【０００６】図１は、本発明による辞書作成装置の一実
施例を説明するための構成図で、図中、１はサンプルテ
キスト抽出部、２は分析情報指示部、３は分析結果記憶
部、４は辞書情報特定部、５は辞書情報記憶部、６は辞
書合成部、７はサンプルファイル、８はテーブルファイ
ル、９は辞書情報ファイル、１０はフォーマットファイ
ル、１１はカスタマイズ辞書ファイル、１２はコーパ
ス、１３は分析結果ファイル、１４は汎用辞書ファイ
ル、１５は訳語辞書ファイルである。FIG. 1 is a block diagram for explaining an embodiment of a dictionary creating apparatus according to the present invention. In the figure, 1 is a sample text extracting section, 2 is an analysis information instructing section, 3 is an analysis result storing section, 4 is a dictionary information specifying unit, 5 is a dictionary information storage unit, 6 is a dictionary synthesizing unit, 7 is a sample file, 8 is a table file, 9 is a dictionary information file, 10 is a format file, 11 is a customized dictionary file, and 12 is a corpus. , 13 are analysis result files, 14 is a general-purpose dictionary file, and 15 is a translation word dictionary file.

【０００７】サンプルテキスト抽出部１では、処理対象
分野のコーパス１２から適当な量、質のサンプルテキス
トが抽出される。抽出されたテキストは必要に応じて分
析情報指示部２に送られたり、サンプルファイル７に記
憶される。分析情報指示部２ではユーザが提示されたサ
ンプルテキストに対して分析情報を指示する。分析方法
は、係り受け分析や句構造分析など、処理対象言語の特
徴やユーザの能力に適した方法がとられる。また、必要
に応じて汎用辞書ファイル１４から情報を参照できる。
分析した結果は分析結果記憶部３で適当な処理を施した
後、分析結果ファイル１３に記憶されたり、直接次の処
理に送られる。また、分析の際には係り受けのパターン
などを記憶したテーブルファイル８が参照される。The sample text extraction unit 1 extracts sample text of appropriate quantity and quality from the corpus 12 of the field to be processed. The extracted text is sent to the analysis information instruction unit 2 or stored in the sample file 7 as needed. In the analysis information instruction unit 2, the user instructs analysis information on the presented sample text. As the analysis method, a method suitable for the characteristics of the processing target language and the ability of the user, such as dependency analysis and phrase structure analysis, is adopted. Further, information can be referred to from the general-purpose dictionary file 14 as needed.
The analysis result is stored in the analysis result file 13 or directly sent to the next process after being subjected to an appropriate process in the analysis result storage unit 3. Further, at the time of analysis, the table file 8 storing the dependency pattern and the like is referred to.

【０００８】辞書情報特定部４では、汎用辞書を検索し
て得られた情報から、サンプル分析の結果とマッチする
情報を特定する。特定された情報は辞書情報記憶部５で
原形単位にまとめ上げられる。まとめ上げた情報は辞書
情報ファイル９に格納できる。記憶された情報は、辞書
合成部６でフォーマットファイル１０の形式に従って辞
書記述に合成され、カスタマイズ辞書ファイル１１に格
納される。辞書記述の形式は汎用辞書ファイルの形式に
直接従ってもよい。また、辞書合成の際は必要に応じて
汎用辞書ファイル１４、訳語辞書ファイル１６の情報が
補われる。The dictionary information specifying unit 4 specifies the information that matches the result of the sample analysis from the information obtained by searching the general-purpose dictionary. The specified information is collected in the dictionary information storage unit 5 in units of original forms. The collected information can be stored in the dictionary information file 9. The stored information is combined into a dictionary description by the dictionary combining unit 6 according to the format of the format file 10 and stored in the customized dictionary file 11. The format of the dictionary description may directly follow the format of the general-purpose dictionary file. Further, when the dictionaries are combined, the information in the general-purpose dictionary file 14 and the translated word dictionary file 16 is supplemented as necessary.

【０００９】図２は、図１におけるテーブルファイルの
構成図を示す図で、２１は係り受けパターンテーブル、
２２は係り受け整合性テーブル、２３は構文ルールテー
ブル、２４は項構造マッチテーブル、２５は優先順位テ
ーブル、２６は原形・変化形対応テーブルである。係り
受けパターンテーブル２１は、サンプル文分析の際の係
り受けパターンを格納してある。係り受け整合性テーブ
ル２２は、汎用辞書の検索結果からサンプル文中の用法
に適する情報を特定するためのものである。原形・変化
形対応テーブル２６は、特定された情報を原形単位にま
とめ上げるためのものである。図３は、本発明の実施例
における係り受け分析のパターンテーブルの例を示す図
である。ヘッドを除く全ての語は、このどれかの関係で
自分以外の１語を修飾するものとする。FIG. 2 is a diagram showing the structure of the table file shown in FIG. 1. Reference numeral 21 is a dependency pattern table,
Reference numeral 22 is a dependency matching table, 23 is a syntax rule table, 24 is a term structure match table, 25 is a priority order table, and 26 is an original form / variant form correspondence table. The modification pattern table 21 stores modification patterns used in sample sentence analysis. The dependency matching table 22 is for identifying information suitable for usage in the sample sentence from the search result of the general-purpose dictionary. The original form / variant form correspondence table 26 is for collecting the specified information in units of original forms. FIG. 3 is a diagram showing an example of a dependency analysis pattern table in the embodiment of the present invention. All words except head modify one word other than yourself in any of these relations.

【００１０】図４は、サンプル文分析部における係り受
け分析のフローチャートである。以下、各ステップに従
って順に説明する。サンプルテキストを読み込んで（st
ep１）、各文を単語に分割し（step２）、分割した単語
毎に汎用辞書を検索する（step３）。１文毎にユーザに
表示し（step４）、ユーザの指定を待つ。１語につい
て、修飾先と関係名（step５〜step７）、あるいは必要
に応じて訳語（step８）を指定させる。訳語の指定の場
合は、ユーザが必要に応じて汎用辞書の検索結果を参照
できる（step１３〜step１４）。指定された情報は、汎
用辞書検索で得られたエントリ語形と共に記憶される
（step９〜step１０）。途中で指定解除の指示があれ
ば、その後についての指定事項を解除して新たな指定を
待つ（step１５）。途中、文のスキップ指示（step１
６）や終了指示（step１７）がなければ、これを１文の
全ての語に対して繰り返し（step１１）、また、サンプ
ルテキスト中の全ての文について行なう（step１２）。FIG. 4 is a flowchart of dependency analysis in the sample sentence analysis unit. Hereinafter, each step will be described in order. Load the sample text (st
ep1), each sentence is divided into words (step 2), and a general dictionary is searched for each divided word (step 3). Each sentence is displayed to the user (step 4) and waits for the user's designation. For one word, the qualification destination and the relation name (step 5 to step 7) or the translation word (step 8) is designated as necessary. In the case of designating a translated word, the user can refer to the search result of the general-purpose dictionary as needed (step 13 to step 14). The specified information is stored together with the entry word form obtained by the general-purpose dictionary search (step 9 to step 10). If there is an instruction to cancel the designation on the way, the designation items after that are canceled and a new designation is waited for (step 15). On the way, sentence skip instruction (step1
If there is no 6) or end instruction (step 17), this is repeated for all the words in one sentence (step 11), and for all the sentences in the sample text (step 12).

【００１１】図５（ａ）,（ｂ）は、辞書情報特定部に
おける辞書情報特定処理のフローチャートである。以
下、各ステップに従って順に説明する。記憶された情報
を１語ずつ読み込み（step１）、係り受け整合性テーブ
ル２２を呼び出して（step２）、係りと受けの整合性に
合わない品詞を削除する（step３,４）。次に構文ルー
ルテーブルを呼び出し（step５）、ルールが適用される
品詞があれば（step６）、削除条件に合う品詞を削除す
る（step７,８）。これを適用される品詞がなくなるま
で繰り返す（step９）。次に項構造マッチテーブルを呼
び出し（step１０）、テーブルに記述された品詞があれ
ば（step１１）、被修飾情報と項構造パターンがマッチ
しないものを削除し（step１２,１３）、チェックすべ
き品詞がなくなるまで繰り返す（step１４）。ここまで
の処理で候補品詞が１つに絞れなかった場合は（step１
５）、優先順位テーブルを呼び出し（step１６）、修飾
関係と候補品詞がテーブルに記述されていれば（step１
７,１８）、優先順位が下位の品詞を削除する（step１
９）。これを最後の語まで繰り返す（step２０）。図６
から図９までは、それぞれ係り受け整合性テーブル、構
文ルールテーブル、項構造マッチテーブル、優先順位テ
ーブルの１例を示す図である。図１０は、図４における
係り受け分析と、図５における辞書情報特定処理の結果
の１例を示す図である。FIGS. 5A and 5B are flowcharts of the dictionary information specifying process in the dictionary information specifying unit. Hereinafter, each step will be described in order. The stored information is read word by word (step 1), the dependency matching table 22 is called (step 2), and the part of speech that does not match the matching of the dependency and the dependency is deleted (steps 3 and 4). Next, the syntax rule table is called (step 5), and if there is a part of speech to which the rule is applied (step 6), the part of speech that meets the deletion condition is deleted (steps 7 and 8). This is repeated until no part of speech is applied (step 9). Next, the term structure match table is called (step 10), and if there is a part of speech described in the table (step 11), those that do not match the modified information and the term structure pattern are deleted (steps 12 and 13), and the part of speech to be checked is Repeat until no more (step 14). If the candidate part-of-speech cannot be narrowed down to one by the processing so far (step 1
5), the priority table is called (step 16), and if the modification relation and the candidate part of speech are described in the table (step 1)
(7, 18), delete parts of speech with lower priority (step 1
9). This is repeated until the last word (step 20). Figure 6
9 to 9 are diagrams showing examples of the dependency consistency table, the syntax rule table, the term structure match table, and the priority table, respectively. FIG. 10 is a diagram showing an example of the result of the dependency analysis in FIG. 4 and the dictionary information identification processing in FIG.

【００１２】図１１は、辞書情報記憶部における原形単
位での情報のまとめ上げ処理のフローチャートである。
以下、各ステップに従って順に説明する。特定された情
報を１語ずつ読み込み（step１）、原形が複数の場合は
（step２）、特定した品詞や指定された訳語にマッチし
ない原形を削除する（step１１,１３）。原形とベタ語
形が一致しなければ（step３）、語形を原形に変換し
（step４）、更に品詞に変化形コードが付いていれば
（step５）、原形コードに変換し（step６）、品詞の重
複が生じた場合はそれを削除する（step７）。既に辞書
情報ファイルに登録済みの語であれば（step８）、その
頻度情報などを更新し（step９）、登録済みでなければ
新規に情報を作成し（step１４）、最後の語まで処理を
繰り返す（step１０）。図１２は、前記step１１からst
ep１３の原形特定処理の結果の１例を示す図である。上
段は特定品詞により原形が特定され、下段は指定訳語と
の対応により原形が特定される。図１３は、図２の原形
・変化形対応テーブルの１例を示す図である。図１４
は、図１０の分析情報特定結果を、図１２の処理により
原形単位でまとめ上げた結果の１例を示す図である。FIG. 11 is a flow chart of a process of collecting information in prototype units in the dictionary information storage unit.
Hereinafter, each step will be described in order. The specified information is read word by word (step 1), and when there are a plurality of original forms (step 2), the original forms that do not match the specified part of speech or the specified translated word are deleted (steps 11 and 13). If the original form and the solid form do not match (step 3), the word form is converted to the original form (step 4), and if the part of speech has a variant code (step 5), it is converted to the original form code (step 6), and the part of speech is duplicated. If occurs, delete it (step 7). If the word has already been registered in the dictionary information file (step 8), its frequency information and the like are updated (step 9), and if not registered, new information is created (step 14), and the process is repeated until the last word (step 14). step10). FIG. 12 shows the steps from step 11 to st.
It is a figure which shows an example of the result of the original form specific processing of ep13. In the upper part, the original form is specified by the specific part of speech, and in the lower part, the original form is specified by the correspondence with the designated translated word. FIG. 13 is a diagram showing an example of the original form / variant form correspondence table of FIG. 14
FIG. 13 is a diagram showing an example of a result obtained by collecting the analysis information identification results of FIG. 10 in prototype units by the processing of FIG. 12.

【００１３】図１８は、辞書合成部における辞書合成処
理のフローチャートである。以下、各ステップに従って
順に説明する。図１４のようにまとめ上げられたサンプ
ル情報を１語分ずつ読み込み（step１）、フォーマット
に従ってカスタマイズ辞書ファイルに書き込む（step
２）。足りない情報は汎用辞書ファイルから補い（step
３）、必要ならば出現数により優先度を操作する（step
４,８）。格フレームが必要な品詞であれば（step
５）、訳語辞書から格フレームをコピーする（step
９）。サンプル情報にはないが汎用辞書中で高い優先度
が付与されている情報も、必要であれば優先度を操作し
てカスタマイズ辞書ファイルに書き込む（step６,１０
〜１２）。これを最後の語まで繰り返す（step１３）。FIG. 18 is a flowchart of the dictionary synthesizing process in the dictionary synthesizing unit. Hereinafter, each step will be described in order. The sample information collected as shown in FIG. 14 is read word by word (step 1) and written in the customized dictionary file according to the format (step 1).
2). Complement the missing information from the general-purpose dictionary file (step
3) If necessary, manipulate the priority according to the number of appearances (step
4, 8). If a case frame requires a part of speech (step
5), copy the case frame from the translation dictionary (step
9). Information that is not included in the sample information but has a high priority in the general-purpose dictionary is written in the customized dictionary file by operating the priority if necessary (steps 6 and 10).
~ 12). This is repeated until the last word (step 13).

【００１４】図１５は汎用辞書ファイルの情報の１例を
示す図である。図１６はカスタマイズ辞書のフォーマッ
トの１例を示す図である。図１７は汎用辞書情報から作
成した訳語辞書の１例を示す図である。汎用辞書にない
訳語が付与された場合でも、指定された訳語と特定した
品詞の組み合わせから訳語辞書の情報を抽出することに
より、適切な格フレームを付与することが可能である。
図１９は、図１８の処理により合成されたカスタマイズ
辞書の情報の１例を示す図である。FIG. 15 is a diagram showing an example of information in a general-purpose dictionary file. FIG. 16 is a diagram showing an example of the format of the customized dictionary. FIG. 17 is a diagram showing an example of a translated word dictionary created from general-purpose dictionary information. Even when a translated word that is not in the general-purpose dictionary is added, it is possible to add an appropriate case frame by extracting the information of the translated word dictionary from the combination of the specified translated word and the identified part-of-speech.
FIG. 19 is a diagram showing an example of the information of the customized dictionary synthesized by the processing of FIG.

【００１５】[0015]

【効果】以上の説明から明らかなように、本発明による
と、以下のような効果がある。（１）請求項１に対応する効果：本発明により処理対象
文書から適当な質、量のサンプルテキストを抽出し、該
サンプルテキストを言語的に分析した結果から、ユーザ
に汎用辞書の情報を意識させずに、効率的にカスタマイ
ズ辞書用データを得ることができる。（２）請求項２に対応する効果：特に分析方法として、
サンプル文を語単位に分割し、各語について修飾先の語
とその関係名を指定する係り受け分析を用い、係り受け
の整合性と汎用辞書の情報との照合により、サンプル文
での用法を自動的に特定することで、ユーザの主観によ
る汎用辞書情報の選択ミスを避けることができ、より正
確な情報が効率的に得られる。（３）請求項３〜５に対応する効果：係り受けの整合性
だけで特定できない場合は、構文的ルール、項構造マッ
チ、優先順位などの手段を用いて処理の精密化が可能と
なる。（４）請求項６に対応する効果：サンプル文から得られ
たベタ語形単位の情報を原形単位の情報に変換してまと
め上げることで、直接辞書データとして登録可能な情報
が得られる。（５）請求項７，８に対応する効果：１つのベタ語形に
対して複数の原形の可能性がある場合は、上記手段で特
定された情報や、ユーザが指定した訳語との対応により
原形を１つに絞り、正確な情報を得ることが可能とな
る。（６）請求項９〜１２に対応する効果：前記手段で得ら
れた情報を、予め規定されたフォーマットに従って合成
し、また、汎用辞書にないがユーザが指定した情報を追
加したり、サンプル文から直接得られない情報は適宜汎
用辞書から補うことにより、対象文書の処理に適したカ
スタマイズ辞書を効率的かつ精密に作成することが可能
となる。As is apparent from the above description, the present invention has the following effects. (1) Effect corresponding to claim 1: According to the present invention, a sample text having an appropriate quality and quantity is extracted from a document to be processed, and a linguistic analysis of the sample text is performed. The customized dictionary data can be efficiently obtained without doing so. (2) Effect corresponding to claim 2: Especially as an analysis method,
The sample sentence is divided into words, and the dependency analysis is performed to specify the modification target word and its relation name for each word. By comparing the dependency consistency with the information in the general-purpose dictionary, the usage in the sample sentence is determined. By automatically specifying, it is possible to avoid selection mistakes of general-purpose dictionary information due to user subjectivity, and more accurate information can be efficiently obtained. (3) Effects corresponding to claims 3 to 5: When it is not possible to specify only by the consistency of dependency, the processing can be refined by using a syntactic rule, a term structure match, a priority order, or the like. (4) Effect corresponding to claim 6: Information that can be directly registered as dictionary data can be obtained by converting the information in solid form units obtained from the sample sentence into information in units of original forms and collecting them. (5) Effects corresponding to claims 7 and 8: When there is a possibility of plural original forms for one solid word form, the original form is obtained by the correspondence with the information specified by the above means and the translation specified by the user. It becomes possible to obtain accurate information by narrowing down to one. (6) Effects corresponding to claims 9 to 12: The information obtained by the means is combined in accordance with a predetermined format, and information not specified in the general dictionary but specified by the user is added, or a sample sentence is added. By appropriately supplementing the information that cannot be directly obtained from the general-purpose dictionary, it becomes possible to efficiently and precisely create a customized dictionary suitable for processing the target document.

[Brief description of drawings]

【図１】本発明による辞書作成装置の一実施例を説明
するための構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a dictionary creation device according to the present invention.

【図２】図１におけるテーブルファイルの構成図であ
る。FIG. 2 is a configuration diagram of a table file in FIG.

【図３】本発明の係り受けパターンテーブルを示す図
である。FIG. 3 is a diagram showing a dependency pattern table of the present invention.

【図４】本発明のサンプル文分析部における係り受け
分析のフローチャートを示す図である。FIG. 4 is a diagram showing a flowchart of dependency analysis in the sample sentence analysis unit of the present invention.

【図５】本発明の辞書情報特定部における辞書情報特
定処理のフローチャートを示す図である。FIG. 5 is a diagram showing a flowchart of dictionary information specifying processing in the dictionary information specifying unit of the present invention.

【図６】本発明の係り受け整合性テーブルを示す図で
ある。FIG. 6 is a diagram showing a dependency matching table of the present invention.

【図７】本発明の構文ルールテーブルを示す図であ
る。FIG. 7 is a diagram showing a syntax rule table of the present invention.

【図８】本発明の項構造マッチテーブルを示す図であ
る。FIG. 8 is a diagram showing a term structure match table of the present invention.

【図９】本発明の優先順位テーブルを示す図である。FIG. 9 is a diagram showing a priority table of the present invention.

【図１０】本発明の分析情報特定結果を示す図であ
る。FIG. 10 is a diagram showing an analysis information identification result of the present invention.

【図１１】本発明の辞書情報記憶部における原形単位
での情報のまとめ上げ処理のフローチャートを示す図で
ある。FIG. 11 is a diagram showing a flowchart of information gathering processing for each original form in the dictionary information storage unit of the present invention.

【図１２】本発明の原形特定処理の結果を示す図であ
る。FIG. 12 is a diagram showing a result of the original shape specifying process of the present invention.

【図１３】本発明の原形・変化形対応テーブルを示す
図である。FIG. 13 is a diagram showing an original form / variant form correspondence table of the present invention.

【図１４】本発明のサンプル情報を示す図である。FIG. 14 is a diagram showing sample information of the present invention.

【図１５】本発明の汎用辞書情報を示す図である。FIG. 15 is a diagram showing general dictionary information of the present invention.

【図１６】本発明のフォーマット情報を示す図であ
る。FIG. 16 is a diagram showing format information of the present invention.

【図１７】本発明の訳語辞書を示す図である。FIG. 17 is a diagram showing a translated word dictionary of the present invention.

【図１８】本発明の辞書合成部における辞書合成処理
のフローチャートを示す図である。FIG. 18 is a diagram showing a flowchart of a dictionary synthesizing process in the dictionary synthesizing unit of the present invention.

【図１９】本発明のカスタマイズ辞書の情報を示す図
である。FIG. 19 is a diagram showing information in a customized dictionary of the present invention.

【符号の説明】１…サンプルテキスト抽出部、２…分析情報指示部、３
…分析結果記憶部、４…辞書情報特定部、５…辞書情報
記憶部、６…辞書合成部、７…サンプルファイル、８…
テーブルファイル、９…辞書情報ファイル、１０…フォ
ーマットファイル、１１…カスタマイズ辞書ファイル、
１２…コーパス、１３…分析結果ファイル、１４…汎用
辞書ファイル、１５…訳語辞書ファイル。[Explanation of Codes] 1 ... Sample text extraction unit, 2 ... Analysis information instruction unit, 3
... analysis result storage unit, 4 ... dictionary information specifying unit, 5 ... dictionary information storage unit, 6 ... dictionary combining unit, 7 ... sample file, 8 ...
Table file, 9 ... dictionary information file, 10 ... format file, 11 ... customized dictionary file,
12 ... Corpus, 13 ... Analysis result file, 14 ... General dictionary file, 15 ... Translation dictionary file.

─────────────────────────────────────────────────────
─────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成５年１１月１９日[Submission date] November 19, 1993

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】全図[Correction target item name] All drawings

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図７】 [Figure 7]

【図１】 [Figure 1]

【図２】 [Fig. 2]

【図３】 [Figure 3]

【図９】 [Figure 9]

【図１６】 FIG. 16

【図４】 [Figure 4]

【図５】 [Figure 5]

【図８】 [Figure 8]

【図１３】 [Fig. 13]

【図１９】 FIG. 19

【図６】 [Figure 6]

【図１２】 [Fig. 12]

【図１０】 [Figure 10]

【図１４】 FIG. 14

【図１１】 FIG. 11

【図１５】 FIG. 15

【図１７】 FIG. 17

【図１８】 FIG. 18

Claims

[Claims]

1. A sample text extracting unit for extracting a sample text from a document of a processing target field, an analysis information instructing unit for presenting the extracted sample text and instructing analysis information of a sentence in the text, Based on the analysis result storage unit that stores the analysis information and the analysis result stored in the analysis result storage unit, for each word in the sample sentence, dictionary information that specifies information suitable for usage in the sample from information in the general-purpose dictionary Based on the specification unit, the information specified by the dictionary information specification unit, the dictionary information storage unit that stores frequency information such as the frequency of appearance of words, and the information of a general-purpose dictionary, if necessary, based on the prescribed description format. And a dictionary synthesizing unit for synthesizing the information stored in the dictionary information storage unit into a dictionary format.

2. The analysis information instructing unit, the information to be instructed is a relation of each word in a sample sentence and its qualified relation name, and based on the information, in the analysis result storage unit,
A qualification destination, a qualification source, and each qualification relation name are stored for each word, and the dictionary information specifying unit specifies information matching with the dependency consistency from the dictionary lookup information by the general-purpose dictionary for each word. The dictionary creating apparatus according to claim 1, further comprising a specifying unit for performing the setting.

3. The dictionary information identifying unit has an excluding means for excluding the inappropriate information by applying a syntactic rule when the inappropriate information in the general dictionary cannot be eliminated only by the dependency matching. The dictionary creation device according to claim 2, wherein

4. The dictionary information specifying unit specifies a term structure such as a verb in a sample sentence according to the number and types of terms extracted from the dependency information, and stores information matching the specified term structure in a general-purpose dictionary. The dictionary creating apparatus according to claim 2, further comprising an extracting unit that extracts the information.

5. If the dictionary information specifying unit cannot specify the information suitable for the usage in the sample sentence from the information of the general dictionary by the removing unit and the extracting unit, the dictionary information specifying unit arbitrarily determines according to the language property of the sample text. 5. The dictionary creation device according to claim 3, further comprising means for specifying information according to the priority order.

6. The dictionary information storage unit has means for collectively listing the dictionary information obtained in units of solid word forms in a sample sentence as information in units of original forms in a general-purpose dictionary. Dictionary creation device.

7. In the dictionary information storage unit, when there is a possibility of a plurality of original forms for one solid form, a selecting means for selecting an original form that matches the information specified by the dictionary information specifying unit is provided. The dictionary creation apparatus according to claim 1, wherein the dictionary creation apparatus comprises.

8. The sample sentence analysis unit has a designation unit for designating a translated word of a second language corresponding to each word of the sample sentence, in addition to the dependency analysis, and one of the dictionary information storage units 2. The dictionary creating apparatus according to claim 1, further comprising a selection unit for selecting an original form that matches a designated translation when there are a plurality of original forms for the solid form.

9. The synthesizing unit for synthesizing the dictionary information stored in the dictionary information storage unit in the dictionary synthesizing unit according to the frequency information stored in the storing unit according to a prescribed description format. The dictionary creating apparatus according to claim 1, further comprising means.

10. If the pair of the information specified by the dictionary information specifying unit and the translated word specified by the sample sentence analyzing unit is not in the general-purpose dictionary, the dictionary synthesizing unit newly creates the pair. The dictionary creating apparatus according to claim 1, further comprising means for adding as dictionary information.

11. The dictionary synthesizing unit, in addition to the information specified by the dictionary information specifying unit, adjusts the priority of information that has a high priority in a general-purpose dictionary and adjusts the dictionary. The dictionary creating apparatus according to claim 1, further comprising an adding unit that adds the information.

12. A translation word dictionary comprising a translation word, a part of speech, and a co-occurrence frame is created from a bilingual general-purpose dictionary, and a pair of the part of speech information specified by the dictionary information specifying unit and a translation word specified by the user in the sample sentence analysis unit. The dictionary creating apparatus according to claim 1, further comprising: an adding unit that extracts a corresponding co-occurrence frame from a translation dictionary by using, and adds it to the dictionary information.