JPH0816910B2

JPH0816910B2 - Language analyzer

Info

Publication number: JPH0816910B2
Application number: JP61234327A
Authority: JP
Inventors: 壽彦横川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1986-10-03
Filing date: 1986-10-03
Publication date: 1996-02-21
Anticipated expiration: 2011-02-21
Also published as: JPS6389975A

Description

【発明の詳細な説明】技術分野本発明は言語解析装置、とくに、たとえば自動翻訳装
置に有用な自然言語を解析する言語解析装置に関する。Description: TECHNICAL FIELD The present invention relates to a language analysis device, and more particularly to a language analysis device that analyzes a natural language useful for, for example, an automatic translation device.

従来技術たとえば英語などの外国語の文からそれに対応する日
本語の文を作成する場合、入力された英文の形態素を解
析し、その構文を解析し、その文構造を変換し、そのの
ち日本語の訳文を生成する。2. Description of the Related Art When a corresponding Japanese sentence is created from a foreign language sentence such as English, for example, the morpheme of the input English sentence is analyzed, its syntax is analyzed, and its sentence structure is converted. Generate a translation of

文の形態素を解析する際、ある単語が単独で用いられ
ているか、他の単語と結合した連語または熟語として用
いられているかは、解析結果を左右する重要な問題であ
る。従来方式には、単語単独での辞書情報と、熟語ない
しは連語についての辞書情報とを併立して備えた方式
と、たとえば最長一致法などのように特定の条件に合致
した熟語ないしは連語を常に優先して選択する方式とが
ある。When analyzing a sentence morpheme, whether a certain word is used alone or is used as a compound word or a compound word combined with other words is an important issue that influences the analysis result. In the conventional method, dictionary information for a single word and dictionary information for a compound word or a compound word are provided in parallel, and a compound word or compound word that matches a specific condition such as the longest match method is always prioritized. There is a method to select.

前者の従来方式は、選択の対象となる候補の組合せが
非常に多くなり、それらのすべてについて解析をする必
要がある。したがって、処理に時間を要し、また誤った
組合せも正解としてとらえられる危険性があるなど、無
駄が多い。In the former conventional method, the number of combinations of candidates to be selected becomes extremely large, and it is necessary to analyze all of them. Therefore, it takes a lot of time for processing, and there is a risk that an incorrect combination can be regarded as a correct answer, which is wasteful.

また後者の従来方式では、熟語としての用法が常に他
に優先して選択される。したがって、単語相互の結合度
の高くない連語などの場合にも熟語としての用法が常に
選択され、誤った解析結果を導出する危険性が高い。Also, in the latter conventional method, the usage as an idiom is always selected with priority over the other. Therefore, the usage as an idiom is always selected even in the case of collocations where the degree of connection between words is not high, and there is a high risk of deriving an incorrect analysis result.

このように形態素解析において可能のある全候補をす
べて解としてしまうと、候補の数が多くなり、その後の
構造変換や訳文生成の処理で生ずる解は莫大な数となっ
てしまう。このように解の候補の数が多いことは、それ
以降の処理の速度を著しく低下させる結果を招く。In this way, if all possible candidates in morphological analysis are set as solutions, the number of candidates increases, and the number of solutions generated by the subsequent structural conversion and translation generation processing becomes enormous. Such a large number of solution candidates results in a significant reduction in the speed of subsequent processing.

そこで、自動翻訳プロセス全体の効率を向上させるに
は、このような無駄な解の数を減らして解析の効率を高
くすることが要求される。Therefore, in order to improve the efficiency of the entire automatic translation process, it is required to reduce the number of such useless solutions and increase the efficiency of analysis.

目的本発明はこのような要求に鑑み、形態素解析を効率的
に行なうことのできる言語解析装置を提供することを目
的とする。Aim The present invention has been made in view of such requirements, and an object thereof is to provide a language analysis apparatus capable of efficiently performing morphological analysis.

構成本発明は上記の目的を達成させるため、単語、連語お
よび熟語について形態素データを含む辞書データが格納
された辞書手段と、入力された文について辞書手段を参
照して形態素解析を行なう解析手段とを有する言語解析
装置において、辞書手段は、連語および熟語について、
連語または熟語を構成する語の相互の結合度を示す結合
度データを含み、解析手段は、入力された文に含まれる
それぞれの語について辞書手段を参照し、１つの語につ
いて他の語との組合せで複数の辞書データが索出された
ときは、結合度データを参照して結合度の高い語の組合
せを選択する言語解析装置を特徴としたものである。以
下、本発明の一実施例に基づいて具体的に説明する。Structure In order to achieve the above object, the present invention has a dictionary means for storing dictionary data including morpheme data for words, collocations and idioms, and analysis means for performing morphological analysis on an input sentence by referring to the dictionary means. In the language analysis device having and, the dictionary means, for collocations and idioms,
The analysis unit includes connection degree data indicating a connection degree between words forming a compound word or a compound word, and the analysis unit refers to the dictionary unit for each word included in the input sentence and refers to one word with another word. When a plurality of dictionary data are searched for in the combination, the language analysis device is characterized by referring to the connectivity data and selecting a combination of words with high connectivity. Hereinafter, a specific description will be given based on an embodiment of the present invention.

第２図を参照すると、本発明による言語解析装置を英
日自動翻訳装置に適用した実施例の全体構成が示されて
いる。なお本発明は、英語を日本語に翻訳する英日自動
翻訳装置のみならず、ある言語を他の言語に翻訳する際
おもに、入力される言語の文を解析する如何なる言語の
解析装置にも効果的に適用されることは、言うまでもな
い。Referring to FIG. 2, there is shown an overall configuration of an embodiment in which the language analysis device according to the present invention is applied to an English-Japanese automatic translation device. The present invention is effective not only for an English-Japanese automatic translation device that translates English into Japanese, but also for an analysis device for any language that analyzes a sentence of an input language, mainly when translating one language into another language. Needless to say, it is applied to each other.

同実施例は入力部10を有し、日本語に翻訳すべき英文
テキスト12がこれにより入力される。入力部10はたとえ
ば、英数字キーなどの文字キーや機能キーなどを有する
キーボード、紙に記録された英文テキストを読み取る光
学的文字読取装置（OCR），および（または）磁気ディ
スクなどの記憶媒体に記録された英文テキストを読み込
むファイル記憶装置などを含んでよい。This embodiment has an input unit 10 for inputting an English text 12 to be translated into Japanese. The input unit 10 is, for example, a keyboard having character keys such as alphanumeric keys and function keys, an optical character reader (OCR) for reading English text recorded on paper, and / or a storage medium such as a magnetic disk. It may include a file storage device or the like for reading the recorded English text.

入力部10により入力された英文テキストは、前編集部
14に読み込まれ、翻訳の前処理が行なわれる。ここで
は、主として文の認定と未知語の処理を行なう。これは
形態素解析の一部として機能する。The English text input by the input unit 10 is
It is read into 14 and pre-processing for translation is performed. Here, recognition of sentences and processing of unknown words are mainly performed. This functions as part of the morphological analysis.

前編集された英文データは、前編集で得られた情報と
ともに形態素解析部16に転送される。形態素解析部16で
は、単語辞書18を索引して文に分割し、英文の形態素を
解析し、未知語の処理、固有名詞、時の表現、数の表現
などの各種のまとめあげを行ない、付加疑問、同格の認
定などの文全体の処理を行なう。その形態素解析ルール
は解析ルールファイル36に格納されている。The pre-edited English sentence data is transferred to the morphological analysis unit 16 together with the information obtained by the pre-editing. The morphological analysis unit 16 indexes the word dictionary 18 and divides the sentence into sentences, analyzes the morphemes of the English sentence, performs various processing such as processing of unknown words, proper nouns, expressions of time, expressions of numbers, etc. , And performs the processing of the entire sentence such as recognition of the same rank. The morphological analysis rule is stored in the analysis rule file 36.

形態素解析された英文データは、形態素解析で得られ
た辞書情報とともに構文解析Ｉ部20に転送される。構文
解析Ｉ部20は、文法ルールを英文データに適用して文に
ついて表層構造の解析を行ない、すべての構文的可能性
を見つけ出す機能部である。The morphologically analyzed English sentence data is transferred to the syntax analysis unit 20 together with the dictionary information obtained by the morphological analysis. The syntactic analysis I unit 20 is a functional unit that applies grammatical rules to English sentence data to analyze the surface structure of a sentence and find all syntactic possibilities.

構文解析Ｉ部20で構文解析された英文データは、その
解析情報とともに構文解析II部22に送られる。ここで
は、構文解析Ｉによる表層的な構文解析結果から、構造
記述を適用して解を選択する。これによって英語文の確
からしい解析木を作成し、その構造を作る。これらの構
文解析ルールはやはり、解析ルールファイル36に格納さ
れている。The English text data that has been parsed by the syntax analysis I unit 20 is sent to the syntax analysis II unit 22 together with the analysis information. Here, a solution is selected by applying a structural description from the surface analysis result obtained by the analysis I. This creates a probable parse tree of the English sentence and creates its structure. These parsing rules are still stored in the parsing rule file 36.

構文解析された英文データは、解析木のデータとして
構造変換部24に転送される。構造変換部24では、英語文
の中間的構造である構文木から対応する日本語文の構文
木を作成し、日本語文を訳出しやすい日本語基底構造に
変換する。The parsed English sentence data is transferred to the structure conversion unit 24 as parse tree data. The structure conversion unit 24 creates a syntax tree of a corresponding Japanese sentence from a syntax tree that is an intermediate structure of the English sentence, and converts the Japanese sentence into a Japanese base structure that is easy to translate.

こうして構造変換された日本語の基底構造を示す構文
木データは訳文生成部26に送出され、後者にて訳文の生
成が行なわれる。これは、日本語の構文木の木構造から
日本語の文を生成する機能部である。The syntax tree data indicating the basic structure of Japanese thus structurally transformed is sent to the translated sentence generation unit 26, and the translated sentence is generated in the latter. This is a functional unit that generates a Japanese sentence from the tree structure of the Japanese syntax tree.

訳文生成された日本語分データ、すなわち訳文データ
は、後編集部30に送られる。後編集部30では、翻訳処理
に利用した情報を使用し、辞書18を索引して訳文データ
を修正し、より自然な日本語文を完成する。この日本語
文データは出力部32に転送され、翻訳された日本語文34
として出力部32から出力される。出力部32は、たとえば
プリンタ、ディスプレイ、および（または）磁気ディス
クなどのファイル記憶装置を含む。The translated Japanese data, that is, the translated data, is sent to the post-editing unit 30. The post-editing unit 30 uses the information used for the translation process to index the dictionary 18 to correct the translated sentence data, thereby completing a more natural Japanese sentence. This Japanese sentence data is transferred to the output unit 32 and the translated Japanese sentence 34
Is output from the output unit 32. The output unit 32 includes a file storage device such as a printer, a display, and / or a magnetic disk.

これらの一連の翻訳処理の流れは、本装置全体の制御
を統括する制御部38によって制御される。単語辞書18に
は、本実施例では英語および日本語の単語についての辞
書データが格納され、語彙だけでなく、係り関係すなわ
ち共起関係や、意味、単複、品詞などの様々な情報が記
述されている。また解析ルールファイル36には、形態素
解析および構文解析のルールデータが格納されている。The flow of these series of translation processes is controlled by the control unit 38 that controls the entire control of the present apparatus. In the present embodiment, the word dictionary 18 stores dictionary data for English and Japanese words, and describes not only vocabulary but also various information such as relations, that is, co-occurrence relations, meanings, singularity, and parts of speech. ing. The analysis rule file 36 stores rule data for morphological analysis and syntax analysis.

制御部38には、操作表示部40が接続されている。操作
表示部40は、操作者から本装置に様々な指示を与える、
たとえば翻訳指示キー、カーソルキーなどの操作キー
や、入力英語文テキスト、翻訳結果の日本語文、辞書情
報などの中間データ、操作者に対する様々な指示などを
可視表示するディスプレイやインジケータを有する。な
お、それらの操作表示機能の多くは、入力部10にキーボ
ードを備えている場合はそのキーボードに、また出力部
32にディスプレイを備えている場合はそのディスプレイ
に含まれるように構成してよい。The operation display unit 40 is connected to the control unit 38. The operation display unit 40 gives various instructions to the apparatus from an operator,
For example, it has an operation key such as a translation instruction key and a cursor key, a display and an indicator for visually displaying an input English sentence text, a Japanese sentence as a translation result, intermediate data such as dictionary information, and various instructions to an operator. Note that many of these operation display functions are provided on the keyboard when the input unit 10 has a keyboard, and on the output unit.
If the display 32 is provided with a display, the display may be included in the display.

第１図を参照すると、形態素解析部16の詳細な構成が
例示されている。形態素解析部16は、入力部10のキーボ
ードなどの入力装置100や入力文書ファイル102とのイン
タフェースをとる入力インタフェース104を有する。入
力インタフェース104には、これらの入力装置100や入力
文書ファイル102からたとえばASCIIなどのコードデータ
の形で英文文字列データが入力され、その文字列データ
を一時蓄積する入力文字列バッファが備えられている。
これらの入力文字列は、前編集部14にて前編集を受けた
ものでよい。Referring to FIG. 1, a detailed configuration of the morphological analysis unit 16 is illustrated. The morphological analysis unit 16 has an input interface 104 that interfaces with the input device 100 such as the keyboard of the input unit 10 and the input document file 102. The input interface 104 is provided with an input character string buffer for temporarily storing the English character string data in the form of code data such as ASCII from the input device 100 and the input document file 102, and temporarily storing the character string data. There is.
These input character strings may have been pre-edited by the pre-editing unit 14.

形態素解析部16は、図示のように処理部106,辞書検索
部108,矛盾解消ルール処理部110および制御部112を有す
る。処理部106は、形態素解析を行なう解析処理機能部
であり、検索済み辞書情報バッファすなわち辞書情報保
存テーブル120（第９図）を有する。形態素解析は、入
力文字列の先頭から順に検索キーの文字列に従って辞書
探索を指示し、これに従って辞書検索部108から得た辞
書情報を検索済み辞書情報バッファ120に格納し、後述
の最優先フラグに応じた優先度の処理などを実行するこ
とによって行なわれる。The morpheme analysis unit 16 has a processing unit 106, a dictionary search unit 108, a contradiction resolution rule processing unit 110, and a control unit 112, as illustrated. The processing unit 106 is an analysis processing function unit that performs morphological analysis, and has a searched dictionary information buffer, that is, a dictionary information storage table 120 (FIG. 9). The morphological analysis instructs a dictionary search in accordance with the character string of the search key in order from the beginning of the input character string, stores the dictionary information obtained from the dictionary search unit 108 in the searched dictionary information buffer 120 in accordance therewith, and sets the highest priority flag described later. It is performed by executing a process of priority according to the.

辞書検索部108は、処理部106から指示される検索キー
文字列に基づき、単語辞書18を検索して辞書情報を取り
出し、これを処理部106に転送する機能部である。The dictionary search unit 108 is a functional unit that searches the word dictionary 18 based on the search key character string designated by the processing unit 106, extracts dictionary information, and transfers the dictionary information to the processing unit 106.

単語辞書18には、第３図にそのエントリ情報の例を示
すように、各単語のエントリについて品詞、活用などの
文法情報の他に、最優先フラグが格納されている。この
辞書をここでは優先フラグ付辞書ファイルと称する。
「最優先フラグ」とは、辞書エントリを構成する連語ま
たは熟語に含まれる単語相互間の結合の強弱を示すフラ
グであり、「０」が弱い結びつきまたは結びつき無し
を、また「１」が強い結びつきを示す。これによって、
文において、連語または熟語のなかで結びつきが強いと
判定したものは熟語としての用法と推定し、そうでない
場合は単語単独での用法の可能性も並列的に考慮する。As shown in the example of the entry information in FIG. 3, the word dictionary 18 stores the highest priority flag in addition to the grammatical information such as the part of speech and conjugation for each word entry. This dictionary is referred to as a priority flag added dictionary file here.
The “highest priority flag” is a flag that indicates the strength of the connection between the words included in the collocations or idioms that make up the dictionary entry, where “0” indicates a weak connection or no connection, and “1” indicates a strong connection. Indicates. by this,
In a sentence, a compound word or a compound word that is determined to be strongly connected is estimated to be a compound word, and if not, the possibility of using the word alone is considered in parallel.

第３図に例示するように、単語辞書18における各エン
トリは、単語単独と連語または熟語とを区別せず、連語
および熟語、ならびにそれらを構成する単語単独のそれ
ぞれについて配列されている。また活用形もそれぞれ１
つのエントリを構成している。活用形が複数ある場合も
それぞれ別のエントリとして登録され、その活用が何で
あるかは活用部に表示される。品詞についても同様であ
り、複数の品詞の登録を許容し、それぞれの品詞情報を
有する。その他の情報としては、たとえば名詞の可算、
不可算の別、自動詞、他動詞の別、訳語などが登録され
ている。As illustrated in FIG. 3, each entry in the word dictionary 18 is arranged for each of the collocations and idioms, and each of the words constituting each of them, without distinguishing between the singular words and the collocations or idioms. Also, each inflection type is 1
Make up one entry. If there are multiple usage forms, they are registered as separate entries, and the usage form is displayed in the usage section. The same applies to the part-of-speech, which allows registration of a plurality of parts-of-speech and has the respective part-of-speech information. Other information includes countable nouns,
Uncountable distinctions, intransitive verbs, transitive verb distinctions, translations, etc. are registered.

たとえば“get"は動詞の原形であり、最優先フラグは
「０」である。熟語“get up"は、原形動詞句で、その
最優先フラグは「１」である。また、前置詞句“up to"
は最優先フラグが「１」であるが、連語としての名詞句
“white house"は最優先フラグが「０」であり、後者は
単語間の結合度が低いことを示している。なお同図にお
いて、記号は空白文字を示している。For example, "get" is the original form of the verb, and the highest priority flag is "0". The idiom "get up" is a prototypical verb phrase, and its highest priority flag is "1". Also, the prepositional phrase “up to”
Indicates that the highest priority flag is "1", but the noun phrase "white house" as a collocation has the highest priority flag "0", and the latter indicates that the degree of connection between words is low. In the figure, symbols Indicates a space character.

このように辞書検索部108で検索された辞書情報には
最優先フラグが含まれている。同じ文字列、または重複
する文字列について最優先フラグに「１」がたっている
ときは、その矛盾を解消しなければならない。この矛盾
解消を行なうのが矛盾解消ルール処理部110であり、こ
れは、解析ルールファイル36に格納されている最優先フ
ラグ矛盾解消ルールを参照してその処理を行なう。In this way, the dictionary information searched by the dictionary search unit 108 includes the highest priority flag. When the highest priority flag is "1" for the same character string or overlapping character strings, the contradiction must be resolved. The contradiction resolution rule processing unit 110 performs this contradiction resolution, and this processing is performed by referring to the highest priority flag contradiction resolution rule stored in the analysis rule file 36.

矛盾解消ルールは本実施例では、次の（１）〜（３）
の順序で適用され、これによって優先選択を行なう。In the present embodiment, the contradiction resolution rule is the following (1) to (3).
Are applied in the order of, and the priority selection is performed by this.

（１）品詞が動詞となる熟語または語、（２）構成字数の多い連語、熟語または語、（３）文における位置が前の方にある連語、熟語または
語。(1) Idioms or words whose parts of speech are verbs, (2) collocations, idioms or words with a large number of constituent characters, (3) collocations, idioms or words whose position in the sentence is earlier.

こうして選択された語の用法、換言すれば解析単位
は、処理部106の検索済み辞書情報バッファ120に活性情
報として表示される。活性情報は、「１」でその解析単
位が有効であることを示し、「０」でその可能性を採用
しないことを示す。The usage of the word thus selected, in other words, the analysis unit is displayed in the searched dictionary information buffer 120 of the processing unit 106 as active information. In the activity information, "1" indicates that the analysis unit is valid, and "0" indicates that the possibility is not adopted.

制御部112は、これら形態素解析部16の各機能部の動
作、処理を統括し、制御する機能部である。これは、本
装置全体の制御を行なう制御部38に含まれていてもよ
い。The control unit 112 is a functional unit that controls and controls the operation and processing of each functional unit of the morphological analysis unit 16. This may be included in the control unit 38 that controls the entire apparatus.

形態素解析された結果は、出力インタフェース114を
通して構文解析Ｉ部20へ転送される。構文解析Ｉ部20に
直接転送しない場合は、構文解析用入力ファイル116お
よび構文解析用辞書情報ファイル118に一旦格納され
る。The result of the morphological analysis is transferred to the syntax analysis I unit 20 through the output interface 114. When not directly transferred to the syntactic analysis I section 20, the syntactic analysis input file 116 and the syntactic analysis dictionary information file 118 are temporarily stored.

本実施例では、形態素解析の際、辞書引き単位の切出
し位置から始まる単語、連語または熟語をすべて取り出
すが、最優先フラグに従ってひとまとまりの「ユニッ
ト」と認定された連語または熟語については、それを構
成する個々の単語について得られた辞書情報は破棄す
る。つまり、形態素解析で得られた辞書情報の最優先フ
ラグを参照して、文において、単語相互間の結びつきの
強弱を判定する。連語または熟語のなかで結びつきが強
いと判定したものはその文において熟語としての用法で
あると推定し、そうでない場合は単語単独での用法の可
能性も並列的に考慮する。In the present embodiment, during morphological analysis, all words, collocations or idioms starting from the cut-out position of the dictionary lookup unit are extracted, but for collocations or idioms recognized as a group of "units" according to the highest priority flag, it is extracted. The dictionary information obtained for each of the constituent words is discarded. That is, the strength of the connection between the words in the sentence is determined by referring to the highest priority flag of the dictionary information obtained by the morphological analysis. It is presumed that a compound word or a compound word that has a strong connection is considered to be a compound word in the sentence, and if not, the possibility of using the word alone is considered in parallel.

このような最優先フラグによる処理は、第４図に示す
ようなシーケンスにて行なう。入力部10から入力文字列
データを受け（200），最優先フラグ付き辞書ファイル1
8を索引するために入力文字列を辞書引き単位に切り出
し（201），これに従って辞書18を検索し（203），これ
を入力文字列データの示す文の最終位置まで行なうと
（202），最優先フラグの矛盾を解消し（204），形態素
解析結果を構文解析Ｉ部20へ出力する（205）。The processing with such a highest priority flag is performed in the sequence as shown in FIG. Receives the input character string data from the input unit 10 (200) and the dictionary file with the highest priority flag 1
In order to index 8, the input character string is cut out in dictionary lookup units (201), the dictionary 18 is searched accordingly (203), and this is performed up to the final position of the sentence indicated by the input character string data (202), the maximum The contradiction of the priority flag is resolved (204), and the morphological analysis result is output to the syntax analysis I unit 20 (205).

入力処理200では、まず入力文書ファイル102または入
力装置100から入力インタフェース104の入力文字列バッ
ファに読み込む（210,第５図）。入力文字列データは、
たとえばASCIIなどのコードで入力されるが、符号EOFを
読み込むと、処理部106は入力文字列バッファにNULLコ
ードを書き込んで最終位置とする。In the input process 200, first, the input document file 102 or the input device 100 is read into the input character string buffer of the input interface 104 (210, FIG. 5). Input string data is
For example, a code such as ASCII is input, but when the code EOF is read, the processing unit 106 writes a NULL code in the input character string buffer and sets it as the final position.

次に処理部106は入力文字列を整形する（211）。たと
えば空白文字などのスペース該当文字種に属する文字が
２個以上連続したときは、それらを単一の空白文字にま
とめる。スペース該当文字種には、空白文字タブ、改行などが含まれる。また、入力文字列バッファの先頭から
最初に現われるスペース該当文字種以外の文字までのス
ペース該当文字は消去する。Next, the processing unit 106 shapes the input character string (211). For example, when two or more characters belonging to the space character type such as a space character are consecutive, they are combined into a single space character. White space character Tabs, line breaks Etc. are included. In addition, the characters corresponding to the space from the beginning of the input character string buffer to characters other than the character corresponding to the first space appearing are deleted.

たとえばなる入力文字列を整形すると、これは、第６図に示すよ
うに、に整形され、符号［NULL］の位置がバッファの最終位置
を示している。For example If you format the input string as follows, it becomes, as shown in Figure 6, The position of the code [NULL] indicates the final position of the buffer.

辞書引き単位の切出し処理201で使用される辞書引き
デリミタは、英文字、数字、アポストロフィ、ハイフン
およびピリオド以外の文字、ならびに空白文字に続くア
ポストロフィの位置に置かれる。処理部106は、辞書引
きの先頭ポインタを有し、これは、最初はバッファの先
頭にセットされる。The dictionary lookup delimiter used in the dictionary lookup unit cutout process 201 is placed at a position of an apostrophe following an alphabetic character, a digit, an apostrophe, a hyphen and a period, and a space character. The processing unit 106 has a dictionary lookup head pointer, which is initially set at the head of the buffer.

辞書検索部108はそこで、先頭ポインタの指示してい
る文字から次のデリミタの直前の文字までの文字列を検
索キー文字列として最優先フラグ付き辞書ファイル18を
索引する。辞書エントリと検索キー文字列を比較し、両
者が一致したらその辞書情報を取り込む（203）。辞書
エントリの文字列全体が少なくとも検索キー文字列の先
頭から始まる部分と一致し、その部分の直後が辞書引き
デリミタまたはアポストロフィもしくはピリオドである
場合に、一致と判定される。たとえば第７図に示すよう
に、検索キー文字列の先頭文字“g"を先頭ポインタが指示していると、辞書
エントリの“get"と“get upon"がこれと一致する。Then, the dictionary retrieval unit 108 indexes the dictionary file 18 with the highest priority flag using the character string from the character pointed by the leading pointer to the character immediately before the next delimiter as the search key character string. The dictionary entry is compared with the search key character string, and if both match, the dictionary information is fetched (203). If the entire character string of the dictionary entry matches at least the part starting from the beginning of the search key character string, and the part immediately after that part is the dictionary lookup delimiter or the apostrophe or period, it is determined as a match. For example, as shown in Fig. 7, search key character strings When the leading pointer points to the leading character "g" of "," the dictionary entries "get" and "get upon" match this.

索出された辞書情報は処理部106の検索済み辞書情報
バッファ120に格納される。この読込みとともに、その
一致した文字列の開始位置と終了位置が記憶される。こ
れは、入力バッファ中の文字位置を先頭から順に特定す
るものである。検索済み辞書情報バッファ120には活性
情報の蓄積領域が設けられているが、これは、索出した
辞書情報が後の処理に有効なものか否かを指示する情報
であり、この段階ではすべて「１」にしておく。The retrieved dictionary information is stored in the searched dictionary information buffer 120 of the processing unit 106. Along with this reading, the start position and end position of the matched character string are stored. This specifies the character positions in the input buffer in order from the beginning. The searched dictionary information buffer 120 is provided with an active information storage area, which is information indicating whether or not the retrieved dictionary information is effective for subsequent processing. Leave it at "1".

以降先頭ポインタは、辞書引きのたびに更新され、文
字列を左から右に見て現在の先頭ポインタの次に現われ
るデリミタの直後の文字にセットされる。こうして順次
辞書検索が行なわれる。上述の例では、最初に“I"の
“I"を、次に“will"の“w"を、次に“get"の“g"を、
という順に辞書引き単位の先頭の文字を指示する。先頭
ポインタがNULLコードを通過すると、最終位置と判定さ
れる（202）。こうして、前述の英文入力文字列の例に
ついて索出された辞書情報の例を第９図に示す。Thereafter, the head pointer is updated each time the dictionary is looked up, and is set to the character immediately after the delimiter that appears next to the current head pointer when the character string is viewed from left to right. Thus, the dictionary search is sequentially performed. In the above example, first "I" for "I", then "w" for "will", then "g" for "get",
The first character of the dictionary lookup unit is designated in this order. When the leading pointer passes the NULL code, it is determined to be the final position (202). FIG. 9 shows an example of the dictionary information found in this way for the example of the English input character string.

第8A図〜第8D図を参照し、矛盾解消ルール処理部110
が最優先フラグ矛盾解消ルールファイル36を参照して行
なう最優先フラグの矛盾解消処理204を説明する。第8A
図および第8B図のフローが最優先フラグが立っている語
の位置が重なった場合の処理であり、第8C図および第8D
図のフローは最優先フラグによって解析単位すなわち要
素を消去する処理、すなわち活性情報を「０」にする処
理である。なおこれらのフロー図において、記号「＜
＝」は代入を、「→」は参照を、また「ｐ→ｘ」はポイ
ンタｐのエントリのもつｘの内容をそれぞれ示す。Referring to FIGS. 8A to 8D, the conflict resolution rule processing unit 110
A description will be given of the contradiction resolution processing 204 of the highest priority flag performed by referring to the highest priority flag contradiction resolution rule file 36. 8th A
The flow of FIG. 8 and FIG. 8B is the processing when the positions of the words for which the highest priority flag is set overlap, and the flow of FIG. 8C and FIG.
The flow in the figure is a process of deleting an analysis unit, that is, an element by the highest priority flag, that is, a process of setting active information to "0". In these flow charts, the symbol "<
“=” Indicates substitution, “→” indicates reference, and “p → x” indicates the content of x held by the entry of the pointer p.

まず、最優先フラグが「１」であり、かつ文における
位置が重なる語の組を検出する（ステップ220〜223）。
次に、検出された各組について最優先フラグ解消ルール
を適用し、それらのうち有効なものを選択する（ステッ
プ224〜235）。First, a word set whose top priority flag is "1" and whose position in the sentence overlaps is detected (steps 220 to 223).
Next, the highest priority flag elimination rule is applied to each of the detected pairs, and the valid one is selected (steps 224 to 235).

たとえば前述の例では、第９図に示すように、文字列については、開始位置「８」終了位置「13」の“get u
p"と、開始位置「12」終了位置「16」の“up to"で最優
先フラグに「１」がたち、しかもその文字位置が重複す
る。これにまず、前述のルール（１）が適用され、保存
用ポインタpsaveの品詞、およびポインタｐの品詞を参
照してそれらが動詞であるか否かを判定する（224）。
この例では、動詞に該当するので“get up"の組合せが
選択される。For example, in the above example, as shown in FIG. For the "get u" at the start position "8" and the end position "13"
“1” is set in the highest priority flag between p ”and“ up to ”at the start position“ 12 ”and the end position“ 16 ”, and the character positions thereof overlap. First, the rule (1) described above is applied, and it is determined whether or not they are verbs by referring to the part of speech of the save pointer psave and the part of speech of the pointer p (224).
In this example, the combination of "get up" is selected because it corresponds to a verb.

ルール（１）を満足しないときは、ルール（２）を適
用して（228），保存用ポインタpsaveのエントリを参照
してその文字列の長さlensとポインタｐのエントリを参
照してその文字列の長さlenとを比較する。さらにルー
ル（２）を満足しないときは、ルール（３）を適用して
（229），保存用ポインタpsaveの開始位置を参照してそ
の位置startsとポインタｐの開始位置を参照してその位
置startとを比較する。When the rule (1) is not satisfied, the rule (2) is applied (228), the entry of the save pointer psave is referred to, the length of the character string and the entry of the pointer p are referred to, and the character is referred to. Compare column length len. Further, when the rule (2) is not satisfied, the rule (3) is applied (229), the start position of the save pointer psave is referred to and its position starts and the start position of the pointer p are referred to, and the position start is started. Compare with.

こうして矛盾解消ルール（１）〜（３）の順に適用し
てそのいずれかが満足されると、満足されなかった、す
なわち有効でないエントリの活性情報を「０」にし（23
2），他の、すなわち有効なエントリのそれは「１」の
ままにしておく（231）。このような矛盾解消ルールの
適用を、ポインタｐを歩進させながら（234,235）最終
位置まで各エントリについて順次実行し、有効なエント
リについてのみ活性情報を「１」とする。前述の例につ
いてその状態を第10図に示す。たとえば、エントリ“up
to"についてその活性情報が「０」とされた。When the conflict resolution rules (1) to (3) are applied in this order and any one of them is satisfied, the liveness information of the entry that is not satisfied, that is, the invalid entry is set to "0" (23
2), leave that of other, ie valid entries as "1" (231). The application of such a contradiction resolution rule is sequentially executed for each entry up to the final position while moving the pointer p in steps (234, 235), and the active information is set to "1" only for a valid entry. The state of the above example is shown in FIG. For example, the entry “up
The activity information of "to" was set to "0".

次に、活性情報および最優先フラグの双方とも「１」
である組合せと位置が一部でも重複するものを検出し
（236〜241）、それらの活性情報を「０」にする（242,
249）。このような矛盾解消ルールの適用を、ポインタ
ｐを歩進させながら（243,248）最終位置まで各エント
リについて順次実行し、有効でないエントリの活性情報
を「０」とする。これによって、たとえばエントリ“ge
t up"については“get"および“up"の活性情報が「０」
とされた（第10図）。なおエントリ“white"“white ho
use"および“house"については、位置の重複があっても
最優先フラグがすべて「０」であるので、それらの活性
情報は「１」に維持されている。Next, both the liveness information and the highest priority flag are "1".
The combination and the position that are partially overlapped are detected (236 to 241), and their activity information is set to "0" (242,
249). The application of such a contradiction resolution rule is sequentially executed for each entry up to the final position while moving the pointer p in steps (243, 248), and the liveness information of the invalid entry is set to "0". This allows, for example, the entry “ge
For "t up", the activity information of "get" and "up" is "0"
(Fig. 10). The entry "white""white ho
For "use" and "house", since the highest priority flags are all "0" even if there are overlapping positions, their liveness information is maintained at "1".

こうして最終位置［NULL］の直前まで処理が行なわれ
ると、入力インタフェース104の入力バッファ、および
検索済み辞書情報バッファ120の内容を出力インタフェ
ース114から構文解析Ｉ部16へ出力する。検索済み辞書
情報バッファ120の内容は、活性情報に「１」が表示さ
れているエントリのみについて出力が行なわれる。たと
えば、入力バッファの内容は構文解析用入力ファイル11
6へ書き出し、検索済み辞書情報バッファ120の内容は構
文解析用辞書情報ファイル118へ書き出すようにしても
よい。このとき、活性情報および最優先フラグもともに
出力されるので、構文解析用辞書情報ファイル118は検
索済み辞書情報バッファ120と同一の構造となる。な
お、活性情報および最優先フラグを出力しないように構
成してもよい。When the processing is performed just before the final position [NULL], the contents of the input buffer of the input interface 104 and the searched dictionary information buffer 120 are output from the output interface 114 to the syntax analysis I unit 16. The contents of the searched dictionary information buffer 120 are output only for the entries for which "1" is displayed in the active information. For example, the contents of the input buffer are
6 may be written, and the contents of the searched dictionary information buffer 120 may be written to the syntactic analysis dictionary information file 118. At this time, since the activity information and the highest priority flag are also output, the syntactic analysis dictionary information file 118 has the same structure as the searched dictionary information buffer 120. The activity information and the highest priority flag may not be output.

効果本発明によれば、文の形態素解析の際、辞書引き単位
の切出し位置から始まる単語、連語または熟語をすべて
取り出すが、辞書情報に含まれる最優先フラグに従って
ひとまとまりと認定された連語または熟語については、
それを構成する個々の単語について得られた辞書情報を
破棄する。つまり、文において、単語相互間の結びつき
の強弱を判定し、連語または熟語のなかで結びつきが強
いものはその文において熟語としての用法であると推定
し、そうでない場合は単語単独での用法の可能性も並列
的に考慮する。これによって形態素解析が効率的に行な
われる。EFFECT According to the present invention, during sentence morphological analysis, all words, collocations or idioms starting from the cut-out position of the dictionary lookup unit are extracted, but the collocations or collocations recognized as a group according to the highest priority flag included in the dictionary information For idioms,
The dictionary information obtained for each of the words that compose it is discarded. That is, in a sentence, the strength of the connection between words is determined, and it is presumed that the compound or the idiom with strong connection is the usage as the idiom in the sentence, and if not, the usage of the word alone is determined. Possibility is also considered in parallel. Thereby, the morphological analysis is efficiently performed.

[Brief description of drawings]

第１図は、第２図に示す実施例の形態素解析部の詳細な
構成例を示す機能ブロック図、第２図は、本発明による言語解析装置を英日自動翻訳装
置に適用した実施例の全体構成を示す機能ブロック図、第３図は、第１図に示す実施例における最優先フラグ付
き辞書ファイルの構成例を示す説明図、第４図は同実施例における形態素解析処理の例を示すフ
ロー図、第５図は形態素解析処理における入力処理の例を示すフ
ロー図、第６図は同実施例における入力文字列の整形の例を示す
説明図、第７図は同実施例における辞書検索の例を示す説明図、第8A図ないし第8D図は、形態素解析における最優先フラ
グの矛盾の解消処理の例を示すフロー図、第９図は辞書引きした検索済み辞書情報バッファの内容
の例を示す説明図、第10図は、最優先フラグの矛盾解消処理を行なった結果
の検索済み辞書情報バッファの内容の例を示す説明図で
ある。主要部分の符号の説明 16……形態素解析部 18……最優先フラグ付き辞書ファイル 36……最優先フラグ矛盾解消ルールファイル 106……処理部 108……辞書検索部 110……矛盾解消ルール処理部 112……制御部FIG. 1 is a functional block diagram showing a detailed configuration example of a morpheme analysis unit of the embodiment shown in FIG. 2, and FIG. 2 shows an embodiment in which the language analysis device according to the present invention is applied to an English-Japanese automatic translation device. FIG. 3 is a functional block diagram showing the overall configuration, FIG. 3 is an explanatory diagram showing a configuration example of a dictionary file with a top priority flag in the embodiment shown in FIG. 1, and FIG. 4 shows an example of morpheme analysis processing in the same embodiment. Flow chart, FIG. 5 is a flow chart showing an example of input processing in the morphological analysis processing, FIG. 6 is an explanatory diagram showing an example of shaping an input character string in the same embodiment, and FIG. 7 is a dictionary search in the same embodiment. FIG. 8A to FIG. 8D are flow charts showing an example of resolution processing of the contradiction of the highest priority flag in morphological analysis, and FIG. 9 is an example of the contents of the searched dictionary information buffer obtained by dictionary dictionary. Fig. 10 shows the highest priority flag Is an explanatory diagram showing an example of the contents of a retrieved dictionary information buffer result of performing conflict resolution process grayed. Explanation of code of main part 16 …… Morphological analysis section 18 …… Dictionary file with top priority flag 36 …… Top priority flag conflict resolution rule file 106 …… Processing section 108 …… Dictionary search section 110 …… Contradiction resolution rule processing section 112 ... Control unit

Claims

[Claims]

1. A linguistic analysis apparatus comprising: dictionary means for storing dictionary data containing morpheme data for words, collocations and idioms; and analysis means for performing morpheme analysis on an input sentence by referring to the dictionary means. , The dictionary means includes a priority flag for establishing a compound word and a compound word as a compound word or a compound word, and the analysis means refers to the dictionary means for each word included in the input sentence, and 1 When multiple dictionary data are searched for in combination with other words for one word,
Judge whether the priority flag is set, when the priority flag is set, presume that it is a usage as a compound word or idiom, discard the dictionary data for each word constituting the compound word or idiom, A language analysis device characterized by performing morphological analysis as a compound word or a compound word.

2. The apparatus according to claim 1, wherein the analyzing means sets a priority flag for a plurality of combinations of a word searched by the dictionary means with another word, When a contradiction occurs, a linguistic analysis apparatus is characterized in that one of the plurality of combinations is prioritized according to a predetermined contradiction resolution rule to resolve the contradiction.