JP2007052700A

JP2007052700A - Machine translation apparatus and machine translation program

Info

Publication number: JP2007052700A
Application number: JP2005238416A
Authority: JP
Inventors: Masaki Shindo; 正樹新藤; Takashi Shibuya; 貴志澁谷; Etsuo Ito; 悦雄伊藤; Naoko Takigawa; 直子瀧川; Enko Sai; 遠航蔡; Yumiko Yoshimura; 裕美子吉村
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2005-08-19
Filing date: 2005-08-19
Publication date: 2007-03-01
Anticipated expiration: 2025-08-19
Also published as: JP4886244B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce the generation of unknown words or non-sentences in translating an original into another language and to improve translation accuracy and operation efficiency by correcting words including lost characters or misinputted characters when an input original to be translated is acquired from an original on the basis of the characteristics of an acquiring source of the input original text. <P>SOLUTION: A translation original estimation apparatus is provided with: a morpheme analysis part 14 for determining whether the head character of a head word out of a plurality of words obtained by dividing character strings acquired from a web page in each word unit is a capital letter or a small letter, and when the head character is a capital letter, generating a plurality of head word substituting words by adding a capital letter of alphabet to the head of a head word; a syntax analysis part 15 for performing the syntax analysis of one or more sentences in each of which a head word substitution candidate existing in a dictionary part 10 which is selected on the basis of dictionary data in the dictionary part 10 is substituted for the head word of an original on the basis of the dictionary data of the dictionary part 10 about the plurality of generated head word substituting words; and a syntax analysis control part 16 for selecting a grammatically correct original candidate from the candidates and outputting the selected candidate to a translation part 18. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、例えば英語などの第１言語の文章を日本語などの第２言語の文章に変換する機械翻訳装置および機械翻訳プログラムに関する。 The present invention relates to a machine translation device and a machine translation program that convert a sentence in a first language such as English into a sentence in a second language such as Japanese.

例えば英語などの言語で記載された小説や新聞記事などをスキャナで読み取り、その画像データをコンピュータへ入力して、コンピュータにインストールされている文字認識ソフトウェアで文字認識処理して得たテキストデータ（文字コード）からなる文字列をさらに翻訳ソフトウェアが他の言語、例えば日本語などの文章に翻訳する機械翻訳システムがある。 For example, text data (characters) obtained by scanning a novel or newspaper article written in a language such as English with a scanner, inputting the image data to a computer, and performing character recognition processing with character recognition software installed on the computer There is a machine translation system in which a translation software further translates a character string consisting of a code) into a sentence in another language such as Japanese.

小説や新聞記事などの文書の表紙には、文の先頭の文字が以降の文字と異なり、数行に及ぶ大きな文字で書かれていたり飾り文字（画像データ）が貼り付けられていたりする場合が多い。 On the cover of a document such as a novel or newspaper article, the first character of the sentence is different from the following characters, and it may be written in large characters extending over several lines, or decorative characters (image data) may be pasted. Many.

このような文書の内容は、人間が見れば１文として判読することができるものの、コンピュータなどの文字認識ソフトウェアで１文を判読するような場合では先頭の文字の大きさが地の文の大きさと異なるため、地の文とのつながりを判別できなかったり、飾り文字だけが除外されて未読語を含む文として解析される場合もあり、原文のままの正しい文字列を翻訳ソフトウェアへ渡すことができず、思うような翻訳結果が得られない場合がある。 The contents of such a document can be read as a single sentence by humans, but in the case where a sentence is read by character recognition software such as a computer, the size of the first character is the size of the sentence in the ground. Therefore, it may not be possible to determine the connection with the local sentence, or it may be analyzed as a sentence including unread words with only the decoration characters excluded, and the correct character string as the original sentence may be passed to the translation software. You may not be able to obtain the desired translation results.

また、機械翻訳システムでは、コンピュータの画面に表示させた英文のＷｅｂページを日本語のページへ翻訳する機能が付加されている場合が多く、英文のＷｅｂページの第１文字（先頭文字）がテキスト以外の文字、例えば画像（イメージデータ）などで作成されている場合があったり、先頭文字だけ他の文字と大きく異なったフォントで書かれている場合がある。
この場合も上記スキャナで新聞記事を読み取った場合と同様に、文字認識ソフトウェアは、最初の文字列のうち、先頭単語の先頭文字とそれ以降の文字とが連続している一つの単語であると解釈できずに、翻訳辞書に存在しない語、つまり未知語や未読語として出力したり、構文解析に失敗したり、誤った翻訳結果を出力してしまう場合がある。 In addition, a machine translation system is often provided with a function of translating an English Web page displayed on a computer screen into a Japanese page, and the first character (first character) of the English Web page is text. There are cases in which the characters are created by other characters, such as images (image data), or the first character is written in a font that is significantly different from other characters.
Also in this case, as in the case of reading a newspaper article with the scanner, the character recognition software determines that the first character of the first character string is one word in which the first character of the first word and the subsequent characters are continuous. There are cases in which words that cannot be interpreted and output as words that do not exist in the translation dictionary, that is, unknown words or unread words, fail in syntax analysis, or output incorrect translation results.

以下、図２０を参照して、具体的な４つのケース（下記（１）〜（４）のケース）で説明する。
（１）のケースは、原文の最初の文字「Ｗ」が飾り文字となっているような場合である。この文での地の文は「orld is too big.」となり、文字認識ソフトウェアとしては「orld is too big.」という文（文字列）を翻訳ソフトウェアへ出力することになり、入力された文（文字列）について翻訳ソフトウェアが日本語へ翻訳処理を行うと、「orld」が未知語として抽出されてしまい、翻訳結果は「orldは大きすぎます。」となり、原文とは意味の異なる翻訳結果となり、正確に翻訳できない。 Hereinafter, with reference to FIG. 20, it demonstrates with four specific cases (the following (1)-(4) cases).
Case (1) is a case where the first letter “W” of the original text is a decorative character. The local sentence in this sentence will be “orld is too big.”, And the character recognition software will output the sentence (character string) “orld is too big.” To the translation software. When translation software translates into Japanese for (string), "orld" is extracted as an unknown word, and the translation result is "orld is too big."Can't translate correctly.

（２）のケースは原文の最初の文字「Ｔ」が飾り文字となっているような場合である。この文での地の文は「his dish is very delicious.」となる。このまま翻訳を行うと翻訳結果は「彼の料理は非常においしい。」となり、正しい解析が行われているが原文の期待する翻訳結果である「この料理は非常においしい。」とは異なってしまうパターンである。このような場合、とりあえず意味は通じるため、利用者が原文を参照しなければ誤った翻訳結果が得られたことが判らないという問題がある。 Case (2) is a case where the first letter “T” in the original text is a decorative character. The local text in this sentence is "his dish is very delicious." If the translation is done as it is, the translation result will be “His dish is very delicious.” Although it is correctly analyzed, it is different from “This dish is very delicious.” It is a pattern. In such a case, since the meaning is understood for the time being, there is a problem that it is not known that an incorrect translation result is obtained unless the user refers to the original text.

（３）のケースは原文の最初の文字「Ｔ」が飾り文字となっているような場合である。この文での地の文は「he problems of the last few days have been fixed.」となる。このまま翻訳を行うと、翻訳結果は「彼、過去数時期の問題が解決されました。」となり、構文解析に失敗し、正確に翻訳を行うことができない。 The case (3) is a case where the first letter “T” in the original text is a decorative character. The local sentence in this sentence is "he problems of the last few days have been fixed." If the translation is performed as it is, the translation result will be “He, the problem of the past several years has been resolved.” The syntax analysis failed and the translation cannot be performed accurately.

（４）のケースは原文の最初の文字「Ｔ」が特大フォントとなっているような場合である。この文の場合、翻訳ソフトウェアとしては、「Ｔ」が「he problems of the last」の最初の文字なのか、「few days have been fixed.」の最初の文字なのかを判別できないため、このまま翻訳処理を実行しても正確な翻訳結果が得られない。 Case (4) is a case where the first character “T” in the original text is an extra large font. In this case, the translation software cannot determine whether "T" is the first letter of "he problems of the last" or "few days have been fixed." Even if is executed, accurate translation results are not obtained.

上記（１）から（４）のケースのようにスキャナの画像の文字認識結果やＷｅｂページをコピー操作して得たテキスト文字列を翻訳ソフトウェアがそのまま翻訳してしまうと、本来、利用者が期待する翻訳結果を出力することができない場合がある。 As in the cases (1) to (4) above, if the translation software translates the character recognition result of the image of the scanner or the text character string obtained by copying the web page as it is, the user originally expects In some cases, the translation result cannot be output.

この種の技術の先行技術としては、例えば辞書に存在しない語（未知語）を検出する機械翻訳装置において、機械翻訳時に未知語を検出し、その検出結果（未知語リスト）の表示形式を、ユーザが任意に指定できるようにしたことで、作業効率及び処理の有効性を高めることができる技術が公開されている（例えば特許文献１参照）。 As a prior art of this type of technology, for example, in a machine translation device that detects a word (unknown word) that does not exist in a dictionary, an unknown word is detected during machine translation, and the display format of the detection result (unknown word list) is A technique that can improve work efficiency and effectiveness of processing by allowing the user to arbitrarily specify the information has been disclosed (for example, see Patent Document 1).

一方、小説や新聞記事などの文書やＷｅｂページの文章を上記のスキャナなどの自動文字認識処理に頼らずに、人が記事やＷｅｂページの内容を読んでキーボードから直接キー入力する方法がある。
この場合、記事やＷｅｂページの内容を人が読んで入力するため、極端に大きな先頭文字や飾り文字なども正しく判断して入力できるものの、テキスト入力を人が行うため、キーの誤操作などの入力ミスが生じる可能性が高くなる。
特開平７-３１９８７８号公報 On the other hand, there is a method in which a person reads a content of an article or a Web page and inputs a key directly from a keyboard without relying on an automatic character recognition process such as a scanner for a document such as a novel or a newspaper article or a Web page sentence.
In this case, because people read and input the contents of articles and web pages, you can correctly enter and input extremely large initial characters and decorative characters. However, because people enter text, input such as incorrect key operations The possibility of mistakes increases.
JP-A-7-31878

このように、上記先行技術の場合、ユーザが未知語リストの表示形式を任意に指定できるものの、翻訳時に未知語の検出が行われるため未知語自体が少なくなるわけではなく、指定したいずれかの未知語を含む文字列が翻訳対象となるため、翻訳精度の更なる向上が見込めないという問題点があった。 Thus, in the case of the above prior art, although the user can arbitrarily specify the display format of the unknown word list, unknown words themselves are not reduced because unknown words are detected during translation. Since character strings including unknown words are to be translated, there is a problem that further improvement in translation accuracy cannot be expected.

また、原文を人が読んでキー入力したときに入力ミスがあった場合は、翻訳対象の文字列に未知語が増えることになり、文字訂正のための後戻り作業が生じ、作業効率が低下するという問題がある。 In addition, if there is an input mistake when a person reads the original text and inputs a key, unknown words will increase in the character string to be translated, resulting in a backward work for correcting the character, resulting in reduced work efficiency. There is a problem.

本発明はこのような課題を解決するためになされたもので、原文をキー入力またはＷｅｂページやスキャナで取り込んで得た文字列を機械翻訳する際に単語が未知語や非文として検出されることを低減し、翻訳精度を向上すると共に作業効率を向上することのできる機械翻訳装置および機械翻訳プログラムを提供することを目的としている。 The present invention has been made to solve such a problem, and a word is detected as an unknown word or non-sentence when machine translation is performed on a character string obtained by inputting an original sentence with a key input or a Web page or a scanner. An object of the present invention is to provide a machine translation apparatus and a machine translation program capable of reducing the above-described problem and improving translation accuracy and work efficiency.

上記した目的を達成するために、本発明の機械翻訳装置は、第１の言語で作成された原文を他の言語の文に翻訳するために予め登録された辞書データを基に第２の言語の文章へ翻訳する翻訳部を備えた機械翻訳装置において、前記第１の言語で作成された文字列を受け付けるデータ受付手段と、前記データ受付手段により受け付けられた文字列を前記辞書データを基に単語の単位に分割する分割手段と、前記分割手段により分割された単語の中で前記辞書データに登録されていない単語を未知語として抽出する未知語抽出手段と、前記未知語抽出手段により抽出された未知語に対して所定の規則に従って文字を付加または文字を置き換えることで置き換え候補文字を生成する置き換え候補文字生成手段と、前記置き換え候補文字生成手段により生成された置き換え候補文字を置き換え元の未知語と置き換えた文章を翻訳対象の文字列として前記翻訳部へ出力する文字列出力手段とを具備したことを特徴とする。 In order to achieve the above object, the machine translation apparatus of the present invention provides a second language based on dictionary data registered in advance for translating an original sentence created in a first language into a sentence in another language. In the machine translation apparatus provided with the translation unit for translating the sentence into the text, the data receiving means for receiving the character string created in the first language, and the character string received by the data receiving means based on the dictionary data Dividing means for dividing into word units, unknown word extracting means for extracting words that are not registered in the dictionary data among the words divided by the dividing means as unknown words, and extracted by the unknown word extracting means A replacement candidate character generating means for generating a replacement candidate character by adding a character to the unknown word or replacing the character according to a predetermined rule; and the replacement candidate character generating means It has and a character string output means for outputting the generated replacement text is replaced with candidate characters to replace the original unknown word as a translated string to the translation unit, characterized in.

本発明の機械翻訳装置は、第１の言語で作成された原文を他の言語の文に翻訳するために予め文字、単語、辞書、訳語、品詞を含む辞書データを蓄積した辞書部を基に第２の言語の文章へ翻訳する翻訳部を備えた機械翻訳装置において、前記第１の言語の原文をコンピュータで処理可能な文字列として受け付けるデータ受付手段と、前記データ受付手段により受け付けられた文字列について前記辞書部に蓄積された辞書データを基に形態素解析を行うことで、文字列を単語の単位に分割する分割手段と、前記分割手段により分割された複数の単語のうち先頭単語の先頭文字が小文字が大文字かを判定し、前記先頭文字が小文字の場合、前記先頭単語の先頭に大文字のアルファベットを付加した複数の新たな単語を生成する単語生成手段と、前記単語生成手段により生成された複数の単語の中で前記辞書データとして存在する単語を抽出する置き換え候補単語抽出手段と、前記置き換え候補単語抽出手段により抽出された置き換え候補の単語を、前記原文の先頭単語と置き換えた文章を翻訳対象の文字列として前記翻訳部へ出力する文字列出力手段とを具備したことを特徴とする。 The machine translation apparatus according to the present invention is based on a dictionary unit that stores dictionary data including characters, words, dictionaries, translations, and parts of speech in advance in order to translate an original sentence created in a first language into a sentence in another language. In a machine translation apparatus including a translation unit that translates text into a second language, data accepting means for accepting the original text in the first language as a character string that can be processed by a computer, and characters accepted by the data accepting means Dividing means for dividing a character string into word units by performing morphological analysis based on the dictionary data stored in the dictionary section for the column; and a head of the first word among a plurality of words divided by the dividing means A word generating means for determining whether a lowercase letter is an uppercase letter and generating a plurality of new words by adding an uppercase alphabet to the beginning of the first word when the first letter is lowercase; A replacement candidate word extracting unit that extracts a word existing as the dictionary data among a plurality of words generated by the word generating unit, and a replacement candidate word extracted by the replacement candidate word extracting unit Character string output means for outputting a sentence replaced with a word to the translation unit as a character string to be translated is provided.

本発明の機械翻訳装置は、第１の言語で作成された原文を第２の言語の文章へ翻訳する機械翻訳システムにおいて、キー入力された前記原文の文字列を受け付けるデータ受付手段と、前記データ受付手段により入力された文字列に対して予め設定された辞書データに基づいて形態素解析を行うことで、文字列を単語の単位に分割する分割手段と、前記分割手段により分割された各単語の中で前記辞書データとして登録されていない単語を未知語として抽出する未知語抽出手段と、前記未知語抽出手段により抽出された未知語の文字を所定の規則に基づいて置き換えて生成した新たな単語を前記原文の単語と置き換えた場合に文法的に正しい文章となる置き換え候補単語を抽出する置き換え候補単語生成手段と、前記置き換え候補単語生成手段により生成された置き換え候補単語を、前記原文の文字列の該当位置の単語と置き換えて翻訳対象の文字列として前記翻訳部へ出力する文字列出力手段とを具備することを特徴とする。 The machine translation apparatus of the present invention is a machine translation system for translating an original text created in a first language into a text in a second language, a data receiving means for receiving a character string of the original text input by the key, and the data By performing morphological analysis on the character string input by the accepting unit based on dictionary data set in advance, a dividing unit that divides the character string into word units, and each word divided by the dividing unit An unknown word extraction means for extracting a word that is not registered as dictionary data as an unknown word, and a new word generated by replacing characters of the unknown word extracted by the unknown word extraction means based on a predetermined rule Replacement candidate word generation means for extracting a replacement candidate word that becomes a grammatically correct sentence when the word is replaced with the original word, and the replacement candidate word generation means The candidate words replacement which is more produced, characterized by comprising a character string output means for outputting to said translation unit replaces the word corresponding position of the character string of the original as a string to be translated.

本発明の機械翻訳プログラムは、第１の言語で作成された原文を他の言語の文に翻訳するために予め記憶された辞書データを基に第２の言語の文章へ翻訳する翻訳部を備えた機械翻訳装置に処理を実行させる機械翻訳プログラムにおいて、前記機械翻訳装置を、前記第１の言語で作成された文字列を受け付けるデータ受付手段と、前記データ受付手段により受け付けられた文字列を前記辞書データを基に単語の単位に分割する分割手段と、前記分割手段により分割された単語の中で前記辞書データに登録されていない単語を未知語として抽出する未知語抽出手段と、前記未知語抽出手段により抽出された未知語に対して所定の規則に従って文字を付加または文字を置き換えることで、置き換え候補文字を生成する置き換え候補文字生成手段と、前記置き換え候補文字生成手段により生成された置き換え候補文字を、置き換え元の未知語と置き換えた文章を翻訳対象の文字列として前記翻訳部へ出力する文字列出力手段として機能させることを特徴とする。 A machine translation program according to the present invention includes a translation unit that translates an original sentence created in a first language into a sentence in a second language based on dictionary data stored in advance to translate the sentence in another language. In the machine translation program for causing the machine translation device to execute processing, the machine translation device is configured to receive a character string created in the first language and a character string accepted by the data acceptance unit. Dividing means for dividing into word units based on dictionary data, unknown word extracting means for extracting words that are not registered in the dictionary data among the words divided by the dividing means, and the unknown words A replacement candidate character generating means for generating a replacement candidate character by adding a character or replacing a character according to a predetermined rule with respect to the unknown word extracted by the extracting means; The replacement candidate character replacement candidate characters generated by the generation means, characterized in that to function as a character string output means for outputting the sentence by replacing the replacement source unknown word to the translation unit as a translated string.

本発明の機械翻訳プログラムは、第１の言語で作成された原文を他の言語の文に翻訳するために予め文字、単語、辞書、訳語、品詞を含む辞書データを蓄積した辞書部を基に第２の言語の文章へ翻訳する翻訳部を備えた機械翻訳装置に処理を実行させる機械翻訳プログラムにおいて、前記機械翻訳装置を、前記第１の言語の原文をコンピュータで処理可能な文字列として受け付けるデータ受付手段と、前記データ受付手段により受け付けられた文字列について前記辞書部に蓄積された辞書データを基に形態素解析を行うことで、文字列を単語の単位に分割する分割手段と、前記分割手段により分割された複数の単語のうち先頭単語の先頭文字が小文字が大文字かを判定し、前記先頭文字が小文字の場合、前記先頭単語の先頭に大文字のアルファベットを付加した複数の新たな単語を生成する単語生成手段と、前記単語生成手段により生成された複数の単語の中で前記辞書データとして存在する単語を抽出する置き換え候補単語抽出手段と、前記置き換え候補単語抽出手段により抽出された置き換え候補の単語を、前記原文の先頭単語と置き換えた文章を翻訳対象の文字列として前記翻訳部へ出力する文字列出力手段として機能させることを特徴とする。 The machine translation program of the present invention is based on a dictionary unit that stores dictionary data including characters, words, dictionaries, translations, and parts of speech in advance to translate an original sentence created in a first language into a sentence in another language. In a machine translation program that causes a machine translation device including a translation unit to translate a sentence in a second language to execute processing, the machine translation device receives the original text in the first language as a character string that can be processed by a computer. A data receiving unit; a dividing unit that divides the character string into word units by performing morphological analysis on the character string received by the data receiving unit based on dictionary data stored in the dictionary unit; and Among the plurality of words divided by the means, it is determined whether the first letter of the first word is a lowercase letter. If the first letter is a lowercase letter, an uppercase alphabet is added at the beginning of the first word. A word generation unit that generates a plurality of new words with a bet; a replacement candidate word extraction unit that extracts words existing as the dictionary data among the plurality of words generated by the word generation unit; and the replacement The replacement candidate word extracted by the candidate word extraction unit is made to function as a character string output unit that outputs a sentence in which the first word of the original sentence is replaced to the translation unit as a character string to be translated.

本発明の機械翻訳プログラムは、第１の言語で作成された原文を第２の言語の文章へ翻訳する機械翻訳装置に処理を実行させる機械翻訳プログラムにおいて、前記機械翻訳装置を、キー入力された前記原文の文字列を受け付けるデータ受付手段と、前記データ受付手段により入力された文字列に対して予め設定された辞書データに基づいて形態素解析を行うことで、文字列を単語の単位に分割する分割手段と、前記分割手段により分割された各単語の中で前記辞書データとして登録されていない単語を未知語として抽出する未知語抽出手段と、前記未知語抽出手段により抽出された未知語の文字を所定の規則に基づいて置き換えて生成した新たな単語を前記原文の単語と置き換えた場合に文法的に正しい文章となる置き換え候補単語を抽出する置き換え候補単語生成手段と、前記置き換え候補単語生成手段により生成された置き換え候補単語を、前記原文の文字列の該当位置の単語と置き換えて翻訳対象の文字列として前記翻訳部へ出力する文字列出力手段として機能させることを特徴とする。 The machine translation program of the present invention is a machine translation program that causes a machine translation device to execute processing on a machine translation device that translates an original sentence created in a first language into a sentence in a second language. Data receiving means for receiving the original text string, and dividing the character string into word units by performing morphological analysis on the character string input by the data receiving means based on preset dictionary data A dividing unit; an unknown word extracting unit that extracts a word that is not registered as the dictionary data among the words divided by the dividing unit; and an unknown word character extracted by the unknown word extracting unit A replacement candidate word that becomes a grammatically correct sentence when a new word generated by replacing a word with a predetermined rule is replaced with the original word A replacement candidate word generation unit, and a character string output that outputs the replacement candidate word generated by the replacement candidate word generation unit to the translation unit as a translation target character string by replacing the replacement candidate word with a word at a corresponding position in the original text string It is made to function as a means.

本発明では、機械翻訳を行う前の原文入力時に欠落または誤入力された文字を正しく補完した上で機械翻訳を行うことで、第１言語の文章を第２言語の文章へ翻訳する上で、翻訳対象の第１言語の文章内で未知語や非文となる語句をできるだけ少なくすることができる。 In the present invention, in order to translate a sentence in a first language into a sentence in a second language by performing machine translation after correctly supplementing characters that are missing or erroneously input at the time of inputting the original text before performing machine translation, It is possible to reduce the number of unknown words and non-sentences in the sentence of the first language to be translated as much as possible.

すなわち、原文を入力する際に損なわれた単語、語句を辞書に存在する既知語として補完することで未知語や非文を少なくした上で機械翻訳を行うので、翻訳精度を向上すると共に、原文を入力する際のミスによって発生する後戻り作業を軽減することができる。 In other words, by compiling words and phrases damaged when inputting the original text as known words existing in the dictionary, machine translation is performed with fewer unknown words and non-sentences. It is possible to reduce the backtracking work that occurs due to an error in inputting.

以上説明したように本発明によれば、原文をキー入力またはＷｅｂページやスキャナで取り込んで得た文字列を機械翻訳する際に単語が未知語や非文として検出されることを低減し、翻訳精度を向上すると共に作業効率を向上することができる。 As described above, according to the present invention, it is possible to reduce the detection of a word as an unknown word or non-sentence when machine translation of a character string obtained by inputting an original sentence with a key input or a Web page or a scanner is performed. The accuracy can be improved and the working efficiency can be improved.

以下、本発明の実施の形態を図面を参照して詳細に説明する。図１は本発明の機械翻訳装置および機械翻訳プログラムに係る第１実施形態の翻訳原文推定装置の構成を示す図、図２は図１の翻訳原文推定装置において先頭単語の先頭に大文字のアルファベットを順に付して生成した単語が辞書に存在するか否かのフラグを付した例を示す図である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a diagram showing a configuration of a translation original text estimation apparatus according to the first embodiment of the machine translation apparatus and machine translation program of the present invention. FIG. 2 shows an uppercase alphabet at the beginning of the first word in the translation text estimation apparatus of FIG. It is a figure which shows the example which attached | subjected the flag whether the word attached | subjected in order and produced | generated exists in a dictionary.

図１に示すように、この第１実施形態の翻訳原文推定装置は、辞書部１０、入力部１１、原文推測部１２、原文推測判定部１３、統計データ記憶部１７、翻訳部１８、出力部１９を備えている。原文推測部１２は、入力された原文に対して辞書部１０を参照して形態素解析を行う形態素解析部１４、この形態素解析部１４により形態素解析された形態素解析結果に対して辞書部１０を参照して構文解析を行う構文解析部１５、この構文解析部１５により構文解析された構文解析結果が適切かどうかを辞書部１０を参照して判定する構文解析制御部１６等を有している。 As shown in FIG. 1, the translation original text estimation apparatus according to the first embodiment includes a dictionary unit 10, an input unit 11, an original text estimation unit 12, an original text estimation determination unit 13, a statistical data storage unit 17, a translation unit 18, and an output unit. 19 is provided. The original sentence estimation unit 12 refers to the input original sentence by referring to the dictionary unit 10 and performs a morpheme analysis unit 14, and refers to the dictionary unit 10 for a morpheme analysis result analyzed by the morpheme analysis unit 14. A syntax analysis unit 15 that performs syntax analysis, a syntax analysis control unit 16 that determines whether the syntax analysis result analyzed by the syntax analysis unit 15 is appropriate with reference to the dictionary unit 10, and the like.

つまりこの翻訳原文推定装置は、第１の言語（英語）で作成された原文を他の言語の文に翻訳するために予め文字、単語、辞書、訳語、品詞を含む辞書データを蓄積した辞書部１０を基に第２の言語（日本語）の文章へ翻訳する翻訳部１８を備えたものである。 In other words, this translation original sentence estimation device is a dictionary unit that stores in advance dictionary data including characters, words, dictionaries, translated words, and parts of speech in order to translate an original sentence created in a first language (English) into a sentence in another language. 10 is provided with a translation unit 18 for translating text into a second language (Japanese) based on 10.

入力部１１は、例えばＬＡＮやインターネットへの通信手段とウェブブラウザなどのソフトウェア、キーボードとこのキーボードからのキー操作で入力されたコマンドを受け付けて文字コード化する文字入力ソフトウェア、帳票の面をＣＣＤラインセンサなどでスキャニングして帳票に記載された文章を画像化する画像読取装置（以下スキャナと称す）と、このスキャナにより得られた文書の画像データを文字認識するソフトウェアなどで実現される。 The input unit 11 includes, for example, communication means for a LAN or the Internet, software such as a web browser, a keyboard and character input software that accepts commands input by key operations from the keyboard, and converts the character into a CCD line. This is realized by an image reading device (hereinafter referred to as a scanner) that scans a sensor or the like to form an image of a sentence described in a form, and software that recognizes characters of image data of a document obtained by the scanner.

入力部１１は、上記インターネットやキーボード等から入力された、翻訳すべき第１言語の文や文章を受け付けるものであり、帳票やＷｅｂページに示されている第１の言語の文章、例えば英語の文（以下原文と称す）などのテキストデータの列（以下文字列と称す）の入力を受け付けるデータ受付手段として機能する。 The input unit 11 receives a sentence or sentence in a first language to be translated, which is input from the Internet, a keyboard, or the like. A sentence in a first language shown on a form or a web page, for example, English It functions as a data receiving means for receiving an input of a text data string (hereinafter referred to as a character string) such as a sentence (hereinafter referred to as an original text).

入力部１１は、原文をスキャナで読み取り得られた画像データから認識した文字列の入力を受け付ける。また入力部１１は、Ｗｅｂページをテキストコピーして得られた文字列の入力を受け付ける。つまり、入力部１１は、原文を画像読み取り後に文字認識して得られたコンピュータで処理可能な文字列、または原文をキー入力して得られたコンピュータで処理可能な文字列を受け付ける。 The input unit 11 receives input of a character string recognized from image data obtained by reading an original text with a scanner. The input unit 11 accepts input of a character string obtained by text copying a Web page. That is, the input unit 11 accepts a computer-processable character string obtained by character recognition after the original text is read, or a computer-processable character string obtained by key-inputting the original text.

辞書部１０は、メモリまたはハードディスク装置で実現される。辞書部１０には、見出し語（単語や文字）に対応付けされた訳語、品詞、意味解析辞書などの辞書データが蓄積（記憶）されている。品詞は、例えば名詞、動詞、形容詞、副詞などである。辞書データには、単語を構成する文字毎の文字コードなども含まれている。辞書データの訳語は、例えば日→英、英→日などのように他国語間を一方向または双方向に変換するデータである。辞書データには、例えば英語のアルファベット表や日本語の漢字表、ひらがな表、カタカナ表などの各国語の文字表が含まれる。 The dictionary unit 10 is realized by a memory or a hard disk device. The dictionary unit 10 stores (stores) dictionary data such as translated words, parts of speech, and semantic analysis dictionaries associated with headwords (words and characters). The parts of speech are, for example, nouns, verbs, adjectives, adverbs, and the like. The dictionary data includes a character code for each character constituting the word. The translation of the dictionary data is data for converting between other languages in one direction or in two directions, such as Japanese → English, English → Japanese. The dictionary data includes, for example, character tables of national languages such as English alphabet tables, Japanese kanji tables, hiragana tables, and katakana tables.

原文推測部１２は、入力部１１より入力された文（テキストデータ）またはスキャナからの画像データを文字認識した文について、形態素解析を行い、文を複数の文字列に分割し、分割した複数の文字列のうちの先頭文字列について先頭文字が小文字か否かを判定し、小文字であった文字列の先頭にアルファベットの大文字をＡからＺまで順に追加してゆき、新たな２６個の文字列を生成する。 The original sentence estimation unit 12 performs a morphological analysis on a sentence (text data) input from the input unit 11 or a sentence in which image data from a scanner is character-recognized, and divides the sentence into a plurality of character strings. It is determined whether or not the first character of the character string is a lower case letter, and upper case letters are added in order from A to Z at the beginning of the lower case character string, so that 26 new character strings are obtained. Is generated.

原文推測部１２は、新たな２６個の文字列について、辞書部１０を参照して未知語か否かを判定しそれぞれの文字列にフラグを付加する。未知語でない場合、例えば○等のフラグ、未知語の場合、例えば×等のフラグを付加する。 The original sentence estimation unit 12 refers to the dictionary unit 10 for the new 26 character strings, determines whether or not the word is an unknown word, and adds a flag to each character string. If it is not an unknown word, a flag such as ◯ is added, and if it is an unknown word, a flag such as x is added.

原文推測部１２は、○のフラグが付加された文字列を置き換え候補として、入力された原文の先頭文字列と入れ替えて、意味が通る文章となるかどうかの構文解析を行い、意味が通る文を原文候補として選出する。 The original sentence estimation unit 12 replaces the character string to which the flag of ○ is added as a replacement candidate with the first character string of the input original sentence, performs a syntax analysis to determine whether the sentence passes the meaning, and passes the meaning. Is selected as the original text candidate.

原文推測判定部１３は、原文推測部１２の構文解析部１５により選出（抽出）された１つ以上の原文候補について、予め統計データ記憶部１７に設定（記憶）されている統計データを基に文章中の先頭単語として適切な単語が用いられている原文候補を選定しそれを翻訳対象の文字列として翻訳部１８へ出力する。 The original sentence estimation determination unit 13 uses one or more original sentence candidates selected (extracted) by the syntax analysis unit 15 of the original sentence estimation unit 12 based on statistical data set (stored) in the statistical data storage unit 17 in advance. An original sentence candidate using an appropriate word as the first word in the sentence is selected and output to the translation unit 18 as a character string to be translated.

つまり原文推測判定部１３は、原文推測部１２により抽出された置き換え候補の単語を、原文の先頭単語と置き換えた文章を翻訳対象の文字列として翻訳部１８へ出力する文字列出力手段として機能する。 In other words, the original sentence estimation determination unit 13 functions as a character string output unit that outputs a sentence obtained by replacing the replacement candidate word extracted by the original sentence estimation unit 12 with the first word of the original sentence to the translation unit 18 as a character string to be translated. .

原文推測判定部１３は、複数の置き換え候補の単語が抽出された場合、原文の先頭単語と置き換えた各文章について辞書データを基に文法的に正しい文章の候補を選出する第１文章候補選出手段として機能する。 The original sentence guessing / determining unit 13 selects first sentence candidate selection means for selecting a grammatically correct sentence candidate based on dictionary data for each sentence replaced with the first word of the original sentence when a plurality of replacement candidate words are extracted. Function as.

原文推測判定部１３は、選出した１つ以上の文法的に正しい文章候補について、予め統計データ記憶部１７に設定された統計データを基に文章中の先頭単語として適切な単語が用いられている文章候補を選定しそれを翻訳対象の文字列として翻訳部１８へ出力する第２文章候補選定手段（原文推測判定部）として機能する。 The original sentence estimation determination unit 13 uses an appropriate word as the first word in the sentence based on statistical data set in advance in the statistical data storage unit 17 for one or more selected grammatically correct sentence candidates. It functions as second sentence candidate selection means (original sentence estimation determination unit) that selects a sentence candidate and outputs it to the translation unit 18 as a character string to be translated.

原文推測部１２の形態素解析部１４は、入力部１１により受け付けられた文字列について辞書部１０のデータを基に形態素解析を行うことで、文字列を単語の単位に分割する分割手段として機能する。また、形態素解析部１４は、分割した複数の単語のうち先頭単語の先頭文字が英語の文字表、つまりアルファベットの小文字が大文字かを辞書部１０の辞書データを基に判定し、先頭文字がアルファベットの小文字の場合、先頭単語の先頭に大文字のアルファベットを付加した複数の先頭単語置き換えのための新たな単語を生成する単語生成手段として機能する。 The morpheme analysis unit 14 of the original text estimation unit 12 functions as a dividing unit that divides the character string into word units by performing morphological analysis on the character string received by the input unit 11 based on the data in the dictionary unit 10. . Further, the morpheme analysis unit 14 determines whether the first character of the first word among the plurality of divided words is an English character table, that is, whether the lowercase letter of the alphabet is uppercase, based on the dictionary data of the dictionary unit 10, and the first character is alphabetic In the case of lowercase letters, it functions as a word generating means for generating a new word for replacing a plurality of head words by adding a capital letter alphabet to the head of the head word.

形態素解析部１４は、生成した複数の先頭単語置き換え単語について辞書部１０の辞書データを基に辞書部１０に存在する先頭単語置き換え候補の単語を選出（抽出）する先頭単語置き換え候補選出（抽出）手段として機能する。 The morpheme analysis unit 14 selects (extracts) a first word replacement candidate that selects (extracts) a first word replacement candidate word existing in the dictionary unit 10 based on the dictionary data of the dictionary unit 10 for the plurality of generated first word replacement words. Functions as a means.

構文解析部１５は、形態素解析部１４により選出（抽出）された先頭単語置き換え候補の単語を原文の先頭単語と入れ替えた各文章について辞書部１０の辞書データを基に構文解析を行い、文法的に正しい原文候補を選出する原文候補選出手段として機能する。 The syntax analysis unit 15 performs syntax analysis on each sentence in which the first word replacement candidate word selected (extracted) by the morpheme analysis unit 14 is replaced with the first word of the original text based on the dictionary data of the dictionary unit 10, and It functions as an original text candidate selection means for selecting correct original text candidates.

構文解析制御部１６は、原文の先頭単語と置き換えた各文章について辞書データを基に構文解析を行うことで文法的に正しい文章の候補、または意味解析を行うことで意味が正しい文章の候補を選出する文章解析手段として機能する。 The syntax analysis control unit 16 performs syntactic analysis based on dictionary data for each sentence replaced with the first word of the original sentence, or grammatically correct sentence candidates or semantic sentence candidates by means of semantic analysis. Functions as a sentence analysis means to elect.

統計データ記憶部１７には、辞書部１０に存在する単語の中で予め品詞毎に統計を取って導き出しておいた出現頻度の高い単語の順序が設定された統計データが記憶されている。統計データとしては、例えば文章中での単語の出現頻度データ、文頭単語の出現頻度データおよび文頭文字の出現頻度データのうち少なくとも１つが用いられる。
つまり統計データ記憶部１７には、文頭単語として出現頻度が高いデータや、文頭文字として出現頻度が高いデータが品詞ごとに記憶されており、例えば佐藤と鈴木という固有名詞では、佐藤が一番、鈴木が二番などと設定されている。 The statistical data storage unit 17 stores statistical data in which the order of words having a high appearance frequency, which is previously obtained by taking statistics for each part of speech among the words existing in the dictionary unit 10, is set. As statistical data, for example, at least one of word appearance frequency data, sentence word appearance frequency data, and sentence character appearance frequency data in a sentence is used.
That is, the statistical data storage unit 17 stores, for each part of speech, data having a high appearance frequency as a head word or data having a high appearance frequency as a head character. For example, Sato is the most suitable noun of Sato and Suzuki. Suzuki is set as second.

翻訳部１８は、入力部１１から指定された言語翻訳方向（英→日など）に従って、上記辞書部１０の該当文字表を含む辞書データを用いて第１の言語の文章である英語の文を第２の言語の文、例えば日本語の文へ翻訳する。 The translation unit 18 uses the dictionary data including the corresponding character table of the dictionary unit 10 according to the language translation direction (such as English → Japanese) designated from the input unit 11 to translate an English sentence that is a sentence in the first language. Translate to a second language sentence, for example, a Japanese sentence.

上記辞書部１０、入力部１１、原文推測部１２（形態素解析部１４、構文解析部１５、構文解析制御部１６）、原文推測判定部１３、統計データ記憶部１７、翻訳部１８、出力部１９などは、コンピュータのキーボード、スキャナ、ＣＰＵ、メモリ、ハードディスク装置、モニタなどのハードウェアと、ハードディスク装置にインストールされたオペレーティングシステムおよび翻訳原文推定プログラムなどのソフトウェアとの協動動作で実現される。 The dictionary unit 10, the input unit 11, the source sentence estimation unit 12 (the morpheme analysis unit 14, the syntax analysis unit 15, the syntax analysis control unit 16), the source sentence estimation determination unit 13, the statistical data storage unit 17, the translation unit 18, and the output unit 19. Are realized by a cooperative operation of hardware such as a computer keyboard, scanner, CPU, memory, hard disk device, and monitor, and software such as an operating system and a translation original text estimation program installed in the hard disk device.

以下、図２〜図６を参照してこの第１実施形態の翻訳原文推定装置の動作を説明する。
図２はこの実施形態の翻訳原文推定装置が行う辞書引き処理で作成される単語一覧を表す図、図３は図１の辞書部１０の辞書データを示す図、図４はこの翻訳原文推定装置の処理の流れを示すフローチャート、図５は辞書引き単語推定処理を示すフローチャート、図６は構文解析原文推定処理を示すフローチャートである。 Hereinafter, the operation of the translation original text estimation apparatus according to the first embodiment will be described with reference to FIGS.
2 is a diagram showing a word list created by dictionary lookup processing performed by the translation original text estimation apparatus of this embodiment, FIG. 3 is a diagram showing dictionary data in the dictionary unit 10 of FIG. 1, and FIG. 4 is this translation text estimation apparatus. FIG. 5 is a flowchart showing dictionary lookup word estimation processing, and FIG. 6 is a flowchart showing syntax analysis original text estimation processing.

なお、この実施形態を説明するにあたり、入力される第１言語は英語、出力する第２言語は日本語とする。つまり、この例では原文が英語であり、これを日本語の文章に翻訳するケースについて説明するが、本発明はこれらの言語のみに限定されるものではない。 In describing this embodiment, it is assumed that the first language to be input is English and the second language to be output is Japanese. That is, in this example, the original sentence is English, and a case where the original sentence is translated into a Japanese sentence will be described. However, the present invention is not limited to only these languages.

この第１実施形態の翻訳原文推定装置において、上記従来技術で説明した１.のケースの場合、入力部１１から「orld is too big.」がＰＣ１に入力される。これを入力原文Ａという。 In the translation original text estimation apparatus according to the first embodiment, in the case of 1 described in the above prior art, “orld is too big.” Is input to the PC 1 from the input unit 11. This is referred to as input original text A.

ＰＣ１では、原文推測部１２が、入力部１１から入力された入力原文Ａを取得すると（Ｓ６０１）、取得した入力原文Ａについて、辞書部１０を参照して形態素解析を行い、単語に分割し、その入力原文Ａの言語が推測可能な言語であるかどうかを判定する（Ｓ６０２）。 In PC1, when the original sentence estimation unit 12 acquires the input original sentence A input from the input unit 11 (S601), the acquired input original sentence A is subjected to morphological analysis with reference to the dictionary unit 10 and divided into words, It is determined whether the language of the input original text A is a language that can be estimated (S602).

この判定の結果、この例のように入力原文Ａの言語が英語であり辞書部１０に英語のテキストデータが存在する場合、原文推測部１２は、続いて入力原文Ａのさらに元の文である原文を推測する必要があるかどうかを判定する（Ｓ６０３）。 As a result of this determination, when the language of the input original sentence A is English and the English text data exists in the dictionary unit 10 as in this example, the original sentence inference part 12 is the original sentence of the input original sentence A. It is determined whether or not it is necessary to guess the original text (S603).

なお、原文を推測する必要があるかどうかについては、この例のように入力原文Ａの言語が英語の場合、先頭の文字が大文字か小文字かで判定する。入力原文Ａの文章の形態が２行以上にわたる場合は文字の位置情報を利用して判定を行う。 Whether or not it is necessary to infer the original text is determined based on whether the first letter is uppercase or lowercase when the language of the input original text A is English as in this example. When the form of the sentence of the input original sentence A extends over two or more lines, the determination is made using character position information.

例えば入力言語が英語の場合、通常、文の先頭文字は英大文字で記述される。このため、入力データの先頭文字が英大文字である場合は、正しい原文が入力されたものと判定し、入力原文Ａのさらに元である原文の推測を行う必要はない。原文推測部１２は、先頭文字が英小文字である場合に原文の推測を行う必要があるものと判定する。 For example, when the input language is English, the first character of a sentence is usually written in uppercase letters. For this reason, when the first character of the input data is an uppercase letter, it is determined that the correct original text has been input, and it is not necessary to infer the original text that is the original of the input original text A. The original sentence estimation unit 12 determines that the original sentence needs to be estimated when the first character is a lowercase letter.

また、入力原文Ａの文章の形態が２行以上にわたり、文字の位置情報を利用する場合は、１行目の先頭文字のフォントサイズが他のフォントサイズの２倍ある場合を想定すると、１行目の２文字目の位置と２行目の２文字目の位置が大きく異なるようになる。このため２行目の２文字目の位置と１行目の２文字目の位置が大きく異なる場合、原文推測部１２は、原文の推測を行う必要があるものと判定をする。 In addition, when the input source sentence A has two or more lines and character position information is used, assuming that the font size of the first character on the first line is twice that of the other fonts, one line is assumed. The position of the second character of the eye and the position of the second character of the second line are greatly different. For this reason, when the position of the second character on the second line and the position of the second character on the first line are greatly different, the original sentence estimation unit 12 determines that it is necessary to estimate the original sentence.

これによって、原文推測判定部１３は、原文を推測するかどうかの判定を行う。１．の例では、先頭の文字が英小文字のため、原文推測を行う必要があり、最初に辞書引き原文推測処理にて単語を推測する処理（辞書引き単語推定処理）を行う（Ｓ６０４）。 Accordingly, the original sentence estimation determination unit 13 determines whether or not to estimate the original sentence. 1. In the example, since the first character is a lowercase letter, it is necessary to perform original text estimation, and first, a process of estimating a word (dictionary search word estimation process) in the dictionary reference original text estimation process is performed (S604).

辞書引き単語推定処理では、図５に示すように、原文推測判定部１３は、まず、入力原文Ａを最初の区切り位置（区切り文字）で区切ることで、「orld」という文字列を取得する（Ｓ７０１）。この「orld」という単語を単語Ｂとする。 In the dictionary lookup word estimation process, as shown in FIG. 5, the original sentence estimation determination unit 13 first acquires the character string “orld” by dividing the input original sentence A by the first delimiter position (delimiter character) ( S701). This word “orld” is designated as word B.

そして原文推測判定部１３は、取得した単語Ｂの先頭に付与する推測文字xを‘Ａ’から‘Ｚ’まで変化させ（Ｓ７０２）、推測文字ｘの後に単語Ｂを追加、言い換えれば、単語Ｂの先頭に推測文字xを追加し「Aorld」とする（Ｓ７０３）。この「Aorld」を推測単語Ｃn（nはA-Zまで変化）とし、図２（ａ）のように、「orld」という単語Ｂについての推測単語Ｃｎを２６個生成しメモリに一時記憶する。 Then, the original text guess determination unit 13 changes the guess character x to be added to the head of the acquired word B from 'A' to 'Z' (S702), and adds the word B after the guess character x, in other words, the word B The guess character x is added to the head of “Aorld” (S703). This “Aorld” is assumed to be a guess word Cn (n changes to A-Z), and as shown in FIG. 2A, 26 guess words Cn for the word B “orld” are generated and temporarily stored in the memory.

その後、原文推測判定部１３は、メモリに一時記憶した中から初めの推測単語ＣAから順に辞書部１０の翻訳辞書を用いて辞書引きを行い（Ｓ７０４）、辞書部１０に該当単語（テキストデータ）が存在するかどうかを判定する（Ｓ７０５）。 After that, the original text guess determination unit 13 performs dictionary lookup using the translation dictionary of the dictionary unit 10 in order from the first guess word CA stored temporarily in the memory (S704), and the corresponding word (text data) is stored in the dictionary unit 10. Is determined (S705).

この判定の結果、辞書部１０に該当単語が存在した場合、原文推測判定部１３は、その推測単語Ｃn（例えば「World」など）をメモリに保持しておく（Ｓ７０６）。
推測単語ＣAの次は、推測文字xを‘B’にした推測単語ＣBで辞書引きを行い、推測文字を‘Z’まで変化させて単語推測処理を行い、上記同様に翻訳辞書に登録されている推測単語Ｃnを順次メモリに一時保持する。なお、辞書部１０に存在しない単語については、メモリから削除する。 As a result of the determination, when the corresponding word exists in the dictionary unit 10, the original text estimation determination unit 13 holds the estimated word Cn (for example, “World”, etc.) in the memory (S706).
Next to the guess word CA, dictionary lookup is performed with the guess word CB with the guess character x set to “B”, the guess character is changed to “Z”, word guess processing is performed, and it is registered in the translation dictionary as described above. The estimated words Cn are temporarily stored in the memory sequentially. Note that words that do not exist in the dictionary unit 10 are deleted from the memory.

図２（b）のように、単語Ｂが、例えば「he」などの場合は、「Ahe」、「Bhe」…「Zhe」という推測単語Ｃnを生成し、「She」や「The」などのように翻訳辞書にあるものについては○のフラグ、翻訳辞書にないものについては×のフラグを該当単語に付し、翻訳辞書に登録されている単語のみをメモリに一時保持する。 As shown in FIG. 2B, when the word B is “he”, for example, the guess word Cn “Ahe”, “Bhe”... “Zhe” is generated, and “She”, “The”, etc. As described above, a flag in the translation dictionary is marked with a circle, and a flag in the translation dictionary is marked with a cross, and only words registered in the translation dictionary are temporarily stored in the memory.

図２（c）のように、単語Ｂが例えば「at」などの場合は、上記と同様に「Aat」、「Bat」…「Zat」という推測単語Ｃnを生成し、翻訳辞書にあるものについては○のフラグ、翻訳辞書にないものについては×のフラグを付し、翻訳辞書に登録されている単語のみをメモリに保持する。 As shown in FIG. 2C, when the word B is “at”, for example, the guess word Cn “Aat”, “Bat”. Is marked with a ◯ flag, and with a x flag for those not in the translation dictionary, only the words registered in the translation dictionary are stored in the memory.

図２（d）のように、単語Ｂが例えば「ato」などの場合は、上記と同様に「Aato」、「Bato」…「Zato」という推測単語Ｃnを生成し、翻訳辞書にあるものについては○のフラグ、翻訳辞書にないものについては×のフラグを付し、翻訳辞書に登録されている単語のみをメモリに保持する。 As shown in FIG. 2D, when the word B is “ato” or the like, for example, the guess words Cn “Aato”, “Bato”. Is marked with a ◯ flag, and with a x flag for those not in the translation dictionary, only the words registered in the translation dictionary are stored in the memory.

原文推測判定部１３は、上記のようにして推測文字xを‘Z’まで変化させて生成した推測単語Ｃnのうち、辞書部１０の翻訳辞書に登録されている単語をメモリに保持する。 The original text guess determination unit 13 holds, in the memory, words registered in the translation dictionary of the dictionary unit 10 among the guess words Cn generated by changing the guess character x to ‘Z’ as described above.

この例では、翻訳辞書に存在した推測単語Ｃnは、図３に示すように、例えば「The」は「限定詞」、「She」は「代名詞」、「Bat」は「名詞」…などというように、見出し語と見出し語品詞という対応関係で組み合わされてメモリに保持される。
そして、メモリに保持しておいた推測単語Ｃnが一つ以上存在した場合（Ｓ６０５の！＝０）、原文推測判定部１３は、メモリに保持しておいた辞書部１０に存在する推測単語Ｃnを、入力原文Ａ中の単語Ｂと置き換えて、翻訳原文候補Ｄm(mはA-Zまで)を生成する。つまり原文推測判定部１３は、推測文字ｘに置き換えた推測単語Ｃnを単語Ｂからの置き換え候補とした翻訳原文候補Ｄmを原文推測部１２（構文解析部１５および形態素解析部１４）へ順次出力し構文解析および形態素解析させる。 In this example, as shown in FIG. 3, the guess word Cn existing in the translation dictionary is, for example, “The” is “qualifier”, “She” is “pronoun”, “Bat” is “noun”, etc. Are combined in a correspondence relationship between a headword and a headword part-of-speech and stored in the memory.
When one or more guess words Cn held in the memory exist (S605! = 0), the original sentence guess determination unit 13 makes the guess word Cn present in the dictionary part 10 held in the memory. Is replaced with the word B in the input original A to generate a translation original candidate Dm (m is up to AZ). In other words, the original sentence estimation determination unit 13 sequentially outputs the translation original sentence candidates Dm in which the estimated word Cn replaced with the estimated character x is a replacement candidate from the word B to the original sentence estimation unit 12 (the syntax analysis unit 15 and the morpheme analysis unit 14). Let parsing and morphological analysis.

なお、推測単語Ｃnを単語Ｂからの置き換え候補とした翻訳原文候補Ｄmが１つだけしか存在しなかった場合（Ｓ６０９の＝１）、原文推測判定部１３は、単語Ｂから推測単語Ｃnへ置き換えた翻訳原文候補Ｄmを翻訳部１８へ出力し、その文章が翻訳部１８によって翻訳される（Ｓ６１１）。 When there is only one translation original text candidate Dm with the estimated word Cn as a replacement candidate from the word B (S609 = 1), the original text estimation determining unit 13 replaces the word B with the estimated word Cn. The translated original sentence candidate Dm is output to the translation unit 18, and the sentence is translated by the translation unit 18 (S611).

原文推測判定部１３からの翻訳原文候補Ｄmが原文推測部１２（構文解析部１５および形態素解析部１４）に入力されると、原文推測部１２では、構文解析部１５および形態素解析部１４による構文解析原文推測処理が行われ、構文解析制御部１６によって判定される（Ｓ６０６）。 When the translation source sentence candidate Dm from the source sentence estimation determination unit 13 is input to the source sentence estimation unit 12 (the syntax analysis unit 15 and the morpheme analysis unit 14), the source sentence estimation unit 12 uses the syntax analysis unit 15 and the morpheme analysis unit 14 An analysis original sentence estimation process is performed and determined by the syntax analysis control unit 16 (S606).

原文推測部１２におけるＳ６０６の構文解析原文推測処理では、構文解析技術および形態素解析技術を用いて翻訳原文候補Ｄmの内容が解析されて、構文解析制御部１６で翻訳原文候補Ｄmが適切かどうかが判定される。 In the syntax analysis source text estimation process of S606 in the source text estimation unit 12, the contents of the translation source text candidate Dm are analyzed using syntax analysis technology and morpheme analysis technology, and whether or not the translation source text candidate Dm is appropriate in the syntax analysis control unit 16 is determined. Determined.

構文解析原文推測処理では、図６に示すように、形態素解析部１４は、入力された翻訳原文候補Ｄmの形態素解析処理を行い（Ｓ８０２）、その後、構文解析部１５は、入力された翻訳原文候補Ｄmの構文解析処理を行う（Ｓ８０３）。これらの処理は、入力された翻訳原文候補Ｄmの数だけ処理行う（Ｓ８０１、Ｓ８０６）。 In the syntax analysis original text estimation process, as shown in FIG. 6, the morpheme analysis unit 14 performs a morpheme analysis process on the input translation source text candidate Dm (S802), and then the syntax analysis unit 15 inputs the input translation source text. Parsing processing of the candidate Dm is performed (S803). These processes are performed for the number of input translation source candidates Dm (S801, S806).

翻訳原文候補Ｄmに対する構文解析処理が成功した場合（Ｓ８０４のＹｅｓ）、構文解析部１５は、構文解析が成功した翻訳原文候補Ｄmをメモリに一次保持し（Ｓ８０５）、原文推測部１２では、次の翻訳原文候補Ｄmの処理に移り（Ｓ８０６）、翻訳原文候補Ｄｍがなくなるまで上記処理が繰り返される。 When the parsing process for the translation source text candidate Dm is successful (Yes in S804), the syntax analysis unit 15 primarily holds the translation source text candidate Dm for which the parsing has been successful in the memory (S805). The process proceeds to the translation original sentence candidate Dm (S806), and the above process is repeated until there is no translation original sentence candidate Dm.

すべての翻訳原文候補Ｄmの処理を終了すると、構文解析制御部１６は、メモリに保持された翻訳原文候補Ｄmの数をチェックする。
このチェックの結果、メモリに保持された翻訳原文候補Ｄｍが一つもない場合（Ｓ６０７）、構文解析制御部１６は、入力部１１から入力された入力原文Ａ自体を翻訳部１８へ出力し、翻訳部１８は、入力された入力原文Ａを辞書部１０のデータを基に翻訳する（Ｓ６０８）。 When the processing of all the translation original sentence candidates Dm is completed, the syntax analysis control unit 16 checks the number of translation original sentence candidates Dm held in the memory.
As a result of this check, if there is no translation original text candidate Dm held in the memory (S607), the syntax analysis control unit 16 outputs the input original text A itself input from the input unit 11 to the translation unit 18, and translates it. The unit 18 translates the input original sentence A based on the data in the dictionary unit 10 (S608).

入力原文Ａが、例えば「orld is too big.」などの場合、構文解析原文推測処理によって、翻訳原文候補Ｄmは１つだけに特定される。 When the input source text A is “orld is too big.”, For example, only one source text candidate Dm is specified by the parsing source text estimation process.

しかし、原文が「at eats rat.」の場合は、翻訳原文候補Ｄmは、翻訳原文候補ＤB、ＤC、ＤE、ＤF、ＤG・・・などと複数あり、構文解析原文推測処理を行っただけでは翻訳原文候補を１つに絞れないことになる。 However, if the source text is “at eats rat.”, There are a plurality of translation source candidates Dm such as translation source candidates DB, DC, DE, DF, DG, etc. The translation source candidates cannot be narrowed down to one.

そこで、構文解析を行った結果、翻訳原文候補を１つに絞れない場合（Ｓ６０７の＞１）、原文推測部１２から原文推測判定部１３へ複数の翻訳原文候補Ｄmが出力される。 Therefore, as a result of the syntax analysis, when the number of translation original sentence candidates cannot be narrowed down to one (> 1 in S607), a plurality of translation original sentence candidates Dm are output from the original sentence estimation unit 12 to the original sentence estimation determination unit 13.

複数の翻訳原文候補Ｄmが入力された原文推測判定部１３は、入力原文Ａの残り部分について辞書部１０の意味解析辞書を用いて意味解析を行うことで（Ｓ６０８）、推測単語に期待される品詞や意味から単語を推測する。 The original sentence estimation determination unit 13 to which a plurality of translation original sentence candidates Dm are input performs semantic analysis on the remaining part of the input original sentence A using the semantic analysis dictionary of the dictionary unit 10 (S608), and is expected to be an estimated word. Guess the word from part of speech and meaning.

例えば「at eats rat.」などといった例では、「rat = ねずみ」、「eat = 食べる」というキーワードから、推測される単語には「名詞」、「生物」、という情報が存在する単語と推測できる。このような情報から原文推測判定部１３は翻訳原文候補Ｄmを選出し、翻訳部１８へ送り、翻訳を行わせる。 For example, in an example such as “at eats rat.”, It is possible to infer from the keywords “rat = mouse” and “eat = eat” that the presumed word contains information such as “noun” and “living organism”. . Based on such information, the original text estimation determination unit 13 selects a translation original text candidate Dm and sends it to the translation unit 18 for translation.

しかし、中には期待される意味を持っている単語が複数存在し（Ｓ６０９の＞１）、翻訳原文候補を１つに絞れない場合がある。
このように複数の翻訳原文候補が存在して１つに絞れない場合、原文推測判定部１３は、統計データ記憶部１７に記憶されている統計データや同文書中での単語の出現頻度データを用いて優先度の高いいずれか１つの翻訳原文候補を選定する。このように、複数の翻訳原文候補の中から統計上で優先度の高いいずれか１つの翻訳原文候補を選定することを、翻訳原文候補を推定する、という（Ｓ６１０）。 However, there are a plurality of words having an expected meaning (> 1 in S609), and there are cases where the translation source sentence candidates cannot be narrowed down to one.
In this way, when a plurality of translation source text candidates exist and cannot be narrowed down to one, the source text estimation determination unit 13 uses the statistical data stored in the statistical data storage unit 17 and the appearance frequency data of words in the same document. Use to select any one translation source candidate with high priority. In this way, selecting one translation source candidate having a high statistical priority from a plurality of translation source candidates is referred to as estimating a translation source candidate (S610).

例えば「ato is a Japanese name.」という入力原文Ａの場合、期待される意味を持っている単語が複数存在するようなパターンであり、このような場合、原文推測判定部１３は、文頭単語を検索キーにして統計データ記憶部１７を検索し、文頭単語として出現頻度の高いデータ、文頭文字として出現頻度の高いデータなどの統計データや同文書中での単語の出現頻度データを基に翻訳原文候補に見当をつけ、１つの候補を選定する。
そして、原文推測判定部１３は、選定した１つの翻訳原文候補を翻訳部１８へ出力する。
翻訳部１８は、入力された翻訳原文候補を辞書部１０のデータを基に翻訳する（Ｓ６１１）。 For example, in the case of the input original text A “ato is a Japanese name.”, There is a pattern in which a plurality of words having the expected meaning exist. The statistical data storage unit 17 is searched using a search key, and is translated based on statistical data such as data having a high appearance frequency as a head word, data having a high appearance frequency as a head character, or data on the frequency of appearance of a word in the document. Select one candidate by registering the original text candidates.
Then, the original text estimation determination unit 13 outputs the selected one translation original text candidate to the translation unit 18.
The translation unit 18 translates the input translation source text candidate based on the data in the dictionary unit 10 (S611).

このようにこの第１実施形態の翻訳原文推定装置によれば、Ｗeｂページから得られた入力原文Ａの先頭文字が小文字か大文字かを判定し、小文字の場合、入力原文Ａの先頭文字が欠落している可能性があるため、形態素解析を行って単語に分割した先頭単語に対して大文字を付与して未知語とならない単語を置き換え候補として入力原文の単語と置き換えて構文解析を行うことで、Ｗeｂページの取り込みで未知語とされることが多かった先頭単語を正しく訂正（補完）した上で翻訳部１８へ出力するので、翻訳部１８による翻訳結果の精度を向上することができる。 Thus, according to the translation source text estimation apparatus of the first embodiment, it is determined whether the first letter of the input source text A obtained from the Web page is a lower case letter or upper case letter. Morphological analysis is performed, and capitalization is applied to the first word divided into words, and a word that does not become an unknown word is replaced with a word in the input source text as a replacement candidate. Since the leading word, which is often an unknown word in the web page import, is corrected (complemented) and output to the translation unit 18, the accuracy of the translation result by the translation unit 18 can be improved.

つまり、Ｗeｂページの特性を考慮しＷeｂページから得られた入力原文Ａの先頭単語について本来未知語とされるはずの単語について先頭に一文字を追加した中で辞書に存在した単語を元の未知語とされるはずの先頭単語に置き換えて翻訳することによって、正しい翻訳結果を得ることができる。 That is, in consideration of the characteristics of the Web page, the word existing in the dictionary is added to the original unknown word with one letter added to the head of the word originally supposed to be an unknown word in the input original sentence A obtained from the Web page. The correct translation result can be obtained by translating it by replacing it with the first word that should be taken.

上記第１実施形態では、Ｗｅｂページからの翻訳を例にして説明したがねこの他、本発明は、例えばスキャナにて読み取られた画像データを文字認識した際に、欠如した文字部分についても利用することができる。すなわち、スキャナを使用して文字認識を行おうとした場合、画像データ内に飾り文字のような認識困難な文字が存在することが良くあり、文字認識が正確に行われないことがあるが、本発明を適用することによって、スキャナからの画像データでは認識困難なためスキャナにかけた元の帳票に記載されていた原文より欠落した文字を推測することで、より正しい認識結果を出力することができる。 In the first embodiment described above, translation from a Web page has been described as an example. In addition, the present invention also uses a missing character portion when, for example, image data read by a scanner is recognized. can do. That is, when trying to perform character recognition using a scanner, there are often characters that are difficult to recognize such as decorative characters in the image data, and character recognition may not be performed accurately. By applying the invention, it is difficult to recognize with the image data from the scanner, so that a more correct recognition result can be output by estimating the missing characters from the original text written in the original form applied to the scanner.

次に、図７〜図１９を参照して本発明に係る第２実施形態の翻訳原文推定装置について説明する。図７は本発明に係る第２実施形態の翻訳原文推定装置の構成を示す図である。 Next, the translation original text estimation apparatus according to the second embodiment of the present invention will be described with reference to FIGS. FIG. 7 is a diagram showing a configuration of the translation original text estimation apparatus according to the second embodiment of the present invention.

図７に示すように、この第２実施形態の翻訳原文推定装置は、キーボードからキー入力された原文の文字列を受け付けるデータ受付手段としての入力部２０１と、入力された第１言語の文字列および入力手段のタイプ情報などが記憶される入力文字列記憶部２０２、入力された文字列に対して形態素解析による単語の分割や未知語の有無、構文解析を行い非文、未知語などの判定を行う判定部２０３、この判定部２０３により抽出された未知語が記憶される未知語リスト記憶部２０４、この未知語リスト記憶部２０４に記憶された未知語に対してスペルコレクト処理を行うことで単語を生成しその単語を元の文の単語と置き換えるスペルコレクト部２０５、このスペルコレクト部２０５により生成された単語を元の文の単語と置き換えて翻訳部２０８へ出力する翻訳する結果を表示する表示部２０６、第１言語および第２の言語の文についての構文規則や語彙を含む翻訳辞書（以下辞書データと称す）が登録されている辞書部２０７を有している。 As shown in FIG. 7, the translation original text estimation apparatus according to the second embodiment includes an input unit 201 as data receiving means for receiving a text string of an original text key-input from a keyboard, and a character string input in a first language. And an input character string storage unit 202 for storing type information of the input means, etc., determination of non-sentence, unknown word, etc. by performing word analysis, presence / absence of unknown word, syntactic analysis on input character string By performing a spell correct process on the unknown words stored in the unknown word list storage unit 204 and the unknown word list storage unit 204 in which the unknown words extracted by the determination unit 203 are stored. A spell correct unit 205 that generates a word and replaces the word with a word of the original sentence, and translates the word generated by the spell correct unit 205 with a word of the original sentence A display unit 206 for displaying a result to be output to 208, and a dictionary unit 207 in which a translation dictionary (hereinafter referred to as dictionary data) including syntax rules and vocabulary for sentences in the first language and the second language is registered. Have.

なおハードウェアやＯＳを含むソフトウェアの関係は上記第１実施形態と同様である。つまり、このシステムは、ＣＰＵ、メモリ、ハードディスク装置、キーボート、マウス、モニタなどを備えるＰＣで構成され、上記各部は、そのハードディスク装置にインストールされているソフトウェアがＣＰＵによってメモリに読み出されて処理を実行することで実現される。入力文字列記憶部２０２、未知語リスト記憶部２０４などの記憶部や辞書部２０７などはメモリまたはハードディスク装置により実現される。 The relationship between the hardware and software including the OS is the same as that in the first embodiment. In other words, this system is composed of a PC having a CPU, memory, hard disk device, keyboard, mouse, monitor, etc., and the above components are processed by the software installed in the hard disk device being read into the memory by the CPU. It is realized by executing. Storage units such as the input character string storage unit 202 and the unknown word list storage unit 204, the dictionary unit 207, and the like are realized by a memory or a hard disk device.

入力部２０１は、利用者がキーボードから英語などの第１言語の文字列を入力すると、キー入力された文字列を受け付け、入力手段のタイプ情報と共に入力文字列記憶部２０２に記憶する。入力手段のタイプ情報とは、例えばキーボード、スキャナ、マイクなどである。またタイプ情報にはキーボードの種類なども含まれる。 When the user inputs a first language character string such as English from the keyboard, the input unit 201 accepts the character string input by the key and stores it in the input character string storage unit 202 together with the type information of the input means. The type information of the input means is, for example, a keyboard, a scanner, a microphone, or the like. The type information also includes the type of keyboard.

辞書部２０７には、見出し語（単語や文字）に対応付けされた訳語、品詞、意味解析辞書などの辞書データが蓄積（記憶）されている。品詞は、例えば名詞、動詞、形容詞、副詞などである。辞書データには、単語を構成する文字毎の文字コードなども含まれている。辞書データの訳語は、例えば日→英、英→日などのように他国語間を一方向または双方向に変換するデータである。辞書データには、例えば英語のアルファベット表や日本語の漢字表、ひらがな表、カタカナ表などの各国語の文字表が含まれる。上記辞書データとしては、見出し語に対応した英単語、日本語の単語などの各国の単語も多数登録されている。 The dictionary unit 207 stores (stores) dictionary data such as translated words, parts of speech, and semantic analysis dictionaries associated with headwords (words and characters). The parts of speech are, for example, nouns, verbs, adjectives, adverbs, and the like. The dictionary data includes a character code for each character constituting the word. The translation of the dictionary data is data for converting between other languages in one direction or in two directions, such as Japanese → English, English → Japanese. The dictionary data includes, for example, character tables of national languages such as English alphabet tables, Japanese kanji tables, hiragana tables, and katakana tables. As the dictionary data, many words from various countries such as English words corresponding to headwords and Japanese words are registered.

翻訳部２０８は、入力部２０１から利用者により指定された言語翻訳方向（英→日など）に従って、上記辞書部２０７の該当文字表を含む辞書データを用いて第１の言語の文章である英語の文を第２の言語の文、例えば日本語の文へ翻訳する。
判定部２０３は、入力部２０１により入力された文字列に対して予め設定された辞書データに基づいて形態素解析を行うことで、文字列を単語の単位に分割する分割手段として機能する。 The translation unit 208 uses the dictionary data including the corresponding character table of the dictionary unit 207 according to the language translation direction (such as English → Japanese) designated by the user from the input unit 201, and is English that is a sentence in the first language. Is translated into a second language sentence, for example, a Japanese sentence.
The determination unit 203 functions as a dividing unit that divides the character string into word units by performing morphological analysis on the character string input by the input unit 201 based on preset dictionary data.

判定部２０３は、分割した各単語の中で辞書部２０７に辞書データとして登録されていない単語を未知語として抽出する未知語抽出手段として機能する。 The determination unit 203 functions as an unknown word extraction unit that extracts, as unknown words, words that are not registered as dictionary data in the dictionary unit 207 among the divided words.

判定部２０３は、原文の文字列のうち、抽出された未知語の前後に存在する単語の品詞を辞書データより抽出し、文法上、未知語の品詞として成立する品詞を未知語に対応付けた品詞組合せリストを生成する品詞組合せリスト生成手段として機能する。
判定部２０３は、生成した品詞組合せリストと辞書データに基づいて原文の文字列の構文解析を行う構文解析手段として機能する。 The determination unit 203 extracts, from the dictionary data, the part of speech of the word existing before and after the extracted unknown word from the original text string, and associates the part of speech that is established as the part of speech of the unknown word with the unknown word. It functions as a part of speech combination list generating means for generating a part of speech combination list.
The determination unit 203 functions as a syntax analysis unit that performs syntax analysis of the original text string based on the generated part-of-speech combination list and dictionary data.

判定部２０３は、スペルコレクト部２０５により生成された置き換え候補の単語を元の文の単語と置き換えて翻訳部２０８へ出力する。つまり判定部２０３は、置き換え候補の単語を、原文の文字列の該当位置の単語と置き換えて翻訳対象の文字列として翻訳部２０８へ出力する文字列出力手段として機能する。判定部２０３は、原文の先頭単語と置き換えた各文章について辞書データを基に構文解析を行うことで文法的に正しい文章の候補、または意味解析を行うことで意味が正しい文章の候補を選出する文章解析手段として機能する。 The determination unit 203 replaces the replacement candidate word generated by the spell correction unit 205 with the original sentence word and outputs the replacement sentence to the translation unit 208. In other words, the determination unit 203 functions as a character string output unit that replaces a replacement candidate word with a word at a corresponding position in the original character string and outputs it to the translation unit 208 as a character string to be translated. The determination unit 203 selects a grammatically correct sentence candidate by performing syntax analysis based on dictionary data for each sentence replaced with the first word of the original sentence, or a sentence candidate having a correct meaning by performing semantic analysis. Functions as a sentence analysis means.

スペルコレクト部２０５は、辞書部２０７の英単語を基に、キー入力された文字列が形態素解析され、分割して得られた単語が辞書部２０７に登録されているか否かを判定し、単語が辞書部２０７に登録されていない未知語であった場合、入力手段の特徴に応じて設定された、例えばキーボードのキー配列によってタイプミスが生じ易い文字が設定されたテーブル（図１６〜１８参照）を用いて正しいスペルに修正する英文スペルコレクト機能である。 The spell corrector 205 performs morphological analysis on the character string input by the key based on the English words in the dictionary unit 207, and determines whether the word obtained by the division is registered in the dictionary unit 207. Is an unknown word that is not registered in the dictionary unit 207, a table that is set according to the characteristics of the input means, for example, characters that are likely to cause typographical errors due to the keyboard layout (see FIGS. 16 to 18). ) To correct spelling to the correct spelling.

スペルコレクト部２０５は、構文解析が成功した品詞を未知語の品詞として特定する未知語品詞特定手段として機能する。
判定部２０３は、品詞が特定された未知語の文字を所定の規則で置き換えて候補文字列を生成する候補文字列生成手段として機能する。スペルコレクト部２０５は、生成した置き換え候補の文字列が辞書データに存在しかつ品詞が一致するものを置き換え候補単語として判定部２０３へ出力する置き換え候補単語抽出手段として機能する。 The spell corrector 205 functions as an unknown word part-of-speech specifying means for specifying a part of speech that has been successfully parsed as a part of speech of an unknown word.
The determination unit 203 functions as a candidate character string generation unit that generates a candidate character string by replacing characters of an unknown word whose part of speech is specified with a predetermined rule. The spell corrector 205 functions as a replacement candidate word extraction unit that outputs the generated replacement candidate character string that exists in the dictionary data and matches the part of speech to the determination unit 203 as a replacement candidate word.

スペルコレクト部２０５は、入力文字列を分割して得た単語が辞書部２０７に登録されていない未知語であった場合、予め設定された入力手段の特徴（タイプ情報）に従ってその単語の綴り間違いをチェックし、間違っている単語を正しいスペルに修正した置き換え候補の単語を生成する。 When the word obtained by dividing the input character string is an unknown word that is not registered in the dictionary unit 207, the spell corrector 205 corrects the spelling of the word according to the characteristics (type information) set in advance. And generate a replacement candidate word with the wrong word corrected to the correct spelling.

つまりスペルコレクト部２０５は、判定部２０３により抽出された未知語の文字を所定の規則に基づいて置き換えて生成した新たな単語を原文の単語と置き換えた場合に文法的に正しい文章となる置き換え候補単語を抽出する置き換え候補単語生成手段として機能する。 In other words, the spell corrector 205 replaces the unknown word character extracted by the determination unit 203 based on a predetermined rule and replaces the new word with the original word so that the replacement candidate becomes a grammatically correct sentence. It functions as replacement candidate word generation means for extracting words.

これら判定部２０３およびスペルコレクト部２０５は、抽出した未知語の文字を所定の規則に基づいて置き換えて生成した新たな単語を原文の単語と置き換えた場合に文法的に正しい文章となる置き換え候補単語を抽出する置き換え候補単語生成手段として機能する。 These determination unit 203 and spell correction unit 205 replace candidate words that become grammatically correct sentences when new words generated by replacing the extracted unknown word characters based on predetermined rules are replaced with original words. It functions as a replacement candidate word generation means for extracting.

以下、図８〜図１９を参照してこの第２実施形態の翻訳原文推定装置の動作を説明する。
この第２実施形態の翻訳原文推定装置の動作を説明するにあたり、処理の流れをわかり易くするためにユーザが入力する第１言語を英語、第２言語を日本語とし、英語の入力文を例にして、図８〜図１０に示すフローチャートを用いて説明する。 Hereinafter, the operation of the translation original text estimation apparatus according to the second embodiment will be described with reference to FIGS.
In explaining the operation of the translation source sentence estimation apparatus according to the second embodiment, in order to make the processing flow easy to understand, the first language input by the user is English, the second language is Japanese, and an English input sentence is taken as an example. This will be described with reference to the flowcharts shown in FIGS.

翻訳者が入力部２０１より翻訳要求指示を行うと、判定部２０３は、入力部２０１に入力された翻訳要求指示に従い、入力文字列記憶部２０２より一文ずつ入力された文字列を取得する（Ｓ１００，Ｓ１１０）。 When the translator issues a translation request instruction from the input unit 201, the determination unit 203 obtains a character string input sentence by sentence from the input character string storage unit 202 in accordance with the translation request instruction input to the input unit 201 (S100). , S110).

続いて、判定部２０３は、入力された文字列に対して辞書部２０７の単語のデータを用いて形態素解析を行うことで、文字列を単語単位に分割しそれぞれの単語に番号を付与する。
例えば英語の入力文が“I yake a bictcle.”であった場合は、入力文は、図１１に示すように、“I”、“yake”、“ａ”、“bictcle”…などの単語単位に分割され（Ｓ１２０）、それぞりれ単語には、番号が対応付けされ、メモリ上で保持される。
この例では、“I”、には「１」、“yake”には「２」、“ａ”には「３」、“bictcle”には「４」などの番号が対応付けされてメモリ上で保持される。
その後、判定部２０３は、処理の順序を決定するためのカウンタの値、つまり入力文内における単語の位置を示す変数mの値を１に初期化する（Ｓ１３０）。 Subsequently, the determination unit 203 performs morphological analysis on the input character string using word data in the dictionary unit 207, thereby dividing the character string into words and assigning numbers to the respective words.
For example, when the input sentence in English is “I yake a bictcle.”, The input sentence is a word unit such as “I”, “yake”, “a”, “bictcle”... (S120), each word is associated with a number and stored in the memory.
In this example, “1” is associated with “I”, “2” is associated with “yake”, “3” is associated with “a”, and “4” is associated with “bictcle”. Held in.
Thereafter, the determination unit 203 initializes the value of the counter for determining the processing order, that is, the value of the variable m indicating the position of the word in the input sentence to 1 (S130).

続いて、判定部２０３は、番号順にm番目の単語を取り出し、その単語を検索キーにして辞書部２０７を検索し、辞書引き処理を行う（Ｓ１４０）。
辞書引き処理の結果、一致する語が辞書部２０７に存在しない場合（Ｓ１５０のＮｏ）、判定部２０３は、その単語を未知語として未知語リスト記憶部２０４へ出力し、図１２に示すような未知語リスト２１に記憶する（Ｓ１６０）。ここでも各未知語は、変数ｐのインクリメント対応のため番号付けされ、メモリ上で保持される。 Subsequently, the determination unit 203 extracts the mth word in numerical order, searches the dictionary unit 207 using the word as a search key, and performs a dictionary lookup process (S140).
If no matching word exists in the dictionary unit 207 as a result of the dictionary lookup processing (No in S150), the determination unit 203 outputs the word as an unknown word to the unknown word list storage unit 204, as shown in FIG. It is stored in the unknown word list 21 (S160). Again, each unknown word is numbered to accommodate the increment of the variable p and is held in memory.

未知語を未知語リスト２１へ記憶した後、判定部２０３は、分割した単語に付与した番号によって、現在処理中の単語が最後の単語か否かの判定し（Ｓ１７０）、最後の単語ではない場合（Ｓ１７０のＮｏ）、変数mの値に１を加え（Ｓ１８０）、上記Ｓ１４０〜Ｓ１８０の処理を、最後の単語に対する処理が終わるまで繰り返す。 After storing the unknown word in the unknown word list 21, the determination unit 203 determines whether or not the currently processed word is the last word based on the number assigned to the divided word (S170), and is not the last word. In the case (No in S170), 1 is added to the value of the variable m (S180), and the processes in S140 to S180 are repeated until the process for the last word is completed.

全ての単語に対する処理が完了すると（Ｓ１７０のＹｅｓ）、続いて判定部２０３は、未知語リスト記憶部２０４に未知語が存在するか否かの判定を行い（Ｓ１９０）、未知語の存在の有無に応じて異なる処理を実行する。
例えば未知語リスト記憶部２０４に未知語が存在する場合（Ｓ１９０のＹｅｓ）、判定部２０３は、図９に示すフローチャートの処理を実行し、未知語リスト記憶部２０４に未知語が存在しない場合（Ｓ１９０のＮｏ）、判定部２０３は、図１０に示すフローチャートの処理を実行する。 When the processing for all the words is completed (Yes in S170), the determination unit 203 subsequently determines whether or not an unknown word exists in the unknown word list storage unit 204 (S190), and whether or not there is an unknown word Different processing is executed depending on the situation.
For example, when an unknown word exists in the unknown word list storage unit 204 (Yes in S190), the determination unit 203 executes the processing of the flowchart illustrated in FIG. 9 and when the unknown word does not exist in the unknown word list storage unit 204 ( The determination unit 203 executes the processing of the flowchart shown in FIG.

未知語が存在する場合（Ｓ１９０のＹｅｓ）、判定部２０３は、未知語リスト記憶部２０４に格納されている単語が何個存在するかをチェックし、未知語の個数に従い、図１３に示すような未知語品詞組合せリスト２３を作成する（図９のＳ３００）。
この際、判定部２０３は、未知語の前後の単語の品詞情報を利用して品詞の並びとして正しくない候補は抽出しない。
例えば、未知語品詞組合せリスト２３にリストアップされた未知語の“bictcle”の場合、前の単語は、不定冠詞“a”であり、続く単語は、“.”（ピリオド）で文が終わるため、このようなケースの場合は、“bictcle”の品詞候補は“名詞”のみと限定できる。
判定部２０３は、Ｓ３００の処理で、未知語品詞組合せリスト２３を作成すると、作成した未知語品詞組合せリスト２３の位置を示す変数ｎの値を１に初期化し（Ｓ３１０）、文中の各未知語に対して未知語品詞組合せリスト２３の品詞を付加して（Ｓ３２０）、構文解析を行う（Ｓ３３０）。 When an unknown word exists (Yes in S190), the determination unit 203 checks how many words are stored in the unknown word list storage unit 204, and according to the number of unknown words, as illustrated in FIG. An unknown word part-of-speech combination list 23 is created (S300 in FIG. 9).
At this time, the determination unit 203 does not extract candidates that are not correct as part-of-speech arrangements using part-of-speech information of words before and after the unknown word.
For example, in the case of the unknown word “bictcle” listed in the unknown word part-of-speech combination list 23, the previous word is the indefinite article “a”, and the following word ends with “.” (Period). In such a case, the part-of-speech candidates for “bictcle” can be limited to only “nouns”.
When creating the unknown word part-of-speech combination list 23 in the process of S300, the determination unit 203 initializes the value of the variable n indicating the position of the created unknown word part-of-speech combination list 23 to 1 (S310), and each unknown word in the sentence Then, the part of speech of the unknown word part of speech combination list 23 is added (S320), and syntax analysis is performed (S330).

構文解析に成功した場合（Ｓ３４０のＹｅｓ）、判定部２０３は、その品詞を各未知語の品詞と特定する（Ｓ３４０）。
また、判定部２０３は、構文解析に失敗した場合（Ｓ３４０のＮｏ）、変数ｎの値に１を加え（Ｓ３５０）、Ｓ３２０〜Ｓ３５０の処理を構文解析に成功するまで繰り返す。 When the parsing is successful (Yes in S340), the determination unit 203 identifies the part of speech as the part of speech of each unknown word (S340).
Further, when the syntax analysis fails (No in S340), the determination unit 203 adds 1 to the value of the variable n (S350), and repeats the processes of S320 to S350 until the syntax analysis is successful.

この実施形態の例文のケースでは、未知語の“bictcle”は、未知語品詞組合せリスト２３において「名詞」として限定されており、同文中の未知語である“yake”には、未知語品詞組合せリスト２３の品詞候補の中の「動詞」が適用された場合に構文解析が成功する。 In the case of the example sentence of this embodiment, the unknown word “bictcle” is limited as “noun” in the unknown word part-of-speech combination list 23, and the unknown word “yake” in the sentence has an unknown word part-of-speech combination. The parsing succeeds when the “verb” in the part-of-speech candidates in the list 23 is applied.

判定部２０３による構文解析が成功すると、スペルコレクト部２０５は、未知語リストの位置を示す変数ｐの値を１に初期化し（Ｓ３６０）、未知語ｐの文字位置を示す変数qの値を１に初期化し（Ｓ３７０）、アルファベット（ａ〜ｚ）の位置を示す変数ｒの値を１に初期化し（Ｓ３８０）、変数ｑで示される１文字をａ〜ｚに置き換えて候補文字列を作成する（Ｓ３９０）。 If the parsing by the determination unit 203 is successful, the spell corrector 205 initializes the value of the variable p indicating the position of the unknown word list to 1 (S360), and sets the value of the variable q indicating the character position of the unknown word p to 1 (S370), the value of the variable r indicating the position of the alphabet (az) is initialized to 1 (S380), and one character indicated by the variable q is replaced with az to create a candidate character string. (S390).

スペルコレクト部２０５は、作成した候補文字列を、辞書部２０７の翻訳辞書と比較して（Ｓ４００）、候補文字列が辞書部２０７に存在しかつ品詞が一致するかどうかを判定する（Ｓ４１０）。
判定の結果、候補文字列が条件を満たしていた場合、スペルコレクト部２０５は、その候補文字列を置き換え単語として出力する（Ｓ４３０）。 The spell correction unit 205 compares the created candidate character string with the translation dictionary of the dictionary unit 207 (S400), and determines whether the candidate character string exists in the dictionary unit 207 and the parts of speech match (S410). .
If the candidate character string satisfies the condition as a result of the determination, the spell corrector 205 outputs the candidate character string as a replacement word (S430).

このようにしてスペルコレクト部２０５は、アルファベットの位置を示す変数r、未知語の文字位置を示す変数ｑをインクリメントし（Ｓ４２０，Ｓ４６０）、生成した文字列に対して辞書部２０７のデータとの比較を行い、上記条件を満たした場合、その単語を置き換え単語として抽出し（Ｓ３９０〜Ｓ４６０）、メモリに設定した置き換え候補テーブル３０（図１４参照）に登録する。 In this way, the spell corrector 205 increments the variable r indicating the alphabet position and the variable q indicating the character position of the unknown word (S420, S460), and the generated character string is compared with the data in the dictionary unit 207. When comparison is made and the above condition is satisfied, the word is extracted as a replacement word (S390 to S460) and registered in the replacement candidate table 30 (see FIG. 14) set in the memory.

スペルコレクト部２０５は、全ての未知語に対してＳ３７０〜Ｓ４８０までの一連の処理を行うことによって、各未知語の置き換え候補を置き換え候補テーブル３０に登録する。
その後、判定部２０３は、置き換え候補テーブル３０に登録されている各未知語の置き換え候補を読み出して、それぞれの組合せによって原文を生成し、生成した原文を翻訳部２０８へ出力する。
翻訳部２０８は、判定部２０３から入力された原文を辞書部２０７の翻訳辞書（辞書データ）に基づいて翻訳して得た翻訳結果（図１５参照）を表示部２０６へ出力する。つまり、原文の未知語がスペルコレクト候補に置き換えられた上で翻訳され、表示部２０６の画面に表示される（Ｓ４９０）。
この例では、置き換え候補テーブル３０に登録されている未知語“bictcle”の置き換え候補は“bicycle”だけであり、また未知語の“yake”の置き換え候補は、“bake”、“make”、“take”の３つであるため、各未知語の置き換え候補の組合せによって生成される原文は、図１５に示すように、３つでき、その翻訳結果も３つの文になる。 The spell corrector 205 registers a replacement candidate for each unknown word in the replacement candidate table 30 by performing a series of processes from S370 to S480 for all unknown words.
Thereafter, the determination unit 203 reads replacement candidates for each unknown word registered in the replacement candidate table 30, generates an original sentence by each combination, and outputs the generated original sentence to the translation unit 208.
The translation unit 208 outputs the translation result (see FIG. 15) obtained by translating the original text input from the determination unit 203 based on the translation dictionary (dictionary data) of the dictionary unit 207 to the display unit 206. That is, the unknown word in the original text is replaced with the spell correct candidate, translated, and displayed on the screen of the display unit 206 (S490).
In this example, the replacement candidate of the unknown word “bictcle” registered in the replacement candidate table 30 is only “bicycle”, and replacement candidates of the unknown word “yake” are “bake”, “make”, “ Since there are three “take”, there are three original sentences generated by combinations of replacement candidates for each unknown word, as shown in FIG. 15, and the translation results are also three sentences.

なお、上記Ｓ１９０の処理において、未知語が検出されなかった場合、判定部２０３は、図１０に示すフローチャートのように処理を実行する。
すなわち、未知語が検出されなかった場合、判定部２０３は、分割した単語からなる文章で構文解析を行い（Ｓ５００）、構文解析が成功した場合（Ｓ５１０のＹｅｓ）は、その文章（英文）を翻訳部２０８へ出力し翻訳させる。
また、構文解析が失敗した場合（Ｓ５１０のＮｏ）は、類似する基本文型と比較し品詞が異なる単語を未知語として未知語リスト記憶部２０４に出力し（Ｓ５２０）、前述したＳ３００〜Ｓ４９０の処理を行う。 If no unknown word is detected in the process of S190, the determination unit 203 performs the process as shown in the flowchart of FIG.
That is, when an unknown word is not detected, the determination unit 203 performs syntax analysis on a sentence composed of divided words (S500). When the syntax analysis is successful (Yes in S510), the sentence (English) is selected. The data is output to the translation unit 208 and translated.
If the parsing fails (No in S510), a word with a different part of speech compared to a similar basic sentence pattern is output as an unknown word to the unknown word list storage unit 204 (S520), and the processing of S300 to S490 described above is performed. I do.

このようにこの第２実施形態の翻訳原文推定装置によれば、英語で作成された原文を日本語の文章へ翻訳する上で、翻訳者などによりキーボードからキー入力された原文の文字列を入力部２０１が受け付けて入力文字列記憶部２０２に記憶すると、判定部２０３は、その文字列を入力文字列記憶部２０２から読み出して、予め辞書部２０７に設定された辞書データに基づいて形態素解析を行うことで、文字列を単語の単位に分割し、分割した各単語の中で辞書データとして登録されていない単語を未知語としてリストアップ（抽出）する。その後、スペルコレクト部２０５は、判定部２０３がリストアップ（抽出）した未知語の文字を、所定の規則（構文解析処理による文法的なチェックや品詞との対応付け、アルファベット表を用いた文字の置き換えなど）に基づいて置き換えて生成した新たな単語を原文の単語と置き換えた場合に文法的に正しい文章となる置き換え候補の単語を生成し、生成した置き換え候補の単語を、原文の文字列の該当位置の単語と置き換えて翻訳対象の文字列として翻訳部２０８へ出力するので、文字列が翻訳部２０８へ入力される前の段階でタイプミスなどで未知語とされる単語がなくなり、翻訳結果の精度を向上することができる。
つまり、英語の文章を機械翻訳で日本語の文章へ翻訳する上で、翻訳元となる英語の文章の精度を向上した上で翻訳することで、翻訳結果の精度を向上することができる。 Thus, according to the translation original text estimation apparatus of the second embodiment, when translating an original text created in English into a Japanese text, a character string of the original text key-input from the keyboard by a translator or the like is input. When the unit 201 accepts and stores the input character string in the input character string storage unit 202, the determination unit 203 reads the character string from the input character string storage unit 202 and performs morphological analysis based on dictionary data set in the dictionary unit 207 in advance. By doing so, the character string is divided into units of words, and among the divided words, words that are not registered as dictionary data are listed (extracted) as unknown words. After that, the spell corrector 205 converts the character of the unknown word listed (extracted) by the determination unit 203 into a predetermined rule (a grammatical check by a parsing process, an association with a part of speech, a character using an alphabet table). A replacement candidate word that becomes a grammatically correct sentence when a new word generated by replacing the original word with the original word is generated, and the generated replacement candidate word is replaced with the original text string. Since it is replaced with the word at the corresponding position and output to the translation unit 208 as a character string to be translated, there is no word that is an unknown word due to a typo or the like before the character string is input to the translation unit 208, and the translation result Accuracy can be improved.
In other words, when an English sentence is translated into a Japanese sentence by machine translation, the accuracy of the translation result can be improved by translating after improving the accuracy of the English sentence as a translation source.

なお、本発明は、上記第２実施形態に説明した例、つまりタイプミス（キーボードの押し違え）に対する単語の訂正例は一例であり、以下に示すような方法もある。
［第３実施形態］
翻訳者が、例えば“I take a bus.”などという文をキーボードからキー入力しようとしたところ、キーボードの押し間違えによって、“I yake a bus.”とタイプミスして入力してしまい、文字列の中に“yake”という未知語を含む文が入力された場合を例にあげて説明する。 In the present invention, the example described in the second embodiment, that is, an example of correcting a word with respect to a typo (keyboard mistake) is an example, and there is a method as described below.
[Third Embodiment]
For example, when a translator tries to input a sentence such as “I take a bus.” From the keyboard, he or she mistyped “I yake a bus.” A case where a sentence including an unknown word “yake” is input will be described as an example.

入力部２０１よりキー入力された文字列“I yake a bus.”は、入力部２０１が“キーボード”という入力手段のタイプ情報と対応付けされて入力文字列記憶部２０２に記憶される。 The character string “I yake a bus.” Key-inputted from the input unit 201 is stored in the input character string storage unit 202 in association with the type information of the input means that the input unit 201 is “keyboard”.

判定部２０３は、キー入力された文字列のうち“yake”について未知語として判定し、未知語リスト記憶部２０４に記憶する。
スペルコレクト部２０５は、未知語リスト記憶部２０４に記憶された未知語の“yake”を未知語リスト記憶部２０４より読み出し、予めメモリに設定されていた図１６に示す置き換え候補テーブル３１の文字置き換えデータを基にスペルコレクト候補文字列（未知語に含まれるある文字を変えた別の単語）を生成する。 The determination unit 203 determines “yake” in the key-input character string as an unknown word and stores it in the unknown word list storage unit 204.
The spell corrector 205 reads the unknown word “yake” stored in the unknown word list storage unit 204 from the unknown word list storage unit 204, and replaces the characters in the replacement candidate table 31 shown in FIG. A spell correct candidate character string (another word obtained by changing a character included in an unknown word) is generated based on the data.

置き換え候補テーブル３１は、単に「キーボード」、「マイク」、「スキャナ」などといった入力手段のタイプで分ける以外に、キーボードの中でも、１０１／１０２英語キーボード、１０６英語キーボード、１０９英語キーボード、親指シフトキーボードなどのようにキーボードの種類によって変わるキー配列で細かくタイプ分けしても良い。 The replacement candidate table 31 is not simply divided by the type of input means such as “keyboard”, “microphone”, “scanner”, etc., and among the keyboards, 101/102 English keyboard, 106 English keyboard, 109 English keyboard, thumb shift keyboard For example, the key layout may be finely divided according to the keyboard type.

この場合、上記図９に示した文字置き換え処理（Ｓ３９０の処理）において、対象文字列の入力手段が”キーボード”として指定されている場合、スペルコレクト部２０５は、図１６に示すように、キーボードのキー配列を基に対象文字列とキー位置が近いキーを置き換え候補としたキー配列による置き換え候補テーブル３１を優先的に用いる。このキー配列による置き換え候補テーブル３１は、メモリやハードディスク装置に予め記憶しておいても良く、プログラムに変換ルールとして設定しておいても良い。 In this case, in the character replacement process shown in FIG. 9 (the process of S390), when the target character string input means is designated as “keyboard”, the spell corrector 205, as shown in FIG. The replacement candidate table 31 based on the key arrangement using the key whose key position is close to that of the target character string based on the key arrangement is preferentially used. The replacement candidate table 31 based on the key arrangement may be stored in advance in a memory or a hard disk device, or may be set as a conversion rule in a program.

このキー配列による置き換え候補テーブル３１を用いる場合、例えば上記文字列“I yake a bus.”の例では、“yake”に対する置き換え候補としては、まず、“uake”,“take”,“gake”,“jake”,“hake”が生成され、この中で単語として存在し品詞も一致する“take”が採用される。
これにより、この例文のケースに対して生成された翻訳結果の候補の文は、図１７に示す順位付けされることになり、正しい翻訳結果が最上位の候補として表示されることになり、効果を確認することができる。 When using the replacement candidate table 31 based on this key arrangement, for example, in the example of the character string “I yake a bus.”, As replacement candidates for “yake”, first, “uake”, “take”, “gake”, “Jake” and “hake” are generated, and “take”, which is present as a word and matches the part of speech, is adopted.
As a result, the translation result candidate sentences generated for the example sentence case are ranked as shown in FIG. 17, and the correct translation result is displayed as the top candidate. Can be confirmed.

また、入力手段がキーボードの場合は、左手で押すキーと右で押すキーが存在し、スペル中に左手の指で押すキーと右手の指で押すキーが並んで存在する場合、例えば“make”と入力しようとしたところ先に“a”を押してしまい、“amke” と入力される場合も考えられる。 In addition, when the input means is a keyboard, there are a key pressed with the left hand and a key pressed with the right hand. If there are a key pressed with the finger of the left hand and a key pressed with the finger of the right hand during spelling, for example, “make” If you try to enter "amke" after pressing "a" first.

このため、図１８に示すように、指による置き換え候補テーブル３２を予めメモリに記憶しておき、スペルコレクト部２０５は、入力文字列中に左手の指で押すキーと右手の指で押すキーの並びが存在するか否かを判定し、入力文字列中に左手の指で押すキーと右手の指で押すキーの並びが存在する場合、入力文字列中に左手の指で押すキーと右手の指で押すキーの並びの部分について、指による置き換え候補テーブル３２を用いて“amke”という単語の文字“am”を“ma”というように入れ替えて“make”という正しいスペルの単語を導き出してもよい。 For this reason, as shown in FIG. 18, the finger replacement candidate table 32 is stored in the memory in advance, and the spell corrector 205 stores the keys to be pressed with the left and right fingers in the input character string. If there is a sequence of keys to be pressed with the left finger and keys to be pressed with the right hand finger in the input string, the keys to be pressed with the left finger and the right hand in the input string For the part of the key sequence to be pressed with a finger, the correct spelling word “make” is derived by replacing the character “am” of the word “amke” with “ma” using the replacement candidate table 32 with the finger. Good.

一方、入力手段が「スキャナ」などを利用した文字認識ソフトウェアであった場合は、認識対象の画像データの状態によっては、形が近い文字に誤って認識されてしまう傾向が強い。 On the other hand, when the input means is character recognition software using a “scanner” or the like, there is a strong tendency that characters having similar shapes are erroneously recognized depending on the state of image data to be recognized.

そこで、図１９に示すように、例えばアルファベットや数字の小文字、大文字、半角、全角などの間で類似する文字を対応付けた類似文字置き換え候補テーブル３３を予めメモリに記憶しておき、スペルコレクト部２０５は、入力手段として「スキャナ」が指定された場合、類似文字置き換え候補テーブル３３を優先的に用いて文字の置き換えを行うようにしてもよい。 Therefore, as shown in FIG. 19, for example, a similar character replacement candidate table 33 in which similar characters are associated with each other between lowercase letters, uppercase letters, half-width characters, full-width characters, and the like is stored in a memory in advance, and the spell correction unit When “scanner” is designated as an input unit 205, the similar character replacement candidate table 33 may be preferentially used to perform character replacement.

このようにこの第３実施形態によれば、文字置き換えのための規則を、キー配列による置き換え候補テーブル３１に基づくものを優先敵に利用したり、左右の手の指の位置による置き換え候補テーブル３２に基づくものを優先的に利用したり、類似文字置き換え候補テーブル３３に基づくものを優先的に利用することで、キー入力された翻訳元の文字列のうちの未知語をより高精度に訂正することができる。
つまり、入力手段のタイプに応じた置き換え候補テーブル３１，３２，３３を利用することによって、より適切な置き換え候補を選出し、正しい単語に訂正した文字列を翻訳部２０８へ出力し、翻訳部２０８により、精度の高い翻訳結果を導き出すことができる。 As described above, according to the third embodiment, a rule for character replacement based on the replacement candidate table 31 based on the key arrangement is used as a priority enemy, or the replacement candidate table 32 based on the positions of the fingers of the left and right hands. Precisely use the one based on, or preferentially use the one based on the similar character replacement candidate table 33, thereby correcting the unknown word in the character string of the translation source keyed in with higher accuracy. be able to.
That is, by using the replacement candidate tables 31, 32, and 33 according to the type of input means, more appropriate replacement candidates are selected, and the character string corrected to the correct word is output to the translation unit 208. Thus, a highly accurate translation result can be derived.

本発明の第１実施形態の翻訳原文推定装置の構成を示す図である。It is a figure which shows the structure of the translation original text estimation apparatus of 1st Embodiment of this invention. （ａ）未知語として検出された先頭単語「orld」の先頭に大文字のアルファベットを付加し辞書に存在するか否かのフラグを付した例を示す図である。（ｂ）未知語として検出された先頭単語「he」の先頭に大文字のアルファベットを付加し辞書に存在するか否かのフラグを付した例を示す図である。（ｃ）未知語として検出された先頭単語「at」の先頭に大文字のアルファベットを付加し辞書に存在するか否かのフラグを付した例を示す図である。（ｄ）未知語として検出された先頭単語「ato」の先頭に大文字のアルファベットを付加し辞書に存在するか否かのフラグを付した例を示す図である。(A) It is a figure which shows the example which added the flag of whether a capital letter alphabet is added to the head of the head word "orld" detected as an unknown word, and it exists in a dictionary. (B) It is a figure which shows the example which added the capital alphabet to the head of the top word "he" detected as an unknown word, and attached | subjected the flag of whether it exists in a dictionary. (C) It is a figure which shows the example which added the capital alphabet to the head of the top word "at" detected as an unknown word, and attached | subjected the flag whether it exists in a dictionary. (D) It is a figure which shows the example which added the capital letter alphabet to the head of the head word "ato" detected as an unknown word, and attached | subjected the flag whether it exists in a dictionary. 先頭単語の先頭に文字を付加した中から辞書に存在する単語としてメモリに記憶された例を示す図である。It is a figure which shows the example memorize | stored in the memory as a word which exists in a dictionary from the thing which added the character to the head of the top word. この翻訳原文推定装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of this translation original text estimation apparatus. この翻訳原文推定装置の辞書引き単語推測処理を示すフローチャートである。It is a flowchart which shows the dictionary lookup word estimation process of this translation original sentence estimation apparatus. この翻訳原文推定装置の構文解析原文推測処理を示すフローチャートである。It is a flowchart which shows the syntax analysis original text estimation process of this translation original text estimation apparatus. 本発明に係る第２実施形態の翻訳原文推定装置の構成を示す図である。It is a figure which shows the structure of the translation original text estimation apparatus of 2nd Embodiment which concerns on this invention. 第２実施形態の翻訳原文推定装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the translation original text estimation apparatus of 2nd Embodiment. 第２実施形態の翻訳原文推定装置において、未知語が検出された場合の処理を示すフローチャートである。It is a flowchart which shows a process when an unknown word is detected in the translation original text estimation apparatus of 2nd Embodiment. 第２実施形態の翻訳原文推定装置において、未知語が検出されなかった場合の処理を示すフローチャートである。It is a flowchart which shows a process when an unknown word is not detected in the translation original text estimation apparatus of 2nd Embodiment. 分割された単語を示す図である。It is a figure which shows the divided | segmented word. 未知語リストを示す図である。It is a figure which shows an unknown word list. 未知語品詞組合せテーブルを示す図である。It is a figure which shows an unknown word part of speech combination table. 未知語を辞書部に登録されている単語に置き換えるための置き換え候補テーブルを示す図である。It is a figure which shows the replacement candidate table for replacing an unknown word with the word registered into the dictionary part. 置き換え候補テーブルを用いた場合の翻訳結果を示す図である。It is a figure which shows the translation result at the time of using a replacement candidate table. キー配置による置き換え候補テーブルを示す図である。It is a figure which shows the replacement candidate table by key arrangement | positioning. キー配置による置き換え候補テーブルを用いた場合の翻訳結果を示す図である。It is a figure which shows the translation result at the time of using the replacement candidate table by key arrangement | positioning. 指による置き換え候補テーブルを示す図である。It is a figure which shows the replacement candidate table by a finger | toe. 類似文字置き換え候補テーブルを示す図である。It is a figure which shows a similar character replacement candidate table. 誤った翻訳結果を出力してしまう可能性のある４つケースを示す図である。It is a figure which shows four cases which may output an incorrect translation result.

Explanation of symbols

１０…辞書部、１１…入力部、１２…原文推測部、１３…原文推測判定部、１４…形態素解析部、１５…構文解析部、１６…構文解析制御部、１７…統計データ記憶部、１８…翻訳部、１９…出力部、３０…置き換え候補テーブル、３１…キー配置による置き換え候補テーブル、３２…指による置き換え候補テーブル、３３…類似文字置き換え候補テーブル、２０１…入力部、２０２…入力文字列記憶部、２０３…判定部、２０４…未知語リスト記憶部、２０５…スペルコレクト部、２０６…表示部、２０７…辞書部、２０８…翻訳部。 DESCRIPTION OF SYMBOLS 10 ... Dictionary part, 11 ... Input part, 12 ... Original sentence estimation part, 13 ... Original sentence estimation determination part, 14 ... Morphological analysis part, 15 ... Syntax analysis part, 16 ... Syntax analysis control part, 17 ... Statistical data storage part, 18 ... Translation unit, 19 ... Output unit, 30 ... Replacement candidate table, 31 ... Replacement candidate table by key arrangement, 32 ... Replacement candidate table by finger, 33 ... Similar character replacement candidate table, 201 ... Input unit, 202 ... Input character string Storage unit 203... Determination unit 204 204 Unknown word list storage unit 205 Spell correction unit 206 Display unit 207 Dictionary unit 208 Translation unit

Claims

In a machine translation apparatus including a translation unit that translates a sentence written in a first language into a sentence in a second language based on dictionary data registered in advance to translate the original sentence created in another language into a sentence in another language,
Data receiving means for receiving a character string created in the first language;
A dividing unit that divides the character string received by the data receiving unit into word units based on the dictionary data;
An unknown word extracting means for extracting a word that is not registered in the dictionary data among the words divided by the dividing means, as an unknown word;
Replacement candidate character generation means for generating a replacement candidate character by adding a character or replacing a character according to a predetermined rule with respect to the unknown word extracted by the unknown word extraction means;
And a character string output means for outputting a sentence obtained by replacing the replacement candidate character generated by the replacement candidate character generation means with a replacement-source unknown word to the translation unit as a character string to be translated. Translation device.

Translates the original text created in the first language into the text in the second language based on the dictionary part that stores the dictionary data including characters, words, dictionaries, translated words, and parts of speech in advance to translate the text in other languages In a machine translation device equipped with a translation unit,
Data receiving means for receiving the original text of the first language as a computer-processable character string;
Dividing means for dividing the character string into word units by performing morphological analysis on the basis of the dictionary data stored in the dictionary unit for the character string received by the data receiving means;
Of the plurality of words divided by the dividing means, it is determined whether the first letter of the first word is a lowercase letter, and if the first letter is a lowercase letter, a plurality of new letters with an uppercase alphabet added to the beginning of the first word Word generation means for generating a word;
Replacement candidate word extraction means for extracting words existing as the dictionary data among a plurality of words generated by the word generation means;
Character string output means for outputting a sentence obtained by replacing the replacement candidate word extracted by the replacement candidate word extraction means with the first word of the original sentence to the translation unit as a character string to be translated, Machine translation device.

The character string output means includes:
A first sentence candidate that selects a grammatically correct sentence candidate based on the dictionary data for each sentence replaced with the first word of the original sentence when a plurality of replacement candidate words are extracted by the replacement candidate word extracting means; 3. The machine translation apparatus according to claim 1, further comprising selection means.

The first sentence candidate selection means includes:
Sentence analysis means for selecting a grammatically correct sentence candidate by performing syntax analysis based on the dictionary data for each sentence replaced with the first word of the original sentence, or a sentence sentence having a correct meaning by performing semantic analysis The machine translation apparatus according to claim 3, further comprising:

For one or more grammatically correct sentence candidates selected by the first sentence candidate selecting means, a sentence candidate in which an appropriate word is used as the first word in the sentence based on preset statistical data is selected. 4. The machine translation apparatus according to claim 3, further comprising second sentence candidate selection means for outputting it as a character string to be translated to the translation unit.

6. The machine translation apparatus according to claim 5, wherein the statistical data uses appearance frequency data of words in a sentence.

6. The machine translation apparatus according to claim 5, wherein appearance frequency data of a head word is used for the statistical data.

6. The machine translation apparatus according to claim 5, wherein the appearance data of initial characters is used for the statistical data.

In a machine translation system that translates an original text created in a first language into a text in a second language,
Data receiving means for receiving the original text string input by the key;
A dividing unit that divides the character string into word units by performing morphological analysis based on dictionary data set in advance for the character string input by the data receiving unit;
An unknown word extracting means for extracting a word that is not registered as the dictionary data among the words divided by the dividing means, as an unknown word;
Extracting a replacement candidate word that becomes a grammatically correct sentence when a new word generated by replacing the character of the unknown word extracted by the unknown word extraction means based on a predetermined rule is replaced with the original word Replacement candidate word generation means;
A character string output unit that replaces the replacement candidate word generated by the replacement candidate word generation unit with a word at a corresponding position in the original character string and outputs it to the translation unit as a character string to be translated; A machine translation device.

The replacement candidate word generation means includes:
The part of speech of the word existing before and after the unknown word extracted by the unknown word extraction means is extracted from the dictionary data, and the part of speech that is established as the part of speech of the unknown word is defined as an unknown word. A part-of-speech combination list generating means for generating an associated part-of-speech combination list;
Syntax analysis means for performing syntax analysis of the text string of the original text based on the part of speech combination list generated by the part of speech combination list generation means and the dictionary data;
An unknown word part-of-speech specifying means for specifying a part of speech that has been successfully parsed by the syntax analysis means,
Candidate character string generating means for generating a candidate character string by replacing the character of the unknown word whose part of speech is specified by the unknown word part of speech specifying means with a predetermined rule;
A replacement candidate word extraction unit that outputs a candidate character string generated by the candidate character string generation unit that is present in the dictionary data and has a matching part of speech as a replacement candidate word to the character string output unit. The machine translation apparatus according to claim 9, wherein the machine translation apparatus is a machine translation device.

The predetermined rule is:
10. The machine translation apparatus according to claim 9, wherein the machine translation apparatus is based on a replacement candidate table based on key arrangements or positions of fingers of left and right hands.

The predetermined rule is:
The machine translation apparatus according to claim 9, wherein the machine translation apparatus is based on a similar character replacement candidate table in which similar characters are associated with each other.

Executes processing in a machine translation device having a translation unit that translates into a second language sentence based on dictionary data stored in advance to translate an original sentence created in the first language into a sentence in another language In the machine translation program to let
The machine translation device;
Data receiving means for receiving a character string created in the first language;
A dividing unit that divides the character string received by the data receiving unit into word units based on the dictionary data;
An unknown word extracting means for extracting a word that is not registered in the dictionary data among the words divided by the dividing means, as an unknown word;
A replacement candidate character generating unit that generates a replacement candidate character by adding a character or replacing a character according to a predetermined rule with respect to the unknown word extracted by the unknown word extracting unit;
A machine that functions as a character string output unit that outputs a sentence obtained by replacing the replacement candidate character generated by the replacement candidate character generation unit with a replacement source unknown word as a character string to be translated to the translation unit. Translation program.

Translates the original text created in the first language into the text in the second language based on the dictionary part that stores the dictionary data including characters, words, dictionaries, translated words, and parts of speech in advance to translate the text in other languages In a machine translation program for causing a machine translation device including a translation unit to execute processing,
The machine translation device;
Data receiving means for receiving the original text of the first language as a computer-processable character string;
Dividing means for dividing the character string into word units by performing morphological analysis on the basis of the dictionary data stored in the dictionary unit for the character string received by the data receiving means;
Of the plurality of words divided by the dividing means, it is determined whether the first letter of the first word is a lowercase letter, and if the first letter is a lowercase letter, a plurality of new letters with an uppercase alphabet added to the beginning of the first word Word generation means for generating a word;
Replacement candidate word extraction means for extracting words existing as the dictionary data among a plurality of words generated by the word generation means;
The replacement candidate word extracted by the replacement candidate word extraction unit functions as a character string output unit that outputs a sentence in which the first word of the original sentence is replaced as a character string to be translated to the translation unit. Machine translation program.

In a machine translation program that causes a machine translation device that translates an original sentence created in a first language into a sentence in a second language,
The machine translation device;
Data receiving means for receiving the original text string input by the key;
A dividing unit that divides the character string into word units by performing morphological analysis based on dictionary data set in advance for the character string input by the data receiving unit;
An unknown word extracting means for extracting a word that is not registered as the dictionary data among the words divided by the dividing means, as an unknown word;
Extracting a replacement candidate word that becomes a grammatically correct sentence when a new word generated by replacing the character of the unknown word extracted by the unknown word extraction means based on a predetermined rule is replaced with the original word Replacement candidate word generation means;
The replacement candidate word generated by the replacement candidate word generation means is replaced with a word at a corresponding position in the original text character string and functions as a character string output means for outputting to the translation section as a character string to be translated. A machine translation program.