JP2007156545A

JP2007156545A - Symbol string conversion method, word translation method, its device, its program and recording medium

Info

Publication number: JP2007156545A
Application number: JP2005346898A
Authority: JP
Inventors: Katsuto Sudo; 克仁須藤; Hideki Isozaki; 秀樹磯崎; Hajime Tsukada; 元塚田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-11-30
Filing date: 2005-11-30
Publication date: 2007-06-21
Anticipated expiration: 2025-11-30
Also published as: JP4266222B2

Abstract

PROBLEM TO BE SOLVED: To provide a device for converting a first word belonging to a certain language system into a second word corresponding to the other language system. SOLUTION: A word translation device 5B is provided with: a word output part 5 for preparing a database by referring to a transliteration probability model 7, and using approximation under the consideration of a word set maximizing conditioned probability with the character history of the word set as conditions, and for retrieving a second word corresponding to an input first word; a conversion candidate retrieving part 40 for outputting a third word extracted from document data acquired by electronic equipment 50 connected to a communication network NW based on the first word as a conversion candidate to the second word; and a conversion possibility calculating part 30 for calculating status transition weight based on the third word acquired by the conversion candidate retrieval part 40 by referring to a composite database acquired by compounding the history relating to characters comprising the third word with the database prepared by the word output part 5. COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、例えば、情報検索システム、質問応答システム、機械翻訳システムに利用される記号列変換方法、単語翻訳方法、その装置およびそのプログラム並びに記録媒体に関する。 The present invention relates to a symbol string conversion method, a word translation method, an apparatus thereof, a program thereof, and a recording medium used in, for example, an information search system, a question answering system, and a machine translation system.

従来、例えば、情報検索システム、質問応答システム、機械翻訳システムにおいて、言語横断的な変換、すなわち、原言語の単語あるいは複合語（以下、単に単語という）から対象言語の単語への変換（翻訳）が必要な場合がある。このように言語横断的な変換が必要な場合には、一般に、辞書のように単語の変換規則を記述したデータベースが利用される。 Conventionally, for example, in an information retrieval system, a question answering system, and a machine translation system, cross-language conversion, that is, conversion (translation) from a source language word or compound word (hereinafter simply referred to as a word) to a target language word. May be necessary. When such cross-language conversion is necessary, a database describing word conversion rules such as a dictionary is generally used.

また、単語を構成する文字（記号）に着目して、単語の翻訳（記号列の変換）を、文字単位の変換として捉える「翻字」という技術が知られている（例えば、非特許文献１、非特許文献２）。この翻字によれば、例えば、変換すべき言語の種類が多い場合にも、それに対応した種々の辞書を必ずしも整備しておかなくてもよいことが期待される。 Further, a technique called “transliteration” is known in which word translation (symbol string conversion) is regarded as character-by-character conversion by focusing on characters (symbols) constituting a word (for example, Non-Patent Document 1). Non-Patent Document 2). According to this transliteration, for example, even when there are many types of languages to be converted, it is expected that it is not always necessary to prepare various dictionaries corresponding thereto.

非特許文献１に開示された翻字技術では、単語の発音に対応する記号を利用して、翻字の確からしさ（単語の変換の確からしさ）を示す確率モデルを予め作成しておき、原言語の単語（文字）から、この確からしさが最も大きくなるような対象言語の単語（文字）を求める。具体的には、両言語の対応する単語組の複数のデータを含む学習データから確率モデルを作成する際に、原言語の文字が原言語の発音に変換される確率と、原言語の発音が対象言語の発音に変化する確率と、対象言語の発音が対象言語の文字に変換される確率と、を統計的に求めておき、確率の積によって翻字の確からしさを計算することとしている。 In the transliteration technique disclosed in Non-Patent Document 1, a probability model indicating the certainty of transliteration (the certainty of conversion of a word) is created in advance using a symbol corresponding to the pronunciation of a word. From the words (characters) in the language, the words (characters) in the target language that have the greatest probability are obtained. Specifically, when creating a probabilistic model from learning data that includes multiple sets of corresponding word pairs in both languages, the probability that the source language characters are converted to the source language pronunciation and the source language pronunciation The probability of changing to the pronunciation of the target language and the probability that the pronunciation of the target language is converted into characters of the target language are statistically obtained, and the probability of transliteration is calculated by the product of the probabilities.

また、非特許文献２に開示された翻字技術は、カタカナ（日本語）で表記される単語から、アルファベット（英語）で表記される単語への翻字を実現するものである。具体的には、この翻字技術では、カタカナ表記をローマ字表記に置き換えた各文字から英単語の各文字への変換確率を示す確率モデルを予め作成しておき、この確率モデルを利用して翻字を行う。なお、この場合には、１文字単位の変換だけではなく、着目する文字の前後の文字（複数の文字）に関して、対応付け可能な複数文字間の変換確率も利用している。
K. Knight et al, “Machine Transliteration”, Computational Linguistics, 1998,vol.24,No.4, p.599-612 E. Brill et al. “Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs” in Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, 2001, p.393-399 In addition, the transliteration technique disclosed in Non-Patent Document 2 realizes transliteration from words expressed in katakana (Japanese) to words expressed in alphabet (English). Specifically, in this transliteration technique, a probability model indicating the conversion probability from each letter in which katakana notation is replaced with Roman letter notation to each letter of English words is created in advance, and transliteration is performed using this probability model. Do the letter. In this case, not only conversion in units of characters but also conversion probabilities between a plurality of characters that can be associated with each other before and after the character of interest (a plurality of characters) are used.
K. Knight et al, “Machine Transliteration”, Computational Linguistics, 1998, vol.24, No.4, p.599-612 E. Brill et al. “Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs” in Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, 2001, p.393-399

しかしながら、前記した翻字技術（記号変換技術）には、以下に示す問題がある。
すなわち、非特許文献１に開示された技術では、学習データ中の原言語の単語と対象言語の単語の両方とも単語の読みが既知でなければ、確率モデルを作成することができないという問題がある。また、発音体系の異なる言語対においては発音間の対応をとることが困難である。 However, the transliteration technique (symbol conversion technique) described above has the following problems.
In other words, the technique disclosed in Non-Patent Document 1 has a problem in that a probability model cannot be created unless both the source language word and the target language word in the learning data are known. . Moreover, it is difficult to take correspondence between pronunciations in language pairs with different pronunciation systems.

一方、非特許文献２に開示された技術では、カタカナをローマ字表記にすることでアルファベットとの対応関係をとりやすくすることは可能であるが、日英翻訳以外の様々な言語に対応して翻字を実現するためには、発音情報と同等な効力のある別の情報がさらに必要となるという問題がある。 On the other hand, with the technology disclosed in Non-Patent Document 2, it is possible to make the correspondence with the alphabet easier by converting Katakana to Roman letters, but it is compatible with various languages other than Japanese-English translation. In order to realize the character, there is a problem that another information having the same effect as the pronunciation information is further required.

そこで、本発明では、以上のような問題点に鑑みてなされたものであり、所定の記号体系に属する記号列を、任意の記号体系に属する対応した記号列に変換することのできる技術を提供することを目的とする。 Accordingly, the present invention has been made in view of the above-described problems, and provides a technique capable of converting a symbol string belonging to a predetermined symbol system into a corresponding symbol string belonging to an arbitrary symbol system. The purpose is to do.

前記課題を解決するため、請求項１に記載の記号列変換方法は、異なる記号体系にそれぞれ属する同じ意味の記号列の組合わせである記号列組における記号の同時生起頻度を利用した記号列変換装置の記号列変換方法であって、前記記号列変換装置は、第１の記号体系に属する第１の記号列を入力するステップと、前記同時生起頻度および前記記号列組の中の記号組の出現順序の頻度を利用して、前記入力された第１の記号列に対応する第２の記号体系に属する第２の記号列を推定するステップと、前記推定された第２の記号列を出力するステップとを含んで実行することを特徴とする。 In order to solve the above-mentioned problem, the symbol string conversion method according to claim 1 is a symbol string conversion using a symbol co-occurrence frequency in a symbol string set which is a combination of symbol strings having the same meaning belonging to different symbol systems. A symbol string conversion method for a device, wherein the symbol string conversion device includes a step of inputting a first symbol string belonging to a first symbol system, and a symbol set in the symbol string set in the co-occurrence frequency and the symbol string set. Estimating a second symbol string belonging to a second symbol system corresponding to the input first symbol string using the frequency of appearance order, and outputting the estimated second symbol string And the step of executing.

かかる手順によれば、記号列変換装置は、入力された第１の記号列を構成する記号の出現順序に対応した順序で出現することが尤もらしい記号から構成された第２の記号列を推定することができる。したがって、入力される第１の記号列が、同時生起頻度を計算するために利用された学習データベースに予め登録されていない未学習（未知）の記号列であっても第２の記号列を出力することが可能となる。ここで、記号組の出現順序の頻度を利用する場合に、出現順序の確率値の対数をとって符号を逆転させる等の処理を行って生成した状態遷移重みを利用することもできる。この状態遷移重みを用いる場合には、記号列変換装置は、状態遷移重みが最小となるような第２の記号列を探索する。 According to this procedure, the symbol string conversion apparatus estimates a second symbol string composed of symbols that are likely to appear in an order corresponding to the appearance order of the symbols constituting the input first symbol string. can do. Therefore, even if the input first symbol string is an unlearned (unknown) symbol string that is not registered in advance in the learning database used for calculating the co-occurrence frequency, the second symbol string is output. It becomes possible to do. Here, when using the frequency of the appearance order of the symbol set, it is also possible to use the state transition weight generated by performing processing such as taking the logarithm of the probability value of the appearance order and reversing the sign. When using this state transition weight, the symbol string converter searches for the second symbol string that minimizes the state transition weight.

また、請求項１に記載の記号列変換装置が利用する同時生起頻度を格納した記号変換確率モデルを作成する記号変換確率モデル作成装置は、以下に示すような記号変換確率モデル作成方法を利用することができる。すなわち、前記記号変換確率モデル作成装置は、学習データベースにデータとして記憶された第１の記号列および第２の記号列に基づいて、前記第１の記号列および第２の記号列をそれぞれ構成する第１の記号と第２の記号との間の関連度（共起頻度、カイ二乗値など）を計算すると共に、前記第１の記号と前記第２の記号のうちのいずれかに対応する記号がない場合に仮想的な空記号を用いた仮想的な関連度を計算し、計算した関連度および仮想的な関連度をデータとして格納する記号間関連度データベースを作成するステップと、前記学習データベースに記憶されたデータと、前記記号間関連度データベースに記憶されたデータと、に基づいて、前記第１の記号列と前記第２の記号列との間で対応付けられた記号間の関連度および仮想的な関連度のそれぞれの和または積が最大となるように対応付けられた、２つの記号列の組から成る記号列組を生成し、生成した記号列組をデータとして格納する記号列組データベースを作成するステップと、前記記号列組データベースに記憶されたデータを参照して、前記同時生起頻度を、前記記号列組の出現順序の頻度として計算し、前記記号変換確率モデルを作成するステップとを含んで実行するようにしてもよい。 Further, a symbol conversion probability model creating apparatus for creating a symbol conversion probability model storing a co-occurrence frequency used by the symbol string converting apparatus according to claim 1 uses a symbol conversion probability model creating method as described below. be able to. That is, the symbol conversion probability model creation device configures the first symbol string and the second symbol string, respectively, based on the first symbol string and the second symbol string stored as data in the learning database. A degree of association (co-occurrence frequency, chi-square value, etc.) between the first symbol and the second symbol, and a symbol corresponding to one of the first symbol and the second symbol Calculating a virtual relevance using a virtual empty symbol when there is no symbol, and creating an inter-symbol relevance database that stores the calculated relevance and virtual relevance as data, and the learning database The degree of association between symbols associated between the first symbol string and the second symbol string based on the data stored in the data and the data stored in the inter-symbol association degree database And virtual Create a symbol string set consisting of two symbol string pairs that are correlated so that the sum or product of each relevance is maximized, and create a symbol string set database that stores the generated symbol string pairs as data And referring to data stored in the symbol string set database, calculating the co-occurrence frequency as a frequency of appearance order of the symbol string set, and creating the symbol conversion probability model May be executed.

また、請求項２に記載の記号列変換方法は、請求項１に記載の記号列変換方法において、前記第２の記号列を推定するステップは、前記同時生起頻度に基づいて、前記第１の記号列と前記第２の記号列とを記号単位で対応付けた組から成る任意の記号列組において、前記出現順序の頻度をそれぞれ計算し、この計算の結果に基づいて、前記出現順序の頻度が最大となる記号列組を探索し、この探索された記号列組に関する前記出現順序の頻度を利用して、前記第２の記号列を推定することを特徴とする。 Further, the symbol string conversion method according to claim 2 is the symbol string conversion method according to claim 1, wherein the step of estimating the second symbol string is based on the co-occurrence frequency. The frequency of the appearance order is calculated for each symbol string set composed of a set in which the symbol string and the second symbol string are associated with each other in symbol units, and the frequency of the appearance order is calculated based on the result of the calculation. The second symbol string is estimated using the frequency of the appearance order with respect to the searched symbol string set.

かかる手順によれば、記号列変換装置は、第１の記号列と第２の記号列とを記号単位で対応付けた組から成る記号列組における出現順序の頻度のうち、その出現順序の頻度が最大となる対応付けがなされた記号組だけを考慮する近似を用いて記号列組を推定するので、探索の枝刈りなどによって解探索空間を削減することができる。 According to such a procedure, the symbol string conversion device has a frequency of appearance order among the frequencies of appearance order in a symbol string set composed of a set in which the first symbol string and the second symbol string are associated with each other in symbol units. Since the symbol string set is estimated by using an approximation that considers only the symbol set with which the correspondence is maximized, the solution search space can be reduced by pruning the search.

また、請求項３に記載の記号列変換方法は、請求項１または請求項２に記載の記号列変換方法において、前記第２の記号列を推定するステップは、前記同時生起頻度に基づいて、前記第１の記号列と前記第２の記号列とを記号単位で対応付けた組から成る任意の記号列組において、前記出現順序の頻度をそれぞれ計算し、この計算の結果に基づいて、前記出現順序の頻度が最大となる記号列組を探索し、この探索された記号列組に関する前記出現順序の頻度をデータとして格納するデータベースを作成するステップと、前記データベースを参照して、前記第２の記号列を探索するステップとを有することを特徴とする。 The symbol string conversion method according to claim 3 is the symbol string conversion method according to claim 1 or 2, wherein the step of estimating the second symbol string is based on the co-occurrence frequency. The frequency of the appearance order is calculated for each symbol string set composed of a set in which the first symbol string and the second symbol string are associated in symbol units, and based on the result of the calculation, Searching for a symbol string set having the highest frequency of appearance order, creating a database for storing the frequency of appearance order relating to the searched symbol string set as data, and referring to the database, the second And searching for a symbol string.

かかる手順によれば、記号列変換装置は、出現順序の頻度が最大となる対応付けがなされた記号組だけを考慮する近似を用いて探索した記号列組に関する出現順序の頻度をデータとして格納するデータベースを作成し、作成したデータベースを参照して第２の記号列を探索する。したがって、入力される第１の記号列が、同時生起頻度を計算するために利用された学習データベースに予め登録されている学習済み（既知）の記号列の場合に、学習データベースに第１の記号列とペアで登録されていた第２の記号列を変換結果として出力することが可能となる。 According to such a procedure, the symbol string conversion device stores, as data, the frequency of the appearance order related to the symbol string set searched using approximation that considers only the symbol set associated with the highest frequency of the appearance order. A database is created, and the second symbol string is searched with reference to the created database. Therefore, when the input first symbol string is a learned (known) symbol string registered in advance in the learning database used for calculating the co-occurrence frequency, the first symbol string is stored in the learning database. The second symbol string registered in pairs with the string can be output as a conversion result.

また、請求項４に記載の単語翻訳方法は、請求項１乃至請求項３のいずれか一項に記載の記号列変換方法において、前記記号列が文字で構成された単語である単語翻訳方法であって、前記記号列変換装置は、入力される単語に基づいて、通信ネットワークに接続された電子機器から文書データを取得するステップと、前記取得された文書データから、予め定められた個数の単語を、前記入力される単語からの変換候補として抽出するステップとをさらに含んで実行することを特徴とする。 The word translation method according to claim 4 is the word translation method according to any one of claims 1 to 3, wherein the symbol string is a word composed of characters. The symbol string conversion device acquires document data from an electronic device connected to a communication network based on an input word, and a predetermined number of words from the acquired document data. And a step of extracting as a conversion candidate from the inputted word.

このような手順によれば、記号列変換装置は、単語翻訳装置として機能し、任意の言語体系に属する第１の単語を、他の言語体系で対応する第２の単語に変換する。なお、言語体系で用いられる文字は表音文字であることが好ましい。そして、記号列変換装置は、通信ネットワークから取得した文書データから単語を、翻訳のために入力される第１の単語からの変換候補として抽出する。したがって、入力される第１の単語が、同時生起頻度を計算するために利用された学習データベースに予め登録されていない未学習（未知）の単語であっても、通信ネットワークから抽出された現存する単語を、翻訳結果として採用して出力することが可能となる。 According to such a procedure, the symbol string conversion device functions as a word translation device, and converts a first word belonging to an arbitrary language system into a corresponding second word in another language system. The characters used in the language system are preferably phonetic characters. Then, the symbol string conversion device extracts words from the document data acquired from the communication network as conversion candidates from the first word input for translation. Therefore, even if the input first word is an unlearned (unknown) word that is not registered in advance in the learning database used for calculating the co-occurrence frequency, the existing first word extracted from the communication network exists. A word can be adopted and output as a translation result.

また、請求項５に記載の単語翻訳装置は、異なる言語体系にそれぞれ属する同じ意味の単語の組合わせである単語組における文字の同時生起頻度を利用した単語翻訳装置であって、第１の言語体系に属する第１の単語を入力する入力手段と、前記同時生起頻度および前記単語組の中の文字組の出現順序の頻度を利用して、前記入力された第１の単語に対応する第２の言語体系に属する第２の単語を推定する単語探索手段と、前記推定された第２の単語を出力する出力手段とを備えることを特徴とする。 The word translation device according to claim 5 is a word translation device using the simultaneous occurrence frequency of characters in a word set which is a combination of words having the same meaning belonging to different language systems, wherein the first language A second word corresponding to the input first word by using the input means for inputting the first word belonging to the system, and the frequency of the simultaneous occurrence and the appearance order of the character set in the word set; And a word search means for estimating a second word belonging to the language system and an output means for outputting the estimated second word.

かかる構成によれば、単語翻訳装置は、入力された第１の単語を構成する文字の出現順序に対応した順序で出現することが尤もらしい文字から構成された第２の単語を推定することができる。ここで、例えば、第１の単語をカタカナ表記、第２の単語をアルファベット表記とすることができる。この単語翻訳装置によれば、入力される第１の単語が、同時生起頻度を計算するために利用された学習データベースに予め登録されていない未学習（未知）の単語であっても第２の単語を出力することが可能となる。ここで、文字組の出現順序の頻度を利用する場合に、出現順序の確率値の対数をとって符号を逆転させる等の処理を行って生成した状態遷移重みを利用することもできる。この状態遷移重みを用いる場合には、単語翻訳装置は、状態遷移重みが最小となるような第２の単語を探索する。 According to such a configuration, the word translation device can estimate a second word composed of characters that are likely to appear in an order corresponding to the appearance order of the characters constituting the input first word. it can. Here, for example, the first word can be written in katakana and the second word can be written in alphabet. According to this word translation apparatus, even if the input first word is an unlearned (unknown) word that is not registered in advance in the learning database used to calculate the co-occurrence frequency, the second word A word can be output. Here, when the frequency of the appearance order of character sets is used, it is also possible to use the state transition weight generated by performing processing such as taking the logarithm of the probability value of the appearance order and reversing the sign. When this state transition weight is used, the word translation device searches for a second word that minimizes the state transition weight.

また、請求項６に記載の単語翻訳装置は、請求項５に記載の単語翻訳装置において、前記単語探索手段は、前記同時生起頻度に基づいて、前記第１の単語と前記第２の単語とを文字単位で対応付けた組から成る任意の単語組において、前記出現順序の頻度をそれぞれ計算し、この計算の結果に基づいて、前記出現順序の頻度が最大となる単語組を探索し、この探索された単語組に関する前記出現順序の頻度を利用して、前記第２の単語を推定することを特徴とする。 Further, the word translation device according to claim 6 is the word translation device according to claim 5, wherein the word search means is configured to determine the first word and the second word based on the co-occurrence frequency. Is calculated for each word set consisting of a set of characters associated with each other, and based on the result of the calculation, a word set having the maximum frequency of appearance is searched for. The second word is estimated using the frequency of the appearance order related to the searched word set.

かかる構成によれば、単語翻訳装置は、第１の単語と第２の単語とを文字単位で対応付けた組から成る単語組における出現順序の頻度のうち、その出現順序の頻度が最大となる対応付けがなされた文字組だけを考慮する近似を用いて単語を推定するので、探索の枝刈りなどによって解探索空間を削減することができる。 According to such a configuration, the word translation device has the highest appearance order frequency among the appearance order frequencies in the word set composed of a set in which the first word and the second word are associated in character units. Since the word is estimated using an approximation that considers only the character sets that are associated, the solution search space can be reduced by pruning the search.

また、請求項７に記載の単語翻訳装置は、請求項５または請求項６に記載の単語翻訳装置において、前記同時生起頻度に基づいて、前記第１の単語と前記第２の単語とを文字単位で対応付けた組から成る任意の単語組において、前記出現順序の頻度をそれぞれ計算し、この計算の結果に基づいて、前記出現順序の頻度が最大となる単語組を探索し、この探索された単語組に関する前記出現順序の頻度をデータとして格納するデータベースを作成するデータベース作成手段をさらに備え、前記単語探索手段は、前記データベースを参照して、前記第２の単語を探索することを特徴とする。 Further, the word translation device according to claim 7 is the word translation device according to claim 5 or 6, wherein the first word and the second word are converted into characters based on the co-occurrence frequency. The frequency of the appearance order is calculated in an arbitrary word set composed of pairs associated in units, and based on the result of the calculation, the word set that maximizes the frequency of the appearance order is searched, and this search is performed. Database creation means for creating a database for storing the frequency of appearance order related to the word set as data, wherein the word search means searches for the second word with reference to the database. To do.

かかる構成によれば、単語翻訳装置は、出現順序の頻度が最大となる対応付けがなされた文字組だけを考慮する近似を用いて探索した単語組に関する出現順序の頻度をデータとして格納するデータベースを作成し、作成したデータベースを参照して第２の単語を探索する。したがって、入力される第１の単語が、同時生起頻度を計算するために利用された学習データベースに予め登録されている学習済み（既知）の単語の場合に、学習データベースに第１の単語とペアで登録されていた第２の単語を翻訳結果として出力することが可能となる。 According to such a configuration, the word translation device stores a database that stores, as data, the frequency of appearance order related to a word set searched using approximation that considers only the character set associated with the highest frequency of appearance order. The second word is searched with reference to the created database. Therefore, when the input first word is a learned (known) word registered in advance in the learning database used for calculating the co-occurrence frequency, the first word is paired with the first word in the learning database. It becomes possible to output the 2nd word registered by (2) as a translation result.

また、請求項８に記載の単語翻訳装置は、請求項５乃至請求項７のいずれか一項に記載の単語翻訳装置において、前記入力手段に入力される前記第１の単語に基づいて、通信ネットワークに接続された電子機器から文書データを取得する文書データ取得手段と、前記取得された文書データから、予め定められた個数の単語を、前記第１の単語からの変換候補として抽出する変換候補抽出手段とをさらに備えることを特徴とする。 Further, the word translation device according to claim 8 is the word translation device according to any one of claims 5 to 7, wherein communication is performed based on the first word input to the input means. Document data acquisition means for acquiring document data from an electronic device connected to the network, and a conversion candidate for extracting a predetermined number of words as conversion candidates from the first word from the acquired document data And an extraction means.

かかる構成によれば、単語翻訳装置は、通信ネットワークから取得した文書データから単語を、翻訳のために入力される第１の単語からの変換候補として抽出する。したがって、入力される第１の単語が、同時生起頻度を計算するために利用された学習データベースに予め登録されていない未学習（未知）の単語であっても、通信ネットワークから抽出された現存する単語を、翻訳結果として採用して出力することが可能となる。 According to such a configuration, the word translation device extracts words from the document data acquired from the communication network as conversion candidates from the first word input for translation. Therefore, even if the input first word is an unlearned (unknown) word that is not registered in advance in the learning database used for calculating the co-occurrence frequency, the existing first word extracted from the communication network exists. A word can be adopted and output as a translation result.

また、請求項９に記載の記号列変換プログラムは、請求項１乃至請求項３のいずれか一項に記載の記号列変換方法をコンピュータに実行させることを特徴とする。このように構成されることにより、このプログラムをインストールされたコンピュータは、このプログラムに基づいた各機能を実現することができる。 A symbol string conversion program according to claim 9 causes a computer to execute the symbol string conversion method according to any one of claims 1 to 3. By being configured in this way, a computer in which this program is installed can realize each function based on this program.

また、請求項１０に記載の単語翻訳プログラムは、請求項４に記載の単語翻訳方法をコンピュータに実行させることを特徴とする。このように構成されることにより、このプログラムをインストールされたコンピュータは、このプログラムに基づいた各機能を実現することができる。 A word translation program according to claim 10 causes a computer to execute the word translation method according to claim 4. By being configured in this way, a computer in which this program is installed can realize each function based on this program.

また、請求項１１に記載の記録媒体は、請求項９に記載の記号列変換プログラムまたは請求項１０に記載の単語翻訳プログラムが記録されたことを特徴とする。このように構成されることにより、この記録媒体を装着されたコンピュータは、この記録媒体に記録されたプログラムに基づいた各機能を実現することができる。 A recording medium according to an eleventh aspect is characterized in that the symbol string conversion program according to the ninth aspect or the word translation program according to the tenth aspect is recorded. By being configured in this way, a computer equipped with this recording medium can realize each function based on a program recorded on this recording medium.

本発明によれば、所定の記号体系に属する記号列を、任意の記号体系に属する対応した記号列に変換することができる。特に、発音やローマ字化規則などの情報を利用することなく、既知の記号変換結果の出現順序を考慮して変換することが可能である。 According to the present invention, a symbol string belonging to a predetermined symbol system can be converted into a corresponding symbol string belonging to an arbitrary symbol system. In particular, it is possible to perform conversion in consideration of the appearance order of known symbol conversion results without using information such as pronunciation and Romanization rules.

以下、本発明の実施形態について、適宜図面を参照しながら説明する。
[単語翻訳システムの構成]
（第１の実施形態）
図１は、本発明の第１の実施形態に係る単語翻訳装置を含む単語翻訳システムの構成例を示す図である。単語翻訳システム（記号列変換システム）１は、変換元文字列である第１の単語（記号列）と、この第１の単語に対応した変換先文字列である第２の単語とをそれぞれ構成する文字の同時生起確率（同時生起頻度）をデータとして格納した翻字確率モデルを利用して、入力された第１の単語を第２の単語へ変換して出力するものである。ここで、第１の単語とは、第１の言語体系に属する複数の第１の文字から構成されている。同様に、第２の単語は、第２の言語体系に属する複数の第２の文字から構成されている。また、同時生起確率とは、第１の文字の出現と、該第１の文字の変換結果としての第２の文字の出現とが同時に生起する確率である。以下では、第１の単語をソース単語、第１の文字をソース文字、第２の単語をターゲット単語、第２の文字をターゲット文字と呼ぶ場合もある。 Hereinafter, embodiments of the present invention will be described with reference to the drawings as appropriate.
[Configuration of word translation system]
(First embodiment)
FIG. 1 is a diagram illustrating a configuration example of a word translation system including a word translation apparatus according to the first embodiment of the present invention. The word translation system (symbol string conversion system) 1 includes a first word (symbol string) that is a conversion source character string and a second word that is a conversion destination character string corresponding to the first word. Using the transliteration probability model in which the co-occurrence probability (co-occurrence frequency) of the character to be stored is stored as data, the input first word is converted into the second word and output. Here, the first word is composed of a plurality of first characters belonging to the first language system. Similarly, the second word is composed of a plurality of second characters belonging to the second language system. The co-occurrence probability is a probability that the appearance of the first character and the appearance of the second character as a conversion result of the first character occur at the same time. Hereinafter, the first word may be referred to as a source word, the first character as a source character, the second word as a target word, and the second character as a target character.

この単語翻訳システム１は、図１に示すように、記憶装置２と、記憶装置３と、翻字確率モデル作成装置（記号変換確率モデル作成装置）４と、単語翻訳装置（記号列変換装置）５とを備えている。
記憶装置２は、学習データベース６を記憶したものであって、一般的なハードディスク等の記憶手段である。
学習データベース６は、ソース単語とターゲット単語の組である。 As shown in FIG. 1, the word translation system 1 includes a storage device 2, a storage device 3, a transliteration probability model creation device (symbol conversion probability model creation device) 4, and a word translation device (symbol string conversion device). And 5.
The storage device 2 stores a learning database 6 and is a storage means such as a general hard disk.
The learning database 6 is a set of source words and target words.

記憶装置３は、翻字確率モデル（記号変換確率モデル）７を記憶したものであって、一般的なハードディスク等の記憶手段である。
翻字確率モデル７は、ソース文字からターゲット文字への翻字確率を、ソース文字とターゲット文字の同時生起確率をデータとして格納するものである。 The storage device 3 stores a transliteration probability model (symbol conversion probability model) 7 and is a storage means such as a general hard disk.
The transliteration probability model 7 stores the transliteration probability from the source character to the target character, and the co-occurrence probability of the source character and the target character as data.

翻字確率モデル作成装置（記号変換確率モデル作成装置）４と、単語翻訳装置（記号列変換装置）５は、一般的なコンピュータ（計算機）であり、例えば、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、ＲＯＭ（Read Only Memory）と、ＨＤＤ（Hard Disk Drive）と、ＫＢ／ＣＲＴ（Key Board／Cathode Ray Tube）と、入力／出力インタフェースとを含んで構成されている。 The transliteration probability model creation device (symbol conversion probability model creation device) 4 and the word translation device (symbol string conversion device) 5 are general computers (computers) such as a CPU (Central Processing Unit) and a RAM. (Random Access Memory), ROM (Read Only Memory), HDD (Hard Disk Drive), KB / CRT (Key Board / Cathode Ray Tube), and an input / output interface.

翻字確率モデル作成装置（記号変換確率モデル作成装置）４は、学習データベース６に基づいて、ソース文字とターゲット文字との対応関係を求め、このソース文字とターゲット文字との間の翻字確率を、直前（Ｎ−１）個の翻字結果を考慮して決定するＮグラムモデルとしてモデル化して、翻字確率モデル７を作成するものである。
単語翻訳装置（記号列変換装置）５は、１つのソース単語を入力として、翻字確率モデル７を用いて、ソース単語に対応するターゲット単語を出力するものである。 The transliteration probability model creation device (symbol conversion probability model creation device) 4 obtains the correspondence between the source character and the target character based on the learning database 6, and calculates the transliteration probability between the source character and the target character. A transliteration probability model 7 is created by modeling as an N-gram model determined in consideration of the previous (N-1) transliteration results.
The word translation device (symbol string conversion device) 5 outputs a target word corresponding to a source word using a transliteration probability model 7 with one source word as an input.

[翻字確率モデル作成装置の構成]
図２は、図１に示した翻字確率モデル作成装置の構成例を示す機能ブロック図である。
翻字確率モデル作成装置４は、図２に示すように、入力手段１０と、記憶手段（ＲＡＭ等）１１と、文字間関連度データベース作成手段（記号間関連度データベース作成手段）１２と、単語組データベース作成手段（記号列組データベース作成手段）１３と、生起確率計算手段１４と、書込手段１５とを備えている。 [Configuration of transliteration probability model creation device]
FIG. 2 is a functional block diagram illustrating a configuration example of the transliteration probability model creation device illustrated in FIG. 1.
As shown in FIG. 2, the transliteration probability model creation device 4 includes an input means 10, a storage means (RAM, etc.) 11, a character-to-character relationship database creation means (a symbol-to-symbol relationship degree database creation means) 12, a word A set database creation means (symbol string set database creation means) 13, an occurrence probability calculation means 14, and a writing means 15 are provided.

入力手段１０は、入力インターフェースであり、学習データベース６から、ソース文字とターゲット文字とを入力し、文字間関連度データベース作成手段１２と、単語組データベース作成手段１３とに出力するものである。この入力手段１０は、入力装置Ｍからデータベース作成の指示等を入力する。入力装置Ｍは、例えば、マウスやキーボード等のポインティングデバイスである。
記憶手段１１は、ＲＡＭと、ＲＯＭと、ＨＤＤとを含んでおり、ＨＤＤに、文字間関連度データベース（記号間関連度データベース）１６と、単語組データベース（記号列組データベース）１７とを記憶するものである。 The input means 10 is an input interface for inputting source characters and target characters from the learning database 6 and outputting them to the inter-character relevance degree database creating means 12 and the word set database creating means 13. The input means 10 inputs a database creation instruction or the like from the input device M. The input device M is a pointing device such as a mouse or a keyboard, for example.
The storage means 11 includes a RAM, a ROM, and an HDD, and stores an inter-character relevance database (inter-symbol relevance database) 16 and a word group database (symbol string group database) 17 in the HDD. Is.

文字間関連度データベース１６は、ソース文字とターゲット文字との間の統計的な関連度をデータとして格納するものである。ここで、関連度Assoc(s,t)とは、ソース文字ｓに対する翻字候補としてターゲット文字ｔが現れ易いことを指す尺度である。例えば、ソース文字ｓを含むソース単語Ｓ₀に対応するターゲット単語Ｔ₀にターゲット文字tが多く含まれていたり、このソース単語Ｓ₀に対応しないターゲット単語Ｔ₁にターゲット文字ｔがあまり含まれていなかったりする場合には、関連度Assoc(s,t)は高くなる。この関連度は、具体的には、共起頻度や、統計量の検定に用いられるカイ二乗値、カイ二乗値を０〜１の範囲に正規化した値であるφ²などを用いることができる。
単語組データベース１７は、ソース単語とターゲット単語との間で対応付けられた文字間の関連度のそれぞれの積が最大となるように対応付けられた、２つの単語の組から成る単語組をデータとして格納するものである。 The inter-character relevance database 16 stores the statistical relevance between the source character and the target character as data. Here, the association level Assoc (s, t) is a scale indicating that the target character t is likely to appear as a transliteration candidate for the source character s. For example, the target word T ₀ corresponding to the source word S ₀ including the source character s includes many target characters t, or the target word T ₁ not corresponding to the source word S ₀ includes too many target characters t. If not, the relevance level Assoc (s, t) is high. Specifically, the degree of association can be the co-occurrence frequency, the chi-square value used for statistical tests, φ ² that is a value obtained by normalizing the chi-square value to a range of 0 to 1, and the like. .
The word set database 17 stores a word set made up of two sets of words associated with each other so that the product of the relevance between the characters associated with the source word and the target word is maximized. Is stored as

文字間関連度データベース作成手段１２は、学習データベース６に記憶されたデータに基づいて、ソース文字とターゲット文字との間の統計的な関連度を計算すると共に、ソース文字とターゲット文字とのうちのいずれかに対応する文字がない場合に仮想的な空文字φを用いた仮想的な関連度を計算し、計算した関連度および仮想的な関連度をデータとする文字間関連度データベース１６を作成するものである。本実施形態では、仮想的な関連度として、ソース文字ｓがターゲット文字ｔのどの文字とも対応しない場合の仮想的な関連度Assoc(s,φ_t)と、ターゲット文字ｔがソース文字ｓのどの文字とも対応しない場合の仮想的な関連度Assoc(φ_s,t)との２種類を用いる。 The character-to-character relevance database creating means 12 calculates a statistical relevance between the source character and the target character based on the data stored in the learning database 6, and of the source character and the target character. When there is no character corresponding to any of the characters, a virtual relevance degree using the virtual empty character φ is calculated, and the inter-character relevance degree database 16 using the calculated relevance degree and the virtual relevance degree as data is created. Is. In the present embodiment, as the virtual relevance, the virtual relevance degree Assoc (s, φ _t ) when the source character s does not correspond to any character of the target character t, and the target character t which of the source character s Two types are used: a virtual relevance level Assoc (φ _s , t) when it does not correspond to a character.

単語組データベース作成手段１３は、学習データベース６に記憶されたデータと、文字間関連度データベース１６に記憶されたデータと、に基づいて、ソース単語とターゲット単語との間で対応付けられた文字間の関連度および仮想的な関連度のそれぞれの積が最大となるように（最適となるように）対応付けられた、２つの単語の組から成る単語組を生成し、生成した単語組をデータとする単語組データベース１７を作成するものである。 The word set database creation means 13 uses the data stored in the learning database 6 and the data stored in the inter-character relevance database 16 to inter-character spacing associated between the source word and the target word. Generate a word set consisting of two word sets that are associated with each other so that the product of the relevance level and the virtual relevance level is maximized (optimal), and the generated word set is data The word set database 17 is created.

ここで、単語組データベース１７が作成されるまでの具体例について図３を参照して説明する。図３は、図２に示した単語組データベースが作成されるまでの具体例を示す説明図である。ここでは、第１の言語体系が日本語（カタカナ）で第２の言語体系が英語（アルファベット）５としている。 Here, a specific example until the word set database 17 is created will be described with reference to FIG. FIG. 3 is an explanatory diagram showing a specific example until the word set database shown in FIG. 2 is created. Here, the first language system is Japanese (Katakana) and the second language system is English (alphabet) 5.

図３の（ａ）に示すように、第１の単語（ソース単語）Ｓとして「アイスクリーム」、第２の単語（ターゲット単語）Ｔとして「ice cream」を想定する。ここで想定したように、単語は、単語数が１つに制限されるものではなく、複数個の単語から成る複合語（例えば、ice cream）であってもよい。 As shown in (a) of FIG. 3, “ice cream” is assumed as the first word (source word) S, and “ice cream” is assumed as the second word (target word) T. As assumed here, the number of words is not limited to one, but may be a compound word (for example, ice cream) including a plurality of words.

ここで、第１の単語Ｓは、ｍ個の第１の文字（ソース文字ｓ₁，ｓ₂，…，ｓ_m）で構成されるものとする。したがって、「アイスクリーム」の場合には、第１の単語Ｓは、図３の（ｂ）に示すように、ｍ＝７なので、７個の文字ＩＤ（ｓ₁〜ｓ₇）が付されることとなる。 Here, the first word S is assumed to be composed of m first characters (source characters s ₁ , s ₂ ,..., S _m ). Therefore, in the case of “ice cream”, since the first word S is m = 7 as shown in FIG. 3B, seven character IDs (s _{1 to} s ₇ ) are attached. It will be.

同様に、第２の単語Ｔは、ｎ個の第２の文字（ターゲット文字ｔ₁，ｔ₂，…，ｔ_n）で構成されるものとする。したがって、「ice cream」の場合には、第２の単語Ｔは、図３の（ｃ）に示すように、ｎ＝８なので、８個の文字ＩＤ（ｔ₁〜ｔ₈）が付されることとなる。ここで、空白を無視して文字列を結合することとしたが、空白も文字として扱ってもよい。なお、空白に限らず、アンダーバー等の他の記号を同様に扱ってもよいことはもちろんである。 Similarly, it is assumed that the second word T is composed of n second characters (target characters t ₁ , t ₂ ,..., T _n ). Therefore, in the case of “ice cream”, since the second word T is n = 8 as shown in FIG. 3C, eight character IDs (t _{1 to} t ₈ ) are attached. It will be. Here, it is assumed that the character string is combined while ignoring the white space, but the white space may be treated as a character. Of course, other symbols such as underbars may be handled in the same manner, not limited to spaces.

図３の（ｂ）と図３の（ｃ）とを比較すると、文字の個数が異なる（ｍ＜ｎ）。本実施形態では、文字間の対応関係は、第１の文字の１文字と第２の文字の１文字の１対１対応であり、それぞれ対応する文字がない場合には空文字を必要とする。すなわち、第１の単語Ｓと第２の単語Ｔとを文字単位で最適となるように対応付けると、第１の単語では、図３の（ｄ）に示すように、空文字φ（文字ＩＤ「φ_s」）が２つ挿入される。同様に、第２の単語では、図３の（ｅ）に示すように、空文字φ（文字ＩＤ「φ_t」）が１つ挿入される。このように対応付けが最適化されたときには、文字の個数は等しくなる。このときの個数をｌ個（エル個）とすると、一般に、ｌ≧ｍかつｌ≧ｎと表すことができる。なお、この場合には、ｌ＝９である。 When FIG. 3B and FIG. 3C are compared, the number of characters is different (m <n). In the present embodiment, the correspondence between characters is one-to-one correspondence between one character of the first character and one character of the second character, and an empty character is required when there is no corresponding character. That is, when the first word S and the second word T are associated with each other so as to be optimal in character units, the first word has an empty character φ (character ID “φ” as shown in FIG. Two _s ") are inserted. Similarly, in the second word, one empty character φ (character ID “φ _t ”) is inserted as shown in FIG. When the correspondence is optimized in this way, the number of characters becomes equal. If the number at this time is 1 (el), it can be generally expressed as l ≧ m and l ≧ n. In this case, l = 9.

図３の（ｄ）および図３の（ｅ）に示された空文字入りのそれぞれの単語から、図３の（ｆ）に示すように、単語組を生成する。そして、この単語組を構成する両言語の文字間の対応付けＡを、Ａ＝ａ₁，ａ₂，…，ａ_lとする。そして、対応付けＡの要素、すなわち、文字組ＩＤを、ａ_i＝（ｓ_j，ｔ_k）で示すこととする。ここで、ｓ_jはｓ₁，…，ｓ_mのいずれか、もしくはφ_sであり、ｔ_kはｔ₁，…，ｔ_nのいずれか、もしくはφ_tのことである。 As shown in (f) of FIG. 3, a word set is generated from each of the words with blank characters shown in (d) of FIG. 3 and (e) of FIG. Then, the correspondence A between the characters of both languages constituting this word set is set as A = a ₁ , a ₂ _,. The element of association A, that is, the character set ID is represented by a _i = (s _j , t _k ). Here, s _j is s _1, ..., one of s _m, or a phi _s, t _k is t _1, ..., is that one or phi _t of t _n.

また、本実施形態では、対応付けが最適化されたときには、空文字を入れる前のｓ₁，…，ｓ_m，ｔ₁，…，ｔ_nの各文字をその順序を変えることなく対応づけるものとする。言い換えると、Ｉ＞ｉなるａ_I＝（ｓ_J，ｔ_K）に対して、Ｊ＞ｊ、Ｋ＞ｋの関係がある。具体的には、図６の（ｆ）に示すように、文字組ＩＤ「ａ₂」、「ａ₃」において、文字組（イ，ｉ）、文字組（ス，ｃ）の各要素を比較すると、第１の単語側（ソース側）の「イ」と「ス」との順序は、元の「アイスクリーム」の順序と同じであり、また、第２の単語側（ターゲット側）の「ｉ」と「ｃ」との順序は、元の「ice cream」の順序と同じである。つまり、対応付けによっても順序は不変である。 Further, in the present embodiment, when the correspondence is optimized, I s ₁ before placing an _{_{empty, ..., s m, t 1}} , ..., and those associating without changing their order each character of t _n To do. In other words, there is a relationship of J> j and K> k for a _I = (s _J , t _K ) where I> i. Specifically, as shown in (f) of FIG. 6, each element of the character set (I, i) and character set (S, c) is compared in the character set IDs “a ₂ ” and “a ₃ ”. Then, the order of “i” and “su” on the first word side (source side) is the same as the order of the original “ice cream”, and “order” on the second word side (target side) The order of “i” and “c” is the same as that of the original “ice cream”. That is, the order is not changed by the association.

一方、仮に、例えば、図６の（ｇ）に示すように、文字組ＩＤ「ａ₂」、「ａ₃」において、文字組（イ，ｅ）、文字組（ス，ｃ）の各要素を比較すると、第１の単語側（ソース側）の「イ」と「ス」との順序は、元の「アイスクリーム」の順序と同じであるが、第２の単語側（ターゲット側）の「ｅ」と「ｃ」との順序は、元の「ice cream」の順序と逆転している。つまり、対応付けによって順序が変化していることとなる。要するに、本実施形態では、最適な対応付けによって、図６の（ｇ）に示すような対応付けは排除され、図６の（ｆ）に示すように対応付けがなされる。 On the other hand, for example, as shown in FIG. 6G, each element of the character set (I, e) and the character set (S, c) in the character set IDs “a ₂ ” and “a ₃ ” In comparison, the order of “i” and “su” on the first word side (source side) is the same as the order of the original “ice cream”, but on the second word side (target side) “ The order of “e” and “c” is reversed from the original “ice cream” order. That is, the order is changed by the association. In short, in the present embodiment, the association shown in (g) of FIG. 6 is eliminated by the optimum association, and the association is made as shown in (f) of FIG.

単語組データベース作成手段１３は、式（１）に基づいて、文字間の関連度および仮想的な関連度のそれぞれの積が最大となるような対応付け（最適な対応付け）Ａ＾（Ａハット）を求める。なお、式（１）において、Assoc(ａ_i)は、所定の対応付け「Ａ」がなされた文字組ａ_iのソース文字とターゲット文字との関連度であり、「argmax _A (y)、ただしｙ＝f(A)」は、ｙが最大となるときの「Ａ」を求めることを意味する。
また、単語組データベース作成手段１３は、式（２）に基づいて、最適な対応付けＡ＾を求めるようにしてもよい。この場合には、文字間の関連度および仮想的な関連度のそれぞれの和が最大となるような対応付けが求められることとなる。 Based on the formula (1), the word set database creating means 13 creates a correspondence (optimum correspondence) A ^ (A hat) that maximizes the product of the degree of association between characters and the degree of virtual association. ) In Expression (1), Assoc (a _i ) is the degree of association between the source character and target character of the character set a _i for which a predetermined association “A” has been made, and “argmax _A (y), “y = f (A)” means to obtain “A” when y is maximum.
Further, the word set database creating means 13 may obtain an optimum association A ^ based on the formula (2). In this case, the association is calculated so that the sum of the relevance between characters and the virtual relevance is maximized.

図２に戻って、翻字確率モデル作成装置４の構成例の説明を続ける。
生起確率計算手段１４は、単語組データベース１７に記憶されたデータを参照して、同時生起確率を、単語組を構成するソース単語およびターゲット単語において、ソース文字とターゲット文字の文字組の出現順序の確率（出現順序の頻度）を計算し、翻字確率モデル７を作成するものである。ここで、出現順序の確率とは、着目するソース文字またはターゲット文字が出現するまでの各文字の状態遷移を示す履歴を条件とする条件付き確率である。つまり、生起確率計算手段１４は、あるソース文字の出現と、そのソース文字の翻字結果であるターゲット文字の出現とが同時に生起する確率を、あるソース文字の直前（Ｎ−１）個のソース文字の履歴と、当該ターゲット文字の直前（Ｎ−１）個のターゲット文字の履歴とを用いて翻字確率モデル７を作成する。例えば、図３を参照して説明したソース文字（文字ＩＤ「ｓ_j」）とターゲット文字（文字ＩＤ「ｔ_k」）を利用すると、対応付けられ単語組において、ソース文字とターゲット文字とで表現される文字組（文字組ＩＤ「ａ_i」）が現れる確率（同時生起確率）Ｐ（ａ_i）は、直前（Ｎ−１）個の文字組（ａ_i-1，…，ａ_i-N+1）の条件付き確率で表すことができる。なお、Ｎは、Ｎグラム言語モデルにおける「Ｎ」を示す数値である。また、以下、単に確率という場合には、同時生起確率を意味する。 Returning to FIG. 2, the description of the configuration example of the transliteration probability model creation device 4 will be continued.
The occurrence probability calculation means 14 refers to the data stored in the word set database 17 to determine the co-occurrence probability of the appearance order of the character set of the source character and the target character in the source word and the target word constituting the word set. The probability (frequency of appearance order) is calculated, and the transliteration probability model 7 is created. Here, the probability of the appearance order is a conditional probability on the condition that the state transition of each character until the focused source character or target character appears is a condition. That is, the occurrence probability calculation means 14 determines the probability that the appearance of a certain source character and the appearance of the target character, which is the transliteration result of the source character, occur at the same time (N−1) sources immediately before the certain source character. The transliteration probability model 7 is created using the character history and the history of (N-1) target characters immediately before the target character. For example, when the source character (character ID “s _j ”) and the target character (character ID “t _k ”) described with reference to FIG. 3 are used, the source character and the target character are represented in the associated word group. The probability (co-occurrence probability) P (a _i ) of appearing character sets (character set ID “a _i ”) is the immediately preceding (N−1) character sets (a _i−1 ,..., A _{i−N +1} ) with a conditional probability. N is a numerical value indicating “N” in the N-gram language model. Further, hereinafter, the simple probability means the co-occurrence probability.

そこで、生起確率計算手段１４は、単語組データベース１７を用いて、条件付き確率Ｐ（ａ_i｜ａ_i-1，…，ａ_i-N+1）を計算する。ここで、Ｎに大きな値を設定すると、大多数の条件付き確率が「０」となり、その結果、確率モデルとして汎用性が劣化してしまうことから、生起確率計算手段１４は、Ｎを比較的小さな値（例えば、１，２，３）としたときの確率値を用いて平滑化処理する。これにより、直前（Ｎ−１）文字の条件付き確率が「０」にならないため、任意の翻字結果に対して「０」ではない確率値を与えることができる。この平滑化処理としては、自然言語処理や音声認識に利用されるＮグラム言語モデルに適用される公知の平滑化技術を利用することができる（例えば、「確率的言語モデル」北研二、東京大学出版会、１９９９、第３章、言語と計算−４を参照）。 Therefore, the occurrence probability calculation means 14 calculates a conditional probability P (a _i | a _i−1 ,..., A _{i−N + 1} ) using the word set database 17. Here, when a large value is set for N, the majority of conditional probabilities become “0”, and as a result, the versatility of the probability model deteriorates. Smoothing is performed using the probability value when the value is small (for example, 1, 2, 3). Thereby, since the conditional probability of the immediately preceding (N−1) character does not become “0”, a probability value that is not “0” can be given to any transliteration result. As the smoothing process, a known smoothing technique applied to an N-gram language model used for natural language processing or speech recognition can be used (for example, “stochastic language model” Kenji Kita, University of Tokyo (See Publishing, 1999, Chapter 3, Languages and Calculations-4).

書込手段１５は、生起確率計算手段１４で計算された確率値を翻字確率モデル７として記憶装置３（図１参照）に書き込むものである。 The writing means 15 writes the probability value calculated by the occurrence probability calculating means 14 into the storage device 3 (see FIG. 1) as the transliteration probability model 7.

なお、前記した文字間関連度データベース作成手段１２と、単語組データベース作成手段１３と、生起確率計算手段１４は、ＣＰＵが記憶手段１１のＲＯＭ等に格納された所定のプログラムをＲＡＭに展開して実行することにより実現されるものである。 The inter-character relevance database creating means 12, the word set database creating means 13, and the occurrence probability calculating means 14 are such that the CPU expands a predetermined program stored in the ROM or the like of the storage means 11 to the RAM. It is realized by executing.

[単語翻訳装置の構成]
図４は、図１に示した単語翻訳装置の構成例を示す機能ブロック図である。
単語翻訳装置５は、翻字確率モデル作成装置４で作成された翻字確率モデル７に基づいて、入力装置Ｍから入力されるソース単語を構成するソース文字をターゲット文字に翻字することによってターゲット単語への翻訳（変換）を実現し、翻訳したターゲット単語を出力装置Ｄへ出力するものである。 [Configuration of word translation device]
4 is a functional block diagram illustrating a configuration example of the word translation apparatus illustrated in FIG.
Based on the transliteration probability model 7 created by the transliteration probability model creation device 4, the word translation device 5 translates the source characters constituting the source word input from the input device M into the target characters. Translation (conversion) into words is realized, and the translated target words are output to the output device D.

（翻訳原理）
ここで、単語翻訳装置５における翻訳（記号列変換）の原理を数式に基づいて説明する。なお、この翻訳原理の説明において「ターゲット単語Ｔ」という場合には、ソース単語Ｓと１対１に対応する正確に翻訳された該当する単語と、それに類似した単語とを含んでおり、いわば、ターゲット単語候補と呼べるものを意味している。 (Translation principle)
Here, the principle of translation (symbol string conversion) in the word translation apparatus 5 will be described based on mathematical expressions. In the description of the translation principle, the term “target word T” includes an accurately translated word corresponding to the source word S on a one-to-one basis, and a word similar thereto, so to speak, It means what can be called target word candidates.

入力されたソース単語Ｓと、その正確な翻訳結果を含むターゲット単語Ｔとが、翻字確率モデル７において単語組として現れる同時生起確率は、入力されたソース単語Ｓのソース文字と、ターゲット単語Ｔのターゲット文字との文字間の対応付けＡによって、それぞれ異なったものとなっている。このとき、入力されたソース単語Ｓと、そのターゲット単語Ｔとの同時生起確率Ｐ（Ｓ，Ｔ，Ａ）は、式（３）に示すように、条件付き確率の積で表すことができる。 The co-occurrence probabilities that the input source word S and the target word T including the exact translation result appear as a word set in the transliteration probability model 7 are the source character of the input source word S and the target word T Are different depending on the correspondence A between the characters and the target character. At this time, the co-occurrence probability P (S, T, A) of the input source word S and the target word T can be expressed as a product of conditional probabilities as shown in Expression (3).

文字間の対応付けＡとしては、多数の可能性があるため、それらをすべて考慮してソース単語Ｓとターゲット単語Ｔとが翻字確率モデル７に基づいて翻字される単語の組として現れる最終的な確率Ｐ（Ｓ，Ｔ）は、式（４）で示されることとなる。 Since there are many possibilities as the correspondence A between characters, the source word S and the target word T appear as a set of words that are transliterated based on the transliteration probability model 7 in consideration of all of them. The typical probability P (S, T) is expressed by the equation (4).

前記した式（４）によると、確率Ｐ（Ｓ，Ｔ）を正確に求めるには、各対応付けＡに対する確率値の総和を計算しなければならないことになる。しかしながら、すべての対応付けＡを考慮すると、計算が膨大になるため実用的ではない。そこで、計算量を削減するために、本実施形態では、以下の近似を導入することとする。すなわち、式（５）に示すように、対応付けＡに対する確率値を最大にするときの対応付けＡを、最適な対応付けＡ′として採用する。 According to the above equation (4), in order to accurately determine the probability P (S, T), the sum of the probability values for each association A must be calculated. However, considering all the correspondences A, the calculation becomes enormous, which is not practical. Therefore, in order to reduce the amount of calculation, the following approximation is introduced in this embodiment. That is, as shown in Expression (5), the association A when the probability value for the association A is maximized is adopted as the optimum association A ′.

そして、前記した式（５）で示される最適な対応付けＡ′のみを考慮する近似を行う。このような計算のためには、公知のＶｉｔｅｒｂｉアルゴリズムが利用可能である。この近似により、前記した式（４）で示した確率Ｐ（Ｓ，Ｔ）は、式（６）のように近似されることとなる。なお、式（６）の具体的な計算に際しては、前記した式（３）が利用されることとなる。 Then, an approximation is performed in consideration of only the optimum association A ′ expressed by the above-described equation (5). A known Viterbi algorithm can be used for such calculation. By this approximation, the probability P (S, T) shown in the above equation (4) is approximated as in equation (6). In the specific calculation of Expression (6), Expression (3) described above is used.

単語翻訳装置５は、ソース単語Ｓに対する最適なターゲット単語Ｔ′の探索として、任意のターゲット単語Ｔに対する任意の文字間の対応付けを考慮し、その上で前記した式（６）を満たすものを探索することになるため、探索の枝刈りなどによって解探索空間を削減することができる。 The word translation device 5 considers the correspondence between arbitrary characters with respect to an arbitrary target word T as a search for the optimal target word T ′ with respect to the source word S, and further satisfies the above-described equation (6). Since the search is performed, the solution search space can be reduced by pruning the search.

前記した式（６）の探索、すなわち、最適なターゲット単語Ｔ′を探索する方法として、本実施形態では、公知の重み付き有限状態トランスデューサ（ＷＦＳＴ：Weighted Finite State Transducer）と呼ばれる有限状態機械を用いて効率的な探索を行う（非特許文献１参照）。このＷＦＳＴでは、状態遷移に対する重みが予め定義されており、ソース文字の系列を入力とし、ターゲット文字の系列を出力することができる。 As a search for the above-described equation (6), that is, a method for searching for the optimum target word T ′, in this embodiment, a known finite state machine called WFST (Weighted Finite State Transducer) is used. Efficient search (see Non-Patent Document 1). In this WFST, weights for state transitions are defined in advance, and a sequence of source characters can be input and a sequence of target characters can be output.

複数のＷＦＳＴの合成演算によって複数の有限状態機械の機能を統合することが可能である（非特許文献１参照）。つまり、ソース単語Ｓから、ソース単語Ｓの言語でもターゲット単語Ｔの言語でもない言語を示す中間言語の単語Ｉを翻訳生成する第１のＷＦＳＴと、この中間言語の単語Ｉから、ターゲット単語Ｔを翻訳生成する第２のＷＦＳＴと、を合成することにより、ソース単語Ｓからターゲット単語Ｔへの翻訳を実現するようにしてもよい。このように構成することで、例えば、ソース単語Ｓの言語とターゲット単語Ｔの言語との間の翻字を実現するための学習データベースが利用できなくとも、ソース単語Ｓの言語と中間言語との間の翻字を実現するための学習データベースと、中間言語とターゲット単語Ｔの言語との間の翻字を実現するための学習データベースと、をそれぞれ利用すれば、第１のＷＦＳＴと、第２のＷＦＳＴとを作成可能である。ここで、ソース単語Ｓからターゲット単語Ｔへの翻訳の際に変換に利用する中間言語の数は１つに限定されるものではなく、翻字を実現するための学習データベースが存在すれば複数種類の中間言語を介在させることも可能である。 It is possible to integrate the functions of a plurality of finite state machines by combining a plurality of WFSTs (see Non-Patent Document 1). That is, from the source word S, a first WFST that translates and generates an intermediate language word I indicating a language that is neither the language of the source word S nor the target word T, and the target word T is determined from the intermediate language word I. You may make it implement | achieve translation from the source word S to the target word T by synthesize | combining with 2nd WFST which carries out translation production | generation. With this configuration, for example, even if a learning database for realizing transliteration between the language of the source word S and the language of the target word T cannot be used, the language of the source word S and the intermediate language If the learning database for realizing transliteration between and the learning database for realizing transliteration between the intermediate language and the language of the target word T are respectively used, the first WFST and the second WFST can be created. Here, the number of intermediate languages used for conversion in the translation from the source word S to the target word T is not limited to one, and if there is a learning database for realizing transliteration, a plurality of types are available. It is also possible to intervene intermediate languages.

具体的には、本実施形態では、単語翻訳装置５は、以下に示すように、１種類のＷＦＳＴを、１種類のＷＦＳＴデータベースおよびＷＦＳＴ探索プログラムの組で構成するが、中間言語を介して合成演算可能な複数種類のＷＦＳＴを利用するようにしてもよい。この場合には、複数種類の翻字確率モデル７を利用することとなる。 Specifically, in the present embodiment, the word translation device 5 is configured by combining one type of WFST as a set of one type of WFST database and a WFST search program as shown below, but synthesizing it via an intermediate language. A plurality of types of WFST that can be calculated may be used. In this case, a plurality of types of transliteration probability models 7 are used.

（構成の具体例）
単語翻訳装置５は、前記した翻訳（記号列変換）原理を実現するために、図４に示すように、入力手段（第１の入力手段）２１と、記憶手段２２と、状態遷移情報データベース作成手段（データベース作成手段）２３と、単語探索手段２４と、出力手段（第１の出力手段）２５と、状態遷移情報データベース２６とを備えている。 (Specific example of configuration)
In order to realize the translation (symbol string conversion) principle described above, the word translation device 5 has an input means (first input means) 21, a storage means 22, and a state transition information database creation as shown in FIG. Means (database creation means) 23, word search means 24, output means (first output means) 25, and state transition information database 26 are provided.

入力手段（第１の入力手段）２１は、入力インターフェースであり、入力装置Ｍから、ソース単語（第１の単語）を入力し、状態遷移情報データベース作成手段２３と単語探索手段２４とに出力するものである。また、入力手段２１は、翻字確率モデル７からソース文字列およびターゲット文字列を入力し、状態遷移情報データベース作成手段２３に出力する。
記憶手段２２は、ＲＡＭと、ＲＯＭと、ＨＤＤとを含んでおり、ＨＤＤに、状態遷移情報データベース２６を記憶するものである。 The input means (first input means) 21 is an input interface, which inputs a source word (first word) from the input device M and outputs it to the state transition information database creation means 23 and the word search means 24. Is. Further, the input means 21 inputs the source character string and the target character string from the transliteration probability model 7 and outputs them to the state transition information database creation means 23.
The storage means 22 includes a RAM, a ROM, and an HDD, and stores a state transition information database 26 in the HDD.

状態遷移情報データベース２６は、前記したＷＦＳＴデータベースに相当する。この状態遷移情報データベース２６は、ソース単語と、該ソース単語に文字間対応付けされたターゲット単語とを文字単位で対応付けた組から成る単語組の中の文字組の出現順序の確率に対応する重み（状態遷移重み）を、遷移元状態および遷移先状態と共に、データとして格納するものである。なお、重みの代わりに出現順序の確率そのものを格納するようにしても良い。
また、状態遷移情報データベース２６は、具体的には、翻字確率モデル７に格納された単語組のソース文字の系列を入力対応データとして有する。また、状態遷移情報データベース２６は、翻字確率モデル７に格納された単語組のターゲット文字の系列と、状態遷移重みとして前記した式（６）の確率値の重みとを、出力対応データとして有する。 The state transition information database 26 corresponds to the WFST database described above. This state transition information database 26 corresponds to the probability of the appearance order of character sets in a word set consisting of a set in which a source word and a target word associated with the source word are associated with each other in character units. The weight (state transition weight) is stored as data together with the transition source state and the transition destination state. Note that the appearance order probability itself may be stored instead of the weight.
In addition, the state transition information database 26 specifically includes a source character sequence of a word set stored in the transliteration probability model 7 as input correspondence data. Further, the state transition information database 26 has, as output correspondence data, the target character series of the word set stored in the transliteration probability model 7 and the weight of the probability value of the above-described equation (6) as the state transition weight. .

状態遷移情報データベース作成手段（データベース作成手段）２３は、翻字確率モデル７に格納されたデータを参照して、状態遷移情報データベース２６を作成するものである。この状態遷移情報データベース作成手段２３は、ソース単語と、該ソース単語に文字間対応付けされたターゲット単語との組から成る単語組の中の文字組の出現順序の確率が最大となる単語組を考慮する近似（前記した式（６）に相当する）を用いて求められた確率に対応する重みを状態遷移重みとして計算する。なお、出現順序の確率を算出するための各確率値は予め求められている。 The state transition information database creation means (database creation means) 23 refers to the data stored in the transliteration probability model 7 and creates the state transition information database 26. The state transition information database creation means 23 selects a word set having the highest probability of the appearance order of the character set in the word set consisting of the source word and the target word associated with the source word. A weight corresponding to the probability obtained by using an approximation to be considered (corresponding to the above-described equation (6)) is calculated as a state transition weight. Each probability value for calculating the probability of the appearance order is obtained in advance.

ここで、状態遷移情報データベース作成手段２３が計算する状態遷移重みについて説明する。前記した式（３）の条件付き確率Ｐ（ａ_i｜ａ_i-1，…，ａ_i-N+1）における条件ａ_i-1，…，ａ_i-N+1を履歴という。この履歴は、各文字組ａ_iに対応する状態遷移の系列である。具体的には、ｉ番目の文字組ａ_i（ｓ_j，ｔ_k）に着目する。この文字組ａ_i（ｓ_j，ｔ_k）は、ソース文字ｓ_jを入力として、ターゲット文字ｔ_kを出力するような状態遷移に対応している。この文字組ａ_i（ｓ_j，ｔ_k）が出現するまでには、直前の（Ｎ−１）個の文字組ａ_i-1，…，ａ_i-N+1の状態遷移の系列を経ている。そこで、文字組ａ_i（ｓ_j，ｔ_k）が対応している状態遷移に対して、条件付き確率Ｐ（ａ_i｜ａ_i-1，…，ａ_i-N+1）に対応する重みを状態遷移重みとして付与する。ここでは、この状態遷移重みを、条件付き確率の対数の符号を逆転させたもの、すなわち、−ｌｏｇＰ（ａ_i｜ａ_i-1，…，ａ_i-N+1）とする。ここで、対数の底は、例えば、２である。なお、ソース文字ｓ_jが空文字φ（文字ＩＤ「φ_s」）である場合には、入力されたソース文字と無関係に行われる状態遷移として実現される（これはε遷移と呼ばれる）。また、ターゲット文字ｔ_kが空文字φ（文字ＩＤ「φ_t」）である場合には、出力なしの状態遷移として実現される。 Here, the state transition weight calculated by the state transition information database creation unit 23 will be described. Conditional probability P of the formula (3) described above _{_{(a i | a i-1}} , ..., a i-N + 1) Conditions a _i-1 _{in, ..., a i-N +} 1 of the history. This history is a series of state transitions corresponding to each character set a _i . Specifically, attention is focused on the i-th character set a _i (s _j , t _k ). This character set _{_{_{a i (s j, t k}}} ) is input with a source character s _j, it corresponds to the state transition for outputting the target character t _k. This character set of _{_{_{a i (s j, t k}}} ) until the advent of, just before the (N-1) number of character sets a _i-1, ..., via the state transition sequence of a _{i-N + 1} Yes. Therefore, the weight corresponding to the conditional probability P (a _i | a _i−1 ,..., A _{i−N + 1} ) for the state transition corresponding to the character set a _i (s _j , t _k ). Is assigned as a state transition weight. Here, the state transition weight is assumed to be the logarithm of the logarithm of the conditional probability, that is, −logP (a _i | a _i−1 ,..., A _{i−N + 1} ). Here, the base of the logarithm is 2, for example. When the source character s _j is an empty character φ (character ID “φ _s ”), it is realized as a state transition performed irrespective of the input source character (this is called an ε transition). When the target character t _k is the empty character φ (character ID “φ _t ”), this is realized as a state transition without output.

単語探索手段２４は、入力された第１の単語に対応する第２の単語を推定するものであり、前記したＷＦＳＴ探索プログラムに相当する。この単語探索手段２４は、状態遷移情報データベース２６に記憶されたデータを参照して、入力されたソース単語Ｓに対応して前記した式（６）を満たす最適なターゲット単語Ｔを探索（推定）し、出力手段２５に出力するものである。具体的には、単語探索手段２４は、入力されたソース単語Ｓを構成するソース文字ｓ₁，…，ｓ_mを順に状態遷移情報データベース２６の入力対応データとした場合に、ε遷移も考慮して、状態遷移情報データベース２６の出力対応データを探索し、探索した出力対応データに相当する文字系列（ターゲット文字列）の中で状態遷移重みが最小となるターゲット文字列を選択する。なお、本実施形態では、単語探索手段２４は、状態遷移重みが最小となるターゲット文字列を選択するが、これに限定されずに、複数個選択するようにしてもよい。この場合には、変換候補として上位数個のターゲット単語を出力することとなる。また、単語探索手段２４は、ターゲット単語（ターゲット文字列）と共に、その状態遷移重みの値を出力するようにしてもよい。この場合には、ターゲット単語と、入力されたソース単語との間で翻訳（記号列の変換）がどのくらい尤もらしいかを示す変換可能性として、この状態遷移重みの値を利用することができる。 The word search means 24 estimates a second word corresponding to the input first word, and corresponds to the above-described WFST search program. The word search means 24 searches (estimates) the optimum target word T that satisfies the above-described equation (6) corresponding to the input source word S with reference to the data stored in the state transition information database 26. And output to the output means 25. Specifically, the word search means 24, the source character s ₁ constituting the source word S input, ..., when the input corresponding data sequentially state transition s _m information database 26, epsilon transition is also taken into account Thus, the output correspondence data in the state transition information database 26 is searched, and the target character string having the smallest state transition weight is selected from the character series (target character string) corresponding to the searched output correspondence data. In the present embodiment, the word search unit 24 selects the target character string having the minimum state transition weight, but the present invention is not limited to this, and a plurality of target character strings may be selected. In this case, the top several target words are output as conversion candidates. The word search means 24 may output the value of the state transition weight together with the target word (target character string). In this case, the value of this state transition weight can be used as a conversion possibility indicating how likely the translation (symbol string conversion) is between the target word and the input source word.

なお、前記した状態遷移情報データベース作成手段２３と、単語探索手段２４とは、ＣＰＵが記憶手段２２のＲＯＭ等に格納された所定のプログラムをＲＡＭに展開して実行することにより実現されるものである。
出力手段（第１の出力手段）２５は、出力装置Ｄへの出力インターフェースであり、単語探索手段２４によって探索されたターゲット単語を出力装置Ｄに出力するものである。なお、出力装置Ｄは、例えば、液晶ディスプレイ等の表示装置である。 The state transition information database creation means 23 and the word search means 24 described above are realized by the CPU developing and executing a predetermined program stored in the ROM or the like of the storage means 22 on the RAM. is there.
The output means (first output means) 25 is an output interface to the output device D, and outputs the target word searched by the word search means 24 to the output device D. The output device D is a display device such as a liquid crystal display, for example.

[翻字確率モデル作成装置の動作]
翻字確率モデル作成装置４の動作について図５を参照（適宜図２参照）して説明する。
図５は、図２に示した翻字確率モデル作成装置の動作を示すフローチャートである。
翻字確率モデル作成装置４は、文字間関連度データベース作成手段１２によって、学習データベース６に格納されたデータに基づいて、ソース文字とターゲット文字との文字（記号）間関連度を計算し、文字間関連度データベース１６を作成する（ステップＳ１）。
続いて、翻字確率モデル作成装置４は、単語組データベース作成手段１３によって、学習データベース６に格納されたデータと、文字間関連度データベース１６に格納されたデータとに基づいて、関連度の積が最大となる単語（記号列）組を生成し、単語組データベース１７を作成する（ステップＳ２）。
続いて、翻字確率モデル作成装置４は、生起確率計算手段１４によって、単語組データベース１７に格納されたデータに基づいて、単語組の各単語（ソース単語およびターゲット単語）において、文字の同時生起確率を、履歴を条件とする条件付き確率として計算し、翻字確率モデル（記号変換確率モデル）７を作成する（ステップＳ３）。 [Operation of transliteration probability model creation device]
The operation of the transliteration probability model creation device 4 will be described with reference to FIG. 5 (refer to FIG. 2 as appropriate).
FIG. 5 is a flowchart showing the operation of the transliteration probability model creation apparatus shown in FIG.
The transliteration probability model creation device 4 calculates the relevance between characters (symbols) between the source character and the target character based on the data stored in the learning database 6 by the inter-character relevance database creating means 12. The interrelationship database 16 is created (step S1).
Subsequently, the transliteration probability model creation device 4 uses the word set database creation unit 13 to calculate the product of the relevance based on the data stored in the learning database 6 and the data stored in the inter-character relevance database 16. A word (symbol string) set that maximizes is generated, and a word set database 17 is created (step S2).
Subsequently, the transliteration probability model creation device 4 uses the occurrence probability calculation unit 14 to simultaneously generate characters in each word (source word and target word) of the word set based on the data stored in the word set database 17. The probability is calculated as a conditional probability with a history as a condition, and a transliteration probability model (symbol conversion probability model) 7 is created (step S3).

[単語翻訳装置の動作]
単語翻訳装置５の動作について図６を参照（適宜図４参照）して説明する。
図６は、図４に示した単語翻訳装置の動作を示すフローチャートである。
単語翻訳装置５は、状態遷移情報データベース作成手段２３によって、翻字確率モデル（記号変換確率モデル）７に基づき、単語組を構成するソース単語とターゲット単語をそれぞれ構成するソース文字およびターゲット文字に関して、文字（記号）の条件付き確率に対応する状態遷移重みを計算し、状態遷移情報データベース２６を予め作成する（ステップＳ１１）。
そして、単語翻訳装置５は、状態遷移情報データベース２６が予め作成された状態で、入力手段２１によって、入力装置Ｍから、翻訳対象である第１の単語（記号列）をソース単語として入力する（ステップＳ１２）。
続いて、単語翻訳装置５は、ステップＳ１１で予め作成された状態遷移情報データベース２６に基づいて、単語探索手段２４によって、第１の単語（ソース単語）に対応するターゲット単語として第２の単語（記号列）を探索する（ステップＳ１３）。
続いて、単語翻訳装置５は、探索された第２の単語（ターゲット単語）を翻訳結果として出力する（ステップＳ１４）。これにより、出力装置Ｄは、ターゲット単語を表示する。なお、単語翻訳装置５は、ターゲット単語と共に、その状態遷移重みの値を出力するようにしてもよい。 [Operation of word translation device]
The operation of the word translation device 5 will be described with reference to FIG. 6 (see FIG. 4 as appropriate).
FIG. 6 is a flowchart showing the operation of the word translation apparatus shown in FIG.
Based on the transliteration probability model (symbol conversion probability model) 7 by the state transition information database creation means 23, the word translation device 5 relates to the source characters and target characters that constitute the word set and the target word, respectively. The state transition weight corresponding to the conditional probability of the character (symbol) is calculated, and the state transition information database 26 is created in advance (step S11).
And the word translation apparatus 5 inputs the 1st word (symbol string) which is a translation object from the input device M as a source word by the input means 21 in the state by which the state transition information database 26 was created beforehand ( Step S12).
Subsequently, based on the state transition information database 26 created in advance in step S11, the word translation device 5 uses the word search unit 24 to select the second word (the target word corresponding to the first word (source word)) The symbol string is searched (step S13).
Subsequently, the word translation device 5 outputs the searched second word (target word) as a translation result (step S14). Thereby, the output device D displays the target word. The word translation device 5 may output the value of the state transition weight together with the target word.

第１の実施形態によれば、所定の言語体系に属する単語（第１の単語）を、任意の言語体系に属する対応した単語（第２の単語）に変換することができる。また、単語翻訳システム１では、翻字確率モデル作成装置４が、発音やローマ字化規則などの情報を利用することなく、学習データベース６に登録された第１の単語および第２の単語の組の集合のみを利用して、翻字確率モデル７を作成する。そのため、翻字確率モデル７を利用する単語翻訳装置５は、発音が不明な単語の処理の問題や、発音間の対応付けの問題や、ローマ字化に代表される表記変換のための知識などを必要とすることなく、既知の記号変換結果の履歴を考慮した翻訳を可能とすることができる。その結果、例えば、日本語のカタカナ（表音文字）を用いた英語（アルファベット、表音文字）文書の検索システムや、同種の質問応答システム、機械翻訳システムにおける翻訳処理において、翻訳辞書でカバーできない単語を扱うことができるようになる。なお、単語翻訳装置５は、状態遷移情報データベース作成手段２３および状態遷移情報データベース２６とを備えるベストモードで説明したが、これらは必須の構成ではない。 According to the first embodiment, a word (first word) belonging to a predetermined language system can be converted into a corresponding word (second word) belonging to an arbitrary language system. Moreover, in the word translation system 1, the transliteration probability model creation device 4 does not use information such as pronunciation and romaji rules, and sets the first and second word pairs registered in the learning database 6. The transliteration probability model 7 is created using only the set. For this reason, the word translation device 5 using the transliteration probability model 7 has the problem of processing a word whose pronunciation is unknown, the problem of association between pronunciations, knowledge for notation conversion represented by Romanization, and the like. It is possible to perform translation in consideration of the history of known symbol conversion results without the necessity. As a result, for example, the translation dictionary in English (alphabet, phonogram) document search systems using Japanese katakana (phonetic characters), similar question answering systems, and machine translation systems cannot be covered by translation dictionaries. You will be able to handle words. In addition, although the word translation apparatus 5 demonstrated in the best mode provided with the state transition information database preparation means 23 and the state transition information database 26, these are not essential structures.

（第２の実施形態）
図７は、本発明の第２の実施形態に係る単語翻訳装置を含む単語翻訳システムの構成例を示す図である。
単語翻訳システム（記号列変換システム）１Ａは、第１の単語（ソース単語）と、ソース単語の第２の単語（ターゲット単語）への変換候補の単語である１以上の第３の単語とを入力するものである。この単語翻訳システム１Ａは、単語翻訳装置（記号列変換装置）５Ａを備えている点を除いて、図１に示した単語翻訳システム１と同様なので、説明の便宜のために、同一の構成には、同一の符号を付し、説明および図面を適宜省略する。 (Second Embodiment)
FIG. 7 is a diagram illustrating a configuration example of a word translation system including a word translation apparatus according to the second embodiment of the present invention.
The word translation system (symbol string conversion system) 1A includes a first word (source word) and one or more third words that are candidates for conversion from the source word to the second word (target word). Input. This word translation system 1A is the same as the word translation system 1 shown in FIG. 1 except that it includes a word translation device (symbol string conversion device) 5A. For convenience of explanation, this word translation system 1A has the same configuration. Are denoted by the same reference numerals, and description and drawings are omitted as appropriate.

単語翻訳装置（記号列変換装置）５Ａは、図７に示すように、単語出力部５と、変換可能性計算部３０とを備えている。
単語出力部５は、図４に示した単語翻訳装置５（第１の実施形態）を指しており、同一の符号を付してある。
変換可能性計算部３０は、単語翻訳装置５Ａの外部から入力された第３の単語と、入力されたソース単語との間で翻訳（記号列の変換）がどのくらい尤もらしいかを示す変換可能性を確率値として出力するものである。 As shown in FIG. 7, the word translation device (symbol string conversion device) 5 A includes a word output unit 5 and a conversion possibility calculation unit 30.
The word output unit 5 points to the word translation device 5 (first embodiment) shown in FIG. 4 and is given the same reference numerals.
The conversion possibility calculation unit 30 indicates the likelihood of translation (symbol string conversion) between the third word input from the outside of the word translation device 5A and the input source word. Are output as probability values.

変換可能性計算部３０は、ソース単語（第１の単語）Ｓと第３の単語の双方を入力とし、前記した式（６）で示される確率の積が最大となるような対応付けのときの重みを、それらに対する確率として計算して出力する機能を有する。つまり、変換可能性計算部３０は第３の単語のソース単語Ｓからの変換可能性（尤度）を翻字確率モデル７に基づいて計算する。その際には状態遷移情報データベース作成手段２３（図４参照）で作成した有限状態機械を利用する。 When the conversion possibility calculation unit 30 receives both the source word (first word) S and the third word as input, and performs the association such that the product of the probabilities expressed by the above equation (6) is maximized. Have the function of calculating and outputting the weights of these as probabilities for them. That is, the conversion possibility calculation unit 30 calculates the conversion possibility (likelihood) of the third word from the source word S based on the transliteration probability model 7. At that time, the finite state machine created by the state transition information database creating means 23 (see FIG. 4) is used.

[変換可能性計算部の構成]
図８は、図７に示した変換可能性計算部の構成例を示す機能ブロック図である。
変換可能性計算部３０が最適な状態遷移系列を探索する方法として、本実施形態では、状態遷移情報データベース２６（ＷＦＳＴデータベース）と、ターゲット単語Ｔを構成するターゲット文字列とを受理する有限状態オートマトン（ＦＳＡ：Finite State Automaton）との合成によって得られる重み付き有限状態オートマトン（ＷＦＳＡ：Weighted Finite State Automaton）を用いる。本実施形態では、このＷＦＳＡは、具体的には、ＷＦＳＡデータベースと、ＷＦＳＡ探索プログラムとから構成される。 [Configuration of convertibility calculator]
FIG. 8 is a functional block diagram illustrating a configuration example of the convertibility calculation unit illustrated in FIG.
In the present embodiment, as a method for the convertibility calculation unit 30 to search for an optimal state transition sequence, a finite state automaton that accepts the state transition information database 26 (WFST database) and the target character string constituting the target word T is used. A weighted finite state automaton (WFSA) obtained by combining with (FSA: Finite State Automaton) is used. In the present embodiment, the WFSA specifically includes a WFSA database and a WFSA search program.

変換可能性計算部３０は、前記したＷＦＳＡを実現するために、図７に示すように、入力手段（第２の入力手段）３１と、記憶手段３２と、合成状態遷移情報データベース作成手段３３と、状態遷移重み計算手段３４と、出力手段（第２の出力手段）３５とを備えている。 As shown in FIG. 7, the convertibility calculation unit 30 includes an input unit (second input unit) 31, a storage unit 32, a combined state transition information database creation unit 33, as shown in FIG. , State transition weight calculating means 34 and output means (second output means) 35.

入力手段（第２の入力手段）３１は、入力インターフェースであり、ソース単語（第１の単語）のターゲット単語（第２の単語）への変換候補の単語である１つ以上の第３の単語を入力装置Ｍから入力し、状態遷移重み計算手段３４に出力するものである。また、入力手段３１は、単語出力部５から状態遷移情報データベース２６を入力し、合成状態遷移情報データベース作成手段３３に出力する。 The input means (second input means) 31 is an input interface, and one or more third words that are conversion candidate words to the target word (second word) of the source word (first word) Is input from the input device M and output to the state transition weight calculation means 34. In addition, the input unit 31 inputs the state transition information database 26 from the word output unit 5 and outputs it to the combined state transition information database creation unit 33.

記憶手段３２は、ＲＡＭと、ＲＯＭと、ＨＤＤとを含んでおり、ＨＤＤに、合成状態遷移情報データベース３６を記憶するものである。
合成状態遷移情報データベース３６は、前記したＷＦＳＡデータベースに相当し、入力予定の第３の単語を構成する第３の文字に関する履歴と、状態遷移情報データベース２６に記憶されたデータとを合成した結果をデータとして格納するものである。図８では、合成状態遷移情報データベース３６を１つだけ示しているが、第２の実施形態では、入力予定の各第３の単語と、状態遷移情報データベース２６とをそれぞれ合成することにより、入力予定の第３の単語の個数だけ、合成状態遷移情報データベースを予め作成しておく。 The memory | storage means 32 contains RAM, ROM, and HDD, and memorize | stores the synthetic | combination state transition information database 36 in HDD.
The combined state transition information database 36 corresponds to the WFSA database described above, and combines the history of the third character constituting the third word to be input and the data stored in the state transition information database 26. It is stored as data. In FIG. 8, only one combined state transition information database 36 is shown. However, in the second embodiment, each third word scheduled to be input and the state transition information database 26 are combined to be input. The composite state transition information database is created in advance for the number of planned third words.

合成状態遷移情報データベース作成手段３３は、入力予定の第３の単語を構成する第３の文字に関する履歴と、単語出力部５の状態遷移情報データベース２６に記憶されたデータとを合成し、合成した結果をデータとする合成状態遷移情報データベース３６を作成するものである。この合成状態遷移情報データベース作成手段３３は、入力手段３１から入力する第３の単語から、合成に必要なＦＳＡを作成する。なお、合成状態遷移情報データベース作成手段３３は、入力手段３１から予め作成されたＦＳＡを入力してデータベースの合成を行うようにしてもよい。 The synthesized state transition information database creation unit 33 synthesizes the history related to the third character constituting the third word to be input and the data stored in the state transition information database 26 of the word output unit 5 and synthesizes them. The composite state transition information database 36 using the result as data is created. The synthesized state transition information database creating unit 33 creates an FSA necessary for synthesis from the third word input from the input unit 31. Note that the combined state transition information database creating unit 33 may input the FSA created in advance from the input unit 31 and synthesize the database.

状態遷移重み計算手段３４は、前記したＷＦＳＡ探索プログラムに相当する。この状態遷移重み計算手段３４は、合成状態遷移情報データベース３６に記憶されたデータを参照して、入力手段２１に入力された第１の単語を構成するソース文字（第１の文字）と、入力手段３１に入力された第３の単語を構成する第３の文字と、から成る文字組の出現順序の確率として、前記した状態遷移重みを計算するものである。なお、状態遷移重みの代わりに条件付き確率そのものを計算するようにしても良い。
具体的には、状態遷移重み計算手段３４は、第３の単語を構成する第３の文字を順に、当該第３の単語を構成する第３の文字の履歴がＦＳＡとして合成された合成状態遷移情報データベース３６の入力対応データとした場合に、ε遷移も考慮して、ソース文字（第１の文字）から第３の文字への状態遷移重みの合計値を計算する。そして、この計算処理を、入力された第３の単語に対応する合成状態遷移情報データベース３６それぞれについて実行し、この合計値が、入力された複数の第３の単語の中で最小値となる第３の単語を探索し、そのときの最小値を変換可能性として出力手段３５に出力する。 The state transition weight calculation means 34 corresponds to the above-described WFSA search program. The state transition weight calculation means 34 refers to the data stored in the combined state transition information database 36, and inputs the source characters (first characters) constituting the first word input to the input means 21, and the input The state transition weight described above is calculated as the probability of the appearance order of the character set composed of the third character constituting the third word input to the means 31. The conditional probability itself may be calculated instead of the state transition weight.
Specifically, the state transition weight calculating unit 34 sequentially combines the third character constituting the third word, and the combined state transition in which the history of the third character constituting the third word is synthesized as FSA. In the case of the input correspondence data in the information database 36, the total value of the state transition weights from the source character (first character) to the third character is calculated in consideration of the ε transition. Then, this calculation process is executed for each of the combined state transition information databases 36 corresponding to the input third word, and the total value is the minimum value among the plurality of input third words. 3 is searched, and the minimum value at that time is output to the output means 35 as the possibility of conversion.

なお、前記した合成状態遷移情報データベース作成手段３３と、状態遷移重み計算手段３４とは、ＣＰＵが記憶手段３２のＲＯＭ等に格納された所定のプログラムをＲＡＭに展開して実行することにより実現されるものである。 The composite state transition information database creation means 33 and the state transition weight calculation means 34 described above are realized by the CPU developing and executing a predetermined program stored in the ROM or the like of the storage means 32 on the RAM. Is.

出力手段（第２の出力手段）３５は、出力装置Ｄへの出力インターフェースであり、状態遷移重み計算手段３４によって選択された第３の単語の状態遷移重み（または確率）を出力装置Ｄに出力するものである。なお、出力する状態遷移重みは、それぞれの値でも合計値でもよい。 The output means (second output means) 35 is an output interface to the output device D, and outputs the state transition weight (or probability) of the third word selected by the state transition weight calculation means 34 to the output device D. To do. Note that the state transition weights to be output may be each value or a total value.

[変換可能性計算部の動作]
変換可能性計算部３０の動作について図９を参照（適宜図８参照）して説明する。
図９は、図８に示した変換可能性計算部の動作を示すフローチャートである。
変換可能性計算部３０は、合成状態遷移情報データベース作成手段３３によって、既知の入力予定の１以上の第３の単語（記号列）を構成する第３の文字（記号）の履歴を、状態遷移情報データベース２６に合成し、合成状態遷移情報データベース３６を予め作成する（ステップＳ２１）。
そして、変換可能性計算部３０は、合成状態遷移情報データベース３６を予め作成した状態で、入力手段３１によって、入力装置Ｍから、ソース単語としての第１の単語（記号列）の変換候補である第３の単語（記号列）を入力する（ステップＳ２２）。
続いて、変換可能性計算部３０は、状態遷移重み計算手段３４によって、第１の単語を構成する第１の文字（記号）から、第３の単語を構成する第３の文字（記号）への状態遷移重みの合計値が、最小となる第３の単語を選択する（ステップＳ２３）。
続いて、変換可能性計算部３０は、状態遷移重み計算手段３４によって選択された第３の単語の状態遷移重みを出力手段３５によって出力する（ステップＳ２４）。これにより、出力装置Ｄは、状態遷移重みを変換可能性として表示する。 [Operation of the convertibility calculator]
The operation of the convertibility calculation unit 30 will be described with reference to FIG. 9 (refer to FIG. 8 as appropriate).
FIG. 9 is a flowchart showing the operation of the convertibility calculation unit shown in FIG.
The conversion possibility calculation unit 30 uses the combined state transition information database creation unit 33 to convert the history of the third characters (symbols) constituting one or more third words (symbol strings) that are known to be input into the state transition. Combining with the information database 26, a combined state transition information database 36 is created in advance (step S21).
And the conversion possibility calculation part 30 is the conversion candidate of the 1st word (symbol string) as a source word from the input device M by the input means 31 in the state which produced the synthetic | combination state transition information database 36 beforehand. A third word (symbol string) is input (step S22).
Subsequently, the convertibility calculating unit 30 causes the state transition weight calculating unit 34 to change from the first character (symbol) constituting the first word to the third character (symbol) constituting the third word. The third word having the smallest total value of the state transition weights is selected (step S23).
Subsequently, the convertibility calculation unit 30 outputs the state transition weight of the third word selected by the state transition weight calculation unit 34 by the output unit 35 (step S24). Thereby, the output device D displays the state transition weight as the conversion possibility.

なお、以上の第２の実施形態の説明では、状態遷移重み計算手段３４は、状態遷移重みの合計値を計算し、この合計値が、入力された複数の第３の単語の中で最小値となる第３の単語を探索するものとして説明したが、単に合計値または状態遷移重みの各値のみを出力するようにしてもよい。この場合には、出力装置Ｄに表示された状態遷移重みをユーザが目視により確認して、そのときの最小値となる第３の単語を選択すればよい。 In the above description of the second embodiment, the state transition weight calculation unit 34 calculates the total value of the state transition weights, and this total value is the minimum value among the plurality of input third words. In the above description, the third word is searched for. However, only the total value or each value of the state transition weight may be output. In this case, the user may visually check the state transition weight displayed on the output device D and select the third word that is the minimum value at that time.

第２の実施形態によれば、ソース単語（第１の単語）の変換候補として、複数の単語（第３の単語）を入力としたときに、ソース単語（第１の単語）から第３の単語への変換の確からしさを求めることができ、翻訳の精度を向上させることができる。また、ソース単語（第１の単語）が、学習データベース６に予め登録されていない未学習（未知）の単語であっても、第３の単語を、ソース単語（第１の単語）からの翻訳結果（変換候補）として採用することも可能となる。この場合、変換可能性が予め定められた値よりも高い第３の単語を翻訳結果として出力（表示）するようにしてもよい。 According to the second embodiment, when a plurality of words (third word) are input as conversion candidates of the source word (first word), the third word is converted from the source word (first word) to the third word. The certainty of conversion into words can be obtained, and the accuracy of translation can be improved. Even if the source word (first word) is an unlearned (unknown) word that is not registered in the learning database 6 in advance, the third word is translated from the source word (first word). It can also be adopted as a result (conversion candidate). In this case, a third word whose conversion possibility is higher than a predetermined value may be output (displayed) as a translation result.

（第３の実施形態）
図１０は、本発明の第３の実施形態に係る単語翻訳装置を含む単語翻訳システムの構成例を示す図である。
単語翻訳システム（記号列変換システム）１Ｂは、第１の単語（ソース単語）と共に単語翻訳装置５Ｂに入力される、ソース単語の第２の単語（ターゲット単語）への変換候補の単語である第３の単語を、単語翻訳装置５Ｂの外部から取得するものである。
この単語翻訳システム１Ｂは、単語翻訳装置（記号列変換装置）５Ｂを備えている点を除いて、図７に示した単語翻訳システム１Ａと同様なので、説明の便宜のために、同一の構成には、同一の符号を付し、説明および図面を適宜省略する。 (Third embodiment)
FIG. 10 is a diagram illustrating a configuration example of a word translation system including a word translation apparatus according to the third embodiment of the present invention.
The word translation system (symbol string conversion system) 1B is a word that is a candidate for conversion to the second word (target word) of the source word that is input to the word translation device 5B together with the first word (source word). 3 words are acquired from outside the word translation device 5B.
This word translation system 1B is the same as the word translation system 1A shown in FIG. 7 except that it includes a word translation device (symbol string conversion device) 5B. For convenience of explanation, this word translation system 1B has the same configuration. Are denoted by the same reference numerals, and description and drawings are omitted as appropriate.

単語翻訳装置（記号列変換装置）５Ｂは、図１０に示すように、単語出力部５と、変換可能性計算部３０と、変換候補検索部４０とを備えている。
変換候補検索部４０は、通信ネットワークＮＷに接続された電子機器５０から取得した文書データに基づいて抽出された単語群を第３の単語として変換可能性計算部３０に入力するものである。
通信ネットワークＮＷは、例えば、インターネット等から構成されている。
電子機器５０は、例えば、Ｗｅｂサーバ等のコンピュータ（情報処理装置）や、データベースを備えるハードディスク装置等の記憶装置である。 As shown in FIG. 10, the word translation device (symbol string conversion device) 5 B includes a word output unit 5, a conversion possibility calculation unit 30, and a conversion candidate search unit 40.
The conversion candidate search part 40 inputs the word group extracted based on the document data acquired from the electronic device 50 connected to the communication network NW to the conversion possibility calculation part 30 as a third word.
The communication network NW is composed of, for example, the Internet.
The electronic device 50 is a storage device such as a computer (information processing device) such as a Web server or a hard disk device including a database, for example.

[変換候補検索部の構成]
図１１は、図１０に示した変換候補検索部の構成例を示す機能ブロック図である。
変換候補検索部４０は、図１１に示すように、入力手段４１と、記憶手段４２と、文書データ取得手段４３と、変換候補抽出手段４４と、出力手段４５とを備えている。 [Configuration of conversion candidate search unit]
FIG. 11 is a functional block diagram illustrating a configuration example of the conversion candidate search unit illustrated in FIG.
As shown in FIG. 11, the conversion candidate search unit 40 includes input means 41, storage means 42, document data acquisition means 43, conversion candidate extraction means 44, and output means 45.

入力手段４１は、入力インターフェースであり、ソース単語（第１の単語）を入力装置Ｍから入力し、文書データ取得手段４３に出力するものである。また、入力手段４１は、通信ネットワークＮＷから文書データを入力し、文書データ取得手段４３に出力する。
記憶手段４２は、ＲＡＭと、ＲＯＭと、ＨＤＤとを含んでおり、入力手段４１から入力する文書データ等のデータや、各種動作プログラム等を記憶するものである。 The input means 41 is an input interface, which inputs a source word (first word) from the input device M and outputs it to the document data acquisition means 43. The input unit 41 inputs document data from the communication network NW and outputs it to the document data acquisition unit 43.
The storage means 42 includes a RAM, a ROM, and an HDD, and stores data such as document data input from the input means 41, various operation programs, and the like.

文書データ取得手段４３は、入力手段４１に入力される第１の単語（ソース単語）に基づいて、通信ネットワークＮＷに接続された電子機器５０から文書データを取得するものである。この文書データ取得手段４３は、公知の技術であるインターネット上での文書検索方法、または、文書データベースに対する文書検索方法を利用して、入力されたソース単語を含む文書を検索する。なお、取得すべき文書数は、入力装置Ｍから指定してもよいし、予め指定された文書数を記憶手段４２に格納しておくようにしてもよい。 The document data acquisition unit 43 acquires document data from the electronic device 50 connected to the communication network NW based on the first word (source word) input to the input unit 41. The document data obtaining unit 43 retrieves a document including the input source word by using a known document retrieval method on the Internet or a document retrieval method for a document database. The number of documents to be acquired may be designated from the input device M, or the number of documents designated in advance may be stored in the storage means 42.

変換候補抽出手段４４は、文書データ取得手段４３によって取得された文書データから、予め定められた個数の第３の単語を抽出し、出力手段４５に出力するものである。変換候補抽出手段４４による抽出方法は、任意であり、例えば、ターゲット言語で用いられている文字コードを用いた正規表現によるマッチングなどを利用してもよい。なお、抽出すべき単語数は、入力装置Ｍから指定してもよいし、予め指定された単語数を記憶手段４２に格納しておくようにしてもよい。 The conversion candidate extraction unit 44 extracts a predetermined number of third words from the document data acquired by the document data acquisition unit 43 and outputs the third word to the output unit 45. The extraction method by the conversion candidate extraction unit 44 is arbitrary, and for example, matching by a regular expression using a character code used in the target language may be used. Note that the number of words to be extracted may be designated from the input device M, or the number of words designated in advance may be stored in the storage means 42.

なお、前記した文書データ取得手段４３と、変換候補抽出手段４４とは、ＣＰＵが記憶手段４２のＲＯＭ等に格納された所定のプログラムをＲＡＭに展開して実行することにより実現されるものである。
出力手段４５は、出力装置Ｄへの出力インターフェースであり、変換候補抽出手段４４によって抽出された第３の単語を出力装置Ｄに出力するものである。 The document data acquisition unit 43 and the conversion candidate extraction unit 44 are realized by the CPU developing and executing a predetermined program stored in the ROM or the like of the storage unit 42 on the RAM. .
The output unit 45 is an output interface to the output device D, and outputs the third word extracted by the conversion candidate extraction unit 44 to the output device D.

[変換候補検索部の動作]
変換候補検索部４０の動作について図１２を参照（適宜図１１参照）して説明する。
図１２は、図８に示した変換候補検索部の動作を示すフローチャートである。
変換候補検索部４０は、入力手段４１によって、入力装置Ｍから、翻訳対象である第１の単語（記号列）をソース単語として入力する（ステップＳ３１）。
続いて、変換候補検索部４０は、文書データ取得手段４３によって、入力された第１の単語（ソース単語）に基づいて、通信ネットワークＮＷから文書データを取得する（ステップＳ３２）。
続いて、変換候補検索部４０は、変換候補抽出手段４４によって、取得された文書データから、変換候補である第３の単語（記号列）を抽出する（ステップＳ３３）。
そして、変換候補検索部４０は、出力手段４５によって、抽出された第３の単語を変換可能性計算部３０に出力する（ステップＳ３４）。これにより、変換可能性計算部３０では、第３の単語は、入力手段３１（図８参照）によって、合成状態遷移情報データベース作成手段３３（図８参照）に入力されることとなる。 [Operation of conversion candidate search unit]
The operation of the conversion candidate search unit 40 will be described with reference to FIG. 12 (see FIG. 11 as appropriate).
FIG. 12 is a flowchart showing the operation of the conversion candidate search unit shown in FIG.
The conversion candidate search part 40 inputs the 1st word (symbol string) which is a translation object from the input device 41 as a source word by the input means 41 (step S31).
Subsequently, the conversion candidate search unit 40 acquires document data from the communication network NW based on the input first word (source word) by the document data acquisition unit 43 (step S32).
Subsequently, the conversion candidate search unit 40 uses the conversion candidate extraction unit 44 to extract a third word (symbol string) that is a conversion candidate from the acquired document data (step S33).
And the conversion candidate search part 40 outputs the extracted 3rd word to the conversion possibility calculation part 30 by the output means 45 (step S34). Thereby, in the conversion possibility calculation part 30, the 3rd word will be input into the synthetic | combination state transition information database preparation means 33 (refer FIG. 8) by the input means 31 (refer FIG. 8).

第３の実施形態によれば、通信ネットワークから取得した文書データから抽出された単語を、第３の単語として入力し、この第３の単語の変換可能性を計算することができる。したがって、第２の単語として適切なものが探索されない場合でも、通信ネットワークから取得した第３の単語の変換可能性が適切な結果である場合に、この第３の単語を変換候補として採用することが可能となる。 According to the third embodiment, the word extracted from the document data acquired from the communication network can be input as the third word, and the convertibility of the third word can be calculated. Therefore, even when an appropriate second word is not searched, if the conversion possibility of the third word acquired from the communication network is an appropriate result, the third word is adopted as a conversion candidate. Is possible.

以上、本発明の各実施形態について説明したが、本発明はこれらに限定されるものではなく、その趣旨を変えない範囲で様々に実施することができる。例えば、各実施形態では、ある言語体系に属する文字で構成された単語を変換対象（翻訳対象）としたが、この場合の「言語」とは、自然言語に限定されるものではなく、所定の規則に基づく記号体系であってもよい。この場合には、この記号体系に属する記号で構成された記号列を変換対象とする記号列変換方法、記号列変換装置および記号列変換プログラムとして実現することができる。 As mentioned above, although each embodiment of this invention was described, this invention is not limited to these, In the range which does not change the meaning, it can implement variously. For example, in each embodiment, a word composed of characters belonging to a certain language system is a conversion target (translation target). However, the “language” in this case is not limited to a natural language, It may be a symbol system based on rules. In this case, the present invention can be realized as a symbol string conversion method, a symbol string conversion device, and a symbol string conversion program for converting a symbol string composed of symbols belonging to this symbol system.

また、第２の実施形態の単語翻訳装置５Ａや第３の実施形態の単語翻訳装置５Ｂでは、それぞれ、第１の実施形態の単語翻訳装置５で説明した入力手段２１、記憶手段２２、出力手段２５とは、別に入力手段、記憶手段、出力手段を設けたが、これらは共通の構成としてもよく、また、入力装置Ｍや出力装置Ｄを共用するようにしてもよい。 In the word translation device 5A of the second embodiment and the word translation device 5B of the third embodiment, the input means 21, storage means 22, and output means described in the word translation device 5 of the first embodiment, respectively. 25, input means, storage means, and output means are provided separately, but these may have a common configuration, or the input device M and the output device D may be shared.

また、第３の実施形態の単語翻訳装置５Ｂでは、通信ネットワークＮＷから取得した単語の変換可能性を計算することを前提として説明したが、変換可能性計算部３０は必須の構成ではない。単語翻訳装置５Ｂが変換可能性計算部３０を備えない場合には、例えば、単語出力部５でソース単語（第１の単語）が未知の単語であると判別したときに、通信ネットワークＮＷから取得した１つ以上の単語を、ソース単語（第１の単語）からの変換候補としてそのまま採用し、出力装置Ｄに出力（表示）するようにしてもよい。 Further, the word translation device 5B of the third embodiment has been described on the assumption that the word conversion possibility acquired from the communication network NW is calculated, but the conversion possibility calculation unit 30 is not an essential component. When the word translation device 5B does not include the convertibility calculation unit 30, for example, when the word output unit 5 determines that the source word (first word) is an unknown word, it is acquired from the communication network NW. The one or more words may be used as conversion candidates from the source word (first word) and output (displayed) to the output device D.

次に、本発明の効果を確認した複数の実施例について説明する。各実施例では、ソース言語が日本語（カタカナで表記）、ターゲット言語が英語（アルファベットで表記）の場合の単語の変換を行った。 Next, a plurality of examples in which the effect of the present invention has been confirmed will be described. In each example, word conversion was performed when the source language was Japanese (indicated in katakana) and the target language was in English (indicated in alphabet).

［実施例１］
単語翻訳システム１（図１参照）において、翻字確率モデル作成装置４によって、翻字確率モデル７を予め作成し、第１の実施形態の単語翻訳装置５を用いて、ソース単語「ドナルド」からターゲット単語「ｄｏｎａｌｄ」を取得した。
この場合には、翻字確率モデル作成装置４は、以下に示すようにして、翻字確率モデル７を作成した。
まず、学習データベース６には、図１３（ａ）に例示するように、カタカナ表記１３０１の語と、アルファベット表記１３０２の語との組を格納した。 [Example 1]
In the word translation system 1 (see FIG. 1), a transliteration probability model 7 is created in advance by the transliteration probability model creation device 4, and the source word “Donald” is created using the word translation device 5 of the first embodiment. The target word “donald” was acquired.
In this case, the transliteration probability model creation device 4 created the transliteration probability model 7 as follows.
First, in the learning database 6, as illustrated in FIG. 13A, pairs of katakana notation 1301 words and alphabet notation 1302 words are stored.

また、翻字確率モデル作成装置４の文字間関連度データベース作成手段１２（図２参照）は、関連度として、式（７）で示されるφ²（ｓ，ｔ）を用いた。このφ²（ｓ，ｔ）は、カイ二乗値を０〜１の範囲に正規化した値である（詳細はW. A. Gale and K. W. Church,”Identifying word correspondances in parallel texts Proceedings of the 4th DARPA workshop on Speech and Natural Language,1991を参照）。 Further, the inter-character relevance database creating means 12 (see FIG. 2) of the transliteration probability model creating apparatus 4 uses φ ² (s, t) represented by the equation (7) as the relevance. This φ ² (s, t) is a value obtained by normalizing the chi-square value to a range of 0 to 1 (for details, see WA Gale and KW Church, “Identifying word correspondances in parallel texts Proceedings of the 4th DARPA workshop on Speech and Natural Language, 1991).

ここで、ｆｒｅｑ（＊）は、学習データベース６中で記号＊が出現する単語組の数を示すものである。すなわち、ｆｒｅｑ（ｓ）はソース文字ｓが出現する単語組の数を示し、ｆｒｅｑ（ｔ）はターゲット文字ｔが出現する単語組の数を示し、ｆｒｅｑ（ｓ，ｔ）は両方とも出現する単語組の数を示す。また、Ｌは、学習データベース６中に格納されているすべての単語組の総数である。 Here, freq (*) indicates the number of word groups in which the symbol * appears in the learning database 6. That is, freq (s) indicates the number of word pairs in which the source character s appears, freq (t) indicates the number of word pairs in which the target character t appears, and freq (s, t) indicates both words that appear. Indicates the number of pairs. L is the total number of all word groups stored in the learning database 6.

この文字間関連度データベース作成手段１２は、英単語の区切りとして現れるターゲット文字側の空白を削除し、一続きの単語であるようにして扱った。
作成された文字間関連度データベース１６には、図１３（ｂ）に例示するように、学習データベース６内のターゲット文字１３１１ごとに、ソース文字との関連度１３１２が格納されている。例えば、ターゲット文字１３１１が「ａ」の場合には、ソース文字“ア”と「0.312370273233768」の関連度を有し、ソース文字“ラ”やソース文字“ナ”等とも所定の関連度を有している。同様に、ターゲット文字１３１１が「ｂ」の場合には、ソース文字“ブ”と「0.247172957562107」の関連度を有していることが示されている。 This inter-character relevance database creating means 12 deletes the blank on the target character side that appears as a break between English words and treats it as a series of words.
In the created inter-character relevance database 16, as illustrated in FIG. 13B, the relevance 1312 with the source character is stored for each target character 1311 in the learning database 6. For example, when the target character 1311 is “a”, the source character “A” has a relevance level of “0.312370273233768”, and the source character “La”, the source character “NA”, etc. have a predetermined relevance level. ing. Similarly, when the target character 1311 is “b”, it is indicated that the source character “B” has an association degree of “0.247172957562107”.

また、文字間関連度データベース作成手段１２は、ソース文字ｓとの空文字φ_tとの関連度Assoc（ｓ，φ_t）としては、ソース文字ｓと他のターゲット文字との関連度の相乗平均を用いるト共に、空文字φ_sとターゲット文字ｔとの関連度Assoc（φ_s，ｔ）としては、ターゲット文字ｔと他のソース文字との関連度の相乗平均を用いた。 Further, the inter-character relevance database creating means 12 calculates the geometric mean of the relevance between the source character s and other target characters as the relevance Assoc (s, φ _t ) with the empty character φ _t with the source character s. As the degree of association Assoc (φ _s , t) between the empty character φ _s and the target character t, the geometric average of the degree of association between the target character t and another source character was used.

次に、翻字確率モデル作成装置４の単語組データベース作成手段１３（図２参照）は、図１３の（ａ）に示すような学習データベース６と、図１３（ｂ）に示すような文字間関連度データベース１６とを用いて、学習データベース６の各単語組に対して、前記した式（１）を満たすような文字間の対応付けを求め、図１４に例示するような単語組データベース１７を作成する。 Next, the word set database creation means 13 (see FIG. 2) of the transliteration probability model creation device 4 performs the learning database 6 as shown in FIG. 13A and the character spacing as shown in FIG. Using the relevance database 16, for each word set in the learning database 6, an association between characters that satisfies the above-described equation (1) is obtained, and a word set database 17 as illustrated in FIG. 14 is obtained. create.

図１４に示すように、対応付け１４０１は、対応付けられたターゲット文字とソース文字とをコロンで結んで表記している。ここで、＜ｅｐｓ＞は、空文字φを表す記号であり、各行の左端の＜ｓ＞:＜ｓ＞は、語の開始点を示し、各行の右端＜／ｓ＞:＜／ｓ＞は、語の終了点を表す記号である。 As shown in FIG. 14, the association 1401 describes the associated target character and source character connected by a colon. Here, <eps> is a symbol representing the empty character φ, <s>: <s> at the left end of each line indicates the starting point of the word, and </ s>: </ s> at the right end of each line is A symbol representing the end point of a word.

次に、翻字確率モデル作成装置４の生起確率計算手段１４（図２参照）は、図１４に示す単語組データベース１７から、Ｎ＝３のトライグラムモデルを作成した。ここでは、生起確率計算手段１４は、実際には、Ｎ＝１のユニグラムモデルと、Ｎ＝２のバイグラムモデルと、Ｎ＝３のトライグラムモデルとをそれぞれ作成した。これにより、翻字確率モデル７には、図１５に示すように、３種類の形式で各文字組の同時生起確率が格納される。 Next, the occurrence probability calculation means 14 (see FIG. 2) of the transliteration probability model creation apparatus 4 created a trigram model of N = 3 from the word set database 17 shown in FIG. Here, the occurrence probability calculation unit 14 actually created a unigram model with N = 1, a bigram model with N = 2, and a trigram model with N = 3. As a result, the transliteration probability model 7 stores the co-occurrence probability of each character set in three types as shown in FIG.

ユニグラムモデル（Ｎ＝１）は、図１５の（ａ）に示すように、生起確率１５０１と、第１表記１５０２と、平滑化係数１５０３の各データを備えている。
生起確率１５０１は、直前の単語と無関係に翻字文字組が生起する確率を対数で示している。
第１表記１５０２は、文字組の表記である
平滑化係数１５０３は、平滑化のための係数で、Ｎ＞１のＮグラムの確率を堆定するために利用される。 As shown in FIG. 15A, the unigram model (N = 1) includes occurrence probability 1501, first notation 1502, and smoothing coefficient 1503.
The occurrence probability 1501 indicates a logarithm of the probability that a transliterated character set will occur regardless of the immediately preceding word.
A first notation 1502 is a character set notation. A smoothing coefficient 1503 is a coefficient for smoothing, and is used to settle N-gram probabilities of N> 1.

バイグラムモデル（Ｎ＝２）は、図１５の（ｂ）に示すように、生起確率１５１１と、第２表記１５１２と、第３表記１５１３と、平滑化係数１５１４の各データを備えている。
生起確率１５１１は、直前の１単語に依存して翻字文字組が生起する確率を対数で示している。
第２表記１５１２は、直前の文字組の表記である。
第３表記１５１３は、生起確率１５１１を求めるために用いた文字組である。
平滑化係数１５１４は、平滑化のための係数で、Ｎ＞２のＮグラムの確率を推定するために利用される。 As shown in FIG. 15B, the bigram model (N = 2) includes data of an occurrence probability 1511, a second notation 1512, a third notation 1513, and a smoothing coefficient 1514.
The occurrence probability 1511 indicates a logarithm of the probability that a transliterated character set occurs depending on the immediately preceding word.
The second notation 1512 is a notation of the immediately preceding character set.
A third notation 1513 is a character set used for obtaining the occurrence probability 1511.
The smoothing coefficient 1514 is a coefficient for smoothing, and is used to estimate the probability of N gram of N> 2.

トライグラムモデル（Ｎ＝３）は、図１５の（ｃ）に示すように、生起確率１５２１と、第４表記１５２２と、第５表記１５２３と、第６表記１５２４の各データを備えている。
生起確率１５２１は、直前の２単語に依存して翻字文字組が生起する確率を対数で示している。
第４表記１５２２は、２つ前の文字組の表記である。
第５表記１５２３は、直前の文字組の表記である。
第６表記１５２４は、生起確率を求めるために用いた文字組である。 The trigram model (N = 3) includes data of occurrence probabilities 1521, fourth notation 1522, fifth notation 1523, and sixth notation 1524, as shown in FIG.
The occurrence probability 1521 indicates a logarithm of the probability that a transliterated character set occurs depending on the immediately preceding two words.
The fourth notation 1522 is the notation of the previous character set.
The fifth notation 1523 is the notation of the immediately preceding character set.
The sixth notation 1524 is a character set used to determine the occurrence probability.

これらのモデルの作成にあたって、生起確率計算手段１４は、公知の言語モデル作成ツール「CMU-Cambridge SLM Toolkit」を利用した（P. Clarkson et a1, “Statistical language modeling using the CMU-Cambridge toolkit”, in Proceedings of EUROSPEECH97,1997,p.2707-2710）。なお、この際、図１４または図１５に示したコロンで結ばれた各文字組は一つの記号として扱われる。 In creating these models, the occurrence probability calculation means 14 used a known language model creation tool “CMU-Cambridge SLM Toolkit” (P. Clarkson et a1, “Statistical language modeling using the CMU-Cambridge toolkit”, in Proceedings of EUROSPEECH97, 1997, p.2707-2710). At this time, each character set connected by a colon shown in FIG. 14 or FIG. 15 is treated as one symbol.

また、これらのモデルの平滑化手法として、生起確率計算手段１４は、公知のWitten-Bellの平滑化手法を利用した（I. H. Witten and T. C. Bell, “The zero-frequency problem: Estimating the prob-abilities of novel events in adaptive text compression”, IEEE Transaction of Information Theory,1991 vol. 37, no.4, p. 1085-1094）。 As a smoothing method for these models, the occurrence probability calculation means 14 uses a well-known Witten-Bell smoothing method (IH Witten and TC Bell, “The zero-frequency problem: Estimating the prob-abilities of novel events in adaptive text compression ”, IEEE Transaction of Information Theory, 1991 vol. 37, no. 4, p. 1085-1094).

次に、第１の実施形態の単語翻訳装置５は、状態遷移情報データベース作成手段２３（図４参照）によって、図１５に例示した翻字確率モデル７から、図１６に例示するような状態遷移情報データベース２６を作成する。なお、この際、図１５に示したコロンで結ばれた各文字組は分解され、ソース文字は、状態遷移情報データベース２６の入力対応データとして使用され、ターゲット文字は状態遷移情報データベース２６の出力対応データとして使用される。 Next, the word translation device 5 according to the first embodiment uses the state transition information database creation unit 23 (see FIG. 4) to change the state transition as illustrated in FIG. 16 from the transliteration probability model 7 illustrated in FIG. An information database 26 is created. At this time, each character set connected by the colon shown in FIG. 15 is disassembled, the source character is used as input correspondence data of the state transition information database 26, and the target character is output correspondence of the state transition information database 26. Used as data.

状態遷移情報データベース２６は、図１６に示すように、状態識別１６０１と、第１状態番号１６０２と、第２状態番号１６０３と、ソース文字１６０４と、ターゲット文字１６０５と、状態遷移重み１６０６とを格納している。
状態識別１６０１は、初期状態「Ｉ」、遷移状態「Ｔ」、終了（受理）状態「Ｆ」をそれぞれ示すものである。
第１状態番号１６０２は、遷移元状態番号を示すものである。ただし、初期状態「Ｉ」や終了（受理）状態「Ｆ」においては、初期状態の状態番号や終了状態の状態番号を示す。第２状態番号１６０３は、遷移先状態番号を示すものである。
ソース文字１６０４は、入力記号に対応した入力対応データであり、図１５に示したコロンで結ばれた各文字組が分解されたソース文字によって生成される。
ターゲット文字１６０５は、出力記号に対応した出力対応データであり、図１５に示したコロンで結ばれた各文字組が分解されたターゲット文字によって生成される。
状態遷移重み１６０６は、遷移に与えられる重み（状態遷移重み）である。ただし、初期状態「Ｉ」や終了（受理）状態「Ｆ」においては、初期状態の重みや終了状態の重みを示す。なお、図１６では、初期状態の重みや終了状態の重みは実質的に「０」としている。 As shown in FIG. 16, the state transition information database 26 stores a state identification 1601, a first state number 1602, a second state number 1603, a source character 1604, a target character 1605, and a state transition weight 1606. is doing.
The state identification 1601 indicates an initial state “I”, a transition state “T”, and an end (acceptance) state “F”.
The first state number 1602 indicates a transition source state number. However, in the initial state “I” and the end (acceptance) state “F”, the state number of the initial state and the state number of the end state are shown. The second state number 1603 indicates a transition destination state number.
The source character 1604 is input correspondence data corresponding to an input symbol, and is generated by a source character obtained by decomposing each character set connected by a colon shown in FIG.
A target character 1605 is output correspondence data corresponding to an output symbol, and is generated by a target character obtained by decomposing each character set connected by a colon shown in FIG.
The state transition weight 1606 is a weight (state transition weight) given to the transition. However, in the initial state “I” and the end (acceptance) state “F”, the initial state weight and the end state weight are shown. In FIG. 16, the weight in the initial state and the weight in the end state are substantially “0”.

この状態遷移情報データベース２６は、翻字確率モデル７の文脈情報（条件付き確率）を反映したものとなっている。具体的には、図１５に例示した翻字確率モデル７では、翻字文字組の同時生起確率は直前の最大２個（図１５の（ｃ）の場合）の翻字文字組によって決定されている。このときの同時生起確率（文脈情報または条件付き確率）は、図１６においては、第１状態番号１６０２と、第２状態番号１６０３と（各状態）に保持されていることになる。 This state transition information database 26 reflects the context information (conditional probability) of the transliteration probability model 7. Specifically, in the transliteration probability model 7 illustrated in FIG. 15, the simultaneous occurrence probability of the transliterated character set is determined by the last two transliterated character sets (in the case of (c) in FIG. 15). Yes. The co-occurrence probability (context information or conditional probability) at this time is held in the first state number 1602 and the second state number 1603 (each state) in FIG.

以上のように各データベースが整備された状態で、第１の実施形態の単語翻訳装置５に、一例として、図１７の（ａ）に示すように、入力単語１７０１として、ソース単語「ドナルド」を入力して翻訳した。すなわち、図１６に例示した状態遷移情報データベース２６をＷＦＳＴデータベースとし、かつ、単語探索手段２４（図４参照）をＷＦＳＴ探索プログラムとしたＷＦＳＴを用いて、ソース単語をターゲット単語に変換した。このときの単語探索手段２４の出力例を、図１７の（ｂ）に示す。 With each database maintained as described above, the source word “Donald” is input as the input word 1701 to the word translation apparatus 5 of the first embodiment as an example as shown in FIG. Input and translated. That is, the source words are converted into target words using WFST in which the state transition information database 26 illustrated in FIG. 16 is a WFST database and the word search means 24 (see FIG. 4) is a WFST search program. An output example of the word search means 24 at this time is shown in FIG.

図１７の（ｂ）に示す出力例には、対応状態番号１７１１と、入力記号１７１２と、出力記号１７１３と、状態遷移重み１７１４とが格納されている。
対応状態番号１７１１は、ソース単語と対応するターゲット単語とから成る単語組を構成する文字組の状態を示す状態番号である。ここで、文字組の状態は、条件付き確率を反映している。
入力記号１７１２は、ソース単語に対して最適な対応付けを実行したときのソース文字の系列を示している。なお、ターゲット単語に対応する文字が無い場合には、空文字φの代わりに＜ｅｐｓ＞が記載されている。また、＜ｓ＞は語の開始点を示し、＜／ｓ＞は語の終了点を表す記号である。 In the output example shown in FIG. 17B, a corresponding state number 1711, an input symbol 1712, an output symbol 1713, and a state transition weight 1714 are stored.
The correspondence state number 1711 is a state number indicating the state of a character set that constitutes a word set including a source word and a corresponding target word. Here, the state of the character set reflects the conditional probability.
An input symbol 1712 indicates a sequence of source characters when an optimum association is executed for the source word. When there is no character corresponding to the target word, <eps> is described instead of the empty character φ. <S> indicates the start point of the word, and </ s> is a symbol indicating the end point of the word.

出力記号１７１３は、ターゲット単語に関し、入力記号１７１２と同様なものである。この出力記号１７１３の系列を連結すると、図１７の（ｃ）に示すように、出力単語１７２１として「ｄｏｎａｌｄ」が生成される。そして、連結生成された「ｄｏｎａｌｄ」は出力装置Ｄ（図４参照）に表示されることとなる。
状態遷移重み１７１４は、条件付き確率の対数の符号を逆転させた値である。
なお、この例では、スペルどおり正しく変換されたが、たとえ変換結果のスペルが正しくなかったとしても、探索空間を大きくしてより多くの変換候補を得ることができれば、例えば、情報検索システムにおいて、クエリに含めて利用することが可能となる。 The output symbol 1713 is similar to the input symbol 1712 regarding the target word. When the series of output symbols 1713 are connected, “donald” is generated as the output word 1721 as shown in FIG. The generated “donald” is displayed on the output device D (see FIG. 4).
The state transition weight 1714 is a value obtained by reversing the log of the conditional probability.
In this example, the conversion is correctly performed according to the spelling, but even if the conversion result is not spelled correctly, if the search space can be enlarged and more conversion candidates can be obtained, for example, in the information retrieval system, It can be used by including it in the query.

また、単語翻訳装置５は、ターゲット単語と共に、図１７の（ｂ）に示す状態遷移重み１７１４の値を出力するようにしてもよい。この場合には、ターゲット文字と、入力されたソース文字との間で翻訳（文字の変換）がどのくらい尤もらしいかを示す変換可能性として、この状態遷移重みの値を利用することができる。また、図１７の（ｂ）に示す状態遷移重み１７１４の合計値を出力するようにすれば、ターゲット単語と、入力されたソース単語との間で翻訳がどのくらい尤もらしいかを示すこともできる。 Moreover, you may make it the word translation apparatus 5 output the value of the state transition weight 1714 shown to (b) of FIG. 17 with a target word. In this case, the value of this state transition weight can be used as the possibility of conversion indicating how likely the translation (character conversion) is between the target character and the input source character. If the total value of the state transition weights 1714 shown in FIG. 17B is output, it is possible to indicate how likely the translation is between the target word and the input source word.

［実施例２］
実施例２は、実施例１に以下の内容を加えたものである。すなわち、単語翻訳システム１Ａ（図７参照）において、第２の実施形態の単語翻訳装置５Ａを用いて、第１の単語であるソース単語「レオパード」に対して、変換候補として、３つの第３の単語である「ｌｅｏｐａｒｄ」と、「ｌｉｏｎ」と、「ｌｅｏｐｏｎ」とを入力したときのそれぞれの単語への変換可能性を計算した。なお、アルファベットはすべて小文字に置き換えられている。 [Example 2]
The second embodiment is obtained by adding the following contents to the first embodiment. That is, in the word translation system 1A (see FIG. 7), using the word translation device 5A of the second embodiment, three third candidates are converted as conversion candidates for the source word “leopard” that is the first word. The possibility of conversion to each word when “leopard”, “lion”, and “leopon” are input was calculated. All alphabets are replaced with lowercase letters.

そこで、単語翻訳装置５Ａの変換可能性計算部３０（図８参照）では、合成状態遷移情報データベース作成手段３３は、入力手段３１を介して第３の単語、例えば「ｌｅｏｐａｒｄ」を取得し、図１８に示すように、この「ｌｅｏｐａｒｄ」に対応するデータベースであるＦＳＡ（有限状態オートマトン）を作成する。このＦＳＡは、遷移元状態番号１８０１と、遷移先状態番号１８０２と、入力文字の表記１８０３と、出力文字の表記１８０４と、状態遷移重み１８０５とを備えている。例えば、遷移元状態番号１８０１が「１（one）」の行は、現在状態が「１（one）」で入力文字の表記１８０３として記号「ｌ（エル）」を受け取った場合には、状態遷移重み１８０５を「０」として、出力文字の表記１８０４として記号「＜ｅｐｓ＞」を出力し（この場合には、何もしないことになる）、遷移先状態番号１８０２である「２」の状態に移行する。なお、表記の注釈は前記した通りである。 Therefore, in the convertibility calculation unit 30 (see FIG. 8) of the word translation device 5A, the synthesized state transition information database creation unit 33 acquires a third word, for example, “leopard” via the input unit 31, As shown in FIG. 18, an FSA (finite state automaton) that is a database corresponding to the “leopard” is created. This FSA includes a transition source state number 1801, a transition destination state number 1802, an input character notation 1803, an output character notation 1804, and a state transition weight 1805. For example, if the current state is “1 (one)” and the symbol “l (el)” is received as the input character notation 1803 in the row where the transition source state number 1801 is “1 (one)”, the state transition The weight 1805 is set to “0”, the symbol “<eps>” is output as the output character notation 1804 (in this case, nothing is done), and the transition destination state number 1802 is set to the state “2”. Transition. Note that the notation is as described above.

そして、合成状態遷移情報データベース作成手段３３（図８参照）は、図１８に例示したＦＳＡと、図１６に例示した状態遷移情報データベース２６とを合成する。これにより、「ｌｅｏｐａｒｄ」が合成された合成状態遷移情報データベース３６（ＷＦＳＡデータベース）が作成される。同様に、図示は省略するが、他の入力予定の第３の単語「ｌｉｏｎ」および「ｌｅｏｐｏｎ」についても、ＦＳＡを作成し、図１６に例示した状態遷移情報データベース２６と合成し、対応する合成状態遷移情報データベース３６をそれぞれ作成する。そして、単語翻訳装置５Ａは、例えば、「ｌｅｏｐａｒｄ」が入力されたときに、状態遷移重み計算手段３４（図８参照）によって、作成された合成状態遷移情報データベース３６を参照して、状態遷移重みを計算する。同様に、「ｌｉｏｎ」および「ｌｅｏｐｏｎ」についても、状態遷移重みをそれぞれ計算する。そのときの計算結果を図１９に示す。 Then, the combined state transition information database creation unit 33 (see FIG. 8) combines the FSA illustrated in FIG. 18 and the state transition information database 26 illustrated in FIG. As a result, a combined state transition information database 36 (WFSA database) in which “leopard” is combined is created. Similarly, although not shown, an FSA is created for the other third words “lion” and “leopon” to be input, and is synthesized with the state transition information database 26 illustrated in FIG. Each state transition information database 36 is created. Then, the word translation device 5A refers to the created state transition weight information database 36 by the state transition weight calculation means 34 (see FIG. 8), for example, when “leopaard” is input, and the state transition weight Calculate Similarly, state transition weights are calculated for “lion” and “leopon”, respectively. The calculation results at that time are shown in FIG.

変換可能性計算部３０は、図１９の（ａ）〜（ｃ）に示すように、変換候補として入力された各第３の単語に対応する出力単語１９０１と、累計重み１９０２とを出力する。ここで、累計重み１９０２は、ソース文字から第３の文字への状態遷移重みの合計値である。例えば、入力された「ｌｅｏｐａｒｄ」に関しては、図１９の（ａ）に示すように、出力単語１９０１として「ｌｅｏｐａｒｄ」を出力し、累計重み１９０２として「12.799」を出力する。同様に、「ｌｉｏｎ」に関しては、図１９の（ｂ）に示すように、累計重み１９０２として「23.4622」を出力し、さらに、「ｌｅｏｐｏｎ」に関しては、図１９の（ｃ）に示すように、累計重み１９０２として「17.98」を出力する。したがって、この例では、「レオパード」に対応して探索された「ｌｅｏｐａｒｄ」の累計重み１９０２が最小となっている。つまり、入力された３つの第３の単語のうち、「ｌｅｏｐａｒｄ」が最も変換可能性が高いことになる。 As shown in FIGS. 19A to 19C, the convertibility calculation unit 30 outputs an output word 1901 corresponding to each third word input as a conversion candidate, and a cumulative weight 1902. Here, the cumulative weight 1902 is the total value of the state transition weights from the source character to the third character. For example, with respect to the input “leopard”, “leopard” is output as the output word 1901 and “12.799” is output as the cumulative weight 1902 as shown in FIG. Similarly, for “lion”, “23.4622” is output as the cumulative weight 1902 as shown in FIG. 19B, and for “leapon”, as shown in FIG. 19C, “17.98” is output as the cumulative weight 1902. Therefore, in this example, the cumulative weight 1902 of “leopard” searched for “leopard” is the smallest. In other words, among the three input third words, “leopaard” has the highest conversion possibility.

なお、変換可能性計算部３０は、図１９の（ｄ）〜（ｆ）に示すように、計算結果をデータベースであるＦＳＡ（有限状態オートマトン）の形式で出力するようにしてもよい。これらのＦＳＡは、対応状態番号１９１１と、入力文字の表記１９１２と、出力文字の表記１９１３と、状態遷移重み１９１４とを備えている。各項目は、図１８に示したＦＳＡと同様なものである。 Note that the conversion possibility calculation unit 30 may output the calculation result in the form of FSA (finite state automaton) that is a database, as shown in (d) to (f) of FIG. These FSAs include a corresponding state number 1911, an input character notation 1912, an output character notation 1913, and a state transition weight 1914. Each item is the same as the FSA shown in FIG.

また、本第３の実施例の代わりに、第３の単語「ｌｅｏｐａｒｄ」、「ｌｉｏｎ」および「ｌｅｏｐｏｎ」と、状態遷移情報データベース２６とを一度に合成することにより唯一の合成状態遷移情報データベース３６を作成するようにしてもよい。この場合には、入力されたそれぞれの第３の単語は、この唯一の合成状態遷移情報データベース３６に基づいて確率が計算され、変換可能性の最も高いもの（この場合には、「ｌｅｏｐａｒｄ」）が一位の候補として出力され、以下、変換可能性の高い順に出力されることになる。その結果、変換可能性を一度に比較することが可能となる。 Further, instead of the third embodiment, the third words “leopaard”, “lion” and “leopon” and the state transition information database 26 are combined at a time to create a unique combined state transition information database 36. You may make it create. In this case, each input third word has a probability calculated based on this unique combined state transition information database 36 and has the highest conversion possibility (in this case, “leopard”). Are output as the first candidate, and are output in descending order of possibility of conversion. As a result, the conversion possibilities can be compared at a time.

［実施例３］
実施例３は、実施例２に以下の内容を加えたものである。すなわち、単語翻訳システム１Ｂ（図１０参照）において、第３の実施形態の単語翻訳装置５Ｂを用いて、第１の単語であるソース単語「スーパーカミオカンデ」に対して、変換候補である第３の単語をインターネットを利用して取得し、正解である「Ｓｕｐｅｒ−Ｋａｍｉｏｋａｎｄｅ」への変換可能性を計算した。 [Example 3]
Example 3 is obtained by adding the following contents to Example 2. That is, in the word translation system 1B (see FIG. 10), using the word translation device 5B of the third embodiment, the source word “Super-Kamiokande”, which is the first word, is a third conversion candidate. The word was acquired using the Internet, and the possibility of conversion to the correct “Super-Kamiokande” was calculated.

単語翻訳装置５Ｂの変換候補検索部４０は、文書データ取得手段４３（図１１参照）によって、ソース単語「スーパーカミオカンデ」が含まれる文書を検索し、データベースとして記憶手段４２に格納した。この記憶手段４２に格納されたデータベースの一例を図２０の（ａ）に示す。このデータベースは、図２０の（ａ）に示すように、文書のタイトル２００１と、文書の掲載されたホームページのアドレスを示すＵＲＬ２００２とを備え、１０個の文書データ（No,501〜510）を格納している。 The conversion candidate search unit 40 of the word translation device 5B searches the document containing the source word “Super-Kamiokande” by the document data acquisition unit 43 (see FIG. 11) and stores it in the storage unit 42 as a database. An example of the database stored in the storage means 42 is shown in FIG. As shown in FIG. 20A, this database includes a document title 2001 and a URL 2002 indicating the address of a home page on which the document is posted, and stores ten document data (No. 501-510). is doing.

変換候補検索部４０は、変換候補抽出手段４４（図１１参照）によって、各文書データ（No,501〜510）のＵＲＬ２００２にアクセスして、ソース単語「スーパーカミオカンデ」に対する変換候補としてふさわしい単語を、図２０の（ｂ）に示すように、抽出した。ここでは、変換候補抽出手段４４は、図２０の（ｂ）に示すように、抽出単語２０１１として、各文書データ（No,501〜510）から１つずつ抽出した合計１０個のアルファベット表記の語を、第３の単語として、変換可能性計算部３０に出力した。 The conversion candidate search unit 40 accesses the URL 2002 of each document data (No. 501-510) by the conversion candidate extraction unit 44 (see FIG. 11), and selects a word suitable as a conversion candidate for the source word “Super-Kamiokande”. Extraction was performed as shown in FIG. Here, as shown in FIG. 20B, the conversion candidate extracting unit 44 extracts a total of ten alphabetic words extracted one by one from each document data (No, 501-510) as extracted words 2011. Is output to the conversion possibility calculation unit 30 as the third word.

これにより、変換可能性計算部３０は、変換候補検索部４０から入力した１０個の第３の単語に関して、前記した実施例２で説明した動作を実行し、変換可能性（重み）を計算し、出力する。その出力データの例を、図２０の（ｃ）に示す。出力データは、図２０の（ｃ）に示すように、項目として、抽出単語２０２１と、変換可能性（重み）２０２２とを有している。この出力データによれば、No.611の「Ｓｕｐｅｒ−Ｋａｍｉｏｋａｎｄｅ」は、変換可能性（重み）２０２２が「19.9722」であり、この重みは、１０個の抽出単語２０２１のうち最小の値となっている。つまり、No.611の「Ｓｕｐｅｒ−Ｋａｍｉｏｋａｎｄｅ」は、変換可能性が最大となっている。その結果、単語翻訳装置５Ｂは、ソース単語「スーパーカミオカンデ」に対して、インターネットを利用して変換候補となる単語を複数取得し、その中で、正解である「Ｓｕｐｅｒ−Ｋａｍｉｏｋａｎｄｅ」への変換可能性が最大であることを求めることができた。 Thereby, the convertibility calculation unit 30 performs the operation described in the second embodiment on the ten third words input from the conversion candidate search unit 40, and calculates the convertibility (weight). ,Output. An example of the output data is shown in FIG. As shown in (c) of FIG. 20, the output data has extracted words 2021 and convertibility (weight) 2022 as items. According to this output data, “Super-Kamiokande” No. 611 has a conversion possibility (weight) 2022 of “19.9722”, and this weight is the smallest value among the ten extracted words 2021. Yes. In other words, No. 611 “Super-Kamiokande” has the highest conversion possibility. As a result, the word translation device 5B obtains a plurality of conversion candidate words using the Internet for the source word “Super-Kamiokande”, and among them, conversion to the correct “Super-Kamiokande” is possible. We were able to ask for the greatest sex.

この第３の実施例によれば、入力されたソース単語（同様にターゲット単語）が、単語翻訳装置５Ｂが利用する翻字確率モデル７を作成する際に学習データベース６に登録されていなかった場合でも、ソース単語に対して最尤のターゲット単語に変換することが可能となる。すなわち、第１段階で、インターネットで取得した第３の単語をＦＳＡとして状態遷移情報データベース２６に合成しておき、第２段階で、第１の単語と、取得した第３の単語とを入力として、予め作成された合成状態遷移情報データベース３６に基づいて、変換可能性を計算し、変換可能性が最も高い第３の単語を選択する。したがって、単語翻訳装置５Ｂは、固定された辞書に依存するのではなく、通信ネットワークＮＷを介して取得する実在の単語を変換候補とするので、実用性が高くなる。このため、人名や地名等の固有名詞の翻訳や、新語が次々に使用される記事やニュース等の翻訳に好適である。 According to the third embodiment, when the input source word (similarly the target word) is not registered in the learning database 6 when the transliteration probability model 7 used by the word translation device 5B is created. However, the source word can be converted into the most likely target word. That is, in the first stage, the third word acquired on the Internet is synthesized as FSA in the state transition information database 26, and in the second stage, the first word and the acquired third word are input. The conversion possibility is calculated based on the composite state transition information database 36 created in advance, and the third word having the highest conversion possibility is selected. Therefore, the word translation apparatus 5B does not depend on a fixed dictionary, but uses actual words acquired via the communication network NW as conversion candidates, so that the practicality becomes high. For this reason, it is suitable for the translation of proper nouns, such as a person name and a place name, and the translation of articles, news, etc. in which new words are used one after another.

本発明の第１の実施形態に係る単語翻訳装置を含む単語翻訳システムの構成例を示す図である。It is a figure which shows the structural example of the word translation system containing the word translation apparatus which concerns on the 1st Embodiment of this invention. 図１に示した翻字確率モデル作成装置の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the transliteration probability model creation apparatus shown in FIG. 図２に示した単語組データベースが作成されるまでの具体例を示す説明図である。It is explanatory drawing which shows the specific example until the word set database shown in FIG. 2 is created. 図１に示した単語翻訳装置の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the word translation apparatus shown in FIG. 図２に示した翻字確率モデル作成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the transliteration probability model creation apparatus shown in FIG. 図４に示した単語翻訳装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the word translation apparatus shown in FIG. 本発明の第２の実施形態に係る単語翻訳装置を含む単語翻訳システムの構成例を示す図である。It is a figure which shows the structural example of the word translation system containing the word translation apparatus which concerns on the 2nd Embodiment of this invention. 図７に示した変換可能性計算部の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the conversion possibility calculation part shown in FIG. 図８に示した変換可能性計算部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the conversion possibility calculation part shown in FIG. 本発明の第３の実施形態に係る単語翻訳装置を含む単語翻訳システムの構成例を示す図である。It is a figure which shows the structural example of the word translation system containing the word translation apparatus which concerns on the 3rd Embodiment of this invention. 図１０に示した変換候補検索部の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the conversion candidate search part shown in FIG. 図８に示した変換候補検索部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the conversion candidate search part shown in FIG. カタカナをアルファベットに変換する実施例の説明図であり、（ａ）は学習データベースの一例であり、（ｂ）は文字間関連度データベースの一例である。It is explanatory drawing of the Example which converts a katakana into an alphabet, (a) is an example of a learning database, (b) is an example of an inter-character relationship degree database. 図１３の（ａ）に示した学習データベースから作成された文字間対応付き単語組データベースの一例を示す説明図である。It is explanatory drawing which shows an example of the word set database with a correspondence between characters produced from the learning database shown to (a) of FIG. 図１に示した翻字確率モデルデータベースの一例を示す説明図であり、（ａ）はユニグラム、（ｂ）はバイグラム、（ｃ）はトライグラムのデータをそれぞれ示している。It is explanatory drawing which shows an example of the transliteration probability model database shown in FIG. 1, (a) is a unigram, (b) is a bigram, (c) has each shown the data of a trigram. 図４に示した状態遷移情報データベースの一例を示す説明図である。It is explanatory drawing which shows an example of the state transition information database shown in FIG. 単語翻訳の一例を示す説明図であり、（ａ）は入力単語、（ｂ）は探索結果、（ｃ）は出力単語を示している。It is explanatory drawing which shows an example of word translation, (a) is an input word, (b) is a search result, (c) has shown the output word. 第３の単語に関するＦＳＡの一例を示す説明図である。It is explanatory drawing which shows an example of FSA regarding a 3rd word. 第２の実施形態に係る単語翻訳装置に第３の単語を３個入力した場合の探索結果の一例を示す説明図である。It is explanatory drawing which shows an example of the search result at the time of inputting the 3rd word to the word translation apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る単語翻訳装置の変換候補検索部による第３の単語の検索の一例を示す説明図であり、（ａ）は検索文書、（ｂ）は抽出単語、（ｃ）は出力結果を示している。It is explanatory drawing which shows an example of the search of the 3rd word by the conversion candidate search part of the word translation apparatus concerning 3rd Embodiment, (a) is a search document, (b) is an extraction word, (c) is output Results are shown.

Explanation of symbols

１（１Ａ，１Ｂ）単語翻訳システム（記号列変換システム）
２記憶装置
３記憶装置
４翻字確率モデル作成装置（記号変換確率モデル作成装置）
５単語翻訳装置（単語出力部）
５Ａ，５Ｂ単語翻訳装置（記号列変換装置）
６学習データベース
７翻字確率モデル（記号変換確率モデル）
１０入力手段
１１記憶手段
１２文字間関連度データベース作成手段（記号間関連度データベース作成手段）
１３単語組データベース作成手段（記号列組データベース作成手段）
１４生起確率計算手段
１５書込手段
１６文字間関連度データベース（記号間関連度データベース）
１７単語組データベース（記号列組データベース）
Ｍ入力装置
２１入力手段（第１の入力手段）
２２記憶手段
２３状態遷移情報データベース作成手段（データベース作成手段）
２４単語探索手段
２５出力手段（第１の出力手段）
２６状態遷移情報データベース
Ｄ出力装置
３０変換可能性計算部
３１入力手段（第２の入力手段）
３２記憶手段
３３合成状態遷移情報データベース作成手段
３４状態遷移重み計算手段
３５出力手段（第２の出力手段）
３６合成状態遷移情報データベース
４０変換候補検索部
５０電子機器
Ｎ通信ネットワーク
４１入力手段
４２記憶手段
４３文書データ取得手段
４４変換候補抽出手段
４５出力手段 1 (1A, 1B) Word translation system (symbol string conversion system)
2 storage device 3 storage device 4 transliteration probability model creation device (symbol conversion probability model creation device)
5 Word translation device (word output unit)
5A, 5B Word translation device (symbol string conversion device)
6 Learning database 7 Transliteration probability model (symbol conversion probability model)
10 input means 11 storage means 12 character-to-character relevance database creation means (symbol relevance degree database creation means)
13 Word set database creation means (symbol string set database creation means)
14 occurrence probability calculation means 15 writing means 16 inter-character relevance database (inter-symbol relevance database)
17 Word set database (symbol string set database)
M input device 21 input means (first input means)
22 storage means 23 state transition information database creation means (database creation means)
24 word search means 25 output means (first output means)
26 state transition information database D output device 30 convertibility calculation unit 31 input means (second input means)
32 storage means 33 composite state transition information database creation means 34 state transition weight calculation means 35 output means (second output means)
36 Compound State Transition Information Database 40 Conversion Candidate Search Unit 50 Electronic Device N Communication Network 41 Input Means 42 Storage Means 43 Document Data Acquisition Means 44 Conversion Candidate Extraction Means 45 Output Means

Claims

A symbol string conversion method of a symbol string conversion device that uses a symbol co-occurrence frequency in a symbol string set that is a combination of symbol strings of the same meaning belonging to different symbol systems,
The symbol string converter is
Inputting a first symbol string belonging to the first symbol system;
A second symbol string belonging to a second symbol system corresponding to the input first symbol string is estimated using the co-occurrence frequency and the frequency of the appearance order of the symbol groups in the symbol string set And steps to
Outputting the estimated second symbol string;
A symbol string conversion method comprising:

The step of estimating the second symbol string includes:
Based on the co-occurrence frequency, the frequency of the appearance order is calculated for each symbol string set composed of a set in which the first symbol string and the second symbol string are associated with each other in symbol units, Based on the calculation result, a symbol string set that maximizes the frequency of the appearance order is searched, and the second symbol string is estimated using the frequency of the appearance order related to the searched symbol string set. The symbol string conversion method according to claim 1.

The step of estimating the second symbol string includes:
Based on the co-occurrence frequency, the frequency of the appearance order is calculated for each symbol string set composed of a set in which the first symbol string and the second symbol string are associated with each other in symbol units, Based on the calculation result, searching for a symbol string set that maximizes the frequency of the appearance order, and creating a database that stores the frequency of the appearance order related to the searched symbol string set as data;
Searching the second symbol string with reference to the database;
The symbol string conversion method according to claim 1 or 2, characterized by comprising:

The symbol string conversion method according to any one of claims 1 to 3,
A word translation method in which the symbol string is a word composed of letters,
The symbol string converter is
Obtaining document data from an electronic device connected to a communication network based on an input word;
Extracting a predetermined number of words from the acquired document data as conversion candidates from the input words;
A word translation method, further comprising:

A word translation device using the co-occurrence frequency of characters in a word set, which is a combination of words of the same meaning belonging to different language systems,
An input means for inputting a first word belonging to the first language system;
A word search for estimating a second word belonging to a second language system corresponding to the input first word, using the co-occurrence frequency and the appearance order frequency of the word set Means,
Output means for outputting the estimated second word;
A word translation device comprising:

The word search means includes
Based on the co-occurrence frequency, the frequency of the appearance order is calculated for each arbitrary word set composed of a set in which the first word and the second word are associated with each other in units of characters. The second word is estimated using the frequency of the appearance order relating to the searched word set based on the search for the word set having the highest appearance order frequency. Item 6. The word translation device according to Item 5.

Based on the co-occurrence frequency, the frequency of the appearance order is calculated for each arbitrary word set composed of a set in which the first word and the second word are associated with each other in units of characters. A database creation means for searching for a word set having the maximum frequency of appearance order and creating a database for storing the frequency of appearance order related to the searched word set as data;
The word translation device according to claim 5 or 6, wherein the word search means searches for the second word with reference to the database.

Document data acquisition means for acquiring document data from an electronic device connected to a communication network based on the first word input to the input means;
Conversion candidate extraction means for extracting a predetermined number of words as conversion candidates from the first word from the acquired document data;
The word translation device according to any one of claims 5 to 7, further comprising:

A symbol string conversion program for causing a computer to execute the symbol string conversion method according to any one of claims 1 to 3.

A word translation program for causing a computer to execute the word translation method according to claim 4.

A recording medium in which the symbol string conversion program according to claim 9 or the word translation program according to claim 10 is recorded.