JP2007257644A

JP2007257644A - Program, method and device for acquiring translation word based on translation word candidate character string prediction

Info

Publication number: JP2007257644A
Application number: JP2007077693A
Authority: JP
Inventors: Gaolin Fang; 高林方; Hao Yu; 浩于
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2006-03-24
Filing date: 2007-03-23
Publication date: 2007-10-04
Also published as: CN101042692B; CN101042692A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a program, a method and a device for acquiring translation words based on translation word candidate character string prediction. <P>SOLUTION: This translation word acquisition program functions a computer having a word element parallel translation dictionary which stores correspondence relation between a word element of an original language and a translation word element which is a word element of a target language as a word element division part which divides query of the input original language into word elements, a translation word element extracting part which extracts the corresponding translation word element by reading and referring to the word element parallel translation dictionary to the divided word elements, a logic operation expression construction part which constructs a logic operation expression for retrieval by combining the extracted translation word element and the query, a retrieval part which retrieves a document matched to a query 2 from a network or a data base by using the constructed logic operation expression as the query 2 and a translation word candidate acquisition part which extracts the translation word candidates from the retrieved document. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、コンピュータ情報処理に関し、特に知識を発見するためにウェブマイニングを使用する処理に関するものである。特に、この発明は、特定の語句（エンティティ・アイテム）に対して目標言語（翻訳目標となる言語を言う）の訳語をネットワークあるいはデータベース中の通常の文書から推定するものであり、訳語中の訳語語素を推測することで訳語取得効率を向上させるものである。ここで、エンティティ・アイテムとは、例えば、単語、固有名詞、慣用句、成句などである。また、語素とは、通常の単語より細かい意味の最小単位を言う。 The present invention relates to computer information processing, and more particularly to processing that uses web mining to discover knowledge. In particular, the present invention estimates a translation of a target language (referred to as a translation target language) for a specific phrase (entity item) from a normal document in a network or database, and the translation in the translation The translation acquisition efficiency is improved by guessing the word element. Here, the entity item is, for example, a word, a proper noun, an idiomatic phrase, or a phrase. A word element is a minimum unit having a finer meaning than a normal word.

外国語で文章を書いているときにはしばしば、一般的な辞書には掲載されていない専門用語や固有名詞、慣用句、成句等に遭遇する。かかる表現の正確な訳語を見つけることは、辞書を調べたりドキュメントをサーチしたりしても長い時間がかかるものである。訳語を見つけにくい語としては、
［文字１］

（自動車のナンバープレート：英語ではLicense plate number）、
［文字２］

（三国志：英語ではThe Romance of Three Kingdoms）等がある。機械翻訳システムは通常、（機械翻訳辞書として保有していない）未知な表現に出会うとシステム全体の翻訳の正確性が突然低下する。また、クロスリンガル情報検索では、クエリ語が辞書に収録されておらず当該クエリ語に対応する訳語が存在しないと、これがボトルネックとなって情報検索システム全体の性能の向上が妨げられてしまう。 When writing in a foreign language, we often encounter technical terms, proper nouns, idiomatic phrases and phrases that are not listed in general dictionaries. Finding an accurate translation of such an expression can take a long time, even if you search a dictionary or search a document. As words that are difficult to find a translation,
[Character 1]

(Automobile license plate: License plate number in English),
[Character 2]

(Sangokushi: The Romance of Three Kingdoms in English). A machine translation system usually encounters an unknown expression (not held as a machine translation dictionary) and suddenly reduces the accuracy of translation of the entire system. Further, in the cross-lingual information search, if a query word is not recorded in the dictionary and there is no translation corresponding to the query word, this becomes a bottleneck and prevents the performance of the entire information search system from being improved.

したがって、辞書には収録されていないが重要なエントリー語（未登録語）の訳語を取得することは極めて重要である。かかる問題を解決するために、ネットワークサーチエンジンの利用を検討した当業者もいたが、無関係なウェブページおよび冗長情報が大量に検索され、ユーザが必要としている有用な訳語やその関連知識を見つけ出すことは著しく困難であった。この発明の目的の一つは、訳語の検索において無関係なウェブページを排除すべく、訳語中に含まれる可能性の高い文字列（訳語語素）を予測することによって、インターネット上の大量の情報資源を利用して、辞書にはない対訳をサーチエンジンを利用して効率的に取得することである。 Therefore, it is extremely important to obtain translations of important entry words (unregistered words) that are not recorded in the dictionary. In order to solve such problems, there are those skilled in the art who have considered using a network search engine, but a large amount of irrelevant web pages and redundant information are searched to find useful translations and related knowledge that users need. Was extremely difficult. One of the objects of the present invention is to estimate a large amount of information on the Internet by predicting a character string (translation word word) that is likely to be included in a translation word in order to eliminate unrelated web pages in the translation search. Using resources, it is possible to efficiently obtain bilingual translations that are not in the dictionary using a search engine.

対応する２言語の訳語対を取得するための手法、または、２言語間翻訳の手法には、おおまかに以下のような種類がある。
（１）対訳コーパスから訳語を取得する。この方法では、大規模な対訳コーパスを構築せねばならない。対訳コーパスを手作業で構築する労力を軽減するため、ネットから取得したコーパスから自動的に対応のとれたコーパス（aligned corpus）を構築することを提案した研究者もいる。しかし、ネット上の構造化されていないデータから対応のとれたコーパスを構築することはかなり困難である。 There are roughly the following types of techniques for acquiring corresponding pairs of translated words in two languages or techniques for translating between two languages.
(1) A translation is acquired from a bilingual corpus. In this method, a large-scale bilingual corpus must be constructed. Some researchers have proposed building an automatically aligned corpus from a corpus obtained from the net to reduce the effort of manually building a bilingual corpus. However, it is quite difficult to construct a corpus that is compatible with unstructured data on the net.

（２）非対訳コーパスから訳語を取得する。この方法は、大規模平衡コーパスを使うと、ある単語に対して、２つの言語間の文脈は同一であるかまたは類似するという点を利用して訳語を取得する。この方法によれば、対訳コーパスの量的不足という問題点を回避しつつその時点で存在するコーパスを十分に利用することができる。しかし、訳語取得の効果の点では前述の方法に劣る。さらに、現在までの研究結果から分かるようにこの方法は主として個々の単語の訳語取得に重点を置いている。 (2) Acquire translations from a non-translated corpus. This method obtains a translation using the fact that when a large-scale balanced corpus is used, the context between two languages is the same or similar for a certain word. According to this method, the corpus existing at that time can be fully utilized while avoiding the problem of a shortage of parallel corpus. However, it is inferior to the above-described method in terms of the effect of acquiring translated words. Furthermore, as can be seen from the research results to date, this method mainly focuses on the translation of individual words.

（３）フレーズの各語の訳語候補の組み合わせから、フレーズの訳語を取得する。フレーズを構成する各語の訳語候補を組み合わせ、各組み合わせの適切度を評価する。この方法は、名詞連語等の訳語取得に利用されるが、うまくいかない場合も多い。 (3) A phrase translation is acquired from a combination of translation candidates for each phrase. The candidate translations of each word constituting the phrase are combined, and the appropriateness of each combination is evaluated. This method is used for acquiring translated words such as noun collocations, but often fails.

（４）翻字あるいは音訳によって訳語を取得する。この方法は、似通った２言語間の訳語取得で使われる。例えば、日本語の漢字語から中国語の訳語を取得するには文字間の変換だけですむ場合がある（翻字）。また英語を日本語にするには英語の発音をカタカナで表現すればよい（音訳）。 (4) Acquire translations by transliteration or transliteration. This method is used to obtain translations between two similar languages. For example, in order to obtain a Chinese translation from a Japanese Kanji word, it may be only necessary to convert between characters (transliteration). To translate English into Japanese, the pronunciation of English can be expressed in katakana (transliteration).

（５）アンカーテキスト情報によって訳語を取得する。この方法は、同一のウェブページにリンクしているアンカーテキストは同一の意味である可能性が高いという仮説を利用するもので、複数言語のアンカーテキストが抽出された場合、それらは同一概念を表す語（訳語関係にある）とみなすものである。複数言語で記述しているウェブサイトを持つ会社名の訳語を取得する場合には比較的うまくいく。 (5) A translation is acquired based on anchor text information. This method uses the hypothesis that anchor texts linked to the same web page are likely to have the same meaning. When anchor texts in multiple languages are extracted, they represent the same concept. It is regarded as a word (related to a translated word). It works relatively well when you get translations for company names that have websites written in multiple languages.

以上が、本願発明のベースとなるネットワークからの訳語抽出に関する背景技術であり、次に本願発明のベースとなる技術を説明する。 The above is the background technology related to the translation extraction from the network that is the basis of the present invention, and the technology that is the basis of the present invention is described next.

（６）インターネットの部分対訳（partially bilingual texts）からの訳語取得。科学技術論文や専門記事等では、特に、中国語、日本語、韓国語等のアジア言語では、例えば、「部分対訳（partially bilingual texts）からの訳語取得」のように専門用語やよく知られていない固有名詞には対応する英訳を付与することがしばしばある。科学技術論文等の専門分野の文書がインターネットで公開されることの増加にともない、インターネット上のコンテンツから専門用語等の訳語を取得するウェブマイニング手法は有効な手段となっている。 (6) Acquisition of translations from partially bilingual texts on the Internet. In scientific and technical papers and specialized articles, especially in Asian languages such as Chinese, Japanese, and Korean, technical terms and well-known names such as “Acquisition of translations from partially bilingual texts” are well known. Often no proper nouns are given a corresponding English translation. As documents in specialized fields such as scientific and technical papers are released on the Internet, web mining techniques for acquiring translated words such as technical terms from contents on the Internet have become effective means.

非特許文献１記載のシステムは、原言語における単語jの目標言語eの対訳を得る手段を述べているもので、単語jを含むすべての文書から目標言語の単語を少なくとも一つ含む文書に対して、文書中に出現するすべての目標言語の単語eに対して、単語jと単語eとの間の訳語確からしさを計算し、もっとも訳語らしい単語eを求めるというものである。また現実的にはすべての文書の代わりに検索エンジンを利用して特定の文書数に絞り込むこと、各文書におけるjとeとの間の訳語確からしさを求める際にjとeの距離を利用することが示されている。 The system described in Non-Patent Document 1 describes means for obtaining a parallel translation of a target language e of a word j in the source language. From all documents including the word j, a document including at least one target language word Thus, for every word e of the target language that appears in the document, the likelihood of the translation between the word j and the word e is calculated, and the word e that seems to be the most translated word is obtained. In reality, instead of all documents, use a search engine to narrow down to a specific number of documents, and use the distance between j and e to determine the likelihood of translation between j and e in each document. It has been shown.

また、非特許文献２記載のシステムは、中国語のクエリーを入力して英語の文献を検索するようなクロス言語の情報検索において、クエリー中に未登録語がある場合があるが、ネットワーク上にある部分対訳を含む文書を利用することでこの問題（クロス言語検索の性能維持）を解決しようというものである。これはネットワークを利用した訳語検索の一つの応用を示したものと言える。 In the system described in Non-Patent Document 2, in a cross-language information search in which a Chinese query is input and an English document is searched, there may be an unregistered word in the query. This is to solve this problem (maintaining the performance of cross-language search) by using a document including a partial translation. This can be said to show one application of translation search using a network.

非特許文献３もクロス言語の情報検索においてクエリー中の未登録語の存在の問題に対してネットワーク検索を利用したクロス言語検索性能の維持の問題を扱っている。 Non-Patent Document 3 also deals with the problem of maintaining cross-language search performance using network search for the problem of the presence of unregistered words in a query in cross-language information search.

M.Nagata, T.Saito, and K.Suzuki, Using the Web as a Bilingual Dictionary, Proc. ACL 2001 Workshop Data-Driven Methods in Machine Translation, 2001M.Nagata, T.Saito, and K.Suzuki, Using the Web as a Bilingual Dictionary, Proc.ACL 2001 Workshop Data-Driven Methods in Machine Translation, 2001 P.J.Cheng, et al. Translating unknown queries with web corpora for cross-language information retrieval. SIGIR 2004P.J.Cheng, et al. Translating unknown queries with web corpora for cross-language information retrieval.SIGIR 2004 Y.Zhang, P.Vines, RMIT Chinese-English CLIR at NTCIR-4, In Working Notes of the Fourth NTCIR Workshop MeetingY.Zhang, P.Vines, RMIT Chinese-English CLIR at NTCIR-4, In Working Notes of the Fourth NTCIR Workshop Meeting

これまで説明したように、専門用語や固有名詞の訳語を得るために、ネットワーク上の部分対訳が利用されており、この部分対訳を含む文書を得るために、原言語の単語をクエリーとしたネットワーク検索が利用されている。しかし、原言語の単語を含む文書（原言語の単語をクエリーとして検索した検索結果文書）は、数が多く、これをすべてダウンロードして利用することは現実的ではない。そこで上述の文献では、検索エンジンのランキングを使った足切りによって、現実的な数の文書だけを利用している。 As described above, partial translations on the network are used to obtain translations of technical terms and proper nouns, and a network using source language words as queries to obtain documents containing these partial translations. Search is being used. However, there are a large number of documents including source language words (search result documents obtained by searching source language words as queries), and it is not realistic to download all of them. Therefore, in the above-mentioned document, only a realistic number of documents is used by the cut-off using the ranking of the search engine.

しかしながら、検索エンジンのランキングはクエリーとして与えられた単語の頻度や、そのクエリーの単語を含む文書のサイトの著名度など（例えばGoogleのPageRankアルゴリズム）によりランキングされるものであり、上位にランキングされたからと言って部分対訳を含んでいるとは限らない。むしろ一般的には部分対訳を含まない文書の方が多いとも言える。 However, search engine rankings are ranked according to the frequency of words given as a query, the prominence of the site of the document that contains the query word (eg Google's PageRank algorithm), and so on. However, it does not always include partial translations. Rather, in general, it can be said that there are more documents that do not contain partial translations.

訳語抽出の精度を維持するためには部分対訳を含む文書の数を保証する必要があるが、そのためには検索結果から多くの文書を取り出して（ダウンロードして）、部分対訳の有無をチェックすることが必要であるが、これでは処理により多くの時間がかかってしまう。訳語抽出の精度を維持し、処理時間も増やさないようにするための工夫が必要である。 In order to maintain the accuracy of translation extraction, it is necessary to guarantee the number of documents including partial parallel translations. To do so, a large number of documents are extracted (downloaded) from the search results and checked for partial translations. This requires more time for processing. It is necessary to devise in order to maintain the accuracy of translated word extraction and not increase the processing time.

この発明は、上述した従来技術の問題点を解消するためになされたものであり、部分対訳を含む文書を効率的に検索するために、訳語候補を推測し、これらを含む文書のみを検索するように検索エンジンにクエリーを与えることで、効率的に有効な部分対訳を含む文書を検索する訳語取得方法および訳語取得装置を提供することを目的とする。 The present invention has been made to solve the above-described problems of the prior art, and in order to efficiently search for documents including partial parallel translations, translation candidate candidates are estimated and only documents including these are searched. An object of the present invention is to provide a translated word acquisition method and a translated word acquisition apparatus for efficiently retrieving a document including a partial parallel translation by giving a query to a search engine.

入力された原言語のクエリー（単語ないしフレーズ）に対して、語素（通常の単語より細かい意味の最小単位）に分割し、語素対訳辞書を使ってクエリーの各語素に対応する目標言語の語素候補の集合を獲得し、ここからもとのクエリーだけでなく少なくとも語素候補のどれかをも含む文書を検索するように検索用の論理演算式を構築することにより、従来のネットワークを使った訳語検索方法と比較して、クエリーに対する部分対訳を有する文書の検索効率（検索された文書におけるクエリーの部分対訳を有する率）を向上することができる。これによって、従来と同じ足切りを行った場合には訳語抽出精度の大幅な向上が見込まれる、あるいは同じだけの有効文書を取り出すならば、大幅な処理時間の短縮が実現される。 The source language query (word or phrase) is divided into word elements (the smallest unit of meaning that is finer than ordinary words), and the target language corresponding to each word element of the query using the word parallel translation dictionary A conventional network is constructed by acquiring a set of word candidates and constructing a logical expression for search so that a document including not only the original query but also at least one of the word candidates is searched from here. Compared with the translated word search method using, it is possible to improve the search efficiency of a document having a partial translation for a query (the rate of having a partial translation of a query in a retrieved document). As a result, if the same cut-off as in the prior art is performed, a significant improvement in translation accuracy can be expected, or if the same number of valid documents are taken out, a significant reduction in processing time is realized.

以下に添付図面を参照して、この発明に係る訳語取得プログラム、訳語取得方法および訳語取得装置の好適な実施の形態を詳細に説明する。 Exemplary embodiments of a translated word acquisition program, a translated word acquisition method, and a translated word acquisition apparatus according to the present invention will be described below in detail with reference to the accompanying drawings.

図２は、この発明の実施例にかかる訳語候補文字列予測に基づく訳語取得装置のブロック図である。図２に示すように、この発明の実施例に係る訳語取得装置は、クエリー受付部１０、語素分割部１２、語素辞書１４、訳語語素集合取り出し部（訳語語素取出部）１６、語素対訳辞書１８、論理演算式構築部２０、検索部２２、および訳語候補取得部２４を備える。さらに、訳語候補取得部２４は、文書ダウンロード部３０、訳語候補抽出部３２、訳語候補評価部３４を備える。 FIG. 2 is a block diagram of the translated word acquisition apparatus based on the translated word candidate character string prediction according to the embodiment of the present invention. As shown in FIG. 2, the translation acquisition device according to the embodiment of the present invention includes a query reception unit 10, a word element division unit 12, a word element dictionary 14, a translation word element set extraction unit (translation word element extraction unit) 16, A word parallel translation dictionary 18, a logical operation expression construction unit 20, a search unit 22, and a translation word candidate acquisition unit 24 are provided. Further, the translation candidate acquisition unit 24 includes a document download unit 30, a translation candidate extraction unit 32, and a translation candidate evaluation unit 34.

クエリー受付部１０は、訳語を調べたい語をクエリーとして受け取る。クエリーとして入力された文字列は例えば、図３に示すように表示部上に表示される。語素分割部１２は、入力された単語やフレーズ（クエリー）を語素に分割する。語素対訳辞書１８は、原言語の各語素に対応する目標言語の訳語語素集合を保持する。訳語語素集合取り出し部１６では、語素分割部１２により分割出力された各語素のそれぞれに対して、上記語素対訳辞書１８を参照することで、対応する目標言語の訳語語素集合を取り出す。論理演算式構築部２０では、上記訳語語素集合取り出し部１６で取り出された原言語の各語素に対応する訳語語素集合の集合と、最初に入力されたクエリーとを組み合わせて検索用の論理演算式を構築する。検索部２２では、この論理演算式構築された論理演算式をクエリーとして、ネットワークあるいはデータベースからそのクエリーに合致する文書を検索する。訳語候補取得部２４では、検索部２２によって検索・取得された文書をダウンロードし、クエリーに対する部分対訳を見つけて抽出し、対訳候補をその評価値とともに作成し、複数の文書から抽出した対訳候補から様々なヒューリスティックスを使って対訳候補の確からしさを計算し、その計算結果に基づいてクエリーの訳語候補を取り出す。 The query receiving unit 10 receives a word for which a translated word is to be checked as a query. The character string input as the query is displayed on the display unit as shown in FIG. 3, for example. The word element dividing unit 12 divides an input word or phrase (query) into word elements. The word element bilingual dictionary 18 holds a target word element set corresponding to each word element in the source language. The translated word element set extraction unit 16 refers to the word element bilingual dictionary 18 for each of the word elements divided and output by the word element dividing unit 12, thereby obtaining the corresponding word word translated word set. Take out. The logical operation expression constructing unit 20 combines a set of translated word element sets corresponding to each word element of the source language extracted by the translated word element set extracting unit 16 and a query inputted first, for search. Build logical expressions. The retrieval unit 22 retrieves a document that matches the query from the network or database using the constructed logical operation expression as a query. The translation candidate acquisition unit 24 downloads the document searched / acquired by the search unit 22, finds and extracts a partial parallel translation for the query, creates a parallel translation candidate together with its evaluation value, and extracts from the parallel translation candidates extracted from a plurality of documents The likelihood of the translation candidate is calculated using various heuristics, and the translation candidate of the query is extracted based on the calculation result.

クエリー受付部１０への入力は、例えば、キーボード等によって単語やフレーズを入力するものでもよい。また、ＬＡＮやインターネット等のネットワークを通じて単語やフレーズを入力することもできる。ネットワーク経由で単語やフレーズを入力する場合は、入力部をネットワークインターフェースの形で構成してもよい。また、ハードディスクドライブ等の記憶装置またはスキャナ等から単語やフレーズを入力することもできる。この場合は、入力部を、スキャナ、記憶装置等と接続してデータ通信を行うように構成してもよい。入力部は、例えば、ＵＳＢ（ユニバーサルシリアルバス）等の有線接続、ブルートゥース等の無線接続によって接続できる。また、フラッシュメモリ、フロッピー（登録商標）ディスク、ＣＤ（コンパクトディスク）、ＤＶＤ（デジタル・バーサトル・ディスク、デジタル・ビデオ・ディスク）等の記憶媒体に記憶した句アイテムを入力することもできる。この場合には、入力部を、記憶媒体からのデータ読み取りを行う装置として構成してもよい。データ読取装置としては、フラッシュメモリリーダ、フロッピー（登録商標）ディスクドライブ、ＣＤドライブ、ＤＶＤドライブ等がある。 The input to the query receiving unit 10 may be, for example, inputting a word or phrase using a keyboard or the like. It is also possible to input words and phrases through a network such as a LAN or the Internet. When inputting words or phrases via a network, the input unit may be configured in the form of a network interface. In addition, words and phrases can be input from a storage device such as a hard disk drive or a scanner. In this case, the input unit may be connected to a scanner, a storage device, or the like to perform data communication. The input unit can be connected by, for example, a wired connection such as USB (Universal Serial Bus) or a wireless connection such as Bluetooth. It is also possible to input phrase items stored in a storage medium such as a flash memory, a floppy (registered trademark) disk, a CD (compact disk), and a DVD (digital versatile disk, digital video disk). In this case, the input unit may be configured as a device that reads data from a storage medium. Examples of the data reader include a flash memory reader, a floppy (registered trademark) disk drive, a CD drive, and a DVD drive.

さらに、入力部を上記のうち複数の場合に適用できるように構成してもよい。 Furthermore, the input unit may be configured to be applicable to a plurality of cases.

訳語候補取得部が出力する訳語情報は、ネットワークを通じて出力することができる。この場合には、ネットワークインターフェースを備えるように出力部を構成する。また、訳語情報を、パーソナルコンピュータ等の他の情報処理装置および記憶装置に出力することもできる。この場合には、出力部を、パーソナルコンピュータ等の他の情報処理装置および記憶装置とデータ通信を行うように構成する。また、訳語情報を記憶媒体に出力・書き込みすることもできる。この場合には、出力部を、記憶装置または記憶媒体にデータを書き込む書込装置として構成する。書込装置としては、例えば、フラッシュメモリレコーダ、フロッピー（登録商標）ディスクドライブ、ＣＤ−Ｒドライブ、ＤＶＤ−Ｒドライブ等がある。 The translated word information output by the translated word candidate acquisition unit can be output through a network. In this case, the output unit is configured to include a network interface. Also, the translated word information can be output to other information processing devices such as personal computers and storage devices. In this case, the output unit is configured to perform data communication with another information processing apparatus such as a personal computer and a storage device. Also, the translated word information can be output / written on the storage medium. In this case, the output unit is configured as a writing device that writes data to a storage device or a storage medium. Examples of the writing device include a flash memory recorder, a floppy (registered trademark) disk drive, a CD-R drive, and a DVD-R drive.

また、出力部は、データが利用できるようにモニタ等の表示装置に訳語情報を出力表示するために、例えば、表示装置とデータ通信を行うインターフェースとして構成してもよいし、表示装置と接続され、内蔵された情報処理装置にデータを送信するインターフェースとして構成してもよい。 Further, the output unit may be configured as an interface for performing data communication with the display device or connected to the display device in order to output and display the translated word information on a display device such as a monitor so that the data can be used. The interface may be configured to transmit data to the built-in information processing apparatus.

さらに、出力部を上記のうち複数の場合に適用できるように構成してもよい。 Furthermore, the output unit may be configured to be applicable to a plurality of cases.

図４は、語素分割部１２における語素分割処理の処理手順を示すフローチャートである。語素分割処理においては、語素分割部１２は入力されたクエリーを語素分割の対象として受け取り、語素分割の対象である語やフレーズを、意味の通る最小の単位である語素に分割する。語素分割の方法は、一般に形態素解析と呼ばれる手法と同じ方法であって構わない。形態素解析との違いは、形態素解析が形態素辞書を使って分割を行うのに対して、語素分割は語素辞書を利用することである。例えば、「三国志」は一つの単語（形態素）と考えられるが、語素としては「三」、「国」、「志」と分割することができる。ほとんどの漢字はそれ一文字で意味を持つ最小の単位と考えられるので、語素辞書を用意せず、単に一漢字単位に区切るものであってもよい。この場合、１文字だけでは最小の意味を有さない特定の文字列だけを辞書に持ち、それらとマッチするときだけはそれらの文字列を語素とし、残りの漢字は１文字ずつ分割するというものであってもよい。具体的には、語素分割部１２は、分割対象の文字列を受け取ると、受け取った文字列を対象文字列ｓにセットし（ステップＳ４０１）、分割リストＬを空リストとする（ステップＳ４０２）対象文字列ｓが空の場合には（ステップＳ４０３、Ｙｅｓ）、処理を終了し、対象文字列ｓが空でない場合には（ステップＳ４０３、Ｎｏ）、対象文字列ｓの先頭部分から語素辞書にマッチする文字列（語素）を取り出す（ステップＳ４０４）。取り出された文字列は分割リストＬに追加され（ステップＳ４０５）、他方、対象文字列ｓからはマッチした部分文字列が取り除かれる（ステップＳ４０６）。そして、再びステップＳ４０３以降の処理を繰り返す。 FIG. 4 is a flowchart showing a processing procedure of the word element division processing in the word element dividing unit 12. In the word segmentation process, the word segmentation unit 12 receives the input query as the target of the word segmentation, and divides the word or phrase that is the target of the word segmentation into the word units that are the smallest meaningful units To do. The word segmentation method may be the same as a method generally called morphological analysis. The difference from the morpheme analysis is that the morpheme analysis uses the morpheme dictionary, whereas the word element division uses the word element dictionary. For example, “Sangokushi” is considered as one word (morpheme), but the word element can be divided into “three”, “country”, and “shi”. Most kanji characters are considered to be the smallest unit that has meaning in a single character, so a word dictionary may not be prepared, and it may be simply divided into one kanji unit. In this case, only a single character string that has no minimum meaning in the dictionary is stored in the dictionary, and only when they match, those character strings are used as word elements, and the remaining kanji characters are divided one by one. It may be a thing. Specifically, when receiving the character string to be divided, the word element dividing unit 12 sets the received character string to the target character string s (step S401), and sets the division list L to an empty list (step S402). If the target character string s is empty (step S403, Yes), the process is terminated. If the target character string s is not empty (step S403, No), the word dictionary is used from the beginning of the target character string s. A character string (word element) that matches is extracted (step S404). The extracted character string is added to the divided list L (step S405), while the matched partial character string is removed from the target character string s (step S406). And the process after step S403 is repeated again.

語素対訳辞書１８は、語素とその対訳の集合を保持する。保持される集合は、例えば、「三」と「three, third, triple」、「国」と「nation, country, land, state, world」等である。 The word-element bilingual dictionary 18 holds a set of word elements and their parallel translations. The retained sets are, for example, “three” and “three, third, triple”, “country” and “nation, country, land, state, world”, and the like.

訳語語素集合取り出し部１６は、語素分割部１２で取り出された語素リストの各語素に対して、語素対訳辞書を参照して訳語候補のリストを作成する。訳語語素集合取り出し部１６による訳語候補リスト作成処理の処理手順を図５に示す。具体的には、訳語語素集合取り出し部１６は、語素リストを受け取ると、受け取った語素リストをリストＬにセットし（ステップＳ５０１）、さらに訳語リストＴを空にする（ステップＳ５０２）。リストＬが空の場合には（ステップＳ５０３、Ｙｅｓ）訳語リストＴを返して処理を終了し、リストＬが空でなかった場合には（ステップＳ５０３、Ｎｏ）リストＬから語素ｗを取り出す（ステップＳ５０４）。次に、語素対訳辞書を参照して、語素ｗの訳語ｔ１〜ｔｎを訳語リストＴに追加する（ステップＳ５０５）。そして、リストＬから語素ｗを取り除き、再びステップＳ５０３以降の処理を繰り返す。 The translation word element set extraction unit 16 creates a list of translation word candidates by referring to the word element bilingual dictionary for each word element extracted from the word element division unit 12. FIG. 5 shows a processing procedure of a translation word candidate list creation process by the translation word element set extraction unit 16. Specifically, when the translated word element set extraction unit 16 receives the word element list, it sets the received word element list in the list L (step S501), and further emptyes the translated word list T (step S502). If the list L is empty (step S503, Yes), the translated word list T is returned and the process is terminated. If the list L is not empty (step S503, No), the word element w is extracted from the list L (step S503). Step S504). Next, referring to the word element bilingual dictionary, the translated words t1 to tn of the word element w are added to the translated word list T (step S505). Then, the word element w is removed from the list L, and the processes after step S503 are repeated again.

論理演算式構築部２０は、原言語のクエリーと訳語語素集合取り出し部１６で取り出された訳語リストとから検索用の論理演算式を構築する。論理演算式は後段の検索部２２の検索エンジンが解釈できる形式で出力する。例えば、
［文字２］

という入力に対して、訳語リスト｛three,third,triple,nation,country,land,state,world,...｝が出力された場合には、論理演算式、「“
［文字２］

”＆（"three"|"third"|"triple"|"nation"|"country"|"land"|"state"|"world"|...）」が構築され出力される。 The logical operation expression construction unit 20 constructs a logical operation expression for search from the source language query and the translation word list extracted by the translation word element set extraction unit 16. The logical operation expression is output in a format that can be interpreted by the search engine of the search unit 22 in the subsequent stage. For example,
[Character 2]

When the translated word list {three, third, triple, nation, country, land, state, world, ...} is output, the logical operation expression ““
[Character 2]

検索部２２は、論理演算式構築された論理演算式をクエリー（クエリー２）として、ネットワークあるいはデータベースからそのクエリーに合致する文書を検索する。具体的には、検索エンジンのＡＰＩに沿った形式で例えば以下の様にｕｒｌを与えることで検索が実現される。日本語、中国語等はｕｒｌエンコードにしたがいエンコードされる。
http://www.google.co.jp/search?hl=ja&q=%E4%B8%89%E5%9B%BD+%28three+%7C+third%29&btnG=Google+%E6%A4%9C%E7%B4%A2&lr= The retrieval unit 22 retrieves a document that matches the query from the network or database using the logical operation expression constructed as the logical operation expression as a query (query 2). Specifically, for example, the search is realized by giving url as follows in a format according to the API of the search engine. Japanese, Chinese, etc. are encoded according to url encoding.
http://www.google.com/search?hl=en&q=%E4%B8%89%E5%9B%BD+%28three+%7C+third%29&btnG=Google+%E6%A4%9C%E7%B4 % A2 & lr =

文書ダウンロード部３０では、検索結果をダウンロードする。通常、適当な閾値（文書数あるいは時間制限等）にしたがいランキングの上位文書から適当数の文書をダウンロードする。 The document download unit 30 downloads the search result. Usually, an appropriate number of documents are downloaded from the higher ranking document according to an appropriate threshold (number of documents or time limit).

訳語候補抽出部３２では、クエリーの訳語の候補を抽出する。訳語候補抽出部３２による訳語候補抽出処理の処理手順を図６に示す。目標言語の任意の文字列が候補になるが、実際にはヒューリスティックを使って、適切な絞り込みが行われる。ヒューリスティックとしては、非特許文献１で示されるように、クエリーからの距離、クエリーと目標言語の文字列長の差、特定の記号や表現の有無等が使われる。具体的には、訳語候補抽出部３２は、文書中でクエリー文字列を探し（ステップＳ６０１）、前後の閾値以内の文字列を取り出し（ステップＳ６０２）、取り出された文字列中の目標言語のみの文字列を他の言語が挟まれていない単位で取り出す（ステップＳ６０３）。さらに、取り出された各文字列に対して、部分単語列を候補として抽出する（ステップＳ６０４）。最後にヒューリスティックに基づいて、不要な候補を捨てる（ステップＳ６０５）。 The translation candidate extraction unit 32 extracts translation candidate words for the query. FIG. 6 shows a processing procedure of translated word candidate extraction processing by the translated word candidate extracting unit 32. Arbitrary character strings in the target language are candidates, but in practice, appropriate narrowing is performed using heuristics. As the heuristic, as shown in Non-Patent Document 1, the distance from the query, the difference in the character string length between the query and the target language, the presence or absence of a specific symbol or expression, and the like are used. Specifically, the translated word candidate extraction unit 32 searches for a query character string in the document (step S601), extracts a character string within the preceding and following threshold values (step S602), and extracts only the target language in the extracted character string. A character string is taken out in units in which no other language is sandwiched (step S603). Further, a partial word string is extracted as a candidate for each extracted character string (step S604). Finally, unnecessary candidates are discarded based on the heuristic (step S605).

例えば、クエリー
［文字２］

に対して、検索文書として、
［文字３］

が検索文書中にあれば、"Kingdoms"、"Three Kingdoms"、"of Three Kingdoms"、"Romance of Kingdoms"、"Book"、"online"、"online read"、"online read book"等が候補として抽出される。 For example, query [character 2]

As a search document,
[Character 3]

"Kingdoms", "Three Kingdoms", "of Three Kingdoms", "Romance of Kingdoms", "Book", "online", "online read", "online read book", etc. Extracted as

なお、上述したヒューリスティックの例は例示に過ぎず、クエリーの訳語候補の絞り込みは他の方法で行うこともできる。例えば、候補文字列の中から、前置詞で始まっているものを捨てるというヒューリスティックを導入することもでき、その場合には、訳語候補から"of Three Kingdoms"が除かれることになる。 The heuristic example described above is merely an example, and the query translation word candidates can be narrowed down by other methods. For example, it is possible to introduce a heuristic that discards a character string that starts with a preposition from among candidate character strings, and in that case, “of Three Kingdoms” is removed from the translation candidates.

訳語候補評価部３４は、各訳語の候補の訳語の確からしさを評価する。例えば、各候補の頻度を計算することが最も簡単な評価である。そのほかには、候補文字列を抽出した際に個別の評価値を与え、その重みつき頻度を求めるなどもできる。 The translated word candidate evaluation unit 34 evaluates the likelihood of the translated words of each translated word candidate. For example, the simplest evaluation is to calculate the frequency of each candidate. In addition, when a candidate character string is extracted, an individual evaluation value is given, and the weighted frequency can be obtained.

また、この発明の実施例に係る訳語取得装置は、原語候補抽出部を備え、原語候補抽出部では、与えられた語素に対して省略復元辞書を参照して、その原語を語素に付与する。省略復元辞書とは、省略される前の語、例えば「大」に対する「大学」などを対応付けた辞書を言う。例えば、「流通股」という語は、二つの語素「流通」と「股」とに分割される。ここで「股」という語素の訳語としては、「section」と「thigh」であるが、この語素「股」は「股票」（stock）や「
［文字４］

」（stockholder）の省略とも考えられる。そこで、原語候補抽出部では、省略復元辞書を用いて、「股」という語素に対して、「股票」、「
［文字４］

」も語素に加える。ここでは、原語候補を抽出する実施例を示したが、語素対訳辞書１８および訳語語素集合取り出し部１６の処理を混合させ、「股」に対して直接、「section」「thigh」という訳語候補を取り出す際に「stock」と「stockholder」を付与するという方法であっても構わない。 The translated word acquisition apparatus according to the embodiment of the present invention further includes a source word candidate extraction unit, and the source word candidate extraction unit refers to the abbreviated restoration dictionary for the given word element and assigns the original word to the word element. To do. The abbreviated restoration dictionary is a dictionary in which a word before abbreviation, for example, “university” with respect to “dai” is associated. For example, the word “distribution crotch” is divided into two word elements “distribution” and “crotch”. Here, the translation of the word element “thigh” is “section” and “thigh”, but this word element “crotch” is “crotch vote” (stock) or “
[Character 4]

"(Stockholder) may be omitted. Therefore, the original word candidate extraction unit uses the abbreviated restoration dictionary to perform “crotch vote”, “
[Character 4]

Is also added to the word element. In this example, the source word candidates are extracted. However, the word pair translation dictionary 18 and the translation word set extraction unit 16 are mixed to directly translate “section” and “thigh” into “translation”. A method of adding “stock” and “stockholder” when extracting candidates may be used.

中国語の専門用語および固有名詞の多くは、多数の単語または句から構成される。各単語または各句からその正確な意味を直接的に取得することは極めて困難であるが、当該単語の構成要素である単語または句から、部分的な意味または関連のある意味、および単語を推測することは可能である。例えば、「
［文字２］

」を構成するアイテムそれぞれの対応訳語は、「三」については「three」、「国」については、「country、nation」、「演」については「act、practice」、「
［文字５］

」については「meaning、justice」となり、人は、「三つの国に関すること」なのだろうと大まかなイメージを持つことができる。また、「
［文字１］

」を構成する語素それぞれの対応訳語は、「
［文字６］

」については、「vehicle、car」、「牌」については、「brand、plate、board」、「号」については「number、size、data」、となり、人は、「車の番号」なのだろうと大まかなイメージを持つことができる。 Many Chinese terminology and proper nouns are composed of many words or phrases. It is extremely difficult to get its exact meaning directly from each word or phrase, but guessing a partial or related meaning and word from the word or phrase that is a component of that word It is possible to do. For example, "
[Character 2]

"Three" for "three", "country, nation" for "country", "act, practice", "
[Character 5]

"Meaning, justice", and people can have a rough image that it is about "three countries". Also,"
[Character 1]

The corresponding translations of the word elements that make up "
[Character 6]

”For“ vehicle, car ”,“ 牌 ”for“ brand, plate, board ”,“ No. ”for“ number, size, data ”, and for people to be“ car number ” You can have a rough image.

この訳語候補予測の手法においては、上述の語素対訳辞書により予測される単語または訳語をキーワードとして使用してクエリーを拡張し、拡張したクエリーを利用して検索された文書の中から訳語候補を取り出し、統計処理を行い、これらのドキュメント中の、元のクエリー語と密接な関連性を持つ単語を取得して、部分対訳を含む文書を取得する。
生成される論理演算式は以下のようなものになる。「“
［文字２］

”＋（three(country(nation(act(practice(justice(meaning）」、「“
［文字１］

”＋（vehicle(car(brand(plate(board(number(size(data）」となる。 In this translation candidate prediction method, a query is expanded using a word or translation predicted by the above-mentioned word-pair bilingual dictionary as a keyword, and a translation candidate is selected from documents searched using the expanded query. Extraction and statistical processing are performed to obtain words closely related to the original query word in these documents to obtain a document including a partial translation.
The generated logical operation expression is as follows. ““
[Character 2]

"+ (Three (country (nation (act (practice (justice (meaning)"",""
[Character 1]

"+ (Vehicle (car (brand (plate (board (number (size (data))".

さらに、各実施の形態においては、各動作は、１または２以上のプロセッサまたは特定の回路または接続部（例えば特定の機能を実行するよう相互接続された個別論理ゲート等）で接続された２つのプロセッサの組み合わせによって実行することができる。したがって、この発明の各局面は、異なる実施態様で実現することができ、これら全ての実施態様がここに開示した発明の範囲内にあるものと考えられる。かかる実施態様の例は、この発明の各局面につき、「前記動作を実行するよう構成された論理」または「前記動作を実行するまたは実行することができる論理」を意味する。 Further, in each embodiment, each operation is two connected by one or more processors or specific circuits or connections (eg, individual logic gates interconnected to perform a specific function). It can be executed by a combination of processors. Accordingly, each aspect of the invention can be implemented in different embodiments and all these embodiments are considered to be within the scope of the invention disclosed herein. Examples of such embodiments, for each aspect of the present invention, mean “logic configured to perform the operation” or “logic that performs or can execute the operation”.

さらに、この発明の実施の形態によれば、この発明の目的は、上記プログラムを実行するコンピュータ読取可能媒体によっても実現することができる。コンピュータ読取可能媒体とは、命令を実行するシステム、デバイス、または装置と組み合わせ可能であるか、または、命令をシステム、デバイス、または装置によって実行させるために使用できるよう、プログラムを収容し、記憶し、通信し、送信し、または送ることができるあらゆる機械を意味する。例えば、コンピュータ読取可能媒体とは、電子的、磁気的、光学的、電磁的、赤外線的、または半導体による、システム、デバイス、または装置または配布媒体でありうるが、これらに限定されない。コンピュータ読取可能媒体のより具体的な例としては、（包括的列挙ではないが）１または２以上のリードワイヤを備える電気的接続、ポータブルコンピュータディスク、ランダムアクセスメモリ（ＲＡＭ）、リードオンリーメモリ（ＲＯＭ）、消去可能プログラマブルリードオンリーメモリ（ＥＰＲＯＭ）、光ファイバ、およびコンパクトディスクリードオンリーメモリ（ＣＤＲＯＭ）等がある。 Furthermore, according to the embodiment of the present invention, the object of the present invention can also be realized by a computer-readable medium that executes the program. A computer-readable medium is a computer readable medium that contains or stores a program such that it can be combined with a system, device, or apparatus that executes instructions, or can be used to cause instructions to be executed by the system, device, or apparatus. Means any machine that can communicate, transmit or send. For example, a computer-readable medium can be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or apparatus or distribution medium. More specific examples of computer readable media include (but are not a comprehensive list) electrical connections with one or more lead wires, portable computer disks, random access memory (RAM), read only memory (ROM). ), Erasable programmable read only memory (EPROM), optical fiber, and compact disc read only memory (CDROM).

なお、この発明の実施の形態についての上記説明は、例示および説明のみを目的とする。上記説明は、この発明を個々に開示される特定の態様で包括すること、またはそれに限定することを意図するものではない。この発明の様々な変形例および変更例は、当業者には明らかである。この実施の形態は、この発明の原理およびその実際の応用例を最もうまく説明し、当業者にこの発明の様々な実施の形態および変形例を理解させ、それによって具体的かつ予測される応用を行えるよう選択され説明されたものである。この発明の範囲は添付の請求の範囲およびその均等物によって定義されるものと理解されねばならない。 It should be noted that the above description of the embodiments of the present invention is for illustration and description only. The above description is not intended to be exhaustive or to limit the invention to the particular embodiments disclosed. Various modifications and variations of this invention will be apparent to those skilled in the art. This embodiment best describes the principles of the invention and its practical applications, and allows one skilled in the art to understand various embodiments and variations of the invention, thereby providing specific and anticipated applications. It has been selected and explained to be able to do so. It should be understood that the scope of the invention is defined by the appended claims and their equivalents.

以上のように、この発明にかかる訳語候補文字列予測に基づく訳語取得プログラム、訳語取得方法および訳語取得装置は、知識を発見するためにウェブマイニングを使用する処理に適しており、特に、特定の語句（エンティティ・アイテム）に対して目標言語の訳語をネットワークあるいはデータベース中の通常の文書から推定する訳語取得方法および訳語取得装置に有用である。 As described above, the translation acquisition program, translation acquisition method, and translation acquisition device based on the translation candidate character string prediction according to the present invention are suitable for processing using web mining to find knowledge, The present invention is useful for a translation acquisition method and a translation acquisition apparatus for estimating a translation of a target language from a normal document in a network or database for a phrase (entity item).

図１は、この発明に係る訳語取得装置の原理を示す図である。FIG. 1 is a diagram showing the principle of a translated word acquisition apparatus according to the present invention. 図２は、この発明の実施例に係る訳語候補文字列予測に基づく訳語取得装置のブロック図である。FIG. 2 is a block diagram of the translated word acquisition apparatus based on the translated word candidate character string prediction according to the embodiment of the present invention. 図３は、クエリーとして受け付けられる文字列の入力位置を示す図である。FIG. 3 is a diagram illustrating an input position of a character string accepted as a query. 図４は、語素分割部による語素分割処理の処理手順を示すフローチャートである。FIG. 4 is a flowchart showing a processing procedure of word element division processing by the word element dividing unit. 図５は、訳語語素取り出し部による訳語候補リスト作成処理の処理手順を示すフローチャートである。FIG. 5 is a flowchart showing a processing procedure of a translation word candidate list creation process by the translation word element extraction unit. 図６は、訳語候補抽出部による訳語候補抽出処理の処理手順を示すフローチャートである。FIG. 6 is a flowchart showing a processing procedure of translated word candidate extraction processing by the translated word candidate extraction unit.

Explanation of symbols

１０クエリー受付部
１２語素分割部
１４語素辞書
１６訳語語素集合取り出し部
１８語素対訳辞書
２０論理演算式構築部
２２検索部
２４訳語候補取得部
３０文書ダウンロード部
３２訳語候補抽出部
３４訳語候補評価部 DESCRIPTION OF SYMBOLS 10 Query reception part 12 Word element division part 14 Word element dictionary 16 Translation word element set extraction part 18 Word element parallel translation dictionary 20 Logical operation expression construction part 22 Search part 24 Translation word candidate acquisition part 30 Document download part 32 Translation word candidate extraction part 34 Translation word Candidate evaluation department

Claims

A computer having a word-element bilingual dictionary for storing a correspondence relationship between a word element of a source language and a target word element.
A word segmentation unit that divides an input source language query into word elements,
A translated word element extraction unit that extracts a corresponding translated word element by reading and referring to the word element bilingual dictionary for the divided word elements,
A logical operation expression constructing unit that constructs a logical operation expression for search by combining the extracted translation word element and the query;
A retrieval unit that retrieves a document that matches the query 2 from the network or database using the constructed logical operation expression as the query 2,
A translation candidate acquisition unit that extracts translation candidates from the retrieved document;
Translation program to function as

A computer further having an abbreviated restoration dictionary for storing a correspondence between an abbreviated word element and an original word candidate that is an abbreviated previous word element;
Reading the abbreviated restoration dictionary, taking out the original word candidate for the word element divided by the word element dividing unit, and as the original word candidate extracting unit using the original word candidate as a new word element,
The translated word acquisition program according to claim 1, further functioning.

A word-to-word translation dictionary that stores correspondences between source language word elements and target language word elements,
A word segmentation unit that divides an input source language query into word segments;
A translated word element extraction unit that retrieves a corresponding translated word element by reading and referring to the word element bilingual dictionary for the divided word elements;
A logical operation expression construction unit that constructs a logical operation expression for search by combining the extracted translation word element and the query;
A retrieval unit that retrieves a document that matches the query 2 from the network or database using the constructed logical operation expression as the query 2,
A translation candidate acquisition unit that extracts translation candidates from the retrieved document;
Translated word acquisition device.

An abbreviated restoration dictionary that stores the correspondence between the omitted word elements and the original word candidates that are the previous word elements that are omitted;
Reading the abbreviated restoration dictionary, extracting a source word candidate for a word element divided by the word element dividing unit, and a source word candidate extracting unit that uses the original word candidate as a new word element;
The translated word candidate device according to claim 3, further comprising:

A computer having a word-element bilingual dictionary storing a correspondence relationship between a source language word element and a target language word element translated word element,
Divide the input source language query into word elements (referred to as a word element division process)
With respect to the divided word elements, the corresponding word word elements are extracted by reading out and referring to the word element bilingual dictionary,
A logical expression for search is constructed by combining the extracted translation word element and the query,
The constructed logical operation expression is used as a query 2, and a document matching the query 2 is retrieved from a network or a database.
Extract translation candidates from the retrieved document,
The translation acquisition method characterized by this.

A computer further comprising an abbreviated restoration dictionary for storing a correspondence between an abbreviated word element and an original word candidate that is an abbreviated previous word element,
After the word element division step, the abbreviated restoration dictionary is read out, a source word candidate for the word element divided by the word element division unit is extracted, and the original word candidate is set as a new word element.
The translated word acquisition method according to claim 5.