JP5510069B2

JP5510069B2 - Translation device

Info

Publication number: JP5510069B2
Application number: JP2010119569A
Authority: JP
Inventors: 千加志杉浦
Original assignee: Fujitsu Mobile Communications Ltd
Current assignee: Fujitsu Mobile Communications Ltd
Priority date: 2010-05-25
Filing date: 2010-05-25
Publication date: 2014-06-04
Anticipated expiration: 2030-05-25
Also published as: JP2011248002A

Description

本発明は、翻訳装置に係り、特に、翻訳元言語と、翻訳先言語とを判定する処理に関する。 The present invention relates to a translation apparatus, and more particularly to processing for determining a translation source language and a translation destination language.

音声を入力して翻訳する装置は、音声によって入力される言語（翻訳元言語）から、音声や、文章の表示によって出力する言語（翻訳先言語）への翻訳を行う際、翻訳元言語と、翻訳先言語とを判定することが不可欠である。この判定を装置の使用者の操作によらず、入力された音声に基づいて行う処理が知られている（例えば、特許文献１参照。）。 When translating a speech input language (translation source language) from a speech input language (translation source language) into speech or a language (translation destination language) output by displaying a sentence, It is essential to determine the target language. A process is known in which this determination is performed based on an input voice regardless of the operation of the user of the apparatus (see, for example, Patent Document 1).

特開２００７−３２２５２３号公報（第２−３頁、図１）JP 2007-322523 A (page 2-3, FIG. 1)

しかしながら、上述した特許文献１に開示されている方法では、翻訳元言語と、翻訳先言語との判定には、入力される音声によって発話者を同定可能とする事前の学習処理が必要である問題点があった。この問題点は、例えば、旅行先の街頭での会話を翻訳する際に顕著である。この翻訳の際、不特定の発話者の最初の発話から長い待ち時間なしに適切な翻訳が求められ、かつ、発話者の音声による学習は望めない。 However, in the method disclosed in Patent Document 1 described above, the determination of the translation source language and the translation destination language requires a prior learning process that enables the speaker to be identified by the input speech. There was a point. This problem is remarkable when, for example, a conversation on a street in a travel destination is translated. At the time of this translation, an appropriate translation is required without a long waiting time from the first utterance of an unspecified speaker, and learning by the voice of the speaker cannot be expected.

本発明は上記問題点を解決するためになされたもので、音声の入力開始後、短時間で入力された言語を判定する翻訳装置を提供することを目的とする。 The present invention has been made to solve the above problems, and an object of the present invention is to provide a translation apparatus that determines a language input in a short time after the start of speech input.

上記目的を達成するために、本発明の翻訳装置は、第１の言語の発話から第２の言語への翻訳、及び、前記第２の言語の発話から前記第１の言語への翻訳を行う翻訳装置であって、前記第１の言語の発話の言語特徴量と、前記第２の言語の発話の言語特徴量とを記憶する言語特徴量記憶手段と、発話を入力する入力手段と、前記入力手段によって入力された発話の言語特徴量を前記言語特徴量記憶手段に記憶された言語特徴量と比較することによって、その発話が前記第１の言語であるか前記第２の言語であるかを判定する言語判定手段と、前記入力手段によって入力された発話が第１の話者による発話か第２の話者による発話かを判定した結果に従って、その発話が前記第１の言語であるか前記第２の言語であるかを判定する話者判定手段と、前記言語判定手段による判定と、前記話者判定手段による判定とを参照して、前記入力手段によって入力された発話が前記第１の言語であるか前記第２の言語であるかを判定し、前記第１の言語であると判定された場合、前記発話を前記第１の言語から前記第２の言語に翻訳し、前記第２の言語であると判定された場合、前記発話を前記第２の言語から前記第１の言語に翻訳する音声翻訳手段とを有し、前記音声翻訳手段は、前記入力手段によって最初に入力された発話が前記第１の言語であるか前記第２の言語であるかを前記言語判定手段による判定に従って判定し、前記話者判定手段は、前記第１の話者の話者特徴量及び発話する言語の確率と、前記第２の話者の話者特徴量及び発話する言語の確率とを記憶し、前記入力手段によって入力された発話の話者特徴量を前記第１の話者の話者特徴量及び前記第２の話者の話者特徴量と比較してその発話の話者を判定し、（ａ）その発話は、前記判定された話者の発話する確率の大きい言語である、と前記言語判定手段と別途に判定し、（ｂ）前記記憶された前記判定された話者の話者特徴量を前記入力された発話の話者特徴量によって学習更新すると共に、前記記憶された前記判定された話者の発話する言語の確率を前記言語判定手段によって判定されたその発話の言語によって学習更新することを特徴とする。

In order to achieve the above object, the translation apparatus of the present invention performs translation from an utterance of a first language into a second language, and translation from an utterance of the second language into the first language. A translation device, language feature storage means for storing a linguistic feature quantity of the first language utterance and a linguistic feature quantity of the second language utterance; an input means for inputting an utterance; Whether the utterance is the first language or the second language by comparing the linguistic feature quantity of the utterance input by the input means with the linguistic feature quantity stored in the linguistic feature quantity storage means And whether the utterance is the first language according to the result of determining whether the utterance input by the input means is an utterance by the first speaker or an utterance by the second speaker Speaker determination means for determining whether the language is the second language Referring to the determination by the language determination means and the determination by the speaker determination means, it is determined whether the utterance input by the input means is the first language or the second language. , If it is determined to be the first language, the utterance is translated from the first language to the second language, and if it is determined to be the second language, the utterance is converted to the second language. Speech translation means for translating from two languages into the first language, wherein the speech translation means determines whether the utterance first input by the input means is the first language or the second language According to the determination by the language determination means, the speaker determination means, the speaker feature amount of the first speaker, the probability of the language to speak, and the speaker feature of the second speaker The volume and the probability of the language spoken. Comparing the speaker feature of the input utterance with the speaker feature of the first speaker and the speaker feature of the second speaker to determine the speaker of the utterance; (a) The speech is determined separately from the language determination means that the determined speaker has a high probability of speaking , and (b) the stored speaker feature of the determined speaker Learning and updating according to the speaker feature amount of the input utterance, and learning and updating the stored language probability of the determined speaker by the language of the utterance determined by the language determining means. Features.

本発明によれば、音声の入力開始後、短時間で入力された言語を判定する。 According to the present invention, a language input in a short time after the start of voice input is determined.

本発明の実施形態に係る移動通信装置の構成を示すブロック図。The block diagram which shows the structure of the mobile communication apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る会話翻訳部の構成を示すブロック図。The block diagram which shows the structure of the conversation translation part which concerns on embodiment of this invention. 本発明の実施形態に係る言語辞書の構成の一例を示す図。The figure which shows an example of a structure of the language dictionary which concerns on embodiment of this invention. 本発明の実施形態に係る話者辞書の構成の一例を示す図。The figure which shows an example of a structure of the speaker dictionary which concerns on embodiment of this invention. 本発明の実施形態に係る話者判定部による学習の動作のフローチャート。The flowchart of the operation | movement of learning by the speaker determination part which concerns on embodiment of this invention. 本発明の実施形態に係る話者判定部による学習の際の話者の判断の概要を示す図。The figure which shows the outline | summary of the judgment of the speaker at the time of learning by the speaker determination part which concerns on embodiment of this invention. 本発明の実施形態に係る翻訳方向判定部による入力音声の言語を判定する動作のフローチャート。The flowchart of the operation | movement which determines the language of the input speech by the translation direction determination part which concerns on embodiment of this invention.

以下に、本発明の実施形態に係る翻訳装置の実施の形態を、図面を参照して説明する。 Embodiments of a translation apparatus according to embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の実施形態に係る翻訳装置が適用された移動通信装置の構成を示すブロック図である。この移動通信装置は、装置全体の制御を行う制御部１１と、移動通信網通信部１２と、移動通信網に属する基地局（図示せず）との間で無線信号の送受信を行うアンテナ１２ａと、移動通信網送受信部１３と、音声処理部１４と、受話音声出力等に用いられるスピーカ１４ａと、送話音声入力等に用いられるマイクロフォン１４ｂと、使用者に情報を視覚的に提示する表示部１５と、使用者からの操作指示を入力する入力部１６と、第１の言語と、第２の言語とによる会話を翻訳する会話翻訳部２０とからなる。 FIG. 1 is a block diagram showing a configuration of a mobile communication device to which a translation device according to an embodiment of the present invention is applied. This mobile communication device includes a control unit 11 that controls the entire device, a mobile communication network communication unit 12, and an antenna 12a that transmits and receives radio signals between base stations (not shown) belonging to the mobile communication network. A mobile communication network transmitting / receiving unit 13, a voice processing unit 14, a speaker 14a used for receiving voice output, a microphone 14b used for transmitting voice input, and a display unit for visually presenting information to the user 15, an input unit 16 for inputting an operation instruction from the user, and a conversation translation unit 20 for translating a conversation in the first language and the second language.

ここで、第１の言語と、第２の言語とは、会話翻訳部２０の動作開始前に定められている。また、会話は、第１の話者による発話と、第２の話者による発話からなり、一方の話者は、第１の言語で発話し、他方の話者は、第２の言語で発話する。 Here, the first language and the second language are determined before the operation of the conversation translation unit 20 starts. The conversation consists of an utterance by the first speaker and an utterance by the second speaker. One speaker speaks in the first language and the other speaker speaks in the second language. To do.

図２は、会話翻訳部２０の構成を示すブロック図である。会話翻訳部２０は、制御部１１と接続され、会話翻訳部２０全体の制御を行う翻訳制御部２１と、音声処理部１４と接続されるバッファ２０−２と、話速推定部２２と、音素種別検出部２３と、話速正規化部２４と、言語判定部２５と、言語辞書２５−２と、話者特徴量算出部２６と、話者判定部２７と、話者辞書２７−２と、翻訳方向判定部２８と、音声翻訳部２９とからなる。 FIG. 2 is a block diagram showing the configuration of the conversation translation unit 20. The conversation translation unit 20 is connected to the control unit 11 and controls the entire conversation translation unit 20, a buffer 20-2 connected to the speech processing unit 14, a speech speed estimation unit 22, and a phoneme. Type detection unit 23, speech speed normalization unit 24, language determination unit 25, language dictionary 25-2, speaker feature amount calculation unit 26, speaker determination unit 27, speaker dictionary 27-2 The translation direction determination unit 28 and the speech translation unit 29 are included.

バッファ２０−２には、音声処理部１４によって入力音声が記憶され、処理が終了した入力音声は、翻訳制御部２１によって消去される。なお、図２で、信号の流れと制御の流れとを矢印によって示したが、この矢印は理解を容易にするためのものであり、必ずしも全ての流れを示すものではない。 In the buffer 20-2, the input speech is stored by the speech processing unit 14, and the input speech that has been processed is deleted by the translation control unit 21. In FIG. 2, the flow of signals and the flow of control are indicated by arrows, but these arrows are for ease of understanding and do not necessarily indicate all flows.

図３は、言語辞書２５−２の構成の一例を示す。言語辞書２５−２は、言語２５−２ａと、その言語の発話の特徴量である言語特徴量２５−２ｂとからなり、言語２５−２ａが「第１の言語」である情報と、「第２の言語」である情報とからなる。言語特徴量２５−２ｂは、ベクトル量子化（ＶＱ、Vector Quantization）や、混合ガウス分布（ＧＭＭ、Gaussian Mixture Model）等の手法による言語モデルに基づく特徴量である。言語辞書２５−２は、装置の製造時に記憶される。又は、会話翻訳部２０の動作開始前に所定の言語データサーバ装置（図示せず）からダウンロードされる。 FIG. 3 shows an example of the configuration of the language dictionary 25-2. The language dictionary 25-2 includes a language 25-2a and a language feature 25-2b that is a feature of speech of the language. Information that the language 25-2a is "first language" 2 languages ". The language feature quantity 25-2b is a feature quantity based on a language model by a technique such as vector quantization (VQ) or mixed Gaussian distribution (GMM). The language dictionary 25-2 is stored when the apparatus is manufactured. Alternatively, it is downloaded from a predetermined language data server device (not shown) before the operation of the conversation translation unit 20 starts.

図４は、話者辞書２７−２の構成の一例を示す。話者辞書２７−２は、発話者２７−２ａと、その発話者の発話の特徴量である話者特徴量２７−２ｂと、その発話者の発話が第１の言語である確率である第１の言語確率２７−２ｃと、その発話者の発話が第２の言語である確率である第２の言語確率２７−２ｄとからなり、発話者２７−２ａが「第１の話者」である情報と、「第２の話者」である情報とからなる。話者特徴量２７−２ｂは、例えば、メル周波数ケプストラム係数（ＭＦＣＣ、Mel-Frequency Cepstrum Coefficient）であり、話速を含んでも良い。 FIG. 4 shows an example of the configuration of the speaker dictionary 27-2. The speaker dictionary 27-2 includes a speaker 27-2a, a speaker feature 27-2b which is a feature of the speaker's speech, and a probability that the speaker's speech is the first language. 1 language probability 27-2c and a second language probability 27-2d, which is a probability that the utterance of the speaker is the second language, and the speaker 27-2a is “the first speaker”. It consists of some information and information that is a “second speaker”. The speaker feature 27-2b is, for example, a Mel-Frequency Cepstrum Coefficient (MFCC) and may include speech speed.

なお、話者特徴量２７−２ｂと、第１の言語確率２７−２ｃと、第２の言語確率２７−２ｄとは、会話が行われると会話翻訳部２０の動作によって記憶される。即ち、会話の開始時、言い換えると、会話翻訳部２０の動作開始の際、話者特徴量２７−２ｂには値が記憶されていず、動作が進むにつれて、所定の値に収束する。また、会話の開始時、第１の言語確率２７−２ｃと、第２の言語確率２７−２ｄとには、それぞれ０．５が記憶され、動作が進むにつれて、所定の値に収束する。なお、第１の言語確率２７−２ｃと、第２の言語確率２７−２ｄとの和は１であり、また、動作が進むにつれて、各発話者２７−２ａの第１の言語確率２７−２ｃ及び第２の言語確率２７−２ｄの一方は１に近づく。 The speaker feature 27-2b, the first language probability 27-2c, and the second language probability 27-2d are stored by the operation of the conversation translation unit 20 when a conversation is performed. That is, when the conversation starts, in other words, when the operation of the conversation translation unit 20 starts, no value is stored in the speaker feature 27-2b, and converges to a predetermined value as the operation proceeds. At the start of conversation, 0.5 is stored in each of the first language probability 27-2c and the second language probability 27-2d, and converges to a predetermined value as the operation proceeds. The sum of the first language probability 27-2c and the second language probability 27-2d is 1, and as the operation proceeds, the first language probability 27-2c of each speaker 27-2a. And one of the second language probabilities 27-2d approaches 1.

上記のように構成された、本発明の実施形態に係る移動通信装置の各部の機能を、図１及び図２を参照して説明する。 The function of each part of the mobile communication apparatus configured as described above according to the embodiment of the present invention will be described with reference to FIG. 1 and FIG.

移動通信網通信部１２は、アンテナ１２ａが基地局から受信した無線信号から高周波信号を得て、この高周波信号を移動通信網送受信部１３に送信する。また、移動通信網送受信部１３から送信された高周波信号をアンテナ１２ａに送信する。 The mobile communication network communication unit 12 obtains a high frequency signal from the radio signal received by the antenna 12 a from the base station, and transmits this high frequency signal to the mobile communication network transmission / reception unit 13. Moreover, the high frequency signal transmitted from the mobile communication network transmission / reception unit 13 is transmitted to the antenna 12a.

移動通信網送受信部１３は、移動通信網通信部１２からの高周波信号を増幅、周波数変換及び復調し、それによって得たデジタル音声信号を音声処理部１４へ、制御信号を制御部１１へ、それぞれ送信する。更には、移動通信網送受信部１３は、音声処理部１４から送信されたデジタル音声信号と、制御部１１から送信された制御信号とを変調、周波数変換及び増幅し、高周波信号を得て、それを移動通信網通信部１２に送信する。 The mobile communication network transmission / reception unit 13 amplifies, frequency-converts and demodulates the high frequency signal from the mobile communication network communication unit 12, and the digital audio signal obtained thereby is transmitted to the audio processing unit 14, and the control signal is transmitted to the control unit 11, respectively. Send. Furthermore, the mobile communication network transmitting / receiving unit 13 modulates, frequency-converts and amplifies the digital audio signal transmitted from the audio processing unit 14 and the control signal transmitted from the control unit 11 to obtain a high-frequency signal. Is transmitted to the mobile communication network communication unit 12.

音声処理部１４は、移動通信網送受信部１３から送信されたデジタル音声信号、及び、会話翻訳部２０から送信されたデジタル音声信号をアナログ音声信号に変換し、それを増幅してスピーカ１４ａに送る。また、マイクロフォン１４ｂが送信するアナログ音声信号を増幅し、それをデジタル音声信号に変換して移動通信網送受信部１３、又は、会話翻訳部２０へ送信する。なお、音声処理部１４は、会話翻訳部２０と音声信号を送受する際、マイクロフォン１４ｂによって入力された、会話翻訳部２０の動作に起因してスピーカ１４ａから発生させた音声を打ち消す処理を行う。 The voice processing unit 14 converts the digital voice signal transmitted from the mobile communication network transmission / reception unit 13 and the digital voice signal transmitted from the conversation translation unit 20 into an analog voice signal, amplifies it, and sends it to the speaker 14a. . Further, the analog audio signal transmitted by the microphone 14b is amplified, converted into a digital audio signal, and transmitted to the mobile communication network transmitting / receiving unit 13 or the conversation translation unit 20. Note that the voice processing unit 14 performs processing for canceling the voice generated from the speaker 14a due to the operation of the conversation translation unit 20 input by the microphone 14b when the voice signal is transmitted to and received from the conversation translation unit 20.

表示部１５は、例えばＬＣＤ（Liquid Crystal Display）の表示部であり、制御部１１の制御により、使用者に操作を促す表示や、使用者が操作した内容の表示や、装置の動作状態の表示等を行う。 The display unit 15 is, for example, a display unit of an LCD (Liquid Crystal Display). The control unit 11 controls the display to prompt the user to operate, display the contents operated by the user, and display the operation state of the apparatus. Etc.

入力部１６は、電話番号を含む数字や文字の入力に用いられる文字・数字キーと、移動通信装置の電源のオン及びオフ等の動作指示や、発信及び着信応答等の操作指示等を入力するために用いられる複数の機能キーを備え、使用者が操作したキーを識別するコード信号を制御部１１へ通知する。 The input unit 16 inputs numbers and characters including a telephone number, characters / number keys used for inputting characters, operation instructions such as power on / off of the mobile communication device, operation instructions such as outgoing and incoming call responses, and the like. The control unit 11 is notified of a code signal that includes a plurality of function keys used for the purpose and identifies a key operated by the user.

話速推定部２２は、バッファ２０−２に記憶された入力音声の話速を発話が続く限り、言い換えると、会話が続く限り繰り返し推定する。音素種別検出部２３は、バッファ２０−２に記憶された入力音声に含まれる音素出現頻度及び音素出現時間長を発話が続く限り、言い換えると、会話が続く限り繰り返し検出する。即ち、摩擦音の多少（摩擦音であることは、４ｋＨｚ以上の成分が多いことによって判断される。）、有声音の多少（有声音は、基本周波数の存在によって判断される。）母音区間の長短、子音と母音との組合せ方等、発話された音素の特徴量を検出する。これらは、言語に依存した特徴量であり、かつ、マイクロフォン１４ｂの特性に依存した周波数歪の影響が少ない。 The speaking speed estimation unit 22 repeatedly estimates the speaking speed of the input voice stored in the buffer 20-2 as long as the utterance continues, in other words, as long as the conversation continues. The phoneme type detection unit 23 repeatedly detects the phoneme appearance frequency and the phoneme appearance time length included in the input speech stored in the buffer 20-2 as long as the utterance continues, in other words, as long as the conversation continues. That is, the degree of frictional sound (a frictional sound is determined by the presence of many components of 4 kHz or more), the degree of voiced sound (the voiced sound is determined by the presence of a fundamental frequency) A feature amount of an uttered phoneme, such as a combination of consonants and vowels, is detected. These are feature quantities dependent on the language, and are less affected by frequency distortion depending on the characteristics of the microphone 14b.

話速正規化部２４は、音素種別検出部２３によって検出された特徴量を、話速推定部２２によって推定された話速によって正規化し、正規化された音素出現頻度及び音素出現時間長を発話が続く限り、言い換えると、会話が続く限り繰り返し算出し、入力音声の言語特徴量とする。 The speech speed normalization unit 24 normalizes the feature amount detected by the phoneme type detection unit 23 with the speech speed estimated by the speech speed estimation unit 22, and utters the normalized phoneme appearance frequency and phoneme appearance time length. In other words, as long as the conversation continues, in other words, it is repeatedly calculated as long as the conversation continues, and is used as the language feature amount of the input speech.

言語判定部２５は、話速正規化部２４によって算出された入力音声の言語特徴量を、言語辞書２５−２に記憶された言語特徴量２５−２ｂと比較することによって、入力音声は、第１の言語である確率であるＰｌ（Ｌａ）と、第２の言語である確率Ｐｌ（Ｌｂ）とを算出する。なお、この算出は、発話が続く限り、言い換えると、会話が続く限り繰り返し行われるが、発話の開始から短時間、例えば、１秒以内の入力音声によって行なわれた算出結果が翻訳方向判定部２８によって参照される。ここで、発話の開始は、入力音声の音量が所定の発話開始音量閾値以上であることによって判断される。また、発話の終了は、所定の発話終了時間閾値以上に渡って入力音声の音量が発話終了音量閾値未満であることによって判断される。 The language determination unit 25 compares the language feature amount of the input speech calculated by the speech speed normalization unit 24 with the language feature amount 25-2b stored in the language dictionary 25-2, so that the input speech is Pl (La), which is a probability of one language, and a probability Pl (Lb), which is a second language, are calculated. Note that this calculation is repeated as long as the utterance continues, in other words, as long as the conversation continues, but the calculation result performed by the input speech within a short time, for example, within one second from the start of the utterance, is the translation direction determination unit 28 Referenced by. Here, the start of the utterance is determined when the volume of the input voice is equal to or higher than a predetermined utterance start volume threshold. The end of the utterance is determined by the volume of the input voice being less than the utterance end volume threshold over a predetermined utterance end time threshold.

話者特徴量算出部２６は、バッファ２０−２に記憶された入力音声の話者特徴量を発話が続く限り、言い換えると、会話が続く限り繰り返し算出する。話者判定部２７は、第１に、発話が続く限り、言い換えると、会話が続く限り繰り返し話者辞書２７−２の学習をする。ここで、学習とは、話者辞書２７−２に値を学習更新記憶させることによって行われる。 The speaker feature amount calculation unit 26 repeatedly calculates the speaker feature amount of the input voice stored in the buffer 20-2 as long as the utterance continues, in other words, as long as the conversation continues. First, the speaker determination unit 27 repeatedly learns the speaker dictionary 27-2 as long as the utterance continues, in other words, as long as the conversation continues. Here, learning is performed by storing and updating values in the speaker dictionary 27-2.

即ち、話者判定部２７は、話者特徴量算出部２６によって算出された話者特徴量（以後、この話者特徴量は、話速推定部２２によって推定された話速を含んでも良い。）と２つの話者特徴量２７−２ｂとを比較して、算出された話者特徴量と近い話者特徴量２７−２ｂである発話者２７−２ａを得る。そして、その発話者２７−２ａの話者特徴量２７−２ｂを話者特徴量算出部２６によって算出された話者特徴量によって学習し、学習された話者特徴量を話者特徴量２７−２ｂに更新記憶させる。 That is, the speaker determination unit 27 may include the speaker feature amount calculated by the speaker feature amount calculation unit 26 (hereinafter, this speaker feature amount may include the speech speed estimated by the speech speed estimation unit 22. ) And the two speaker feature values 27-2b, and a speaker 27-2a that is a speaker feature value 27-2b close to the calculated speaker feature value is obtained. Then, the speaker feature 27-2b of the speaker 27-2a is learned by the speaker feature calculated by the speaker feature calculator 26, and the learned speaker feature is used as the speaker feature 27-. Update and store in 2b.

更に、話者判定部２７は、言語判定部２５によって判定された言語が第１の言語である場合、その発話者２７−２ａの第１の言語確率２７−２ｃを増やし、第２の言語確率２７−２ｄを減らす。その言語が第２の言語である場合、その発話者２７−２ａの第１の言語確率２７−２ｃを減らし、第２の言語確率２７−２ｄを増やす。 Furthermore, when the language determined by the language determination unit 25 is the first language, the speaker determination unit 27 increases the first language probability 27-2c of the speaker 27-2a and increases the second language probability. Reduce 27-2d. When the language is the second language, the first language probability 27-2c of the speaker 27-2a is decreased, and the second language probability 27-2d is increased.

なお、会話の開始の際、２つの話者特徴量２７−２ｂには値が記憶されていない。そこで、話者判定部２７は、会話の最初の発話の発話者２７−２ａは、第１の話者であるとして、話者特徴量算出部２６によって算出された話者特徴量を発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂとして新規記憶させる。 At the start of the conversation, no value is stored in the two speaker feature values 27-2b. Accordingly, the speaker determination unit 27 determines that the speaker 27-2a of the first utterance of the conversation is the first speaker and uses the speaker feature calculated by the speaker feature calculator 26 as the speaker 27. -2a is newly stored as a speaker feature 27-2b whose "first speaker".

また、少なくとも発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂが新規記憶された直後には、発話者２７−２ａが「第２の話者」である話者特徴量２７−２ｂには値が記憶されていない。その際、話者判定部２７は、話者特徴量算出部２６によって算出された話者特徴量と、発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂとを比較して発話者が第１の話者であるか否かの認証を行って、話者は第１の話者か、第２の話者かを判定する。そして、第１の話者と判定された場合、発話者２７−２ａが「第１の話者」である話者辞書２７−２に値を学習記憶させる。一方、第２の話者と判定された場合、発話者２７−２ａが「第２の話者」である話者辞書２７−２に値を新規記憶させる。 Further, at least immediately after the speaker feature 27-2b in which the speaker 27-2a is the “first speaker” is newly stored, the story in which the speaker 27-2a is the “second speaker”. No value is stored in the person feature 27-2b. At that time, the speaker determination unit 27 includes the speaker feature amount calculated by the speaker feature amount calculation unit 26 and the speaker feature amount 27-2b in which the speaker 27-2a is the “first speaker”. And authenticating whether or not the speaker is the first speaker, it is determined whether the speaker is the first speaker or the second speaker. When the speaker is determined to be the first speaker, the speaker 27-2a learns and stores the value in the speaker dictionary 27-2 in which the “first speaker” is stored. On the other hand, when it is determined that the speaker is the second speaker, the speaker 27-2a stores a new value in the speaker dictionary 27-2, which is the “second speaker”.

話者判定部２７は、第２に、発話者の判定を介して発話された言語の判定を行う。即ち、話者特徴量算出部２６によって算出された話者特徴量を、話者辞書２７−２に記憶された話者特徴量２７−２ｂと比較することによって、発話者の判定を行い、発話者２７−２ａが「第１の話者」である確率と、「第２の話者」である確率とを求める。更に、それぞれの確率を対応する第１の言語確率２７−２ｃと乗算した積を加えることにより、発話が第１の言語である確率Ｐｓ（Ｌａ）を求める。更に、それぞれの確率を対応する第２の言語確率２７−２ｄと乗算した積を加えることにより、発話が第２の言語である確率Ｐｓ（Ｌｂ）を求める。なお、この算出は、発話の開始から短時間、例えば、１秒以内の入力音声によって発話の開始毎に行なわれ、算出結果が翻訳方向判定部２８によって参照される。 Second, the speaker determination unit 27 determines the language spoken through the speaker determination. That is, the speaker feature amount calculated by the speaker feature amount calculation unit 26 is compared with the speaker feature amount 27-2b stored in the speaker dictionary 27-2, thereby determining the speaker, and speaking. The probability that the speaker 27-2a is the “first speaker” and the probability that the speaker 27-2a is the “second speaker” are obtained. Further, the probability Ps (La) that the utterance is the first language is obtained by adding a product obtained by multiplying each probability by the corresponding first language probability 27-2c. Further, the probability Ps (Lb) that the utterance is the second language is obtained by adding a product obtained by multiplying each probability by the corresponding second language probability 27-2d. This calculation is performed every time the utterance starts for a short time from the start of the utterance, for example, within 1 second, and the calculation result is referred to by the translation direction determination unit 28.

ここで、話者特徴量２７−２ｂの学習及び話者特徴量２７−２ｂを参照した判定は、ベクトル量子化や、混合ガウス分布等の手法による統計量モデルに従う。また、判定は、線形判別や、サポートベクトルマシン（ＳＶＭ、Support Vector Machine）等の判定モデルに従っても良い。 Here, the learning of the speaker feature 27-2b and the determination with reference to the speaker feature 27-2b follow a statistic model by a method such as vector quantization or a mixed Gaussian distribution. The determination may be made according to a determination model such as linear determination or a support vector machine (SVM).

以上の説明で、話者判定部２７の動作は、学習動作と、判定動作とに分けられるとしたが、これは、動作の理解を容易にするためである。これらの動作は、共通な動作を含んでいるので、当然に、適宜共通な動作を共用して良い。 In the above description, the operation of the speaker determination unit 27 is divided into a learning operation and a determination operation. This is for easy understanding of the operation. Since these operations include common operations, naturally, common operations may be shared as appropriate.

ここで、会話翻訳部２０の動作開始直後は、話者特徴量２７−２ｂには値が記憶されていない可能性がある。また、話者特徴量２７−２ｂと、第１の言語確率２７−２ｃと、第２の言語確率２７−２ｄとは、発話の度に行われる学習のために大きく変化する可能性があるので、話者判定部２７は、話者の認証等の如何なる処理によっても妥当な確率Ｐｓ（Ｌａ）と、確率Ｐｓ（Ｌｂ）とを得られない。学習処理が進み、話者特徴量２７−２ｂと、第１の言語確率２７−２ｃと、第２の言語確率２７−２ｄとの値が収束するに従って、確率Ｐｓ（Ｌａ）と、Ｐｓ（Ｌｂ）とは妥当な値が得られる。即ち、話者判定部２７は、発話者の発話は、第１の言語であるか第２の言語であるかを正しく判定することができる。 Here, immediately after the operation of the conversation translation unit 20 is started, there is a possibility that no value is stored in the speaker feature 27-2b. In addition, the speaker feature 27-2b, the first language probability 27-2c, and the second language probability 27-2d may change greatly due to learning performed each time an utterance is made. The speaker determination unit 27 cannot obtain a reasonable probability Ps (La) and probability Ps (Lb) by any processing such as speaker authentication. As the learning process proceeds and the speaker feature value 27-2b, the first language probability 27-2c, and the second language probability 27-2d converge, the probabilities Ps (La) and Ps (Lb ) Is a reasonable value. That is, the speaker determination unit 27 can correctly determine whether the utterance of the speaker is the first language or the second language.

翻訳方向判定部２８は、言語判定部２５によって算出された、入力音声が第１の言語である確率であるＰｌ（Ｌａ）及び第２の言語である確率Ｐｌ（Ｌｂ）と、話者判定部２７によって算出された、入力音声が第１の言語である確率であるＰｓ（Ｌａ）及び第２の言語である確率Ｐｓ（Ｌｂ）とから、入力音声が第１の言語であるか、第２の言語であるかを判定する。そして、翻訳方向を判定する。即ち、入力音声が第１の言語であれば、翻訳方向は、第１の言語から第２の言語であり、入力音声が第２の言語であれば、翻訳方向は、第２の言語から第１の言語であると判定する。 The translation direction determination unit 28 calculates Pl (La), which is a probability that the input speech is the first language, and P1 (Lb), which is the second language, calculated by the language determination unit 25, and a speaker determination unit. Whether the input speech is the first language or the second probability Ps (La), which is the probability that the input speech is the first language, and the probability Ps (Lb), which is the second language, is Determine if the language is Then, the translation direction is determined. That is, if the input speech is the first language, the translation direction is from the first language to the second language, and if the input speech is the second language, the translation direction is from the second language to the second language. It is determined that the language is 1.

音声翻訳部２９は、翻訳方向判定部２８によって判定された翻訳方向に従って、バッファ２０−２に記憶された入力音声を発話の開始時の音声まで遡って読み出して翻訳し、翻訳された出力音声を音声処理部１４に送信する。なお、音声翻訳部２９による翻訳された音声が出力中に翻訳方法が逆である旨が入力部１６の所定の操作によって入力された場合、翻訳制御部２１は、翻訳方向判定部２８に上記翻訳方向と逆の方向で、バッファ２０−２に記憶された入力音声を発話の開始時の音声まで遡って読み出して翻訳させる。また、話者判定部２７による上記発話による学習を取り消させる。 The voice translation unit 29 reads and translates the input voice stored in the buffer 20-2 retroactively to the voice at the start of the utterance according to the translation direction determined by the translation direction determination unit 28, and the translated output voice. It transmits to the voice processing unit 14. Note that when the speech translated by the speech translation unit 29 is output by the predetermined operation of the input unit 16 while the translation method is reversed, the translation control unit 21 sends the translation direction determination unit 28 the translation In a direction opposite to the direction, the input voice stored in the buffer 20-2 is read back to the voice at the start of utterance and translated. Further, the learning by the utterance by the speaker determination unit 27 is canceled.

ここで、図５に示すフローチャートを参照して、話者判定部２７による学習の動作の詳細を説明する。なお、以下の動作は、２つの話者特徴量２７−２ｂに値が記憶されている際の動作である。話者判定部２７は、発話が開始されると、又は、発話の途中に学習動作を開始し（ステップＳ２７ａ）、話者特徴量算出部２６によって算出された話者特徴量と、言語判定部２５によって判定された言語とを受信する（ステップＳ２７ｂ）。そして、受信された話者特徴量と、発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂとの間の距離Ｄ１を算出し、更に、発話者２７−２ａが「第２の話者」である話者特徴量２７−２ｂとの間の距離Ｄ２を算出する。 Here, with reference to the flowchart shown in FIG. 5, the details of the learning operation by the speaker determination unit 27 will be described. The following operation is an operation when values are stored in the two speaker feature values 27-2b. When the utterance is started or during the utterance, the speaker determination unit 27 starts a learning operation (step S27a), and the speaker feature amount calculated by the speaker feature amount calculation unit 26 and the language determination unit The language determined by 25 is received (step S27b). Then, a distance D1 between the received speaker feature value and the speaker feature value 27-2b in which the speaker 27-2a is the “first speaker” is calculated, and the speaker 27-2a is further calculated. Is calculated as a distance D2 from the speaker feature 27-2b whose is a “second speaker”.

そして、話者判定部２７は、距離Ｄ１と、距離Ｄ２とから、上記発話の話者を判断する（ステップＳ２７ｃ）。この判断の詳細は後述する。第１の話者であると判断された場合、話者判定部２７は、発話者２７−２ａが「第１の話者」である話者辞書２７−２を学習更新記憶させ（ステップＳ２７ｄ）、ステップＳ２７ｂの受信するステップに移る。なお、この更新記憶は、話者特徴量２７−２ｂの学習更新記憶と共に、言語判定部２５によって判定された言語が第１の言語である場合、第１の言語確率２７−２ｃを増やし、第２の言語確率２７−２ｄを減らす更新記憶が含まれる。その言語が第２の言語である場合、第１の言語確率２７−２ｃを減らし、第２の言語確率２７−２ｄを増やす更新記憶が含まれる。 And the speaker determination part 27 determines the speaker of the said utterance from the distance D1 and the distance D2 (step S27c). Details of this determination will be described later. When it is determined that the speaker is the first speaker, the speaker determination unit 27 learns, updates, and stores the speaker dictionary 27-2 in which the speaker 27-2a is the “first speaker” (step S27d). Then, the process proceeds to the receiving step of step S27b. This update storage increases the first language probability 27-2c when the language determined by the language determination unit 25 is the first language together with the learning update storage of the speaker feature 27-2b. Update storage is included to reduce the linguistic probability 27-2d of 2. If the language is a second language, update storage is included that decreases the first language probability 27-2c and increases the second language probability 27-2d.

第２の話者であると判断された場合、話者判定部２７は、発話者２７−２ａが「第２の話者」である話者辞書２７−２を学習更新記憶させ（ステップＳ２７ｅ）、ステップＳ２７ｂの受信するステップに移る。なお、この更新記憶は、ステップＳ２７ｄの説明で述べたように、話者特徴量２７−２ｂの学習更新記憶と共に、第１の言語確率２７−２ｃ及び、第２の言語確率２７−２ｄの更新記憶が含まれる。 When it is determined that the speaker is the second speaker, the speaker determination unit 27 learns, updates, and stores the speaker dictionary 27-2 in which the speaker 27-2a is the “second speaker” (step S27e). Then, the process proceeds to the receiving step of step S27b. In addition, as described in the description of step S27d, this update storage includes the update of the first language probability 27-2c and the second language probability 27-2d together with the learning update storage of the speaker feature 27-2b. Memory is included.

不詳であると判断された場合、話者判定部２７は、学習をせずに、即ち、話者辞書２７−２の更新をすることなく、ステップＳ２７ｂの受信するステップに移る。話者判定部２７は、発話の終了に伴い、任意の動作ステップで学習動作を終了する（図示せず）。 If it is determined that it is unknown, the speaker determination unit 27 proceeds to the receiving step of step S27b without learning, that is, without updating the speaker dictionary 27-2. The speaker determination unit 27 ends the learning operation at an arbitrary operation step as the utterance ends (not shown).

上述したステップＳ２７ｃの判断の詳細を説明する。話者判定部２７は、距離Ｄ１が距離Ｄ２より小さく、かつ、距離Ｄ１と、距離Ｄ２との差が所定の話者特徴量閾値以上の場合、話者は第１の話者であると判断する。また、距離Ｄ２が距離Ｄ１より小さく、かつ、距離Ｄ１と、距離Ｄ２との差が所定の話者特徴量閾値以上の場合、話者は第２の話者であると判断する。これら以外の場合、即ち、距離Ｄ１と、距離Ｄ２との差が所定の話者特徴量閾値未満の場合、話者は不詳であると判断する。 Details of the determination in step S27c described above will be described. The speaker determination unit 27 determines that the speaker is the first speaker when the distance D1 is smaller than the distance D2 and the difference between the distance D1 and the distance D2 is equal to or greater than a predetermined speaker feature amount threshold. To do. When the distance D2 is smaller than the distance D1 and the difference between the distance D1 and the distance D2 is equal to or greater than a predetermined speaker feature amount threshold, the speaker is determined to be the second speaker. In other cases, that is, when the difference between the distance D1 and the distance D2 is less than a predetermined speaker feature amount threshold, it is determined that the speaker is unknown.

図６は、発話の話者特徴量に依存してこれらの判断がされる概要を示す。発話ｐの話者特徴量について、発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂとの間の距離Ｄｐ１と、発話者２７−２ａが「第２の話者」である話者特徴量２７−２ｂとの間の距離Ｄｐ２とを比較する。距離Ｄｐ１は距離Ｄｐ２より小さく、かつ、それらの距離の差は、所定の話者特徴量閾値以上である。よって、発話ｐの話者は、第１の話者であると判断される。 FIG. 6 shows an outline in which these determinations are made depending on the speaker feature amount of the utterance. For the speaker feature amount of the utterance p, the distance Dp1 between the speaker feature amount 27-2b where the speaker 27-2a is the “first speaker”, and Is compared with the distance Dp2 between the speaker feature amount 27-2b. The distance Dp1 is smaller than the distance Dp2, and the difference between the distances is equal to or greater than a predetermined speaker feature amount threshold. Therefore, the speaker of the utterance p is determined to be the first speaker.

発話ｑの話者特徴量について、発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂとの間の距離Ｄｑ１と、発話者２７−２ａが「第２の話者」である話者特徴量２７−２ｂとの間の距離Ｄｑ２とを比較する。距離Ｄｑ２は距離Ｄｑ１より小さく、かつ、それらの距離の差は、所定の話者特徴量閾値以上である。よって、発話ｑの話者は、第２の話者であると判断される。 For the speaker feature amount of the utterance q, the distance Dq1 between the speaker feature amount 27-2b in which the speaker 27-2a is the “first speaker”, and the speaker 27-2a indicates “the second story”. Is compared with the distance Dq2 to the speaker feature 27-2b. The distance Dq2 is smaller than the distance Dq1, and the difference between the distances is equal to or greater than a predetermined speaker feature amount threshold. Therefore, the speaker of the utterance q is determined to be the second speaker.

発話ｒの話者特徴量について、発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂとの間の距離Ｄｒ１と、発話者２７−２ａが「第２の話者」である話者特徴量２７−２ｂとの間の距離Ｄｒ２とを比較する。距離Ｄｒ１は距離Ｄｒ２より小さいが、それらの距離の差は、所定の話者特徴量閾値未満である。よって、発話ｒの話者は、不詳であると判断される。 As for the speaker feature amount of the utterance r, the distance Dr1 between the speaker feature amount 27-2b in which the speaker 27-2a is the “first speaker”, and the speaker 27-2a has the “second story”. The distance Dr2 is compared with the speaker feature 27-2b that is a "person". Although the distance Dr1 is smaller than the distance Dr2, the difference between the distances is less than a predetermined speaker feature amount threshold. Therefore, the speaker of the utterance r is determined to be unknown.

発話ｓの話者特徴量について、発話者２７−２ａが「第１の話者」である話者特徴量２７−２ｂとの間の距離Ｄｓ１と、発話者２７−２ａが「第２の話者」である話者特徴量２７−２ｂとの間の距離Ｄｓ２とを比較する。距離Ｄｓ２は距離Ｄｓ１より小さいが、それらの距離の差は、所定の話者特徴量閾値未満である。よって、発話ｓの話者は、不詳であると判断される。 Regarding the speaker feature amount of the utterance s, the distance Ds1 between the speaker feature amount 27-2b in which the speaker 27-2a is the “first speaker”, and the speaker 27-2a has the “second story”. Is compared with the distance Ds2 between the speaker feature amount 27-2b. The distance Ds2 is smaller than the distance Ds1, but the difference between the distances is less than a predetermined speaker feature amount threshold. Therefore, the speaker of the utterance s is determined to be unknown.

このように、話者判定部２７は、学習においては、話者が不詳と判断し、学習更新記憶をさせないことがある。発話者２７−２ａが「第２の話者」である話者特徴量２７−２ｂに値が記憶されていない場合、同様に話者が不詳と判断し、学習更新記憶をさせないことがある。しかし、発話の開始の際、会話の冒頭の発話の際を除き、話者を判定し、その判定結果に基づいて、発話された言語の判定を行って、翻訳方向判定部２８に送信する。翻訳方向判定部２８が参照する情報の提供を行いつつ、かつ、話者辞書２７−２を誤った方向に学習させないためである。 As described above, in the learning, the speaker determining unit 27 may determine that the speaker is unknown and may not store the learning update storage. If no value is stored in the speaker feature 27-2b in which the speaker 27-2a is the “second speaker”, the speaker may similarly determine that the speaker is unknown and not store the learning update. However, at the start of utterance, except for the utterance at the beginning of the conversation, the speaker is determined, the language spoken is determined based on the determination result, and transmitted to the translation direction determination unit 28. This is because the information to be referred to by the translation direction determination unit 28 is provided and the speaker dictionary 27-2 is not learned in the wrong direction.

ここで、図７に示すフローチャートを参照して、翻訳方向判定部２８による入力音声が第１の言語であるか第２の言語であるかを判定する動作の詳細を説明する。翻訳方向判定部２８は、入力部１６の所定の操作に起因する制御部１１の制御によって起動された翻訳制御部２１によって起動されて動作を開始する（ステップＳ２８ａ）。そして、入力音声が第１の言語であるか第２の言語であるかを判定する際に、話者判定部２７によって算出された確率に付す重みｒを０に設定する（ステップＳ２８ｂ）。ここで、重みｒは、０以上かつ１以下の数であり（０≦ｒ≦１）、１−ｒは、言語判定部２５によって算出された確率に付す重みとなる。 Here, with reference to the flowchart shown in FIG. 7, the detail of the operation | movement which determines whether the input audio | voice by the translation direction determination part 28 is a 1st language or a 2nd language is demonstrated. The translation direction determination unit 28 is activated by the translation control unit 21 activated by the control of the control unit 11 caused by a predetermined operation of the input unit 16 and starts its operation (step S28a). Then, when determining whether the input speech is the first language or the second language, the weight r attached to the probability calculated by the speaker determination unit 27 is set to 0 (step S28b). Here, the weight r is a number of 0 or more and 1 or less (0 ≦ r ≦ 1), and 1-r is a weight attached to the probability calculated by the language determination unit 25.

翻訳方向判定部２８は、発話が開始されたか否かを判断し（ステップＳ２８ｃ）、開始されない場合、この判断する動作を繰り返す。発話が開始されたと判断された場合、発話された言語が第１の言語である確率Ｐ（Ｌａ）と、第２の言語である確率Ｐ（Ｌｂ）とを、それぞれ、以下の式で算出する。
Ｐ（Ｌａ）＝ｒ×Ｐｓ（Ｌａ）＋（１−ｒ）×Ｐｌ（Ｌａ）
Ｐ（Ｌｂ）＝ｒ×Ｐｓ（Ｌｂ）＋（１−ｒ）×Ｐｌ（Ｌｂ） The translation direction determination unit 28 determines whether or not the utterance has been started (step S28c), and if not started, repeats the determination operation. When it is determined that the utterance has started, the probability P (La) that the spoken language is the first language and the probability P (Lb) that is the second language are calculated by the following equations, respectively. .
P (La) = r * Ps (La) + (1-r) * Pl (La)
P (Lb) = r * Ps (Lb) + (1-r) * Pl (Lb)

そして、Ｐ（Ｌａ）がＰ（Ｌｂ）より大きければ、発話された言語は第１の言語であると判定し、翻訳方向は、第１の言語から第２の言語であると判定する。Ｐ（Ｌｂ）がＰ（Ｌａ）より大きければ、発話された言語は第２の言語であると判定し、翻訳方向は、第２の言語から第１の言語であると判定する（ステップＳ２８ｄ）。ここで、Ｐ（Ｌａ）と、Ｐ（Ｌｂ）との差が所定の言語判定閾値未満の場合、翻訳方向判定部２８は、発話された言語が第１の言語であるか第２の言語であるかの入力を促す文章や、画像を表示部１５に表示させ、入力部１６の操作に基づいて、翻訳方向とを判定しても良い。 If P (La) is larger than P (Lb), it is determined that the spoken language is the first language, and the translation direction is determined from the first language to the second language. If P (Lb) is larger than P (La), it is determined that the spoken language is the second language, and the translation direction is determined to be the first language from the second language (step S28d). . Here, when the difference between P (La) and P (Lb) is less than a predetermined language determination threshold, the translation direction determination unit 28 determines whether the spoken language is the first language or the second language. A sentence for prompting an input or an image may be displayed on the display unit 15, and the translation direction may be determined based on the operation of the input unit 16.

次に、翻訳方向判定部２８は、重みｒを修正し（ステップＳ２８ｅ）、ステップＳ２８ｃの発話が開始されたか否かを判断する動作に移る。翻訳方向判定部２８は、どの動作ステップにあるかを問わず、入力部１６の所定の操作に起因する制御部１１の制御に基づく翻訳制御部２１の終了指示に従って動作を終了する（図示せず）。 Next, the translation direction determination unit 28 corrects the weight r (step S28e), and proceeds to an operation of determining whether or not the utterance in step S28c has been started. The translation direction determination unit 28 ends the operation in accordance with an end instruction of the translation control unit 21 based on the control of the control unit 11 caused by a predetermined operation of the input unit 16 regardless of which operation step it is in (not shown). ).

ここで、重みｒの修正を説明する。上述したように、会話の最初の発話において、話者判定部２７の判定は妥当でないことが多い。しかし、言語判定部２５の判定は、会話の最初においても概ね妥当である。そこで、重みｒの初期値は０とする。しかし、会話が進み、即ち、発話が繰り返されることによって、話者判定部２７の判定は正確さを増す。しかし、言語判定部２５の判定は、会話が進むことによって正確さを増すことはない。 Here, the correction of the weight r will be described. As described above, in the first utterance of the conversation, the determination by the speaker determination unit 27 is often not appropriate. However, the determination by the language determination unit 25 is generally appropriate even at the beginning of the conversation. Therefore, the initial value of the weight r is 0. However, as the conversation progresses, that is, the utterance is repeated, the accuracy of the determination by the speaker determination unit 27 increases. However, the determination by the language determination unit 25 does not increase the accuracy as the conversation progresses.

そこで、重みｒは、発話が繰り返されるに従い大きな値とし、１に漸近させ、又は、１とする。ただし、繰り返された発話の回数の増加によって必ず増大させることは必ずしも適切ではない。学習処理が進むと考えられる所定の回数の及び／又は所定の時間に渡る発話が繰り返された後、また、話者特徴量２７−２ｂを参照した話者判定部２７による話者判定の確度が上がり、かつ、それぞれの発話者２７−２ａに対して、第１の言語確率２７−２ｃ及び第２の言語確率２７−２ｄとの差が大きくなるにつれて増大させることが適切である。重みｒを如何に増加させていくかの詳細は、会話翻訳部２０内の各部の性能のみならず、マイクロフォン１４ｂの性能に依存するので、事前の試用によって決定される。 Therefore, the weight r is set to a larger value as the utterance is repeated, and asymptotic to 1 or set to 1. However, it is not always appropriate to always increase the number of repeated utterances. After the utterance is repeated a predetermined number of times and / or for a predetermined time during which the learning process is considered to proceed, the accuracy of speaker determination by the speaker determination unit 27 with reference to the speaker feature 27-2b is determined. It is appropriate to increase each speaker 27-2a as the difference between the first language probability 27-2c and the second language probability 27-2d increases. The details of how to increase the weight r depend on not only the performance of each part in the conversation translation unit 20 but also the performance of the microphone 14b, and thus are determined by prior trial use.

以上の説明で、音声翻訳部２９は、翻訳結果を出力音声として出力するとしたがこれに限るものではない。音声による出力に加えて、又は、代えて、文を表示部１５に表示させることによって翻訳結果を出力しても良い。また、入力部１６の所定のキー操作によって翻訳結果を出力音声として出力しても良い。 In the above description, the speech translation unit 29 outputs the translation result as output speech, but the present invention is not limited to this. In addition to or instead of outputting by voice, the translation result may be output by causing the display unit 15 to display a sentence. Further, the translation result may be output as an output sound by a predetermined key operation of the input unit 16.

以上の説明で、言語判定部２５、話者判定部２７、及び、翻訳方向判定部２８は、発話された言語が第１の言語である確率と、第２の言語である確率とを別個に算出するとしたが、これに限るものではない。例えば、第１の言語である確率を算出し、第２の言語である確率は、１と第１の言語である確率との差として求めても良い。しかし、上述したように別個に算出するようにすれば、発話される言語が３以上の場合であっても、発話された言語がいずれの言語であるか、全く同じように算出可能である。 In the above description, the language determination unit 25, the speaker determination unit 27, and the translation direction determination unit 28 separately determine the probability that the spoken language is the first language and the probability that the spoken language is the second language. Although it is calculated, it is not limited to this. For example, the probability of being the first language may be calculated, and the probability of being the second language may be obtained as a difference between 1 and the probability of being the first language. However, if the calculation is performed separately as described above, even if the spoken language is three or more, it is possible to calculate in exactly the same manner which language is spoken.

以上の説明で、会話する２人の発話者は不特定であるとしたが、これに限るものではない。例えば、発話者の１人は装置の所有者であり、所有者の母国語である第１の言語で発話するとして処理をしても良い。この場合、例えば、発話者２７−２ａが「第１の話者」である話者辞書２７−２の話者特徴量２７−２ｂには装置の所有者の第１の言語による発話の特徴量が、第１の言語確率２７−２ｃには１が、第２の言語確率２７−２ｄには０が事前に記憶される。更には、発話者２７−２ａが「第２の話者」である話者辞書２７−２の第１の言語確率２７−２ｃには０が、第２の言語確率２７−２ｄには１が事前に記憶される。なぜなら、第２の話者が第１の言語で発話するなら、会話翻訳部２０を動作させる必要がないからである。 In the above description, it is assumed that the two speakers speaking are unspecified, but the present invention is not limited to this. For example, one speaker may be the owner of the device, and processing may be performed assuming that the speaker speaks in the first language that is the owner's native language. In this case, for example, the speaker feature 27-2b of the speaker dictionary 27-2 in which the speaker 27-2a is the “first speaker” includes the utterance feature in the first language of the device owner. However, 1 is stored in advance in the first language probability 27-2c, and 0 is stored in advance in the second language probability 27-2d. Furthermore, the first language probability 27-2c of the speaker dictionary 27-2 in which the speaker 27-2a is the “second speaker” is 0, and the second language probability 27-2d is 1. Memorized in advance. This is because if the second speaker speaks in the first language, it is not necessary to operate the conversation translation unit 20.

この場合、話者判定部２７の判定動作は、上述した方法に代えて、又は、加えて発話者が第１の話者であるか否かの認証となる。この認証結果は、発話者２７−２ａが「第１の話者」である話者辞書２７−２の話者特徴量２７−２ｂに所有者の第１の言語による発話の特徴量が記憶されているので、発話者が第１の話者であるか第２の話者であるかを問わず、上述した判定よりも正しい可能性が大きい。更に、会話の最初の発話であっても、充分な正確さを持つ、即ち、Ｐｓ（Ｌａ）とＰｓ（Ｌｂ）との差が所定の値以上であり、会話の最初の発話であっても、翻訳方向判定部２８は、話者判定部２７の判定結果に大きく依存して翻訳方向の判定をする、極端には、話者判定部２７の判定結果によって翻訳方向の判定をしても良い。言い換えれば、重みｒは、上述した説明より大きい。 In this case, the determination operation of the speaker determination unit 27 is authentication of whether or not the speaker is the first speaker instead of or in addition to the method described above. As the authentication result, the feature value of the utterance in the first language of the owner is stored in the speaker feature value 27-2b of the speaker dictionary 27-2 in which the speaker 27-2a is the “first speaker”. Therefore, regardless of whether the speaker is the first speaker or the second speaker, there is a greater possibility of being correct than the determination described above. Furthermore, even if it is the first utterance of the conversation, it has sufficient accuracy, that is, even if the difference between Ps (La) and Ps (Lb) is a predetermined value or more, The translation direction determination unit 28 determines the translation direction largely depending on the determination result of the speaker determination unit 27. In an extreme case, the translation direction may be determined based on the determination result of the speaker determination unit 27. . In other words, the weight r is larger than the above description.

なお、装置の所有者の母国語は、装置が備えるＳＩＭカード（Subscriber Identity Module Card）に記憶された使用言語情報や、国情報から判断することができる。また、装置の所有者の第１の言語による発話の特徴量は、翻訳制御部２１が会話翻訳部２０を話者特徴量学習モードで動作させ、話者特徴量算出部２６と話者判定部２７の学習機能を動作させることによって話者辞書２７−２に記憶させることができる。また、移動通信網を介した通話の際、話者特徴量算出部２６を動作させ、また、話者判定部２７の学習機能を動作させることによって、所有者の手を煩わすことなく話者辞書２７−２に記憶させることができる。 Note that the native language of the owner of the device can be determined from language information stored in a SIM card (Subscriber Identity Module Card) included in the device and country information. Further, the feature amount of the utterance in the first language of the owner of the apparatus is obtained by causing the translation control unit 21 to operate the conversation translation unit 20 in the speaker feature amount learning mode, so that the speaker feature amount calculating unit 26 and the speaker determining unit are operated. It can be stored in the speaker dictionary 27-2 by operating the 27 learning functions. In addition, when a call is made via a mobile communication network, the speaker feature amount calculation unit 26 is operated, and the learning function of the speaker determination unit 27 is operated, so that the speaker dictionary is not troubled by the owner. 27-2.

更に、装置の所有者であるか否かに拘らず、また、１名であるか複数名であるかを問わず、発話者となる者の音声について、事前に発話する言語が記憶され、話者特徴量の学習処理がされるとしても良い。そして、複数名について、発話する言語及び話者特徴量が記憶されている場合、翻訳制御部２１は、会話に先立って、その会話の発話者となる者の選択を促す文章又は画像を表示部１５に表示させ、入力部１６の所定の操作に従って選択された者に係る情報に基づいて、話者特徴量２７−２ｂ、第１の言語確率２７−２ｃ、及び第２の言語確率２７−２ｄを記憶させる。 In addition, regardless of whether the device is owned or not, whether it is one person or multiple persons, the speech language is stored in advance for the voice of the person who is the speaker. A person feature amount learning process may be performed. And when the language to speak and speaker feature-value are memorize | stored about multiple names, the translation control part 21 displays the text or image which prompts selection of the person who becomes the speaker of the conversation prior to the conversation. 15, based on the information related to the person selected according to the predetermined operation of the input unit 16, the speaker feature 27-2b, the first language probability 27-2c, and the second language probability 27-2d Remember.

以上の説明で、発話の開始は、入力音声の音量が所定の発話開始音量閾値以上であることによって判断されるとしたが、これに限るものではない。例えば、入力部１６の所定のキー操作によって判断されるとしても良い。表示部１５がタッチパネルである場合、そのタッチパネルへの指等の接触によって判断されるとしても良い。この操作が必要である場合であっても、本発明の適用よれば翻訳方向の判定を使用者の操作によらないため、不特定の発話者によって装置が使用される場合の使い心地に大きな効果がある。 In the above description, the start of utterance is determined based on the volume of the input voice being equal to or higher than the predetermined utterance start volume threshold. However, the present invention is not limited to this. For example, it may be determined by a predetermined key operation of the input unit 16. When the display unit 15 is a touch panel, the determination may be made by touching the touch panel with a finger or the like. Even if this operation is necessary, according to the application of the present invention, since the translation direction is not determined by the user's operation, it has a great effect on the usability when the device is used by an unspecified speaker. There is.

以上の説明で、スピーカ１４ａと、マイクロフォン１４ｂとは、会話翻訳部２０による翻訳と、移動通信網を介した通話とで共用するとしたが、これに限るものではない。スピーカ１４ａ及びマイクロフォン１４ｂの片方又は両方は、翻訳用と、通話用とが別に備えられているとしても良い。 In the above description, the speaker 14a and the microphone 14b are shared by the translation by the conversation translation unit 20 and the call via the mobile communication network, but the present invention is not limited to this. One or both of the speaker 14a and the microphone 14b may be separately provided for translation and for calling.

以上の説明は、本発明を移動通信装置に適用した例を用いたが、本発明は、当然に会話の翻訳を行うあらゆる装置、例えば、ＰＤＡや、パソコン等に適用することが可能である。本発明は以上の構成に限定されるものではなく、種々の変形が可能である。 The above description uses an example in which the present invention is applied to a mobile communication device, but the present invention can naturally be applied to any device that translates conversation, such as a PDA or a personal computer. The present invention is not limited to the above configuration, and various modifications are possible.

１１制御部
１４音声処理部
１４ａスピーカ
１４ｂマイクロフォン
２０会話翻訳部
２１翻訳制御部
２２話速推定部
２３音素種別検出部
２４話速正規化部
２５言語判定部
２５−２言語辞書
２５−２ａ言語
２５−２ｂ言語特徴量
２６話者特徴量算出部
２７話者判定部
２７−２話者辞書
２７−２ａ発話者
２７−２ｂ話者特徴量
２７−２ｃ第１の言語確率
２７−２ｄ第２の言語確率
２８翻訳方向判定部
２９音声翻訳部 11 control unit 14 speech processing unit 14a speaker 14b microphone 20 conversation translation unit 21 translation control unit 22 speech speed estimation unit 23 phoneme type detection unit 24 speech speed normalization unit 25 language determination unit 25-2 language dictionary 25-2a language 25- 2b Language feature amount 26 Speaker feature amount calculation unit 27 Speaker determination unit 27-2 Speaker dictionary 27-2a Speaker 27-2b Speaker feature amount 27-2c First language probability 27-2d Second language probability 28 Translation direction determination unit 29 Speech translation unit

Claims

A translation device for translating a utterance of a first language into a second language, and translating the utterance of the second language into the first language,
Language feature storage means for storing a language feature of the first language utterance and a language feature of the second language utterance;
An input means for inputting an utterance;
The utterance is the first language or the second language by comparing the linguistic feature quantity of the utterance input by the input means with the linguistic feature quantity stored in the linguistic feature quantity storage means. Language determination means for determining whether or not
Whether the utterance is the first language or the second language according to the result of determining whether the utterance input by the input means is an utterance by the first speaker or an utterance by the second speaker Speaker determination means for determining
With reference to the determination by the language determination means and the determination by the speaker determination means, it is determined whether the utterance input by the input means is the first language or the second language, If it is determined to be the first language, the utterance is translated from the first language to the second language, and if it is determined to be the second language, the utterance is converted to the second language. Voice translation means for translating from the language into the first language,
The speech translation means determines whether the utterance first input by the input means is the first language or the second language according to the determination by the language determination means,
The speaker determination means stores the speaker feature of the first speaker and the probability of the language to speak, the speaker feature of the second speaker and the probability of the language to speak, and the input Comparing the speaker feature of the utterance input by the means with the speaker feature of the first speaker and the speaker feature of the second speaker to determine the speaker of the utterance;
(A) The speech is determined separately from the language determination means as a language with a high probability of the determined speaker speaking,
(B) learning and updating the stored speaker feature value of the determined speaker by the speaker feature value of the input utterance, and the stored language of the determined speaker A translation apparatus characterized in that the probability is learned and updated by the language of the utterance determined by the language determination means.

The speaker determination means includes
(C) storing the speaker feature quantity of the utterance first input by the input means as the speaker feature quantity of the first speaker, and the language of the utterance determined by the language determining means Remember that it ’s the language of the first speaker,
(D) When an utterance is input for the second time or later by the input means and the speaker feature value of the second speaker is not stored, the speaker feature value of the input utterance is stored By comparing with the speaker feature quantity of the first speaker, it is determined whether the utterance is an utterance by the first speaker or an utterance by the second speaker, and the utterance by the first speaker If it is determined, the stored speaker feature value and language of the first speaker with reference to the input speaker feature value of the utterance and the language of the utterance determined by the language determination means. Learning and storing, and when it is determined that the utterance is by the second speaker, the speaker feature amount of the input utterance is stored as the speaker feature amount of the second speaker, and the language The language of the utterance determined by the determining means is the language of the second speaker Stored,
(E) When an utterance is input by the input means and the speaker feature amount of the first speaker and the speaker feature amount of the second speaker are stored, the input utterance By comparing the speaker feature value with the stored speaker feature value of the first speaker and the speaker feature value of the second speaker, whether the utterance is an utterance by the first speaker or not The stored speaker is determined with reference to the input speaker feature and the language of the speech determined by the language determination means by determining whether the speech is by a second speaker. Learn and memorize speaker features and language
(F) comparing the speaker feature value of the utterance input by the input means with the stored speaker feature value of the first speaker and / or the speaker feature value of the second speaker. To determine whether the utterance is an utterance by the first speaker or an utterance by the second speaker, and the utterance is determined to be a language stored as being the language of the determined speaker. The translation apparatus according to claim 1, wherein:

The speech translation means determines whether the utterance input after the predetermined number of times by the input means is the first language or the second language based on the determination by the language determination means. The translation apparatus according to claim 1, wherein the determination is made by assigning a greater weight to the determination by the method.

The speech translating means may determine whether the utterance input by the input means is the first language or the second language before determining whether the utterance input later is the second language. 2. The method according to claim 1, wherein a weight that is not small is given to the judgment by the speaker judging means and a weight that is not big is given to the judgment by the language judging means as compared with the judgment about the uttered speech. 2. The translation apparatus according to 2.