JP2006189544A

JP2006189544A - Interpretation system, interpretation method, recording medium with interpretation program recorded thereon, and interpretation program

Info

Publication number: JP2006189544A
Application number: JP2005000396A
Authority: JP
Inventors: Tatsuya Kimura; 達也木村
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2005-01-05
Filing date: 2005-01-05
Publication date: 2006-07-20

Abstract

<P>PROBLEM TO BE SOLVED: To provide an interpretation system capable of outputting synthesized speech increasingly approximate to speaker's vocal quality, an interpretation method, a recording medium with an interpretation program recorded thereon, and an interpretation program. <P>SOLUTION: The interpretation system is equipped with a speech recognition section 101 which recognizes the input speech entered by a first language, an interpretation section 102 which interprets the result of the speech recognition to a second language, a speech synthesis section 103 which subjects the interpreted second language to speech synthesis, a vocal quality analysis section 104 which analyzes the a vocal quality of the first language, a vocal quality similarity metering section 105 which meters the similarity of the vocal quality of the first language, and vocal quality of the second language, and a vocal quality controller 106 which controls the vocal quality of the second language undergoing the speech synthesis in the speech synthesis section 103 based on the result of the vocal quality similarity metering obtained in the vocal quality similarity metering section 105, and thereby the vocal quality of the first language and the vocal quality of the second language are made relatively similar and the occurrence of sense of incongruity is reduced as far as possible. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、第１の言語で発声され入力された音声信号を第２の言語に通訳し、その通訳結果を合成音声し出力する通訳装置、通訳方法、通訳プログラムを記録した記録媒体、および通訳プログラムに関するものである。 The present invention relates to an interpreting apparatus, an interpreting method, an interpreting method, a recording medium on which an interpreting program is recorded, and an interpreter that interprets an input speech signal uttered in a first language into a second language, synthesizes and outputs the interpretation result It is about the program.

従来、通訳装置は、マイクより入力された音声信号を音声認識手段により自動的に認識し、自動翻訳手段により上記認識結果を所望の外言語に翻訳したのち、音声合成手段により、上記翻訳結果を外言語の音声で合成するという構成により実現されている。 Conventionally, an interpreting apparatus automatically recognizes a speech signal input from a microphone by speech recognition means, translates the recognition result into a desired foreign language by automatic translation means, and then translates the translation result by speech synthesis means. This is realized by the composition of synthesizing with foreign language speech.

ところで、このような通訳装置において、話者の声質に近い自然な通訳結果を得ることを目的として、合成音声の話速を入力の話速に応じて制御するように構成したものも知られている（例えば、特許文献１参照）。 By the way, in such an interpreting apparatus, it is also known that the speech rate of the synthesized speech is controlled according to the input speech rate for the purpose of obtaining a natural interpretation result close to the voice quality of the speaker. (For example, refer to Patent Document 1).

図４は、このように話速を制御するように構成した従来の通訳装置の構成を示すブロック図である。 FIG. 4 is a block diagram showing the configuration of a conventional interpreting apparatus configured to control the speech speed in this way.

図４において、マイク４０１に向かって発せられた音声は、マイク４０１によって電気信号に変換され、入力アンプ４０２によって増幅される。増幅された入力音声信号は、音声分析回路４０３によって音声認識される。音声認識された結果は、電子翻訳回路４０４に供給され、ここで所望の外言語に自動的に翻訳される。そして、その後、音声合成回路４０５によって音声合成され、合成音声信号として出力される。出力された合成音声信号は、出力アンプ４０６で増幅され、スピーカ４０７から合成音声として出力される。 In FIG. 4, sound emitted toward the microphone 401 is converted into an electric signal by the microphone 401 and amplified by the input amplifier 402. The amplified input voice signal is recognized by the voice analysis circuit 403. The result of speech recognition is supplied to an electronic translation circuit 404, where it is automatically translated into a desired foreign language. Thereafter, the speech synthesis circuit 405 synthesizes speech and outputs it as a synthesized speech signal. The output synthesized voice signal is amplified by the output amplifier 406 and outputted from the speaker 407 as synthesized voice.

一方、計測回路４０８では、入力された音声の話速が計測され、その結果が制御回路４０９に加えられる。制御回路４０９は、計測回路４０８から供給された話速に応じて電子翻訳回路４０４、搬出速度制御回路４１０をそれぞれ制御し、合成音声の話速を入力音声の話速に連動するように制御する。 On the other hand, the measurement circuit 408 measures the speech speed of the input voice, and the result is added to the control circuit 409. The control circuit 409 controls the electronic translation circuit 404 and the carry-out speed control circuit 410 in accordance with the speech speed supplied from the measurement circuit 408, and controls the speech speed of the synthesized speech to be linked to the speech speed of the input speech. .

これにより、スピーカ４０７から出力される合成音声が、マイク４０１に入力された入力音声の話速にあったものになり、自然な話速の通訳結果を得ることができる。 As a result, the synthesized speech output from the speaker 407 matches the speech speed of the input speech input to the microphone 401, and a natural speech speed interpretation result can be obtained.

なお、文字表示装置４１１は、翻訳結果を文字として表示するものである。 The character display device 411 displays the translation result as characters.

また、同じように、話者の声質に近い自然な通訳結果を得ることを目的として、合成音声の韻律を話者の入力音声を基に制御するようにしたものも公知である（例えば、特許文献２参照）。 Similarly, for the purpose of obtaining a natural interpretation result close to the voice quality of the speaker, there is also known one in which the prosody of the synthesized speech is controlled based on the input speech of the speaker (for example, a patent Reference 2).

これによれば、第１の言語からこれとは別の第２の言語に翻訳する場合、アクセントが考慮され、アクセントと言う面でより自然な通訳結果を得ることができ、音声理解を高めることができる。
特開昭５７―５７３７５号公報特開平６−３３２４９４号公報 According to this, when translating from the first language to another second language, accents are taken into consideration, and more natural interpretation results can be obtained in terms of accents, and speech understanding is improved. Can do.
JP-A-57-57375 JP-A-6-332494

しかしながら、上記した従来の通訳装置では、合成音声の話速を入力音声の話速にあわせて制御するものであったり、合成音声のアクセントを入力音声のアクセントに合わせて適正なものになるように制御するものであったりするだけのものであり、合成音声の話速、アクセントに着目しただけのものであるため、話者の声質にあった適正な合成音声を得ることはできなかった。 However, in the above-described conventional interpreting apparatus, the speech speed of the synthesized speech is controlled in accordance with the speech rate of the input speech, or the accent of the synthesized speech is adjusted to match the accent of the input speech. Since it is only a thing to control and it is only a thing paying attention to the speech speed and accent of a synthetic speech, it was not possible to obtain a proper synthetic speech suitable for the voice quality of the speaker.

本発明は、このような従来の問題に鑑みてなされたものであり、話者の声質にあったより近い合成音声を出力することができる通訳装置、通訳方法、および通訳プログラムを提供するものである。 The present invention has been made in view of such a conventional problem, and provides an interpreting apparatus, an interpreting method, and an interpreting program capable of outputting synthesized speech closer to the voice quality of a speaker. .

本発明の通訳装置は、第１の言語で入力された入力音声を音声認識する音声認識手段と、音声認識された結果を第２の言語に翻訳する翻訳手段と、翻訳された第２の言語を音声合成する音声合成手段と、第１の言語の声質を分析する声質分析手段と、第１の言語の声質と第２の言語の声質との類似性を計量する声質類似性計量手段と、声質類似性計量手段で得られた声質類似性計量結果に基づいて音声合成手段によって音声合成される第２の言語の声質を制御する声質制御手段とを備えた構成を有する。 The interpreting apparatus according to the present invention includes a speech recognition unit that recognizes an input speech input in a first language, a translation unit that translates the speech recognition result into a second language, and a translated second language. Voice synthesis means for voice synthesis, voice quality analysis means for analyzing voice quality of the first language, voice quality similarity measurement means for measuring the similarity between the voice quality of the first language and the voice quality of the second language, And a voice quality control means for controlling the voice quality of the second language synthesized by the voice synthesis means based on the voice quality similarity measurement result obtained by the voice quality similarity measurement means.

この構成により、第１の言語の声質と第２の言語の声質との類似性が声質類似性計量手段によって計量され、その類似性が近づくように音声合成手段が制御されるため、第２の言語の声質が第１の言語の声質に類似し、違和感を極力少なくすることができる。 With this configuration, the similarity between the voice quality of the first language and the voice quality of the second language is measured by the voice quality similarity measurement unit, and the voice synthesis unit is controlled so that the similarity approaches, so the second The voice quality of the language is similar to the voice quality of the first language, and the uncomfortable feeling can be reduced as much as possible.

また、本発明の通訳装置は、音声認識手段が、第１の言語で入力された入力音声信号を文字列または単語または単語列または文または意味表現として認識し、音声合成手段が、第２の言語による文字列または単語または単語列または文として合成する構成を有する。 In the interpreting apparatus of the present invention, the speech recognition means recognizes the input speech signal input in the first language as a character string, a word, a word string, a sentence, or a semantic expression, and the speech synthesis means It has a configuration in which it is synthesized as a character string or a word or a word string or a sentence according to language.

この構成により、第１の言語を文字列または単語または単語列または文または意味表現として認識し、第２の言語を文字列または単語または単語列または文として合成することができる。 With this configuration, the first language can be recognized as a character string, a word, a word string, a sentence, or a semantic expression, and the second language can be synthesized as a character string, a word, a word string, or a sentence.

また、本発明の通訳装置は、声質分析手段が、入力音声の個人性を特徴づけている声質の特徴量を抽出し、声質類似性計量手段が、声質分析手段によって抽出された声質の特徴量を音声合成手段によって音声合成された第２の言語の声質の特徴量と比較する構成を有する。 In the interpreting apparatus of the present invention, the voice quality analysis means extracts the voice quality feature quantity characterizing the individuality of the input speech, and the voice quality similarity measurement means extracts the voice quality feature quantity extracted by the voice quality analysis means. Is compared with the feature quantity of the voice quality of the second language synthesized by speech synthesis means.

この構成により、声質の類似性を、音声の個人性を特徴づけている声質の特徴量で容易に判断することができる。 With this configuration, it is possible to easily determine the similarity of voice quality based on the voice quality feature amount that characterizes the individuality of the voice.

また、本発明の通訳装置は、声質の特徴量が、入力音声信号、および、音声合成された第２の言語の音声信号に含まれる声道特性としてのスペクトル包絡である構成を有する。 The interpreting device of the present invention has a configuration in which the voice quality feature amount is a spectrum envelope as a vocal tract characteristic included in the input speech signal and the speech signal of the second language that is speech-synthesized.

この構成により、声質の特徴量を容易に抽出することができる。 With this configuration, it is possible to easily extract the feature quantity of voice quality.

また、本発明の通訳装置は、声質分析手段が、入力音声信号中の声道特徴量を抽出する声道特徴量抽出手段と、入力音声信号中のピッチ周波数を抽出するピッチ周波数抽出手段とを備え、声質類似性計量手段が、声道特徴量抽出手段によって抽出された声道特徴量と音声合成手段によって音声合成された第２の言語の声道特徴量とを比較する声道特徴量類似性計量手段と、ピッチ周波数抽出手段によって抽出されたピッチ周波数と音声合成手段によって音声合成された第２の言語のピッチ周波数とを比較するピッチ周波数類似性計量手段とを備えた構成を有する。 In the interpreting apparatus of the present invention, the voice quality analyzing means includes vocal tract feature quantity extracting means for extracting the vocal tract feature quantity in the input voice signal, and pitch frequency extracting means for extracting the pitch frequency in the input voice signal. A voice quality similarity metric means for comparing the vocal tract feature quantity extracted by the vocal tract feature quantity extraction means with the vocal tract feature quantity of the second language synthesized by the speech synthesis means. And a pitch frequency similarity measuring unit that compares the pitch frequency extracted by the pitch frequency extracting unit with the pitch frequency of the second language synthesized by the speech synthesizing unit.

この構成により、声道特徴量のみならず、ピッチ周波数も対象として類似性が判断されることになり、個人の特徴をより大きく捕らえ、誰が発生したかをより分かりやすくすることができる。 With this configuration, similarity is determined not only for the vocal tract feature value but also for the pitch frequency, and it is possible to capture the individual features more greatly and make it easier to understand who has occurred.

また、本発明の通訳装置は、入力音声信号の信号パワーを抽出する信号パワー抽出手段を更に備え、信号パワー抽出手段で抽出した信号パワーを基に、音声合成手段によって合成される第２の言語の音声信号の信号パワーを制御する構成を有する。 The interpreting apparatus of the present invention further includes signal power extraction means for extracting the signal power of the input speech signal, and the second language synthesized by the speech synthesis means based on the signal power extracted by the signal power extraction means. The signal power of the audio signal is controlled.

この構成により、第２の言語の信号パワーを第１の言語の信号パワーに合わせて類似するように制御することができ、パワーが大きいときには大きく、小さいときには小さく、任意に調整することができる。 With this configuration, the signal power of the second language can be controlled to be similar to the signal power of the first language, and can be arbitrarily adjusted to be large when the power is large and small when the power is small.

また、本発明の通訳装置は、入力音声信号から入力音声の発声速度を抽出する発声速度抽出手段を更に備え、発声速度抽出手段で抽出した発声速度を基に、音声合成手段によって合成される第２の言語の音声信号の発声速度を制御する構成を有する。 The interpreting apparatus of the present invention further includes an utterance speed extraction means for extracting the utterance speed of the input voice from the input voice signal, and is synthesized by the voice synthesis means based on the utterance speed extracted by the utterance speed extraction means. It has the structure which controls the utterance speed of the audio | voice signal of 2 languages.

この構成により、第２の言語の発声速度を第１の言語の発声速度に合わせて早くしたり、遅くしたりすることができ、第１の言語を発声する人の特徴を更に１つ付加することができる。 With this configuration, the utterance speed of the second language can be increased or decreased in accordance with the utterance speed of the first language, and one more feature of a person who speaks the first language is added. be able to.

さらに、本発明の通訳方法は、第１の言語で入力された入力音声を音声認識する音声認識ステップと、音声認識ステップで音声認識された結果を第２の言語に翻訳する翻訳ステップと、翻訳ステップで翻訳された第２の言語を音声合成する音声合成ステップと、第１の言語で入力された入力音声の声質を分析する声質分析ステップと、第１の言語で入力された入力音声の声質と第２の言語に翻訳された合成音声の声質の類似性を計量する声質類似性計量ステップと、声質類似性計量ステップで得られた声質類似性計量結果に基づいて、音声合成ステップで合成される第２の言語の声質を制御する声質制御ステップとを備えた構成を有する。 Further, the interpretation method of the present invention includes a speech recognition step for recognizing input speech input in a first language, a translation step for translating the result of speech recognition in the speech recognition step into a second language, A speech synthesis step for speech synthesis of the second language translated in the step; a voice quality analysis step for analyzing the voice quality of the input speech input in the first language; and a voice quality of the input speech input in the first language. And a voice quality similarity metric step for measuring the voice quality similarity of the synthesized speech translated into the second language, and a voice quality similarity metric result obtained in the voice quality similarity metric step. And a voice quality control step for controlling the voice quality of the second language.

この構成により、第１の言語の声質と第２の言語の声質とが声質類似性計量ステップで互いに比較され、その結果にしたがって両者の声質の類似性がより近くなるように音声合成の方法が制御されることになり、第２の言語をより第１の言語の音声に近づけることができる。 According to this configuration, the voice quality of the first language and the voice quality of the second language are compared with each other in the voice quality similarity metric step, and the speech synthesis method is performed so that the similarity of the voice quality of both is closer according to the result. As a result, the second language can be brought closer to the voice of the first language.

また、本発明の通訳方法は、音声認識ステップが、第１の言語で入力された入力音声信号を文字列または単語または単語列または文または意味表現として認識し、音声合成ステップが、第２の言語による文字列または単語または単語列または文として合成する構成を有する。 In the interpreting method of the present invention, the speech recognition step recognizes an input speech signal input in the first language as a character string, a word, a word string, a sentence, or a semantic expression, and the speech synthesis step includes a second speech synthesis step. It has a configuration in which it is synthesized as a character string or a word or a word string or a sentence according to language.

さらに、本発明の記録媒体は、第１の言語で入力された入力音声を音声認識する音声認識ステップと、音声認識ステップで音声認識された結果を第２の言語に翻訳する翻訳ステップと、翻訳ステップで翻訳された第２の言語を音声合成する音声合成ステップと、第１の言語で入力された入力音声の声質を分析する声質分析ステップと、第１の言語で入力された入力音声の声質と第２の言語に翻訳された合成音声の声質の類似性を計量する声質類似性計量ステップと、声質類似性計量ステップで得られた声質類似性計量結果に基づいて
、音声合成ステップで合成される第２の言語の声質を制御する声質制御ステップとをコンピュータに実行させるための通訳プログラムを記録している。 Furthermore, the recording medium of the present invention includes a speech recognition step for recognizing input speech input in a first language, a translation step for translating the result of speech recognition in the speech recognition step into a second language, A speech synthesis step for speech synthesis of the second language translated in the step; a voice quality analysis step for analyzing the voice quality of the input speech input in the first language; and a voice quality of the input speech input in the first language. And a voice quality similarity metric step for measuring the voice quality similarity of the synthesized speech translated into the second language, and a voice quality similarity metric result obtained in the voice quality similarity metric step. And a voice quality control step for controlling the voice quality of the second language.

この構成により、この通訳プログラムを読み出して各ステップを実行し、第１の言語で発声された音声をそれに類似した声質の第２の言語に容易に翻訳することが可能になる。 With this configuration, it is possible to read out this interpreting program and execute each step, and easily translate a voice uttered in the first language into a second language having a voice quality similar to that.

さらに、本発明の通訳プログラムは、第１の言語で入力された入力音声を音声認識する音声認識手順と、音声認識手順で音声認識された結果を第２の言語に翻訳する翻訳手順と、翻訳手順で翻訳された第２の言語を音声合成する音声合成手順と、第１の言語で入力された入力音声の声質を分析する声質分析手順と、第１の言語で入力された入力音声の声質と第２の言語に翻訳された合成音声の声質の類似性を計量する声質類似性計量手順と、声質類似性計量手順で得られた声質類似性計量結果に基づいて、音声合成手順で合成される第２の言語の声質を制御する声質制御手順とをコンピュータに実行させる構成を有する。 Furthermore, the interpreting program of the present invention includes a speech recognition procedure for recognizing an input speech input in a first language, a translation procedure for translating the result of speech recognition in the speech recognition procedure into a second language, A speech synthesis procedure for speech synthesis of the second language translated in the procedure, a voice quality analysis procedure for analyzing the voice quality of the input speech input in the first language, and a voice quality of the input speech input in the first language And a voice quality similarity metric procedure for measuring the voice quality similarity of the synthesized speech translated into the second language, and a voice quality similarity metric result obtained by the voice quality similarity metric procedure. And a voice quality control procedure for controlling the voice quality of the second language.

この構成により、第１の言語で発声された音声をそれに類似した声質の第２の言語に翻訳することをコンピュータで実行させることができる。 With this configuration, it is possible to cause the computer to execute the translation of the voice uttered in the first language into the second language having a voice quality similar to that.

本発明の通訳装置は、第１の言語で入力された入力音声を音声認識する音声認識手段と、音声認識された結果を第２の言語に翻訳する翻訳手段と、翻訳された第２の言語を音声合成する音声合成手段と、第１の言語の声質を分析する声質分析手段と、第１の言語の声質と第２の言語の声質との類似性を計量する声質類似性計量手段と、声質類似性計量手段で得られた声質類似性計量結果に基づいて音声合成手段によって音声合成される第２の言語の声質を制御する声質制御手段とを備えたものであり、第１の言語の声質と第２の言語の声質とが声質類似性計量手段によって計量され、その類似性が近づくように音声合成手段が制御されるため、第２の言語の声質が第１の言語の声質に類似し、違和感を生じることが少なくなるという効果を有する。 The interpreting apparatus according to the present invention includes a speech recognition unit that recognizes an input speech input in a first language, a translation unit that translates the speech recognition result into a second language, and a translated second language. Voice synthesis means for voice synthesis, voice quality analysis means for analyzing voice quality of the first language, voice quality similarity measurement means for measuring the similarity between the voice quality of the first language and the voice quality of the second language, Voice quality control means for controlling the voice quality of the second language synthesized by the voice synthesis means based on the voice quality similarity metric result obtained by the voice quality similarity measurement means. The voice quality of the second language is similar to the voice quality of the first language because the voice quality and the voice quality of the second language are measured by the voice quality similarity measurement means, and the speech synthesis means is controlled so that the similarity is approximated. However, it has the effect of reducing discomfort. That.

以下、本発明の実施の形態について、図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

(実施の形態１)
図１は、本発明の実施の形態１における通訳装置の概略構成を示すブロック図である。 (Embodiment 1)
FIG. 1 is a block diagram showing a schematic configuration of an interpreting apparatus according to Embodiment 1 of the present invention.

本発明の実施の形態１における通訳装置は、図１に示すように、入力音声信号を入力する音声認識部１０１と、音声認識部１０１の出力を入力とする翻訳部１０２と、翻訳部１０２の出力を入力とする音声合成部１０３と、入力音声信号を入力とする声質分析部１０４と、声質分析部１０４の出力及び音声合成部１０３の音声合成結果を入力とし、これらの結果に基づいて音声合成部１０３を制御する声質制御部１０６とを備えている。 As shown in FIG. 1, the interpretation apparatus according to Embodiment 1 of the present invention includes a speech recognition unit 101 that inputs an input speech signal, a translation unit 102 that receives the output of the speech recognition unit 101, and a translation unit 102. The speech synthesis unit 103 that receives the output, the voice quality analysis unit 104 that receives the input speech signal, the output of the voice quality analysis unit 104 and the speech synthesis result of the speech synthesis unit 103 are input, and the speech is generated based on these results. And a voice quality control unit 106 that controls the synthesis unit 103.

次に、本実施の形態における通訳装置について、その動作を説明する。 Next, the operation of the interpreting apparatus in this embodiment will be described.

第１の言語（例えば、日本語）で発声された音声は、図示していないマイクで音声信号に変換され、入力アンプで増幅されて、入力音声信号として音声認識部１０１に入力される。音声認識部１０１に入力音声信号が入力されると、音声認識部１０１が、入力された入力音声信号を認識し、その結果を単語、単語列、文、意味表現など予め指定された形式で出力する。出力された認識結果は、翻訳部１０２に入力され、ここで、第２の言語（例えば、英語）に翻訳される。翻訳結果は、認識結果と同様、単語、単語列、文などの予め指定された表現形式で出力される。出力された翻訳結果は、音声合成部１０３に加えられ、ここで音声合成され、合成音声信号を出力する。
音声合成部１０３は、声質制御可能な音声合成手段であり、声質制御部１０６の制御の基でそれぞれ異なる声質の合成音声信号を出力することが可能なように構成されている。具体的には、例えば、公知技術であるボコーダに基づく方法（古井貞護著「音声情報処理」、森北出版1998 P40）等を用いるが、この方法に限定されるものではない。 A voice uttered in a first language (for example, Japanese) is converted into a voice signal by a microphone (not shown), amplified by an input amplifier, and input to the voice recognition unit 101 as an input voice signal. When an input voice signal is input to the voice recognition unit 101, the voice recognition unit 101 recognizes the input voice signal and outputs the result in a predesignated format such as a word, a word string, a sentence, or a semantic expression. To do. The output recognition result is input to the translation unit 102, where it is translated into a second language (for example, English). Similar to the recognition result, the translation result is output in an expression format designated in advance such as a word, a word string, and a sentence. The output translation result is added to the speech synthesizer 103, where it is synthesized and output a synthesized speech signal.
The voice synthesizer 103 is a voice synthesizer capable of voice quality control, and is configured to be able to output synthesized voice signals having different voice qualities under the control of the voice quality controller 106. Specifically, for example, a method based on a vocoder which is a well-known technique (Sadago Furui “Speech Information Processing”, Morikita Publishing 1998 P40) is used, but is not limited to this method.

以下、このことについて、より詳細に説明する。 Hereinafter, this will be described in more detail.

入力音声信号は、音声認識部１０１に入力されると同時に、声質分析部１０４にも入力される。音声分析部１０４は、例えば、入力音声信号に含まれるスペクトル包絡（声道特性）などの音声の個人性を特徴づけている声質の特徴量を抽出する。この特徴量としては、例えば、低次ケプストラム係数等のベクトル量を使用する。 The input voice signal is input to the voice recognition unit 101 and also to the voice quality analysis unit 104. The voice analysis unit 104 extracts, for example, a feature quantity of voice quality that characterizes the individuality of voice, such as a spectral envelope (voice tract characteristic) included in the input voice signal. As this feature amount, for example, a vector amount such as a low-order cepstrum coefficient is used.

音声分析部１０４が入力音声信号から声質の特徴量を抽出すると、その特徴量が、声質類似性計量部１０５に入力される。声質類似性計量部１０５に入力音声信号の声質の特徴量が入力されると、声質類似性計量部１０５、声質制御部１０５の制御の基で、音声合成部１０３によって合成され、出力される合成音声信号の声質が入力音声信号の声質に最も近くなるように音声合成部１０３を制御する。 When the voice analysis unit 104 extracts a voice quality feature amount from the input voice signal, the feature amount is input to the voice quality similarity measurement unit 105. When the voice quality feature quantity of the input voice signal is input to the voice quality similarity measurement unit 105, the synthesis is performed by the voice synthesis unit 103 and output under the control of the voice quality similarity measurement unit 105 and the voice quality control unit 105. The voice synthesis unit 103 is controlled so that the voice quality of the voice signal is closest to the voice quality of the input voice signal.

すなわち、声質類似性計量部１０５では、入力音声信号の声質の特徴量と、現在の制御条件を仮定した場合に音声合成部１０３により合成される音声の声質の特徴量とを互いに比較し、入力音声と出力音声の声質の類似性を計量する。声質の類似性の計量方法としては、例えば、低次ケプストラム係数等のベクトル量のユークリッド距離、或いは、聴覚重み付け距離等を用いる。 That is, the voice quality similarity measurement unit 105 compares the voice quality feature quantity of the input voice signal with the voice quality feature quantity of the voice synthesized by the voice synthesis unit 103 when the current control condition is assumed, Measure the voice quality similarity between voice and output voice. As a method for measuring the similarity of voice quality, for example, a Euclidean distance of a vector quantity such as a low-order cepstrum coefficient or an auditory weighting distance is used.

このようにして、声質の類似性を計量すると、その結果が声質制御部１０６に入力される。声質制御部１０６は、入力された声質の類似性に基づいて、声質の類似性が最適値になるように音声合成部１０３を制御する。 When the voice quality similarity is measured in this manner, the result is input to the voice quality control unit 106. The voice quality control unit 106 controls the voice synthesis unit 103 so that the voice quality similarity becomes an optimum value based on the input voice quality similarity.

なお、声質の類似性として距離を用いる場合には、距離が小さいほど良好な制御規範である（良く似ている）ことを意味する。 When distance is used as the similarity of voice quality, it means that the smaller the distance, the better the control standard (similarly).

かかる構成によれば、声質類似性計量部１０５によって入力音声信号の声質と合成音声信号の声質を互いに比較しながら声質制御部１０６を用いて音声合成部１０３を制御し、合成音声の声質を入力音声の声質により近いものにすることができる。 According to this configuration, the voice quality similarity measuring unit 105 controls the voice synthesis unit 103 using the voice quality control unit 106 while comparing the voice quality of the input voice signal and the voice quality of the synthesized voice signal, and inputs the voice quality of the synthesized voice. It can be made closer to the voice quality.

(実施の形態２)
図２は、本発明の実施の形態２における通訳装置の概略構成を示すブロック図である。 (Embodiment 2)
FIG. 2 is a block diagram showing a schematic configuration of the interpreting apparatus according to Embodiment 2 of the present invention.

本発明の実施の形態２における通訳装置は、図２に示すように、入力音声信号を入力とする音声認識部２０１と、音声認識部２０１の出力を入力とする翻訳部２０２と、翻訳部２０２の出力を入力とする音声合成部２０３と、入力音声信号を入力とする声道特徴量抽出部２０４と、同じく入力音声信号を入力とするピッチ周波数抽出部２０５と、同じく入力音声信号を入力とする信号パワー抽出部２０６と、同じく入力音声信号を入力とする音声速度抽出部２０７と、声道抽出部２０４の出力を入力とする声道特徴量類似性計量部２０８と、ピッチ周波数抽出部２０５の出力を入力とするピッチ周波数類似性計量部２０９と、声道特徴量類似性計量部２０８、ピッチ周波数類似性計量部２０９、信号パワー抽出部２０６、音声速度抽出部２０７の出力をそれぞれ入力とし、音声合成部２０３を制御する声質制御部２１０とを備えている。 As shown in FIG. 2, the interpreting apparatus according to Embodiment 2 of the present invention includes a speech recognition unit 201 that receives an input speech signal, a translation unit 202 that receives an output of the speech recognition unit 201, and a translation unit 202. A speech synthesizer 203 that receives the output of the input, a vocal tract feature amount extractor 204 that receives the input speech signal, a pitch frequency extractor 205 that also receives the input speech signal, and an input speech signal as input. A signal power extraction unit 206 that performs input, an audio speed extraction unit 207 that receives an input audio signal, a vocal tract feature quantity similarity measurement unit 208 that receives an output of the vocal tract extraction unit 204, and a pitch frequency extraction unit 205. , A pitch frequency similarity metric unit 209, a vocal tract feature quantity similarity metric unit 208, a pitch frequency similarity metric unit 209, a signal power extraction unit 206, and a voice speed extraction unit 20. As input the output of each, and a voice control unit 210 for controlling the speech synthesizing portion 203.

第１の言語で発声された音声は、図示していないマイクで音声信号に変換され、入力アンプで増幅されて、入力音声信号として音声認識部２０１に入力される。音声認識部２０１に入力音声信号が入力されると、音声認識部２０１が、入力された入力音声信号を認識し、その結果を単語、単語列、文、意味表現など予め指定された形式で出力する。出力された認識結果は、翻訳部２０２に入力され、ここで、第２の言語に翻訳される。翻訳結果は、認識結果と同様、単語、単語列、文などの予め指定された表現形式で出力される。出力された翻訳結果は、音声合成部２０３に加えられ、ここで音声合成され、合成音声信号を出力する。 The voice uttered in the first language is converted into a voice signal by a microphone (not shown), amplified by an input amplifier, and input to the voice recognition unit 201 as an input voice signal. When an input speech signal is input to the speech recognition unit 201, the speech recognition unit 201 recognizes the input speech signal that has been input and outputs the result in a predesignated format such as a word, word string, sentence, or semantic expression. To do. The output recognition result is input to the translation unit 202, where it is translated into the second language. Similar to the recognition result, the translation result is output in a pre-designated expression format such as a word, a word string, or a sentence. The output translation result is added to the speech synthesizer 203, where the speech is synthesized and a synthesized speech signal is output.

音声合成部２０３は、声質制御可能な音声合成手段であり、声質制御部２１０の制御の基でそれぞれ異なる声質の合成音声信号を出力するように構成されている。そして、声質制御部２１０には、声道特徴量類似性計量部２０８、ピッチ周波数類似性計量部２０９、信号パワー抽出部２０６、発声速度抽出部２０７の出力が入力されるように構成されている。したがって、声質制御部２１０は、これらの出力で制御され、その出力に応じて音声合成部２０３を制御する。 The voice synthesizer 203 is a voice synthesizer capable of voice quality control, and is configured to output synthesized voice signals having different voice qualities under the control of the voice quality controller 210. The voice quality control unit 210 is configured to receive the outputs of the vocal tract feature quantity similarity measurement unit 208, the pitch frequency similarity measurement unit 209, the signal power extraction unit 206, and the utterance speed extraction unit 207. . Therefore, the voice quality control unit 210 is controlled by these outputs, and controls the speech synthesis unit 203 according to the outputs.

以下、このことについて更に詳細に説明する。 This will be described in more detail below.

先ず、声道特徴量抽出部２０４は、例えば、入力音声信号に含まれるスペクトル包絡（声道特性）などの音声の個人性を特徴づけている声質の特徴量を抽出する。この特徴量としては、例えば、低次ケプストラム係数等のベクトル量を使用する。声質の特徴量を抽出すると、その特徴量が声道特徴量類似性計量部２０８に供給される。そして、音声合成部２０３で合成された合成音声の声質の特徴量と比較される。 First, the vocal tract feature value extraction unit 204 extracts, for example, a feature value of voice quality that characterizes the individuality of speech, such as a spectrum envelope (voice tract characteristic) included in the input speech signal. As this feature amount, for example, a vector amount such as a low-order cepstrum coefficient is used. When the feature quantity of the voice quality is extracted, the feature quantity is supplied to the vocal tract feature quantity similarity measurement unit 208. Then, it is compared with the voice quality feature amount of the synthesized speech synthesized by the speech synthesis unit 203.

また、ピッチ周波数抽出部２０５は、入力音声信号のピッチ周波数、または、その推移パターンを抽出する。抽出されたピッチ周波数、または、その推移パターンは、ピッチ周波数類似性計量部２０９に入力され、ピッチ周波数類似度計量部２０９で、音声合成部２０３で合成された合成音声のピッチ周波数、または、その推移パターンと比較される。 The pitch frequency extraction unit 205 extracts the pitch frequency of the input audio signal or the transition pattern thereof. The extracted pitch frequency or its transition pattern is input to the pitch frequency similarity metric unit 209, and the pitch frequency similarity metric unit 209 synthesizes the pitch frequency of the synthesized speech synthesized by the speech synthesizer 203, or its Compared to transition pattern.

声道特徴量類似性計量部２０８で比較され得られた結果と、ピッチ周波数類似性計量部２０９で比較され得られた結果、並びに、信号パワー抽出部２０６、発声速度抽出部２０７でそれぞれ抽出された入力音声信号のパワー、発声速度が、それぞれ声質制御部２１０に入力される。その結果、次のように、音声合成部２０３が制御される。 The results obtained by comparison by the vocal tract feature quantity similarity measurement unit 208, the results obtained by comparison by the pitch frequency similarity measurement unit 209, and the signal power extraction unit 206 and the utterance speed extraction unit 207, respectively. The input voice signal power and utterance speed are input to the voice quality control unit 210, respectively. As a result, the speech synthesizer 203 is controlled as follows.

すなわち、まず、声道特徴量類似性計量部２０８、ピッチ周波数類似性計量部２０９、声質制御部２１０がそれぞれ動作することにより合成音の声質が入力音声に近い声質になるように音声合成部２０３が制御される。 That is, first, the speech synthesis unit 203 is configured such that the voice quality of the synthesized sound becomes a voice quality close to that of the input voice by operating the vocal tract feature quantity similarity measurement unit 208, the pitch frequency similarity measurement unit 209, and the voice quality control unit 210, respectively. Is controlled.

声道特徴量類似性計量部２０８では、入力音声信号の声質の特徴量と、現在の制御条件を仮定した場合に音声合成部２０３により合成される音声の声質の特徴量とを比較し、入力音声と出力音声の声質の類似性を計量する。声質の類似性の計量方法としては、上記ベクトル量のユークリッド距離あるいは聴覚重み付け距離等を用いればよい。このようにして計量した声質の類似性は、声質制御部２１０に加えられ、それに基づいて音声合成部２０３が制御される。したがって、音声合成部２０３で合成される合成音の声質は、入力音声に近い声質になる。なお、類似性として距離を用いる場合は、距離の値が小さいほど良好な制御規範である（良く似ている）ことを意味する。 The vocal tract feature quantity similarity measurement unit 208 compares the voice quality feature quantity of the input voice signal with the voice quality feature quantity of the voice synthesized by the voice synthesis unit 203 when the current control condition is assumed. Measure the voice quality similarity between voice and output voice. As a method for measuring the similarity of voice quality, the Euclidean distance or the auditory weighting distance of the vector amount may be used. The similarity of the measured voice quality is added to the voice quality control unit 210, and the voice synthesis unit 203 is controlled based on the similarity. Therefore, the voice quality of the synthesized sound synthesized by the voice synthesizer 203 is similar to the voice quality of the input voice. When distance is used as the similarity, the smaller the distance value, the better the control standard (similarly).

ピッチ周波数類似性計量部２０９では、入力音声信号のピッチ周波数パターンと現在の制御条件を仮定した場合に音声合成部２０３により合成される音声のピッチ周波数パター
ンの類似性を計量する。ピッチ周波数パターンの類似性も声質制御部２１０に供給され、それに基づいて音声合成部２０３が制御される。したがって、音声合成部２０３で合成される合成音のピッチ周波数パターンは、入力音声に近いピッチ周波数パターンになる。 The pitch frequency similarity measuring unit 209 measures the similarity between the pitch frequency pattern of the input voice signal and the pitch frequency pattern of the voice synthesized by the voice synthesizing unit 203 when the current control condition is assumed. The similarity of the pitch frequency pattern is also supplied to the voice quality control unit 210, and the voice synthesis unit 203 is controlled based on the similarity. Therefore, the pitch frequency pattern of the synthesized sound synthesized by the speech synthesis unit 203 is a pitch frequency pattern close to the input voice.

また、声質の制御を行う際に入力音声信号のパワー、および、入力音声の発声速度も参照することになり、合成音のパワー、および、発声速度も入力音声に連動したパワー、および、発声速度になる。 In addition, when controlling the voice quality, the power of the input voice signal and the utterance speed of the input voice are also referred to. The power of the synthesized voice and the utterance speed are also linked to the input voice and the utterance speed. become.

このように、かかる構成によれば、声質特徴量抽出部２０４、ピッチ周波数抽出部２０５、信号パワー抽出部２０６、発声速度抽出部２０７により声質の特徴をそれぞれ抽出し、声道特徴量類似性計量部２０８、および、ピッチ周波数類似性計量部２０９を用いて声道特徴量類似度、ピッチ周波数類似度を算出し、それらを声質制御部２１０に入力して、声質制御部２１０で音声合成部２０３を制御しており、声道特徴量、ピッチ周波数、信号パワー、発声速度をそれぞれ入力音声により近いものにすることができ、より入力音声に近い声質の合成音を得ることができるという効果を有する。 As described above, according to this configuration, the voice quality feature extraction unit 204, the pitch frequency extraction unit 205, the signal power extraction unit 206, and the utterance speed extraction unit 207 respectively extract voice quality features, and the vocal tract feature amount similarity metric. The vocal tract feature amount similarity and the pitch frequency similarity are calculated using the unit 208 and the pitch frequency similarity measuring unit 209, and are input to the voice quality control unit 210. The vocal tract feature value, pitch frequency, signal power, and utterance speed can be made closer to the input voice, and a synthesized voice with a voice quality closer to the input voice can be obtained. .

(実施の形態３)
図３は、本発明の実施の形態３として、通訳方法を説明するためのフローチャートである。 (Embodiment 3)
FIG. 3 is a flowchart for explaining an interpretation method as the third embodiment of the present invention.

図３において、ステップＳ３０１は、第１の言語で発声された音声信号を入力する音声入力ステップである。入力された第１の言語の音声信号は、次の音声認識ステップＳ３０２において音声認識され、その結果を単語または単語列または文または意味表現など予め指定された形式で出力する。翻訳ステップＳ３０３では、認識結果を第２の言語に翻訳し、翻訳結果を単語または単語列または文などあらかじめ指定された表現形式で出力する。音声合成ステップＳ３０４では、翻訳結果を入力し、第２の言語による合成音声信号を出力する。声質分析ステップＳ３０５では、例えば、入力音声に含まれるスペクトル包絡（声道特性）などの音声の個人性を特徴づける声質の特徴量を抽出する。この特徴量としては、例えば、低次ケプストラム係数等のベクトル量を使用すればよい。 In FIG. 3, step S301 is a voice input step for inputting a voice signal uttered in the first language. The input speech signal of the first language is speech-recognized in the next speech recognition step S302, and the result is output in a predesignated format such as a word, a word string, a sentence, or a semantic expression. In the translation step S303, the recognition result is translated into the second language, and the translation result is output in a previously designated expression format such as a word, a word string, or a sentence. In speech synthesis step S304, the translation result is input and a synthesized speech signal in the second language is output. In the voice quality analysis step S305, for example, a voice quality feature amount characterizing voice personality such as a spectrum envelope (voice tract characteristic) included in the input voice is extracted. As this feature amount, for example, a vector amount such as a low-order cepstrum coefficient may be used.

声質類似性計量ステップＳ３０６と次の声質制御ステップＳ３０７とは、互いに連動して処理を行うことにより、合成音の声質が入力音声に近い声質になるように合成音の声質を制御する。 The voice quality similarity measurement step S306 and the next voice quality control step S307 perform processing in conjunction with each other, thereby controlling the voice quality of the synthesized sound so that the voice quality of the synthesized sound is close to that of the input voice.

すなわち、声質類似性計量ステップＳ３０６では、入力音声信号の声質の特徴量と、現在の制御条件を仮定した場合に音声合成ステップＳ３０４により合成される音声の声質の特徴量とを比較することにより、入力音声と出力音声の声質の類似性を計量する。声質の類似性の計量方法としては、ベクトル量のユークリッド距離あるいは聴覚重み付け距離等を用いればよい。なお、類似性として距離を用いる場合は、距離の値が小さいほど良好な制御規範である（良く似ている）ことを意味する。声質制御ステップＳ３０７では、声質の類似性計量の結果が最適値になるように合成音声の制御を行う。 That is, in the voice quality similarity metric step S306, by comparing the voice quality feature quantity of the input voice signal with the voice quality feature quantity of the voice synthesized by the voice synthesis step S304 when the current control condition is assumed, Measure the voice quality similarity between input and output speech. As a method for measuring the similarity of voice quality, a vector quantity of Euclidean distance or auditory weighting distance may be used. When distance is used as the similarity, the smaller the distance value, the better the control standard (similarly). In the voice quality control step S307, the synthesized voice is controlled so that the result of the voice quality similarity metric becomes an optimum value.

かかる方法によれば、声質類似性計量ステップＳ３０６と、声質制御ステップＳ３０７を用いて入力音声の声質と合成音声の声質を比較しながら合成音声の声質を制御することが可能であり、入力音声に近い声質の合成音声を出力することができる。 According to this method, it is possible to control the voice quality of the synthesized voice while comparing the voice quality of the input voice and the voice quality of the synthesized voice using the voice quality similarity measurement step S306 and the voice quality control step S307. Synthetic voices with similar voice quality can be output.

なお、本実施の形態において、これらのステップを含む通訳プログラムを記録媒体に記録した場合には、この記録媒体をコンピュータなどに装着し、コンピュータを用いてこれらのステップを含むプログラムを読み出し、任意に第１の言語で発声した音声を第２の言語に翻訳し、音声として音声合成することができる。 In this embodiment, when an interpreting program including these steps is recorded on a recording medium, the recording medium is mounted on a computer or the like, and the program including these steps is read using the computer. The speech uttered in the first language can be translated into the second language and synthesized as speech.

また、本実施の形態において、これらのステップを含む通訳プログラムをインターネットなどの通信媒体を介してコンピュータなどに配信あるいは移動などした場合には、配信あるいは移動されたコンピュータはこのプログラムをそのまま実行して、任意に第１の言語で発声した音声を第２の言語に翻訳し、音声として音声合成することができる。 In this embodiment, when an interpreter program including these steps is distributed or moved to a computer or the like via a communication medium such as the Internet, the distributed or moved computer executes the program as it is. The voice uttered in the first language can be arbitrarily translated into the second language and synthesized as a voice.

そして、本実施の形態によれば、翻訳された第２の言語の声質を第１の言語の声質に近づけることができ、例えば、自分が発生したのに他人の声で翻訳されるとか、男性が発声したのに女性の声で翻訳されるとかといった違和感を生じることが極力少なくなり、より違和感の少ない翻訳を可能にする。 According to the present embodiment, the voice quality of the translated second language can be brought close to the voice quality of the first language. Is less likely to cause a sense of incongruity, such as being translated by a woman's voice, but enables translation with less sense of incongruity.

なお、上記各実施の形態において、第１の言語とは、翻訳される側の言語を意味し、第２の言語とは、翻訳された後の言語を意味している。すなわち、上記実施の形態のように日本語を英語に翻訳するのであれば、日本語が第１の言語、英語が第２の言語である。そして、同じ日本語でも、大阪弁を標準語に翻訳するのであれば、大阪弁が第１の言語、標準語が第２の言語である。すなわち、第１、第２の言語には、所謂、各国の言語のみならず、方言、現地語、その他言い回しの異なる全ての言語を含む。 In each of the above embodiments, the first language means the language to be translated, and the second language means the translated language. That is, if Japanese is translated into English as in the above embodiment, Japanese is the first language and English is the second language. If the Japanese dialect is translated into the standard language even in the same Japanese language, the Osaka dialect is the first language and the standard language is the second language. That is, the first and second languages include not only so-called national languages but also dialects, local languages, and all other languages with different expressions.

本発明の通訳装置は、第１の言語で入力された入力音声を音声認識する音声認識手段と、音声認識された結果を第２の言語に翻訳する翻訳手段と、翻訳された第２の言語を音声合成する音声合成手段と、第１の言語の声質を分析する声質分析手段と、第１の言語の声質と第２の言語の声質との類似性を計量する声質類似性計量手段と、声質類似性計量手段で得られた声質類似性計量結果に基づいて音声合成手段によって音声合成される第２の言語の声質を制御する声質制御手段とを備えたものであり、第１の言語の声質と第２の言語の声質とが声質類似性計量手段によって計量され、その類似性が近づくように音声合成手段が制御されるため、第２の言語の声質が第１の言語の声質に類似し、違和感を生じることが極力少なくなリ、音声合成を行う各種機器に有用である。 The interpreting apparatus according to the present invention includes a speech recognition unit that recognizes an input speech input in a first language, a translation unit that translates the speech recognition result into a second language, and a translated second language. Voice synthesis means for voice synthesis, voice quality analysis means for analyzing voice quality of the first language, voice quality similarity measurement means for measuring the similarity between the voice quality of the first language and the voice quality of the second language, Voice quality control means for controlling the voice quality of the second language synthesized by the voice synthesis means based on the voice quality similarity metric result obtained by the voice quality similarity measurement means. The voice quality of the second language is similar to the voice quality of the first language because the voice quality and the voice quality of the second language are measured by the voice quality similarity measurement means, and the speech synthesis means is controlled so that the similarity is approximated. And speech synthesis that minimizes the sense of discomfort It is useful in various types of equipment to perform.

本発明の実施の形態１における通訳装置の構成を示すブロック図The block diagram which shows the structure of the interpretation apparatus in Embodiment 1 of this invention. 本発明の実施の形態２における通訳装置の構成を示すブロック図The block diagram which shows the structure of the interpretation apparatus in Embodiment 2 of this invention. 本発明の実施の形態３における通訳方法を説明するためのフローチャートThe flowchart for demonstrating the interpretation method in Embodiment 3 of this invention. 従来の通訳装置の構成を示すブロック図Block diagram showing the configuration of a conventional interpreter

Explanation of symbols

１０１、２０１音声認識部
１０２、２０２翻訳部
１０３、２０３音声合成部
１０４声質分析部
１０５声質類似性計量部
１０６、２１０声質制御部
２０４声道特徴量抽出部
２０５ピッチ周波数抽出部
２０６信号パワー抽出部
２０７発声速度抽出部
２０８声道特徴量類似性計量部
２０９ピッチ周波数類似性計量部

101, 201 Speech recognition unit 102, 202 Translation unit 103, 203 Speech synthesis unit 104 Voice quality analysis unit 105 Voice quality similarity measurement unit 106, 210 Voice quality control unit 204 Vocal tract feature value extraction unit 205 Pitch frequency extraction unit 206 Signal power extraction unit 207 Speech rate extraction unit 208 Vocal tract feature value similarity measurement unit 209 Pitch frequency similarity measurement unit

Claims

Speech recognition means for recognizing input speech input in a first language, translation means for translating the speech recognition result into a second language, and speech for speech synthesis of the translated second language Synthesis means, voice quality analysis means for analyzing voice quality of the first language, voice quality similarity measurement means for measuring the similarity between the voice quality of the first language and the voice quality of the second language, and the voice quality An interpreting apparatus comprising: voice quality control means for controlling the voice quality of the second language synthesized by the speech synthesis means based on the voice quality similarity metric obtained by the similarity metric means.

The speech recognition means recognizes an input speech signal input in the first language as a character string, a word, a word string, a sentence, or a semantic expression, and the speech synthesis means recognizes a character string in the second language. 2. The interpreter according to claim 1, wherein the interpreter is synthesized as a word, a word string, or a sentence.

The voice quality analysis unit extracts a voice quality feature amount characterizing the individuality of the input speech, and the voice quality similarity measurement unit extracts the voice quality feature amount extracted by the voice quality analysis unit as the voice synthesis unit. The interpreting apparatus according to claim 1, wherein the interpreting apparatus compares with the feature quantity of the voice quality of the second language synthesized by the voice.

4. The interpreting apparatus according to claim 3, wherein the feature quantity of the voice quality is a spectrum envelope as a vocal tract characteristic included in the input speech signal and the speech signal of the second language synthesized by speech. .

The voice quality analysis means comprises: a vocal tract feature quantity extraction means for extracting a vocal tract feature quantity in the input voice signal; and a pitch frequency extraction means for extracting a pitch frequency in the input voice signal, the voice quality similarity Vocal tract feature quantity similarity metric for comparing the vocal tract feature quantity extracted by the vocal tract feature quantity extraction means with the vocal tract feature quantity of the second language synthesized by the speech synthesis means. And a pitch frequency similarity measuring means for comparing the pitch frequency extracted by the pitch frequency extracting means with the pitch frequency of the second language synthesized by the speech synthesizing means. The interpreting device according to any one of claims 1 to 4.

Signal power extraction means for extracting the signal power of the input voice signal is further provided, and the second language voice signal synthesized by the voice synthesis means based on the signal power extracted by the signal power extraction means. 6. The interpreting apparatus according to claim 1, wherein the signal power is controlled.

A voice rate extraction unit for extracting a voice rate of the input voice from the input voice signal, and the second language synthesized by the voice synthesis unit based on the voice rate extracted by the voice rate extraction unit; 6. The interpreting apparatus according to claim 1, wherein a speech rate of the voice signal is controlled.

A speech recognition step for recognizing input speech input in a first language, a translation step for translating the speech recognition result in the speech recognition step into a second language, and the first translated in the translation step. A speech synthesis step for speech synthesis of two languages; a voice quality analysis step for analyzing voice quality of the input speech input in the first language; a voice quality of the input speech input in the first language; and the second A voice quality similarity metric step for measuring the voice quality similarity of the synthesized speech translated into the language, and a voice quality similarity metric result obtained in the voice quality similarity metric step, and synthesized in the voice synthesis step And a voice quality control step for controlling the voice quality of the second language.

The speech recognition step recognizes an input speech signal input in the first language as a character string, a word, a word string, a sentence, or a semantic expression, and the speech synthesis step includes a character string in the second language, 9. The interpretation method according to claim 8, wherein the interpretation is performed as a word, a word string, or a sentence.

A speech recognition step for recognizing input speech input in a first language, a translation step for translating the speech recognition result in the speech recognition step into a second language, and the first translated in the translation step. A speech synthesis step for speech synthesis of two languages; a voice quality analysis step for analyzing voice quality of the input speech input in the first language; a voice quality of the input speech input in the first language; and the second A voice quality similarity metric step for measuring the voice quality similarity of the synthesized speech translated into the language, and a voice quality similarity metric result obtained in the voice quality similarity metric step, and synthesized in the voice synthesis step A computer-readable recording medium recording an interpreter program for causing a computer to execute a voice quality control step for controlling the voice quality of the second language.

A speech recognition procedure for recognizing input speech input in a first language; a translation procedure for translating a speech recognition result in the speech recognition procedure into a second language; and the first translated in the translation procedure. A speech synthesis procedure for speech synthesis of two languages, a voice quality analysis procedure for analyzing voice quality of input speech input in the first language, a voice quality of input speech input in the first language, and the second A voice quality similarity metric procedure for measuring the similarity of voice quality of synthesized speech translated into the language of the voice and a voice quality similarity metric result obtained by the voice quality similarity metric procedure are synthesized by the voice synthesis procedure. An interpreting program for causing a computer to execute a voice quality control procedure for controlling the voice quality of the second language.