JP2009146043A

JP2009146043A - Unit and method for voice translation, and program

Info

Publication number: JP2009146043A
Application number: JP2007320893A
Authority: JP
Inventors: Atsushi Yoshimoto; 淳善本; Toru Shimizu; 徹清水
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2007-12-12
Filing date: 2007-12-12
Publication date: 2009-07-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice translation unit translating a filler also. <P>SOLUTION: The voice translation unit includes: a voice information reception section 11 for receiving original language voice information; a voice recognition section 13 for voice recognizing the original language voice information; a machine translation section 15 for machine translating voice recognition result information; a voice generation section 17 for generating target language voice information corresponding to translation result information; a filler time position identification section 20 for identifying a filler position in the original language voice information; a filler information extraction section 21 for extracting original language filler information including a paralanguage of the filler; a filler text position identification section 22 for identifying the filler position in the voice recognition result information; a filler insertion position identification section 23 for identifying a filler insertion position, that is, a filler position in the target language voice information; a filler information insertion section 25 for inserting the target language filler information, corresponding to the original language filler information, into the filler insertion position; and a voice information output section 19 for outputting the target language voice information including the target language filler information. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、翻訳後の目的言語音声情報にもフィラーに関する情報を挿入する音声翻訳装置等に関する。 The present invention relates to a speech translation apparatus that inserts information related to fillers into translated target language speech information.

従来の翻訳装置において、原言語におけるテキストそのもの以外の情報をも目的言語に反映させる技術が開発されている。例えば、原言語のテキストにおけるイタリック等の文字修飾の情報をも、目的言語のテキストに反映させる機械翻訳装置が開発されている（例えば、特許文献１参照）。
特開２０００−１２３０１２号公報 In a conventional translation device, a technique for reflecting information other than text itself in a source language in a target language has been developed. For example, a machine translation device has been developed that reflects character modification information such as italics in the source language text in the target language text (see, for example, Patent Document 1).
JP 2000-123012 A

前述のように、原言語におけるテキストそのもの以外の情報をも目的言語に反映させることにより、テキストそのもの以外の情報をも伝達することが求められてきており、そのことは、音声翻訳装置においても同様である。すなわち、音声翻訳において、発話された言語そのもの以外の情報を伝達することにより、より臨場感のあふれる話し言葉の翻訳を実現することが求められてきている。 As described above, it has been required to transmit information other than the text itself by reflecting information other than the text itself in the source language in the target language. It is. That is, in speech translation, it has been demanded to realize translation of spoken words with a more realistic feeling by transmitting information other than the spoken language itself.

本発明は、上記の事情を考慮してなされたものであり、原言語で発話された言語そのもの以外の情報をも伝達することができる音声翻訳装置等を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a speech translation apparatus and the like that can transmit information other than the language itself spoken in the original language.

上記目的を達成するため、本発明による音声翻訳装置は、発話された原言語の音声をマイクによって集音した情報である原言語音声情報を受け付ける音声情報受付部と、前記音声情報受付部が受け付けた原言語音声情報を音声認識して、当該原言語音声情報に対応するテキスト情報である音声認識結果情報を取得する音声認識部と、前記音声認識部が取得した音声認識結果情報を機械翻訳して、当該音声認識結果情報に対応する目的言語のテキスト情報である翻訳結果情報を取得する機械翻訳部と、前記機械翻訳部が取得した翻訳結果情報に対応する目的言語の音声情報である目的言語音声情報を生成する音声生成部と、前記音声情報受付部が受け付けた原言語音声情報において、有意な発話の間に挿入される発話であるフィラーの時間的な位置を特定するフィラー時間位置特定部と、前記原言語音声情報において前記フィラー時間位置特定部が時間的な位置を特定したフィラーに関する、音声情報における非言語の情報であるパラ言語を少なくとも含む情報である原言語フィラー情報を抽出するフィラー情報抽出部と、前記原言語音声情報において前記フィラー時間位置特定部が特定したフィラーの時間的な位置に対応する前記音声認識結果情報における位置を特定するフィラーテキスト位置特定部と、前記フィラーテキスト位置特定部が特定した前記音声認識結果情報における位置に対応する前記目的言語音声情報における時間的な位置であるフィラー挿入位置を特定するフィラー挿入位置特定部と、前記音声生成部が生成した目的言語音声情報における、前記フィラー挿入位置特定部が特定したフィラー挿入位置に、前記フィラー情報抽出部が抽出した原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する情報であり、目的言語でのフィラーの音声情報である目的言語フィラー情報を挿入するフィラー情報挿入部と、前記フィラー情報挿入部によって目的言語フィラー情報の挿入された目的言語音声情報を出力する音声情報出力部と、を備えたものである。 In order to achieve the above object, a speech translation apparatus according to the present invention receives a speech information receiving unit that receives source language speech information that is information obtained by collecting spoken source language speech with a microphone, and the speech information receiving unit receives the speech information. A speech recognition unit that recognizes the original language speech information and obtains speech recognition result information that is text information corresponding to the source language speech information; and machine translates the speech recognition result information obtained by the speech recognition unit. A machine translation unit that acquires translation result information that is text information of a target language corresponding to the speech recognition result information, and a target language that is speech information of a target language corresponding to the translation result information acquired by the machine translation unit In the source language speech information received by the speech generation unit that generates speech information and the speech information reception unit, the temporal of filler that is an utterance inserted between significant utterances Information including at least a para language that is non-linguistic information in speech information, with respect to a filler time position specifying unit that specifies a position and a filler in which the temporal position is specified by the filler time position specifying unit in the source language speech information. A filler information extracting unit for extracting certain source language filler information, and a filler text for specifying a position in the speech recognition result information corresponding to a temporal position of the filler specified by the filler time position specifying unit in the source language speech information A position specifying unit, a filler insertion position specifying unit for specifying a filler insertion position that is a temporal position in the target language speech information corresponding to a position in the speech recognition result information specified by the filler text position specifying unit, The filler insertion position in the target language speech information generated by the speech generation unit. The target language filler which is information having the same para language as the para language included in the source language filler information extracted by the filler information extracting unit at the filler insertion position specified by the specifying unit, and is voice information of the filler in the target language A filler information insertion unit that inserts information; and a voice information output unit that outputs target language voice information into which target language filler information is inserted by the filler information insertion unit.

このような構成により、フィラーについても翻訳することができる。目的言語フィラー情報は、原言語でのフィラーと同じパラ言語を有するため、目的言語音声情報に挿入された目的言語フィラー情報も、原言語での話者がフィラーを発話した際と同様の雰囲気で出力されることになる。したがって、より臨場感のある、話者の感情を含んだ音声翻訳結果を生成することができうる。 With such a configuration, the filler can also be translated. Since the target language filler information has the same para language as the filler in the source language, the target language filler information inserted in the target language speech information has the same atmosphere as when the speaker in the source language utters the filler. Will be output. Therefore, it is possible to generate a speech translation result including the emotion of the speaker with a more realistic feeling.

また、本発明による音声翻訳装置では、前記原言語フィラー情報は、前記フィラー時間位置特定部が時間的な位置を特定したフィラーの音声情報であり、前記目的言語フィラー情報は、前記原言語フィラー情報であってもよい。
このような構成により、原言語のフィラーそのものを目的言語音声情報に挿入することができる。 In the speech translation apparatus according to the present invention, the source language filler information is speech information of a filler whose temporal position is specified by the filler time position specifying unit, and the target language filler information is the source language filler information. It may be.
With such a configuration, the source language filler itself can be inserted into the target language speech information.

また、本発明による音声翻訳装置では、前記原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する目的言語フィラー情報を生成するフィラー情報生成部をさらに備え、前記フィラー情報挿入部は、前記フィラー情報生成部が生成した目的言語フィラー情報を目的言語音声情報に挿入してもよい。
このような構成により、原言語のフィラーと同じパラ言語を有するフィラーを生成し、その生成したフィラーを目的言語音声情報に挿入することができる。 The speech translation apparatus according to the present invention may further include a filler information generation unit that generates target language filler information having the same para language as the para language included in the source language filler information, and the filler information insertion unit includes the filler The target language filler information generated by the information generation unit may be inserted into the target language voice information.
With such a configuration, a filler having the same para language as the source language filler can be generated, and the generated filler can be inserted into the target language speech information.

また、本発明による音声翻訳装置では、前記フィラー情報生成部は、前記フィラー時間位置特定部が時間的な位置を特定したフィラーの有する母音と同じ母音を有する目的言語フィラー情報を生成してもよい。 In the speech translation apparatus according to the present invention, the filler information generation unit may generate target language filler information having the same vowel as the vowel of the filler whose temporal position is specified by the filler time position specifying unit. .

このような構成により、原言語のフィラーと同様の母音を有するフィラーを生成して、目的言語音声情報に挿入することができる。このフィラーは、目的言語のフィラーとは異なる音声であるかもしれないが、母音とパラ言語が原言語のフィラーと共通しているため、原言語の話者の感情は、そのフィラーによって十分聞き手に伝わるものと考えられ得る。 With such a configuration, a filler having a vowel similar to that of the source language filler can be generated and inserted into the target language speech information. This filler may be a different voice than the target language filler, but because the vowels and para-language are in common with the source language filler, the emotions of the source language speaker are sufficiently heard by the filler. It can be considered to be transmitted.

また、本発明による音声翻訳装置では、前記フィラー情報生成部は、前記原言語音声情報のフィラーが音声認識され、機械翻訳された結果に対応する目的言語の音声情報である目的言語フィラー情報を生成してもよい。 Further, in the speech translation apparatus according to the present invention, the filler information generation unit generates target language filler information which is speech information of a target language corresponding to a result of speech recognition of the filler of the source language speech information and machine translation. May be.

このような構成により、目的言語音声情報に挿入されるフィラーを、目的言語での一般的なフィラーとすることができ、その挿入されたフィラーをより自然なものとすることができる。 With such a configuration, the filler inserted into the target language speech information can be a general filler in the target language, and the inserted filler can be made more natural.

また、本発明による音声翻訳装置では、前記パラ言語は、周波数、音量、周波数の変化、音量の変化から選ばれる少なくとも一つの情報であってもよい。
また、本発明による音声翻訳装置では、前記パラ言語は、前記原言語音声情報のフィラーの時間的な長さ、当該フィラーの始点側の音声の空白の時間的な長さ、当該フィラーの終点側の音声の空白の時間的な長さをさらに含んでもよい。 In the speech translation apparatus according to the present invention, the para language may be at least one information selected from frequency, volume, change in frequency, and change in volume.
Further, in the speech translation apparatus according to the present invention, the para-language includes the time length of the filler of the source language speech information, the time length of the speech blank on the start point side of the filler, and the end point side of the filler It may further include the time length of the voice blanking.

本発明による音声翻訳装置等によれば、音声翻訳において、フィラーのパラ言語を含む情報をも目的言語の音声情報に付加することができ、より臨場感のある音声翻訳を実現することが可能となる。 According to the speech translation apparatus and the like according to the present invention, in speech translation, information including the filler para-language can be added to speech information in the target language, and speech translation with a more realistic feeling can be realized. Become.

以下、本発明による音声翻訳装置について、実施の形態を用いて説明する。なお、以下の実施の形態において、同じ符号を付した構成要素及びステップは同一または相当するものであり、再度の説明を省略することがある。 Hereinafter, a speech translation apparatus according to the present invention will be described using embodiments. In the following embodiments, components and steps denoted by the same reference numerals are the same or equivalent, and repetitive description may be omitted.

（実施の形態１）
本発明の実施の形態１による音声翻訳装置について、図面を参照しながら説明する。本実地の形態による音声翻訳装置は、原言語のフィラーに関するパラ言語をも用いて音声翻訳を行うものである。 (Embodiment 1)
A speech translation apparatus according to Embodiment 1 of the present invention will be described with reference to the drawings. The speech translation apparatus according to the present embodiment performs speech translation using a para language related to a filler of the source language.

図１は、本実施の形態による音声翻訳装置１の構成を示すブロック図である。本実施の形態による音声翻訳装置１は、音声情報受付部１１と、原言語音声情報蓄積部１２と、音声認識部１３と、音声認識結果情報蓄積部１４と、機械翻訳部１５と、翻訳結果情報蓄積部１６と、音声生成部１７と、目的言語音声情報蓄積部１８と、音声情報出力部１９と、フィラー時間位置特定部２０と、フィラー情報抽出部２１と、フィラーテキスト位置特定部２２と、フィラー挿入位置特定部２３と、フィラー情報生成部２４と、フィラー情報挿入部２５とを備える。 FIG. 1 is a block diagram showing a configuration of a speech translation apparatus 1 according to this embodiment. The speech translation apparatus 1 according to the present embodiment includes a speech information reception unit 11, a source language speech information storage unit 12, a speech recognition unit 13, a speech recognition result information storage unit 14, a machine translation unit 15, and a translation result. Information storage unit 16, speech generation unit 17, target language speech information storage unit 18, speech information output unit 19, filler time position specification unit 20, filler information extraction unit 21, filler text position specification unit 22 The filler insertion position specifying unit 23, the filler information generating unit 24, and the filler information inserting unit 25 are provided.

音声情報受付部１１は、発話された原言語の音声をマイクによって集音した情報である原言語音声情報を受け付ける。音声情報受付部１１は、例えば、マイクから直接、原言語音声情報を受け付けてもよく、あるいは、マイクで集音された原言語音声情報が一度蓄積されたものを受け付けてもよい。原言語音声情報は、いわゆる音声信号の情報である。 The voice information receiving unit 11 receives source language voice information which is information obtained by collecting the spoken source language voice with a microphone. For example, the voice information receiving unit 11 may receive the source language voice information directly from a microphone, or may receive the source language voice information collected by the microphone once accumulated. The source language audio information is so-called audio signal information.

音声情報受付部１１は、例えば、入力デバイス（例えば、マイクなど）から入力された原言語音声情報を受け付けてもよく、有線もしくは無線の通信回線を介して送信された原言語音声情報を受信してもよく、所定の記録媒体（例えば、光ディスクや磁気ディスク、半導体メモリなど）から読み出された原言語音声情報を受け付けてもよい。なお、音声情報受付部１１は、受け付けを行うためのデバイス（例えば、モデムやネットワークカードなど）を含んでもよく、あるいは含まなくてもよい。また、音声情報受付部１１は、ハードウェアによって実現されてもよく、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 For example, the voice information receiving unit 11 may receive source language voice information input from an input device (for example, a microphone), and receives source language voice information transmitted via a wired or wireless communication line. Alternatively, source language audio information read from a predetermined recording medium (for example, an optical disk, a magnetic disk, or a semiconductor memory) may be received. Note that the voice information receiving unit 11 may or may not include a device (for example, a modem or a network card) for receiving. The audio information receiving unit 11 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device.

原言語音声情報蓄積部１２は、音声情報受付部１１が受け付けた原言語音声情報を、所定の記録媒体に蓄積する。この記録媒体は、例えば、半導体メモリや、光ディスク、磁気ディスク等であり、原言語音声情報蓄積部１２が有していてもよく、あるいは原言語音声情報蓄積部１２の外部に存在してもよい。また、この記録媒体は、原言語音声情報を一時的に記憶するものであってもよく、そうでなくてもよい。 The source language audio information storage unit 12 stores the source language audio information received by the audio information reception unit 11 in a predetermined recording medium. The recording medium is, for example, a semiconductor memory, an optical disk, a magnetic disk, or the like, and may be included in the source language audio information storage unit 12 or may exist outside the source language audio information storage unit 12. . Further, this recording medium may or may not temporarily store source language audio information.

音声認識部１３は、音声情報受付部１１が受け付けた原言語音声情報を音声認識して、その原言語音声情報に対応するテキスト情報である音声認識結果情報を取得する。音声認識の方法は、すでに公知であり、その詳細な説明を省略する。音声認識部１３は、例えば、音響モデルや、辞書情報、言語モデル等を用いることによって、音声認識を行ってもよい。 The speech recognition unit 13 recognizes the source language speech information received by the speech information reception unit 11 and acquires speech recognition result information that is text information corresponding to the source language speech information. The method of speech recognition is already known and will not be described in detail. The voice recognition unit 13 may perform voice recognition by using, for example, an acoustic model, dictionary information, a language model, or the like.

音声認識結果情報蓄積部１４は、音声認識部１３が取得した音声認識結果情報を、所定の記録媒体に蓄積する。この記録媒体は、例えば、半導体メモリや、光ディスク、磁気ディスク等であり、音声認識結果情報蓄積部１４が有していてもよく、あるいは音声認識結果情報蓄積部１４の外部に存在してもよい。また、この記録媒体は、音声認識結果情報を一時的に記憶するものであってもよく、そうでなくてもよい。 The voice recognition result information storage unit 14 stores the voice recognition result information acquired by the voice recognition unit 13 in a predetermined recording medium. The recording medium is, for example, a semiconductor memory, an optical disk, a magnetic disk, or the like, and may be included in the voice recognition result information storage unit 14 or may exist outside the voice recognition result information storage unit 14. . Further, this recording medium may or may not temporarily store voice recognition result information.

機械翻訳部１５は、音声認識部１３が取得した音声認識結果情報を機械翻訳して、その音声認識結果情報に対応する目的言語のテキスト情報である翻訳結果情報を取得する。音声認識結果情報は、原言語のテキスト情報であり、それと対訳関係にあるテキスト情報が、翻訳結果情報である。機械翻訳の方法は、すでに公知であり、その詳細な説明を省略する。また、原言語と目的言語とは、異なる言語であればよく、その組合せは問わない。例えば、原言語が日本語であり、目的言語が英語であってもよく、原言語が英語であり、目的言語がフランス語であってもよい。 The machine translation unit 15 performs machine translation on the speech recognition result information acquired by the speech recognition unit 13, and acquires translation result information that is text information of a target language corresponding to the speech recognition result information. The speech recognition result information is text information in the source language, and the text information having a parallel translation relationship with it is translation result information. The machine translation method is already known, and detailed description thereof is omitted. Further, the source language and the target language may be different languages, and the combination thereof is not limited. For example, the source language may be Japanese, the target language may be English, the source language may be English, and the target language may be French.

翻訳結果情報蓄積部１６は、機械翻訳部１５が機械翻訳した結果である翻訳結果情報を、所定の記録媒体に蓄積する。この記録媒体は、例えば、半導体メモリや、光ディスク、磁気ディスク等であり、翻訳結果情報蓄積部１６が有していてもよく、あるいは翻訳結果情報蓄積部１６の外部に存在してもよい。また、この記録媒体は、翻訳結果情報を一時的に記憶するものであってもよく、そうでなくてもよい。 The translation result information accumulation unit 16 accumulates translation result information, which is a result of machine translation by the machine translation unit 15, in a predetermined recording medium. The recording medium is, for example, a semiconductor memory, an optical disk, a magnetic disk, or the like, and may be included in the translation result information storage unit 16 or may exist outside the translation result information storage unit 16. Further, this recording medium may or may not temporarily store the translation result information.

音声生成部１７は、機械翻訳部１５が取得した翻訳結果情報に対応する目的言語の音声情報である目的言語音声情報を生成する。目的言語音声情報は、目的言語での、いわゆる音声信号の情報である。テキストから、そのテキストに対応した音声を生成する方法は、音声合成の技術としてすでに公知であり、その詳細な説明を省略する。 The voice generation unit 17 generates target language voice information that is voice information of a target language corresponding to the translation result information acquired by the machine translation unit 15. The target language audio information is so-called audio signal information in the target language. A method of generating speech corresponding to the text from the text is already known as a speech synthesis technique, and detailed description thereof is omitted.

目的言語音声情報蓄積部１８は、音声生成部１７が生成した目的言語音声情報を、所定の記録媒体に蓄積する。この記録媒体は、例えば、半導体メモリや、光ディスク、磁気ディスク等であり、目的言語音声情報蓄積部１８が有していてもよく、あるいは目的言語音声情報蓄積部１８の外部に存在してもよい。また、この記録媒体は、目的言語音声情報を一時的に記憶するものであってもよく、そうでなくてもよい。 The target language voice information storage unit 18 stores the target language voice information generated by the voice generation unit 17 in a predetermined recording medium. The recording medium is, for example, a semiconductor memory, an optical disk, a magnetic disk, or the like, and may be included in the target language / speech information storage unit 18 or may exist outside the target language / speech information storage unit 18. . The recording medium may or may not temporarily store the target language audio information.

音声情報出力部１９は、後述するフィラー情報挿入部２５によって目的言語フィラー情報の挿入された目的言語音声情報を出力する。目的言語フィラー情報や、その目的言語フィラー情報を目的言語音声情報に挿入する処理等については、後述する。 The voice information output unit 19 outputs target language voice information in which target language filler information is inserted by a filler information insertion unit 25 described later. The processing for inserting the target language filler information and the target language filler information into the target language voice information will be described later.

ここで、この出力は、例えば、スピーカによる音声出力でもよく、所定の機器への通信回線を介した送信でもよく、記録媒体への蓄積でもよく、他の構成要素への引き渡しでもよい。なお、音声情報出力部１９は、出力を行うデバイス（例えば、スピーカや通信デバイスなど）を含んでもよく、あるいは含まなくてもよい。また、音声情報出力部１９は、ハードウェアによって実現されてもよく、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 Here, this output may be, for example, an audio output by a speaker, transmission via a communication line to a predetermined device, accumulation in a recording medium, or delivery to another component. The audio information output unit 19 may or may not include a device that performs output (for example, a speaker or a communication device). Moreover, the audio | voice information output part 19 may be implement | achieved by hardware, or may be implement | achieved by software, such as a driver which drives those devices.

フィラー時間位置特定部２０は、音声情報受付部１１が受け付けた原言語音声情報において、フィラーの時間的な位置を特定する。フィラー（ｆｉｌｌｅｒ）とは、有意な発話の間に挿入される発話である。例えば、「え〜っと」や、「あのぉー」、「そのー」等である。 The filler time position specifying unit 20 specifies the temporal position of the filler in the source language audio information received by the audio information receiving unit 11. A filler is an utterance inserted between significant utterances. For example, “Utto”, “That ぉ”, “That”, etc.

ここで、フィラー時間位置特定部２０がフィラーの時間的な位置を特定する方法について説明する。 Here, a method in which the filler time position specifying unit 20 specifies the time position of the filler will be described.

［音声認識結果（音声認識のできなかったこと）を利用する方法］
フィラーは有意な発話ではないため、一般に、認識精度を重視していない処理の軽い音声認識処理、例えば、ＲＴＦ（リアルタイムファクター）が小さい音声認識処理では、フィラーの領域については、音声認識が行われないことがある。したがって、そのような音声認識を音声認識部１３が行っているのであれば、フィラー時間位置特定部２０は、音声認識部１３によって音声認識のなされなかった原言語音声情報の時間的な範囲を、フィラーの時間的な位置として特定してもよい。 [How to use voice recognition results (failed to recognize voice)]
Since fillers are not significant utterances, generally, speech recognition processing that does not place importance on recognition accuracy, such as light recognition processing with a small RTF (real time factor), for example, speech recognition is performed for the filler region. There may not be. Therefore, if the speech recognition unit 13 performs such speech recognition, the filler time position specifying unit 20 determines the temporal range of the source language speech information that was not speech-recognized by the speech recognition unit 13, You may specify as a time position of a filler.

［音声認識結果（音声認識できた結果）を利用する方法］
上述の場合とは異なり、認識精度を重視した音声認識処理、例えば、ＲＴＦが大きい音声認識処理を行い、かつ、フィラーに関する辞書（「え〜っと」等のフィラーの音声信号をテキスト「え〜っと」等に変換するための辞書）を用いて音声認識処理を行う場合には、原言語音声情報に含まれるフィラーの音声信号もテキストに変換されることになる。したがって、そのような音声認識を音声認識部１３が行っているのであれば、フィラー時間位置特定部２０は、あらかじめ図示しない記録媒体で保持されているフィラーに対応するテキスト情報、例えば、「え〜っと」や、「あのぉー」、「そのー」等を用いて、そのフィラーに対応するテキストが音声認識結果情報に含まれる場合に、その音声認識結果情報に含まれるフィラーのテキストに対応する原言語音声情報における時間的な位置を特定することによって、フィラーの時間的な位置を特定することができる。 [Method of using speech recognition results (results of speech recognition)]
Unlike the above-described case, speech recognition processing with an emphasis on recognition accuracy, for example, speech recognition processing with a large RTF is performed, and a filler speech signal such as a filler dictionary (“Etto” or the like) In the case of performing speech recognition processing using a dictionary for converting to “tto” etc., the filler speech signal included in the source language speech information is also converted to text. Therefore, if the voice recognition unit 13 performs such voice recognition, the filler time position specifying unit 20 preliminarily stores text information corresponding to fillers held in a recording medium (not shown), for example, “e- When the text corresponding to the filler is included in the speech recognition result information using “Tto”, “That”, “That”, etc., it corresponds to the filler text included in the speech recognition result information. By specifying the temporal position in the source language audio information, the temporal position of the filler can be specified.

［波形のパターンマッチングを利用する方法］
フィラー時間位置特定部２０は、あらかじめ図示しない記録媒体で保持されている、フィラーに対応する音声信号の波形を用いて、原言語音声情報においてその波形のパターンマッチングを行い、フィラーに対応する音声信号の波形に類似する時間的な領域が存在した場合に、その時間的な領域を、フィラーの時間的な位置として特定してもよい。波形のパターンマッチングについては、すでに公知であり、その詳細な説明を省略する。なお、この波形のパターンマッチングでは、例えば、厳密なマッチングをとるのではなく、例えば、波形のエンベロープのマッチングをとってもよく、必要十分な範囲でのマッチングをとることができればよい。 [Method using waveform pattern matching]
The filler time position specifying unit 20 uses a waveform of an audio signal corresponding to the filler, which is held in advance on a recording medium (not shown), performs pattern matching of the waveform in the source language audio information, and an audio signal corresponding to the filler If there is a temporal region similar to the waveform, the temporal region may be specified as the temporal position of the filler. Waveform pattern matching is already known and will not be described in detail. In this waveform pattern matching, for example, not exact matching, but, for example, waveform envelope matching may be taken as long as matching within a necessary and sufficient range can be achieved.

なお、ここでは、フィラーの時間的な位置を特定する３通りの方法について説明したが、それ以外の方法を用いてフィラーの時間的な位置を特定してもよいことは言うまでもない。 In addition, although the three methods for specifying the temporal position of the filler have been described here, it goes without saying that the temporal position of the filler may be specified using other methods.

また、フィラーの時間的な位置を特定するとは、例えば、そのフィラーの時間的な位置を特定する情報を図示しない記録媒体に蓄積することであってもよく、あるいは、フィラーの時間的な位置を特定可能な情報を原言語音声情報に付加することであってもよい。前者の場合には、例えば、フィラーの始点を示す情報と、フィラーの終点を示す情報を図示しない記録媒体に蓄積してもよい。フィラーの始点や終点は、例えば、タイムコードによって示されてもよく、原言語音声情報の先頭や終点、あるいは特定の位置からのデータ容量（例えば、バイト数）で示されてもよく、または、原言語音声情報の波形そのものによって示されてもよい。また、後者の場合には、原言語音声情報において、フィラーの始点の位置に、フィラーの始点であることを示す情報（例えば、フラグのようなものであってもよい）が付加され、フィラーの終点の位置に、フィラーの終点であることを示す情報が付加されてもよく、あるいは、フィラーの始点から終点にわたって連続的に、フィラーであることを示す情報が付加されてもよい。なお、フィラーの時間的な位置を特定することができるのであれば、フィラーの特定方法は、これらに限定されないことは言うまでもない。 Further, specifying the temporal position of the filler may be, for example, accumulating information for specifying the temporal position of the filler in a recording medium (not shown), or determining the temporal position of the filler. The identifiable information may be added to the source language audio information. In the former case, for example, information indicating the start point of the filler and information indicating the end point of the filler may be stored in a recording medium (not shown). The start point and end point of the filler may be indicated by, for example, a time code, may be indicated by the beginning or end point of the source language audio information, or the data capacity (for example, the number of bytes) from a specific position, or It may be indicated by the waveform itself of the source language audio information. In the latter case, in the source language speech information, information indicating the filler start point (for example, a flag) may be added to the filler start point position. Information indicating the end point of the filler may be added to the position of the end point, or information indicating the filler may be added continuously from the start point of the filler to the end point. Needless to say, the filler identification method is not limited to these as long as the temporal position of the filler can be identified.

フィラー情報抽出部２１は、原言語音声情報においてフィラー時間位置特定部２０が時間的な位置を特定したフィラーに関する原言語フィラー情報を抽出する。フィラー時間位置特定部２０が時間的な位置を特定したフィラーに関する原言語フィラー情報とは、フィラー時間位置特定部２０が時間的な位置を特定したフィラーに関するパラ言語を少なくとも含む情報である。パラ言語とは、音声情報における韻律的特徴の情報である。声の強弱、強弱の変化、高さ（周波数）、高さの変化、発話の速さ、抑揚、震え、声色、また発話中の情報のみならず、発話で生じる沈黙（間）やその長さもパラ言語に含まれる。例えば「ありがとう」と書かれた文章単体からはその書き手の気持ちを察するのは困難だが、「ありがとう」という句を読み上げられた音声情報の場合ならば、その話し手の気持ち（例えば愛情や怒り、同情など）を察することは通例容易になる。また反対に、それら気持ちを込めて読み上げることも可能である。これはパラ言語情報が存在しているために可能となる。フィラーに関するパラ言語は、例えば、原言語音声情報のフィラーの周波数であってもよく、フィラーの音量であってもよく、フィラーの周波数の変化であってもよく、フィラーの音量の変化であってもよく、原言語音声情報のフィラーの時間的な長さであってもよく、そのフィラーの始点側の音声の空白の時間的な長さであってもよく、そのフィラーの終点側の音声の空白の時間的な長さであってもよく、フィラーの発話の速さであってもよく、それらの任意の二以上の情報の組合せであってもよく、フィラーに関するパラ言語を適切に示すことができる情報であれば、それら以外の情報であってもよい。フィラーの発話の速さは、例えば、単位時間あたりのモーラ数や、単位時間あたりのシラブル数によって計測することができうる。 The filler information extraction unit 21 extracts source language filler information related to the filler for which the filler time position specifying unit 20 has specified the temporal position in the source language voice information. The source language filler information relating to the filler for which the filler time position specifying unit 20 has specified the temporal position is information including at least a para language relating to the filler for which the filler time position specifying unit 20 has specified the temporal position. Para language is information of prosodic features in speech information. Voice strength, strength change, height (frequency), height change, utterance speed, inflection, tremor, voice color, as well as information during utterance, as well as silence (between) and length of utterance Included in para language. For example, it is difficult to detect the writer's feelings from a single sentence that says “Thank you”, but in the case of voice information that reads the phrase “Thank you”, the speaker ’s feelings (for example, affection, anger, sympathy) Etc.) is usually easier. On the other hand, it is also possible to read aloud with those feelings. This is possible because paralinguistic information exists. The para language related to the filler may be, for example, the filler frequency of the source language voice information, the filler volume, the filler frequency change, or the filler volume change. It may be the length of time of the filler of the source language speech information, or the length of time of the blank space of the speech at the start point of the filler. It may be the length of time of the blank, it may be the speed of the filler's utterance, it may be a combination of any two or more of those information, and the para language related to the filler should be indicated appropriately Any other information may be used as long as the information can be received. The speed of filler utterance can be measured, for example, by the number of mora per unit time or the number of syllables per unit time.

フィラー情報抽出部２１は、原言語フィラー情報として、例えば、フィラー時間位置特定部２０が時間的な位置を特定したフィラーの音声情報そのものを原言語音声情報から抽出してもよい。また、フィラー時間位置特定部２０は、原言語フィラー情報として、例えば、フィラー時間位置特定部２０が時間的な位置を特定したフィラーのパラ言語のみを原言語音声情報から抽出してもよい。そのパラ言語は、前述のように、例えば、周波数、音量、周波数の変化、音量の変化から選ばれる少なくとも一つの情報であってもよく、原言語音声情報のフィラーの時間的な長さ、そのフィラーの始点側の音声の空白の時間的な長さ、そのフィラーの終点側の音声の空白の時間的な長さをさらに含む情報であってもよい。フィラー情報抽出部２１が抽出した原言語フィラー情報は、図示しない記録媒体において一時的に記憶されてもよい。 The filler information extraction unit 21 may extract, as source language filler information, for example, the filler speech information itself, for which the filler time position specifying unit 20 has specified the temporal position, from the source language speech information. Moreover, the filler time position specific | specification part 20 may extract only the para language of the filler which the filler time position specific | specification part 20 specified the time position as source language filler information from source language audio | voice information, for example. As described above, the para language may be at least one information selected from, for example, frequency, volume, change in frequency, change in volume, the length of time of the filler of the source language audio information, The information may further include the time length of the voice blank on the start point side of the filler and the time length of the voice blank on the end point side of the filler. The source language filler information extracted by the filler information extraction unit 21 may be temporarily stored in a recording medium (not shown).

なお、周波数とは、原言語音声情報における、フィラーの音声信号の周波数そのものであってもよく、あるいは、原言語音声情報における、フィラー以外の音声信号の周波数と、フィラーの音声信号の周波数との差であってもよい。また、その周波数は、フィラーの音声信号の区間にわたって平均のとられたものであってもよい。周波数の変化は、例えば、フィラーにおける微少時間（例えば、１０ｍｓや３０ｍｓ等）ごとの周波数の平均を検出することによって抽出することができる。また、この周波数は、いわゆる基本周波数Ｆ０であってもよい。 The frequency may be the frequency of the audio signal of the filler itself in the source language audio information, or the frequency of the audio signal other than the filler and the frequency of the audio signal of the filler in the source language audio information. It may be a difference. The frequency may be averaged over the section of the filler audio signal. The change in frequency can be extracted by, for example, detecting an average frequency for every minute time (for example, 10 ms or 30 ms) in the filler. Further, this frequency may be a so-called basic frequency F0.

また、音量とは、原言語音声情報における、フィラーの音声信号の音量（電圧）そのものであってもよく、あるいは、原言語音声情報における、フィラー以外の音声信号の音量と、フィラーの音声信号の音量との差であってもよい。また、その音量は、フィラーの音声信号の区間にわたって平均のとられたものであってもよい。周波数の変化は、例えば、フィラーにおける微少時間（例えば、１０ｍｓや３０ｍｓ等）ごとの音量の平均を検出することによって抽出することができる。 The volume may be the volume (voltage) of the filler audio signal in the source language audio information, or the volume of the audio signal other than the filler in the source language audio information and the volume of the filler audio signal. It may be a difference from the volume. Further, the volume may be averaged over the section of the filler audio signal. The change in frequency can be extracted, for example, by detecting the average of the sound volume for every minute time (for example, 10 ms, 30 ms, etc.) in the filler.

フィラーテキスト位置特定部２２は、原言語音声情報においてフィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応する音声認識結果情報における位置を特定する。 The filler text position specifying unit 22 specifies a position in the speech recognition result information corresponding to the temporal position of the filler specified by the filler time position specifying unit 20 in the source language voice information.

例えば、音声認識結果情報におけるテキストと、原言語音声情報における時間的な位置とが対応付けられている場合（例えば、音声認識結果情報のテキストに、原言語音声情報のタイムコードが付与されている場合）には、原言語音声情報における時間的な位置に対応する音声認識結果情報におけるテキストの位置を特定することができる。したがって、そのような場合には、フィラーテキスト位置特定部２２は、フィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応するテキストの位置を、音声認識結果情報におけるテキストと、原言語音声情報における時間的な位置とを対応付ける情報を用いることによって、特定することができる。 For example, when the text in the speech recognition result information is associated with the temporal position in the source language speech information (for example, the time code of the source language speech information is added to the text of the speech recognition result information) In the case), the position of the text in the speech recognition result information corresponding to the temporal position in the source language speech information can be specified. Therefore, in such a case, the filler text position specifying unit 22 sets the text position corresponding to the temporal position of the filler specified by the filler time position specifying unit 20, the text in the speech recognition result information, and the source language. The information can be specified by using information that correlates the temporal position in the audio information.

また、音声認識結果情報におけるテキストと、原言語音声情報における時間的な位置とを対応付ける情報が存在しない場合には、例えば、フィラーテキスト位置特定部２２は、原言語音声情報において、フィラー時間位置特定部２０が時間的な位置を特定したフィラーの前方、後方、あるいは、その両方の音声信号（この音声信号には、フィラーは含まれない）を取得し、その音声信号を音声認識した結果を取得する。そして、その結果が、音声認識結果情報のどこに位置するのかによって、フィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応するテキストの位置を特定することができる。より具体的には、フィラー時間位置特定部２０によって特定されたフィラーの前方の音声信号を音声認識した結果を取得した場合には、フィラーテキスト位置特定部２２は、その音声認識した結果の位置を音声認識結果情報において特定し、その特定した位置の直後を、フィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応するテキストの位置として特定してもよい。 When there is no information that associates the text in the speech recognition result information with the temporal position in the source language speech information, for example, the filler text position specifying unit 22 specifies the filler time position in the source language speech information. The unit 20 acquires the audio signal in front of, behind, or both of the filler whose temporal position is specified (this audio signal does not include the filler), and acquires the result of the audio recognition of the audio signal. To do. The position of the text corresponding to the temporal position of the filler specified by the filler time position specifying unit 20 can be specified depending on where the result is located in the speech recognition result information. More specifically, when the result of voice recognition of the voice signal ahead of the filler specified by the filler time position specifying unit 20 is acquired, the filler text position specifying unit 22 determines the position of the result of the voice recognition. It may be specified in the voice recognition result information, and immediately after the specified position may be specified as the text position corresponding to the temporal position of the filler specified by the filler time position specifying unit 20.

また、音声認識のできなかったことを利用してフィラーの位置を特定した場合には、フィラーテキスト位置特定部２２は、音声認識結果情報において、その音声認識のできなかった位置を、フィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応するテキストの位置として特定してもよい。 In addition, when the filler position is specified using the fact that voice recognition could not be performed, the filler text position specifying unit 22 sets the position where the voice recognition could not be performed in the voice recognition result information as the filler time position. The position of the text corresponding to the temporal position of the filler specified by the specifying unit 20 may be specified.

また、音声認識結果を利用してフィラーの位置を特定した場合には、音声認識結果情報にフィラーに対応するテキストが存在する場合がある。したがって、その場合には、フィラーテキスト位置特定部２２は、音声認識結果情報における、そのフィラーに対応するテキストの位置を、フィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応するテキストの位置として特定してもよい。 In addition, when the position of the filler is specified using the voice recognition result, the text corresponding to the filler may exist in the voice recognition result information. Therefore, in that case, the filler text position specifying unit 22 is a text corresponding to the temporal position of the filler specified by the filler time position specifying unit 20 in the voice recognition result information. You may specify as a position of.

また、その他の方法を利用することによって適切にフィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応するテキストの位置として特定することができるのであれば、フィラーテキスト位置特定部２２は、その方法を用いてフィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応するテキストの位置を特定してもよい。 Moreover, if the position of the text corresponding to the temporal position of the filler appropriately specified by the filler time position specifying unit 20 can be specified by using other methods, the filler text position specifying unit 22 The position of the text corresponding to the temporal position of the filler specified by the filler time position specifying unit 20 may be specified using the method.

また、原言語音声情報においてフィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応する音声認識結果情報における位置を特定するとは、例えば、その位置を特定する情報を図示しない記録媒体に蓄積することであってもよく、あるいは、その位置を特定可能な情報を音声認識結果情報に付加することであってもよい。前者の場合であって、音声認識結果情報にフィラーに対応するテキストが存在しない場合には、その位置を特定する情報は、例えば、音声認識結果情報の先頭や後端、あるいは特定の位置から、その特定の対象となる位置までの文字数で示されてもよく、データ容量で示されてもよく、その位置の前方に存在するテキストで示されてもよく、あるいは、その位置の後方に存在するテキストで示されてもよい。また、前者の場合であって、音声認識結果情報にフィラーに対応するテキストが存在する場合には、その位置を特定する情報は、例えば、その位置の始点と終点を特定する情報であってもよい。始点や終点は、例えば、音声認識結果情報の先頭や後端、あるいは特定の位置から、その特定の対象となる位置までの文字数で示されてもよく、データ容量で示されてもよく、その位置の前方に存在するテキストと、その位置の後方に存在するテキストとで示されてもよい。また、後者の場合には、音声認識結果情報において、特定の対象となる位置に、その位置であることを示す情報（例えば、フラグのようなものであってもよい）が付加されてもよい。なお、原言語音声情報においてフィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応する音声認識結果情報における位置を適切に特定することができるのであれば、その特定方法は、これらに限定されないことは言うまでもない。 Further, to specify the position in the speech recognition result information corresponding to the temporal position of the filler specified by the filler time position specifying unit 20 in the source language voice information, for example, the information specifying the position is recorded on a recording medium (not shown). It may be stored, or information that can specify the position may be added to the speech recognition result information. In the former case, when there is no text corresponding to the filler in the speech recognition result information, the information for specifying the position is, for example, from the beginning or the rear end of the speech recognition result information, or a specific position, It may be indicated by the number of characters up to that particular target position, it may be indicated by the data capacity, it may be indicated by text existing in front of the position, or it exists behind the position. It may be shown in text. In the former case, when the text corresponding to the filler exists in the speech recognition result information, the information specifying the position may be information specifying the start point and the end point of the position, for example. Good. For example, the start point and the end point may be indicated by the number of characters from the beginning or the end of the speech recognition result information or a specific position to the specific target position, or may be indicated by a data capacity. It may be indicated by text existing in front of the position and text existing behind the position. In the latter case, in the voice recognition result information, information indicating the position (for example, a flag) may be added to a specific target position. . In addition, if the position in the speech recognition result information corresponding to the temporal position of the filler specified by the filler time position specifying unit 20 in the source language voice information can be appropriately specified, the specifying method is as follows. It goes without saying that it is not limited.

フィラー挿入位置特定部２３は、フィラーテキスト位置特定部２２が特定した音声認識結果情報における位置に対応する目的言語音声情報における時間的な位置であるフィラー挿入位置を特定する。このフィラー挿入位置の特定は、例えば、音声認識結果情報において特定された位置に対応する翻訳結果情報における位置を特定し、その特定された翻訳結果情報の位置に対応する目的言語音声情報における時間的な位置を特定することによって行われる。 The filler insertion position specifying unit 23 specifies a filler insertion position that is a temporal position in the target language speech information corresponding to the position in the speech recognition result information specified by the filler text position specifying unit 22. The filler insertion position is specified by, for example, specifying a position in the translation result information corresponding to the position specified in the speech recognition result information, and temporally in the target language voice information corresponding to the position of the specified translation result information. This is done by specifying the correct position.

まず、フィラー挿入位置特定部２３が、音声認識結果情報において特定された位置に対応する翻訳結果情報における位置を特定する方法について説明する。 First, a method in which the filler insertion position specifying unit 23 specifies the position in the translation result information corresponding to the position specified in the speech recognition result information will be described.

フィラー挿入位置特定部２３は、例えば、音声認識結果情報において特定された位置に続くチャンクを特定する。ここで、チャンクとは、一定のまとまりのあるテキストであって、例えば、形態素や単語であってもよく、あるいは、複数の連続した形態素や複数の連続した単語であってもよい。そして、フィラー挿入位置特定部２３は、その特定した原言語のチャンクに対応する目的言語のチャンクを取得する。これは、例えば、機械翻訳を行うことによって得ることができる。機械翻訳は、フィラー挿入位置特定部２３が行ってもよく、あるいは、機械翻訳部１５において行ってもよい。フィラー挿入位置特定部２３は、翻訳結果情報において、目的言語のチャンクの位置を特定し、その目的言語のチャンクの位置の直前が、音声認識結果情報において特定された位置に対応する翻訳結果情報における位置であると特定してもよい。 For example, the filler insertion position specifying unit 23 specifies a chunk following the position specified in the speech recognition result information. Here, the chunk is a certain unitary text, and may be, for example, a morpheme or a word, or may be a plurality of continuous morphemes or a plurality of continuous words. Then, the filler insertion position specifying unit 23 acquires a chunk in the target language corresponding to the specified source language chunk. This can be obtained, for example, by performing machine translation. The machine translation may be performed by the filler insertion position specifying unit 23 or may be performed by the machine translation unit 15. The filler insertion position specifying unit 23 specifies the position of the target language chunk in the translation result information, and the position immediately before the position of the target language chunk in the translation result information corresponding to the position specified in the speech recognition result information. You may specify that it is a position.

フィラー挿入位置特定部２３は、例えば、音声認識結果情報において特定された位置に先行するチャンクを特定し、その特定した原言語のチャンクに対応する目的言語のチャンクを取得する。そして、フィラー挿入位置特定部２３は、翻訳結果情報において、目的言語のチャンクの位置を特定し、その目的言語のチャンクの位置の直後が、音声認識結果情報において特定された位置に対応する翻訳結果情報における位置であると特定してもよい。 For example, the filler insertion position specifying unit 23 specifies a chunk preceding the position specified in the speech recognition result information, and acquires a target language chunk corresponding to the specified source language chunk. Then, the filler insertion position specifying unit 23 specifies the position of the target language chunk in the translation result information, and the translation result corresponding to the position specified in the speech recognition result information immediately after the position of the target language chunk. You may specify that it is a position in information.

また、例えば、音声認識結果情報にフィラーに対応するテキストが存在する場合には、フィラー挿入位置特定部２３は、そのテキストと対訳関係にある目的言語のフィラーを翻訳結果情報で特定することによって、音声認識結果情報において特定された位置に対応する翻訳結果情報における位置を特定してもよい。フィラー挿入位置特定部２３は、原言語のフィラーと対訳関係にある目的言語のフィラーを、例えば、対訳関係にある原言語のフィラーと、目的言語のフィラーとを対応付ける辞書（この辞書は、例えば、図示しない記録媒体で保持されていてもよい）を用いて取得してもよく、機械翻訳部１５から取得してもよい。 Further, for example, when the text corresponding to the filler exists in the speech recognition result information, the filler insertion position specifying unit 23 specifies the filler of the target language having a translation relation with the text by the translation result information. The position in the translation result information corresponding to the position specified in the speech recognition result information may be specified. The filler insertion position specifying unit 23 associates a filler in the target language that is in a translation relationship with the filler in the source language, for example, a dictionary that associates the filler in the source language in a translation relationship with the filler in the target language (this dictionary is, for example, May be stored using a recording medium (not shown), or may be acquired from the machine translation unit 15.

また、フィラー挿入位置特定部２３が、音声認識結果情報において特定された位置に対応する翻訳結果情報における位置を特定するとは、翻訳結果情報における位置を特定する情報を図示しない記録媒体に蓄積することであってもよく、あるいは、その位置を特定可能な情報を翻訳結果情報に付加することであってもよい。それらの方法は、原言語音声情報においてフィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応する音声認識結果情報における位置を特定する場合の処理と同様であり、その詳細な説明を省略する。 In addition, when the filler insertion position specifying unit 23 specifies the position in the translation result information corresponding to the position specified in the speech recognition result information, the information specifying the position in the translation result information is stored in a recording medium (not shown). Alternatively, information that can specify the position may be added to the translation result information. These methods are the same as the processing in the case of specifying the position in the speech recognition result information corresponding to the temporal position of the filler specified by the filler time position specifying unit 20 in the source language voice information, and detailed description thereof will be given. Omitted.

次に、フィラー挿入位置特定部２３が、特定された翻訳結果情報の位置に対応する目的言語音声情報における時間的な位置（すなわち、フィラー挿入位置）を特定する方法について説明する。 Next, a description will be given of a method in which the filler insertion position specifying unit 23 specifies a temporal position (that is, filler insertion position) in the target language speech information corresponding to the position of the specified translation result information.

例えば、翻訳結果情報におけるテキストと、目的言語音声情報における時間的な位置とが対応付けられている場合（例えば、翻訳結果情報のテキストに、目的言語音声情報のタイムコードが付与されている場合）には、翻訳結果情報におけるテキストの位置に対応する目的言語音声情報における時間的な位置を特定することができる。したがって、そのような場合には、フィラー挿入位置特定部２３は、特定された翻訳結果情報の位置に対応する目的言語音声情報の位置を、翻訳結果情報におけるテキストと、目的言語音声情報における時間的な位置とを対応付ける情報を用いることによって、特定することができる。 For example, when the text in the translation result information is associated with the temporal position in the target language voice information (for example, when the time code of the target language voice information is added to the text of the translation result information) The time position in the target language speech information corresponding to the position of the text in the translation result information can be specified. Therefore, in such a case, the filler insertion position specifying unit 23 determines the position of the target language speech information corresponding to the specified position of the translation result information in terms of the text in the translation result information and the time in the target language speech information. It is possible to specify the information by using information that associates a specific position.

また、翻訳結果情報におけるテキストと、目的言語音声情報における時間的な位置とを対応付ける情報が存在しない場合には、例えば、フィラー挿入位置特定部２３は、特定された翻訳結果情報の位置の前方、後方、あるいは、その両方のテキストを取得し、そのテキストから音声信号を生成する。そして、その音声信号が、目的言語音声情報のどこに位置するのかによって、特定された翻訳結果情報の位置に対応する目的言語音声情報における位置を特定することができる。より具体的には、特定された翻訳結果情報における位置の前方のテキストから音声信号を生成した場合には、フィラー挿入位置特定部２３は、その生成した音声信号の位置を目的言語音声情報において特定し、その特定した位置の直後を、特定された翻訳結果情報の位置に対応する目的言語音声情報における時間的な位置として特定してもよい。 In addition, when there is no information that associates the text in the translation result information with the temporal position in the target language speech information, for example, the filler insertion position specifying unit 23, in front of the position of the specified translation result information, The text of the back or both is acquired, and the audio | voice signal is produced | generated from the text. Then, the position in the target language speech information corresponding to the position of the identified translation result information can be identified depending on where in the target language speech information the speech signal is located. More specifically, when an audio signal is generated from text ahead of the position in the specified translation result information, the filler insertion position specifying unit 23 specifies the position of the generated audio signal in the target language audio information. Then, the position immediately after the specified position may be specified as a temporal position in the target language speech information corresponding to the position of the specified translation result information.

また、フィラーも音声認識されて、機械翻訳され、翻訳結果情報にフィラーに対応するテキストが存在する場合には、フィラー挿入位置特定部２３は、目的言語音声情報における、そのフィラーのテキストに対応する音声信号の位置を、特定された翻訳結果情報の位置に対応する目的言語音声情報における時間的な位置として特定してもよい。 When the filler is also recognized and machine-translated, and the text corresponding to the filler exists in the translation result information, the filler insertion position specifying unit 23 corresponds to the filler text in the target language speech information. The position of the audio signal may be specified as a temporal position in the target language audio information corresponding to the position of the specified translation result information.

また、特定された翻訳結果情報の位置に対応する目的言語音声情報における時間的な位置を特定するとは、例えば、その位置を特定する情報を図示しない記録媒体に蓄積することであってもよく、あるいは、その位置を特定可能な情報を目的言語音声情報に付加することであってもよい。それらの方法は、原言語音声情報においてフィラー時間位置特定部２０がフィラーの時間的な位置を特定する場合の処理と同様であり、その詳細な説明を省略する。 Further, specifying the temporal position in the target language speech information corresponding to the position of the specified translation result information may be, for example, accumulating information for specifying the position in a recording medium (not shown), Alternatively, information that can specify the position may be added to the target language audio information. These methods are the same as the processing in the case where the filler time position specifying unit 20 specifies the temporal position of the filler in the source language speech information, and detailed description thereof is omitted.

また、その他の方法を利用することによって、適切にフィラーテキスト位置特定部２２が特定した音声認識結果情報における位置に対応する目的言語音声情報における時間的な位置を特定することができるのであれば、フィラー挿入位置特定部２３は、その方法を用いて目的言語音声情報における時間的な位置（すなわち、フィラー挿入位置）を特定してもよい。 Moreover, if the time position in the target language speech information corresponding to the position in the speech recognition result information appropriately specified by the filler text position specifying unit 22 can be specified by using other methods, The filler insertion position specifying unit 23 may specify a temporal position (that is, filler insertion position) in the target language speech information using the method.

フィラー情報生成部２４は、フィラー情報抽出部２１が抽出した原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する目的言語フィラー情報を生成する。この目的言語フィラー情報は、目的言語においてフィラーと認識されるものであってもよく、あるいは、そうでなくてもよい。例えば、目的言語が英語である場合には、前者の目的言語フィラー情報は、「Ｗｅｌｌ」や「Ｌｅｔｍｅｓｅｅ」の音声信号であってもよく、後者の目的言語フィラー情報は、「え〜」や「そのぉ〜」の音声信号であってもよい。 The filler information generation unit 24 generates target language filler information having the same para language as the para language included in the source language filler information extracted by the filler information extraction unit 21. This target language filler information may or may not be recognized as a filler in the target language. For example, when the target language is English, the former target language filler information may be an audio signal of “Well” or “Let me see”, and the latter target language filler information is “e-”. Alternatively, it may be a voice signal of “That ~”.

フィラー情報生成部２４は、例えば、フィラー時間位置特定部２０が時間的な位置を特定したフィラーの有する母音と同じ母音を有する目的言語フィラー情報を生成してもよい。また、フィラー情報生成部２４は、例えば、原言語音声情報のフィラーが音声認識され、機械翻訳された結果に対応する目的言語の音声情報である目的言語フィラー情報を生成してもよい。その生成の際にも、前述のように、原言語フィラー情報に含まれるパラ言語と同じパラ言語を有するように目的言語フィラー情報を生成するものとする。したがって、フィラー情報生成部２４が生成する目的言語フィラー情報は、原言語フィラー情報とパラ言語が共通することになり、例えば、周波数や、音量、それらの変化、フィラーの時間的な長さ、プレの間や、ポストの間などが共通することになる。 For example, the filler information generation unit 24 may generate target language filler information having the same vowel as the vowel of the filler whose temporal position is specified by the filler time position specifying unit 20. Further, the filler information generation unit 24 may generate target language filler information, which is voice information of a target language corresponding to a result obtained by, for example, recognizing the filler of the source language voice information and performing machine translation. At the time of generation, as described above, the target language filler information is generated so as to have the same para language as the para language included in the source language filler information. Therefore, the target language filler information generated by the filler information generation unit 24 is common to the source language filler information and the para language. For example, the frequency, the volume, the change thereof, the filler time length, Between the posts and between the posts.

ここで、フィラー情報生成部２４が、フィラー時間位置特定部２０が時間的な位置を特定したフィラーの有する母音と同じ母音を有する目的言語フィラー情報を生成する場合の処理について説明する。まず、フィラー情報生成部２４は、フィラー時間位置特定部２０が時間的な位置を特定したフィラーに対して音声認識と同様の処理を行うことによって、そのフィラーに対応する母音・子音の並びを取得する。そして、その取得した母音のみを取り出して、それに対応する音声信号であり、原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する音声信号を合成する。その合成の際に、原言語の母音から、その音声信号の合成を行ってもよく（この場合には、原言語での音声信号となる）、あるいは、目的言語の母音から、その音声信号の合成を行ってもよい（この場合には、目的言語での音声信号となる）。後者の場合には、図示しない記録媒体において、原言語の母音と、目的言語の母音とを対応付ける情報が保持されており、フィラー情報生成部２４は、その情報を参照することによって、原言語の母音を目的言語の母音に変換してもよい。このように、「フィラー時間位置特定部２０が時間的な位置を特定したフィラーの有する母音と同じ母音を有する目的言語フィラー情報」における同じ母音は、原言語のフィラーの有する母音とまったく同じであってもよく、あるいは、その母音と対訳関係にある母音であってもよい。 Here, processing when the filler information generation unit 24 generates target language filler information having the same vowel as the vowel of the filler whose temporal position is specified by the filler time position specifying unit 20 will be described. First, the filler information generation unit 24 obtains a sequence of vowels and consonants corresponding to the filler by performing the same process as the voice recognition on the filler whose temporal position is specified by the filler time position specifying unit 20. To do. Then, only the acquired vowel is extracted, and an audio signal corresponding to the acquired vowel is synthesized, which has the same paralanguage as the paralanguage included in the source language filler information. At the time of the synthesis, the speech signal may be synthesized from the vowel of the source language (in this case, the speech signal is the source language), or the speech signal of the target language is generated from the vowel of the target language. Synthesis may be performed (in this case, the voice signal is in the target language). In the latter case, information that associates the vowels of the source language with the vowels of the target language is held in a recording medium (not shown), and the filler information generation unit 24 refers to the information, thereby You may convert a vowel into the vowel of a target language. In this way, the same vowel in the “target language filler information having the same vowel as the vowel of the filler whose temporal position is specified by the filler time position specifying unit 20” is exactly the same as the vowel of the source language filler. Or a vowel that has a translation relationship with the vowel.

また、フィラー情報生成部２４は、例えば、フィラーの原言語の音声信号と、それに対応するフィラーの目的言語の音声信号とを対応付ける情報である音声信号対応情報を用いて、フィラー時間位置特定部２０が時間的な位置を特定したフィラーの音声信号と類似する、音声信号対応情報に含まれるフィラーの原言語の音声信号を特定し、その音声信号に対応付けられている、フィラーの目的言語の音声信号を取得し、その取得した音声信号が、原言語フィラー情報に含まれるパラ言語と同じパラ言語を有するように変換した目的言語フィラー情報を生成してもよい。なお、音声信号対応情報は、フィラー情報生成部２４がアクセス可能な図示しない記録媒体で保持されているものとする。音声信号対応情報では、例えば、原言語（日本語）の「え〜っと」の音声信号と、目的言語（英語）の「Ｗｅｌｌ」の音声信号とが対応付けられていてもよい。 In addition, the filler information generation unit 24 uses, for example, voice signal correspondence information that is information that associates a voice signal in the source language of the filler with a voice signal in the target language of the filler corresponding thereto, and uses the filler time position specifying unit 20. Identifies the filler's source language speech signal included in the speech signal correspondence information, similar to the filler's speech signal whose time position is specified, and the filler's target language speech associated with the speech signal The target language filler information may be generated by acquiring the signal and converting the acquired speech signal so as to have the same para language as the para language included in the source language filler information. It is assumed that the audio signal correspondence information is held in a recording medium (not shown) accessible by the filler information generation unit 24. In the audio signal correspondence information, for example, the original language (Japanese) “Etto” audio signal and the target language (English) “Well” audio signal may be associated with each other.

また、フィラー情報生成部２４は、例えば、フィラーとして生成する音声信号をあらかじめ図示しない記録媒体で保持しており、その音声信号のパラ言語を、原言語フィラー情報に含まれるパラ言語と同じパラ言語を有するように変換した目的言語フィラー情報を生成してもよい。例えば、目的言語が英語である場合には、フィラー情報生成部２４は、あらかじめフィラーとして、「Ｗｅｌｌ」や「Ｌｅｔｍｅｓｅｅ」に対応する音声信号を保持しておく。そして、フィラー情報生成部２４は、その音声信号のパラ言語のみを原言語フィラー情報にあわせることによって、目的言語フィラー情報を生成してもよい。この場合に、フィラー情報生成部２４は、目的言語の複数のフィラーの音声信号を保持しておき、原言語のフィラーの長さに応じて、そのフィラーの音声信号を選択するようにしてもよい。例えば、フィラー情報生成部２４は、原言語のフィラーが短い場合には、「Ｗｅｌｌ」を選択し、原言語のフィラーが長い場合には、「Ｌｅｔｍｅｓｅｅ」を選択してもよい。 In addition, the filler information generation unit 24 holds, for example, an audio signal generated as a filler in advance on a recording medium (not shown), and the paralanguage of the audio signal is the same paralanguage as the paralanguage included in the source language filler information. The target language filler information converted so as to have For example, when the target language is English, the filler information generation unit 24 holds an audio signal corresponding to “Well” or “Let me see” in advance as a filler. Then, the filler information generation unit 24 may generate the target language filler information by matching only the para language of the voice signal with the source language filler information. In this case, the filler information generating unit 24 may hold a plurality of filler audio signals in the target language and select the filler audio signal according to the length of the source language filler. . For example, the filler information generation unit 24 may select “Well” when the source language filler is short, and may select “Let me see” when the source language filler is long.

また、フィラー情報生成部２４は、音声認識や、機械翻訳、音声合成を行う際に、フィラー情報生成部２４自身がその処理を行ってもよく、あるいは、他の構成要素（例えば、音声認識部１３や、機械翻訳部１５、音声生成部１７等）や、他の装置にその処理を依頼し、その処理の結果を受け取ってもよい。 In addition, the filler information generation unit 24 may perform the processing by the filler information generation unit 24 itself when performing speech recognition, machine translation, and speech synthesis, or other components (for example, a speech recognition unit). 13, machine translation unit 15, voice generation unit 17, etc.) or another device, and may receive the processing result.

また、フィラー情報生成部２４は、原言語フィラー情報がフィラーの音声情報そのものである場合には、その原言語フィラー情報からパラ言語の情報（例えば、周波数や音量等）を取り出す処理を行ってもよい。
また、フィラー情報生成部２４が行う音声信号のパラ言語を変更する処理等は、すでに公知であり、その詳細な説明を省略する。 In addition, when the source language filler information is the filler speech information itself, the filler information generation unit 24 may perform processing for extracting paralingual information (for example, frequency, volume, etc.) from the source language filler information. Good.
Moreover, the process etc. which change the para language of the audio | voice signal which the filler information production | generation part 24 performs are already well-known, The detailed description is abbreviate | omitted.

フィラー情報挿入部２５は、音声生成部１７が生成した目的言語音声情報における、フィラー挿入位置特定部２３が特定したフィラー挿入位置に、フィラー情報抽出部２１が抽出した原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する情報であり、目的言語でのフィラーの音声情報である目的言語フィラー情報を挿入する。すなわち、フィラー情報挿入部２５は、フィラー情報生成部２４が生成した目的言語フィラー情報を目的言語音声情報に挿入する。 The filler information insertion unit 25 is a parameter included in the source language filler information extracted by the filler information extraction unit 21 at the filler insertion position specified by the filler insertion position specification unit 23 in the target language voice information generated by the voice generation unit 17. The target language filler information, which is information having the same para language as the language and is voice information of the filler in the target language, is inserted. That is, the filler information insertion unit 25 inserts the target language filler information generated by the filler information generation unit 24 into the target language voice information.

目的言語音声情報のフィラー挿入位置に目的言語フィラー情報を挿入するとは、例えば、目的言語音声情報に、フィラー時間位置特定部２０が時間的な位置を特定したフィラーに対応する音声情報が含まれない場合には、フィラー挿入位置に目的言語フィラー情報を追加することであってもよく、あるいは、目的言語音声情報に、フィラー時間位置特定部２０が時間的な位置を特定したフィラーに対応する音声情報が含まれる場合には、フィラー挿入位置で示される、その音声情報の位置に、目的言語フィラー情報を上書きで追加することであってもよい。 Inserting the target language filler information at the filler insertion position of the target language voice information means that, for example, the target language voice information does not include the voice information corresponding to the filler whose temporal position is specified by the filler time position specifying unit 20. In this case, the target language filler information may be added to the filler insertion position, or the voice information corresponding to the filler for which the filler time position specifying unit 20 specifies the temporal position is added to the target language voice information. Is included, the target language filler information may be overwritten and added to the position of the voice information indicated by the filler insertion position.

なお、原言語音声情報が蓄積される記録媒体、音声認識結果情報が蓄積される記録媒体、翻訳結果情報が蓄積される記録媒体、目的言語音声情報が蓄積される記録媒体等は、同一の記録媒体で実現されてもよく、あるいは、別々の記録媒体で実現されてもよい。前者の場合には、例えば、原言語音声情報が記憶されている領域が、原言語音声情報が蓄積される記録媒体となり、音声認識結果情報が記憶されている領域が、音声認識結果情報が蓄積される記録媒体となりうる。 A recording medium in which source language speech information is stored, a recording medium in which speech recognition result information is stored, a recording medium in which translation result information is stored, a recording medium in which target language speech information is stored, etc. It may be realized by a medium, or may be realized by a separate recording medium. In the former case, for example, an area in which source language speech information is stored becomes a recording medium in which source language speech information is stored, and an area in which speech recognition result information is stored is stored in speech recognition result information. Recording medium.

次に、本実施の形態による音声翻訳装置１の動作について、図２のフローチャートを用いて説明する。
（ステップＳ１０１）音声情報受付部１１は、原言語音声情報を受け付けたかどうか判断する。そして、受け付けた場合には、ステップＳ１０２に進み、そうでない場合には、受け付けるまでステップＳ１０１の処理を繰り返す。 Next, the operation of the speech translation apparatus 1 according to this embodiment will be described using the flowchart of FIG.
(Step S101) The voice information receiving unit 11 determines whether source language voice information has been received. If accepted, the process proceeds to step S102. If not, the process of step S101 is repeated until accepted.

（ステップＳ１０２）原言語音声情報蓄積部１２は、音声情報受付部１１が受け付けた原言語音声情報を蓄積する。なお、音声情報受付部１１がマイクから直接、原言語音声情報をリアルタイムで受け付けるような場合には、このステップＳ１０１とステップＳ１０２の処理が繰り返して実行されることによって、一連の長さの原言語音声情報が蓄積されるようにしてもよい。 (Step S102) The source language speech information storage unit 12 stores source language speech information received by the speech information reception unit 11. When the voice information receiving unit 11 receives the source language voice information directly from the microphone in real time, the process of step S101 and step S102 is repeatedly executed, so that the source language of a series of lengths is obtained. Audio information may be accumulated.

（ステップＳ１０３）音声認識部１３は、原言語音声情報蓄積部１２が蓄積した原言語音声情報を音声認識して、原言語音声情報に対応する音声認識結果情報を取得する。 (Step S103) The voice recognition unit 13 performs voice recognition on the source language voice information accumulated by the source language voice information accumulation unit 12, and acquires voice recognition result information corresponding to the source language voice information.

（ステップＳ１０４）音声認識結果情報蓄積部１４は、音声認識部が取得した音声認識結果情報を蓄積する。 (Step S104) The voice recognition result information accumulation unit 14 accumulates the voice recognition result information acquired by the voice recognition unit.

（ステップＳ１０５）機械翻訳部１５は、音声認識結果情報蓄積部１４が蓄積した音声認識結果情報を機械翻訳して、その音声認識結果情報に対応する目的言語の翻訳結果情報を取得する。 (Step S105) The machine translation unit 15 performs machine translation on the speech recognition result information accumulated by the speech recognition result information accumulation unit 14, and acquires translation result information of a target language corresponding to the speech recognition result information.

（ステップＳ１０６）翻訳結果情報蓄積部１６は、機械翻訳部１５が取得した翻訳結果情報を蓄積する。 (Step S106) The translation result information accumulation unit 16 accumulates the translation result information acquired by the machine translation unit 15.

（ステップＳ１０７）音声生成部１７は、翻訳結果情報蓄積部１６が蓄積した翻訳結果情報に対応する目的言語音声情報を生成する。 (Step S107) The voice generation unit 17 generates target language voice information corresponding to the translation result information accumulated by the translation result information accumulation unit 16.

（ステップＳ１０８）目的言語音声情報蓄積部１８は、音声生成部１７が生成した目的言語音声情報を蓄積する。 (Step S108) The target language voice information storage unit 18 stores the target language voice information generated by the voice generation unit 17.

（ステップＳ１０９）フィラー時間位置特定部２０は、原言語音声情報蓄積部１２が蓄積した原言語音声情報において、フィラーの時間的な位置を特定する。この特定された情報は、図示しない記録媒体において一時的に記憶されてもよい。 (Step S109) The filler time position specifying unit 20 specifies the temporal position of the filler in the source language speech information accumulated by the source language speech information accumulation unit 12. This specified information may be temporarily stored in a recording medium (not shown).

（ステップＳ１１０）フィラー情報抽出部２１は、原言語音声情報蓄積部１２が蓄積した原言語音声情報において、フィラー時間位置特定部２０が時間的な位置を特定したフィラーに関する原言語フィラー情報を抽出する。その抽出された原言語フィラー情報は、図示しない記録媒体において一時的に記憶されてもよい。 (Step S <b> 110) The filler information extraction unit 21 extracts source language filler information related to the filler for which the filler time position specifying unit 20 has specified a temporal position in the source language voice information accumulated by the source language voice information accumulation unit 12. . The extracted source language filler information may be temporarily stored in a recording medium (not shown).

（ステップＳ１１１）フィラーテキスト位置特定部２２は、フィラー時間位置特定部２０が特定したフィラーの時間的な位置に対応する、音声認識結果情報蓄積部１４が蓄積した音声認識結果情報における位置を特定する。この特定された情報は、図示しない記録媒体において一時的に記憶されてもよい。 (Step S111) The filler text position specifying unit 22 specifies the position in the voice recognition result information accumulated by the voice recognition result information accumulating unit 14 corresponding to the temporal position of the filler specified by the filler time position specifying unit 20. . This specified information may be temporarily stored in a recording medium (not shown).

（ステップＳ１１２）フィラー挿入位置特定部２３は、フィラーテキスト位置特定部２２が特定した音声認識結果情報における位置に対応する、目的言語音声情報蓄積部１８が蓄積した目的言語音声情報における時間的な位置であるフィラー挿入位置を特定する。この特定されたフィラー挿入位置を示す情報は、図示しない記録媒体において一時的に記憶されてもよい。 (Step S112) The filler insertion position specifying unit 23 corresponds to the position in the speech recognition result information specified by the filler text position specifying unit 22, and the temporal position in the target language voice information stored by the target language voice information storage unit 18 The filler insertion position is specified. Information indicating the specified filler insertion position may be temporarily stored in a recording medium (not shown).

（ステップＳ１１３）フィラー情報生成部２４は、フィラー情報抽出部２１が抽出した原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する目的言語フィラー情報を生成する。その生成された目的言語フィラー情報は、図示しない記録媒体において一時的に記憶されてもよい。 (Step S113) The filler information generation unit 24 generates target language filler information having the same para language as the para language included in the source language filler information extracted by the filler information extraction unit 21. The generated target language filler information may be temporarily stored in a recording medium (not shown).

（ステップＳ１１４）フィラー情報挿入部２５は、フィラー情報生成部２４が生成した目的言語フィラー情報を、目的言語音声情報における、フィラー挿入位置特定部２３が特定したフィラー挿入位置に挿入する。 (Step S114) The filler information insertion unit 25 inserts the target language filler information generated by the filler information generation unit 24 into the filler insertion position specified by the filler insertion position specification unit 23 in the target language voice information.

（ステップＳ１１５）音声情報出力部１９は、目的言語フィラー情報の挿入された目的言語音声情報を出力する。そして、ステップＳ１０１に戻る。
なお、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。また、図２のフローチャートにおいて、処理の順番にはある程度の任意性がある。例えば、フィラー挿入位置の特定処理（ステップＳ１１２）と、目的言語フィラー情報の生成処理（ステップＳ１１３）との順番が逆であってもよい。 (Step S115) The voice information output unit 19 outputs the target language voice information in which the target language filler information is inserted. Then, the process returns to step S101.
In the flowchart of FIG. 2, the process is terminated by powering off or a process termination interrupt. In the flowchart of FIG. 2, the order of processing has a certain degree of arbitraryness. For example, the order of the filler insertion position specifying process (step S112) and the target language filler information generating process (step S113) may be reversed.

次に、本実施の形態による音声翻訳装置１の動作について、具体例を用いて説明する。
この具体例では、日本人の話者が日本語を発話して、その発話が英語に音声翻訳される場合について説明する。 Next, the operation of the speech translation apparatus 1 according to this embodiment will be described using a specific example.
In this specific example, a case where a Japanese speaker speaks Japanese and the speech is translated into English will be described.

まず、話者がマイクに向かって「私は便宜的に三つの時期に分けたのですが、え〜、それぞれの時期に若干の重複があります。」と発話したとする。すると、その原言語音声情報が音声情報受付部１１で受け付けられ、原言語音声情報蓄積部１２によって蓄積される（ステップＳ１０１，Ｓ１０２）。図３の原言語音声情報は、そのようにして蓄積された原言語音声情報の一例を示すものである。 First, suppose that the speaker spoke to Mike, saying, “I divided it into three periods for convenience, but there are some overlaps in each period.” Then, the source language voice information is received by the voice information receiving unit 11 and stored by the source language voice information storage unit 12 (steps S101 and S102). The source language voice information of FIG. 3 shows an example of the source language voice information accumulated in this way.

次に、音声認識部１３は、蓄積された原言語音声情報に対して、音声認識の処理を実行し、その認識結果である音声認識結果情報を取得する（ステップＳ１０３）。その音声認識結果情報は、「私は便宜的に三つの時期に分けたのですが<音声認識不可>それぞれの時期に若干の重複があります」であったとする。この具体例での音声認識部１３は、フィラー「え〜」を認識できず、音声認識結果情報に<音声認識不可>のマークを挿入したものとする。その取得された音声認識結果情報は、音声認識結果情報蓄積部１４によって蓄積される（ステップＳ１０４）。図３の音声認識結果情報は、そのようにして蓄積された音声認識結果情報を示すものである。図３で示されるように、この音声認識結果情報にも、原言語音声情報のタイムコードが付与されており、原言語音声情報の音声信号と、音声認識結果情報のテキストとの対応をとることができるようになっている。 Next, the speech recognition unit 13 performs speech recognition processing on the accumulated source language speech information, and acquires speech recognition result information that is a recognition result (step S103). It is assumed that the speech recognition result information is "I divided it into three periods for convenience, but there is a slight overlap in each period." It is assumed that the voice recognition unit 13 in this specific example cannot recognize the filler “e˜” and has inserted a <no voice recognition> mark in the voice recognition result information. The acquired voice recognition result information is stored by the voice recognition result information storage unit 14 (step S104). The voice recognition result information in FIG. 3 indicates the voice recognition result information accumulated in this way. As shown in FIG. 3, the time code of the source language speech information is also given to the speech recognition result information, and the correspondence between the speech signal of the source language speech information and the text of the speech recognition result information is taken. Can be done.

次に、機械翻訳部１５は、蓄積された音声認識結果情報に対して、機械翻訳の処理を実行し、その機械翻訳結果である翻訳結果情報を取得する（ステップＳ１０５）。なお、機械翻訳部１５は、<音声認識不可>については、機械翻訳を行わないものとする。その翻訳結果情報は、「ｓｏｆｏｒｐｒａｃｔｉｃａｌｒｅａｓｏｎ，Ｉｄｉｖｉｄｅｄｔｈｅｅｒａｉｎｔｏｔｈｒｅｅｅｒａｓ．ｔｈｅｓｅｅｒａｓａｒｅｓｏｍｅｈｏｗｏｖｅｒｌａｐｐｉｎｇ．」であったとする。その取得された翻訳結果情報は、翻訳結果情報蓄積部１６によって蓄積される（ステップＳ１０６）。図３の翻訳結果情報は、そのようにして蓄積された翻訳結果情報を示すものである。 Next, the machine translation unit 15 performs machine translation processing on the accumulated speech recognition result information, and acquires translation result information that is the machine translation result (step S105). Note that the machine translation unit 15 does not perform machine translation for <no speech recognition>. The translation result information is assumed to be "so for practical reason, I divided the era into the eras. The eras are somehow overwrapping." The acquired translation result information is accumulated by the translation result information accumulation unit 16 (step S106). The translation result information of FIG. 3 shows the translation result information accumulated in this way.

次に、音声生成部１７は、蓄積された翻訳結果情報に対して、音声合成の処理を実行し、その翻訳結果情報に対応する目的言語音声情報を生成する（ステップＳ１０７）。その目的言語音声情報は、目的言語音声情報蓄積部１８によって蓄積される（ステップＳ１０８）。図３の目的言語音声情報は、そのようにして蓄積された目的言語音声情報の一例を示すものである。図３で示されるように、この目的言語音声情報にもタイムコードが付与されているが、それは原言語音声情報のタイムコードに対応しているものではない。また、フィラーが翻訳されていないため、フィラーの存在しない目的言語音声情報となっている。また、翻訳結果情報にもタイムコードが付与されているが、それは音声合成の後に、目的言語音声情報のタイムコードが、翻訳結果情報の対応する位置に事後的に付与されたものである。なお、図３の翻訳結果情報では、０．５秒ごとにタイムコードが付与されている場合について示しているが、さらに細かくタイムコードが付与されてもよいことは言うまでもない。 Next, the speech generation unit 17 performs speech synthesis processing on the accumulated translation result information, and generates target language speech information corresponding to the translation result information (step S107). The target language voice information is stored by the target language voice information storage unit 18 (step S108). The target language voice information in FIG. 3 shows an example of the target language voice information accumulated in this way. As shown in FIG. 3, a time code is also given to the target language voice information, but it does not correspond to the time code of the source language voice information. Moreover, since the filler is not translated, it is the target language voice information without the filler. A time code is also given to the translation result information. This is a result of the time code of the target language voice information being added afterwards to the corresponding position of the translation result information after speech synthesis. Note that the translation result information in FIG. 3 shows the case where a time code is assigned every 0.5 seconds, but it goes without saying that the time code may be given more finely.

次に、フィラー時間位置特定部２０は、図３の音声認識結果情報を参照し、<音声認識不可>の区間を特定する。ここでは、タイムコード３．１〜４．５が音声認識不可の区間として特定されたとする。このタイムコードは、原言語音声情報のタイムコードに対応しているため、そのタイムコードの範囲がそのまま、原言語音声情報におけるフィラーの時間的な位置となる。フィラー時間位置特定部２０は、その特定したタイムコードの範囲を、図示しない記録媒体において一時的に記憶する（ステップＳ１０９）。 Next, the filler time position specifying unit 20 refers to the voice recognition result information in FIG. 3 and specifies a section of <no voice recognition>. Here, it is assumed that the time codes 3.1 to 4.5 are specified as sections where speech recognition is impossible. Since this time code corresponds to the time code of the source language speech information, the range of the time code is directly used as the time position of the filler in the source language speech information. The filler time position specifying unit 20 temporarily stores the specified time code range in a recording medium (not shown) (step S109).

フィラー情報抽出部２１は、フィラー時間位置特定部２０が特定したタイムコードの範囲を参照し、そのタイムコードの範囲の原言語音声情報から、パラ言語を抽出する（ステップＳ１１０）。そのパラ言語が、原言語フィラー情報である。この具体例では、パラ言語として、周波数の変化、音量の変化、原言語音声情報のフィラーの長さ、そのフィラーの始点側の音声の空白の時間的な長さ（プレ間）、そのフィラーの終点側の音声の空白の時間的な長さ（ポスト間）が抽出されたものとする。フィラー情報抽出部２１は、その抽出した原言語フィラー情報を図示しない記録媒体において一時的に記憶する。 The filler information extraction unit 21 refers to the time code range specified by the filler time position specification unit 20 and extracts the para language from the source language speech information in the time code range (step S110). The para language is source language filler information. In this specific example, as the para language, the frequency change, volume change, length of the filler of the voice information of the source language, the length of time of the blank of the voice on the start side of the filler (between pre), It is assumed that the time length (between posts) of the voice on the end side is extracted. The filler information extraction unit 21 temporarily stores the extracted source language filler information in a recording medium (not shown).

フィラーテキスト位置特定部２２は、原言語のフィラーの時間的な位置に対応する音声認識結果情報における位置を特定する。この特定は、音声認識結果情報において<音声認識不可>の位置を特定することによって行われる（ステップＳ１１１）。具体的には、フィラーテキスト位置特定部２２は、<音声認識不可>の位置の時間的後方に存在するチャンク「それぞれの時期に若干の重複があります」を図示しない記録媒体において一時的に記憶することによって、その特定を行う。 The filler text position specifying unit 22 specifies the position in the speech recognition result information corresponding to the temporal position of the source language filler. This specification is performed by specifying the position of <no voice recognition> in the voice recognition result information (step S111). Specifically, the filler text position specifying unit 22 temporarily stores in a recording medium (not shown) a chunk “there is a slight overlap at each time” that exists behind the position of <no voice recognition>. To do that.

フィラー挿入位置特定部２３は、フィラーテキスト位置特定部２２が特定した「それぞれの時期に若干の重複があります」を参照し、そのテキストを機械翻訳部１５に渡すことによって、その翻訳結果「ｔｈｅｓｅｅｒａｓａｒｅｓｏｍｅｈｏｗｏｖｅｒｌａｐｐｉｎｇ．」を取得する。そして、フィラー挿入位置特定部２３は、翻訳結果情報蓄積部１６が蓄積した翻訳結果情報を参照し、その翻訳結果「ｔｈｅｓｅｅｒａｓａｒｅｓｏｍｅｈｏｗｏｖｅｒｌａｐｐｉｎｇ．」の直前の位置を付与されているタイムコードで特定する。このタイムコード「５．６」は、目的言語音声情報のタイムコードに対応しているため、そのタイムコードの位置がそのまま、フィラー挿入位置となる。フィラー挿入位置特定部２３は、そのタイムコードを、図示しない記録媒体において一時的に記憶する（ステップＳ１１２）。図３において、フィラー挿入位置を矢印で示している（これは説明の便宜のためであって、実際に翻訳結果情報等に矢印の情報が含まれていなくてもよい）。 The filler insertion position specifying unit 23 refers to “there is a slight overlap in each period” specified by the filler text position specifying unit 22, and passes the text to the machine translation unit 15, so that the translation result “the eras” are somehow overwrapping. ”. Then, the filler insertion position specifying unit 23 refers to the translation result information accumulated by the translation result information accumulating unit 16 and specifies the position immediately before the translation result “the eras are somehow overlapping”. To do. Since this time code “5.6” corresponds to the time code of the target language voice information, the position of the time code is directly used as the filler insertion position. The filler insertion position specifying unit 23 temporarily stores the time code in a recording medium (not shown) (step S112). In FIG. 3, the filler insertion position is indicated by an arrow (this is for convenience of explanation, and the information on the arrow may not actually be included in the translation result information or the like).

フィラー情報生成部２４は、目的言語でのフィラーとして、「Ｗｅｌｌ」の音声信号を図示しない記録媒体において保持しているものとする。そして、フィラー情報生成部２４は、フィラー情報抽出部２１が抽出した原言語フィラー情報を参照し、その「Ｗｅｌｌ」の音声信号のパラ言語を、原言語フィラー情報に含まれるパラ言語に一致させた目的言語フィラー情報を生成する（ステップＳ１１３）。フィラー情報生成部２４は、その目的言語フィラー情報を図示しない記録媒体において一時的に記憶する。 It is assumed that the filler information generation unit 24 holds a “Well” audio signal in a recording medium (not shown) as a filler in the target language. Then, the filler information generation unit 24 refers to the source language filler information extracted by the filler information extraction unit 21 and matches the para language of the voice signal of “Well” with the para language included in the source language filler information. Target language filler information is generated (step S113). The filler information generation unit 24 temporarily stores the target language filler information in a recording medium (not shown).

フィラー情報挿入部２５は、フィラー挿入位置特定部２３が特定したフィラー挿入位置を参照し、そのフィラー挿入位置に、フィラー情報生成部２４が生成した目的言語フィラー情報を挿入する（ステップＳ１１４）。図３の目的言語フィラー情報の挿入された目的言語音声情報は、その目的言語フィラー情報の挿入後の目的言語音声情報の一例を示すものである。そして、最終的に、音声情報出力部１９は、目的言語フィラー情報の挿入された目的言語音声情報を出力する（ステップＳ１１５）。すなわち、「ｓｏｆｏｒｐｒａｃｔｉｃａｌｒｅａｓｏｎ，Ｉｄｉｖｉｄｅｄｔｈｅｅｒａｉｎｔｏｔｈｒｅｅｅｒａｓ．Ｗｅｌｌｔｈｅｓｅｅｒａｓａｒｅｓｏｍｅｈｏｗｏｖｅｒｌａｐｐｉｎｇ．」に対応する音声が出力されることになる。また、フィラーの「Ｗｅｌｌ」の部分のパラ言語が、話者が原言語で発生した「え〜」のパラ言語と一致しているため、より臨場感のある、話し手の感情等を含んだ音声翻訳結果となる。 The filler information insertion unit 25 refers to the filler insertion position specified by the filler insertion position specification unit 23, and inserts the target language filler information generated by the filler information generation unit 24 into the filler insertion position (step S114). The target language speech information with the target language filler information inserted in FIG. 3 shows an example of the target language speech information after the target language filler information is inserted. Finally, the speech information output unit 19 outputs the target language speech information with the target language filler information inserted (step S115). That is, a sound corresponding to “so for practical reason, I divided the era into the eras. Well the eras are somehow overlapping” is output. In addition, since the para-language of the “Well” part of the filler matches the para-language of “e-” that the speaker has generated in the original language, the voice that includes the speaker's emotions and the like that is more realistic. Result of translation.

なお、この具体例では、原言語音声情報に１個のフィラーのみが含まれる場合について説明したが、原言語音声情報に２以上のフィラーが含まれる場合には、各フィラーについて、前述と同様の処理を行うことになる。例えば、図２のフローチャートのステップＳ１０９〜Ｓ１１４の各処理において、複数のフィラーについての処理が行われることになる。 In this specific example, the case where only one filler is included in the source language speech information has been described, but when two or more fillers are included in the source language speech information, each filler is the same as described above. Processing will be performed. For example, in each process of steps S109 to S114 in the flowchart of FIG. 2, a process for a plurality of fillers is performed.

また、この具体例で用いた図３の具体的なデータのうち、目的言語音声情報に関する部分は、この具体例での説明するために示したものであり、実際のデータと異なるところがありうる。 Also, of the specific data of FIG. 3 used in this specific example, the portion related to the target language speech information is shown for explanation in this specific example, and may differ from the actual data.

以上のように、本実施の形態による音声翻訳装置１によれば、フィラーも翻訳し、その翻訳後の目的言語のフィラーと、原言語のフィラーとでパラ言語を共通化することができる。したがって、フィラーは、通常は音声翻訳において破棄されてしまう情報であるが、そのフィラーも翻訳することによって、より臨場感のある、話し手の感情等の伝わる音声翻訳を実現することができうる。また、目的言語音声情報に目的言語フィラー情報を挿入することによって、原言語での話し言葉のリズムを目的言語音声情報に与えることもできる。さらに、その挿入される目的言語フィラー情報が、原言語フィラー情報と同じパラ言語を有することになるため、原言語での発話の空気を、目的言語音声情報でも読むことができるようになりうる。その結果として、目的言語音声情報を聞いている者は、例えば、文脈を勘違いしていたことに気づくこともあり、また、機械翻訳による誤訳に気づくこともありうる。 As described above, according to the speech translation apparatus 1 according to the present embodiment, the filler can also be translated, and the post-translation target language filler and the source language filler can share the para language. Therefore, the filler is information that is normally discarded in speech translation, but by translating the filler, it is possible to realize speech translation that conveys the speaker's emotions with a more realistic feeling. Further, by inserting the target language filler information into the target language voice information, the rhythm of the spoken language in the original language can be given to the target language voice information. Furthermore, since the inserted target language filler information has the same para language as the source language filler information, the air of speech in the source language can be read also with the target language speech information. As a result, a person who is listening to the target language speech information may notice that the context is misunderstood, for example, or may notice a mistranslation due to machine translation.

なお、本実施の形態による音声翻訳装置１では、フィラー情報抽出部２１が抽出した原言語フィラー情報を用いて目的言語フィラー情報を生成する場合について説明したが、その生成を行わなくてもよい。例えば、原言語フィラー情報が、フィラー時間位置特定部２０が時間的な位置を特定したフィラーの音声情報（音声信号）であり、目的言語フィラー情報が、原言語フィラー情報そのものであってもよい。この場合には、目的言語音声情報に、原言語のフィラーそのものが挿入されることになる。フィラー自体は、有意な情報ではないため、言語が異なったとしても、ある程度の理解が可能であると考えられる。したがって、原言語のフィラーそのものを目的言語音声情報に挿入したとしても、目的言語音声情報の聞き手の理解がそれによって妨げられるようなことはないと考えられる。この場合には、音声翻訳装置１は、フィラー情報生成部２４を備えなくてもよい。また、フィラー情報挿入部２５は、前述のように、目的言語音声情報におけるフィラー挿入位置に、原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する情報であり、目的言語でのフィラーの音声情報である目的言語フィラー情報を挿入するものであれば、フィラー情報生成部２４によって生成された目的言語フィラー情報を挿入するものでなくてもよい。 In the speech translation apparatus 1 according to the present embodiment, the case where the target language filler information is generated using the source language filler information extracted by the filler information extraction unit 21 has been described, but the generation may not be performed. For example, the source language filler information may be voice information (voice signal) of the filler whose temporal position is specified by the filler time position specifying unit 20, and the target language filler information may be the source language filler information itself. In this case, the source language filler itself is inserted into the target language speech information. Since the filler itself is not significant information, it can be understood to some extent even if the language is different. Therefore, even if the source language filler itself is inserted into the target language speech information, it is considered that this does not hinder the understanding of the listener of the target language speech information. In this case, the speech translation apparatus 1 may not include the filler information generation unit 24. Further, as described above, the filler information insertion unit 25 is information having the same paralanguage as the paralanguage included in the source language filler information at the filler insertion position in the target language speech information, and the filler speech in the target language. If the target language filler information which is information is inserted, the target language filler information generated by the filler information generation unit 24 may not be inserted.

また、上記実施の形態では、音声翻訳装置１がスタンドアロンである場合について説明したが、音声翻訳装置１は、スタンドアロンの装置であってもよく、サーバ・クライアントシステムにおけるサーバ装置であってもよい。後者の場合には、出力部や受付部は、通信回線を介して入力を受け付けたり、画面を出力したりすることになる。 Moreover, although the case where the speech translation apparatus 1 was a stand-alone was demonstrated in the said embodiment, the stand-alone apparatus may be sufficient as the speech translation apparatus 1, and the server apparatus in a server client system may be sufficient as it. In the latter case, the output unit or the reception unit receives an input or outputs a screen via a communication line.

また、上記実施の形態において、各処理または各機能は、単一の装置または単一のシステムによって集中処理されることによって実現されてもよく、あるいは、複数の装置または複数のシステムによって分散処理されることによって実現されてもよい。 In the above embodiment, each process or each function may be realized by centralized processing by a single device or a single system, or may be distributedly processed by a plurality of devices or a plurality of systems. It may be realized by doing.

また、上記実施の形態において、各構成要素が実行する処理に関係する情報、例えば、各構成要素が受け付けたり、取得したり、選択したり、生成したり、送信したり、受信したりした情報や、各構成要素が処理で用いるしきい値や数式、アドレス等の情報等は、上記説明で明記していない場合であっても、図示しない記録媒体において、一時的に、あるいは長期にわたって保持されていてもよい。また、その図示しない記録媒体への情報の蓄積を、各構成要素、あるいは、図示しない蓄積部が行ってもよい。また、その図示しない記録媒体からの情報の読み出しを、各構成要素、あるいは、図示しない読み出し部が行ってもよい。 In the above embodiment, information related to processing executed by each component, for example, information received, acquired, selected, generated, transmitted, or received by each component In addition, information such as threshold values, mathematical formulas, addresses, etc. used by each component in processing is retained temporarily or over a long period of time on a recording medium (not shown) even when not explicitly stated in the above description. It may be. Further, the storage of information in the recording medium (not shown) may be performed by each component or a storage unit (not shown). Further, reading of information from the recording medium (not shown) may be performed by each component or a reading unit (not shown).

また、上記実施の形態において、音声翻訳装置１に含まれる２以上の構成要素が通信デバイスや入力デバイス等を有する場合に、２以上の構成要素が物理的に単一のデバイスを有してもよく、あるいは、別々のデバイスを有してもよい。 In the above embodiment, when two or more constituent elements included in the speech translation apparatus 1 have a communication device, an input device, or the like, two or more constituent elements may physically have a single device. Or you may have separate devices.

また、上記実施の形態において、各構成要素は専用のハードウェアにより構成されてもよく、あるいは、ソフトウェアにより実現可能な構成要素については、プログラムを実行することによって実現されてもよい。例えば、ハードディスクや半導体メモリ等の記録媒体に記録されたソフトウェア・プログラムをＣＰＵ等のプログラム実行部が読み出して実行することによって、各構成要素が実現され得る。なお、上記実施の形態における音声翻訳装置１を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、発話された原言語の音声をマイクによって集音した情報である原言語音声情報を受け付ける音声情報受付部と、前記音声情報受付部が受け付けた原言語音声情報を音声認識して、当該原言語音声情報に対応するテキスト情報である音声認識結果情報を取得する音声認識部と、前記音声認識部が取得した音声認識結果情報を機械翻訳して、当該音声認識結果情報に対応する目的言語のテキスト情報である翻訳結果情報を取得する機械翻訳部と、前記機械翻訳部が取得した翻訳結果情報に対応する目的言語の音声情報である目的言語音声情報を生成する音声生成部と、前記音声情報受付部が受け付けた原言語音声情報において、有意な発話の間に挿入される発話であるフィラーの時間的な位置を特定するフィラー時間位置特定部と、前記原言語音声情報において前記フィラー時間位置特定部が時間的な位置を特定したフィラーに関する、音声情報における非言語の情報であるパラ言語を少なくとも含む情報である原言語フィラー情報を抽出するフィラー情報抽出部と、前記原言語音声情報において前記フィラー時間位置特定部が特定したフィラーの時間的な位置に対応する前記音声認識結果情報における位置を特定するフィラーテキスト位置特定部と、前記フィラーテキスト位置特定部が特定した前記音声認識結果情報における位置に対応する前記目的言語音声情報における時間的な位置であるフィラー挿入位置を特定するフィラー挿入位置特定部と、前記音声生成部が生成した目的言語音声情報における、前記フィラー挿入位置特定部が特定したフィラー挿入位置に、前記フィラー情報抽出部が抽出した原言語フィラー情報に含まれるパラ言語と同じパラ言語を有する情報であり、目的言語でのフィラーの音声情報である目的言語フィラー情報を挿入するフィラー情報挿入部と、前記フィラー情報挿入部によって目的言語フィラー情報の挿入された目的言語音声情報を出力する音声情報出力部として機能させるためのものである。 In the above embodiment, each component may be configured by dedicated hardware, or a component that can be realized by software may be realized by executing a program. For example, each component can be realized by a program execution unit such as a CPU reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. The software that realizes the speech translation apparatus 1 in the above embodiment is the following program. In other words, this program uses a computer to obtain a speech information receiving unit that receives source language speech information, which is information obtained by collecting speech of a spoken source language with a microphone, and source language speech information received by the speech information receiving unit. A speech recognition unit that recognizes speech and acquires speech recognition result information that is text information corresponding to the source language speech information, and machine translates the speech recognition result information acquired by the speech recognition unit, and the speech recognition result A machine translation unit that acquires translation result information that is text information of a target language corresponding to information, and a voice that generates target language voice information that is voice information of a target language corresponding to the translation result information acquired by the machine translation unit In the source language speech information received by the generation unit and the speech information reception unit, the temporal position of the filler that is an utterance inserted between significant utterances is specified. A filler time position specifying unit, and a source language that is information including at least a para-language that is non-linguistic information in voice information, regarding the filler whose time position is specified by the filler time position specifying unit in the source language voice information A filler information extracting unit for extracting filler information, and a filler text position specifying unit for specifying a position in the speech recognition result information corresponding to a temporal position of the filler specified by the filler time position specifying unit in the source language speech information A filler insertion position identifying unit that identifies a filler insertion position that is a temporal position in the target language speech information corresponding to the position in the speech recognition result information identified by the filler text position identifying unit, and the speech generation unit The filler insertion position specifying unit in the target language speech information generated by The target language filler information which is the same para language as the para language included in the source language filler information extracted by the filler information extraction unit and is the voice information of the filler in the target language is inserted at the specified filler insertion position. And a filler information insertion unit that functions as a voice information output unit that outputs target language voice information in which target language filler information is inserted by the filler information insertion unit.

なお、上記プログラムにおいて、情報を出力する出力ステップや、情報を受け付ける受付ステップなどでは、ハードウェアでしか行われない処理、例えば、出力ステップにおけるモデムやインターフェースカードなどで行われる処理は少なくとも含まれない。 In the above program, the output step for outputting information and the accepting step for receiving information do not include at least processing performed only by hardware, for example, processing performed by a modem or an interface card in the output step. .

また、このプログラムは、サーバなどからダウンロードされることによって実行されてもよく、所定の記録媒体（例えば、ＣＤ−ＲＯＭなどの光ディスクや磁気ディスク、半導体メモリなど）に記録されたプログラムが読み出されることによって実行されてもよい。 Further, this program may be executed by being downloaded from a server or the like, and a program recorded on a predetermined recording medium (for example, an optical disk such as a CD-ROM, a magnetic disk, a semiconductor memory, or the like) is read out. May be executed by

また、このプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes this program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

図４は、上記プログラムを実行して、上記実施の形態による音声翻訳装置を実現するコンピュータの外観の一例を示す模式図である。上記実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムによって実現される。 FIG. 4 is a schematic diagram showing an example of the appearance of a computer that executes the program and realizes the speech translation apparatus according to the embodiment. The above-described embodiment is realized by computer hardware and a computer program executed on the computer hardware.

図４において、コンピュータシステム１００は、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブ１０５、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ１０６を含むコンピュータ１０１と、キーボード１０２と、マウス１０３と、モニタ１０４とを備える。 In FIG. 4, a computer system 100 includes a computer 101 including a CD-ROM (Compact Disk Read Only Memory) drive 105, an FD (Flexible Disk) drive 106, a keyboard 102, a mouse 103, and a monitor 104.

図５は、コンピュータシステムを示す図である。図５において、コンピュータ１０１は、ＣＤ−ＲＯＭドライブ１０５、ＦＤドライブ１０６に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１１と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１１２と、ＣＰＵ１１１に接続され、アプリケーションプログラムの命令を一時的に記憶すると共に、一時記憶空間を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１１３と、アプリケーションプログラム、システムプログラム、及びデータを記憶するハードディスク１１４と、ＣＰＵ１１１、ＲＯＭ１１２等を相互に接続するバス１１５とを備える。なお、コンピュータ１０１は、ＬＡＮへの接続を提供する図示しないネットワークカードを含んでいてもよい。 FIG. 5 is a diagram illustrating a computer system. In FIG. 5, in addition to the CD-ROM drive 105 and the FD drive 106, a computer 101 includes a CPU (Central Processing Unit) 111, a ROM (Read Only Memory) 112 for storing a program such as a bootup program, A CPU (Random Access Memory) 113 that is connected to the CPU 111 and temporarily stores application program instructions and provides a temporary storage space, a hard disk 114 that stores application programs, system programs, and data, a CPU 111 and a ROM 112. Etc. to each other. The computer 101 may include a network card (not shown) that provides connection to the LAN.

コンピュータシステム１００に、上記実施の形態による音声翻訳装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ１２１、またはＦＤ１２２に記憶されて、ＣＤ−ＲＯＭドライブ１０５、またはＦＤドライブ１０６に挿入され、ハードディスク１１４に転送されてもよい。これに代えて、そのプログラムは、図示しないネットワークを介してコンピュータ１０１に送信され、ハードディスク１１４に記憶されてもよい。プログラムは実行の際にＲＡＭ１１３にロードされる。なお、プログラムは、ＣＤ−ＲＯＭ１２１やＦＤ１２２、またはネットワークから直接、ロードされてもよい。 A program that causes the computer system 100 to execute the functions of the speech translation apparatus according to the above embodiment is stored in the CD-ROM 121 or FD 122, inserted into the CD-ROM drive 105 or FD drive 106, and transferred to the hard disk 114. May be. Instead, the program may be transmitted to the computer 101 via a network (not shown) and stored in the hard disk 114. The program is loaded into the RAM 113 at the time of execution. The program may be loaded directly from the CD-ROM 121, the FD 122, or the network.

プログラムは、コンピュータ１０１に、上記実施の形態による音声翻訳装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティプログラム等を必ずしも含んでいなくてもよい。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいてもよい。コンピュータシステム１００がどのように動作するのかについては周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 101 to execute the functions of the speech translation apparatus according to the above embodiment. The program may include only a part of an instruction that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 100 operates is well known and will not be described in detail.

また、本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 Further, the present invention is not limited to the above-described embodiment, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上より、本発明による音声翻訳装置等によれば、フィラーも翻訳することによって、より臨場感のある音声翻訳を実現できるという効果が得られ、音声翻訳システム等として有用である。 As described above, according to the speech translation apparatus or the like according to the present invention, the effect that a more realistic speech translation can be realized by translating the filler is also useful, and is useful as a speech translation system.

本発明の実施の形態１による音声翻訳装置の構成を示すブロック図1 is a block diagram showing the configuration of a speech translation apparatus according to Embodiment 1 of the present invention. 同実施の形態による音声翻訳装置の動作を示すフローチャートThe flowchart which shows the operation | movement of the speech translation apparatus by the embodiment 同実施の形態による原言語音声情報等の一例を示す図The figure which shows an example of source language audio | voice information etc. by the embodiment 同実施の形態におけるコンピュータシステムの外観一例を示す模式図Schematic diagram showing an example of the appearance of the computer system in the embodiment 同実施の形態におけるコンピュータシステムの構成の一例を示す図The figure which shows an example of a structure of the computer system in the embodiment

Explanation of symbols

１音声翻訳装置
１１音声情報受付部
１２原言語音声情報蓄積部
１３音声認識部
１４音声認識結果情報蓄積部
１５機械翻訳部
１６翻訳結果情報蓄積部
１７音声生成部
１８目的言語音声情報蓄積部
１９音声情報出力部
２０フィラー時間位置特定部
２１フィラー情報抽出部
２２フィラーテキスト位置特定部
２３フィラー挿入位置特定部
２４フィラー情報生成部
２５フィラー情報挿入部 DESCRIPTION OF SYMBOLS 1 Speech translation apparatus 11 Speech information reception part 12 Source language speech information storage part 13 Speech recognition part 14 Speech recognition result information storage part 15 Machine translation part 16 Translation result information storage part 17 Speech generation part 18 Target language speech information storage part 19 Voice Information output unit 20 Filler time position specifying unit 21 Filler information extracting unit 22 Filler text position specifying unit 23 Filler insertion position specifying unit 24 Filler information generating unit 25 Filler information inserting unit

Claims

A voice information receiving unit that receives source language voice information, which is information obtained by collecting the spoken source language voice with a microphone;
A speech recognition unit that recognizes the original language speech information received by the speech information reception unit and obtains speech recognition result information that is text information corresponding to the source language speech information;
A machine translation unit that machine translates the speech recognition result information obtained by the speech recognition unit, and obtains translation result information that is text information of a target language corresponding to the speech recognition result information;
A speech generation unit that generates target language speech information that is speech information of a target language corresponding to the translation result information acquired by the machine translation unit;
In the source language voice information received by the voice information receiving unit, a filler time position specifying unit that specifies a temporal position of a filler that is an utterance inserted between significant utterances;
Filler information extraction for extracting source language filler information, which is information including at least a para language, which is non-language information in speech information, regarding the filler whose temporal position is specified by the filler time position specifying unit in the source language speech information. And
A filler text position specifying unit for specifying a position in the voice recognition result information corresponding to a temporal position of the filler specified by the filler time position specifying unit in the source language voice information;
A filler insertion position identifying unit that identifies a filler insertion position that is a temporal position in the target language speech information corresponding to the position in the speech recognition result information identified by the filler text position identifying unit;
In the target language voice information generated by the voice generation unit, the filler insertion position specified by the filler insertion position specification unit has the same para language as the para language included in the source language filler information extracted by the filler information extraction unit. A filler information insertion unit that inserts target language filler information that is information and voice information of the filler in the target language;
A speech translation apparatus comprising: a speech information output unit that outputs target language speech information in which target language filler information is inserted by the filler information insertion unit.

The source language filler information is voice information of a filler whose temporal position is specified by the filler time position specifying unit,
The speech translation apparatus according to claim 1, wherein the target language filler information is the source language filler information.

A filler information generating unit for generating target language filler information having the same para language as the para language included in the source language filler information;
The speech translation apparatus according to claim 1, wherein the filler information insertion unit inserts the target language filler information generated by the filler information generation unit into target language speech information.

The speech translation device according to claim 3, wherein the filler information generation unit generates target language filler information having the same vowel as the vowel of the filler whose temporal position is specified by the filler time position specifying unit.

4. The speech translation apparatus according to claim 3, wherein the filler information generation unit generates target language filler information that is speech information of a target language corresponding to a result of speech recognition and machine translation of the filler of the source language speech information. .

The speech translation apparatus according to claim 1, wherein the para language is at least one information selected from a frequency, a volume, a change in frequency, and a change in volume.

The para language includes the time length of the filler of the source language speech information, the time length of the speech blank on the start point side of the filler, and the time length of the speech blank on the end point side of the filler The speech translation apparatus according to claim 6, further comprising:

Voice information reception unit, voice recognition unit, machine translation unit, voice generation unit, filler time position specifying unit, filler information extraction unit, filler text position specifying unit, filler insertion position specifying unit, filler information A speech translation method processed using an insertion unit and a speech information output unit,
A voice information receiving step in which the voice information receiving unit receives source language voice information which is information obtained by collecting voice of the spoken source language with a microphone;
A speech recognition step in which the speech recognition unit recognizes speech information received in the speech information reception step and acquires speech recognition result information that is text information corresponding to the source language speech information;
The machine translation unit machine translates the speech recognition result information obtained in the speech recognition step, and obtains translation result information that is text information of a target language corresponding to the speech recognition result information; and
A voice generation step in which the voice generation unit generates target language voice information that is voice information of a target language corresponding to the translation result information acquired in the machine translation step;
In the source language speech information received in the speech information reception step, the filler time position specification unit, a filler time position specification step for specifying a temporal position of a filler that is an utterance inserted between significant utterances;
Source language filler that is information including at least a para-language that is non-linguistic information in speech information, regarding the filler whose temporal position is identified in the filler time location specification step in the source language speech information by the filler information extraction unit. A filler information extraction step for extracting information;
The filler text position specifying unit specifies a position in the voice recognition result information corresponding to the temporal position of the filler specified in the filler time position specifying step in the source language voice information;
Filler insertion position specifying step in which the filler insertion position specifying unit specifies a filler insertion position that is a temporal position in the target language speech information corresponding to the position in the speech recognition result information specified in the filler text position specifying step. When,
The filler information insertion unit includes the parameter included in the source language filler information extracted in the filler information extraction step at the filler insertion position identified in the filler insertion position identification step in the target language speech information generated in the speech generation step. Filler information insertion step for inserting target language filler information which is information having the same para language as the language and is voice information of the filler in the target language;
A speech translation method comprising: a speech information output step in which the speech information output unit outputs target language speech information in which target language filler information is inserted in the filler information insertion step.

Computer
A voice information receiving unit that receives source language voice information, which is information obtained by collecting the spoken source language voice with a microphone;
A speech recognition unit that recognizes the original language speech information received by the speech information reception unit and obtains speech recognition result information that is text information corresponding to the source language speech information;
A machine translation unit that machine translates the speech recognition result information obtained by the speech recognition unit, and obtains translation result information that is text information of a target language corresponding to the speech recognition result information;
A speech generation unit that generates target language speech information that is speech information of a target language corresponding to the translation result information acquired by the machine translation unit;
In the source language voice information received by the voice information receiving unit, a filler time position specifying unit that specifies a temporal position of a filler that is an utterance inserted between significant utterances;
Filler information extraction for extracting source language filler information, which is information including at least a para language, which is non-language information in speech information, regarding the filler whose temporal position is specified by the filler time position specifying unit in the source language speech information. And
A filler text position specifying unit for specifying a position in the voice recognition result information corresponding to a temporal position of the filler specified by the filler time position specifying unit in the source language voice information;
A filler insertion position identifying unit that identifies a filler insertion position that is a temporal position in the target language speech information corresponding to the position in the speech recognition result information identified by the filler text position identifying unit;
In the target language voice information generated by the voice generation unit, the filler insertion position specified by the filler insertion position specification unit has the same para language as the para language included in the source language filler information extracted by the filler information extraction unit. A filler information insertion unit that inserts target language filler information that is information and voice information of the filler in the target language;
A program for functioning as a voice information output unit that outputs target language voice information in which target language filler information is inserted by the filler information insertion unit.