JP2015108705A

JP2015108705A - Voice reproduction system, voice reproduction method and program

Info

Publication number: JP2015108705A
Application number: JP2013250976A
Authority: JP
Inventors: 暦本　純一; Junichi Rekimoto; 純一暦本; 翔李; Xiang Li
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2013-12-04
Filing date: 2013-12-04
Publication date: 2015-06-11

Abstract

PROBLEM TO BE SOLVED: To provide a technique for allowing audience to feel as if a user speaks naturally by oneself even when the user is not fluent in the used language.SOLUTION: The voice reproduction system comprises a storage unit, an imaging unit, and a control unit. The storage unit previously stores voice data including a plurality of portion waveform data. The imaging unit is designed to be able to image the mouth of a user. The control unit determines that the mouth of the user is opened on the basis of an image imaged by the imaging unit and controls the timing of reproduction of the voice data so as to sequentially reproduce the portion waveform data in response to the fact that the mouth of the user is opened.

Description

本技術は、音声を再生する音声再生システム、音声再生方法及びプログラムに関する。 The present technology relates to an audio reproduction system, an audio reproduction method, and a program for reproducing audio.

近年においては、国際会議や国際的なイベントにおいて、母国語以外の言語でスピーチやプレゼンテーションなどの発表を行う機会が増えてきている。例えば、発表を行う者が日本人である場合、その者が英語で発表を行ったり、訪問先の国の言語で発表を行ったりする場合がある。 In recent years, opportunities to present speeches and presentations in languages other than their native language are increasing at international conferences and events. For example, when a person who makes a presentation is a Japanese person, the person may make a presentation in English or in a language of a visited country.

母国語以外の言語で流暢に発表を行うためには、その言語を習得するための膨大な労力が必要とされる。流暢な発表を行うことができない発表者の場合、発表者の発表内容を通訳者によって同時通訳し、イヤホンから通訳者の音声を流すことによって、その音声を聴衆に提示するといった方法が用いられる場合が多い（例えば、下記特許文献１参照）。 In order to make presentations fluently in a language other than your native language, a great deal of effort is required to learn that language. In the case of a presenter who cannot make a fluent presentation, a method is used in which the presenter's presentation is simultaneously interpreted by the interpreter, and the interpreter's voice is played through the earphones and the voice is presented to the audience. There are many (for example, refer the following patent document 1).

特開２００７−３０６４２０号公報JP 2007-306420 A

しかしながら、同時通訳が用いられる場合、発表の自然さが失われてしまうといった問題がある。そこで、使用される言語にユーザが必ずしも堪能でなくとも、あたかもユーザ本人が自然に話しているように聴衆に感じさせることができる技術が望まれている。 However, when simultaneous interpretation is used, there is a problem that the naturalness of the presentation is lost. Therefore, there is a demand for a technology that can make the audience feel as if the user is speaking naturally even if the user is not necessarily fluent in the language used.

以上のような事情に鑑み、本技術の目的は、使用される言語にユーザが必ずしも堪能でなくとも、あたかもユーザ本人が自然に話しているように聴衆に感じさせることができる技術を提供することにある。 In view of the circumstances as described above, the purpose of the present technology is to provide a technology that can make the audience feel as if the user is speaking naturally even if the user is not necessarily proficient in the language used. It is in.

本技術に係る音声再生システムは、記憶部と、撮像部と、制御部とを具備する。
前記記憶部は、複数の部分波形データを含む音声データを予め記憶する。
前記撮像部は、ユーザの口を撮像可能とされる。
前記制御部は、前記撮像部により撮像された画像に基づいてユーザの口が開いたことを判定し、前記ユーザの口が開いたことに応じて前記部分波形データを順次再生するように前記音声データの再生のタイミングを制御する。 The audio reproduction system according to the present technology includes a storage unit, an imaging unit, and a control unit.
The storage unit stores in advance audio data including a plurality of partial waveform data.
The imaging unit can image a user's mouth.
The control unit determines that a user's mouth has been opened based on an image captured by the imaging unit, and the audio is sequentially played back in response to the user's mouth being opened. Control the timing of data playback.

本技術では、例えば、発表などに用いられる音声データが予め記憶部に記憶される。そして、実際に発表などが行われるときに、ユーザの口が開くと、音声データにおける部分波形データが順次再生されるので、音声データの再生タイミングが口が開いたタイミングと一致することになる。結果として、使用される言語にユーザが必ずしも堪能でなくとも、あたかもユーザ本人が自然に話しているように聴衆に感じさせることができる。 In the present technology, for example, audio data used for presentation or the like is stored in advance in the storage unit. When the user's mouth opens when an announcement is actually made, the partial waveform data in the audio data is sequentially reproduced, so that the reproduction timing of the audio data coincides with the timing at which the mouth is opened. As a result, even if the user is not necessarily proficient in the language used, the audience can feel as if the user is speaking naturally.

上記音声再生システムにおいて、前記撮像部は、前記ユーザの顔を撮像可能であってもよい。
この場合、前記制御部は、撮像された画像に基づいて顔情報を抽出し、前記顔情報に基づいて、現在再生されている音声データにおける音声の抑揚を制御してもよい。 In the audio reproduction system, the imaging unit may be capable of imaging the user's face.
In this case, the control unit may extract face information based on the captured image, and may control speech inflection in the currently reproduced audio data based on the face information.

このように、顔情報に基づいて、現在再生されている音声データにおける音声の抑揚を制御することによって、現在再生されている音声データに感情表現を加えることが可能になる。 As described above, by controlling the inflection of the sound in the currently reproduced sound data based on the face information, it becomes possible to add an emotional expression to the currently reproduced sound data.

上記音声再生システムにおいて、前記制御部は、前記顔情報として目と眉毛との間の距離を抽出し、目と眉毛との間の距離に基づいて、音声の大きさを制御してもよい。 In the audio reproduction system, the control unit may extract a distance between the eyes and the eyebrows as the face information, and control the size of the audio based on the distance between the eyes and the eyebrows.

これにより、現在再生されている音声データに感情表現を加えることが可能になる。 This makes it possible to add emotional expressions to the currently reproduced audio data.

上記音声再生システムにおいて、前記制御部は、前記顔情報として顔の上下方向での向きを抽出し、顔の上下方向での向きに基づいて、音声の高さを制御してもよい。 In the audio reproduction system, the control unit may extract a face orientation in the vertical direction as the face information, and control the voice height based on the face vertical orientation.

上記音声再生システムは、プロンプタ部を更に具備していてもよい。
この場合、前記記憶部は、前記音声データに対応するテキストデータをさらに記憶してもよい。
この場合、前記制御部は、前記テキストデータをプロンプタ部に表示させるようにプロンプタ部の表示を制御してもよい。 The audio reproduction system may further include a prompter unit.
In this case, the storage unit may further store text data corresponding to the voice data.
In this case, the control unit may control display of the prompter unit so that the text data is displayed on the prompter unit.

この音声再生システムでは、ユーザがプロンプタ部に表示されるテキストデータを読み上げる（口パクでもよい）と、ユーザの口が開いたタイミングに合わせて、音声データにおける部分波形データが順次再生される。 In this audio reproduction system, when the user reads out the text data displayed on the prompter section (or may be a mouthpiece), the partial waveform data in the audio data is sequentially reproduced in accordance with the timing when the user's mouth is opened.

上記音声再生システムにおいて、前記テキストデータは、複数の部分テキストデータを含んでいてもよい。
この場合、前記記憶部は、前記部分波形データと、前記部分波形データに対応する部分テキストデータとを関連付けて記憶してもよい。
この場合、前記制御部は、前記部分波形データの再生のタイミングに合わせて、現在再生されている前記部分波形データに対応する部分テキストデータが強調表示されるようにプロンプタ部の表示を制御してもよい。 In the audio reproduction system, the text data may include a plurality of partial text data.
In this case, the storage unit may store the partial waveform data and the partial text data corresponding to the partial waveform data in association with each other.
In this case, the control unit controls the display of the prompter unit so that the partial text data corresponding to the currently reproduced partial waveform data is highlighted in accordance with the reproduction timing of the partial waveform data. Also good.

この音声再生システムでは、ユーザは、強調表示されている部分テキストデータを視認することで、現在、テキストデータのどの部分が再生されているのかを容易に認識することができる。 In this audio reproduction system, the user can easily recognize which part of the text data is currently being reproduced by visually recognizing the highlighted partial text data.

上記音声再生システムにおいて、前記制御部は、前記部分波形データと、前記部分波形データに対応する部分テキストデータとを関連付けて記憶部に記憶させるために、前記部分波形データと、前記部分テキストデータとを対応付ける処理を実行してもよい。 In the audio reproduction system, the control unit associates the partial waveform data with the partial text data corresponding to the partial waveform data, and stores the partial waveform data, the partial text data, You may perform the process which matches.

これにより、部分波形データと、部分テキストデータとを適切に対応付けることができる。 Thereby, partial waveform data and partial text data can be matched appropriately.

上記音声再生システムにおいて、前記記憶部は、ナレータがテキストデータを読み上げることによって取得されたデータを前記音声データとして記憶してもよい。 In the audio reproduction system, the storage unit may store data acquired by a narrator reading out text data as the audio data.

上記音声再生システムにおいて、前記記憶部は、音声合成による合成音声を前記音声データとして記憶してもよい。 In the voice reproduction system, the storage unit may store synthesized voice by voice synthesis as the voice data.

本技術に係る音声再生方法は、複数の部分波形データを含む音声データを予め記憶することを含む。
ユーザの口が撮像される。
撮像された画像に基づいてユーザの口の開いたことが判定される。
前記ユーザの口が開いたことに応じて前記部分波形データを順次再生するように前記音声データの再生のタイミングが制御される。 The audio reproduction method according to the present technology includes storing audio data including a plurality of partial waveform data in advance.
The user's mouth is imaged.
It is determined that the user's mouth is open based on the captured image.
The reproduction timing of the audio data is controlled so that the partial waveform data is sequentially reproduced according to the opening of the user's mouth.

本技術に係るプログラムは、音声再生システムに、
複数の部分波形データを含む音声データを予め記憶するステップと、
ユーザの口を撮像するステップと、
撮像された画像に基づいてユーザの口が開いたことを判定するステップと、
前記ユーザの口が開いたことに応じて前記部分波形データを順次再生するように前記音声データの再生のタイミングを制御するステップと
を実行させる。 A program according to the present technology is stored in an audio playback system.
Pre-storing audio data including a plurality of partial waveform data;
Imaging a user's mouth;
Determining that the user's mouth has been opened based on the captured image;
Controlling the reproduction timing of the audio data so as to sequentially reproduce the partial waveform data in response to the opening of the user's mouth.

以上のように、本技術によれば、使用される言語にユーザが必ずしも堪能でなくとも、あたかもユーザ本人が自然に話しているように聴衆に感じさせることができる技術を提供することができる。 As described above, according to the present technology, even if the user is not necessarily fluent in the language to be used, it is possible to provide a technology that can make the audience feel as if the user is speaking naturally.

本技術の第１実施形態に係る音声再生システムの電気的な構成を示すブロック図である。1 is a block diagram showing an electrical configuration of a sound reproduction system according to a first embodiment of the present technology. 準備段階での音声再生システムの動作を説明するための図である。It is a figure for demonstrating operation | movement of the audio | voice reproduction | regeneration system in a preparation stage. 動的計画法によって、コストが最小となる対応付けを求める処理を説明するための図である。It is a figure for demonstrating the process which calculates | requires the matching from which cost becomes the minimum by dynamic programming. ユーザが発表を行うときの音声再生システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice reproduction system when a user gives a presentation. ユーザが発表を行うときの音声再生システムの動作を説明するための模式図である。It is a schematic diagram for demonstrating operation | movement of the audio | voice reproduction system when a user makes a presentation. 目と眉毛との間の距離の変化を表す図である。It is a figure showing the change of the distance between eyes and eyebrows. 顔の上下方向での向きの変化を表す図である。It is a figure showing the change of the direction in the up-down direction of a face.

以下、本技術に係る実施形態を、図面を参照しながら説明する。 Hereinafter, embodiments according to the present technology will be described with reference to the drawings.

＜第１実施形態＞
［音声再生システム１０の全体構成及び各部の構成］
図１は、本技術の第１実施形態に係る音声再生システム１０の電気的な構成を示すブロック図である。本実施形態の説明では、音声再生システム１０は、例えば、スピーチやプレゼンテーションなどの発表を行う際に使用されるシステムであるとして説明する。なお、本実施形態に係る音声再生システム１０は、発表以外にも、演劇などの他の用途にも用いることができる。 <First Embodiment>
[Overall Configuration of Audio Reproduction System 10 and Configuration of Each Unit]
FIG. 1 is a block diagram showing an electrical configuration of an audio reproduction system 10 according to the first embodiment of the present technology. In the description of the present embodiment, the audio reproduction system 10 will be described as a system used when, for example, a presentation such as speech or presentation is performed. In addition, the audio reproduction system 10 according to the present embodiment can be used for other uses such as theater besides the announcement.

また、第１実施形態の説明では、発表に用いられる言語（英語）をユーザが流暢に話すことができない場合に、この音声再生システム１０を利用する場合について説明する。 In the description of the first embodiment, a case will be described in which the audio reproduction system 10 is used when the user cannot speak the language (English) used for the presentation fluently.

図１に示すように、音声再生システム１０は、制御部１と、記憶部２と、入力部３と、表示部４と、撮像部５と、プロンプタ部６と、マイクロフォン７と、スピーカ８とを含む。 As shown in FIG. 1, the sound reproduction system 10 includes a control unit 1, a storage unit 2, an input unit 3, a display unit 4, an imaging unit 5, a prompter unit 6, a microphone 7, and a speaker 8. including.

記憶部２は、制御部１の処理に必要な各種のプログラムが記憶される不揮発性のメモリ（例えば、ＲＯＭ（Read Only memory））と、制御部１の作業領域として用いられる揮発性のメモリ（例えば、ＲＡＭ（Random Access Memory））とを含む。上記各種のプログラムは、光ディスク、半導体メモリ等の可搬性の記録媒体から読み取られてもよい。 The storage unit 2 includes a non-volatile memory (for example, ROM (Read Only Memory)) in which various programs necessary for the processing of the control unit 1 and a volatile memory (as a work area of the control unit 1) For example, RAM (Random Access Memory). The various programs may be read from a portable recording medium such as an optical disk or a semiconductor memory.

特に、本実施形態においては、記憶部２には、ユーザにより用意された、発表に用いられる言語（英語）によって作成されたテキストデータ（スピーチ原稿）と、このテキストデータに対応する音声データとが相互に関連付けて記憶されている。 In particular, in the present embodiment, the storage unit 2 includes text data (speech manuscript) prepared by a user and created in a language (English) used for presentation, and voice data corresponding to the text data. They are stored in association with each other.

表示部４は、例えば、液晶ディスプレイ、あるいは、ＥＬディスプレイ（ＥＬ：Electro Luminescence）等により構成され、制御部１の制御に基づき、各種の画像を画面上に表示させる。例えば、表示部４の画面上には、プロンプタ部６に表示されるテキストデータ（１ページ分）と同じデータが表示される。 The display unit 4 is configured by, for example, a liquid crystal display, an EL display (EL: Electro Luminescence), or the like, and displays various images on the screen based on the control of the control unit 1. For example, the same data as the text data (for one page) displayed on the prompter unit 6 is displayed on the screen of the display unit 4.

入力部３は、キーボードやマウスなどにより構成される。入力部３は、ユーザからの指示を入力して、ユーザの指示に応じた信号を制御部１へ出力する。 The input unit 3 includes a keyboard, a mouse, and the like. The input unit 3 inputs an instruction from the user and outputs a signal corresponding to the user's instruction to the control unit 1.

制御部１は、ＣＰＵ（Central Processing Unit）等により構成される。制御部１は、記憶部２に記憶された各種のプログラムに基づき種々の演算を実行し、音声再生システム１０の各部を統括的に制御する。例えば、制御部１は、撮像部５により撮像された画像に基づいてユーザの口が開いたことを判定したり、ユーザの口が開いたことに応じて部分波形データ（後述）を順次再生するように音声データの再生のタイミングを制御したりする。なお、制御部１の処理についての詳細は後述する。 The control unit 1 is configured by a CPU (Central Processing Unit) or the like. The control unit 1 executes various calculations based on various programs stored in the storage unit 2 and controls each unit of the audio reproduction system 10 in an integrated manner. For example, the control unit 1 determines that the user's mouth has been opened based on the image captured by the imaging unit 5 or sequentially reproduces partial waveform data (described later) in response to the user's mouth being opened. In this way, the playback timing of audio data is controlled. Details of the processing of the control unit 1 will be described later.

制御部１、記憶部２、入力部３及び表示部４は、例えば、ＰＣ（Personal Computer）によって構成されていてもよい。また、記憶部２は、サーバ装置における記憶部２が用いられてもよい。この場合、制御部１は、ネットワークを介して記憶部２に記憶された各種のデータを取得してもよい。 The control unit 1, the storage unit 2, the input unit 3, and the display unit 4 may be configured by, for example, a PC (Personal Computer). Further, the storage unit 2 may be the storage unit 2 in the server device. In this case, the control unit 1 may acquire various data stored in the storage unit 2 via the network.

撮像部５は、ＣＣＤ（Charge Coupled Device）センサ、ＣＭＯＳ（Complementary Metal Oxide Semiconductor）センサ等の撮像素子と、撮像素子の撮像面に像を結像させる結像レンズ等の光学系とを含む。撮像部５は、少なくともユーザの口を撮像可能なように構成されている。なお、本実施形態では、撮像部５は、口だけでなくユーザの顔全体を撮像可能なように構成されている。 The imaging unit 5 includes an imaging device such as a CCD (Charge Coupled Device) sensor or a CMOS (Complementary Metal Oxide Semiconductor) sensor, and an optical system such as an imaging lens that forms an image on the imaging surface of the imaging device. The imaging unit 5 is configured to capture at least the user's mouth. In the present embodiment, the imaging unit 5 is configured to be able to image not only the mouth but the entire user's face.

プロンプタ部６は、制御部１の制御に応じて、テキストデータを１ページ毎に表示させる。プロンプタ部６は、垂直面に対して傾斜するように配置された板状のハーフミラーと、ハーフミラーに対してテキストデータ（１ページ分）を投影するための上向きに配置された液晶表示装置とを含む。このプロンプタ部６は、ユーザ（発表者）からは、ハーフミラー上にテキストデータに対応する文字が見えるが、聴衆からはこの文字は見えずにハーフミラーが透明に見えるように構成されている。このプロンプタ部６は、例えば、ユーザが発表を行う壇上に配置される。 The prompter unit 6 displays text data for each page in accordance with the control of the control unit 1. The prompter unit 6 includes a plate-like half mirror arranged to be inclined with respect to a vertical plane, and a liquid crystal display device arranged upward to project text data (for one page) onto the half mirror. including. The prompter unit 6 is configured so that a character corresponding to the text data can be seen on the half mirror from the user (presenter), but the character is not seen from the audience and the half mirror looks transparent. For example, the prompter unit 6 is arranged on a platform where a user makes a presentation.

なお、プロンプタ部６は、ハーフミラー及び液晶表示装置によって構成されたものに限られない。例えば、プロンプタ部６は、液晶表示装置のみによって構成されていてもよい。この場合、液晶表示装置は、液晶表示装置に表示された文字をユーザ（発表者）が見やすい位置に配置される。 Note that the prompter unit 6 is not limited to a half mirror and a liquid crystal display device. For example, the prompter unit 6 may be configured only by a liquid crystal display device. In this case, the liquid crystal display device is disposed at a position where the user (presenter) can easily see the characters displayed on the liquid crystal display device.

マイクロフォン７は、入力される音声を電気信号に変換して制御部１に出力する。このマイクロフォン７は、発表が行われる前の準備段階において、発表に用いられる言語（英語）で作成されたテキストデータ（スピーチ原稿）をナレータが読み上げる際に使用される。ナレータは、典型的には、テキストデータ（英語）を流暢に読み上げることが可能な者とされる。 The microphone 7 converts the input sound into an electrical signal and outputs it to the control unit 1. The microphone 7 is used when the narrator reads out text data (speech manuscript) created in a language (English) used for the presentation in a preparatory stage before the presentation. The narrator is typically a person who can read the text data (English) fluently.

スピーカ８は、制御部１の制御に応じて音声を出力する。スピーカ８は、発表が行われる会場に配置される。会場に配置されるスピーカ８の数は、１つであってもよいし、２以上であってもよい。 The speaker 8 outputs sound according to the control of the control unit 1. The speaker 8 is arranged at a venue where the presentation is performed. The number of speakers 8 arranged in the venue may be one or two or more.

［動作説明］
次に、音声再生システム１０の動作について説明する。動作説明においては、まず、発表が行われる前に行われる準備段階での音声再生システム１０の動作について説明し、次に、ユーザが発表を行うときの音声再生システム１０の動作について説明する。 [Description of operation]
Next, the operation of the audio reproduction system 10 will be described. In the description of the operation, first, an operation of the audio reproduction system 10 at a preparation stage performed before the presentation is performed will be described, and then an operation of the audio reproduction system 10 when the user makes an announcement will be described.

「準備段階での音声再生システム１０の動作」
図２は、準備段階での音声再生システム１０の動作を説明するための図であり、英語で作成されたテキストデータと、このテキストデータが読み上げられた際に取得された音声データとの対応関係を示す図である。 “Operation of the audio playback system 10 in the preparation stage”
FIG. 2 is a diagram for explaining the operation of the audio reproduction system 10 at the preparation stage, and the correspondence between the text data created in English and the audio data acquired when the text data is read out. FIG.

まず、図２の上側を参照して、ユーザは、発表を行うときに使用される、英語で作成されたテキストデータ（スピーチ原稿）を用意する。図２の上側の図では、「I am a cat. I don't have a name. I am a cat」という、テキストデータの文頭の部分が一例として示されている。なお、テキストデータは、図２に示されるテキストデータよりも実際には長い。 First, referring to the upper side of FIG. 2, the user prepares text data (speech manuscript) created in English, which is used when making a presentation. In the upper diagram of FIG. 2, the head portion of the text data “I am a cat. I don't have a name. I am a cat” is shown as an example. Note that the text data is actually longer than the text data shown in FIG.

次に、ユーザは、用意したテキストデータを流暢に読み上げることが可能なナレータに、このテキストデータを読み上げてもらう。ナレータによって読み上げられた音声は、マイクロフォン７によって電気信号に変換される。これにより、流暢に読み上げられた音声データが取得される（図２の下側の図参照）。取得された音声データは、記憶部２に記憶される。 Next, the user asks the narrator who can read the prepared text data fluently to read the text data. The voice read out by the narrator is converted into an electric signal by the microphone 7. Thereby, the voice data read out fluently is acquired (refer to the lower diagram in FIG. 2). The acquired voice data is stored in the storage unit 2.

次に、制御部１は、取得された音声データから無音部分（図２の下側の図において、四角で囲まれた部分参照）を検出し、検出された無音部分で音声データを分割する。これにより音声データが複数の部分波形データに分割される。この部分波形データは、ユーザが発表を行う際に再生される１単位（１セグメント）に相当する。部分波形データは、無音部分が開始された時点から、一群の波形を経て次の無音部分が開始される時点までのデータである。 Next, the control unit 1 detects a silent part (refer to a part surrounded by a square in the lower diagram in FIG. 2) from the acquired voice data, and divides the voice data by the detected silent part. As a result, the audio data is divided into a plurality of partial waveform data. This partial waveform data corresponds to one unit (one segment) reproduced when the user makes a presentation. The partial waveform data is data from the time when the silent portion is started to the time when the next silent portion is started through a group of waveforms.

次に、制御部１は、テキストデータを部分テキストデータに分割する。典型的には、制御部１は、テキストデータに含まれる区切り（ピリオド、カンマ）に基づいて、テキストデータを分割することによって、テキストデータを部分テキストデータに分割する。 Next, the control unit 1 divides the text data into partial text data. Typically, the control unit 1 divides the text data into partial text data by dividing the text data based on delimiters (period, comma) included in the text data.

なお、部分テキストデータが分割される前に、ユーザによって、分割の目印となるカンマがテキストデータに追加される。このカンマは、入力部３を介して入力される。図２に示す例では、"I am"と、"a cat"との間に、ユーザによってカンマが追加されている（２箇所）。また、図２に示す例では、"I don't have" と、"a name"との間に、ユーザによってカンマが追加されている。 Note that, before the partial text data is divided, a comma as a mark for division is added to the text data by the user. This comma is input via the input unit 3. In the example shown in FIG. 2, a comma is added by the user between “I am” and “a cat” (two places). In the example illustrated in FIG. 2, a comma is added by the user between “I don't have” and “a name”.

ユーザは、部分波形データに対応する部分テキストデータの長さを予測しながら、テキストデータにカンマを追加することになる。なお、制御部１が、部分波形データに対応する部分テキストデータの長さを予測して自動的にテキストデータを部分テキストデータに分割する処理を実行してもよい。 The user adds a comma to the text data while predicting the length of the partial text data corresponding to the partial waveform data. The control unit 1 may execute a process of predicting the length of the partial text data corresponding to the partial waveform data and automatically dividing the text data into the partial text data.

音声データが部分波形データに分割され、かつ、テキストデータが部分テキストデータに分割されると、次に、制御部１は、部分波形データと、部分テキストデータとを対応付ける処理を実行する。図２に示す例では、１番目、２番目、・・の部分波形データが、順番に、１番目、２番目・・の部分テキストデータである"I am"、"a cat"、・・にそれぞれ対応付けられている。これにより、部分波形データと、部分テキストデータが相互に関連付けられて記憶部２に記憶される。 When the voice data is divided into partial waveform data and the text data is divided into partial text data, the control unit 1 next executes a process of associating the partial waveform data with the partial text data. In the example shown in FIG. 2, the partial waveform data of the first, second,... Are in order of “I am”, “a cat”,. Each is associated. As a result, the partial waveform data and the partial text data are associated with each other and stored in the storage unit 2.

ここで、音声データの区切りと、テキストデータの区切りが完全に一致しない場合も想定される。例えば、部分波形データの個数と、部分テキストデータの個数とが異なるような場合も想定される。このような場合に、部分波形データと、部分テキストデータとが順番に１対１で対応付けられてしまった場合、その対応付けの処理は、不正確な処理となる。 Here, it may be assumed that the voice data delimiter and the text data delimiter do not completely match. For example, the case where the number of partial waveform data differs from the number of partial text data is also assumed. In such a case, when the partial waveform data and the partial text data are associated with each other in a one-to-one correspondence, the association process is inaccurate.

従って、音声データの区切りと、テキストデータの区切りが完全に一致しないような場合には、制御部１は、動的計画法を使用してコストが最小となる対応付けを求める処理を実行する。 Therefore, when the voice data delimiter and the text data delimiter do not completely match, the control unit 1 executes a process for obtaining an association that minimizes the cost by using dynamic programming.

図３は、動的計画法によって、コストが最小となる対応付けを求める処理を説明するための図である。図３を参照して、今、音声データにおける各部分波形データの再生時間をａ_１、ａ_２、ａ_３、・・とする。同様に、テキストデータにおける各部分テキストデータの再生時間をｔ_１、ｔ_２、ｔ_３、・・とする。部分テキストデータの再生時間は、ボイスシンセサイザーにテキストデータを入力して音声化した場合の再生時間から推定することができる。 FIG. 3 is a diagram for explaining processing for obtaining an association that minimizes the cost by dynamic programming. Referring to FIG. 3, the reproduction times of the partial waveform data in the audio data are now a ₁ , a ₂ , a ₃ ,. Similarly, let t ₁ , t ₂ , t ₃ ,... Be the reproduction times of the partial text data in the text data. The reproduction time of the partial text data can be estimated from the reproduction time when the text data is input to the voice synthesizer and converted into speech.

このとき、音声データと、テキストデータとの対応付けは、図３に示すように、一本の折れ線によって表現することができる。この折れ線をＳ_１、Ｓ_２、Ｓ_３、・・とする。また、Ｓ_ｉに対応するテキストデータを｛ｔ_ｋ、ｔ_k＋１、・・｝＝ｔ｛Ｓ_ｉ｝、｛ａ_ｌ、ａ_ｌ＋１、・・｝＝ａ｛Ｓ_ｉ｝とする。図３に示す例では、ｔ｛Ｓ_２｝＝｛ｔ_２、ｔ_３｝、ａ｛Ｓ_２｝＝｛ａ_２｝である。このとき、この対応付けのコストは、Σ（（Σｔ｛Ｓ_ｉ｝）−（Σａ｛Ｓ_ｉ｝））^２、ｉ＝１、ｎとなる。制御部１は、このコストを最小にするような対応付けを動的計画法によって求める。 At this time, the association between the voice data and the text data can be expressed by a single broken line as shown in FIG. These broken lines are designated as S ₁ , S ₂ , S ₃ ,. _{_{_{Further, {t k, t k +}}} 1, ··} text data corresponding to _{_{_{S i = t {S i}}}} , and _{{a l, a l + 1} , ··} = a {S i}. In the example illustrated in FIG. 3, t {S ₂ } = {t ₂ , t ₃ } and a {S ₂ } = {a ₂ }. In this case, the cost of this mapping _{is, Σ ((Σt {S i} }) - (Σa {S i})) becomes 2, i = 1, n. The control unit 1 obtains an association that minimizes the cost by dynamic programming.

これにより、音声データの区切りと、テキストデータの区切りが完全に一致しないような場合にも、部分波形データと、部分テキストデータとを適切に対応付けることができる。 As a result, even when the voice data delimiter and the text data delimiter do not completely match, the partial waveform data and the partial text data can be associated with each other appropriately.

「ユーザが発表を行うときの音声再生システム１０の動作」 “Operation of the audio playback system 10 when the user makes a presentation”

図４は、ユーザが発表を行うときの音声再生システム１０の動作を示すフローチャートである。図５は、ユーザが発表を行うときの音声再生システム１０の動作を説明するための模式図である。 FIG. 4 is a flowchart showing the operation of the audio reproduction system 10 when the user makes a presentation. FIG. 5 is a schematic diagram for explaining the operation of the audio reproduction system 10 when the user makes a presentation.

まず、制御部１は、テキストデータのうち、最初のページに対応するテキストデータをプロンプタ部６に表示させる（ステップ１０１）。制御部１は、表示部４の画面上にプロンプタ部６に表示されるテキストデータと同じデータを表示させてもよい。 First, the control unit 1 causes the prompter unit 6 to display text data corresponding to the first page of the text data (step 101). The control unit 1 may display the same data as the text data displayed on the prompter unit 6 on the screen of the display unit 4.

次に、制御部１は、撮像部５により撮像された画像に基づいて（画像解析によって）、ユーザの口が開いたかどうかを判定する（ステップ１０２）。ユーザの口が閉じた状態である場合（ステップ１０２のＮＯ）、制御部１は、再びステップ１０２へ戻って、ユーザの口が開いたかどうかを判定する。したがって、ユーザの口が閉じた状態である場合には、部分波形データの再生は開始されない。 Next, the control unit 1 determines whether or not the user's mouth has been opened based on the image captured by the imaging unit 5 (by image analysis) (step 102). When the user's mouth is in a closed state (NO in Step 102), the control unit 1 returns to Step 102 again and determines whether or not the user's mouth has been opened. Therefore, when the user's mouth is closed, the reproduction of the partial waveform data is not started.

例えば、ユーザがプロンプタ部６に表示されているテキストデータ（１ページ分）を読み上げると（必ずしも音読する必要はなく、口パクでもよい）、制御部１は、ユーザの口が開いたと判定し（ステップ１０２のＹＥＳ）、部分波形データの再生を開始する（ステップ１０３）。部分波形データの再生が開始されると、ナレータによって読み上げられた流暢な英語がスピーカ８から出力される。 For example, when the user reads out the text data (for one page) displayed on the prompter unit 6 (it is not always necessary to read aloud and may be spoken), the control unit 1 determines that the user's mouth has been opened ( The reproduction of the partial waveform data is started (step 103). When the reproduction of the partial waveform data is started, fluent English read out by the narrator is output from the speaker 8.

次に、制御部１は、部分波形データの再生のタイミングに合わせて、現在再生されている部分波形データに対応する部分テキストデータが強調表示されるようにプロンプタ部６の表示を制御する（ステップ１０４）。 Next, the control unit 1 controls the display of the prompter unit 6 so that the partial text data corresponding to the currently reproduced partial waveform data is highlighted in accordance with the reproduction timing of the partial waveform data (Step S1). 104).

これにより、ユーザは、強調表示されている部分テキストデータを視認することで、現在、テキストデータのどの部分が再生されているのかを容易に認識することができる。 Thus, the user can easily recognize which part of the text data is currently being reproduced by visually recognizing the highlighted partial text data.

強調表示の方法としては、部分テキストデータの全部又は一部を点滅表示させる方法、部分テキストデータの全部又は一部の色を他の色に変える方法などが挙げられる。なお、強調表示としては、どのような方法が用いられても構わない。 Examples of the highlighting method include a method of blinking all or part of the partial text data, a method of changing the color of all or part of the partial text data to another color, and the like. Note that any method may be used for highlighting.

強調表示の処理を実行すると、次に、制御部１は、撮像部５により撮像された画像に基づいて（画像解析によって）、顔情報を抽出し、顔情報に基づいて、現在再生されている音声データにおける音声の抑揚を制御する（ステップ１０５）。 When the highlighting process is executed, the control unit 1 then extracts face information based on the image captured by the image capturing unit 5 (by image analysis), and is currently being reproduced based on the face information. The voice inflection in the voice data is controlled (step 105).

ステップ１０５では、制御部１は、顔情報として目と眉毛との間の距離を抽出し、目と眉毛との間の距離に基づいて、音声の大きさを制御する処理を実行する。 In step 105, the control unit 1 extracts the distance between the eyes and the eyebrows as the face information, and executes a process for controlling the volume of the sound based on the distance between the eyes and the eyebrows.

図６は、目と眉毛との間の距離の変化を表す図である。制御部１は、目と眉毛との間の距離が大きくなるほど、音声の大きさが大きくなるように音声の大きさを制御する。 FIG. 6 is a diagram illustrating a change in the distance between the eyes and the eyebrows. The control unit 1 controls the sound volume so that the sound volume increases as the distance between the eyes and the eyebrows increases.

また、ステップ１０５では、制御部１は、顔情報として顔の上下方向での向きを抽出し、顔の上下方向での向きに基づいて、音声の高さを制御する。 Further, in step 105, the control unit 1 extracts the face vertical direction as face information, and controls the voice height based on the face vertical direction.

図７は、顔の上下方向での向きの変化を示す図である。制御部１は、顔の向きが基準よりも下向きである場合には、音声の高さが低くなるように音声の高さを制御する。一方、顔の向きが基準よりも上向きである場合には、音声の高さが高くなるように音声の高さを制御する。 FIG. 7 is a diagram illustrating a change in the orientation of the face in the vertical direction. The control unit 1 controls the voice level so that the voice level is lower when the face orientation is lower than the reference. On the other hand, when the face orientation is higher than the reference, the voice level is controlled so that the voice level becomes higher.

ステップ１０３で部分波形データの再生が開始されてから、その部分波形データの再生時間（部分波形データ毎に異なる）が経過すると、部分波形データの再生が終了する（ステップ１０６）。 When the reproduction time of the partial waveform data (different for each partial waveform data) has elapsed after the reproduction of the partial waveform data is started in step 103, the reproduction of the partial waveform data is finished (step 106).

部分波形データの再生が終了されると、制御部１は、次の部分波形データが存在するかどうかを判定する（ステップ１０７）。次の部分波形データが存在する場合（ステップ１０７のＹＥＳ）、制御部１は、プロンプタ部６上に現在表示されているページ内のテキストデータに対応する音声データの再生が完了したかどうかを判定する（ステップ１０８）。 When the reproduction of the partial waveform data is completed, the control unit 1 determines whether or not the next partial waveform data exists (step 107). When the next partial waveform data exists (YES in Step 107), the control unit 1 determines whether or not the reproduction of the audio data corresponding to the text data in the page currently displayed on the prompter unit 6 is completed. (Step 108).

プロンプタ部６上に現在表示されているページ内のテキストデータに対応する音声データの再生が完了している場合（ステップ１０８のＹＥＳ）、制御部１は、次のページに対応するテキストデータをプロンプタ部６上に表示させる（ステップ１０９）。そして、制御部１は、ステップ１０２へ戻り、ユーザの口が開いたかどうかを判定する。 When the reproduction of the audio data corresponding to the text data in the page currently displayed on the prompter unit 6 is completed (YES in step 108), the control unit 1 prompts the text data corresponding to the next page. It is displayed on the part 6 (step 109). Then, the control unit 1 returns to Step 102 and determines whether or not the user's mouth has been opened.

一方、プロンプタ部６上に現在表示されているページ内のテキストデータに対応する音声データの再生が完了していない場合（ステップ１０８のＮＯ）、制御部１は、ページを切替える処理を実行せずに、ステップ１０２へ戻り、ユーザの口が開いたかどうかを判定する。 On the other hand, when the reproduction of the voice data corresponding to the text data in the page currently displayed on the prompter unit 6 is not completed (NO in step 108), the control unit 1 does not execute the process of switching pages. Returning to step 102, it is determined whether or not the user's mouth has been opened.

すなわち、制御部１は、部分波形データの再生が終了した時点で、再び、ユーザの口が開いたかどうかを判定する。そして、制御部１は、ユーザの口が開いたと判定された場合には、次の部分波形データの再生を開始する（ステップ１０３）。これにより、ユーザの口が開いたタイミングに合わせて、音声データにおける部分波形データが順次再生されることになる。 That is, the control unit 1 determines again whether or not the user's mouth has been opened when the reproduction of the partial waveform data is completed. When it is determined that the user's mouth has been opened, the control unit 1 starts reproduction of the next partial waveform data (step 103). Thereby, the partial waveform data in the audio data is sequentially reproduced in accordance with the timing when the user's mouth is opened.

ステップ１０７において、次の部分波形データが存在しないと判定された場合（ステップ１０７のＮＯ）、制御部１は処理を終了する。 If it is determined in step 107 that the next partial waveform data does not exist (NO in step 107), the control unit 1 ends the process.

［作用等］
本実施形態では、発表に用いられる音声データが予め記憶部に記憶される。そして、実際に発表などが行われるときに、ユーザの口が開くと、音声データにおける部分波形データが順次再生されるので、音声データの再生タイミングが口が開いたタイミングと一致することになる。結果として、使用される言語にユーザが必ずしも堪能でなくとも、あたかもユーザ本人が自然に話しているように聴衆に感じさせることができる。 [Action etc.]
In the present embodiment, voice data used for presentation is stored in advance in the storage unit. When the user's mouth opens when an announcement is actually made, the partial waveform data in the audio data is sequentially reproduced, so that the reproduction timing of the audio data coincides with the timing at which the mouth is opened. As a result, even if the user is not necessarily proficient in the language used, the audience can feel as if the user is speaking naturally.

ここで、比較例として、音声データが部分波形データに分割されておらず、音声データが最初から最後まで途切れなく再生される場合を想定する。このような場合、再生される音声データに合わせてユーザが口パクを行うことも不可能ではないが、発表のテンポが予め録音された音声データのテンポに支配されてしまう。この場合、ユーザが聴衆の反応などを見ながら、ユーザ自身のテンポで発表を行うことができない。 Here, as a comparative example, it is assumed that the audio data is not divided into partial waveform data and the audio data is reproduced without interruption from the beginning to the end. In such a case, it is not impossible for the user to perform a speech according to the audio data to be reproduced, but the tempo of the presentation is controlled by the tempo of the audio data recorded in advance. In this case, the user cannot make a presentation at the user's own tempo while watching the reaction of the audience.

一方、本実施形態に係る音声再生システム１０では、ユーザが口を閉じた状態とすれば、次の部分波形データは再生されない。従って、ユーザは、聴衆の反応などを見ながら、ユーザ自身のテンポで発表を行うこともできる。 On the other hand, in the audio reproduction system 10 according to the present embodiment, if the user closes his / her mouth, the next partial waveform data is not reproduced. Therefore, the user can also make a presentation at the user's own tempo while watching the reaction of the audience.

さらに、本実施形態では、顔情報に基づいて、現在再生されている音声データにおける音声の抑揚が制御されるので、現在再生されている音声データに感情表現を加えることが可能になる。 Furthermore, in this embodiment, since the inflection of the voice data currently being reproduced is controlled based on the face information, it is possible to add emotional expressions to the currently reproduced voice data.

＜各種変形例＞
以上の説明では、音声データは、ナレータがテキストデータを読み挙げることによって取得されたデータであるとして説明した。一方、音声データは、音声合成による合成音声であってもよい。この合成音声についても、発表が行われる前の準備段階において、部分波形データに分割されて記憶部２に予め記憶される。なお、音声データとして合成音声が用いられる場合、動的計画法による、部分波形データと、部分テキストデータとの対応付けの処理は行う必要はない。 <Various modifications>
In the above description, the audio data has been described as data obtained by the narrator reading the text data. On the other hand, the voice data may be synthesized voice by voice synthesis. This synthesized speech is also divided into partial waveform data and stored in advance in the storage unit 2 in a preparatory stage before the presentation is performed. When synthesized speech is used as speech data, it is not necessary to perform a process of associating partial waveform data and partial text data by dynamic programming.

ここで、音声データとして、合成音声が用いられた場合、ナレータの読み上げによる音声データに比べて、単調になりがちであり、また、流暢さも多少軽減するといったことが考えられる。しかしながら、上述の抑揚制御（ステップ１０５参照）を実行することによって、単調さを解消することができ、また、流暢さも向上させることができる。 Here, when synthesized speech is used as speech data, it tends to be monotonous compared to speech data read by a narrator, and the fluency may be somewhat reduced. However, by executing the above-described inflection control (see step 105), monotony can be eliminated and fluency can be improved.

以上の説明では、ユーザが英語を流暢に話すことができない場合に、発表において音声再生システム１０を利用する場合について説明した。一方、ユーザが何らかの理由（聴覚障害、声帯の障害等）で声を出すことができないような場合にも、発表において本技術に係る音声再生システム１０を利用することができる。 In the above description, the case where the audio reproduction system 10 is used in the presentation when the user cannot speak English fluently has been described. On the other hand, even when the user cannot speak for some reason (such as hearing impairment or vocal cord disturbance), the audio reproduction system 10 according to the present technology can be used in the presentation.

音声再生システムは、一時停止機能や、スキップ機能を備えていてもよい。例えば、ユーザは、発表の途中で質問などがされたときに、音声データの再生を一時停止させる。また、ユーザは、発表の時間が足らなくなった場合に、音声データの再生位置を任意の位置にスキップさせる。この場合、制御部は、リモートコントローラ（図示せず）に設けられた一時停止ボタン、スキップボタンが操作されたときに、この操作信号に基づいて一時停止や、スキップを実行させればよい。 The audio reproduction system may have a pause function or a skip function. For example, the user pauses the reproduction of audio data when a question is asked during the presentation. In addition, when the presentation time runs out, the user skips the reproduction position of the audio data to an arbitrary position. In this case, when a pause button or a skip button provided on a remote controller (not shown) is operated, the control unit may execute a pause or skip based on this operation signal.

以上の説明では、音声再生システム１０を発表において利用する場合について説明したが、音声再生システム１０は、演劇などの他の用途にも利用することができる。さらに、例えば、アニメーション、映画におけるキャラクターや、俳優などの音声を音声データとして用いることによって、音声再生システム１０にエンターテイメント性を持たせることも可能である。この場合、ユーザが口を動かすと、キャラクターや、俳優などの音声がスピーカ８から出力される。また、女性が男性の声で話しているように見せかけることもでき（その逆も可能）、子供が大人の声で話しているように見せかけることもできる（その逆も可能）。 In the above description, the case where the audio reproduction system 10 is used in the presentation has been described. However, the audio reproduction system 10 can also be used for other uses such as theater. Furthermore, for example, the sound reproduction system 10 can be provided with entertainment by using sound of characters such as animations, movies, and actors as sound data. In this case, when the user moves his / her mouth, voices of characters, actors, etc. are output from the speaker 8. You can also pretend that a woman is speaking in a male voice (and vice versa), and pretend that a child is speaking in an adult voice (or vice versa).

さらに、音声再生システム１０は、学習用の用途としても用いることが可能である。例えば、英語の教科書におけるテキストデータが１ページごとにプロンプタ部６に表示される。ユーザがプロンプタ部６に表示された英語を読み上げると、口が開いたタイミングに合わせて、音声データにおける部分波形データが順次再生される。これにより、模範となる流暢な英語がスピーカ８から出力される。ユーザは、ユーザが読み上げた英語の発音と、模範となる流暢な英語の発音の違いから間違いを修正して、正確な英語の発音を覚えることができる。 Furthermore, the audio reproduction system 10 can be used for learning purposes. For example, text data in an English textbook is displayed on the prompter unit 6 page by page. When the user reads out the English displayed on the prompter unit 6, the partial waveform data in the audio data is sequentially reproduced in accordance with the opening timing of the mouth. As a result, fluent English as an example is output from the speaker 8. The user can correct the mistake based on the difference between the English pronunciation read out by the user and the fluent English pronunciation as an example, and can learn the correct English pronunciation.

以上の説明では、再生される言語が１種類である場合について説明した。しかしながら、再生される言語は、２種類以上であってもよい。この場合には、例えば、２種類以上のチャネルにそれぞれ異なる言語の音声データが出力され、イヤホンによって受信される。イヤホンは、受信するチャネルを調整可能に構成されている。聴衆は、希望の言語に対応するチャネルを選択することによって、イヤホンを介して希望の言語で発表や演劇を聴くことができる。 In the above description, the case where there is one kind of language to be reproduced has been described. However, two or more languages may be played back. In this case, for example, audio data in different languages is output to two or more types of channels and received by the earphone. The earphone is configured to be able to adjust a channel to be received. By selecting a channel corresponding to the desired language, the audience can listen to the presentation or theatrical performance in the desired language via the earphone.

プロンプタ部６上には、テキストデータの他に、音声データが表示されてもよい。例えば、ユーザがほとんど知識のない言語で発表や演劇を行う場合、ユーザは、音声データの波形を視認することによって、口の動かし方の足がかりを掴むことが可能となる。 In addition to the text data, voice data may be displayed on the prompter unit 6. For example, when a user makes a presentation or a play in a language with little knowledge, the user can grasp a foothold of how to move his / her mouth by visually recognizing the waveform of audio data.

本技術は、以下の構成を採用することもできる。
（１）複数の部分波形データを含む音声データを予め記憶する記憶部と、
ユーザの口を撮像可能な撮像部と、
前記撮像部により撮像された画像に基づいてユーザの口が開いたことを判定し、前記ユーザの口が開いたことに応じて前記部分波形データを順次再生するように前記音声データの再生のタイミングを制御する制御部と
を具備する音声再生システム。
（２）上記（１）に記載の音声再生システムであって、
前記撮像部は、前記ユーザの顔を撮像可能であり、
前記制御部は、撮像された画像に基づいて顔情報を抽出し、前記顔情報に基づいて、現在再生されている音声データにおける音声の抑揚を制御する
音声再生システム。
（３）上記（２）に記載の音声再生システムであって、
前記制御部は、前記顔情報として目と眉毛との間の距離を抽出し、目と眉毛との間の距離に基づいて、音声の大きさを制御する
音声再生システム。
（４）上記（２）または（３）に記載の音声再生システムであって、
前記制御部は、前記顔情報として顔の上下方向での向きを抽出し、顔の上下方向での向きに基づいて、音声の高さを制御する
音声再生システム。
（５）上記（１）〜（４）のうち何れか１つに記載の音声再生システムであって、
プロンプタ部を更に具備し、
前記記憶部は、前記音声データに対応するテキストデータをさらに記憶し、
前記制御部は、前記テキストデータをプロンプタ部に表示させるようにプロンプタ部の表示を制御する
音声再生システム。
（６）上記（５）に記載の音声再生システムであって、
前記テキストデータは、複数の部分テキストデータを含み、
前記記憶部は、前記部分波形データと、前記部分波形データに対応する部分テキストデータとを関連付けて記憶し、
前記制御部は、前記部分波形データの再生のタイミングに合わせて、現在再生されている前記部分波形データに対応する部分テキストデータが強調表示されるようにプロンプタ部の表示を制御する
音声再生システム。
（７）上記（６）に記載の音声再生システムであって、
前記制御部は、前記部分波形データと、前記部分波形データに対応する部分テキストデータとを関連付けて記憶部に記憶させるために、前記部分波形データと、前記部分テキストデータとを対応付ける処理を実行する
音声再生システム。
（８）上記（１）〜（７）のうちいずれか１つに記載の音声再生システムであって、
前記記憶部は、ナレータが前記テキストデータを読み上げることによって取得されたデータを前記音声データとして記憶する
音声再生システム。
（９）上記（１）〜（７）のうちいずれか１つに記載の音声再生システムであって、
前記記憶部は、音声合成による合成音声を前記音声データとして記憶する
音声再生システム。 The present technology can also employ the following configurations.
(1) a storage unit that stores in advance audio data including a plurality of partial waveform data;
An imaging unit capable of imaging a user's mouth;
Timing of reproducing the audio data so as to determine that the user's mouth has been opened based on the image captured by the imaging unit and to sequentially reproduce the partial waveform data in accordance with the opening of the user's mouth A sound reproduction system comprising a control unit for controlling the sound.
(2) The audio reproduction system according to (1) above,
The imaging unit is capable of imaging the user's face,
The said control part extracts face information based on the imaged image, and controls sound inflection in the audio | voice data currently reproduced | regenerated based on the said face information.
(3) The audio reproduction system according to (2) above,
The said control part extracts the distance between eyes and eyebrows as said face information, and controls the magnitude | size of an audio | voice based on the distance between eyes and eyebrows. The audio | voice reproduction system.
(4) The audio reproduction system according to (2) or (3) above,
The control unit extracts an orientation of the face in the vertical direction as the face information, and controls a voice height based on the orientation of the face in the vertical direction.
(5) The audio reproduction system according to any one of (1) to (4) above,
Further comprising a prompter,
The storage unit further stores text data corresponding to the voice data,
The audio control system, wherein the control unit controls display of the prompter unit so that the text data is displayed on the prompter unit.
(6) The audio reproduction system according to (5) above,
The text data includes a plurality of partial text data,
The storage unit stores the partial waveform data in association with partial text data corresponding to the partial waveform data,
The control unit controls display of the prompter unit so that partial text data corresponding to the currently reproduced partial waveform data is highlighted in accordance with the reproduction timing of the partial waveform data.
(7) The audio reproduction system according to (6) above,
The control unit executes a process of associating the partial waveform data and the partial text data in order to associate and store the partial waveform data and the partial text data corresponding to the partial waveform data in the storage unit. Audio playback system.
(8) The audio reproduction system according to any one of (1) to (7) above,
The said memory | storage part memorize | stores the data acquired when the narrator reads the said text data as said audio | voice data. Voice reproduction | regeneration system.
(9) The audio reproduction system according to any one of (1) to (7) above,
The said memory | storage part memorize | stores the synthetic | combination audio | voice by audio | voice synthesis | combination as said audio | voice data Voice reproduction | regeneration system.

１…制御部
２…記憶部
３…入力部
４…表示部
５…撮像部
６…プロンプタ部
７…マイクロフォン
８…スピーカ
１０…音声再生システム DESCRIPTION OF SYMBOLS 1 ... Control part 2 ... Memory | storage part 3 ... Input part 4 ... Display part 5 ... Imaging part 6 ... Prompter part 7 ... Microphone 8 ... Speaker 10 ... Sound reproduction system

Claims

A storage unit for storing in advance audio data including a plurality of partial waveform data;
An imaging unit capable of imaging a user's mouth;
Timing of reproducing the audio data so as to determine that the user's mouth has been opened based on the image captured by the imaging unit and to sequentially reproduce the partial waveform data in accordance with the opening of the user's mouth A sound reproduction system comprising a control unit for controlling the sound.

The audio playback system according to claim 1,
The imaging unit is capable of imaging the user's face,
The said control part extracts face information based on the imaged image, and controls sound inflection in the audio | voice data currently reproduced | regenerated based on the said face information.

The audio playback system according to claim 2,
The said control part extracts the distance between eyes and eyebrows as said face information, and controls the magnitude | size of an audio | voice based on the distance between eyes and eyebrows. The audio | voice reproduction system.

The audio playback system according to claim 2,
The control unit extracts an orientation of the face in the vertical direction as the face information, and controls a voice height based on the orientation of the face in the vertical direction.

The audio playback system according to claim 1,
Further comprising a prompter,
The storage unit further stores text data corresponding to the voice data,
The audio control system, wherein the control unit controls display of the prompter unit so that the text data is displayed on the prompter unit.

The audio reproduction system according to claim 5,
The text data includes a plurality of partial text data,
The storage unit stores the partial waveform data in association with partial text data corresponding to the partial waveform data,
The control unit controls display of the prompter unit so that partial text data corresponding to the currently reproduced partial waveform data is highlighted in accordance with the reproduction timing of the partial waveform data.

The audio reproduction system according to claim 6,
The control unit executes a process of associating the partial waveform data and the partial text data in order to associate and store the partial waveform data and the partial text data corresponding to the partial waveform data in the storage unit. Audio playback system.

The audio playback system according to claim 1,
The said memory | storage part memorize | stores the data acquired when the narrator reads text data as said audio | voice data.

Pre-store voice data including multiple partial waveform data,
Image the mouth of the user,
Based on the captured image, it is determined that the user's mouth has been opened,
An audio reproduction method for controlling the reproduction timing of the audio data so as to sequentially reproduce the partial waveform data in response to the opening of the user's mouth.

In the audio playback system,
Pre-storing audio data including a plurality of partial waveform data;
Imaging a user's mouth;
Determining that the user's mouth has been opened based on the captured image;
Controlling the reproduction timing of the audio data so as to sequentially reproduce the partial waveform data in response to the opening of the user's mouth.