JP2016080908A

JP2016080908A - Signal processing device

Info

Publication number: JP2016080908A
Application number: JP2014212921A
Authority: JP
Inventors: 隼人大下; Hayato Oshita; 佳孝浦谷; Yoshitaka Uratani; 雅史吉田; Masashi Yoshida; 嘉山　啓; Hiroshi Kayama; 啓嘉山; 賀文水野; Yoshifumi Mizuno; 祐高橋; Yu Takahashi; 近藤　多伸; Kazunobu Kondo; 多伸近藤
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2014-10-17
Filing date: 2014-10-17
Publication date: 2016-05-16

Abstract

PROBLEM TO BE SOLVED: To visually process a sound by preventing the synchronization of a video and the sound from collapsing.SOLUTION: A video operation input part 104 is operated by the user of a karaoke device 100, and generates operation content data corresponding to the operation, and applies the operation content data to a voice processing parameter generation part 105 and a video processing parameter generation part 106. The voice processing parameter generation part 105 generates a voice processing parameter from the operation content data applied by the video operation input part 104, and applies the voice processing parameter to a voice processing part 102 for contributing to control of voice processing. Also, the video processing parameter generation part 106 generates a video processing parameter from the operation content data applied by the video operation input part 104, and applies the video processing parameter to a video processing part 108 for contributing to control of video processing.SELECTED DRAWING: Figure 1

Description

本発明は、互いに同期再生される映像と音の各々を表す信号を加工する技術に関する。 The present invention relates to a technique for processing a signal representing each of video and sound that are reproduced in synchronization with each other.

デジタル技術の発達に伴い、互いに同期再生される映像と音よりなるコンテンツを一般ユーザが作成し公開することや、このようなコンテンツを構成する映像と音の一方にさらに加工を施すことが一般的となっている。前者の例として動画投稿サイトに投稿された動画が挙げられ、後者の例としてカラオケ装置が挙げられる。その中でも特に、カラオケ装置に関しては、近年、バリエーション豊かな音加工や映像加工を実現できるものが一般に普及している。 With the development of digital technology, it is common for general users to create and publish content consisting of video and sound that are played back synchronously, and to further process one of the video and sound that make up such content It has become. An example of the former is a video posted on a video posting site, and an example of the latter is a karaoke device. Among them, in particular, with respect to karaoke apparatuses, in recent years, those capable of realizing a variety of sound processing and video processing have become popular.

カラオケ装置における音加工の一例としては、伴奏音の音波形を表す音信号に対して、キー変換を具体例とする周波数変換が挙げられる。例えばキー変換機能を備えたカラオケ装置の利用者は、そのカラオケ装置のリモコンのスライダやそのカラオケ装置に付随した表示装置に表示された仮想スライダを上下させて伴奏音のキーを調整することで、自身のキーに即した伴奏音でカラオケ曲を歌唱することができる。 As an example of sound processing in a karaoke apparatus, frequency conversion with a key conversion as a specific example is given to a sound signal representing the sound waveform of an accompaniment sound. For example, a user of a karaoke device having a key conversion function adjusts a key of an accompaniment sound by moving up and down a virtual slider displayed on a slider of a remote controller of the karaoke device or a display device attached to the karaoke device, You can sing karaoke songs with accompaniment sounds according to your own keys.

一方、カラオケ装置における映像加工の具体例としては、カラオケ曲の進行に伴って表示装置に表示される様々な映像コンテンツ（例えば、カラオケ曲のイメージに即した背景の映像やこの背景映像に重ね合わせて表示される歌詞の字幕映像など）に加工を施すことが挙げられる。この種の映像加工機能を有するカラオケ装置の具体例としては、特許文献１、２および３の各文献に開示のカラオケ装置が挙げられる。これら各文献に開示のカラオケ装置における映像加工は以下の通りである。 On the other hand, as a specific example of the video processing in the karaoke device, various video contents displayed on the display device as the karaoke song progresses (for example, a background video in accordance with the image of the karaoke song or an overlay on this background video) (For example, subtitle video of lyrics displayed). Specific examples of the karaoke apparatus having this type of video processing function include karaoke apparatuses disclosed in Patent Documents 1, 2, and 3. Video processing in the karaoke apparatus disclosed in each of these documents is as follows.

特許文献１には、表示装置に表示される背景映像と歌詞字幕映像に対して歌唱者の映像をリアルタイムに合成する映像加工が開示されている。特許文献２には、デジタルカメラの映像を背景映像として歌詞字幕映像と合成する映像加工が開示されている。 Patent Document 1 discloses video processing for synthesizing a singer's video in real time with a background video and lyrics subtitle video displayed on a display device. Patent Document 2 discloses video processing for synthesizing a video from a digital camera with a lyrics subtitle video as a background video.

また、特許文献３には、歌唱者を示すアバターに歌唱者の設定した装飾品の画像を重ね合わせるなどの加工を施して背景映像と合成する映像加工が開示されている。アバターとは、人間を模したキャラクターのことである。 Further, Patent Document 3 discloses video processing for performing processing such as superimposing an ornament image set by a singer on an avatar indicating a singer and combining it with a background video. An avatar is a character that imitates a human being.

特開２００１−０４５３７３号公報JP 2001-045373 A 特開２００９−０２０２３４号公報JP 2009-020234 A 特許５０１８７９７号公報Japanese Patent No. 5018797

特許文献１、２および３に開示されたカラオケ装置では、映像加工に関しては考慮されているが、その映像加工に対応した音加工は考慮されていない。本来、カラオケ装置で再生される映像は音と同期再生されるものであり、映像のみに加工を施すことで音との同期が崩れることは好ましくない。 In the karaoke apparatuses disclosed in Patent Documents 1, 2, and 3, although image processing is considered, sound processing corresponding to the image processing is not considered. Originally, the video reproduced by the karaoke apparatus is reproduced in synchronization with the sound, and it is not preferable that the synchronization with the sound is lost by processing only the video.

同期再生される映像と音声のうちの音声のみに加工を施すことで両者の同期が崩れることも同様に好ましくないが、従来の音加工ではこのような配慮は払われていなかった。加えて、従来の音加工では、直接的な操作で音信号の音量の増減といった大まかな加工を施すことは可能であったが、音信号の周波数分布を変更するといったきめ細やかな加工を施すことはできなかった。前者は、音信号の特徴（例えば周波数分布等）を視覚的に把握していなくても行うことができるが、後者は、音信号の特徴を視覚的に把握していないと難しい。さらに、従来の音加工には、加工によって生じる音の細かい変化を視覚的に把握することができない、といった問題もあった。 Similarly, it is not preferable that the synchronization of both of the video and the audio to be reproduced in synchronization is changed, and the synchronization between the two is lost. However, in the conventional sound processing, such consideration has not been paid. In addition, in conventional sound processing, it was possible to perform rough processing such as increasing or decreasing the volume of the sound signal by direct operation, but performing detailed processing such as changing the frequency distribution of the sound signal I couldn't. The former can be performed without visually grasping the characteristics (for example, frequency distribution) of the sound signal, but the latter is difficult unless the characteristics of the sound signal are visually grasped. Further, the conventional sound processing has a problem that it is impossible to visually grasp the fine change of sound caused by the processing.

この発明は以上のような事情に鑑みてなされたものであり、その目的は、同期再生される映像と音の少なくとも一方を加工しても両者の同期が崩れず、さらに、ユーザが音の加工を視覚的に行うことができるようにする技術を提供することにある。 The present invention has been made in view of the circumstances as described above. The purpose of the present invention is to prevent at least one of the synchronized video and sound from being disrupted, and further to allow the user to process the sound. It is to provide a technique that enables visual recognition.

上記課題を解決するために本発明は、表示装置に表示させる映像に対する加工内容を表す映像加工パラメータを入力音信号に応じて生成するパラメータ生成部と、前記入力音信号の表す音と同期して前記表示装置に表示させる映像を表す映像データに対して前記映像加工パラメータに基づいた加工を施し、加工済の映像データを前記表示装置に与える映像加工部と、を有することを特徴とする信号加工装置を提供する。この発明によれば、入力音信号に応じて再生される音と同期して表示装置に表示させる映像に加工を施す映像加工パラメータが当該入力音信号に応じて生成される。したがって、映像と音の同期が崩れない。 In order to solve the above-described problems, the present invention provides a parameter generation unit that generates a video processing parameter that represents processing content for a video displayed on a display device according to an input sound signal, and a sound that is expressed by the input sound signal. And a video processing unit that performs processing based on the video processing parameters for video data representing video to be displayed on the display device, and supplies the processed video data to the display device. Providing the device. According to the present invention, the video processing parameters for processing the video to be displayed on the display device in synchronization with the sound reproduced according to the input sound signal are generated according to the input sound signal. Therefore, the synchronization of video and sound is not lost.

より好ましい態様においては、上記信号加工装置は、前記映像加工部による加工の対象となる映像データを前記入力音信号に基づいて生成する映像生成部と、前記表示装置に表示される映像に対する操作が入力される映像操作入力部と、前記入力音信号に対して音加工パラメータに基づいた加工を施して出力する音加工部と、を備え、前記パラメータ生成部は、前記入力音信号に応じて前記映像加工パラメータを生成する処理に代えて前記映像操作入力部に入力された操作に応じて前記映像加工パラメータを生成する処理、または前記入力音信号と前記映像操作入力部に入力された操作とに応じて前記映像加工パラメータを生成する処理を実行するとともに、前記音加工パラメータを前記映像操作入力部に入力された操作に応じて生成することを特徴とする。 In a more preferred aspect, the signal processing device includes a video generation unit that generates video data to be processed by the video processing unit based on the input sound signal, and an operation for the video displayed on the display device. An input video operation input unit; and a sound processing unit that processes the input sound signal based on a sound processing parameter and outputs the processed sound, and the parameter generation unit performs the processing according to the input sound signal. In place of the process of generating the video processing parameter, the process of generating the video processing parameter according to the operation input to the video operation input unit, or the operation input to the input sound signal and the video operation input unit In response to this, processing for generating the video processing parameter is executed, and the sound processing parameter is generated in accordance with an operation input to the video operation input unit. The features.

例えば、映像操作入力部に対する操作に応じて映像加工パラメータをパラメータ生成部に生成させる態様であれば、表示装置に表示される映像に対して当該映像を加工するための操作が行われると、その加工内容を表す映像加工パラメータが当該操作に応じて生成されるとともに、当該映像と同期して出力される音に対する加工内容を表す音加工パラメータが当該操作に応じて生成される。そして、表示装置に表示される映像を表す映像データに対して上記映像加工パラメータに基づく加工が施され、出力される音を表す入力音信号に対して上記音加工パラメータに基づく加工が施される。このため、映像操作入力部に対して為された操作に応じて生成される映像加工パラメータと音加工パラメータの平仄を揃えておけば（例えば、ピンチアウト操作に応じて映像を拡大する映像加工パラメータを生成するとともに音量を引き上げる音加工パラメータを生成する一方、ピンチイン操作に応じて映像を縮小する映像加工パラメータを生成するとともに音量を引き下げる音加工パラメータを生成する当）、映像に加工を施しても映像と音の同期が崩れず、視覚的な音声加工を行うことが可能になる。 For example, if the video processing parameter is generated by the parameter generation unit in response to an operation on the video operation input unit, when an operation for processing the video is performed on the video displayed on the display device, A video processing parameter representing the processing content is generated according to the operation, and a sound processing parameter representing the processing content for the sound output in synchronization with the video is generated according to the operation. The video data representing the video displayed on the display device is processed based on the video processing parameter, and the input sound signal representing the output sound is processed based on the sound processing parameter. . For this reason, if the level of the video processing parameters generated according to the operation performed on the video operation input unit and the level of the sound processing parameters are aligned (for example, the video processing parameters for enlarging the video according to the pinch-out operation). And a sound processing parameter that reduces the volume in response to a pinch-in operation and a sound processing parameter that lowers the volume. Visual and audio processing can be performed without synchronizing video and sound.

好ましい態様としては、前記入力音信号は、ユーザの歌唱音声を表す音信号を含み、前記映像データは、前記ユーザを模したアバターの各部位を表すデータであり、前記アバターの部位毎に異なる音響効果が割り当てられており、前記パラメータ生成部は、前記映像操作入力部により操作が為されたアバターの部位毎に当該部位に対する加工内容を表す映像加工パラメータを生成するとともに、操作の為された部位に対応する音響効果を調整する音加工パラメータを生成する。この態様によれば、音にはユーザの歌唱音声が含まれる。さらに、表示装置に表示される映像はユーザを模したアバターであり、アバターの部位毎に操作を行うと、音に対してアバターの部位毎に割り当てられた異なる音響効果が施される。したがって、アバターの部位の映像に加工を施しても映像と音の同期が崩れず、アバターの部位に対応する音響効果を施す音声加工を視覚的に行うことができる。 As a preferred aspect, the input sound signal includes a sound signal representing a user's singing voice, and the video data is data representing each part of an avatar that imitates the user, and different sound for each part of the avatar. An effect is assigned, and the parameter generation unit generates a video processing parameter indicating processing content for the part for each part of the avatar operated by the video operation input unit, and a part for which the operation is performed A sound processing parameter that adjusts the acoustic effect corresponding to is generated. According to this aspect, the sound includes the user's singing voice. Furthermore, the video displayed on the display device is an avatar that imitates the user, and when an operation is performed for each part of the avatar, different acoustic effects assigned to each part of the avatar are applied to the sound. Therefore, even if the video of the part of the avatar is processed, the synchronization of the video and the sound is not lost, and the voice processing for applying the acoustic effect corresponding to the part of the avatar can be visually performed.

また、別の好ましい態様としては、コンピュータに、表示装置に表示させる映像に対する加工内容を表す映像加工パラメータを入力音信号に応じて生成するパラメータ生成処理と、前記入力音信号の表す音と同期して前記表示装置に表示させる映像を表す映像データに対して前記映像加工パラメータに基づいた加工を施し、加工済の映像データを前記表示装置に与える映像加工処理と、を実行させるプログラムを提供する。この態様によっても、入力音信号に応じて再生される音と同期して表示装置に表示させる映像に加工を施す映像加工パラメータが当該入力音信号に応じて生成される。したがって、映像と音の同期が崩れない。 Further, as another preferred aspect, a parameter generation process for generating a video processing parameter indicating processing content for a video to be displayed on a display device on a computer according to an input sound signal, and a sound expressed by the input sound signal are synchronized. And a video processing process for processing the video data representing the video to be displayed on the display device based on the video processing parameters and executing the processed video data to the display device. Also according to this aspect, the video processing parameters for processing the video to be displayed on the display device in synchronization with the sound reproduced according to the input sound signal are generated according to the input sound signal. Therefore, the synchronization of video and sound is not lost.

この発明の信号加工装置の第１実施形態であるカラオケ装置１００の構成を示すブロック図である。It is a block diagram which shows the structure of the karaoke apparatus 100 which is 1st Embodiment of the signal processing apparatus of this invention. 同カラオケ装置１００に記憶されている音声加工テーブルの概念図である。3 is a conceptual diagram of a voice processing table stored in the karaoke apparatus 100. FIG. 同カラオケ装置１００に記憶されている映像加工テーブルの概念図である。3 is a conceptual diagram of a video processing table stored in the karaoke apparatus 100. FIG. 同カラオケ装置１００に記憶されているリバーブ変更テーブルを例示した概念図である。3 is a conceptual diagram illustrating a reverb change table stored in the karaoke apparatus 100. FIG. 同カラオケ装置１００に記憶されている背景映像変更テーブルを例示した概念図である。3 is a conceptual diagram illustrating a background video change table stored in the karaoke apparatus 100. FIG. 音声加工パラメータ生成部１０５が音声加工パラメータを生成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in which the audio processing parameter production | generation part 105 produces | generates an audio processing parameter. 映像加工パラメータ生成部１０６が映像加工パラメータを生成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in which the video process parameter production | generation part 106 produces | generates a video process parameter. 同カラオケ装置１００の映像操作入力部１０４に表示される映像を例示した図である。3 is a diagram illustrating an example of an image displayed on an image operation input unit 104 of the karaoke apparatus 100. FIG. 同映像操作入力部１０４にピンチアウト操作を行った後に表示される映像を例示した図である。It is the figure which illustrated the image | video displayed after performing pinch out operation to the video operation input part. 同映像操作入力部１０４にタッチ操作を行った後に表示される映像を例示した図である。6 is a diagram illustrating an image displayed after performing a touch operation on the image operation input unit 104. FIG. 音声加工パラメータ生成部１０５が音声加工パラメータを生成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in which the audio processing parameter production | generation part 105 produces | generates an audio processing parameter. 映像加工パラメータ生成部１０６が映像加工パラメータを生成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in which the video process parameter production | generation part 106 produces | generates a video process parameter. この発明の第２実施形態であるカラオケ装置に記憶されているアバター部位加工テーブルの概念図である。It is a conceptual diagram of the avatar part process table memorize | stored in the karaoke apparatus which is 2nd Embodiment of this invention. 同カラオケ装置に記憶されているコンプレッサ加工テーブルの概念図である。It is a conceptual diagram of the compressor processing table memorize | stored in the karaoke apparatus. この発明の第３実施形態であるカラオケ装置２００の構成を示すブロック図である。It is a block diagram which shows the structure of the karaoke apparatus 200 which is 3rd Embodiment of this invention. 特徴量解析部２０４が映像加工パラメータを生成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in which the feature-value analysis part 204 produces | generates an image processing parameter.

以下、図面を参照しつつ、この発明の実施形態を説明する。
＜第１実施形態＞
（Ａ：構成）
図１は、この発明の信号加工装置の第１実施形態であるカラオケ装置１００の構成を示すブロック図である。カラオケ装置１００は、ユーザの歌唱音声を収音し伴奏音とともにスピーカ等の放音装置に出力させるとともに、当該ユーザを模したキャラクターであるアバターを背景映像と合成して表示装置に表示させる装置である。図１に示すように、カラオケ装置１００は、音声入力部１０１、音声加工部１０２、音声出力部１０３、映像操作入力部１０４、音声加工パラメータ生成部１０５、映像加工パラメータ生成部１０６、映像生成部１０７、映像加工部１０８、背景映像取得部１０９および映像出力部１１０を有している。カラオケ装置１００を構成する各部のうち、音声加工部１０２、音声加工パラメータ生成部１０５、映像加工パラメータ生成部１０６、映像生成部１０７および映像加工部１０８は、カラオケ装置１００のＣＰＵ（ＣｅｎｔｒａｌＰｒоｃｅｓｓｉｎｇＵｎｉｔ：図１では図示略）が当該カラオケ装置１００に予め記憶されている制御プログラム（図１では図示略）にしたがって実現するソフトウェアモジュールである。 Embodiments of the present invention will be described below with reference to the drawings.
<First Embodiment>
(A: Configuration)
FIG. 1 is a block diagram showing a configuration of a karaoke apparatus 100 which is a first embodiment of a signal processing apparatus of the present invention. The karaoke apparatus 100 is a device that picks up a user's singing voice and outputs it along with an accompaniment sound to a sound emitting device such as a speaker, and also displays an avatar, which is a character imitating the user, with a background image and displays it on a display device. is there. As shown in FIG. 1, a karaoke apparatus 100 includes an audio input unit 101, an audio processing unit 102, an audio output unit 103, a video operation input unit 104, an audio processing parameter generation unit 105, an image processing parameter generation unit 106, and an image generation unit. 107, a video processing unit 108, a background video acquisition unit 109, and a video output unit 110. Among the units constituting the karaoke device 100, the audio processing unit 102, the audio processing parameter generation unit 105, the video processing parameter generation unit 106, the video generation unit 107, and the video processing unit 108 are a CPU (Central Producing Unit: 1 is a software module realized according to a control program (not shown in FIG. 1) stored in advance in the karaoke apparatus 100.

背景映像取得部１０９は、ネットワークを介してホストコンピュータに接続されている。当該ホストコンピュータは、背景映像取得部１０９に対してユーザにより選択されたカラオケ曲の伴奏音を表す伴奏音データと背景映像データとを配信する。背景映像データとは、伴奏音と同期させて表示装置に表示させる映像を表すデータである。本実施形態の背景映像データは、各々異なる映像を表す複数の映像データから構成されている。例えば、音楽ホールを表す映像データと風呂場を表示する映像データとが背景映像データに含まれているといった具合である。背景映像取得部１０９は、ホストコンピュータから配信される背景映像データと伴奏音データを受信し、背景映像データを映像加工部１０８に与え、伴奏音データを音声加工部１０２に与える。図１では、背景映像取得部１０９と音声加工部１０２に接続する信号線の図示を省略している。 The background video acquisition unit 109 is connected to a host computer via a network. The host computer distributes the accompaniment sound data representing the accompaniment sound of the karaoke song selected by the user and the background image data to the background image acquisition unit 109. The background video data is data representing video to be displayed on the display device in synchronization with the accompaniment sound. The background video data of this embodiment is composed of a plurality of video data representing different videos. For example, video data representing a music hall and video data displaying a bathroom are included in the background video data. The background video acquisition unit 109 receives background video data and accompaniment sound data distributed from the host computer, gives the background video data to the video processing unit 108, and gives the accompaniment sound data to the audio processing unit 102. In FIG. 1, illustration of signal lines connected to the background video acquisition unit 109 and the audio processing unit 102 is omitted.

音声入力部１０１は例えばマイクロフォンである。図１に示すように、本実施形態のカラオケ装置１００は、複数の音声信号入力部１０１を有している。複数の音声入力部１０１の各々は、それらを各々持って歌唱する複数の歌唱者の音声を収音し、収音された音声の波形を表す音声信号を音声加工部１０２に与える。本実施形態では、カラオケ装置１００が複数の歌唱者によって利用される場合、各歌唱者はそれぞれ異なる音声入力部１０１を持って歌唱する。したがって、音声入力部１０１の数は、歌唱者の人数と同数もしくは歌唱者の人数よりも多くなければならない。歌唱者が使用していない音声入力部１０１は音声を収音せず、音声信号を生成しない。 The voice input unit 101 is a microphone, for example. As shown in FIG. 1, the karaoke apparatus 100 of this embodiment has a plurality of audio signal input units 101. Each of the plurality of voice input units 101 collects voices of a plurality of singers who sing with each of them, and gives a voice signal representing a waveform of the collected voices to the voice processing unit 102. In this embodiment, when the karaoke apparatus 100 is used by a plurality of singers, each singer sings with a different voice input unit 101. Therefore, the number of voice input units 101 must be the same as the number of singers or more than the number of singers. The voice input unit 101 not used by the singer does not collect voice and does not generate a voice signal.

音声加工部１０２は、音声入力部１０１および背景映像取得部１０９から与えられた音声信号に加工を施し、加工後の音声信号を複数の音声出力部１０３の各々に振り分けて出力する。音声出力部１０３は例えばスピーカである。音声出力部１０３は、音声加工部１０２から与えられた音声信号の表す音を放音する。本実施形態では、音声入力部１０１と音声出力部１０３の数は一致していない（図１に示すＮとＴは等しくない）が、両者の数が一致していてもよい。また、本実施形態では、音声入力部１０１と音声出力部１０３をそれぞれ複数設けたが、各々１つずつであってもよい。 The audio processing unit 102 processes the audio signals given from the audio input unit 101 and the background video acquisition unit 109, distributes the processed audio signals to each of the plurality of audio output units 103, and outputs them. The audio output unit 103 is, for example, a speaker. The audio output unit 103 emits the sound represented by the audio signal given from the audio processing unit 102. In the present embodiment, the numbers of the voice input unit 101 and the voice output unit 103 do not match (N and T shown in FIG. 1 are not equal), but the number of both may match. Further, in the present embodiment, a plurality of audio input units 101 and a plurality of audio output units 103 are provided, but one each may be provided.

音声加工部１０２が行う音声信号の加工の一例としては、音声入力部１０１から与えられた音声信号の音量の増減や音声信号にリバーブなどの音響効果を付与することや、複数の音声出力部１０３の出力バランスを調整することで音像を定位させることが挙げられる。これらの加工は、音声加工パラメータ生成部１０５が生成する音声加工パラメータに従って行われる。 As an example of the audio signal processing performed by the audio processing unit 102, an increase or decrease in the volume of the audio signal given from the audio input unit 101, an acoustic effect such as reverb is added to the audio signal, or a plurality of audio output units 103 The sound image can be localized by adjusting the output balance. These processes are performed according to the voice processing parameters generated by the voice processing parameter generation unit 105.

音声入力部１０１が生成する音声信号は、映像生成部１０７にも与えられる。映像生成部１０７は、図１には図示しない記憶装置に記憶された複数のアバターデータの中から、与えられた各音声信号に対応するアバターデータを読み出し、当該アバターデータを映像加工部１０８に出力する。より詳細に説明すると、複数のアバターデータの各々は、互いに異なる音声信号識別子と対応付けて上記記憶装置に記憶されている。音声信号識別子は、図１に示す各音声入力部１０１に付された番号の１からＮである。つまり、カラオケ装置１００に設けられている音声入力部１０１の数と上記記憶装置に記憶されているアバターデータの数は一致する。映像生成部１０７は、音声信号入力部１０１から音声信号が与えられると、その音声信号入力部１０１を示す音声信号識別子に対応したアバターデータを記憶装置から読み出し、そのアバターデータを映像加工部１０８に与える。 The audio signal generated by the audio input unit 101 is also given to the video generation unit 107. The video generation unit 107 reads avatar data corresponding to each given audio signal from a plurality of avatar data stored in a storage device (not shown in FIG. 1), and outputs the avatar data to the video processing unit 108. To do. More specifically, each of the plurality of avatar data is stored in the storage device in association with a different audio signal identifier. The audio signal identifiers are numbers 1 to N assigned to the audio input units 101 shown in FIG. That is, the number of voice input units 101 provided in the karaoke apparatus 100 matches the number of avatar data stored in the storage device. When the audio signal is given from the audio signal input unit 101, the video generation unit 107 reads out the avatar data corresponding to the audio signal identifier indicating the audio signal input unit 101 from the storage device, and sends the avatar data to the video processing unit 108. give.

なお、アバターデータと対応付けて記憶装置に記憶させておく音声信号識別子は、音声入力部１０１に付された番号には限定されず、音声信号の波形データや音声信号の特徴を表す特徴量データであっても良い。音声信号識別子として波形データ或いは特徴量データを用いる場合には、映像生成部１０７は、音声入力部１０１から音声信号を受け取ったことを契機として、その音声信号の波形或いは特徴を判別し、その波形或いは特徴を示す音声識別子に対応するアバターデータを記憶装置から読み出す処理を実行する。この態様においては、歌唱者が歌の途中でマイクロフォンを別のマイクロフォンに持ち替えたとしても、その持ち替え後も持ち替え前と同じアバターが表示装置に表示される。 Note that the audio signal identifier stored in the storage device in association with the avatar data is not limited to the number assigned to the audio input unit 101, and is waveform data of the audio signal and feature amount data representing the characteristics of the audio signal. It may be. When waveform data or feature value data is used as the audio signal identifier, the video generation unit 107 determines the waveform or feature of the audio signal when the audio signal is received from the audio input unit 101, and the waveform Or the process which reads the avatar data corresponding to the audio | voice identifier which shows the characteristic from a memory | storage device is performed. In this aspect, even if the singer changes the microphone to another microphone during the song, the same avatar as before the change is displayed on the display device even after the change.

音声信号識別子に対応付けるアバターデータは、予め用意されたアバターデータの中から、その音声信号識別子の示す音声入力部１０１を使用するユーザが自らの好みに応じて選択できるようになっていても良いし、予め用意されたアバターの各部位（例えば、アバターの顔や胴、手足など）を表すデータの中から当該ユーザが自らの好みに応じて選択したものを１つに合成したものであっても良い。また、アバターデータの選択や生成は、カラオケ装置１００において行われても良いし、パソコン等のカラオケ装置１００以外の機器で上記選択或いは生成を行い、その選択や生成の結果をカラオケ装置１００に記憶させたものであっても良い。 The user who uses the voice input unit 101 indicated by the voice signal identifier may be selected according to his / her preference from avatar data prepared in advance as avatar data associated with the voice signal identifier. Even if the user selects from the data representing each part of the avatar prepared in advance (for example, the avatar's face, torso, limbs, etc.) according to the user's preference, good. The selection and generation of the avatar data may be performed in the karaoke apparatus 100, or the selection or generation is performed by a device other than the karaoke apparatus 100 such as a personal computer, and the selection or generation result is stored in the karaoke apparatus 100. It may be a

映像加工部１０８は、映像生成部１０７から与えられたアバターデータと、背景映像取得部１０９から与えられた背景映像データに含まれる複数の映像データの中から選択した１つの映像データの両者に加工を施し、さらに両者を合成し、出力映像データとして映像出力部１１０に与える。映像出力部１１０は例えばモニタなどの表示装置であり、与えられた出力映像データの表す映像を表示する。映像加工部１０８における選択、加工および合成は、映像加工パラメータ生成部１０６により生成された映像加工パラメータに従って行われる。ここで映像加工パラメータにしたがって行われる加工の一例としては、アバターの拡大或いは縮小、アバターの表示位置の変更、背景映像の変更などが挙げられる。カラオケ装置１００において歌唱者が歌唱する歌を選択し、歌の伴奏音が流れ始め歌唱者がまだ歌い始めていない状況下では、映像出力部１１０はアバターを表示せず、背景映像のみを表示する。歌唱者が歌い始めていなければ、音声入力部１０１から映像生成部１０７に音声信号は与えられず、映像生成部１０７がアバターデータを映像加工部１０８に出力することはないからである。この後、歌唱者が歌唱を開始すると、映像加工部１０８は、映像加工パラメータが与えられるまでは、背景映像の中央に予め定められた大きさでアバターが表示されるようにアバターデータと背景映像データを合成する。 The video processing unit 108 processes both the avatar data given from the video generation unit 107 and one video data selected from a plurality of video data included in the background video data given from the background video acquisition unit 109. Then, both are combined and provided to the video output unit 110 as output video data. The video output unit 110 is a display device such as a monitor, and displays a video represented by given output video data. Selection, processing, and synthesis in the video processing unit 108 are performed according to the video processing parameters generated by the video processing parameter generation unit 106. Examples of processing performed in accordance with the video processing parameters include avatar enlargement or reduction, avatar display position change, background video change, and the like. In a situation where the singer selects a song to be sung in the karaoke apparatus 100 and the accompaniment sound starts to flow and the singer has not yet started singing, the video output unit 110 does not display the avatar but displays only the background video. This is because if the singer has not started singing, an audio signal is not given from the audio input unit 101 to the video generation unit 107, and the video generation unit 107 does not output the avatar data to the video processing unit 108. Thereafter, when the singer starts singing, the image processing unit 108 displays the avatar data and the background image so that the avatar is displayed in a predetermined size in the center of the background image until the image processing parameter is given. Synthesize the data.

映像操作入力部１０４は映像出力部１１０の表示面全体を覆うように設けられたタッチパネルである。映像操作入力部１０４は、カラオケ装置１００のユーザにより操作され、その操作に対応した操作内容データを生成し、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６に与える。本実施形態では、映像出力部１１０の表示面全体を映像操作入力１０４としたが上記表示面の一部のみを映像操作入力部１０４としても良い。また、映像操作入力部１０４は、映像出力部１１０とは別の装置（例えば、カラオケ装置１００のリモコン）の一部であっても良い。ただし、映像操作入力部１０４がカラオケ装置１００のリモコンの一部である場合には、映像加工部１０８は、映像出力部１１０だけでなく当該リモコンにも映像を出力する。この場合、映像出力部１１０と映像操作入力部１０４に表示された映像は同じとなる。 The video operation input unit 104 is a touch panel provided so as to cover the entire display surface of the video output unit 110. The video operation input unit 104 is operated by the user of the karaoke apparatus 100, generates operation content data corresponding to the operation, and supplies the operation content data to the audio processing parameter generation unit 105 and the video processing parameter generation unit 106. In the present embodiment, the entire display surface of the video output unit 110 is the video operation input 104, but only a part of the display surface may be the video operation input unit 104. In addition, the video operation input unit 104 may be a part of a device different from the video output unit 110 (for example, a remote controller of the karaoke device 100). However, when the video operation input unit 104 is a part of the remote control of the karaoke apparatus 100, the video processing unit 108 outputs the video not only to the video output unit 110 but also to the remote control. In this case, the images displayed on the image output unit 110 and the image operation input unit 104 are the same.

映像操作入力部１０４に行われる操作の具体例としては、ピンチイン操作およびピンチアウト操作（以下、ピンチイン／アウト操作と表記）とタッチ操作が挙げられる。ピンチイン操作とは、映像操作入力部１０４に２本の指をタッチし、その２本の指の間隔を狭めるように各指を動かす操作のことである。ピンチアウト操作とは、同様にタッチし、２本の指の間隔を広げるように動かす操作のことである。ピンチイン／アウト操作が為されると、映像操作入力部１０４は、当該ピンチイン／アウト操作を示す操作内容データを生成し、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６に与える。この操作内容データには、当該操作に関与した２本の指の各々についてのタッチを開始した座標（例えば、映像操作入力部１０４の表示領域の左上隅を原点とする座標、以下、同じ）から指を離した座標までの軌跡を表すデータが含まれている。 Specific examples of operations performed on the video operation input unit 104 include a pinch-in operation and a pinch-out operation (hereinafter referred to as a pinch-in / out operation) and a touch operation. The pinch-in operation is an operation of touching the video operation input unit 104 with two fingers and moving each finger so as to narrow the interval between the two fingers. The pinch-out operation is an operation in which the user touches and moves to widen the interval between two fingers. When a pinch-in / out operation is performed, the video operation input unit 104 generates operation content data indicating the pinch-in / out operation, and supplies the operation content data to the audio processing parameter generation unit 105 and the video processing parameter generation unit 106. In this operation content data, from the coordinates at which the touch of each of the two fingers involved in the operation is started (for example, the coordinates with the upper left corner of the display area of the video operation input unit 104 as the origin, the same applies hereinafter). Data representing the trajectory up to the coordinates where the finger is released is included.

タッチ操作とは、ユーザが映像操作入力部１０４に１本の指でタッチし、タッチしたまま映像操作入力部１０４の画面上をなぞり、任意の位置で画面から指を離す操作のことである。このタッチ操作により、映像操作入力部１０４は、タッチ操作を示す操作内容データ（すなわち、タッチを開始した座標から指を離した座標までの軌跡を表すデータ）を生成し、その操作内容データを音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６に与える。なお、本実施形態では、アバターに対して当該タッチ操作が為された場合と背景映像に対して当該タッチ操作が為された場合とを区別し、特に後者を「フリック操作」或いは「スワイプ操作」と呼ぶ。 The touch operation is an operation in which the user touches the video operation input unit 104 with one finger, traces the screen of the video operation input unit 104 while touching, and releases the finger from the screen at an arbitrary position. By this touch operation, the video operation input unit 104 generates operation content data indicating the touch operation (that is, data representing a locus from the coordinate where the touch is started to the coordinate where the finger is released), and the operation content data is voiced. The processing parameter generation unit 105 and the video processing parameter generation unit 106 are provided. In this embodiment, the case where the touch operation is performed on the avatar and the case where the touch operation is performed on the background image are distinguished, and the latter is particularly referred to as “flick operation” or “swipe operation”. Call it.

音声加工パラメータ生成部１０５は、映像操作入力部１０４から与えられた操作内容データから音声加工パラメータを生成し、音声加工部１０２に与える。音声加工パラメータ生成部１０５には、図２に示す音声加工テーブルと、図４に示すリバーブ変更テーブルとが予め記憶されている。図２に示すように、音声加工テーブルには、映像操作入力部１０４に対して為される各種操作の各々に対応付けて、その操作が為されたときに、音声に施す加工内容を表すデータが格納されている。一方、図４に示すように、リバーブ変更テーブルには、音声に施すリバーブの種類を示す識別子（図４では番号）に対応付けて、当該リバーブの内容を示すデータが格納されている。なお、音声加工パラメータ生成部１０５は、現在施されているリバーブの識別子を記憶する。音声加工パラメータ生成部１０５は、映像操作入力部１０４から与えられた操作内容データの示す操作内容と、音声加工テーブルの格納内容とから、その操作内容に応じた加工内容を表す音声加工パラメータを生成する。なお、上記操作内容がフリック操作或いはスワイプ操作であれば、音声加工パラメータ生成部１０５は、さらに、リバーブ変更テーブルを参照して音声加工パラメータを生成する。 The audio processing parameter generation unit 105 generates an audio processing parameter from the operation content data given from the video operation input unit 104 and gives it to the audio processing unit 102. The audio processing parameter generation unit 105 stores in advance an audio processing table shown in FIG. 2 and a reverb change table shown in FIG. As shown in FIG. 2, the audio processing table is associated with each of various operations performed on the video operation input unit 104, and represents data for processing to be performed on the audio when the operation is performed. Is stored. On the other hand, as shown in FIG. 4, the reverb change table stores data indicating the contents of the reverb in association with an identifier (number in FIG. 4) indicating the type of reverb applied to the voice. Note that the voice processing parameter generation unit 105 stores the identifier of the currently applied reverb. The voice processing parameter generation unit 105 generates a voice processing parameter representing the processing content according to the operation content from the operation content indicated by the operation content data given from the video operation input unit 104 and the stored content of the audio processing table. To do. If the operation content is a flick operation or a swipe operation, the audio processing parameter generation unit 105 further generates an audio processing parameter by referring to the reverb change table.

映像加工パラメータ生成部１０６は、映像操作入力部１０４から与えられた操作内容データから映像加工パラメータを生成し、映像加工部１０８に与える。映像加工パラメータ生成部１０６には、図３に示す映像加工テーブルと、図５に示す背景映像変更テーブルが予め記憶されている。図３に示すように、映像加工テーブルには、映像操作入力部１０４に対して為される各種操作の各々に対応付けて、その操作が為されたときに、映像に施す加工内容を表すデータが格納されている。一方、図５に示すように、背景映像変更テーブルには、映像に施す背景映像の種類を示す識別子（図５では番号）に対応付けて、当該背景映像の内容を示すデータが格納されている。なお、映像加工パラメータ生成部１０６は、現在表示されている背景映像の識別子を記憶する。また、図４と図５の同じ番号に対応するリバーブと背景映像は互いに対応している。映像加工パラメータ生成部１０６は、映像操作入力部１０４から与えられた操作内容データの示す操作内容と、映像加工テーブルの格納内容とから、その操作内容に応じた加工内容を表す映像加工パラメータを生成する。なお、上記操作内容がフリック操作或いはスワイプ操作であれば、映像加工パラメータ生成部１０６は、さらに、背景映像変更テーブルを参照して映像加工パラメータを生成する。 The video processing parameter generation unit 106 generates a video processing parameter from the operation content data given from the video operation input unit 104 and gives it to the video processing unit 108. In the video processing parameter generation unit 106, a video processing table shown in FIG. 3 and a background video change table shown in FIG. 5 are stored in advance. As shown in FIG. 3, the video processing table is associated with each of various operations performed on the video operation input unit 104, and represents data to be processed on the video when the operation is performed. Is stored. On the other hand, as shown in FIG. 5, the background video change table stores data indicating the content of the background video in association with an identifier (number in FIG. 5) indicating the type of background video applied to the video. . Note that the video processing parameter generation unit 106 stores the identifier of the currently displayed background video. Also, the reverb and the background image corresponding to the same number in FIGS. 4 and 5 correspond to each other. The video processing parameter generation unit 106 generates video processing parameters representing the processing content corresponding to the operation content from the operation content indicated by the operation content data given from the video operation input unit 104 and the stored content of the video processing table. To do. If the operation content is a flick operation or a swipe operation, the video processing parameter generation unit 106 further generates a video processing parameter with reference to the background video change table.

図６は、音声加工パラメータ生成部１０５が音声加工パラメータを生成する処理の流れを示すフローチャートであり、図７は、映像加工パラメータ生成部１０６が映像加工パラメータを生成する処理の流れを示すフローチャートである。図６と図７を比較すれば明らかなように、ステップＳＡ１０３、ステップＳＡ１０５およびステップＳＡ１０６と、ステップＳＢ１０３、ステップＳＢ１０５およびステップＳＢ１０６とが各々対応し、これらのステップ以外は内容が同じであるため、図６と図７では同じステップ番号を用いる。音声加工パラメータ生成部１０５と映像加工パラメータ１０６は、映像操作入力部１０４から操作内容データを受け取ったことを契機として音声加工パラメータと映像加工パラメータの生成を各々開始する。音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６の各々は、操作内容データの示す操作がピンチイン／アウト操作であるか否かを判定する（ステップＳＡ１０１）。具体的には、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６の各々は、操作内容データに２本の指の軌跡を表すデータが含まれている場合に、ピンチイン／アウト操作を示す操作内容データであると判定する。ステップＳＡ１０１の判定結果が“Ｙｅｓ”であった場合、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６は、その操作内容データの示すタッチを開始した２本の指の座標の少なくとも一方が、アバターに対応する領域内にあるか否かを判定する（ステップＳＡ１０２）。ステップＳＡ１０２の判定結果が“Ｙｅｓ”であった場合、音声加工パラメータ生成部１０５は、音声加工テーブルを参照して、ピンチイン／アウト操作後の２本の指の間隔に応じて音声信号の音量を増減させる音声加工パラメータを生成する（ステップＳＡ１０３）。同様に、映像加工パラメータ生成部１０６は、ステップ１０２の判定結果が“Ｙｅｓ”であった場合、映像加工テーブルを参照して、ピンチイン／アウト操作後の２本の指の間隔に応じたサイズに当該アバターを縮小／拡大させ、当該アバターとピンチイン／アウト操作前の背景映像を合成させる映像加工パラメータを生成する（ステップＳＢ１０３）。これに対して、ステップＳＡ１０２の判定結果が“Ｎｏ”であった場合には、音声加工パラメータ生成部１０５と映像加工パラメータ１０６は、音声加工パラメータと映像加工パラメータを生成せずに、ステップＳＡ１０３とステップＳＢ１０３を実行することなく、当該生成処理を終了する。 FIG. 6 is a flowchart showing a flow of processing in which the audio processing parameter generation unit 105 generates audio processing parameters, and FIG. 7 is a flowchart showing a flow of processing in which the video processing parameter generation unit 106 generates video processing parameters. is there. As apparent from comparing FIG. 6 and FIG. 7, step SA103, step SA105, and step SA106 correspond to step SB103, step SB105, and step SB106, and the contents are the same except for these steps. The same step number is used in FIGS. The audio processing parameter generation unit 105 and the video processing parameter 106 each start generation of the audio processing parameter and the video processing parameter when the operation content data is received from the video operation input unit 104. Each of the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 determines whether or not the operation indicated by the operation content data is a pinch-in / out operation (step SA101). Specifically, each of the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 performs an operation indicating a pinch-in / out operation when the operation content data includes data representing the trajectory of two fingers. The content data is determined. When the determination result in step SA101 is “Yes”, the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 indicate that at least one of the coordinates of the two fingers that started the touch indicated by the operation content data is It is determined whether or not it is within the area corresponding to the avatar (step SA102). If the determination result in step SA102 is “Yes”, the audio processing parameter generation unit 105 refers to the audio processing table and adjusts the volume of the audio signal according to the interval between two fingers after the pinch-in / out operation. A voice processing parameter to be increased or decreased is generated (step SA103). Similarly, when the determination result in step 102 is “Yes”, the video processing parameter generation unit 106 refers to the video processing table and sets the size according to the interval between two fingers after the pinch-in / out operation. The avatar is reduced / enlarged, and image processing parameters for synthesizing the avatar and the background image before the pinch-in / out operation are generated (step SB103). On the other hand, if the determination result in step SA102 is “No”, the audio processing parameter generation unit 105 and the video processing parameter 106 do not generate the audio processing parameter and the video processing parameter, and step SA103 and The generation process ends without executing step SB103.

次いで、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６が、タッチ操作を示す操作内容データを受け取った場合について説明する。この場合の説明も図６と図７を援用する。音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６の各々は、ステップＳＡ１０１の判定結果が“Ｎｏ”であった場合、その操作内容データの表す操作がアバターに対するタッチ操作であるか否かを判定する（ステップＳＡ１０４）。具体的には、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６の各々は、受け取った操作内容データの示すタッチ開始位置の座標がアバターに対応する領域内の座標であった場合には、アバターに対するタッチ操作であると判定し、当該領域外の座標であった場合にはフリック操作或いはスワイプ操作であると判定する。 Next, a case where the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 receive operation content data indicating a touch operation will be described. 6 and 7 are also used for the description in this case. Each of the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 determines whether or not the operation represented by the operation content data is a touch operation on the avatar when the determination result in step SA101 is “No”. (Step SA104). Specifically, each of the audio processing parameter generation unit 105 and the video processing parameter generation unit 106, when the coordinates of the touch start position indicated by the received operation content data are coordinates in the area corresponding to the avatar, It is determined that the touch operation is performed on the avatar. If the coordinates are outside the area, it is determined that the operation is a flick operation or a swipe operation.

ステップＳＡ１０４の判定結果が“Ｙｅｓ”であった場合、映像加工パラメータ生成部１０６は、映像加工テーブルの格納内容を参照し、アバターの表示位置を移動させて背景映像と合成することを指示する映像加工パラメータを生成する（ステップＳＢ１０５）。具体的には、映像加工パラメータ生成部１０６は、受け取った操作内容データの示すタッチ終了位置に上記アバターを移動させて背景映像と合成することを指示する映像加工パラメータを生成する。一方、音声加工パラメータ生成部１０５は、ステップＳＡ１０４の判定結果が“Ｙｅｓ”であった場合、音声加工テーブルの格納内容を参照し、アバターに対応する音像（当該アバターに対応付けられている音声信号識別子の示す音声信号に対応する音像、すなわち、当該アバターに対応する歌唱者の歌唱音声に対応する音像）の定位位置を移動させることを指示する音声加工パラメータを生成する（ステップＳＡ１０５）。具体的には、音声加工パラメータ生成部１０５は、受け取った操作内容データの示すタッチ終了位置に上記音像が定位するように各音声出力部１０３の出力バランスを調整することを指示する音声加工パラメータを生成する。 When the determination result in step SA104 is “Yes”, the video processing parameter generation unit 106 refers to the stored content of the video processing table, and moves the avatar display position to instruct to combine with the background video. A machining parameter is generated (step SB105). Specifically, the video processing parameter generation unit 106 generates a video processing parameter that instructs to move the avatar to the touch end position indicated by the received operation content data and combine it with the background video. On the other hand, when the determination result in step SA104 is “Yes”, the voice processing parameter generation unit 105 refers to the stored content of the voice processing table and refers to the sound image corresponding to the avatar (the voice signal associated with the avatar. A sound processing parameter is generated that instructs to move the localization position of the sound image corresponding to the sound signal indicated by the identifier, that is, the sound image corresponding to the singer's singing sound corresponding to the avatar (step SA105). Specifically, the voice processing parameter generation unit 105 sets a voice processing parameter that instructs to adjust the output balance of each voice output unit 103 so that the sound image is localized at the touch end position indicated by the received operation content data. Generate.

これに対して、ステップＳＡ１０４の判定結果が“Ｎｏ”であった場合、映像加工パラメータ生成部１０６は、受け取った操作内容データがフリック操作或いはスワイプ操作であるとして、背景映像を変更してアバターと合成することを指示する映像加工パラメータを生成する（ステップＳＢ１０６）。具体的には、映像加工パラメータ生成部１０６は、まず、映像加工テーブルを参照して背景映像の変更を指示されたと判定し、その時点の背景映像を示す識別子（すなわち、映像加工パラメータ生成部１０６に記憶されている識別子）と映像変更テーブルの格納内容とから、変更先の背景映像の識別子を特定する。そして、映像加工パラメータ生成部１０６は、自身に記憶している識別子を上記の要領で特定した識別子に更新するとともに、当該識別子の示す背景映像とアバターとを合成することを指示する映像加工パラメータを生成する。例えば、背景映像がカラオケ装置１００の製作者またはユーザが任意に設定した映像（すなわち、図５では番号１に対応する初期背景映像）の場合、映像加工パラメータ生成部１０６は、フリック操作或いはスワイプ操作により、背景映像を音楽ホール（すなわち、図５では番号２に対応する背景映像）に選択することを指示する映像加工パラメータを生成する。その後、再度フリック操作或いはスワイプ操作が行われると、映像加工パラメータ生成部１０６は、背景映像を風呂場（すなわち、図５では番号３に対応する背景映像）に選択することを指示する映像加工パラメータを生成する。映像加工パラメータ生成部１０６は、フリック操作或いはスワイプ操作が次々と行われると、次々に背景映像を選択する映像加工パラメータを生成するが、フリック操作或いはスワイプ操作前に図５に示す背景映像変更テーブルの最後の番号に対応する背景映像である場合は、フリック操作或いはスワイプ操作により、次に番号１に対応する初期背景映像を選択する映像加工パラメータを生成する。このようにすることで、映像加工パラメータ生成部１０６は、フリック操作或いはスワイプ操作を繰り返すと、背景映像を次々と選択してゆく映像加工パラメータを生成する。なお、フリック操作或いはスワイプ操作前後でアバターに変化はなく、背景映像に対して合成される位置にも変化はない。 On the other hand, if the determination result in step SA104 is “No”, the video processing parameter generation unit 106 changes the background video and determines that the received operation content data is a flick operation or a swipe operation, Video processing parameters for instructing synthesis are generated (step SB106). Specifically, the video processing parameter generation unit 106 first determines that an instruction to change the background video is given with reference to the video processing table, and an identifier indicating the background video at that time (ie, the video processing parameter generation unit 106). The identifier of the background video to be changed is identified from the stored identifier in the video change table. Then, the video processing parameter generation unit 106 updates the identifier stored in the video processing parameter to the identifier specified in the above-described manner, and sets the video processing parameter for instructing to combine the background video indicated by the identifier with the avatar. Generate. For example, when the background video is a video arbitrarily set by the producer or user of the karaoke apparatus 100 (that is, the initial background video corresponding to number 1 in FIG. 5), the video processing parameter generation unit 106 performs a flick operation or a swipe operation. Thus, a video processing parameter for instructing to select the background video as the music hall (that is, the background video corresponding to the number 2 in FIG. 5) is generated. Thereafter, when the flick operation or the swipe operation is performed again, the video processing parameter generation unit 106 instructs to select the background video as the bathroom (that is, the background video corresponding to number 3 in FIG. 5). Is generated. When the flick operation or swipe operation is performed one after another, the image processing parameter generation unit 106 generates image processing parameters for selecting the background image one after another, but the background image change table shown in FIG. 5 before the flick operation or swipe operation is performed. In the case of the background video corresponding to the last number, video processing parameters for selecting the initial background video corresponding to the number 1 are generated by flicking or swiping. In this way, the video processing parameter generation unit 106 generates video processing parameters for selecting background images one after another when the flick operation or swipe operation is repeated. Note that there is no change in the avatar before and after the flick operation or swipe operation, and there is no change in the position where the background image is synthesized.

同様に、音声加工パラメータ生成部１０５は、ステップＳＡ１０４の判定結果が“Ｎｏ”であった場合、受け取った操作内容データがフリック操作或いはスワイプ操作であるとして、アバターに対応する音声に付与するリバーブの変更を指示する音声加工パラメータを生成する（ステップＳＡ１０６）。具体的には、音声加工パラメータ生成部１０５は、まず、音声加工テーブルを参照してリバーブの変更を指示されたと判定し、その時点のリバーブを示す識別子（すなわち、音声加工パラメータ生成部１０５に記憶されている識別子）とリバーブ変更テーブルの格納内容とから、変更先のリバーブの識別子を特定する。そして、音声加工パラメータ生成部１０５は、自身に記憶している識別子を上記の要領で特定した識別子に更新するとともに、当該識別子の示すリバーブをアバターに対応する音声に付与することを指示する音声加工パラメータを生成する。例えば、リバーブが施されていない（すなわち、図４では番号１に対応するリバーブを施さない）場合、音声加工パラメータ生成部１０５は、フリック操作或いはスワイプ操作により、音声に音楽ホールで聴いているかのようなリバーブを施す（すなわち、図４では番号２に対応するリバーブを施す）ことを指示する音声加工パラメータを生成する。その後、再度フリック操作或いはスワイプ操作が行われると、音声加工パラメータ生成部１０５は、音声に風呂場で聴いているかのようなリバーブを施す（すなわち、図４では番号３に対応するリバーブを施す）ことを指示する音声加工パラメータを生成する。音声加工パラメータ生成部１０５は、フリック操作或いはスワイプ操作が次々と行われると、次々にリバーブを選択する音声加工パラメータを生成するが、フリック操作或いはスワイプ操作前に図５に示すリバーブ変更テーブルの最後の番号に対応するリバーブである場合は、フリック操作或いはスワイプ操作により、次に番号１に対応するリバーブを施さない音声加工パラメータを生成する。このようにすることで、音声加工パラメータ生成部１０５は、フリック操作或いはスワイプ操作を何度繰り返しても、リバーブを次々と選択してゆく音声加工パラメータを生成する。
以上が、本実施形態のカラオケ装置１００の構成である。 Similarly, if the determination result in step SA104 is “No”, the voice processing parameter generation unit 105 determines that the received operation content data is a flick operation or a swipe operation, and applies the reverb to be added to the voice corresponding to the avatar. A voice processing parameter for instructing the change is generated (step SA106). Specifically, the speech processing parameter generation unit 105 first determines that reverb change has been instructed with reference to the speech processing table, and stores an identifier indicating the reverb at that time (ie, stored in the speech processing parameter generation unit 105). Identifier) and the stored contents of the reverb change table specify the reverb identifier of the change destination. Then, the voice processing parameter generation unit 105 updates the identifier stored in the voice identifier to the identifier specified in the above manner, and voice processing that instructs to add the reverb indicated by the identifier to the voice corresponding to the avatar. Generate parameters. For example, when the reverb is not applied (that is, when the reverb corresponding to the number 1 is not applied in FIG. 4), the sound processing parameter generation unit 105 determines whether the sound is being listened to in the music hall by the flick operation or the swipe operation. A speech processing parameter is generated that indicates that reverb is applied (that is, reverb corresponding to number 2 is applied in FIG. 4). Thereafter, when a flick operation or a swipe operation is performed again, the audio processing parameter generation unit 105 applies reverberation as if listening to the audio in the bathroom (that is, applies reverberation corresponding to number 3 in FIG. 4). A voice processing parameter for instructing this is generated. The voice processing parameter generation unit 105 generates voice processing parameters for selecting reverb one after another when flick operation or swipe operation is performed one after another. Before the flick operation or swipe operation, the voice processing parameter generation unit 105 generates an end of the reverb change table shown in FIG. In the case of the reverb corresponding to the number 1, a voice processing parameter that does not perform the reverb corresponding to the number 1 is generated by a flick operation or a swipe operation. In this way, the speech processing parameter generation unit 105 generates speech processing parameters for selecting reverb one after another regardless of how many times the flick operation or swipe operation is repeated.
The above is the configuration of the karaoke apparatus 100 of the present embodiment.

（Ｂ：動作）
以下、カラオケ装置１００のユーザが一人である場合を例にとって、カラオケ装置１００の動作を説明する。上記ユーザがカラオケ装置１００の電源を投入すると、映像出力部１１０にはカラオケ曲の選択を促すメニュー画面が表示される。このメニュー画面を視認したユーザは、リモコン等を操作することで歌唱しようとするカラオケ曲の選択や、演奏開始の指示を入力することができる。カラオケ曲の選択が行われると、背景映像取得部１０９はホストコンピュータから当該カラオケ曲に対応する背景映像データを取得するとともに、当該カラオケ曲の伴奏音データを同ホストコンピュータから取得する。そして、演奏開始の指示が入力されると、背景映像取得部１０９はホストコンピュータから取得した背景映像データの映像加工部１０８への出力を開始するとともに、同ホストコンピュータから取得した伴奏音データの音声加工部１０２への出力（図１では図示略）を開始する。伴奏音の音量の調整は、例えば、カラオケ装置１００のリモコンのスライダを上下させることで行われる。 (B: Operation)
Hereinafter, the operation of the karaoke apparatus 100 will be described by taking as an example a case where the user of the karaoke apparatus 100 is one person. When the user turns on the power of the karaoke apparatus 100, the video output unit 110 displays a menu screen that prompts the user to select a karaoke song. A user who visually recognizes the menu screen can select a karaoke song to be sung and input an instruction to start performance by operating a remote controller or the like. When a karaoke song is selected, the background image acquisition unit 109 acquires background image data corresponding to the karaoke song from the host computer and acquires accompaniment sound data of the karaoke song from the host computer. When an instruction to start performance is input, the background video acquisition unit 109 starts outputting the background video data acquired from the host computer to the video processing unit 108, and the audio of the accompaniment sound data acquired from the host computer. Output (not shown in FIG. 1) to the processing unit 102 is started. The volume of the accompaniment sound is adjusted, for example, by moving a slider on the remote controller of the karaoke apparatus 100 up and down.

このようにして背景映像取得部１０９から背景加工部１０８への背景映像データの出力が開始されると、映像出力部１１０では歌唱対象のカラオケ曲に対応した映像の表示が開始され、音声出力部１０３では同カラオケ曲の伴奏音の放音が開始される。上記歌唱者は上記映像および伴奏音から歌唱開始タイミングに至ったことを把握すると、音声入力部１０１を持って歌唱を開始する。このように、上記ユーザは歌唱開始点に到達するまでは歌唱を開始しないため、背景映像取得部１０９から背景加工部１０８への背景映像データの出力開始時点から歌唱開始タイミングまでは、当該ユーザに対応するアバターが映像出力部１１０に表示されることはない。そして、ユーザが歌い始めると、その歌唱音声を表す音声信号が音声加工部１０２に与えられるとともに映像生成部１０７に与えられる。映像生成部１０７は当該ユーザに対応するアバターを表すアバターデータの映像加工部１０８への出力を開始する。映像加工部１０８は、与えられたアバターデータを背景映像データの中央の座標に位置するように合成して映像出力データとして映像出力部１１０に与える。そのため、映像出力部１１０には、図６に示すように背景映像の中央にアバターを配置した映像が表示される。さらに、各音声出力部１０３から出力される上記ユーザの歌声が、映像出力部１１０に表示されたアバターの位置から聞こえるように、出力バランスは調整される。 When the output of the background video data from the background video acquisition unit 109 to the background processing unit 108 is started in this way, the video output unit 110 starts displaying the video corresponding to the karaoke song to be sung, and the audio output unit In 103, sound emission of the accompaniment sound of the karaoke song is started. When the singer grasps that the singing start timing has been reached from the video and the accompaniment sound, the singer starts the singing with the voice input unit 101. As described above, since the user does not start singing until the singing start point is reached, from the start of outputting the background video data to the background processing unit 108 from the background video acquisition unit 109 to the singing start timing, The corresponding avatar is not displayed on the video output unit 110. When the user starts to sing, an audio signal representing the singing voice is given to the audio processing unit 102 and to the video generation unit 107. The video generation unit 107 starts outputting the avatar data representing the avatar corresponding to the user to the video processing unit 108. The video processing unit 108 synthesizes the given avatar data so as to be positioned at the center coordinates of the background video data, and provides the resultant data to the video output unit 110 as video output data. Therefore, the video output unit 110 displays a video in which an avatar is arranged at the center of the background video as shown in FIG. Further, the output balance is adjusted so that the user's singing voice output from each audio output unit 103 can be heard from the position of the avatar displayed on the video output unit 110.

上記の要領でアバターの表示が開始されると、カラオケ装置１００のユーザはアバターに対する操作、或いは背景映像に対する操作を映像操作入力部１０４に対して行うことができる。なお、アバターに対する操作としては、前述したピンチイン／アウト操作とアバターの位置を移動させるタッチ操作が挙げられる。以下、これらの操作が為された場合にカラオケ装置１００が行う動作について説明する。 When the display of the avatar is started as described above, the user of the karaoke apparatus 100 can perform an operation on the avatar or an operation on the background image on the video operation input unit 104. In addition, as operation with respect to an avatar, the touch operation which moves the position of the pinch in / out operation mentioned above and an avatar is mentioned. Hereinafter, an operation performed by the karaoke apparatus 100 when these operations are performed will be described.

（Ｂ−１：アバターに対する操作が為された場合の動作）
前述したようにアバターに対する操作としては、ピンチイン／アウト操作とタッチ操作が挙げられる。 (B-1: Operation when an avatar is operated)
As described above, operations for an avatar include pinch-in / out operations and touch operations.

（Ｂ−１−１：アバターに対するピンチイン／アウト操作が為された場合の動作）
まず、アバターに対するピンチイン／アウト操作が為された場合の動作を説明する。
映像出力部１１０に表示されたアバターに対してピンチイン／アウト操作が行われると、上記で説明したように、映像操作入力部１０４は、ピンチイン／アウト操作を示す操作内容データを生成し、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６に与える。前述したように、音声加工パラメータ生成部１０５は、アバターに対応する音声の音量をピンチイン／アウト操作の操作量に応じて減少／増大させることを指示する音声加工パラメータを生成し、映像加工パラメータ生成部１０６は、アバターの大きさを縮小／拡大することを指示する映像加工パラメータを生成する。 (B-1-1: Operation when a pinch-in / out operation is performed on an avatar)
First, an operation when a pinch-in / out operation for an avatar is performed will be described.
When a pinch-in / out operation is performed on the avatar displayed on the video output unit 110, as described above, the video operation input unit 104 generates operation content data indicating the pinch-in / out operation, and performs voice processing. This is given to the parameter generation unit 105 and the video processing parameter generation unit 106. As described above, the audio processing parameter generation unit 105 generates the audio processing parameter that instructs to decrease / increase the volume of the audio corresponding to the avatar according to the operation amount of the pinch-in / out operation, and generates the video processing parameter The unit 106 generates a video processing parameter that instructs to reduce / enlarge the size of the avatar.

音声加工パラメータは音声加工部１０２に与えられ、音声加工部１０２は、当該音声加工パラメータに従って、ピンチイン／アウト操作の為されたアバターに対応する音声入力部１０１から与えられた音声信号に音量を減少／増大させる加工を施し、加工後の音声信号を音声出力部１０３に与える。音声出力部１０３は、与えられた音声信号を放音する。なお、アバターのピンチイン／アウト操作と音声信号の音量の減少／増大の関係性については、ピンチイン操作により音声信号の音量が増大し、ピンチアウト操作により音声信号の音量が減少するようになっていてもよい。しかし、このような操作内容と加工内容の対応付けはユーザの直感とは合わないため、決して望ましいものではない。ピンチイン操作により音声信号の音量が減少し、ピンチアウト操作により音声信号の音量が増大するという操作内容と加工内容の対応付けが望ましい。 The voice processing parameter is given to the voice processing unit 102, and the voice processing unit 102 reduces the volume of the voice signal given from the voice input unit 101 corresponding to the avatar for which the pinch-in / out operation is performed according to the voice processing parameter. / The processing to increase is performed, and the processed audio signal is given to the audio output unit 103. The audio output unit 103 emits a given audio signal. As for the relationship between the avatar pinch-in / out operation and the decrease / increase of the volume of the audio signal, the volume of the audio signal is increased by the pinch-in operation, and the volume of the audio signal is decreased by the pinch-out operation. Also good. However, such an association between the operation content and the processing content is not desirable because it does not match the user's intuition. It is desirable to associate the operation content with the processing content such that the volume of the audio signal is reduced by the pinch-in operation and the volume of the audio signal is increased by the pinch-out operation.

映像加工部１０８は、映像加工パラメータに従い、ピンチイン／アウト操作の為されたアバターのアバターデータに縮小／拡大の加工を施し、背景映像データと加工後のアバターデータを合成し、出力映像データとする。そして、その出力映像データを映像出力部１１０に与える。映像出力部１１０は、与えられた出力映像データをユーザに表示する。そのため、映像出力部１１０および映像操作入力部１０４に表示されるアバターの全体が縮小／拡大する。なお、アバターの全体に対するピンチイン操作により、アバターの全体が拡大し、ピンチアウト操作によりアバターの全体が縮小してもよいが、このような操作内容と加工内容の対応付けはユーザの直感とは合わないため、決して望ましいものではない。 The video processing unit 108 performs reduction / enlargement processing on the avatar data of the avatar that has been pinched in / out according to the video processing parameters, and synthesizes the background video data and the processed avatar data into output video data. . Then, the output video data is given to the video output unit 110. The video output unit 110 displays the given output video data to the user. Therefore, the entire avatar displayed on the video output unit 110 and the video operation input unit 104 is reduced / enlarged. Note that the entire avatar may be enlarged by a pinch-in operation on the entire avatar, and the entire avatar may be reduced by a pinch-out operation. However, the correspondence between the operation content and the processing content matches the user's intuition. It is not desirable because it is not.

図９は、映像操作入力部１０４にピンチアウト操作を行った後に表示される映像を例示した図である。図８と図９とを比較すれば明らかなように、アバター全体にピンチアウト操作が行われると、アバター全体が拡大されて表示される。さらに、拡大されたアバターに対応した音声は、音量が増大して音声出力部１０３から放音される。このように、本実施形態によれば、映像と音声の同期が崩れることはない。 FIG. 9 is a diagram illustrating an image displayed after a pinch-out operation is performed on the video operation input unit 104. As apparent from a comparison between FIG. 8 and FIG. 9, when a pinch-out operation is performed on the entire avatar, the entire avatar is displayed in an enlarged manner. Furthermore, the sound corresponding to the enlarged avatar is emitted from the sound output unit 103 with an increased volume. Thus, according to the present embodiment, the synchronization between video and audio is not lost.

（Ｂ−１−２：アバターに対するタッチ操作が為された場合の動作）
次に、アバターの位置を移動させるタッチ操作が為された場合の動作について説明する。
アバターの位置を移動させるタッチ操作が為されると、映像加工パラメータ生成部１０６は、タッチ操作終了位置にアバターを移動させる映像加工パラメータを生成する。映像加工パラメータ生成部１０６は、その映像加工パラメータを映像加工部１０８に与え、映像加工部１０８は映像加工パラメータを基にアバターデータと背景映像データを合成し、出力映像データとして映像出力部１１０に与える。映像出力部１１０は、与えられた出力映像データの表す映像を表示する。そのため、映像出力部１１０に表示されたアバターは、当該アバターに対するタッチ操作終了位置に移動する（図１０参照）。なお、映像操作入力部１０４に表示されたアバターを画面外に移動させた場合には、映像加工パラメータ生成部１０６は、アバターを表示しない映像加工パラメータを生成し、音声加工パラメータ生成部１０５は、当該アバターに対応する音声のミュートを指示する音声加工パラメータを生成する。そのため、映像出力部１１０と映像操作入力部１０４にはアバターが表示されず、さらにユーザはそのアバターに対応した音声が聞こえなくなる。 (B-1-2: Operation when a touch operation is performed on an avatar)
Next, an operation when a touch operation for moving the position of the avatar is performed will be described.
When a touch operation for moving the position of the avatar is performed, the video processing parameter generation unit 106 generates a video processing parameter for moving the avatar to the touch operation end position. The video processing parameter generation unit 106 gives the video processing parameters to the video processing unit 108, and the video processing unit 108 synthesizes the avatar data and the background video data based on the video processing parameters and outputs them to the video output unit 110 as output video data. give. The video output unit 110 displays a video represented by the given output video data. Therefore, the avatar displayed on the video output unit 110 moves to the touch operation end position for the avatar (see FIG. 10). When the avatar displayed on the video operation input unit 104 is moved outside the screen, the video processing parameter generation unit 106 generates a video processing parameter that does not display the avatar, and the audio processing parameter generation unit 105 A voice processing parameter that instructs to mute the voice corresponding to the avatar is generated. Therefore, the avatar is not displayed on the video output unit 110 and the video operation input unit 104, and the user cannot hear the voice corresponding to the avatar.

また、アバターの位置を移動させるタッチ操作が行われると、前述したように、音声加工パラメータ生成部１０５は、タッチ操作終了位置に当該アバターに対応する音像（すなわち、ユーザの歌唱音声の音像）が定位するように音声出力部１０３の出力バランスの調整を指示する音声加工パラメータを生成する。音声加工パラメータ生成部１０５は、その音声加工パラメータを音声加工部１０２に与え、音声加工部１０２は、タッチ操作の為されたアバターに対応する音声入力部１０１から与えられた音声信号に上記加工を施し、音声出力部１０３に与える。音声出力部１０３は、与えられた音声信号の表す音を放音する。そのため、ユーザは、アバターに対するタッチ操作におけるタッチ操作終了位置からアタバーが歌っているかのような聴感を得る。この場合も映像と音声の同期が崩れることはない。 When a touch operation for moving the position of the avatar is performed, as described above, the sound processing parameter generation unit 105 displays a sound image corresponding to the avatar (that is, a sound image of the user's singing sound) at the touch operation end position. A voice processing parameter for instructing adjustment of the output balance of the voice output unit 103 is generated so as to be localized. The voice processing parameter generation unit 105 gives the voice processing parameter to the voice processing unit 102, and the voice processing unit 102 performs the above processing on the voice signal given from the voice input unit 101 corresponding to the avatar on which the touch operation is performed. Applied to the audio output unit 103. The audio output unit 103 emits a sound represented by a given audio signal. Therefore, the user obtains an audible feeling as if the attacker is singing from the touch operation end position in the touch operation on the avatar. In this case, the synchronization between the video and audio is not lost.

（Ｂ−２：背景映像に対する操作が為された場合の動作）
最後に、背景映像に対する操作が為された場合の動作について説明する。
映像操作入力部１０４に表示された背景映像に対してフリック操作或いはスワイプ操作が行われると、前述したように、映像加工パラメータ生成部１０６は、背景映像をフリック操作或いはスワイプ操作前とは異なる背景映像の選択を指示する映像加工パラメータを生成する。映像加工パラメータ生成部１０６は、その映像加工パラメータを映像加工部１０８に与え、映像加工部１０８は、映像加工パラメータを基にその操作前と同じアバターデータとその操作前とは異なる背景映像データを合成し、出力映像データとして映像出力部１１０に与える。出力映像データにおいて、アバターデータの背景映像データに対する位置は、操作前後で同じである。映像出力部１１０は、与えられた出力映像データの表す映像を表示する。そのため、映像出力部１１０では、フリック操作或いはスワイプ操作前とは異なる背景映像が表示されているが、その操作前と同じアバターが背景映像に対して操作前と同じ位置に表示されている。 (B-2: Operation when an operation is performed on the background image)
Finally, the operation when an operation is performed on the background video will be described.
When a flick operation or swipe operation is performed on the background video displayed on the video operation input unit 104, the video processing parameter generation unit 106, as described above, causes the background video to have a different background from that before the flick operation or swipe operation. Video processing parameters for instructing video selection are generated. The video processing parameter generation unit 106 gives the video processing parameter to the video processing unit 108, and the video processing unit 108 uses the same avatar data as before the operation and background video data different from that before the operation based on the video processing parameter. Combining them and giving them to the video output unit 110 as output video data. In the output video data, the position of the avatar data relative to the background video data is the same before and after the operation. The video output unit 110 displays a video represented by the given output video data. Therefore, in the video output unit 110, a background video different from that before the flick operation or swipe operation is displayed, but the same avatar as before the operation is displayed at the same position as before the operation with respect to the background video.

一方、音声加工パラメータ生成部１０５は、フリック操作或いはスワイプ操作後の背景映像に対応したリバーブを音声信号に施すことを指示する音声加工パラメータを生成する。音声加工パラメータ生成部１０５は、その音声加工パラメータを音声加工部１０２に与え、音声加工パラメータを基に音声信号にリバーブを施し、音声信号として音声出力部１０３に与える。音声出力部１０３は、与えられた音声信号の表す音を放音する。そのため、音声出力部１０３から、フリック操作或いはスワイプ操作後の背景映像に対応したリバーブを施された音声が放音される。この場合も映像と音声の同期が崩れることはない。
以上が、本実施形態のカラオケ装置１００の動作である。 On the other hand, the audio processing parameter generation unit 105 generates an audio processing parameter that instructs to apply reverberation corresponding to the background video after the flick operation or swipe operation to the audio signal. The voice processing parameter generation unit 105 gives the voice processing parameter to the voice processing unit 102, reverberates the voice signal based on the voice processing parameter, and gives the voice signal to the voice output unit 103. The audio output unit 103 emits a sound represented by a given audio signal. For this reason, the audio output unit 103 emits a sound that has undergone reverberation corresponding to the background image after the flick operation or swipe operation. In this case, the synchronization between the video and audio is not lost.
The above is the operation of the karaoke apparatus 100 of the present embodiment.

以上説明したように、本実施形態のカラオケ装置１００によれば、映像操作入力部１０４を操作することで、その操作に対応した音声加工パラメータと映像加工パラメータの両者が生成されるので、映像と音声の同期が崩れることはない。さらに、カラオケ装置１００のユーザは、映像操作入力部１０４に表示された映像を操作することで、音声の特徴を視覚的に理解でき、音声加工を視覚的に行うことができる。 As described above, according to the karaoke apparatus 100 of the present embodiment, by operating the video operation input unit 104, both audio processing parameters and video processing parameters corresponding to the operation are generated. There is no loss of audio synchronization. Furthermore, the user of the karaoke apparatus 100 can visually understand the characteristics of the audio and can perform the audio processing visually by operating the video displayed on the video operation input unit 104.

（Ｃ：変形例）
本実施形態のカラオケ装置１００には様々な変形例が考えられる。以下にその変形例を示す。 (C: Modification)
Various modifications can be considered for the karaoke apparatus 100 of the present embodiment. The modification is shown below.

（１）アバターに対するピンチイン／アウト操作やタッチ操作、背景映像に対するフリック操作或いはスワイプ操作は、カラオケ装置１００のリモコンに設けられた操作子（例えば、スライダ）に対する操作であってもよい。映像操作入力部１０４をマウスなどのポインティングデバイスに置き換えてもよい。また、カラオケ装置１００のキーボードやリモコンに付属している特定ボタンを押すのと同時にポインティングデバイスを操作することで、初めてピンチイン／アウト操作等と同じ操作ができるようになっていてもよい。また、アバターに対するピンチイン／アウト操作は、映像操作入力部１０４のアバターに対応する領域内で行われる操作に限定されず、アバターに対応する領域近傍で行われる態様であってもよい。つまり、映像操作入力部１０４のアバターに対応する領域内ではなく、アバターに対応する領域近傍でピンチイン／アウト操作を行っても、アバターに対応する領域内でピンチイン／アウト操作を行ったのと同じように操作内容データが生成される態様をとってもよい。さらに、映像操作入力部１０４として、モーションキャプチャデバイスを用いてもよく、音声識別デバイスであってもよい。また、疑似３Ｄ空間上のアバターをつまんで引き延ばす操作を行うことで、映像操作入力部１０４が操作内容データを生成する態様をとってもよい。 (1) The pinch-in / out operation or touch operation on the avatar, the flick operation or swipe operation on the background image may be an operation on an operator (for example, a slider) provided on the remote controller of the karaoke apparatus 100. The video operation input unit 104 may be replaced with a pointing device such as a mouse. Further, by operating a pointing device simultaneously with pressing a specific button attached to the keyboard or remote controller of the karaoke apparatus 100, the same operation as a pinch-in / out operation or the like may be performed for the first time. The pinch-in / out operation for the avatar is not limited to the operation performed in the area corresponding to the avatar of the video operation input unit 104, and may be performed in the vicinity of the area corresponding to the avatar. That is, even if the pinch-in / out operation is performed in the vicinity of the area corresponding to the avatar, not in the area corresponding to the avatar of the video operation input unit 104, it is the same as the pinch-in / out operation performed in the area corresponding to the avatar. In this manner, the operation content data may be generated. Further, a motion capture device or a voice identification device may be used as the video operation input unit 104. In addition, the video operation input unit 104 may generate operation content data by performing an operation of pinching and extending an avatar in the pseudo 3D space.

（２）カラオケ装置１００に設けられている音声入力部１０１の数よりも多い人数のユーザの各々が音声入力部１０１を交互に利用して一人ずつ歌唱を行うようにしても良い。この場合、音声信号識別子として音声入力部１０１の番号を用いるようにすれば、音声入力部１０１を利用するユーザが交代したとしても映像出力部１１０に表示されるアバターが切り替わることはない。つまり、この場合は、映像出力部１１０には、カラオケ装置１００に設けられている音声入力部１０１の数分のアバターが表示されることになる。これに対して、音声信号識別子として波形データまたは特徴量データを用いるようにすれば、ユーザ毎に異なるアバターを表示させることが可能になる。また、一つの音声入力部１０１に対して一人のユーザが音声入力を行う場合と複数人が同時に音声入力を行う場合とで異なるアバターを表示させる場合も音声信号識別子として波形データまたは特徴量データを用いるようにすれば良く、同じアバターを表示させる場合には音声信号識別子として音声入力部１０１の番号を用いるようにすれば良い。 (2) Each of a larger number of users than the number of voice input units 101 provided in the karaoke apparatus 100 may sing one by one using the voice input units 101 alternately. In this case, if the number of the voice input unit 101 is used as the voice signal identifier, the avatar displayed on the video output unit 110 is not switched even if the user who uses the voice input unit 101 is changed. That is, in this case, as many video avatars as the audio input units 101 provided in the karaoke apparatus 100 are displayed on the video output unit 110. On the other hand, if waveform data or feature data is used as an audio signal identifier, a different avatar can be displayed for each user. In addition, waveform data or feature value data is used as an audio signal identifier even when different avatars are displayed when one user inputs audio to one audio input unit 101 and when multiple users input audio simultaneously. What is necessary is just to use it, and when displaying the same avatar, the number of the audio | voice input part 101 should just be used as an audio | voice signal identifier.

（３）背景映像取得部１０９にネットワークを介してホストコンピュータから背景映像データとともに歌詞字幕表示データが配信され、背景映像取得部１０９が当該歌詞字幕表示データと背景映像データを映像加工部１０８に与え、映像加工部１０８が、映像加工パラメータの制御下でアバターデータを加工して、加工後のアバターデータと背景映像データおよび歌詞字幕表示データを合成して映像加工部１０８に与える態様をとってもよい。具体的には、配置位置のアバターの口元の位置を示す相対座標（アバターを表す画像の左上隅を原点とした座標）をアバター毎に予め定めておく。映像加工部１０８は、背景映像に対するアバターの配置位置の座標から、歌詞字幕表示データを合成するアバターの口元の座標（表示面の左上隅を原点とする座標）を算出する。そして、映像加工部１０８は、アバターの口元付近に歌詞字幕表示データを位置させ、アバターの口元に歌詞が噴出し表示された映像出力データとなるように合成を行う。この場合、歌詞が見えるよう、アバターよりも手前に歌詞が表示されるように、映像加工部１０８は、アバターデータ、背景映像データおよび歌詞字幕表示データの合成を行う。 (3) Lyric subtitle display data and background video data are delivered from the host computer to the background video acquisition unit 109 via the network, and the background video acquisition unit 109 supplies the lyrics subtitle display data and the background video data to the video processing unit 108. The video processing unit 108 may process the avatar data under the control of the video processing parameters, synthesize the processed avatar data, the background video data, and the lyrics subtitle display data, and give the combined image to the video processing unit 108. Specifically, relative coordinates (coordinates with the upper left corner of the image representing the avatar as the origin) indicating the position of the mouth position of the avatar at the arrangement position are determined in advance for each avatar. The video processing unit 108 calculates the coordinates of the mouth of the avatar for synthesizing the lyrics subtitle display data (coordinates having the upper left corner of the display surface as the origin) from the coordinates of the arrangement position of the avatar with respect to the background video. Then, the video processing unit 108 positions the lyrics subtitle display data in the vicinity of the avatar's mouth, and synthesizes the video output data so that the lyrics are ejected and displayed at the avatar's mouth. In this case, the video processing unit 108 synthesizes the avatar data, the background video data, and the lyrics subtitle display data so that the lyrics are displayed in front of the avatar so that the lyrics can be seen.

（４）アバターに影や残像が付与されている場合には、影や残像に対する操作に応じて、アバターに対応した音声信号にエコー効果が施されていてもよい。具体的には、映像操作入力部１０４に表示されたアバターにタッチし、タッチした指を移動させずに映像操作入力部１０４から離すという操作内容を表すデータに対応付けて、エコー効果を施すことを表すデータを音声加工テーブルに格納しておく。さらに、同操作を表すデータに対応付けて、アバターの周囲に影を表示させることを表すデータを映像加工テーブルに格納しておく。このような態様においては、アバターにタッチし、タッチした指を移動させずに映像操作入力部１０４から離すと、アバターの周囲に影が表示され、音声にエコー効果が施される。また、映像操作入力部１０４に表示されたアバターの周囲の影に対応する領域に行われたピンチイン／アウト操作を表すデータに対応付けて、エコー効果が弱くもしくは強く施されることを表すデータを音声加工テーブルに格納しておく。さらに、同操作を表すデータに対応付けて、アバターの周囲の影の範囲が縮小／拡大されることを表すデータを映像加工テーブルに格納しておく。映像操作入力部１０４に表示されたアバターの周囲の影にピンチイン／アウト操作を行うと、アバターに対応した音声に施されたエコー効果が弱くもしくは強くなり、アバターの周囲の影の範囲が縮小／拡大する。また、アバターの周囲の影に対してタッチし、タッチした指を移動させずに映像操作入力部１０４から離すという操作内容を表すデータに対応して、施されるエコーの種類が変化することを表すデータを音声加工テーブルに格納し、同操作を表すデータに対応して、アバターの服装が変化することを表すデータを映像加工テーブルに格納しておいてもよい。 (4) When a shadow or an afterimage is given to the avatar, an echo effect may be applied to the audio signal corresponding to the avatar according to an operation on the shadow or the afterimage. Specifically, the echo effect is applied in association with the data representing the operation content of touching the avatar displayed on the video operation input unit 104 and releasing the touched finger from the video operation input unit 104 without moving it. Is stored in the voice processing table. Further, data representing that a shadow is displayed around the avatar is stored in the video processing table in association with data representing the same operation. In such an aspect, when the avatar is touched and moved away from the video operation input unit 104 without moving the touched finger, a shadow is displayed around the avatar and an echo effect is applied to the sound. In addition, data indicating that the echo effect is weakly or strongly applied is associated with data indicating the pinch-in / out operation performed on the area corresponding to the shadow around the avatar displayed on the video operation input unit 104. Store in the voice processing table. Further, data representing that the shadow range around the avatar is reduced / enlarged is stored in the video processing table in association with the data representing the same operation. When a pinch-in / out operation is performed on the shadow around the avatar displayed on the video operation input unit 104, the echo effect applied to the sound corresponding to the avatar becomes weak or strong, and the shadow range around the avatar is reduced / Expanding. In addition, the type of echo to be applied changes according to the data representing the operation content of touching the shadow around the avatar and releasing the touched finger from the video operation input unit 104 without moving it. The data to be represented may be stored in the voice processing table, and the data representing the change of the avatar's clothes may be stored in the video processing table corresponding to the data representing the same operation.

（５）映像操作入力部１０４に表示されたアバターにタッチし、タッチした指を移動させずに映像操作入力部１０４から離す操作を２回連続して行うという操作内容を表すデータに対応付けて、アバターの周囲が滲むことを表すデータを映像加工テーブルに格納し、同操作を表すデータに対応付けて、重複歌唱効果を施すことを表すデータを音声加工テーブルに格納しておいてもよい。この態様では、映像操作入力部１０４に表示されたアバターにタッチし、タッチした指を移動させずに映像操作入力部１０４から離す操作を２回連続して行うと、アバターの周囲が滲んで表示され、音声に重複歌唱効果が施される。この連続して２回行う操作は、ピンチイン／アウト操作であってもタッチ操作であってもよい。さらに、映像操作入力部１０４に表示されたアバターの周囲の滲みに対応する領域に行われたピンチイン／アウト操作を表すデータに対応して、重複歌唱効果が弱くもしくは強く施されることを表すデータを音声加工テーブルに格納しておき、同操作を表すデータに対応して、アバターの周囲の滲みの範囲が縮小／拡大することを表すデータを映像加工テーブルに格納しておいてもよい。映像操作入力部１０４に表示されたアバターの周囲の滲みにピンチイン／アウト操作を行うと、アバターに対応した音声に施された重複歌唱効果が弱くもしくは強くなり、アバターの周囲の滲みの範囲が縮小／拡大する。また、アバターの周囲が滲むことを表すデータの代わりに、同一のアバターを位置をずらして重ねて複数表示することを表すデータを映像加工テーブルに格納していてもよい。さらに、映像操作入力部１０４に表示されたアバターの周囲の重なりに対応する領域に行われたピンチイン／アウト操作を表すデータに対応して、重複歌唱効果を弱くもしくは強く施すことを表すデータを音声加工テーブルに格納しておき、同操作を表すデータに対応して、アバターの周囲の重なりの数が減少／増加することを表すデータを映像加工テーブルに格納しておいてもよい。 (5) Touching the avatar displayed on the video operation input unit 104 and associating it with data representing the operation content of performing the operation of moving away from the video operation input unit 104 twice without moving the touched finger. Data indicating that the surroundings of the avatar are blurred may be stored in the video processing table, and data indicating that the overlapping singing effect is applied may be stored in the audio processing table in association with the data indicating the same operation. In this aspect, when the avatar displayed on the video operation input unit 104 is touched and the touched finger is moved away from the video operation input unit 104 without moving, the surroundings of the avatar are blurred and displayed. And the overlapping singing effect is applied to the voice. The operation performed twice in succession may be a pinch-in / out operation or a touch operation. Further, data indicating that the overlapping singing effect is weakly or strongly applied corresponding to the data indicating the pinch-in / out operation performed on the area corresponding to the blur around the avatar displayed on the video operation input unit 104. May be stored in the audio processing table, and data indicating that the blur range around the avatar is reduced / enlarged may be stored in the video processing table corresponding to the data indicating the same operation. When a pinch-in / out operation is performed on the blur around the avatar displayed on the video operation input unit 104, the overlapping singing effect applied to the voice corresponding to the avatar becomes weak or strong, and the range of the blur around the avatar is reduced. /Expanding. Further, instead of the data indicating that the periphery of the avatar bleeds, data indicating that the same avatar is displayed with a plurality of positions shifted may be stored in the video processing table. Furthermore, corresponding to the data representing the pinch-in / out operation performed on the area corresponding to the overlap around the avatar displayed on the video operation input unit 104, the data representing the weak or strong overlapping singing effect is sounded. It may be stored in the processing table, and data indicating that the number of overlaps around the avatar is decreased / increased may be stored in the video processing table corresponding to the data indicating the same operation.

（６）アバターは人間を模した形状をしているが、アバターが人間を模した形状以外の形状であってもよい。例えば、建造物、車、木、山、楽器といった形状であってもよい。入力される音声とは全く関連性のない形状であってもよい。アバターデータは、図示しないカメラで撮影した歌唱者自身を示す映像データでもよい。また、映像出力部１１０に表示させる映像は、背景映像とアバターとの合成映像には限定されず、アバターのみからなる映像であっても良い。 (6) Although the avatar has a shape imitating a human, the avatar may have a shape other than the shape imitating a human. For example, the shape may be a building, a car, a tree, a mountain, or a musical instrument. The shape may be completely unrelated to the input voice. The avatar data may be video data showing the singer himself photographed with a camera (not shown). The video displayed on the video output unit 110 is not limited to a composite video of a background video and an avatar, and may be a video consisting of only an avatar.

（７）カラオケ装置１００は、入力した音声を採点する機能を有していてもよい。さらに、その採点結果に応じて映像加工部１０８はアバター全体を変化させる態様をとってもよい。詳述すると、図示しない記憶装置には、同一の音声信号識別子に対応した３種類のアバターデータが格納されている。当該アバターデータは、同じキャラクターの喜び、普通、しょんぼりの様子を表しており、カラオケ装置１００の採点結果の点数に対応付けられている。例えば、採点が１００点満点であり、採点結果が７０点から１００点の間は喜びの様子を表したアバターデータが対応しており、採点結果が３１点から６９点の間は普通の様子を表したアバターデータが対応しており、採点結果が０点から３０点の間はしょんぼりの様子を表したアバターデータが対応している。映像生成部１０７は、採点結果に対応したアバターデータを記憶装置から読み出し、そのアバターデータを映像加工部１０８に与える。映像加工部１０８は、そのアバターデータと背景映像を合成し、映像出力データとして映像出力部１１０に与え、映像出力部１１０は、その映像出力データをユーザに表示する。そのため、映像出力部１１０には、採点結果に対応したアバターが表示される。ただし、カラオケ装置１００が採点を行っていない、もしくは採点中であると、映像生成部１０７は、普通の様子を表したアバターデータを映像加工部１０８に与える。 (7) The karaoke apparatus 100 may have a function of scoring the input voice. Furthermore, the video processing unit 108 may take an aspect in which the entire avatar is changed according to the scoring result. More specifically, three types of avatar data corresponding to the same audio signal identifier are stored in a storage device (not shown). The avatar data represents the same character's joy, normality, and joyfulness, and is associated with the score of the scoring result of the karaoke apparatus 100. For example, when the scoring is 100 out of 100, the avatar data indicating the state of joy corresponds between the scoring results of 70 and 100, and the normal state is between 31 and 69. The represented avatar data is supported, and the avatar data representing the state of crawl is supported between 0 and 30 scoring results. The video generation unit 107 reads avatar data corresponding to the scoring result from the storage device, and gives the avatar data to the video processing unit 108. The video processing unit 108 synthesizes the avatar data and the background video, and provides the video output unit 110 as video output data. The video output unit 110 displays the video output data to the user. Therefore, an avatar corresponding to the scoring result is displayed on the video output unit 110. However, if the karaoke apparatus 100 is not scoring or is scoring, the video generation unit 107 gives the avatar data representing the normal state to the video processing unit 108.

（８）アバターの顔が、歌唱者の歌う歌のオリジナルのアーティストの顔とのモーフィング表示になっていてもよい。上記のように、カラオケ装置１００が採点機能を有している場合には、採点の点数が高いほど、アバターの顔はオリジナルのアーティストの顔に近づくようになっていてもよい。詳述すると、図示しない記憶装置に歌唱者の歌う歌のオリジナルのアーティストの顔の画像データを記憶させておく。採点後、映像生成部１０７は、その記憶装置からオリジナルのアーティストの顔の画像データを読み出し、アバターの顔の画像データと採点結果に応じて以下の演算を行って、映像加工部１０８に出力する。すなわち、採点の点数が１００点満点中Ｃ点であるとすると、オリジナルのアーティストの顔の画像データにおける画素Ａと、元々のアバターの顔の画像データにおける画素Ａと同じ位置の画素Ｂを用いて、画素ＡにＣ／１００を掛けたものと、画素Ｂに１−Ｃ／１００を掛けたものの合計を、採点後の画素Ａに対応する画素値とするのである。 (8) The face of the avatar may be a morphing display with the face of the original artist of the song sung by the singer. As described above, when the karaoke apparatus 100 has a scoring function, the face of the avatar may be closer to the face of the original artist as the scoring score is higher. More specifically, the image data of the face of the original artist of the song sung by the singer is stored in a storage device (not shown). After scoring, the video generation unit 107 reads the image data of the original artist's face from the storage device, performs the following calculation according to the image data of the avatar's face and the scoring result, and outputs it to the video processing unit 108. . That is, if the scoring score is C out of 100 points, pixel A in the original artist's face image data and pixel B at the same position as pixel A in the original avatar face image data are used. The sum of the pixel A multiplied by C / 100 and the pixel B multiplied by 1-C / 100 is set as the pixel value corresponding to the pixel A after scoring.

＜第２実施形態＞
第１実施形態では、映像操作入力部１０４に表示されたアバター全体に対してピンチイン／アウト操作やタッチ操作が行われた。これに対して本実施形態では、アバター全体ではなくアバターの部位を指定してピンチイン／アウト操作等の各操作が行われる。この点が本実施形態と第１実施形態が顕著に異なる点である。なお、本実施形態のカラオケ装置のハードウェア構成は第１実施形態のカラオケ装置１００の構成と同一であるため、前掲図１を援用し、詳細な説明を省略する。 Second Embodiment
In the first embodiment, a pinch-in / out operation and a touch operation are performed on the entire avatar displayed on the video operation input unit 104. On the other hand, in this embodiment, each operation such as a pinch-in / out operation is performed by designating a part of the avatar instead of the entire avatar. This is a point where the present embodiment and the first embodiment are significantly different. Since the hardware configuration of the karaoke apparatus of the present embodiment is the same as that of the karaoke apparatus 100 of the first embodiment, the detailed description will be omitted with reference to FIG.

アバターの部位に対する操作は、操作対象の部位を指定する操作（本実施形態では、操作対象の部位に対応する領域内の同じ位置に３回連続してタッチする操作）と、このようにして指定した部位に対する操作（ピンチイン／アウト、およびタッチ操作）とからなる。このため、アバターの部位に対する操作についての操作内容データには、操作対象の部位を指定する操作を表すデータが含まれており、この点がアバター全体に対する操作の操作内容データと異なる。図１１は、音声加工パラメータ生成部１０５が音声加工パラメータを生成する処理の流れを示すフローチャートであり、図１２は、映像加工パラメータ生成部１０６が映像加工パラメータを生成する処理の流れを示すフローチャートである。図１１と図１２を比較すれば明らかなように、ステップＳＡ２０３およびステップＳＡ２０４と、ステップＳＢ２０３およびステップＳＢ２０４とが各々対応し、これらのステップ以外は内容が同じであるため、図１１と図１２では同じステップ番号を用いる。音声加工パラメータ生成部１０５と映像加工パラメータ１０６は、映像操作入力部１０４から操作内容データを受け取ったことを契機として音声加工パラメータと映像加工パラメータの生成を各々開始する。音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６は、映像操作入力部１０４から受け取った操作内容データに操作対象の部位を指定する操作を表すデータが含まれているか否かを判定する（ステップＳＡ２０１）。ステップＳＡ２０１の判定結果が“Ｎｏ”であった場合、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６は、音声加工パラメータと映像加工パラメータを生成せずに、当該生成処理を終了する。一方、ステップＳＡ２０１の判定結果が“Ｙｅｓ”であった場合、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６は、その操作内容データにピンチイン／アウト操作を表すデータが含まれているか否かを判定する（ステップＳＡ２０２）。ステップＳＡ２０２の判定結果が“Ｙｅｓ”であった場合、音声加工パラメータ生成部１０５と映像加工パラメータ生成部１０６は、音声加工パラメータおよび映像加工パラメータを各々生成する（ステップＳＡ２０３およびステップＳＢ２０３）。ここで、どのような操作に操作対象の部位を指定する役割を担わせるのかについては種々の態様が考えられる。例えば、アバターの部位に対応する領域内を３回連続してタッチする操作に上記役割を担わせる態様や、アバターの部位に対応する領域内を長押し（すなわち、予め設定した一定時間以上同じ位置でタッチし続ける）する操作に上記客割を担わせる態様が考えられる。本実施形態では、前者の態様が採用されている。 The operation for the avatar part is designated in this way with the operation for designating the part to be manipulated (in this embodiment, the operation of touching the same position in the region corresponding to the part to be manipulated three times in succession). Operation (pinch-in / out and touch operation) for the selected part. For this reason, the operation content data regarding the operation on the avatar part includes data representing an operation for designating the operation target part, which is different from the operation content data for the operation on the entire avatar. FIG. 11 is a flowchart showing a flow of processing in which the audio processing parameter generation unit 105 generates audio processing parameters, and FIG. 12 is a flowchart showing a flow of processing in which the video processing parameter generation unit 106 generates video processing parameters. is there. As apparent from a comparison between FIG. 11 and FIG. 12, Step SA203 and Step SA204 correspond to Step SB203 and Step SB204, respectively, and the contents are the same except for these steps. Use the same step number. The audio processing parameter generation unit 105 and the video processing parameter 106 each start generation of the audio processing parameter and the video processing parameter when the operation content data is received from the video operation input unit 104. The audio processing parameter generation unit 105 and the video processing parameter generation unit 106 determine whether or not the operation content data received from the video operation input unit 104 includes data representing an operation for designating a region to be operated (step). SA201). If the determination result in step SA201 is “No”, the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 end the generation processing without generating the audio processing parameter and the video processing parameter. On the other hand, if the determination result in step SA201 is “Yes”, the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 determine whether or not the operation content data includes data representing a pinch-in / out operation. Is determined (step SA202). When the determination result in step SA202 is “Yes”, the audio processing parameter generation unit 105 and the video processing parameter generation unit 106 generate the audio processing parameter and the video processing parameter, respectively (step SA203 and step SB203). Here, various modes can be considered as to what kind of operation plays the role of designating the region to be operated. For example, an aspect in which the above-mentioned role is played in the operation of touching the region in the region corresponding to the avatar part three times in succession, or a long press in the region corresponding to the part of the avatar (that is, the same position for a predetermined time or more) A mode in which the above customer discount is assigned to the operation of keeping touching with a) is conceivable. In the present embodiment, the former aspect is adopted.

映像加工パラメータ生成部１０６には、アバターの各部位に対する操作の各操作内容を表すデータに対応づけて当該操作が為されたときに当該アバターに施す加工内容を表すデータを格納した映像加工テーブルと、背景映像変更テーブル（図５参照）と、図１３に示すアバター部位加工テーブルとが予め記憶されている。上記映像加工テーブルには、アバターの部位に対する操作内容を表すデータに対応付けて当該部位に加工を施すことを示すデータが格納されており、アバター部位加工テーブルにはアバターの部位に対応付けて当該部位に施す具体的な加工内容を表すデータが格納されている。例えば、上記映像加工テーブルには、アバターの部位に対するピンチイン／アウト操作を示すデータに対応づけてアバターの部位を縮小／拡大することを示すデータが格納されている。上記映像加工テーブルには、アバターのボディ以外の部位に対するタッチ操作を示すデータに対応付けて、ボディ以外の部位を移動させることを示すデータが格納されており、ボディに対するタッチ操作を示すデータに対応付けてアバターの衣装の変更を表すデータが格納されている。映像加工パラメータ生成部１０６は、映像出力部１１０に表示中のアバターの衣装を表す識別子を記憶している。 The video processing parameter generation unit 106 stores a video processing table that stores data representing processing content to be applied to the avatar when the operation is performed in association with data representing each operation content of the operation on each part of the avatar. The background video change table (see FIG. 5) and the avatar part processing table shown in FIG. 13 are stored in advance. The video processing table stores data indicating that the part is processed in association with the data representing the operation content for the part of the avatar, and the avatar part processing table is associated with the avatar part. Data representing specific processing contents to be applied to the part is stored. For example, the video processing table stores data indicating that the avatar part is reduced / enlarged in association with data indicating a pinch-in / out operation for the avatar part. The video processing table stores data indicating that a part other than the body is moved in association with data indicating a touch operation on a part other than the body of the avatar, and corresponds to data indicating a touch operation on the body. In addition, data representing the change of the avatar's costume is stored. The video processing parameter generation unit 106 stores an identifier representing the avatar costume being displayed on the video output unit 110.

映像加工パラメータ生成部１０６は、ステップＳＡ２０２の判定結果が“Ｙｅｓ”であった場合、受け取った操作内容データに含まれる操作内容を表すデータに応じて当該部位を縮小／拡大することを指示する映像加工パラメータを生成する（ステップＳＢ２０３）。これに対して、ステップＳＡ２０２の判定結果が“Ｎｏ”であった場合、映像加工パラメータ生成部１０６は、受け取った操作内容データがタッチ操作を示すデータであるとし、タッチ操作に応じた映像加工パラメータを生成する（ステップＳＢ２０４）。より詳細に説明すると、操作内容データの示すタッチ位置がアバターのボディ以外であれば、映像加工パラメータ生成部１０６はアバターのボディ以外に対するタッチ操作であるとし、当該タッチ位置に対応する部位の表示位置の移動を指示する映像加工パラメータを生成する。操作内容データの示すタッチ位置がアバターのボディであれば、映像加工テーブルの格納内容とその時点のアバターの衣装の識別子とから更新先の衣装を特定し、当該衣装に更新することを指示する映像加工パラメータを生成する。 When the determination result in step SA202 is “Yes”, the video processing parameter generation unit 106 instructs to reduce / enlarge the part according to the data representing the operation content included in the received operation content data. A machining parameter is generated (step SB203). On the other hand, if the determination result in step SA202 is “No”, the video processing parameter generation unit 106 determines that the received operation content data is data indicating a touch operation, and the video processing parameter according to the touch operation. Is generated (step SB204). More specifically, if the touch position indicated by the operation content data is other than the body of the avatar, the video processing parameter generation unit 106 assumes that the touch operation is for a body other than the body of the avatar, and the display position of the part corresponding to the touch position Video processing parameters for instructing movement of the image are generated. If the touch position indicated by the operation content data is the body of the avatar, the video specifying the update destination costume from the stored content of the video processing table and the identifier of the costume of the avatar at that time, and an instruction to update to the costume Generate machining parameters.

音声加工パラメータ生成部１０５には、アバターの各部位に対する操作の各操作内容を表すデータに対応づけて当該操作が為されたときに当該アバターに対応する音声に施す加工内容を表すデータを格納した音声加工テーブルと、リバーブ変更テーブル（図４参照）と、図１４に示すコンプレッサ加工テーブルとが予め記憶されている。本実施形態では、アバターの部位に対するピンチイン／アウト操作に対応付けてコンプレッサによる加工内容を規定するパラメータの減少／増加を表すデータが上記音声加工テーブルに格納されている。 The voice processing parameter generation unit 105 stores data indicating the processing content to be applied to the voice corresponding to the avatar when the operation is performed in association with the data indicating the operation content of the operation on each part of the avatar. An audio processing table, a reverb change table (see FIG. 4), and a compressor processing table shown in FIG. 14 are stored in advance. In the present embodiment, data representing the decrease / increase of the parameter that defines the processing content by the compressor in association with the pinch-in / out operation for the avatar part is stored in the voice processing table.

コンプレッサによる加工とは、加工対象の音声信号の表す音の音量が予め設定した閾値（スレッショルド値）を超えた場合、超過した音量を設定した比率（レシオ）で圧縮し、設定されたリリース時間で解放することで、カラオケ曲の進行とともに変動する音量の最大値を低下させることを言う。コンプレッサによる加工内容を規定するパラメータとしては、上記閾値、レシオおよびリリース時間の他に、ゲイン、アタック時間、およびニーの３つのパラメータが挙げられる。ゲインとは、音の音量の増減を示す値であり、アタック時間とは、加工対象の音声信号の表す音の音量がスレッショルド値を超えてから（すなわち、音量の圧縮が始まってから）上記レシオに到達するまでの時間のことである。ニーとは、スレッショルド値近傍においてレシオに到達するまでの圧縮の具合を示す値のことである。 Processing with a compressor means that if the volume of the sound represented by the audio signal to be processed exceeds a preset threshold (threshold value), the excess volume is compressed at a set ratio (ratio) and the set release time By releasing, it means reducing the maximum value of the volume that fluctuates with the progress of karaoke songs. As parameters that define the content of processing by the compressor, there are three parameters of gain, attack time, and knee in addition to the threshold value, ratio, and release time. The gain is a value indicating increase / decrease in the volume of the sound, and the attack time is the above ratio after the volume of the sound represented by the audio signal to be processed exceeds the threshold value (that is, after the compression of the volume starts). It is the time to reach. Knee is a value indicating the degree of compression until the ratio is reached in the vicinity of the threshold value.

本実施形態のコンプレッサ加工テーブルには、アバターの部位に対応付けて、コンプレッサのどの項目を調整するのかを表すデータが格納されている。例えば、図１４に示すコンプレッサ加工テーブルでは、アバターの顔に対してゲインが対応付けられており、アバターのボディに対してニーが対応付けられており、アバターの右手に対してアタック時間が対応付けられている。さらに、図１４に示すコンプレッサ加工テーブルでは、アバターの左手に対してリリース時間が対応付けられており、アバターの右足に対してスレッショルド値が対応付けられており、アバターの左足に対してレシオが対応付けられている。 The compressor processing table of the present embodiment stores data indicating which items of the compressor are to be adjusted in association with the avatar part. For example, in the compressor processing table shown in FIG. 14, a gain is associated with the avatar's face, a knee is associated with the avatar's body, and an attack time is associated with the avatar's right hand. It has been. Furthermore, in the compressor processing table shown in FIG. 14, the release time is associated with the avatar's left hand, the threshold value is associated with the avatar's right foot, and the ratio is associated with the avatar's left foot. It is attached.

また、本実施形態の音声加工テーブルには、アバターの部位に対するタッチ操作を示すデータに対応付けて、コンプレッサ以外の加工を当該アバターに対する音声に施すことを示すデータが格納されている。例えば、アバターの右手に対するタッチ操作を表すデータには、当該アバターの低音域を持ち上げることを表すデータが対応付けられており、アバターの左手に対するタッチ操作を表すデータには、当該アバターの高音域を持ち上げることを表すデータが対応付けられており、アバターのボディに対するタッチ操作を表すデータには、当該タッチ操作により更新後の衣装に応じて音響効果を変更することを表すデータが対応付けられている、といった具合である。例えば、変更後の衣装が和服であれば、演歌調の音響効果に変更することを表すといった具合である。 The voice processing table of the present embodiment stores data indicating that processing other than the compressor is performed on the voice for the avatar in association with data indicating a touch operation on the avatar part. For example, data representing a touch operation on the avatar's right hand is associated with data representing raising the bass range of the avatar, and data representing a touch operation on the avatar's left hand includes the treble range of the avatar. Data representing lifting is associated, and data representing a touch operation on the body of the avatar is associated with data representing changing the acoustic effect according to the updated costume by the touch operation. And so on. For example, if the costume after the change is Japanese clothes, the change represents an enka-like sound effect.

音声加工パラメータ生成部１０５は、ステップＳＡ２０２の判定結果が“Ｙｅｓ”であった場合、音声加工テーブルおよび当該アバターに対応する音声信号にその操作内容に応じたコンプレッサ効果を付与することを指示する音声加工パラメータを生成する（ステップＳＡ２０３）。例えば、アバターの顔に対するピンチイン／アウト操作であった場合、音声加工パラメータ生成部１０５は、音声信号のゲインの減少／増大を指示する音声加工パラメータを生成する。また、アバターのボディに対するピンチイン／アウト操作であった場合、音声加工パラメータ生成部１０５は、ニーの減少／増大を指示する音声加工パラメータを生成する。アバターの右手に対するピンチイン／アウト操作であった場合、音声加工パラメータ生成部１０５は、アタック時間の短縮／延長を指示する音声加工パラメータを生成する。アバターの左手に対するピンチイン／アウト操作であった場合、音声加工パラメータ生成部１０５は、リリース時間の短縮／延長を指示する音声加工パラメータを生成する。アバターの右足に対するピンチイン／アウト操作であった場合、音声加工パラメータ生成部１０５は、スレッショルド値の減少／増大を指示する音声加工パラメータを生成する。アバターの左足に対するピンチイン／アウト操作であった場合、音声加工パラメータ生成部１０５は、レシオの減少／増大を指示する音声加工パラメータを生成する。これに対して、ステップＳＡ２０２の判定結果が“Ｎｏ”であった場合、音声加工パラメータ生成部１０５は、音声加工テーブルの格納内容を参照し当該部位に対応する音響効果の調整を指示する音声加工パラメータを生成する（ステップＳＡ２０４）。例えば、アバターのボディに対するタッチ操作であり、そのタッチ操作による更新後の衣装が和服であった場合には、音声加工パラメータ生成部１０５は、演歌調の音響効果を付与することを指示する音声加工パラメータを生成するといった具合である。 When the determination result in step SA202 is “Yes”, the voice processing parameter generation unit 105 gives a voice that instructs to apply a compressor effect corresponding to the operation content to the voice processing table and the voice signal corresponding to the avatar. Processing parameters are generated (step SA203). For example, if the operation is a pinch-in / out operation on the face of the avatar, the audio processing parameter generation unit 105 generates an audio processing parameter that instructs to decrease / increase the gain of the audio signal. If the operation is a pinch-in / out operation on the body of the avatar, the audio processing parameter generation unit 105 generates an audio processing parameter that instructs to decrease / increase the knee. If the operation is a pinch-in / out operation on the right hand of the avatar, the voice processing parameter generation unit 105 generates a voice processing parameter that instructs to shorten / extend the attack time. If the operation is a pinch-in / out operation on the left hand of the avatar, the audio processing parameter generation unit 105 generates an audio processing parameter that instructs to shorten / extend the release time. When the avatar's right foot is a pinch-in / out operation, the voice processing parameter generation unit 105 generates a voice processing parameter for instructing a decrease / increase of the threshold value. If the operation is a pinch-in / out operation on the left foot of the avatar, the voice processing parameter generation unit 105 generates a voice processing parameter that instructs to decrease / increase the ratio. On the other hand, when the determination result in step SA202 is “No”, the voice processing parameter generation unit 105 refers to the stored contents of the voice processing table and instructs the voice processing to adjust the acoustic effect corresponding to the part. A parameter is generated (step SA204). For example, when the touch operation is performed on the body of the avatar and the costume updated by the touch operation is a kimono, the sound processing parameter generation unit 105 performs the sound processing instructing to give an enka sound effect. For example, a parameter is generated.

なお、アバターの部位に対するピンチイン／アウト操作に対するアバターの部位の大きさや長さの縮小／拡大と、コンプレッサのパラメータの減少／増大や短縮／延長は逆でもよい。さらに、ピンチイン／アウト操作を行ったアバターの部位と、映像操作入力部１０４で縮小／拡大されるアバターの部位は一致しなくてもよい。例えば、アバターのボディに対してピンチイン操作を行うと、アバターの右手の長さが拡大し、音声信号のゲインが増大してもよい。しかし、このような操作内容と加工内容の対応付けはユーザの直感と合わないため、必ずしも望ましいものではない。また、アバターの部位に対する操作とコンプレッサのパラメータの関係性は、必ずしも上記で説明した関係性である必要はない。例えば、アバターのボディにピンチイン／アウト操作を行うと、アバターのボディが縮小／拡大し、アタック時間が短縮／延長してもよい。これらは、図１３のアバターの部位加工テーブルと図１４のコンプレッサ加工テーブルのアバター操作部位に対応する箇所を書き換えることで実現される。
以上が、本実施形態のカラオケ装置１００の構成である。 Note that the avatar part size / length reduction / enlargement and the compressor parameter reduction / increase / shortening / extension may be reversed for pinch-in / out operations on the avatar part. Furthermore, the part of the avatar that has performed the pinch-in / out operation and the part of the avatar that is reduced / enlarged by the video operation input unit 104 may not match. For example, when a pinch-in operation is performed on the body of the avatar, the length of the right hand of the avatar may be increased and the gain of the audio signal may be increased. However, such association between the operation content and the processing content is not necessarily desirable because it does not match the user's intuition. In addition, the relationship between the operation for the avatar part and the parameters of the compressor does not necessarily have the relationship described above. For example, when a pinch-in / out operation is performed on the avatar body, the avatar body may be reduced / expanded, and the attack time may be shortened / extended. These are realized by rewriting a part corresponding to the avatar operation part of the avatar part processing table of FIG. 13 and the compressor processing table of FIG.
The above is the configuration of the karaoke apparatus 100 of the present embodiment.

このような構成としたため、本実施形態のカラオケ装置においても、映像（アバター）に対して何らかの加工を施したとしても、当該映像と音声の同期が崩れることはない。以上本発明の第２実施形態について説明したが、この実施形態を以下のように変形してもよい。 Due to such a configuration, even in the karaoke apparatus of the present embodiment, even if some processing is performed on the video (avatar), the video and audio are not synchronized. Although the second embodiment of the present invention has been described above, this embodiment may be modified as follows.

（１）本実施形態の変形例として、第１実施形態の変形例の（１）〜（８）の態様をとってもよい。なお、本実施形態の変形例として、第１実施形態のようなアバター全体に対する操作によって音声と映像に加工が施される態様をとってもよい。 (1) As a modification of the present embodiment, the aspects (1) to (8) of the modification of the first embodiment may be taken. As a modification of the present embodiment, a mode in which sound and video are processed by an operation on the entire avatar as in the first embodiment may be employed.

（２）音声信号識別子として波形データや特徴量データを用いる場合には、音声信号の波形や特徴に関連付けたアバターの部位データを用意しておき、カラオケ装置１００が、入力された音声信号に応じてアバターの部位データの中から各部位を選択して１つのアバターに合成しても良い。 (2) When waveform data or feature value data is used as an audio signal identifier, avatar part data associated with the waveform or feature of the audio signal is prepared, and the karaoke apparatus 100 responds to the input audio signal. Alternatively, each part may be selected from the part data of the avatar and combined into one avatar.

（３）アバターの顔にタッチ操作を施すと、アバターの顔の形が変化して、音声信号に強くコンプレッサの効果が施される態様をとってもよい。この態様では、アバターの顔は、人間の顔の形を模した形である丸から四角や六角形に変化してもよい。また、アバターの顔にタッチ操作を施すことで、アバター全体が変化してもよい。アバター全体の変化とは、アバターの表情や髪型や体格が変化することで、アバター全体が人間以外の例えば昆布に変化してもよい。 (3) When a touch operation is performed on the avatar's face, the shape of the avatar's face may be changed so that the compressor effect is strongly applied to the audio signal. In this aspect, the face of the avatar may change from a circle, which is a shape imitating the shape of a human face, to a square or a hexagon. Moreover, the whole avatar may change by performing a touch operation on the face of the avatar. The change of the entire avatar is a change of the avatar's facial expression, hairstyle, or physique, so that the entire avatar may be changed to, for example, kelp other than humans.

＜第３実施形態＞
図１５は、この発明の第３実施形態であるカラオケ装置２００の構成を示すブロック図である。図１５では、図１におけるものと同一の構成要素には図１におけるものと同一の符号が付されている。図１と図１５を見比べれば明らかなように、カラオケ装置２００の構成は、映像操作入力部１０４、音声加工パラメータ生成部１０５および映像加工パラメータ生成部１０６に代えて特徴量解析部２０４を設けた点がカラオケ装置１００の構成と異なる。特徴量解析部２０４も、カラオケ装置２００のＣＰＵが当該カラオケ装置２００に記憶されている制御プログラム（図１５では図示略）にしたがって実現するソフトウェアモジュールである。以下では、本実施形態の特徴を顕著に示す特徴量解析部２０４を中心に説明する。 <Third Embodiment>
FIG. 15 is a block diagram showing a configuration of a karaoke apparatus 200 according to the third embodiment of the present invention. 15, the same components as those in FIG. 1 are denoted by the same reference numerals as those in FIG. As apparent from a comparison between FIG. 1 and FIG. 15, the configuration of the karaoke apparatus 200 includes a feature amount analysis unit 204 instead of the video operation input unit 104, the audio processing parameter generation unit 105, and the video processing parameter generation unit 106. This is different from the configuration of the karaoke apparatus 100. The feature amount analysis unit 204 is also a software module that is realized by the CPU of the karaoke apparatus 200 according to a control program (not shown in FIG. 15) stored in the karaoke apparatus 200. Below, it demonstrates centering on the feature-value analysis part 204 which shows the characteristic of this embodiment notably.

図１６は、特徴量解析部２０４が映像加工パラメータを生成する処理の流れを示すフローチャートである。特徴量解析部２０４は、音声入力部１０１から受け取った音声信号を予め設定された一定時間分ずつ区切り（ステップＳＡ３０１）、このようにして得られた一定時間分の音声信号の各々について音声解析を行い（ステップＳＡ３０２）、その音声解析の結果を反映した映像加工パラメータを生成する（ステップＳＡ３０３）。特徴量解析部２０４が実行する音声解析には、音声信号の音量のみを解析する態様と、音声信号の音量とダイナミクスレンジを解析する態様と、スペクトル包絡のみを解析する態様の３つの態様がある。ダイナミクスレンジとは、音声信号音量の最大値と最小値の比率であり、スペクトル包絡とは、例えば、一定時間の音声信号に対してフーリエ変換を施し、底が１０の対数を取り、２０倍したものである。 FIG. 16 is a flowchart illustrating a flow of processing in which the feature amount analysis unit 204 generates a video processing parameter. The feature amount analysis unit 204 divides the audio signal received from the audio input unit 101 by a predetermined time (step SA301), and performs audio analysis on each of the audio signals for the predetermined time obtained in this way. (Step SA302), and a video processing parameter reflecting the result of the voice analysis is generated (step SA303). The voice analysis performed by the feature amount analysis unit 204 has three modes: a mode for analyzing only the volume of the voice signal, a mode for analyzing the volume and dynamic range of the voice signal, and a mode for analyzing only the spectral envelope. . The dynamics range is the ratio between the maximum and minimum values of the audio signal volume, and the spectral envelope is, for example, a Fourier transform of the audio signal for a certain time, taking the logarithm of the base 10 and multiplying by 20 Is.

特徴量解析部２０４が音声信号の音量のみを解析する態様の場合、特徴量解析部２０４は、前述した一定時間ごとの平均音量を計測する。特徴量解析部２０４は、平均音量とアバター全体の大きさを対応させる平均音量テーブル（図示略）を予め記憶している。本実施形態の平均音量テーブルには、平均音量が大きいほどアバターを大きくすることを表すデータが格納されているが、逆に平均音量が大きいほどアバターを小さくすることを表すデータが格納されている態様であってもよい。特徴量解析部２０４は、上記一定時間が経過する毎に平均音量テーブルの格納内容と計測した平均音量とからアバター全体の大きさを決定する映像加工パラメータを生成し、その映像加工パラメータを映像加工部１０８に与える。 When the feature amount analysis unit 204 is configured to analyze only the sound volume of the audio signal, the feature amount analysis unit 204 measures the average sound volume for each predetermined time described above. The feature amount analysis unit 204 stores in advance an average volume table (not shown) that associates the average volume with the overall size of the avatar. The average volume table of the present embodiment stores data indicating that the avatar is increased as the average volume increases, but conversely, data indicating that the avatar is decreased as the average volume increases. An aspect may be sufficient. The feature amount analysis unit 204 generates a video processing parameter that determines the size of the entire avatar from the stored content of the average volume table and the measured average volume every time the predetermined time elapses, and the video processing parameter is processed into the video processing parameter. Part 108 is given.

映像加工部１０８は、背景映像取得部１０９から与えられた背景映像と、映像生成部１０７から与えられたアバターと、特徴量解析部２０４から与えられた映像加工パラメータから映像出力データを生成して映像出力部１１０に与える。映像出力部１１０は当該映像出力データの表す映像を表示する。映像加工部１０８には、背景映像取得部１０９から複数の背景映像データが与えられるが、カラオケ装置１００の製作者またはユーザが任意に選択した背景映像データ（或いは映像加工部１０８がランダムに選択した背景映像データ）が加工後のアバターデータとの合成に用いられる。さらに、映像加工部１０８は、背景映像データの中央にアバターデータが位置するように合成を行う。なお、音声入力部１０１は歌唱者が歌っていない時にも周囲雑音を音声として収音してしまう可能性がある。このため本態様においては、音声入力部１０１から与えられた音声信号の音量が予め設定した閾値以下の場合には、アバターを非表示とする映像加工パラメータを生成し映像加工部１０８に与える処理を特徴量解析部２０４に実行させてもよい。また、音声入力部１０１から与えられた音声信号の音量が予め設定した閾値以下の場合には当該音声信号に対応したアバターデータを映像生成部１０７に出力させないようにしても同一の効果が得られる。 The video processing unit 108 generates video output data from the background video given from the background video acquisition unit 109, the avatar given from the video generation unit 107, and the video processing parameters given from the feature amount analysis unit 204. This is given to the video output unit 110. The video output unit 110 displays a video represented by the video output data. The video processing unit 108 is provided with a plurality of background video data from the background video acquisition unit 109. The background video data arbitrarily selected by the producer or user of the karaoke apparatus 100 (or randomly selected by the video processing unit 108). Background video data) is used for synthesis with the processed avatar data. Further, the video processing unit 108 performs synthesis so that the avatar data is positioned at the center of the background video data. The voice input unit 101 may collect ambient noise as voice even when the singer is not singing. For this reason, in this aspect, when the volume of the audio signal given from the audio input unit 101 is equal to or lower than a preset threshold value, a process for generating a video processing parameter for hiding the avatar and giving it to the video processing unit 108 is performed. You may make the feature-value analysis part 204 perform. Further, when the volume of the audio signal given from the audio input unit 101 is equal to or lower than a preset threshold value, the same effect can be obtained even if the video generation unit 107 does not output the avatar data corresponding to the audio signal. .

次に、特徴量解析部２０４が音声信号の音量とダイナミクスレンジを解析する態様の場合、特徴量解析部２０４は、音声入力部１０１から与えられた音声信号について予め設定した一定時間ごとの平均音量とダイナミクスレンジを計測し、アバターの顔色を暖色に変化させる映像加工パラメータを生成する。詳述すると、特徴量解析部２０４は、音声信号の平均音量が閾値を超え、かつ音声信号の平均ダイナミクスレンジが閾値よりも小さいと、歌唱者の熱唱度は高いと判断する。そこで、アバターも熱唱しているようにするために、特徴量解析部２０４は、アバターの顔色を暖色に変化させる映像加工パラメータを生成し、その映像加工パラメータを映像加工部１０８に与える。その後は、上記の特徴量解析部２０４が平均音量のみを解析する態様と同様である。なお、特徴量解析部２０４に上記平均音量テーブルを記憶させておき、アバターの顔色を暖色に変化させるとともに音声信号の平均音量に応じてアバターの大きさを変化させる映像加工パラメータを特徴量解析部２０４に生成させるようにしてもよい。 Next, when the feature amount analysis unit 204 is configured to analyze the volume and dynamics range of the audio signal, the feature amount analysis unit 204 sets the average volume per predetermined time for the audio signal given from the audio input unit 101. And the dynamics range is measured, and the video processing parameter that changes the avatar's face color to warm color is generated. More specifically, the feature amount analysis unit 204 determines that the singer's enthusiasm is high when the average volume of the audio signal exceeds the threshold and the average dynamics range of the audio signal is smaller than the threshold. Therefore, in order to make the avatar also sing, the feature amount analysis unit 204 generates a video processing parameter for changing the avatar's face color to a warm color, and gives the video processing parameter to the video processing unit 108. Thereafter, the feature amount analysis unit 204 is the same as the aspect in which only the average sound volume is analyzed. The feature amount analysis unit 204 stores the average volume table, and changes the avatar's face color to a warm color and changes the avatar's size according to the average volume of the audio signal to the feature amount analysis unit. 204 may be generated.

最後に、特徴量解析部２０４がスペクトル包絡のみを解析する態様の場合、特徴量解析部２０４は、音声入力部１０１から与えられた音声信号について予め設定した一定時間ごとのスペクトル包絡を計測する。そして、特徴量解析部２０４は、図示しない記憶装置に格納されている、歌唱者が歌っている歌のオリジナルのアーティストの音声のスペクトル包絡とアバターを読み出し、計測した歌唱者の音声のスペクトル包絡と例えば相関関数を用いて比較照合する。特徴量解析部２０４は、計測したスペクトル包絡とオリジナルのスペクトル包絡の比較照合の結果（すなわち、相関関数の値）に応じて、入力された音声信号に対応したアバターとオリジナルのアーティストのアバターがモーフィングするような映像加工パラメータを生成し、映像加工部１０８に与える。その後は、上記の特徴量解析部２０４が平均音量のみを解析する態様と同様である。特徴量解析部２０４がスペクトル包絡のみを解析する態様においては、計測したスペクトル包絡とオリジナルのスペクトル包絡が完全に一致すると、映像出力部１１０に表示されるアバターはオリジナルのアーティストのアバターとなる。 Finally, in a case where the feature amount analysis unit 204 analyzes only the spectrum envelope, the feature amount analysis unit 204 measures a spectrum envelope for each predetermined time set in advance for the speech signal given from the speech input unit 101. And the feature-value analysis part 204 reads the spectrum envelope and voice avatar of the original artist of the song which the singer sings, and is stored in the memory | storage device which is not illustrated, The spectrum envelope of the voice of the singer who measured and measured For example, the comparison is performed using a correlation function. The feature amount analysis unit 204 morphs the avatar corresponding to the input audio signal and the original artist's avatar according to the result of comparison and collation between the measured spectral envelope and the original spectral envelope (that is, the value of the correlation function). Such video processing parameters are generated and given to the video processing unit 108. Thereafter, the feature amount analysis unit 204 is the same as the aspect in which only the average sound volume is analyzed. In the aspect in which the feature amount analysis unit 204 analyzes only the spectrum envelope, when the measured spectrum envelope and the original spectrum envelope completely match, the avatar displayed on the video output unit 110 becomes the avatar of the original artist.

この態様においても、上記平均音量テーブルを特徴量解析部２０４に記憶させておき、オリジナルのスペクトル包絡との類否に応じたモーフィングに加えて平均音量に応じてアバターの大きさを変化させる映像加工パラメータを特徴量解析部２０４に生成させてもよく、さらに、熱唱度に応じてアバターの顔色を暖色に変化させる映像加工パラメータを生成させてもよい。また、特徴量解析部２０４がネットワークを介してホストコンピュータに接続されており、歌唱者が歌っている歌のオリジナルのアーティストの音声のスペクトル包絡とアバターが、図示しない記憶装置ではなく、ホストコンピュータから特徴量解析部２０４に直接配信される態様をとってもよい。 Also in this aspect, the above-mentioned average volume table is stored in the feature amount analysis unit 204, and in addition to the morphing according to the similarity with the original spectrum envelope, the image processing for changing the size of the avatar according to the average volume The parameter may be generated by the feature amount analysis unit 204, and further, an image processing parameter for changing the avatar's face color to a warm color according to the enthusiasm may be generated. In addition, the feature quantity analysis unit 204 is connected to the host computer via a network, and the spectral envelope and avatar of the original artist's voice of the song sung by the singer are not from a storage device (not shown) but from the host computer. A mode of being directly distributed to the feature amount analysis unit 204 may be adopted.

音声加工部１０２は、音声入力部１０１から与えられた音声信号を音声出力部１０３に振り分けて出力する。この点は第１実施形態と同じであるが、本実施形態の音声加工部１０２は、音声加工パラメータが与えられておらず、音声信号に加工が施されていない点が第１実施形態とは異なる。
以上が、本実施形態のカラオケ装置２００の構成である。 The audio processing unit 102 distributes the audio signal given from the audio input unit 101 to the audio output unit 103 and outputs it. This point is the same as in the first embodiment, but the voice processing unit 102 of the present embodiment is different from the first embodiment in that no voice processing parameters are given and the voice signal is not processed. Different.
The above is the configuration of the karaoke apparatus 200 of the present embodiment.

このような構成としたため、本実施形態のカラオケ装置２００においては入力された音声信号に対応した映像加工パラメータが生成されるので、映像と音声の同期が崩れることはない。さらに、カラオケ装置２００のユーザは、音声信号に応じてアバターが変化するので、映像出力部１１０に表示された映像から、音声の変化を視覚的に理解することができる。以上本発明の第３実施形態のカラオケ装置２００について説明したが、この実施形態を以下のように変形してもよい。 Because of such a configuration, in the karaoke apparatus 200 of the present embodiment, video processing parameters corresponding to the input audio signal are generated, so that the synchronization between video and audio is not lost. Furthermore, since the avatar changes according to the audio signal, the user of the karaoke apparatus 200 can visually understand the change in audio from the video displayed on the video output unit 110. Although the karaoke apparatus 200 according to the third embodiment of the present invention has been described above, this embodiment may be modified as follows.

（１）本実施形態の変形例として、第１実施形態の変形例の（２）、（３）、（６）、（７）および（８）の態様をとってもよい。 (1) As a modification of the present embodiment, the modes (2), (3), (6), (7) and (8) of the modification of the first embodiment may be taken.

（２）アバターの顔色を暖色に変化させる映像加工パラメータを生成する場合、音量とダイナミクスレンジにより熱唱度の判断を行わず、サーモグラフィを用いて予め設定した一定時間ごとの歌唱者の平均体温を計測し、その平均体温を熱唱度の判断に用いてもよい。歌唱者は熱唱していると体温は上がるはずである。そのため、サーモグラフィが計測した歌唱者の平均体温が予め設定した閾値を超えていると、特徴量生成部２０４は、アバターの顔色を暖色に変化させる映像加工パラメータを生成し、映像加工部１０８に与える。なお、上記の音声信号の平均音量と平均ダイナミクスレンジを計測する態様に平均体温を計測する態様を組み合わせた態様をとってもよい。この態様では、音声信号の平均音量が閾値を超え、なおかつ音声信号の平均ダイナミクスレンジが閾値よりも小さく、さらになおかつ歌唱者の平均体温が閾値を超えると、特徴量解析部２０４は、アバターの顔色を暖色に変化させる映像加工パラメータを生成する。もちろん、特徴解析部２０４が、平均音量テーブルを記憶し、音声信号の平均音量に応じて、上記のようにアバターの大きさを変化させる映像加工パラメータを生成してもよい。 (2) When generating video processing parameters that change the avatar's face color to warm, measure the average body temperature of the singer at a preset time using thermography without judging the degree of enthusiasm based on volume and dynamics range And the average body temperature may be used for judgment of the enthusiasm. If the singer is singing, the body temperature should go up. Therefore, if the average body temperature of the singer measured by the thermography exceeds a preset threshold, the feature value generation unit 204 generates a video processing parameter for changing the avatar's face color to a warm color, and provides the video processing unit 108 with the video processing parameter. . In addition, you may take the aspect which combined the aspect which measures average body temperature with the aspect which measures the average sound volume and average dynamics range of said audio | voice signal. In this aspect, when the average volume of the audio signal exceeds the threshold value, the average dynamics range of the audio signal is smaller than the threshold value, and the average body temperature of the singer exceeds the threshold value, the feature amount analysis unit 204 displays the avatar face color. An image processing parameter for generating a warm color is generated. Of course, the feature analysis unit 204 may store the average volume table and generate the video processing parameter for changing the size of the avatar as described above according to the average volume of the audio signal.

（３）本実施形態と第１実施形態を組み合わせた態様をとってもよい。この場合、映像加工パラメータ生成部１０６と特徴量解析部２０４の両者が併存している態様でもよいし、両者が併存しない態様でもよい。両者が併存しない態様では、特徴量解析部２０４が存在せず、映像加工パラメータ生成部１０６が音声入力部１０１から音声信号を与えられ、映像加工パラメータを生成する。さらに、映像加工パラメータ生成部１０６は、特徴量解析部２０４の機能を担うことになる。 (3) The present embodiment may be combined with the first embodiment. In this case, the video processing parameter generation unit 106 and the feature amount analysis unit 204 may be present together, or may not be present. In a mode in which both do not coexist, the feature amount analysis unit 204 does not exist, and the video processing parameter generation unit 106 receives an audio signal from the audio input unit 101 and generates a video processing parameter. Furthermore, the video processing parameter generation unit 106 has the function of the feature amount analysis unit 204.

＜その他の実施形態＞
上記第１〜第３実施形態ではカラオケ装置への本発明の適用例を説明した。しかし、本発明をカラオケ装置以外の信号加工装置に適用してもよく、具体的には以下の通りである。 <Other embodiments>
In the first to third embodiments, the application example of the present invention to the karaoke apparatus has been described. However, the present invention may be applied to a signal processing apparatus other than a karaoke apparatus, and specifically as follows.

（１）音声入力部１０１が収音する音は、人間の音声に限られず、楽器の音であってもよいし、音声や楽器などの色々な音が混合した伴奏の音であってもよい。これら人間の音声以外の音であっても、映像生成部１０７は各音に対応したアバターを生成し、各音の特徴を視覚的に理解し、各音の加工を視覚的に施すことができる。 (1) The sound collected by the voice input unit 101 is not limited to a human voice, and may be a sound of a musical instrument, or may be an accompaniment sound in which various sounds such as a voice and a musical instrument are mixed. . Even for sounds other than these human sounds, the video generation unit 107 can generate avatars corresponding to each sound, visually understand the characteristics of each sound, and visually process each sound. .

（２）上記第１〜第３実施形態では、音声入力部１０１により収音された音を表す音声信号がリアルタイム入力される場合について説明した。しかし、予め録音された音声と録画された背景映像であって、互いに同期再生される音声と背景映像について、後者に対する映像加工に応じた音声加工を前者に施す態様でもよく、前者に応じた映像加工を後者に施す態様でもよい。この態様では、音声入力部１０１や背景映像取得部１０９を省略することができる。さらに、音声出力部１０３と映像出力部１１０については例えばパソコンの出力デバイスで代用できるため、この態様はソフトウェアのみで実現可能である。つまり、本態様はパソコン上で行う動画編集等に好適である。 (2) In the first to third embodiments, the case where the audio signal representing the sound collected by the audio input unit 101 is input in real time has been described. However, it is also possible to apply a sound processing corresponding to the latter to the former for the sound and the background video that are recorded in advance and recorded in synchronization with each other. An embodiment in which processing is applied to the latter may be used. In this aspect, the audio input unit 101 and the background video acquisition unit 109 can be omitted. Furthermore, since the audio output unit 103 and the video output unit 110 can be replaced by, for example, an output device of a personal computer, this aspect can be realized only by software. That is, this aspect is suitable for moving image editing performed on a personal computer.

（３）携帯電話機などの携帯装置に本発明を適用してもよい。例えば携帯電話機に本発明を適用する場合、音声入力部１０１は携帯電話機のマイクに相当し、音声出力部１０３は携帯電話機のスピーカに相当する。背景映像取得部１０９により取得される背景映像データは例えば携帯電話機に保存された映像データに相当し、映像出力部１１０や映像操作入力部１０４は携帯電話機の画面に相当する。マイクで収音した音を保存された映像に応じて加工することができたり、マイクで収音した音に応じて保存された映像を加工することができる。また、据え置き型や携帯型のゲーム機に本発明を適用してもよい。この場合、ゲーム機のコントローラやタッチパネルが映像装置入力部１０４に相当する。タッチパネルに対してはタッチペンやスタイラスで操作を行う。さらに、ゲーム機がモーションキャプチャデバイスに対応している場合には、モーションセンサが映像操作入力部１０４に相当する。 (3) The present invention may be applied to a portable device such as a cellular phone. For example, when the present invention is applied to a mobile phone, the voice input unit 101 corresponds to a microphone of the mobile phone, and the voice output unit 103 corresponds to a speaker of the mobile phone. The background video data acquired by the background video acquisition unit 109 corresponds to, for example, video data stored in the mobile phone, and the video output unit 110 and the video operation input unit 104 correspond to the screen of the mobile phone. The sound collected by the microphone can be processed according to the stored video, or the stored video can be processed according to the sound collected by the microphone. Further, the present invention may be applied to stationary and portable game machines. In this case, the controller or touch panel of the game machine corresponds to the video device input unit 104. The touch panel is operated with a touch pen or stylus. Further, when the game machine is compatible with a motion capture device, the motion sensor corresponds to the video operation input unit 104.

（４）この発明をアプリケーションサービスプロバイダ（ＡＳＰ）用のサーバ装置に適用してもよい。詳述すると、第１実施形態や第２実施形態の音声加工部１０２、音声加工パラメータ生成部１０５、映像加工パラメータ生成部１０６、映像生成部１０７および映像加工部１０８を有するサーバ装置、もしくは、第３実施形態の音声加工部１０２、映像生成部１０７、映像加工部１０８および特徴量解析部２０４を有するサーバ装置をインターネットなどの電気通信回線に接続しておく。前者のサーバ装置は、電気通信回線経由で音声信号と背景映像データと操作内容データとを受信し、音声加工や映像加工を施して、加工後の音声信号や出力映像データを電気通信回線経由で送信する。後者のサーバ装置は電気通信回線経由で音声信号と背景映像データとを受信し、音声加工や映像加工を施して、加工後の音声信号や出力映像データを電気通信回線経由で送信する。この態様では、ユーザは自身のパソコンから互いに同期再生される音声と映像を表す各信号（前者のサーバ装置であれば、さらに操作内容データ）を上記電気通信回線経由で上記サーバ装置に送信し、サーバ装置により加工された音声信号と映像データとをネットワーク経由で受信して再生することで、上記各実施形態と同様の映像加工および音声加工を行うことができる。 (4) The present invention may be applied to a server device for an application service provider (ASP). Specifically, the server device having the audio processing unit 102, the audio processing parameter generation unit 105, the video processing parameter generation unit 106, the video generation unit 107, and the video processing unit 108 of the first embodiment or the second embodiment, or the first A server apparatus having the audio processing unit 102, the video generation unit 107, the video processing unit 108, and the feature amount analysis unit 204 of the third embodiment is connected to an electric communication line such as the Internet. The former server device receives an audio signal, background video data, and operation content data via an electric communication line, performs audio processing and video processing, and transmits the processed audio signal and output video data via the electric communication line. Send. The latter server device receives an audio signal and background video data via an electric communication line, performs audio processing and video processing, and transmits the processed audio signal and output video data via the electric communication line. In this aspect, the user transmits each signal (in the case of the former server device, operation content data) to the server device via the electric communication line, which represents audio and video that are synchronously reproduced from the personal computer. By receiving and playing back the audio signal and video data processed by the server device via the network, the same video processing and audio processing as in the above embodiments can be performed.

また、この態様では、ユーザは、自らが送信した映像データおよび音声信号と他のユーザがサーバ装置に送信した映像データおよび音声信号とをサーバ装置に加工させ、加工後の映像データと音声信号を受け取ってもよい。例えば、自分のアバターと他のユーザのアバターとを合成して両アバターがデュエットしているかのような映像データを生成させ、自分の歌声を表す音声信号と上記他のユーザの歌声を表す音声信号とを合成しデュエットしているかのような音声を表す音声信号を生成させる、といった具合である。 Also, in this aspect, the user causes the server device to process the video data and audio signal transmitted by the user and the video data and audio signal transmitted from the other user to the server device, and the processed video data and audio signal are processed. You may receive it. For example, by synthesizing one's avatar and another user's avatar to generate video data as if both avatars are duet, an audio signal representing one's singing voice and an audio signal representing the other user's singing voice And generating a sound signal representing sound as if duet.

（５）上記第１〜第３実施形態では、音声出力部１０３は、音声加工部１０２から与えられた音声信号の表す音を放音する放音装置であったが、例えばアンプやミキサなどの放音装置以外の外部装置であってもよい。カラオケ装置１００を録音用途に用いる場合は、当該外部装置が録音装置であってもよい。さらに、当該外部装置と放音装置を併用する態様であってもよい。 (5) In the first to third embodiments, the sound output unit 103 is a sound emitting device that emits sound represented by the sound signal given from the sound processing unit 102. However, for example, an amplifier or a mixer is used. An external device other than the sound emitting device may be used. When the karaoke device 100 is used for recording, the external device may be a recording device. Furthermore, the aspect which uses the said external device and a sound emission device together may be sufficient.

１００，２００…カラオケ装置、１０１…音声入力部、１０２…音声加工部、１０３…音声出力部、１０４…映像操作入力部、１０５…音声加工パラメータ生成部、１０６…映像加工パラメータ生成部、１０７…映像生成部、１０８…映像加工部、１０９…背景映像取得部、１１０…映像出力部、２０４…特徴量解析部。 DESCRIPTION OF SYMBOLS 100,200 ... Karaoke apparatus, 101 ... Audio | voice input part, 102 ... Audio | voice processing part, 103 ... Audio | voice output part, 104 ... Image | video operation input part, 105 ... Audio | voice processing parameter generation part, 106 ... Video | video processing parameter generation part, 107 ... Image generation unit 108... Image processing unit 109. Background image acquisition unit 110. Image output unit 204.

Claims

A parameter generation unit that generates a video processing parameter representing processing content for a video to be displayed on the display device according to an input sound signal;
A video processing unit that performs processing based on the video processing parameters on video data representing video to be displayed on the display device in synchronization with the sound represented by the input sound signal, and supplies the processed video data to the display device When,
A signal processing apparatus comprising:

A video generation unit that generates video data to be processed by the video processing unit based on the input sound signal;
A video operation input unit for inputting an operation on a video displayed on the display device;
A sound processing unit that performs processing based on a sound processing parameter and outputs the input sound signal;
With
The parameter generator is
In place of the process of generating the video processing parameter according to the input sound signal, the process of generating the video processing parameter according to the operation input to the video operation input unit, or the input sound signal and the video operation input The processing for generating the video processing parameter according to an operation input to the unit is executed, and the sound processing parameter is generated according to the operation input to the video operation input unit. The signal processing apparatus according to 1.

The input sound signal includes a sound signal representing a user's singing voice,
The video data is data representing each part of an avatar that imitates the user,
Different sound effects are assigned to each part of the avatar,
The parameter generation unit generates a video processing parameter representing processing content for the part for each part of the avatar operated by the video operation input unit, and adjusts an acoustic effect corresponding to the operated part. The signal processing apparatus according to claim 2, wherein a sound processing parameter to be generated is generated.