JP2009049873A

JP2009049873A - Information processing apparatus

Info

Publication number: JP2009049873A
Application number: JP2007215778A
Authority: JP
Inventors: Atsushi Mae; 篤前
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-08-22
Filing date: 2007-08-22
Publication date: 2009-03-05

Abstract

<P>PROBLEM TO BE SOLVED: To spuriously produce 5.1ch sound information from 2ch input sound signals. <P>SOLUTION: An information processing apparatus comprises: a video frame buffer 100; a sound processing block 200 which inputs right and left 2ch sound signals and produces 4ch sound signals; a sound synthesizing block 300 which weights and synthesizes the 4ch sound signals supplied from the sound processing block to produce 5.1ch surround sound signals; and a sound synthesization control block 500 which includes an image recognizing function and controls a synthesization parameter to be used for synthesizing the 4ch sound signals in the sound synthesizing block 300 based on a result of recognizing image signals synchronized to sound signals. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、画像信号及びこれに同期した音声信号からなる情報コンテンツを記録又は再生出力する情報処理装置に係り、特に、サラウンド再生環境に対応した音声情報の処理を行なう情報処理装置に関する。 The present invention relates to an information processing apparatus that records or reproduces and outputs information content including an image signal and an audio signal synchronized with the image signal, and particularly relates to an information processing apparatus that processes audio information corresponding to a surround reproduction environment.

さらに詳しくは、本発明は、通常の２ｃｈステレオマイクで撮影されたコンテンツを再生する際に擬似的に５．１ｃｈ分の音声情報を作り出すことによって、５．１ｃｈサラウンドのような臨場感を得る情報処理装置、並びに、通常の２ｃｈステレオマイクしか実装していないビデオカメラにおいて擬似的に５．１ｃｈ分の音声情報を作り出して記録する情報処理装置に関する。 More specifically, the present invention provides information that provides a sense of reality such as 5.1ch surround by artificially creating 5.1ch of audio information when playing back content shot with a normal 2ch stereo microphone. The present invention relates to a processing apparatus and an information processing apparatus that artificially creates and records audio information for 5.1 ch in a video camera in which only a normal 2 ch stereo microphone is mounted.

家庭用のビデオカメラは既に広く普及している。近年では、動画像及び音声データをデジタル符号化して、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）やハード・ディスクなどのコンピュータ・ファイルとして記録・管理するデジタルビデオカメラが増えてきている。また、デジタルカメラに画像認識技術を組み合わせることで、被写体認識処理が可能であり、被写体画像の位置や大きさに応じた自動照準（ＡＦ）、自動露光（ＡＥ）といったカメラワークの自動化技術も進められている。 Home video cameras are already widely used. In recent years, an increasing number of digital video cameras digitally encode moving images and audio data and record and manage them as computer files such as DVDs (Digital Versatile Discs) and hard disks. Also, subject recognition processing is possible by combining image recognition technology with a digital camera, and camera work automation technology such as automatic aiming (AF) and automatic exposure (AE) according to the position and size of the subject image is advanced. It has been.

一方、動画像及び画像データの再生システムとしては、想定される位置の視聴ユーザの周囲に複数のスピーカを配置して、実音源に近い、すなわち臨場感のあるサラウンド再生技術が知られている。サラウンド再生環境を実現する音声データ信号方式として、例えば、米国ドルビー研究所の開発したデジタル・マルチチャンネル音声信号の高能率符号化方式であるＡＣ−３が知られている。このＡＣ−３のサラウンド再生システムは、例えば、視聴者の前方左側に配置された左チャンネル用スピーカＬと、視聴者の前方中央に配置されたセンター・チャンネル用スピーカＣと、視聴者の前方右側に配置された右チャンネル用スピーカＲと、視聴者の後方左右にそれぞれ配置されたサラウンド・チャンネル用スピーカＬｓ及びＲｓという５台のスピーカで構成され（図３を参照のこと）、その音声チャネル数はフロント左右２チャンネルとフロントセンター１チャンネルとリア２チャンネルにさらにスーパーウーハ駆動用の低域専用チャンネル（０．１チャンネル）を加えた５．１チャンネルとなる。 On the other hand, as a playback system for moving images and image data, a surround playback technique is known in which a plurality of speakers are arranged around a viewing user at an assumed position, which is close to an actual sound source, that is, has a realistic feeling. As an audio data signal system that realizes a surround reproduction environment, for example, AC-3, which is a high-efficiency encoding system for digital multi-channel audio signals developed by Dolby Laboratories in the United States, is known. This surround playback system of AC-3 is, for example, a left channel speaker L arranged at the front left side of the viewer, a center channel speaker C arranged at the front center of the viewer, and the front right side of the viewer. The right channel speaker R and the surround channel speakers Ls and Rs respectively disposed on the left and right sides of the viewer (see FIG. 3), and the number of audio channels Is a 5.1 channel, which is a channel for the super low frequency (0.1 channel) added to the front left and right 2 channels, the front center 1 channel and the rear 2 channels.

ＤＶＤのような大容量の記録メディアの普及とともに、ホームシアターを始め５．１チャンネルを備えたサラウンド再生環境が家庭内にも浸透しつつある。また、家庭用ビデオカメラにも、５．１チャンネルで音声を記録可能な製品が出現している。５．１チャンネルで撮影されたコンテンツを５．１ｃｈサラウンド環境で再生すると、その場に居合わせたような臨場感が味わうことができ、ユーザにとってメリットは大きい。 With the widespread use of large-capacity recording media such as DVDs, surround playback environments equipped with 5.1 channels such as home theaters are spreading into the home. In addition, products that can record audio with 5.1 channels have also appeared in home video cameras. When the content shot on the 5.1 channel is played back in the 5.1ch surround environment, the user can enjoy a sense of reality as if he / she was on the spot, which is very beneficial for the user.

例えば、複数のマイクロフォンを備え、上記複数の各マイクロフォンから出力されるオーディオ信号を複数チャンネルのオーディオ信号に処理した第１の複数のオーディオ信号と、上記複数の各マイクロフォンから出力される全チャンネルのオーディオ信号を１つのチャンネルのオーディオ信号に処理した第２のオーディオ信号とを、上記映像信号と共に同時に上記記録媒体に記録するようにし、さらに、上記記録を上記第１の複数のオーディオ信号と上記第２のオーディオ信号とを独立して再生することが可能なるように行なうようにしたビデオカメラについて提案がなされている（例えば、特許文献１を参照のこと）。 For example, a first plurality of audio signals that include a plurality of microphones and that process audio signals output from the plurality of microphones into a plurality of channels of audio signals, and all channels of audio output from the plurality of microphones. A second audio signal obtained by processing the signal into an audio signal of one channel is simultaneously recorded on the recording medium together with the video signal, and the recording is performed on the first plurality of audio signals and the second audio signal. A video camera has been proposed that can be reproduced independently of the audio signal (see, for example, Patent Document 1).

また、撮影時に少なくとも異なる４方向以上からの音声を収音するようにビデオカメラに配設された４個以上のマイクロフォンＭ１、Ｍ２、Ｍ３、Ｍ４、…Ｍｎと、前記マイクロフォンから各々出力される音声出力信号ｍ１、ｍ２、ｍ３、ｍ４、…ｍｎを音声合成し、撮影方向に対して右前方からの音声信号Ｒと、左前方からの音声信号Ｌと、前方中央からの音声信号Ｃと、３つの方向と異なる方向からのサラウンド音声信号Ｓから構成される４チャンネル（Ｒｃｈ、Ｌｃｈ、Ｃｃｈ、Ｓｃｈ）の音声信号を生成する音声合成手段と、音声合成手段によって出力された４チャンネルの音声信号Ｒ、Ｌ、Ｃ、Ｓを予め定められた演算式に沿って２チャンネルの音声データＬｔ、Ｒｔに変換する信号処理を行ない出力するマトリックス・エンコーダと、マトリックス・エンコーダから出力される２チャンネルＬｔｃｈ、Ｒｔｃｈの音声データＬｔ、Ｒｔを記録媒体に記録する音声データ記録手段を備えるビデオカメラの録音装置について提案がなされており、記録される音声データは従来と同じ２チャンネルでありながら、再生時にマトリックス・デコードすることで４チャンネル以上のマルチチャンネルのサラウンド再生が可能となり、臨場感のある再生音が得られる（例えば、特許文献２を参照のこと）。 In addition, at least four microphones M1, M2, M3, M4,... Mn arranged in the video camera so as to pick up sounds from at least four different directions at the time of shooting, and sounds output from the microphones, respectively. Output signals m 1, m 2, m 3, m 4,... Mn are synthesized, and the audio signal R from the front right, the audio signal L from the left front, the audio signal C from the front center, 3 Voice synthesis means for generating four-channel (Rch, Lch, Cch, Sch) voice signals composed of surround voice signals S from different directions and four-channel voice signals R output by the voice synthesis means , L, C, and S are converted into 2-channel audio data Lt and Rt according to a predetermined arithmetic expression, and the matrix encoder outputs the signal. And a recording device of a video camera having audio data recording means for recording the audio data Lt and Rt of the two channels Ltch and Rtch output from the matrix encoder on a recording medium, and the audio data to be recorded is Although it is the same two channels as before, matrix decoding at the time of reproduction enables multi-channel surround reproduction of four or more channels, and a realistic reproduction sound can be obtained (for example, refer to Patent Document 2). .

しかしながら、価格設定に制限のある家庭用デジタルカメラにとって、５．１ｃｈサラウンドに対応するには、ライセンス取得などの各種の制約があり、またセット形状から多チャンネルのマイク（５台のスピーカに対応した５台のマイクＬ、Ｃ、Ｒ、Ｌｓ、Ｒｓ）を配置することが難しいといった問題がある。このため、２チャンネルでしか記録できないビデオカメラがいまだに多いというのが実情である。 However, for digital home cameras with limited pricing, there are various restrictions such as obtaining a license to support 5.1ch surround, and the multi-channel microphones (from 5 speakers to 5 speakers) There is a problem that it is difficult to arrange five microphones L, C, R, Ls, and Rs). For this reason, there are still many video cameras that can only record with two channels.

特開２００３−１８５４３号公報JP 2003-18543 A

特開２００５−２２３７０６号公報JP-A-2005-223706

本発明の目的は、動画像及び音声からなる情報を記録又は再生出力する際に、サラウンド再生環境に対応した音声情報の処理を好適に行なうことができる、優れた情報処理装置を提供することにある。 An object of the present invention is to provide an excellent information processing apparatus capable of suitably processing audio information corresponding to a surround reproduction environment when recording or reproducing and outputting information including a moving image and audio. is there.

本発明のさらなる目的は、通常の２ｃｈステレオマイクで撮影されたコンテンツを再生する際に擬似的に５．１ｃｈ分の音声情報を作り出すことによって、５．１ｃｈサラウンドのような臨場感を得ることができる、優れた情報処理装置を提供することにある。 It is a further object of the present invention to obtain a sense of reality such as 5.1ch surround by artificially creating 5.1ch of audio information when playing back content shot with a normal 2ch stereo microphone. An object of the present invention is to provide an excellent information processing apparatus.

本発明のさらなる目的は、通常の２ｃｈステレオマイクしか実装していないビデオカメラにおいて擬似的に５．１ｃｈ分の音声情報を作り出して記録することができる、優れた情報処理装置を提供することにある。 A further object of the present invention is to provide an excellent information processing apparatus capable of creating and recording audio information for 5.1 ch in a pseudo manner in a video camera having only a normal 2 ch stereo microphone mounted thereon. .

本発明は、上記課題を参酌してなされたものであり、画像信号及びこれに同期した音声信号からなる情報コンテンツを記録又は再生出力する情報処理装置であって、
左右２チャンネルからなる入力音声信号Ｌ及びＲに信号処理を施して全方位性となる音声信号Ｃを作り出し、さらに該音声信号Ｃから特定の効果をかけた音声信号Ｅを作り出して、４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅを出力する音声処理ブロックと、
前記音声処理ブロックから出力される４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅを重み付け合成して、視聴者の前方左側に相当する左チャンネル用音声信号Ｌと、視聴者の前方中央に相当するセンター・チャンネル用音声信号Ｃと、視聴者の前方右側に相当する右チャンネル用音声信号Ｒと、視聴者の後方左右にそれぞれ相当するサラウンド・チャンネル用音声信号Ｌｓ及びＲｓからなる５チャンネルを含むサラウンド音声信号を生成する音声合成ブロックと、
音声信号に同期した入力画像信号を認識する画像認識手段を備え、該画像認識結果に基づいて前記音声合成ブロックで４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅを合成する際に用いる合成パラメータを制御する音声合成制御ブロックと、
を具備することを特徴とする情報処理装置である。 The present invention has been made in consideration of the above problems, and is an information processing apparatus that records or reproduces and outputs information content including an image signal and an audio signal synchronized therewith,
The input audio signals L and R consisting of two left and right channels are subjected to signal processing to generate an omnidirectional audio signal C, and further, an audio signal E with a specific effect is generated from the audio signal C to generate four channels. An audio processing block for outputting audio signals L, R, C, E;
The 4-channel audio signals L, R, C, and E output from the audio processing block are weighted and synthesized to correspond to the left-channel audio signal L corresponding to the front left side of the viewer and the front center of the viewer. A surround signal including a center channel audio signal C, a right channel audio signal R corresponding to the front right side of the viewer, and surround channel audio signals Ls and Rs corresponding to the left and right sides of the viewer respectively. A speech synthesis block for generating speech signals;
Image recognition means for recognizing an input image signal synchronized with the audio signal, and a synthesis parameter used when synthesizing the 4-channel audio signals L, R, C, E in the audio synthesis block based on the image recognition result; A speech synthesis control block to be controlled;
It is an information processing apparatus characterized by comprising.

但し、前記音声合成ブロックは、前記音声処理ブロックから出力される４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅを重み付け合成して、スーパーウーハ駆動用の低域専用チャンネル（０．１チャンネル）の音声信号ＬＦＥをさらに生成して、５．１チャンネルの音声信号を合成出力するようにすることもできる。 However, the voice synthesis block weights and synthesizes the four channels of audio signals L, R, C, and E output from the voice processing block to obtain a low-frequency dedicated channel (0.1 channel) for superwoofer driving. An audio signal LFE may be further generated to synthesize and output a 5.1 channel audio signal.

また、前記音声処理ブロックは、音声フィルタにより音声信号Ｃから特定のフィルタ効果をかけた音声信号Ｅを作り出すが、この音声フィルタは具体的には特定の周波数帯域の成分のみを通過させるバンドパス・フィルタで構成される。 The audio processing block generates an audio signal E having a specific filter effect applied from the audio signal C by an audio filter. The audio filter specifically includes a bandpass filter that passes only components in a specific frequency band. Consists of filters.

動画像及び画像データの再生システムとしては、例えば米国ドルビー研究所の開発したＡＣ−３に代表される、視聴ユーザの周囲に複数のスピーカを配置して、実音源に近い、すなわち臨場感のある５．１チャンネル構成のサラウンド再生技術が知られている。ユーザにとっては、その場に居合わせたような臨場感が味わうことができ、メリットは大きい。 As a moving image and image data reproduction system, for example, AC-3 developed by Dolby Laboratories in the United States, a plurality of speakers are arranged around the viewing user, and it is close to a real sound source, that is, has a sense of presence. A surround reproduction technique with a 5.1 channel configuration is known. For the user, it is possible to enjoy a sense of presence as if they were there, and the benefits are great.

しかしながら、価格設定に制限のある家庭用デジタルカメラにとって、５．１チャンネル構成のサラウンドに対応するには、ライセンス取得などの各種の制約があり、またセット形状から多チャンネルのマイクを配置することが難しいといった問題がある。 However, for home digital cameras with limited pricing, there are various restrictions such as license acquisition to support 5.1 channel surround, and multi-channel microphones can be arranged from the set shape. There is a problem that it is difficult.

これに対し、本発明に係る情報処理装置は、通常の２ｃｈステレオマイクで撮影されたコンテンツを再生したり記録したりする際に、画像認識情報を用いて擬似的に５．１ｃｈ分の音声情報を作り出すように構成されており、２チャンネルのマイクのみを備えたビデオカメラで得られたようなＡＶコンテンツから５．１ｃｈサラウンドのような臨場感を得ることができる。 On the other hand, the information processing apparatus according to the present invention uses the image recognition information for pseudo audio information for 5.1 ch when reproducing or recording content shot with a normal 2 ch stereo microphone. Therefore, it is possible to obtain a sense of reality such as 5.1ch surround from AV contents such as those obtained with a video camera having only a two-channel microphone.

具体的には、まず、音声処理ブロックが右２チャンネルからなる入力音声信号Ｌ及びＲに信号処理を施して全方位性となる音声信号Ｃを作り出し、さらに該音声信号Ｃから特定の効果をかけた音声信号Ｅを作り出して、４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅとし、次いで、旺盛合成ブロックが、これら４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅから視聴者の前方中央に相当するセンター・チャンネル用音声信号Ｃと、視聴者の前方右側に相当する右チャンネル用音声信号Ｒと、視聴者の後方左右にそれぞれ相当するサラウンド・チャンネル用音声信号Ｌｓ及びＲｓからなる５チャンネルと、スーパーウーハ駆動用の低域専用チャンネル（０．１チャンネル）の音声信号ＬＦＥの合計５．１チャンネルを合成するように構成されている。そして、音声合成制御ブロックは、音声信号に同期した入力画像信号の画像認識結果に基づいて前記音声合成ブロックで４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅを合成する際に用いる合成パラメータを制御する。 Specifically, first, the audio processing block applies signal processing to the input audio signals L and R consisting of the right two channels to generate an omnidirectional audio signal C, and further applies a specific effect from the audio signal C. A four-channel audio signal L, R, C, E is generated, and then the active synthesis block corresponds to the front center of the viewer from these four-channel audio signals L, R, C, E. Center channel audio signal C, right channel audio signal R corresponding to the front right side of the viewer, and surround channel audio signals Ls and Rs corresponding to the left and right sides of the viewer respectively, A total of 5.1 channels of the audio signal LFE of the low-frequency dedicated channel (0.1 channel) for super woofer driving is synthesized. The speech synthesis control block controls synthesis parameters used when the speech synthesis block synthesizes the 4-channel speech signals L, R, C, and E based on the image recognition result of the input image signal synchronized with the speech signal. To do.

音声合成ブロックは、例えば、前記画像認識手段により認識された画面内の被写体の位置や大きさに基づいて、前記音声合成ブロックで４チャンネルの音声信号Ｌ、Ｒ、Ｃ、Ｅを合成する際に用いる合成パラメータを決定するようにしてもよい。 The voice synthesis block is used when, for example, four voice signals L, R, C, and E are synthesized by the voice synthesis block based on the position and size of the subject in the screen recognized by the image recognition unit. The synthesis parameter to be used may be determined.

また、音声合成ブロックは、音声処理ブロックにおいて、全方位性となる音声信号Ｃから特定の効果をかけた音声信号Ｅを作り出す際に用いる音声フィルタの制御を行なうようにしてもよい。例えば、前記画像認識手段により認識された被写体の人数又は種類に基づいて、前記音声処理ブロックにおける音声フィルタの特性を決定するようにしてもよい。 Further, the speech synthesis block may control the speech filter used when creating the speech signal E having a specific effect from the omnidirectional speech signal C in the speech processing block. For example, the characteristics of the audio filter in the audio processing block may be determined based on the number or type of subjects recognized by the image recognition means.

本発明によれば、通常の２ｃｈステレオマイクで撮影されたコンテンツを再生する際に、画像認識情報を用いて擬似的に５．１ｃｈ分の音声情報を作り出すことによって、５．１ｃｈサラウンドのような臨場感を得ることができる、優れた情報処理装置を提供することができる。 According to the present invention, when content captured with a normal 2ch stereo microphone is reproduced, 5.1ch sound information is artificially created using image recognition information, such as 5.1ch surround. An excellent information processing apparatus that can provide a sense of reality can be provided.

また、本発明によれば、通常の２ｃｈステレオマイクしか実装していないビデオカメラにおいて、画像認識情報を用いて擬似的に５．１ｃｈ分の音声情報を作り出して記録することができる、優れた情報処理装置を提供することができる。 Further, according to the present invention, in a video camera in which only a normal 2ch stereo microphone is mounted, excellent information that can create and record audio information for 5.1ch in a pseudo manner using image recognition information. A processing device can be provided.

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施形態や添付する図面に基づくより詳細な説明によって明らかになるであろう。 Other objects, features, and advantages of the present invention will become apparent from more detailed description based on embodiments of the present invention described later and the accompanying drawings.

以下、図面を参照しながら本発明の実施形態について詳解する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１には、本発明の一実施形態に係る情報処理装置の構成を模式的に示している。この情報処理装置は、例えばＤＶＤ再生装置からビデオ信号及び２ｃｈの音声信号を入力しサラウンド再生出力し、あるいは２ｃｈステレオマイクしか搭載していないビデオカメラからビデオ信号及び音声信号を入力してサラウンド再生に対応した記録を行なうための処理を実行する。 FIG. 1 schematically shows the configuration of an information processing apparatus according to an embodiment of the present invention. This information processing apparatus, for example, inputs a video signal and a 2ch audio signal from a DVD playback apparatus and outputs the surround playback, or inputs a video signal and an audio signal from a video camera having only a 2ch stereo microphone for surround playback. A process for performing the corresponding recording is executed.

図１に示すように、情報処理装置は、ビデオ・フレーム・バッファ１００と、左右２ｃｈの音声信号を入力して４ｃｈの音声信号を作り出す音声処理ブロック２００と、音声処理ブロック２００から供給される４ｃｈの音声信号を重み付け合成して５．１ｃｈのサラウンド音声信号を生成する音声合成ブロック３００と、画像認識機能を備え、音声信号に同期する画像信号を認識した結果に基づいて音声合成ブロック３００で４ｃｈの音声信号を合成する際に用いる合成パラメータを制御する音声合成制御ブロック５００で構成される。 As shown in FIG. 1, the information processing apparatus includes a video frame buffer 100, an audio processing block 200 that inputs left and right 2ch audio signals to generate a 4ch audio signal, and 4ch supplied from the audio processing block 200. A voice synthesis block 300 that generates a 5.1ch surround sound signal by weighting and synthesizing the voice signals of the audio signal, and a voice synthesis block 300 that has an image recognition function and recognizes an image signal that is synchronized with the voice signal. Is composed of a speech synthesis control block 500 for controlling synthesis parameters used when synthesizing the speech signal.

ビデオ・フレーム・バッファ１００は、伝送されるビデオ信号を画像認識するために一時的に保存する。ビデオ信号は、ＤＶＤ再生装置（図示しない）などから供給される再生ビデオ信号、あるいはビデオカメラ（図示しない）で撮影されるビデオ信号である。 The video frame buffer 100 temporarily stores a transmitted video signal for image recognition. The video signal is a playback video signal supplied from a DVD playback device (not shown) or the like, or a video signal shot by a video camera (not shown).

音声処理ブロック２００は、入力された左右２ｃｈそれぞれの音声信号Ｌ及びＲを重畳若しくは合成するなど信号処理を施して全方位性となる音声信号Ｃを作り出し、さらにこの全方位性の音声信号Ｃに特定の効果をかけた音声信号Ｅを作り出す。そして、音声処理ブロック２００は、左右２ｃｈの音声信号Ｌ及びＲとともに、全方位性の音声信号Ｃ、音声信号Ｃに特定の効果をかけた音声信号Ｅの４ｃｈを後段の音声合成ブロック３００に出力する。 The audio processing block 200 performs signal processing such as superposition or synthesis of the input left and right audio signals L and R to generate an omnidirectional audio signal C, and further converts the omnidirectional audio signal C into the omnidirectional audio signal C. An audio signal E with a specific effect is produced. Then, the audio processing block 200 outputs the omnidirectional audio signal C and 4ch of the audio signal E obtained by applying a specific effect to the audio signal C together with the left and right 2ch audio signals L and R to the subsequent audio synthesis block 300. To do.

この音声信号Ｅは、全方位性の音声信号Ｃから音声フィルタを介して特定の成分のみを抽出した音声信号である。また、音声フィルタを通過した際に音声信号Ｅは幾分の遅延が生じるが、４ｃｈすべての音声信号の同時性を保つために、他の３ｃｈの音声信号Ｌ、Ｒ、Ｃの伝送路上には適当なディレイ素子が配置されている。 The audio signal E is an audio signal obtained by extracting only a specific component from the omnidirectional audio signal C through an audio filter. In addition, the audio signal E is somewhat delayed when it passes through the audio filter, but in order to maintain the synchronism of all 4ch audio signals, the other 3ch audio signals L, R, and C are placed on the transmission path. Appropriate delay elements are arranged.

音声信号Ｅを生成するための音声フィルタは、例えばバンドパス・フィルタ（ＢＰＦ）で構成され、全方位性の音声信号Ｃから特定の周波数帯域の線分のみを通過させる。例えば、男性の声の帯域のみを通過させるバンドパス・フィルタを用いて音声フィルタを構成することができる。 The audio filter for generating the audio signal E is constituted by, for example, a bandpass filter (BPF), and allows only a line segment of a specific frequency band to pass from the omnidirectional audio signal C. For example, an audio filter can be configured using a bandpass filter that passes only a male voice band.

また、音声フィルタの周波数特性は一定である必要はなく、画像認識機能を備えた音声合成制御ブロック５００が音声信号に同期して入力される画像信号の画像認識結果に基づいてその周波数特性を制御するようにしてもよい。例えば、ビデオ・フレーム・バッファ１００に一時保持されている画像信号を画像認識ブロック５５０が画像認識して、（主要な）被写体が男性であることを認識したときには、例えば男性の声の帯域のみを通過させる周波数帯域に音声フィルタを設定するようにしてもよい。 Further, the frequency characteristics of the voice filter need not be constant, and the voice synthesis control block 500 having an image recognition function controls the frequency characteristics based on the image recognition result of the image signal input in synchronization with the voice signal. You may make it do. For example, when the image recognition block 550 recognizes an image signal temporarily held in the video frame buffer 100 and recognizes that the (main) subject is a male, for example, only the male voice band is used. An audio filter may be set in the frequency band to pass.

さらには、音声合成制御ブロック５００は、画像認識ブロック５５０により認識された（対象となる）被写体の人数に応じて音声フィルタの周波数特性を切り替えるようにして、音声処理ブロック２００が複数の音声信号Ｅを生成するようにしてもよい。例えば、１画面内で成人男性と子供が認識されたときには、２種類のバンドパス・フィルタによってそれぞれ成人男性の声の帯域と子供の声の帯域を抽出した２種類の音声信号Ｅ１及びＥ２を生成して、後段の音声合成ブロック３００に出力する。なお、この場合の音声処理ブロック２００と音声合成ブロック３００間の音声信号は可変信号数となるため、デジタル・データで音声信号のやり取りを行なうようにしてもよい。 Further, the voice synthesis control block 500 switches the frequency characteristics of the voice filter in accordance with the number of subjects (target) recognized by the image recognition block 550, so that the voice processing block 200 has a plurality of voice signals E. May be generated. For example, when an adult male and a child are recognized in one screen, two types of audio signals E1 and E2 are generated by extracting the band of the adult male voice and the band of the child voice by two types of band-pass filters, respectively. Then, it is output to the subsequent speech synthesis block 300. In this case, since the audio signal between the audio processing block 200 and the audio synthesis block 300 has a variable number of signals, the audio signal may be exchanged with digital data.

音声合成ブロック３００は、音声処理ブロック２００から出力される４ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｅから、視聴者の前方左側に相当する左チャンネル用音声信号Ｌと、視聴者の前方中央に相当するセンター・チャンネル用音声信号Ｃと、視聴者の前方右側に相当する右チャンネル用音声信号Ｒと、視聴者の後方左右にそれぞれ相当するサラウンド・チャンネル用音声信号Ｌｓ及びＲｓからなる合計５．１ｃｈのサラウンド音声信号を合成する。これによって、擬似５．１ｃｈ記録若しくは擬似５．１ｃｈ記録を実現することができる。具体的には、音声合成ブロック３００は、音声合成制御ブロック５００が画像認識ブロック５５０による画像認識結果に基づいて決定する各４通りの合成パラメータＰ_L、Ｐ_R、Ｐ_C、Ｐ_Eを基に、下式により４ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｅを重み合成して、５．１ｃｈそれぞれの音声信号Ｌ、Ｒ、Ｃ、Ｌｓ、Ｒｓを計算する。 The speech synthesis block 300 corresponds to the left channel audio signal L corresponding to the front left side of the viewer and the front center of the viewer from the 4ch audio signals L, R, C and E output from the audio processing block 200. A total of 5.1 channels including a center channel audio signal C, a right channel audio signal R corresponding to the front right side of the viewer, and surround channel audio signals Ls and Rs respectively corresponding to the left and right sides of the viewer. The surround sound signal is synthesized. Thereby, pseudo 5.1ch recording or pseudo 5.1ch recording can be realized. Specifically, the speech synthesis block 300 is based on each of the four synthesis parameters P _L , P _R , P _C , and P _E that the speech synthesis control block 500 determines based on the image recognition result by the image recognition block 550. Then, the 4ch audio signals L, R, C, and E are weighted and synthesized by the following equation to calculate the 5.1ch audio signals L, R, C, Ls, and Rs, respectively.

音声合成制御ブロック５００は、信号線６００よりメモリ・アクセスして、ビデオ・フレーム・バッファ１００にある画像を画像認識ブロック５６０により解析し、画像内の認識対象の属性情報（対象の種別、位置、サイズなど）を作成した結果に基づいて、信号線６１０を使って音声合成ブロック３００の合成パラメータを随時変更する。また、音声合成制御ブロック５００は、画像内で認識された対象の属性情報（被写体の種類や性別、年齢など）に応じて、図示しない信号線を介して音声処理ブロック２００内の音声フィルタの周波数特性を決定するようにしてもよい（前述）。 The speech synthesis control block 500 accesses the memory from the signal line 600, analyzes the image in the video frame buffer 100 by the image recognition block 560, and recognizes attribute information (object type, position, The synthesis parameters of the speech synthesis block 300 are changed at any time using the signal line 610 based on the result of creating the size. Also, the speech synthesis control block 500 determines the frequency of the speech filter in the speech processing block 200 via a signal line (not shown) according to the target attribute information (subject type, sex, age, etc.) recognized in the image. The characteristics may be determined (described above).

図１に示す例では、音声合成制御ブロック５００は、プロセッサ５１０と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）５２０と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）５３０と、入出力インターフェース５４０と、信号線６００によりビデオ・フレーム・バッファ１００内のビデオ画像を画像認識することのできる画像認識ブロック５５０と、これらを相互に接続するバス５６０で構成される。 In the example shown in FIG. 1, the speech synthesis control block 500 includes a processor 510, a ROM (Read Only Memory) 520, a RAM (Random Access Memory) 530, an input / output interface 540, and a signal line 600. An image recognition block 550 capable of recognizing a video image in the buffer 100 and a bus 560 connecting these to each other.

プロセッサ５１０は、所定のプログラムを実行することによって、音声合成ブロック３００において４ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｅを重み合成する際に用いる合成パラメータを随時変更するための処理を行ない、信号線６１０を介して音声合成ブロック３００に設定する。また、プロセッサ５１０は、所定のプログラムを実行することによって、音声処理ブロック２００内の音声フィルタの特性を随時変更するための処理を行ない、図示しない信号線を介して音声処理ブロック２００に設定する。 The processor 510 executes a process for changing the synthesis parameters used at the time of weight synthesis of the 4ch speech signals L, R, C, and E in the speech synthesis block 300 by executing a predetermined program. It is set in the speech synthesis block 300 via 610. Further, the processor 510 executes a process for changing the characteristics of the sound filter in the sound processing block 200 as needed by executing a predetermined program, and sets the sound processing block 200 via a signal line (not shown).

ＲＯＭ５２０は、プロセッサ５１０により実行されるプログラムや各種パラメータなどを保持するメモリであり、例えば、フラッシュメモリなどのＥＥＰＲＯＭにより構成される。ＲＯＭ５２０に格納されるプログラムには、上述した、音声合成ブロック３００において音声信号を重み合成する際の合成パラメータを変更するためのアルゴリズムや、音声処理ブロック２００内の音声フィルタの特性を変更するためのアルゴリズムを実現するためのプログラムが含まれる。 The ROM 520 is a memory that holds a program executed by the processor 510, various parameters, and the like, and includes, for example, an EEPROM such as a flash memory. The program stored in the ROM 520 includes the above-described algorithm for changing the synthesis parameter when the audio signal is weight-synthesized in the audio synthesis block 300 and the characteristics of the audio filter in the audio processing block 200. A program for implementing the algorithm is included.

ＲＡＭ５３０は、プロセッサ５１０におけるプログラム実行に必要な作業データ等を保持するメモリであり、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）やＤＲＡＭ（ＤｙｎａｍｉｃＲＡＭ）などの読み書き可能なメモリ装置により構成され、主にプロセッサ５１０の作業用メモリとして用いられる。 The RAM 530 is a memory that holds work data and the like necessary for program execution in the processor 510. The RAM 530 includes, for example, a readable / writable memory device such as SRAM (Static RAM) or DRAM (Dynamic RAM). It is used as a memory.

入出力インターフェース５４０は、外部装置（図示しない）とのデータのやり取りの際のインターフェース・プロトコルを実現するものであり、例えば、ＲＯＭ５２０内のプログラムの更新処理のために使用される。 The input / output interface 540 implements an interface protocol when data is exchanged with an external device (not shown), and is used, for example, for updating a program in the ROM 520.

画像認識ブロック５６０は、信号線６００を介してメモリ・アクセスして、ビデオ・フレーム・バッファ１００にある画像を解析し、画像内の認識対象の属性情報（対象の種別、位置、サイズなど）を作成する。画像認識ブロック５６０では特に被写体の検出並びに認識を行なう顔認識が適用される。顔認識処理は、例えば、顔画像の位置を検出して検出顔として抽出する顔検出処理と、検出顔から主要な顔器官の位置を検出する顔器官検出処理と、検出顔の識別（人物の特定）を行なう顔識別処理で構成される。但し、本発明の要旨は特定の画像認識技術に限定されるものではないので、本明細書ではこれ以上説明しない。 The image recognition block 560 accesses the memory via the signal line 600, analyzes the image in the video frame buffer 100, and recognizes attribute information (object type, position, size, etc.) of the recognition target in the image. create. In the image recognition block 560, face recognition for detecting and recognizing a subject is particularly applied. The face recognition process includes, for example, a face detection process that detects the position of a face image and extracts it as a detected face, a face organ detection process that detects the position of a main facial organ from the detected face, and identification of a detected face (person's Specific face) processing. However, the gist of the present invention is not limited to a specific image recognition technique, and will not be described further in this specification.

既に述べたように、プロセッサ５１０は、所定のプログラムを実行することによって、音声合成ブロック３００において４ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｅを重み付け合成して５．１ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｌｓ、Ｒｓを生成する際に用いる各４通りの合成パラメータＰ_L、Ｐ_R、Ｐ_C、Ｐ_Eを随時変更するための処理を行なう。合成パラメータを変更するアルゴリズムとしては、画像認識ブロック５６０により認識された画面内の対象物（被写体）の位置や大きさに基づいて合成パラメータを決定する方法が挙げられる。 As described above, the processor 510 executes a predetermined program to weight-synthesize the 4ch audio signals L, R, C, and E in the audio synthesis block 300, thereby generating the 5.1ch audio signals L, R, A process for changing the four synthesis parameters P _L , P _R , P _C , and P _E used when generating C, Ls, and Rs as needed is performed. As an algorithm for changing the synthesis parameter, there is a method of determining the synthesis parameter based on the position and size of the object (subject) in the screen recognized by the image recognition block 560.

ここで、画像認識ブロック５６０により画面内に人物（あるいは、犬などのペット、自動車などの特定の機械装置類などの対象）を検知したときには、５．１ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｌｓ、Ｒｓを生成する際にそれぞれ用いる合成パラメータＰ_L、Ｐ_R、Ｐ_C、Ｐ_Eを例えば以下の表１に示すように決定する。 Here, when a person (or a target such as a pet such as a dog or a specific mechanical device such as a car) is detected in the screen by the image recognition block 560, the 5.1ch audio signals L, R, C, and Ls are detected. , Rs are respectively determined as shown in Table 1 below, for example, as synthesis parameters P _L , P _R , P _C , and P _E used for generating Rs.

上記の表中に含まれる変数α、β、γ、δは画面内で検出された対象の位置に応じて決定される。例えば、図２に示すように、画面中央から対象までの距離をａとし、画面左端から画面中央までの距離をｌとすると、変数α、β、γ、δをそれぞれ下式のように決定することができる。但し、同図中の５台のスピーカＬ、Ｃ、Ｒ、Ｌｓ、Ｒｓは５．１ｃｈサラウンド再生システムにおいて想定される配置とする。 The variables α, β, γ, and δ included in the above table are determined according to the position of the target detected in the screen. For example, as shown in FIG. 2, assuming that the distance from the center of the screen to the object is a and the distance from the left end of the screen to the center of the screen is l, variables α, β, γ, and δ are determined by the following equations, respectively. be able to. However, the five speakers L, C, R, Ls, and Rs in the figure are assumed to be arranged in the 5.1ch surround reproduction system.

他方、画像認識ブロック５６０により画面内に人物などの対象を検知しなかったときには、５．１ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｌｓ、Ｒｓを生成する際にそれぞれ用いる合成パラメータＰ_L、Ｐ_R、Ｐ_C、Ｐ_Eを例えば以下の表２に示すように決定する。 On the other hand, when an object such as a person is not detected in the screen by the image recognition block 560, the synthesis parameters P _L and P _R used to generate the 5.1ch audio signals L, R, C, Ls, and Rs, respectively. , P _C , and P _E are determined as shown in Table 2 below, for example.

以上、特定の実施形態を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が該実施形態の修正や代用を成し得ることは自明である。 The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiment without departing from the gist of the present invention.

本発明に係る情報処理装置は、通常の２ｃｈステレオマイクで撮影されたコンテンツを擬似的に５．１ｃｈで再生する場合、あるいは、通常の２ｃｈステレオマイクしか実装していないビデオカメラにおいて擬似的に５．１ｃｈで記録する場合に適用することができる。 The information processing apparatus according to the present invention reproduces content captured with a normal 2ch stereo microphone in a pseudo 5.1ch manner, or in a video camera in which only a normal 2ch stereo microphone is mounted, It can be applied when recording in 1ch.

要するに、例示という形態で本発明を開示してきたのであり、本明細書の記載内容を限定的に解釈するべきではない。本発明の要旨を判断するためには、特許請求の範囲を参酌すべきである。 In short, the present invention has been disclosed in the form of exemplification, and the description of the present specification should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims should be taken into consideration.

図１は、本発明の一実施形態に係る、動画像及び音声からなる情報を記録又は再生出力する情報処理装置の構成を模式的に示した図である。FIG. 1 is a diagram schematically illustrating a configuration of an information processing apparatus that records or reproduces and outputs information including a moving image and sound according to an embodiment of the present invention. 図２は、画面内に人物などの対象を検知したときに、５．１ｃｈの音声信号Ｌ、Ｒ、Ｃ、Ｌｓ、Ｒｓを生成する際にそれぞれ用いる合成パラメータＰ_L、Ｐ_R、Ｐ_C、Ｐ_Eを決定するための計算式を説明するための図である。FIG. 2 shows synthesis parameters P _L , P _R , P _C , and the like used when generating 5.1ch audio signals L, R, C, Ls, and Rs when a target such as a person is detected in the screen. It is a figure for demonstrating the calculation formula for determining _PE . 図３は、ＡＣ−３のサラウンド再生システムの構成を模式的に示した図である。FIG. 3 is a diagram schematically showing the configuration of an AC-3 surround playback system.

Explanation of symbols

１００…ビデオ・フレーム・バッファ
２００…音声処理ブロック
３００…音声合成ブロック
５００…音声合成制御ブロック
５１０…プロセッサ
５２０…ＲＯＭ
５３０…ＲＡＭ
５４０…入出力インターフェース
５５０…画像認識ブロック
５６０…バス
６００、６１０…信号線
DESCRIPTION OF SYMBOLS 100 ... Video frame buffer 200 ... Speech processing block 300 ... Speech synthesis block 500 ... Speech synthesis control block 510 ... Processor 520 ... ROM
530 ... RAM
540 ... I / O interface 550 ... Image recognition block 560 ... Bus 600, 610 ... Signal line

Claims

An information processing apparatus for recording or reproducing and outputting information content including an image signal and an audio signal synchronized with the image signal,
The input audio signals L and R consisting of two left and right channels are subjected to signal processing to generate an omnidirectional audio signal C, and further, an audio signal E with a specific effect is generated from the audio signal C to generate four channels. An audio processing block for outputting audio signals L, R, C, E;
The 4-channel audio signals L, R, C, and E output from the audio processing block are weighted and synthesized to correspond to the left-channel audio signal L corresponding to the front left side of the viewer and the front center of the viewer. A surround signal including a center channel audio signal C, a right channel audio signal R corresponding to the front right side of the viewer, and surround channel audio signals Ls and Rs corresponding to the left and right sides of the viewer respectively. A speech synthesis block for generating speech signals;
Image recognition means for recognizing an input image signal synchronized with the audio signal, and a synthesis parameter used when synthesizing the 4-channel audio signals L, R, C, E in the audio synthesis block based on the image recognition result; A speech synthesis control block to be controlled;
An information processing apparatus comprising:

The audio processing block generates an audio signal E having a specific filter effect applied from the audio signal C by an audio filter.
The information processing apparatus according to claim 1.

The audio filter is composed of a bandpass filter that allows only a specific frequency band component to pass.
The information processing apparatus according to claim 2.

The voice synthesis control block controls the frequency characteristics of the voice filter based on an image recognition result of the image recognition means of an image signal input in synchronization with a voice signal.
The information processing apparatus according to claim 3.

The speech synthesis control block switches the frequency characteristics of the speech filter according to the number of subjects recognized by the image recognition means, and generates a plurality of speech signals E by the speech synthesis block.
The information processing apparatus according to claim 4.

The speech synthesis block weights and synthesizes the 4-channel audio signals L, R, C, and E output from the audio processing block, and the audio signal of the low-frequency dedicated channel (0.1 channel) for driving the superwoofer. Generate more LFE,
The information processing apparatus according to claim 1.

The speech synthesis block determines the characteristics of the speech filter in the speech processing block based on the number or type of subjects recognized by the image recognition means.
The information processing apparatus according to claim 1.

The speech synthesis block uses synthesis parameters used when the speech synthesis block synthesizes 4-channel audio signals L, R, C, and E based on the position of the subject in the screen recognized by the image recognition unit. decide,
The information processing apparatus according to claim 1.

The speech synthesis block is a synthesis parameter used when the speech synthesis block synthesizes 4-channel audio signals L, R, C, and E based on the size of the subject in the screen recognized by the image recognition unit. To decide,
The information processing apparatus according to claim 1.

A moving image recording means for recording the surround sound signal generated by the sound synthesis block in synchronization with the input image signal;
The information processing apparatus according to claim 1.

It further comprises a moving image reproduction means for reproducing and outputting the surround audio signal generated by the audio synthesis block in synchronization with the input image signal.
The information processing apparatus according to claim 1.