JP2010050755A

JP2010050755A - Video audio output device

Info

Publication number: JP2010050755A
Application number: JP2008213357A
Authority: JP
Inventors: Daisuke Kobayashi; 大祐小林
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2008-08-21
Filing date: 2008-08-21
Publication date: 2010-03-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide a reproducing technique accompanied with presence in a sound including BGM. <P>SOLUTION: A video audio output device includes a video display unit, a detecting means for detecting a face of a person and motion of a lip within an image displayed on the video display unit, an audio processing unit for controlling characteristics of a sound filter based on an address coordinate indicating a position where the face of the person and the motion of the lip exist within the image detected by the detecting means, a determining means for determining whether an audio signal accompanied with the image displayed on the video display unit is a monaural sound signal, and a plurality of speakers whose inputs are made from outputs of the sound filter to be controlled by the audio processing unit based on results determined by the decision means. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、映像音声出力装置に係わり、特に音声の臨場感を伴った再生方法に関する。 The present invention relates to a video / audio output device, and more particularly, to a reproduction method with a sense of realism of audio.

モノラル音声の臨場感を伴った再生方法として近年の特許文献１には、周波数スペクトルの解析結果に応じて前記各チャンネルの出力音声信号の周波数特性を制御することが記載されている。これは、単位期間に音を分離したりテンポに応じて周波数特性を変えることを内容としている。しかしながら音声源のモデルが無く性能に限界がある。 As a reproduction method with a sense of reality of monaural sound, recent patent document 1 describes that the frequency characteristics of the output sound signal of each channel are controlled according to the analysis result of the frequency spectrum. The content is that the sound is separated during the unit period or the frequency characteristic is changed according to the tempo. However, there is no audio source model and performance is limited.

他方デジタルカメラ等では、人の顔や口を検出する検出技術を用いて、写真の明るさ等の補正に利用することが実用化されている。映像音声出力装置においてもこのような技術を用いて、話者の位置を検出し利用することが求められている。 On the other hand, digital cameras and the like have been put to practical use for correcting the brightness of photographs using a detection technique for detecting a person's face and mouth. The video / audio output apparatus is also required to detect and use the position of the speaker using such a technique.

関連して特許文献２には、映像中の話者の位置を検出し、ボリュームを制御することを特徴とする映像音声出力装置と記載されている。しかしながらＢＧＭが無い音声信号では音の臨場感を高めることができるがＢＧＭを含んだ音声信号にこの処理を行うと違和感が残るという問題があった。
特開２００６−８６５５８号公報特開平１１−３１３２７２号公報（請求項１） Relatedly, Patent Document 2 describes a video / audio output device that detects the position of a speaker in a video and controls the volume. However, a sound signal without BGM can increase the sense of presence of sound, but there is a problem that when this process is performed on a sound signal including BGM, a sense of incongruity remains.
JP 2006-86558 A JP-A-11-313272 (Claim 1)

本発明は、ＢＧＭを含んだ音声の臨場感を伴った再生技術を提供することを目的とする。 An object of this invention is to provide the reproduction | regeneration technique with the realistic presence of the audio | voice containing BGM.

上記課題を解決するために、本発明の映像音声出力装置は、映像表示部と、前記映像表示部に表示される映像中において人物の顔や唇の動きを検出する検出手段と、前記検出手段が検出した前記映像中の人物の顔や唇の動きが存在する場所を示すアドレス座標に基づいて音声フィルタの特性を制御する音声処理部と、前記映像表示部に表示される映像に伴う音声信号がモノラル音声信号であるか否かを判定する判定手段と、前記判定手段による判定結果に基づいて前記音声処理部により制御される前記音声フィルタの出力を入力とする複数のスピーカとを備えたことを特徴とする。 In order to solve the above problems, a video / audio output device of the present invention includes a video display unit, a detection unit that detects movement of a person's face and lips in a video displayed on the video display unit, and the detection unit. An audio processing unit that controls the characteristics of an audio filter based on address coordinates indicating the location of movement of a person's face or lips in the video detected by the video, and an audio signal accompanying the video displayed on the video display unit Determining means for determining whether or not the signal is a monaural audio signal, and a plurality of speakers receiving as inputs the output of the audio filter controlled by the audio processing unit based on the determination result by the determining means It is characterized by.

本発明によれば、ＢＧＭを含んだ音声の臨場感を伴った再生技術が得られる。 According to the present invention, it is possible to obtain a reproduction technique with a realistic sensation of sound including BGM.

以下、本発明の実施形態を説明する。
（実施形態１）
本発明による実施形態１を図１乃至図７を参照して説明する。
図１は、この発明の実施の形態で説明するテレビジョン放送受信装置の信号処理系を概略的に示す図である。この信号処理系を構成する各種の回路ブロックは、キャビネット１２の内部に配置されている。 Embodiments of the present invention will be described below.
(Embodiment 1)
A first embodiment of the present invention will be described with reference to FIGS.
FIG. 1 is a diagram schematically showing a signal processing system of a television broadcast receiving apparatus described in the embodiment of the present invention. Various circuit blocks constituting the signal processing system are arranged inside the cabinet 12.

そして、デジタルテレビジョン放送受信用のアンテナ２２で受信したデジタルテレビジョン放送信号は、入力端子２３を介してチューナ部２４に供給される。このチューナ部２４は、入力されたデジタルテレビジョン放送信号から所望のチャンネルの信号を選局し復調している。そして、このチューナ部２４から出力された信号は、デコーダ部２５に供給されて、例えばＭＰＥＧ（moving picture experts group）２デコード処理が施された後、セレクタ２６に供給される。 The digital television broadcast signal received by the digital television broadcast receiving antenna 22 is supplied to the tuner unit 24 via the input terminal 23. The tuner unit 24 selects and demodulates a signal of a desired channel from the input digital television broadcast signal. Then, the signal output from the tuner unit 24 is supplied to the decoder unit 25 and, for example, subjected to MPEG (moving picture experts group) 2 decoding processing, and then supplied to the selector 26.

さらに、アナログテレビジョン放送受信用のアンテナ２７で受信したアナログテレビジョン放送信号は、入力端子２８を介してチューナ部２９に供給される。このチューナ部２９は、入力されたアナログテレビジョン放送信号から所望のチャンネルの信号を選局し復調している。そして、このチューナ部２９から出力された信号は、Ａ／Ｄ（analog／digital）変換部３０によりデジタル化された後、上記セレクタ２６に出力される。 Further, the analog television broadcast signal received by the analog television broadcast receiving antenna 27 is supplied to the tuner unit 29 via the input terminal 28. The tuner unit 29 selects and demodulates a signal of a desired channel from the input analog television broadcast signal. The signal output from the tuner unit 29 is digitized by an A / D (analog / digital) conversion unit 30 and then output to the selector 26.

また、アナログ信号用の入力端子３１に供給されたアナログの映像及び音声信号は、Ａ／Ｄ変換部３２に供給されてデジタル化された後、上記セレクタ２６に出力される。さらに、デジタル信号用の入力端子３３に供給されたデジタルの映像及び音声信号は、そのまま上記セレクタ２６に供給される。 The analog video and audio signals supplied to the analog signal input terminal 31 are supplied to the A / D converter 32 and digitized, and then output to the selector 26. Further, the digital video and audio signals supplied to the digital signal input terminal 33 are supplied to the selector 26 as they are.

上記セレクタ２６は、４種類の入力デジタル映像及び音声信号から１つを選択して、信号処理部３４に供給している。この信号処理部３４は、入力されたデジタル映像信号に所定の信号処理を施して上記映像表示器１４での映像表示に供させている。この映像表示部１４としては、例えば、液晶ディスプレイやプラズマディスプレイ等でなるフラットパネルディスプレイが採用される。また、上記信号処理部３４は、入力されたデジタル音声信号に所定の信号処理を施し、アナログ化して上記スピーカ１５に出力することにより、音声再生を行なっている。 The selector 26 selects one of the four types of input digital video and audio signals and supplies it to the signal processing unit 34. The signal processing unit 34 performs predetermined signal processing on the input digital video signal and provides it for video display on the video display 14. As the video display unit 14, for example, a flat panel display such as a liquid crystal display or a plasma display is adopted. The signal processing unit 34 performs predetermined signal processing on the input digital audio signal, converts the signal into an analog signal, and outputs the analog signal to the speaker 15 for audio reproduction.

ここで、このテレビジョン放送受信装置１１は、上記した各種の受信動作を含む種々の動作を制御部３５によって統括的に制御されている。この制御部３５は、ＣＰＵ（central processing unit）等を内蔵したマイクロプロセッサであり、上記操作部１６や操作子２１（図２では図示せず）からの操作情報、または、上記リモートコントローラ１７から送信された操作情報を、受光部１８を介して受けることにより、その操作内容が反映されるように各部をそれぞれ制御している。 Here, in the television broadcast receiving apparatus 11, various operations including the various receiving operations described above are comprehensively controlled by the control unit 35. The control unit 35 is a microprocessor with a built-in CPU (central processing unit) and the like, and is transmitted from the operation unit 16 and the operation element 21 (not shown in FIG. 2) or from the remote controller 17. By receiving the operated information via the light receiving unit 18, each unit is controlled so that the operation content is reflected.

ここでは、制御部３５は、メモリ部３６を使用している。このメモリ部３６は、主として、そのＣＰＵが実行する制御プログラムを格納したＲＯＭ（read only memory）と、該ＣＰＵに作業エリアを提供するためのＲＡＭ（random access memory）と、各種の設定情報及び制御情報等が格納される不揮発性メモリとを備えている。 Here, the control unit 35 uses the memory unit 36. The memory unit 36 mainly includes a ROM (read only memory) storing a control program executed by the CPU, a RAM (random access memory) for providing a work area to the CPU, various setting information and control. And a nonvolatile memory in which information and the like are stored.

ここで、上記制御部３５は、例としてスタンド１３内に収容されたＨＤＤ（ハードディスクドライブ）ユニット２０と接続されている。この場合、制御部３５からＨＤＤユニット２０に電源電力及び制御信号の供給を行なうライン３７は、接続部３８を介して制御部２６とＨＤＤユニット２０とを接続している。 Here, the control unit 35 is connected to an HDD (Hard Disk Drive) unit 20 housed in the stand 13 as an example. In this case, a line 37 for supplying power and control signals from the control unit 35 to the HDD unit 20 connects the control unit 26 and the HDD unit 20 via the connection unit 38.

また、制御部３５とＨＤＤユニット２０との間でデジタル映像及び音声信号を授受するライン３９は、接続部４０を介して制御部３５とＨＤＤユニット２０とを接続している。すなわち、制御部３５とＨＤＤユニット２０との間でのデジタル映像及び音声信号の伝送は、電源及び制御信号とによって行なわれる。 A line 39 for transferring digital video and audio signals between the control unit 35 and the HDD unit 20 connects the control unit 35 and the HDD unit 20 via the connection unit 40. That is, transmission of digital video and audio signals between the control unit 35 and the HDD unit 20 is performed by the power supply and the control signal.

そして、上記テレビジョン放送受信装置は、セレクタ２６で選択されたデジタルの
映像及び音声信号を、ＨＤＤユニット２０により記録することができるとともに、ＨＤＤユニット２０に記録されたデジタルの映像及び音声信号を再生し、視聴に供させることができる。 The television broadcast receiver can record the digital video and audio signals selected by the selector 26 with the HDD unit 20 and reproduce the digital video and audio signals recorded on the HDD unit 20. And can be used for viewing.

図２は本実施形態要部の概略ブロック図である。
映像処理ブロック１０１に入力された映像信号は顔センシング技術により唇の動きを検出し、この唇の動きが画面上のどこの位置で検出されたかを示すアドレス座標１０２を出力する。また音声処理ブロック１０３に入力された音声信号はアドレス座標１０２の値に応じてフィルタの特性を変化させ４つのスピーカ群からなるスピーカ１０４より音声を出力することでモノラル音声でも臨場感を高めることができる。映像処理ブロック１０１、アドレス座標１０２、音声処理ブロック１０３は、信号処理部３４内に在り、またスピーカ１０４はスピーカ１５に相当する。 FIG. 2 is a schematic block diagram of the main part of this embodiment.
The video signal input to the video processing block 101 detects the movement of the lips by a face sensing technique, and outputs address coordinates 102 indicating where the movement of the lips is detected on the screen. Also, the audio signal input to the audio processing block 103 changes the filter characteristics in accordance with the value of the address coordinate 102, and the audio is output from the speaker 104 consisting of four speaker groups, thereby enhancing the sense of presence even in monaural audio. it can. The video processing block 101, the address coordinate 102, and the audio processing block 103 are in the signal processing unit 34, and the speaker 104 corresponds to the speaker 15.

図３は本実施形態の音声処理の一例を示すブロック図であり、このブロック図では入力音声信号がモノラルであれば画面に表示されている顔（唇）の位置に近いスピーカから音声が出力され、入力音声信号がステレオであれば入力音声信号をスルーし通常の視聴状態に戻すシステムである。 FIG. 3 is a block diagram showing an example of audio processing according to this embodiment. In this block diagram, if the input audio signal is monaural, audio is output from a speaker close to the position of the face (lips) displayed on the screen. If the input audio signal is stereo, the input audio signal is passed through and returned to the normal viewing state.

以下に図３の音声信号処理について説明する。
２チャンネルの入力音声信号２０１はＬＲ音声信号比較ブロック２０２で各チャンネルの音声信号を比較しモノラルかステレオかの判定結果２０３を出力する。ＢＰＦ２０４は人の声の帯域を通過させる特性としＢＰＦ２０４を通過した信号の周波数を周波数比較２０５で音声信号の周波数を測定し、人の声と比較同定する。 The audio signal processing of FIG. 3 will be described below.
The LR audio signal comparison block 202 compares the 2-channel input audio signal 201 with each channel's audio signal, and outputs a monaural or stereo determination result 203. The BPF 204 has a characteristic of passing the band of the human voice, and the frequency of the signal that has passed through the BPF 204 is measured by the frequency comparison 205 and compared with the voice of the human.

4chノッチフィルタ２０７に入力する唇の位置を示す座標情報２０６（図２の１０２）の値に応じてノッチフィルタのＱの値を設定し周波数比較２０５で測定した周波数の値でノッチフィルタの中心周波数（f0）を設定する。 The Q value of the notch filter is set according to the value of the coordinate information 206 (102 in FIG. 2) indicating the position of the lips input to the 4ch notch filter 207, and the center frequency of the notch filter is the frequency value measured by the frequency comparison 205 Set (f0).

セレクタ２０８には音声信号２０１から入力された音声信号と4chノッチフィルタ２０７を通過した音声信号をモノラル判定信号２０３により切り替え出力しＡＭＰ２０９で増幅後、スピーカ２１０（図２の１０４）に出力する。 The selector 208 switches between the audio signal input from the audio signal 201 and the audio signal that has passed through the 4ch notch filter 207 by the monaural determination signal 203, amplified by the AMP 209, and then output to the speaker 210 (104 in FIG. 2).

図４のテレビのスピーカは左右２個ずつ計４個のスピーカが設置され、視聴者がＴＶのＮＥＷＳ番組を見ている状態のイメージ図である。ＮＥＷＳ番組はバイリンガル方式で放送されることがあり日本語で視聴する場合はステレオ（２チャンネル）機能を持つＴＶでもモノラル音声で聞くことになるので、この番組を日本語で視聴する場合の音声信号の経路は図２の4chノッチフィルタブロックを通過する。ＮＥＷＳ番組の映像は右側の人物（Ｂ）が話しているので唇の位置は画面全体の右上に位置する。唇の位置と各スピーカの距離が長いとノッチフィルタのＱの値が高くなるようにフィルタ処理されるので各スピーカに出力される周波数特性は図４の特性になり唇の位置から遠いスピーカほど話し手の声は小さくなるので話し手の口の位置から話し手の声が聞こえるようになる。 4 is a conceptual diagram of a state in which a total of four speakers are installed on the left and right, and the viewer is watching a TV NEWS program. A NEWS program may be broadcast in a bilingual format. When viewing in Japanese, a TV with a stereo (two-channel) function will be heard in monaural audio, so the audio signal when viewing this program in Japanese This path passes through the 4ch notch filter block of FIG. Since the NEWS program video is spoken by the right person (B), the position of the lips is located at the upper right of the entire screen. Since the filter processing is performed so that the Q value of the notch filter increases when the distance between the lip position and each speaker is long, the frequency characteristic output to each speaker becomes the characteristic shown in FIG. The voice of the speaker becomes smaller, so the speaker's voice can be heard from the position of the speaker's mouth.

図５は、実施形態に用いられる各スピーカへの出力特性曲線を示す特性図である。ＳＰ−Ｂは唇の位置に最も近いので減衰がなく、ＳＰ−Ｄ，ＳＰ−Ａ，ＳＰ−Ｃと遠くなるほど減衰が大きくなっている。 FIG. 5 is a characteristic diagram showing an output characteristic curve to each speaker used in the embodiment. Since SP-B is closest to the position of the lips, there is no attenuation, and the attenuation increases as the distance from SP-D, SP-A, and SP-C increases.

図６は、実施形態を説明するための音声波形の図である。図６（ａ）は人が「そう話す」と発話したときの音声波形の振幅の遷移とそのラベリング結果（ｓｏｏｈａｎａｓｕ）であり、図６（ｂ）は図６（ａ）の母音ａの定常部を拡大した波形である。図６（ｂ）のｐは声帯振動の基本ピッチであり、この逆数が基本周波数Ｆ０である。Ｆ０は数十Ｈｚから数百Ｈｚの間に分布し、情動等によって変化する。 FIG. 6 is a voice waveform diagram for explaining the embodiment. FIG. 6A shows the transition of the amplitude of the speech waveform and the labeling result (sohanasu) when the person utters “speak so”, and FIG. 6B shows the steady part of the vowel a in FIG. It is the waveform which expanded. In FIG. 6B, p is the fundamental pitch of vocal cord vibration, and the reciprocal thereof is the fundamental frequency F0. F0 is distributed between several tens of Hz to several hundreds of Hz, and varies depending on emotions and the like.

図７は、音声信号の表現としての音声波形の一例である。図７（ａ）は「サ」を発声したときの音声信号の振幅の時間変化の一部を表す。摩擦子音部／ｓ／と続く過渡部、定常母部／ａ／とからなる。また図７（ｂ）は定常母部／ａ／を１０ｍｓ単位にフーリエ変換を行ったときの周波数スペクトルである。縦軸は強度であり横軸は周波数（単位はｋＨｚ）である。 FIG. 7 is an example of a speech waveform as a representation of a speech signal. FIG. 7A shows a part of the temporal change in the amplitude of the audio signal when “sa” is uttered. It consists of a frictional consonant part / s /, followed by a transient part, and a stationary base part / a /. FIG. 7B shows a frequency spectrum when Fourier transformation is performed on the stationary matrix / a / in units of 10 ms. The vertical axis represents intensity, and the horizontal axis represents frequency (unit: kHz).

この図は男声サンプルを示しており、（ｂ）では細かな波形で示す調波線スペクトルｗと、線形予測分析等で得られるそのスペクトルエンベロープ（包絡）ｅとを示している。スペクトルエンベロープｅからは所謂ホルマント（声道の共振）が３つ認められ、第３ホルマントＦ３は２．５ｋＨｚ近辺に位置している。なお、他の母音／ｉ／、／ｕ／などでは第３ホルマント等の位置は変わるが、２．５ｋＨｚ近辺の強度は高い。なおより支配的な第１ホルマントＦ１と第２ホルマントＦ２は１ｋＨｚ前後に位置している。 This figure shows a male voice sample. In (b), a harmonic line spectrum w indicated by a fine waveform and a spectrum envelope (envelope) e obtained by linear prediction analysis or the like are shown. Three so-called formants (resonance of the vocal tract) are recognized from the spectrum envelope e, and the third formant F3 is located in the vicinity of 2.5 kHz. It should be noted that the position of the third formant, etc. changes for other vowels / i /, / u /, etc., but the intensity around 2.5 kHz is high. The more dominant first formant F1 and second formant F2 are located around 1 kHz.

図５の各スピーカへの出力特性は、ホルマントの他に基本周波数Ｆ０の変化を加味して追従変化させてもよい。
（実施形態２）
本発明による実施形態２を説明する。実施形態１と共通する部分は説明を省略する。
実施形態１では、話者が１名の例を説明したが、話者は複数いてもよい。図４で左側の人（Ａ）も唇を動かしているとする。周波数スペクトル上で比較的高域の成分は子音部や過渡部を含めて話者情報を含んでいる可能性が高く、話者認識も援用して話者Ａ，話者Ｂそれぞれにスピーカ毎に設定したノッチフィルタの特性を重ね合わせて出力すればよい。なお話者認識の手段としては、唇の動きから話者が発音している子母音を同定してその結果を援用してもよい。 The output characteristics to each speaker in FIG. 5 may be changed following the formant in consideration of the change in the fundamental frequency F0.
(Embodiment 2)
A second embodiment according to the present invention will be described. Description of the parts common to the first embodiment is omitted.
In the first embodiment, an example in which there is one speaker has been described, but there may be a plurality of speakers. In FIG. 4, it is assumed that the left person (A) is also moving the lips. A relatively high frequency component on the frequency spectrum is likely to include speaker information including a consonant part and a transient part. Speaker recognition is also used for each speaker A and speaker B for each speaker. What is necessary is just to superimpose and output the characteristics of the set notch filter. As a means for speaker recognition, a consonant sounded by the speaker may be identified from the movement of the lips and the result may be used.

ＴＶのＮＥＷＳ番組に限らず海外映画などの音声信号はバイリンガル方式で放送されることがあり、日本語で視聴する場合はモノラル音声で聞くことになるので、ステレオで放送される番組と比較すると臨場感は劣る。本実施形態の内容による映像音声処理を行うことによりモノラル放送でもテレビ音声信号の臨場感を高めることができ、ＢＧＭが挿入されている番組でもＢＧＭの定位はフラットに聞こえ、話し手の声は話し手の位置より聞こえるようになる。 Audio signals such as overseas movies as well as TV NEWS programs may be broadcast in a bilingual format, and when viewed in Japanese, they will be heard in monaural audio, so they are more realistic than programs broadcast in stereo. The feeling is inferior. By performing the video / audio processing according to the contents of the present embodiment, it is possible to enhance the presence of the TV audio signal even in monaural broadcasting, the BGM localization sounds flat even in a program in which BGM is inserted, and the speaker's voice is the speaker's voice. It becomes audible from the position.

以上概要として、画面上の顔や唇の検出位置の座標に応じて、複数の配置されたスピーカの音声フィルタの特性を制御し、画面に合わせて音声臨場感を高めるように、モノラル音声信号を処理する。更には、音声フィルタの特性はノッチフィルタのＱを制御する。 As an overview above, the monaural audio signal is used to control the audio filter characteristics of the multiple speakers arranged according to the coordinates of the detected positions of the face and lips on the screen, and to enhance the audio presence according to the screen. To process. Furthermore, the characteristics of the audio filter control the Q of the notch filter.

この効果として、本実施形態内容による音声処理を行うことにより、ＢＧＭを含んだ音声信号でもＢＧＭの音の定位は均一に出力でき人の話し声は話者の画面上の位置に応じて音声を出力することができるため、ＢＧＭを含んだ音声信号でも違和感無く音の臨場感を高めることができる。 As an effect of this, by performing voice processing according to the contents of the present embodiment, the localization of the BGM sound can be output even with the voice signal including BGM, and the voice of the person can be output according to the position of the speaker on the screen. Therefore, even in the case of a sound signal including BGM, it is possible to enhance the sense of presence of sound without a sense of incongruity.

なお、この発明は上記実施形態に限定されるものではなく、この外その要旨を逸脱しない範囲で種々変形して実施することができる。例えば、スピーカ群は平面上に配置された例を示したが、所謂サラウンド配置でもよい。 In addition, this invention is not limited to the said embodiment, In the range which does not deviate from the summary, it can implement in various modifications. For example, the example in which the speaker groups are arranged on a plane is shown, but a so-called surround arrangement may be used.

また、上記した実施の形態に開示されている複数の構成要素を適宜に組み合わせることにより、種々の発明を形成することができる。例えば、実施の形態に示される全構成要素から幾つかの構成要素を削除しても良いものである。さらに、異なる実施の形態に係る構成要素を適宜組み合わせても良いものである。 Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above-described embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements according to different embodiments may be appropriately combined.

この発明の一実施形態の信号処理系を概略的に示す図。The figure which shows schematically the signal processing system of one Embodiment of this invention. 同実施形態の要部の概略ブロック図。The schematic block diagram of the principal part of the embodiment. 同実施形態の音声処理の一例を示すブロック図。The block diagram which shows an example of the audio | voice process of the embodiment. 同実施形態の視聴者がＴＶのＮＥＷＳ番組を見ている状態のイメージ図。The image figure of the state in which the viewer of the embodiment is watching the NEWS program of TV. 同実施形態に用いられる各スピーカへの出力特性曲線を示す特性図。The characteristic view which shows the output characteristic curve to each speaker used for the embodiment. 同実施形態を説明するための音声波形の図その１。FIG. 1 is a speech waveform diagram for explaining the embodiment. 同実施形態を説明するための音声波形の図その２。FIG. 2 is a second waveform diagram for explaining the embodiment.

Explanation of symbols

１２…キャビネット、１４…映像表示器（映像表示部）、１５…スピーカ、１６…操作部、１８…受光部、２０…ＨＤＤユニット（記録手段）、２２…アンテナ、２３…入力端子、２４…チューナ部、２５…デコーダ部、２６…セレクタ、２７…アンテナ、２８…入力端子、２９…チューナ部、３０…Ａ／D変換部、３１…入力端子、３２…Ａ／Ｄ変換部、３３…入力端子、３４…信号処理部、３５…制御部、３５ａ…ＨＤＤ制御部、３６…メモリ部、３７…ライン、３８…接続部、３９…ライン、４０…接続部、１０１…映像処理ブロック、１０２…アドレス座標、１０３…音声処理ブロック、１０４…スピーカ、２０２…音声信号比較ブロック、２０４…ＢＰＦ、２０７…ノッチフィルタ。 DESCRIPTION OF SYMBOLS 12 ... Cabinet, 14 ... Image | video display (video display part), 15 ... Speaker, 16 ... Operation part, 18 ... Light-receiving part, 20 ... HDD unit (recording means), 22 ... Antenna, 23 ... Input terminal, 24 ... Tuner , 25 ... Decoder unit, 26 ... Selector, 27 ... Antenna, 28 ... Input terminal, 29 ... Tuner unit, 30 ... A / D conversion unit, 31 ... Input terminal, 32 ... A / D conversion unit, 33 ... Input terminal 34 ... Signal processing unit 35 ... Control unit 35a ... HDD control unit 36 ... Memory unit 37 ... Line 38 ... Connection unit 39 ... Line 40 ... Connection unit 101 ... Video processing block 102 ... Address Coordinates, 103 ... Audio processing block, 104 ... Speaker, 202 ... Audio signal comparison block, 204 ... BPF, 207 ... Notch filter.

Claims

A video display unit;
Detecting means for detecting movement of a person's face and lips in the video displayed on the video display unit;
An audio processing unit that controls characteristics of an audio filter based on address coordinates indicating a location where movement of a person's face or lips in the video detected by the detection unit exists;
An audio / video output apparatus comprising: a determination unit that determines whether an audio signal accompanying an image displayed on the image display unit is a monaural audio signal.

The video / audio output apparatus according to claim 1, wherein the audio processing unit controls a Q value of the notch filter as a characteristic of the audio filter based on the address coordinates.

A BPF configured to analyze the passband of a person's voice;
The video / audio output apparatus according to claim 2, wherein the audio processing unit controls a center frequency of the notch filter while comparing frequencies of audio signals that have passed through the BPF.

When the determination result by the determination means is determined to be a monaural audio signal, a plurality of speakers that receive the output of the sound filter controlled by the sound processing unit based on the determination result by the determination means are input to the sound filter. The video / audio output apparatus according to claim 1, wherein the output is an input.