JP2006339869A

JP2006339869A - Apparatus for integrating video signal and voice signal

Info

Publication number: JP2006339869A
Application number: JP2005160216A
Authority: JP
Inventors: Kozo Okuda; 浩三奥田; Hitoshi Hongo; 仁志本郷
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2005-05-31
Filing date: 2005-05-31
Publication date: 2006-12-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus for integrally controlling a video signal and a voice signal capable of creating a space with a feeling of real life wherein a receiver side can eliminate deviation between a talker at a transmission side and voice or sound uttered by the talker, so that the apparatus can reproduce a state of the transmission side talker uttering the voice or sound as it is. <P>SOLUTION: A face region detection section 108 calculates face region positional information of a person from video information. A sound receiving direction determining section 105 prescribes an existing direction of a transmission side conference participant on the basis of the face region positional information and a zoom magnification of a camera 103 and the direction of the camera. A sound receiving section 104 acquires sound information from the prescribed existence direction. A sound receiving reproducing section 109 forms an image of the sound information around a face region of the conference participant at the transmission side displayed on a display apparatus 107 of a receiver side terminal on the basis of the face region positional information. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、テレビ電話など、空間的に離れた地点から映像信号と音や音声等の音響信号を伝送し、これらの信号を用いて臨場感のある空間を創造する映像信号と音響信号を統合的に制御する装置に関する。 The present invention transmits video signals and sound signals such as sounds and voices from spatially separated points such as videophones, and integrates the video signals and sound signals that create a realistic space using these signals. It relates to the device which controls automatically.

近年、ブロードバンドネットワークの普及により、VoIP技術を利用したテレビ電話
が急速に普及してきている。具体的には、パーソナルコンピュータを利用したテレビ
電話やテレビに接続するタイプのテレビ電話等が浸透し始めてきている。 In recent years, with the spread of broadband networks, videophones using VoIP technology are rapidly spreading. Specifically, videophones using personal computers, videophones connected to televisions, and the like have begun to permeate.

このようなテレビ電話は、画面を見ながら会話するといった利用形態であるため、
ハンズフリー通話となっている。しかしながら、ステレオ通話に対応したエコーキャンセラの実現が難しく、また高価であることから通話音声はモノラルとなっている。このため、送話側と受話側との会話において、１人対１人で会話している限りにおいては、良好な会話が可能であるが、１人対複数人、あるいは複数人対複数人で会話する場合には、受話側では、送話側のいずれの話者が発言しているかが映像と通話音声だけではわかりにくくなる。また、送話側で複数人が同時に発言した場合、それらの音声はモノラルとして受話側へ伝送されるため、受話側では発言内容の了解度が著しく劣化する。 Because such a videophone is a form of use in which you talk while watching the screen,
It is a hands-free call. However, since the echo canceller corresponding to the stereo call is difficult and expensive, the call voice is monaural. For this reason, as long as the conversation between the sending side and the receiving side is a one-to-one conversation, a good conversation is possible, but one-to-multiple, or multiple-to-multiple In the case of conversation, it is difficult for the receiving side to understand which speaker on the transmitting side is speaking only by video and voice. Further, when a plurality of people speak at the transmitting side at the same time, those voices are transmitted as monaural to the receiving side, so that the comprehension of the contents of the speech is significantly deteriorated at the receiving side.

一方、業務用のテレビ電話システムでは、一部でステレオエコーキャンセラの実装
によるステレオ通話に対応し、より臨場感のあるテレビ電話を実現しているものも登場し始めている。通話音声がステレオになると、モノラルと比較し送話側の空間の雰囲気が伝わりやすく、送話側で複数話者が同時に発言しても各話者の発言内容を了解しやすい、などの利点がある。さらに通話音声の帯域を広げることで、より臨場感のあるテレビ電話を実現することが可能となる。 On the other hand, some of the commercial videophone systems that are compatible with stereo calls using a stereo echo canceller have been realized to realize more realistic videophones. When the call voice is stereo, the atmosphere of the sending side is easier to communicate compared to monaural, and even if multiple speakers speak at the same time on the sending side, it is easy to understand the content of each speaker. is there. Furthermore, it is possible to realize a more realistic videophone by expanding the bandwidth of the call voice.

しかしながら、テレビ画面に映し出される映像とスピーカから再生される音声との間にずれが生じやすく、かえって不自然となる場合がある。例えば、映像をズームにした場合、テレビ画面には特定の領域が拡大された映像が表示されるが、スピーカから再生される音声等には変化がなく、受話側では映像と音声との間のずれがさらに大きくなり、かえって不自然な通話感となる。 However, there is a tendency that a gap is easily generated between the video displayed on the television screen and the sound reproduced from the speaker, which may be unnatural. For example, when the video is zoomed, a video in which a specific area is enlarged is displayed on the TV screen, but there is no change in the audio reproduced from the speaker, and the receiver side has a difference between the video and the audio. The gap becomes even larger, which makes the call feel unnatural.

このような不自然な通話感を解消する方法としては、以下の特許文献に記載の方法
が提案されている。この特許文献１では、映像を解析することで、通話相手側がどのような空間で会話しているかを推定し、その推定結果に応じて音響信号を処理するためのパラメータを変更するというものである。例えば、通話相手側が広い部屋で会話しているような場合には、残響などを付加するようなパラメータが選択される、等により臨場感のあるテレビ電話を実現している。
特開平７−１３１７７０ As a method for eliminating such an unnatural feeling of communication, methods described in the following patent documents have been proposed. In this patent document 1, by analyzing the video, it is estimated what kind of space the other party is talking in, and the parameter for processing the acoustic signal is changed according to the estimation result. . For example, when the other party is talking in a large room, a realistic videophone is realized by selecting parameters that add reverberation and the like.
JP-A-7-131770

しかしながら、このような処理を行ったとしても、映像と音声がずれるという問題は依然として解消されていない。 However, even if such a process is performed, the problem that the video and the audio are shifted has not been solved.

そこで、本発明は、受話側において、送話側の話者の映像と話者が発した音声や音
とのずれをなくし、送話側の話者が音声や音を発している状況をできるだけ忠実に再現でき、より臨場感のある空間を創造することができる映像信号と音響信号を統合的に制御する装置を提供することを目的とする。 Therefore, the present invention eliminates the difference between the image of the speaker on the transmitting side and the voice or sound emitted by the speaker on the receiver side, and the situation where the speaker on the transmitting side emits sound or sound as much as possible. It is an object of the present invention to provide an apparatus for controlling video signals and audio signals in an integrated manner that can be faithfully reproduced and can create a more realistic space.

本発明に係る映像信号と音響信号の統合装置は、映像信号を取得する映像取得手段と、音響信号を取得する音響取得手段と、該音響取得手段により取得された音響信号および前記映像取得手段により取得された映像信号を送信する送信手段と、該送信手段により送信された前記映像信号および前記音響信号を受信する受信手段と、該受信手段により受信された前記映像信号を表示する映像表示手段と、該受信手段により受信された前記音響信号を制御する音響制御手段と、該音響制御手段により制御された音響信号を出力する音響出力手段と、からなる映像信号と音響信号の統合装置であって、前記映像取得手段により取得された映像信号から該映像信号上の１又は２以上の人物の顔位置を検出する顔位置検出手段と、該顔位置検出手段によって検出された前記各人物の顔位置および前記映像取得手段の映像取得条件に基づいて該各人物の存在方向を特定する人物方向特定手段とを備え、前記音響取得手段は、前記人物特定手段により特定された前記各人物の存在方向からの音響信号をそれぞれ取得し、前記音響制御手段は、前記表示手段に表示される映像信号上の前記各人物の顔位置付近に該各人物に対応する前記各音響信号を結像することを特徴とする。 The video signal and sound signal integration apparatus according to the present invention includes a video acquisition unit that acquires a video signal, a sound acquisition unit that acquires a sound signal, the acoustic signal acquired by the sound acquisition unit, and the video acquisition unit. Transmitting means for transmitting the acquired video signal; receiving means for receiving the video signal and the acoustic signal transmitted by the transmitting means; and video display means for displaying the video signal received by the receiving means; An apparatus for integrating a video signal and an acoustic signal, comprising: an acoustic control unit that controls the acoustic signal received by the receiving unit; and an acoustic output unit that outputs the acoustic signal controlled by the acoustic control unit. A face position detecting means for detecting the face position of one or more persons on the video signal from the video signal acquired by the video acquiring means, and the face position detecting means. Person direction specifying means for specifying the presence direction of each person based on the face position of each person detected in the above and the video acquisition condition of the video acquisition means, and the sound acquisition means is provided by the person specifying means. Acquire acoustic signals from the identified direction of each person, and the acoustic control means corresponds to each person near the face position of each person on the video signal displayed on the display means. Each acoustic signal is imaged.

また、本発明に係る映像信号と音響信号の統合装置は、映像信号を取得する映像取得手段と、音響信号を取得する音響取得手段と、該音響取得手段により取得された音響信号および前記映像取得手段により取得された映像信号を送信する送信手段と、該送信手段により送信された前記映像信号および前記音響信号を受信する受信手段と、該受信手段により受信された前記映像信号を表示する映像表示手段と、該受信手段により受信された前記音響信号を制御する音響制御手段と、該音響制御手段により制御された音響信号を出力する音響出力手段と、からなる映像信号と音響信号の統合装置であって、前記映像取得手段により取得された映像信号から該映像信号上の１又は２以上の人物の顔位置を検出する顔位置検出手段と、該顔位置検出手段によって検出された前記各人物の顔位置および前記映像取得手段の映像取得条件に基づいて該各人物の存在方向を特定する人物方向特定手段とを備え、前記音響取得手段は、全方位からの音響信号を取得し、前記音響制御手段は、該全方位からの音響信号から、前記人物特定手段により特定された前記各人物の存在方向に関する情報に基づいて、該各人物の存在方向からの音響信号をそれぞれ生成し、前記表示手段に表示される映像信号上の該各人物の顔位置付近に該各人物に対応する前記生成された各音響信号を結像することを特徴とする。 The video signal and sound signal integration apparatus according to the present invention includes a video acquisition unit that acquires a video signal, a sound acquisition unit that acquires a sound signal, the sound signal acquired by the sound acquisition unit, and the video acquisition. Transmitting means for transmitting the video signal acquired by the means, receiving means for receiving the video signal and the acoustic signal transmitted by the transmitting means, and video display for displaying the video signal received by the receiving means A video signal and sound signal integrating device comprising: a sound control means for controlling the sound signal received by the receiving means; and a sound output means for outputting the sound signal controlled by the sound control means. A face position detecting means for detecting a face position of one or more persons on the video signal from the video signal acquired by the video acquiring means, and the face position detecting hand And a person direction specifying means for specifying the presence direction of each person based on the face position of each person detected by the image acquisition condition and the image acquisition condition of the image acquisition means, and the sound acquisition means includes sound from all directions. The sound control means obtains a sound signal from the direction of existence of each person based on information about the direction of existence of each person specified by the person specifying means from the sound signals from all directions. Are generated, and each of the generated acoustic signals corresponding to each person is imaged in the vicinity of the face position of each person on the video signal displayed on the display means.

これらの発明によると、表示装置に表示されている話者が、その場で実際にしゃべっているような臨場感あふれる空間を創造することができる。 According to these inventions, it is possible to create a space full of realism that a speaker displayed on a display device is actually speaking on the spot.

また、本発明に係る映像信号と音響信号の統合装置では、映像取得手段の映像取得条件は、映像取得手段のズーム倍率を含むことを特徴とする。 In the video signal and audio signal integration apparatus according to the present invention, the video acquisition condition of the video acquisition unit includes a zoom magnification of the video acquisition unit.

また、本発明に係る映像信号と音響信号の統合装置では、映像取得手段の映像取得条件は、前記映像取得手段の向いている方向情報を含むことを特徴とする。 In the video signal and audio signal integration apparatus according to the present invention, the video acquisition condition of the video acquisition means includes direction information that the video acquisition means is directed to.

また、本発明に係る映像信号と音響信号の統合装置では、音響取得手段は、映像取得条件のうちのズーム倍率に応じて取得する音響信号の音量レベルを増減させることを特徴とする。 In the video signal and sound signal integration device according to the present invention, the sound acquisition means increases or decreases the volume level of the sound signal acquired according to the zoom magnification of the image acquisition conditions.

また、本発明に係る映像信号と音響信号の統合装置では、音響制御手段は、前記映像取得条件のうちの前記ズーム倍率に応じて前記音響出力手段が出力する音響信号の音量レベルを増減させることを特徴とする。 Further, in the video signal and audio signal integration device according to the present invention, the audio control means increases or decreases the volume level of the audio signal output by the audio output means according to the zoom magnification of the video acquisition conditions. It is characterized by.

上記２つの発明に係る映像信号と音響信号の統合装置によると、映像取得手段のズーム倍率を増減することにより、映像表示手段に表示される人物等のサイズも増減し、これに応じて音響出力手段により出力される当該人物の音響信号の音量レベルも増減する。従って、より臨場感のある空間が創造されることとなる。 According to the video signal and sound signal integration device according to the above two inventions, by increasing / decreasing the zoom magnification of the video acquisition means, the size of the person etc. displayed on the video display means is also increased / decreased, and the sound output is accordingly performed The volume level of the person's sound signal output by the means is also increased or decreased. Therefore, a more realistic space will be created.

本発明によれば、受話側において、送話側の話者の映像と話者が発した音声や音
とのずれをなくし、送話側の話者が音声や音を発している状況をできるだけ忠実に再現でき、より臨場感のある空間を創造することができる映像信号と音響信号を統合的に制御する装置を提供することができる。 According to the present invention, the receiver side eliminates the difference between the image of the speaker on the transmitter side and the voice or sound emitted by the speaker, and the situation where the speaker on the transmitter side emits voice or sound can be as much as possible. It is possible to provide an apparatus for integrated control of video and audio signals that can be faithfully reproduced and can create a more realistic space.

本発明の意義ないし効果は、以下に示す実施の形態の説明により更に明らかとなろう。 The significance or effect of the present invention will become more apparent from the following description of embodiments.

ただし、以下の実施の形態は、あくまでも、本発明の一つの実施形態であって、本発明ないし各構成要件の用語の意義は、以下の実施の形態に記載されたものに制限されるものではない。 However, the following embodiment is merely one embodiment of the present invention, and the meaning of the term of the present invention or each constituent element is not limited to that described in the following embodiment. Absent.

以下、本発明をハンズフリーテレビ電話装置に実施した形態につき、図面に沿って説明する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention implemented in a hands-free videophone device will be described below with reference to the drawings.

図１は、本発明の実施形態の一つであるハンズフリーテレビ電話をテレビ会議に利用する場合の構成の概略を示す図である。 FIG. 1 is a diagram showing an outline of a configuration when a hands-free videophone which is one embodiment of the present invention is used for a video conference.

図１において、送話側のハンズフリーテレビ電話装置（以下、送話側端末と記載する。）の前で、発話者が発話等により音声や音（以下、音情報と記載する。）を発すると、発話者の映像および発話者が発した音情報が直接あるいはネットワークを経由して受話側のハンズフリーテレビ電話装置（以下、受話側端末と記載する。）に送信される。 In FIG. 1, in front of a hands-free videophone device on the transmission side (hereinafter referred to as a transmission-side terminal), a speaker utters voice or sound (hereinafter referred to as sound information) by utterance or the like. Then, the video of the speaker and the sound information uttered by the speaker are transmitted directly or via a network to a hands-free videophone device (hereinafter referred to as a receiver terminal) on the receiver side.

受話側端末は、送話側の発話者の映像が表示装置の画面上に再生するとともに発話者が発した音情報を画面上の発話者の顔領域付近から出力する。また、送話側の映像がズームアップで表示された場合には、出力する音情報の音量レベルも増加させる。このような処理を行うことにより、受話側端末は、あたかも画面上で送話側の発話者が実際にしゃべっているように感じさせる臨場感あふれる空間を創造することができる。逆に、送話側においても、受話側と同様な臨場感あふれる空間が創造される。これにより、会話が行い易く、より臨場感のあるテレビ会議を行うことができる。 The receiver terminal reproduces the video of the speaker on the transmitter side on the screen of the display device and outputs sound information emitted by the speaker from the vicinity of the face area of the speaker on the screen. Further, when the video on the transmission side is displayed with zoom-in, the volume level of the sound information to be output is also increased. By performing such processing, the receiving terminal can create a space full of realism that makes it feel as if the transmitting speaker is actually speaking on the screen. Conversely, on the transmitting side, a space full of realism similar to the receiving side is created. Thereby, it is easy to conduct a conversation and a more realistic video conference can be performed.

図２は、本発明の実施形態の一つであるハンズフリーテレビ電話装置の構成を示す図である。 FIG. 2 is a diagram showing a configuration of a hands-free videophone device which is one embodiment of the present invention.

図２において、ハンズフリーテレビ電話装置１００は、映像情報および音情報を受話側端末に送信する送話処理部１０１、受話側端末からの映像信号および音情報を再生する受話処理部１０２から構成される。 In FIG. 2, a hands-free videophone device 100 includes a transmission processing unit 101 that transmits video information and sound information to a receiving terminal, and a receiving processing unit 102 that reproduces video signals and sound information from the receiving terminal. The

送話処理部１０１は、映像情報の取得および送信ならびにカメラのズーム倍率やカメラが向いている方向といったカメラ情報の送信を行うカメラ１０３、音情報の取得および送信を行う受音部１０４、受音部１０４がいずれの方向の音情報をどの程度の音量レベルで取得するかを決定する受音方向決定部１０５から構成される。受音部１０４は、例えば、指向性の高い複数の小型マイクロフォンが配列されて構成され、所望の方向の音情報を所望の音量レベルで取得することができる。 The transmission processing unit 101 includes acquisition and transmission of video information, a camera 103 that transmits camera information such as a zoom magnification of the camera and a direction in which the camera is facing, a sound receiving unit 104 that acquires and transmits sound information, and a sound reception The unit 104 includes a sound receiving direction determining unit 105 that determines at what volume level the sound information in which direction is acquired. For example, the sound receiving unit 104 is configured by arranging a plurality of small microphones having high directivity, and can acquire sound information in a desired direction at a desired volume level.

受話処理部１０２は、送話側端末から送信される音情報を出力するスピーカ１０６、送話側端末から送信される映像情報を表示する表示装置１０７、送話側端末から送信される映像情報から人物の顔領域を検出する顔領域検出部１０８、顔領域検出部１０８により検出された顔領域の位置情報に基づいて、スピーカ１０８から出力される音情報を表示装置１０７に表示される人物の顔領域付近から出力されているように制御する受音信号再生部１０９から構成される。 The reception processing unit 102 includes a speaker 106 that outputs sound information transmitted from the transmission side terminal, a display device 107 that displays video information transmitted from the transmission side terminal, and video information transmitted from the transmission side terminal. A face area detection unit 108 that detects the face area of a person, and the sound information output from the speaker 108 is displayed on the display device 107 based on the position information of the face area detected by the face area detection unit 108. The sound receiving signal reproducing unit 109 is controlled so as to be output from the vicinity of the area.

図３は、ハンズフリーテレビ電話装置１００を２台用いてテレビ会議を行う場合の構成およびデータの流れを示す図である。 FIG. 3 is a diagram showing a configuration and a data flow when a video conference is performed using two hands-free video phone devices 100.

図３において、送話側端末のカメラ１０３は、送話側においてテレビ会議に出席している人物を含む会議風景を撮影し、撮影した映像情報を受話側端末の顔領域検出部１０８および表示装置１０７へネットワークを経由して送信する。また、カメラ１０３は、カメラのズーム倍率やカメラが向いている方向といったカメラ情報を後述する受音方向決定部１０５へ送信する。 In FIG. 3, the camera 103 of the transmitting terminal captures a meeting scene including a person who is attending a video conference on the transmitting side, and uses the captured video information for the face area detection unit 108 and the display device of the receiving terminal. It transmits to 107 via a network. In addition, the camera 103 transmits camera information such as the zoom magnification of the camera and the direction in which the camera is facing to the sound receiving direction determination unit 105 described later.

受話側端末の顔領域検出部１０８は、カメラ１０３より送信された映像情報から、人物の顔領域を検出し、検出した人物の顔領域の位置情報を算出し、受音信号再生部１０９及び送話端末側の受音方向決定部１０５へ送信する。 The face area detection unit 108 of the receiving terminal detects the face area of the person from the video information transmitted from the camera 103, calculates the position information of the detected face area of the person, The data is transmitted to the sound receiving direction determination unit 105 on the talk terminal side.

図４は、顔領域検出部１０８が算出する顔領域位置情報を示す図である。 FIG. 4 is a diagram showing face area position information calculated by the face area detection unit 108.

顔領域検出部１０８は、図４に示すように、カメラ１０３から所定サイズの映像情報を取得すると、該映像情報の左下隅を原点とし、垂直方向および水平方向に最小値０°、最大値１８０°の座標系を設定し、検出した顔領域の中心座標を顔領域位置情報として算出する。図４では、顔領域検出部１０８は、映像情報から２名の人物の顔領域を検出し、各人物の顔領域位置情報として（Θ_ｐｘ１° , Θ_ｐｙ１° ）, （ Θ_ｐｘ２° , Θ_ｐｙ２° ）
を算出する。 As shown in FIG. 4, when the face area detection unit 108 acquires video information of a predetermined size from the camera 103, the lower left corner of the video information is the origin, and the minimum value is 0 ° and the maximum value is 180 in the vertical and horizontal directions. A coordinate system of ° is set, and the center coordinates of the detected face area are calculated as face area position information. In Figure 4, the face area detection unit 108 detects a human face region of the two from the video information, as the face area position information for each person _{_{(Θ px1 °, Θ py1 °}} ), (Θ px2 °, Θ py2 _° )
Is calculated.

尚、映像情報から人物の顔領域の検出については、例えば特許第３４９０９１０号「顔領域検出装置」において開示されているような方法を利用することができる。 For detecting a human face area from video information, for example, a method disclosed in Japanese Patent No. 3490910 “Face Area Detection Device” can be used.

受音方向決定部１０５は、顔領域検出部１０８より送信された顔領域位置情報およびカメラ１０３より送信されたカメラ情報に基づいて送話側におけるテレビ会議出席者が存在する方向を特定し、受音部１０４に特定された方向の音情報を取得させる。
送話側におけるテレビ会議出席者が複数人存在する場合には、受音方向決定部１０５は各出席者の存在方向を特定し、受音部１０４に特定した各出席者の存在方向毎の音情報を取得させる。 The sound receiving direction determination unit 105 identifies the direction in which the video conference attendee on the transmission side exists based on the face region position information transmitted from the face region detection unit 108 and the camera information transmitted from the camera 103, and receives the receiving direction. The sound information in the direction specified by the sound unit 104 is acquired.
When there are a plurality of TV conference attendees on the transmission side, the sound reception direction determination unit 105 identifies the presence direction of each attendee, and the sound for each attendance direction specified by the sound reception unit 104 is present. Get information.

受音方向決定部１０５は、顔領域検出部１０８より送信される顔領域位置情報のうち、特に水平方向の位置情報に基づいて受音方向を特定するが、受音方向を特定するためには、カメラ情報、例えば、カメラのズーム倍率やカメラの向いている方向を考慮する必要がある。 The sound receiving direction determining unit 105 specifies the sound receiving direction based on the position information in the horizontal direction among the face area position information transmitted from the face region detecting unit 108. In order to specify the sound receiving direction, It is necessary to consider the camera information, for example, the zoom magnification of the camera and the direction in which the camera is facing.

図５は、顔領域検出部１０８により算出された人物の顔領域位置情報が同じであっても、カメラ１０３のズーム倍率に応じて人物の存在方向が変化することを示す図である。 FIG. 5 is a diagram illustrating that the presence direction of the person changes according to the zoom magnification of the camera 103 even when the face area position information of the person calculated by the face area detection unit 108 is the same.

図５に示す如く、映像情報から顔領域検出部１０８が検出した人物の顔領域の水平位置が、映像データサイズの横の長さのａ：ｂの位置であった場合であっても、カメラ１０３がズームアップしている場合は、人物の存在方向がカメラ１０３の正面方向を基準としてΘｚ°と算出されたとしても、カメラがワイド撮影している場合には、人物の存在方向はΘｗ°と算出され、この場合Θｚ°＞Θｗ°となることは明らかである。即ち、カメラ１０３がズームアップで撮影している場合とワイドで撮影している場合とで人物の存在方向が相違することとなる。 As shown in FIG. 5, even if the horizontal position of the face area of the person detected by the face area detection unit 108 from the video information is the position a: b of the horizontal length of the video data size, the camera When the camera 103 is zoomed up, even if the direction in which the person exists is calculated as Θz ° with respect to the front direction of the camera 103, the direction in which the person exists is Θw ° when the camera is shooting wide. In this case, it is clear that Θz °> Θw °. That is, the direction in which the person exists differs between when the camera 103 is shooting with zoom-in and when shooting with a wide angle.

したがって、映像情報から検出した人物の顔領域位置情報から人物の存在方向を特定する場合には、該顔領域位置情報をカメラ１０３のズーム倍率に応じて補正する必要がある。 Therefore, when the person's presence direction is specified from the face area position information of the person detected from the video information, it is necessary to correct the face area position information according to the zoom magnification of the camera 103.

図６は、受音方向決定部１０５が、顔領域検出部１０８より送信される顔領域位置情報から、カメラ１０３のズーム倍率およびカメラ１０３の向きを考慮して受音方向を特定するまでの流れを示す図である。 FIG. 6 shows a flow from when the sound receiving direction determination unit 105 determines the sound receiving direction based on the face area position information transmitted from the face area detection unit 108 in consideration of the zoom magnification of the camera 103 and the direction of the camera 103. FIG.

図６において、顔領域検出部１０８から送信された２人の人物の顔領域位置情報のうち、水平方向の位置情報がそれぞれΘｐｘ１°, Θｐｘ２°であったとすると_、カメラ１０３のズーム倍率がλの場合、受音方向決定部１０５は、第１の補正後の受音方向Θｃ１ °, Θｃ２°を、次式（１）、（２）により算出する。 In FIG. 6, if the position information in the horizontal direction is Θpx1 ° and Θpx2 ° among the face region position information of two persons transmitted from the face region detection unit 108 _, the zoom magnification of _the camera 103 is λ. In this case, the sound receiving direction determination unit 105 calculates the first corrected sound receiving directions Θc1 ° and Θc2 ° by the following equations (1) and (2).

Θｃ１＝Θｐｘ１・ｆ（λ）・・・（１）
Θｃ２＝Θｐｘ２・ｆ（λ）・・・（２）
但し、ｆ（λ）はカメラ１０３のズーム倍率λを入力とした場合の補正関数
次に、カメラ１０３が向いている方向が、図６に示す如く、例えば複数のマイクロフォンから構成される受音部１０４の正面方向を基準（０°）としてΘｄ°である場合、受音方向決定部１０５は、第２補正後の受音方向Θｔ１°、Θｔ２°を次式（３）、（４）により算出する。 Θc1 = Θpx1 · f (λ) (1)
Θc2 = Θpx2 · f (λ) (2)
However, f (λ) is a correction function when the zoom magnification λ of the camera 103 is input. Next, the direction in which the camera 103 faces is, for example, a sound receiving unit composed of a plurality of microphones as shown in FIG. When the front direction of 104 is the reference (0 °) and Θd °, the sound receiving direction determination unit 105 calculates the second corrected sound receiving directions Θt1 ° and Θt2 ° by the following equations (3) and (4). To do.

Θｔ１＝Θｃ１＋Θｄ・・・（３）
Θｔ２＝Θｃ２＋Θｄ・・・（４）
受音方向決定部１０５は、第２補正後の受音方向Θｔ１°、Θｔ２°を検出された各領域に対応する人物の存在方向として決定し、受音部１０４に該方向からの音情報を取得させる。 Θt1 = Θc1 + Θd (3)
Θt2 = Θc2 + Θd (4)
The sound receiving direction determining unit 105 determines the sound receiving directions Θt1 ° and Θt2 ° after the second correction as the existence directions of the persons corresponding to the detected areas, and the sound receiving unit 104 receives sound information from the directions. Get it.

また、受音方向決定部１０５は、カメラのズーム倍率に応じて、取得する音情報の音量レベルを決定し、該音量レベルで受音部１０４に音情報を取得させる。 The sound receiving direction determination unit 105 determines the volume level of sound information to be acquired according to the zoom magnification of the camera, and causes the sound receiving unit 104 to acquire sound information at the volume level.

受音部１０４は、受音方向決定部１０５により特定されたそれぞれの方向から送話側の各テレビ会議出席者の音情報を受音方向決定部１０５により決定された音量レベルで取得し、受話側端末の受音信号再生部１０９へ送信する。 The sound receiving unit 104 acquires the sound information of each video conference attendee on the transmission side from each direction specified by the sound receiving direction determining unit 105 at the volume level determined by the sound receiving direction determining unit 105, It transmits to the received sound signal reproduction unit 109 of the side terminal.

受音信号再生部１０９は、顔領域検出部１０８により送信された顔領域位置情報および受音部１０４より送信された音情報に基づいて、表示装置１０７に表示される送話側の各出席者の顔領域付近から各出席者に対応する音情報が発せられているように音情報を制御し、スピーカ１０６から該音情報を出力する。 Based on the face area position information transmitted from the face area detection unit 108 and the sound information transmitted from the sound reception unit 104, the sound reception signal reproduction unit 109 transmits each attendee on the transmission side displayed on the display device 107. The sound information is controlled so that sound information corresponding to each attendee is emitted from the vicinity of the face area, and the sound information is output from the speaker 106.

図７は、受音信号再生部１０９による音情報の出力方法を示す図である。 FIG. 7 is a diagram showing a method for outputting sound information by the sound reception signal reproduction unit 109.

受音信号再生部１０９は、受信した音情報を制御する信号処理部１０９ａ、該音情報に畳み込むための複数の伝達関数が格納された伝達関数データベース１０９ｂを備えている。伝達関数データベース１０９ｂには、受音した音情報の結像位置を制御するための複数の伝達関数（例えば、伝達関数１、伝達関数２、伝達関数３、・・・）が格納されている。図７に示すように表示装置１０７の画面が予め所定の領域毎（図７の１、２、３・・・）に分割され、各伝達関数は、分割された各領域に対応している。信号処理部１０９ａは、受信した音情報に各伝達関数を畳み込むことにより、音情報を各伝達関数が対応する領域に結像することができる。 The received sound signal reproduction unit 109 includes a signal processing unit 109a that controls received sound information and a transfer function database 109b that stores a plurality of transfer functions for convolution with the sound information. The transfer function database 109b stores a plurality of transfer functions (for example, transfer function 1, transfer function 2, transfer function 3,...) For controlling the imaging position of received sound information. As shown in FIG. 7, the screen of the display device 107 is divided in advance for each predetermined area (1, 2, 3,... In FIG. 7), and each transfer function corresponds to each divided area. The signal processing unit 109a can image sound information in a region corresponding to each transfer function by convolving each transfer function with the received sound information.

信号処理部１０９ａは、受信した顔領域位置情報より、受音した音情報を結像すべき領域を判断し、該領域に対応する伝達関数を選択する。次に、受音した音情報に選択した伝達関数を畳み込み、スピーカ１０６から再生する。これにより、表示装置１０７に表示される送話側の各出席者の顔領域付近に各出席者の音情報を結像することができる。 The signal processing unit 109a determines a region where the received sound information is to be imaged from the received face region position information, and selects a transfer function corresponding to the region. Next, the selected transfer function is convolved with the received sound information and reproduced from the speaker 106. Thereby, the sound information of each attendee can be imaged in the vicinity of the face area of each attendee on the transmission side displayed on the display device 107.

この結果、受話側のテレビ会議参加者は、送話側の各出席者が実際に受話側端末の表示装置１０７からしゃべっているように感じることができる。 As a result, the receiving-side video conference participant can feel as if each transmitting-side attendee is actually speaking from the display device 107 of the receiving-side terminal.

尚、人間は左右で一対の耳を持っているため、人間にとっては、水平方向（左右方向）について、音情報がいずれの向から出力されているかを聞き分けることは比較的容易であるが、垂直方向（上下方向）について、音情報がいずれの方向から出力されているかを聞き分けることは容易ではない。 Since humans have a pair of left and right ears, it is relatively easy for humans to distinguish which direction the sound information is output in the horizontal direction (left and right direction). It is not easy to tell from which direction the sound information is output in the direction (vertical direction).

従って、受音信号再生部１０９に、顔領域検出部１０８により送信された顔領域位置情報のうちの水平位置情報に基づいて、表示装置１０７上の各出席者の顔領域の水平位置を特定し、該水平位置付近から各出席者に対応する音情報が発せられているように受音部１０４より送信された音情報を制御することとしてもよい。この場合、顔領域の垂直方向の位置については予め適当な位置に固定しておく。 Therefore, the horizontal position of each attendee's face area on the display device 107 is specified based on the horizontal position information of the face area position information transmitted from the face area detection unit 108 to the sound reception signal reproduction unit 109. The sound information transmitted from the sound receiving unit 104 may be controlled so that sound information corresponding to each attendee is emitted from the vicinity of the horizontal position. In this case, the vertical position of the face area is fixed in advance to an appropriate position.

具体的には、図８に示す如く、受音信号再生部１０９の伝達関数データベース１０９ｂには、受音した音情報の結像位置を制御するための複数の伝達関数（例えば、伝達関数１、伝達関数２、伝達関数３、・・・）が格納され、図８に示すように表示装置１０７の画面が予め水平方向の所定の領域毎（図８の１、２、３・・・）に分割され、各伝達関数は、分割された各領域に対応している。この場合、信号処理部１０９ａは、受信した音情報に各伝達関数を畳み込むことにより、各伝達関数が対応する領域、即ち顔領域の水平位置付近に音情報を結像することができる。 Specifically, as shown in FIG. 8, the transfer function database 109b of the received sound signal reproduction unit 109 has a plurality of transfer functions (for example, transfer function 1, (Transfer function 2, transfer function 3,...) Are stored, and as shown in FIG. 8, the screen of the display device 107 is preliminarily arranged in predetermined horizontal regions (1, 2, 3,... Each transfer function corresponds to each divided area. In this case, the signal processing unit 109a can image sound information in the vicinity of the horizontal position of the area corresponding to each transfer function, that is, the face area, by convolving each transfer function with the received sound information.

上記実施形態では、図２又は図３に示すように、２つのスピーカ１０６を用いて音情報を制御しているが、スピーカ１０６に替えて、図９に示すように、フラットパネルスピーカ１１０を採用し、表示装置１０９の背面に配置してもよい。このような構成によると表示装置１０７に表示される送話側の各出席者の顔領域付近から実際に音情報を出力することができるため、より臨場感のある空間を創造することができる。 In the above embodiment, sound information is controlled using two speakers 106 as shown in FIG. 2 or FIG. 3, but a flat panel speaker 110 is used instead of the speaker 106 as shown in FIG. However, it may be arranged on the back surface of the display device 109. According to such a configuration, sound information can be actually output from the vicinity of the face area of each attendee on the transmission side displayed on the display device 107, so that a more realistic space can be created.

さらに、図２に示すハンズフリーテレビ電話装置１００の構成のうち、送話処理部１０１の受音方向決定部１０５を無くし、図１０に示すような構成としてもよい。 Further, in the configuration of the hands-free videophone device 100 shown in FIG. 2, the sound receiving direction determining unit 105 of the transmission processing unit 101 may be eliminated and the configuration shown in FIG.

このような構成では、受音部１０４は、例えば指向性のほとんどない、いわゆる、
無指向性マイクロフォンが複数配列されたマイクロフォンアレイで構成されており、各マイクロフォンは全範囲の音情報を取得する。当該構成のハンズフリーテレビ電話装置１００を２台用いてテレビ会議を行う場合、図１１に示すごとく、送話端末側の受音部１０４は、各マイクロフォンで取得した音情報を受話側端末の受音信号再生部１０９へ送信する。送話側端末のカメラ１０３は、カメラ情報を受話側端末の受音信号再生部１０９へ送信する。 In such a configuration, the sound receiving unit 104 has, for example, almost no directivity, so-called
The microphone array includes a plurality of omnidirectional microphones, and each microphone acquires sound information of the entire range. When a video conference is performed using two hands-free videophone devices 100 having the above configuration, as shown in FIG. 11, the sound receiving unit 104 on the transmitting terminal side receives the sound information acquired by each microphone from the receiving terminal. It transmits to the sound signal reproduction unit 109. The camera 103 of the transmitting terminal transmits the camera information to the sound reception signal reproducing unit 109 of the receiving terminal.

受話側端末の受音信号再生部１０９は、顔領域検出部１０８より送信された顔領
域位置情報およびカメラ１０３より送信されたカメラ情報に基づいて送話側におけるテレビ会議出席者が存在する方向を特定し、受音部１０４より送信された各マイクロフォン毎の音情報から特定された方向に対応する音情報を抽出あるいは生成する。 The sound reception signal reproducing unit 109 of the receiving terminal determines the direction in which the video conference attendee on the transmitting side exists based on the face area position information transmitted from the face area detecting unit 108 and the camera information transmitted from the camera 103. The sound information corresponding to the specified direction is extracted or generated from the sound information for each microphone identified and transmitted from the sound receiving unit 104.

尚、受音信号再生部１０９による存在方向の特定の仕方は、上記した図２または図３の受音方向決定部１０５による存在方向の特定方法と同様である。 The method of specifying the direction of presence by the sound reception signal reproduction unit 109 is the same as the method of specifying the direction of presence by the sound reception direction determination unit 105 in FIG. 2 or FIG. 3 described above.

受音信号再生部１０９は、顔領域検出部１０８により送信された顔領域位置情報に基づいて、抽出あるいは生成した音情報を表示装置１０７に表示される送話側の各出席者の顔領域付近から各出席者に対応する音情報が発せられているように音情報を制御し、スピーカ１０６から該音情報を出力する。 The received sound signal reproduction unit 109 is based on the face area position information transmitted by the face area detection unit 108, and the sound information extracted or generated is displayed near the face area of each attendee on the transmission side displayed on the display device 107. The sound information is controlled so that sound information corresponding to each attendee is emitted, and the sound information is output from the speaker 106.

本発明の実施形態に係る各部構成は上述の実施形態に限らず、特許請求の範囲に記載の技術的範囲内で種々の変形が可能である。例えば、上述の実施形態では、受音方向決定部１０５は送話処理部１０１に備えられ、顔領域検出部１０６および受音信号再生部１０７は受話処理部１０２に備えれた構成としているが、これらを送話処理部１０１および受話処理部１０２のどちらに備えるかは適宜選択可能である。 Each part structure which concerns on embodiment of this invention is not restricted to the above-mentioned embodiment, A various deformation | transformation is possible within the technical scope as described in a claim. For example, in the above-described embodiment, the sound reception direction determination unit 105 is provided in the transmission processing unit 101, and the face area detection unit 106 and the sound reception signal reproduction unit 107 are provided in the reception processing unit 102. It is possible to appropriately select which of the transmission processing unit 101 and the reception processing unit 102 includes these.

さらに、上述した本発明に係る実施形態におけるハンズフリーテレビ電話装置１０
０を構成する受音方向決定部１０５、顔領域検出部１０８および受音信号再生部１０９は、ハードウェア的には、任意のコンピュータのＣＰＵ、メモリ、その他のＬＳＩなどで実現でき、ソフトウェア的には、メモリにロードされたプログラムなどによっても実現できる。言うまでもなく、ハードウェアとソフトウェアを組み合わせて実現することもできる。 Furthermore, the hands-free videophone 10 in the embodiment according to the present invention described above.
The sound receiving direction determining unit 105, the face area detecting unit 108, and the sound receiving signal reproducing unit 109 constituting 0 can be realized in hardware by a CPU, memory, other LSI, etc. of any computer, and in software Can also be realized by a program loaded in a memory. Needless to say, it can also be realized by combining hardware and software.

以上のように、上述のハンズフリーテレビ電話装置１００は、顔領域検出部１０８が、カメラ１０３によって撮影された送話側会議風景の映像情報から会議に参加している人物の顔領域位置情報を算出する。 As described above, in the above-described hands-free videophone device 100, the face area detection unit 108 obtains the face area position information of the person participating in the conference from the video information of the transmission-side conference scene photographed by the camera 103. calculate.

ハンズフリーテレビ電話装置１００が図２に示す構成を取る場合は、受音方向決定部１０５が、該顔領域位置情報、カメラのズーム倍率およびカメラの向きに基づいて送話側の会議参加者の存在方向を特定し、受音部１０４が該方向からの音情報を取得する。このとき、カメラ１０３のズーム倍率に応じて音量レベルを増減させて音情報を取得する。受音信号再生部１０９は、顔領域位置情報に基づいて、表示装置１０７に表示される送話側会議参加者の顔位置付近に取得した音情報を結像させる。 When the hands-free videophone device 100 has the configuration shown in FIG. 2, the sound receiving direction determination unit 105 determines whether the conference participant on the transmission side is based on the face area position information, the camera zoom magnification, and the camera direction. The presence direction is specified, and the sound receiving unit 104 acquires sound information from the direction. At this time, sound information is acquired by increasing or decreasing the volume level according to the zoom magnification of the camera 103. The received sound signal reproduction unit 109 forms an image of the acquired sound information in the vicinity of the face position of the transmitting conference participant displayed on the display device 107 based on the face area position information.

ハンズフリーテレビ電話装置１００が図１０に示す構成を取る場合は、受音部１０４は複数の無指向性マイクロフォンにより全方位の方向から音情報を取得する。受音信号再生部１０９は前記顔領域位置情報、カメラ１０３のズーム倍率およびカメラ１０３の向きに基づいて、送話側の会議参加者の存在方向を特定する。次に、受音信号再生部１０９は、取得した音情報を加工し、各会議参加者の存在方向からの音情報を生成する。次に、受音信号再生部１０９は、表示装置１０７に表示される各送話側会議出席者の顔位置付近に、各参加者に対応する生成した音情報を結像する。この際、カメラ１０３のズーム倍率に応じて音量レベルを増減させて音情報を結像する。 When the hands-free videophone device 100 has the configuration shown in FIG. 10, the sound receiving unit 104 acquires sound information from all directions by a plurality of omnidirectional microphones. The sound reception signal reproduction unit 109 identifies the direction of presence of the conference participant on the transmission side based on the face area position information, the zoom magnification of the camera 103, and the direction of the camera 103. Next, the received sound signal reproduction unit 109 processes the acquired sound information to generate sound information from the direction in which each conference participant exists. Next, the received sound signal reproduction unit 109 images the generated sound information corresponding to each participant in the vicinity of the face position of each transmitting conference attendee displayed on the display device 107. At this time, the sound information is imaged by increasing / decreasing the volume level according to the zoom magnification of the camera 103.

これにより、受話側の会議出席者は、送話側の会議出席者があたかも受話側の表示装置から音情報を発していると感じることができる。即ち、ハンズフリーテレビ電話装置によると、受話側において、送話側の会議参加者が音声や音を発している状況をできるだけ忠実に再現でき、より臨場感溢れる空間を創造することができる。 Thereby, the meeting attendee on the receiving side can feel that the meeting attendee on the sending side emits sound information from the display device on the receiving side. That is, according to the hands-free videophone device, on the receiving side, the situation where the conference participant on the transmitting side is producing voice and sound can be reproduced as faithfully as possible, and a more realistic space can be created.

実施の形態に係るハンズフリーテレビ電話をテレビ会議に利用する場合の構成の概略を示す図である。It is a figure which shows the outline of a structure in the case of using the hands-free video telephone concerning embodiment for a video conference. 実施の形態に係るハンズフリーテレビ電話装置の構成を示す図である。It is a figure which shows the structure of the hands-free video telephone apparatus which concerns on embodiment. 実施の形態に係るハンズフリーテレビ電話装置を用いてテレビ会議を行う場合の構成およびデータの流れを示す図である。It is a figure which shows the structure and data flow in the case of performing a video conference using the hands-free video telephone apparatus which concerns on embodiment. 実施の形態に係るハンズフリーテレビ電話装置が備える顔領域検出部が算出する顔領域位置情報を示す図である。It is a figure which shows the face area position information which the face area detection part with which the hands-free video telephone apparatus concerning an embodiment is provided calculates. 実施の形態に係るハンズフリーテレビ電話装置において、人物の顔領域位置情報およびカメラのズーム倍率と人物の存在方向の関係を説明する図である。In the hands-free videophone device according to the embodiment, it is a diagram for explaining the relationship between the face area position information of the person, the zoom magnification of the camera, and the direction of the person. 実施の形態に係るハンズフリーテレビ電話装置において、顔領域位置情報からカメラのズーム倍率および向きを考慮して受音方向を特定するまでの流れを示す図である。In the hands-free videophone device according to the embodiment, it is a diagram showing a flow from the face area position information to the determination of the sound receiving direction in consideration of the zoom magnification and direction of the camera. 実施の形態に係るハンズフリーテレビ電話装置において、音情報の制御方法を説明するため図である。It is a figure for demonstrating the control method of sound information in the hands-free video telephone apparatus which concerns on embodiment. 実施の形態に係るハンズフリーテレビ電話装置において、音情報の制御方法を説明するため図である。It is a figure for demonstrating the control method of sound information in the hands-free video telephone apparatus which concerns on embodiment. 実施の形態に係るハンズフリーテレビ電話装置の第２の構成を示す図である。It is a figure which shows the 2nd structure of the hands-free video telephone apparatus which concerns on embodiment. 実施の形態に係るハンズフリーテレビ電話装置の第３の構成を示す図である。It is a figure which shows the 3rd structure of the hands-free video telephone apparatus which concerns on embodiment. 実施の形態に係る第３の構成のハンズフリーテレビ電話装置を用いてテレビ会議を行う場合の構成およびデータの流れを示す図である。It is a figure which shows the structure and data flow in the case of performing a video conference using the hands-free video telephone apparatus of the 3rd structure which concerns on embodiment.

Explanation of symbols

１００ハンズフリーテレビ電話装置
１０１送話処理部
１０２受話処理部
１０３カメラ
１０４受音部
１０５受音方向決定部
１０６スピーカ
１０７表示装置
１０８顔領域検出部
１０９受音信号再生部
DESCRIPTION OF SYMBOLS 100 Hands-free video telephone apparatus 101 Transmission processing part 102 Reception processing part 103 Camera
DESCRIPTION OF SYMBOLS 104 Sound receiving part 105 Sound receiving direction determination part 106 Speaker 107 Display apparatus 108 Face area detection part 109 Sound receiving signal reproduction part

Claims

Video acquisition means for acquiring a video signal;
Sound acquisition means for acquiring an acoustic signal;
Transmitting means for transmitting the audio signal acquired by the audio acquisition means and the video signal acquired by the video acquisition means;
Receiving means for receiving the video signal and the audio signal transmitted by the transmitting means;
Video display means for displaying the video signal received by the receiving means;
Acoustic control means for controlling the acoustic signal received by the receiving means;
Acoustic output means for outputting an acoustic signal controlled by the acoustic control means;
A video signal and audio signal integration device comprising:
Face position detection means for detecting the face position of one or more persons on the video signal from the video signal acquired by the video acquisition means;
A person direction specifying means for specifying the presence direction of each person based on the face position of each person detected by the face position detection means and the image acquisition condition of the image acquisition means;
The acoustic acquisition means acquires an acoustic signal from the direction of existence of each person specified by the person specifying means,
The sound control means images each sound signal corresponding to each person near the face position of each person on the image signal displayed on the display means. Integrated device.

Video acquisition means for acquiring a video signal;
Sound acquisition means for acquiring an acoustic signal;
Transmitting means for transmitting the audio signal acquired by the audio acquisition means and the video signal acquired by the video acquisition means;
Receiving means for receiving the video signal and the audio signal transmitted by the transmitting means;
Video display means for displaying the video signal received by the receiving means;
Acoustic control means for controlling the acoustic signal received by the receiving means;
Acoustic output means for outputting an acoustic signal controlled by the acoustic control means;
A video signal and audio signal integration device comprising:
Face position detection means for detecting the face position of one or more persons on the video signal from the video signal acquired by the video acquisition means;
A person direction specifying means for specifying the presence direction of each person based on the face position of each person detected by the face position detection means and the image acquisition condition of the image acquisition means;
The sound acquisition means acquires sound signals from all directions,
The acoustic control unit generates an acoustic signal from the direction of existence of each person based on information on the direction of existence of each person identified by the person identification unit from acoustic signals from all directions, An apparatus for integrating a video signal and an acoustic signal, wherein the generated acoustic signal corresponding to each person is imaged in the vicinity of the face position of each person on the video signal displayed on the display means.

The video signal and sound signal integration apparatus according to claim 1, wherein the video acquisition condition of the video acquisition unit includes a zoom magnification of the video acquisition unit.

The video signal and sound signal integration apparatus according to any one of claims 1 to 3, wherein the video acquisition condition of the video acquisition unit includes direction information of the video acquisition unit.

5. The video signal and sound signal integration apparatus according to claim 3, wherein the sound acquisition unit increases or decreases a volume level of the sound signal acquired according to the zoom magnification of the image acquisition conditions. .

5. The video signal according to claim 3, wherein the sound control unit increases or decreases a volume level of the sound signal output from the sound output unit according to the zoom magnification in the image acquisition condition. Sound signal integration device.