JP2010041485A

JP2010041485A - Video/voice output device

Info

Publication number: JP2010041485A
Application number: JP2008203138A
Authority: JP
Inventors: Hiroto Kawachi; 洋人河内; Kazusane Sugaya; 和実菅谷; Teiji Suzuki; 禎司鈴木
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2008-08-06
Filing date: 2008-08-06
Publication date: 2010-02-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice localization technology without causing an uncomfortable feeling even when a scene change occurs while the same person is speaking. <P>SOLUTION: This video/voice output device 1 includes: a video analysis part 11 analyzing video to identify the position of a speaker, and detecting presence of a scene change to determine whether the identified speaker is the same person before and after the scene change; a speaker voice localization parameter setting part 12 setting the value of a speaker voice localization parameter to localize the voice at the position of the identified speaker; a speaker voice localization parameter adjustment part 14 adjusting the value of the speaker voice localization parameter to reduce the change of the localized position with respect to the value of the speaker voice localization parameter set by the speaker voice localization parameter setting part 12 when the identified speaker is determined to be the same person before and after the scene change; a localization process part 15 executing a localization change process of the voice in accordance with the adjusted value of the speaker voice localization parameter; and a voice output part 17 outputting the voice changed in localization. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、映像及び音声を含むコンテンツデータを出力する映像音声出力装置に関し、特に、映像の話者位置に応じて音声の定位を決定し、音声出力制御を行う映像音声出力装置に関する。 The present invention relates to a video / audio output apparatus that outputs content data including video and audio, and more particularly to a video / audio output apparatus that determines sound localization according to a speaker position of video and performs audio output control.

テレビ放送などの番組コンテンツを受信して、ディスプレイに映像を表示するとともにスピーカから音声を出力する場合、モノラル音声においてはスピーカの位置から人の声が聞こえるようになっている。また、ステレオ／サラウンド音声においては、多くの場合、画面中央に人の声を定位させて、画面中央から人の声が聞こえるようになっている。 When a program content such as a television broadcast is received, an image is displayed on a display and a sound is output from a speaker, in monaural sound, a human voice can be heard from the position of the speaker. In stereo / surround sound, in many cases, a human voice is localized at the center of the screen so that the human voice can be heard from the center of the screen.

しかしながら、一般に、ディスプレイ上の話者位置に人の声が定位していると臨場感が増すことが知られているため、従来においては、映像解析により話者位置を特定し、話者位置に音声を定位させる音声定位技術が開示されている。 However, since it is generally known that the presence of a person's voice is localized at the speaker position on the display, it is known that the sense of presence increases. An audio localization technique for localizing audio is disclosed.

例えば、特許文献１では、話者の位置を検出し、検出した位置に応じて、複数のスピーカから出力する音声の音量を制御している。また、特許文献２では、発話者の位置を特定し、特定した位置に応じて、エフェクトや音量調整を行い、最適なスピーカから音声データを出力している。 For example, in Patent Document 1, the position of a speaker is detected, and the volume of sound output from a plurality of speakers is controlled according to the detected position. Moreover, in patent document 2, the position of a speaker is specified, an effect and volume adjustment are performed according to the specified position, and audio | speech data is output from the optimal speaker.

特開平１１−３１３２７２号公報JP-A-11-313272 特開２００７−１１０５８２号公報JP 2007-110582 A

しかしながら、上述した従来技術においては、シーンの内容を考慮せずに、話者位置に音声を定位させているため、シーンによっては、臨場感を高めるどころか、却ってストレスを感じてしまう場合がある。例えば、台詞の最中にカメラアングルが変わって、同一話者の話者位置が急に変わるシーンにおいては、同一人物の台詞の最中に音声の定位位置が変更されるので、当該シーンを視聴している視聴者は、却ってストレスを感じてしまうという問題がある。 However, in the above-described prior art, since the sound is localized at the speaker position without considering the contents of the scene, depending on the scene, there is a case where stress is felt instead of enhancing the sense of reality. For example, in a scene where the camera angle changes during the dialogue and the speaker position of the same speaker changes suddenly, the localization position of the voice is changed during the dialogue of the same person. The viewer who is doing this has a problem of feeling stress on the contrary.

このように従来技術においては、シーンの内容を考慮せずに、一律に話者位置に音声を定位させているため、同一人物の台詞の最中にシーンチェンジが発生して話者位置が急に変わったシーンにおいては、臨場感を高めるどころか、却って違和感が生じるという問題がある。 In this way, in the prior art, since the voice is uniformly localized at the speaker position without considering the contents of the scene, a scene change occurs during the speech of the same person, and the speaker position suddenly changes. In a scene that has changed to, there is a problem that rather than enhancing the sense of presence, a sense of incongruity occurs.

本発明は上記の事情を鑑みてなされたものであり、その課題の一例としては、話者位置を特定して、特定した話者位置に音声を定位させる音声定位技術において、同一人物の発話中にシーンチェンジが発生して話者位置が急に変わっても、違和感を生じない映像音声出力装置を提供することにある。 The present invention has been made in view of the above circumstances, and as an example of the problem, in the speech localization technology that specifies the speaker position and localizes the voice to the specified speaker position, the same person is speaking It is an object of the present invention to provide a video / audio output device that does not give a sense of incongruity even when a scene change occurs and the speaker position changes suddenly.

上記の課題を達成するため、請求項１に係る映像音声出力装置は、音声定位パラメータに基づいて音声定位を制御する映像音声出力装置であって、映像を解析して、話者の位置を特定する話者位置特定手段と、前記話者位置特定手段により特定した話者の位置に音声を定位させるように前記音声定位パラメータの値を設定する音声定位パラメータ設定手段と、映像を解析して、シーンチェンジの有無を検出するシーンチェンジ検出手段と、映像または音声を解析して、前記話者位置特定手段で特定された話者がシーンチェンジの前後で同一人物であるか否かを判定する同一話者判定手段と、シーンチェンジ検出手段によりシーンチェンジがあると検出され、かつ、前記同一話者判定手段により、前記話者位置特定手段で特定された話者が当該シーンチェンジの前後で同一人物であると判定された場合には、前記音声定位パラメータ設定手段で設定された音声定位パラメータの値に対して、定位位置の変更を小さくするように前記音声定位パラメータの値を調整する音声定位パラメータ調整手段と、前記音声定位パラメータ調整手段により、調整された音声定位パラメータの値に従って音声の定位変更処理を行い、映像及び音声を出力する定位変更出力手段と、を備えることを特徴とする。 In order to achieve the above object, a video / audio output device according to claim 1 is a video / audio output device that controls audio localization based on an audio localization parameter, and identifies a speaker position by analyzing video. Speaker localization specifying means, voice localization parameter setting means for setting the value of the voice localization parameter so as to localize the voice to the position of the speaker specified by the speaker position specifying means, and analyzing the video, Same as scene change detection means for detecting the presence or absence of a scene change, and analyzing video or audio to determine whether or not the speaker specified by the speaker position specifying means is the same person before and after the scene change The speaker determination means and the scene change detection means detect that there is a scene change, and the speaker specified by the speaker position specification means by the same speaker determination means If it is determined that the same person is present before and after the change, the value of the voice localization parameter is set so that the change of the localization position is smaller than the value of the voice localization parameter set by the voice localization parameter setting means. A sound localization parameter adjusting unit for adjusting the sound localization parameter, and a sound localization parameter adjusting unit for performing sound localization change processing according to the value of the sound localization parameter adjusted, and for outputting a video and a sound. It is characterized by.

以下、本発明の実施の形態を図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の実施の形態に係る映像音声出力装置１の概略構成図である。映像音声出力装置１は、同一話者の発話中のシーンチェンジを考慮しつつ、話者位置に合わせた音声定位で音声を出力する装置であり、詳しくは、映像解析部１１、話者音声定位パラメータ設定部１２、話者音声定位パラメータ調整部１４、定位処理部１５、映像表示部１６、及び音声出力部１７を備えている。 FIG. 1 is a schematic configuration diagram of a video / audio output apparatus 1 according to an embodiment of the present invention. The video / audio output device 1 is a device that outputs a voice with a sound localization in accordance with a speaker position while taking into account a scene change during the utterance of the same speaker. A parameter setting unit 12, a speaker voice localization parameter adjustment unit 14, a localization processing unit 15, a video display unit 16, and a voice output unit 17 are provided.

ここで、映像音声出力装置１は、外部から入力された映像及び音声を含むコンテンツデータを再生して外部に出力する機能を有する装置であれば何であってもよく、例えば、具体的には、テレビジョン（ＴＶ）、ＤＶＤプレーヤ及びレコーダ、ＢＤプレーヤ及びレコーダ、パーソナルコンピュータ（ＰＣ）などが想定される。また、「話者」とは、映像データ（画面上）において発話している者をいい、「話者位置」とは、話者の画面上の位置をいうが、より正確には話者の顔（特に口）付近の位置をいう。また、「話者位置に合わせた音声定位で音声を出力する」とは、例えば、話者が画面上左側に存在する場合には、画面左側に設けたスピーカから出力される音声の音量を大きくするなどして、話者の位置から音声が聞こえてくるように音声を出力することをいう。 Here, the video / audio output device 1 may be any device as long as it has a function of reproducing content data including video and audio input from the outside and outputting the content data to the outside. A television (TV), a DVD player and recorder, a BD player and recorder, a personal computer (PC), and the like are assumed. “Speaker” refers to the person speaking in the video data (on the screen), and “speaker position” refers to the position of the speaker on the screen. A position near the face (especially the mouth). In addition, “output the voice with the sound localization according to the speaker position” means that, for example, when the speaker exists on the left side of the screen, the volume of the sound output from the speaker provided on the left side of the screen is increased. This means that the sound is output so that the sound can be heard from the position of the speaker.

映像解析部１１は、入力した映像データを映像表示部１６に出力する（音声データと同期させるため、必要に応じて映像データを遅延させて映像表示部１６に出力する）とともに、入力した映像データから話者位置を特定するようになっている。話者位置の特定方法については、公知の技術を用いて行われる。例えば、映像データから人の顔面の領域を検出し、顔面の中の口の動きを検出することで、話者を特定するようにしてもよい。この際、口の動きの検出においては、前後数フレームの映像データを用いて、口領域の輝度などの差分を特徴量として算出し、算出した特徴量の値が最も大きい口領域を持った人を話者と判定とすれば、複数の顔面が検出された場合であっても、話者を特定することができる。 The video analysis unit 11 outputs the input video data to the video display unit 16 (to synchronize with the audio data, the video data is delayed and output to the video display unit 16 as necessary) and the input video data The speaker position is specified from the above. The method for specifying the speaker position is performed using a known technique. For example, a speaker may be specified by detecting a region of a human face from video data and detecting a mouth movement in the face. At this time, in detecting the movement of the mouth, using the video data of several frames before and after, the difference such as the brightness of the mouth area is calculated as a feature amount, and the person with the mouth region having the largest feature value is calculated. Is determined as a speaker, the speaker can be specified even when a plurality of faces are detected.

また、映像解析部１１は、入力した映像データからシーンチェンジの有無を検出をするようになっている。シーンチェンジの有無の検出方法については、公知の技術を用いて行われる。例えば、該当フレームと前フレームの輝度の差分Ｖｄを各画素に対して算出して、差分Ｖｄが閾値以上の画素数Ｖｄｃｎｔをカウントし、画素数Ｖｄｃｎｔが、全画素数に対して予め定めた割合を越えた場合には、シーンチェンジありと判定する方法を用いてもよい。 The video analysis unit 11 detects the presence / absence of a scene change from the input video data. About the detection method of the presence or absence of a scene change, it is performed using a well-known technique. For example, the luminance difference Vd between the corresponding frame and the previous frame is calculated for each pixel, the number of pixels Vdcnt where the difference Vd is equal to or greater than a threshold is counted, and the number of pixels Vdcnt is a predetermined ratio with respect to the total number of pixels If the value exceeds the threshold, a method of determining that there is a scene change may be used.

また、映像解析部１１は、シーンチェンジありと判定した場合には、特定した話者が同一人物であるか否かを判定するようになっている。詳しくは、映像解析部１１が現シーンで特定した話者の顔特徴量を算出し、前シーン（シーンチェンジ前）で特定した話者の顔特徴量と比較して、顔特徴量が等しい場合には、特定した話者は同一人物であると判定するようになっている。なお、前シーンにおける話者の顔特徴量は、後述するように一時記憶領域に保存されている。ここで、顔特徴量とは、例えば、顔器官の形や位置関係に基づいて算出される値であり、顔特徴量の算出方法に関しては公知の技術が用いられる。 Further, when it is determined that there is a scene change, the video analysis unit 11 determines whether or not the specified speakers are the same person. Specifically, when the facial feature amount of the speaker specified in the current scene is calculated by the video analysis unit 11 and the facial feature amount is equal to the speaker facial feature amount specified in the previous scene (before the scene change). The determined speaker is determined to be the same person. Note that the speaker's facial feature amount in the previous scene is stored in a temporary storage area as will be described later. Here, the face feature amount is a value calculated based on, for example, the shape and positional relationship of the facial organ, and a known technique is used as a method for calculating the face feature amount.

また、映像解析部１１は、特定した話者の位置を話者音声定位パラメータ設定部１２に出力し、シーンチェンジの有無に関する判定情報、及び同一話者であるか否かの判定情報を話者音声定位パラメータ調整部１４に出力するようになっている。 In addition, the video analysis unit 11 outputs the specified speaker position to the speaker voice localization parameter setting unit 12, and the determination information regarding the presence / absence of a scene change and the determination information regarding whether or not the speaker is the same speaker are output. The sound is output to the sound localization parameter adjustment unit 14.

話者音声定位パラメータ設定部１２は、映像解析部１１から入力された話者位置に音声データを定位させるためのパラメータ（以下、話者音声定位パラメータという）の値を設定するようになっている。ここで、「話者位置に音声データを定位させるためのパラメータの値」とは、話者位置から音が聞こえるように音声が出力されるためのパラメータの値をいい、例えば、複数備えたスピーカのうち、話者位置の近くに設置されたスピーカの音量を大きくし、他のスピーカの音量を小さくするような音量調整に関するパラメータ値（複数のスピーカのそれぞれに対する音量設定値）を意味する。 The speaker voice localization parameter setting unit 12 sets a parameter value (hereinafter referred to as a speaker voice localization parameter) for localizing the voice data at the speaker position input from the video analysis unit 11. . Here, the “parameter value for localizing the voice data at the speaker position” means a parameter value for outputting a sound so that sound can be heard from the speaker position. For example, a plurality of speakers are provided. Among them, it means a parameter value related to volume adjustment (volume setting value for each of a plurality of speakers) that increases the volume of a speaker installed near the speaker position and decreases the volume of other speakers.

また、話者音声定位パラメータ設定部１２は、設定された話者音声定位パラメータ値を話者音声定位パラメータ調整部１４に出力するようになっている。 The speaker voice localization parameter setting unit 12 outputs the set speaker voice localization parameter value to the speaker voice localization parameter adjustment unit 14.

話者音声定位パラメータ調整部１４は、話者音声定位パラメータ設定部１２で設定された話者音声定位パラメータの値を入力するとともに、映像解析部１１が出力するシーンチェンジの有無に関する判定情報、及び同一話者であるか否かの判定情報を入力して、設定された話者音声定位パラメータの値を調整するようになっている。詳しくは、シーンチェンジがあって、かつ、特定した話者がシーンチェンジの前後で同一人物である場合には、現在の話者位置に音声を定位させる音声定位変更量を小さくするように話者音声定位パラメータの値を調整（修正）するようになっている。 The speaker voice localization parameter adjustment unit 14 inputs the value of the speaker voice localization parameter set by the speaker voice localization parameter setting unit 12, and also includes determination information regarding the presence / absence of a scene change output from the video analysis unit 11, and The determination information as to whether or not they are the same speaker is input, and the value of the set speaker voice localization parameter is adjusted. Specifically, if there is a scene change and the specified speaker is the same person before and after the scene change, the speaker is set to reduce the amount of sound localization change that causes the sound to be localized at the current speaker position. The value of the sound localization parameter is adjusted (corrected).

ここで「現在の話者位置に音声を定位させる音声定位変更量を小さくするように話者音声定位パラメータの値を調整する」とは、例えば、シーンチェンジがあって同一話者が画面上左側から右側に移動した場合を例に挙げて説明すると、シーンチェンジを全く考慮しないときには、右側のスピーカの音量をＡ１の大きさで出力するように設定した話者音声定位パラメータの値Ｐ１を、右側のスピーカの音量をＡ２（＜Ａ１）の大きさで出力するように設定した話者音声定位パラメータの値Ｐ２に調整することをいう。すなわち、シーンチェンジがあって同一話者の位置が左側から右側に移り、同一話者がシーンの前後で継続して発話している場合には、話者音声定位パラメータの値を話者位置に追随させて極端には変化させず、例えば、画面中央位置等に音声を定位させるなど、緩やかに変化させるように話者音声定位パラメータの値を調整する。この結果、同一話者が発話中にシーンチェンジが発生して、話者位置が変更されたとしても、視聴者は、違和感を覚えることがない。なお、「音声定位変更量を小さくする」には、入力された音声データ（通常は画面中央位置に定位していることが多い音声データ）に対して音声定位変更量を小さくするようにしてもよいし、また、直前に設定した話者音声定位パラメータの値に対して音声定位変更量を小さくするようにしてもよい。 Here, “adjusting the value of the speaker's voice localization parameter so as to reduce the voice localization change amount that causes the voice to be localized at the current speaker position” means, for example, that there is a scene change and the same speaker is on the left side of the screen For example, when a scene change is not considered at all, a speaker sound localization parameter value P1 set so that the volume of the right speaker is output at the size of A1 is displayed on the right side. Is adjusted to the value P2 of the speaker voice localization parameter set so as to be output at the magnitude of A2 (<A1). That is, if there is a scene change and the same speaker moves from the left to the right, and the same speaker is speaking continuously before and after the scene, the value of the speaker voice localization parameter is set to the speaker position. The value of the speaker voice localization parameter is adjusted so as to change gently, for example, by moving the voice to the center position of the screen or the like without changing it extremely. As a result, even if a scene change occurs while the same speaker is speaking and the speaker position is changed, the viewer does not feel uncomfortable. Note that “decreasing the audio localization change amount” may be configured such that the audio localization change amount is reduced with respect to input audio data (usually audio data that is often localized at the center position of the screen). Alternatively, the voice localization change amount may be reduced with respect to the value of the speaker voice localization parameter set immediately before.

なお、上述した話者音声定位パラメータの調整の説明においては、シーンチェンジの前後で同一話者の位置が変更される場合を例に挙げてして説明したが、シーンチェンジの前後で同一話者の位置が変更されない場合であってもよい。この場合には、話者音声定位パラメータの値はシーンチェンジの前後で変わらないので、音声定位変更量は生じない。したがって、上述した話者音声定位パラメータの調整には、同一話者の話者位置が変更されない場合を含んでもよいが、この場合には、音声定位変更量は０であるので、実質的には話者音声定位パラメータの調整は行われない。 In the above description of the speaker audio localization parameter adjustment, the case where the position of the same speaker is changed before and after the scene change has been described as an example. However, the same speaker before and after the scene change is described. The position may not be changed. In this case, since the value of the speaker voice localization parameter does not change before and after the scene change, the voice localization change amount does not occur. Therefore, the adjustment of the speaker voice localization parameter described above may include a case where the speaker position of the same speaker is not changed. In this case, since the voice localization change amount is 0, substantially The speaker voice localization parameters are not adjusted.

また、話者音声定位パラメータ調整部１４は、調整された話者音声定位パラメータの値を定位処理部１５に出力するようになっている。 The speaker voice localization parameter adjusting unit 14 outputs the adjusted speaker voice localization parameter value to the localization processing unit 15.

定位処理部１５は、音声データを入力するとともに、話者音声定位パラメータ調整部１４から出力された話者音声定位パラメータの値を入力し、調整された話者音声定位パラメータの値に基づいて、音声データの定位変更処理を行うようになっている。また、定位処理部１５は、定位変更処理した音声データを音声出力部１７に出力するようになっている。 The localization processing unit 15 inputs voice data, inputs the value of the speaker voice localization parameter output from the speaker voice localization parameter adjustment unit 14, and based on the adjusted value of the speaker voice localization parameter, Audio data localization change processing is performed. In addition, the localization processing unit 15 outputs the audio data subjected to the localization change process to the audio output unit 17.

映像表示部１６は、映像解析部１１から出力された映像データをディスプレイ等に表示すべく出力するようになっている。 The video display unit 16 outputs the video data output from the video analysis unit 11 to be displayed on a display or the like.

音声出力部１７は、定位変更処理された音声データをスピーカに出力するようになっている。 The audio output unit 17 outputs audio data that has been subjected to the localization change process to a speaker.

い。 Yes.

次に、図２を参照して、話者音声定位パラメータ調整部１４の機能、すなわち、同一話者が発話中にシーンチェンジが発生して、同一話者が移動する場合の話者音声定位パラメータの調整について具体的に説明する。 Next, referring to FIG. 2, the function of the speaker voice localization parameter adjusting unit 14, that is, the speaker voice localization parameter when the same speaker moves when a scene change occurs while the same speaker is speaking. The adjustment will be specifically described.

なお、図２に示す具体例においては、図３に示すような座標系を用いて説明する。すなわち、１４４０×１０８０の画像サイズにおいて、画面左上を原点、横方向をＸ軸、縦方向をＹ軸としてピクセル単位に座標系を構成している。ここで、画面上で特定される話者ＳＰの位置は、顔面の位置であり、本実施形態では、矩形な顔領域Ｆの四隅の座標を話者ＳＰの位置としている。具体的には、顔領域Ｆの左上の頂点Ｓ０（Ｘ０，Ｙ０）、右上の頂点Ｓ１（Ｘ１，Ｙ１）、左下の頂点Ｓ２（Ｘ２，Ｙ２）、及び右下の頂点Ｓ３（Ｘ３，Ｙ３）により、話者ＳＰの位置を特定している。 Note that the specific example shown in FIG. 2 will be described using a coordinate system as shown in FIG. That is, in an image size of 1440 × 1080, the coordinate system is configured in pixel units with the upper left corner of the screen as the origin, the horizontal direction as the X axis, and the vertical direction as the Y axis. Here, the position of the speaker SP specified on the screen is the position of the face, and in this embodiment, the coordinates of the four corners of the rectangular face area F are the positions of the speaker SP. Specifically, the upper left vertex S0 (X0, Y0), the upper right vertex S1 (X1, Y1), the lower left vertex S2 (X2, Y2), and the lower right vertex S3 (X3, Y3) of the face area F Thus, the position of the speaker SP is specified.

また、図２に示す具体例においては、上述した話者音声定位パラメータを話者音声定位位置Ｐ（Ｐｘ，Ｐｙ）として説明し、話者音声定位位置Ｐから音声が聞こえるように音声は調整されて出力されるものとする。なお、図２に示す具体例は、通常時においては、話者音声定位位置Ｐは、特定された話者の顔領域Ｆの中心位置に設定され、同一話者の発話中にシーンチェンジが発生して、話者位置が移動したときは、話者音声定位位置Ｐは、画面の中心位置に設定される場合を示している。 In the specific example shown in FIG. 2, the speaker voice localization parameter described above is described as the speaker voice localization position P (Px, Py), and the voice is adjusted so that the voice can be heard from the speaker voice localization position P. Output. In the specific example shown in FIG. 2, in the normal state, the speaker voice localization position P is set to the center position of the face area F of the specified speaker, and a scene change occurs during the same speaker's utterance. When the speaker position is moved, the speaker voice localization position P is set to the center position of the screen.

図２（ａ）は、シーンチェンジ前のシーン１の話者位置、すなわち、話者Ａが画面上左側の位置に存在する場合の話者位置を示している。具体的には、図２（ａ）に示すように、話者Ａの顔領域Ｆは、Ｓ０（２００，２２０）、Ｓ１（５８０，２２０）、Ｓ２（２００，６００）、Ｓ３（５８０，６００）なので、話者音声定位位置Ｐは、顔領域Ｆの中心であるＰ１（３９０，４１０）となっている。 FIG. 2A shows the speaker position of the scene 1 before the scene change, that is, the speaker position when the speaker A is present at the left position on the screen. Specifically, as shown in FIG. 2A, the face area F of the speaker A is S0 (200, 220), S1 (580, 220), S2 (200, 600), S3 (580, 600). Therefore, the speaker voice localization position P is P1 (390, 410) which is the center of the face area F.

一方、図２（ｂ）は、シーンチェンジ後のシーンＢの話者位置、すなわち、話者Ａが画面上左側から右側に移動し、右側に存在する場合の話者位置を示している。具体的には、図２（ｂ）に示すように、話者Ａの顔領域Ｆは、Ｓ０（８６０，２２０）、Ｓ１（１２４０，２２０）、Ｓ２（８６０，６００）、Ｓ３（１２４０，６００）なので、顔領域Ｆの中心はＰ２（１０５０，４１０）であるが、話者音声定位位置Ｐは、画面の中心位置であるＰ３（７２０，５４０）となっている。 On the other hand, FIG. 2B shows the speaker position of the scene B after the scene change, that is, the speaker position when the speaker A moves from the left side to the right side on the screen and exists on the right side. Specifically, as shown in FIG. 2 (b), the face area F of the speaker A is S0 (860, 220), S1 (1240, 220), S2 (860, 600), S3 (1240, 600). Therefore, the center of the face area F is P2 (1050, 410), but the speaker voice localization position P is P3 (720, 540) which is the center position of the screen.

このように、話者Ａが発話中にシーンチェンジが発生して、シーンチェンジの前後で話者Ａが移動するような場合には、話者音声を画面中央位置に定位させ、視聴者に違和感を生じさせないようにしている。なお、シーンチェンジを考慮しなければ、話者音声は、話者位置に追随して話者位置に定位させるので、話者音声定位位置ＰはＰ２（１０５０，４１０）となる。 As described above, when a scene change occurs while the speaker A is speaking and the speaker A moves before and after the scene change, the speaker voice is localized at the center position of the screen, and the viewer feels uncomfortable. Is not caused. If the scene change is not taken into consideration, the speaker voice follows the speaker position and is localized at the speaker position, so the speaker voice localization position P is P2 (1050, 410).

すなわち、シーンチェンジを考慮して話者音声定位位置Ｐを決める場合には、話者音声定位位置ＰはＰ１（３９０，４１０）からＰ３（７２０，５４０）に変更されるが、シーンチェンジを考慮せずに話者音声定位位置Ｐを決める場合には、話者音声定位位置ＰはＰ１（３９０，４１０）からＰ２（１０５０，４１０）に変更される。ここで、Ｐ１（３９０，４１０）→Ｐ３（７２０，５４０）の位置変更は、Ｐ１（３９０，４１０）→Ｐ２（１０５０，４１０）の位置変更に比べて変更量が小さくなっており、このことは、上述した「音声定位変更量を小さくするように話者音声定位パラメータの値を調整する」を具体的に示すものとなっている。 That is, when the speaker voice localization position P is determined in consideration of the scene change, the speaker voice localization position P is changed from P1 (390, 410) to P3 (720, 540), but the scene change is considered. Without determining the speaker voice localization position P, the speaker voice localization position P is changed from P1 (390, 410) to P2 (1050, 410). Here, the position change of P1 (390,410) → P3 (720,540) is smaller than the position change of P1 (390,410) → P2 (1050,410). Specifically shows the above-mentioned "Adjust the value of the speaker voice localization parameter so as to reduce the voice localization change amount".

次に、図４を参照して、本実施の形態の映像音声出力装置１の映像音声出力処理について説明する。図４は、映像音声出力装置１の同一話者の発話中のシーンチェンジを考慮して、音声定位制御を行う映像音声出力処理の流れを示すフローチャートである。 Next, with reference to FIG. 4, the video / audio output processing of the video / audio output device 1 of the present embodiment will be described. FIG. 4 is a flowchart showing the flow of the video / audio output processing for performing the audio localization control in consideration of the scene change during the utterance of the same speaker of the video / audio output device 1.

まず、映像音声出力装置１の映像解析部１１が入力された映像データを解析して、映像データの話者位置を特定する（ステップＳ１０）。 First, the video analysis unit 11 of the video / audio output device 1 analyzes the input video data, and specifies the speaker position of the video data (step S10).

次に、映像音声出力装置１の話者音声定位パラメータ設定部１２は、特定された話者位置に基づいて、話者音声定位パラメータの値を設定する（ステップＳ２０）。 Next, the speaker voice localization parameter setting unit 12 of the video / audio output device 1 sets the value of the speaker voice localization parameter based on the specified speaker position (step S20).

次に、映像音声出力装置１の映像解析部１１は、シーンチェンジ検出処理を行う（ステップＳ３０）。シーンチェンジ検出処理では、入力した映像データを解析して、シーンチェンジの検出を行い、シーンチェンジの有無を判定する。 Next, the video analysis unit 11 of the video / audio output device 1 performs a scene change detection process (step S30). In the scene change detection process, the input video data is analyzed, a scene change is detected, and the presence or absence of a scene change is determined.

次に、映像音声出力装置１の映像解析部１１は、シーンチェンジがあると判定した場合には、シーンチェンジの前後で、特定された話者が同一人物であるか否かを判定する同一話者判定処理を行う（ステップＳ４０）。 Next, when the video analysis unit 11 of the video / audio output device 1 determines that there is a scene change, the same story that determines whether or not the specified speakers are the same person before and after the scene change. A person determination process is performed (step S40).

ここで、図５を用いて、同一話者判定処理について説明する。図５は、図４のステップＳ４０の同一話者判定処理の流れを詳しく示すフローチャートである。 Here, the same speaker determination process will be described with reference to FIG. FIG. 5 is a flowchart showing in detail the flow of the same speaker determination process in step S40 of FIG.

映像音声出力装置１の映像解析部１１は、現シーン（シーンチェンジ後のシーン）で特定された話者の顔特徴量を抽出し（ステップＳ４１）、前シーン（シーンチェンジ前のシーン）で特定された話者の顔特徴量と比較する（ステップＳ４２）。 The video analysis unit 11 of the video / audio output device 1 extracts the facial feature amount of the speaker specified in the current scene (the scene after the scene change) (step S41), and specifies the previous scene (the scene before the scene change). It is compared with the face feature amount of the speaker who has been made (step S42).

次に、映像音声出力装置１の映像解析部１１は、現シーンで特定された話者の顔特徴量と前シーンで特定された話者の顔特徴量が等しいか否かを判定し（ステップＳ４３）、等しい場合には（ステップＳ４３：ＹＥＳ）、話者の交代なし、すなわち、同一話者であると判定し（ステップＳ４４）、等しくない場合には（ステップＳ４３：ＮＯ）、話者の交代あり、すなわち、同一話者でないと判定する（ステップＳ４５）。 Next, the video analysis unit 11 of the video / audio output device 1 determines whether or not the facial feature amount of the speaker specified in the current scene is equal to the facial feature amount of the speaker specified in the previous scene (step S1). S43) If equal (step S43: YES), it is determined that there is no change of the speaker, that is, the same speaker (step S44), and if not equal (step S43: NO), It is determined that there is a change, that is, they are not the same speaker (step S45).

最後に、映像音声出力装置１の映像解析部１１は、現シーンの話者の顔特徴量を一時記憶領域に保存する（ステップＳ４６）。 Finally, the video analysis unit 11 of the video / audio output device 1 stores the facial feature amount of the speaker of the current scene in the temporary storage area (step S46).

図４に戻って、映像音声出力装置１の話者音声定位パラメータ調整部１４は、映像解析部１１からのシーンチェンジの有無、及び同一話者か否かの判定情報を受けて、シーンチェンジがあって、かつ、シーンチェンジの前後で同一話者であるか否かを判定する（ステップＳ６０）。 Returning to FIG. 4, the speaker audio localization parameter adjustment unit 14 of the video / audio output device 1 receives the determination information of the presence / absence of the scene change and the same speaker from the video analysis unit 11. It is then determined whether or not the same speaker is present before and after the scene change (step S60).

シーンチェンジがあって、かつ、シーンチェンジの前後で同一話者である場合には（ステップＳ６０：ＹＥＳ）、映像音声出力装置１の話者音声定位パラメータ調整部１４は、話者位置への音声定位変更量が小さくなるように、話者音声定位パラメータの値を調整する（ステップＳ７０）。 When there is a scene change and the same speaker is present before and after the scene change (step S60: YES), the speaker audio localization parameter adjustment unit 14 of the audio / video output apparatus 1 performs audio to the speaker position. The value of the speaker sound localization parameter is adjusted so that the localization change amount becomes small (step S70).

次に、映像音声出力装置１の定位処理部１５は、設定された話者音声定位パラメータの値に従って、音声データの音声定位変更を行う（ステップＳ８０）。すなわち、シーンチェンジがあって、かつ、シーンチェンジの前後で同一話者である場合には（ステップＳ６０：ＹＥＳ）、話者位置への音声定位変更量が小さくなるように調整された話者音声定位パラメータの値で音声データの音声定位変更を行い、そうでない場合には（ステップＳ６０：ＮＯ）、ステップＳ２０で設定された話者音声定位パラメータの値で音声データの音声定位変更を行う。 Next, the localization processing unit 15 of the video / audio output device 1 changes the audio localization of the audio data in accordance with the value of the set speaker audio localization parameter (step S80). That is, when there is a scene change and the same speaker is present before and after the scene change (step S60: YES), the speaker voice adjusted so that the voice localization change amount to the speaker position is reduced. The voice localization of the voice data is changed with the value of the localization parameter. If not (step S60: NO), the voice localization of the voice data is changed with the value of the speaker voice localization parameter set in step S20.

次に、映像音声出力装置１の映像表示部１６は、映像データを出力し、また、音声出力部１７は、音声定位変更を行われた音声データを出力する。 Next, the video display unit 16 of the video / audio output apparatus 1 outputs video data, and the audio output unit 17 outputs audio data that has undergone audio localization change.

なお、本実施の形態では、映像解析部１１が映像データを解析して同一話者判定処理を行ったが、同一話者判定処理の方法はこれに限定されない。例えば、映像解析部１１がシーンチェンジありと判定した場合に、音声データを解析して、特定した話者がシーンチェンジの前後で同一人物であるか否かを判定するようにしてもよい。 In the present embodiment, the video analysis unit 11 analyzes the video data and performs the same speaker determination process, but the method of the same speaker determination process is not limited to this. For example, when the video analysis unit 11 determines that there is a scene change, the audio data may be analyzed to determine whether or not the specified speaker is the same person before and after the scene change.

図６は、音声データに基づいて、同一話者判定処理を行う映像音声出力装置２の概略構成図である。映像音声出力装置２は、同一話者の発話中のシーンチェンジを考慮しつつ、話者位置に合わせた音声定位で音声を出力する装置であり、詳しくは、映像解析部１１、話者音声定位パラメータ設定部１２、音声解析部１３、話者音声定位パラメータ調整部１４、定位処理部１５、映像表示部１６、及び音声出力部１７を備えている。すなわち、映像音声出力装置２は、音声解析部１３を備えている点が映像音声出力装置１と異なっており、その他の点は映像音声出力装置１と略同一である。なお、以下においては、上記実施形態と異なる構成及び機能のみ説明し、その他の構成及び機能に関しては同一部分には同一符号を付して説明を省略する。 FIG. 6 is a schematic configuration diagram of the video / audio output device 2 that performs the same speaker determination processing based on the audio data. The video / audio output device 2 is a device that outputs a voice with a sound localization adapted to the speaker position while taking into account a scene change during the utterance of the same speaker, and more specifically, the video analysis unit 11 and the speaker sound localization. A parameter setting unit 12, a voice analysis unit 13, a speaker voice localization parameter adjustment unit 14, a localization processing unit 15, a video display unit 16, and a voice output unit 17 are provided. In other words, the video / audio output apparatus 2 is different from the video / audio output apparatus 1 in that the audio analysis unit 13 is provided, and the other points are substantially the same as the video / audio output apparatus 1. In the following, only configurations and functions different from those of the above-described embodiment will be described, and with regard to other configurations and functions, the same portions are denoted by the same reference numerals, and description thereof will be omitted.

音声解析部１３は、入力した音声データを定位処理部１５に出力するとともに、映像解析部１１がシーンチェンジありと判定した場合には、入力した音声データを解析して、特定した話者が同一人物であるか否かを判定するようになっている。詳しくは、音声解析部１３は、現シーン（シーンチェンジ後）で特定した話者の音声特徴量を算出し、前シーン（シーンチェンジ前）で特定した話者の音声特徴量と比較して、音声特徴量が等しい場合には、特定した話者は同一人物であると判定するようになっている。なお、前シーンにおける話者の音声特徴量は、後述するように一時記憶領域に保存されている。ここで、音声特徴量とは、例えば、音声のスペクトログラム解析における周波数強度であり、音声特徴量の算出方法に関しては公知の技術が用いられる。 The voice analysis unit 13 outputs the input voice data to the localization processing unit 15, and if the video analysis unit 11 determines that there is a scene change, the input voice data is analyzed and the identified speaker is the same It is determined whether or not it is a person. Specifically, the voice analysis unit 13 calculates the voice feature amount of the speaker specified in the current scene (after the scene change), and compares it with the voice feature amount of the speaker specified in the previous scene (before the scene change). When the voice feature amounts are equal, it is determined that the specified speakers are the same person. Note that the voice feature amount of the speaker in the previous scene is stored in a temporary storage area as will be described later. Here, the speech feature amount is, for example, a frequency intensity in a spectrogram analysis of speech, and a known technique is used for a speech feature amount calculation method.

また、音声解析部１３は、同一話者か否かの判定情報を話者音声定位パラメータ調整部１４に出力するようになっている。 Further, the voice analysis unit 13 outputs determination information as to whether or not they are the same speaker to the speaker voice localization parameter adjustment unit 14.

図７は、映像音声出力装置２の同一話者位置判定処理の流れを示すフローチャートである。図７は、図４のステップＳ４０に相当する処理である。 FIG. 7 is a flowchart showing the flow of the same speaker position determination process of the video / audio output device 2. FIG. 7 is a process corresponding to step S40 of FIG.

映像音声出力装置１の映像解析部１１は、現シーン（シーンチェンジ後のシーン）で特定された話者の音声特徴量を抽出し（ステップＳ５１）、前シーン（シーンチェンジ前のシーン）で特定された話者の音声特徴量と比較する（ステップＳ５２）。 The video analysis unit 11 of the video / audio output device 1 extracts the voice feature amount of the speaker specified in the current scene (the scene after the scene change) (step S51), and specifies the previous scene (the scene before the scene change). It is compared with the voice feature amount of the speaker who has been made (step S52).

次に、映像音声出力装置１の映像解析部１１は、現シーンで特定された話者の音声特徴量と前シーンで特定された話者の音声特徴量が等しいか否かを判定し（ステップＳ５３）、等しい場合には（ステップＳ５３：ＹＥＳ）、話者の交代なし、すなわち、同一話者であると判定し（ステップＳ５４）、等しくない場合には（ステップＳ５３：ＮＯ）、音声解析部１１は、話者の交代あり、すなわち、同一話者でないと判定する（ステップＳ５５）。 Next, the video analysis unit 11 of the video / audio output device 1 determines whether or not the voice feature amount of the speaker specified in the current scene is equal to the voice feature amount of the speaker specified in the previous scene (Step S1). S53) If they are equal (step S53: YES), it is determined that there is no alternation of speakers, that is, they are the same speaker (step S54). If they are not equal (step S53: NO), the voice analysis unit 11 determines that there is a change of speakers, that is, the speakers are not the same (step S55).

最後に、音声解析部１３は、現シーンの話者の音声特徴量を一時記憶領域に保存する（ステップ５６）。 Finally, the voice analysis unit 13 stores the voice feature amount of the speaker of the current scene in the temporary storage area (step 56).

以上説明したように、上記実施の形態に係る映像音声出力装置１及び２によれば、映像を解析して、話者の位置を特定する映像解析部１１と、映像解析部１１により特定した話者の位置に音声を定位させるように話者音声定位パラメータの値を設定する話者音声定位パラメータ設定部１２と、映像を解析して、シーンチェンジの有無を検出する映像解析部１１と、映像または音声を解析して、前記話者位置特定手段で特定された話者がシーンチェンジの前後で同一人物であるか否かを判定する映像解析部１１または音声解析部１３と、映像解析部１１によりシーンチェンジがあると検出され、かつ、映像解析部１１または音声解析部１３により、映像解析部１１で特定された話者が当該シーンチェンジの前後で同一人物であると判定された場合には、話者音声定位パラメータ設定部１２で設定された話者音声定位パラメータの値に対して、定位位置の変更を小さくするように話者音声定位パラメータの値を調整する話者音声定位パラメータ調整部１４と、話者音声定位パラメータ調整部１４により、調整された話者音声定位パラメータの値に従って音声の定位変更処理を行う定位処理部１５と、定位処理部１５により定位変更された音声を出力する音声出力部１７と、を備えるので、同一話者の発話中にシーンチェンジが発生して話者位置が急に変わっても、視聴者は違和感を覚えることがない。 As described above, according to the video and audio output apparatuses 1 and 2 according to the above-described embodiments, the video analysis unit 11 that analyzes the video and identifies the position of the speaker, and the story that is identified by the video analysis unit 11 A speaker voice localization parameter setting unit 12 for setting the value of the speaker voice localization parameter so as to localize the voice at the position of the person, a video analysis unit 11 for analyzing the video and detecting the presence or absence of a scene change, and a video Alternatively, the video analysis unit 11 or the voice analysis unit 13 that analyzes the voice and determines whether or not the speaker specified by the speaker position specifying unit is the same person before and after the scene change, and the video analysis unit 11 When a scene change is detected by the video analysis unit 11 and the voice analysis unit 13 determines that the speaker specified by the video analysis unit 11 is the same person before and after the scene change. The speaker voice localization parameter for adjusting the value of the speaker voice localization parameter so as to reduce the change of the localization position with respect to the value of the speaker voice localization parameter set by the speaker voice localization parameter setting unit 12. The adjustment unit 14, the localization processing unit 15 that performs the localization process of the speech according to the adjusted speaker voice localization parameter value by the speaker voice localization parameter adjustment unit 14, and the localization that has been localized by the localization processing unit 15 And an audio output unit 17 for outputting, so that even if a scene change occurs during the same speaker's utterance and the speaker position changes suddenly, the viewer does not feel uncomfortable.

また、映像解析部１１は、映像解析部１１で特定された話者の顔特徴量を映像データから算出し、算出した顔特徴量がシーンチェンジの前後で同一であるか否かを判定するようにしてもよい。この場合には、映像データから顔特徴量を抽出することで、シーンチェンジ前後の特定された話者が同一人物であるか否かを簡単に判定することができる。 In addition, the video analysis unit 11 calculates the facial feature amount of the speaker specified by the video analysis unit 11 from the video data, and determines whether the calculated facial feature amount is the same before and after the scene change. It may be. In this case, it is possible to easily determine whether or not the specified speakers before and after the scene change are the same person by extracting the facial feature amount from the video data.

また、映像解析部１１は、映像解析部１１で特定された話者の音声特徴量を音声データから算出し、算出した音声特徴量がシーンチェンジの前後で同一であるか否かを判定するようにしてもよい。この場合には、音声データから音声特徴量を抽出することで、シーンチェンジ前後の特定された話者が同一人物であるか否かを簡単に判定することができる。 Further, the video analysis unit 11 calculates the voice feature amount of the speaker specified by the video analysis unit 11 from the voice data, and determines whether or not the calculated voice feature amount is the same before and after the scene change. It may be. In this case, it is possible to easily determine whether or not the specified speakers before and after the scene change are the same person by extracting the audio feature amount from the audio data.

また、話者音声定位パラメータ調整部１４は、表示画面の中心方向の位置に音声を定位させるように話者音声定位パラメータの値を調整するようにしてもよい。同一話者の発話中にシーンチェンジが発生して話者位置が急に変わったシーンであっても、音声を画面中心に定位させているので、視聴者は違和感を覚えることなく、快適にコンテンツを視聴することができる。 The speaker voice localization parameter adjustment unit 14 may adjust the value of the speaker voice localization parameter so that the voice is localized at a position in the center direction of the display screen. Even if a scene change occurs during the same speaker's utterance and the speaker position suddenly changes, the sound is localized around the screen, so the viewer can comfortably enjoy the content without feeling uncomfortable. Can be watched.

以上、本発明の実施の形態について説明してきたが、本発明は、上述した実施の形態に限られるものではなく、本発明の要旨を逸脱しない範囲において、本発明の実施の形態に対して種々の変形や変更を施すことができ、そのような変形や変更を伴うものもまた、本発明の技術的範囲に含まれるものである。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications can be made to the embodiments of the present invention without departing from the gist of the present invention. Such modifications and changes can be made, and those accompanying such modifications and changes are also included in the technical scope of the present invention.

本発明の実施の形態に係る映像音声出力装置の概略構成図である。1 is a schematic configuration diagram of a video / audio output device according to an embodiment of the present invention. 本発明の実施の形態に係る映像音声出力装置映像音声出力装置に入力される映像データにおいて話者位置が変わる様子を示す図である。It is a figure which shows a mode that a speaker position changes in the video data input into the video / audio output device video / audio output device which concerns on embodiment of this invention. 本発明の実施の形態に係る映像音声出力装置に入力される映像データの例である。It is an example of the video data input into the video / audio output device which concerns on embodiment of this invention. 本発明の実施の形態に係る映像音声出力装置の同一話者発話中のシーンチェンジを考慮して、音声定位制御を行う映像音声出力処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the video / audio output process which performs audio localization control in consideration of the scene change in the same speaker utterance of the video / audio output device which concerns on embodiment of this invention. 図４のステップＳ４０の同一話者判定処理の流れを詳しく示すフローチャートである。It is a flowchart which shows in detail the flow of the same speaker determination process of FIG.4 S40. 本発明の他の実施の形態に係る映像音声出力装置の概略構成図である。It is a schematic block diagram of the video / audio output device which concerns on other embodiment of this invention. 本発明の他の実施の形態に係る映像音声出力装置の同一話者判定処理の流れを詳しく示すフローチャートである。It is a flowchart which shows in detail the flow of the same speaker determination process of the video / audio output device which concerns on other embodiment of this invention.

Explanation of symbols

１，２映像音声出力装置
１１映像解析部
１２話者音声定位パラメータ設定部
１３音声解析部
１４話者音声定位パラメータ調整部
１５定位処理部
１６映像表示部
１７音声出力部
DESCRIPTION OF SYMBOLS 1, 2 Video audio output apparatus 11 Video analysis part 12 Speaker audio localization parameter setting part 13 Voice analysis part 14 Speaker audio localization parameter adjustment part 15 Localization processing part 16 Video display part 17 Audio | voice output part

Claims

An audio / video output device that controls audio localization based on audio localization parameters,
A speaker position specifying means for analyzing a video and specifying a speaker position;
Voice localization parameter setting means for setting the value of the voice localization parameter so that the voice is localized at the position of the speaker specified by the speaker position specifying means;
Scene change detection means for analyzing the video and detecting the presence or absence of a scene change;
Analyzing video or audio, the same speaker determining means for determining whether the speaker specified by the speaker position specifying means is the same person before and after the scene change,
The scene change detecting means detects that there is a scene change, and the same speaker determining means determines that the speaker specified by the speaker position specifying means is the same person before and after the scene change. A voice localization parameter adjusting unit that adjusts the value of the voice localization parameter so as to reduce a change in the localization position with respect to the value of the voice localization parameter set by the voice localization parameter setting unit;
Localization change output means for performing audio localization change processing according to the value of the adjusted audio localization parameter by the audio localization parameter adjusting means, and outputting video and audio;
A video / audio output device comprising:

The same speaker determination unit calculates the facial feature amount of the speaker specified by the speaker position specifying unit from the video, and determines whether the calculated facial feature amount is the same before and after the scene change. The video / audio output device according to claim 1.

The same speaker determination unit calculates the voice feature amount of the speaker specified by the speaker position specifying unit from the voice, and determines whether the calculated voice feature amount is the same before and after the scene change. The video / audio output device according to claim 1.

4. The video according to claim 1, wherein the audio localization parameter adjusting unit adjusts the value of the audio localization parameter so that the audio is localized at a position in a center direction of the display screen. 5. Audio output device.

The speaker position specifying means detects the position of a human face in the video, specifies the speaker from the detected movement of the mouth of the face, and sets the vicinity of the specified speaker's mouth as the speaker position. The video / audio output device according to claim 1, wherein: