JP2015142185A

JP2015142185A - Viewing method, viewing terminal and viewing program

Info

Publication number: JP2015142185A
Application number: JP2014012819A
Authority: JP
Inventors: 好江山口; Yoshie Yamaguchi; 豊國田; Yutaka Kunida; 明小島; Akira Kojima
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-01-27
Filing date: 2014-01-27
Publication date: 2015-08-03

Abstract

PROBLEM TO BE SOLVED: To provide a viewing method, a viewing terminal and a viewing program, enabling viewing of user preference with the reflection of user intention.SOLUTION: The viewing method at the viewing terminal includes the steps of: displaying a video synchronized with voice to acquire information indicative of a selected region or a position included in the video; identifying a subject in the video on the basis of the information indicative of the selected region or position included in the video; and mixing the voice on the basis of the identified subject.

Description

本発明は、視聴方法、視聴端末及び視聴プログラムに関する。 The present invention relates to a viewing method, a viewing terminal, and a viewing program.

生体センサにより計測されたユーザの生体情報変化の状態から、番組に対するユーザの集中度を判定し、全ユーザについて集中度を統計した結果に基づいて、映像の配信制御を行う方法が開示されている（例えば、特許文献１参照）。 A method is disclosed in which a user's degree of concentration with respect to a program is determined from the state of changes in the user's biometric information measured by a biometric sensor, and video distribution control is performed based on the results of statistics on the degree of concentration for all users. (For example, refer to Patent Document 1).

特開２００３−１１１１０６号公報JP 2003-111106 A

しかしながら、上記の視聴方法では、映像や音声自体が制作者の意図のみが反映されたものになるため、ユーザは好みの視聴をすることができない、という問題があった。 However, in the above viewing method, there is a problem that the user cannot view the video and audio itself because only the creator's intention is reflected.

上記事情に鑑み、本発明は、ユーザの意図を反映した、ユーザ好みの視聴を可能とする視聴方法、視聴端末及び視聴プログラムを提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide a viewing method, a viewing terminal, and a viewing program that allow a user-preferred viewing reflecting a user's intention.

本発明の一態様は、視聴端末における視聴方法であって、音声に同期した映像を表示し、前記映像に含まれる選択された領域又は位置を示す情報を取得するステップと、前記映像に含まれる選択された領域又は位置を示す情報に基づいて、前記映像における被写体を特定するステップと、特定された前記被写体に基づいて、前記音声をミキシングするステップと、を有する視聴方法である。 One aspect of the present invention is a viewing method in a viewing terminal, wherein a video synchronized with audio is displayed, and information indicating a selected region or position included in the video is acquired; and the video is included in the video A viewing method comprising: specifying a subject in the video based on information indicating a selected region or position; and mixing the audio based on the specified subject.

本発明の一態様は、前記映像における被写体を特定するステップでは、前記被写体の位置を示す位置特定情報、又は、前記映像における前記被写体を同定するための画像識別情報を、保存部から取得し、取得した前記位置特定情報又は前記画像識別情報に基づいて、前記映像における前記被写体を特定する視聴方法である。 According to one aspect of the present invention, in the step of specifying a subject in the video, position specifying information indicating a position of the subject or image identification information for identifying the subject in the video is acquired from a storage unit, In this viewing method, the subject in the video is specified based on the acquired position specifying information or image identification information.

本発明の一態様は、前記映像における被写体を特定するステップでは、前記被写体の位置を示す位置特定情報と、前記映像における選択された被写体の位置を示す被写体選択情報と、に基づいて、前記映像における前記被写体を特定する視聴方法である。 According to one aspect of the present invention, in the step of specifying a subject in the video, the video is based on position specifying information indicating a position of the subject and subject selection information indicating a position of the selected subject in the video. The viewing method for specifying the subject.

本発明の一態様は、前記映像における被写体を特定するステップでは、特定された前記被写体と、前記映像における他の被写体とを、前記位置特定情報に基づいて関連付ける視聴方法である。 One aspect of the present invention is a viewing method in which, in the step of specifying a subject in the video, the specified subject and another subject in the video are associated based on the position specifying information.

本発明の一態様は、前記音声をミキシングするステップでは、特定された前記被写体の音声と、特定された前記被写体に関連付けられた他の前記被写体の音声と、をミキシングする視聴方法である。 One aspect of the present invention is a viewing method in which, in the step of mixing the sound, the sound of the specified subject and the sound of another subject associated with the specified subject are mixed.

本発明の一態様は、前記音声をミキシングするステップでは、特定された移動する前記被写体の音声と、特定された移動する前記被写体に関連付けられた他の前記被写体の音声と、をミキシングする視聴方法である。 In one aspect of the present invention, in the step of mixing the sound, the viewing method for mixing the sound of the identified moving subject and the sound of the other subject associated with the identified moving subject It is.

本発明の一態様は、前記音声をミキシングするステップでは、特定された前記被写体の音声と、所定条件に基づいて特定された他の前記被写体の音声と、をミキシングする視聴方法である。 One aspect of the present invention is a viewing method in which, in the step of mixing the sound, the sound of the specified subject and the sound of the other subject specified based on a predetermined condition are mixed.

本発明の一態様は、音声に同期した映像を表示し、前記映像に含まれる選択された領域又は位置を示す情報を取得する表示部と、前記映像に含まれる選択された領域又は位置を示す情報に基づいて、前記映像における被写体を特定する認識部と、特定された前記被写体に基づいて、前記音声をミキシングする音声ミキシング部と、を備える視聴端末である。 One embodiment of the present invention displays a video synchronized with audio, obtains information indicating a selected region or position included in the video, and indicates a selected region or position included in the video. A viewing terminal comprising: a recognition unit that identifies a subject in the video based on information; and an audio mixing unit that mixes the audio based on the identified subject.

本発明の一態様は、視聴方法を、コンピュータに実行させるための視聴プログラムである。 One embodiment of the present invention is a viewing program for causing a computer to execute a viewing method.

本発明により、音声ミキシング部は、特定された被写体に基づいて、音声をミキシングする。これにより、視聴方法、視聴端末及び視聴プログラムでは、ユーザの意図を反映した、ユーザ好みの視聴が可能となる。 According to the present invention, the audio mixing unit mixes audio based on the identified subject. As a result, the viewing method, the viewing terminal, and the viewing program enable user-preferred viewing that reflects the user's intention.

本発明の実施形態における、視聴端末の構成例を示すブロック図である。It is a block diagram which shows the structural example of the viewing terminal in embodiment of this invention. 本発明の実施形態における、音声収録の第１例を示す図である。It is a figure which shows the 1st example of audio | voice recording in embodiment of this invention. 本発明の実施形態における、音声収録の第２例を示す図である。It is a figure which shows the 2nd example of audio | voice recording in embodiment of this invention. 本発明の実施形態における、カメラの設置例を示す図である。It is a figure which shows the example of installation of the camera in embodiment of this invention. 本発明の実施形態における、被写体の画像が入る領域の例を示す図である。It is a figure which shows the example of the area | region where the image of a to-be-photographed object in embodiment of this invention enters. 本発明の実施形態における、被写体を追跡する第１例を示す図である。It is a figure which shows the 1st example which tracks a to-be-photographed object in embodiment of this invention. 本発明の実施形態における、被写体を直接選択する第１例を示す図である。It is a figure which shows the 1st example which selects a to-be-photographed object directly in embodiment of this invention. 本発明の実施形態における、被写体を追跡する第２例を示す図である。It is a figure which shows the 2nd example which tracks a to-be-photographed object in embodiment of this invention. 本発明の実施形態における、注視領域を拡大する例を示す図である。It is a figure which shows the example which expands a gaze area | region in embodiment of this invention. 本発明の実施形態における、被写体の画像の一部が入る領域を選択する例を示す図である。It is a figure which shows the example which selects the area | region where a part of image of a to-be-entry enters in embodiment of this invention. 本発明の実施形態における、被写体を直接選択する第２例を示す図である。It is a figure which shows the 2nd example which selects a to-be-photographed object directly in embodiment of this invention. 本発明の実施形態における、被写体を追跡する第３例を示す図である。It is a figure which shows the 3rd example which tracks a to-be-photographed object in embodiment of this invention. 本発明の実施形態における、再生開始から被写体を追跡する第１例の処理のフローを示す図である。It is a figure which shows the flow of the process of the 1st example which tracks a to-be-photographed object from the reproduction | regeneration start in embodiment of this invention. 本発明の実施形態における、再生開始から被写体を追跡する第２例の処理のフローを示す図である。It is a figure which shows the flow of the process of the 2nd example which tracks a to-be-photographed object from the start of reproduction | regeneration in embodiment of this invention. 本発明の実施形態における、再生開始から被写体を追跡する第３例の処理のフローを示す図である。It is a figure which shows the flow of the process of the 3rd example which tracks a to-be-photographed object from the start of reproduction | regeneration in embodiment of this invention.

以下、本発明の実施形態の視聴方法、視聴端末及び視聴プログラムを、図面を参照して詳細に説明する。
図１は、本発明の実施形態における、視聴端末の構成例を示すブロック図である。視聴端末１は、再生操作部１０と、再生制御部１１と、表示部１２と、音源認識部１３と、保存部１４と、復号部１５と、音声ミキシング部１６と、ミキシング操作部１７と、同期部１８とを備える。 Hereinafter, a viewing method, a viewing terminal, and a viewing program according to an embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing a configuration example of a viewing terminal in the embodiment of the present invention. The viewing terminal 1 includes a playback operation unit 10, a playback control unit 11, a display unit 12, a sound source recognition unit 13, a storage unit 14, a decoding unit 15, an audio mixing unit 16, a mixing operation unit 17, And a synchronization unit 18.

再生制御部１１と、音源認識部１３と、復号部１５と、音声ミキシング部１６と、同期部１８とのうち、少なくとも一部は、例えば、ＣＰＵ（Central Processing Unit）等のプロセッサである。再生制御部１１と、音源認識部１３と、復号部１５と、音声ミキシング部１６と、同期部１８とのうち、少なくとも一部は、例えば、ＲＯＭ（Read Only Memory）などの記憶部から、ＲＡＭ（Random Access Memory）などの記憶部に展開されたアプリケーションプログラムを実行することにより機能するソフトウェア機能部である。なお、再生制御部１１と、音源認識部１３と、復号部１５と、音声ミキシング部１６と、同期部１８とのうち、少なくとも一部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）等のハードウェア機能部でもよい。 At least a part of the reproduction control unit 11, the sound source recognition unit 13, the decoding unit 15, the audio mixing unit 16, and the synchronization unit 18 is a processor such as a CPU (Central Processing Unit), for example. At least a part of the reproduction control unit 11, the sound source recognition unit 13, the decoding unit 15, the audio mixing unit 16, and the synchronization unit 18 is, for example, from a storage unit such as a ROM (Read Only Memory) to a RAM. This is a software function unit that functions by executing an application program expanded in a storage unit such as (Random Access Memory). Note that at least some of the playback control unit 11, the sound source recognition unit 13, the decoding unit 15, the audio mixing unit 16, and the synchronization unit 18 are LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit). ) Etc. may also be used.

再生操作部１０は、ユーザによる操作に応じて、再生するコンテンツ名の入力、再生の開始、又は、停止を制御する。
再生制御部１１は、表示部１２、音源認識部１３、保存部１４を制御する。これにより、再生制御部１１は、コンテンツの映像と音声を再生させることができる。 The reproduction operation unit 10 controls input of a content name to be reproduced, start of reproduction, or stop according to an operation by a user.
The reproduction control unit 11 controls the display unit 12, the sound source recognition unit 13, and the storage unit 14. Thereby, the reproduction | regeneration control part 11 can reproduce | regenerate the image | video and audio | voice of a content.

表示部１２は、映像と音声を再生する。表示部１２は、取得部（例えば、タッチパネル）を有してもよい。表示部１２は、取得部により、映像に含まれる選択された領域又は位置を示す情報を取得する。これにより、ユーザは、表示部１２に再生されている映像における視聴したい領域を自由に選択できる。 The display unit 12 reproduces video and audio. The display unit 12 may include an acquisition unit (for example, a touch panel). The display unit 12 acquires information indicating the selected region or position included in the video by the acquisition unit. As a result, the user can freely select a region desired to be viewed in the video reproduced on the display unit 12.

以下、選択された領域を「注視領域」という。注視領域は、位置や大きさに関係なく、自由に選択可能である。以下、注視領域を示す情報を「注視領域情報」という。注視領域情報には、注視領域の位置、大きさ、及び映像に関する情報が含まれている。ユーザは、表示部１２に再生されている映像における視聴したい被写体（例えば、人物）を直接選択することができる。 Hereinafter, the selected area is referred to as a “gaze area”. The gaze area can be freely selected regardless of the position and size. Hereinafter, the information indicating the gaze area is referred to as “gaze area information”. The gaze area information includes information on the position, size, and video of the gaze area. The user can directly select a subject (for example, a person) to view in the video reproduced on the display unit 12.

音源認識部１３は、音源としての被写体を特定する。例えば、音源認識部１３は、映像における被写体の位置を特定する。本実施例では、音源としての被写体は、人物の音声を集音するワイヤレスマイクが装着されている人物と、ステージ周辺の音声を集音する集音マイクとである。なお、音源としての被写体は、必ずしも人物や集音マイクである必要はない。 The sound source recognition unit 13 specifies a subject as a sound source. For example, the sound source recognition unit 13 specifies the position of the subject in the video. In this embodiment, the subject as the sound source is a person equipped with a wireless microphone that collects the voice of a person and a sound collection microphone that collects the sound around the stage. Note that the subject as a sound source is not necessarily a person or a sound collecting microphone.

保存部１４は、人物の音声を集音するワイヤレスマイクからの音声データ（各人物音声データ）と、ステージ周辺の音声を集音する集音マイクとからの音声データ（各集音マイク音声データ）と、映像データと、画像識別情報と、位置特定情報と、注視領域情報と、自動注視領域情報と、ミキシング調整情報とを、コンテンツごとに保存する。これらの情報は、コンテンツ毎に符号化されていてもよい。図１では、保存部１４は、一例として、コンテンツ７０と、コンテンツ７１とを保存している。 The storage unit 14 includes voice data (each person voice data) from a wireless microphone that collects a person's voice, and voice data (each sound collection microphone voice data) from a sound collection microphone that collects sound around the stage. Video data, image identification information, position specifying information, gaze area information, automatic gaze area information, and mixing adjustment information are stored for each content. Such information may be encoded for each content. In FIG. 1, the storage unit 14 stores content 70 and content 71 as an example.

画像識別情報は、映像における被写体を同定するための識別情報である。位置特定情報は、被写体を特定するための情報と、映像における被写体の位置を特定するための情報とが含まれている。位置特定情報は、例えば、座標で表現される。これらの情報は、音源としての被写体に装着されている位置センサから発信される。注視領域情報は、再生開始時に使用する情報であり、ユーザが視聴したい領域の位置、大きさ及び映像を示す情報である。自動注視領域情報は、視聴したい被写体をユーザが直接選択した場合に用いる注視領域情報である。ミキシング調整情報は、音源としての被写体の音声ミキシング要素の調整情報である。 Image identification information is identification information for identifying a subject in a video. The position specifying information includes information for specifying the subject and information for specifying the position of the subject in the video. The position specifying information is expressed by coordinates, for example. These pieces of information are transmitted from a position sensor attached to a subject as a sound source. The gaze area information is information used at the start of reproduction, and is information indicating the position, size, and video of the area that the user wants to view. The automatic gaze area information is gaze area information used when the user directly selects a subject to view. The mixing adjustment information is adjustment information of an audio mixing element of a subject as a sound source.

保存部１４は、要求コンテンツを検索し、音声データ及び映像データを復号部１５に出力する。保存部１４は、要求コンテンツの画像識別情報、位置特定情報、注視領域情報、及び、自動注視領域情報を、音源認識部１３に出力する。保存部１４は、ミキシング調整情報を、音声ミキシング部１６に出力する。 The storage unit 14 searches for the requested content and outputs the audio data and the video data to the decoding unit 15. The storage unit 14 outputs the image identification information, position specifying information, gaze area information, and automatic gaze area information of the requested content to the sound source recognition unit 13. The storage unit 14 outputs the mixing adjustment information to the audio mixing unit 16.

復号部１５は、保存部１４から受け取った音声データ及び映像データを復号する。復号部１５は、音声データを音声ミキシング部１６に出力する。復号部１５は、映像データを、音源認識部１３に出力する。
音声ミキシング部１６は、音声にミキシング処理を施す。ミキシング処理の方法は、どのような方法でもよく、特定の方法に限定されない。
ミキシング操作部１７は、ミキシング処理するための設定操作を受け付ける。
同期部１８は、ミキシング処理された音声と映像とが各々に保持している再生時間情報に基づいて、音声と映像とをフレーム単位で同期させる。同期部１８は、同期させた音声と映像とを、表示部１２に出力する。 The decoding unit 15 decodes the audio data and video data received from the storage unit 14. The decoding unit 15 outputs the audio data to the audio mixing unit 16. The decoding unit 15 outputs the video data to the sound source recognition unit 13.
The audio mixing unit 16 performs a mixing process on the audio. The mixing processing method may be any method and is not limited to a specific method.
The mixing operation unit 17 accepts a setting operation for mixing processing.
The synchronization unit 18 synchronizes the audio and the video in units of frames based on the reproduction time information held in the mixed audio and video. The synchronization unit 18 outputs the synchronized audio and video to the display unit 12.

図２は、本発明の実施形態における、音声収録の第１例を示す図である。ワイヤレスマイク６０は、人物Ａの口の近傍（胸や、首の後ろなど）に装着される。ワイヤレスマイク６０は、人物Ａの音声を集音する。ワイヤレスマイク６１は、人物Ｂの口の近傍に装着される。ワイヤレスマイク６１は、人物Ｂの音声を集音する。ワイヤレスマイク６２は、人物Ｃの口の近傍に装着される。ワイヤレスマイク６２は、人物Ｃの音声を集音する。ワイヤレスマイク６３は、人物Ｄの口の近傍に装着される。ワイヤレスマイク６３は、人物Ｄの音声を集音する。 FIG. 2 is a diagram showing a first example of audio recording in the embodiment of the present invention. The wireless microphone 60 is attached in the vicinity of the mouth of the person A (such as the chest and the back of the neck). The wireless microphone 60 collects the voice of the person A. The wireless microphone 61 is attached near the mouth of the person B. The wireless microphone 61 collects the voice of the person B. The wireless microphone 62 is attached near the mouth of the person C. The wireless microphone 62 collects the voice of the person C. The wireless microphone 63 is attached near the mouth of the person D. The wireless microphone 63 collects the voice of the person D.

集音マイク３０、３１、３２、３３、３４（以下、「集音マイク３０〜３４」と表記する。）は、ステージ８０の前部、生演奏バンドの周辺に設置される。集音マイク３０〜３４は、曲が生演奏される場合、その生演奏による曲のメロディ、ステージ８０周辺の音声を集音する。位置センサ４０、４１、４２、４３、４４（以下、「位置センサ４０〜４４」と表記する。）は、現在位置を特定する情報（位置特定情報）を出力可能なセンサである。位置センサ４０〜４４は、集音マイク３０〜３４のそれぞれに装着されている。位置センサ５０、５１、５２、５３（以下、「位置センサ５０〜５３」と表記する。）は、人物Ａ〜Ｄのそれぞれに装着されている。位置特定情報は、音源としての被写体に装着された、位置センサ４０〜４４と、位置センサ５０〜５３とから発信される。 Sound collecting microphones 30, 31, 32, 33 and 34 (hereinafter referred to as “sound collecting microphones 30 to 34”) are installed at the front of stage 80 and around the live performance band. When the music is played live, the sound collection microphones 30 to 34 collect the melody of the music by the live performance and the sound around the stage 80. The position sensors 40, 41, 42, 43, and 44 (hereinafter referred to as “position sensors 40 to 44”) are sensors that can output information for specifying the current position (position specifying information). The position sensors 40 to 44 are attached to the sound collecting microphones 30 to 34, respectively. Position sensors 50, 51, 52, and 53 (hereinafter referred to as “position sensors 50 to 53”) are attached to the persons A to D, respectively. The position specifying information is transmitted from the position sensors 40 to 44 and the position sensors 50 to 53 attached to the subject as a sound source.

図３は、本発明の実施形態における、音声収録の第２例を示す図である。図３では、集音マイク３０〜３４は、ステージ８０の上部、生演奏バンドの上部に設置されている点が、図２に示す場合と異なる。集音マイク３０〜３４は、曲が生演奏される場合、その生演奏による曲のメロディ、ステージ８０周辺の音声を集音する。ワイヤレスマイク６０、６１、６２、６３（以下、「ワイヤレスマイク６０〜６３」と表記する。）と、位置センサ４０〜４４と、位置センサ５０〜５３とについては、図２に示す場合と同様である。 FIG. 3 is a diagram showing a second example of audio recording in the embodiment of the present invention. In FIG. 3, the sound collection microphones 30 to 34 are different from the case shown in FIG. 2 in that they are installed at the upper part of the stage 80 and the upper part of the live performance band. When the music is played live, the sound collection microphones 30 to 34 collect the melody of the music by the live performance and the sound around the stage 80. The wireless microphones 60, 61, 62, and 63 (hereinafter referred to as “wireless microphones 60 to 63”), the position sensors 40 to 44, and the position sensors 50 to 53 are the same as those shown in FIG. is there.

図４は、本発明の実施形態における、カメラの設置例を示す図である。カメラ２０は、客席の後方からステージ８０の全体が映像のフレームに入る様に撮影可能な位置に設置される。
図５は、本発明の実施形態における、被写体の画像が入る領域の例を示す図である。図５には、視聴している映像における選択された被写体である人物Ｄの画像が入る注視領域１００が示されている。ユーザは、視聴したい領域である注視領域１００の範囲を選択する場合、視聴している視聴端末１の表示部１２の画面上を手で触れることで、注視領域１００の範囲を直接指定してもよいし、マウス操作により注視領域１００の範囲を指定してもよい。 FIG. 4 is a diagram showing an installation example of the camera in the embodiment of the present invention. The camera 20 is installed at a position where photographing can be performed so that the entire stage 80 enters the frame of the video from behind the passenger seat.
FIG. 5 is a diagram illustrating an example of a region where an image of a subject enters in the embodiment of the present invention. FIG. 5 shows a gaze area 100 in which an image of a person D that is a selected subject in the video being viewed is entered. When the user selects the range of the gaze area 100 that is the area he / she wants to view, the user can directly specify the range of the gaze area 100 by touching the screen of the display unit 12 of the viewing terminal 1 he / she is viewing. Alternatively, the range of the gaze area 100 may be designated by operating the mouse.

図６は、本発明の実施形態における、被写体を追跡する第１例を示す図である。図６では、ユーザは、表示部１２の画面上を手で触れることで、その画面上で人物Ｄの位置を手動で追いかけるものとする。図６には、ユーザが選択した人物Ｄに対する、手動による追いかけ再生が示されている。再生制御部１１は、視聴端末１の表示部１２の画面上で、被写体の動きを手動で追いかけながら、映像及び音声を再生する。図６では、人物Ｄの移動に応じて注視領域１００が移動したことにより、人物Ｄ以外に、人物Ｂ及びＣが注視領域１００に存在する映像が示されている。注視領域１００の移動は、視聴している視聴端末１の表示部１２の画面上を手で触れることで、注視領域１００の範囲を直接指定してもよいし、マウス操作により注視領域１００の範囲を指定してもよい。 FIG. 6 is a diagram showing a first example for tracking a subject in the embodiment of the present invention. In FIG. 6, it is assumed that the user manually follows the position of the person D on the screen by touching the screen of the display unit 12 with his / her hand. FIG. 6 shows manual follow-up reproduction for the person D selected by the user. The reproduction control unit 11 reproduces video and audio while manually following the movement of the subject on the screen of the display unit 12 of the viewing terminal 1. FIG. 6 shows an image in which, in addition to the person D, the persons B and C exist in the gaze area 100 because the gaze area 100 moves in accordance with the movement of the person D. The gaze area 100 may be moved by directly specifying the range of the gaze area 100 by touching the screen of the display unit 12 of the viewing terminal 1 that is viewing, or by operating the mouse. May be specified.

図７は、本発明の実施形態における、被写体を直接選択する第１例を示す図である。図７では、視聴端末１の表示部１２の画面上で、ユーザの手やマウス操作により、被写体の画像上の１箇所に触ることにより、人物Ｄは直接選択される。この場合、注視領域１００の縦横サイズは、事前に設定されている。以下、事前に設定されている注視領域を、「自動注視領域情報」という。また、以下、映像における被写体の画像にユーザが触れた位置を示す情報を、「被写体選択情報」という。 FIG. 7 is a diagram illustrating a first example of directly selecting a subject in the embodiment of the present invention. In FIG. 7, the person D is directly selected by touching one place on the subject image by the user's hand or mouse operation on the screen of the display unit 12 of the viewing terminal 1. In this case, the vertical and horizontal sizes of the gaze area 100 are set in advance. Hereinafter, the gaze area set in advance is referred to as “automatic gaze area information”. Hereinafter, information indicating the position where the user touches the image of the subject in the video is referred to as “subject selection information”.

図８は、本発明の実施形態における、被写体を追跡する第２例を示す図である。図８では、再生制御部１１は、図７でユーザが選択した人物Ｄの位置を、表示部１２の画面上で位置特定情報に基づいて追いかける。つまり、図８には、図７でユーザが選択した人物Ｄに対する、自動による追いかけ再生が示されている。被写体が直接選択された場合、再生制御部１１は、視聴端末１の表示部１２の画面上で、被写体の動きを自動で追いかけながら、映像及び音声を再生する。図８では、人物Ｄの移動に応じて注視領域１００が移動したことにより、人物Ｄ以外に、人物Ｂ及びＣが注視領域１００に存在する映像が示されている。 FIG. 8 is a diagram showing a second example of tracking a subject in the embodiment of the present invention. In FIG. 8, the playback control unit 11 tracks the position of the person D selected by the user in FIG. 7 on the screen of the display unit 12 based on the position specifying information. That is, FIG. 8 shows automatic follow-up reproduction for the person D selected by the user in FIG. When the subject is directly selected, the reproduction control unit 11 reproduces video and audio while automatically following the movement of the subject on the screen of the display unit 12 of the viewing terminal 1. FIG. 8 shows an image in which, in addition to the person D, the persons B and C exist in the gaze area 100 because the gaze area 100 moves in accordance with the movement of the person D.

図９は、本発明の実施形態における、注視領域１００を拡大する例を示す図である。図９には、図６で選択された注視領域１００が拡大された場合が示されている。ユーザは、視聴端末１の表示部１２の画面上で、ユーザの手やマウス操作により、注視領域１００の拡大を指定することが可能である。 FIG. 9 is a diagram showing an example of enlarging the gaze area 100 in the embodiment of the present invention. FIG. 9 shows a case where the gaze area 100 selected in FIG. 6 is enlarged. The user can specify enlargement of the gaze area 100 on the screen of the display unit 12 of the viewing terminal 1 by the user's hand or mouse operation.

次に、注視領域１００における音源としての被写体（人物）を特定する方法を説明する。画像識別情報は、被写体の画像を識別する情報である。位置特定情報は、被写体を特定する情報と、画面における被写体が存在する位置を示す情報とを含む。注視領域情報は、注視領域の位置と、注視領域の縦横の大きさと、映像に関する情報とを含む。被写体選択情報は、画面に表示された映像における、直接選択された被写体の位置を示す情報である。 Next, a method for specifying a subject (person) as a sound source in the gaze area 100 will be described. Image identification information is information for identifying an image of a subject. The position specifying information includes information for specifying the subject and information indicating the position where the subject exists on the screen. The gaze area information includes the position of the gaze area, the vertical and horizontal sizes of the gaze area, and information about the video. The subject selection information is information indicating the position of the directly selected subject in the video displayed on the screen.

図５に示す場合、音源認識部１３は、画像識別情報又は位置特定情報と、注視領域情報とに基づいて、注視領域１００における被写体（人物Ｄ）を特定する。図６に示すように、注視領域１００における被写体（人物Ｄ、人物Ｂ、人物Ｃ）が複数の場合、音源認識部１３は、画像識別情報又は位置特定情報と、注視領域情報とに基づいて、個別に人物を特定する。 In the case illustrated in FIG. 5, the sound source recognition unit 13 identifies the subject (person D) in the gaze area 100 based on the image identification information or the position identification information and the gaze area information. As shown in FIG. 6, when there are a plurality of subjects (person D, person B, person C) in the gaze area 100, the sound source recognition unit 13 is based on the image identification information or the position specifying information and the gaze area information. Identify people individually.

図７に示す場合、音源認識部１３は、被写体選択情報及び位置特定情報に基づいて、直接指定された被写体（人物Ｄ）を特定する。図８に示すように、注視領域１００における被写体（人物Ｄ、人物Ｂ、人物Ｃ）が複数の場合、音源認識部１３は、画像識別情報又は位置特定情報と、注視領域情報とに基づいて、個別に人物を特定する。 In the case illustrated in FIG. 7, the sound source recognition unit 13 identifies the directly designated subject (person D) based on the subject selection information and the position identification information. As shown in FIG. 8, when there are a plurality of subjects (person D, person B, person C) in the gaze area 100, the sound source recognizing unit 13 is based on the image identification information or the position specifying information and the gaze area information. Identify people individually.

図１０は、本発明の実施形態における、被写体の画像の一部が入る領域を選択する例を示す図である。音源としての被写体（人物Ａ〜Ｄ、集音マイク３０〜３４）の画像の一部でも注視領域１００に入っている場合、その被写体は、注視領域１００に存在するものとされる。図１０では、音源認識部１３は、注視領域１００における被写体（人物Ａ、人物Ｂ）を特定する。また、音源認識部１３は、注視領域１００における被写体（集音マイク３１、集音マイク３３）を特定する。 FIG. 10 is a diagram illustrating an example of selecting an area in which a part of an image of a subject enters in the embodiment of the present invention. When a part of the image of the subject (persons A to D and the sound collecting microphones 30 to 34) as the sound source is also in the gaze region 100, the subject is present in the gaze region 100. In FIG. 10, the sound source recognition unit 13 identifies the subject (person A, person B) in the gaze area 100. The sound source recognizing unit 13 identifies the subject (the sound collecting microphone 31 and the sound collecting microphone 33) in the gaze area 100.

図１１は、本発明の実施形態における、被写体を直接選択する第２例を示す図である。図１１では、音源認識部１３は、被写体選択情報及び位置特定情報に基づいて、直接指定された被写体である人物Ｄを特定する。 FIG. 11 is a diagram illustrating a second example of directly selecting a subject in the embodiment of the present invention. In FIG. 11, the sound source recognition unit 13 identifies the person D who is the directly designated subject based on the subject selection information and the position identification information.

図１２は、本発明の実施形態における、被写体を追跡する第３例を示す図である。音源認識部１３は、被写体選択情報及び位置特定情報に基づいて、移動する人物Ｄの周囲の人物Ａ〜Ｃを特定する。図１２では、音源認識部１３は、映像の真ん中に位置している人物Ｄに近い順が、人物Ｂ、人物Ｃ、人物Ａの順であると特定する。 FIG. 12 is a diagram illustrating a third example of tracking a subject in the embodiment of the present invention. The sound source recognition unit 13 specifies the persons A to C around the moving person D based on the subject selection information and the position specifying information. In FIG. 12, the sound source recognizing unit 13 specifies that the order close to the person D located in the middle of the video is the order of the person B, the person C, and the person A.

次に、注視領域１００における音源としての被写体（集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを行う方法を説明する。
音源認識部１３は、画像識別情報又は位置特定情報に基づいて、全ての集音マイク３０〜３４を特定する。音源認識部１３は、注視領域１００における人物と集音マイクとの位置関係に基づいて、注視領域１００における人物に近い順に、各集音マイクを特定する。具体的には、以下のように人物と集音マイクとを関連付ける。なお、音源認識部１３は、距離が近い順以外に基づいて、人物Ａ〜Ｄと集音マイク３０〜３４とを関連付けてもよい。 Next, a method for specifying a subject (sound collecting microphone) as a sound source in the gaze area 100 and associating the subject (person) and the subject (sound collecting microphone) will be described.
The sound source recognition unit 13 specifies all the sound collecting microphones 30 to 34 based on the image identification information or the position specifying information. The sound source recognizing unit 13 identifies each sound collecting microphone in the order closer to the person in the gaze area 100 based on the positional relationship between the person in the gaze area 100 and the sound collecting microphone. Specifically, the person and the sound collecting microphone are associated as follows. The sound source recognizing unit 13 may associate the persons A to D and the sound collecting microphones 30 to 34 based on the order other than the shortest distance.

図５に示す場合、音源認識部１３は、注視領域１００における被写体（人物Ｄ）に近い集音マイクの順番を、位置特定情報に基づいて、集音マイク３２、３４、３１、３３、３０の順と特定する。
図７に示す場合、音源認識部１３は、被写体選択情報及び位置特定情報に基づいて、直接指定された被写体である人物Ｄに距離が近い集音マイクの順番を、集音マイク３２、３４、３１、３３、３０の順と特定する。 In the case shown in FIG. 5, the sound source recognition unit 13 sets the order of the sound collecting microphones close to the subject (person D) in the gaze area 100 based on the position specifying information of the sound collecting microphones 32, 34, 31, 33, 30. Identify in order.
In the case illustrated in FIG. 7, the sound source recognition unit 13 determines the order of the sound collecting microphones closest to the person D, which is the directly designated subject, based on the subject selection information and the position specifying information. The order is 31, 33, 30.

図６又は図８に示すように、被写体である人物Ｄ、人物Ｂ、人物Ｃの３人の画像が注視領域１００にある場合、音源認識部１３は、位置特定情報に基づいて、人物Ｄに近い集音マイクの順番を、集音マイク３１、３３、３４、３０、３２の順と特定する。また、音源認識部１３は、位置特定情報に基づいて、人物Ｂに近い集音マイクの順番を、集音マイク３１、３３、３０、３４、３２の順と特定する。また、音源認識部１３は、位置特定情報に基づいて、人物Ｃに近い集音マイクの順番を、集音マイク３４、３１、３３、３２、３０の順と特定する。 As shown in FIG. 6 or FIG. 8, when the images of three persons, who are subjects D, B, and C, are in the gaze area 100, the sound source recognition unit 13 determines the person D based on the position specifying information. The order of the sound collecting microphones is specified as the order of the sound collecting microphones 31, 33, 34, 30, 32. Further, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the person B as the order of the sound collecting microphones 31, 33, 30, 34, and 32 based on the position specifying information. Further, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the person C as the order of the sound collecting microphones 34, 31, 33, 32, and 30 based on the position specifying information.

図１０に示すように、被写体である人物Ａ、人物Ｂの２人の画像が注視領域１００にある場合、音源認識部１３は、位置特定情報に基づいて、人物Ａに近い集音マイクの順番を、集音マイク３０、３３、３１、３４、３２と特定する。また、音源認識部１３は、位置特定情報に基づいて、人物Ｂに近い集音マイクの順番を、集音マイク３１、３３、３０、３４、３２と特定する。音源認識部１３は、音源としての被写体（人物、集音マイク）の一部が注視領域１００に入っている場合、注視領域に存在するものと判定する。 As shown in FIG. 10, when two images of the subject person A and person B are in the gaze area 100, the sound source recognition unit 13 determines the order of the sound collecting microphones close to the person A based on the position specifying information. Are identified as sound collecting microphones 30, 33, 31, 34, 32. Further, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the person B as the sound collecting microphones 31, 33, 30, 34, and 32 based on the position specifying information. The sound source recognition unit 13 determines that a part of the subject (person, sound collecting microphone) as a sound source is present in the gaze area 100 when it is in the gaze area 100.

図１１に示すように、音源認識部１３は、被写体選択情報及び位置特定情報に基づいて、直接指定された被写体である人物Ｄに近い集音マイクの順番を、集音マイク３２、３４、３１、３３、３０の順と特定する。
図１２に示すように、音源認識部１３は、被写体選択情報及び位置特定情報に基づいて、直接指定された被写体である人物Ｄに近い集音マイクの順番を、集音マイク３１、３３、３４、３０、３２の順と特定する。 As shown in FIG. 11, the sound source recognition unit 13 determines the order of the sound collecting microphones close to the person D who is the directly designated subject based on the subject selection information and the position specifying information, and the sound collecting microphones 32, 34, 31. , 33 and 30 in this order.
As illustrated in FIG. 12, the sound source recognition unit 13 determines the order of the sound collecting microphones close to the person D who is the directly designated subject based on the subject selection information and the position specifying information. , 30, 32 in order.

次に、音声ミキシングの調整を説明する。
音声ミキシング部１６は、音声にミキシング処理を施す。音声ミキシング部１６は、音声ミキシングの調整要素の音量を変化させる。ミキシング処理の方法は、どのような方法でもよく、特定の方法に限定されない。例えば、音声ミキシング部１６は、ミキシング処理により、被写体の音声の調整要素である音色等を調整してもよい。音声ミキシング部１６は、他の調整要素を変化させてもよい。 Next, audio mixing adjustment will be described.
The audio mixing unit 16 performs a mixing process on the audio. The sound mixing unit 16 changes the volume of the sound mixing adjustment element. The mixing processing method may be any method and is not limited to a specific method. For example, the audio mixing unit 16 may adjust a timbre that is an adjustment element of the sound of the subject by mixing processing. The audio mixing unit 16 may change other adjustment elements.

音声ミキシング調整の例は、以下に示す、音声ミキシング調整の例（ｍｉｘ−１）〜（ｍｉｘ−６）に限定されるものではない。音声ミキシング調整の例（ｍｉｘ−１）〜（ｍｉｘ−６）では、ユーザは、ミキシング要素の調整を、ミキシング操作部１７を操作することにより設定できる。 The example of audio mixing adjustment is not limited to the example (mix-1) to (mix-6) of audio mixing adjustment shown below. In the examples (mix-1) to (mix-6) of the audio mixing adjustment, the user can set the adjustment of the mixing element by operating the mixing operation unit 17.

＜音声ミキシング調整の第１例（ｍｉｘ−１）＞
音声ミキシングに用いられる集音マイクの数は複数でもよいが、音声ミキシング調整の第１例では、音声ミキシングに用いられる集音マイクの数は１台であるものとする。図５に示す注視領域１００における被写体（人物Ｄ）から一番近い集音マイク３２のみとする。なお、音声ミキシングに用いられる集音マイクは、必ずしも被写体から近い集音マイクでなくてもよい。 <First example of audio mixing adjustment (mix-1)>
Although the number of sound collecting microphones used for sound mixing may be plural, in the first example of sound mixing adjustment, it is assumed that the number of sound collecting microphones used for sound mixing is one. Only the sound collecting microphone 32 closest to the subject (person D) in the gaze area 100 shown in FIG. Note that the sound collecting microphone used for audio mixing is not necessarily a sound collecting microphone close to the subject.

音声ミキシング部１６は、集音マイク３２の音量を下げる。音声ミキシング部１６は、被写体（人物Ｄ）の音声の音量を、そのまま変更しない。以上の設定により、ユーザは、映像に複数の被写体が居る中で、選択した人物Ｄを強調した視聴をすることができる。 The audio mixing unit 16 reduces the volume of the sound collection microphone 32. The sound mixing unit 16 does not change the sound volume of the subject (person D) as it is. With the above settings, the user can view the selected person D with emphasis while there are a plurality of subjects in the video.

＜音声ミキシング調整の第２例（ｍｉｘ−２）＞
追いかけ再生の例について説明する。図５に示す注視領域１００における人物Ｄが移動し、図６に示す注視領域１００に人物Ｄ、人物Ｂ、人物Ｃが含まれる。音声ミキシングに用いられる集音マイクの数は複数でもよいが、音声ミキシング調整の第２例では、音声ミキシングに用いられる集音マイクの数は１台であるものとする。 <Second example of audio mixing adjustment (mix-2)>
An example of chasing playback will be described. The person D in the gaze area 100 illustrated in FIG. 5 moves, and the gaze area 100 illustrated in FIG. 6 includes the person D, the person B, and the person C. Although the number of sound collecting microphones used for sound mixing may be plural, in the second example of sound mixing adjustment, the number of sound collecting microphones used for sound mixing is assumed to be one.

音源認識部１３は、人物Ｄに近い集音マイクを、集音マイク３１と特定する。音源認識部１３は、人物Ｂに近い集音マイク番を、集音マイク３１と特定する。音源認識部１３は、人物Ｃに近い集音マイクを、集音マイク３４と特定する。なお、音声ミキシングに用いられる集音マイクは、必ずしも被写体から近い集音マイクでなくてもよい。 The sound source recognition unit 13 identifies the sound collection microphone near the person D as the sound collection microphone 31. The sound source recognition unit 13 identifies the sound collection microphone number close to the person B as the sound collection microphone 31. The sound source recognition unit 13 identifies the sound collecting microphone near the person C as the sound collecting microphone 34. Note that the sound collecting microphone used for audio mixing is not necessarily a sound collecting microphone close to the subject.

音声ミキシング部１６は、これら集音マイク３１、３４の音量を下げる。つまり、音声ミキシング部１６は、人物Ｂと人物Ｃの音量を下げ、人物Ｄの音声の音量を、そのまま変更しない。以上の設定により、ユーザは、追いかけ再生でも、選択した人物Ｄを強調した視聴をすることができる。 The audio mixing unit 16 reduces the volume of the sound collecting microphones 31 and 34. That is, the audio mixing unit 16 decreases the volume of the person B and the person C, and does not change the sound volume of the person D as it is. With the above settings, the user can view the selected person D with emphasis even in chasing playback.

＜音声ミキシング調整の第３例（ｍｉｘ−３）＞
音声ミキシングに用いられる集音マイクの数は何台でもよいが、音声ミキシング調整の第３例では、音声ミキシングに用いられる集音マイクの数は２台であるものとする。 <Third example of audio mixing adjustment (mix-3)>
The number of sound collecting microphones used for sound mixing may be any number, but in the third example of sound mixing adjustment, the number of sound collecting microphones used for sound mixing is two.

図１０に示す場合、音源認識部１３は、人物Ａに近い集音マイクの順番を、集音マイク３０、３３の順と特定する。また、音源認識部１３は、人物Ｂに近い集音マイクの順番を、集音マイク３１、３３の順と特定する。なお、音声ミキシングに用いられる集音マイクは、必ずしも被写体から近い集音マイクでなくてもよい。 In the case illustrated in FIG. 10, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the person A as the order of the sound collecting microphones 30 and 33. Further, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the person B as the order of the sound collecting microphones 31 and 33. Note that the sound collecting microphone used for audio mixing is not necessarily a sound collecting microphone close to the subject.

音声ミキシング部１６は、これら集音マイク３０、３１、３３の音量をそのままとする。また、図１０に示す、注視領域１００おける被写体（人物Ａ、人物Ｂ）の音声の音量を、そのまま変更しない。以上の設定により、ユーザは、複数の集音マイクの中でユーザ好みの集音マイクに特化した視聴をすることができる。 The sound mixing unit 16 keeps the volume of the sound collecting microphones 30, 31, and 33 as they are. Also, the sound volume of the subject (person A, person B) in the gaze area 100 shown in FIG. 10 is not changed as it is. With the above settings, the user can perform viewing specifically for the user-preferred sound collecting microphone among the plurality of sound collecting microphones.

＜音声ミキシング調整の第４例（ｍｉｘ−４）＞
音声ミキシングに用いられる集音マイクの数は何台でもよいが、音声ミキシング調整の第４例では、音声ミキシングには、全ての集音マイクが用いられるのとする。音声ミキシング調整の第４例では、図５に示す全ての集音マイク３０〜３４が用いられる。全ての集音マイク３０〜３４に対して、ミキシング調整はされない。被写体全員（人物Ａ〜Ｄ）の音声に対しても、ミキシング調整はされない。音声ミキシング調整の第４例では、ミキシング要素の調整の設定値は、再生開始時に使用される。 <Fourth example of audio mixing adjustment (mix-4)>
Any number of sound collecting microphones may be used for sound mixing, but in the fourth example of sound mixing adjustment, all sound collecting microphones are used for sound mixing. In the fourth example of the audio mixing adjustment, all the sound collecting microphones 30 to 34 shown in FIG. 5 are used. Mixing adjustment is not performed for all the sound collecting microphones 30 to 34. Mixing adjustment is not performed on the sound of all the subjects (persons A to D). In the fourth example of the audio mixing adjustment, the set value of the mixing element adjustment is used at the start of reproduction.

＜音声ミキシング調整の第５例（ｍｉｘ−５）＞
音声ミキシングに用いられる集音マイクの数は複数でもよいが、音声ミキシング調整の第５例では、音声ミキシングに用いる集音マイクは、注視領域１００における各被写体（人物）に一番近い集音マイク（１台）とする。 <Fifth example of audio mixing adjustment (mix-5)>
Although the number of sound collecting microphones used for sound mixing may be plural, in the fifth example of sound mixing adjustment, the sound collecting microphone used for sound mixing is the sound collecting microphone closest to each subject (person) in the gaze area 100. (1).

人物Ｄが移動する前である図７に示す場合、音声ミキシング部１６は、注視領域１００における被写体（人物Ｄ）から一番近い集音マイク３２の音量を下げる。人物Ｄが移動した後である図８に示す場合、音声ミキシング部１６は、注視領域１００における被写体（人物Ｄ）から一番近い集音マイク３１の音量を下げる。人物Ｄが移動した後である図８に示す場合、音声ミキシング部１６は、注視領域１００における被写体（人物Ｂ）から一番近い集音マイク３１の音量を下げる。人物Ｄが移動した後である図８に示す場合、音声ミキシング部１６は、注視領域１００における被写体（人物Ｃ）から一番近い集音マイク３４の音量を下げる。 In the case shown in FIG. 7 before the person D moves, the audio mixing unit 16 reduces the volume of the sound collecting microphone 32 closest to the subject (person D) in the gaze area 100. In the case shown in FIG. 8 after the person D has moved, the audio mixing unit 16 reduces the volume of the sound collecting microphone 31 closest to the subject (person D) in the gaze area 100. In the case shown in FIG. 8 after the person D has moved, the audio mixing unit 16 reduces the volume of the sound collecting microphone 31 closest to the subject (person B) in the gaze area 100. In the case shown in FIG. 8 after the person D has moved, the audio mixing unit 16 reduces the volume of the sound collecting microphone 34 closest to the subject (person C) in the gaze area 100.

図７又は図８に示す場合、音声ミキシング部１６は、注視領域１００における被写体（人物Ｄ）の音量を変更しない。音声ミキシング部１６は、人物Ｄ以外の人物Ａ、Ｂ、Ｃの音量を下げる。以上の設定により、ユーザは、選択した人物Ｄを強調した視聴をすることができる。 In the case illustrated in FIG. 7 or FIG. 8, the audio mixing unit 16 does not change the volume of the subject (person D) in the gaze area 100. The audio mixing unit 16 reduces the volume of the persons A, B, and C other than the person D. With the above settings, the user can view the selected person D with emphasis.

＜音声ミキシング調整の第６例（ｍｉｘ−６）＞
音声ミキシングに用いられる集音マイクの数は何台でもよいが、音声ミキシング調整の第６例では、音声ミキシングには、全ての集音マイク３０〜３４が用いられるのとする。音声ミキシング部１６は、図１１又は図１２に示す全ての集音マイク３０〜３４の音量を下げる。音声ミキシング部１６は、図１１に示す被写体（人物Ｄ）の音量を変更しない。音声ミキシング部１６は、人物Ｄが移動した後である図１２に示す被写体（人物Ｄ）の音量を変更しない。音声ミキシング部１６は、人物Ｄに近い２人（人物Ｂ、人物Ｃ）の音量を下げる。以上の設定により、ユーザは、追いかけ再生において、ユーザが選択した人物Ｄのみを周囲の状況に合わせた強調した視聴をすることができる。 <Sixth Example of Audio Mixing Adjustment (mix-6)>
Any number of sound collecting microphones may be used for sound mixing, but in the sixth example of sound mixing adjustment, all sound collecting microphones 30 to 34 are used for sound mixing. The audio mixing unit 16 reduces the volume of all the sound collection microphones 30 to 34 shown in FIG. 11 or FIG. The audio mixing unit 16 does not change the volume of the subject (person D) shown in FIG. The audio mixing unit 16 does not change the volume of the subject (person D) shown in FIG. 12 after the person D has moved. The audio mixing unit 16 reduces the volume of two people (person B and person C) close to the person D. With the above settings, the user can view only the person D selected by the user in emphasized viewing according to the surrounding situation in the chasing playback.

次に、視聴端末１の処理のフローを説明する。
図１３は、本発明の実施形態における、再生開始から被写体を追跡する第１例の処理のフローを示す図である。図１３には、再生開始から注視領域１００が選択された場合における、音声ミキシングを実行する処理フローの第１例が示されている。
ユーザは、視聴したいコンテンツ名を再生操作部１０に入力する。再生操作部１０は、再生制御部１１に、コンテンツ名を示す情報を出力する（ステップＳ１）。
再生制御部１１は、再生操作部１０からコンテンツ名を示す情報を受け取ると、コンテンツ名とコンテンツ再生開始の命令とを、保存部１４に送る（ステップＳ２）。 Next, a processing flow of the viewing terminal 1 will be described.
FIG. 13 is a diagram showing a flow of a first example process for tracking a subject from the start of reproduction in the embodiment of the present invention. FIG. 13 shows a first example of a processing flow for executing audio mixing when the gaze area 100 is selected from the start of reproduction.
The user inputs a content name to be viewed on the reproduction operation unit 10. The reproduction operation unit 10 outputs information indicating the content name to the reproduction control unit 11 (step S1).
When receiving the information indicating the content name from the reproduction operation unit 10, the reproduction control unit 11 sends the content name and a content reproduction start command to the storage unit 14 (step S2).

保存部１４は、再生制御部１１から受け取ったコンテンツ名を示す情報の音声データから、全ての音声データ（人物Ａ〜Ｄの音声データと、集音マイク３０〜３４の音声データ）と、映像データとを検索し、映像データ及び音声データを復号部１５に送る。また、保存部１４は、要求されたコンテンツの画像識別情報、位置特定情報、注視領域情報、及び、自動注視領域情報を、音源認識部１３へ送る。 From the audio data of the information indicating the content name received from the playback control unit 11, the storage unit 14 includes all audio data (audio data of the persons A to D and audio data of the sound collecting microphones 30 to 34), and video data. And the video data and audio data are sent to the decoding unit 15. In addition, the storage unit 14 sends the image identification information, the position specifying information, the gaze area information, and the automatic gaze area information of the requested content to the sound source recognition unit 13.

注視領域情報は、注視領域１００の大きさを示す情報である。再生開始時の注視領域１００の大きさは、映像のフレーム全体の大きさである。保存部１４は、ミキシング調整情報を音声ミキシング部１６へ送る。ミキシング調整情報には、上記のミキシング調整の（ｍｉｘ−４）が記載されている（ステップＳ３）。 The gaze area information is information indicating the size of the gaze area 100. The size of the gaze area 100 at the start of reproduction is the size of the entire frame of the video. The storage unit 14 sends the mixing adjustment information to the audio mixing unit 16. The mixing adjustment information describes (mix-4) of the above-mentioned mixing adjustment (step S3).

復号部１５は、全ての音声データ（人物Ａ〜Ｄの音声データと、集音マイク３０〜３４の音声データ）を復号し、音声を音声ミキシング部１６に出力する。復号部１５は、映像データを復号し、映像を音源認識部１３に出力する（ステップＳ４）。
音源認識部１３は、位置特定情報又は画像識別情報と、注視領域情報とに基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、この音源認識結果を音声ミキシング部１６に送る。音源認識部１３は、映像を同期部１８へ送る（ステップＳ５）。 The decoding unit 15 decodes all the audio data (the audio data of the persons A to D and the audio data of the sound collection microphones 30 to 34), and outputs the audio to the audio mixing unit 16. The decoding unit 15 decodes the video data and outputs the video to the sound source recognition unit 13 (step S4).
The sound source recognition unit 13 specifies the subject (person and sound collecting microphone) as a sound source and the subject (person) and the subject (sound collecting microphone) based on the position specifying information or image identification information and the gaze area information. Perform associations. The sound source recognition unit 13 sends the sound source recognition result to the audio mixing unit 16. The sound source recognition unit 13 sends the video to the synchronization unit 18 (step S5).

音声ミキシング部１６は、音源認識部１３から音源認識結果を受け取り、この音源認識結果を表示する（ステップＳ６）。
ユーザは、音声ミキシング部１６に表示されている音源認識結果に基づいて、ミキシング操作部１７を介して、音声ミキシングを調整させる。なお、ユーザは、ミキシング操作部１７を介した音声ミキシングの設定を、必ずしも毎回行う必要はない。本実施例では、再生開始時には、ユーザは、音声ミキシングの設定を行わないものとする（ステップＳ７）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13 and displays the sound source recognition result (step S6).
The user adjusts the audio mixing via the mixing operation unit 17 based on the sound source recognition result displayed on the audio mixing unit 16. It should be noted that the user does not necessarily have to set the audio mixing via the mixing operation unit 17 every time. In this embodiment, it is assumed that the user does not set audio mixing at the start of reproduction (step S7).

音声ミキシング部１６は、再生開始時に保存部１４から受け取ったミキシング調整情報（音声ミキシング調整の第４例ｍｉｘ−４）に基づいて、音声ミキシング調整については何もせずに、全ての音声（人物Ａ〜Ｄの音声、集音マイク３０〜３４の音声）をまとめた音声にして、同期部１８に送る（ステップＳ８）。
同期部１８は、ミキシング後の音声と映像とが各々に保持している再生時間に基づいて、フレーム単位で同期させた音声と映像とを、表示部１２に送る（ステップＳ９）。 The audio mixing unit 16 does not perform any audio mixing adjustment based on the mixing adjustment information (fourth example mix-4 of audio mixing adjustment) received from the storage unit 14 at the start of reproduction, and performs all audio (person A To D and the sounds of the sound collecting microphones 30 to 34 are collected and sent to the synchronization unit 18 (step S8).
The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction times held by the mixed audio and video (step S9).

表示部１２は、要求コンテンツの映像及び音声を再生する（ステップＳ１０）。
本実施例では、ユーザは、映像における、図５に示す人物Ｄ（１人）の画像が入る視聴したい注視領域１００を選択する。ユーザが視聴したい領域である注視領域１００は、位置や大きさに関係なく自由に選択できる。注視領域１００の選択は、視聴端末１の表示部１２の画面上から、ユーザの手による直接指定やマウス操作により選択可能である。 The display unit 12 plays back the video and audio of the requested content (step S10).
In the present embodiment, the user selects a gaze area 100 that the user wants to view and includes an image of the person D (one person) shown in FIG. The gaze area 100, which is the area that the user wants to view, can be freely selected regardless of the position or size. The gaze area 100 can be selected from the screen of the display unit 12 of the viewing terminal 1 by direct designation by the user's hand or mouse operation.

表示部１２は、注視領域１００の情報を再生制御部１１に通知する。注視領域情報には、視聴端末１の表示部１２の画面上から、ユーザが新たに選択した注視領域１００の位置及び大きさを示す情報と、映像に関する情報とが含まれている（ステップＳ１１）。
再生制御部１１は、注視領域情報を音源認識部１３に通知する（ステップＳ１２）。 The display unit 12 notifies the reproduction control unit 11 of information on the gaze area 100. The gaze area information includes information indicating the position and size of the gaze area 100 newly selected by the user from the screen of the display unit 12 of the viewing terminal 1, and information about the video (step S11). .
The reproduction control unit 11 notifies the sound source recognition unit 13 of gaze area information (step S12).

音源認識部１３は、再生制御部１１から通知された注視領域情報と、位置特定情報又は画像識別情報とに基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、再生制御部１１から新しく注視領域情報を受け取ったタイミングで、被写体を特定する。 The sound source recognition unit 13 specifies the subject (person and sound collecting microphone) as the sound source and the subject (person) based on the gaze area information notified from the reproduction control unit 11 and the position specifying information or the image identification information. And associating the subject (sound collecting microphone). The sound source recognizing unit 13 identifies the subject at the timing when the gaze area information is newly received from the reproduction control unit 11.

音源認識部１３は、注視領域１００における被写体を人物Ｄと特定する。音源認識部１３は、全ての集音マイクを、集音マイク３０〜３４と特定する。さらに、音源認識部１３は、被写体（人物Ｄ）に近い集音マイクの順番を、集音マイク３２、３４、３１、３３、３０の順と特定する。音声ミキシング部に、この音源認識結果を音源認識部１３に送る（ステップＳ１３）。 The sound source recognition unit 13 identifies the subject in the gaze area 100 as the person D. The sound source recognition unit 13 identifies all the sound collecting microphones as sound collecting microphones 30 to 34. Furthermore, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the subject (person D) as the order of the sound collecting microphones 32, 34, 31, 33, 30. The sound source recognition result is sent to the sound source recognition unit 13 to the sound mixing unit (step S13).

音声ミキシング部１６は、音源認識部１３から音源認識結果を受け取り、音源認識結果を表示する（ステップＳ１４）。
ユーザは、音声ミキシング部１６に表示されている音源認識結果に基づいて、ミキシング操作部１７を介して、音声ミキシング調整の第１例ｍｉｘ−１により音声ミキシングのパラメータを設定する。ミキシング操作部１７は、この音声ミキシングの設定結果を、音声ミキシング部１６に通知する（ステップＳ１５）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13 and displays the sound source recognition result (step S14).
Based on the sound source recognition result displayed on the audio mixing unit 16, the user sets the audio mixing parameters using the first example mix-1 of the audio mixing adjustment via the mixing operation unit 17. The mixing operation unit 17 notifies the audio mixing unit 16 of the audio mixing setting result (step S15).

音声ミキシング部１６は、ミキシング操作部１７から受け取った設定結果（音声ミキシング調整の第１例ｍｉｘ−１）に基づいて、ミキシングを実行する。このミキシングでは、音声ミキシング部１６は、注視領域１００における人物Ｄの音声は変更せずに、人物Ｄに一番近い集音マイク３２の音量を下げる。音声ミキシング部１６は、これら二つの音声を一つの音声にまとめて、同期部１８に送る（ステップＳ１６）。 The audio mixing unit 16 performs mixing based on the setting result (first example of audio mixing adjustment mix-1) received from the mixing operation unit 17. In this mixing, the voice mixing unit 16 does not change the voice of the person D in the gaze area 100 and decreases the volume of the sound collecting microphone 32 closest to the person D. The voice mixing unit 16 combines these two voices into one voice and sends it to the synchronization unit 18 (step S16).

同期部１８は、ミキシング後の音声と映像が各々に保持している再生時間に基づいて、フレーム単位で同期された音声と映像とを、表示部１２に送る（ステップＳ１７）。
表示部１２は、ステップ１６でミキシングした音声と映像を再生する（ステップＳ１８）。 The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction times held by the mixed audio and video (step S17).
The display unit 12 reproduces the audio and video mixed in step 16 (step S18).

再生を停止させる場合、ユーザは、再生操作部１０を操作して、再生停止を示す情報を再生制御部１１に入力する。再生停止を示す情報を再生制御部１１へ送るステップＳ１９において、ユーザが別の領域の注視領域１００を選択する場合、表示部１２は、ステップＳ１１に進む。
再生制御部１１は、表示部１２、保存部１４、音源認識部１３に、実行停止を示す情報を送る。保存部１４は、実行停止を示す情報を受信した場合、復号部１５、音声ミキシング部１６、ミキシング操作部１７、同期部１８に、実行停止を示す情報を順次送る。これにより、再生は停止する（ステップＳ２０）。 When stopping the reproduction, the user operates the reproduction operation unit 10 to input information indicating the reproduction stop to the reproduction control unit 11. In step S19 in which information indicating the stop of reproduction is sent to the reproduction control unit 11, when the user selects another gaze region 100, the display unit 12 proceeds to step S11.
The reproduction control unit 11 sends information indicating execution stop to the display unit 12, the storage unit 14, and the sound source recognition unit 13. When the storage unit 14 receives information indicating execution stop, the storage unit 14 sequentially sends information indicating execution stop to the decoding unit 15, the audio mixing unit 16, the mixing operation unit 17, and the synchronization unit 18. Thereby, the reproduction stops (step S20).

次に、音声ミキシング処理フローの第２例（被写体手動追跡フロー）を説明する。音声ミキシング処理フローの第２例は、上記の音声ミキシングを実行する処理フローの第１例の続きである。
再生制御部１１は、人物Ｄの動きを追いかけて再生する。注視領域１００の移動は、視聴端末１の表示部１２の画面上から、ユーザの手による直接指定やマウス操作により選択できる。表示部１２は、注視領域１００の情報を、再生制御部１１に一定間隔で通知する。この注視領域情報には、注視領域１００の位置及び大きさを示す情報と、映像に関する情報とが含まれる（ステップＳ１１−２）。 Next, a second example (subject manual tracking flow) of the audio mixing processing flow will be described. The second example of the audio mixing process flow is a continuation of the first example of the process flow for executing the audio mixing described above.
The reproduction control unit 11 follows the movement of the person D and reproduces it. The movement of the gaze area 100 can be selected from the screen of the display unit 12 of the viewing terminal 1 by direct designation by the user's hand or mouse operation. The display unit 12 notifies the reproduction control unit 11 of information on the gaze area 100 at regular intervals. The gaze area information includes information indicating the position and size of the gaze area 100 and information related to the video (step S11-2).

再生制御部１１は、注視領域情報を音源認識部１３に通知する（ステップＳ１２−２）。
音源認識部１３は、再生制御部１１から通知された注視領域情報と、位置特定情報又は画像識別情報に基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、再生制御部１１から新しく注視領域情報を受け取ったタイミングで、被写体を特定する。 The reproduction control unit 11 notifies the sound source recognition unit 13 of gaze area information (step S12-2).
The sound source recognizing unit 13 specifies the subject (person and sound collecting microphone) as the sound source based on the gaze area information notified from the reproduction control unit 11 and the position specifying information or image identification information, and the subject (person) and Associate the subject (sound collecting microphone). The sound source recognizing unit 13 identifies the subject at the timing when the gaze area information is newly received from the reproduction control unit 11.

音源認識部１３は、図６に示す注視領域１００に被写体を、人物Ｄ、人物Ｂ、人物Ｃと特定する。音源認識部１３は、全ての集音マイクを、集音マイク３０〜３４と特定する。被写体（人物Ｄ）に近い集音マイクの順番を、集音マイク３１、３３、３４、３０、３２の順と特定する。音源認識部１３は、被写体（人物Ｂ）に近い集音マイクの順番を、集音マイク３１、３３、３０、３４、３２の順と特定する。音源認識部１３は、被写体（人物Ｃ）に近い集音マイクの順番を、集音マイク３４、３１、３２、３３、３０の順と特定する。音源認識部１３は、この音源認識結果を音声ミキシング部１６に送る（ステップＳ１３−２）。 The sound source recognition unit 13 identifies the subjects as the person D, the person B, and the person C in the gaze area 100 shown in FIG. The sound source recognition unit 13 identifies all the sound collecting microphones as sound collecting microphones 30 to 34. The order of the sound collecting microphones close to the subject (person D) is specified as the order of the sound collecting microphones 31, 33, 34, 30, 32. The sound source recognition unit 13 specifies the order of the sound collecting microphones close to the subject (person B) as the order of the sound collecting microphones 31, 33, 30, 34, and 32. The sound source recognition unit 13 specifies the order of the sound collecting microphones close to the subject (person C) as the order of the sound collecting microphones 34, 31, 32, 33, and 30. The sound source recognition unit 13 sends the sound source recognition result to the sound mixing unit 16 (step S13-2).

音声ミキシング部１６は、音源認識結果を音源認識部１３から受け取り、この音源認識結果を表示する（ステップＳ１４−２）。
ユーザは、音声ミキシング部１６に表示されている音源認識結果に基づいて、ミキシング操作部１７を介して、音声ミキシング調整の第２例ｍｉｘ−２により、音声ミキシングを調整する。ミキシング操作部１７は、この設定結果を音声ミキシング部１６に通知する（ステップＳ１５−２）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13, and displays the sound source recognition result (step S14-2).
Based on the sound source recognition result displayed on the audio mixing unit 16, the user adjusts the audio mixing by the second example mix-2 of the audio mixing adjustment via the mixing operation unit 17. The mixing operation unit 17 notifies the audio mixing unit 16 of the setting result (step S15-2).

音声ミキシング部１６は、ミキシング操作部１７から受け取った設定結果（ｍｉｘ−２）に基づいて、ミキシング（注視領域１００における人物Ｄの音声は変更せず、人物Ｂ、人物Ｃの音量を下げ、集音マイク３１及び３４の音量を下げる）を実行する。音声ミキシング部１６は、これら五つの音声を一つの音声にまとめて、同期部１８に送る（ステップＳ１６−２）。 Based on the setting result (mix-2) received from the mixing operation unit 17, the audio mixing unit 16 does not change the audio of the person D in the gaze area 100 and decreases the volume of the person B and the person C. The sound volume of the sound microphones 31 and 34 is reduced). The voice mixing unit 16 combines these five voices into one voice and sends it to the synchronization unit 18 (step S16-2).

同期部１８は、ミキシング後の音声と映像が各々に保持している再生時間に基づいて、フレーム単位で同期させた音声と映像を、表示部１２に送る（ステップＳ１７−２）。
再生制御部１１は、ステップＳ１６−２でミキシングさせた音声と映像を再生させる（ステップＳ１８−２）。
再生を停止する場合、ユーザは、再生操作部１０を介して再生停止を示す情報を、再生制御部１１に送る（ステップＳ１９−２）。ユーザが別の領域の注視領域１００を選択する場合、表示部１２は、ステップＳ１１に進む。
再生制御部１１は、表示部１２、保存部１４、音源認識部１３に、実行停止を示す情報を送る。保存部１４は、実行停止を示す情報を受信した場合、復号部１５、音声ミキシング部１６、ミキシング操作部１７、同期部１８に、実行停止を示す情報を順次送る。これにより、再生が停止する（ステップＳ２０−２）。 The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction time held in the mixed audio and video (step S17-2).
The playback control unit 11 plays back the audio and video mixed in step S16-2 (step S18-2).
When stopping the reproduction, the user sends information indicating the reproduction stop to the reproduction control unit 11 via the reproduction operation unit 10 (step S19-2). When the user selects another gaze area 100, the display unit 12 proceeds to step S11.
The reproduction control unit 11 sends information indicating execution stop to the display unit 12, the storage unit 14, and the sound source recognition unit 13. When the storage unit 14 receives information indicating execution stop, the storage unit 14 sequentially sends information indicating execution stop to the decoding unit 15, the audio mixing unit 16, the mixing operation unit 17, and the synchronization unit 18. Thereby, the reproduction stops (step S20-2).

次に、音声ミキシング処理フローの第３例を説明する。
ユーザは、映像における、図１０に示す人物Ａと人物Ｂの二人の画像が入る視聴したい領域を選択する。表示部１２は、映像における、選択された注視領域１００を示す情報を、再生制御部１１に通知する。この注視領域１００を示す情報は、注視領域情報である。注視領域情報には、注視領域１００の位置及び大きさを示す情報と、映像に関する情報とが含まれる（ステップＳ１１−３）。 Next, a third example of the audio mixing process flow will be described.
The user selects an area in the video that the user wants to view and contains images of the two persons A and B shown in FIG. The display unit 12 notifies the reproduction control unit 11 of information indicating the selected gaze area 100 in the video. The information indicating the gaze area 100 is gaze area information. The gaze area information includes information indicating the position and size of the gaze area 100 and information related to the video (step S11-3).

再生制御部１１は、注視領域情報を音源認識部１３に通知する（ステップＳ１２−３）。
音源認識部１３は、再生制御部１１から通知された注視領域情報と、位置特定情報又は画像識別情報とに基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、再生制御部から新しく注視領域情報を受け取ったタイミングで、被写体を特定する。 The reproduction control unit 11 notifies the sound source recognition unit 13 of gaze area information (step S12-3).
The sound source recognition unit 13 specifies the subject (person and sound collecting microphone) as the sound source and the subject (person) based on the gaze area information notified from the reproduction control unit 11 and the position specifying information or the image identification information. And associating the subject (sound collecting microphone). The sound source recognizing unit 13 identifies the subject at the timing when the gaze area information is newly received from the reproduction control unit.

音源認識部１３は、図１０に示す注視領域１００における被写体を、人物Ａと人物Ｂとに特定する。音源認識部１３は、全ての集音マイクを、集音マイク３０〜３４と特定する。音源認識部１３は、人物Ａに近い集音マイクの順番を、集音マイク３０、３３、３１、３４、３２の順と特定する。音源認識部１３は、人物Ｂに近い集音マイクの順番を、集音マイク３１、３３、３０、３４、３２の順と特定する。音源認識部１３は、この音源認識結果を音声ミキシング部１６に送る（ステップＳ１３−３）。 The sound source recognition unit 13 specifies the subjects in the gaze area 100 shown in FIG. The sound source recognition unit 13 identifies all the sound collecting microphones as sound collecting microphones 30 to 34. The sound source recognition unit 13 specifies the order of the sound collecting microphones close to the person A as the order of the sound collecting microphones 30, 33, 31, 34, and 32. The sound source recognition unit 13 identifies the order of the sound collecting microphones close to the person B as the order of the sound collecting microphones 31, 33, 30, 34, and 32. The sound source recognition unit 13 sends the sound source recognition result to the sound mixing unit 16 (step S13-3).

音声ミキシング部１６は、音源認識部１３から音源認識結果を受け取り、この結果を表示する（ステップ１４−３）。
ユーザは、音声ミキシング部１６に表示されている音源認識結果に基づいて、ミキシング操作部１７を操作することにより、音声ミキシング調整の第３例（ｍｉｘ−３）により音声ミキシングを調整させる。ミキシング操作部１７は、この設定結果を音声ミキシング部１６に通知する（ステップＳ１５−３）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13 and displays the result (step 14-3).
The user operates the mixing operation unit 17 based on the sound source recognition result displayed on the audio mixing unit 16 to adjust the audio mixing according to the third example (mix-3) of the audio mixing adjustment. The mixing operation unit 17 notifies the audio mixing unit 16 of the setting result (step S15-3).

音声ミキシング部１６は、ミキシング操作部１７から受け取った設定結果（ｍｉｘ−３）に基づいて、ミキシング（注視領域１００における人物Ａと人物Ｂの二人の音声の音量は変更しない。集音マイク３０、３１、３３の音量は変更しない）を実行する。音声ミキシング部１６は、これら五つの音声を一つの音声にまとめて、同期部１８に送る（ステップＳ１６−３）。 Based on the setting result (mix-3) received from the mixing operation unit 17, the audio mixing unit 16 does not change the volume of the audio of the two persons A and B in the gaze area 100. The sound collecting microphone 30 , 31 and 33 are not changed). The voice mixing unit 16 combines these five voices into one voice and sends it to the synchronization unit 18 (step S16-3).

同期部１８は、ミキシング後の音声と映像が各々に保持している再生時間に基づいて、フレーム単位で同期させた音声と映像を、表示部１２に送る（ステップＳ１７−３）。
再生制御部１１は、ステップＳ１６−３でミキシングした音声と映像を再生させる（ステップＳ１８−３）。 The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction times held by the mixed audio and video (step S17-3).
The playback control unit 11 plays back the audio and video mixed in step S16-3 (step S18-3).

再生を停止させる場合、ユーザは、再生操作部１０を操作して、再生停止を示す情報を再生制御部１１に入力する。再生操作部１０は、再生制御部１１に再生停止を示す情報を送る（ステップＳ１９−３）。ユーザが別の領域の注視領域１００を選択する場合、表示部１２は、ステップＳ１１に進む。 When stopping the reproduction, the user operates the reproduction operation unit 10 to input information indicating the reproduction stop to the reproduction control unit 11. The playback operation unit 10 sends information indicating playback stop to the playback control unit 11 (step S19-3). When the user selects another gaze area 100, the display unit 12 proceeds to step S11.

再生制御部１１は、表示部１２、保存部１４、音源認識部１３に、実行停止を示す情報を送る。保存部１４は、実行停止を示す情報を受信した場合、復号部１５、音声ミキシング部１６、ミキシング操作部１７、同期部１８に、実行停止を示す情報を順次送る。これにより、再生が停止する（ステップＳ２０−３）。 The reproduction control unit 11 sends information indicating execution stop to the display unit 12, the storage unit 14, and the sound source recognition unit 13. When the storage unit 14 receives information indicating execution stop, the storage unit 14 sequentially sends information indicating execution stop to the decoding unit 15, the audio mixing unit 16, the mixing operation unit 17, and the synchronization unit 18. Thereby, the reproduction is stopped (step S20-3).

図１４は、本発明の実施形態における、再生開始から被写体を追跡する第２例の処理のフローを示す図である。再生開始から被写体を追跡する第２例は、被写体が直接選択され、その被写体を自動追跡するフローの例である。 FIG. 14 is a diagram showing a flow of a second example process for tracking a subject from the start of reproduction in the embodiment of the present invention. The second example of tracking a subject from the start of reproduction is an example of a flow in which a subject is directly selected and the subject is automatically tracked.

再生開始で実行する処理（ステップＳ１−４からステップＳ１０−４）は、再生開始から被写体を追跡する第１例で説明したステップＳ１からステップＳ１０と同様である。
ユーザは、視聴端末１の表示部１２の画面上で、ユーザの手やマウス操作により、被写体である人物Ｄの１箇所に触れることにより、被写体を直接選択する。表示部１２は、画面上において映像における１箇所に触れられた被写体の位置を示す被写体選択情報を、再生制御部１１に送る（ステップＳ１１−４）。 The processing executed at the start of reproduction (steps S1-4 to S10-4) is the same as the steps S1 to S10 described in the first example of tracking the subject from the start of reproduction.
On the screen of the display unit 12 of the viewing terminal 1, the user directly selects a subject by touching one place of the person D as a subject by the user's hand or mouse operation. The display unit 12 sends subject selection information indicating the position of the subject touched to one place in the video on the screen to the reproduction control unit 11 (step S11-4).

再生制御部１１は、被写体選択情報を音源認識部１３に通知する（ステップＳ１２−４）。
音源認識部１３は、被写体選択情報を受け取った場合、自動注視領域情報を再生制御部１１へ送る（ステップＳ１３−４）。
再生制御部１１は、自動注視領域情報を表示部１２へ送る（ステップＳ１４−４）。 The playback control unit 11 notifies subject selection information to the sound source recognition unit 13 (step S12-4).
When the sound source recognizing unit 13 receives the subject selection information, the sound source recognizing unit 13 sends the automatic gaze area information to the reproduction control unit 11 (step S13-4).
The reproduction control unit 11 sends automatic gaze area information to the display unit 12 (step S14-4).

表示部１２は、被写体選択情報及び自動注視領域情報に基づいて、被写体の画像を中心に注視領域１００を決定する（ステップＳ１５−４）。
音源認識部１３は、再生制御部１１から通知された被写体選択情報及び位置特定情報に基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、再生制御部１１から被写体選択情報を受け取ったタイミングで、被写体を特定する。 The display unit 12 determines the gaze area 100 around the subject image based on the subject selection information and the automatic gaze area information (step S15-4).
The sound source recognition unit 13 specifies the subject (person and sound collecting microphone) as a sound source and the subject (person) and the subject (sound collecting microphone) based on the subject selection information and the position specifying information notified from the reproduction control unit 11. ) Association. The sound source recognition unit 13 identifies the subject at the timing when the subject selection information is received from the reproduction control unit 11.

音源認識部１３は、被写体を人物Ｄ（図７を参照）と特定する。音源認識部１３は、全ての集音マイクを、集音マイク３０〜３４と特定する。さらに、音源認識部１３は、被写体（人物Ｄ）に近い集音マイクの順番を、集音マイク３２、３４、３１、３３、３０の順と特定する。音源認識部１３は、この音源認識結果を音声ミキシング部１６に送る（ステップＳ１６−４）。 The sound source recognition unit 13 identifies the subject as a person D (see FIG. 7). The sound source recognition unit 13 identifies all the sound collecting microphones as sound collecting microphones 30 to 34. Furthermore, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the subject (person D) as the order of the sound collecting microphones 32, 34, 31, 33, 30. The sound source recognition unit 13 sends the sound source recognition result to the sound mixing unit 16 (step S16-4).

音声ミキシング部１６は、音源認識結果を音源認識部１３から受け取り、この音源認識結果を表示する（ステップＳ１７−４）。
ユーザは、音声ミキシング部１６に表示されている音源認識結果に基づいて、ミキシング操作部１７を介して、音声ミキシング調整の第５例（ｍｉｘ−５）により、音声ミキシングを調整する。ミキシング操作部１７は、この設定結果を音声ミキシング部１６に通知する（ステップＳ１８−４）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13, and displays the sound source recognition result (step S17-4).
Based on the sound source recognition result displayed on the audio mixing unit 16, the user adjusts audio mixing through the mixing operation unit 17 according to the fifth example (mix-5) of audio mixing adjustment. The mixing operation unit 17 notifies the audio mixing unit 16 of the setting result (step S18-4).

音声ミキシング部１６は、ミキシング操作部１７から受け取った設定結果（ｍｉｘ−５）に基づいて、ミキシング（注視領域１００における人物Ｄの音声は変更せず、人物Ｄに一番近い集音マイク３２の音量を下げる）を実行する。音声ミキシング部１６は、これら二つの音声を一つの音声にまとめて、同期部１８に送る（ステップＳ１９−４）。
同期部１８は、ミキシング後の音声と映像が各々に保持している再生時間に基づいて、フレーム単位で同期させた音声と映像を、表示部１２に送る（ステップＳ２０−４）。 The sound mixing unit 16 does not change the sound of the person D in the gaze area 100 based on the setting result (mix-5) received from the mixing operation unit 17, and the sound collecting microphone 32 closest to the person D does not change. ) The voice mixing unit 16 combines these two voices into one voice and sends it to the synchronization unit 18 (step S19-4).
The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction times held by the audio and video after mixing (step S20-4).

再生制御部１１は、ステップＳ１９−４でミキシングした音声と映像を再生させる。映像が表示された表示部１２の画面上の１箇所にユーザが触れて、被写体を指定した場合、再生制御部１１は、被写体の追いかけ再生を実行する（図８を参照）。
表示部１２は、人物Ｄの移動に追随した注視領域１００を示す情報を、再生制御部１１に通知する。表示部１２は、注視領域１００を示す情報を、再生制御部１１に一定間隔で通知する。注視領域情報には、注視領域１００の位置及び大きさを示す情報と、映像に関する情報とが含まれている（ステップＳ２１−４）。 The playback control unit 11 plays back the audio and video mixed in step S19-4. When the user touches one place on the screen of the display unit 12 on which the video is displayed and designates a subject, the reproduction control unit 11 performs chasing reproduction of the subject (see FIG. 8).
The display unit 12 notifies the reproduction control unit 11 of information indicating the gaze area 100 following the movement of the person D. The display unit 12 notifies the reproduction control unit 11 of information indicating the gaze area 100 at regular intervals. The gaze area information includes information indicating the position and size of the gaze area 100 and information related to the video (step S21-4).

再生制御部１１は、注視領域情報を音源認識部１３に通知する（ステップＳ２２−４）。
音源認識部１３は、再生制御部１１から通知された注視領域情報と、位置特定情報又は画像識別情報とに基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、再生制御部１１から新しく注視領域情報を受け取ったタイミングで、被写体を特定する。 The reproduction control unit 11 notifies the sound source recognition unit 13 of gaze area information (step S22-4).
The sound source recognition unit 13 specifies the subject (person and sound collecting microphone) as the sound source and the subject (person) based on the gaze area information notified from the reproduction control unit 11 and the position specifying information or the image identification information. And associating the subject (sound collecting microphone). The sound source recognizing unit 13 identifies the subject at the timing when the gaze area information is newly received from the reproduction control unit 11.

音源認識部１３は、図８に示す注視領域１００における被写体を、人物Ｄ、人物Ｂ、人物Ｃと特定する。音源認識部１３は、全ての集音マイクを、集音マイク３０〜３４と特定する。音源認識部１３は、被写体（人物Ｄ）に近い集音マイクの順番を、集音マイク３１、３３、３４、３０、３２の順と特定する。音源認識部１３は、被写体（人物Ｂ）に近い集音マイクの順番を、集音マイク３１、３３、３０、３４、３２の順と特定する。音源認識部１３は、被写体（人物Ｃ）に近い集音マイクの順番を、集音マイク３４、３１、３３、３２、３０の順と特定する。音源認識部１３は、この音源認識結果を、音声ミキシング部１６に送る（ステップＳ２３−４）。 The sound source recognition unit 13 identifies the subjects in the gaze area 100 illustrated in FIG. 8 as the person D, the person B, and the person C. The sound source recognition unit 13 identifies all the sound collecting microphones as sound collecting microphones 30 to 34. The sound source recognition unit 13 specifies the order of the sound collecting microphones close to the subject (person D) as the order of the sound collecting microphones 31, 33, 34, 30, 32. The sound source recognition unit 13 specifies the order of the sound collecting microphones close to the subject (person B) as the order of the sound collecting microphones 31, 33, 30, 34, and 32. The sound source recognition unit 13 specifies the order of the sound collection microphones close to the subject (person C) as the order of the sound collection microphones 34, 31, 33, 32, and 30. The sound source recognition unit 13 sends the sound source recognition result to the sound mixing unit 16 (step S23-4).

音声ミキシング部１６は、音源認識部１３から音源認識結果を受け取り、この結果を表示する（ステップＳ２４−４）。
ユーザは、ミキシング操作部１７を介して音声ミキシングの設定を、必ずしも毎回行う必要はない。音声ミキシング部１６は、前回の条件（音声ミキシング調整の第５例ｍｉｘ−５）に基づいて、音声ミキシングを実行してもよい。ミキシング操作部１７は、前回の条件（音声ミキシング調整の第５例ｍｉｘ−５）を、音声ミキシング部１６に通知する（ステップＳ２５−４）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13 and displays the result (step S24-4).
The user does not necessarily have to set the audio mixing every time via the mixing operation unit 17. The audio mixing unit 16 may execute audio mixing based on the previous condition (fifth example of audio mixing adjustment mix-5). The mixing operation unit 17 notifies the audio mixing unit 16 of the previous condition (fifth example mix-5 of audio mixing adjustment) (step S25-4).

音声ミキシング部１６は、ミキシング操作部１７から受け取った前回の条件（ｍｉｘ−５）に基づいて、ミキシング（注視領域１００における人物Ｄの音声は変更せず、人物Ｂ、人物Ｃの音量を下げ、集音マイク３１、３４の音量を下げる）を実行する。音声ミキシング部１６は、これら五つの音声を一つの音声にまとめて、同期部１８に送る（ステップＳ２６−４）。 Based on the previous condition (mix-5) received from the mixing operation unit 17, the audio mixing unit 16 does not change the audio of the person D in the gaze area 100 and decreases the volume of the person B and the person C. The sound collecting microphones 31 and 34 are turned down). The voice mixing unit 16 combines these five voices into one voice and sends it to the synchronization unit 18 (step S26-4).

同期部１８は、ミキシング後の音声と映像が各々に保持している再生時間に基づいて、フレーム単位で同期させた音声と映像を、表示部１２に送る（ステップＳ２７−４）。
再生制御部１１は、ステップＳ２６−４でミキシングした音声と映像を再生させる（ステップＳ２８−４）。
再生を停止させる場合、ユーザは、再生操作部１０に再生停止を示す情報を入力する。再生操作部１０は、再生停止を示す情報を、再生制御部１１へ送る（ステップＳ２９−４）。ユーザが別の領域の注視領域１００を選択する場合、表示部１２は、ステップＳ１１に進む。 The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction times held by the mixed audio and video (step S27-4).
The playback control unit 11 plays back the audio and video mixed in step S26-4 (step S28-4).
When stopping the reproduction, the user inputs information indicating the reproduction stop to the reproduction operation unit 10. The reproduction operation unit 10 sends information indicating the reproduction stop to the reproduction control unit 11 (step S29-4). When the user selects another gaze area 100, the display unit 12 proceeds to step S11.

再生制御部１１は、表示部１２、保存部１４、音源認識部１３に、実行停止を示す情報送る。保存部１４は、実行停止を示す情報を受信した場合、復号部１５、音声ミキシング部１６、ミキシング操作部１７、同期部１８に、実行停止を示す情報を順次送る。これにより、再生が停止する（ステップＳ３０−４）。 The reproduction control unit 11 sends information indicating execution stop to the display unit 12, the storage unit 14, and the sound source recognition unit 13. When the storage unit 14 receives information indicating execution stop, the storage unit 14 sequentially sends information indicating execution stop to the decoding unit 15, the audio mixing unit 16, the mixing operation unit 17, and the synchronization unit 18. Thereby, the reproduction stops (step S30-4).

図１５は、本発明の実施形態における、再生開始から被写体を追跡する第３例の処理のフローを示す図である。再生開始から被写体を追跡する第３例は、被写体が直接選択され、その被写体を自動追跡するフローの例である。 FIG. 15 is a diagram showing a flow of a third example process for tracking a subject from the start of reproduction in the embodiment of the present invention. The third example of tracking a subject from the start of reproduction is an example of a flow in which a subject is directly selected and the subject is automatically tracked.

再生開始で実行する処理（ステップＳ１−５からステップＳ１０−５）は、再生開始から被写体を追跡する第１例で説明したステップＳ１からステップＳ１０と同様である。
ユーザは、視聴端末１の表示部１２の画面上で、ユーザの手やマウス操作により、被写体である人物Ｄの１箇所に触れることにより、被写体を直接選択する。表示部１２は、画面上において、映像における１箇所に触れられた被写体の位置を示す被写体選択情報を、再生制御部１１に送る（ステップＳ１１−５）。 The processing executed at the start of reproduction (steps S1-5 to S10-5) is the same as the steps S1 to S10 described in the first example of tracking the subject from the start of reproduction.
On the screen of the display unit 12 of the viewing terminal 1, the user directly selects a subject by touching one place of the person D as a subject by the user's hand or mouse operation. The display unit 12 sends subject selection information indicating the position of the subject touched at one place in the video to the reproduction control unit 11 on the screen (step S11-5).

再生制御部１１は、被写体選択情報を音源認識部１３に通知する（ステップ１２−５）。
音源認識部１３は、再生制御部１１から通知された被写体選択情報及び位置特定情報に基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、再生制御部１１から新しく被写体選択情報を受け取ったタイミングで、被写体を特定する。 The reproduction control unit 11 notifies subject selection information to the sound source recognition unit 13 (step 12-5).
The sound source recognition unit 13 specifies the subject (person and sound collecting microphone) as a sound source and the subject (person) and the subject (sound collecting microphone) based on the subject selection information and the position specifying information notified from the reproduction control unit 11. ) Association. The sound source recognizing unit 13 identifies a subject at a timing when new subject selection information is received from the reproduction control unit 11.

音源認識部１３は、図１１に示す直接選択された被写体を、人物Ｄと特定する。音源認識部１３は、全ての集音マイクを、集音マイク３０〜３４と特定する。音源認識部１３は、被写体（人物Ｄ）に近い集音マイクの順番を、集音マイク３２、３４、３１、３３、３０の順と特定する。音源認識部１３は、この音源認識結果を、音声ミキシング部１６に送る（ステップＳ１３−５）。 The sound source recognition unit 13 identifies the directly selected subject shown in FIG. The sound source recognition unit 13 identifies all the sound collecting microphones as sound collecting microphones 30 to 34. The sound source recognition unit 13 specifies the order of the sound collecting microphones close to the subject (person D) as the order of the sound collecting microphones 32, 34, 31, 33, 30. The sound source recognition unit 13 sends the sound source recognition result to the sound mixing unit 16 (step S13-5).

音声ミキシング部１６は、音源認識部１３から音源認識結果を受け取り、この結果を表示する（ステップＳ１４−５）。
ユーザは、音声ミキシング部１６に表示されている音源認識結果に基づいて、ミキシング操作部１７を介して、音声ミキシング調整の第６例（ｍｉｘ−６）に基づいて音声ミキシングを調整する。ミキシング操作部１７は、この設定結果を音声ミキシング部１６に通知する（ステップＳ１５−５）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13 and displays the result (step S14-5).
Based on the sound source recognition result displayed on the audio mixing unit 16, the user adjusts audio mixing based on the sixth example (mix-6) of audio mixing adjustment via the mixing operation unit 17. The mixing operation unit 17 notifies the audio mixing unit 16 of the setting result (step S15-5).

音声ミキシング部１６は、ミキシング操作部１７から受け取った設定結果（ｍｉｘ−６）に基づいて、ミキシング（人物Ｄの音声は変更せず、全ての集音マイクの音量を下げる）を実行する。音声ミキシング部１６は、これら六つの音声を一つの音声にまとめて、同期部１８に送る（ステップＳ１６−５）。
同期部１８は、ミキシング後の音声と映像が各々に保持している再生時間に基づいて、フレーム単位で同期させた音声と映像を、表示部１２に送る（ステップＳ１７−５）。 Based on the setting result (mix-6) received from the mixing operation unit 17, the audio mixing unit 16 performs mixing (the sound of the person D is not changed and the volume of all sound collecting microphones is reduced). The voice mixing unit 16 combines these six voices into one voice and sends it to the synchronization unit 18 (step S16-5).
The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction times held by the mixed audio and video (step S17-5).

再生制御部１１は、ステップＳ１６−５でミキシングした音声と映像を再生させる。ユーザが表示部１２の画面上の映像における１箇所に触れて、被写体を指定した場合、再生制御部１１は、ユーザからの明示の指示がなくても、すなわち自動で、被写体の追いかけ再生を実行する。図１２には、自動の追いかけ再生が示されている。
表示部１２は、人物Ｄの移動に追随した被写体選択情報を、再生制御部１１に通知する。表示部１２は、被写体選択情報を再生制御部１１に一定間隔で通知する（ステップＳ１８−５）。 The playback control unit 11 plays back the audio and video mixed in step S16-5. When the user touches one part of the video on the screen of the display unit 12 and designates the subject, the playback control unit 11 performs the chasing playback of the subject without any explicit instruction from the user, that is, automatically. To do. FIG. 12 shows automatic chasing playback.
The display unit 12 notifies the reproduction control unit 11 of the subject selection information following the movement of the person D. The display unit 12 notifies subject selection information to the reproduction control unit 11 at regular intervals (step S18-5).

再生制御部１１は、被写体選択情報を音源認識部１３に通知する（ステップＳ１９−５）。
音源認識部１３は、再生制御部１１から通知された被写体選択情報及び位置特定情報に基づいて、音源としての被写体（人物と集音マイク）の特定と、被写体（人物）及び被写体（集音マイク）の関連付けとを実行する。音源認識部１３は、再生制御部１１から新しく被写体選択情報を受け取ったタイミングで、被写体を特定する。 The playback control unit 11 notifies subject selection information to the sound source recognition unit 13 (step S19-5).
The sound source recognition unit 13 specifies the subject (person and sound collecting microphone) as a sound source and the subject (person) and the subject (sound collecting microphone) based on the subject selection information and the position specifying information notified from the reproduction control unit 11. ) Association. The sound source recognizing unit 13 identifies a subject at a timing when new subject selection information is received from the reproduction control unit 11.

音源認識部１３は、図１２に示す全ての集音マイクを、集音マイク３０〜３４と特定する。さらに、音源認識部１３は、移動する被写体（人物Ｄ）に近い集音マイクの順番を、集音マイク３１、３３、３４、３０、３２の順と特定する。音源認識部１３は、映像の真ん中の人物Ｄに近い順番を、人物Ｂ、人物Ｃ、人物Ａの順と特定する。音源認識部１３は、この音源認識結果を音声ミキシング部１６に送る（ステップＳ２０−５）。 The sound source recognition unit 13 identifies all the sound collecting microphones shown in FIG. 12 as the sound collecting microphones 30 to 34. Furthermore, the sound source recognition unit 13 specifies the order of the sound collecting microphones close to the moving subject (person D) as the order of the sound collecting microphones 31, 33, 34, 30, 32. The sound source recognizing unit 13 specifies the order close to the person D in the middle of the video as the order of the person B, the person C, and the person A. The sound source recognition unit 13 sends the sound source recognition result to the sound mixing unit 16 (step S20-5).

音声ミキシング部１６は、音源認識部１３から音源認識結果を受け取り、この音源認識結果を表示する（ステップＳ２１−５）。
ユーザは、ミキシング操作部１７を介して音声ミキシングの設定を、必ずしも毎回行う必要はない。音声ミキシング部１６は、前回の条件（音声ミキシング調整の第６例ｍｉｘ−６）に基づいて、音声ミキシングを実行してもよい。ミキシング操作部１７は、前回の条件（音声ミキシング調整の第６例ｍｉｘ−６）を、音声ミキシング部１６に通知する（ステップＳ２２−５）。 The audio mixing unit 16 receives the sound source recognition result from the sound source recognition unit 13, and displays the sound source recognition result (step S21-5).
The user does not necessarily have to set the audio mixing every time via the mixing operation unit 17. The audio mixing unit 16 may perform audio mixing based on the previous condition (sixth example mix-6 of audio mixing adjustment). The mixing operation unit 17 notifies the audio mixing unit 16 of the previous condition (sixth example mix-6 of audio mixing adjustment) (step S22-5).

音声ミキシング部１６は、ミキシング操作部１７から受け取った前回の条件（ｍｉｘ−６）に基づいて、ミキシング（人物Ｄの音声は変更せず、人物Ｂ、人物Ｃの音量を下げ、全ての集音マイクの音量を下げる）を実行する。音声ミキシング部１６は、これら八つの音声を一つの音声にまとめて、同期部１８に送る（ステップＳ２３−５）。
同期部１８は、ミキシング後の音声と映像が各々に保持している再生時間に基づいて、フレーム単位で同期させた音声と映像を、表示部１２に送る（ステップＳ２４−５）。 Based on the previous condition (mix-6) received from the mixing operation unit 17, the audio mixing unit 16 performs mixing (the audio of the person D is not changed, the volume of the person B and the person C is reduced, and all sound collections are performed). Execute (decrease microphone volume). The voice mixing unit 16 combines these eight voices into one voice and sends it to the synchronization unit 18 (step S23-5).
The synchronization unit 18 sends the audio and video synchronized in units of frames to the display unit 12 based on the reproduction times held by the mixed audio and video (step S24-5).

再生制御部１１は、ステップ２３−５でミキシングした音声と映像を再生させる（ステップＳ２５−５）。
再生を停止させる場合、ユーザは、再生操作部１０に再生停止を示す情報を入力する。再生操作部１０は、再生停止を示す情報を、再生制御部１１へ送る（ステップＳ２６−５）。ユーザが別の領域の注視領域１００を選択する場合、表示部１２は、ステップＳ１１に進む。 The playback control unit 11 plays back the audio and video mixed in step 23-5 (step S25-5).
When stopping the reproduction, the user inputs information indicating the reproduction stop to the reproduction operation unit 10. The reproduction operation unit 10 sends information indicating the reproduction stop to the reproduction control unit 11 (step S26-5). When the user selects another gaze area 100, the display unit 12 proceeds to step S11.

再生制御部１１は、表示部１２、保存部１４、音源認識部１３に、実行停止を示す情報送る。保存部１４は、実行停止を示す情報を受信した場合、復号部１５、音声ミキシング部１６、ミキシング操作部１７、同期部１８に、実行停止を示す情報を順次送る。これにより、再生が停止する（ステップＳ２７−５）。 The reproduction control unit 11 sends information indicating execution stop to the display unit 12, the storage unit 14, and the sound source recognition unit 13. When the storage unit 14 receives information indicating execution stop, the storage unit 14 sequentially sends information indicating execution stop to the decoding unit 15, the audio mixing unit 16, the mixing operation unit 17, and the synchronization unit 18. Thereby, the reproduction is stopped (step S27-5).

以上により、本実施形態に係る視聴方法は、視聴端末１における視聴方法であって、音声に同期した映像を表示し、映像に含まれる注視領域１００又は位置を示す情報を取得するステップと、映像に含まれる注視領域１００又は位置を示す情報に基づいて、映像における被写体（例えば、人物）を特定するステップと、特定された被写体に基づいて、音声をミキシングするステップと、を有する。 As described above, the viewing method according to the present embodiment is a viewing method in the viewing terminal 1, which displays a video synchronized with audio and acquires information indicating the gaze area 100 or position included in the video, And a step of specifying a subject (for example, a person) in the video based on the information indicating the gaze region 100 or the position included in the image, and a step of mixing audio based on the specified subject.

本実施形態に係る視聴端末１は、音声に同期した映像を表示し、映像に含まれる選択された領域又は位置を示す情報を取得する表示部１２と、映像に含まれる注視領域１００又は位置を示す情報に基づいて、映像における被写体（例えば、人物）を特定する音源認識部１３と、特定された被写体に基づいて、音声をミキシングする音声ミキシング部１６と、を備える。 The viewing terminal 1 according to the present embodiment displays a video synchronized with audio, obtains information indicating a selected area or position included in the video, and a gaze area 100 or position included in the video. A sound source recognizing unit 13 that identifies a subject (for example, a person) in the video based on the information shown, and an audio mixing unit 16 that mixes sound based on the identified subject.

本実施形態に係る視聴プログラムは、コンピュータに、音声に同期した映像を表示し、映像に含まれる選択された領域又は位置を示す情報を取得する手順と、映像に含まれる選択された領域又は位置を示す情報に基づいて、映像における被写体を特定する手順と、特定された被写体に基づいて、音声をミキシングする手順と、を実行させる。 The viewing program according to the present embodiment displays a video synchronized with audio on a computer, acquires information indicating a selected region or position included in the video, and a selected region or position included in the video. On the basis of the information indicating the above, a procedure for specifying a subject in the video and a procedure for mixing audio based on the specified subject are executed.

この構成により、音声ミキシング部１６は、特定された被写体に基づいて、音声をミキシングする。これにより、本実施形態に係る視聴方法、視聴端末１及び視聴プログラムでは、ユーザの意図を反映した、ユーザ好みの視聴が可能となる。 With this configuration, the audio mixing unit 16 mixes audio based on the specified subject. Thereby, in the viewing method, viewing terminal 1 and viewing program according to the present embodiment, it is possible to view the user's preference reflecting the user's intention.

本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、ユーザが映像のフレーム全体の中から選択した人物の音声と、この選択された人物に位置が近い順に関連付けた結果に基づいて選択した集音マイクの音声と、をユーザの好みに合わせてミキシングする。これにより、本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、ユーザの意図を反映したユーザ好みの人物を強調した視聴を可能とする。 The viewing method, viewing terminal 1 and viewing program according to the present embodiment are selected based on the result of associating the voice of the person selected from the entire frame of the video by the user in the order of position closest to the selected person. The sound from the sound collecting microphone is mixed according to the user's preference. Thereby, the viewing method, viewing terminal 1, and viewing program according to the present embodiment enable viewing with emphasis on a user-preferred person reflecting the user's intention.

本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、ユーザが映像のフレーム全体の中から選択した人物の音声と、映像のフレーム全体の中から選択した集音マイクの音声と、をユーザの好みに合わせてミキシングする。これにより、本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、ユーザの意図を反映したユーザ好みの人物を強調した視聴を可能とする。 The viewing method, the viewing terminal 1, and the viewing program according to the present embodiment provide the user with the voice of the person selected from the entire video frame and the sound of the sound collecting microphone selected from the entire video frame. Mix to your liking. Thereby, the viewing method, viewing terminal 1, and viewing program according to the present embodiment enable viewing with emphasis on a user-preferred person reflecting the user's intention.

本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、移動する人物の音声と、その移動する人物に追随する人物の音声や集音マイクの音声と、をユーザの好みに合わせて選択し、選択した音声をミキシングする。これにより、本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、ユーザの意図を反映したユーザ好みの人物を強調した視聴を可能とする。 The viewing method, viewing terminal 1 and viewing program according to the present embodiment select a voice of a moving person, a voice of a person following the moving person, and a voice of a sound collecting microphone according to the user's preference. , Mix the selected audio. Thereby, the viewing method, viewing terminal 1, and viewing program according to the present embodiment enable viewing with emphasis on a user-preferred person reflecting the user's intention.

本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、映像のフレーム全体の中から被写体（集音マイク）を選択し、選択した被写体（集音マイク）の音声をユーザの好みに合わせてミキシングする。これにより、本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、ユーザの意図を反映したユーザ好みの人物を強調した視聴を可能とする。 The viewing method, viewing terminal 1 and viewing program according to the present embodiment select a subject (sound collecting microphone) from the entire video frame, and adjust the sound of the selected subject (sound collecting microphone) to the user's preference. Mix. Thereby, the viewing method, viewing terminal 1, and viewing program according to the present embodiment enable viewing with emphasis on a user-preferred person reflecting the user's intention.

本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、映像中の音源としての人物と集音マイクを、画像識別情報又は位置特定情報に基づいて個別に検出する。さらに、本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、位置特定情報に基づいて、映像のフレーム全体の中からユーザが選択した人物と集音マイクが近い順に、人物と集音マイクを関連付ける。本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、この結果に基づいて集音マイクを選択し、選択した人物と集音マイクの音声とをミキシングする。これにより、本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、ユーザの好みに応じた映像視聴を実現することができる。 The viewing method, viewing terminal 1 and viewing program according to the present embodiment individually detect a person and a sound collecting microphone as a sound source in a video based on image identification information or position specifying information. Furthermore, the viewing method, the viewing terminal 1 and the viewing program according to the present embodiment are arranged so that the person and the sound collection microphone are arranged in the order of the person and the sound collection microphone selected by the user from the entire video frame based on the position specifying information. Associate. The viewing method, viewing terminal 1 and viewing program according to the present embodiment select a sound collecting microphone based on this result, and mix the selected person and the sound of the sound collecting microphone. Thereby, the viewing method, the viewing terminal 1, and the viewing program according to the present embodiment can realize video viewing according to user preferences.

本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、音源としての人物が移動する場合、移動する人物に、周囲の音源としての人物を追随させるために、移動する人物に位置が近い人物の順を位置特定情報に基づいて関連付ける。本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、移動する人物に近い順に関連付けた結果に基づいて、音声をミキシングする対象とする周囲の人物を選択する。 In the viewing method, viewing terminal 1 and viewing program according to the present embodiment, when a person as a sound source moves, a person whose position is close to the moving person in order to cause the moving person to follow the person as a surrounding sound source Are related based on the location information. The viewing method, viewing terminal 1 and viewing program according to the present embodiment select surrounding persons to be mixed with audio based on the results of association in the order of closest to the moving person.

本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、音源としての人物が移動する場合、移動する人物に、周囲の音源としての集音マイクを追随させるために、移動する人物に位置が近い集音マイクの順を位置特定情報に基づいて関連付ける。本実施形態に係る視聴方法、視聴端末１及び視聴プログラムは、移動する人物の音声と、その移動する人物に近い順に関連付けた結果とに基づいて選択した集音マイクの音声と、をミキシングする。 In the viewing method, viewing terminal 1 and viewing program according to the present embodiment, when a person as a sound source moves, the moving person has a position in order to follow the sound collecting microphone as a surrounding sound source. Associate the order of the nearest microphones based on location information. The viewing method, the viewing terminal 1, and the viewing program according to the present embodiment mix the sound of the moving person and the sound of the sound collecting microphone selected based on the results associated in the order closest to the moving person.

本実施形態に係る視聴方法は、映像における被写体（例えば、人物、集音マイク）を特定するステップでは、音源認識部１３が、被写体の位置を示す位置特定情報、又は、映像における被写体を同定するための画像識別情報を、保存部１４から取得し、取得した位置特定情報又は画像識別情報に基づいて、映像における被写体を特定する。 In the viewing method according to the present embodiment, in the step of specifying a subject (for example, a person, a sound collecting microphone) in the video, the sound source recognition unit 13 identifies position specifying information indicating the position of the subject or a subject in the video. Image identification information is acquired from the storage unit 14, and a subject in the video is specified based on the acquired position specifying information or image identification information.

本実施形態に係る視聴方法は、映像における被写体（例えば、人物、集音マイク）を特定するステップでは、音源認識部１３が、被写体の位置を示す位置特定情報と、映像における選択された被写体の位置を示す被写体選択情報と、に基づいて、映像における被写体を特定する。 In the viewing method according to the present embodiment, in the step of specifying a subject (for example, a person or a sound collecting microphone) in the video, the sound source recognition unit 13 determines the position specifying information indicating the position of the subject and the selected subject in the video. The subject in the video is specified based on the subject selection information indicating the position.

本実施形態に係る視聴方法は、映像における被写体（例えば、人物、集音マイク）を特定するステップでは、音源認識部１３が、特定された被写体（例えば、人物）と、映像における他の被写体（例えば、集音マイク）とを、位置特定情報に基づいて関連付ける。 In the viewing method according to the present embodiment, in the step of specifying a subject (for example, a person, a sound collecting microphone) in the video, the sound source recognition unit 13 determines the specified subject (for example, a person) and another subject (for example, a person) in the video. For example, the sound collection microphone) is associated based on the position specifying information.

本実施形態に係る視聴方法は、音声をミキシングするステップでは、音声ミキシング部１６が、特定された被写体の音声と、特定された被写体に関連付けられた他の被写体の音声と、をミキシングする。
本実施形態に係る視聴方法は、音声をミキシングするステップでは、音声ミキシング部１６が、特定された移動する被写体の音声と、特定された移動する被写体に関連付けられた他の被写体の音声と、をミキシングする（例えば、図６、図７、図８、図１２を参照）。
本実施形態に係る視聴方法は、音声をミキシングするステップでは、音声ミキシング部１６が、特定された被写体の音声と、所定条件に基づいて特定された他の被写体（例えば、ワイヤレスマイク、集音マイク）の音声と、をミキシングする。 In the viewing method according to the present embodiment, in the step of mixing audio, the audio mixing unit 16 mixes the audio of the specified subject with the audio of another subject associated with the specified subject.
In the viewing method according to the present embodiment, in the step of mixing audio, the audio mixing unit 16 includes the audio of the specified moving subject and the audio of other subjects associated with the specified moving subject. Mix (for example, see FIGS. 6, 7, 8, and 12).
In the viewing method according to the present embodiment, in the step of mixing audio, the audio mixing unit 16 and the audio of the specified subject and another subject specified based on a predetermined condition (for example, a wireless microphone, a sound collecting microphone) ) And audio.

上述した実施形態における視聴端末１をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ（Field Programmable Gate Array）等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 You may make it implement | achieve the viewing terminal 1 in embodiment mentioned above with a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included and a program held for a certain period of time. Further, the program may be a program for realizing a part of the above-described functions, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system. You may implement | achieve using programmable logic devices, such as FPGA (Field Programmable Gate Array).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１…視聴端末，１０…再生操作部，１１…再生制御部，１２…表示部，１３…音源認識部，１４…保存部，１５…復号部，１６…音声ミキシング部，１７…ミキシング操作部，１８…同期部，２０…カメラ，３０…集音マイク，３１…集音マイク，３２…集音マイク，３３…集音マイク，３４…集音マイク，４０…位置センサ，４１…位置センサ，４２…位置センサ，４３…位置センサ，４４…位置センサ，５０…位置センサ，５１…位置センサ，５２…位置センサ，５３…位置センサ，６０…ワイヤレスマイク，６１…ワイヤレスマイク，６２…ワイヤレスマイク，６３…ワイヤレスマイク，７０…コンテンツ，７１…コンテンツ，８０…ステージ，１００…注視領域，Ａ…人物，Ｂ…人物，Ｃ…人物，Ｄ…人物 DESCRIPTION OF SYMBOLS 1 ... Viewing terminal, 10 ... Reproduction operation part, 11 ... Reproduction control part, 12 ... Display part, 13 ... Sound source recognition part, 14 ... Storage part, 15 ... Decoding part, 16 ... Audio | voice mixing part, 17 ... Mixing operation part, DESCRIPTION OF SYMBOLS 18 ... Synchronizer, 20 ... Camera, 30 ... Sound collecting microphone, 31 ... Sound collecting microphone, 32 ... Sound collecting microphone, 33 ... Sound collecting microphone, 34 ... Sound collecting microphone, 40 ... Position sensor, 41 ... Position sensor, 42 Position sensor 43 Position sensor 44 Position sensor 50 Position sensor 51 Position sensor 53 Position sensor 53 Position sensor 60 Wireless microphone 61 Wireless microphone 62 Wireless microphone 63 ... wireless microphone, 70 ... content, 71 ... content, 80 ... stage, 100 ... gaze area, A ... person, B ... person, C ... person, D ... person

Claims

A viewing method on a viewing terminal,
Displaying a video synchronized with audio and obtaining information indicating a selected region or position included in the video; and
Identifying a subject in the video based on information indicating a selected region or position included in the video;
Mixing the audio based on the identified subject;
Viewing method.

In the step of specifying a subject in the video, position specifying information indicating the position of the subject or image identification information for identifying the subject in the video is acquired from a storage unit, and the acquired position specifying information or The viewing method according to claim 1, wherein the subject in the video is specified based on the image identification information.

In the step of specifying a subject in the video, the subject in the video is specified based on position specifying information indicating a position of the subject and subject selection information indicating a position of the selected subject in the video. The viewing method according to claim 1.

The viewing method according to claim 3, wherein in the step of specifying a subject in the video, the specified subject and another subject in the video are associated based on the position specifying information.

The viewing method according to claim 4, wherein in the step of mixing the audio, the audio of the specified subject and the audio of another subject associated with the specified subject are mixed.

6. The viewing method according to claim 5, wherein in the step of mixing the audio, the audio of the specified moving subject is mixed with the audio of the other subject associated with the specified moving subject. .

7. The method according to claim 1, wherein in the step of mixing the sound, the sound of the specified subject is mixed with the sound of the other subject specified based on a predetermined condition. The viewing method described.

A display unit for displaying a video synchronized with audio and acquiring information indicating a selected region or position included in the video;
A recognition unit for identifying a subject in the video based on information indicating a selected region or position included in the video;
An audio mixing unit that mixes the audio based on the identified subject;
A viewing terminal comprising:

A viewing program for causing a computer to execute the viewing method according to any one of claims 1 to 7.