JP2012027340A

JP2012027340A - Karaoke apparatus and method of outputting still picture of karaoke singer

Info

Publication number: JP2012027340A
Application number: JP2010167541A
Authority: JP
Inventors: Katsumi Toda; 勝巳戸田
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2010-07-26
Filing date: 2010-07-26
Publication date: 2012-02-09
Anticipated expiration: 2030-07-26
Also published as: JP5201540B2

Abstract

PROBLEM TO BE SOLVED: To improve entertainment by surely extracting a still picture at a timing of excitement of users of a karaoke apparatus.SOLUTION: A karaoke apparatus 10 includes: a speaker 108 which reproduces music data for allowing a user A to sing; a display section 109 capable of displaying picture data with the reproduction of the music data by music reproduction means; and at least one omnidirectional camera 400 which captures a view of a prescribed range including users B and C other than the user A and the user A to generate picture data of the prescribed range. This karaoke apparatus calculates the number of other users B and C whose faces are turned to the direction of the user A based on the picture data of the prescribed range generated by the omnidirectional camera 400 and determines the timing of the excitement of the users A, B and C by time-shift of the number of persons calculated.

Description

本発明は、カラオケ演奏曲の再生サービスを提供するカラオケ装置、及び、このカラオケ装置でのカラオケ歌唱者の静止画を出力するためのカラオケ歌唱者の静止画出力方法に関する。 The present invention relates to a karaoke apparatus that provides a karaoke performance reproduction service, and a karaoke singer's still image output method for outputting a karaoke singer's still image in the karaoke apparatus.

カラオケ装置において、カラオケ演奏曲の再生以外に、これに付帯する種々様々なサービスが既に提唱されている。例えば、歌唱者の歌唱している姿を１台のビデオカメラで撮影してディスプレイに表示するカラオケ装置が、従来既に提唱されている（例えば、特許文献１参照）。この従来のカラオケ装置には、１台のビデオカメラと、このビデオカメラの向きを所定範囲内で自在に変化させるためのサーボ式雲台と、歌唱者の持つワイヤレスマイクロフォンからの無線標識信号を受信し、その無線標識信号の発信源の位置にビデオカメラの向きを合わせるようにサーボ式雲台を駆動する駆動制御手段と、が備えられている。 In the karaoke apparatus, in addition to the reproduction of the karaoke performance music, various services accompanying this have already been proposed. For example, a karaoke apparatus that shoots a singer's appearance with a single video camera and displays it on a display has been proposed in the past (for example, see Patent Document 1). This conventional karaoke device receives one video camera, a servo head for freely changing the orientation of the video camera within a predetermined range, and a radio beacon signal from a wireless microphone of the singer. Drive control means for driving the servo head so that the direction of the video camera is aligned with the position of the source of the radio signal.

また、近年、娯楽性の向上のために、上記のようにして撮影した歌唱中の映像を録画してサーバへアップロードし、当該歌唱者や他のユーザがその歌唱映像（動作）をパソコンや携帯端末より閲覧して楽しめるようにしたサービスが、既に提供されている。通常、このようなサービスでは、閲覧したい動画を選択するためのウェブページにおいて、各動画中のある１場面の静止画が適宜に抽出選択され、サムネイルとして表示されている。 In recent years, in order to improve the entertainment, the video during the singing shot as described above is recorded and uploaded to the server, and the singer and other users can view the singing video (operation) on a personal computer or mobile phone. Services that can be viewed and enjoyed from a terminal have already been provided. Normally, in such a service, a still image of one scene in each moving image is appropriately extracted and displayed as a thumbnail on a web page for selecting a moving image to be browsed.

アップロードされる動画の中からサムネイルとして選択されるべき静止画は、動画内容を代表する代表画像という意味では、本来、歌唱者を含む利用者の集団がもっとも盛り上がっているタイミングの静止画がふさわしい。このような、特定の静止画を動画から選択して抽出する技術としては、例えば、特許文献２や特許文献３に記載の手法が既に提唱されている。これらの従来技術の手法では、音量の大きいところを盛り上がりタイミングと判断し、その盛り上がりタイミングの静止画を抽出するようになっている。 A still image to be selected as a thumbnail from uploaded videos is originally a still image at a timing when the group of users including a singer is most excited in the sense of a representative image representing the content of the video. As such a technique for selecting and extracting a specific still image from a moving image, for example, methods described in Patent Document 2 and Patent Document 3 have already been proposed. In these conventional techniques, a place where the volume is high is determined as the rising timing, and a still image at the rising timing is extracted.

特開平１０−２４０２７６号公報JP-A-10-240276 特許第４１９８３３１号Japanese Patent No. 4198331 特許第４４３５１３０号Patent No. 4435130

しかしながら、上記特許文献２や特許文献３に記載の技術を上記カラオケ動画における静止画の選択に適用した場合、音量を基準として盛り上がりタイミングを判定することから、録画されたカラオケ動画における、視聴者が歌唱者の歌を聴かず大声で雑談していている場合も盛り上がりタイミングと判定される可能性がある。したがって、カラオケ装置の利用者の盛り上がりタイミングを正確に判定し、当該タイミングにおける最適な静止画を確実に得ることは難しかった。 However, when the techniques described in Patent Document 2 and Patent Document 3 are applied to the selection of a still image in the karaoke video, since the climax timing is determined based on the volume, the viewer in the recorded karaoke video can There is a possibility that it will be determined as a climax timing even when the chatter is chatting loudly without listening to the song of the singer. Therefore, it has been difficult to accurately determine the climax timing of the user of the karaoke apparatus and to reliably obtain an optimal still image at the timing.

本発明の目的は、カラオケ装置の利用者の盛り上がりタイミングの静止画を確実に抽出することで娯楽性を向上できる、カラオケ装置、及び、カラオケ歌唱者の静止画出力方法を提供することにある。 An object of the present invention is to provide a karaoke device and a method for outputting a karaoke singer's still image, which can improve entertainment by reliably extracting a still image at a rising timing of a user of the karaoke device.

上記目的を達成するために、第１の発明は、楽曲データ及び映像データを用いて、カラオケ演奏曲の再生サービスを提供するカラオケ装置であって、歌唱者が歌唱するための前記楽曲データを再生する楽曲再生手段と、前記楽曲再生手段により前記楽曲データの再生が行われるのにしたがい、前記映像データを表示可能な表示手段と、前記歌唱者以外の視聴者及び前記歌唱者を含む所定範囲の視野を撮影し、当該所定範囲の映像データを生成する少なくとも１台の動画撮影カメラと、前記動画撮影カメラにより生成された前記所定範囲の映像データにより、顔が前記歌唱者の方向を向いている前記視聴者の人数を算出する算出手段と、前記算出手段により算出される前記人数の時間的推移により、前記歌唱者及び前記視聴者を含む集団の盛り上がりタイミングを決定する決定手段とを有することを特徴とする。 In order to achieve the above object, the first invention is a karaoke apparatus that provides a reproduction service of karaoke performance music using music data and video data, and reproduces the music data for singing by a singer. The music reproduction means, the display means capable of displaying the video data in accordance with the reproduction of the music data by the music reproduction means, a viewer other than the singer, and a predetermined range including the singer The face is directed to the singer by at least one video camera that captures the field of view and generates video data of the predetermined range and the video data of the predetermined range generated by the video camera. The calculating means for calculating the number of viewers, and the time of the number of persons calculated by the calculating means, the group including the singers and the viewers And having a determining means for determining a gully timing.

本願第１発明においては、娯楽性の向上のために、歌唱者及び視聴者の集団の盛り上がりを検出する。まず、少なくとも１台の動画撮影カメラが、歌唱者及び視聴者を含む所定範囲の視野を撮影して、当該所定範囲の映像データを生成する。そして、生成された映像データに基づき、算出手段が、顔が歌唱者の方向を向いている視聴者の人数を算出する。 In the first invention of the present application, in order to improve amusement, a swell of a group of singers and viewers is detected. First, at least one video camera captures a predetermined range of visual field including a singer and a viewer, and generates video data of the predetermined range. Then, based on the generated video data, the calculation means calculates the number of viewers whose faces are facing the singer.

ここで、集団が盛り上がっているときとは、歌唱者の歌唱によって視聴者が心より楽しんでいるときであり、その瞬間には、視聴者の視線が歌唱者のほうへ向いているのが通常である。そこで、本願第１発明においては、決定手段が、算出手段が算出した視聴者の人数の時間的推移により、盛り上がりタイミングを決定する。これにより、顔が歌唱者へ向いている視聴者の人数が最も多い瞬間を、当該集団の盛り上がりタイミングと決定することができる。この結果、例えば、上記生成された映像データのうち、当該盛り上がりタイミングにおける静止画を抽出し、当該サービス提供時間の代表画像としてサーバへアップロードしたり、カラオケ装置の表示手段に代表画像として映し出したり、等種々のサービスを行うことが可能となる。これにより、カラオケ装置の娯楽性をさらに向上することができる。 Here, the time when the group is excited is when the audience is enjoying the song with the singing of the singer, and at that moment, the viewer's line of sight is usually facing the singer. It is. Therefore, in the first invention of the present application, the determining means determines the excitement timing based on the temporal transition of the number of viewers calculated by the calculating means. Thereby, the moment with the largest number of viewers whose faces are facing the singer can be determined as the climax timing of the group. As a result, for example, out of the generated video data, a still image at the rising timing is extracted and uploaded to the server as a representative image of the service provision time, or displayed as a representative image on the display unit of the karaoke device, It is possible to provide various services. Thereby, the entertainment property of a karaoke apparatus can further be improved.

第２発明は、上記第１発明において、前記歌唱者に所持され、当該歌唱者によるカラオケ歌唱の音声信号を入力するためのマイクロフォンと、前記マイクロフォンに設けられ、標識信号を発生する標識信号発生手段と、前記動画撮影カメラは、１台設けられるとともに、前記マイクロフォン及び前記歌唱者を含む前記所定範囲の視野を自装置を中心とした全周を撮影し、前記標識信号発生手段から発生された前記標識信号を含む前記所定範囲の映像データを生成し、前記算出手段は、前記１台の動画撮影カメラにより生成された前記所定範囲の映像データに含まれる前記標識信号に基づいて前記マイクロフォン及び前記歌唱者の位置を特定するとともに、前記所定範囲のうち前記歌唱者の位置以外の範囲の映像データに所定の顔認識処理を行って前記視聴者の顔の向きを決定することにより、顔が前記歌唱者の方向を向いている前記視聴者の人数を算出することを特徴とする。 A second invention is the microphone according to the first invention, which is possessed by the singer and for inputting a voice signal of karaoke singing by the singer, and a sign signal generating means which is provided in the microphone and generates a sign signal. And one moving image shooting camera is provided, and the entire range around the field of view of the predetermined range including the microphone and the singer is photographed, and the sign signal generator generates the sign signal. The video data of the predetermined range including a sign signal is generated, and the calculating means includes the microphone and the song based on the sign signal included in the video data of the predetermined range generated by the one video camera. The position of the performer is specified, and predetermined face recognition processing is performed on video data in a range other than the position of the singer in the predetermined range. Wherein by determining the orientation of the viewer's face, face and calculates the number of the viewer facing the direction of the singer Te.

本願第２発明においては、カラオケ演奏時には、楽曲再生手段によって楽曲データの再生が行われるとともに表示手段により映像データが表示され、それら再生及び表示に合わせて、歌唱者がマイクロフォンにより歌唱を行う。また、動画撮影カメラは１台のみ備えられる。上記歌唱の際には、その１台の動画撮影カメラが、歌唱者及びマイクロフォンを含む所定範囲の視野を撮影し、当該所定範囲の映像データを生成する。このとき、自装置を中心とした全周を撮影可能な、視野が広い動画撮影カメラ（例えば魚眼レンズを備え全周３６０°撮影可能なカメラ）を用いることにより、歌唱者及び視聴者の全員が、動画撮影カメラが生成した映像データの中に常に含まれる。 In the second invention of the present application, during karaoke performance, the music data is reproduced by the music reproducing means and the video data is displayed by the display means, and the singer sings with the microphone in accordance with the reproduction and display. Further, only one moving image shooting camera is provided. At the time of singing, the one video camera captures a predetermined field of view including the singer and the microphone, and generates video data of the predetermined range. At this time, by using a video camera with a wide field of view (for example, a camera equipped with a fish-eye lens and capable of shooting 360 ° around the entire circumference) capable of shooting the entire circumference centering on its own device, all the singers and viewers can It is always included in the video data generated by the video camera.

ここで、マイクロフォンには、標識信号を発生する標識信号発生手段が備えられている。したがって、上記生成された所定範囲の映像データには、歌唱者の所持したマイクロフォンの位置に対応した標識信号が、歌唱者の姿と共に必ず記録されている。そこで、これに対応して、算出手段が、映像データに含まれる標識信号を用いてマイクロフォン及び歌唱者の位置を特定するとともに、歌唱者の位置以外の映像データに対し顔認識処理を行って視聴者の顔の向きを決定する。これにより、複数台のカメラを用いなくても、顔が歌唱者の方向を向いている視聴者の人数を確実に算出することができる。 Here, the microphone is provided with a sign signal generating means for generating a sign signal. Therefore, a sign signal corresponding to the position of the microphone possessed by the singer is always recorded together with the appearance of the singer in the generated video data in the predetermined range. Accordingly, in response to this, the calculation means specifies the position of the microphone and the singer using the sign signal included in the video data, and performs face recognition processing on the video data other than the singer's position for viewing. The orientation of the person's face. Accordingly, the number of viewers whose faces are facing the singer can be reliably calculated without using a plurality of cameras.

第３発明は、上記第２発明において、前記所定範囲の映像データから、前記決定手段により決定された前記盛り上がりタイミングにおける前記歌唱者を含む少なくとも１つの部分静止画を前記全周の中心に対して劣弧をとる扇形に切り出すとともに、その切り出した前記扇形の劣弧を直線に補正することで、前記扇形を四角形に補正する補正処理を行って、１つの静止画とする映像処理手段と、前記静止画を、当該カラオケ装置にネットワーク接続されたサーバへ出力する静止画出力手段とを有することを特徴とする。 According to a third aspect of the present invention, in the second aspect of the present invention, at least one partial still image including the singer at the climax timing determined by the determining means is selected from the predetermined range of video data with respect to the center of the entire circumference. A video processing unit that cuts out a fan-shaped arc and corrects the fan-shaped subarc into a straight line, corrects the fan into a quadrangle, and forms a single still image; and Still image output means for outputting a still image to a server connected to the karaoke apparatus over a network.

決定手段により決定された盛り上がりタイミングにおける静止画を生成する場合、上記のように視野が広い動画撮影カメラの映像データから得た静止画は視野の端部ほど歪んだ状態となっている場合がある。そこで本願第３発明においては、映像処理手段が、上記所定範囲の映像データから、当該盛り上がりタイミングにおける歌唱者を含む部分静止画を全周の中心に対して劣弧をとる扇形に切り出すと共に、その切り出した部分静止画に所定の補正処理を行う。補正処理としては、切り出した扇形の劣弧を直線に補正し、扇形を四角形とする。これにより、上記歪んだ状態が是正された正常な静止画を、静止画出力手段が代表画像としてサーバへ出力することができる。この結果、サーバへアクセスした各ユーザが、当該静止画を閲覧し、楽しむことができる。 When generating a still image at the rising timing determined by the determining means, the still image obtained from the video data of the video camera with a wide field of view as described above may be distorted toward the end of the field of view. . Therefore, in the third invention of the present application, the video processing means cuts out a partial still image including the singer at the rising timing from the video data in the predetermined range into a fan shape having a subarc with respect to the center of the entire circumference. A predetermined correction process is performed on the cut-out partial still image. As the correction process, the cut out arc-shaped subarc is corrected to a straight line, and the fan-shape is made a quadrangle. Thus, a normal still image in which the distorted state is corrected can be output to the server as a representative image by the still image output means. As a result, each user who accesses the server can view and enjoy the still image.

第４発明は、上記第３発明において、前記映像処理手段により補正処理された前記少なくとも１つの静止画に対し所定の顔認識処理を行い、前記歌唱者の顔を認識できるかどうかを判定する判定手段を有し、前記静止画出力手段は、前記映像処理手段により補正処理された前記少なくとも１つの静止画のうち、前記判定手段により前記歌唱者の顔を認識できると判定された静止画を、前記サーバへ出力することを特徴とする。 In a fourth aspect of the present invention, in the third aspect of the present invention, a determination is made as to whether a predetermined face recognition process is performed on the at least one still image corrected by the video processing means and the singer's face can be recognized. The still image output means includes a still image determined by the determination means to be able to recognize the singer's face, out of the at least one still image corrected by the video processing means. It outputs to the said server, It is characterized by the above-mentioned.

歌唱者及び視聴者の集団が盛り上がっている盛り上がりタイミングにおいて、たまたま歌唱者が下を向いたり横を向いたりしている場合がある。このようなタイミングにおける静止画は、歌唱者の顔が正しく記録されていないため、代表画像としては必ずしも好ましくない。そこで本願第４発明においては、判定手段が、補正処理後の少なくとも１つの静止画それぞれに対し、歌唱者の顔を認識できるかどうかを判定する。そして、静止画出力手段は、歌唱者の顔が認識できると判定された静止画をサーバへ出力する。これにより、歌唱者の顔が正しく記録された静止画のみを、確実に代表画像としてサーバへ出力することができる。 There is a case where the singer happens to face downward or sideways at the rising timing when the group of singers and viewers is rising. A still image at such a timing is not necessarily preferable as a representative image because the singer's face is not correctly recorded. Accordingly, in the fourth invention of the present application, the determination means determines whether or not the singer's face can be recognized for each of the at least one still image after the correction processing. Then, the still image output means outputs the still image determined to be able to recognize the singer's face to the server. Thereby, only the still image in which the singer's face is correctly recorded can be reliably output to the server as a representative image.

第５発明は、上記第４発明において、外部から歌唱者ＩＤを入力する歌唱者ＩＤ入力手段を有し、前記映像処理手段はさらに、前記所定範囲の映像データから、前記特定された前記歌唱者の位置を含む部分映像データを切り出して所定の補正処理を行い、前記静止画出力手段は、歌唱者ＩＤ入力手段より入力された前記歌唱者ＩＤと、前記判定手段により前記歌唱者の顔を認識できると判定された静止画と、前記映像処理手段による補正処理後の前記部分映像データと、を対応付けて、当該カラオケ装置にネットワーク接続された前記サーバへ出力することを特徴とする。 5th invention has the singer ID input means which inputs singer ID from the outside in the said 4th invention, The said video processing means is further the said specified singer from the video data of the said predetermined range. The partial image data including the position of the singer is cut out and subjected to a predetermined correction process, and the still image output means recognizes the singer ID input from the singer ID input means and the singer's face by the determination means. The still image determined to be possible and the partial video data after the correction processing by the video processing means are associated with each other and output to the server connected to the karaoke apparatus via a network.

これにより、歌唱者の顔が写っている静止画と、当該歌唱者の歌唱者ＩＤと、その歌唱者が歌っている映像とが、サーバへアップロードされる。この結果、当該歌唱者やその他の各ユーザが、歌唱者ＩＤを用いて当該静止画を検索して閲覧したり、さらに対応する歌唱映像を閲覧して、楽しむことができる。 Thereby, the still image in which the singer's face is reflected, the singer ID of the singer, and the video sung by the singer are uploaded to the server. As a result, the singer and each other user can search and browse the still image using the singer ID, or browse and enjoy the corresponding singing video.

上記目的を達成するために、第６の発明は、カラオケ演奏曲を再生するカラオケ装置に備えられたコンピュータが実行する、当該カラオケ演奏曲の歌唱者を含む静止画を生成して出力するためのカラオケ歌唱者の静止画出力方法であって、少なくとも１台の動画撮影カメラにより撮影され生成された、前記歌唱者以外の視聴者及び前記歌唱者を含む所定範囲の映像データを取得する取得手順と、前記取得手順で取得された前記所定範囲の映像データにより、顔が前記歌唱者の方向を向いている前記視聴者の人数を算出する算出手順と、前記算出手順で算出された前記人数の時間的推移により、前記歌唱者及び前記視聴者を含む集団の盛り上がりタイミングを決定する決定手順と、前記取得手順で取得された前記所定範囲の映像データから、前記決定手順で決定された前記盛り上がりタイミングにおける前記歌唱者を含む少なくとも１つの静止画を抽出する抽出手順と、前記抽出手順で抽出された前記静止画、又は、当該静止画を補正した後の静止画を、前記カラオケ装置に接続されたサーバへ出力する出力手順と、を有することを特徴とする。 In order to achieve the above object, a sixth aspect of the invention is for generating and outputting a still image including a singer of a karaoke performance piece, which is executed by a computer provided in the karaoke apparatus for reproducing the karaoke performance piece. A method for outputting a still image of a karaoke singer, and acquiring a predetermined range of video data including a viewer other than the singer and the singer, which is captured and generated by at least one video camera; , A calculation procedure for calculating the number of viewers whose face is facing the singer by the video data in the predetermined range acquired in the acquisition procedure, and the time of the number of persons calculated in the calculation procedure From the determination procedure for determining the climax timing of the group including the singer and the viewer, and the predetermined range of video data acquired in the acquisition procedure, An extraction procedure for extracting at least one still image including the singer at the rising timing determined by the determination procedure, the still image extracted by the extraction procedure, or a still image after correcting the still image Is output to a server connected to the karaoke apparatus.

本願第６発明においては、少なくとも１台の動画撮影カメラが、歌唱者及び視聴者を含む所定範囲の視野を撮影して、当該所定範囲の映像データを生成する。そして、生成された映像データに基づき、算出手段で、顔が歌唱者の方向を向いている視聴者の人数が算出される。集団が盛り上がっているときとは、歌唱者の歌唱によって視聴者が心より楽しんでいるときであるから、その瞬間には、視聴者の視線が歌唱者のほうへ向いている。そこで、決定手段で、算出手順において算出され視聴者の人数の時間的推移により、盛り上がりタイミングが決定される。これにより、顔が歌唱者へ向いている視聴者の人数が最も多い瞬間を、当該集団の盛り上がりタイミングと決定することができる。これにより、上記取得手順で取得された映像データのうち、当該盛り上がりタイミングにおける静止画が抽出手順で抽出され、当該抽出された静止画（又はそれを補正した静止画）をサービス提供時間の代表画像として出力手順でサーバへアップロードすることができる。この結果、カラオケ装置の娯楽性をさらに向上することができる。 In the sixth invention of the present application, at least one video camera captures a predetermined range of visual field including a singer and a viewer, and generates video data of the predetermined range. Based on the generated video data, the calculating means calculates the number of viewers whose faces are facing the singer. The time when the group is excited is when the viewer is enjoying the singing of the singer, so at that moment the viewer's line of sight is facing the singer. Therefore, the determination means determines the excitement timing based on the temporal transition of the number of viewers calculated in the calculation procedure. Thereby, the moment with the largest number of viewers whose faces are facing the singer can be determined as the climax timing of the group. As a result, among the video data acquired by the acquisition procedure, a still image at the rising timing is extracted by the extraction procedure, and the extracted still image (or a still image obtained by correcting the extracted still image) is represented as a representative image of service provision time. Can be uploaded to the server in the output procedure. As a result, the entertainment of the karaoke apparatus can be further improved.

本発明によれば、カラオケ装置の利用者の盛り上がりタイミングの静止画を確実に抽出し、娯楽性を向上することができる。 ADVANTAGE OF THE INVENTION According to this invention, the still image of the rise timing of the user of a karaoke apparatus can be extracted reliably, and amusement can be improved.

本発明の一実施の形態のカラオケ装置が設置されたカラオケルームを概略的に表す図である。It is a figure which represents roughly the karaoke room in which the karaoke apparatus of one embodiment of this invention was installed. マイクの外観を表す側面図である。It is a side view showing the external appearance of a microphone. カラオケ装置を備えたカラオケシステムの全体構成を表す機能ブロック図である。It is a functional block diagram showing the whole structure of the karaoke system provided with the karaoke apparatus. 全周カメラから入力された映像を画像処理して歌唱者の動画データを得るプロセスを表す説明図である。It is explanatory drawing showing the process of obtaining the moving image data of a singer by image-processing the image | video input from the perimeter camera. 全周カメラによりカラオケルーム内を撮影したときに取得される画像を模式的に表す図である。It is a figure which represents typically the image acquired when the inside of a karaoke room is image | photographed with the all-around camera. 視聴者の１人が歌唱者を見ていない状態において、全周カメラによりカラオケルーム内を撮影したときに取得される画像を模式的に表す別の図である。It is another figure which represents typically the image acquired when the inside of a karaoke room is image | photographed with the all-around camera in the state where one viewer does not look at a singer. 視聴者の１人が歌唱者を見ていない状態でのカラオケルームを概略的に表す図である。It is a figure which represents roughly the karaoke room in the state where one viewer does not look at a singer. タイミングログの一例を示す説明図である。It is explanatory drawing which shows an example of a timing log. 装置本体の制御部により実行される処理手順の詳細を表すフローチャートである。It is a flowchart showing the detail of the process sequence performed by the control part of an apparatus main body. ログ作成処理の手順の詳細を表すフローチャートである。It is a flowchart showing the detail of the procedure of a log creation process. ホストサーバにアップロードされたサムネイルの表示例を表す図である。It is a figure showing the example of a display of the thumbnail uploaded to the host server. 歌唱者の顔が写っている静止画を抽出する変形例において、装置本体の制御部により実行される処理手順の詳細を表すフローチャートである。It is a flowchart showing the detail of the process sequence performed by the control part of an apparatus main body in the modification which extracts the still image in which a singer's face is reflected. 音声レベルを考慮して抽出する変形例において、装置本体の制御部により実行される処理手順の詳細を表すフローチャートである。12 is a flowchart showing details of a processing procedure executed by a control unit of the apparatus main body in a modification example in which extraction is performed in consideration of a sound level.

以下、本発明の一実施の形態を図面を参照しつつ説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

図１は、本実施形態のカラオケ装置が設置されたカラオケルームを概略的に表す図である。 FIG. 1 is a diagram schematically showing a karaoke room in which the karaoke apparatus of the present embodiment is installed.

図１において、カラオケ店舗等のカラオケルームＫＲには、カラオケ装置１０が設置されている。カラオケ装置１０は、楽曲データとしてのＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ（ＭＩＤＩ；登録商標）データ及び映像データを用いて、カラオケ演奏曲の再生サービスを提供する装置である。図１に示すものでは、利用者Ａ〜Ｃがカラオケ歌唱を行っている。カラオケ装置１０は、コマンダと称される装置本体１００と、リモコン２００と、マイクロフォン（以下、略してマイク）３００と、１台の全周カメラ４００（動画撮影カメラ）とを有している。装置本体１００、リモコン２００、及びマイク３００については、後で詳述する。 In FIG. 1, a karaoke apparatus 10 is installed in a karaoke room KR such as a karaoke store. The karaoke device 10 is a device that provides a karaoke performance reproduction service using Musical Instrument Digital Interface (MIDI) data and video data as music data. In the one shown in FIG. 1, users A to C perform karaoke singing. The karaoke apparatus 10 includes an apparatus main body 100 called a commander, a remote controller 200, a microphone (hereinafter abbreviated as a microphone) 300, and a single all-around camera 400 (moving picture camera). The apparatus main body 100, the remote controller 200, and the microphone 300 will be described in detail later.

全周カメラ４００は、この例では、例えばカラオケルームＫＲの天井中心に下向きに備え付けられている。全周カメラ４００は、マイク３００及び歌唱者Ａ〜Ｃを含む所定範囲の固定的な視野を撮影し、その範囲の映像データを生成する。具体的には、全周カメラ４００は、魚眼レンズを有し、当該カメラ４００を中心とした、左右３６０度、上下９０度の半球状視野を１枚に収めた映像データを得る。魚眼レンズの機能によって、全周カメラ４００の視野内の像は、近い物体ほど円の中心に写り、遠い物体ほど円の周辺部に写る性質を持つ。また、魚眼レンズにより広い視野角が得られる代償として、全ての物体は扇状に歪曲して写るようになる。従って、全周カメラ４００によりカラオケルームＫＲ内の全体を撮影すると、全体的に扇状に歪曲した画像が取得されることとなる。 In this example, the all-around camera 400 is provided downward in the center of the ceiling of the karaoke room KR, for example. The all-around camera 400 captures a fixed range of visual field including the microphone 300 and the singers A to C, and generates video data of the range. Specifically, the omnidirectional camera 400 has a fisheye lens, and obtains video data in which a hemispherical field of 360 degrees on the left and right and 90 degrees on the upper and lower sides is centered on the camera 400. Due to the function of the fisheye lens, the image in the field of view of the omnidirectional camera 400 has a property that a closer object appears in the center of the circle and a farther object appears in the periphery of the circle. In addition, as a price for obtaining a wide viewing angle with a fisheye lens, all objects are distorted in a fan shape. Therefore, when the entire karaoke room KR is photographed by the all-around camera 400, an image distorted in a fan shape as a whole is acquired.

図２は、マイク３００の外観を表す図である。図２において、マイク３００は、利用者によるカラオケ歌唱の音声を音声信号に変換して入力するものである。 FIG. 2 is a diagram illustrating the appearance of the microphone 300. In FIG. 2, the microphone 300 is for converting the voice of karaoke singing by the user into an audio signal and inputting it.

マイク３００は、マイク素子３０１が内蔵されたマイクハウジング３０２を有している。マイクハウジング３０２の上部には、電源スイッチ３０３が設けられている。マイクハウジング３０２の下部には、所定の色（この例では、緑色）の標識信号としてのマーカ光を発光する、標識信号発生手段としての発光ダイオード（ＬＥＤ）３０４と、このＬＥＤ３０４から発した緑色のマーカ光を均一に拡散させる半透明の光拡散球３０５とが設けられている。 The microphone 300 has a microphone housing 302 in which a microphone element 301 is built. A power switch 303 is provided on the top of the microphone housing 302. At the bottom of the microphone housing 302, a light emitting diode (LED) 304 serving as a marker signal generating unit that emits marker light as a marker signal of a predetermined color (in this example, green), and a green light emitted from the LED 304 are displayed. A translucent light diffusion sphere 305 that uniformly diffuses the marker light is provided.

図３は、上記のカラオケ装置１０を備えたカラオケシステムの全体構成を表す機能ブロック図である。 FIG. 3 is a functional block diagram showing the overall configuration of a karaoke system including the karaoke apparatus 10 described above.

図３において、カラオケシステム１は、上記カラオケルームＫＲに設置された上記カラオケ装置１０と、ホストサーバ２０とを有している。カラオケ装置１０とホストサーバ２０とは、例えば通信ネットワーク等のネットワークＮＷとを介し、互いに情報送受信可能に接続されている。 In FIG. 3, the karaoke system 1 includes the karaoke apparatus 10 installed in the karaoke room KR and a host server 20. The karaoke apparatus 10 and the host server 20 are connected to each other so as to be able to transmit and receive information via a network NW such as a communication network.

カラオケ装置１０は、上記の装置本体１００、リモコン２００、マイク３００、及び全周カメラ４００を有している。装置本体１００とリモコン２００とは、例えば無線又は有線のＬＡＮ等のネットワークを介し、互いに情報送受信可能に接続されている。装置本体１００とマイク３００とは、無線回線又は有線回線により接続されている。 The karaoke apparatus 10 includes the apparatus main body 100, the remote controller 200, the microphone 300, and the all-around camera 400. The apparatus main body 100 and the remote controller 200 are connected to each other so as to be able to transmit / receive information to / from each other via a network such as a wireless or wired LAN. The apparatus main body 100 and the microphone 300 are connected by a wireless line or a wired line.

装置本体１００は、制御部１０１と、大容量記憶装置１０３と、操作部１０４と、受信部１０５と、音源１０６と、音声制御部１０７と、スピーカ１０８と、表示部１０９と、通信制御部１１０とを有している。 The apparatus main body 100 includes a control unit 101, a mass storage device 103, an operation unit 104, a reception unit 105, a sound source 106, a sound control unit 107, a speaker 108, a display unit 109, and a communication control unit 110. And have.

制御部１０１は、図示しないＣＰＵや、ＲＡＭ及びＲＯＭ等のメモリを備えている。この制御部１０１は、ＲＡＭの一時記憶機能を利用しつつ、ＲＯＭや上記大容量記憶装置１０３に予め記憶された各種プログラムを実行する。これにより、装置本体１００全体の制御を行う。 The control unit 101 includes a CPU (not shown) and a memory such as a RAM and a ROM. The control unit 101 executes various programs stored in advance in the ROM or the large-capacity storage device 103 while using the temporary storage function of the RAM. Thereby, the entire apparatus main body 100 is controlled.

特に、制御部１０１は、全周カメラ４００により得られたカラオケルームＫＲ内の撮像画像に対して所定の画像処理を行い、マイク３００を持った歌唱者の複数の画像（集合体としての動画を構成する複数の静止画。以下同様）を生成し、その画像を大容量記憶装置１０３に記憶するとともに表示部１０９に表示させる処理を行う（詳細は後述）。 In particular, the control unit 101 performs predetermined image processing on the captured image in the karaoke room KR obtained by the omnidirectional camera 400, and a plurality of images of a singer with the microphone 300 (moving images as an aggregate). A plurality of still images to be configured (the same applies hereinafter) is generated, and the image is stored in the mass storage device 103 and displayed on the display unit 109 (details will be described later).

大容量記憶装置１０３は、例えばＨａｒｄＤｉｓｋＤｒｉｖｅ（ＨＤＤ）などから構成される。この大容量記憶装置１０３には、ＭＩＤＩデータ、背景映像データ、及び歌詞データ等の各種情報が記憶されている。また、この大容量記憶装置１０３には、利用者の歌唱時の動画データが順次記憶される。 The mass storage device 103 is composed of, for example, a hard disk drive (HDD). The mass storage device 103 stores various information such as MIDI data, background video data, and lyrics data. Also, the large-capacity storage device 103 sequentially stores moving image data when the user sings.

操作部１０４は、例えば複数のキーやスイッチなどから構成される。利用者は、この操作部１０４又は後述のリモコン２００の操作部２０４を用いて、カラオケ演奏曲の予約操作等の各種操作を行うことができる。 The operation unit 104 includes, for example, a plurality of keys and switches. The user can perform various operations such as a reservation operation for a karaoke performance using the operation unit 104 or an operation unit 204 of the remote controller 200 described later.

受信部１０５は、上記のマイク３００から出力された歌唱者の音声信号を受信する。 The receiving unit 105 receives the singer's voice signal output from the microphone 300.

音源１０６は、上記制御部１０１によって大容量記憶装置１０３から読み出されたＭＩＤＩデータを再生して音声制御部１０７へ出力する。音声制御部１０７は、音源１０６から出力されたＭＩＤＩデータ、及び、受信部１０５を介してマイク３００により入力された音声信号を増幅し、スピーカ１０８へ出力する。スピーカ１０８は、音声制御部１０７から出力されたＭＩＤＩデータ及び音声信号を音声出力する。 The sound source 106 reproduces the MIDI data read from the mass storage device 103 by the control unit 101 and outputs it to the audio control unit 107. The audio control unit 107 amplifies the MIDI data output from the sound source 106 and the audio signal input from the microphone 300 via the receiving unit 105 and outputs the amplified signal to the speaker 108. The speaker 108 outputs the MIDI data and audio signal output from the audio control unit 107 as audio.

なお、以下適宜、音源１０６、音声出力部１０７、及びスピーカ１０８を、省略して「音源１０６等」と称する。音源１０６等は、楽曲データを再生する楽曲再生手段を構成している。 Hereinafter, the sound source 106, the audio output unit 107, and the speaker 108 are appropriately omitted and referred to as “sound source 106 etc.”. The sound source 106 and the like constitute music reproducing means for reproducing music data.

表示部１０９は、例えば液晶ディスプレイなどから構成され、各種映像を表示する表示手段として機能する。特に、表示部１０９は、上記音源１０６等によるＭＩＤＩデータの再生に同期して、言い換えれば、音源１０６等によりＭＩＤＩデータの再生が行われるのに従い、大容量記憶装置１０３から読み出された背景映像データ、及び歌詞データに対応したテロップ等を表示することができる。 The display unit 109 is composed of a liquid crystal display, for example, and functions as display means for displaying various videos. In particular, the display unit 109 synchronizes with the reproduction of the MIDI data by the sound source 106 or the like, in other words, the background video read from the mass storage device 103 as the MIDI data is reproduced by the sound source 106 or the like. Data and telops corresponding to the lyrics data can be displayed.

通信制御部１１０は、リモコン２００やホストサーバ２０との間で情報通信の制御を行う。 The communication control unit 110 controls information communication with the remote controller 200 and the host server 20.

リモコン２００は、利用者がカラオケ演奏曲の予約操作等の各種操作を行うための操作端末である。このリモコン２００は、制御部２０１と、記憶装置２０３と、操作部２０４と、表示部２０９と、通信制御部２１０とを有している。 The remote controller 200 is an operation terminal for a user to perform various operations such as a reservation operation for a karaoke performance song. The remote controller 200 includes a control unit 201, a storage device 203, an operation unit 204, a display unit 209, and a communication control unit 210.

制御部２０１は、図示しないＣＰＵやＲＡＭ及びＲＯＭ等のメモリを備えている。この制御部２０１は、ＲＡＭの一時記憶機能を利用しつつ、ＲＯＭや上記記憶装置２０３に予め記憶された各種プログラムを実行する。これにより、リモコン２００全体の制御を行う。 The control unit 201 includes a memory such as a CPU, RAM, and ROM (not shown). The control unit 201 executes various programs stored in advance in the ROM or the storage device 203 while using the temporary storage function of the RAM. As a result, the entire remote controller 200 is controlled.

記憶装置２０３は、例えば不揮発性メモリなどから構成され、各種情報を記憶する。操作部２０４は、例えば複数のキーやスイッチなどから構成される。利用者は、この操作部２０４又は上記カラオケ装置１００の操作部１０４を用いて、カラオケ演奏曲の予約操作等の各種操作を行うことができる。表示部２０９は、例えば液晶ディスプレイなどから構成され、各種表示を行う。 The storage device 203 is composed of, for example, a nonvolatile memory and stores various types of information. The operation unit 204 is composed of, for example, a plurality of keys and switches. Using the operation unit 204 or the operation unit 104 of the karaoke apparatus 100, the user can perform various operations such as a reservation operation for karaoke performance songs. The display unit 209 is composed of, for example, a liquid crystal display and performs various displays.

通信制御部２１０は、装置本体１００やホストサーバ２０との間で情報通信の制御を行う。 The communication control unit 210 controls information communication with the apparatus main body 100 and the host server 20.

ホストサーバ２０には、利用者の歌唱中の姿の動画データが圧縮動画ファイルとしてアップロード可能である（詳細は後述）。このホストサーバ２０にアップロードされた動画データは、所定のＷｅｂページにおいて特定の利用者の端末より閲覧可能となっている（後述の図１１も参照）。 The host server 20 can upload the moving image data of the user's singing as a compressed moving image file (details will be described later). The moving image data uploaded to the host server 20 can be viewed from a specific user terminal on a predetermined Web page (see also FIG. 11 described later).

ここで、本実施形態の特徴の１つとして、全周カメラ４００により得られたカラオケルームＫＲ内の映像画像に含まれるマーカ光に基づいてマイク３００の位置が特定され、そのマイク３００の位置を含む部分映像が切り出され、マイク３００を持った歌唱者の動画（カラオケ投稿動画）データが取得される。このとき、全周カメラ４００で撮像して得られた映像信号は、人間の通常の視野とは大きく異なるので、カラオケ投稿動画の用途としてそのまま使うことはできない。このため、全周カメラ４００で撮像して得られた映像信号に対して所定の処理を施す必要がある。 Here, as one of the features of the present embodiment, the position of the microphone 300 is specified based on the marker light included in the video image in the karaoke room KR obtained by the all-around camera 400, and the position of the microphone 300 is determined. The included partial video is cut out and the video (karaoke post video) data of the singer with the microphone 300 is acquired. At this time, the video signal obtained by imaging with the omnidirectional camera 400 is significantly different from the normal visual field of human beings, and therefore cannot be used as it is for the karaoke post movie. For this reason, it is necessary to perform a predetermined process on the video signal obtained by imaging with the omnidirectional camera 400.

図４（ａ）〜（ｆ）は、全周カメラ４００より入力された映像を画像処理して歌唱者の動画データを得るプロセスを表す説明図である。本処理は、装置本体１００の制御部１０１によって実行される。 FIGS. 4A to 4F are explanatory diagrams showing a process of obtaining video data of a singer by performing image processing on the video input from the all-around camera 400. This process is executed by the control unit 101 of the apparatus main body 100.

図４（ａ）に示すように、まず全周カメラ４００で取得したカラオケルームＫＲ内の映像を入力する。ここでは、利用者Ａ〜Ｃ及びマイク３００の映像のみを表し、テーブルや装置本体１００等の映像は省略してある。 As shown in FIG. 4A, first, an image in the karaoke room KR acquired by the all-around camera 400 is input. Here, only the images of the users A to C and the microphone 300 are shown, and the images of the table, the apparatus main body 100, and the like are omitted.

その後、図４（ｂ）に示すように、全周カメラ４００より入力された映像において、所定の色（この例では緑）の成分以外の成分を除去するカラーフィルタ処理を行う。具体的には、色フィルタ（ここでは緑フィルタ）を通して、全周カメラ４００より入力された映像から緑色の成分のみを抽出する。色フィルタは、ＲＧＢのＧ値のみを通過させるか、又は、ＹＵＶのＵＶが一定範囲内にある画素値のみを通過させる、ＣＰＵ演算処理による画素データファイルである。 Thereafter, as shown in FIG. 4B, color filter processing is performed to remove components other than a component of a predetermined color (in this example, green) in the video input from the all-around camera 400. Specifically, only the green component is extracted from the video input from the all-around camera 400 through a color filter (here, a green filter). The color filter is a pixel data file obtained by CPU calculation processing that passes only the G values of RGB or passes only the pixel values of YUV UV within a certain range.

その後、図４（ｃ）に示すように、カラーフィルタ処理が行われた映像データを輝度フィルタに通し、輝度が一定以上の値を示す画素値のみを通過させることで、画像データの２値化を行う。これにより、画像の中の「純粋な緑色に近く、一定以上の明るさがある」画素のみが「１」を示し、それ以外の画素は「０」を示すビットマップが得られる。 Thereafter, as shown in FIG. 4C, the image data that has been subjected to color filter processing is passed through a luminance filter, and only pixel values that have a luminance of a certain level or higher are passed, thereby binarizing the image data. I do. Thereby, only a pixel “close to pure green and having a certain level of brightness” in the image indicates “1”, and other pixels indicate “0”.

その後、図４（ｄ）に示すように、全周カメラ４００より入力された映像についてエリア判定を行う。具体的には、予め蜘蛛の巣状に定義されたマップに従い、角度方向（人間の視覚での左右に相当）に対して８分解（Ａ〜Ｈ）、距離方向（人間の視覚での奥行きに相当）に対して３分解（１〜３）又は４分解（１〜４）の計２８分解された各エリアについて、エリアごとにビットマップの画素値を全て加算する。この加算値が最も大きい値（図中ではエリアＧ２）がマーカ光を検知しており、撮影すべき歌唱者がいるエリア（方向）であると判定される。 Thereafter, as shown in FIG. 4D, area determination is performed on the video input from the all-around camera 400. Specifically, according to a map defined in the shape of a spider web in advance, the angle direction (corresponding to left and right in human vision) is divided into 8 (A to H) and the distance direction (in human visual depth). All the pixel values of the bitmap are added for each area for each area that has been divided into a total of 28 (three) (1-3) or four (1-4). The value with the largest added value (area G2 in the figure) detects the marker light and is determined to be the area (direction) where the singer to be photographed is.

その後、図４（ｅ）に示すように、図４（ｄ）に示す処理で選択されたエリアについて、扇状スキャンによる画像の形状補正を行う。具体的には、エリア内にある画素を同図に示した走査線に従って並べなおす処理を行う。これによって、扇形状の直径方向はＹ軸、円周方向はＸ軸の矩形状に変形し矯正される。ここで、中心部に近い走査線は短く、円周部に近い走査線は長いが、同一値の画素で補完して拡大するか、画素を省略して縮小し、一定長の線データを得る。走査線の座標パターンは、エリアＡ１〜Ｈ４ごとに予め用意されている。 Thereafter, as shown in FIG. 4E, the shape of the image is corrected by fan scan for the area selected in the process shown in FIG. Specifically, a process of rearranging the pixels in the area according to the scanning line shown in FIG. As a result, the fan-shaped diameter direction is deformed and corrected to a Y-axis rectangular shape and the circumferential direction is an X-axis rectangular shape. Here, the scanning line close to the central part is short and the scanning line close to the circumferential part is long, but it is supplemented with pixels of the same value and enlarged, or the pixels are omitted and reduced to obtain line data of a fixed length. . A scanning line coordinate pattern is prepared in advance for each of the areas A1 to H4.

全ての走査線について変換処理が終了すると、図４（ｆ）に示すような、最終的な出力画像（図中ではエリアＧ２の画像）が得られる。この画像は、全周カメラ４００に写ったマーカ光の周辺領域だけを切り取った上で、湾曲した魚眼レンズの円形視野角を通常の矩形視野角へと変換補正したものである。言い換えれば、全周の中心に対して劣弧をとる扇形が切り出されると共に、その切り出された扇形の劣弧を直線に補正することで、扇形が四角形に補正されたものである。このような補正であることから、結果的にマイク３００を持っている歌唱者（この例では利用者Ａ）に対して通常のカメラを向けたのと同等の結果が得られる。本実施形態では、歌唱者（この例では利用者Ａ）が歌唱している間の映像に対し上記変換補正が行われることで、歌唱者である利用者Ａを含む連続的な映像が取得され、記憶される。 When the conversion process is completed for all the scanning lines, a final output image (an image of area G2 in the drawing) as shown in FIG. 4F is obtained. This image is obtained by cutting out only the peripheral region of the marker light captured by the all-round camera 400 and converting and correcting the circular viewing angle of the curved fisheye lens into a normal rectangular viewing angle. In other words, a sector having a subarc arc with respect to the center of the entire circumference is cut out, and the sector is corrected to a quadrangle by correcting the cut out arc of the sector into a straight line. Since it is such correction, as a result, a result equivalent to that when a normal camera is pointed at the singer (user A in this example) having the microphone 300 is obtained. In this embodiment, the continuous correction | amendment image | video which contains the user A who is a singer is acquired by performing the said conversion correction | amendment with respect to the image | video while a singer (user A in this example) is singing. Memorized.

なお、これらの処理は、動画を構成する各画像（静止画）に対して行われるので、毎秒３０フレームの速度で処理されるが、演算能力の関係上、例えば１０フレームにつき１フレームの頻度で処理を行うなど、間引きを行ってもよい。 Since these processes are performed on each image (still image) constituting the moving image, the process is performed at a rate of 30 frames per second. However, for example, at a frequency of 1 frame per 10 frames due to calculation capability. Thinning may be performed such as processing.

本実施形態では、上記のようにして歌唱者（この例では利用者Ａ）の位置を特定した後、その歌唱者に対し、他の利用者（この例では利用者Ｂ，Ｃ）が視線を向けているかどうかで、これら利用者Ａ，Ｂ，Ｃからなるグループの盛り上がりタイミングを検出する。 In this embodiment, after specifying the position of the singer (user A in this example) as described above, other users (users B and C in this example) look at the singer. The rising timing of the group consisting of these users A, B, and C is detected depending on whether or not it is directed.

すなわち、上述においては、エリアＧ２の画像についての処理を例にとって説明したが、それ以外の２７個のエリア、すなわち、エリアＡ１〜Ａ３，Ｂ１〜Ｂ３，Ｃ１〜Ｃ４，Ｄ１〜Ｄ４，Ｅ１〜Ｅ３，Ｆ１〜Ｆ３，Ｇ１，Ｇ３，Ｇ４，Ｈ１〜Ｈ４についても、同様の手法で各エリアに対して通常のカメラを向けた場合と同等の映像を得ることができる。これにより、歌唱者である利用者Ａのまわりの所定範囲（例えば通常は人の存在が考えにくいカラオケルームＫＲの四隅に相当するエリアＣ４，Ｄ４，Ｇ４，Ｈ４を除く範囲）の映像を得ることができ、当該所定範囲に他の利用者（視聴者）がいるかどうかを、公知の顔認識技術等により検知することができる。この例で言えば、利用者Ｂ及び利用者Ｃの存在が、上記の顔認識技術等を用いて認識される。図５は、このようにして歌唱者（利用者Ａ）まわりの所定範囲の画像が取得された状態を概念的に表している。 That is, in the above description, the processing for the image of the area G2 has been described as an example, but the other 27 areas, that is, the areas A1 to A3, B1 to B3, C1 to C4, D1 to D4, E1 to E3. , F1 to F3, G1, G3, G4, and H1 to H4, it is possible to obtain an image equivalent to a case where a normal camera is directed to each area by the same method. As a result, an image of a predetermined range around the user A who is a singer (for example, a range excluding areas C4, D4, G4, and H4 corresponding to the four corners of the karaoke room KR, where it is difficult for humans to exist) is obtained. It is possible to detect whether there is another user (viewer) within the predetermined range by using a known face recognition technique or the like. In this example, the presence of the user B and the user C is recognized using the face recognition technology described above. FIG. 5 conceptually shows a state where an image of a predetermined range around the singer (user A) is acquired in this way.

そして、さらに、本実施形態では、図５のように歌唱者（利用者Ａ）まわりの所定範囲について取得された画像を用いて、歌唱者以外の利用者（この例では利用者Ｂ，Ｃ）が歌唱者（利用者Ａ）に対し視線を向けているかどうかを検出する。図５の例は、破線矢印で示すように、２名の利用者Ｂ，Ｃ全員の顔が利用者Ａに向き、利用者Ａに対し視線を向けている状態を表しており、現実の空間における上記図１に示した状態に対応している。 And in this embodiment, users other than a singer (in this example, users B and C) using the image acquired about the predetermined range around a singer (user A) as shown in FIG. Detects whether or not the singer (user A) is turning his gaze. The example of FIG. 5 shows a state in which the faces of all the two users B and C are facing the user A and are looking toward the user A, as indicated by broken arrows. This corresponds to the state shown in FIG.

一方、例えば、利用者Ｂ，Ｃのうち利用者Ｃの顔が歌唱者である利用者Ａを向いておらず、利用者Ａに対し視線を向けているのは利用者Ｂの１名のみである状態もありうる。この場合も、上記の顔技術認識等を用いて図６に示すような歌唱者（利用者Ａ）まわりの所定範囲の画像が取得されることで、上記のような状態であることが認識される。現実の空間では図７に示されるような状態となる。 On the other hand, for example, among the users B and C, the face of the user C is not facing the user A who is a singer, and only one user B is looking toward the user A. There may be certain conditions. Also in this case, the above-described state is recognized by acquiring a predetermined range of images around the singer (user A) as shown in FIG. The In the actual space, the state is as shown in FIG.

以上のようにして、本実施形態では、利用者Ａが歌唱者として歌唱しているとき、所定周期（例えば数十ｍｓｅｃ等）ごとの各タイミングにおいて利用者Ａの方に顔が向き視線を向けている他の利用者の人数を検出する。その検出した人数は、時系列に沿ったタイミングログに記録される。そして、歌唱者へ顔を向けている他の利用者（視聴者）が最も多いタイミングを、この集団の盛り上がりタイミングである、と決定する。 As described above, in the present embodiment, when the user A is singing as a singer, the face is directed toward the user A at each timing every predetermined period (for example, several tens of msec). Detect the number of other users who are. The detected number of people is recorded in a timing log in time series. Then, the timing when the number of other users (viewers) who are turning their faces to the singer is the highest is determined to be the rising timing of this group.

図８は、上記盛り上がりタイミングを決定するために用いられる、上記タイミングログの例を表す説明図である。図８に示すように、タイミングログには、各データが取得された時刻（言い換えれば録画時刻）を例えばｍｓｅｃ単位で表す「時刻」欄と、前述のようにしてマーカ光に基づき識別された歌唱者を表す「歌唱者」欄と、上記顔認識により識別された、歌唱者以外の在室利用者すなわち視聴者を表す「視聴者」欄と、その視聴者のうち歌唱者の方を顔が向いている視聴者を表す「歌唱者の方を向いている視聴者」の欄とが、記録欄としてそれぞれ設けられている。図示のように、この例では、左から右に向かって時系列的に各データが記録されている。 FIG. 8 is an explanatory diagram showing an example of the timing log used for determining the rising timing. As shown in FIG. 8, the timing log includes a “time” column indicating the time (in other words, recording time) at which each data is acquired, for example, in units of msec, and a song identified based on the marker light as described above. The “singer” field that represents the listener, the “viewer” field that represents the resident user other than the singer, that is, the viewer, identified by the face recognition, and the singer of the viewer is the face A column of “viewer facing the singer” indicating a viewer who is facing is provided as a record column. As illustrated, in this example, each data is recorded in time series from left to right.

例えば利用者Ａが歌唱している間は、「歌唱者」欄には当該時刻範囲の全タイミングにおいて「Ａ」が記録される。利用者Ａが歌唱しているカラオケルームＫＲ内に利用者Ｂ，Ｃの両方が在室しているタイミングでは「視聴者」欄に「Ｂ，Ｃ」が記録される。例えば利用者Ｂがトイレに行くために退室し視聴者として利用者Ｃのみが在室しているタイミングでは「Ｃ」のみが記録される。 For example, while user A is singing, “A” is recorded in the “singer” column at all timings of the time range. At the timing when both the users B and C are present in the karaoke room KR where the user A sings, “B, C” is recorded in the “viewer” column. For example, only “C” is recorded when user B leaves the room to go to the toilet and only user C is present as a viewer.

また、利用者Ａが歌唱しているときに利用者Ｂ，Ｃの両方が利用者Ａの方を向いているタイミングでは「歌唱者の方を向いている視聴者」欄に「Ｂ，Ｃ」が記録される（図５及び図１の状態に相当）。一方、利用者Ａが歌唱しているときに利用者Ｃは利用者Ａの方を向いているが、利用者Ｂは例えば壁の方を向いており利用者Ａの方を向いていないタイミングでは「歌唱者の方を向いている視聴者」欄に「Ｃ」のみが記録され（図６及び図７の状態に相当）、誰も利用者Ａの方を向いていないタイミングでは「なし」と記録される。 In addition, when the user A is singing, at the timing when both the users B and C are facing the user A, “B, C” is displayed in the “viewer facing the singer” column. Is recorded (corresponding to the states of FIGS. 5 and 1). On the other hand, user C is facing user A when user A is singing, but user B is facing the wall and not facing user A, for example. Only “C” is recorded in the “viewer facing the singer” column (corresponding to the state of FIGS. 6 and 7), and “none” is indicated when no one is facing the user A. To be recorded.

図８に示す例では、図中左右方向の中央に位置するタイミングで、歌唱者である利用者Ａの方を、２人の利用者Ｂ，Ｃが向いている。したがって、このタイミングが、「盛り上がりタイミング」である、として決定される。 In the example shown in FIG. 8, two users B and C are facing the user A who is a singer at a timing located in the center in the left-right direction in the figure. Therefore, it is determined that this timing is the “rising timing”.

そして、先に述べたように、本実施形態では、歌唱者が歌唱している間の当該歌唱者（上記の例では利用者Ａ）を含む映像が連続的に取得され、記憶されている。このように記憶された映像すなわち動画は多数の静止画の集合体であるが、本実施形態では、上記盛り上がりタイミングとして決定されたタイミングの静止画が、上記歌唱者を撮影した映像を代表する、サムネイル画像として抽出される。 And as stated previously, in this embodiment, the image | video containing the said singer (user A in said example) while the singer is singing is continuously acquired and memorize | stored. The video thus stored, that is, the moving image is an aggregate of a large number of still images, but in this embodiment, the still image at the timing determined as the climax timing represents the video obtained by photographing the singer. Extracted as a thumbnail image.

図９は、上記の手法を実行するために、制御部１０１により実行される処理手順の詳細を表すフローチャートである。 FIG. 9 is a flowchart showing details of a processing procedure executed by the control unit 101 in order to execute the above method.

図９において、カラオケ演奏曲に対応したＭＩＤＩデータの再生が開始されると、このフローが開始される。すなわち、音源１０６等によるＭＩＤＩデータの再生と同期して、背景映像データ及び歌詞データが表示部１０９に表示される。すると、歌唱者によるカラオケ演奏曲の歌唱が行われ、マイク３００よりカラオケ歌唱の音声が入力される。 In FIG. 9, when the reproduction of the MIDI data corresponding to the karaoke performance music is started, this flow is started. That is, the background video data and the lyrics data are displayed on the display unit 109 in synchronization with the reproduction of the MIDI data by the sound source 106 or the like. Then, the singing of the karaoke performance music by the singer is performed, and the voice of the karaoke singing is input from the microphone 300.

まずステップＳ１０において、全周カメラ４００により撮影された、カラオケルームＫＲ内の映像データを取得する。この手順が、各請求項記載の取得手順を構成する。 First, in step S10, video data in the karaoke room KR taken by the all-around camera 400 is acquired. This procedure constitutes the acquisition procedure described in each claim.

その後、ステップＳ２５で、カラオケルームＫＲ内の映像データに含まれるマーカ光に基づいて、マイク３００の位置を特定する。このステップＳ２５の処理は、前述の図４（ａ）〜図４（ｃ）に示した画像処理に対応するものである。そして、ステップＳ３０において、カラオケルームＫＲ内の映像データからマイク３００の位置を含む部分映像データを切り出す。このステップＳ３０の処理は、図４（ｄ）に示したエリア判定処理に対応するものである。 Thereafter, in step S25, the position of the microphone 300 is specified based on the marker light included in the video data in the karaoke room KR. The processing in step S25 corresponds to the image processing shown in FIGS. 4 (a) to 4 (c). In step S30, partial video data including the position of the microphone 300 is cut out from the video data in the karaoke room KR. The process in step S30 corresponds to the area determination process shown in FIG.

その後、ステップＳ３５において、マイク３００の位置を含む部分映像データの補正処理を行い、マイク３００を持った歌唱者の姿が写った画像を得る。このステップＳ３５の処理は、図４（ｅ），図４（ｆ）に示した画像の形状矯正に対応する。そして、ステップＳ４０において、補正処理後の画像を表示部１０９の一部領域に表示させるとともに、撮影時刻と関連づけて大容量記憶装置１０３に保存する。 Thereafter, in step S <b> 35, the partial video data including the position of the microphone 300 is corrected, and an image showing the singer holding the microphone 300 is obtained. The processing in step S35 corresponds to the shape correction of the image shown in FIGS. 4 (e) and 4 (f). In step S40, the corrected image is displayed in a partial area of the display unit 109, and is stored in the mass storage device 103 in association with the photographing time.

その後、ステップＳ１００において、上記図８を用いて説明したタイミング六を作成するログ作成処理が実行される。図１０は、このステップＳ１００の詳細手順を表すフローチャートである。 Thereafter, in step S100, a log creation process for creating the timing 6 described with reference to FIG. 8 is executed. FIG. 10 is a flowchart showing the detailed procedure of step S100.

図１０において、まずステップＳ１１０で、歌唱者の位置から所定範囲（前述の例ではカラオケルームＫＲの四隅を除く範囲）内の映像データを補正する。なお、この補正処理は、上記ステップＳ３５での補正処理と同等のものを実行すれば足りるので、詳細な説明を省略する。 In FIG. 10, first, in step S110, video data within a predetermined range (a range excluding the four corners of the karaoke room KR in the above example) from the position of the singer is corrected. Note that this correction process only needs to execute the same correction process as the above-described step S35, and thus detailed description thereof is omitted.

その後、ステップＳ１２０において、上記ステップＳ１１０で補正された所定範囲の映像データに対し、公知の顔認識処理を実行し、視聴者の姿を検出する。 Thereafter, in step S120, a known face recognition process is performed on the video data in the predetermined range corrected in step S110 to detect the viewer's appearance.

そして、ステップＳ１３０に移り、上記ステップＳ１２０において検出した視聴者について、各視聴者の顔の向きを検出する。この検出には、上記同様、公知の適宜の顔認識処理やその他の画像解析処理により、各視聴者の顔の輪郭線や鼻・口の位置を特定し、顔の向きがカラオケルームＫＲ内のいずれの方向を向いているかを算出すればよい。 Then, the process proceeds to step S130, and the orientation of each viewer's face is detected for the viewer detected in step S120. For this detection, as described above, the face contour of each viewer and the position of the nose / mouth are specified by a known appropriate face recognition process and other image analysis processes, and the face orientation is determined in the karaoke room KR. It may be calculated which direction it is facing.

その後、ステップＳ１４０に移り、上記ステップＳ１３０での検出結果に基づき、歌唱者の方を向いている視聴者人数をカウントする。そして、ステップＳ１２０で検出されたカラオケルームＫＲ内に在室する視聴者の数と、上記カウントされた歌唱者の方を向いている視聴者の人数と、撮影時刻とを上記タイミングログにデータとして記録する。なお、この作成されたタイミングログは、例えば上記大容量記憶装置１０３内に、参照可能に蓄積され格納される。なお、これらステップＳ１２０、ステップＳ１３０、及びステップＳ１４０が各請求項記載の算出手順を構成すると共に、算出手段として機能する。その後、ステップＳ４５（図９参照）に移る。 Thereafter, the process proceeds to step S140, and the number of viewers facing the singer is counted based on the detection result in step S130. Then, the number of viewers present in the karaoke room KR detected in step S120, the number of viewers facing the counted singers, and the shooting time as data in the timing log. Record. Note that the created timing log is accumulated and stored in the mass storage device 103 so that it can be referred to. These steps S120, S130, and S140 constitute the calculation procedure described in each claim and function as calculation means. Thereafter, the process proceeds to step S45 (see FIG. 9).

図９に戻り、ステップＳ４５では、カラオケ演奏曲に対応したＭＩＤＩデータの再生が終了したかどうかを判定する。カラオケ演奏曲に対応したＭＩＤＩデータの再生が終了したときは、ステップＳ４５の判定が満たされてステップＳ５０に移る。一方、カラオケ演奏曲に対応したＭＩＤＩデータの再生が終了していないときは、ステップＳ４５の判定が満たされず、ステップＳ１０に戻り、同様の手順を繰り返す。これにより、カラオケ演奏曲の再生が終了しない間は、ステップＳ１０〜ステップＳ４０及びステップＳ１００が繰り返され、ステップＳ４０を経るたびにステップＳ４０において大容量記憶装置１０３に補正処理後の映像（動画）が順次保存されていき、また、動画を構成する各画像（静止画）に対しステップＳ１００のログ作成処理が実行される。すなわち、各画像（静止画）ごとにステップＳ１０〜ステップＳ４０及びステップＳ１００が繰り返されるのである。 Returning to FIG. 9, in step S45, it is determined whether or not the reproduction of the MIDI data corresponding to the karaoke performance song has been completed. When the reproduction of the MIDI data corresponding to the karaoke performance song is completed, the determination at step S45 is satisfied, and the routine goes to step S50. On the other hand, when the reproduction of the MIDI data corresponding to the karaoke performance song is not finished, the determination in step S45 is not satisfied, and the process returns to step S10 and the same procedure is repeated. Thus, steps S10 to S40 and step S100 are repeated while the reproduction of the karaoke performance song is not completed, and the corrected video (moving image) is stored in the large capacity storage device 103 in step S40 every time step S40 is passed. The log creation processing in step S100 is executed for each image (still image) constituting the moving image. That is, step S10 to step S40 and step S100 are repeated for each image (still image).

ステップＳ５０では、動画投稿指示操作画面を表示部１０９の一部領域に表示させる。なお、この動画投稿指示操作画面には、操作者（歌唱者である利用者Ａ。但し、利用者Ｂ，Ｃが操作してもよい）のＩＤ、すなわち歌唱者ＩＤの入力を促す表示が含まれている。その後、ステップＳ５５に移る。 In step S <b> 50, the moving image posting instruction operation screen is displayed in a partial area of the display unit 109. In addition, the video posting instruction operation screen includes a display for prompting the input of the ID of the operator (user A who is a singer, but may be operated by the users B and C), that is, the singer ID. It is. Thereafter, the process proceeds to step S55.

ステップＳ５５では、表示部１０９の動画投稿指示操作画面によって、操作者よりカラオケ動画の投稿が指示されたかどうかを判定する。すなわち、上記歌唱者ＩＤの入力を促す表示に対応して（例えば装置本体１００の操作部１０４又はリモコン２００の操作部２０４により）歌唱者ＩＤが入力されると共に、適宜の投稿指示ボタン等の操作がなされたかどうかが判定される。なお、上記操作部１０４又は操作部２０４が各請求項記載の歌唱者ＩＤ入力手段として機能する。カラオケ動画の投稿が指示されたときは、ステップＳ５５の判定が満たされてステップＳ２００に移り、カラオケ動画の投稿が指示されないときは、ステップＳ５５の判定が満たされず、このフローを終了する。 In step S55, it is determined whether posting of a karaoke video is instructed by the operator on the video posting instruction operation screen of the display unit 109. That is, a singer ID is input in response to the display prompting the input of the singer ID (for example, by the operation unit 104 of the apparatus main body 100 or the operation unit 204 of the remote controller 200), and an operation of an appropriate posting instruction button or the like. It is determined whether or not In addition, the said operation part 104 or the operation part 204 functions as a singer ID input means as described in each claim. When the posting of the karaoke video is instructed, the determination at step S55 is satisfied, and the routine proceeds to step S200. When the posting of the karaoke video is not instructed, the determination at step S55 is not satisfied, and this flow ends.

ステップＳ２００では、上記ステップＳ１００のログ作成処理により作成されたタイミングログ（図８参照）の「視聴者の方を向いている視聴者」の欄を参照し、各タイミングにおける歌唱者の方に向いている視聴者の人数を取得する。そして、当該人数が最大となっているタイミングを、盛り上がりタイミングとして決定する。このステップＳ２００が、各請求項記載の決定手順を構成すると共に、決定手段として機能する。 In step S200, the column of “viewer facing the viewer” in the timing log (see FIG. 8) created by the log creation process in step S100 is referred to, and the song is directed toward the singer at each timing. Get the number of viewers who are. Then, the timing at which the number of persons is maximum is determined as the rising timing. This step S200 constitutes the determination procedure described in each claim and functions as a determination means.

その後、ステップＳ２１０に移り、上記ステップＳ４０において撮影時刻と関連づけて大容量記憶装置１０３に記憶されていた歌唱者の姿を含む複数の画像データの中から、上記ステップＳ２００で決定した盛り上がりタイミングに対応した画像（静止画）を取得し、その画像をサムネイル画像（代表画像）とする。このステップＳ２１０と前述のステップＳ３０及びステップＳ３５とが、各請求項記載の映像処理手段として機能する。その後、ステップＳ６０に移る。 Thereafter, the process proceeds to step S210, and corresponds to the climax timing determined in step S200 from among a plurality of image data including the appearance of the singer stored in the mass storage device 103 in association with the shooting time in step S40. The obtained image (still image) is acquired, and the image is set as a thumbnail image (representative image). Step S210 and the above-described steps S30 and S35 function as video processing means described in each claim. Thereafter, the process proceeds to step S60.

ステップＳ６０では、前述のステップＳ１０〜ステップＳ４０及びステップＳ１００の繰り返し時にステップＳ４０で大容量記憶装置１０３に順次保存された補正処理後の複数の出力映像を用いた動画データと、ステップＳ２１０で取得されたサムネイル画像と、ステップＳ５５で入力された歌唱者ＩＤとを、互いに関連づけた態様でホストサーバ２０にアップロードする。なお、このステップＳ６０が、各請求項記載の出力手順を構成するとともに、静止画出力手段として機能する。ステップＳ６０が完了すると、このフローを終了する。 In step S60, moving image data using a plurality of output videos after the correction processing sequentially stored in the large-capacity storage device 103 in step S40 when the above-described steps S10 to S40 and step S100 are repeated, and acquired in step S210. The thumbnail image and the singer ID input in step S55 are uploaded to the host server 20 in a manner associated with each other. This step S60 constitutes the output procedure described in each claim and functions as a still image output means. When step S60 is completed, this flow ends.

ホストサーバ２０にアップロードされた上記サムネイルの、前述の所定のｗｅｂページでの表示例を図１１に示す。図１１に示す表示例では、歌唱者（例えば利用者Ａ〜Ｃのいずれか。上記の例では利用者Ａ）が歌唱しているサムネイル画像が、カラオケ演奏曲の曲名、歌唱日時（上記盛り上がりタイミングの日時を含む）、カラオケ動画の再生回数、評価等とともに表示されている。ｗｅｂページの画面上で例えばこのサムネイル画像（又は対応する操作部等）をクリックすることにより、上記ステップＳ６０でアップロードされた歌唱者による歌唱時の動画をすべて再生し、閲覧することができる。この例では、このようにして行った閲覧の後の、当該カラオケ動画に対する閲覧した利用者による評価（「うまい」「おもしろい」「かわいい」「泣ける」が併せて記入され、表示される（詳細な図示は省略）。 FIG. 11 shows a display example of the thumbnail uploaded to the host server 20 on the aforementioned predetermined web page. In the display example shown in FIG. 11, the thumbnail image sung by a singer (for example, one of users A to C. In the above example, the sung date, the singing date and time (the above-mentioned excitement timing) Are displayed together with the number of times the karaoke video has been played, evaluation, etc. By clicking, for example, this thumbnail image (or the corresponding operation unit or the like) on the web page screen, all the videos of the singers uploaded in step S60 can be reproduced and viewed. In this example, after the browsing performed in this way, the evaluation by the browsing user with respect to the karaoke video (“delicious”, “interesting”, “cute”, “crying” is also written and displayed (detailed) (The illustration is omitted).

以上説明したように、本実施形態においては、娯楽性の向上のために、歌唱者（前述の例では利用者Ａ）及び視聴者（前述の例では利用者Ｂ，Ｃ）の集団の盛り上がりを検出する。前述したように、集団が盛り上がっているときとは、歌唱者の歌唱によって視聴者が心より楽しんでいるときであり、その瞬間には、視聴者の視線が歌唱者のほうへ向いているのが通常である。そこで、全周カメラ４００の撮影結果に基づき生成されたカラオケルームＫＲ内の映像データに対し顔認識処理を行い、その認識結果に基づき、各タイミングにおける「歌唱者の方を向いている視聴者」の人数をタイミングログとして記録する（図８、図９のステップＳ１００参照）。そして、当該人数の時間的推移により、盛り上がりタイミングを決定する（ステップＳ２００参照）。これにより、顔が歌唱者へ向いている視聴者の人数が最も多い瞬間を、当該集団の盛り上がりタイミングと決定することができる。この結果、全周カメラ４００での撮影により生成された映像データのうち、当該盛り上がりタイミングにおける静止画を抽出し（ステップＳ２１０参照）、当該サービス提供時間の代表画像としてホストサーバ２０へアップロードすることができる（ステップＳ６０）。この結果、ホストサーバ２０へアクセスした各ユーザ等が、当該静止画を閲覧し、楽しむことができる（図１１参照）。なおこのサムネイルは、カラオケ装置１０の表示部１０９に映し出すこともできる。このような種々のサービスを行うことにより、カラオケ装置１０の娯楽性をさらに向上することができる。 As described above, in this embodiment, in order to improve the entertainment, a group of singers (user A in the above example) and viewers (users B and C in the above example) is excited. To detect. As mentioned above, when the group is excited, the audience is enjoying the song with their singing, and at that moment the audience's line of sight is facing the singer. Is normal. Therefore, face recognition processing is performed on the video data in the karaoke room KR generated based on the photographing result of the all-around camera 400, and “viewer facing the singer” at each timing based on the recognition result. Are recorded as a timing log (see step S100 in FIGS. 8 and 9). Then, the rising timing is determined based on the temporal transition of the number of persons (see step S200). Thereby, the moment with the largest number of viewers whose faces are facing the singer can be determined as the climax timing of the group. As a result, a still image at the rising timing is extracted from the video data generated by shooting with the all-around camera 400 (see step S210) and uploaded to the host server 20 as a representative image of the service providing time. Yes (step S60). As a result, each user who has accessed the host server 20 can view and enjoy the still image (see FIG. 11). This thumbnail can also be displayed on the display unit 109 of the karaoke apparatus 10. By performing such various services, the entertainment of the karaoke apparatus 10 can be further improved.

また、本実施形態では特に、１台の全周カメラ４００が、歌唱者及びマイクロ３００を含む所定範囲の視野を撮影し、当該所定範囲の映像データを生成する。このとき、カメラを中心とした全周を撮影可能な視野が広い全周カメラ４００を用いることにより、歌唱者（上記の例では利用者Ａ）及び視聴者（上記の利用者Ｂ，Ｃ）の全員が、当該カメラが生成した映像データの中に常に含まれる。そして生成された所定範囲の映像データには、歌唱者の所持したマイク３００の位置に対応したマーカ光が、歌唱者の姿と共に必ず記録されている。そこで、映像データに含まれるマーカ光を用いてマイク３００及び歌唱者の位置を特定する（図９のステップＳ２５参照）とともに、歌唱者の位置以外の映像データに対し顔認識処理を行って視聴者の顔の向きを決定する（図１０のステップＳ１２０、ステップＳ１３０参照）。これにより、複数台のカメラを用いなくても、全周カメラ４００の１台だけで、顔が歌唱者の方向を向いている視聴者の人数を確実に算出することができる。 In the present embodiment, in particular, one omnidirectional camera 400 captures a predetermined range of visual field including the singer and the micro 300 and generates video data of the predetermined range. At this time, by using the omnidirectional camera 400 having a wide field of view capable of photographing the entire circumference around the camera, the singer (user A in the above example) and the viewers (the above users B and C) can be obtained. All of them are always included in the video data generated by the camera. In the generated video data in the predetermined range, marker light corresponding to the position of the microphone 300 possessed by the singer is always recorded together with the appearance of the singer. Therefore, the positions of the microphone 300 and the singer are specified using the marker light included in the video data (see step S25 in FIG. 9), and the face recognition process is performed on the video data other than the singer's position, thereby allowing the viewer Is determined (see step S120 and step S130 in FIG. 10). Accordingly, the number of viewers whose faces are facing the singer can be reliably calculated with only one of the all-round cameras 400 without using a plurality of cameras.

ここで、上記静止画を生成する場合、上記のように視野が広い全周カメラ４００の映像データから得た静止画は視野の端部ほど歪んだ状態となっている。本実施形態では特に、図９のステップＳ３５において、上記所定範囲の映像データから、歌唱者を含む部分静止画を全周の中心に対して劣弧をとる扇形に切り出すと共に、その切り出した部分静止画に所定の補正処理を行う（前述したように、切り出した扇形の劣弧を直線に補正し、扇形を四角形とする）。これにより、上記歪んだ状態が是正された正常な静止画からなる盛り上がりタイミングのサムネイルを、ステップＳ６０において代表画像としてホストサーバ２０へ出力することができる。 Here, when the still image is generated, the still image obtained from the video data of the omnidirectional camera 400 having a wide field of view as described above is distorted toward the end of the field of view. In the present embodiment, in particular, in step S35 of FIG. 9, the partial still image including the singer is cut out from the video data in the predetermined range into a fan shape having a subarc with respect to the center of the entire circumference, and the cut out partial still image is included. A predetermined correction process is performed on the image (as described above, the cut arc-shaped subarc is corrected to a straight line so that the fan shape is a square). As a result, a thumbnail at a rising timing composed of a normal still image in which the distorted state is corrected can be output to the host server 20 as a representative image in step S60.

なお、本発明は、上記実施形態に限られるものではなく、その趣旨及び技術的思想を逸脱しない範囲内で種々の変形が可能である。以下、そのような変形例を説明する。 The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit and technical idea of the present invention. Hereinafter, such modifications will be described.

（１）歌唱者の顔が写っている静止画を抽出する場合
例えば、歌唱者及び視聴者の集団が盛り上がっている盛り上がりタイミングにおいて、たまたま歌唱者が下を向いたり横を向いたりしている場合がある。このようなタイミングにおける静止画は、歌唱者の顔が正しく記録されていないため、代表画像としては必ずしも好ましくない。本変形例は、このような場合に対応するためのものである。 (1) When extracting a still image in which a singer's face is reflected For example, when a singer happens to face down or to the side at a climax when the group of singers and viewers is rising There is. A still image at such a timing is not necessarily preferable as a representative image because the singer's face is not correctly recorded. This modification is for dealing with such a case.

図１２は、この変形例において制御部１０１により実行される処理手順の詳細を表すフローチャートであり、上記図９に対応する図である。図９と同等の手順には同一の符号を付し、説明を省略又は簡略化する。 FIG. 12 is a flowchart showing details of the processing procedure executed by the control unit 101 in this modification, and corresponds to FIG. 9 described above. Steps equivalent to those in FIG. 9 are denoted by the same reference numerals, and description thereof is omitted or simplified.

図１２において、本変形例では、図９におけるステップＳ２１０を省略すると共に、新たに、ステップＳ２２０、ステップＳ２３０、ステップＳ２４０、ステップＳ２５０、ステップＳ２６０を設けている。 In FIG. 12, in this modification, step S210 in FIG. 9 is omitted, and step S220, step S230, step S240, step S250, and step S260 are newly provided.

ステップＳ１０〜ステップＳ４０及びステップＳ２００の処理は、図９と同様であり、詳細な説明を省略する。ステップＳ２００が終了したら、新たに設けたステップＳ２２０に移る。 The processes in steps S10 to S40 and S200 are the same as those in FIG. When step S200 ends, the process proceeds to newly provided step S220.

ステップＳ２２０では、上記ステップＳ２００での盛り上がりタイミングの決定結果に基づき、盛り上がりタイミングが複数あったかどうかを判定する。すなわち、上記タイミングログにおいて、複数のタイミングにおいて、歌唱者の方を向いている視聴者の数が同人数であったかどうか、が判定される。 In step S220, it is determined whether or not there are a plurality of rising timings based on the determination result of the rising timing in step S200. That is, in the timing log, it is determined whether or not the number of viewers facing the singer is the same number at a plurality of timings.

盛り上がりタイミングが１つに限られ、複数はなかった場合は、ステップＳ２２０の判定が満たされず、ステップＳ２５０に移る。ステップＳ２５０では、上記ステップＳ４０において撮影時刻と関連づけて大容量記憶装置１０３に記憶されていた歌唱者の姿を含む多数の画像データの中から、上記１つの盛り上がりタイミングに対応した画像（静止画）を取得し、その画像をサムネイル画像とする。その後、ステップ６０に移る。 If the rising timing is limited to one and there are not a plurality of timings, the determination at step S220 is not satisfied, and the routine goes to step S250. In step S250, an image (still image) corresponding to the one climax timing is selected from a large number of image data including the appearance of the singer stored in the mass storage device 103 in association with the shooting time in step S40. And the image is used as a thumbnail image. Thereafter, the process proceeds to step 60.

一方、ステップＳ２２０において、盛り上がりタイミングが複数あった場合は、ステップＳ２２０の判定が満たされ、ステップＳ２３０に移る。ステップＳ２３０では上記ステップＳ４０において撮影時刻と関連づけて大容量記憶装置１０３に記憶されていた歌唱者の姿を含む多数の画像データの中から、上記複数の盛り上がりタイミングに対応した画像（静止画）を取得する。そして、それら複数の静止画の中に対し、公知の顔認識の手法を用いて、歌唱者の顔が認識できるものがあるかを判定する。なお、このステップＳ２２０が、各請求項記載の判定手段として機能する。 On the other hand, if there are a plurality of climax timings at step S220, the determination at step S220 is satisfied, and the routine goes to step S230. In step S230, an image (still image) corresponding to the plurality of climax timings is selected from a large number of image data including the appearance of the singer stored in the mass storage device 103 in association with the shooting time in step S40. get. Then, it is determined whether there is any one that can recognize the singer's face by using a known face recognition method among the plurality of still images. In addition, this step S220 functions as determination means described in each claim.

歌唱者の顔が認識できるものがある場合は、ステップＳ２３０の判定が満たされ、ステップＳ２４０に移り、当該認識できる静止画をサムネイル画像とする。なお、歌唱者の顔が認識できる静止画が複数あった場合には、その中の適宜のいずれか１つをサムネイルとすればよい。その後、ステップ６０に移る。 If there is something that can recognize the singer's face, the determination in step S230 is satisfied, and the process proceeds to step S240, where the recognizable still image is set as a thumbnail image. If there are a plurality of still images in which the singer's face can be recognized, any one of them may be used as a thumbnail. Thereafter, the process proceeds to step 60.

一方、ステップＳ２３０において、歌唱者の顔が認識できる静止画が無い場合は、ステップＳ２６０に移る。ステップＳ２６０では、上記ステップＳ２３０において取得された、複数の静止画の中の適宜のいずれか１つをサムネイルとする。その後、ステップＳ６０に移る。なお、歌唱者の顔が認識できなかったことに対応して、このステップＳ２６０ではサムネイルを設定せず、表示部１０９に表示信号を出力して適宜のエラー表示を行い、このフローを終了するようにしてもよい。また盛り上がりタイミングが１つしかなくステップＳ２２０での判定が満たされなかった場合にも、ステップＳ２３０と同様の判定を行い、判定が満たされなかった場合には、上記同様にサムネイルを設定せずエラー表示としてもよい。 On the other hand, in step S230, when there is no still image which can recognize a singer's face, it moves to step S260. In step S260, any one of the plurality of still images acquired in step S230 is set as a thumbnail. Thereafter, the process proceeds to step S60. Incidentally, in response to the fact that the singer's face could not be recognized, in this step S260, a thumbnail is not set, a display signal is output to the display unit 109, an appropriate error display is performed, and this flow is terminated. It may be. Even if there is only one climax timing and the determination in step S220 is not satisfied, the same determination as in step S230 is performed. If the determination is not satisfied, a thumbnail is not set and an error is generated as described above. It is good also as a display.

なお、ステップＳ６０の処理は、図９と同様であるので、詳細な説明を省略する。また、ステップＳ２５０、ステップＳ２４０、ステップＳ２６０と前述のステップＳ３０及びステップＳ３５とが、各請求項記載の映像処理手段として機能する。 Note that the processing in step S60 is the same as that in FIG. Further, Step S250, Step S240, Step S260 and Step S30 and Step S35 described above function as the video processing means described in each claim.

本変形例においては、サムネイルを設定してホストサーバ２０へ出力する際、なるべく歌唱者の顔が正しく記録された静止画をサムネイルにすることができる。 In this modification, when a thumbnail is set and output to the host server 20, a still image in which the singer's face is correctly recorded can be used as a thumbnail.

（２）音声レベルを考慮して抽出する場合
例えば、歌唱者及び視聴者の集団が盛り上がっている盛り上がりタイミングにおいて、たまたま歌唱者が歌っていない場合がある。すなわち、歌唱者の歌唱ではないジェスチャーや仕草、表情等により視聴者が一斉に着目した場合等、歌唱者による歌唱以外の行動により盛り上がりが生じた場合である。このようなタイミングにおける静止画は、歌唱を行っていないため、カラオケ歌唱動画の代表画像としては必ずしも好ましくない。本変形例は、このような場合に対応するためのものである。 (2) When extracting in consideration of the audio level For example, there is a case where the singer does not sing by chance at the exciting timing when the group of singers and viewers is excited. That is, when the viewer pays attention all at once due to gestures, gestures, facial expressions, etc. that are not singing by the singer, this is a case where excitement occurs due to actions other than singing by the singer. Since the still image at such timing does not sing, it is not necessarily preferable as a representative image of the karaoke song moving image. This modification is for dealing with such a case.

図１３は、この変形例において制御部１０１により実行される処理手順の詳細を表すフローチャートであり、上記図９や図１２に対応する図である。図１２と同等の手順には同一の符号を付し、説明を省略又は簡略化する。 FIG. 13 is a flowchart showing details of a processing procedure executed by the control unit 101 in this modification, and corresponds to FIG. 9 and FIG. The same steps as those in FIG. 12 are denoted by the same reference numerals, and description thereof is omitted or simplified.

図１２において、本変形例では、図９におけるステップＳ２３０及びステップＳ２４０に代え、ステップＳ３００、ステップＳ３１０を設けている。 In FIG. 12, in this modification, steps S300 and S310 are provided instead of steps S230 and S240 in FIG.

すなわち、図１３に示すように、ステップＳ２２０の判定が満たされたら、ステップＳ３００に移る。ステップＳ３００では、上記ステップＳ４０において撮影時刻と関連づけて大容量記憶装置１０３に記憶されていた歌唱者の姿を含む多数の画像データの中から、上記複数の盛り上がりタイミングに対応した画像（静止画）を取得する。そして、それら複数の静止画の中に対し、当該タイミングにおけるマイク３００からの音声レベルが所定値未満のものがあるかを判定する。すなわち、この変形例では、上記大容量記憶装置１０３に記憶された各画像データは、その時点でのマイク３００の音声レベルとも予め対応付けられた形で、記憶されている。 That is, as shown in FIG. 13, when the determination in step S220 is satisfied, the process proceeds to step S300. In step S300, an image (still image) corresponding to the plurality of climax timings from among a large number of image data including the appearance of the singer stored in the mass storage device 103 in association with the shooting time in step S40. To get. Then, it is determined whether among the plurality of still images, there is an audio level from the microphone 300 at the timing that is less than a predetermined value. That is, in this modification, each image data stored in the large-capacity storage device 103 is stored in a form associated with the sound level of the microphone 300 at that time in advance.

音声レベルが所定値未満のものがあった場合には、ステップＳ３００の判定が満たされ、ステップＳ３１０に移り、音声レベルが所定値以上となっているときの静止画をサムネイル画像とする。なお、音声レベルが所定値以上となっている静止画が複数あった場合には、その中の適宜のいずれか１つをサムネイルとすればよい。その後、ステップ６０に移る。 If there is an audio level that is less than the predetermined value, the determination in step S300 is satisfied, and the process moves to step S310, and a still image when the audio level is equal to or higher than the predetermined value is set as a thumbnail image. When there are a plurality of still images having an audio level equal to or higher than a predetermined value, any one of them may be used as a thumbnail. Thereafter, the process proceeds to step 60.

一方、ステップＳ３００において、音声レベルが所定値未満であるものがない場合は、ステップＳ２６０に移り、上記同様、上記ステップＳ２３０において取得された、複数の静止画の中の適宜のいずれか１つをサムネイルとする。その後、ステップＳ６０に移る。 On the other hand, if there is no audio level lower than the predetermined value in step S300, the process proceeds to step S260, and as described above, any one of the plurality of still images acquired in step S230 is selected. A thumbnail. Thereafter, the process proceeds to step S60.

なお、ステップＳ６０の処理は、図９と同様であるので、詳細な説明を省略する。また、ステップＳ２５０、ステップＳ３１０、ステップＳ２６０と前述のステップＳ３０及びステップＳ３５とが、各請求項記載の映像処理手段として機能する。 Note that the processing in step S60 is the same as that in FIG. Further, Step S250, Step S310, Step S260 and Step S30 and Step S35 described above function as video processing means described in each claim.

本変形例においては、サムネイルを設定してホストサーバ２０へ出力する際、なるべく歌唱者が実際に歌唱している様子が記録された静止画をサムネイルにすることができる。 In this modification, when a thumbnail is set and output to the host server 20, a still image in which a state where the singer actually sings is recorded can be used as a thumbnail.

（３）その他
以上においては、１台の全周カメラ４００を用いて撮影を行った場合を例にとって説明したが、これに限られない。すなわち、カメラを複数台用い、歌唱者撮影カメラで歌唱者を撮影すると共に、視聴者を撮影するために配置した別の視聴者撮影カメラにて室内の視聴者を撮影するようにしても良い。いずれにしても、歌唱者の方向へ向かっている視聴者の人数を検出できれば足りる。 (3) Others In the above, the case where photographing is performed using one omnidirectional camera 400 has been described as an example, but the present invention is not limited to this. That is, a plurality of cameras may be used, and a singer can be photographed with a singer photographing camera, and an indoor viewer may be photographed with another viewer photographing camera arranged for photographing the viewer. In any case, it is sufficient to detect the number of viewers heading toward the singer.

なお、図９、図１０、図１２、図１３等に示すフローチャートは本発明を上記フローに示す手順に限定するものではなく、発明の趣旨及び技術的思想を逸脱しない範囲内で手順の追加・削除又は順番の変更等をしてもよい。 The flowcharts shown in FIG. 9, FIG. 10, FIG. 12, FIG. 13, etc. are not intended to limit the present invention to the procedure shown in the above flow, and additional procedures / additions can be made without departing from the spirit and technical idea of the invention. You may delete or change the order.

また、以上既に述べた以外にも、上記実施形態や各変形例による手法を適宜組み合わせて利用しても良い。 In addition to those already described above, the methods according to the above-described embodiments and modifications may be used in appropriate combination.

その他、一々例示はしないが、本発明は、その趣旨を逸脱しない範囲内において、種々の変更が加えられて実施されるものである。 In addition, although not illustrated one by one, the present invention is implemented with various modifications within a range not departing from the gist thereof.

１０カラオケ装置
１０６音源（楽曲再生手段）
１０７音声制御部（楽曲再生手段）
１０８スピーカ（楽曲再生手段）
１０９表示部（表示手段）
２００リモコン
３００マイク（マイクロフォン）
３０４ＬＥＤ（標識信号発生手段）
４００全周カメラ（動画撮影カメラ）
Ａ利用者（歌唱者）
Ｂ利用者（視聴者）
Ｃ利用者（視聴者） 10 Karaoke apparatus 106 Sound source (music playback means)
107 Voice control unit (music playback means)
108 Speaker (music playback means)
109 Display section (display means)
200 remote control 300 microphone (microphone)
304 LED (sign signal generating means)
400 All around camera (video camera)
A user (singer)
B User (viewer)
C User (viewer)

Claims

A karaoke apparatus that provides a reproduction service of karaoke performance music using music data and video data,
A music reproducing means for reproducing the music data for the singer to sing;
Display means capable of displaying the video data as the music data is played back by the music playback means;
At least one video camera that captures a predetermined field of view including a viewer other than the singer and the singer, and generates video data of the predetermined range;
Calculating means for calculating the number of viewers whose face is facing the singer by the predetermined range of video data generated by the video camera;
Determining means for determining a climax timing of a group including the singer and the viewer by temporal transition of the number of persons calculated by the calculating means;
A karaoke apparatus comprising:

The karaoke apparatus according to claim 1,
A microphone that is possessed by the singer and for inputting an audio signal of karaoke singing by the singer;
A sign signal generating means for generating a sign signal provided in the microphone;
One moving image shooting camera is provided,
The entire field of view of the predetermined range including the microphone and the singer is imaged all around the device, and the video data of the predetermined range including the marker signal generated from the marker signal generating means is generated,
The calculating means includes
The position of the microphone and the singer is specified based on the marker signal included in the video data of the predetermined range generated by the one video camera, and the position other than the position of the singer within the predetermined range The number of viewers whose faces are facing the singers is calculated by performing a predetermined face recognition process on video data in the range to determine the orientation of the viewer's face. Karaoke device to do.

The karaoke apparatus according to claim 2,
At least one partial still image including the singer at the rising timing determined by the determining means is cut out from the video data in the predetermined range into a fan shape having a subarc with respect to the center of the entire circumference, and the cut-out Video processing means for correcting the sector to a square by correcting the fan-shaped subarc to a straight line,
A karaoke apparatus comprising: a still image output means for outputting the still image to a server connected to the karaoke apparatus over a network.

The karaoke apparatus according to claim 3,
A determination unit that performs a predetermined face recognition process on the at least one still image corrected by the video processing unit and determines whether the singer's face can be recognized;
The still image output means includes
A karaoke apparatus that outputs, to the server, a still image that is determined by the determining unit to be able to recognize the face of the singer among the at least one still image that has been corrected by the video processing unit. .

The karaoke apparatus according to claim 4,
It has a singer ID input means for inputting a singer ID from the outside,
The video processing means further includes
From the video data of the predetermined range, cut out partial video data including the position of the specified singer, perform a predetermined correction process,
The still image output means includes
The singer ID input from the singer ID input means, the still image determined to be able to recognize the singer's face by the determination means, the partial video data after the correction processing by the video processing means, Are associated with each other and output to the server connected to the karaoke apparatus via a network.

A karaoke singer still image output method for generating and outputting a still image including a singer of the karaoke performance song, which is executed by a computer provided in the karaoke apparatus for reproducing the karaoke performance song,
An acquisition procedure for acquiring video data of a predetermined range including a viewer other than the singer and the singer, which is captured and generated by at least one video camera;
A calculation procedure for calculating the number of viewers whose face is facing the singer by the video data in the predetermined range acquired in the acquisition procedure;
A determination procedure for determining a climax timing of a group including the singer and the viewers by temporal transition of the number of persons calculated in the calculation procedure,
An extraction procedure for extracting at least one still image including the singer at the excitement timing determined in the determination procedure from the video data in the predetermined range acquired in the acquisition procedure;
An output procedure for outputting the still image extracted in the extraction procedure or a still image after correcting the still image to a server connected to the karaoke device;
A method for outputting a still image of a karaoke singer characterized by comprising: