JP5034516B2

JP5034516B2 - Highlight scene detection device

Info

Publication number: JP5034516B2
Application number: JP2007016636A
Authority: JP
Inventors: 千加志杉浦; 公生三関
Original assignee: Fujitsu Mobile Communications Ltd
Current assignee: Fujitsu Mobile Communications Ltd
Priority date: 2007-01-26
Filing date: 2007-01-26
Publication date: 2012-09-26
Anticipated expiration: 2027-01-26
Also published as: JP2008185626A

Abstract

<P>PROBLEM TO BE SOLVED: To detect highlight scene without using reference information and without being greatly influenced by encoding distortion. <P>SOLUTION: A spectrum is detected by a spectrum calculation module 22 by dividing an audio data included in a multi-media content for each fixed section. Then, a cheering voice feature amount composed of a frequency value (peak frequency) and its power value, which become a maximum value and a local maximal value in a specified bandwidth in which a cheering voice feature appears in the detected spectrum, is detected by a cheering voice feature amount extracting module 23. Then, in a cheering voice section judgement module 24 and in preset time length, the section in which the state of the detected cheering voice feature amount being higher than a threshold exists for a time length more than a judgement threshold rate is judged as a cheering section, and a display data for expressing the judgement result of the cheering voice section is created and output by an output control module 25. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、例えばコンサートやスポーツ番組等のライブ番組のコンテンツからハイライトシーンを検出する装置に関する。 The present invention relates to an apparatus for detecting a highlight scene from content of a live program such as a concert or a sports program.

放送番組等のマルチメディアコンテンツを録画して再生することを目的とするレコーダにおいて、短時間に見どころのシーンだけを見るという視聴形態が求められている。特に、スポーツ番組ではコンテンツ全体に対する得点シーン等の見どころとなるシーンは短い時間であり、よって短時間視聴のニーズが高い。 In a recorder that aims to record and reproduce multimedia contents such as broadcast programs, a viewing mode is required in which only the highlight scene is viewed in a short time. In particular, in sports programs, scenes that are highlights such as scoring scenes for the entire content are short in time, and thus there is a high need for short-time viewing.

解決策として、例えば家庭用のレコーダにおいて、映像信号を録画する際もしくは録画後に、見どころシーンとして歓声が大きく盛り上った区間（以後歓声区間と称する）を検出する技術が開発されている。このような技術を採用することで、ユーザはスポーツ番組のコンテンツ全体のうち見どころシーンだけを短時間に視聴することが可能となる。 As a solution, for example, in a home recorder, a technique has been developed for detecting a section where cheers are greatly raised as a highlight scene (hereinafter referred to as a cheer section) when a video signal is recorded or after recording. By adopting such a technique, the user can view only the highlight scene in the entire content of the sports program in a short time.

歓声区間を検出する技術としては、歓声区間の特徴をリファレンス情報として予め用意しておき、入力信号ごとに上記リファレンス情報との類似度を算出して、この類似度がしきい値より大きい区間を歓声区間として検出する技術が知られている（例えば、特許文献１を参照。）。リファレンス情報としては、スペクトルそのものを用いるものや、複数のスペクトルの統計的な情報を用いるものがある。特許文献１では、歓声区間を検出するための技術としてベクトル量子化による類似度算出方法を用いており、これは統計的情報を用いるものに該当する。
特許第３４７５３１７号公報 As a technique for detecting a cheering section, a feature of a cheering section is prepared in advance as reference information, a similarity with the reference information is calculated for each input signal, and a section in which the similarity is larger than a threshold is calculated. A technique for detecting a cheering section is known (see, for example, Patent Document 1). Reference information includes information using the spectrum itself and information using statistical information of a plurality of spectra. In Patent Document 1, a similarity calculation method using vector quantization is used as a technique for detecting a cheering section, which corresponds to a method using statistical information.
Japanese Patent No. 3475317

ところが、前述したような予めリファレンス情報を用意しておき、このリファレンス情報を用いた類似度の算出によって歓声区間を検出する方法では、検出精度がリファレンス情報に依存してしまい、結果として歓声区間の検出性能が不安定になるという課題がある。例えば、リファレンス情報の作成に用いたオーディオ信号と検出対象の入力信号との収録環境が異なる場合には、期待する検出性能が得られないことがある。また、符号化されたオーディオ信号を復号してこの復号されたオーディオ信号を検出対象の入力信号とする場合には、リファレンス情報に対して符号化による歪みを含んでいるため検出性能が著しく劣化する可能性もある。これはビットレートが低い符号化の場合に特に顕著である。 However, in the method of preparing the reference information as described above and detecting the cheering section by calculating the similarity using the reference information, the detection accuracy depends on the reference information, and as a result, the cheering section There is a problem that the detection performance becomes unstable. For example, if the recording environment of the audio signal used to create the reference information and the input signal to be detected are different, the expected detection performance may not be obtained. In addition, when the encoded audio signal is decoded and the decoded audio signal is used as an input signal to be detected, the detection performance is significantly deteriorated because the reference information includes distortion due to encoding. There is a possibility. This is particularly noticeable in the case of encoding with a low bit rate.

この発明は上記事情に着目してなされたもので、その目的とするところは、リファレンス情報を用いずかつ符号化による歪みの影響を大きく受けることなくハイライトシーンを検出できるようにし、これにより検出精度の高いハイライトシーン検出装置を提供することにある。 The present invention has been made paying attention to the above circumstances, and the object of the present invention is to enable detection of a highlight scene without using reference information and without being greatly affected by distortion caused by encoding. An object of the present invention is to provide a highlight scene detection apparatus with high accuracy.

上記目的を達成するためにこの発明の一観点は、オーディオ信号を含むコンテンツデータを受け取り、この受け取ったコンテンツデータに含まれるオーディオ信号を一定区間ごとに区切って、これらの区間ごとにそのスペクトルを検出する。そして、この検出されたスペクトルのうち予め設定された帯域内のスペクトルから、最大値かつ極大値をとる周波数をピーク周波数とし、このピーク周波数と当該ピーク周波数のパワー値とからなる歓声特徴量を検出し、この検出された歓声特徴量が判定しきい値よりも高い状態が予め設定された判定時間長に対して判定しきい値率以上存在する区間を歓声区間と判定するようにしたものである。 In order to achieve the above object, one aspect of the present invention receives content data including an audio signal, divides the audio signal included in the received content data into predetermined intervals, and detects the spectrum for each of these intervals. To do. And, from the spectrum in the preset band among the detected spectrum, the frequency that takes the maximum value and the maximum value is set as the peak frequency, and the cheering feature amount including the peak frequency and the power value of the peak frequency is detected. The section where the detected cheering feature amount is higher than the determination threshold is equal to or more than the determination threshold rate with respect to the predetermined determination time length is determined as the cheering section. .

一般に、オーディオ信号のハイライトシーンの指標となる歓声を含む区間では、特定の周波数帯域において時間方向に安定した周波数ピークが存在する。この周波数ピークは、歓声が有する本質的な特徴であり、しかも比較的低域に存在するため、収録環境の影響や、オーディオ信号に対する符号化歪みの影響を受けにくい。したがって、上記したように入力オーディオ信号の周波数帯域のうち検出対象の帯域が先ず特定され、この特定の周波数帯域において歓声の特徴量を表す周波数ピークとそのパワー値が検出されて、この検出情報をもとに歓声区間が判定されることによって、収録環境の違いやオーディオ符号化による歪みに対して比較的安定かつ高い検出性能を得ることが可能となる。 In general, there is a stable frequency peak in the time direction in a specific frequency band in a section including a cheer as an index of a highlight scene of an audio signal. This frequency peak is an essential feature of cheers, and since it exists in a relatively low frequency range, it is less susceptible to the effects of the recording environment and the encoding distortion on the audio signal. Therefore, as described above, the detection target band is first identified among the frequency bands of the input audio signal, and the frequency peak representing the cheering feature amount and its power value are detected in this specific frequency band, and this detection information is obtained. Based on the determination of the cheering section, it is possible to obtain relatively stable and high detection performance with respect to differences in recording environment and distortion due to audio encoding.

すなわちこの発明によれば、リファレンス情報を用いずかつ符号化による歪みの影響を大きく受けることなくハイライトシーンを検出することができ、これにより検出精度の高いハイライトシーン検出装置を提供することができる。 That is, according to the present invention, it is possible to detect a highlight scene without using reference information and without being greatly affected by the distortion caused by encoding, thereby providing a highlight scene detection device with high detection accuracy. it can.

以下、図面を参照してこの発明の実施形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

（第１の実施形態）
図１は、この発明の第１の実施の形態に係るハイライトシーン検出装置の構成を示すブロック図である。
この実施形態のハイライトシーン検出装置１Ａは、ビデオレコーダ、ビデオカメラ、テレビジョン録画再生機能付きパーソナル・コンピュータ、テレビジョン録画再生機能付き携帯端末などの各種放送録画再生装置に接続されるか、又は当該録画再生装置に内蔵された状態で使用される。 (First embodiment)
FIG. 1 is a block diagram showing a configuration of a highlight scene detection apparatus according to the first embodiment of the present invention.
The highlight scene detection apparatus 1A of this embodiment is connected to various broadcast recording / playback apparatuses such as a video recorder, a video camera, a personal computer with a television recording / playback function, and a portable terminal with a television recording / playback function, or Used in a state of being built in the recording / playback apparatus.

ハイライトシーン検出装置１Ａは、例えば中央制御ユニット（ＣＰＵ；Central Processing Unit）からなる制御ユニット２Ａを備える。この制御ユニット２Ａには、バス７を介して記憶ユニット６及びインタフェース群が接続されている。インタフェース群は、操作情報入力インタフェース（操作情報入力Ｉ／Ｆ）３と、コンテンツ入力インタフェース（コンテンツ入力Ｉ／Ｆ）４と、出力インタフェース（出力Ｉ／Ｆ）５とから構成される。 The highlight scene detection apparatus 1A includes a control unit 2A composed of, for example, a central control unit (CPU). A storage unit 6 and an interface group are connected to the control unit 2A via a bus 7. The interface group includes an operation information input interface (operation information input I / F) 3, a content input interface (content input I / F) 4, and an output interface (output I / F) 5.

操作情報入力Ｉ／Ｆ３はキースイッチ群に接続され、ユーザによる上記キースイッチ群の操作を検出する。コンテンツ入力Ｉ／Ｆ４は、図示しない放送番組録画再生装置から出力されたマルチメディアコンテンツのデータを受信するもので、記録媒体Ｉ／Ｆやライン入力端子を備える。また、マルチメディアコンテンツがアナログ信号の場合に備え、受信したアナログ信号をディジタル信号に変換するＡ／Ｄ変換器も備える。出力Ｉ／Ｆ５は、制御ユニット２Ａにより検出されたコンテンツ中の歓声区間を表す情報をコンテンツ再生装置等へ出力する。
なお、ハイライトシーン検出装置１Ａが放送番組録画再生装置に組み込まれている場合には、上記コンテンツ入力Ｉ／Ｆ４は当該装置内でマルチメディアコンテンツ記録部から読み出されたデータを受け取る機能となる。 The operation information input I / F 3 is connected to the key switch group and detects the operation of the key switch group by the user. The content input I / F 4 receives multimedia content data output from a broadcast program recording / playback apparatus (not shown), and includes a recording medium I / F and a line input terminal. In addition, an A / D converter that converts the received analog signal into a digital signal is provided in case the multimedia content is an analog signal. The output I / F 5 outputs information representing a cheering section in the content detected by the control unit 2A to a content reproduction device or the like.
When the highlight scene detection device 1A is incorporated in a broadcast program recording / playback device, the content input I / F 4 has a function of receiving data read from the multimedia content recording unit in the device. .

制御ユニット２Ａは、この発明に係わる制御機能として、入力制御モジュール２１と、スペクトル算出モジュール２２と、歓声特徴量抽出モジュール２３と、歓声区間判定モジュール２４と、出力制御モジュール２５を備えている。なお、これらのモジュール２１〜２５はいずれも、アプリケーション・プログラムをＣＰＵに実行させることにより実現される。 The control unit 2A includes an input control module 21, a spectrum calculation module 22, a cheer feature amount extraction module 23, a cheer section determination module 24, and an output control module 25 as control functions according to the present invention. Each of these modules 21 to 25 is realized by causing the CPU to execute an application program.

入力制御モジュール２１は、上記操作情報入力Ｉ／Ｆ３から操作入力信号を受け取ってその種類を判別する。例えば、ハイライトシーンの検出を要求する信号、検出モードを選択指定する信号、出力モードを選択指定する信号を判別する。
また入力制御モジュール２１は、コンテンツ入力Ｉ／Ｆ４を介してマルチメディアコンテンツのデータを取り込む。そして、この取り込んだマルチメディアコンテンツデータからオーディオデータを抽出し、この抽出したオーディオデータを一定時間ごとに区切ってオーディオフレームデータとして記憶ユニット６に一旦記憶させる。なお、抽出されたオーディオデータがＡＡＣ（Adaptive Audio Coding）等のオーディオ符号化方式により符号化された圧縮データの場合には、当該圧縮データをデコードしたのち一定時間ごとに区切ってオーディオフレームデータとする。 The input control module 21 receives an operation input signal from the operation information input I / F 3 and determines its type. For example, a signal for requesting detection of a highlight scene, a signal for selecting and specifying a detection mode, and a signal for selecting and specifying an output mode are determined.
Further, the input control module 21 takes in the data of multimedia contents via the content input I / F 4. Then, audio data is extracted from the captured multimedia content data, and the extracted audio data is divided at regular intervals and temporarily stored in the storage unit 6 as audio frame data. In the case where the extracted audio data is compressed data encoded by an audio encoding method such as AAC (Adaptive Audio Coding), the compressed data is decoded and then divided into fixed time frames to obtain audio frame data. .

さらに入力制御モジュール２１は、上記オーディオフレームデータをスペクトル算出モジュール２２に渡す際に、当該オーディオフレームデータを８kHzにダウンサンプリングして汎用のＰＣＭ（Pulse Code Modulation）信号に変換する。ここで、サンプリング周波数が比較的低い値の８kHzである理由は、後段の処理で利用する周波数帯域が０〜２kHzで十分である点、ＡＡＣなどのオーディオ符号化でビットレートが低い場合でも４kHz程度までの周波数成分は圧縮によって情報が失われるケースが少ない点、ダウンサンプリング前のサンプリング周波数が８kHzの整数倍であることが多いのでダウンサンプリング処理が比較的簡便な処理で済む点、などの理由による。よって場合によっては、ダウンサンプリング周波数は６kHzや１２kHzでもよく、８kHzという値は必須ではない。 Furthermore, the input control module 21 down-samples the audio frame data to 8 kHz and converts the audio frame data into a general-purpose PCM (Pulse Code Modulation) signal when passing the audio frame data to the spectrum calculation module 22. Here, the reason why the sampling frequency is a relatively low value of 8 kHz is that the frequency band used in the subsequent processing is sufficient from 0 to 2 kHz, and about 4 kHz even when the bit rate is low by audio encoding such as AAC. The frequency components up to are less likely to lose information due to compression, and the sampling frequency before downsampling is often an integer multiple of 8 kHz, so the downsampling process is relatively simple. . Therefore, in some cases, the downsampling frequency may be 6 kHz or 12 kHz, and a value of 8 kHz is not essential.

スペクトル算出モジュール２２は、上記入力制御モジュール２１から渡されたオーディオフレームデータの対数パワースペクトルを算出する。このスペクトルの算出方法としては、ＤＦＴ（Discrete Fourier Transform）やＦＦＴ（Fast Fourier Transform）などのフーリエ変換に基づく方法、ＬＰＣ（Linear Predictive Coding）スペクトルなどの線形予測分析に基づく方法、バンドパスフィルタとパワー算出による信号処理ベースの方法が使用される。 The spectrum calculation module 22 calculates the logarithmic power spectrum of the audio frame data passed from the input control module 21. The spectrum calculation method includes a method based on Fourier transform such as DFT (Discrete Fourier Transform) and FFT (Fast Fourier Transform), a method based on linear prediction analysis such as LPC (Linear Predictive Coding) spectrum, a bandpass filter and power. A computational signal processing based method is used.

歓声特徴量抽出モジュール２３は、上記スペクトル算出モジュール２２により算出されたスペクトルから、歓声の特徴が現れる特定の帯域内において最大値かつ極大値をとる周波数値（ピーク周波数）とそのパワー値を検出する。そして、この検出されたピーク周波数とそのパワー値を、歓声特徴量を表す情報として歓声区間判定モジュール２４に渡す。 The cheering feature quantity extraction module 23 detects, from the spectrum calculated by the spectrum calculating module 22, a frequency value (peak frequency) that takes a maximum value and a maximum value within a specific band in which cheering features appear, and its power value. . Then, the detected peak frequency and its power value are passed to the cheering section determination module 24 as information representing the cheering feature amount.

歓声区間判定モジュール２４は、上記歓声特徴量抽出モジュール２３から渡された歓声特徴量を表す情報をもとに、予め設定した時間長において歓声特徴量しきい値よりも高い状態が上記時間長に対して判定しきい値率以上存在する区間を検出する。そして、この検出された区間を歓声区間と判定する。 Based on the information indicating the cheering feature amount passed from the cheering feature amount extraction module 23, the cheering section determination module 24 indicates that the state that is higher than the cheering feature amount threshold value in the preset time length is the time length. On the other hand, the section which exists more than the judgment threshold rate is detected. Then, the detected section is determined as a cheer section.

出力制御モジュール２５は、上記歓声区間判定モジュール２４により歓声区間と判定された区間をタイムバーにより表した出力データを生成し、この生成した出力データを上記出力Ｉ／Ｆ５へ出力する。上記タイムバーの形態は、上記操作情報入力Ｉ／Ｆ３を介して予め入力指定された出力モードに応じて決定される。 The output control module 25 generates output data in which the section determined as the cheering section by the cheering section determination module 24 is represented by a time bar, and outputs the generated output data to the output I / F 5. The form of the time bar is determined in accordance with an output mode designated in advance via the operation information input I / F 3.

なお、記憶ユニット６は、コンテンツ入力Ｉ／Ｆ４を介して入力されたオーディオコンテンツデータを保存すると共に、上記制御ユニット２Ａの各モジュールが歓声区間を判定する一連の処理を実行した際に算出される歓声特徴量を表す情報や歓声区間の判定結果を表す情報を一時保存するために用いられる。 The storage unit 6 stores the audio content data input via the content input I / F 4 and is calculated when each module of the control unit 2A executes a series of processes for determining a cheering section. It is used to temporarily store information representing cheering feature values and information representing determination results of cheering sections.

次に、以上のように構成されたハイライトシーン検出装置１Ａの動作を説明する。
なお、ここでは外部の放送番組記録装置からスポーツ番組のマルチメディアコンテンツのオーディオコンテンツデータを取り込み、当該オーディオコンテンツデータから歓声区間を検出してその結果を表す情報を上記放送番組記録装置へ出力する場合を例にとって説明する。 Next, the operation of the highlight scene detection apparatus 1A configured as described above will be described.
In this case, audio content data of sports program multimedia content is taken from an external broadcast program recording device, a cheer section is detected from the audio content data, and information representing the result is output to the broadcast program recording device. Will be described as an example.

ハイライトシーン検出装置１Ａでは、先ず歓声区間の検出モード及び検出結果の出力モードの設定が以下のように行われる。すなわち、ユーザが図示しない入力デバイスにおいて上記検出モード及び出力モードの選択操作を行うと、これらのモード選択指定信号が操作情報入力Ｉ／Ｆ３を介して制御ユニット２Ａの入力制御モジュール２１に取り込まれ、この入力制御モジュール２１において識別されて保存される。なお、上記検出モード及び出力モードの選択指定方法の具体例については後述する。 In the highlight scene detection device 1A, first, the detection mode of the cheering section and the output mode of the detection result are set as follows. That is, when the user performs the selection operation of the detection mode and the output mode with an input device (not shown), these mode selection designation signals are taken into the input control module 21 of the control unit 2A via the operation information input I / F3. The input control module 21 is identified and stored. A specific example of the method for selecting and specifying the detection mode and the output mode will be described later.

上記検出モード及び出力モードの設定処理が終了すると、ハイライトシーン検出装置１Ａは歓声区間検出モードに移行して、先ずオーディオデータのスペクトルを算出する処理と、歓声特徴量を抽出する処理を次のように実行する。図２は、制御ユニット２Ａによるその処理手順及び処理内容を示すフローチャートである。 When the setting process of the detection mode and the output mode is completed, the highlight scene detection device 1A shifts to the cheering section detection mode, and first performs a process of calculating a spectrum of audio data and a process of extracting cheering feature values. Run like so. FIG. 2 is a flowchart showing the processing procedure and processing contents of the control unit 2A.

すなわち、制御ユニット２ＡはステップＳ２１においてハイライトシーン検出要求の入力を監視している。この状態で、ユーザが入力デバイスにおいてハイライトシーンの検出要求操作を行ったとする。そうすると制御ユニット２Ａは、上記要求操作を操作情報入力Ｉ／Ｆ３を介して入力制御モジュール２１で検出する。続いて制御ユニット２Ａは、入力制御モジュール２１において図示しない放送番組録画再生装置からのマルチメディアコンテンツデータの入力をコンテンツ入力Ｉ／Ｆ４０を介して監視する。 That is, the control unit 2A monitors the input of the highlight scene detection request in step S21. In this state, it is assumed that the user performs a highlight scene detection request operation on the input device. Then, the control unit 2A detects the requested operation by the input control module 21 via the operation information input I / F3. Subsequently, the control unit 2A monitors the input control module 21 through the content input I / F 40 for the input of multimedia content data from a broadcast program recording / playback apparatus (not shown).

この状態で、放送番組録画再生装置から送られたマルチメディアコンテンツデータがコンテンツ入力Ｉ／Ｆ４０で受信されると、制御ユニット２Ａは入力制御モジュール２１により、上記コンテンツ入力Ｉ／Ｆ４を介してマルチメディアコンテンツのデータを取り込む。そして、この取り込んだマルチメディアコンテンツデータからオーディオデータを抽出し、この抽出したオーディオデータを一定時間ごとに区切ってオーディオフレームデータとしたのち、さらに８kHzにダウンサンプリングしてスペクトル算出モジュール２２に渡す。 In this state, when the multimedia content data sent from the broadcast program recording / playback apparatus is received by the content input I / F 40, the control unit 2A causes the input control module 21 to perform multimedia via the content input I / F4. Import content data. Then, audio data is extracted from the captured multimedia content data, and the extracted audio data is divided into fixed time intervals to obtain audio frame data, which is further downsampled to 8 kHz and passed to the spectrum calculation module 22.

スペクトル算出モジュール２２は、上記ダウンサンプリングされたオーディオデータをステップＳ２２によりフレームごとに取り込む。そして、ステップＳ２３において、フーリエ変換に基づく方法或いは線形予測分析に基づく方法により対数パワースペクトルを算出する。なお、どのような算出方法を使用する場合でも、周波数分解能は少なくとも３０Hz程度あることが好ましい。このように３０Hz以上の周波数分解能によりスペクトルを算出することで、後段の歓声特徴量抽出モジュール２３における歓声特徴量の抽出精度を高めることが可能となる。 The spectrum calculation module 22 captures the downsampled audio data for each frame in step S22. In step S23, a logarithmic power spectrum is calculated by a method based on Fourier transform or a method based on linear prediction analysis. It should be noted that no matter what calculation method is used, the frequency resolution is preferably at least about 30 Hz. Thus, by calculating the spectrum with a frequency resolution of 30 Hz or higher, it is possible to improve the extraction accuracy of the cheer feature quantity in the cheer feature quantity extraction module 23 in the subsequent stage.

続いて制御ユニット２Ａは、歓声特徴量抽出モジュール２３により歓声特徴量の抽出処理を以下のように実行する。
一般に、歓声とは非常に多くの数の叫び声の集合なので、歓声の周波数特性は、個々の叫び声の周波数特性が重ね合わさって平滑化されたものと考えることができる。男性の場合、叫び声特有の“ウァー”や“ウォー”など声のうち、長時間継続する母音部分“ａ”や“ｏ”の第１フォルマント周波数は６００〜８００Hz、第２フォルマント周波数は１０００Hz付近に存在するので、周波数のピークはおおよそ６００〜１０００Hz付近に現れる。また叫び声は、興奮の度合いが強ければピッチ周波数と音圧が上がり、興奮の度合いが弱ければピッチ周波数と音圧は下がる傾向にある。このため、６００〜１０００Hzの範囲におけるピーク周波数が高くかつ音圧が大きいほど興奮の度合いが強く、逆にピーク周波数が低くかつ音圧が低いほど興奮の度合いが弱いことになる。したがって、この叫び声の集合である歓声のスペクトルでは、６００〜１０００Hzの範囲にピーク周波数が存在することになる。 Subsequently, the control unit 2A executes the extraction process of the cheering feature value by the cheering feature value extraction module 23 as follows.
In general, cheers are a collection of a very large number of screams, and therefore the frequency characteristics of cheers can be thought of as smoothed by superimposing the frequency characteristics of individual screams. In the case of males, the first formant frequency of the vowel parts “a” and “o” that last for a long time, such as “War” and “War”, which are peculiar to screams, is 600 to 800 Hz, and the second formant frequency is around 1000 Hz. Since it exists, the frequency peak appears around 600 to 1000 Hz. In addition, the screaming voice tends to increase the pitch frequency and the sound pressure if the degree of excitement is strong, and to decrease the pitch frequency and the sound pressure if the degree of excitement is weak. For this reason, the higher the peak frequency in the range of 600 to 1000 Hz and the higher the sound pressure, the stronger the degree of excitement. Conversely, the lower the peak frequency and the lower the sound pressure, the weaker the degree of excitement. Therefore, in the spectrum of cheers, which is a collection of screams, there is a peak frequency in the range of 600 to 1000 Hz.

そこで、歓声特徴量抽出モジュール２３は、先ずステップＳ２４において、上記スペクトル算出モジュール２２により算出されたスペクトルから、歓声の特徴が現れる特定の帯域６００〜１０００Hzの範囲において極大値をとる周波数値を検出する。続いてステップＳ２５において、上記特定の帯域６００〜１０００Hzの範囲において最大値をとる周波数を検出する。そして、上記極大値の検出結果と最大値の検出結果をもとに、ステップＳ２６により最大値かつ極大値をとる周波数値の有無を判定し、当該条件を満足する周波数値が検出された場合にこの周波数値をピーク周波数とする。 Therefore, the cheering feature quantity extraction module 23 first detects, in step S24, a frequency value having a maximum value in a specific band range of 600 to 1000 Hz where cheering features appear from the spectrum calculated by the spectrum calculation module 22. . Subsequently, in step S25, the frequency having the maximum value in the specific band of 600 to 1000 Hz is detected. Then, based on the detection result of the maximum value and the detection result of the maximum value, the presence / absence of a frequency value having a maximum value and a maximum value is determined in step S26, and when a frequency value satisfying the condition is detected. Let this frequency value be the peak frequency.

例えば、いま図４（ａ）に示すように、６００〜１０００Hzの範囲において最大値をとりかつ極大値をとる周波数値が検出できた場合には、この周波数値がピーク周波数となる。これに対し、図４（ｂ）に示すように極大値が検出されてもこの極大値が最大値でない場合には、ピーク周波数は無し（＝０）と判定される。また、ステップＳ２４からステップＳ２６までの一連の処理は、ピーク周波数を検出するための処理であり、ステップＳ２４とステップＳ２５は順不同でも構わない。 For example, as shown in FIG. 4A, when a frequency value having a maximum value and a maximum value in the range of 600 to 1000 Hz can be detected, this frequency value becomes the peak frequency. On the other hand, as shown in FIG. 4B, if the maximum value is not the maximum value even if the maximum value is detected, it is determined that there is no peak frequency (= 0). A series of processing from step S24 to step S26 is processing for detecting a peak frequency, and step S24 and step S25 may be in any order.

上記ピーク周波数が検出されると、歓声特徴量抽出モジュール２３は続いてステップＳ２７に移行し、ここで上記検出されたピーク周波数におけるパワー値を検出する。そして、この検出したパワー値と上記ピーク周波数値とにより表される歓声特徴量を表す情報を生成し、この情報をステップＳ２８により記憶ユニット６に保存する。 When the peak frequency is detected, the cheering feature amount extraction module 23 proceeds to step S27, where the power value at the detected peak frequency is detected. Then, information representing the cheering feature amount represented by the detected power value and the peak frequency value is generated, and this information is stored in the storage unit 6 in step S28.

例えば、下記数１に示すようにピーク周波数PeakFreqとそのパワー値PeakPowとの加重和を求めてこれを歓声特徴量Featとしたり、下記［数２］に示すようにパワー値PeakPowがあるしきい値ThPowA以上の場合にピーク周波数PeakFreqの値にボーナス項Bnsを付してこれを歓声特徴量Featとする。また、ピーク周波数PeakFreqのパワー値PeakPowをそのまま特徴量として用いずに、例えば［数３］に示すようにパワー値PeakPowがあるしきい値ThPowBよりも低い場合には６００〜１０００Hzの範囲にピーク周波数が存在したとしても歓声によるピークではないと判断して、歓声特徴量を０としてもよい。さらには、［数４］に示すようにピーク周波数PeakFreqが高いほどこのしきい値ThPowVの値を大きくしてもよい。なお、α、βはそれぞれピーク周波数PeakFreq及びそのパワー値PeakPowの加重和の重みを示し、γはしきい値ThPowVをピーク周波数PeakFreqに応じたしきい値とするための係数を示している。 For example, the weighted sum of the peak frequency PeakFreq and its power value PeakPow is obtained as shown in the following equation 1 and this is used as the cheering feature amount Feat, or the threshold value with the power value PeakPow as shown in the following [equation 2]. In the case of ThPowA or more, a bonus term Bns is added to the value of the peak frequency PeakFreq and this is used as a cheering feature amount Feat. Further, without using the power value PeakPow of the peak frequency PeakFreq as a feature amount as it is, for example, as shown in [Equation 3], when the power value PeakPow is lower than a certain threshold value ThPowB, the peak frequency is in the range of 600 to 1000 Hz. Even if it exists, it may be determined that it is not a peak due to cheers, and the cheer feature value may be set to zero. Further, as shown in [Equation 4], the threshold ThPowV may be increased as the peak frequency PeakFreq is higher. Α and β represent the weights of the weighted sum of the peak frequency PeakFreq and its power value PeakPow, respectively, and γ represents a coefficient for setting the threshold ThPowV to a threshold corresponding to the peak frequency PeakFreq.

以上述べた歓声特徴量の生成方法は、歓声において興奮の度合いが強いほどピーク周波数が高くかつパワーも大きいという特性を利用したものであり、この趣旨を逸脱しない範囲であれば［数１］〜［数４］に示した生成方法に限らずこれらを組み合わせるなどの種々変形が可能である。 The cheering feature value generation method described above uses the characteristic that the peak frequency is higher and the power is larger as the degree of excitement in cheering is stronger. Not only the generation method shown in [Equation 4] but also various modifications such as a combination thereof are possible.

このようにオーディオデータの一つのフレームについて歓声特徴量の抽出処理が終了すると、制御ユニット２Ａはオーディオデータの次フレームの有無をステップＳ２９により判定する。そして、次フレームがある場合にはステップＳ２２に戻り、上記ステップＳ２２〜ステップＳ２８によるスペクトルの算出及び歓声特徴量の抽出処理を繰り返し実行する。 Thus, when the extraction process of the cheering feature amount is completed for one frame of the audio data, the control unit 2A determines whether or not there is a next frame of the audio data in step S29. If there is a next frame, the process returns to step S22, and the spectrum calculation and cheering feature amount extraction processes in steps S22 to S28 are repeated.

一方、オーディオデータのすべてのフレームについて上記した歓声特徴量の抽出処理が終了すると、制御ユニット２Ａは次に歓声区間の判定処理及びその判定結果の出力処理を以下のように実行する。図３は、その処理手順及び処理内容を示すフローチャートである。 On the other hand, when the above-described extraction process of the cheering feature amount is completed for all the frames of the audio data, the control unit 2A next executes the cheering section determination process and the determination result output process as follows. FIG. 3 is a flowchart showing the processing procedure and processing contents.

すなわち、制御ユニット２Ａの歓声区間判定モジュール２４は、先ずステップＳ３１において、予め設定された判定時間長Ｌごとに、上記歓声特徴量抽出モジュール２３により抽出された歓声特徴量のうち、しきい値を超える合計時間長を算出し、上記判定時間長に対する上記算出された合計時間長の割合を算出する。そして、この算出された割合をステップＳ３２により判定しきい値率と比較し、割合が判定しきい値を超える場合に上記判定時間Ｌを歓声区間であるとステップＳ３３にて判定する。これに対し、割合が判定しきい値以下の場合には、上記判定時間Ｌを非歓声区間であるとステップＳ３４で判定する。そして、以上の判定結果を記憶ユニット６に保存する。 That is, the cheering section determination module 24 of the control unit 2A first sets a threshold value among the cheering feature amounts extracted by the cheering feature amount extraction module 23 for each determination time length L set in advance in step S31. The total time length exceeding is calculated, and the ratio of the calculated total time length to the determination time length is calculated. Then, the calculated ratio is compared with the determination threshold value in step S32, and when the ratio exceeds the determination threshold value, it is determined in step S33 that the determination time L is a cheering section. On the other hand, when the ratio is equal to or less than the determination threshold value, it is determined in step S34 that the determination time L is a non-cheering section. Then, the above determination result is stored in the storage unit 6.

例えば、いまある判定時間長Ｌにおいて図５（ａ）に示すような歓声特徴量が得られたとする。この場合、歓声区間判定モジュール２４は、しきい値を超える歓声特徴量の時間ｌ１、ｌ２、ｌ３の合計時間長Σｌ（＝ｌ１＋ｌ２＋ｌ３）を算出する。そして、上記判定時間Ｌに対するこの算出された合計時間長Σｌ（＝ｌ１＋ｌ２＋ｌ３）の割合を算出し、この算出された割合の値を判定しきい値率と次式のように比較する。 For example, it is assumed that a cheering feature amount as shown in FIG. In this case, the cheering section determination module 24 calculates a total time length Σl (= l1 + l2 + l3) of times l1, l2, and l3 of cheering feature values exceeding the threshold value. Then, the ratio of the calculated total time length Σl (= l1 + l2 + l3) with respect to the determination time L is calculated, and the value of the calculated ratio is compared with the determination threshold rate as in the following equation.

そして、この算出された合計時間長Σｌ（＝ｌ１＋ｌ２＋ｌ３）の判定時間Ｌに対する割合が判定しきい値率よりも大きければ、つまり［数５］の式を満たせばこの区間を歓声区間と判定する。例えば、判定しきい値率を０．７とした場合、図５（ａ）の例ではΣｌ／Ｌが判定しきい値率＝０．７を上回っているので、この区間は歓声区間と判定される。これに対し図５（ｂ）の例では、Σｌ／Ｌが判定しきい値率＝０．７以下であるため歓声区間として判定されない。 Then, if the ratio of the calculated total time length Σl (= l1 + l2 + l3) to the determination time L is larger than the determination threshold rate, that is, if the expression of [Equation 5] is satisfied, this interval is determined as a cheering interval. For example, when the determination threshold rate is 0.7, in the example of FIG. 5A, Σl / L exceeds the determination threshold rate = 0.7, so this interval is determined to be a cheering interval. The On the other hand, in the example of FIG. 5B, Σl / L is not determined as a cheering section because the determination threshold rate is 0.7 or less.

ここで、判定時間長Ｌは、検出対象の歓声区間の長さを制御する値であり、この値Ｌを長くするほど短時間の歓声区間を検出対象から除外することができるが、長くしすぎると非常に長い歓声区間しか検出されなくなってしまう。また、短くするほど短時間の歓声区間を検出することが可能となるが、短くしすぎると分析誤差などで突発的に歓声特徴量のしきい値を超えるもの、つまり歓声でない区間も歓声として検出されてしまう。よって、判定時間長Ｌは予め数秒程度に設定しておくことが妥当である。また、この判定時間長Ｌの値は、入力デバイスにおいて入力されるユーザの設定要求に応じてユーザが希望する値に設定することも可能である。 Here, the determination time length L is a value that controls the length of the cheering section to be detected. The longer the value L, the shorter the cheering section can be excluded from the detection target, but it is too long. Only a very long cheering section will be detected. In addition, the shorter the cheering interval, the shorter the cheering interval can be detected. However, if it is too short, the cheering feature threshold that suddenly exceeds the cheering feature value due to analysis errors, etc. It will be. Therefore, it is appropriate to set the determination time length L to about several seconds in advance. In addition, the value of the determination time length L can be set to a value desired by the user in response to the user's setting request input at the input device.

判定しきい値率は、歓声区間として判定される歓声の確からしさを制御するための値であり、この値を高くするほど検出精度が向上するが検出漏れが生じる可能性が高くなる。一方、判定しきい値率を低くするほど検出精度は低下するが、検出漏れが生じる可能性は低くなる。このため、判定しきい値率は予め０．７〜０．９程度の適当な数値に設定しておくことが望ましい。なお、この判定しきい値についても、上記判定時間長Ｌと同様に、入力デバイスにより入力されるユーザの設定要求に応じて０．０より大きく１．０以下の範囲で任意の値に設定することが可能である。 The determination threshold rate is a value for controlling the probability of cheering determined as a cheering section. Increasing this value improves detection accuracy but increases the possibility of detection omission. On the other hand, the lower the determination threshold rate, the lower the detection accuracy, but the lower the possibility of detection omission. For this reason, it is desirable to set the determination threshold rate to an appropriate value of about 0.7 to 0.9 in advance. As with the determination time length L, this determination threshold is also set to an arbitrary value in the range of greater than 0.0 and less than or equal to 1.0 according to the user's setting request input by the input device. It is possible.

歓声区間の判定に判定しきい値率を用いると、分析誤差などによって歓声区間に突発的な欠落が生じた場合でも安定した結果を得ることができる。しかし、必ずしもしきい値率を用いる必要はなく、同様の効果をもたらす方法としてほかに歓声特徴量の移動平均をとるものや、メディアンフィルタリングを用いるものなどがある。すなわち、しきい値判定率は必須ではなく、この趣旨を逸脱しない範囲であれば種々の改良又は変更が可能である。 If the determination threshold rate is used for the determination of the cheering section, a stable result can be obtained even when a sudden loss occurs in the cheering section due to an analysis error or the like. However, it is not always necessary to use the threshold rate, and there are other methods that bring about the same effect, such as taking a moving average of cheering feature values and using median filtering. That is, the threshold determination rate is not essential, and various improvements or changes can be made within a range that does not depart from this spirit.

歓声特徴量の検出しきい値は、歓声の大きさ（盛り上がり）の検出レベルを制御するための値であり、この値が高いほどより盛り上がり度の高い歓声のみを検出することができる。これに対し歓声特徴量の検出しきい値が低いと、それほど盛上っていない歓声でも検出してしまうことになる。したがって、歓声特徴量の検出しきい値も予め適当な数値に設定しておくことが望ましいが、上記判定時間長Ｌや判定しきい値率と同様に、入力デバイスにおいて入力されるユーザの設定要求に応じて０より大きい任意の値に設定できるようにしてもよい。 The detection threshold value of the cheering feature value is a value for controlling the detection level of the cheering magnitude (swelling), and the higher the value, the higher the cheering degree can be detected. On the other hand, if the detection threshold value of the cheering feature amount is low, even a cheering that is not so high will be detected. Therefore, it is desirable that the detection threshold value of the cheering feature amount is set to an appropriate value in advance. However, as with the determination time length L and the determination threshold rate, a user setting request input at the input device is required. Depending on, it may be set to an arbitrary value larger than 0.

このように歓声区間判定モジュール２４では、複数の判定条件を選択的に任意に設定することで歓声区間の判定結果を任意に制御することができるので、より細かいニーズに合わせて歓声検出を行うことが可能となる。ただし、これらの判定条件をユーザが適切に制御するには、経験が必要だったり面倒な操作が必要となる。 As described above, the cheering section determination module 24 can arbitrarily control the determination result of the cheering section by selectively setting a plurality of determination conditions, so that cheering detection can be performed according to more detailed needs. Is possible. However, in order for the user to appropriately control these determination conditions, experience or troublesome operations are required.

そこで、予め複数のハイライト検出モードを用意しておき、ユーザがこれらのモードのうち任意のモードを選択すると、それに応じて判定条件が適切な値に可変設定されるようにするとよい。例えば、図６に示すように３つのハイライト検出モードを用意しておき、ユーザがこれらのモードのうちの一つを選択した上で条件値を入力することにより、当該条件値に応じた判定条件が設定されるようにする。 Therefore, a plurality of highlight detection modes may be prepared in advance, and when the user selects an arbitrary mode from these modes, the determination condition may be variably set to an appropriate value accordingly. For example, as shown in FIG. 6, three highlight detection modes are prepared, and a user selects one of these modes and inputs a condition value, thereby determining according to the condition value. Make sure the condition is set.

図６において、上から１番目の検出モードは、ハイライトの盛り上がりの度合いに応じて上位Ｘ位までを検出するものである。上から２番目の検出モードは、検出された歓声区間の合計時間がＸ分になるようにハイライトの盛り上がりの度合いに応じて上位から検出するものである。上から３番目の検出モードは、検出された区間の合計時間がそのコンテンツ全体の時間長に対してＸ％になるように、ハイライトの度合いに応じて上位から検出するものである。このようなハイライト検出モードを予め用意することで、ユーザは歓声区間判定モジュール２４が使用する判定条件の値を直接入力する必要がなく、これにより経験の有無にかかわらず常に簡単な操作で適切な判定条件を設定することが可能となる。 In FIG. 6, the first detection mode from the top is for detecting up to the upper X position according to the degree of highlight swell. In the second detection mode from the top, detection is performed from the top in accordance with the degree of the climax of the highlight so that the total time of the detected cheering section is X minutes. In the third detection mode from the top, detection is performed from the top in accordance with the degree of highlight so that the total time of the detected section is X% with respect to the time length of the entire content. By preparing such a highlight detection mode in advance, the user does not need to directly input the value of the determination condition used by the cheering section determination module 24, so that it is always appropriate with a simple operation regardless of experience. It is possible to set various determination conditions.

具体的には、まず歓声区間が多めに検出されるように判定時間長Ｌを短めの３秒程度に、判定しきい値率を小さめの０．７程度に、歓声特徴量の検出しきい値を低めにそれぞれ設定し、歓声区間を検出する。次にこれらの歓声区間に対し、ハイライトの度合いを歓声得点として算出する。ハイライトの度合いとは、いかに盛上っているか判断する指標である。このため、歓声得点は、歓声が長い時間持続し、歓声特徴量の値が大きく、さらに歓声区間中の欠落が少ないほど大きな値となる。 Specifically, the detection threshold value of the cheering feature amount is first set to about 3 seconds, which is a short determination time length, and to about 0.7, which is a small determination threshold value so that a large number of cheer sections are detected. Is set to a lower value and a cheering section is detected. Next, the degree of highlight is calculated as a cheer score for these cheer sections. The degree of highlight is an index for judging how successful the highlight is. For this reason, the cheering score becomes larger as the cheering lasts for a longer time, the cheering feature value is larger, and there are fewer omissions in the cheering section.

歓声得点の一例としては、図７に示すような歓声特徴量が描く図形の面積があげられる。これにより、歓声得点を比較的単純な方法で算出することができる。歓声区間判定モジュール２４は、この歓声得点の高い順に、ユーザが所望したハイライト検出モードに応じて歓声区間を選出し、最終的な歓声区間として出力する。このように、一旦歓声区間が多めに検出されるような設定で歓声区間を検出しておき、ユーザが所望する要件に応じて歓声区間を選出して出力するという形態をとることで、ユーザが所望する条件が変更された場合でも再度歓声区間の検出処理を行わずに済む。このため、歓声区間の検出に要する処理量を大幅に削減することができる。 As an example of a cheer score, the area of a figure drawn by cheer feature values as shown in FIG. Thereby, a cheering score can be calculated by a relatively simple method. The cheering section determination module 24 selects cheering sections according to the highlight detection mode desired by the user in descending order of the cheering scores, and outputs the selected cheering sections. In this way, by taking a form in which a cheering section is detected in such a setting that a large number of cheering sections are once detected, and a cheering section is selected and output according to requirements desired by the user, the user can Even if the desired condition is changed, it is not necessary to perform the cheering section detection process again. For this reason, the processing amount required for detection of a cheering section can be significantly reduced.

また、ユーザが所望する条件を満たさないほど歓声区間の数が少なかった場合、例えばユーザが上位６０分までを検出と指定したにもかかわらず、検出した歓声区間の長さの合計が６０分に満たなかった場合には、さらに多くの歓声区間が検出されるように判定時間長Ｌ、判定しきい値率、歓声特徴量の検出しきい値を調節することで対応することが可能となる。このような場合、判定時間長Ｌと判定しきい値率を下げすぎると、前述した理由により歓声ではない区間を検出してしまう可能性が高くなるので、歓声特徴量の検出しきい値を下げることが最も効果的である。 In addition, when the number of cheering sections is so small that the user's desired condition is not satisfied, for example, the total of the lengths of the cheering sections detected is 60 minutes even though the user designates the top 60 minutes as detection. If not, it is possible to cope with the problem by adjusting the determination time length L, the determination threshold rate, and the detection threshold value of the cheering feature amount so that more cheering sections are detected. In such a case, if the determination time length L and the determination threshold rate are lowered too much, there is a high possibility that a section that is not cheering will be detected for the reason described above, so the detection threshold value of the cheer feature amount is lowered. Is most effective.

このように、歓声区間の検出方法として、複数のハイライト検出モードを予め用意し、ユーザがこれらの検出モードの中から所望のモードを選択指定した場合に、この指定されたモードに応じて判定時間長Ｌ、判定しきい値率、及び歓声特徴量の検出しきい値が自動調節されることにより、ユーザは面倒な操作をせずとも、ユーザが所望する条件でハイライトシーンの検出が可能となる。 As described above, when a plurality of highlight detection modes are prepared in advance as a method for detecting a cheering section, and a user selects and designates a desired mode from these detection modes, determination is made according to the designated mode. By automatically adjusting the detection threshold of time length L, judgment threshold rate, and cheering feature value, the user can detect highlight scenes under the conditions desired by the user without troublesome operations. It becomes.

以上のように歓声区間の判定結果が得られると、制御ユニット２Ａは続いて出力制御モジュール２５により上記歓声区間判定結果の出力処理を実行する。すなわち、出力制御モジュール２５は、先ずステップＳ３６において、事前に設定された出力形態を判定する。そして、この判定された出力形態に応じてステップＳ３７〜ステップＳ４０のいずれかにより出力データを生成し、この生成された出力データをステップＳ４１により出力Ｉ／Ｆ５から図示しない録画再生装置等へ出力する。 When the determination result of the cheering section is obtained as described above, the control unit 2A subsequently executes the output process of the cheering section determination result by the output control module 25. That is, the output control module 25 first determines a preset output form in step S36. Then, output data is generated in one of steps S37 to S40 according to the determined output form, and the generated output data is output from the output I / F 5 to a recording / reproducing apparatus (not shown) or the like in step S41. .

例えば出力形態としては、ハイライトシーンの位置を表示する第１の形態と、ハイライトシーンのみを圧縮して表示する第２の形態と、ハイライトシーンを色分け表示する第３の形態と、ハイライトシーンを順位付けして表示する第４の形態とがある。
このうち、先ず第１の形態がユーザにより選択されている場合には、出力制御モジュール２５はステップＳ３７により、例えば図８（ａ）に示すようにコンテンツ中におけるハイライトシーンの時間位置Ｔ１〜Ｔｎを表すタイムバーを生成し、このタイムバーの表示データをステップＳ４１により出力Ｉ／Ｆ５から出力させる。 For example, the output form includes a first form for displaying the position of the highlight scene, a second form for compressing and displaying only the highlight scene, a third form for displaying the highlight scene by color, and a highlight form. There is a fourth form in which light scenes are ranked and displayed.
Among these, first, when the first form is selected by the user, the output control module 25 performs the time positions T1 to Tn of the highlight scene in the content as shown in FIG. Is generated, and display data of this time bar is output from the output I / F 5 in step S41.

次に第２の形態がユーザにより選択されている場合には、出力制御モジュール２５はステップＳ３８により、例えば図８（ｂ）に示すようにハイライトシーン以外の区間をスキップしてハイライトシーンのみを並べたタイムバーを生成し、このタイムバーの表示データをステップＳ４１により出力Ｉ／Ｆ５から出力させる。 Next, when the second form is selected by the user, the output control module 25 skips the sections other than the highlight scene, for example, as shown in FIG. Are generated, and display data of the time bar is output from the output I / F 5 in step S41.

第３の形態が選択されている場合には、出力制御モジュール２５はステップＳ３９により、例えば図８（ｃ）に示すようにコンテンツ中におけるハイライトシーンの時間位置を表し、さらに上記各ハイライトシーンをその歓声得点の高低に応じて色分けして表示したタイムバーを生成し、この生成したタイムバーの表示データをステップＳ４１により出力Ｉ／Ｆ５から出力させる。 If the third form is selected, the output control module 25 represents the time position of the highlight scene in the content, for example, as shown in FIG. A time bar is generated by color-coding according to the cheer score, and the display data of the generated time bar is output from the output I / F 5 in step S41.

第４の形態が選択されている場合には、出力制御モジュール２５はステップＳ４０により、例えば図８（ｄ）に示すようにコンテンツ中におけるハイライトシーンの時間位置を示すと共に、各ハイライトシーンにその歓声得点とその順位を表す情報をふかして表示したタイムバーを生成し、この生成したタイムバーの表示データをステップＳ４１により出力Ｉ／Ｆ５から出力させる。 When the fourth form is selected, the output control module 25 indicates the time position of the highlight scene in the content as shown in FIG. 8D, for example, as shown in FIG. A time bar that is displayed with the information indicating the cheer score and the ranking is generated, and the display data of the generated time bar is output from the output I / F 5 in step S41.

このようにユーザが選択指定した出力形態に応じて、歓声区間の判定結果を表すタイムバーを生成し出力することで、ユーザは検出されたハイライトシーンがどのような時間位置に存在し、どのような長さで、どの程度の盛り上がり具合かを自身が希望する形態で確認することができるようになる。このため、短時間で視聴する際に大変有用な情報を得ることができる。またマルチメディアコンテンツを編集する際にも、ハイライトシーンという編集に大変有用な情報を得ることができるので、編集作業を効率的に短時間で行うことができるようになる。 In this way, by generating and outputting a time bar representing the determination result of the cheering section according to the output form selected and specified by the user, the user can find out what time position the detected highlight scene exists and which With such a length, it becomes possible to confirm how much the climax is in the form that he desires. For this reason, very useful information can be obtained when viewing in a short time. In addition, when editing multimedia contents, information that is very useful for editing called a highlight scene can be obtained, so that editing work can be performed efficiently and in a short time.

なお、上記出力されるタイムバーの表示データを使用したコンテンツの再生制御方法としては、次のようなものが考えられる。すなわち、マルチメディアコンテンツを再生中の録画再生装置に、当該マルチメディアコンテンツのハイライトシーンの位置を表すタイムバーを供給して表示器に表示させる。そして、録画再生装置においてユーザが上記タイムバーに表示されたハイライトシーンを選択すると、そのハイライトシーンのみを再生する。また、録画再生装置に自動スキップモードを設定しておき、上記タイムバーに従いハイライトシーン以外のシーンをスキップしてハイライトシーンのみを順次再生する。 As a content reproduction control method using the output time bar display data, the following can be considered. That is, a time bar indicating the position of the highlight scene of the multimedia content is supplied to the recording / playback apparatus that is playing back the multimedia content and displayed on the display. When the user selects a highlight scene displayed on the time bar in the recording / playback apparatus, only the highlight scene is played back. Further, an automatic skip mode is set in the recording / playback apparatus, and only highlight scenes are sequentially played back by skipping scenes other than the highlight scene according to the time bar.

また、上記タイムバーの情報は、録画再生装置におけるコンテンツの再生制御に用いる以外に、インターネット上に設けられたコンテンツ配信サーバによるコンテンツの配信制御に使用したり、コンテンツを記録媒体に記録する際にハイライトシーンに相当する区間のみを選択的に記録する制御に使用することができる。さらに、ハイライトシーンの区間の属性を表す情報をテキストデータにより表示するようにしてもよい。 The time bar information is used for content distribution control by a content distribution server provided on the Internet, and is used for recording content on a recording medium, in addition to being used for content reproduction control in a recording / reproducing apparatus. It can be used for control to selectively record only the section corresponding to the highlight scene. Furthermore, information representing the attribute of the highlight scene section may be displayed as text data.

以上述べたように第１の実施形態では、録画再生装置から入力されたマルチメディアコンテンツに含まれるオーディオデータを、入力制御モジュール２１で一定区間ごとに区切ってこれらの区間ごとにスペクトル算出モジュール２２によりスペクトルを検出する。次に、歓声特徴量抽出モジュール２３により、上記検出されたスペクトルのうち歓声の特徴が現れる特定の帯域内において最大値かつ極大値をとる周波数値（ピーク周波数）とそのパワー値とからなる歓声特徴量を検出する。そして、歓声区間判定モジュール２４において、予め設定した時間長において上記検出された歓声特徴量がしきい値よりも高い状態が上記時間長に対して判定しきい値率以上存在する区間を歓声区間と判定し、出力制御モジュール２５により上記歓声区間の判定結果を表す表示データを生成して上記録画再生装置に出力するようにしている。 As described above, in the first embodiment, the audio data included in the multimedia content input from the recording / playback apparatus is divided by the input control module 21 for each predetermined section, and the spectrum calculation module 22 for each section. Detect the spectrum. Next, the cheering feature amount extraction module 23 makes a cheering feature comprising a frequency value (peak frequency) having a maximum value and a maximum value within a specific band in which cheering features appear in the detected spectrum and its power value. Detect the amount. Then, in the cheering section determination module 24, a section where a state where the detected cheering feature amount is higher than a threshold value for a preset time length is greater than or equal to the determination threshold rate with respect to the time length is defined as a cheering section. The output control module 25 generates display data representing the determination result of the cheering section and outputs the display data to the recording / playback apparatus.

したがって、歓声区間の検出にリファレンス情報を用いないため収録環境の違いの影響を受けることなく常に安定な性能で歓声区間を検出することが可能となる。また、歓声特有の６００〜１０００Hzの帯域に存在するスペクトルのピークが、歓声の興奮が強いほどその周波数が高くてパワーが大きくなり、歓声の興奮が弱いほどその周波数が低くてパワーが小さくなるという特性を利用して、歓声区間を判定するための特徴量が検出される。このため、オーディオ信号の符号化歪みに対しても安定かつ高精度に歓声区間を判定することが可能となる。 Therefore, since the reference information is not used for detecting the cheering section, it is possible to always detect the cheering section with stable performance without being affected by the difference in the recording environment. Moreover, the peak of the spectrum existing in the 600 to 1000 Hz band peculiar to cheers is that the higher the cheer excitement, the higher the frequency and the greater the power, and the weaker the cheer excitement, the lower the frequency and the lower the power. A characteristic amount for determining a cheering section is detected using the characteristic. For this reason, it becomes possible to determine a cheering section stably and with high accuracy even with respect to encoding distortion of an audio signal.

また本実施形態では、複数のハイライト検出モードを予め用意して表示し、ユーザがこれらの検出モードの中から所望のモードを選択指定したときに、この指定された検出モードに応じて判定時間長Ｌ、判定しきい値率、及び歓声特徴量の検出しきい値が自動調節される。したがって、ユーザは面倒な入力設定操作を行わなくても、ユーザが所望する条件でハイライトシーンの検出が可能となる。 In the present embodiment, a plurality of highlight detection modes are prepared and displayed in advance, and when the user selects and designates a desired mode from these detection modes, a determination time is determined according to the designated detection mode. The detection threshold of the length L, the judgment threshold rate, and the cheering feature amount is automatically adjusted. Therefore, the highlight scene can be detected under the conditions desired by the user without performing a troublesome input setting operation.

さらに、ユーザが選択指定した出力形態に応じて、歓声区間の判定結果を表すタイムバーが生成されて出力される。このため、ユーザは検出されたハイライトシーンがコンテンツ中のどの時間位置に存在し、かつどのような長さでどの程度の盛り上がり具合かを自身が希望する形態により確認することが可能となる。このため、短時間で視聴する際に大変有用な情報を得ることができる。またマルチメディアコンテンツを編集する際にも、ハイライトシーンという編集に大変有用な情報を得ることができるので、編集作業を効率的に短時間で行うことが可能となる。 Further, a time bar representing the determination result of the cheering section is generated and output according to the output form selected and designated by the user. Therefore, the user can confirm at which time position in the content the detected highlight scene exists, and at what length and how much it rises in the form he desires. For this reason, very useful information can be obtained when viewing in a short time. In addition, when editing multimedia content, information that is very useful for editing called a highlight scene can be obtained, so that editing work can be performed efficiently and in a short time.

（第２の実施形態）
図９は、この発明の第２の実施形態に係わるハイライトシーン検出装置１Ｂの構成を示すブロック図である。なお、同図において前記図１と同一部分には同一符号を付して詳しい説明は省略する。
制御ユニット２Ｂには、前記第１の実施形態で説明した入力制御モジュール２１、スペクトル算出モジュール２２、歓声特徴量抽出モジュール２３、歓声区間判定モジュール２４及び出力制御モジュール２５に加え、歓声区間情報正規化モジュール２６と、歓声パターン類似度算出モジュール２７が新たに設けられている。 (Second Embodiment)
FIG. 9 is a block diagram showing a configuration of a highlight scene detection apparatus 1B according to the second embodiment of the present invention. In the figure, the same parts as those in FIG.
In the control unit 2B, in addition to the input control module 21, the spectrum calculation module 22, the cheer feature amount extraction module 23, the cheer segment determination module 24, and the output control module 25 described in the first embodiment, the cheer segment information normalization is performed. A module 26 and a cheering pattern similarity calculation module 27 are newly provided.

歓声区間情報正規化モジュール２６は、歓声区間判定モジュール２４により得られた歓声区間の判定結果を、同歓声区間を複数の小区間に分けて歓声特徴量のしきい値により正規化することにより、上記歓声区間の判定結果をパターン化する。
歓声パターン類似度算出モジュール２７は、上記正規化処理によりパターン化された歓声区間と予め用意した基準歓声パターンとの類似度を算出し、この算出された類似度に応じて上記歓声区間を複数の歓声パターンに分類する。 The cheering section information normalization module 26 normalizes the cheering section determination result obtained by the cheering section determination module 24 by dividing the cheering section into a plurality of subsections and using the threshold value of the cheering feature amount. The determination result of the cheering section is patterned.
The cheering pattern similarity calculation module 27 calculates the similarity between the cheering section patterned by the normalization process and a reference cheering pattern prepared in advance, and the cheering section is divided into a plurality of cheering sections according to the calculated similarity. Classify into cheer patterns.

次に、以上のように構成された装置による歓声区間の正規化処理及びパターン分類処理の動作を説明する。
先ず、歓声区間の正規化処理は以下のように行われる。図１０はその処理手順と処理内容を示すフローチャートである。制御ユニット２Ｂの歓声区間情報正規化モジュール２６は、先ずステップＳ１０１において、上記歓声区間判定モジュール２４により検出された歓声区間の各歓声特徴量の平均値と標準偏差を算出する。次に、この算出された平均値及び標準偏差から歓声特徴量のしきい値を算出すると共に、上記正規化対象の歓声区間を前半部分と中間部分と後半部分とに３区分する。そして、先ず前半部分に上記しきい値以上の歓声特徴量が存在するか否かをステップＳ１０２で判定する。この判定の結果、しきい値以上の歓声特徴量が存在すると、続いて後半部分に上記しきい値以上の歓声特徴量が存在するか否かをステップＳ１０３で判定する。そして、この判定の結果しきい値以上の歓声特徴量が存在した場合には、中間部分に上記しきい値以上の歓声特徴量が存在するか否かをステップＳ１０４で判定する。 Next, cheering section normalization processing and pattern classification processing operations performed by the apparatus configured as described above will be described.
First, normalization processing of the cheering section is performed as follows. FIG. 10 is a flowchart showing the processing procedure and processing contents. In step S101, the cheer section information normalization module 26 of the control unit 2B first calculates an average value and a standard deviation of each cheer feature value of the cheer section detected by the cheer section determination module 24. Next, a threshold value of the cheering feature value is calculated from the calculated average value and standard deviation, and the cheering section to be normalized is divided into three parts: a first half part, an intermediate part, and a second half part. First, in step S102, it is determined whether or not a cheering feature amount equal to or greater than the threshold value exists in the first half. As a result of this determination, if there is a cheering feature value greater than or equal to the threshold value, it is then determined in step S103 whether or not there is a cheering feature value greater than or equal to the threshold value in the latter half. If the result of this determination is that there is a cheering feature amount greater than or equal to the threshold value, it is determined in step S104 whether or not there is a cheering feature amount greater than or equal to the threshold value in the intermediate portion.

以上の各判定の結果、歓声区間の前半、後半及び中間のいずれの部分にもしきい値以上の歓声特徴量が存在した場合には、歓声区間情報正規化モジュール２６はステップＳ１０５に６おいて上記歓声区間の正規化パターンをパターンＡと判定する。また、歓声区間の前半部分及び後半部分にしきい値以上の歓声特徴量が存在するものの、中間部分にはしきい値以上の歓声特徴量が存在しなかった場合には、ステップＳ１０６において上記歓声区間の正規化パターンをパターンＢと判定する。さらに、前半部分にしきい値以上の歓声特徴量が存在し、後半部分にしきい値以上の歓声特徴量が存在しなかった場合には、ステップＳ１０７において上記歓声区間の正規化パターンをパターンＤと判定する。 As a result of the above determinations, if there is a cheering feature amount greater than or equal to the threshold value in any of the first half, the second half, and the middle of the cheering section, the cheering section information normalization module 26 performs the above in 6 in step S105. The normalization pattern of the cheering section is determined as pattern A. If there is a cheering feature amount greater than or equal to the threshold value in the first half portion and the second half portion of the cheering interval, but no cheering feature amount greater than or equal to the threshold value exists in the middle portion, the cheering interval in step S106. Is determined as a pattern B. Further, if there is a cheering feature amount greater than or equal to the threshold value in the first half part and no cheering feature value greater than or equal to the threshold value in the second half part, the normalization pattern of the cheering section is determined as pattern D in step S107. To do.

一方、上記ステップＳ１０２において、前半部分にしきい値以上の歓声特徴量が存在しないと判定されると、歓声区間情報正規化モジュール２６は後半部分にしきい値以上の歓声特徴量が存在するかどうかをステップＳ１０８で判定し、存在すればステップＳ１０９により上記歓声区間の正規化パターンをパターンＣと判定する。これに対し、上記ステップＳ１０８においてしきい値以上の歓声特徴量が存在しないと判定された場合には、歓声区間情報正規化モジュール２６は中間部分にしきい値以上の歓声特徴量が存在するかどうかをステップＳ１１０で判定する。そして、存在すればステップＳ１１１により上記歓声区間の正規化パターンをパターンＥと判定し、一方存在しなかった場合にはステップＳ１１２により上記歓声区間の正規化パターンをパターンＡと判定する。 On the other hand, if it is determined in step S102 that there is no cheering feature amount greater than or equal to the threshold value in the first half portion, the cheer section information normalization module 26 determines whether or not a cheer feature amount greater than or equal to the threshold value exists in the second half portion. In step S108, if it exists, the normalization pattern of the cheering section is determined as pattern C in step S109. On the other hand, if it is determined in step S108 that there is no cheering feature quantity greater than or equal to the threshold value, the cheering section information normalization module 26 determines whether or not there is a cheering feature quantity greater than or equal to the threshold value in the intermediate portion. Is determined in step S110. If it exists, the normalization pattern of the cheering section is determined as the pattern E in step S111, and if it does not exist, the normalization pattern of the cheering section is determined as the pattern A in step S112.

図１１（ａ）〜（ｅ）は上記各正規化パターンＡ〜Ｅを模式的に示したもので、横軸は時間、縦軸は歓声特徴量をそれぞれ示す。パターンＡは、歓声特徴量が平坦なパターンであり、例えば特に大盛り上がりもなく、ワーっと盛上ったような抑揚のない歓声に相当する。パターンＢは、歓声特徴量が高い状態から一旦低くなりまた高くなるパターンである。野球を例にとると、得点圏にランナーがいる場合にヒットかどうかあいまいな当りの直後は盛り上がりが小さいが、ボールが落ちてヒットになり、得点が入ったときに大きく盛り上がる場合などに相当する。パターンＣは、歓声特徴量が低い状態から高くなるパターンである。例えばサッカーにおいて、得点チャンスで盛上った状態から、シュートを決めて歓声が大きく盛上る場合に相当する。パターンＤは、歓声特徴量が高い状態から低くなるパターンである。例えばサッカーにおいて、ロングシュートなどの突発的な歓声の盛り上がりの後に、歓声が小さい余韻が持続するような場合に相当する。パターンＥは、歓声特徴量が低い状態から高い状態になり、また低くなるパターンである。例えば野球において、バッターがホームランを打ったときに、バッティングの直後に歓声が盛り上がって少々治まり、観客席にボールが入った瞬間にまた歓声が盛上るような場合に相当する。 11A to 11E schematically show the respective normalization patterns A to E. The horizontal axis indicates time, and the vertical axis indicates cheering feature amounts. The pattern A is a pattern having a flat cheering feature amount, and corresponds to, for example, a cheering without any inflection, such as no particular excitement. Pattern B is a pattern that once decreases and increases from a state where the cheering feature amount is high. Taking baseball as an example, if the runner is in the scoring area, the hit will be small immediately after the ambiguous hit, but it corresponds to the case where the ball falls and hits and gets big when the score enters . Pattern C is a pattern in which the cheering feature amount increases from a low state. For example, in soccer, it corresponds to a case where a cheer is greatly increased by deciding a shot from a state where the scoring chance is increased. The pattern D is a pattern that decreases from a state where the cheering feature amount is high. For example, in soccer, this corresponds to a case where a cheerful lingering sound persists after a sudden cheering such as a long shot. The pattern E is a pattern in which the cheering feature value is changed from a low state to a high state and becomes low. For example, in baseball, when a batter hits a home run, the cheering swells immediately after the batting and heals a little, and the cheering rises again at the moment the ball enters the spectator seat.

このように、歓声区間として検出された区間内において、歓声の度合いが大きくなる箇所がパターンＡのように存在しない場合、パターンＢのように前半と後半にある場合、パターンＣのように後半のみにある場合、パターンＤのように前半のみにある場合、パターンＥのように中央付近にある場合にそれぞれ分類することで、検出された歓声がどのような歓声なのかをある程度把握することができる。 Thus, in the section detected as the cheering section, when there is no portion where the degree of cheering is large as in pattern A, when in the first half and the second half as in pattern B, only in the second half as in pattern C , It is possible to grasp to some extent what kind of cheer the detected cheers are by classifying them when they are only in the first half like pattern D and when they are near the center like pattern E. .

なお、上記正規化処理に使用するしきい値は次のように算出される。図１２はその算出方法を説明するための図である。すなわち、しきい値は、歓声区間における歓声特徴量の平均値と標準偏差とを加算することにより算出される。このように平均値や標準偏差などの統計量を用いてしきい値を算出することで、歓声特徴量のバラツキに対応することが可能となる。 The threshold value used for the normalization process is calculated as follows. FIG. 12 is a diagram for explaining the calculation method. That is, the threshold value is calculated by adding the average value of the cheering feature amount in the cheering section and the standard deviation. Thus, by calculating a threshold value using a statistical quantity such as an average value or a standard deviation, it becomes possible to deal with variations in cheering feature quantities.

また、下記［数６］に示すように、標準偏差を正の実数α倍したものと平均値との和をしきい値とし、この係数αを制御することによりしきい値を可変制御するようにしてもよい。 Further, as shown in [Equation 6] below, the sum of the standard deviation multiplied by a positive real number α and the average value is used as a threshold value, and the threshold value is variably controlled by controlling this coefficient α. It may be.

このようにすると、係数αを大きくするほどしきい値を超える歓声特徴量が少なくなり、ほとんどパターンＡに分類されてしまうことになるので、経験上係数αは１．０〜２．０の範囲に設定することが適当である。 In this way, as the coefficient α is increased, the cheering feature amount exceeding the threshold value is decreased and the pattern A is almost classified into the pattern A. Therefore, the coefficient α is in the range of 1.0 to 2.0 based on experience. It is appropriate to set to.

さらに、標準偏差の算出は歓声区間ごとに算出するのではなく、複数の歓声区間の情報を使って算出してもよい。こうすることで、歓声区間内のバラツキのみならず、歓声区間外、つまり歓声区間ごとのバラツキの影響も考慮することができる。図１２の例では、歓声区間の前半部分はしきい値以上の歓声特徴量が存在するので、ステップＳ１０２においてしきい値以上の歓声特徴量が存在すると判定され、また後半部分にはしきい値以上の歓声特徴量が存在しないので、ステップＳ１０３において存在しないと判定される。このため、歓声区間の正規化パターンはパターンＤと判定される。 Further, the standard deviation may be calculated using information of a plurality of cheering sections instead of calculating for each cheering section. By doing so, not only the variation in the cheering section but also the influence of the variation outside the cheering section, that is, for each cheering section, can be considered. In the example of FIG. 12, since the cheering feature amount equal to or greater than the threshold exists in the first half portion of the cheering section, it is determined in step S102 that the cheering feature amount equal to or greater than the threshold exists. Since the above cheer feature amount does not exist, it is determined in step S103 that it does not exist. For this reason, the normalization pattern of the cheering section is determined as the pattern D.

以上のように、歓声区間を正規化する際に、歓声区間を前半部分、中間部分及び後半部分に区分けして量子化することで、時間情報を正規化することができる。また、歓声特徴量がしきい値以上か否かにより二値化するので、これも歓声特徴量を正規化することができるが、歓声特徴量は歓声の盛り上がりを示す情報そのものなので、二値化せずに値を保持する方が好ましい。例えば、前半部分、中間部分及び後半部分のそれぞれにおいて、しきい値未満の場合の歓声特徴量は平均値を保持し、しきい値以上の場合には歓声特徴量がしきい値以上のもののみの平均値やその部分のみの歓声特徴量の平均を保持することなどが考えられる。このように歓声特徴量の値を保持することで、後段の歓声パターン分類モジュール２７において、より詳細な分類が可能となる。 As described above, when normalizing the cheering interval, time information can be normalized by dividing the cheering interval into a first half part, an intermediate part, and a second half part and quantizing. In addition, since the binarization is performed depending on whether or not the cheering feature amount is equal to or greater than the threshold value, the cheering feature amount can also be normalized. However, since the cheering feature amount is the information itself indicating the excitement of the cheering, it is binarized. It is preferable to keep the value without. For example, in each of the first half part, the middle part, and the second half part, the cheering feature amount is less than the threshold value, and the cheering feature amount is not less than the threshold value. It is conceivable to maintain the average value of the voices and the average of the cheering feature values of only that part. By holding the value of the cheering feature amount in this way, more detailed classification is possible in the cheering pattern classification module 27 in the subsequent stage.

歓声区間情報正規化モジュール２６により得られる正規化された歓声情報の表現の一例としては図１３が挙げられる。同図において、１３１はコンテンツの開始位置から何番目の歓声区間かを示し、１３２は歓声区間の開始時間と終了時間を示す。また、１３３は歓声区間情報正規化モジュール２６においてどの歓声パターンに正規化されたかを示し、１３４は歓声特徴量の正規化された値を示している。このような出力形態とすることで、歓声区間に関する時間、パターン、歓声の大きさ（程度）に関する情報が分かるので、後段の歓声パターン分類モジュール２７において、より詳細な分類が可能となる。 An example of the expression of the normalized cheer information obtained by the cheer section information normalization module 26 is shown in FIG. In the figure, 131 indicates the number of cheer sections from the start position of the content, and 132 indicates the start time and end time of the cheer sections. Reference numeral 133 indicates which cheer pattern is normalized by the cheer section information normalization module 26, and 134 indicates a normalized value of the cheer feature amount. By setting it as such an output form, since the information regarding the time, pattern, and the magnitude | size (degree) of a cheer regarding a cheering section is known, in a cheering pattern classification | category module 27 of a latter stage, a more detailed classification | category is attained.

以上説明したように歓声区間情報正規化モジュール２６では、歓声区間判定モジュール２４により歓声区間と判定された区間が、予め用意した歓声パターンと歓声特徴量に正規化される。この結果、歓声区間の判定情報は歓声区間の時間長や歓声特徴量のバラツキに対し頑健な情報となるので、より高精度に歓声区間を分類することが可能となる。 As described above, in the cheer section information normalization module 26, the section determined as the cheer section by the cheer section determination module 24 is normalized to a cheer pattern and a cheer feature value prepared in advance. As a result, the determination information of the cheering section becomes information that is robust with respect to the length of the cheering section and the variation in the cheering feature amount, so that the cheering section can be classified with higher accuracy.

なお、上記説明では５つの歓声パターンを用意した場合を例にとって説明したが、歓声パターンの数はこれに限るものではなく、複数であれば如何なる数に設定してもよい。ただし、多くしすぎると正規化によるバラツキに対する頑健性が失われ、また少なすぎると歓声パターンを分類数が減ってしまうので、適切な数にすることが肝要である。 In the above description, the case where five cheer patterns are prepared has been described as an example. However, the number of cheer patterns is not limited to this, and any number may be set as long as it is plural. However, if the number is too large, the robustness against variations due to normalization is lost, and if the number is too small, the number of cheer patterns is reduced. Therefore, it is important to set the number appropriately.

次に、歓声パターン分類モジュール２７では、上記歓声区間情報正規化モジュール２６により正規化された歓声区間の判定情報を分類する処理が行われる。分類の方法は、基本的には歓声区間情報正規化モジュール２６により正規化された歓声パターンそのものである。例えば、歓声パターンが図１１に示したパターンＣのときには、先に述べたように歓声特徴量が低い状態から高くなるパターンであり、サッカーでは得点チャンスで盛上った状態からシュートを決めて歓声が大きく盛上る場合に相当する。 Next, in the cheering pattern classification module 27, processing for classifying the cheering section determination information normalized by the cheering section information normalizing module 26 is performed. The classification method is basically the cheer pattern itself normalized by the cheer section information normalizing module 26. For example, when the cheering pattern is the pattern C shown in FIG. 11, the cheering feature amount is increased from a low state as described above. This corresponds to a case where the value rises greatly.

一方、さらに詳細に分類しようとする場合には、同じ歓声パターンでも歓声特徴量の値を用いて優劣をつけるようにしてもよい。例えば、図１４に示すように同じ歓声パターンＣにおいて、歓声特徴量の盛り上がり時の値を（ａ）よりも（ｂ）の方が大きくなるように設定する。つまり、この例では（ｂ）の方がより大きな歓声の盛り上がりであったことを示しているので、優劣をつけることができる。このように、同じ歓声パターン間で盛上っている方の歓声特徴量を比較したり、盛上っている方と盛上っていない方の線形和で比較することで、同じ歓声パターン同士の優劣をつけることができる。 On the other hand, when trying to classify in more detail, the same cheering pattern may be given superiority or inferiority using the value of the cheering feature value. For example, as shown in FIG. 14, in the same cheering pattern C, the value of the cheering feature value at the time of rising is set so that (b) is larger than (a). In other words, in this example, (b) indicates that the cheering was greater, so that superiority or inferiority can be given. In this way, by comparing the cheering features of those who are prospering between the same cheering patterns, or by comparing with the linear sum of those who are prospering and those who are not prominent, the same cheering patterns Can be better or worse.

このようにすると次のような作用効果が得られる。すなわち、前記第１の実施形態では、入力デバイスにおいて入力されるユーザの種々の要求に応じて合計再生時間などの制約が課される場合に、歓声区間の長さや特徴量の大きさに依存する歓声得点により優劣を付けて、この優劣をもとに歓声区間を上位から順に選択していた。しかし、第２の実施形態では、予め歓声パターンに優先順位を付け、さらにこの歓声パターン内で優劣をつけることで、歓声区間全部を順位付けすることができる。 In this way, the following effects can be obtained. That is, in the first embodiment, when restrictions such as the total playback time are imposed according to various requests of the user input at the input device, it depends on the length of the cheer section and the size of the feature amount. The superiority and inferiority were given by cheering scores, and cheering sections were selected in order from the top based on this superiority. However, in the second embodiment, it is possible to rank all cheering sections by prioritizing cheering patterns in advance and giving superiority or inferiority within the cheering patterns.

なお、第２の実施形態は次のような各種変形が可能である。例えば、図１５に示すようにハイライトパターンの選択メニューを表示し、ユーザに所望する歓声パターンを選択させるようにしてもよい。さらに、前記図８（ｃ）に示したように歓声パターンを色や形により分類して表示することで、ユーザは歓声のパターンを位置情報、さらにはコンテンツ全体における歓声区間の構成などを瞬時に簡単に把握することができるようになり、ユーザは多種多様な視聴形態を実現することが可能となる。 The second embodiment can be modified in various ways as follows. For example, as shown in FIG. 15, a highlight pattern selection menu may be displayed to allow the user to select a desired cheer pattern. Further, as shown in FIG. 8 (c), the cheer patterns are classified and displayed by color and shape, so that the user can instantly display the cheer patterns as positional information and the structure of cheer sections in the entire content. It becomes possible to easily grasp, and the user can realize various viewing modes.

以上述べたように第２の実施形態では、歓声パターン分類モジュール２７により歓声区間情報正規化モジュール２６により得られる正規化された歓声区間の判定情報を用いて歓声区間を分類することができ、これによってユーザは単に歓声区間の位置を知るだけでなく、その歓声がどのような歓声かを瞬時に簡単に把握することができるようになる。したがって、スポーツコンテンツ全体の流れを把握することが可能となったり、運動会などの歓声を伴うイベントを録画したコンテンツを編集する際の情報がより有益な情報になったりなどの効果が期待できる。 As described above, in the second embodiment, the cheering pattern classification module 27 can classify cheering sections using the normalized cheering section determination information obtained by the cheering section information normalization module 26. Thus, the user can not only know the position of the cheering section, but can easily grasp the cheering instantly. Accordingly, it is possible to expect the effects that it is possible to grasp the flow of the entire sports content, and that the information when editing the content recorded with events such as athletic meet becomes more useful information.

また第２の実施形態では、歓声区間情報正規化モジュール２６及び歓声パターン分類モジュール２７による正規化処理及びパターン分類処理により、パターンと値で表現した情報を使って歓声区間を複数の歓声パターンに分類しさらに優劣を付けた場合を例にとって説明した。しかし、それに限るものではなく、歓声区間のパターン分類処理方法についてはこの発明の要旨を逸脱しない範囲で種々の方法を採用することが可能である。例えば、ゴールシーンやホームランシーン等、目標となるシーンを予め統計的に学習し、このターゲットパターンに対する確率的距離に基づいて歓声区間の正規化及び類似度の算出を行うようにしてもよい。具体的には、ベクトル量子化やクラスタリング、ＧＭＭなどの各種統計的モデルを使って、ゴールシーンやホームランシーン等のシーンごとに予め統計的に学習しておく。そして、歓声区間を分類するときに、各歓声区間の判定情報が上記学習したものに対し確率的に起こりうる確率が最も高いパターンに属するとして歓声パターンの分類をする。また、このときの確率を類似度の大きさとすれば、優先順位を付けることもできる。 In the second embodiment, the cheering section information normalization module 26 and cheering pattern classification module 27 perform normalization processing and pattern classification processing to classify cheering sections into a plurality of cheering patterns using information expressed by patterns and values. The case where the superiority or inferiority is given is explained as an example. However, the present invention is not limited to this, and various methods can be adopted as the pattern classification processing method for the cheering section without departing from the gist of the present invention. For example, a target scene such as a goal scene or a home run scene may be statistically learned in advance, and normalization of a cheering section and calculation of similarity may be performed based on a stochastic distance with respect to the target pattern. Specifically, statistical learning is performed in advance for each scene such as a goal scene or home run scene using various statistical models such as vector quantization, clustering, and GMM. And when classifying a cheering section, the cheering pattern is classified as belonging to a pattern having the highest probability that the determination information of each cheering section can occur stochastically with respect to the learned information. If the probability at this time is the magnitude of the similarity, priority can be given.

（第３の実施形態）
前記第２の実施形態では、歓声特徴量の時間遷移パターンを正規化して分類することで歓声区間の分類処理を行った。しかし、上記処理手法では、集団が同じ音程で歌を歌って応援するシーンや楽器を使った応援シーン等の組織的な応援シーンを正しく特定することが難しい。組織的応援は、スポーツコンテンツの盛り上がりにはあまり関係ないことが多く、むしろ観客が熱狂的に盛上ると組織的応援は崩れる。このため、組織的応援区間を積極的に歓声区間としないか、もしくは組織的応援区間として歓声区間から区別することで、歓声区間の検出精度の向上や歓声区間の分類精度の向上が期待できる。 (Third embodiment)
In the second embodiment, the cheering segment classification processing is performed by normalizing and classifying the temporal transition pattern of the cheering feature value. However, with the above processing method, it is difficult to correctly specify a systematic support scene such as a scene where a group sings and supports a song with the same pitch or a support scene using a musical instrument. Organized support often has little to do with the excitement of sports content. Rather, organized support collapses when the audience enthusiastically swells. For this reason, the detection accuracy of the cheering section and the improvement of the classification accuracy of the cheering section can be expected by not actively setting the organized support section as the cheering section or by distinguishing from the cheering section as the organized cheering section.

第３の実施形態は、歓声特徴量を抽出する際に、歓声特徴量を抽出するとともに組織的応援かどうかを判定する。例えば、図１に示した歓声特徴量抽出モジュール２３において、スペクトル算出モジュール２２により算出されたスペクトルから、歓声の特徴が現れる一部の帯域内において最大値かつ極大値をとる周波数（ピーク周波数）とそのパワー値を検出すると共に、上記ピーク周波数の周辺の周波数のパワー値を算出する。そして、このピーク周波数のパワー値と周辺周波数のパワー値との差がしきい値よりも大きい場合に、組織的応援状態であることを示す組織応援フラグを出力する。 In the third embodiment, when the cheering feature value is extracted, the cheering feature value is extracted and it is determined whether or not it is systematic support. For example, in the cheer feature amount extraction module 23 shown in FIG. 1, from the spectrum calculated by the spectrum calculation module 22, a frequency (peak frequency) having a maximum value and a maximum value within a partial band where cheer features appear. While detecting the power value, the power value of the frequency around the peak frequency is calculated. Then, when the difference between the power value of the peak frequency and the power value of the peripheral frequency is larger than the threshold value, an organization support flag indicating that the system is in an organized support state is output.

以下、この第３の実施形態による歓声区間及び組織的応援区間の判定処理動作を説明する。図１６はその処理手順と処理内容を示すフローチャートである。なお、装置の構成については図１を援用して説明を行う。 Hereinafter, the determination processing operation of the cheering section and the organized support section according to the third embodiment will be described. FIG. 16 is a flowchart showing the processing procedure and processing contents. In addition, about the structure of an apparatus, FIG. 1 is used and demonstrated.

歓声特徴量抽出モジュール２３は、先ずステップＳ１２１において、スペクトル算出モジュール２２により算出されたスペクトルからピーク周波数を検出する。次にステップＳ１２２において、上記検出したピーク周波数が０かどうかを判定し、０と判定された場合はステップＳ１２３により当該区間を非歓声区間と判定する。 The cheering feature amount extraction module 23 first detects a peak frequency from the spectrum calculated by the spectrum calculation module 22 in step S121. Next, in step S122, it is determined whether or not the detected peak frequency is 0. If it is determined to be 0, the section is determined as a non-cheering section in step S123.

これに対し、上記ステップＳ１２２においてピーク周波数が０以外と判定されたとする。この場合歓声特徴量抽出モジュール２３は、ステップＳ１２４によりピーク周波数を含む歓声特徴量の帯域幅を算出する。例えば、図１７に示すようにピーク周波数のパワー値から３dB低下した位置での周波数幅を帯域幅として算出する。そして、この算出された帯域幅が予め設定したしきい値より小さいか否かをステップＳ１２５により判定する。この判定の結果、帯域幅がしきい値以上と判定された場合には、ステップＳ１２６により当該区間を通常歓声区間と判定する。一方、上記ステップＳ１２５において帯域幅がしきい値より小さいと判定された場合には、歓声特徴量抽出モジュール２３はステップＳ１２７により当該区間を組織的応援と判定する。 In contrast, it is assumed that the peak frequency is determined to be other than 0 in step S122. In this case, the cheering feature amount extraction module 23 calculates the bandwidth of the cheering feature amount including the peak frequency in step S124. For example, as shown in FIG. 17, the frequency width at a position 3 dB lower than the peak frequency power value is calculated as the bandwidth. Then, it is determined in step S125 whether or not the calculated bandwidth is smaller than a preset threshold value. As a result of this determination, if it is determined that the bandwidth is equal to or greater than the threshold, the section is determined to be a normal cheer section in step S126. On the other hand, when it is determined in step S125 that the bandwidth is smaller than the threshold value, the cheering feature amount extraction module 23 determines that the section is organized support in step S127.

ここで、ピーク周波数の帯域幅がしきい値より狭い場合に、組織的応援と見なせる理由を以下に説明する。すなわち、組織的応援とは多くの人が同じ歌を歌ったり同じ楽器を鳴らしたりすることを指す。したがって、通常の歓声の場合にはバラバラだった各々のピッチ周波数は、同じ歌を歌っている場合にはピッチ周波数がほぼ同じになる。このため、ピッチによるハーモニクス構造、つまり周波数の山と谷が、例えば図１７（ａ），（ｂ）のようにスペクトル上に顕著に現れる。また、応援用の楽器が鳴っている場合も同様に、楽器の基本周波数によるハーモニクス構造により周波数の山と谷がスペクトル上に顕著に現れる。このため、組織的応援の場合には周波数ピークが鋭くなり、帯域幅が狭くなる。 Here, the reason why it can be regarded as systematic support when the bandwidth of the peak frequency is smaller than the threshold will be described below. In other words, organized support means that many people sing the same song or play the same instrument. Accordingly, the pitch frequencies that are different in the case of normal cheers are almost the same when the same song is sung. For this reason, the harmonics structure due to the pitch, that is, the peaks and valleys of the frequency, remarkably appear on the spectrum as shown in FIGS. 17 (a) and 17 (b), for example. Similarly, when a cheering instrument is ringing, the peaks and valleys of the frequency appear prominently in the spectrum due to the harmonic structure of the fundamental frequency of the instrument. For this reason, in the case of organized support, the frequency peak becomes sharp and the bandwidth becomes narrow.

以上述べたように第３の実施形態によれば、ピーク周波数の帯域幅の大小により歓声区間と組織的応援区間とを分けることができる。このため、歓声区間の検出ならびに歓声区間の分類の精度が向上し、ユーザが満足する多種多様な視聴形態が実現できる。また、組織的応援区間そのものがコンテンツの内容を把握する上で有益な情報となる。このため、ユーザはコンテンツをより一層短時間で視聴できるようになったり、また編集することができるようになる。 As described above, according to the third embodiment, the cheering section and the organized support section can be divided according to the bandwidth of the peak frequency. For this reason, the accuracy of the detection of the cheering section and the classification of the cheering section is improved, and various viewing modes satisfying the user can be realized. Further, the organized support section itself is useful information for grasping the content. As a result, the user can view and edit the content in a shorter time.

（その他の実施形態）
前記各実施形態では、ハイライトシーン検出装置を録画再生装置とは別に設けた場合を例にとって説明したが、ハイライトシーン検出装置を録画再生装置内に設けるようにしてもよい。また、ハイライトシーン検出装置は、録画再生装置以外に撮像装置やコンテンツ配信サーバに付加または内蔵させるようにしてもよい。
その他、ハイライトシーン検出装置の構成や制御ユニットによる処理手順と処理内容などについても、この発明の要旨を逸脱しない範囲で種々変形して実施できる。 (Other embodiments)
In each of the above-described embodiments, the case where the highlight scene detection device is provided separately from the recording / playback device has been described as an example. However, the highlight scene detection device may be provided in the recording / playback device. Further, the highlight scene detection device may be added to or incorporated in the imaging device or the content distribution server in addition to the recording / playback device.
In addition, the configuration of the highlight scene detection apparatus and the processing procedure and processing contents by the control unit can be variously modified and implemented without departing from the gist of the present invention.

要するにこの発明は、上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、各実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 In short, the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the components without departing from the scope of the invention in the implementation stage. Moreover, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in each embodiment. Furthermore, you may combine suitably the component covering different embodiment.

この発明の第１の実施の形態に係るハイライトシーン検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the highlight scene detection apparatus which concerns on 1st Embodiment of this invention. 図１に示したハイライトシーン検出装置の制御ユニットによるスペクトル算出処理と歓声特徴量検出処理の手順及び処理内容を示すフローチャートである。It is a flowchart which shows the procedure of the spectrum calculation process by the control unit of the highlight scene detection apparatus shown in FIG. 図１に示したハイライトシーン検出装置の制御ユニットによる歓声区間判定処理及びその出力処理の手順及び処理内容を示すフローチャートである。It is a flowchart which shows the procedure and process content of a cheering area determination process by the control unit of the highlight scene detection apparatus shown in FIG. 1, and its output process. 図１に示したハイライトシーン検出装置の制御ユニットによるピーク周波数検出処理動作を説明するための図である。It is a figure for demonstrating the peak frequency detection processing operation by the control unit of the highlight scene detection apparatus shown in FIG. 図１に示したハイライトシーン検出装置の制御ユニットによる歓声区間判定処理動作を説明するための図である。It is a figure for demonstrating the cheering zone determination processing operation by the control unit of the highlight scene detection apparatus shown in FIG. 図１に示したハイライトシーン検出装置の制御ユニットにより表示されるハイライト検出モード入力メニューの一例を示す図である。It is a figure which shows an example of the highlight detection mode input menu displayed by the control unit of the highlight scene detection apparatus shown in FIG. 図１に示したハイライトシーン検出装置の制御ユニットにより得られる、歓声得点の一例を示す図である。It is a figure which shows an example of a cheer score obtained by the control unit of the highlight scene detection apparatus shown in FIG. 図１に示したハイライトシーン検出装置の制御ユニットにより生成されるタイムバーの複数の例を示す図である。It is a figure which shows several examples of the time bar produced | generated by the control unit of the highlight scene detection apparatus shown in FIG. この発明の第２の実施形態に係わるハイライトシーン検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the highlight scene detection apparatus concerning the 2nd Embodiment of this invention. 図９に示したハイライトシーン検出装置の制御ユニットによる歓声区間正規化処理の手順と処理内容を示すフローチャートである。It is a flowchart which shows the procedure and process content of a cheering area normalization process by the control unit of the highlight scene detection apparatus shown in FIG. 図９に示したハイライトシーン検出装置の制御ユニットにより得られる正規化パターンを模式的に示した図である。It is the figure which showed typically the normalization pattern obtained by the control unit of the highlight scene detection apparatus shown in FIG. 図９に示したハイライトシーン検出装置の制御ユニットにおいて使用される算出方法を説明するための図である。It is a figure for demonstrating the calculation method used in the control unit of the highlight scene detection apparatus shown in FIG. 図９に示したハイライトシーン検出装置の制御ユニットにより得られる正規化された歓声情報の表現の一例を示す図である。It is a figure which shows an example of the expression of the normalized cheering information obtained by the control unit of the highlight scene detection apparatus shown in FIG. 図９に示したハイライトシーン検出装置の制御ユニットにおいて、歓声パターンに歓声特徴量の値を用いて優劣をつける場合の例を示す図である。It is a figure which shows the example in the case of giving the superiority or inferiority to a cheering pattern using the value of a cheering feature-value in the control unit of the highlight scene detection apparatus shown in FIG. 第２の実施形態の変形例である、ハイライトパターン選択メニューの表示例を示す図である。It is a figure which shows the example of a display of the highlight pattern selection menu which is a modification of 2nd Embodiment. この発明の第３の実施形態における歓声区間及び組織的応援区間の判定処理の手順と処理内容を示すフローチャートである。It is a flowchart which shows the procedure and process content of the determination process of a cheering area and a systematic support area in 3rd Embodiment of this invention. この発明の第３の実施形態におけるピーク周波数を含む歓声特徴量の帯域幅算出手法を説明するための図である。It is a figure for demonstrating the bandwidth calculation method of the cheering feature-value containing the peak frequency in 3rd Embodiment of this invention.

Explanation of symbols

１Ａ，１Ｂ…ハイライトシーン検出装置、２Ａ，２Ｂ…制御ユニット、３…操作情報入力インタフェース（操作情報入力Ｉ／Ｆ）、４…コンテンツ入力インタフェース（コンテンツ入力Ｉ／Ｆ）、５…出力インタフェース（出力Ｉ／Ｆ）、６…記憶ユニット、７…バス、２１…入力制御モジュール、２２…スペクトル算出モジュール、２３…歓声特徴量抽出モジュール、２４…歓声区間判定モジュール、２５…出力制御モジュール、２６…歓声区間情報正規化モジュール、２７…歓声区間判定モジュール。 DESCRIPTION OF SYMBOLS 1A, 1B ... Highlight scene detection apparatus, 2A, 2B ... Control unit, 3 ... Operation information input interface (operation information input I / F), 4 ... Content input interface (content input I / F), 5 ... Output interface ( (Output I / F), 6 ... storage unit, 7 ... bus, 21 ... input control module, 22 ... spectrum calculation module, 23 ... cheer feature quantity extraction module, 24 ... cheer section determination module, 25 ... output control module, 26 ... Cheer section information normalization module, 27 .. cheer section determination module.

Claims

Means for receiving content data including an audio signal;
A spectrum detector for dividing the audio signal included in the received content data into predetermined intervals and detecting the spectrum for each of the intervals;
Based on the detected peak frequency and the power value , a peak frequency indicating a maximum value and a maximum value and a power value of the peak frequency are detected from a spectrum within a preset band among the detected spectra. Feature amount detecting means for detecting a cheer feature amount from a predetermined relational expression ;
Determining means for determining a section where the detected cheering feature amount is higher than a determination threshold value as a cheering section for a predetermined determination time length with respect to a predetermined determination time length; Highlight scene detection device characterized by the above.

Means for receiving the designation information when the user designates and inputs the output form of the determination result of the cheering section;
Means for adjusting at least one of the determination time length, a determination threshold value and a determination threshold rate according to the received designation information;
The determination result of the cheering section obtained by the determination unit based on the adjusted determination time length, determination threshold value or determination threshold rate is edited and output in the output form represented by the designation information. The highlight scene detection apparatus according to claim 1, further comprising: means.

Normalization means for normalizing the determination result of the cheer section obtained by the determination means with information representing the length of cheer and the size of the cheer;
The highlight scene detection device according to claim 1, further comprising a classifying unit that classifies the determination result of the normalized cheer section into a plurality of preset cheer patterns.

The feature amount detection means includes:
From the spectrum in a band that has been set in advance among the detected spectrum by the spectrum detecting means, and means for detecting a power value of the peak frequency and the peak frequency,
From the spectrum of the preset in-band, and means for detecting a power value of a frequency near the peak frequency,
The difference between the power value of the detected peak frequency and the power value of the surrounding frequency is calculated, and the cheering feature amount is detected when the difference of the calculated power value is larger than a preset value. 4. The highlight scene detection device according to claim 1, further comprising means for outputting an organization support flag indicating that the cheering section is in an organized support state.