JP4884163B2

JP4884163B2 - Voice classification device

Info

Publication number: JP4884163B2
Application number: JP2006293055A
Authority: JP
Inventors: 悟松本; 友二山本; 達雄古賀; 良輔大槻
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2006-10-27
Filing date: 2006-10-27
Publication date: 2012-02-29
Anticipated expiration: 2026-10-27
Also published as: JP2008111866A

Description

本発明は、音声信号を解析し、分類する音声分類装置及びコンピュータシステムを用いて音声分類装置の機能を実現するためのコンピュータプログラムに関する。 The present invention relates to a speech classification device that analyzes and classifies speech signals, and a computer program for realizing the function of the speech classification device using a computer system.

従来から、サウンド認識を用いて音声信号を含むマルチメディアコンテンツからスポーツハイライトなどの番組要約を抽出する方法が知られている（例えば、特許文献１参照）。例えば、スポーツイベントのビデオにおいては、観客の拍手、喝采、バットによるボールの打撃、興奮した音声、背景雑音または音楽を識別して、オーディオコンテンツを分類する。これにより、スポーツイベントから面白いハイライトを見つけることができる。 2. Description of the Related Art Conventionally, a method for extracting a program summary such as a sports highlight from multimedia content including an audio signal using sound recognition is known (for example, see Patent Document 1). For example, in a sporting event video, audio content is categorized by identifying audience applause, spear, hitting a ball with a bat, excited speech, background noise or music. This makes it possible to find interesting highlights from sporting events.

オーディオコンテンツを分類する従来の方法では、入力されたオーディオ信号から、特徴のセットを抽出し、特徴のセットを、拍手、喝采、打球、音声、音楽、音楽付き音声などのオーディオクラスに従って分類する。そして、拍手または喝采として分類された特徴部分のグループをスポーツハイライトなどの番組要約として選択する。 In a conventional method for classifying audio content, a set of features is extracted from an input audio signal, and the set of features is classified according to an audio class such as applause, scissors, ball hitting, voice, music, and voice with music. Then, a group of feature parts classified as applause or jealousy is selected as a program summary such as a sports highlight.

特開２００４−２５８６５９号公報JP 2004-258659 A

しかし、オーディオコンテンツを分類する従来の方法では、事前に学習データから作成した、拍手、喝采、打球、音声、音楽、音楽付き音声などのオーディオクラス（音響モデル）を継続的に記憶する記憶部を必要としている。オーディオ（音声）には様々な種類のものがあるが、この様々な種類の音声に対応した音響モデルのパラメータを記憶するためには多くの記憶容量を必要とする。 However, in the conventional method of classifying audio contents, a storage unit that continuously stores audio classes (acoustic models) such as applause, bat, ball hitting, voice, music, and voice with music created from learning data in advance is provided. In need of. There are various types of audio (speech), and a large amount of storage capacity is required to store parameters of the acoustic model corresponding to the various types of audio.

また、記憶容量を削減するために複数の音響モデルを一つにまとめた場合、個々の音響モデルに比べて音声の分類精度が低下してしまう。 In addition, when a plurality of acoustic models are combined into one in order to reduce the storage capacity, the voice classification accuracy is lowered as compared with individual acoustic models.

本発明は、上記問題点を解決するために成されたものであり、その目的は、個々の音響モデルを作成することにより音声の分類精度が向上する音声分類装置及びコンピュータプログラムを提供することである。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech classification device and a computer program that improve the speech classification accuracy by creating individual acoustic models. is there.

本発明の第１の特徴は、音声信号から当該音声信号に含まれる特定の音声の音響モデルを作成する音響モデル手段と、音響モデルを用いて音声信号を分類する音声分類手段とを備える音声分類装置であることを要旨とする。 A first feature of the present invention is an audio classification comprising: an acoustic model unit that creates an acoustic model of a specific audio included in the audio signal from the audio signal; and an audio classification unit that classifies the audio signal using the acoustic model. The gist is that it is a device.

第１の特徴によれば、音声信号から音響モデルを逐次作成し、作成された音響モデルを用いて音声信号を分類する。これにより、予め音響モデルを作成して継続的に保持する必要がなくなり、音響モデルのパラメータ分のデータ容量を削減することができる。また、音声信号の環境に適応した音響モデルを作成できるため、音声信号の分類精度が向上する。 According to the first feature, the acoustic model is sequentially created from the speech signal, and the speech signal is classified using the created acoustic model. This eliminates the need to create and continuously store an acoustic model in advance, and reduce the data capacity for the parameters of the acoustic model. In addition, since an acoustic model adapted to the environment of the audio signal can be created, the accuracy of classification of the audio signal is improved.

第１の特徴において、音響モデル手段は、音声信号の特定の周波数帯域を通過させるフィルタ手段と、特定の周波数帯域における音声信号のパワー値を求めるパワー値検出手段と、音声信号のパワー値が所定の閾値を超える時間帯を求めるピーク時間帯検出手段と、音声信号を周波数領域の信号へ変換する周波数変換手段と、閾値を超える時間帯における周波数領域の信号から音響モデルを作成する音響モデル作成手段とを備えていてもかまわない。 In the first feature, the acoustic model means includes a filter means for passing a specific frequency band of the audio signal, a power value detecting means for obtaining a power value of the audio signal in the specific frequency band, and a power value of the audio signal being predetermined. A peak time zone detecting means for obtaining a time zone exceeding a threshold value, a frequency converting means for converting an audio signal into a frequency domain signal, and an acoustic model creating means for creating an acoustic model from the frequency domain signal in a time zone exceeding the threshold value It does not matter if it is equipped with.

音響モデルのパラメータはデータ容量が大きいため、多くの音響モデルを継続的に保持することができない。そこで、音響モデルを予め用意するのではなく、音響モデルに対応する特定の周波数帯域を通過させるフィルタ手段を用意し、フィルタ手段を用いて音声信号から音響モデルを作成する。フィルタ手段のデータ容量は音響モデルのパラメータに比べて非常に小さいため、記憶容量の削減の効果が得られる。また、音声信号のパワー値のピークのうち閾値を超える時間帯を検出し、その時間帯における周波数領域の信号を用いて音響モデルを作成する。これにより、不必要なデータを減らして必要なデータからのみ音響モデルを作成でき、品質の高い音響モデルを作成することができる。 Since the parameters of the acoustic model have a large data capacity, many acoustic models cannot be retained continuously. Therefore, an acoustic model is not prepared in advance, but a filter unit that passes a specific frequency band corresponding to the acoustic model is prepared, and an acoustic model is created from the audio signal using the filter unit. Since the data capacity of the filter means is very small compared to the parameters of the acoustic model, the effect of reducing the storage capacity can be obtained. In addition, a time zone exceeding a threshold is detected from the power value peaks of the audio signal, and an acoustic model is created using a signal in the frequency domain in that time zone. Thereby, unnecessary data can be reduced, an acoustic model can be created only from necessary data, and a high-quality acoustic model can be created.

第１の特徴において、フィルタ手段は、各々異なる周波数帯域を通過させる２以上のフィルタを備え、音響モデル作成手段は、２以上のフィルタを用いて２以上の音響モデルを作成し、音声分類手段は、２以上の音響モデルを用いて音声信号を分類してもかまわない。 In the first feature, the filter means includes two or more filters that pass different frequency bands, the acoustic model creation means creates two or more acoustic models using the two or more filters, and the speech classification means includes The audio signal may be classified using two or more acoustic models.

フィルタ手段が２以上のフィルタを備えることにより、２以上の音響モデルが作成される。よって、この２以上の音響モデルを用いて音声信号の分類が可能となるため、従来のように音響モデルを継続的に保持する必要がなくなり、データ容量を削減することができる。 When the filter means includes two or more filters, two or more acoustic models are created. Therefore, since the audio signal can be classified using these two or more acoustic models, there is no need to continuously hold the acoustic model as in the conventional case, and the data capacity can be reduced.

第１の特徴において、２以上のフィルタには、特定の周波数帯域を通過させる帯域通過フィルタと、特定の周波数帯域以外を通過させる帯域除去フィルタとが含まれ、音響モデル作成手段は、帯域通過フィルタ及び帯域除去フィルタを用いて音響モデルを作成してもかまわない。 In the first feature, the two or more filters include a bandpass filter that passes a specific frequency band and a band elimination filter that passes a band other than the specific frequency band. In addition, an acoustic model may be created using a band elimination filter.

２以上のフィルタとして、特定の周波数帯域を通過させるフィルタと、特定の周波数帯域以外を通過させるフィルタとを備えることにより、特定の周波数帯域に対応する音響モデル及び特定の周波数帯域以外に対応する音響モデルを作成することができる。よって、この２つの音響モデルを用いて音声信号を分類することで、音声信号の分類精度が更に向上する。 The acoustic model corresponding to the specific frequency band and the sound corresponding to other than the specific frequency band are provided by including a filter that allows passage of the specific frequency band and a filter that passes other than the specific frequency band as the two or more filters. A model can be created. Therefore, the classification accuracy of the audio signal is further improved by classifying the audio signal using these two acoustic models.

第１の特徴において、音声分類装置は、予め作成された音響モデルを保持する音響モデル記憶手段を更に備えていてもかまわない。 In the first feature, the speech classification apparatus may further include an acoustic model storage unit that holds an acoustic model created in advance.

音響モデル記憶手段を備えることにより、予め作成された一般的な音響モデルを持つことができ、音響モデル手段における計算量を減らすことができる。 By providing the acoustic model storage means, it is possible to have a general acoustic model created in advance, and to reduce the amount of calculation in the acoustic model means.

第１の特徴において、音声分類装置は、音声信号を含むコンテンツから当該コンテンツのジャンルを決定するコンテンツジャンル抽出手段と、コンテンツのジャンルに基づいてフィルタ手段が用いるフィルタを決定するフィルタ決定手段とを更に備えていてもかまわない。 In the first feature, the audio classification device further includes a content genre extraction unit that determines a genre of the content from content including an audio signal, and a filter determination unit that determines a filter used by the filter unit based on the genre of the content. You may have it.

コンテンツのジャンルに基づいてフィルタを決定することにより、ジャンルに応じてフィルタを使い分けることができ、音響モデル手段における計算量を軽減し、音声信号の分類精度が更に向上する。 By determining the filter based on the genre of the content, the filter can be used properly according to the genre, the amount of calculation in the acoustic model means is reduced, and the classification accuracy of the audio signal is further improved.

ここで、コンテンツの「ジャンル」とは、コンテンツの様式・内容に関して類似するものをまとめたコンテンツの種類を示すものであり、例えば、ニュース、ドラマ、音楽、サッカー、野球などが挙げられる。 Here, the “genre” of the content indicates the type of content in which similar content and content are collected, and examples thereof include news, drama, music, soccer, baseball, and the like.

本発明の第２の特徴は、コンピュータを、音声信号から当該音声信号に含まれる特定の音声の音響モデルを作成する音響モデル手段、及び音響モデルを用いて音声信号を分類する音声分類手段として機能させるためのコンピュータプログラムであることを要旨とする。 According to a second aspect of the present invention, the computer functions as an acoustic model unit that creates an acoustic model of a specific voice included in the voice signal from the voice signal, and a voice classification unit that classifies the voice signal using the acoustic model. The gist of the invention is that it is a computer program.

第２の特徴において、音響モデル手段は、音声信号の特定の周波数帯域を通過させるフィルタ手段と、特定の周波数帯域における音声信号のパワー値を求めるパワー値検出手段と、音声信号のパワー値が所定の閾値を超える時間帯を求めるピーク時間帯検出手段と、音声信号を周波数領域の信号へ変換する周波数変換手段と、閾値を超える時間帯における周波数領域の信号から音響モデルを作成する音響モデル作成手段とを備えていてもかまわない。 In the second feature, the acoustic model means includes a filter means for passing a specific frequency band of the audio signal, a power value detecting means for obtaining a power value of the audio signal in the specific frequency band, and a power value of the audio signal being predetermined. A peak time zone detecting means for obtaining a time zone exceeding a threshold value, a frequency converting means for converting an audio signal into a frequency domain signal, and an acoustic model creating means for creating an acoustic model from the frequency domain signal in a time zone exceeding the threshold value It does not matter if it is equipped with.

本発明によれば、個々の音響モデルを作成することにより音声の分類精度が向上する音声分類装置及びコンピュータプログラムを提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the audio | voice classification | category apparatus and computer program which improve the audio | voice classification | category precision by producing each acoustic model can be provided.

以下図面を参照して、本発明の実施の形態を説明する。図面の記載において同一部分には同一符号を付している。 Embodiments of the present invention will be described below with reference to the drawings. In the description of the drawings, the same parts are denoted by the same reference numerals.

（第１の実施の形態）
図１を参照して、本発明の第１の実施の形態に係わる音声分類装置（ＤＳＰ音解析部４）を含む映像記録再生装置２０の全体構成を説明する。映像記録再生装置２０は、デジタルチューナ１が受信したマルチメディアコンテンツに対応するストリームに含まれる映像データと音声データを各々分離する音声分離部２と、音声データを復号する音声デコーダ３と、複合された音声信号に含まれる音声の種類を識別して分類するＤＳＰ音解析部４と、ＤＳＰ音解析部４が作成した音響モデルを一時的に記憶するＲＡＭ１２と、予め学習された音響モデルを分類番号を付して継続的に記憶するＲＯＭ１３と、ＤＳＰ音解析部４で得られた結果をＨＤＤ−ＩＦ（ハードディスクドライブ−インターフェース）７を通して読み出してプレイリストを作成するＣＰＵ（中央演算処理装置）５と、符号化された映像データと音声データを復号し、映像信号はモニタ９、音声信号はスピーカ１０に出力するＡ／Ｖデコーダ６と、ＨＤＤ−ＩＦ７を通して受信したマルチメディアコンテンツに対応するストリームを記憶するＨＤＤ（ハードディスクドライブ）８と、ＨＤＤ−ＩＦ７とを備える。デジタルチューナ１は放送波を受信する。モニタ９は入力された映像信号を表示し、スピーカ１０は入力された音声信号を再生する。 (First embodiment)
With reference to FIG. 1, an overall configuration of a video recording / reproducing apparatus 20 including an audio classification apparatus (DSP sound analysis unit 4) according to the first embodiment of the present invention will be described. The video recording / reproducing apparatus 20 is combined with an audio separation unit 2 that separates video data and audio data included in a stream corresponding to multimedia content received by the digital tuner 1 and an audio decoder 3 that decodes audio data. A DSP sound analysis unit 4 that identifies and classifies the type of sound included in the received sound signal, a RAM 12 that temporarily stores an acoustic model created by the DSP sound analysis unit 4, and a classification number for a previously learned acoustic model ROM 13 for storing continuously, CPU (Central Processing Unit) 5 for reading out the results obtained by DSP sound analysis unit 4 through HDD-IF (Hard Disk Drive Interface) 7 and creating a playlist, The encoded video data and audio data are decoded, and the video signal is output to the monitor 9 and the audio signal is output to the speaker 10. That includes an A / V decoder 6, a HDD (hard disk drive) 8 for storing a stream corresponding to multimedia content received through HDD-IF7, the HDD-IF7. The digital tuner 1 receives a broadcast wave. The monitor 9 displays the input video signal, and the speaker 10 reproduces the input audio signal.

音声分離部２と、音声デコーダ３と、ＤＳＰ音解析部４と、ＲＡＭ１２と、ＲＯＭ１３と、ＣＰＵ５と、Ａ／Ｖデコーダ６と、ＨＤＤ−ＩＦ７は、バス１１を介して接続されている。 The sound separation unit 2, the sound decoder 3, the DSP sound analysis unit 4, the RAM 12, the ROM 13, the CPU 5, the A / V decoder 6, and the HDD-IF 7 are connected via a bus 11.

なお、本発明の第１の実施の形態に係わる音声分類装置は、ＤＳＰ音解析部４に相当する。 Note that the speech classification apparatus according to the first embodiment of the present invention corresponds to the DSP sound analysis unit 4.

デジタルチューナ１は、放送波を受信しチャンネル毎の符号化されたストリームを出力する。出力されたストリームは音声分離部２へと転送される。また同時にＨＤＤ−ＩＦ７を通してＨＤＤ８に記録される。音声分離部２は、入力されたストリームを映像データと音声データに分離する。分離された音声データは音声デコーダ３へ転送される。音声デコーダ３は入力された音声データを復号し、復号された音声信号をＤＳＰ音解析部４に転送する。 The digital tuner 1 receives broadcast waves and outputs an encoded stream for each channel. The output stream is transferred to the audio separation unit 2. At the same time, it is recorded in the HDD 8 through the HDD-IF 7. The audio separation unit 2 separates the input stream into video data and audio data. The separated audio data is transferred to the audio decoder 3. The audio decoder 3 decodes the input audio data, and transfers the decoded audio signal to the DSP sound analysis unit 4.

図２を参照して、図１のＤＳＰ音解析部４（音声分類装置）の詳細な構成を説明する。ＤＳＰ音解析部４は、音声信号から前記の音声信号に含まれる特定の音声の音響モデルを作成する音響モデル部（音響モデル手段）２１と、作成された音響モデルを用いて上記の音声信号を分類する音声分類部（音声分類手段）２２とを備える。 With reference to FIG. 2, the detailed configuration of the DSP sound analysis unit 4 (speech classification device) in FIG. 1 will be described. The DSP sound analysis unit 4 generates an acoustic model unit (acoustic model means) 21 that creates an acoustic model of a specific speech included in the speech signal from the speech signal, and the speech signal using the created acoustic model. A voice classification unit (voice classification means) 22 for classification is provided.

音響モデル部２１は、音声信号の特定の周波数帯域を通過させるフィルタ部（フィルタ手段）２６と、特定の周波数帯域における音声信号のパワー値を求めるパワー値検出部（パワー値検出手段）２７と、音声信号のパワー値が所定の閾値を超える時間帯を求めるピーク時間帯検出部（ピーク時間帯検出手段）２８と、音声信号を周波数領域の信号へ変換する周波数変換部（周波数変換手段）２９と、閾値を超える時間帯における周波数領域の信号から音響モデルを作成する音響モデル作成部（音響モデル作成手段）３０とを備える。 The acoustic model unit 21 includes a filter unit (filter unit) 26 that passes a specific frequency band of the audio signal, a power value detection unit (power value detection unit) 27 that obtains a power value of the audio signal in the specific frequency band, A peak time zone detector (peak time zone detector) 28 for obtaining a time zone in which the power value of the audio signal exceeds a predetermined threshold; a frequency converter (frequency converter) 29 for converting the audio signal into a frequency domain signal; And an acoustic model creation unit (acoustic model creation means) 30 for creating an acoustic model from a signal in a frequency domain in a time zone exceeding the threshold.

図４を参照して、音声信号からピーク時間帯を求めるまでの音声信号の変化の様子を説明する。フィルタ部２６により処理された音声信号Ｐ１の時刻に対する音声信号の振幅値が、図４に示すようなプロファイルを取った場合を考える。パワー値検出部２７が、（１）式に従って、音声信号Ｐ１のパワー値ｐを求める処理を実施することにより、図４に示す音声信号Ｐ２が得られる。 With reference to FIG. 4, the state of the change of the audio signal until the peak time zone is obtained from the audio signal will be described. Consider a case where the amplitude value of the audio signal with respect to the time of the audio signal P1 processed by the filter unit 26 has a profile as shown in FIG. The power value detection unit 27 performs processing for obtaining the power value p of the audio signal P1 according to the equation (1), so that the audio signal P2 shown in FIG. 4 is obtained.

（１）式に示すように、パワー値ｐは、ある期間における音声信号の振幅値の二乗平均値である。x[k]は音声信号の振幅値であり、ｋはサンプリング周波数を示す変数である。パワー値ｐは、ｋ＝０〜Ｎ−１の総ての整数について、音声信号の振幅値x[k]をそれぞれ二乗し、Ｎ個の二乗した音声信号の振幅値x[k]を総て足し合わせ、足し合わせた結果である合計値をＮで除算することにより算出される。なお、Ｎは任意の自然数である。 As shown in the equation (1), the power value p is a mean square value of the amplitude values of the audio signal in a certain period. x [k] is the amplitude value of the audio signal, and k is a variable indicating the sampling frequency. The power value p is obtained by squaring the amplitude value x [k] of the audio signal for all integers k = 0 to N−1, and all the amplitude values x [k] of the N squared audio signals. It is calculated by adding and dividing the total value, which is the result of the addition, by N. N is an arbitrary natural number.

ピーク時間帯検出部２８は、音声信号Ｐ２に対して所定の閾値を設定し、音声信号のパワー値が所定の閾値を超える時間帯を求める。時刻t1〜t2、t3〜t4、・・・までの時間帯は、ピーク時間帯３５となる。 The peak time zone detection unit 28 sets a predetermined threshold for the audio signal P2, and obtains a time zone in which the power value of the audio signal exceeds the predetermined threshold. The time zone from time t1 to t2, t3 to t4,...

図３、図５〜図７及び図１４を参照して、音響モデルを作成する時、音声信号を分類する時、及びコンテンツを再生する時の図１の映像記録再生装置２０の動作を説明する。 The operation of the video recording / reproducing apparatus 20 of FIG. 1 when creating an acoustic model, classifying an audio signal, and reproducing content will be described with reference to FIGS. 3, 5 to 7, and 14. .

[音響モデル作成時の動作：図３、図５及び図１４（ａ）]
デジタルチューナ１が受信した放送波から得られるコンテンツに対応するストリームは、音声分離部２により音声データと映像データに分類される（Ｓ１０１ａ）。音声デコーダ３が、音声データを復号して音声信号を生成する（Ｓ１０１ｂ）。フィルタ部２６は、特定の周波数帯域を通過させるバンドパスフィルタを用いて、特定の周波数帯域の音声信号だけを取り出し（Ｓ１０２）、パワー値検出部２７が特定の周波数帯域の音声信号のパワー値を求める、つまり音声をパワー信号へ変換する（Ｓ１０３）。ピーク時間帯検出部２８は音声のパワー信号の極大値のうち、ある閾値以上になる時間帯（開始時刻及び終了時刻）を検出する（Ｓ１０５）。一方、周波数変換部２９が生の音声信号を周波数領域の信号へ変換し、特徴量抽出部３１が、検出された時間帯における周波数領域の信号を抽出する（Ｓ１０６）。抽出された周波数領域の信号（特徴量）は、特徴量記憶部（ＲＡＭ１２）に一時的に記憶される（Ｓ１０７）。なお、特徴量としては、例えば音声認識でよく用いられるメル周波数スペクトラム係数（ＭＦＣＣ）を用いることができる。上記のＳ１０５〜Ｓ１０７を、コンテンツが終わるまで実施する。そして、コンテンツ終端検出部３２がコンテンツの終わりを検出した場合（Ｓ１０４にてＹＥＳ）、Ｓ１０８へ進み、特徴量読み出し部４５が特徴量記憶部（ＲＡＭ１２）から特徴量を読み出し、音響モデル作成部３０は、特徴量を学習データとして音響モデルを作成する（Ｓ１０８）。作成された音響モデルは音響モデル記憶部（ＲＡＭ１２）に記憶される（Ｓ１０９）。 [Operation when creating an acoustic model: FIG. 3, FIG. 5 and FIG. 14 (a)]
The stream corresponding to the content obtained from the broadcast wave received by the digital tuner 1 is classified into audio data and video data by the audio separation unit 2 (S101a). The audio decoder 3 decodes the audio data and generates an audio signal (S101b). The filter unit 26 extracts only the audio signal in the specific frequency band using a bandpass filter that passes the specific frequency band (S102), and the power value detection unit 27 determines the power value of the audio signal in the specific frequency band. In other words, the sound is converted into a power signal (S103). The peak time zone detection unit 28 detects a time zone (start time and end time) that exceeds a certain threshold among the maximum values of the audio power signal (S105). On the other hand, the frequency conversion unit 29 converts the raw audio signal into a frequency domain signal, and the feature amount extraction unit 31 extracts the frequency domain signal in the detected time zone (S106). The extracted frequency domain signal (feature amount) is temporarily stored in the feature amount storage unit (RAM 12) (S107). As the feature amount, for example, a mel frequency spectrum coefficient (MFCC) often used in speech recognition can be used. The above steps S105 to S107 are performed until the content ends. If the content end detection unit 32 detects the end of the content (YES in S104), the process proceeds to S108, where the feature amount reading unit 45 reads the feature amount from the feature amount storage unit (RAM 12), and the acoustic model creation unit 30. Creates an acoustic model using the feature value as learning data (S108). The created acoustic model is stored in the acoustic model storage unit (RAM 12) (S109).

図１４（ａ）は、Ｓ１０２において、フィルタ部２６が特定の周波数帯域の音声信号だけを取り出す際に使用するフィルタを例示するテーブルである。例えば、笛のフィルタは、笛の音に対応する特定の周波数帯域の音声信号だけを取り出す機能を有する。よって、笛の音に対応する特定の周波数帯域の音声信号だけを取り出すことにより、音響モデル作成部３０は、取り出された音声信号に基づいて、笛の音響モデルを作成することができる。同様に、音楽、行司、・・・のフィルタを用いることにより、音楽、行司、・・・の音響モデルを作成することができる。第１の実施の形態において、音響モデル作成時に使用するフィルタは、予め設定されている。 FIG. 14A is a table illustrating a filter used when the filter unit 26 extracts only an audio signal in a specific frequency band in S102. For example, the whistle filter has a function of extracting only an audio signal of a specific frequency band corresponding to the sound of the whistle. Therefore, by extracting only the sound signal of a specific frequency band corresponding to the sound of the whistle, the acoustic model creating unit 30 can create the sound model of the whistle based on the extracted sound signal. Similarly, an acoustic model of music, manager,... Can be created by using a filter of music, manager,. In the first embodiment, the filter used when creating the acoustic model is set in advance.

[音声信号分類時の動作：図３、図６及び図１４（ｂ）]
コンテンツ読み出し部４６がＨＤＤ（コンテンツ記憶部）８からコンテンツに対応するストリームを読み出し（Ｓ２０１）、音声分離部２が音声データを抽出する（Ｓ２０２ａ）。そして、音声デコーダ３が音声データを復号して音声信号を生成した（Ｓ２０２ｂ）後、周波数変換部２９が、音声信号を周波数領域の信号へ変換する（Ｓ２０３）。そして、音響モデル読み出し部４７がＲＯＭ１３及びＲＡＭ１２からそれぞれ音響モデルを読み出し、音声分類部２２は、読み出された音響モデルと入力された周波数領域の信号（特徴量）との尤度を尤度関数から計算し、尤度が最大となるモデルのラベルと時刻を分類結果記憶部（ＨＤＤ８）に記録する（Ｓ２０４）。なお、「ラベル」とは分類番号である。また、「時刻」は、入力された周波数領域の信号と音響モデルとの比較において、入力信号の開始位置を示す時刻とする。第２及び第３の実施の形態においても同様とする。 [Operation at the time of audio signal classification: FIG. 3, FIG. 6 and FIG. 14 (b)]
The content reading unit 46 reads a stream corresponding to the content from the HDD (content storage unit) 8 (S201), and the audio separation unit 2 extracts audio data (S202a). Then, after the audio decoder 3 decodes the audio data to generate an audio signal (S202b), the frequency converter 29 converts the audio signal into a frequency domain signal (S203). Then, the acoustic model reading unit 47 reads the acoustic model from the ROM 13 and the RAM 12, respectively, and the speech classification unit 22 calculates the likelihood between the read acoustic model and the input frequency domain signal (feature amount) as a likelihood function. And the label and time of the model with the maximum likelihood are recorded in the classification result storage unit (HDD 8) (S204). “Label” is a classification number. The “time” is a time indicating the start position of the input signal in the comparison between the input frequency domain signal and the acoustic model. The same applies to the second and third embodiments.

図１４（ｂ）は、Ｓ２０４において、音響モデル読み出し部４７がＲＯＭ１３及びＲＡＭ１２からそれぞれ読み出した音響モデルを例示するテーブルである。図１４（ｂ）中の「動的に作成される音響モデル」とは、本発明の実施の形態において、作成される音響モデルを指し、「静的な音響モデル」とは、予め作成された音響モデルを指す。音声分類部２２は、例えば、ＲＡＭ１２から読み出された笛の音響モデルと、ＲＯＭ１３から読み出された歓声の音響モデルを用いて、音声信号を笛の部分と歓声の部分に分類する。第１の実施の形態において、音声信号分類時に使用する音響モデルは、予め設定されている。 FIG. 14B is a table illustrating the acoustic models read out from the ROM 13 and the RAM 12 by the acoustic model reading unit 47 in S204. The “dynamically created acoustic model” in FIG. 14B refers to the acoustic model created in the embodiment of the present invention, and the “static acoustic model” is created in advance. An acoustic model. The voice classification unit 22 classifies the voice signal into a whistle part and a cheer part using, for example, a whistle acoustic model read from the RAM 12 and a cheer acoustic model read from the ROM 13. In the first embodiment, the acoustic model used at the time of audio signal classification is set in advance.

[コンテンツ再生時の動作：図３及び図７]
ユーザ入力部４８が、ユーザのコンテンツの再生を指示する入力を受けると、分類結果・時刻検出部３３がＨＤＤ８から、再生指示を受けたコンテンツに対応するラベルと時刻を読み出し（Ｓ３０１）、プレイリスト作成部３４が読み出されたラベルと時刻を用いて、所定のルールに基づいてダイジェスト再生用のプレイリストを作成する（Ｓ３０２）。そして、コンテンツ読み出し部４６がコンテンツに対応するストリームを読み出し、再生制御部４９が、作成されたプレイリストに基づいて、該当するデータ（ストリーム）を再生する（Ｓ３０３）。なお、上記の所定のルールとは、ダイジェスト再生にあたり、どの特徴を用いて再生するか、即ち、予めどのラベルを採用してダイジェスト再生の箇所とするか、など、決められたものをいう。 [Operation during content playback: FIGS. 3 and 7]
When the user input unit 48 receives an input for instructing reproduction of the user's content, the classification result / time detection unit 33 reads the label and time corresponding to the content for which the reproduction instruction has been received from the HDD 8 (S301), and the playlist. Using the read label and time, the creation unit 34 creates a playlist for digest reproduction based on a predetermined rule (S302). Then, the content reading unit 46 reads the stream corresponding to the content, and the reproduction control unit 49 reproduces the corresponding data (stream) based on the created playlist (S303). Note that the above-mentioned predetermined rule, when the digest playback, or playback using any features, i.e., whether the location of the digest reproduction adopted in advance which labels, etc., refers to one determined.

以上説明したように、本発明の第１の実施の形態によれば以下の作用効果が得られる。 As described above, according to the first embodiment of the present invention, the following operational effects can be obtained.

図２の音響モデル部２１が音声信号から当該音声信号に含まれる特定の音声の音響モデルを作成し、音声分類部２２がこの音響モデルを用いて音声信号を分類する。つまり、音声信号から音響モデルを逐次作成し、作成された音響モデルを用いて音声信号を分類する。これにより、予め音響モデルを作成して継続的に保持する必要がなくなり、音響モデルのパラメータ分のデータ容量を削減することができる。すなわち、事前に用意する音響モデルを減らすことにより、これを保持するために必要な記憶容量を軽減することができる。また、音声信号の環境に適応した音響モデルを作成できるため、音声信号の分類精度が向上する。 The acoustic model unit 21 in FIG. 2 creates an acoustic model of a specific voice included in the voice signal from the voice signal, and the voice classification unit 22 classifies the voice signal using the acoustic model. That is, an acoustic model is sequentially created from the speech signal, and the speech signal is classified using the created acoustic model. This eliminates the need to create and continuously store an acoustic model in advance, and reduce the data capacity for the parameters of the acoustic model. That is, by reducing the number of acoustic models prepared in advance, it is possible to reduce the storage capacity required to hold this model. In addition, since an acoustic model adapted to the environment of the audio signal can be created, the accuracy of classification of the audio signal is improved.

フィルタ部２６が音声信号の特定の周波数帯域を通過させ、パワー値検出部２７がフィルタ部２６を通過した音声信号のパワー値を求め、ピーク時間帯検出部２８が音声信号のパワー値が所定の閾値を超える時間帯を求め、周波数変換部２９が音声信号を周波数領域の信号へ変換し、音響モデル作成部３０が閾値を超える時間帯における周波数領域の信号（特徴量）から音響モデルを作成する。音響モデルのパラメータは記憶容量が大きいため、多くの音響モデルを継続的に保持することができない。そこで、音響モデルを予め用意するのではなく、音響モデルに対応する特定の周波数帯域を通過させるフィルタ部２６を用意し、フィルタ部２６を用いて音声信号から音響モデルを作成する。フィルタ部２６の記憶容量は音響モデルのパラメータに比べて非常に小さいため、記憶容量の削減の効果が得られる。また、音声信号のパワー値のピークのうち閾値を超える時間帯を検出し、その時間帯における周波数領域の信号を用いて音響モデルを作成する。これにより、不必要なデータを減らして必要なデータからのみ音響モデルを作成でき、品質の高い音響モデルを作成することができる。 The filter unit 26 passes a specific frequency band of the audio signal, the power value detection unit 27 obtains the power value of the audio signal that has passed the filter unit 26, and the peak time zone detection unit 28 sets the power value of the audio signal to a predetermined value. A time zone exceeding the threshold is obtained, the frequency conversion unit 29 converts the audio signal into a frequency domain signal, and the acoustic model creation unit 30 creates an acoustic model from the frequency domain signal (feature) in the time zone exceeding the threshold. . Since the parameters of the acoustic model have a large storage capacity, many acoustic models cannot be continuously maintained. Therefore, an acoustic model is not prepared in advance, but a filter unit 26 that passes a specific frequency band corresponding to the acoustic model is prepared, and an acoustic model is created from an audio signal using the filter unit 26. Since the storage capacity of the filter unit 26 is very small compared to the parameters of the acoustic model, an effect of reducing the storage capacity can be obtained. In addition, a time zone exceeding a threshold is detected from the power value peaks of the audio signal, and an acoustic model is created using a signal in the frequency domain in that time zone. Thereby, unnecessary data can be reduced, an acoustic model can be created only from necessary data, and a high-quality acoustic model can be created.

通常、音声分類部２２は、少なくとも２つの音響モデルを用いて音声信号を相対的に識別して分類する。よって、２以上の音響モデルを用意する必要がある。そこで、フィルタ部２６が各々異なる周波数帯域を通過させる２以上のフィルタを備え、音響モデル作成部３０が、この２以上のフィルタを用いて２以上の音響モデルを作成することにより、音声分類部２２は、この２以上の音響モデルを用いて音声信号を分類することができる。よって、フィルタ部２６が２以上のフィルタを備えることにより、２以上の音響モデルが作成され、この２以上の音響モデルを用いて音声信号の分類が可能となるため、従来のように音響モデルを継続的に保持する必要がなくなり、データ容量を削減することができる。 Usually, the voice classification unit 22 relatively identifies and classifies the voice signal using at least two acoustic models. Therefore, it is necessary to prepare two or more acoustic models. Therefore, the filter unit 26 includes two or more filters that allow different frequency bands to pass, and the acoustic model creation unit 30 creates two or more acoustic models using the two or more filters. Can classify an audio signal using the two or more acoustic models. Therefore, since the filter unit 26 includes two or more filters, two or more acoustic models are created, and the audio signal can be classified using the two or more acoustic models. There is no need to keep the data continuously, and the data capacity can be reduced.

周波数変換部２９により変換された周波数領域の信号で音声の分類を行っているため、音声パワーに依存しないダイジェスト再生を作成することができ、より高品質なダイジェストが生成される。 Since audio classification is performed using the frequency domain signals converted by the frequency conversion unit 29, digest reproduction independent of audio power can be created, and a higher quality digest can be generated.

このように、従来の技術では、予め音響モデルを保持しておくが、音声信号が入力された際に人間が同じ種類の音だと認識していても、学習データとして似た音声信号が入っていなければ音声信号の認識率が悪くなっていた。また、音響モデルを増やすにつれパラメータを記録するための記録容量が必要となっていた。本発明の第１の実施の形態によれば、ＨＤＤ８に蓄積されたコンテンツに対応するストリームにおいて、一つのコンテンツの音声信号から特定の音声を学習データとして利用して音響モデルを作成し、再度音響モデルを用いて音声信号を分類し主なシーンだけをとりだした（ダイジェストを作成した）動画や音声を作成する装置を実現できる。一つのコンテンツから特定の音の音響モデルを作成することで、従来より音声の分類の精度が向上し、音響モデルを使用する領域を減らすことが可能となる。 As described above, in the conventional technology, an acoustic model is stored in advance. However, even when a human recognizes that the sound is the same type when the audio signal is input, a similar audio signal is input as learning data. If not, the speech signal recognition rate was poor. Further, as the number of acoustic models increases, a recording capacity for recording parameters is required. According to the first embodiment of the present invention, in the stream corresponding to the content stored in the HDD 8, an acoustic model is created by using specific speech as learning data from the audio signal of one content, and the acoustic model is again generated. It is possible to realize an apparatus for creating a moving image or sound in which audio signals are classified using models and only main scenes are extracted (digest is created). By creating an acoustic model of a specific sound from one content, it is possible to improve the accuracy of speech classification than before and reduce the area where the acoustic model is used.

（第２の実施の形態）
図８に示すように、音声分類装置（ＤＳＰ音解析部４）は、コンテンツから当該コンテンツのジャンルを決定するコンテンツジャンル抽出部３６（コンテンツジャンル抽出手段）と、コンテンツのジャンルに基づいてフィルタ部２６が用いるフィルタを決定するフィルタ決定部３８と、予め作成された音響モデルを保持する音響モデル記憶部（音響モデル記憶手段）とを更に備える。コンテンツジャンル抽出部３６は、ＥＰＧやユーザの手入力によって、コンテンツのジャンルを取得する。そして、コンテンツとジャンルは、対応づけられて記憶される。フィルタ決定部３８は、その内部にジャンルとこれに対応するバンドパスフィルタとの関係を示す情報を有し、フィルタ部２６が使用するフィルタ（バンドパスフィルタ）を決定する。 (Second Embodiment)
As shown in FIG. 8, the audio classification device (DSP sound analysis unit 4) includes a content genre extraction unit 36 (content genre extraction means) that determines the genre of the content from the content, and a filter unit 26 based on the genre of the content. Further includes a filter determination unit 38 that determines a filter used by and an acoustic model storage unit (acoustic model storage unit) that holds a previously created acoustic model. The content genre extraction unit 36 acquires the genre of the content by EPG or manual input by the user. The content and genre are stored in association with each other. The filter determination unit 38 has information indicating the relationship between the genre and the bandpass filter corresponding to the genre and determines a filter (bandpass filter) used by the filter unit 26.

図８〜図１０及び図１５を参照して、音響モデルを作成する時、及び音声信号を分類する時の図１の映像記録再生装置２０の動作を説明する。なお、コンテンツを再生する時の動作は、図７と同じため、図示及び説明を省略する。 The operation of the video recording / reproducing apparatus 20 in FIG. 1 when creating an acoustic model and classifying an audio signal will be described with reference to FIGS. Since the operation when reproducing the content is the same as that in FIG. 7, illustration and description thereof are omitted.

[音響モデル作成時の動作：図８、図９及び図１５]
コンテンツジャンル抽出部３６がＥＰＧなどからコンテンツのジャンル情報を取得し、フィルタ決定部３８がジャンル情報に基づいてフィルタ部２６が使用するバンドパスフィルタを決定する（Ｓ４０１）。例えば、コンテンツがサッカーの試合であれば、即ち、コンテンツのジャンルがサッカーの場合、笛のバンドパスフィルタが選ばれる。デジタルチューナ１が受信した放送波から得られるコンテンツに対応するストリームは、音声分離部２により音声データと映像データに分類される（Ｓ４０２ａ）。音声デコーダ３が音声データを復号して音声信号を生成した（Ｓ４０２ｂ）後、フィルタ部２６は、フィルタ決定部３８により決定されたバンドパスフィルタを用いて、特定の周波数帯域の音声信号だけを取り出す（Ｓ４０３）。なお、使用されたバンドパスフィルタの情報はＲＡＭ１２に記録される（Ｓ４０４）。また、コンテンツに対応するストリームがＨＤＤ８に記録される際にジャンル情報が、このストリームと対応づけられてＨＤＤ８に記録される。そして、パワー値検出部２７が特定の周波数帯域の音声信号のパワー値を求める、つまり音声信号をパワー信号へ変換する（Ｓ４０５）。ピーク時間帯検出部２８は音声のパワー信号の極大値のうち、ある閾値以上になる時間帯（開始時刻及び終了時刻）を検出する（Ｓ４０７）。一方、周波数変換部２９が生の音声信号を周波数領域の信号へ変換し、特徴量抽出部３１が、検出された時間帯における周波数領域の信号を抽出する（Ｓ４０８）。抽出された周波数領域の信号（特徴量）は、特徴量記憶部（ＲＡＭ１２）に一時的に記憶される（Ｓ４０９）。上記のＳ４０７〜Ｓ４０９を、コンテンツに対応するストリームが終わるまで実施する。そして、コンテンツ終端検出部３２がコンテンツに対応するストリームの終わりを検出した場合（Ｓ４０６にてＹＥＳ）、Ｓ４１０へ進み、特徴量読み出し部４５は、特徴量記憶部（ＲＡＭ１２）から特徴量を読み出し、音響モデル作成部３０は、特徴量から学習データとして音響モデルを作成する（Ｓ４１０）。作成された音響モデルは音響モデル記憶部（ＲＡＭ１２）に記憶される（Ｓ４１１）。 [Operation when creating acoustic model: FIGS. 8, 9 and 15]
The content genre extraction unit 36 acquires content genre information from EPG or the like, and the filter determination unit 38 determines a bandpass filter used by the filter unit 26 based on the genre information (S401). For example, if the content is a soccer game, that is, if the content genre is soccer, a whistle bandpass filter is selected. The stream corresponding to the content obtained from the broadcast wave received by the digital tuner 1 is classified into audio data and video data by the audio separation unit 2 (S402a). After the audio decoder 3 decodes the audio data and generates an audio signal (S402b), the filter unit 26 extracts only an audio signal in a specific frequency band using the bandpass filter determined by the filter determination unit 38. (S403). Information on the used bandpass filter is recorded in the RAM 12 (S404). Further, when a stream corresponding to the content is recorded on the HDD 8, genre information is recorded on the HDD 8 in association with this stream. Then, the power value detection unit 27 obtains the power value of the audio signal in the specific frequency band, that is, converts the audio signal into a power signal (S405). The peak time zone detection unit 28 detects a time zone (start time and end time) that exceeds a certain threshold among the maximum values of the audio power signal (S407). On the other hand, the frequency conversion unit 29 converts the raw audio signal into a frequency domain signal, and the feature amount extraction unit 31 extracts the frequency domain signal in the detected time zone (S408). The extracted frequency domain signal (feature amount) is temporarily stored in the feature amount storage unit (RAM 12) (S409). The above steps S407 to S409 are performed until the stream corresponding to the content ends. If the content end detection unit 32 detects the end of the stream corresponding to the content (YES in S406), the process proceeds to S410, and the feature amount reading unit 45 reads the feature amount from the feature amount storage unit (RAM 12), The acoustic model creation unit 30 creates an acoustic model as learning data from the feature amount (S410). The created acoustic model is stored in the acoustic model storage unit (RAM 12) (S411).

図１５（ａ）のテーブルは、Ｓ４０１においてフィルタ決定部３８が使用する、ジャンルと音響モデル作成時に使用するフィルタとの関係の一例を示すテーブルである。フィルタ決定部３８は、コンテンツジャンル抽出部３６が抽出したジャンル情報に基づいてフィルタ部２６が使用するバンドパスフィルタを決定する。例えば、ジャンルがサッカーである場合、笛のフィルタを用いて笛の音響モデルを作成し、音楽、行司のフィルタは使用しない。同様に、ジャンルが相撲、音楽、・・・であれば、行司、音楽のフィルタを用いて行司、音楽の音響モデルをそれぞれ作成する。このように、第２の実施の形態において、音響モデル作成時に使用するフィルタは、ジャンルによって決定される。 The table in FIG. 15A is a table showing an example of the relationship between the genre and the filter used when creating the acoustic model, which is used by the filter determination unit 38 in S401. The filter determination unit 38 determines a band pass filter used by the filter unit 26 based on the genre information extracted by the content genre extraction unit 36. For example, when the genre is soccer, an acoustic model of a whistle is created using a whistle filter, and a music and an executive filter are not used. Similarly, if the genre is sumo, music,..., The boss and music acoustic models are created using the boss and music filters, respectively. Thus, in the second embodiment, the filter used when creating the acoustic model is determined by the genre.

[音声信号分類時の動作：図８、図１０及び図１５]
コンテンツ読み出し部４６が、ＨＤＤ（コンテンツ記憶部）８からコンテンツに対応するストリームを読み出し（Ｓ５０１）、音声分離部２が音声データを抽出する（Ｓ５０２ａ）、そして、音声デコーダ３が音声データを復号して音声信号を生成した（Ｓ５０２ｂ）後、周波数変換部２９が、音声信号を周波数領域の信号へ変換する（Ｓ５０３）。そして、コンテンツ読み出し部４６がＨＤＤ８に記録されたジャンル情報を読み出す（Ｓ５０４）。ジャンル情報は、コンテンツ記録時に、コンテンツと対応付けられて記録される。そして、音響モデル読み出し部４７は、ジャンルに応じた音響モデルを音響モデル記憶部（ＲＯＭ１３）及び音響モデル記憶部（ＲＡＭ１２）から選択する（Ｓ５０５）。例えば、ジャンルがサッカーの場合、サッカーの試合で必要となる音響モデル（歓声やアナウンサーなど）が選択される。音声分類部２２は、読み出された音響モデルと入力された周波数領域の信号（特徴量）との尤度を尤度関数から計算し、尤度が最大となるモデルのラベルと時刻を分類結果記憶部（ＨＤＤ８）に記録する（Ｓ５０６）。 [Operation at the time of audio signal classification: FIG. 8, FIG. 10 and FIG. 15]
The content reading unit 46 reads a stream corresponding to the content from the HDD (content storage unit) 8 (S501), the audio separation unit 2 extracts the audio data (S502a), and the audio decoder 3 decodes the audio data. After the voice signal is generated (S502b), the frequency converter 29 converts the voice signal into a frequency domain signal (S503). Then, the content reading unit 46 reads the genre information recorded in the HDD 8 (S504). Genre information is recorded in association with content during content recording. Then, the acoustic model reading unit 47 selects an acoustic model corresponding to the genre from the acoustic model storage unit (ROM 13) and the acoustic model storage unit (RAM 12) (S505). For example, when the genre is soccer, an acoustic model (such as a cheer or an announcer) required for a soccer game is selected. The speech classification unit 22 calculates the likelihood of the read acoustic model and the input frequency domain signal (feature amount) from a likelihood function, and classifies the label and time of the model with the maximum likelihood as a result of classification. The data is recorded in the storage unit (HDD 8) (S506).

図１５（ｂ）は、Ｓ５０５において音響モデル読み出し部４７が使用する、ジャンルと音声信号分類時に使用する音響モデルとの関係の一例を示すテーブルである。図１５（ｂ）中の「動的に作成される音響モデル」とは、本発明の実施の形態において、作成される音響モデルを指し、「静的な音響モデル」とは、予め作成された音響モデルを指す。音声分類部２２は、コンテンツジャンル抽出部３６が抽出したジャンル情報に基づいて、使用する音響モデルを決定する。例えば、ジャンルがサッカーである場合、笛の音響モデル及び歓声の音響モデルを用いて音声信号を分類し、音楽、行司の音響モデルは使用しない。同様に、ジャンルが相撲であれば、行司の音響モデル及び歓声の音響モデルを用いて音声信号を分類し、ジャンルが音楽であれば、音楽の音響モデルを用いて音声信号を分類する。このように、第２の実施の形態において、音声信号分類時に使用する音響モデルは、ジャンルによって決定される。 FIG. 15B is a table showing an example of the relationship between the genre and the acoustic model used at the time of audio signal classification used by the acoustic model reading unit 47 in S505. The “dynamically created acoustic model” in FIG. 15B refers to the acoustic model created in the embodiment of the present invention, and the “static acoustic model” is created in advance. An acoustic model. The audio classification unit 22 determines an acoustic model to be used based on the genre information extracted by the content genre extraction unit 36. For example, when the genre is soccer, the sound signal is classified using the acoustic model of the whistle and the cheering acoustic model, and the acoustic model of music and the manager is not used. Similarly, if the genre is a sumo, the audio signal is classified using the Gyoji's acoustic model and the cheering acoustic model, and if the genre is music, the audio signal is classified using the musical acoustic model. Thus, in the second embodiment, the acoustic model used at the time of audio signal classification is determined by the genre.

以上説明したように、本発明の第２の実施の形態によれば以下の作用効果が得られる。 As described above, according to the second embodiment of the present invention, the following operational effects can be obtained.

コンテンツジャンル抽出部３６がコンテンツのジャンルを決定し、フィルタ決定部３８がコンテンツのジャンルに基づいてフィルタ部２６が用いるフィルタを決定する。コンテンツのジャンルに基づいてフィルタを決定することにより、ジャンルに応じてフィルタを使い分けることができ、音響モデル部２１における計算量を軽減し、音声信号の分類精度が更に向上する。 The content genre extraction unit 36 determines the genre of the content, and the filter determination unit 38 determines the filter used by the filter unit 26 based on the genre of the content. By determining the filter based on the genre of the content, the filter can be used properly according to the genre, the amount of calculation in the acoustic model unit 21 is reduced, and the classification accuracy of the audio signal is further improved.

予め作成された音響モデルを保持する音響モデル記憶部（ＲＯＭ１３）を備えることにより、予め作成された一般的な音響モデルを持つことができ、音響モデル部２１における計算量を減らすことができる。 By providing the acoustic model storage unit (ROM 13) that holds the acoustic model created in advance, it is possible to have a general acoustic model created in advance, and the amount of calculation in the acoustic model unit 21 can be reduced.

（第３の実施の形態）
図１１に示すように、フィルタ部２６が備える２以上のフィルタに、特定の周波数帯域を通過させる帯域通過フィルタ（バンドパスフィルタ）４２と、特定の周波数帯域以外を通過させる帯域除去フィルタ（バンドエリミネーションフィルタ）４１とが含まれる場合について説明する。 (Third embodiment)
As shown in FIG. 11, two or more filters included in the filter unit 26 are passed through a bandpass filter (bandpass filter) 42 that passes a specific frequency band and a band elimination filter (band elimination filter) that passes a band other than the specific frequency band. (Nation filter) 41 will be described.

図１１〜図１３、図１６を参照して、音響モデルを作成する時、及び音声信号を分類する時の図１の映像記録再生装置２０の動作を説明する。なお、コンテンツを再生する時の動作は、図７と同じため、図示及び説明を省略する。 The operation of the video recording / reproducing apparatus 20 in FIG. 1 when creating an acoustic model and classifying an audio signal will be described with reference to FIGS. Since the operation when reproducing the content is the same as that in FIG. 7, illustration and description thereof are omitted.

[音響モデル作成時の動作：図１１、図１２及び図１６]
デジタルチューナ１が受信した放送波から得られるコンテンツに対応するストリームは、音声分離部２により音声データと映像データに分類される（Ｓ６０１ａ）。音声デコーダ３が、音声データを復号して音声信号を生成する（Ｓ６０１ｂ）。フィルタ部２６は、バンドパスフィルタ４２及びバンドエリミネーションフィルタ４１を用いて、特定の周波数帯域の音声信号及び特定の周波数帯域以外の音声信号をそれぞれ取り出し（Ｓ６０２）、パワー値検出部２７が特定の周波数帯域の音声信号及び特定の周波数帯域以外の音声信号のパワー値をそれぞれ求める、つまり音声信号をパワー信号へ変換する（Ｓ６０３）。ピーク時間帯検出部２８は音声のパワー信号の極大値のうち、ある閾値以上になる時間帯（開始時刻及び終了時刻）を検出する（Ｓ６０５）。一方、周波数変換部２９が生の音声信号を周波数領域の信号へ変換し、特徴量抽出部３１が、検出された時間帯における周波数領域の信号を抽出する（Ｓ６０６）。抽出された周波数領域の信号（特徴量）は、特徴量記憶部（ＲＡＭ１２）に一時的に記憶される（Ｓ６０７）。上記のＳ６０５〜Ｓ６０７を、コンテンツに対応するストリームが終わるまで実施する。そして、コンテンツ終端検出部３２がコンテンツに対応するストリームの終わりを検出した場合（Ｓ６０４にてＹＥＳ）、Ｓ６０８へ進み、特徴量読み出し部４５が、コンテンツ読み出し部４６から特徴量を読み出し、音響モデル作成部３０が、特徴量を学習データとして音響モデルを作成する（Ｓ６０８）。音響モデルは、バンドパスフィルタ４２及びバンドエリミネーションフィルタ４１にそれぞれ対応している。作成された音響モデルは共に音響モデル記憶部（ＲＡＭ１２）に記憶される（Ｓ６０９）。 [Operation when creating acoustic model: FIGS. 11, 12, and 16]
The stream corresponding to the content obtained from the broadcast wave received by the digital tuner 1 is classified into audio data and video data by the audio separation unit 2 (S601a). The audio decoder 3 decodes the audio data to generate an audio signal (S601b). The filter unit 26 uses the band-pass filter 42 and the band elimination filter 41 to extract an audio signal in a specific frequency band and an audio signal other than the specific frequency band, respectively (S602), and the power value detection unit 27 specifies a specific frequency band. The power values of the audio signal in the frequency band and the audio signal other than the specific frequency band are obtained, that is, the audio signal is converted into a power signal (S603). The peak time zone detection unit 28 detects a time zone (start time and end time) that exceeds a certain threshold among the maximum values of the audio power signal (S605). On the other hand, the frequency conversion unit 29 converts the raw audio signal into a frequency domain signal, and the feature amount extraction unit 31 extracts the frequency domain signal in the detected time zone (S606). The extracted frequency domain signal (feature amount) is temporarily stored in the feature amount storage unit (RAM 12) (S607). The above steps S605 to S607 are performed until the stream corresponding to the content ends. If the content end detection unit 32 detects the end of the stream corresponding to the content (YES in S604), the process proceeds to S608, where the feature amount reading unit 45 reads the feature amount from the content reading unit 46 and creates an acoustic model. The unit 30 creates an acoustic model using the feature quantity as learning data (S608). The acoustic model corresponds to the bandpass filter 42 and the band elimination filter 41, respectively. Both of the created acoustic models are stored in the acoustic model storage unit (RAM 12) (S609).

このように、図５のフローチャートにおける動作に加えて、バンドエリミネーションフィルタ４１を用いて音響モデルを作成する。このバンドエリミネーションフィルタ４１を通してバンドパスフィルタ４２の場合と同様の動作を行う。例えば、バンドパスフィルタ４２としてサッカーの笛を用いた場合、笛のバンドパスフィルタ４２に対してバンドエリミネーションフィルタ４１は笛の周波数帯域だけを除去するフィルタを用いる。これを用いて、サッカーの笛とそれ以外の音響モデルをそれぞれ作成することができる。 Thus, in addition to the operation in the flowchart of FIG. 5, an acoustic model is created using the band elimination filter 41. The same operation as that of the band-pass filter 42 is performed through the band elimination filter 41. For example, when a soccer whistle is used as the bandpass filter 42, the band elimination filter 41 uses a filter that removes only the frequency band of the whistle with respect to the bandpass filter 42 of the whistle. Using this, it is possible to create a soccer whistle and other acoustic models.

[音声信号分類時の動作：図１１、図１３及び図１６]
コンテンツ読み出し部４６がＨＤＤ（コンテンツ記憶部）８からコンテンツに対応するストリームを読み出し（Ｓ７０１）、音声分離部２が音声データを抽出する（Ｓ７０２ａ）。そして、音声デコーダ３が音声データを復号して音声信号を生成した（Ｓ７０２ｂ）後、周波数変換部２９が音声信号を周波数領域の信号へ変換する（Ｓ７０３）。そして、音響モデル読み出し部４７が音響モデル記憶部（ＲＡＭ１２）からバンドパスフィルタ４２及びバンドエリミネーションフィルタ４１に対応する音響モデルを読み出し、音声分類部２２は、読み出された音響モデルと入力された周波数領域の信号（特徴量）との尤度を尤度関数から計算し、尤度が最大となるモデルのラベルと時刻を分類結果記憶部（ＨＤＤ８）に記録する（Ｓ７０４）。なお、「ラベル」とは分類番号である。 [Operation at the time of audio signal classification: FIG. 11, FIG. 13 and FIG. 16]
The content reading unit 46 reads a stream corresponding to the content from the HDD (content storage unit) 8 (S701), and the audio separation unit 2 extracts audio data (S702a). Then, after the audio decoder 3 decodes the audio data to generate an audio signal (S702b), the frequency converter 29 converts the audio signal into a frequency domain signal (S703). Then, the acoustic model reading unit 47 reads out acoustic models corresponding to the bandpass filter 42 and the band elimination filter 41 from the acoustic model storage unit (RAM 12), and the speech classification unit 22 is input with the read out acoustic model. The likelihood with the frequency domain signal (feature amount) is calculated from the likelihood function, and the label and time of the model with the maximum likelihood are recorded in the classification result storage unit (HDD 8) (S704). “Label” is a classification number.

このように、音響モデル読み出し部４７が音響モデルを読み出す場所が音響モデル記憶部（ＲＡＭ１２）のみとなる点が図６のフローチャートと異なり、その他の点は同一である。 In this way, the point that the acoustic model reading unit 47 reads the acoustic model is only the acoustic model storage unit (RAM 12), and the other points are the same.

以上説明したように、本発明の第３の実施の形態によれば以下の作用効果が得られる。 As described above, according to the third embodiment of the present invention, the following operational effects can be obtained.

フィルタ部２６が有する２以上のフィルタには、特定の周波数帯域を通過させるバンドパスフィルタ４２と、特定の周波数帯域以外を通過させるバンドエリミネーションフィルタ４１とが含まれ、音響モデル作成部３０は、バンドパスフィルタ４２及びバンドエリミネーションフィルタ４１を用いて音響モデルをそれぞれ作成する。２以上のフィルタとして、特定の周波数帯域を通過させるフィルタと、特定の周波数帯域以外を通過させるフィルタとを備えることにより、特定の周波数帯域に対応する音響モデル及び特定の周波数帯域以外に対応する音響モデルを作成することができる。よって、この２つの音響モデルを用いて音声信号を分類することで、音声信号の分類精度が更に向上する。 The two or more filters included in the filter unit 26 include a band-pass filter 42 that passes a specific frequency band and a band elimination filter 41 that passes a band other than the specific frequency band. Acoustic models are created using the bandpass filter 42 and the band elimination filter 41, respectively. The acoustic model corresponding to the specific frequency band and the sound corresponding to other than the specific frequency band are provided by including a filter that allows passage of the specific frequency band and a filter that passes other than the specific frequency band as the two or more filters. A model can be created. Therefore, the classification accuracy of the audio signal is further improved by classifying the audio signal using these two acoustic models.

（その他の実施の形態）
上記のように、本発明は、第１乃至第３の実施の形態によって記載したが、この開示の一部をなす論述及び図面はこの発明を限定するものであると理解すべきではない。この開示から当業者には様々な代替実施の形態、実施例及び運用技術が明らかとなろう。 (Other embodiments)
As described above, the present invention has been described according to the first to third embodiments. However, it should not be understood that the description and drawings constituting a part of this disclosure limit the present invention. From this disclosure, various alternative embodiments, examples and operational techniques will be apparent to those skilled in the art.

第３の実施の形態では、第１の実施の形態と同様に、音響モデル作成時に使用する帯域通過フィルタ及び帯域除去フィルタ、及び音声信号分離時に使用する音響モデルは、予め設定される。しかし、本発明はこれに限定されることなく、第３の実施の形態でも、第２の実施の形態と同様にして、フィルタ及び音響モデルをジャンルによって決定しても構わない。以下に、図１６（ａ）〜図１６（ｃ）を参照して、フィルタ及び音響モデルをジャンルによって決定する際に使用するテーブルについて説明する。 In the third embodiment, as in the first embodiment, the band-pass filter and the band elimination filter that are used when creating the acoustic model and the acoustic model that is used when the audio signal is separated are set in advance. However, the present invention is not limited to this, and in the third embodiment, the filter and the acoustic model may be determined according to the genre in the same manner as in the second embodiment. Hereinafter, with reference to FIGS. 16A to 16C, a table used when determining the filter and the acoustic model according to the genre will be described.

図１６（ｂ）に、ジャンルと音響モデル作成時に使用するフィルタとの関係の一例を示す。なお、図１６（ａ）は、帯域通過フィルタ及び帯域除去フィルタの特性を示す図である。また、図１６（ａ）は、帯域通過フィルタは、特定の周波数帯域の信号を通過させ、帯域除去フィルタは、特定の周波数帯域以外の信号を通過させることを示している。 FIG. 16B shows an example of the relationship between the genre and the filter used when creating the acoustic model. FIG. 16A is a diagram illustrating characteristics of the bandpass filter and the band removal filter. FIG. 16A shows that the band-pass filter passes a signal in a specific frequency band, and the band removal filter passes a signal other than the specific frequency band.

フィルタ決定部３８は、コンテンツジャンル抽出部３６が抽出したジャンル情報に基づいてフィルタ部２６が使用する帯域通過フィルタ及び帯域除去フィルタを決定する。例えば、ジャンルがサッカーである場合、笛の帯域通過フィルタ及び笛の帯域除去フィルタを用いて音響モデルをそれぞれ作成し、音楽、行司の帯域通過フィルタ及び帯域除去フィルタは使用しない。同様に、ジャンルが相撲、音楽、・・・であれば、行司、音楽の帯域通過フィルタ及び帯域除去フィルタをそれぞれ用いて音響モデルを作成する。このように、第３の実施の形態において、音響モデル作成時に使用する帯域通過フィルタ及び帯域除去フィルタは、ジャンルによって決定しても構わない。この場合、帯域通過フィルタ及び帯域除去フィルタの２つのフィルタを用いて２つの音響モデルを作成するため、フィルタ決定時に必要となるテーブルも２つとなる。図１６（ｂ）において、「笛の帯域除去フィルタ」は、笛’と標記することで、「笛の帯域通過フィルタ」と区別している。 The filter determination unit 38 determines a band pass filter and a band removal filter used by the filter unit 26 based on the genre information extracted by the content genre extraction unit 36. For example, if the genre is soccer, an acoustic model is created using a whistle band-pass filter and a whistle band-reject filter, respectively, and the music and Gyoji band-pass filters and band-reject filters are not used. Similarly, if the genre is sumo, music,..., An acoustic model is created using the Goshi, music bandpass filter, and band elimination filter. As described above, in the third embodiment, the band-pass filter and the band elimination filter used when creating the acoustic model may be determined according to the genre. In this case, since two acoustic models are created using two filters, a band pass filter and a band elimination filter, two tables are required when determining the filter. In FIG. 16 (b), “the whistle band elimination filter” is distinguished from “the whistle band pass filter” by marking the whistle '.

図１６（ｃ）は、ジャンルと音声信号分類時に使用する音響モデルとの関係の一例を示す。図１６（ｃ）中の「動的に作成される音響モデル」とは、本発明の実施の形態において、作成される音響モデルを指す。音声分類部２２は、コンテンツジャンル抽出部３６が抽出したジャンル情報に基づいて、使用する音響モデルを決定する。例えば、ジャンルがサッカーである場合、笛の音響モデルと笛’の音響モデルを用いて音声信号を分類し、音楽、行司の音響モデルは使用しない。ここで、「笛’の音響モデル」は笛の帯域除去フィルタによって作成された音響モデルを示す。同様に、ジャンルが相撲であれば、行司の音響モデル及び行司’の音響モデルを用いて音声信号を分類し、ジャンルが音楽であれば、音楽の音響モデル及び音楽’の音響モデルを用いて音声信号を分類する。このように、第３の実施の形態において、音声信号分類時に使用する音響モデルは、ジャンルによって決定される。また、帯域通過フィルタと帯域除去フィルタを用いて音響モデルをそれぞれ作成しているので、音声信号分類時にも、帯域通過フィルタによって作成された音響モデルと、帯域除去フィルタによって作成された音響モデルの、２つのテーブルが必要となる。 FIG. 16C shows an example of the relationship between the genre and the acoustic model used for audio signal classification. The “dynamically created acoustic model” in FIG. 16C indicates an acoustic model created in the embodiment of the present invention. The audio classification unit 22 determines an acoustic model to be used based on the genre information extracted by the content genre extraction unit 36. For example, when the genre is soccer, the sound signals are classified using the acoustic model of the whistle and the acoustic model of the whistle ', and the acoustic model of music and the manager is not used. Here, the “acoustic model of the whistle” indicates an acoustic model created by the band elimination filter of the whistle. Similarly, if the genre is sumo, the audio signal is classified using Gyoji's acoustic model and Gyoji's acoustic model, and if the genre is music, the music acoustic model and music's acoustic model are used for audio. Classify signals. Thus, in the third embodiment, the acoustic model used at the time of audio signal classification is determined by the genre. Also, since the acoustic models are created using the band pass filter and the band elimination filter, respectively, the acoustic model created by the band pass filter and the acoustic model created by the band elimination filter are also used when classifying the audio signal. Two tables are required.

実施の形態では、ＤＳＰ音解析部４が作成した音響モデルを記憶する記憶媒体は、ＲＡＭとあるが、音響モデルの記憶媒体は、電源をオフすると、記憶された情報がクリアされるものだけでなく、電源がオフされても記憶された情報が保持され続ける記憶媒体とすることができる。 In the embodiment, the storage medium for storing the acoustic model created by the DSP sound analysis unit 4 is a RAM. However, the storage medium for the acoustic model is only one in which stored information is cleared when the power is turned off. In other words, the storage medium can keep the stored information even when the power is turned off.

第１乃至第３の実施の形態で説明した音声分類装置の動作は、時系列的につながった一連の処理又は操作、即ち音声分類方法としても表現することができる。従って、この音声分類方法を、コンピュータシステムを用いて実行するために、コンピュータシステム内のプロセッサーなどが果たす複数の機能を特定するコンピュータプログラムとして構成することができる。また、このコンピュータプログラムは、コンピュータ読み取り可能な記録媒体に保存することができる。この記録媒体をコンピュータシステムによって読み込ませ、前記プログラムを実行してコンピュータを制御しながら上述した音声分類方法を実現することができる。ここで、前記記録媒体としては、メモリ装置、磁気ディスク装置、光ディスク装置、その他のプログラムを記録することができるような装置が含まれる。 The operation of the speech classification apparatus described in the first to third embodiments can be expressed as a series of processes or operations connected in time series, that is, a speech classification method. Therefore, in order to execute this speech classification method using a computer system, it can be configured as a computer program that specifies a plurality of functions performed by a processor or the like in the computer system. The computer program can be stored in a computer-readable recording medium. The above-described speech classification method can be realized by reading this recording medium by a computer system and executing the program to control the computer. Here, the recording medium includes a memory device, a magnetic disk device, an optical disk device, and other devices capable of recording a program.

このように、本発明はここでは記載していない様々な実施の形態等を包含するということを理解すべきである。したがって、本発明はこの開示から妥当な特許請求の範囲に係る発明特定事項によってのみ限定されるものである。 Thus, it should be understood that the present invention includes various embodiments and the like not described herein. Therefore, the present invention is limited only by the invention specifying matters according to the scope of claims reasonable from this disclosure.

本発明の第１の実施の形態に係わる音声分類装置を含む映像記録再生装置の全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a video recording / reproducing apparatus including an audio classification apparatus according to a first embodiment of the present invention. 図１のＤＳＰ音解析部４（音声分類装置）の詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the DSP sound-analysis part 4 (voice classification apparatus) of FIG. 図１の映像記録再生装置２０の詳細な機能ブロックを示すブロック図である。It is a block diagram which shows the detailed functional block of the video recording / reproducing apparatus 20 of FIG. 音声信号からピーク時間帯を求めるまでの音声信号の変化の様子を示す模式図である。It is a schematic diagram which shows the mode of the change of an audio | voice signal until it calculates | requires a peak time slot | zone from an audio | voice signal. 音響モデルを作成する時における図１の映像記録再生装置２０の動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement of the video recording / reproducing apparatus 20 of FIG. 1 when producing an acoustic model. 音声信号を分類する時における図１の映像記録再生装置２０の動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement of the video recording / reproducing apparatus 20 of FIG. 1 when classifying an audio | voice signal. コンテンツを再生する時における図１の映像記録再生装置２０の動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement of the video recording / reproducing apparatus 20 of FIG. 1 at the time of reproducing | regenerating a content. 第２の実施の形態に係わる、図１の映像記録再生装置２０の詳細な機能ブロックを示すブロック図である。It is a block diagram which shows the detailed functional block of the video recording / reproducing apparatus 20 of FIG. 1 concerning 2nd Embodiment. 音響モデルを作成する時における図８の映像記録再生装置２０の動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement of the video recording / reproducing apparatus 20 of FIG. 8 when producing an acoustic model. 音声信号を分類する時における図８の映像記録再生装置２０の動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement of the video recording / reproducing apparatus 20 of FIG. 8 when classifying an audio | voice signal. 第３の実施の形態に係わる、図１の映像記録再生装置２０の詳細な機能ブロックを示すブロック図である。It is a block diagram which shows the detailed functional block of the video recording / reproducing apparatus 20 of FIG. 1 concerning 3rd Embodiment. 音響モデルを作成する時における図１１の映像記録再生装置２０の動作の流れを示すフローチャートである。12 is a flowchart showing a flow of operations of the video recording / reproducing apparatus 20 in FIG. 11 when creating an acoustic model. 音声信号を分類する時における図１１の映像記録再生装置２０の動作の流れを示すフローチャートである。12 is a flowchart showing a flow of operations of the video recording / reproducing apparatus 20 of FIG. 11 when classifying audio signals. 図１４（ａ）は、Ｓ１０２において、フィルタ部２６が特定の周波数帯域の音声信号だけを取り出す際に使用するフィルタを例示するテーブルであり、図１４（ｂ）は、Ｓ２０４において、音響モデル読み出し部４７がＲＯＭ１３及びＲＡＭ１２からそれぞれ読み出した音響モデルを例示するテーブルである。14A is a table illustrating a filter used when the filter unit 26 extracts only a specific frequency band audio signal in S102, and FIG. 14B is an acoustic model reading unit in S204. 47 is a table illustrating acoustic models read from the ROM 13 and the RAM 12, respectively. 図１５（ａ）は、Ｓ４０１においてフィルタ決定部３８が使用する、ジャンルとフィルタとの関係の一例を示すテーブルであり、図１５（ｂ）は、Ｓ５０５において音声分離部２２が使用する、ジャンルと音響モデルとの関係の一例を示すテーブルである。FIG. 15A is a table showing an example of the relationship between the genre and the filter used by the filter determination unit 38 in S401, and FIG. 15B shows the genre used by the audio separation unit 22 in S505. It is a table which shows an example of a relationship with an acoustic model. 図１６（ａ）は、帯域通過フィルタと帯域除去フィルタの周波数特性を示す図であり、図１６（ｂ）は、ジャンルと音響モデル作成時に使用する帯域通過フィルタ及び帯域除去フィルタとの関係の一例を示すテーブルであり、図１６（ｃ）は、ジャンルと音声信号分離時に使用する音響モデルとの関係の一例を示すテーブルである。FIG. 16A is a diagram illustrating the frequency characteristics of the band pass filter and the band elimination filter, and FIG. 16B is an example of the relationship between the genre and the band pass filter and the band elimination filter used when creating the acoustic model. FIG. 16C is a table showing an example of a relationship between a genre and an acoustic model used at the time of audio signal separation.

Explanation of symbols

１デジタルチューナ
２音声分離部
３音声デコーダ
４ＤＳＰ音解析部
５ＣＰＵ
６Ａ／Ｖデコーダ
７ハードディスクドライブ−インターフェース
８ハードディスクドライブ
９モニタ
１０スピーカ
１１バス
１２ＲＡＭ
１３ＲＯＭ
２１音響モデル部（音響モデル手段）
２２音声分類部（音声分類手段）
２６フィルタ部（フィルタ手段）
２７パワー値検出部（パワー値検出手段）
２８ピーク時間帯検出部（ピーク時間帯手段）
２９周波数変換部（周波数変換手段）
３０音響モデル作成部（音響モデル作成手段）
３１特徴量抽出部
３２コンテンツ終端検出部
３３分類結果・時刻検出部
３４プレイリスト作成部
３５ピーク時間帯
３６コンテンツジャンル抽出部（コンテンツジャンル抽出手段）
３７使用モデル決定部
３８フィルタ決定部
４１バンドエリミネーションフィルタ（帯域除去フィルタ）
４２バンドパスフィルタ（帯域通過フィルタ）
４５特徴量読み出し部
４６コンテンツ読み出し部
４７音響モデル読み出し部
４８ユーザ入力部
４９再生制御部 1 Digital Tuner 2 Audio Separation Unit 3 Audio Decoder 4 DSP Sound Analysis Unit 5 CPU
6 A / V decoder 7 Hard disk drive-interface 8 Hard disk drive 9 Monitor 10 Speaker 11 Bus 12 RAM
13 ROM
21 Acoustic model part (acoustic model means)
22 Voice classification part (voice classification means)
26 Filter section (filter means)
27 Power value detection unit (power value detection means)
28 Peak time zone detector (peak time zone means)
29 Frequency converter (frequency converter)
30 Acoustic model creation unit (acoustic model creation means)
31 feature amount extraction unit 32 content end detection unit 33 classification result / time detection unit 34 playlist creation unit 35 peak time zone 36 content genre extraction unit (content genre extraction means)
37 Use model decision unit 38 Filter decision unit 41 Band elimination filter (band elimination filter)
42 Bandpass filter (bandpass filter)
45 feature reading unit 46 content reading unit 47 acoustic model reading unit 48 user input unit 49 playback control unit

Claims

An audio classification device having a function of recording content,
Of the audio signals contained in the content, sequentially creating an acoustic model from the speech signal corresponding to the audio frequency band to be extracted by the filter to be used when creating an acoustic model, classifying the audio signal using an acoustic model created A dynamic acoustic model creation unit for creating a dynamic acoustic model;
Using the dynamic acoustic model created by the dynamic acoustic model creation unit, and a speech classification unit for classifying the speech signal ,
The acoustic model creation unit
Of the audio signals contained in the prior Kiko content, and a filter portion that transmits particular audio signal component having the frequency band to be extracted by the filter to be used for the acoustic model creation,
A power detection unit that detects a power value of the specific audio signal component that passes through the filter unit;
A time detection unit for detecting a time zone in which the power value detected by the power detection unit exceeds a predetermined threshold; and
Has a frequency conversion section for converting an audio signal included in the prior logger content in the time period detected by said time detection unit from the time domain to the frequency domain,
The dynamic acoustic model creation unit creates the dynamic acoustic model by using the speech signal converted into the frequency domain by the frequency conversion unit as a feature amount.

The dynamic acoustic model creating unit uses only a bandpass filter as the filter unit, and outputs only audio signals in frequency bands corresponding to each of frequency band sounds extracted by filters used when creating multiple types of acoustic models. It has multiple types of filters that have the function of taking out ,
The dynamic acoustic model creation unit corresponds to each of voices in a frequency band extracted by a filter used when creating multiple types of acoustic models as the dynamic acoustic model based on the multiple types of specific speech signal components. Create multiple dynamic acoustic models,
The speech classification apparatus according to claim 1, wherein the speech classification unit classifies speech signals included in the second content using the plurality of dynamic acoustic models.

The filter unit, said a band pass filter that passes a specific audio signal, band other than the frequency band to be extracted by the filter to be used for the acoustic model creation with frequency band extracted by the filter to be used for the acoustic model creation The speech classification apparatus according to claim 1, further comprising: a band elimination filter that allows a signal having a frequency to pass .

A static acoustic model storage means for holding a static acoustic model to be an acoustic model created in advance ;
The voice classifying unit, the dynamic acoustic model and using said static acoustic model, the speech classification apparatus according to claim 1, characterized in that for classifying the audio signal included in the prior logger content.

The dynamic acoustic model creation unit, as the filter unit, transmits a plurality of types of specific sound signal components corresponding to each of sound in a frequency band extracted by a filter used when creating a plurality of types of acoustic models. Has a filter of
The dynamic acoustic model creation unit
A genre extraction unit that extracts a genre that is determined by the manual input of EPG and the user from the front Kiko content,
The filter selection unit according to claim 1, further comprising: a filter selection unit that selects a filter corresponding to the first content from the plurality of types of filters based on the genre extracted by the genre extraction unit. Voice classification device.