JP2009103945A

JP2009103945A - Video content processing device and program

Info

Publication number: JP2009103945A
Application number: JP2007275937A
Authority: JP
Inventors: Takashi Watanabe; 剛史渡辺
Original assignee: NEC Electronics Corp
Current assignee: NEC Electronics Corp
Priority date: 2007-10-24
Filing date: 2007-10-24
Publication date: 2009-05-14

Abstract

<P>PROBLEM TO BE SOLVED: To accurately and efficiently classify video contents. <P>SOLUTION: A feature amount extracting section 130 in a server 110 extracts a feature amount of a speech data included in an uploaded video content. A classification processing section 140 provides service for classifying the uploaded video content and storing it in a content storage section 170, and search service of the stored video content. A classification processing section 140 utilizes the feature amount extracted by the feature amount extracting section 130, when the video content is stored and searched, as an index for determining a classification of the video content. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、映像コンテンツ処理技術、具体的には、映像コンテンツを分類する技術に関する。 The present invention relates to a video content processing technique, and more specifically to a technique for classifying video content.

インターネットの普及および通信網の高速化に伴い、個人ユーザでさえ映像コンテンツをインターネット上のサーバにアップロードすることが容易になっている。そのため、インターネット上には夥しい量の映像コンテンツが蓄積されており、これらの映像コンテンツは、またユーザにより視聴されたりするなど盛んに利用されている。 With the spread of the Internet and the speeding up of communication networks, even individual users can easily upload video content to a server on the Internet. Therefore, a large amount of video content is accumulated on the Internet, and these video content are actively used such as being viewed by a user.

通常、目的の映像コンテンツにたどり着くために、ユーザは検索操作を行う。検索エンジンに対して、ユーザの検索操作に応じて目的の映像コンテンツを正確にかつ漏れなく選出できることが要求されている。 Normally, the user performs a search operation in order to reach the target video content. A search engine is required to be able to select target video content accurately and without omission according to a user's search operation.

検索の技術については、従来キーワードによる手法が用いられている。この手法は、元々テキストコンテンツ用のものであり、映像コンテンツの検索に適用するためには、たとえば、映像コンテンツのサムネイルに該映像コンテンツの種類を示すテキスト情報を書込み、検索の際には、ユーザが入力したキーワードと、サムネイルに書き込まれたテキスト情報とを比較し、キーワードと一致するテキスト情報を有する映像コンテンツを目的映像コンテンツとして抽出する。 Conventionally, a keyword technique is used as a search technique. This method is originally for text content, and in order to apply it to the search of video content, for example, text information indicating the type of the video content is written in the thumbnail of the video content. Is compared with the text information written in the thumbnail, and the video content having the text information matching the keyword is extracted as the target video content.

また、近年、画像解析技術の向上により、キーワードが示す種類の画像を検索可能にした手法も開発されている。この手法によれば、たとえば、「花」というキーワードに対して、画像解析の結果により花であると判定された画像を検索結果として得るようになっている。
特開２００２−９１４８１号公報特開２００１−２１５９８２号公報特表２００６−５０１５０２号公報特開２００５−２５０４７２号公報 In recent years, a technique has been developed that makes it possible to search for a type of image indicated by a keyword by improving the image analysis technique. According to this method, for example, for a keyword “flower”, an image determined to be a flower by the result of image analysis is obtained as a search result.
JP 2002-91481 A JP 2001-215882 A Special table 2006-501502 gazette JP 2005-250472 A

映像コンテンツのサムネイルに該映像コンテンツの種類を示すテキスト情報を書き込む手法は、２種類が考えられる。１つは、サーバ側において、管理者がアップロードされた映像コンテンツを視聴してその種類を判断し、判断結果に応じてテキスト情報をサムネイルに書き込む手法である。この手法は、管理者による手作業が必要であり、夥しい量の映像コンテンツがアップロードされる現状においては、不現実である。 There are two methods for writing text information indicating the type of the video content in the thumbnail of the video content. One is a method in which the administrator views the uploaded video content, determines the type of the video content, and writes the text information to the thumbnail according to the determination result. This method requires manual work by an administrator, and is unrealistic in the current situation where a large amount of video content is uploaded.

もう１つは、映像コンテンツの種類を予め規定しておき、映像コンテンツをアップロードするユーザに種類の指定をさせ、ユーザが指定した種類に対応するテキスト情報を当該映像コンテンツのサムネイルに書き込んで格納する手法である。この手法は、ユーザによる指定が必要であり、効率が良くない。また、ユーザの指定が必ずしも正しいとは限らないため、後に検索の際に、不正確な検索結果または検索漏れが生じやすいという問題がある。 The other is that the type of video content is specified in advance, the user who uploads the video content specifies the type, and the text information corresponding to the type specified by the user is written and stored in the thumbnail of the video content. It is a technique. This method requires designation by the user and is not efficient. In addition, since the user's designation is not always correct, there is a problem that inaccurate search results or search omissions are likely to occur during a later search.

また、映像コンテンツに含まれる画像が通常動画像であるので、画像解析により映像コンテンツの種類を特定することは困難である。 Further, since the image included in the video content is a normal moving image, it is difficult to specify the type of the video content by image analysis.

本発明の一つの態様は、映像コンテンツ処理装置である。この装置は、映像コンテンツに含まれる音声データの特徴量を取得する特徴量取得部と、該特徴量取得部が取得した特徴量を、映像コンテンツの種類を判定する指標として利用する分類処理部を備える。 One aspect of the present invention is a video content processing apparatus. This apparatus includes a feature amount acquisition unit that acquires a feature amount of audio data included in video content, and a classification processing unit that uses the feature amount acquired by the feature amount acquisition unit as an index for determining the type of video content. Prepare.

なお、上記態様を方法やシステム、またはコンピュータを上記装置として動作せしめるプログラムに置き換えて表現したものも、本発明の態様としては有効である。 It is also effective as an aspect of the present invention to express the above aspect by replacing a method, a system, or a program that causes a computer to operate as the above apparatus.

本発明にかかる技術によれば、映像コンテンツを正確かつ効率良く分類することができる。 According to the technology of the present invention, video content can be classified accurately and efficiently.

本発明の実施の形態を説明する前に、まず、本発明の原理について説明する。
映像コンテンツは、通常、動画像データと音声データから構成される。本願発明者は、研究模索した結果、映像コンテンツの種類によって、それに含まれる音声データにより示される音声が異なる特徴を呈することを知見した。図１〜図３を参照して説明する。 Before describing the embodiment of the present invention, first, the principle of the present invention will be described.
Video content is usually composed of moving image data and audio data. As a result of research exploration, the inventor of the present application has found that depending on the type of video content, the audio indicated by the audio data included therein has different characteristics. A description will be given with reference to FIGS.

図１は、映像コンテンツに含まれる音声データが示す音声の例を示す。このような音声に対して、主要周波数帯域、主要周波数帯域のパワースペクトルと全周波数帯域のパワースペクトルの総和との比、最大音量、平均音量、平均音量と最大音量の比、低音量区間密度を求めることができる。 FIG. 1 shows an example of audio indicated by audio data included in video content. For such audio, the main frequency band, the ratio of the power spectrum of the main frequency band to the sum of the power spectrum of all frequency bands, the maximum volume, the average volume, the ratio of the average volume to the maximum volume, the low volume interval density Can be sought.

最大音量は、映像コンテンツに含まれる音声データの全サンプルのレベル値の中の最大値である。 The maximum volume is the maximum value among the level values of all samples of the audio data included in the video content.

低音量区間は、映像コンテンツの全区間に亘り、音量が小さい区間例えばレベルの絶対値が所定の閾値以下である区間の総長であり、低音量区間密度は、低音量区間が全区間に占める割合である。 The low volume section is the total length of the section where the volume is low, for example, the section where the absolute value of the level is equal to or less than a predetermined threshold, and the low volume section density is the ratio of the low volume section to the entire section. It is.

平均音量は、映像コンテンツの全区間に亘って、低音量区間を除いた区間の音量の平均値である。 The average volume is an average value of the volume in the section excluding the low volume section over the entire section of the video content.

以下の説明において、上述した主要周波数帯域、主要周波数帯域のパワースペクトルと全周波数帯域のパワースペクトルの総和との比、最大音量、平均音量、平均音量と最大音量の比、低音量区間密度を特徴量という。 In the following description, the main frequency band, the ratio of the power spectrum of the main frequency band to the sum of the power spectrum of all frequency bands, the maximum volume, the average volume, the ratio of the average volume to the maximum volume, the low volume interval density are characterized. It is called quantity.

図２は、複数の種類の映像コンテンツに対して上記６種類の特徴量を求めた結果を示す。映像コンテンツの種類として、「Ｎｅｗｓ」、「Ｖｏｃａｌ」、「Ｐｉａｎｏ」、「Ｃｌａｓｓｉｃ」、「ＣＭ」は、それぞれ、「テレビで放送されたニュース」、「ボーカル」、「ピアノのソロ演奏」、「オーケストラによるクラシック音楽演奏」、「テレビで放送されたコマーシャル」の映像を示し、「Ｖｏｃａｌ」のうち、「Ｆ−Ｖｏｃａｌ」と「Ｍ−Ｖｏｃａｌ」は「女性ボーカル」と「男性ボーカル」の映像をそれぞれ示す。また、上記各種類の映像コンテンツと比較するための「Ｔｏｎｅ」と「Ｎｏｉｓｅ」は、それぞれ単調な正弦波とノイズを示す。 FIG. 2 shows the results of obtaining the six types of feature amounts for a plurality of types of video content. As types of video content, “News”, “Vocal”, “Piano”, “Classic”, and “CM” are “news broadcast on TV”, “vocal”, “solo performance of piano”, “ “Classical music performance by orchestra” and “Commercial broadcast on TV” are shown. Of “Vocal”, “F-Vocal” and “M-Vocal” are images of “female vocal” and “male vocal”. Each is shown. In addition, “Tone” and “Noise” for comparison with the above-mentioned types of video content indicate monotone sine waves and noise, respectively.

図２に示す結果から、それぞれの種類の映像コンテンツの音声の傾向を見出すことができる。図３に示すように、それぞれの種類の映像コンテンツは、上述した各特徴量のうちの１つまたは複数について、所定の範囲内の値をとるなどの傾向を有する。 From the results shown in FIG. 2, it is possible to find the sound tendency of each type of video content. As shown in FIG. 3, each type of video content has a tendency to take a value within a predetermined range for one or more of the above-described feature amounts.

本願発明者は、映像コンテンツの音声データの特徴量と映像コンテンツの種類とが関係することを知見し、映像コンテンツの音声データの特徴量を、映像コンテンツの種類を判定する指標として用いる技術を想到した。特徴量としては、上述した主要周波数帯域、「主要周波数帯域のパワースペクトル／全周波数帯域のパワースペクトル」により代表される周波数特徴量と、最大音量、「平均音量／最大音量」により代表される音量特徴量と、低音量区間密度の３種類の特徴量のうちの２つ以上が用いられる。 The inventor of the present application has found that the feature amount of the audio data of the video content is related to the type of the video content, and has come up with a technique of using the feature amount of the audio data of the video content as an index for determining the type of the video content. did. As the feature amount, the above-described main frequency band, the frequency feature amount represented by “power spectrum of main frequency band / power spectrum of all frequency bands”, the maximum volume, and the volume represented by “average volume / maximum volume”. Two or more of the three types of feature amounts, that is, the feature amount and the low volume interval density are used.

この技術は、様々な適用形態が考えられる。
例えば、映像コンテンツの記録や保存に際して、映像コンテンツの音声データの上述した特徴量を抽出し、抽出した特徴量に基づいて映像コンテンツの種類を判定する。そして、判定した種類を示すテキスト情報を映像コンテンツのサムネイルなどの付属情報に格納する。こうすることによって、映像コンテンツの種類の指定や、映像コンテンツを視聴してその種類を示すテキスト情報を入力するなどの作業を介さずに、映像コンテンツの種類情報を映像コンテンツに付属させ、のちのキーワードによる検索などに供することができる。 This technology can be applied in various forms.
For example, when recording or storing video content, the above-described feature amount of the audio data of the video content is extracted, and the type of the video content is determined based on the extracted feature amount. Then, text information indicating the determined type is stored in attached information such as a thumbnail of the video content. In this way, the type information of the video content is attached to the video content without any work such as specifying the type of the video content or inputting text information indicating the type of the video content. It can be used for searching by keywords.

また、キーワードによる検索のみならず、所定の映像コンテンツ例えばユーザが指定したコンテンツと相似する映像コンテンツを検索する場合にも本発明の技術を適用することができる。この場合、たとえば、映像コンテンツの記録や保存に際して、その音声データの上述した特徴量を抽出した、抽出した特徴量そのものを付属情報に格納する。所定の映像コンテンツと相似する映像コンテンツの検索に際しては、該所定の映像コンテンツの特徴量を抽出して、保存中の映像コンテンツの特徴量と比較することによって、目的の映像コンテンツを選出してもよい。 Further, the technique of the present invention can be applied not only to search by keyword but also to search for predetermined video content, for example, video content similar to content specified by the user. In this case, for example, when the video content is recorded or stored, the above-described feature amount of the audio data is extracted, and the extracted feature amount itself is stored in the attached information. When searching for video content similar to the predetermined video content, the target video content can be selected by extracting the feature amount of the predetermined video content and comparing it with the feature amount of the stored video content. Good.

勿論、本発明にかかる技術は、上述した２つの例に限らず、映像コンテンツの分類を必要とするいかなる場合にも適用することができ、また、いずれかの場合に適用した場合にも、映像コンテンツの分類を正確かつ効率良く行うことを実現できる。 Of course, the technology according to the present invention is not limited to the two examples described above, and can be applied to any case where classification of video content is required. It is possible to achieve accurate and efficient content classification.

音声データの特徴量を抽出して利用する技術は従来様々な視点から提案されているが（特許文献１〜特許文献４）、これらの技術は、音声データの特徴量を、該音声データを含む映像コンテンツの種類を判定する指標として用いる発想が無く、抽出する特徴量の性質も異なる。たとえば、特許文献１に開示された技術は、標準音声と入力音声から特徴量を抽出して、抽出した特徴量に基づいて入力音声と標準音声の類似度を算出して音声認識を行う技術を開示している。この技術は、例えば「ｇａ」という人間の発声を入力音声から検出するために、予め「ｇａ」の標準音声から特徴量（基準特徴量）を抽出しておく。そして、入力音声に各局部の特徴量をそれぞれ抽出して、特徴量が基準特徴量と類似する局部の有無を探す。すなわち、この技術は、音声の局部の具体的な内容を特定するものであり、特徴量も音声の局部の特徴を表すものである。 Techniques for extracting and using feature values of audio data have been proposed from various viewpoints (Patent Documents 1 to 4). These techniques include the audio data including the audio data. There is no idea used as an index for determining the type of video content, and the characteristics of the extracted feature quantities are also different. For example, the technique disclosed in Patent Document 1 is a technique for extracting a feature amount from standard speech and input speech, calculating a similarity between the input speech and standard speech based on the extracted feature amount, and performing speech recognition. Disclosure. In this technique, for example, in order to detect a human utterance “ga” from input speech, a feature amount (reference feature amount) is previously extracted from the standard speech “ga”. Then, the feature amount of each local part is extracted from the input speech, and the presence or absence of a local part whose feature amount is similar to the reference feature amount is searched. That is, this technique specifies the specific contents of the local part of the voice, and the feature amount also represents the characteristic of the local part of the voice.

それに対して、本発明にかかる技術は、局部が具体的にどんな内容であるかとは関係なく、音声データ全体の特徴を表しうるものを特徴量として抽出して音声データ全体の特徴を捉えている。図４を参照して説明する。図中基準音声は、テレビ放送されたニュースの音声であり、比較対象音声１は、テレビ放送された別のニュースの音声であり、比較対象音声２は、音楽会の音声である。本発明の技術によれば、比較対象音声１と基準音声について、局部的に見た場合に相似するか否かとは関係なく、比較対象音声１が基準音声と似た特徴を有すると判定され、それらをそれぞれ含む映像コンテンツも似た種類のものであると判定される。また、比較対象音声２と基準音声についても、局部的に見た場合に相似するか否かとは関係なく、本発明の技術によれば、比較対象音声１が基準音声と波異なった特徴を有すると判定され、それらをそれぞれ含む映像コンテンツも異なった種類のものであると判定される。 On the other hand, the technology according to the present invention captures the characteristics of the entire audio data by extracting as features the features that can represent the characteristics of the entire audio data, regardless of the specific content of the local area. . This will be described with reference to FIG. In the figure, the reference sound is the sound of news broadcast on television, the comparison target sound 1 is the sound of another news broadcast on television, and the comparison target sound 2 is the sound of a music meeting. According to the technique of the present invention, the comparison target voice 1 and the reference voice are determined to have similar characteristics to the reference voice regardless of whether or not they are similar when viewed locally. It is determined that the video content including each of them is of a similar type. Further, regardless of whether the comparison target sound 2 and the reference sound are similar when viewed locally, according to the technique of the present invention, the comparison target sound 1 has characteristics that are different from the reference sound. Then, it is determined, and the video contents including them are also determined to be of different types.

以上の説明を踏まえて、本発明の実施の形態を説明する。
図５は、本発明の実施の形態となるネットワークシステム１００を示す。ネットワークシステム１００は、サーバ１１０、ネットワーク例えばインターネット１８０を介してサーバ１１０と接続可能な複数の端末１９０を備える。 Based on the above description, embodiments of the present invention will be described.
FIG. 5 shows a network system 100 according to the embodiment of the present invention. The network system 100 includes a server 110 and a plurality of terminals 190 that can be connected to the server 110 via a network such as the Internet 180.

図６は、サーバ１１０の構成を示す。本実施の形態において、サーバ１１０は、任意の端末１９０からアップロードされる映像コンテンツの蓄積サービス、および蓄積された映像コンテンツの検索サービスを提供する。図６に示すように、サーバ１１０は、ユーザインタフェース１２０、特徴量抽出部１３０、分類処理部１４０、再生部１５０、基準データ記憶部１６０、コンテンツ記憶部１７０を備える。 FIG. 6 shows the configuration of the server 110. In the present embodiment, the server 110 provides a storage service for video content uploaded from an arbitrary terminal 190 and a search service for stored video content. As illustrated in FIG. 6, the server 110 includes a user interface 120, a feature amount extraction unit 130, a classification processing unit 140, a reproduction unit 150, a reference data storage unit 160, and a content storage unit 170.

ユーザインタフェース１２０は、端末１９０とサーバ１１０のインタフェースであり、端末１９０とサーバ１１０間の送受信の仲介を行う。また、ユーザインタフェース１２０は、端末１９０が利用できるサービスメニューの表示も行う。本実施の形態において、ユーザインタフェース１２０は、メインメニューとして、下記の３つを表示する。 The user interface 120 is an interface between the terminal 190 and the server 110 and mediates transmission / reception between the terminal 190 and the server 110. The user interface 120 also displays a service menu that can be used by the terminal 190. In the present embodiment, the user interface 120 displays the following three as the main menu.

１．映像コンテンツのアップロード
２．映像コンテンツの検索
３．サムネイル表示 1. Upload video content 2. Search for video content. Thumbnail display

ユーザインタフェース１２０は、端末１９０によりメインメニュー中の「映像コンテンツのアップロード」が選択された際に、該端末１９０からアップロードされた映像コンテンツを受信して特徴量抽出部１３０に出力する。 When “Upload video content” in the main menu is selected by the terminal 190, the user interface 120 receives the video content uploaded from the terminal 190 and outputs it to the feature amount extraction unit 130.

また、ユーザインタフェース１２０は、端末１９０によりメインメニュー中の「映像コンテンツの検索」が選択された際に、サブメニューとして映像コンテンツの種類の一覧を表示すると共に、種類を選択するように端末１９０に対して指示する。そして、ユーザインタフェース１２０は、端末１９０を介してユーザが選択した種類を示す情報を分類処理部１４０に出力することによりユーザが選択した種類に基づく検索を行わせる。 In addition, when “search for video content” in the main menu is selected by the terminal 190, the user interface 120 displays a list of video content types as a submenu and allows the terminal 190 to select a type. Instruct. Then, the user interface 120 outputs information indicating the type selected by the user via the terminal 190 to the classification processing unit 140 to perform a search based on the type selected by the user.

端末１９０によりメインメニュー中の「サムネイル表示」が選択された際に、ユーザインタフェース１２０は、コンテンツ記憶部１７０に記憶された全ての映像コンテンツのサムネイルの表示を再生部１５０に指示する。 When “Thumbnail Display” in the main menu is selected by the terminal 190, the user interface 120 instructs the playback unit 150 to display thumbnails of all video content stored in the content storage unit 170.

また、この指示に応じて再生部１５０が表示したサムネイルのうちのいずれかが選択された際に、ユーザインタフェース１２０は、選択されたサムネイルに対応する映像コンテンツの再生を再生部１５０に指示する。 Further, when any of the thumbnails displayed by the playback unit 150 is selected in response to this instruction, the user interface 120 instructs the playback unit 150 to play the video content corresponding to the selected thumbnail.

さらに、この指示に応じて再生部１５０が当該映像コンテンツを再生した際に、ユーザインタフェース１２０は、サブメニューとして「関連コンテンツの検索」を表示すると共に、端末１９０によりこのサブメニューが選択された際に、再生された映像コンテンツの関連コンテンツの検索を分類処理部１４０に指示する。 Further, when the playback unit 150 plays back the video content in response to this instruction, the user interface 120 displays “search for related content” as a submenu and when the submenu is selected by the terminal 190. In addition, the classification processing unit 140 is instructed to search related content of the reproduced video content.

すなわち、本実施の形態において、サーバ１１０は、検索サービスについて、ユーザが指摘した種類の映像コンテンツの検索と、ユーザが指定した映像コンテンツの関連コンテンツの検索の２種類を提供する。 That is, in the present embodiment, the server 110 provides two types of search services: search for video content of the type pointed out by the user and search for related content of the video content specified by the user.

特徴量抽出部１３０は、ユーザインタフェース１２０を介して端末１９０から映像コンテンツがアップロードされたときに、該映像コンテンツの音声データから、特徴量を抽出して分類処理部１４０に出力する。ここで、特徴量抽出部１３０は、周波数特徴量として主要周波数帯域と「主要周波数帯域のパワースペクトル／全周波数帯域のパワースペクトルの総和」を抽出し、音量特徴量として、最大音量と、平均音量と、「平均音量／最大音量」とを抽出すると共に、低音量区間密度も抽出する。すなわち、本実施の形態において、６種類の特徴量が用いられる。 When the video content is uploaded from the terminal 190 via the user interface 120, the feature amount extraction unit 130 extracts the feature amount from the audio data of the video content and outputs the feature amount to the classification processing unit 140. Here, the feature quantity extraction unit 130 extracts the main frequency band and “the sum of the power spectrum of the main frequency band / the power spectrum of all frequency bands” as the frequency feature quantity, and the maximum volume and the average volume as the volume feature quantity. And “average volume / maximum volume” are extracted, and the low volume interval density is also extracted. That is, in this embodiment, six types of feature quantities are used.

ここで図３のフローチャートを参照して、主要周波数帯域の抽出手法の一例を説明する。主要周波数帯域は、映像コンテンツに含まれる音声の各周波数帯域のうちの、パワースペクトルが最も強い周波数帯域とすることができる。特徴量抽出部１３０は、この周波数帯域を取得するために、まず、音声データを複数（ｄｉｖＮｕｍ）のフレームに分割する。１フレームは、離散フーリエ変換の処理単位であり、フレーム長は、１フレームにおけるサンプル数である。 Here, an example of a technique for extracting the main frequency band will be described with reference to the flowchart of FIG. The main frequency band can be a frequency band having the strongest power spectrum among the frequency bands of audio included in the video content. In order to acquire this frequency band, the feature amount extraction unit 130 first divides the audio data into a plurality of (divNum) frames. One frame is a discrete Fourier transform processing unit, and the frame length is the number of samples in one frame.

特徴量抽出部１３０は、１つ目のフレーム（Ｓ１２、Ｓ１４：Ｎｏ）に対して、離散フーリエ変換を行い、該フレームについて、各周波数帯域におけるパワースペクトルｒｅｓｕｌｔを得る（Ｓ１６）。その後、特徴量抽出部１３０は、ステップＳ１６におけるフーリエ変換を各フレームに対して順次行うと共に、周波数帯域別にパワースペクトルを累積する（Ｓ１４：Ｎｏ、Ｓ１６、Ｓ１８、Ｓ２０）。特徴量抽出部１３０は、全フレームに対してフーリエ変換を行って、周波数帯域別のパワースペクトルの累積値ａｃｃｍ（ｄｉｖＮｕｍ）を得るまで（Ｓ１４：Ｙｅｓ）上記処理を行う。 The feature amount extraction unit 130 performs discrete Fourier transform on the first frame (S12, S14: No), and obtains a power spectrum result in each frequency band for the frame (S16). Thereafter, the feature quantity extraction unit 130 sequentially performs the Fourier transform in step S16 on each frame and accumulates the power spectrum for each frequency band (S14: No, S16, S18, S20). The feature amount extraction unit 130 performs the above-described processing until Fourier transform is performed on all frames to obtain a power spectrum cumulative value accm (divNum) for each frequency band (S14: Yes).

なお、ステップＳ１６における離散フーリエ変換は、高速フーリエ変換ＦＦＴであり、その一例を下記の式（１）〜（３）により表わされることができる。 Note that the discrete Fourier transform in step S16 is a fast Fourier transform FFT, and an example thereof can be represented by the following equations (1) to (3).

特徴量抽出部１３０は、このようにして得た周波数帯域別のパワースペクトルの累積値ａｃｃｍ（ｄｉｖＮｕｍ）のうちの最大値に対応する周波数帯域を主要周波数帯域として取得する。 The feature amount extraction unit 130 acquires the frequency band corresponding to the maximum value among the cumulative values accm (divNum) of the power spectrum for each frequency band obtained in this way as the main frequency band.

なお、「主要周波数帯域のパワースペクトル／全周波数帯域のパワースペクトルの総和」については、パワースペクトルの累積値ａｃｃｍ（ｄｉｖＮｕｍ）のうちの最大値と、各周波数帯域のパワースペクトルの累積値ａｃｃｍ（ｄｉｖＮｕｍ）の総和との比を求めればよい。 As for “the power spectrum of the main frequency band / the sum of the power spectrum of all frequency bands”, the maximum value among the accumulated values accm (divNum) of the power spectrum and the accumulated value accm (divNum of the power spectrum of each frequency band). ) And the sum of the values.

次に低音量密度の抽出手法の一例を説明する。図８のフローチャートに示すように、特徴量抽出部１３０は、音量すなわちレベル値が低音量判定閾値ｔｈ以下のサンプルが、低音量区間検出閾値ｍｉｎＴｈが示すサンプル数以上連続した区間を低音量区間とし、音声データにおける低音量区間のサンプル数の総和と、音声データの総サンプル数との比を低音量密度として求める。 Next, an example of a low volume density extraction method will be described. As shown in the flowchart of FIG. 8, the feature amount extraction unit 130 sets a section in which samples having a volume, that is, a level value equal to or lower than the low volume determination threshold th, continue for the number of samples indicated by the low volume section detection threshold minTh. Then, the ratio of the total number of samples in the low volume section in the audio data and the total number of samples in the audio data is obtained as the low volume density.

特徴量抽出部１３０は、このようにして映像コンテンツの音声データから種々の特徴量を抽出して分類処理部１４０に出力する。 In this way, the feature amount extraction unit 130 extracts various feature amounts from the audio data of the video content and outputs them to the classification processing unit 140.

分類処理部１４０は、端末１９０から映像コンテンツがアップロードされた際に、該映像コンテンツの付属情報を生成して映像コンテンツと共にコンテンツ記憶部１７０に格納する処理と、ユーザインタフェース１２０から検索指示がなされた際に、検索指示に基づいてコンテンツ記憶部１７０から目的コンテンツを検索して再生部１５０にそれらのサムネイルを表示させる処理を行う。 When the video content is uploaded from the terminal 190, the classification processing unit 140 generates the auxiliary information of the video content and stores it in the content storage unit 170 together with the video content, and a search instruction is given from the user interface 120. At this time, based on the search instruction, the target content is searched from the content storage unit 170 and the playback unit 150 displays these thumbnails.

まず、映像コンテンツがアップロードされた際の分類処理部１４０の処理を説明する。このとき、分類処理部１４０は、特徴量抽出部１３０からの特徴量と、基準データ記憶部１６０に記憶された基準データとを用いて、付属情報としての類似度を算出する。 First, the processing of the classification processing unit 140 when video content is uploaded will be described. At this time, the classification processing unit 140 uses the feature amount from the feature amount extraction unit 130 and the reference data stored in the reference data storage unit 160 to calculate the similarity as the attached information.

基準データ記憶部１６０には、映像コンテンツの複数の種類について、基準となる映像コンテンツの特徴量をそれぞれ記憶している。例えば、ニュースについて、図２に示す「ｎｅｗｓ」の各特徴量を記憶しており、女性ボーカリストＢについて、図２に示す「Ｆ−ｖｏｃａｌＢ１」および「Ｆ−ｖｏｃａｌＢ２」のそれぞれの特徴量を記憶している。例として、図２に示す内容は、基準データとして、基準データ記憶部１６０に記憶されている。以下の説明において、基準データ記憶部１６０に記憶された各種類の映像コンテンツの特徴量を、その種類の映像コンテンツの「基準特徴量」という。 The reference data storage unit 160 stores feature amounts of video content serving as a reference for a plurality of types of video content. For example, the feature values of “news” shown in FIG. 2 are stored for news, and the feature values of “F-vocals B1” and “F-vocals B2” shown in FIG. I remember it. As an example, the content shown in FIG. 2 is stored in the reference data storage unit 160 as reference data. In the following description, the feature amount of each type of video content stored in the reference data storage unit 160 is referred to as “reference feature amount” of that type of video content.

分類処理部１４０は、映像コンテンツの種類毎に、下記の式（４）に従って、特徴量抽出部１３０からの特徴量と、基準データ記憶部１６０に記憶された基準特徴量との類似度を特徴量の種類毎に求める。 For each type of video content, the classification processing unit 140 is characterized by the similarity between the feature amount from the feature amount extraction unit 130 and the reference feature amount stored in the reference data storage unit 160 according to the following equation (4). Calculate for each type of quantity.

これにより、映像コンテンツの種類毎に、主要周波数帯域、主要周波数帯域のパワースペクトルと全周波数帯域のパワースペクトルの総和との比、最大音量、平均音量、平均音量と最大音量の比、主要周波数帯域、低音量区間密度についての６つの類似度が得られる。 As a result, for each type of video content, the main frequency band, the ratio of the power spectrum of the main frequency band to the sum of the power spectrum of all frequency bands, the maximum volume, the average volume, the ratio of the average volume to the maximum volume, the main frequency band Six similarities for the low volume interval density are obtained.

そして、分類処理部１４０は、映像コンテンツの種類毎に、下記の式（５）に従って類似スコアを求める。 And the classification | category process part 140 calculates | requires a similarity score according to following formula (5) for every kind of video content.

類似スコア＝k1×主要周波数帯域類似度
＋k2×「主要周波数帯域のパワースペクトル／
全周波数帯域のパワースペクトルの総和」類似度
＋k3×最大音量類似度（５）
＋k4×平均音量類似度
＋k5×「平均音量／最大音量」類似度
＋k6×低音量区間密度
式中のｋは、重付け係数であり、それらを調整することによって、求められた類似スコアが示す類似の程度をより精確に表すことができる。 Similarity score = k1 x main frequency band similarity
+ K2 x "Power spectrum of main frequency band /
"Sum of power spectra in all frequency bands" similarity
+ K3 x maximum volume similarity (5)
+ K4 x average volume similarity
+ K5 x “average volume / maximum volume” similarity
+ K6 × Low Volume Section Density k in the equation is a weighting coefficient, and by adjusting them, the degree of similarity indicated by the obtained similarity score can be expressed more accurately.

図９は、アップロードされた映像コンテンツがボーカルである場合に、分類処理部１４０が求めた類似スコアの例を示す。図示のように、アップロードされた映像コンテンツの類似スコアは、比較対象の映像コンテンツと相似するほど低くなる。 FIG. 9 shows an example of the similarity score obtained by the classification processing unit 140 when the uploaded video content is vocal. As shown in the figure, the similarity score of the uploaded video content becomes lower as it is similar to the video content to be compared.

また、図１０は、アップロードされた映像コンテンツがクラシック音楽会である場合の類似スコアの例を示す。図１０からも分かるように、アップロードされた映像コンテンツの類似スコアは、比較対象の映像コンテンツと相似するほど低くなる。 FIG. 10 shows an example of a similarity score when the uploaded video content is a classical music concert. As can be seen from FIG. 10, the similarity score of the uploaded video content becomes lower as it is similar to the video content to be compared.

すなわち、分類処理部１４０が求めた類似スコアが低い種類ほど、アップロードされた映像コンテンツはその種類の映像コンテンツと相似する。 That is, the lower the type of similarity score obtained by the classification processing unit 140, the more similar the uploaded video content to the type of video content.

分類処理部１４０は、アップロードされた映像コンテンツの各特徴量と類似スコアを付属情報として、該映像コンテンツと共にコンテンツ記憶部１７０に格納する。 The classification processing unit 140 stores each feature quantity and similarity score of the uploaded video content as attached information in the content storage unit 170 together with the video content.

図１１は、コンテンツ記憶部１７０における映像コンテンツの格納態様を示す。図１１に示すように、コンテンツ記憶部１７０には、映像コンテンツ本体と、付属情報とを対応付けて記憶されており、付属情報は、主要周波数帯域、「主要周波数帯域のパワースペクトル／全周波数帯域のパワースペクトルの総和」、最大音量、平均音量、「平均音量／最大音量」、低音量区間密度の６つの特徴量と、「Ｎｅｗｓ」、「Ｆ−ｖｏｃａｌＡ」、「Ｆ−ｖｏｃａｌＢ１」など映像コンテンツの各種類についての類似スコアである。 FIG. 11 shows how video content is stored in the content storage unit 170. As shown in FIG. 11, the content storage unit 170 stores the video content main body and the attached information in association with each other, and the attached information includes the main frequency band, “power spectrum of the main frequency band / all frequency bands. ”Sum of power spectrum”, maximum volume, average volume, “average volume / maximum volume”, low volume interval density, 6 feature quantities, “News”, “F-vocal A”, “F-vocal B1”, etc. It is a similarity score for each type of video content.

次に、ユーザインタフェース１２０から検索指示がなされた際に、分類処理部１４０が行う処理を説明する。前述したように、本実施の形態において、サーバ１１０は、ユーザが指定した種類の映像コンテンツの検索と、ユーザが指定した映像コンテンツの関連コンテンツの検索の２種類の検索サービスを提供する。 Next, processing performed by the classification processing unit 140 when a search instruction is issued from the user interface 120 will be described. As described above, in the present embodiment, the server 110 provides two types of search services: a search for video content of a type specified by the user and a search for related content of the video content specified by the user.

分類処理部１４０は、ユーザが指定した種類の映像コンテンツの検索が指示された際に、コンテンツ記憶部１７０に格納された各映像コンテンツに対して、ユーザが指定した種類についての類似スコアをソーティングし、類似スコアが所定の閾値以下である映像コンテンツを選出する。このように選出された映像コンテンツは、類似スコアが閾値以下すなわちユーザが指定種類の映像コンテンツと閾値が示す程度以上に相似するものである。 The classification processing unit 140 sorts the similarity score for the type specified by the user for each video content stored in the content storage unit 170 when an instruction to search for the video content of the type specified by the user is given. Video content whose similarity score is below a predetermined threshold is selected. The video content selected in this way has a similarity score that is similar to the threshold value or less, that is, the user indicates the specified type of video content or more than the threshold value indicates.

分類処理部１４０は、選出した映像コンテンツのサムネイルを、類似スコアが低いものからの順に再生部１５０に再生させる。 The classification processing unit 140 causes the playback unit 150 to play back the thumbnails of the selected video content in descending order of similarity score.

一方、ユーザが指定した映像コンテンツの関連コンテンツの検索が指示された際に、分類処理部１４０は、指示された映像コンテンツの各特徴量を基準特徴として、コンテンツ記憶部１７０に記憶された他の映像コンテンツと該映像コンテンツの類似スコアを求める。類似スコアの求め方については、基準特徴量が異なる点を除き、アップロードされた映像コンテンツに対して類似スコアを求める手法と同じである。 On the other hand, when the search for the related content of the video content designated by the user is instructed, the classification processing unit 140 uses the feature amount of the instructed video content as a reference feature and stores the other content stored in the content storage unit 170. The similarity score between the video content and the video content is obtained. The method for obtaining the similarity score is the same as the method for obtaining the similarity score for the uploaded video content except that the reference feature amount is different.

図１２は、コンテンツ記憶部１７０に記憶された各映像コンテンツのうちの「コンテンツ２」の関連コンテンツの検索が指示された際に、分類処理部１４０が求めた類似スコアの例を示す。図示のように、コンテンツ１、コンテンツ３、コンテンツ４など、コンテンツ２以外の他の映像コンテンツについて、コンテンツ２との類似スコアが求められる。 FIG. 12 shows an example of the similarity score obtained by the classification processing unit 140 when a search for related content of “content 2” among the video contents stored in the content storage unit 170 is instructed. As shown in the figure, a similarity score with content 2 is obtained for video content other than content 2, such as content 1, content 3, and content 4.

分類処理部１４０は、このようにして求められた類似スコアをソーティングし、類似スコアが所定の閾値以下である映像コンテンツを選出する。このように選出された映像コンテンツは、類似スコアが閾値以下すなわちユーザが指定映像コンテンツと閾値が示す程度以上に相似するものである。 The classification processing unit 140 sorts the similarity scores obtained in this way, and selects video contents whose similarity scores are equal to or less than a predetermined threshold. The video content selected in this way is similar in that the similarity score is equal to or less than the threshold, that is, the user indicates the designated video content and the threshold.

このように、本実施の形態におけるサーバ１１０は、自動的に映像コンテンツを分類することができると共に、任意に指定された映像コンテンツの種類と類似する種類のコンテンツの検索も可能である。 As described above, the server 110 according to the present embodiment can automatically classify video content and can search for content of a type similar to the type of video content arbitrarily specified.

以上、実施の形態をもとに本発明による技術を説明した。実施の形態は例示であり、本発明の主旨から逸脱しない限り、さまざまな変更、増減を加えてもよい。これらの変更、増減が加えられた変形例も本発明の範囲にあることは当業者に理解されるところである。
例えば、上述した実施の形態において、５種の特徴量を用いたが、取り扱う映像コンテンツの種類に応じて特徴量の種類を増減してもよい。 In the above, the technique by this invention was demonstrated based on embodiment. The embodiment is an exemplification, and various changes and increases / decreases may be added without departing from the gist of the present invention. It will be understood by those skilled in the art that modifications to which these changes and increases / decreases are also within the scope of the present invention.
For example, in the above-described embodiment, five types of feature amounts are used. However, the types of feature amounts may be increased or decreased according to the types of video content to be handled.

また、上述した実施の形態は、本発明にかかる技術を検索のために適用したものであるが、本発明の技術は、映像コンテンツの種類の判定が必要な場合や、映像コンテンツの種類を判定するための情報を提供する場合などにも適用することができる。たとえば、映像コンテンツを記録する際に、上述した特徴量を抽出して映像コンテンツに付属して記録するように記録装置を構成してもよい。 In the above-described embodiment, the technique according to the present invention is applied for searching. However, the technique of the present invention determines the type of video content or the type of video content. The present invention can also be applied when providing information for doing so. For example, when recording video content, the recording device may be configured to extract the above-described feature amount and record it along with the video content.

映像コンテンツに含まれる音声データの例を示す図である。It is a figure which shows the example of the audio data contained in a video content. 各種映像コンテンツの特徴量の例を示す図である。It is a figure which shows the example of the feature-value of various video content. 映像コンテンツの種類とその特徴量の関係を説明するための図である。It is a figure for demonstrating the relationship between the kind of video content, and its feature-value. 本発明にかかる技術による音声データの特徴の捉え方を説明するための図である。It is a figure for demonstrating how to catch the characteristic of the audio | speech data by the technique concerning this invention. 本発明の実施の形態にかかるネットワークシステムを示す図である。It is a figure which shows the network system concerning embodiment of this invention. 図５に示すネットワークシステムにおけるサーバを示す図である。It is a figure which shows the server in the network system shown in FIG. 図６に示すサーバにおける特徴量抽出部が主要周波数帯域を求める手法の例を示すフローチャートである。It is a flowchart which shows the example of the method in which the feature-value extraction part in the server shown in FIG. 6 calculates | requires a main frequency band. 図６に示すサーバにおける特徴量抽出部が低音量区間密度を求める手法の例を示すフローチャートである。It is a flowchart which shows the example of the method in which the feature-value extraction part in the server shown in FIG. 6 calculates | requires a low sound volume area density. 図６に示すサーバにおける分類処理部が、アップロードされた映像コンテンツに対して、映像コンテンツの種類毎に求めた類似スコアの例を示す図である（その１）。It is a figure which shows the example of the similarity score calculated | required for every kind of video content with respect to the uploaded video content by the classification | category process part in the server shown in FIG. 図６に示すサーバにおける分類処理部が、アップロードされた映像コンテンツに対して、映像コンテンツの種類毎に求めた類似スコアの例を示す図である（その２）。It is a figure which shows the example of the similarity score calculated | required for every kind of video content with respect to the uploaded video content by the classification | category process part in the server shown in FIG. 図６に示すサーバにおけるコンテンツ記憶部における映像コンテンツの格納形態を示す図である。It is a figure which shows the storage form of the video content in the content memory | storage part in the server shown in FIG. 図６に示すサーバにおける分類処理部が求めた、ユーザにより指定された映像コンテンツと、コンテンツ記憶部に格納された他の映像コンテンツとの類似スコアの例を示す図である。It is a figure which shows the example of the similarity score of the video content designated by the user calculated | required by the classification | category process part in the server shown in FIG. 6, and the other video content stored in the content memory | storage part.

Explanation of symbols

１００ネットワークシステム
１１０サーバ
１２０ユーザインタフェース
１３０特徴量抽出部
１４０分類処理部
１５０再生部
１６０基準データ記憶部
１７０コンテンツ記憶部
１８０インターネット
１９０端末 DESCRIPTION OF SYMBOLS 100 Network system 110 Server 120 User interface 130 Feature-value extraction part 140 Classification processing part 150 Playback part 160 Reference data storage part 170 Content storage part 180 Internet 190 Terminal

Claims

A feature amount acquisition unit for acquiring feature amounts of audio data included in the video content;
A video content processing apparatus comprising: a classification processing unit that uses the feature amount acquired by the feature amount acquisition unit as an index for determining a type of the video content.

The feature amount includes two or more of three types of features, that is, a frequency feature amount, a volume feature amount, and a proportion of a volume equal to or less than a predetermined volume of the sound represented by the sound data. The video content processing apparatus according to claim 1, wherein the video content processing apparatus is a video content processing apparatus.

The video content processing apparatus according to claim 2, wherein the frequency feature amount includes a main frequency band and a ratio of a power spectrum of the frequency band to a sum of power spectra of all frequency bands.

The video content processing apparatus according to claim 2, wherein the volume feature amount includes a maximum volume, and a ratio between an average volume and the maximum volume.

The classification processing unit uses the feature amounts acquired by the feature amount acquisition unit with respect to the audio data included in the first video content and the audio data included in the second video content. 5. The video content processing apparatus according to claim 1, wherein a degree of similarity between the type of one video content and the type of the second video content is obtained. 6.

The classification processing unit uses a plurality of types of feature amounts of the audio data included in the first video content and a plurality of types of feature amounts of the audio data included in the second video content for each type. And calculating the similarity between the type of the first video content and the type of the second video content by integrating the similarities of the various types of feature amounts. The video content processing apparatus according to claim 5, wherein the video content processing apparatus is a video content processing apparatus.

The video content processing apparatus according to claim 6, wherein the classification processing unit integrates the degree of similarity of the types of feature amounts by weighted addition.

Acquire feature values of audio data included in video content,
A program for causing a computer to execute a process of using the acquired feature amount as an index for determining the type of the video content.