JP5182892B2

JP5182892B2 - Voice search method, voice search device, and voice search program

Info

Publication number: JP5182892B2
Application number: JP2009218455A
Authority: JP
Inventors: 浩太日高; 明小島; 豪入江; 信弥中嶌
Original assignee: EDUCATIONAL FOUNDATION OF KOKUSHIKAN; Nippon Telegraph and Telephone Corp
Current assignee: EDUCATIONAL FOUNDATION OF KOKUSHIKAN; Nippon Telegraph and Telephone Corp
Priority date: 2009-09-24
Filing date: 2009-09-24
Publication date: 2013-04-17
Anticipated expiration: 2029-09-24
Also published as: JP2011069845A

Abstract

<P>PROBLEM TO BE SOLVED: To search an optional audio section of a video, an audio and music contents by a reference audio section, and to thereby acquire a scene having a similar mental impression in both audio sections as a search result. <P>SOLUTION: A method includes the steps of: dividing the audio by an audioless section (S1); extracting rhythm characteristics related to a fundamental frequency of the divided audio section or the degree of emphasis of the audio (S2); and calculating a distance between the rhythm characteristics of the reference audio section and the rhythm characteristics of one or more audio sections which are investigation objects (S3). An audio section having similarity equal to or higher than a prescribed value is estimated by the calculated distance (S4). <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は，音声区間を検索する音声検索方法，装置，およびそのプログラムに関する。 The present invention relates to a voice search method, device, and program for searching a voice section.

任意の音声区間と印象的に類似したシーンを検出できれば，ユーザのお好みのシーン検索に適用できるため，利便性が高い。例えば，非特許文献１では，基本周波数の時間履歴から類似シーンを検索する手法が示されており，公知の技術となっている。 If a scene that is impressively similar to an arbitrary voice section can be detected, it can be applied to a user's favorite scene search, which is highly convenient. For example, Non-Patent Document 1 shows a technique for retrieving a similar scene from a time history of a fundamental frequency, which is a known technique.

また，音声の感性情報の取得としては，特許文献１において，音声の強調状態を抽出する手法が示されている。 In addition, as a method for acquiring voice sensitivity information, Patent Document 1 discloses a technique for extracting a voice enhancement state.

また，非特許文献２に記載されている技術によれば，音声の笑い声に着目して，短時間化映像を生成する手法が示されている。 In addition, according to the technique described in Non-Patent Document 2, a technique for generating a shortened video image by paying attention to voice laughter is shown.

さらに，特許文献２には，映像／音声データベースの中からクエリと類似度の高いものを検索する際に，映像／音声データベースの映像／音声に対して，その感情，感情度を推定し，感情（語）または感情度をクエリとし，感情（語）または感情度の類似度が高いものを検索する技術が記載されている。 Further, in Patent Document 2, when searching a video / audio database having a high similarity to a query, the emotion / emotion level is estimated for the video / audio in the video / audio database. (Word) or emotion level is used as a query, and a technique for searching for a high emotion (word) or emotion level similarity is described.

特許第３８０３３１１号公報Japanese Patent No. 3803311 特開２００８−２０４１９３号公報JP 2008-204193 A

斉藤勇，中嶌信弥，“感性情報検索インタフェースの試み−音声韻律情報によるコンテンツ検索”，2008年度画像電子学会第36回年次大会予稿集, R.3-5, 2008.Isamu Saito, Nobuya Nakajo, “Attempt of Kansei Information Retrieval Interface-Content Retrieval by Speech Prosodic Information”, Proceedings of the 36th Annual Conference of the Institute of Image Electronics Engineers of Japan, R.3-5, 2008 入江豪，日高浩太，宮下直也，佐藤隆，谷口行信，「個人撮影映像を対象とした映像速覧のための“笑い”シーン検出法」，映像情報メディア学会誌，vol.62, no.2, pp.227-233，2008．Go Irie, Kota Hidaka, Naoya Miyashita, Takashi Sato, Yukinobu Taniguchi, “Laughter” Scene Detection Method for Video Quick Reference for Personal Video, Journal of the Institute of Image Information and Television Engineers, vol.62, no .2, pp.227-233, 2008.

しかしながら，非特許文献１の手法は，基本周波数の時間履歴から検索するため，例えば，倍ピッチや半ピッチといった基本周波数抽出誤差の影響を無視できず，仮に平均化処理した際にもこれらの影響によって，所望の音声区間を検出することはできなかった。 However, since the method of Non-Patent Document 1 retrieves from the time history of the fundamental frequency, for example, the influence of the fundamental frequency extraction error such as double pitch and half pitch cannot be ignored. As a result, it was not possible to detect a desired speech segment.

また，特許文献１，非特許文献２の方法は，映像，音声，音楽の要約手法を示すものであって，検索に利用することを示したものではなかった。 In addition, the methods of Patent Document 1 and Non-Patent Document 2 show video, audio, and music summarization techniques, and do not indicate that they are used for searching.

特許文献２には，“楽しい”，“悲しい”などの感情のカテゴリ（感情語）や感情度をクエリとして，映像や音声を検索する発明について開示されている。この発明では，データベース中の映像や音声が誘発する感情語や感情度を推定する。ユーザは，見たいと思う映像や音声の感情語や感情度をクエリとして与えることで，近い印象の映像・音声を検索することができる。しかしながら，感情は，主観性が強く，カテゴリの境界もあいまいであることから，ユーザにとっては，必ずしも適当な感情語や感情度としてクエリを表現することができない場合があった。また，主観的影響の揺らぎを伴う，感情語や感情度で類似性を判断するため，必ずしもユーザが想定する検索結果を提示できないという課題があった。 Patent Document 2 discloses an invention for searching video and audio by using a category of emotion (emotion word) such as “fun” and “sad” and an emotion level as a query. In the present invention, the emotion word and the emotion level induced by the video and audio in the database are estimated. The user can search for video / sounds with similar impressions by giving the emotional word and emotion level of the video / sound that he / she wants to watch as a query. However, since emotions are highly subjective and the boundaries between categories are ambiguous, the user may not always be able to express a query as an appropriate emotion word or emotion level. In addition, there is a problem that it is not always possible to present the search result assumed by the user because similarity is judged by the emotion word and the emotion level accompanied by fluctuation of subjective influence.

本発明は，上記事情に着目してなされたもので，その目的とするところは，映像，音声，音楽コンテンツの任意の音声区間を参照音声区間によって検索し，検索結果として両音声区間の心理的印象が類似するものを得られるようにする技術を提供することにある。 The present invention has been made paying attention to the above circumstances, and the object of the present invention is to search an arbitrary voice section of video, voice, and music content by a reference voice section and to obtain a psychological result of both voice sections as a search result. The object is to provide a technology that makes it possible to obtain a similar impression.

上記目的を達成するために，本発明は，音声が視聴者に与える印象を示す特徴量として韻律特徴に着目して，各音声区間に対する韻律特徴の距離を算出することにより，人に与える印象が類似した音声区間を抽出するものであり，音声をクエリとして入力して韻律特徴を抽出し，韻律特徴が類似した映像／音声の音声区間を人に与える印象が類似した音声区間として判定することを特徴としている。音声をクエリとすることにより，ユーザが印象を言葉で表現する手間を省くことができる。また，音声をクエリとし，かつ，特徴量を用いて類似性判断を行うことにより，印象を言葉で表現することによる主観的ゆらぎが混入することを回避することができる。 In order to achieve the above object, the present invention focuses on the prosodic feature as a feature amount indicating the impression that the voice gives to the viewer, and calculates the distance of the prosodic feature for each voice section, thereby giving the impression given to the person. Extracting similar speech segments, inputting speech as a query, extracting prosodic features, and determining audio / video segments with similar prosodic features as speech segments with similar impressions to people It is a feature. By using voice as a query, the user can save time and effort to express the impression in words. In addition, by using speech as a query and performing similarity determination using feature quantities, it is possible to avoid mixing subjective fluctuations caused by expressing an impression in words.

すなわち，この発明は以下のような手段を備える。 That is, the present invention comprises the following means.

・第１の手段は，映像から音声を抽出し，さらに音声区間に分割し，韻律特徴によって参照音声区間との距離を計算し，類似した音声区間を推定する。また，基本周波数をさらに加重平均基本周波数で算出し，セミトーン化して韻律特徴とする。この第１の手段によれば，参照音声区間の心理的印象に類似した音声区間を検索することが可能となる。また，基本周波数を倍ピッチ，半ピッチの影響を軽減して求めることができる。 The first means extracts the voice from the video, further divides it into voice sections, calculates the distance from the reference voice section based on the prosodic features, and estimates a similar voice section. In addition, the fundamental frequency is further calculated by the weighted average fundamental frequency, and is converted into semitones to obtain prosodic features. According to this first means, it is possible to search for a speech segment similar to the psychological impression of the reference speech segment. Further, the fundamental frequency can be obtained by reducing the influence of double pitch and half pitch.

・第２の手段は，動的時間伸縮法により両音声区間の距離を計算する。この第２の手段によれば，比較する２つ以上の音声区間の時間長が異なる場合でも，該音声区間の類似度を求めることができる。 The second means calculates the distance between both speech sections by the dynamic time expansion / contraction method. According to the second means, even when two or more speech sections to be compared have different time lengths, the similarity between the speech sections can be obtained.

また，特許文献２の有する課題を解決するために，本発明は，感情（語）ではなく音声信号を直接クエリとして利用し，特徴量レベルで類似性を判断する。さらに，類似性の判断も特徴量レベルで行う。 In order to solve the problem of Patent Document 2, the present invention uses voice signals directly as queries instead of emotions (words) and determines similarity at the feature level. Furthermore, similarity is also determined at the feature level.

この発明によれば，参照音声区間の心理的印象に類似した音声区間を検索することが可能となる。また，音声信号を直接クエリとして利用するため，ユーザがクエリを表現できないようなケースを回避することができる。さらに，類似性の判断も特徴量レベルで行うため，主観性の揺らぎによる影響を受けにくく，検索結果に対するユーザの意図とのずれが生じにくい。 According to the present invention, it is possible to search for a speech segment similar to the psychological impression of the reference speech segment. Further, since the voice signal is directly used as a query, a case where the user cannot express the query can be avoided. Furthermore, since the similarity is also determined at the feature amount level, it is not easily affected by fluctuations in subjectivity, and the user's intention with respect to the search result is less likely to occur.

本発明による音声検索方法の一例を示す処理フローチャートである。It is a processing flowchart which shows an example of the voice search method by this invention. 基本周波数の加重平均の例を説明する図である。It is a figure explaining the example of the weighted average of a fundamental frequency. 基本周波数をセミトーン化する例を説明する図である。It is a figure explaining the example which semitone-converts a fundamental frequency. 本発明による音声検索装置の構成例を示す図である。It is a figure which shows the structural example of the speech search device by this invention. コンテンツ記憶部と調査対象音声区間韻律特徴記憶部のデータ構成例を示す図である。It is a figure which shows the data structural example of a content memory | storage part and a search object audio | voice area prosodic feature memory | storage part. 音声検索装置の利用例を示す図である。It is a figure which shows the usage example of a voice search device. 自動検索の際のユーザ表示画面の例を示す図である。It is a figure which shows the example of the user display screen in the case of an automatic search.

以下に図面を参照してこの発明の実施例を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１に，本発明の実施例による音声検索方法の処理フローを示す。ステップＳ１は，音声分割ステップであり，ここで入力音声を任意の音声区間に分割する。ステップＳ２は，韻律特徴抽出ステップであり，韻律特徴を抽出する。ステップＳ３は，距離計算ステップであり，前記音声区間の韻律特徴と所定の音声区間の韻律特徴との類似度を距離として求める。ステップＳ４は，音声区間推定ステップであり，ここで類似しているか否かを判定する。 FIG. 1 shows a processing flow of a voice search method according to an embodiment of the present invention. Step S1 is a voice division step, in which the input voice is divided into arbitrary voice sections. Step S2 is a prosodic feature extraction step, which extracts prosodic features. Step S3 is a distance calculation step, in which the similarity between the prosodic feature of the speech section and the prosodic feature of the predetermined speech section is obtained as a distance. Step S4 is a speech segment estimation step, in which it is determined whether or not they are similar.

まず，音声分割ステップＳ１では，音声の有声・無声判定により，入力音声を音声区間に分割する。例えば，基本周波数が抽出されないフレームを無声としてもよく，ＬＰＣ分析の予測残差の相関係数に閾値を設定して閾値以下は無声としてもよい。 First, in the voice division step S1, the input voice is divided into voice sections by voiced / unvoiced voice judgment. For example, a frame from which a fundamental frequency is not extracted may be unvoiced, or a threshold value may be set for the correlation coefficient of the prediction residual of LPC analysis, and the threshold value or less may be unvoiced.

韻律特徴抽出ステップＳ２では，基本周波数を波形処理，相関処理，スペクトル処理のいずれかで抽出する。通常，基本周波数のような韻律特徴は，数十ミリ秒という短い時間間隔（これを窓という）毎に抽出する。この窓が分析フレームであり，本ステップＳ２における韻律特徴抽出では，例えば１０ｍｓ（ミリ秒）毎や３０ｍｓ毎などの分析フレーム幅の間隔でもって抽出を行えばよい。上記ＬＰＣ分析手法，基本周波数抽出手法については，次の参考文献１に示されており，公知の技術であるので，ここでのさらに詳しい説明は省略する。
［参考文献１］古井貞煕著，ディジタル信号処理，東海大学出版，pp.57 〜59．
また，日本語の単語のアクセントは，拍（モーラ）ごとに１つの値を与えればよく，すなわち，この点ピッチにより韻律特徴としてもよい。または，簡単のため，例えば５０ｍｓの振幅による加重平均のＦ０を点ピッチとしてもよい。なお，５０ｍｓの振幅については限定されるものではなく，例えば１００ｍｓとしてもよい。 In the prosodic feature extraction step S2, the fundamental frequency is extracted by any one of waveform processing, correlation processing, and spectrum processing. Normally, prosodic features such as the fundamental frequency are extracted every short time interval (called a window) of several tens of milliseconds. This window is an analysis frame, and in the prosodic feature extraction in this step S2, extraction may be performed at intervals of the analysis frame width such as every 10 ms (milliseconds) or every 30 ms. The LPC analysis method and the fundamental frequency extraction method are described in Reference Document 1 below, and are known techniques, and thus a detailed description thereof is omitted here.
[Reference 1] Sadahiro Furui, Digital Signal Processing, Tokai University Press, pp. 57-59.
The accent of the Japanese word may be given one value for each beat (mora), that is, it may be a prosodic feature by this point pitch. Alternatively, for simplicity, for example, a weighted average F0 with an amplitude of 50 ms may be used as the point pitch. The amplitude of 50 ms is not limited and may be 100 ms, for example.

図２は，基本周波数の加重平均の例を説明する図である。図２に示すように，基本周波数（Ｆ０），振幅が，ともに１０ｍｓ間隔で抽出されているとする。そうすると，５０ｍｓの間には，５組の基本周波数と振幅（Ｆ１，ｐ１）〜（Ｆ５，ｐ５）が得られることになる。通常の基本周波数の平均では，“（Ｆ１＋Ｆ２＋Ｆ３＋Ｆ４＋Ｆ５）／５”が平均値であるが，振幅による加重平均の場合には，“（α１×Ｆ１＋α２×Ｆ２＋α３×Ｆ３＋α４×Ｆ４＋α５×Ｆ５）／５”が基本周波数の加重平均となる。ここで，αｉは，ｐｉを０〜１の値をとるように正規化した値である。 FIG. 2 is a diagram illustrating an example of a weighted average of fundamental frequencies. As shown in FIG. 2, it is assumed that both the fundamental frequency (F0) and the amplitude are extracted at intervals of 10 ms. Then, five sets of fundamental frequencies and amplitudes (F1, p1) to (F5, p5) are obtained in 50 ms. In the average of normal fundamental frequencies, “(F1 + F2 + F3 + F4 + F5) / 5” is an average value, but in the case of weighted average based on amplitude, “(α1 × F1 + α2 × F2 + α3 × F3 + α4 × F4 + α5 × F5) / 5” is basic. It is a weighted average of frequencies. Here, αi is a value obtained by normalizing pi to take a value of 0 to 1.

さらに，以上のようにして求めた基本周波数をセミトーン化して韻律特徴としてもよい。図３は，基本周波数をセミトーン化する例を説明する図である。 Further, the fundamental frequency obtained as described above may be semitoned to obtain a prosodic feature. FIG. 3 is a diagram for explaining an example in which the fundamental frequency is made semitone.

セミトーン化とは，基本周波数から音階（セミトーン）Ｃ，Ｃ＃，…，Ｂへと変換する処理である。倍ピッチ（本来の基本周波数の２倍の周波数が抽出されてしまうケース）や半ピッチ（本来の半分の周波数が抽出されてしまうケース）が与える不具合を解消する目的でセミトーン化を導入する。１オクターブは基本周波数にして丁度，倍であり，倍ピッチや半ピッチは同じ音階に分類されるため，セミトーン化によって上記の不具合による影響を回避することができる。 Semitoning is a process of converting from a fundamental frequency to a scale (semitone) C, C #,. Semi-toning is introduced for the purpose of eliminating the problems caused by double pitch (a case where a frequency twice the original fundamental frequency is extracted) and half pitch (a case where a half frequency is extracted). One octave is just a double of the fundamental frequency, and double pitch and half pitch are classified into the same scale. Therefore, the effect of the above-mentioned problem can be avoided by semitone.

具体的なセミトーン化の処理の一例としては，例えば図３に示すセミトーン・基本周波数対応テーブルをあらかじめ記憶装置に用意しておく。基本周波数を抽出したならば，図３に示すセミトーン・基本周波数対応テーブルを参照して，抽出した基本周波数をセミトーンに変換する。この場合，例えば自乗距離として最も小さいセミトーンに分類するものとしてよい。通常，人間の音声の基本周波数は，７５ＨＺ〜４５０Ｈｚ程度と言われているので，図３のテーブルにおける基本周波数よりも低い場合や高い場合には，テーブルに設定されている基本周波数をオクターブの原則に則って，２倍もしくは半分にしたものとの距離を測り，該当するセミトーンを選択すればよい。 As an example of specific semitone processing, for example, a semitone / basic frequency correspondence table shown in FIG. 3 is prepared in a storage device in advance. When the fundamental frequency is extracted, the extracted fundamental frequency is converted into a semitone with reference to the semitone / basic frequency correspondence table shown in FIG. In this case, for example, it may be classified into the smallest semitone as the square distance. Usually, the fundamental frequency of human voice is said to be about 75HZ to 450Hz. Therefore, when the fundamental frequency is lower or higher than the fundamental frequency in the table of FIG. 3, the fundamental frequency set in the table is the octave principle. In accordance with the above, measure the distance to double or halve and select the appropriate semitone.

また，強調度を音声の強調確率と平静確率によって求め，これを韻律特徴としてもよい。具体的には，音声小段落ごとの強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）の確率比をＰｓ（ｅ）／Ｐｓ（ｎ）とすればよい。前記強調確率，平静確率については，特許文献１に記載の方法により求めればよい。 Also, the enhancement degree may be obtained from the speech enhancement probability and the calm probability, and this may be used as a prosodic feature. Specifically, the probability ratio between the emphasis probability Ps (e) and the calm probability Ps (n) for each audio sub-paragraph may be Ps (e) / Ps (n). What is necessary is just to obtain | require the said emphasis probability and calm probability by the method of patent document 1. FIG.

距離計算ステップＳ３では，２つ以上の音声区間を比較する際，各々の区間の時間長が異なるために比較が簡単ではない。そこで，動的時間伸縮法により距離を求める。動的時間伸縮法については，前述した参考文献１（古井貞煕著，ディジタル信号処理，東海大学出版，pp.162〜164 ）に示されている。 In the distance calculation step S3, when comparing two or more speech sections, the comparison is not easy because the time lengths of the sections are different. Therefore, the distance is obtained by the dynamic time expansion / contraction method. The dynamic time expansion / contraction method is described in the above-mentioned Reference 1 (written by Sadahiro Furui, Digital Signal Processing, Tokai University Press, pp. 162-164).

音声区間推定ステップＳ４では，前記距離をｄとして求めたものが最小となる音声区間を出力すればよい。なお，本発明においては，入力する音声区間を参照音声区間，検索対象となる音声区間を調査対象の音声区間と呼ぶこととする。上記処理では，１つ以上の調査対象の音声区間の韻律特徴が記憶されている記憶部から韻律特徴を読み出し，韻律特徴抽出ステップＳ２で求めた参照音声区間の韻律特徴との距離を，距離計算ステップＳ３で求める。音声区間推定ステップＳ４により最も距離の近い，すなわち類似度の大きい音声区間が出力されることになる。 In the speech section estimation step S4, it is only necessary to output a speech section in which the distance obtained as d is the minimum. In the present invention, the input speech segment is referred to as a reference speech segment, and the speech segment to be searched is referred to as a survey target speech segment. In the above processing, the prosodic features are read from the storage unit storing the prosodic features of one or more speech sections to be investigated, and the distance from the prosodic features of the reference speech sections obtained in the prosodic feature extracting step S2 is calculated by distance calculation. Obtained in step S3. In the speech segment estimation step S4, the speech segment with the closest distance, that is, the speech with a high degree of similarity is output.

本発明によれば，韻律特徴が類似している２つ以上の音声区間は，心理的印象の尺度においても近似している効果がある。例えば，映画やテレビ放送番組の任意の区間に興味を持ったユーザが，前記区間と類似した区間を検索することが可能となる。 According to the present invention, two or more speech segments having similar prosodic features have an effect of being approximated also in a psychological impression scale. For example, a user who is interested in an arbitrary section of a movie or television broadcast program can search for a section similar to the section.

図４は，以上の処理を実行する音声検索装置の構成例を示す図である。図４に示すように，音声検索装置１０は，中央処理ユニット（ＣＰＵ： Central Processing Unit）１１を備える。このＣＰＵ１１には，バス１２を介してプログラムメモリ１３，データメモリ１４，通信インタフェース（通信Ｉ／Ｆ）１５がそれぞれ接続されている。 FIG. 4 is a diagram showing a configuration example of a voice search device that executes the above processing. As shown in FIG. 4, the voice search device 10 includes a central processing unit (CPU) 11. A program memory 13, a data memory 14, and a communication interface (communication I / F) 15 are connected to the CPU 11 via a bus 12.

プログラムメモリ１３には，音声分割部１３ａ，韻律特徴抽出部１３ｂ，距離計算部１３ｃ，音声区間推定部１３ｄを実現するプログラムが記憶される。データメモリ１４には，コンテンツ記憶部１４ａと調査対象音声区間韻律特徴記憶部１４ｂが設けられている。音声分割部１３ａ，韻律特徴抽出部１３ｂ，距離計算部１３ｃ，音声区間推定部１３ｄは，それぞれ，図１の音声分割ステップＳ１，韻律特徴抽出ステップＳ２，距離計算ステップＳ３，音声区間推定ステップＳ４の処理機能に対応している。 The program memory 13 stores a program for realizing the speech dividing unit 13a, the prosodic feature extracting unit 13b, the distance calculating unit 13c, and the speech segment estimating unit 13d. The data memory 14 is provided with a content storage unit 14a and a survey target speech segment prosodic feature storage unit 14b. The speech dividing unit 13a, the prosody feature extracting unit 13b, the distance calculating unit 13c, and the speech segment estimating unit 13d are respectively the speech segmentation step S1, the prosody feature extraction step S2, the distance calculation step S3, and the speech segment estimation step S4 in FIG. Supports processing functions.

図４に示す音声検索装置１０は，大量のコンテンツ（映像・音・音楽）の中から，ユーザが問い合わせとして入力したコンテンツに，韻律特徴の観点で近いものを検索する。コンテンツ記憶部１４ａは，いわゆるコンテンツデータベースであって，これには大量のコンテンツが格納されている。また，調査対象音声区間韻律特徴記憶部１４ｂには，コンテンツの音声区間毎に，韻律特徴の時刻歴が記憶されている。 The voice search device 10 shown in FIG. 4 searches a large amount of content (video / sound / music) that is close to the content input by the user as an inquiry in terms of prosodic features. The content storage unit 14a is a so-called content database, in which a large amount of content is stored. Also, the time history of the prosodic features is stored for each speech segment of the content in the survey target speech segment prosodic feature storage unit 14b.

図５は，図４に示すコンテンツ記憶部１４ａと調査対象音声区間韻律特徴記憶部１４ｂに記憶されるデータの構成例を示す。 FIG. 5 shows a configuration example of data stored in the content storage unit 14a and the survey target speech segment prosodic feature storage unit 14b shown in FIG.

図５に示すａ２０１は，コンテンツ記憶部１４ａのコンテンツテーブルである。ここには，コンテンツそのもの（ファイル）の存在場所を示すファイルパスと，これらを一意に識別するコンテンツＩＤとが対応づけて格納されている。 A201 illustrated in FIG. 5 is a content table of the content storage unit 14a. Here, a file path indicating the location of the content itself (file) and a content ID for uniquely identifying these are stored in association with each other.

ａ２０２は，コンテンツ記憶部１４ａのコンテンツＩＤ＝０００１の音声区間テーブルである。この例では，音声区間テーブルは，コンテンツＩＤ毎に用意している。１つのコンテンツは，複数の音声区間を持っているので，これらを一意に識別する音声区間ＩＤと，その区間の音声ファイルの存在場所を示すファイルパス，および，その各音声区間が存在するコンテンツ中の時刻位置（開始，終了時刻）が，音声区間テーブルａ２０２中に対応づけて格納されている。 a202 is an audio section table of content ID = 0001 in the content storage unit 14a. In this example, an audio section table is prepared for each content ID. Since one content has a plurality of audio sections, the audio section ID for uniquely identifying them, the file path indicating the location of the audio file in the section, and the contents in which each audio section exists Are stored in the speech section table a202 in association with each other.

ａ２０３は，調査対象音声区間韻律特徴記憶部１４ｂに記憶されている韻律特徴のデータの例であり，コンテンツＩＤ＝０００１，音声区間ＩＤ＝ｓ０００１の韻律特徴を示している。図１の韻律特徴抽出ステップＳ２で説明したように，韻律特徴は，例えば５０ｍｓ単位で抽出されるので，１つの音声区間に対して，５０ｍｓ毎に複数の韻律特徴が抽出されることになる。そこで，図５のａ２０３のように，音声区間ＩＤ＝ｓ０００１に対して，５０ｍｓ毎に韻律特徴として基本周波数が格納される。 a203 is an example of prosodic feature data stored in the investigation target speech section prosodic feature storage unit 14b, and indicates the prosodic features of content ID = 0001 and speech section ID = s0001. As described in the prosodic feature extraction step S2 in FIG. 1, prosodic features are extracted in units of 50 ms, for example, so that a plurality of prosodic features are extracted every 50 ms for one speech section. Therefore, as indicated by a203 in FIG. 5, the fundamental frequency is stored as a prosodic feature every 50 ms for the speech section ID = s0001.

これらの各テーブルデータは，リレーショナルデータベースのように相互参照可能であるので，ａ２０３の韻律特徴で音声区間の類似性を判別した後，この区間がどのコンテンツ（ＩＤ）のどの音声区間（ＩＤ）のものであるかを引き出すことができる。 Each of these table data can be cross-referenced like a relational database. Therefore, after determining the similarity of the speech segment by the prosodic feature of a203, this segment is in which speech segment (ID) of which content (ID). You can pull out what it is.

ユーザがシーンに興味を持ち，類似シーンを検索する場合について述べる。図４の音声分割部１３ａは，メディアが映像の場合には音声を抽出した後，音声分割ステップＳ１と同様の処理を施す。ここで，音楽の場合であっても音声と同様の処理が施されるため，本発明において音声という場合，音楽など楽音全般を含むものとする。 A case where a user is interested in a scene and searches for a similar scene will be described. The audio dividing unit 13a in FIG. 4 performs the same processing as the audio dividing step S1 after extracting the audio when the medium is video. Here, even in the case of music, processing similar to that for voice is performed. Therefore, in the present invention, the term “speech” includes all musical sounds such as music.

指定されたシーンは，コンテンツ記憶部１４ａに記憶され，音声分割部１３ａで分割された音声情報ならびに音声区間の韻律特徴を，韻律特徴抽出部１３ｂにより抽出し，調査対象音声区間韻律特徴記憶部１４ｂの韻律特徴との距離を距離計算部１３ｃで求め，距離が最小となる音声区間を音声区間推定部１３ｄで同定する。そして，同音声区間と同期するシーンをコンテンツ記憶部１４ａから抽出してユーザに提示する。 The designated scene is stored in the content storage unit 14a, and the speech information divided by the speech dividing unit 13a and the prosodic features of the speech section are extracted by the prosodic feature extracting unit 13b, and the target speech section prosodic feature storing unit 14b is extracted. The distance from the prosodic feature is obtained by the distance calculation unit 13c, and the speech section having the minimum distance is identified by the speech section estimation unit 13d. Then, a scene synchronized with the same voice section is extracted from the content storage unit 14a and presented to the user.

例えば，図６に示すように，ユーザ端末２０のパーソナルコンピュータ（ＰＣ）から検索要求を，ネットワーク等を介して音声検索装置１０に行ってもよい。その場合，検索要求は，
・映像，音声などのメディア，
・映像，音声などのメディアの当該時刻情報，
でよい。 For example, as shown in FIG. 6, a search request may be sent from the personal computer (PC) of the user terminal 20 to the voice search device 10 via a network or the like. In that case, the search request is
・ Media such as video and audio,
・ Time information of media such as video and audio,
It's okay.

または，ユーザ端末が図４に示す音声検索装置１０の場合，ネットワークは必要としない構成としてもよい。 Alternatively, in the case where the user terminal is the voice search device 10 shown in FIG.

類似シーン検索は，ユーザの検索を待たずに自動的に行なってもよい。例えば，予め調査対象音声区間韻律特徴記憶部１４ｂに，調査対象音声区間の韻律特徴と感性状態を対応付けて保管しておく。その場合，例えば楽しい，幸せ，怒り，嫌悪，あきらめ，悲しみ，平静，驚きなどの感情と対応付けておけばよい。さらに，例えば楽しいレベルを付与しておけば，より楽しいシーンの韻律特徴から，当該シーンがより楽しいシーンであると自動推定することが可能となる。例えば，“かなり楽しい”シーンがユーザの表示端末に表示されている場合，別のシーンの“かなり楽しい”シーンを，図７のユーザ表示画面に示すように，自動挿入してもよい。ユーザは，挿入されたシーンに興味が誘発されれば，挿入シーンの視聴に切り替えるなどの対処を行なえばよい。このようにすることで，自動的に類似シーンがリコメンデーションされる装置が実現する。 The similar scene search may be automatically performed without waiting for the user search. For example, the prosodic feature and the emotional state of the investigation target speech section are stored in advance in the investigation target speech section prosodic feature storage unit 14b. In that case, it may be associated with emotions such as fun, happiness, anger, disgust, giving up, sadness, calmness and surprise. Furthermore, for example, if a fun level is assigned, it is possible to automatically estimate that the scene is a more enjoyable scene from the prosodic features of the more enjoyable scene. For example, when a “pretty fun” scene is displayed on the display terminal of the user, a “pretty fun” scene of another scene may be automatically inserted as shown in the user display screen of FIG. If the user is interested in the inserted scene, the user may take measures such as switching to viewing the inserted scene. In this way, a device that automatically recommends similar scenes is realized.

なお，調査対象音声区間韻律特徴記憶部１４ｂに，調査対象音声区間の韻律特徴と感性状態を対応付けて保管する際には，必ずしもすべてのコンテンツの音声区間の一つ一つに対して，感性状態を対応づけておく必要はなく，少量の組の韻律特徴の時刻歴パターンと感性状態とを対応づけたモデル（テーブル）を，入力装置からの入力などにより人手で用意するだけでよい。このモデルの入力後は，このモデルを参照して残りの音声区間に感性状態を自動付与することができる。モデルとしては，例えば，ＨＭＭ（Hidden Markov Model ）などを用いることができる。 When storing the prosodic features and the emotional state of the survey target speech section in the survey target speech segment prosodic feature storage unit 14b in association with each other, it is not always necessary to detect the sensitivity of each speech segment of all contents. It is not necessary to associate the states with each other, and a model (table) in which a time history pattern of a small group of prosodic features and the emotional state are associated with each other may be prepared manually by input from an input device. After the input of this model, the Kansei state can be automatically given to the remaining speech sections with reference to this model. As the model, for example, HMM (Hidden Markov Model) can be used.

以上の処理によって，韻律特徴の時刻歴パターンと感性状態を対応づけたモデルができたとすると，あとは，
（１）ユーザが視聴しているコンテンツの音声区間の韻律特徴を抽出し，
（２）モデルを用いることによって，抽出した韻律特徴の感性状態を取得し，
（３）得られた感性状態に対応するコンテンツをコンテンツ記憶部１４ａから検索する，という処理を経ることで，図７に示したような利用が実現できることになる。 Assuming that the above process created a model that correlates the time history pattern of prosodic features with the emotional state,
(1) Extract the prosodic features of the speech segment of the content that the user is viewing,
(2) By using the model, acquire the sensitivity state of the extracted prosodic features,
(3) By using the process of searching the content storage unit 14a for the content corresponding to the obtained sensitivity state, the use as shown in FIG. 7 can be realized.

以上の音声検索の処理は，コンピュータとソフトウェアプログラムとによって実現することができ，そのプログラムをコンピュータ読み取り可能な記録媒体に記録することも，ネットワークを通して提供することも可能である。 The above voice search processing can be realized by a computer and a software program, and the program can be recorded on a computer-readable recording medium or provided through a network.

１０音声検索装置
１１中央処理ユニット（ＣＰＵ）
１２バス
１３プログラムメモリ
１３ａ音声分割部
１３ｂ韻律特徴抽出部
１３ｃ距離計算部
１３ｄ音声区間推定部
１４データメモリ
１４ａコンテンツ記憶部
１４ｂ調査対象音声区間韻律特徴記憶部
１５通信インタフェース 10 Voice Retrieval Device 11 Central Processing Unit (CPU)
12 bus 13 program memory 13a speech division unit 13b prosodic feature extraction unit 13c distance calculation unit 13d speech segment estimation unit 14 data memory 14a content storage unit 14b survey target speech segment prosody feature storage unit 15 communication interface

Claims

In a voice search method in which a computer searches a voice segment,
A voice division step of dividing the voice into a plurality of voice sections by a silent section;
A prosodic feature extracting step of extracting prosodic features in units of analysis frames for the speech section;
One or more of the probing features stored in the probing feature prosodic feature storage unit that stores the prosodic features extracted in advance in the prosodic feature extracting step of the reference speech segment and the prosodic features for each speech segment of the target speech A distance calculating step for calculating a distance from the prosodic feature of the speech section of
Performing a speech segment estimation step for estimating a speech segment having a similarity equal to or greater than a predetermined value based on the calculated distance ;
The prosody feature extraction step includes a weighted average fundamental frequency extraction step for extracting a fundamental frequency by a weighted average or a point pitch, and a semitone process for semitoning the fundamental frequency. A voice search method characterized by using a fundamental frequency .

In the audio retrieval method according to claim 1 Symbol placement,
In the distance calculating step, the distance is calculated based on a dynamic time expansion / contraction method.

In a voice search device that searches a voice segment,
A voice dividing unit that divides the voice into a plurality of voice segments by a silent segment;
A prosodic feature extraction unit that extracts prosodic features in units of analysis frames for the speech section;
A survey target speech segment prosody feature storage unit that stores a prosody feature for each speech segment of the survey target speech in advance;
Distance calculation for calculating a distance between the prosodic feature extracted by the prosodic feature extraction unit of the reference speech segment and the prosodic feature of one or more speech segments to be investigated stored in the surveyed speech segment prosodic feature storage unit Part,
A speech segment estimation unit that estimates a speech segment having a similarity greater than or equal to a predetermined value based on the calculated distance ;
The prosody feature extraction unit includes a weighted average fundamental frequency extraction unit that extracts a fundamental frequency by weighted average or point pitch, and a semitone unit that semitones the fundamental frequency, and semitones the prosody feature. A voice search device characterized by having a fundamental frequency .

In speech retrieval apparatus according to claim 3 Symbol mounting,
The distance calculation unit calculates a distance based on a dynamic time expansion / contraction method.

A voice search program for causing a computer to execute the voice search method according to claim 1 .