JP5182892B2 - Voice search method, voice search device, and voice search program - Google Patents

Voice search method, voice search device, and voice search program Download PDF

Info

Publication number
JP5182892B2
JP5182892B2 JP2009218455A JP2009218455A JP5182892B2 JP 5182892 B2 JP5182892 B2 JP 5182892B2 JP 2009218455 A JP2009218455 A JP 2009218455A JP 2009218455 A JP2009218455 A JP 2009218455A JP 5182892 B2 JP5182892 B2 JP 5182892B2
Authority
JP
Japan
Prior art keywords
voice
speech
prosodic
fundamental frequency
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2009218455A
Other languages
Japanese (ja)
Other versions
JP2011069845A (en
Inventor
浩太 日高
明 小島
豪 入江
信弥 中嶌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
EDUCATIONAL FOUNDATION OF KOKUSHIKAN
Nippon Telegraph and Telephone Corp
Original Assignee
EDUCATIONAL FOUNDATION OF KOKUSHIKAN
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by EDUCATIONAL FOUNDATION OF KOKUSHIKAN, Nippon Telegraph and Telephone Corp filed Critical EDUCATIONAL FOUNDATION OF KOKUSHIKAN
Priority to JP2009218455A priority Critical patent/JP5182892B2/en
Publication of JP2011069845A publication Critical patent/JP2011069845A/en
Application granted granted Critical
Publication of JP5182892B2 publication Critical patent/JP5182892B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

<P>PROBLEM TO BE SOLVED: To search an optional audio section of a video, an audio and music contents by a reference audio section, and to thereby acquire a scene having a similar mental impression in both audio sections as a search result. <P>SOLUTION: A method includes the steps of: dividing the audio by an audioless section (S1); extracting rhythm characteristics related to a fundamental frequency of the divided audio section or the degree of emphasis of the audio (S2); and calculating a distance between the rhythm characteristics of the reference audio section and the rhythm characteristics of one or more audio sections which are investigation objects (S3). An audio section having similarity equal to or higher than a prescribed value is estimated by the calculated distance (S4). <P>COPYRIGHT: (C)2011,JPO&amp;INPIT

Description

この発明は,音声区間を検索する音声検索方法,装置,およびそのプログラムに関する。   The present invention relates to a voice search method, device, and program for searching a voice section.

任意の音声区間と印象的に類似したシーンを検出できれば,ユーザのお好みのシーン検索に適用できるため,利便性が高い。例えば,非特許文献1では,基本周波数の時間履歴から類似シーンを検索する手法が示されており,公知の技術となっている。   If a scene that is impressively similar to an arbitrary voice section can be detected, it can be applied to a user's favorite scene search, which is highly convenient. For example, Non-Patent Document 1 shows a technique for retrieving a similar scene from a time history of a fundamental frequency, which is a known technique.

また,音声の感性情報の取得としては,特許文献1において,音声の強調状態を抽出する手法が示されている。   In addition, as a method for acquiring voice sensitivity information, Patent Document 1 discloses a technique for extracting a voice enhancement state.

また,非特許文献2に記載されている技術によれば,音声の笑い声に着目して,短時間化映像を生成する手法が示されている。   In addition, according to the technique described in Non-Patent Document 2, a technique for generating a shortened video image by paying attention to voice laughter is shown.

さらに,特許文献2には,映像/音声データベースの中からクエリと類似度の高いものを検索する際に,映像/音声データベースの映像/音声に対して,その感情,感情度を推定し,感情(語)または感情度をクエリとし,感情(語)または感情度の類似度が高いものを検索する技術が記載されている。   Further, in Patent Document 2, when searching a video / audio database having a high similarity to a query, the emotion / emotion level is estimated for the video / audio in the video / audio database. (Word) or emotion level is used as a query, and a technique for searching for a high emotion (word) or emotion level similarity is described.

特許第3803311号公報Japanese Patent No. 3803311 特開2008−204193号公報JP 2008-204193 A

斉藤勇,中嶌信弥,“感性情報検索インタフェースの試み−音声韻律情報によるコンテンツ検索”,2008年度画像電子学会第36回年次大会予稿集, R.3-5, 2008.Isamu Saito, Nobuya Nakajo, “Attempt of Kansei Information Retrieval Interface-Content Retrieval by Speech Prosodic Information”, Proceedings of the 36th Annual Conference of the Institute of Image Electronics Engineers of Japan, R.3-5, 2008 入江 豪,日高浩太,宮下直也,佐藤 隆,谷口行信,「個人撮影映像を対象とした映像速覧のための“笑い”シーン検出法」,映像情報メディア学会誌,vol.62, no.2, pp.227-233,2008.Go Irie, Kota Hidaka, Naoya Miyashita, Takashi Sato, Yukinobu Taniguchi, “Laughter” Scene Detection Method for Video Quick Reference for Personal Video, Journal of the Institute of Image Information and Television Engineers, vol.62, no .2, pp.227-233, 2008.

しかしながら,非特許文献1の手法は,基本周波数の時間履歴から検索するため,例えば,倍ピッチや半ピッチといった基本周波数抽出誤差の影響を無視できず,仮に平均化処理した際にもこれらの影響によって,所望の音声区間を検出することはできなかった。   However, since the method of Non-Patent Document 1 retrieves from the time history of the fundamental frequency, for example, the influence of the fundamental frequency extraction error such as double pitch and half pitch cannot be ignored. As a result, it was not possible to detect a desired speech segment.

また,特許文献1,非特許文献2の方法は,映像,音声,音楽の要約手法を示すものであって,検索に利用することを示したものではなかった。   In addition, the methods of Patent Document 1 and Non-Patent Document 2 show video, audio, and music summarization techniques, and do not indicate that they are used for searching.

特許文献2には,“楽しい”,“悲しい”などの感情のカテゴリ(感情語)や感情度をクエリとして,映像や音声を検索する発明について開示されている。この発明では,データベース中の映像や音声が誘発する感情語や感情度を推定する。ユーザは,見たいと思う映像や音声の感情語や感情度をクエリとして与えることで,近い印象の映像・音声を検索することができる。しかしながら,感情は,主観性が強く,カテゴリの境界もあいまいであることから,ユーザにとっては,必ずしも適当な感情語や感情度としてクエリを表現することができない場合があった。また,主観的影響の揺らぎを伴う,感情語や感情度で類似性を判断するため,必ずしもユーザが想定する検索結果を提示できないという課題があった。   Patent Document 2 discloses an invention for searching video and audio by using a category of emotion (emotion word) such as “fun” and “sad” and an emotion level as a query. In the present invention, the emotion word and the emotion level induced by the video and audio in the database are estimated. The user can search for video / sounds with similar impressions by giving the emotional word and emotion level of the video / sound that he / she wants to watch as a query. However, since emotions are highly subjective and the boundaries between categories are ambiguous, the user may not always be able to express a query as an appropriate emotion word or emotion level. In addition, there is a problem that it is not always possible to present the search result assumed by the user because similarity is judged by the emotion word and the emotion level accompanied by fluctuation of subjective influence.

本発明は,上記事情に着目してなされたもので,その目的とするところは,映像,音声,音楽コンテンツの任意の音声区間を参照音声区間によって検索し,検索結果として両音声区間の心理的印象が類似するものを得られるようにする技術を提供することにある。   The present invention has been made paying attention to the above circumstances, and the object of the present invention is to search an arbitrary voice section of video, voice, and music content by a reference voice section and to obtain a psychological result of both voice sections as a search result. The object is to provide a technology that makes it possible to obtain a similar impression.

上記目的を達成するために,本発明は,音声が視聴者に与える印象を示す特徴量として韻律特徴に着目して,各音声区間に対する韻律特徴の距離を算出することにより,人に与える印象が類似した音声区間を抽出するものであり,音声をクエリとして入力して韻律特徴を抽出し,韻律特徴が類似した映像/音声の音声区間を人に与える印象が類似した音声区間として判定することを特徴としている。音声をクエリとすることにより,ユーザが印象を言葉で表現する手間を省くことができる。また,音声をクエリとし,かつ,特徴量を用いて類似性判断を行うことにより,印象を言葉で表現することによる主観的ゆらぎが混入することを回避することができる。   In order to achieve the above object, the present invention focuses on the prosodic feature as a feature amount indicating the impression that the voice gives to the viewer, and calculates the distance of the prosodic feature for each voice section, thereby giving the impression given to the person. Extracting similar speech segments, inputting speech as a query, extracting prosodic features, and determining audio / video segments with similar prosodic features as speech segments with similar impressions to people It is a feature. By using voice as a query, the user can save time and effort to express the impression in words. In addition, by using speech as a query and performing similarity determination using feature quantities, it is possible to avoid mixing subjective fluctuations caused by expressing an impression in words.

すなわち,この発明は以下のような手段を備える。   That is, the present invention comprises the following means.

・第1の手段は,映像から音声を抽出し,さらに音声区間に分割し,韻律特徴によって参照音声区間との距離を計算し,類似した音声区間を推定する。また,基本周波数をさらに加重平均基本周波数で算出し,セミトーン化して韻律特徴とする。この第1の手段によれば,参照音声区間の心理的印象に類似した音声区間を検索することが可能となる。また,基本周波数を倍ピッチ,半ピッチの影響を軽減して求めることができる。 The first means extracts the voice from the video, further divides it into voice sections, calculates the distance from the reference voice section based on the prosodic features, and estimates a similar voice section. In addition, the fundamental frequency is further calculated by the weighted average fundamental frequency, and is converted into semitones to obtain prosodic features. According to this first means, it is possible to search for a speech segment similar to the psychological impression of the reference speech segment. Further, the fundamental frequency can be obtained by reducing the influence of double pitch and half pitch.

・第の手段は,動的時間伸縮法により両音声区間の距離を計算する。この第の手段によれば,比較する2つ以上の音声区間の時間長が異なる場合でも,該音声区間の類似度を求めることができる。 The second means calculates the distance between both speech sections by the dynamic time expansion / contraction method. According to the second means, even when two or more speech sections to be compared have different time lengths, the similarity between the speech sections can be obtained.

また,特許文献2の有する課題を解決するために,本発明は,感情(語)ではなく音声信号を直接クエリとして利用し,特徴量レベルで類似性を判断する。さらに,類似性の判断も特徴量レベルで行う。   In order to solve the problem of Patent Document 2, the present invention uses voice signals directly as queries instead of emotions (words) and determines similarity at the feature level. Furthermore, similarity is also determined at the feature level.

この発明によれば,参照音声区間の心理的印象に類似した音声区間を検索することが可能となる。また,音声信号を直接クエリとして利用するため,ユーザがクエリを表現できないようなケースを回避することができる。さらに,類似性の判断も特徴量レベルで行うため,主観性の揺らぎによる影響を受けにくく,検索結果に対するユーザの意図とのずれが生じにくい。   According to the present invention, it is possible to search for a speech segment similar to the psychological impression of the reference speech segment. Further, since the voice signal is directly used as a query, a case where the user cannot express the query can be avoided. Furthermore, since the similarity is also determined at the feature amount level, it is not easily affected by fluctuations in subjectivity, and the user's intention with respect to the search result is less likely to occur.

本発明による音声検索方法の一例を示す処理フローチャートである。It is a processing flowchart which shows an example of the voice search method by this invention. 基本周波数の加重平均の例を説明する図である。It is a figure explaining the example of the weighted average of a fundamental frequency. 基本周波数をセミトーン化する例を説明する図である。It is a figure explaining the example which semitone-converts a fundamental frequency. 本発明による音声検索装置の構成例を示す図である。It is a figure which shows the structural example of the speech search device by this invention. コンテンツ記憶部と調査対象音声区間韻律特徴記憶部のデータ構成例を示す図である。It is a figure which shows the data structural example of a content memory | storage part and a search object audio | voice area prosodic feature memory | storage part. 音声検索装置の利用例を示す図である。It is a figure which shows the usage example of a voice search device. 自動検索の際のユーザ表示画面の例を示す図である。It is a figure which shows the example of the user display screen in the case of an automatic search.

以下に図面を参照してこの発明の実施例を説明する。   Embodiments of the present invention will be described below with reference to the drawings.

図1に,本発明の実施例による音声検索方法の処理フローを示す。ステップS1は,音声分割ステップであり,ここで入力音声を任意の音声区間に分割する。ステップS2は,韻律特徴抽出ステップであり,韻律特徴を抽出する。ステップS3は,距離計算ステップであり,前記音声区間の韻律特徴と所定の音声区間の韻律特徴との類似度を距離として求める。ステップS4は,音声区間推定ステップであり,ここで類似しているか否かを判定する。   FIG. 1 shows a processing flow of a voice search method according to an embodiment of the present invention. Step S1 is a voice division step, in which the input voice is divided into arbitrary voice sections. Step S2 is a prosodic feature extraction step, which extracts prosodic features. Step S3 is a distance calculation step, in which the similarity between the prosodic feature of the speech section and the prosodic feature of the predetermined speech section is obtained as a distance. Step S4 is a speech segment estimation step, in which it is determined whether or not they are similar.

まず,音声分割ステップS1では,音声の有声・無声判定により,入力音声を音声区間に分割する。例えば,基本周波数が抽出されないフレームを無声としてもよく,LPC分析の予測残差の相関係数に閾値を設定して閾値以下は無声としてもよい。   First, in the voice division step S1, the input voice is divided into voice sections by voiced / unvoiced voice judgment. For example, a frame from which a fundamental frequency is not extracted may be unvoiced, or a threshold value may be set for the correlation coefficient of the prediction residual of LPC analysis, and the threshold value or less may be unvoiced.

韻律特徴抽出ステップS2では,基本周波数を波形処理,相関処理,スペクトル処理のいずれかで抽出する。通常,基本周波数のような韻律特徴は,数十ミリ秒という短い時間間隔(これを窓という)毎に抽出する。この窓が分析フレームであり,本ステップS2における韻律特徴抽出では,例えば10ms(ミリ秒)毎や30ms毎などの分析フレーム幅の間隔でもって抽出を行えばよい。上記LPC分析手法,基本周波数抽出手法については,次の参考文献1に示されており,公知の技術であるので,ここでのさらに詳しい説明は省略する。
[参考文献1]古井貞煕著,ディジタル信号処理,東海大学出版,pp.57 〜59.
また,日本語の単語のアクセントは,拍(モーラ)ごとに1つの値を与えればよく,すなわち,この点ピッチにより韻律特徴としてもよい。または,簡単のため,例えば50msの振幅による加重平均のF0を点ピッチとしてもよい。なお,50msの振幅については限定されるものではなく,例えば100msとしてもよい。
In the prosodic feature extraction step S2, the fundamental frequency is extracted by any one of waveform processing, correlation processing, and spectrum processing. Normally, prosodic features such as the fundamental frequency are extracted every short time interval (called a window) of several tens of milliseconds. This window is an analysis frame, and in the prosodic feature extraction in this step S2, extraction may be performed at intervals of the analysis frame width such as every 10 ms (milliseconds) or every 30 ms. The LPC analysis method and the fundamental frequency extraction method are described in Reference Document 1 below, and are known techniques, and thus a detailed description thereof is omitted here.
[Reference 1] Sadahiro Furui, Digital Signal Processing, Tokai University Press, pp. 57-59.
The accent of the Japanese word may be given one value for each beat (mora), that is, it may be a prosodic feature by this point pitch. Alternatively, for simplicity, for example, a weighted average F0 with an amplitude of 50 ms may be used as the point pitch. The amplitude of 50 ms is not limited and may be 100 ms, for example.

図2は,基本周波数の加重平均の例を説明する図である。図2に示すように,基本周波数(F0),振幅が,ともに10ms間隔で抽出されているとする。そうすると,50msの間には,5組の基本周波数と振幅(F1,p1)〜(F5,p5)が得られることになる。通常の基本周波数の平均では,“(F1+F2+F3+F4+F5)/5”が平均値であるが,振幅による加重平均の場合には,“(α1×F1+α2×F2+α3×F3+α4×F4+α5×F5)/5”が基本周波数の加重平均となる。ここで,αiは,piを0〜1の値をとるように正規化した値である。   FIG. 2 is a diagram illustrating an example of a weighted average of fundamental frequencies. As shown in FIG. 2, it is assumed that both the fundamental frequency (F0) and the amplitude are extracted at intervals of 10 ms. Then, five sets of fundamental frequencies and amplitudes (F1, p1) to (F5, p5) are obtained in 50 ms. In the average of normal fundamental frequencies, “(F1 + F2 + F3 + F4 + F5) / 5” is an average value, but in the case of weighted average based on amplitude, “(α1 × F1 + α2 × F2 + α3 × F3 + α4 × F4 + α5 × F5) / 5” is basic. It is a weighted average of frequencies. Here, αi is a value obtained by normalizing pi to take a value of 0 to 1.

さらに,以上のようにして求めた基本周波数をセミトーン化して韻律特徴としてもよい。図3は,基本周波数をセミトーン化する例を説明する図である。   Further, the fundamental frequency obtained as described above may be semitoned to obtain a prosodic feature. FIG. 3 is a diagram for explaining an example in which the fundamental frequency is made semitone.

セミトーン化とは,基本周波数から音階(セミトーン)C,C#,…,Bへと変換する処理である。倍ピッチ(本来の基本周波数の2倍の周波数が抽出されてしまうケース)や半ピッチ(本来の半分の周波数が抽出されてしまうケース)が与える不具合を解消する目的でセミトーン化を導入する。1オクターブは基本周波数にして丁度,倍であり,倍ピッチや半ピッチは同じ音階に分類されるため,セミトーン化によって上記の不具合による影響を回避することができる。   Semitoning is a process of converting from a fundamental frequency to a scale (semitone) C, C #,. Semi-toning is introduced for the purpose of eliminating the problems caused by double pitch (a case where a frequency twice the original fundamental frequency is extracted) and half pitch (a case where a half frequency is extracted). One octave is just a double of the fundamental frequency, and double pitch and half pitch are classified into the same scale. Therefore, the effect of the above-mentioned problem can be avoided by semitone.

具体的なセミトーン化の処理の一例としては,例えば図3に示すセミトーン・基本周波数対応テーブルをあらかじめ記憶装置に用意しておく。基本周波数を抽出したならば,図3に示すセミトーン・基本周波数対応テーブルを参照して,抽出した基本周波数をセミトーンに変換する。この場合,例えば自乗距離として最も小さいセミトーンに分類するものとしてよい。通常,人間の音声の基本周波数は,75HZ〜450Hz程度と言われているので,図3のテーブルにおける基本周波数よりも低い場合や高い場合には,テーブルに設定されている基本周波数をオクターブの原則に則って,2倍もしくは半分にしたものとの距離を測り,該当するセミトーンを選択すればよい。   As an example of specific semitone processing, for example, a semitone / basic frequency correspondence table shown in FIG. 3 is prepared in a storage device in advance. When the fundamental frequency is extracted, the extracted fundamental frequency is converted into a semitone with reference to the semitone / basic frequency correspondence table shown in FIG. In this case, for example, it may be classified into the smallest semitone as the square distance. Usually, the fundamental frequency of human voice is said to be about 75HZ to 450Hz. Therefore, when the fundamental frequency is lower or higher than the fundamental frequency in the table of FIG. 3, the fundamental frequency set in the table is the octave principle. In accordance with the above, measure the distance to double or halve and select the appropriate semitone.

また,強調度を音声の強調確率と平静確率によって求め,これを韻律特徴としてもよい。具体的には,音声小段落ごとの強調確率Ps(e)と平静確率Ps(n)の確率比をPs(e)/Ps(n)とすればよい。前記強調確率,平静確率については,特許文献1に記載の方法により求めればよい。   Also, the enhancement degree may be obtained from the speech enhancement probability and the calm probability, and this may be used as a prosodic feature. Specifically, the probability ratio between the emphasis probability Ps (e) and the calm probability Ps (n) for each audio sub-paragraph may be Ps (e) / Ps (n). What is necessary is just to obtain | require the said emphasis probability and calm probability by the method of patent document 1. FIG.

距離計算ステップS3では,2つ以上の音声区間を比較する際,各々の区間の時間長が異なるために比較が簡単ではない。そこで,動的時間伸縮法により距離を求める。動的時間伸縮法については,前述した参考文献1(古井貞煕著,ディジタル信号処理,東海大学出版,pp.162〜164 )に示されている。   In the distance calculation step S3, when comparing two or more speech sections, the comparison is not easy because the time lengths of the sections are different. Therefore, the distance is obtained by the dynamic time expansion / contraction method. The dynamic time expansion / contraction method is described in the above-mentioned Reference 1 (written by Sadahiro Furui, Digital Signal Processing, Tokai University Press, pp. 162-164).

音声区間推定ステップS4では,前記距離をdとして求めたものが最小となる音声区間を出力すればよい。なお,本発明においては,入力する音声区間を参照音声区間,検索対象となる音声区間を調査対象の音声区間と呼ぶこととする。上記処理では,1つ以上の調査対象の音声区間の韻律特徴が記憶されている記憶部から韻律特徴を読み出し,韻律特徴抽出ステップS2で求めた参照音声区間の韻律特徴との距離を,距離計算ステップS3で求める。音声区間推定ステップS4により最も距離の近い,すなわち類似度の大きい音声区間が出力されることになる。   In the speech section estimation step S4, it is only necessary to output a speech section in which the distance obtained as d is the minimum. In the present invention, the input speech segment is referred to as a reference speech segment, and the speech segment to be searched is referred to as a survey target speech segment. In the above processing, the prosodic features are read from the storage unit storing the prosodic features of one or more speech sections to be investigated, and the distance from the prosodic features of the reference speech sections obtained in the prosodic feature extracting step S2 is calculated by distance calculation. Obtained in step S3. In the speech segment estimation step S4, the speech segment with the closest distance, that is, the speech with a high degree of similarity is output.

本発明によれば,韻律特徴が類似している2つ以上の音声区間は,心理的印象の尺度においても近似している効果がある。例えば,映画やテレビ放送番組の任意の区間に興味を持ったユーザが,前記区間と類似した区間を検索することが可能となる。   According to the present invention, two or more speech segments having similar prosodic features have an effect of being approximated also in a psychological impression scale. For example, a user who is interested in an arbitrary section of a movie or television broadcast program can search for a section similar to the section.

図4は,以上の処理を実行する音声検索装置の構成例を示す図である。図4に示すように,音声検索装置10は,中央処理ユニット(CPU: Central Processing Unit)11を備える。このCPU11には,バス12を介してプログラムメモリ13,データメモリ14,通信インタフェース(通信I/F)15がそれぞれ接続されている。   FIG. 4 is a diagram showing a configuration example of a voice search device that executes the above processing. As shown in FIG. 4, the voice search device 10 includes a central processing unit (CPU) 11. A program memory 13, a data memory 14, and a communication interface (communication I / F) 15 are connected to the CPU 11 via a bus 12.

プログラムメモリ13には,音声分割部13a,韻律特徴抽出部13b,距離計算部13c,音声区間推定部13dを実現するプログラムが記憶される。データメモリ14には,コンテンツ記憶部14aと調査対象音声区間韻律特徴記憶部14bが設けられている。音声分割部13a,韻律特徴抽出部13b,距離計算部13c,音声区間推定部13dは,それぞれ,図1の音声分割ステップS1,韻律特徴抽出ステップS2,距離計算ステップS3,音声区間推定ステップS4の処理機能に対応している。   The program memory 13 stores a program for realizing the speech dividing unit 13a, the prosodic feature extracting unit 13b, the distance calculating unit 13c, and the speech segment estimating unit 13d. The data memory 14 is provided with a content storage unit 14a and a survey target speech segment prosodic feature storage unit 14b. The speech dividing unit 13a, the prosody feature extracting unit 13b, the distance calculating unit 13c, and the speech segment estimating unit 13d are respectively the speech segmentation step S1, the prosody feature extraction step S2, the distance calculation step S3, and the speech segment estimation step S4 in FIG. Supports processing functions.

図4に示す音声検索装置10は,大量のコンテンツ(映像・音・音楽)の中から,ユーザが問い合わせとして入力したコンテンツに,韻律特徴の観点で近いものを検索する。コンテンツ記憶部14aは,いわゆるコンテンツデータベースであって,これには大量のコンテンツが格納されている。また,調査対象音声区間韻律特徴記憶部14bには,コンテンツの音声区間毎に,韻律特徴の時刻歴が記憶されている。   The voice search device 10 shown in FIG. 4 searches a large amount of content (video / sound / music) that is close to the content input by the user as an inquiry in terms of prosodic features. The content storage unit 14a is a so-called content database, in which a large amount of content is stored. Also, the time history of the prosodic features is stored for each speech segment of the content in the survey target speech segment prosodic feature storage unit 14b.

図5は,図4に示すコンテンツ記憶部14aと調査対象音声区間韻律特徴記憶部14bに記憶されるデータの構成例を示す。   FIG. 5 shows a configuration example of data stored in the content storage unit 14a and the survey target speech segment prosodic feature storage unit 14b shown in FIG.

図5に示すa201は,コンテンツ記憶部14aのコンテンツテーブルである。ここには,コンテンツそのもの(ファイル)の存在場所を示すファイルパスと,これらを一意に識別するコンテンツIDとが対応づけて格納されている。   A201 illustrated in FIG. 5 is a content table of the content storage unit 14a. Here, a file path indicating the location of the content itself (file) and a content ID for uniquely identifying these are stored in association with each other.

a202は,コンテンツ記憶部14aのコンテンツID=0001の音声区間テーブルである。この例では,音声区間テーブルは,コンテンツID毎に用意している。1つのコンテンツは,複数の音声区間を持っているので,これらを一意に識別する音声区間IDと,その区間の音声ファイルの存在場所を示すファイルパス,および,その各音声区間が存在するコンテンツ中の時刻位置(開始,終了時刻)が,音声区間テーブルa202中に対応づけて格納されている。   a202 is an audio section table of content ID = 0001 in the content storage unit 14a. In this example, an audio section table is prepared for each content ID. Since one content has a plurality of audio sections, the audio section ID for uniquely identifying them, the file path indicating the location of the audio file in the section, and the contents in which each audio section exists Are stored in the speech section table a202 in association with each other.

a203は,調査対象音声区間韻律特徴記憶部14bに記憶されている韻律特徴のデータの例であり,コンテンツID=0001,音声区間ID=s0001の韻律特徴を示している。図1の韻律特徴抽出ステップS2で説明したように,韻律特徴は,例えば50ms単位で抽出されるので,1つの音声区間に対して,50ms毎に複数の韻律特徴が抽出されることになる。そこで,図5のa203のように,音声区間ID=s0001に対して,50ms毎に韻律特徴として基本周波数が格納される。   a203 is an example of prosodic feature data stored in the investigation target speech section prosodic feature storage unit 14b, and indicates the prosodic features of content ID = 0001 and speech section ID = s0001. As described in the prosodic feature extraction step S2 in FIG. 1, prosodic features are extracted in units of 50 ms, for example, so that a plurality of prosodic features are extracted every 50 ms for one speech section. Therefore, as indicated by a203 in FIG. 5, the fundamental frequency is stored as a prosodic feature every 50 ms for the speech section ID = s0001.

これらの各テーブルデータは,リレーショナルデータベースのように相互参照可能であるので,a203の韻律特徴で音声区間の類似性を判別した後,この区間がどのコンテンツ(ID)のどの音声区間(ID)のものであるかを引き出すことができる。   Each of these table data can be cross-referenced like a relational database. Therefore, after determining the similarity of the speech segment by the prosodic feature of a203, this segment is in which speech segment (ID) of which content (ID). You can pull out what it is.

ユーザがシーンに興味を持ち,類似シーンを検索する場合について述べる。図4の音声分割部13aは,メディアが映像の場合には音声を抽出した後,音声分割ステップS1と同様の処理を施す。ここで,音楽の場合であっても音声と同様の処理が施されるため,本発明において音声という場合,音楽など楽音全般を含むものとする。   A case where a user is interested in a scene and searches for a similar scene will be described. The audio dividing unit 13a in FIG. 4 performs the same processing as the audio dividing step S1 after extracting the audio when the medium is video. Here, even in the case of music, processing similar to that for voice is performed. Therefore, in the present invention, the term “speech” includes all musical sounds such as music.

指定されたシーンは,コンテンツ記憶部14aに記憶され,音声分割部13aで分割された音声情報ならびに音声区間の韻律特徴を,韻律特徴抽出部13bにより抽出し,調査対象音声区間韻律特徴記憶部14bの韻律特徴との距離を距離計算部13cで求め,距離が最小となる音声区間を音声区間推定部13dで同定する。そして,同音声区間と同期するシーンをコンテンツ記憶部14aから抽出してユーザに提示する。   The designated scene is stored in the content storage unit 14a, and the speech information divided by the speech dividing unit 13a and the prosodic features of the speech section are extracted by the prosodic feature extracting unit 13b, and the target speech section prosodic feature storing unit 14b is extracted. The distance from the prosodic feature is obtained by the distance calculation unit 13c, and the speech section having the minimum distance is identified by the speech section estimation unit 13d. Then, a scene synchronized with the same voice section is extracted from the content storage unit 14a and presented to the user.

例えば,図6に示すように,ユーザ端末20のパーソナルコンピュータ(PC)から検索要求を,ネットワーク等を介して音声検索装置10に行ってもよい。その場合,検索要求は,
・映像,音声などのメディア,
・映像,音声などのメディアの当該時刻情報,
でよい。
For example, as shown in FIG. 6, a search request may be sent from the personal computer (PC) of the user terminal 20 to the voice search device 10 via a network or the like. In that case, the search request is
・ Media such as video and audio,
・ Time information of media such as video and audio,
It's okay.

または,ユーザ端末が図4に示す音声検索装置10の場合,ネットワークは必要としない構成としてもよい。   Alternatively, in the case where the user terminal is the voice search device 10 shown in FIG.

類似シーン検索は,ユーザの検索を待たずに自動的に行なってもよい。例えば,予め調査対象音声区間韻律特徴記憶部14bに,調査対象音声区間の韻律特徴と感性状態を対応付けて保管しておく。その場合,例えば楽しい,幸せ,怒り,嫌悪,あきらめ,悲しみ,平静,驚きなどの感情と対応付けておけばよい。さらに,例えば楽しいレベルを付与しておけば,より楽しいシーンの韻律特徴から,当該シーンがより楽しいシーンであると自動推定することが可能となる。例えば,“かなり楽しい”シーンがユーザの表示端末に表示されている場合,別のシーンの“かなり楽しい”シーンを,図7のユーザ表示画面に示すように,自動挿入してもよい。ユーザは,挿入されたシーンに興味が誘発されれば,挿入シーンの視聴に切り替えるなどの対処を行なえばよい。このようにすることで,自動的に類似シーンがリコメンデーションされる装置が実現する。   The similar scene search may be automatically performed without waiting for the user search. For example, the prosodic feature and the emotional state of the investigation target speech section are stored in advance in the investigation target speech section prosodic feature storage unit 14b. In that case, it may be associated with emotions such as fun, happiness, anger, disgust, giving up, sadness, calmness and surprise. Furthermore, for example, if a fun level is assigned, it is possible to automatically estimate that the scene is a more enjoyable scene from the prosodic features of the more enjoyable scene. For example, when a “pretty fun” scene is displayed on the display terminal of the user, a “pretty fun” scene of another scene may be automatically inserted as shown in the user display screen of FIG. If the user is interested in the inserted scene, the user may take measures such as switching to viewing the inserted scene. In this way, a device that automatically recommends similar scenes is realized.

なお,調査対象音声区間韻律特徴記憶部14bに,調査対象音声区間の韻律特徴と感性状態を対応付けて保管する際には,必ずしもすべてのコンテンツの音声区間の一つ一つに対して,感性状態を対応づけておく必要はなく,少量の組の韻律特徴の時刻歴パターンと感性状態とを対応づけたモデル(テーブル)を,入力装置からの入力などにより人手で用意するだけでよい。このモデルの入力後は,このモデルを参照して残りの音声区間に感性状態を自動付与することができる。モデルとしては,例えば,HMM(Hidden Markov Model )などを用いることができる。   When storing the prosodic features and the emotional state of the survey target speech section in the survey target speech segment prosodic feature storage unit 14b in association with each other, it is not always necessary to detect the sensitivity of each speech segment of all contents. It is not necessary to associate the states with each other, and a model (table) in which a time history pattern of a small group of prosodic features and the emotional state are associated with each other may be prepared manually by input from an input device. After the input of this model, the Kansei state can be automatically given to the remaining speech sections with reference to this model. As the model, for example, HMM (Hidden Markov Model) can be used.

以上の処理によって,韻律特徴の時刻歴パターンと感性状態を対応づけたモデルができたとすると,あとは,
(1)ユーザが視聴しているコンテンツの音声区間の韻律特徴を抽出し,
(2)モデルを用いることによって,抽出した韻律特徴の感性状態を取得し,
(3)得られた感性状態に対応するコンテンツをコンテンツ記憶部14aから検索する,という処理を経ることで,図7に示したような利用が実現できることになる。
Assuming that the above process created a model that correlates the time history pattern of prosodic features with the emotional state,
(1) Extract the prosodic features of the speech segment of the content that the user is viewing,
(2) By using the model, acquire the sensitivity state of the extracted prosodic features,
(3) By using the process of searching the content storage unit 14a for the content corresponding to the obtained sensitivity state, the use as shown in FIG. 7 can be realized.

以上の音声検索の処理は,コンピュータとソフトウェアプログラムとによって実現することができ,そのプログラムをコンピュータ読み取り可能な記録媒体に記録することも,ネットワークを通して提供することも可能である。   The above voice search processing can be realized by a computer and a software program, and the program can be recorded on a computer-readable recording medium or provided through a network.

10 音声検索装置
11 中央処理ユニット(CPU)
12 バス
13 プログラムメモリ
13a 音声分割部
13b 韻律特徴抽出部
13c 距離計算部
13d 音声区間推定部
14 データメモリ
14a コンテンツ記憶部
14b 調査対象音声区間韻律特徴記憶部
15 通信インタフェース
10 Voice Retrieval Device 11 Central Processing Unit (CPU)
12 bus 13 program memory 13a speech division unit 13b prosodic feature extraction unit 13c distance calculation unit 13d speech segment estimation unit 14 data memory 14a content storage unit 14b survey target speech segment prosody feature storage unit 15 communication interface

Claims (5)

コンピュータが音声区間を検索する音声検索方法において,
音声を無声区間によって複数の音声区間に分割する音声分割ステップと,
前記音声区間について分析フレーム単位に韻律特徴を抽出する韻律特徴抽出ステップと,
参照音声区間の前記韻律特徴抽出ステップにより抽出された韻律特徴と,あらかじめ調査対象音声の音声区間毎に韻律特徴を記憶する調査対象音声区間韻律特徴記憶部に記憶されている調査対象の一つ以上の音声区間の韻律特徴との距離を算出する距離計算ステップと,
前記算出された距離により所定の値以上の類似度となる音声区間を推定する音声区間推定ステップとを実行し
かつ,前記韻律特徴抽出ステップは,基本周波数を加重平均,または点ピッチにより抽出する加重平均基本周波数抽出ステップと,前記基本周波数をセミトーン化するセミトーン化ステップとを有し,前記韻律特徴をセミトーン化基本周波数とする
ことを特徴とする音声検索方法。
In a voice search method in which a computer searches a voice segment,
A voice division step of dividing the voice into a plurality of voice sections by a silent section;
A prosodic feature extracting step of extracting prosodic features in units of analysis frames for the speech section;
One or more of the probing features stored in the probing feature prosodic feature storage unit that stores the prosodic features extracted in advance in the prosodic feature extracting step of the reference speech segment and the prosodic features for each speech segment of the target speech A distance calculating step for calculating a distance from the prosodic feature of the speech section of
Performing a speech segment estimation step for estimating a speech segment having a similarity equal to or greater than a predetermined value based on the calculated distance ;
The prosody feature extraction step includes a weighted average fundamental frequency extraction step for extracting a fundamental frequency by a weighted average or a point pitch, and a semitone process for semitoning the fundamental frequency. A voice search method characterized by using a fundamental frequency .
請求項1記載の音声検索方法において,
前記距離計算ステップでは,動的時間伸縮法に基づいて距離を算出する
ことを特徴とする音声検索方法。
In the audio retrieval method according to claim 1 Symbol placement,
In the distance calculating step, the distance is calculated based on a dynamic time expansion / contraction method.
音声区間を検索する音声検索装置において,
音声を無声区間によって複数の音声区間に分割する音声分割部と,
前記音声区間について分析フレーム単位に韻律特徴を抽出する韻律特徴抽出部と,
あらかじめ調査対象音声の音声区間毎に韻律特徴を記憶する調査対象音声区間韻律特徴記憶部と,
参照音声区間の前記韻律特徴抽出部によって抽出された韻律特徴と前記調査対象音声区間韻律特徴記憶部に記憶されている調査対象の一つ以上の音声区間の韻律特徴との距離を算出する距離計算部と,
前記算出された距離により所定の値以上の類似度となる音声区間を推定する音声区間推定部とを備え
かつ,前記韻律特徴抽出部は,基本周波数を加重平均,または点ピッチにより抽出する加重平均基本周波数抽出部と,前記基本周波数をセミトーン化するセミトーン化部とを有し,前記韻律特徴をセミトーン化基本周波数とする
ことを特徴とする音声検索装置。
In a voice search device that searches a voice segment,
A voice dividing unit that divides the voice into a plurality of voice segments by a silent segment;
A prosodic feature extraction unit that extracts prosodic features in units of analysis frames for the speech section;
A survey target speech segment prosody feature storage unit that stores a prosody feature for each speech segment of the survey target speech in advance;
Distance calculation for calculating a distance between the prosodic feature extracted by the prosodic feature extraction unit of the reference speech segment and the prosodic feature of one or more speech segments to be investigated stored in the surveyed speech segment prosodic feature storage unit Part,
A speech segment estimation unit that estimates a speech segment having a similarity greater than or equal to a predetermined value based on the calculated distance ;
The prosody feature extraction unit includes a weighted average fundamental frequency extraction unit that extracts a fundamental frequency by weighted average or point pitch, and a semitone unit that semitones the fundamental frequency, and semitones the prosody feature. A voice search device characterized by having a fundamental frequency .
請求項3記載の音声検索装置において,
前記距離計算部は,動的時間伸縮法に基づいて距離を算出する
ことを特徴とする音声検索装置。
In speech retrieval apparatus according to claim 3 Symbol mounting,
The distance calculation unit calculates a distance based on a dynamic time expansion / contraction method.
請求項1または請求項2記載の音声検索方法を,コンピュータに実行させるための音声検索プログラム。 A voice search program for causing a computer to execute the voice search method according to claim 1 .
JP2009218455A 2009-09-24 2009-09-24 Voice search method, voice search device, and voice search program Active JP5182892B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2009218455A JP5182892B2 (en) 2009-09-24 2009-09-24 Voice search method, voice search device, and voice search program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2009218455A JP5182892B2 (en) 2009-09-24 2009-09-24 Voice search method, voice search device, and voice search program

Publications (2)

Publication Number Publication Date
JP2011069845A JP2011069845A (en) 2011-04-07
JP5182892B2 true JP5182892B2 (en) 2013-04-17

Family

ID=44015220

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2009218455A Active JP5182892B2 (en) 2009-09-24 2009-09-24 Voice search method, voice search device, and voice search program

Country Status (1)

Country Link
JP (1) JP5182892B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087669A (en) * 2018-10-23 2018-12-25 腾讯科技(深圳)有限公司 Audio similarity detection method, device, storage medium and computer equipment

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598502A (en) * 2014-04-22 2015-05-06 腾讯科技(北京)有限公司 Method, device and system for obtaining background music information in played video
CN105430494A (en) * 2015-12-02 2016-03-23 百度在线网络技术(北京)有限公司 Method and device for identifying audio from video in video playback equipment
CN106601259B (en) * 2016-12-13 2021-04-06 北京奇虎科技有限公司 Information recommendation method and device based on voiceprint search
CN108198076A (en) * 2018-02-05 2018-06-22 深圳市资本在线金融信息服务有限公司 A kind of financial investment method, apparatus, terminal device and storage medium
WO2024176327A1 (en) * 2023-02-21 2024-08-29 ハイラブル株式会社 Information processing device, information processing method, and program

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9918611D0 (en) * 1999-08-07 1999-10-13 Sibelius Software Ltd Music database searching
JP3730144B2 (en) * 2001-08-03 2005-12-21 日本電信電話株式会社 Similar music search device and method, similar music search program and recording medium thereof
JP4698606B2 (en) * 2004-12-10 2011-06-08 パナソニック株式会社 Music processing device
JP2007041234A (en) * 2005-08-02 2007-02-15 Univ Of Tokyo Method for deducing key of music sound signal, and apparatus for deducing key
JP2009058548A (en) * 2007-08-30 2009-03-19 Oki Electric Ind Co Ltd Speech retrieval device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087669A (en) * 2018-10-23 2018-12-25 腾讯科技(深圳)有限公司 Audio similarity detection method, device, storage medium and computer equipment

Also Published As

Publication number Publication date
JP2011069845A (en) 2011-04-07

Similar Documents

Publication Publication Date Title
CN110557589B (en) System and method for integrating recorded content
JP5142769B2 (en) Voice data search system and voice data search method
JP5182892B2 (en) Voice search method, voice search device, and voice search program
JP5024154B2 (en) Association apparatus, association method, and computer program
CN102956230B (en) The method and apparatus that song detection is carried out to audio signal
JPWO2005069171A1 (en) Document association apparatus and document association method
JP2006501502A (en) System and method for generating audio thumbnails of audio tracks
CN1463419A (en) Synchronizing text/visual information with audio playback
US11354754B2 (en) Generating self-support metrics based on paralinguistic information
JP5218766B2 (en) Rights information extraction device, rights information extraction method and program
Zewoudie et al. The use of long-term features for GMM-and i-vector-based speaker diarization systems
CN108986843B (en) Audio data processing method and device, medium and computing equipment
Baumeister et al. The influence of alcoholic intoxication on the fundamental frequency of female and male speakers
JPH10307580A (en) Music searching method and device
JP2014119716A (en) Interaction control method and computer program for interaction control
JP2010117528A (en) Vocal quality change decision device, vocal quality change decision method and vocal quality change decision program
JP4601643B2 (en) Signal feature extraction method, signal search method, signal feature extraction device, computer program, and recording medium
JP5196114B2 (en) Speech recognition apparatus and program
CN112397048B (en) Speech synthesis pronunciation stability evaluation method, device and system and storage medium
JP2011170622A (en) Content providing system, content providing method, and content providing program
Lee et al. Detecting music in ambient audio by long-window autocorrelation
JP3934556B2 (en) Method and apparatus for extracting signal identifier, method and apparatus for creating database from signal identifier, and method and apparatus for referring to search time domain signal
Shih et al. Multidimensional humming transcription using a statistical approach for query by humming systems
JP2006323008A (en) Musical piece search system and musical piece search method
JP2004145161A (en) Speech database registration processing method, speech generation source recognizing method, speech generation section retrieving method, speech database registration processing device, speech generation source recognizing device, speech generation section retrieving device, program therefor, and recording medium for same program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20110829

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20110829

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20111125

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20111118

RD02 Notification of acceptance of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7422

Effective date: 20111221

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20121011

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20121016

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20121213

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20130108

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20130110

R150 Certificate of patent or registration of utility model

Ref document number: 5182892

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20160125

Year of fee payment: 3

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

S111 Request for change of ownership or part of ownership

Free format text: JAPANESE INTERMEDIATE CODE: R313117

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350