JP4986301B2

JP4986301B2 - Content search apparatus, program, and method using voice recognition processing function

Info

Publication number: JP4986301B2
Application number: JP2008252219A
Authority: JP
Inventors: ▲シン▼ 徐; 正樹内藤; 恒河井
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2008-09-30
Filing date: 2008-09-30
Publication date: 2012-07-25
Anticipated expiration: 2028-09-30
Also published as: JP2010085522A

Description

本発明は、音声認識処理機能を用いたコンテンツ検索装置、プログラム及び方法に関する。 The present invention relates to a content search apparatus, program, and method using a speech recognition processing function.

従来、携帯電話機やパーソナルコンピュータのような端末を用いて、コンテンツとしての楽曲を検索する音楽配信サービスが提供されている。このサービスによれば、利用者は、楽曲名やアーティスト名のようなキーワードを、端末に入力する。端末は、そのキーワードを、ネットワークを介してコンテンツ検索サーバへ送信し、適切な楽曲を検索することができる。 2. Description of the Related Art Conventionally, music distribution services for searching for music as content using a terminal such as a mobile phone or a personal computer have been provided. According to this service, a user inputs a keyword such as a music title or artist name into the terminal. The terminal can search the appropriate music piece by transmitting the keyword to the content search server via the network.

楽曲を検索するためのキーワードとなる楽曲名やアーティスト名には、日本語、英語、数字等が混在する場合もある。このような場合、例えば携帯電話機に対してキーワードを入力する操作は、利用者にとって手間がかかる。このような手間を排除するために、音声入力機能を搭載した端末もある。 In some cases, Japanese, English, numbers, and the like are mixed in the music name and artist name as keywords for searching for music. In such a case, for example, an operation for inputting a keyword to the mobile phone is troublesome for the user. In order to eliminate such trouble, some terminals have a voice input function.

図１は、従来技術におけるコンテンツ検索装置の機能構成図である。 FIG. 1 is a functional configuration diagram of a content search apparatus in the prior art.

図１によれば、コンテンツ検索装置１は、音声入力部１０１と、音響特徴量抽出部１０２と、音響モデル蓄積部１０３と、言語モデル蓄積部１０４と、音声認識デコーダ１０５と、コンテンツ検索部１０６とを有する。音声入力部１０１以外のこれら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。 According to FIG. 1, the content search device 1 includes a voice input unit 101, an acoustic feature amount extraction unit 102, an acoustic model storage unit 103, a language model storage unit 104, a voice recognition decoder 105, and a content search unit 106. And have. These functional components other than the voice input unit 101 are realized by executing a program that causes a computer installed in the apparatus to function.

音声入力部１０１は、利用者の発生した音声を入力し、電気信号（音波波形）に変換する。変換された音波波形は、音響特徴量抽出部１０２へ出力される。 The voice input unit 101 inputs voice generated by a user and converts it into an electric signal (sound wave waveform). The converted sound wave waveform is output to the acoustic feature quantity extraction unit 102.

音響特徴量抽出部１０２は、入力された音声波形から音響特徴量ｘを抽出する音響特徴量を抽出する。例えば、周波数の違いに基づく人の認識感度で重み付けをしたメルケプストラム係数（Mel Frequency Cepstrum Coefficient ＭＦＣＣ）を用いることもできる。 The acoustic feature quantity extraction unit 102 extracts an acoustic feature quantity for extracting the acoustic feature quantity x from the input speech waveform. For example, a mel cepstrum coefficient (MFCC) weighted by human recognition sensitivity based on a difference in frequency may be used.

音響モデル蓄積部１０３は、音響モデルを蓄積し、認識結果候補の単語列ωに対して、入力音声から抽出した音響特徴量ｘが観測される音響確率Ｐ(ｘ｜ω)を出力する。ここで、ω（ω＝ω₁，ω₂，・・・，ω_m）は単語列を意味し、ω_mは単語を意味する。 The acoustic model accumulating unit 103 accumulates an acoustic model, and outputs an acoustic probability P (x | ω) at which the acoustic feature amount x extracted from the input speech is observed with respect to the recognition result candidate word string ω. Here, ω (ω = ω ₁ , ω ₂ ,..., Ω _m ) means a word string, and ω _m means a word.

言語モデル蓄積部１０４は、言語モデルを蓄積し、統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)を出力する。言語モデルとしては、一般的に、統計的言語モデル又は記述文法が用いられる。尚、この言語モデル蓄積部１０４には、通常、認識の対象になる単語辞書も蓄積されている。 The language model accumulating unit 104 accumulates language models and outputs statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω). As the language model, a statistical language model or a description grammar is generally used. Note that the language model storage unit 104 also normally stores word dictionaries to be recognized.

Ｐn-gram(ω)は、ｎグラムと称される統計的言語モデルの言語確率であり、大量の文章の言語統計に基づいて、単語列ωが出現する言語確率Ｐ(ω)を推定したものである。主として、書き取り(dictation)や対話音声認識で用いられる。 Pn-gram (ω) is the language probability of a statistical language model called n-gram, and is an estimate of the language probability P (ω) that a word string ω appears based on the language statistics of a large number of sentences. It is. Mainly used in dictation and interactive speech recognition.

Ｐcfg(ω)は、文脈自由文法と呼ばれる文法規則ベースのモデルの言語確率であり、言語に関する知識及び認識タスクの分析結果に基づいて構文規則を人手によって記述したものである。 Pcfg (ω) is a language probability of a grammar rule-based model called context-free grammar, and is a description of syntax rules by hand based on the knowledge of the language and the analysis result of the recognition task.

音声認識デコーダ１０５は、音響特徴量ｘと、音響確率Ｐ(ｘ｜ω)と、統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)を用いた言語確率Ｐ(ω)とに基づいて、以下の評価関数を最大又は高い順から上位Ｎ位の認識結果単語列ω^を出力する。
Ｐ(ω)×Ｐ(ｘ|ω)（ω∈Ｗ，ｘ∈Ｘ）
探索された認識結果単語列ω^は、コンテンツ検索部１０６へ出力される。例えば、最大確率の認識結果単語列ω^は、以下の式で表される。
ω^＝arg max｛Ｐ(ω)×Ｐ(ｘ|ω)｝（ω∈Ｗ，ｘ∈Ｘ） The speech recognition decoder 105 includes an acoustic feature quantity x, an acoustic probability P (x | ω), and a language probability P (ω) using statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω). Based on the above, the following evaluation functions are output in the highest N recognition result word string ω ^ from the highest or highest order.
P (ω) × P (x | ω) (ω∈W, x∈X)
The searched recognition result word string ω ^ is output to the content search unit 106. For example, the recognition result word string ω ^ with the maximum probability is expressed by the following expression.
ω ^ = arg max {P (ω) × P (x | ω)} (ω∈W, x∈X)

尚、認識結果単語列ω^の探索処理には、いわゆるビームサーチアルゴリズムが用いられる。ビームサーチアルゴリズムとは、単語列の候補について、所定の探索ビーム幅を用いて、評価関数Ｐ(ω)×Ｐ(ｘ|ω)の高いものだけ残し、低いものは枝刈りする探索処理である。ビーム幅の設定によって、計算時間と認識精度とのトレードオフを制御することができる。 A so-called beam search algorithm is used for the search processing of the recognition result word string ω ^. The beam search algorithm is a search process in which only a high evaluation function P (ω) × P (x | ω) is left and a low one is pruned using a predetermined search beam width for word string candidates. . The trade-off between calculation time and recognition accuracy can be controlled by setting the beam width.

コンテンツ検索部１０６は、認識結果単語列ω^を検索キーとして、コンテンツデータベースから、コンテンツ自体又はダウンロードサーバアドレスを検索する。 The content search unit 106 searches the content database or the download server address from the content database using the recognition result word string ω ^ as a search key.

このようなコンテンツ検索装置を用いた技術として、利用者が、アーティスト名と楽曲名とを「の」でつなげて発声し、その音波波形からキーワードを認識し、楽曲を検索する楽曲検索システムがある（例えば特許文献１参照）。 As a technique using such a content search apparatus, there is a music search system in which a user connects an artist name and a song name with “no”, utters, recognizes a keyword from the sound waveform, and searches for a song. (For example, refer to Patent Document 1).

また、単語辞書を用いて番組を検索する番組指定装置がある（例えば特許文献２参照）。単語辞書は、予め登録された単語辞書から、利用者の嗜好情報に合わない単語を削除したものである。単語列の探索空間を小さくすることによって、認識精度を向上させることができる。 There is also a program designating device that searches for programs using a word dictionary (see, for example, Patent Document 2). The word dictionary is obtained by deleting words that do not match the user's preference information from a word dictionary registered in advance. Recognition accuracy can be improved by reducing the search space for word strings.

特開２００２−１８９４８３号公報Japanese Patent Application Laid-Open No. 2002-189483 特開２００４−１２０７６７号公報JP 2004-120767 A

特許文献１に記載された技術によれば、発声された音波波形から直接的に単語列を抽出しており、利用者の意図又は嗜好を反映するものではない。また、単語辞書に登録された全ての楽曲名単語を同じ優先順位で探索する。従って、数万から数十万楽曲名程度を記録した商用楽曲データベースを検索対象にする場合には、単語辞書に登録した楽曲名単語が増えると共に、探索空間が膨大となる。そのため、全ての楽曲を同じ優先順位で探索すると探索時間が非常にかかるばかりでなく、利用者の意図又は嗜好とは異なる楽曲名を認識結果として出力してしまうことも多い。 According to the technique described in Patent Document 1, a word string is directly extracted from a sound wave waveform uttered, and does not reflect the user's intention or preference. In addition, all music title words registered in the word dictionary are searched with the same priority. Therefore, when a commercial music database in which about tens of thousands to hundreds of thousands of music titles are recorded is to be searched, the music name words registered in the word dictionary increase and the search space becomes enormous. For this reason, searching for all music pieces with the same priority not only takes a very long search time, but also often outputs a music name that is different from the user's intention or preference as a recognition result.

また、特許文献２に記載された技術によれば、利用者の嗜好情報に合わない単語辞書の単語を削除した縮小単語辞書を作成している。従って、縮小単語辞書に該当しない、嗜好から外れた番組は、全く認識することができない。また、嗜好度の推定精度が高くない場合には、認識性能が大幅に低下することがある。尚、嗜好度は、利用者が利用毎に嗜好度の評価点数等を手動で入力すべきものである。 Moreover, according to the technique described in Patent Literature 2, a reduced word dictionary is created by deleting words in a word dictionary that does not match user preference information. Therefore, a program that does not correspond to the reduced word dictionary and deviates from the preference cannot be recognized at all. In addition, when the estimation accuracy of the preference level is not high, the recognition performance may be significantly reduced. Note that the preference level should be manually input by the user every time the user uses the evaluation score.

そこで、本発明は、利用者の嗜好を考慮して、利用者毎に認識精度が高いと感じさせるコンテンツを音声から検索することができるコンテンツ検索装置、プログラム及び方法を提供することを目的とする。 Therefore, the present invention has an object to provide a content search device, a program, and a method that can search for content that makes the user feel that recognition accuracy is high for each user in consideration of user preferences. .

本発明によれば、入力された音声波形から音響特徴量ｘを抽出する音響特徴量抽出手段と、
音響モデルを蓄積し、１つ以上の単語ω _m からなる単語認識結果候補の単語列ωに対して音響特徴量ｘが観測される音響確率Ｐ(ｘ｜ω)を出力する音響モデル蓄積手段と、
言語モデルを蓄積し、統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)を出力する言語モデル蓄積手段と、
音響特徴量ｘ、音響確率Ｐ(ｘ｜ω)及び統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)に基づいて、認識結果単語列ω^を出力する音声認識デコーダと、
認識結果単語列ω^を検索キーとして、コンテンツデータベースからコンテンツを検索するコンテンツ検索手段と
を有するコンテンツ検索装置において、
全てのコンテンツにおける２つのコンテンツ間類似度Ｓ _i,j と、利用者Ｕ毎に過去に検索されたコンテンツ毎の単語ω _m （＝コンテンツ名Ｍ _n (Ｍ ₁ ,Ｍ ₂ ,・・・,Ｍ _v )）の群とを蓄積する利用者情報蓄積手段と、
当該コンテンツ名Ｍ _q に対する利用者Ｕの第１の重みα(Ｍ _q ,Ｕ)を、当該コンテンツ名Ｍqと、当該利用者Ｕによって過去に検索された各コンテンツ名Ｍ _n との間のコンテンツ間類似度Ｓ _Mq,Mn の総和から算出し、第１の重みα(Ｍ _q ,Ｕ)を付与した単語ω _m 毎の嗜好度確率Ｐ ^* (ω _m )を算出し、１つ以上の単語ω _m を含む単語列ω毎の嗜好度確率Ｐ ^* (ω)を、Ｐ ^* (ω)＝Ｐ ^* (ω ₁ ，ω ₂ ，・・・，ω _m )＝Ｐ ^* (ω ₁ )×Ｐ ^* (ω ₂ )×・・・×Ｐ ^* (ω _m )によって算出する嗜好度確率計算手段と、
統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)に、嗜好度確率Ｐ^*(ω)を重み付けた言語確率Ｐ(ω)を出力する言語確率算出手段と
を有することを特徴とする。 According to the present invention, acoustic feature quantity extraction means for extracting the acoustic feature quantity x from the input speech waveform;
An acoustic model accumulating means for accumulating an acoustic model and outputting an acoustic probability P (x | ω) at which an acoustic feature quantity x is observed for a word sequence ω of word recognition result candidates composed of one or more words ω _m ,
Language model storage means for storing language models and outputting statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω);
A speech recognition decoder that outputs a recognition result word string ω ^ based on an acoustic feature quantity x, an acoustic probability P (x | ω), and a statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω);
In a content search apparatus having content search means for searching for content from a content database using the recognition result word string ω ^ as a search key,
The similarity between two contents S _{i, j} in all contents and the word ω _m (= content name M _n (M ₁ , M ₂ ,..., M) for each content searched in the past for each user U _v )) a user information storage means for storing
First weight alpha (M _q, U) of the user U with respect to the content name M _q a, between the contents between the the content name Mq, each content name M _n retrieved in the past by the user U A preference probability P ^* (ω _m ) is calculated for each word ω _m calculated from the sum of the similarities S _{Mq and Mn} and given the first weight α (M _q , U) , and one or more words ω The preference probability P ^* (ω) for each word string ω including _m is expressed as P ^* (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* preference degree probability calculating means for calculating by (ω ₂ ) ×... × P ^* (ω _m ) ;
A language probability calculating means for outputting a language probability P (ω) obtained by weighting a preference probability P ^* (ω) to a statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω). And

本発明のコンテンツ検索装置における他の実施形態によれば、
利用者情報蓄積手段は、コンテンツのカテゴリｋと、各カテゴリｋに含まれるコンテンツに対する全利用者における検索頻度とを蓄積しており、
嗜好度確率計算手段は、当該コンテンツ名Ｍ _q に対応するカテゴリｋにおける利用者Ｕの検索頻度に対する、他の全ての検索頻度との比に基づいて第２の重みβkを算出し、第１の重みα(Ｍ _q ,Ｕ)に第２の重みβkを加えて単語ω _m 毎の嗜好度確率Ｐ ^* (ω _m )を算出する
ことも好ましい。 According to another embodiment of the content search device of the present invention,
The user information accumulating means accumulates the content category k and the search frequency among all users for the content included in each category k.
The preference degree probability calculation means calculates the second weight βk based on the ratio of all other search frequencies to the search frequency of the user U in the category k corresponding to the content name M _q , A preference probability P ^* (ω _m ) is calculated for each word ω _m by adding the second weight β k to the weight α (M _q , U).
It is also preferable.

本発明のコンテンツ検索装置における他の実施形態によれば、
利用者情報蓄積手段は、複数の利用者ＵがカテゴリＣに区分されており、各カテゴリＣに含まれる複数の利用者Ｕについて、当該コンテンツ名Ｍqと、当該利用者Ｕによって過去に検索された各コンテンツ名Ｍ _n との間のコンテンツ間類似度Ｓ _Mq,Mn の総和から算出された、当該コンテンツ名Ｍ _q に対する利用者Ｕnの第１の重みα(Ｍ _q ,Ｕ _n )を蓄積しており、
嗜好度確率計算手段は、利用者Ｕが属するカテゴリＣについて、当該コンテンツ名Ｍ _q に対する利用者Ｕnの第１の重みα(Ｍ _q ,Ｕ _n )の総和に基づいて第３の重みγ(Ｍ _q ,Ｕ)を算出し、第１の重みα(Ｍ _q ,Ｕ)及び／又は第２の重みβkに第３の重みγ(Ｍ _q ,Ｕ)を加えて単語ω _m 毎の嗜好度確率Ｐ ^* (ω _m )を算出する
ことも好ましい。 According to another embodiment of the content search device of the present invention,
In the user information storage means, a plurality of users U are classified into categories C, and the plurality of users U included in each category C have been searched in the past by the content name Mq and the users U. Accumulating the first weight α (M _q , U _n ) of the user Un for the content name M _q calculated from the sum of the inter-content similarity S _{Mq, Mn} with each content name M _n And
Preference probability calculation means, for the category C which the user U belongs, the content name M first weighted alpha (M _q, U _n) of the user Un with respect to _q third on the basis of the sum of the weights gamma (M _q , U), and the third weight γ (M _q , U) is added to the first weight α (M _q , U) and / or the second weight βk, and the preference probability for each word ω _m Calculate P ^* (ω _m )
It is also preferable.

本発明のコンテンツ検索装置における他の実施形態によれば、認識結果単語列ω^を利用者に表示し、且つ該認識結果単語列ω^に対する利用者の正誤評価入力操作を受け、誤りの評価入力操作の場合、嗜好度確率計算手段に対して嗜好度確率Ｐ^*(ω)を再計算させる利用者評価手段を更に有することも好ましい。 According to another embodiment of the content search apparatus of the present invention, the recognition result word string ω ^ is displayed to the user, and the user's correct / incorrect evaluation input operation for the recognition result word string ω ^ is received to evaluate the error. In the case of an input operation, it is preferable to further include a user evaluation unit that causes the preference level probability calculation unit to recalculate the preference level probability P ^* (ω).

本発明のコンテンツ検索装置における他の実施形態によれば、音声認識デコーダは、音響確率Ｐ(ｘ｜ω)に言語確率Ｐ(ω)を重み付けた確率が、所定閾値以下となる認識候補単語列を枝刈りするビームサーチ法を用い、最終的には重み付けた確率が最大又は高い順から上位Ｎ位の認識結果単語列ω^のみを出力することも好ましい。 According to another embodiment of the content search device of the present invention, the speech recognition decoder uses a recognition candidate word string in which the probability that the language probability P (ω) is weighted to the acoustic probability P (x | ω) is equal to or less than a predetermined threshold. It is also preferable to use a beam search method for pruning and finally output only the top N recognition result word strings ω ^ from the highest or highest weighted probability.

本発明のコンテンツ検索装置における他の実施形態によれば、コンテンツは、楽曲であることも好ましい。 According to another embodiment of the content search device of the present invention, the content is preferably music.

本発明によれば、
コンテンツデータベースからコンテンツを検索する装置に搭載されたコンピュータを機能させるプログラムであって、
入力された音声波形から音響特徴量Ｘを抽出する音響特徴量抽出手段と、
音響モデルを蓄積し、１つ以上の単語ω _m からなる認識結果候補の単語列ωに対して音響特徴量ｘが観測される音響確率Ｐ(ｘ｜ω)を出力する音響モデル蓄積手段と、
言語モデルを蓄積し、統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)を出力する言語モデル蓄積手段と、
音響特徴量ｘ、音響確率Ｐ(ｘ｜ω)及び統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)に基づいて、認識結果単語列ω^を出力する音声認識デコーダと、
認識結果単語列ωを検索キーとして、コンテンツデータベースからコンテンツを検索するコンテンツ検索手段と
としてコンピュータを機能させるコンテンツ検索プログラムにおいて、
全てのコンテンツにおける２つのコンテンツ間類似度Ｓ _i,j と、利用者Ｕ毎に過去に検索されたコンテンツ毎の単語ω _m （＝コンテンツ名Ｍ _n (Ｍ ₁ ,Ｍ ₂ ,・・・,Ｍ _v )）の群とを蓄積する利用者情報蓄積手段と、
当該コンテンツ名Ｍ _q に対する利用者Ｕの第１の重みα(Ｍ _q ,Ｕ)を、当該コンテンツ名Ｍqと、当該利用者Ｕによって過去に検索された各コンテンツ名Ｍ _n との間のコンテンツ間類似度Ｓ _Mq,Mn の総和から算出し、第１の重みα(Ｍ _q ,Ｕ)を付与した単語ω _m 毎の嗜好度確率Ｐ ^* (ω _m )を算出し、１つ以上の単語ω _m を含む単語列ω毎の嗜好度確率Ｐ ^* (ω)を、Ｐ ^* (ω)＝Ｐ ^* (ω ₁ ，ω ₂ ，・・・，ω _m )＝Ｐ ^* (ω ₁ )×Ｐ ^* (ω ₂ )×・・・×Ｐ ^* (ω _m )によって算出する嗜好度確率計算手段と、
統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)に、嗜好度確率Ｐ^*(ω)を重み付けた言語確率Ｐ(ω)を出力する言語確率算出手段と
してコンピュータを更に機能させることを特徴とする。 According to the present invention,
A program for causing a computer installed in an apparatus for searching for content from a content database to function.
Acoustic feature quantity extraction means for extracting the acoustic feature quantity X from the input speech waveform;
An acoustic model accumulating means for accumulating an acoustic model and outputting an acoustic probability P (x | ω) at which an acoustic feature quantity x is observed with respect to a recognition result candidate word string ω composed of one or more words ω _m ;
Language model storage means for storing language models and outputting statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω);
A speech recognition decoder that outputs a recognition result word string ω ^ based on an acoustic feature quantity x, an acoustic probability P (x | ω), and a statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω);
In a content search program for causing a computer to function as a content search means for searching for content from a content database using the recognition result word string ω as a search key,
The similarity between two contents S _{i, j} in all contents and the word ω _m (= content name M _n (M ₁ , M ₂ ,..., M) for each content searched in the past for each user U _v )) a user information storage means for storing
First weight alpha (M _q, U) of the user U with respect to the content name M _q a, between the contents between the the content name Mq, each content name M _n retrieved in the past by the user U A preference probability P ^* (ω _m ) is calculated for each word ω _m calculated from the sum of the similarities S _{Mq and Mn} and given the first weight α (M _q , U) , and one or more words ω The preference probability P ^* (ω) for each word string ω including _m is expressed as P ^* (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* preference degree probability calculating means for calculating by (ω ₂ ) ×... × P ^* (ω _m ) ;
The computer further functions as a language probability calculation means for outputting a language probability P (ω) obtained by weighting the preference probability P ^* (ω) to the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω). It is characterized by that.

本発明によれば、入力された音声波形から音響特徴量ｘを抽出し、音響特徴量ｘと、１つ以上の単語ω _m からなる認識結果候補の単語列ωに対して音響特徴量ｘが観測される音響確率Ｐ(ｘ｜ω)と、統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)とに基づいて認識結果単語列ω^を出力し、認識結果単語列ω^をキーとして、コンテンツデータベースからコンテンツを検索する装置におけるコンテンツ検索方法において、
全てのコンテンツにおける２つのコンテンツ間類似度Ｓ _i,j と、利用者Ｕ毎に過去に検索されたコンテンツ毎の単語ω _m （＝コンテンツ名Ｍ _n (Ｍ ₁ ,Ｍ ₂ ,・・・,Ｍ _v )）の群とを蓄積する利用者情報蓄積部を有し、
当該コンテンツ名Ｍ _q に対する利用者Ｕの第１の重みα(Ｍ _q ,Ｕ)を、当該コンテンツ名Ｍqと、当該利用者Ｕによって過去に検索された各コンテンツ名Ｍ _n との間のコンテンツ間類似度Ｓ _Mq,Mn の総和から算出し、第１の重みα(Ｍ _q ,Ｕ)を付与した単語ω _m 毎の嗜好度確率Ｐ ^* (ω _m )を算出し、１つ以上の単語ω _m を含む単語列ω毎の嗜好度確率Ｐ ^* (ω)を、Ｐ ^* (ω)＝Ｐ ^* (ω ₁ ，ω ₂ ，・・・，ω _m )＝Ｐ ^* (ω ₁ )×Ｐ ^* (ω ₂ )×・・・×Ｐ ^* (ω _m )によって算出する第１のステップと、
統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)に、嗜好度確率Ｐ^*(ω)を重み付けた言語確率Ｐ(ω)を出力する第２のステップと
を有することを特徴とする。 According to the present invention, the acoustic feature quantity x is extracted from the input speech waveform, and the acoustic feature quantity x is obtained for the acoustic feature quantity x and the recognition result candidate word string ω composed of one or more words ω _m. Based on the observed acoustic probability P (x | ω) and the statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω), the recognition result word sequence ω ^ is output, and the recognition result word sequence ω In the content search method in the device for searching for content from the content database using ^ as a key,
The similarity between two contents S _{i, j} in all contents and the word ω _m (= content name M _n (M ₁ , M ₂ ,..., M) for each content searched in the past for each user U _v )) has a user information storage unit for storing
First weight alpha (M _q, U) of the user U with respect to the content name M _q a, between the contents between the the content name Mq, each content name M _n retrieved in the past by the user U A preference probability P ^* (ω _m ) is calculated for each word ω _m calculated from the sum of the similarities S _{Mq and Mn} and given the first weight α (M _q , U) , and one or more words ω The preference probability P ^* (ω) for each word string ω including _m is expressed as P ^* (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* a first step calculated by (ω ₂ ) ×... × P ^* (ω _m ) ;
A second step of outputting a language probability P (ω) obtained by weighting the preference probability P ^* (ω) to the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω). And

本発明のコンテンツ検索装置、プログラム及び方法によれば、音声認識処理の中で、利用者の嗜好度で重み付けした言語確率を用いて単語列を抽出し、その単語列を用いてコンテンツを検索することによって、利用者毎に認識精度が高いと感じさせることができる。 According to the content search apparatus, program, and method of the present invention, in a speech recognition process, a word string is extracted using a language probability weighted by a user's preference level, and the content is searched using the word string. By this, it can be made to feel that recognition accuracy is high for every user.

以下では、図面を用いて、本発明を実施するための最良の形態について詳細に説明する。 Hereinafter, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

図２は、本発明におけるコンテンツ検索装置の機能構成図である。 FIG. 2 is a functional configuration diagram of the content search apparatus according to the present invention.

図２によれば、図１と比較して、コンテンツ検索装置１は、更に、言語確率算出部１１１と、嗜好度確率計算部１１２と、利用者情報蓄積部１１３と、利用者評価部１１４とを更に有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。 According to FIG. 2, compared with FIG. 1, the content search device 1 further includes a language probability calculation unit 111, a preference probability calculation unit 112, a user information storage unit 113, and a user evaluation unit 114. It has further. These functional components are realized by executing a program that causes a computer installed in the apparatus to function.

利用者情報蓄積部１１３は、利用者情報を蓄積する。利用者情報としては、利用者検索履歴情報、コンテンツ間類似度、コンテンツアクセス度、コンテンツ鮮度及び／又は利用者属性情報を含む。これらの情報は、コンテンツデータベース２からの情報で更新することも好ましい。 The user information storage unit 113 stores user information. The user information includes user search history information, similarity between contents, content access degree, content freshness, and / or user attribute information. It is also preferable to update these information with information from the content database 2.

嗜好度確率計算部１１２は、利用者情報に基づいて、コンテンツに対する利用者の嗜好の度合いを表す、単語ω_m毎の嗜好度確率Ｐ^*(ω_m)から単語列ωの嗜好度確率Ｐ^*(ω)＝Ｐ^*(ω₁，ω₂，・・・，ω_m)＝Ｐ^*(ω₁)×Ｐ^*(ω₂)×・・・×Ｐ^*(ω_m)を計算する。また、嗜好度確率計算部１１２は、算出された嗜好度確率Ｐ^*(ω)を蓄積し、言語確率算出部１１１へその嗜好度確率Ｐ^*(ω)を出力する。 The preference degree probability calculation unit 112 represents the preference degree probability P ^{* of the} word string ω from the preference degree probability P ^* (ω _m ) for each word ω _m that represents the degree of preference of the user for the content based on the user information ^. (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* (ω ₂ ) × ... × P ^* (ω _m ) is calculated. Further, preference probability calculation unit 112, the calculated preference score probability P ^* (omega) accumulates, and outputs the ^* preference probability P language probability calculation unit 111 navel (omega).

言語確率算出部１１１は、統計的／文法的言語確率Ｐn-gram(ω)／Ｐcfg(ω)に、嗜好度確率Ｐ^*(ω)で重み付けた言語確率Ｐ(ω)を出力する。本発明の特徴的な点は、言語確率Ｐ(ω)として、嗜好度確率Ｐ^*(ω)で重み付けられた言語確率Ｐ(ω)を用いていることである。具体的には、例えば統計的言語モデルを利用する場合には、嗜好度で重み付けられた言語確率Ｐ(ω)は、次式で計算される。ここでは、Ｐn-gram(ω)に、嗜好度確率Ｐ^*(ω)で重み付けする。
Ｐ(ω)＝Ｐ^*(ω)×Ｐn-gram(ω) The language probability calculation unit 111 outputs the language probability P (ω) weighted by the preference probability P ^* (ω) to the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω). The characteristic point of the present invention is that the language probability P (ω) weighted by the preference probability P ^* (ω) is used as the language probability P (ω). Specifically, for example, when a statistical language model is used, the language probability P (ω) weighted by the preference level is calculated by the following equation. Here, Pn-gram (ω) is weighted with a preference probability P ^* (ω).
P (ω) = P ^* (ω) × Pn-gram (ω)

また、記述文法を利用する認識処理の場合には、嗜好度で重み付けられた言語確率Ｐ(ω)は、次式で計算される。ここでは、Ｐcfg(ω)は、１又は０である。
Ｐ(ω)＝Ｐ^*(ω)×Ｐcfg(ω) In the case of recognition processing using a description grammar, the language probability P (ω) weighted by the preference level is calculated by the following equation. Here, Pcfg (ω) is 1 or 0.
P (ω) = P ^* (ω) × Pcfg (ω)

ここで、嗜好度で重み付けられた確率Ｐ(ω)は、全ての単語列ωに関する総和ΣＰ(ω)が１になるように、正規化したものである。 Here, the probability P (ω) weighted by the preference level is normalized so that the sum ΣP (ω) regarding all the word strings ω becomes 1.

利用者評価部１１４は、音声認識デコーダの出力である認識結果単語列ω^を利用者に対して表示する。また、認識結果単語列ω^に対する利用者からの正誤評価の入力操作を受け付ける。正しい評価の入力操作の場合には、認識結果単語列ω^を検索キーとする検索が、コンテンツ検索部１０６に指示される。誤りの評価の入力操作の場合には、嗜好度確率計算部１１２に対して嗜好度確率Ｐ^*(ω)を再計算させる。 The user evaluation unit 114 displays the recognition result word string ω ^ that is the output of the speech recognition decoder to the user. In addition, an input operation of correctness evaluation from the user for the recognition result word string ω ^ is accepted. In the case of a correct evaluation input operation, the content search unit 106 is instructed to perform a search using the recognition result word string ω ^ as a search key. In the case of an error evaluation input operation, the preference probability calculation unit 112 recalculates the preference probability P ^* (ω).

また、本発明における音声認識デコーダ１０５は、音響確率Ｐ(ｘ｜ω)に言語確率Ｐ(ω)を重み付けた確率が、所定閾値以下となる認識候補単語列を枝刈りするビームサーチ法を用い、最終的には重み付けた確率が最大又は高い順から上位Ｎ位の認識結果単語列ω^を出力する。 Further, the speech recognition decoder 105 according to the present invention uses a beam search method for pruning recognition candidate word strings in which the probability of weighting the language probability P (ω) to the acoustic probability P (x | ω) is equal to or less than a predetermined threshold. Finally, the top N ranked recognition result word strings ω ^ are output in order from the highest or highest weighted probability.

図３は、嗜好度確率計算部における嗜好度確率Ｐ^*(ω)の計算ステップを表す説明図である。 FIG. 3 is an explanatory diagram showing steps of calculating the preference probability P ^* (ω) in the preference probability calculation unit.

図３によれば、利用者情報蓄積部１１３は、利用者検索履歴情報と、コンテンツ情報と、利用者属性情報とを蓄積する。利用者検索履歴情報は、利用者が検索した楽曲及びそれの検索頻度などの検索履歴情報である。コンテンツ情報は、コンテンツ間類似度と、コンテンツアクセス度と、コンテンツ鮮度とを有する。コンテンツ（楽曲）間類似度は、ジャンルやアーティスト等の楽曲情報に基づいて予め算出されたコンテンツ同士の間の類似度である。コンテンツアクセス度は、例えば各楽曲の利用者全体の検索頻度である。コンテンツ鮮度は、例えば公開（リリース）日である。利用者属性情報は、利用者の年齢層や性別や職業などに基づく利用者プロファイル情報である。 According to FIG. 3, the user information accumulation unit 113 accumulates user search history information, content information, and user attribute information. The user search history information is search history information such as the music searched by the user and the search frequency thereof. The content information has a similarity between contents, a content access degree, and a content freshness. The similarity between contents (songs) is a similarity between contents calculated in advance based on music information such as genre and artist. The content access degree is, for example, the search frequency of all users of each music piece. The content freshness is, for example, a release (release) date. The user attribute information is user profile information based on the user's age group, gender, occupation, and the like.

これらの利用者情報を用いて、嗜好度確率計算部１１２は、各種重み係数α、β、γを算出し、それら重み係数から、コンテンツ名（楽曲名、アーティスト名等）の単語毎の嗜好度確率Ｐ^*(ω)を計算する。 Using these pieces of user information, the preference probability calculation unit 112 calculates various weighting coefficients α, β, and γ, and the preference degree for each word of the content name (music title, artist name, etc.) based on the weighting coefficients. The probability P ^* (ω) is calculated.

嗜好度確率計算部１１２は、利用者検索履歴情報を用いたコンテンツ間類似度に基づく履歴重みαを算出する。履歴重みαは、認識対象となる全曲名に対して、利用者の検索嗜好を反映するものであり、利用者履歴情報の量と正比例するよう変化する。 The preference degree probability calculation unit 112 calculates a history weight α based on the similarity between contents using the user search history information. The history weight α reflects the user's search preference for all song names to be recognized, and changes so as to be directly proportional to the amount of user history information.

履歴重みαの計算例について説明する。まず、全楽曲において、曲ｉと曲ｊの類似度Ｓ_ｉ，ｊを計算する。ｉ及びｊは、楽曲の番号である。Ｓ_ｉ，ｊの値は、楽曲ｉ及びｊのジャンル情報の相関度に正比例する。また、楽曲のアーティスト情報、歌詞情報及びメロディ情報における相関度を用いて、類似度Ｓ_ｉ，ｊを計算することもできる。利用者Ｕの検索履歴（検索した楽曲Ｍ_ｉ）Ｒ(Ｕ)＝{Ｍ_１，Ｍ_２，…，Ｍ_Ｖ}に対して、曲Ｍ_ｑに対する履歴嗜好重みα(Ｍｑ,Ｕ)は、次式で計算される。また、利用者の検索履歴は空白の場合にはαが０となる。
α(Ｍｑ,Ｕ)＝ΣＳ_{Ｍｑ，Ｍｎ} Ｍ_ｎ∈Ｒ(Ｕ)
また、データベースの楽曲をカテゴリに分類しておき、利用者Ｕがカテゴリｋに属する曲を多く検索した場合に、利用者Ｕに対するカテゴリｋの中の楽曲のα(Ｍ,Ｕ)（Ｍ∈ｋ）の値を大きくすることもできる。 A calculation example of the history weight α will be described. First, the similarity S _{i, j} between music i and music _j is calculated for all music. i and j are music numbers. The value of S _{i, j} is directly proportional to the degree of correlation between the genre information of music i and j. Also, the similarity S _{i, j} can be calculated using the degree of correlation in the artist information, lyric information, and melody information of the music. For user U's search history (searched music M _i ) R (U) = {M ₁ , M ₂ ,..., M _V }, the history preference weight α (Mq, U) for music M _q is Calculated by the formula. If the user's search history is blank, α is 0.
α (Mq, U) = ΣS _{Mq, Mn} M _n ∈ R (U)
Further, when the music pieces in the database are classified into categories and the user U searches many music pieces belonging to the category k, α (M, U) (M∈k) of the music pieces in the category k with respect to the user U. ) Can be increased.

また、嗜好度確率計算部１１２は、コンテンツアクセス度及びコンテンツ鮮度に基づく流行重みβを算出する。 In addition, the preference probability calculation unit 112 calculates the trend weight β based on the content access degree and the content freshness.

流行重みβの計算例について説明する。流行重みβは、楽曲の流行性を表すため、その楽曲の全利用者の検索頻度を用いる。計算例として、まず、楽曲カテゴリにおいて、あるカテゴリｋ中の全ての楽曲に対して、横軸を、各楽曲のリリース日からの経過時間ｔとし、縦軸を、その時間帯内の各楽曲の毎日検索頻度として統計する。その統計結果より、カテゴリｋに所属する楽曲の検索頻度ｆ_ｋとｔの関係式ｆ_ｋ＝Ｆ_ｋ(ｔ)を推定する。この関係に基づいて、カテゴリｋ中の楽曲の流行重みβ_ｋは、次式で計算される。
β_ｋ＝Ｆ_ｋ(ｔ)／max｛Ｆ_ｋ(ｔ)｝ An example of calculating the fashion weight β will be described. The trend weight β uses the search frequency of all users of the music in order to represent the fashion of the music. As a calculation example, first, in the music category, for all songs in a certain category k, the horizontal axis is the elapsed time t from the release date of each music, and the vertical axis is the time of each music in that time zone. Statistics as daily search frequency. From the statistical result, the relational expression f _k = F _k (t) between the search frequencies f _k and t of music belonging to the category k is estimated. Based on this relationship, the trend weight β _k of the music piece in category _k is calculated by the following equation.
β _k = F _k (t) / max {F _k (t)}

更に、嗜好度確率計算部１１２は、利用者属性情報を用いたコンテンツ間類似度に基づく属性重みγを算出する。属性重みγは、当該利用者の属するカテゴリの中で、比較的多く検索されたコンテンツ及びそれらコンテンツと類似度の高いコンテンツに対しては、高く設定される。 Further, the preference probability calculation unit 112 calculates an attribute weight γ based on the similarity between contents using the user attribute information. The attribute weight γ is set to be high for content that is relatively frequently searched and content that has a high degree of similarity with the content to which the user belongs.

属性重みγの計算例について説明する。属性重みγは、利用者Ｕのプロファイル情報と類似する他利用者の検索履歴（検索した楽曲情報）を参考して、利用者Ｕの楽曲への嗜好を推定する。まず、全利用者プロファイル情報をクラスタリング処理した結果、利用者Ｕは、プロファイルのカテゴリＣに属するものとする。利用者Ｕ以外のカテゴリＣに属する利用者Ｕ_１，Ｕ_２，…，Ｕ_Ｎにおける楽曲Ｍ_ｑの属性重みγ(Ｍｑ,Ｕ)は、次式で計算される。Ｎは、カテゴリＣの利用者数（利用者Ｕを除き）となる。
γ(Ｍｑ,Ｕ)＝（１／Ｎ）Σα(Ｍｑ,Ｕｎ) Ｕｎ，Ｕ∈Ｃ An example of calculating the attribute weight γ will be described. The attribute weight γ estimates the preference of the user U to the music with reference to the search history (searched music information) of other users similar to the profile information of the user U. First, as a result of clustering processing of all user profile information, it is assumed that the user U belongs to the category C of the profile. The attribute weight γ (Mq, U) of the music M _{q for} the users U ₁ , U ₂ ,..., U _N belonging to the category C other than the user U is calculated by the following equation. N is the number of users in category C (excluding user U).
γ (Mq, U) = (1 / N) Σα (Mq, Un) Un, U∈C

このようにして得られた履歴重みα、流行重みβ及び属性重みγは、利用者の嗜好に合えば合うほど、高い値が設定される。そして、嗜好度確率計算部１１２は、履歴重みα、流行重みβ及び／又は属性重みγを用いて、単語ω_m毎の嗜好度確率Ｐ^*(ω_m)から単語列ωの嗜好度確率Ｐ^*(ω)＝Ｐ^*(ω₁，ω₂，・・・，ω_m)＝Ｐ^*(ω₁)×Ｐ^*(ω₂)×・・・×Ｐ^*(ω_m)を計算する。単語ω_mは、例えばコンテンツ名（楽曲名、アーティスト名等）である。 The history weight α, the fashion weight β, and the attribute weight γ obtained in this way are set to higher values as the user's preference is met. Then, the preference degree probability calculation unit 112 uses the history weight α, the fashion weight β, and / or the attribute weight γ to calculate the preference degree probability P of the word string ω from the preference degree probability P ^* (ω _m ) for each word ω _m. ^* (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* (ω ₂ ) × ... × P ^* (ω _m ) is calculated. The word ω _m is, for example, a content name (music name, artist name, etc.).

嗜好度確率Ｐ^*(ω)は、履歴重みα、流行重みβ及び属性重みγに基づいて、引数（α＋β＋γ）に関する一般的な関数Ｆを用いて、Ｐ^*(ω)＝Ｆ(α＋β＋γ)で算出することができる。具体例として、例えば、次の２つの式による計算を示す。
Ｐ^^*(ω)＝(α＋β＋γ)^Ｐ
ここで、指数ｐは、利用者毎に設定される定数である。予め収録した音声波形及び利用者情報に基づいた音声認識実験によって、指数ｐを決定しておく。
Ｐ^*(ω)＝Ｐ^^*(ω)／（ΣＰ^^*(ω)）（Ｐ^*(ω)の正規化） The preference probability P ^* (ω) is expressed as P ^* (ω) = F (α + β + γ) using a general function F related to the argument (α + β + γ) based on the history weight α, the fashion weight β, and the attribute weight γ. Can be calculated. As a specific example, calculation by the following two formulas is shown, for example.
P ^ ^* (ω) = (α + β + γ) ^P
Here, the index p is a constant set for each user. The index p is determined by a speech recognition experiment based on prerecorded speech waveforms and user information.
P ^* (ω) = P ^ ^* (ω) / (ΣP ^ ^* (ω)) (normalization of P ^* (ω))

上記の指数ｐは、音声認識の結果に応じて調整するようにしてもよい。次に、この指数ｐの調整方法を説明する。 The index p may be adjusted according to the result of speech recognition. Next, a method for adjusting the index p will be described.

図２に示す利用者評価部１１４では、音声認識デコーダ１０５の出力である認識結果単語列ω^を利用者に提示する。その認識結果単語列ω^が正しければ、利用者の操作によって、利用者評価部１１４における「検索ボタン」が押下され（Ｙｅｓ）、その認識結果単語列ω^を検索キーとして、検索がなされる。 The user evaluation unit 114 shown in FIG. 2 presents the recognition result word string ω ^ that is the output of the speech recognition decoder 105 to the user. If the recognition result word string ω ^ is correct, a “search button” in the user evaluation unit 114 is pressed by the user's operation (Yes), and a search is performed using the recognition result word string ω ^ as a search key. .

その認識結果単語列ω^が誤っている場合、利用者の操作によって、利用者評価部１１４における「再試行ボタン」が押下され（Ｎｏ）、その押下操作情報が嗜好度確率計算部１１２に通知される。これにより、嗜好度確率計算部１１２は、嗜好度確率Ｐ^*(ω)による重み付きの言語モデルＰ(ω)における言語的確率と嗜好度確率との間のバランスが適切になり、誤認識が低減するように指数ｐの値が調整されていく。 If the recognition result word string ω ^ is incorrect, the “retry button” in the user evaluation unit 114 is pressed by the user's operation (No), and the pressing operation information is notified to the preference probability calculation unit 112. Is done. As a result, the preference probability calculation unit 112 has an appropriate balance between the linguistic probability and the preference probability in the weighted language model P (ω) based on the preference probability P ^* (ω). The value of the index p is adjusted so as to decrease.

指数ｐの自動更新の一例をあげる。認識結果が誤っているとの評価操作を受けた場合、音声認識デコーダ１０５において嗜好度確率Ｐ^*(ω)を用いることなく、再度認識処理を実行する。再認識の結果が前回の認識結果と異なり、かつ再認識の結果の認識スコアＳが予め設定した閾値Ｒより高い場合には、嗜好度確率Ｐ^*(ω)は、当該利用者に不適切であると推定される（Ｓは正規化されたスコア、値は０〜１の間）。そこで、再認識の結果と同じ結果が出るまで、Ｐ^*(ω)の値を１に近づけるため、次式の計算を繰り返して、指数ｐを調整する。
ｎ：繰り返す回数
ｐ_ｎ：ｎ回目の調整がなされた指数ｐの値
ｐ_０：音声認識実験による初期値
ｐ_ｎ＝（１−Ｓ）ｐ_ｎ−１ An example of automatic updating of the index p is given. When the evaluation operation that the recognition result is incorrect is received, the speech recognition decoder 105 executes the recognition process again without using the preference probability P ^* (ω). When the re-recognition result is different from the previous recognition result and the recognition score S of the re-recognition result is higher than the preset threshold value R, the preference probability P ^* (ω) is inappropriate for the user. Presumed to be (S is normalized score, value is between 0 and 1). Therefore, until the same result as the result of re-recognition is obtained, the index p is adjusted by repeating the calculation of the following equation in order to bring the value of P ^* (ω) close to 1.
n: number of repetitions p _n : value of index p adjusted for n-th time p ₀ : initial value obtained by speech recognition experiment p _n = (1-S) p _n−1

このような動作によって、嗜好度確率Ｐ^*(ω)を、徐々に修正することによって音声認識率を上げることでき、適応的な動作を可能とする。 With such an operation, the speech recognition rate can be increased by gradually correcting the preference probability P ^* (ω), thereby enabling an adaptive operation.

図４は、本発明の他の実施形態におけるシステム構成図である。 FIG. 4 is a system configuration diagram according to another embodiment of the present invention.

図４によれば、ユーザによって操作される端末３と、認識サーバ４と、情報管理サーバ５と、コンテンツサーバ６とが、インターネットを介して接続されている。 According to FIG. 4, a terminal 3 operated by a user, a recognition server 4, an information management server 5, and a content server 6 are connected via the Internet.

端末３は、音声入力部１０１、音響特徴量抽出部１０２及び利用者評価部１１４を有する。音響特徴量抽出部１０２から出力された音響特徴量ｘは、ネットワークを介して、認識サーバ４へ送信される。 The terminal 3 includes a voice input unit 101, an acoustic feature amount extraction unit 102, and a user evaluation unit 114. The acoustic feature quantity x output from the acoustic feature quantity extraction unit 102 is transmitted to the recognition server 4 via the network.

認識サーバ４は、音響モデル蓄積部１０３、言語モデル蓄積部１０４、言語確率算出部１１１及び音声認識デコーダ１０５を有する。音声認識デコーダ１０５は、端末３から音響特徴量ｘを受信し、情報管理サーバ５から嗜好度確率Ｐ^*(ω)を受信する。そして、音声認識デコーダ１０５は、認識結果単語列ω^をネットワークを介してコンテンツサーバ６へ送信する。 The recognition server 4 includes an acoustic model storage unit 103, a language model storage unit 104, a language probability calculation unit 111, and a speech recognition decoder 105. The speech recognition decoder 105 receives the acoustic feature quantity x from the terminal 3 and receives the preference probability P ^* (ω) from the information management server 5. Then, the speech recognition decoder 105 transmits the recognition result word string ω ^ to the content server 6 via the network.

情報管理サーバ５は、嗜好度確率計算部１１２及び利用者情報蓄積部１１３を有する。嗜好度確率計算部１１２は、予め算出した嗜好度確率Ｐ^*(ω)を、利用者毎に蓄積している。また、嗜好度確率計算部１１２は、端末３の利用者評価部１１４からの指示により、認識結果単語列ω^が誤っていた場合には、利用者毎の嗜好度確率Ｐ^*(ω)を再計算するようにしてもよい。 The information management server 5 includes a preference probability calculation unit 112 and a user information storage unit 113. The preference degree probability calculation unit 112 stores the preference degree probability P ^* (ω) calculated in advance for each user. In addition, when the recognition result word string ω ^ is incorrect according to an instruction from the user evaluation unit 114 of the terminal 3, the preference degree probability calculation unit 112 calculates the preference degree probability P ^* (ω) for each user. You may make it recalculate.

端末３では、利用者の発声は音声入力部１０１によって音声波形に変換され、音響特徴量抽出部１０２によって音響特徴量ｘが抽出され、その音響特徴量ｘが認識サーバ４へ送信される。同時に、端末３から利用者識別番号が情報管理サーバ５へ送信される。情報管理サーバ５では、利用者識別番号に対応する嗜好度確率Ｐ^*(ω)を認識サーバ４へ送信する。 In the terminal 3, the user's utterance is converted into a speech waveform by the speech input unit 101, the acoustic feature amount x is extracted by the acoustic feature amount extraction unit 102, and the acoustic feature amount x is transmitted to the recognition server 4. At the same time, the user identification number is transmitted from the terminal 3 to the information management server 5. The information management server 5 transmits the preference probability P ^* (ω) corresponding to the user identification number to the recognition server 4.

認識サーバ４の言語確率算出部１１１では、受信した嗜好度確率Ｐ^*(ω)を使って、嗜好度確率Ｐ^*(ω)で重み付けた言語確率Ｐ(ω)を算出する。一方、音声認識デコーダ１０５は、端末３から受信した音響特徴量ｘと、音響確率Ｐ(ｘ｜ω)と、重み付けた言語確率Ｐ(ω)とに基づいて、音声認識処理を実行する。 The language probability calculation unit 111 of the recognition server 4 calculates the language probability P (ω) weighted by the preference probability P ^* (ω) using the received preference probability P ^* (ω). On the other hand, the speech recognition decoder 105 executes speech recognition processing based on the acoustic feature amount x received from the terminal 3, the acoustic probability P (x | ω), and the weighted language probability P (ω).

音声認識デコーダ１０５から出力された認識結果単語列ω^は、ネットワークを介して利用者の端末３へ送信される。端末３は、その認識結果単語列ω^を、利用者評価部１１４を用いて利用者に表示する。認識結果単語列ω^が、利用者にとって欲するコンテンツに一致している場合（正解の場合）は、利用者の操作によって、認識結果単語列ω^が検索キーとして、コンテンツサーバ又はＷＥＢ検索サーバにおけるコンテンツ検索部１０６へ送信される。その検索結果は、端末３に返送される。 The recognition result word string ω ^ output from the speech recognition decoder 105 is transmitted to the user's terminal 3 via the network. The terminal 3 displays the recognition result word string ω ^ to the user using the user evaluation unit 114. When the recognition result word string ω ^ matches the content desired by the user (in the case of correct answer), the recognition result word string ω ^ is used as a search key in the content server or the WEB search server by the user's operation. It is transmitted to the content search unit 106. The search result is returned to the terminal 3.

認識結果単語列ω^が、利用者の欲するコンテンツに一致していない場合（不正解の場合）、利用者によって音声の再入力操作がなされるか、又は、それ以上検索しない。再入力操作の場合は、その再入力操作情報が、情報管理サーバ５の嗜好度確率計算部１１２にフィードバックされる。これによって、嗜好度確率計算部１１２は、Ｐ^*(ω)の計算モデルの自動更新に使用される。 If the recognition result word string ω ^ does not match the content desired by the user (in the case of an incorrect answer), the user performs a voice re-input operation or does not search any more. In the case of a re-input operation, the re-input operation information is fed back to the preference probability calculation unit 112 of the information management server 5. Thus, the preference probability calculation unit 112 is used for automatic updating of the calculation model of P ^* (ω).

このような分散的構成によって、端末と各サーバ間の機能分担を柔軟に図ることができ、検索性能と利用者規模に適応しやすいものになる。 With such a distributed configuration, it is possible to flexibly share the functions between the terminal and each server, and it is easy to adapt to the search performance and the user scale.

以上、詳細に説明したように、本発明のコンテンツ検索装置、プログラム及び方法によれば、音声認識処理の中で、音響確率と利用者の嗜好度で重み付けした言語確率とを用いて、候補の単語列の中から最も認識スコアの高い単語列を抽出し、その単語列を用いてコンテンツを検索することによって、利用者毎に認識精度が高いと感じさせることができる。また、音声認識デコーダが、厳しいビーム幅のビームサーチで探索しても、認識精度が劣化せず、計算時間を短縮することもできる。更に、嗜好度に基づいて単語辞書の語彙を削減することもないので、嗜好以外の楽曲も検索される。 As described above in detail, according to the content search device, program, and method of the present invention, in the speech recognition process, using the acoustic probability and the language probability weighted by the user's preference level, By extracting the word string having the highest recognition score from the word string and searching for the content using the word string, it is possible to make each user feel that the recognition accuracy is high. Even if the speech recognition decoder searches by a beam search with a strict beam width, the recognition accuracy does not deteriorate and the calculation time can be shortened. Furthermore, since the vocabulary of the word dictionary is not reduced based on the preference level, music other than the preference is also searched.

前述した本発明の種々の実施形態において、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 In the various embodiments of the present invention described above, various changes, modifications, and omissions in the scope of the technical idea and the viewpoint of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

従来技術におけるコンテンツ検索装置の機能構成図である。It is a functional block diagram of the content search apparatus in a prior art. 本発明におけるコンテンツ検索装置の機能構成図である。It is a functional block diagram of the content search apparatus in this invention. 嗜好度確率計算部における嗜好度確率Ｐ^*(ω)の計算ステップを表す説明図である。It is explanatory drawing showing the calculation step of the preference degree probability P ^* ((omega)) in a preference degree probability calculation part. 本発明の他の実施形態におけるシステム構成図である。It is a system configuration figure in other embodiments of the present invention.

Explanation of symbols

１コンテンツ検索装置
１０１音声入力部
１０２音響特徴量抽出部
１０３音響モデル蓄積部
１０４言語モデル蓄積部
１０５音声認識デコーダ
１０６コンテンツ検索部
１１１言語確率算出部
１１２嗜好度確率計算部
１１３利用者情報蓄積部
１１４利用者評価部
２コンテンツデータベース
３端末
４認識サーバ
５情報管理サーバ
６コンテンツサーバ DESCRIPTION OF SYMBOLS 1 Content search apparatus 101 Speech input part 102 Acoustic feature-value extraction part 103 Acoustic model storage part 104 Language model storage part 105 Speech recognition decoder 106 Content search part 111 Language probability calculation part 112 Preference degree probability calculation part 113 User information storage part 114 User evaluation unit 2 Content database 3 Terminal 4 Recognition server 5 Information management server 6 Content server

Claims

Acoustic feature quantity extraction means for extracting the acoustic feature quantity x from the input speech waveform;
An acoustic model accumulating means for accumulating an acoustic model and outputting an acoustic probability P (x | ω) at which the acoustic feature quantity x is observed for a word string ω of word recognition result candidates composed of one or more words ω _m When,
Language model storage means for storing language models and outputting statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω);
Speech recognition that outputs a recognition result word string ω ^ based on the acoustic feature amount x, the acoustic probability P (x | ω), and the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω) A decoder;
In a content search apparatus having content search means for searching for content from a content database using the recognition result word string ω ^ as a search key,
The similarity between two contents S _{i, j} in all contents and the word ω _m (= content name M _n (M ₁ , M ₂ ,..., M) for each content searched in the past for each user U _v )) a user information storage means for storing
First weight alpha (M _q, U) of the user U with respect to the content name M _q a, between the contents between the the content name Mq, each content name M _n retrieved in the past by the user U A preference probability P ^* (ω _m ) is calculated for each word ω _m calculated from the sum of the similarities S _{Mq and Mn} and given the first weight α (M _q , U) , and one or more words ω The preference probability P ^* (ω) for each word string ω including _m is expressed as P ^* (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* preference degree probability calculating means for calculating by (ω ₂ ) ×... × P ^* (ω _m ) ;
Language probability calculation means for outputting a language probability P (ω) obtained by weighting the preference probability P ^* (ω) to the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω). Content search device characterized by the above.

The user information accumulating unit accumulates content categories k and search frequencies of all users for content included in each category k,
The preference probability calculation means calculates a second weight βk based on the ratio of all other search frequencies to the search frequency of the user U in the category k corresponding to the content name M _q , The preference probability P ^* (ω _m ) for each word ω _m is calculated by adding the second weight β k to the weight α (M _q , U) of
The content search apparatus according to claim 1.

In the user information storage means, a plurality of users U are classified into categories C, and the plurality of users U included in each category C are searched in the past by the content name Mq and the users U. content similarity S _Mq between each content name M _n _was, was calculated from the sum of _Mn, the first weighting α (M _q, U _n) of the user Un with respect to the content name M _q accumulate And
For the category C to which the user U belongs, the preference probability calculation means calculates a third weight γ (based on the sum of the first weight α (M _q , U _n ) of the user Un for the content name M _q . M _q , U) is calculated, and the third weight γ (M _q , U) is added to the first weight α (M _q , U) and / or the second weight βk, and the preference for each word ω _m Calculate the probability P ^* (ω _m )
The content search device according to claim 1, wherein the content search device is a content search device.

The recognition result word string ω ^ is displayed to the user, and the user's correct / incorrect evaluation input operation is performed on the recognition result word string ω ^. 3. The content search apparatus according to claim 2, further comprising user evaluation means for recalculating the preference probability P ^* (ω).

The speech recognition decoder uses a beam search method for pruning a recognition candidate word string whose probability that the language probability P (ω) is weighted to a predetermined threshold or less by using the acoustic probability P (x | ω). 5. The content search apparatus according to claim 1, wherein only the top N recognition result word string ω ^ from the highest or highest weighted probability is output.

The content search apparatus according to claim 1 , wherein the content is music.

A program for causing a computer installed in an apparatus for searching for content from a content database to function.
Acoustic feature quantity extraction means for extracting the acoustic feature quantity X from the input speech waveform;
An acoustic model accumulating means for accumulating an acoustic model and outputting an acoustic probability P (x | ω) at which the acoustic feature quantity x is observed with respect to a recognition result candidate word string ω composed of one or more words ω _m ; ,
Language model storage means for storing language models and outputting statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω);
Speech recognition that outputs a recognition result word string ω ^ based on the acoustic feature amount x, the acoustic probability P (x | ω), and the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω) A decoder;
In a content search program for causing a computer to function as content search means for searching for content from a content database using the recognition result word string ω as a search key,
The similarity between two contents S _{i, j} in all contents and the word ω _m (= content name M _n (M ₁ , M ₂ ,..., M) for each content searched in the past for each user U _v )) a user information storage means for storing
First weight alpha (M _q, U) of the user U with respect to the content name M _q a, between the contents between the the content name Mq, each content name M _n retrieved in the past by the user U A preference probability P ^* (ω _m ) is calculated for each word ω _m calculated from the sum of the similarities S _{Mq and Mn} and given the first weight α (M _q , U) , and one or more words ω The preference probability P ^* (ω) for each word string ω including _m is expressed as P ^* (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* preference degree probability calculating means for calculating by (ω ₂ ) ×... × P ^* (ω _m ) ;
A computer is further provided as a language probability calculating means for outputting a language probability P (ω) obtained by weighting the preference probability P ^* (ω) to the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω). A program for content search, characterized by functioning.

An acoustic feature quantity x is extracted from the input speech waveform, and the acoustic feature quantity x is observed with respect to the acoustic feature quantity x and a recognition result candidate word string ω composed of one or more words ω _m. A recognition result word string ω ^ is output based on the probability P (x | ω) and the statistical / grammatical language probabilities Pn-gram (ω) / Pcfg (ω), and the recognition result word string ω ^ is a key. In a content search method in an apparatus for searching for content from a content database,
The similarity between two contents S _{i, j} in all contents and the word ω _m (= content name M _n (M ₁ , M ₂ ,..., M) for each content searched in the past for each user U _v )) has a user information storage unit for storing
First weight alpha (M _q, U) of the user U with respect to the content name M _q a, between the contents between the the content name Mq, each content name M _n retrieved in the past by the user U A preference probability P ^* (ω _m ) is calculated for each word ω _m calculated from the sum of the similarities S _{Mq and Mn} and given the first weight α (M _q , U) , and one or more words ω The preference probability P ^* (ω) for each word string ω including _m is expressed as P ^* (ω) = P ^* (ω ₁ , ω ₂ ,..., ω _m ) = P ^* (ω ₁ ) × P ^* a first step calculated by (ω ₂ ) ×... × P ^* (ω _m ) ;
A second step of outputting a language probability P (ω) obtained by weighting the preference probability P ^* (ω) to the statistical / grammatical language probability Pn-gram (ω) / Pcfg (ω). Content search method characterized by the above.