JP6680009B2

JP6680009B2 - Search index generation device, search index generation method, voice search device, voice search method and program

Info

Publication number: JP6680009B2
Application number: JP2016051031A
Authority: JP
Inventors: 寛基富田
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2016-03-15
Filing date: 2016-03-15
Publication date: 2020-04-15
Anticipated expiration: 2036-03-15
Also published as: JP2017167265A

Description

本発明は、検索インデクス生成装置、検索インデックス生成方法、音声検索装置、音声検索方法及びプログラムに関する。 The present invention relates to a search index generation device, a search index generation method, a voice search device, a voice search method and a program.

音声検索では、検索対象とする検索語（クエリ）に対応する音声が発話されている箇所を音声信号の中から特定する検索技術を使用する。この音声検索技術では、高速かつ正確な音声検索を実現することが重要となる。 In the voice search, a search technique is used that specifies, from a voice signal, a portion where a voice corresponding to a search word (query) to be searched is uttered. In this voice search technology, it is important to realize fast and accurate voice search.

上記音声検索技術の１つとして、非特許文献１は、検索対象の音声信号と検索するクエリ音声信号とを高速に比較する技術を開示している。非特許文献１が開示する技術では、検索対象の音声信号の特徴量とクエリ音声信号の特徴量とを比較する。 As one of the voice search techniques, Non-Patent Document 1 discloses a technique for quickly comparing a search target voice signal with a query voice signal to be searched. In the technique disclosed in Non-Patent Document 1, the feature amount of the search target voice signal is compared with the feature amount of the query voice signal.

Ｙ．ＺｈａｎｇａｎｄＪ．Ｇｌａｓｓ． “Ａｎｉｎｎｅｒ−ｐｒｏｄｕｃｔｌｏｗｅｒ−ｂｏｕｎｄｅｓｔｉｍａｔｅｆｏｒｄｙｎａｍｉｃｔｉｍｅｗａｒｐｉｎｇ，” ｉｎＰｒｏｃ．ＩＣＡＳＳＰ，２０１１，ｐｐ．５６６０−５６６３．Y. Zhang and J. Glass. "An inner-product lower-bound estimate for dynamic time warping," in Proc. ICASSP, 2011, pp. 5660-5663.

クエリ音声信号を検索する場合、非特許文献１が開示する技術では、検索対象の音声信号に複数のフレームを設定し、そのフレームごとの音声の特徴量と音響モデルの音素の各状態の特徴量とが一致する確率をテーブルにした検索インデックスを作成する。そして、この検索インデックスを利用して、クエリ音声信号の位置を検索することにより、検索を高速化している。非特許文献１が開示する技術では、音声の特徴を解析する時間単位であるフレーム長を音素を構成する状態の時間長としている。検索精度を上げるためには、音素を構成する状態の数を増やし、より短い時間に細分して音声信号の特徴を比較解析することが望ましい。しかしながら、音素を構成する状態の数が多くなると音声検索処理量が膨大と成り、検索時間が長くなるという問題がある。また、検索インデックスのデータサイズが大きくなってしまうという問題もある。一方、音素を構成する状態の数を減らすと、抽出した特徴量は長い時間内での平均値となってしまうため、音声の瞬時的な特徴を喪失することとなり、音声検索の精度が低下する場合がある。つまり、非特許文献１が開示する技術では、検索インデックスのデータサイズと検索精度とはトレードオフの関係にある。 In the technique disclosed in Non-Patent Document 1, when a query speech signal is searched, a plurality of frames are set in a speech signal to be searched, and a speech feature amount for each frame and a feature amount of each state of phonemes of an acoustic model are set. Create a search index with a table of the probability that and match. Then, the search index is used to search the position of the query voice signal, thereby speeding up the search. In the technique disclosed in Non-Patent Document 1, the frame length, which is the time unit for analyzing the characteristics of a voice, is the time length of a state that constitutes a phoneme. In order to improve the search accuracy, it is desirable to increase the number of states that compose a phoneme and subdivide them into shorter times for comparative analysis of the characteristics of the speech signal. However, when the number of states forming a phoneme increases, the amount of voice search processing becomes enormous and the search time becomes long. There is also a problem that the data size of the search index becomes large. On the other hand, if the number of states that make up a phoneme is reduced, the extracted feature amount will be an average value over a long time, and the instantaneous feature of the voice will be lost, and the accuracy of voice search will decrease. There are cases. That is, in the technique disclosed in Non-Patent Document 1, there is a trade-off relationship between the search index data size and the search accuracy.

本発明は、以上のような状況を鑑みてなされたものであり、音声検索の精度を維持しながら、検索インデックスのデータサイズを縮小することが可能な検索インデクス生成装置、検索インデックス生成方法、音声検索装置、音声検索方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above situation, and is capable of reducing the data size of a search index while maintaining the accuracy of voice search, a search index generation device, a search index generation method, and a voice. An object is to provide a search device, a voice search method, and a program.

上記目的を達成するため、本発明に係る検索インデックス生成装置は、
検索対象の音声信号を取得する取得手段と、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定手段と、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得手段と、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得手段と、
前記出力確率取得手段が取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定手段と、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を対応付けた検索インデックスを生成する検索インデックス生成手段と、
を備えることを特徴とする。 In order to achieve the above object, the search index generation device according to the present invention is
An acquisition means for acquiring a voice signal to be searched,
Section setting means for setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each of the frame sections, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states constituting each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with each of the phonemes, for each frame of the speech signal to be searched.
It is characterized by including.

本発明によれば、音声検索の精度を維持しながら、検索インデックスのデータサイズを縮小することができる。 According to the present invention, it is possible to reduce the data size of the search index while maintaining the accuracy of voice search.

本発明の実施形態１に係る音声検索装置の物理構成を示す図である。It is a figure which shows the physical constitution of the voice search device which concerns on Embodiment 1 of this invention. 本発明の実施形態１に係る音声検索装置の機能構成を示す図である。It is a figure which shows the function structure of the voice search device which concerns on Embodiment 1 of this invention. 本発明の実施形態１に係る検索インデックス生成部の機能構成を示す図である。It is a figure which shows the function structure of the search index generation part which concerns on Embodiment 1 of this invention. 音素の状態について説明するための図である。It is a figure for demonstrating the state of a phoneme. 代表確率置換処理前の検索インデクスについて説明するための図である。It is a figure for demonstrating the search index before a representative probability replacement process. （ａ）は、検索対象の音声信号の波形図である。（ｂ）は、検索対象の音声信号において設定されるフレームを示す図である。（ｃ）は、検索対象の音声信号において指定される尤度取得区間を示す図である。(A) is a waveform diagram of the audio signal to be searched. (B) is a figure which shows the frame set in the audio signal of search object. (C) is a figure which shows the likelihood acquisition area designated in the audio signal of search object. 代表確率置換処理後の検索インデクスについて説明するための図である。It is a figure for demonstrating the search index after a representative probability replacement process. 本発明の実施形態１に係る音声検索部の機能構成を示す図である。It is a figure which shows the function structure of the audio | voice search part which concerns on Embodiment 1 of this invention. クエリ音素列に設定するフレームについて説明するための図である。（ａ）は、クエリ音素列を示す図である。（ｂ）は、クエリ音素列において設定されるフレームを示す図である。It is a figure for demonstrating the frame set to a query phoneme sequence. (A) is a figure which shows a query phoneme sequence. (B) is a figure which shows the frame set in a query phoneme sequence. クエリ音素列の出力確率について説明するための図である。It is a figure for demonstrating the output probability of a query phoneme sequence. Ｌｏｗｅｒ−Ｂｏｕｎｄ化処理について説明するための図である。It is a figure for demonstrating Lower-Bound-ized processing. 本発明の実施形態１に係る音声検索装置が実行する検索インデックス生成処理の流れを示すフローチャートである。6 is a flowchart showing a flow of a search index generation process executed by the voice search device according to the first embodiment of the present invention. 本発明の実施形態１に係る音声検索装置が実行する音声検索処理の流れを示すフローチャートである。6 is a flowchart showing a flow of a voice search process executed by the voice search device according to the first embodiment of the present invention. 本発明の実施形態１に係る音声検索装置が実行する音声検索処理の流れを示すフローチャートである。6 is a flowchart showing a flow of a voice search process executed by the voice search device according to the first embodiment of the present invention. 本発明の実施形態２に係る音声検索部の機能構成を示す図である。It is a figure which shows the function structure of the audio | voice search part which concerns on Embodiment 2 of this invention. クエリを音声信号として取得する場合について説明するための図である。（ａ）は、クエリ音声信号の波形図である。（ｂ）は、クエリ音声信号において設定されるフレームを示す図である。It is a figure for demonstrating the case where a query is acquired as an audio | voice signal. (A) is a waveform diagram of a query voice signal. (B) is a figure which shows the frame set in a query audio | voice signal. 本発明の実施形態３に係る音声検索部の機能構成を示す図である。It is a figure which shows the function structure of the audio | voice search part which concerns on Embodiment 3 of this invention. 代表確率置換処理後のクエリ出力確率について説明するための図である。It is a figure for demonstrating the query output probability after a representative probability replacement process.

以下、本発明の実施形態に係る検索インデクス生成装置、検索インデックス生成方法、音声検索装置、音声検索方法及びプログラムについて、図面を参照しながら説明する。なお、図中同一又は相当する部分には同一符号を付す。 Hereinafter, a search index generation device, a search index generation method, a voice search device, a voice search method, and a program according to embodiments of the present invention will be described with reference to the drawings. The same or corresponding parts in the drawings are designated by the same reference numerals.

（実施形態１）
実施形態１に係る音声検索装置１００は、物理的には、図１に示すように、ＲＯＭ（Read Only Memory）１と、ＲＡＭ（Random Access Memory）２と、外部記憶装置３と、入力装置４と、出力装置５と、ＣＰＵ（Central Processing Unit）６と、バス７と、を備える。 (Embodiment 1)
The voice search device 100 according to the first embodiment physically includes a ROM (Read Only Memory) 1, a RAM (Random Access Memory) 2, an external storage device 3, and an input device 4 as shown in FIG. An output device 5, a CPU (Central Processing Unit) 6, and a bus 7.

ＲＯＭ１は、検索インデックス生成プログラム、音声検索プログラムを記憶する。ＲＡＭ２は、ＣＰＵ６のワークエリアとして使用される。 The ROM 1 stores a search index generation program and a voice search program. The RAM 2 is used as a work area for the CPU 6.

外部記憶装置３は、例えば、ハードディスクから構成され、解析対象である音声信号、音響モデル等をデータとして記憶する。また、音声検索装置１００が解析対象の音声信号から生成した検索インデックスを記憶する。 The external storage device 3 is composed of, for example, a hard disk, and stores a voice signal, an acoustic model and the like to be analyzed as data. In addition, the voice search device 100 stores the search index generated from the voice signal to be analyzed.

入力装置４は、クエリをテキスト入力するキーボード、クエリを音声信号として入力するマイク等から構成される。出力装置５は、例えば、液晶ディスプレイの画面、スピーカ等を備える。出力装置５は、ＣＰＵ６によって出力された音声データをスピーカから出力し、検索した検索語の音声信号における位置等を画面に表示する。 The input device 4 includes a keyboard that inputs a query as a text, a microphone that inputs the query as an audio signal, and the like. The output device 5 includes, for example, a screen of a liquid crystal display, a speaker, and the like. The output device 5 outputs the voice data output by the CPU 6 from the speaker and displays the position of the retrieved search word in the voice signal on the screen.

バス７は、ＲＯＭ１、ＲＡＭ２、外部記憶装置３、入力装置４、出力装置５、ＣＰＵ６、を接続する。ＣＰＵ６は、ＲＯＭ１に記憶された検索インデックス生成プログラム、音声検索プログラムをＲＡＭ２に読み出して、そのプログラムを実行することにより、以下に示す機能を実現する。 The bus 7 connects the ROM 1, the RAM 2, the external storage device 3, the input device 4, the output device 5, and the CPU 6. The CPU 6 realizes the following functions by reading the search index generation program and the voice search program stored in the ROM 1 into the RAM 2 and executing the programs.

音声検索装置１００は、機能的には、図２に示すように、検索インデックス生成部１１０と、音声検索部１３０と、を備える。 The voice search device 100 functionally includes a search index generation unit 110 and a voice search unit 130, as shown in FIG.

最初に、検索インデックス生成部１１０の構成について説明する。検索インデックス生成部１１０は、図３に示すように、音声信号記憶部１０１と、音響モデル記憶部１０２と、出力確率記憶部１０３と、音声信号取得部１１１と、フレーム設定部１１２と、特徴量取得部１１３と、出力確率取得部１１４と、代表確率設定部１２０と、を備える。代表確率設定部１２０は、圧縮インデックス生成部１２１を備える。音声信号記憶部１０１、音響モデル記憶部１０２、出力確率記憶部１０３は、外部記憶装置３の記憶領域に構築されている。 First, the configuration of the search index generation unit 110 will be described. As shown in FIG. 3, the search index generation unit 110 includes an audio signal storage unit 101, an acoustic model storage unit 102, an output probability storage unit 103, an audio signal acquisition unit 111, a frame setting unit 112, and a feature amount. The acquisition unit 113, the output probability acquisition unit 114, and the representative probability setting unit 120 are provided. The representative probability setting unit 120 includes a compression index generation unit 121. The audio signal storage unit 101, the acoustic model storage unit 102, and the output probability storage unit 103 are built in the storage area of the external storage device 3.

音声信号記憶部１０１は、検索対象とする音声信号を記憶する。検索対象の音声信号は、例えばニュース放送等の音声、録音された会議の音声、録音された講演の音声、映画の音声等に係る音声信号である。 The voice signal storage unit 101 stores a voice signal to be searched. The audio signal to be searched is, for example, an audio signal related to audio of news broadcast, audio of recorded conference, audio of recorded lecture, audio of movie, or the like.

音響モデル記憶部１０２は、モノフォンモデルの音響モデルを記憶する。モノフォンモデルは、１音素毎に生成された音響モデルであり、隣接する音素に依存しない音響モデルである。音声検索装置１００は、モノフォンモデルを一般的な方法で学習して、音響モデル記憶部１０２に予め記憶しておく。 The acoustic model storage unit 102 stores an acoustic model of a monophone model. The monophone model is an acoustic model generated for each phoneme, and is an acoustic model that does not depend on adjacent phonemes. The voice search device 100 learns a monophone model by a general method and stores it in the acoustic model storage unit 102 in advance.

モノフォンモデルとして、例えば、一般的な音声認識で利用される音響モデルであるＨＭＭ（Hidden Markov Model；隠れマルコフモデル）を利用できる。ＨＭＭは、統計的な手法により音声信号からその音声信号を構成する音素を確率的に推定するためのモデルである。ＨＭＭには、時間的な状態の揺らぎを示す遷移確率と、各状態から入力された特徴量と一致する確率（出力確率）と、をパラメータとした標準パターンを用いる。 As the monophone model, for example, an HMM (Hidden Markov Model), which is an acoustic model used in general speech recognition, can be used. The HMM is a model for probabilistically estimating a phoneme that constitutes a voice signal from the voice signal by a statistical method. The HMM uses a standard pattern with parameters of transition probabilities indicating temporal fluctuations and probabilities (output probabilities) of matching feature quantities input from each state.

音素とは、話者により発話された音声を構成する成分の単位である。例えば、「キゾクセイド」という単語は、「ｋ，ｉ，ｚ，ｏ，ｋ，ｕ，ｓ，ｅ，ｉ，ｄ，ｏ」という１１個の音素から構成される。音素は、さらに、複数の状態に分割される。 A phoneme is a unit of components that make up a voice uttered by a speaker. For example, the word "Kizokuseido" is composed of 11 phonemes "k, i, z, o, k, u, s, e, i, d, o". Phonemes are further divided into states.

状態とは、音素を構成する最小の時間単位である。各音素に定められた状態数が「３」である場合を例にとって説明する。例えば、音声「あ」の音素「ａ」は、図４に示すように、この音素の発声開始時を含む第１の状態「ａ１」と、中間状態である第２の状態「ａ２」と、発声終了時を含む第３の状態「ａ３」と、の３つの状態に分けられる。すなわち、１音素は３つの状態から構成される。全ての音素が３つの状態から構成されている場合、音響モデルで利用される全音素の数をｍとすると、（ｍ×３）個の状態が存在する。 A state is the smallest unit of time that constitutes a phoneme. A case where the number of states defined for each phoneme is “3” will be described as an example. For example, as shown in FIG. 4, the phoneme "a" of the voice "a" has a first state "a1" including the start of utterance of this phoneme, a second state "a2" which is an intermediate state, and It is divided into three states, that is, a third state “a3” including the end of utterance. That is, one phoneme is composed of three states. When all phonemes are composed of three states, there are (m × 3) states, where m is the number of all phonemes used in the acoustic model.

音素の各状態の特徴量は、音素の状態ごとに音声信号から抽出した音声の特徴を表す数値である。この特徴量は、音声データを周波数軸上に変換して得られる周波数軸系特徴パラメータと、音声データのエネルギーの２乗和やその対数を計算することにより得られるパワー系特徴パラメータと、を組み合わせることによって得られる。 The feature amount of each state of the phoneme is a numerical value representing the feature of the voice extracted from the voice signal for each state of the phoneme. This feature amount combines a frequency axis system characteristic parameter obtained by converting the voice data on the frequency axis and a power system characteristic parameter obtained by calculating the square sum of energy of the voice data or its logarithm. Obtained by

例えば周知のように、特徴量は、周波数軸系特徴パラメータ１２成分（１２次元）とパワー系特徴パラメータ１成分（１次元）、直前の時間窓の各成分との差分を取った周波数軸系特徴パラメータ１２成分（１２次元）とパワー系特徴パラメータ１成分（１次元）、及び、直前の時間窓の各成分との差分の差分を取った周波数軸系特徴パラメータ１２成分（１２次元）の、合計３８成分を有する３８次元ベクトル量として構成される。 For example, as is well known, the feature amount is a frequency axis system feature obtained by taking a difference between the frequency axis system feature parameter 12 components (12 dimensions), the power system feature parameter 1 component (1 dimension), and each component of the immediately preceding time window. A total of 12 components (12 dimensions) of the parameter and 12 components of the power system characteristic parameter (1 dimension), and 12 components of the frequency axis system characteristic parameter (12 dimensions) obtained by subtracting the difference between each component of the immediately preceding time window It is configured as a 38-dimensional vector quantity having 38 components.

図３に戻って、出力確率記憶部１０３は、検索インデックス生成部１１０が生成した図５に示すような代表確率置換処理前の検索インデックスを記憶する。また、後述する代表確率置換処理後の検索インデックスを記憶する。検索インデックスとは、検索対象の音声信号に複数のフレームを設定し、そのフレームごとの音声の特徴量と音響モデルの音素の各状態の特徴量とが一致する確率である出力確率を記憶したテーブルである。 Returning to FIG. 3, the output probability storage unit 103 stores the search index generated by the search index generation unit 110 before the representative probability replacement process as shown in FIG. Further, the search index after the representative probability replacement process described later is stored. The search index is a table in which a plurality of frames are set in the speech signal to be searched, and the output probability, which is the probability that the feature amount of the voice for each frame and the feature amount of each state of the phoneme of the acoustic model match. Is.

音声信号取得部１１１は、音声信号記憶部１０１から検索対象とする音声信号を取得する。 The voice signal acquisition unit 111 acquires a voice signal to be searched from the voice signal storage unit 101.

フレーム設定部１１２は、音声信号の特徴量を取得する音声信号における区間の単位であるフレームを設定する。フレームとは、検索対象の音声信号とクエリ音声信号とを比較する時間窓である。本実施形態では、音素の状態ごとに検索対象の音声信号とクエリ音声信号とを比較して音声検出を行う。フレームの時間長には、例えば、４０ｍｓを用いる。 The frame setting unit 112 sets a frame, which is a unit of a section in an audio signal from which a characteristic amount of the audio signal is acquired. A frame is a time window for comparing a search target audio signal and a query audio signal. In the present embodiment, voice detection is performed by comparing the search target voice signal and the query voice signal for each phoneme state. For example, 40 ms is used as the time length of the frame.

検索対象の音声信号にフレームごとの区間を設定する方法について、図６を参照して説明する。図６（ａ）は、先頭から末尾までの時間長Ｔの検索対象の音声信号の波形図である。縦軸は音声信号の強度を示し、横軸は時間を示す。図６（ｂ）は、図６（ａ）に示す音声信号において設定されるフレームを示す。フレーム設定部１１２は、図６（ｂ）に示すように、フレーム長ｔの区間を１シフト長Ｓずつシフトして、検索対象の音声信号にフレーム番号ｆ_１からｆ_Ｎの区間を設定する。フレーム番号ｆ_１の区間は、音声信号の先頭から始まる時間長ｔの区間である。フレーム番号ｆ_２の区間は、音声信号の先頭から１シフト長Ｓだけシフトした位置から始まる時間長ｔの区間である。フレーム設定部１１２は、以下同様に、シフト長Ｓずつシフトしてフレーム番号ｆ_Ｎまで設定する。 A method of setting a section for each frame in the audio signal to be searched will be described with reference to FIG. FIG. 6A is a waveform diagram of an audio signal to be searched having a time length T from the beginning to the end. The vertical axis represents the strength of the audio signal, and the horizontal axis represents time. FIG. 6B shows a frame set in the audio signal shown in FIG. As shown in FIG. 6B, the frame setting unit 112 shifts the section of the frame length t by 1 shift length S and sets the section of the frame numbers f ₁ to f _{N in} the audio signal to be searched. The section of frame number f _{1 is} a section of time length t starting from the beginning of the audio signal. The section of frame number f _{2 is} a section of time length t starting from the position shifted by one shift length S from the beginning of the audio signal. Similarly, the frame setting unit 112 shifts by the shift length S and sets up to the frame number f _N.

シフト長Ｓは、検索の精度を決める長さである。シフト長Ｓは、フレーム長ｔより短い値に設定される固定値である。例えば、フレーム長をｔ＝４０ｍｓとした場合は、シフト長をＳ＝１０ｍｓのように設定する。 The shift length S is a length that determines the accuracy of the search. The shift length S is a fixed value set to a value shorter than the frame length t. For example, when the frame length is t = 40 ms, the shift length is set as S = 10 ms.

特徴量取得部１１３は、フレーム区間ごとに検索対象の音声信号の特徴量を取得する。具体的には、特徴量取得部１１３は、検索対象の音声信号の特徴量をフレーム番号ｆ_１からｆ_Ｎのフレーム毎に取得する。 The feature amount acquisition unit 113 acquires the feature amount of the audio signal to be searched for each frame section. Specifically, the feature amount acquisition unit 113 acquires the feature amount of the audio signal to be searched for each frame of frame numbers f ₁ to f _N.

出力確率取得部１１４は、検索対象の音声信号の特徴量が音響モデルに含まれる音素の各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得し、音響モデルの音素の各状態と対応付けて記憶する。 The output probability acquisition unit 114 acquires, for each frame section, an output probability that is a probability that the feature amount of the search target speech signal matches the feature amount of each state of the phonemes included in the acoustic model, and outputs the phoneme of the acoustic model. It is stored in association with each state.

具体的には、出力確率取得部１１４は、取得した特徴量と音響モデルの音素の各状態の特徴量とを比較することにより、フレーム番号ｆ_１からｆ_Ｎのフレームに含まれる音声信号の特徴量が音響モデルの音素の各状態の特徴量と一致する確率である出力確率をフレーム毎に取得し、音素の各状態と対応付けた検索インデックスとして出力確率記憶部１０３に記憶する。この出力確率を記憶したテーブルを検索インデックスという。図５に示す検索インデックスは、後述する代表確率置換処理前の検索インデックスである。 Specifically, the output probability acquisition unit 114 compares the acquired feature amount with the feature amount of each state of the phonemes of the acoustic model to determine the features of the audio signal included in the frames of frame numbers f ₁ to f _N. An output probability, which is a probability that the amount matches the feature amount of each state of the phoneme of the acoustic model, is acquired for each frame and stored in the output probability storage unit 103 as a search index associated with each state of the phoneme. A table that stores this output probability is called a search index. The search index shown in FIG. 5 is a search index before the representative probability replacement process described later.

図５は、音素の種類がｍ種類であり、音素の状態数が３である検索インデックスの例である。図５の１列目は、シフト長Ｓずつシフトして作成したフレームのフレーム番号を示す。フレームごとの特徴量が音素の各状態の特徴量と一致する確率をｆ（ｘ，ｙ，ｚ）で表す。ｘ（ｘ＝１〜Ｎ）はフレーム番号を示し、ｙ（ｙ＝１〜ｍ）は音素番号を示し、ｚ（ｚ＝１〜３）は状態番号を示す。ｆ（１，１，１）は、フレーム番号ｆ_１のフレームに含まれる音声信号の特徴量が、音響モデルに含まれる音素１の状態１の特徴量と一致する確率を表す。フレーム番号ｆ_Ｘのフレームに含まれる音声信号の特徴量が、音響モデルに含まれる音素番号ｙの状態ｚの特徴量と一致する確率をｆ（ｘ，ｙ，ｚ）で表す。 FIG. 5 is an example of a search index in which the number of types of phonemes is m and the number of states of phonemes is three. The first column in FIG. 5 shows the frame number of the frame created by shifting by the shift length S. The probability that the feature amount of each frame matches the feature amount of each state of the phoneme is represented by f (x, y, z). x (x = 1 to N) indicates a frame number, y (y = 1 to m) indicates a phoneme number, and z (z = 1 to 3) indicates a state number. f (1,1,1) represents the probability that the feature amount of the voice signal included in the frame of frame number f ₁ matches the feature amount of state 1 of the phoneme 1 included in the acoustic model. The probability that the feature amount of the speech signal included in the frame of frame number f _X matches the feature amount of the state z of the phoneme number y included in the acoustic model is represented by f (x, y, z).

図３に戻って、代表確率設定部１２０は、出力確率取得部１１４が取得した出力確率について、それぞれの音素を構成する状態の中で最も出力確率が高い状態の出力確率を、その音素の代表出力確率として設定する。例えば、代表確率設定部１２０は、図５のフレーム番号ｆ１の音素１に含まれる状態１の出力確率ｆ（１，１，１）、状態２の出力確率ｆ（１，１，２）、状態３の出力確率ｆ（１，１，３）を比較して、最も大きい出力確率を抽出する。例えば、代表確率設定部１２０は、状態２の出力確率ｆ（１，１，２）が最も大きい場合、状態１の出力確率ｆ（１，１，１）、状態２の出力確率ｆ（１，１，２）、状態３の出力確率ｆ（１，１，３）の値を、状態２の出力確率ｆ（１，１，２）で置換する。つまり、音素１を代表する出力確率としてｆ（１，１，２）の値を設定する。 Returning to FIG. 3, with respect to the output probabilities acquired by the output probability acquisition unit 114, the representative probability setting unit 120 represents the output probability of the state having the highest output probability among the states forming each phoneme as a representative of the phonemes. Set as output probability. For example, the representative probability setting unit 120 outputs the output probability f (1,1,1) of the state 1 included in the phoneme 1 of the frame number f1 of FIG. The output probabilities f (1,1,3) of 3 are compared to extract the largest output probability. For example, when the output probability f (1,1,2) of the state 2 is the largest, the representative probability setting unit 120 outputs the output probability f (1,1,1) of the state 1 and the output probability f (1,2 of the state 2 1, 2) and the value of the output probability f (1,1,3) of the state 3 are replaced with the output probability f (1,1,2) of the state 2. That is, the value of f (1,1,2) is set as the output probability representing the phoneme 1.

代表確率設定部１２０は、フレームｆ１の音素２から音素ｍについても、同様にして、最も出力確率が大きい状態の出力確率をその音素の代表出力確率として設定する置換処理を行う。代表確率設定部１２０は、全てのフレームについて同様の置換処理を行う。 The representative probability setting unit 120 similarly performs the replacement process for the phonemes 2 to m of the frame f1 to set the output probability of the state in which the output probability is the highest as the representative output probability of the phoneme. The representative probability setting unit 120 performs the same replacement process on all frames.

圧縮インデックス生成部１２１は、検索対象とする音声信号の図５に示す検索インデックスの各音素の出力確率を代表出力確率に置換処理して、圧縮された図７に示す検索インデックスを生成する。つまり、１つの音素を構成する３つの状態の出力確率が同じ値に置換されているので、検索インデックスのデータサイズを１／「音素の状態の数」に圧縮することができる。圧縮インデックス生成部１２１は、生成した図７に示す置換処理後の圧縮された検索インデックスを出力確率記憶部１０３に記憶する。 The compressed index generation unit 121 replaces the output probability of each phoneme of the search index shown in FIG. 5 of the speech signal to be searched with the representative output probability to generate the compressed search index shown in FIG. 7. That is, since the output probabilities of the three states making up one phoneme are replaced with the same value, the data size of the search index can be compressed to 1 / “the number of phoneme states”. The compressed index generation unit 121 stores the generated compressed search index after the replacement process shown in FIG. 7 in the output probability storage unit 103.

置換処理前の図５に示す検索インデックスの出力確率を、状態１から状態３の出力確率で平均する処理の場合、例えば、状態２の出力確率が極めて大きいという特徴があった場合でも、状態１と状態３の出力確率が小さい場合には、平均化されることによってその音素の中に極めて大きい出力確率を有する状態があるという情報が喪失されてしまうことになる。 In the case of the process of averaging the output probabilities of the search index shown in FIG. 5 before the replacement process with the output probabilities of the states 1 to 3, for example, even when the output probability of the state 2 is extremely large, the state 1 When the output probability of state 3 is small, the information that there is a state having an extremely high output probability in the phonemes is lost by averaging.

これに対して、代表確率設定部１２０による置換処理後の図７に示す検索インデックスは、図５に示す置換処理前の検索インデックスに含まれていた極めて大きい出力確率の値が残っているので、その音素の中に極めて大きい出力確率を有する状態があるという情報が喪失されることはない。 On the other hand, in the search index shown in FIG. 7 after the replacement processing by the representative probability setting unit 120, the value of the extremely large output probability included in the search index before the replacement processing shown in FIG. The information that there is a state in the phoneme with a very high output probability is not lost.

次に、音声検索部１３０の構成について説明する。音声検索部１３０は、図８に示すように、音響モデル記憶部１０２と、出力確率記憶部１０３と、時間長記憶部１０４と、クエリ出力確率記憶部１０５と、トライフォンモデル記憶部１０６と、検索文字列取得部１３１と、変換部１３２と、フレーム列作成部１３３と、クエリ出力確率取得部１３４と、区間指定部１３５と、第２出力確率取得部１３６と、置換部１３７と、尤度取得部１３８と、繰り返し部１３９と、特定部１４０と、を備える。音響モデル記憶部１０２、出力確率記憶部１０３、時間長記憶部１０４、クエリ出力確率記憶部１０５、トライフォンモデル記憶部１０６は、外部記憶装置３の記憶領域に構築されている。 Next, the configuration of the voice search unit 130 will be described. As shown in FIG. 8, the voice search unit 130 includes an acoustic model storage unit 102, an output probability storage unit 103, a time length storage unit 104, a query output probability storage unit 105, a triphone model storage unit 106, Search character string acquisition unit 131, conversion unit 132, frame string creation unit 133, query output probability acquisition unit 134, section designation unit 135, second output probability acquisition unit 136, replacement unit 137, likelihood The acquisition unit 138, the repeating unit 139, and the specifying unit 140 are provided. The acoustic model storage unit 102, the output probability storage unit 103, the time length storage unit 104, the query output probability storage unit 105, and the triphone model storage unit 106 are built in the storage area of the external storage device 3.

音響モデル記憶部１０２は、検索インデックス生成時と同じモノフォンモデルの音響モデルを記憶する。出力確率記憶部１０３は、検索インデックス生成部１１０が生成した図７に示す置換処理後の検索インデックスを記憶する。時間長記憶部１０４は、大量の音声データから算出した平均継続時間長を音素を構成する状態ごとに記憶する。クエリ出力確率記憶部１０５は、音声検索部１３０が生成するクエリの音素列に含まれる音素が音響モデルに含まれる音素の各状態の特徴量と一致する確率（第２の確率）を記憶する。トライフォンモデル記憶部１０６は、トライフォンモデルの音響モデルを記憶する。 The acoustic model storage unit 102 stores the same acoustic model of the monophone model as when the search index was generated. The output probability storage unit 103 stores the search index after the replacement process shown in FIG. 7 generated by the search index generation unit 110. The time length storage unit 104 stores the average duration time calculated from a large amount of voice data for each state forming a phoneme. The query output probability storage unit 105 stores a probability (second probability) that a phoneme included in the phoneme string of the query generated by the voice search unit 130 matches the feature amount of each state of the phonemes included in the acoustic model. The triphone model storage unit 106 stores the acoustic model of the triphone model.

検索文字列取得部１３１は、検索文字列を取得する。検索文字列取得部１３１は、例えば入力装置４を介してユーザが入力した検索文字列を取得する。つまり、ユーザは、音声検索装置１００に対して、検索語（クエリ）を文字列としてテキスト入力する。 The search character string acquisition unit 131 acquires a search character string. The search character string acquisition unit 131 acquires a search character string input by the user via the input device 4, for example. That is, the user text-inputs the search word (query) as a character string to the voice search device 100.

変換部１３２は、音響モデル記憶部１０２に記憶されているモノフォンモデルの音素を、検索文字列取得部１３１が取得した検索文字列にしたがって並べて、検索文字列を音素列に変換する。すなわち、変換部１３２は、検索文字列に含まれる文字と同順で、各文字を発声したときの音素（モノフォン）を並べることにより、検索文字列をモノフォン音素列に変換する。 The conversion unit 132 arranges the phonemes of the monophone model stored in the acoustic model storage unit 102 according to the search character string acquired by the search character string acquisition unit 131 and converts the search character string into a phoneme string. That is, the conversion unit 132 converts the search character string into a monophone phoneme string by arranging phonemes (monophones) when the respective characters are uttered in the same order as the characters included in the search character string.

例えば、変換部１３２は、検索文字列として日本語「キゾクセイド」が入力された場合、「ｋ，ｉ，ｚ，ｏ，ｋ，ｕ，ｓ，ｅ，ｉ，ｄ，ｏ」という１１個のモノフォン音素から構成されるモノフォン音素列に変換する。ここで、各音素は、３つの状態で構成されている。したがって、変換部１３２は、検索文字列「キゾクセイド」を３３個の状態で構成される状態列に変換する。 For example, the conversion unit 132 receives 11 monophones “k, i, z, o, k, u, s, e, i, d, o” when Japanese “kizoxade” is input as the search character string. Converts to a monophone phoneme sequence composed of phonemes. Here, each phoneme is configured in three states. Therefore, the conversion unit 132 converts the search character string “Kizokuseido” into a status string composed of 33 statuses.

さらに、変換部１３２は、変換した状態列を構成する３３個の状態のそれぞれの時間長を、時間長記憶部１０４から取得する。そして、変換部１３２は、それぞれの状態ごとに取得した時間長をそれぞれの状態の時間長として、クエリ音素列を作成する。 Furthermore, the conversion unit 132 acquires the time length of each of the 33 states that form the converted state sequence from the time length storage unit 104. Then, the conversion unit 132 creates the query phoneme string by using the time lengths acquired for each state as the time lengths of the states.

変換部１３２は、時間長記憶部１０４から取得した３３個の状態の時間長を合計した時間長を、検索文字列「キゾクセイド」が発話される発話時間長Ｌとして導出する。この発話時間長Ｌは、後述する尤度取得において、尤度を計算するための尤度取得区間の時間長として使用する。 The conversion unit 132 derives the time length obtained by summing the time lengths of the 33 states acquired from the time length storage unit 104 as the utterance time length L in which the search character string “Kizokuseido” is uttered. This utterance time length L is used as a time length of a likelihood acquisition section for calculating likelihood in likelihood acquisition described later.

ところで、検索対象の音声信号は必ずしも平均的な速度で発話された音声信号に限定されず、様々な速度で発話された音声信号が検索対象となる。しかし、時間長記憶部１０４に記憶されている時間長は大量の音声データから計算した音素の各状態の平均時間長である。したがって、変換部１３２は、時間長記憶部１０４から取得した時間長を補正して使用することが望ましい。例えば、ユーザが検索対象の音声信号の発話速度に応じた補正係数を入力装置４から入力し、変換部１３２は、ユーザが入力した補正係数に基づいて、時間長記憶部１０４から取得した時間長を補正してモノフォンモデルの音素を並べることが望ましい。また、音声検索装置１００が、音声信号に含まれる単位時間あたりの音素数をカウントすることにより、検索対象の音声信号の発話速度を測定し、音声検索装置１００が補正係数を設定するようにしてもよい。 By the way, the voice signal to be searched is not necessarily limited to the voice signal uttered at an average speed, and the voice signal uttered at various speeds is the search target. However, the time length stored in the time length storage unit 104 is the average time length of each state of phonemes calculated from a large amount of voice data. Therefore, it is desirable that the conversion unit 132 corrects and uses the time length acquired from the time length storage unit 104. For example, the user inputs a correction coefficient corresponding to the utterance speed of the voice signal to be searched from the input device 4, and the conversion unit 132 acquires the time length acquired from the time length storage unit 104 based on the correction coefficient input by the user. It is desirable to correct and arrange the phonemes of the monophone model. Also, the voice search device 100 counts the number of phonemes per unit time included in the voice signal to measure the utterance speed of the voice signal to be searched, and the voice search device 100 sets the correction coefficient. Good.

フレーム列作成部１３３は、変換部１３２が作成したクエリ音素列について、フレーム長ごとの区間に分割したフレーム列を作成する。クエリ音素列に設定するフレーム列について図９を参照して説明する。図９（ａ）は、取得した時間長の長さに対応して音素の状態を並べたクエリ音素列である。つまり、クエリの音素列「ｋ，ｉ，ｚ，ｏ，ｋ，ｕ，ｓ，ｅ，ｉ，ｄ，ｏ」の最初の音素「ｋ」の状態１の音響モデルを、時間長記憶部１０４に記憶されている音素「ｋ」の状態１の時間長の長さで並べる。次に、音素「ｋ」の状態２の音響モデルを、時間長記憶部１０４に記憶されている音素「ｋ」の状態２の時間長の長さで並べる。以下同様にして、音素「ｏ」の状態３の音響モデルを、時間長記憶部１０４に記憶されている音素「ｏ」の状態３の時間長の長さで並べる。このように並べられたクエリ音素列の合計時間長Ｌが発話時間長Ｌである。 The frame sequence creation unit 133 creates a frame sequence in which the query phoneme sequence created by the conversion unit 132 is divided into sections for each frame length. The frame sequence set as the query phoneme sequence will be described with reference to FIG. FIG. 9A is a query phoneme sequence in which phoneme states are arranged in correspondence with the length of the acquired time length. That is, the acoustic model of state 1 of the first phoneme “k” of the phoneme sequence “k, i, z, o, k, u, s, e, i, d, o” of the query is stored in the time length storage unit 104. The phonemes “k” stored are arranged in the length of the time length of state 1. Next, the acoustic models in the state 2 of the phoneme “k” are arranged in the length of the time length of the state 2 of the phoneme “k” stored in the time length storage unit 104. Similarly, the acoustic models in the state 3 of the phoneme “o” are arranged in the length of the time length of the state 3 of the phoneme “o” stored in the time length storage unit 104. The total time length L of the query phoneme strings arranged in this way is the utterance time length L.

図９（ｂ）は、図９（ａ）に示すクエリ音素列において設定されるフレームを示す。フレーム列作成部１３３は、図９（ｂ）に示すように、フレーム長ｔの区間を１シフト長Ｓずつシフトして、クエリ音素列にフレーム番号ｇ_１からｇ_ｋの区間を設定する。フレーム長ｔは、検索インデックスを生成した際に用いたフレーム長ｔ（例えば、４０ｍｓ）と同じにする。シフト長Ｓも検索インデックス生成時と同じシフト長Ｓ（例えば、１０ｍｓ）とする。フレーム番号ｇ_１の区間は、クエリ音素列の先頭から始まる時間長ｔの区間である。フレーム番号ｇ_２の区間は、クエリ音素列の先頭から１シフト長Ｓだけシフトした位置から始まる時間長ｔの区間である。フレーム列作成部１３３は、以下同様に、シフト長Ｓずつシフトしてフレーム番号ｇ_ｋまでフレームを設定する。 FIG. 9B shows a frame set in the query phoneme string shown in FIG. 9A. As shown in FIG. 9B, the frame sequence creation unit 133 shifts the interval of the frame length t by 1 shift length S and sets the interval of the frame numbers g ₁ to g _{k in} the query phoneme sequence. The frame length t is the same as the frame length t (for example, 40 ms) used when the search index is generated. The shift length S is also set to the same shift length S (for example, 10 ms) as when the search index is generated. The section of frame number g _{1 is} a section of time length t starting from the beginning of the query phoneme string. The section of the frame number g _{2 is} a section of time length t starting from the position shifted by one shift length S from the beginning of the query phoneme string. Similarly, the frame sequence creation unit 133 shifts by the shift length S and sets frames up to the frame number g _k .

図８に戻って、クエリ出力確率取得部１３４は、クエリ音素列の各状態が音響モデルに含まれる音素の各状態の特徴量と一致する確率（第２の確率）をフレーム（ｇ_１〜ｇ_ｋ）ごとに取得し、音素の各状態と対応付けてクエリ出力確率記憶部１０５に記憶する。図１０は、音素の種類がｍ種類であり、音素の状態数が３の場合の例である。音素の種類数「ｍ」と状態数「３」は、検索インデックスの作成時と同じ数とする。図１０の１列目は、フレーム列作成部１３３が作成したフレーム列を構成するフレームのフレーム番号を示す。そして、フレーム列を構成するフレーム（ｇ_１〜ｇ_ｋ）の特徴量が、音素の各状態の特徴量と一致する確率をｇ（ａ，ｙ，ｚ）で表す。ａ（ａ＝１〜ｋ）はクエリ音素列のフレーム番号を示し、ｙ（ｙ＝１〜ｍ）は音素番号を示し、ｚ（ｚ＝１〜３）は状態番号を示す。 Returning to FIG. 8, the query output probability acquisition unit 134 determines the probability (second probability) that each state of the query phoneme string matches the feature amount of each state of the phonemes included in the acoustic model in the frame (g _{1 to} g). _k ), and stores it in the query output probability storage unit 105 in association with each phoneme state. FIG. 10 is an example when there are m types of phonemes and the number of phoneme states is three. The number of phoneme types “m” and the number of states “3” are the same as when the search index was created. The first column in FIG. 10 shows the frame numbers of the frames that form the frame sequence created by the frame sequence creation unit 133. Then, the probability that the feature amount of the frames (g _{1 to} g _k ) forming the frame sequence matches the feature amount of each state of the phoneme is represented by g (a, y, z). a (a = 1 to k) indicates the frame number of the query phoneme sequence, y (y = 1 to m) indicates the phoneme number, and z (z = 1 to 3) indicates the state number.

クエリ音素列のフレーム数ｋは、クエリ音素列の発話時間長Ｌとシフト長Ｓを用いて、ｋ＝Ｌ／Ｓで求めた値の小数点以下を切り捨てた自然数である。 The number of frames k of the query phoneme sequence is a natural number obtained by truncating the value obtained by k = L / S after the decimal point using the utterance time length L and the shift length S of the query phoneme sequence.

図８に戻って、区間指定部１３５は、音声信号からクエリ音素列の発話時間長Ｌの区間を尤度取得区間として複数指定する。尤度取得区間は、その区間からクエリ音素列が発せられている尤度を取得する区間である。尤度とは、検索対象の音声とクエリ音素列との類似の度合いを示す指標である。図６（ｃ）を参照して説明する。区間指定部１３５は、まず、検索対象の音声信号の先頭フレームｆ_１から始まるクエリ音素列の発話時間長Ｌの区間を第１尤度取得区間として指定する。本実施形態では、クエリ音素列を構成するフレームのフレーム数をｋ個としているので、第１フレームｆ_１から第ｋフレームｆ_ｋの区間を第１尤度取得区間として指定する。 Returning to FIG. 8, the section specifying unit 135 specifies a plurality of sections of the query phoneme sequence having the utterance time length L as the likelihood acquisition sections from the voice signal. The likelihood acquisition section is a section from which the likelihood that the query phoneme string is emitted is acquired. Likelihood is an index indicating the degree of similarity between the search target speech and the query phoneme sequence. This will be described with reference to FIG. The section designating unit 135 first designates a section having the utterance time length L of the query phoneme sequence starting from the _first frame f ₁ of the speech signal to be searched as the first likelihood acquisition section. In the present embodiment, since the number of frames forming the query phoneme string is _k , the section from the first frame f ₁ to the k-th frame f _k is designated as the first likelihood acquisition section.

次に、区間指定部１３５は、音声信号の第２フレームｆ_２から第（ｋ＋１）フレームｆ_ｋ＋１の区間を第２尤度取得区間として指定する。以下同様に、第Ｐ尤度取得区間まで指定する。なお、検索対象の音声信号の中で指定可能な尤度取得区間の数Ｐは、音声信号の時間長Ｔと尤度取得区間の時間長（クエリ音素列の発話時間長）Ｌとシフト長Ｓとを用いて、Ｐ＝（Ｔ−Ｌ＋Ｓ）／Ｓで求めた値の小数点以下を切り捨てた自然数である。 Next, the section designating unit 135 designates the section from the second frame f ₂ to the (k + 1) th frame f _{k + 1} of the audio signal as the second likelihood acquisition section. Similarly, up to the P-th likelihood acquisition section is designated. The number P of likelihood acquisition sections that can be designated in the search target speech signal is the time length T of the speech signal, the time length of the likelihood acquisition section (utterance time length of the query phoneme string) L, and the shift length S. Is a natural number obtained by truncating the value obtained by P = (T−L + S) / S using the and.

図８に戻って、第２出力確率取得部１３６は、クエリ音素列を構成する各フレームが検索対象の音声信号を構成する各フレームと一致する確率（第３の確率）を取得する。具体的には、第２出力確率取得部１３６は、クエリ音素列の各フレームが音素の各状態である確率（第２の確率）と、検索対象の音声信号の検索インデックスに記憶した確率（第１の確率）とを掛け合わせることにより、クエリ音素列の各フレーム（ｇ_１〜ｇ_ｋ）が検索対象の音声信号の各フレーム（ｆ_１〜ｆ_Ｎ）と一致する確率（第３の確率）を求める。 Returning to FIG. 8, the second output probability acquisition unit 136 acquires the probability (third probability) that each frame forming the query phoneme string matches each frame forming the search target speech signal. Specifically, the second output probability acquisition unit 136 calculates the probability that each frame of the query phoneme string is each state of the phoneme (second probability) and the probability stored in the search index of the speech signal to be searched (the second probability). The probability that each frame (g _{1 to} g _k ) of the query phoneme sequence matches each frame (f _{1 to} f _N ) of the search target speech signal (third probability). Ask for.

図７と図１０を参照して具体的に説明する。区間指定部１３５が、音声信号の先頭フレームｆ_１から始まる第１尤度取得区間を指定すると、第２出力確率取得部１３６は、クエリ音素列の先頭フレームｇ_１と音声信号の先頭フレームｆ_１について音素の各状態の出力確率を掛け合わせることにより、クエリ音素列の第１フレームｇ_１が検索対象の音声信号の第１フレームｆ_１と一致する確率を取得する。 This will be specifically described with reference to FIGS. 7 and 10. When the section designating unit 135 designates the first likelihood acquisition section starting from the head frame f ₁ of the speech signal, the second output probability acquiring unit 136 causes the head frame g ₁ of the query phoneme sequence and the head frame f _{1 of the} speech signal. The probability that the first frame g ₁ of the query phoneme sequence matches the first frame f ₁ of the speech signal to be searched is obtained by multiplying the output probabilities of the phoneme states with respect to.

具体的には、第２出力確率取得部１３６は、クエリ音素列の第１フレームｇ_１の状態１が音声信号の第１フレームｆ_１の音素１である確率Ｐ（１，１，１）を式（１）から求める。クエリ音素列の第１フレームｇ_１の状態１が音声信号の第１フレームｆ_１の音素２である確率Ｐ（１，２，１）を式（２）から求める。以下同様にして、第２出力確率取得部１３６は、クエリ音素列の第１フレームｇ_１の状態３が音声信号の第１フレームｆ_１の音素ｍである確率Ｐ（１，ｍ，３）を式（３）から求める。
Ｐ（１，１，１）＝ｆ（１，１）×ｇ（１，１，１）・・・式（１）
Ｐ（１，２，１）＝ｆ（１，２）×ｇ（１，２，１）・・・式（２）
Ｐ（１，ｍ，３）＝ｆ（１，ｍ）×ｇ（１，ｍ，３）・・・式（３） Specifically, the second output probability acquisition unit 136 determines the probability P (1,1,1) that the state 1 of the first frame g ₁ of the query phoneme string is the phoneme ₁ of the first frame f ₁ of the speech signal. Obtained from equation (1). The probability P (1,2,1) that the state 1 of the first frame g ₁ of the query phoneme sequence is the phoneme 2 of the first frame f ₁ of the speech signal is obtained from the equation (2). Similarly, the second output probability acquisition unit 136 determines the probability P (1, m, 3) that the state 3 of the first frame g ₁ of the query phoneme string is the phoneme m of the first frame f ₁ of the speech signal. Obtained from equation (3).
P (1,1,1) = f (1,1) × g (1,1,1) Equation (1)
P (1,2,1) = f (1,2) × g (1,2,1) Equation (2)
P (1, m, 3) = f (1, m) × g (1, m, 3) ... Equation (3)

このように、第２出力確率取得部１３６は、クエリ音素列の第１フレームｇ_１について（ｍ×３）個の確率（第３の確率）を取得する。そして、（ｍ×３）個の確率を掛け合わせることにより、クエリ音素列の第１フレームｇ_１が検索対象の音声信号の第１フレームｆ_１と一致する確率である出力確率Ｐ（１，１）を式（４）により取得する。 As described above, the second output probability acquisition unit 136 acquires (m × 3) probabilities (third probabilities) for the first frame g ₁ of the query phoneme string. Then, by multiplying the (m × 3) probabilities, the output probability P (1,1) is the probability that the first frame g ₁ of the query phoneme sequence matches the first frame f ₁ of the speech signal to be searched. ) Is obtained by the equation (4).

次に、第２出力確率取得部１３６は、クエリ音素列の第２フレームｇ_２と音声信号の第２フレームｆ_２に対応する音素の各状態の出力確率を掛け合わせることにより、クエリ音素列の第２フレームｇ_２が検索対象の音声信号の第２フレームｆ_２と一致する確率を取得する。具体的には、第２出力確率取得部１３６は、クエリ音素列の第２フレームｇ_２について（ｍ×３）個の出力確率を取得する。そして、（ｍ×３）個の出力確率を掛け合わせることにより、クエリ音素列の第２フレームｇ_２が検索対象の音声信号の第２フレームｆ_２と一致する確率である出力確率Ｐ（１，２）を式（５）により取得する。 Next, the second output probability acquisition unit 136 multiplies the second frame g ₂ of the query phoneme string by the output probability of each state of the phonemes corresponding to the second frame f ₂ of the speech signal to obtain the query phoneme string. The probability that the second frame g ₂ matches the second frame f ₂ of the audio signal to be searched is acquired. Specifically, the second output probability acquisition unit 136 acquires (m × 3) output probabilities for the second frame g ₂ of the query phoneme string. Then, by multiplying the output probabilities of (m × 3), the output probability P (1, which is the probability that the second frame g ₂ of the query phoneme sequence matches the second frame f ₂ of the speech signal to be searched. 2) is obtained by the equation (5).

以下同様にして、第２出力確率取得部１３６は、クエリ音素列の第ｋフレームｇ_ｋまでの出力確率Ｐ（１，ｋ）を式（６）により取得する。 Similarly, the second output probability acquisition unit 136 acquires the output probability P (1, k) of the query phoneme string up to the k-th frame g _{k according} to equation (6).

クエリ音素列が検索対象の音声信号の先頭フレームｆ_１から始まる場合について出力確率の取得が終わると、区間指定部１３５は、音声信号の第２フレームｆ_２から始まる第２尤度取得区間を指定する。第２出力確率取得部１３６は、クエリ音素列の先頭フレームｇ_１を検索対象の音声信号の第２フレームｆ_２に合わせて同様の計算を行う。 When acquisition of the output probability ends when the query phoneme sequence starts from the _first frame f ₁ of the speech signal to be searched, the section designating unit 135 specifies the second likelihood acquisition section starting from the second frame f ₂ of the speech signal. To do. The second output probability acquisition unit 136 matches the top frame g ₁ of the query phoneme string with the second frame f ₂ of the search target speech signal and performs the same calculation.

以下同様にして、第２出力確率取得部１３６は、第Ｐ尤度取得区間までの出力確率を求める。第２出力確率取得部１３６は、クエリ音素列の先頭フレームｇ_１を検索対象の音声信号の第ｓフレームｆ_ｓに合わせた場合（第ｓ尤度取得区間）のクエリ音素列の第ｊフレームｇ_ｊの出力確率を式（８）にて求める。 Similarly, the second output probability acquisition unit 136 determines the output probability up to the P-th likelihood acquisition section. The second output probability acquisition unit 136 matches the _first frame g ₁ of the query phoneme string with the sth frame f _s of the speech signal to be searched (s-th likelihood acquisition section), the j-th frame g of the query phoneme string. The output probability of _j is calculated by equation (8).

図８に戻って、置換部１３７は、第２出力確率取得部１３６が取得した出力確率のそれぞれを、そのフレームと隣接する前後数フレームの中で最大の出力確率に置換する。この置換処理は、Ｌｏｗｅｒ−Ｂｏｕｎｄ化処理と呼ばれる。 Returning to FIG. 8, the replacing unit 137 replaces each of the output probabilities acquired by the second output probability acquiring unit 136 with the maximum output probability of the preceding and following several frames adjacent to the frame. This replacement process is called a Lower-Bound process.

具体的に図１１を参照して、Ｌｏｗｅｒ−Ｂｏｕｎｄ化処理について説明する。図１１において、実線はフレーム毎に取得された出力確率を示す。縦軸は出力確率の高さを下になるほど高くなるように示し、横軸は時間を示す。置換部１３７は、各フレームの出力確率を、そのフレームと、そのフレームの前のＮ１個のフレームと、そのフレームの後のＮ２個のフレームの中で最大の出力確率に置き換える。Ｎ１とＮ２は０を含む自然数であるが、Ｎ１とＮ２のいずれかは０ではないものとする。 The Lower-Bound process will be described specifically with reference to FIG. 11. In FIG. 11, the solid line indicates the output probability acquired for each frame. The vertical axis shows the higher the output probability, the higher it becomes, and the horizontal axis shows the time. The replacing unit 137 replaces the output probability of each frame with the maximum output probability among the frame, the N1 frames before the frame, and the N2 frames after the frame. N1 and N2 are natural numbers including 0, but it is assumed that either N1 or N2 is not 0.

クエリ音素列の先頭フレームｇ_１を音声信号の先頭フレームｆ_１に合わせた場合で、Ｎ１＝２、Ｎ２＝２として説明する。置換部１３７は、クエリ音素列の第１フレームｇ_１の出力確率Ｐ（１，１）を、その前にフレームが無いので、自身の第１フレームｇ_１のＰ（１，１）とその後の第２フレームｇ_２のＰ（１，２）と第３フレームｇ_３のＰ（１，３）の中で最大の出力確率と置換する。置換部１３７は、クエリ音素列の第２フレームｇ_２の出力確率Ｐ（１，２）を、その前の第１フレームｇ_１のＰ（１，１）と自身の第２フレームｇ_２のＰ（１，２）とその後の第３フレームｇ_３のＰ（１，３）と第４フレームｇ_４のＰ（１，４）の中で最大の出力確率と置換する。置換部１３７は、クエリ音素列の第３フレームｇ_３の出力確率Ｐ（１，３）を、その前の第１フレームｇ_１のＰ（１，１）と第２フレームｇ_２のＰ（１，２）と、自身の第３フレームｇ_３のＰ（１，３）と、その後の第４フレームｇ_４のＰ（１，４）と第５フレームｇ_５のＰ（１，５）の中で最大の出力確率と置換する。このように、置換部１３７は、第ｋフレームまで置換処理を行う。置換の結果、図１１に実線で示した出力確率は、破線で示したＬｏｗｅｒ−Ｂｏｕｎｄ化処理後の出力確率のように、時間方向において値の変化が小さくなった出力確率に変換される。 A case where the head frame g ₁ of the query phoneme string is aligned with the head frame f ₁ of the audio signal will be described as N1 = 2 and N2 = 2. The replacement unit 137 sets the output probability P (1,1) of the first frame g ₁ of the query phoneme sequence to P (1,1) of its own first frame g _{1 and} the output probability P (1,1) of the first frame g ₁ after that. It is replaced with the maximum output probability of P (1,2) of the second frame g ₂ and P (1,3) of the third frame g ₃ . The replacing unit 137 sets the output probability P (1,2) of the second frame g ₂ of the query phoneme string to P (1,1) of the preceding first frame g ₁ and P of its own second frame g ₂ . (1, 2) and the subsequent P (1, 3) of the third frame g ₃ and P (1, 4) of the fourth frame g ₄ are replaced with the maximum output probability. The replacing unit 137 replaces the output probability P (1,3) of the third frame g ₃ of the query phoneme string with P (1,1) of the preceding first frame g ₁ and P (1 of the second frame g _2. , 2), P (1,3) of its own third frame g ₃ , and then P (1,4) of fourth frame g ₄ and P (1,5) of fifth frame g ₅ Replace with the maximum output probability. In this way, the replacement unit 137 performs the replacement process up to the kth frame. As a result of the replacement, the output probability indicated by the solid line in FIG. 11 is converted into the output probability in which the change in the value is small in the time direction, like the output probability after the Lower-Bound process shown by the broken line.

図８に戻って、尤度取得部１３８は、置換部１３７による置換処理後の出力確率に基づいて、区間指定部１３５が指定した尤度取得区間がクエリ音素列が発せられている区間であることの尤もらしさを示す尤度を取得する。具体的には、尤度取得部１３８は、置換処理後の出力確率の対数をとって得られる値を、尤度取得区間の先頭から末尾までの全フレーム、この例ではｋフレームにわたって加算することにより、この尤度取得区間の尤度を取得する。すなわち、出力確率が高いフレームを多く含む尤度取得区間ほど、尤度取得部１３８が取得する尤度は高くなる。 Returning to FIG. 8, the likelihood acquisition unit 138 is a section in which the query phoneme sequence is issued as the likelihood acquisition section specified by the section specification unit 135 based on the output probability after the replacement processing by the replacement unit 137. A likelihood indicating the likelihood of a thing is acquired. Specifically, the likelihood acquisition unit 138 adds the value obtained by taking the logarithm of the output probability after the replacement processing over all the frames from the beginning to the end of the likelihood acquisition section, which is k frames in this example. Thus, the likelihood of this likelihood acquisition section is acquired. That is, the likelihood that the likelihood acquisition unit 138 acquires is higher as the likelihood acquisition section includes more frames with a higher output probability.

繰り返し部１３９は、区間指定部１３５が指定する尤度取得区間の音声信号における指定区間を変えて、区間指定部１３５、第２出力確率取得部１３６、置換部１３７、及び尤度取得部１３８の処理を繰り返すように各部を制御する。１回目の処理では、検索対象の音声信号の第１フレームｆ_１から始まる第１尤度取得区間の尤度を求めたので、２回目は、検索対象の音声信号の第２フレームｆ_２から始まる第２尤度取得区間の尤度を求める。以後１フレームずつシフトして、第Ｐ尤度取得区間までの尤度を求める。 The repeating unit 139 changes the designated section in the audio signal of the likelihood acquisition section designated by the section designating unit 135, and the section designating unit 135, the second output probability acquiring unit 136, the replacing unit 137, and the likelihood acquiring unit 138. Each part is controlled to repeat the process. In the first processing, since the likelihood of the first likelihood acquisition section starting from the first frame f ₁ of the search target audio signal is obtained, the second processing starts from the second frame f ₂ of the search target audio signal. The likelihood of the second likelihood acquisition section is calculated. After that, by shifting by one frame, the likelihood up to the P-th likelihood acquisition section is obtained.

特定部１４０は、尤度取得部１３８が取得したＰ個の尤度に基づいて、検索対象の音声信号の中からクエリ音素列が発せられていると推定される推定区間を特定する。そのために、特定部１４０は、尤度取得部１３８が取得した尤度に基づいて、区間指定部１３５が指定した尤度取得区間の中から、検索文字列に対応する音声が発せられていることが推定される推定区間の候補を尤度が高い順にｘ個の区間を予備的に選択し、残りの尤度取得区間を候補から除外する。 The identifying unit 140 identifies, based on the P likelihoods acquired by the likelihood acquiring unit 138, an estimated section in which a query phoneme string is estimated to be emitted from the search target speech signal. Therefore, the identifying unit 140, based on the likelihood acquired by the likelihood acquiring unit 138, outputs a voice corresponding to the search character string from the likelihood acquiring section specified by the section specifying unit 135. As a candidate of the estimated section in which is estimated, x sections are preliminarily selected in the order of high likelihood, and the remaining likelihood acquisition sections are excluded from the candidates.

このとき、区間指定部１３５が指定した尤度取得区間は多くの重なりを有するため、尤度が大きい区間は時系列的に連続して存在することが多い。そのため、特定部１４０が、尤度取得区間の中で単純に尤度が大きい区間から順に推定区間の候補を選択すると、選択される区間が検索対象の音声信号における一部に集中する可能性が大きくなる。これを避けるために、特定部１４０は、所定の選択時間長を設け、選択時間長ごとに、この所定の選択時間長の区間の中から開始する尤度取得区間の中で尤度が最大の尤度取得区間を１つずつ選択する。この所定の選択時間長は、例えば尤度取得区間の発話時間長Ｌの１／ｍ（例えばｍ＝２）に相当する時間長のように、尤度取得区間の発話時間長Ｌよりも短い時間に設定する。例えば、検索語「カテゴリ」の発話時間長が２秒以上（Ｌ≧２秒）であると仮定した場合、ｍ＝２とし、選択時間長を１秒に設定する。選択時間長（Ｌ／ｍ）毎に１個ずつ尤度取得区間が候補として選択され、残りは候補から除外される。これにより、特定部１４０は、推定区間の候補を、検索対象の音声信号全体にわたって満遍なく選択できる。特定部１４０は、この選択時間長（Ｌ／ｍ）毎に行う尤度取得区間の選択の中から、尤度が高い尤度取得区間をｘ個選択する。 At this time, since the likelihood acquisition sections designated by the section designating unit 135 have many overlaps, sections with a large likelihood often exist continuously in time series. Therefore, if the identifying unit 140 simply selects candidates for the estimated section in order from the section with the largest likelihood in the likelihood acquisition section, the selected section may be concentrated on a part of the voice signal to be searched. growing. In order to avoid this, the specifying unit 140 provides a predetermined selection time length, and for each selection time length, the likelihood is maximum in the likelihood acquisition section starting from the section of the predetermined selection time length. The likelihood acquisition sections are selected one by one. This predetermined selection time length is shorter than the utterance time length L of the likelihood acquisition section, such as a time length corresponding to 1 / m (for example, m = 2) of the utterance time length L of the likelihood acquisition section. Set to. For example, assuming that the utterance time length of the search word “category” is 2 seconds or more (L ≧ 2 seconds), m = 2 and the selection time length is set to 1 second. One likelihood acquisition section is selected as a candidate for each selection time length (L / m), and the rest are excluded from the candidates. As a result, the identifying unit 140 can uniformly select the candidates of the estimated section over the entire audio signal to be searched. The identifying unit 140 selects x likelihood acquisition sections having a high likelihood from the likelihood acquisition sections selected for each selection time length (L / m).

次に、特定部１４０は、選択したｘ個の区間に対して、トライフォンモデル及び動的計画法（ＤＰ（Dynamic Programming）マッチング）に基づくより精度の高い尤度取得処理を実行する。ＤＰマッチングは、解析区間の尤度が最大になるように状態遷移を選択する手法である。トライフォンモデルでは、前後の音素との状態遷移を考慮する必要があるので、ＤＰマッチングにより、尤度取得区間の尤度が最大となるように、前後の音素の状態遷移を決める。 Next, the identifying unit 140 performs a more accurate likelihood acquisition process based on the triphone model and dynamic programming (DP (Dynamic Programming) matching) for the selected x sections. DP matching is a method of selecting state transitions so that the likelihood of the analysis section is maximized. In the triphone model, since it is necessary to consider the state transition with the preceding and following phonemes, the state transition of the preceding and following phonemes is determined by DP matching so as to maximize the likelihood of the likelihood acquisition section.

特定部１４０は、音声信号の特徴量とトライフォン音素列に含まれるトライフォンモデルとの対応を、ＤＰマッチングにより探索する。そして、特定部１４０は、トライフォンモデルに対する尤度に基づいて、特定部１４０が予備選択したx個の区間の中から、検索対象の音声信号の中から検索文字列に対応する音声が発せられていることが推定される推定区間を特定する。例えば、特定部１４０は、トライフォンモデルに基づく尤度が大きい順に、所定の数の区間を推定区間として特定する。もしくは、尤度が所定の値以上の区間を推定区間として特定する。特定部１４０が特定した区間の位置情報は、最終的な検索結果として、出力装置５が備える画面を介して外部に表示される。 The identifying unit 140 searches for the correspondence between the feature amount of the voice signal and the triphone model included in the triphone phoneme string by DP matching. Then, based on the likelihood of the triphone model, the identifying unit 140 outputs a voice corresponding to the search character string from the voice signal to be searched from the x sections preselected by the identifying unit 140. The estimated section that is estimated to be present is specified. For example, the specifying unit 140 specifies a predetermined number of sections as estimated sections in descending order of likelihood based on the triphone model. Alternatively, a section whose likelihood is greater than or equal to a predetermined value is specified as an estimated section. The position information of the section specified by the specifying unit 140 is displayed outside as a final search result via a screen included in the output device 5.

以上のような物理的構成及び機能的構成を有する音声検索装置１００が実行する検索インデックス生成処置について、図１２に示すフローチャートを参照しながら説明する。 A search index generation procedure executed by the voice search device 100 having the above-described physical configuration and functional configuration will be described with reference to the flowchart shown in FIG.

検索対象の音声データは予め音声信号記憶部１０１に記憶されており、音響モデルは音響モデル記憶部１０２に記憶されているものとする。ＣＰＵ６が、ＲＯＭ１から検索インデックス生成プログラムを読み出して、検索インデックス生成プログラムを実行することにより、図１２に示すフローチャートは開始する。 It is assumed that the voice data to be searched is previously stored in the voice signal storage unit 101 and the acoustic model is stored in the acoustic model storage unit 102. The CPU 6 reads the search index generation program from the ROM 1 and executes the search index generation program, so that the flowchart shown in FIG. 12 starts.

検索インデックス生成プログラムが実行されると、音声信号取得部１１１は、音声信号記憶部１０１から検索対象とする音声信号を読み出す（ステップＳ１１）。次に、フレーム設定部１１２は、図６を用いて説明したように、音声信号をフレーム長ごとに区分したフレームを設定する（ステップＳ１２）。次に、特徴量取得部１１３は、検索対象の音声信号の特徴量をフレーム番号ｆ_１からｆ_Ｎのフレーム毎に取得する（ステップＳ１３）。 When the search index generation program is executed, the audio signal acquisition unit 111 reads the audio signal to be searched from the audio signal storage unit 101 (step S11). Next, the frame setting unit 112 sets a frame in which the audio signal is divided for each frame length, as described with reference to FIG. 6 (step S12). Next, the feature amount acquisition unit 113 acquires the feature amount of the audio signal to be searched for each frame of frame numbers f ₁ to f _N (step S13).

次に、出力確率取得部１１４は、検索対象の音声信号に設定したフレーム番号ｆ_１からｆ_Ｎの区間が、音響モデルの音素の各状態と一致する確率である出力確率を取得し、図５に示すような置換処理前の検索インデックスを生成し、出力確率記憶部１０３に記憶する（ステップＳ１４）。 Next, the output probability acquisition unit 114 acquires the output probability that is the probability that the section of the frame numbers f ₁ to f _N set in the speech signal to be searched matches each state of the phonemes of the acoustic model, and FIG. A search index before the replacement process as shown in is generated and stored in the output probability storage unit 103 (step S14).

次に、代表確率設定部１２０は、出力確率取得部１１４が取得した図５に示す置換処理前の検索インデックスについて、音素を構成する状態の中で最も出力確率が高い状態の出力確率を、その音素の代表出力確率として設定する。そして、代表確率設定部１２０は、その音素の出力確率を、抽出した代表出力確率に置換することにより、図７に示すような代表出力確率に置換処理後の検索インデックスを作成する（ステップＳ１５）。 Next, the representative probability setting unit 120, regarding the search index before the replacement process shown in FIG. Set as a representative output probability of a phoneme. Then, the representative probability setting unit 120 replaces the output probability of the phoneme with the extracted representative output probability to create a search index after the replacement process with the representative output probability as shown in FIG. 7 (step S15). .

次に、音声検索装置１００が実行する音声検索処理について、図１３と図１４に示すフローチャートを参照しながら説明する。 Next, the voice search processing executed by the voice search device 100 will be described with reference to the flowcharts shown in FIGS. 13 and 14.

ユーザは、予め、モノフォン音響モデルを音響モデル記憶部１０２に、音素の状態ごとの平均時間長を時間長記憶部１０４に、トライフォン音響モデルをトライフォンモデル記憶部１０６に記憶しておく。また、検索対象の音声信号から作成した図７に示す代表出力確率に置換処理後の検索インデックス（第１の確率）を予め作成し、出力確率記憶部１０３に記憶しておく。 The user stores a monophone acoustic model in the acoustic model storage unit 102, an average time length for each phoneme state in the time length storage unit 104, and a triphone acoustic model in the triphone model storage unit 106 in advance. Further, the search index (first probability) after the replacement process is created in advance in the representative output probability shown in FIG. 7 created from the audio signal to be searched, and stored in the output probability storage unit 103.

ＣＰＵ６が、ＲＯＭ１から音声検索プログラムを読み出して、音声検索プログラムを実行し、ユーザが検索語（クエリ）をテキストデータとして入力装置４から入力することにより、図１３に示すフローチャートは開始する。 The CPU 6 reads the voice search program from the ROM 1, executes the voice search program, and the user inputs a search word (query) as text data from the input device 4 to start the flowchart shown in FIG.

最初に、図１３を参照しながら、音声検索装置１００が、検索語（クエリ）の出力確率を求める処理について説明する。 First, a process in which the voice search device 100 obtains the output probability of a search word (query) will be described with reference to FIG.

ユーザが、検索語（クエリ）を入力装置４から入力すると、検索文字列取得部１３１は、クエリを取得する。そして、変換部１３２は、テキストデータとして取得したクエリを、モノフォン音素列に変換する（ステップＳ３１）。例えば、変換部１３２は、検索文字列として日本語「キゾクセイド」が入力された場合、「ｋ，ｉ，ｚ，ｏ，ｋ，ｕ，ｓ，ｅ，ｉ，ｄ，ｏ」という１１個のモノフォン音素から構成されるモノフォン音素列に変換する。ここで、各音素は、３つの状態で構成されているので、変換部１３２は、検索文字列「キゾクセイド」を３３個の状態から構成される状態列に変換することになる。 When the user inputs a search word (query) from the input device 4, the search character string acquisition unit 131 acquires a query. Then, the conversion unit 132 converts the query acquired as the text data into a monophone phoneme string (step S31). For example, the conversion unit 132 receives 11 monophones “k, i, z, o, k, u, s, e, i, d, o” when Japanese “kizoxade” is input as the search character string. Converts to a monophone phoneme sequence composed of phonemes. Here, since each phoneme is composed of three states, the conversion unit 132 converts the search character string “Kizokuseido” into a status string composed of 33 states.

次に、変換部１３２は、音響モデル記憶部１０２に記憶されているモノフォンモデルの音素を、検索文字列取得部１３１が取得した検索文字列にしたがって並べる（ステップＳ３２）。 Next, the conversion unit 132 arranges the phonemes of the monophone model stored in the acoustic model storage unit 102 according to the search character string acquired by the search character string acquisition unit 131 (step S32).

さらに、変換部１３２は、変換した３３個の状態のそれぞれの時間長を、時間長記憶部１０４から取得する（ステップＳ３３）。そして、変換部１３２は、３３個の状態のモノフォンモデルを取得した時間長の長さで並べたクエリ音素列を作成する。 Further, the conversion unit 132 acquires the converted time lengths of the 33 states from the time length storage unit 104 (step S33). Then, the conversion unit 132 creates a query phoneme string in which the monophone models in the 33 states are arranged in the length of the acquired time length.

このとき、ユーザが検索対象の音声信号の話速に適合するように、時間長を補正する補正係数を入力した場合、変換部１３２は、時間長記憶部１０４から取得した時間長を補正して、クエリ音素列を作成する。 At this time, when the user inputs a correction coefficient for correcting the time length so as to match the speech speed of the voice signal to be searched, the conversion unit 132 corrects the time length acquired from the time length storage unit 104. , Create a query phoneme sequence.

次に、変換部１３２は、時間長記憶部１０４から取得した３３個の状態の時間長を合計した時間長を、検索文字列「キゾクセイド」が発話される発話時間長Ｌ（尤度取得区間の長さ）として導出する（ステップＳ３４）。 Next, the conversion unit 132 calculates the time length obtained by summing the time lengths of the 33 states acquired from the time length storage unit 104, as the utterance time length L (the likelihood acquisition section The length is derived (step S34).

次に、フレーム列作成部１３３は、図９に示すように、変換部１３２が作成したクエリ音素列にフレームｇ_１からｇ_ｋを設定する（ステップＳ３５）。 Next, the frame sequence creation unit 133 sets frames g ₁ to g _k in the query phoneme sequence created by the conversion unit 132, as shown in FIG. 9 (step S35).

次に、クエリ出力確率取得部１３４は、クエリ音素列の各状態が音響モデルの音素の各状態と一致するクエリ音素列の出力確率（第２の確率）を取得し、図１０に示すように、取得した出力確率を音素の各状態と対応付けてクエリ出力確率記憶部１０５に記憶する（ステップＳ３６）。以上の処理により、クエリ音素列の出力確率の生成処理は完了する。 Next, the query output probability acquisition unit 134 acquires the output probability (second probability) of the query phoneme sequence in which each state of the query phoneme sequence matches each state of the phonemes of the acoustic model, and as illustrated in FIG. The acquired output probability is stored in the query output probability storage unit 105 in association with each state of the phoneme (step S36). With the above processing, the generation processing of the output probability of the query phoneme sequence is completed.

次に、図１４を参照しながら、クエリの検索処理について説明する。クエリ音素列の出力確率（第２の確率）の取得が終わると、区間指定部１３５は、クエリ音素列が検索対象の音声信号と一致する確率（第３の確率）を取得する尤度取得区間を複数設定し、尤度取得部１３８は、それぞれの尤度取得区間からクエリ音素列が発せられている尤度を取得する。 Next, a query search process will be described with reference to FIG. When acquisition of the output probability (second probability) of the query phoneme string is finished, the section designating unit 135 acquires a probability (third probability) that the query phoneme string matches the search target speech signal, the likelihood acquisition section. , And the likelihood acquisition unit 138 acquires the likelihood that the query phoneme string is emitted from each likelihood acquisition section.

そのために、区間指定部１３５は、まず、検索インデックスの先頭フレームｆ_１から始まる第１尤度取得区間を指定する（ステップＳ４１）。そして、第２出力確率取得部１３６は、式（４）によりクエリ音声信号の第１フレームｇ_１が検索対象の音声信号の第１フレームｆ_１と一致する確率（第３の確率）を求める。同様にして、第２出力確率取得部１３６は、第１尤度取得区間に含まれるクエリ音素列の第ｋフレームｇ_ｋまでの出力確率（第３の確率）を式（６）により求める（ステップＳ４２）。 Therefore, the section designating unit 135 first designates the first likelihood acquisition section starting from the first frame f ₁ of the search index (step S41). Then, the second output probability acquisition unit 136 obtains the probability (third probability) that the first frame g ₁ of the query audio signal matches the first frame f ₁ of the search target audio signal according to the equation (4). Similarly, the second output probability acquisition unit 136 obtains the output probability (third probability) up to the k-th frame g _k of the query phoneme string included in the first likelihood acquisition section by the formula (6) (step S6). S42).

第２出力確率取得部１３６が出力確率を取得すると、置換部１３７は、フレーム毎に取得した出力確率を、そのフレームとそのフレーム前のＮ１個のフレームとそのフレーム後のＮ２個のフレームの、合計（１＋Ｎ１＋Ｎ２）個のフレームの中で最大の出力確率に置き換えることにより、Ｌｏｗｅｒ−Ｂｏｕｎｄ化処理を実行する（ステップＳ４３）。 When the second output probability acquisition unit 136 acquires the output probability, the replacement unit 137 replaces the output probabilities acquired for each frame with that of the frame, N1 frames before the frame, and N2 frames after the frame. The Lower-Bound process is executed by replacing the output probability with the maximum output probability among the total of (1 + N1 + N2) frames (step S43).

尤度取得部１３８は、Ｌｏｗｅｒ−Ｂｏｕｎｄ化処理後の出力確率をフレームごとに対数をとって加算することにより、区間指定部１３５が指定した第１尤度取得区間の尤度を取得する（ステップＳ４４）。尤度取得部１３８が尤度を取得すると、繰り返し部１３９は、検索対象の音声信号における全区間の尤度取得が終了したか否かを判別する（ステップＳ４５）。 The likelihood acquisition unit 138 acquires the likelihood of the first likelihood acquisition section designated by the section designation unit 135 by taking the logarithm of the output probabilities after the Lower-Bound process and adding the logarithms for each frame (step. S44). When the likelihood acquisition unit 138 acquires the likelihood, the repetition unit 139 determines whether or not the likelihood acquisition of all sections in the search target audio signal has been completed (step S45).

全区間の尤度取得が終了していない場合（ステップＳ４５：Ｎｏ）、繰り返し部１３９は、検索インデックスの位置を１フレーム進めた次の尤度取得区間を指定する（ステップＳ４６）。そして、区間指定部１３５が新たに指定した尤度取得区間に対して上述したステップＳ４２〜Ｓ４５の処理を繰り返す。 When the likelihood acquisition for all sections has not been completed (step S45: No), the repeating unit 139 specifies the next likelihood acquisition section that is one frame ahead of the search index position (step S46). Then, the processing of steps S42 to S45 described above is repeated for the likelihood acquisition section newly specified by the section specifying unit 135.

区間指定部１３５が第ｓ尤度取得区間を指定すると、第２出力確率取得部１３６は、第ｓ尤度取得区間に含まれるｋ個のフレームのそれぞれについて、式（８）により出力確率を求める（ステップＳ４２）。そして、求めたフレーム毎の出力確率をＬｏｗｅｒ−Ｂｏｕｎｄ化処理を実行する（ステップＳ４３）。尤度取得部１３８は、Ｌｏｗｅｒ−Ｂｏｕｎｄ化処理後の出力確率をフレームごとに対数をとって加算することにより、区間指定部１３５が指定した尤度取得区間の尤度を取得する（ステップＳ４４）。 When the section designating unit 135 designates the s-th likelihood acquisition section, the second output probability acquisition section 136 obtains the output probability for each of the k frames included in the s-th likelihood acquisition section by using Expression (8). (Step S42). Then, the lower-bound process is performed on the obtained output probability for each frame (step S43). The likelihood acquisition unit 138 acquires the likelihood of the likelihood acquisition section designated by the section designation unit 135 by taking the logarithm of the output probabilities after the Lower-Bound processing and adding the logarithms for each frame (step S44). .

このように、繰り返し部１３９は、第Ｐ尤度取得区間までの尤度を順次取得するように、区間指定部１３５、第２出力確率取得部１３６、置換部１３７、尤度取得部１３８を制御する。最終的に、全区間の尤度取得が終了すると（ステップＳ４５：ＹＥＳ）、音声検索装置１００は、取得した尤度に基づいてクエリ音声信号に対応する区間を特定する処理に移行する。 In this way, the repeating unit 139 controls the section designating unit 135, the second output probability acquiring unit 136, the replacing unit 137, and the likelihood acquiring unit 138 so that the likelihoods up to the P-th likelihood acquiring section are sequentially acquired. To do. Finally, when the likelihood acquisition of all the sections is completed (step S45: YES), the voice search device 100 moves to a process of identifying the section corresponding to the query voice signal based on the acquired likelihood.

特定部１４０は、区間指定部１３５が指定したＰ個の尤度取得区間の中から、所定の選択時間長（例えば、１秒）毎に最も尤度が高い尤度取得区間を選択する（ステップＳ４７）。すなわち、特定部１４０は、最終的な検索結果として特定する区間の候補を、検索対象の音声信号の全体から満遍なく候補が残るように、予備選択する。 The specifying unit 140 selects the likelihood acquisition section having the highest likelihood for each predetermined selection time length (for example, 1 second) from the P likelihood acquisition sections specified by the section specifying unit 135 (step S47). That is, the specifying unit 140 preliminarily selects the candidates of the section to be specified as the final search result so that the candidates remain evenly in the entire audio signal to be searched.

次に、特定部１４０は、トライフォン音響モデルを用いた詳細な音声検索処理を行う（ステップＳ４８）。すなわち、特定部１４０が予備選択した尤度取得区間について、トライフォンモデル及びＤＰマッチングに基づいて、第２出力確率取得部１３６及び尤度取得部１３８に比べて精度の高い第２の尤度取得処理を実行する。 Next, the identifying unit 140 performs a detailed voice search process using the triphone acoustic model (step S48). That is, with respect to the likelihood acquisition section preselected by the identifying unit 140, the second likelihood acquisition having higher accuracy than the second output probability acquisition unit 136 and the likelihood acquisition unit 138 is performed based on the triphone model and the DP matching. Execute the process.

そして、特定部１４０は、第２の尤度取得処理で取得した尤度に基づいて、検索文字列に対応する区間を特定する（ステップＳ４９）。例えば、特定部１４０は、第２の尤度取得処理で取得した第２の尤度が大きい順にソートし、上位の所定の数の区間を、検索文字列に対応する音声が発せられていることが推定される区間として特定する。そして、特定部１４０がクエリに対応する区間を特定すると、特定部１４０は、出力装置５を介して特定結果を出力する。以上で、音声検索処理の説明を終了する。 Then, the identifying unit 140 identifies the section corresponding to the search character string based on the likelihood acquired in the second likelihood acquisition processing (step S49). For example, the identifying unit 140 sorts the second likelihoods acquired in the second likelihood acquisition process in descending order of the likelihood, and a predetermined number of high-order sections are uttered corresponding to the search character string. Is specified as the estimated interval. When the identifying unit 140 identifies the section corresponding to the query, the identifying unit 140 outputs the identifying result via the output device 5. This is the end of the description of the voice search process.

以上に説明したように、実施形態１に係る音声検索装置１００は、解析対象の音声信号をフレーム区間に分割し、分割したフレーム区間ごとに音声の特徴量を取得する。そして、フレーム区間ごとに音響モデルの特徴量と一致する出力確率を求め、代表出力確率に置換処理前の検索インデックスを生成する。音声検索装置１００は、この置換処理前の検索インデックスにおいて、音素を構成する複数の状態の中で最も高い出力確率を、その音素の代表出力確率とする置換処理を行う。この置換処理により、置換処理後の検索インデックスは、置換処理前の検索インデックスに比べると、「１／音素の状態の数」にデータサイズを小さくすることができる。この置換処理後の検索インデックスには、置換処理前の検索インデックスの各音素に存在していた、最も高い出力確率がそのままの値で残っている。つまり、置換処理後の検索インデックスは、音声の瞬時的な特徴を喪失していない。この置換処理後の検索インデックスを用いて音声検索を行うことにより、検索精度の低下を低減することができる。 As described above, the voice search device 100 according to the first embodiment divides the voice signal to be analyzed into frame sections, and acquires the feature amount of the voice for each divided frame section. Then, the output probability that matches the feature amount of the acoustic model is obtained for each frame section, and the search index before the replacement process is generated in the representative output probability. The voice search device 100 performs the replacement process in which the highest output probability among the plurality of states forming the phoneme in the search index before the replacement process is the representative output probability of the phoneme. By this replacement process, the data size of the search index after the replacement process can be reduced to "1 / the number of phoneme states" as compared with the search index before the replacement process. In the search index after the replacement process, the highest output probability that existed in each phoneme of the search index before the replacement process remains as it is. That is, the search index after the replacement process does not lose the instantaneous characteristics of the voice. By performing a voice search using the search index after this replacement processing, it is possible to reduce the reduction in search accuracy.

（実施形態２）
上記の説明では、音声検索装置１００が、検索語（クエリ）をテキストデータとして入力する場合について説明した。しかし、クエリの入力方法はこれに限定する必要は無い。例えば、クエリを音声データとして入力することもできる。実施形態２に係る音声検索装置１００は、図２に示すように、検索インデックス生成部１１０と音声検索部１３０とから構成される。検索インデックス生成部１１０の構成は、実施形態１と同じである。音声検索部１３０の構成について、図１５を参照して説明する。 (Embodiment 2)
In the above description, the case where the voice search device 100 inputs a search word (query) as text data has been described. However, the input method of the query does not need to be limited to this. For example, the query can be input as voice data. The voice search device 100 according to the second embodiment includes a search index generation unit 110 and a voice search unit 130, as shown in FIG. The configuration of the search index generation unit 110 is the same as that of the first embodiment. The configuration of the voice search unit 130 will be described with reference to FIG.

実施形態２に係る音声検索部１３０は、図１５に示すように、音響モデル記憶部１０２と、出力確率記憶部１０３と、クエリ出力確率記憶部１０５と、トライフォンモデル記憶部１０６と、クエリ音声信号取得部１５１と、フレーム列作成部１５２と、クエリ特徴量取得部１５３と、クエリ出力確率取得部１３４と、区間指定部１３５と、第２出力確率取得部１３６と、置換部１３７と、尤度取得部１３８と、繰り返し部１３９と、特定部１４０と、を備える。 As shown in FIG. 15, the voice search unit 130 according to the second embodiment includes an acoustic model storage unit 102, an output probability storage unit 103, a query output probability storage unit 105, a triphone model storage unit 106, and a query voice. The signal acquisition unit 151, the frame sequence generation unit 152, the query feature amount acquisition unit 153, the query output probability acquisition unit 134, the section designation unit 135, the second output probability acquisition unit 136, the replacement unit 137, and the likelihood The degree acquiring unit 138, the repeating unit 139, and the specifying unit 140 are provided.

クエリ音声信号取得部１５１は、入力装置４を介してユーザが入力したクエリ音声信号を音声データとして取得する。 The query voice signal acquisition unit 151 acquires the query voice signal input by the user via the input device 4 as voice data.

フレーム列作成部１５２は、取得したクエリ音声信号について、フレーム長ごとの区間に分割したフレーム列を作成する。クエリ音声信号のフレーム列について図１６を参照して説明する。図１６（ａ）は、先頭から末尾までの時間長Ｌのクエリ音声信号の波形図である。時間長Ｌはクエリ音声信号が発話される時間長（発話時間長）である。縦軸はクエリ音声信号の強度を示し、横軸は時間を示す。図１６（ｂ）は、図１６（ａ）に示すクエリ音声信号において設定されるフレームを示す。フレーム列作成部１５２は、図１６（ｂ）に示すように、フレーム長ｔの区間を１シフト長Ｓずつシフトして、クエリ音声信号にフレーム番号ｇ_１からｇ_ｋの区間を設定する。フレームの設定方法は、実施形態１の説明と同じである。 The frame sequence creating unit 152 creates a frame sequence in which the acquired query audio signal is divided into sections for each frame length. The frame sequence of the query audio signal will be described with reference to FIG. FIG. 16A is a waveform diagram of a query voice signal having a time length L from the beginning to the end. The time length L is a time length (utterance time length) in which the query voice signal is uttered. The vertical axis represents the strength of the query voice signal, and the horizontal axis represents time. FIG. 16B shows a frame set in the query audio signal shown in FIG. As shown in FIG. 16B, the frame sequence creation unit 152 shifts the section of the frame length t by 1 shift length S and sets the section of the frame numbers g ₁ to g _{k in} the query audio signal. The frame setting method is the same as that described in the first embodiment.

図１５に戻って、クエリ特徴量取得部１５３は、フレーム列作成部１５２が作成したフレーム列を構成するフレーム（ｇ_１〜ｇ_ｋ）ごとにクエリ音声信号の特徴量を取得する。 Returning to FIG. 15, the query feature amount acquisition unit 153 acquires the feature amount of the query audio signal for each frame (g _{1 to} g _k ) forming the frame sequence created by the frame sequence creation unit 152.

クエリ出力確率取得部１３４は、クエリ特徴量取得部１５３が取得した特徴量に基づいて、この特徴量が音響モデルに含まれる音素の各状態の特徴量と一致する確率（第２の確率）をフレーム（ｇ_１〜ｇ_ｋ）ごとに取得し、音素の各状態と対応付けてクエリ出力確率記憶部１０５に記憶する。このクエリ音声信号について作成した出力確率のテーブルは、図１０に示すような出力確率のテーブルとなる。他の構成及び代表確率設定処理と音声検索処理については、実施形態１の説明と同じである。 The query output probability acquisition unit 134, based on the characteristic amount acquired by the query characteristic amount acquisition unit 153, determines the probability (second probability) that this characteristic amount matches the characteristic amount of each state of the phonemes included in the acoustic model. It is acquired for each frame (g _{1 to} g _k ) and stored in the query output probability storage unit 105 in association with each phoneme state. The output probability table created for this query voice signal is an output probability table as shown in FIG. Other configurations, representative probability setting processing, and voice search processing are the same as those described in the first embodiment.

以上説明したように、実施形態２に係る音声検索装置１００は、クエリを音声信号として入力した場合でも、音声検索をすることができる。 As described above, the voice search device 100 according to the second embodiment can perform voice search even when a query is input as a voice signal.

（実施形態３）
実施形態１と２では、検索対象の音声信号の検索インデックスのデータサイズを縮小する場合について説明した。実施形態３では、クエリの出力確率についてもデータサイズを縮小し、検索時の処理負荷を低減する場合について説明する。 (Embodiment 3)
In the first and second embodiments, the case where the data size of the search index of the audio signal to be searched is reduced has been described. In the third embodiment, a case will be described in which the data size of the output probability of the query is also reduced to reduce the processing load at the time of search.

実施形態３に係る音声検索装置１００は、図２に示すように、検索インデックス生成部１１０と音声検索部１３０とから構成される。検索インデックス生成部１１０の構成は、実施形態１と同じである。音声検索部１３０の構成は、図１７に示すように、クエリの出力確率についても代表確率を設定する代表確率設定部１２０を設ける。他の構成については、実施形態１の構成と同じである。 The voice search device 100 according to the third embodiment includes a search index generation unit 110 and a voice search unit 130, as shown in FIG. The configuration of the search index generation unit 110 is the same as that of the first embodiment. As shown in FIG. 17, the configuration of the voice search unit 130 includes a representative probability setting unit 120 that sets a representative probability for the output probability of the query. Other configurations are the same as those of the first embodiment.

音声検索部１３０内に備えられた代表確率設定部１２０は、クエリの出力確率についても、音素を構成する状態の中で最も高い出力確率をその音素の代表確率で置換する処理をする。この置換処理により、図１０に示すクエリの出力確率は、図１８に示すような出力確率と成り、データサイズが縮小される。 Regarding the output probability of the query, the representative probability setting unit 120 included in the voice search unit 130 also performs a process of replacing the highest output probability among the states forming a phoneme with the representative probability of the phoneme. By this replacement processing, the output probability of the query shown in FIG. 10 becomes the output probability as shown in FIG. 18, and the data size is reduced.

このように検索対象の音声信号の検索インデックスに加えて、クエリの出力確率についても縮小処理を行うことにより、音声検索時の出力確率の計算式が、実施形態１で説明した式（８）から下記に示す式（９）とすることができる。つまり状態数ｚに関する計算処理を削減でき、状態数が３であれば計算量を１／３に、状態数が５であれば計算量を１／５に削減できる。 As described above, by performing the reduction process on the output probability of the query in addition to the search index of the voice signal to be searched, the calculation formula of the output probability at the time of voice search is calculated from the formula (8) described in the first embodiment. It can be expressed by the following equation (9). That is, it is possible to reduce the calculation process related to the number of states z. If the number of states is 3, the calculation amount can be reduced to 1/3, and if the number of states is 5, the calculation amount can be reduced to 1/5.

以上に説明したように、実施形態３に係る音声検索装置１００は、検索対象の検索インデックスに加えて、クエリの出力確率についても圧縮処理を行うので、データサイズを小さくすることができる。また、音声検索時の計算処理を１／「音素の状態の数」に軽減することができる。この置換処理後のクエリの出力確率には、置換処理前のその音素ごとの大きな出力確率の値が残っているので、その音素の中に極めて大きい出力確率を有する状態があるという情報が喪失されることはない。つまり、置換処理後の検索インデックスは、音声の瞬時的な特徴を喪失していない。したがって、この置換処理後のクエリの出力確率を用いて音声検索を行うことにより、検索精度の低下を低減しつつ、データサイズを縮小し、計算処理を軽くすることができる。 As described above, the voice search device 100 according to the third embodiment performs the compression process on the output probability of the query in addition to the search index of the search target, so that the data size can be reduced. Further, the calculation processing at the time of voice search can be reduced to 1 / “the number of phoneme states”. Since the output probability of the query after the replacement process has a large output probability value for each phoneme before the replacement process, information that there is a state having an extremely large output probability in the phoneme is lost. There is no such thing. That is, the search index after the replacement process does not lose the instantaneous characteristics of the voice. Therefore, by performing the voice search using the output probability of the query after the replacement processing, it is possible to reduce the data size and the calculation processing while reducing the deterioration of the search accuracy.

なお、実施形態２の音声検索部１３０の構成に、代表確率設定部１２０を設けるようにしてもよい。 The representative probability setting unit 120 may be provided in the configuration of the voice search unit 130 of the second embodiment.

なお、上記の説明では、音声検索装置１００が、検索インデックス生成部１１０を備える場合について説明したが、検索インデックス生成部１１０と音声検索部１３０とが別々の装置に実装されていてもよい。 In the above description, the case where the voice search device 100 includes the search index generation unit 110 has been described, but the search index generation unit 110 and the voice search unit 130 may be implemented in different devices.

また、上記の説明では、特定部１４０が、トライフォンモデルを用いた精度の高い検索を行う説明をした。トライフォンモデルを用いた検索を行うことにより検索精度は向上するが、処理時間が長くなる。したがって、トライフォンモデルを用いた検索を行うか否かは任意である。 Further, in the above description, the specification unit 140 has been described as performing a highly accurate search using the triphone model. Although the search accuracy is improved by performing the search using the triphone model, the processing time becomes long. Therefore, it is arbitrary whether or not the search using the triphone model is performed.

また、本発明に係る機能を実現するための構成を予め備えた音声検索装置として提供できることはもとより、プログラムの適用により、既存のパーソナルコンピュータや情報端末機器等を、本発明に係る音声検索装置として機能させることもできる。すなわち、上記実施形態で例示した音声検索装置１００による各機能構成を実現させるためのプログラムを、既存のパーソナルコンピュータや情報端末機器等を制御するＣＰＵ等が実行できるように適用することで、本発明に係る音声検索装置１００として機能させることができる。また、本発明に係る音声検索方法は、音声検索装置を用いて実施できる。 Further, it is possible to provide a voice search device having a configuration for realizing the function according to the present invention in advance, and by applying the program, an existing personal computer, information terminal device, or the like is used as the voice search device according to the present invention. It can also function. That is, by applying the program for realizing each functional configuration by the voice search device 100 exemplified in the above embodiment so that the CPU or the like controlling the existing personal computer, information terminal device, or the like can execute the program, the present invention It can be made to function as the voice search device 100 according to. Further, the voice search method according to the present invention can be implemented using a voice search device.

また、このようなプログラムの適用方法は任意である。プログラムを、例えば、コンピュータが読取可能な記録媒体（ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory）、ＤＶＤ（Digital Versatile Disc）、ＭＯ（Magneto Optical disc）等）に格納して適用できる他、インターネット等のネットワーク上のストレージにプログラムを格納しておき、これをダウンロードさせることにより適用することもできる。 The method of applying such a program is arbitrary. For example, the program can be stored in a computer-readable recording medium (CD-ROM (Compact Disc Read-Only Memory), DVD (Digital Versatile Disc), MO (Magneto Optical disc), etc.), etc. The program can be applied by storing the program in a storage on the network and downloading the program.

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲とが含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 Although the preferred embodiment of the present invention has been described above, the present invention is not limited to the specific embodiment, and the present invention includes the invention described in the claims and an equivalent range thereof. included. The inventions described in the initial claims of the present application will be additionally described below.

（付記１）
検索対象の音声信号を取得する取得手段と、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定手段と、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得手段と、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得手段と、
前記出力確率取得手段が取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定手段と、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を対応付けた検索インデックスを生成する検索インデックス生成手段と、
を備える検索インデックス生成装置。 (Appendix 1)
An acquisition means for acquiring a voice signal to be searched,
Section setting means for setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each of the frame sections, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states constituting each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with each of the phonemes, for each frame of the speech signal to be searched.
A search index generation device comprising:

（付記２）
検索対象の音声信号を取得する取得工程と、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定工程と、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得工程と、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得工程と、
前記出力確率取得工程で取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定工程と、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を対応付けた検索インデックスを生成する検索インデックス生成工程と、
を含む検索インデックス生成方法。 (Appendix 2)
An acquisition step of acquiring a voice signal to be searched,
A section setting step of setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition step of acquiring a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition step of acquiring an output probability, which is a probability that the feature amount of the search target voice signal matches the feature amount of each state forming a phoneme of an acoustic model, for each frame section,
Among the output probabilities of each state constituting each phoneme acquired in the output probability acquisition step, the highest output probability, a representative probability setting step of setting as a representative output probability of the phoneme,
A search index generating step of generating a search index in which the representative output probability is associated with each of the phonemes, for each frame of the voice signal to be searched.
Search index generation method including.

（付記３）
コンピュータを、
検索対象の音声信号を取得する取得手段、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定手段、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得手段、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得手段、
前記出力確率取得手段が取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定手段、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を対応付けた検索インデックスを生成する検索インデックス生成手段、
として機能させるためのプログラム。 (Appendix 3)
Computer,
Acquisition means for acquiring the audio signal to be searched,
Section setting means for setting a frame section which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each frame section, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states forming each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with each of the phonemes, for each frame of the audio signal to be searched.
Program to function as.

（付記４）
検索インデックス生成部と、音声検索部と、を備える音声検索装置であって、
前記検索インデックス生成部は、
検索対象の音声信号を取得する取得手段と、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定手段と、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得手段と、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得手段と、
前記出力確率取得手段が取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定手段と、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を第１の確率として対応付けた検索インデックスを生成する検索インデックス生成手段と、
を備え、
前記音声検索部は、
前記第１の確率を記憶する出力確率記憶手段と、
クエリ音声信号に含まれるフレーム毎に取得され、前記クエリ音声信号の特徴量が前記音響モデルに含まれる音素の各状態の特徴量と一致する確率であって、前記音響モデルの音素の各状態と対応付けられた第２の確率と、前記出力確率記憶手段が記憶する前記第１の確率とに基づいて、前記検索対象の音声信号の中から前記クエリ音声信号が発せられていると推定される推定区間を特定する特定手段と、
を備えることを特徴とする音声検索装置。 (Appendix 4)
A voice search device comprising a search index generation unit and a voice search unit,
The search index generation unit,
An acquisition means for acquiring a voice signal to be searched,
Section setting means for setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each of the frame sections, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states constituting each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with the respective phonemes as the first probability for each frame of the voice signal to be searched.
Equipped with
The voice search unit,
Output probability storage means for storing the first probability,
Acquired for each frame included in the query speech signal, the feature amount of the query speech signal is the probability of matching with the feature amount of each state of the phoneme included in the acoustic model, each state of the phoneme of the acoustic model. Based on the associated second probability and the first probability stored in the output probability storage means, it is estimated that the query voice signal is generated from the search target voice signals. Specifying means for specifying the estimation section,
A voice search device comprising:

（付記５）
前記検索対象の音声信号と前記クエリ音声信号とを比較する区間であるフレーム毎に、前記クエリ音声信号の特徴量を取得するクエリ特徴量取得手段と、
前記クエリ特徴量取得手段が取得したクエリ音声信号の特徴量に基づき、前記第２の確率を、音響モデルの音素の各状態と対応付けてフレーム毎に取得するクエリ出力確率取得手段と、
をさらに備えることを特徴とする付記４に記載の音声検索装置。 (Appendix 5)
Query feature amount acquisition means for acquiring a feature amount of the query voice signal for each frame that is a section in which the search target voice signal and the query voice signal are compared,
A query output probability acquisition unit that acquires the second probability for each frame in association with each state of the phonemes of the acoustic model based on the feature amount of the query speech signal acquired by the query feature amount acquisition unit;
5. The voice search device according to attachment 4, further comprising:

（付記６）
前記検索対象の音声信号におけるクエリ音声信号の発話時間長を有する区間である尤度取得区間を複数指定する区間指定手段と、
前記区間指定手段が指定した尤度取得区間が前記クエリ音声信号が発せられている区間であることの尤もらしさを示す尤度を、前記第１の確率と前記第２の確率とに基づいて取得する尤度取得手段と、
をさらに備え、
前記区間指定手段は、前記検索対象の音声信号における前記尤度取得区間の先頭位置を変えて複数の尤度取得区間を指定し、
前記尤度取得手段は、前記複数の尤度取得区間のそれぞれについて尤度を取得し、
前記特定手段は、前記区間指定手段が指定した尤度取得区間のそれぞれについて前記尤度取得手段が取得した尤度に基づいて、前記検索対象の音声信号の中から前記クエリ音声信号が発せられていると推定される推定区間を特定する、
ことを特徴とする付記４または５に記載の音声検索装置。 (Appendix 6)
Section specifying means for specifying a plurality of likelihood acquisition sections that are sections having the utterance time length of the query voice signal in the search target voice signal,
A likelihood indicating the likelihood that the likelihood acquisition section designated by the section designating section is a section in which the query audio signal is issued is acquired based on the first probability and the second probability. Likelihood acquisition means for
Further equipped with,
The section designating unit designates a plurality of likelihood acquisition sections by changing a start position of the likelihood acquisition section in the voice signal to be searched,
The likelihood acquisition unit acquires a likelihood for each of the plurality of likelihood acquisition sections,
The specifying unit outputs the query voice signal from the voice signals to be searched based on the likelihood acquired by the likelihood acquiring unit for each of the likelihood acquiring sections specified by the section specifying unit. The estimated interval that is estimated to be
The voice search device according to appendix 4 or 5, characterized in that

（付記７）
前記複数の尤度取得区間のそれぞれについて、前記第１の確率と前記第２の確率とを前記尤度取得区間に含まれるフレーム毎に掛け合わせた第３の確率を取得する第２出力確率取得手段をさらに設け、
前記尤度取得手段は、前記第２出力確率取得手段がフレーム毎に取得した第３の確率の対数をとった値を加算して前記尤度取得区間の尤度を取得する、
ことを特徴とする付記６に記載の音声検索装置。 (Appendix 7)
For each of the plurality of likelihood acquisition sections, a second output probability acquisition for acquiring a third probability obtained by multiplying the first probability and the second probability for each frame included in the likelihood acquisition section. Further means is provided,
The likelihood acquisition unit acquires a likelihood of the likelihood acquisition section by adding a value obtained by taking a logarithm of the third probability acquired by the second output probability acquisition unit for each frame.
7. The voice search device according to appendix 6, characterized in that.

（付記８）
前記クエリ出力確率取得手段が取得した第２の確率について、音素を構成する状態の中で最も出力確率が高い状態の出力確率を、その音素の代表出力確率として抽出し、抽出した出力確率をその音素の代表出力確率として設定する第２の代表確率設定手段をさらに設けたことを特徴とする付記５に記載の音声検索装置。 (Appendix 8)
Of the second probabilities acquired by the query output probability acquisition means, the output probability of the state having the highest output probability among the states forming the phoneme is extracted as the representative output probability of the phoneme, and the extracted output probability is 6. The voice search device according to attachment 5, further comprising second representative probability setting means for setting the phoneme as a representative output probability.

（付記９）
検索対象の音声信号を取得する取得工程と、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定工程と、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得工程と、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得工程と、
前記出力確率取得工程で取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定工程と、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を第１の確率として対応付けた検索インデックスを生成する検索インデックス生成工程と、
クエリ音声信号に含まれるフレーム毎に取得され、前記クエリ音声信号の特徴量が前記音響モデルに含まれる音素の各状態の特徴量と一致する確率であって、前記音響モデルの音素の各状態と対応付けられた第２の確率と、前記第１の確率とに基づいて、前記検索対象の音声信号の中から前記クエリ音声信号が発せられていると推定される推定区間を特定する特定工程と、
を含む音声検索方法。 (Appendix 9)
An acquisition step of acquiring a voice signal to be searched,
A section setting step of setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition step of acquiring a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition step of acquiring an output probability, which is a probability that the feature amount of the search target voice signal matches the feature amount of each state forming a phoneme of an acoustic model, for each frame section,
Among the output probabilities of each state constituting each phoneme acquired in the output probability acquisition step, the highest output probability, a representative probability setting step of setting as a representative output probability of the phoneme,
A search index generating step of generating a search index in which the representative output probability is associated with each of the phonemes as the first probability for each frame of the voice signal to be searched.
Acquired for each frame included in the query speech signal, the feature amount of the query speech signal is the probability of matching with the feature amount of each state of the phoneme included in the acoustic model, each state of the phoneme of the acoustic model. A specifying step of specifying an estimated section in which it is estimated that the query voice signal is emitted from the voice signals to be searched, based on the associated second probability and the first probability; ,
Voice search method including.

（付記１０）
コンピュータを、
検索対象の音声信号を取得する取得手段、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定手段、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得手段、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得手段、
前記出力確率取得工程で取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定手段、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を第１の確率として対応付けた検索インデックスを生成する検索インデックス生成手段、
クエリ音声信号に含まれるフレーム毎に取得され、前記クエリ音声信号の特徴量が前記音響モデルに含まれる音素の各状態の特徴量と一致する確率であって、前記音響モデルの音素の各状態と対応付けられた第２の確率と、前記第１の確率とに基づいて、前記検索対象の音声信号の中から前記クエリ音声信号が発せられていると推定される推定区間を特定する特定手段、
として機能させるためのプログラム。 (Appendix 10)
Computer,
Acquisition means for acquiring the audio signal to be searched,
Section setting means for setting a frame section which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each frame section, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states forming each phoneme acquired in the output probability acquisition step, the highest output probability is set as a representative output probability of that phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with each of the phonemes as the first probability for each frame of the speech signal to be searched.
Acquired for each frame included in the query speech signal, the feature amount of the query speech signal is the probability of matching with the feature amount of each state of the phoneme included in the acoustic model, each state of the phoneme of the acoustic model. Specifying means for specifying an estimated section in which it is estimated that the query voice signal is emitted from the voice signals to be searched based on the associated second probability and the first probability;
Program to function as.

（付記１１）
検索インデックス生成部と、音声検索部と、を備える音声検索装置であって、
前記検索インデックス生成部は、
検索対象の音声信号を取得する取得手段と、
取得した音声信号の特徴量を解析する単位であるフレーム区間を設定する区間設定手段と、
前記フレーム区間ごとに前記検索対象の音声信号の特徴量を取得する特徴量取得手段と、
前記検索対象の音声信号の特徴量が音響モデルの音素を構成する各状態の特徴量と一致する確率である出力確率を前記フレーム区間ごとに取得する出力確率取得手段と、
前記出力確率取得手段が取得したそれぞれの音素を構成する各状態の出力確率の中で最も高い出力確率を、その音素の代表出力確率として設定する代表確率設定手段と、
前記検索対象とする音声信号のフレームごとに、前記それぞれの音素に前記代表出力確率を第１の確率として対応付けた検索インデックスを生成する検索インデックス生成手段と、
を備え、
前記音声検索部は、
前記第１の確率を記憶する出力確率記憶手段と、
検索文字列を取得する検索文字列取得手段と、
前記検索文字列取得手段が取得した検索文字列を音素列に変換し、時間長記憶部から取得した音素の時間長の長さで音響モデルを並べたクエリ音素列を作成する変換手段と、
全クエリ音素列に含まれるフレーム毎に取得され、前記クエリ音素列の特徴量が前記音響モデルに含まれる音素の各状態の特徴量と一致する確率であって、前記音響モデルの音素の各状態と対応付けられた第２の確率と、前記出力確率記憶手段が記憶する前記第１の確率とに基づいて、前記検索対象の音声信号の中からクエリ音声信号が発せられていると推定される推定区間を特定する特定手段と、
を備えることを特徴とする音声検索装置。 (Appendix 11)
A voice search device comprising a search index generation unit and a voice search unit,
The search index generation unit,
An acquisition means for acquiring a voice signal to be searched,
Section setting means for setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each of the frame sections, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states constituting each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with the respective phonemes as the first probability for each frame of the voice signal to be searched.
Equipped with
The voice search unit,
Output probability storage means for storing the first probability,
A search character string acquisition means for acquiring a search character string,
A conversion unit that converts the search character string acquired by the search character string acquisition unit into a phoneme string, and creates a query phoneme string in which acoustic models are arranged by the length of the time length of the phoneme acquired from the time length storage unit,
Acquired for each frame included in the entire query phoneme sequence, the probability that the feature amount of the query phoneme sequence matches the feature amount of each state of the phonemes included in the acoustic model, and each state of the phonemes of the acoustic model. It is estimated that a query voice signal is emitted from the voice signals to be searched, based on the second probability associated with and the first probability stored in the output probability storage means. Specifying means for specifying the estimation section,
A voice search device comprising:

１…ＲＯＭ、２…ＲＡＭ、３…外部記憶装置、４…入力装置、５…出力装置、６…ＣＰＵ、７…バス、１００…音声検索装置、１０１…音声信号記憶部、１０２…音響モデル記憶部、１０３…出力確率記憶部、１０４…時間長記憶部、１０５…クエリ出力確率記憶部、１０６…トライフォンモデル記憶部、１１０…検索インデックス生成部、１１１…音声信号取得部、１１２…フレーム設定部、１１３…特徴量取得部、１１４…出力確率取得部、１２０…代表確率設定部、１２１…圧縮インデックス生成部、１３０…音声検索部、１３１…検索文字列取得部、１３２…変換部、１３３…フレーム列作成部、１３４…クエリ出力確率取得部、１３５…区間指定部、１３６…第２出力確率取得部、１３７…置換部、１３８…尤度取得部、１３９…繰り返し部、１４０…特定部、１５１…クエリ音声信号取得部、１５２…フレーム列作成部、１５３…クエリ特徴量取得部 1 ... ROM, 2 ... RAM, 3 ... External storage device, 4 ... Input device, 5 ... Output device, 6 ... CPU, 7 ... Bus, 100 ... Voice search device, 101 ... Voice signal storage section, 102 ... Acoustic model storage 103, output probability storage unit, 104 ... time length storage unit, 105 ... query output probability storage unit, 106 ... triphone model storage unit, 110 ... search index generation unit, 111 ... audio signal acquisition unit, 112 ... frame setting Part 113 ... Feature amount acquisition unit 114 ... Output probability acquisition unit 120 ... Representative probability setting unit 121 ... Compressed index generation unit 130 ... Voice search unit 131 ... Search character string acquisition unit 132 ... Conversion unit 133 ... Frame sequence creation unit, 134 ... Query output probability acquisition unit, 135 ... Section designation unit, 136 ... Second output probability acquisition unit, 137 ... Substitution unit, 138 ... Likelihood acquisition unit, 139 ... Ri barbs, 140 ... particular unit, 151 ... queries audio signal acquisition unit, 152 ... frame sequence creation unit, 153 ... query feature amount acquisition unit

Claims

An acquisition means for acquiring a voice signal to be searched,
Section setting means for setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each of the frame sections, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states constituting each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with each of the phonemes, for each frame of the speech signal to be searched.
A search index generation device comprising:

An acquisition step of acquiring a voice signal to be searched,
A section setting step of setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition step of acquiring a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition step of acquiring an output probability, which is a probability that the feature amount of the search target voice signal matches the feature amount of each state forming a phoneme of an acoustic model, for each frame section,
Among the output probabilities of each state constituting each phoneme acquired in the output probability acquisition step, the highest output probability, a representative probability setting step of setting as a representative output probability of the phoneme,
A search index generating step of generating a search index in which the representative output probability is associated with each of the phonemes, for each frame of the voice signal to be searched.
Search index generation method including.

Computer,
Acquisition means for acquiring the audio signal to be searched,
Section setting means for setting a frame section which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each frame section, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states forming each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with each of the phonemes, for each frame of the audio signal to be searched.
Program to function as.

A voice search device comprising a search index generation unit and a voice search unit,
The search index generation unit,
An acquisition means for acquiring a voice signal to be searched,
Section setting means for setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each of the frame sections, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states constituting each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with the respective phonemes as the first probability for each frame of the voice signal to be searched.
Equipped with
The voice search unit,
Output probability storage means for storing the first probability,
Acquired for each frame included in the query speech signal, the feature amount of the query speech signal is the probability of matching with the feature amount of each state of the phoneme included in the acoustic model, each state of the phoneme of the acoustic model. Based on the associated second probability and the first probability stored in the output probability storage means, it is estimated that the query voice signal is generated from the search target voice signals. Specifying means for specifying the estimation section,
A voice search device comprising:

Query feature amount acquisition means for acquiring a feature amount of the query voice signal for each frame that is a section in which the search target voice signal and the query voice signal are compared,
A query output probability acquisition unit that acquires the second probability for each frame in association with each state of the phonemes of the acoustic model based on the feature amount of the query speech signal acquired by the query feature amount acquisition unit;
The voice search device according to claim 4, further comprising:

Section specifying means for specifying a plurality of likelihood acquisition sections that are sections having the utterance time length of the query voice signal in the search target voice signal,
A likelihood indicating the likelihood that the likelihood acquisition section designated by the section designating section is a section in which the query audio signal is issued is acquired based on the first probability and the second probability. Likelihood acquisition means for
Further equipped with,
The section designating unit designates a plurality of likelihood acquisition sections by changing a start position of the likelihood acquisition section in the voice signal to be searched,
The likelihood acquisition unit acquires a likelihood for each of the plurality of likelihood acquisition sections,
The specifying unit outputs the query voice signal from the voice signals to be searched based on the likelihood acquired by the likelihood acquiring unit for each of the likelihood acquiring sections specified by the section specifying unit. The estimated interval that is estimated to be
The voice search device according to claim 4 or 5, characterized in that.

For each of the plurality of likelihood acquisition sections, a second output probability acquisition for acquiring a third probability obtained by multiplying the first probability and the second probability for each frame included in the likelihood acquisition section. Further means is provided,
The likelihood acquisition unit acquires a likelihood of the likelihood acquisition section by adding a value obtained by taking a logarithm of the third probability acquired by the second output probability acquisition unit for each frame.
The voice search device according to claim 6, wherein the voice search device is a voice search device.

Of the second probabilities acquired by the query output probability acquisition means, the output probability of the state having the highest output probability among the states forming the phoneme is extracted as the representative output probability of the phoneme, and the extracted output probability is The voice search device according to claim 5, further comprising second representative probability setting means for setting as a representative output probability of a phoneme.

An acquisition step of acquiring a voice signal to be searched,
A section setting step of setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition step of acquiring a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition step of acquiring an output probability, which is a probability that the feature amount of the search target voice signal matches the feature amount of each state forming a phoneme of an acoustic model, for each frame section,
Among the output probabilities of each state constituting each phoneme acquired in the output probability acquisition step, the highest output probability, a representative probability setting step of setting as a representative output probability of the phoneme,
A search index generating step of generating a search index in which the representative output probability is associated with each of the phonemes as the first probability for each frame of the voice signal to be searched.
Acquired for each frame included in the query speech signal, the feature amount of the query speech signal is the probability of matching with the feature amount of each state of the phoneme included in the acoustic model, each state of the phoneme of the acoustic model. A specifying step of specifying an estimated section in which it is estimated that the query voice signal is emitted from the voice signals to be searched, based on the associated second probability and the first probability; ,
Voice search method including.

Computer,
Acquisition means for acquiring the audio signal to be searched,
Section setting means for setting a frame section which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each frame section, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states forming each phoneme acquired in the output probability acquisition step, the highest output probability is set as a representative output probability of that phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with each of the phonemes as the first probability for each frame of the speech signal to be searched.
Acquired for each frame included in the query speech signal, the feature amount of the query speech signal is the probability of matching with the feature amount of each state of the phoneme included in the acoustic model, each state of the phoneme of the acoustic model. Specifying means for specifying an estimated section in which it is estimated that the query voice signal is emitted from the voice signals to be searched based on the associated second probability and the first probability;
Program to function as.

A voice search device comprising a search index generation unit and a voice search unit,
The search index generation unit,
An acquisition means for acquiring a voice signal to be searched,
Section setting means for setting a frame section, which is a unit for analyzing the characteristic amount of the acquired audio signal,
A characteristic amount acquisition unit that acquires a characteristic amount of the audio signal to be searched for each frame section,
An output probability acquisition unit that acquires, for each of the frame sections, an output probability that is a probability that the feature amount of the search target voice signal matches the feature amount of each state that forms a phoneme of an acoustic model,
Among the output probabilities of the respective states constituting each phoneme acquired by the output probability acquisition means, the highest output probability, a representative probability setting means for setting as a representative output probability of the phoneme,
Search index generation means for generating a search index in which the representative output probability is associated with the respective phonemes as the first probability for each frame of the voice signal to be searched.
Equipped with
The voice search unit,
Output probability storage means for storing the first probability,
A search character string acquisition means for acquiring a search character string,
A conversion unit that converts the search character string acquired by the search character string acquisition unit into a phoneme string, and creates a query phoneme string in which acoustic models are arranged at the length of the time length of the phoneme acquired from the time length storage unit,
Acquired for each frame included in the entire query phoneme sequence, the probability that the feature amount of the query phoneme sequence matches the feature amount of each state of the phonemes included in the acoustic model, and each state of the phonemes of the acoustic model. It is estimated that a query voice signal is emitted from the voice signals to be searched, based on the second probability associated with and the first probability stored in the output probability storage means. Specifying means for specifying the estimation section,
A voice search device comprising: