JP6569343B2

JP6569343B2 - Voice search device, voice search method and program

Info

Publication number: JP6569343B2
Application number: JP2015138662A
Authority: JP
Inventors: 孝浩田中
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2015-07-10
Filing date: 2015-07-10
Publication date: 2019-09-04
Anticipated expiration: 2035-07-10
Also published as: JP2017021196A

Description

本発明は、音声検索装置、音声検索方法及びプログラムに関する。 The present invention relates to a voice search device, a voice search method, and a program.

クエリを表す音声を録音音声から検索する２パス検索手法と呼ばれる音声検索技術が知られている。 A voice search technique called a two-pass search method for searching a voice representing a query from a recorded voice is known.

２パス検索手法は、録音音声の話速が所定の平均話速にほぼ一致するものと仮定して音声検索を行う。このため、２パス検索手法には、録音音声の話速と平均話速とが異なるほど検索精度が低下してしまうという問題がある。 In the two-pass search method, the voice search is performed on the assumption that the voice speed of the recorded voice substantially matches the predetermined average voice speed. For this reason, the two-pass search method has a problem that the search accuracy decreases as the speech speed of the recorded voice and the average speech speed differ.

この問題を解決する方法として、録音音声の話速を推定し、推定した話速を用いて音声検索を行うことが考えられる。話速を推定する技術としては、例えば、特許文献１に記載の技術が知られている。 As a method for solving this problem, it is conceivable to estimate the speech speed of the recorded speech and perform a speech search using the estimated speech speed. As a technique for estimating the speech speed, for example, a technique described in Patent Document 1 is known.

特開２０１０−２６３２３号公報JP 2010-26323 A

特許文献１の技術では、雑音等が原因で録音音声の話速を誤って推定してしまうことがある。 In the technique of Patent Document 1, there is a case where the speech speed of the recorded voice is erroneously estimated due to noise or the like.

誤って推定された録音音声の話速に基づいて音声検索が行われ、不適切な検索結果が提示された場合、ユーザは、音声検索に用いる話速として適切な話速（適切な検索結果を生じる程度に録音音声の話速に一致している話速）を手動で設定する必要がある。 When a voice search is performed based on the erroneously estimated speech speed of the recorded voice and an inappropriate search result is presented, the user selects an appropriate speech speed (appropriate search result) as the speech speed used for the voice search. It is necessary to manually set the speech speed that matches the speech speed of the recorded voice).

しかしながら、特許文献１の技術では、音声検索に用いている話速をユーザが直感的に把握する術が無かった。このため、ユーザは、音声検索に用いる話速として適切な話速を直感的に設定することができなかった。 However, in the technique of Patent Document 1, there is no way for the user to intuitively grasp the speech speed used for the voice search. For this reason, the user cannot intuitively set an appropriate speech speed as the speech speed used for the voice search.

本発明は、上記の課題に鑑みてなされたものであり、ユーザが直感的に指定した適切な話速を用いて、正確な音声検索を実行する音声検索装置、音声検索方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and provides a voice search device, a voice search method, and a program that perform an accurate voice search using an appropriate speech speed intuitively designated by a user. For the purpose.

上記目的を達成するため、本発明に係る音声検索装置は、
クエリを取得するクエリ取得部と、
所定の可変範囲の話速からユーザにより指定された話速を受け付ける話速指定部と、
前記話速指定部が受け付けた話速で、前記取得したクエリを表すデモ音声を音声出力するデモ音声出力部と、
前記話速指定部が受け付けた話速に基づいて、前記クエリを表す音声を、音声検索の対象である録音音声から音声検索する検索部と、
を備えることを特徴とする。 In order to achieve the above object, a voice search device according to the present invention provides:
A query acquisition unit for acquiring a query;
A speech rate designation unit for accepting a speech rate designated by the user from a predetermined range of speech rate;
A demo voice output unit that outputs a demo voice representing the acquired query at a speech speed accepted by the speech speed designation unit;
Based on the speech speed received by the speech speed designating unit, a search unit that performs a voice search from the recorded voice that is the target of the voice search for the voice representing the query;
It is characterized by providing.

本発明によれば、ユーザが直感的に指定した適切な話速を用いて、正確な音声検索を実行する音声検索装置、音声検索方法及びプログラムを提供することができる。 According to the present invention, it is possible to provide a voice search device, a voice search method, and a program that perform an accurate voice search using an appropriate speech speed that is intuitively designated by a user.

本発明の実施形態に係る音声検索装置の構成例を示す図である。It is a figure showing an example of composition of a voice search device concerning an embodiment of the present invention. （ａ）〜（ｃ）は、何れも、本発明の実施形態に係る音声検索装置が表示する検索画面の例を示す図である。（ａ）は、検索画面の一例を示す図である。（ｂ）は、検索画面の別例を示す図である。（ｃ）は、検索画面のさらに別例を示す図である。(A)-(c) is a figure which shows the example of the search screen which the audio | voice search apparatus which concerns on embodiment of this invention displays. (A) is a figure showing an example of a search screen. (B) is a figure showing another example of a search screen. (C) is a figure which shows another example of a search screen. 録音音声の波形と、尤度取得区間と、フレームと、の間の関係を説明するための図である。It is a figure for demonstrating the relationship between the waveform of a sound recording, a likelihood acquisition area, and a frame. 本発明の実施形態に係る音声検索装置が実行する音声検索処理を説明するためのフローチャートである。It is a flowchart for demonstrating the voice search process which the voice search apparatus which concerns on embodiment of this invention performs. 本発明の実施形態に係る音声検索装置が実行する話速指定処理を説明するためのフローチャートである。It is a flowchart for demonstrating the speech speed designation | designated process which the speech search device which concerns on embodiment of this invention performs. 本発明の実施形態に係る音声検索装置が実行する音声出力処理を説明するためのフローチャートである。It is a flowchart for demonstrating the audio | voice output process which the audio | voice search device which concerns on embodiment of this invention performs. 本発明の実施形態の変形例に係る音声検索装置の構成例を示す図である。It is a figure which shows the structural example of the speech search device which concerns on the modification of embodiment of this invention.

以下、本発明の実施形態に係る音声検索装置の機能及び動作について、図面を参照しながら説明する。図中、互いに同一又は同等の部分には同一の符号を付す。 Hereinafter, functions and operations of the voice search device according to the embodiment of the present invention will be described with reference to the drawings. In the drawing, the same or equivalent parts are denoted by the same reference numerals.

音声検索装置は、２パス検索手法を用い、クエリを表す音声を録音音声から検索する。２パス検索手法の詳細については、後述する。 The voice search device uses a two-pass search technique to search for voices representing queries from recorded voices. Details of the two-pass search method will be described later.

音声検索装置１００は、図１に示すように、ＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）１０と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０と、外部記憶部３０と、入力部４０と、出力部５０と、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）６０と、を備える。 As shown in FIG. 1, the voice search apparatus 100 includes a ROM (Read-Only Memory) 10, a RAM (Random Access Memory) 20, an external storage unit 30, an input unit 40, an output unit 50, a CPU ( Central Processing Unit) 60.

ＲＯＭ１０は、音声検索プログラムを含む各種プログラムを固定的に記憶する。音声検索プログラムの詳細については、後述する。 The ROM 10 permanently stores various programs including a voice search program. Details of the voice search program will be described later.

ＲＡＭ２０は、データやプログラムを一時的に記憶する。ＲＡＭ２０は、ＣＰＵ６０のワークメモリとして機能する。 The RAM 20 temporarily stores data and programs. The RAM 20 functions as a work memory for the CPU 60.

外部記憶部３０は、データを固定的に記憶する。外部記憶部３０は、例えば、ハードディスクを備える。具体的に、本実施形態の外部記憶部３０は、音声検索の対象である録音音声を予め外部から取得し、記憶している。また、外部記憶部３０は、モノフォンモデルＤＢ（ＤａｔａＢａｓｅ）３１と、トライフォンモデルＤＢ３２と、時間長ＤＢ３３と、韻律ＤＢ３４と、音声ＤＢ３５と、を備える。 The external storage unit 30 stores data in a fixed manner. The external storage unit 30 includes, for example, a hard disk. Specifically, the external storage unit 30 of the present embodiment acquires and stores a recorded voice that is a target of voice search from the outside in advance. The external storage unit 30 includes a monophone model DB (Data Base) 31, a triphone model DB 32, a time length DB 33, a prosody DB 34, and a speech DB 35.

モノフォンモデルＤＢ３１は、モノフォンモデルを記憶する。モノフォンモデルは、１音素（モノフォン）毎に生成された音響モデルである。音響モデルは、各音素の波形特徴量（例えば、周波数特性）をモデル化したものである。モノフォンモデルは、隣接する音素に依存しない、すなわち前後の音素状態との状態遷移を固定化した音響モデルである。 The monophone model DB 31 stores a monophone model. The monophone model is an acoustic model generated for each phoneme (monophone). The acoustic model is obtained by modeling the waveform feature amount (for example, frequency characteristics) of each phoneme. The monophone model is an acoustic model that does not depend on adjacent phonemes, that is, a state transition between the phoneme states before and after is fixed.

トライフォンモデルＤＢ３２は、トライフォンモデルを記憶する。トライフォンモデルは、３音素（トライフォン）毎に生成された音響モデルであり、隣接する音素に依存する、すなわち前後の音素状態との状態遷移を考慮した音響モデルである。 The triphone model DB 32 stores a triphone model. The triphone model is an acoustic model generated for every three phonemes (triphones), and is an acoustic model that depends on adjacent phonemes, that is, takes into account state transitions with previous and subsequent phoneme states.

外部記憶部３０は、公知技術を用いて生成されたモノフォンモデル及びトライフォンモデルを予め外部から取得し、それぞれモノフォンモデルＤＢ３１及びトライフォンモデルＤＢ３２に記憶している。 The external storage unit 30 acquires a monophone model and a triphone model generated using a known technique from the outside in advance, and stores them in the monophone model DB 31 and the triphone model DB 32, respectively.

時間長ＤＢ３３は、各音素の平均長を記憶する。各音素の平均長は、話者が平均話速で各音素を発声するときに要する時間である。外部記憶部３０は、公知技術を用いて取得された各音素の平均長を予め外部から取得し、時間長ＤＢ３３に記憶している。 The time length DB 33 stores the average length of each phoneme. The average length of each phoneme is the time required for the speaker to utter each phoneme at the average speech speed. The external storage unit 30 previously acquires the average length of each phoneme acquired using a known technique from the outside and stores it in the time length DB 33.

韻律ＤＢ３４は、各音素の韻律（ピッチ遷移、音素長及びパワー）を、この音素を含む音素列に対応付けて記憶する。外部記憶部３０は、任意の公知技術を用いて取得された各音素の韻律を予め外部から取得し、韻律ＤＢ３４に記憶している。 The prosody DB 34 stores the prosody of each phoneme (pitch transition, phoneme length, and power) in association with the phoneme string including this phoneme. The external storage unit 30 previously acquires the prosody of each phoneme acquired using any known technique from the outside and stores it in the prosody DB 34.

音声ＤＢ３５は、各音素を表す音声素片を記憶する。外部記憶部３０は、公知技術を用いて取得された音声素片を予め外部から取得し、音声ＤＢ３５に記憶している。 The speech DB 35 stores speech segments representing each phoneme. The external storage unit 30 acquires a speech unit acquired using a known technique from the outside in advance and stores it in the speech DB 35.

入力部４０は、ユーザによる各種データの入力を受け付ける。具体的に、入力部４０は、ユーザによるクエリや動作指示の入力を受け付ける。入力部４０は、入力された各種データを、ＣＰＵ６０へ供給する。入力部４０は、例えば、キーボードやマイクを備える。 The input unit 40 receives input of various data by the user. Specifically, the input unit 40 receives an input of a query or an operation instruction by the user. The input unit 40 supplies various input data to the CPU 60. The input unit 40 includes, for example, a keyboard and a microphone.

出力部５０は、各種データを外部へ出力する。出力部５０は、例えばディスプレイやスピーカを備え、スピーカを介してデータを音声として出力し、ディスプレイを介してデータを画像として出力する。 The output unit 50 outputs various data to the outside. The output unit 50 includes, for example, a display and a speaker, outputs data as sound through the speaker, and outputs data as an image through the display.

出力部５０が出力（音声出力）する音声は、録音音声と、後述するデモ音声と、を含む。 The sound output (sound output) by the output unit 50 includes recorded sound and demo sound described later.

出力部５０のディスプレイが出力（画像出力）する画像は、図２（ａ）〜（ｃ）に示す検索画面Ｗｎを含む。音声検索装置１００は、後述する音声検索処理を実行する際、図２（ａ）〜（ｃ）に示す検索画面Ｗｎを出力部５０のディスプレイに順次表示する（画像出力する）。図２（ａ）は、音声検索装置１００の電源をユーザが投入したときに音声検索装置１００が表示する検索画面Ｗｎを示す。図２（ｂ）は、後述する１パス目検索の結果をユーザへ提示する際に音声検索装置１００が表示する検索画面Ｗｎを示す。図２（ｃ）は、後述する２パス目検索の結果をユーザへ提示する際に音声検索装置１００が表示する検索画面Ｗｎを示す。音声検索処理、１パス目検索及び２パス目検索の詳細については、後述する。 The image output (image output) by the display of the output unit 50 includes a search screen Wn shown in FIGS. The voice search device 100 sequentially displays the search screen Wn shown in FIGS. 2A to 2C on the display of the output unit 50 (outputs an image) when executing a voice search process to be described later. FIG. 2A shows a search screen Wn displayed by the voice search device 100 when the user turns on the power of the voice search device 100. FIG. 2B shows a search screen Wn displayed by the voice search device 100 when a result of the first pass search described later is presented to the user. FIG. 2C shows a search screen Wn displayed by the voice search device 100 when a result of a second pass search described later is presented to the user. Details of the voice search processing, the first pass search, and the second pass search will be described later.

ＣＰＵ６０は、ＲＯＭ１０に記憶された各種プログラムを実行する。具体的に、ＣＰＵ６０は、ＲＯＭ１０に記憶された音声検索プログラムを実行することにより、機能的に、図１に示す話速指定部１０１、音声出力部１０２、クエリ取得部１０３、検索部１０４及び提示部１０５を実現する。 The CPU 60 executes various programs stored in the ROM 10. Specifically, the CPU 60 executes a voice search program stored in the ROM 10 to functionally provide the speech speed designation unit 101, the voice output unit 102, the query acquisition unit 103, the search unit 104, and the presentation shown in FIG. The unit 105 is realized.

話速指定部１０１は、音声検索に用いる話速のユーザによる指定を受け付ける。具体的に、話速指定部１０１は、所定の可変範囲の話速からユーザにより指定された話速を受け付ける。 The speech speed designation unit 101 accepts designation by the user of the speech speed used for voice search. Specifically, the speech speed designation unit 101 receives the speech speed designated by the user from the speech speed in a predetermined variable range.

本実施形態において、ユーザは、平均話速に対する倍率である話速パラメータを指定することにより、話速を指定する。例えば、ユーザが話速パラメータ（平均話速に対する倍率）として「２」を指定すると（平均話速に対する倍率として「２倍」を指定すると）、話速指定部１０１は、平均話速の２倍の話速を、指定された話速として受け付ける。また、本実施形態において、話速指定部１０１は、「０．５」から「２．０」までの範囲に属する話速パラメータの指定を受け付ける（ユーザは、話速指定部１０１を介して、平均話速の０．５倍の話速から平均話速の２．０倍の話速までの範囲（所定の可変範囲）に属する話速を指定できる）。 In the present embodiment, the user designates the speech speed by designating a speech speed parameter that is a magnification with respect to the average speech speed. For example, when the user designates “2” as the speech speed parameter (magnification with respect to the average speech speed) (designates “2 times” as the magnification with respect to the average speech speed), the speech speed designation unit 101 doubles the average speech speed. Is accepted as the designated speech speed. In the present embodiment, the speech speed designating unit 101 accepts designation of speech speed parameters belonging to the range from “0.5” to “2.0” (the user passes the speech speed designating unit 101 via The speech speeds belonging to the range (predetermined variable range) from 0.5 times the average speech speed to 2.0 times the average speech speed can be specified.

具体的に、話速指定部１０１は、出力部５０のディスプレイが図２（ａ）〜（ｃ）に示す検索画面Ｗｎを表示した状態において、話速の指定を受け付ける。検索画面Ｗｎは、図２（ａ）〜（ｃ）に示すように、話速シークバーＳＢを含んでいる。ユーザは、話速スライダーＳＳを話速シークバーＳＢ上で入力部４０を介して移動させることにより、話速パラメータを指定する。図２（ａ）〜（ｃ）に示すように、話速シークバーＳＢにおいて指定可能な話速パラメータの上限値は「２．０」であり、下限値は「０．５」である。ユーザは、所望の話速パラメータに対応する位置に話速スライダーＳＳを配置した上で、検索画面Ｗｎ中の検索アイコンＲＳまたはデモ音声再生アイコンＰ２を選択する。話速指定部１０１は、ユーザによる検索アイコンＲＳまたはデモ音声再生アイコンＰ２の選択を検出すると、検出した時点における話速スライダーＳＳの位置に対応する話速パラメータを取得する。話速指定部１０１は、取得した話速パラメータを平均話速に乗じることで得られる話速を、ユーザが指定した話速として受け付ける。 Specifically, the speech speed designation unit 101 accepts designation of the speech speed in a state where the display of the output unit 50 displays the search screen Wn shown in FIGS. As shown in FIGS. 2A to 2C, the search screen Wn includes a speech speed seek bar SB. The user designates the speech speed parameter by moving the speech speed slider SS on the speech speed seek bar SB via the input unit 40. As shown in FIGS. 2A to 2C, the upper limit value of the speech speed parameter that can be specified in the speech speed seek bar SB is “2.0”, and the lower limit value is “0.5”. The user arranges the speech speed slider SS at a position corresponding to a desired speech speed parameter, and then selects the search icon RS or the demo sound reproduction icon P2 on the search screen Wn. When detecting the selection of the search icon RS or the demo sound reproduction icon P2 by the user, the speech speed specifying unit 101 acquires a speech speed parameter corresponding to the position of the speech speed slider SS at the time of detection. The speech speed designation unit 101 accepts the speech speed obtained by multiplying the acquired speech speed parameter by the average speech speed as the speech speed designated by the user.

ＣＰＵ６０は、入力部４０を制御することにより、話速指定部１０１を実現する。 The CPU 60 realizes the speech speed designation unit 101 by controlling the input unit 40.

音声出力部１０２は、話速指定部１０１を介してユーザが指定した話速（音声検索に用いる話速）を有するデモ音声を音声出力する。本実施形態において、音声出力部１０２は、図１に示すように、デモ音声出力部１０２ａと、録音音声出力部１０２ｂと、を備え、デモ音声と録音音声とを、ユーザが両者の話速を比較可能な態様で音声出力する。ＣＰＵ６０は、出力部５０のスピーカを制御することにより、音声出力部１０２を実現する。 The voice output unit 102 outputs a demo voice having a voice speed (speak speed used for voice search) designated by the user via the voice speed designation unit 101. In this embodiment, the audio output unit 102 includes a demo audio output unit 102a and a recorded audio output unit 102b as shown in FIG. Output audio in a comparable manner. The CPU 60 realizes the audio output unit 102 by controlling the speaker of the output unit 50.

具体的に、デモ音声出力部１０２ａは、話速指定部１０１が受け付けた話速で、後述するクエリ取得部１０３が取得したクエリを表すデモ音声を音声出力する（話速指定部１０１を介してユーザが指定した話速を有し、後述するクエリ取得部１０３が取得したクエリを表すデモ音声を音声出力する）。デモ音声出力部１０２ａは、デモ音声を、このデモ音声の話速と、音声検索の対象である録音音声の話速と、をユーザが比較可能な態様で音声出力する。 Specifically, the demo voice output unit 102a outputs the demo voice representing the query acquired by the query acquisition unit 103 (to be described later) at the speech speed received by the speech speed specification unit 101 (via the speech speed specification unit 101). Demo voices having a speech speed designated by the user and representing a query acquired by the query acquisition unit 103 described later are output as speech). The demo voice output unit 102a outputs the demo voice in a manner that allows the user to compare the voice speed of the demo voice with the voice speed of the recorded voice that is the target of the voice search.

より具体的に、デモ音声出力部１０２ａは、下記の（Ａ１）〜（Ａ４）の処理を順に実行することにより、ユーザが指定した話速を有し、クエリを表すデモ音声を生成する。
（Ａ１）クエリを表す音素列が含む各音素の韻律（ピッチ遷移、音素長及びパワー）を、韻律ＤＢ３４から取得する。
（Ａ２）（Ａ１）の処理で取得した韻律のうち、各音素の音素長を、音素の平均長をユーザが話速指定部１０１を介して指定した話速パラメータで除算した時間長へ補正する。
（Ａ３）クエリを表す音素列が含む各音素の音声素片を、音声ＤＢ３５から取得する。
（Ａ４）（Ａ３）の処理で取得した音声素片のピッチ遷移、音素長及びパワーを、（Ａ２）の処理で補正を施した韻律に合致するように調整することにより、デモ音声を生成する。 More specifically, the demo voice output unit 102a sequentially executes the following processes (A1) to (A4), thereby generating a demo voice representing a query having a speech speed designated by the user.
(A1) The prosody (pitch transition, phoneme length, and power) of each phoneme included in the phoneme string representing the query is acquired from the prosody DB 34.
(A2) Among the prosody obtained in the processing of (A1), the phoneme length of each phoneme is corrected to the time length obtained by dividing the average phoneme length by the speech speed parameter designated by the user via the speech speed designating unit 101. .
(A3) The speech segment of each phoneme included in the phoneme string representing the query is acquired from the speech DB 35.
(A4) A demo speech is generated by adjusting the pitch transition, phoneme length, and power of the speech segment acquired in the processing of (A3) so as to match the prosody corrected in the processing of (A2). .

また、録音音声出力部１０２ｂは、音声検索の対象である録音音声を音声出力する。具体的に、録音音声出力部１０２ｂは、録音音声を下記の（Ｂ１）または（Ｂ２）の何れかの方法で音声出力することにより、デモ音声の話速と録音音声の話速とをユーザが比較することを可能にする。
（Ｂ１）録音音声出力部１０２ｂは、ユーザが入力部４０を介して指定した任意の再生位置から録音音声の音声出力を開始する（録音音声のうちユーザが指定した録音部分を音声出力する）。
（Ｂ２）録音音声出力部１０２ｂは、後述する検索部１０４が実行した音声検索の結果のうち、ユーザが指定した結果に対応する録音音声を音声出力する（録音音声のうち後述する検索部１０４が音声検索して得られた録音部分を音声出力する）。 The recorded voice output unit 102b outputs the recorded voice that is the target of the voice search. Specifically, the recorded voice output unit 102b outputs the recorded voice by the following method (B1) or (B2), so that the user can determine the speech speed of the demo voice and the voice speed of the recorded voice. Allows comparison.
(B1) The recorded voice output unit 102b starts outputting voice of the recorded voice from an arbitrary reproduction position designated by the user via the input unit 40 (outputs the recorded part designated by the user out of the recorded voice).
(B2) The recorded voice output unit 102b outputs the recorded voice corresponding to the result designated by the user among the results of the voice search performed by the search unit 104 described later (the search unit 104 described later among the recorded voices). The recorded part obtained by voice search is output as voice).

以下、（Ｂ１）及び（Ｂ２）の方法の詳細について説明する。 Hereinafter, the details of the methods (B1) and (B2) will be described.

（Ｂ１）の方法において、ユーザは、図２（ａ）〜（ｃ）に示す検索画面Ｗｎが含む録音音声スライダーＶＳを、録音音声シークバーＶＢ上で入力部４０を介して移動させることにより、録音音声の再生位置を指定する。ユーザは、所望の再生位置に対応する録音音声シークバーＶＢ上の位置に録音音声スライダーＶＳを配置した上で、検索画面Ｗｎ中の録音音声再生アイコンＰ１を選択する。録音音声出力部１０２ｂは、ユーザによる録音音声再生アイコンＰ１の選択を検出すると、検出した時点における録音音声スライダーＶＳの位置に対応する録音音声中の位置から録音音声の出力を開始する。ユーザは、デモ音声と、指定した再生位置の録音音声（録音音声のうちユーザが指定した録音部分）と、を聞き比べることにより、両者の話速を比較できる。 In the method (B1), the user moves the recording voice slider VS included in the search screen Wn shown in FIGS. 2 (a) to 2 (c) by moving the recording voice seek bar VB via the input unit 40. Specify the audio playback position. The user selects the recording sound reproduction icon P1 in the search screen Wn after arranging the recording sound slider VS at a position on the recording sound seek bar VB corresponding to a desired reproduction position. When the recording sound output unit 102b detects the selection of the recording sound reproduction icon P1 by the user, the recording sound output unit 102b starts outputting the recording sound from a position in the recording sound corresponding to the position of the recording sound slider VS at the time of detection. The user can compare the speech speeds by listening to and comparing the demo sound and the recorded sound at the designated playback position (the recorded portion of the recorded sound designated by the user).

（Ｂ２）の方法において、ユーザは、図２（ｂ）及び（ｃ）が含む１パス目結果アイコンＩＡ１〜ＩＡ５のうち何れか、または図２（ｃ）が含む２パス目結果アイコンＩＢ１〜ＩＢ５のうち何れかを入力部４０を介して選択することにより、検索部１０４が実行した音声検索の結果を指定する。検索部１０４は、音声検索を実行することにより、録音音声中のクエリを含んでいると推定される区間を特定する。１パス目結果アイコンＩＡ１〜ＩＡ５及び２パス目結果アイコンＩＢ１〜ＩＢ５は、何れも、後述する提示部１０５によって検索画面Ｗｎ中に表示される。１パス目結果アイコンＩＡ１〜ＩＡ５及び２パス目結果アイコンＩＢ１〜ＩＢ５は、それぞれ、検索部１０４が音声検索により特定した録音音声中の区間を表している。録音音声出力部１０２ｂは、１パス目結果アイコンＩＡ１〜ＩＡ５または２パス目結果アイコンＩＢ１〜ＩＢ５のうち何れかをユーザが選択したことを検出すると、選択されたアイコンが表す区間の録音音声（録音音声のうち後述する検索部１０４が音声検索して得られた録音部分）を音声出力する。１パス目結果アイコンＩＡ１〜ＩＡ５及び２パス目結果アイコンＩＢ１〜ＩＢ５の詳細については、後述する。ユーザは、デモ音声と、音声検索の結果に対応する録音音声と、を聞き比べることにより、両者の話速を比較できる。 In the method (B2), the user selects any one of the first pass result icons IA1 to IA5 included in FIGS. 2B and 2C, or the second pass result icons IB1 to IB5 included in FIG. Is selected via the input unit 40, and the result of the voice search executed by the search unit 104 is designated. The search unit 104 specifies a section estimated to include a query in the recorded voice by performing a voice search. The first pass result icons IA1 to IA5 and the second pass result icons IB1 to IB5 are all displayed on the search screen Wn by the presentation unit 105 described later. The first-pass result icons IA1 to IA5 and the second-pass result icons IB1 to IB5 respectively represent sections in the recorded voice specified by the search unit 104 by voice search. When the recording voice output unit 102b detects that the user has selected one of the first pass result icons IA1 to IA5 or the second pass result icons IB1 to IB5, the recorded voice (sound recording) of the section represented by the selected icon is recorded. Among the voices, a recording part obtained by voice search by a search unit 104 (to be described later) outputs the voice. Details of the first pass result icons IA1 to IA5 and the second pass result icons IB1 to IB5 will be described later. The user can compare the speaking speeds of the demo voice and the recorded voice corresponding to the result of the voice search by comparing them.

（Ｂ２）の方法によれば、クエリを表すデモ音声と、音声検索の結果として取得された、クエリを表す音声を含んでいると推定される区間の録音音声と、をユーザが聞き比べることができる。このため、ユーザは、デモ音声と任意の再生位置の録音音声とを聞き比べる（Ｂ１）の方法に比べ、より的確に話速を比較できる。一方、（Ｂ１）の方法によれば、（Ｂ２）の方法とは異なり、音声検索処理が実行されておらず、検索結果がまだ取得されていない段階において、デモ音声と録音音声との話速を比較できる。 According to the method (B2), the user can compare the demo voice representing the query with the recorded voice of the section estimated to include the voice representing the query acquired as a result of the voice search. it can. For this reason, the user can compare the speech speed more accurately as compared with the method (B1) of listening and comparing the demo voice and the recorded voice at an arbitrary reproduction position. On the other hand, according to the method (B1), unlike the method (B2), the speech speed of the demo voice and the recorded voice is not obtained when the voice search process is not executed and the search result is not yet acquired. Can be compared.

クエリ取得部１０３は、クエリを取得する。 The query acquisition unit 103 acquires a query.

検索部１０４は、話速指定部１０１を介してユーザが指定した話速（話速指定部１０１が受け付けた話速）に基づいて、クエリを表す音声を、音声検索の対象である録音音声から音声検索する。具体的に、検索部１０４は、第１検索部１０４ａと第２検索部１０４ｂとを備え、録音音声中のクエリを表す音声が含まれていると推定される区間（推定区間）を、２パス検索手法を用いて特定する。 Based on the speech speed designated by the user via the speech speed designating unit 101 (speech speed accepted by the speech speed designating unit 101), the search unit 104 extracts the speech representing the query from the recorded speech that is the target of the speech search. Search by voice. Specifically, the search unit 104 includes a first search unit 104a and a second search unit 104b, and two sections (estimated sections) that are estimated to include a voice representing a query in the recorded voice are included in two passes. Identify using search techniques.

２パス検索手法において、検索部１０４は、２段階の処理に分けて音声検索を実行する。１段階目の処理において、第１検索部１０４ａが、モノフォンモデル（隣接する音素に依存しない音響モデル）に基づいて音声検索を行うことにより、推定区間の候補を選択する。２段階目の処理において、第２検索部１０４ｂが、トライフォンモデル（隣接する音素に依存する音響モデル）に基づいて音声検索を行うことにより、１段階目の音声検索で選択された推定区間の候補の中から推定区間を特定する。以下、説明のため、第１検索部１０４ａが実行する音声検索を１パス目検索、第２検索部１０４ｂが実行する音声検索を２パス目検索と呼ぶ。 In the two-pass search method, the search unit 104 executes a voice search by dividing it into two steps. In the first-stage process, the first search unit 104a selects a candidate for the estimation section by performing a voice search based on a monophone model (an acoustic model that does not depend on adjacent phonemes). In the second-stage process, the second search unit 104b performs a voice search based on the triphone model (an acoustic model that depends on adjacent phonemes), so that the estimated section selected in the first-stage voice search Identify the estimated interval from the candidates. Hereinafter, for the sake of explanation, the voice search performed by the first search unit 104a is referred to as a first pass search, and the voice search performed by the second search unit 104b is referred to as a second pass search.

モノフォンモデルに基づく１パス目検索は、モノフォンモデルよりも情報量の多いトライフォンモデルに基づく２パス目検索に比べて、精度は低いものの、計算量が少なく、検索結果を得るまでに要する時間が短い（検索速度が速い）。２パス目検索は、１パス目検索に比べて、計算量が多く、検索結果を得るまでに要する時間は長い（検索速度が遅い）ものの、精度が高い。検索部１０４は、検索速度が速い１パス目検索によって選択した推定区間の候補を対象として精度の高い２パス目検索を実行することにより、計算量を抑制し、高速で、精度の高い音声検索を実行する。 The first-pass search based on the monophone model is less accurate than the second-pass search based on the triphone model, which has a larger amount of information than the monophone model, but requires a small amount of calculation and results in obtaining a search result. Short time (fast search). The second-pass search is more computationally intensive than the first-pass search, and takes a long time to obtain a search result (slow search speed), but has high accuracy. The search unit 104 performs a high-accuracy second-pass search for the estimated section candidate selected by the first-pass search with a high search speed, thereby suppressing the amount of calculation and performing a high-speed and high-accuracy voice search. Execute.

具体的に、第１検索部１０４ａは、１パス目検索において、下記の（Ｃ１）〜（Ｃ７）の処理を順に実行する。
（Ｃ１）クエリをモノフォン音素列に変換する。
（Ｃ２）モノフォン音素列が含む各音素の平均長を取得する。
（Ｃ３）各音素の平均長に基づいて、クエリを表す平均話速の音声の発話時間長を取得する。
（Ｃ４）クエリを表すユーザが指定した話速の音声の発話時間長を取得する。
（Ｃ５）尤度取得区間を設定する。
（Ｃ６）モノフォン音素列に基づいて、尤度取得区間ごとの第１尤度を取得する。
（Ｃ７）第１尤度に基づいて、尤度取得区間の中から推定区間の候補を選択する。 Specifically, the first search unit 104a sequentially executes the following processes (C1) to (C7) in the first pass search.
(C1) The query is converted into a monophone phoneme string.
(C2) The average length of each phoneme included in the monophone phoneme string is acquired.
(C3) Based on the average length of each phoneme, the utterance time length of the voice with the average speech speed representing the query is acquired.
(C4) An utterance time length of speech at a speech speed designated by a user representing a query is acquired.
(C5) A likelihood acquisition interval is set.
(C6) A first likelihood for each likelihood acquisition section is acquired based on the monophone phoneme string.
(C7) Based on the first likelihood, an estimation interval candidate is selected from the likelihood acquisition intervals.

以下、（Ｃ１）〜（Ｃ７）の各処理の詳細について説明する。 Hereinafter, the details of the processes (C1) to (C7) will be described.

（Ｃ１）の処理において、第１検索部１０４ａは、モノフォンモデルの音素をクエリに従って（各音素に対応する文字のクエリ中の順序と同じ順序で）並べることにより、クエリをモノフォン音素列に変換する。 In the process of (C1), the first search unit 104a converts the query into a monophone phoneme string by arranging the phonemes of the monophone model according to the query (in the same order as the order in the query of the characters corresponding to each phoneme). To do.

（Ｃ２）の処理において、第１検索部１０４ａは、（Ｃ１）の処理で変換したモノフォン音素列に含まれる各音素の平均長を、時間長ＤＢ３３から取得する。 In the process (C2), the first search unit 104a acquires the average length of each phoneme included in the monophone phoneme string converted in the process (C1) from the time length DB 33.

（Ｃ３）の処理において、第１検索部１０４ａは、（Ｃ２）の処理で取得した各音素の平均長を足し合わせることにより、クエリを表す平均話速の音声の発話時間長を取得する。 In the process (C3), the first search unit 104a acquires the speech duration of the average speech speed voice representing the query by adding the average lengths of the phonemes acquired in the process (C2).

（Ｃ４）の処理において、第１検索部１０４ａは、（Ｃ３）の処理で取得したクエリを表す平均話速の音声の発話時間長を、話速指定部１０１を介してユーザが指定した話速パラメータで除算することにより、クエリを表すユーザが指定した話速の音声の発話時間長を取得する。 In the process of (C4), the first search unit 104a uses the speech speed designated by the user via the speech speed designating unit 101 to determine the speech duration of the average speech speed representing the query acquired in the process of (C3). By dividing by the parameter, the utterance time length of the voice at the speech speed designated by the user representing the query is acquired.

（Ｃ５）の処理において、第１検索部１０４ａは、図３に示すように、録音音声の時間軸上に、第０尤度取得区間から第（Ｐ−１）尤度取得区間までのＰ個の尤度取得区間を設定する。各尤度取得区間は、（Ｃ４）の処理で取得された、クエリを表すユーザが指定した話速の音声の発話時間長を時間長Ｌとして有する。各尤度取得区間の開始位置は、所定のシフト長Ｓずつ離間して設定される。シフト長Ｓの詳細については、後述する。図３において、横軸は時間を示し、縦軸は録音音声の波形ＷＶの大きさを示す。図３において、録音音声の波形ＷＶは、時間長Ｔを有している。 In the process of (C5), as shown in FIG. 3, the first search unit 104a performs P pieces from the 0th likelihood acquisition section to the (P-1) likelihood acquisition section on the time axis of the recorded voice. Set the likelihood acquisition interval. Each likelihood acquisition section has, as the time length L, the utterance time length of the speech at the speaking speed designated by the user representing the query, acquired in the process of (C4). The start positions of the respective likelihood acquisition sections are set apart by a predetermined shift length S. Details of the shift length S will be described later. In FIG. 3, the horizontal axis represents time, and the vertical axis represents the size of the waveform WV of the recorded voice. In FIG. 3, the waveform WV of the recorded voice has a time length T.

（Ｃ６）の処理において、第１検索部１０４ａは、（Ｃ５）の処理で設定した尤度取得区間それぞれについて、第１尤度を取得する。第１尤度は、モノフォンモデルに基づいて取得された、尤度取得区間がクエリを表す音声を含む区間であることの尤もらしさを示す指標である。 In the process (C6), the first search unit 104a acquires the first likelihood for each likelihood acquisition section set in the process (C5). The first likelihood is an index indicating the likelihood that the likelihood acquisition section acquired based on the monophone model is a section including a voice representing a query.

具体的に、第１検索部１０４ａは、下記の（Ｃ６−１）〜（Ｃ６−５）の処理を順に実行することにより、各尤度取得区間の第１尤度を取得する。
（Ｃ６−１）フレームを設定する。
（Ｃ６−２）フレームごとに録音音声の特徴量を取得する。
（Ｃ６−３）モノフォンモデルに基づき、フレームごとに出力確率を算出する。
（Ｃ６−４）出力確率をＬｏｗｅｒ−Ｂｏｕｎｄ化する。
（Ｃ６−５）第１尤度を算出する。 Specifically, the first search unit 104a acquires the first likelihood of each likelihood acquisition section by sequentially executing the following processes (C6-1) to (C6-5).
(C6-1) A frame is set.
(C6-2) The feature amount of the recorded voice is acquired for each frame.
(C6-3) An output probability is calculated for each frame based on the monophone model.
(C6-4) The output probability is made lower-bound.
(C6-5) The first likelihood is calculated.

以下、（Ｃ６−１）〜（Ｃ６−５）の各処理の詳細について説明する。 Hereinafter, the details of the processes (C6-1) to (C6-5) will be described.

（Ｃ６−１）の処理において、第１検索部１０４ａは、図３に示すように、録音音声の時間軸上に、第０フレームから第（Ｎ−１）尤度取得区間までのＮ個のフレーム（時間窓）を設定する。各フレームは、所定のフレーム長Ｆを時間長として有する。各フレームの開始位置は、シフト長Ｓずつ離間して設定される。フレーム長Ｆ及びシフト長Ｓは、音響モデルの作成時に設定されている。フレーム長Ｆは、各フレームが隣接するフレームと重複するように、シフト長Ｓより長く設定されている。 In the process of (C6-1), as shown in FIG. 3, the first search unit 104a includes N pieces of frames from the 0th frame to the (N−1) th likelihood acquisition section on the time axis of the recorded voice. Set the frame (time window). Each frame has a predetermined frame length F as a time length. The start positions of the respective frames are set apart by a shift length S. The frame length F and the shift length S are set when the acoustic model is created. The frame length F is set longer than the shift length S so that each frame overlaps with an adjacent frame.

（Ｃ５）の処理で設定した尤度取得区間は、それぞれ、（Ｃ６−１）で設定したフレームを複数含む。例えば、第０尤度取得区間は、第０フレームから第（Ｍ−１）フレームまでのＭ個のフレームを含む。以下、第１検索部１０４ａは、（Ｃ６−２）〜（Ｃ６−５）の処理を実行することにより、１つの尤度取得区間の録音音声と（Ｃ１）の処理で取得したモノフォン音素列とをフレーム単位で比較し、この尤度取得区間の第１尤度を取得する。 Each likelihood acquisition section set in the process of (C5) includes a plurality of frames set in (C6-1). For example, the 0th likelihood acquisition section includes M frames from the 0th frame to the (M−1) th frame. Hereinafter, the first search unit 104a executes the processes (C6-2) to (C6-5), and the recorded speech in one likelihood acquisition section and the monophone phoneme sequence acquired in the process (C1) Are compared in units of frames, and the first likelihood of this likelihood acquisition section is acquired.

（Ｃ６−２）の処理において、第１検索部１０４ａは、１つの尤度取得区間が含む各フレームの録音音声の特徴量を取得する。この特徴量は、例えばケプストラムやメルケプストラムと呼ばれる音声データを周波数軸上に変換して得られる周波数軸系特徴パラメータと、音声データのエネルギー２乗和やその対数を計算することにより得られるパワー系特徴パラメータと、を組み合わせることによって得られる。例えば、特徴量は、周波数軸系特徴パラメータ１２成分（１２次元）とパワー系特徴パラメータ１成分（１次元）、直前の時間窓の各成分との差分を取ったもの、すなわち△周波数軸系特徴パラメータ１２成分（１２次元）と△パワー系特徴パラメータ１成分（１次元）、及び直前の時間窓の各成分との差分の差分を取ったもの、すなわち△△周波数軸系特徴パラメータ１２成分（１２次元）の、合計３８成分を有する３８次元ベクトル量として構成される。 In the process of (C6-2), the first search unit 104a acquires the feature amount of the recorded voice of each frame included in one likelihood acquisition section. This feature amount is obtained by, for example, calculating a frequency axis characteristic parameter obtained by converting speech data called cepstrum or mel cepstrum on the frequency axis, and a power system obtained by calculating the sum of squares of energy of the voice data or its logarithm It is obtained by combining feature parameters. For example, the feature amount is obtained by taking a difference between a frequency axis system characteristic parameter 12 component (12 dimensions), a power system characteristic parameter 1 component (1 dimension), and each component of the immediately preceding time window, that is, a Δ frequency axis system feature. The difference between the parameter 12 component (12 dimensions), the Δ power system characteristic parameter 1 component (1 dimension), and each component of the immediately preceding time window, that is, the ΔΔ frequency axis system characteristic parameter 12 component (12 Dimensional) and a 38-dimensional vector quantity having a total of 38 components.

（Ｃ６−３）の処理において、第１検索部１０４ａは、（Ｃ６−２）の処理で取得した録音音声の特徴量に基づいて、この特徴量がモノフォン音素列に含まれる各音素から出力される出力確率を、フレーム毎に算出する。具体的に、第１検索部１０４ａは、各フレームにおける録音音声の特徴量と、モノフォン音素列に含まれる音素の状態の中でこのフレームに対応する状態のモノフォンモデルとを比較する。そして、各フレームにおける特徴量がモノフォン音素列に含まれる各音素から出力される確率（出力確率）を計算する。この出力確率は、複数のガウス分布を重み付きで加算した正規混合連続分布によって表される。 In the process (C6-3), the first search unit 104a outputs the feature value from each phoneme included in the monophone phoneme string based on the feature value of the recorded voice acquired in the process (C6-2). Output probability is calculated for each frame. Specifically, the first search unit 104a compares the feature amount of the recorded voice in each frame with the monophone model in a state corresponding to this frame among the phoneme states included in the monophone phoneme string. Then, the probability (output probability) that the feature value in each frame is output from each phoneme included in the monophone phoneme string is calculated. This output probability is represented by a normal mixed continuous distribution obtained by adding a plurality of Gaussian distributions with weights.

（Ｃ６−４）の処理において、第１検索部１０４ａは、（Ｃ６−３）の処理で算出した各フレームの出力確率を、そのフレームの前後ｎフレームの出力確率のうち最大の出力確率によって置換する。この処理は、Ｌｏｗｅｒ−Ｂｏｕｎｄ化と呼ばれる。本実施形態において、ｎは、１００ｍｓｅｃに相当するフレーム数に設定される。出力確率をＬｏｗｅｒ−Ｂｏｕｎｄ化することにより、出力確率の時間方向における変化が小さくなる。従って、出力確率をＬｏｗｅｒ−Ｂｏｕｎｄ化することにより、時間長ＤＢ３３に記憶された各音素の平均長と各音素の実際の継続長との間の誤差を、前後ｎフレームの範囲内で吸収できる。また、出力確率をＬｏｗｅｒ−Ｂｏｕｎｄ化することにより、（Ｃ４）の処理で取得した、クエリを表すユーザが指定した話速の音声の発話時間長とクエリを表す実際の録音音声の発話時間長との間の誤差を、前後ｎフレームの範囲内で吸収できる。 In the process of (C6-4), the first search unit 104a replaces the output probability of each frame calculated in the process of (C6-3) with the maximum output probability among the output probabilities of n frames before and after the frame. To do. This process is called lower-bound conversion. In the present embodiment, n is set to the number of frames corresponding to 100 msec. By changing the output probability to lower-bound, the change in the output probability in the time direction is reduced. Therefore, by converting the output probability to Lower-Bound, an error between the average length of each phoneme stored in the time length DB 33 and the actual duration of each phoneme can be absorbed within the range of the preceding and following n frames. Further, by converting the output probability to Lower-Bound, the speech duration of the speech speed specified by the user representing the query and the actual speech duration of the recorded speech representing the query, acquired in the process of (C4), Can be absorbed within the range of the preceding and following n frames.

（Ｃ６−５）の処理において、第１検索部１０４ａは、（Ｃ６−４）でＬｏｗｅｒ−Ｂｏｕｎｄ化した各フレームの出力確率を対数軸上で足し合わせることにより、これらのフレームを含む尤度取得区間の第１尤度を取得する。 In the process of (C6-5), the first search unit 104a obtains the likelihood including these frames by adding the output probabilities of each frame converted to the lower-bound in (C6-4) on the logarithmic axis. The first likelihood of the section is acquired.

第１検索部１０４ａは、（Ｃ６−２）〜（Ｃ６−５）の処理を実行することにより、１つの尤度取得区間の第１尤度を取得する。第１検索部１０４ａは、上記（Ｃ５）の処理で設定した尤度取得区間それぞれを対象として（Ｃ６−２）〜（Ｃ６−５）の処理を実行することにより、各尤度取得区間の第１尤度を取得する。 The 1st search part 104a acquires the 1st likelihood of one likelihood acquisition area by performing the process of (C6-2)-(C6-5). The first search unit 104a performs the processes of (C6-2) to (C6-5) for each likelihood acquisition section set in the process of (C5) above, so that the first search section 104a 1 likelihood is acquired.

（Ｃ７）の処理において、第１検索部１０４ａは、録音音声の時間軸上に、複数の選択区間を設定する。各選択区間は、所定の選択時間長を時間長として有する。第１検索部１０４ａは、まず、各選択区間に開始位置が含まれる複数の尤度取得区間のうち、（Ｃ６）の処理で取得した第１尤度が尤も大きい尤度取得区間を選択する。次に、第１検索部１０４ａは、選択した複数の尤度取得区間のうち、尤度が最も大きい所定個数（本実施形態では、５個）の尤度取得区間を、推定区間の候補として選択する。 In the process (C7), the first search unit 104a sets a plurality of selection sections on the time axis of the recorded voice. Each selection section has a predetermined selection time length as a time length. The first search unit 104a first selects a likelihood acquisition section having a large first likelihood acquired in the process (C6) from a plurality of likelihood acquisition sections whose start positions are included in each selection section. Next, the first search unit 104a selects a predetermined number (5 in the present embodiment) of likelihood acquisition sections with the highest likelihood among the selected plurality of likelihood acquisition sections as estimation section candidates. To do.

図３に示すように、（Ｃ５）の処理で設定した尤度取得区間は、互いに重複して設定されている。このため、尤度が大きい単一の録音音声中の区間が、複数の尤度取得区間に属している場合がある。この場合、第１検索部１０４ａが、推定区間の候補として、尤度が最も大きい所定個数の尤度取得区間を録音音声全体から選択すると、録音音声の一部からのみ推定区間の候補が選択される一方、録音音声の他の部分にある尤度が大きい区間が見落とされてしまう虞がある。そこで、第１検索部１０４ａは、まず各選択区間から一つずつ尤度取得区間を選択し、選択した尤度取得区間のうち尤度が最も大きい所定個数の尤度取得区間を推定区間の候補として選択する。これにより第１検索部１０４ａは、録音音声全体から偏りなく推定区間の候補を選択できる。 As shown in FIG. 3, the likelihood acquisition sections set in the process (C5) are set to overlap each other. For this reason, a section in a single recorded voice having a high likelihood may belong to a plurality of likelihood acquisition sections. In this case, when the first search unit 104a selects a predetermined number of likelihood acquisition sections with the highest likelihood from the entire recorded voice as candidates for the estimated section, the candidate for the estimated section is selected from only a part of the recorded voice. On the other hand, there is a possibility that a section having a high likelihood in another part of the recorded voice may be overlooked. Therefore, the first search unit 104a first selects one likelihood acquisition section from each selection section, and selects a predetermined number of likelihood acquisition sections with the highest likelihood among the selected likelihood acquisition sections as candidates for estimation sections. Select as. Thereby, the 1st search part 104a can select the candidate of an estimation area from the whole recorded audio | voice without bias.

選択時間長（選択区間の時間長）は、尤度取得区間の時間長Ｌよりも短い時間に設定される。例えば、尤度取得区間の時間長Ｌを所定の定数ｋにより除算した値（Ｌ／ｋ）を、選択時間長として設定する。 The selection time length (the time length of the selection section) is set to a time shorter than the time length L of the likelihood acquisition section. For example, a value (L / k) obtained by dividing the time length L of the likelihood acquisition section by a predetermined constant k is set as the selected time length.

以上説明したように、第１検索部１０４ａは、（Ｃ１）〜（Ｃ７）の処理を実行することにより、話速指定部１０１を介してユーザが指定した話速と、モノフォンモデル（隣接する音素に依存しない音響モデル）と、に基づいて、推定区間（クエリを表す音声を含んでいると推定される録音音声中の区間）の候補を録音音声から選択する。 As described above, the first search unit 104a executes the processing of (C1) to (C7), and thereby the speech speed designated by the user via the speech speed designating unit 101 and the monophone model (adjacent to each other). Based on the phoneme-based acoustic model), a candidate for an estimated section (a section in the recorded voice that is estimated to include a voice representing a query) is selected from the recorded voice.

第１検索部１０４ａは、（Ｃ７）の処理で選択した推定区間の候補の録音音声中における位置を示す情報を、１パス目検索の結果を示す情報として、提示部１０５へ供給する。 The first search unit 104a supplies information indicating the position of the estimated section candidate selected in the process of (C7) in the recorded voice to the presentation unit 105 as information indicating the result of the first pass search.

第２検索部１０４ｂは、２パス目検索において、第１検索部１０４ａが選択した推定区間の候補を対象として、下記の（Ｄ１）〜（Ｄ３）の処理を順に実行する。
（Ｄ１）クエリをトライフォン音素列に変換する。
（Ｄ２）トライフォン音素列に基づいて、推定区間の候補ごとの第２尤度を取得する。
（Ｄ３）第２尤度に基づいて、推定区間の候補の中から推定区間を特定する。 In the second pass search, the second search unit 104b sequentially executes the following processes (D1) to (D3) for the estimation section candidates selected by the first search unit 104a.
(D1) The query is converted into a triphone phoneme string.
(D2) Based on the triphone phoneme string, a second likelihood is obtained for each estimation section candidate.
(D3) Based on the second likelihood, an estimation section is specified from the estimation section candidates.

（Ｄ１）の処理において、第２検索部１０４ｂは、クエリに従ってトライフォンモデルの音素を並べることにより、クエリをトライフォン音素列に変換する。 In the process of (D1), the second search unit 104b converts the query into a triphone phoneme string by arranging phonemes of the triphone model according to the query.

（Ｄ２）の処理において、第２検索部１０４ｂは、（Ｃ７）で推定区間の候補として選択した尤度取得区間それぞれについて、第２尤度を取得する。第２尤度は、トライフォンモデルに基づいて取得された、尤度取得区間がクエリを表す音声を含む区間であることの尤もらしさを示す指標である。第２尤度は、モノフォンモデルよりも情報量の多いトライフォンモデルに基づいて取得されている。このため、第２尤度は、モノフォンモデルに基づいて取得された第１尤度よりも、精度が高い指標（尤度）である。 In the process of (D2), the second search unit 104b acquires the second likelihood for each likelihood acquisition section selected as the estimation section candidate in (C7). The second likelihood is an index indicating the likelihood that the likelihood acquisition section acquired based on the triphone model is a section including a voice representing a query. The second likelihood is acquired based on the triphone model having a larger amount of information than the monophone model. For this reason, the second likelihood is an index (likelihood) with higher accuracy than the first likelihood acquired based on the monophone model.

具体的に、第２検索部１０４ｂは、下記の（Ｄ２−１）及び（Ｄ２−２）の処理を実行することにより、１つの推定区間の候補（尤度取得区間）の第２尤度を取得する。第２検索部１０４ｂは、（Ｃ７）の処理で選択した尤度取得区間それぞれを対象として（Ｄ２−１）及び（Ｄ２−２）の処理を順に実行することにより、各尤度取得区間の第２尤度を取得する。
（Ｄ２−１）トライフォンモデルに基づいて、フレームごとに出力確率を算出する。
（Ｄ２−２）出力確率に基づいて、第２尤度を取得する。 Specifically, the second search unit 104b performs the following processes (D2-1) and (D2-2) to obtain the second likelihood of one estimation section candidate (likelihood acquisition section). get. The second search unit 104b executes the processes of (D2-1) and (D2-2) in order for each likelihood acquisition section selected in the process of (C7). Get two likelihoods.
(D2-1) The output probability is calculated for each frame based on the triphone model.
(D2-2) The second likelihood is acquired based on the output probability.

以下、（Ｄ２−１）及び（Ｄ２−２）の各処理の詳細について説明する。 Hereinafter, details of each processing of (D2-1) and (D2-2) will be described.

（Ｄ２−１）の処理において、第２検索部１０４ｂは、尤度取得区間の録音音声の特徴量が、（Ｄ１）の処理で取得したトライフォン音素列に含まれる各音素から出力される出力確率を、フレーム毎に算出する。具体的に、第２検索部１０４ｂは、（Ｃ６−２）の処理で取得した各フレームにおける録音音声の特徴量と、トライフォン音素列に含まれる各トライフォンのモデルとを比較する。そして、各フレームにおける特徴量が各トライフォンから出力される確率を計算する。 In the process of (D2-1), the second search unit 104b outputs the output of each of the phonemes included in the triphone phoneme sequence acquired in the process of (D1). Probability is calculated for each frame. Specifically, the second search unit 104b compares the feature amount of the recorded voice in each frame acquired in the process (C6-2) with the model of each triphone included in the triphone phoneme string. Then, the probability that the feature value in each frame is output from each triphone is calculated.

（Ｄ２−２）の処理において、第２検索部１０４ｂは、（Ｄ２−１）の処理で算出した出力確率に基づいて、尤度取得区間における各フレームとトライフォン音素列に含まれる各トライフォンとの対応を、ＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングにより探索する。そして、尤度取得区間における各フレームに対応付けられたトライフォンのそれぞれについて取得された出力確率の対数を足し合わせることにより、この尤度取得区間の第２尤度を取得する。 In the process of (D2-2), the second search unit 104b uses each triphone included in each frame and triphone phoneme string in the likelihood acquisition section based on the output probability calculated in the process of (D2-1). Is searched for by DP (Dynamic Programming) matching. And the 2nd likelihood of this likelihood acquisition area is acquired by adding the logarithm of the output probability acquired about each of the triphone matched with each frame in a likelihood acquisition area.

（Ｄ３）の処理において、第２検索部１０４ｂは、（Ｄ２）の処理で取得した第２尤度が最も大きい所定個数（本実施形態では、５個）の尤度取得区間を、推定区間として特定する。 In the process of (D3), the second search unit 104b uses a predetermined number (5 in this embodiment) of likelihood acquisition sections with the largest second likelihood acquired in the process of (D2) as estimation sections. Identify.

以上説明したように、第２検索部１０４ｂは、（Ｄ１）〜（Ｄ３）の処理を実行することにより、トライフォンモデル（隣接する音素に依存する音響モデル）に基づいて、第１検索部１０４ａが選択した推定区間の候補の中から、推定区間を特定する。 As described above, the second search unit 104b executes the processes (D1) to (D3), and thus based on the triphone model (acoustic model depending on adjacent phonemes), the first search unit 104a. The estimation section is identified from the estimation section candidates selected by.

第２検索部１０４ｂは、（Ｄ３）の処理で特定した推定区間の録音音声中における位置情報を、２パス目検索の結果を示す情報として、提示部１０５へ供給する。 The second search unit 104b supplies the position information in the recorded voice of the estimated section specified in the process (D3) to the presentation unit 105 as information indicating the result of the second pass search.

図１に戻って、提示部１０５は、検索部１０４が実行した音声検索の結果をユーザに提示する。 Returning to FIG. 1, the presentation unit 105 presents the result of the voice search performed by the search unit 104 to the user.

本実施形態で、提示部１０５は、出力部５０が備えるディスプレイに、検索部１０４が実行した１パス目検索及び２パス目検索の結果を示すアイコンを表示することにより、これらの検索結果をユーザへ提示する。 In the present embodiment, the presentation unit 105 displays icons indicating the results of the first pass search and the second pass search performed by the search unit 104 on the display included in the output unit 50, and displays these search results as a user. To present.

具体的に、提示部１０５は、図２（ｂ）及び（ｃ）に示すように、検索画面Ｗｎ中に、１パス目結果アイコンＩＡ１〜ＩＡ５を表示する。１パス目結果アイコンＩＡ１〜ＩＡ５は、第１検索部１０４ａが１パス目検索によって選択した推定区間の候補をそれぞれ表すアイコンである。提示部１０５は、第１検索部１０４ａから供給された、１パス目検索の結果を示す情報（推定区間の候補の録音音声中における位置を示す情報）に基づき、１パス目結果アイコンＩＡ１〜ＩＡ５を表示する。すなわち、提示部１０５は、１パス目結果アイコンＩＡ１〜ＩＡ５を検索画面Ｗｎ中に表示することにより、１パス目検索の結果をユーザへ提示する。 Specifically, as shown in FIGS. 2B and 2C, the presentation unit 105 displays first pass result icons IA1 to IA5 in the search screen Wn. The first pass result icons IA <b> 1 to IA <b> 5 are icons respectively representing estimated section candidates selected by the first search unit 104 a through the first pass search. The presenting unit 105 supplies the first pass result icons IA1 to IA5 based on the information (information indicating the position of the candidate of the estimated section in the recorded voice) indicating the result of the first pass search supplied from the first search unit 104a. Is displayed. That is, the presentation unit 105 presents the first pass search result to the user by displaying the first pass result icons IA1 to IA5 in the search screen Wn.

また、提示部１０５は、図２（ｃ）に示すように、検索画面Ｗｎ中に、２パス目結果アイコンＩＢ１〜ＩＢ５を表示する。２パス目結果アイコンＩＢ１〜ＩＢ５は、第２検索部１０４ｂが２パス目検索によって特定した推定区間をそれぞれ表すアイコンである。提示部１０５は、第２検索部１０４ｂから供給された、２パス目検索の結果を示す情報（推定区間の録音音声中における位置を示す情報）に基づき、２パス目結果アイコンＩＢ１〜ＩＢ５を表示する。すなわち、提示部１０５は、２パス目結果アイコンＩＢ１〜ＩＢ５を検索画面Ｗｎ中に表示することにより、２パス目検索の結果をユーザへ提示する。 Further, as shown in FIG. 2C, the presentation unit 105 displays second pass result icons IB1 to IB5 in the search screen Wn. The second pass result icons IB1 to IB5 are icons respectively representing estimated sections specified by the second search unit 104b by the second pass search. The presentation unit 105 displays the second pass result icons IB1 to IB5 based on the information (information indicating the position in the recorded voice of the estimated section) indicating the result of the second pass search supplied from the second search unit 104b. To do. That is, the presentation unit 105 presents the second pass search result to the user by displaying the second pass result icons IB1 to IB5 in the search screen Wn.

ＣＰＵ６０は、出力部５０を制御することにより、提示部１０５を実現する。 The CPU 60 realizes the presentation unit 105 by controlling the output unit 50.

以下、上記の物理的・機能的構成を有する音声検索装置１００が実行する音声検索処理について、図４に示すフローチャートを参照して説明する。 Hereinafter, the voice search processing executed by the voice search device 100 having the above-described physical / functional configuration will be described with reference to the flowchart shown in FIG.

音声検索装置１００は、音声検索の対象である録音音声を予め外部から取得し、外部記憶部３０に記憶している。 The voice search device 100 acquires a recorded voice that is a target of voice search from the outside in advance and stores it in the external storage unit 30.

また、音声検索装置１００は、任意の公知技術により取得されたモノフォンモデル、トライフォンモデル、各音素の平均長、各音素の韻律、各音素の音声素片を予め外部から取得し、外部記憶部３０に記憶している。 Further, the speech search apparatus 100 previously obtains a monophone model, a triphone model, an average length of each phoneme, a prosody of each phoneme, and a phoneme unit of each phoneme obtained from any known technique from an external storage. Stored in the unit 30.

ユーザが音声検索装置１００の電源を投入すると、音声検索装置１００は、出力部５０が備えるディスプレイに、図２（ａ）の検索画面Ｗｎを表示する。 When the user turns on the power of the voice search device 100, the voice search device 100 displays the search screen Wn of FIG. 2A on the display provided in the output unit 50.

ユーザは、図２（ａ）の検索画面Ｗｎ中で、入力部４０を介して、話速スライダーＳＳを、音声検索に用いることを所望する話速（話速パラメータ）に対応する位置に配置する。図２（ａ）では、話速スライダーＳＳが、平均話速の１．３倍の話速に対応する位置（話速パラメータ「１．３」に対応する位置）に配置されている。 The user arranges the speech speed slider SS at a position corresponding to the speech speed (speech speed parameter) desired to be used for the speech search via the input unit 40 in the search screen Wn of FIG. . In FIG. 2A, the speech speed slider SS is disposed at a position corresponding to a speech speed 1.3 times the average speech speed (position corresponding to the speech speed parameter “1.3”).

ユーザは、入力部４０が備えるキーボードを介して、クエリをテキスト入力する。図２（ａ）では、文字列「ラーメン」がクエリとして入力されている。 The user inputs the query text via the keyboard provided in the input unit 40. In FIG. 2A, the character string “ramen” is input as a query.

ユーザが、入力部４０を介し、検索画面Ｗｎ中の検索アイコンＲＳを選択すると、これに応答して、音声検索装置１００のＣＰＵ６０が、図４のフローチャートに示す音声検索処理を開始する。 When the user selects the search icon RS in the search screen Wn via the input unit 40, in response to this, the CPU 60 of the voice search device 100 starts the voice search process shown in the flowchart of FIG.

音声検索処理を開始すると、話速指定部１０１は、まず、検索アイコンＲＳが選択された時点における話速スライダーＳＳの位置に対応する話速パラメータを取得する（話速を取得する）（ステップＳ１０１）。次に、クエリ取得部１０３が、ユーザがテキスト入力したクエリを取得する（ステップＳ１０２）。 When the voice search process is started, the speech speed designation unit 101 first acquires a speech speed parameter corresponding to the position of the speech speed slider SS at the time when the search icon RS is selected (acquires the speech speed) (step S101). ). Next, the query acquisition unit 103 acquires a query input by the user (step S102).

第１検索部１０４ａは、ステップＳ１０２で取得したクエリを、モノフォン音素列へ変換する（ステップＳ１０３）。第１検索部１０４ａは、ステップＳ１０３で取得したモノフォン音素列の音素の平均長を時間長ＤＢ３３から取得することにより、クエリを表す平均話速の音声の発話時間長を取得する（ステップＳ１０４）。 The first search unit 104a converts the query acquired in step S102 into a monophone phoneme string (step S103). The first search unit 104a acquires the average speech length of the monophone phoneme sequence acquired in step S103 from the time length DB 33, thereby acquiring the speech duration of the average speech speed representing the query (step S104).

第１検索部１０４ａは、ステップＳ１０４で取得した発話時間長をステップＳ１０１で取得した話速パラメータによって除算することにより、クエリを表すユーザが指定した話速の音声の発話時間長を取得する（ステップＳ１０５）。 The first search unit 104a divides the utterance time length acquired in step S104 by the speech speed parameter acquired in step S101, thereby acquiring the utterance time length of the speech with the speech speed specified by the user representing the query (step S105).

第１検索部１０４ａは、ステップＳ１０５で取得した発話時間長を時間長Ｌとして有する尤度取得区間を、録音音声の時間軸上に複数設定する（ステップＳ１０６）。 The first search unit 104a sets a plurality of likelihood acquisition sections having the utterance time length acquired in step S105 as the time length L on the time axis of the recorded voice (step S106).

次に、第１検索部１０４ａは、録音音声の時間軸上に所定のフレーム長Ｆを時間長として有するフレームを複数設定する（ステップＳ１０７）。 Next, the first search unit 104a sets a plurality of frames having a predetermined frame length F as the time length on the time axis of the recorded voice (step S107).

第１検索部１０４ａは、ステップＳ１０７で設定したフレームごとに、録音音声の特徴量を取得する（ステップＳ１０８）。 The first search unit 104a acquires the feature amount of the recorded voice for each frame set in step S107 (step S108).

第１検索部１０４ａは、ステップＳ１０８で取得した特徴量と対応するモノフォンモデルとに基づいて、この特徴量がステップＳ１０３で取得した音素列に含まれる各音素から出力される出力確率を、フレーム毎に算出する（ステップＳ１０９）。 Based on the feature amount acquired in step S108 and the corresponding monophone model, the first search unit 104a uses the feature probability output from each phoneme included in the phoneme sequence acquired in step S103 as a frame. It is calculated every time (step S109).

第１検索部１０４ａは、ステップＳ１０９でフレームごとに算出した出力確率をＬｏｗｅｒ−Ｂｏｕｎｄ化する（各フレームの前後ｎフレームの中で最大の出力確率に置き換える）（ステップＳ１１０）。 The first search unit 104a converts the output probability calculated for each frame in Step S109 into Lower-Bound (replaces it with the maximum output probability in n frames before and after each frame) (Step S110).

第１検索部１０４ａは、ステップＳ１０６で設定した尤度取得区間それぞれについて、各尤度取得区間が含む全てのフレームのステップＳ１１０の処理によりＬｏｗｅｒ−Ｂｏｕｎｄ化した出力確率の対数軸上での和をとることにより、各尤度取得区間の第１尤度を取得する（ステップＳ１１１）。 For each likelihood acquisition section set in step S106, the first search unit 104a calculates the sum on the logarithmic axis of the output probabilities that are lower-bounded by the processing in step S110 of all frames included in each likelihood acquisition section. As a result, the first likelihood of each likelihood acquisition section is acquired (step S111).

第１検索部１０４ａは、録音音声の時間軸上に、所定の選択時間長を時間長として有する選択区間を複数設定する（ステップＳ１１２）。 The first search unit 104a sets a plurality of selection sections having a predetermined selection time length as the time length on the time axis of the recorded voice (step S112).

第１検索部１０４ａは、ステップＳ１１２で設定した選択区間ごとに、各選択区間に開始位置が含まれる複数の尤度取得区間のうち、ステップＳ１１１で取得した第１尤度が最も大きい尤度取得区間を選択する（ステップＳ１１３）。 For each selection section set in step S112, the first search unit 104a acquires the likelihood having the largest first likelihood acquired in step S111 among a plurality of likelihood acquisition sections in which the start position is included in each selection section. A section is selected (step S113).

第１検索部１０４ａは、ステップＳ１１３で選択した複数の尤度取得区間のうち、ステップＳ１１１で取得した第１尤度が最も大きい所定個数（本実施形態では、５個）の尤度取得区間を、推定区間の候補として選択する（ステップＳ１１４）。第１検索部１０４ａは、選択した推定区間の候補の録音音声中における位置を示す情報を、提示部１０５へ供給する。 The first search unit 104a selects a predetermined number (5 in this embodiment) of likelihood acquisition intervals having the largest first likelihood acquired in step S111 among the plurality of likelihood acquisition intervals selected in step S113. Then, it is selected as a candidate for the estimated section (step S114). The first search unit 104a supplies information indicating the position of the selected estimated section candidate in the recorded voice to the presentation unit 105.

提示部１０５は、第１検索部１０４ａから供給された情報に基づき、ステップＳ１１４で選択した推定区間の候補を表す１パス目結果アイコンＩＡ１〜ＩＡ５を含む、図２（ｂ）に示す検索画面Ｗｎを出力部５０のディスプレイに表示する（１パス目検索の結果をユーザへ提示する）（ステップＳ１１５）。 The presenting unit 105 includes the first-pass result icons IA1 to IA5 representing the estimation section candidates selected in step S114 based on the information supplied from the first search unit 104a, and the search screen Wn illustrated in FIG. Is displayed on the display of the output unit 50 (the result of the first pass search is presented to the user) (step S115).

第２検索部１０４ｂは、ステップＳ１０２で取得したクエリを、トライフォン音素列へ変換する（ステップＳ１１６）。 The second search unit 104b converts the query acquired in step S102 into a triphone phoneme string (step S116).

第２検索部１０４ｂは、ステップＳ１０７で設定したフレームごとに、ステップＳ１１６で取得したトライフォン音素列に基づいて、出力確率を算出する（ステップＳ１１７）。 The second search unit 104b calculates an output probability for each frame set in step S107, based on the triphone phoneme sequence acquired in step S116 (step S117).

第２検索部１０４ｂは、ステップＳ１１４で選択された推定区間の候補それぞれについて、ステップＳ１１７で算出した出力確率に基づいてＤＰマッチングを行うことにより、各推定区間の候補の第２尤度を取得する（ステップＳ１１８）。 The second search unit 104b obtains the second likelihood of each estimation section candidate by performing DP matching based on the output probability calculated in step S117 for each of the estimation section candidates selected in step S114. (Step S118).

第２検索部１０４ｂは、ステップＳ１１４で選択された推定区間の候補のうち、ステップＳ１１８で取得した第２尤度が最も大きい所定個数（本実施形態では、５個）の推定区間の候補を推定区間として特定する（ステップＳ１１９）。第２検索部１０４ｂは、特定した推定区間の録音音声中における位置を示す情報を、提示部１０５へ供給する。 The second search unit 104b estimates a predetermined number (5 in this embodiment) of estimation section candidates having the largest second likelihood acquired in step S118 among the estimation section candidates selected in step S114. The section is specified (step S119). The second search unit 104b supplies information indicating the position of the identified estimated section in the recorded voice to the presentation unit 105.

提示部１０５は、第２検索部１０４ｂから供給された情報に基づき、ステップＳ１１９で特定した推定区間を表す２パス目結果アイコンＩＢ１〜ＩＢ５を含む、図２（ｃ）に示す検索画面Ｗｎを表示する（２パス目検索の結果をユーザへ提示する）（ステップＳ１２０）。 The presenting unit 105 displays the search screen Wn shown in FIG. 2C including the second pass result icons IB1 to IB5 representing the estimated section identified in step S119 based on the information supplied from the second search unit 104b. (The result of the second pass search is presented to the user) (step S120).

次に、ＣＰＵ６０は、検索画面Ｗｎ中の話速変更アイコンＣＳが選択されたか否か判別する（ステップＳ１２１）。話速変更アイコンＣＳが選択されていないと判別すると（ステップＳ１２１；ＮＯ）、ＣＰＵ６０は、検索画面Ｗｎ中の終了アイコンＥＳが選択されたか否か判別する（ステップＳ１２２）。話速変更アイコンＣＳが選択されておらず（ステップＳ１２１；ＮＯ）、終了アイコンＥＳも選択されていないと判別すると（ステップＳ１２２；ＮＯ）、処理はステップＳ１２１へ戻る。ＣＰＵ６０は、話速変更アイコンＣＳまたは終了アイコンＥＳの何れかが選択されるまで、ステップＳ１２１〜Ｓ１２２の処理を繰り返す。 Next, the CPU 60 determines whether or not the speech speed change icon CS on the search screen Wn has been selected (step S121). If it is determined that the speech speed change icon CS is not selected (step S121; NO), the CPU 60 determines whether or not the end icon ES in the search screen Wn is selected (step S122). If it is determined that the speech speed change icon CS is not selected (step S121; NO) and the end icon ES is not selected (step S122; NO), the process returns to step S121. The CPU 60 repeats the processes of steps S121 to S122 until either the speech speed change icon CS or the end icon ES is selected.

ユーザは、ステップＳ１２０で２パス目検索の結果（推定区間）が提示された後、録音音声を聞くことにより、提示された検索結果が正確か否か判断する。具体的に、ユーザは、クエリを表す音声を含む録音音声中の区間が全て推定候補として特定されているか否か（検索漏れはないか否か）、録音音声を全て聞くことにより判断する。また、ユーザは、推定区間として特定された全ての区間の録音音声がクエリを表しているか否か（誤検出はないか否か）、全ての推定区間の録音音声を聞くことにより判断する。 After the result of the second pass search (estimated section) is presented in step S120, the user determines whether the presented search result is accurate by listening to the recorded voice. Specifically, the user determines whether or not all sections in the recorded voice including the voice representing the query are specified as estimation candidates (whether there is no omission of search) or by listening to all the recorded voice. In addition, the user determines whether or not the recorded voices of all the sections specified as the estimated sections represent queries (whether there is no erroneous detection) and listens to the recorded voices of all the estimated sections.

録音音声を聞いた結果、検索結果が不正確である（検出漏れまたは誤検出がある）とユーザが判断した場合、直前のステップＳ１０１〜Ｓ１２０で音声検索に用いた話速が不適切な話速である可能性が高い。この場合、ユーザは、音声検索に用いる話速を手動で修正するため、入力部４０を介し、検索画面Ｗｎ中の話速変更アイコンＣＳを選択する。これに応答し、ＣＰＵ６０は、話速変更アイコンＣＳが選択されたと判別し（ステップＳ１２１；ＹＥＳ）、話速指定処理を開始する（ステップＳ１２３）。 As a result of listening to the recorded voice, if the user determines that the search result is inaccurate (there is a detection failure or a false detection), the speech speed used for the voice search in the previous steps S101 to S120 is inappropriate. Is likely. In this case, the user selects the speech speed change icon CS in the search screen Wn via the input unit 40 in order to manually correct the speech speed used for the voice search. In response to this, the CPU 60 determines that the speech speed change icon CS has been selected (step S121; YES), and starts the speech speed designation process (step S123).

以下、ステップＳ１２３の話速指定処理の詳細を、図５のフローチャートを参照して説明する。なお、図５のフローチャートの話速指定処理中、出力部５０のディスプレイには、図２（ｃ）に示す検索画面Ｗｎが表示されている。 Hereinafter, details of the speech speed designation processing in step S123 will be described with reference to the flowchart of FIG. 5 is displayed on the display of the output unit 50 during the speech speed designation process in the flowchart of FIG.

話速指定処理を開始すると、まず、録音音声出力部１０２ｂが、録音音声再生アイコンＰ１が選択されたか否か判別する（ステップＳ２０１）。ユーザは、録音音声の再生位置を指定することを所望する場合、録音音声シークバーＶＢ上で録音音声スライダーＶＳを移動させることにより録音音声の再生位置を指定し、録音音声再生アイコンＰ１を選択する。これに応答して、録音音声出力部１０２ｂは、録音音声再生アイコンＰ１が選択されたと判別し（ステップＳ２０１；ＹＥＳ）、その時点における録音音声スライダーＶＳの位置に対応する位置から録音音声の音声出力（再生）を開始する（ステップＳ２０３）。 When the speech speed designation process is started, first, the recorded voice output unit 102b determines whether or not the recorded voice playback icon P1 has been selected (step S201). When the user desires to designate the playback position of the recorded voice, the user designates the playback position of the recorded voice by moving the recording voice slider VS on the recorded voice seek bar VB, and selects the recorded voice playback icon P1. In response to this, the recording sound output unit 102b determines that the recording sound reproduction icon P1 is selected (step S201; YES), and outputs the sound of the recording sound from the position corresponding to the position of the recording sound slider VS at that time. (Reproduction) is started (step S203).

録音音声再生アイコンＰ１が選択されていないと判別すると（ステップＳ２０１；ＮＯ）、録音音声出力部１０２ｂは、１パス目結果アイコンＩＡ１〜ＩＡ５または２パス目結果アイコンＩＢ１〜ＩＢ５の何れかが選択されたか否か判別する（ステップＳ２０２）。ユーザは、検索結果に対応する録音音声を音声出力させることを所望する場合、何れかのアイコンを選択する。これに応答して、録音音声出力部１０２ｂは、１パス目結果アイコンＩＡ１〜ＩＡ５または２パス目結果アイコンＩＢ１〜ＩＢ５の何れかが選択されたと判別し（ステップＳ２０２；ＹＥＳ）、選択されたアイコンが表す検索結果に対応する区間の録音音声を音声出力する（ステップＳ２０３）。 If it is determined that the recorded sound reproduction icon P1 is not selected (step S201; NO), the recorded sound output unit 102b selects either the first pass result icons IA1 to IA5 or the second pass result icons IB1 to IB5. It is determined whether or not (step S202). When the user desires to output the recorded sound corresponding to the search result, the user selects any icon. In response to this, the audio recording output unit 102b determines that any one of the first pass result icons IA1 to IA5 or the second pass result icons IB1 to IB5 has been selected (step S202; YES), and the selected icon The recorded voice of the section corresponding to the search result represented by is output as a voice (step S203).

何れのアイコンも選択されていないと判別すると（ステップＳ２０２；ＮＯ）、処理はステップＳ２０１へ戻る。録音音声出力部１０２ｂは、録音音声の再生位置が指定されるか、検索結果が指定されるまで（ステップＳ２０１またはステップＳ２０２の何れかでＹＥＳと判別されるまで）、ステップＳ２０１及びＳ２０２の処理を繰り返す。 If it is determined that no icon is selected (step S202; NO), the process returns to step S201. The recorded sound output unit 102b performs the processing of steps S201 and S202 until the playback position of the recorded sound is specified or the search result is specified (until determined as YES in either step S201 or step S202). repeat.

録音音声の再生（ステップＳ２０３）が終わった後、録音音声出力部１０２ｂは、録音音声の再度の音声出力が指示されたか否か判別する（ステップＳ２０４）。ユーザは、録音音声を聞き直す必要があると判断した場合、録音音声再生アイコンＰ１、１パス目結果アイコンＩＡ１〜ＩＡ５または２パス目結果アイコンＩＢ１〜ＩＢ５のうち何れかを選択することにより、録音音声の再度の音声出力を指示する。これに応答して、録音音声出力部１０２ｂは、再度の音声出力が指示されたと判別し（ステップＳ２０４；ＹＥＳ）、処理はステップＳ２０１へ戻る。 After the reproduction of the recorded voice (step S203) is finished, the recorded voice output unit 102b determines whether or not a second voice output of the recorded voice is instructed (step S204). When the user determines that it is necessary to listen to the recorded voice again, the user selects the recording voice reproduction icon P1, the first pass result icons IA1 to IA5, or the second pass result icons IB1 to IB5, thereby recording. Instructs audio output again. In response to this, the recorded voice output unit 102b determines that another voice output has been instructed (step S204; YES), and the process returns to step S201.

録音音声の再度の音声出力が指示されていないと判別すると(ステップＳ２０４；ＮＯ）、デモ音声出力部１０２ａが、デモ音声再生アイコンＰ２が選択されたか否か判別する（ステップＳ２０５）。デモ音声再生アイコンＰ２が選択されていないと判別すると（ステップＳ２０５；ＮＯ）、処理はステップＳ２０４へ戻る。音声出力部１０２は、録音音声の再度の音声出力が指示されるか、デモ音声の音声出力が指示されるまで（ステップＳ２０４またはステップＳ２０５でＹＥＳと判別されるまで）、ステップＳ２０４及びステップＳ２０５の処理を繰り返す。 If it is determined that the second sound output of the recorded sound is not instructed (step S204; NO), the demo sound output unit 102a determines whether or not the demo sound reproduction icon P2 is selected (step S205). If it is determined that the demo sound reproduction icon P2 is not selected (step S205; NO), the process returns to step S204. The sound output unit 102 performs steps S204 and S205 until a sound output of the recorded sound is instructed again or a sound output of the demo sound is instructed (YES in step S204 or step S205). Repeat the process.

ステップＳ２０１〜Ｓ２０４の処理で録音音声を聞いたユーザは、話速スライダーＳＳを移動させることにより、音声検索に用いる話速を、録音音声の話速に一致させるべく変更する。変更が完了すると（話速スライダーＳＳを話速シークバーＳＢ上の所望の位置に配置すると）、ユーザは、変更後の話速と録音音声の話速とを比較するために、デモ音声再生アイコンＰ２を選択することにより、デモ音声の音声出力を指示する。これに応答して、デモ音声出力部１０２ａは、デモ音声の音声出力が指示されたと判別し（ステップＳ２０５；ＹＥＳ）、音声出力処理を実行することによりデモ音声を音声出力する（ステップＳ２０６）。 The user who has heard the recorded voice in the processes of steps S201 to S204 changes the voice speed used for the voice search to match the voice speed of the recorded voice by moving the voice speed slider SS. When the change is completed (when the speech speed slider SS is placed at a desired position on the speech speed seek bar SB), the user compares the speech speed after the change with the speech speed of the recorded voice, and the demo voice playback icon P2 By selecting, the audio output of the demo audio is instructed. In response to this, the demo audio output unit 102a determines that the audio output of the demo audio has been instructed (step S205; YES), and outputs the audio of the demo by executing the audio output process (step S206).

以下、ステップＳ２０６の音声出力指示の詳細について、図６のフローチャートを参照して説明する。 Details of the voice output instruction in step S206 will be described below with reference to the flowchart of FIG.

音声出力処理を開始すると、デモ音声出力部１０２ａは、まず、ステップＳ２０５でＹＥＳと判別された時点（デモ音声再生アイコンＰ２が選択された時点）における話速スライダーＳＳの位置に対応する話速パラメータを取得する（ステップＳ３０１）。 When the voice output process is started, the demo voice output unit 102a firstly sets the speech speed parameter corresponding to the position of the speech speed slider SS at the time when YES is determined in step S205 (when the demo voice playback icon P2 is selected). Is acquired (step S301).

デモ音声出力部１０２ａは、図４のフローチャートのステップＳ１０３で取得したモノフォン音素列が含む各音素の韻律を、韻律ＤＢ３４から取得する（ステップＳ３０２）。 The demo voice output unit 102a acquires the prosody of each phoneme included in the monophone phoneme sequence acquired in step S103 of the flowchart of FIG. 4 from the prosody DB 34 (step S302).

デモ音声出力部１０２ａは、ステップＳ３０２で取得した韻律に対し、これらの韻律が含む音素長が、これらの音素長をステップＳ３０１で取得した話速パラメータで除算して得られる値になるような補正を施す（ステップＳ３０３）。 The demo speech output unit 102a corrects the prosody obtained in step S302 so that the phoneme lengths included in these prosody become values obtained by dividing these phoneme lengths by the speech speed parameter obtained in step S301. (Step S303).

次に、デモ音声出力部１０２ａは、ステップＳ１０３で取得したモノフォン音素列が含む各音素の音声素片を、音声ＤＢ３５から取得する（ステップＳ３０４）。 Next, the demo speech output unit 102a acquires the speech segment of each phoneme included in the monophone phoneme sequence acquired in step S103 from the speech DB 35 (step S304).

デモ音声出力部１０２ａは、ステップＳ３０４で取得した音声素片の韻律を、ステップＳ３０３で補正を施した韻律に一致するように調整することにより、デモ音声を生成する（ステップＳ３０５）。 The demo voice output unit 102a generates a demo voice by adjusting the prosody of the speech unit acquired in step S304 to match the prosody corrected in step S303 (step S305).

デモ音声出力部１０２ａは、ステップＳ３０５の調整を施した音声素片で構成されるデモ音声を音声出力し（ステップＳ３０６）、音声出力処理を終了する。 The demo sound output unit 102a outputs a sound of the demo sound composed of the speech unit adjusted in step S305 (step S306), and ends the sound output process.

図５に戻って、ステップＳ２０６でデモ音声を音声出力した後、デモ音声出力部１０２ａは、デモ音声再生アイコンＰ２が選択されたか否か判別する（ステップＳ２０７）。ユーザは、ステップＳ２０６でデモ音声を聞いた結果、音声検索に用いる話速を変更する必要があると判断する場合がある。この場合、ユーザは、話速スライダーＳＳを移動することにより話速の再調整を行う。話速の再調整が完了した後（話速シークバーＳＢ上の所望の位置に話速スライダーＳＳを配置した後）、ユーザは、再調整後の話速と録音音声の話速とを比較するため、デモ音声再生アイコンＰ２を選択することによりデモ音声の音声出力を指示する。これに応答し、デモ音声出力部１０２ａは、デモ音声の再度の音声出力が指示されたと判別し（ステップＳ２０７；ＹＥＳ）、再びステップＳ２０６の音声出力処理を実行する。 Returning to FIG. 5, after outputting the demo sound in step S206, the demo sound output unit 102a determines whether or not the demo sound reproduction icon P2 is selected (step S207). As a result of hearing the demo voice in step S206, the user may determine that it is necessary to change the speech speed used for the voice search. In this case, the user readjusts the speech speed by moving the speech speed slider SS. After the readjustment of the speech speed is completed (after the speech speed slider SS is arranged at a desired position on the talk speed seek bar SB), the user compares the speech speed after the readjustment with the speech speed of the recorded voice. By selecting the demo audio reproduction icon P2, the audio output of the demo audio is instructed. In response to this, the demo sound output unit 102a determines that the second sound output of the demo sound has been instructed (step S207; YES), and executes the sound output process of step S206 again.

また、ユーザは、デモ音声を聞き直す必要があると判断した場合、話速を変更することなく、デモ音声再生アイコンＰ２を選択することによってデモ音声の再度の音声出力を指示する場合がある。この場合、デモ音声出力部１０２ａは、デモ音声の再度の音声出力が指示されたと判別し（ステップＳ２０７；ＹＥＳ）、再びステップＳ２０６の音声出力処理を実行する。この場合、デモ音声出力部１０２ａは、図６のフローチャートのステップＳ３０１〜Ｓ３０５の処理は省略し、ステップＳ３０６の処理（デモ音声の音声出力）のみを実行する。ユーザが話速を変更していないため、直前の音声出力処理で取得したデモ音声をそのまま音声出力すればよいからである。 In addition, when it is determined that it is necessary to listen to the demo sound again, the user may instruct to output the demo sound again by selecting the demo sound reproduction icon P2 without changing the speech speed. In this case, the demo sound output unit 102a determines that the second sound output of the demo sound has been instructed (step S207; YES), and executes the sound output process of step S206 again. In this case, the demo audio output unit 102a omits the processes of steps S301 to S305 in the flowchart of FIG. 6 and executes only the process of step S306 (demonstration audio output). This is because the user does not change the speaking speed, and the demo voice acquired in the immediately previous voice output process may be output as it is.

デモ音声再生アイコンＰ２が選択されていないと判別すると（ステップＳ２０７；ＮＯ）、録音音声出力部１０２ｂが、録音音声再生アイコンＰ１、１パス目結果アイコンＩＡ１〜ＩＡ５または２パス目結果ＩＢ１〜ＩＢ５のうち何れかが選択されたか否か判別する（ステップＳ２０８）。ユーザは、デモ音声を聞いた後、改めて録音音声を聞き直してデモ音声と比較する必要があると判断した場合、これらのアイコンのうち何れかを選択することにより、録音音声の音声出力を指示する。これに応答し、録音音声出力部１０２ｂは、録音音声の再度の音声出力が指示されたと判別し（ステップＳ２０８；ＹＥＳ）、処理はステップＳ２０１へ戻って録音音声がもう一度音声出力される。 If it is determined that the demo audio playback icon P2 is not selected (step S207; NO), the recorded audio output unit 102b displays the recorded audio playback icon P1, the first pass result icons IA1 to IA5, or the second pass results IB1 to IB5. It is determined whether any of them has been selected (step S208). When the user listens to the demo sound and decides that it is necessary to listen to the recorded sound again and compare it with the demo sound, the user can select one of these icons to instruct the sound output of the recorded sound. To do. In response to this, the recorded sound output unit 102b determines that the second sound output of the recorded sound has been instructed (step S208; YES), the process returns to step S201, and the recorded sound is output again.

録音音声の再度の音声出力が指示されていないと判別すると（ステップＳ２０８；ＮＯ）、話速指定部１０１が、検索画面Ｗｎ中の検索アイコンＲＳが選択されたか否か判別する（ステップＳ２０９）。検索アイコンＲＳが選択されていないと判別すると（ステップＳ２０９；ＮＯ）、処理はステップＳ２０７へ戻る。 When it is determined that the re-speech output of the recorded voice is not instructed (step S208; NO), the speech speed designation unit 101 determines whether or not the search icon RS in the search screen Wn has been selected (step S209). If it is determined that the search icon RS is not selected (step S209; NO), the process returns to step S207.

ユーザは、録音音声とデモ音声とを聞き比べた結果、音声検索に用いる話速をこれ以上調整する必要がないと判断すると、検索アイコンＲＳを選択することにより、その時点で指定している話速で音声検索を実行するように音声検索装置１００に指示する。話速指定部１０１は、これに応答し、検索アイコンＲＳが選択されたと判別して（ステップＳ２０９；ＹＥＳ）、その時点における話速スライダーＳＳの位置に対応する話速（話速パラメータ）をユーザが指定した話速として取得し（ステップＳ２１０）、話速指定処理を終了する。 As a result of comparing the recorded voice and the demo voice, the user determines that there is no need to further adjust the speech speed used for the voice search, and by selecting the search icon RS, The voice search device 100 is instructed to perform voice search at a high speed. In response to this, the speech speed designation unit 101 determines that the search icon RS has been selected (step S209; YES), and sets the speech speed (speech speed parameter) corresponding to the position of the speech speed slider SS at that time to the user. Is acquired as the designated speech speed (step S210), and the speech speed designation processing is terminated.

図４に戻って、ステップＳ１２３の話速指定処理が終了した後、処理はステップＳ１０５へ戻る。録音音声を聞いた結果、検索結果が正確である（検出漏れも誤検出もない）とユーザが判断した場合、これ以上検索を行う必要はない。この場合、ユーザは、入力部４０を介し、終了アイコンＥＳを選択することにより音声検索処理の終了を指示する。これに応答し、ＣＰＵ６０は、終了アイコンＥＳが選択されたと判別し（ステップＳ１２２；ＹＥＳ）、音声検索処理を終了する。 Returning to FIG. 4, after the speech speed designation processing in step S123 is completed, the processing returns to step S105. As a result of listening to the recorded voice, if the user determines that the search result is accurate (no omissions or false detections), no further search is necessary. In this case, the user instructs the end of the voice search process by selecting the end icon ES via the input unit 40. In response to this, the CPU 60 determines that the end icon ES has been selected (step S122; YES), and ends the voice search process.

以上説明したように、音声検索装置１００は、録音音声と、ユーザが指定した音声検索に用いる話速のデモ音声と、を音声出力する。ユーザは、音声出力された２つの音声を聞き比べることにより、音声検索に用いる話速として適切な話速（すなわち、録音音声の話速に概ね一致した話速）を直感的に指定できる。音声検索装置１００は、ユーザが直感的に指定した適切な話速を用いて音声検索を実行することにより、正確な検索結果を取得する。すなわち、音声検索装置１００は、ユーザが直感的に指定した適切な話速を用いて、正確な音声検索を実行することができる。 As described above, the voice search device 100 outputs the recorded voice and the speech speed demonstration voice used for the voice search specified by the user. The user can intuitively specify an appropriate speech speed (that is, a speech speed that approximately matches the speech speed of the recorded voice) as the speech speed used for the speech search by comparing the two voices that have been output. The voice search device 100 acquires an accurate search result by executing a voice search using an appropriate speech speed intuitively designated by the user. That is, the voice search device 100 can execute an accurate voice search using an appropriate speech speed that is intuitively designated by the user.

以上、本発明の実施形態について説明したが、これらの実施形態は一例であり、本発明の適用範囲はこれに限られない。すなわち、本発明の実施形態は種々の応用が可能であり、あらゆる実施の形態が本発明の範囲に含まれる。 As mentioned above, although embodiment of this invention was described, these embodiment is an example and the application range of this invention is not restricted to this. That is, the embodiments of the present invention can be applied in various ways, and all the embodiments are included in the scope of the present invention.

上述の実施形態において、音声検索装置１００は、音声検索の対象である録音音声を、予め外部から取得し、記憶していた。しかし、これは一例に過ぎず、音声検索装置１００は、任意の方法で録音音声を取得できる。例えば、音声検索装置１００は、入力部４０が備えるマイクを介して音声を録音することにより、録音音声を自ら生成してもよい。 In the above-described embodiment, the voice search device 100 previously acquires and stores the recorded voice that is the target of the voice search from the outside. However, this is only an example, and the voice search device 100 can acquire the recorded voice by an arbitrary method. For example, the voice search device 100 may generate a recorded voice by recording voice through a microphone included in the input unit 40.

また、上述の実施形態において、ユーザは、クエリを、入力部４０が備えるキーボードを介してテキスト入力した。しかし、これは一例に過ぎず、ユーザは、任意の方法でクエリを入力できる。例えば、ユーザは、入力部４０が備えるマイクを介して、クエリを音声入力することができる。 Moreover, in the above-described embodiment, the user inputs a text through a keyboard provided in the input unit 40. However, this is only an example, and the user can enter a query in any way. For example, the user can input a query by voice through a microphone included in the input unit 40.

また、上述の実施形態において、第２検索部１０４ｂは、トライフォンモデルに基づいて２パス目検索を行った。しかし、これは一例に過ぎず、第２検索部１０４ｂは、隣接する音素に依存する任意の音響モデルを用いて２パス目検索を行うことができる。例えば、第２検索部１０４ｂは、バイフォンモデルに基づいて２パス目検索を行うことができる。バイフォンモデルは、２音素（バイフォン）毎に生成された音響モデルであり、隣接する音素に依存する、すなわち前後の音素状態との状態遷移を考慮した音響モデルである。 In the above-described embodiment, the second search unit 104b performs the second pass search based on the triphone model. However, this is only an example, and the second search unit 104b can perform a second-pass search using an arbitrary acoustic model that depends on adjacent phonemes. For example, the second search unit 104b can perform a second pass search based on the biphone model. The biphone model is an acoustic model generated for every two phonemes (biphone), and is an acoustic model that depends on adjacent phonemes, that is, takes into account the state transition between the previous and subsequent phoneme states.

また、上述の実施形態において、第１検索部１０４ａは、各選択区間から選択した尤度取得区間のうち、第１尤度が最も大きい所定個数の尤度取得区間を、推定区間の候補として選択した。しかし、これは一例に過ぎず、第１検索部１０４ａは、第１尤度に基づき、任意の方法で推定区間の候補を選択できる。例えば、第１検索部１０４ａは、第１尤度が所定の閾値以上である全ての尤度取得区間を、推定区間の候補として選択できる。 In the above-described embodiment, the first search unit 104a selects a predetermined number of likelihood acquisition sections having the largest first likelihood among the likelihood acquisition sections selected from the selection sections as candidates for the estimation section. did. However, this is only an example, and the first search unit 104a can select an estimation section candidate by an arbitrary method based on the first likelihood. For example, the first search unit 104a can select all likelihood acquisition sections whose first likelihood is greater than or equal to a predetermined threshold as candidates for estimation sections.

また、上述の実施形態において、第２検索部１０４ｂは、推定区間の候補のうち、第２尤度が最も大きい所定個数の推定区間の候補を、推定区間として特定した。しかし、これは一例に過ぎず、第２検索部１０４ｂは、第２尤度に基づき、任意の方法で推定区間を特定できる。例えば、第２検索部１０４ｂは、第２尤度が所定の閾値以上である全ての推定区間の候補を、推定区間として特定できる。 In the above-described embodiment, the second search unit 104b identifies a predetermined number of estimation section candidates having the largest second likelihood as estimation sections among the estimation section candidates. However, this is only an example, and the second search unit 104b can specify the estimation interval by an arbitrary method based on the second likelihood. For example, the second search unit 104b can identify all estimation section candidates whose second likelihood is equal to or greater than a predetermined threshold as estimation sections.

また、上述の実施形態において、音声検索装置１００は、２パス目検索の結果がユーザへ提示された後に、話速の変更を受け付けた（図４のフローチャートにおいて、ステップＳ１２０の処理の後に、ステップＳ１２１の処理を実行した）。しかし、これは一例に過ぎず、音声検索装置１００は、任意のタイミングで話速の変更を受け付けることができる。 In the above-described embodiment, the voice search device 100 accepts a change in the speech speed after the result of the second pass search is presented to the user (in the flowchart of FIG. The process of S121 was executed). However, this is merely an example, and the voice search device 100 can accept a change in speech speed at an arbitrary timing.

例えば、音声検索装置１００は、図４のフローチャートに示す音声検索処理を実行する前に、図５のフローチャートに示す話速指定処理を行ってもよい。この場合、音声検索装置１００は、図２（ａ）に示すように、ユーザが選択可能な検索結果が検索画面Ｗｎ中に提示されていない状態で話速の変更を受け付ける。ユーザは、録音音声スライダーＶＳによって指定した任意の再生位置の録音音声と、デモ音声と、を聞き比べて話速を指定する。 For example, the voice search device 100 may perform the speech speed designation process shown in the flowchart of FIG. 5 before executing the voice search process shown in the flowchart of FIG. In this case, as shown in FIG. 2A, the voice search device 100 accepts a change in speech speed in a state where search results that can be selected by the user are not presented in the search screen Wn. The user designates the speech speed by comparing the recorded voice at an arbitrary reproduction position designated by the recorded voice slider VS with the demo voice.

この態様によれば、音声検索を開始する前に適切な話速を設定できる。このため、不正確な音声検索を実行してしまうことを防止し、効率良く正確な音声検索を実行できる。 According to this aspect, an appropriate speech speed can be set before starting the voice search. For this reason, it is possible to prevent an inaccurate voice search from being performed, and to perform an accurate voice search efficiently.

なお、この態様において、録音音声の表す文字列をユーザが聞き取り、入力部４０を介してこの文字列を入力し、この文字列を表すデモ音声を音声出力部１０２に出力させてもよい。この態様によれば、録音音声と、録音音声と同じ文字列を表すデモ音声と、を聞き比べることができる。このため、ユーザは、録音音声と、録音音声とは異なる文字列を表すデモ音声と、を聞き比べる場合に比べ、より直感的で、より正確に話速を指定できる。 In this aspect, the user may listen to the character string represented by the recorded voice, input the character string via the input unit 40, and cause the voice output unit 102 to output the demo voice representing the character string. According to this aspect, it is possible to hear and compare the recorded voice and the demo voice that represents the same character string as the recorded voice. For this reason, the user can specify the speech speed more intuitively and more accurately than when comparing the recorded voice and the demo voice representing a character string different from the recorded voice.

また、音声検索装置１００は、図４のフローチャートのステップＳ１１５の直後（１パス目検索の結果をユーザへ提示した直後）に、話速の変更を受け付けてもよい。この場合、音声検索装置１００は、図２（ｂ）に示すように、検索画面Ｗｎ中において、１パス目検索の結果が選択可能に提示された状態で話速の変更を受け付ける。ユーザは、１パス目検索の結果に対応する録音音声と、クエリを表すデモ音声と、を聞き比べる。すなわち、この態様によれば、ユーザは、クエリを表すデモ音声と、クエリを表す音声を含んでいる可能性が高い区間の録音音声と、を聞き比べ（同じ文字列を表す音声同士を聞き比べ）、より直感的に、より正確に話速を指定できる。 Further, the voice search device 100 may accept the change in the speech speed immediately after step S115 in the flowchart of FIG. 4 (immediately after the result of the first pass search is presented to the user). In this case, as shown in FIG. 2B, the voice search device 100 accepts a change in speech speed in a state where the result of the first pass search is presented to be selectable on the search screen Wn. The user listens to the recorded voice corresponding to the result of the first pass search and the demo voice representing the query. That is, according to this aspect, the user listens and compares the demo voice that represents the query and the recorded voice in the section that is likely to include the voice that represents the query. ), You can specify the speaking speed more intuitively and more accurately.

また、この態様によれば、２パス目検索の結果に対応する録音音声と、クエリを表すデモ音声と、を聞き比べる上述の実施形態に比べて、音声検索の早い段階で適切な話速を指定できる。このため、不適切な話速に基づく１パス目検索で選択された不適切な推定区間の候補を対象として２パス目検索を行うことを防止し、効率良く正確な音声検索を実行できる。また、この態様によれば、１パス目検索は２パス目検索よりも検索時間が短いため、上述の実施形態に比べて、話速の変更、話速の比較及び再検索のサイクルに要する時間が短く、話速を素早く調整できる。一方、１パス目検索よりも精度の高い２パス目検索の結果に対応する録音音声を用いて比較を行う上述の実施形態は、この態様に比べ、クエリを表している可能性のより高い録音音声を用いて比較ができるため、より効率良く適切な話速を設定できる。 In addition, according to this aspect, compared to the above-described embodiment in which the recorded voice corresponding to the result of the second pass search and the demo voice representing the query are compared, an appropriate speech speed can be obtained at an early stage of the voice search. Can be specified. For this reason, it is possible to prevent the second-pass search from being performed on the candidate of the inappropriate estimated section selected in the first-pass search based on the inappropriate speech speed, and to perform an accurate and accurate voice search. In addition, according to this aspect, the search time for the first pass is shorter than the search for the second pass, so that the time required for the cycle of changing the speech speed, comparing the speech speed, and re-searching as compared with the above-described embodiment. Is short and the speech speed can be adjusted quickly. On the other hand, in the above-described embodiment in which the comparison is performed using the recorded voice corresponding to the result of the second pass search that is more accurate than the first pass search, the recording that is more likely to represent the query than this aspect. Since comparison can be made using voice, an appropriate speech speed can be set more efficiently.

また、音声検索装置１００は、図４のフローチャートの任意の処理に並行して、話速の変更を受け付けることもできる。この態様によれば、音声検索中の任意のタイミングで話速を変更できるため、よりユーザの意図に沿った音声検索を実行できる。 In addition, the voice search device 100 can accept a change in speech speed in parallel with an arbitrary process in the flowchart of FIG. According to this aspect, since the speech speed can be changed at an arbitrary timing during the voice search, it is possible to execute the voice search more in line with the user's intention.

また、上述の実施形態において、音声検索装置１００は、音声検索処理の一環として、モノフォン音素列に基づいて出力確率を算出した。しかし、これは一例に過ぎず、モノフォン音素に基づいて出力確率を算出することなく、音声検索を実行することができる。 In the above-described embodiment, the speech search apparatus 100 calculates the output probability based on the monophone phoneme string as part of the speech search process. However, this is only an example, and a voice search can be executed without calculating an output probability based on a monophone phoneme.

以下、モノフォン音素に基づいて出力確率を算出することなく音声検索を実行する音声検索装置１００’の機能及び動作について説明する。 Hereinafter, the function and operation of the speech search apparatus 100 ′ that performs speech search without calculating the output probability based on monophone phonemes will be described.

音声検索装置１００’は、上述の実施形態に係る音声検索装置１００と概ね共通の構成を有するものの、一部の構成が異なる。具体的に、音声検索装置１００’は、図７に示すように、音声検索装置１００とは異なり、モノフォンモデルＤＢ３１を備えていない。その代わり、音声検索装置１００’は、音声検索装置１００とは異なり、出力確率ＤＢ３６を外部記憶部３０に備えている。 The voice search device 100 ′ has a configuration that is generally common to the voice search device 100 according to the above-described embodiment, but a part of the configuration is different. Specifically, as shown in FIG. 7, the voice search device 100 ′ does not include the monophone model DB 31 unlike the voice search device 100. Instead, the voice search device 100 ′ includes an output probability DB 36 in the external storage unit 30, unlike the voice search device 100.

出力確率ＤＢ３６は、録音音声の時間軸上に設定されたフレームごとに、モノフォンモデルの各音素と、外部装置によって予め算出された、各音素から録音音声の特徴量が出力される出力確率と、を対応付けて記憶している。すなわち、音声検索装置１００’は、外部装置によってモノフォンモデルに基づいて算出された各フレームの出力確率を、この外部装置から予め取得し、出力確率ＤＢ３６に格納している。なお、モノフォンモデルに基づく出力確率は、外部装置ではなく、音声検索装置１００’自身が音声検索処理に先立って算出し、出力確率ＤＢ３６に記憶しておいてもよい。 The output probability DB 36 includes, for each frame set on the time axis of the recorded sound, each phoneme of the monophone model, and an output probability that the feature amount of the recorded sound is output from each phoneme calculated in advance by an external device. Are stored in association with each other. In other words, the voice search device 100 ′ previously acquires the output probability of each frame calculated by the external device based on the monophone model from the external device and stores it in the output probability DB 36. The output probability based on the monophone model may be calculated prior to the voice search process by the voice search device 100 ′ itself, not the external device, and stored in the output probability DB 36.

図７に示す物理的・機能的構成を備える音声検索装置１００’は、図４のフローチャートに示す音声検索処理と概ね同様の処理を実行する。ただし、ステップＳ１０９の処理が異なる。 The voice search device 100 ′ having the physical and functional configuration shown in FIG. 7 executes processing that is substantially the same as the voice search processing shown in the flowchart of FIG. 4. However, the processing in step S109 is different.

具体的に、ステップＳ１０９において、音声検索装置１００’の第１検索部１０４ａは、ステップＳ１０３で取得したモノフォン音素列に含まれる各音素に対応付けて出力確率ＤＢ３６に記憶された出力確率を取得する。すなわち、第１検索部１０４ａは、出力確率ＤＢ３６がフレームごとに記憶しているモノフォンモデルの全音素の出力確率の中から、モノフォン音素列が含む音素の出力確率を、録音音声の全フレームについて取得する。 Specifically, in step S109, the first search unit 104a of the voice search device 100 ′ acquires the output probability stored in the output probability DB 36 in association with each phoneme included in the monophone phoneme sequence acquired in step S103. . That is, the first search unit 104a calculates the output probability of the phonemes included in the monophone phoneme sequence from the output probabilities of all phonemes of the monophone model stored for each frame in the output probability DB 36 for all frames of the recorded speech. get.

以上説明したように、音声検索装置１００’は、録音音声全体の出力確率を予め尤度インデックスとして記憶しておき、音声検索時には、その尤度インデックスに基づいて１パス目検索を実行する。すなわち、音声検索装置１００’は、出力確率を算出することなく音声検索を実行できる。音声検索装置１００’は、モノフォンモデルに基づく出力確率の算出を行う音声検索装置１００に比べて、演算負荷を抑制でき、より高速に音声検索を実行できる。 As described above, the voice search device 100 ′ stores the output probability of the entire recorded voice as a likelihood index in advance, and performs a first pass search based on the likelihood index during the voice search. That is, the voice search device 100 ′ can perform a voice search without calculating the output probability. The voice search device 100 ′ can suppress the calculation load and can perform voice search at a higher speed than the voice search device 100 that calculates the output probability based on the monophone model.

本発明に係る音声検索装置は、スマートフォンやコンピュータ、デジタルカメラ、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｃｅ）等の任意の電子機器によって実現できる。 The voice search device according to the present invention can be realized by any electronic device such as a smartphone, a computer, a digital camera, or a PDA (Personal Digital Assistance).

具体的には、スマートフォン、コンピュータ、デジタルカメラ、ＰＤＡ等の電子機器を本発明に係る音声検索装置として動作させるためのプログラムを、これらの電子機器が読み取り可能な記録媒体（例えば、メモリカードやＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ−ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃＲｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）等）に格納して配布し、インストールすることにより本発明に係る音声検索装置を実現することができる。 Specifically, a program for operating an electronic device such as a smartphone, a computer, a digital camera, or a PDA as a voice search device according to the present invention is recorded on a recording medium (for example, a memory card or a CD) that can be read by these electronic devices. The voice search device according to the present invention can be realized by storing, distributing, and installing in a ROM (Compact Disc Read-Only Memory), DVD-ROM (Digital Versatile Disc Read-Only Memory), and the like.

あるいは、上記プログラムを、インターネット等の通信ネットワーク上のサーバ装置が有する記憶装置（例えば、ディスク装置等）に格納しておき、スマートフォン、コンピュータ、デジタルカメラ、ＰＤＡ等がこのプログラムをダウンロードすることによって本発明に係る音声検索装置を実現してもよい。 Alternatively, the above program is stored in a storage device (for example, a disk device) of a server device on a communication network such as the Internet, and a smartphone, a computer, a digital camera, a PDA, or the like downloads the program. The voice search device according to the invention may be realized.

また、本発明に係る音声検索装置の機能を、オペレーティングシステム（ＯＳ：ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）とアプリケーションプログラムとの協働又は分担により実現する場合には、アプリケーションプログラム部分のみを記録媒体や記憶装置に格納してもよい。 Further, when the function of the voice search device according to the present invention is realized by the cooperation or sharing of an operating system (OS) and an application program, only the application program portion is stored in a recording medium or a storage device. May be.

また、アプリケーションプログラムを搬送波に重畳し、通信ネットワークを介して配信してもよい。例えば、通信ネットワーク上の掲示板（ＢＢＳ：ＢｕｌｌｅｔｉｎＢｏａｒｄＳｙｓｔｅｍ）にアプリケーションプログラムを掲示し、ネットワークを介してアプリケーションプログラムを配信してもよい。そして、このアプリケーションプログラムをコンピュータにインストールして起動し、ＯＳの制御下で、他のアプリケーションプログラムと同様に実行することにより、本発明に係る音声検索装置を実現してもよい。 Further, the application program may be superimposed on a carrier wave and distributed via a communication network. For example, an application program may be posted on a bulletin board (BBS: Bulletin Board System) on a communication network, and the application program may be distributed via the network. Then, the voice search device according to the present invention may be realized by installing and starting this application program on a computer and executing it in the same manner as other application programs under the control of the OS.

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲とが含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 The preferred embodiments of the present invention have been described above. However, the present invention is not limited to the specific embodiments, and the present invention includes the invention described in the claims and the equivalent scope thereof. included. Hereinafter, the invention described in the scope of claims of the present application will be appended.

（付記１）
クエリを取得するクエリ取得部と、
所定の可変範囲の話速からユーザにより指定された話速を受け付ける話速指定部と、
前記話速指定部が受け付けた話速で、前記取得したクエリを表すデモ音声を音声出力するデモ音声出力部と、
前記話速指定部が受け付けた話速に基づいて、前記クエリを表す音声を、音声検索の対象である録音音声から音声検索する検索部と、
を備えることを特徴とする音声検索装置。 (Appendix 1)
A query acquisition unit for acquiring a query;
A speech rate designation unit for accepting a speech rate designated by the user from a predetermined range of speech rate;
A demo voice output unit that outputs a demo voice representing the acquired query at a speech speed accepted by the speech speed designation unit;
Based on the speech speed received by the speech speed designating unit, a search unit that performs a voice search from the recorded voice that is the target of the voice search for the voice representing the query;
A voice search device comprising:

（付記２）
前記録音音声のうちユーザが指定した録音部分あるいは前記検索部が音声検索して得られた録音部分を音声出力する録音音声出力部を備え、
前記デモ音声出力部が音声出力するデモ音声と前記録音音声出力部が音声出力する録音音声とをユーザにとって比較可能にした
ことを特徴とする付記１に記載の音声検索装置。 (Appendix 2)
A recording voice output unit that outputs a voice of a recording part specified by a user or a recording part obtained by voice search by the search unit of the recorded voice,
The voice search device according to appendix 1, wherein the demo voice output by the demo voice output unit and the recorded voice output by the recorded voice output unit can be compared for a user.

（付記３）
前記検索部は、
前記話速指定部を介してユーザが指定した話速と、隣接する音素に依存しない音響モデルと、に基づいて、前記クエリを表す音声を含んでいると推定される前記録音音声中の区間である推定区間の候補を前記録音音声から選択する第１検索部を含み、
前記録音音声出力部は、前記第１検索部が選択した前記推定区間の候補における前記録音音声を音声出力する
ことを特徴とする付記２に記載の音声検索装置。 (Appendix 3)
The search unit
Based on the speech speed designated by the user via the speech speed designating unit and the acoustic model that does not depend on adjacent phonemes, the section in the recorded speech estimated to include speech representing the query A first search unit for selecting a candidate for an estimated interval from the recorded voice;
The voice search device according to appendix 2, wherein the recorded voice output unit outputs the recorded voice in the estimated section candidate selected by the first search unit.

（付記４）
前記検索部は、
隣接する音素に依存する音響モデルに基づいて、前記第１検索部が選択した前記推定区間の候補の中から、前記推定区間を特定する第２検索部を含み、
前記録音音声出力部は、前記第２検索部が特定した前記推定区間における前記録音音声を音声出力する
ことを特徴とする付記３に記載の音声検索装置。 (Appendix 4)
The search unit
Based on an acoustic model that depends on adjacent phonemes, a second search unit that identifies the estimated section from among the candidates for the estimated section selected by the first search unit,
4. The voice search apparatus according to appendix 3, wherein the recorded voice output unit outputs the recorded voice in the estimated section specified by the second search unit.

（付記５）
クエリを取得するクエリ取得ステップと、
所定の可変範囲の話速からユーザにより指定された話速を受け付ける話速指定ステップと、
前記話速指定ステップで受け付けた話速で、前記取得したクエリを表すデモ音声を音声出力するデモ音声出力ステップと、
前記話速指定ステップで受け付けた話速に基づいて、前記クエリを表す音声を、音声検索の対象である録音音声から音声検索する検索ステップと、
を含むことを特徴とする音声検索方法。 (Appendix 5)
A query acquisition step for acquiring a query;
A speech speed designation step for accepting a speech speed designated by the user from a predetermined range of speech speed;
A demo voice output step for outputting a demo voice representing the acquired query at a voice speed accepted in the voice speed designation step;
Based on the speech speed accepted in the speech speed designation step, a search step for performing a voice search for the voice representing the query from the recorded voice that is the target of the voice search;
A voice search method comprising:

（付記６）
コンピュータを、
クエリを取得するクエリ取得部、
所定の可変範囲の話速からユーザにより指定された話速を受け付ける話速指定部、
前記話速指定部が受け付けた話速で、前記取得したクエリを表すデモ音声を音声出力するデモ音声出力部、
前記話速指定部が受け付けた話速に基づいて、前記クエリを表す音声を、音声検索の対象である録音音声から音声検索する検索部、
として機能させることを特徴とするプログラム。 (Appendix 6)
Computer
A query acquisition unit for acquiring a query,
A speech speed designation unit that accepts a speech speed designated by the user from a predetermined range of speech speed;
A demo voice output unit that outputs a demo voice representing the acquired query at a speech speed accepted by the speech speed designation unit,
A search unit that searches the voice representing the query from the recorded voice that is the target of the voice search, based on the speech speed received by the speech speed specifying unit,
A program characterized by functioning as

１０…ＲＯＭ、２０…ＲＡＭ、３０…外部記憶部、３１…モノフォンモデルＤＢ、３２…トライフォンモデルＤＢ、３３…時間長ＤＢ、３４…韻律ＤＢ、３５…音声ＤＢ、３６…出力確率ＤＢ、４０…入力部、５０…出力部、６０…ＣＰＵ、１００、１００’…音声検索装置、１０１…話速指定部、１０２…音声出力部、１０２ａ…デモ音声出力部、１０２ｂ…録音音声出力部、１０３…クエリ取得部、１０４…検索部、１０４ａ…第１検索部、１０４ｂ…第２検索部、１０５…提示部、Ｗｎ…検索画面、ＳＢ…話速シークバー、ＳＳ…話速スライダー、ＲＳ…検索アイコン、Ｐ１…録音音声再生アイコン、Ｐ２…デモ音声再生アイコン、ＶＳ…録音音声スライダー、ＶＢ…録音音声シークバー、Ｓ…シフト長、ＷＶ…録音音声の波形、Ｌ…尤度取得区間の時間長、Ｔ…録音音声の波形の時間長、Ｆ…フレーム長、ＣＳ…話速変更アイコン、ＥＳ…終了アイコン DESCRIPTION OF SYMBOLS 10 ... ROM, 20 ... RAM, 30 ... External storage part, 31 ... Monophone model DB, 32 ... Triphone model DB, 33 ... Time length DB, 34 ... Prosody DB, 35 ... Speech DB, 36 ... Output probability DB, DESCRIPTION OF SYMBOLS 40 ... Input part, 50 ... Output part, 60 ... CPU, 100, 100 '... Voice search device, 101 ... Speech speed designation part, 102 ... Voice output part, 102a ... Demo voice output part, 102b ... Recording voice output part, DESCRIPTION OF SYMBOLS 103 ... Query acquisition part, 104 ... Search part, 104a ... 1st search part, 104b ... 2nd search part, 105 ... Presentation part, Wn ... Search screen, SB ... Speech speed seek bar, SS ... Speech speed slider, RS ... Search Icon, P1 ... Recording sound playback icon, P2 ... Demo sound playback icon, VS ... Recording sound slider, VB ... Recording sound seek bar, S ... Shift length, WV ... Recording sound waveform, L ... Time length of time acquisition sections, T ... the time length of the recorded sound of the waveform, F ... frame length, CS ... speech speed change icon, ES ... end icon

Claims

A query acquisition unit for acquiring a query;
A speech rate designation unit for accepting a speech rate designated by the user from a predetermined range of speech rate;
A demo voice output unit that outputs a demo voice representing the acquired query at a speech speed accepted by the speech speed designation unit;
Based on the speech speed received by the speech speed designating unit, a search unit that performs a voice search from the recorded voice that is the target of the voice search for the voice representing the query;
A voice search device comprising:

A recording voice output unit that outputs a voice of a recording part specified by a user or a recording part obtained by voice search by the search unit of the recorded voice,
The voice search device according to claim 1, wherein the demo voice output by the demo voice output unit and the recorded voice output by the recorded voice output unit can be compared for a user.

The search unit
Based on the speech speed designated by the user via the speech speed designating unit and the acoustic model that does not depend on adjacent phonemes, the interval in the recorded speech that is estimated to include speech representing the query A first search unit for selecting a candidate for an estimated interval from the recorded voice;
The voice search apparatus according to claim 2, wherein the recorded voice output unit outputs the recorded voice in the estimated section candidate selected by the first search unit.

The search unit
Based on an acoustic model that depends on adjacent phonemes, a second search unit that identifies the estimated section from among the candidates for the estimated section selected by the first search unit,
The voice search device according to claim 3, wherein the recorded voice output unit outputs the recorded voice in the estimated section specified by the second search unit.

A query acquisition step for acquiring a query;
A speech speed designation step for accepting a speech speed designated by the user from a predetermined range of speech speed;
A demo voice output step for outputting a demo voice representing the acquired query at a voice speed accepted in the voice speed designation step;
Based on the speech speed accepted in the speech speed designation step, a search step for performing a voice search for the voice representing the query from the recorded voice that is the target of the voice search;
A voice search method comprising:

Computer
A query acquisition unit for acquiring a query,
A speech speed designation unit that accepts a speech speed designated by the user from a predetermined range of speech speed;
A demo voice output unit that outputs a demo voice representing the acquired query at a speech speed accepted by the speech speed designation unit,
A search unit that searches the voice representing the query from the recorded voice that is the target of the voice search, based on the speech speed received by the speech speed specifying unit,
A program characterized by functioning as