JP2011090483A

JP2011090483A - Information processing apparatus and program

Info

Publication number: JP2011090483A
Application number: JP2009243144A
Authority: JP
Inventors: Ryosuke Hamazaki; 良介濱崎; Takashi Ota; 恭士大田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-10-22
Filing date: 2009-10-22
Publication date: 2011-05-06
Anticipated expiration: 2029-10-22
Also published as: JP5223843B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processing apparatus capable of properly controlling the range retroacting from the point of time corresponding to the input point of time of voice on output time series of voice data. <P>SOLUTION: The information processing apparatus for reproducing the voice data includes: an input unit which accepts voice given by a user; a calculation unit which calculates the speed of sound uttered by the user; and a control unit which determines the range retroacting from the point of time corresponding to the input point of time of the voice on the output time series of the voice data according to the speed of sound uttered by the user. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声データを再生する情報処理装置に関する。 The present invention relates to an information processing apparatus that reproduces audio data.

音声は生成されてもすぐに消滅するために、人は音声情報を人の記憶容量の範囲内でしか覚えられない。そのために、音声データの再生中に聞き逃しなどによって聞き直したい情報がある場合には、通常、人は、音声データを巻き戻して聞き直したい情報が含まれている箇所を検索する。 Even if the voice is generated, it disappears immediately, so that the person can only remember the voice information within the storage capacity of the person. Therefore, when there is information that the user wants to hear again due to missed listening or the like during the reproduction of the audio data, the person usually searches for a part including the information to be reheard by rewinding the audio data.

音声データの再確認箇所を指示する指示部からの指示に基づいて、予め決められた方法によって音声データを遡って取得する取得範囲が決定され、取得範囲内の単語が抽出されて表示される技術がある（例えば、特許文献１）。 A technique in which an acquisition range in which audio data is acquired retrospectively by a predetermined method is determined based on an instruction from an instruction unit that indicates a reconfirmation point of audio data, and words in the acquisition range are extracted and displayed (For example, Patent Document 1).

また、聞き返しの指示があった場合に遡って音声の切れ目を検出し、その検出点から再度音声を再生する技術がある（例えば、特許文献２，３）。 In addition, there is a technique of detecting a break in the sound retroactively when an instruction to listen back is given and reproducing the sound again from the detection point (for example, Patent Documents 2 and 3).

特開２００７−２５７３４１号公報JP 2007-257341 A 特開昭６２−４０５７７号公報Japanese Patent Laid-Open No. 62-40577 特開２０００−２６７６８７号公報JP 2000-267687 A

しかしながら、特許文献１記載の技術では、音声データの取得範囲が初期設定から変更されない。このため、特許文献１には、必ずしも指定された取得範囲に再確認したい情報が含まれているとは限らないこという問題があった。また、初期設定の取得範囲が大きすぎる場合には、再確認したい情報の位置から大きく遡って音声データが取得され、単語の抽出処理の効率が低下するという問題があった。 However, in the technique described in Patent Document 1, the acquisition range of audio data is not changed from the initial setting. For this reason, Patent Document 1 has a problem that information to be reconfirmed is not necessarily included in the designated acquisition range. Further, when the initial acquisition range is too large, there is a problem that voice data is acquired far back from the position of the information to be reconfirmed, and the efficiency of the word extraction process is lowered.

また、特許文献２，３記載の技術では、文節の区切り（音声の切れ目）まで遡って再生がされるが、特定のワードが含まれるかどうかは制御できず、所望の音声データを含む音声を必ずしも聞き返すことはできないという問題があった。 Also, in the techniques described in Patent Documents 2 and 3, playback is performed retroactively to the break of a phrase (sound break), but it is not possible to control whether or not a specific word is included. There was a problem that it was not always possible to hear back.

本発明の一態様は、音声データの出力時系列上の音声の入力時点に対応する時点から遡る範囲を適正に制御可能な情報処理装置を提供することを目的とする。 An object of one embodiment of the present invention is to provide an information processing apparatus capable of appropriately controlling a range that goes back from a time point corresponding to a time point of input of sound on an output time series of sound data.

本発明の態様の一つは、情報処理装置である。この情報処理装置は、
音声データを再生する情報処理装置であって、
利用者が発する音声を受け付ける入力部と、
前記音声の発話速度を算出する算出部と、
前記発話速度に応じて、前記音声データの出力時系列上の前記音声の入力時点に対応する時点から遡る範囲を決定する制御部と、
を備える。 One aspect of the present invention is an information processing apparatus. This information processing device
An information processing apparatus for reproducing audio data,
An input unit for receiving a voice uttered by the user;
A calculation unit for calculating an utterance speed of the voice;
A control unit that determines a range that goes back from the time corresponding to the input time of the sound on the output time series of the sound data, according to the speech rate;
Is provided.

本発明の他の態様の一つは、上述した音声入力時点から遡る範囲の決定方法である。ま
た、本発明の他の態様は、情報処理装置を音声入力時点から遡る範囲の決定装置として機能させるプログラム、及び当該プログラムを記録したコンピュータ読み取り可能な記録媒体を含むことができる。 Another aspect of the present invention is a method for determining a range that goes back from the voice input time point described above. In addition, another aspect of the present invention can include a program that causes the information processing device to function as a determination device for a range that goes back from the voice input time point, and a computer-readable recording medium that records the program.

開示の情報処理装置によれば、音声データの出力時系列上の音声の入力時点に対応する時点から遡る範囲を適正に制御することができる。 According to the information processing apparatus of the disclosure, it is possible to appropriately control a range that goes back from the time corresponding to the input time of the sound on the output time series of the sound data.

情報処理装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of information processing apparatus. 情報処理装置のプロセッサが遡る範囲を決定するためのプログラムを実行することによって実現される機能の説明図である。It is explanatory drawing of the function implement | achieved by running the program for determining the range which the processor of information processing apparatus goes back. 算出部の構成例を示す図である。It is a figure which shows the structural example of a calculation part. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 発話速度と遡る範囲との関係の一例を示す図である。It is a figure which shows an example of the relationship between an utterance speed and the range to go back. 情報処理装置の処理フローの例を示す図である。It is a figure which shows the example of the processing flow of information processing apparatus. 情報処理装置のプロセッサが音声データ再生プログラムを実行することによって実現される機能の説明図である。It is explanatory drawing of the function implement | achieved when the processor of information processing apparatus runs an audio | voice data reproduction | regeneration program. 検索部がワードスポッティング技術を用いて、キーワードの検索を行う場合の例を示す。An example in which the search unit searches for a keyword using the word spotting technique is shown. 検索範囲の再設定処理の一例を示す図である。It is a figure which shows an example of the reset process of a search range. 検索範囲の再設定処理の一例を示す図である。It is a figure which shows an example of the reset process of a search range. 検索範囲の再設定処理の一例を示す図である。It is a figure which shows an example of the reset process of a search range. 部分音声データ内でキーワードが複数検出される場合の処理の一例を示す図である。It is a figure showing an example of processing in case a plurality of keywords are detected in partial voice data. 部分音声データ内でキーワードが複数検出される場合の処理の一例を示す図である。It is a figure showing an example of processing in case a plurality of keywords are detected in partial voice data. 部分音声データ内でキーワードが複数検出される場合の処理の一例を示す図である。It is a figure showing an example of processing in case a plurality of keywords are detected in partial voice data. 部分音声データ内でキーワードが複数検出される場合の処理の一例を示す図である。It is a figure showing an example of processing in case a plurality of keywords are detected in partial voice data. 再生用音声データを記憶部から読み出すときの先頭位置を決定する処理の一例を示す図である。It is a figure which shows an example of the process which determines the head position when reproducing audio | voice data from a memory | storage part. 再生用音声データを記憶部から読み出すときの先頭位置を決定する処理の一例を示す図である。It is a figure which shows an example of the process which determines the head position when reproducing audio | voice data from a memory | storage part. 情報処理装置の処理フローの例を示す図である。It is a figure which shows the example of the processing flow of information processing apparatus.

以下、図面に基づいて、本発明の実施の形態を説明する。以下の実施形態の構成は例示であり、本発明は実施形態の構成に限定されない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The configuration of the following embodiment is an exemplification, and the present invention is not limited to the configuration of the embodiment.

＜情報処理装置のハードウェア構成例＞
図１は、情報処理装置のハードウェア構成例を示す図である。情報処理装置１は、プロセッサ１０１，主記憶装置１０２，マイクロフォン１０３，出力装置１０４，補助記憶装
置１０５，ネットワークインタフェース１０７，及びチューナー１０８を備える。それらはバス１０９により互いに接続されている。 <Example of hardware configuration of information processing apparatus>
FIG. 1 is a diagram illustrating a hardware configuration example of the information processing apparatus. The information processing apparatus 1 includes a processor 101, a main storage device 102, a microphone 103, an output device 104, an auxiliary storage device 105, a network interface 107, and a tuner 108. They are connected to each other by a bus 109.

マイクロフォン１０３は、利用者が発する音声を収集する。マイクロフォン１０３は、収集された音声に応じた電気信号をプロセッサ１０１に出力する。以降、音声に応じた電気信号を「音声信号」という。 The microphone 103 collects sound emitted by the user. The microphone 103 outputs an electrical signal corresponding to the collected sound to the processor 101. Hereinafter, the electrical signal corresponding to the voice is referred to as “voice signal”.

ネットワークインタフェース１０７は、ネットワークとの情報の入出力を行うインタフェースである。ネットワークインタフェース１０７は、有線のネットワーク、および、無線のネットワークと接続する。ネットワークインタフェース１０７は、例えば、ＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣａｒｄ），無線ＬＡＮ（ＬｏｃａｌＡｒｅａ
Ｎｅｔｗｏｒｋ）カード等である。ネットワークインタフェース１０７は、接続されるネットワークからインターネットラジオやインターネットテレビ等の音声信号を受信する。ネットワークインタフェース１０７で受信された音声信号は、プロセッサ１０１に出力される。 The network interface 107 is an interface for inputting / outputting information to / from the network. The network interface 107 is connected to a wired network and a wireless network. The network interface 107 is, for example, a NIC (Network Interface Card) or a wireless LAN (Local Area).
Network) card. The network interface 107 receives audio signals such as Internet radio and Internet TV from a connected network. The audio signal received by the network interface 107 is output to the processor 101.

チューナー１０８は、受信周波数を選択することによって、選局し、ラジオやテレビなどの放送電波を受信する。チューナー１０８は、受信した放送電波の音声信号をプロセッサ１０１に出力する。 The tuner 108 selects a reception frequency by selecting a reception frequency, and receives broadcast radio waves such as radio and television. The tuner 108 outputs the received broadcast radio sound signal to the processor 101.

主記憶装置１０２は、プロセッサ１０１に、補助記憶装置１０５に格納されているプログラムをロードする記憶領域および作業領域を提供したり、バッファとして用いられたりする。主記憶装置１０２は、例えば、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）のような半導体メモリである。 The main storage device 102 provides the processor 101 with a storage area and a work area for loading a program stored in the auxiliary storage device 105, and is used as a buffer. The main storage device 102 is, for example, a semiconductor memory such as a RAM (Random Access Memory).

補助記憶装置１０５は、様々なプログラムや、各プログラムの実行に際してプロセッサ１０１が使用するデータを格納する。補助記憶装置１０５は、例えば、ＥＰＲＯＭ（ＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲＯＭ）、又はハードディスクドライブ（ＨａａｒｄＤｉｓｃＤｒｉｖｅ）である。補助記憶装置１０５は、リムーバルメディア、すなわち可搬記録媒体を含むことができる。リムーバルメディアは、例えば、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ），フラッシュメモリ，ＣＤ（ＣｏｍｐａｃｔＤｉｓｃ），又はＤＶＤのような記録媒体である。補助記憶装置１０５は、例えば、オペレーティングシステム（ＯＳ），利用者の音声入力時点から遡る範囲を決定するためのプログラム，音声データ再生プログラム、その他様々なアプリケーションプログラムを保持する。 The auxiliary storage device 105 stores various programs and data used by the processor 101 when executing each program. The auxiliary storage device 105 is, for example, an EPROM (Erasable Programmable ROM) or a hard disk drive (Haard Disc Drive). The auxiliary storage device 105 can include a removable medium, that is, a portable recording medium. The removable medium is a recording medium such as USB (Universal Serial Bus), flash memory, CD (Compact Disc), or DVD, for example. The auxiliary storage device 105 holds, for example, an operating system (OS), a program for determining a range that goes back from the user's voice input point, a voice data reproduction program, and various other application programs.

プロセッサ１０１は、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｓｅｓｓｉｎｇＵｎｉｔ）や、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）である。プロセッサ１０１は、補助記憶装置１０５に保持されたＯＳや様々なアプリケーションプログラムを主記憶装置１０２にロードして実行することによって、音声に係る様々な処理を実行する。 The processor 101 is, for example, a CPU (Central Processing Unit) or a DSP (Digital Signal Processor). The processor 101 loads the OS and various application programs held in the auxiliary storage device 105 into the main storage device 102 and executes them, thereby executing various processes related to voice.

例えば、プロセッサ１０１は、プログラムの実行によって、マイクロフォン１０３から入力される音声信号に対するディジタル変換処理を行い、音声データを得る。音声データは、主記憶装置１０２及び／又は補助記憶装置１０５に記憶される。 For example, the processor 101 performs digital conversion processing on an audio signal input from the microphone 103 by executing a program to obtain audio data. The audio data is stored in the main storage device 102 and / or the auxiliary storage device 105.

また、プロセッサ１０１は、プログラムの実行によって、ネットワークインタフェース１０７やチューナー１０８で受信される音声信号から音声データを生成し、主記憶装置１０２及び／又は補助記憶装置１０５に記録する。 Further, the processor 101 generates audio data from audio signals received by the network interface 107 and the tuner 108 by executing the program, and records the audio data in the main storage device 102 and / or the auxiliary storage device 105.

ネットワークインタフェース１０７やチューナー１０８で受信される、リアルタイム処理で再生される音声信号から生成された音声データは、主記憶装置１０２内のバッファに所定時間蓄積される。プロセッサ１０１は、主記憶装置１０２のバッファに蓄積された音声データを逐次読み出し、復号化処理を行って音声信号に復号し、出力装置１０４に出力する。 The audio data generated from the audio signal that is received by the network interface 107 and the tuner 108 and reproduced by real-time processing is accumulated in a buffer in the main storage device 102 for a predetermined time. The processor 101 sequentially reads out audio data stored in the buffer of the main storage device 102, performs a decoding process, decodes the audio data, and outputs the audio signal to the output device 104.

また、プロセッサ１０１は、利用者の音声の入力時点から遡る範囲を決定するためのプログラムの実行によって、ネットワークインタフェース１０７，チューナー１０８等から入力される音声データの出力時系列上の利用者から音声の入力があった時点から遡る範囲を決定する処理を行う。処理の詳細は後述される。 In addition, the processor 101 executes the program for determining the range that goes back from the user's voice input time point, so that the voice of the voice data input from the network interface 107, the tuner 108, etc. Processing to determine the range that goes back from the point of input. Details of the processing will be described later.

また、プロセッサ１０１は、音声データ再生プログラムの実行によって、ネットワークインタフェース１０７，チューナー１０８等から逐次入力される音声データの再生中に、音声入力があった場合に、音声入力時点から所定範囲遡った時点から再生する処理を行う。処理の詳細は後述される。 In addition, when the audio data is input during the reproduction of the audio data sequentially input from the network interface 107, the tuner 108, and the like by the execution of the audio data reproduction program, the processor 101 points back a predetermined range from the audio input time point. Process to play from. Details of the processing will be described later.

出力装置１０４は、プロセッサ１０１の処理の結果を出力する。出力装置１０４は、ディスプレイ，スピーカ及びスピーカインターフェイス回路等を含む。 The output device 104 outputs the processing result of the processor 101. The output device 104 includes a display, a speaker, a speaker interface circuit, and the like.

情報処理装置１は、例えば、パーソナルコンピュータなどの汎用のコンピュータである。また、情報処理装置１は、例えば、携帯電話，カーナビゲーションシステム，１セグメント部分受信サービス（一般に「ワンセグ放送」と呼ばれる）の受像機，ラジオ等の音声データ及び音声データを含む動画を再生し、音声情報を提供する装置である。また、それらに格納されるＩＣチップ等である。 The information processing apparatus 1 is a general-purpose computer such as a personal computer, for example. Further, the information processing apparatus 1 reproduces a moving image including audio data and audio data such as a mobile phone, a car navigation system, a receiver of a one-segment partial reception service (generally called “one-segment broadcasting”), a radio, It is a device that provides audio information. Also, an IC chip or the like stored in them.

＜第１実施形態＞
第１実施形態の情報処理装置は、音声データの再生中に、利用者から音声の入力がある場合には、利用者の音声の入力を契機として、利用者の音声の入力時点から所定範囲遡った時点から音声を再生する情報処理装置である。例えば、再生中の音声を聞き逃した場合に利用者は何らかの音声を発声することで、情報処理装置は利用者の音声の入力時点から所定範囲遡った時点から音声データの再生を行う。 <First Embodiment>
In the information processing apparatus according to the first embodiment, when a voice is input from the user during playback of the voice data, the input of the voice of the user triggers a predetermined range from the user's voice input time. This is an information processing apparatus that reproduces sound from the point in time. For example, when the user misses the sound being reproduced, the user utters some sound, so that the information processing apparatus reproduces the sound data from a time point that is a predetermined range backward from the time point when the user's sound is input.

図２は、情報処理装置１のプロセッサ１０１が利用者の音声の入力時点から遡る範囲を決定するためのプログラムを実行することによって実現される機能の説明図である。情報処理装置１は、プロセッサ１０１が利用者の音声の入力時点から遡る範囲を決定するためのプログラムを実行することによって、入力部１１，算出部１２，制御部１３，及び抽出部１４を実現することができる。すなわち、情報処理装置１は、利用者の音声の入力時点から遡る範囲を決定するための決定プログラムの実行によって、入力部１１，算出部１２，制御部１３，及び抽出部１４を備えた装置として機能する。 FIG. 2 is an explanatory diagram of functions realized by the processor 101 of the information processing apparatus 1 executing a program for determining a range that goes back from the user's voice input time point. The information processing apparatus 1 implements the input unit 11, the calculation unit 12, the control unit 13, and the extraction unit 14 by executing a program for the processor 101 to determine a range that goes back from the user's voice input time point. be able to. That is, the information processing apparatus 1 is an apparatus including the input unit 11, the calculation unit 12, the control unit 13, and the extraction unit 14 by executing a determination program for determining a range that goes back from the user's voice input time point. Function.

入力部１１は、マイクロフォン１０３を含み、マイクロフォン１０３に入力された利用者が発した音声の音声信号を入力として得る。入力部１１は、音声信号のディジタル変換処理を行い、音声信号を音声データに変換する。入力部１１は、音声データに、音声データが入力された時間をタイムスタンプとして付与する。タイムスタンプは、情報処理装置１が備えるクロック（図示せず）に基づいた情報処理装置１の起動からの経過時間，情報処理装置１が管理する時刻等の内のいずれかを付与する。入力部１１は、利用者が発した音声の音声データを算出部１２と抽出部１４とに出力する。以降、利用者が発した音声の音声データを入力音声データという。 The input unit 11 includes a microphone 103, and obtains an audio signal of a voice uttered by a user input to the microphone 103 as an input. The input unit 11 performs digital conversion processing on the audio signal and converts the audio signal into audio data. The input unit 11 gives the time when the sound data is input to the sound data as a time stamp. The time stamp is assigned with any one of an elapsed time since the activation of the information processing apparatus 1 based on a clock (not shown) provided in the information processing apparatus 1, a time managed by the information processing apparatus 1, and the like. The input unit 11 outputs voice data of a voice uttered by the user to the calculation unit 12 and the extraction unit 14. Hereinafter, the voice data of the voice uttered by the user is referred to as input voice data.

算出部１２には、入力部１１から入力音声データが入力される。算出部１２は、入力音
声データから利用者の発話速度を算出する。 Input voice data is input from the input unit 11 to the calculation unit 12. The calculation unit 12 calculates the utterance speed of the user from the input voice data.

図３は、算出部１２の構成例を示す図である。算出部１２は、区間検出部１２１，音声認識部１２２，モーラ数算出部１２３，及び発話速度算出部１２４を含む。 FIG. 3 is a diagram illustrating a configuration example of the calculation unit 12. The calculation unit 12 includes a section detection unit 121, a voice recognition unit 122, a mora number calculation unit 123, and an utterance speed calculation unit 124.

区間検出部１２１は、入力部１１から入力音声データを入力として得る。区間検出部１２１は、入力音声データの始端から終端までの時間を入力音声データに付与されたタイムスタンプより測定する。区間検出部１２１は、測定時間を入力音声データの音声区間長として発話速度算出部１２４に出力する。区間検出部１２１は、入力音声データを音声認識部１２２に出力する。 The section detection unit 121 obtains input voice data from the input unit 11 as an input. The section detection unit 121 measures the time from the beginning to the end of the input voice data from the time stamp given to the input voice data. The section detection unit 121 outputs the measurement time to the utterance speed calculation unit 124 as the voice section length of the input voice data. The section detection unit 121 outputs the input voice data to the voice recognition unit 122.

音声認識部１２２は、区間検出部１２１から入力音声データを入力として得る。音声認識部１２２は、入力音声データに基づき音声認識処理を行う。例えば、音声認識部１２２は、音声認識処理により、入力音声データの内容が「ねこ（猫）」であることを認識結果として得る。音声認識部１２２が行う音声認識処理として、既存のあらゆる音声認識の手法を適用することができる。音声認識部１２２は、音声認識処理の結果をモーラ数算出部１２３に出力する。 The voice recognition unit 122 obtains input voice data from the section detection unit 121 as an input. The voice recognition unit 122 performs voice recognition processing based on the input voice data. For example, the voice recognition unit 122 obtains as a recognition result that the content of the input voice data is “cat (cat)” by voice recognition processing. As the voice recognition processing performed by the voice recognition unit 122, any existing voice recognition technique can be applied. The voice recognition unit 122 outputs the result of the voice recognition process to the mora number calculation unit 123.

モーラ数算出部１２３は、音声認識部１２２から入力音声データの音声認識の結果を入力として得る。モーラ数算出部１２３は、音声認識の結果からモーラ数を算出する。 The mora number calculation unit 123 obtains the speech recognition result of the input speech data from the speech recognition unit 122 as an input. The mora number calculation unit 123 calculates the mora number from the result of speech recognition.

モーラとは、音韻論上、一定の時間的長さを持った音の分節単位である。日本語学においては、モーラは一般に「拍（はく）」とも呼ばれる。例えば、「ねこ（猫）」という単語のモーラ数は、「ネ」と「コ」とで２モーラである。例えば、「かっぱ（合羽）」という単語のモーラ数は、「カ」と「ッ」と「パ」とで３モーラである。例えば、「チョコレート」という単語のモーラ数は、「チョ」と「コ」と「レ」と「ー」と「ト」とで５モーラである。モーラ数算出部１２３は、入力音声のモーラ数を発話速度算出部１２４に出力する。 Mora is a segmental unit of sound having a certain length of time in phonological theory. In Japanese language studies, Mora is generally called “beat”. For example, the number of mora of the word “cat (cat)” is 2 mora for “ne” and “ko”. For example, the number of mora for the word “kappa” is 3 mora for “ka”, “tsu”, and “pa”. For example, the number of mora of the word “chocolate” is 5 mora for “cho”, “co”, “le”, “-” and “to”. The mora number calculation unit 123 outputs the mora number of the input voice to the utterance speed calculation unit 124.

発話速度算出部１２４は、入力音声データの音声区間長とモーラ数とを入力として得る。発話速度算出部１２４は、入力音声データの音声区間長とモーラ数とから音声入力時の利用者の発話速度を算出する。発話速度算出部１２４は、例えば、発話速度＝モーラ数／音声区間長として発話速度を算出する。例えば、入力音声データの音声区間長が０．５秒で、入力音声データのモーラ数が４である場合には、発話速度は、発話速度＝４モーラ／０．５秒＝８モーラ／秒と算出される。発話速度算出部１２４は、算出された発話速度を制御部１３に出力する。 The utterance speed calculation unit 124 obtains as input the voice section length and the number of mora of the input voice data. The utterance speed calculation unit 124 calculates the utterance speed of the user at the time of voice input from the voice section length of the input voice data and the number of mora. The utterance speed calculation unit 124 calculates the utterance speed as, for example, utterance speed = number of mora / speech interval length. For example, when the voice section length of the input voice data is 0.5 seconds and the number of mora of the input voice data is 4, the speech speed is: speech speed = 4 mora / 0.5 seconds = 8 mora / second. Calculated. The utterance speed calculation unit 124 outputs the calculated utterance speed to the control unit 13.

制御部１３は、発話速度を入力として得る。制御部１３は、発話速度に基づいて、音声データ（以降、再生音声データ，再生用音声データともいう）の出力時系列上での利用者の音声(入力音声)の入力時点に対応する時点から遡る範囲を決定する。 The control unit 13 obtains the speech rate as an input. From the time corresponding to the input time of the user's voice (input voice) on the output time series of the voice data (hereinafter also referred to as playback voice data or playback voice data) based on the speech rate. Determine the range to go back.

図４Ａ，図４Ｂ，図４Ｃ，図４Ｄ，図４Ｅ，図４Ｆ，図４Ｇ，及び図４Ｈは、発話速度と利用者の音声の入力時点から遡る範囲との関係の例を示す図である。図４Ａから図４Ｈに示される内の１つまたはそれ以上の発話速度と遡る範囲との対応表は、補助記憶装置１０５に記憶されている。制御部１３は、入力音声データの発話速度が入力されると、例えば、図４Ａから図４Ｈに示される対応表の少なくとも１つに基づいて、遡る範囲を決定する。遡る範囲は、時間長，音節数，モーラ数などいずれで定義されてもよい。第１実施形態においては、制御部１３は、遡る範囲を時間で定義する場合について説明される。 4A, 4B, 4C, 4D, 4E, 4F, 4G, and 4H are diagrams showing examples of the relationship between the speech rate and the range that goes back from the input time point of the user's voice. A correspondence table between one or more of the speech speeds shown in FIGS. 4A to 4H and the retroactive range is stored in the auxiliary storage device 105. When the speech rate of the input voice data is input, the control unit 13 determines a range that goes back, for example, based on at least one of the correspondence tables shown in FIGS. 4A to 4H. The retroactive range may be defined by any of the time length, the number of syllables, the number of mora, and the like. In 1st Embodiment, the control part 13 demonstrates the case where the retroactive range is defined with time.

人間は、慌てると早口になる、すなわち発話速度が大きくなるという性質を持つ。再生
中の音声を聞き逃して、利用者が早口で聞き逃した情報に関する音声を発する場合には、利用者が慌てている，又は焦っている心理状態であることが考えられる。従って、情報処理装置１に入力された利用者の音声データ（入力音声データ）の発話速度が大きい場合には、利用者が聞き直したい情報が、利用者の音声の入力時点から遡って近い範囲内に存在する可能性が高い。反対に、人間は、迷っていたり、考えていたりすると、ゆっくりと発声する、すなわち、発話速度が小さくなる。利用者がゆっくりと発声する場合には、利用者の音声の入力時点から遡って離れた箇所に利用者が聞き直したい情報が存在する可能性が高い。 Humans have the property of being quick when speaking, that is, speaking speed increases. If the user hears the voice being played back and utters the voice related to the information that the user missed in a quick manner, the user may be in a state of being in a hurry or impatient. Therefore, when the utterance speed of the user's voice data (input voice data) input to the information processing apparatus 1 is high, the information that the user wants to listen to is within a range retroactive from the user's voice input time point. There is a high possibility that it exists in. On the other hand, if a person is lost or thinking, he / she utters slowly, that is, the speaking rate decreases. When the user utters slowly, there is a high possibility that there is information that the user wants to listen to again at a location that goes back from the time when the user's voice was input.

図４Ａに示される例は、このような人間の性質を鑑み、発話速度が大きいほど遡る範囲が小さくなるように対応付けられた発話速度と遡る範囲との対応表の例である。図４Ａでは、発話速度と遡る範囲との関係が線形である場合の例が示される。 The example shown in FIG. 4A is an example of a correspondence table between the utterance speed and the retroactive range associated with each other so that the retroactive range becomes smaller as the utterance speed increases in view of the human nature. FIG. 4A shows an example in which the relationship between the speech rate and the retroactive range is linear.

例えば、人間の発話速度は、かなり早口の場合には、１５モーラ／秒、通常の場合で８モーラ／秒である。すなわち、人間が１秒間に発するモーラ数には限界があるため、発話速度には最小値と最大値とを設定することができる。例えば、図４Ａでは、発話速度の最小値が０モーラ／秒、最大値が１５モーラ／秒と設定される。また、図４Ａでは、発話速度が最小値である場合に遡る範囲が最大値となり、発話速度が最大値である場合に遡る範囲が最小値となるように設定されている。また、図４Ａでは、発話速度の最小値０モーラ／秒に対応する、遡る範囲の最大値は１５秒に設定されている。また、図４Ａでは、発話速度の最大値１５モーラ／秒に対応する、遡る範囲は最小値が５秒に設定されている。 For example, the speech rate of a human is 15 mora / sec for a very fast mouth and 8 mora / sec for a normal case. That is, since there is a limit to the number of mora that a person emits per second, a minimum value and a maximum value can be set for the speech rate. For example, in FIG. 4A, the minimum value of the speech rate is set to 0 mora / second, and the maximum value is set to 15 mora / second. In FIG. 4A, the range that goes back when the speech rate is the minimum value is set to the maximum value, and the range that goes back when the speech rate is the maximum value is set to the minimum value. In FIG. 4A, the maximum value of the retroactive range corresponding to the minimum value 0 mora / second of the speech rate is set to 15 seconds. In FIG. 4A, the retroactive range corresponding to the maximum value of the speech rate of 15 mora / second is set to the minimum value of 5 seconds.

図４Ａと同様に、図４Ｂ，図４Ｃ，及び図４Ｄに示される対応表の例も、発話速度が大きいほど遡る範囲が小さくなるように発話速度と遡る範囲との対応付けがなされている。また、図４Ａと同様に、図４Ｂ，図４Ｃ，及び図４Ｄに示される対応表の例も、発話速度の最小値と最大値とが設定されており、発話速度が最小値となる場合に、遡る範囲が最大値になり、発話速度が最大値となる場合に、遡る範囲が最小値になる。 Similar to FIG. 4A, the correspondence tables shown in FIG. 4B, FIG. 4C, and FIG. 4D also associate the utterance speed with the retroactive range so that the retroactive range decreases as the utterance speed increases. Similarly to FIG. 4A, the correspondence table examples shown in FIG. 4B, FIG. 4C, and FIG. 4D also have the minimum and maximum utterance speeds set, and the utterance speed becomes the minimum value. When the range that goes back becomes the maximum value and the speech rate becomes the maximum value, the range that goes back becomes the minimum value.

図４Ｂは、発話速度と遡る範囲との関係が階段状である場合の例を示す。図４Ｃは、発話速度と遡る範囲との関係が非線形である場合の例を示す。図４Ｃに示される例は、発話速度が小さい領域の方が制御しやすい、すなわち、人間は早口で話すことよりもゆっくりと話すことの方が制御しやすいという、人間の性質に最も合致した発話速度と遡る範囲の関係を実現する例である。図４Ｄは、発話速度と遡る範囲との関係が非線形である場合の例を示す。図４Ｃに示される例と比較すると、図４Ｄに示される例は非線形の凹凸が逆である。図４Ｄに示される例は、人間が制御しにくい早口の発声でも、すなわち、発話速度が大きい領域でも遡る範囲を広く設定可能な制御を実現する例である。 FIG. 4B shows an example in which the relationship between the speech rate and the retroactive range is stepped. FIG. 4C shows an example where the relationship between the speech rate and the retroactive range is non-linear. The example shown in FIG. 4C is an utterance that best matches the human nature, in that the region where the utterance speed is low is easier to control, that is, the human is easier to control speaking slowly than speaking fast. It is an example which implement | achieves the relationship between speed and the retroactive range. FIG. 4D shows an example in which the relationship between the speech rate and the retroactive range is non-linear. Compared to the example shown in FIG. 4C, the example shown in FIG. 4D has the non-linear irregularities reversed. The example shown in FIG. 4D is an example that realizes control capable of setting a wide range that goes back even in a fast-speech utterance that is difficult for humans to control, that is, in a region where the speech rate is high.

図４Ｅから図４Ｈは、発話速度が大きいほど遡る範囲が広くなるように対応付けられた発話速度と遡る範囲との対応表の例を示す。図４Ａから図４Ｄに示される例と同様に、図４Ｅから図４Ｈに示される例にも、発話速度には、最小値と最大値とが設定可能である。図４Ｅから図４Ｈに示される対応表の例では、発話速度の最小値に対応して遡る範囲の最小値，発話速度の最大値に対応して遡る範囲の最大値が設定される。図４Ｅから図４Ｈに示される例では、発話速度が最小値となる場合に、遡る範囲が最小値になる。図４Ｅから図４Ｈに示される例では、発話速度が最大値となる場合に、遡る範囲が最大値になる。 FIG. 4E to FIG. 4H show examples of correspondence tables between the speech speeds associated with the retroactive ranges so that the retroactive ranges become wider as the speech speed increases. Similarly to the examples shown in FIGS. 4A to 4D, the minimum value and the maximum value can be set for the speech rate in the examples shown in FIGS. 4E to 4H. In the example of the correspondence table shown in FIGS. 4E to 4H, the minimum value of the range that goes back corresponding to the minimum value of the speech rate and the maximum value of the range that goes back corresponding to the maximum value of the speech rate are set. In the example shown in FIG. 4E to FIG. 4H, when the speaking rate becomes the minimum value, the retroactive range becomes the minimum value. In the example shown in FIG. 4E to FIG. 4H, when the speaking rate becomes the maximum value, the retroactive range becomes the maximum value.

図４Ｅは、発話速度と遡る範囲との関係が線形である場合の例を示す。図４Ｆは、発話速度と遡る範囲との関係が階段状である場合の例を示す。図４Ｇは、発話速度と遡る範囲との関係が非線形である場合の例を示す。図４Ｇに示される例は、発話速度が小さい領域の方が制御しやすい、すなわち、早口で話すことよりもゆっくりと話すことの方が制御しやすいという、人間の性質に最も合致した、発話速度と遡る範囲の関係を実現する例であ
る。図４Ｈは、発話速度と遡る範囲との関係が非線形である場合の例を示す。図４Ｇに示される例と比較すると、図４Ｈに示される例は非線形の凹凸が逆である。図４Ｈに示される例は、人間が制御しにくい早口の発声でも、すなわち、発話速度が大きい領域でも遡る範囲を広く設定可能な制御を実現する例である。 FIG. 4E shows an example where the relationship between the speech rate and the retroactive range is linear. FIG. 4F shows an example in which the relationship between the speech rate and the retroactive range is stepped. FIG. 4G shows an example in which the relationship between the speech rate and the retroactive range is non-linear. The example shown in FIG. 4G shows an utterance speed that best matches the human nature, in which the region where the utterance speed is low is easier to control, that is, it is easier to control speaking slowly than speaking fast. This is an example of realizing the relationship of the range that goes back to. FIG. 4H shows an example where the relationship between the speech rate and the retroactive range is non-linear. Compared to the example shown in FIG. 4G, the example shown in FIG. 4H has the non-linear irregularities reversed. The example shown in FIG. 4H is an example of realizing control capable of setting a wide range that can be traced back even in a fast-speech utterance that is difficult for humans to control, that is, in a region where the speech rate is high.

制御部１３は、例えば、図４Ａから図４Ｈの発話速度と遡る範囲との対応表を用いて、遡る範囲を決定し、抽出部１４に出力する。 For example, the control unit 13 determines the range to be traced using the correspondence table between the speech rate and the range to be traced in FIGS. 4A to 4H and outputs the range to the extraction unit 14.

抽出部１４は、入力部１１から入力音声データと、制御部１３から遡る範囲とを入力として得る。抽出部１４は、入力部１１に利用者の音声が入力された時点から遡る範囲に相当する時間を遡った時点から、主記憶装置１０２内にバッファされている再生音声データを抽出する。 The extraction unit 14 obtains input voice data from the input unit 11 and a range that goes back from the control unit 13 as inputs. The extraction unit 14 extracts the reproduction audio data buffered in the main storage device 102 from a time point that corresponds to a range that goes back from the time point when the user's voice is input to the input unit 11.

再生音声データは、ネットワークインタフェース１０７（図１）、又は、チューナー１０８（図１）から入力された音声信号をディジタル信号の場合はそのまま、あるいはアナログ信号の場合はディジタル変換したものであって、主記憶装置１０２（図１）にバッファされている音声データである。主記憶装置１０２内のバッファには、現時点から、利用者の音声が入力された時点から遡る範囲よりも充分長い所定時間遡った時点までの、出力装置１０４から出力された音声データと同じ内容の音声データが蓄積されている。再生音声データには、再生音声データと同じ音声データが情報処理装置１から出力される時間がタイムスタンプとして付与されている。例えば、再生音声データが主記憶装置１０２のバッファに格納される時間を再生音声データと同じ内容の音声データが情報処理装置１から出力される時間とみなしてもよい。また、再生音声データに付与されるタイムスタンプは、情報処理装置１が備えるクロック（図示せず）に基づいた情報処理装置１の起動からの経過時間，時刻，再生音声データの先頭を始点（０：ゼロ）とした場合の出力時間の何れであってもよい。第１実施形態では、再生音声データに付与されたタイムスタンプが示す時系列は、再生音声データの出力時系列と呼ばれる。 The reproduced audio data is obtained by converting the audio signal input from the network interface 107 (FIG. 1) or the tuner 108 (FIG. 1) as it is in the case of a digital signal or digitally converting it in the case of an analog signal. The audio data is buffered in the storage device 102 (FIG. 1). The buffer in the main storage device 102 has the same content as the audio data output from the output device 104 from the present time to a time that is a predetermined time long enough to go back from the time when the user's sound is input. Audio data is accumulated. The reproduced audio data is given a time stamp when the same audio data as the reproduced audio data is output from the information processing apparatus 1. For example, the time when the reproduced audio data is stored in the buffer of the main storage device 102 may be regarded as the time when the audio data having the same content as the reproduced audio data is output from the information processing apparatus 1. In addition, the time stamp given to the reproduced audio data starts from the elapsed time, time, and the beginning of the reproduced audio data from the start of the information processing device 1 based on a clock (not shown) provided in the information processing device 1 (0 : Zero), any output time may be used. In the first embodiment, the time series indicated by the time stamp given to the reproduced audio data is called an output time series of the reproduced audio data.

入力音声データにも情報処理装置１に入力された時点のタイムスタンプが付与されているので、抽出部１４は、入力音声データの入力時点に対応する、再生音声データの出力時系列上での時点を求めることができる。例えば、再生音声データと入力音声データともに、情報処理装置１のクロックによる起動からの経過時間がタイムスタンプとして付与されている場合には、入力音声データのタイムスタンプが示す時間が、入力音声データの入力時点に対応する、再生音声データの出力時系列上での時点となる。また、再生音声データのタイムスタンプと、入力音声データのタイムスタンプとが異なる時系列である場合には、抽出部１４は、入力音声データのタイムスタンプが示す時間を、再生音声データの出力時系列上の時間に変換して、入力音声データの入力時点に対応する再生音声データの出力時系列上での時点を得る。 Since the time stamp at the time of input to the information processing apparatus 1 is also given to the input sound data, the extraction unit 14 is the time point on the output time series of the reproduced sound data corresponding to the input time point of the input sound data. Can be requested. For example, when both the playback audio data and the input audio data are given as time stamps, the time indicated by the time stamp of the input audio data is the time of the input audio data. This is the time point on the output time series of the reproduced audio data corresponding to the input time point. When the time stamp of the reproduced audio data and the time stamp of the input audio data are different time series, the extraction unit 14 uses the time indicated by the time stamp of the input audio data as the output time series of the reproduced audio data. By converting to the above time, the time point on the output time series of the reproduced sound data corresponding to the input time point of the input sound data is obtained.

抽出部１４は、求められた入力音声データの入力時点に対応する、再生音声データの出力時系列上での時点から、制御部１３入力された遡る範囲に相当する時間を遡った時点を先頭として順次再生音声データを抽出する。抽出部１４は、抽出された再生音声データを出力する。なお、第１実施形態では、入力部１１から入力された入力音声データは、用いられないが、入力音声データを用いてもよい。入力音声データが用いて処理を行う情報処理装置の実施形態については、後述される。 The extraction unit 14 starts from a point in time that corresponds to the retroactive range input to the control unit 13 from the point in time on the output time series of the reproduced audio data corresponding to the input point in time of the input audio data obtained. Sequentially reproduced audio data is extracted. The extraction unit 14 outputs the extracted reproduced audio data. In the first embodiment, input voice data input from the input unit 11 is not used, but input voice data may be used. An embodiment of an information processing apparatus that performs processing using input voice data will be described later.

抽出部１４から出力された再生音声データは、プロセッサ１０１によって、復号処理により音声信号に復号され、出力装置１０４から再生出力される。 The reproduced audio data output from the extraction unit 14 is decoded into an audio signal by the decoding process by the processor 101, and is reproduced and output from the output device 104.

図５は、情報処理装置１の処理フローの例を示す図である。図５に示される例は、情報
処理装置１がインターネットラジオなどの再生音声データをリアルタイム処理によって再生中に、利用者の音声がマイクロフォン１０３から入力された場合を示す。 FIG. 5 is a diagram illustrating an example of a processing flow of the information processing apparatus 1. The example shown in FIG. 5 shows a case where the user's voice is input from the microphone 103 while the information processing apparatus 1 is playing back playback voice data such as Internet radio by real-time processing.

例えば、情報処理装置１から出力される音声を聴取する利用者は、音声の聞き逃しなどによって、出力された音声をすぐに聞き直したい場合などに、再度聴取を希望する情報に関連する文言を発する。例えば、情報処理装置１から「本日の電機関連株価終値は、Ａ社Ｘ円、Ｂ社Ｙ円・・・」という音声情報が出力されている場合に、Ａ社の株価の聴き直しを希望する利用者が「Ａ社」と発声する。 For example, when a user who listens to the sound output from the information processing apparatus 1 wants to listen again to the output sound immediately because the user has missed the sound, he / she can write a word related to the information he / she wants to hear again. To emit. For example, when the information processing apparatus 1 outputs the voice information “Today's electricity-related stock price closing price is A company X yen, B company Y yen...”, It desires to listen to the stock price of company A again. The user says “Company A”.

利用者の音声信号「Ａ社」は、マイクロフォン１０３を通じて入力部１１に入力される。入力部１１は、利用者の音声信号「Ａ社」が入力されると、入力音声を検出する（ＯＰ１）。入力部１１は、利用者の音声信号「Ａ社」を入力音声データ「Ａ社」に変換して、抽出部１４と、区間検出部１２１とに出力する。 The user's voice signal “Company A” is input to the input unit 11 through the microphone 103. When the user's voice signal “Company A” is input, the input unit 11 detects the input voice (OP1). The input unit 11 converts the user's voice signal “Company A” into input voice data “Company A”, and outputs it to the extraction unit 14 and the section detection unit 121.

区間検出部１２１は、入力音声データ「Ａ社」が入力されると、入力音声「Ａ社」の音声区間長を測定する（ＯＰ２）。例えば、入力音声データ「Ａ社」の音声区間長が０．５秒であったとする。区間検出部１２１は、入力音声データ「Ａ社」の音声区間長である０．５秒を発話速度算出部１２４に出力する。区間検出部１２１は、入力音声データ「Ａ社」を音声認識部１２２に出力する。 When the input voice data “Company A” is input, the section detection unit 121 measures the voice section length of the input voice “Company A” (OP2). For example, it is assumed that the voice section length of the input voice data “Company A” is 0.5 seconds. The section detection unit 121 outputs 0.5 seconds which is the voice section length of the input voice data “Company A” to the utterance speed calculation unit 124. The section detection unit 121 outputs the input voice data “Company A” to the voice recognition unit 122.

音声認識部１２２は、入力音声データ「Ａ社」が入力されると、音声認識処理を実行する（ＯＰ３）。音声認識部１２２の音声認識処理により、入力音声データの文言が「Ａ社」であることが判明する。音声認識部１２２は、音声認識処理の結果である入力音声データの文言「Ａ社」をモーラ数算出部１２３に出力する。 When the input voice data “Company A” is input, the voice recognition unit 122 executes voice recognition processing (OP3). The speech recognition process of the speech recognition unit 122 reveals that the wording of the input speech data is “Company A”. The voice recognition unit 122 outputs the word “Company A” of the input voice data, which is the result of the voice recognition process, to the mora number calculation unit 123.

モーラ数算出部１２３は、音声認識処理の結果である入力音声データの文言「Ａ社」が入力されると、入力音声データの文言「Ａ社」のモーラ数を算出する（ＯＰ４）。音声認識処理の結果が「Ａ社」である場合には、モーラ数算出部１２３は、「Ａ社」のモーラ数を「エ」と「ー」と「シャ」とで３モーラと算出する。モーラ数算出部１２３は、「Ａ社」は３モーラであることを発話速度算出部１２４に出力する。 When the word “Company A” of the input voice data, which is the result of the voice recognition process, is input, the mora number calculation unit 123 calculates the number of mora of the word “Company A” of the input voice data (OP4). When the result of the speech recognition process is “Company A”, the mora number calculation unit 123 calculates the mora number of “Company A” as 3 mora with “D”, “−”, and “Sha”. The mora number calculation unit 123 outputs to the utterance speed calculation unit 124 that “Company A” is 3 mora.

発話速度算出部１２４は、入力音声データ「Ａ社」の音声区間長０．５秒とモーラ数３モーラとが入力されると、入力音声データ「Ａ社」の発話速度を算出する（ＯＰ５）。発話速度算出部１２４は、入力音声データ「Ａ社」の発話速度を発話速度＝入力音声データ「Ａ社」のモーラ数÷入力音声データ「Ａ社」の音声区間長＝３モーラ÷０．５秒＝６モーラ／秒と算出する。発話速度算出部１２４は、入力音声データ「Ａ社」の発話速度６モーラ／秒を制御部１３に出力する。 The utterance speed calculation unit 124 calculates the utterance speed of the input voice data “Company A” when the voice section length of 0.5 seconds and the mora number 3 mora of the input voice data “Company A” are input (OP5). . The utterance speed calculation unit 124 sets the utterance speed of the input voice data “Company A” to the utterance speed = the number of mora of the input voice data “Company A” ÷ the voice section length of the input voice data “Company A” = 3 mora ÷ 0.5. Calculated as second = 6 mora / second. The utterance speed calculation unit 124 outputs the utterance speed 6 mora / second of the input voice data “Company A” to the control unit 13.

制御部１３は、入力音声データ「Ａ社」の発話速度（６モーラ／秒）が入力されると、発話速度に基づいて入力音声データの入力時点から遡る範囲を決定する（ＯＰ６）。制御部１３は、補助記憶装置１０５に記憶された発話速度と遡る範囲との対応表（図４Ａから図４Ｈ参照）を参照して、入力音声データの入力時点から遡る範囲を決定する。例えば、入力音声データ「Ａ社」の発話速度が６モーラ／秒である場合には、制御部１３は、入力音声データの入力時点から遡る範囲を９秒と決定する。制御部１３は、入力音声データの入力時点から遡る範囲「９秒」を抽出部１４に出力する。 When the speech rate (6 mora / second) of the input speech data “Company A” is input, the control unit 13 determines a range that goes back from the input speech data input time based on the speech rate (OP6). The control unit 13 refers to a correspondence table (see FIGS. 4A to 4H) between the speech rate stored in the auxiliary storage device 105 and the range that goes back, and determines the range that goes back from the input time point of the input voice data. For example, when the speech rate of the input voice data “Company A” is 6 mora / second, the control unit 13 determines the range that goes back from the input time point of the input voice data as 9 seconds. The control unit 13 outputs a range “9 seconds” that goes back from the input time point of the input voice data to the extraction unit 14.

抽出部１４は、入力音声データ「Ａ社」と、入力音声データの入力時点から遡る範囲「９秒」とが入力されると、主記憶装置１０２内のバッファに蓄積されている再生音声データの出力時系列上の入力音声データの入力時点に対応する時点を求める。抽出部１４は、主記憶装置１０２のバッファ内に蓄積された再生音声データから、入力音声「Ａ社」が入
力された時点に対応する再生音声データの出力時系列上の時点から遡る範囲（９秒）を遡った時点を先頭として、順次再生音声データを抽出する（ＯＰ７）。抽出部１４は、抽出された再生音声データを出力する。抽出部１４から出力された再生音声データは、出力装置１０４から再生出力される。 When the input sound data “Company A” and the range “9 seconds” that goes back from the input time point of the input sound data are input, the extraction unit 14 stores the reproduced sound data stored in the buffer in the main storage device 102. A time point corresponding to the input time point of the input voice data on the output time series is obtained. The extraction unit 14 has a range (9) that extends from the reproduced audio data stored in the buffer of the main storage device 102 from the time on the output time series of the reproduced audio data corresponding to the time when the input audio “Company A” is input. The playback audio data is sequentially extracted starting from the time point that goes back (second) (OP7). The extraction unit 14 outputs the extracted reproduced audio data. The reproduced audio data output from the extraction unit 14 is reproduced and output from the output device 104.

情報処理装置１は、利用者が発する音声の発話速度に応じて、利用者の音声の入力時点に対応する再生音声データの出力時系列上の時点から遡る範囲を設定する。例えば、発話速度が大きくなるにつれて遡る範囲が小さくなるように設定する。例えば、発話速度が大きくなるにつれて遡る範囲が大きくなるように設定する。このように、情報処理装置１によれば、発話速度に応じて、利用者の音声入力時点に対応する再生音声データの出力時系列上の時点から遡る範囲の設定を制御することが可能である。 The information processing apparatus 1 sets a range that goes back from the time point on the output time series of the reproduced sound data corresponding to the input time point of the user's sound, according to the speaking rate of the sound that the user utters. For example, the retroactive range is set to be smaller as the utterance speed increases. For example, the range that goes back as the utterance speed increases is set to increase. Thus, according to the information processing apparatus 1, it is possible to control the setting of the range that goes back from the time point on the output time series of the reproduced voice data corresponding to the voice input time point of the user according to the speech speed. .

また、情報処理装置１が、発話速度が小さいときに、すなわち、利用者がゆっくりと発話したときに、遡る範囲を大きく設定する場合には、聞き逃した情報のような再確認したい情報を再度情報処理装置１から利用者が聴取する可能性が高くなる。 In addition, when the information processing apparatus 1 sets a long range when the utterance speed is low, that is, when the user speaks slowly, information to be reconfirmed such as missed information is again displayed. The possibility that the user listens from the information processing apparatus 1 is increased.

また、発話速度が大きい場合、すなわち、利用者が早口で発話した場合には、利用者が再確認したい情報が利用者の音声入力時点から遡って近い時点に存在する可能性が高い。情報処理装置１が発話速度が大きい場合に遡る範囲を小さく設定することによって、情報処理装置１の処理量を低減することができ、音声データの再生処理の効率の向上が期待できる。 In addition, when the speaking rate is high, that is, when the user speaks quickly, there is a high possibility that the information that the user wants to reconfirm is present at a point near the user's voice input point. By setting the range that goes back when the information processing apparatus 1 has a high utterance speed to be small, the processing amount of the information processing apparatus 1 can be reduced, and improvement in the efficiency of reproduction processing of audio data can be expected.

＜第２実施形態＞
第２実施形態の情報処理装置は、音声データの再生中に、利用者から聞き逃した情報などの再確認したい情報に関する音声が入力された場合に、再確認したい情報に関する音声の文言（キーワード）を認識し、その文言をバッファされている音声データから検索する。情報処理装置は検索の結果、ピンポイントでキーワードを含む音声データを再生する。また、第２実施形態の情報処理装置は、利用者が発した音声の発声速度によって、音声データの検索範囲を制御する。第２実施形態の情報処理装置の構成は、第１実施形態の情報処理装置の構成と一部共通する。第２実施形態では、第１実施形態と共通する箇所の説明は省略される。 Second Embodiment
In the information processing apparatus according to the second embodiment, when voice related to information to be reconfirmed, such as information missed from the user, is input during reproduction of the audio data, a voice word (keyword) related to information to be reconfirmed. And retrieves the wording from the buffered audio data. As a result of the search, the information processing apparatus reproduces the voice data including the keyword pinpoint. In addition, the information processing apparatus according to the second embodiment controls the search range of the audio data based on the voice rate of the voice uttered by the user. The configuration of the information processing apparatus of the second embodiment is partially in common with the configuration of the information processing apparatus of the first embodiment. In the second embodiment, descriptions of parts common to the first embodiment are omitted.

＜＜情報処理装置の構成例＞＞
図６は、情報処理装置のプロセッサが音声データ再生プログラムを実行することによって実現される機能の説明図である。図６に示される情報処理装置２のハードウェア構成は図１に示される情報処理装置１と同様である。情報処理装置２は、プロセッサ１０１が音声データ再生プログラムを実行することによって、データ入力部２１，記録部２２，音声入力部２４，算出部２５，制御部２６，検索部２７，出力範囲決定部２８，及び出力部２９を実現することができる。すなわち、情報処理装置２は、音声データ再生プログラムの実行によって、データ入力部２１，記録部２２，記憶部２３，音声入力部２４，算出部２５，制御部２６，検索部２７，出力範囲決定部２８，及び出力部２９を備えた装置として機能する。 << Configuration example of information processing apparatus >>
FIG. 6 is an explanatory diagram of functions realized when the processor of the information processing apparatus executes the audio data reproduction program. The hardware configuration of the information processing apparatus 2 shown in FIG. 6 is the same as that of the information processing apparatus 1 shown in FIG. In the information processing apparatus 2, when the processor 101 executes the audio data reproduction program, the data input unit 21, the recording unit 22, the audio input unit 24, the calculation unit 25, the control unit 26, the search unit 27, and the output range determination unit 28. , And the output unit 29 can be realized. That is, the information processing apparatus 2 performs the data input unit 21, the recording unit 22, the storage unit 23, the audio input unit 24, the calculation unit 25, the control unit 26, the search unit 27, and the output range determination unit by executing the audio data reproduction program. 28, and an output unit 29.

データ入力部２１は、図１に示されるネットワークインタフェース１０７，又は、チューナー１０８と接続し、無線通信または有線による通信により他の装置から音声信号を入力として得る。データ入力部２１は、例えば、ラジオ放送電波、ワンセグ放送等の音声信号を入力として得る。データ入力部２１は、音声信号を情報処理装置２で扱える音声データに変換し、記録部２２と出力部２９とに出力する。データ入力部２１は、例えば、アナログ信号からディジタル信号へ変換し、ディジタル信号を符号化して音声データを得る。以降、データ入力部２１を介して情報処理装置２に入力される音声データを再生用音声デ
ータと呼ぶ。 The data input unit 21 is connected to the network interface 107 or the tuner 108 shown in FIG. 1 and obtains an audio signal as an input from another device by wireless communication or wired communication. The data input unit 21 receives, for example, an audio signal such as a radio broadcast radio wave or a one-segment broadcast as an input. The data input unit 21 converts the audio signal into audio data that can be handled by the information processing apparatus 2, and outputs the audio data to the recording unit 22 and the output unit 29. The data input unit 21 converts, for example, an analog signal into a digital signal and encodes the digital signal to obtain voice data. Hereinafter, the audio data input to the information processing apparatus 2 via the data input unit 21 is referred to as reproduction audio data.

また、データ入力部２１は、再生用音声データが出力される時間をタイムスタンプとして再生用音声データに付与する。タイムスタンプは、例えば、情報処理装置２が備えるクロック（図示せず）に基づいた情報処理装置２の起動からの経過時間，情報処理装置２が管理する時刻，再生音声データの先頭を始点（０：ゼロ）とした場合の経過時間の何れであってもよい。また、第２実施形態では、再生用音声データがデータ入力部２１に入力された時点の時間に、再生用音声データが情報処理装置２に入力されて出力されるまでに要すると予測される時間を加算した時点を、再生用音声データの出力時点とみなしている。第２実施形態では、再生用音声データに付与されたタイムスタンプが示す時系列は、再生用音声データの出力時系列と呼ばれる。 Further, the data input unit 21 assigns the time when the reproduction audio data is output to the reproduction audio data as a time stamp. The time stamp includes, for example, an elapsed time since the start of the information processing device 2 based on a clock (not shown) provided in the information processing device 2, a time managed by the information processing device 2, and the beginning of the reproduced audio data (0 : Zero) may be any elapsed time. In the second embodiment, the time required for the reproduction audio data to be input to the information processing apparatus 2 and output at the time when the reproduction audio data is input to the data input unit 21 is predicted. Is added as the output time point of the audio data for reproduction. In the second embodiment, the time series indicated by the time stamp given to the reproduction audio data is called the output time series of the reproduction audio data.

記録部２２は、データ入力部２１から再生用音声データを入力として得る。記録部２２は、再生用音声データを記憶部２３に格納する。 The recording unit 22 obtains reproduction audio data from the data input unit 21 as an input. The recording unit 22 stores the reproduction audio data in the storage unit 23.

記憶部２３は、図１に示される主記憶装置１０２の記憶領域の一部である。記憶部２３は、記録部２２によって記録される再生用音声データを所定時間保持する。 The storage unit 23 is a part of the storage area of the main storage device 102 shown in FIG. The storage unit 23 holds the playback audio data recorded by the recording unit 22 for a predetermined time.

音声入力部２４は、図１に示されるマイクロフォン１０３と接続し、マイクロフォン１０３を通じて利用者の発した音声信号を入力として得る。音声入力部２４は、入力された音声信号を情報処理装置２が扱える音声データに変換する。例えば、音声入力部２４は、入力された音声信号をアナログ音声信号からディジタル音声信号に変換し、ディジタル音声信号を符号化し音声データを得る。音声入力部２４は、利用者の音声データを算出部２５と検索部２７とに出力する。なお、利用者が発する音声の発話内容は、利用者が再確認を希望する情報を示す語句である。音声入力部２４に入力された利用者の音声及び音声データは、以降、キーワード音声及びキーワード音声データと呼ばれる。 The voice input unit 24 is connected to the microphone 103 shown in FIG. 1 and receives a voice signal emitted by the user through the microphone 103 as an input. The voice input unit 24 converts the input voice signal into voice data that can be handled by the information processing apparatus 2. For example, the voice input unit 24 converts an input voice signal from an analog voice signal to a digital voice signal, and encodes the digital voice signal to obtain voice data. The voice input unit 24 outputs user voice data to the calculation unit 25 and the search unit 27. The utterance content of the voice uttered by the user is a phrase indicating information that the user desires to reconfirm. The user's voice and voice data input to the voice input unit 24 are hereinafter referred to as keyword voice and keyword voice data.

また、音声入力部２４は、キーワード音声データが入力された時点の時間をタイムスタンプとしてキーワード音声データに付与する。タイムスタンプは、例えば、情報処理装置２が備えるクロック（図示せず）に基づいた情報処理装置２の起動からの経過時間，情報処理装置２が管理する時刻の何れであってもよい。 Further, the voice input unit 24 assigns the time when the keyword voice data is input as a time stamp to the keyword voice data. The time stamp may be, for example, an elapsed time from the activation of the information processing apparatus 2 based on a clock (not shown) provided in the information processing apparatus 2 or a time managed by the information processing apparatus 2.

算出部２５は、音声入力部２４からキーワード音声データを入力として得る。算出部２５は、キーワード音声データの発話速度を算出する。算出部２５は、区間検出部２５１，音声認識部２５２，モーラ数算出部２５３，及び発話速度算出部２５４を含む。算出部２５，及び算出部２５に含まれる区間検出部２５１，音声認識部２５２，モーラ数算出部２５３，及び発話速度算出部２５４は、第１実施形態における算出部１５及び算出部１５に含まれる区間検出部１２１，音声認識部１２２，モーラ数算出部１２３，及び発話速度算出部１２４とそれぞれ同様である。算出部２５は、算出されたキーワード音声の発話速度を制御部２６に出力する。 The calculation unit 25 receives keyword voice data from the voice input unit 24 as an input. The calculation unit 25 calculates the utterance speed of the keyword voice data. The calculation unit 25 includes a section detection unit 251, a speech recognition unit 252, a mora number calculation unit 253, and an utterance speed calculation unit 254. The calculation unit 25, the section detection unit 251, the speech recognition unit 252, the mora number calculation unit 253, and the speech rate calculation unit 254 included in the calculation unit 25 are included in the calculation unit 15 and the calculation unit 15 in the first embodiment. This is the same as each of the section detection unit 121, the speech recognition unit 122, the mora number calculation unit 123, and the speech rate calculation unit 124. The calculation unit 25 outputs the calculated utterance speed of the keyword voice to the control unit 26.

制御部２６は、キーワード音声データの発話速度を入力として得る。制御部２６は、例えば、発話速度に基づいて、記憶部２３に保持されている再生用音声データ内でキーワード音声の語句（キーワード）を検索するための検索範囲を決定する。制御部２６は、対応表記憶部２６１を含む。対応表記憶部２６１は、図１に示される補助記憶装置１０５のデータ格納領域の一部であり、発話速度と検索範囲の対応表（図４Ａから図４Ｈ参照）を保持する。制御部２６は、第１実施形態の制御部１３の遡る範囲の決定処理と同様にして、対応表記憶部２６１に保持された発話速度と検索範囲との対応表に基づいて検索範囲を決定する。制御部２６は、決定された検索範囲を検索部２７に出力する。検索範囲は、時間長，音節数，モーラ数などいずれで定義されてもよい。第２実施形態においては、制御部
２６は、検索範囲を時間で定義する場合について説明される。 The control unit 26 obtains the utterance speed of the keyword voice data as an input. For example, the control unit 26 determines a search range for searching for a phrase (keyword) of the keyword voice in the reproduction voice data held in the storage unit 23 based on the utterance speed. The control unit 26 includes a correspondence table storage unit 261. The correspondence table storage unit 261 is a part of the data storage area of the auxiliary storage device 105 shown in FIG. 1, and holds a correspondence table (see FIGS. 4A to 4H) of the speech rate and the search range. The control unit 26 determines the search range based on the correspondence table between the utterance speed and the search range held in the correspondence table storage unit 261 in the same manner as the retrospective range determination process of the control unit 13 of the first embodiment. . The control unit 26 outputs the determined search range to the search unit 27. The search range may be defined by any of time length, syllable number, mora number, and the like. In 2nd Embodiment, the control part 26 demonstrates the case where a search range is defined by time.

検索部２７は、キーワード音声データと検索範囲とを入力として得る。検索部２７は、キーワード音声データの入力時点に対応する、記憶部２３に保持される再生用音声データの出力時系列上の時点を求める。検索部２７は、キーワード音声データに付与されたタイムスタンプと、再生用音声データに付与されたタイムスタンプとから、キーワード音声データの入力時点に対応する、再生用音声データの出力時系列上の時点を求めることができる。例えば、再生用音声データとキーワード音声データともに、情報処理装置２のクロックによる起動からの経過時間がタイムスタンプとして付与されている場合には、キーワード音声データのタイムスタンプが示す時間が、キーワード音声データの入力時点に対応する、再生用音声データの出力時系列上の時点となる。また、再生用音声データのタイムスタンプと、キーワード音声データのタイムスタンプとが異なる時系列である場合には、検索部２７は、キーワード音声データのタイムスタンプが示す時間を、再生音声データの出力時系列上の時間に変換して、入力音声データの入力時点に対応する再生音声データの出力時系列上での時点を得る。 The search unit 27 receives the keyword voice data and the search range as inputs. The search unit 27 obtains a time point on the output time series of the reproduction sound data held in the storage unit 23 corresponding to the input time point of the keyword sound data. The search unit 27 uses the time stamp given to the keyword voice data and the time stamp given to the playback voice data, the time point on the output time series of the playback voice data corresponding to the input time point of the keyword voice data. Can be requested. For example, when both the playback audio data and the keyword audio data are given as time stamps from the time when the information processing apparatus 2 is activated by the clock, the time indicated by the time stamp of the keyword audio data is the keyword audio data. Is the time point on the output time series of the reproduction audio data corresponding to the input time point. In addition, when the time stamp of the playback audio data and the time stamp of the keyword audio data are different, the search unit 27 uses the time indicated by the time stamp of the keyword audio data as the output time of the playback audio data. The time on the output time series of the reproduced audio data corresponding to the input time of the input audio data is obtained by converting the time into the time on the sequence.

検索部２７は、記憶部２３に保持される再生用音声データから、再生用音声データの出力時系列上の利用者のキーワード音声の入力時点に対応する時点から、検索範囲に相当する時間を遡った時点までの再生用音声データを部分音声データとして読み出す。検索部２７は、読み出された部分音声データ内にキーワードが含まれるか否かの検索を行う。また、検索部２７は、部分音声データとして、利用者からのキーワード音声の入力時点の直前若しくは直後の無音箇所の時点から検索範囲に相当する時間を遡った時点までの再生用音声データを記憶部２３から読み出してもよい。無音箇所とは、息継ぎ時の呼気の箇所や、文章と文章の間の一定時間以上の無音箇所を指す。以降、キーワード音声の入力時点という場合には、キーワード音声の入力時点に対応する、再生用音声データの出力時系列上の時点が示されることとする。 The search unit 27 traces the time corresponding to the search range from the playback audio data held in the storage unit 23 from the time corresponding to the input time of the keyword voice of the user on the output time series of the playback audio data. Audio data for reproduction up to the point in time is read out as partial audio data. The search unit 27 performs a search for whether or not a keyword is included in the read partial audio data. Further, the search unit 27 stores, as partial audio data, reproduction audio data from the time point of the silent part immediately before or immediately after the input time of the keyword sound from the user to the time point that goes back the time corresponding to the search range. 23 may be read out. The silent part refers to a part of exhalation at the time of breathing and a part of silence between a sentence and a certain time or more. Hereinafter, when the keyword voice input time is referred to, the time on the output time series of the reproduction voice data corresponding to the keyword voice input time is indicated.

検索部２７は、記憶部２３から読み出された部分音声データ内の利用者が発したキーワードの検索に、例えば、ワードスポッティングのような音声認識の技術を用いる。 The search unit 27 uses, for example, a speech recognition technique such as word spotting to search for a keyword issued by a user in the partial voice data read from the storage unit 23.

図７は、検索部２７がワードスポッティング技術を用いて、キーワードの検索を行う場合の例を示す。ワードスポッティングとは、音声データの周波数成分のような特徴となるパラメータを用いて、検出したい特定の単語を抽出する方法である。 FIG. 7 shows an example in which the search unit 27 searches for a keyword using the word spotting technique. Word spotting is a method for extracting a specific word to be detected using a parameter having a characteristic such as a frequency component of audio data.

図７に示される例は、情報処理装置２から再生用音声の出力中に、利用者がキーワード「Ａ社」を発した場合に、再生用音声データから抽出された部分音声データ「今日の株価はＡ社５００円、Ｂ社・・・」にキーワード「Ａ社」が含まれるか否かを検索する例である。図７には、「今日の株価はＡ社５００円、Ｂ社・・・」という内容の部分音声データの音声波形と「Ａ社」というキーワード音声データの音声波形とが示されている。検索部２７は、キーワード音声の入力時点から、検索範囲に相当する時間を遡った時点までに含まれる再生用音声データを部分音声データとして記憶部２３から読み出す。検索部２７は、部分音声データと、キーワード音声データ「Ａ社」とを比較することで、部分音声データからキーワード「Ａ社」を検出する。検索部２７は、部分音声データを、例えば、音節または単語ごとに区切る。音節または単語で区切られた部分音声データごとにパラメータを算出し、このパラメータとキーワード音声データ「Ａ社」のパラメータとをそれぞれ比較する。検索部２７は、音節または単語で区切られた部分音声データうちの１つのパラメータとキーワード音声データ「Ａ社」のパラメータとが合致する場合に、部分音声デーからキーワード「Ａ社」を検出し、検出成功を判定する。読み出された部分音声データ内にキーワードが検出された場合には、検索部２７は、検出結果を出力範囲決定部２８に出力する。 In the example shown in FIG. 7, when the user utters the keyword “Company A” while the playback audio is being output from the information processing apparatus 2, the partial voice data “today's stock price” extracted from the playback audio data Is an example of searching whether the keyword “Company A” is included in “Company A 500 yen, Company B. FIG. 7 shows a speech waveform of the partial speech data with the content “Today's stock price is 500 yen for Company A, Company B...” And a speech waveform of the keyword speech data of “Company A”. The search unit 27 reads, from the storage unit 23, the playback audio data included from the time point when the keyword voice is input to the point when the time corresponding to the search range is traced back. The search unit 27 detects the keyword “Company A” from the partial speech data by comparing the partial speech data with the keyword speech data “Company A”. The search unit 27 divides the partial voice data into, for example, syllables or words. A parameter is calculated for each partial voice data segmented by syllables or words, and this parameter is compared with the parameter of the keyword voice data “Company A”. The search unit 27 detects the keyword “Company A” from the partial speech data when one parameter of the partial speech data divided by syllables or words matches the parameter of the keyword speech data “Company A”, Determine successful detection. When a keyword is detected in the read partial audio data, the search unit 27 outputs the detection result to the output range determination unit 28.

音節または単語で区切られた部分音声データのパラメータとキーワード音声データ「Ａ社」のパラメータとが合致しない場合には、検索部２７は、部分音声データからキーワード「Ａ社」を検出できず、検出失敗を判定する。読み出された部分音声データからキーワードが検出されない場合には、検索部２７は制御部２６に検索範囲の再設定要求を出力する。 If the parameter of the partial voice data divided by syllables or words does not match the parameter of the keyword voice data “Company A”, the search unit 27 cannot detect the keyword “Company A” from the partial voice data and detects it. Determine failure. If no keyword is detected from the read partial audio data, the search unit 27 outputs a search range reset request to the control unit 26.

制御部２６は、検索部２７から検索範囲の再設定要求を入力として得ると、検索範囲を設定し直す。 When receiving a search range reset request from the search unit 27 as an input, the control unit 26 resets the search range.

図８Ａ，図８Ｂ，及び図８Ｃは、検索範囲の再設定処理の例である処理１から処理３を示す図である。制御部２６は、検索範囲の再設定処理として、図８Ａ，図８Ｂ，及び図８Ｃにそれぞれ示される処理１，処理２，処理３の何れを行ってもよい。図８Ａ，図８Ｂ，及び図８Ｃは、いずれも１回目の検索時の検索範囲が制御部２６によって５秒に設定される場合を示す。また、図８Ａ，図８Ｂ，及び図８Ｃは、キーワードが「Ａ社」である場合を示す。 FIG. 8A, FIG. 8B, and FIG. 8C are diagrams showing processing 1 to processing 3, which are examples of search range resetting processing. The control unit 26 may perform any one of the processing 1, processing 2, and processing 3 shown in FIGS. 8A, 8B, and 8C, respectively, as the search range resetting processing. 8A, 8B, and 8C show cases where the search range at the time of the first search is set to 5 seconds by the control unit 26. FIG. 8A, 8B, and 8C show a case where the keyword is “Company A”.

図８Ａは、検索範囲の再設定処理の一例である処理１を示す図である。制御部２６は、検索部２７から検索範囲の再設定要求が入力されると、１回目と同じサイズで２回目の検索範囲を決定する。制御部２６は、決定された２回目の検索範囲を検索部２７に出力する。 FIG. 8A is a diagram illustrating a process 1 which is an example of a search range resetting process. When a search range reset request is input from the search unit 27, the control unit 26 determines the second search range with the same size as the first time. The control unit 26 outputs the determined second search range to the search unit 27.

検索部２７は、１回目の検索範囲と同じサイズである２回目の検索範囲が入力されると、キーワード音声の入力時点から１回目の検索範囲を遡った時点から、さらに２回目の検索範囲を遡った時点までの部分音声データを読み出し、キーワードの検索を行う。例えば、図８Ａにおいては、検索部２７は、キーワード音声「Ａ社」の入力時点から１回目の検索範囲（５秒）を遡った時点から、さらに２回目の検索範囲（５秒）を遡った時点までの再生用音声データを部分音声データとして読み出す。すなわち、キーワード音声「Ａ社」の入力時点の５秒前から１０秒前の範囲に含まれる再生用音声データを部分音声データとして読み出す。図８Ａでは、キーワード音声の入力時点の５秒前から１０秒前の範囲に含まれる部分音声データ内を検索する２回目の検索で、キーワード「Ａ社」が検出される場合を示している。検索部２７は、２回目の検索でキーワード「Ａ社」が検出されると、検出結果を出力範囲決定部２８に出力する。 When the second search range having the same size as the first search range is input, the search unit 27 further searches for the second search range from the time when the first search range is traced back from the input time of the keyword voice. The partial voice data up to the point in time is read and the keyword is searched. For example, in FIG. 8A, the search unit 27 goes back the second search range (5 seconds) from the time when the first search range (5 seconds) is traced from the input time point of the keyword voice “Company A”. Audio data for reproduction up to the time is read as partial audio data. That is, the reproduction audio data included in the range from 5 seconds to 10 seconds before the input time point of the keyword audio “Company A” is read as partial audio data. FIG. 8A shows a case where the keyword “Company A” is detected in the second search for searching within the partial voice data included in the range from 5 seconds to 10 seconds before the input time of the keyword voice. When the keyword “Company A” is detected in the second search, the search unit 27 outputs the detection result to the output range determination unit 28.

２回目の検索でも、キーワード「Ａ社」が検出されない場合には、検索部２７は、再度制御部２６に検索範囲の再設定要求を出力する。制御部２６は、再度検索範囲の再設定要求が入力されると、３回目の検索範囲を１回目及び２回目と同じ検索範囲に設定する。制御部２６は、３回目の検索範囲を検索部２７に出力する。検索部２７は、３回目の検索範囲が入力されると、キーワード音声の入力時点から１回目の検索範囲と２回目の検索範囲とを遡った時点から、さらに３回目の検索範囲遡った時点までの部分音声データを読み出し、キーワードの検索を行う。図８Ａにおいては、キーワード音声データ「Ａ社」の入力時点から１回目の検索範囲（５秒）と２回目の検索範囲（５秒）とを遡った時点から、さらに３回目の検索範囲（５秒）を遡った時点までに含まれる再生用音声データを部分音声データとして読み出す。すなわち、検索部２７は、キーワード音声データ「Ａ社」の入力時点の１０秒前から１５秒前の範囲に含まれる再生用音声データを部分音声データとして読み出す。検索部２７は、読み出された部分音声データ内で３回目のキーワード「Ａ社」の検索を実行する。 If the keyword “Company A” is not detected in the second search, the search unit 27 again outputs a search range reset request to the control unit 26. When the search range reset request is input again, the control unit 26 sets the third search range to the same search range as the first time and the second time. The control unit 26 outputs the third search range to the search unit 27. When the third search range is input, the search unit 27 starts from the time when the first search range and the second search range are traced back to the time when the third search range is traced. The partial voice data is read out and a keyword is searched. In FIG. 8A, the third search range (5 seconds) is further traced back from the time when the first search range (5 seconds) and the second search range (5 seconds) are traced from the input time point of the keyword voice data “Company A”. Reproduction audio data included up to the point of time) is read out as partial audio data. That is, the search unit 27 reads out reproduction audio data included in a range from 10 seconds to 15 seconds before the input time point of the keyword audio data “Company A” as partial audio data. The search unit 27 executes a third search for the keyword “Company A” in the read partial voice data.

検索部２７及び制御部２６は、記憶部２３から読み出された部分音声データからキーワード「Ａ社」が検出されるまで、上記の検索処理を予め設定されたｎ＋１（ｎは０を含ま
ない自然数）回繰り返す。ｎ＋１回繰り返してもキーワード「Ａ社」が検出されない場合には、検索部２７は、「検出失敗」を出力範囲決定部２８に出力する。「検出失敗」が入力されると、出力範囲決定部２８から出力部２９、および、出力部２９に接続されている出力装置１０４を通じて、利用者にキーワードの検出に失敗したことが通知される。キーワードの検出が失敗する原因として、例えば、利用者の発声が不明瞭で、正しく音声認識できない場合などが考えられる。 The search unit 27 and the control unit 26 perform the above-described search processing until the keyword “Company A” is detected from the partial voice data read from the storage unit 23, and n + 1 (n is a natural number not including 0). ) Repeat times. If the keyword “Company A” is not detected after repeating n + 1 times, the search unit 27 outputs “detection failure” to the output range determination unit 28. When “detection failure” is input, the output range determination unit 28 notifies the user that the keyword detection has failed through the output unit 29 and the output device 104 connected to the output unit 29. As a cause of keyword detection failure, for example, a case where the user's utterance is unclear and speech recognition cannot be performed correctly is considered.

図８Ｂは、検索範囲の再設定処理の一例である処理２を示す図である。制御部２６は、検索部２７から検索範囲の再設定要求が入力されると、１回目の検索範囲のα倍（α＞１を２回目の検索範囲に決定する。決定された２回目の検索範囲は検索部２７へ出力される。例えば、図８Ｂに示す例では、α＝２であり、制御部２６は２回目の検索範囲として、１回目の検索範囲（５秒）のα倍（２倍）である１０秒を指定する。 FIG. 8B is a diagram illustrating a process 2 that is an example of a search range resetting process. When a search range reset request is input from the search unit 27, the control unit 26 determines α times the first search range (α> 1 as the second search range. The determined second search 8B is output to the search unit 27. For example, in the example illustrated in FIG. 8B, α = 2, and the control unit 26 sets α times (2 times the first search range (5 seconds) as the second search range. Is 10 seconds).

検索部２７は、１回目の検索範囲のα倍である２回目の検索範囲が入力されると、キーワード音声の入力時点から１回目の検索範囲を遡った時点から、さらに２回目の検索範囲を遡った時点に含まれる部分音声データを記憶部２３から読み出す。図８Ｂにおいては、検索部２７は、キーワード音声データ「Ａ社」の入力時点から１回目の検索範囲（５秒）を遡った時点から、さらに２回目の検索範囲（１０秒）を遡った時点に含まれる再生用音声データを部分音声データとして読み出す。すなわち、検索部２７は、キーワード音声データ「Ａ社」の入力時点の５秒前から１５秒前の範囲に含まれる再生用音声データを部分音声データとして読み出す。図８Ｂでは、キーワード音声データ「Ａ社」の入力時点の５秒前から１５秒前に含まれる部分音声データ内を検索する２回目の検索で、キーワード「Ａ社」が検出される場合を示している。 When the second search range, which is α times the first search range, is input, the search unit 27 further searches for the second search range from the time when the first search range is traced back from the input time of the keyword voice. Partial audio data included in the retroactive time point is read from the storage unit 23. In FIG. 8B, the search unit 27 further goes back to the second search range (10 seconds) from the time when the first search range (5 seconds) is traced from the input time point of the keyword voice data “Company A”. Is read out as partial audio data. That is, the search unit 27 reads out the reproduction audio data included in the range from 5 seconds to 15 seconds before the input time point of the keyword audio data “Company A” as partial audio data. FIG. 8B shows a case where the keyword “Company A” is detected in the second search for searching within the partial speech data included between 5 seconds and 15 seconds before the input time point of the keyword speech data “Company A”. ing.

２回目の検索でも、キーワード「Ａ社」が検出されない場合には、検索部２７及び制御部２６は、図８Ａで示される処理１の場合と同様に、キーワード「Ａ社」が検出されるまで検索処理をｎ＋１回繰り返す。検索部２７は、ｎ＋１回目の検索処理が失敗すると、「検索失敗」を出力範囲決定部２８に出力する。 If the keyword “Company A” is not detected even in the second search, the search unit 27 and the control unit 26 until the keyword “Company A” is detected as in the case of the process 1 shown in FIG. 8A. The search process is repeated n + 1 times. If the (n + 1) th search process fails, the search unit 27 outputs “search failure” to the output range determination unit 28.

図８Ｃは、検索範囲の再設定処理の一例である処理３を示す図である。制御部２６は、検索部２７から検索範囲の再設定要求が入力されると、キーワード音声の入力時点から１回目の検索範囲を遡った時点から、記憶部２３に格納される再生用音声データの先頭までを２回目の検索範囲と決定し、２回目の検索範囲を検索部２７に出力する。 FIG. 8C is a diagram illustrating a process 3 which is an example of a search range resetting process. When the search range reset request is input from the search unit 27, the control unit 26 stores the reproduction audio data stored in the storage unit 23 from the time when the first search range is traced back from the input time of the keyword voice. The first search range is determined up to the beginning, and the second search range is output to the search unit 27.

検索部２７は、２回目の検索範囲が入力されると、キーワード音声の入力時点から１回目の検索範囲を遡った時点から、記憶部２３に格納される再生用音声データの先頭までの部分音声データを記憶部２３から読み出す。検索部２７は、読み出された部分音声データ内で、キーワード「Ａ社」の２回目の検索を実行する。図８Ｃの処理３では、検索部２７がデータの先頭から検索を行うので、２回目の検索でキーワード「Ａ社」が検出される可能性が処理１及び処理２よりも高い。 When the search range for the second time is input, the search unit 27 starts from the time when the first search range is traced from the time point when the keyword voice is input to the beginning of the playback audio data stored in the storage unit 23. Data is read from the storage unit 23. The search unit 27 executes a second search for the keyword “Company A” in the read partial audio data. In the process 3 of FIG. 8C, the search unit 27 searches from the beginning of the data, so the possibility that the keyword “Company A” is detected in the second search is higher than in the process 1 and the process 2.

以上、図８Ａ，図８Ｂ，及び図８Ｃで説明された、検索範囲の再設定処理をまとめると以下の通りである。
（処理１）２回目以降の検索範囲として、制御部２６は、１回目の検索範囲と同じ範囲を設定する。
（処理２）２回目以降の検索範囲として、制御部２６は、前回の検索範囲のα倍（α＞１）の検索範囲を設定する。
（処理３）２回目の検索範囲として、制御部２６は、再生用音声データの先頭から１回目の検索の部分音声データの開始点までを設定する。 The search range resetting process described with reference to FIGS. 8A, 8B, and 8C is summarized as follows.
(Processing 1) As the search range for the second and subsequent times, the control unit 26 sets the same range as the first search range.
(Process 2) As a search range for the second and subsequent times, the control unit 26 sets a search range that is α times (α> 1) the previous search range.
(Process 3) As the second search range, the control unit 26 sets from the beginning of the reproduction audio data to the start point of the partial audio data of the first search.

図９Ａから図９Ｄは、部分音声データ内でキーワードが複数検出される場合の処理の例である処理Ａから処理Ｄを示す図である。検索部２７は、部分音声データ内にキーワードが複数検出される場合には、処理Ａから処理Ｄの何れを実行してもよい。図９Ａから図９Ｄに示される例は、検索部２７がキーワードとして「Ａ社」の検索処理を実行する例を示す。 FIGS. 9A to 9D are diagrams showing process A to process D, which are examples of processes when a plurality of keywords are detected in the partial voice data. The search unit 27 may execute any of the process A to the process D when a plurality of keywords are detected in the partial voice data. The examples shown in FIGS. 9A to 9D show an example in which the search unit 27 executes a search process of “Company A” as a keyword.

図９Ａは、部分音声データ内にキーワードが複数検出される場合の処理の一例である処理Ａを示す図である。処理Ａでは、検索部２７は、検出されたキーワードの中から、キーワードの入力時点から遡って時間的に最も近いキーワードを検索結果として出力範囲決定部２８に出力する。 FIG. 9A is a diagram illustrating a process A that is an example of a process when a plurality of keywords are detected in partial audio data. In the process A, the search unit 27 outputs, to the output range determination unit 28, the keyword closest in time from the keyword input point in time as the search result.

図９Ｂは、部分音声データ内にキーワードが複数検出される場合の処理の一例である処理Ｂを示す図である。処理Ｂでは、検索部２７は、検出されたキーワードの中から、キーワードの入力時点から遡って時間的に最も遠いキーワードを検索結果として出力範囲決定部２８に出力する。 FIG. 9B is a diagram illustrating a process B which is an example of a process when a plurality of keywords are detected in the partial voice data. In the process B, the search unit 27 outputs the keyword farthest in time from the keyword input time point to the output range determination unit 28 as a search result.

図９Ｃは、部分音声データ内にキーワードが複数検出される場合の処理の一例である処理Ｃを示す図である。処理Ｃでは、検索部２７は、検出されたキーワードの中の任意のキーワードを検索結果として出力範囲決定部２８に出力する。 FIG. 9C is a diagram illustrating a process C which is an example of a process when a plurality of keywords are detected in the partial voice data. In the process C, the search unit 27 outputs an arbitrary keyword among the detected keywords to the output range determination unit 28 as a search result.

図９Ｄは、部分音声データ内にキーワードが複数検出される場合の処理の一例である処理Ｄを示す図である。処理Ｄでは、検索部２７は、検出されたすべてのキーワードを検索結果として出力範囲決定部２８に出力する。 FIG. 9D is a diagram illustrating a process D which is an example of a process when a plurality of keywords are detected in the partial voice data. In the process D, the search unit 27 outputs all detected keywords to the output range determination unit 28 as search results.

出力範囲決定部２８は、検索部２７からキーワードの検索結果が入力されると、再生用音声データを記憶部２３から読み出す際の先頭位置を決定する。再生用音声データを記憶部２３から読み出す際の先頭位置は、再生用音声データの再生の開始位置である。 When the keyword search result is input from the search unit 27, the output range determination unit 28 determines the head position when reading the reproduction audio data from the storage unit 23. The head position when the reproduction audio data is read from the storage unit 23 is the reproduction start position of the reproduction audio data.

図１０Ａ及び図１０Ｂは、再生用音声データを記憶部２３から読み出すときの先頭位置を決定する処理の例を示す図である。図１０Ａ及び図１０Ｂでは、例えば、「本日の電機関連株価終値は、Ａ社Ｘ円、Ｂ社Ｙ円・・・」という音声情報の出力中に、Ａ社の株価を聞き直したい利用者が「Ａ社」とキーワードを発する場合の例を示す。図１０Ａ及び図１０Ｂに示される例では、検索部２７によるキーワードの検索処理によって、キーワード「Ａ社」が検出された場合に、出力範囲決定部２８が記憶部２３から再生用音声データを読み出す際の先頭を決定する例を示す。 FIG. 10A and FIG. 10B are diagrams illustrating an example of processing for determining the head position when reading audio data for reproduction from the storage unit 23. In FIG. 10A and FIG. 10B, for example, a user who wants to rehearse the stock price of Company A while outputting the voice information “The closing price of today's electrical equipment-related stock price is A Company X Yen, B Company Y Yen”. An example of issuing a keyword “A company” is shown. In the example shown in FIGS. 10A and 10B, when the keyword “Company A” is detected by the keyword search processing by the search unit 27, the output range determination unit 28 reads the playback audio data from the storage unit 23. An example of deciding the head of is shown.

図１０Ａに示される例では、出力範囲決定部２８は、検出されたキーワード「Ａ社」を、記憶部２３から再生用音声データを読み出すときの先頭として決定する。出力範囲決定部２８は、キーワード「Ａ社」を先頭として、記憶部２３から順次再生用音声データを読み出し、出力部２９に出力する。出力部２９を通じて出力装置１０４からは、「Ａ社Ｘ円、Ｂ社Ｙ円。鋼鉄関連株価は、・・・」というように、検出されたキーワード「Ａ社」を開始位置として再生用音声データが出力される。 In the example shown in FIG. 10A, the output range determination unit 28 determines the detected keyword “Company A” as the head when reading playback audio data from the storage unit 23. The output range determination unit 28 sequentially reads out the reproduction audio data from the storage unit 23 with the keyword “Company A” at the head, and outputs it to the output unit 29. From the output device 104 through the output unit 29, playback audio data with the detected keyword “Company A” as the start position, such as “Company A X Yen, Company B Yen Yen. Is output.

図１０Ｂに示される例では、出力範囲決定部２８は、検出されたキーワード「Ａ社」の時間的に直前の無音箇所を、記憶部２３から再生用音声データを読み出すときの先頭として決定する。出力範囲決定部２８は、検出されたキーワード「Ａ社」の時間的に直前の無音箇所を先頭として記憶部２３から再生用音声データを読み出し、出力部２９に出力する。無音箇所とは、文と文との間の一定時間以上の無音箇所、情報を読み上げる人の息継ぎ箇所等である。出力部２９を通じて出力装置１０４からは、「本日の電機関連株価終値は、Ａ社Ｘ円、Ｂ社Ｙ円。・・・」というように、検出されたキーワード「Ａ社」の時間的
に直前の無音箇所を開始位置として、再生用音声データが出力される。 In the example shown in FIG. 10B, the output range determination unit 28 determines the silent part immediately before the detected keyword “Company A” as the head when reading the reproduction audio data from the storage unit 23. The output range determination unit 28 reads the reproduction audio data from the storage unit 23 with the silent part immediately preceding the detected keyword “Company A” in time as the head, and outputs the reproduction audio data to the output unit 29. The silent part is a silent part of a certain time or more between sentences and a breathing part of a person who reads out information. From the output device 104 through the output unit 29, “Today's electric machinery related stock price closing price is A company X yen, B company Y yen. The sound data for reproduction is output starting from the silent part.

情報処理装置２は、図１０Ａに示される処理と図１０Ｂに示される処理とのいずれかを実行する。 The information processing device 2 executes either the process shown in FIG. 10A or the process shown in FIG. 10B.

出力部２９は、データ入力部２１から再生用音声データを入力として得る。例えば、出力部２９は、再生用音声データをディジタル信号に復号する。出力部２９は、スピーカ等の出力装置１０４（図１）に接続しており、出力装置１０４に復号化されたディジタル信号を出力する。出力部２９から出力されたディジタル信号は、アナログ信号に変換され出力装置１０４から音声信号として出力される。 The output unit 29 receives the playback audio data from the data input unit 21 as an input. For example, the output unit 29 decodes the reproduction audio data into a digital signal. The output unit 29 is connected to an output device 104 (FIG. 1) such as a speaker, and outputs the decoded digital signal to the output device 104. The digital signal output from the output unit 29 is converted into an analog signal and output from the output device 104 as an audio signal.

また、出力部２９は、再生用音声の入力をデータ入力部２１と出力範囲決定部２８との間で切り換える。出力部２９は、出力範囲決定部２８から再生用音声データが入力されると、再生用音声データの入力をデータ入力部２１から出力範囲決定部２８に切り換える。また、その後、例えば、出力範囲決定部２８から所定時間再生用音声データが入力されない場合には、出力部２９は、出力範囲決定部２８からデータ入力部２１に再生用音声データの入力を切り換える。 Further, the output unit 29 switches the input of the reproduction audio between the data input unit 21 and the output range determination unit 28. When the reproduction audio data is input from the output range determination unit 28, the output unit 29 switches the input of the reproduction audio data from the data input unit 21 to the output range determination unit 28. After that, for example, when the audio data for reproduction for a predetermined time is not input from the output range determination unit 28, the output unit 29 switches the input of the reproduction audio data from the output range determination unit 28 to the data input unit 21.

出力部２９は、出力範囲決定部２８から順次入力される再生用音声データを１倍の速度で再生してもよい。また、出力部２９は、出力範囲決定部２８から順次入力される再生用音声データを、例えば、２倍の速度で再生してもよい。出力範囲決定部２８から入力される再生用音声データを２倍速で再生する場合には、ある時点で出力範囲決定部２８から入力される再生用音声データが入力部２１から入力されるリアルタイム処理の再生用音声データに追いつく。その後、出力範囲決定部２８から再生用音声データが入力されなくなるので、出力部２９は、出力範囲決定部２８からデータ入力部２１に再生用音声データの入力を切り換え、データ入力部２１から入力される再生用音声データを１倍速で再生する。 The output unit 29 may reproduce the reproduction audio data sequentially input from the output range determination unit 28 at a single speed. Further, the output unit 29 may reproduce the reproduction audio data sequentially input from the output range determination unit 28 at a double speed, for example. When the reproduction audio data input from the output range determination unit 28 is reproduced at a double speed, the reproduction audio data input from the output range determination unit 28 at a certain point in time is processed in real time. Keep up with audio data for playback. Thereafter, since the playback audio data is not input from the output range determination unit 28, the output unit 29 switches the input of the playback audio data from the output range determination unit 28 to the data input unit 21, and is input from the data input unit 21. Playback audio data is played back at 1x speed.

＜＜情報処理装置の処理フロー＞＞
図１１は、情報処理装置２の処理フローの例を示す図である。図１１に示される例は、音声データの再生中に利用者が聞き直したい情報に関するキーワードを発する場合の例である。 << Processing flow of information processing apparatus >>
FIG. 11 is a diagram illustrating an example of a processing flow of the information processing apparatus 2. The example shown in FIG. 11 is an example in the case where a keyword related to information that the user wants to re-listen is issued during reproduction of audio data.

情報処理装置２は、出力部２９からの音声データの再生開始（出力開始）とともに、図１１の処理フローを開始する。 The information processing apparatus 2 starts the processing flow of FIG. 11 together with the start of reproduction (output start) of the audio data from the output unit 29.

例えば、出力部２９から「本日の電機関連株価終値は、Ａ社Ｘ円、Ｂ社Ｙ円・・・」という音声情報が出力されている場合に、Ａ社の株価を聴き直したい希望する利用者が「Ａ社」とキーワードを発する。 For example, when the voice information “Today's electric equipment-related stock closing price is A company X yen, B company Y yen... Issue the keyword “Company A”.

音声入力部２４は、この利用者の「Ａ社」というキーワードの発声を入力音声として検出する（ＯＰ２１）。音声入力部２４は、キーワード音声データ「Ａ社」を算出部２５と検索部２７とに出力する。 The voice input unit 24 detects the utterance of the keyword “Company A” by the user as the input voice (OP21). The voice input unit 24 outputs the keyword voice data “Company A” to the calculation unit 25 and the search unit 27.

算出部２５の区間検出部２５１は、キーワード音声データ「Ａ社」が入力されると、キーワード音声データ「Ａ社」の音声区間長を測定する（ＯＰ２２）。例えば、キーワード音声データ「Ａ社」の音声区間長が０．５秒であったとする。区間検出部２５１は、キーワード音声データ「Ａ社」の音声区間長を発話速度算出部２５４に出力する。区間検出部２５１は、キーワード音声データ「Ａ社」を音声認識部２５２に出力する。 When the keyword voice data “Company A” is input, the section detection unit 251 of the calculation unit 25 measures the voice section length of the keyword voice data “Company A” (OP22). For example, it is assumed that the voice segment length of the keyword voice data “Company A” is 0.5 seconds. The section detection unit 251 outputs the voice section length of the keyword voice data “Company A” to the utterance speed calculation unit 254. The section detection unit 251 outputs the keyword voice data “Company A” to the voice recognition unit 252.

音声認識部２５２は、キーワード音声データ「Ａ社」が入力されると、音声認識処理を
実行する（ＯＰ２３）。音声認識部２５２の音声認識処理により、キーワードが「Ａ社」であることが判明する。音声認識部２５２は、音声認識処理の結果「Ａ社」をモーラ数算出部２５３に出力する。 When the keyword speech data “Company A” is input, the speech recognition unit 252 executes speech recognition processing (OP23). The voice recognition process of the voice recognition unit 252 reveals that the keyword is “Company A”. The voice recognition unit 252 outputs the result of the voice recognition process “Company A” to the mora number calculation unit 253.

モーラ数算出部２５３は、音声認識の結果「Ａ社」が入力されると、モーラ数を算出する（ＯＰ２４）。音声認識の結果が「Ａ社」である場合には、モーラ数算出部２５３は、「Ａ社」のモーラ数を「エ」と「ー」と「シャ」とで３モーラと算出する。モーラ数算出部２５３は、「Ａ社」は３モーラであることを発話速度算出部２５４に出力する。 When the result of speech recognition “Company A” is input, the mora number calculation unit 253 calculates the mora number (OP24). When the result of the speech recognition is “Company A”, the mora number calculation unit 253 calculates the mora number of “Company A” as 3 mora with “D”, “−”, and “Sha”. The mora number calculation unit 253 outputs to the utterance speed calculation unit 254 that “Company A” is 3 mora.

発話速度算出部２５４は、キーワード音声データ「Ａ社」の音声区間長０．５秒とモーラ数３モーラとが入力されると、キーワード音声データ「Ａ社」の発話速度を算出する（ＯＰ２５）。発話速度算出部２５４は、キーワード音声データ「Ａ社」の発話速度を発話速度＝キーワード音声データ「Ａ社」のモーラ数÷キーワード音声データ「Ａ社」の音声区間長＝３モーラ÷０．５秒＝６モーラ／秒と算出する。発話速度算出部２５４は、キーワード音声データ「Ａ社」の発話速度６モーラ／秒を制御部２６に出力する。 The utterance speed calculation unit 254 calculates the utterance speed of the keyword voice data “Company A” when the voice interval length 0.5 seconds and the mora number 3 mora of the keyword voice data “Company A” are input (OP25). . The speech rate calculation unit 254 calculates the speech rate of the keyword speech data “Company A” as speech rate = number of mora of the keyword speech data “Company A” ÷ speech interval length of the keyword speech data “Company A” = 3 mora ÷ 0.5. Calculated as second = 6 mora / second. The utterance speed calculation unit 254 outputs the utterance speed 6 mora / second of the keyword voice data “Company A” to the control unit 26.

制御部２６は、キーワード音声データ「Ａ社」の発話速度６モーラ／秒が入力されると、発話速度に基づいて検索範囲を決定する（ＯＰ２６）。制御部２６は、対応表記憶部２６１内に保持された発話速度と検索範囲との対応表（図４Ａから図４Ｈ参照）を参照して検索範囲を決定する。例えば、入力音声データ「Ａ社」の発話速度が６モーラ／秒の場合には、制御部２６は、検索範囲を３秒と決定する。制御部２６は、決定された検索範囲「３秒」を検索部２７に出力する。 When the utterance speed 6 mora / second of the keyword voice data “Company A” is input, the control unit 26 determines a search range based on the utterance speed (OP26). The control unit 26 determines the search range with reference to a correspondence table (see FIGS. 4A to 4H) between the speech rate and the search range held in the correspondence table storage unit 261. For example, when the speech rate of the input voice data “Company A” is 6 mora / second, the control unit 26 determines the search range as 3 seconds. The control unit 26 outputs the determined search range “3 seconds” to the search unit 27.

検索部２７は、キーワード音声データ「Ａ社」と検索範囲「３秒」とが入力されると、記憶部２３から、キーワード音声データの入力時点から検索範囲遡った範囲に含まれる部分音声データを読み出し、部分音声データ内でキーワードを検索する（ＯＰ２７）。検索部２７は、例えば、図７で示されるワードスポッティングなどを用いて検索処理を実行する。 When the keyword voice data “Company A” and the search range “3 seconds” are input, the search unit 27 retrieves the partial voice data included in the range retroactive to the search range from the input point of the keyword voice data from the storage unit 23. Read and search for keywords in the partial audio data (OP27). The search unit 27 executes the search process using, for example, word spotting shown in FIG.

検索部２７の検索処理が失敗した場合、すなわち、部分音声データ内でキーワード「Ａ社」が検出されない場合には（ＯＰ２８：Ｎｏ）、検索部２７は検索範囲の再設定要求を制御部２６に出力する。制御部２６は、検索範囲の再設定要求が入力されると、検索範囲の再設定を行う（ＯＰ２９）。制御部２６は、再設定された検索範囲を検索部２７に出力する。検索部２７は、前回の検索の検索範囲から再設定された検索範囲遡った範囲に含まれる部分音声データを記憶部２３から読み出して、キーワード「Ａ社」を再度検索する（ＯＰ２７）。 When the search process of the search unit 27 fails, that is, when the keyword “Company A” is not detected in the partial voice data (OP28: No), the search unit 27 sends a search range reset request to the control unit 26. Output. When the search range reset request is input, the control unit 26 resets the search range (OP29). The control unit 26 outputs the reset search range to the search unit 27. The search unit 27 reads the partial voice data included in the range retroactive to the search range reset from the search range of the previous search from the storage unit 23 and searches for the keyword “Company A” again (OP27).

検索部２７の検索処理が成功した場合、すなわち、部分音声データ内にキーワード「Ａ社」が検出された場合には（ＯＰ２８：Ｙｅｓ）、検索部２７は、検出結果を出力範囲決定部２８に出力する。出力範囲決定部２８は、再生用音声データの再生の開始点を決定し、記憶部２３から順次再生用音声データを読み出して出力部２９に出力する。出力部２９は、出力範囲決定部２８から再生用音声データが入力され始めると、データ入力部２１から入力される再生用音声データの出力処理を中断し、出力範囲決定部２８から入力される再生用音声データを出力する（ＯＰ３０）。 When the search process of the search unit 27 is successful, that is, when the keyword “Company A” is detected in the partial voice data (OP28: Yes), the search unit 27 sends the detection result to the output range determination unit 28. Output. The output range determination unit 28 determines a playback start point of the playback audio data, sequentially reads the playback audio data from the storage unit 23, and outputs the playback audio data to the output unit 29. When the reproduction audio data starts to be input from the output range determination unit 28, the output unit 29 interrupts the output processing of the reproduction audio data input from the data input unit 21, and the reproduction input from the output range determination unit 28 Audio data is output (OP30).

ＯＰ２７，ＯＰ２８，及びＯＰ２９における処理は、図８Ａ，図８Ｂ,及び図８Ｃに示される、検索範囲の再設定処理の例である処理Ａ，処理Ｂ，及び処理Ｃの何れかの処理である。 The processing in OP27, OP28, and OP29 is one of processing A, processing B, and processing C, which are examples of search range resetting processing, shown in FIGS. 8A, 8B, and 8C.

＜＜第２実施形態の作用効果＞＞
第２実施形態の情報処理装置２は、利用者が発する聞き直したい情報に関するキーワードを認識し、キーワードの入力時点から検索範囲遡った範囲に含まれる部分音声データ内でキーワードを検索する。情報処理装置２は、部分音声データ内でキーワードが検出されない場合には検索範囲を再設定する。このように、情報処理装置２は、キーワードを検索し、検索結果に応じて検索範囲を変えて再度検索することによって、利用者が聞き直したい情報を精度良く出力することができる。 << Effects of Second Embodiment >>
The information processing apparatus 2 according to the second embodiment recognizes a keyword related to information that the user wants to re-listen, and searches for the keyword in the partial voice data included in the range that goes back to the search range from the keyword input time point. The information processing device 2 resets the search range when no keyword is detected in the partial voice data. In this way, the information processing apparatus 2 can accurately output information that the user wants to listen to again by searching for a keyword and changing the search range according to the search result and searching again.

また、利用者は、聞き逃しのような再確認したい情報に関するキーワードを発声するのみで、再確認したい情報を再度再生することができるので、操作が容易である。また、再確認したい情報に関連するキーワードを検索することにより、利用者は再確認したい情報をピンポイントで得ることができる。 Further, since the user can reproduce the information to be reconfirmed only by speaking a keyword related to the information to be reconfirmed such as missed listening, the operation is easy. Further, by searching for a keyword related to information to be reconfirmed, the user can pinpoint the information to be reconfirmed.

情報処理装置２は、第１実施形態の情報処理装置１と同様に、利用者の発声の発話速度に応じて検索範囲を設定する。例えば、発話速度が大きくなるにつれて検索範囲が小さくなるように設定する。例えば、発話速度が大きくなるにつれて検索範囲が大きくなるように設定する。このように、情報処理装置１によれば、発話速度に応じて、キーワードの検索範囲の設定を制御することが可能である。 The information processing device 2 sets the search range according to the utterance speed of the user's utterance, as in the information processing device 1 of the first embodiment. For example, the search range is set to become smaller as the utterance speed increases. For example, the search range is set to increase as the utterance speed increases. Thus, according to the information processing apparatus 1, it is possible to control the setting of the keyword search range in accordance with the speech rate.

また、情報処理装置２が、発話速度が小さいときに、すなわち、利用者がゆっくりと発話したときに、検索範囲を大きく設定する場合には、キーワードが検索範囲に含まれる可能性が高くなり、１回の検索でキーワードが検出される精度が向上する。 Further, when the information processing device 2 sets the search range to be large when the utterance speed is low, that is, when the user speaks slowly, the possibility that the keyword is included in the search range increases. The accuracy with which keywords are detected in one search is improved.

また、発話速度が大きい場合、すなわち、利用者が早口で発話した場合には、利用者が再確認したい情報が利用者の音声入力時点から遡って近い時点に存在する可能性が高い。情報処理装置２が発話速度が大きい場合に検索範囲を小さく設定することによって、情報処理装置のキーワード検索の処理量を低減することができ、効率の向上が期待できる。 In addition, when the speaking rate is high, that is, when the user speaks quickly, there is a high possibility that the information that the user wants to reconfirm is present at a point near the user's voice input point. By setting the search range to be small when the information processing device 2 has a high utterance speed, it is possible to reduce the amount of keyword search processing performed by the information processing device and to expect improvement in efficiency.

情報処理装置２は、キーワードの検索結果に従って、記憶部２３に保持される再生用音声データの読み出し開始位置（再生開始位置）を制御することができる。 The information processing apparatus 2 can control the reading start position (playback start position) of the playback audio data held in the storage unit 23 according to the keyword search result.

＜変形例＞
第１実施形態及び第２実施形態では、発話速度算出部１２４及び発話速度算出部２５４は、入力音声のモーラ数と時間長とから発話速度を算出した。入力音声のモーラ数に代えて、発話速度算出部１２４及び発話速度算出部２５４は、入力音声のスペクトル特性等を用いて発話速度を算出してもよい。発話速度算出部１２４及び発話速度算出部２５４は、一般的に広く用いられる発話速度の算出法を用いることができる。 <Modification>
In the first embodiment and the second embodiment, the utterance speed calculation unit 124 and the utterance speed calculation unit 254 calculate the utterance speed from the number of mora and the time length of the input speech. Instead of the number of mora of the input speech, the speech rate calculation unit 124 and the speech rate calculation unit 254 may calculate the speech rate using the spectral characteristics of the input speech. The utterance speed calculation unit 124 and the utterance speed calculation unit 254 can use an utterance speed calculation method that is generally widely used.

第１実施形態及び第２実施形態では、制御部１３及び制御部２６は、それぞれ、時間を単位として遡る範囲又は検索範囲を決定した。時間を単位として遡る範囲又は検索範囲を決定することに代えて、制御部１３及び制御部２６は、音節数，単語数，呼気段落、無音区間等を用いて遡る範囲又は検索範囲を決定してもよい。 In 1st Embodiment and 2nd Embodiment, the control part 13 and the control part 26 each determined the range or search range which goes back in units of time. Instead of determining the range or search range that goes back in time units, the control unit 13 and the control unit 26 determine the range or search range that goes back using the number of syllables, the number of words, the expiratory paragraph, the silent period, and the like. Also good.

第１実施形態及び第２実施形態では、情報処理装置１及び情報処理装置２は、ネットワークインタフェース１０７やチューナー１０８から逐次入力される音声信号をリアルタイムに再生中に、利用者の音声入力を契機に、利用者の音声の入力時点から所定範囲遡った時点から音声データを再生する処理ついて説明された。情報処理装置１及び情報処理装置２は、補助記憶装置１０５などに予め保持される音声データを再生中にも、第１実施形態及び第２実施形態で説明された処理を実行することができる。 In the first embodiment and the second embodiment, the information processing apparatus 1 and the information processing apparatus 2 are triggered by the user's voice input while reproducing the audio signals sequentially input from the network interface 107 and the tuner 108 in real time. The process of reproducing audio data from a time point that is a predetermined range backward from the user's voice input time point has been described. The information processing apparatus 1 and the information processing apparatus 2 can execute the processes described in the first embodiment and the second embodiment even during reproduction of audio data held in advance in the auxiliary storage device 105 or the like.

第２実施形態では、検索部２７は、検索範囲でキーワードを検索する方法としてワード
スポッティング技術を用いた。ワードスポッティング技術に代えて、検索部２７は、他の音声認識の技術を用いてもよい。例えば、再生用音声データはテキスト化されて記憶部２３に格納されており、検索部２７は、キーワード音声データをテキスト化し、検索範囲内に含まれる再生用音声データのテキストから、キーワードの検索を行ってもよい。 In the second embodiment, the search unit 27 uses the word spotting technique as a method for searching for keywords in the search range. Instead of the word spotting technique, the search unit 27 may use another voice recognition technique. For example, the playback audio data is converted into text and stored in the storage unit 23, and the search unit 27 converts the keyword audio data into text, and searches for keywords from the text of the playback audio data included in the search range. You may go.

また、第２実施形態において、音声認識部２５２がキーワード音声データの音声認識処理を終了した後、利用者に対してキーワード音声データの音声認識結果を確認してもよい。例えば、音声認識部２５２は、キーワード音声データの音声認識結果を出力部２９に出力する。出力部２９は、キーワード音声データの音声認識結果が入力されると、その音声入力結果が正しいか否かを利用者に問い合わせる音声を出力する。利用者に音声認識結果を問い合わせる音声は補助記憶装置１０５（図１）に格納されている。例えば、キーワード音声データの音声認識結果が「Ａ社」である場合には、出力部２９は、「“Ａ社”でよろしいですか。」という音声を出力する。このように、キーワード音声データの音声認識結果を、利用者に確認することによって、キーワードの音声認識失敗による検索部２７でのキーワードの検出失敗を防止することができる。 In the second embodiment, after the voice recognition unit 252 finishes the voice recognition process for keyword voice data, the voice recognition result of the keyword voice data may be confirmed to the user. For example, the voice recognition unit 252 outputs the voice recognition result of the keyword voice data to the output unit 29. When the voice recognition result of the keyword voice data is input, the output unit 29 outputs a voice for inquiring the user whether the voice input result is correct. The voice for inquiring the user about the voice recognition result is stored in the auxiliary storage device 105 (FIG. 1). For example, when the speech recognition result of the keyword speech data is “Company A”, the output unit 29 outputs a speech “Are you sure that“ Company A ”?”. As described above, by confirming the speech recognition result of the keyword speech data with the user, it is possible to prevent a keyword detection failure in the search unit 27 due to the keyword speech recognition failure.

また、第２実施形態では、利用者が明示的に聞き直したい情報に関するキーワード（単語）を発する場合の例が説明された。再生用音声データの出力中に利用者が聞き直したい情報を指定する場合には、利用者が、例えば、「Ａ社の株価はいくらだろう。」というような文章を発することも考えられる。利用者からのキーワード音声入力が文章又は複数の単語である場合には、例えば、情報処理装置２は、以下のような処理を行う。 Further, in the second embodiment, an example has been described in which a user issues a keyword (word) regarding information that the user wants to explicitly hear again. When the user designates information that he / she wants to hear again during the output of the playback audio data, the user may, for example, issue a sentence such as “How much is the stock price of Company A?”. When the keyword voice input from the user is a sentence or a plurality of words, for example, the information processing apparatus 2 performs the following process.

例えば、「Ａ社の株価はいくらだろう。」というキーワード音声入力があった場合を例とする。補助記憶装置１０５（図１）はキーワード候補のリストを保持する。音声認識部２５２は、「Ａ社の株価はいくらだろう。」というキーワード音声データが入力されると、キーワード音声データから、補助記憶装置１０５に保持されているリスト中のキーワード候補を、例えば、ワードスポッティング技術を用いて、検出する。音声認識部２５２は、リスト中のキーワード候補を用いて、キーワード音声データから気ワード「Ａ社」と「株価」とを検出することができる。音声認識部２５２は、検出されたキーワード「Ａ社」と「株価」との双方を検索部２７に出力してもよい。または、キーワード候補のリスト内では、単語の種類（例えば、固有名詞、一般名詞等）によって優先度が付けられており、優先度に従って検出されたキーワード「Ａ社」と「株価」との何れかを検索部２７に出力してもよい。キーワードが入力されると、検索部２７は、第２実施形態で説明された処理を実行する。このように情報処置装置２は、キーワード候補のリストを備え、キーワード音声データからキーワードを抽出することによって、利用者が発した音声が文章などである場合にも、適正に利用者が聞きたい情報を出力することができる。 For example, a case where a keyword voice input “What is the stock price of Company A?” Is taken as an example. The auxiliary storage device 105 (FIG. 1) holds a list of keyword candidates. When the keyword speech data “What is the stock price of Company A?” Is input, the speech recognition unit 252 selects keyword candidates in the list held in the auxiliary storage device 105 from the keyword speech data, for example, Detect using word spotting techniques. The voice recognition unit 252 can detect the qi word “Company A” and “stock price” from the keyword voice data using the keyword candidates in the list. The voice recognition unit 252 may output both the detected keyword “Company A” and “stock price” to the search unit 27. Alternatively, in the keyword candidate list, priority is given according to the type of word (for example, proper noun, general noun, etc.), and either the keyword “Company A” or “stock price” detected according to the priority is selected. May be output to the search unit 27. When a keyword is input, the search unit 27 executes the process described in the second embodiment. As described above, the information processing apparatus 2 includes a list of keyword candidates, and by extracting a keyword from the keyword voice data, even if the voice uttered by the user is a sentence or the like, information that the user wants to hear properly Can be output.

＜その他＞
以上の実施形態に関し、更に以下を開示する。
（付記１）
音声データを再生する情報処理装置であって、
利用者が発する音声を受け付ける入力部と、
前記音声の発話速度を算出する算出部と、
前記発話速度に応じて、前記音声データの出力時系列上の前記音声の入力時点に対応する時点から遡る範囲を決定する制御部と、
を備える情報処理装置。
（付記２）
前記制御部は、前記発話速度が大きくなるにつれて前記遡る範囲が小さくなるように前記範囲を決定する
付記１の情報処理装置。
（付記３）
前記制御部は、前記発話速度の下限値に対応する前記遡る範囲の最大値を設定し、前記発話速度の上限値に対応する前記遡る範囲の最小値を設定し、前記発話速度が前記下限値から前記上限値までの間で大きくなるにつれて、前記遡る範囲が前記最大値から前記最小値までの間で狭くなるように設定する
付記１に記載の情報処理装置。
（付記４）
前記制御部は、前記音声の発話速度が大きくなるにつれて前記遡る範囲が大きくなるように、前記遡る範囲を決定する
付記１の情報処理装置。
（付記５）
前記音声データの出力時系列上の前記音声の入力時点に対応する時点から前記遡る範囲遡った時点までの部分音声データを抽出する抽出部を
さらに含む付記１から４のいずれか１項に記載の情報処理装置。
（付記６）
前記音声データの出力時系列上の前記音声の入力時点に対応する時点から前記遡る範囲遡った時点までの部分音声データに、前記音声の発話内容が含まれるか否かを検索する検索部をさらに含む
付記１から４のいずれか１項に記載の情報処理装置。
（付記７）
前記制御部は、前記部分音声データに前記音声の発話内容が含まれない場合に、前記遡る範囲を拡大し、
前記検索部は、前記音声データの出力時系列上の前記音声の入力時点に対応する時点から前記制御部によって前記拡大された前記遡る範囲遡った時点までに含まれる部分音声データに、前記音声の発話内容が含まれるか否かを検索する
付記６に記載の情報処理装置。
（付記８）
前記検索部は、前記部分音声データ内に、複数の前記音声の発話内容が検出された場合には、前記検出された前記複数の前記音声の発話内容の内の少なくとも１つを検索結果とする
付記６又は７に記載の情報処理装置。
（付記９）
前記検索部の検索結果に基づいて、前記音声データの出力開始点を決定する決定部
をさらに含む付記６から８のいずれか１項に記載の情報処理装置。
（付記１０）
前記決定部は、前記検索部によって検出された前記音声の発話内容を前記音声データの出力開始点として決定する
付記９に記載の情報処理装置。
（付記１１）
前記決定部は、前記検索部によって検出された前記音声の発話内容よりも遡った箇所に存在する無音箇所を前記音声データの出力開始点として決定する
付記１０に記載の情報処理装置。
（付記１２）
音声データを再生する情報処理装置に、
利用者が発する音声を受け付けるステップと、
前記音声の発話速度を算出するステップと、
前記発話速度に応じて、前記音声データの出力時系列上の前記音声の入力時点に対応する時点から遡る範囲を決定するステップと、
を実行させるためのプログラム。 <Others>
The following is further disclosed regarding the above embodiment.
(Appendix 1)
An information processing apparatus for reproducing audio data,
An input unit for receiving a voice uttered by the user;
A calculation unit for calculating an utterance speed of the voice;
A control unit that determines a range that goes back from the time corresponding to the input time of the sound on the output time series of the sound data, according to the speech rate;
An information processing apparatus comprising:
(Appendix 2)
The information processing apparatus according to appendix 1, wherein the control unit determines the range so that the retroactive range decreases as the utterance speed increases.
(Appendix 3)
The control unit sets the maximum value of the retroactive range corresponding to the lower limit value of the speech rate, sets the minimum value of the retroactive range corresponding to the upper limit value of the speech rate, and the speech rate is the lower limit value The information processing apparatus according to appendix 1, wherein the retroactive range is set to be narrower between the maximum value and the minimum value as the value increases from the maximum value to the upper limit value.
(Appendix 4)
The information processing apparatus according to appendix 1, wherein the control unit determines the retroactive range so that the retroactive range increases as the speech rate of the voice increases.
(Appendix 5)
The supplementary note according to any one of appendices 1 to 4, further including an extraction unit that extracts partial voice data from a time point corresponding to the voice input time point on the output time series of the voice data to a time point retroactive to the retroactive range. Information processing device.
(Appendix 6)
A retrieval unit for retrieving whether or not the speech utterance content is included in the partial speech data from the time corresponding to the speech input time on the output time series of the speech data to the time traced back to the retroactive range. The information processing apparatus according to any one of appendices 1 to 4, further comprising:
(Appendix 7)
The control unit expands the retroactive range when the speech content of the voice is not included in the partial voice data,
The search unit includes, in the partial audio data included from the time corresponding to the input time of the audio on the output time series of the audio data to the time point back to the enlarged retrospective range by the control unit, The information processing apparatus according to appendix 6, which searches whether or not utterance content is included.
(Appendix 8)
When the plurality of speech utterance contents are detected in the partial speech data, the search unit uses at least one of the detected speech utterance contents as a search result. The information processing apparatus according to appendix 6 or 7.
(Appendix 9)
The information processing apparatus according to any one of appendices 6 to 8, further including a determination unit that determines an output start point of the audio data based on a search result of the search unit.
(Appendix 10)
The information processing apparatus according to appendix 9, wherein the determining unit determines the utterance content of the voice detected by the search unit as an output start point of the voice data.
(Appendix 11)
The information processing apparatus according to supplementary note 10, wherein the determination unit determines a silent part existing in a part that goes back from the speech utterance content detected by the search part as an output start point of the voice data.
(Appendix 12)
An information processing device that reproduces audio data
A step of accepting a voice uttered by a user;
Calculating the speech rate of the voice;
Determining a range going back from a time corresponding to an input time of the sound on an output time series of the sound data according to the speech speed;
A program for running

１，２情報処理装置
１１入力部
１２，２５算出部
１３，２６制御部
１４抽出部
２１データ入力部
２２記録部
２３記憶部
２４音声入力部
２７検索部
２８出力範囲決定部
２９出力部
１０１プロセッサ
１０２主記憶装置
１０３マイクロフォン
１０４出力装置
１０５補助記憶装置
１０７ネットワークインタフェース
１０８チューナー
１０９バス
１２１，２５１区間検出部
１２２，２５２音声認識部
１２３，２５３モーラ数算出部
１２４，２５４発話速度算出部
２６１対応表記憶部 DESCRIPTION OF SYMBOLS 1, 2 Information processing apparatus 11 Input part 12, 25 Calculation part 13, 26 Control part 14 Extraction part 21 Data input part 22 Recording part 23 Storage part 24 Voice input part 27 Search part 28 Output range determination part 29 Output part 101 Processor 102 Main storage device 103 Microphone 104 Output device 105 Auxiliary storage device 107 Network interface 108 Tuner 109 Bus 121, 251 Section detection unit 122, 252 Speech recognition unit 123, 253 Mora number calculation unit 124, 254 Speech rate calculation unit 261 Correspondence table storage unit

Claims

An information processing apparatus for reproducing audio data,
An input unit for receiving a voice uttered by the user;
A calculation unit for calculating an utterance speed of the voice;
A control unit that determines a range that goes back from the time corresponding to the input time of the sound on the output time series of the sound data, according to the speech rate;
An information processing apparatus comprising:

The information processing apparatus according to claim 1, wherein the control unit determines the retroactive range so that the retroactive range decreases as the utterance speed increases.

A retrieval unit for retrieving whether or not the speech utterance content is included in the partial speech data from the time corresponding to the speech input time on the output time series of the speech data to the time traced back to the retroactive range. The information processing apparatus according to claim 1 or 2.

The control unit expands the retroactive range when the speech content of the voice is not included in the partial voice data,
The search unit includes, in the partial audio data included from the time corresponding to the input time of the audio on the output time series of the audio data to the time point back to the enlarged retrospective range by the control unit, The information processing apparatus according to claim 3, wherein a search is made as to whether or not utterance content is included.

An information processing device that reproduces audio data
A step of accepting a voice uttered by a user;
Calculating the speech rate of the voice;
Determining a range going back from a time corresponding to an input time of the sound on an output time series of the sound data according to the speech speed;
A program for running