JP2010044241A

JP2010044241A - Voice recognition device and control program of same

Info

Publication number: JP2010044241A
Application number: JP2008208546A
Authority: JP
Inventors: Toshiki Endo; 俊樹遠藤; Koji Sase; 孝司佐瀬
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2008-08-13
Filing date: 2008-08-13
Publication date: 2010-02-25

Abstract

<P>PROBLEM TO BE SOLVED: To lessen processing load and reduce the response time in voice recognition processing. <P>SOLUTION: This voice recognition device includes: a recognition processing request receiving means 10 to which a voice recognition request and a voice signal are input; a similarity measuring means 31 for measuring the similarity between a voice signal input together with a voice recognition request and a voice signal stored in a database in the past; and a voice recognition processing means 20 for performing the voice recognition processing for the input voice signal using an acoustic model 21 and a language model 22. The recognition processing request receiving means 10 measures the similarity to the voice signal of a relevant user stored in the database in the past before the recognition processing is performed, and reads the voice recognition result to the relevant voice signal from the database 33 to be output as the voice recognition result to the voice signal input together with the voice recognition request when there is a voice signal having high similarity. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、過去の音声認識結果を利用する音声認識技術に関する。 The present invention relates to a speech recognition technology that uses past speech recognition results.

従来から、音声入力が行なわれたときに、過去に入力された音声との類似度を測定し、類似度がある閾値以上である場合に、その音声に対応した認識結果を出力する音声認識装置が知られている（特許文献１）。この音声認識装置では、認識対象単語の標準パターンの中で、音声分析部が音声波形を分析した入力パターンとの類似度計算の対象となる部分標準パターンと、これらのパターンの累積類似度を記憶する。また、累積類似度を現行閾値と比較して、部分標準パターンの類似度の大きいものの個数を計数する。現行閾値および個数の対応関係に応じて枝刈閾値を計算し、この枝刈閾値に基づいて、部分標準パターンの中で、類似度の大きいものに関して特徴量との類似度を計算する。そして、部分標準パターンの中で、最も類似度の大きい標準パターンに対応する単語を認識結果として決定する。
特開平１０−１５３９９９号公報 Conventionally, when speech input is performed, a speech recognition device that measures the similarity with speech input in the past and outputs a recognition result corresponding to the speech when the similarity is greater than a certain threshold Is known (Patent Document 1). In this speech recognition apparatus, among standard patterns of recognition target words, a partial standard pattern to be subjected to similarity calculation with an input pattern obtained by the speech analysis unit analyzing a speech waveform, and a cumulative similarity of these patterns are stored. To do. Further, the cumulative similarity is compared with the current threshold value, and the number of partial standard patterns having a high similarity is counted. A pruning threshold value is calculated according to the correspondence relationship between the current threshold value and the number, and based on the pruning threshold value, a similarity with a feature amount is calculated for a partial standard pattern having a high similarity. Then, a word corresponding to the standard pattern having the highest similarity among the partial standard patterns is determined as a recognition result.
JP-A-10-153999

しかしながら、特許文献１記載の技術では、過去に入力されたすべての音声との類似度を計算するため、比較する音声データの個数が多くなり、処理時間が長くなるという問題がある。また、比較対象の音声については、ユーザの情報を加味せずに、異なるユーザの音声とも類似度を計算するため、処理時間が長くなると共に、誤った認識結果を返信してしまう場合がある。さらに、比較対象の音声として、認識結果の正誤に関係なく、類似度を計算するため、処理時間が長くなると共に、誤った認識結果を出力してしまう場合がある。その他にも、ユーザが語彙外発声と気づかずに何度も発声した音声とも比較するため、ユーザが希望しない認識結果が何度も出力される場合もある。 However, the technique described in Patent Document 1 has a problem in that the number of pieces of voice data to be compared is increased and the processing time is increased because the degree of similarity with all voices input in the past is calculated. In addition, for the comparison target voice, the similarity is calculated for the voices of different users without taking into account the user information, so that the processing time becomes longer and an incorrect recognition result may be returned. Furthermore, since the similarity is calculated as the comparison target voice regardless of whether the recognition result is correct or incorrect, the processing time may be increased and an erroneous recognition result may be output. In addition, since it is also compared with the voice that the user uttered many times without noticing it, the recognition result that the user does not want may be output many times.

本発明は、このような事情に鑑みてなされたものであり、類似度測定に要する処理時間を短縮し、認識結果の誤りを回避することができる音声認識装置およびプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and it is an object of the present invention to provide a speech recognition apparatus and a program that can reduce processing time required for similarity measurement and avoid errors in recognition results. To do.

（１）上記の目的を達成するために、本発明は、以下のような手段を講じた。すなわち、本発明の音声認識装置は、音声認識要求と共に入力された音声信号と過去に蓄積された音声信号との類似度が高い場合に、前記類似度が高い過去の音声認識結果を前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する音声認識装置であって、音声認識要求および音声信号を入力する認識処理要求受信手段と、前記音声認識要求と共に入力された音声信号と、データベースに既に蓄積されている音声信号との類似度を測定する類似度測定手段と、音響モデルおよび言語モデルを用いて、入力された音声信号の音声認識処理を行なう音声認識処理手段と、を備え、前記認識処理要求受信手段は、前記測定の結果、データベースに既に蓄積されている音声信号に、前記類似度が高い音声信号が存在する場合は、前記類似度が高い音声信号に対する音声認識結果をデータベースから読み出して、前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する一方、データベースに既に蓄積されている音声信号に、前記類似度が高い音声信号が存在しない場合は、前記音声認識要求と共に入力された音声信号を前記音声認識処理手段に出力して音声認識処理を実行させることを特徴としている。 (1) In order to achieve the above object, the present invention takes the following measures. That is, the speech recognition apparatus of the present invention recognizes the speech recognition result having a high similarity when the speech signal input together with the speech recognition request and the speech signal accumulated in the past are high. A speech recognition apparatus for outputting a speech recognition result for a speech signal input together with a request, a recognition processing request receiving means for inputting a speech recognition request and a speech signal, a speech signal input together with the speech recognition request, and a database A similarity measurement means for measuring the similarity with the voice signal already stored in the voice, and a voice recognition processing means for performing voice recognition processing of the input voice signal using an acoustic model and a language model, The recognition processing request receiving means, when the speech signal having a high similarity exists in the speech signal already accumulated in the database as a result of the measurement, A speech recognition result for the speech signal having a high similarity is read from the database and output as a speech recognition result for the speech signal input together with the speech recognition request, while the similarity is added to the speech signal already stored in the database. When there is no voice signal having a high level, the voice signal input together with the voice recognition request is output to the voice recognition processing means to execute voice recognition processing.

このように、過去にデータベースに蓄積された音声信号に、類似度が高い音声信号が存在した場合は、類似度が高い音声信号に対する音声認識結果をデータベースから読み出して、音声認識要求と共に入力された音声信号に対する音声認識結果として出力するので、処理負荷の大きい音声認識処理を行なう必要がなく、処理負荷の軽減と応答時間の短縮を図ることができる。 As described above, when a speech signal having a high similarity exists in the speech signal stored in the database in the past, a speech recognition result for the speech signal having a high similarity is read from the database and input together with the speech recognition request. Since it is output as a speech recognition result for the speech signal, it is not necessary to perform speech recognition processing with a large processing load, and the processing load can be reduced and the response time can be shortened.

（２）また、本発明の音声認識装置において、前記認識処理要求受信手段は、音声認識要求、音声信号およびユーザＩＤを入力し、前記類似度測定手段は、音声認識要求と共に入力されたユーザＩＤおよび音声信号を取得し、前記音声認識要求と共に入力された音声データと、データベースに既に蓄積されている音声信号のうち、前記ユーザＩＤで特定される音声信号との類似度を測定することを特徴としている。 (2) In the speech recognition apparatus according to the present invention, the recognition process request receiving unit inputs a speech recognition request, a speech signal, and a user ID, and the similarity measuring unit receives the user ID input together with the speech recognition request. And a voice signal is acquired, and a similarity between the voice data input together with the voice recognition request and a voice signal specified by the user ID among voice signals already stored in a database is measured. It is said.

このように、音声認識要求と共に入力されたユーザＩＤおよび音声信号を取得し、音声認識要求と共に入力された音声データと、データベースに既に蓄積されている音声信号のうち、ユーザＩＤで特定される音声信号との類似度を測定するので、ユーザＩＤで特定される音声信号のみを対象として類似度を測定することができ、処理時間の短縮と認識結果の誤りを回避することが可能となる。 As described above, the user ID and the voice signal input together with the voice recognition request are acquired, and the voice data input together with the voice recognition request and the voice specified by the user ID among the voice signals already stored in the database. Since the similarity with the signal is measured, the similarity can be measured only for the audio signal specified by the user ID, and it becomes possible to shorten the processing time and avoid the recognition result error.

（３）また、本発明の音声認識装置において、前記類似度測定手段は、前記音声認識要求と共に入力された音声データと、データベースに既に蓄積されている音声信号のうち、前記ユーザＩＤで特定される音声信号であって、前記音声認識処理手段の認識結果に含まれる認識スコアが所定の閾値よりも高い音声信号との類似度を測定することを特徴としている。 (3) In the speech recognition apparatus of the present invention, the similarity measurer is specified by the user ID among the speech data input together with the speech recognition request and the speech signal already stored in the database. And a speech signal having a recognition score included in the recognition result of the speech recognition processing means higher than a predetermined threshold is measured.

このように、音声認識要求と共に入力された音声データと、データベースに既に蓄積されている音声信号のうち、ユーザＩＤで特定される音声信号であって、音声認識処理手段の認識結果に含まれる認識スコアが所定の閾値よりも高い音声信号との類似度を測定するので、過去の誤った認識結果を出力することを回避することができると共に、類似度測定に要する処理時間を短縮することができる。 As described above, among the voice data input together with the voice recognition request and the voice signals already stored in the database, the voice signal specified by the user ID and recognized in the recognition result of the voice recognition processing means. Since the degree of similarity with an audio signal whose score is higher than a predetermined threshold is measured, it is possible to avoid outputting past erroneous recognition results and to reduce the processing time required for similarity measurement. .

（４）また、本発明の音声認識装置において、前記類似度測定手段は、前記音声認識要求と共に入力された音声信号が入力された時刻より前の所定期間内に、前記ユーザＩＤに対応する複数の音声信号が入力され、これらが相互に類似する場合は、これらを前記類似度の測定対象から除外すると共に、前記ユーザＩＤに対応し前記類似度の測定対象となる音声信号のうち、前記類似度が閾値以上で最も高い音声信号に対する音声認識結果をデータベースから読み出して、前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する一方、前記ユーザＩＤに対応し前記類似度の測定対象となる音声信号のうち、前記類似度が閾値以上となる音声信号が前記所定期間内に入力されていなかった場合は、前記音声認識要求と共に入力された音声信号を前記音声認識処理手段に出力して音声認識処理を実行させることを特徴としている。 (4) Further, in the voice recognition device of the present invention, the similarity measurer includes a plurality of units corresponding to the user IDs within a predetermined period before the time when the voice signal input together with the voice recognition request is input. If these are similar to each other, they are excluded from the measurement target of the similarity, and the similarity among the audio signals corresponding to the user ID and the measurement target of the similarity is excluded. A speech recognition result for the highest speech signal having a degree equal to or higher than a threshold is read from the database and output as a speech recognition result for the speech signal input together with the speech recognition request, while the similarity is measured corresponding to the user ID If a voice signal having a similarity greater than or equal to a threshold value has not been input within the predetermined period, the voice recognition request is shared. Is characterized by executing the speech recognition processing the input audio signal is output to the speech recognition processing unit.

このように、音声認識要求と共に入力された音声信号が入力された時刻より前の所定期間内に、ユーザＩＤに対応する複数の音声信号が入力され、これらが相互に類似する場合は、これらを前記類似度の測定対象から除外するので、処理負担の低減と処理時間の短縮を図ると共に、誤った認識結果を何度も出力することを回避することができる。また、ユーザＩＤに対応し類似度の測定対象となる音声信号のうち、類似度が閾値以上で最も高い音声信号に対する音声認識結果をデータベースから読み出して、音声認識要求と共に入力された音声信号に対する音声認識結果として出力するので、正しい音声認識結果を出力する確率を高めることが可能となる。 As described above, when a plurality of audio signals corresponding to the user ID are input within a predetermined period before the time when the audio signal input together with the audio recognition request is input, and these are similar to each other, Since the similarity is excluded from the measurement target, it is possible to reduce the processing load and the processing time, and to avoid outputting an erroneous recognition result many times. In addition, a voice recognition result for a voice signal whose similarity is equal to or higher than a threshold among voice signals corresponding to the user ID is read from the database, and a voice for the voice signal input together with the voice recognition request is read. Since it outputs as a recognition result, it becomes possible to raise the probability of outputting a correct speech recognition result.

（５）また、本発明の音声認識装置において、前記認識処理要求受信手段は、前記音声認識要求と共に入力された音声信号が入力された時刻より所定期間以上前に、前記ユーザＩＤに対応する複数の音声信号が入力され、これらが前記音声認識要求と共に入力された音声信号と類似する場合は、前記類似度が閾値以上で最も類似度の高い音声信号に対する音声認識結果をデータベースから読み出して、前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する一方、前記類似度が閾値以上となる音声信号が前記時刻より所定期間以上前に入力されていなかった場合は、前記音声認識要求と共に入力された音声信号を前記音声認識処理手段に出力して音声認識処理を実行させることを特徴としている。 (5) In the speech recognition apparatus of the present invention, the recognition processing request receiving means includes a plurality of correspondences corresponding to the user IDs at least a predetermined period before the time when the speech signal input together with the speech recognition request is input. If these are similar to the voice signal input together with the voice recognition request, the voice recognition result for the voice signal having the highest similarity with the similarity equal to or higher than a threshold is read from the database, and While outputting as a voice recognition result for a voice signal input together with a voice recognition request, if a voice signal whose similarity is greater than or equal to a threshold is not input more than a predetermined period before the time, along with the voice recognition request The input voice signal is output to the voice recognition processing means to execute voice recognition processing.

このように、音声認識要求と共に入力された音声信号が入力された時刻より所定期間以上前に、ユーザＩＤに対応する複数の音声信号が入力され、これらが音声認識要求と共に入力された音声信号と類似する場合は、類似度が閾値以上で最も類似度の高い音声信号に対する音声認識結果をデータベースから読み出して、音声認識要求と共に入力された音声信号に対する音声認識結果として出力するので、正しい音声認識結果を出力する確率を高めることが可能となる。 As described above, a plurality of audio signals corresponding to the user ID are input a predetermined period or more before the time when the audio signal input together with the audio recognition request is input, and these are the audio signal input together with the audio recognition request If they are similar, the speech recognition result for the speech signal having the highest similarity that is equal to or higher than the threshold value is read from the database and output as the speech recognition result for the speech signal input together with the speech recognition request. Can be increased.

（６）また、本発明の音声認識装置において、前記認識処理要求受信手段は、前記音声認識要求と共に入力された音声信号が入力された時刻より所定期間以上前に、前記ユーザＩＤに対応する複数の音声信号が入力され、これらが前記音声認識要求と共に入力された音声信号と類似する場合は、前記類似度が閾値以上で最新の音声信号に対する音声認識結果をデータベースから読み出して、前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する一方、前記類似度が閾値以上となる音声信号が前記時刻より所定期間以上前に入力されていなかった場合は、前記音声認識要求と共に入力された音声信号を前記音声認識処理手段に出力して音声認識処理を実行させることを特徴としている。 (6) Moreover, in the speech recognition apparatus of the present invention, the recognition processing request receiving means includes a plurality of correspondences corresponding to the user ID at least a predetermined period before the time when the speech signal input together with the speech recognition request is input. If these are similar to the speech signal input together with the speech recognition request, the speech recognition result for the latest speech signal with the similarity equal to or greater than a threshold value is read from the database, and the speech recognition request When a voice signal whose similarity is equal to or higher than a threshold is not input before a predetermined period before the time, the voice recognition request is input together with the voice recognition request. A voice signal is output to the voice recognition processing means to execute voice recognition processing.

このように、音声認識要求と共に入力された音声信号が入力された時刻より所定期間以上前に、ユーザＩＤに対応する複数の音声信号が入力され、これらが音声認識要求と共に入力された音声信号と類似する場合は、類似度が閾値以上で最新の音声信号に対する音声認識結果をデータベースから読み出して、音声認識要求と共に入力された音声信号に対する音声認識結果として出力するので、正しい音声認識結果を出力する確率を高めることが可能となる。 As described above, a plurality of audio signals corresponding to the user ID are input a predetermined period or more before the time when the audio signal input together with the audio recognition request is input, and these are the audio signal input together with the audio recognition request If they are similar, the speech recognition result for the latest speech signal with a similarity equal to or greater than the threshold value is read from the database and output as the speech recognition result for the speech signal input together with the speech recognition request, so the correct speech recognition result is output. The probability can be increased.

（７）また、本発明の音声認識装置の制御プログラムは、音声認識要求と共に入力された音声信号と過去に蓄積された音声信号との類似度が高い場合に、前記類似度が高い過去の音声認識結果を前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する音声認識装置の制御プログラムであって、音声認識要求および音声信号を入力する処理と、前記音声認識要求と共に入力された音声信号と、データベースに既に蓄積されている音声信号との類似度を測定する処理と、前記測定の結果、データベースに既に蓄積されている音声信号に、前記類似度が高い音声信号が存在する場合は、前記類似度が高い音声信号に対する音声認識結果をデータベースから読み出して、前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する処理と、データベースに既に蓄積されている音声信号に、前記類似度が高い音声信号が存在しない場合は、前記音声認識要求と共に入力された音声信号に対して、音響モデルおよび言語モデルを用いて、音声認識処理を行なう処理と、を含む一連の処理を、コンピュータで読み取りおよび実行可能にコマンド化したことを特徴としている。 (7) Further, the control program of the speech recognition apparatus according to the present invention provides a past speech having a high similarity when the similarity between the speech signal input together with the speech recognition request and the speech signal accumulated in the past is high. A control program for a speech recognition apparatus that outputs a recognition result as a speech recognition result for a speech signal input together with the speech recognition request, the processing for inputting the speech recognition request and the speech signal, and input together with the speech recognition request Processing for measuring the similarity between an audio signal and an audio signal already stored in the database, and when the result of the measurement is an audio signal having a high similarity in the audio signal already stored in the database Reads out a speech recognition result for a speech signal having a high degree of similarity from a database, and for a speech signal input together with the speech recognition request If there is no voice signal having a high similarity in the process of outputting as a voice recognition result and the voice signal already stored in the database, an acoustic model and A series of processes including a voice recognition process using a language model is converted into a command that can be read and executed by a computer.

本発明によれば、入力音声との類似度を測定する蓄積音声について、ユーザIDを利用するので、類似度測定に要する処理時間の短縮を図り、誤った認識結果を出力することを回避することができる。また、類似度を測定する蓄積音声について、その認識スコアがある閾値以上のもののみを対象とするので、過去の誤った認識結果を出力することを回避することができ、類似度測定に要する処理時間を短縮することができる。さらに、アクセス情報を利用し、近い過去に発声された複数の音声のうち類似度の近い音声を、語彙外発声や誤認識発声と判定して、類似度を求める対象から除外するので、処理負荷の低減および処理時間の短縮を図ることができ、誤った認識結果を何度も出力することを回避することができる。 According to the present invention, since the user ID is used for the accumulated voice that measures the similarity to the input voice, it is possible to shorten the processing time required for the similarity measurement and to avoid outputting an erroneous recognition result. Can do. Also, since the accumulated speech for measuring the similarity is only targeted for those whose recognition score is equal to or greater than a certain threshold value, it is possible to avoid outputting past erroneous recognition results, and processing required for similarity measurement Time can be shortened. Furthermore, since the access information is used and voices with a similar similarity among a plurality of voices uttered in the near past are determined as vocabulary utterances or misrecognition utterances, and excluded from the target of similarity, processing load Can be reduced and the processing time can be shortened, and it is possible to avoid outputting erroneous recognition results many times.

本実施形態に係る音声認識装置では、ユーザによって過去に発生された蓄積音声と類似した音声が入力された場合には、音声認識処理をせずに、過去の音声認識結果を返信する。これにより、短い応答時間で音声認識結果を出力することができ、かつ、処理負担を低減させることが可能となる。音声認識処理は、処理負担が大きく、処理時間を要するのに対し、類似判定処理は、音声認識処理よりも処理負担が小さく、処理時間が短い。従って、過去に発生された蓄積音声と類似した音声が入力された場合には、音声認識処理を行なわず、過去の音声認識結果を返信することによって、処理負担の低減と応答時間の短縮を図ることが可能となる。以下、本実施形態について、図面を参照して説明する。 In the speech recognition apparatus according to the present embodiment, when speech similar to accumulated speech generated in the past is input by the user, past speech recognition results are returned without performing speech recognition processing. As a result, the speech recognition result can be output with a short response time, and the processing load can be reduced. The voice recognition processing has a large processing load and requires processing time, whereas the similarity determination processing has a processing load smaller than that of the voice recognition processing and processing time is short. Accordingly, when a voice similar to the accumulated voice generated in the past is input, the voice recognition process is not performed and the past voice recognition result is returned, thereby reducing the processing load and the response time. It becomes possible. Hereinafter, the present embodiment will be described with reference to the drawings.

図１は、本実施形態に係る音声認識装置の概略構成を示す図である。認識処理要求受信手段１０は、ユーザ端末からの音声認識要求と音声データを受信し、音声認識処理手段２０に対して、認識処理を指示する。また、認識結果をユーザ端末に返信するほか、入力音声や認識結果、アクセス情報を各々のＤＢに蓄積する。処理制御手段１３は、音声認識要求を受信すると、入力された音声を類似度測定手段３１に送信し、過去の蓄積音声との類似度の測定を指示する。過去に類似した音声が存在する場合には、その音声に対する音声認識結果を過去履歴ＤＢ３３から取得して、出力する。また、過去に類似した音声が存在しない場合には、音声認識処理手段２０に対して、音声認識処理を実行させる。 FIG. 1 is a diagram illustrating a schematic configuration of a speech recognition apparatus according to the present embodiment. The recognition process request receiving means 10 receives a voice recognition request and voice data from the user terminal, and instructs the voice recognition processing means 20 to perform a recognition process. In addition to returning the recognition result to the user terminal, the input voice, the recognition result, and the access information are stored in each DB. When receiving the voice recognition request, the process control unit 13 transmits the input voice to the similarity measurement unit 31 and instructs the similarity measurement unit 31 to measure the similarity with the past accumulated voice. If there is speech similar to the past, the speech recognition result for the speech is acquired from the past history DB 33 and output. If there is no speech similar to the past, the speech recognition processing means 20 is caused to execute speech recognition processing.

音声認識処理手段２０は、認識処理要求受信手段１０から受信した認識要求に従って、音響モデル２１および言語モデル２２を用いて認識処理を行なう。そして、認識結果を認識処理要求受信手段１０に返信する。 The speech recognition processing means 20 performs recognition processing using the acoustic model 21 and the language model 22 in accordance with the recognition request received from the recognition processing request receiving means 10. Then, the recognition result is returned to the recognition process request receiving means 10.

類似度測定手段３１は、入力音声と過去に発声された蓄積音声データの類似度を判定する。求めた音声データ間の距離が閾値以下である場合に、両者が類似していると判定する。類似度を測定する音声は、該当するユーザＩＤのものに限る。また、過去履歴ＤＢ３３は、後述するように、蓄積音声ＤＢ３３ａ、認識結果ＤＢ３３ｂおよびアクセス情報ＤＢ３３ｃを備える。詳細については、後述する。 The similarity measurer 31 determines the similarity between the input voice and accumulated voice data uttered in the past. When the obtained distance between the audio data is equal to or less than the threshold, it is determined that the two are similar. The voice for measuring the similarity is limited to that of the corresponding user ID. The past history DB 33 includes an accumulated voice DB 33a, a recognition result DB 33b, and an access information DB 33c, as will be described later. Details will be described later.

図２は、図１に示した過去履歴ＤＢ３３の概略構成を示す図である。蓄積音声ＤＢ３３ａは、入力された音声を蓄積するＤＢである。蓄積される音声データは、ＰＣＭ形式などの音声データの他、スペクトル領域のデータ、ケプストラム領域のデータ、ＶＱデータなどであってもよい。認識結果ＤＢ３３ｂは、認識結果を蓄積するＤＢである。蓄積される認識結果は、認識された文字および認識スコアである。認識スコアは、さらに音響尤度と言語確率に別けて保持してもよい。アクセス情報ＤＢ３３ｃは、アクセス情報を蓄積するＤＢである。蓄積されるアクセス情報としては、アクセス時間、アクセスユーザＩＤ、および、対応する蓄積音声ＤＢに格納された音声データ名、対応する認識結果ＤＢに格納された認識結果ファイル名がある。 FIG. 2 is a diagram showing a schematic configuration of the past history DB 33 shown in FIG. The stored voice DB 33a is a DB that stores input voice. The voice data to be stored may be spectrum data, cepstrum domain data, VQ data, etc. in addition to voice data in the PCM format or the like. The recognition result DB 33b is a DB that accumulates recognition results. The recognition results that are accumulated are recognized characters and recognition scores. The recognition score may be further held separately for acoustic likelihood and language probability. The access information DB 33c is a DB that accumulates access information. The access information to be accumulated includes access time, access user ID, voice data name stored in the corresponding stored voice DB, and recognition result file name stored in the corresponding recognition result DB.

図３は、図１に示した類似度測定手段３１の概略構成を示すブロック図である。類似度測定手段３１において、類似度判定制御手段３１ａは、認識処理要求受信手段１０から入力音声データとユーザＩＤを受信した後、蓄積音声ＤＢ３３ａにある同一ユーザＩＤの音声データを取得する。データ加工手段３１ｃは、入力音声、および蓄積音声ＤＢ３３ａから取得した音声データを同じ種類のデータ形式に加工する。例えば、両者がＰＣＭ等の音声データやスペクトル領域のデータである場合には、スペクトル領域のデータ、ケプストラム領域のデータ、ＶＱデータなどに加工する。両者が、ケプストラム領域のデータの場合には、ケプストラム領域のデータ、ＶＱデータなどに加工する。両者がＶＱデータの場合にはそのままにする。 FIG. 3 is a block diagram showing a schematic configuration of the similarity measuring means 31 shown in FIG. In the similarity measurer 31, the similarity determination controller 31a receives the input voice data and the user ID from the recognition process request receiver 10, and then acquires the voice data of the same user ID in the accumulated voice DB 33a. The data processing unit 31c processes the input voice and the voice data acquired from the stored voice DB 33a into the same type of data format. For example, if both are audio data such as PCM or spectral domain data, they are processed into spectral domain data, cepstrum domain data, VQ data, and the like. If both are cepstrum area data, they are processed into cepstrum area data, VQ data, and the like. When both are VQ data, they are left as they are.

距離計算手段３１ｄは、音声データ間の距離または、認識結果の距離を計算する。求めた距離は、音声間距離情報テーブルや認識結果間距離情報テーブル、および類似度判定手段３１ｅに出力する。類似度判定手段３１ｅは、距離計算手段３１ｄで求めた距離が閾値以下である場合に、類似していると判定する。判定した結果と該当するデータの情報を、認識処理要求受信手段１０に出力する。 The distance calculation means 31d calculates the distance between the voice data or the distance of the recognition result. The obtained distance is output to the inter-speech distance information table, the recognition result inter-distance information table, and the similarity determination means 31e. The similarity determination unit 31e determines that they are similar when the distance obtained by the distance calculation unit 31d is equal to or less than a threshold value. Information on the determined result and the corresponding data is output to the recognition process request receiving means 10.

図１に示した類似度測定手段３１は、他の構成を採ることも可能である。図４は、類似度測定手段３１の他の構成を示す図である。図４において、アクセス情報分析手段３１ｂは、認識処理要求受信手段１０における処理制御手段１３から、入力音声データとユーザＩＤを受信した後、アクセス情報ＤＢ３３ｃから該当ユーザＩＤのアクセス情報を取得し、類似度を判定するために用いる音声データを選択する。アクセス情報分析手段３１ｂは、以下の２つの条件のいずれか一方を満足するものを、類似度を測定する音声データとして選択する。なお、図４において、その他の構成要素については、図３と同様である。
（条件１）現時刻からＴ以内に発声された複数の音声については、これらが相互に類似する場合は、これらを類似度の測定対象から除外する。そして、ユーザＩＤに対応し類似度の測定対象となる音声信号のうち、最も類似度が高い（距離が小さい）もののみを選択する。
（条件２）現時刻からＴ’以上の間隔が開いている音声データを選択する。 The similarity measurer 31 shown in FIG. 1 can take other configurations. FIG. 4 is a diagram illustrating another configuration of the similarity measuring unit 31. In FIG. 4, the access information analyzing unit 31b receives the input voice data and the user ID from the processing control unit 13 in the recognition processing request receiving unit 10, and then acquires the access information of the corresponding user ID from the access information DB 33c. Audio data used for determining the degree is selected. The access information analyzing unit 31b selects one satisfying one of the following two conditions as voice data for measuring the similarity. In FIG. 4, the other components are the same as those in FIG.
(Condition 1) If a plurality of voices uttered within T from the current time are similar to each other, they are excluded from the similarity measurement target. Then, only the audio signal having the highest similarity (the distance is small) is selected from the audio signals corresponding to the user ID and whose similarity is to be measured.
(Condition 2) Audio data having an interval of T ′ or more from the current time is selected.

また、図１に示した類似度測定手段３１は、他の構成を採ることも可能である。図５は、類似度測定手段３１の他の構成を示す図である。図５において、アクセス情報／認識結果分析手段３１ｆは、認識処理要求受信手段１０における処理制御手段１３から、入力音声データとユーザＩＤを受信した後、アクセス情報ＤＢ３３ｃから該当ユーザＩＤのアクセス情報を取得し、また、認識結果ＤＢから認識スコアが高い認識結果のみを取得し、類似度を判定するために用いる音声データを選択する。なお、図５において、その他の構成要素については、図３および図４と同様である。 Further, the similarity measuring unit 31 shown in FIG. 1 can adopt other configurations. FIG. 5 is a diagram showing another configuration of the similarity measuring unit 31. In FIG. 5, the access information / recognition result analyzing unit 31f receives the input voice data and the user ID from the processing control unit 13 in the recognition processing request receiving unit 10, and then acquires the access information of the corresponding user ID from the access information DB 33c. In addition, only the recognition result having a high recognition score is acquired from the recognition result DB, and the audio data used for determining the similarity is selected. In FIG. 5, the other components are the same as those in FIGS.

図６は、図３および図４に示した距離計算手段３１ｄが行なう距離計算方法の概念を示す図である。この距離計算では、異なるフレーム数の２つの音声の距離は、ＤＴＷ（動的時間伸縮法）を用いて求める。各フレーム間の距離の例として、以下の距離尺度がある。（１）対数スペクトル、ＬＰＣスペクトル、ケプストラム、ＶＱデータのユークリッド距離。
（２）ＬＰＣスペクトルを用いた最尤スペクトル距離。
（３）Ｃｏｓｈ尺度。 FIG. 6 is a diagram showing the concept of the distance calculation method performed by the distance calculation means 31d shown in FIGS. In this distance calculation, the distance between two voices having different numbers of frames is obtained using DTW (Dynamic Time Stretching Method). Examples of distances between frames include the following distance measures. (1) Logarithmic spectrum, LPC spectrum, cepstrum, Euclidean distance of VQ data.
(2) Maximum likelihood spectral distance using LPC spectrum.
(3) Cosh scale.

図７は、図３および図４に示した類似度判定手段３１ｅの動作の概念を示す図である。類似度判定手段３１ｅは、アクセス時刻が、現時刻からＴ以内の複数の発声の場合は、これらが相互に類似する場合は、これらを類似度の測定対象から除外する。そして、ユーザＩＤに対応し類似度の測定対象となる音声信号のうち、最も類似度が高い（距離が小さい）もののみを判定対象とする。また、アクセス時刻が、Ｔ’以上の間隔にある音声を判定対象とする。 FIG. 7 is a diagram showing a concept of operation of the similarity determination unit 31e shown in FIGS. When the access time is a plurality of utterances within T from the current time, the similarity determination unit 31e excludes these from the similarity measurement target if they are similar to each other. Then, among the audio signals corresponding to the user ID and whose similarity is to be measured, only the signal with the highest similarity (small distance) is set as the determination target. In addition, a voice whose access time is at an interval equal to or greater than T ′ is determined.

図８は、本実施形態に係る音声認識装置の動作を示すフローチャートである。まず、音声認識要求を受信し（ステップＳ１）、入力音声とユーザＩＤを類似度測定手段３１に送信する（ステップＳ２）。次に、類似度を測定する該当ユーザIDの音声を蓄積音声ＤＢ３３ａから取得し（ステップＳ３）、入力音声と過去の音声のデータ形式を統一化する（ステップＳ４）。そして、入力音声と過去の音声との間の距離を計算する（ステップＳ５）。 FIG. 8 is a flowchart showing the operation of the speech recognition apparatus according to this embodiment. First, a voice recognition request is received (step S1), and the input voice and user ID are transmitted to the similarity measurer 31 (step S2). Next, the voice of the user ID whose similarity is to be measured is acquired from the accumulated voice DB 33a (step S3), and the data formats of the input voice and the past voice are unified (step S4). Then, the distance between the input voice and the past voice is calculated (step S5).

次に、音声間の距離による類似度判定を行ない（ステップＳ６）、類似していない場合、すなわち、音声間の距離が閾値以上である場合は、入力音声を用いて、音声認識処理を行ない（ステップＳ７）、認識結果を返信する（ステップＳ８）。一方、ステップＳ６において、類似している場合、すなわち、音声間の距離が閾値以下である場合は、入力音声と類似した過去の音声の音声認識結果を、認識結果ＤＢ３３ｂから取得し（ステップＳ９）、取得した過去の音声認識結果を返信する（ステップＳ１０）。 Next, similarity determination is performed based on the distance between voices (step S6). If they are not similar, that is, if the distance between voices is greater than or equal to a threshold, voice recognition processing is performed using the input voice ( In step S7), the recognition result is returned (step S8). On the other hand, if they are similar in step S6, that is, if the distance between the voices is less than or equal to the threshold, the voice recognition result of the past voice similar to the input voice is acquired from the recognition result DB 33b (step S9). The acquired past speech recognition result is returned (step S10).

以上のような本実施形態の特徴的な動作は、コンピュータでプログラムを実行させることにより行なうことができる。すなわち、本実施形態に係る音声認識装置の制御プログラムは、音声認識要求と共に入力された音声信号と過去に蓄積された音声信号との類似度が高い場合に、前記類似度が高い過去の音声認識結果を前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する音声認識装置の制御プログラムであって、音声認識要求および音声信号を入力する処理と、前記音声認識要求と共に入力された音声信号と、データベースに既に蓄積されている音声信号との類似度を測定する処理と、前記測定の結果、データベースに既に蓄積されている音声信号に、前記類似度が高い音声信号が存在する場合は、前記類似度が高い音声信号に対する音声認識結果をデータベースから読み出して、前記音声認識要求と共に入力された音声信号に対する音声認識結果として出力する処理と、データベースに既に蓄積されている音声信号に、前記類似度が高い音声信号が存在しない場合は、前記音声認識要求と共に入力された音声信号に対して、音響モデルおよび言語モデルを用いて、音声認識処理を行なう処理と、を含む一連の処理を、コンピュータで読み取りおよび実行可能にコマンド化したことを特徴としている。 The characteristic operations of the present embodiment as described above can be performed by causing a computer to execute a program. In other words, the control program for the speech recognition apparatus according to the present embodiment is configured so that when the similarity between the speech signal input together with the speech recognition request and the speech signal accumulated in the past is high, the past speech recognition with the high similarity is performed. A control program for a speech recognition apparatus for outputting a result as a speech recognition result for a speech signal input together with the speech recognition request, a process for inputting the speech recognition request and the speech signal, and speech input together with the speech recognition request A process of measuring the similarity between the signal and the voice signal already stored in the database, and if the result of the measurement is that the voice signal already stored in the database contains a voice signal having a high similarity The speech recognition result for the speech signal having a high similarity is read from the database, and the speech signal input together with the speech recognition request is read. If the speech signal having the high similarity does not exist in the speech signal already stored in the database and the processing to output as the speech recognition result, an acoustic model is generated for the speech signal input together with the speech recognition request. In addition, a series of processes including a voice recognition process using a language model is converted into a command that can be read and executed by a computer.

本実施形態に係る音声認識装置の概略構成を示す図である。It is a figure which shows schematic structure of the speech recognition apparatus which concerns on this embodiment. 図１に示した過去履歴ＤＢ３３の概略構成を示す図である。It is a figure which shows schematic structure of the past log | history DB33 shown in FIG. 図１に示した類似度測定手段３１の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the similarity measuring means 31 shown in FIG. 類似度測定手段３１の他の構成を示す図である。FIG. 6 is a diagram showing another configuration of the similarity measurer 31. 類似度測定手段３１の他の構成を示す図である。FIG. 6 is a diagram showing another configuration of the similarity measurer 31. 図３および図４に示した距離計算手段３１ｄが行なう距離計算方法の概念を示す図である。It is a figure which shows the concept of the distance calculation method which the distance calculation means 31d shown in FIG. 3 and FIG. 4 performs. 図３および図４に示した類似度判定手段３１ｅの動作の概念を示す図である。It is a figure which shows the concept of operation | movement of the similarity determination means 31e shown in FIG. 3 and FIG. 本実施形態に係る音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition apparatus which concerns on this embodiment.

Explanation of symbols

１０認識処理要求受信手段
１３処理制御手段
２０音声認識処理手段
２１音響モデル
２２言語モデル
３１類似度測定手段
３１ａ類似度判定制御手段
３１ｂアクセス情報分析手段
３１ｃデータ加工手段
３１ｄ距離計算手段
３１ｅ類似度判定手段
３１ｆアクセス情報／認識結果分析手段
３３過去履歴ＤＢ
３３ａ蓄積音声ＤＢ
３３ｂ認識結果ＤＢ
３３ｃアクセス情報ＤＢ DESCRIPTION OF SYMBOLS 10 Recognition process request | requirement reception means 13 Process control means 20 Speech recognition processing means 21 Acoustic model 22 Language model 31 Similarity measurement means 31a Similarity determination control means 31b Access information analysis means 31c Data processing means 31d Distance calculation means 31e Similarity determination means 31f Access information / recognition result analysis means 33 Past history DB
33a Accumulated voice DB
33b Recognition result DB
33c Access information DB

Claims

When the similarity between the speech signal input together with the speech recognition request and the speech signal accumulated in the past is high, the speech recognition for the speech signal input together with the speech recognition request is performed based on the past speech recognition result having the high similarity. A speech recognition device that outputs as a result,
A recognition processing request receiving means for inputting a voice recognition request and a voice signal;
Similarity measuring means for measuring the similarity between the voice signal input together with the voice recognition request and the voice signal already stored in the database;
Voice recognition processing means for performing voice recognition processing of an input voice signal using an acoustic model and a language model,
The recognition processing request receiving means, when there is a voice signal having a high similarity in the voice signal already stored in the database as a result of the measurement, a voice recognition result for the voice signal having a high similarity is stored in the database. Is output as a speech recognition result for the speech signal input together with the speech recognition request, while the speech recognition having a high similarity does not exist in the speech signal already stored in the database. A speech recognition apparatus characterized in that a speech signal input together with a request is output to the speech recognition processing means to execute speech recognition processing.

The recognition processing request receiving means inputs a voice recognition request, a voice signal and a user ID,
The similarity measuring unit acquires a user ID and a voice signal input together with the voice recognition request, and the user ID among the voice data input together with the voice recognition request and the voice signal already stored in the database. The speech recognition apparatus according to claim 1, wherein the degree of similarity with the speech signal specified in (1) is measured.

The similarity measuring means is a voice signal specified by the user ID among voice data input together with the voice recognition request and voice signals already stored in a database, and the voice recognition processing means The speech recognition apparatus according to claim 2, wherein the degree of similarity with a speech signal having a recognition score included in the recognition result is higher than a predetermined threshold is measured.

The similarity measurer receives a plurality of voice signals corresponding to the user ID within a predetermined period before the time when the voice signal input together with the voice recognition request is input, and these are similar to each other. In this case, these are excluded from the similarity measurement target, and among the speech signals corresponding to the user ID and the similarity measurement target, the speech recognition result for the highest speech signal having the similarity equal to or higher than a threshold value Is read out from the database and output as a speech recognition result for the speech signal input together with the speech recognition request, while the similarity is equal to or greater than a threshold among speech signals corresponding to the user ID and the similarity is to be measured. If the voice signal to be input is not input within the predetermined period, the voice signal input together with the voice recognition request is converted into the voice recognition processing means. Output speech recognition apparatus according to claim 2 or claim 3, wherein the executing the speech recognition process.

The recognition processing request receiving means receives a plurality of audio signals corresponding to the user ID at least a predetermined period before the time when the audio signal input together with the audio recognition request is input, and these are input to the audio recognition request If the voice signal is similar to the voice signal input together with the voice recognition request, the voice recognition result for the voice signal having the highest similarity with the similarity equal to or higher than a threshold value is read from the database, and the voice recognition result for the voice signal input with the voice recognition request is read. On the other hand, if the audio signal having the similarity equal to or higher than the threshold value has not been input before the predetermined time, the audio signal input together with the audio recognition request is output to the audio recognition processing means. The voice recognition apparatus according to claim 2, wherein voice recognition processing is executed.

The recognition processing request receiving means receives a plurality of audio signals corresponding to the user ID at least a predetermined period before the time when the audio signal input together with the audio recognition request is input, and these are input to the audio recognition request When the voice signal is similar to the voice signal input together with the voice recognition request, the voice recognition result for the latest voice signal with the similarity equal to or higher than a threshold value is read from the database and output as the voice recognition result for the voice signal input together with the voice recognition request. On the other hand, if the audio signal whose similarity is equal to or greater than the threshold value has not been input before a predetermined period of time before the time, the audio signal input together with the audio recognition request is output to the audio recognition processing means. The speech recognition apparatus according to claim 2 or 3, wherein recognition processing is executed.

When the similarity between the speech signal input together with the speech recognition request and the speech signal accumulated in the past is high, the speech recognition for the speech signal input together with the speech recognition request is performed based on the past speech recognition result having the high similarity. A speech recognition device control program that outputs as a result,
A process of inputting a voice recognition request and a voice signal;
A process of measuring the similarity between the voice signal input together with the voice recognition request and the voice signal already stored in the database;
As a result of the measurement, if a speech signal having a high similarity exists in the speech signal already stored in the database, a speech recognition result for the speech signal having a high similarity is read from the database, and the speech recognition request Processing to output as a voice recognition result for the voice signal input together with,
If the speech signal already stored in the database does not contain a speech signal with a high degree of similarity, speech recognition processing is performed using an acoustic model and a language model for the speech signal input together with the speech recognition request. A control program for a speech recognition apparatus, characterized in that a series of processes including: