JP2005062572A

JP2005062572A - Speech recognition apparatus

Info

Publication number: JP2005062572A
Application number: JP2003293836A
Authority: JP
Inventors: Toshiki Endo; 俊樹遠藤; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-08-15
Filing date: 2003-08-15
Publication date: 2005-03-10
Anticipated expiration: 2023-08-15
Also published as: JP3965141B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition apparatus for relatively highly exactly recognizing speech even when large loss is present in a sequence of feature parameters of a speech. <P>SOLUTION: The speech recognition apparatus includes an input buffer 30 which receives a speech data frame, a frame buffer 36 which stores frames in the order of frame numbers, a frame loss detection part 32 which detects the occurrence of frame loss, a feature parameter estimation part 34 which estimates feature data of frames as many as loss frames on the basis of feature data in the frame buffer 36 and frame order information and inserts the estimated feature data into the prescribed position in the frame buffer 36, and a speech recognition part 38 which reads frames out of the feature parameter estimation part 34 in order to perform speech recognition. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は音声認識技術に関し、特に、特徴ベクトルに変換後、パケット形式で伝送される音声信号においてパケットロスが生じた際にも高い精度で音声認識可能な音声認識装置に関する。 The present invention relates to a speech recognition technology, and more particularly to a speech recognition apparatus capable of performing speech recognition with high accuracy even when packet loss occurs in a speech signal transmitted in a packet format after being converted into a feature vector.

音声認識技術の発達並びに携帯電話及びＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）等の携帯端末の普及に伴い、携帯端末を用いた音声認識サービスが今後広く使われることが予想される。一方、携帯端末においては利用可能なリソース（処理能力、電源）などが限られている。携帯端末での消費電力又は処理量を抑制し、音声コーデック処理への影響をなくすことが望ましい。そのため、欧州電気通信標準化機構（ＥＴＳＩ：ＥｕｒｏｐｅａｎＴｅｌｅｃｏｍｍｕｎｉｃａｔｉｏｎｓＳｔａｎｄａｒｄｓＩｎｓｔｉｔｕｔｅ）では、分散型音声認識（ＤＳＲ：ＤｉｓｔｒｉｂｕｔｅｄＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ）が標準化された。 With the development of voice recognition technology and the spread of mobile terminals such as mobile phones and PDAs (Personal Digital Assistants), it is expected that voice recognition services using mobile terminals will be widely used in the future. On the other hand, available resources (processing capacity, power source) and the like are limited in portable terminals. It is desirable to suppress the power consumption or processing amount in the portable terminal and eliminate the influence on the voice codec processing. For this reason, the Distributed Telecommunications Recognition (DSR) has been standardized by the European Telecommunications Standards Institute (ETSI).

ＤＳＲ方式では、携帯端末で音響分析処理を行ない、分析データを音声認識サーバに送信する。サーバでこの分析データに基づく音声認識処理を実行する。 In the DSR method, acoustic analysis processing is performed on a portable terminal, and analysis data is transmitted to a voice recognition server. The server executes speech recognition processing based on the analysis data.

ＤＳＲシステムの機能構成を図８にブロック図形式で示す。図８を参照して、このシステムは、携帯端末からなり、入力される音声信号の音響分析処理を行なって符号化された分析データをパケット形式で送信するクライアント端末１８０と、このパケット形式の分析データを受信して復号化し、復号化した分析データに対して音声認識を行なう音声認識サーバ１８２とを含む。音声認識サーバ１８２の出力は他のサービス（例えば翻訳サービス、自動応答サービスなど）に与えられる。 The functional configuration of the DSR system is shown in block diagram form in FIG. Referring to FIG. 8, this system is composed of a mobile terminal, and performs analysis of the sound of the input audio signal and transmits the encoded analysis data in packet format, and analysis of this packet format. A voice recognition server 182 that receives and decodes the data, and performs voice recognition on the decoded analysis data. The output of the speech recognition server 182 is given to other services (for example, translation service, automatic response service, etc.).

クライアント端末１８０は、音声信号に対して音響分析を行ない、所定の形式の特徴パラメータ（特徴データ）を抽出するための特徴パラメータ抽出部１９０と、特徴パラメータ抽出部１９０から出力された特徴パラメータに対して圧縮処理を行なう圧縮部１９２と、圧縮部１９２により圧縮された特徴パラメータ（以下、「圧縮特徴パラメータ」と呼ぶ。）に対し誤り訂正符号等を付す符号化を行ない、パケットのペイロードに格納し送信するための符号化処理部１９４とを含む。 The client terminal 180 performs acoustic analysis on the audio signal, extracts a feature parameter (feature data) of a predetermined format, and the feature parameter output from the feature parameter extraction unit 190 The compression unit 192 that performs compression processing, and the feature parameter compressed by the compression unit 192 (hereinafter referred to as “compression feature parameter”) is encoded with an error correction code or the like, and stored in the packet payload. And an encoding processing unit 194 for transmission.

音声認識サーバ１８２は、受信したパケットのペイロードに含まれる誤り訂正符号を復号化することで圧縮特徴パラメータを復元する復号化処理部２００と、復号化処理部２００により復元された圧縮特徴パラメータを伸張することにより、音響分析結果の特徴パラメータを復元する伸張処理部２０２と、伸張処理部２０２により復元された特徴パラメータを入力として受けて音声認識を行なうための認識処理部２０４とを含む。 The speech recognition server 182 decodes the error correction code included in the payload of the received packet to restore the compression feature parameter, and decompresses the compression feature parameter restored by the decoding processing unit 200 Thus, a decompression processing unit 202 that restores the feature parameter of the acoustic analysis result and a recognition processing unit 204 that receives the feature parameter restored by the decompression processing unit 202 as input and performs speech recognition.

最近のいわゆるインターネットの利用の広がりに伴い、クライアント端末１８０から音声認識サーバ１８２への通信はインターネット上に構築されたＩＰ（ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）ネットワークを介して行なわれることが多くなり、今後さらに一般的になると思われる。 With the recent spread of so-called Internet use, communication from the client terminal 180 to the voice recognition server 182 is often performed via an IP (Internet Protocol) network built on the Internet, and will be more generally used in the future. It seems to be.

ＤＳＲサービスにおいて、ユーザとサーバ間の音声対話を想定した場合、短い伝播遅延が望まれる。従ってＤＳＲにおける音声データの送受信には、リアルタイム性を実現するＲＴＰ（ＲｅａｌＴｉｍｅＰｒｏｔｏｃｏｌ）／ＵＤＰ（ＵｓｅｒＤａｔａｇｒａｍＰｒｏｔｏｃｏｌ）／ＩＰが適していると考えられる。インターネット技術の標準化団体であるＩＥＴＦ（ＴｈｅＩｎｔｅｒｎｅｔＥｎｇｉｎｅｅｒｉｎｇＴａｓｋＦｏｒｃｅ）のＡＶＴ（Ａｕｄｉｏ／ＶｉｓｕａｌＴｒａｎｓｐｏｒｔ）ワーキンググループでは、ＤＳＲ向けのＲＴＰパケット構成に関し勧告が出された。 In the DSR service, when a voice conversation between a user and a server is assumed, a short propagation delay is desired. Therefore, it is considered that RTP (Real Time Protocol) / UDP (User Datagram Protocol) / IP which realizes real-time performance is suitable for transmission / reception of audio data in DSR. In the AETF (Audio / Visual Transport) working group of IETF (The Internet Engineering Task Force), which is an Internet technology standardization organization, a recommendation was made regarding the RTP packet configuration for DSR.

しかし、ＲＴＰ／ＵＤＰ／ＩＰを用いた送受信では、リアルタイム性を確保するために、何らかの原因でパケットが送信先に届かなかった場合でもパケットの再送は行なわない。たとえばパケットが輻輳した場合、ルータがパケットを破棄することがあるが、そのような場合にＲＴＰ／ＵＤＰ／ＩＰではそのパケットが再送されることはない。そのため、パケットロスが生じる。ＤＳＲにＲＴＰ／ＵＤＰ／ＩＰを用いるとパケットロスにより音声データのロスが生じることになる。また、そうしたパケットロスはバースト的に生じることが知られている。 However, in transmission / reception using RTP / UDP / IP, in order to ensure real-time performance, even if the packet does not reach the transmission destination for some reason, the packet is not retransmitted. For example, when a packet is congested, the router may discard the packet. In such a case, the packet is not retransmitted by RTP / UDP / IP. Therefore, packet loss occurs. When RTP / UDP / IP is used for DSR, voice data is lost due to packet loss. It is known that such packet loss occurs in a burst manner.

この問題を解決することを目的とした提案が、後掲の非特許文献１〜３においてなされている。 Proposals aimed at solving this problem have been made in Non-Patent Documents 1 to 3 below.

遠藤、中村、「分散型認識システムでのデータ補完に関する一検討」、音響学会講演論文集、１−４−９，ｐｐ．１７−１８，２００３年３月．Endo and Nakamura, “A Study on Data Complementation in a Distributed Recognition System”, Proc. Of Acoustical Society, 1-4-9, pp. 17-18, March 2003. ミルナーＢ．、セムナニＳ．、「ＩＰネットワーク上での頑健な音声認識」、ＩＥＥＥＩＣＡＳＳＰ予稿集、ｐｐ．２６１−２６４、２０００年６月．（Ｍｉｌｌｎｅｒ，Ｇ．ａｎｄＳｅｍｎａｎｉＳ．，”ＲｏｂｕｓｔｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｏｖｅｒＩＰｎｅｔｗｏｒｋｓ”，Ｐｒｏｃ．ＩＥＥＥＩＣＡＳＳＰ，ｐｐ．１７９１−１７９４，Ｊｕｎｅ２０００．Milner B.B. Semnani S. "Robust voice recognition on IP network", IEEE ICASSP proceedings, pp. 261-264, June 2000. (Millner, G. and Seminani S., “Robust spec recognition over IP networks”, Proc. IEEE ICASSP, pp. 1791-1794, June 2000. ミルナーＢ．，「バースト的パケットロスにおける頑健な音声認識」、ＩＥＥＥＩＣＡＳＳＰ予稿集、ｐｐ．２６１−２６４、２００１年５月（ＭｉｌｌｎｅｒＢ．，”Ｒｏｂｕｓｔｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｉｎｂｕｒｓｔ−ｌｉｋｅｐａｃｋｅｔｌｏｓｓ”，Ｐｒｏｃ．ＩＥＥＥＩＣＡＳＳＰ，ｐｐ．２６１−２６４，Ｍａｙ２００１）．Milner B.B. , “Robust voice recognition in bursty packet loss”, IEEE ICASSP proceedings, pp. 261-264, May 2001 (Millner B., “Robust speech recognition in burst-like packet loss”, Proc. IEEE ICASSP, pp. 261-264, May 2001).

非特許文献１〜３では、パケットロスの生じた区間を代替値で補完したデータを用いて音声認識をするデータ補完法に関する検討及び実験が行なわれている。しかし、パケットロス率が大きい場合、又はパケットロス長が長い場合には、これらの方法では認識劣化を十分に補うことができない。 In Non-Patent Documents 1 to 3, studies and experiments on a data complementing method for performing speech recognition using data obtained by complementing a section in which a packet loss occurs with an alternative value are performed. However, when the packet loss rate is large or the packet loss length is long, these methods cannot sufficiently compensate for the recognition degradation.

それゆえに本発明の目的は、音声の特徴パラメータのシーケンス中に大きなロスが存在する場合でも比較的高い精度で音声認識を行なうことができる音声認識装置を提供することである。 Therefore, an object of the present invention is to provide a speech recognition apparatus capable of performing speech recognition with relatively high accuracy even when a large loss exists in a sequence of speech feature parameters.

この発明の他の目的は、バースト型のパケットロスが生じた場合でも比較的高い精度で音声認識を行なうことができる、サーバ型の音声認識装置を提供することである。 Another object of the present invention is to provide a server type speech recognition apparatus capable of performing speech recognition with relatively high accuracy even when a burst type packet loss occurs.

本発明の第１の局面に係る音声認識装置は、音声の特徴データとフレームの時間的順序を示すフレーム順序情報とを含むフレームを受信するためのフレーム受信手段と、フレーム受信手段により受信されたフレームをフレーム順序情報と関連付けて記憶するためのフレーム記憶手段と、フレーム受信手段に接続され、フレーム順序情報に基づいてフレームロスが発生したことを検出し、さらに当該フレームロスにより失われたフレームのフレーム位置を検出するためのフレームロス検出手段とを含む。音声認識装置はさらに、フレームロス検出手段によりフレームロスの発生が検出されたことに応答して、フレームロス検出手段によりロスが検出されたフレームの数だけのフレームの特徴データを、フレーム記憶手段に記憶されているフレームに含まれる特徴データ及びフレーム順序情報に基づいて個別に推定し、推定された特徴データを含むフレームを生成し、フレーム記憶手段内の、当該生成されたフレームのフレーム順序情報により定まるフレーム位置に挿入するための特徴データ推定手段と、フレーム記憶手段からフレームをフレーム順序情報に従った順序で読出して各フレームに含まれる特徴データに対する音声認識を行なうための音声認識手段とを含む。 A speech recognition apparatus according to a first aspect of the present invention includes a frame receiving unit for receiving a frame including voice feature data and frame order information indicating a temporal order of frames, and the frame receiving unit receives the frame. A frame storage means for storing the frame in association with the frame order information, and a frame receiving means for detecting that a frame loss has occurred based on the frame order information, and for detecting a frame lost due to the frame loss. Frame loss detecting means for detecting the frame position. The speech recognition apparatus further responds to the detection of the occurrence of the frame loss by the frame loss detection means, and stores the feature data of the frames as many as the number of frames detected by the frame loss detection means in the frame storage means. Based on the feature data and the frame order information included in the stored frame, individually estimated, a frame including the estimated feature data is generated, and the frame order information of the generated frame in the frame storage means is used. Feature data estimation means for inserting at a predetermined frame position, and voice recognition means for reading out the frames from the frame storage means in the order according to the frame order information and performing voice recognition on the feature data included in each frame. .

好ましくは、特徴データ推定手段は、第１のフレーム数を記憶するための第１のフレーム数記憶手段と、第１のフレーム数記憶手段に接続され、フレーム記憶手段に記憶されたフレームのうち、フレームロス検出手段により検出されたフレームロスの前の第１のフレーム数のフレームの特徴データをフレーム記憶手段から読出すための前フレーム読出手段と、前フレーム読出手段により読出された第１の数のフレームの特徴データに基づいて、フレームロス検出手段により検出されたフレームロス中の各フレームに含まれる特徴データを推定するための推定手段と、推定された特徴データを含むフレームを生成し、フレーム記憶手段内の、当該生成されたフレームのフレーム順序情報により定まるフレーム位置に挿入するためのフレーム挿入手段とを含む。 Preferably, the feature data estimation means is connected to the first frame number storage means for storing the first frame number and the first frame number storage means, and among the frames stored in the frame storage means, A previous frame reading means for reading out the frame feature data of the first frame number before the frame loss detected by the frame loss detecting means from the frame storage means, and a first number read by the previous frame reading means Based on the feature data of the frame, an estimation unit for estimating the feature data included in each frame in the frame loss detected by the frame loss detection unit, and a frame including the estimated feature data are generated. Frame insertion for insertion at a frame position determined by the frame order information of the generated frame in the storage means And a stage.

特徴データ推定手段はさらに、第２のフレーム数を記憶するための第２のフレーム数記憶手段と、第２のフレーム数記憶手段に接続され、フレーム記憶手段に記憶されたフレームのうち、フレームロス検出手段により検出されたフレームロスの後の第２の数のフレーム数のフレームの特徴データをフレーム記憶手段から読出すための後フレーム読出手段とを含んでもよい。推定手段は、前フレーム読出手段により読出された第１の数のフレームの特徴データ、及び後フレーム読出手段により読出された第２の数のフレームの特徴データに基づいて、フレームロス検出手段により検出されたフレームロス中の各フレームに含まれる特徴データを推定するための手段を含んでもよい。 The feature data estimation means is further connected to the second frame number storage means for storing the second number of frames and the second frame number storage means. Among the frames stored in the frame storage means, the frame loss And a post-frame reading unit for reading out feature data of the second number of frames after the frame loss detected by the detection unit from the frame storage unit. The estimation means is detected by the frame loss detection means based on the feature data of the first number of frames read by the previous frame reading means and the feature data of the second number of frames read by the subsequent frame reading means. Means may be included for estimating feature data included in each frame in the generated frame loss.

さらに好ましくは、フレームロス検出手段により検出された失われたフレームの数と所定のしきい値とを比較し、第１のフレーム数記憶手段に記憶されている第１のフレーム数、又は第２のフレーム数記憶手段に記憶されている第２のフレーム数、又はその双方を比較結果に従って定まる所定の更新方法に従って更新するための更新手段を含む。 More preferably, the number of lost frames detected by the frame loss detection means is compared with a predetermined threshold value, and the first frame number stored in the first frame number storage means or the second Update means for updating the second frame number stored in the frame number storage means or both in accordance with a predetermined update method determined according to the comparison result.

更新手段は、フレームロス検出手段により検出された失われたフレームの数と所定のしきい値とを比較し、第１のフレーム数記憶手段に記憶されている第１のフレーム数、又は第２のフレーム数記憶手段に記憶されている第２のフレーム数、又はその双方に、比較結果に従って定まる所定の定数を加算して更新するための手段を含んでもよい。 The update means compares the number of lost frames detected by the frame loss detection means with a predetermined threshold value, and compares the first frame number stored in the first frame number storage means, or the second A means for adding a predetermined constant determined according to the comparison result to the second frame number stored in the frame number storage means or both of them may be included.

好ましくは、所定の定数は失われたフレームの数がしきい値を超えている場合には正の定数であり、それ以外の場合には負の定数である。又は、所定の定数は失われたフレームの数がしきい値を超えている場合には負の定数であり、それ以外の場合には正の定数である。 Preferably, the predetermined constant is a positive constant if the number of lost frames exceeds a threshold value, and is a negative constant otherwise. Alternatively, the predetermined constant is a negative constant when the number of lost frames exceeds a threshold value, and is a positive constant otherwise.

好ましくは、推定するための手段は、次の式によって失われたフレームの特徴データを算出し、 Preferably, the means for estimating calculates feature data of the lost frame according to the following equation:

ただしＮ_f及びＮ_bはそれぞれ第１のフレーム数及び第２のフレーム数であり、Ｘ_t'f及びＸ_t'bは、フレームロス検出手段により検出されたフレームロスのそれぞれ前のＮ_f個及び後のＮ_b個の特徴データの平均からなる特徴データであり、t'_f及びｔ'_bはこれらＸ_t'f及びＸ_t'bに対応するフレーム順序情報を示し、Ｘ_t'f及びＸ_t'bは以下の様にして算出され、

However, N _f and N _b are the first frame number and the second frame number, respectively, and X _t′f and X _t′b are N _f frames before the frame loss detected by the frame loss detecting means, respectively. and a characteristic data consisting of the average of the N _b pieces of feature data after, t _'f and t' _b represents the frame sequence information corresponding to these X _T'f and _{_X} t'b, X t'f and X _t'b is calculated as follows:

ただしｔ_f及びｔ_bはそれぞれフレームロスが生じた直前及び直後のフレームに対応する時刻を示す。

However, t _f and t _b indicate times corresponding to the frames immediately before and immediately after the occurrence of the frame loss, respectively.

さらに好ましくは、特徴データ推定手段は、特徴データ推定手段による推定に用いられる第１のフレーム数及び第２のフレーム数を、フレームロスに含まれるフレームの数と対応付けて記憶するためのフレーム数テーブルと、フレーム数テーブル記憶手段に接続され、フレームロス検出手段により検出されたフレームロスに含まれるフレームの数に応じた第１のフレーム数及び第２のフレーム数をフレーム数テーブルより読出し、フレーム記憶手段に記憶されたフレームのうち、フレームロス検出手段により検出されたフレームロスの前の第１のフレーム数のフレームの特徴データと、当該フレームロスの後の第２のフレーム数のフレームの特徴データとをフレーム記憶手段から読出すためのフレーム読出手段と、フレーム読出手段により読出された第１のフレーム数のフレーム及び第２のフレーム数のフレームの特徴データに基づいて、フレームロス検出手段により検出されたフレームロス中の各フレームに含まれる特徴データを推定するための推定手段を含む。 More preferably, the feature data estimation means stores the first frame number and the second frame number used for estimation by the feature data estimation means in association with the number of frames included in the frame loss. A first frame number and a second frame number corresponding to the number of frames included in the frame loss detected by the frame loss detection unit, connected to the table and the frame number table storage unit, and read out from the frame number table; Of the frames stored in the storage means, the feature data of the frame having the first frame number before the frame loss detected by the frame loss detecting means and the feature of the frame having the second frame number after the frame loss are detected. Frame reading means for reading data from the frame storage means, and reading by the frame reading means Estimating means for estimating feature data included in each frame in the frame loss detected by the frame loss detecting means based on the feature data of the first number of frames and the second number of frames including.

本発明の第２の局面に係る音声認識装置は、音声の特徴データとフレームの時間的順序を示すフレーム順序情報とを含むフレームを受信するためのフレーム受信手段と、フレーム受信手段により受信されたフレームをフレーム順序情報と関連付けて記憶するためのフレーム記憶手段と、フレーム受信手段に接続され、フレーム順序情報に基づいてフレームロスが発生したことを検出し、さらに当該フレームロスにより失われたフレームのフレーム位置を検出するためのフレームロス検出手段と、フレーム記憶手段からフレームをフレーム順序情報に従った順序で読出して各フレームに含まれる特徴データに対する音声認識を行なうための音声認識手段とを含み、音声認識手段は、フレームロス検出手段によりフレームロスが検出されているか否かに従って、各状態の出力尤度を算出する手法を選択して出力尤度を算出する、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：ＨＭＭ）によって音声を認識するための手段を含む。 The speech recognition apparatus according to the second aspect of the present invention receives a frame including speech feature data and frame order information indicating the temporal order of frames, and received by the frame receiving means. A frame storage means for storing the frame in association with the frame order information, and a frame receiving means for detecting that a frame loss has occurred based on the frame order information, and for detecting a frame lost due to the frame loss. Frame loss detection means for detecting a frame position; and voice recognition means for reading the frames from the frame storage means in the order according to the frame order information and performing voice recognition on the feature data included in each frame, Whether the voice recognition means has detected a frame loss by the frame loss detection means. According to whether, by selecting the method of calculating the output likelihood of each state is calculated the output likelihood, Hidden Markov Models: comprising means for recognizing speech by (Hidden Markov Model HMM).

好ましくは、ＨＭＭによって音声を認識するための手段は、フレームロス検出手段によりフレームロスが検出されていないときには Preferably, the means for recognizing the voice by the HMM is when the frame loss is not detected by the frame loss detecting means.

によってＨＭＭの各状態Ｓ_tにおける出力尤度ｐ（Ｘ_t｜Ｓ_t）を算出し、ただしＭはＨＭＭの各ノードを構成するガウス混合分布の混合数を表し、ｗ_jは当該ガウス混合分布の混合要素ｊの混合重みを表し、ｔは順序情報を表し、Ｎ（Ｘ_t；μ_j，σ_j ²）はｔ番目のフレームＸ_tの入力特徴データに対する単変量ガウス分布関数を表し、混合要素ｊは分散σ_j ²及び平均μ_jを持ち、フレームロス検出手段によりフレームロスが検出されているときには

To calculate the output likelihood p (X _t | S _t ) in each state S _t of the HMM, where M represents the number of Gaussian mixture distributions constituting each node of the HMM, and w _j represents the Gaussian mixture distribution. Represents the mixing weight of the mixing element j, t represents order information, N (X _t ; μ _j , σ _j ² ) represents a univariate Gaussian distribution function for the input feature data of the t th frame X _t , and the mixing element j has variance σ _j ² and average μ _j , and when the frame loss is detected by the frame loss detection means

ただしＣは予め定められた定数、によりＨＭＭの各状態Ｓ_tにおける出力尤度ｐ（Ｘ_t｜Ｓ_t）を算出する。

However, C is a predetermined constant, and the output likelihood p (X _t | S _t ) in each state S _t of the HMM is calculated.

以下、本発明の第１の実施の形態及びその変形例、並びに第２の実施の形態について説明する。各実施の形態については、最初に構成を述べ、次に動作を述べる。第１の実施の形態及び第２の実施の形態はいずれもＭｉｓｓｉｎｇＦｅａｔｕｒｅＴｈｅｏｒｙ（ＭＦＴ）と呼ばれる理論に基づいて音声認識を行なう。なお、以下の説明において、音声認識に必要な特徴パラメータ（特徴データ）はＲＴＰ／ＵＤＰで送信されるものとする。特徴パラメータは所定長（例えば５０バイト）のフレーム単位で構成され、ＵＤＰのペイロードに複数フレームが格納されている。各ＲＴＰパケットには通し番号が付されている。また、ＵＤＰデータグラムのヘッダには、そのパケットのペイロードサイズが格納されている。 Hereinafter, the first embodiment of the present invention, its modification, and the second embodiment will be described. For each embodiment, the configuration will be described first, and then the operation will be described. In both the first embodiment and the second embodiment, speech recognition is performed based on a theory called Missing Feature Theory (MFT). In the following description, it is assumed that feature parameters (feature data) necessary for speech recognition are transmitted by RTP / UDP. The characteristic parameters are configured in units of frames having a predetermined length (for example, 50 bytes), and a plurality of frames are stored in the UDP payload. A serial number is assigned to each RTP packet. The header of the UDP datagram stores the payload size of the packet.

［第１の実施の形態］
‐構成‐
図１に、本発明の第１の実施の形態に係るサーバ‐クライアント型音声認識システムで使用される音声認識サーバ２０のブロック図を示す。この第１の実施の形態に係る音声認識サーバ２０は、ＭＦＴの中でもデータ補間法を用いて、パケットロスがあった場合の音声認識を行なう。 [First Embodiment]
-Constitution-
FIG. 1 shows a block diagram of a speech recognition server 20 used in the server-client speech recognition system according to the first embodiment of the present invention. The voice recognition server 20 according to the first embodiment performs voice recognition when there is a packet loss using the data interpolation method in the MFT.

図１を参照して、音声認識サーバ２０は、インターネット網に接続され、この音声認識サーバ２０を送信先として送信されてくるパケットを受信して一時蓄積するための入力バッファ３０と、入力バッファ３０中のＵＤＰから取出される特徴パラメータのフレームをフレーム番号と関連付けて格納するフレームバッファ３６とを含む。フレーム番号はフレームの時間的順序を示す順序情報である。本実施の形態ではフレームバッファ３６はフレーム番号順にフレームを格納する。 Referring to FIG. 1, a speech recognition server 20 is connected to the Internet network. An input buffer 30 for receiving and temporarily storing packets transmitted with the speech recognition server 20 as a transmission destination, and an input buffer 30 And a frame buffer 36 for storing the frame of the characteristic parameter retrieved from the UDP in association with the frame number. The frame number is order information indicating the temporal order of frames. In the present embodiment, the frame buffer 36 stores frames in the order of frame numbers.

音声認識サーバ２０はさらに、入力バッファ３０が受信した一連のパケットにパケットロスが生じているか否かを検出し、さらにパケットロスにより失われたフレームの位置及び数を算出し、フレームロスが生じていることと、どのフレームが失われているかとを示すフレームロス検出信号を出力するためのフレームロス検出部３２と、フレームロス検出部３２によってフレームロスがあったことが検出されたことに応答して、フレームバッファ３６に格納されているフレームに含まれている特徴パラメータを用いたデータ補間法によって失われた各フレームの特徴パラメータを推定し、フレームバッファ３６内のその失われたフレームに相当する所定の位置に、補間された特徴データからなるフレームを挿入する処理を行なうための特徴パラメータ推定部３４と、フレームバッファ３６に格納されている特徴パラメータを順番に読出して音声認識を行なう音声認識部３８とを含む。この音声認識部３８は、従来技術で使用されているものと同一のものでよい。 The voice recognition server 20 further detects whether or not a packet loss has occurred in a series of packets received by the input buffer 30, and further calculates the position and number of frames lost due to the packet loss. And a frame loss detection unit 32 for outputting a frame loss detection signal indicating which frame is lost, and responding to the frame loss detection unit 32 detecting that there is a frame loss. Then, the feature parameter of each frame lost by the data interpolation method using the feature parameter included in the frame stored in the frame buffer 36 is estimated, and it corresponds to the lost frame in the frame buffer 36. A feature parameter for processing to insert a frame consisting of interpolated feature data at a predetermined position. Includes a meter estimator 34, the feature parameter stored in the frame buffer 36 is read sequentially and the speech recognition unit 38 for performing speech recognition. This voice recognition unit 38 may be the same as that used in the prior art.

図２は、フレームロス検出部３２の詳細なブロック図である。図２を参照して、フレームロス検出部３２は、入力バッファ３０に一時蓄積されたＵＤＰデータグラム中に含まれるＲＴＰヘッダを抽出し、ＲＴＰシーケンス番号を調べることによりパケットロスが生じているか否か、及び失われたパケット数がいくつかを検知するためのロスパケット数検知部５０と、入力バッファ３０に一時蓄積されたＵＤＰデータグラム中のＵＤＰヘッダからペイロードサイズを読出すためのペイロードサイズ読出部５２とを含む。 FIG. 2 is a detailed block diagram of the frame loss detection unit 32. Referring to FIG. 2, the frame loss detection unit 32 extracts the RTP header included in the UDP datagram temporarily stored in the input buffer 30, and checks the RTP sequence number to determine whether or not a packet loss has occurred. , And a lost packet number detector 50 for detecting the number of lost packets, and a payload size reading unit for reading the payload size from the UDP header in the UDP datagram temporarily stored in the input buffer 30 52.

フレームロス検出部３２はさらに、ロスパケット数検知部５０により検知されたロスパケット数と、ペイロードサイズ読出部５２により読出されたＵＤＰデータグラムのペイロードサイズ、及び所定のフレーム長によって、いくつのフレームがパケットロスにより失われたかを算出するためのロスフレーム数算出部５４とを含む。ロスフレーム数算出部５４は、この算出結果に従って前述したフレームロス検出信号を出力する。 The frame loss detection unit 32 further determines the number of frames depending on the number of lost packets detected by the lost packet number detection unit 50, the payload size of the UDP datagram read by the payload size reading unit 52, and a predetermined frame length. And a lost frame number calculation unit 54 for calculating whether the packet has been lost due to packet loss. The loss frame number calculation unit 54 outputs the frame loss detection signal described above according to the calculation result.

図３は、図１に示す特徴パラメータ推定部３４の詳細なブロック図である。図３を参照して、特徴パラメータ推定部３４は、補間計算に使用される、それぞれフレームロス前後のフレーム数を記憶する第１及び第２のフレーム数記憶部８０及び８２を含む。第１のフレーム数記憶部８０はフレームロス前のフレームであって補間計算に使用されるフレームの数を記憶する。第２のフレーム数記憶部８２はフレームロス後のフレームであって補間計算に使用されるフレームの数を記憶する。 FIG. 3 is a detailed block diagram of the feature parameter estimation unit 34 shown in FIG. Referring to FIG. 3, feature parameter estimation unit 34 includes first and second frame number storage units 80 and 82 that store the number of frames before and after the frame loss, respectively, used for interpolation calculation. The first frame number storage unit 80 stores the number of frames before frame loss and used for interpolation calculation. The second frame number storage unit 82 stores the number of frames after frame loss and used for interpolation calculation.

特徴パラメータ推定部３４はさらに、フレームロス検出信号及び第１のフレーム数記憶部８０の出力を受け、フレームロスが生じたときに、フレームロス直前の第１のフレーム数記憶部８０に記憶された数だけのフレームをフレームバッファ３６から読出すための前フレーム読出部７０と、同じくフレームロス検出信号及び第２のフレーム数記憶部８２の出力を受け、フレームロスが生じたときに、フレームロス直後の第２のフレーム数記憶部８２に記憶された数だけのフレームをフレームバッファ３６から読出すための後フレーム読出部７２とを含む。 The feature parameter estimation unit 34 further receives the frame loss detection signal and the output of the first frame number storage unit 80, and is stored in the first frame number storage unit 80 immediately before the frame loss when a frame loss occurs. The previous frame reading unit 70 for reading the number of frames from the frame buffer 36 and the output of the frame loss detection signal and the second frame number storage unit 82 are also received. And a rear frame reading unit 72 for reading out the same number of frames stored in the second frame number storage unit 82 from the frame buffer 36.

特徴パラメータ推定部３４はさらに、フレームロスが検出されたことに応答して、第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２の出力、並びに前フレーム読出部７０及び後フレーム読出部７２によってフレームバッファ３６から読出されたフレームを受け、後述する計算方法によって、失われたフレームの特徴パラメータを推定するための補間計算部７４と、補間計算部７４により推定された特徴パラメータからなる補間フレームをフレームバッファ３６中の所定位置に挿入する処理を行なう補間フレーム挿入処理部７６とを含む。 In response to the detection of the frame loss, the feature parameter estimation unit 34 further outputs the outputs of the first frame number storage unit 80 and the second frame number storage unit 82, and the previous frame reading unit 70 and the subsequent frame reading. An interpolation calculation unit 74 for receiving a frame read from the frame buffer 36 by the unit 72 and estimating a feature parameter of the lost frame by a calculation method described later, and a feature parameter estimated by the interpolation calculation unit 74 And an interpolation frame insertion processing unit 76 for performing processing for inserting the interpolation frame at a predetermined position in the frame buffer 36.

第１のフレーム数記憶部８０に記憶されているフレーム数をＮ_f、第２のフレーム数記憶部８２に記憶されているフレーム数をＮ_bとする。本実施の形態ではＮ_f及びＮ_bはいずれも通信状態に従って以下の様に更新される。フレームロスの数をＮ_Lとする。この数Ｎ_Lがあるしきい値Ｓを超えていればＮ_f及びＮ_bの両者に定数を加算する。本実施の形態ではこの定数は正の定数１である。Ｎ_Lがしきい値Ｓ以下であればＮ_f及びＮ_bの両者から１を減算する。すなわち負の定数−１を加算する。ただし、Ｎ_f及びＮ_bの最小値をいずれも０とする。 The number of frames stored in the first frame number storage unit 80 is N _f , and the number of frames stored in the second frame number storage unit 82 is N _b . In the present embodiment, both N _f and N _b are updated as follows according to the communication state. Let N _L be the number of frame losses. If this number N _L exceeds a certain threshold value S, a constant is added to both N _f and N _b . In the present embodiment, this constant is a positive constant 1. If N _L is less than or equal to the threshold value S, 1 is subtracted from both N _f and N _b . That is, a negative constant −1 is added. However, the minimum values of N _f and N _b are both 0.

特徴パラメータ推定部３４はそのために、上記したしきい値Ｓを記憶するためのしきい値記憶部８４と、しきい値記憶部８４に記憶されたしきい値Ｓ，第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２に記憶された数Ｎ_f及びＮ_b、並びにフレームロス検出信号により表されたフレームロス数に従ってＮ_f及びＮ_bを更新するための更新処理部７８と、しきい値記憶部８４に記憶されたしきい値記憶部８４を手操作により更新するためのしきい値入力部８６とを含む。 For this purpose, the feature parameter estimation unit 34 includes a threshold value storage unit 84 for storing the threshold value S, a threshold value S stored in the threshold value storage unit 84, and a first frame number storage unit. An update processing unit 78 for updating N _f and N _{b in} accordance with the numbers N _f and N _b stored in the 80 and second frame number storage unit 82 and the number of frame losses represented by the frame loss detection signal; And a threshold value input unit 86 for manually updating the threshold value storage unit 84 stored in the threshold value storage unit 84.

補間計算部７４が行なう補間計算について説明する。このフレーム補間は、パケットで送信されてくる特徴ベクトルの各要素に対し行なわれる。以下の説明では、特徴ベクトルストリーム中の、時刻ｔ_Nにおけるベクトルｘをｘ＝｛Ｘ_t1，Ｘ_t2，…，Ｘ_tN｝で表す。またｍ番目のフレームが失われたものとする（１≦ｍ≦Ｎ）。ｍは連続した複数の場合もあり得る。 Interpolation calculation performed by the interpolation calculation unit 74 will be described. This frame interpolation is performed for each element of the feature vector transmitted in the packet. In the following description, a vector x at time t _N in the feature vector stream is represented by x = {X _t1 , X _t2 ,..., X _tN }. It is assumed that the mth frame is lost (1 ≦ m ≦ N). m may be a plurality of consecutive ones.

補間方法は多数存在するが、受信したデータに基づいてデータ補間を行なうことが有効である。本実施の形態では、図４に示す方法によってデータ補間を行なう。図４を参照して、失われたフレームの特徴ベクトル＾Ｘtmは、ｔ'_f＜ｔm＜ｔ'_bを満足するｔ_mを用いて以下の式に従い推定される。 There are many interpolation methods, but it is effective to perform data interpolation based on the received data. In this embodiment, data interpolation is performed by the method shown in FIG. Referring to FIG. 4, feature vector ^ Xtm of the lost frame by using the t _m which satisfies _{t 'f <tm <t'} b are estimated according to the following equation.

ただしＸ_t'f及びＸ_t'bは、失われた特徴ベクトルのそれぞれ前のＮ_f個及び後のＮ_b個の特徴ベクトルの平均ベクトルであり、t'_f及びｔ'_bはこれらＸ_t'f及びＸ_t'bに対応する時刻を示す。

However X _T'f and X _T'b is the mean vector of the N _b-number of feature vectors of the previous N _f-number and after each missing feature vector, t _'f and t' _b These X _{t The} time corresponding to _'f and X _t'b is shown.

Ｘ_t'f及びＸ_t'bは以下の様にして算出される。 X _t′f and X _t′b are calculated as follows.

ただしｔ_f及びｔ_bはそれぞれフレームロスが生じた直前及び直後のフレームに対応する時間を示す。

However t _f and t _b indicates the time corresponding to the immediately preceding and immediately following frame frame loss has occurred, respectively.

図４に示す例はＮ_f＝Ｎ_b＝３の例を示している。図４において、実線は特徴ベクトルの一要素の値を示し、×印はロスフレーム前後のそれぞれ３つずつのフレームの平均値を表す。○印は式（２）〜（４）を用いて算出された推定値を示す。 The example shown in FIG. 4 shows an example where N _f = N _b = 3. In FIG. 4, a solid line indicates a value of one element of the feature vector, and a cross indicates an average value of three frames before and after the loss frame. A circle indicates an estimated value calculated using equations (2) to (4).

‐動作‐
図１〜図３に示す音声認識システム１０は以下の様に動作する。予め、しきい値記憶部８４には所定のしきい値が設定されているものとする。また第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２にも予め所定の値が設定されているものとする。多くの場合、前回の通信時に更新された値が第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２に設定されているが、例えば電源投入ごとに所定の初期値がこれらに設定される様にしてもよい。 -Operation-
The voice recognition system 10 shown in FIGS. 1 to 3 operates as follows. It is assumed that a predetermined threshold is set in the threshold storage unit 84 in advance. It is also assumed that predetermined values are set in advance in the first frame number storage unit 80 and the second frame number storage unit 82. In many cases, values updated at the time of the previous communication are set in the first frame number storage unit 80 and the second frame number storage unit 82. For example, a predetermined initial value is set in each time the power is turned on. You may make it do.

送信されてくるパケットは入力バッファ３０に一時蓄積される。フレームロス検出部３２のペイロードサイズ読出部５２は、ＵＤＰヘッダからペイロードサイズ情報を読出し、ロスフレーム数算出部５４に与える。通常、ペイロードサイズは固定された値である。 The transmitted packet is temporarily stored in the input buffer 30. The payload size reading unit 52 of the frame loss detection unit 32 reads the payload size information from the UDP header and supplies it to the loss frame number calculation unit 54. Usually, the payload size is a fixed value.

ロスパケット数検知部５０は一連のＵＤＰペイロード中のＲＴＰヘッダからパケット番号を読出し、それらの番号が連続しているか否かに基づいてパケットロスがあったか否かを判定する。パケットロスがあった場合、ロスパケット数検知部５０はロスパケット数を算出しロスフレーム数算出部５４に与える。 The lost packet number detection unit 50 reads the packet number from the RTP header in the series of UDP payloads, and determines whether or not there is a packet loss based on whether or not these numbers are continuous. When there is a packet loss, the lost packet number detection unit 50 calculates the number of lost packets and gives it to the lost frame number calculation unit 54.

ロスフレーム数算出部５４は、ペイロードサイズ読出部５２から与えられたペイロードサイズ、及び予め設定されたフレームサイズに基づき、１ＵＤＰデータグラムのペイロード中に含まれるフレーム数を算出する。さらにロスフレーム数算出部５４は、算出されたフレーム数にロスパケット数を乗ずることにより、ロスフレーム数を算出しフレームロス信号として特徴パラメータ推定部３４に与える。 The loss frame number calculation unit 54 calculates the number of frames included in the payload of one UDP datagram based on the payload size given from the payload size reading unit 52 and a preset frame size. Furthermore, the lost frame number calculation unit 54 calculates the number of lost frames by multiplying the calculated number of frames by the number of lost packets, and provides the result to the feature parameter estimation unit 34 as a frame loss signal.

図３を参照して、特徴パラメータ推定部３４の更新処理部７８はフレームロス検出信号により指定されるロスフレーム数Ｎ_Lと、しきい値記憶部８４に記憶されているしきい値Ｓとを比較する。Ｎ_L＞Ｓであれば更新処理部７８は第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２に記憶されている値Ｎ_f及びＮ_bにそれぞれ１を加算する処理を行なう。それ以外の場合には更新処理部７８は、値Ｎ_f及びＮ_bからそれぞれ１を減算する処理を行なう。 Referring to FIG. 3, update processing unit 78 of feature parameter estimation unit 34 calculates the number of lost frames N _L specified by the frame loss detection signal and threshold S stored in threshold storage 84. Compare. If N _L > S, the update processing unit 78 performs a process of adding 1 to the values N _f and N _b stored in the first frame number storage unit 80 and the second frame number storage unit 82, respectively. In other cases, the update processing unit 78 performs a process of subtracting 1 from each of the values N _f and N _b .

特徴パラメータ推定部３４の前フレーム読出部７０及び後フレーム読出部７２は、それぞれ第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２に記憶された値Ｎ_f及びＮ_bに基づき、ロスフレームの直前及び直後のフレームをＮ_f及びＮ_bにより指定される数だけフレームバッファ３６（図１）から読出す。読出されたフレームは補間計算部７４に与えられる。補間計算部７４は、第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２に記憶された値Ｎ_f及びＮ_b及び前フレーム読出部７０及び後フレーム読出部７２から与えられた、ロスフレームの直前及び直後のフレームの情報に基づき、式（１）に従ってロスフレームの特徴ベクトルの各要素を算出する。補間計算部７４は、算出された要素からなる、推定されたロスフレームの特徴ベクトルを補間フレーム挿入処理部７６に与える。 Based on the values N _f and N _b stored in the first frame number storage unit 80 and the second frame number storage unit 82, respectively, the front frame reading unit 70 and the rear frame reading unit 72 of the feature parameter estimation unit 34, respectively. Frames immediately before and after the lost frame are read from the frame buffer 36 (FIG. 1) in the number specified by N _f and N _b . The read frame is given to the interpolation calculation unit 74. The interpolation calculation unit 74 is given from the values N _f and N _b stored in the first frame number storage unit 80 and the second frame number storage unit 82, the previous frame reading unit 70, and the subsequent frame reading unit 72. Based on the information of the frame immediately before and after the loss frame, each element of the feature vector of the loss frame is calculated according to Equation (1). The interpolation calculation unit 74 gives an estimated loss frame feature vector composed of the calculated elements to the interpolation frame insertion processing unit 76.

補間フレーム挿入処理部７６は、フレームロス検出信号によって指定されるフレーム位置に補間計算部７４により算出されたロスフレームの特徴ベクトルが挿入される様に、フレームバッファ３６の内容を更新する。 The interpolation frame insertion processing unit 76 updates the contents of the frame buffer 36 so that the feature vector of the loss frame calculated by the interpolation calculation unit 74 is inserted at the frame position specified by the frame loss detection signal.

図１に示す音声認識部３８は、フレームバッファ３６からフレームに含まれる特徴ベクトルを順に読出し、ＨＭＭに与えることで音声認識を行なう。 The speech recognition unit 38 shown in FIG. 1 performs speech recognition by sequentially reading out feature vectors included in a frame from the frame buffer 36 and giving them to the HMM.

音声認識部３８が連続密度ＨＭＭ（ＣＤＨＭＭ）であるものとする。失われたフレームの状態Ｓ_tは、補間計算部７４により推定された特徴ベクトル＾Ｘ_tを用いて計算される。従って、ＨＭＭのノードＳ_tの尤度関数は以下の式により与えられる。 It is assumed that the voice recognition unit 38 is a continuous density HMM (CDHMM). State S _t lost frame is calculated using a feature vector ^ X _t estimated by interpolation calculation unit 74. Therefore, the likelihood function of the node S _t of the HMM is given by the following equation.

ただしＭはガウス混合分布の混合数を表し、ｗ_jは混合要素ｊの混合重みを表し、Ｎ（Ｘ_t；μ_j，σ_j ²）はｔ番目のフレームＸ_tの入力特徴量に対する単変量ガウス分布関数を表し、混合要素ｊは分散σ_j ²及び平均μ_jを持つものとする。

Where M represents the number of mixtures in the Gaussian mixture distribution, w _j represents the mixing weight of the mixing element j, and N (X _t ; μ _j , σ _j ² ) is a univariate with respect to the input feature quantity of the t-th frame X _t. It represents a Gaussian distribution function, and the mixing element j has a variance σ _j ² and an average μ _j .

この第１の実施の形態の装置によれば、パケットロスが生じ、複数のフレームが失われたときでも、特徴パラメータ推定部３４によってロスフレームの特徴ベクトルが推定され、推定されたフレームが特徴パラメータ推定部３４内のロスフレームの位置に挿入される。音声認識部３８は単にフレームバッファ３６から順にフレームを読出て音声認識を行なうだけでよい。そのため、音声認識部３８の構成を従来のものと変えずに、パケットロスが生じた場合でも音声認識を行なうことができる。また後述する様にその精度は高く、従来よりも頑健な音声認識を実現できる。 According to the apparatus of the first embodiment, even when packet loss occurs and a plurality of frames are lost, the feature vector of the lost frame is estimated by the feature parameter estimation unit 34, and the estimated frame is used as the feature parameter. It is inserted at the position of the loss frame in the estimation unit 34. The voice recognition unit 38 simply reads the frames in order from the frame buffer 36 and performs voice recognition. Therefore, voice recognition can be performed even when packet loss occurs without changing the configuration of the voice recognition unit 38 from the conventional one. Further, as will be described later, the accuracy is high, and voice recognition that is more robust than the conventional one can be realized.

なお、上記した実施の形態のシステムでは、値Ｎ_f及びＮ_bの更新では、加算又は減算される値は１に限定されている。こうすることにより、ロスパケット数の数の変化によって値Ｎ_f及びＮ_bの値が激しく変動し音声認識が不安定になることを防止できる。ただし、この値は１に限定されるわけではなく、応用に応じて適当な値を選択する様にすればよい。値Ｎ_f及びＮ_bの更新を行なわず、固定した値（例えば値Ｎ_f＝Ｎ_b＝１）としてもよい。 In the system of the embodiment described above, the updating of the value N _f and N _b, the value to be added or subtracted is limited to 1. By doing so, it is possible to prevent the voice recognition from becoming unstable due to the fluctuation of the values N _f and N _b due to the change in the number of lost packets. However, this value is not limited to 1, and an appropriate value may be selected according to the application. The values N _f and N _b may not be updated and may be fixed values (for example, the value N _f = N _b = 1).

また上記した実施の形態のシステムでは、ロスフレームの前後のフレームを使用した内分によりロスフレームを推定している。しかしこの場合、ロスフレームの後のフレームの情報まで必要とするので推定に時間を要し、音声認識に遅延を生ずる。そこで、音声認識をできるだけ早くすることが必要な場合には、Ｎ_bの値を０に固定することが考えられる。ロスフレームの前の複数のフレームのデータから、ロスフレームの値を外挿することにより、式（１）を用いた場合と同様の結果を得ることができる。 In the system according to the above-described embodiment, the loss frame is estimated based on the internal division using the frames before and after the loss frame. However, in this case, since information of the frame after the loss frame is required, it takes time to estimate and delays speech recognition. Therefore, when it is necessary to make speech recognition as fast as possible, it is conceivable to fix the value of N _b to 0. By extrapolating the value of the loss frame from the data of a plurality of frames before the loss frame, the same result as that obtained using Equation (1) can be obtained.

また上記した説明では、ロスフレーム数の数Ｎ_Lがしきい値Ｓを超えていればＮ_f及びＮ_bの両者に１を加算し、Ｎ_Lがしきい値Ｓ以下であればＮ_f及びＮ_bの両者から１を減算している。これは推定の精度を高めることを重視した方法である。しかしＮ_f及びＮ_bの数の決め方はこれに限定されるわけではない。例えば、処理のリアルタイム性を推定の精度よりも重視する場合には、ロスフレーム数の数Ｎ_Lがしきい値Ｓを超えていればＮ_f及びＮ_bの両者から１を減算し、Ｎ_Lがしきい値Ｓを超えていればＮ_f及びＮ_bの両者から１を減算することも考えられる。 In the above description, if the number N _{L of} lost frames exceeds the threshold S, 1 is added to both N _f and N _b , and if N _L is equal to or less than the threshold S, N _f and 1 is subtracted from both of N _b . This is a method that attaches importance to increasing the accuracy of estimation. However, the method of determining the numbers of N _f and N _b is not limited to this. For example, when the real-time property of processing is more important than the accuracy of estimation, if the number N _{L of} lost frames exceeds the threshold value S, 1 is subtracted from both N _f and N _b , and N _L If N exceeds the threshold value S, 1 may be subtracted from both N _f and N _b .

［第１の実施の形態の変形例］
第１の実施の形態のシステムでは、第１のフレーム数記憶部８０及び第２のフレーム数記憶部８２に記憶されたＮ_f及びＮ_bの値をパケット数に応じて更新する。しかし本発明はそのような実施の形態に限定されず、パケットロスの数Ｎ_Lに応じてＮ_f及びＮ_bの値を予め定める様にすることも考えられる。そのためには、Ｎ_Lに対するＮ_f及びＮ_bの値を予めテーブルにしておけばよい。そのようなシステムで使用される特徴パラメータ推定部１２０のブロック図を図５に示す。この補間計算部１２０は、図１に示す特徴パラメータ推定部３４に代えて使用することができる。 [Modification of First Embodiment]
In the system according to the first embodiment, the values of N _f and N _b stored in the first frame number storage unit 80 and the second frame number storage unit 82 are updated according to the number of packets. However, the present invention is not limited to such an embodiment, and it may be possible to predetermine the values of N _f and N _b according to the number of packet losses N _L. For this purpose, the values of N _f and N _b for N _L may be stored in advance in a table. A block diagram of the feature parameter estimation unit 120 used in such a system is shown in FIG. The interpolation calculation unit 120 can be used in place of the feature parameter estimation unit 34 shown in FIG.

図５を参照して、特徴パラメータ推定部１２０は、上記したパケットロス数Ｎ_Lに対するフレームロス前後のフレーム数Ｎ_f及びＮ_bを記憶するテーブル１３０と、フレームロス検出信号を受け、フレームロスが生じたときに、ロスフレームの数Ｎ_Lに対応する数Ｎ_fをテーブル１３０から読出し、さらにその数Ｎ_fに対応するフレームロス直前のフレームをフレームバッファ３６から読出すための前フレーム読出部１３２と、同じくフレームロス検出信号出力を受け、フレームロスが生じたときに、ロスフレームの数Ｎ_Lに対応する数Ｎ_bをテーブル１３０から読出し、さらにその数Ｎ_bに対応するフレームロス直後のフレームをフレームバッファ３６から読出すための後フレーム読出部１３４とを含む。 Referring to FIG. 5, the feature parameter estimation unit 120 receives a table 130 for storing the frame numbers N _f and N _b before and after the frame loss with respect to the packet loss number N _L and the frame loss detection signal. When this occurs, the number N _f corresponding to the number N _L of lost frames is read from the table 130, and the previous frame reading unit 132 for reading the frame immediately before the frame loss corresponding to the number N _f from the frame buffer 36. Similarly, when a frame loss occurs upon receiving a frame loss detection signal output, the number N _b corresponding to the number N _L of lost frames is read from the table 130, and the frame immediately after the frame loss corresponding to the number N _b And a rear frame reading unit 134 for reading from the frame buffer 36.

特徴パラメータ推定部１２０はさらに、フレームロスが検出されたことに応答して、ロスフレームの数Ｎ_Lに対応する数Ｎ_f及びＮ_bをテーブル１３０から読出し、さらに前フレーム読出部１３２及び後フレーム読出部１３４によってフレームバッファ３６から読出されたフレームを受け、式（１）と同様の計算方法によって、失われたフレームの特徴パラメータを推定するための補間計算部１３６と、補間計算部１３６により推定された特徴パラメータからなる補間フレームをフレームバッファ３６中の所定位置に挿入する処理を行なう補間フレーム挿入処理部７６とを含む。 In response to the detection of the frame loss, the feature parameter estimation unit 120 further reads out the numbers N _f and N _b corresponding to the number N _L of lost frames from the table 130, and further, the previous frame reading unit 132 and the subsequent frame An interpolation calculation unit 136 for receiving a frame read from the frame buffer 36 by the reading unit 134 and estimating a feature parameter of the lost frame by a calculation method similar to Equation (1), and estimation by the interpolation calculation unit 136 And an interpolated frame insertion processing unit 76 for performing processing for inserting an interpolated frame composed of the characteristic parameters into a predetermined position in the frame buffer 36.

この特徴パラメータ推定部１２０は、Ｎ_f及びＮ_bの値の定め方以外は第１の実施の形態と同様に動作する。 The feature parameter estimation unit 120 operates in the same manner as in the first embodiment except for how to determine the values of N _f and N _b .

この変形例では、ロスパケットの数Ｎ_Lと内分のための値Ｎ_f及びＮ_bとの関係が固定されている。そのため、第１の実施の形態における様にその関係自体を動的に変えることはできない。しかし、予めパケットロスの発生状況と値Ｎ_f及びＮ_bとの関係を予測することができる場合には有効である。また、この変形例ではレスポンス時間が一定となるため、一定の精度で安定して音声認識を行なうのに有効である。 In this modification, the relationship between the number N _L of lost packets and the values N _f and N _b for internal division is fixed. Therefore, the relationship itself cannot be changed dynamically as in the first embodiment. However, it is effective when it is possible to predict in advance the relationship between the occurrence of packet loss and the values N _f and N _b . Further, in this modified example, since the response time is constant, it is effective for stably performing speech recognition with a constant accuracy.

［第２の実施の形態］
第１の実施の形態のシステムは、ロスフレームに含まれる特徴ベクトルをロスフレーム群の前後のフレームの特徴ベクトルから推定した。フレームロスがある場合の音声認識の他の方法に、マージナリゼーション方式と呼ばれるものがある。本発明の第２の実施の形態に係るシステムは、マージナリゼーション方式を用いたものである。 [Second Embodiment]
In the system according to the first embodiment, the feature vector included in the loss frame is estimated from the feature vectors of the frames before and after the loss frame group. Another method of speech recognition when there is a frame loss is called a marginalization method. The system according to the second embodiment of the present invention uses a marginalization method.

マージナリゼーション方式では、一部の音声データが失われた場合、失われたデータを用いずにＨＭＭでの出力尤度を操作することで認識を行なう。これを実現するために、音声認識サーバにフレームロスを検知する機能を持つ必要がある点は第１の実施の形態のシステムと同様である。 In the marginalization method, when a part of audio data is lost, recognition is performed by manipulating the output likelihood in the HMM without using the lost data. In order to realize this, the point that the voice recognition server needs to have a function of detecting a frame loss is the same as in the system of the first embodiment.

図６に、第２の実施の形態に係るサーバ‐クライアント型音声認識システムで使用される音声認識サーバ１４０のブロック図を示す。図６を参照してこの音声認識サーバ１４０は、第１の実施の形態の音声認識サーバ２０と同様の入力バッファ３０、フレームロス検出部３２、及びフレームバッファ３６を含む。さらにこの音声認識サーバ１４０は、第１の実施の形態の音声認識サーバ２０と異なり、フレームロス検出部３２の出力するフレームロス検出信号を直接受け、失われたフレームの特徴ベクトルを推定することなくマージナリゼーション方式で音声認識を行なう音声認識部１５０を含む。 FIG. 6 shows a block diagram of a speech recognition server 140 used in the server-client speech recognition system according to the second embodiment. With reference to FIG. 6, the speech recognition server 140 includes an input buffer 30, a frame loss detection unit 32, and a frame buffer 36 similar to the speech recognition server 20 of the first embodiment. Furthermore, unlike the speech recognition server 20 of the first embodiment, the speech recognition server 140 directly receives the frame loss detection signal output from the frame loss detection unit 32, and without estimating the feature vector of the lost frame. A speech recognition unit 150 that performs speech recognition using a marginalization method is included.

図６において、図１と同じ部品には同じ参照番号を付してある。それらの名称及び機能も同一である。従ってここではそれらについての詳細な説明は繰返さない。 In FIG. 6, the same components as those in FIG. 1 are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated here.

マージナリゼーション手法による音声認識では、次の式によりＨＭＭのノードＳ_tの出力尤度ｐ（Ｘ_t｜Ｓ_t）を求める。 In speech recognition by merging internalization technique, the output likelihood p nodes S _t of the HMM by the following equation | Request (X _t S _t).

フレームロスがない場合には、式（５）の上の第１式を用いてＨＭＭの各状態の出力尤度を計算する。フレームロスがある場合には、第１式のＸ_tが存在しないため、式（５）の第２式により示される様に全ての状態の出力尤度を同じ値「Ｃ」とする。これにより、フレームロスがある場合には、状態遷移は、予め学習された状態遷移確率のみに依存することになる。 When there is no frame loss, the output likelihood of each state of the HMM is calculated using the first equation above Equation (5). If there is a frame loss, there is no _{Xt in} the first equation, so that the output likelihoods in all states are set to the same value “C” as indicated by the second equation in equation (5). Thus, when there is a frame loss, the state transition depends only on the state transition probability learned in advance.

音声認識部１５０の構成を模式的に図７に示す。図７を参照して、音声認識部１５０は、ＨＭＭ１６０と、上に示した式（５）の第１式を用いてＨＭＭ１６０の各状態の出力尤度を計算する出力尤度算出部１６４と、定数Ｃを記憶する定数記憶部１６６と、パケットロスがない場合には出力尤度算出部１６４を用い、パケットロスがある場合には定数記憶部１６６の出力Ｃを用い、それぞれ各状態の出力尤度を算出する様にＨＭＭ１６０を制御する選択部１６２とを含む。 A configuration of the speech recognition unit 150 is schematically shown in FIG. Referring to FIG. 7, the speech recognition unit 150 includes an HMM 160 and an output likelihood calculation unit 164 that calculates the output likelihood of each state of the HMM 160 using the first equation of the equation (5) shown above. The constant storage unit 166 that stores the constant C, and the output likelihood calculation unit 164 when there is no packet loss, and the output C of the constant storage unit 166 when there is a packet loss, each output likelihood of each state. And a selection unit 162 that controls the HMM 160 so as to calculate the degree.

フレームロス検出信号が、フレームロス検出を示す値であるときは選択部１６２は定数記憶部１６６の出力をＨＭＭ１６０の各出力尤度とする。フレームロスが検出されていないときには、選択部１６２は、ＨＭＭ１６０の各状態の出力尤度に出力尤度算出部１６４での計算結果を用いる。 When the frame loss detection signal is a value indicating frame loss detection, the selection unit 162 sets the output of the constant storage unit 166 as each output likelihood of the HMM 160. When no frame loss is detected, the selection unit 162 uses the calculation result of the output likelihood calculation unit 164 as the output likelihood of each state of the HMM 160.

この音声認識部１５０により、上記したマージナリゼーションによる音声認識が可能となる。 The voice recognition unit 150 enables voice recognition by the above-described marginalization.

［実験結果］
上記第１の実施の形態のシステム、及び第２の実施の形態のシステムを用いてフレームロスが生じた場合の音声認識の結果を調べる実験を行なった。実験では、パケットロスがランダムに生じると仮定したランダムロスモデル、及び通常状態とロス状態の２状態の間での遷移確率を定めて得られるギルバートロスモデルについて、パケットロス率と平均バーストロス長に対する単語認識率の傾向を調べた。実験を簡易とするため、１パケットに１フレームが格納されていると仮定した。 [Experimental result]
An experiment was conducted to examine the result of speech recognition when a frame loss occurred using the system of the first embodiment and the system of the second embodiment. In the experiment, the random loss model assumed that packet loss occurs randomly, and the Gilbert loss model obtained by determining the transition probability between the normal state and the loss state, the packet loss rate and the average burst loss length The tendency of word recognition rate was investigated. In order to simplify the experiment, it was assumed that one frame was stored in one packet.

実施の形態１の実験では、簡単のためにＮ_f＝１かつＮ_b＝０に固定した実験と、Ｎ_f＝１かつＮ_b＝１に固定した実験とを行なった。 In the experiment of the first embodiment, for the sake of simplicity, an experiment in which N _f = 1 and N _b = 0 were fixed and an experiment in which N _f = 1 and N _b = 1 were fixed were performed.

また、比較のために、第１の実施の形態において、特徴ベクトルを算出するかわりに、予めＨＭＭの学習の時に使用されたデータの平均を求め、この平均ベクトルを失われたフレームのデータとしてＨＭＭで音声認識を行なう実験も行なった。これをベースラインとして実験結果を考察する。 For comparison, in the first embodiment, instead of calculating a feature vector, an average of data used at the time of HMM learning is obtained in advance, and this average vector is used as lost frame data as an HMM. We also conducted an experiment to perform speech recognition. We consider the experimental results using this as a baseline.

その結果、平均バースト長が長くなると、上記したいずれの実験においても単語認識率は低下した。しかし、本発明による単語認識を行なった場合の単語認識率は、いずれの場合もベースラインの結果を大きく上回った。パケットロス率が大きくなるとその差は大きくなる。また、マージナリゼーション方式（第２の実施の形態）による単語認識率は、他のいずれをも上回った。従ってマージナリゼーション方式はバーストパケットロスに対し、他の方式よりも頑健であると考えられる。 As a result, as the average burst length increased, the word recognition rate decreased in any of the experiments described above. However, the word recognition rate when word recognition according to the present invention was performed was significantly higher than the baseline result in all cases. The difference increases as the packet loss rate increases. Moreover, the word recognition rate by the marginalization method (2nd Embodiment) exceeded all the others. Therefore, it is considered that the marginalization method is more robust against burst packet loss than other methods.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係る音声認識サーバのブロック図である。It is a block diagram of the speech recognition server which concerns on the 1st Embodiment of this invention. 図１に示すフレームロス検出部３２の詳細なブロック図である。FIG. 2 is a detailed block diagram of a frame loss detection unit 32 shown in FIG. 図１に示す特徴パラメータ推定部３４の詳細なブロック図である。FIG. 2 is a detailed block diagram of a feature parameter estimation unit 34 shown in FIG. 1. 特徴パラメータ推定部３４で実行される特徴ベクトルの推定処理を説明するための図である。It is a figure for demonstrating the estimation process of the feature vector performed in the feature parameter estimation part. 第１の実施の形態の音声認識サーバの変形例のブロック図である。It is a block diagram of the modification of the speech recognition server of 1st Embodiment. 本発明の第２の実施の形態に係る音声認識サーバのブロック図である。It is a block diagram of the speech recognition server which concerns on the 2nd Embodiment of this invention. 図６に示す音声認識部１５０の詳細なブロック図である。FIG. 7 is a detailed block diagram of the voice recognition unit 150 shown in FIG. 6. 従来のサーバ‐クライアント型音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the conventional server-client type | mold speech recognition system.

Explanation of symbols

２０，１４０，１８２音声認識サーバ、３０入力バッファ、３２フレームロス検出部、３４，１２０特徴パラメータ推定部、３６フレームバッファ、３８，１５０音声認識部、５０ロスパケット数検知部、５２ペイロードサイズ読出部、５４ロスフレーム数算出部、７０，１３２前フレーム読出部、７２，１３４後フレーム読出部、７４，１３６補間計算部、７６補間フレーム挿入処理部、７８更新処理部、８０第１のフレーム数記憶部８０、８２第２のフレーム数記憶部、１３０テーブル 20, 140, 182 Speech recognition server, 30 input buffer, 32 frame loss detection unit, 34, 120 feature parameter estimation unit, 36 frame buffer, 38, 150 speech recognition unit, 50 lost packet number detection unit, 52 payload size reading unit , 54 Loss frame number calculation unit, 70, 132 Previous frame reading unit, 72, 134 Post frame reading unit, 74, 136 Interpolation calculation unit, 76 Interpolation frame insertion processing unit, 78 Update processing unit, 80 First frame number storage Part 80, 82 second frame number storage part, 130 table

Claims

Frame receiving means for receiving a frame including voice feature data and frame order information indicating a temporal order of frames;
Frame storage means for storing the frame received by the frame receiving means in association with frame order information;
Frame loss detecting means connected to the frame receiving means, detecting that a frame loss has occurred based on the frame order information, and detecting a frame position of a frame lost due to the frame loss;
In response to the occurrence of frame loss detected by the frame loss detection means, frame feature data corresponding to the number of frames detected by the frame loss detection means is stored in the frame storage means. A frame determined based on the frame order information of the generated frame in the frame storage means, which is estimated based on the feature data included in the existing frame and the frame order information, generates a frame including the estimated feature data Feature data estimation means for insertion at a position;
A speech recognition apparatus comprising: speech recognition means for reading out frames from the frame storage means in an order according to frame order information and performing speech recognition on feature data included in each frame.

The feature data estimation means includes:
First frame number storage means for storing the first frame number;
Of the frames stored in the frame storage means and connected to the first frame number storage means, the feature data of the frame having the first frame number before the frame loss detected by the frame loss detection means is obtained. Previous frame reading means for reading from the frame storage means;
Estimation means for estimating feature data included in each frame in the frame loss detected by the frame loss detection means based on the feature data of the first number of frames read by the previous frame reading means When,
The speech recognition apparatus according to claim 1, further comprising: a frame insertion unit for inserting the estimated frame at a frame position determined by frame order information in the frame storage unit.

The feature data estimation means further includes
Second frame number storage means for storing a second frame number;
Of the frames connected to the second frame number storage means and stored in the frame storage means, the characteristics of the second number of frames after the frame loss detected by the frame loss detection means Post-frame reading means for reading data from said frame storage means,
The estimating means is based on the feature data of the first number of frames read by the previous frame reading means and the feature data of the second number of frames read by the subsequent frame reading means. The speech recognition apparatus according to claim 2, comprising means for estimating feature data included in each frame in the frame loss detected by the frame loss detection means.

The number of lost frames detected by the frame loss detection means is compared with a predetermined threshold value, and the first frame number stored in the first frame number storage means or the second The speech recognition apparatus according to claim 3, further comprising an updating unit configured to update the second frame number stored in the frame number storage unit or both according to a predetermined updating method determined according to the comparison result.

The update means compares the number of lost frames detected by the frame loss detection means with a predetermined threshold value, and the first frame number stored in the first frame number storage means Or a means for adding a predetermined constant determined according to a comparison result to the second frame number stored in the second frame number storage means, or both, and updating it. The speech recognition apparatus according to the description.

The speech recognition apparatus according to claim 5, wherein the predetermined constant is a positive constant when the number of lost frames exceeds the threshold value, and is a negative constant otherwise. .

The speech recognition apparatus according to claim 5, wherein the predetermined constant is a negative constant when the number of lost frames exceeds the threshold value, and is a positive constant otherwise. .

The means for estimating calculates feature data of a lost frame by the following equation:

However, N _f and N _b are the first frame number and the second frame number, respectively, and X _t′f and X _t′b are respectively before the frame loss detected by the frame loss detecting means. The feature data is an average of N _f feature data and the subsequent N _b feature data, t ′ _f and t ′ _b indicate frame order information corresponding to these X _t′f and X _t′b , and X _{t 'f} and X _t'b are calculated as follows:

However t _f and t _b indicates time corresponding to the immediately preceding and immediately following frame frame loss has occurred, respectively, the speech recognition apparatus according to claim 3.

The feature data estimation means includes:
A frame number table for storing the first frame number and the second frame number used for estimation by the feature data estimation unit in association with the number of frames included in a frame loss;
The first frame number and the second frame number corresponding to the number of frames included in the frame loss detected by the frame loss detection unit and connected to the frame number table storage unit are read from the frame number table. , Out of the frames stored in the frame storage means, feature data of the first number of frames before the frame loss detected by the frame loss detection means, and the second feature data after the frame loss Frame reading means for reading frame feature data of the number of frames from the frame storage means;
Included in each frame in the frame loss detected by the frame loss detecting means based on the feature data of the first number of frames and the second number of frames read by the frame reading means The speech recognition apparatus according to claim 1, comprising estimation means for estimating feature data.

Frame receiving means for receiving a frame including voice feature data and frame order information indicating a temporal order of frames;
Frame storage means for storing the frame received by the frame receiving means in association with frame order information;
Frame loss detecting means connected to the frame receiving means, detecting that a frame loss has occurred based on the frame order information, and detecting a frame position of a frame lost due to the frame loss;
Voice recognition means for reading out frames from the frame storage means in an order according to frame order information and performing voice recognition on feature data included in each frame;
The speech recognition unit selects a method for calculating an output likelihood of each state according to whether or not a frame loss is detected by the frame loss detection unit, and calculates an output likelihood. A hidden Markov model (HMM) A speech recognition apparatus comprising means for recognizing speech by means of

Means for recognizing speech by the HMM are:
When no frame loss is detected by the frame loss detection means

The output likelihood p in each state S _t of the HMM by | calculates (X _t S _t), where M represents the number of mixtures Gaussian mixture that constitutes each node of the HMM, w _j is the Gaussian mixture Represents the mixing weight of the mixing element j of the distribution, t represents order information, N (X _t ; μ _j , σ _j ² ) represents a univariate Gaussian distribution function for the input feature data of the t th frame X _t , The mixing element j has a variance σ _j ² and an average μ _j ,
When a frame loss is detected by the frame loss detection means

However C is predetermined constant, the output likelihood p in each state S _t of the HMM | calculates the (X _t S _t), the speech recognition apparatus according to claim 10.