JP2005331616A

JP2005331616A - Client to server speech recognition method, device for use in same, and its program and recording medium

Info

Publication number: JP2005331616A
Application number: JP2004148298A
Authority: JP
Inventors: Yoshikazu Yamaguchi; 義和山口
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-05-18
Filing date: 2004-05-18
Publication date: 2005-12-02
Anticipated expiration: 2024-05-18
Also published as: JP4425055B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce the traffic between a client device and a server device and also to prevent the slippage in processing by a method in which an input signal of speech sections detected in the client device is transmitted to the server device and the server device performs speed recognition processing to the input signal and it returns the results to the client device. <P>SOLUTION: The client device stores feature amounts for detecting speech sections sequentially in a storage part 130 and detects speech sections by reading out (140) the feature amounts from the storage part 130 and transmits the input signal of speech sections detected on the basis of the first sample position of the input signal to the server device together with the first sample position of the sections, and the server device extracts (320) feature amounts from the input signal of the speech sections and stores these feature amounts while associating them with their sample positions and performs (350) speech recognition by reading out the feature amounts from a storage part and detects the completion of a speech in its recognition process to transmit the sample position to the client device, and then the client device resumes the detecting of speech sections by reading out stored detected feature amounts from the sample position of the completion of the speech. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、クライアント装置に入力された入力信号を、ネットワークを介して接続されたサーバ装置に送信し、サーバ装置で音声認識を行い、その認識結果をクライアント装置に送信するクライアント・サーバ音声認識方法及びこれに用いる装置、そのプログラム、その記録媒体に関する。 The present invention relates to a client / server speech recognition method for transmitting an input signal input to a client device to a server device connected via a network, performing speech recognition at the server device, and transmitting the recognition result to the client device. In addition, the present invention relates to a device used therefor, a program thereof, and a recording medium thereof.

クライアント・サーバ音声認識方法においてクライアント装置からサーバ装置への音声送信は、一般的にクライアント装置で入力信号から音声区間を検出し、入力信号中のその音声区間の信号のみをサーバ装置に送信して、通信量を削減し、サーバ装置では受信した信号の全てについて音声認識処理を行う。
このようなクライアント・サーバ音声認識方法では、クライアント装置に音声入力信号が入力されてからサーバ装置へ音声区間の信号を送信するまでに必要とする処理と比較して、サーバ装置での音声認識に必要とする処理が多いこと、クライアント装置とサーバ装置間の通信負荷状況により通信が遅れること、１台のサーバ装置で複数のクライアント装置からの要求を処理することなどの理由により、クライアント装置での処理がサーバ装置での処理に比べて一方的に先に進むことが多い。 In the client / server speech recognition method, the voice transmission from the client device to the server device is generally performed by detecting the voice section from the input signal in the client apparatus and transmitting only the signal of the voice section in the input signal to the server apparatus. The amount of communication is reduced, and the server apparatus performs voice recognition processing on all received signals.
In such a client / server speech recognition method, compared with the processing required from when a speech input signal is input to the client device to when a signal in the speech section is transmitted to the server device, speech recognition at the server device is performed. Due to the large amount of processing required, communication delays due to the communication load between the client device and the server device, processing of requests from multiple client devices with one server device, etc. In many cases, processing proceeds unilaterally as compared with processing in the server device.

非特許文献１に示すように音声認識特徴量抽出処理の一部をクライアント装置で行い、それらをサーバ装置に送信し、サーバ装置において残る音声認識特徴量抽出処理を行う分散型音声認識（Distributed Speech Recognition、以下ＤＳＲとする）がある。
クライアント装置は一般に計算能力が低いため、クライアント装置には実装が困難であるが、計算能力の高いサーバ装置には実装が容易な、音声区間検出精度が高いが処理量の多い音声区間検出機能、あるいはサーバ装置で行う音声認識処理の過程において指定された文法から発声の終了を検出する機能をサーバ装置に実装してサーバ装置でのみ音声区間を検出することもできる。この場合はクライアント装置から音声認識に必要としない非音声信号もクライアント装置へ送信し、通信量が多くなる。
ＥＴＳＩ発行資料「ＥＴＳＩＥＳ２０２２１２Ｖ１．１．１」 As shown in Non-Patent Document 1, distributed speech recognition (Distributed Speech) that performs part of speech recognition feature value extraction processing at the client device, transmits them to the server device, and performs speech recognition feature value extraction processing that remains at the server device Recognition, hereinafter referred to as DSR).
Since client devices generally have low computing capabilities, they are difficult to implement on client devices, but they are easy to implement on server devices with high computing capabilities. Alternatively, a function for detecting the end of utterance from the grammar specified in the process of the speech recognition process performed by the server device can be installed in the server device, and the speech section can be detected only by the server device. In this case, a non-voice signal that is not required for voice recognition is also transmitted from the client device to the client device, increasing the amount of communication.
ETSI publication document “ETSI ES 202 212 V1.1.1”

クライアント装置に処理量が少なくて済む音声区間検出機能を実装し、サーバ装置に多くの処理量を必要とする音声区間検出機能を実装してクライアント・サーバ音声認識を行ってクライアント装置とサーバ装置間の通信量を削減し、かつより厳密な音声区間の検出とそれに伴う高精度な音声認識を可能とすることが考えられる。
しかしこの場合は両装置の進行状況にズレが発生する。このため、音声区間と非音声区間が繰り返し入力されると、クライアント装置ではその音声区間の検出精度が悪いため、音声区間が実際には終了しているが、これを検出できないで音声区間終了後の非音声区間の信号を音声区間の信号としてサーバ装置へ送信し続けた場合、クライアント装置からは本来は非音声信号であって送信する必要がない信号をサーバ装置へ送信しクライアント装置とサーバ装置間の通信量が増大し、かつサーバ装置は本来、音声認識する必要のない非音声区間の音声認識処理も行うため、サーバ装置での音声認識に必要な処理量も増大し、また、クライアント装置は本来は非音声区間である信号を音声区間の信号としているため、次の音声区間に対する正確な音声開始位置の検出が困難となり、このためサーバ装置はこの開始位置が不正確な音声区間に対して音声認識を実行する可能性があり、音声認識率の劣化を招く恐れがある。 The client device implements a voice segment detection function that requires less processing, and the server device implements a voice segment detection function that requires a large amount of processing to perform client / server speech recognition, and between the client device and the server device. It is conceivable to reduce the amount of communication and enable more accurate speech segment detection and accompanying highly accurate speech recognition.
However, in this case, a deviation occurs in the progress of both devices. For this reason, when a voice section and a non-speech section are repeatedly input, the client apparatus has poor accuracy in detecting the voice section, so the voice section has actually ended. When the signal of the non-speech section is continuously transmitted to the server apparatus as the signal of the speech section, the client apparatus transmits a signal that is originally a non-speech signal and does not need to be transmitted to the server apparatus. And the server device also performs speech recognition processing in a non-speech section that originally does not need to be recognized, so the processing amount required for speech recognition in the server device also increases, and the client device Since a signal that is originally a non-speech segment is used as a speech segment signal, it is difficult to accurately detect the speech start position for the next speech segment. The start position may perform speech recognition on inaccurate speech section, thereby possibly deteriorating the speech recognition rate.

この発明の目的は、不必要な通信量を減らし、連続して音声を入力する際の音声の開始位置を正確に検出して、音声認識率を向上することができるクライアント・サーバ音声認識方法及びその装置、そのプログラム、その記録媒体を提供することにある。 An object of the present invention is to provide a client / server speech recognition method capable of reducing an unnecessary communication amount, accurately detecting a start position of speech when continuously inputting speech, and improving a speech recognition rate. To provide the device, the program, and the recording medium.

この発明によれば、クライアント装置は入力信号より音声区間検出に用いる検出特徴量を抽出し、この検出特徴量を用いて音声区間を検出し、入力信号中のこの音声区間の信号をサーバ装置に送信し、
サーバ装置は受信した音声区間の信号より音声認識に用いる認識特徴量を抽出し、この認識特徴量を用いて音声認識を行い、音声認識処理により得られた情報又は受信した音声区間の信号を用いて音声区間の終了位置を検出してクライアント装置に送信し、
クライアント装置は音声終了位置を受信すると、音声区間の検出処理を中断し、受信した音声区間終了位置から検出特徴量の抽出を新たに開始する。 According to the present invention, the client device extracts a detected feature amount used for speech section detection from the input signal, detects the speech section using the detected feature amount, and transmits the signal of the speech section in the input signal to the server device. Send
The server device extracts a recognition feature amount used for speech recognition from the received speech section signal, performs speech recognition using the recognition feature amount, and uses information obtained by speech recognition processing or a received speech section signal. To detect the end position of the voice interval and send it to the client device,
When the client apparatus receives the voice end position, the client section interrupts the voice section detection process, and newly starts detection feature amount extraction from the received voice section end position.

この構成によれば、クライアント装置は音声区間の信号だけをサーバ装置へ送信しているので、通信量を大幅に減少でき、しかもサーバ装置で音声区間の終了を検出しているから、この位置を正確に検出でき、かつこの音声区間の終了をクライアント装置に送信し、クライアント装置は音声区間終了を受信すると、音声区間の検出を中断し、改めて次の音声区間の検出をその受信した音声区間の終了位置から開始するため、常に正しく音声区間の開始位置を正確に検出でき、サーバ装置における音声認識の認識率が向上する。またクライアント装置におけるサーバ装置より音声区間終了の受信から次の音声区間の開始までは非音声信号が誤って音声区間信号としてサーバ装置へ送信するおそれがなく、それだけ通信量が減少する。 According to this configuration, since the client device transmits only the signal of the voice interval to the server device, the communication volume can be greatly reduced, and the end of the voice interval is detected by the server device. When the end of the voice segment can be accurately detected and transmitted to the client device, and the client device receives the end of the voice segment, the detection of the next voice segment is detected again in the received voice segment. Since it starts from the end position, the start position of the voice section can always be accurately detected accurately, and the recognition rate of voice recognition in the server device is improved. Further, there is no possibility that the non-voice signal is erroneously transmitted to the server apparatus as a voice section signal from the reception of the voice section end to the start of the next voice section from the server apparatus in the client apparatus, and the communication amount is reduced accordingly.

以下この発明の実施形態を図面を用いて説明する。図１にこの発明方法を適用したシステム構成と、この発明のクライアント装置の実施形態及びこの発明のサーバ装置の実施形態の各機能構成を示し、図２にこの発明のクライアント装置処理方法の実施形態の流れ図を、図３にこの発明のサーバ装置処理方法の実施形態の流れ図をそれぞれ示す。この実施形態ではクライアント装置及びサーバ装置をそれぞれ電子計算機を用いて機能させた場合であり、以後、クライアント計算機及びサーバ計算機と書く。またこの実施形態ではサーバ計算機として音声区間の開始を検出する機能は実装されていないが、音声認識の過程で音声区間の終了を検出する機能が実装されている場合であるが、サーバ計算機において、音声認識部の前段もしくは内部で音声区間の開始検出もしくは終端の検出機能が実装されている場合においてもこの発明は適用可能であり、これらの検出は受信した音声区間の信号を用いて行ってもよい。
クライアント計算機１００はＬＡＮ（Local Area Network）などのネットワーク２００を介してサーバ計算機３００と接続される。この実施形態ではクライアント計算機１００において音声区間の検出に必要な検出特徴量を過去の分まで検出特徴量記憶部に記憶し、サーバ計算機３００において検出された音声区間が終了したサンプル位置をクライアント計算機１００に送信し、クライアント計算機１００が受信した音声区間終了のサンプル位置以後より音声区間の検出を再実行する場合である。 Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a system configuration to which the method of the present invention is applied, a functional configuration of a client device of the present invention, and a server device of the present invention, and FIG. 2 shows an embodiment of the client device processing method of the present invention. FIG. 3 shows a flowchart of the embodiment of the server apparatus processing method of the present invention. In this embodiment, the client device and the server device are respectively functioned using an electronic computer, and are hereinafter referred to as a client computer and a server computer. In this embodiment, the server computer is not implemented with a function for detecting the start of a voice segment, but is a case where a function for detecting the end of a voice segment in the process of speech recognition is implemented. The present invention can also be applied when a voice section start detection or end detection function is implemented before or inside the voice recognition unit, and these detections may be performed using a received voice section signal. Good.
The client computer 100 is connected to the server computer 300 via a network 200 such as a LAN (Local Area Network). In this embodiment, the client computer 100 stores the detected feature quantity necessary for detecting the voice section in the detected feature quantity storage unit up to the past, and the client computer 100 sets the sample position where the voice section detected by the server computer 300 ends. This is a case where the voice segment detection is re-executed after the sample position at the end of the voice segment received by the client computer 100.

機能構成及び処理手順
クライアント計算機１００では、図に示していない前段のＡ／Ｄ変換器などでデジタル化された入力信号が音声信号入力装置（図示せず）より入力端子１０１を通じて検出特徴量抽出部１１０に入力され、この検出特徴量抽出部１１０において入力信号から音声区間の検出に用いる検出特徴量が抽出される（ステップＳ１）。例えば入力信号の複数サンプル（フレームという）から計算した音声パワーやピッチなどが検出特徴量として計算される。この例ではその抽出した検出特徴量は、検出特徴量管理部１２０を介して検出特徴量記憶部１３０に順次記憶される（ステップＳ１）。この際、各検出特徴量を入力信号上の位置と対応づけて記憶部１３０に記憶する。この例では入力信号に対する処理、つまり検出特徴量の抽出開始サンプル位置ｓ０を基点として設定し（ステップＳ２）、各検出特徴量をサンプル位置と対応づけ、記憶部１３０に記憶する。 Functional Configuration and Processing Procedure In the client computer 100, an input signal digitized by a preceding A / D converter or the like not shown in the figure is detected from an audio signal input device (not shown) through an input terminal 101. 110, the detected feature amount extraction unit 110 extracts a detected feature amount used for detecting a speech section from the input signal (step S1). For example, the sound power and pitch calculated from a plurality of samples (referred to as frames) of the input signal are calculated as the detected feature amount. In this example, the extracted detected feature values are sequentially stored in the detected feature value storage unit 130 via the detected feature value management unit 120 (step S1). At this time, each detected feature amount is stored in the storage unit 130 in association with the position on the input signal. In this example, processing for an input signal, that is, detection feature quantity extraction start sample position s0 is set as a base point (step S2), and each detected feature quantity is associated with a sample position and stored in storage unit 130.

音声検出部１４０では時間経過に沿って、つまり順次記憶された検出特徴量を、検出特徴量管理部１２０を介して検出特徴量記憶部１３０より読み込み、これら検出特徴量に基づき音声信号を検出する、つまり対応する入力信号が音声か非音声かの判別を行う（ステップＳ３）。またこの例では信号送信管理部１５０は、入力信号のサンプルごともしくはフレームごとに、入力信号がサーバ計算機３００に対し未送信かそれとも送信済みかを調査し（ステップＳ４）、未送信であれば音声検出部１４０での検出が音声、つまり音声区間の場合は（ステップＳ５）、その音声区間のその入力信号をクライアント送信部１６０の入力信号送信部１６１より、１フレーム又は複数フレームごとにパケットとしてサーバ計算機３００に送信し（ステップＳ１０）、送信済みであれば、クライアント送信部１６０の検出結果情報送信部１６２より音声検出部１４０の判別結果（以下検出結果という）の情報、例えば既に送信済みの入力信号に対して「音声」か「非音声」を表す検出結果情報をサーバ計算機３００に送信する（ステップＳ１１）。 The voice detection unit 140 reads the detected feature values stored over time, that is, sequentially, from the detection feature value storage unit 130 via the detection feature value management unit 120, and detects a voice signal based on the detected feature values. That is, it is determined whether the corresponding input signal is voice or non-voice (step S3). In this example, the signal transmission management unit 150 checks whether the input signal has not been transmitted or transmitted to the server computer 300 for each sample or frame of the input signal (step S4). When the detection by the detection unit 140 is a voice, that is, a voice interval (step S5), the input signal of the voice interval is sent from the input signal transmission unit 161 of the client transmission unit 160 as a packet for each frame or a plurality of frames. Information transmitted to the computer 300 (step S10), and if it has been transmitted, information on a discrimination result (hereinafter referred to as a detection result) of the voice detection unit 140 from the detection result information transmission unit 162 of the client transmission unit 160, for example, an input that has already been transmitted Detection result information representing “speech” or “non-speech” is transmitted to the server computer 300 (steps). Flop S11).

更にこの例ではステップＳ５でその未送信信号が音声区間であれば、これがその音声区間の開始の部分であるかを判定し（ステップＳ６）、音声区間の開始部分であればその音声区間が1回の発話における最初のものかを調べ（ステップＳ７）、最初の音声区間であればその音声区間の開始のフレームの入力信号上の位置を示す開始位置をサーバ計算機３００へ送信し（ステップＳ８）、またその音声区間の開始部分（フレーム）と対応する入力信号をサーバ計算機３００へ送信する。この例では入力信号の処理を開始した時点、つまり入力端子１０１に入力信号が入力されたその入力信号開始位置を基点（基点サンプル位置）とした、前記最初の音声区間における開始部分のフレームを示すサンプル位置を開始位置としてサーバ計算機３００へ送信する。この開始位置の送信をするか否かは信号送信管理部１５０が行う。 Further, in this example, if the untransmitted signal is a voice section in step S5, it is determined whether this is the start part of the voice section (step S6). If it is the first speech section, it is checked (step S7), and if it is the first speech section, the start position indicating the position on the input signal of the start frame of the speech section is transmitted to the server computer 300 (step S8). In addition, an input signal corresponding to the start portion (frame) of the voice section is transmitted to the server computer 300. In this example, the frame of the start portion in the first speech section is shown with the base point (base point sample position) when the input signal processing is started, that is, the input signal start position where the input signal is input to the input terminal 101. The sample position is transmitted to the server computer 300 as the start position. The signal transmission management unit 150 determines whether or not to transmit the start position.

また後述のようにクライアント計算機１００で終了信号を受信し、音声検出部１４０での音声、非音声判別処理を中断し、その後、受信した終了信号が示す位置から音声、非音声の判別処理をして音声区間の検出を再開始し、音声区間の開始を検出して音声区間の信号を送信する際に、その新たに検出した音声区間開始位置と、それまでに送信した検出結果情報の最後のサンプル位置との間に、未送信の区間があった場合は非音声区間についてもそれを示す検出結果情報を送信するようにした場合である。つまりステップＳ５での判定が音声区間でなければステップＳ７に移り、ステップＳ７で音声区間が最初のものでなければ次の音声区間の開始前であるかを調べ（ステップＳ９）、音声区間開始前であればステップＳ１１に移り、検出結果情報、つまり非音声を示す情報が送信される。 As will be described later, the client computer 100 receives the end signal, interrupts the voice / non-voice discrimination processing in the voice detection unit 140, and then performs the voice / non-voice discrimination processing from the position indicated by the received end signal. When the voice segment detection is restarted, the start of the voice segment is detected and the voice segment signal is transmitted, the newly detected voice segment start position and the last detection result information transmitted so far are transmitted. If there is an untransmitted section between the sample positions, this is a case where detection result information indicating that section is also transmitted for the non-voice section. In other words, if the determination in step S5 is not the speech section, the process proceeds to step S7. If the speech section is not the first one in step S7, it is checked whether the next speech section is before the start (step S9). If so, the process proceeds to step S11, and detection result information, that is, information indicating non-voice is transmitted.

このようにしてこの実施形態ではサーバ計算機３００はクライアント計算機１００から送信された音声検出開始位置を基点として、少なくともパケットごとに入力信号上での所定サンプルごとになんらかの信号がクライアント計算機１００から送信され、また受信した終了信号が示す位置から、音声検出を再開するため、音声区間の終了の検出誤りおよび検出結果情報の送信過程で発生する欠落により、音声区間の信号もしくは検出結果情報とサンプル位置との対応付けがずれる問題を回避している。この問題は一定間隔ごとに入力サンプル位置情報をクライアント計算機１００からサーバ計算機３００に送る（ステップＳ１２）ことで回避してもよい。 In this way, in this embodiment, the server computer 300 transmits some signal from the client computer 100 for each predetermined sample on the input signal at least for each packet, based on the voice detection start position transmitted from the client computer 100. In addition, in order to restart the voice detection from the position indicated by the received end signal, the voice section signal or the detection result information and the sample position are detected due to a detection error at the end of the voice section and a loss generated in the transmission process of the detection result information. The problem of misalignment is avoided. This problem may be avoided by sending the input sample position information from the client computer 100 to the server computer 300 at regular intervals (step S12).

サーバ計算機３００では、クライアント計算機１００より送信された音声区間の信号をサーバ受信部３１０の入力信号受信部３１１で受信すると、認識特徴量抽出部３２０において、１フレームごとに例えばケプストラム、デルタケプストラム、パワー、デルタパワーの一群など音声認識に用いる認識特徴量を音声区間の信号より抽出し、認識特徴量管理部３３０を介して認識特徴量記憶部３４０に記憶する。
クライアント計算機１００より送信された検出結果情報を、サーバ受信部３１０内の検出結果情報受信部３１２で受信した場合は、認識特徴量管理部３３０を介して、認識特徴量記憶部３４０に記憶されている既に抽出済みの認識特徴量のうち、受信した検出結果情報と同じサンプル位置の認識特徴量にその検出結果情報を付加する。検出結果情報が非音声であればその付加をすることなく、その検出結果情報と同じサンプル位置の認識特徴量を消去してもよい。 In the server computer 300, when the signal of the voice section transmitted from the client computer 100 is received by the input signal receiving unit 311 of the server receiving unit 310, the recognition feature value extracting unit 320 performs, for example, a cepstrum, a delta cepstrum, a power for each frame. The recognition feature amount used for speech recognition, such as a group of delta powers, is extracted from the signal in the speech section and stored in the recognition feature amount storage unit 340 via the recognition feature amount management unit 330.
When the detection result information transmitted from the client computer 100 is received by the detection result information reception unit 312 in the server reception unit 310, the detection result information is stored in the recognition feature amount storage unit 340 via the recognition feature amount management unit 330. Among the already extracted recognition feature quantities, the detection result information is added to the recognition feature quantity at the same sample position as the received detection result information. If the detection result information is non-speech, the recognition feature quantity at the same sample position as the detection result information may be deleted without adding the detection result information.

つまり図３に示すように、サーバ計算機３００のサーバ受信部３１０がクライアント計算機１００からの送信信号を受信すると（ステップＳ３１）、それが検出結果情報ではなく、つまり音声区間の信号であれば（ステップＳ３２）、その信号から認識特徴量を抽出して、認識特徴量記憶部３４０に記憶する（ステップＳ３３）。その際、その音声区間信号が、最初（発話の）の音声区間の開始のものであれば、その音声区間開始位置も同時に受信され、その音声区間開始位置（サンプル位置）と対応付けて認識特徴量が記憶され、また他の音声区間信号の認識特徴量も、各パケットごとに、各フレームごとに、前記音声区間開始信号を基準とする位置（サンプル位置）ごとに対応付けられる。受信信号が検出結果情報であれば、その検出結果情報がそのサンプル位置と対応付けて認識特徴量記憶部３４０内に記憶される（ステップＳ３４）。検出結果情報が受信される場合は、後述するようにサーバ計算機３００において音声区間の終了を検出し、この終了のサンプル位置をクライアント計算機１００へ送信し、クライアント計算機１００がその終了サンプル位置以後から音声検出を再開始した場合であり、サーバ計算機３００は受信した検出結果情報を、認識特徴量記憶部３４０に記憶されている、そのサンプル位置と対応する認識特徴量に対し付加することができる。あるいは検出結果情報が非音声であれば、その認識特徴量を消去することができる。 That is, as shown in FIG. 3, when the server reception unit 310 of the server computer 300 receives a transmission signal from the client computer 100 (step S31), if it is not detection result information, that is, a signal in a voice section (step S31). In step S32, a recognition feature amount is extracted from the signal and stored in the recognition feature amount storage unit 340 (step S33). At that time, if the speech segment signal is the start of the first (speech) speech segment, the speech segment start position is also received at the same time, and is recognized in association with the speech segment start position (sample position). The amount of recognition is also stored for each frame, and for each frame, for each frame, for each position (sample position) with reference to the voice segment start signal. If the received signal is detection result information, the detection result information is stored in the recognition feature amount storage unit 340 in association with the sample position (step S34). When the detection result information is received, the end of the voice section is detected by the server computer 300 as will be described later, the end sample position is transmitted to the client computer 100, and the client computer 100 performs the voice from the end sample position onward. This is a case where the detection is restarted, and the server computer 300 can add the received detection result information to the recognized feature quantity stored in the recognized feature quantity storage unit 340 and corresponding to the sample position. Alternatively, if the detection result information is non-speech, the recognition feature amount can be deleted.

音声認識部３５０は時間経過に沿って、つまり認識特徴量記憶部３４０に記憶された順に、１フレームごとに認識特徴量管理部３３０を介して認識特徴量記憶部３４０より音声区間の認識特徴量を読み込み、音声認識を行う（ステップＳ３５）。
またこの実施形態ではクライアント計算機１００に記憶する検出特徴量記憶部１３０の記憶容量を増加させないために、一定間隔ごとにサーバ計算機３００から音声認識処理した音声のサンプル位置をクライアント計算機１００に送信し、クライアント計算機１００では、そのサンプル位置より以前に遡って音声検出をする必要がないとして該当する検出特徴量記憶部１３０内の記憶した検出特徴量を消去するものである。このため認識進行管理部３６０で音声認識の進行状況を、ある一定間隔、２０〜５０フレーム（１フレームは認識処理区間単位で例えば１０ミリ秒）ごと、例えば３００ミリ秒ごとに調査し（ステップＳ３６）、その時点で認識処理が進んだサンプル位置を音声認識部３５０から取得する（ステップＳ３７）。認識進行管理部３６０は認識特徴量管理部３３０に対して認識特徴量記憶部３４０のうち認識処理が進んだ位置以前の認識特徴量を消去することを通知し、認識特徴量管理部３３０ではこの通知どおり該当する認識特徴量を消去する（ステップＳ３８）。一方で認識進行管理部３６０は、サーバ送信部３７０の位置信号送信部３７１に対して上記認識処理が進んだ位置を進行位置信号として送信するように通知し、位置信号送信部３７１はクライアント計算機１００へ上記進行位置信号を送信する（ステップＳ３９）。 The voice recognition unit 350 recognizes the recognition feature value of the voice section from the recognition feature value storage unit 340 via the recognition feature value management unit 330 for each frame in accordance with the passage of time, that is, in the order stored in the recognition feature value storage unit 340. And performs voice recognition (step S35).
In this embodiment, in order not to increase the storage capacity of the detected feature amount storage unit 130 stored in the client computer 100, the server computer 300 transmits the voice sample positions subjected to voice recognition processing to the client computer 100 at regular intervals. In the client computer 100, the detected feature quantity stored in the corresponding detected feature quantity storage unit 130 is erased because it is not necessary to perform voice detection before the sample position. For this reason, the recognition progress management unit 360 checks the progress of speech recognition every certain interval, 20 to 50 frames (one frame is, for example, 10 milliseconds for each recognition processing section), for example, every 300 milliseconds (step S36). ), The sample position where the recognition process has progressed at that time is acquired from the speech recognition unit 350 (step S37). The recognition progress management unit 360 notifies the recognition feature value management unit 330 that the recognition feature value before the position where the recognition process has advanced is deleted from the recognition feature value storage unit 340. As per the notification, the pertinent recognition feature amount is deleted (step S38). On the other hand, the recognition progress management unit 360 notifies the position signal transmission unit 371 of the server transmission unit 370 to transmit the position where the recognition process has progressed as a progress position signal, and the position signal transmission unit 371 notifies the client computer 100. The travel position signal is transmitted to (step S39).

クライアント計算機１００では、上記進行位置信号をクライアント受信部１７０の位置信号受信部１７１で受信すると（ステップＳ１３）、検出特徴量管理部１２０に対して検出特徴量記憶部１３０に記憶されている検出特徴量中の進行位置以前のものの消去を通知し、検出特徴量管理部１２０ではこの通知どおり該当する音声検出特徴量を消去する（ステップＳ１４）。
上述の認識進行管理部３６０での進行状況調査は一定間隔で行われ、随時、サーバ計算機３００及びクライアント計算機１００における各記憶部からその進行位置以前に記憶した特徴量が消去され、記憶部１３０，１４０が有効に用いられ比較的小さい記憶容量のもので済む。 In the client computer 100, when the position signal receiving unit 171 of the client receiving unit 170 receives the progress position signal (step S13), the detected feature stored in the detected feature amount storage unit 130 with respect to the detected feature amount management unit 120. The detection feature amount management unit 120 deletes the corresponding voice detection feature amount according to the notification (step S14).
The above-described progress check in the recognition progress management unit 360 is performed at regular intervals, and the feature quantity stored before the progress position is erased from each storage unit in the server computer 300 and the client computer 100 as needed. 140 is effectively used and only requires a relatively small storage capacity.

音声認識部３５０内の区間終了検出部３５１がその音声区間の終了を検出し、音声認識処理が終了したことを認識進行管理部３６０が検知すると（ステップＳ４０）、上述の一定間隔での認識処理の進行調査時と同様に、音声区間が終了した位置を取得し（ステップＳ４１）、認識特徴量管理部３３０に対して認識特徴量記憶部３４０に記憶されている音声区間の終了サンプル位置以前の認識特徴量の消去を通知し、認識特徴量管理部３３０ではその通知どおり該当する認識特徴量を消去する（ステップＳ４２）。この消去により記憶部３４０を有効に用いることができる。 When the section end detection unit 351 in the voice recognition unit 350 detects the end of the voice section and the recognition progress management unit 360 detects that the voice recognition process has ended (step S40), the above-described recognition process at regular intervals. In the same manner as at the time of the progress investigation, the position where the speech segment ends is acquired (step S41), and the recognition feature value management unit 330 stores the position before the end sample position of the speech segment stored in the recognition feature value storage unit 340. The recognition feature quantity erasure is notified, and the recognition feature quantity management unit 330 erases the corresponding recognition feature quantity as notified (step S42). By this erasing, the storage unit 340 can be used effectively.

一方で認識進行管理部３６０では、位置信号送信部３７１に対して上記音声区間が終了したサンプル位置を送信するように通知し、位置信号送信部３７１はクライアント計算機１００へ音声区間が終了したサンプル位置を音声区間終了位置信号（音声終了）として送信する（ステップＳ４３）。
クライアント計算機１００では、上記音声区間終了位置信号を位置信号受信部１７１で受信すると（ステップＳ１５）、検出特徴量管理部１２０に対して検出特徴量記憶部１３０のうち音声区間終了位置以前の検出特徴量の消去を通知し、検出特徴量管理部１２０ではその通知のとおり該当する検出特徴量を消去する（ステップＳ１６）。それと同時に、音声検出部１４０に対して、現在実行中の音声、非音声判別処理を中断し、上記音声区間の終了位置から音声、非音声判別処理を再開始するように通知し、音声検出部１４０はこの通知どおりにステップＳ３に戻って音声、非音声判別処理を再開始する（ステップＳ１８）。 On the other hand, the recognition progress management unit 360 notifies the position signal transmission unit 371 to transmit the sample position at which the voice interval ends, and the position signal transmission unit 371 notifies the client computer 100 of the sample position at which the voice interval ends. Is transmitted as a voice segment end position signal (voice end) (step S43).
In the client computer 100, when the position signal receiving unit 171 receives the voice segment end position signal (step S15), the detected feature quantity management unit 120 detects the feature before the voice segment end position in the detected feature quantity storage unit 130. The detected feature quantity management unit 120 erases the corresponding detected feature quantity as notified (step S16). At the same time, the voice detection unit 140 is notified to interrupt the currently executed voice / non-voice discrimination process and restart the voice / non-voice discrimination process from the end position of the voice section. 140 returns to step S3 according to this notification and restarts the voice / non-voice discrimination processing (step S18).

またサーバ計算機３００の認識進行管理部３６０が、音声認識部３５０における音声区間の終了を検知すると音声認識部３５０において出力された認識結果をサーバ送信部３７０の認識結果送信部３７２よりクライアント計算機１００に送信する（ステップＳ４３）。クライアント計算機１００ではその認識結果をクライアント受信部１７０の認識結果受信部１７２にて受信し、ステップＳ１７の処理の前に図に示していない音声認識結果出力装置に出力端子１０２より出力してステップＳ３に戻る（ステップＳ１８）。
これ以降の動作は上記で説明した内容の繰り返しである。なおステップＳ１の検出特徴量の抽出は各フレームごとに常に行われており、図２は主に検出特徴量記憶部１３０が検出特徴量を読み出して処理する以後の手順を示す。 When the recognition progress management unit 360 of the server computer 300 detects the end of the voice section in the voice recognition unit 350, the recognition result output from the voice recognition unit 350 is transferred from the recognition result transmission unit 372 of the server transmission unit 370 to the client computer 100. Transmit (step S43). In the client computer 100, the recognition result is received by the recognition result receiving unit 172 of the client receiving unit 170, and is output from the output terminal 102 to the voice recognition result output device (not shown) before the process of step S17, and then the step S3. (Step S18).
The subsequent operations are the same as described above. Extraction of the detected feature value in step S1 is always performed for each frame, and FIG. 2 mainly shows a procedure after the detected feature value storage unit 130 reads and processes the detected feature value.

サーバ計算機３００は一般にハードウェア及びソフトウェア規模が大きい高価なものであり、よって区間終了検出部３５１として検出能力が高い高価なものを用いてもサーバ計算機３００としてはそれ程高価なものにならない。一方、クライアント計算機１００は一般にハードウェア及びソフトウェア規模が比較的小さい安価なものである。よって音声区間終了の検出能力が比較的低い安価なものを用い、サーバ計算機３００の区間終了検出部３５１として検出能力が高いものを用い、前述したようにサーバ計算機３００で音声区間終了を検出すると、その位置を示す終了信号とクライアント計算機１００へ送信し、クライアント計算機１００で、その終了信号が示す位置から、改めて、音声検出を行うことにより、クライアント計算機１００で音声区間が終了してもこれを検出することができず、音声区間として信号を送信しても、またクライアント計算機１００とサーバ計算機３００との間に修理ずれ（後者が遅れる）があってもクライアント計算機１００で音声区間の開始を確実に検出することができ、従って音声認識率も高くなる。また終了信号の受信から、次の音声区間の開始までの非音声信号は送信されず、それだけ通信量が減少する。
更に前記例のように検出結果情報を送信する場合は、これは音声か、非音声かを表わす１ビットのみでよく、音声区間信号に対する検出結果情報の場合、その音声区間信号より、著しく少ない通信量で済み、非音声区間に対して、検出結果情報を送る場合も少ない通信量でクライアント計算機１００とサーバ計算機３００との処理同期を維持できる。 The server computer 300 is generally expensive and has a large hardware and software scale. Therefore, even if an expensive computer having high detection capability is used as the section end detection unit 351, the server computer 300 is not so expensive. On the other hand, the client computer 100 is generally an inexpensive one having a relatively small hardware and software scale. Therefore, using a low-priced one having a relatively low detection capability of the voice section and a high detection capability as the section end detection unit 351 of the server computer 300, and detecting the voice section end by the server computer 300 as described above, The end signal indicating the position is transmitted to the client computer 100, and the client computer 100 detects the voice again from the position indicated by the end signal. Even if a signal is transmitted as a voice interval and there is a repair gap between the client computer 100 and the server computer 300 (the latter is delayed), the client computer 100 can reliably start the voice interval. Therefore, the speech recognition rate is also increased. Further, the non-voice signal from the reception of the end signal to the start of the next voice section is not transmitted, and the communication amount is reduced accordingly.
Further, when detecting result information is transmitted as in the above example, this is only one bit indicating whether it is voice or non-speech. In the case of detection result information for a voice interval signal, communication is significantly less than that of the voice interval signal. Even when the detection result information is sent to the non-voice section, the processing synchronization between the client computer 100 and the server computer 300 can be maintained with a small communication amount.

具体的処理例
次に、図４〜図７を参照して、この発明において行われる位置信号の送受信とクライアント計算機１００及びサーバ計算機３００の各特徴量記憶部１３０及び３４０における特徴量の記憶、消去の状態、音声区間の検出再開始の流れを具体的に説明する。
図２はクライアント計算機で音声検出が開始され、音声の開始を検出し、入力信号を図４〜図７中のＡに示すグラフは入力信号を表し、横軸を時刻（音声入力開始を基点としたサンプル位置）、縦軸を音声のパワー（音量）とし、その中で音声の区間と非音声の区間が存在している。各図のＢにおける四角の列は、クライアント計算機１００の検出特徴量記憶部１３０内における検出特徴量のフレームごとの記憶状態を入力信号に沿って示し、各図のＣはクライアント計算機１００とサーバ計算機３００間で送信される信号を示し、各図のＤの四角の列はサーバ計算機３００の認識特徴量記憶部３４０における認識特徴量の各フレームごとの記憶状態を受信信号に沿って示している。 Specific Processing Example Next, with reference to FIG. 4 to FIG. 7, transmission / reception of position signals performed in the present invention, and storage and deletion of feature amounts in the feature amount storage units 130 and 340 of the client computer 100 and the server computer 300 The flow of the state and voice segment detection restart will be specifically described.
In FIG. 2, voice detection is started by the client computer, the start of voice is detected, the input signal is represented by the graph indicated by A in FIGS. 4 to 7, and the horizontal axis represents time (speech input start as a base point). Sampled position), and the vertical axis is voice power (volume), in which there are voice sections and non-voice sections. The square column in B in each figure shows the storage state of the detected feature quantity for each frame in the detected feature quantity storage unit 130 of the client computer 100 along the input signal, and C in each figure shows the client computer 100 and the server computer. The signals transmitted between 300 are shown, and the square column D in each figure shows the storage state of the recognition feature value for each frame in the recognition feature value storage unit 340 of the server computer 300 along the received signal.

図４はクライアント計算機１００で音声検出が開始され、音声区間の開始を検出し、音声区間の信号をサーバ計算機３００へ送信する状態を示している。クライアント計算機１００において入力信号が入力され、その最初のサンプル位置ｓ０より音声区間の検出が開始され、各フレームごとに抽出された検出特徴量記憶部１３０に、各フレームごとに実線四角として記憶しながら読み出し音声区間の開始位置を探し始める。このとき、最初のサンプル位置ｓ０が入力信号上の位置の基点となる。サンプル位置ｓ１のフレームで音声区間の開始を検出すると、その音声区間の最初のフレームの入力信号Ｓｐをサーバ計算機３００に送信するとともに区間開始位置としてサンプル位置ｓ１の信号Ｐｓをサーバ計算機３００に送信する。以降は順次、その音声区間のフレームごとの入力信号のみをサーバ計算機３００に送信する。上述したように、クライアント計算機１００とサーバ計算機３００間での通信状況によりこの音声区間の信号の送受信に遅れを伴う場合がある。 FIG. 4 shows a state in which voice detection is started in the client computer 100, the start of the voice section is detected, and a signal in the voice section is transmitted to the server computer 300. The client computer 100 receives an input signal, starts detection of a speech section from the first sample position s0, and stores it as a solid line square for each frame in the detected feature quantity storage unit 130 extracted for each frame. The search for the start position of the read voice section is started. At this time, the first sample position s0 becomes the base point of the position on the input signal. When the start of the voice section is detected in the frame at the sample position s1, the input signal Sp of the first frame of the voice section is transmitted to the server computer 300, and the signal Ps at the sample position s1 is transmitted to the server computer 300 as the section start position. . Thereafter, only the input signal for each frame of the speech section is sequentially transmitted to the server computer 300. As described above, there may be a delay in the transmission / reception of signals in this voice section depending on the communication status between the client computer 100 and the server computer 300.

サーバ計算機３００ではクライアント計算機１００から信号を受信すると、サンプル位置ｓ１からの音声区間の１フレームごとの信号から抽出された認識特徴量を、サンプル位置ｓ１から認識特徴量記憶部３４０に順次記憶し、またこれらを順次読み出して認識を開始する。このとき、クライアント計算機１００の検出特徴量記憶部１３０においてもサーバ計算機３００の認識特徴量記憶部３４０においてそれぞれ記憶された特徴量は消去されない。
クライアント計算機１００においては音声区間検出が進み、またサーバ計算機３００においては音声認識が進み、一定間隔ごとに進行位置信号Ｐｐが発生し、それ以前に記憶した特徴量が消去される様子を図５に示す。サーバ計算機３００でサンプル位置ｓ２において進行位置信号Ｐｐが発生し、認識特徴量記憶部３４０に記憶されているサンプル位置ｓ２以前の認識特徴量が消去される。その消去された認識特徴量を点線の四角で示す。クライアント計算機１００においてはサーバ計算機から受信された位置ｓ２を示す進行位置信号Ｐｐに従って、検出特徴量記憶部１３０に記憶されている位置ｓ２以前の検出特徴量が点線四角で示すように消去される。 When the server computer 300 receives a signal from the client computer 100, the server computer 300 sequentially stores the recognition feature quantity extracted from the signal for each frame of the speech section from the sample position s1 in the recognition feature quantity storage unit 340 from the sample position s1, These are sequentially read out and recognition is started. At this time, the feature quantities stored in the recognized feature quantity storage unit 340 of the server computer 300 are not deleted even in the detected feature quantity storage unit 130 of the client computer 100.
In the client computer 100, voice section detection proceeds, and in the server computer 300, voice recognition progresses. A progress position signal Pp is generated at regular intervals, and the feature quantity stored before that is erased in FIG. Show. The server computer 300 generates a progress position signal Pp at the sample position s2, and the recognition feature quantity before the sample position s2 stored in the recognition feature quantity storage unit 340 is deleted. The deleted recognition feature amount is indicated by a dotted square. In the client computer 100, in accordance with the progress position signal Pp indicating the position s2 received from the server computer, the detected feature quantities before the position s2 stored in the detected feature quantity storage unit 130 are erased as indicated by a dotted line square.

更に一定フレーム数が経過したサンプル位置ｓ３でも同様に、進行位置信号Ｐｐが発生して、サーバ計算機３００では認識特徴量記憶部３４０に記憶されている位置ｓ３以前の認識特徴量が消去され、クライアント計算機１００では検出特徴量記憶部１３０に記憶されている位置ｓ３以前の検出特徴量が消去される。
クライアント計算機１００において音声区間検出が更に進み、サーバ計算機３００において音声認識が更に進み、サーバ計算機３００において音声区間終了（音声終了位置）を検知したが、クライアント計算機１００においては音声区間の検出で音声区間の終了を検知されずにそれ以降も引き続き音声区間として検出し続けている様子を図６に示す。 Similarly, a progress position signal Pp is also generated at the sample position s3 where a certain number of frames have elapsed, and the server computer 300 erases the recognized feature quantity before the position s3 stored in the recognized feature quantity storage unit 340, and the client computer 300 In the computer 100, the detected feature quantity before the position s3 stored in the detected feature quantity storage unit 130 is deleted.
In the client computer 100, the voice section detection further proceeds, the server computer 300 further performs voice recognition, and the server computer 300 detects the end of the voice section (speech end position). In the client computer 100, the voice section is detected by detecting the voice section. FIG. 6 shows a state in which the end of is continuously detected as a voice section without being detected.

サーバ計算機３００においてサンプル位置ｓ４にて音声区間の終了を検出し、音声認識特徴量記憶部３４０に記憶されているサンプル位置ｓ４以前の認識特徴量が点線四角で示すように消去され、また音声区間の終了位置ｓ４を示す音声区間終了位置信号Ｐｅがクライアント計算機１００へ送信される。
クライアント計算機１００においてはサーバ計算機３００から受信されたサンプル位置ｓ４を示す音声区間終了位置信号Ｐｅに従って、検出特徴量記憶部１３０に記憶されているサンプル位置ｓ４以前の検出特徴量が点線四角で示すように消去され、それと同時に音声検出部１４０での音声区間の検出を中断させ、その音声区間終了位置信号Ｐｅを受信した時点、図６ではサンプル位置ｓ５までの検出音声区間信号Ｓｐをサーバ計算機３００へ送信する。 The server computer 300 detects the end of the speech section at the sample position s4, and the recognition feature quantity before the sample position s4 stored in the speech recognition feature quantity storage unit 340 is erased as indicated by a dotted square, and the speech section A voice segment end position signal Pe indicating the end position s4 is transmitted to the client computer 100.
In the client computer 100, in accordance with the voice segment end position signal Pe indicating the sample position s4 received from the server computer 300, the detected feature quantities before the sample position s4 stored in the detected feature quantity storage unit 130 are indicated by dotted line squares. At the same time, the detection of the voice section by the voice detection unit 140 is interrupted, and when the voice section end position signal Pe is received, the detected voice section signal Sp up to the sample position s5 in FIG. Send.

その後クライアント計算機１００において音声区間の検出を再開し、音声区間の開始位置を検出し、すでにその部分の入力信号が送信済みの区間については音声検出部１４０の検出結果情報を、未送信の区間については入力信号中のその音声区間の信号をサーバ計算機３００に送信する様子を図７に示す。
クライアント計算機１００では、前回の音声区間が終了したサンプル位置ｓ４の次のサンプルのフレームの検出特徴量から読み出して音声区間の検出を開始する。検出特徴量記憶部１３０には、音声区間終了位置信号Ｐｅの受信後も検出特徴量抽出部１１０で抽出されたフレームごとの検出特徴量が順次に記憶されている。この例ではサンプル位置ｓ５より以前はすでに入力信号が音声区間の信号として送信している。よって次の音声区間を検出するまでは各フレームごとに音声検出部１４０で検出した非音声区間であることを示す検出結果情報ＵＶ（ＵｎＶｏｉｃｅ）がサーバ計算機３００に送信される。 Thereafter, the client computer 100 restarts the detection of the voice section, detects the start position of the voice section, and detects the detection result information of the voice detection unit 140 for the section in which the input signal of that part has already been transmitted, for the untransmitted section. FIG. 7 shows a state in which the signal of the voice section in the input signal is transmitted to the server computer 300.
In the client computer 100, reading from the detected feature value of the frame of the next sample after the sample position s4 at which the previous speech section has ended is performed, and detection of the speech section is started. The detected feature value storage unit 130 sequentially stores the detected feature values for each frame extracted by the detected feature value extraction unit 110 even after receiving the voice segment end position signal Pe. In this example, before the sample position s5, the input signal has already been transmitted as a signal in the voice section. Therefore, detection result information UV (Un Voice) indicating that the frame is a non-voice section detected by the voice detection unit 140 is transmitted to the server computer 300 for each frame until the next voice section is detected.

よって音声検出部１４０の検出結果情報が送信されるが、図示例は、この検出結果は非音声であり、非音声区間であることを示す検出結果情報ＵＶ（ＵｎＶｏｉｃｅ）がサーバ計算機３００へ送信される。また、この図示例では既に送信済の区間、つまりサンプル位置ｓ５と次の音声区間の開始位置、サンプル位置ｓ６との間に、非音声区間が存在している。この例ではサーバ計算機３００で、次の音声区間の開始のサンプル位置が、区間開始位置を送信することなく、知ることができるようにサンプル位置ｓ５から、次の音声区間の開始サンプル位置ｓ６までの各区間は音声検出部１４０の検出結果、つまり非音声を示す検出結果情報ＵＶをサーバ計算機へ送信するようにしている。 Therefore, the detection result information of the voice detection unit 140 is transmitted. In the illustrated example, the detection result is non-sound, and detection result information UV (Un Voice) indicating that it is a non-speech section is transmitted to the server computer 300. Is done. In the illustrated example, a non-speech section exists between the sections that have already been transmitted, that is, between the sample position s5 and the start position of the next voice section and the sample position s6. In this example, in the server computer 300, the sample position at the start of the next voice section is known from the sample position s5 to the start sample position s6 of the next voice section so that the server position can be known without transmitting the section start position. In each section, the detection result of the voice detection unit 140, that is, detection result information UV indicating non-voice is transmitted to the server computer.

つまり図２中において、ステップＳ４で入力信号が未送信であり、かつステップＳ５で音声区間を検出せず、またステップＳ７で発話における最初の音声区間でなければ、ステップＳ９で音声区間の開始前であるか否かを調べ、音声区間の前、つまり非音声区間であれば、ステップＳ１１へ移って音声検出部１４０の検出結果情報ＵＶをサーバ計算機３００へ送信する。このようにすれば、入力信号の各フレームごとに検出開始位置、つまりサンプル位置ｓ０から検出結果情報又は音声区間の信号のいずれかがサーバ計算機３００へ送信され、クライアント計算機１００とサーバ計算機３００とでサンプル位置を同期させることができる。 That is, in FIG. 2, if the input signal has not been transmitted in step S4, and no speech segment is detected in step S5, and if it is not the first speech segment in the utterance in step S7, the start of the speech segment in step S9. If it is before the speech section, that is, if it is a non-speech section, the process proceeds to step S11 and the detection result information UV of the speech detection unit 140 is transmitted to the server computer 300. In this way, the detection start position, that is, either the detection result information or the signal of the voice section is transmitted from the sample position s0 to the server computer 300 for each frame of the input signal, and the client computer 100 and the server computer 300 Sample position can be synchronized.

サーバ計算機３００においては検出結果情報ＵＶを受信すると、これと対応する区間に該当する認識特徴量記憶部３４０内の認識特徴量がこの例では消去される。つまり図示例では認識特徴量記憶部３４０内のサンプル位置ｓ４の次からサンプル位置ｓ５に記憶された認識特徴量は点線四角のように消去される。その後、クライアント計算機１００から受信した検出結果情報ＵＶは記憶せず、その記憶部３４０内の記憶領域はなにも記憶されない。
次にクライアント計算機１００ではサンプル位置ｓ６で音声区間の開始が検出されると、そのサンプル位置ｓ６よりその音声区間の各フレームの入力信号Ｓｐを次々にサーバ計算機３００に送信する。 When the server computer 300 receives the detection result information UV, the recognition feature quantity in the recognition feature quantity storage unit 340 corresponding to the section corresponding to the detection result information UV is deleted in this example. That is, in the illustrated example, the recognized feature quantity stored at the sample position s5 after the sample position s4 in the recognized feature quantity storage unit 340 is deleted as shown by a dotted square. Thereafter, the detection result information UV received from the client computer 100 is not stored, and no storage area in the storage unit 340 is stored.
Next, when the client computer 100 detects the start of the speech section at the sample position s6, the client computer 100 sequentially transmits the input signal Sp of each frame in the speech section to the server computer 300 from the sample position s6.

サーバ計算機３００にて音声区間の信号を受信すると、そのサンプル位置ｓ６より再び音声認識を開始する。
この図６の例ではサンプル位置ｓ５の次のサンプルからサンプル位置ｓ６の前のサンプルの非音声区間において入力信号を送信しない区間があり、その分の通信量を削減することができる。このとき検出結果情報は送信するが、それは例えば「音声」と「非音声」を区別する情報（１ｂｉｔ）であり、音声区間の入力信号と比較して格段に通信量は少なくて済む。 When the server computer 300 receives the signal of the voice section, voice recognition starts again from the sample position s6.
In the example of FIG. 6, there is a section in which an input signal is not transmitted in a non-voice section from a sample after the sample position s5 to a sample before the sample position s6, and the amount of communication can be reduced accordingly. At this time, the detection result information is transmitted, but it is information (1 bit) for distinguishing between “speech” and “non-speech”, for example, and the amount of communication is much smaller than the input signal in the speech section.

また音声区間の検出を再開始し（図７中に示す）、以後は非音声区間においては検出結果情報も送信せずに通信量を削減することもできる。その際には図２中に破線で示すようにステップＳ５では音声区間を待ち、ステップＳ７でその音声区間が発話の最初でなければステップＳ１９に移り、その音声区間の直前に未送信区間があるか、つまり直前が非音声区間かの判定がなされ、未送信区間があればステップＳ８に移り、その音声区間の開始フレームの入力信号を送信する際にその開始フレーム位置、図７の例ではサンプル位置ｓ６を示す開始位置をサーバ計算機３００に送信し、クライアント計算機１００とサーバ計算機３００とで入力サンプル位置の同期をとる。またステップＳ４において未送信でないと判定されるとステップＳ２０に移り、音声区間であればステップＳ１１に移るが、音声区間でなければステップＳ５に移る。このようにして音声区間の再開始以後に検出した非音声についてはいずれの信号もサーバ計算機３００へ送信しない。サーバ計算機３００では各音声区間の始めにはその開始フレームの位置が受信され、これに基づきクライアント計算機１００と同期をとることができる。またこの場合は音声区間の開始位置として最初の音声区間から何番目の音声区間であることを示す位でもよい。各音声区間ごとに開始位置が送られて来る場合はサーバ計算機３００において、ステップＳ３４で対応認識特徴量を消去することは行わなくてもよい。 It is also possible to restart the detection of the voice section (shown in FIG. 7) and thereafter reduce the communication amount without transmitting the detection result information in the non-voice section. In that case, as shown by a broken line in FIG. 2, in step S5, the voice section is waited. In step S7, if the voice section is not the first utterance, the process proceeds to step S19, and there is an untransmitted section immediately before the voice section. That is, it is determined whether the immediately preceding non-speech period is present, and if there is a non-transmission period, the process proceeds to step S8, where the start frame position when transmitting the input signal of the start frame of the voice period, in the example of FIG. The start position indicating the position s6 is transmitted to the server computer 300, and the client computer 100 and the server computer 300 synchronize the input sample positions. If it is determined in step S4 that it is not yet transmitted, the process proceeds to step S20, and if it is a voice section, the process proceeds to step S11, but if it is not a voice section, the process proceeds to step S5. In this way, no signal is transmitted to the server computer 300 for the non-speech detected after the restart of the speech section. The server computer 300 receives the position of the start frame at the beginning of each voice section, and can synchronize with the client computer 100 based on this position. Further, in this case, the start position of the voice section may indicate the number of the voice section from the first voice section. When the start position is sent for each voice section, the server computer 300 does not have to delete the corresponding recognition feature value in step S34.

いずれの方法においても、サーバ計算機３００においてこの区間の音声認識処理を行う必要がなく、その分の音声認識処理に伴う処理量が軽減し、かつ余分な区間に対し音声認識を行うことに基づく誤認識、例えば雑音に対し、有意な認識結果を湧き出すなどを防ぐことができる。
さらに、上記効果を得るためにクライアント計算機１００およびサーバ計算機３００において処理済の記憶領域を確保する必要があるが、これら記憶領域のうち不必要な記憶領域を定期的に解放することによって両計算機における使用記憶容量を増大させることなく実行できる。 In any of the methods, the server computer 300 does not need to perform speech recognition processing for this section, the amount of processing associated with the corresponding speech recognition processing is reduced, and an error based on performing speech recognition for an extra section. Recognition, for example, a significant recognition result for noise can be prevented.
Furthermore, in order to obtain the above effect, it is necessary to secure storage areas that have been processed in the client computer 100 and the server computer 300, but by periodically releasing unnecessary storage areas of these storage areas, It can be executed without increasing the storage capacity used.

変形例
これまでは、サーバ計算機３００においてのみ認識特徴量を抽出して音声認識を行う構成においての説明をしたが、例えば非特許文献１に示す、認識特徴量の少なくとも一部をクライアント計算機１００で行う分散型音声認識方法にこの発明を適用できる。この場合における、前述した実施形態と異なる点のみを主として以下に説明する。この場合もサーバ計算機３００において音声区間の開始検出機能が実装されておらず、音声認識の過程において音声区間の終了を検出し、その位置をクライアント計算機１００に送信するが、サーバ計算機３００において、音声認識部３５０の前段もしくは内部で音声区間開始検出もしくは音声区間終端検出機能が実装されている場合においても適用可能である。なお図１〜図３においてこの変形例を兼用して示すため変形例により異なる部分には括弧書き、又は破線で示す。 Modifications Up to now, the description has been given of the configuration in which the recognition feature amount is extracted and the speech recognition is performed only in the server computer 300. For example, at least a part of the recognition feature amount shown in Non-Patent Document 1 is performed in the client computer 100. The present invention can be applied to a distributed speech recognition method. Only the differences from the above-described embodiment in this case will be mainly described below. In this case as well, the server computer 300 does not have the voice segment start detection function, and detects the end of the voice segment in the process of voice recognition, and transmits the position to the client computer 100. The present invention can also be applied to a case where a speech segment start detection or speech segment end detection function is implemented in the preceding stage or inside of the recognition unit 350. In addition, in FIG. 1-3, since this modification is shown also, the part which changes with modifications is shown with a parenthesis or a broken line.

クライアント計算機１００において音声検出部１４０で検出特徴量を読み込み、音声区間の検出を行い、信号送信管理部１５０では、入力信号のサンプルごともしくはフレームごとに入力信号から抽出された認識特徴量Ａがサーバ計算機に未送信かそれとも送信済みかを調査し（図２、ステップＳ４）、未送信であれば音声検出部１４０で検出された音声区間の入力信号より、図１中の破線で示す認識特徴量Ａ抽出部１８０において例えばケプストラム及びパワーといった認識特徴量Ａを抽出し（図２中のステップＳ８とＳ１０の間のステップＳ５１）、認識特徴量送信部１６１より認識特徴量Ａをサーバ計算機３００に送信する。認識特徴量Ａが送信済みであれば検出結果情報送信部１６０より検出結果情報をサーバ計算機３００に送信する。このとき、音声区間の開始位置の送信は先の場合と同様に行われる。 In the client computer 100, the voice detection unit 140 reads the detected feature amount and detects the voice section. In the signal transmission management unit 150, the recognition feature amount A extracted from the input signal for each sample or frame of the input signal is stored in the server. It is checked whether it has not been transmitted to the computer or transmitted (FIG. 2, step S4). If it has not been transmitted, the recognition feature amount indicated by the broken line in FIG. 1 from the input signal of the speech section detected by the speech detector 140. The A extraction unit 180 extracts the recognition feature quantity A such as cepstrum and power (step S51 between steps S8 and S10 in FIG. 2), and transmits the recognition feature quantity A to the server computer 300 from the recognition feature quantity transmission unit 161. To do. If the recognition feature amount A has been transmitted, the detection result information transmission unit 160 transmits the detection result information to the server computer 300. At this time, transmission of the start position of the voice section is performed in the same manner as in the previous case.

サーバ計算機３００では、クライアント計算機１００より送信された認識特徴量Ａを認識特徴量受信部３１１で受信した場合は、認識特徴量Ｂ抽出部３２０において最終的に音声認識に用いるケプストラム、デルタケプストラム、パワー、デルタパワーといった一群認識特徴量Ｂを抽出し、認識特徴量管理部３３０を介して認識特徴量記憶部３４０に記憶する。例えば認識特徴量Ａがケプストラム、パワーであり、これらより認識特徴量Ｂ抽出部３２０でデルタケプストラム、デルタパワーを抽出し、前記一群の認識特徴量Ｂを得る。ここで、音声認識に認識特徴量Ａをそのまま用いる場合も考えられ、そのときはクライアント計算機１００より受信した認識特徴量Ａを、順次認識特徴量管理部３３０を介して認識特徴量記憶部３４０に記憶する。つまり図３においてステップＳ３２で受信信号が検出結果情報でなければ破線で示すように直ちにステップＳ３５へ移る。 In the server computer 300, when the recognition feature value receiving unit 311 receives the recognition feature value A transmitted from the client computer 100, the recognition feature value B extraction unit 320 finally uses the cepstrum, delta cepstrum, power used for speech recognition. The group recognition feature quantity B such as delta power is extracted and stored in the recognition feature quantity storage unit 340 via the recognition feature quantity management unit 330. For example, the recognition feature quantity A is cepstrum and power, and the recognition feature quantity B extraction unit 320 extracts the delta cepstrum and delta power from these to obtain the group of recognition feature quantities B. Here, there may be a case where the recognition feature amount A is used as it is for speech recognition. In this case, the recognition feature amount A received from the client computer 100 is sequentially stored in the recognition feature amount storage unit 340 via the recognition feature amount management unit 330. Remember. That is, if the received signal is not the detection result information in step S32 in FIG. 3, the process immediately proceeds to step S35 as indicated by a broken line.

サーバ計算機３００における音声区間の終了の検出は図１中に破線で示すように区間終了検出部３８を設けて、入力信号受信部３１１の受信音声区間の信号より検出してもよい。
図１中に示したクライアント装置及びサーバ装置はコンピュータによらず、構成することもでき、コンピュータにより機能させる場合は、例えば図２に示した処理方法の各過程をコンピュータに実行させるためのクライアント装置処理プログラムを、あるいは図３に示した処理方法の各過程をコンピュータに実行させるためのサーバ装置処理プログラムをコンピュータに、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体からインストールし、又は通信回線を介してダウンロードして、そのプログラムをコンピュータに実行させればよい。 The end of the voice section in the server computer 300 may be detected from the signal in the received voice section of the input signal receiving section 311 by providing a section end detecting unit 38 as indicated by a broken line in FIG.
The client device and the server device shown in FIG. 1 can be configured without using a computer. When the computer device functions, for example, the client device for causing the computer to execute each process of the processing method shown in FIG. A server program for causing a computer to execute the processing program or each process of the processing method shown in FIG. 3 is installed in a computer from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device, or communicated What is necessary is just to download via a line and to make the computer execute the program.

この発明のクライアント・サーバ音声認識方法を適用したシステム構成例及びそのクライアント装置とサーバ装置の機能構成例を示すブロック図。1 is a block diagram showing a system configuration example to which a client / server speech recognition method of the present invention is applied, and functional configuration examples of the client device and the server device. クライアント装置の処理手順の例を示す流れ図。The flowchart which shows the example of the process sequence of a client apparatus. サーバ装置の処理手順の例を示す流れ図。The flowchart which shows the example of the process sequence of a server apparatus. この発明の実施例において、クライアント計算機で音声検出が開始し始めた状態を説明するための図。The figure for demonstrating the state which started the audio | voice detection in the client computer in the Example of this invention. この発明の実施例において、音声認識が進み、一定間隔ごとに記憶部内認識特徴量が消去されている状態を説明するための図。In the Example of this invention, the figure for demonstrating the state by which speech recognition progresses and the recognition amount in a memory | storage part is erase | eliminated for every fixed interval. この発明の実施例において、サーバ計算機で音声終了を検知した状態を説明するための図。The figure for demonstrating the state which detected the audio | voice end with the server computer in the Example of this invention. この発明の実施例において、クライアント計算機で音声検出を再開した状態を説明するための図。The figure for demonstrating the state which restarted the audio | voice detection with the client computer in the Example of this invention.

Claims

In a client-server speech recognition method for transmitting an input signal input to a client device to a server device connected to the client device via a network, performing speech recognition on the server device, and transmitting the result to the client device.
The client device extracts a detection feature amount used for detection of a voice section from an input signal,
Detecting a voice section using the detected feature amount,
Transmitting the voice interval signal on the input signal to the server device;
The server device extracts a recognition feature amount used for speech recognition from the received signal of the speech section,
Speech recognition is performed using the above recognition features,
Based on the voice recognition process or detecting the end of the voice section from the received voice section signal,
When the end of the voice section is detected, a voice end indicating the end position is transmitted to the client device,
The client / server speech recognition method according to claim 1, wherein when the end of speech is received, the client device stops detecting the speech interval, and moves from the end position indicated by the end of speech to detection of the next speech interval.

The above-described client-server speech recognition method for transmitting an input signal input to a client device to a server device connected to the client device via a network, performing speech recognition at the server device, and transmitting the result to the client device. A processing method of a client device,
Extracting detection features used to detect speech segments from the input signal,
Detects whether it is a speech segment or a non-speech segment using the detected feature amount
Transmitting the voice interval signal on the input signal to the server device;
A client apparatus processing method comprising: receiving an end signal indicating a voice section end position from the server apparatus, temporarily stopping detection of the voice section or the non-voice section, and then restarting from the end position.

Storing the extracted detected feature quantity in the detected feature quantity storage unit so that the position on the input signal can be known;
The detection of the voice section or the non-voice section is performed by reading the detected feature quantity from the detected feature quantity storage unit,
The signal of the voice section is transmitted to the server device so that the position on the input signal can be known,
The detection feature amount is read from the position after the position on the input signal indicated by the voice section end signal,
After the restart, it is determined whether or not the signal of the voice section of the corresponding input signal has been transmitted to the server apparatus, and if it is transmitted, information indicating the detection result of the voice section is transmitted to the server apparatus. The client device processing method according to claim 1.

The detection feature value is stored in the feature value storage unit based on the voice detection start position on the input signal,
When transmitting the speech section signal to the server device, the start position of at least the first speech section in the input signal,
Send to the server device based on the voice detection start position,
4. The client device processing method according to claim 3, wherein the voice section end signal is a voice section end position with the voice detection start position as a base point.

4. The client apparatus processing method according to claim 3, wherein transmission of the start position of the voice section is performed for each voice section, and information indicating the detection result of the voice section is transmitted only for the voice section.

6. The client apparatus processing method according to claim 3, wherein when the voice section end signal is received, the detected feature quantity before the voice section end signal in the detected feature quantity storage unit is deleted. .

The received feature amount before the speech recognition progress position in the detected feature amount storage unit is deleted when the speech recognition progress position is received from the server device. Client device processing method.

8. The client according to claim 2, wherein a feature amount used for speech recognition is extracted from the speech section signal, and the recognition feature amount is transmitted to the server device as the speech section signal. Device processing method.

A server in a client-server speech recognition method for transmitting an input signal input to a client device to a server device connected to the client device via a network, performing speech recognition at the server device, and transmitting the result to the client device A processing method for an apparatus,
Extracting the recognition feature amount used for speech recognition from the speech section signal received from the client device,
Speech recognition is performed using the above recognition features,
Detect the end of the voice segment,
When detecting the end of the voice section, the voice end indicating the end position is transmitted to the client device.

Receiving the start position of the speech section in the signal of the first speech section, storing the recognition feature quantity in the recognition feature quantity storage unit with the start position as a base point;
The recognition feature value is read from the recognition feature value storage unit and the speech recognition is performed.
Check whether the signal received from the client device is detection result information,
If it is detection result information, the detection result information is added to the corresponding recognition feature quantity in the recognition feature quantity storage unit, or if the detection result information is voice, the detection result information is added. 10. The server device processing method according to claim 9, wherein if there is, the correspondence recognition feature value is deleted.

Receiving the start position of the speech section in the signal of each speech section, storing the recognition feature quantity in the recognition feature quantity storage unit with each start position as a base point;
The recognition feature value is read from the recognition feature value storage unit and the speech recognition is performed.
Check whether the signal received from the client device is detection result information,
10. The server apparatus processing method according to claim 9, wherein if it is detection result information, the detection result information is added to the corresponding recognition feature quantity in the recognition feature quantity storage unit.

The voice end is a position based on the start position. Simultaneously with transmission of the voice end position to the client device, the recognition feature quantity before the voice end position stored in the recognition feature quantity storage unit is obtained. 12. The server apparatus processing method according to claim 10, wherein the server apparatus processing method is deleted.

11. The recognition progress position is transmitted to the client device at regular intervals with the start position as a base point, and the recognition feature quantity before the recognition progress position in the recognition feature quantity storage unit is erased. The server apparatus processing method in any one of -12.

The signal of the received speech section is a recognition feature value extracted from the signal, and another recognition feature value is extracted using the recognition feature value, or the process proceeds to the next process without extraction. The server apparatus processing method according to any one of claims 9 to 13.

A client-server speech recognition system for transmitting a voice signal input to a client device to a server device connected to the client device via a network, performing voice recognition on the server device, and transmitting the result to the client device. A client device,
A detection feature amount extraction unit that extracts a detection feature amount used for detection of a speech section from an input signal;
A detection feature amount storage unit that stores the detection feature amount extracted by the detection feature amount extraction unit;
A detection feature amount management unit that manages storage and reading of the detection feature amount with respect to the detection feature amount storage unit;
A voice detection unit that detects a voice section using the detected feature value read from the detected feature value storage unit via the detected feature value management unit;
An input signal transmission unit that transmits a signal of a voice section detected by the detection unit in the input signal to the server device;
A position signal receiving unit that receives the voice end position transmitted from the server device, interrupts the voice detection to the voice detection unit, and then notifies the restart position of the detection of the voice section;
A recognition result receiving unit that receives the recognition result transmitted from the server device and outputs the recognition result to the voice recognition result output device;
A client device comprising:

A recognition feature amount extraction unit that extracts a recognition feature amount from a signal of a voice section detected by the voice detection unit;
The client apparatus according to claim 15, wherein the input signal transmission unit is a transmission unit that transmits the recognized feature quantity as a signal of the voice section.

A client-server speech recognition system for transmitting a voice signal input to a client device to a server device connected to the client device via a network, performing voice recognition on the server device, and transmitting the result to the client device. A server device,
An input signal receiving unit for receiving a signal of a voice section transmitted from the client device;
A recognition feature amount extraction unit that extracts a recognition feature amount used for speech recognition from the signal of the speech section received by the input signal reception unit;
A recognition feature amount storage unit for storing the recognition feature amount extracted by the recognition feature amount extraction unit;
A recognition feature amount management unit that manages storage and reading of the recognition feature amount with respect to the recognition feature amount storage unit;
A speech recognition unit that performs speech recognition using the recognition feature value read from the recognition feature value storage unit via the recognition feature value management unit;
A section end detection unit that detects the end of a voice section in the voice recognition process of the voice recognition unit or detects the end position of the voice section from the signal of the voice section;
A server apparatus comprising: a position signal transmission unit that transmits an end position of a voice section detected by the section end detection unit to the client device.

The input signal receiving unit is a receiving unit that receives a recognition feature amount as a signal of the voice section,
18. The server device according to claim 17, wherein the recognition feature quantity extraction unit is a recognition feature quantity extraction unit that extracts another recognition feature quantity based on the received recognition feature quantity.

The program for making a computer perform each process of the client apparatus processing method in any one of Claims 2-8.

The program for making a computer perform each process of the server apparatus processing method in any one of Claims 9-14.

A computer-readable recording medium on which the program according to claim 19 or 20 is recorded.