JP2017068061A

JP2017068061A - Communication terminal and voice recognition system

Info

Publication number: JP2017068061A
Application number: JP2015193953A
Authority: JP
Inventors: 隆行崎田; Takayuki Sakita
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2015-09-30
Filing date: 2015-09-30
Publication date: 2017-04-06
Anticipated expiration: 2035-09-30
Also published as: JP6549009B2

Abstract

PROBLEM TO BE SOLVED: To provide a communication terminal and a voice recognition system for reducing communication load on a voice recognition processing server apparatus performing a voice recognition process on collected voice data.SOLUTION: A communication terminal transmits voice data produced by a user to a voice recognition processing server apparatus performing a voice recognition process, and receives a voice recognition processing result for the voice data. The communication terminal comprises: a volume measuring unit 112 that measures a volume of the voice data acquired by a voice input unit; and a voice data output control unit 113 that transmits the voice data to the voice recognition processing server apparatus. When the volumes of sequentially input voice data are lower than a predetermined threshold a recognition processing state of the voice recognition process on the voice data received from the voice recognition processing server apparatus is being an unrecognized state indicating a waiting state, the voice data output control unit 113 exerts a control not to transmit silence voice data to the voice recognition processing server apparatus.SELECTED DRAWING: Figure 2

Description

本発明の実施形態は、通信端末で収集されたユーザの音声を音声認識処理サーバ装置で音声認識処理し、音声認識結果を通信端末に提供する音声認識システムに関する。 Embodiments described herein relate generally to a voice recognition system that performs voice recognition processing on a user's voice collected by a communication terminal using a voice recognition processing server device and provides a voice recognition result to the communication terminal.

従来から、ユーザが発した音声を認識し、テキストデータ化する技術がある。音声認識処理は、処理負荷が高いため、クライアント側から音声データを送信してサーバ装置で音声認識処理を行うサーバ／クライアント型の音声認識システムがある。 Conventionally, there is a technology for recognizing a voice uttered by a user and converting it into text data. Since the voice recognition process has a high processing load, there is a server / client type voice recognition system in which voice data is transmitted from the client side and the voice recognition process is performed by the server device.

特許第４１９７２７１号公報Japanese Patent No. 4197271

通信端末で収集された音声データを音声認識処理する音声認識処理サーバ装置に対する通信負荷を低減させることができる通信端末及び音声認識システムを提供する。 Provided are a communication terminal and a voice recognition system capable of reducing a communication load on a voice recognition processing server device that performs voice recognition processing on voice data collected by a communication terminal.

実施形態の通信端末は、音声認識処理を行う音声認識処理サーバ装置にユーザが発した音声データを送信し、前記音声データに対する音声認識処理結果を前記音声認識処理サーバ装置から受信する。通信端末は、音声入力部によって取得された音声データの音量を測定する音量測定部と、前記音声データを前記音声認識処理サーバ装置に送信する音声データ出力制御部と、を有する。前記音声データ出力制御部は、順次入力される前記音声データの音量が無音を示す所定の閾値未満であり、かつ前記音声認識処理サーバ装置から受信する前記音声データに対する音声認識処理の認識処理状態が待機中を示す未認識中である場合、無音の前記音声データを前記音声認識処理サーバ装置に送信しないように制御する。 The communication terminal according to the embodiment transmits voice data issued by a user to a voice recognition processing server apparatus that performs voice recognition processing, and receives a voice recognition processing result for the voice data from the voice recognition processing server apparatus. The communication terminal includes a volume measuring unit that measures the volume of the voice data acquired by the voice input unit, and a voice data output control unit that transmits the voice data to the voice recognition processing server device. The voice data output control unit is configured such that the volume of the voice data sequentially input is less than a predetermined threshold value indicating silence, and the recognition processing state of voice recognition processing for the voice data received from the voice recognition processing server device is Control is performed so that the silent voice data is not transmitted to the voice recognition processing server apparatus when the voice data is not recognized indicating standby.

第１実施形態の音声認識システムの構成を示す図である。It is a figure which shows the structure of the speech recognition system of 1st Embodiment. 第１実施形態の通信端末の機能ブロックを示す図である。It is a figure which shows the functional block of the communication terminal of 1st Embodiment. 第１実施形態の音声認識処理を説明するための図である。It is a figure for demonstrating the speech recognition process of 1st Embodiment. 第１実施形態の音声認識処理サーバ装置の処理フローを示す図である。It is a figure which shows the processing flow of the speech recognition processing server apparatus of 1st Embodiment. 第１実施形態の通信端末の音声データ出力制御を説明するための図である。It is a figure for demonstrating the audio | voice data output control of the communication terminal of 1st Embodiment. 第１実施形態の通信端末の処理フローを示す図である。It is a figure which shows the processing flow of the communication terminal of 1st Embodiment. 第１実施形態の通信端末の音声データ出力制御の変形例を説明するための図である。It is a figure for demonstrating the modification of the audio | voice data output control of the communication terminal of 1st Embodiment. 図７に示した変形例に係る通信端末の処理フローを示す図である。It is a figure which shows the processing flow of the communication terminal which concerns on the modification shown in FIG.

以下、実施形態につき、図面を参照して説明する。 Hereinafter, embodiments will be described with reference to the drawings.

（第１実施形態）
図１から図８は、第１実施形態の音声認識システムを示す図である。図１は、音声認識システムの全体構成図である。音声認識システムは、ユーザ（利用者）側の通信端末１００と、通信端末で収集（取得）されたユーザが発した音声に対する音声認識処理を行う音声認識処理サーバ装置３００（以下、サーバ装置３００という）と、を含んで構成されている。 (First embodiment)
1 to 8 are diagrams showing a voice recognition system according to the first embodiment. FIG. 1 is an overall configuration diagram of a voice recognition system. The voice recognition system includes a communication terminal 100 on the user (user) side and a voice recognition processing server device 300 (hereinafter referred to as server device 300) that performs voice recognition processing on voices uttered by the user collected (acquired) by the communication terminal. ) And.

通信端末１００とサーバ装置３００との間は、無線通信網または有線通信網で接続される。例えば、インターネット網（ＩＰ網）などの通信網、ＰＨＳをはじめ３Ｇ、４Ｇ、ＬＴＥといった携帯機器向けの通信網などが含まれる。また、ＰＳＴＮ（公衆交換電話網）であってもよい。 The communication terminal 100 and the server device 300 are connected by a wireless communication network or a wired communication network. For example, a communication network such as the Internet network (IP network), a communication network for mobile devices such as 3G, 4G, and LTE including PHS are included. Further, it may be a PSTN (Public Switched Telephone Network).

通信端末１００は、通信機能を有する情報端末装置である。例えば、携帯電話機や多機能携帯電話機などの通話・通信機能を備えた携帯端末や、通信機能を備えるＰＤＡ(Personal Digital Assistant)などの移動通信端末装置がある。また、通信端末１００として、パーソナルコンピュータなどの通信機能を備えた情報処理端末装置も含まれる。 The communication terminal 100 is an information terminal device having a communication function. For example, there are mobile terminals having a call / communication function such as a mobile phone and a multi-function mobile phone, and mobile communication terminal devices such as a PDA (Personal Digital Assistant) having a communication function. The communication terminal 100 also includes an information processing terminal device having a communication function such as a personal computer.

通信端末１００は、図１に示すように、全体の制御を司るＣＰＵ１１０、記憶部１２０、サーバ装置３００との間の通信制御を行う通信部１３０、マイク（集音装置）１４０、スピーカー（音声出力装置）１５０、液晶ディスプレイ等の表示部１６０及び、タッチパネルや操作キーなどの操作部１７０を含んで構成されている。 As shown in FIG. 1, the communication terminal 100 includes a CPU 110 that performs overall control, a storage unit 120, a communication unit 130 that performs communication control with the server device 300, a microphone (sound collector) 140, a speaker (audio output). Device) 150, a display unit 160 such as a liquid crystal display, and an operation unit 170 such as a touch panel and operation keys.

図２は、通信端末１００の機能ブロック図である。通信端末１００は、マイク１４０と接続されるＡ／Ｄ変換部１１１、音量測定部１１２、音声データ出力制御部１１３、認識状態確認部１１４、及び表示制御部１１５を含んで構成されている。 FIG. 2 is a functional block diagram of the communication terminal 100. The communication terminal 100 includes an A / D conversion unit 111 connected to a microphone 140, a sound volume measurement unit 112, an audio data output control unit 113, a recognition state confirmation unit 114, and a display control unit 115.

Ａ／Ｄ変換部１１１は、マイク１４０から出力される音声のアナログ信号をデジタルデータに変換し、音声データを生成する。音量測定部１１２は、Ａ／Ｄ変換部１１１から音声データが入力され、音声データからユーザが発した音声の音量を測定する。音声データ出力制御部１１３は、Ａ／Ｄ変換部１１１から音声データが入力されるとともに、音量測定結果が入力され、生成された音声データをサーバ装置３００に出力（送信）する制御を行う。認識状態確認部１１４は、サーバ装置３００の音声認識処理の認識状態（処理状態）を確認（設定）する。表示制御部１１５は、サーバ装置３００から受信する音声認識結果情報、例えば、テキストデータを表示部１６０に表示する表示制御を行う。 The A / D converter 111 converts an analog audio signal output from the microphone 140 into digital data, and generates audio data. The sound volume measurement unit 112 receives sound data from the A / D conversion unit 111 and measures the sound volume of the sound uttered by the user from the sound data. The audio data output control unit 113 receives the audio data from the A / D conversion unit 111 and receives the sound volume measurement result, and performs control to output (transmit) the generated audio data to the server device 300. The recognition state confirmation unit 114 confirms (sets) the recognition state (processing state) of the speech recognition processing of the server device 300. The display control unit 115 performs display control for displaying voice recognition result information received from the server device 300, for example, text data on the display unit 160.

サーバ装置３００は、図１に示すように、全体の制御を司るＣＰＵ３１０、記憶部３２０、通信端末１００との間の通信制御を行う通信部３３０、音声認識処理を行い、音声認識結果を出力する音声認識部３４０を含んで構成されている。音声認識部３４０は、ソフトウェアで構成され、ＣＰＵ３１０が音声認識処理を行ったり、音声認識制御装置（制御回路）としてハードウェアで構成したりすることができる。 As shown in FIG. 1, the server apparatus 300 performs a voice recognition process, a CPU 310 that performs overall control, a storage unit 320, a communication unit 330 that performs communication control with the communication terminal 100, and outputs a voice recognition result. A voice recognition unit 340 is included. The speech recognition unit 340 is configured by software, and the CPU 310 can perform speech recognition processing or can be configured by hardware as a speech recognition control device (control circuit).

音声認識部３４０は、通信端末１００から送信される音声データに対して音声認識処理を行う。音声認識処理は、入力される音声データの音響分析を行い、音響モデルや言語モデルとマッチングして、テキスト（文字）データに変換する処理である。 The voice recognition unit 340 performs voice recognition processing on the voice data transmitted from the communication terminal 100. The voice recognition process is a process of performing acoustic analysis of input voice data, matching it with an acoustic model or a language model, and converting it into text (character) data.

音響モデルは、音素の波形サンプルと波形サンプルに対応したテキストデータとを含む。言語モデルは、語と語の結び付きの出現確率、言い換えれば、言葉のつながりを確率を使って表現したデータである。これらの音響モデルや言語モデル、その他の音声認識処理に必要な情報な各種情報は、記憶部３２０に記憶されている。 The acoustic model includes a phoneme waveform sample and text data corresponding to the waveform sample. The language model is data representing the probability of appearance of word-to-word links, in other words, the connection of words using probability. These acoustic models, language models, and other various information necessary for speech recognition processing are stored in the storage unit 320.

また、音声認識部３４０の音声認識処理には、音声（有音）／非音声（無音）を判定して音声（有音）区間を検出する有効音声データ検出処理（ＶＡＤ：ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ、以下、ＶＡＤ処理という）を含むことができる。音声認識部３４０は、ＶＡＤ処理で抽出された有音区間に対して音響モデル等を適用した音声認識処理を行うことができる。なお、本実施形態の音声認識処理は、適宜公知の手法を適用することができる。 The voice recognition process of the voice recognition unit 340 includes a voice data detection process (VAD: Voice Activity Detection, hereinafter) that determines voice (sound) / non-speech (silence) and detects a voice (sound) section. , Referred to as VAD processing). The voice recognition unit 340 can perform a voice recognition process in which an acoustic model or the like is applied to a voiced section extracted by the VAD process. Note that a known method can be appropriately applied to the voice recognition processing of the present embodiment.

そして、本実施形態の音声認識システムは、音声データに対する音声認識処理のリソースが、サーバ装置３００側に集約されている。このため、通信端末１００は、基本的に、音声認識に必要な音声データを収集・生成してサーバ装置３００に送信するだけであり、ＶＡＤ処理を含む音声認識処理は、通信端末１００側で行われない。このように構成することで、通信端末１００の処理負荷の低減を図ることができる。 In the speech recognition system according to the present embodiment, resources for speech recognition processing for speech data are collected on the server device 300 side. For this reason, the communication terminal 100 basically only collects and generates voice data necessary for voice recognition and transmits the voice data to the server apparatus 300. The voice recognition processing including VAD processing is performed on the communication terminal 100 side. I will not. With this configuration, the processing load on the communication terminal 100 can be reduced.

図３は、本実施形態の通信端末１００で収集された音声データに対するサーバ装置３００の音声認識処理を説明するための図である。図３に示すように、通信端末１００は、音声認識を開始するための操作（例えば、音声認識用アプリケーションの起動）が行われると、マイク１４０を起動し、ユーザが発する音声を集音して音声データを生成する処理を開始する。 FIG. 3 is a diagram for explaining the voice recognition processing of the server apparatus 300 for the voice data collected by the communication terminal 100 of the present embodiment. As shown in FIG. 3, when an operation for starting speech recognition (for example, activation of a speech recognition application) is performed, the communication terminal 100 activates the microphone 140 and collects the speech uttered by the user. The process for generating audio data is started.

通信端末１００のＡ／Ｄ変換部１１１には、マイク１４０から集音された音声が順次入力される。Ａ／Ｄ変換部１１１は、所定の時間間隔でリアルタイムにＡ／Ｄ変換して音声パケットデータを生成する。音声データ出力制御部１１３は、サーバ装置３００に時系列に連続して順次音声パケットデータを送信する。 The sound collected from the microphone 140 is sequentially input to the A / D conversion unit 111 of the communication terminal 100. The A / D converter 111 performs A / D conversion in real time at predetermined time intervals to generate voice packet data. The voice data output control unit 113 sequentially transmits voice packet data to the server apparatus 300 in time series.

通信端末１００は、音声認識を開始するための操作が行われたタイミングやマイク１４０で音声が集音処理を開始したタイミングを起点として、マイク１４０を通じて集音された音声データを順次送信し続け、音声認識を終了するための条件を満たすまで、サーバ装置３００側で音声のストリームデータとして受信されるように制御する。ここで、音声認識を終了するための条件とは、例えば、音声認識を終了するためのユーザによる操作やサーバ装置３００から音声認識結果が所定時間以上受信されないことをトリガーとすることができる。 The communication terminal 100 continues to sequentially transmit the voice data collected through the microphone 140, starting from the timing when the operation for starting voice recognition is performed and the timing when the voice starts the sound collection process by the microphone 140, Until the condition for ending speech recognition is satisfied, control is performed so that the server apparatus 300 receives the data as audio stream data. Here, the condition for ending the voice recognition can be triggered by, for example, an operation by the user for ending the voice recognition or a voice recognition result not being received from the server device 300 for a predetermined time or more.

サーバ装置３００は、音声データを受信すると、ＶＡＤ処理を行い、有音／無音を判定して有音区間を検出し、有音区間に対して音響モデル等を用いて音声認識処理を行う。サーバ装置３００は、「今日は・・・いい天気ですね」の音声データをユーザが発する音声の時間順に時系列に連続した音声パケットデータとして受信し、順次受信する音声パケットデータに対してその都度音声認識処理を行い、音声をテキストデータに順次変換する。 When the server apparatus 300 receives the voice data, the server apparatus 300 performs a VAD process, determines a voice / silence, detects a voice section, and performs a voice recognition process on the voice section using an acoustic model or the like. The server apparatus 300 receives the voice data “Today is a good weather” as voice packet data that is continuous in time order in the time order of voices uttered by the user. Voice recognition processing is performed, and the voice is sequentially converted into text data.

サーバ装置３００は、通信端末１００から有音／無音に関わらず、最初の音声パケットデータを受信したことをトリガーに、ＶＡＤ処理を含む音声認識処理を開始することができる。一方、開始された音声認識処理は、無音の音声データが一定時間継続して入力された場合、一旦終了するように構成することができる。例えば、一定の時間（Ｔ）、有音の音声区間が検出されないとき、言い換えれば、一定の時間（Ｔ）継続して無音が検出されたとき、通信端末１００から連続して入力される音声データに対する音声認識処理を一旦終了して待機状態に移行する。そして、継続した無音区間の後に有音の音声データが検出されたとき、改めて音声認識処理を開始するように構成することができる。 The server apparatus 300 can start voice recognition processing including VAD processing triggered by the reception of the first voice packet data regardless of whether the communication terminal 100 is voiced or silent. On the other hand, the started voice recognition process can be configured to end once when silent voice data is continuously input for a certain period of time. For example, when no voiced voice section is detected for a certain time (T), in other words, when silence is continuously detected for a certain time (T), voice data continuously input from the communication terminal 100 The voice recognition processing for is temporarily ended and a standby state is entered. Then, when voiced voice data is detected after the continuous silent section, the voice recognition process can be started again.

図４は、本実施形態のサーバ装置３００の音声認識処理の処理フローを示す図である。図４に示すように、音声データを受信すると（Ｓ３０１のＹＥＳ）、音声認識部３４０は、音声認識処理を開始し、ＳＯＳ（ＳｔａｒｔｏｆＳｐｅｅｃｈ）信号を通信端末１００に送信（出力）する（Ｓ３０２）。ＳＯＳ信号は、音声認識処理の認識状態を示す認識状態情報であり、認識状態が「認識処理中（実行中）」であることを示す。 FIG. 4 is a diagram illustrating a processing flow of the voice recognition processing of the server apparatus 300 according to the present embodiment. As shown in FIG. 4, when voice data is received (YES in S301), the voice recognition unit 340 starts voice recognition processing and transmits (outputs) an SOS (Start of Speech) signal to the communication terminal 100 (S302). ). The SOS signal is recognition state information indicating a recognition state of the speech recognition process, and indicates that the recognition state is “recognition process in progress”.

音声認識部３４０は、上述した音声認識処理を行い（Ｓ３０３）、音声データに対する音声認識処理結果を通信端末１００に順次送信する。音声認識部３４０は、ＳＯＳ信号出力後の音声認識処理実行中に、認識処理終了条件を満たすか否かを判別し（Ｓ３０４）、認識処理終了条件を満たすと判別されたとき（Ｓ３０４のＹＥＳ）、実行中の音声認識処理を終了（待機に移行）するとともに、ＳＯＳ信号に対する１サイクルの音声認識処理の終了を示すＥＯＳ（ＥｎｄｏｆＳｐｅｅｃｈ）信号を通信端末１００に送信（出力）する（Ｓ３０５）。ＥＯＳ信号は、音声認識処理の認識状態を示す認識状態情報であり、認識状態が「未認識中（待機中）」であることを示す。ここで、ステップＳ３０４の認識処理終了条件は、音声認識処理中の無音区間の継続時間が、所定時間Ｔを超えたか否かとすることができる。 The voice recognition unit 340 performs the voice recognition process described above (S303), and sequentially transmits the voice recognition process result for the voice data to the communication terminal 100. The speech recognition unit 340 determines whether or not the recognition process end condition is satisfied during execution of the speech recognition process after the output of the SOS signal (S304), and when it is determined that the recognition process end condition is satisfied (YES in S304). Then, the voice recognition process being executed is ended (shifted to standby), and an EOS (End of Speech) signal indicating the end of one cycle of the voice recognition process for the SOS signal is transmitted (output) to the communication terminal 100 (S305). . The EOS signal is recognition state information indicating the recognition state of the speech recognition process, and indicates that the recognition state is “unrecognized (standby)”. Here, the recognition process end condition in step S304 can be whether or not the duration of the silent section during the speech recognition process has exceeded a predetermined time T.

なお、図３の「今日は・・・いい天気ですね」には、「・・・」で示す無音が含まれているが、音声認識部３４０は、「・・・」で示される無音の継続時間ｔ１が、開始された音声認識処理の終了を判断するための上述の所定時間Ｔよりも短いため、音声認識処理を終了せずに、１サイクルの音声認識処理を継続して行っている。つまり、「今日は・・・いい天気ですね」を１サイクルの音声認識処理で行うために、文節間の無音期間ｔ１を予めサンプリングし、文節間の無音期間ｔ１よりも長い所定時間Ｔを設定することができる。なお、変換されたテキストデータは、１サイクルの音声認識処理中に例えば、変換された文字や文節毎に複数回に渡って通信端末１００に送信されたり、１サイクルの音声認識処理の終わりに一括して通信端末に送信されたりするように構成することができる。 Note that “Today is a nice weather” in FIG. 3 includes silence indicated by “...”, But the voice recognition unit 340 indicates that the silence indicated by “. Since the duration t1 is shorter than the above-described predetermined time T for determining the end of the started voice recognition process, the one-cycle voice recognition process is continuously performed without ending the voice recognition process. . In other words, in order to perform “Today's good weather” with one cycle of speech recognition processing, the silence period t1 between phrases is sampled in advance, and a predetermined time T longer than the silence period t1 between phrases is set. can do. Note that the converted text data is transmitted to the communication terminal 100 a plurality of times for each converted character or phrase, for example, during one cycle of speech recognition processing, or at the end of one cycle of speech recognition processing. Then, it can be configured to be transmitted to the communication terminal.

このように本実施形態の音声認識処理は、「認識処理中」と「未認識中」の２つのステータスが存在し、一対のＳＯＳ信号とＥＯＳ信号との間の区間が音声認識処理の実行中を示し、ＥＯＳ信号から次のサイクルにおける音声認識処理のＳＯＳ信号までの間の区間が音声認識処理の待機中を示す（図３参照）。通信端末１００の認識状態確認部１１４は、ＳＯＳ信号を受信した後にＥＯＳ信号を受信していない場合は、サーバ装置３００の音声認識処理のステータスを「認識処理中」に更新し、ＥＯＳ信号を受信した後にＳＯＳ信号を受信していない場合は、サーバ装置３００の音声認識処理のステータスを「未認識中」に更新する。認識状態確認部１１４は、音声認識処理のステータス更新情報を音声データ出力制御部１１３に出力する。 As described above, in the voice recognition process of the present embodiment, there are two statuses of “recognition process in progress” and “unrecognized”, and the section between the pair of SOS signals and the EOS signal is being executed. The section from the EOS signal to the SOS signal of the speech recognition process in the next cycle indicates that the speech recognition process is on standby (see FIG. 3). If the EOS signal is not received after receiving the SOS signal, the recognition state confirmation unit 114 of the communication terminal 100 updates the status of the voice recognition processing of the server device 300 to “recognition processing in progress” and receives the EOS signal. If the SOS signal has not been received after this, the status of the speech recognition process of the server apparatus 300 is updated to “Unrecognized”. The recognition state confirmation unit 114 outputs the status update information of the voice recognition process to the voice data output control unit 113.

本実施例の音声認識部３４０は、通信端末１００から連続して順次送信される音声データに対して音声認識処理を行うものの、音声データを受信して音声認識処理を開始し、音声認識処理中に所定時間Ｔの無音が継続したとき、音声認識処理を開始後の連続した無音区間に対して実行中の音声認識処理を一旦終了させて次の有音が入力されるまで待機し、有音が入力されたときに音声認識処理を改めて行う。このように構成することで、無用な音声認識処理の実行を抑制することができ、サーバ装置３００の処理負荷を低減させることができる。 The voice recognition unit 340 according to the present embodiment performs voice recognition processing on voice data continuously transmitted from the communication terminal 100, but receives voice data and starts voice recognition processing. When the silence for a predetermined time T continues, the voice recognition process being executed for the continuous silent section after the start of the voice recognition process is temporarily stopped and waits until the next voice is input, Voice recognition processing is performed again when is input. By comprising in this way, execution of useless speech recognition processing can be suppressed and the processing load of the server apparatus 300 can be reduced.

ここで、図３に示すように、マイク１４０で集音されたユーザの音声には、有音及び無音が含まれるが、通信端末１００は、音声データ内に無音が含まれていても所定の時間間隔で区切られた音声パケットデータをサーバ装置３００に連続して送信している。図３の例において、例えば、「今日は・・・いい天気ですね」とユーザが発したとする。「・・・」は、無音を示す。「今日は・・・いい天気ですね」という音声データは、通信端末１００側で「・・・」の無音で仕切られることなく、「・・・」で表す無音も音声データとして有音データに引き続きサーバ装置３００に送信される。これは、サーバ装置３００側に音声認識処理のリソースを集約して通信端末１００の処理負荷を低減させるために、通信端末１００側では、音声データに対するＶＡＤ処理などが行われないためである。 Here, as shown in FIG. 3, the user's voice collected by the microphone 140 includes sound and silence. However, the communication terminal 100 may perform predetermined processing even if silence is included in the sound data. Voice packet data divided at time intervals is continuously transmitted to the server apparatus 300. In the example of FIG. 3, for example, it is assumed that the user issues “Today is a nice weather”. “...” indicates silence. The voice data “Today is a nice weather” is not partitioned by the “...” silence on the communication terminal 100 side, and the silence represented by “...” is also converted into voice data as voice data. Subsequently, it is transmitted to the server apparatus 300. This is because VAD processing or the like for voice data is not performed on the communication terminal 100 side in order to reduce the processing load on the communication terminal 100 by consolidating voice recognition processing resources on the server device 300 side.

このため、図３に示すように、通信端末１００は、サーバ装置３００側の１サイクルの音声認識処理が終了していても、無音の音声データをサーバ装置３００に送信し続けることになり、サーバ装置３００との間の通信トラフィック（通信データ量）が増加し、ネットワークに負担を掛けてしまう。そこで、本実施形態では、ＳＯＳ信号及びＥＯＳ信号に基づいてサーバ装置３００の音声認識処理の処理状態を確認し、音声認識処理が待機中であるときは、無音の音声データをサーバ装置３００に送信しないように制御する。 For this reason, as shown in FIG. 3, the communication terminal 100 continues to transmit silent sound data to the server device 300 even when the one-cycle speech recognition processing on the server device 300 side is completed. Communication traffic (communication data amount) with the apparatus 300 increases, which places a burden on the network. Therefore, in the present embodiment, the processing state of the voice recognition process of the server apparatus 300 is confirmed based on the SOS signal and the EOS signal, and when the voice recognition process is on standby, silent voice data is transmitted to the server apparatus 300. Control not to.

図５は、本実施形態の通信端末１００の音声データ出力制御を説明するための図である。図５に示すように、音量測定部１１２は、音声データの音量を測定し、マイク１４０を通じて入力された音声が無音であるか有音であるかを判別する音量チェック処理を行う。例えば、測定された音量が所定の閾値以上の場合、有音と判別し、音量が閾値未満であるとき、無音と判別することができる。音量チェック結果は、音声データ出力制御部１１３に出力される。 FIG. 5 is a diagram for explaining audio data output control of the communication terminal 100 according to the present embodiment. As shown in FIG. 5, the volume measuring unit 112 measures the volume of the audio data and performs a volume check process for determining whether the sound input through the microphone 140 is silent or sound. For example, when the measured volume is equal to or higher than a predetermined threshold, it is determined as sound, and when the volume is less than the threshold, it can be determined as silence. The sound volume check result is output to the audio data output control unit 113.

音量チェック処理において無音と判別されたとき、音声データ出力制御部１１３は、認識状態確認部１１４から入力されるステータス更新情報に基づいて、サーバ装置３００側で音声認識処理の状態が「未認識中」であるか否かを判別する。音声データ出力制御部１１３は、音声認識処理の状態が「未認識中」のとき、無音の音声データを送信しないように制御する。 When it is determined that there is no sound in the volume check process, the voice data output control unit 113 determines that the status of the voice recognition process is “Unrecognized” on the server device 300 side based on the status update information input from the recognition state confirmation unit 114. Is determined. The voice data output control unit 113 performs control so that silent voice data is not transmitted when the voice recognition processing state is “Unrecognized”.

つまり、音声データ出力制御部１１３は、サーバ装置３００からＳＯＳ信号受信後に受信されたＥＯＳ信号に基づいて、音声データが有音となるまで、言い換えれば、ＥＯＳ信号を受信した後、所定の閾値以上の音量の音声データが入力されるまで、音声データの生成及び音声データのサーバ装置３００への送信を禁止し、サーバ装置３００に、音声データが送信されないように音声データ出力制御を行う。 In other words, the audio data output control unit 113 is based on the EOS signal received after receiving the SOS signal from the server device 300 until the audio data becomes sound, in other words, after receiving the EOS signal, Until the sound data of the volume is input, the generation of the sound data and the transmission of the sound data to the server apparatus 300 are prohibited, and the sound data output control is performed so that the sound data is not transmitted to the server apparatus 300.

図６は、本実施形態の通信端末１００の音声データ出力制御の処理フローを示す図である。通信端末１００は、音声認識を開始するための操作が行われると（Ｓ１０１）、マイク１４０を起動するとともに、音声データ生成処理及び音量チェック処理を行う（Ｓ１０２）。なお、ステップＳ１０１では、サーバ装置３００との間の通信セッションを確立する通信処理を行うことができる。 FIG. 6 is a diagram illustrating a processing flow of audio data output control of the communication terminal 100 according to the present embodiment. When an operation for starting voice recognition is performed (S101), the communication terminal 100 activates the microphone 140 and performs voice data generation processing and volume check processing (S102). In step S101, a communication process for establishing a communication session with the server apparatus 300 can be performed.

通信端末１００は、音声認識を開始するための操作に伴い、サーバ装置３００から認識状態情報の更新処理を開始する（Ｓ１０３）。更新処理は、通信端末１００側での音声認識を終了するための条件を満たすまで、音声データ生成処理などの他の処理とは個別に並行してＳＯＳ信号及びＥＯＳ信号が受信される度に行われる。 The communication terminal 100 starts recognition state information update processing from the server device 300 in accordance with an operation for starting voice recognition (S103). The update process is performed each time an SOS signal and an EOS signal are received in parallel with other processes such as the voice data generation process until the condition for ending the voice recognition on the communication terminal 100 side is satisfied. Is called.

通信端末１００は、生成された音声データの音量を測定し、マイク１４０を通じて入力された音声が無音であるか有音であるかを判別する（Ｓ１０４）。通信端末１００は、測定された音量が所定の閾値以上（有音）であると判別された場合、サーバ装置３００に音声データを送信する音声データ送信処理を行う（Ｓ１０５）。 The communication terminal 100 measures the volume of the generated voice data and determines whether the voice input through the microphone 140 is silent or voiced (S104). When it is determined that the measured sound volume is equal to or higher than the predetermined threshold (sound), the communication terminal 100 performs a sound data transmission process for transmitting sound data to the server device 300 (S105).

一方、ステップＳ１０４において、音量が閾値未満（無音）であると判別されたとき、通信端末１００は、ステップＳ１０６に進み、認識状態情報に基づいてサーバ装置３００側の音声認識処理が「認識処理中」であるか否かを判別する。「認識処理中」であると判別された場合、通信端末１００は、ステップＳ１０５に進み、サーバ装置３００に音声データを送信する音声データ送信処理を行う。「認識処理中」でない（「未認識中」である）と判別された場合、通信端末１００は、ステップＳ１０５をスキップし、無音の音声データを送信しないように制御する。 On the other hand, when it is determined in step S104 that the volume is less than the threshold value (silence), the communication terminal 100 proceeds to step S106, and the voice recognition process on the server apparatus 300 side is based on the recognition state information. Is determined. If it is determined that “recognition processing is in progress”, the communication terminal 100 proceeds to step S105 and performs audio data transmission processing for transmitting audio data to the server device 300. When it is determined that it is not “recognition process in progress” (“unrecognition is in progress”), the communication terminal 100 skips step S105 and performs control so as not to transmit silent audio data.

通信端末１００は、サーバ装置３００に送信した音声データに対する音声認識結果を受信すると（Ｓ１０７のＹＥＳ）、音声認識結果を表示部１６０に表示する表示制御を行う（Ｓ１０８）。通信端末１００は、音声認識を終了するための条件を満たすまで、ステップＳ１０４からステップＳ１０８を繰り返し行う（Ｓ１０９のＮＯ）。音声認識を終了するための条件を満たしたとき、例えば、起動した音声認識用のアプリケーションを終了するための操作が行われたとき（Ｓ１０９のＹＥＳ）、通信端末１００は、図６に示す処理を終了する。 When the communication terminal 100 receives the voice recognition result for the voice data transmitted to the server device 300 (YES in S107), the communication terminal 100 performs display control to display the voice recognition result on the display unit 160 (S108). The communication terminal 100 repeatedly performs step S104 to step S108 until the condition for ending speech recognition is satisfied (NO in S109). When the condition for ending the speech recognition is satisfied, for example, when an operation for ending the activated speech recognition application is performed (YES in S109), the communication terminal 100 performs the process shown in FIG. finish.

本実施形態によれば、通信端末１００の処理性能がＶＡＤ処理を含む音声認識処理に必要なリソースに割かれないので通信端末１００の処理負荷を低減できると共に、不要な音声をサーバ装置３００に送信しないので、サーバ装置３００との間の通信トラフィック（通信データ量）を低減させることができる。 According to the present embodiment, the processing performance of the communication terminal 100 is not allocated to the resources necessary for voice recognition processing including VAD processing, so that the processing load on the communication terminal 100 can be reduced and unnecessary voice is transmitted to the server device 300. Therefore, communication traffic (communication data amount) with the server apparatus 300 can be reduced.

次に、本実施形態の変形例について説明する。図７は、通信端末１００の音声データ出力制御の変形例を説明するための図であり、図８は、本変形例に係る通信端末１００の処理フローを示す図である。 Next, a modification of this embodiment will be described. FIG. 7 is a diagram for explaining a modified example of the audio data output control of the communication terminal 100, and FIG. 8 is a diagram illustrating a processing flow of the communication terminal 100 according to the modified example.

本変形例は、図７に示すように、音声認識を開始するための操作が行われた後、有音が入力されるまでの間の無音の音声データを、サーバ装置３００に送信しないように制御する。図５及び図６に示した音声データ出力制御では、音声認識を開始するための操作が行われたタイミングやマイク１４０で音声が集音処理を開始したタイミングで、音声データをサーバ装置３００に送信していた。 As shown in FIG. 7, in this modification, silent sound data is not transmitted to the server apparatus 300 until a sound is input after an operation for starting speech recognition is performed. Control. In the audio data output control shown in FIGS. 5 and 6, the audio data is transmitted to the server device 300 at the timing when the operation for starting the speech recognition is performed or when the sound starts the sound collecting process by the microphone 140. Was.

このため、例えば、音声認識を開始するための操作が行われた後にサーバ装置３００からＳＯＳ信号を受信した後は、無音であっても音声データがサーバ装置３００に送信されてしまう（図６のステップＳ１０４のＮＯからステップＳ１０６のＹＥＳ）。 Therefore, for example, after receiving an SOS signal from the server apparatus 300 after an operation for starting voice recognition is performed, the voice data is transmitted to the server apparatus 300 even if there is no sound (in FIG. 6). From NO at step S104 to YES at step S106).

そこで、本変形例では、音声認識を開始するための操作後、つまり、マイク１４０で音声データの取得処理が開始されてから、最初に所定の閾値以上の音量の音声データ（有音の音声データ）が入力されるまでの間、マイク１４０で集音された無音の音声データをサーバ装置３００に送信しないように制御し、上述の図５及び図６に示した音声データ出力制御に加え、よりサーバ装置３００との間の通信トラフィック（通信データ量）を低減させるようにしている。 Therefore, in this modified example, after an operation for starting voice recognition, that is, after the voice data acquisition process is started by the microphone 140, first, voice data having a volume equal to or higher than a predetermined threshold (sound voice data). Until the silent sound data collected by the microphone 140 is not transmitted to the server device 300, and in addition to the sound data output control shown in FIGS. Communication traffic (communication data amount) with the server apparatus 300 is reduced.

まず、図８のステップＳ１０３の認識状態情報更新処理の開始時に、認識状態情報を「未認識中」に初期化する。音声認識を開始するための操作後、ＳＯＳ信号を最初に受信するまでの間を「未認識中」と設定する。このように構成することで、図７に示すように、ＳＯＳ信号の受信有無に関わらず、無音の音声データをサーバ装置３００に送信しないようにすることができる。 First, at the start of the recognition state information update process in step S103 of FIG. 8, the recognition state information is initialized to “unrecognized”. After the operation for starting the speech recognition, the time until the first reception of the SOS signal is set as “Unrecognized”. With this configuration, as shown in FIG. 7, it is possible to prevent silent audio data from being transmitted to the server device 300 regardless of whether or not the SOS signal is received.

次に、図８の例において、図６のステップＳ１０４及びＳ１０６と異なり、音声認識を開始するための操作後、最初に音声データを送信する際に、認識状態情報に基づいてサーバ装置３００側の音声認識処理が「認識処理中」であるか否かを判別する（Ｓ１０４Ａ）。そして、通信端末１００は、「未認識中」であると判別されたとき、生成された音声データの音量を測定し、マイク１４０を通じて入力された音声が無音であるか有音であるかを判別する（Ｓ１０６Ａ）。通信端末１００は、測定された音量が所定の閾値未満（無音）であると判別された場合、ステップＳ１０５をスキップし、無音の音声データをサーバ装置３００に送信しないように制御する。 Next, in the example of FIG. 8, unlike the steps S104 and S106 of FIG. 6, when the voice data is transmitted for the first time after the operation for starting the voice recognition, the server apparatus 300 side is based on the recognition state information. It is determined whether the speech recognition process is “recognition process in progress” (S104A). When it is determined that the communication terminal 100 is “Unrecognized”, the communication terminal 100 measures the volume of the generated voice data and determines whether the voice input through the microphone 140 is silent or voiced. (S106A). If it is determined that the measured volume is less than the predetermined threshold (silence), the communication terminal 100 skips step S105 and performs control so that silent audio data is not transmitted to the server device 300.

図７の例で説明すると、音声認識を開始するための操作後、最初に音声データを送信するときは、音声認識処理のステータスが「未認識中」に初期設定されるので、音声データ出力制御部１１３は、音声データをサーバ装置３００に送信しない。このため、サーバ装置３００は、ＳＯＳ信号を出力しないことになる。 Referring to the example of FIG. 7, when voice data is first transmitted after an operation for starting voice recognition, the voice recognition process status is initially set to “Unrecognized”, so voice data output control is performed. The unit 113 does not transmit the audio data to the server device 300. For this reason, the server apparatus 300 does not output the SOS signal.

そして、音声データ出力制御部１１３は、音声認識を開始するための操作後に未だ音声データを送信していない状態で、有音の音声データが入力されたとき、音声認識処理のステータスが「未認識中」であっても、サーバ装置３００に音声データを送信する（Ｓ１０４ＡのＮＯからＳ１０６ＡのＹＥＳ）。有音の音声データを受信したサーバ装置３００は、ＳＯＳ信号を通信端末１００に送信し、音声認識処理のステータスが「認識処理中」に更新される。 When the voice data output control unit 113 receives voiced voice data in a state where voice data has not yet been transmitted after the operation for starting voice recognition, the voice recognition processing status is “Unrecognized”. Even if “medium”, the audio data is transmitted to the server apparatus 300 (NO in S104A to YES in S106A). Receiving the voice data, the server apparatus 300 transmits an SOS signal to the communication terminal 100, and the status of the voice recognition process is updated to “recognition process in progress”.

一方、ステップＳ１０４Ａでサーバ装置３００側の音声認識処理が「認識処理中」であると判別された場合は、音声データ出力制御部１１３は、無音であってもそのまま音声データをサーバ装置に送信する音声データ送信処理を行う（Ｓ１０５）。その他の処理について、図６で説明した処理も同様であるので、同符号を付して説明を省略する。 On the other hand, if it is determined in step S104A that the voice recognition process on the server apparatus 300 side is “recognition process in progress”, the voice data output control unit 113 transmits the voice data to the server apparatus as it is even when there is no sound. Audio data transmission processing is performed (S105). The other processes are the same as those described with reference to FIG.

以上、本実施形態の音声認識システムにおいて、通信端末１００は、音声データに圧縮処理を施し、圧縮された音声データを音声認識処理サーバ装置３００に送信することができる。このとき、音声認識処理サーバ装置３００は、圧縮された音声データを伸長して音声認識処理を行うことができる。 As described above, in the voice recognition system of the present embodiment, the communication terminal 100 can perform compression processing on voice data and transmit the compressed voice data to the voice recognition processing server apparatus 300. At this time, the voice recognition processing server apparatus 300 can perform voice recognition processing by decompressing the compressed voice data.

また、通信端末１００及び音声認識処理サーバ装置３００の各機能は、プログラムとして構成することができる。例えば、コンピュータの不図示の補助記憶装置に格納され、ＣＰＵ等の制御部が補助記憶装置に格納された各機能毎のプログラムを主記憶装置に読み出し、主記憶装置に読み出された該プログラムを制御部が実行し、本実施形態の各部の機能をコンピュータに動作させることができる。 Moreover, each function of the communication terminal 100 and the speech recognition processing server apparatus 300 can be configured as a program. For example, a program for each function stored in an auxiliary storage device (not shown) of a computer and stored in the auxiliary storage device by a control unit such as a CPU is read into the main storage device, and the program read into the main storage device is read out. The control unit can execute the function of each unit according to this embodiment.

また、上記プログラムは、コンピュータ読取可能な記録媒体に記録された状態で、コンピュータに提供することも可能である。コンピュータ読取可能な記録媒体としては、ＣＤ−ＲＯＭ等の光ディスク、ＤＶＤ−ＲＯＭ等の相変化型光ディスク、ＭＯ（Magnet Optical）やＭＤ(Mini Disk)などの光磁気ディスク、フロッピー（登録商標）ディスクやリムーバブルハードディスクなどの磁気ディスク、コンパクトフラッシュ（登録商標）、スマートメディア、SDメモリカード、メモリスティック等のメモリカードが挙げられる。また、本発明の目的のために特別に設計されて構成された集積回路（ICチップ等）等のハードウェア装置も記録媒体として含まれる。 Further, the program can be provided to a computer in a state where the program is recorded on a computer-readable recording medium. Computer-readable recording media include optical disks such as CD-ROM, phase change optical disks such as DVD-ROM, magneto-optical disks such as MO (Magnet Optical) and MD (Mini Disk), floppy (registered trademark) disks, Examples include magnetic disks such as removable hard disks, memory cards such as compact flash (registered trademark), smart media, SD memory cards, and memory sticks. A hardware device such as an integrated circuit (IC chip or the like) specially designed and configured for the purpose of the present invention is also included as a recording medium.

なお、本発明の実施形態を説明したが、当該実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。この新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 In addition, although embodiment of this invention was described, the said embodiment is shown as an example and is not intending limiting the range of invention. The novel embodiment can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１００通信端末
１１０制御部（ＣＰＵ）
１１１Ａ／Ｄ変換部
１１２音量測定部
１１３音声データ出力制御部
１１４認識状態確認部
１１５表示制御部
１２０記憶部
１３０通信部
１４０マイク
１５０スピーカー
１６０表示部
１７０操作部
３００音声認識処理サーバ装置
３１０制御部（ＣＰＵ）
３２０記憶部
３３０通信部
３４０音声認識部 100 communication terminal 110 control unit (CPU)
111 A / D conversion unit 112 Volume measurement unit 113 Audio data output control unit 114 Recognition state confirmation unit 115 Display control unit 120 Storage unit 130 Communication unit 140 Microphone 150 Speaker 160 Display unit 170 Operation unit 300 Voice recognition processing server apparatus 310 Control unit (CPU)
320 storage unit 330 communication unit 340 voice recognition unit

Claims

A communication terminal that transmits voice data issued by a user to a voice recognition processing server device that performs voice recognition processing, and receives a voice recognition processing result for the voice data from the voice recognition processing server device,
A volume measuring unit for measuring the volume of audio data acquired by the audio input unit;
A voice data output control unit that transmits the voice data to the voice recognition processing server device,
The voice data output control unit is configured such that the volume of the voice data sequentially input is less than a predetermined threshold value indicating silence, and the recognition processing state of voice recognition processing for the voice data received from the voice recognition processing server device is A communication terminal, which is controlled so as not to transmit the silent voice data to the voice recognition processing server device when unrecognized indicating standby.

The signal indicating the recognition processing state is paired with the SOS signal and the SOS signal indicating the start of the voice recognition processing in response to the reception of the voice data, and the silent voice data is detected for a predetermined time in the started voice recognition processing. An EOS signal indicating that the speech recognition process is to be terminated when it continues,
The voice data output control unit controls not to transmit the voice data to the voice recognition processing server apparatus until the voice data having a volume equal to or higher than the predetermined threshold is input after receiving the EOS signal. The communication terminal according to claim 1.

The audio data output control unit outputs the audio data indicating silence during a period from when the audio data acquisition process is started by the audio input unit to when the audio data having a volume equal to or higher than the predetermined threshold is input. The communication terminal according to claim 1, wherein control is performed so as not to transmit to the voice recognition processing server device.

A program executed by a communication terminal that transmits voice data issued by a user to a voice recognition processing server device that performs voice recognition processing, and receives a voice recognition processing result for the voice data from the voice recognition processing server device,
A first function for measuring the volume of audio data acquired by the audio input unit;
A second function of transmitting the voice data to the voice recognition processing server device,
In the second function, the volume of the voice data that is sequentially input is less than a predetermined threshold value indicating silence, and the recognition processing state of the voice recognition process for the voice data received from the voice recognition processing server apparatus is on standby. When the program is unrecognized, the program is controlled so as not to transmit the silent voice data to the voice recognition processing server apparatus.

A speech recognition processing server device that performs speech recognition processing; and a communication terminal that transmits speech data issued by a user to the speech recognition processing server device and receives a speech recognition processing result for the speech data from the speech recognition processing server device. A speech recognition system comprising:
The voice recognition processing server device transmits a signal indicating a recognition processing state of voice recognition processing to the received voice data to the communication terminal,
The communication terminal is
A volume measuring unit for measuring the volume of audio data acquired by the audio input unit;
A voice data output control unit that transmits the voice data to the voice recognition processing server device,
The sound data output control unit, when the volume of the sound data sequentially input is less than a predetermined threshold value indicating silence, and the recognition processing state is unrecognized indicating standby for sound recognition processing, A voice recognition system that controls not to transmit voice data to the voice recognition processing server device.