JP2006011066A

JP2006011066A - Voice recognition/synthesis system, synchronous control method, synchronous control program and synchronous controller

Info

Publication number: JP2006011066A
Application number: JP2004188408A
Authority: JP
Inventors: Kentaro Nagatomo; 健太郎長友
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-06-25
Filing date: 2004-06-25
Publication date: 2006-01-12
Anticipated expiration: 2024-06-25
Also published as: JP4483428B2

Abstract

<P>PROBLEM TO BE SOLVED: To synchronously control two pieces of information, i.e., voice data and a voice recognition/synthesis control command. <P>SOLUTION: In a voice recognition process, a voice input means 101 starts inputting voice data at time A0, successively divides inputted data into packets, adds identifiers and transmits them to a voice recognition means 102. The voice recognition means 102 starts receiving the packets from the voice input means 101 at time A1. Then, a control means 103 sets an identifier "AI0" and a control command "AC1", which instructs a start of the voice recognition process, at time A2 and transmits them to the voice recognition means 102. When the voice recognition means 102 receives the control command "AC1" including the identifier "AI0" at time A4, the means 102 reads the voice data, to which the specified identifier "AI0" is added, from a packet holding means 102d and starts the voice recognition process from the read packets. Thus, the voice recognition process is conducted for the appropriate voice data to be processed. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、入力した音声データを解析する音声認識処理または／および音声データを生成する音声合成処理を行う音声認識／合成システム等に関し、特に、音声データと音声認識／合成処理の制御指令とを同期させる音声認識／合成システム、同期制御方法、同期制御プログラム、および同期制御装置に関する。 The present invention relates to a speech recognition / synthesis system that performs speech recognition processing for analyzing input speech data or / and speech synthesis processing for generating speech data, and more particularly to speech data and control commands for speech recognition / synthesis processing. The present invention relates to a speech recognition / synthesis system to be synchronized, a synchronization control method, a synchronization control program, and a synchronization control apparatus.

入力した音声データを解析する音声認識技術や音声データを生成する音声合成技術を応用した音声応用システムは、音声データの入出力を行う音声データ入出力部と、音声認識処理または／および音声合成処理を行う音声認識／合成処理部と、音声データ入出力部や音声認識／合成処理部を制御する制御部とを含む構成とされている。 A speech application system that applies speech recognition technology that analyzes input speech data and speech synthesis technology that generates speech data includes a speech data input / output unit that inputs and outputs speech data, and speech recognition processing and / or speech synthesis processing. And a control unit that controls the voice data input / output unit and the voice recognition / synthesis processing unit.

従来、上記のような音声応用システムは、プログラム構成要素やハードウェアブロック単位で実現されてきた。ところが、近年の通信ネットワーク技術の進展に伴い、音声応用システムを構成する各部がサーバ単位で実現される例が増えつつある。 Conventionally, the voice application system as described above has been realized in units of program components and hardware blocks. However, with recent advances in communication network technology, an increasing number of examples are realized in which each unit constituting a voice application system is realized in units of servers.

しかし、各部がサーバ単位で実現された互いに疎な構成要素からなる音声対話システムにおいては、伝送遅延が発生し得る通信ネットワークを介して構成要素間でデータをやりとりするため、何らかの時間同期制御機構が必要となる。 However, in a spoken dialogue system consisting of sparse components implemented in units of servers, data is exchanged between components via a communication network where transmission delay may occur. Necessary.

仮に、時間同期制御機構を備えないこととすると、各部は適切に協調動作することができず、音声データの取りこぼし、雑音の誤検出、不適当な音響／言語モデル、パラメータの適用による認識性能の低下や出力音声品質の劣化などの様々な問題が生じるおそれがある。 If it is assumed that the time synchronization control mechanism is not provided, each unit cannot properly operate in a coordinated manner, and the recognition performance of the voice data is missed, the noise is falsely detected, the sound / language model is inappropriate, and the parameter is applied. Various problems such as degradation and degradation of output voice quality may occur.

例えば、送信側端末によって、同じ時間軸上に展開していた音声データと他のメディアデータ（画像等）とが分離され、別々の系を通して伝送されたあと、受信側端末によって、各々のメディアデータを受信して一つの時間軸上に再び同期編成することができるようにするのは、マルチメディアデータ通信の分野ではありふれた課題である。 For example, audio data and other media data (images, etc.) developed on the same time axis are separated by the transmitting side terminal and transmitted through different systems, and then each media data is received by the receiving side terminal. It is a common problem in the field of multimedia data communication to be able to receive and receive synchronous organization on a single time axis.

特許文献１には、同一の時間軸上に展開していたマルチメディアデータがメディアごとに多重分離されて別個に伝送され、受信側装置でそれらのデータが同期出力されるようにした技術が開示されている。 Patent Document 1 discloses a technique in which multimedia data developed on the same time axis is demultiplexed for each medium and transmitted separately, and the data is synchronously output at the receiving side device. Has been.

また、特許文献２には、与えられたテキストから時間同期制御情報が付加された音声データと動画像データとが生成される音声応用システムが開示されている。 Patent Document 2 discloses a voice application system that generates voice data and moving image data to which time synchronization control information is added from given text.

特開２００３−２０４４９２号公報JP 2003-204492 A 特開２００３−２１６１７３号公報JP 2003-216173 A

ところが、上述した従来技術では、音声データと音声認識／合成制御指令のような本質的に異なる時間軸上に存在する情報については、同期制御を行うことができないという課題があった。 However, the above-described conventional technique has a problem that synchronization control cannot be performed on information existing on essentially different time axes such as voice data and voice recognition / synthesis control commands.

すなわち、音声データと音声認識／合成制御指令の２つの情報は、どのタイミングで同期させるべきかが自明でないため、同期制御することはできなかった。 That is, since it is not obvious at which timing the two pieces of information of the voice data and the voice recognition / synthesis control command should be synchronized, the synchronization control cannot be performed.

特許文献１や特許文献２には、複数のメディアデータの同期を行う技術が開示されているが、同期制御の対象となっている各メディアデータは、元来同一の時間軸上に展開されるべきものであるので、どのようなタイミングで同期すべきかは自明である。 Patent Documents 1 and 2 disclose a technique for synchronizing a plurality of media data. However, each media data subject to synchronization control is originally developed on the same time axis. Since it should be, it is obvious at what timing it should be synchronized.

しかし、音声データと音声認識／合成制御指令の間には、自明な同期タイミングは存在しないため、上記の各特許文献に記載されている技術を適用しても、音声データと音声認識／合成制御指令との同期制御を行うことはできない。 However, since there is no obvious synchronization timing between the voice data and the voice recognition / synthesis control command, the voice data and the voice recognition / synthesis control are applied even if the techniques described in the above patent documents are applied. Synchronous control with the command cannot be performed.

この場合、適当な同期タイミングを仮定（例えば、音声認識／合成処理部が音声認識開始コマンドを受け取った時刻に直近の音声データを認識処理対象の先頭データであるとみなす等）して動作すれば、システムの環境等によっては上記のような問題が生じないこともある。 In this case, if the operation is performed assuming an appropriate synchronization timing (for example, the latest speech data is regarded as the first data to be recognized at the time when the speech recognition / synthesis processing unit receives the speech recognition start command). Depending on the system environment or the like, the above problem may not occur.

しかし、上記のような同期タイミングを仮定する手法では、伝送遅延が非常に大きい環境や、音声データに対応する制御指令が次々と発行されるようなシビアな状況においては破綻してしまうため、やはり音声データと音声認識／合成制御指令との同期制御を確実に行うことはできなかった。 However, the method that assumes the synchronization timing as described above will fail in an environment where the transmission delay is very large or in a severe situation where control commands corresponding to audio data are issued one after another. Synchronous control between the voice data and the voice recognition / synthesis control command cannot be performed reliably.

本発明は、上述した問題を解消し、音声データと音声認識／合成制御指令の二つの情報を同期させて制御することができるようにすることを目的とする。 An object of the present invention is to solve the above-mentioned problems and to control two pieces of information of voice data and a voice recognition / synthesis control command in synchronization.

本発明の音声認識／合成システムは、入力した音声データを解析する音声認識処理または／および音声データを生成する音声合成処理を行う音声認識／合成システム（例えば音声認識／合成システム１００，３００，５００，７００，９００，１０００）であって、音声データが複数の区間に分割された各音声分割データ（例えば音声データパケット）のうち処理対象の音声分割データを特定するための識別情報（例えば識別子）を設定した制御指令を発行する制御指令手段（例えば制御手段１０３，３０３，５０４，７０３，９０５、音声対話管理サーバ１００１）と、制御指令手段からの制御指令に従って、当該制御指令に設定されている識別情報によって特定される音声分割データに対して音声認識処理または／および音声合成処理を行う音声処理手段（例えば音声認識手段１０２，３０２，５０２，５０３，９０２、音声生成手段７０１、９０３）と、を備えたことを特徴とする。 The speech recognition / synthesis system of the present invention is a speech recognition / synthesis system (for example, speech recognition / synthesis system 100, 300, 500) that performs speech recognition processing for analyzing input speech data and / or speech synthesis processing for generating speech data. , 700, 900, 1000), and identification information (for example, identifier) for specifying the divided audio data to be processed among the respective divided audio data (for example, audio data packets) obtained by dividing the audio data into a plurality of sections. Is set to the control command according to the control command from the control command means (for example, the control means 103, 303, 504, 703, 905, the voice conversation management server 1001) and the control command means. Speech recognition processing and / or speech synthesis processing is performed on the speech segmented data specified by the identification information. Cormorant sound processing means (e.g., speech recognition means 102,302,502,503,902, sound generating means 701,903) characterized by comprising a, a.

上記のように構成したことで、音声データの伝送系と制御指令の伝送系とが独立している場合であっても、音声認識処理または／および音声合成処理の処理対象となる音声分割データを厳密に指定することができ、音声データと制御指令とを同期させて制御することができる。 With the above configuration, even if the voice data transmission system and the control command transmission system are independent, the voice division data to be processed in the voice recognition process or / and the voice synthesis process can be reduced. The voice data and the control command can be controlled in synchronization with each other.

入力した音声データが複数の区間に分割された各音声分割データに、システム内で一意に識別される識別情報をそれぞれ付加する識別情報付加手段（例えば識別子付与手段１０１ｄ）を備えていてもよい。 Identification information adding means (for example, identifier giving means 101d) for adding identification information uniquely identified in the system may be provided to each piece of voice divided data obtained by dividing the input voice data into a plurality of sections.

音声データの入力処理を行う音声入力処理手段（例えば音声入力処理手段１０１ｂ）と、音声入力処理手段によって入力された音声データを複数の区間に分割した音声分割データ（例えば音声データパケット）を生成する音声データ分割手段（例えばパケット分割手段１０１ｃ）と、を備えていてもよい。 Voice input processing means (for example, voice input processing means 101b) that performs voice data input processing and voice divided data (for example, voice data packets) obtained by dividing the voice data input by the voice input processing means into a plurality of sections are generated. Voice data dividing means (for example, packet dividing means 101c).

制御指令手段は、音声認識処理または／および音声合成処理の実行時刻を設定した制御指令を発行し、音声処理手段は、制御指令手段からの制御指令に従って、当該制御指令に設定されている実行時刻となったときに、当該制御指令に設定されている識別情報によって特定される音声分割データに対して音声認識処理または／および音声合成処理を行うように構成されていてもよい。 The control command means issues a control command in which the execution time of the voice recognition process or / and the voice synthesis process is set, and the voice processing means executes the execution time set in the control command according to the control command from the control command means Then, the voice recognition process or / and the voice synthesis process may be performed on the voice division data specified by the identification information set in the control command.

複数の音声処理手段（例えば音声認識手段３０２ａ，３０２ｂ）を備えるとともに、複数の音声処理手段それぞれの音声認識処理または／および音声合成処理の処理結果を統合する処理結果統合手段（例えば結果統合手段３０３ｃ）を備えていてもよい。 A plurality of speech processing means (for example, speech recognition means 302a and 302b) and a processing result integration means (for example, result integration means 303c) for integrating the results of speech recognition processing and / or speech synthesis processing of each of the plurality of speech processing means. ) May be provided.

制御指令手段は、複数の音声処理手段（例えば第１の音声認識手段５０２と第２の音声認識手段５０３）のうちの一の音声処理手段（例えば第１の音声認識手段）に対して制御指令（例えば図６に示す制御指令「ＣＣ１」）を発行し、一の音声処理手段は、制御指令手段からの制御指令の一部または全部（例えば図６に示す制御指令「ＣＣ２」）を他の音声処理手段に転送する制御指令転送手段（例えば音声認識制御手段５０２ａ）を有する構成とされていてもよい。 The control command means is a control command for one voice processing means (for example, the first voice recognition means) of a plurality of voice processing means (for example, the first voice recognition means 502 and the second voice recognition means 503). (For example, the control command “CC1” shown in FIG. 6), and one sound processing unit sends part or all of the control command from the control command unit (for example, the control command “CC2” shown in FIG. 6) to the other You may be set as the structure which has the control command transfer means (for example, voice recognition control means 502a) which transfers to a voice processing means.

一の音声処理手段は、制御指令手段からの制御指令によって指示された処理対象の音声データの一区間または全区間（例えば図６に示す識別子「ＣＩ１０」が付加された音声データパケット）を他の音声処理手段に転送する音声データ転送手段（例えばパケット送受信手段５０２ｅ）を有する構成とされていてもよい。 One voice processing means uses one section or all sections (for example, a voice data packet to which the identifier “CI10” shown in FIG. 6 is added) as the processing target voice data instructed by the control command from the control command section. It may be configured to have voice data transfer means (for example, packet transmission / reception means 502e) for transferring to the voice processing means.

識別情報付加手段は、識別情報として、例えば、タイムスタンプ、シリアル番号、音声認識処理または／および音声合成処理による音声対話の処理シーケンス番号、またはこれらの組み合わせを、各音声分割データそれぞれに付加する。 The identification information adding means adds, as identification information, for example, a time stamp, a serial number, a voice interaction processing sequence number by voice recognition processing and / or voice synthesis processing, or a combination thereof to each voice divided data.

識別情報の時間順序性を管理する機能を提供する識別情報管理手段（例えば制御手段１０３）を備えていてもよい。識別情報管理手段は、システムを構成する各構成要素でそれぞれ用いられる絶対時刻を同期させ（例えばＮＴＰサーバからの時刻情報を利用して同期させる）、かつ、識別情報に特定の絶対時刻（例えば識別情報が付与されたときの時刻）を対応させることで、各識別情報の時間順序性（例えば付与された順番）を管理する機能を有する。 You may provide the identification information management means (for example, control means 103) which provides the function to manage the time order of identification information. The identification information management means synchronizes the absolute time used by each component constituting the system (for example, synchronizes using the time information from the NTP server), and the identification information has a specific absolute time (for example, identification). Corresponding time (when information is given) has a function of managing the time order (for example, given order) of each identification information.

また、本発明の同期制御方法は、入力した音声データを解析する音声認識処理または／および音声データを生成する音声合成処理を行う音声認識／合成システム（例えば音声認識／合成システム１００，３００，５００，７００，９００，１０００）であって、音声データが複数の区間に分割された各音声分割データ（例えば音声データパケット）における同期制御方法であって、音声データが複数の区間に分割された各音声分割データのうち処理対象の音声分割データを特定するための識別情報（例えば識別子）を設定した制御指令（例えば図２における指令「ＡＣ１」）を発行し、制御指令に従って、当該制御指令に設定されている識別情報によって特定される音声分割データ（例えば識別子「ＡＩ０」が付加された音声データパケット）に対して音声認識処理または／および音声合成処理を行うことを特徴とする。 Also, the synchronization control method of the present invention is a speech recognition / synthesis system (for example, speech recognition / synthesis system 100, 300, 500) that performs speech recognition processing for analyzing input speech data and / or speech synthesis processing for generating speech data. , 700, 900, 1000), which is a synchronization control method for each piece of voice divided data (for example, voice data packet) in which voice data is divided into a plurality of sections, each of which is divided into a plurality of sections. A control command (for example, command “AC1” in FIG. 2) in which identification information (for example, an identifier) for specifying the processing-target audio divided data among the audio divided data is set, and is set in the control command according to the control command. Audio segmentation data specified by the identification information (for example, an audio data packet to which an identifier “AI0” is added) And performing speech recognition processing and / or speech synthesis processing to.

入力した音声データが複数の区間に分割された各音声分割データに、システム内で一意に識別される識別情報をそれぞれ付加するように構成されていてもよい。 Identification information uniquely identified in the system may be added to each piece of voice divided data obtained by dividing the input voice data into a plurality of sections.

音声データの入力処理を行い、入力処理によって入力された音声データを複数の区間に分割した音声分割データを生成するように構成されていてもよい。 It may be configured to perform voice data input processing and generate voice divided data obtained by dividing the voice data input by the input processing into a plurality of sections.

音声認識処理または／および音声合成処理の実行時刻を設定した制御指令を発行し、制御指令に従って、当該制御指令に設定されている実行時刻となったときに、当該制御指令に設定されている識別情報によって特定される音声分割データに対して音声認識処理または／および音声合成処理を行うように構成されていてもよい。 The identification set in the control command is issued when the execution time set in the control command is issued according to the control command after issuing the control command in which the execution time of the voice recognition process or / and the voice synthesis process is set The voice recognition process or / and the voice synthesis process may be performed on the voice division data specified by the information.

制御指令に従って異なる処理手段で行われた複数の音声認識処理または／および音声合成処理の処理結果を統合するように構成されていてもよい。 You may be comprised so that the process result of the several speech recognition process or / and speech synthesis process which were performed by the different process means according to the control command may be integrated.

制御指令に従って音声認識処理または／および音声合成処理を行ったあと、制御指令の一部または全部を他の処理手段に転送し、転送された制御指令に従って他の処理手段にて音声認識処理または／および音声合成処理を行うように構成されていてもよい。 After performing speech recognition processing or / and speech synthesis processing according to the control command, part or all of the control command is transferred to other processing means, and according to the transferred control command, speech recognition processing or / And may be configured to perform speech synthesis processing.

制御指令手段からの制御指令によって指示された処理対象の音声データの一区間または全区間を他の処理手段に転送するように構成されていてもよい。 One section or all sections of the audio data to be processed instructed by the control command from the control command means may be transferred to another processing means.

各音声分割データそれぞれに付加する識別情報として、例えば、タイムスタンプ、シリアル番号、音声認識処理または／および音声合成処理による音声対話の処理シーケンス番号、またはこれらの組み合わせのいずれかを用いる。 As the identification information to be added to each voice division data, for example, any one of a time stamp, a serial number, a voice interaction processing sequence number by voice recognition processing and / or voice synthesis processing, or a combination thereof is used.

また、本発明の同期制御プログラムは、入力した音声データを解析する音声認識処理または／および音声データを生成する音声合成処理を行う音声認識／合成システム（例えば音声認識／合成システム１００，３００，５００，７００，９００，１０００）に同期制御を実行させる同期制御プログラムであって、音声認識／合成システムを構成するコンピュータ（例えば、音声入力手段９０１、音声認識手段９０２、音声生成手段９０３、音声出力手段９０４、制御手段９０５）に、音声データが複数の区間に分割された各音声分割データのうち処理対象の音声分割データを特定するための識別情報（例えば識別子）を設定した制御指令を発行するステップと、制御指令に従って、当該制御指令に設定されている識別情報によって特定される音声分割データに対して音声認識処理または／および音声合成処理を行うステップとを実行させるためのものである。 Further, the synchronization control program of the present invention is a speech recognition / synthesis system (for example, speech recognition / synthesis system 100, 300, 500) that performs speech recognition processing for analyzing input speech data and / or speech synthesis processing for generating speech data. , 700, 900, 1000), a computer program (for example, a voice input unit 901, a voice recognition unit 902, a voice generation unit 903, a voice output unit) constituting a voice recognition / synthesis system. 904, issuing to the control means 905) a control command in which identification information (for example, an identifier) for specifying the audio division data to be processed among the audio division data obtained by dividing the audio data into a plurality of sections is set. And according to the control command, it is specified by the identification information set in the control command. It is intended for executing and performing the speech recognition processing and / or speech synthesis processing to voice data segment.

上記のように構成したことで、音声認識／合成システムにおける音声データの伝送系と制御指令の伝送系とが独立している場合であっても、音声認識／合成システムを構成するコンピュータに、音声認識処理または／および音声合成処理の処理対象となる音声分割データを厳密に指定させることができ、音声データと制御指令とを同期させて制御させることができるようになる。 With the above configuration, even if the voice data transmission system and the control command transmission system in the voice recognition / synthesis system are independent, the computer constituting the voice recognition / synthesis system can receive voice. It is possible to strictly specify the voice division data to be processed in the recognition process or / and the voice synthesis process, and to control the voice data and the control command in synchronization.

コンピュータに、さらに、入力した音声データが複数の区間に分割された各音声分割データに、システム内で一意に識別される識別情報をそれぞれ付加するステップを実行させるように構成されていてもよい。 The computer may further be configured to execute a step of adding identification information uniquely identified in the system to each of the divided audio data obtained by dividing the input audio data into a plurality of sections.

コンピュータに、さらに、音声データの入力処理を行うステップと、入力処理によって入力された音声データを複数の区間に分割した音声分割データを生成するステップとを実行させるように構成されていてもよい。 The computer may be further configured to execute a step of performing voice data input processing and a step of generating voice division data obtained by dividing the voice data input by the input processing into a plurality of sections.

さらに、本発明の同期制御装置は、音声データの入力処理を行う音声入力処理手段（例えば音声入力処理手段１０１ｂ）と、音声入力処理手段によって入力された音声データを複数の区間に分割した音声分割データを生成する音声データ分割手段（例えばパケット分割手段１０１ｃ）と、音声データ分割手段によって分割された各音声分割データに、システム内で一意に識別される識別情報（例えば識別子）をそれぞれ付加する識別情報付加手段（例えば識別子付与手段１０１ｄ）と、音声入力処理手段によって入力された音声データを解析する音声認識処理または／および音声データを生成する音声合成処理を行う音声処理手段（例えば音声認識手段１０２）に対して、処理対象の音声分割データを特定するための識別情報を設定した制御指令を発行する制御指令手段（例えば制御手段１０３）と、を備えたことを特徴とする。 Further, the synchronization control apparatus of the present invention includes a voice input processing means (for example, voice input processing means 101b) for performing voice data input processing, and voice division in which voice data input by the voice input processing means is divided into a plurality of sections. Audio data dividing means for generating data (for example, packet dividing means 101c) and identification for adding identification information (for example, identifier) uniquely identified in the system to each audio divided data divided by the audio data dividing means Speech processing means (for example, speech recognition means 102) for performing speech recognition processing for analyzing speech data input by the information addition means (for example, identifier giving means 101d) and / or speech synthesis processing for generating speech data. ) For which identification information for specifying the audio segmentation data to be processed is set A control command means for issuing a decree (e.g., control unit 103), characterized by comprising a.

上記のように構成したことで、音声データの伝送系と制御指令の伝送系とが独立している場合であっても、音声認識処理または／および音声合成処理の処理対象となる音声分割データを厳密に指定することができ、音声処理手段に、音声データと制御指令とを同期させて制御させることができる。 With the above configuration, even if the voice data transmission system and the control command transmission system are independent, the voice division data to be processed in the voice recognition process or / and the voice synthesis process can be reduced. The voice processing means can control the voice data and the control command in synchronization with each other.

本発明によれば、音声データの伝送系と制御指令の伝送系とが独立している場合であっても、音声認識処理または／および音声合成処理の処理対象となる音声分割データを厳密に指定することができ、音声データと制御指令とを同期させて制御することができる。 According to the present invention, even when the voice data transmission system and the control command transmission system are independent, the voice division data to be processed in the voice recognition process or / and the voice synthesis process is strictly specified. The voice data and the control command can be synchronized and controlled.

以下、本発明の実施の形態について図面を参照して説明する。
実施の形態１．
図１は、本発明の第１の実施の形態における音声認識／合成システム１００の構成例を示すブロック図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration example of a speech recognition / synthesis system 100 according to the first embodiment of the present invention.

図１に示すように、本例の音声認識／合成システム１００は、音声入力手段１０１と、音声認識手段１０２と、制御手段１０３とを含む。音声入力手段１０１と、音声認識手段１０２と、制御手段１０３とは、伝送手段１０４によって接続されている。 As shown in FIG. 1, the speech recognition / synthesis system 100 of this example includes a speech input unit 101, a speech recognition unit 102, and a control unit 103. The voice input unit 101, the voice recognition unit 102, and the control unit 103 are connected by a transmission unit 104.

音声入力手段１０１は、音声入力制御手段１０１ａと、音声入力処理手段１０１ｂと、パケット分割手段１０１ｃと、識別子付与手段１０１ｄと、パケット保持手段１０１ｅと、パケット送信手段１０１ｆとを含む。 The voice input unit 101 includes a voice input control unit 101a, a voice input processing unit 101b, a packet division unit 101c, an identifier assigning unit 101d, a packet holding unit 101e, and a packet transmission unit 101f.

音声入力手段１０１は、制御手段１０３からの制御指令にもとづいて音声データを入力する処理や、入力した音声データをパケットに分割して音声認識手段１０２に送信する処理などの各種の処理を実行する。 The voice input unit 101 executes various processes such as a process of inputting voice data based on a control command from the control unit 103 and a process of dividing the input voice data into packets and transmitting the packets to the voice recognition unit 102. .

音声入力制御手段１０１ａは、制御手段１０３などの他の構成要素からの制御指令を受信し、受信した制御指令にもとづいて音声入力手段１０１全体の動作を制御する。また、音声入力制御手段１０１ａは、他の構成要素からの要求に応じて、音声入力処理の状況等を示す情報を、他の構成要素に送信する。 The voice input control means 101a receives control commands from other components such as the control means 103, and controls the operation of the voice input means 101 as a whole based on the received control commands. In addition, the voice input control unit 101a transmits information indicating the status of voice input processing and the like to other components in response to a request from the other components.

音声入力処理手段１０１ｂは、外部からの音声データを受信し、受信した音声データに対して必要に応じて各種の処理を施す。例えば、音声入力処理手段１０１ｂは、入力した音声データがアナログ音声であればＡ／Ｄ変換を行い、入力した音声データが何らかのエンコードを施されたデータであればデコード処理を行う。なお、音声入力処理手段１０１ｂは、雑音抑制処理等を行う機能を有していてもよい。 The voice input processing unit 101b receives voice data from the outside and performs various processes on the received voice data as necessary. For example, the audio input processing means 101b performs A / D conversion if the input audio data is analog audio, and performs decoding processing if the input audio data is data that has been subjected to some encoding. Note that the voice input processing unit 101b may have a function of performing noise suppression processing or the like.

パケット分割手段１０１ｃは、音声データを適切な区間に区切り、各区間毎に音声データの再現に必要な付帯情報を付加することで、音声データをパケットと呼ばれる単位に分割する。付帯情報としては、シリアル番号、タイムスタンプ、パケットサイズ、量子化パラメータ等が用いられる。 The packet dividing unit 101c divides the audio data into units called packets by dividing the audio data into appropriate sections and adding additional information necessary for reproducing the audio data for each section. As the auxiliary information, a serial number, a time stamp, a packet size, a quantization parameter, and the like are used.

識別子付与手段１０１ｄは、パケット分割手段１０１ｃによって生成された各パケットを一意に識別するための識別子を付与する処理を実行する。 The identifier assigning unit 101d executes a process for assigning an identifier for uniquely identifying each packet generated by the packet dividing unit 101c.

パケット保持手段１０１ｅは、必要に応じてパケットをバッファリングする処理を実行する。 The packet holding unit 101e executes processing for buffering packets as necessary.

なお、実際には、パケット保持手段１０１ｅが一度に保持できるパケットの個数は有限であるため、パケットが失われることがある。パケットが失われると、その失われたパケットに対する処理要求を行うことができなくなるという問題が生じる。 Actually, since the number of packets that the packet holding unit 101e can hold at a time is finite, packets may be lost. When a packet is lost, there arises a problem that a processing request cannot be made for the lost packet.

しかしながら、多くの場合、保持しておく必要のあるパケットは対話の流れによってある程度限定されるので、適当なバッファ管理アルゴリズムを用いておけば、実用上問題になることは少ない。 However, in many cases, the packets that need to be held are limited to some extent by the flow of the conversation, so if an appropriate buffer management algorithm is used, there is little problem in practical use.

もし、問題になってしまうようであれば、制御手段１０３が対話フローから必要とされるパケットを保持し続けるよう制御指令を発行するようにすればよい。あるいは、必要なパケットを保持する他の構成要素から取得し直すようにしてもよい。 If there is a problem, the control means 103 may issue a control command so as to keep holding packets required from the dialogue flow. Or you may make it acquire again from the other component holding a required packet.

パケット送信手段１０１ｆは、パケット保持手段１０１ｅに保持されているパケットを、音声認識手段１０２などの他の構成要素へ送信する処理を実行する。 The packet transmission unit 101f executes a process of transmitting the packet held in the packet holding unit 101e to other components such as the voice recognition unit 102.

音声認識手段１０２は、音声認識制御手段１０２ａと、音声認識処理手段１０２ｂと、識別子判定手段１０２ｃと、パケット保持手段１０２ｄと、パケット受信手段１０２ｅとを含む。 The voice recognition unit 102 includes a voice recognition control unit 102a, a voice recognition processing unit 102b, an identifier determination unit 102c, a packet holding unit 102d, and a packet reception unit 102e.

音声認識手段１０２は、制御手段１０３からの制御指令を受けて音声を認識し、制御手段１０３からの要求に応じて認識結果を出力する。 The voice recognition unit 102 recognizes a voice in response to a control command from the control unit 103, and outputs a recognition result in response to a request from the control unit 103.

音声認識制御手段１０２ａは、制御手段１０３などの他の構成要素からの制御指令を受信し、受信した制御指令にもとづいて音声認識手段１０２全体の動作を制御する。また、音声認識制御手段１０２ａは、他の構成要素からの要求に応じて保持している認識結果等の情報を送信する。 The voice recognition control unit 102a receives a control command from another component such as the control unit 103, and controls the entire operation of the voice recognition unit 102 based on the received control command. Further, the voice recognition control unit 102a transmits information such as a recognition result held in response to a request from another component.

音声認識処理手段１０２ｂは、識別子判定手段１０２ｃによって選別された音声データに対して音声認識処理を行い、その結果を保持する。 The voice recognition processing unit 102b performs voice recognition processing on the voice data selected by the identifier determination unit 102c and holds the result.

識別子判定手段１０２ｃは、制御指令によって指定された特定のパケットを、パケットに付与されている識別子にもとづいて選択し、音声認識処理手段１０２ｂに送信する。 The identifier determination unit 102c selects a specific packet designated by the control command based on the identifier given to the packet, and transmits it to the voice recognition processing unit 102b.

パケット保持手段１０２ｄは、必要に応じてパケットをバッファリングする処理を実行する。 The packet holding unit 102d executes processing for buffering packets as necessary.

パケット受信手段１０２ｅは、音声入力手段１０１などの他の構成要素から音声データのパケットを受信し、パケット保持手段１０２ｄへ格納する。 The packet receiving unit 102e receives a packet of audio data from other components such as the audio input unit 101 and stores it in the packet holding unit 102d.

制御手段１０３は、ユーザインタフェース手段１０３ａと、対話管理手段１０３ｂとを含む。 The control means 103 includes user interface means 103a and dialogue management means 103b.

制御手段１０３は、音声入力手段１０１や音声認識手段１０２を制御して、システム１００全体が一つの音声対話システムとして動作するよう協調させる。 The control unit 103 controls the voice input unit 101 and the voice recognition unit 102 to cooperate so that the entire system 100 operates as one voice dialogue system.

ユーザインタフェース手段１０３ａは、音声以外のユーザインタフェース装置、例えばディスプレイやボタン、マウス、キーボード等であり、音声対話を補助するためにユーザとのインタラクションを行う。 The user interface unit 103a is a user interface device other than a voice, such as a display, a button, a mouse, a keyboard, and the like, and performs an interaction with the user to assist a voice conversation.

対話管理部１０３ｂは、ユーザインタフェース手段１０３ａや音声認識手段１０２から得られた情報にもとづいて、対話の進行を管理し、同時に必要に応じて他の構成要素に制御指令を送信する。 The dialog management unit 103b manages the progress of the dialog based on information obtained from the user interface unit 103a and the voice recognition unit 102, and simultaneously transmits a control command to other components as necessary.

また、対話管理部１０３ｂは、他の構成要素からの制御指令を受信し、解釈し、必要に応じて何らかの情報を送信することで、全体を制御する。 The dialogue management unit 103b receives and interprets control commands from other components, and controls the whole by transmitting some information as necessary.

伝送手段１０４は、図１に示す各構成要素の間で制御指令や音声データパケットを相互に送受信するための通信ネットワークである。図１では、制御指令を伝送する系と音声データパケットを伝送する系とがそれぞれ独立した系となっているが、同じ系を共有するようにしてもよい。 The transmission means 104 is a communication network for transmitting and receiving control commands and voice data packets between the components shown in FIG. In FIG. 1, the system that transmits the control command and the system that transmits the voice data packet are independent systems, but they may share the same system.

識別子としては、シリアル番号、シーケンス番号、タイムスタンプなどが用いられる。以下、それぞれの識別子について説明する。 As the identifier, a serial number, a sequence number, a time stamp, or the like is used. Hereinafter, each identifier will be described.

シリアル番号は、パケットの順序関係を示すために広く用いられている付帯情報である。シリアル番号は、一般的に有限の整数で表現されるため、ある限定された時間内でのみ一意性を持つ。 The serial number is incidental information that is widely used to indicate the order relationship of packets. Since the serial number is generally expressed as a finite integer, it is unique only within a limited time.

シリアル番号を用いる場合、シリアル番号の桁あふれを検出する仕組みが別途必要になるが、実用上は、先行する番号との大小関係を検査するだけでよく、簡単である。 When a serial number is used, a mechanism for detecting overflow of the serial number is separately required. However, in practice, it is only necessary to check the magnitude relationship with the preceding number, and it is simple.

シリアル番号は、ある構成要素の組と別の組とでは値に一意性がない。図１に示す音声認識／合成システム１００では、音声入力手段１０１と音声認識手段１０２との組が一組しか存在しないので問題にならないが、後述する第２の実施の形態における音声認識／合成システム３００（図３参照）のように、音声入力手段と音声認識手段との組が複数存在し、それぞれ非同期に動作している場合には、各々の組の間でシリアル番号の一意性はないことになる。 The serial number has a unique value between one component set and another set. In the speech recognition / synthesis system 100 shown in FIG. 1, there is no problem because there is only one set of the speech input means 101 and the speech recognition means 102, but the speech recognition / synthesis system in the second embodiment to be described later. When there are a plurality of pairs of voice input means and voice recognition means such as 300 (see FIG. 3) and each of them is operating asynchronously, there is no serial number uniqueness between each pair. become.

シーケンス番号は、ある処理シーケンスに属する音声データを識別するための番号であり、例えば１回目の発話に対しては１番、２回目の発話に対しては２番といった要領で割り当てる。このように、シーケンス番号はそれを割り当てる側（この場合、音声入力手段１０１）が音声データを何らかの単位に区切る必要がある。しかし、実装が比較的容易なため、限定された用途では利用を検討する価値がある。 The sequence number is a number for identifying audio data belonging to a certain processing sequence. For example, the sequence number is assigned in a manner such as No. 1 for the first utterance and No. 2 for the second utterance. Thus, the sequence number assigner (in this case, the voice input means 101) needs to divide the voice data into some units. However, since it is relatively easy to implement, it is worth considering its use in limited applications.

タイムスタンプは、ある音声データを処理した時刻をそのまま識別子として用いる方法である。タイムスタンプは、ほぼ無限の時間内で一意性があり、また、時系列データである音声データとの相性もよい。しかし、システム１００を構成する各構成要素の間で絶対時刻を揃えておく必要がある。各構成要素が単一の計算機の上で動作するのであれば、そのような必要はない。ところが、本例のように、各構成要素が別々の計算機上で動作する場合には、例えば、時刻同期を行うためのＮＴＰ（Network Time Protocol）のような仕組みを併用することで、各構成要素の間で絶対時刻を揃えておく必要がある。 The time stamp is a method of using the time when certain audio data is processed as an identifier as it is. The time stamp is unique within an almost infinite time, and is compatible with audio data that is time-series data. However, it is necessary to align the absolute time among the constituent elements constituting the system 100. This is not necessary if each component operates on a single computer. However, when each component operates on a separate computer as in this example, for example, by using a mechanism such as NTP (Network Time Protocol) for time synchronization together, each component It is necessary to keep the absolute time in between.

本例で用いる識別子は、システム１００全体で各音声データを一意に識別できるものが望ましい。少なくとも、音声データを直接やり取りする構成要素の組において、ある限定された時間内でならば十分に一意性を保証できる識別子である必要がある。 The identifier used in this example is desirably one that can uniquely identify each audio data in the entire system 100. At least, a set of components that directly exchange voice data needs to be an identifier that can guarantee sufficient uniqueness within a limited time.

上記の識別子の例のうちで、本例で用いる識別子に最も好適なのは、タイムスタンプである。時系列データである音声データとの相性がよく、また、音声対話システム開発者が直感的に把握しやすいためである。また、タイムスタンプは、時刻を意味するデータであるため、スケジューリング機能と容易に組み合わせることができる。よって、本例では、識別子としてタイムスタンプを用いるものとする。 Of the above identifier examples, the most suitable identifier used in this example is a time stamp. This is because it has good compatibility with voice data that is time-series data, and it is easy for a voice dialogue system developer to grasp intuitively. Further, since the time stamp is data indicating time, it can be easily combined with the scheduling function. Therefore, in this example, a time stamp is used as an identifier.

なお、本例のシステム１００は、識別情報の時間順序性を管理するための機能を備えている。この機能は、例えば制御手段１０３に備えられる。この機能により、制御手段１０３は、各識別情報と、各識別情報が付与された時刻とを対応付けして記憶しておく。この時刻は、例えば絶対時刻を管理するＮＴＰサーバから提供されるシステム内で共通に用いられる時刻が用いられる。各識別情報と、各識別情報が付与された時刻に関する情報は、例えば識別情報を付与する機能を有する端末装置（例えば音声入力手段１０１）から取得するようにすればよい。このように構成することで、システム１００に各識別情報の時間順序性を管理する機能を持たせることができる。 In addition, the system 100 of this example has a function for managing the time order of identification information. This function is provided in the control means 103, for example. With this function, the control unit 103 stores each identification information in association with the time at which each identification information was given. For this time, for example, a time used in common in a system provided from an NTP server that manages absolute time is used. What is necessary is just to make it acquire each identification information and the information regarding the time when each identification information was given from the terminal device (for example, audio | voice input means 101) which has a function which provides identification information, for example. With this configuration, the system 100 can be provided with a function for managing the time order of each piece of identification information.

次に、本例の音声認識／合成システム１００の動作について説明する。
図２は、本例の音声認識／合成システム１００による音声認識処理の例を示すタイムチャートである。 Next, the operation of the speech recognition / synthesis system 100 of this example will be described.
FIG. 2 is a time chart showing an example of speech recognition processing by the speech recognition / synthesis system 100 of this example.

図２に示すタイムチャートにおいて、縦軸は、時刻Ａ０から時刻Ａ１４までの時間の経過を表しており、下方に進むほど未来の事象を表す。また、図２において、実線の矢印は、制御指令の流れを表す。また、破線の矢印は、音声データの流れを表す。 In the time chart shown in FIG. 2, the vertical axis represents the passage of time from time A0 to time A14, and represents a future event as it progresses downward. In FIG. 2, solid arrows indicate the flow of control commands. A broken arrow represents the flow of audio data.

音声認識処理において、先ず、音声入力手段１０１は、音声データの入力を時刻Ａ０から開始し、入力した音声データを順次パケットに分割して識別子を付与しながら、音声認識手段１０２に送信していく。 In the voice recognition process, first, the voice input unit 101 starts to input voice data from time A0, and sequentially transmits the input voice data to the voice recognition unit 102 while dividing the packet into packets and assigning identifiers. .

なお、入力した音声データから最初に生成されたパケットに付加される識別子は、「ＡＩ０」であるものとする。 It is assumed that the identifier added to the packet first generated from the input voice data is “AI0”.

音声認識手段１０２は、音声入力手段１０１からのパケットの受信を、時刻Ａ１に開始する。ただし、時刻Ａ１には、まだ音声認識処理が開始されていない。このため、音声認識手段１０２は、受信した各パケットをパケット保持手段１０２ｄに格納する。 The voice recognition unit 102 starts receiving the packet from the voice input unit 101 at time A1. However, at time A1, the speech recognition process has not yet started. For this reason, the voice recognition means 102 stores each received packet in the packet holding means 102d.

次いで、制御手段１０３は、時刻Ａ２に、音声認識処理の開始を指示する制御指令「ＡＣ１」を音声認識手段１０２に送信する。 Next, the control means 103 transmits a control command “AC1” instructing the start of the voice recognition process to the voice recognition means 102 at time A2.

制御指令「ＡＣ１」を送信する際には、制御手段１０３は、音声認識処理の対象データを特定するための任意の識別子を指定し、制御指令「ＡＣ１」に含めることができるものとする。ここでは、音声入力手段１０１によって音声データの入力処理が開始された時刻が時刻Ａ０であることから、識別子「ＡＩ０」を制御指令「ＡＣ１」に含めるものとする。 When transmitting the control command “AC1”, the control unit 103 can specify an arbitrary identifier for specifying the target data of the speech recognition process and include it in the control command “AC1”. Here, since the time when the voice data input process is started by the voice input unit 101 is the time A0, the identifier “AI0” is included in the control command “AC1”.

音声認識手段１０２は、識別子「ＡＩ０」を含む制御指令「ＡＣ１」を時刻Ａ４に受信すると、指定された識別子「ＡＩ０」が付加されている音声データをパケット保持手段１０２ｄから読み出し、読み出したパケットから音声認識処理を開始する。 When the voice recognition unit 102 receives the control command “AC1” including the identifier “AI0” at time A4, the voice recognition unit 102 reads the voice data to which the designated identifier “AI0” is added from the packet holding unit 102d, and reads the packet from the read packet. Starts speech recognition processing.

その後、音声認識処理が時刻Ａ６に完了すると、音声認識手段１０２は、その旨を示す音声認識処理完了通知「ＡＣ１’」を、制御手段１０３に送信する。 Thereafter, when the voice recognition process is completed at time A 6, the voice recognition unit 102 transmits a voice recognition process completion notification “AC 1 ′” indicating that to the control unit 103.

音声認識処理完了通知「ＡＣ１’」を受信すると、制御手段１０３は、音声入力を一時的に停止することとし、時刻Ａ７に、音声入力手段１０１に向けて音声入力処理停止指令「ＡＣ２」を送信する。 When the voice recognition processing completion notification “AC1 ′” is received, the control unit 103 temporarily stops voice input, and transmits a voice input processing stop command “AC2” to the voice input unit 101 at time A7. To do.

音声入力処理停止指令「ＡＣ２」を時刻Ａ８に受信すると、音声入力手段１０１は、その時点で音声認識手段１０２への音声データパケットの送信を停止する。 When the voice input processing stop command “AC2” is received at time A8, the voice input means 101 stops sending voice data packets to the voice recognition means 102 at that time.

なお、音声認識手段１０２が音声認識処理を完了した時刻Ａ６から、音声入力手段１０１が音声データパケットの送信を停止した時刻Ａ８までの間は、音声入力手段１０１から音声認識手段１０２に音声データパケットが送信され続けている。すなわち、音声認識手段１０２が音声認識処理を完了した時刻Ａ６のあとも、時刻Ａ８までは音声データパケットが音声認識手段１０２に送信される。このため、音声認識手段１０２が備えるパケット保持手段１０２ｄには、時刻Ａ６以降に受信され、時刻Ａ８までに音声入力手段１０１から送信された音声データパケットが保持されている。 Note that, from time A6 when the voice recognition means 102 completes the voice recognition processing to time A8 when the voice input means 101 stops sending voice data packets, the voice input means 101 sends a voice data packet to the voice recognition means 102. Continues to be sent. That is, the voice data packet is transmitted to the voice recognition means 102 until the time A8 after the time A6 when the voice recognition means 102 completes the voice recognition processing. For this reason, the packet holding means 102d included in the voice recognition means 102 holds voice data packets received from time A6 and transmitted from the voice input means 101 until time A8.

その後、制御手段１０３は、音声入力手段１０１に音声入力を再開するように、指令「ＡＣ３」を送信する。指令「ＡＣ３」を時刻Ａ９に受信すると、音声入力手段１０１は、音声認識手段１０２へのパケットの送信処理を再開する。 Thereafter, the control means 103 transmits a command “AC3” to the voice input means 101 so as to resume voice input. When the instruction “AC3” is received at time A9, the voice input unit 101 resumes the packet transmission process to the voice recognition unit 102.

なお、この例では、時刻Ａ９にて送信処理の再開後に最初に送信されるパケットに付加される識別子は、識別子「ＡＩ９」であるものとする。 In this example, it is assumed that the identifier added to the packet transmitted first after the restart of transmission processing at time A9 is the identifier “AI9”.

制御手段１０３は、音声入力手段１０１が音声データパケットの送信処理を再開した時刻が時刻Ａであることっから、識別子「ＡＩ９」が付加されたパケットを処理対象の音声データとして音声認識を開始するように、識別子「ＡＩ９」を含む音声認識開始指令「ＡＣ４」を音声認識手段１０２に向けて送信する。 Since the time when the voice input means 101 resumes the voice data packet transmission processing is time A, the control means 103 starts voice recognition using the packet with the identifier “AI9” as voice data to be processed. Thus, the voice recognition start command “AC4” including the identifier “AI9” is transmitted to the voice recognition means 102.

また、この例では、制御手段１０３は、指令「ＡＣ４」を送信した直後である時刻Ａ１０に、音声認識開始指令「ＡＣ５」を別途送信したものとする。 In this example, it is assumed that the control unit 103 separately transmits a voice recognition start command “AC5” at time A10 immediately after transmitting the command “AC4”.

そして、この例では、音声認識開始指令「ＡＣ４」と音声認識開始指令「ＡＣ５」とが順次送信されたにもかかわらず、例えば伝送路の状態の影響で、指令「ＡＣ４」よりも先に、指令「ＡＣ５」が音声認識手段１０２に受信されたものとする。 In this example, even though the voice recognition start command “AC4” and the voice recognition start command “AC5” are sequentially transmitted, for example, due to the state of the transmission path, the command “AC4” is preceded. It is assumed that the command “AC5” is received by the voice recognition unit 102.

このような場合でも、それぞれの指令には処理対象となる音声データを特定するための識別子が含まれているので、音声認識手段１０２は、どの音声データから音声認識処理を開始すればよいのか正確に判断することができる。よって、処理すべき適切な音声データに対して音声認識処理が行われる。 Even in such a case, since each command includes an identifier for identifying the voice data to be processed, the voice recognition unit 102 can accurately determine from which voice data the voice recognition process should be started. Can be judged. Therefore, speech recognition processing is performed on appropriate speech data to be processed.

以上に説明したように、上述した第１の実施の形態では、音声データパケットに識別子を付与するとともに、処理対象とする音声データを特定するための識別子を指定した制御指令を発行する構成としているので、処理対象とする音声データを正確に特定することができる。 As described above, in the first embodiment described above, an identifier is assigned to the voice data packet and a control command specifying an identifier for specifying the voice data to be processed is issued. Therefore, it is possible to accurately specify the audio data to be processed.

このような手法を用いない場合、認識処理の対象とすべき一部の音声データが捨てられたり、あるいは逆に、認識処理の対象とすべきでない音声データも含めて音声認識処理に掛けてしまうおそれがある。 If such a method is not used, a part of the voice data that should be the target of the recognition process is discarded, or conversely, the voice data including the voice data that should not be the target of the recognition process is subjected to the voice recognition process. There is a fear.

図２に示した指令「ＡＣ１」のケースでは、時刻Ａ４から始まる音声認識処理の対象として識別子「ＡＩ０」を持つパケット（以降の音声データ）を指示しているが、パケット保持手段１０２ｄによって保持されている範囲であれば任意の過去に到着したパケットを指定することができる。 In the case of the command “AC1” shown in FIG. 2, a packet having the identifier “AI0” (subsequent voice data) is designated as the target of the voice recognition process starting from time A4, but is held by the packet holding unit 102d. Any packet that arrived in the past can be specified as long as it is within the specified range.

また、時刻Ａ４以降に到着するパケットを指定してもよく、その場合は、そのパケットが到着するまで音声認識処理の開始は延期される。 Alternatively, a packet that arrives after time A4 may be specified. In this case, the start of the speech recognition process is postponed until the packet arrives.

図２に示した例では、時刻Ａ８からＡ９にかけて音声データの入力が中断しているが、音声認識手段１０２の内部にあるパケット保持手段１０２ｄには古いパケットがまだ残っている。このため、音声データの入力処理の再開後は、パケット保持手段１０２ｄの格納情報の見た目上は、中断直前である時刻Ａ８に送信されたパケットと再開直後に送信された時刻Ａ９のパケットとが連続して保持されているように見える。 In the example shown in FIG. 2, the input of voice data is interrupted from time A8 to A9, but old packets still remain in the packet holding means 102d inside the voice recognition means 102. For this reason, after resuming the voice data input process, the packet stored at the packet holding means 102d is apparently continuous with the packet transmitted at time A8 immediately before the interruption and the packet at time A9 transmitted immediately after resumption. And seems to be held.

仮に、再開後の音声認識処理でパケット保持手段１０２ｄに保持されているパケットを順番に処理していくこととすると、時刻Ａ８と時刻Ａ９の音声データの非連続性の影響で、認識精度は劣化してしまうことになる。しかし、図２のケースにおいては、再開時に発行された認識開始指令「ＡＣ４」にて識別子「ＡＩ９」を指定するようにしているので、ているので、音声データの非連続性の影響を受けることはなく、認識精度の劣化を防止することができる。 If the packets held in the packet holding unit 102d are sequentially processed in the voice recognition process after the restart, the recognition accuracy deteriorates due to the discontinuity of the voice data at the time A8 and the time A9. Will end up. However, in the case of FIG. 2, since the identifier “AI9” is designated by the recognition start command “AC4” issued at the time of restart, it is affected by the discontinuity of the audio data. However, it is possible to prevent deterioration of recognition accuracy.

また、上述した第１の実施の形態では、さらに、音声データパケットに識別子を付与するとともに、ある処理が扱うべき音声データの識別子を指定して制御指令を発行する構成としているので、複数の制御指令の送信時刻と受信時刻が錯綜したとしても、それぞれの制御指令にもとづく処理を適切な音声データに対して行うことができる。 Further, in the first embodiment described above, an identifier is given to the voice data packet, and a control command is issued by designating an identifier of voice data to be handled by a certain process. Even if the transmission time and the reception time of the command are complicated, the processing based on each control command can be performed on the appropriate audio data.

図２における制御指令「ＡＣ４」と制御指令「ＡＣ５」は、複数の制御指令の送信時刻と受信時刻が錯綜した場合の例である。 The control command “AC4” and the control command “AC5” in FIG. 2 are examples when the transmission time and the reception time of a plurality of control commands are complicated.

認識処理指令「ＡＣ４」では処理対象の音声データを特定するために識別子「ＡＩ９」が指定され、認識処理指令「ＡＣ５」では処理対象の音声データを特定するために識別子「ＡＩ１０」が指定されていたとする。 In the recognition processing command “AC4”, the identifier “AI9” is specified to specify the audio data to be processed, and in the recognition processing command “AC5”, the identifier “AI10” is specified to specify the audio data to be processed. Suppose.

この場合、指令「ＡＣ４」と指令「ＡＣ５」が到着した順序に関わらず、音声認識処理手段１０２は、各指令で指定されている処理対象の音声データについて適切に音声認識処理を行う。 In this case, regardless of the order in which the commands “AC4” and “AC5” arrive, the speech recognition processing unit 102 appropriately performs speech recognition processing on the speech data to be processed specified by each command.

これに対し、従来は、指令「ＡＣ４」と指令「ＡＣ５」とで異なるパラメータが用いられていたので、両者の到達順序が意図した順序と入れ替わってしまうと、それぞれの指令による音声データとパラメータの関係が不適切になり、認識精度の劣化を招くおそれがあった。 On the other hand, conventionally, since different parameters are used for the command “AC4” and the command “AC5”, if the arrival order of the two is changed from the intended order, the voice data and the parameter of each command are changed. There is a possibility that the relationship becomes inappropriate and the recognition accuracy deteriorates.

また、上述した第１の実施の形態では、音声データパケットへの識別子の付与を音声入力手段１０１の中で閉じて行う構成としているので、他の構成要素の種類や状態によらずに、音声データを適切に処理することができる。すなわち、音声入力手段１０１と他の構成要素とは互いに高いレベルで独立性を保ちながら、かつ制御指令と音声データの密な連携を達成することができる。 In the first embodiment described above, since the identifier is assigned to the voice data packet in the voice input unit 101, the voice data packet is closed regardless of the type or state of the other components. Data can be processed appropriately. That is, the voice input means 101 and the other components can achieve close cooperation between the control command and the voice data while maintaining independence at a high level.

実施の形態２．
次に、本発明の第２の実施の形態について図面を参照して説明する。
図３は、本発明の第２の実施の形態における音声認識／合成システム３００の構成例を示すブロック図である。 Embodiment 2. FIG.
Next, a second embodiment of the present invention will be described with reference to the drawings.
FIG. 3 is a block diagram illustrating a configuration example of the speech recognition / synthesis system 300 according to the second embodiment of the present invention.

図３に示すように、本例の音声認識／合成システム３００は、音声入力手段３０１と、音声認識手段３０２ａ〜３０２ｎの集合３０２と、制御手段３０３とを含む。音声入力手段３０１と、複数の音声認識手段３０２ａ〜３０２ｎと、制御手段３０３とは、それぞれ伝送手段３０４によって接続されている。 As shown in FIG. 3, the speech recognition / synthesis system 300 of this example includes speech input means 301, a set 302 of speech recognition means 302 a to 302 n, and control means 303. The voice input unit 301, the plurality of voice recognition units 302a to 302n, and the control unit 303 are connected by a transmission unit 304, respectively.

音声入力手段３０１は、上述した音声入力手段１０１と同様の構成とされる。音声認識手段の集合３０２は、２つ以上の音声認識手段によって構成される。個々の音声認識手段３０２ａ〜３０２ｎは、それぞれ、上述した音声認識手段１０２と同様の構成とされる。伝送手段３０４は、上述した伝送手段１０４と同様の構成とされる。 The voice input unit 301 has the same configuration as the voice input unit 101 described above. The set 302 of speech recognition means is composed of two or more speech recognition means. Each of the voice recognition units 302a to 302n has the same configuration as the voice recognition unit 102 described above. The transmission unit 304 has the same configuration as the transmission unit 104 described above.

制御手段３０３は、ユーザインタフェース手段３０３ａと、対話管理手段３０３ｂと、結果統合手段３０３ｃとを含む。ユーザインタフェース手段３０３ａおよび対話管理手段３０３ｂは、それぞれ、上述した制御手段１０３におけるユーザインタフェース手段１０３ａおよび対話管理手段１０３ｂと同様の構成とされる。 The control means 303 includes user interface means 303a, dialogue management means 303b, and result integration means 303c. The user interface unit 303a and the dialogue management unit 303b have the same configuration as the user interface unit 103a and the dialogue management unit 103b in the control unit 103 described above, respectively.

結果統合手段３０３ｃは、複数の音声認識手段３０２ａ〜３０２ｎからそれぞれ受信した認識結果を何らかの方法で評価し、その結果を統合させて、単一の音声認識手段から取得した音声認識結果と同様に取り扱うことができるようにする。具体的には、例えば、処理対象の音声データにおける各区間の音声認識結果について、それぞれ、信頼度が高い音声認識手段の認識結果を採用し（例えば住所の音声認識については音声認識手段３０２ａが信頼度が最も高く、名前の音声認識については音声認識手段３０２ｂが信頼度が最も高いなどの情報をあらかじめ把握しておく）、採用した認識結果を繋ぎ合わせるようにすればよい。 The result integration unit 303c evaluates the recognition results respectively received from the plurality of speech recognition units 302a to 302n by some method, integrates the results, and handles them in the same manner as the speech recognition results acquired from the single speech recognition unit. To be able to. Specifically, for example, the recognition result of the speech recognition unit having high reliability is adopted for the speech recognition result of each section in the speech data to be processed (for example, the speech recognition unit 302a is reliable for address speech recognition). For speech recognition of names, the speech recognition means 302b knows in advance information such as the highest reliability), and the adopted recognition results may be connected.

本例の音声認識／合成システム３００は、複数の音声認識手段３０２ａ〜３０２ｎに、単一の音声入力手段３０１から入力された同一の音声データについて、それぞれ異なる条件の下で、音声認識処理を実行させる。 The speech recognition / synthesis system 300 of this example executes speech recognition processing on a plurality of speech recognition units 302a to 302n for the same speech data input from a single speech input unit 301 under different conditions. Let

そして、制御手段３０３に、音声認識手段３０２ａ〜３０２ｎがそれぞれ導き出す少しずつ異なる結果を、結果統合手段３０３ｃによって統合する処理を実行させる。 Then, the control unit 303 causes the result integrating unit 303c to execute a process of integrating the slightly different results derived by the voice recognition units 302a to 302n.

次に、本例の音声認識／合成システム３００の動作について説明する。
図４は、本例の音声認識／合成システム３００による音声認識処理の例を示すタイムチャートである。 Next, the operation of the speech recognition / synthesis system 300 of this example will be described.
FIG. 4 is a time chart showing an example of speech recognition processing by the speech recognition / synthesis system 300 of this example.

図４に示す音声認識処理では、各構成要素の間の音声データと制御指令のやり取りが示されている。 The voice recognition process shown in FIG. 4 shows the exchange of voice data and control commands between the constituent elements.

音声入力手段３０１は、時刻Ｂ０に入力された音声データに識別子「ＢＩ０」を付加し、各音声認識手段３０２ａ〜３０２ｃにそれぞれ送信する。 The voice input unit 301 adds the identifier “BI0” to the voice data input at time B0, and transmits it to each of the voice recognition units 302a to 302c.

ここでは、音声入力手段３０１によって送信された音声データが各音声認識手段３０２ａ〜３０２ｃに到着する時刻が、それぞれ異なるものとする。図４に示すように、この例では、時刻Ｂ１に音声認識手段３０２ｃに到着し、時刻Ｂ２に音声認識手段３０２ｂに到着し、時刻Ｂ３に音声認識手段３０２ａに到着したものとする。 Here, it is assumed that the time at which the voice data transmitted by the voice input unit 301 arrives at each of the voice recognition units 302a to 302c is different. As shown in FIG. 4, in this example, it is assumed that the voice recognition means 302c arrives at time B1, the voice recognition means 302b arrives at time B2, and the voice recognition means 302a arrives at time B3.

また、図４に示すように、制御手段３０３が時刻Ｂ４に発行した音声認識開始指令「ＢＣ１」が各音声認識手段３０２ａ〜３０２ｃに到着する時刻も、それぞれ異なるものとする。 Also, as shown in FIG. 4, the time at which the speech recognition start command “BC1” issued by the control unit 303 at time B4 arrives at each of the speech recognition units 302a to 302c is also different.

このとき、制御手段３０３は、処理対象データを特定するための識別子として指令「ＢＣ１」に識別子「ＢＩ０」を指定している。このため、各音声認識手段３０２ａ〜３０２ｃにて同一の音声データを処理対象とする音声認識処理が適切に実行される。 At this time, the control unit 303 designates the identifier “BI0” in the command “BC1” as an identifier for specifying the processing target data. For this reason, the voice recognition processing for processing the same voice data is appropriately executed in each of the voice recognition units 302a to 302c.

もちろん、制御手段３０３が、時刻Ｂ４とは異なる時刻Ｂ５に音声認識開始指令「ＢＣ２」を発行したとしても、指令「ＢＣ２」にて識別子「ＢＩ０」が指定されていれば、指令「ＢＣ１」を発行した場合と全く同一の音声データを音声認識処理の対象とさせることができる。 Of course, even if the control means 303 issues the voice recognition start command “BC2” at time B5 different from time B4, if the identifier “BI0” is specified by the command “BC2”, the command “BC1” is issued. The voice data that is exactly the same as that issued can be the target of voice recognition processing.

一方、ある音声データにおける異なる区間のデータを、各音声認識手段３０２ａ〜３０２ｃに別個に処理させるようにしてもよい。 On the other hand, the data of different sections in a certain voice data may be processed separately by each voice recognition means 302a to 302c.

具体的には、例えば、図４に示すように、時刻Ｂ６、時刻Ｂ７、時刻Ｂ８に入力された各音声データに付加された識別子が、それぞれ識別子「ＢＩ６」、識別子「ＢＩ７」、識別子「ＢＩ８」であるとする。そして、制御手段３０３が、音声認識処理開始指令「ＢＣ３」に識別子「ＢＩ６」を設定し、指令「ＢＣ４」に識別子「ＢＩ７」を設定し、「ＢＣ５」に識別子「ＢＩ８」を設定する。このように構成すれば、各音声認識手段３０２ａ〜３０２ｃに、それぞれ異なる音声区間の音声データを処理対象として音声認識処理を実行させることができる。 Specifically, for example, as shown in FIG. 4, the identifiers added to the audio data input at time B6, time B7, and time B8 are the identifier “BI6”, identifier “BI7”, identifier “BI8”, respectively. ”. Then, the control unit 303 sets the identifier “BI6” in the voice recognition process start command “BC3”, sets the identifier “BI7” in the command “BC4”, and sets the identifier “BI8” in “BC5”. If constituted in this way, each voice recognition means 302a-302c can be made to perform voice recognition processing for the voice data of a different voice section, respectively.

結果統合手段３０３ｃにおける認識結果統合処理には、さまざまな手法が考えられる。例えば、各音声認識手段３０２ａ〜３０２ｃでの認識結果の尤度にもとづいて並べ替え、信頼度を用いて再評価する等の手法を取ることができる。また、その他には、例えば、認識結果を純粋に文字列として扱う方法や、入力音声とのアライメントを取って評価する方法などが考えられる。 Various methods can be considered for the recognition result integration processing in the result integration unit 303c. For example, it is possible to take a technique such as rearrangement based on the likelihood of the recognition result in each of the speech recognition units 302a to 302c and re-evaluation using the reliability. In addition, for example, a method of treating the recognition result as a pure character string, a method of evaluating by alignment with the input speech, and the like can be considered.

以上に説明したように、上述した第２の実施の形態では、音声認識処理の対象となる音声データを識別子によって厳密に指定する構成としているので、複数の音声認識処理手段３０２ａ〜３０２ｃによって並列的に音声認識処理を行う際に、各々の音声認識処理手段３０２ａ〜３０２ｃが確実に指定通りに同じ音声データを扱うよう保証することができる。 As described above, in the second embodiment described above, since the voice data to be subjected to the voice recognition processing is strictly specified by the identifier, the plurality of voice recognition processing units 302a to 302c are configured in parallel. When performing voice recognition processing, it is possible to ensure that each voice recognition processing means 302a to 302c handles the same voice data as specified.

また、上述した第２の実施の形態では、音声認識処理の対象となる音声データを識別子によって厳密に指定する構成としているので、各音声認識手段３０２ａ〜３０２ｃにある音声データの異なる区間をそれぞれ音声認識処理させる際に、それぞれが処理した区間における音声データの時刻関係を正確に知ることができる。従って、複数の認識結果の時刻関係を完全に把握した上で、それら複数の認識結果を統合することができる。 In the second embodiment described above, the voice data to be subjected to the voice recognition process is strictly specified by the identifier. Therefore, different sections of the voice data in the voice recognition units 302a to 302c are set to the respective voices. When performing the recognition process, it is possible to accurately know the time relationship of the audio data in each processed section. Therefore, it is possible to integrate the plurality of recognition results after completely grasping the time relationship between the plurality of recognition results.

さらに、上述した第２の実施の形態では、音声認識処理の対象となる音声データを識別子によって厳密に指定する構成としているので、ある音声認識手段の認識結果や認識処理中の途中経過に応じて、処理対象の音声に適した別の音声認識手段を起動したり、認識処理の精度を向上させるためのパラメータ（処理対象の音声に適したパラメータ。具体的には、例えば氏名用のパラメータ、住所用のパラメータなどがある）を動的に変化させることができ、その際に扱われた音声データの識別子を調べることによって、結果統合手段３０３ｃがそれらをより正確に統合することができる。 Furthermore, in the second embodiment described above, since the voice data to be subjected to the voice recognition processing is strictly specified by the identifier, depending on the recognition result of a certain voice recognition means and the progress in the middle of the recognition processing. , Parameters for starting another speech recognition means suitable for the speech to be processed and improving the accuracy of the recognition processing (parameters suitable for the speech to be processed. Specifically, for example, parameters for names, addresses, etc. And the result integration means 303c can integrate them more accurately by examining the identifiers of the audio data handled at that time.

実施の形態３．
次に、本発明の第３の実施の形態について図面を参照して説明する。
図５は、本発明の第３の実施の形態における音声認識／合成システム５００の構成例を示すブロック図である。 Embodiment 3 FIG.
Next, a third embodiment of the present invention will be described with reference to the drawings.
FIG. 5 is a block diagram illustrating a configuration example of a speech recognition / synthesis system 500 according to the third embodiment of the present invention.

図５に示すように、音声認識／合成システム５００は、音声入力手段５０１と、第１の音声認識手段５０２と、第２の音声認識手段５０３と、制御手段５０４とを含む。音声入力手段５０１と、音声認識手段の集合５０２と、第１の音声認識手段５０２と、第２の音声認識手段５０３と、制御手段５０４とは、伝送手段５０５によって接続されている。 As shown in FIG. 5, the speech recognition / synthesis system 500 includes a speech input unit 501, a first speech recognition unit 502, a second speech recognition unit 503, and a control unit 504. The voice input means 501, the voice recognition means set 502, the first voice recognition means 502, the second voice recognition means 503, and the control means 504 are connected by a transmission means 505.

第１の音声認識手段５０２は、音声認識制御手段５０２ａと、音声認識処理手段５０２ｂと、結果統合手段５０２ｆと、識別子判別手段５０２ｃと、パケット保持手段５０２ｄと、パケット送受信手段５０２ｅとを含む。 The first voice recognition unit 502 includes a voice recognition control unit 502a, a voice recognition processing unit 502b, a result integration unit 502f, an identifier determination unit 502c, a packet holding unit 502d, and a packet transmission / reception unit 502e.

音声認識制御手段５０２ａと、音声認識処理手段５０２ｂと、識別子判別手段５０２ｃと、パケット保持手段５０２ｄとは、それぞれ、音声認識制御手段１０２ａと、音声認識処理手段１０２ｂと、識別子判別手段１０２ｃと、パケット保持手段１０２ｄと同様に構成される。 The voice recognition control means 502a, the voice recognition processing means 502b, the identifier discrimination means 502c, and the packet holding means 502d are respectively the voice recognition control means 102a, the voice recognition processing means 102b, the identifier discrimination means 102c, and the packet. The same as the holding means 102d.

パケット送受信手段１０２ｅは、音声入力手段５０１からの音声データパケットを受信する処理や、第２の音声認識手段５０３に対して音声データパケットを送信する処理などを実行する。 The packet transmitting / receiving unit 102e executes processing for receiving the voice data packet from the voice input unit 501, processing for transmitting the voice data packet to the second voice recognition unit 503, and the like.

結果統合手段５０２ｆは、第１の音声認識手段５０２の認識結果と、第２の音声認識手段５０２の認識結果とを統合する処理などを実行する。 The result integration unit 502f executes processing for integrating the recognition result of the first speech recognition unit 502 and the recognition result of the second speech recognition unit 502, and the like.

第１の音声認識手段５０２は、音声入力手段５０１からの音声データを受信し、制御手段５０４の制御に応じて受信した音声データを認識するための音声認識処理を実行し、認識結果を送信する。 The first voice recognition unit 502 receives the voice data from the voice input unit 501, executes voice recognition processing for recognizing the received voice data according to the control of the control unit 504, and transmits the recognition result. .

この例では、第１の音声認識手段５０２は、音声認識処理の任意のタイミングで第２の音声認識手段５０３を呼び出し、音声認識手段５０３による音声認識処理の処理結果を受け取って、結果統合手段５０２ｆによって自らの認識結果と統合した後、それを最終的な結果として用いる。 In this example, the first speech recognition unit 502 calls the second speech recognition unit 503 at an arbitrary timing of the speech recognition process, receives the processing result of the speech recognition process by the speech recognition unit 503, and integrates the result integration unit 502f. After integrating it with its own recognition result, it is used as the final result.

第２の音声認識手段５０３が用いる音声データは、第１の音声認識手段５０２の内部にあるパケット保持手段５０２ｄから読み出した音声データをパケット送受信手段５０２ｅによって転送することにより第２の音声認識手段５０３に入力される。なお、音声入力手段５０１から第２の音声認識手段５０３に、音声データを直接送信するようにしてもよい。 The voice data used by the second voice recognition means 503 is the second voice recognition means 503 by transferring the voice data read from the packet holding means 502d inside the first voice recognition means 502 by the packet transmitting / receiving means 502e. Is input. Note that voice data may be directly transmitted from the voice input unit 501 to the second voice recognition unit 503.

音声入力手段５０１、制御手段５０４、伝送手段５０５は、それぞれ、上述した音声入力手段１０１、制御手段１０３、伝送手段１０４と同様に構成される。また、第１の音声認識手段５０２は、上述した音声認識手段１０２に結果統合手段５０２ｆを付加し、さらにパケット送信手段１０２ｅをパケット送受信手段に５０２ｅに変更した構成とされている。さらに、第２の音声認識手段５０２は、上述した音声認識手段１０２と同様の構成とされる。 The voice input unit 501, the control unit 504, and the transmission unit 505 are configured in the same manner as the voice input unit 101, the control unit 103, and the transmission unit 104, respectively. The first voice recognition unit 502 is configured by adding a result integration unit 502f to the above-described voice recognition unit 102 and further changing the packet transmission unit 102e to a packet transmission / reception unit 502e. Further, the second voice recognition unit 502 has the same configuration as the voice recognition unit 102 described above.

なお、第１の音声認識手段５０２と第２の音声認識手段５０３との間に、第３、第４、・・・の音声認識手段を挟みこむこともできる。さらに、各音声認識手段５０２，５０３
の代わりに、上述した第２の実施の形態における複数の音声認識手段３０２ａ〜３０２ｎの集合３０２を用いることもできる。 It is also possible to sandwich third, fourth,... Voice recognition means between the first voice recognition means 502 and the second voice recognition means 503. Further, each voice recognition means 502, 503
Instead of the above, the set 302 of the plurality of speech recognition units 302a to 302n in the second embodiment described above can be used.

次に、本例の音声認識／合成システム５００の動作について説明する。
図６は、本例の音声認識／合成システム５００による音声認識処理の例を示すタイムチャートである。 Next, the operation of the speech recognition / synthesis system 500 of this example will be described.
FIG. 6 is a time chart showing an example of speech recognition processing by the speech recognition / synthesis system 500 of this example.

この例では、第１の音声認識手段５０２にてある認識処理を行う過程で、その処理対象の音声データの一部または全体に対して、異なる条件の下で音声認識処理を実行させる。この場合、第１の音声認識手段５０２は、異なる条件の下での音声認識処理の対象とする音声データを第２の音声認識手段５０３に転送するようにすればよい。 In this example, in the course of performing a certain recognition process by the first voice recognition unit 502, the voice recognition process is executed under different conditions on part or all of the voice data to be processed. In this case, the first speech recognition unit 502 may transfer the speech data to be subjected to speech recognition processing under different conditions to the second speech recognition unit 503.

図６には、第１の音声認識手段５０２が第２の音声認識手段５０３に同じ音声データを処理させたときの例が示されている。この図６には、第２の音声認識手段５０３に対して音声データを与える２つの例が示されている。 FIG. 6 shows an example when the first voice recognition unit 502 causes the second voice recognition unit 503 to process the same voice data. FIG. 6 shows two examples of giving voice data to the second voice recognition means 503.

第１の例（時刻Ｃ０〜時刻Ｃ８に示す例）では、第２の音声認識手段５０３は、音声入力手段５０１から音声データを直接受け取る。すなわち、音声入力手段５０１は、第１の音声認識手段５０２および第２の音声認識手段５０３に対し、入力した音声データに識別子付加して順次送信する。 In the first example (example shown at time C0 to time C8), the second voice recognition unit 503 directly receives voice data from the voice input unit 501. That is, the voice input unit 501 sequentially transmits the input voice data with an identifier added to the first voice recognition unit 502 and the second voice recognition unit 503.

制御手段５０４は、時刻Ｃ０に、第１の音声認識手段５０２に対して識別子「ＣＩ１」の音声データを音声認識するように、識別子「ＣＩ１」を含む指令「ＣＣ１」を出す。 At time C0, the control unit 504 issues a command “CC1” including the identifier “CI1” so that the first speech recognition unit 502 recognizes the voice data of the identifier “CI1”.

指令「ＣＣ１」を受信した時刻Ｃ２に、第１の音声認識手段５０２は、音声認識処理を開始するとともに、第２の音声認識手段５０３に対しても同じ識別子「ＣＩ１」の音声データに対する認識処理を開始するように、識別子「ＣＩ１」を含む指令「ＣＣ２」を発行する。 At the time C2 at which the command “CC1” is received, the first speech recognition unit 502 starts speech recognition processing and also recognizes speech data with the same identifier “CI1” from the second speech recognition unit 503. The command “CC2” including the identifier “CI1” is issued.

第２の音声認識手段５０３は、識別子「ＣＩ１」の音声データを音声入力手段５０１から直接受け取り、音声認識処理を行う。そして、第２の音声認識手段５０３は、認識処理が完了すると、その旨を示す認識処理完了通知「ＣＣ２’」を第１の音声認識手段５０２に送信する。 The second voice recognition unit 503 directly receives the voice data with the identifier “CI1” from the voice input unit 501 and performs voice recognition processing. Then, when the recognition process is completed, the second voice recognition unit 503 transmits a recognition process completion notification “CC2 ′” indicating that to the first voice recognition unit 502.

第１の音声認識手段５０２は、認識処理完了通知「ＣＣ２’」を受けると、結果統合手段５０２ｆで自らの認識結果と第２の音声認識手段５０３の認識結果とを統合し、最終的な認識処理の完了を示す認識処理完了通知「ＣＣ１’」を制御手段５０４に送信する。 Upon receiving the recognition process completion notification “CC2 ′”, the first speech recognition unit 502 integrates its own recognition result with the recognition result of the second speech recognition unit 503 in the result integration unit 502f, and performs final recognition. A recognition process completion notification “CC1 ′” indicating the completion of the process is transmitted to the control means 504.

上記のように、第１の例では、第１の音声認識手段５０２が、制御手段５０４からの指令「ＣＣ１」を受けたことに応じて、音声認識処理を開始するとともに、第２の音声認識手段５０３に対して指令「ＣＣ２」を発行する。この指令「ＣＣ１」と指令「ＣＣ３」は、識別子「ＣＩ１」を含んでいれば同一内容の指令であってもよく、一部が異なる内容（例えば処理時刻や処理結果の返答先に関する情報）となっていてもよい。この場合、第１の音声認識手段５０２が、指令「ＣＣ１」の内容を変更することで指令「ＣＣ２」を作成するようにしてもよい。なお、認識処理完了通知「ＣＣ２’」および「ＣＣ１’」についても同様である。 As described above, in the first example, the first voice recognition unit 502 starts the voice recognition process in response to receiving the command “CC1” from the control unit 504, and also performs the second voice recognition. A command “CC2” is issued to the means 503. The command “CC1” and the command “CC3” may be commands having the same contents as long as they include the identifier “CI1”, and some of the contents are different (for example, information regarding processing time and processing result response destination). It may be. In this case, the first voice recognition unit 502 may create the command “CC2” by changing the content of the command “CC1”. The same applies to the recognition process completion notifications “CC2 ′” and “CC1 ′”.

第２の例（時刻Ｃ９〜時刻Ｃ１８に示す例）では、第２の音声認識手段５０３は、第１の音声認識手段５０２から認識対象となる音声データを受信する。すなわち、第１の音声認識手段５０２は、音声入力手段５０１からの識別子が付加されている音声データを受信し、第２の音声認識手段５０３に順次転送する。 In the second example (example shown at time C9 to time C18), the second speech recognition unit 503 receives speech data to be recognized from the first speech recognition unit 502. That is, the first voice recognition unit 502 receives the voice data to which the identifier is added from the voice input unit 501 and sequentially transfers it to the second voice recognition unit 503.

制御手段５０４は、時刻Ｃ９に第１の音声認識手段５０２に対して識別子「ＣＩ１０」の音声データを音声認識するように、識別子「ＣＩ１０」を含む指令「ＣＣ３」を出す。 The control unit 504 issues a command “CC3” including the identifier “CI10” so that the first speech recognition unit 502 recognizes the speech data of the identifier “CI10” at time C9.

指令「ＣＣ３」を受けた時刻Ｃ１１に、第１の音声認識手段５０２は、音声認識処理を開始するとともに、第２の音声認識手段５０３に対して識別子「ＣＩ１０」の音声データを転送し、その後、時刻Ｃ１２に、第２の音声認識手段５０３に対しても同じ識別子「ＣＩ１０」の音声データに対する認識処理を開始するよう指令「ＣＣ４」を発行する。 At the time C11 at which the command “CC3” is received, the first voice recognition unit 502 starts voice recognition processing and transfers the voice data of the identifier “CI10” to the second voice recognition unit 503, and then At time C12, a command “CC4” is issued to start the recognition process for the voice data having the same identifier “CI10” also to the second voice recognition unit 503.

第２の音声認識手段５０３は、第１の音声認識手段５０２からの識別子「ＣＩ１０」が付加された音声データパケットを受信し、指令「ＣＣ４」に従って音声認識処理を行う。 The second voice recognition unit 503 receives the voice data packet to which the identifier “CI10” is added from the first voice recognition unit 502 and performs voice recognition processing according to the command “CC4”.

そして、第２の音声認識手段５０３は、音声認識処理が完了すると、その旨を示す認識処理完了通知「ＣＣ４’」を第１の音声認識手段に送信する。 Then, when the voice recognition process is completed, the second voice recognition unit 503 transmits a recognition process completion notification “CC4 ′” indicating that to the first voice recognition unit.

第１の音声認識手段５０２は、認識処理完了通知「ＣＣ４’」を受けると、結果統合手段５０２ｆで自らの認識結果と第２の音声認識手段５０３の認識結果とを統合し、最終的な認識処理の完了を示す認識処理完了通知「ＣＣ３’」を制御手段５０４に送信する。 Upon receiving the recognition process completion notification “CC4 ′”, the first speech recognition unit 502 integrates its own recognition result and the recognition result of the second speech recognition unit 503 in the result integration unit 502f, and performs final recognition. A recognition process completion notification “CC3 ′” indicating the completion of the process is transmitted to the control means 504.

なお、第２の音声認識手段５０３がどの構成要素から音声データを受け取るかは、各構成要素間の伝送路の状態などによって決めることが望ましい。 Note that it is desirable to determine from which component the second speech recognition means 503 receives the speech data depending on the state of the transmission path between the components.

例えば、音声入力手段５０１から音声認識手段への伝送路は比較的混雑しており、一方で音声認識手段同士の間の伝送路は比較的空いているのであれば、音声データは音声認識手段の間でやり取りする方がよい。 For example, if the transmission path from the voice input unit 501 to the voice recognition unit is relatively congested while the transmission path between the voice recognition units is relatively free, the voice data is stored in the voice recognition unit. It is better to communicate between them.

以上に説明したように、上述した第３の実施の形態では、ある音声認識手段が別の音声認識手段を呼び出して音声認識処理を実行させる構成としたので、見た目上、他の構成要素の関与なしに音声認識処理性能を向上させることができる。また、この際、識別子を指定した制御指令を用いることによって、第２の音声認識手段５０３が処理すべき音声データを厳密に指定することができる。 As described above, in the above-described third embodiment, since a certain voice recognition unit calls another voice recognition unit to execute the voice recognition process, it is apparently involved in other components. The speech recognition processing performance can be improved without any. At this time, the voice data to be processed by the second voice recognition unit 503 can be strictly specified by using the control command specifying the identifier.

実施の形態４．
次に、本発明の第４の実施の形態について図面を参照して説明する。
図７は、本発明の第４の実施の形態における音声認識／合成システム７００の構成例を示すブロック図である。 Embodiment 4 FIG.
Next, a fourth embodiment of the present invention will be described with reference to the drawings.
FIG. 7 is a block diagram showing a configuration example of a speech recognition / synthesis system 700 according to the fourth exemplary embodiment of the present invention.

図７に示すように、音声認識／合成システム７００は、音声生成手段７０１と、音声出力手段７０２と、制御手段７０３とを含む。音声生成手段７０１と、音声出力手段７０２と、制御手段７０３とは、伝送手段７０４によって接続されている。 As shown in FIG. 7, the speech recognition / synthesis system 700 includes a speech generation unit 701, a speech output unit 702, and a control unit 703. The voice generation unit 701, the voice output unit 702, and the control unit 703 are connected by a transmission unit 704.

音声生成手段７０１は、音声生成制御手段７０１ａと、音声生成処理手段７０１ｂと、パケット分割手段７０１ｃと、識別子付与手段７０１ｄと、パケット保持手段７０１ｅと、パケット送信手段７０１ｆとを含む。 The voice generation unit 701 includes a voice generation control unit 701a, a voice generation processing unit 701b, a packet division unit 701c, an identifier assigning unit 701d, a packet holding unit 701e, and a packet transmission unit 701f.

音声生成手段７０１は、制御手段７０３からの制御指令を受けて音声を生成し、パケットに切り分け、それらに識別子を付与して、音声出力手段７０２に送信する。 The voice generation unit 701 receives a control command from the control unit 703, generates voices, cuts them into packets, assigns identifiers thereto, and transmits them to the voice output unit 702.

音声生成制御手段７０１ａは、制御手段７０３などの他の構成要素からの制御指令を受信し、受信した制御指令にもとづいて音声生成手段７０１全体の動作を制御する。また、音声生成制御手段７０１ａは、他の構成要素からの要求に応じて音声生成処理の状況等の情報を送信する。 The voice generation control unit 701a receives a control command from another component such as the control unit 703, and controls the entire operation of the voice generation unit 701 based on the received control command. Also, the voice generation control unit 701a transmits information such as the status of voice generation processing in response to a request from another component.

音声生成処理手段７０１ｂは、他の構成要素からの制御指令にもとづいて、音声データを生成する。具体的には、音声合成技術を用いて何らかの文字列から音声波形を合成する処理や、制御指令によって指定された音声波形ファイルを読み込む処理などを実行する。 The voice generation processing unit 701b generates voice data based on control commands from other components. Specifically, processing for synthesizing a speech waveform from a certain character string using speech synthesis technology, processing for reading a speech waveform file designated by a control command, and the like are executed.

識別子付与手段７０１ｄは、上述した音声認識／合成システム１００が備える識別子付与手段１０１ｄと同様に動作するが、さらに、あるパケットに付与する識別子を他の構成要素からの制御指令によって決定する機能を有する。 The identifier assigning unit 701d operates in the same manner as the identifier assigning unit 101d included in the speech recognition / synthesis system 100 described above, but further has a function of determining an identifier to be assigned to a packet by a control command from another component. .

パケット分割手段７０１ｃ、パケット保持手段７０１ｅ、パケット送信手段７０１ｆは、それぞれ、上述した音声認識／合成システム１００が備えるパケット化手段１０１ｃ、パケット保持手段１０１ｅ、パケット送信手段１０１ｆと同様に構成される。 The packet dividing unit 701c, the packet holding unit 701e, and the packet transmitting unit 701f are configured in the same manner as the packetizing unit 101c, the packet holding unit 101e, and the packet transmitting unit 101f included in the speech recognition / synthesis system 100 described above, respectively.

音声出力手段７０２は、音声出力制御手段７０２ａと、音声出力手段７０２ｂと、識別子判別手段７０２ｃと、パケット保持手段７０２ｄと、パケット受信手段７０２ｅとを含む。 The audio output unit 702 includes an audio output control unit 702a, an audio output unit 702b, an identifier determination unit 702c, a packet holding unit 702d, and a packet reception unit 702e.

音声出力手段７０２は、制御手段７０３からの制御指令を受けて音声を出力する処理や、制御手段７０３からの要求に応じて処理を行い、その処理結果を制御手段７０３に送信する処理などを行う。 The voice output unit 702 performs processing for outputting a voice in response to a control command from the control unit 703, processing in response to a request from the control unit 703, and processing for transmitting the processing result to the control unit 703. .

制御手段７０３および伝送手段７０４は、それぞれ、上述した音声認識／合成システム１００が備える制御手段１０３および伝送手段１０４と同様に構成される。 The control unit 703 and the transmission unit 704 are configured similarly to the control unit 103 and the transmission unit 104 included in the speech recognition / synthesis system 100 described above.

次に、本例の音声認識／合成システム７００の動作について説明する。
図８は、本例の音声認識／合成システム７００による音声合成処理の例を示すタイムチャートである。 Next, the operation of the speech recognition / synthesis system 700 of this example will be described.
FIG. 8 is a time chart showing an example of speech synthesis processing by the speech recognition / synthesis system 700 of this example.

図８のタイムチャートには、制御手段７０３からの指示にもとづいて音声生成手段７０１で生成された２つの音声データが、音声出力手段７０２を通じて、制御手段７０３の意図した時刻にユーザ（ユーザ端末）に向けて出力される処理の例が示されている。 In the time chart of FIG. 8, two voice data generated by the voice generation unit 701 based on an instruction from the control unit 703 are transmitted to the user (user terminal) at the time intended by the control unit 703 through the voice output unit 702. An example of the process output toward is shown.

図８において、上下方向の実線の矢印はそれぞれの構成要素における時間の経過を表し、下に向かうほど未来の事象を表す。また、左右方向の実線の矢印は制御指令の流れを表し、破線の矢印は音声データの流れを表す。なお、「ユーザ」は、例えばパーソナルコンピュータや携帯情報端末などのユーザ端末を意味する。 In FIG. 8, the solid arrows in the vertical direction indicate the passage of time in each component, and the future events as they go down. Also, a solid line arrow in the left-right direction represents the flow of control commands, and a broken line arrow represents the flow of audio data. “User” means a user terminal such as a personal computer or a portable information terminal.

制御手段７０３は、音声生成手段７０１に対して、音声生成処理を行い、その結果生成された音声データを音声出力手段７０２へ送信するように、時刻Ｄ０に指令「ＤＣ１」を発行する。このとき、制御手段７０３は、音声生成手段７０１に対して、生成された音声データの先頭パケットには識別子「ＤＩ０」を付与するよう指示する。 The control unit 703 issues a command “DC1” at time D0 so as to perform voice generation processing on the voice generation unit 701 and transmit the voice data generated as a result to the voice output unit 702. At this time, the control unit 703 instructs the audio generation unit 701 to add the identifier “DI0” to the first packet of the generated audio data.

指令「ＤＣ１」を受信すると、音声生成手段７０１は、音声生成処理手段７０１ｂによって生成音声を作成し、パケットに分割する。分割した各パケットには、音声生成手段７０１は、識別子「ＤＩ０」から、識別子「ＤＩ１」，識別子「ＤＩ２」・・・を順番に付与していく。ここでは、最後のパケットに付与された識別子が識別子「ＤＩ４」であったとする。 When receiving the command “DC1”, the voice generation unit 701 creates the generated voice by the voice generation processing unit 701b and divides it into packets. The voice generation unit 701 sequentially assigns an identifier “DI1”, an identifier “DI2”,... From the identifier “DI0” to each divided packet. Here, it is assumed that the identifier given to the last packet is the identifier “DI4”.

指令「ＤＣ１」に応じた音声データの生成を完了すると、音声生成手段７０１は、その旨を示す生成完了通知「ＤＣ１’」を制御手段７０３に送信する。 When the generation of the audio data corresponding to the command “DC1” is completed, the audio generation unit 701 transmits a generation completion notification “DC1 ′” indicating that to the control unit 703.

次いで、時刻Ｄ６に、制御手段７０３は、音声出力手段７０２に対して、識別子「ＤＩ０」〜識別子「ＤＩ４」が付加されている音声データを時刻Ｄ１１から順次出力するように音声データ出力指令「ＤＣ３」を発行する。 Next, at time D6, the control unit 703 outputs an audio data output command “DC3” to the audio output unit 702 so that audio data to which the identifiers “DI0” to “DI4” are added is sequentially output from the time D11. Is issued.

音声データ出力指令「ＤＣ３」を受信すると、音声出力手段７０２は、音声生成手段７０１から当該識別子（識別子「ＤＩ０」〜識別子「ＤＩ４」）が付加された音声データを受信し、時刻Ｄ１１までにパケット保持手段７０２ｄに保持し、その後、時刻Ｄ１１から、識別子「ＤＩ０」〜識別子「ＤＩ４」が付加されている音声データを順次出力する。 Upon receiving the audio data output command “DC3”, the audio output unit 702 receives the audio data to which the identifier (identifier “DI0” to identifier “DI4”) is added from the audio generation unit 701, and packets are received by time D11. The data is held in the holding unit 702d, and thereafter, audio data to which the identifiers “DI0” to “DI4” are added is sequentially output from time D11.

なお、この例では、図８に示すように、制御手段７０３は、制御指令「ＤＣ１」に従って生成され出力される音声データに続けて別の音声データが出力されるように制御指令「ＤＣ２」を発行するものとする。 In this example, as shown in FIG. 8, the control means 703 outputs a control command “DC2” so that another audio data is output following the audio data generated and output according to the control command “DC1”. Shall be issued.

具体的には、制御指令「ＤＣ１」と同様にして、制御手段７０３は、音声生成手段７０１に対して、時刻Ｄ３に指令「ＤＣ２」を発行するとともに、生成された音声データの先頭パケットには識別子「ＤＩ５」を付与するよう指示する。 Specifically, similarly to the control command “DC1”, the control unit 703 issues a command “DC2” at time D3 to the voice generation unit 701, and the first packet of the generated voice data includes An instruction to assign the identifier “DI5” is given.

指令「ＤＣ２」を受信すると、音声生成手段７０１は、音声生成処理手段７０１ｂによって生成音声を作成し、パケットに分割する。分割した各パケットには、音声生成手段７０１は、識別子「ＤＩ５」から、識別子「ＤＩ６」，識別子「ＤＩ７」・・・を順番に付与していく。ここでは、最後のパケットに付与された識別子が識別子「ＤＩ１０」であったとする。 When the instruction “DC2” is received, the voice generation unit 701 creates the generated voice by the voice generation processing unit 701b and divides it into packets. The voice generation unit 701 sequentially assigns the identifier “DI6”, the identifier “DI7”,... To the divided packets. Here, it is assumed that the identifier given to the last packet is the identifier “DI10”.

指令「ＤＣ２」に応じた音声データの生成を完了すると、音声生成手段７０１は、その旨を示す生成完了通知「ＤＣ２’」を制御手段７０３に送信する。 When the generation of the audio data corresponding to the command “DC2” is completed, the audio generation unit 701 transmits a generation completion notification “DC2 ′” indicating that to the control unit 703.

次いで、時刻Ｄ８に、制御手段７０３は、音声出力手段７０２に対して、識別子「ＤＩ５」〜識別子「ＤＩ１０」が付加されている音声データを順次出力するように音声データ出力指令「ＤＣ４」を発行する。このとき、音声データ出力指令「ＤＣ４」にて、音声データの出力を実際に開始する時刻として、出力対象の音声データの前に出力される音声データ（識別子「ＤＩ０」〜識別子「ＤＩ４」が付加されている音声データ）の開始時刻Ｄ１１に、その音声データの出力時間の長さを加えることによって算出される時刻Ｄ１２を指定する。 Next, at time D8, the control unit 703 issues an audio data output command “DC4” to the audio output unit 702 so as to sequentially output audio data to which the identifiers “DI5” to “DI10” are added. To do. At this time, the audio data output command “DC4” is used to add audio data (identifier “DI0” to identifier “DI4”) that is output before the audio data to be output as the actual start time of audio data output. The time D12 calculated by adding the length of the output time of the sound data to the start time D11 of the sound data).

なお、音声データの出力時間の長さは、何らかの方法であらかじめ取得しておくようにすればよい。具体的には、例えば、識別子の定義によっては、識別子そのものから計算できる。また、例えば、音声生成手段７０１に問い合わせることによって取得しておいてもよい。さらに、例えば、最初の制御指令「ＤＣ１」に対する応答「ＤＣ１’」の際に、同時に対応する音声データの出力時間の長さを示す情報を送信するようにしてもよい。 The length of the audio data output time may be acquired in advance by some method. Specifically, for example, depending on the definition of the identifier, it can be calculated from the identifier itself. Further, for example, it may be acquired by making an inquiry to the voice generation unit 701. Further, for example, at the time of response “DC1 ′” to the first control command “DC1”, information indicating the output time length of the corresponding audio data may be transmitted at the same time.

また、図８に示した方法とは別の方法として、音声出力手段７０２に対して、識別子「ＤＩ４」の音声データの出力を完了した直後に、識別子「ＤＩ５」〜識別子「ＤＩ１０」を出力するよう指令を出すようにして、連続して音声出力を行うようにすることも考えられる。 Also, as a method different from the method shown in FIG. 8, immediately after the output of the audio data with the identifier “DI4” is completed, the identifier “DI5” to the identifier “DI10” are output to the audio output unit 702. It is also conceivable to output voice continuously by issuing a command to output the command.

以上に説明したように、上述した第４の実施の形態では、音声データを実際に出力することを指令する際に、その出力時刻だけでなく処理対象とする音声データの識別子を指定する構成としたので、適切な時刻に適切な音声データが出力されるよう保証することができる。 As described above, in the above-described fourth embodiment, when instructing to actually output audio data, not only the output time but also the identifier of the audio data to be processed is specified. Therefore, it can be ensured that appropriate audio data is output at an appropriate time.

実施の形態５．
次に、本発明の第５の実施の形態について図面を参照して説明する。
図９は、本発明の第５の実施の形態における音声認識／合成システム９００の構成例を示すブロック図である。 Embodiment 5. FIG.
Next, a fifth embodiment of the present invention will be described with reference to the drawings.
FIG. 9 is a block diagram showing a configuration example of a speech recognition / synthesis system 900 according to the fifth embodiment of the present invention.

図９に示すように、音声認識／合成システム９００は、音声入力手段９０１と、音声認識手段９０２と、音声生成手段９０３と、音声出力手段９０４と、制御手段９０５とを含む。音声入力手段９０１と、音声認識手段９０２と、音声生成手段９０３と、音声出力手段９０４と、制御手段９０５とは、伝送手段９０６によって接続されている。 As shown in FIG. 9, the speech recognition / synthesis system 900 includes speech input means 901, speech recognition means 902, speech generation means 903, speech output means 904, and control means 905. The voice input unit 901, the voice recognition unit 902, the voice generation unit 903, the voice output unit 904, and the control unit 905 are connected by a transmission unit 906.

なお、制御手段９０５を除く各構成要素のうち１または２以上の構成要素を備えていない構成としてもよい。 In addition, it is good also as a structure which is not provided with 1 or 2 or more components among each component except the control means 905. FIG.

音声入力手段９０１、音声認識手段９０２、制御手段９０５、伝送手段９０６は、それぞれ、上述した音声認識／合成システム１００が備える音声入力手段１０１、音声認識手段１０２、制御手段１０３、伝送手段１０４と同様に構成される。 The voice input unit 901, the voice recognition unit 902, the control unit 905, and the transmission unit 906 are the same as the voice input unit 101, the voice recognition unit 102, the control unit 103, and the transmission unit 104 provided in the voice recognition / synthesis system 100 described above, respectively. Configured.

また、音声生成手段９０３と音声出力手段９０４は、それぞれ、上述した音声認識／合成システム７００が備える音声生成手段７０１と音声出力手段７０２と同様に構成される。従って、図９に示す各構成要素における個々の動作の詳細については省略する。 The voice generation unit 903 and the voice output unit 904 are configured in the same manner as the voice generation unit 701 and the voice output unit 702 provided in the voice recognition / synthesis system 700 described above, respectively. Therefore, the details of the individual operations in each component shown in FIG. 9 are omitted.

本例の音声認識／合成システム９００の動作は、上述した第１の実施の形態における音声認識処理と第４の実施の形態における音声合成処理とを組み合わせたものである。 The operation of the speech recognition / synthesis system 900 of this example is a combination of the speech recognition process in the first embodiment and the speech synthesis process in the fourth embodiment.

従って、第５の実施の形態では、上述した第１の実施の形態および第４の実施の形態にてそれぞれ説明した効果をともに得ることができる。 Therefore, in the fifth embodiment, both the effects described in the first embodiment and the fourth embodiment described above can be obtained.

さらに、第５の実施の形態では、音声入力手段９０１と音声出力手段９０４を組み合わせることで、バージイン機能を実現することができる。従って、従来の技術と比較して、破棄される入力音声データの区間を小さく抑えることができる。 Furthermore, in the fifth embodiment, the barge-in function can be realized by combining the voice input unit 901 and the voice output unit 904. Therefore, compared to the conventional technique, the section of the input voice data to be discarded can be reduced.

なお、図９に示す音声認識／合成システム９００に、上述した第２の実施の形態や、第３の実施の形態で示した構成をさらに組み合わせるようにしてもよい。そのように構成すれば、上述した第２の実施の形態や第３の実施の形態にてそれぞれ説明した効果をも同時に享受することができる。 Note that the speech recognition / synthesis system 900 shown in FIG. 9 may be further combined with the configuration described in the second embodiment or the third embodiment. With such a configuration, the effects described in the second embodiment and the third embodiment described above can be enjoyed at the same time.

以上のように、上述した各実施の形態では、音声データパケットにシステム内で一意に識別される識別子を付加して処理を行う構成としたので、音声データと、音声認識制御または音声合成制御の制御指令との本質的に独立した情報を適切に同期制御することができ、音声認識／合成システムを構成する各構成要素を適切に協調動作させることができる。 As described above, in each of the above-described embodiments, since the processing is performed by adding an identifier uniquely identified in the system to the voice data packet, the voice data and voice recognition control or voice synthesis control are processed. Information that is essentially independent of the control command can be appropriately synchronously controlled, and each component constituting the speech recognition / synthesis system can be appropriately coordinated.

すなわち、同期すべき各データの先頭をデータの送信元が基準点として設定しておくといった特許文献１や特許文献２に開示されているような構成ではなく、音声データパケットにシステム内で一意に識別される識別子を付加して処理を行う構成としているので、同期のタイミングを設定できる構成要素が限定されることなく、システムにおける任意の構成要素が任意の同期タイミングを自由に設定することができるようになり、柔軟な処理を行うことが可能となっている。 That is, it is not a configuration as disclosed in Patent Document 1 or Patent Document 2 in which the data transmission source sets the beginning of each data to be synchronized as a reference point, and the voice data packet is uniquely assigned in the system. Since it is configured to perform processing by adding an identified identifier, any component in the system can freely set any synchronization timing without being limited to any component that can set synchronization timing. Thus, flexible processing can be performed.

また、上述した各実施の形態では、音声データパケットにシステム内で一意に識別される識別子を付加し、その識別子を設定した制御指令を行う構成としたことで、任意の構成要素が自由に同期タイミングを設定することによって、音声データと、音声認識制御または音声合成制御の制御指令との本質的に独立した情報を同期させることができる。従って、個々の構成要素がよりインテリジェントに動作することができるようになる。 Further, in each of the above-described embodiments, an identifier that is uniquely identified in the system is added to the voice data packet, and a control command in which the identifier is set is configured so that any component can be freely synchronized. By setting the timing, it is possible to synchronize essentially independent information between the speech data and the control command for speech recognition control or speech synthesis control. Thus, individual components can operate more intelligently.

また、上述した各実施の形態では、音声データパケットにシステム内で一意に識別される識別子を付加し、その識別子を設定した制御指令を行う構成としたことで、各音声認識／音声合成処理にて制御対象とする音声データを容易かつ厳密に特定することができ、音声認識精度や合成音声品質の劣化を防ぐことができる。 Further, in each of the above-described embodiments, an identifier uniquely identified in the system is added to the voice data packet, and a control command in which the identifier is set is executed, so that each voice recognition / speech synthesis process is performed. Therefore, it is possible to easily and strictly specify the voice data to be controlled, and to prevent deterioration of voice recognition accuracy and synthesized voice quality.

また、上述した各実施の形態では、音声認識処理の対象となる音声データの区間を識別子によって厳密に指定する構成としているので、入力音声の欠落、特に発話の先頭部分が欠落することを回避することができ、音声認識精度の低下を防ぐことができる。また、雑音の混入を最小限にするようにすれば、発話区間の誤検出を抑制することができる。 Further, in each of the above-described embodiments, since the section of the voice data to be subjected to the voice recognition process is strictly specified by the identifier, it is possible to avoid the lack of the input voice, particularly the beginning part of the utterance. And a reduction in voice recognition accuracy can be prevented. Moreover, if noise mixing is minimized, erroneous detection of the speech section can be suppressed.

上述した各実施の形態のように、音声データ伝送系とコマンド伝送系とが独立している場合には、音声認識処理の開始を指令するコマンドに対応する音声データの区間がどのタイミングで音声認識手段に到着するかは不定である。上述した各実施の形態では、音声データの到着タイミングを知ることなしに、処理対象とされている適切な音声データ区間を特定することができ、適切に認識処理を行うことができる。 When the voice data transmission system and the command transmission system are independent as in each of the embodiments described above, at which timing the voice data section corresponding to the command instructing the start of the voice recognition process is It is uncertain whether it will arrive at the means. In each of the above-described embodiments, it is possible to specify an appropriate voice data section that is a processing target without knowing the arrival timing of the voice data, and to appropriately perform recognition processing.

また、上述した各実施の形態では、音声認識処理の開始を指令するコマンドが頻繁に発行される状況においても、各々のコマンドの対象となる音声データを混同することなく識別することができ、ある音声認識処理の対象となる音声データ区間を厳密に指定することができるので、処理対象の音声に適したパラメータ設定を用いた音声認識処理が適切に行われることを保証できる。 Further, in each of the above-described embodiments, even in a situation where a command for instructing the start of voice recognition processing is frequently issued, the voice data that is the target of each command can be identified without confusion. Since the voice data section to be subjected to the voice recognition process can be strictly specified, it can be ensured that the voice recognition process using the parameter setting suitable for the voice to be processed is appropriately performed.

すなわち、上述した各実施の形態のように、音声データと制御指令とがそれぞれ異なる伝送路を通るため、その順序関係はまったく保証されない。上述した各実施の形態では、たとえ制御指令を発行した構成要素が意図した順序と異なる順序で音声データが音声認識構成要素に到着したとしても、処理対象とされている適切な音声データ区間を確実に特定することができ、適切に認識処理を行うことができる。従って、例えば姓名の発話と電話番号の発話が連続してなされてときに、前者に対して電話番号用のパラメータ設定を用い、後者に対して姓名用のパラメータ設定を用いて認識処理を行ってしまうようなことは防止される。 That is, as in the above-described embodiments, since the audio data and the control command pass through different transmission paths, the order relationship is not guaranteed at all. In each of the above-described embodiments, even if the voice data arrives at the voice recognition component in an order different from the order in which the component that issued the control command is intended, the appropriate voice data section to be processed is ensured. And the recognition process can be appropriately performed. Therefore, for example, when the utterance of the first and last name and the utterance of the telephone number are made consecutively, the parameter setting for the telephone number is used for the former, and the parameter setting for the first and last name is used for the latter. This is prevented.

なお、上述した各実施の形態において、付加する識別子の管理を工夫（例えば、音声データと付加した識別子との組合せや、入力した音声データを記憶保持しておく）するようにすれば、任意の時刻に発行された制御指令に別の任意の時刻の音声データを対応付ける構成とすることができる。そのように構成すれば、制御指令が発行される時刻よりも過去や未来の任意の時刻に入力された音声データであっても、その制御指令によって指令することができる。このように、制御指令が発行される時刻よりも過去や未来の任意の時刻に入力された音声データに対して、当該制御指令によって指令を行うことができるので、ある処理と別の処理との間の見かけ上のアイドル時間を最小化することができるようになる。なお、実際には、指定された音声データが確実に得られるような何らかの工夫（過去のデータのバッファリングや未来の制御のスケジューリング）を併せて実装する必要がある。 In each of the above-described embodiments, if the management of identifiers to be added is devised (for example, the combination of voice data and added identifiers or the input voice data is stored and held), any arbitrary It can be set as the structure which matches the audio | voice data of another arbitrary time with the control command issued at the time. With this configuration, even voice data input at an arbitrary time in the past or in the future from the time at which the control command is issued can be commanded by the control command. As described above, since it is possible to perform a command by the control command for voice data input at an arbitrary time in the past or in the future from the time when the control command is issued, The apparent idle time can be minimized. In practice, it is necessary to implement some contrivance (buffering of past data and scheduling of future control) so that the designated audio data can be reliably obtained.

なお、上述した各実施の形態においては、音声出力を実行する時刻を制御指令に設定する例（例えば制御指令「ＤＣ３」）についてだけ述べたが、音声認識処理や音声合成処理を実行する時刻を制御指令に設定するようにしてもよい。この場合、制御指令に従って、その制御指令に設定されている時刻に、その制御指令に設定されている音声データパケットに対して音声認識処理や音声合成処理が実行されることになる。 In each of the above-described embodiments, only the example in which the time for executing the voice output is set as the control command (for example, the control command “DC3”) has been described. However, the time for executing the voice recognition processing and the voice synthesis processing is described. You may make it set to a control command. In this case, in accordance with the control command, at the time set in the control command, the voice recognition process and the voice synthesis process are executed for the voice data packet set in the control command.

なお、上述した各実施の形態では特に言及していないが、本システム１００，３００，５００，７００，９００において実行される各処理は、本システム１００等に搭載されている制御プログラム（同期制御プログラム）に従って実行される。この制御プログラムは、例えば、入力した音声データを解析する音声認識処理または／および音声データを生成する音声合成処理を行う音声認識／合成システムに同期制御を実行させる同期制御プログラムであって、音声認識／合成システムを構成するコンピュータに、音声データが複数の区間に分割された各音声分割データのうち処理対象の音声分割データを特定するための識別情報を設定した制御指令を発行するステップと、制御指令に従って、当該制御指令に設定されている識別情報によって特定される音声分割データに対して音声認識処理または／および音声合成処理を行うステップとを実行させるためのプログラムである。 Although not specifically mentioned in each of the above-described embodiments, each process executed in the system 100, 300, 500, 700, 900 is a control program (synchronous control program) installed in the system 100 or the like. ) Is executed. This control program is, for example, a synchronization control program that causes a speech recognition / synthesis system that performs speech recognition processing to analyze input speech data or / and speech synthesis processing to generate speech data to perform synchronization control. A step of issuing a control command in which identification information for specifying the audio division data to be processed among the audio division data obtained by dividing the audio data into a plurality of sections is set to a computer constituting the synthesis system; A program for executing a step of performing speech recognition processing and / or speech synthesis processing on speech divided data specified by identification information set in the control command according to the command.

次に、本発明の具体的実施例について説明する。
以下に説明する実施例は、上述した第５の実施の形態に対応するものである。 Next, specific examples of the present invention will be described.
The example described below corresponds to the fifth embodiment described above.

図１０は、本実施例における音声認識／合成システム１０００を示す説明図である。図１０に示すように、音声認識／合成システム１０００は、音声対話管理サーバ１００１と、入力端末装置１００２と、音声入出力サーバ１００３と、音声認識サーバ１００４と、音声合成サーバ１００５と、を含む。 FIG. 10 is an explanatory diagram showing a speech recognition / synthesis system 1000 according to the present embodiment. As shown in FIG. 10, the speech recognition / synthesis system 1000 includes a speech dialogue management server 1001, an input terminal device 1002, a speech input / output server 1003, a speech recognition server 1004, and a speech synthesis server 1005.

音声対話管理サーバ１００１と、入力端末装置１００２と、音声入出力サーバ１００３と、音声認識サーバ１００４と、音声合成サーバ１００５とは、それぞれ、コンピュータネットワーク１００６によって接続されている。 The voice interaction management server 1001, the input terminal device 1002, the voice input / output server 1003, the voice recognition server 1004, and the voice synthesis server 1005 are each connected by a computer network 1006.

音声対話管理サーバ１００１を除く各コンピュータ１００２，１００３，１００４，１００５は、それぞれ何台でも接続することができる。また、一台の装置で任意の二台以上の装置の役割を兼ねることもできる。例えば、一台のコンピュータで音声対話管理サーバ１００１と音声入出力サーバ１００３とを兼ねることができる。 Any number of computers 1002, 1003, 1004, and 1005 other than the voice interaction management server 1001 can be connected. One device can also serve as two or more arbitrary devices. For example, a single computer can serve as both the voice interaction management server 1001 and the voice input / output server 1003.

あるいは、一台のコンピュータがすべての構成要素を兼ねることもできる。またさらに、上述した第３の実施の形態のように、一台の音声認識サーバ１００４ないし音声合成サーバ１００５が、別の音声認識サーバ１００４、音声合成サーバ１００５を呼び出すプロキシサーバの役割を兼ねてもよい。 Alternatively, one computer can serve as all the components. Furthermore, as in the third embodiment described above, a single speech recognition server 1004 or speech synthesis server 1005 may serve as a proxy server that calls another speech recognition server 1004 or speech synthesis server 1005. Good.

音声対話管理サーバ１００１は、システム全体を制御する制御装置であり、上述した制御手段９０５（図９参照）に相当する機能を持ったプログラムが動作するコンピュータである。また、複数の音声認識サーバまたは音声合成サーバから一台を選択し、音声入出力サーバに仲介する負荷分散処理を行う機能も有する。 The voice interaction management server 1001 is a control device that controls the entire system, and is a computer on which a program having a function corresponding to the control unit 905 (see FIG. 9) operates. It also has a function of performing load distribution processing by selecting one from a plurality of speech recognition servers or speech synthesis servers and mediating to a speech input / output server.

入出力端末装置１００２は、ユーザが直接利用する入出力装置であり、音声入出力機能の他、ディスプレイ装置やキーボード、マウス、タッチパネル等を搭載することもできる。 The input / output terminal device 1002 is an input / output device directly used by a user, and can include a display device, a keyboard, a mouse, a touch panel, and the like in addition to a voice input / output function.

具体的には、入出力端末装置１００２として、ＰＣ（パーソナルコンピュータ）、電話（固定電話、携帯電話）、ＰＤＡ（Personal Digital Assistants）、ネットワーク対応型家電等が用いられる。 Specifically, as the input / output terminal device 1002, a PC (personal computer), a telephone (fixed phone, a mobile phone), a PDA (Personal Digital Assistants), a network-compatible home appliance, or the like is used.

入出力端末装置１００２は、音声認識／合成システム９００が備える制御手段９０５、音声入力手段９０１および音声出力手段９０４のそれぞれの機能の一部を兼ねたものである。 The input / output terminal device 1002 also serves as a part of the functions of the control unit 905, the voice input unit 901, and the voice output unit 904 provided in the voice recognition / synthesis system 900.

音声入出力サーバ１００３は、入出力端末装置１００２によって入力された音声データを音声パケットに分割し、識別子を付与し、各構成要素に送信する機能を有するサーバ装置である。 The voice input / output server 1003 is a server device having a function of dividing voice data input by the input / output terminal device 1002 into voice packets, assigning identifiers, and transmitting them to each component.

また、音声入出力サーバ１００３は、逆に、他の構成要素から受信したパケットを結合し、音声出力装置１００２に送る機能をも有する。 Conversely, the voice input / output server 1003 also has a function of combining packets received from other components and sending them to the voice output device 1002.

すなわち、音声入出力サーバ１００３は、音声認識／合成システム９００が備える音声入力手段９０１および音声出力手段９０４のそれぞれの機能の一部を兼ね備えたものである。なお、この実施例では、音声データの識別子として、タイムスタンプが使用される。 That is, the voice input / output server 1003 has a part of each function of the voice input unit 901 and the voice output unit 904 provided in the voice recognition / synthesis system 900. In this embodiment, a time stamp is used as an identifier of audio data.

音声認識サーバ１００４は、音声入出力サーバ１００３から得られた音声データに対して音声認識処理を行い、その結果を音声対話管理サーバ１００１に送信する処理を実行するサーバ装置である。音声認識サーバ１００４は、音声認識／合成システム９００が備える音声認識手段９０２に相当する。 The voice recognition server 1004 is a server device that performs voice recognition processing on voice data obtained from the voice input / output server 1003 and executes processing for transmitting the result to the voice dialogue management server 1001. The voice recognition server 1004 corresponds to the voice recognition unit 902 included in the voice recognition / synthesis system 900.

音声合成サーバ１００５は、音声対話管理サーバ１００１の指示に従って音声データを合成し、その結果を音声入出力サーバ１００３に送信する機能を有するサーバ装置である。 The voice synthesis server 1005 is a server device having a function of synthesizing voice data in accordance with an instruction from the voice dialogue management server 1001 and transmitting the result to the voice input / output server 1003.

なお、音声合成処理がその都度行われる必要はなく、あらかじめ合成した結果をキャッシュしておき、キャッシュしてある合成音声を用いるようにしてもよい。また、合成音声だけでなく、任意の波形ファイルを再生することで音声データを生成するようにしてもよい。 Note that the speech synthesis process need not be performed each time, and the synthesized result may be cached in advance and the cached synthesized speech may be used. In addition to the synthesized speech, the speech data may be generated by reproducing an arbitrary waveform file.

音声合成サーバ１００５は、音声認識／合成システム９００が備える音声生成手段９０３に相当する。 The voice synthesis server 1005 corresponds to the voice generation unit 903 provided in the voice recognition / synthesis system 900.

コンピュータネットワーク１００６は、例えば一般的に用いられるＬＡＮであるが、伝送遅延の大きな網、例えば無線ネットワークや電話回線網、ＷＡＮを用いることもできる。 The computer network 1006 is, for example, a generally used LAN, but a network with a large transmission delay, such as a wireless network, a telephone line network, or a WAN, can also be used.

入出力端末装置１００２と音声入出力サーバ１００３とを二つ組み合わせることで、音声認識／合成システム９００が備える音声入力手段９０１および音声出力手段９０４の機能を実現する。 By combining the input / output terminal device 1002 and the voice input / output server 1003, the functions of the voice input unit 901 and the voice output unit 904 provided in the voice recognition / synthesis system 900 are realized.

本実施例の音声認識／合成システム１０００の各構成要素が上記のように区分されているのは、さまざまな既存の入出力端末装置をこの対話システム１０００で利用可能とするために、各々の入出力端末装置による音声入出力の違いを音声入出力サーバ１００３で吸収しているからである。 The components of the speech recognition / synthesis system 1000 according to the present embodiment are divided as described above, so that various existing input / output terminal devices can be used in the interactive system 1000. This is because the voice input / output server 1003 absorbs the difference in voice input / output by the output terminal device.

従って、図１０では、各入出力端末装置ごとに異なる音声入出力サーバに接続されている。ただし、この図１０に示す例は、入出力端末装置と音声入出力サーバが常に１対１の関係にあることを示しているわけではない。ある音声入出力サーバに対応する入出力端末装置が複数あれば、１つの音声入出力サーバに複数種類の複数個の入出力端末装置が接続される構成としてもよい。 Therefore, in FIG. 10, each input / output terminal device is connected to a different voice input / output server. However, the example shown in FIG. 10 does not always indicate that the input / output terminal device and the voice input / output server have a one-to-one relationship. If there are a plurality of input / output terminal devices corresponding to a certain voice input / output server, a plurality of types of input / output terminal devices may be connected to one voice input / output server.

なお、音声認識／合成システム１０００における各部の動作は、上述した第５の実施の形態における音声認識／合成システム９００における各部の動作と同様であるため、その詳細な説明は省略する。 Note that the operation of each unit in the speech recognition / synthesis system 1000 is the same as the operation of each unit in the speech recognition / synthesis system 900 in the fifth embodiment described above, and a detailed description thereof will be omitted.

本発明によれば、自動音声応答装置等の音声対話システムの他、音声リモコンや音声インターネット閲覧装置、また障害者向けの音声ユーザインタフェース、あるいはロボットの音声対話機能などの各種の用途に適用するのに有用である。 According to the present invention, in addition to a voice dialogue system such as an automatic voice response device, it is applied to various uses such as a voice remote controller, a voice internet browsing device, a voice user interface for a disabled person, or a voice dialogue function of a robot. Useful for.

また、本発明によれば、動画像や株価など、音声以外の時系列データを厳密に扱う用途に適用することも可能である。 Further, according to the present invention, it is also possible to apply the present invention to strictly handling time-series data other than voice such as moving images and stock prices.

本発明の第１の実施の形態における音声認識／合成システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition / synthesis system in the 1st Embodiment of this invention. 本発明の第１の実施の形態における音声認識／合成システムの動作の一例を示すタイムチャートである。It is a time chart which shows an example of operation | movement of the speech recognition / synthesis system in the 1st Embodiment of this invention. 本発明の第２の実施の形態における音声認識／合成システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition / synthesis system in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における音声認識／合成システムの動作の一例を示すタイムチャートである。It is a time chart which shows an example of operation | movement of the speech recognition / synthesis system in the 2nd Embodiment of this invention. 本発明の第３の実施の形態における音声認識／合成システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition / synthesis system in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における音声認識／合成システムの動作の一例を示すタイムチャートである。It is a time chart which shows an example of operation | movement of the speech recognition / synthesis system in the 3rd Embodiment of this invention. 本発明の第４の実施の形態における音声認識／合成システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition / synthesis system in the 4th Embodiment of this invention. 本発明の第４の実施の形態における音声認識／合成システムの動作の一例を示すタイムチャートである。It is a time chart which shows an example of operation | movement of the speech recognition / synthesis system in the 4th Embodiment of this invention. 本発明の第５の実施の形態における音声認識／合成システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition / synthesis system in the 5th Embodiment of this invention. 本発明の実施例における音声認識／合成システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition / synthesis system in the Example of this invention.

Explanation of symbols

１００，３００，５００，７００，９００，１０００音声認識／合成システム
１０１，３０１，５０１音声入力手段
１０２，３０２，３０２ａ，３０２ｂ，３０２ｎ音声認識手段
１０３，３０３，５０４，７０３制御手段
１０４，３０４，５０５，７０４伝送手段
１０１ａ音声入力制御手段
１０１ｂ音声入力処理手段
１０１ｃ，７０１ｃパケット分割手段
１０１ｄ，７０１ｄ識別子付与手段
１０１ｅ，１０２ｄ，５０２ｄ，７０１ｅ，７０２ｄパケット保持手段
１０１ｆ，７０１ｆパケット送信手段
１０２ａ，５０２ａ音声認識制御手段
１０２ｂ，５０２ｂ音声認識処理手段
１０２ｃ，５０２ｃ，７０２ｃ，９０４ｂ識別子判別手段
１０２ｅ，５０２ｅ，７０２ｅパケット受信手段
１０３ａ，３０３ａユーザインタフェース手段
１０３ｂ，３０３ｂ対話管理手段
３０３ｃ，５０２ｆ結果統合手段
５０２第１の音声認識手段
５０３第２の音声認識手段
７０１音声生成手段
７０２音声出力手段
７０１ａ音声生成制御手段
７０１ｂ音声生成処理手段
７０２ａ音声出力制御手段
７０２ｂ音声出力処理手段
１００１音声対話管理サーバ
１００２入出力端末装置
１００３音声入出力サーバ
１００４音声認識サーバ
１００５音声合成サーバ 100, 300, 500, 700, 900, 1000 Speech recognition / synthesis system 101, 301, 501 Speech input means 102, 302, 302a, 302b, 302n Speech recognition means 103, 303, 504, 703 Control means 104, 304, 505 , 704 Transmission means 101a Voice input control means 101b Voice input processing means 101c, 701c Packet division means 101d, 701d Identifier assignment means 101e, 102d, 502d, 701e, 702d Packet holding means 101f, 701f Packet transmission means 102a, 502a Voice recognition control Means 102b, 502b Voice recognition processing means 102c, 502c, 702c, 904b Identifier discrimination means 102e, 502e, 702e Packet receiving means 103a, 303a User interface 103b, 303b Dialog management means 303c, 502f Result integration means 502 First speech recognition means 503 Second speech recognition means 701 Speech generation means 702 Speech output means 701a Speech generation control means 701b Speech generation processing means 702a Speech output Control unit 702b Speech output processing unit 1001 Spoken dialogue management server 1002 Input / output terminal device 1003 Speech input / output server 1004 Speech recognition server 1005 Speech synthesis server

Claims

A speech recognition / synthesis system for performing speech recognition processing for analyzing input speech data and / or speech synthesis processing for generating speech data,
Control command means for issuing a control command in which identification information for specifying the voice divided data to be processed is specified among the respective voice divided data obtained by dividing the voice data into a plurality of sections;
Voice processing means for performing voice recognition processing or / and voice synthesis processing on voice divided data specified by the identification information set in the control command in accordance with the control command from the control command means A speech recognition / synthesis system characterized by this.

The speech recognition / synthesis system according to claim 1, further comprising identification information adding means for adding identification information uniquely identified in the system to each divided voice data obtained by dividing the inputted voice data into a plurality of sections.

Voice input processing means for performing voice data input processing;
The voice recognition / synthesis system according to claim 1, further comprising: voice data dividing means for generating voice divided data obtained by dividing the voice data input by the voice input processing means into a plurality of sections.

The control command means issues a control command that sets the execution time of the voice recognition process or / and the voice synthesis process,
In accordance with the control command from the control command unit, the voice processing unit converts the voice division data specified by the identification information set in the control command when the execution time set in the control command is reached. The speech recognition / synthesis system according to any one of claims 1 to 3, wherein speech recognition processing and / or speech synthesis processing is performed on the speech recognition processing.

With a plurality of voice processing means,
The speech recognition / synthesis system according to any one of claims 1 to 4, further comprising a processing result integration unit that integrates processing results of the speech recognition processing and / or speech synthesis processing of each of the plurality of speech processing units.

The control command means issues a control command to one of the plurality of voice processing means,
6. The voice recognition / synthesis system according to claim 5, wherein the one voice processing means has a control command transfer means for transferring a part or all of the control command from the control command means to another voice processing means.

7. The voice processing means according to claim 6, further comprising: voice data transfer means for transferring one or all sections of processing target voice data instructed by a control command from the control command means to another voice processing means. Recognition / synthesis system.

The identification information adding means adds, as identification information, a time stamp, a serial number, a voice interaction processing sequence number by voice recognition processing or / and voice synthesis processing, or a combination thereof to each piece of voice divided data. The speech recognition / synthesis system according to claim 7.

The speech recognition / synthesis system according to any one of claims 1 to 8, further comprising an identification information management unit that provides a function of managing the time order of identification information.

The identification information management means manages the time order of each identification information by synchronizing the absolute time respectively used by each component constituting the system and making the identification information correspond to a specific absolute time. 9. The speech recognition / synthesis system according to 9.

A synchronization control method in a speech recognition / synthesis system for performing speech recognition processing for analyzing input speech data and / or speech synthesis processing for generating speech data,
Issuing a control command that sets identification information for specifying the audio division data to be processed among the audio division data obtained by dividing the audio data into a plurality of sections,
According to the control command, a voice recognition process or / and a voice synthesis process are performed on the voice divided data specified by the identification information set in the control command.

The synchronization control method according to claim 11, wherein identification information uniquely identified in the system is added to each voice division data obtained by dividing the inputted voice data into a plurality of sections.

Perform audio data input processing,
The synchronization control method according to claim 11 or 12, wherein voice divided data is generated by dividing the voice data input by the input process into a plurality of sections.

Issue a control command that sets the execution time of speech recognition processing and / or speech synthesis processing,
In accordance with the control command, when the execution time set in the control command is reached, voice recognition processing or / and speech synthesis processing is performed on the voice divided data specified by the identification information set in the control command. The synchronization control method according to any one of claims 11 to 13.

The synchronous control method according to any one of claims 11 to 14, wherein the processing results of a plurality of speech recognition processes and / or speech synthesis processes performed by different processing means according to a control command are integrated.

After performing speech recognition processing or / and speech synthesis processing according to the control command, a part or all of the control command is transferred to other processing means,
The synchronous control method according to claim 15, wherein speech recognition processing and / or speech synthesis processing is performed by the other processing means in accordance with the transferred control command.

The synchronous control method according to claim 16, wherein one or all sections of the audio data to be processed instructed by the control command from the control command means are transferred to another processing means.

The time stamp, serial number, speech recognition processing sequence number and / or speech synthesis processing sequence number by speech synthesis processing, or a combination thereof is used as the identification information added to each voice division data. The synchronization control method according to any one of 17.

A synchronization control program that causes a speech recognition / synthesis system that performs speech recognition processing to analyze input speech data and / or speech synthesis processing to generate speech data to perform synchronization control,
In a computer constituting the speech recognition / synthesis system,
Issuing a control command that sets identification information for identifying the audio division data to be processed among the audio division data obtained by dividing the audio data into a plurality of sections;
A synchronization control program for executing, in accordance with the control command, a step of performing speech recognition processing and / or speech synthesis processing on the speech divided data specified by the identification information set in the control command.

On the computer,
The synchronization control program according to claim 19, further comprising the step of adding identification information uniquely identified in the system to each piece of voice divided data obtained by dividing the inputted voice data into a plurality of sections.

On the computer,
Furthermore, a step of performing voice data input processing;
21. The synchronization control program according to claim 19 or 20, further comprising: generating audio division data obtained by dividing the audio data input by the input processing into a plurality of sections.

The time stamp, serial number, speech recognition processing sequence number and / or speech synthesis processing sequence number by speech synthesis processing, or a combination thereof is used as identification information added to each voice division data. 21. The synchronization control program according to any one of 21.

Voice input processing means for performing voice data input processing;
Voice data dividing means for generating voice divided data obtained by dividing the voice data input by the voice input processing means into a plurality of sections;
Identification information adding means for adding identification information uniquely identified in the system to each voice divided data divided by the voice data dividing means;
Identification information for specifying speech-division data to be processed for speech processing means for performing speech recognition processing for analyzing speech data input by the speech input processing means and / or speech synthesis processing for generating speech data And a control command means for issuing a control command in which the control command is set.