JP6636379B2

JP6636379B2 - Identifier construction apparatus, method and program

Info

Publication number: JP6636379B2
Application number: JP2016078589A
Authority: JP
Inventors: 隆朗福冨; 岡本　学; 学岡本; 清彰松井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-04-11
Filing date: 2016-04-11
Publication date: 2020-01-29
Anticipated expiration: 2036-04-11
Also published as: JP2017191119A

Description

この発明は、対話システムが想定する発話意図の有無を識別するための識別器を構築する技術に関する。 The present invention relates to a technology for constructing a discriminator for discriminating the presence or absence of a speech intention assumed by a dialog system.

対話システムにおいては、発話区間検出技術（例えば、非特許文献１参照。）を用いて、システム利用者の発話のみが切り出され、対話システムに送られる。システム利用者の発話のみを精度よく切り出すことで、システムと利用者との円滑な対話が実現する。 In the dialogue system, only the utterance of the system user is cut out using the utterance section detection technology (for example, see Non-Patent Document 1) and sent to the dialogue system. By extracting only the utterance of the system user with high accuracy, a smooth dialogue between the system and the user is realized.

藤本雅清, “音声区間検出の基礎と最近の研究動向”, IEICE Technical Report.,Masayoshi Fujimoto, “Basics of Voice Activity Detection and Recent Research Trends”, IEICE Technical Report.,

しかしながら、発話区間検出技術によって、正しく発話区間を検出できた場合でも、その検出された発話が対話システムに向けられたものでない場合がある。例えば、対話システムの前方での利用者同士が雑談している場合などである。この場合、対話システムに向けられたものではない発話に対して対話システムが応答してしまう可能性があった。 However, even if the utterance section can be correctly detected by the utterance section detection technique, the detected utterance may not be directed to the dialogue system. For example, there is a case where users in front of the interactive system are chatting with each other. In this case, the dialogue system may respond to an utterance not directed to the dialogue system.

この発明の目的は、対話システムに向けられたものではない発話に対しては応答しないようにするために、対話システムが想定する発話意図の有無を識別するための識別器を構築する識別器構築装置、方法及びプログラムを提供することである。 An object of the present invention is to construct a classifier for constructing a classifier for discriminating presence / absence of a speech intention assumed by a dialog system so as not to respond to an utterance not directed to the dialog system. It is to provide an apparatus, a method and a program.

この発明の一態様による識別器構築装置は、入力された音声信号の中の音声区間を検出する音声区間検出部と、検出された音声区間に対して音声認識を行い音声認識結果及び音声認識結果の信頼度を得る音声認識部と、音声認識結果に基づいて音声区間に対話システムが想定する発話意図が含まれているか判定する発話意図判定部と、信頼度が所定の閾値以上であり発話意図が含まれている音声区間である第一音声区間と、信頼度が所定の閾値以下であり発話意図が含まれていない音声区間である第二音声区間とを識別する識別部と、第一音声区間及び第二音声区間のそれぞれの特徴量を算出する特徴量算出部と、算出された特徴量を用いて、対話システムに対する発話意図の有無を識別するための識別器を構築する識別器構築部と、を備えている。 According to one embodiment of the present invention, there is provided a discriminator construction apparatus that detects a speech section in an input speech signal, performs speech recognition on the detected speech section, and performs a speech recognition result and a speech recognition result. A speech recognition unit that obtains the reliability of the speech recognition unit; a speech intention determination unit that determines whether the speech section includes the speech intention assumed by the dialogue system based on the speech recognition result; a first one sound voice interval is a speech segment that contains an identification unit for identifying reliability and a second speech section is a speech section is not included is the utterance intention less than a predetermined threshold value, the first A feature amount calculation unit that calculates the feature amount of each of the voice section and the second voice section, and a discriminator construction that constructs a discriminator for discriminating the presence / absence of an utterance intention to the interactive system using the calculated feature amounts And a part There.

対話システムが想定する発話意図の有無を識別するための識別器を構築することができる。この識別器を用いることで、対話システムに向けられたものではない発話に対して、は応答しない対話システムを構築することができる。 It is possible to construct a discriminator for discriminating the presence / absence of a speech intention assumed by the dialogue system. By using this discriminator, a dialog system that does not respond to an utterance that is not directed to the dialog system can be constructed.

識別器構築装置の例を説明するためのブロック図。FIG. 2 is a block diagram illustrating an example of a classifier construction apparatus. 識別器構築方法の例を説明するための流れ図。5 is a flowchart for explaining an example of a classifier construction method. 識別部２３の処理の例を説明するための流れ図。5 is a flowchart for explaining an example of processing of the identification unit 23. 音声区間検出部１の処理の例を説明するための図。FIG. 4 is a diagram for explaining an example of processing of the voice section detection unit 1. 識別部２３の処理の例を説明するための図。FIG. 9 is a diagram for explaining an example of processing of the identification unit 23.

以下、図面を参照して、この発明の一実施形態について説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

［識別器構築装置及び方法］
識別器構築装置及は、図１に例示するように、音声区間検出部１、対話システム部２、特徴量算出部３、学習データ蓄積部４及び識別器構築部５を備えている。対話システム部２は、音声認識部２１、発話意図判定部２２及び識別部２３を例えば備えている。発話意図判定部２２は、発話意図コーパス記憶部２１１を例えば備えている。 [Identifier construction apparatus and method]
As illustrated in FIG. 1, the discriminator construction apparatus includes a voice section detection unit 1, a dialogue system unit 2, a feature amount calculation unit 3, a learning data storage unit 4, and a discriminator construction unit 5. The dialogue system unit 2 includes, for example, a voice recognition unit 21, an utterance intention determination unit 22, and an identification unit 23. The utterance intention determination unit 22 includes, for example, an utterance intention corpus storage unit 211.

識別器構築方法は、識別器構築装置の各部が図２及び以下に説明するステップＳ１からＳ５の処理を実行することにより実現される。 The discriminator construction method is realized by each unit of the discriminator construction apparatus executing the processing of FIG. 2 and steps S1 to S5 described below.

＜音声区間検出部１＞
音声区間検出部１には、音声信号が入力される。 <Speech section detection unit 1>
A voice signal is input to the voice section detection unit 1.

音声区間検出部１は、入力された音声信号の中の音声区間を検出する（ステップＳ１）。検出された音声区間は、対話システム部２の音声認識部２１に出力される。 The voice section detector 1 detects a voice section in the input voice signal (step S1). The detected voice section is output to the voice recognition unit 21 of the dialogue system unit 2.

音声区間の検出方法については例えば非特許文献１等の既存の技術を用いればよい。 An existing technique such as Non-Patent Document 1 may be used as a method for detecting a voice section.

図４に例示するように、入力された音声信号には、音声区間が含まれている。音声区間検出部１は、入力された音声信号の中の音声区間を検出し、検出された音声区間の音声信号を出力する。図４の例では、２個の音声区間が検出され、検出された２個の音声区間のそれぞれの音声信号が音声認識部２１に出力されている。 As illustrated in FIG. 4, the input voice signal includes a voice section. The voice section detection unit 1 detects a voice section in the input voice signal, and outputs a voice signal of the detected voice section. In the example of FIG. 4, two voice sections are detected, and respective voice signals of the two detected voice sections are output to the voice recognition unit 21.

＜音声認識部２１＞
音声認識部２１には、音声区間検出部１で検出された音声区間が入力される。 <Speech recognition unit 21>
The speech section detected by the speech section detection unit 1 is input to the speech recognition unit 21.

音声認識部２１は、音声区間検出部１で検出された音声区間に対して音声認識を行い音声認識結果及び音声認識結果の信頼度を得る（ステップＳ２１）。得られた音声認識結果は、発話意図判定部２２に出力される。得られた音声認識結果の信頼度は、識別部２３に出力される。 The voice recognition unit 21 performs voice recognition on the voice section detected by the voice section detection unit 1, and obtains the voice recognition result and the reliability of the voice recognition result (step S21). The obtained speech recognition result is output to the utterance intention determination unit 22. The reliability of the obtained speech recognition result is output to the identification unit 23.

音声認識の方法については既存の技術を用いればよい。 An existing technology may be used for the voice recognition method.

＜発話意図判定部２２＞
発話意図判定部２２には、音声認識部２１で得られた音声認識結果が入力される。 <Utterance intention determination unit 22>
The speech recognition result obtained by the speech recognition unit 21 is input to the speech intention determination unit 22.

発話意図判定部２２は、音声認識部２１で得られた音声認識結果に基づいて上記の音声区間に対話システムが想定する発話意図が含まれているか判定する（ステップＳ２２）。判定結果は、識別部２３に出力される。 The utterance intention determination unit 22 determines whether or not the utterance intended by the dialog system is included in the above-mentioned voice section based on the speech recognition result obtained by the speech recognition unit 21 (step S22). The determination result is output to the identification unit 23.

例えば、観光案内のための対話システムであれば、音声認識結果から、観光値に関する名所の名前や食事、交通機関等に関するキーワードや表現が検出された場合は対話システムが想定する発話意図が含まれていると判定する。これに対して、発話意図につながるキーワードや表現が含まれていない場合には発話意図が含まれていない、発話意図が検出できなかったと判定する。これらの発話意図を理解するためのキーワードや表現の情報は、発話意図コーパスとして一般に利用可能であり、対話システム構築者が事前に準備しておく。例えば、発話意図コーパス記憶部２２１に事前に記憶させておく。この場合、発話意図判定部２２は、発話意図コーパス記憶部２２１から読み込んだキーワードや表現を用いて、上記の判定を行う。 For example, in the case of an interactive system for sightseeing guidance, if a keyword or expression related to a tourist attraction, a name of a sightseeing value, a meal, a transportation system, etc. is detected from a speech recognition result, the utterance intention assumed by the interactive system is included. It is determined that there is. On the other hand, when no keyword or expression leading to the utterance intention is included, it is determined that the utterance intention is not included and the utterance intention cannot be detected. The information on keywords and expressions for understanding these utterance intentions is generally available as a utterance intention corpus, and is prepared in advance by a dialog system builder. For example, it is stored in the utterance intention corpus storage unit 221 in advance. In this case, the utterance intention determination unit 22 performs the above-described determination using the keyword or expression read from the utterance intention corpus storage unit 221.

＜識別部２３＞
識別部２３には、音声認識部２１で得られた音声認識結果の信頼度と、発話意図判定部２２で得られた判定結果とが入力される。 <Identifier 23>
The reliability of the voice recognition result obtained by the voice recognition unit 21 and the determination result obtained by the speech intention determination unit 22 are input to the identification unit 23.

識別部２３は、信頼度が所定の閾値以上であり発話意図が含まれている音声区間である第一有音声区間と、上記信頼度が所定の閾値以下であり上記発話意図が含まれていない音声区間である第二音声区間とを識別する（ステップＳ２３）。識別結果は、特徴量算出部３に出力される。 The identification unit 23 includes a first voiced section that is a voice section whose reliability is equal to or more than a predetermined threshold value and includes a speech intention, and a voice section whose reliability is equal to or less than a predetermined threshold value and does not include the speech intention. The voice section is identified as the second voice section (step S23). The identification result is output to the feature value calculation unit 3.

例えば、音声区間に、第一音声区間であるか、第二音声区間であるかどうかを識別するためのフラグが付与される。ここで、付与するフラグは、第一音声区間に「０」、第二後者に「１」と与えるなど数値表現もよい。図５では、音声区間Ｕ１のフラグを０とし、音声区間Ｕ２のフラグを１とし、音声区間Ｕ３のフラグを１とし、音声区間Ｕ４のフラグを０とし、音声区間Ｕ５のフラグを０とし、音声区間Ｕ６のフラグを０としている。 For example, a flag for identifying whether a voice section is a first voice section or a second voice section is added. Here, the flag to be given may be a numerical expression such as giving "0" to the first voice section and "1" to the second latter. In FIG. 5, the flag of the voice section U1 is set to 0, the flag of the voice section U2 is set to 1, the flag of the voice section U3 is set to 1, the flag of the voice section U4 is set to 0, and the flag of the voice section U5 is set to 0. The flag of the section U6 is set to 0.

第一音声区間は、正しく対話システムに入力された音声と仮定することができる。第二音声区間は、正しく対話システムに入力された音声ではないと仮定することができる。 The first voice section can be assumed to be a voice correctly input to the dialogue system. It can be assumed that the second voice section is not a voice that was correctly input to the dialog system.

例えば図３に示すように、識別部２３は、所定の閾値をεとして、音声区間について、信頼度がε以上であり発話意図が含まれているかどうかを判定する。信頼度がε以上であり発話意図が含まれている場合には、識別部２３は、その音声区間を第一音声区間とする。それ以外の場合には、識別部２３は、信頼度がε以下であり発話意図が含まれていないかどうかを判定する。信頼度がε以下であり発話意図が含まれていない場合には、識別部２３は、その音声区間を第二音声区間とする。それ以外の場合には、識別部２３は、その音声区間の音声信号は、発話意図判定部２２に出力される。 For example, as illustrated in FIG. 3, the identification unit 23 determines whether or not the reliability is equal to or more than ε and the utterance intention is included in the voice section, with a predetermined threshold being ε. When the reliability is equal to or more than ε and includes the utterance intention, the identification unit 23 sets the voice section as the first voice section. In other cases, the identification unit 23 determines whether the reliability is equal to or less than ε and does not include an utterance intention. When the reliability is equal to or less than ε and does not include a speech intention, the identification unit 23 sets the voice section as a second voice section. In other cases, the identification unit 23 outputs the audio signal of the audio section to the utterance intention determination unit 22.

なお、識別部２３は、信頼度がε以上であり発話意図が含まれていない場合、及び、信頼度がε以下であり発話意図が含まれている場合にも、その音声区間を第二音声区間としてもよい。 It should be noted that the identification unit 23 also determines that the voice section is the second voice even when the reliability is equal to or more than ε and does not include the utterance intention, and when the reliability is equal to or less than ε and includes the utterance intention. It may be a section.

＜特徴量算出部３＞
特徴量算出部３には、第一音声区間の音声信号と、第二音声区間の音声信号とが入力される。 <Feature amount calculation unit 3>
The voice signal of the first voice section and the voice signal of the second voice section are input to the feature amount calculation unit 3.

特徴量算出部３は、第一音声区間の音声信号と第二音声区間の音声信号とを用いて、第一音声区間及び第二音声区間のそれぞれの特徴量を算出する（ステップＳ３）。算出された特徴量は、学習データ蓄積部４に蓄積される。 The feature amount calculation unit 3 calculates the feature amounts of the first voice section and the second voice section using the voice signal of the first voice section and the voice signal of the second voice section (step S3). The calculated feature amount is stored in the learning data storage unit 4.

特徴量としては、パワーとピッチの平均、標準偏差及びそれらのΔ成分、音声区間長の少なくとも１つを用いることができる。これらの特徴量については既存の技術を用いて算出すればよい。 As the characteristic amount, at least one of the average of power and pitch, the standard deviation, their Δ component, and the voice section length can be used. What is necessary is just to calculate these characteristic amounts using an existing technique.

＜学習データ蓄積部４＞
学習データ蓄積部４には、第一音声区間及び第二音声区間のそれぞれの特徴量が蓄積される。 <Learning data storage unit 4>
The learning data storage unit 4 stores the feature amounts of the first voice section and the second voice section.

第一音声区間及び第二音声区間のそれぞれの特徴量が、所定の時間Ｔ分蓄積されるまで、ステップＳ１からステップ３までの処理が繰り返し実行される。Ｔは所定の時間長である、例えばＴ＝0.5[h]程度に設定することができる。 Until the feature amounts of the first and second voice sections are accumulated for a predetermined time T, the processing from step S1 to step 3 is repeatedly executed. T is a predetermined time length, for example, T can be set to about 0.5 [h].

＜識別器構築部５＞
識別器構築部５には、第一音声区間及び第二音声区間のそれぞれの特徴量が入力される。識別器構築部５は、第一音声区間及び第二音声区間のそれぞれの特徴量を例えば学習データ蓄積部４から読み込む。 <Identifier construction unit 5>
The feature amount of each of the first voice section and the second voice section is input to the discriminator construction unit 5. The discriminator construction unit 5 reads the respective feature amounts of the first speech section and the second speech section from, for example, the learning data storage unit 4.

識別器構築部５は、算出された特徴量を用いて、対話システムに対する発話意図の有無を識別するための識別器を構築する（ステップＳ５）。 The discriminator construction unit 5 constructs a discriminator for discriminating the presence / absence of an utterance intention to the dialogue system using the calculated feature amount (step S5).

識別器の構築には、既存の技術を用いればよい。例えば、サポートベクターマシン等の認識モデルを利用すればよい。 Existing technology may be used to construct the classifier. For example, a recognition model such as a support vector machine may be used.

識別器構築装置及び方法で構築された識別器を用いることで、意図した発話と意図しない発話を判断可能なシステムとすることができる。これにより、不自然な対話応答を減らすことができ、対話システムの品質を上げることができる。 By using the discriminator constructed by the discriminator construction apparatus and method, a system capable of determining an intended utterance and an unintended utterance can be provided. As a result, unnatural dialog responses can be reduced, and the quality of the dialog system can be improved.

［プログラム及び記録媒体］
識別器構築装置における各処理をコンピュータによって実現する場合、識別器構築装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 [Program and recording medium]
When each process in the classifier construction device is realized by a computer, the processing contents of the functions that the classifier construction device should have are described by a program. Then, by executing this program on a computer, each processing is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program describing this processing content can be recorded on a computer-readable recording medium. As a computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

［変形例］
識別器構築装置において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Modification]
The processes described in the discriminator construction apparatus are not only executed in chronological order according to the order of description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processing or as necessary.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, it goes without saying that changes can be made as appropriate without departing from the spirit of the present invention.

Claims

A voice section detection unit that detects a voice section in the input voice signal;
A voice recognition unit that performs voice recognition on the detected voice section and obtains a voice recognition result and the reliability of the voice recognition result;
An utterance intention determining unit that determines whether or not the utterance intention assumed by the dialogue system is included in the voice section based on the voice recognition result;
In the reliability is the first one sound voice interval is a speech segment that contains is the utterance intention not less than the predetermined threshold value, the reliability speech interval does not contain the speech intention is equal to or less than a predetermined threshold value An identification unit that identifies a certain second voice section;
A feature amount calculation unit that calculates a feature amount of each of the first voice section and the second voice section,
A classifier constructing unit configured to construct a classifier for identifying the presence or absence of an utterance intention with respect to the dialogue system using the calculated feature amount;
A classifier construction apparatus including:

A voice section detection step of detecting a voice section in the input voice signal,
A voice recognition step of performing voice recognition on the detected voice section to obtain a voice recognition result and the reliability of the voice recognition result;
An utterance intention determining unit that determines whether or not the utterance intended by the dialogue system is included in the voice section based on the result of the voice recognition,
Identifying portion, the reliability and the second one sound voice interval is a speech segment that contains is the utterance intention not less than the predetermined threshold value, the reliability contains the utterance intention is equal to or less than a predetermined threshold value An identification step for identifying a second voice section that is not a voice section;
A feature value calculating step of calculating a feature value of each of the first voice section and the second voice section,
A classifier constructing unit configured to construct a classifier for identifying presence / absence of an utterance intention with respect to the dialogue system using the calculated feature amount;
A classifier construction method including:

A program for causing a computer to function as each unit of the discriminator construction apparatus according to claim 1.