JP2018132678A

JP2018132678A - Turn-taking timing identification apparatus, turn-taking timing identification method, program and recording medium

Info

Publication number: JP2018132678A
Application number: JP2017026681A
Authority: JP
Inventors: 亮増村; Akira Masumura; 浩和政瀧; Hirokazu Masataki
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2018-08-23
Anticipated expiration: 2037-02-16
Also published as: JP6612277B2

Abstract

PROBLEM TO BE SOLVED: To provide turn-taking timing information technology using a feature amount extracted from information of past speeches before the last time without manually designing a feature amount extraction rule.SOLUTION: A turn-taking timing identification apparatus comprises: a voice section detection section for detecting speech k from input voice; an in-speech feature amount sequence generation section for generating an in-speech feature amount sequence K from the speech k; a speech feature amount calculation section for calculating a speech feature amount k from the in-speech feature amount sequence k; a turn-taking point feature amount calculation section for calculating a turn-taking point feature amount k from time-series data constituted of a speech feature amount i (i=1,...,k-1) and the speech feature amount k, which are previously calculated; and a turn-taking timing identification section for generating an identification result k showing whether the turn-taking point k is turn-taking timing or not from the turn-taking point feature amount k. The speech feature amount calculation section and the turn-taking point feature amount calculation section are constituted by using a neural network where time-series data is set to be input and the feature amount is outputted.SELECTED DRAWING: Figure 2

Description

本発明は、音声対話システムに関し、特にユーザの発話に対して音声対話システムが適切なタイミングで応答をするためのターンテイキングタイミングを識別するための技術に関する。 The present invention relates to a voice interaction system, and more particularly to a technique for identifying turn-taking timing for a voice interaction system to respond to a user's utterance at an appropriate timing.

音声対話システムは、ユーザの発話にシステムが応答する構成になっている。ユーザの発話に対していつシステムが応答を行うかのタイミングは、ターンテイキングタイミングと呼ばれ、ターンテイキングタイミングを適切に識別することで、ユーザとシステムの間のスムーズな対話を実現することができる。 The voice interactive system is configured such that the system responds to a user's utterance. The timing of when the system responds to the user's utterance is called turn-taking timing, and a smooth dialogue between the user and the system can be realized by appropriately identifying the turn-taking timing. .

ターンテイキングタイミング識別に関する最も簡易な方法は、ユーザの発話の無音時間について閾値（例えば、0.4秒）を設ける方法である。この方法では、ユーザの発話後の非音声区間の時間が閾値を超えた場合にターンテイキングを行う（つまり、システムが応答する）仕組みになっている。なお、非音声区間の検出（つまり、音声区間であるか非音声区間であるかの識別）には音声区間検出技術を用いる。 The simplest method relating to turn-taking timing identification is a method of setting a threshold (for example, 0.4 seconds) for the silence time of the user's utterance. In this method, when the time of the non-speech section after the user's utterance exceeds a threshold value, turn taking is performed (that is, the system responds). Note that a speech segment detection technique is used for detection of a non-speech segment (that is, identification of whether a speech segment or a non-speech segment).

また、音声信号系列、音声信号系列から抽出される系列（例えば、基本周波数系列やケプストラム系列）、音声認識結果の単語系列からターンテイキングタイミングを識別する枠組みを機械学習によりモデル化する取り組みがある。この機械学習を用いた方法では、一般的に、音声区間検出によりユーザの発話後に非音声区間を検出したタイミングにおいて、学習したモデルを用いて、直前のユーザの発話（つまり、音声区間検出により検出した音声区間）の音声信号系列、音声信号系列から抽出される系列、音声認識結果の単語系列から、ターンテイキングタイミングであるか否かを識別する。具体的には、非特許文献１や非特許文献２では、SVM(Support Vector Machine)や決定木などの機械学習モデルを用いて、直前のユーザの発話の音声情報、音声認識したテキスト情報から、ターンテイキングタイミングであるか否かを識別する。SVMや決定木などの機械学習モデルは、ユーザの発話の音声情報などから抽出した特徴量を用いて学習される。 In addition, there is an approach of modeling a framework for identifying turn taking timing from a speech signal sequence, a sequence extracted from the speech signal sequence (for example, a basic frequency sequence or a cepstrum sequence), and a word sequence of a speech recognition result by machine learning. In the method using machine learning, generally, at the timing when a non-speech segment is detected after the user's utterance by the speech segment detection, using the learned model, the previous user's utterance (that is, detected by the speech segment detection). Whether or not it is turn-taking timing from the speech signal sequence of the speech segment), the sequence extracted from the speech signal sequence, and the word sequence of the speech recognition result. Specifically, in Non-Patent Document 1 and Non-Patent Document 2, using machine learning models such as SVM (Support Vector Machine) and decision tree, the speech information of the user's utterance and the text information that has been speech-recognized, It is identified whether it is turn taking timing. Machine learning models such as SVMs and decision trees are learned using feature quantities extracted from voice information of user utterances.

L. Ferrer, E. Shriberg, A. Stolcke, “A prosody-based approach to end-of-utterance detection that does not require speech recognition”, In Proc. of ICASSP’03, 2003.L. Ferrer, E. Shriberg, A. Stolcke, “A prosody-based approach to end-of-utterance detection that does not require speech recognition”, In Proc. Of ICASSP’03, 2003. R. Sato, R. Higashinaka, M. Tamoto, M. Nakano, K. Aikawa, “Learning decision trees to determine turntaking by spoken dialogue systems”, In Proc. of ICSLP-02, 2002.R. Sato, R. Higashinaka, M. Tamoto, M. Nakano, K. Aikawa, “Learning decision trees to determine turntaking by spoken dialogue systems”, In Proc. Of ICSLP-02, 2002.

非特許文献１や非特許文献２のような機械学習を用いた従来の枠組みには、２つの問題がある。１つ目の問題は、直前のユーザの発話の音声信号系列、音声信号系列から抽出される系列、音声認識結果の単語系列からの特徴量抽出規則の設計を、人手にまかせている点である。人手により設計した特徴量として、例えば、“発話の終端から100msの基本周波数の傾き”、“発話の終端から2単語”などがある。このような特徴量の抽出規則は、人手で様々な分析を実施すれば有効なものを見つけることができるかもしれない。しかし、銀行窓口用対話システム、コンタクトセンタオペレーティング用対話システムなど各種タスクに応じて分析する必要があるため、実際にタスクごとに分析を行うのは困難であり、その設計は容易ではない。 The conventional framework using machine learning such as Non-Patent Document 1 and Non-Patent Document 2 has two problems. The first problem is that the design of the feature quantity extraction rule from the speech signal sequence of the user's previous speech, the sequence extracted from the speech signal sequence, and the word sequence of the speech recognition result is left to human. . Examples of the feature amount designed manually include “slope of fundamental frequency of 100 ms from the end of utterance”, “two words from the end of utterance”, and the like. It may be possible to find an effective rule for extracting such feature amounts by performing various analyzes manually. However, since it is necessary to analyze according to various tasks such as a bank window dialog system and a contact center operating dialog system, it is difficult to actually perform analysis for each task, and the design is not easy.

２つ目の問題は、１つ目の問題とも関連するが、直前の発話よりも過去の発話の情報を利用するような特徴量の設計が困難である点である。前述の通り、特徴量抽出規則の設計は人手によるものである。このため、過去の発話のどのような情報がターンテイキングタイミング識別に有効であるかを判断することは難しく、直前の発話よりも過去の発話の特徴量を利用することはなかった。 The second problem is related to the first problem, but is that it is more difficult to design a feature quantity that uses information of a past utterance than the previous utterance. As described above, the feature amount extraction rule is designed manually. For this reason, it is difficult to determine what kind of information of the past utterance is effective for turntaking timing identification, and the feature amount of the past utterance is not used more than the previous utterance.

そこで本発明では、特徴量抽出規則を人手により設計することなく、直前の発話および直前の発話よりも過去の発話の情報から抽出した特徴量を用いたターンテイキングタイミング識別技術を提供することを目的とする。 Therefore, the present invention has an object to provide a turntaking timing identification technique using a feature amount extracted from information of a previous utterance and a previous utterance rather than a previous utterance without manually designing a feature amount extraction rule. And

本発明の一態様は、入力音声から、当該入力音声に含まれるk番目（kは1以上の整数）の発話である発話kを検出する音声区間検出部と、前記発話kから、k番目の発話内特徴量系列である発話特徴量系列kを生成する発話内特徴量系列生成部と、前記発話内特徴量系列kから、前記発話kを特徴付ける発話特徴量である発話特徴量kを計算する発話特徴量計算部と、既に計算してあるi番目の発話特徴量である発話特徴量i（i=1,…,k-1）と前記発話特徴量kから構成される時系列データである発話特徴量系列kから、前記発話kの直後に出現する識別対象ターンテイキング点となるターンテイキング点kを特徴付けるターンテイキング点特徴量kを計算するターンテイキング点特徴量計算部と、前記ターンテイキング点特徴量kから、前記ターンテイキング点kがターンテイキングタイミングであるか否かを示す識別結果kを生成するターンテイキングタイミング識別部とを含むターンテイキングタイミング識別装置であって、前記発話特徴量計算部と前記ターンテイキング点特徴量計算部は、それぞれ固定長ベクトル系列として表現される時系列データを入力とし、固定長ベクトルとして表現される特徴量を出力するニューラルネットワークを用いて構成される。 One aspect of the present invention is an audio section detection unit that detects an utterance k that is a k-th (k is an integer equal to or greater than 1) utterance included in the input voice, and an k-th from the utterance k. An utterance feature quantity sequence generation unit that generates an utterance feature quantity sequence k that is an intra-utterance feature quantity sequence, and an utterance feature quantity k that is an utterance feature quantity that characterizes the utterance k is calculated from the intra-utterance feature quantity sequence k. It is time-series data composed of an utterance feature quantity calculation unit, an utterance feature quantity i (i = 1,..., K-1) that is the i-th utterance feature quantity that has already been calculated and the utterance feature quantity k. A turn-taking point feature quantity calculation unit for calculating a turn-taking point feature quantity k that characterizes a turn-taking point k that is an identification target turn-taking point that appears immediately after the utterance k from the utterance feature quantity series k, and the turn-taking point From the feature value k, the turn-taking point k is A turn-taking timing identification device that includes a turn-taking timing identification unit that generates an identification result k indicating whether or not it is a timing, wherein the utterance feature amount calculation unit and the turn-taking point feature amount calculation unit include: Each is configured using a neural network that receives time series data expressed as a fixed-length vector sequence and outputs a feature value expressed as a fixed-length vector.

本発明の一態様は、入力音声から、当該入力音声に含まれるk番目（kは1以上の整数）の発話である発話kを検出する音声区間検出部と、Jを発話から生成される発話内特徴量の種類の数、jを1≦j≦Jを満たす整数とし、前記発話kから、k番目の第j種発話内特徴量系列である第j種発話特徴量系列kを生成する第j種発話内特徴量系列生成部と、前記第j種発話特徴量系列kから、前記発話kを特徴付ける第j種発話特徴量である第j種発話特徴量kを計算する第j種発話特徴量計算部と、前記第j種発話特徴量k（1≦j≦J）から、前記発話kを特徴付ける結合発話特徴量である結合発話特徴量kを生成する発話特徴量結合部と、既に計算してあるi番目の結合発話特徴量である結合発話特徴量i（i=1,…,k-1）と前記結合発話特徴量kから構成される時系列データである結合発話特徴量系列kから、前記発話kの直後に出現する識別対象ターンテイキング点となるターンテイキング点kを特徴付けるターンテイキング点特徴量kを計算するターンテイキング点特徴量計算部と、前記ターンテイキング点特徴量kから、前記ターンテイキング点kがターンテイキングタイミングであるか否かを示す識別結果kを生成するターンテイキングタイミング識別部とを含むターンテイキングタイミング識別装置であって、前記発話特徴量計算部と前記ターンテイキング点特徴量計算部は、それぞれ固定長ベクトル系列として表現される時系列データを入力とし、固定長ベクトルとして表現される特徴量を出力するニューラルネットワークを用いて構成される。 One aspect of the present invention is a speech section detection unit that detects an utterance k that is a k-th (k is an integer equal to or greater than 1) utterance included in an input speech, and an utterance generated from the utterance The number of types of internal feature quantity, j is an integer satisfying 1 ≦ j ≦ J, and a j-th type utterance feature quantity sequence k that is a k-th type j internal utterance feature quantity series is generated from the utterance k. a j-th utterance feature quantity generator for calculating a j-th utterance feature quantity k that is a j-th utterance feature quantity characterizing the utterance k from the j-th utterance feature quantity series generation unit and the j-th kind utterance feature quantity sequence k An amount calculation unit, an utterance feature amount combination unit that generates a combined utterance feature amount k that is a combined utterance feature amount that characterizes the utterance k from the j-th type utterance feature amount k (1 ≦ j ≦ J), and an already calculated A combined utterance feature quantity i (i = 1,..., K−1), which is the i-th combined utterance feature quantity, and time series data composed of the combined utterance feature quantity k. A turn-taking point feature quantity calculation unit for calculating a turn-taking point feature quantity k that characterizes a turn-taking point k that is an identification target turn-taking point that appears immediately after the utterance k from the utterance feature quantity series k, and the turn-taking point A turn taking timing discriminating device including a turn taking timing discriminating unit for generating an identification result k indicating whether or not the turn taking point k is a turn taking timing from the feature amount k, wherein the utterance feature amount calculating unit The turn-taking point feature amount calculation unit is configured using a neural network that receives time-series data expressed as a fixed-length vector sequence and outputs a feature amount expressed as a fixed-length vector.

本発明によれば、特徴量抽出規則を人手により設計することなく、直前の発話および直前の発話よりも過去の発話の情報から抽出した特徴量を用いたターンテイキングタイミング識別を実現することができる。これにより、高精度なターンテイキングタイミング識別を実現することができる。 According to the present invention, it is possible to realize turn-taking timing identification using a feature amount extracted from the previous utterance and past utterance information rather than the previous utterance without manually designing a feature amount extraction rule. . Thereby, highly accurate turn-taking timing identification can be realized.

入力音声、発話、発話内特徴量系列の関係を示す図。The figure which shows the relationship between an input audio | voice, speech, and the feature-value series in speech. ターンテイキングタイミング識別装置１００の構成の一例を示す図。The figure which shows an example of a structure of the turn taking timing identification apparatus 100. FIG. ターンテイキングタイミング識別装置１００の動作の一例を示す図。The figure which shows an example of operation | movement of the turn taking timing identification apparatus 100. 発話内特徴量、発話特徴量、ターンテイキング点特徴量の関係を示す図。The figure which shows the relationship between the feature-value in an utterance, an utterance feature-value, and a turn taking point feature-value. ターンテイキングタイミング識別装置２００の構成の一例を示す図。The figure which shows an example of a structure of the turn taking timing identification apparatus 200. ターンテイキングタイミング識別装置２００の動作の一例を示す図。The figure which shows an example of operation | movement of the turn taking timing identification apparatus 200. 第j種発話内特徴量、第j種発話特徴量、結合発話特徴量の関係を示す図。The figure which shows the relationship between the feature quantity in j type utterance, the j type utterance feature quantity, and the combined utterance feature quantity.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

まず、各実施形態で用いる用語や前提について簡単に説明する。 First, terms and assumptions used in each embodiment will be briefly described.

入力音声とは、ターンテイキングタイミング識別の対象となる音声のことである。入力音声は、音声区間検出技術を用いて音声区間と非音声区間の判別を行うことができるものとする。入力音声は、非音声区間ごとに区切ることで、発話単位に分割される。 The input voice is a voice that is a target of turn-taking timing identification. It is assumed that the input speech can be distinguished from a speech segment and a non-speech segment using a speech segment detection technique. The input voice is divided into utterance units by dividing each non-voice section.

発話は、音声信号系列、音声信号から抽出される系列（例えば、基本周波数系列やケプストラム系列）として扱えるものとする。また、発話は、音声認識システムを用いることで、音声認識結果の単語系列として扱えるものとする。つまり、発話を、音声信号、基本周波数、ケプストラム、音声認識結果の単語のような特徴の時系列データとして扱う。このような発話から生成される特徴量を発話内特徴量という。これらの特徴量は一般にベクトルとして表現することができる。また、これらの発話内特徴量の時系列データのことを発話内特徴量系列という。 It is assumed that the utterance can be handled as a speech signal sequence or a sequence extracted from the speech signal (for example, a fundamental frequency sequence or a cepstrum sequence). Also, utterances can be handled as word sequences of speech recognition results by using a speech recognition system. That is, speech is treated as time-series data of features such as speech signals, fundamental frequencies, cepstrum, and words of speech recognition results. A feature amount generated from such an utterance is referred to as an intra-utterance feature amount. These feature quantities can generally be expressed as vectors. In addition, the time series data of these feature amounts in the utterance is referred to as the feature amount sequence in the utterance.

ターンテイキング点とは、音声区間と非音声区間が切り替わる点のことである。ターンテイキングタイミングの識別は、ターンテイキング点において行われる。 The turn-taking point is a point where a voice section and a non-voice section are switched. The turn taking timing is identified at the turn taking point.

図１は、入力音声、発話、発話内特徴量系列の関係を示す。図１に示すように、入力音声は、非音声区間を区切りとして、1番目の発話、…、k-2番目の発話、k-1番目の発話、k番目の発話、…に分割される。i番目の発話（i=1,…,k-1,k,…）から生成される特徴量の時系列データがi番目の発話内特徴量系列である。i番目の発話が終わった（つまり、i番目の発話の後に非音声区間が検出された）時点が、i番目のターンテイキング点である。したがって、i番目のターンテイキング点は、i番目の発話の直後に出現するターンテイキング点である。また、i番目の発話、i番目の発話内特徴量系列、i番目のターンテイキング点のことをそれぞれ発話i、発話内特徴量系列i、ターンテイキング点iという（i=1,…,k-1,k,…）。 FIG. 1 shows the relationship between the input speech, the utterance, and the feature amount series in the utterance. As shown in FIG. 1, the input speech is divided into a first utterance,..., A k-2th utterance, a k-1th utterance, a kth utterance,. The time-series data of feature values generated from the i-th utterance (i = 1,..., k-1, k,...) is the i-th utterance feature value sequence. The point in time when the i-th utterance ends (that is, when a non-speech interval is detected after the i-th utterance) is the i-th turn taking point. Therefore, the i-th turn taking point is a turn taking point that appears immediately after the i-th utterance. Also, the i-th utterance, i-th utterance feature series, and i-th turn-taking point are referred to as utterance i, utterance feature series i, and turn-taking point i, respectively (i = 1,..., K− 1, k, ...).

なお、直近で検出された発話のことを直前の発話といい、このときのターンテイキング点（直前の発話の直後に出現するターンテイキング点）のことを識別対象ターンテイキング点という。 The most recently detected utterance is referred to as the immediately preceding utterance, and the turn taking point at this time (turn taking point that appears immediately after the immediately preceding utterance) is referred to as the identification target turn taking point.

＜第一実施形態＞
以下、図２〜図４を参照してターンテイキングタイミング識別装置１００について説明する。図２は、ターンテイキングタイミング識別装置１００の構成を示すブロック図である。図３は、ターンテイキングタイミング識別装置１００の動作を示すフローチャートである。図４は、発話内特徴量、発話特徴量、ターンテイキング点特徴量の関係を示す図である。図２に示すように、ターンテイキングタイミング識別装置１００は、音声区間検出部１１０、発話内特徴量系列生成部１２０、発話特徴量計算部１３０、ターンテイキング点特徴量計算部１４０、ターンテイキングタイミング識別部１５０、記録部１９０を含む。記録部１９０は、ターンテイキングタイミング識別装置１００の処理に必要な情報を適宜記録する構成部である。 <First embodiment>
Hereinafter, the turn taking timing identification device 100 will be described with reference to FIGS. FIG. 2 is a block diagram showing a configuration of the turn taking timing identification device 100. As shown in FIG. FIG. 3 is a flowchart showing the operation of the turntaking timing identification device 100. FIG. 4 is a diagram illustrating a relationship among the feature amount in utterance, the utterance feature amount, and the turn taking point feature amount. As shown in FIG. 2, the turn-taking timing identification device 100 includes a speech section detection unit 110, an in-utterance feature amount series generation unit 120, an utterance feature amount calculation unit 130, a turn-taking point feature amount calculation unit 140, and a turn-taking timing identification. Part 150 and recording part 190. The recording unit 190 is a component that appropriately records information necessary for processing of the turntaking timing identification device 100.

ターンテイキングタイミング識別装置１００は、入力音声から検出された直前の発話の直後に出現する識別対象ターンテイキング点がターンテイキングタイミングであるか否かを示す識別結果を生成する。ターンテイキングタイミング識別装置１００は、直前の発話および直前の発話よりも過去の発話から生成される発話内特徴量系列を用いて、識別対象ターンテイキング点がターンテイキングタイミングであるか否かを示す識別結果（True/False）を生成する。発話内特徴量系列には、例えば、音声認識結果の単語系列を用いる。 The turn-taking timing identification device 100 generates an identification result indicating whether or not the identification target turn-taking point that appears immediately after the immediately preceding utterance detected from the input voice is the turn-taking timing. The turn-taking timing identification device 100 uses the feature quantity sequence in the utterance generated from the immediately preceding utterance and the utterance that is earlier than the immediately preceding utterance to identify whether or not the identification target turn-taking point is the turn-taking timing. Generate the result (True / False). For example, a word sequence of a speech recognition result is used as the feature amount sequence in the utterance.

ターンテイキングタイミング識別装置１００は、1番目の発話から順に、各発話の直後に出現するターンテイキング点がターンテイキングタイミングであるかを識別していく。 The turn-taking timing identification device 100 identifies, in order from the first utterance, whether the turn-taking point that appears immediately after each utterance is the turn-taking timing.

図３に従いターンテイキングタイミング識別装置１００の動作について説明する。音声区間検出部１１０は、入力音声から、入力音声に含まれるk番目（kは1以上の整数）の発話である発話kを検出する（Ｓ１１０）。発話検出には、音声区間と非音声区間を区別することができる音声区間検出技術であればどのようなものを用いてもよい。 The operation of the turn taking timing identification device 100 will be described with reference to FIG. The speech section detection unit 110 detects, from the input speech, the utterance k that is the kth speech (k is an integer equal to or greater than 1) included in the input speech (S110). For speech detection, any speech section detection technique that can distinguish between speech sections and non-speech sections may be used.

発話内特徴量系列生成部１２０は、Ｓ１１０で検出した発話kから、k番目の発話内特徴量系列である発話特徴量系列kを生成する（Ｓ１２０）。前述の通り、発話内特徴量として、発話単位で生成される音声信号、基本周波数、ケプストラム、音声認識結果の単語などを用いることができる。 The in-utterance feature quantity sequence generation unit 120 generates an utterance feature quantity sequence k that is the k-th in-utterance feature quantity sequence from the utterance k detected in S110 (S120). As described above, a speech signal generated in speech units, a fundamental frequency, a cepstrum, a speech recognition result word, or the like can be used as a feature amount in speech.

発話特徴量計算部１３０は、Ｓ１２０で生成した発話内特徴量系列kから、k番目の発話を特徴付ける発話特徴量である発話特徴量kを計算する（Ｓ１３０）。発話特徴量計算部１３０は、ニューラルネットワークによる計算を実行する構成部である。発話特徴量計算部１３０の構成に用いるニューラルネットワークは、固定長ベクトル系列として表現される時系列データを入力とし、固定長ベクトルとして表現される特徴量を出力するものであれば、どのようなものでもよい。例えば、再帰型ニューラルネットワーク（RNN: Recurrent Neural Networks）を用いることができる。このRNNは、可変長データである時系列データを扱うニューラルネットワークの一般的な枠組みである。 The utterance feature amount calculation unit 130 calculates an utterance feature amount k, which is an utterance feature amount that characterizes the k-th utterance, from the intra-utterance feature amount sequence k generated in S120 (S130). The utterance feature amount calculation unit 130 is a configuration unit that executes calculation using a neural network. The neural network used for the configuration of the utterance feature amount calculation unit 130 is any type as long as it inputs time series data expressed as a fixed-length vector sequence and outputs feature amounts expressed as a fixed-length vector. But you can. For example, a recurrent neural network (RNN) can be used. This RNN is a general framework of a neural network that handles time-series data that is variable-length data.

なお、ニューラルネットワークによる計算を特徴付けるモデルは、事前に学習されており、このモデルを用いて発話特徴量計算部１３０の計算が実行されるものとする。 It is assumed that the model characterizing the calculation by the neural network has been learned in advance, and the calculation of the utterance feature value calculation unit 130 is executed using this model.

また、k番目の発話よりも過去の発話である1番目の発話、…、k-1番目の発話については、既にＳ１２０及びＳ１３０の処理が実行され、それぞれ1番目の発話特徴量（発話特徴量1）、…、k-1番目の発話特徴量（発話特徴量k-1）が計算されており、発話特徴量1、…、発話特徴量k-1は記録部１９０に記録されているものとする。 Also, for the first utterance that is a utterance before the kth utterance,..., The k-1th utterance, the processing of S120 and S130 has already been performed, and the first utterance feature amount (utterance feature amount). 1), the k-1th utterance feature amount (utterance feature amount k-1) is calculated, and the utterance feature amount 1,..., Utterance feature amount k-1 is recorded in the recording unit 190. And

発話特徴量は、発話と１対１に対応するものであり、各発話を特徴付ける固定長ベクトルとなる。 The utterance feature amount has a one-to-one correspondence with the utterance, and is a fixed-length vector that characterizes each utterance.

例えば、発話内特徴量系列として音声認識結果の単語系列を用いる場合、発話特徴量計算部１３０は、k番目の発話の音声認識結果の単語系列w₁ ^k,…,w_N(k) ^kをk番目の発話特徴量h^kに変換する。また、1番目の発話の音声認識結果の単語系列、…、k-1番目の発話の音声認識結果の単語系列については発話特徴量計算部１３０のニューラルネットワークを用いて同様の処理が既に実行されており、例えば、k-1番目の発話の音声認識結果の単語系列w₁ ^k-1,…,w_N(k-1) ^k-1はk-1番目の発話特徴量h^k-1に変換されている。その結果、記録部１９０には、1番目の発話特徴量h¹、…、k-1番目の発話特徴量h^k-1が記録されている。 For example, when the word sequence of the speech recognition result is used as the feature amount sequence in the utterance, the utterance feature amount calculating unit 130 converts the word sequence w ₁ ^k ,..., W _{N (k)} ^k of the speech recognition result of the k-th utterance. The k-th utterance feature value h ^k is converted. The same processing has already been performed on the word sequence of the speech recognition result of the first utterance,..., The word sequence of the speech recognition result of the k−1th utterance using the neural network of the utterance feature amount calculation unit 130. For example, the word sequence w ₁ ^k−1 ,..., W _{N (k−1)} ^k−1 of the speech recognition result of the k−1th utterance is represented by the k−1th utterance feature amount h ^k−1 . It has been converted. As a result, the recording unit 190 records the first utterance feature amount h ¹ ,..., The (k−1) th utterance feature amount h ^k−1 .

発話特徴量の次元は、発話特徴量計算部１３０の構成に用いるニューラルネットワークに依存して決定されるものであり、例えば、200次元とあらかじめ（モデル学習開始前に）人手で設定される。 The dimension of the utterance feature value is determined depending on the neural network used for the configuration of the utterance feature value calculation unit 130. For example, the dimension of the utterance feature value is set to 200 dimensions in advance (before the model learning is started) manually.

なお、発話特徴量計算部１３０の構成に用いるニューラルネットワークとして、RNN以外のものを用いることができる。例えば、LSTM(Long Short-Term Memory)、GRU(Gated Recurrent Unit)といった構造のニューラルネットワークを用いてもよい。また、これらの構造を複数積み上げてもよい。例えば、LSTM構造を3層積み上げたニューラルネットワークを用いてもよい。 Note that a neural network other than the RNN can be used for the configuration of the utterance feature amount calculation unit 130. For example, a neural network having a structure such as LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) may be used. A plurality of these structures may be stacked. For example, a neural network in which three layers of LSTM structures are stacked may be used.

ターンテイキング点特徴量計算部１４０は、記録部１９０に記録してある発話特徴量1、…、発話特徴量k-1とＳ１３０で計算した発話特徴量kから構成される時系列データである発話特徴量系列kから、発話kの直後に出現する識別対象ターンテイキング点となるターンテイキング点kを特徴付けるターンテイキング点特徴量kを計算する（Ｓ１４０）。ターンテイキング点特徴量計算部１４０は、ニューラルネットワークによる計算を実行する構成部である。ターンテイキング点特徴量計算部１４０の構成に用いるニューラルネットワークは、固定長ベクトル系列として表現される時系列データを入力とし、固定長ベクトルとして表現される特徴量を出力するものであれば、どのようなものでもよい。例えば、再帰型ニューラルネットワーク（RNN: Recurrent Neural Networks）を用いることができる。 The turn-taking point feature quantity calculation unit 140 is an utterance that is time-series data composed of the utterance feature quantity 1,..., The utterance feature quantity k-1 recorded in the recording unit 190, and the utterance feature quantity k calculated in S130. From the feature amount series k, a turn-taking point feature amount k that characterizes the turn-taking point k that is the identification target turn-taking point that appears immediately after the utterance k is calculated (S140). The turn-taking point feature quantity calculation unit 140 is a component that executes calculation by a neural network. As long as the neural network used for the configuration of the turn-taking point feature amount calculation unit 140 receives time-series data expressed as a fixed-length vector sequence and outputs a feature amount expressed as a fixed-length vector, any method can be used. It may be anything. For example, a recurrent neural network (RNN) can be used.

なお、ニューラルネットワークによる計算を特徴付けるモデルは、事前に学習されており、このモデルを用いてターンテイキング点特徴量計算部１４０の計算が実行されるものとする。 It is assumed that the model that characterizes the calculation by the neural network is learned in advance, and the calculation of the turn-taking point feature quantity calculation unit 140 is executed using this model.

ターンテイキング点特徴量は、ターンテイキング点と１対１に対応するものであり、1番目の発話からk番目の発話までのユーザの入力音声を特徴付ける固定長ベクトルとなる。 The turn-taking point feature amount has a one-to-one correspondence with the turn-taking point, and is a fixed-length vector that characterizes the user's input speech from the first utterance to the k-th utterance.

先ほどの例では、ターンテイキング点特徴量計算部１４０は、発話特徴量h¹,…,h^k-1,h^kから構成される発話特徴量系列からk番目のターンテイキング点特徴量v^kに変換する。同様に、k+1番目の発話を検出した場合は、発話特徴量h¹,…,h^k-1,h^k,h^k+1から構成される発話特徴量系列からk+1番目のターンテイキング点特徴量v^k+1に変換する。 In the previous example, the turn-taking point feature quantity calculation unit 140 converts the utterance feature quantity sequence composed of utterance feature quantities h ¹ ,..., H ^k−1 , h ^k to the k-th turn taking point feature quantity v ^k . Convert. Similarly, when the k + 1-th utterance is detected, the k + 1-th turn from the utterance feature amount sequence composed of the utterance feature amounts h ¹ ,..., H ^k−1 , h ^k , h ^{k + 1.} Convert to taking point feature value v ^{k + 1} .

ターンテイキング点特徴量の次元は、ターンテイキング点特徴量計算部１４０の構成に用いるニューラルネットワークに依存して決定されるものであり、例えば200次元とあらかじめ人手で設定される。 The dimension of the turn-taking point feature value is determined depending on the neural network used for the configuration of the turn-taking point feature value calculation unit 140, and is set to 200 dimensions in advance, for example, manually.

なお、ターンテイキング点特徴量計算部１４０の構成に用いるニューラルネットワークとして、RNN以外のものを用いることができる。例えば、LSTM(Long Short-Term Memory)、GRU(Gated Recurrent Unit)といった構造のニューラルネットワークを用いてもよい。また、これらの構造を複数積み上げてもよい。例えば、LSTM構造を3層積み上げたニューラルネットワークを用いてもよい。 A neural network other than RNN can be used as the neural network used for the configuration of the turn-taking point feature quantity calculation unit 140. For example, a neural network having a structure such as LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) may be used. A plurality of these structures may be stacked. For example, a neural network in which three layers of LSTM structures are stacked may be used.

ターンテイキングタイミング識別部１５０は、Ｓ１４０で計算したターンテイキング点特徴量kから、ターンテイキング点kがターンテイキングタイミングであるか否かを示す識別結果kを生成する（Ｓ１５０）。ターンテイキングタイミング識別部１５０は、ニューラルネットワークによる計算を実行する構成部である。ターンテイキングタイミング識別部１５０の構成に用いるニューラルネットワークは、固定長ベクトルとして表現される特徴量を入力として、固定長ベクトル（またはスカラ）として表現される特徴量を出力するものであれば、どのようなものでもよい。例えば、ディープニューラルネットワーク（DNN: Deep Neural Networks）を用いることができる。5層の全結合ニューラルネットワーク構造とソフトマックス関数の出力層で構成されるDNNの場合、ターンテイキングタイミングである否かを示す識別結果を確率分布として出力することになる。このとき、出力層はターンテイキングタイミングである（応答すべきタイミングである、True）確率を出力するユニット、ターンテイキングタイミングではない（応答すべきタイミングではない、False）確率を出力するユニットとの２つのユニットから構成されることになる。 The turn taking timing identification unit 150 generates an identification result k indicating whether or not the turn taking point k is the turn taking timing from the turn taking point feature quantity k calculated in S140 (S150). The turn-taking timing identification unit 150 is a component that executes calculation using a neural network. As long as the neural network used for the configuration of the turn-taking timing identification unit 150 outputs a feature quantity expressed as a fixed-length vector (or a scalar) with a feature quantity expressed as a fixed-length vector as an input, any neural network can be used. It may be anything. For example, a deep neural network (DNN) can be used. In the case of a DNN composed of a 5-layer fully connected neural network structure and a softmax function output layer, an identification result indicating whether or not it is turn taking timing is output as a probability distribution. At this time, the output layer is a unit that outputs a probability that is turn-taking timing (the timing that should be answered, True), and a unit that outputs a probability that is not the turn-taking timing (not the timing that should be answered, False). It will consist of two units.

なお、ニューラルネットワークによる計算を特徴付けるモデルは、事前に学習されており、このモデルを用いてターンテイキングタイミング識別部１５０の計算が実行されるものとする。 It is assumed that the model characterizing the calculation by the neural network is learned in advance, and the calculation of the turn-taking timing identification unit 150 is executed using this model.

本発明によれば、再帰型ニューラルネットワークのように時系列データを入力とするニューラルネットワークを階層的に用いることで、直前の発話および直前の発話よりも過去の発話の発話内特徴量系列から、識別対象ターンテイキング点がターンテイキングタイミングであるか否かを識別するために用いる固定長ベクトルであるターンテイキング点特徴量を計算する。これにより、特徴量抽出規則を人手により設計することなく、直前の発話および直前の発話よりも過去の発話の情報から抽出した特徴量を用いたターンテイキングタイミング識別を実現することができる。 According to the present invention, by using a neural network that inputs time-series data like a recursive neural network in a hierarchical manner, from the feature amount series in the utterance of the past utterance than the immediately preceding utterance and the immediately preceding utterance, A turn-taking point feature quantity that is a fixed-length vector used to identify whether or not the identification target turn-taking point is the turn-taking timing is calculated. This makes it possible to realize turn-taking timing identification using the feature amount extracted from the previous utterance and the information of the previous utterance rather than the previous utterance without manually designing the feature amount extraction rule.

つまり、人手を介することなく特徴量抽出規則をモデルとして獲得することにより、設計者に依存しない、バラつきを抑制した規則設計が可能となる。また、直前の発話以前の発話の情報を用いたモデルとして学習することにより、高精度なモデルを学習することが可能となる。これにより、高精度なターンテイキングタイミング識別を実現することができる。 That is, by acquiring the feature quantity extraction rule as a model without human intervention, it is possible to perform rule design that suppresses variations and does not depend on the designer. In addition, it is possible to learn a high-accuracy model by learning as a model using utterance information before the previous utterance. Thereby, highly accurate turn-taking timing identification can be realized.

＜第二実施形態＞
第一実施形態では、単一の発話内特徴量系列（例えば、音声認識結果の単語系列）を用いて、ターンテイキングタイミングを識別したが、例えば、基本周波数系列とケプストラム系列というように、複数種類の発話内特徴量系列を用いて識別するようにしてもよい。 <Second embodiment>
In the first embodiment, the turntaking timing is identified using a single intra-speech feature quantity sequence (for example, a word sequence of a speech recognition result). For example, there are a plurality of types such as a basic frequency sequence and a cepstrum sequence. May be identified using the feature amount series in the utterance.

そこで、第二実施形態では、複数種類の発話内特徴量系列から、ターンテイキング点特徴量を計算する。 Therefore, in the second embodiment, turn-taking point feature values are calculated from a plurality of types of feature-value sequences in the utterance.

発話から生成される発話内特徴量の種類の数をJとする。また、jを1≦j≦Jを満たす整数とする。 Let J be the number of types of features in the utterance generated from the utterance. J is an integer satisfying 1 ≦ j ≦ J.

以下、図５〜図７を参照してターンテイキングタイミング識別装置２００について説明する。図５は、ターンテイキングタイミング識別装置２００の構成を示すブロック図である。図６は、ターンテイキングタイミング識別装置２００の動作を示すフローチャートである。図７は、第j種発話内特徴量、第j種発話特徴量、結合発話特徴量の関係を示す図である。図５に示すように、ターンテイキングタイミング識別装置２００は、音声区間検出部１１０、第1種発話内特徴量系列生成部１２０₁、…、第J種発話内特徴量系列生成部１２０_J、第1種発話特徴量計算部１３０₁、…、第J種発話特徴量計算部１３０_J、発話特徴量結合部２３０、ターンテイキング点特徴量計算部１４０、ターンテイキングタイミング識別部１５０、記録部１９０を含む。記録部１９０は、ターンテイキングタイミング識別装置２００の処理に必要な情報を適宜記録する構成部である。 Hereinafter, the turn taking timing identification device 200 will be described with reference to FIGS. FIG. 5 is a block diagram illustrating a configuration of the turn taking timing identification device 200. FIG. 6 is a flowchart showing the operation of the turntaking timing identification device 200. FIG. 7 is a diagram illustrating the relationship among the feature quantities in the j-th utterance, the j-th utterance feature quantity, and the combined utterance feature quantity. As shown in FIG. 5, the turntaking timing identification device 200 includes a speech section detection unit 110, a first type utterance feature quantity sequence generation unit 120 ₁ ,..., A J type utterance feature quantity sequence generation unit 120 _J , 1 type utterance feature amount calculation unit 130 ₁ ,... J type utterance feature amount calculation unit 130 _J , utterance feature amount combination unit 230, turn taking point feature amount calculation unit 140, turn taking timing identification unit 150, and recording unit 190. Including. The recording unit 190 is a component that appropriately records information necessary for processing of the turntaking timing identification device 200.

ターンテイキングタイミング識別装置２００は、入力音声から検出された直前の発話の直後に出現する識別対象ターンテイキング点がターンテイキングタイミングであるか否かを示す識別結果を生成する。ターンテイキングタイミング識別装置２００は、直前の発話および直前の発話よりも過去の発話から生成されるJ種類の発話内特徴量系列を用いて、識別対象ターンテイキング点がターンテイキングタイミングであるか否かを示す識別結果（True/False）を生成する。 The turn taking timing identification device 200 generates an identification result indicating whether or not the identification target turn taking point that appears immediately after the immediately preceding utterance detected from the input voice is the turn taking timing. The turn-taking timing identifying apparatus 200 uses the J-type utterance feature quantity sequence generated from the immediately preceding utterance and the utterance that is earlier than the immediately preceding utterance to determine whether the turn-taking point to be identified is the turn-taking timing. An identification result (True / False) is generated.

ターンテイキングタイミング識別装置２００は、1番目の発話から順に、各発話の直後に出現するターンテイキング点がターンテイキングタイミングであるかを識別していく。 The turn-taking timing identifying apparatus 200 identifies, in order from the first utterance, whether the turn-taking point that appears immediately after each utterance is the turn-taking timing.

図６に従いターンテイキングタイミング識別装置２００の動作について説明する。音声区間検出部１１０は、入力音声から、入力音声に含まれるk番目（kは1以上の整数）の発話である発話kを検出する（Ｓ１１０）。 The operation of the turn taking timing identification device 200 will be described with reference to FIG. The speech section detection unit 110 detects, from the input speech, the utterance k that is the kth speech (k is an integer equal to or greater than 1) included in the input speech (S110).

第j種発話内特徴量系列生成部１２０_j(1≦j≦J)は、Ｓ１１０で検出した発話kから、k番目の第j種発話内特徴量系列である第j種発話特徴量系列kを生成する（Ｓ１２０）。 The j-th utterance feature quantity sequence generation unit 120 _j (1 ≦ j ≦ J), from the utterance k detected in S110, is a k-th type j utterance feature quantity series j-th utterance feature quantity series k. Is generated (S120).

第j種発話特徴量計算部１３０_j(1≦j≦J)は、Ｓ１２０で生成した第j種発話内特徴量系列kから、k番目の発話を特徴付ける第j種発話特徴量である第j種発話特徴量kを計算する（Ｓ１３０）。各jについて、第j種発話特徴量計算部１３０_jは、ニューラルネットワークによる計算を実行する構成部であり、その特徴は第一実施形態の発話特徴量計算部１３０と同様である。 The j-th utterance feature quantity calculation unit 130 _j (1 ≦ j ≦ J) is the j-th utterance feature quantity that characterizes the k-th utterance from the in-jth utterance feature quantity sequence k generated in S120. A seed utterance feature quantity k is calculated (S130). For each j, the j-th type utterance feature amount calculation unit 130 _j is a component that executes calculation by a neural network, and the feature is the same as the utterance feature amount calculation unit 130 of the first embodiment.

発話特徴量結合部２３０は、Ｓ１３０で計算した第1種発話特徴量k、…、第J種発話特徴量kから、k番目の発話を特徴付ける結合発話特徴量である結合発話特徴量kを生成する（Ｓ２３０）。結合発話特徴量は、ベクトルである第1種発話特徴量、…、第J種発話特徴量をベクトルとして結合したベクトルである。例えば、基本周波数系列を第1種発話内特徴量系列、ケプストラム系列を第2種発話内特徴量系列とし、第1種発話内特徴量系列から生成した第1種発話特徴量の次元を200、第2種発話内特徴量系列から生成した第2種発話特徴量の次元を200とすると、結合発話特徴量の次元は400となる。 The utterance feature amount combining unit 230 generates a combined utterance feature amount k that is a combined utterance feature amount that characterizes the kth utterance from the first type utterance feature amount k,..., The J type utterance feature amount k calculated in S130. (S230). The combined utterance feature amount is a vector obtained by combining the first type utterance feature amount, which is a vector, ..., the J type utterance feature amount as a vector. For example, the basic frequency sequence is the first type utterance feature amount sequence, the cepstrum sequence is the second type utterance feature amount sequence, the dimension of the first type utterance feature amount generated from the first type utterance feature amount sequence is 200, When the dimension of the second type utterance feature value generated from the feature type sequence in the second type utterance is 200, the dimension of the combined utterance feature value is 400.

なお、k番目の発話よりも過去の発話である1番目の発話、…、k-1番目の発話については、既にＳ１２０、Ｓ１３０、Ｓ２３０の処理が実行され、それぞれ1番目の結合発話特徴量（結合発話特徴量1）、…、k-1番目の結合発話特徴量（結合発話特徴量k-1）が計算されており、結合発話特徴量1、…、結合発話特徴量k-1は記録部１９０に記録されているものとする。 For the first utterance that is a past utterance than the kth utterance,..., The k-1th utterance, the processing of S120, S130, and S230 has already been performed, and the first combined utterance feature amount ( Combined utterance feature 1), k-1th combined utterance feature (combined utterance feature k-1) is calculated, and combined utterance feature 1, ..., combined utterance feature k-1 is recorded. Assume that it is recorded in the unit 190.

ターンテイキング点特徴量計算部１４０は、記録部１９０に記録してある結合発話特徴量1、…、結合発話特徴量k-1とＳ２３０で計算した結合発話特徴量kから構成される時系列データである結合発話特徴量系列kから、発話kの直後に出現する識別対象ターンテイキング点となるターンテイキング点kを特徴付けるターンテイキング点特徴量kを計算する（Ｓ１４０）。ターンテイキング点特徴量計算部１４０は、ニューラルネットワークによる計算を実行する構成部であり、その特徴は第一実施形態のターンテイキング点特徴量計算部１４０と同様である。 The turn-taking point feature quantity calculation unit 140 is time-series data composed of the combined utterance feature quantity 1,..., The combined utterance feature quantity k-1 recorded in the recording unit 190, and the combined utterance feature quantity k calculated in S230. From the combined utterance feature quantity sequence k, the turn-taking point feature quantity k that characterizes the turn-taking point k that becomes the identification target turn-taking point that appears immediately after the utterance k is calculated (S140). The turn-taking point feature value calculation unit 140 is a component that executes a calculation using a neural network, and its features are the same as those of the turn-taking point feature value calculation unit 140 of the first embodiment.

ターンテイキングタイミング識別部１５０は、Ｓ１４０で計算したターンテイキング点特徴量kから、ターンテイキング点kがターンテイキングタイミングであるか否かを示す識別結果kを生成する（Ｓ１５０）。ターンテイキングタイミング識別部１５０は、ニューラルネットワークによる計算を実行する構成部であり、その特徴は第一実施形態のターンテイキングタイミング識別部１５０と同様である。 The turn taking timing identification unit 150 generates an identification result k indicating whether or not the turn taking point k is the turn taking timing from the turn taking point feature quantity k calculated in S140 (S150). The turn-taking timing identification unit 150 is a component that executes calculation by a neural network, and the feature thereof is the same as that of the turn-taking timing identification unit 150 of the first embodiment.

＜変形例＞
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 <Modification>
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A speech section detection unit that detects an utterance k that is a k-th utterance (k is an integer equal to or greater than 1) included in the input speech from the input speech;
From the utterance k, an in-speech feature quantity sequence generation unit that generates an utterance feature quantity series k that is a k-th utterance feature quantity series;
An utterance feature amount calculation unit for calculating an utterance feature amount k that is an utterance feature amount characterizing the utterance k from the intra-utterance feature amount sequence k;
From the utterance feature quantity sequence k, which is time series data composed of the utterance feature quantity i (i = 1,..., K−1) and the utterance feature quantity k, which have already been calculated. A turn-taking point feature quantity calculating unit that calculates a turn-taking point feature quantity k that characterizes a turn-taking point k that becomes an identification target turn-taking point that appears immediately after the utterance k;
A turn-taking timing identification unit that generates an identification result k indicating whether or not the turn-taking point k is a turn-taking timing from the turn-taking point feature quantity k,
The utterance feature amount calculation unit and the turn-taking point feature amount calculation unit each receive time series data expressed as a fixed-length vector sequence, and use a neural network that outputs a feature amount expressed as a fixed-length vector. A turn-taking timing identification device comprising:

A speech section detection unit that detects an utterance k that is a k-th utterance (k is an integer equal to or greater than 1) included in the input speech from the input speech;
J is the number of types of features in the utterance generated from the utterance, j is an integer satisfying 1 ≦ j ≦ J,
From the utterance k, a j-th utterance feature quantity sequence generation unit for generating a j-th utterance feature quantity series k which is a k-th j-th kind utterance feature quantity series;
A j-type utterance feature quantity calculating unit for calculating a j-type utterance feature quantity k which is a j-type utterance feature quantity characterizing the utterance k from the j-type utterance feature quantity sequence k;
An utterance feature amount combining unit that generates a combined utterance feature amount k that is a combined utterance feature amount that characterizes the utterance k from the jth type utterance feature amount k (1 ≦ j ≦ J);
A combined utterance feature quantity that is time series data composed of a combined utterance feature quantity i (i = 1,..., K-1) that is the i-th combined utterance feature quantity that has already been calculated. From the series k, a turn-taking point feature quantity calculating unit that calculates a turn-taking point feature quantity k that characterizes the turn-taking point k that becomes an identification target turn-taking point that appears immediately after the utterance k, and
A turn-taking timing identification unit that generates an identification result k indicating whether or not the turn-taking point k is a turn-taking timing from the turn-taking point feature quantity k,
The utterance feature amount calculation unit and the turn-taking point feature amount calculation unit each receive time series data expressed as a fixed-length vector sequence, and use a neural network that outputs a feature amount expressed as a fixed-length vector. A turn-taking timing identification device comprising:

A turn-taking timing identification device detects an utterance k that is a k-th utterance (k is an integer of 1 or more) included in the input voice from the input voice;
The turn-taking timing identification device generates, from the utterance k, a utterance feature quantity sequence k that is a k-th utterance feature quantity series, and an utterance feature quantity sequence generation step;
An utterance feature amount calculating step in which the turn-taking timing identification device calculates an utterance feature amount k which is an utterance feature amount characterizing the utterance k from the intra-utterance feature amount sequence k;
The turn-taking timing discriminating device is a time-series data composed of an utterance feature quantity i (i = 1,..., K−1) that is an i-th utterance feature quantity already calculated and the utterance feature quantity k. A turn-taking point feature amount calculating step for calculating a turn-taking point feature amount k that characterizes a turn-taking point k that is an identification target turn-taking point that appears immediately after the utterance k, from a certain utterance feature amount series k;
A turn-taking timing identifying step in which the turn-taking timing identifying device generates an identification result k indicating whether or not the turn-taking point k is a turn-taking timing from the turn-taking point feature quantity k. An identification method,
The utterance feature amount calculation step and the turn-taking point feature amount calculation step each receive time series data expressed as a fixed-length vector sequence, and use a neural network that outputs a feature amount expressed as a fixed-length vector. A turn-taking timing identification method which is executed.

A turn-taking timing identification device detects an utterance k that is a k-th utterance (k is an integer of 1 or more) included in the input voice from the input voice;
J is the number of types of features in the utterance generated from the utterance, j is an integer satisfying 1 ≦ j ≦ J,
The turn-taking timing identification device generates a j-th utterance feature quantity sequence k which is a k-th j-th utterance feature quantity series from the utterance k, and a j-th utterance feature quantity series generation step;
A j-type utterance feature amount calculating step in which the turn-taking timing identification device calculates a j-type utterance feature amount k which is a j-type utterance feature amount characterizing the utterance k from the j-th utterance feature amount sequence k; When,
An utterance feature amount combining step for generating a combined utterance feature amount k that is a combined utterance feature amount that characterizes the utterance k from the jth type utterance feature amount k (1 ≦ j ≦ J); ,
When the turn-taking timing discriminating device is composed of a combined utterance feature quantity i (i = 1,..., K−1) that is an i-th combined utterance feature quantity that has already been calculated, and the combined utterance feature quantity k. A turn-taking point feature amount calculating step for calculating a turn-taking point feature amount k that characterizes a turn-taking point k that is an identification target turn-taking point that appears immediately after the utterance k, from a combined utterance feature amount sequence k that is sequence data; ,
A turn-taking timing identifying step in which the turn-taking timing identifying device generates an identification result k indicating whether or not the turn-taking point k is a turn-taking timing from the turn-taking point feature quantity k. An identification method,
The utterance feature amount calculation step and the turn-taking point feature amount calculation step each receive time series data expressed as a fixed-length vector sequence, and use a neural network that outputs a feature amount expressed as a fixed-length vector. A turn-taking timing identification method which is executed.

A program for causing a computer to function as the turntaking timing identification device according to claim 1.

A recording medium for recording a program for causing a computer to function as the turntaking timing identification device according to claim 1.