JP2008216618A

JP2008216618A - Speech discrimination device

Info

Publication number: JP2008216618A
Application number: JP2007053751A
Authority: JP
Inventors: Kentaro Koga; 健太郎古賀; Yasuo Ariki; 康雄有木; Tetsuya Takiguchi; 哲也滝口; Tomoyuki Yamagata; 知行山形
Original assignee: Denso Ten Ltd; Kobe University NUC
Current assignee: Denso Ten Ltd; Kobe University NUC
Priority date: 2007-03-05
Filing date: 2007-03-05
Publication date: 2008-09-18

Abstract

<P>PROBLEM TO BE SOLVED: To discriminate a speech with high precision. <P>SOLUTION: A speech discrimination device which discriminates a speech and a car navigator which performs various operations are connected through a network etc., to communicate with each other. Further, the speech discrimination device has a microphone receiving various speeches. In such a configurations, the speech discrimination device specifies an utterance section showing a section where an utterance is made from the power of an input speech signal to calculate an acoustic feature quantity showing a feature of the speech signal from the power and the pitch of the speech signal in the specified utterance section, specifies the section before and/or the section after the specified utterance section to calculate an acoustic feature quantity from the power and the pitch of the speech signal in the specified section, and discriminates whether the input speech signal is a system request to request a connected processor which performs various processings to execute processings from the acoustic feature quantity in the calculated utterance section and acoustic feature quantities in the sections before and after the calculated utterance section. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、入力された音声信号がシステムへの要求であるか雑談であるかを判別する音声判別装置に関する。 The present invention relates to a voice discrimination device that discriminates whether an input voice signal is a request to a system or a chat.

従来より、様々な分野で音声によるインターフェースが実用化されつつある。特に、ロボットとのコミュニケーションやカーナビのように手を使うことが困難な機器の操作への適用が顕著である。 Conventionally, voice interfaces have been put into practical use in various fields. In particular, the application to the operation of devices that are difficult to use with hands, such as communication with robots and car navigation systems, is remarkable.

ところが、現在使用されている音声認識システム（音声判別システム）は、入力された音声がシステムへの発話か周囲の雑談かを正確に判別できないため、スイッチなどを用いなければ意図しない動作を湧き出させてしまっていた。つまり、音声認識システムは、音声認識だけでは不正確であるため、さらにスイッチが操作されるなどの付加的な情報を検出することで、不正確さを補っていた。 However, currently used speech recognition systems (speech discrimination systems) cannot accurately discriminate whether the input speech is utterance to the system or the surrounding chat, so that unintended operations can be generated without using switches. It was. In other words, since the voice recognition system is inaccurate only by voice recognition, the inaccuracy is compensated by detecting additional information such as the operation of a switch.

これに対して、音声スポッタ（参考文献：Goto et al、ICSLP、８、1522-1536、2004）のようにユーザが意識して韻律特徴や言語特徴を変化させる手法があるが、これでは、ユーザは自分の発話に不自然さを感じるため、ユーザへの負担が大きい。そこで、よりユーザに負担をかけない手法として、自然な発声の音響特徴を用いる手法（参考文献：山田、SIG-SLP-61（2)、7-12、2006-5）や音声認識結果を基にする手法（参考文献：佐古、SIG-SLP-64、19-24、2006-12）が提案されている。 On the other hand, there is a method of changing prosodic features and language features with the user consciousness, such as a voice spotter (reference: Goto et al, ICSLP, 8, 1522-1536, 2004). Feels unnatural in their utterances, which places a heavy burden on the user. Therefore, methods that use the acoustic features of natural utterances (references: Yamada, SIG-SLP-61 (2), 7-12, 2006-5) and speech recognition results are used as methods that place less burden on the user. (Reference: Sako, SIG-SLP-64, 19-24, 2006-12) has been proposed.

例えば、非特許文献１では、発話された区間における音声信号のパワーとピッチから音声信号の特徴を示す音響特徴量を算出して、当該音声信号がシステムへの要求発話か雑談かの音声判別を実現する手法が開示されている。具体的には、入力された音声信号から平均・標準偏差・最大・最大値と最小値との差の８次元を音響特徴量として算出して音声識別を行い、さらに、ＨＭＭ（Hidden Markov Model）に基づく音響モデルによる入力発話の尤度から音声認識を行い、その音声認識結果と音声識別との結果を用いて、システムへの要求発話か雑談かの音声判別を行う。 For example, in Non-Patent Document 1, an acoustic feature amount indicating the characteristics of an audio signal is calculated from the power and pitch of the audio signal in the uttered section, and audio discrimination is performed as to whether the audio signal is a request utterance to the system or chat. Techniques for realizing it are disclosed. Specifically, voice identification is performed by calculating eight dimensions of the difference between the average, standard deviation, maximum, maximum and minimum values from the input speech signal as acoustic feature quantities, and further, HMM (Hidden Markov Model) Speech recognition is performed from the likelihood of the input utterance by the acoustic model based on, and the speech discrimination between the requested utterance to the system and the chat is performed using the result of the speech recognition and the speech identification.

杉本夏樹、北岡教英、中川聖一、豊橋技術科学大学情報工学系、“音声特徴を用いた対システム発話と対人間発話の識別”、2006年電子情報通信学会総合大会Natsuki Sugimoto, Norihide Kitaoka, Seiichi Nakagawa, Department of Information Engineering, Toyohashi University of Technology, “Distinguishing between-system utterances and interpersonal utterances using speech features”, 2006 IEICE General Conference

ところで、上記した従来の技術は、音声判別の精度が低いという課題があった。具体的には、発話された区間においてのみ、パワーとピッチとのそれぞれについて平均・標準偏差・最大・最大値と最小値との差の８次元を音響特徴量として尤度を算出して、音声認識を行うため、雑音などに影響を受け易く音声認識の精度が良くない。そして、このように認識された結果を用いて、システムへの要求発話か雑談かの音声判別を行うため、音声判別の精度も悪くなる。 By the way, the above-described conventional technique has a problem that the accuracy of voice discrimination is low. Specifically, only in the uttered section, the likelihood is calculated by using the 8 dimensions of the difference between the average, the standard deviation, the maximum, the maximum value, and the minimum value for each of the power and the pitch as an acoustic feature, Since recognition is performed, it is easily affected by noise and the like, and the accuracy of speech recognition is not good. Then, using the result recognized in this manner, the voice discrimination between the requested utterance to the system and the chat is performed, so that the voice discrimination accuracy is also deteriorated.

そこで、この発明は、上述した従来技術の課題を解決するためになされたものであり、精度の高い音声判別を実現することが可能である音声判別装置を提供することを目的とする。 Accordingly, the present invention has been made in order to solve the above-described problems of the prior art, and an object thereof is to provide a voice discrimination device capable of realizing voice discrimination with high accuracy.

上述した課題を解決し、目的を達成するため、請求項１に係る発明は、入力された音声信号のから発話された区間を示す発話区間を特定する発話区間特定手段と、前記発話区間特定手段により特定された発話区間における音声信号のパワーとピッチとから前記音声信号の特徴を示す音響特徴量を算出する第一の音響特徴量算出手段と、前記発話区間特定手段により特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとから前記音響特徴量を算出する第二の音響特徴量算出手段と、前記第一の音響特徴量算出手段により算出された音響特徴量と、前記第二の音響特徴量算出手段により算出された音響特徴量とから、前記入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別する音声判別手段と、を備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the invention according to claim 1 is characterized in that an utterance section specifying means for specifying an utterance section indicating an utterance section from an input voice signal, and the utterance section specifying means. First acoustic feature quantity calculating means for calculating an acoustic feature quantity indicating the characteristics of the voice signal from the power and pitch of the voice signal in the utterance section specified by, and the utterance section specified by the utterance section specifying means A second acoustic feature quantity calculating unit that identifies a preceding section and / or a subsequent section and calculates the acoustic feature quantity from the power and pitch of the audio signal in the identified section; and the first acoustic feature quantity calculation A process for executing various processes in which the input audio signal is connected from the acoustic feature quantity calculated by the means and the acoustic feature quantity calculated by the second acoustic feature quantity calculation means. A voice discriminating means for discriminating whether or not the system request for requesting execution of processing to the device, characterized by comprising a.

また、請求項２に係る発明は、上記の発明において、前記第一の音響特徴量算出手段は、前記発話区間特定手段により特定された発話区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出することを特徴とする。 Further, in the invention according to claim 2, in the above invention, the first acoustic feature amount calculating means is an average value of the power and pitch of the audio signal in the utterance section specified by the utterance section specifying means. Any one or more of the standard deviation value, the maximum value, and the difference between the maximum value and the minimum value is calculated as the acoustic feature amount.

また、請求項３に係る発明は、上記の発明において、前記第二の音響特徴量算出手段は、前記発話区間特定手段により特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出することを特徴とする。 According to a third aspect of the present invention, in the above invention, the second acoustic feature quantity calculating means specifies a section before and / or a section after the utterance section specified by the utterance section specifying means. Calculating one or more of an average value, a standard deviation value, a maximum value, and a difference between the maximum value and the minimum value of the power and pitch of the audio signal in the specified section as an acoustic feature amount. And

また、請求項４に係る発明は、上記の発明において、前記入力された音声信号から発話の内容を文字データとして抽出する音声認識を行い、音声認識した文字データから言語特徴量を生成する言語特徴量生成手段をさらに備え、前記音声判別手段は、前記第一の音響特徴量算出手段により算出された音響特徴量と、前記第二の音響特徴量算出手段により算出された音響特徴量と、前記言語特徴量生成手段により生成された言語特徴量とから、前記入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別することを特徴とする。 According to a fourth aspect of the present invention, in the above invention, the language feature for performing speech recognition for extracting the content of the utterance as character data from the input speech signal and generating a language feature from the speech-recognized character data. A volume generation unit, wherein the voice determination unit includes the acoustic feature amount calculated by the first acoustic feature amount calculation unit, the acoustic feature amount calculated by the second acoustic feature amount calculation unit, Whether or not the input audio signal is a system request for requesting execution of processing to a connected processing device that executes various processes based on the language feature generated by the language feature generating means. It is characterized by discriminating.

請求項１の発明によれば、入力された音声信号から発話された区間を示す発話区間を特定し、特定された発話区間における音声信号のパワーとピッチとから音声信号の特徴を示す音響特徴量を算出し、特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとから音響特徴量を算出し、算出された発話区間の音響特徴量と、算出された発話区間の前の区間および／または後ろの区間の音響特徴量とから、入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別するので、精度の高い音声判別を実現することが可能である。 According to the first aspect of the present invention, an utterance section indicating an utterance section is specified from the input voice signal, and an acoustic feature amount indicating the characteristics of the voice signal from the power and pitch of the voice signal in the specified utterance section. Is calculated, the section before and / or after the specified utterance section is specified, the acoustic feature amount is calculated from the power and pitch of the voice signal in the specified section, and the acoustic feature of the calculated utterance section is calculated. The processing unit is requested to execute processing for various processes to which the input audio signal is connected, based on the amount and the acoustic feature amount of the section before and / or after the calculated speech section Therefore, it is possible to realize highly accurate voice discrimination.

例えば、システム要求である場合には、発話区間の前後の区間で無音になることが多いことを根拠に、発話区間の発話区間と発話の前後の区間とのそれぞれから音響特徴量を算出して、入力信号がシステム要求であるか否かを判別するので、発話区間のみから音響特徴量を算出して判別する場合に比べて、多くの特徴量を算出して判別することができる結果、精度の高い音声判別を実現することが可能である。 For example, if it is a system request, the acoustic feature amount is calculated from each of the utterance section of the utterance section and the section before and after the utterance based on the fact that there is often silence in the section before and after the utterance section. Since it is determined whether or not the input signal is a system request, it is possible to calculate and distinguish many feature amounts compared to the case where the acoustic feature amount is calculated and determined only from the utterance period. It is possible to realize high voice discrimination.

また、入力された音声信号を音声認識して、精度の高い音声判別が期待できる言語特徴量を算出するための高価な装置を必要とすることなく、低コストで、かつ、十分に精度の高い音声判別を実現することが可能である。 In addition, low-cost and sufficiently accurate without requiring an expensive device for recognizing the input speech signal and calculating a language feature that can be expected to be highly accurate speech discrimination Voice discrimination can be realized.

また、請求項２の発明によれば、特定された発話区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出するので、より精度の高い音声判別を実現することが可能である。 According to the invention of claim 2, any one of an average value, a standard deviation value, a maximum value, and a difference between the maximum value and the minimum value of the power and pitch of the voice signal in the specified speech section Alternatively, since a plurality are calculated as acoustic feature amounts, it is possible to realize more accurate speech discrimination.

例えば、音響特徴として「平均値、標準偏差値、最大値、最大値と最小値との差」の全部を用いることで、より特徴を明確化することができる結果、より精度の高い音声判別を実現することが可能である。また、あまり明確に特徴を示さないパラメータを除外することで、システムの処理負荷を軽減したり、処理速度を早くしたりすることが可能である。 For example, by using all of “average value, standard deviation value, maximum value, difference between maximum value and minimum value” as acoustic features, the features can be clarified, resulting in more accurate voice discrimination. It is possible to realize. Further, by excluding parameters that do not clearly show features, it is possible to reduce the processing load of the system and increase the processing speed.

また、請求項３の発明によれば、特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出するので、発話区間のみの音響特徴量を利用する場合に比べて、より精度の高い音声判別を実現することが可能である。 According to the invention of claim 3, the section before the specified speech section and / or the section after the specified section is specified, and the average value and standard deviation value of the power and pitch of the audio signal in the specified section are specified. Since one or more of the maximum value and the difference between the maximum value and the minimum value is calculated as the acoustic feature amount, more accurate speech discrimination can be performed compared to the case where the acoustic feature amount of only the utterance section is used. It is possible to realize.

例えば、発話区間の前後の区間それぞれで音響特徴としてパワーとピッチとのそれぞれの「平均値、標準偏差値、最大値、最大値と最小値との差」の全部を用いることで、合計で最大２４次元を音響特徴量として使用することができ、より特徴を明確化することができる結果、より精度の高い音声判別を実現することが可能である。また、必要に応じて、前の区間や後ろの区間だけを音響特徴量として算出することができる結果、あまり明確に特徴を示さないパラメータを除外することで、システムの処理負荷を軽減したり、処理速度を早くしたりすることが可能である。 For example, by using all of the “average value, standard deviation value, maximum value, difference between maximum value and minimum value” of the power and pitch as the acoustic features in each of the sections before and after the utterance section, the maximum in total Since 24 dimensions can be used as the acoustic feature amount and the feature can be further clarified, it is possible to realize voice discrimination with higher accuracy. In addition, as a result of being able to calculate only the previous section and the rear section as the acoustic feature amount as necessary, by excluding parameters that do not show the feature so clearly, the processing load of the system can be reduced, It is possible to increase the processing speed.

また、請求項４の発明によれば、入力された音声信号から発話の内容を文字データとして抽出する音声認識を行い、音声認識した文字データから言語特徴量を生成し、算出された発話区間の音響特徴量と、算出された発話区間の前の区間および／または後ろの区間の音響特徴量と、生成された言語特徴量とから、入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別するので、より精度の高い音声判別を実現することが可能である。 According to the invention of claim 4, speech recognition is performed to extract the content of the utterance from the input speech signal as character data, a language feature is generated from the speech-recognized character data, and the calculated speech interval From the acoustic feature amount, the calculated acoustic feature amount before and / or after the utterance interval, and the generated language feature amount, the input speech signal executes various processes to be connected. Since it is determined whether or not it is a system request for requesting the processing apparatus to execute processing, it is possible to realize more accurate voice determination.

例えば、言語特徴量を算出することができる高価な機能や製品を利用することにより、発話区間と前後の区間の音響特徴量のみを用いて音声判別する場合に比べて、より精度の高い音声判別を実現することが可能である。 For example, by using an expensive function or product that can calculate language features, speech discrimination with higher accuracy is possible compared to speech discrimination using only acoustic features in the utterance section and the preceding and following sections. Can be realized.

以下に添付図面を参照して、この発明に係る音声判別装置、音声判別方法および音声判別プログラムの実施例を詳細に説明する。なお、以下では、本実施例で用いる主要な用語、本実施例に係る音声判別装置の概要および特徴、音声判別装置の構成および処理の流れを順に説明し、最後に本実施例に対する種々の変形例を説明する。 Exemplary embodiments of a speech discrimination device, a speech discrimination method, and a speech discrimination program according to the present invention will be described below in detail with reference to the accompanying drawings. In the following, the main terms used in this embodiment, the outline and features of the speech discrimination apparatus according to this embodiment, the configuration of the speech discrimination apparatus and the flow of processing will be described in order, and finally various modifications to this embodiment will be described. An example will be described.

［用語の説明］
まず最初に、本実施例で用いる主要な用語を説明する。本実施例で用いる「音声判別装置（特許請求の範囲に記載の「音声判別装置」に対応する。）」とは、入力された音声信号がシステムへの要求であるか、その他の雑談であるかを判別する装置のことである。具体的に例を挙げると、「音声判別装置」は、カーナビゲーションシステム（以下、カーナビ）と接続されており、運転席乗員や同乗者などから発話される音声を受信し、当該発話された内容がシステム要求であるか雑談であるかを判別する。例えば、「音声判別装置」は、入力された音声信号をシステム要求と判別した場合、当該発話された内容を実施するようにカーナビゲーションシステムに指示を出力し、雑談と判別した場合、指示を出力することなく当該発話を破棄する。 [Explanation of terms]
First, main terms used in this embodiment will be described. The “speech discrimination device (corresponding to“ speech discrimination device ”recited in the claims”) used in the present embodiment is an input voice signal that is a request to the system or other chat. It is a device that determines whether or not. As a specific example, the “voice discrimination device” is connected to a car navigation system (hereinafter referred to as “car navigation system”), receives voices spoken from a driver's seat passenger, passengers, etc., and the spoken contents. Is a system request or chat. For example, the “voice discrimination device” outputs an instruction to the car navigation system to implement the spoken content when it determines that the input voice signal is a system request, and outputs an instruction when it is determined as chat. Discard the utterance without doing it.

システム要求とは、例えば、「カーナビを起動する」、「カーナビの表示を拡大図に変更する」、「目的地を変更する」などカーナビの各種操作内容を示すものであり、雑談とは、「カーナビの操作を説明するよ」、「このボタンを押せば表示がかわるよ」、「TVのボリュームはこのボタンだよ」などカーナビの操作説明やカーナビの操作に関係ない日常会話などが含まれる。なお、本実施例では、「音声判別装置」は、カーナビへのシステム要求であるか雑談であるかを判別する場合について説明するが、本発明はこれに限定されるものではなく、カーナビ以外にロボットへのシステム要求など様々な分野に適用することが可能である。 The system request indicates, for example, various operations of the car navigation such as “activate the car navigation”, “change the car navigation display to an enlarged view”, “change the destination”, etc. Includes explanations of car navigation operations and daily conversations that are not related to car navigation operations, such as “I will explain the operation of the car navigation system”, “You can change the display by pressing this button”, “The volume of the TV is this button”. In the present embodiment, the case where the “voice discrimination device” discriminates whether it is a system request to a car navigation system or a chat will be described. However, the present invention is not limited to this and is not limited to a car navigation system. It can be applied to various fields such as system requirements for robots.

また、本実施例で用いる「適合率」とは、正当であると判定される集合を「Ａ」と「Ｂ」、正当でないと判定される集合を「Ｃ」とした場合に、「適合率＝Ａ／（Ａ＋Ｃ）」で算出されるものであり、つまり、誤検出の湧き出しに対する精度指標であり、また、「再現率」とは、「再現率＝Ａ／（Ａ＋Ｂ）」で算出されるものであり、つまり、未検出に対する精度指標である。また、Ｆ値とは、適合率と再現率との調和平均であり、適合率と再現率との両方がバランスよく高い場合に値が高くなるものであり、つまり、適合率、再現率を総合的に評価する指標である。 Further, the “accuracy rate” used in the present embodiment is “accuracy rate” when “A” and “B” are set to be determined to be valid, and “C” is set to be determined not to be valid. = A / (A + C) ", that is, an accuracy index for the occurrence of false detection, and" recall rate "is calculated as" recall rate = A / (A + B) " That is, it is an accuracy index for non-detection. The F value is the harmonic average of precision and recall, and the value increases when both precision and recall are high in a balanced manner. In other words, the precision and recall are integrated. It is an index to be evaluated.

［音声識別装置の概要および特徴］
次に、図１を用いて、実施例１に係る音声判別装置の概要および特徴を説明する。図１は、実施例１に係る音声判別装置を含むシステムの全体構成を示すシステム構成図である。 [Outline and features of voice identification device]
Next, the outline and features of the speech discrimination apparatus according to the first embodiment will be described with reference to FIG. FIG. 1 is a system configuration diagram illustrating an overall configuration of a system including the voice discrimination device according to the first embodiment.

図１に示すように、このシステムは、音声を判別する音声判別装置と各種操作を実行するカーナビとがネットワークなどを介して相互に通信可能に接続されている。また、音声判別装置は、各種音声を受け付けるマイクを備えている。また、カーナビは、音声判別装置から入力される指示内容を実施する。例えば、「カーナビの表示を拡大図に変更する」を示す指示を受信した場合、カーナビは、受信した指示内容の通りに表示している図を拡大図に変更する。 As shown in FIG. 1, in this system, a voice discrimination device that discriminates voice and a car navigation system that executes various operations are connected to each other via a network or the like. The voice discrimination device also includes a microphone that accepts various voices. In addition, the car navigation system executes the instruction content input from the voice discrimination device. For example, when an instruction indicating “change the display of the car navigation to an enlarged view” is received, the car navigation changes the figure displayed according to the received instruction content to the enlarged view.

このような構成において、実施例１に係る音声判別装置は、上記したように、入力された音声信号がシステムへの要求であるか、その他の雑談であるかを判別することを概要とするものであり、特に、精度の高い音声判別を実現することが可能である点に主たる特徴がある。 In such a configuration, as described above, the speech discrimination apparatus according to the first embodiment has an outline of discriminating whether the input speech signal is a request to the system or other chat. In particular, the main feature is that it is possible to realize highly accurate voice discrimination.

この主たる特徴を具体的に説明すると、音声判別装置は、発話された音声をマイクを介して受信する（図１の（１）参照）。具体的に例を挙げると、音声判別装置は、運転席乗員や同乗者などから発話された「カーナビの表示を拡大図に変更する」、「目的地を変更する」、「このボタンを押せば表示がかわるよ」などの音声信号をマイクを介して受信する。 The main feature will be described in detail. The voice discrimination device receives spoken voice via a microphone (see (1) in FIG. 1). To give specific examples, the voice discrimination device utters “change the car navigation display to an enlarged view”, “change the destination”, “ An audio signal such as “The display changes” is received via the microphone.

そして、音声判別装置は、入力された音声信号から発話された区間を示す発話区間を特定し、特定された発話区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出する（図１の（２）参照）。上記した例で具体的に説明すると、音声判別装置は、「カーナビの表示を拡大図に変更する」と音声信号が入力された場合に、入力された音声信号のパワーに対して、あらかじめ定めた音声判別で取り扱い易い所定の閾値を設定しておき、この閾値の前後にマージンを取り出し、取り出した区間を発話区間と特定する。すると、音声判別装置は、特定された発話区間における音声信号のパワーの「平均値、標準偏差値、最大値、最大値と最小値との差」と、当該音声信号のピッチの「平均値、標準偏差値、最大値、最大値と最小値との差」とをそれぞれ算出して、これらを音響特徴量とする。 Then, the speech discriminating device specifies an utterance interval indicating an utterance interval from the input audio signal, and averages, standard deviation values, and maximum values of the power and pitch of the audio signal in the specified utterance interval Then, any one or more of the difference between the maximum value and the minimum value is calculated as the acoustic feature amount (see (2) in FIG. 1). Specifically, in the above example, the voice discrimination device determines in advance the power of the input voice signal when the voice signal is input as “change the car navigation display to an enlarged view”. A predetermined threshold that is easy to handle in voice discrimination is set, margins are extracted before and after this threshold, and the extracted section is specified as an utterance section. Then, the speech discriminating apparatus determines that the power of the speech signal in the specified speech section is “average value, standard deviation value, maximum value, difference between maximum value and minimum value” and the pitch of the speech signal “average value, The standard deviation value, the maximum value, and the difference between the maximum value and the minimum value are calculated, and these are used as acoustic feature amounts.

続いて、音声判別装置は、特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出する（図１の（３）と（４）参照）。上記した例で具体的に説明すると、音声判別装置は、特定された発話区間の最初から０．７秒前までを発話区間の前区間（区間Ａ）と特定し、特定された発話区間の最後から０．７秒後までを発話区間の後区間（区間Ｂ）と特定する。すると、音声判別装置は、特定された区間Ａにおける音声信号のパワーの「平均値、標準偏差値、最大値、最大値と最小値との差」と、特定された区間Ｂにおける音声信号のパワーの「平均値、標準偏差値、最大値、最大値と最小値との差」とをそれぞれ算出して、これらを音響特徴量とする。 Subsequently, the speech discriminating apparatus identifies the preceding section and / or the following section of the identified speech section, and the average value, standard deviation value, and maximum value of the power and pitch of the speech signal in the identified section. Then, any one or a plurality of differences between the maximum value and the minimum value are calculated as acoustic feature amounts (see (3) and (4) in FIG. 1). More specifically, in the example described above, the speech discriminating apparatus specifies the first section of the specified utterance section to 0.7 seconds before as the previous section of the utterance section (section A), and the end of the specified utterance section. To 0.7 seconds later is identified as the latter section (section B) of the utterance section. Then, the speech discriminating apparatus performs “the difference between the average value, the standard deviation value, the maximum value, the maximum value and the minimum value” of the power of the sound signal in the specified section A and the power of the sound signal in the specified section B. Of “average value, standard deviation value, maximum value, and difference between maximum value and minimum value” are calculated as acoustic feature values.

そして、音声判別装置は、発話区間から算出された音響特徴量と、発話区間の前後の区間から算出された音響特徴量とから入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別する（図１の（５）参照）。上記した例で具体的に説明すると、音声判別装置のＳＶＭ（Support Vector Machine）は、発話区間から算出された音響特徴量「パワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差」と、区間Ａから算出された音響特徴量「パワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差」と、区間Ｂから算出された音響特徴量「パワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差」とを決定木学習ツールＣ４．５で生成された木の順序を元にスケーリングを行い（参考文献：C.W.Hsu et al、http://www.csie.ntu.edu.tw/~cjlin/papers/guide.pdf）、特徴量ごとの重み（どの特徴量が重要かを示す指標）を算出する。そして、ＳＶＭは、算出した特徴量ごとの重みから入力された音声信号がシステム要求であるか否かを判別する。なお、ここで入力された音声信号が「カーナビの表示を拡大図に変更する」であるとすると、音声判別装置は、当該音声信号をシステム要求であると判別する。 Then, the speech discrimination device is a processing device that executes various processes in which a speech signal input from an acoustic feature amount calculated from an utterance interval and an acoustic feature amount calculated from an interval before and after the utterance interval is connected. It is determined whether or not the request is a system request for requesting execution of processing (see (5) in FIG. 1). More specifically, in the above example, the SVM (Support Vector Machine) of the voice discrimination device calculates the acoustic feature amount “average value, standard deviation value, maximum value, maximum value of power and pitch” calculated from the speech section. The difference between the value and the minimum value ”and the acoustic feature amount calculated from the section A“ average value, standard deviation value, maximum value, difference between the maximum value and the minimum value of power and pitch ”and the section B The order of the trees generated by the decision tree learning tool C4.5 with the acoustic feature quantity “average value, standard deviation value, maximum value, difference between maximum value and minimum value of power and pitch” calculated from (Reference: CWHsu et al, http://www.csie.ntu.edu.tw/~cjlin/papers/guide.pdf), and the weight for each feature (which feature is important) Is calculated). Then, the SVM determines whether or not the audio signal input from the calculated weight for each feature amount is a system request. If the voice signal input here is “change the car navigation display to an enlarged view”, the voice discrimination device determines that the voice signal is a system request.

その後、音声判別装置は、当該音声信号で指示された操作内容を実施するように、カーナビに指示を出力する（図１の（６）参照）。上記した例で具体的に説明すると、音声判別装置は、当該音声信号「カーナビの表示を拡大図に変更する」で指示された操作内容「表示を拡大図に変更」を実施するように、カーナビに指示を出力する。そして、カーナビは、指示された内容を処理する。なお、当該音声信号がシステム要求ではなく雑談と判別された場合、音声判別装置は、カーナビに指示を出力することなく、当該音声信号を破棄する。 Thereafter, the voice discrimination device outputs an instruction to the car navigation system so as to execute the operation content instructed by the voice signal (see (6) in FIG. 1). Specifically, in the above example, the voice discrimination device performs the operation content “change display to enlarged view” instructed by the voice signal “change the display of car navigation to an enlarged view”. Outputs instructions. Then, the car navigation system processes the instructed content. When it is determined that the voice signal is not a system request but a chat, the voice discrimination device discards the voice signal without outputting an instruction to the car navigation system.

このように、実施例１に係る音声判別装置は、発話区間と発話の前後の区間とのそれぞれから８次元を音響特徴量、つまり、合計として２４次元を音響特徴量を算出して、入力信号がシステム要求であるか否かを判別する結果、上記した主たる特徴のごとく、精度の高い音声判別を実現することが可能である。 As described above, the speech discriminating apparatus according to the first embodiment calculates the 8-dimensional acoustic feature amount from each of the utterance section and the section before and after the utterance, that is, calculates the total 24-dimensional acoustic feature amount, As a result of discriminating whether or not is a system request, it is possible to realize highly accurate speech discrimination as the main feature described above.

［音声判別装置の構成］
次に、図２を用いて、図１に示した音声判別装置の構成を説明する。図２は、実施例１に係る音声判別装置の構成を示すブロック図である。図２に示すように、この音声判別装置１０は、通信制御Ｉ／Ｆ部１１と、入力部１２と、記憶部１３と、制御部２０とから構成される。 [Configuration of voice discrimination device]
Next, the configuration of the voice discrimination device shown in FIG. 1 will be described with reference to FIG. FIG. 2 is a block diagram illustrating the configuration of the speech discrimination apparatus according to the first embodiment. As shown in FIG. 2, the voice discrimination device 10 includes a communication control I / F unit 11, an input unit 12, a storage unit 13, and a control unit 20.

通信制御Ｉ／Ｆ部１１は、接続される装置との間でやり取りする各種情報に関する通信を制御する。上記した例で具体的に説明すると、通信制御Ｉ／Ｆ部１１は、後述する制御部２０により、入力された音声信号がシステム要求であると判別されて、当該システムへの指示をカーナビに送信したりする。 The communication control I / F unit 11 controls communication related to various types of information exchanged with a connected device. Specifically, in the above example, the communication control I / F unit 11 determines that the input voice signal is a system request by the control unit 20 to be described later, and transmits an instruction to the system to the car navigation system. To do.

入力部１２は、マイクなどを備えて構成され、利用者から発話された音声信号などを受け付ける。具体的に例を挙げれば、入力部１２は、利用者から発話された「カーナビの表示を拡大図に変更する」、「目的地を変更する」、「このボタンを押せば表示がかわるよ」などのシステム要求、雑談、会話などの音声信号を受け付けて、後述する発話区間特定部２１を介して、当該音声信号を入力信号ＤＢ１４に格納する。 The input unit 12 includes a microphone and receives an audio signal uttered by a user. For example, the input unit 12 utters “changes the car navigation display to an enlarged view”, “changes the destination”, or “presses this button to change the display” spoken by the user. A voice signal such as a system request, chat, conversation or the like is received, and the voice signal is stored in the input signal DB 14 via an utterance section specifying unit 21 described later.

記憶部１３は、制御部２０による各種処理に必要なデータおよびプログラムを格納するとともに、特に本発明に密接に関連するものとしては、入力信号ＤＢ１４と、判別結果ＤＢ１５とを備える。また、記憶部１３は、発話区間を特定するのに利用されるあらかじめ定めた音声判別で取り扱い易い所定の閾値や発話区間の前後の区間を特定するのに利用されるあらかじめ定めた秒数などを記憶する。なお、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 The storage unit 13 stores data and programs necessary for various processes performed by the control unit 20 and includes an input signal DB 14 and a discrimination result DB 15 that are particularly closely related to the present invention. In addition, the storage unit 13 stores a predetermined threshold that is easy to handle by a predetermined voice discrimination used for specifying the utterance section, a predetermined number of seconds used for specifying the sections before and after the utterance section, and the like. Remember. The information including various data and parameters can be arbitrarily changed unless otherwise specified.

入力信号ＤＢ１４は、入力部１２により入力された音声信号を一時的に記憶する。具体的に例を挙げると、入力信号ＤＢ１４は、図３に示すような音声信号を記憶する。判別結果ＤＢ１５は、後述するＳＶＭ２４により判別された結果を記憶する。具体的に例を挙げると、判別結果ＤＢ１５は、図４に示すように、『システム要求と判別された音声信号である「システム要求」、雑談と判別された音声信号である「雑談」』として「カーナビを起動する、音量を大きくする、カーナビの操作を説明するよ、このボタンを押せば表示が変わるよ」などと記憶する。なお、図３は、入力信号ＤＢに記憶される入力された音声信号の例を示す図であり、図４は、判別結果ＤＢに記憶される情報の例を示す図である。また、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 The input signal DB 14 temporarily stores the audio signal input from the input unit 12. As a specific example, the input signal DB 14 stores an audio signal as shown in FIG. The determination result DB 15 stores the result determined by the SVM 24 described later. As a specific example, as shown in FIG. 4, the discrimination result DB 15 includes “system request” that is a voice signal discriminated as a system request and “chat” that is a voice signal discriminated as a chat. “Activate the car navigation system, increase the volume, explain the operation of the car navigation system, press this button to change the display”, etc. 3 is a diagram illustrating an example of an input audio signal stored in the input signal DB, and FIG. 4 is a diagram illustrating an example of information stored in the determination result DB. In addition, information including various data and parameters can be arbitrarily changed unless otherwise specified.

制御部２０は、ＯＳ（Operating System）などの制御プログラム、各種の処理手順などを規定したプログラムおよび所要データを格納するための内部メモリを有するとともに、特に本発明に密接に関連するものとしては、発話区間特定部２１と、第一音響特徴算出部２２と、第二音響特徴算出部２３と、ＳＶＭ２４と、指示出力部２５とを備え、これらによって種々の処理を実行する。 The control unit 20 has a control program such as an OS (Operating System), a program that defines various processing procedures, and an internal memory for storing necessary data. In particular, the control unit 20 is closely related to the present invention. The speech section specifying unit 21, the first acoustic feature calculation unit 22, the second acoustic feature calculation unit 23, the SVM 24, and the instruction output unit 25 are provided, and various processes are executed by these.

発話区間特定部２１は、入力された音声信号のパワーから発話された区間を示す発話区間を特定する。上記した例で具体的に説明すると、発話区間特定部２１は、「カーナビの表示を拡大図に変更する」と音声信号が入力された場合に、入力された音声信号のパワーに対して、あらかじめ定めた音声判別で取り扱い易い所定の閾値を設定しておき、この閾値の前後にマージンを取り出し、取り出した区間を発話区間と特定する。なお、発話区間特定部２１は、特許請求の範囲に記載の「発話区間特定手段」に対応する。 The utterance interval specifying unit 21 specifies an utterance interval indicating an utterance interval from the power of the input voice signal. Specifically, in the example described above, the speech segment specifying unit 21 determines in advance the power of the input voice signal when the voice signal “Change the car navigation display to an enlarged view” is input. A predetermined threshold value that is easy to handle in the determined voice discrimination is set, margins are extracted before and after the threshold value, and the extracted section is specified as the speech section. The utterance section specifying unit 21 corresponds to “speech section specifying means” recited in the claims.

第一音響特徴算出部２２は、発話区間特定部２１により特定された発話区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出する。上記した例で具体的に説明すると、第一音響特徴算出部２２は、発話区間特定部２１により特定された発話区間における音声信号のパワーの「平均値、標準偏差値、最大値、最大値と最小値との差」と、当該音声信号のピッチの「平均値、標準偏差値、最大値、最大値と最小値との差」とを音響特徴量としてそれぞれ算出して、これらを後述するＳＶＭ２４に出力する。なお、第一音響特徴算出部２２は、特許請求の範囲に記載の「第一の音響特徴量算出手段」に対応する。また、「平均値、標準偏差値、最大値、最大値と最小値との差」の算出については、公知の手法を用いて算出することができる。 The first acoustic feature calculation unit 22 calculates the average value, the standard deviation value, the maximum value, and the difference between the maximum value and the minimum value of the power and pitch of the speech signal in the utterance section specified by the utterance section specifying unit 21. Any one or more of them are calculated as acoustic feature quantities. More specifically, in the above example, the first acoustic feature calculation unit 22 calculates the “average value, standard deviation value, maximum value, maximum value” of the power of the voice signal in the speech segment specified by the speech segment specifying unit 21. The “difference from the minimum value” and the “average value, standard deviation value, maximum value, difference between the maximum value and minimum value” of the pitch of the audio signal are calculated as acoustic feature amounts, and these are described later as SVM 24. Output to. The first acoustic feature calculation unit 22 corresponds to “first acoustic feature amount calculation means” described in the claims. The calculation of “average value, standard deviation value, maximum value, difference between maximum value and minimum value” can be performed using a known method.

第二音響特徴算出部２３は、発話区間特定部２１により特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとから音響特徴量を算出する。上記した例で具体的に説明すると、第二音響特徴算出部２３は、音声判別装置は、記憶部１３に記憶される秒数と入力された音声信号を参照し、特定された発話区間の最初から０．７秒前までを発話区間の前区間（区間Ａ）と特定し、特定された発話区間の最後から０．７秒後までを発話区間の後区間（区間Ｂ）と特定する。すると、音声判別装置は、特定された区間Ａにおける音声信号のパワーの「平均値、標準偏差値、最大値、最大値と最小値との差」と、特定された区間Ｂにおける音声信号のパワーの「平均値、標準偏差値、最大値、最大値と最小値との差」とを音響特徴量としてそれぞれ算出して、これらをと後述するＳＶＭ２４に出力する。なお、ここで用いた０．７秒とは、あくまで例であり、限定するものではない。また、第二音響特徴算出部２３は、特許請求の範囲に記載の「第二の音響特徴量算出手段」に対応する。 The second acoustic feature calculation unit 23 specifies a section before and / or after the utterance section specified by the utterance section specifying unit 21, and calculates an acoustic feature amount from the power and pitch of the audio signal in the specified section. calculate. Specifically, the second acoustic feature calculation unit 23 refers to the number of seconds stored in the storage unit 13 and the input voice signal, and the second acoustic feature calculation unit 23 refers to the first speech segment specified. From 0.7 to 0.7 seconds before is specified as the previous section of the utterance section (section A), and the last section from the end of the specified utterance section is specified as the subsequent section of the utterance section (section B). Then, the speech discriminating apparatus performs “the difference between the average value, the standard deviation value, the maximum value, the maximum value and the minimum value” of the power of the sound signal in the specified section A and the power of the sound signal in the specified section B. “Average value, standard deviation value, maximum value, difference between maximum value and minimum value” are calculated as acoustic feature amounts, and these are output to the SVM 24 described later. The 0.7 second used here is merely an example, and is not limited. The second acoustic feature calculation unit 23 corresponds to “second acoustic feature amount calculation means” recited in the claims.

ＳＶＭ２４は、発話区間から算出された音響特徴量と、発話区間の前後の区間から算出された音響特徴量とから、入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別する。具体的には、ＳＶＭ２４は、データの中で最も他のクラスと近い位置にいるサポートベクトルを基準として、そのユーグリッド距離が最も大きくなるような位置に識別境界を設定し、その境界に基づいてシステム要求であるか雑談であるかを判別する。 The SVM 24 uses a sound feature amount calculated from the utterance section and a sound feature amount calculated from the sections before and after the utterance section to a processing device that executes various processes to which the input audio signal is connected. It is determined whether it is a system request for requesting execution of processing. Specifically, the SVM 24 sets an identification boundary at a position where the Eugrid distance becomes the maximum with reference to the support vector closest to the other class in the data, and based on the boundary Determine whether it is a system request or chat.

例えば、ＳＶＭ２４は、発話区間から算出された音響特徴量「パワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差」と、区間Ａから算出された音響特徴量「パワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差」と、区間Ｂから算出された音響特徴量「パワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差」とを決定木学習ツールＣ４．５で生成された木の順序を元にスケーリングを行い、特徴量ごとの重みを算出する。そして、ＳＶＭ２４は、図５に示すように、それぞれの重みを分布させて、マージンが最大になるように境界を設ける。なお、図５は、ＳＶＭによる要求判別を説明するための図である。 For example, the SVM 24 calculates the acoustic feature amount “average value, standard deviation value, maximum value, difference between maximum value and minimum value of power and pitch” calculated from the utterance section and the acoustic feature calculated from the section A. Feature value “average value, standard deviation value, maximum value, difference between maximum value and minimum value of power and pitch” and acoustic feature value calculated from section B “average value of power and pitch The standard deviation value, the maximum value, and the difference between the maximum value and the minimum value ”are scaled based on the order of the trees generated by the decision tree learning tool C4.5, and the weight for each feature amount is calculated. Then, as shown in FIG. 5, the SVM 24 distributes the respective weights and provides a boundary so that the margin is maximized. FIG. 5 is a diagram for explaining request determination by the SVM.

その後、ＳＶＭ２４は、算出した特徴量ごとの重みの数が、設けた境界の下方より上方に多く位置する場合に、入力された音声信号をシステム要求と判定して、当該音声信号を判別結果ＤＢ１５に格納するとともに、当該システム要求に対応する指示を出力するように指示出力部２５に指示する。一方、下方に多く位置する場合に、ＳＶＭ２４は、入力された音声信号を雑談と判定して、当該音声信号を判別結果ＤＢ１５に格納する。なお、ＳＶＭ２４は、特許請求の範囲に記載の「音声判別手段」に対応する。 After that, the SVM 24 determines that the input audio signal is a system request when the calculated number of weights for each feature amount is higher than the lower side of the provided boundary, and determines the audio signal as the determination result DB 15. The instruction output unit 25 is instructed to output an instruction corresponding to the system request. On the other hand, when many are located below, the SVM 24 determines that the input audio signal is chat, and stores the audio signal in the determination result DB 15. The SVM 24 corresponds to “voice discrimination means” described in the claims.

指示出力部２５は、カーナビなどの接続されている装置にシステムに関する操作指示を出力する。上記した例で具体的に説明すると、指示出力部２５は、ＳＶＭ２４からシステム要求を指示する旨の信号を受信すると、当該入力された音声信号「カーナビの表示を拡大図に変更する」に対応する指示操作「表示を拡大図に変更」を実施するようにカーナビに指示を出力する。そして、このような指示を受信したカーナビは、当該指示内容を実施する。 The instruction output unit 25 outputs an operation instruction related to the system to a connected device such as a car navigation system. Specifically, in the above example, when the instruction output unit 25 receives a signal to instruct a system request from the SVM 24, the instruction output unit 25 corresponds to the input voice signal “change the car navigation display to an enlarged view”. An instruction is output to the car navigation system so as to perform the instruction operation “change display to enlarged view”. The car navigation system that has received such an instruction implements the instruction content.

［音声判別装置による処理］
次に、図６を用いて、音声判別装置による処理を説明する。図６は、実施例１に係る音声判別装置における音声判別処理の流れを示すフローチャートである。 [Processing by voice discrimination device]
Next, processing performed by the voice discrimination device will be described with reference to FIG. FIG. 6 is a flowchart illustrating the flow of the speech discrimination process in the speech discrimination apparatus according to the first embodiment.

図６に示すように、音声が入力されると（ステップＳ６０１肯定）、音声判別装置１０の発話区間特定部２１は、入力された音声信号のパワーから発話された区間を示す発話区間を特定する（ステップＳ６０２）。 As shown in FIG. 6, when a voice is input (Yes at Step S <b> 601), the utterance section specifying unit 21 of the voice discrimination device 10 specifies a utterance section indicating a utterance section from the power of the input voice signal. (Step S602).

続いて、第一音響特徴算出部２２は、発話区間特定部２１により特定された発話区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出して、算出した音響特徴量をＳＶＭ２４に出力する（ステップＳ６０３）。 Subsequently, the first acoustic feature calculation unit 22 calculates the average value, the standard deviation value, the maximum value, the maximum value, and the minimum value of the power and pitch of the voice signal in the utterance section specified by the utterance section specifying unit 21. Any one or more of these differences are calculated as acoustic feature amounts, and the calculated acoustic feature amounts are output to the SVM 24 (step S603).

そして、第二音響特徴算出部２３は、発話区間特定部２１により特定された発話区間の前の区間および後ろの区間を特定し（ステップＳ６０４）、特定した区間における音声信号のパワーとピッチとから音響特徴量を算出して、算出した音響特徴量をＳＶＭ２４に出力する（ステップＳ６０５）。 And the 2nd acoustic feature calculation part 23 specifies the area before the utterance area specified by the utterance area specific | specification part 21, and the back area (step S604), From the power and pitch of the audio | voice signal in the specified area. The acoustic feature amount is calculated, and the calculated acoustic feature amount is output to the SVM 24 (step S605).

その後、ＳＶＭ２４は、発話区間から算出された音響特徴量と、発話区間の前後の区間から算出された音響特徴量とから入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別する（ステップＳ６０６）。 After that, the SVM 24 performs processing for executing various processes to which the audio signal input from the acoustic feature amount calculated from the speech section and the acoustic feature amount calculated from the sections before and after the speech section is connected. In step S606, it is determined whether the request is a system request for executing the process.

そして、システム要求であると判定された場合（ステップＳ６０７肯定）、ＳＶＭ２４は、当該システム要求に対応する指示を出力するように指示出力部２５に指示し、指示出力部２５は、指示操作を実施するようにカーナビに指示を出力する（ステップＳ６０８）。続いて、ＳＶＭ２４は、入力された音声信号と判別結果とを判別結果ＤＢ１５に格納する（ステップＳ６０９）。 If it is determined that the request is a system request (Yes at step S607), the SVM 24 instructs the instruction output unit 25 to output an instruction corresponding to the system request, and the instruction output unit 25 performs an instruction operation. In this way, an instruction is output to the car navigation system (step S608). Subsequently, the SVM 24 stores the input audio signal and the discrimination result in the discrimination result DB 15 (step S609).

一方、システム要求でないと判定された場合（ステップＳ６０７否定）、ＳＶＭ２４は、入力された音声信号と判別結果とを判別結果ＤＢ１５に格納する（ステップＳ６０９）。 On the other hand, when it is determined that the request is not a system request (No at Step S607), the SVM 24 stores the input audio signal and the determination result in the determination result DB 15 (Step S609).

［実施例１による効果］
このように、実施例１によれば、入力された音声信号のパワーから発話された区間を示す発話区間を特定し、特定された発話区間における音声信号のパワーとピッチとから音声信号の特徴を示す音響特徴量を算出し、特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとから音響特徴量を算出し、算出された発話区間の音響特徴量と、算出された前後の区間の音響特徴量とから、入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別するので、精度の高い音声判別を実現することが可能である。 [Effects of Example 1]
As described above, according to the first embodiment, an utterance period indicating an utterance period is specified from the power of the input audio signal, and the characteristics of the audio signal are determined from the power and pitch of the audio signal in the specified utterance period. Calculating the acoustic feature amount to be indicated, identifying the preceding and / or following segment of the identified utterance segment, calculating the acoustic feature amount from the power and pitch of the audio signal in the identified segment, and calculating the utterance From the acoustic feature value of the section and the calculated acoustic feature values of the preceding and following sections, the input audio signal is a system request that requests the processing device that executes various processes to be connected to execute the process. Therefore, it is possible to realize highly accurate voice discrimination.

例えば、発話区間と発話の前後の区間とのそれぞれから音響特徴量を算出して、入力信号がシステム要求であるか否かを判別するので、発話区間のみから音響特徴量を算出して判別する場合に比べて、多くの特徴量を算出して判別することができる結果、精度の高い音声判別を実現することが可能である。 For example, the acoustic feature quantity is calculated from each of the utterance section and the sections before and after the utterance, and it is determined whether or not the input signal is a system request. Therefore, the acoustic feature quantity is calculated only from the utterance section. Compared to the case, more feature amounts can be calculated and discriminated, and as a result, highly accurate voice discrimination can be realized.

また、実施例１によれば、特定された発話区間における音声信号のパワーとピッチとのそれぞれの平均値、標準偏差値、最大値、最大値と最小値との差のいずれか一つまたは複数を音響特徴量として算出するこので、より精度の高い音声判別を実現することが可能である。 Further, according to the first embodiment, any one or more of the average value, standard deviation value, maximum value, and difference between the maximum value and the minimum value of the power and pitch of the audio signal in the specified speech section Thus, it is possible to realize more accurate voice discrimination.

また、実施例１によれば、特定された発話区間の前の区間および／または後ろの区間を特定し、特定した区間における音声信号のパワーとピッチとのそれぞれの「平均値、標準偏差値、最大値、最大値と最小値との差」のいずれか一つまたは複数を音響特徴量として算出するこので、発話区間のみの音響特徴量を利用する場合に比べて、より精度の高い音声判別を実現することが可能である。 Further, according to the first embodiment, the section before and / or after the identified speech section is identified, and the “average value, standard deviation value, One or more of “maximum value, difference between maximum value and minimum value” is calculated as an acoustic feature amount. This makes speech discrimination more accurate than when using acoustic feature amounts for only the speech segment. Can be realized.

例えば、発話区間の前後の区間それぞれで音響特徴として「平均値、標準偏差値、最大値、最大値と最小値との差」の全部を用いることで、合計で最大２４次元を音響特徴量として使用することができ、より特徴を明確化することができる結果、より精度の高い音声判別を実現することが可能である。また、必要に応じて、前の区間や後ろの区間だけを音響特徴量として算出することができる結果、あまり明確に特徴を示さないパラメータを除外することで、システムの処理負荷を軽減したり、処理速度を早くしたりすることが可能である。 For example, by using all of “average value, standard deviation value, maximum value, difference between maximum value and minimum value” as the acoustic features in the sections before and after the utterance section, a total of up to 24 dimensions can be used as the acoustic features. As a result, it is possible to realize voice discrimination with higher accuracy. In addition, as a result of being able to calculate only the previous section and the rear section as the acoustic feature amount as necessary, by excluding parameters that do not show the feature so clearly, the processing load of the system can be reduced, It is possible to increase the processing speed.

ところで、実施例１では、発話区間とその前後の区間との音響特徴量を用いて音声判別を行う場合について説明したが、本発明はこれに限定されるものではなく、入力された音声信号の言語特徴量をさらに用いて音声判別を行うようにしてもよい。 By the way, in the first embodiment, a case has been described in which speech discrimination is performed using acoustic feature amounts of an utterance section and sections before and after the utterance section. However, the present invention is not limited to this, and the input speech signal The speech discrimination may be performed by further using the language feature amount.

そこで、実施例２では、図７〜図１０を用いて、発話区間とその前後の区間との音響特徴量と、入力された音声信号の言語特徴量をさらに用いて音声判別を行った場合の実験結果について説明する。 Therefore, in the second embodiment, in the case where the speech discrimination is further performed using the acoustic feature amount of the utterance section and the preceding and following sections and the language feature amount of the input speech signal using FIGS. The experimental results will be described.

[実験環境]
まず、実施例２では、二人以上の人間とシステムとが同時に存在することを想定する。これは、ロボットを操作する際に周囲に人がいる場合や、カーナビを操作する際に助手席に同乗者がいる場合のように、自然な状況であるといえる。また、システムとして音声コマンドにより移動するロボットを用いた。 [Experiment environment]
First, in Example 2, it is assumed that two or more people and a system exist simultaneously. This can be said to be a natural situation, such as when there is a person around when operating the robot, or when a passenger is present in the passenger seat when operating the car navigation system. A robot that moves by voice commands was used as the system.

そして、二人以上の人間が互いに会話を行いながら、任意にロボットへ「写真を撮って」、「こっちに来て」などのシステム要求を発話する。また、現状のロボットでは受理できないが、利用者がロボットの動作を期待して発話した「付いて来て〜」のような発話もシステム要求発話とした。収録は、二人の発話者それぞれの胸元に取り付けたマイクで行った。発話数は、３３０で、内４９発話がシステム要求であった。 Then, while two or more people are talking to each other, the system requests such as “Take a picture” or “Come here” to the robot. In addition, utterances such as “Follow me” that the user uttered in anticipation of the robot's motion, which cannot be accepted by the current robot, were also set as system request utterances. Recording was done with microphones attached to the chests of the two speakers. The number of utterances was 330, 49 of which were system requests.

[音響特徴量と言語特徴量との組み合わせ]
また、言語特徴量は、全発話の音声認識結果に含まれる単語の異なり数を次元としてベクトル空間を用意し、一発話内の単語の出現回数をベクトルの要素として用いる（参考文献：佐古、SIG-SLP-64、19-24、2006-12）。音響特徴量としては、特に雑談では発話の前後区間にフィラーや笑い声、言い淀みなどが入ることが多いのに対して、システム要求発話では前後は無音になることが多いと考えられるため、発話前部、中部、後部の３区間にわけて、同様にパワー、ピッチを求めて２４次元の音響特徴量として判別に用いる。そして、音響特徴量と言語特徴量の組み合わせは、図７に示すように、特徴量のベクトルをスケーリング係数を用いて連結させることで、新たな特徴量を算出した。なお、図７は、音響特徴量と言語特徴量の組み合わせの例を示す図である。ただし、パワーは、ＲＭＳ（Root Mean Square）を用いて、ピッチは、F0を用いた。 [Combination of acoustic features and language features]
Language features are prepared as a vector space with the number of different words included in the speech recognition results of all utterances as dimensions, and the number of occurrences of words in one utterance is used as a vector element (reference: Sako, SIG) -SLP-64, 19-24, 2006-12). As acoustic features, especially in chatting, fillers, laughter, and grudges often enter the front and back sections of utterances, whereas in system-requested utterances, it is often silent before and after utterances. It is divided into three sections, a central part, a central part, and a rear part, and similarly, power and pitch are obtained and used for discrimination as a 24-dimensional acoustic feature amount. As for the combination of the acoustic feature quantity and the language feature quantity, as shown in FIG. 7, a new feature quantity is calculated by connecting the feature quantity vectors using the scaling coefficient. FIG. 7 is a diagram illustrating an example of combinations of acoustic feature amounts and language feature amounts. However, RMS (Root Mean Square) was used for power, and F0 was used for pitch.

[実験]
音声判別を行うＳＶＭのKernel関数には、RBF Kernelを用いて、評価については、Leave-one-outにより行った。また、音響特徴量を用いた音声判別については、実施例１と同様の手法を用いるので、ここでは、その詳細な説明は省略するが、F値が最大になったケースとして、発話区間のみで行う場合に比べて、３区間を利用した方が精度が向上していることがわかる（図１０参照）。 [Experiment]
The RVM kernel was used for the SVM Kernel function that performs speech discrimination, and the evaluation was performed by leave-one-out. Moreover, since the same method as in the first embodiment is used for the speech discrimination using the acoustic feature amount, a detailed description thereof is omitted here. However, as a case where the F value is maximized, only the speech section is used. It can be seen that the accuracy is improved by using the three sections as compared to the case of performing (see FIG. 10).

また、言語特徴量を使用する音声認識の条件としては、ベースラインの音響モデルは、CSJモニターのうち男性話者２００名の講演音声を用いて作成した。音響分析条件とＨＭＭの仕様を図８に示す。さらに、ＭＬＬＲ＋ＭＡＰにより音響モデル適応を行い、適応データの分量を約１０分とした。そして、音響モデル適応は、テストセットを含めたクローズ、言語モデル適応は、話者Ｂの発話を用いて話者Ａの認識用言語モデルを作成することによりオープンとした。なお、図８は、音響分析条件とHMMの仕様の例を示す図である。 As a condition for speech recognition using language features, a baseline acoustic model was created using speech speech of 200 male speakers in the CSJ monitor. The acoustic analysis conditions and HMM specifications are shown in FIG. Furthermore, acoustic model adaptation was performed by MLLR + MAP, and the amount of adaptation data was about 10 minutes. The acoustic model adaptation is closed including the test set, and the language model adaptation is open by creating a language model for recognition of the speaker A using the speech of the speaker B. FIG. 8 is a diagram illustrating an example of acoustic analysis conditions and HMM specifications.

この条件のもと、juliusにより音声認識を行った結果、単語正解精度が42.1％となり、この音声認識結果を基に言語特徴量を生成した結果、566次元のベクトルとなった。これを用いてシステム要求判別を行った結果を図９と図１０に示す。図９と図１０に示すように、音響特徴量を用いて判別した結果よりも高精度でシステム要求判別を行うことができる。また、音響特徴量と言語特徴量の計590次元で判別を行うと、図９と図１０に示すように、音響特徴量のみを用いた場合や言語特徴量のみを用いた場合に比べて、より高精度でシステム要求判別を行うことができることがわかる。ただし、言語特徴量と音響特徴量のスケーリングは、最も識別率が高くなるものを実験的に求めて選択した。なお、図９と図１０は、システム要求判別結果を示す図である。 As a result of performing speech recognition with julius under this condition, the correct word accuracy was 42.1%, and as a result of generating language features based on this speech recognition result, it became a 566-dimensional vector. The results of system request discrimination using this are shown in FIGS. As shown in FIG. 9 and FIG. 10, the system request determination can be performed with higher accuracy than the determination result using the acoustic feature amount. In addition, when the discrimination is performed with a total of 590 dimensions of the acoustic feature amount and the language feature amount, as shown in FIGS. 9 and 10, compared to the case where only the acoustic feature amount is used or the case where only the language feature amount is used, It can be seen that the system requirement can be determined with higher accuracy. However, the scaling of the linguistic feature and acoustic feature was selected by experimentally finding the one with the highest identification rate. 9 and 10 are diagrams showing the system request determination result.

[実施例２による効果]
このように、実施例２によれば、入力された音声信号から発話の内容を文字データとして抽出する音声認識を行い、音声認識した文字データから言語特徴量（例えば、文字データそれぞれの重要度や重み付けなど）を生成し、算出された発話区間の音響特徴量と、算出された発話区間の前後の区間の音響特徴量と、生成された言語特徴量とから、前記入力された音声信号が、接続される各種処理を実行する処理装置に対して処理の実行を要求するシステム要求であるか否かを判別するので、より精度の高い音声判別を実現することが可能である。 [Effects of Example 2]
As described above, according to the second embodiment, the speech recognition is performed to extract the content of the utterance as the character data from the input speech signal, and the language feature (for example, the importance level of each character data or Weighting etc.), and the input speech signal is calculated from the calculated acoustic feature amount of the utterance section, the acoustic feature amount of the section before and after the calculated utterance section, and the generated language feature amount. Since it is determined whether or not it is a system request for requesting execution of processing to a processing apparatus that executes various connected processes, it is possible to realize voice discrimination with higher accuracy.

さて、これまで本発明の実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下に示すように、（１）音響特徴量、（２）音声判別、（３）発話区間特定手法、（４）システム構成等、（５）プログラムにそれぞれ区分けして異なる実施例を説明する。 Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the embodiments described above. Therefore, as shown below, (1) acoustic features, (2) speech discrimination, (3) utterance section identification method, (4) system configuration, etc., and (5) programs are divided into different examples, and different embodiments are described. To do.

（１）音響特徴量
例えば、実施例１と２では、音響特徴量としてパワーとピッチとのそれぞれの「平均値、標準偏差値、最大値、最大値と最小値との差」を用いる場合について説明したが、本発明はこれに限定されるものではなく、「平均値、標準偏差値」を用いるなど、その組み合わせは任意に行うことができる。 (1) Acoustic feature amount For example, in the first and second embodiments, the case of using “average value, standard deviation value, maximum value, difference between maximum value and minimum value” of power and pitch as the acoustic feature amount is used. Although described, this invention is not limited to this, The combination can be arbitrarily performed, such as using "an average value and a standard deviation value."

また、実施例１と２では、発話区間とその前後の区間の音響特徴量を用いる場合について説明したが、本発明はこれに限定されるものではなく、発話区間と後の区間、発話区間と前の区間など、どちらかの区間の音響特徴量だけを用いるようにしてもよい。 Further, in the first and second embodiments, the case where the acoustic features of the utterance section and the sections before and after the utterance section are used has been described. However, the present invention is not limited to this, and the utterance section, the subsequent section, the utterance section, and the like. Only the acoustic feature amount of one of the sections such as the previous section may be used.

（２）音声判別
また、実施例１と２では、音声判別手段としてＳＶＭを用いた場合について説明したが、本発明はこれに限定されるものではなく、Ａｄａｂｏｏｓｔ、平均分布、平均ベクトル解析、確率統計、GMM（Gaussian Mixture Model）、決定木など、様々なパターン識別手法を用いることができる。 (2) Voice discrimination In the first and second embodiments, the case where the SVM is used as the voice discrimination means has been described. However, the present invention is not limited to this. Adaboost, average distribution, average vector analysis, probability Various pattern identification methods such as statistics, GMM (Gaussian Mixture Model), and decision trees can be used.

（３）発話区間特定手法
また、実施例１と２では、入力された音声信号のパワーから発話区間を特定する場合について説明したが、本発明はこれに限定されるものではなく、様々な手法により発話区間を特定することができる。具体的には、VAD（Voice Activity Detection）など公知の手法を用いて発話区間を特定することができる。 (3) Spoken Interval Identification Method In the first and second embodiments, the case where the utterance interval is specified from the power of the input voice signal has been described. However, the present invention is not limited to this, and various methods are available. The speech section can be specified by Specifically, the utterance interval can be specified using a known method such as VAD (Voice Activity Detection).

例えば、MFCC（Mel Frequency Cepstrum Coefficient）などにより音声特徴を抽出し、抽出した音声特徴を音声や雑音の特徴を記述した音響モデルと比較することで、音声の尤度を算出し、この尤度を用いて音声か非音声（雑音）かを判定することで、発話区間を特定する手法などを用いてもよい。 For example, speech features are extracted by MFCC (Mel Frequency Cepstrum Coefficient), etc., and the likelihood of speech is calculated by comparing the extracted speech features with an acoustic model describing features of speech and noise. A method of specifying an utterance section by determining whether it is speech or non-speech (noise) may be used.

（４）システム構成等
また、本実施例において説明した各処理のうち、自動的におこなわれるものとして説明した処理（例えば、特徴量の重み付け処理や音声認識処理など）の全部または一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報（例えば、図３、図４、図８など）については、特記する場合を除いて任意に変更することができる。 (4) System Configuration In addition, all or a part of the processes (for example, feature weighting process or voice recognition process) described as automatically performed among the processes described in the present embodiment are manually performed. Alternatively, all or part of the processing described as being performed manually can be automatically performed by a known method. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters (for example, FIG. 3, FIG. 4, FIG. It can be arbitrarily changed except for.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合（例えば、第一音響特徴算出部と第二音響特徴算出部とを統合するなど）して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. It can be configured by integrating (for example, integrating the first acoustic feature calculation unit and the second acoustic feature calculation unit). Further, all or any part of each processing function performed in each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

（５）プログラム
なお、本実施例で説明し音声判別方法は、あらかじめ用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。このプログラムは、インターネットなどのネットワークを介して配布することができる。また、このプログラムは、ハードディスク、フレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤなどのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行することもできる。 (5) Program The voice discrimination method described in the present embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program can be distributed via a network such as the Internet. The program can also be executed by being recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a CD-ROM, an MO, and a DVD and being read from the recording medium by the computer.

以上のように、本発明に係る音声判別装置は、入力された音声信号がシステムへの要求であるか雑談であるかを判別することに有用であり、特に、精度の高い音声判別を実現することに適する。 As described above, the speech discrimination device according to the present invention is useful for discriminating whether an input speech signal is a request to the system or a chat, and particularly realizes highly accurate speech discrimination. Suitable for that.

実施例１に係る音声判別装置を含むシステムの全体構成を示すシステム構成図である。1 is a system configuration diagram illustrating an overall configuration of a system including a voice discrimination device according to Embodiment 1. FIG. 実施例１に係る音声判別装置の構成を示すブロック図である。1 is a block diagram illustrating a configuration of a voice discrimination device according to Embodiment 1. FIG. 入力信号ＤＢに記憶される入力された音声信号の例を示す図である。It is a figure which shows the example of the input audio | voice signal memorize | stored in input signal DB. 判別結果ＤＢに記憶される情報の例を示す図である。It is a figure which shows the example of the information memorize | stored in discrimination | determination result DB. ＳＶＭによる要求判別を説明するための図である。It is a figure for demonstrating the request discrimination | determination by SVM. 実施例１に係る音声判別装置における音声判別処理の流れを示すフローチャートである。3 is a flowchart illustrating a flow of a voice discrimination process in the voice discrimination device according to the first embodiment. 音響特徴量と言語特徴量の組み合わせの例を示す図である。It is a figure which shows the example of the combination of an acoustic feature-value and a language feature-value. 音響分析条件とＨＭＭの仕様の例を示す図である。It is a figure which shows the example of the specification of acoustic analysis conditions and HMM. システム要求判別結果を示す図である。It is a figure which shows a system request | requirement determination result. システム要求判別結果を示す図である。It is a figure which shows a system request | requirement determination result.

Explanation of symbols

１０音声判別装置
１１通信制御Ｉ／Ｆ部
１２入力部
１３記憶部
１４入力信号ＤＢ
１５判別結果ＤＢ
２０制御部
２１発話区間特定部
２２第一音響特徴算出部
２３第二音響特徴算出部
２４ＳＶＭ
２５指示出力部 DESCRIPTION OF SYMBOLS 10 Voice discrimination device 11 Communication control I / F part 12 Input part 13 Storage part 14 Input signal DB
15 Discrimination result DB
20 Control Unit 21 Speaking Section Identification Unit 22 First Acoustic Feature Calculation Unit 23 Second Acoustic Feature Calculation Unit 24 SVM
25 Instruction output section

Claims

An utterance section specifying means for specifying an utterance section indicating an utterance section from the input voice signal;
First acoustic feature quantity calculating means for calculating an acoustic feature quantity indicating the characteristics of the voice signal from the power and pitch of the voice signal in the speech section specified by the utterance section specifying means;
A second acoustic feature quantity that identifies a section before and / or a section behind the speech section identified by the speech section identifying means, and calculates the acoustic feature quantity from the power and pitch of the audio signal in the identified section. A calculation means;
Various processes in which the input audio signal is connected from the acoustic feature quantity calculated by the first acoustic feature quantity calculation means and the acoustic feature quantity calculated by the second acoustic feature quantity calculation means. Voice discrimination means for discriminating whether or not it is a system request for requesting the processing device to execute processing;
A voice discrimination device comprising:

The first acoustic feature quantity calculating means includes an average value, a standard deviation value, a maximum value, and a maximum value and a minimum value of the power and pitch of the speech signal in the utterance section specified by the utterance section specifying means. The voice discrimination device according to claim 1, wherein any one or more of the differences are calculated as an acoustic feature amount.

The second acoustic feature quantity calculating means specifies a preceding section and / or a following section of the utterance section specified by the utterance section specifying means, and each of the power and pitch of the audio signal in the specified section is specified. The speech discrimination apparatus according to claim 1 or 2, wherein any one or more of an average value, a standard deviation value, a maximum value, and a difference between the maximum value and the minimum value is calculated as an acoustic feature amount.

Speech recognition for extracting the content of the utterance as character data from the input speech signal, further comprising a language feature generating means for generating a language feature from the speech-recognized character data;
The voice discrimination means is generated by the acoustic feature quantity calculated by the first acoustic feature quantity calculation means, the acoustic feature quantity calculated by the second acoustic feature quantity calculation means, and the language feature quantity generation means. It is determined whether or not the input voice signal is a system request for requesting execution of processing to a processing device that executes various connected processes from the language feature amount The voice discrimination device according to any one of claims 1 to 3.