JP2020067495A

JP2020067495A - Device, method and program which analyze voice

Info

Publication number: JP2020067495A
Application number: JP2018198271A
Authority: JP
Inventors: 嘉山　啓; Hiroshi Kayama; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2020-04-30
Also published as: WO2020085323A1

Abstract

To appropriately determine a speaker's intention even if it is difficult to determine the speaker's intention with a pitch transition at an end of a phrase in an utterance interval only.SOLUTION: A device specifies a plurality of partial utterance intervals included in one utterance interval in a voice signal, and analyzes change of the voice signal for each partial utterance interval. Specifically, the device divides the voice signal into utterance intervals UP1 sandwiching silent intervals in which the length of a duration time is longer than a time threshold value TH4 between respective intervals, and divides each utterance interval into one or more partial utterance intervals PUP1-PUP3 sandwiching the silent intervals in which the length of the duration time is shorter than the time threshold value TH4 in-between.SELECTED DRAWING: Figure 2

Description

この発明は、対話装置等に好適な音声分析装置、音声分析方法および音声分析プログラムに関する。 The present invention relates to a voice analysis device, a voice analysis method, and a voice analysis program suitable for a dialogue device or the like.

利用者の発話に対して応答を提供する対話装置において、自然な対話を実現するためには、対話装置側が、利用者の発話の音高変化等の態様に基づいて、発話者の意図を判断し、発話者の意図に対応した応答を提供する必要がある。このような要求に応える技術として、例えば特許文献１に開示された技術がある。特許文献１に開示の技術では、発話区間の語尾の音高変化に基づいて応答を制御する。 In order to realize a natural dialogue in a dialogue device that provides a response to the utterance of the user, the dialogue device side determines the intention of the utterer based on the pitch change of the utterance of the user. However, it is necessary to provide a response corresponding to the intention of the speaker. As a technique that meets such a demand, there is a technique disclosed in Patent Document 1, for example. In the technique disclosed in Patent Document 1, the response is controlled based on the pitch change of the ending of the utterance section.

特開２０１５−６９０３８号JP-A-2015-69038

音声信号から発話区間を抽出し、この発話区間の末尾の音高の遷移から発話者の意図を判断する場合において、発話区間中に疑問を意図する音高の遷移があると、意図の判断が困難になる場合がある。 When extracting the utterance section from the voice signal and judging the intention of the speaker from the transition of the pitch at the end of this utterance section, if there is a transition of the pitch that is in doubt during the utterance section, the judgment of the intention is made. It can be difficult.

以下の例では、句読点は音高の下降遷移、疑問符は音高の上昇遷移を表すものとする。
例１：「今日、ラーメンでいい？」
例２：「今日、ラーメンでいい？ね。」
例３：「今日、ラーメンでいい？ね？」 In the following examples, punctuation marks represent pitch transitions and question marks represent pitch transitions.
Example 1: "Is ramen good today?"
Example 2: "Is ramen good today?"
Example 3: "Is ramen good today?"

例１において、対話装置は、発話区間単位で音声の分析を行った場合、発話区間の末尾「いい？」の音高の上昇遷移を検出するため、疑問の意図があると判断し、予め疑問の意図の問いかけに対して録音された応答を出力する。この場合、適切な対話が実現される。 In Example 1, when the speech analysis is performed in units of utterance intervals, the dialogue device detects a rising transition of the pitch of the last "good?" Outputs the recorded response to the inquired question of. In this case, an appropriate dialogue is realized.

例２において、対話装置は、発話区間単位で音声の分析を行った場合、発話区間の末尾「ね。」の音高の下降遷移を検出するため、確認の意図があると判断し、予め確認の意図の問いかけに対して録音された応答を出力する。この場合、末尾「ね。」の前の「いい？」の音高が上昇遷移しており、疑問の意図を表している。従って、応答が発話者の意図に沿わず、不適切な対話となる。 In Example 2, when the speech analysis is performed on the utterance section basis, the dialogue apparatus detects the downward transition of the pitch of the end of the utterance section, "Ne." Outputs the recorded response to the inquired question of. In this case, the pitch of “good?” Before the end “ne.” Is changing upward, which represents the question intent. Therefore, the response does not meet the intention of the speaker, resulting in an inappropriate dialogue.

例３において、対話装置は、発話区間単位で音声の分析を行った場合、発話区間の末尾「ね？」の音高の上昇遷移を検出するため、疑問の意図があると判断し、予め疑問の意図の問いかけに対して録音された応答を出力する。しかし、対話装置は、発話者の意図に関する判断において、末尾「ね？」の前の「いい？」の音高の上昇遷移を考慮しないため、疑問の意図の強度を判断し損なう。このため、不適切な対話となる。 In Example 3, when the dialogue device analyzes the voice in units of utterance intervals, it detects that the pitch transition of the end of the utterance interval "ne?" Outputs the recorded response to the inquired question of. However, since the dialogue device does not consider the rising transition of the pitch of “Ii?” Before the end “” in the judgment regarding the intention of the speaker, it fails to judge the strength of the questioned intention. This leads to inappropriate dialogue.

発話者に対して適切な応答をするため、発話の音声認識を行って発話者の意図を分析することも考えられる。しかし、音声認識を行うとすると、装置が大規模化し、かつ、発話から応答までの時間が長くなる問題がある。 In order to give an appropriate response to the speaker, it is possible to analyze the intention of the speaker by performing voice recognition of the utterance. However, if voice recognition is performed, there is a problem that the device becomes large-scale and the time from utterance to response becomes long.

この発明は以上のような事情に鑑みてなされたものであり、発話区間の語尾の音高遷移のみでは発話者の意図を判断することが困難である場合においても適切かつ簡易に発話者の意図を判断することができる技術的手段を提供することを目的とする。 The present invention has been made in view of the above circumstances, and even when it is difficult to determine the intention of the speaker only by the pitch transition of the ending of the utterance section, the intention of the speaker can be appropriately and easily obtained. The purpose is to provide a technical means capable of determining.

この発明は、音声信号の中に１つの発話区間に含まれる複数の部分発話区間を特定する特定部と、部分発話区間毎に音声信号の変化を分析する分析部とを有する音声分析装置を提供する。 The present invention provides a voice analysis device having a specifying unit that specifies a plurality of partial utterance sections included in one utterance section in a voice signal, and an analysis unit that analyzes changes in the voice signal for each partial utterance section. To do.

この発明の一実施形態である対話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the dialog device which is one Embodiment of this invention. 同対話装置の音声分析装置としての機能を説明するタイムチャートである。It is a time chart explaining a function as a voice analysis device of the dialog device. 同実施形態における制御装置が音声分析プログラムを実行することにより実現される機能の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the function implement | achieved when the control apparatus in the same embodiment executes a voice analysis program. 同音声分析プログラムの処理内容を示すフローチャートである。It is a flow chart which shows the processing contents of the voice analysis program. 同音声分析プログラムの発話区間処理の処理内容を示すフローチャートである。It is a flow chart which shows the processing contents of the speech section processing of the voice analysis program. 同実施形態の第１動作例を示すタイムチャートである。8 is a time chart showing a first operation example of the same embodiment. 同実施形態の第２動作例を示すタイムチャートである。8 is a time chart showing a second operation example of the same embodiment. 同実施形態の第３動作例を示すタイムチャートである。It is a time chart which shows the 3rd example of operation of the embodiment.

以下、図面を参照し、この発明の実施形態について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１はこの発明による音声分析装置の一実施形態である対話装置の構成を示すブロック図である。この対話装置は、制御装置１と、演算装置２と、記憶装置３と、表示装置４と、操作装置５と、収音装置６と、放音装置７とを有する。 FIG. 1 is a block diagram showing the configuration of a dialogue device which is an embodiment of a voice analysis device according to the present invention. The dialogue device includes a control device 1, a computing device 2, a storage device 3, a display device 4, an operating device 5, a sound collecting device 6, and a sound emitting device 7.

制御装置１は、対話装置の制御中枢であり、ＣＰＵにより構成されている。記憶装置３は、ＲＡＭ等の揮発性記憶部と、ＲＯＭやハードディスク等の不揮発性記憶部とを有する。不揮発性記憶部には、各種のプログラムが記憶されている。これらのプログラムには、ユーザの発話音声を分析する音声分析プログラムと、発話音声の分析結果に基づいてユーザの発話音声に対する応答音声を合成する音声合成プログラムが含まれる。制御装置１は、揮発性記憶部をワークエリアとして使用し、不揮発性記憶部に記憶された各プログラムを実行する。演算装置２は、例えばＤＳＰであり、制御装置１が音声分析プログラムや音声合成プログラムを実行する際に、制御装置１による制御の下、音声分析や音声合成のための演算処理を実行する。表示装置４は、例えば液晶パネルであり、ユーザに対して各種の情報表示を行う。操作装置５は、キーボードやマウス等、ユーザからの指示を受け取るための各種の操作子を含む。収音装置６は、ユーザの発話音声を収音するマイクロホンと、このマイクロホンが出力するアナログ音声信号をＡ／Ｄ変換し、音声信号のサンプル列を出力するＡ／Ｄ変換器を含む。制御装置１は、この収音装置６によって出力される音声信号のサンプル列を処理対象とし、上述した音声分析プログラムを実行するとともに、音声合成プログラムを実行し、応答音声のサンプル列を出力する。放音装置７は、この応答音声のサンプル列をＤ／Ａ変換してアナログ音声信号を出力するＤ／Ａ変換器と、このアナログ音声信号を音声として放音するスピーカとを含む。 The control device 1 is a control center of the dialogue device, and is composed of a CPU. The storage device 3 has a volatile storage unit such as a RAM and a non-volatile storage unit such as a ROM or a hard disk. Various programs are stored in the non-volatile storage unit. These programs include a voice analysis program for analyzing a user's uttered voice and a voice synthesis program for synthesizing a response voice to the user's uttered voice based on the analysis result of the uttered voice. The control device 1 uses the volatile storage unit as a work area and executes each program stored in the non-volatile storage unit. The arithmetic unit 2 is, for example, a DSP, and when the control unit 1 executes the voice analysis program and the voice synthesis program, under the control of the control unit 1, executes arithmetic processing for voice analysis and voice synthesis. The display device 4 is, for example, a liquid crystal panel, and displays various information to the user. The operation device 5 includes various operators such as a keyboard and a mouse for receiving instructions from the user. The sound pickup device 6 includes a microphone that picks up a voice uttered by a user, and an A / D converter that A / D-converts an analog voice signal output by the microphone and outputs a sample sequence of the voice signal. The control device 1 processes the sample sequence of the voice signal output by the sound collecting device 6 as a processing target, executes the above-mentioned voice analysis program, executes the voice synthesis program, and outputs the response voice sample sequence. The sound emitting device 7 includes a D / A converter that D / A converts the sample sequence of the response sound and outputs an analog sound signal, and a speaker that emits the analog sound signal as sound.

本実施形態において、制御装置１は、音声分析プログラムを実行することにより音声分析装置として機能する。図２は制御装置１の音声分析装置としての機能を説明するタイムチャートである。図２において、横軸は時刻、縦軸は処理対象である音声信号の音量（音圧レベル）である。 In the present embodiment, the control device 1 functions as a voice analysis device by executing a voice analysis program. FIG. 2 is a time chart explaining the function of the control device 1 as a voice analysis device. In FIG. 2, the horizontal axis represents time and the vertical axis represents the volume (sound pressure level) of the audio signal to be processed.

本実施形態において、制御装置１は、収音装置６から出力される音声信号のサンプル列を一定時間長のフレームに分割し、各フレームの発生時刻を監視しつつ、音声分析プログラム１０を実行する。図３は本実施形態において、制御装置１が音声分析プログラム１０を実行することにより実現される機能の構成を示す機能ブロック図である。図３に示すように、音声分析プログラム１０に基づくこの機能構成は、特定部１１と、分析部１２とを含む。特定部１１（制御装置１）は、音声信号の中に１つの発話区間ＵＰ１に含まれる複数の部分発話区間ＰＵＰ１〜ＰＵＰ３を特定する。分析部１２（制御装置１）は、部分発話区間毎に音声信号の変化を分析する。 In the present embodiment, the control device 1 divides the sample sequence of the audio signal output from the sound collection device 6 into frames of a fixed time length, and executes the audio analysis program 10 while monitoring the generation time of each frame. . FIG. 3 is a functional block diagram showing a configuration of functions realized by the control device 1 executing the voice analysis program 10 in the present embodiment. As shown in FIG. 3, this functional configuration based on the voice analysis program 10 includes a specifying unit 11 and an analyzing unit 12. The specifying unit 11 (control device 1) specifies a plurality of partial utterance sections PUP1 to PUP3 included in one utterance section UP1 in the audio signal. The analysis unit 12 (control device 1) analyzes the change in the audio signal for each partial utterance section.

より具体的には、特定部１１は、音声信号の中に、各々第１の終了判定基準（以下、第１判定基準）により発話終了が判定された終期ｔ６を有する発話区間ＵＰ１を特定し、発話区間ＵＰ１の中に、第１の終了判定基準より細分が可能な第２の終了判定基準（以下、第２判定基準）により発話終了が判定された終期ｔ２、ｔ４、ｔ６を有する複数の部分発話区間ＰＵＰ１〜ＰＵＰ３を特定するものである。 More specifically, the identifying unit 11 identifies, in the voice signal, the utterance section UP1 having the end t6 at which the utterance end is determined by the first end determination criterion (hereinafter, first determination criterion), In the utterance section UP1, a plurality of parts having end periods t2, t4, and t6 in which the utterance end is determined by a second end determination criterion (hereinafter, second determination criterion) that can be subdivided from the first end determination criterion. The speech sections PUP1 to PUP3 are specified.

ここで、第１および第２の終了判定基準とは、例えば音声信号の音量が閾値ＴＨ２未満になってから、閾値ＴＨ２より大きい閾値ＴＨ１を超えるまでの無音区間の長さに関する判断基準である。 Here, the first and second end determination criteria are, for example, determination criteria relating to the length of the silent section from when the volume of the audio signal becomes less than the threshold TH2 to when it exceeds the threshold TH1 which is greater than the threshold TH2.

図２に示す例では、時刻ｔ６において音声信号の音量が閾値ＴＨ２未満になってから閾値ＴＨ４より長い時間が経過した時刻ｔ７になっても、音量が閾値ＴＨ１を超えない。すなわち、時刻ｔ６以降の無音区間の継続時間（無音時間）長は閾値ＴＨ４を超える。このため、時刻ｔ６は、発話区間ＵＰ１の終期と判断される。 In the example shown in FIG. 2, the volume does not exceed the threshold TH1 even at time t7 when a time longer than the threshold TH4 has elapsed after the volume of the audio signal became less than the threshold TH2 at time t6. That is, the duration (silence time) length of the silent section after time t6 exceeds the threshold TH4. Therefore, the time t6 is determined to be the end of the utterance section UP1.

一方、図２に示す例では、時刻ｔ２において音声信号の音量が閾値ＴＨ２未満になってから、閾値ＴＨ４よりも短い閾値ＴＨ３より長い時間が経過した時刻ｔ３に、音量が閾値ＴＨ１を超える。すなわち、時刻ｔ２以降の無音時間は、その長さが閾値ＴＨ４より短く閾値ＴＨ３より長い。そこで、時刻ｔ２は、部分発話区間ＰＵＰ１の終期と判定される。部分発話区間ＰＵＰ２についても同様である。 On the other hand, in the example shown in FIG. 2, the volume exceeds the threshold TH1 at time t3 when a time longer than the threshold TH3 shorter than the threshold TH4 has elapsed since the volume of the audio signal became less than the threshold TH2 at the time t2. That is, the silent period after the time t2 has a length shorter than the threshold TH4 and longer than the threshold TH3. Therefore, the time t2 is determined to be the end of the partial utterance period PUP1. The same applies to the partial utterance section PUP2.

ここで、部分発話区間の終期に関する第２判定基準に用いられる閾値ＴＨ３は、発話区間の終期に関する第１判定基準に用いられる閾値ＴＨ４よりも短い。従って、第２判定基準を用いることで、第１判定基準により検出された発話区間を、それより短い部分発話区間に細分することができる。すなわち、無音区間の判定基準として、第２判定基準（ＴＨ３）は第１判定基準（ＴＨ４）より緩いと言える。ここで、第２判定基準が「緩い」というのは、言い換えると、第１判定基準に基づいて区切られた１つの発話区間内に、さらに、部分発話区間の区切りである短い無音区間を判定できる、ということである。 Here, the threshold TH3 used for the second criterion for the end of the partial utterance section is shorter than the threshold TH4 used for the first criterion for the end of the utterance section. Therefore, by using the second criterion, the utterance section detected by the first criterion can be subdivided into partial utterance sections shorter than that. That is, it can be said that the second criterion (TH3) is looser than the first criterion (TH4) as the criterion for the silent section. In this case, the second criterion is "loose". In other words, it is possible to determine a short silent period which is a delimiter of the partial utterance period within one utterance period divided based on the first criterion. ,That's what it means.

本実施形態では、無音区間の長さに基づいて、部分発話区間または発話区間の終期を決定している。従って、この点に着目して捉えるならば、音声分析プログラム１０に係る前記機能構成の特定部１１は、音声信号の中に、各々の間に継続時間長が閾値ＴＨ４より長い無音区間を挟む発話区間を特定し、各発話区間の中に、継続時間長が時間閾値ＴＨ４より短い短無音区間ｔ２〜ｔ３、ｔ４〜ｔ５を間に挟んだ１または複数の部分発話区間ＰＵＰ１〜ＰＵＰ３を特定し、音声分析プログラム１０に係る前記機能構成の分析部１２は、部分発話区間毎に音声信号の変化を分析するものである、ということができる。 In the present embodiment, the end of the partial utterance section or the utterance section is determined based on the length of the silent section. Therefore, from this point of view, the specification unit 11 of the functional configuration according to the voice analysis program 10 utters a voice signal in which a silent section having a duration time longer than the threshold TH4 is sandwiched between the voice signals. A section is specified, and in each utterance section, one or a plurality of partial utterance sections PUP1 to PUP3 having short durations t2 to t3 and t4 to t5 whose duration is shorter than the time threshold TH4 are specified, It can be said that the analysis unit 12 of the functional configuration according to the voice analysis program 10 analyzes a change in the voice signal for each partial utterance section.

本実施形態において、制御装置１は、音声分析プログラム１０と並行して音声合成プログラムを実行する。音声分析プログラム１０では、発話区間を構成する部分発話区間毎に、音声信号の音高の遷移等を分析し、分析結果を音声合成プログラムに引き渡す。音声合成プログラムでは、この分析結果に基づいて、ユーザの発話に対する応答内容を判定し、応答音声のサンプル列を合成し、放音装置７に供給する。すなわち、制御装置１は、音声分析プログラム１０と並行して音声合成プログラムを実行することにより、発話区間の音声に対する応答音声を合成する音声合成装置として機能する。
以上が本実施形態の構成である。 In the present embodiment, the control device 1 executes the voice synthesis program in parallel with the voice analysis program 10. The voice analysis program 10 analyzes the transition of the pitch of the voice signal for each partial utterance section that constitutes the utterance section, and delivers the analysis result to the voice synthesis program. Based on the analysis result, the voice synthesis program determines the content of the response to the user's utterance, synthesizes a sample sequence of the response voice, and supplies it to the sound emitting device 7. That is, the control device 1 functions as a voice synthesizing device that synthesizes a response voice with respect to the voice in the utterance section by executing the voice synthesis program in parallel with the voice analysis program 10.
The above is the configuration of the present embodiment.

図４は本実施形態における音声分析プログラム１０の処理内容を示すフローチャートである。図５は同プログラム１０における発話区間処理Ｓ４の処理内容を示すフローチャートである。図５に示された各処理のうちＳ４２４３３が上述した分析部１２（制御装置１）により実行される処理であり、それ以外の処理が上述した特定部１１（制御装置１）により実行される処理である。図６〜図８は本実施形態の第１〜第３動作例を示すタイムチャートである。図６〜図８において、横軸は時刻、縦軸は処理対象である音声信号の音量である。 FIG. 4 is a flowchart showing the processing contents of the voice analysis program 10 in this embodiment. FIG. 5 is a flowchart showing the processing contents of the utterance section processing S4 in the program 10. Of the processes shown in FIG. 5, S42433 is the process executed by the analysis unit 12 (control device 1) described above, and the other processes are executed by the identification unit 11 (control device 1) described above. Is. 6 to 8 are time charts showing first to third operation examples of this embodiment. 6 to 8, the horizontal axis represents time and the vertical axis represents the volume of the audio signal to be processed.

まず、図４および図５のフローチャートを参照し、図６の第１動作例について説明する。操作装置５に対して所定の操作が行われると、制御装置１は、記憶装置３に記憶された音声分析プログラム１０および音声合成プログラムの実行を開始する。なお、本実施形態の特徴は音声分析プログラム１０にあるため、以下では、音声分析プログラム１０の処理内容の説明が中心となる。 First, the first operation example of FIG. 6 will be described with reference to the flowcharts of FIGS. 4 and 5. When a predetermined operation is performed on the operation device 5, the control device 1 starts executing the voice analysis program 10 and the voice synthesis program stored in the storage device 3. Since the feature of this embodiment resides in the voice analysis program 10, the description of the processing content of the voice analysis program 10 will be mainly described below.

以下の説明において、仮発話区間とは、音声信号の音量が閾値ＴＨ１を超えることにより開始される区間である。本実施形態では、継続時間長が閾値ＴＨ５より長い区間を部分発話区間とする。従って、音声信号の音量が閾値ＴＨ１を超えるタイミングでは、未だ、そのタイミングから開始される区間が部分発話区間となるか否か不明である。そこで、本実施形態では、音声信号の音量が閾値ＴＨ１より大きくなることにより開始される区間を仮発話区間とする。この仮発話区間の継続時間長が閾値ＴＨ５を超える時点で、当該仮発話区間は部分発話区間となる。また、以下の説明において、仮無音区間とは、音声信号の音量が閾値ＴＨ２未満になることにより開始される区間である。本実施形態では、継続時間長が閾値ＴＨ３より長い区間を無音区間とする。従って、音声信号の音量が閾値ＴＨ２より小さくなったタイミングでは、そのタイミングから開始される区間が無音区間となるか否か不明である。そこで、本実施形態では、音声信号の音量が閾値ＴＨ２未満になることにより開始される区間を仮無音区間とする。この仮無音区間の継続時間長が閾値ＴＨ３を超える時点で、当該無音区間は無音区間となる。 In the following description, the provisional utterance section is a section started when the volume of the audio signal exceeds the threshold TH1. In the present embodiment, a section in which the duration is longer than the threshold TH5 is a partial utterance section. Therefore, at the timing when the volume of the audio signal exceeds the threshold TH1, it is still unknown whether the section started from that timing is the partial utterance section. Therefore, in the present embodiment, a section started when the volume of the audio signal becomes higher than the threshold TH1 is defined as a temporary utterance section. When the duration of the provisional utterance section exceeds the threshold TH5, the provisional utterance section becomes the partial utterance section. Further, in the following description, the temporary silence section is a section started when the volume of the audio signal becomes less than the threshold TH2. In the present embodiment, a section whose duration is longer than the threshold TH3 is a silent section. Therefore, at the timing when the volume of the audio signal becomes lower than the threshold value TH2, it is unknown whether the section started from the timing becomes the silent section. Therefore, in the present embodiment, a section that starts when the volume of the audio signal becomes less than the threshold TH2 is a temporary silence section. When the duration of the temporary silent section exceeds the threshold TH3, the silent section becomes a silent section.

音声分析プログラム１０において、制御装置１は、まず、初期化処理Ｓ１を実行する。この初期化処理Ｓ１において、制御装置１は、仮無音区間の継続時間長である仮無音時間を「０」、部分発話区間数を「０」、仮発話区間の継続時間長である仮発話時間を「０」、仮発話区間状態フラグをＯＦＦとする。 In the voice analysis program 10, the control device 1 first executes an initialization process S1. In the initialization process S1, the control device 1 sets the temporary silence duration, which is the duration of the temporary silence section, to "0", the number of partial speech segments to "0", and the temporary speech duration, which is the duration of the temporary speech segment. Is set to "0" and the provisional utterance section state flag is set to OFF.

次に制御装置１は、収音装置６から１フレーム分の入力音声信号のサンプル列を取得し、記憶装置３内のバッファ領域に格納する（Ｓ２）。次に制御装置１は、バッファ領域に格納した入力音声信号のサンプル列から音高や音量等、入力音声のパラメータを抽出する（Ｓ３）。次に制御装置１は図５に示す発話区間処理Ｓ４を実行する。この発話区間処理Ｓ４では、音声信号から発話区間、部分発話区間を抽出し、発話区間を構成する部分発話区間毎に音声信号の変化を分析する。次に制御装置１は、操作装置５の操作等により終了指示が発生したか否かを判断する。この判断結果が「ＹＥＳ」である場合、制御装置１は音声分析プログラム１０を終了する。一方、この判断結果が「ＮＯ」である場合、制御装置１はＳ２に戻って処理Ｓ２〜Ｓ４を再び実行する。このように、終了指示が発生するまでの間、処理Ｓ２〜Ｓ５が繰り返される。 Next, the control device 1 acquires a sample sequence of the input audio signal for one frame from the sound collection device 6 and stores it in the buffer area in the storage device 3 (S2). Next, the control device 1 extracts parameters of the input voice such as pitch and volume from the sample sequence of the input voice signal stored in the buffer area (S3). Next, the control device 1 executes the utterance section processing S4 shown in FIG. In the utterance section process S4, the utterance section and the partial utterance section are extracted from the voice signal, and the change of the voice signal is analyzed for each of the partial utterance sections constituting the utterance section. Next, the control device 1 determines whether or not an end instruction has been issued by operating the operating device 5 or the like. When this determination result is “YES”, the control device 1 ends the voice analysis program 10. On the other hand, when this determination result is “NO”, the control device 1 returns to S2 and executes the processes S2 to S4 again. In this way, the processes S2 to S5 are repeated until the end instruction is generated.

次に図５の発話区間処理Ｓ４の処理内容について説明する。
図５の発話区間処理Ｓ４では、まず、仮発話区間状態フラグがＯＦＦであるか否かを判断する（Ｓ４１）。図６の第１動作例において、初期化処理Ｓ１後、音声信号の音量が閾値ＴＨ１以下である期間は、仮発話区間状態フラグがＯＦＦとなる。このため、Ｓ４１の判断結果が「ＹＥＳ」となり、制御装置１の処理は、仮無音区間処理Ｓ４２に進む。 Next, the processing content of the speech section processing S4 of FIG. 5 will be described.
In the utterance section processing S4 of FIG. 5, first, it is determined whether or not the provisional utterance section state flag is OFF (S41). In the first operation example of FIG. 6, after the initialization process S1, the provisional utterance section state flag is OFF during the period when the volume of the audio signal is equal to or lower than the threshold TH1. Therefore, the determination result of S41 is "YES", and the process of the control device 1 proceeds to the temporary silence section process S42.

この仮無音区間処理Ｓ４２において、制御装置１は、まず、音声信号の音量が閾値ＴＨ１より大きいか否かを判断する（Ｓ４２１）。図６の第１動作例において、時刻ｔ１よりも以前の期間は、音声信号の音量が閾値ＴＨ１よりも小さいため、Ｓ４２１の判断結果は「ＮＯ」となる。この結果、制御装置１は、仮無音区間継続処理Ｓ４２４を実行する。 In this temporary silence section processing S42, the control device 1 first determines whether or not the volume of the audio signal is higher than the threshold value TH1 (S421). In the first operation example of FIG. 6, since the volume of the audio signal is lower than the threshold TH1 in the period before time t1, the determination result of S421 is “NO”. As a result, the control device 1 executes the temporary silent section continuation process S424.

この仮無音区間継続処理Ｓ４２４において、制御装置１は、まず、仮無音時間の更新を行う（Ｓ４２４１）。具体的には、初期化処理Ｓ１、Ｓ４２４３４、Ｓ４２４１およびＳ４３３の実行タイミングのうちの最新のタイミングからの経過時間を仮無音時間に加算する。算出された仮無音時間は、現在の仮無音区間の開始からその時点までの経過時間である。次に制御装置１は、仮無音時間が閾値ＴＨ４より長いか否かを判断する（Ｓ４２４２）。この判断結果が「ＮＯ」である場合、制御装置１は、仮無音区間継続処理Ｓ４２４、仮無音区間処理Ｓ４２および発話区間処理Ｓ４を終了し、図４のＳ５に進む。 In the temporary silence section continuation process S424, the control device 1 first updates the temporary silence period (S4241). Specifically, the elapsed time from the latest timing among the execution timings of the initialization processes S1, S42434, S4241 and S433 is added to the temporary silence time. The calculated temporary silence period is the elapsed time from the start of the current temporary silence section to that point. Next, the control device 1 determines whether the temporary silence time is longer than the threshold TH4 (S4242). When this determination result is “NO”, the control device 1 ends the temporary silence section continuation process S424, the temporary silence section process S42, and the speech section process S4, and proceeds to S5 of FIG.

図６の第１動作例において、時刻ｔ１より前の無音区間は、Ｓ４１の判断結果が「ＹＥＳ」、Ｓ４２１の判断結果が「ＮＯ」、Ｓ４２４２の判断結果が「ＮＯ」となって、仮無音時間の更新（Ｓ４２４１）が繰り返される。そして、仮無音時間が閾値ＴＨ４を超えると、Ｓ４２４２の判断結果が「ＹＥＳ」となり、Ｓ４２４３１以降の処理で仮無音時間の「０」リセットが行われるが、その詳細については後述する。 In the first operation example of FIG. 6, in the silent section before time t1, the determination result of S41 is “YES”, the determination result of S421 is “NO”, the determination result of S4242 is “NO”, and the temporary silence is generated. The time update (S4241) is repeated. Then, when the temporary silence time exceeds the threshold TH4, the determination result of S4242 becomes "YES", and the temporary silence time is reset to "0" in the processing of S42431 and thereafter, which will be described later in detail.

この後、音声信号の音量が上がって、時刻ｔ１に閾値ＴＨ１を超える。この結果、Ｓ４２１の判断結果が「ＹＥＳ」となり、制御装置１は、仮無音時間が閾値ＴＨ３より長いか否かを判断する（Ｓ４２２）。この判断結果が「ＹＥＳ」である場合、制御装置１の処理はＳ４２３に進む。一方、Ｓ４２２の判断結果が「ＮＯ」である場合、制御装置１は、部分発話区間数が０か否かを判断する（Ｓ４２５）。この判断結果が「ＹＥＳ」である場合、制御装置１の処理はＳ４２３に進む。 After that, the volume of the audio signal rises and exceeds the threshold TH1 at time t1. As a result, the determination result of S421 becomes "YES", and the control device 1 determines whether or not the temporary silence period is longer than the threshold value TH3 (S422). If this determination result is “YES”, the process of the control device 1 proceeds to S423. On the other hand, when the determination result of S422 is "NO", the control device 1 determines whether the number of partial utterance sections is 0 (S425). If this determination result is “YES”, the process of the control device 1 proceeds to S423.

図６の第１動作例の時刻ｔ１において、仮無音時間が閾値ＴＨ３を超える場合には、Ｓ４２２の判断結果が「ＹＥＳ」となってＳ４２３に進む。一方、時刻ｔ１において仮無音時間が閾値ＴＨ３以下である場合には、Ｓ４２２の判断結果が「ＮＯ」となってＳ４２５に進むが、初期化処理Ｓ１直後の時刻ｔ１では部分発話区間数が０であるため、Ｓ４２５の判断結果が「ＹＥＳ」となってＳ４２３に進む。このように時刻ｔ１では、仮無音時間が閾値ＴＨ３を超えるか否かに拘わらず、処理はＳ４２３に進む。 At time t1 in the first operation example of FIG. 6, if the temporary silence time exceeds the threshold TH3, the determination result of S422 is “YES” and the process proceeds to S423. On the other hand, if the temporary silence time is less than or equal to the threshold TH3 at time t1, the determination result of S422 is "NO" and the process proceeds to S425, but the number of partial utterance sections is 0 at time t1 immediately after the initialization process S1. Therefore, the determination result of S425 is “YES” and the process proceeds to S423. Thus, at time t1, the process proceeds to S423 regardless of whether the temporary silence time exceeds the threshold TH3.

次にＳ４２３に進むと、制御装置１は、仮発話区間開始処理を実行する。具体的には、制御装置１は、仮発話区間状態フラグをＯＮとし、仮発話時間を０に初期化する。この仮発話区間開始処理Ｓ４２３を終えると、制御装置１は、仮無音区間処理Ｓ４２および発話区間処理Ｓ４を終了し、図４のＳ５に進む。 Next, when proceeding to S423, the control device 1 executes a temporary utterance section start process. Specifically, the control device 1 turns on the provisional utterance section state flag and initializes the provisional utterance time to zero. When the provisional utterance section start process S423 is finished, the control device 1 finishes the provisional silence section process S42 and the utterance section process S4, and proceeds to S5 of FIG.

その後、発話区間処理Ｓ４では、仮発話区間状態フラグがＯＮであるため、Ｓ４１の判断結果が「ＮＯ」となり、制御装置１は、仮発話区間処理Ｓ４３を実行する。この仮発話区間処理Ｓ４３において、制御装置１は、まず、入力音声信号の音量が閾値ＴＨ２未満であるか否かを判断する（Ｓ４３１）。図６の第１動作例において、時刻ｔ１が過ぎて時刻ｔ２になるまでの期間は、入力音声信号の音量が閾値ＴＨ２より大きい。従って、この間は、Ｓ４３１の判断結果が「ＮＯ」となり、制御装置１は、仮発話区間継続処理Ｓ４３４を実行する。この仮発話区間継続処理Ｓ４３４では、仮発話時間の更新を行う。具体的には、Ｓ４２３およびＳ４３４の実行タイミングのうちの最新のタイミングからの経過時間を仮発話時間に加算する。算出された仮発話時間は、現在の仮発話区間の開始からその時点までの経過時間である。Ｓ４３４が終了すると、制御装置１は、仮発話区間処理Ｓ４３および発話区間処理Ｓ４を終了し、図４のＳ５に進む。 After that, in the utterance section process S4, since the temporary utterance section state flag is ON, the determination result of S41 is "NO", and the control device 1 executes the temporary utterance section process S43. In the provisional utterance section processing S43, the control device 1 first determines whether or not the volume of the input voice signal is less than the threshold TH2 (S431). In the first operation example of FIG. 6, the volume of the input audio signal is larger than the threshold TH2 during the period from the time t1 to the time t2. Therefore, during this period, the determination result of S431 is "NO", and the control device 1 executes the temporary utterance section continuation process S434. In this temporary utterance section continuation process S434, the temporary utterance time is updated. Specifically, the elapsed time from the latest timing of the execution timings of S423 and S434 is added to the temporary utterance time. The calculated provisional utterance time is the elapsed time from the start of the current provisional utterance section to that point. When S434 ends, the control device 1 ends the temporary utterance section process S43 and the utterance section process S4, and proceeds to S5 in FIG.

その後、入力音声信号の音量が下がって、時刻ｔ２に閾値ＴＨ２未満になる。そして、発話区間処理Ｓ４では、Ｓ４１の判断結果が「ＮＯ」となり、仮発話区間処理Ｓ４３ではＳ４３１の判断結果が「ＹＥＳ」となり、制御装置１は、仮発話時間が閾値ＴＨ５より長いか否かを判断する（Ｓ４３２）。図６の第１動作例では、時刻ｔ１から時刻ｔ２までの仮発話時間が閾値ＴＨ５を超える。このため、Ｓ４３２の判断結果が「ＹＥＳ」となり、制御装置１は仮無音区間開始処理Ｓ４３３を実行する。この仮無音区間開始処理Ｓ４３３において、制御装置１は、入力音声信号における時刻ｔ１から時刻ｔ２までの区間を未登録の部分発話区間ＰＵＰ１とし、仮発話区間状態フラグをＯＦＦとし、仮無音時間を０に初期化する。この時、部分発話区間数は１である。この仮無音区間開始処理Ｓ４３３が終了すると、制御装置１は、仮発話区間処理Ｓ４３および発話区間処理Ｓ４を終了し、図４のＳ５に進む。 After that, the volume of the input audio signal decreases and becomes less than the threshold value TH2 at time t2. Then, in the utterance section process S4, the determination result of S41 becomes "NO", and in the temporary utterance section process S43, the determination result of S431 becomes "YES", and the control device 1 determines whether the temporary utterance time is longer than the threshold value TH5. Is determined (S432). In the first operation example of FIG. 6, the provisional utterance time from time t1 to time t2 exceeds the threshold TH5. Therefore, the determination result of S432 is “YES”, and the control device 1 executes the temporary silence section start process S433. In the temporary silence section start process S433, the control device 1 sets the section from the time t1 to the time t2 in the input voice signal as the unregistered partial utterance section PUP1, sets the temporary utterance section state flag to OFF, and sets the temporary silence time to 0. Initialize to. At this time, the number of partial utterance sections is 1. When the temporary silence section start processing S433 ends, the control device 1 ends the temporary speech section processing S43 and the speech section processing S4, and proceeds to S5 in FIG.

その後、発話区間処理Ｓ４では、仮発話区間状態フラグがＯＦＦであるため、Ｓ４１の判断結果が「ＹＥＳ」となって仮無音区間処理Ｓ４２に進む。そして、仮無音区間処理Ｓ４２において、入力音声信号の音量が閾値ＴＨ１未満である場合には、Ｓ４２１の判断結果が「ＮＯ」となって仮無音区間継続処理Ｓ４２４に進む。そして、仮無音区間継続処理Ｓ４２４では、仮無音時間の更新を行い（Ｓ４２４１）、仮無音時間が閾値ＴＨ４より長いか否かを判断し（Ｓ４２４２）、Ｓ４２４２の判断結果が「ＮＯ」である場合は、仮無音区間継続処理Ｓ４２４、仮無音区間処理Ｓ４２および発話区間処理Ｓ４を終了し、図４のＳ５に進む。第１動作例の最初の仮無音区間では、仮無音時間が閾値ＴＨ４を超えることなく、このような処理が時刻ｔ３になるまで繰り返される。 After that, in the utterance section process S4, since the temporary utterance section state flag is OFF, the determination result of S41 is "YES", and the process proceeds to the temporary silence section process S42. Then, in the temporary silence section process S42, when the volume of the input audio signal is less than the threshold value TH1, the determination result of S421 becomes “NO”, and the process proceeds to the temporary silence section continuation process S424. Then, in the temporary silence duration continuation process S424, the temporary silence duration is updated (S4241), it is determined whether the temporary silence duration is longer than the threshold TH4 (S4242), and the determination result of S4242 is "NO". Ends the temporary silence section continuation process S424, the temporary silence section process S42, and the speech section process S4, and proceeds to S5 in FIG. In the first provisional silence section of the first operation example, such processing is repeated until the time t3 without the provisional silence time exceeding the threshold TH4.

そして、入力音声信号の音量が上がって、時刻ｔ３に閾値ＴＨ１を超える。この結果、仮無音区間処理Ｓ４２では、Ｓ４２１の判断結果が「ＹＥＳ」となり、制御装置１は、仮無音時間が閾値ＴＨ３より長いか否かを判断する（Ｓ４２２）。この第１動作例では、仮無音時間ｔ３−ｔ２が閾値ＴＨ３を超えるため、Ｓ４２２の判断結果が「ＹＥＳ」となり、制御装置１は仮発話区間開始処理Ｓ４２３を実行し、仮無音区間処理Ｓ４２および発話区間処理Ｓ４を終了し、図４のＳ５に進む。以後、時刻ｔ４になるまでの間、制御装置１は、Ｓ４１、Ｓ４３１、Ｓ４３４の処理を繰り返す。 Then, the volume of the input audio signal rises and exceeds the threshold TH1 at time t3. As a result, in the temporary silence section process S42, the determination result of S421 is “YES”, and the control device 1 determines whether the temporary silence time is longer than the threshold TH3 (S422). In the first operation example, since the temporary silence time t3-t2 exceeds the threshold TH3, the determination result of S422 is “YES”, the control device 1 executes the temporary utterance section start process S423, and the temporary silence section process S42 and The utterance section processing S4 is ended, and the process proceeds to S5 in FIG. After that, the control device 1 repeats the processing of S41, S431, and S434 until time t4.

そして、入力音声信号の音量が下がって、時刻ｔ４に閾値ＴＨ２未満になる。この結果、仮発話区間処理Ｓ４３では、Ｓ４３１の判断結果が「ＹＥＳ」となり、制御装置１は、仮発話時間ｔ４−ｔ３が閾値ＴＨ５より長いか否かを判断する（Ｓ４３２）。第１動作例では、このＳ４３２の判断結果は「ＹＥＳ」となる。この結果、制御装置１は、仮無音区間開始処理Ｓ４３３を実行し、入力音声信号における時刻ｔ３から時刻ｔ４までの区間を未登録の部分発話区間ＰＵＰ２とし、仮発話区間状態フラグをＯＦＦとし、仮無音時間を０に初期化する。この時、部分発話区間数は２である。この仮無音区間開始処理Ｓ４３３が終了すると、制御装置１は、仮発話区間処理Ｓ４３および発話区間処理Ｓ４を終了し、図４のＳ５に進む。 Then, the volume of the input audio signal decreases and becomes less than the threshold value TH2 at time t4. As a result, in the temporary utterance section process S43, the determination result of S431 is “YES”, and the control device 1 determines whether the temporary utterance time t4 to t3 is longer than the threshold value TH5 (S432). In the first operation example, the determination result of S432 is “YES”. As a result, the control device 1 executes the temporary silence section start process S433, sets the section from the time t3 to the time t4 in the input voice signal as the unregistered partial utterance section PUP2, turns off the temporary utterance section state flag, and The silent time is initialized to 0. At this time, the number of partial utterance sections is 2. When the temporary silence section start processing S433 ends, the control device 1 ends the temporary speech section processing S43 and the speech section processing S4, and proceeds to S5 in FIG.

その後、第１動作例では、入力音声信号の音量が上がって閾値ＴＨ１を超える時刻ｔ５においてｔ５−ｔ４＞ＴＨ３であり、入力信号の音量が下がって閾値ＴＨ２未満になる時刻ｔ６においてｔ６−ｔ５＞ＴＨ５である。この場合の動作は、部分発話区間ＰＵＰ１、ＰＵＰ２について行われた動作と同様である。 After that, in the first operation example, t5-t4> TH3 at time t5 when the volume of the input audio signal increases and exceeds the threshold TH1, and t6-t5> at time t6 when the volume of the input signal decreases and becomes less than the threshold TH2. TH5. The operation in this case is similar to the operation performed for the partial utterance sections PUP1 and PUP2.

時刻ｔ６において、発話区間処理Ｓ４の仮発話区間処理Ｓ４３では、Ｓ４３１の判断結果が「ＹＥＳ」、Ｓ４３２の判断結果が「ＹＥＳ」となり、制御装置１は、仮無音区間開始処理Ｓ４３３を実行し、入力音声信号における時刻ｔ５から時刻ｔ６までの区間を未登録の部分発話区間ＰＵＰ３とし、仮発話区間状態フラグをＯＦＦとし、仮無音時間を０に初期化する。以後、制御装置１は、Ｓ４１、Ｓ４２１、Ｓ４２４１、Ｓ４２４２の処理を繰り返す。 At time t6, in the temporary utterance period process S43 of the utterance period process S4, the determination result of S431 is “YES”, the determination result of S432 is “YES”, and the control device 1 executes the temporary silence period start process S433, The section from time t5 to time t6 in the input voice signal is set as the unregistered partial utterance section PUP3, the temporary utterance section state flag is set to OFF, and the temporary silence duration is initialized to 0. After that, the control device 1 repeats the processing of S41, S421, S4241, and S4242.

そして、第１動作例では、時刻ｔ７において仮無音時間が閾値ＴＨ４を超え、この仮無音区間が無音区間であることが確定する。この結果、仮無音区間継続処理Ｓ４２４では、Ｓ４２４２の判断結果が「ＹＥＳ」となり、制御装置１は、部分発話区間処理Ｓ４２４３を実行する。 Then, in the first operation example, the provisional silence time exceeds the threshold TH4 at time t7, and it is determined that this provisional silence section is a silence section. As a result, in the temporary silence section continuation process S424, the determination result of S4242 becomes “YES”, and the control device 1 executes the partial utterance section process S4243.

この部分発話区間処理Ｓ４２４３において、制御装置１は、まず、部分発話区間数が１以上か否かを判断する（Ｓ４２４３１）。第１動作例では、時刻ｔ７において、部分発話区間としてＰＵＰ１、ＰＵＰ２、ＰＵＰ３の３つが検出されており、部分発話区間数は３である。このため、Ｓ４２４３１の判断結果は「ＹＥＳ」となり、制御装置１は発話区間構成処理Ｓ４２４３２を実行する。具体的には、制御装置１は、部分発話区間ＰＵＰ１、ＰＵＰ２、ＰＵＰ３を含む時刻ｔ１から時刻ｔ６までの区間を発話区間ＵＰ１として登録する。次に制御装置１は発話区間分析処理Ｓ４２４３３を実行する。この発話区間分析処理Ｓ４２４３３の詳細については後述する。次に制御装置１はリセットＳ４２４３４を実行する。このリセットＳ４２４３４では、仮無音時間を「０」に、部分発話区間数を「０」にリセットする。時刻ｔ７以降もＳ４２４１で仮無音時間の更新は継続され、仮無音時間が閾値ＴＨ４を超えるごとに、Ｓ４２４２で「ＹＥＳ」と判定されるが、部分発話区間数が「０」なのでＳ４２４３１で「ＮＯ」と判断され、Ｓ４２４３４で仮無音時間が「０」にリセットされる。この無音区間確定後の仮無音時間の更新は、必ずしも行わなくてもよい。 In this partial utterance section processing S4243, the control device 1 first determines whether or not the number of partial utterance sections is 1 or more (S42431). In the first operation example, at time t7, three PUP1, PUP2, and PUP3 are detected as partial utterance sections, and the number of partial utterance sections is three. Therefore, the determination result of S42431 is “YES”, and the control device 1 executes the utterance section configuration processing S42432. Specifically, the control device 1 registers a section including the partial utterance sections PUP1, PUP2, and PUP3 from time t1 to time t6 as the utterance section UP1. Next, the control device 1 executes the speech segment analysis processing S42433. Details of the utterance section analysis processing S42433 will be described later. Next, the control device 1 executes reset S42434. In this reset S42434, the temporary silence period is reset to "0" and the number of partial utterance sections is reset to "0". Even after time t7, the update of the temporary silence period is continued in S4241, and every time the temporary silence period exceeds the threshold TH4, it is determined to be “YES” in S4242, but since the number of partial utterance sections is “0”, “NO” in S42431. ", And the temporary silence time is reset to" 0 "in S42434. The provisional silence duration does not have to be updated after the silence section is determined.

以上が本実施形態の第１動作例である。なお、上述した処理には、閾値との比較に基づく分岐が複数あるが、それぞれ、閾値に等しい場合
にＹＥＳとＮＯの何れに分岐するかは、本発明の本質には余り関係がないので、必要に応じて適宜変えてよい。 The above is the first operation example of the present embodiment. In the above-mentioned processing, there are a plurality of branches based on the comparison with the threshold value. However, which of YES and NO is branched when they are equal to the threshold value, since it does not have much relation to the essence of the present invention, It may be changed as needed.

次に図４および図５のフローチャートを参照し、図７の第２動作例について説明する。この第２動作例は、次の点において第１動作例（図６）と異なる。第１動作例では、入力音声信号の音量が閾値ＴＨ２未満になる時刻ｔ２から閾値ＴＨ１を超える時刻ｔ３までの仮無音時間ｔ３−ｔ２が閾値ＴＨ３より長い。これに対し、第２動作例では、当該仮無音時間ｔ３−ｔ２が閾値ＴＨ３以下である。 Next, the second operation example of FIG. 7 will be described with reference to the flowcharts of FIGS. 4 and 5. The second operation example differs from the first operation example (FIG. 6) in the following points. In the first operation example, the temporary silence time t3-t2 from time t2 when the volume of the input audio signal is less than the threshold TH2 to time t3 when the volume exceeds the threshold TH1 is longer than the threshold TH3. On the other hand, in the second operation example, the temporary silent time t3-t2 is equal to or less than the threshold TH3.

この第２動作例では、時刻ｔ３において、発話区間処理Ｓ４のＳ４１の判断結果が「ＹＥＳ」、仮無音区間処理Ｓ４２のＳ４２１の判断結果が「ＹＥＳ」となってＳ４２２に進んだとき、仮無音時間が閾値ＴＨ３以下であるため、Ｓ４２２の判断結果が「ＮＯ」となる。そして、時刻ｔ３においては、時刻ｔ１から時刻ｔ２までの区間が部分発話区間であるため、Ｓ４２５の判断結果が「ＮＯ」となる。この結果、制御装置１は、仮発話区間再開処理Ｓ４２６を実行する。この仮発話区間再開処理Ｓ４２６では、時刻ｔ１から時刻ｔ２まで継続した（直前の）部分発話区間と時刻ｔ３以降の仮発話区間とを接続して一体化する。具体的には、仮発話区間状態フラグをＯＮとし、時刻ｔ１から時刻ｔ３までの経過時間を仮発話時間とする。この仮発話区間再開処理Ｓ４２６が行われる結果、第２動作例では、時刻ｔ１が部分発話区間ＰＵＰ１の始期となり、時刻ｔ３の後、入力音声信号の音量が閾値ＴＨ２未満になる時刻ｔ４が同部分発話区間ＰＵＰ１の終期となる。結果的に、第２動作例では、２つの部分発話区間ＰＵＰ１、ＰＵＰ２が検出される。 In this second operation example, at time t3, when the determination result of S41 of the utterance interval processing S4 is “YES” and the determination result of S421 of the temporary silence interval processing S42 is “YES”, and the process proceeds to S422, the temporary silence is generated. Since the time is equal to or less than the threshold TH3, the determination result of S422 is "NO". Then, at time t3, the section from time t1 to time t2 is a partial utterance section, and therefore the determination result of S425 is “NO”. As a result, the control device 1 executes the temporary utterance period restart processing S426. In this temporary speech section restart processing S426, the (immediately before) partial speech section continued from time t1 to time t2 and the temporary speech section after time t3 are connected and integrated. Specifically, the temporary utterance section state flag is turned on, and the elapsed time from time t1 to time t3 is set as the temporary utterance time. As a result of the provisional utterance section restart processing S426 being performed, in the second operation example, the time t1 is the beginning of the partial utterance section PUP1 and, after the time t3, the time t4 at which the volume of the input voice signal becomes less than the threshold TH2 is the same part. It is the end of the speech section PUP1. As a result, in the second operation example, two partial speech sections PUP1 and PUP2 are detected.

次に図４および図５のフローチャートを参照し、図８の第３動作例について説明する。この第３動作例は、次の点において第１動作例（図６）と異なる。第１動作例では、入力音声信号の音量が閾値ＴＨ１を超える時刻ｔ１から閾値ＴＨ２未満になる時刻ｔ２までの仮発話時間ｔ２−ｔ１が閾値ＴＨ５を超えていた。これに対し、第３動作例では、当該仮発話時間ｔ２−ｔ１が閾値ＴＨ５以下である。 Next, the third operation example of FIG. 8 will be described with reference to the flowcharts of FIGS. 4 and 5. The third operation example differs from the first operation example (FIG. 6) in the following points. In the first operation example, the temporary utterance time t2-t1 from the time t1 when the volume of the input audio signal exceeds the threshold TH1 to the time t2 when the volume of the input audio signal becomes less than the threshold TH2 exceeds the threshold TH5. On the other hand, in the third operation example, the provisional utterance time t2-t1 is less than or equal to the threshold TH5.

この第３動作例では、時刻ｔ２において、発話区間処理Ｓ４のＳ４１の判断結果が「ＮＯ」、仮発話区間処理Ｓ４３のＳ４３１の判断結果が「ＹＥＳ」となってＳ４３２に進んだとき、仮発話時間が閾値ＴＨ５以下であるため、Ｓ４３２の判断結果が「ＮＯ」となる。この結果、制御装置１は、仮無音区間再開処理Ｓ４３５を実行する。この仮無音区間再開処理Ｓ４３５では、時刻ｔ１までの無音区間における仮無音区間と時刻ｔ２以降の仮無音区間とを接続して、１つの仮無音区間として一体化する。具体的には、仮発話区間状態フラグをＯＦＦとし、時刻０から時刻ｔ３までの経過時間を仮無音時間とする。この仮無音区間再開処理Ｓ４３５が行われる結果、第３動作例では、時刻ｔ３から始まる部分発話区間が最初の部分発話区間ＰＵＰ１となる。すなわち、本実施形態では、仮発話時間が閾値ＴＨ５以下である区間は部分発話区間とせず、直前の仮無音区間の継続部として取り扱う。なお、第３動作例では、無音区間の後の最初の仮発話区間が直前の仮無音区間に組み込まれる例を示したが、例えば図６の部分発話区間ＰＵＰ２等、２番目以降に生じる仮発話区間についても同様であり、当該仮発話区間の継続時間長が閾値ＴＨ５以下である場合には、当該仮発話区間はその直前の仮無音区間に組み込まれる。結果的に、第３動作例では、２つの部分発話区間ＰＵＰ１、ＰＵＰ２が検出される。 In the third operation example, at time t2, when the determination result of S41 of the utterance period processing S4 is “NO” and the determination result of S431 of the temporary utterance period process S43 is “YES”, and the process proceeds to S432, the temporary utterance is performed. Since the time is equal to or less than the threshold TH5, the determination result of S432 is "NO". As a result, the control device 1 executes the temporary silence interval restart processing S435. In this temporary silence section restart processing S435, the temporary silence section in the silent section up to time t1 and the temporary silence section after time t2 are connected and integrated as one temporary silence section. Specifically, the temporary utterance section state flag is set to OFF, and the elapsed time from time 0 to time t3 is set to the temporary silence time. As a result of performing the provisional silence section restart processing S435, in the third operation example, the partial speech section starting from time t3 becomes the first partial speech section PUP1. That is, in the present embodiment, the section in which the temporary utterance time is equal to or less than the threshold TH5 is not regarded as the partial utterance section, but is treated as a continuation part of the immediately preceding temporary silence section. In addition, in the third operation example, an example in which the first provisional utterance section after the silence section is incorporated into the provisional silence section immediately before is shown. For example, the partial utterance section PUP2 in FIG. The same applies to the section, and when the duration of the provisional utterance section is equal to or less than the threshold TH5, the provisional utterance section is incorporated into the provisional silent section immediately before that. As a result, in the third operation example, two partial speech periods PUP1 and PUP2 are detected.

次に発話区間処理Ｓ４において実行される発話区間分析Ｓ４２４３３について説明する。以下では、上述した例１〜例３が発話区間の発話内容である場合を例に発話区間分析Ｓ４２４３３について説明する。
例１：「今日、ラーメンでいい？」
例２：「今日、ラーメンでいい？ね。」
例３：「今日、ラーメンでいい？ね？」 Next, the utterance section analysis S42433 executed in the utterance section process S4 will be described. In the following, the utterance section analysis S42433 will be described by taking the case where the above-mentioned Examples 1 to 3 are the utterance contents of the utterance section as an example.
Example 1: "Is ramen good today?"
Example 2: "Is ramen good today?"
Example 3: "Is ramen good today?"

発話区間分析Ｓ４２４３３では、Ｓ４２４３２において構成した発話区間を構成する各部分発話区間について音声信号の音高遷移を求める。 In the utterance section analysis S42433, the pitch transition of the voice signal is obtained for each partial utterance section forming the utterance section formed in S42432.

例１の場合、発話区間分析Ｓ４２４３３では、発話区間を構成する部分発話区間「今日、」と部分発話区間「ラーメンでいい？」の各々の音高遷移を求めるが、最後の部分発話区間「ラーメンでいい？」の末尾において音高の上昇遷移が観測される。このため、発話区間分析Ｓ４２４３３では、当該発話区間の発話には疑問の意図があると判断する。 In the case of Example 1, in the utterance section analysis S42433, the pitch transitions of the partial utterance section “today” and the partial utterance section “Is the Ramen OK?” That compose the utterance section are obtained, but the final partial utterance section “Ramen” At the end of "?", A rising pitch transition is observed. Therefore, in the utterance section analysis S42433, it is determined that the utterance in the utterance section has a questioning intention.

例２の場合、発話区間分析Ｓ４２４３３では、発話区間を構成する部分発話区間「今日、」と、部分発話区間「ラーメンでいい？」と、部分発話区間「ね。」の各々の音高遷移を求めるが、発話区間の途中の部分発話区間「ラーメンでいい？」の末尾において音高の上昇遷移が観測される。このため、発話区間分析Ｓ４２４３３では、当該発話区間の発話には疑問の意図があると判断する。 In the case of Example 2, in the utterance section analysis S42433, the pitch transition of each of the partial utterance section “today”, the partial utterance section “Ramen good?”, And the partial utterance section “Ne.” That constitute the utterance section is performed. Although asked, a rising pitch transition is observed at the end of the partial utterance section “Is Ramen OK?” In the middle of the utterance section. Therefore, in the utterance section analysis S42433, it is determined that the utterance in the utterance section has a questioning intention.

例３の場合、発話区間分析Ｓ４２４３３では、発話区間を構成する部分発話区間「今日、」と、部分発話区間「ラーメンでいい？」と、部分発話区間「ね？」の各々の音高遷移を求めるが、発話区間の２番目の部分発話区間「ラーメンでいい？」の末尾と、発話区間の最後の部分発話区間「ね？」の末尾とにおいて音高の上昇遷移が観測される。そして、発話区間分析Ｓ４２４３３では、発話区間を構成する各部分発話区間のうち、末尾に音高の上昇遷移が観測された部分発話区間の数を、当該発話区間の疑問の意図の強度（念押し）と判断する。従って、例３の場合、発話区間分析Ｓ４２４３３では、発話者が疑問の意図の念押しをしているとの判断が行われる。 In the case of Example 3, in the utterance section analysis S42433, the pitch transitions of the partial utterance section “today,” which constitutes the utterance section, the partial utterance section “Ramen good?”, And the partial utterance section “Ne?” Although it is calculated, a rising pitch transition is observed at the end of the second partial utterance section “Ramen is OK?” And the end of the last partial utterance section “Ne?” Of the utterance section. Then, in the utterance section analysis S42433, the number of partial utterance sections in which a rising transition of the pitch is observed at the end of the partial utterance sections constituting the utterance section is defined as ). Therefore, in the case of Example 3, in the speech section analysis S42433, it is determined that the speaker is pushing the question intention.

音声分析プログラム１０では、この発話区間分析Ｓ４２４３３により判断された発話者の意図を示す情報を音声合成プログラムに引き渡す。音声合成プログラムでは、この発話者の意図を示す情報に基づいて、発話者に対する応答音声の内容を決定する。例１〜例３では、いずれも疑問の意図ありと判断されたので、特許文献１のように、当該発話に対する応答の音声を、疑問に対する応答に固有の特性となるよう制御する。なお、例３で生成する音声については、疑問が「念押し」に相当するので、その分だけ、例１、例２の疑問に対する応答の特性とは異なる特性となるよう制御してもよい。 In the voice analysis program 10, the information indicating the intention of the speaker determined in the speech section analysis S42433 is delivered to the voice synthesis program. The voice synthesis program determines the content of the response voice to the speaker based on the information indicating the intention of the speaker. In each of Examples 1 to 3, it is determined that the question is intentional. Therefore, as in Patent Document 1, the voice of the response to the utterance is controlled to have a characteristic peculiar to the response to the question. Note that the voice generated in Example 3 corresponds to a "remembering" question, and thus may be controlled to have characteristics different from the characteristics of the response to the questions in Examples 1 and 2 by that amount.

以上のように、本実施形態によれば、音声信号を１または複数の部分発話区間を含む発話区間に区切り、部分発話区間毎に音声信号の変化、具体的には音高の遷移を分析するので、１つの発話の発話区間の語尾の音高遷移のみでは、その発話における発話者の意図を判断することが困難である場合（例えば例２）においても、適切かつ簡易に発話者の意図を判断し、その発話に対する応答の音声を制御することができる。 As described above, according to the present embodiment, the voice signal is divided into the utterance sections including one or a plurality of partial utterance sections, and the change of the voice signal, specifically, the pitch transition is analyzed for each partial utterance section. Therefore, even when it is difficult to judge the intention of the speaker in the utterance only by the pitch transition of the ending of the utterance section of one utterance (for example, Example 2), the intention of the utterer can be appropriately and easily determined. It is possible to judge and control the voice of the response to the utterance.

以上、この発明の一実施形態について説明したが、この発明には他にも実施形態があり得る。例えば次の通りである。 Although one embodiment of the present invention has been described above, the present invention may have other embodiments. For example:

（１）上記実施形態では、入力音声信号を、短無音区間を間に挟んだ部分発話区間に区切るとともに、短無音区間よりも長い仮無音区間（発話区間の終期）が生じた場合に、それまでに区切られた１または複数の部分発話区間をまとめて１つの発話区間を構成した。しかし、この発明の適用範囲は、このような態様に限定されるものではない。例えば次のような他の態様も考えられる。まず、音声信号において継続時間長が第１の時間閾値を超える無音区間を見つけ、音声信号からこの無音区間で区切られた１ないし複数の発話区間を抽出する。次に、１つの発話区間内において継続時間長が第２の時間閾値（＜第１の時間閾値）を超える短無音区間を見つけ、その発話区間からこの短無音区間で区切られた１ないし複数の部分発話区間を抽出する。このような態様においても上記実施形態と同様な効果が得られる。 (1) In the above embodiment, when the input voice signal is divided into partial utterance sections with a short silence section interposed therebetween, and a temporary silence section (end of the utterance section) longer than the short silence section occurs, One or more partial utterance sections divided up to the above are combined to form one utterance section. However, the scope of application of the present invention is not limited to such an aspect. For example, the following other modes are also possible. First, a silent section whose duration exceeds the first time threshold is found in the audio signal, and one or a plurality of utterance sections separated by the silent section are extracted from the audio signal. Next, in one utterance section, a short silent section whose duration exceeds the second time threshold value (<first time threshold value) is found, and one or more short silence sections are separated from the utterance section. Extract a partial utterance section. Even in such an aspect, the same effect as that of the above embodiment can be obtained.

（２）上記実施形態では、音量に基づいて判定された仮無音区間の継続時間の長さ（仮無音時間）に基づいて、部分発話区間の区切り（短無音区間）と発話区間の区切り（無音区間）とを判定している。しかし、第１判定基準と第２判定基準の少なくとも一方について、仮無音時間の基準に加え、または仮無音時間の基準に代えて、当該区間の音量、音高、スペクトル等、仮無音時間以外のファクタを基準として、仮無音区間ないし無音区間を判定してもよい。例えば発話の終了時に現れやすい音声の特徴を部分発話区間や発話区間の終了要件にしてもよい。その場合、部分発話区間の終期よりも発話区間の終期の方が「終わった」感が強くなるように部分発話区間や発話区間の終了要件を定めればよい。 (2) In the above-described embodiment, based on the length of the duration of the temporary silence section (temporary silence duration) determined based on the volume, the segment of the partial utterance segment (short silence segment) and the segment of the utterance segment (silence). Section). However, for at least one of the first determination criterion and the second determination criterion, in addition to the reference of the temporary silence period or instead of the reference of the temporary silence period, the volume, pitch, spectrum, etc. of the section other than the temporary silence period are excluded. The temporary silence section or the silent section may be determined based on the factor. For example, the feature of the voice that is likely to appear at the end of the utterance may be a requirement for ending the partial utterance section or the utterance section. In that case, the ending condition of the partial utterance section or the utterance section may be set so that the end of the utterance section has a stronger feeling of “finished” than the end of the partial utterance section.

（３）発話者の意図を分析するために、音高遷移の分析と、音声認識エンジンまたは感情認識エンジンとを併用してもよい。このようにすることで、頑健に発話者の意図を分析することができる。 (3) In order to analyze the intention of the speaker, the pitch transition analysis and the voice recognition engine or the emotion recognition engine may be used together. By doing so, it is possible to robustly analyze the intention of the speaker.

（４）部分発話区間を、意図分析の単位のみならず、音声認識または感情認識の単位として用いてもよい。 (4) The partial utterance section may be used not only as a unit of intention analysis but also as a unit of voice recognition or emotion recognition.

（５）上記実施形態の音声分析プログラム１０を、音声制御装置や音声対話評価装置等、対話装置以外の装置に適用してもよい。 (5) The voice analysis program 10 of the above embodiment may be applied to a device other than the dialogue device, such as a voice control device or a voice dialogue evaluation device.

（６）上記実施形態の音声分析プログラムを利用させるサービスをクラウドサーバが提供してもよい。 (6) The cloud server may provide a service for using the voice analysis program of the above embodiment.

（７）上記実施形態の音声分析プログラムをＰＣアプリケーションやスマートフォンアプリケーションとして提供してもよい。 (7) The voice analysis program of the above embodiment may be provided as a PC application or a smartphone application.

（８）この発明は、玩具やカーナビゲーションシステム等において、音声を分析する装置として実現することも可能である。 (8) The present invention can also be realized as a device that analyzes voice in a toy, a car navigation system, or the like.

（９）対話を自然なものにするために、発話区間を構成する一部の部分発話区間、例えば音高の上昇遷移が末尾にあるような発話者の意図が現れる部分発話区間の音高に対して所定の関係、例えば協和音関係を持つように応答音声の音高を制御してもよい。 (9) In order to make the dialogue natural, the pitch of a part of the partial utterance section that constitutes the utterance section, for example, the pitch of the partial utterance section in which the intention of the speaker where the rising transition of the pitch is at the end appears. On the other hand, the pitch of the response voice may be controlled so as to have a predetermined relationship, for example, a consonant relationship.

１……制御装置、２……演算装置、３……記憶装置、４……表示装置、５……操作装置、６……収音装置、７……放音装置、ＵＰ１……発話区間、ＰＵＰ１〜ＰＵＰ３……部分発話区間、１０……音声分析プログラム、１１……特定部、１２……分析部。 1 ... control device, 2 ... arithmetic device, 3 ... storage device, 4 ... display device, 5 ... operation device, 6 ... sound collecting device, 7 ... sound emitting device, UP1 ... utterance section, PUP1 to PUP3 ... Partial utterance section, 10 ... Speech analysis program, 11 ... Specification section, 12 ... Analysis section.

Claims

A specifying unit that specifies a plurality of partial utterance sections included in one utterance section in the audio signal;
A voice analysis device having an analysis unit that analyzes changes in a voice signal for each partial utterance section.

The specifying unit specifies, in the voice signal, the utterance section having an end time when the utterance end is determined by the first determination criterion, and in the voice signal, a second difference different from the first determination criterion. The voice analysis device according to claim 1, wherein the plurality of partial utterance sections having an end time when the utterance end is determined by the determination criterion are specified.

The specifying unit specifies, in the voice signal, an utterance section sandwiching a silent section whose duration is longer than a time threshold, and in the voice signal, a short silence section whose duration is shorter than the time threshold. The voice analysis device according to claim 1, wherein the plurality of partial utterance sections sandwiched therebetween are specified.

The specifying unit sets the timing at which the volume of the voice signal exceeds a first volume threshold as the start of the partial utterance section, and the volume of the voice signal becomes less than a second volume threshold lower than the first volume threshold. The speech analysis device according to any one of claims 1 to 3, wherein the timing is the end of the partial utterance section.

A voice synthesis device comprising the voice analysis device according to any one of claims 1 to 4, wherein, for each utterance section, based on an analysis result for each utterance section, the speech of the utterance section is analyzed. A voice synthesizer that synthesizes response voice.

Specify a plurality of partial utterance sections included in one utterance section in the audio signal,
A voice analysis method for analyzing changes in a voice signal for each partial utterance section.

Computer,
A specifying unit that specifies a plurality of partial utterance sections included in one utterance section in the audio signal;
A program that functions as an analysis unit that analyzes changes in the audio signal for each partial utterance section.