JP2015087718A

JP2015087718A - Voice interaction system and voice interaction method

Info

Publication number: JP2015087718A
Application number: JP2013228525A
Authority: JP
Inventors: 達朗堀; Tatsuro Hori; 生聖渡部; Seisho Watabe
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2013-11-01
Filing date: 2013-11-01
Publication date: 2015-05-07

Abstract

PROBLEM TO BE SOLVED: To give a proper response to a long speech of a person.SOLUTION: A voice interaction system includes speech recognition means which recognizes speech uttered by a person in externally input speech, and speech output means which outputs a response speech according to a speech recognition result obtained by the speech recognition means. The speech recognition means determines whether the externally input speech is a first speech pattern determined in advance as a comma, or a second speech pattern determined in advance as a period. When the first speech pattern is determined, the speech in a comma section preceding the first speech pattern is recognized. When a second speech pattern is determined, the speech in a period section preceding the second speech pattern is recognized.

Description

本発明は、音声対話システム及び音声対話方法に関し、特に、人が発声した音声を認識して、認識した音声に応じた音声を発声する技術に関する。 The present invention relates to a voice dialogue system and a voice dialogue method, and more particularly to a technique for recognizing a voice uttered by a person and uttering a voice corresponding to the recognized voice.

特許文献１には、ユーザの音声を認識して、認識された音声に基づいて、ユーザとの対話を行う対話装置が開示されている。特許文献１に開示の対話装置のように、人との対話を行うロボットは、人との対話において、人の発話における文章構造を解析してから、解析結果に基づいてその発話への応答となる音声を発声する。 Patent Document 1 discloses an interactive device that recognizes a user's voice and performs a dialogue with the user based on the recognized voice. As in the dialogue device disclosed in Patent Document 1, a robot that performs dialogue with a person analyzes a sentence structure in a person's utterance in the dialogue with the person, and then responds to the utterance based on the analysis result. Say the following voice.

すなわち、このようなロボットは、人の発話において一文に含まれる全ての音声区間における音声を認識してから、その発話における音声に対する応答を行う。そのため、人がしゃべる一文が長いと、なかなか応答が行われずテンポの悪い対話になってしまい、人の会話意欲を削いでしまうという問題がある。特に、人が一方的にしゃべる場合、音声区間の途切れが分かりづらく、交互にきれいな受け答えを成立させることは困難であった。 That is, such a robot recognizes speech in all speech sections included in one sentence in a human utterance, and then responds to the speech in the utterance. For this reason, when one sentence spoken by a person is long, there is a problem in that the response is not performed easily and the conversation becomes poor in tempo, and the conversational motivation of the person is reduced. In particular, when a person speaks unilaterally, it is difficult to understand the interruption of the speech section, and it is difficult to establish a beautiful answer alternately.

特開２０１３−１１３９６６号公報JP 2013-113966 A

本発明は、上述した知見に基づいてなされたものであって、人の発話が長い場合であっても、適切な受け答えをすることができる音声対話システム及び音声対話方法を提供することを目的とする。 The present invention has been made on the basis of the above-described knowledge, and an object thereof is to provide a voice dialogue system and a voice dialogue method capable of appropriately receiving and answering even when a human utterance is long. To do.

本発明の第１の態様に係る音声対話システムは、外部から入力される音声において、人が発声している音声を認識する音声認識手段と、前記音声認識手段による音声認識結果に応じた応答音声を発声する音声発声手段と、を備え、前記音声認識手段は、前記外部から入力される音声が、読点であるとして予め定めた第１の音声パターンであるか、句点であるとして予め定めた第２の音声パターンであるかを判断し、前記第１の音声パターンであると判断した場合には、当該第１の音声パターンであると判断したときまでの読点単位の区間における音声を認識し、前記第２の音声パターンであると判断した場合には、当該第２の音声パターンであると判断したときまでの句点単位の区間における音声を認識するものである。 The voice dialogue system according to the first aspect of the present invention includes a voice recognition unit for recognizing a voice uttered by a person and a response voice corresponding to a voice recognition result by the voice recognition unit. Voice utterance means for uttering the voice, wherein the voice recognition means is a first voice pattern that is predetermined as a punctuation point or a predetermined first as a punctuation point. If it is determined that it is the first voice pattern, the voice in the section of the reading point unit until it is determined that the first voice pattern is determined, When it is determined that the voice pattern is the second voice pattern, the voice in the section in units of punctuation until it is determined that the voice pattern is the second voice pattern is recognized.

本発明の第２の態様に係る音声対話方法は、外部から入力される音声において、人が発声している音声を認識する音声認識ステップと、前記音声認識ステップによる音声認識結果に応じた応答音声を発声する音声発声ステップと、を備え、前記音声認識ステップでは、前記外部から入力される音声が、読点であるとして予め定めた第１の音声パターンであるか、句点であるとして予め定めた第２の音声パターンであるかを判断し、前記第１の音声パターンであると判断した場合には、当該第１の音声パターンであると判断したときまでの読点単位の区間における音声を認識し、前記第２の音声パターンであると判断した場合には、当該第２の音声パターンであると判断したときまでの句点単位の区間における音声を認識するものである。 The voice interaction method according to the second aspect of the present invention includes a voice recognition step for recognizing a voice uttered by a person in a voice input from the outside, and a response voice according to a voice recognition result obtained by the voice recognition step. And in the voice recognition step, the voice inputted from the outside is a first voice pattern predetermined as a punctuation mark or a predetermined first as a punctuation mark. If it is determined that it is the first voice pattern, the voice in the section of the reading point unit until it is determined that the first voice pattern is determined, When it is determined that the voice pattern is the second voice pattern, the voice in the section in units of punctuation until it is determined that the voice pattern is the second voice pattern is recognized.

上述した本発明の各態様によれば、人の発話が長い場合であっても、適切な受け答えをすることができる音声対話システム及び音声対話方法を提供することができる。 According to each aspect of the present invention described above, it is possible to provide a voice dialogue system and a voice dialogue method capable of giving an appropriate answer even when a person's utterance is long.

実施の形態に係る音声対話システムの構成図である。1 is a configuration diagram of a voice interaction system according to an embodiment. 実施の形態に係る音声対話システムの処理を示すフローチャートである。It is a flowchart which shows the process of the speech dialogue system which concerns on embodiment. 実施の形態に係る音声対話システムの応答文選択処理を示すフローチャートである。It is a flowchart which shows the response sentence selection process of the speech dialogue system which concerns on embodiment. 実施の形態に係る読点レベルでの応答を実現する処理部分を示すフローチャートである。It is a flowchart which shows the process part which implement | achieves the response in the reading level which concerns on embodiment. 実施の形態に係る読点処理と句点処理の単位を示す図である。It is a figure which shows the unit of the reading process and phrase process which concern on embodiment. 実施の形態に係る１サイクル目の１段目の処理状態を示す図である。It is a figure which shows the process state of the 1st step | paragraph of the 1st cycle which concerns on embodiment. 実施の形態に係る１サイクル目の２段目の処理状態を示す図である。It is a figure which shows the process state of the 2nd step | paragraph of the 1st cycle which concerns on embodiment. 実施の形態に係る２サイクル目の１段目の処理状態を示す図である。It is a figure which shows the process state of the 1st step | paragraph of the 2nd cycle which concerns on embodiment. 実施の形態に係る２サイクル目の２段目の処理状態を示す図である。It is a figure which shows the process state of the 2nd step | paragraph of the 2nd cycle which concerns on embodiment. 実施の形態に係る３サイクル目の１段目の処理状態を示す図である。It is a figure which shows the process state of the 1st step | paragraph of the 3rd cycle which concerns on embodiment. 実施の形態に係る３サイクル目の２段目の処理状態を示す図である。It is a figure which shows the process state of the 2nd step | paragraph of the 3rd cycle which concerns on embodiment. 実施の形態に係る４サイクル目の１段目の処理状態を示す図である。It is a figure which shows the process state of the 1st step | paragraph of the 4th cycle which concerns on embodiment. 実施の形態に係る４サイクル目の２段目の処理状態を示す図である。It is a figure which shows the process state of the 2nd step | paragraph of the 4th cycle which concerns on embodiment. 実施の形態に係る５サイクル目の１段目の処理状態を示す図である。It is a figure which shows the process state of the 1st step | paragraph of the 5th cycle which concerns on embodiment. 実施の形態に係る５サイクル目の２段目の処理状態を示す図である。It is a figure which shows the process state of the 2nd step | paragraph of the 5th cycle which concerns on embodiment. 実施の形態に係る句点レベルでの応答を実現する処理部分を示すフローチャートである。It is a flowchart which shows the process part which implement | achieves the response at the punctuation level which concerns on embodiment. 実施の形態に係る応答文選択処理の処理状態の一例を示すフローチャートである。It is a flowchart which shows an example of the process state of the response sentence selection process which concerns on embodiment. 実施の形態に係る応答文選択処理の処理状態の一例を示すフローチャートである。It is a flowchart which shows an example of the process state of the response sentence selection process which concerns on embodiment.

以下に図面を参照しながら、本発明の好適な実施の形態について説明する。以下の実施の形態に示す具体的な数値等は、発明の理解を容易とするための例示にすぎず、特に断る場合を除き、それに限定されるものではない。また、以下の記載及び図面では、説明の明確化のため、当業者にとって自明な事項等については、適宜、省略及び簡略化がなされている。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings. Specific numerical values and the like shown in the following embodiments are merely examples for facilitating understanding of the invention, and are not limited thereto unless otherwise specified. In the following description and drawings, matters obvious to those skilled in the art are omitted or simplified as appropriate for the sake of clarity.

＜発明の実施の形態＞
図１を参照して、本実施の形態に係る音声対話システム１の構成について説明する。図１は、本実施の形態に係る音声対話システム１の構成図である。音声対話システム１は、例えば、人と対話するロボットに適用することができる。 <Embodiment of the Invention>
With reference to FIG. 1, the structure of the voice interactive system 1 which concerns on this Embodiment is demonstrated. FIG. 1 is a configuration diagram of a voice interaction system 1 according to the present embodiment. The voice interaction system 1 can be applied to, for example, a robot that interacts with a person.

音声対話システム１は、制御部２、記憶部３、マイク４、スピーカ５、及びＩ／Ｏポート６を有する。制御部２、記憶部３、及びＩ／Ｏポート６は、相互にバスを介して接続されている。マイク４及びスピーカ５は、Ｉ／Ｏポート６と接続されている。 The voice interaction system 1 includes a control unit 2, a storage unit 3, a microphone 4, a speaker 5, and an I / O port 6. The control unit 2, the storage unit 3, and the I / O port 6 are connected to each other via a bus. The microphone 4 and the speaker 5 are connected to the I / O port 6.

制御部２は、音声対話システム１を統括的に制御する。制御部２は、ＣＰＵ（Central Processing Unit）を有し、記憶部３に格納されたプログラムを実行することによって、本実施の形態１に係る音声対話システム１としての各種処理を実行する。すなわち、記憶部３に格納されたプログラムは、本実施の形態に係る音声対話システム１としての各種処理を、ＣＰＵに実行させるためのコードを含む。制御部２は、読点用音声認識部１１、名詞抽出部１２、読点用応答文作成部１３、発話タイミング判定部１４、句点用音声認識部２１、句点用応答文作成部２２、音声合成部３０として機能する。 The control unit 2 comprehensively controls the voice interaction system 1. The control unit 2 includes a CPU (Central Processing Unit), and executes various processes as the voice interaction system 1 according to the first embodiment by executing a program stored in the storage unit 3. That is, the program stored in the storage unit 3 includes codes for causing the CPU to execute various processes as the voice interaction system 1 according to the present embodiment. The control unit 2 includes a punctuation speech recognition unit 11, a noun extraction unit 12, a punctuation response sentence creation unit 13, an utterance timing determination unit 14, a punctuation speech recognition unit 21, a punctuation response sentence creation unit 22, and a speech synthesis unit 30. Function as.

読点用音声認識部１１は、人が発声した音声を読点単位で音声認識する。読点用音声認識部１１は、人が音声を発声している発話区間（音声区間）が終了してから次の発話区間が開始されない時間が、第１の所定時間に達した場合、読点で発話区間が終了したと認識する。そして、読点用音声認識部１１は、終了した発話区間を含む読点単位の区間における音声を音声認識する。より厳密には、読点用音声認識部１１は、終了した発話区間の音声を音声認識する。すなわち、読点用音声認識部１１は、一文の開始又は１つ前の読点から、その次の読点までの区間となる、１つの発話区間における音声を音声認識することになる。ここで、第１の所定時間として、例えば人の発話における読点での平均的な発話区間間の時間を採用する等して任意に好適な時間を設定すればよい。 The reading point speech recognition unit 11 recognizes speech uttered by a person in units of reading points. The punctuation point speech recognition unit 11 utters at the punctuation point when the time during which the next utterance period does not start after the utterance period (speech period) in which the person utters the voice has reached the first predetermined time has been reached. Recognize that the section has ended. The reading point speech recognition unit 11 recognizes the speech in the section of the reading point unit including the completed utterance section. More precisely, the reading point voice recognition unit 11 recognizes the voice of the completed utterance section. That is, the reading point speech recognition unit 11 recognizes speech in one utterance section, which is a section from the start of one sentence or the previous reading point to the next reading point. Here, as the first predetermined time, a suitable time may be arbitrarily set by adopting, for example, an average time between utterance sections at a reading point in a person's utterance.

名詞抽出部１２は、読点用音声認識部１１が認識した読点単位の音声から名詞を抽出し、抽出した名詞を示す名詞情報を名詞データベース４２に格納する。 The noun extraction unit 12 extracts nouns from the reading unit speech recognized by the reading point speech recognition unit 11 and stores the noun information indicating the extracted nouns in the noun database 42.

読点用応答文作成部１３は、名詞データベース４２に格納された名詞情報が示す名詞を確認する応答文を作成する。すなわち、読点用応答文作成部１３は、名詞を確認する応答文を示す応答文情報を生成する。 The punctuation mark response sentence creation unit 13 creates a response sentence for confirming the noun indicated by the noun information stored in the noun database 42. That is, the punctuation mark response sentence generation unit 13 generates response sentence information indicating a response sentence for confirming the noun.

発話タイミング判定部１４は、読点での発話区間の終了を検出し、その検出タイミングにおいて読点用応答文作成部１３によって応答文情報が生成されている場合には、その応答文情報を音声合成部３０に出力する。読点での発話区間の終了を検出は、上述と同様に、人が音声を発声している発話区間が終了してから次の発話区間が開始されない時間が、第１の所定時間に達したことをもって検出するようにすればよい。 The utterance timing determination unit 14 detects the end of the utterance section at the punctuation mark, and when the response sentence information is generated by the punctuation response sentence creation unit 13 at the detection timing, the utterance timing determination unit 14 converts the response sentence information into the speech synthesis unit. Output to 30. In the same manner as described above, the end of the utterance section at the punctuation point is detected when the time when the next utterance section is not started after the utterance section in which the person utters the voice has finished has reached the first predetermined time. May be detected.

句点用音声認識部２１は、人が発声した音声を句点単位で音声認識する。句点用音声認識部２１は、人が音声を発声している発話区間が終了してから次の発話区間が開始されない時間が、第１の所定時間よりも長い第２の所定時間に達した場合、句点で発話区間が終了したと認識する。そして、句点用音声認識部２１は、終了した発話区間を含む句点単位の区間における音声を音声認識する。より厳密には、句点用音声認識部２１は、その句点までの一文の区間に含まれる１つ以上の発話区間における音声を音声認識する。すなわち、句点用音声認識部２１は、一文の開始から句点までの区間に含まれる、少なくとも１つの発話区間における音声を音声認識することになる。ここで、第２の所定時間として、例えば人の発話での句点における平均的な発話区間間の時間を採用する等して任意に好適な時間を設定すればよい。 The phrase speech recognition unit 21 recognizes speech uttered by a person in phrase units. When the punctuation speech recognition unit 21 reaches a second predetermined time longer than the first predetermined time after the utterance interval in which the person utters the voice ends, the next utterance interval does not start , It recognizes that the utterance section has ended at the point. Then, the phrase speech recognition unit 21 recognizes speech in a phrase unit section including the completed speech section. More precisely, the phrase recognition voice recognition unit 21 recognizes speech in one or more utterance sections included in one sentence section up to the phrase. That is, the phrase speech recognition unit 21 recognizes speech in at least one utterance section included in the section from the start of one sentence to the phrase. Here, as the second predetermined time, for example, an appropriate time may be set by adopting, for example, an average time between utterance sections at a punctuation point in a person's utterance.

句点用応答文作成部２２は、句点用音声認識部２１が認識した「動詞」及び「格要素＋格」を確認する応答文を作成する。すなわち、句点用応答文作成部２２は、「動詞」及び「格要素＋格」を確認する応答文を示す応答文情報を生成し、音声合成部３０に出力に格納する。 The phrase response sentence creation unit 22 creates a response sentence for confirming the “verb” and “case element + case” recognized by the phrase recognition unit 21. That is, the phrase response sentence creation unit 22 generates response sentence information indicating a response sentence for confirming “verb” and “case element + case”, and stores the response sentence information in the speech synthesis unit 30 as an output.

音声合成部３０は、読点用応答文作成部１３及び句点用応答文作成部２２から出力された応答文情報が示す応答文の音声を示す音声情報を音声合成によって生成し、Ｉ／Ｏポート６を介してスピーカ５に出力する。 The speech synthesizer 30 generates speech information indicating speech of the response sentence indicated by the response sentence information output from the punctuation mark response sentence creation unit 13 and the phrase response sentence creation unit 22 by speech synthesis, and the I / O port 6 Is output to the speaker 5 via.

記憶部３は、上記のプログラムや、音声対話システム１として処理に必要な各種情報が格納される。記憶部３は、認識用辞書データベース４１、及び名詞データベース４２が構築される。認識用辞書データベース４１は、人が発声した音声中の単語を認識するために、照合用の複数の単語の音声情報が予め格納されている。これらの単語の音声情報は、例えば、複数人の音声をサンプリングすることで算出した平均的な音声を示すように事前に作成してもよい。 The storage unit 3 stores the above-described program and various information necessary for processing as the voice interaction system 1. In the storage unit 3, a recognition dictionary database 41 and a noun database 42 are constructed. In the recognition dictionary database 41, speech information of a plurality of words for collation is stored in advance in order to recognize words in speech uttered by a person. The voice information of these words may be created in advance so as to indicate an average voice calculated by sampling voices of a plurality of people, for example.

すなわち、読点用音声認識部１１及び句点用音声認識部２１は、認識用辞書データベース４１に格納された単語の音声情報と、人が発声した音声の音声情報とを照合することで、人が発声した音声中の各単語を認識することにより、音声認識した区間における音声内容を認識する。認識用辞書データベース４１の単語の種類としては、「動詞」、「格要素」、及び「格」等が用意される。 That is, the punctuation point speech recognition unit 11 and the punctuation point speech recognition unit 21 collate the speech information of words stored in the recognition dictionary database 41 with the speech information of speech uttered by the person, so that the person speaks. By recognizing each word in the voice, the voice content in the voice-recognized section is recognized. As the types of words in the recognition dictionary database 41, “verb”, “case element”, “case”, and the like are prepared.

ここで、読点用音声認識部１１及び句点用音声認識部２１は、音声認識した区間における音声内容として可能性のある複数のパターンの音声内容を音声認識結果として生成する。そして、読点用音声認識部１１は、それらの複数のパターンの音声内容に所定の割合以上で同一の「格要素」（名詞）が含まれている場合には、その「格要素」（名詞）が信頼できると判定する。句点用音声認識部２１は、それらの複数のパターンの音声内容に所定の割合以上で同一の「動詞」又は「格要素＋格」が含まれている場合には、その「動詞」又は「格要素＋格」が信頼できると判定する。例えば、１０パターンの音声内容を生成したときには、８パターン以上の音声内容において同一の「動詞」が含まれている場合であれば、その「動詞」が信頼できると判定し、同一の「動詞」が含まれている音声内容が８パターン未満である場合であれば、その「動詞」が信頼できないと判定するようにしてもよい。 Here, the punctuation point speech recognition unit 11 and the punctuation point speech recognition unit 21 generate, as speech recognition results, speech contents of a plurality of patterns that are possible speech contents in the speech-recognized section. If the same “case element” (noun) is included in the voice content of the plurality of patterns in a predetermined ratio or more, the speech recognition unit for reading marks 11 reads the “case element” (noun). Is determined to be reliable. The punctuation speech recognition unit 21, when the same “verb” or “case element + case” is included in the speech content of the plurality of patterns at a predetermined ratio or more, the “verb” or “case” Judge that element + case is reliable. For example, when 10 patterns of audio content are generated, if the same “verb” is included in 8 or more patterns of audio content, it is determined that the “verb” is reliable, and the same “verb”. If the audio content that contains is less than 8 patterns, it may be determined that the “verb” is not reliable.

名詞抽出部１２は、読点用音声認識部１１によって信頼できると判定された「格要素」（名詞）を抽出し、読点用応答文作成部１３は、名詞抽出部１２が抽出した「格要素」（名詞）を確認する応答文を示す応答文情報を生成する。一方、句点用応答文作成部２２は、「動詞」及び「格要素＋格」のうち、句点用音声認識部２１によって信頼できると判定されたものを確認する応答文を示す応答文情報を生成する。なお、信頼できると判定された「動詞」が複数検出された場合には、その中で最も信頼できるものを用いて応答文を作成するようにすればよい。ここで、最も信頼できると判定された「動詞」とは、生成した複数のパターンの音声内容に同一の「動詞」が含まれている数が最も多い「動詞」とすればよい。これは、「格要素」（名詞）、又は「格要素＋格」についても同様である。 The noun extraction unit 12 extracts “case elements” (nouns) determined to be reliable by the punctuation speech recognition unit 11, and the punctuation response sentence creation unit 13 extracts the “case elements” extracted by the noun extraction unit 12. Response sentence information indicating a response sentence for confirming (noun) is generated. On the other hand, the phrase response sentence creation unit 22 generates response sentence information indicating a response sentence for confirming that the “verb” and “case element + case” are determined to be reliable by the phrase recognition unit 21. To do. When a plurality of “verbs” determined to be reliable are detected, a response sentence may be created using the most reliable one among them. Here, the “verb” determined to be the most reliable may be the “verb” having the largest number of the same “verb” included in the audio contents of the plurality of generated patterns. The same applies to “case element” (noun) or “case element + case”.

マイク４は、外部から入力された音声を、その音声を示す音声情報に変換し、Ｉ／Ｏポート６に出力する。スピーカ５は、Ｉ／Ｏポート６から出力された音声情報を音声に変換し、出音する。これにより、制御部２の音声合成部３０によって音声合成された音声が発声される。 The microphone 4 converts audio input from the outside into audio information indicating the audio and outputs the audio information to the I / O port 6. The speaker 5 converts the sound information output from the I / O port 6 into sound and outputs the sound. Thereby, the voice synthesized by the voice synthesis unit 30 of the control unit 2 is uttered.

Ｉ／Ｏポート６は、マイク４から出力された音声情報をＡ／Ｄ変換し、制御部２に出力する。Ｉ／Ｏポート６は、制御部２から出力された音声情報をＤ／Ａ変換し、スピーカ５に出力する。 The I / O port 6 performs A / D conversion on the audio information output from the microphone 4 and outputs it to the control unit 2. The I / O port 6 performs D / A conversion on the audio information output from the control unit 2 and outputs it to the speaker 5.

本実施の形態は、以上に説明した構成によって、名詞抽出結果に基づいて、読点単位の短い間隔でオウム返しをすることを可能とし、相手の話の腰を折らずに対話を続けることを可能とする。また、これによれば、一文が長くなかなか相手の話に対して総括的な応答をすることができない場合であっても、こまめに短いオウム返しをすることが可能であるため、相手の話を促すことができる。 In the present embodiment, the configuration described above allows parrots to be returned at short intervals in reading units based on the noun extraction result, and the conversation can be continued without breaking the opponent's story. And Also, according to this, even if it is difficult to make a comprehensive response to the other person's story, it is possible to return a short parrot frequently, so Can be urged.

続いて、図２〜図４を参照して、本発明の実施の形態に係る音声対話システム１の処理について説明する。図２は、本発明の実施の形態に係る音声対話システム１の処理を示すフローチャートである。 Then, with reference to FIGS. 2-4, the process of the voice interactive system 1 which concerns on embodiment of this invention is demonstrated. FIG. 2 is a flowchart showing processing of the voice interaction system 1 according to the embodiment of the present invention.

マイク４は、継続的に、外部から入力される音声を示す音声情報を生成し、Ｉ／Ｏポート６を介して制御部２に出力する。よって、人から発話入力があった場合（Ｓ１：Ｙｅｓ）、マイク４は、その発話における音声を示す音声情報を生成し、Ｉ／Ｏポート６を介して制御部２に出力する。 The microphone 4 continuously generates sound information indicating sound input from the outside, and outputs the sound information to the control unit 2 via the I / O port 6. Therefore, when there is an utterance input from a person (S1: Yes), the microphone 4 generates voice information indicating the voice in the utterance and outputs it to the control unit 2 via the I / O port 6.

読点用音声認識部１１は、読点での発話区間の終了を検出し、その終了した発話区間における音声情報に基づいて、その発話区間における音声情報が示す音声の音声認識を行う（Ｓ２）。具体的には、読点用音声認識部１１は、マイク４から人の発声した音声を示す音声情報が出力されなくなった時間が第１の所定時間に到達したことを、読点での発話区間の終了として検出する。ここで、人の発声した音声を示す音声情報が出力されなくなったか否かの判定は、例えば、音声情報が示す音声の音圧レベルが所定の一定値以下となったか否かによって判定するようにすればよい。そして、読点用音声認識部１１は、発話区間において出力を受けた音声情報が示す音声の音声認識を行う。 The reading point speech recognition unit 11 detects the end of the utterance section at the reading point, and performs speech recognition of the voice indicated by the speech information in the utterance section based on the voice information in the ended utterance section (S2). Specifically, the voice recognition unit for reading marks 11 indicates that the time when the voice information indicating the voice uttered by the person from the microphone 4 is not output has reached the first predetermined time, and the end of the speech section at the reading point. Detect as. Here, the determination as to whether or not the voice information indicating the voice uttered by the person is no longer output is made based on, for example, whether or not the sound pressure level of the voice indicated by the voice information is equal to or lower than a predetermined constant value. do it. The reading voice recognition unit 11 performs voice recognition of the voice indicated by the voice information received in the utterance section.

名詞抽出部１２は、読点用音声認識部１１が認識した音声内容から、読点用音声認識部１１によって信頼できると判定された名詞を抽出し、抽出した名詞を示す名詞情報を名詞データベース４２に格納する（Ｓ３）。 The noun extraction unit 12 extracts nouns determined to be reliable by the punctuation speech recognition unit 11 from the speech content recognized by the punctuation speech recognition unit 11, and stores noun information indicating the extracted nouns in the noun database 42. (S3).

読点用応答文作成部１３は、名詞データベース４２に格納された名詞情報が示す名詞を確認する応答文を示す応答文情報を生成し、記憶部３に格納する（Ｓ４）。 The punctuation mark response sentence generation unit 13 generates response sentence information indicating a response sentence for confirming a noun indicated by the noun information stored in the noun database 42 and stores the response sentence information in the storage unit 3 (S4).

一方、発話タイミング判定部１４は、読点での発話区間の終了を検出する（Ｓ５）。具体的には、発話タイミング判定部１４は、上述と同様に、マイク４から人の発声した音声を示す音声情報が出力されなくなった時間が第１の所定時間に到達したことを、発話区間の終了として検出する。 On the other hand, the utterance timing determination unit 14 detects the end of the utterance section at the reading (S5). Specifically, as described above, the utterance timing determination unit 14 indicates that the time when the voice information indicating the voice uttered by the person from the microphone 4 is not output has reached the first predetermined time. Detect as end.

発話タイミング判定部１４は、発話区間の終了を検出した場合、名詞を確認する応答文を発話可能であるか否かを判定する（Ｓ６）。具体的は、発話タイミング判定部１４は、読点用応答文作成部１３によって応答文情報が作成済みで記憶部３に格納されている場合、名詞を確認する応答文を発話可能であると判定する。他方、発話タイミング判定部１４は、読点用応答文作成部１３によって応答文情報が作成済みでなく記憶部３に格納されていない場合、名詞を確認する応答文を発話可能でないと判定する。 When the utterance timing determination unit 14 detects the end of the utterance section, the utterance timing determination unit 14 determines whether or not a response sentence for confirming the noun can be uttered (S6). Specifically, the utterance timing determination unit 14 determines that the response sentence for confirming the noun can be uttered when the response sentence information has been created by the punctuation mark response sentence creation unit 13 and stored in the storage unit 3. . On the other hand, the utterance timing determination unit 14 determines that the response sentence for confirming the noun cannot be uttered when the response sentence information has not been created by the punctuation mark response sentence creation unit 13 and is not stored in the storage unit 3.

発話タイミング判定部１４は、名詞を確認する応答文を発話可能であると判定した場合（Ｓ６：Ｙｅｓ）、記憶部３に格納された応答文情報を音声合成部３０に出力する。発話タイミング判定部１４は、名詞を確認する応答文を発話可能でないと判定した場合（Ｓ６：Ｎｏ）、応答文情報が未作成であるため、応答文情報の音声合成部３０への出力は行わない。 When the speech timing determination unit 14 determines that the response sentence for confirming the noun can be spoken (S6: Yes), the speech timing determination unit 14 outputs the response sentence information stored in the storage unit 3 to the speech synthesis unit 30. If the utterance timing determination unit 14 determines that the response sentence for confirming the noun cannot be uttered (S6: No), the response sentence information has not been created, so the response sentence information is output to the speech synthesizer 30. Absent.

一方、句点用音声認識部２１は、句点での発話区間の終了を検出し、その終了した発話区間まで続いた一文の区間に含まれる各音声区間における音声情報に基づいて、それらの発話区間における音声情報が示す音声の音声認識を行う（Ｓ７）。具体的には、句点用音声認識部２１は、マイク４から人の発声した音声を示す音声情報が出力されなくなった時間が第２の所定時間に到達したことを、句点での発話区間の終了として検出する。ここで、人の発声した音声を示す音声情報が出力されなくなったか否かの判定は、例えば、音声情報が示す音声の音圧レベルが所定の一定値以下となったか否かによって判定するようにすればよい。そして、句点用音声認識部２１は、各発話区間において出力を受けた音声情報が示す音声の音声認識を行う。 On the other hand, the speech recognition unit 21 for punctuation detects the end of the utterance section at the punctuation point, and based on the speech information in each speech section included in the one-sentence section that continues to the finished utterance section, Voice recognition indicated by the voice information is performed (S7). Specifically, the speech recognition unit 21 for a punctuation point indicates that the time when the speech information indicating the speech uttered by the person from the microphone 4 is not output has reached the second predetermined time, and the end of the utterance section at the punctuation point Detect as. Here, the determination as to whether or not the voice information indicating the voice uttered by the person is no longer output is made based on, for example, whether or not the sound pressure level of the voice indicated by the voice information is equal to or lower than a predetermined constant value. do it. Then, the phrase speech recognition unit 21 performs speech recognition of the speech indicated by the speech information received in each utterance section.

句点用音声認識部２１は、認識した一文の音声内容において「動詞」及び「格要素＋格」のうち、少なくともいずれかが存在するか否かを判定する（Ｓ８）。 The phrase speech recognition unit 21 determines whether or not at least one of “verb” and “case element + case” exists in the recognized speech content of one sentence (S8).

句点用音声認識部２１は、認識した一文の音声内容において「動詞」及び「格要素＋格」のうち、少なくともいずれかが存在すると判定した場合（Ｓ８：Ｙｅｓ）、「動詞」及び「格要素＋格」のうち、存在すると判定したものが信頼できるか否かを判定する（Ｓ９）。 When the phrase speech recognition unit 21 determines that at least one of “verb” and “case element + case” exists in the recognized speech content of the sentence (S8: Yes), “verb” and “case element” It is determined whether or not the “+ case” determined to exist is reliable (S9).

句点用応答文作成部２２は、句点用音声認識部２１によって、認識した一文の音声内容において「動詞」及び「格要素＋格」がいずれも存在し、かつ「動詞」及び「格要素＋格」のいずれも信頼できると判定された場合（Ｓ９：Ｙｅｓ）、「動詞」＋「格要素＋格」を確認する応答文を示す応答文情報を作成し、音声合成部３０に出力する（Ｓ１０）。 The phrase response sentence creation unit 22 includes both “verb” and “case element + case” in the speech content of one sentence recognized by the phrase speech recognition unit 21, and “verb” and “case element + case”. Is determined to be reliable (S9: Yes), response sentence information indicating a response sentence confirming “verb” + “case element + case” is created and output to the speech synthesizer 30 (S10). ).

一方、句点用応答文作成部２２は、句点用音声認識部２１によって、認識した一文の音声内容において、少なくとも「動詞」が存在し、かつ「動詞」だけが信頼できると判定された場合（Ｓ１１）、「動詞」を確認する応答文を示す応答文情報を作成し、音声合成部３０に出力する（Ｓ１２）。 On the other hand, the phrase response sentence creation unit 22 determines that the phrase speech recognition unit 21 determines that at least “verb” exists and only “verb” is reliable in the recognized speech content of the sentence (S11). ), Response sentence information indicating a response sentence for confirming the “verb” is created and output to the speech synthesizer 30 (S12).

一方、句点用応答文作成部２２は、句点用音声認識部２１によって、認識した一文の音声内容において、少なくとも「格要素＋格」が存在し、かつ「格要素＋格」だけが信頼できると判定された場合（Ｓ１３）、「格要素＋格」を確認する応答文を示す応答文情報を作成し、音声合成部３０に出力する（Ｓ１４）。 On the other hand, when the phrase response sentence creation unit 22 recognizes at least “case element + case” and only “case element + case” can be trusted in the speech content of one sentence recognized by the phrase speech recognition unit 21. If determined (S13), response sentence information indicating a response sentence for confirming “case element + case” is created and output to the speech synthesizer 30 (S14).

句点用音声認識部２１によって、認識した一文の音声内容において「動詞」及び「格要素＋格」のいずれも存在しないと判定された場合（Ｓ８：Ｎｏ）、及び、認識した一文の音声内容において「動詞」及び「格要素＋格」のうち、存在すると判定されたものの全てが信頼できないと判定された場合（Ｓ９：Ｎｏ）は、応答文情報の作成は行われない。 When the phrase speech recognition unit 21 determines that neither “verb” nor “case element + case” exists in the recognized speech content of the sentence (S8: No), and in the recognized speech content of the sentence If all of the “verb” and “case element + case” determined to be present are determined to be unreliable (S9: No), the response sentence information is not created.

音声合成部３０は、読点用応答文作成部１３及び句点用応答文作成部２２から出力された応答文情報から、応答する応答文を発声するための応答文情報を選択する（Ｓ１５）。音声合成部３０は、選択した応答文情報を音声合成して、その応答文情報が示す応答文の音声を示す音声情報を生成し、順次Ｉ／Ｏポート６を介してスピーカ５に出力する（Ｓ１６）。これにより、人の発声した音声に対する応答として、その音声の内容を確認する応答文の音声がスピーカ５から発声される。 The speech synthesizer 30 selects response text information for uttering a response text in response from the response text information output from the punctuation response text creation section 13 and the phrase response text creation section 22 (S15). The voice synthesizer 30 synthesizes the selected response sentence information, generates voice information indicating the voice of the response sentence indicated by the response sentence information, and sequentially outputs the voice information to the speaker 5 via the I / O port 6 ( S16). Thereby, as a response to the voice uttered by the person, the voice of the response sentence for confirming the content of the voice is uttered from the speaker 5.

ここで、図３を参照して、ステップＳ１５の応答文選択処理について、より詳細に説明する。図３は、ステップＳ１５の応答文選択処理を示すフローチャートである。 Here, with reference to FIG. 3, the response sentence selection process of step S15 is demonstrated in detail. FIG. 3 is a flowchart showing the response sentence selection process in step S15.

図３では、読点用応答文作成部１３が作成した応答文情報を「Ａ」として示し、句点用応答文作成部２２が作成した応答文情報を「Ｂ」として示している。音声合成部３０は、読点用応答文作成部１３及び句点用応答文作成部２２の少なくともいずれか１つから応答文情報の出力を受けた場合、その応答文情報が示す応答文の音声合成を開始する。 In FIG. 3, the response sentence information created by the punctuation mark response sentence creation unit 13 is shown as “A”, and the response sentence information created by the phrase response sentence creation unit 22 is shown as “B”. When the speech synthesizer 30 receives output of response sentence information from at least one of the punctuation response sentence creation unit 13 and the phrase response sentence creation unit 22, the speech synthesis unit 30 synthesizes the response sentence indicated by the response sentence information. Start.

ここで、音声合成部３０は、応答文情報Ａの音声合成中（応答文情報Ａが示す応答文の音声の発声中）に、応答文情報Ｂの出力を受けた場合には、応答文情報Ａの音声合成を中断して、応答文情報Ｂの音声合成を開始する。それに対して、音声合成部３０は、応答文情報Ｂの音声合成中（応答文情報Ｂが示す応答文の音声の発声中）に、応答文情報Ａの出力を受けた場合には、応答文情報Ｂの音声合成は中断せずに継続し、応答文情報Ａの音声合成は行わない。すなわち、音声合成部３０は、読点用応答文作成部１３によって作成された応答文情報Ａが示す応答文よりも、句点用応答文作成部２２によって作成された応答文情報Ｂが示す応答文を優先的に発声する。応答文情報Ａの応答文よりも、応答文情報Ｂの応答文の方が、人が発声した一文の音声に対する総括的な応答内容となっているため、この応答文情報Ｂが作成し終わっているタイミングのように、人が一文をしゃべり終わったタイミングでの応答内容として、より適切であるからである。 Here, when the speech synthesizer 30 receives the output of the response text information B during the speech synthesis of the response text information A (during speech of the response text indicated by the response text information A), The voice synthesis of A is interrupted, and the voice synthesis of the response sentence information B is started. On the other hand, if the speech synthesizer 30 receives the output of the response text information A during the speech synthesis of the response text information B (during speech of the response text indicated by the response text information B), The speech synthesis of the information B is continued without interruption, and the speech synthesis of the response sentence information A is not performed. That is, the speech synthesizer 30 uses the response sentence indicated by the response sentence information B created by the phrase response sentence creation unit 22 rather than the response sentence shown by the response sentence information A created by the punctuation response sentence creation unit 13. Speak preferentially. Since the response sentence of the response sentence information B is a general response content for a single sentence uttered by a person, the response sentence information B has been created rather than the response sentence of the response sentence information A. This is because it is more appropriate as a response content at the timing when a person finishes speaking a sentence, as in the case of a certain timing.

ここで、応答文情報Ａが示す応答文を連続して音声合成する場合には、その応答文を、所定の相槌（間投詞）（例えば「うん」）に置き換えて音声合成を行う。これによって、常に、名詞を確認する応答をする場合と比較して、会話にリズムを持たせることが可能となる。 Here, when continuously synthesizing the response sentence indicated by the response sentence information A, the response sentence is replaced with a predetermined interaction (interjection) (for example, “Yes”) to perform speech synthesis. As a result, it is possible to always give a rhythm to the conversation as compared with the case of responding to confirm the noun.

以上に説明したように、本実施の形態に係る音声対話システム１は、図４に示すように、発話区間終了検出（Ｓ５）及び発話タイミング判定（Ｓ６）を中心とした処理によって、読点レベルの切れ目で応答をすることができ、テンポの良い対話を可能としている。 As described above, the voice conversation system 1 according to the present embodiment, as shown in FIG. 4, performs the reading level of the reading level by the process centering on the detection of the end of the utterance period (S5) and the determination of the utterance timing (S6). It is possible to respond at breaks, enabling a conversation with good tempo.

すなわち、句点を検出する時間（第２の所定時間）よりも短い時間（第１の所定時間）の間、人の発話が無いことを検出するようにすることで、読点単位での発話区間の切れ目を検出することを可能としている。これは、上述したように、例えば、人の発声した音声の音圧レベルが一定値以下の時間が第１の所定時間（例えば０．３ｓｅｃ）の間継続したことをもって検出される。 That is, by detecting that there is no human utterance for a time (first predetermined time) shorter than the time (second predetermined time) for detecting a punctuation mark, It is possible to detect a break. As described above, this is detected when, for example, a time during which a sound pressure level of a voice uttered by a person is below a certain value continues for a first predetermined time (for example, 0.3 sec).

そして、発話区間の終了検出時に、名詞を確認する応答文が作成済みである場合に、この発話区間の終了タイミングで応答文の音声を発声する。すなわち、人の発話における読点のタイミングで応答文の音声が発声され、テンポの良い対話をすることができる。 When a response sentence for confirming a noun has already been created when detecting the end of the utterance section, the voice of the response sentence is uttered at the end timing of the utterance section. That is, a response sentence is uttered at the timing of a reading point in a person's utterance, and a conversation with a good tempo can be performed.

続いて、図５〜図１８を参照して、人の発話として一例を挙げて、上述した音声対話システム１の処理の流れについて説明する。 Next, with reference to FIG. 5 to FIG. 18, the flow of processing of the above-described voice interaction system 1 will be described by taking an example of human speech.

ここでは、人が以下の内容を発話した例について説明する。 Here, an example in which a person speaks the following contents will be described.

「主人がね、親戚のところに行くけども、１１日にね、帰ってくるって言ってね、雨の中出掛けたんですけど。。。」 “The husband goes to his relatives, but on the 11th, he said he would come home and went out in the rain.”

この場合、図５に示すように、読点用音声認識部１１における音声認識処理（読点処理）は、「主人がね」「親戚のところに行くけども」「１１日にね」「帰ってくるって言ってね」「雨の中出掛けたんですけど」と読点レベルの切れ目の単位で行われる。句点用音声認識部２１における音声認識処理（句点処理）は、「主人がね、親戚のところに行くけども、１１日にね、帰ってくるって言ってね、雨の中出掛けたんですけど」までの句点レベルの切れ目の単位で行われる。すなわち、句点処理が１サイクル回る間に読点処理は５サイクル回る。以下、その読点処理の５サイクルについて説明する。また、以下の説明では、読点処理について発話区間の音声内容における名詞が全て信頼できると判定されるものとして説明する。 In this case, as shown in FIG. 5, the speech recognition processing (reading processing) in the speech recognition unit 11 for reading marks is as follows: Say, “I went out in the rain,” and it is done in units of punctuation marks. The speech recognition process (punctuation processing) in the speech recognition unit 21 for the punctuation marks is “until the master went to his relatives, but he said he would come back on the 11th, but he went out in the rain” This is done in units of breaks at the punctuation level. In other words, while the punctuation process is performed one cycle, the reading process is performed five cycles. Hereinafter, five cycles of the reading process will be described. In the following description, the reading process will be described assuming that all nouns in the speech content of the utterance section are determined to be reliable.

（１サイクル目、１段階目：図６）
人によって１発話目「主人がね」まで発話されたときに、読点用音声認識部１１及び発話タイミング判定部１４は、読点単位での発話区間の終了を検出する。このときには、発話タイミング判定部１４は、発話区間の終了は検出されたが、まだ１発話目に対する応答文情報が作成されていないため、音声合成部３０を介した応答文の音声の発声は行わない。読点用音声認識部１１は、１発話目の発話区間における音声「主人がね」の音声認識を実施する。 (First cycle, first stage: Fig. 6)
When a person speaks up to the first utterance “the master is,” the punctuation speech recognition unit 11 and the utterance timing determination unit 14 detect the end of the utterance section in punctuation points. At this time, the utterance timing determination unit 14 detects the end of the utterance section, but has not yet created the response sentence information for the first utterance, so the utterance of the response sentence via the speech synthesizer 30 is performed. Absent. The punctuation point speech recognition unit 11 performs speech recognition of the voice “master kane” in the first utterance section.

（１サイクル目、２段階目：図７）
発話タイミング判定部１４は、２発話目の発話区間の終了を待ち合わせる。名詞抽出部１２は、読点用音声認識部１１が認識した１発話目の音声内容から、名詞「ご主人」を抽出する。読点用応答文作成部１３は、名詞抽出部１２が抽出した名詞「ご主人」に基づいて、その名詞「ご主人」を確認する応答文「あーご主人が」の応答文情報を作成する。 (First cycle, second stage: Fig. 7)
The utterance timing determination unit 14 waits for the end of the second utterance section. The noun extraction unit 12 extracts the noun “master” from the speech content of the first utterance recognized by the punctuation speech recognition unit 11. Based on the noun “master” extracted by the noun extraction unit 12, the punctuation mark response sentence creation unit 13 creates response sentence information of the response sentence “Ao Master is” that confirms the noun “master”.

（２サイクル目、１段階目：図８）
人によって２発話目「親戚のところに行くけども」まで発話されたときに、読点用音声認識部１１及び発話タイミング判定部１４は、読点単位での発話区間の終了を検出する。このときには、発話タイミング判定部１４は、１発話目に対する応答文情報が作成されているため、その応答文情報を音声合成部３０に送信する。読点用音声認識部１１は、２発話目の発話区間における音声「親戚のところに行くけども」の音声認識を実施する。 (Second cycle, first stage: Fig. 8)
When a person speaks up to the second utterance “I'm going to a relative”, the punctuation speech recognition unit 11 and the utterance timing determination unit 14 detect the end of the utterance section in units of punctuation. At this time, since the response sentence information for the first utterance has been created, the utterance timing determination unit 14 transmits the response sentence information to the speech synthesizer 30. The voice recognition unit for reading marks 11 performs voice recognition of the voice “I go to relatives” in the second utterance section.

（２サイクル目、２段階目：図９）
発話タイミング判定部１４は、３発話目の発話区間の終了を待ち合わせる。名詞抽出部１２は、読点用音声認識部１１が認識した２発話目の音声内容から、名詞「親戚」を抽出する。読点用応答文作成部１３は、名詞抽出部１２が抽出した名詞「親戚」に基づいて、その名詞「親戚」を確認する応答文「親戚ね」の応答文情報を作成する。また、音声合成部３０は、発話タイミング判定部１４から送信された応答文情報が示す応答文「あーご主人が」の音声を音声合成によって生成してスピーカ５を介して発声する。 (Second cycle, second stage: Fig. 9)
The utterance timing determination unit 14 waits for the end of the third utterance section. The noun extraction unit 12 extracts the noun “relative” from the speech content of the second utterance recognized by the punctuation speech recognition unit 11. Based on the noun “relative” extracted by the noun extraction unit 12, the punctuation mark response sentence creating unit 13 creates response sentence information of the response sentence “relative neighbour” that confirms the noun “relative”. In addition, the speech synthesizer 30 generates speech of the response sentence “Ao ga ga ga” indicated by the response sentence information transmitted from the utterance timing determination unit 14 by speech synthesis and utters it via the speaker 5.

（３サイクル目、１段階目：図１０）
人によって３発話目「１１日にね」まで発話されたときに、読点用音声認識部１１及び発話タイミング判定部１４は、読点単位での発話区間の終了を検出する。このときには、発話タイミング判定部１４は、２発話目に対する応答文情報が作成されているため、その応答文情報を音声合成部３０に送信する。読点用音声認識部１１は、３発話目の発話区間における音声「１１日にね」の音声認識を実施する。 (3rd cycle, 1st stage: Fig. 10)
When a person speaks up to the third utterance “day 11”, the punctuation speech recognition unit 11 and the utterance timing determination unit 14 detect the end of the utterance section in punctuation points. At this time, since the response sentence information for the second utterance has been created, the utterance timing determination unit 14 transmits the response sentence information to the speech synthesis unit 30. The voice recognition unit for reading marks 11 performs voice recognition of the voice “day 11” in the third utterance section.

ここで、音声合成部３０は、発話タイミング判定部１４から送信された応答文情報が示す応答文「１１日」を、定型の相槌「うん」の音声に差し替える。このように、読点における音声認識処理（読点処理）の結果が連続したときには、応答文を、一つ置きに、名詞を確認しない簡易な相槌（間投詞）に差し替えることで、会話にリズムを持たせることができる。 Here, the speech synthesizer 30 replaces the response sentence “11th” indicated by the response sentence information transmitted from the utterance timing determination unit 14 with the voice of the standard answer “Yes”. In this way, when the results of voice recognition processing (reading mark processing) at reading points continue, the response sentence is replaced with a simple answer (interjection) that does not check the nouns, so that the conversation has a rhythm. be able to.

（３サイクル目、２段階目：図１１）
発話タイミング判定部１４は、４発話目の発話区間の終了を待ち合わせる。名詞抽出部１２は、読点用音声認識部１１が認識した３発話目の音声内容から、名詞「１１日」を抽出する。読点用応答文作成部１３は、名詞抽出部１２が抽出した名詞「１１日」に基づいて、その名詞「１１日」を確認する応答文「１１日にね」の応答文情報を作成する。また、音声合成部３０は、差し替え後の応答文「うん」の音声を音声合成によって生成してスピーカ５を介して発声する。このように、音声合成部３０における応答文選択処理では、読点における音声認識処理（読点処理）の結果が連続したときには、「名詞を確認する応答」と、定型の相槌「うん」を交互に音声合成する。 (3rd cycle, 2nd stage: Fig. 11)
The utterance timing determination unit 14 waits for the end of the fourth utterance section. The noun extraction unit 12 extracts the noun “11 days” from the speech content of the third utterance recognized by the punctuation speech recognition unit 11. Based on the noun “11 days” extracted by the noun extraction unit 12, the punctuation mark response sentence creating unit 13 creates response sentence information of the response sentence “11 days” that confirms the noun “11 days”. Further, the speech synthesizer 30 generates speech of the response sentence “Yes” after replacement by speech synthesis and utters it through the speaker 5. As described above, in the response sentence selection process in the speech synthesizer 30, when the result of the speech recognition process (reading mark process) at the reading points continues, the “response for confirming the noun” and the standard answer “Yes” are alternately spoken. Synthesize.

ここで、定型の相槌は、例えば、その相槌の応答文を示す応答文情報を記憶部３に予め格納しておき、音声合成部３０は、その応答文情報が示す応答文（相槌）の音声を音声合成するようにすればよい。また、相槌の内容も上記の例「うん」のみに限られず、他の相槌を用意してもよく、複数の相槌を用意して、それらを所定の順序又はランダムに音声合成するようにしてもよい。 Here, for example, the standard interaction is stored in advance in the storage unit 3 as response sentence information indicating the response sentence of the interaction, and the speech synthesizer 30 reads the speech of the response sentence (consideration) indicated by the response sentence information. May be synthesized with speech. Also, the contents of the considerations are not limited to the above example “Yes”, but other considerations may be prepared, and a plurality of considerations may be prepared and synthesized in a predetermined order or randomly. Good.

（４サイクル目、１段階目：図１２）
人によって４発話目「帰ってくるって言ってね」まで発話されたときに、読点用音声認識部１１及び発話タイミング判定部１４は、読点単位での発話区間の終了を検出する。このときには、発話タイミング判定部１４は、３発話目に対する応答文情報が作成されているため、その応答文情報を音声合成部３０に送信する。読点用音声認識部１１は、４発話目の発話区間における音声「帰ってくるって言ってね」の音声認識を実施する。 (4th cycle, 1st stage: Fig. 12)
When a person speaks up to the fourth utterance “Please say come back”, the punctuation speech recognition unit 11 and the utterance timing determination unit 14 detect the end of the utterance section in units of punctuation. At this time, since the response sentence information for the third utterance has been created, the utterance timing determination unit 14 transmits the response sentence information to the speech synthesis unit 30. The punctuation point speech recognition unit 11 performs speech recognition of the speech “Please come home” in the fourth utterance section.

（４サイクル目、２段階目：図１３）
発話タイミング判定部１４は、５発話目の発話区間の終了を待ち合わせる。名詞抽出部１２は、読点用音声認識部１１が認識した４発話目の音声内容には名詞が含まれていないため、名詞を抽出することができない。よって、読点用応答文作成部１３も、４発話目に対する応答文情報は作成しない。また、音声合成部３０は、発話タイミング判定部１４から出力された応答文情報が示す応答文「１１日にね」の音声を音声合成によって生成してスピーカ５を介して発声する。 (4th cycle, 2nd stage: Fig. 13)
The utterance timing determination unit 14 waits for the end of the fifth utterance section. The noun extraction unit 12 cannot extract a noun because the speech content of the fourth utterance recognized by the punctuation point speech recognition unit 11 does not include a noun. Therefore, the punctuation mark response sentence creation unit 13 does not create response sentence information for the fourth utterance. In addition, the speech synthesizer 30 generates speech of the response sentence “Nine days” indicated by the response sentence information output from the utterance timing determination unit 14 by speech synthesis, and utters it through the speaker 5.

（５サイクル目、１段階目：図１４）
人によって５発話目「雨の中出掛けたんですけど」まで発話されたときに、読点用音声認識部１１及び発話タイミング判定部１４は、読点単位での発話区間の終了を検出する。このときには、発話タイミング判定部１４は、発話区間の終了は検出されたが、４発話目に対する応答文情報が作成されていないため、音声合成部３０を介した応答文の音声の発声は行わない。読点用音声認識部１１は、５発話目の発話区間における音声「雨の中出掛けたんですけど」の音声認識を実施する。 (5th cycle, 1st stage: Fig. 14)
When a person speaks up to the fifth utterance “I went out in the rain”, the punctuation speech recognition unit 11 and the utterance timing determination unit 14 detect the end of the utterance section in units of punctuation. At this time, although the end of the utterance period is detected, the utterance timing determination unit 14 does not utter the response sentence via the speech synthesizer 30 because the response sentence information for the fourth utterance has not been created. . The voice recognition unit for reading marks 11 performs voice recognition of the voice “I went out in the rain” in the fifth utterance section.

（５サイクル目、２段階目：図１５）
発話タイミング判定部１４は、６発話目の発話区間の終了を待ち合わせる。名詞抽出部１２は、読点用音声認識部１１が認識した５発話目の音声内容から、名詞「雨」を抽出する。読点用応答文作成部１３は、名詞抽出部１２が抽出した名詞「雨」に基づいて、その名詞「雨」を確認する応答文「雨ね」の応答文情報を作成する。 (5th cycle, 2nd stage: Fig. 15)
The utterance timing determination unit 14 waits for the end of the sixth utterance section. The noun extraction unit 12 extracts the noun “rain” from the speech content of the fifth utterance recognized by the punctuation speech recognition unit 11. Based on the noun “rain” extracted by the noun extraction unit 12, the punctuation mark response sentence creation unit 13 creates response sentence information of the response sentence “rain” that confirms the noun “rain”.

一方で、５サイクル目では、人による「主人がね、親戚のところに行くけども、１１日にね、帰ってくるって言ってね、雨の中出掛けたんですけど。。。」までの一文の発話が終了するため、句点用音声認識部２１も、句点での発話区間の終了を検出する。よって、句点用音声認識部２１も、図１６に示すように、１発話目〜５発話目までの一文の音声について音声認識処理（句点処理）を実施する。 On the other hand, in the fifth cycle, one sentence from a person said, “My husband goes to his relatives, but he says he will come home on the 11th.” Since the utterance ends, the phrase speech recognition unit 21 also detects the end of the utterance section at the phrase. Therefore, as shown in FIG. 16, the punctuation speech recognition unit 21 also performs speech recognition processing (punctuation processing) for one sentence of speech from the first utterance to the fifth utterance.

ここで、句点用音声認識部２１は、信頼できる「格要素＋格」が「主人が」であると判定し、信頼できる動詞は「出掛けた」であると判定したものとする。この場合、句点用応答文作成部２２は、「動詞」と「格要素＋格」の最も信頼できる組み合わせである「ご主人が出掛けたんだね。」という応答文の応答文情報を生成する。そして、句点用応答文作成部２２は、生成した応答文情報を音声合成部３０に送信する。 Here, it is assumed that the phrase recognition unit 21 determines that the reliable “case element + case” is “the master” and that the reliable verb is “going out”. In this case, the response sentence creation unit 22 for punctuation generates response sentence information of a response sentence “My husband went out” which is the most reliable combination of “verb” and “case element + case”. Then, the phrase response sentence creation unit 22 transmits the generated response sentence information to the speech synthesis unit 30.

なお、句点用応答文作成部２２は、「動詞」のみ信頼できた場合は、「出掛けたんだ。」という応答文の応答文情報を生成し、「格要素＋格」のみ信頼できた場合は、「ご主人がね。」という文の応答文情報を生成する。句点用応答文作成部２２は、「動詞」及び「格要素＋格」のどちらも信頼できなかった場合は、応答文情報は生成しない。 The response sentence creation unit 22 for punctuation generates response sentence information of a response sentence “I went out” when only “verb” is reliable, and when only “case element + case” is reliable. , Response sentence information of the sentence “My husband is.” Is generated. When the “verb” and “case element + case” are unreliable, the phrase response sentence creation unit 22 does not generate response sentence information.

（６サイクル目）
６サイクル目では、読点用応答文作成部１３と句点用応答文作成部２２の応答文情報の作成・送信タイミングによっては、音声合成部３０において、読点用応答文作成部１３による応答文「雨ね」の音声合成と、句点用応答文作成部２２の応答文「ご主人が出掛けたんだね」の音声合成とが競合することになる。 (6th cycle)
In the sixth cycle, depending on the response sentence information creation / transmission timing of the punctuation response sentence creation unit 13 and the phrase response sentence creation unit 22, the speech synthesizer 30 causes the response sentence “rain” The speech synthesis of “Ne” and the speech synthesis of the response sentence “My husband went out” of the phrase response sentence creation unit 22 compete.

上述したように、音声合成部３０は、いずれか一方の応答文情報のみの送信を受けている場合には、その応答文情報の応答文の音声を音声合成し、両方の応答文情報が競合した場合には、句点用応答文作成部２２からの応答文情報を優先的に処理する。すなわち、例えば、図１７に示すように、読点用応答文作成部１３からの応答文情報のみの送信を受けている場合、音声合成部３０は、その応答文情報が示す応答文「雨ね」の音声合成を実施する。 As described above, when only one of the response sentence information is received, the voice synthesis unit 30 synthesizes the voice of the response sentence of the response sentence information, and both response sentence information competes. In this case, the response text information from the phrase response text creation unit 22 is preferentially processed. That is, for example, as illustrated in FIG. 17, when only the response sentence information is received from the punctuation mark response sentence creation unit 13, the speech synthesizer 30 determines that the response sentence “rain” is indicated by the response sentence information. Perform voice synthesis.

一方、例えば、図１８に示すように、読点用応答文作成部１３からの応答文情報のみの送信を先に受けて、その応答文情報が示す応答文「雨ね」を音声合成中に、句点用応答文作成部２２からの応答文情報の送信を受けた場合には、音声合成部３０は、読点用応答文作成部１３からの応答文情報が示す応答文「雨ね」の音声合成を中断して、句点用応答文作成部２２からの応答文情報が示す応答文「ご主人が出掛けたんだね」の音声合成を開始する。よって、例えば、「あめ、ご主人が出掛けたんだね」といったように発声中であっても強制的に優先度の高い応答文の音声の発声に切り替えられる。 On the other hand, for example, as illustrated in FIG. 18, the response sentence information only from the punctuation mark response sentence creation unit 13 is received first, and the response sentence “rain rain” indicated by the response sentence information is being synthesized. When the response sentence information is transmitted from the phrase response sentence creation unit 22, the speech synthesis unit 30 synthesizes the response sentence “rain” indicated by the response sentence information from the punctuation response sentence creation unit 13. , And the speech synthesis of the response sentence “The master went out” indicated by the response sentence information from the phrase response sentence creation unit 22 is started. Therefore, for example, even if the voice is being spoken such as “Ama, the husband has gone out”, the voice of the response sentence having a high priority is forcibly switched.

まとめると、例文の対話結果は以下のようになる。「Ａ」は、人の発話内容を示し、「Ｂ」は、音声対話システム１の発話内容を示している。 In summary, the dialogue result of the example sentence is as follows. “A” indicates the utterance content of the person, and “B” indicates the utterance content of the voice dialogue system 1.

Ａ：主人がね、（名詞抽出：主人）親戚のところに行くけども、
Ｂ：あーご主人が（読点処理結果）
Ａ：１１日にね、（名詞抽出：１１日）
Ｂ：うん（読点処理結果）
Ａ：帰ってくるって言ってね、
Ｂ：１１日ね（読点処理結果）
Ａ：雨の中出掛けたんですけど。。。（名詞抽出：雨）
Ｂ：ご主人が出掛けたんだね（句点処理結果）
Ａ：そしたら急にね・・・。 A: My husband (noun extraction: my husband) goes to my relatives,
B: Oh my husband (reading result)
A: On the 11th (noun extraction: 11th)
B: Yeah (reading result)
A: Tell me to come home,
B: 11 days (reading result)
A: I went out in the rain. . . (Noun extraction: rain)
B: Your husband went out (results of the phrase processing)
A: Then suddenly ...

このように、本実施の形態では、こまめに応答をして、音声対話システム１が人に対して話を聞いていることを示すことで、対話の継続性を高めることができる。すなわち、読点では、こまめに短い応答文で確認することで、人の話を阻害しないように人の話を促すことができるようにしている。一方で、句点ではそれよりも長い総括的な応答文で確認をすることで、人が話し手として十分な対話感覚が得られるようにしている。 As described above, in this embodiment, it is possible to improve the continuity of the dialogue by frequently responding and indicating that the voice dialogue system 1 is listening to the person. In other words, by reading frequently with short response sentences, punctuation can be encouraged so as not to disturb the person. On the other hand, by confirming with a general response sentence longer than that at the point, a person can obtain a sufficient sense of dialogue as a speaker.

それに対して、本実施の形態を適用しない場合には、例文の対話結果は以下のようになる。 On the other hand, when the present embodiment is not applied, the dialogue result of the example sentence is as follows.

Ａ：主人がね、親戚のところに行くけども、１１日にね、帰ってくるって言ってね、雨の中出掛けたんですけど。。。
Ｂ：ご主人が出掛けたんだね
（ご主人がね、出掛けたんだ、など） A: My husband went to my relatives, but on the 11th, he said he would come home and went out in the rain. . .
B: My husband went out (My husband went out, etc.)

すなわち、句点までの一文の発話が終了するまで、応答が全くなされず、人の会話意欲を削いでしまう。 That is, no response is made at all until the utterance of one sentence up to a punctuation point ends, and the person's willingness to conversation is reduced.

以上に説明したように、本実施の形態は、人が音声を発声している音声区間における音声を認識する音声認識手段（読点用音声認識部１１及び句点用音声認識部２１に対応する）と、音声認識手段による音声認識結果に応じた応答音声を発声する音声発声手段（名詞抽出部１２、読点用応答文作成部１３、発話タイミング判定部１４、句点用応答文作成部２２、音声合成部３０に対応する）と、を備えるようにしている。そして、音声認識手段は、音声区間が終了してから次の音声区間が開始されない時間が、第１の所定時間に達した場合、終了した音声区間を含む読点単位の区間における音声を認識し、音声区間が終了してから次の音声区間が開始されない時間が、第１の所定時間よりも長い第２の所定時間に達した場合、終了した音声区間を含む句点単位の区間における音声を認識するようにしている。 As described above, the present embodiment is a voice recognition means for recognizing a voice in a voice section in which a person utters a voice (corresponding to the reading voice recognition unit 11 and the punctuation voice recognition unit 21). , Voice utterance means for uttering the response voice according to the voice recognition result by the voice recognition means (noun extraction unit 12, punctuation response sentence creation part 13, utterance timing judgment part 14, phrase response sentence creation part 22, speech synthesis part 30). Then, the voice recognition means recognizes the voice in the section of the reading point unit including the ended voice section when the time when the next voice section does not start after the end of the voice section has reached the first predetermined time, When the time when the next voice segment is not started after the end of the voice segment has reached a second predetermined time longer than the first predetermined time, the voice in the phrase unit segment including the completed voice segment is recognized. I am doing so.

これによれば、音声区間が終了してから次の音声区間が開始される時間によって、句点と読点とを区別して認識し、人の発話中に読点毎に受け答えをすることができる。したがって、人の発話が長い場合であっても、適切な受け答えをすることができる。 According to this, it is possible to distinguish and recognize the punctuation mark and the punctuation mark according to the time when the next voicing period is started after the end of the speech section, and to receive and answer each punctuation mark during the utterance of the person. Therefore, even if a person's utterance is long, it is possible to give an appropriate answer.

また、本実施の形態では、読点単位の音声認識結果に応じた応答音声の発声タイミングと、句点単位の音声認識結果に応じた応答音声の発声タイミングとが競合した場合、句点単位の音声認識結果に応じた応答音声を優先的に発声するようにしている。これによれば、句読点に応じて適切な音声を発話することができる。 Further, in the present embodiment, when the utterance timing of the response voice according to the speech recognition result in units of reading points competes with the utterance timing of the response speech according to the speech recognition result in units of punctuation points, the speech recognition result in units of punctuation points The response voice corresponding to the voice is preferentially uttered. According to this, an appropriate voice can be uttered according to the punctuation marks.

＜本発明の他の実施の形態＞
上記の実施の形態では、読点及び句点を発話が無い時間（音声の音圧レベルが一定値以下である時間）が所定時間に達したか否かによって判定するようにしていたが、これに限られない。例えば、次に説明する（変形例１）又は（変形例２）のように、読点及び句点を判定するようにしてもよい。 <Other embodiments of the present invention>
In the above embodiment, reading and punctuation are determined based on whether or not the time when there is no utterance (the time during which the sound pressure level of the voice is below a certain value) has reached a predetermined time. I can't. For example, as described below (Modification 1) or (Modification 2), reading marks and punctuation marks may be determined.

（変形例１：周波数による判別）
特定周波数帯域に含まれるスペクトルの割合が第１の割合以下である場合は読点である判断し、特定周波数帯域に含まれるスペクトルの割合が第１の割合よりも低い第２の割合以下である場合は句点であると判断するようにしてよい。例えば、特定周波数帯に含まれるスペクトルの割合が全体の２０％以下の場合は読点である判断し、特定周波数帯に含まれるスペクトルの割合が１０％以下の場合は句点と判断する。 (Modification 1: Discrimination by frequency)
When the ratio of the spectrum included in the specific frequency band is equal to or less than the first ratio, it is determined as a reading point, and when the ratio of the spectrum included in the specific frequency band is equal to or less than the second ratio lower than the first ratio May be determined to be a punctuation mark. For example, when the ratio of the spectrum included in the specific frequency band is 20% or less of the whole, it is determined as a reading point, and when the ratio of the spectrum included in the specific frequency band is 10% or less, it is determined as a punctuation mark.

具体的には、読点用音声認識部１１及び発話タイミング判定部１４は、マイク４から出力された音声情報が示す音声において、特定周波数帯域に含まれるスペクトルの割合が第１の割合以下であることを、読点での発話区間の終了として検出する。また、句点用音声認識部２１は、マイク４から出力された音声情報が示す音声において、特定周波数帯域に含まれるスペクトルの割合が第２の割合以下であることを、句点での発話区間の終了として検出する。 Specifically, in the voice indicated by the voice information output from the microphone 4, the percentage of the spectrum included in the specific frequency band is equal to or less than the first percentage in the reading point voice recognition unit 11 and the speech timing determination unit 14. Is detected as the end of the utterance section at the reading point. In addition, the speech recognition unit 21 for punctuation indicates that in the speech indicated by the speech information output from the microphone 4, the proportion of the spectrum included in the specific frequency band is equal to or less than the second proportion, and the utterance interval at the punctuation is ended. Detect as.

なお、上記の特定周波数帯域は、人の発話中にスペクトルの割合が高くなると考えられる任意の周波数帯域を予め定めるようにしてよい。 The specific frequency band may be determined in advance as an arbitrary frequency band that is considered to have a high spectrum ratio during human speech.

（変形例２：ＨＭＭ（Hidden Markov Model：隠れマルコフモデル）による判別）
音声に基づいて読点の尤度（読点らしさ）を算出する読点認識用のＨＭＭと、音声に基づいて句点の尤度（句点らしさ）を算出する句点認識用のＨＭＭと、発話の尤度（発話区間らしさ）を算出する発話認識用のＨＭＭ等の各種ＨＭＭを予め用意する。そして、音声に基づいて読点認識用のＨＭＭが算出した尤度が一番高い場合、読点であると判定する。音声に基づいて句点認識用のＨＭＭが算出した尤度が一番高い場合、句点であると判定する。音声に基づいて発話認識用のＨＭＭが算出した尤度が一番高い場合、読点でも句点でもなく、発話中であると判定する。 (Modification 2: Discrimination based on HMM (Hidden Markov Model))
HMM for reading recognition that calculates the likelihood (reading likelihood) of a reading point based on speech, HMM for phrase recognition that calculates the likelihood (punctuation likelihood) of phrase based on speech, and the likelihood of speech (utterance) Various HMMs such as an utterance recognition HMM for calculating the section likelihood) are prepared in advance. When the likelihood calculated by the HMM for reading point recognition based on the voice is the highest, it is determined that the reading point is read. When the likelihood calculated by the HMM for recognizing a punctuation based on speech is the highest, it is determined that the punctuation is a punctuation. When the likelihood calculated by the speech recognition HMM based on the speech is the highest, it is determined that the speech is not a punctuation mark nor a punctuation mark.

具体的は、読点認識用のＨＭＭ、句点認識用のＨＭＭ、及び発話認識用のＨＭＭを予め学習により生成しておき、それらのＨＭＭの情報を記憶部３に格納しておく。読点用音声認識部１１及び発話タイミング判定部１４は、マイク４から出力された音声情報が示す音声を、各ＨＭＭへの入力とし、それらのＨＭＭの出力した尤度において、読点認識用のＨＭＭが算出した尤度が一番高いことを、読点での発話区間の終了として検出する。また、句点用音声認識部２１は、マイク４から出力された音声情報が示す音声を、各ＨＭＭへの入力とし、それらのＨＭＭの出力した尤度において、句点認識用のＨＭＭが算出した尤度が一番高いことを、句点での発話区間の終了として検出する。 Specifically, an HMM for reading point recognition, an HMM for phrase recognition, and an HMM for speech recognition are generated in advance by learning, and information on these HMMs is stored in the storage unit 3. The punctuation point speech recognition unit 11 and the utterance timing determination unit 14 use the speech indicated by the speech information output from the microphone 4 as an input to each HMM, and the HMM for reading point recognition uses the likelihood output by those HMMs. It is detected that the calculated likelihood is the highest as the end of the utterance section at the reading point. In addition, the speech recognition unit 21 for punctuation uses the speech indicated by the speech information output from the microphone 4 as input to each HMM, and the likelihood calculated by the HMM for punctuation recognition in the likelihood output by those HMMs Is detected as the end of the utterance section at the punctuation point.

以上に実施の形態及び他の実施の形態として説明したように、本実施の形態では、人が発声している音声が、読点であるとして予め定めた第１の音声パターン（次の音声区間が開始されない時間が第１の所定時間経過、スペクトルの割合が第１の割合以下、読点認識用のＨＭＭの算出した尤度が一番高い）であると判断した場合に、第１の音声パターンであると判断したときまでの読点単位の区間における音声を認識し、人が発声している音声が、、読点であるとして予め定めた第２の音声パターン（次の音声区間が開始されない時間が第２の所定時間経過、スペクトルの割合が第２の割合以下、句点認識用のＨＭＭの算出した尤度が一番高い）であると判断した場合に、第２の音声パターンであると判断したときまでの読点単位に区間における音声を認識するようにしている。 As described above in the embodiment and the other embodiments, in the present embodiment, the first voice pattern (the next voice section is determined in advance) that the voice uttered by a person is a reading point. When it is determined that the first predetermined time has elapsed, the spectrum ratio is equal to or less than the first ratio, and the likelihood calculated by the HMM for reading mark recognition is the highest) Recognize the voice in the section of the reading point until it is determined that there is a second voice pattern that has been pre-determined that the voice uttered by the person is the reading point (the time when the next voice section is not started) 2 is determined to be the second speech pattern when it is determined that the predetermined time elapses in 2 and the spectrum ratio is equal to or less than the second ratio and the HMM for the phrase recognition is the highest likelihood). Up to the reading unit That is to recognize the voice.

これによれば、第１の音声パターンであるか第２の音声パターンであるかによって、句点と読点とを区別して認識し、人の発話中に読点毎に受け答えをすることができる。したがって、人の発話が長い場合であっても、適切な受け答えをすることができる。 According to this, it is possible to distinguish and recognize a punctuation mark and a reading mark depending on whether it is the first sound pattern or the second sound pattern, and receive and answer each reading mark during a person's utterance. Therefore, even if a person's utterance is long, it is possible to give an appropriate answer.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention.

１音声対話システム
２制御部
３記憶部
４マイク
５スピーカ
６Ｉ／Ｏポート
１１読点用音声認識部
１２名詞抽出部
１３読点用応答文作成部
１４発話タイミング判定部
２１句点用音声認識部
２２句点用応答文作成部
３０音声合成部 DESCRIPTION OF SYMBOLS 1 Voice dialogue system 2 Control part 3 Memory | storage part 4 Microphone 5 Speaker 6 I / O port 11 Speech recognition part 12 for punctuation Noun extraction part 13 Response sentence preparation part 14 for punctuation Timing determination part 21 Speech recognition part 22 for punctuation Response sentence creation unit 30 Speech synthesis unit

Claims

Speech recognition means for recognizing speech uttered by a person in speech input from the outside;
Voice utterance means for uttering a response voice according to a voice recognition result by the voice recognition means,
The voice recognition means
Determining whether the externally input voice is a first voice pattern predetermined as a punctuation mark or a second voice pattern predetermined as a punctuation mark;
If it is determined to be the first sound pattern, it recognizes the sound in the reading unit interval until it is determined to be the first sound pattern;
If it is determined to be the second voice pattern, it recognizes the voice in the period of the phrase unit until it is determined to be the second voice pattern.
Spoken dialogue system.

The voice recognition means
Determine the voice segment in which the person is speaking,
As the first voice pattern, it is determined that a time when the next voice section is not started after the end of the voice section has reached a first predetermined time,
Determining, as the second voice pattern, that a time during which the next voice section is not started after the end of the voice section has reached a second predetermined time longer than the first predetermined time;
The speech dialogue system according to claim 1.

The speech utterance means, when the utterance timing of the response speech according to the speech recognition result in the reading unit and the utterance timing of the response speech in accordance with the speech recognition result in the phrase unit compete, Preferentially utter response voice according to the result,
The voice interaction system according to claim 1 or 2.

The voice utterance unit utters a voice of a response sentence for confirming a voice content in the section of the reading point unit as a response voice according to the voice recognition result of the reading point unit,
The voice utterance unit utters a compatible voice instead of the voice of the response sentence as the response voice when the utterance of the response voice according to the voice recognition result of the reading mark unit is continuous.
The voice interaction system according to any one of claims 1 to 3.

The voice utterance means has a noun extraction means for extracting a noun included in the voice content in the section of the reading point unit based on the voice recognition result of the reading point unit.
The voice utterance means utters the voice of a response sentence for confirming the noun extracted by the noun extraction means as the response voice according to the voice recognition result in reading points.
The spoken dialogue system according to any one of claims 1 to 4.

The voice utterance means has a response sentence creation means for creating a response sentence for confirming the voice content in the section of the reading point unit as the content of the response voice according to the voice recognition result of the reading point unit,
The voice utterance means includes
If a response sentence is created by the response sentence creation means when it is determined that the first voice pattern is the response, the created response is used as a response voice according to the speech recognition result in the reading unit. Speak the sentence,
In the case where a response sentence has not been created by the response sentence creating means, if the response sentence has been created by the response sentence creating means when it is determined that it is the first voice pattern next time, the reading unit Utter the voice of the created response sentence as a response voice according to the voice recognition result of
The voice interaction system according to any one of claims 1 to 5.

A speech recognition step for recognizing speech uttered by a person in speech input from outside;
A voice utterance step of uttering a response voice according to the voice recognition result of the voice recognition step,
In the voice recognition step,
Determining whether the externally input voice is a first voice pattern predetermined as a punctuation mark or a second voice pattern predetermined as a punctuation mark;
If it is determined to be the first sound pattern, it recognizes the sound in the reading unit interval until it is determined to be the first sound pattern;
If it is determined to be the second voice pattern, it recognizes the voice in the period of the phrase unit until it is determined to be the second voice pattern.
Spoken dialogue method.