WO2005124738A1

WO2005124738A1 - Voice dialog system and voice dialog method

Info

Publication number: WO2005124738A1
Application number: PCT/JP2004/008772
Authority: WO
Inventors: Kazuya Nomura; Ryo Mochizuki; Hirofumi Nishimura
Original assignee: Matsushita Electric Industrial Co., Ltd.
Priority date: 2004-06-16
Filing date: 2004-06-16
Publication date: 2005-12-29

Abstract

A voice dialog system (1) has a speaker (20) capable of outputting a system side voice to a user, a microphone (21) for converting a voice produced by the user into a voice signal according to the system side voice outputted from the speaker (20), a section (23) for recognizing the user voice inputted to the microphone (21), a voice production timing detecting section (27) for detecting voice production timing based on the voice signal produced by converting the user voice through the microphone (21) and a response aural signal from a response generating section (26), a familiarization level judging section (50) for judging the level of familiarization of voice dialog of the user by using the voice production timing, and a voice output altering section (60) for altering the output content of the system side voice depending on the level of familiarization judged by the familiarization level judging section (50).

Description

Spoken dialogue system and spoken dialogue method

The present invention relates to a voice dialogue system and a voice dialogue method for performing a dialogue between a system and a user using each other's voices.

Field background technology

Conventionally, this type of voice interaction system has a microphone that captures the input voice from the user (speaker), a speed that outputs the voice response of the system, and the input voice from the user. A voice response remover that removes the voice response superimposed on the speech, a voice recognizer that takes in the output of the voice response remover and recognizes the utterance of the user, and a voice response corresponding to the recognized voice. A dialogue control unit that performs selection control and a voice response unit that actually outputs a voice response to a speaker and a voice response removal unit are provided to enable a voice dialogue between the user and the system. At this time, the system has a function to recognize the user's voice even when the voice dialogue system is outputting a voice response. Open 2 0 0 1-2 9 6 No. 890).

However, in such a conventional voice interactive system, a user who is accustomed to using the voice interactive system can use the purged function to respond to the sound of the system. Even if you utter the voice inside, you can grasp the voice content of the user, but fix the output of the voice response to a level that can be used by users who are not used to using it And the voice Users familiar with using the dialogue system

Speaking starts in the middle of the output of the voice response, and the voice is output from the system even after terminating the speech)), the answer continues to be output, and a waiting time occurs, and conversely, the voice is reduced to reduce the waiting time increasing the Rebenore of response, Les such accustomed to use, Interview one the one is was Oh ingredients <happens when we will issue force ^s to use

The present invention has been made in order to solve such a problem, and has a new concept.

Be aware of the proficiency of using the dialogue system.Speech-to-lingual system with a system capable of changing voice output and standing up

It is S-type to provide a dialogue method. Disclosure of the invention

The speech dialogue system of the first invention has a sound capable of outputting system-side speech to a user = ■ output unit and

According to the system-side audio output from the voice output unit, the user is generated by the user.

The microphone that converts the voice into a voice signal and the user that is input to the microphone

Say voice grS

The part of the society

Based on the audio signal converted from the microphone by the microphone, the user stands up.

The mortar has a proficiency determining unit that determines the proficiency level of the dialogue, and an audio output changing unit that changes the output of the system-side voice according to the proficiency level determined by the proficiency level determining unit. ing.

With this configuration, it is possible to provide a dialog system in which the output of the system-side voice can be changed according to the user's proficiency in using the dialog system.

According to a second aspect of the present invention, there is provided a speech dialogue system in which a speech output unit capable of outputting system-side speech to a user, and the user is generated according to the system-side speech output by the output unit. User-Standing

Convert sayings to sounds Microphone and user input from microphone

Say the audio? ³ Recognition δ Generated sound ^, part, and user

Based on the voice signal converted by the microphone, the user's

The system includes a proficiency determining unit that determines the proficiency level of the spoken dialogue, and a voice output changing unit that changes the output of the system according to the proficiency level determined by the proficiency level determining unit. The output of the side audio is changed between at least two output contents, at least between the detailed output contents and the simpler output contents than the detailed contents. For a user who is not familiar with the dialogue system, the system can easily understand how to use it by outputting system-side sounds with detailed output contents.

For users who are familiar with the dialogue system, the system side with simple output contents

Speech output can be used to eliminate or reduce waiting time for smooth use.

The speech dialogue system according to the third invention is system-independent for the user.

A voice that can output voice

And voice output

System side output by the voice output unit

A microphone that converts the user's voice into an m-number, and a sound unit that recognizes the user's voice that is input from the microphone. And the user's standing voice based on the voice signal converted from the user's standing voice by the microphone.

A proficiency determining unit that determines the proficiency level of the spoken dialogue, and a system is established based on the proficiency level determined by the proficiency level determining unit.

And an audio output change unit that changes the output of the user, and an utterance timing detection unit that detects the timing of the utterance of the user based on the input audio signal. The proficiency level determination unit has a configuration in which the proficiency level determination unit determines the proficiency level using the generation timing. With this configuration, it is possible to change the output of the system-side voice according to the user's skill level in using the dialog system. In this case, the user's utterance type is used to determine the skill level. If you want to detect the sound of the channel, you can judge your proficiency by simple signal detection and calculation.

The speech dialogue system according to the fourth invention is system-independent for a user.

An audio output unit that can output the message, and a user who speaks according to the system-side audio output by the output unit.

The microphone P-phone, which converts the sound into a sound? = · Is, and the microphone input phone,

誠 p ¾ icrr

P P¾¾ P | J and the user

Based on the sound converted from the microphone, the user's standing

It has a proficiency determining unit that determines the degree of proficiency of the tongue, and a voice output changing unit that changes the output of the system-side voice based on the proficiency determined by the proficiency determining unit. In addition, based on the input sound, a utterance timing detection unit that detects the utterance start time as the utterance timing is provided.The proficiency judgment unit starts utterance. It has a configuration in which the proficiency level is determined using the time difference between the time and the output start time of the system-side audio.

With this configuration, the user's proficiency in using the conversation system can be adjusted.

-It is possible to change the output of the system-side sound in the same way, in which case the start time of the user's single sound and the start time of the system-side sound output by the sound output unit are used to determine the proficiency The time difference between the two

, _ Standing

It is possible to detect the start time of the voice signal and calculate the time difference by detecting the start time of the voice output of the system and the time difference. <It is possible to determine the proficiency level by simple signal detection and calculation. .

The voice dialogue system of the invention of claim 5 is a system in which Standing

An audio output unit capable of outputting audio and a user generated by the user according to the system-side audio output from the audio output unit.

A microphone mouth phone that converts to J ° IS and a microphone mouth phone.

Saying w hi n ^ o and ·

According to the voice signal converted by the microphone

A proficiency level determination unit for determining the proficiency level of the spoken dialogue, and a standing power change unit for changing the output of the system side voice in addition to the proficiency level determined by the proficiency level determination unit. Is entered

Standing

The use frequency count unit that counts the cumulative use count of the user's voice input using 5¾m based on the audio signal that was sent is used, and the proficiency level judgment unit starts from the use count unit. It has a structure to judge the proficiency level using the obtained cumulative use count.

With this configuration, user standing

The proficiency of using the tongue system \-It is possible to change the output of the system-side audio in the meantime. Since the used number of times of use is used, the user can detect the signal input of _, and calculate the accumulated value to obtain the number of times of use. <, Simple signal detection and calculation Will determine your proficiency

The speech dialogue system according to the sixth aspect of the present invention provides a system

Standing

An audio output unit capable of outputting audio, and a user generated by the user according to the system-side audio output from the output unit

Microphone that converts the message into a message, and a microphone that outputs the user's voice that is input.

Say £ δ or part, and

The user's voice is converted based on the voice signal converted by the microphone.

A proficiency judging unit that judges the proficiency level of the spoken dialogue, and an output of the system-side voice according to the proficiency level determined by the proficiency level judging unit

In addition to having an output change unit, Tachiyoshi

Say based on the signal

The user using μτδ knowledge is provided with a usage frequency measurement unit that calculates the frequency of use of voice input, and the proficiency level determination unit uses the usage level obtained from the usage frequency calculation unit to determine the level of proficiency. With this structure, the user can stand alone.

It is possible to change the output of the system-side voice according to the level of use of the dialogue system. In this case, the frequency of use of the voice dialogue system used by the user to determine the level of skill is used. The user

If the use of the spoken dialogue system is detected based on the input of the voice signal and the frequency is calculated and the frequency of use is calculated, the proficiency level can be determined by simple signal detection and calculation.

A speech dialogue system according to a seventh aspect of the present invention is a speech dialogue system comprising: a sound output unit capable of outputting a voice on the system side;

A microphone that converts the user's voice generated by the user into a voice signal in accordance with the system-side voice output by the voice output unit, and a voice recognition of the user voice input by the microphone. Recognition unit and user

Voice ¾: User standing based on the audio signal converted by the microphone

A proficiency judging unit for judging the proficiency of the spoken dialogue, and a system for changing the output of the system side sounds according to the proficiency determined by the proficiency judging unit

It has an output changing section and a

The utterance speed calculation unit that calculates the utterance speed of the user's voice using the BTS knowledge is provided.The proficiency level determination unit determines the proficiency level based on the user's utterance speed obtained from the utterance speed calculation unit. have

With this configuration, it is possible to change the output of the system-side voice according to the user's skill level in using the interactive system. In this case, the user is required to determine the skill level. We decided to use the social signal, so For example, it is only necessary to detect and calculate the user's utterance start time and utterance end time, and the proficiency can be determined by simple signal detection and calculation.

The speech dialogue system according to the eighth aspect of the present invention provides the system

A voice that can output voice

And the system side output by the voice output unit

According to ■, a microphone that converts the user's sound generated by the user into an audio signal according to the user, and a user microphone that is input to the microphone

Voice to voice

^ Sounds to recognize

And converted the voice of User 1 into a microphone

Based on the voice signal, the user says: Skills to judge the proficiency of voice dialogue

And an output changing unit for changing the output of the system-side sound according to the proficiency determined by the proficiency determining unit.

Cumulative average that calculates the cumulative average similarity using the similarity that indicates how similar the content of one voice of the user who responded to the voice based on the voice signal is to the correct response. A similarity calculation unit is provided, and the proficiency determination unit is configured to determine the proficiency using the cumulative average mussel average similarity obtained from the cumulative average similarity calculation unit.

With this configuration, it is possible to change the output of the system-side voice according to the user's skill in using the speech dialogue system. In this case, the user can judge the skill level. Since the cumulative average similarity of the spoken dialogue system used was used, the user based on the user's voice signal input.

, Standing

The voice sounds the content of the voice iv *

Then, using a threshold or the like to detect the similarity of how the recognition content of the system responds to the question of the system voice and how similar the response is to the content of the response, a threshold value is used to calculate the cumulative average value. It will be possible to judge proficiency with simple detection and calculation

A speech dialogue system according to a ninth aspect of the present invention comprises: a speech output unit capable of outputting system-side speech to a user;

The system output by the voice output section A microphone that converts the user's voice generated by the user according to the voice into a voice code, and a voice that recognizes the user's voice input to the microphone phone And a proficiency determination unit that determines the proficiency level of the user's voice dialogue based on the audio signal converted from the user's voice by the microphone phone, and a proficiency level determination unit. According to the proficiency level, the system can change the voice output.

In addition to having an output changing unit, based on the input audio signal, the content of the user's voice in response to the A cumulative average recognition rate calculator is used to calculate the product-average recognition rate using the recognition rate that indicates whether or not the skill has been used.The proficiency level determination unit uses the cumulative average recognition rate obtained from the cumulative average recognition rate calculation unit. It has a configuration to determine the proficiency level.

With this configuration, it is possible to change the output of the system-side sound in accordance with the user's proficiency in using the speech dialogue system. In this case, the user can judge the proficiency. Uses the cumulative average recognition rate of the spoken dialogue system used by the user, so that the user-independent

容 ¾ 、に、、、、、、、、、、、、、 ¾ ¾ ¾ ¾ にに ¾ 認識認識認識 ¾ ¾ 認識 ¾ 認識 ¾ ¾ 認識認識Can be calculated by simple detection and calculation.

-When you can judge your maturity

Standing of the 10th invention

The conversation system is capable of outputting system-side audio to the user.

The user output according to the voice output section and the system-side audio output from the audio output section.

It recognizes the microphone that converts the voice into a voice signal and the user's standing voice that is input to the microphone mouth phone. Based on the speech recognition unit and the voice signal changed by the user's voice, the user's voice is judged based on the voice proficiency of the user's voice conversation. A system judging unit according to the proficiency level determined by the proficiency level determining unit and the proficiency level determining unit

And a voice output changing unit for changing the output of the utterance, and for determining the similarity or the recognition rate when the proficiency determining unit determines that the proficiency is lower than a predetermined value. And a threshold changing unit that changes the threshold so as to lower the threshold.

According to the configuration, the output of the system-side speech can be changed according to the user's proficiency in using the speech dialogue system. Can be knocked according to the judgment result of the proficiency level, and the threshold value is corrected to an appropriate value to reduce the proficiency level and facilitate the user's voice recognition. This is possible.

A speech dialogue system according to an eleventh aspect of the present invention includes a voice output unit capable of outputting voices of the system side to a user, and a system side output unit that outputs the voices.

According to the user,

(1) A microphone that converts voice into a voice signal, and a user voice that is input to a microphone.

A voice recognition unit, a proficiency determination unit that determines the proficiency level of a user's voice dialog based on a voice signal converted from a user's voice by a microphone, and a proficiency level determination unit. System setup according to the mastery level

And a proficiency level determination unit, and a proficiency level determination unit based on the meaning of the contents of the system-side voice output from the voice output unit to the user. Is determined.

With this configuration, it is possible to respond to the user's

-The audio output of the system can be changed in the

It is possible to determine the user's proficiency level for each of the questions asked by the voice, for example, when conducting a question-and-answer question of the user's proficiency using system-side audio, simple contents And whether it ’s an unripe question It is possible to make the output contents of the system sound source different while giving π to the proficiency level at the time of asking a question as detailed contents.

The tongue-and-tongue system of the twelfth invention is capable of outputting system-side audio to a user. A microphone that converts a user sound generated by the microphone into an audio signal, a sound section that recognizes the user sound input to the microphone, and a user and a _ standing voice. A proficiency determining unit that determines the proficiency of a user's voice dialogue based on the voice signal converted by the microphone, and a proficiency determined by the proficiency determining unit. System side

And a voice output changing unit for changing the voice output.

It has a speaker judgment unit that recognizes who the speaker is based on the signal, and the proficiency judgment unit uses the speaker judgment unit to increase the proficiency with each user. Has a configuration to determine

With this configuration, user standing

It is possible to change the output of the system side voice according to the skill level of the use of the spoken dialogue system, and in that case, it is recognized which user is using the real tongue system. The proficiency level is determined with the user who is recognized as a result of the

It is possible to change the voice and output it

The speech dialogue system of the thirteenth invention is a stand-by system capable of outputting system-side speech to a user.

It is input to a microphone output phone, a microphone phone that converts a user sound generated by the user into a voice signal in accordance with the system-side voice output from the voice output section, and a microphone phone. User sound

■ =! = ·

The sound section of the society

The voice is converted by microphone and is

A lesson to judge the proficiency of the user's voice conversation based on the signal A system determination according to the proficiency level determined by the proficiency level determination section and the proficiency level determination section

And a voice output changing unit for changing the output of the user, and a system side output from the voice output unit to the user.

When the content of the voice changes, the dictionary has a configuration that switches the dictionary of the content that is predicted to generate a user according to the changed content.

According to the configuration of-, it is possible to change the output of the system-side sound in accordance with the user's proficiency in using the tongue system, in which case the changed system Sidelines

The dictionary is switched to a dictionary that is predicted to have a possibility that the user will respond according to the content of the voice.-S), so if the dictionary is fixed, false recognition is reduced, and It is possible to grasp the response content of each user more quickly

The sound dialogue system of the fourteenth invention is system-independent for the user.

Standing that can output voice

Voice output unit and system output by voice output unit

The user uttered a voice according to the voice of the

A microphone microphone that converts the signal into a sound signal, and a user microphone that is input to the microphone

The voice recognition unit is a voice η

The old man who converted the gods by using Magic P-phone

The proficiency determining unit that determines the proficiency of the user's voice dialogue based on the signal, and the proficiency determined by the proficiency determining unit are used to set up the system.

Change the output of the social context ^

It has an output changing unit, and it stands up from the audio signal input from the microphone.

The system output from the output unit, Tachiyoshi

It has a configuration equipped with a response remover that removes the signal equivalent to the output of the device.

With this configuration, it is possible to learn the use of the user's voice dialogue system, and then to change the output of the system side voice in that case. A ゝ Magic mouth phone created by the user Since the signal output from the system is a signal obtained by superimposing the system side and the user's voice, the signal part corresponding to the system side can be removed, and the user's side is removed. O Recognition more clearly

According to the fifteenth aspect of the invention, the system can be output from the output unit to the user, and the system side output from the output unit can be output.

According to the society, the user generated the sound. The user's voice was converted to the m? = ■ signal by the microphone D phone, and the user sound input from the microphone mouth phone was converted to the SrS knowledge by the microphone. At the same time, the user who responds to the system-side voice determines the user's level of conversation skill based on the audio signal converted by the microphone, and then determines the proficiency. It has a configuration to change the output of the system side according to the degree.

With this configuration, it is possible to provide a voice dialogue method that can change the output of the system side according to the user's skill in using the voice dialogue system.

The features and locations of the communication system and the communication method according to the present invention will be apparent from the following description in the drawings and figures below.

FIG. 1 is a block diagram showing a configuration of a voice interaction system according to a first embodiment of the present invention.

FIG. 2 shows a speech dialogue system according to the first and second embodiments of the present invention.

First

System that represents the operation of the system

FIG. 3 is a block diagram showing a configuration of a voice interaction system according to a second embodiment of the present invention. FIG. 4 is a block diagram showing a configuration of a voice interaction system according to a third embodiment of the present invention.

FIG. 5 is a time chart showing the operation of the voice interaction system according to the third embodiment of the present invention.

FIG. 6 is a block diagram showing a configuration of a voice interaction system according to a fourth embodiment of the present invention.

FIG. 7 is a time chart showing the operation of the voice interaction system according to the fourth embodiment of the present invention.

FIG. 8 is a block diagram showing a configuration of a voice interaction system according to a fifth embodiment of the present invention.

FIG. 9 is a time chart showing the operation of the voice interaction system according to the fifth embodiment of the present invention.

FIG. 10 is a block diagram showing a configuration of a voice interaction system according to a sixth embodiment of the present invention.

FIG. 11 is a time chart showing the operation of the voice interaction system according to the sixth embodiment of the present invention.

FIG. 12 is a block diagram showing a configuration of a voice interaction system according to a seventh embodiment of the present invention.

FIG. 13 is a time chart showing the operation of the voice interaction system according to the seventh embodiment of the present invention.

FIG. 14 is a block diagram showing the configuration of the voice interaction system according to the eighth embodiment of the present invention.

FIG. 15 is a time chart showing the operation of the voice interaction system according to the eighth embodiment of the present invention.

FIG. 16 is a block diagram of a speech dialogue system according to a ninth embodiment of the present invention. It is a block diagram showing the result

FIG. 17 shows a ninth embodiment of the present invention.

It is a timing chart showing the operation of the dialogue system

FIG. 18 is a block diagram showing the configuration of the speech dialogue system according to the tenth embodiment of the present invention.

FIG. 19 is a perspective view of a tenth embodiment of the present invention.

The best mode for carrying out the invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following embodiments, substantially the same components as those in | pj will be denoted by the same reference numerals, and the description thereof will be repeated. The description is omitted. here

-= ± = ■

ゝ

The following describes an example in which the dialogue system is applied to a force navigation device.

As shown in FIG. 1, a speech dialogue system 1 according to a first embodiment of the present invention is a system-independent system such as voice and response to each user.

A speech force (sound output unit) 20 for outputting the speech, a microphone 21 for converting the speech generated by the user into an audio signal, and a microphone

2 Standing output from 1

The voice response remover 22 that removes the output-equivalent signal corresponding to Gy K output at speed 20 from the 1st issue, and the microphone 21 outputs the signal.

The superimposed signal is removed by the response remover 2 2 ≤

Get away

User based on Japanese Society 15 —Recognizing the contents of the utterance of the voice, the voice recognition section 23, and the corresponding i¾ based on the content of the user's voice obtained by the voice section recognition section 23, A dialogue control unit 24 that selects the answer voice and controls the dialogue with the user, and responds with a message

Saisha Database 2 5 And a response that generates a speech response signal to be output to the speech force 20 or the speech response removal unit 22 using the data of the response speech database 25 based on the output of the dialogue control unit 24. A generation unit 26, an utterance timing detection unit 27 for detecting the conversation timing of the user, and a voice response signal and a sound signal from which the superimposed signal has been removed are used. And a proficiency determining section 50 for determining the proficiency level and outputting the result to the response generating section 26.

Note that the dialogue control unit 24, the response sound database 7 and the response generation unit

26 constitutes a voice output changing unit 60 of the present invention that changes the output of the system side voice according to the user's proficiency.

The voice response removing unit 22 is based on the 1 ”signal input from the microphone 21 and the response voice signal input from the response generating unit 26, for example, LMS (Least Mean quare) / Finoleta coefficient learning means 28 that adjusts the filter coefficient (impulse response) obtained using the Newton algorithm optimally while learning it. Adaptive filter 2 that corrects and outputs the response voice signal

9 and an adaptive filter from the audio signal input from microphone 21.

It has a subtractor 30 for subtracting the output signal inputted from 29. The voice recognition unit 23 performs acoustic processing on the voice signal input from the microphone 21 and having the superposition of the voice response reduced by the voice response removal unit 22.

Symphony processing means (not shown) and standing

The phoneme identification means selects and identifies the most suitable phoneme candidates based on the minimum unit of speech obtained by the sound processing means.

(Not shown), a dictionary database (not shown) that stores flat data related to the purpose of using the spoken dialogue system 1 (not shown), and phonemes and dictionaries obtained by phoneme recognition means. Word candidates are selected based on the data and, syntax, It has a language processing means (not shown) for executing word processing to obtain correct sentences while using linguistic information such as thought and context.

Incidentally, Tatsuhibiki processing means was example, if, LPC cepstrum preparative ram: utilizing _{(L 1 nea Γ P redictor C} oefficient C epstr U m LPC coefficients of cepstrum preparative ram) or the like, inputted from the microphone Hong 2 1 The speech signal obtained is converted into a time-series vector notation called a feature vector, and the outline of the speech spectrum (the vector envelope) is estimated.

Standing

The element identification means is, for example, HMM (HiddenMarkoV

Mode 1: Hidden Markov model)

The speech signal is established using the sound parameters extracted by the sound processing means based on the speech.

5 is performed, and the most suitable phoneme candidate is selected by comparing with a standard phoneme model prepared in advance.ο

The processing means uses syntax such as comparing word dictionaries in the dictionary database based on phoneme candidates to select the most likely word, and specifying connection relationships between words using a language model. It performs processing and semantic processing.

On the other hand, the dialogue control section 24 selectively controls the response content based on the content of the voice signal recognized by the voice recognition section 23 and outputs it to the response generation section 26.

The response generator 26 responds based on the content determined by the dialogue controller 24.

了 Completion database 25 Generates a response voice signal using the data from the input device and outputs it to the proficiency level determination section 50 and the speaker 20. In addition, the response generation unit 26 determines whether the proficiency level from the proficiency level determination unit 50 is high or low.

The details determined by the interaction control unit 24 will be described later in more detail. According to one of the more detailed response contents and the more simplified response contents, the signal from J

Signal to output the response)

The utterance timing detection unit 27, which is input from the microphone phone 21, outputs a corresponding part of the voice output from the voice output 20 of the ^ IS “^”. , _ Standing obtained by removing in step 2

The utterance start time (timing of utterance) of the user is detected based on the signal of the society. O The proficiency determination unit 50 is provided with a response from the response generation unit 26.

The time difference between the output start time of the signal and the utterance start time of the user input from the timing detection unit 27 is calculated, and this time difference is less than the set time. If so, it is determined that the user is proficient in using the voice dialogue system 1, and if the time difference is equal to or greater than the set time, it is determined that the user is not proficient in using it.

Next, with reference to FIG. 2, a first embodiment of the present invention will be described.

Do ^

Standing in the shape of

He explains the operation of Dialogue System 1 ο

In FIG. 2, the upper row (a) shows the case where the user is proficient and the middle row (b) shows the case where the user is proficient and the lower row. (c) shows the case in which the response generation unit 26 changes the output of the voice response and responds after the proficiency level determination unit 50 determines that the user is proficient. o In each row of Fig. 2, the horizontal axis indicates the time axis in which the time elapses in the direction of the arrow.

The guide voice of the dialogue system 1 and the lower half of the line

Show each of them.

First, if the user is unfamiliar with the case shown in Fig. 2 (a), the user who wishes to use the navigation device will see The spoken dialogue system 1 outputs a guide voice S10 for asking a user about a destination, such as "Where are you?" When the output of the guide voice S10 ends, the user receives this question and utters a user voice U10 indicating that the desired destination is "Yokohama I". The user voice U 10 is input to the microphone 21 and converted into a voice signal.

In this case, since the guide voice S 10 and the user voice U 10 are not superimposed in time, the voice signal output from the microphone 21 is guided by the voice response removing unit 22. The signal equivalent to voice and the like is passed through without being subtracted, and is input to the voice recognition unit 23 and the proficiency level determination unit 50.

The speech recognition unit 23 recognizes the content of the user speech U 10 based on the speech signal, that is, the destination is Yokohama 巿, but it is not clear whether it is Yokohama city or not. The control unit 24 selects the content of the next question (guidance voice) to be asked to the user. In other words, since it was recognized that the destination was Yokohama City, in order to further narrow down the destination, in Yokohama I, the level one level below “巿” was not “Town”. Because of the “ward”, the dialog control unit 24 determines to ask which ward the destination is. Based on this determination, the response generation unit 26 outputs a voice response signal asking which zone it is.

That is, the response generation unit 26 generates (voice synthesis) a voice response signal using data read from the response voice database 25 based on the signal input from the dialog control unit 24. This voice response signal is sent to the filter coefficient learning means 28, the adaptive filter 29, and the utterance timing. As well as being input to the speaker detection unit 27, it is also input to the speaker 20, and as shown in Fig. 2, the area of "Which of Yokohama City? Please tell us the name of the area." In the case of outputting the guide voice S20 for asking the name, the guide and voice S20 do not simply output "Please tell me the ward name." ”And output.

The system allows the user to confirm that the Yokohama system, which is the user's request, is correct.

The user who has heard the above guide voice emits a user voice U20 of "-Tsuzuki Ward" as the desired ward. In this case, since the user is not familiar with the use of the voice dialogue system 1, the "speak the ward name" part of the output of the guide voice S20 If you don't ask for the rest, you don't know what to do next. In this PI, the user starts responding with the user voice U 20, for example, as shown in the case, speak the name of the zone of the output of the guide voice S 20. Where

At this time, the guide voice S20 “·······” output from the speaker 20 and the user-voice U20 “Tsuzuki” overlap and the microphone mouth phone 21 Input, but the voice response remover 22

According to the comment, since a considerable amount of the signal is removed from the signal input from the microphone 21, the sound recognition unit 23 can correct the user's voice U 20. ο

At this time, the timing detection section 27 detects the time (timing) T until the start time of U20, the first utterance of Tsuzuki-ku J, and T, and learns. Input to the degree determination unit 50.

The proficiency judgment unit 50 asks for the ward name from the response generation unit 26, The input signal of the guided voice S 20 and the utterance timing signal of the user voice U 20 from the utterance timing detection unit 27 are input, and the output start time of the guide voice S 20 and the user voice U The time difference T is calculated from the utterance start time of 20 and. In this case, the time difference T becomes larger than the proficiency determination reference value, and the proficiency determination unit 50 determines that the user is not proficient in using the voice conversation system 1. As a result of this determination, the response generation unit 26 outputs the above-mentioned guide voice S 20 without being changed even in the next voice conversation.

On the other hand, as shown in Fig. 2 (b), the guide voice S10 and the user's voice U10 are performed in the same manner as in Fig. 2 (a), followed by "Which of Yokohama City? The dialogue controller 24 decides to output a voice response signal from the response generator 26 asking "Where do you name the ward?"

At this stage, the user was able to understand what he was asking, and the guide sound S21 was output up to の in Yokohama. Ichigo "Is a user with Tsuzuki Ward J

Say U 2

Assume that 0 is issued ο

At this time, Guy K says, “Single way. • •” of S21 and the user's standing voice U20 are overlapped and input to the microphone 21. The response remover 22 removes the signal corresponding to the gai voice from the signal input to the microphone 21, and the voice nw unit 23 utters the voice U 2

0 can be correctly written as B¾B or 1 3 o

At the same time, the utterance timing detector 27 detects the utterance U 20 time (Ty, Ng) of the user who wrote the song with Tsuzuki-ku J, and the proficiency determination unit 50. Enter o

The proficiency level judgment unit 50 asks for the ward name from the response generation unit 26. The signal of the guided voice S 21 and the utterance timing signal of the user voice U 20 from the utterance timing detection unit 50 are input, and the output start time of the guide voice S 21 and the user voice U 2 The time difference t is calculated from the utterance start time of 0 and. In this case, the time difference t becomes smaller than the proficiency determination reference value, and the proficiency determination section 27 determines that the user has mastered the use of the voice interaction system 1. As a result of this determination, the response generation unit 26 stops the output in the middle of "Which of Yokohama City" as shown in the guide voice S21, and thereafter, in this case, Please tell me the name of the ward. Do not output the J part.

When the user next uses the voice dialogue system 1, as shown in FIG. 2 (C), the guide voice S10 and the user voice

After U 10 is performed at the same percentage as that in FIG. 2 (a), the response generation unit 26 replaces the guide sound S 20 with the guide sound S.

S2 2, namely {Yokohama}! Is output. Since the user is familiar with the spoken dialogue system 1, the above guide

Just listening to S22, ΓTsuzuki-ku J and 、, ュ '-The one voice U20 was issued and received

Saisha Dialogue System 1 Γ Tsuzuki Ward! A more abbreviated guide,

It outputs S 30 and says τ <. In this way, the user starts the speech dialogue system 1.

After it is determined that you are familiar with the system, you will be guided by the system

The output will be changed and output will be changed to a more abbreviated content o

As described above, the first embodiment of the present invention is established.

The tongue tongue system 1 is composed of system-side audio output such as a

The timing of the conversation with the society is detected to determine the user's proficiency in using the voice dialogue system 1, and the subsequent system is determined according to the proficiency. In order to determine the degree to which the output of the system side can be changed, the user

The utterance of the utterance and the wing are detected and sent, so it can be detected easily.

Next, the configuration of a voice interaction system 2 according to a second embodiment of the present invention will be described with reference to FIG.

As shown in FIG. 3, the voice interaction system 2 according to the second embodiment of the present invention is based on the detection of the timing of the generation of the voice interaction system 1 according to the first embodiment shown in FIG. Instead of the proficiency judging unit 50 that judges the proficiency based only on the timing detected by the part 27, time

The H-32 is further increased by the proficiency level determination section 51 that determines the proficiency level by adding the data from the time-semantic database 32 to the timing. The point is different from the voice dialogue system 1 in Fig. 1 o

The time-to-semantic database 32 responds to the utterance of Guy's voice from speed 20 and explains the meaning of Guy における in the period from the time of the user's utterance start. The ripeness determination unit 51 is an utterance timing detection unit 27, which is based on the time-semantic data 32 in addition to the utterance timing of the user. User's power The skill of the user is determined by taking into account the taste of the guide until responding, and the taste is determined. That is, the above guide, taking into account the meaning of For example, by grasping the user's proficiency by checking whether the user executes the answer that matches the question 1 = 1 Are configured to

The operation of the dialog system 2 according to the second embodiment of the present invention is as follows.

2 Same as the timing chart in Fig. 2, but the second Figure (b)

The fact that the content of 7 ° U 20 is interrogated by the guide voice s 21, and whether or not it matches the content being spoken, is in agreement with the operation of the sound dialogue system 1 in Fig. 1. Just do

As described above, the spoken dialogue system 2 according to the second embodiment of the present invention can be used in accordance with a user's proficiency.

Uto says the system side of the society

If you are unfamiliar with the output, it is easy for you to understand how to use it. It is not necessary to wait for the output of the next system-side audio while listening to the Guy who can be understood by outputting simple contents that omit the above contents. Then, in the mouth to judge the user's proficiency level, a question by the system side sound, the contents of the call and the user's voice response Since the contents of the above are combined, it can be combined with the timing of the source.

1 Standing figure

It is possible to further improve the accuracy of the proficiency level judgment with the dialog system 1

Next, with reference to FIG. 4, the ski configuration of the voice interaction system 3 according to the third embodiment of the present invention will be described.

As shown in FIG. 4, a third embodiment of the present invention is described.

The voice dialogue system 3 is obtained from the utterance timing detection unit 27 of the sound pair 5 system 1 of the first embodiment shown in FIG. 1 and the timing detection unit 27 of this utterance pair. Instead of the proficiency level determination unit 50 that determines the level of proficiency based on the social timing,

The number of usages that stores the number of times of use of the dialogue system 3 and the number of times that the number of times of use of the mussels used by the number of usages of the dialogue system 3 are stored. Using the part 3 4 and the cumulative number of uses obtained from the use count part 3 3 The difference from the spoken dialogue system 1 is that a proficiency level determination unit 52 for determining the level is provided.

Each time the user uses the voice interaction system 3 once, the usage count part 3 3 adds one time to the previous cumulative usage count stored in the usage count storage part 3 4. Then, a new cumulative number of times of use is obtained, and the cumulative number of times of use is input to the number-of-uses storage section 34 and rewritten and stored, and is also input to the proficiency determination section 52. is there.

The proficiency level determination unit 52 determines the user proficiency level by comparing the proficiency level with a criterion value based on the cumulative usage count input from the usage count unit 33. is there. In this case, a first set value and a second set value that is larger than the first set value are provided as the criterion value for the proficiency level.

Next, an operation of the voice interaction system 3 according to the third embodiment of the present invention will be described with reference to FIG.

Fig. 5 (a) shows the case of an unskilled user, Fig. 5 (b) shows the case of a user who has become somewhat proficient, and Fig. 5 (c) shows the case of a sufficiently proficient user. ing.

Each time the user uses the voice dialogue system 3, the usage count unit 33 adds the current usage count of 1 to the accumulated usage count stored in the usage count storage unit 34 each time. In addition, a new cumulative use count is obtained. The usage count unit 33 stores the new cumulative usage count in the usage count storage unit 34 and inputs the new cumulative usage count to the proficiency level determination unit 52.

The proficiency level determination section 52 compares the input cumulative number with a first set value and a second set value, which are criteria for proficiency level determination. If the cumulative number of times of use is smaller than the first set value, the proficiency judging unit 52 judges that the user is not proficient in using the voice dialogue system 3, and FIG. As shown, the guide voices S10 and S20 are output with the same contents as in Fig. 5 (a). In this case, the user-to-user voices U10 and U20 are the same as in the case of FIG. 5 (a).

If the user's usage count increases and the cumulative usage count input from the usage count section 3 3 falls below the second set value and is greater than or equal to the first set value, the proficiency determination section 52 It is determined that the user has mastered the speech dialogue system 3 to a certain degree, and the result of this determination is input to the response generation unit 26. As a result of this input, the response generation unit 26 sets the guide voice S23 as a more omitted guide voice S23, as shown in FIG. 5 (b). For example, change to the question "Which of Yokohama City is the power of?", Omitting "Please tell me the name of the ward."

If the user's usage count further increases and the cumulative usage count input from the usage count section 3 3 becomes equal to or greater than the second set value, the proficiency level determination section 5 2 It is determined that the user is sufficiently proficient in the system 3, and the result of the determination is input to the response generator 26. As a result of this input, the response generation unit 26 converts the content of the guide voice S 23 into a guide voice S 22 that is further omitted, as shown in FIG. 5 (c). For example, change to “Yokohama City!” And output from speaker 20.

As described above, the spoken dialogue system 3 according to the third embodiment of the present invention determines the proficiency according to the cumulative number of uses, and according to the proficiency, the content of the system-side speech such as the guide speech. Can be changed. In this case, the proficiency level is determined in three stages: unfamiliar, proficient to some extent, and proficient enough. A detailed response can be made by simplifying the contents of the side audio sequentially.

Next, with reference to FIG. 6, a fourth embodiment of the present invention will be described.

Explain the structure of the dialogue system

The voice dialogue system 4 according to the fourth embodiment of the present invention shown in FIG. 6 is similar to the voice dialogue system 1 according to the first embodiment shown in FIG. Instead of having the proficiency level determination unit 50, a usage frequency calculation unit 35 that calculates the usage frequency of the user's voice interaction system 4 and a usage that stores the usage frequency obtained by the usage frequency calculation unit 35 Frequency storage

36 and a proficiency level determination unit 53 using the usage frequency obtained from the usage frequency calculation unit 35 are different from the voice interaction system 1, and the other configuration is the same as the voice interaction system 1. It is.

In other words, the usage frequency calculation unit 35

Each time 4 is used, a new usage frequency is calculated based on the usage frequency up to that time stored in the usage frequency storage unit 36 and the current usage, and the newly obtained usage frequency is calculated. The proficiency level determination unit 53 is input to the usage frequency storage unit 36 to store the replacement frequency, and is also input to the proficiency level determination unit 53. The user's proficiency level is determined by using the frequency of use input from 5 and comparing it with the proficiency level criterion.

-You have to do it. In this case, a third set value and a fourth set value larger than this are set as criteria for judging the proficiency level.

Next, an operation of the voice interaction system 4 according to the fourth embodiment of the present invention will be described with reference to FIG.

Fig. 7 (a) shows the case of an unskilled user, Fig. 7 (b) shows the case of a user who has become somewhat proficient, and Fig. 7 (c) shows the case of Each ripe user case is shown.

Each time the user uses the voice dialogue system 4, the calculation is performed based on the usage frequency up to that time stored in the usage frequency storage unit 36 and the current usage by the usage frequency deduction unit 35 in the usage frequency storage unit 36 To obtain a new frequency of use. The usage frequency calculation unit 35 stores the new usage frequency in the usage frequency storage unit 36 and inputs the new usage frequency to the proficiency level determination unit 53.

The proficiency judging unit 53 is a third unit based on the input frequency of use and the criterion for judging the proficiency.

Compare the magnitude relationship with the nX constant value fourth setting value. If the frequency of use is lower than the third set value, the proficiency judging unit 53 judges that the user is not proficient in using the voice dialogue system 4, and as shown in FIG. 7 (a). Fig. 5 shows the guide voices S10 and S20.

Output with the same content as in (a). In this case, user voice U 1

0 and U 20 are the same as in the case of Fig. 5 (a).

If the frequency of use of the menu increases and the usage frequency input from the usage frequency calculation unit 35 becomes equal to or more than the third set value and less than the fourth set value, the proficiency determination unit 53 Determines that he / she has mastered the speech dialogue system 4 to some extent, and inputs the determination result to the response generation unit 26. As a result of this input, the answer generation unit 26, as shown in FIG. 7 (b), outputs the guide voice S23 with the contents of the guide voice S20 further omitted, as shown in FIG. 7 (b). And, for example,

Change the content to the question "Which city is Yokohama?" Without "Please tell me the name of the ward."

If the number of times the user uses the user further increases and the usage frequency input from the usage frequency calculation unit 35 becomes equal to or higher than the fourth set value, the proficiency level determination unit 3 It is determined that the user is sufficiently proficient in step 4, and the result of the determination is input to the response generator 26. This input causes a response As shown in Fig. 7 (c), the part 26 is a guide voice S22 that further omits the content of the guide voice S23. And then output from speaker 20

As described above, the voice interaction system 4 according to the fourth embodiment of the present invention provides a voice interaction system capable of changing the output of the system-side voice in accordance with the user's proficiency in using the voice interaction system. In addition to this, it is possible to use the 口 / A port and the frequency of use in which the detection operation is simple to judge the proficiency level.

Next, with reference to FIG. 8, a description will be given of a configuration of a speech dialog system according to a fifth embodiment of the present invention.

As shown in FIG. 8, the voice dialogue system 5 according to the fifth embodiment of the present invention is a vocal timing detection unit of the voice dialogue system 1 according to the first embodiment shown in FIG. Instead of 3 8 and the proficiency level determination section 50, an utterance rate calculation section 37 that calculates the utterance rate of the user, and an utterance rate calculation section

The utterance speed storage unit 30 that stores the utterance J¾ degrees obtained in 37 and the proficiency determination unit 54 that determines the proficiency using the utterance speed are provided.

It is the same as the dialogue system 1, but other configurations are the same as the voice dialogue system 1.

That is, the utterance speed calculation unit 37 calculates the utterance speed at which the user utters.

¾: Calculate and input to the speech speed storage unit 38 and the proficiency level judgment unit 54. Here, the speech speed is, for example, (the length of the dictionary) Z (speech time, ie, s. Time interval).

The proficiency level determination unit 54 determines whether the user is unfamiliar or proficient depending on whether the utterance speed of the user is lower (slower) or higher (faster) than the criterion value. It is. Next, with reference to FIG. 9, the operation of the voice dialogue system 5 according to the fifth embodiment of the present invention will be described.

Fig. 9 (a) shows a user who is familiar with the voice dialogue system 5 and Fig. 9 (b) shows a user who is familiar with the voice dialogue system 5. Show the case.

The utterance speed calculation unit 37 calculates the utterance speed of the user and inputs it to the utterance speed storage unit 38 and the proficiency determination unit 54. The proficiency level determination unit 54 compares the input stuttering level with the criterion value, and determines that the utterance speed is lower than the criterion value. As a result of learning the system 5 and judging that there is no problem, as shown in FIG. 9 (a), the response generator 26 sends the careful voice S 20 of the detailed contents to the speaker 20 from the speaker 20. Output

On the other hand, the proficiency level determination unit 54 determines that the user is proficient in the conversation system 5 in the mouth if the speech speed is higher than or equal to the determination reference value. Then, as shown in Fig. 9 (b), the guide voice S20 is changed to a simple guide voice s23 with a part of the guide voice S20 omitted, and the response generation unit 26 takes the speed 20 Output from

As described above, the voice interaction system 5 according to the fifth embodiment of the present invention provides a voice interaction system capable of changing the output of the system-side voice according to the user's proficiency in using the voice interaction system. Can provide

-Together with this, it is possible to use the user's voice utterance degree, which is easy to detect and calculate, to determine the user's proficiency level.

Next, the configuration of a voice interaction system according to a sixth embodiment of the present invention will be described with reference to FIG.

芎 10 As shown in FIG. 10, the speech dialogue system according to the sixth embodiment of the present invention is described. The system 6 is a voice interactive system of the first embodiment shown in FIG.

In place of the utterance timing detection unit 27 and the proficiency level determination unit 50 of FIG. 1, a user-one voice is recognized from a user-one voice signal, and a question The cumulative average similarity calculation unit 39 that calculates the cumulative average similarity based on the similarity and the similarity of the response content of the user and the response content, and the cumulative Every time the average similarity calculation unit 39 calculates a new cumulative average similarity, the cumulative average similarity storage unit 40 rewrites and stores the new cumulative average similarity, and the cumulative average similarity calculation Department

The point that a proficiency level judging unit 55 for judging the level of proficiency using the cumulative average similarity input from 39 is provided is similar to the speech dialogue system 1. In this spoken dialogue system 6, the function of the speech recognition unit 23 is enhanced as follows.

In other words, the speech recognition unit 23 receives the voice of the user based on the signal received by the microphone mouth phone 21 and removed by the speech response removal unit 22 from the superposition output from the speech force 20. Is input to the dialogue control section 24 and the dialogue control section 24 or the response generation section 26 draws the above-mentioned ""-"-single voice from the input.

The content of the voice is input (in the figure, this signal line is omitted), and the user,

The content of the guide and the correct response to the guide voice are compared, and the

—It is configured to calculate the similarity to determine how similar the response content is to the correct response, and to input it to the cumulative average similarity S10 arithmetic unit 39.

The cumulative average similarity calculator 39 calculates the similarity newly input from the speech recognizer 23 and the similarity stored in the cumulative average similarity storage 40. The new average of the cumulative average similarities is calculated using the formula (for example, the sum of the recognition result similarities) / (the number of times of recognition). It is recommended to input 40 and the proficiency judgment section 55.

The proficiency level determination unit 55 is configured to compare the input cumulative average similarity with a criterion value and determine the proficiency level based on the level.

Next, the operation of the voice interaction system 6 according to the sixth embodiment of the present invention will be described with reference to FIG.

Fig. 11 (a) shows a user's use of the spoken dialogue system 6, and Fig. 11 (b) shows a user using the spoken dialogue system 6. The figure shows the case where the user is proficient.

When the user starts using the voice dialogue system 6, the cumulative average similarity calculation unit 39 obtains the similarity obtained by the speech recognition unit 23 in response to the new use of the user's voice dialogue system 6. , And calculates a new cumulative average similarity based on the accumulated average similarity stored in the cumulative average similarity storage unit 40 and the cumulative average similarity storage unit 40. Part 5 Enter in 5 and ο

When the input cumulative average similarity is lower than the determination reference value, the proficiency determination unit 55 determines that the user is not proficient, and the response generation unit 2

6 outputs a polite and detailed guy K voice S 20 from the speaker 20 power. On the other hand, when the cumulative average similarity is equal to or greater than the judgment reference value, it is judged that the user is proficient, and the guidance sound is partially changed to a simple content of Guy K sound S23. Output from the speed force 20.

As described above, the spoken dialogue system 6 according to the sixth embodiment of the present invention depends on the user's proficiency in using the spoken dialogue system. It is possible to provide a spoken dialogue system that can change the output of speech, and it is also possible to use a cumulative average similarity that is easy to detect and calculate to determine proficiency. Become possible

Next, a configuration of a dialogue system according to a seventh embodiment of the present invention will be described with reference to FIG.

As shown in FIG. 12, the dialogue system 7 according to the seventh embodiment of the present invention is different from the voice dialogue system according to the first embodiment shown in FIG. 1.

In place of the utterance timing detection unit 27 of 1 and the proficiency level determination unit 50, a guide, and how much the user can correct the content of the voice question, and when Each time the cumulative average recognition rate calculation unit 41 calculates the cumulative average recognition rate based on the new cumulative average recognition rate, and the cumulative average recognition rate calculation unit 41 calculates the new cumulative average recognition rate, it is rewritten into the new cumulative average recognition rate. It differs from the spoken dialogue system 1 in that the cumulative average recognition rate storage unit 42 that stores it and the proficiency level determination unit 56 that determines the user's proficiency based on the cumulative average recognition rate are BX-digit. The other components are the same as the dialogue system 1 and | pj. In this spoken dialogue system 7, SS SSιτ

He said that the function of 23 was enhanced as follows.

In other words, the voice f¾ section 23 recognizes the content of the user's one sound based on the signal received by the microphone 21 and the superimposed portion removed by the answer removing section 22 and recognizes the contents of the dialog section 24. At the same time, the user's guide from which the user is extracted from the dialogue control unit 24 or the command generation unit 26 is input, and the content of the voice is input (in FIG. (Signal lines are omitted.)

Compare the contents of the question asked by

—To what extent the user corrects the question of the guide voice <Calculates the PS rate and inputs it to the cumulative average recognition rate calculator 41. It is configured in.

The cumulative average recognition rate calculation section 41 calculates the recognition rate newly input from the speech recognition section 23 and the cumulative average recognition rate up to that stored in the cumulative average recognition rate storage section 42. If the new cumulative average recognition rate is

Calculate using the formula (1) (total number of times P9 \) / (recognition frequency) and input it to the cumulative mussels average recognition rate storage unit 42 and the proficiency level determination unit 56.

The proficiency determining section 56 is configured to compare the average recognition rate of the accumulated mussels with the criterion value and determine the proficiency based on its level.

Next, with reference to FIG. 13, the operation of the voice interaction system 7 according to the seventh embodiment of the present invention will be described.

Fig. 13 (a) shows the case where the user is unfamiliar with the use of the spoken dialogue system 7, and Fig. 13 (b) shows the case where the user is not familiar with the use of the spoken dialogue system 7. Shows a case that is proficient.

When the user starts using the speech dialogue system 7, the cumulative average cumulative average recognition rate calculation unit 41 receives the speech recognition unit 23 according to the new use of the user's speech dialogue system 7. Recognition rate and cumulative average recognition rate storage unit

A new or cumulative average recognition rate is calculated based on the cumulative average recognition rate up to that stored in 42 and is input to the cumulative average recognition rate storage section 42 and the proficiency judgment section 56.

The proficiency level determination section 56 determines that the user is not proficient when the input cumulative average recognition rate is lower than the determination reference value, and the response generation section 26 provides a polite and detailed guide voice S 2. 0 is output from speaker 20. On the other hand, when the cumulative average recognition rate is equal to or greater than the judgment reference value, it is judged that the user is proficient, and the simple content of the guide sound is partially omitted. As described above, the voice dialogue system 7 according to the seventh embodiment of the present invention is a user's proficiency in using the voice dialogue system. System audio output can be changed according to

In addition to being able to provide a dialogue system, the average recognition of mussels that is easy to detect and calculate to determine proficiency

-Using rates allows

Next, with reference to FIG. 14, the configuration of a speech dialogue system according to an eighth embodiment of the present invention will be described.

Standing as shown in Fig. 14

The speech dialogue system 8 according to the eighth embodiment of the present invention is a speech dialogue system according to the first embodiment shown in FIG.

The proficiency level is determined by using one of the similarity and recognition rate obtained by the voice recognition section 23 instead of the utterance timing detection section 2 7 and the proficiency level determination section 50 of 1. If the proficiency level determined by the proficiency level determination section 57 and the proficiency level determination section 57 is lower than a predetermined value, the operation is started.

The threshold changing unit that changes the threshold used to determine the similarity and the recognition rate in the recognition unit 23

The difference between 4 and 3 is the same as that of the spoken dialogue system 1, and the other components are the same as those of the spoken dialogue system 1.

8 contains the cumulative average similarity calculation unit 39 in FIG. 10 and the cumulative average similarity storage unit 40, or the cumulative average recognition rate calculation unit 4 shown in FIG.

1 and a cumulative average recognition rate storage section 4 2 are provided, however, the former is used here, and is omitted in FIG. 8. Its functions are enhanced as in the dialogue systems 6 and 7 in Fig. 10 and Fig. 12.

Next, the operation of the voice interaction system 8 according to the eighth embodiment of the present invention will be described with reference to FIG. When the user starts using the speech B tongue system 8, as in the case of the speech dialogue system 6 in FIG. 10, the speech recognition unit 23 uses the threshold to make a speech based on the user's voice.

The voice recognition unit 23 detects similarities such as how similar the user's response is to the correct response to the guide voice question.The cumulative average similarity calculation unit calculates the cumulative average based on the input similarity. The similarity is calculated and input to the proficiency level determination unit 57. The proficiency determining unit 57 determines the proficiency using the cumulative average similarity. As a result, when it is determined that the proficiency level is low, the threshold value is reduced by the threshold value changing unit 43, so that the user who is not accustomed to using the voice dialogue system 8 can also make a speech. Recognition section 23 makes recognition easier.

That is, in the case where the threshold of the similarity is always fixed, the first

5 As shown in Fig. 5 (a), if the user is judged to be proficient in using the spoken dialogue system 8 by the proficiency judging unit 57, the similarity is judged by the voice 顾 unit 23. The threshold is higher than the threshold used for the determination, and the user's voice can be recognized <, but the user is not proficient in using the spoken dialogue system 8. User's mouth has similarity

This means that the voice recognition unit 23 cannot recognize the user's voice well below the threshold used for the similarity determination.

-Therefore, the proficiency level determination unit 57 determines that the user level is low.

* ¾: In the mouth, as shown in Fig. 15 (b), the threshold value is lowered by the threshold value changing unit 43, and thereafter, the similarity is determined by the speech recognition unit 23 using the lowered threshold value. In this way, even a user with a low level of proficiency can use the voice recognition unit 23 to create a user-friendly environment.

He will be able to recognize it more easily.

As described above, the voice-to-speech system 8 according to the eighth embodiment of the present invention is described.

■ = ± = ■

Depends on the user's proficiency in using the dialog system. Standing

It is possible to provide a spoken dialogue system that can change the voice output, and it is also easy to detect and calculate cumulative average similarity to determine proficiency.

-It is possible to use the degree. In this case, when the user's proficiency is low, the user is proficient by changing the threshold to lower the threshold for determining the similarity. S ^ S "3 more easily

Next, referring to FIG. 16, a description will be given of the configuration of the voice-to-S tongue system according to the ninth embodiment of the present invention.

As shown in FIG. 16, the speech dialogue system 9 according to the ninth embodiment of the present invention is different from the voice conversation system 9 according to the first embodiment shown in FIG.

The said dialogue system

In place of the proficiency judgment unit 50 of 1, who is the speaker who is the best user

The mouth determination unit 44, the speaker information identified by the speaker determination unit 44, the speech timing, the utterance timing obtained from the ring detection unit 27, and the ring information are input to the user. The point that a proficiency level judging unit 58 for judging the proficiency level of the user is provided is similar to that of the dialogue system 1, and the other configuration is the same as that of the voice pair system 1.

Although not shown in the figure, the speaker information determined by the speaker determination unit 44 and the proficiency level information of the speaker determined by the proficiency level determination unit 58 are input and the s Stores information on learning for each player. \ B

Next, the operation of the speech dialogue system 9 according to the ninth embodiment of the present invention will be described with reference to FIG.

''

When the conversation system 9 is started to be used, as shown in Fig. 17, the speaker determination unit 4 4 uses the speaker judgment unit 44 based on the user sound U 10 uttered by the first user. Is determined from the speaker-specific proficiency storage unit. Find out proficiency level information and continue

According to the level of proficiency at which tm has come out, we will output from speed 20.

In other words, if the speaker is not proficient, as shown in Fig. 17 (a), In: <Gy K outputs S20 with detailed contents, and the P¾ In the case where the sound is being played, as shown in Fig. 17 (b),

2 Change 2 to a more simplified version and output it

Note that the speaker is not proficient in the mouth.

In the voice dialogue system 1 and | pj, the development timing detection unit 27 uses the Guy and Riko

The output start time of S20 and the start time of U20 sound

-The time difference between and is detected, and the proficiency level is determined by the proficiency level determination unit 58 using the time difference between and the proficiency level of the user and the user is improved. If it is determined that one is used, the level determined that the speaker's proficiency determined by the person determination unit 44 is used from an unfamiliar level is changed. To be stored. On the other hand, if the proficiency determined by using the learning timing and the proficiency is still accustomed and remains at a high level, the relevant speaker stored in the speaker-specific proficiency storage unit is used. Do not rewrite your proficiency

As described above, the sound dialogue system 9f according to the ninth embodiment of the present invention determines the proficiency level with the S¾ person, and changes the content according to the proficiency level of each user. The system can be used to output the voice, so that the speaker can be determined at the input stage of the user's voice U10 for the voice s10. From the above, the guide sound S20 and the corresponding user

According to U20, the level of proficiency is determined earlier and the proficiency level is determined at a later stage, and the system-side voice is output, as compared to the voice dialogue system 1 that determines the level of proficiency from U20. For example, if you want to By registering in advance, it is possible to determine whether or not the users ¹ to are registered for use permission, and it is possible to limit the users of the voice interaction system 9.

Next, with reference to FIG. 18, the configuration of the voice pair system according to the tenth embodiment of the present invention will be described.

As shown in FIG. 18, the speech dialogue system 10 according to the tenth embodiment of the present invention is different from the voice conversation system 10 according to the first embodiment of the present invention.

According to the dialogue system 1, sounds 辞書 ·, π ^ ζ ¾r with different content dictionaries,

This is in addition to a dictionary switching unit 46 that switches the dictionary used in the synthesizing unit 23 based on the frequently used speech signal output by the response generating unit 26.

Next, the operation of the voice-to-speech system 10 according to the tenth embodiment of the present invention will be described with reference to FIG.

When the user starts using the speech dialogue system 10, the speaker 2

The user's one sound generated by the user in response to the guide voice output from 0 is supplemented by the microphone 21. The supplemented signal is a voiced conversation system 1 shown in FIG. As in the case of (1), the utterance timing is detected by the utterance timing detection unit 27. Using the utterance timing, the proficiency determination unit 59 changes the content of the guide voice output from the speed 20 according to the user's proficiency determined by the responsiveness determination unit 59. Established in

Answer Generate a sign

In this case, the dictionary is generated based on the content of the response voice signal of the response generation unit 26.

Standing

The switching unit 46 switches so as to select a dictionary matching the above contents from the plurality of dictionaries in the voice recognition dictionary database 45.

, Available to us For example, as shown in Fig. 19 (a), a guide voice S20 of "Which of Yokohama? Please tell me the ward name." Is output, and the response is Despite using the dictionary D1 of all ward names in Yokohama City in anticipation of being a name, the user was asked to read the first part of the guide voice S20, Or "I don't notice the name of the ward in the second half, or I think that" Shin-Yokohama "is the name of the ward] ^^ What you said

If you say U 21, "Shin-Yokohama" will not be recognized by the speech recognition unit 23 because it does not appear in all ward abbreviations D 1 in Yokohama.

In contrast,

According to the dictionary switching unit 4 6 of the dialogue system 10,

Since the user may name a town or character lower than the ward name in the most part of the guide voice S20, the guide it-= as shown in Fig. 17 (b) ; = ·

From the start of the output of the social network 20 to the predetermined time after the end of the guide voice S 20, the dictionary D 1 of all ward names in Yokohama is made available and the voice of the guy K is used. From the start of the output of s 20, the ward name is switched so that the whole town character dictionary D 2 of Yokohama のみ can be used by the speech recognition unit 23 only in the middle of the second half of · ·. Issued a voice U 21 with the name of the ward named “Shin-Yokohama”, but the entire town dictionary D of Yokohama

It is possible to recognize the correct place using <2>. On the other hand, as shown in Fig. 19 (c), from the start of the output of the guide speech S20 to the neighborhood of "ku name", the whole town character dictionary D2 of Yokohama 巿 is used, and the rest is Yokohama. It is also possible to switch to the all ward name dictionary D 1 of 巿. In this way, the user can hear the guide and hear the name of the town in the middle of the voice S20, or hear the guide voice S20 and finish the ward. I realized that I had to say my name, for example, "Kanagawa-ku J and review '0 aifc

If you say the one 20-S

The voice u can also be output from the cocoon part 23

As described above, the tenth embodiment of the present invention was established.

The voice dialogue system

10 is to provide a voice dialogue system that can change the output of system side sounds according to the user's proficiency / skill in using the voice dialogue system. The user's ability in response to the contents of the system-side sound, such as the system sound; the content likely to respond (that is, the system-side sound A

1 = Switch to a dictionary with either a single response or an error that seems to be wrong)-When the dictionary is fixed, false recognition is reduced compared to a fixed dictionary, and the response content of the user is faster. It is possible to grasp the situation.

It should be noted that the present invention is not limited to the above embodiment, and some of them may be changed or modified.

That is, in the above-described embodiment, the input timing for the proficiency level determination unit 100 includes input timing, output speed, cumulative use of the mussels, cumulative average similarity, cumulative average recognition rate, and the like. One of the above was used, but it is not limited to this.

Also, the proficiency level is determined and stored for each content of the system-side audio, and the system-side setup determined to be proficient is stored.

It may be possible to change the output of the system-side audio only after the contents described, so that the output is changed only based on the contents that the user is really used to o

In addition, the utterance tie detection unit 27 takes the time difference between the output start time of the system-side sound source and the utterance start time of the user's one-sided voice, but this is not necessarily limited to the start time. It is also possible to detect the time on the way and obtain the time. As described above, depending on the user's proficiency in using the voice dialogue system, the system side

It is possible to change the output

It is possible to provide an effective speech dialogue system.

Industrial applicability

As described above, the B-voice dialogue system and the dialogue method according to the present invention are based on the user's proficiency, in which the system and the user interact with each other by sound. It has the effect that the output of the system can be changed, and is useful as a spoken dialogue system.

Claims

The scope of the claims

1. An audio output unit that can output system-side audio to the user

-M

マ A microphone for converting the user's voice uttered by the user into a voice signal in accordance with the system-side voice output from the recording section output unit, and the user's input to the previous IB microphone mouth phone A voice recognition unit for recognizing 13 voices, and an eye converted from the user's voice by the microphone.

A proficiency judging unit for judging the proficiency of the user's voice dialogue based on the 1 ° signal; and 音声 a voice for changing the output of the system-side speech according to the proficiency determined by the proficiency deciding unit. A speech tongue tongue system characterized by having an output changing unit and.

2. The output of the recording system side audio is changed between at least two output contents, with detailed output contents and simpler output contents than detailed output contents. Speech tongue tongue system according to claim 1

3. The utterance produced by the user based on the input voice signal

ヽ

2. The utterance timing detection unit for detecting timing and timing, wherein the proficiency level determination unit determines the proficiency level using a utterance timing. Spoken dialogue system ο

4. The recording utterance timing is the utterance start time of the user, and the 習 recording proficiency determination unit determines the proficiency using the time difference between the utterance start time and the output start time of the system-side voice. The voice interaction system according to claim 3, characterized in that it is determined.

5. A use count unit for counting the cumulative use count of a user's input using the speech recognition based on the input voice signal is provided. , The judgment unit obtains W1 from the use frequency force The spoken dialogue system according to claim 1, wherein the proficiency level is determined using the cumulative number of times of use.

6-a use frequency calculating unit that calculates the use frequency of the voice input based on the input voice signal using the voice recognition, wherein the proficiency level determination unit includes the use frequency calculation unit 2. The sound dialogue system according to claim 1, wherein the proficiency level is determined by using the use frequency obtained from the user.

7 • a utterance speed calculation unit that calculates the utterance speed of the user's single voice using the voice recognition based on the input voice signal, wherein the proficiency level determination unit determines the utterance speed from the utterance speed calculation unit The utterance of the user obtained

·-The voice interaction system according to claim 1, wherein the proficiency level is determined based on speed.

8-Eye responding to the system voice based on the input voice signal IJ § Calculate cumulative average similarity using similarity that indicates how similar the content of the user's voice is to the correct response A cumulative average similarity calculation unit that performs the calculation, wherein the proficiency determination unit determines the proficiency using the cumulative average similarity obtained from the cumulative average similarity calculation unit. The speech tongue system described in 1.

9-Eye responding to the system-side voice based on the input voice signal Indicates how accurately the contents of the user's voice have been recognized by the contents of the system-side voice. A cumulative average recognition rate calculation unit that calculates a cumulative average recognition rate using the recognition rate; and a proficiency determination unit. The voice interaction system according to claim 1, wherein the determination is made. 1 0 • Change the threshold value to lower the threshold value for determining the similarity or 5 recognition rate at the entrance where the proficiency level is determined to be lower than the predetermined value by the proficiency level determination unit. It has a threshold changing unit.

5 Requirement 8 C §C ¾, a speech tongue system.

1 1 • The learning proficiency judgment unit is used by the recording system output by the recording output unit to the IU user.

The conversation system according to claim 1, characterized in that the ripeness is determined in consideration of the taste of the contents of the conversation.

1 2 • Based on the input voice signal, perform R9¾ of the power with the speaker being si. The speaker recognition unit is provided. The speaker proficiency judgment unit is recognized by the speaker judgment unit. The voice dialogue system described in item 1 characterized by determining the eye IJ proficiency level for each user

1 3 • When the contents of the recording system output to the user from the recording output unit change, the contents that the user is expected to emit in response to the changed contents Spoken dialogue system described in claim 1 characterized by switching words to a dictionary

Mountain

1 4 • The flj sound output unit outputs from the mortar signal input from the microphone. The sound filter response remover that removes the signal equivalent to the output of the voice from the system is characterized. A *

B. Spoken dialogue system described in claim 1

1 5 • It is possible to output the system message from the voice output terminal to the user.

The voice output by the voice output unit is converted into a voice signal by the microphone through the voice output by the user described above on the side of the IU recording system output by the voice output unit, and the voice input by the microphone to the BU recording microphone. ~ The voice is written in B 音声 S or "9" and the UU system system voice

1

The voice power BU microphone Mic mouth And determining the proficiency level of the user's voice dialogue, and then changing the output of the system-side voice according to the determined proficiency level.