WO2017098940A1 - Dispositif d'interaction vocale et procédé d'interaction vocale - Google Patents

Dispositif d'interaction vocale et procédé d'interaction vocale Download PDF

Info

Publication number
WO2017098940A1
WO2017098940A1 PCT/JP2016/085126 JP2016085126W WO2017098940A1 WO 2017098940 A1 WO2017098940 A1 WO 2017098940A1 JP 2016085126 W JP2016085126 W JP 2016085126W WO 2017098940 A1 WO2017098940 A1 WO 2017098940A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
response
utterance
pitch
speech
Prior art date
Application number
PCT/JP2016/085126
Other languages
English (en)
Japanese (ja)
Inventor
嘉山 啓
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP2015238912A external-priority patent/JP6657888B2/ja
Priority claimed from JP2015238913A external-priority patent/JP6728660B2/ja
Priority claimed from JP2015238911A external-priority patent/JP6657887B2/ja
Priority claimed from JP2016088720A external-priority patent/JP6569588B2/ja
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Priority to EP16872840.0A priority Critical patent/EP3389043A4/fr
Priority to CN201680071706.4A priority patent/CN108369804A/zh
Publication of WO2017098940A1 publication Critical patent/WO2017098940A1/fr
Priority to US16/002,208 priority patent/US10854219B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Definitions

  • the present invention relates to a voice dialogue technique for reproducing a response voice to an uttered voice.
  • Patent Document 1 discloses a technique for analyzing utterance contents by voice recognition on a user's uttered voice and synthesizing and reproducing a response voice according to the analysis result.
  • an object of the present invention is to realize a natural voice conversation.
  • a voice interaction method includes a voice acquisition step of acquiring a speech signal representing a speech voice, and a voice that specifies a pitch of the speech voice from the speech signal.
  • the voice interaction device includes a voice acquisition unit that acquires an utterance signal representing an utterance voice, a voice analysis unit that specifies a pitch of the utterance voice from the utterance signal, and among the utterance voices And a response generation unit that causes the playback device to play back a response voice having a pitch corresponding to the lowest value of the pitch specified by the voice analysis unit for the end section near the end point.
  • the voice interaction method includes a voice acquisition step of acquiring an utterance signal representing an utterance voice, an audio analysis step of specifying a pitch of the utterance voice from the utterance signal, and an end point of the utterance voice.
  • a voice interaction device includes a voice acquisition unit that acquires an utterance signal representing an utterance voice, a voice analysis unit that specifies a pitch of the utterance voice from the utterance signal, and among the utterance voices And a response generation unit that causes the playback device to play back the prosodic response voice according to the transition of the pitch specified by the voice analysis unit in the end section near the end point.
  • the voice interaction method includes a voice acquisition step of acquiring an utterance signal representing an utterance voice, a first response voice representing an answer to the utterance voice, and a second response voice other than the answer answer.
  • the voice interaction device includes a voice acquisition unit that acquires a speech signal representing a speech voice, a first response voice that represents a response to the speech, and a second response speech other than the response
  • a response generation unit that selectively reproduces any of the above.
  • a voice dialogue method is a method for executing a voice dialogue for reproducing a response voice to a uttered voice, wherein a voice acquisition step of obtaining a utterance signal representing the uttered voice, A history management step of generating a usage history, and a response generation step of causing the playback device to play back the prosodic response sound corresponding to the usage history.
  • a voice interaction device is a device for executing a voice dialogue for reproducing a response voice to a utterance voice, a voice acquisition unit for obtaining a utterance signal representing the utterance voice, A history management unit that generates a usage history; and a response generation unit that causes a playback device to play back the response sound of the prosody corresponding to the usage history.
  • FIG. 1 is a configuration diagram of a voice interactive apparatus 100A according to the first embodiment of the present invention.
  • the voice interactive apparatus 100A of the first embodiment is a voice interactive system that reproduces a voice (hereinafter referred to as “response voice”) Vy that responds to an input voice (hereinafter referred to as “speech voice”) Vx produced by the user U. is there.
  • a portable information processing device such as a mobile phone or a smartphone, or an information processing device such as a personal computer can be used as the voice interactive device 100A.
  • the speech dialogue apparatus 100A in the form of a toy (eg, a doll such as a stuffed animal) or a robot simulating the appearance of an animal or the like.
  • Speech voice Vx is, for example, a voice (an example of input voice) including an interrogation (question) and an interrogation
  • a response voice Vy is a response including an answer to the interrogation or a response to the interrogation. Is the voice.
  • the response voice Vy includes a voice meaning an interjection, for example.
  • An interjection is an independent word (an exclamation or an exclamation) that is used independently of other segments and is not used. Specifically, phrases such as “un” and “ee” (in English, “aha” or “right”) that express the competing utterances, and “e” that expresses speech (stagnation of response).
  • the voice interactive apparatus 100A of the first embodiment generates a prosodic response voice Vy according to the prosody of the speech voice Vx.
  • Prosodic is a linguistic and phonetic characteristic that can be perceived by the listener of a speech, and cannot be grasped only from the general notation of the language (for example, not a special notation that expresses prosody). Means nature. Prosody can also be rephrased as a characteristic that allows the listener to recall or guess the intention or emotion of the speaker.
  • inflection change or intonation of voice tone
  • tone sound level or strength
  • tone length tone length
  • speech speed rhythm
  • rhythm structure of temporal change in tone
  • accent Various features such as high or low accents can be included in the concept of prosody, but typical examples of prosody are pitch (fundamental frequency) or volume.
  • the voice interaction device 100A of the first embodiment includes a control device 20, a storage device 22, a voice input device 24, and a playback device 26.
  • the voice input device 24 is an element that generates, for example, a voice signal (hereinafter referred to as “speech signal”) X representing the utterance voice Vx of the user U, and includes a sound collection device 242 and an A / D converter 244.
  • the sound collection device (microphone) 242 collects the utterance voice Vx produced by the user U and generates an analog voice signal representing the sound pressure fluctuation of the utterance voice Vx.
  • the A / D converter 244 converts the audio signal generated by the sound collection device 242 into a digital speech signal X.
  • the control device 20 is an arithmetic processing device (for example, CPU) that comprehensively controls each element of the voice interactive device 100A.
  • the control device 20 of the first embodiment acquires the utterance signal X supplied from the voice input device 24, and generates a response signal Y representing the response voice Vy with respect to the utterance voice Vx.
  • the playback device 26 is an element that plays back the response sound Vy corresponding to the response signal Y generated by the control device 20, and includes a D / A converter 262 and a sound emitting device 264.
  • the D / A converter 262 converts the digital response signal Y generated by the control device 20 into an analog audio signal, and the sound emitting device 264 (for example, a speaker or headphones) responds according to the converted audio signal.
  • Vy is emitted as sound waves.
  • the playback device 26 may include a processing circuit such as an amplifier that amplifies the response signal Y.
  • the storage device 22 stores a program executed by the control device 20 and various data used by the control device 20.
  • a known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of a plurality of recording media can be arbitrarily employed as the storage device 22.
  • storage device 22 of 1st Embodiment memorize
  • a voice signal Z of a response voice such as “Yes”, which means a match, which is an example of an interjection, is stored in the storage device 22 will be exemplified.
  • the audio signal Z is recorded in advance and stored in the storage device 22 as an audio file of an arbitrary format such as wav format.
  • the control device 20 executes a program stored in the storage device 22 to provide a plurality of functions (voice acquisition unit 32, voice analysis unit 34A, and response generation unit 36A) for establishing a dialogue with the user U.
  • a program stored in the storage device 22 to provide a plurality of functions (voice acquisition unit 32, voice analysis unit 34A, and response generation unit 36A) for establishing a dialogue with the user U.
  • a configuration in which the function of the control device 20 is realized by a plurality of devices (that is, a system), or a configuration in which a dedicated electronic circuit realizes a part of the function of the control device 20 may be employed.
  • the voice acquisition unit 32 of the first embodiment acquires the speech signal X generated by the voice input device 24 from the voice input device 24.
  • the voice analysis unit 34A specifies the pitch (fundamental frequency) P of the uttered voice Vx from the utterance signal X acquired by the voice acquisition unit 32.
  • the pitch P is specified sequentially in a predetermined cycle. That is, the pitch P is specified for each of a plurality of different time points on the time axis.
  • a known technique can be arbitrarily adopted to specify the pitch P of the speech voice Vx. Note that the pitch P can be specified by extracting an acoustic component in a specific frequency band from the speech signal X.
  • the frequency band to be analyzed by the voice analysis unit 34A can be changed, for example, according to an instruction from the user U (for example, designation of male voice or female voice). It is also possible to dynamically change the frequency band to be analyzed according to the pitch P of the speech voice Vx.
  • the response generator 36A causes the playback device 26 to play back the response voice Vy for the utterance voice Vx of the utterance signal X acquired by the voice acquisition section 32. Specifically, the response generation unit 36A generates a response signal Y of the response voice Vy triggered by the pronunciation of the uttered voice Vx by the user U, and supplies the response signal Y to the playback device 26, thereby responding to the response voice Vy. Is played back by the playback device 26.
  • the response generation unit 36A of the first embodiment adjusts the prosody of the audio signal Z stored in the storage device 22 in accordance with the pitch P of the uttered voice Vx specified by the audio analysis unit 34A, so that the response voice Vy
  • the response signal Y is generated. That is, the response voice Vy obtained by adjusting the initial response voice represented by the voice signal Z in accordance with the prosody of the uttered voice Vx is played from the playback device 26.
  • the conversation partner produces a response voice corresponding to the pitch near the end point of the speaker's utterance voice (ie, the pitch of the response voice is the end point of the utterance voice).
  • the response generation unit 36A of the first embodiment adjusts the pitch of the audio signal Z according to the pitch P of the uttered voice Vx specified by the voice analysis unit 34A, so that the response voice A response signal Y of Vy is generated.
  • FIG. 2 is a flowchart of processing executed by the control device 20 of the first embodiment. For example, the process of FIG. 2 is started in response to an instruction from the user U (for example, an instruction to start a voice conversation program) to the voice interaction apparatus 100A.
  • an instruction from the user U for example, an instruction to start a voice conversation program
  • the voice acquisition unit 32 waits until the user U starts to pronounce the uttered voice Vx (S10: NO). Specifically, the voice acquisition unit 32 sequentially specifies the volume of the speech signal X supplied from the voice input device 24, and the volume is set to a predetermined threshold (for example, a fixed value selected in advance or from the user U). If the state exceeding the variable value according to the instruction continues for a predetermined time length, it is determined that the speech voice Vx has started.
  • a predetermined threshold for example, a fixed value selected in advance or from the user U. If the state exceeding the variable value according to the instruction continues for a predetermined time length, it is determined that the speech voice Vx has started.
  • the detection method of the start of the speech voice Vx (that is, the start point of the speech section) is arbitrary. For example, when the volume of the utterance signal X exceeds a threshold value and the voice analysis unit 34A detects a significant pitch P, it can be determined that the utterance voice Vx has started.
  • the voice acquisition unit 32 acquires the speech signal X from the voice input device 24 and stores it in the storage device 22 (S11).
  • the voice analysis unit 34A specifies the pitch P of the uttered voice Vx from the utterance signal X acquired by the voice acquisition unit 32 and stores it in the storage device 22 (S12).
  • the voice acquisition unit 32 determines whether or not the user U has finished generating the uttered voice Vx (S13). Specifically, in the voice acquisition unit 32, the state in which the volume of the utterance signal X falls below a predetermined threshold (for example, a fixed value selected in advance or a variable value according to an instruction from the user U) is a predetermined length of time. It is determined that the uttered voice Vx has ended when only the duration continues.
  • a known technique can be arbitrarily employed for detecting the end of the speech voice Vx (that is, the end point of the speech section).
  • a time series of a plurality of pitches P of the utterance voice Vx is specified for the utterance section from the start point to the end point tB of the utterance voice Vx.
  • FIG. 3 it is assumed that the user U has pronounced the utterance voice Vx of the question sentence “Is it fun?” That asks the speaker to recognize the emotion or intention of the utterance partner.
  • FIG. 4 it is assumed that the user U has pronounced a plain speech utterance voice Vx that expresses the emotion or intention of the utterer or requests consent from the utterance partner.
  • the response generation unit 36A executes a process (hereinafter referred to as “response generation process”) SA for causing the playback device 26 to reproduce the response voice Vy to the utterance voice Vx.
  • response generation process SA of the first embodiment adjusts the pitch of the voice signal Z in accordance with the pitch P of the uttered voice Vx specified by the voice analysis unit 34A, whereby the response signal of the response voice Vy. This is a process for generating Y.
  • FIG. 5 is a flowchart of a specific example of the response generation process SA.
  • the response generation process SA of FIG. 5 is started when the utterance voice Vx ends (S13: YES).
  • the response generation unit 36A As illustrated in FIG. 3 and FIG. 4, for a section (hereinafter referred to as “tail section”) E including the end point tB of the uttered voice Vx in the uttered voice Vx.
  • the lowest value (hereinafter referred to as “minimum pitch”) Pmin among the plurality of pitches P specified by the voice analysis unit 34A is specified as the prosody of the speech voice Vx (SA1).
  • the tail section E is, for example, a part of the utterance voice Vx over a predetermined length before the end point tB of the utterance voice Vx.
  • the time length of the end section E is set to a numerical value (for example, about 180 milliseconds) within a range of about several tens of milliseconds to several seconds.
  • the pitch P tends to increase in the vicinity of the end point tB. Therefore, the pitch P at the minimum point at which the transition of the pitch P of the speech voice Vx changes from a decrease to an increase is specified as the minimum pitch Pmin.
  • the pitch P tends to decrease monotonously toward the end point tB in the spoken speech Vx. Therefore, the pitch P at the end point tB of the speech voice Vx is specified as the lowest pitch Pmin.
  • the response generation unit 36A generates a response signal Y representing the response voice Vy having a pitch corresponding to the minimum pitch Pmin of the utterance voice Vx (SA2). Specifically, as illustrated in FIGS. 3 and 4, the response generation unit 36 ⁇ / b> A has the lowest pitch at a specific time point (hereinafter referred to as “target point”) ⁇ on the time axis of the response voice Vy.
  • the response signal Y of the response voice Vy is generated by adjusting the pitch of the voice signal Z so as to coincide with the high Pmin.
  • a preferred example of the target point ⁇ is a start point of a specific mora (typically the last mora) among the plurality of mora constituting the response voice Vy.
  • the response signal Y of the response voice Vy is generated by adjusting (pitch shifting) the pitch in all sections of the voice signal Z so that the pitch matches the minimum pitch Pmin.
  • a well-known technique can be arbitrarily employ
  • the target point ⁇ is not limited to the start point of the last mora in the response voice Vy.
  • the pitch can be adjusted with the start point or end point of the response voice Vy as the target point ⁇ .
  • the response generation unit 36A waits until the time point ty at which the playback of the response voice Vy should start (hereinafter referred to as “response start point”) arrives (SA3: NO).
  • the response start point ty is, for example, a point in time when a predetermined time (for example, 150 ms) has elapsed from the end point tB of the speech voice Vx.
  • the response generation unit 36A supplies the response signal Y after adjustment according to the minimum pitch Pmin to the playback device 26 to reproduce the response voice Vy (SA4). . That is, the reproduction of the response voice Vy is started at the response start point ty after a predetermined time has elapsed from the end point tB of the utterance voice Vx.
  • the response generation unit 36A sequentially supplies the response signal Y from the response start point ty to the playback device 26 in real time in parallel with the generation (pitch shift) of the response signal Y to reproduce the response voice Vy. It is also possible.
  • the response generation unit 36A of the first embodiment is an element that causes the playback device 26 to play back the response voice Vy having a pitch corresponding to the lowest pitch Pmin in the tail section E of the speech voice Vx. Function.
  • the control device 20 determines whether or not the user U has instructed the end of the voice dialogue as exemplified in FIG. 2 (S14). If the end of the voice dialogue is not instructed (S14: NO), the process proceeds to step S10. That is, triggered by the start of the uttered voice Vx (S10: YES), the voice acquisition unit 32 acquires the utterance signal X (S11), the voice analysis unit 34A specifies the pitch P (S12), and the response generation unit 36A. The response generation process SA is executed.
  • the response voice Vy having a pitch corresponding to the pitch P of the uttered voice Vx is reproduced for each pronunciation of the uttered voice Vx. That is, a voice dialogue is realized in which the pronunciation of an arbitrary utterance voice Vx by the user U and the reproduction of a response voice Vy (for example, a response voice of “yeah”) corresponding to the utterance voice Vx are alternately repeated. .
  • a response voice Vy for example, a response voice of “yeah”
  • the response voice Vy having a pitch corresponding to the lowest pitch Pmin in the end section E including the end point tB of the speech voice Vx is played from the playback device 26. Therefore, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the conversation partner generates a response voice at a pitch corresponding to the pitch near the end point of the uttered voice.
  • the response voice Vy is reproduced so that the pitch at the start point (target point ⁇ ) of the last mora in the response voice Vy matches the minimum pitch Pmin, it is close to an actual dialogue. The effect of realizing a natural voice conversation is particularly remarkable.
  • the configuration in which the pitch of the target point ⁇ in the response voice Vy is matched with the minimum pitch Pmin in the tail section E of the utterance voice Vx is exemplified.
  • the target point ⁇ of the response voice Vy is illustrated.
  • the relationship between the pitch of the voice and the minimum pitch Pmin of the uttered voice Vx is not limited to the above example (a relationship in which both match).
  • the pitch of the response voice Vy at the target point ⁇ can be matched with the pitch obtained by adding or subtracting a predetermined adjustment value (offset) ⁇ p to the minimum pitch Pmin.
  • the adjustment value ⁇ p is a fixed value selected in advance (for example, a numerical value corresponding to a pitch such as 5 degrees with respect to the lowest pitch Pmin) or a variable value according to an instruction from the user U. Further, according to the configuration in which the adjustment value ⁇ p is set to a numerical value corresponding to an integral multiple of the octave, the response voice Vy having a pitch obtained by octave shifting the minimum pitch Pmin is reproduced. It is also possible to switch whether to apply the adjustment value ⁇ p according to an instruction from the user U.
  • the pitch of the response voice Vy is controlled according to the pitch P of the utterance voice Vx (specifically, the minimum pitch Pmin of the end section E).
  • the prosody type of the speech voice Vx used for control and the prosody type of the response voice Vy controlled according to the prosody of the speech voice Vx are not limited to the pitch.
  • a configuration that controls the prosody of the response voice Vy according to the volume of the utterance voice Vx (an example of prosody), or a response voice according to the range of the pitch or volume variation of the utterance voice Vx (an example of prosody) A configuration for controlling the prosody of Vy is also employed.
  • volume of the response voice Vy (an example of the prosody) is controlled according to the prosody of the utterance voice Vx, or the range of fluctuations in the pitch or volume of the response voice Vy according to the prosody of the utterance voice Vx (prosodic A configuration for controlling the other example may also be employed.
  • the prosody of the response speech is not necessarily determined uniformly according to the prosody of the speech speech. That is, the prosody of the response voice tends to depend on the prosody of the utterance voice and can vary for each pronunciation of the utterance voice.
  • the response generation unit 36A can change the prosody (for example, pitch or volume) of the response voice Vy reproduced from the playback device 26 for each utterance voice Vx. Specifically, as described above, in the configuration in which the pitch of the response voice Vy is adjusted to be the pitch obtained by adding or subtracting the adjustment value ⁇ p to the minimum pitch Pmin, the response generation unit 36A The adjustment value ⁇ p is variably controlled for each sound Vx.
  • the response generation unit 36A generates a random number within a predetermined range for each pronunciation of the uttered voice Vx, and sets the random number as the adjustment value ⁇ p. According to the above configuration, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the prosody of the response voice can be changed every time the uttered voice is pronounced.
  • the response signal Y is generated by adjusting the pitch of one type of audio signal Z.
  • a plurality of audio signals Z having different pitches are used for generating the response signal Y.
  • a configuration in which the response signal Y is generated by adjusting the pitch of the voice signal Z that is closest to the lowest pitch Pmin of the speech voice Vx among the plurality of voice signals Z can be assumed.
  • the plurality of audio signals Z are generated by recording a plurality of sounds that are pronounced at different pitches or by adjusting the pitches of sounds that are pronounced at specific pitches.
  • a plurality of audio signals Z having different pitches with a predetermined step size (for example, 100 cent corresponding to a semitone) are stored in the storage device 22 in advance.
  • the response generation unit 36A selects, as the response signal Y, a speech signal Z having a pitch closest to the lowest pitch Pmin of the uttered speech Vx from among the plurality of speech signals Z stored in the storage device 22, for example, as a playback device.
  • the response device Vy is played back by the playback device 26.
  • the adjustment of the pitch of the audio signal Z by the response generation unit 36A can be omitted. According to the above configuration in which the adjustment of the pitch signal Z is omitted, there is an advantage that the processing load of the response generation unit 36A is reduced.
  • the minimum pitch Pmin of the speech speech Vx among the plurality of speech signals Z stored in the storage device 22 is shifted in octave units.
  • a configuration in which an audio signal Z that is closest to any pitch is selected is also suitable.
  • the response voice Vy is reproduced from the playback device 26, but the utterance voice Vx is also played back from the playback device 26 by supplying the utterance signal X acquired by the voice acquisition unit 32 to the playback device 26. It is possible. A configuration for switching whether or not to reproduce the uttered voice Vx from the playback device 26 in accordance with an instruction from the user U may be employed.
  • Second Embodiment A second embodiment of the present invention will be described.
  • symbol used by description of 1st Embodiment is diverted, and each detailed description is abbreviate
  • FIG. 6 is a configuration diagram of a voice interaction device 100B according to the second embodiment of the present invention.
  • the voice interaction apparatus 100B of the second embodiment reproduces a response voice Vy corresponding to the uttered voice Vx produced by the user U.
  • the voice interaction device 100B of the second embodiment has a configuration in which the response generation unit 36A of the voice interaction device 100A of the first embodiment is replaced with a response generation unit 36B.
  • the configuration and operation of the other elements (voice input device 24, playback device 26, voice acquisition unit 32, and voice analysis unit 34A) of the voice interactive device 100B are the same as those in the first embodiment.
  • the conversation partner pronounces the response speech with a prosody according to the utterance content of the speaker (whether it is a question sentence or a plain sentence).
  • the prosody differs between a response voice for a question sentence and a response voice for a plain text.
  • the voice of the answer to the question sentence is relatively loud compared to the voice of the companion sentence, for example, because it is necessary for the speaker to clearly recognize the answer (positive or negative) of the responder
  • the volume is pronounced with emphasis on the inflection (time fluctuation of volume or pitch).
  • the response generation unit 36B of the second embodiment causes the playback device 26 to play back the prosodic response voice Vy according to the utterance content (question or plain text) by the utterance voice Vx.
  • FIG. 7 illustrates the transition of the pitch P of the utterance voice Vx of the question sentence
  • FIG. 8 illustrates the transition of the pitch P of the utterance voice Vx of the plain sentence.
  • the pitch P of the utterance voice Vx of the question sentence is changed from a decrease to an increase or increases monotonously in the end section E as illustrated in FIG.
  • FIG. 7 illustrates the transition of the pitch P of the utterance voice Vx of the question sentence
  • FIG. 8 illustrates the transition of the pitch P of the utterance voice Vx of the plain sentence.
  • the high P monotonously decreases from the start point tA to the end point tB of the tail section E. Therefore, by analyzing the transition of the pitch P in the vicinity of the end of the utterance voice Vx (end section E), it is possible to estimate whether the utterance content of the utterance voice Vx corresponds to a question sentence or a plain sentence. It is.
  • the response generation unit 36B of the second embodiment uses the prosody response voice Vy according to the transition of the pitch P in the last section E (ie, the question sentence or the plain text) of the utterance voice Vx.
  • the playback device 26 makes the playback. Specifically, as illustrated in FIG. 7, when the transition of the pitch P of the utterance voice Vx changes from a decrease to an increase in the end section E, or the pitch P of the utterance voice Vx is monotonous in the end section E. (That is, when the utterance content is estimated to be a question sentence), the prosody response voice Vy suitable for the question sentence is reproduced from the reproduction device 26. On the other hand, as illustrated in FIG. 8, when the pitch P of the speech voice Vx decreases monotonously within the end section E (that is, when the speech content is estimated to be a plain text), it is suitable for a plain text. A prosody response voice Vy is reproduced from the reproduction device 26.
  • the storage device 22 of the voice interaction device 100B of the second embodiment stores a response signal YA and a response signal YB in which a response voice Vy of a specific utterance content is recorded in advance.
  • the response signal YA and the response signal YB have the same character notation for the utterance contents but have different prosody.
  • the response voice Vy represented by the response signal YA is a voice of “Yes” pronounced with the intention of an affirmative answer to the utterance voice Vx of the question sentence, and is represented by the response signal YB.
  • the response voice Vy is a voice of “Yun” that is pronounced with the intention of reconciliation with the spoken voice Vx.
  • the response voice Vy of the response signal YA has a prosodic difference that the volume is larger than that of the response voice Vy of the response signal YB, and the range of fluctuations in volume and pitch (ie, inflection) is wide.
  • the response generation unit 36B of the second embodiment selectively supplies one of the response signal YA and the response signal YB stored in the storage device 22 to the playback device 26, so that a plurality of responses with different prosody are obtained. Any one of the voices Vy is selectively reproduced. It should be noted that the content of pronunciation can be made different between the response signal YA and the response signal YB. In the above description, dialogue in Japanese is exemplified, but the same situation can be assumed in languages other than Japanese.
  • FIG. 9 is a flowchart of a response generation process SB for the response generation unit 36B of the second embodiment to cause the playback device 26 to play back the response voice Vy.
  • the response generation process SA of FIG. 2 illustrated in the first embodiment is replaced with the response generation process SB of FIG. Processing other than the response generation processing SB is the same as in the first embodiment.
  • the response generation process SB in FIG. 9 is started when the utterance voice Vx ends (S13: YES).
  • the response generation unit 36B has an average (hereinafter referred to as “first average pitch”) Pave1 of a plurality of pitches P in the first interval E1 in the end interval E of the speech voice Vx,
  • An average of a plurality of pitches P in the second section E2 (hereinafter referred to as “second average pitch”) Pave2 is calculated (SB1).
  • the first section E1 is a front section (for example, a section including the start point tA of the end section E) of the end section E
  • the second section E2 is the end section E. Of these, it is a section behind the first section E1 (for example, a section including the end point tB of the end section E).
  • first half of the tail section E is defined as the first section E1
  • second half of the tail section E is defined as the second section E2.
  • the conditions of the first section E1 and the second section E2 are not limited to the above examples.
  • a configuration in which the first interval E1 and the second interval E2 are moved back and forth with an interval, or a configuration in which the time lengths of the first interval E1 and the second interval E2 are different may be employed.
  • the response generator 36B compares the first average pitch Pave1 in the first section E1 with the second average pitch Pave2 in the second section E2, and determines whether the first average pitch Pave1 is lower than the second average pitch Pave2. It is determined whether or not (SB2).
  • SB2 the transition of the pitch P of the utterance voice Vx of the question sentence tends to change from a decrease to an increase or monotonously increase in the end section E. Therefore, as illustrated in FIG. 7, the first average pitch Pave1 is likely to be lower than the second average pitch Pave2 (Pave1 ⁇ Pave2).
  • the pitch P of the plain speech utterance voice Vx tends to decrease monotonously in the end section E. Therefore, as illustrated in FIG. 8, the first average pitch Pave1 is likely to exceed the second average pitch Pave2 (Pave1> Pave2).
  • the second The response generation unit 36B of the embodiment selects the response signal YA corresponding to the response voice Vy of the answer to the question sentence from the storage device 22 (SB3).
  • the response generation unit 36B Is selected from the storage device 22 (SB4).
  • the response generator 36B arrives at the response start point ty (SB5) as in the first embodiment. : YES), the response signal Vy is reproduced by supplying the response signal Y to the reproduction device 26 (SB6).
  • the pitch P of the utterance voice Vx changes from a decrease to an increase within the end section E, or when the pitch P of the utterance voice Vx increases monotonically within the end section E (SB2: YES).
  • the acquisition of the speech signal X by the voice acquisition unit 32 (S11), the specification of the pitch P by the voice analysis unit 34A (S12), and the response generation process SB by the response generation unit 36B are as follows. (S14: NO). Therefore, as in the first embodiment, a voice conversation is realized in which the sound of an arbitrary uttered voice Vx by the user U and the reproduction of the response voice Vy for the uttered voice Vx are alternately repeated.
  • the prosody response voice Vy corresponding to the transition of the pitch P in the last section E of the speech voice Vx is played from the playback device 26. Therefore, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the conversation partner sounds a response voice with a prosody according to the utterance content of the speaker.
  • the transition of the pitch P changes from a decrease to an increase within the end section E, or when the pitch P increases monotonically within the end section E, and the end point from the start point tA of the end section E.
  • the prosody of the response voice Vy is different from the case where the pitch P decreases monotonously over tB. Therefore, it is possible to realize a natural voice conversation that simulates the tendency of an actual conversation in which the prosody of the response voice is different between the question sentence and the plain text.
  • the prosody of the response voice Vy according to the result of comparing the first average pitch Pave1 in the first section E1 and the second average pitch Pave2 in the second section E2 in the end section E. Therefore, there is an advantage that transition of the pitch P can be evaluated by simple processing of averaging and comparing a plurality of pitches P (and thus the prosody of the response voice Vy can be selected).
  • any one of a plurality of response signals Y (YA and YB) stored in advance in the storage device 22 is selectively supplied to the playback device 26.
  • the response generation unit 36B can generate a prosodic response signal Y corresponding to the transition of the pitch P in the tail section E of the utterance voice Vx.
  • the response generation unit 36B increases the volume of the response signal YA and the volume when the uttered voice Vx is a question sentence.
  • the response signal YB of the response voice Vy of the answer is generated by expanding the range of fluctuation of the pitch, while the response signal YA is supplied to the playback device 26 when the utterance voice Vx is a plain text.
  • the response signal YA of the response voice Vy of the consent to the plain text by reducing the volume of the initial response signal Y and reducing the range of fluctuation of the volume and pitch.
  • the response signal Y having different prosody is generated by adjusting one response signal Y, it is not necessary to hold a plurality of response signals Y (YA and YB) having different prosody in the storage device 22. There is an advantage that the storage capacity required for the storage device 22 is reduced.
  • the initial prosody of the response signal Y is adjusted according to the utterance content of the utterance voice Vx. Since there is no need, there is an advantage that the processing load of the response generation unit 36B is reduced.
  • the first average pitch Pave1 in the first section E1 and the second average pitch Pave2 in the second section E2 in the end section E are compared, but the utterance of the speech voice Vx is compared.
  • the method for estimating whether the content corresponds to a question sentence or a plain text is not limited to the above examples.
  • the pitch P decreases monotonously in the end section E, the pitch P tends to become the minimum pitch Pmin at the end point tB of the end section E.
  • the speech voice it is also possible to estimate that the utterance content of Vx corresponds to a plain text. It is also possible to estimate whether the utterance content of the utterance voice Vx corresponds to the question sentence or the plain sentence according to the transition of the pitch P before and after the point of the lowest pitch Pmin in the end section E. is there. For example, when the pitch P rises after the time of the lowest pitch Pmin in the end section E, the response generation unit 36B estimates that the utterance content of the uttered voice Vx corresponds to the question sentence.
  • FIG. 10 is a configuration diagram of a voice interactive apparatus 100C according to the third embodiment of the present invention. Similar to the voice interaction apparatus 100A of the first embodiment, the voice interaction apparatus 100C of the third embodiment reproduces a response voice Vy corresponding to the uttered voice Vx pronounced by the user U. In the third embodiment, in addition to an answer to the utterance voice Vx or a response voice (hereinafter referred to as “second response voice”) Vy2, a response voice (hereinafter referred to as “first response voice”) indicating a response to the utterance voice Vx. ) Vy1 can be regenerated from the regenerator 26.
  • first response voice hereinafter referred to as “first response voice”
  • the first response voice Vy1 is a voice such as “E?” Or “What?” For rehearsing the utterance voice Vx to the speaker.
  • the storage device 22 of the voice interaction device 100C of the third embodiment includes a response signal Y1 containing the first response voice Vy1 for answering questions and a phase other than answering questions (for example, “Yes”).
  • the response signal Y2 recorded with the second response voice Vy2 of (ii) is stored.
  • the voice interaction device 100C of the third embodiment replaces the voice analysis unit 34A and the response generation unit 36A of the voice interaction device 100A of the first embodiment with the voice analysis unit 34C and the response generation unit 36C.
  • This is a replacement configuration.
  • the configuration and operation of other elements of the voice interactive device 100C (the voice input device 24, the playback device 26, and the voice acquisition unit 32) are the same as those in the first embodiment.
  • the speech analysis unit 34C of the third embodiment identifies the prosodic index value Q from the speech signal X acquired by the speech acquisition unit 32.
  • the prosodic index value Q is an index value related to the prosody of the utterance voice Vx, and is calculated for each utterance voice Vx (for each unit when a series of utterances from the start point to the end point of the utterance voice Vx is used as a unit). Specifically, the average value of pitches, the fluctuation range of pitches, the average value of volume, or the fluctuation range of volume in the utterance section of the utterance voice Vx is calculated from the utterance signal X as the prosodic index value Q.
  • the response generation unit 36C of the third embodiment selectively selects one of the first response voice Vy1 representing the answer to the uttered voice Vx and the second response voice Vy2 other than the answer to the playback device 26. Let it play.
  • the response generation unit 36C of the third embodiment compares the prosodic index value Q specified by the speech analysis unit 34C with the threshold value QTH, and determines the first response speech Vy1 and the first response speech Vy1 according to the comparison result. Either of the two response voices Vy2 is caused to be played back by the playback device 26.
  • the threshold value QTH is set to a representative value (for example, an average value) of the prosodic index value Q of a plurality of uttered voices Vx spoken by the user U in the past. That is, the threshold value QTH corresponds to a standard prosody estimated from the user U's past speech.
  • the prosodic index value Q of the utterance voice Vx deviates from the threshold value QTH
  • the first response voice Vy1 is returned.
  • the prosodic index value Q approximates the threshold value QTH
  • the second response is a conflict. Audio Vy2 is reproduced.
  • FIG. 11 is a flowchart of processing executed by the control device 20 of the third embodiment. For example, the process in FIG. 11 is started in response to an instruction from the user U (for example, an instruction to start a program for voice conversation) to the voice conversation apparatus 100C.
  • an instruction from the user U for example, an instruction to start a program for voice conversation
  • the voice acquisition unit 32 acquires the utterance signal X from the voice input device 24 and stores it in the storage device 22 (S21).
  • the voice analysis unit 34C specifies the feature quantity q related to the prosody of the uttered voice Vx from the utterance signal X acquired by the voice acquisition unit 32 (S22).
  • the feature quantity q is, for example, the pitch P or volume of the speech voice Vx.
  • the acquisition of the utterance signal X by the voice acquisition unit 32 (S21) and the specification of the feature quantity q by the voice analysis unit 34C (S22) are repeated until the end of the utterance voice Vx (S23: NO). That is, a time series of a plurality of feature quantities q of the utterance voice Vx is specified for the utterance section from the start point to the end point tB of the utterance voice Vx.
  • the voice analysis unit 34C calculates the prosodic index value Q from the time series of a plurality of feature quantities q specified for the utterance section from the start point to the end point of the uttered voice Vx (S24). ). Specifically, the speech analysis unit 34C calculates an average value or a variation range (range) of the plurality of feature quantities q in the utterance section as the prosodic index value Q.
  • the response generation unit 36C executes a response generation process SC for causing the playback apparatus 26 to play back the response voice Vy.
  • the response generation process SC of the third embodiment is a process in which the playback device 26 selectively plays back either the first response voice Vy1 or the second response voice Vy2 according to the prosodic index value Q calculated by the voice analysis unit 34C. It is.
  • the voice analysis unit 34C updates the threshold value QTH according to the prosodic index value Q of the uttered voice Vx (S25). Specifically, the voice analysis unit 34C calculates a representative value (for example, an average value or a median value) of a plurality of prosodic index values Q of the past utterance voice Vx including the utterance voice Vx as the updated threshold value QTH. . For example, as expressed by the following formula (1), the weighted average (exponential moving average) between the current prosodic index value Q and the threshold value QTH before update is calculated as the threshold value QTH after update.
  • the symbol ⁇ in the formula (1) is a predetermined positive number (forgetting factor) less than 1.
  • the speech analysis unit 34C of the third embodiment functions as an element that sets the representative value of the prosodic index value Q in a plurality of past speech speech Vx as the threshold value QTH.
  • the threshold value QTH is updated to a numerical value reflecting the prosodic index value Q of the utterance voice Vx for each pronunciation of the utterance voice Vx, and becomes a numerical value corresponding to a standard prosody estimated from the utterance of the user U over a plurality of times.
  • FIG. 12 is a flowchart of the response generation process SC of the third embodiment.
  • the response generation unit 36C compares the prosodic index value Q specified by the speech analysis unit 34C with the current threshold value QTH, and a predetermined range including the threshold value QTH (hereinafter referred to as “allowable range”). It is determined whether or not the prosodic index value Q is included in R (SC1). 13 and 14 illustrate the transition of the feature quantity q specified by the voice analysis unit 34C from the speech voice Vx. As illustrated in FIGS. 13 and 14, the allowable range R is a range having a predetermined width with the threshold value QTH being a median value.
  • the process of comparing the prosodic index value Q and the threshold value QTH determines whether or not the absolute value of the difference between the prosodic index value Q and the threshold value QTH exceeds a predetermined value (for example, half the range width of the allowable range R). It can also be realized as a determination process.
  • the prosodic index value Q is a numerical value inside the allowable range R.
  • the inclusion of the prosodic index value Q within the allowable range R means that the prosody of the utterance voice Vx of this time approximates the standard prosody of the user U (past utterance tendency).
  • the conversation partner is easy to listen to the uttered voice (a situation where there is a low possibility that a question is returned to the speaker).
  • the response generation unit 36C sends the response signal Y2 of the second response voice Vy2 that is in conflict with the utterance voice Vx from the storage device 22. Select (SC2).
  • the prosodic index value Q is a numerical value outside the allowable range R (specifically, a numerical value lower than the lower limit value of the allowable range R).
  • the fact that the prosodic index value Q is not included in the allowable range R means that the prosody of the uttered voice Vx is deviated from the standard prosody of the user U. In other words, assuming a dialogue between actual humans, it can be evaluated that the conversation partner is difficult to hear the uttered voice (a situation where there is a high possibility that it is necessary to answer the speaker).
  • the response generation unit 36C causes the second response voice Vy1 (for example, “??” “what” to answer the uttered voice Vx). Is selected from the storage device 22 as an object to be supplied to the playback device 26 (SC3).
  • the response generator 36C arrives at the response start point ty (SC4: YES) as in the first embodiment. ),
  • the response signal Vy (the first response sound Vy1 or the second response sound Vy2) is reproduced by supplying the response signal Y to the reproduction device 26 (SC5). That is, when the prosodic index value Q is included in the allowable range R, the second response voice Vy2 is reproduced, and when the prosodic index value Q is not included in the allowable range R, the answering first response voice Vy2 is reproduced. Vy1 is played back.
  • either the first response voice Vy1 that indicates the answer to the uttered voice Vx or the second response voice Vy2 other than the answer is selectively played back from the playback device 26. . Therefore, it is possible to realize a natural voice dialogue that simulates the tendency of an actual dialogue in which not only the talker's utterance but also the questioning / replying (rehearsal) to the talker occurs appropriately.
  • the utterance since either the first response voice Vy1 or the second response voice Vy2 is selected according to the result of comparing the prosody index value Q representing the prosody of the utterance voice Vx with the threshold value QTH, the utterance It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation that when the prosody of the voice changes unexpectedly, it becomes difficult to listen and the necessity of answering questions increases.
  • the prosody of the utterance of the utterer is the standard prosody of the utterer (that is, the dialog)
  • the first response voice Vy1 is selected when the prosodic index value Q is a numerical value outside the allowable range R including the threshold value QTH
  • the second response voice Vy2 is selected when the numerical value is inside the allowable range R.
  • the first response voice Vy1 is excessively frequently compared with the configuration in which the first response voice Vy1 and the second response voice Vy2 are selected only in accordance with the magnitude relationship between the prosodic index value Q and the threshold value QTH. Can be reduced (the first response voice Vy1 is reproduced at an appropriate frequency).
  • the reproduction of the first response voice Vy1 and the reproduction of the second response voice Vy2 are selected according to the prosodic index value Q of the utterance voice Vx.
  • a predetermined frequency is used regardless of the characteristics of the utterance voice Vx.
  • the response generation unit 36C plays back to the playback device 26 the first response voice Vy1 that answers the utterance voice Vx that is randomly selected from the plurality of utterance voices Vx that the user U sequentially generates.
  • the corresponding second response voice Vy2 is reproduced.
  • the response generation unit 36C generates a random number within a predetermined range for each utterance of the utterance voice Vx, and selects the first response voice Vy1 when the random number exceeds the threshold, while the random number falls below the threshold. In this case, the second response voice Vy2 is selected.
  • the first response voice Vy1 for answering the utterance voice Vx randomly selected from the plurality of utterance voices Vx is reproduced, the answer to the utterance voice is randomly generated. It is possible to realize a natural voice conversation simulating the tendency of actual voice conversation.
  • the response generation unit 36C can change the ratio of the number of reproductions of the first response voice Vy1 to the number of utterances of the utterance voice Vx (that is, the reproduction frequency of the first response voice Vy1). For example, the response generation unit 36C controls the reproduction frequency of the first response voice Vy1 by adjusting the threshold value to be compared with the random number. For example, when the reproduction frequency of the first response voice Vy1 is set to 30%, the first response voice Vy1 is reproduced for 30% of the total number of utterances of the utterance voice Vx, and the remaining number of times is 70%. The second response voice Vy2 is reproduced for the utterance.
  • the reproduction frequency of the first response voice Vy1 (for example, a threshold value compared with a random number) can be changed according to an instruction from the user U, for example.
  • FIG. 15 is a configuration diagram of a voice interaction device 100D according to the fourth embodiment of the present invention. Similar to the voice interaction apparatus 100A of the first embodiment, the voice interaction apparatus 100D of the fourth embodiment reproduces a response voice Vy corresponding to the uttered voice Vx pronounced by the user U.
  • the voice interaction device 100D of the fourth embodiment replaces the voice analysis unit 34A and the response generation unit 36A of the voice interaction device 100A of the first embodiment with the history management unit 38 and the response generation unit 36D.
  • This is a replacement configuration.
  • the configuration and operation of the other elements (voice input device 24, playback device 26, and voice acquisition unit 32) of the voice interactive device 100D are the same as in the first embodiment.
  • the storage device 22 of the fourth embodiment stores a response signal Y representing the response voice Vy of specific utterance content. In the following description, a response voice Vy of “Yes” that means a conflict with the speech voice Vx is illustrated.
  • the usage history H of the fourth embodiment is the number of voice conversations executed in the past (hereinafter referred to as “use count”) N using the voice interaction device 100D.
  • the voice dialogue is started once (starting of the voice dialogue device 100D) until ending (that is, one voice dialogue including a plurality of pairs of the utterance voice Vx and the response voice Vy).
  • the history management unit 38 counts the number of voice conversations as the number of uses N.
  • the usage history H generated by the history management unit 38 is stored in the storage device 22.
  • the response generation unit 36D of the fourth embodiment causes the playback device 26 to play back the prosodic response voice Vy according to the usage history H generated by the history management unit 38. That is, the prosody of the response voice Vy is variably controlled according to the usage history H.
  • the reproduction standby time W of the response voice Vy is controlled according to the usage history H as the prosody of the response voice Vy.
  • the waiting time W is the length of time from the end point tB of the utterance voice Vx to the response start point ty of the response voice Vy (that is, the interval between the utterance voice Vx and the response voice Vy).
  • the response generation unit 36D of the fourth embodiment has a longer waiting time W for the response voice Vy when the usage history N indicated by the usage history H is greater than when the usage count N is small.
  • the waiting time W is controlled in accordance with the usage history H so as to be shortened.
  • FIG. 16 is a flowchart of processing executed by the control device 20 of the fourth embodiment.
  • the process of FIG. 16 is started in response to an instruction from the user U (instruction to start a voice conversation program) to the voice interaction apparatus 100D.
  • the voice acquisition unit 32 acquires the utterance signal X from the voice input device 24 and stores it in the storage device 22 (S31). The acquisition of the utterance signal X by the voice acquisition unit 32 is repeated until the end of the utterance voice Vx (S32: NO).
  • the response generation unit 36D When the speech voice Vx ends (S32: YES), the response generation unit 36D performs a response generation process SD for causing the playback device 26 to play back the response voice Vy of the prosody according to the usage history H stored in the storage device 22.
  • the response generation process SD of the fourth embodiment is a process for controlling the waiting time W from the end point tB of the utterance voice Vx to the response start point ty at which the reproduction of the response voice Vy is started according to the usage history H. It is.
  • the acquisition of the speech signal X by the voice acquisition unit 32 (S31) and the response generation process SD by the response generation unit 36D are repeated until the end of the voice conversation is instructed by the user U (S33: NO). Therefore, as in the first embodiment, a voice conversation is realized in which the sound of an arbitrary uttered voice Vx by the user U and the reproduction of the response voice Vy for the uttered voice Vx are alternately repeated.
  • the history management unit 38 updates the usage history H stored in the storage device 22 to the content that takes into account the current voice dialogue (S34). ). Specifically, the history management unit 38 increases the usage count N indicated by the usage history H by one. Accordingly, the usage history H increases by 1 each time a voice conversation is performed by the voice conversation device 100D. After the usage history H is updated, the process of FIG.
  • FIG. 17 is a flowchart of the response generation process SD of the fourth embodiment
  • FIGS. 18 and 19 are explanatory diagrams of the response generation process SD.
  • the response generation unit 36D variably sets the standby time W according to the usage history H stored in the storage device 22 (SD1 to SD3). Specifically, the response generation unit 36D first determines whether or not the usage count N indicated by the usage history H exceeds a predetermined threshold value NTH (SD1). When the number of uses N exceeds the threshold value NTH (SD1: YES), the response generation unit 36D sets a predetermined basic value w0 (for example, 150 ms) as the standby time W as illustrated in FIG. 18 (SD2).
  • a predetermined basic value w0 for example, 150 ms
  • the response generator 36D when the number of uses N is less than the threshold value NTH (SD1: NO), the response generator 36D, as exemplified in FIG. 19, adds the predetermined adjustment value (offset) ⁇ w to the basic value w0 (w0 + ⁇ w). Is set as the waiting time W (SD3).
  • the adjustment value ⁇ w is set to a predetermined positive number.
  • the standby time W is binary controlled according to whether or not the usage count N exceeds the threshold value NTH. However, the standby time W is changed in multiple values according to the usage count N. Is also possible.
  • the response generation unit 36D waits until the standby time W set according to the usage history H in the above processing has elapsed from the end point tB of the uttered voice Vx (SD4: NO).
  • the response generation unit 36D supplies the response signal Y stored in the storage device 22 to the playback device 26 to play back the response voice Vy. (SD5).
  • the response generation unit 36D of the fourth embodiment reproduces the response voice Vy of the prosody (the waiting time W in the fourth embodiment) according to the usage history H of the voice interaction apparatus 100D. To play.
  • the response voice Vy is reproduced as the standby time W of the basic value w0 elapses.
  • the response voice Vy is reproduced as the standby time W obtained by adding the value ⁇ w elapses. That is, the standby time W is shortened when the number of uses N is large.
  • the response voice Vy of the prosody (waiting time W) corresponding to the voice conversation use history H by the voice conversation device 100D is reproduced, the conversation with the specific partner is repeated. It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation in which the prosody of the spoken voice changes with time.
  • the standby time W that is the interval between the utterance voice Vx and the response voice Vy is controlled according to the usage history H. Therefore, at the stage immediately after the start of the dialogue at the first meeting, the interval between the utterance and the response is long, and the natural conversation that simulates the tendency of the actual dialogue that the interval is shortened as the dialogue with the dialogue partner is repeated. Spoken dialogue is realized.
  • the voice interaction apparatus 100 (100A, 100B, 100C and 100D) exemplified in the above-described embodiments can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined within a range that does not contradict each other.
  • the configuration of the first embodiment that controls the prosody of the response voice Vy according to the prosody of the speech voice Vx (for example, pitch P) is similarly applied to the second to fourth embodiments.
  • the prosody of the response signal Y selected in step SB3 or step SB4 in FIG. 9 is controlled according to the prosody of the uttered voice Vx (for example, pitch P) and reproduced from the playback device 26.
  • a configuration is adopted in which the prosody of the response signal Y selected in step SC2 or step SC3 in FIG. 12 is controlled according to the prosody of the speech voice Vx.
  • FIG. A configuration may be adopted in which the prosody of the response signal Y acquired from the storage device 22 in step SD5 is controlled according to the prosody of the speech voice Vx.
  • the first embodiment is applied to the second to fourth embodiments, as in the first embodiment, for example, at the start point of a specific mora (typically the last mora) of the response voice Vy.
  • the pitch of the response signal Y is adjusted so that the pitch matches the lowest pitch Pmin in the last section E of the speech voice Vx.
  • the configuration of the third embodiment that selectively reproduces either the first response voice Vy1 for answering the uttered voice Vx or the second response voice Vy2 other than the answer is applied to each form other than the third embodiment. It is also possible to do. Further, the configuration of the fourth embodiment for controlling the prosody of the response voice Vy (for example, the waiting time W) according to the voice conversation use history H can be applied to the first to third embodiments. .
  • Various variables related to the voice conversation of each form described above can be changed according to an instruction from the user U, for example.
  • a configuration in which the type of response voice Vy reproduced from the reproduction device 26 is selected according to an instruction from the user U can also be adopted.
  • the length of the waiting time W from the end point tB of the utterance voice Vx to the response start point ty of the response voice Vy may be set according to an instruction from the user U. Is possible.
  • the configuration in which the reproduction frequency of the first response voice Vy1 for answering the uttered voice Vx can be changed according to the instruction from the user U is exemplified. It is also possible to control the reproduction frequency of the first response voice Vy1 in accordance with elements other than the above instruction. Specifically, a configuration may be employed in which the response generation unit 36D of the third embodiment controls the reproduction frequency of the first response voice Vy1 according to the usage history H exemplified in the fourth embodiment.
  • the characteristics of the utterance of the conversation partner for example, mustache or tone
  • the frequency of answering the spoken speech decreases.
  • the tendency to do is assumed. Considering the above tendency, a configuration in which the reproduction frequency of the first response voice Vy1 is reduced as the number of uses N indicated by the use history H increases.
  • the usage count N of the voice conversation is exemplified as the usage history H.
  • the usage history H is not limited to the usage count N.
  • the number of times the response voice Vy in the voice conversation is played once the frequency of use of the voice conversation (the number of uses per unit time), the duration of use of the voice conversation (for example, the progress from the first use of the voice dialogue apparatus 100) It is also possible to apply the elapsed time since the last use of the voice interactive apparatus 100 to the control of the waiting time W as the usage history H.
  • the response signal Y is generated and reproduced from the audio signal Z stored in advance in the storage device 22.
  • the response signal Y is stored in advance in the storage device 22.
  • the response signal Y representing the response voice Vy of the specific utterance content can be synthesized by, for example, a known voice synthesis technique.
  • synthesizing the response signal Y for example, segment-connected speech synthesis or speech synthesis using a statistical model such as a hidden Markov model is preferably used.
  • the speech voice Vx and the response voice Vy are not limited to human voices. For example, it is also possible to use animal voices as speech voices Vx and response voices Vy.
  • the voice interaction device 100 includes the voice input device 24 and the playback device 26 has been exemplified.
  • the voice dialogue device 100 can be connected to a device (voice input / output device) separate from the voice interaction device 100. It is also possible to install an input device 24 and a playback device 26.
  • the voice interactive device 100 is realized by a terminal device such as a mobile phone or a smartphone, for example, and the voice input / output device is realized by an electronic device such as an animal-type toy or a robot.
  • the voice interactive device 100 and the voice input / output device can communicate with each other wirelessly or by wire.
  • the speech signal X generated by the voice input device 24 of the voice input / output device is transmitted to the voice dialog device 100 wirelessly or by wire
  • the response signal Y generated by the voice dialog device 100 is transmitted from the voice input device of the voice input device wirelessly or by wire. It is transmitted to the playback device 26.
  • the voice interaction device 100 is realized by an information processing device such as a mobile phone or a personal computer. However, a part or all of the functions of the voice interaction device 100 are performed by a server device (so-called cloud server). It can also be realized. Specifically, the voice interactive device 100 is realized by a server device that communicates with a terminal device via a mobile communication network or a communication network such as the Internet. For example, the voice interaction apparatus 100 receives the utterance signal X generated by the voice input device 24 of the terminal apparatus from the terminal apparatus, and generates the response signal Y from the utterance signal X with the configuration according to each of the above-described embodiments.
  • the voice interaction device 100 transmits a response signal Y generated from the speech signal X to the terminal device, and causes the playback device 26 of the terminal device to play back the response voice Vy.
  • the voice interactive device 100 is realized by a single device or a set of a plurality of devices (that is, a server system).
  • at least one of the functions for example, the voice acquisition unit 32, the voice analysis units 34A and 34C, the response generation units 36A, 36B, 36C and 36D, and the history management unit 38) of the voice interactive device 100 according to each of the above-described embodiments. Part
  • the server device may be realized by a server device, and other functions may be realized by a terminal device. Whether each function realized by the voice interactive apparatus 100 is realized by the server apparatus or the terminal apparatus (function sharing) is arbitrary.
  • the response voice Vy having a specific utterance content (for example, “Yes”) is reproduced with respect to the utterance voice Vx, but the utterance content of the response voice Vy is limited to the above examples.
  • the utterance content of the utterance voice Vx is analyzed by voice recognition and morphological analysis for the utterance signal X, and a response voice Vy having an appropriate content for the utterance content is selected from a plurality of candidates and reproduced by the reproduction device 26. Is also possible.
  • the response voice Vy of the utterance content prepared in advance is reproduced regardless of the utterance voice Vx. Therefore, when considered simply, it can be inferred that a natural dialogue is not established, but in reality, humans can interact with each other by controlling the prosody of the response voice Vy in various ways as illustrated in the above examples. The user U can perceive a feeling like a natural dialogue.
  • voice recognition and morphological analysis are not executed, there is an advantage that the processing delay and processing load due to these processes are reduced or eliminated.
  • any one of a plurality of audio signals Z having different utterance contents can be selectively used for reproducing the response audio Vy.
  • the response generation unit 36A of the first embodiment selects any one of the plurality of audio signals Z having different utterance contents from the storage device 22 and supplies the response signal Y corresponding to the audio signal Z to the reproduction device 26.
  • the response voice Vy is reproduced.
  • the method of selecting the audio signal Z is arbitrary, for example, a method of randomly selecting any one of the plurality of audio signals Z is assumed.
  • the audio analysis unit 34 34A, 34C, or 34D
  • the processing load of the control device 20 is reduced.
  • the response signal Y in which the prosody (for example, pitch or volume) of the audio signal Z is adjusted is supplied to the reproduction device 26.
  • a supply configuration may also be employed. For example, as illustrated in the first embodiment, a configuration in which the prosody (typically pitch) of the speech signal Z is adjusted according to the minimum pitch Pmin of the uttered speech Vx, or the prosody of the speech signal Z is randomly selected. A configuration to be adjusted is preferable.
  • the voice interaction apparatus 100 (100A, 100B, 100C, or 100D) exemplified in each of the above-described embodiments can be used for evaluation of an actual interaction between humans.
  • observation speech the prosody of response speech
  • Vy the prosody of response speech
  • the apparatus (dialog evaluation apparatus) that performs the evaluation exemplified above can be used for training of dialogue between humans.
  • the section extending over a predetermined length before the end point tB of the utterance voice Vx in the utterance voice Vx is exemplified as the end section E.
  • the condition of the end section E is not limited to the above examples.
  • the end section E is defined with the time point in the vicinity of the end point tB (the time point before the end point tB) in the speech voice Vx as the end point (that is, the end section is excluded by excluding the section in the vicinity of the end point tB in the speech sound Vx). It is also possible to specify E).
  • the end section E is comprehensively expressed as a section near the end point tB in the speech voice Vx.
  • the voice interaction device 100 (100A, 100B, 100C, or 100D) exemplified in the above-described embodiments can be realized by the cooperation of the control device 20 and the program for voice interaction as described above.
  • the program according to the first aspect of the present invention provides a computer with an audio acquisition process for acquiring an utterance signal representing an utterance voice, an audio analysis process for specifying a pitch of the utterance voice from the utterance signal, Among them, a response generation process is executed for causing the playback apparatus to play back a response voice having a pitch corresponding to the lowest value of the pitch specified in the voice analysis process for the end section near the end point.
  • the program according to the second aspect of the present invention includes a computer that obtains an utterance signal representing an utterance voice, a voice analysis process that specifies a pitch of the utterance voice from the utterance signal, and Among them, a response generation process is executed for causing the playback device to play back the response sound of the prosody according to the transition of the pitch specified by the voice analysis process for the end section near the end point.
  • a program for acquiring, in a computer a voice acquisition process for acquiring an utterance signal representing an utterance voice, a first response voice representing an answer to the utterance voice, and a second response voice other than the answer answer.
  • a program according to a fourth aspect of the present invention is a program for voice conversation for reproducing a response voice to an utterance voice, and obtains an utterance signal representing the utterance voice in a computer, and the voice dialogue.
  • the history management process for generating the usage history and the response generation process for causing the playback apparatus to reproduce the response sound of the prosody according to the usage history are executed.
  • the program according to each of the above aspects can be provided in a form stored in a computer-readable recording medium and installed in the computer.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included.
  • optical recording medium optical disk
  • non-transitory recording medium includes all computer-readable recording media except for transient propagation signals (transitory, “propagating” signal), and does not exclude volatile recording media. . It is also possible to distribute the program to a computer in the form of distribution via a communication network.
  • a voice interaction method includes a voice acquisition step of acquiring an utterance signal representing an utterance voice, and a plurality of pitches of the utterance voice (for example, a plurality of temporal changes in the pitch).
  • a speech analysis step for identifying a time series of numerical values from the speech signal, and a minimum value of pitches identified in the speech analysis step for the end section near the end point of the speech speech (for example, the pitch of the pitch in the end section)
  • the response sound having a pitch corresponding to the minimum value of the pitch in the end section near the end point of the uttered voice is played from the playback device. Therefore, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the conversation partner generates a response voice at a pitch corresponding to the pitch near the end point of the uttered voice.
  • the pitch of the start point of the last mora in the response voice is set to coincide with the minimum value of the pitch in the last section of the speech voice. The response sound is played back by the playback device.
  • a voice interaction device includes a voice acquisition unit that acquires an utterance signal representing an utterance voice, a voice analysis unit that specifies a pitch of the utterance voice from the utterance signal, A response generation unit that causes the playback device to play back a response voice having a pitch corresponding to the lowest pitch value specified by the voice analysis unit for the end section near the end point of the uttered voice.
  • the response sound having a pitch corresponding to the minimum value of the pitch in the end section near the end point of the uttered voice is played from the playback device. Therefore, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the conversation partner generates a response voice at a pitch corresponding to the pitch near the end point of the uttered voice.
  • a voice interaction method includes a voice acquisition step of acquiring an utterance signal representing an uttered voice, and a plurality of pitches of the uttered voice (for example, a plurality of temporal changes in the pitch).
  • a voice analysis step for identifying a time series of numerical values from the utterance signal, and playing a prosodic response voice according to the transition of the pitch specified in the voice analysis step for the end section near the end point among the end points of the utterance voice
  • an audio analysis step to be played back by the apparatus.
  • the conversation partner pronounces the response speech with the prosody according to the utterance content of the speaker. It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation.
  • aspect 6 In a preferred example of aspect 5 (aspect 6), in the response generation step, a case in which the pitch changes from a decrease to an increase in the end interval, and a case in which the pitch decreases from the start point to the end point of the end interval, Then, the playback apparatus reproduces response voices having different prosody.
  • the prosody of the response voice is different between the case where the pitch changes from decrease to increase within the end interval and the case where the pitch decreases from the start point to the end point of the end interval. Therefore, it is possible to realize a natural voice conversation that simulates the tendency of the actual conversation that the prosody of the response voice is different.
  • an average of the pitches in the first interval among the end intervals for example, a plurality of changes representing temporal changes in pitches in the first interval).
  • the average of the first average pitch, and the average of the pitches in the second section behind the first section of the last section (for example, the temporal change of the pitch in the second section)
  • a second average pitch that is an average of a plurality of numerical values representing the case where the first average pitch is less than the second average pitch, and the first average pitch is the second average pitch.
  • the playback apparatus reproduces response sounds having different prosody when the height is exceeded.
  • the prosody of the response speech is made different according to the result of the comparison between the first average pitch in the first section ahead of the last section and the second average pitch in the second section behind.
  • a pitch in the last section is stored from a storage device that stores a plurality of response signals representing response sounds of different prosody.
  • the response signal of the response sound corresponding to the transition of the response is acquired, and the playback device is caused to reproduce the response sound by outputting the response signal.
  • the response signal prosody is adjusted in accordance with the transition of the pitch in the last section.
  • the prosody of the prosody corresponding to the transition of the pitch in the end interval is determined from the response signal representing the response speech of the predetermined prosody.
  • a response signal representing the response sound is generated, and the response device is caused to reproduce the response sound by outputting the response signal.
  • a voice interaction device includes a voice acquisition unit that acquires an utterance signal representing an utterance voice, a voice analysis unit that specifies a pitch of the utterance voice from the utterance signal, A response generating unit that causes the playback device to play back the response sound of the prosody according to the transition of the pitch specified by the voice analysis unit for the end section near the end point of the uttered voice.
  • the conversation partner pronounces the response speech with the prosody according to the utterance content of the speaker. It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation.
  • a voice interaction method is a voice acquisition step of acquiring an utterance signal representing an utterance voice, a first response voice representing a question answer to the utterance voice, and a first answer voice other than the question answer.
  • a response generation step of causing the playback device to selectively play back any one of the two response voices.
  • the first response sound that indicates the answer to the uttered voice and the second response sound other than the answer are selectively played back from the playback device.
  • the method includes a speech analysis step of identifying a prosodic index value representing the prosody of the uttered speech from the utterance signal, and the response generating step includes a prosodic index value and a threshold value of the uttered speech And either the first response voice or the second response voice is selected according to the result of the comparison.
  • the prosody of the utterance of the utterer is the standard prosody of the utterer (that is, the prosody assumed by the conversation partner). It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation that a question is easily returned from the conversation partner when it deviates from).
  • the first response sound is output when the prosodic index value is a numerical value outside a predetermined range including the threshold. Select and select the second response voice if the value is inside the predetermined range.
  • the first response voice is selected when the prosodic index value is a numerical value outside the predetermined range, and the second response voice is selected when the numerical value is inside the predetermined range. It is possible to reduce the possibility that the first response sound is reproduced with frequency.
  • the first response sound is reproduced as a response to the utterance voice randomly selected from a plurality of utterance voices.
  • the tendency of the actual voice conversation in which the answer to the utterance voice is randomly generated is simulated. Natural speech dialogue can be realized.
  • a voice interaction apparatus includes a voice acquisition unit that acquires an utterance signal that represents an utterance voice, a first response voice that represents a reply to the utterance voice, and a first response voice other than the question reply.
  • a response generation unit that selectively causes the playback device to play back any one of the two response sounds.
  • any one of the first response sound that represents the answer to the spoken voice and the second response sound other than the answer is selectively reproduced from the reproduction device. Therefore, it is possible to realize a natural voice dialogue that simulates the tendency of an actual dialogue in which not only the talker's utterance but also the questioning / replying (rehearsal) to the talker occurs appropriately.
  • a voice interaction method is a method for executing a voice conversation for reproducing a response voice to a spoken voice, wherein a voice acquisition step of obtaining a speech signal representing the spoken voice; A history management step of generating a usage history of the voice dialogue, and a response generation step of causing a playback device to play back the response voice of the prosody according to the usage history.
  • the prosodic response voice corresponding to the usage history of the voice conversation is reproduced, the actual conversation tendency that the prosody of the utterance voice changes with time as the conversation with the specific conversation partner is repeated. It is possible to realize a simulated natural voice conversation.
  • a standby time that is an interval between the uttered voice and the response voice is controlled according to the usage history.
  • the conversation interval is long immediately after starting the dialogue at the first meeting, and the dialogue with the conversation partner is A natural voice dialogue simulating the tendency of an actual dialogue in which the interval of the dialogue is shortened as it is repeated is realized.
  • a voice interaction apparatus is an apparatus for executing a voice conversation for reproducing a response voice to an utterance voice, and a voice acquisition unit for acquiring an utterance signal representing the utterance voice; A history management unit that generates a usage history of the voice conversation; and a response generation unit that causes a playback device to reproduce prosodic response speech according to the usage history.
  • the prosodic response voice corresponding to the usage history of the voice conversation is reproduced, the actual conversation tendency that the prosody of the utterance voice changes with time as the conversation with the specific conversation partner is repeated. It is possible to realize a simulated natural voice conversation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un dispositif d'interaction vocale qui acquiert un signal vocal représentant un son vocal, identifie la hauteur du son vocal du signal vocal, et amène un dispositif de reproduction à lire une réponse vocale avec une hauteur correspondant à la valeur la plus faible pour la hauteur identifiée dans le dernier segment à proximité de la fin du son vocal.
PCT/JP2016/085126 2015-12-07 2016-11-28 Dispositif d'interaction vocale et procédé d'interaction vocale WO2017098940A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP16872840.0A EP3389043A4 (fr) 2015-12-07 2016-11-28 Dispositif d'interaction vocale et procédé d'interaction vocale
CN201680071706.4A CN108369804A (zh) 2015-12-07 2016-11-28 语音交互设备和语音交互方法
US16/002,208 US10854219B2 (en) 2015-12-07 2018-06-07 Voice interaction apparatus and voice interaction method

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
JP2015238912A JP6657888B2 (ja) 2015-12-07 2015-12-07 音声対話方法、音声対話装置およびプログラム
JP2015-238911 2015-12-07
JP2015-238914 2015-12-07
JP2015238913A JP6728660B2 (ja) 2015-12-07 2015-12-07 音声対話方法、音声対話装置およびプログラム
JP2015-238913 2015-12-07
JP2015238911A JP6657887B2 (ja) 2015-12-07 2015-12-07 音声対話方法、音声対話装置およびプログラム
JP2015238914 2015-12-07
JP2015-238912 2015-12-07
JP2016088720A JP6569588B2 (ja) 2015-12-07 2016-04-27 音声対話装置およびプログラム
JP2016-088720 2016-04-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/002,208 Continuation US10854219B2 (en) 2015-12-07 2018-06-07 Voice interaction apparatus and voice interaction method

Publications (1)

Publication Number Publication Date
WO2017098940A1 true WO2017098940A1 (fr) 2017-06-15

Family

ID=59013065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/085126 WO2017098940A1 (fr) 2015-12-07 2016-11-28 Dispositif d'interaction vocale et procédé d'interaction vocale

Country Status (1)

Country Link
WO (1) WO2017098940A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62115199A (ja) * 1985-11-14 1987-05-26 日本電気株式会社 音声応答装置
JP2008151840A (ja) * 2006-12-14 2008-07-03 Nippon Telegr & Teleph Corp <Ntt> 仮音声区間決定装置、方法、プログラム及びその記録媒体、音声区間決定装置
JP2012128440A (ja) 2012-02-06 2012-07-05 Denso Corp 音声対話装置
JP2014191029A (ja) * 2013-03-26 2014-10-06 Fuji Soft Inc 音声認識システムおよび音声認識システムの制御方法
WO2014192959A1 (fr) * 2013-05-31 2014-12-04 ヤマハ株式会社 Procédé permettant de répondre à des remarques au moyen d'une synthèse de la parole

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62115199A (ja) * 1985-11-14 1987-05-26 日本電気株式会社 音声応答装置
JP2008151840A (ja) * 2006-12-14 2008-07-03 Nippon Telegr & Teleph Corp <Ntt> 仮音声区間決定装置、方法、プログラム及びその記録媒体、音声区間決定装置
JP2012128440A (ja) 2012-02-06 2012-07-05 Denso Corp 音声対話装置
JP2014191029A (ja) * 2013-03-26 2014-10-06 Fuji Soft Inc 音声認識システムおよび音声認識システムの制御方法
WO2014192959A1 (fr) * 2013-05-31 2014-12-04 ヤマハ株式会社 Procédé permettant de répondre à des remarques au moyen d'une synthèse de la parole

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3389043A4 *

Similar Documents

Publication Publication Date Title
US10854219B2 (en) Voice interaction apparatus and voice interaction method
WO2017006766A1 (fr) Procédé et dispositif d&#39;interaction vocale
US10789937B2 (en) Speech synthesis device and method
EP3065130B1 (fr) Synthèse de la parole
JP2006113546A (ja) 情報伝達装置
JP2012159540A (ja) 話速変換倍率決定装置、話速変換装置、プログラム、及び記録媒体
JP2016105142A (ja) 会話評価装置およびプログラム
JP6270661B2 (ja) 音声対話方法、及び音声対話システム
WO2019172397A1 (fr) Procédé de traitement de la voix, dispositif de traitement de la voix et support d&#39;enregistrement
JP6569588B2 (ja) 音声対話装置およびプログラム
WO2019181767A1 (fr) Dispositif de traitement de son, procédé de traitement de son et programme
JP6728660B2 (ja) 音声対話方法、音声対話装置およびプログラム
JP6657887B2 (ja) 音声対話方法、音声対話装置およびプログラム
JP6657888B2 (ja) 音声対話方法、音声対話装置およびプログラム
WO2017098940A1 (fr) Dispositif d&#39;interaction vocale et procédé d&#39;interaction vocale
WO2018164278A1 (fr) Procédé et dispositif de conversation vocale
JP2019060941A (ja) 音声処理方法
JP2018146907A (ja) 音声対話方法および音声対話装置
JP6182894B2 (ja) 音響処理装置および音響処理方法
JP2019159014A (ja) 音声処理方法および音声処理装置
JP2018159777A (ja) 音声再生装置、および音声再生プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16872840

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016872840

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2016872840

Country of ref document: EP

Effective date: 20180709