WO2016002879A1 - Voice synthesis device, voice synthesis method, and program - Google Patents

Voice synthesis device, voice synthesis method, and program Download PDF

Info

Publication number
WO2016002879A1
WO2016002879A1 PCT/JP2015/069126 JP2015069126W WO2016002879A1 WO 2016002879 A1 WO2016002879 A1 WO 2016002879A1 JP 2015069126 W JP2015069126 W JP 2015069126W WO 2016002879 A1 WO2016002879 A1 WO 2016002879A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
answer
primary
voice
interjection
Prior art date
Application number
PCT/JP2015/069126
Other languages
French (fr)
Japanese (ja)
Inventor
松原 弘明
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to JP2014-136812 priority Critical
Priority to JP2014136812 priority
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Publication of WO2016002879A1 publication Critical patent/WO2016002879A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Abstract

In order to perform voice synthesis of an answer to a question from a user with natural feeling, a voice synthesis device is provided with: a voice input unit (102) which inputs voice; an acquisition unit (22) which acquires a primary answer to the voice inputted by the voice input unit (102); an analysis unit (112) which analyzes whether an object to be repeated is included in the primary answer; and a voice synthesis unit (24) which, when it is analyzed that the primary answer includes the object to be repeated, performs voice synthesis of a secondary answer that is obtained by repeating the object twice or more, and outputs the secondary answer.

Description

Speech synthesizer, speech synthesis method and a program

The present invention, speech synthesis device, a speech synthesis method and a program.

Recently, a speech synthesis technique, has been proposed as follows. That is, by the synthesized output sound corresponding to the speech tone or voice of the user, more humane Could techniques (for example, see Patent Document 1) and, by analyzing the user's voice, Ya psychological state of the user technique for diagnosing the health status (e.g., see Patent Document 2) are proposed. Further, while recognizing speech input by a user, and outputs the contents specified in the scenario in speech synthesis, speech dialogue system for realizing voice interaction with the user has been proposed (e.g. Patent Document 3 reference).

JP 2003-271194 JP Patent No. 4495907 Publication Patent No. 4832097 Publication

However, by combining a voice synthesis technology and speech dialogue system described above, with respect to the user's voice, assume a speech synthesizer for outputting a speech synthesis data retrieval like to. In this way assumed speech synthesizer, felt sound unnatural to the user output by the speech synthesis, specifically, there are times when really gives the feeling that the machine is talking, it is pointed out a problem that there.
The present invention has been made in view of such circumstances, and an object thereof is speech synthesizer capable of providing natural feeling to the user, to provide a speech synthesis method and a program is there.

To achieve the above object, the speech synthesis device according to one embodiment of the present invention, an audio input unit for inputting a voice, an acquisition unit for acquiring primary answer to voice input by the voice input unit, the primary the answer, an analysis unit for analyzing whether contain repeats of interest, if the parsed to be subject to repeated primary answer, repeating the secondary answer the subject more than once to speech synthesis a speech synthesizer for outputting Te characterized by including the. According to the speech synthesis apparatus according to the one embodiment, if the target iteration primary answer, since the subject is output by being repeated speech synthesis, a natural impression as if they were as if interacting with humans it can give.

In speech synthesis device according to the one embodiment, the repetition of interest is the number of syllables 2 following interjection, wherein the analysis unit determines whether or not contains interjection into the primary answer, sensitive If it is determined that the verb is included, the number of syllables of the feeling verb may be analyzed whether 2 or less.

In speech synthesis device according to the one aspect, the acquiring unit, creating a language analysis unit for analyzing the meaning of the voice input by the voice input unit, a primary answer to the meaning that has been analyzed by the language analyzing unit and the primary answer preparation unit that may be configured to include a. According to this configuration, answer appropriate content is output by voice synthesis for the voice input.

In speech synthesis device according to the one embodiment, wherein when analyzed to be subject to repeated primary answer may be configured to include a repeating unit that outputs repeatedly the subject more than once. According to this configuration, or press just in case, such as or trying to get sympathy, can give actually the impression as if they were interacting with people.
When provided with the repeating unit, the speech synthesis unit, if the number of syllables interjection included in the primary answer 2 or less, the secondary answers interjection is repeated by the repeating unit to speech synthesis, the if the number of syllables of the interjection included in the primary answer is three or more, may be speech synthesis the primary answer as secondary responses. If the number of syllables interjection included in the primary answer 2 or less, while the sense of the verb is speech synthesized as secondary answers are repeated, if 3 or more, the speech synthesis as a primary answer as secondary Answer It is.
The speech synthesis unit includes a voice sequence creation unit for creating an audio sequence from the answer, and a synthesizing unit for outputting an audio signal speech synthesis based on the speech sequence, it may include.

In speech synthesis device according to the one embodiment, it has a first mode and a second mode, the primary reply composition unit, if the first mode, to create a primary answer plus specific contents to interjection , if the second mode, the primary response interjection alone, or, the primary response only specific content, may be configured to create. If the first mode, while the secondary answer plus specific contents in interjection is created, if the number of syllables of the feeling verb "2" or less, since the sense of the verb is repeated, it is possible to further increase the empathy to the user.

In speech synthesis device according to the one embodiment, includes a prohibition mode, if the inhibit mode, the voice synthesis unit, the number of the syllables may be speech synthesized without repeated 2 following interjection. Mere repetition of interjection is in some circumstances may cause discomfort to the user, this prohibition mode, it is possible to prevent the giving of such discomfort. In this arrangement, the repetitive unit, if the inhibition mode may be configured to count the syllables cancels the repetitive functions 2 following interjection.
In the present invention, not only the speech synthesizer, and speech synthesis method, can be conceptualized in programs, computer-readable recording medium recording the program for causing a computer to function as the speech synthesis is there.

It is a diagram showing a configuration of a speech synthesis apparatus according to the first embodiment. Is a block diagram showing the functional arrangement of a speech synthesis device. It is a flowchart showing the operation of the speech synthesis apparatus according to the first embodiment. Is a diagram illustrating an example of inquiry by the user. Is a diagram illustrating an example of a reply to be synthesized by the speech synthesizer. Is a diagram illustrating an example of inquiry by the user. Is a diagram illustrating an example of a reply to be synthesized by the speech synthesizer. It is a diagram showing a configuration of a speech synthesis apparatus according to the second embodiment. It is a flowchart showing the operation of the speech synthesis apparatus according to the second embodiment. Is a diagram illustrating an example of a reply to be synthesized by the speech synthesizer. Is a diagram illustrating an example of a reply to be synthesized by the speech synthesizer.

First, an outline of the speech synthesis apparatus according to the embodiment.
In interaction between people, When asked by one person (a a) (inquiry), (a b) other people to consider the case of answer. In this case, human b, compared question emitted by the human a, rather than issuing an answer to the question as it is the case to repeat some or all of the answers are found. For example, for a question by a person a, person b is a positive meaning in Japanese "yes" (the sound in the Roman, when expressed, separated by syllable [ha-i], in the following () in of [] is, what was represented separated by syllable, if you try to answer the same) in the figure, as it is "Yes" ([ha-i]) and not to answer, "crawl" ([ha-i -ha-i]) is repeated to cases such as.
On the other hand, with respect to question by human a, if the answer is human b, in some cases not be repeated as described above. For example, for a question by human a, if the person b is to try to answer as a negative meaning in Japanese "No" ([i-i-e]), "No No" ([i- i-e-i-i-e]) repeated as that as less.

Also it means the content of the answer is the same, different languages, sometimes repeating is true inverse relationship. For example, if the answer to be the positive sense human b in English "Yes", "Yes, yes" is rarely be repeated as. On the other hand, if the answer to be a negative meaning people b in English, "No", "No, no," it is a case that is repeated and be seen.

Further, for example, a person a is in Japanese, "tomorrow, sunny?" To the question that ( "Will it be sunny tomorrow?" In English), people b is in Japanese, "Yes, it is sunny" ([ha If you want to try to answer in the sense stating that -i ha-re-de-su]), "crawling, is sunny" ([ha-i-ha-i ha-re-de-su]), such as there is also a case to answer by repeating the "Yes". However, people b is "Yes, it will be sunny tomorrow." The same meaning in English when trying to answer a), "Yes, yes, it will be sunny tomorrow.") I like to repeat the "Yes" case is less likely to respond to.

Here, the present inventor has either part of the answer or all are repeated, as a border or not be repeated, and the number of syllables interjection included in the reply (number of syllables) is "2" or less or, if it is whether or not, were considered.
Speaking in the above example, it is a interjection, and, of Japanese number of syllables is less than or equal to "2" and "Yes" ( "ha-i") and English for the "No" is repeated. However, because it is an interjection, "No" of the Japanese is the number of syllables is "3" or more ([i-i-e]) and, is is it is less iteration for "Yes" in English .
In this way, the semantic content, such as if it is a negative or a positive, regardless, also, to be regardless of language, first of all, be considered to be focused on the number of syllables interjection.

The reason for interjection number of syllables is less than or equal to "2" is repeated, the press just in case unconsciously in order to be simple, an attempt is made to obtain the sympathy, I would like to have / snuggle resonance to the remarks of the other party to try to convey the fact is, you do not give a cold impression to the other party, the like can be considered.

It is to be noted that the interjection, excitement and, in response, expressed and interrogation, there is no use, is a word that may alone be in the statement. Examples other than the above, in Japanese, "Hmm" of back-channel feedback ([fu-mu]) and, "Yes" ([so-u]), and the like, in English, "Ah", "Oh" and the like, in the Chinese "shi" ([shi]), "apparent" ([ming-ba-i]) in addition to the like include animal sounds as will be described later.

Further, syllables and is a segment that separates the sound when uttered, 1 typically follows. To 4. For one vowel like, composed of the vowel alone or an audio configured with one or more consonants before and after the vowel (s), refers to unity when heard sound .
1. Vowel (V)
2. Consonant + vowel (CV)
3. Vowel + consonant (VC)
4. Consonant + vowel + consonant (CVC)
The syllables of the Japanese, the above 1. And 2. Relevant syllable is present, but the three. And 4. Syllable does not exist corresponding to.
It should be noted that the above-mentioned 1. To 4. For an example in order to classify the syllable, some languages, disorganized when heard sound, rather than around the vowels, sometimes consonants is centered. Also, the tonal language such as Chinese (Tone Language), a combination of vowels and consonants, in some cases constitute a syllable further adding tone by height variation of the pitch of the vowel.

Thus, in the actual human interaction between, to question by human a, human b cases is found to respond repeatedly to the following interjection two syllables. Speech synthesis apparatus according to an embodiment of the present invention, In summary, in order to provide a feeling as if they were as if interacting with humans, interjection contained within the answer to be against the question, the answer if the number of syllables is "2" or less, and configured to output the speech synthesis the sense of verbs is repeated more than once.
The reason why the interjection number of syllables is "2" or less is repeated, since it is equal to reminds unwittingly because as simple as described above, a simple number of syllables is "2" or less if a word, even without touching lyrics, there is the repeated tendency. Further, even interjection of the number of syllables is "3" or more, may also be repeated unconsciously like confirmation. The contents in consideration of this trend is to be described later.
Hereinafter, will be described with reference to the drawings details of the speech synthesis apparatus according to the embodiment.

<First Embodiment>
Figure 1 is a diagram showing a hardware configuration of the speech synthesis apparatus 10 according to the first embodiment.
The speech synthesizer 10 is a terminal device such as a cellular phone, as shown in FIG, and CPU (Central Processing Unit) 11, a memory 12, a display unit 13, a touch panel 14, an audio input portion It includes a 102, a communication unit 126, an audio output unit 142, a.

CPU 11 controls the entire speech synthesizer 10, a memory 12 is used as a main storage unit of the CPU 11, it stores an application program and various data for speech synthesis. Display unit 13 is, for example, a liquid crystal display device, displays various screens such as settings and operations. The touch panel 14 detects a touch position on the display screen by the display unit 13, and outputs information indicating the detected touch position.

Audio input unit 102 is omitted for details, a microphone and converting sound into electrical signals, LPF (low pass filter) for cutting high-frequency components of the transformed audio signal, the audio signal obtained by cutting a high frequency component composed of such a / D converter for converting a digital signal. The communication unit 126 communicates with an external server via the Internet. The audio output unit 142, the synthesized speech signal and a D / A converter for converting an analog signal, an amplifier for amplifying the audio signal converted into an analog signal, the amplified signal such as a speaker for outputting the sound converter in constructed.

In the speech synthesizing apparatus 10, by executing the application program, and has a configuration that realizes a function of outputting speech synthesized answer to question by the user. That is, the speech synthesis apparatus 10 is realized by cooperation between the processing unit and the application program such as a CPU.
Incidentally, the application program, for example, from a specific site is downloaded via the Internet, may be installed on the terminal device is installed is provided in a form stored such in the readable recording medium as a memory card and it may be.

In the speech synthesizing apparatus 10, CPU 11 that executes the above application programs, functional blocks such as the following may be constructed.

Figure 2 is a block diagram showing the arrangement of a speech synthesis device 10.
As shown in this figure, the speech synthesizer 10, acquisition unit 22, the speech synthesis unit 24, the analysis unit 112, repetition section 114, the language database 122, response database 124 and audio library 128 is constructed. Among, the acquisition unit 22 includes a language analysis part 108, a primary reply composition part 110, the voice synthesis unit 24 includes a voice sequence creation unit 116 and the combining unit 118. Note that as described above for the voice input unit 102, the communication unit 126 and the audio output unit 142.

Language analysis unit 108, the semantic content of voice inputted to the voice input unit 102, it analyzes the processing of the audio signal (specific). Specifically, the language analysis part 108, whether close to which phonemes audio signal, it is determined by reference to the phoneme model prepared in advance in the language database 122, analyzes the semantic content of the word. Incidentally, such a phoneme model, can be used, for example Hidden Markov Models.

Primary reply composition part 110, the primary response text corresponding to the semantic content of the voice analyzed by the language analysis unit 108, created by referring to the response database 124 and a communication unit 126.
For example, "now, when?" For the question that acquires the time information from the built-in real time clock (not shown), information other than the time information (for example, boilerplate) by obtaining from the answer database 124, to create a primary answer of "I'm home is ○○ when ○○ minute".
On the other hand, with respect to the question of "the tomorrow of the weather?", And do not get weather information and access to an external server, it is not possible to create a primary answered in a single voice synthesizer 10. Thus, if the primary answer only response database 124 can not be created, the communication unit 126 is accesses the external server via the Internet, and has a configuration to obtain information necessary for the primary response. Thus, primary reply composition unit 110 is adapted to primary answer, configured to retrieve from the response database 124 or an external server for interrogation.
It should be noted that, when you get the information you need, the primary answer generation unit 110, the primary answer that "is ○○" to the question, to create by using the fixed sentence. In addition, the primary answer generation unit 110, rather than the specific contents as the answer to the question, for example, "Yes" or yes / no answer or as simple as a "No", such as "Yes", "Like" in some cases to create an interjection, such as back-channel feedback as the primary answer.
Here, the primary answer say, which has been created by the primary reply composition unit 110, it refers to the stage before repeating the interjection. The measure for distinguishing the final secondary answers are subjected to speech synthesis.

Analysis unit 112, the first, as well as determine whether contains interjection primary answers created by the primary reply composition unit 110, if it contains, in the second syllable of the feeling verb to analyze the number.
Analyzer 112 includes interjection primary answer, and, by analyzing the number of syllables of the feeling verb to or less than "2", and supplies the primary and respond to repetitive unit 114. Incidentally, the analysis unit 112, when determining that contains no interjection primary answer, or if you also include the number of syllables of the feeling verb analyzed as being "3" or more, the primary Answer and it outputs the audio sequence creating section 116 as it is a secondary response.

Now, if the primary answer created by the primary answer generating unit 110, for example, "sunny", the analysis unit 112 determines that the information does not include an interjection to the primary response. Further, if the primary answer is Japanese "yes" ([ha-i]), the analysis unit 112 is configured to determine the interjection to the primary reply is included, the number of syllables of the feeling verb analyzes to be equal to or less than "2". Note that the primary answer if "no" in Japanese ([i-i-e]), the analysis unit 112 is determined with the interjection to the primary answer is included, of the feeling verb the number of syllables is analysis and is "3" or more.
Further, the analysis unit 112, the answer is "Yes" primary answer is English created, as well as determined that contains interjection to the primary answers, the number of syllables of the sense verb "3" or more analyzed there, the primary answer if "No" in English, as well as determine the interjection to the primary reply is included, the number of syllables of the feeling verb analyzes to be equal to or less than "2" .

Note that it is determined whether or not the number of syllables in the primary reply is included in the following interjection "2", to the analysis unit 112 may analyze the text of the primary answer, for example, as follows it may be. Specifically, for the primary answer generating unit 110 primary answer, created to allow identifying the interjection the other portion, with the analysis unit 112 keep more previously registered number of syllables is "2" of the following interjection , there is interjection to be identified among the primary answers created, and, if matches interjection had been registered, the number of syllables in the primary answer is included in the following interjection "2" it may be determined to be there. There is no interjection identified Some primary answer, or, even in the presence of interjection, unless match the interjection that has been registered, the analysis unit 112, the number of syllables in the primary Answer it may be determined that it is not included in the following interjection "2".

Repetition section 114, the interjection of the number of syllables "2" or less, a plurality of times (in the first embodiment twice) iteratively, and outputs it as a secondary answer to be speech synthesis. In the above example, the primary answers created by the primary answer generating unit 110 if "YES" in Japanese, is repeated twice by repeated unit 114 outputs the secondary answered that "crawl". The primary answer if "No" in English, is repeated twice by repeated unit 114 outputs the secondary answered that "No, no".

Audio sequence creation unit 116, the secondary answer, interjection is repeated by the repeated unit 114 or is supplied from the secondary reply output from the analysis unit 112, the synthesis unit 118 creates a voice sequence.
Here, the speech sequence, a data for speech synthesis of secondary answer, in particular, the secondary answers, at which timing, in any pitch, in what volume, etc. is a data defining what should be uttered.

Combining unit 118, and speech synthesis based on speech sequence, and outputs the synthesized speech signal in a digital signal.
Incidentally, the combining unit 118, for example, to synthesize a speech, in the following manner. That is, the synthesis unit 118 converts the reply contents defined by phonetic sequences in a column of speech units, a speech unit data corresponding to the columns of each speech segment, as well as selected from the audio library 128, tie while modified to portions are continuous, each of the pitch and volume for the voice segment data connection, converts to match the pitch and volume defined by the sound sequence, to synthesize a speech.
Here, the audio library 128 referred, those previously database single such transition portion from phonemes and phoneme to a phoneme, the speech unit data defining the waveforms of various speech unit a speech material it is.

Next, the operation of the speech synthesizer 10. Figure 3 is a flow chart showing a processing operation in the speech synthesizer 10.
When the user has a predetermined operation, the main menu screen example displayed on the display unit 13, when the user selects an icon corresponding to the interaction processing, starts the application program CPU11 is corresponding to the process . By the execution of this application program, the functional blocks shown in FIG. 2 is constructed.

First, (when inputting voice) user asks the voice to the audio input unit 102, an audio input unit 102 converts the sound into a digital audio signal, (step supplies the language analysis part 108 Sa11).
Then, the language analysis unit 108 analyzes the audio signal of the interrogation, the means (text) to the primary reply composition unit 110 (step Sa12).
Primary reply composition part 110, the primary answer corresponding to the voice analyzed, or with reference to the response database 124, or by reference to the information obtained from the external server via the communication unit 126 as necessary, to create , and supplies the analysis section 112 (step Sa13).

Analyzer 112 includes interjection primary answers created, and the number of syllables of the feeling verb is equal to or smaller than a "2" (step Sa14). The result of this determination, analyzer 112, if the determination result is "Yes", and supplies the primary and respond to repetitive unit 114, repetition section 114, the number of syllables "2" following interjection twice It cycled, as a secondary answer to be speech synthesis, and supplies to the audio sequence creating section 116 (step Sa15).
On the other hand, the analysis unit 112 may not include the interjection primary answers created, or if be included is the number of syllables of the feeling verb "3" or more (judgment result in step Sa14 is " If No "), the primary answer as secondary answers should speech synthesis as it directly supplies the audio sequence creating section 116.

Audio sequence creation unit 116 creates a voice sequence corresponding to the outputted secondary responses from either the primary reply composition unit 110 or the repetitive unit 114, and supplies to the synthesizing unit 118 (step Sa16). Incidentally, and vocalization timing responses defined by the audio sequence, the pitch, the volume, etc., the the model data may be acquired from a database (not shown).
Then, the composition unit 118, speech synthesis secondary answers according to the voice sequence created by the audio sequence creating section 116 (step Sa17). Note that when outputting the secondary answer by speech synthesis, although not particularly shown, CPU 11, terminate execution of the application program, returning to the menu screen.

4A is a diagram illustrating an example of inquiry by the user, FIG. 4B is a diagram showing an example of a reply to be synthesized by the speech synthesizer 10 with respect to the question.
As shown in FIG. 4A, asking the user W is referred to as "tomorrow is swollen?", In other words, the voice of the content that asks you to confirm that tomorrow's weather is sunny, the speech synthesis apparatus 10 is a terminal device it is assumed that input. At this time, since the weather information acquired by the external server is "fine", the Japanese primary answers created by the primary answer generating unit 110 is a means of positive relative asked the "yes" if ([ha-i]) is, interjection is included in the primary answer, and, since the number of syllables interjection is "2", as shown in Figure 4B, the "yes" ([ ha-i]) are repeated a "crawl" [ha-i-ha-i]) to be speech synthesis. For this reason, the user W, to the question of its own, mechanical and not, as if interacting with people, it is possible to obtain a natural feeling. Moreover, from the user W, that interjection is repeated, empathy is improved.

5A is a diagram showing another example of the interrogation by the user, FIG. 5B is a diagram showing an example of a reply to be synthesized by the speech synthesizer 10 with respect to the question.
5A, the user W is "weather of tomorrow?" In other words, it is assumed that you enter the question determine the specific contents whether there's what is tomorrow's weather, the speech synthesis apparatus 10. In this case, for weather information obtained through an external server is "fine", "sunny" of Japanese primary answer generating unit 110 as the primary answer ([ha-re-de-su] ) when you create a, it does not include the interjection, as shown in FIG. 5B, "sunny" as it is ([ha-re-de-su]) to be speech synthesis.
It should be noted, "sunny" in Japanese, it is speaking in English, "It will be sunny.".

<Second Embodiment>
Next, a second embodiment will be described. Incidentally, the same elements as the first embodiment in the following, is omitted while diverting the code used in the description of the first embodiment, a detailed description as appropriate.

Figure 6 is a block diagram showing the arrangement of a speech synthesis device 10 in the second embodiment. Portion 6 is different from FIG. 2, the point of creating according the text of the primary answer to the question primary answer generating unit 110 is analyzed by the language analyzing unit 108, the set mode to the mode setting portion 130 It is in. In the second embodiment, the mode setting unit 130, the information output from the touch panel 14 (see FIG. 1) that CPU11 processes, and outputs the mode set by the user.

In the second embodiment, the mode set in the primary reply composition unit 110 is two. That is, even when it is sufficient to create the contents of only interjection against inquiry of the user, a first mode for creating a primary answer dare additional specific contents after the feeling verb, the user in the case that for the question of it is sufficient to create the contents of the only interjection, when you create a primary answer of only the sense of the verb, there is a need to create a sorry content in interjections to the question of user if, at a second mode to create a primary response only specific content for that question.

If, for example, question is where is "tomorrow is swollen?", If the first mode, the primary answer the primary answer generating unit 110 is created, for example, if positive and a Japanese "Yes, it is sunny a "([ha-i ha-re-de-su]). In other words, the primary answer generation unit 110, followed by interjections of "Yes" in Japanese ([ha-i]), "sunny" in the Japanese a concrete content to the question ([ha-re a material obtained by adding -de-su]) to create a primary response.
In this case, if the second mode, the primary answer the primary reply composition unit 110 creates, for example if positive and a Japanese "yes" ([ha-i]), and the interjection only Become. In addition, in the second mode, if the question is that "the tomorrow of the weather?", If the weather information that has been obtained via an external server is "fine", the primary answer the primary answer generating unit 110 is created , and if the Japanese in the same manner as in the first embodiment "sunny" ([ha-re-de-su]).

Next, the operation of the second embodiment.
Figure 7 is a flow chart showing a processing operation in the speech synthesizer 10.
Portion 7 is different from FIG. 3, for example, in the previous step Sa10 in step Sa11, the point of acquiring the primary answer preparation unit 110 a set mode, and, in step Sa13, analyzed by the language analysis unit 108 the text of the primary answer to the meaning of the speech lies in creating the primary answer preparation unit 110 in accordance with the set mode.
In the second embodiment, in step Sa14, the analysis unit 112, a primary reply created in the first mode or the second mode analyzes, the primary answers created as described above by the primary reply composition 110 to include interjection, and, if it contains the points that whether the number of syllables of the feeling verb is equal to or less than "2" analyzer 112 analyzes is the same as in the first embodiment.

Figure 8 is a diagram showing an example of a reply speech synthesizer 10 according to the second embodiment will be synthesized. In this example, as shown in FIG. 4A as a question, it is assumed that the user W enters "tomorrow is swollen?".
In this case, weather information obtained through an external server is "fine", if the first mode is set, in Japanese as described above, "Yes, it is sunny" ([ha- i ha-re-de-su]) is created as the primary answer. However, in the second embodiment, "YES" interjection included in the primary Answer ([ha-i]) a plurality of times, where along with is repeated twice, after the repeated interjection, for asking the specific content was followed by "crawling, is sunny" and ([ha-i-ha-i ha-re-de-su]), is speech synthesis.
According to the second embodiment, the user W, to the question itself, after excitement number of syllables is "2" or less lyrics is repeated, the specific contents, such as check the question of its since added answer is speech synthesis, it is possible to further enhance the sympathy of the user W.

In the second embodiment, a weather information acquired through the external server is "fine", if set second mode, e.g. interjection only Japanese "yes" ([ha -i]) is created. Accordingly, "YES" in the sense of the verb ([ha-i]) a plurality of times, here is repeated twice, as shown in Figure 4B "crawl" ([ha-i-ha-i]) If it will be speech synthesis.

<Application Example Modification>
The present invention is not limited to the embodiments described above, but may be various application and modification as described for example below. Further, described below applied modified embodiment of the can combine one or more arbitrarily selected appropriately.

<Iteration of the object>
In an embodiment, the primary answer, if it contains the number of syllables is 2 or less interjection has been speech synthesis repeatedly the feeling verb twice or more, as described above, the number of syllables is "2" or less if a simple word, there is likely to be repeated without the interjection trend. Further, even interjection of the number of syllables is "3" or more, tend to be repeated.
Therefore, the analysis unit 112, the primary answer analyzes whether including the target, such as the following iteration (word), if it is analyzed that there is the target, repeat unit 114 the primary Answer it may be supplied to.
The first object of the iteration, as in the embodiment, although the number of syllables and the like is 2 or less interjection, as the second object is not limited to interjection, simple word syllable number is "2" following also mentioned, further, a third object, the number of syllables and the like also interjection "3" or more. Most preferred is a first object, the second object and the third object is a first object of the alternatives.
The simple words the number of syllables is the "2", may but simple because hardly analysis also provides a number of syllables "3" or more, interjection that may be repeated in it is considered to be limited. Therefore, the analysis unit 112, the target for example registered in advance, may be analyzed whether included in the primary response.

<Voice input unit and the like>
In embodiments, the audio input unit 102 has a configuration for converting the audio signal to the sound of the user to input by the microphone is not limited to this configuration, and the audio signal processed by the other processing unit, other device supplied from (or transferred) may be configured to input an audio signal. That is, the voice input unit 102 may have a configuration that acquires in some way audio.
Further, in the embodiment, although the primary answer to the semantic content of the speech created in text, data other than text, for example, may be created by the speech waveform data. In the case of using the voice waveform data as the primary answer, by treating the speech waveform data may be analyzed whether the target iteration. In this case, it is preferable to use a speech waveform data also for the secondary response.
As the advantage of the arrangement for creating a primary answer in text as embodiments, and that it improves the target accuracy of analysis iteration, since the text also secondary answer, and convenient point during speech synthesis and the like.
In addition, primary reply composition part 110, the primary answer to voice input, instead of creating with reference to the boilerplate response database 124 may be acquired directly from an external server. That is, the primary reply composition part 110, the primary answer to input voice, it suffices to obtain some form.

<Ban mode>
In embodiments, the number of syllables is repeated twice for the following interjection "2", for example "yes" ([ha-i]) repeatedly to "crawl" ([ha-i-ha- i]) with and outputs the speech synthesis, it becomes a so-called two degrees reply, in some circumstances, in some cases giving an unpleasant feeling to the user.
Therefore, the operation mode for prohibiting repetition of the interjection (prohibition mode) is provided, if set prohibited mode, a configuration for canceling the repeated function of interjection. Examples of configuration for canceling, if set to prohibit mode, repetitive unit 114, may be configured to prohibit the repeated function of the interjection, the analysis unit 112, the primary answer by the primary reply composition 110 also the number of syllables interjection contained is not more than "2", instead of supplying the primary and respond to repetitive unit 114 directly, may be supplied to the audio sequence creating section 116. In any case, the number of syllables "2" following interjection is configured as not repeated.
Thus, even in "Yes" primary answers created is two syllables or less by the primary answer preparation unit 110 ([ha-i]), as shown in FIG. 9, without being repeated, primary Answer as "Yes" ([ha-i]) is to be speech synthesis.

It should be noted, and the prohibition mode, for the first mode or the second mode described above, to the user may be set manually, voice and content that has been input, volume, the content of the answer, the conversation of history (situation) device according to a result of analyzing the like may be automatically set.

<Speech and answer>
For the embodiment, the answer has been configured to speech synthesis in the human voice, also in addition to the voice by the people, may be speech synthesis in animal sounds. In other words, the repetition of the subject is not limited to the human voice, is a concept that includes the animal sounds.
For example, if the user inputs a voice of the question of meaning that regardless of the language, "it's Iitenki" to the speech synthesis device 10, speech synthesis device 10, the cat "meow" (in English "Meow") speech synthesis and so repeatedly, for example, "Meowing" ( "Meow, meow" in English).

In the case of combining and outputting the animal sounds, it is impossible to obtain information desired by the user. In other words, as asked by the user "tomorrow weather is?" Even if the questions and, the user can not get the weather information of tomorrow. However, when the user question in some form, with respect to the inquiry, the cry reaction is returned, with respect to the user, if it were will and imaginary animals that emits the bark leads such, it can be expected to give a kind of healing effect.
Also, the speech synthesizer 10 to speech synthesis in animal sounds, is not limited to the terminal device, and pet robot to the animal imitating may be applied to a rag.

<Others>
In the embodiments, repeated a second time, it may be three or more.
Further, CPU 11 is, voice and content input, volume, of a respondent may be configured to be set by appropriately determining the number of iterations based on such a conversation history (status).
In the embodiment, the language analysis part 108 is a construction for obtaining a primary answer to question, is provided with the language database 122 and response database 124 on the side of the speech synthesizer 10, in such a terminal device, the processing load becomes point or heavier, such as in consideration of the limited storage capacity may be provided on the side of the external server. That is, in the speech synthesizer 10, sufficient if configured to acquire somehow primary answer to question, the primary answers, or to create on the side of the speech synthesizer 10, other configurations other than the voice synthesizer 10 (e.g., external server) or to create on the side of not asked for.
Incidentally, the speech synthesizer 10, the answer to the voice, if created possible application not access the external server or the like, the information acquisition unit 126 is unnecessary.

The speech synthesis apparatus 10 according to the embodiment, in addition to the terminal device, a dedicated or electronic circuits may be realized by a general-purpose personal computer. When implemented in a personal computer, as well as connected to a microphone and a speaker, it is realized by executing a pre-installed application program. In this case, the application program installed in the personal computer may be downloaded via the Internet as with the terminal device may be provided in a state in which the computer is stored in a recording medium readable. Recording medium is, for example, a recording medium of non-fugitive (non-transitory), but optical recording medium such as a CD-ROM (optical disk) is a good example, is any such semiconductor recording medium or a magnetic recording medium.
The voice synthesizing apparatus according to the embodiment can also be implemented as a speech synthesis method for synthesizing speech.

10 ... speech synthesizer, 22 ... acquisition unit, 24 ... sound synthesis unit, 102 ... voice input section, 108 ... language analysis unit, 110 ... primary reply composition part, 112 ... analysis unit, 114 ... repeat unit, 116 ... audio sequence generating unit, 118 ... synthesis unit, 126 ... communication unit.

Claims (11)

  1. An audio input unit for inputting a voice,
    An acquisition unit for acquiring primary answer to voice input by the voice input unit,
    The primary answer, an analysis unit for analyzing whether contains repeats of the target,
    When it is analyzed and the there is a subject of iterations primary answer, a voice synthesizing unit for outputting the speech synthesized repetitive secondary answer the subject more than once,
    Speech synthesis apparatus characterized by comprising a.
  2. The repetition of the subject, the number of syllables is 2 less interjection,
    The analysis unit,
    Determine whether contains interjection into the primary answer,
    If it is determined that interjection is included, the speech synthesis apparatus according to claim 1 in which the number of syllables of the feeling verb, characterized in that analyzes whether 2 or less.
  3. The acquisition unit,
    A language analysis unit for analyzing the meaning of the voice input by the voice input unit,
    A primary reply creation unit that creates a primary answer to the meaning that has been analyzed by the language analyzing unit,
    Speech synthesis apparatus according to claim 2, characterized in that it comprises a.
  4. Claims wherein the analysis unit, the number of syllables interjection included in the primary answer when it is analyzed and is 2 or less, characterized in that it comprises an iterative section for outputting repeatedly the feeling verb more than once speech synthesis apparatus according to claim 2 or 3.
  5. The speech synthesis unit,
    If the number of syllables interjection included in the primary answer 2 or less, the secondary answers interjection is repeated by the repeating unit and speech synthesis,
    If the number of syllables interjection included in the primary answer is 3 or more, the speech synthesis apparatus according to claim 4, characterized in that the speech synthesis of the primary answer as secondary responses.
  6. The speech synthesis unit,
    A voice sequence creation unit for creating an audio sequence from said primary answer,
    A combining unit for outputting an audio signal speech synthesis based on the speech sequence,
    Speech synthesis apparatus according to claim 5, characterized in that it comprises a.
  7. Has a first mode and a second mode,
    Said primary answer creation unit,
    If the first mode, to create a primary answer plus specific contents in interjection,
    If the second mode, the primary response interjection alone, or speech synthesizer according primary answers only specific contents, to claim 3, wherein the creating.
  8. It has a prohibition mode,
    If the inhibit mode,
    The speech synthesis unit,
    Speech synthesis apparatus according to claim 5, 6 or 7, characterized in that voice synthesis is the number of the syllables is not repeated two following interjection.
  9. The repetitive unit,
    If the inhibit mode, the voice synthesizing apparatus according to claim 8, characterized in that the number of the syllables cancels the repetitive functions 2 following interjection.
  10. Enter the voice,
    Get the primary answer to voice and the input,
    The primary answer analyzes whether contains repeats of the target,
    Wherein when the analyzed that there is a target of the iterative primary answer, speech synthesis method characterized in that two or more repetitive secondary answer the subject and outputs the voice synthesis.
  11. The computer,
    Voice input unit for inputting a voice,
    Acquisition unit for acquiring primary answer to voice input by the voice input unit,
    The primary answer analysis unit for analyzing whether contains repeats of the target, and,
    Wherein when it is analyzed that there is a target of the iterative primary answer, the voice synthesizing unit for outputting the speech synthesized repetitive secondary answer the subject more than once,
    Program for causing to function as.
PCT/JP2015/069126 2014-07-02 2015-07-02 Voice synthesis device, voice synthesis method, and program WO2016002879A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2014-136812 2014-07-02
JP2014136812 2014-07-02

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
EP15814984.9A EP3166104A4 (en) 2014-07-02 2015-07-02 Voice synthesis device, voice synthesis method, and program
JP2016531443A JP6428774B2 (en) 2014-07-02 2015-07-02 Voice control system, voice control method and program
CN201580035951.5A CN106471569A (en) 2014-07-02 2015-07-02 Voice synthesis device, voice synthesis method, and program
US15/316,850 US10224021B2 (en) 2014-07-02 2015-07-02 Method, apparatus and program capable of outputting response perceivable to a user as natural-sounding

Publications (1)

Publication Number Publication Date
WO2016002879A1 true WO2016002879A1 (en) 2016-01-07

Family

ID=55019406

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/069126 WO2016002879A1 (en) 2014-07-02 2015-07-02 Voice synthesis device, voice synthesis method, and program

Country Status (5)

Country Link
US (1) US10224021B2 (en)
EP (1) EP3166104A4 (en)
JP (2) JP6428774B2 (en)
CN (1) CN106471569A (en)
WO (1) WO2016002879A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002311981A (en) * 2001-04-17 2002-10-25 Sony Corp Natural language processing system and natural language processing method as well as program and recording medium
WO2002097794A1 (en) * 2001-05-25 2002-12-05 Rhetorical Group Plc Speech synthesis
JP2004110673A (en) * 2002-09-20 2004-04-08 Nippon Telegr & Teleph Corp <Ntt> Text style conversion method, text style conversion device, text style conversion program, and storage medium storing the text style conversion program
JP2006528050A (en) * 2003-05-15 2006-12-14 ゼットアイ コーポレイション オブ カナダ インコーポレイテッド Text input in the video game
JP2010175717A (en) * 2009-01-28 2010-08-12 Mitsubishi Electric Corp Speech synthesizer

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69232407T2 (en) 1991-11-18 2002-09-12 Toshiba Kawasaki Kk Speech dialogue system to facilitate computer-human interaction
JPH05216618A (en) * 1991-11-18 1993-08-27 Toshiba Corp Voice interactive system
DE19861167A1 (en) * 1998-08-19 2000-06-15 Christoph Buskies Method and apparatus for koartikulationsgerechten concatenation of audio segments, and means for providing audio data koartikulationsgerecht concatenated
SE517026C2 (en) 2000-11-17 2002-04-02 Forskarpatent I Syd Ab Method and apparatus for speech analysis
GB0113581D0 (en) * 2001-06-04 2001-07-25 Hewlett Packard Co Speech synthesis apparatus
JP2003271194A (en) 2002-03-14 2003-09-25 Canon Inc Voice interaction device and controlling method thereof
JP4038211B2 (en) * 2003-01-20 2008-01-23 富士通株式会社 Speech synthesizer, speech synthesis method and speech synthesis system
US20050154594A1 (en) * 2004-01-09 2005-07-14 Beck Stephen C. Method and apparatus of simulating and stimulating human speech and teaching humans how to talk
JP2006039120A (en) * 2004-07-26 2006-02-09 Sony Corp Interactive device and interactive method, program and recording medium
JP2006157538A (en) * 2004-11-30 2006-06-15 Sony Corp Telephone system and voice outputting method of telephone system
JP4832097B2 (en) 2006-02-13 2011-12-07 富士通テン株式会社 Voice dialogue system
KR100764174B1 (en) * 2006-03-03 2007-10-08 삼성전자주식회사 Apparatus for providing voice dialogue service and method for operating the apparatus
CN100501782C (en) * 2006-09-30 2009-06-17 山东建筑大学 Intelligent voice warning system
US8930192B1 (en) * 2010-07-27 2015-01-06 Colvard Learning Systems, Llc Computer-based grapheme-to-speech conversion using a pointing device
CN102324231A (en) * 2011-08-29 2012-01-18 北京捷通华声语音技术有限公司 Game dialogue voice synthesizing method and system
US9064492B2 (en) * 2012-07-09 2015-06-23 Nuance Communications, Inc. Detecting potential significant errors in speech recognition results
US9799328B2 (en) * 2012-08-03 2017-10-24 Veveo, Inc. Method for using pauses detected in speech input to assist in interpreting the input during conversational interaction for information retrieval
JP5821824B2 (en) * 2012-11-14 2015-11-24 ヤマハ株式会社 Speech synthesis devices
KR101709187B1 (en) * 2012-11-14 2017-02-23 한국전자통신연구원 Spoken Dialog Management System Based on Dual Dialog Management using Hierarchical Dialog Task Library
US9292488B2 (en) * 2014-02-01 2016-03-22 Soundhound, Inc. Method for embedding voice mail in a spoken utterance using a natural language processing computer system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002311981A (en) * 2001-04-17 2002-10-25 Sony Corp Natural language processing system and natural language processing method as well as program and recording medium
WO2002097794A1 (en) * 2001-05-25 2002-12-05 Rhetorical Group Plc Speech synthesis
JP2004110673A (en) * 2002-09-20 2004-04-08 Nippon Telegr & Teleph Corp <Ntt> Text style conversion method, text style conversion device, text style conversion program, and storage medium storing the text style conversion program
JP2006528050A (en) * 2003-05-15 2006-12-14 ゼットアイ コーポレイション オブ カナダ インコーポレイテッド Text input in the video game
JP2010175717A (en) * 2009-01-28 2010-08-12 Mitsubishi Electric Corp Speech synthesizer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3166104A4 *

Also Published As

Publication number Publication date
JPWO2016002879A1 (en) 2017-04-27
CN106471569A (en) 2017-03-01
JP6428774B2 (en) 2018-11-28
EP3166104A4 (en) 2018-03-07
US10224021B2 (en) 2019-03-05
US20170116978A1 (en) 2017-04-27
JP2019045867A (en) 2019-03-22
EP3166104A1 (en) 2017-05-10

Similar Documents

Publication Publication Date Title
Hess et al. Effects of global and local context on lexical processing during language comprehension.
Thomas Sociophonetics: an introduction
Kawahara et al. Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation
US6751592B1 (en) Speech synthesizing apparatus, and recording medium that stores text-to-speech conversion program and can be read mechanically
EP1693827B1 (en) Extensible speech recognition system that provides a user with audio feedback
US5790978A (en) System and method for determining pitch contours
US6754627B2 (en) Detecting speech recognition errors in an embedded speech recognition system
CA2493265C (en) System and method for augmenting spoken language understanding by correcting common errors in linguistic performance
CN1222924C (en) Voice personalization of speech synthesizer
EP1089193A2 (en) Translating apparatus and method, and recording medium used therewith
Oviatt et al. Toward adaptive conversational interfaces: Modeling speech convergence with animated personas
US6725199B2 (en) Speech synthesis apparatus and selection method
KR100746526B1 (en) Conversation processing apparatus and method, and recording medium therefor
Bailey et al. Disfluencies affect the parsing of garden-path sentences
JP4644403B2 (en) Device for detecting an emotion of the audio signal through analysis of a plurality of audio signal parameters, methods, and, the article of manufacture
US20020184029A1 (en) Speech synthesis apparatus and method
EP1286330A2 (en) Method for data entry by voice under adverse conditions
US7062440B2 (en) Monitoring text to speech output to effect control of barge-in
US20090313019A1 (en) Emotion recognition apparatus
US7280968B2 (en) Synthetically generated speech responses including prosodic characteristics of speech inputs
Bulyko et al. Error-correction detection and response generation in a spoken dialogue system
KR20060056403A (en) Identifying natural speech pauses in a text string
Janse Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech
WO2003065349B1 (en) Text to speech
US7809572B2 (en) Voice quality change portion locating apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15814984

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase in:

Ref document number: 2016531443

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 15316850

Country of ref document: US

NENP Non-entry into the national phase in:

Ref country code: DE

REEP

Ref document number: 2015814984

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015814984

Country of ref document: EP