WO2010004978A1 - 音声合成モデル生成装置、音声合成モデル生成システム、通信端末、及び音声合成モデル生成方法 - Google Patents
音声合成モデル生成装置、音声合成モデル生成システム、通信端末、及び音声合成モデル生成方法 Download PDFInfo
- Publication number
- WO2010004978A1 WO2010004978A1 PCT/JP2009/062341 JP2009062341W WO2010004978A1 WO 2010004978 A1 WO2010004978 A1 WO 2010004978A1 JP 2009062341 W JP2009062341 W JP 2009062341W WO 2010004978 A1 WO2010004978 A1 WO 2010004978A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- synthesis model
- speech synthesis
- voice
- speech
- image information
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 263
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 263
- 238000004891 communication Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000000605 extraction Methods 0.000 claims abstract description 38
- 239000000284 extract Substances 0.000 claims abstract description 11
- 230000005540 biological transmission Effects 0.000 claims description 20
- 230000001186 cumulative effect Effects 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 11
- 230000004044 response Effects 0.000 claims description 10
- 238000002372 labelling Methods 0.000 claims description 5
- 238000010295 mobile communication Methods 0.000 abstract description 59
- 238000010586 diagram Methods 0.000 description 7
- 230000008450 motivation Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000004519 manufacturing process Methods 0.000 description 5
- 238000013179 statistical model Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 238000003066 decision tree Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 206010034719 Personality change Diseases 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000035807 sensation Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Definitions
- the present invention relates to a speech synthesis model generation device, a speech synthesis model generation system, a communication terminal, and a speech synthesis model generation method.
- the speech synthesis model is information used to create speech data corresponding to input text (character string).
- character string As a speech synthesis method using a speech synthesis model, for example, as described in Patent Document 1, an input character string is analyzed and speech data corresponding to text is combined with reference to the speech synthesis model. Some of them create voice data.
- the present invention has been made to solve the above-described problems, and a speech synthesis model generation device, a speech synthesis model generation system, a communication terminal, and a speech synthesis model generation that can suitably acquire a user's speech. It aims to provide a method.
- a speech synthesis model generation device is acquired by a learning information acquisition unit that acquires feature data of a user's speech and text data corresponding to the speech, and a learning information acquisition unit.
- a speech synthesis model generating means for generating a speech synthesis model by performing learning based on the feature amount and text data; and a parameter generating means for generating a parameter indicating the learning degree of the speech synthesis model generated by the speech synthesis model generating means;
- the image information generating means for generating image information for displaying the image to the user according to the parameter generated by the parameter generating means, and the image information output for outputting the image information generated by the image information generating means And means.
- a speech synthesis model is generated based on the feature amount of the speech and text data, and a parameter indicating the learning degree of the speech synthesis model is generated. Then, image information for displaying an image to the user according to the parameter is generated, and the image information is output.
- the user who inputs the voice can recognize the learning degree of the voice synthesis model as a visualized image, the user can obtain a sense of achievement with respect to the input of the voice, and the user who wants to input the voice Increases motivation. As a result, the user's voice can be suitably acquired.
- request information generation means for generating and outputting request information for allowing the user to input voice based on the parameter generated by the parameter generation means in order to acquire the feature amount.
- the apparatus further includes a word extraction unit that extracts words from the text data acquired by the learning information acquisition unit, and the parameter generation unit learns the speech synthesis model according to the cumulative number of words extracted by the word extraction unit. It is preferable to generate a parameter indicating the degree.
- parameters are generated according to the cumulative number of words, so that the user can recognize that the number of words has increased by looking at the image information generated according to the parameters. As a result, it is possible to further obtain a sense of accomplishment with respect to the input of voice. As a result, the user's voice can be acquired more suitably.
- the image information is preferably information for displaying a character image.
- the character image output to the user changes, for example, so as to increase according to the parameter, so that the user is more visually pleasing than when numerical values are displayed as an image, for example. Can do.
- a feeling of achievement of the user can be further obtained, and the motivation of the user who intends to input voice is further improved.
- the user's voice can be acquired more suitably.
- the speech synthesis model generation means generates a speech synthesis model for each user.
- a voice synthesis model corresponding to each user can be generated, and the voice synthesis model can be used by an individual.
- the audio feature amount is preferably context data obtained by labeling audio in units of audio and data relating to an audio waveform indicating audio features. With this configuration, it is possible to reliably generate a speech synthesis model.
- a speech synthesis model generation system includes a speech terminal including a communication terminal having a communication function and a speech synthesis model generation device capable of communicating with the communication terminal.
- a synthesis model generation system wherein a communication terminal synthesizes speech information including speech input means for inputting a user's speech, speech information input by the speech input means, or speech feature values, and text data corresponding to the speech.
- Learning information transmission means to be transmitted to the model generation device, and image information for displaying an image to the user from the speech synthesis model generation device in response to transmission of the voice information and text data from the voice information transmission means.
- the generation device acquires the feature amount of the voice by receiving the voice information transmitted from the communication terminal, and also acquires the learning information acquisition unit that acquires the text data transmitted from the communication terminal, and learning information acquisition Generating a speech synthesis model by performing learning based on the feature quantity and text data acquired by the means, and generating a parameter indicating the learning degree of the speech synthesis model generated by the speech synthesis model generation unit
- Parameter generating means for performing image information generating means for generating image information in accordance with the parameters generated by the parameter generating means, and image information output means for transmitting the image information generated by the image information generating means to the communication terminal. It is characterized by providing.
- the voice synthesis model generation device when the voice is acquired by the communication terminal and the voice information including the voice or the feature quantity of the voice and the text data corresponding to the voice are received by the voice synthesis model generation device, the voice is obtained based on the feature quantity and the text data. A speech synthesis model is generated. Then, a parameter indicating the learning level of the speech synthesis model is generated, and image information for displaying an image for the user is generated according to the parameter, and transmitted from the speech synthesis model generation device to the communication terminal.
- the learning degree of the speech synthesis model can be recognized as a visualized image, it is possible to obtain a sense of accomplishment with respect to the input of the speech, and the motivation of the user who intends to input the speech is improved. As a result, the user's voice can be suitably acquired.
- the voice since the voice is acquired by the communication terminal, it is possible to easily acquire the voice without requiring a facility such as a studio.
- the communication terminal further includes a feature quantity extraction unit that extracts a voice feature quantity from the voice input by the voice input unit.
- the voice transmitted from the communication terminal may be deteriorated by a codec or a communication channel, and if a voice synthesis model is generated from the voice, the quality of the voice synthesis model may be lowered.
- the feature amount necessary for generating the speech synthesis model is extracted by the communication terminal, and the feature amount is sent, so that a highly accurate speech synthesis model can be generated.
- text data acquisition means for acquiring text data corresponding to the voice from the voice input by the voice input means.
- the present invention can be described as an invention of a speech synthesis model generation system as described above, and can also be described as an invention of a communication terminal included in the speech synthesis model generation system as follows.
- This communication terminal included in the speech synthesis model generation system also has a new configuration, and this also corresponds to the present invention. Therefore, the same operation and effect as the speech synthesis model generation system are exhibited.
- the communication terminal is a communication terminal having a communication function, and includes a voice input unit that inputs a user's voice, and a feature amount extraction that extracts a voice feature amount from the voice input by the voice input unit.
- Means, text data acquisition means for acquiring text data corresponding to the voice, voice feature amount extracted by the feature amount extraction means, and text data acquired by the text data acquisition means communicate with the communication terminal
- Learning information transmitting means for transmitting to the speech synthesis model generating apparatus capable of displaying the image from the speech synthesis model generating apparatus to the user in response to the transmission of the feature amount and the text data from the learning information transmitting means
- Image information receiving means for receiving the image information and display means for displaying the image information received by the image information receiving means , Characterized in that it comprises a.
- the present invention can be described as an invention of a speech synthesis model, a speech synthesis model generation system, and a communication terminal as described above, and can also be described as an invention of a speech synthesis model generation method as follows. This is substantially the same invention only in different categories, and has the same operations and effects.
- the speech synthesis model generation method is based on the learning information acquisition step of acquiring the feature amount of the user's speech and the text data corresponding to the speech, the feature amount and the text data acquired in the learning information acquisition step.
- a speech synthesis model generation step for generating a speech synthesis model by performing learning a parameter generation step for generating a parameter indicating a learning degree of the speech synthesis model generated in the speech synthesis model generation step, and a parameter generation step
- An image information generation step for generating image information for displaying an image for the user according to the parameters, and an image information output step for outputting the image information generated in the image information generation step.
- the speech synthesis model generation method includes a speech synthesis model generation system including a communication terminal having a communication function and a speech synthesis model generation apparatus capable of communicating with the communication terminal.
- a model generation method in which a communication terminal inputs a voice input step in which a user's voice is input, voice information input in the voice input step, or voice information including voice feature amounts and text data corresponding to the voice, and a voice synthesis model.
- image information for displaying an image to the user from the speech synthesis model generation device is received.
- Image information receiving step and a table for displaying the image information received in the image information receiving step A speech synthesis model generation device that acquires speech feature values by receiving speech information transmitted from a communication terminal and acquires text data transmitted from the communication terminal.
- a learning information acquisition step a speech synthesis model generation step for generating a speech synthesis model by performing learning based on the feature amount and text data acquired in the learning information acquisition step, and a speech synthesis generated in the speech synthesis model generation step
- the speech synthesis model generation method is a speech synthesis model generation method by a communication terminal having a communication function, and includes a speech input step for inputting a user's speech, and speech from the speech input in the speech input step.
- a feature amount extraction step for extracting the feature amount
- a text data acquisition step for acquiring text data corresponding to speech
- a speech feature amount extracted in the feature amount extraction step and a text data acquisition step.
- a learning information transmitting step for transmitting the text data to a speech synthesis model generating apparatus capable of communicating with a communication terminal, and generating a speech synthesis model in response to transmitting the feature amount and the text data in the learning information transmitting step.
- Image information for receiving image information for displaying an image from a device to a user A receiving step, characterized in that it comprises a display step of displaying the image information received by the image information receiving step.
- the present invention it is possible to visualize and recognize the learning degree of the speech synthesis model generated by the speech input by the user, so that the speech input by the user by the act of simply inputting the speech over a long period of time can be prevented. A decrease in motivation can be prevented, and the user's voice can be suitably acquired.
- FIG. 1 It is a figure which shows the structure of the speech synthesis model production
- FIG. 1 shows the configuration of a speech synthesis model generation system according to an embodiment of the present invention.
- the speech synthesis model generation system 1 includes a mobile communication terminal (communication terminal) 2 and a speech synthesis model generation device 3.
- the mobile communication terminal 2 and the speech synthesis model generation device 3 can transmit / receive information to / from each other through mobile communication.
- the speech synthesis model generation system 1 usually includes an infinite number of mobile communication terminals 2.
- the speech synthesis model generation device 3 may be configured by a single device or a plurality of devices.
- the speech synthesis model generation system 1 is a system that can generate a speech synthesis model for the user of the mobile communication terminal 2.
- the speech synthesis model is information used to create user speech data corresponding to input text.
- the voice data synthesized using the voice synthesis model can be used, for example, at the time of reading a mail in the mobile communication terminal 2, at the time of message reproduction at the time of absence, and on a blog or WEB.
- the mobile communication terminal 2 is, for example, a mobile phone, and is a communication terminal that performs wireless communication with a base station that covers a wireless area in which the mobile device is located and receives a call service or a packet communication service in accordance with a user operation. . Further, the mobile communication terminal 2 can use an application using a packet communication service, and the application is updated by data transmitted from the speech synthesis model generation device 3.
- the application management may be performed by a device provided separately from the speech synthesis model generation device 3.
- the application in the present embodiment is a breeding game in which screen display is performed and a command can be input by a user's voice, for example. More specifically, the character displayed by the application is nurtured (the appearance of the character changes) by the user's voice input.
- the speech synthesis model generation device 3 is a device that generates a speech synthesis model based on information related to the user's speech transmitted from the mobile communication terminal 2.
- the speech synthesis model generation device 3 is in a mobile communication network and is managed by a service provider that provides a speech synthesis model generation service.
- FIG. 2 is a diagram illustrating a hardware configuration of the mobile communication terminal 2.
- the mobile communication terminal 2 includes a CPU (Central Processing Unit) 21, a RAM (Random Access Memory) 22, a ROM (Read Only Memory) 23, an operation unit 24, a microphone 25, a wireless communication unit 26, and a display. 27, a speaker 28, an antenna 29, and the like.
- a CPU Central Processing Unit
- RAM Random Access Memory
- ROM Read Only Memory
- an operation unit 24 a microphone 25
- a wireless communication unit 26 and a display.
- a speaker 28 an antenna 29, and the like.
- FIG. 3 is a diagram illustrating a hardware configuration of the speech synthesis model generation device 3.
- the speech synthesis model generation device 3 includes a CPU 31, a RAM 32 and a ROM 33 that are main storage devices, a communication module 34 that is a data transmission / reception device such as a network card, an auxiliary storage device 35 such as a hard disk, a keyboard, and the like.
- the computer includes hardware such as an input device 36 for inputting information to the speech synthesis model generation device 3 and an output device 37 for outputting information such as a monitor. When these components operate, the functions described later of the speech synthesis model generation device 3 are exhibited.
- the mobile communication terminal 2 includes a voice input unit 200, a feature amount extraction unit 201, a text data acquisition unit 202, a learning information transmission unit 203, a reception unit 204, a display unit 205, A speech synthesis model holding unit 206 and a speech synthesis unit 207 are provided.
- the voice input unit 200 is a microphone 25 and is a voice input unit that inputs a user's voice.
- the voice input unit 200 inputs a user's voice as a command input to the above-described application, for example.
- the voice input unit 200 passes the input voice through a filter to remove noise, and outputs the voice input from the user to the feature amount extraction unit 201 and the text data acquisition unit 202 as voice data.
- the feature quantity extraction unit 201 extracts a voice feature quantity from the voice data received from the voice input unit 200.
- the voice feature value is a numerical value of voice quality such as voice height, speed, accent, etc.
- context data obtained by labeling voice into voice units and a voice waveform indicating the voice characteristics It is data about.
- the context data is a context label (phoneme string) obtained by dividing (labeling) speech data into speech units such as phonemes.
- the speech unit is a speech unit such as “phonemes”, “words”, and “sentences” divided according to a predetermined rule.
- the context label factors include the preceding, relevant, and subsequent phonemes, the mora position of the phoneme in the accent phrase, the preceding, the relevant, the following part-of-speech / utilization / utilization type, the preceding, the relevant, and the succeeding
- the data relating to the speech waveform is the logarithmic fundamental frequency and the mel cepstrum.
- the logarithmic fundamental frequency represents the height of speech and is extracted by extracting fundamental frequency parameters from speech data.
- the mel cepstrum expresses the voice quality of the voice, and is extracted by performing mel cepstrum analysis on the voice data.
- the feature amount extraction unit 201 outputs the extracted feature amount to the learning information transmission unit 203.
- the text data acquisition unit 202 is a text data acquisition unit that acquires text data corresponding to speech from the speech data received from the speech input unit 200.
- the text data acquisition unit 202 analyzes the input speech data (speech recognition), thereby acquiring text data (character string) whose content matches the speech input by the user.
- the text data acquisition unit 202 outputs the acquired text data to the learning information transmission unit 203.
- the text data may be acquired from the feature amount of the voice extracted by the feature amount extraction unit 201.
- the learning information transmission unit 203 is a learning information transmission unit that transmits the feature amount received from the feature amount extraction unit 201 and the text data received from the text data acquisition unit 202 to the speech synthesis model generation device 3.
- the learning information transmission unit 203 transmits the feature amount and text data to the speech synthesis model generation device 3 using XML over HTTP, SIP, or the like.
- user authentication using, for example, SIP or IMS is performed between the mobile communication terminal 2 and the speech synthesis model generation device 3.
- the reception unit 204 receives image information, request information, and a speech synthesis model from the speech synthesis model generation device 3 in response to the feature information and text data being transmitted to the speech synthesis model generation device 3 by the learning information transmission unit 203.
- Receiving means image information receiving means.
- the image information is information for causing the user to display an image on the display 27.
- the request information is, for example, information that prompts the user to input a voice, information such as a sentence or a word that is input, and an image (text) corresponding to the request information is displayed on the display 27.
- Image information and request information are used and output by the above-described application.
- audio data corresponding to the request information may be output from the speaker 28.
- the receiving unit 204 outputs the received image information and request information to the display unit 205 and outputs a speech synthesis model to the speech synthesis model holding unit 206.
- the display unit 205 is a display unit that displays image information and request information received from the receiving unit 204.
- the display unit 205 displays image information and request information on the display 27 of the mobile communication terminal 2 when the application is activated.
- FIG. 4 is a diagram illustrating an example in which image information and request information are displayed on the display 27. As shown in the figure, the image information is displayed on the upper side of the display 27 as an image of the character C, and the request information is displayed as, for example, three selection items S1 to S3 as a message requesting the user to input voice. . The user utters any one of the selection items S1 to S3 displayed on the display 27, and the generated voice is input by the voice input unit 200.
- the speech synthesis model holding unit 206 holds the speech synthesis model received from the receiving unit 204.
- the speech synthesis model holding unit 206 receives information related to the speech synthesis model from the reception unit 204, the speech synthesis model holding unit 206 performs an update process on the existing speech synthesis model.
- the voice synthesis unit 207 synthesizes voice data with reference to the voice synthesis model held in the voice synthesis model holding unit 206.
- a conventionally known method is used as a method of synthesizing the voice data.
- the speech synthesis unit 207 refers to the speech synthesis model holding unit 206 when a text (character string) is input also by the operation unit 24 (keyboard) of the mobile communication terminal 2 and a synthesis instruction is given from the user. Then, acoustic features (logarithmic fundamental frequency and mel cepstrum) corresponding to the phoneme sequence (context label) of the text input from the stored speech synthesis model are stochastically predicted, and the input text corresponds Generate voice data by synthesis.
- the voice synthesizer 207 outputs the synthesized voice data to the speaker 28, for example. Note that the voice data generated by the voice synthesizer 207 is also used for an application.
- the speech synthesis model generation device 3 includes a learning information acquisition unit 300, a speech synthesis model generation unit 301, a model database 302, a statistical model database 303, a word extraction unit 304, and a word database 305.
- the learning information acquisition unit 300 is a learning information acquisition unit that acquires feature quantities and text data by receiving them from the mobile communication terminal 2.
- the learning information acquisition unit 300 outputs the feature amount and text data received and acquired from the mobile communication terminal 2 to the speech synthesis model generation unit 301 and outputs the text data to the word extraction unit 304.
- the speech synthesis model generation unit 301 is a speech synthesis model generation unit that performs learning based on the feature amount and text data received from the learning information acquisition unit 300 and generates a speech synthesis model.
- the speech synthesis model is generated by a conventionally known method. Specifically, for example, the speech synthesis model generation unit 301 generates a speech synthesis model for each user of the mobile communication terminal 2 by learning based on a hidden Markov model (HMM).
- the speech synthesis model generation unit 301 models acoustic feature quantities (logarithmic fundamental frequency, mel cepstrum) of speech units (context labels) such as phonemes using an HMM that is a kind of probability model.
- the speech synthesis model generation unit 301 repeatedly performs learning on the logarithmic fundamental frequency and the mel cepstrum.
- the speech synthesis model generation unit 301 determines a state duration (phoneme duration) representing the rhythm and tempo of speech from the state distribution (Gaussian distribution) based on the models generated for the logarithmic fundamental frequency and the mel cepstrum. Model. Then, the speech synthesis model generation unit 301 generates a speech synthesis model by synthesizing the logarithmic fundamental frequency and the mel cepstrum HMM and the state duration model.
- the generated speech synthesis model is output to the model database 302 and the statistical model database 303.
- the model database 302 holds the speech synthesis model received from the speech synthesis model generation unit 301 for each user.
- the model database 302 receives information on a new speech synthesis model from the speech synthesis model generation unit 301, the model database 302 performs an update process on the existing speech synthesis model.
- the statistical model database 303 collectively stores the speech synthesis models of all mobile communication terminal 2 users received from the speech synthesis model generation unit 301.
- Information related to the speech synthesis model stored in the statistical model database 303 is generated by, for example, a process of generating an average model for all users or an average model for each user's age by the statistical model generation unit. Used to interpolate missing models of speech synthesis models.
- the word extraction unit 304 is a word extraction unit that extracts words from the text data received from the learning information acquisition unit 300.
- the word extraction unit 304 refers to a dictionary database (not shown) in which word information for specifying a word by a technique such as morphological analysis is stored, and the text data The word is extracted from the text data based on the degree of matching between the word information and the word information.
- the word is a minimum unit of sentence structure, and includes, for example, an independent word such as “mobile” and an attached word such as “O”.
- the word extraction unit 304 outputs word data indicating the extracted word to the word database 305 for each user.
- the word database 305 holds the word data received from the word extraction unit 304 for each user.
- the word database 305 holds a table as shown in FIG.
- FIG. 5 is a diagram illustrating an example of a table in which word data is held.
- “word data” stored for each of the 12 categories divided according to a predetermined rule and the “word count” of the word data are stored in association with each other. ing.
- category 1 holds words such as “mobile phone” and “speech”, and the cumulative number of words for each category is “50”.
- the category in which the word is accommodated is determined by a conventional method such as a decision tree of the spectrum part, a decision tree of the fundamental frequency, and a decision tree of the state duration model.
- the parameter generation unit 306 is a parameter generation unit that generates a parameter indicating the learning degree of the speech synthesis model according to the cumulative number of words in the word database 305 in which the words extracted by the word extraction unit 304 are held.
- the above learning degree is a degree (accuracy of the speech synthesis model) indicating how much the speech synthesis model can reproduce the user's speech.
- the parameter generation unit 306 calculates the cumulative number of words from the number of words for each category in the word database 305, and generates a parameter for each user indicating the degree of learning of the speech synthesis model proportional to the cumulative number of words.
- the parameter is indicated by a numerical value such as 0, 1,..., And indicates that the learning degree increases as the numerical value increases.
- the reason why the parameter is calculated according to the cumulative number of words is that the increase in the number of words for each category is directly related to the improvement of the accuracy of the speech synthesis model.
- the parameter generation unit 306 outputs the generated parameters to the image information generation unit 307 and the request information generation unit 308.
- the parameters include information that can specify the number of words for each category.
- the accuracy of the speech synthesis model improves as the input of speech data increases, and the reproducibility of user speech also increases.
- the image information generation unit 307 is an image information generation unit that generates image information for displaying an image to the user of the mobile communication terminal 2 in accordance with the parameter output from the parameter generation unit 306.
- the image information generation unit 307 generates image information for displaying a character image used for an application.
- the image information generation unit 307 holds a table as shown in FIG.
- FIG. 6 is a diagram illustrating an example of a table in which parameters and levels indicating the degree of change in images are associated with each other. As shown in FIG. 6, when the parameter is “0”, the level is “1”, and when the parameter is “3”, the level is “4”.
- the image information generation unit 307 generates image information corresponding to a level indicating the degree of change of the image, and outputs the image information to the information output unit 309.
- FIG. 7 shows an example in which the character image displayed on the display 27 of the mobile communication terminal 2 changes according to the level indicating the degree of change of the image.
- FIG. 7A shows a character image C1 corresponding to level 1
- FIG. 7B shows a character image C2 corresponding to level 3.
- the contour of the character image C1 is not clear at level 1
- the contour of the character image C2 is clear at level 3.
- the character image grows (changes) in accordance with the level associated with the parameter.
- the words displayed in the balloons of the character images C1 and C2 are also displayed so as to speak fluently as the level increases. That is, as the learning of the speech synthesis model progresses with the user's voice, the characters displayed by the application grow with it.
- the request information generation unit 308 is request information generation means for generating request information for allowing a user to input voice in order to acquire a feature amount based on the parameter generated by the parameter generation unit 306. Based on the parameter, the request information generation unit 308 compares the number of words for each category held in the word database, identifies a category having a smaller number of words than other categories, and calculates a word corresponding to the category. To do. Specifically, as illustrated in FIG. 5, for example, when the category “6” holds fewer words than the other categories, the request information generation unit 308 selects a word corresponding to the category “6”. Calculate multiple. Then, the request information generation unit 308 generates request information indicating the calculated word and outputs the request information to the information output unit 309.
- the information output unit 309 transmits the speech synthesis model generated by the speech synthesis model generation unit 301, the image information output from the image information generation unit 307, and the request information output from the request information generation unit 308 to the mobile communication terminal 2.
- Information output means image information output means.
- the information output unit 309 transmits a speech synthesis model, image information, and request information.
- FIG. 8 is a sequence diagram showing processing of the mobile communication terminal 2 and the speech synthesis model generation device 3.
- the mobile communication terminal 2 first, voice corresponding to the display by the application is input from the user by the voice input unit 200 (S 01, voice input step). And the feature-value extraction part 201 extracts the feature-value of an audio
- the learning information acquisition unit 300 receives the learning information from the mobile communication terminal 2 to acquire the feature amount and the text data (S05, learning information acquisition step).
- the speech synthesis model generation unit 301 generates a speech synthesis model based on the acquired feature amount and text data (S06, speech synthesis model generation step).
- the word extraction unit 304 extracts words based on the acquired text data (S07).
- the parameter generating unit 306 generates a parameter indicating the learning degree of the speech synthesis model based on the cumulative number of extracted words (S08, parameter generating step).
- the image information generation unit 307 generates image information according to the parameters for displaying the image to the user of the mobile communication terminal 2 based on the generated parameters (S09). Further, the request information generation unit 308 generates request information for allowing the user of the mobile communication terminal 2 to input voice based on the generated parameters in order to acquire the feature amount (S10). .
- the speech synthesis model, the image information, and the request information generated in this way are transmitted from the speech synthesis model generation unit 301 to the mobile communication terminal 2 by the information output unit 309 (S11, information output step).
- the voice synthesis model, the image information, and the request information are received by the receiving unit 204, the voice synthesis model is held in the voice synthesis model holding unit 206, and the image information and the request information are displayed by the display unit 205. It is displayed on the display (S12, display step).
- the user of the mobile communication terminal 2 inputs voice according to the request information displayed on the display 27.
- the process returns to step S01 and the following process is repeated.
- the above is the processing executed by the speech synthesis model generation system 1 according to the present embodiment.
- a speech synthesis model is generated based on speech feature values and text data, and a parameter indicating the learning degree of the speech synthesis model is generated. Then, image information for displaying an image to the user according to the parameter is generated, and the image information is output.
- the user who inputs the voice can recognize the learning degree of the voice synthesis model as a visualized image. Therefore, the user can obtain a sense of accomplishment with respect to the input of the voice, and is motivated to input the voice. Will improve. As a result, the user's voice can be suitably acquired.
- request information for allowing the user to input speech is generated and acquired to the mobile communication terminal 2 in order to acquire the feature amount. Since it is transmitted, the speech input from the user is appropriate for learning for generating a speech synthesis model.
- the parameter generation unit 306 generates a parameter indicating the learning degree of the speech synthesis model according to the cumulative number of words extracted by the word extraction unit 304.
- the parameter since the parameter is generated according to the cumulative number of words, the user can recognize that the number of words is increasing by looking at the image information generated according to the parameter. As a result, it is possible to further obtain a sense of accomplishment with respect to the input of voice. As a result, the user's voice can be acquired more suitably.
- the image information transmitted from the speech synthesis model generation device 3 to the mobile communication terminal 2 is information for displaying a character image, and the character image output to the user increases, for example, according to the parameters. Therefore, it is possible to give a visual sensation to the user, for example, compared to a case where a numerical value or the like is displayed as an image. Thereby, a feeling of achievement of the user can be further obtained, and motivation to input voice is further improved. As a result, the user's voice can be acquired more suitably.
- the speech synthesis model generation unit 301 since the speech synthesis model generation unit 301 generates a speech synthesis model for each user, it is possible to generate a speech synthesis model corresponding to each user, and the speech synthesis model can be used by an individual.
- the speech feature amount is context data obtained by labeling speech in speech units and data (logarithmic fundamental frequency and mel cepstrum) regarding speech waveforms indicating speech features, a speech synthesis model can be generated reliably.
- the mobile communication terminal 2 since the voice is acquired by the mobile communication terminal 2, it is possible to easily acquire the voice without the need for equipment such as a studio. Further, unlike the case where the speech synthesis model is generated from the speech transmitted from the mobile communication terminal 2, the mobile communication terminal 2 extracts and transmits the feature quantity necessary for generating the speech synthesis model. It is possible to generate a speech synthesis model with higher accuracy than when a speech synthesis model is generated using speech degraded by the above.
- the present invention is not limited to the above embodiment.
- learning is performed using an HMM to generate a speech synthesis model, but a speech synthesis model may be generated using another algorithm.
- the feature amount extraction unit 201 of the mobile communication terminal 2 extracts the feature amount of the speech and transmits the feature amount to the speech synthesis model generation device 3.
- the feature amount is input to the speech input unit 200.
- the voice may be transmitted to the voice synthesis model generation device 3 as voice information (for example, voice encoded in AAC, AMR, etc.).
- the feature amount is extracted in the speech synthesis model generation device 3.
- the image information generation unit 307 generates the image information based on the level associated with the parameter corresponding to the cumulative number of words stored in the word database 305.
- Information generation is not limited to this method. For example, if a database that holds data for configuring the size, personality, etc. of the character image C is provided and the user inputs, for example, a voice “thank you”, the data indicating the size is displayed according to a predetermined rule. Image information may be generated by adding 1 and adding 1 to data indicating the kindness of personality.
- the image information is information for displaying a character image.
- the image information may be information for displaying an object such as a graph, a numerical value, or a car.
- a graph it is information that displays the cumulative number of words
- an object such as a car
- the image information is display data for displaying a character image.
- any data for generating an image in the mobile communication terminal 2 may be used.
- image information for generating an image is generated and transmitted based on the parameter output from the parameter generation unit 306, and the character image is received in the mobile communication terminal 2 that has received the image information. May be generated.
- the image information created by the speech synthesis model generation device 3 is a parameter indicating the face size, skin color, and the like of a character image set in advance.
- the parameter output from the parameter generation unit 306 of the speech synthesis model generation device 3 may be transmitted as image information, and the mobile communication terminal 2 may generate a character image based on the parameter.
- the mobile communication terminal 2 holds information (for example, information shown in FIG. 6) indicating what kind of character image is generated according to the parameters.
- the cumulative number of words in the word data held in the word database 305 of the speech synthesis model generation device 3 may be transmitted as image information, and the mobile communication terminal 2 may generate a character image based on the image information.
- the mobile communication terminal 2 generates a parameter from the cumulative number of words, and retains information (for example, information shown in FIG. 6) indicating what kind of character image is generated according to the parameter. is doing.
- the request information generation unit 308 generates the request information based on the number of words for each category of words held in the word database 305. However, from the database in which the requested word is stored in advance. It is good also as a structure by which a word is requested
- the text data acquisition unit 202 is provided in the mobile communication terminal 2, but may be provided in the speech synthesis model generation device 3.
- the acquisition of text data may be performed by a server device that can transmit and receive information by mobile communication without the mobile communication terminal 2 itself.
- the mobile communication terminal 2 transmits the feature amount extracted by the feature amount extraction unit 201 to the server device, and the text data acquired based on the feature amount is transmitted to the server device in response to the transmission of the feature amount. Sent from
- the text data is acquired by the text data acquisition unit 202.
- the user may input the text data after voice input.
- you may acquire from the text data contained in request information.
- the text data acquisition unit 202 acquires the text data without confirming to the user. However, when the acquired text data is once displayed to the user and the confirmation key is pressed by the user, for example. It is good also as a structure acquired.
- the speech synthesis model generation system 1 is configured by the mobile communication terminal 2 and the speech synthesis model generation device 3.
- the speech synthesis model generation device 3 may be used alone.
- the speech synthesis model generation device 3 is provided with a speech input unit and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
- Telephonic Communication Services (AREA)
- Telephone Function (AREA)
Abstract
Description
Claims (13)
- ユーザの音声の特徴量及び前記音声に対応するテキストデータを取得する学習情報取得手段と、
前記学習情報取得手段によって取得された前記特徴量及び前記テキストデータに基づいて学習を行って音声合成モデルを生成する音声合成モデル生成手段と、
前記音声合成モデル生成手段によって生成された前記音声合成モデルの学習度合を示すパラメータを生成するパラメータ生成手段と、
前記パラメータ生成手段によって生成された前記パラメータに応じて、ユーザに対して画像を表示させるための画像情報を生成する画像情報生成手段と、
前記画像情報生成手段によって生成された前記画像情報を出力する画像情報出力手段と、を備えることを特徴とする音声合成モデル生成装置。 - 前記特徴量を取得するために、前記ユーザに前記音声を入力させるための要求情報を、前記パラメータ生成手段によって生成された前記パラメータに基づいて生成して出力する要求情報生成手段を更に備えることを特徴とする請求項1記載の音声合成モデル生成装置。
- 前記学習情報取得手段によって取得された前記テキストデータから単語を抽出する単語抽出手段を更に備え、
前記パラメータ生成手段は、前記単語抽出手段によって抽出された前記単語の累積単語数に応じて、前記音声合成モデルの前記学習度合を示す前記パラメータを生成することを特徴とする請求項1又は2記載の音声合成モデル生成装置。 - 前記画像情報はキャラクタ画像を表示させるための情報であることを特徴とする請求項1~3のいずれか一項記載の音声合成モデル生成装置。
- 前記音声合成モデル生成手段は、前記ユーザ毎に前記音声合成モデルを生成することを特徴とする請求項1~4のいずれか一項記載の音声合成モデル生成装置。
- 前記特徴量は、前記音声を音声単位にラベリングしたコンテキストデータ及び前記音声の特徴を示す音声波形に関するデータであることを特徴とする請求項1~5のいずれか一項記載の音声合成モデル生成装置。
- 通信機能を有する通信端末と、当該通信端末と通信を行うことができる音声合成モデル生成装置とを含んで構成される音声合成モデル生成システムであって、
前記通信端末は、
ユーザの音声を入力する音声入力手段と、
前記音声入力手段によって入力された前記音声又は当該音声の特徴量からなる音声情報及び前記音声に対応するテキストデータを前記音声合成モデル生成装置に送信する学習情報送信手段と、
前記音声情報送信手段から前記音声情報及び前記テキストデータを送信したことに応じて、前記音声合成モデル生成装置から前記ユーザに対して画像を表示させるための画像情報を受信する画像情報受信手段と、
前記画像情報受信手段によって受信された前記画像情報を表示する表示手段と、を備え、
前記音声合成モデル生成装置は、
前記通信端末から送信される前記音声情報を受信することにより前記音声の特徴量を取得すると共に、前記通信端末から送信される前記テキストデータを受信することにより取得する学習情報取得手段と、
前記学習情報取得手段によって取得された前記特徴量及び前記テキストデータに基づいて学習を行って音声合成モデルを生成する音声合成モデル生成手段と、
前記音声合成モデル生成手段によって生成された前記音声合成モデルの学習度合を示すパラメータを生成するパラメータ生成手段と、
前記パラメータ生成手段によって生成された前記パラメータに応じて、前記画像情報を生成する画像情報生成手段と、
前記画像情報生成手段によって生成された前記画像情報を前記通信端末に送信する画像情報出力手段と、を備えることを特徴とする音声合成モデル生成システム。 - 前記通信端末は、
前記音声入力手段によって入力された前記音声から当該音声の特徴量を抽出する特徴量抽出手段を更に備えることを特徴とする請求項7記載の音声合成モデル生成システム。 - 前記音声入力手段によって入力された前記音声から当該音声に対応するテキストデータを取得するテキストデータ取得手段を更に備えることを特徴とする請求項7又は8記載の音声合成モデル生成システム。
- 通信機能を有する通信端末であって、
ユーザの音声を入力する音声入力手段と、
前記音声入力手段によって入力された前記音声から当該音声の特徴量を抽出する特徴量抽出手段と、
前記音声に対応するテキストデータを取得するテキストデータ取得手段と、
前記特徴量抽出手段によって抽出された前記音声の特徴量、及びテキストデータ取得手段によって取得された前記テキストデータを、前記通信端末と通信を行うことができる音声合成モデル生成装置に送信する学習情報送信手段と、
前記学習情報送信手段から前記特徴量及び前記テキストデータを送信したことに応じて、前記音声合成モデル生成装置から前記ユーザに対して画像を表示させるための画像情報を受信する画像情報受信手段と、
前記画像情報受信手段によって受信された前記画像情報を表示する表示手段と、を備えることを特徴とする通信端末。 - ユーザの音声の特徴量及び音声に対応するテキストデータを取得する学習情報取得ステップと、
前記学習情報取得ステップにおいて取得された前記特徴量及び前記テキストデータに基づいて学習を行って音声合成モデルを生成する音声合成モデル生成ステップと、
前記音声合成モデル生成ステップにおいて生成された前記音声合成モデルの学習度合を示すパラメータを生成するパラメータ生成ステップと、
前記パラメータ生成ステップにおいて生成された前記パラメータに応じて、ユーザに対して画像を表示させるための画像情報を生成する画像情報生成ステップと、
前記画像情報生成ステップにおいて生成された前記画像情報を出力する画像情報出力ステップと、を含むことを特徴とする音声合成モデル生成方法。 - 通信機能を有する通信端末と、当該通信端末と通信を行うことができる音声合成モデル生成装置とを含んで構成される音声合成モデル生成システムによる音声合成モデル生成方法であって、
前記通信端末が、
ユーザの音声を入力する音声入力ステップと、
前記音声入力ステップにおいて入力された前記音声又は当該音声の特徴量からなる音声情報及び前記音声に対応するテキストデータを前記音声合成モデル生成装置に送信する学習情報送信ステップと、
前記音声情報送信ステップにおいて前記音声情報及び前記テキストデータを送信したことに応じて、前記音声合成モデル生成装置から前記ユーザに対して画像を表示させるための画像情報を受信する画像情報受信ステップと、
前記画像情報受信ステップにおいて受信された前記画像情報を表示する表示ステップと、を含み、
前記音声合成モデル生成装置が、
前記通信端末から送信される前記音声情報を受信することにより前記音声の特徴量を取得する共に、前記通信端末から送信される前記テキストデータを受信することにより取得する学習情報取得ステップと、
前記学習情報取得ステップにおいて取得された前記特徴量及び前記テキストデータに基づいて学習を行って音声合成モデルを生成する音声合成モデル生成ステップと、
前記音声合成モデル生成ステップにおいて生成された前記音声合成モデルの学習度合を示すパラメータを生成するパラメータ生成ステップと、
前記パラメータ生成ステップにおいて生成された前記パラメータに応じて、前記画像情報を生成する画像情報生成ステップと、
前記画像情報生成ステップにおいて生成された前記画像情報を前記通信端末に送信する画像情報出力ステップと、を含むことを特徴とする音声合成モデル生成方法。 - 通信機能を有する通信端末による音声合成モデル生成方法であって、
ユーザの音声を入力する音声入力ステップと、
前記音声入力ステップにおいて入力された前記音声から当該音声の特徴量を抽出する特徴量抽出ステップと、
前記音声に対応するテキストデータを取得するテキストデータ取得ステップと、
前記特徴量抽出ステップにおいて抽出された前記音声の特徴量、及び前記テキストデータ取得ステップおいて取得された前記テキストデータを、前記通信端末と通信を行うことができる音声合成モデル生成装置に送信する学習情報送信ステップと、
前記学習情報送信ステップにおいて前記特徴量及び前記テキストデータを送信したことに応じて、前記音声合成モデル生成装置から前記ユーザに対して画像を表示させるための画像情報を受信する画像情報受信ステップと、
前記画像情報受信ステップにおいて受信された前記画像情報を表示する表示ステップと、を含むことを特徴とする音声合成モデル生成方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009801268433A CN102089804B (zh) | 2008-07-11 | 2009-07-07 | 声音合成模型生成装置、声音合成模型生成系统、通信终端以及声音合成模型生成方法 |
EP09794422A EP2306450A4 (en) | 2008-07-11 | 2009-07-07 | VOICE SYNTHESIZING MODEL GENERATION DEVICE, VOICE SYNTHESIZING MODEL GENERATING SYSTEM, COMMUNICATION TERMINAL DEVICE, AND METHOD FOR GENERATING VOICE SYNTHESIZING MODEL |
US13/003,701 US20110144997A1 (en) | 2008-07-11 | 2009-07-07 | Voice synthesis model generation device, voice synthesis model generation system, communication terminal device and method for generating voice synthesis model |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-181683 | 2008-07-11 | ||
JP2008181683A JP2010020166A (ja) | 2008-07-11 | 2008-07-11 | 音声合成モデル生成装置、音声合成モデル生成システム、通信端末、及び音声合成モデル生成方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010004978A1 true WO2010004978A1 (ja) | 2010-01-14 |
Family
ID=41507091
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/062341 WO2010004978A1 (ja) | 2008-07-11 | 2009-07-07 | 音声合成モデル生成装置、音声合成モデル生成システム、通信端末、及び音声合成モデル生成方法 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20110144997A1 (ja) |
EP (1) | EP2306450A4 (ja) |
JP (1) | JP2010020166A (ja) |
KR (1) | KR20110021944A (ja) |
CN (1) | CN102089804B (ja) |
WO (1) | WO2010004978A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015125268A (ja) * | 2013-12-26 | 2015-07-06 | ブラザー工業株式会社 | カラオケ装置及びカラオケ用プログラム |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5457706B2 (ja) * | 2009-03-30 | 2014-04-02 | 株式会社東芝 | 音声モデル生成装置、音声合成装置、音声モデル生成プログラム、音声合成プログラム、音声モデル生成方法および音声合成方法 |
EP2608195B1 (en) * | 2011-12-22 | 2016-10-05 | BlackBerry Limited | Secure text-to-speech synthesis for portable electronic devices |
US9166977B2 (en) | 2011-12-22 | 2015-10-20 | Blackberry Limited | Secure text-to-speech synthesis in portable electronic devices |
KR101703214B1 (ko) * | 2014-08-06 | 2017-02-06 | 주식회사 엘지화학 | 문자 데이터의 내용을 문자 데이터 송신자의 음성으로 출력하는 방법 |
US9336782B1 (en) * | 2015-06-29 | 2016-05-10 | Vocalid, Inc. | Distributed collection and processing of voice bank data |
US9558734B2 (en) * | 2015-06-29 | 2017-01-31 | Vocalid, Inc. | Aging a text-to-speech voice |
CN113724685B (zh) * | 2015-09-16 | 2024-04-02 | 株式会社东芝 | 语音合成模型学习装置、语音合成模型学习方法及存储介质 |
US10311219B2 (en) * | 2016-06-07 | 2019-06-04 | Vocalzoom Systems Ltd. | Device, system, and method of user authentication utilizing an optical microphone |
JPWO2019073559A1 (ja) * | 2017-10-11 | 2020-10-22 | サン電子株式会社 | 情報処理装置 |
KR102441066B1 (ko) * | 2017-10-12 | 2022-09-06 | 현대자동차주식회사 | 차량의 음성생성 시스템 및 방법 |
US10755694B2 (en) * | 2018-03-15 | 2020-08-25 | Motorola Mobility Llc | Electronic device with voice-synthesis and acoustic watermark capabilities |
CN108668024B (zh) * | 2018-05-07 | 2021-01-08 | 维沃移动通信有限公司 | 一种语音处理方法及终端 |
KR102243325B1 (ko) * | 2019-09-11 | 2021-04-22 | 넷마블 주식회사 | 시동어 인식 기술을 제공하기 위한 컴퓨터 프로그램 |
CN111009233A (zh) * | 2019-11-20 | 2020-04-14 | 泰康保险集团股份有限公司 | 语音处理方法、装置、电子设备及存储介质 |
KR20200111609A (ko) | 2019-12-16 | 2020-09-29 | 휴멜로 주식회사 | 음성 합성 장치 및 그 방법 |
KR20200111608A (ko) | 2019-12-16 | 2020-09-29 | 휴멜로 주식회사 | 음성 합성 장치 및 그 방법 |
US11368799B2 (en) * | 2020-02-04 | 2022-06-21 | Securboration, Inc. | Hearing device customization systems and methods |
JP2020205057A (ja) * | 2020-07-31 | 2020-12-24 | 株式会社Suntac | 情報処理装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002196786A (ja) * | 2000-12-26 | 2002-07-12 | Mitsubishi Electric Corp | 音声認識装置 |
JP2003177790A (ja) * | 2001-09-13 | 2003-06-27 | Matsushita Electric Ind Co Ltd | 端末装置、サーバ装置および音声認識方法 |
JP2003295880A (ja) | 2002-03-28 | 2003-10-15 | Fujitsu Ltd | 録音音声と合成音声を接続する音声合成システム |
JP2004012584A (ja) * | 2002-06-04 | 2004-01-15 | Nippon Telegr & Teleph Corp <Ntt> | 音声認識用情報作成方法、音響モデル作成方法、音声認識方法、音声合成用情報作成方法、音声合成方法及びそれらの装置、並びにプログラム及びそのプログラムを記録した記録媒体 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6101470A (en) * | 1998-05-26 | 2000-08-08 | International Business Machines Corporation | Methods for generating pitch and duration contours in a text to speech system |
EP1293964A3 (en) * | 2001-09-13 | 2004-05-12 | Matsushita Electric Industrial Co., Ltd. | Adaptation of a speech recognition method to individual users and environments with transfer of data between a terminal and a server |
US7480641B2 (en) * | 2006-04-07 | 2009-01-20 | Nokia Corporation | Method, apparatus, mobile terminal and computer program product for providing efficient evaluation of feature transformation |
-
2008
- 2008-07-11 JP JP2008181683A patent/JP2010020166A/ja not_active Withdrawn
-
2009
- 2009-07-07 EP EP09794422A patent/EP2306450A4/en not_active Withdrawn
- 2009-07-07 KR KR1020107029074A patent/KR20110021944A/ko not_active Application Discontinuation
- 2009-07-07 WO PCT/JP2009/062341 patent/WO2010004978A1/ja active Application Filing
- 2009-07-07 US US13/003,701 patent/US20110144997A1/en not_active Abandoned
- 2009-07-07 CN CN2009801268433A patent/CN102089804B/zh not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002196786A (ja) * | 2000-12-26 | 2002-07-12 | Mitsubishi Electric Corp | 音声認識装置 |
JP2003177790A (ja) * | 2001-09-13 | 2003-06-27 | Matsushita Electric Ind Co Ltd | 端末装置、サーバ装置および音声認識方法 |
JP2003295880A (ja) | 2002-03-28 | 2003-10-15 | Fujitsu Ltd | 録音音声と合成音声を接続する音声合成システム |
JP2004012584A (ja) * | 2002-06-04 | 2004-01-15 | Nippon Telegr & Teleph Corp <Ntt> | 音声認識用情報作成方法、音響モデル作成方法、音声認識方法、音声合成用情報作成方法、音声合成方法及びそれらの装置、並びにプログラム及びそのプログラムを記録した記録媒体 |
Non-Patent Citations (1)
Title |
---|
See also references of EP2306450A4 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015125268A (ja) * | 2013-12-26 | 2015-07-06 | ブラザー工業株式会社 | カラオケ装置及びカラオケ用プログラム |
Also Published As
Publication number | Publication date |
---|---|
US20110144997A1 (en) | 2011-06-16 |
EP2306450A1 (en) | 2011-04-06 |
CN102089804A (zh) | 2011-06-08 |
CN102089804B (zh) | 2012-07-18 |
EP2306450A4 (en) | 2012-09-05 |
KR20110021944A (ko) | 2011-03-04 |
JP2010020166A (ja) | 2010-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2010004978A1 (ja) | 音声合成モデル生成装置、音声合成モデル生成システム、通信端末、及び音声合成モデル生成方法 | |
KR101788500B1 (ko) | 이름 발음을 위한 시스템 및 방법 | |
CN106898340B (zh) | 一种歌曲的合成方法及终端 | |
US7596499B2 (en) | Multilingual text-to-speech system with limited resources | |
CN1758330B (zh) | 用于通过交互式话音响应系统防止语音理解的方法和设备 | |
JP4884212B2 (ja) | 音声合成装置 | |
US20160240215A1 (en) | System and Method for Text-to-Speech Performance Evaluation | |
CN106373580A (zh) | 基于人工智能的合成歌声的方法和装置 | |
JP2002366186A (ja) | 音声合成方法及びそれを実施する音声合成装置 | |
CN108831437A (zh) | 一种歌声生成方法、装置、终端和存储介质 | |
KR100659212B1 (ko) | 어학 학습 시스템 및 어학 학습용의 음성 데이터 제공 방법 | |
JP6111802B2 (ja) | 音声対話装置及び対話制御方法 | |
JP2004226556A (ja) | 話し方診断方法、話し方診断装置、話し方学習支援方法、音声合成方法、カラオケ練習支援方法、ボイストレーニング支援方法、辞書、語学教材、方言矯正方法、方言学習方法 | |
TW200901161A (en) | Speech synthesizer generating system and method | |
JP2011028130A (ja) | 音声合成装置 | |
EP1317749B1 (en) | Method of and system for improving accuracy in a speech recognition system | |
JP2011186143A (ja) | ユーザ挙動を学習する音声合成装置、音声合成方法およびそのためのプログラム | |
JP2011028131A (ja) | 音声合成装置 | |
CN110298150B (zh) | 一种基于语音识别的身份验证方法及系统 | |
US20140074478A1 (en) | System and method for digitally replicating speech | |
JP5320341B2 (ja) | 発声用テキストセット作成方法、発声用テキストセット作成装置及び発声用テキストセット作成プログラム | |
CN113192484A (zh) | 基于文本生成音频的方法、设备和存储介质 | |
JP4244661B2 (ja) | 音声データ提供システムならびに音声データ作成装置および音声データ作成プログラム | |
JP2006330060A (ja) | 音声合成装置、音声処理装置、およびプログラム | |
CN117953854A (zh) | 多方言的语音合成方法、装置、电子设备及可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980126843.3 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09794422 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20107029074 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009794422 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13003701 Country of ref document: US |