WO2007091475A1 - 音声合成装置、音声合成方法及びプログラム - Google Patents
音声合成装置、音声合成方法及びプログラム Download PDFInfo
- Publication number
- WO2007091475A1 WO2007091475A1 PCT/JP2007/051669 JP2007051669W WO2007091475A1 WO 2007091475 A1 WO2007091475 A1 WO 2007091475A1 JP 2007051669 W JP2007051669 W JP 2007051669W WO 2007091475 A1 WO2007091475 A1 WO 2007091475A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- music
- speech
- unit
- utterance
- format
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 13
- 238000013500 data storage Methods 0.000 claims abstract description 32
- 238000004458 analytical method Methods 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 15
- 238000001308 synthesis method Methods 0.000 claims description 14
- 238000010586 diagram Methods 0.000 description 8
- 238000001228 spectrum Methods 0.000 description 8
- 230000015572 biosynthetic process Effects 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 108091053856 miR-2001 stem-loop Proteins 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2240/00—Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
- G10H2240/075—Musical metadata derived from musical analysis or for use in electrophonic musical instruments
- G10H2240/081—Genre classification, i.e. descriptive metadata for classification or selection of musical pieces according to style
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/315—Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
- G10H2250/455—Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
Definitions
- Speech synthesis apparatus speech synthesis method and program
- the present invention relates to a speech synthesis technique, and more particularly to a speech synthesis apparatus, speech synthesis method, and program for synthesizing speech from text.
- a prosody eg, pitch frequency pattern, amplitude, duration
- a phonetic symbol string reading, syntax 'part of speech information, text analysis result including an accent type, etc.
- the processing result of generating a waveform, selecting a unit waveform for example, a waveform having a length of about the pitch length or syllable time length extracted from natural speech
- generating the waveform is uniquely determined.
- the speech synthesizer always synthesizes speech in the same utterance format (voice volume, utterance speed, prosody, voice color, etc.) in any situation or environment.
- Patent Document 1 discloses a configuration of a speech synthesis system that selects a phoneme / prosodic control rule according to information indicating the brightness of a user environment, the position of a user, and the like.
- Patent Document 2 discloses a child spectrum based on the power spectrum and frequency distribution information of ambient noise. The structure of a speech synthesizer that controls sound power, pitch frequency, and sampling frequency is disclosed.
- Patent Document 3 discloses a configuration of a speech synthesizer that controls speech rate, pitch frequency, volume, and voice quality based on various timing information including time, date, and day of the week.
- Non-Patent Document 1 describes a genre estimation method for estimating a music genre by obtaining musical features (instrument composition, rhythm structure) by analyzing a short-time amplitude spectrum and discrete wavelet transform coefficient of a music signal. It is disclosed.
- Non-Patent Document 2 discloses a genre estimation method for estimating a music genre from a mel frequency cepstrum coefficient of a music signal using a tree-structured vector quantization method.
- Non-Patent Document 3 discloses a method of searching for a music signal by calculating similarity using a spectrum histogram! Speak.
- Patent Document 1 Japanese Patent No. 3595041
- Patent Document 2 Japanese Patent Laid-Open No. 11-15495
- Patent Document 3 Japanese Patent Laid-Open No. 11 161298
- Non-patent literature l Tzanetakis, Essl, Cook: "Automatic Musical Genre CI assification of Audio Signals", Proceedings of IS MIR 2001, pp. 205-210, 2001.
- Non-Patent Document 2 Hoashi, Matsumoto, Inoue: “Personalization of User Profiles for Content— based Music Retrieval Based on Relevance Feedback”, Proceedings of ACM Multimedia 2003, pp. 110—119, 2003.
- Non-Patent Document 3 Kimura, et al .: “High-speed search of sound and video with global pruning”, IEICE Transactions D— ⁇ , Vol. J85 -D-II, No. 10, pp. 1552-1562, October 2002
- BGM background music
- BGM background music
- BGM background music
- BGM is generally played along with natural sound for the purpose of drawing the audience's attention or impressing the audience with a message.
- BGM is played in the background of narration.
- BGM especially the music genre to which the BGM belongs
- BGM may be selected depending on the speaker's utterance format. You can find the relationship that has been made. For example, in weather forecasts and traffic information, it is common for announcements to be made in a calm tone with BGM with a gentle tone like easy listening. However, even if the content is the same, announcements are often made with a loud voice in certain programs and live broadcasts.
- the environment in which the speech synthesizer is used is diverse, and the opportunity to output synthesized speech in a place where various music including the BGM is played (user environment)
- the conventional speech synthesizers including those described in Patent Document 1 and the like described above cannot reproduce music existing in the user environment in controlling the speech format of synthesized speech. Therefore, there is a problem that the utterance format cannot be harmonized with the surrounding music.
- the present invention has been made in view of the above circumstances, and an object of the present invention is to provide a speech synthesizer, a speech synthesis method, and a program capable of synthesizing speech in harmony with music existing in a user environment. It is to provide.
- a speech synthesizer characterized by automatically selecting an utterance format according to an input music signal. More specifically, the speech synthesizer analyzes a music signal, determines an utterance format that matches the analysis result of the music signal, and an utterance format selection unit that synthesizes speech according to the utterance format. And with Composed.
- a speech synthesis method for generating synthesized speech using a speech synthesizer, wherein the speech synthesizer analyzes an input music signal, and the music
- a speech synthesizing method comprising: determining an utterance format suitable for a signal analysis result; and the speech synthesizer synthesizing speech according to the utterance format.
- a program to be executed by a computer constituting a speech synthesizer wherein an input music signal is analyzed, and the music is selected from utterance formats prepared in advance.
- FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to a first embodiment of the present invention.
- FIG. 2 is an example of a table defining a relationship between a music genre, an utterance format, and an utterance format parameter used in the speech synthesizer according to the first embodiment of the present invention.
- FIG. 3 is a flowchart for explaining the operation of the speech synthesizer according to the first embodiment of the present invention.
- FIG. 4 is a block diagram showing a configuration of a speech synthesizer according to the second embodiment of the present invention.
- ⁇ 5 Music genres and sounds used in the speech synthesizer according to the second embodiment of the present invention. It is an example of the table
- ⁇ 7 A block diagram showing the configuration of the speech synthesizer according to the third embodiment of the present invention.
- FIG. 8 is a flowchart for explaining the operation of the speech synthesizer according to the third embodiment of the present invention.
- FIG. 9 is a block diagram showing the configuration of a speech synthesizer according to the fourth embodiment of the present invention.
- FIG. 10 is a flowchart for explaining the operation of the speech synthesizer according to the fourth embodiment of the present invention.
- FIG. 1 is a block diagram showing the configuration of the speech synthesizer according to the first embodiment of the present invention.
- the speech synthesizer according to this embodiment includes a prosody generation unit 11, a unit waveform selection unit 12, a waveform generation unit 13, prosody generation rule storage units 15 to 15, and a unit.
- An utterance format information storage unit 24 is provided.
- the prosody generation unit 11 is a processing means for generating prosody information from the prosody generation rule selected based on the utterance format and the phonetic symbol string.
- the unit waveform selection unit 12 is processing means for selecting a unit waveform from unit waveform data selected based on the utterance format, phonetic symbol string, and prosody information.
- the waveform generator 13 is a processing means for generating a prosody information and unit waveform data force synthesized speech waveform.
- the prosody generation rule storage units 15 to 15 are required to realize synthesized speech in each utterance format.
- Prosody generation rules for example, pitch frequency pattern, amplitude, duration length, etc.
- the unit waveform data storage unit 16 also has a power of 16, as in the case of the prosody generation rule storage unit.
- Unit waveform data required for the realization of synthesized speech in each utterance format (for example, a waveform having a pitch length or a syllable time length from which natural speech power is also extracted) is stored.
- the prosody generation rules and unit waveform data to be stored in 1 N 1 are natural sounds that match each utterance format.
- the prosody generation rules and unit waveform data required for realizing a vigorous voice that has been generated are stored in the prosody generation rule storage unit 15 and the unit waveform data storage unit 16.
- Prosody generation rules and unit waveform data required to realize a calm voice generated from a calm voice are stored in the prosody generation rule storage unit 15 and unit waveform data storage.
- the prosody generation rules and unit waveform data generated from the secret voice stored in the memory 16 are rhymes.
- the generated prosody generation rules and unit waveform data are the prosody generation rule storage unit 15 and unit waveform data.
- the method for generating rules and unit waveform data is not dependent on the utterance format, and the same method as that for generating from standard voice can be used.
- the music genre estimation unit 21 is processing means for estimating the music genre to which the input music signal belongs.
- the utterance format selection unit 23 is a processing means for determining the music genre utterance format stored in the utterance format information storage unit 24 and estimated based on the table.
- the utterance format information storage unit 24 stores a table that defines the relationship between the music genre, the utterance format, and the utterance format parameters exemplified in FIG.
- the utterance format parameters are the prosody generation rule storage unit number and the unit waveform data storage unit number. By combining prosody generation rules and unit waveform data corresponding to each number, synthesized speech in a specific utterance format can be generated. Realized. In the example of FIG. 2, both the utterance format and the utterance format parameter are defined for convenience of explanation, but since only the utterance format parameter is used in the utterance format selection unit 23, Format definitions can be omitted.
- the utterance format information storage unit 24 defines only the relationship between the music genre and the utterance format, and the correspondence between the utterance format, the prosody generation rules and the unit waveform data is the prosody generation unit 11 and the unit waveform.
- the selection unit 12 may be configured to select prosody generation rules and unit waveform data according to the utterance format.
- a configuration in which a number of utterance formats are prepared is prepared, but only unit waveform data of one utterance format is prepared, and the utterance format is switched by changing the prosodic generation rules. It is also possible. In this case, the storage capacity and processing amount of the speech synthesizer can be further reduced.
- the correspondence between the music genre information defined in the utterance format information storage unit 24 and the utterance format may be changed according to the user's preference, or a plurality of correspondence relationships prepared in advance. It is possible to allow the user to select according to their preference.
- FIG. 3 is a flowchart showing the operation of the speech synthesizer according to this embodiment. is there.
- the music genre estimation unit 21 extracts features of a music signal such as spectrum and cepstrum from the input music signal, estimates the music genre to which the input music belongs, Output to the format selector 23 (step Al).
- the publicly known methods described in Non-Patent Document 1, Non-Patent Document 2, etc. listed above can be used.
- the utterance format selection unit 23 based on the estimated music genre transmitted from the music genre estimation unit 21, reads the corresponding utterance from the table (see Fig. 2) stored in the utterance format information storage unit 24.
- the speech format is selected, and the speech format parameters necessary to realize the selected speech format are transmitted to the prosody generation unit 11 and the unit waveform selection unit 12 (step A2).
- the estimated music genre is pop, a fine voice is selected as the utterance format, a calm voice is selected for easy listening, and a quiet voice is selected for religious music. Voice is selected. If the estimated music genre does not exist in the table of Fig. 2, the standard utterance format is selected as in the case of the "other" music genre.
- the prosody generation unit 11 refers to the utterance format parameters supplied from the utterance format selection unit 23, and the utterance format selection unit 23 designates the prosody generation rule storage units 15 to 15.
- the prosody generation rule storage unit having the storage unit number is selected. Then, based on the prosody generation rule of the selected prosody generation rule storage unit, prosody information is generated from the input phonetic symbol string and transmitted to the unit waveform selection unit 12 and the waveform generation unit 13 (step A3).
- the unit waveform selection unit 12 refers to the utterance format parameter transmitted from the utterance format selection unit 23, and from the unit waveform data storage unit 16 force 16, the utterance format selection unit 23
- the unit waveform data storage unit having the storage unit number designated by is selected. Then, based on the inputted phonetic symbol string and the prosodic information supplied from the prosody generation unit 11, a unit waveform is selected from the selected unit waveform data storage unit and transmitted to the waveform generation unit 13 (step A4). .
- the waveform generation unit 13 is simply based on the prosodic information transmitted from the prosody generation unit 11. Connects the unit waveforms supplied from the unit waveform selector 12 and outputs a synthesized speech signal (step A5).
- the unit waveform data storage units 16 to 16 are provided for each utterance format.
- the configuration is prepared, it may be configured to provide only a standard voice unit waveform data storage unit.
- the utterance format is controlled only by the prosody generation rules.
- the unit waveform data is larger in data size than other data including the prosody generation rules, the storage capacity of the entire synthesizer is greatly increased. The advantage is that it can be reduced.
- the power of the synthesized speech is not a control target, and the power is the same whether the synthesized speech is output with a secret voice or the synthesized speech is output with a cheerful voice.
- harmony may be lost if the volume of the synthesized speech is too loud compared to the background music, and in some cases it may be harsh.
- harmony may be impaired and it may be difficult to hear the synthesized speech.
- FIG. 4 is a block diagram showing the configuration of the speech synthesizer according to the second embodiment of the present invention.
- the speech synthesizer according to the present embodiment is different from the speech synthesizer according to the first embodiment (see FIG. 1) in the synthesized speech power adjustment unit 17 and the synthesized speech.
- a power calculator 18 and a music signal power calculator 19 are added.
- an utterance format selection unit 27 and an utterance format information storage unit 28 Is arranged.
- the utterance format information storage unit 28 stores a table that defines the relationship between the music genre, the utterance format, and the utterance format parameters exemplified in FIG.
- the difference from the table (see FIG. 2) held in the utterance format information storage unit 24 of the first embodiment is that a power ratio is added. It is a point.
- the power ratio is a value obtained by dividing the power of the synthesized speech by the power of the music signal. In other words, if the power ratio is greater than 1.0, it indicates that the power of the synthesized speech is greater than the power of the music signal. Referring to Fig. 5, for example, if the music genre is estimated to be pops, the utterance format is energetic voice, the power ratio is set to 1.2, and the power (1.2 times) exceeds the music signal power To output the synthesized voice power. Similarly, the power ratio is set to 1.0 when the utterance form is calm, 0.9 for the quiet voice, and 1.0 for the standard voice.
- FIG. 6 is a flowchart showing the operation of the speech synthesizer according to this embodiment.
- the process from music genre estimation (step A1) to waveform generation (step A5) is substantially the same as in the first embodiment described above, but in step A2, the utterance format selection unit 27 performs a music genre estimation unit 21.
- the difference is that the power ratio stored in the utterance format information storage unit 28 is transmitted to the synthesized speech power adjustment unit 17 from the estimated music genre transmitted from (step A2).
- the music signal power calculator 19 calculates the average power of the input music signal and transmits it to the synthesized speech power adjuster 17 (step Bl). If the sample number of the signal is n and the music signal is x (n), the average power P (n) of the music signal can be obtained by, for example, the primary leak integration as shown in the following equation (1).
- a is the time constant of the first-order leak integration. Since power is calculated to prevent the difference between the average volume of synthesized speech and BGM from becoming large, it is desirable to set a large value such as 0.9 for a and calculate the average power for a long time. Conversely, if the power is calculated with the value of a set to a small value such as 0.1, the volume of the synthesized speech changes frequently and greatly, and the synthesized speech may become audible.
- a moving average or an average value of all samples of the input signal can be used.
- the synthesized speech power calculation unit 18 calculates the average power of the synthesized speech supplied from the waveform generation unit 13 and transmits it to the synthesized speech power adjustment unit 17 (step B2).
- the same method as the music signal power can be used for the calculation of the synthesized voice power.
- the synthesized speech power adjusting unit 17 includes the music signal power supplied from the music signal power calculating unit 19, the synthesized speech power supplied from the synthesized speech power calculating unit 18, and the speech format selecting unit 27.
- the power of the synthesized speech signal supplied from the waveform generator 13 is adjusted based on the power ratio in the utterance format parameters supplied from, and output as a power adjusted speech synthesized signal (step B3). More specifically, the synthesized speech power adjustment unit 17 makes the ratio between the power of the synthesized speech signal that is finally output and the music signal power approach the value of the power ratio supplied from the speech format selection unit 27. Adjust the power of the synthesized voice.
- the power adjustment coefficient such that the ratio of the power of the music signal and the power-adjusted synthesized speech substantially matches the power ratio supplied from the utterance format selection unit 27. If the music signal power is P, the synthesized voice power is P, and the power ratio is r, the power adjustment factor c is
- the synthesized speech signal before power adjustment is y (n)
- the synthesized speech signal y (n) after power adjustment is given by the following equation.
- fine control such as making the synthesized voice power a little higher than the standard voice when a fine voice is selected, and making the power a little lower when a secret voice is selected. Enables an utterance format that is more harmonious with BGM It is possible to
- FIG. 7 is a block diagram showing the configuration of a speech synthesizer according to the third embodiment of the present invention.
- the speech synthesizer according to the present embodiment adds a music attribute information storage unit 32 to the speech synthesizer according to the first embodiment (see FIG. 1). Instead of the music genre estimation unit 21, a music attribute information search unit 31 is provided.
- the music attribute information search unit 31 is a processing means for extracting a feature quantity such as a spectrum from the input music signal.
- the characteristic amounts of various music signals and the music genres of the music signals are individually recorded. By comparing the characteristic amounts, the music is specified and the genre is determined. It is possible to do.
- FIG. 8 is a flowchart showing the operation of the speech synthesizer according to this embodiment. Since the music genre estimation (step A1) is different from the first embodiment described above and the others have already been described, step D1 in FIG. 8 will be described in detail below.
- the music attribute information search unit 31 extracts a feature amount such as a spectrum from the input music signal. Subsequently, the music attribute information search unit 31 calculates the similarity between all the feature values of the music stored in the music attribute information storage unit 32 and the feature values of the input music signal. Then, the music genre information of the music having the highest similarity is transmitted to the utterance format selection unit 23 (step Dl).
- step D1 if the maximum value of similarity falls below a preset threshold value.
- the music attribute information search unit 31 determines that the music corresponding to the input music signal is recorded in the music attribute information storage unit 32, and outputs “other” as the music genre.
- the music attribute information storage unit 32 in which the music genre is individually recorded is used for each piece of music, which is higher than in the first and second embodiments.
- V. Music genre can be specified with accuracy and reflected in the utterance format.
- the utterance format is determined by attribute information other than the music genre. It becomes possible.
- the music attribute information storage unit 32 If the number of types of music stored in the music attribute information storage unit 32 increases, it becomes possible to specify many genres of music signals. The amount increases. If necessary, when the music attribute information storage unit 32 is arranged outside the speech synthesizer and the similarity of the feature quantity of the music signal is calculated, it is stored in the music attribute information storage unit 32 using wired and wireless communication means. It is also possible to adopt an access configuration.
- FIG. 9 is a block diagram showing the configuration of a speech synthesizer according to the fourth embodiment of the present invention.
- the speech synthesizer according to the present embodiment adds a music playback unit 35 and a music data storage unit 37 to the speech synthesizer according to the first embodiment (see FIG. 1).
- a reproduction music information acquisition unit 36 is provided in place of the music genre estimation unit 21.
- the music data storage unit 37 stores a music signal, a song number of the music, and a music genre.
- the music playback unit 35 receives music signals stored in the music data storage unit 37 according to playback commands including various commands such as song number, volume, playback, stop, rewind, fast forward, etc. It is a means to output through.
- the music playback unit 35 supplies the music number of the music being played back to the playback music information acquisition unit 36.
- the reproduction music information acquisition unit 36 performs music corresponding to the song number supplied from the music reproduction unit 35.
- This genre information is extracted from the music data storage unit 37 and transmitted to the utterance format selection unit 23, and is the same processing means as the music genre estimation unit 21 of the first embodiment.
- FIG. 10 is a flowchart showing the operation of the speech synthesizer according to this embodiment. Since the music genre estimation (step A1) is different from the first embodiment described above, and others have already been described, steps D2 and D3 in FIG. 10 will be described in detail below.
- the song number is supplied to the playback music information acquisition unit 36 (step D2).
- the reproduction music information acquisition unit 36 extracts the music genre information corresponding to the song number supplied from the music reproduction unit 35 from the music data storage unit 37 and transmits it to the utterance format selection unit 23 (step D3). .
- music genre estimation processing and search processing are not required, and it is possible to reliably specify the music genre of the BGM being played.
- the music playback unit 35 can directly acquire the genre information of the music being played back from the music data storage unit 37, the playback music information acquisition unit 36 is eliminated and the music playback unit 35 changes to the utterance format selection unit 23. It can also be configured to supply music genres directly.
- music genre information is recorded in the music data storage unit 37.
- the music genre estimation unit 21 is used instead of the reproduction music information acquisition unit 36 to estimate the music genre. It is also possible.
- the utterance can be determined using the attribute information other than genre.
- the format selection unit 23 and the utterance format information storage unit 24 can be changed.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007800048865A CN101379549B (zh) | 2006-02-08 | 2007-02-01 | 声音合成装置、声音合成方法 |
US12/223,707 US8209180B2 (en) | 2006-02-08 | 2007-02-01 | Speech synthesizing device, speech synthesizing method, and program |
JP2007557805A JP5277634B2 (ja) | 2006-02-08 | 2007-02-01 | 音声合成装置、音声合成方法及びプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006-031442 | 2006-02-08 | ||
JP2006031442 | 2006-02-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007091475A1 true WO2007091475A1 (ja) | 2007-08-16 |
Family
ID=38345078
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2007/051669 WO2007091475A1 (ja) | 2006-02-08 | 2007-02-01 | 音声合成装置、音声合成方法及びプログラム |
Country Status (4)
Country | Link |
---|---|
US (1) | US8209180B2 (ja) |
JP (1) | JP5277634B2 (ja) |
CN (1) | CN101379549B (ja) |
WO (1) | WO2007091475A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009139022A1 (ja) * | 2008-05-15 | 2009-11-19 | パイオニア株式会社 | 音声出力装置およびプログラム |
WO2018211750A1 (ja) | 2017-05-16 | 2018-11-22 | ソニー株式会社 | 情報処理装置および情報処理方法 |
JP2021067922A (ja) * | 2019-10-28 | 2021-04-30 | ネイバー コーポレーションNAVER Corporation | 映像コンテンツに対する合成音のリアルタイム生成を基盤としたコンテンツ編集支援方法およびシステム |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130030789A1 (en) * | 2011-07-29 | 2013-01-31 | Reginald Dalce | Universal Language Translator |
US9959342B2 (en) * | 2016-06-28 | 2018-05-01 | Microsoft Technology Licensing, Llc | Audio augmented reality system |
JPWO2018030149A1 (ja) * | 2016-08-09 | 2019-06-06 | ソニー株式会社 | 情報処理装置及び情報処理方法 |
EP3506255A1 (en) * | 2017-12-28 | 2019-07-03 | Spotify AB | Voice feedback for user interface of media playback device |
CN112735454A (zh) * | 2020-12-30 | 2021-04-30 | 北京大米科技有限公司 | 音频处理方法、装置、电子设备和可读存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0837700A (ja) * | 1994-07-21 | 1996-02-06 | Kenwood Corp | 音場補正回路 |
JP2003058198A (ja) * | 2001-08-21 | 2003-02-28 | Canon Inc | 音声出力装置、音声出力方法、及び、プログラム |
JP2003524906A (ja) * | 1998-04-14 | 2003-08-19 | ヒアリング エンハンスメント カンパニー,リミティド ライアビリティー カンパニー | 聴覚障害および非聴覚障害リスナーの好みに合わせてユーザ調整能力を提供する方法および装置 |
JP2004513445A (ja) * | 2000-10-30 | 2004-04-30 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 個人のインタラクションをシミュレートし、ユーザの情緒状態及び/又は性格に反応するユーザインタフェース/エンタテインメントデバイス |
JP2004361874A (ja) * | 2003-06-09 | 2004-12-24 | Sanyo Electric Co Ltd | 音楽再生装置 |
JP2005077663A (ja) * | 2003-08-29 | 2005-03-24 | Brother Ind Ltd | 音声合成装置、音声合成方法、及び音声合成プログラム |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3070127B2 (ja) * | 1991-05-07 | 2000-07-24 | 株式会社明電舎 | 音声合成装置のアクセント成分制御方式 |
CN1028572C (zh) | 1991-11-05 | 1995-05-24 | 湘潭市新产品开发研究所 | 声控自动伴奏机 |
JPH05307395A (ja) * | 1992-04-30 | 1993-11-19 | Sony Corp | 音声合成装置 |
JPH08328576A (ja) * | 1995-05-30 | 1996-12-13 | Nec Corp | 音声案内装置 |
JPH1020885A (ja) * | 1996-07-01 | 1998-01-23 | Fujitsu Ltd | 音声合成装置 |
JP3578598B2 (ja) | 1997-06-23 | 2004-10-20 | 株式会社リコー | 音声合成装置 |
JPH1115488A (ja) * | 1997-06-24 | 1999-01-22 | Hitachi Ltd | 合成音声評価・合成装置 |
JPH11161298A (ja) | 1997-11-28 | 1999-06-18 | Toshiba Corp | 音声合成方法及び装置 |
US6446040B1 (en) * | 1998-06-17 | 2002-09-03 | Yahoo! Inc. | Intelligent text-to-speech synthesis |
JP2000105595A (ja) * | 1998-09-30 | 2000-04-11 | Victor Co Of Japan Ltd | 歌唱装置及び記録媒体 |
JP2001309498A (ja) * | 2000-04-25 | 2001-11-02 | Alpine Electronics Inc | 音声制御装置 |
US6990453B2 (en) * | 2000-07-31 | 2006-01-24 | Landmark Digital Services Llc | System and methods for recognizing sound and music signals in high noise and distortion |
US6915261B2 (en) * | 2001-03-16 | 2005-07-05 | Intel Corporation | Matching a synthetic disc jockey's voice characteristics to the sound characteristics of audio programs |
US7203647B2 (en) * | 2001-08-21 | 2007-04-10 | Canon Kabushiki Kaisha | Speech output apparatus, speech output method, and program |
JP2004205605A (ja) * | 2002-12-24 | 2004-07-22 | Yamaha Corp | 音声および楽曲再生装置およびシーケンスデータフォーマット |
US9042921B2 (en) * | 2005-09-21 | 2015-05-26 | Buckyball Mobile Inc. | Association of context data with a voice-message component |
JP2007086316A (ja) | 2005-09-21 | 2007-04-05 | Mitsubishi Electric Corp | 音声合成装置、音声合成方法、音声合成プログラムおよび音声合成プログラムを記憶したコンピュータ読み取り可能な記憶媒体 |
US7684991B2 (en) * | 2006-01-05 | 2010-03-23 | Alpine Electronics, Inc. | Digital audio file search method and apparatus using text-to-speech processing |
-
2007
- 2007-02-01 JP JP2007557805A patent/JP5277634B2/ja not_active Expired - Fee Related
- 2007-02-01 US US12/223,707 patent/US8209180B2/en not_active Expired - Fee Related
- 2007-02-01 CN CN2007800048865A patent/CN101379549B/zh not_active Expired - Fee Related
- 2007-02-01 WO PCT/JP2007/051669 patent/WO2007091475A1/ja active Search and Examination
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0837700A (ja) * | 1994-07-21 | 1996-02-06 | Kenwood Corp | 音場補正回路 |
JP2003524906A (ja) * | 1998-04-14 | 2003-08-19 | ヒアリング エンハンスメント カンパニー,リミティド ライアビリティー カンパニー | 聴覚障害および非聴覚障害リスナーの好みに合わせてユーザ調整能力を提供する方法および装置 |
JP2004513445A (ja) * | 2000-10-30 | 2004-04-30 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 個人のインタラクションをシミュレートし、ユーザの情緒状態及び/又は性格に反応するユーザインタフェース/エンタテインメントデバイス |
JP2003058198A (ja) * | 2001-08-21 | 2003-02-28 | Canon Inc | 音声出力装置、音声出力方法、及び、プログラム |
JP2004361874A (ja) * | 2003-06-09 | 2004-12-24 | Sanyo Electric Co Ltd | 音楽再生装置 |
JP2005077663A (ja) * | 2003-08-29 | 2005-03-24 | Brother Ind Ltd | 音声合成装置、音声合成方法、及び音声合成プログラム |
Non-Patent Citations (1)
Title |
---|
HAN K.-P. ET AL.: "Genre classification system of TV sound signals based on a spectrogram analysis", IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, vol. 44, no. 1, 1998, pages 33 - 42, XP000779248 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009139022A1 (ja) * | 2008-05-15 | 2009-11-19 | パイオニア株式会社 | 音声出力装置およびプログラム |
JPWO2009139022A1 (ja) * | 2008-05-15 | 2011-09-08 | パイオニア株式会社 | 音声出力装置およびプログラム |
WO2018211750A1 (ja) | 2017-05-16 | 2018-11-22 | ソニー株式会社 | 情報処理装置および情報処理方法 |
JP2021067922A (ja) * | 2019-10-28 | 2021-04-30 | ネイバー コーポレーションNAVER Corporation | 映像コンテンツに対する合成音のリアルタイム生成を基盤としたコンテンツ編集支援方法およびシステム |
JP7128222B2 (ja) | 2019-10-28 | 2022-08-30 | ネイバー コーポレーション | 映像コンテンツに対する合成音のリアルタイム生成を基盤としたコンテンツ編集支援方法およびシステム |
Also Published As
Publication number | Publication date |
---|---|
CN101379549A (zh) | 2009-03-04 |
JP5277634B2 (ja) | 2013-08-28 |
JPWO2007091475A1 (ja) | 2009-07-02 |
CN101379549B (zh) | 2011-11-23 |
US8209180B2 (en) | 2012-06-26 |
US20100145706A1 (en) | 2010-06-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6645956B2 (ja) | 携帯用音声合成のためのシステム及び方法 | |
KR101274961B1 (ko) | 클라이언트단말기를 이용한 음악 컨텐츠 제작시스템 | |
US7825321B2 (en) | Methods and apparatus for use in sound modification comparing time alignment data from sampled audio signals | |
JP5143569B2 (ja) | 音響的特徴の同期化された修正のための方法及び装置 | |
JP5277634B2 (ja) | 音声合成装置、音声合成方法及びプログラム | |
US7613612B2 (en) | Voice synthesizer of multi sounds | |
BR112013019792B1 (pt) | Misturador de faixa de áudio semântico | |
CN101111884B (zh) | 用于声学特征的同步修改的方法和装置 | |
JP7363954B2 (ja) | 歌唱合成システム及び歌唱合成方法 | |
JP7424359B2 (ja) | 情報処理装置、歌唱音声の出力方法、及びプログラム | |
US6915261B2 (en) | Matching a synthetic disc jockey's voice characteristics to the sound characteristics of audio programs | |
CN112289300B (zh) | 音频处理方法、装置及电子设备和计算机可读存储介质 | |
JP2016161919A (ja) | 音声合成装置 | |
CN113936629B (zh) | 音乐文件处理方法和装置、音乐演唱设备 | |
CN113781989A (zh) | 一种音频的动画播放、节奏卡点识别方法及相关装置 | |
WO2014142200A1 (ja) | 音声処理装置 | |
JP2023013684A (ja) | 歌唱声質変換プログラム及び歌唱声質変換装置 | |
JPH11167388A (ja) | 音楽演奏装置 | |
Jayasinghe | Machine Singing Generation Through Deep Learning | |
CN118379978A (zh) | 一种基于智能音响的线上k歌方法、系统及存储介质 | |
JP6182894B2 (ja) | 音響処理装置および音響処理方法 | |
JP2005274790A (ja) | 音楽再生装置、音楽再生方法、音楽再生プログラム、及び電子アルバム装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
ENP | Entry into the national phase |
Ref document number: 2007557805 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12223707 Country of ref document: US Ref document number: 200780004886.5 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07707855 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) |