CN1573921A - Speech and music regeneration device - Google Patents
Speech and music regeneration device Download PDFInfo
- Publication number
- CN1573921A CN1573921A CNA2004100474146A CN200410047414A CN1573921A CN 1573921 A CN1573921 A CN 1573921A CN A2004100474146 A CNA2004100474146 A CN A2004100474146A CN 200410047414 A CN200410047414 A CN 200410047414A CN 1573921 A CN1573921 A CN 1573921A
- Authority
- CN
- China
- Prior art keywords
- data
- speech
- user
- phrase
- script
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008929 regeneration Effects 0.000 title abstract 2
- 238000011069 regeneration method Methods 0.000 title abstract 2
- 238000013515 script Methods 0.000 claims abstract description 121
- 230000005236 sound signal Effects 0.000 claims abstract description 13
- 230000015572 biosynthetic process Effects 0.000 claims description 85
- 238000003786 synthesis reaction Methods 0.000 claims description 85
- 238000005070 sampling Methods 0.000 claims description 10
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 claims 2
- 230000002194 synthesizing effect Effects 0.000 claims 1
- 150000001875 compounds Chemical class 0.000 abstract 2
- 230000001172 regenerating effect Effects 0.000 abstract 1
- 238000000034 method Methods 0.000 description 32
- 238000012545 processing Methods 0.000 description 32
- 239000000203 mixture Substances 0.000 description 31
- 230000008569 process Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 12
- 239000011295 pitch Substances 0.000 description 12
- 230000008859 change Effects 0.000 description 11
- 230000005540 biological transmission Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 5
- 239000002131 composite material Substances 0.000 description 5
- 238000010295 mobile communication Methods 0.000 description 5
- 230000001413 cellular effect Effects 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 3
- 238000001308 synthesis method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/15—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/471—General musical sound synthesis principles, i.e. sound category-independent synthesis methods
- G10H2250/481—Formant synthesis, i.e. simulating the human speech production mechanism by exciting formant resonators, e.g. mimicking vocal tract filtering as in LPC synthesis vocoders, wherein musical instruments may be used as excitation signal to the time-varying filter estimated from a singer's speech
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Electrophonic Musical Instruments (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a regeneration device for a sound and the music, comprising an intermediary device to regenerate a synthesized sound signal, a sound resource (20) for regenerating the expected sound or music based on the synthesized sound signal, and a speaker. The invention is characterized in that the script data (11), the user tone color parameter (12), and the user phrase compound glossary data (13) are integrated by the intermediary device, meanwhile the synthesized sound signal is regenerated according to the default tone color parameter (18) and the default compound glossary data (19). The expected wave data, the music phrase data comprising the note data, and the formant frame data are appropriately integrated to regenerate according to the incident category under the instance of using the HV scripts to record different incidents as the script data.
Description
Technical Field
The present invention relates to a speech and music reproducing device, and more particularly to a speech and music reproducing device that reproduces a specific speech by speech synthesis and converts character information into speech or reproduces music.
Background
Conventionally, a character string/speech conversion apparatus has been designed which converts information on a character string (character string information) such as an electronic mail into speech and outputs the speech. Japanese laid-open patent publication No. 2001-7937 shows an example of a character string speech converting apparatus in which character string information is divided in units of sections, and speech is output and the content thereof is displayed on a display.
There are known methods of reproducing waveform data (or sampled data) obtained by sampling a musical composition phrase, a speech phrase, or the like, and methods of reproducing a musical composition phrase by composing a musical composition phrase from musical notation information such as SMF (standard MIDI file) or SMAF (synthetic music mobile application file). For example, japanese laid-open patent publication No. 2001-51688 discloses an electronic mail reading apparatus capable of separating character string information and musical tone information in an electronic mail and reproducing the separated sounds.
However, in the conventional character string speech conversion device, since the character string information is divided into units of a sentence (sentence) or a phrase) and the speech output is performed, the speech output is a set of speech of a pronunciation unit (or a character unit), and when a connection point of the pronunciation unit is reproduced, a listener feels dissonant compared to a normal speech. That is, in the conventional character string speech conversion device, the timbre cannot be changed in speech with good sound quality as a whole of a section, in other words, natural speech close to human speech cannot be output.
A method considered for solving the above problem is, for example, to sample speech of each section (hereinafter, referred to as "phrase") in advance, store it as speech data, and output it as a corresponding speech waveform when reproducing it. However, this method requires an increase in the sampling frequency in order to improve the quality of the voice output, and therefore, it is necessary to store a large amount of voice data, and there is a technical difficulty in devices having a relatively limited storage capacity, such as a mobile phone (mobile phone or mobile phone).
In the conventional method of reproducing waveform data created by sampling music or voice, or the conventional method of reproducing music data by composing one piece of music data from character information such as SMF or SMAF, the reproduction timing of music or voice is not described in the form of a text file, and therefore it is difficult to combine voice reproduction based on character string information with waveform data reproduction or music data reproduction according to the intention of the user.
Disclosure of Invention
In order to solve the above-described problems, it is an object of the present invention to provide a speech reproducing apparatus capable of reproducing and outputting a desired section (or phrase) composed of character string information or the like by changing the tone of the desired section (or phrase) as speech having a good sound quality.
Another object of the present invention is to provide a voice/music reproduction device that allows a user to easily perform voice reproduction or a combination of waveform data reproduction and music data reproduction, thereby allowing voice and music to be reproduced faithfully to the user's intention.
A speech reproducing device according to the present invention stores a database composed of formant frame data corresponding to a predetermined pronunciation unit as synthesis dictionary data, and performs speech synthesis using the synthesis dictionary data when information on character strings constituting the pronunciation unit is added in parallel. Here, when replacing formant frame data with arbitrary user phrase data and providing character string information, speech synthesis is performed using synthesis dictionary data replaced with the user phrase data. The tone parameters of the processed formant frame data are appended to the user phrase data. In addition, a prescribed data exchange format containing user phrase data is used in speech synthesis. The data exchange format is, for example, the SMAF file format, and contains not only user phrase data but also various pieces or music piece reproduction information.
Specifically, the speech reproducing apparatus includes default synthetic dictionary data for storing formant frame data corresponding to a predetermined utterance unit, middleware, Application Program Interface (API), converter, driver, and sound source for replacing the formant frame data with user phrase data. Thus, a desired phrase composed of character string information can be reproduced as a voice with good sound quality, and the tone can be appropriately changed and reproduced.
The speech/music reproduction device according to the present invention stores script data (i.e., HV-script) in which a reproduction instruction of character sound or sound data stored in advance is described. A speech signal corresponding to the character is generated based on the script data to generate a desired speech, and a sound generation signal corresponding to the sound generation data is generated to generate a desired speech or musical sound. Here, the pronunciation data is, for example, waveform data generated by sampling speech or music, and a synthesized pronunciation signal is generated from the waveform data. When the musical composition data including the note information is the data for sound emission, a musical tone signal corresponding to the note information is generated based on the musical composition data. When a formant control parameter (or formant frame data) characterizing the character sound is stored, a speech signal is generated from the formant control parameter. The script data may be created arbitrarily by the user, and in this case, the script data is in a predetermined file format created by text input.
Specifically, when various events described in the HV-script are interpreted and the type of the event indicates waveform data, the waveform data is read and reproduced, and when the type of the event indicates phrase data, the phrase data is reproduced. At this time, note data is read out and reproduced based on time information in the musical composition phrase data. In addition, in response to other events, a character string input using the synthesis dictionary data is converted into a formant frame string, and speech synthesis is performed. Thus, the user can easily combine voice reproduction, waveform data reproduction, and music data reproduction.
Drawings
Fig. 1 is a block diagram showing a configuration of a speech reproducing apparatus according to a first embodiment of the present invention.
FIG. 2 is a diagram showing the assignment relationship between pronunciation units and phrase IDs.
Fig. 3 is an example of contents of phrase synthesis dictionary data.
Fig. 4 is an example of a format of the SMAF file.
Fig. 5 is a functional block diagram showing an example of the HV authoring tool.
Fig. 6 is a block diagram showing a configuration of a mobile communication terminal to which a voice reproducing device is applied.
Fig. 7 is a flowchart showing a process of creating user phrase synthesis dictionary data.
Fig. 8 is a flowchart showing a reproduction process of the user phrase synthesis dictionary data.
Fig. 9 is a flowchart showing the processing for creating the SMAF file.
Fig. 10 is a flowchart showing a reproduction process of the SMAF file.
Fig. 11 is a block diagram showing a configuration of a speech/music reproducing device according to a second embodiment of the present invention.
Fig. 12 is a diagram showing an example of the distribution relationship between events and waveform data and musical composition phrase data.
Fig. 13 is a flowchart showing the voice/music reproduction processing of the second embodiment.
Fig. 14 is a block diagram showing a configuration of a mobile phone including the voice/music reproducing device according to the second embodiment.
Fig. 15 is a block diagram showing a configuration of a speech/music reproducing device according to a third embodiment of the present invention.
Fig. 16 is a flowchart showing the operation of the speech/music reproducing apparatus shown in fig. 15.
Detailed Description
Embodiments of the present invention are described in detail with reference to the accompanying drawings.
Fig. 1 is a block diagram showing a configuration of a speech reproducing apparatus according to a first embodiment of the present invention.
That is, the speech reproducing apparatus 1 shown in fig. 1 includes application software (application software)14, a middleware API (middleware application program interface)15, a converter 16, a driver 17, a default tone color parameter (default tone color parameter)18, default composite dictionary data (default composite dictionary data) 19, and a source 20, and reproduces speech by inputting script data (script data)11, user tone color parameters 12, and user phrase composite dictionary data (user phrase composite dictionary data)13 (variable length).
The speech reproducing apparatus 1 basically adopts a method of reproducing speech by formant synthesis (formant synthesis) according to a CSM (complex sinusoidal model) speech synthesis method using an fm (frequency modulation) sound source resource. In the present embodiment, the user phrase synthesis dictionary data 13 is defined, and the speech reproducing apparatus 1 assigns user phrases to the tone parameters in units of phonemes (phones) with reference to the user phrase synthesis dictionary data 13. When the user phrase synthesis dictionary data 13 is assigned to the tone color parameters in this manner, the speech reproducing apparatus 1 replaces the phonemes registered in the default synthesis dictionary data with the user phrases during reproduction, and then performs speech synthesis based on the replaced data. The "phoneme" is a minimum unit of pronunciation, and in the case of japanese, is composed of a vowel (vowel) and a consonant (consonant).
The following describes the detailed configuration of the speech reproducing apparatus 1.
In FIG. 1, the script data 11 is defined to reproduce "HV (human voice): speech synthesized by the above-described method "data in a data format. That is, the script data 11 represents a data format for performing speech synthesis including a phonetic character string including a prosodic symbol (prosodic symbol), event data for setting a phonetic sound, and event data for controlling the application software 14, and is a text input format to facilitate manual input by a user.
The definition of the data format in the script data 11 has language dependency and can be defined by various languages, but in the present embodiment, the definition is defined in japanese.
The user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19 are equivalent to a database stored in phonetic character units by analyzing actual human voice by sampling with phonetic character units (for example, "あ", "い" in japanese), extracting 8 sets of formant frequencies, formant intensities (formant levels), and tones as parameters, and associating formant frame data and phonetic character units with the parameters in advance. Since the user-phrase synthesis dictionary data 13 is a database constructed outside the middleware and the formant frame data can be arbitrarily registered in the database by the user, the contents registered in the user-phrase synthesis dictionary data 13 can be completely replaced with the contents stored in the default synthesis dictionary data 19 via the middleware API 15. That is, the contents of the default synthetic dictionary data 19 can be completely replaced with the contents of the user phrase synthetic dictionary data 13. On the other hand, the default synthetic dictionary data 19 is a database built in the intermediate device.
It is preferable that the user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19 have two types, male voice quality and female voice quality, respectively, and that the voice output of the voice reproducing apparatus 1 is changed in accordance with the period of each frame, but the frame period of formant frame data registered in the user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19 is set to 20ms, for example.
The user tone color parameters 12 and the default tone color parameters 18 are a group of parameters for controlling the sound quality in the speech output of the speech reproducing apparatus 1, and in other words, the user tone color parameters 12 and the default tone color parameters 18 can be changed, for example, by 8 sets of the resonance frequencies and the formant intensities (that is, by specifying the amounts of changes in the resonance frequencies and the formant intensities registered in the user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19) and by specifying the basic waveform for formant synthesis, so that various tone colors can be created.
The default tone color parameters 18 are tone color parameters set in advance as default values in the middleware, and the user tone color parameters 12 are parameters that can be arbitrarily created by the user, are set and stored outside the middleware, and are parameters for expanding the contents of the default tone color parameters 18 via the middleware API 15.
The application software 14 is software for reproducing the script data 11.
The middleware API15 forms an interface between the application software 14 formed of software and the converter 16, driver 17, default tone color parameters 18, and default synthesis dictionary data 19 formed of middleware.
The converter 16 interprets the script data 11 and finally converts the script data into a formant frame data string formed by the connected frame data by the driver 17.
The driver 17 generates a formant frame data string from the phonetic character string included in the scenario data 11 and the default synthetic dictionary data 19, and processes the formant frame data string by interpreting the tone color parameters.
The sound source 20 outputs a synthesized utterance signal corresponding to the output data of the transducer 16, and outputs the synthesized utterance signal to a speaker to emit a sound.
The technical features of the speech reproducing apparatus 1 of the present embodiment will be described below.
The user tone parameters 12 include parameters for assigning a phrase ID stored in the user phrase synthesis dictionary data 13 to an arbitrary pronunciation unit. Fig. 2 shows an example of assigning pronunciation units and phrase IDs, and here shows assignment relations between syllables (mora) and phrase IDs. In the case of japanese, the syllable means "beat", and corresponds to, for example, kana character unit.
By assigning a phrase ID to each pronunciation unit, the pronunciation unit specified in the user tone color parameter 12 is specified not with reference to the default synthetic dictionary data 19 but with reference to the user phrase synthetic dictionary data 13. In the user tone color parameters 12, it is preferable that an arbitrary number of pronunciation units can be specified from one tone color parameter.
The above-described phrase ID assignment of each pronunciation unit in the user tone color parameters 12 is an example of the present embodiment, and other methods may be adopted as long as they correspond to pronunciation units.
Next, details of the user phrase synthesis dictionary data 13 will be explained, and fig. 3 shows an example of contents of the user phrase synthesis dictionary data 13. The user phrase synthesis dictionary data 13 stores formant frame data consisting of 8 sets of resonance frequencies, formant intensities, and tones, and the "phrase" in fig. 3 means, for example, "おはよう" in japanese and has a meaning or a sentence unified by syllables, and the scale of definition of the "phrase" does not need to be particularly specified, and may be a sentence of any scale such as a word, a syllable, and a sentence.
As a tool for creating the user phrase synthesis dictionary data 13, an analysis tool for analyzing a normal audio file (file with extension names such as wav and aif) to generate formant frame data including 8 sets of resonance frequencies, formant intensities, and tones must be installed.
The scenario data 11 includes event data indicating a change in sound quality, but the user tone color parameters 12 may be specified by the event data.
For example, as a description example of the script data 11, "TJK 12 みなさ '/' X10 あか" may be set in the case of using hiragana and alphanumeric characters of japanese. In this example, "K" represents event data specifying default tone color parameters 18, and "X" represents event data specifying user tone color parameters 12. "K12" is a code for specifying a specific default tone color parameter from among a plurality of types of default tone color parameters, and "X10" is a code for specifying a user tone color parameter shown in fig. 2 from among a plurality of types of user tone color parameters.
In the above example, the synthesized speech to be reproduced is "みなさ di/こ di/にちは Suzuki です. ". Here, "みなさ'/" is a synthesized speech reproduced with reference to the default tone color parameter 18 and default synthesis dictionary data 19; "こ '/' にちは" and "suzuki です" are synthesized voices reproduced with reference to the user tone color parameters 12 and the user phrase synthesis dictionary data 13. That is, the sentence "みなさ '/" is a synthesized speech which reads out and reproduces formant frame data on "み", "な", "さ", "'/" 4 phonemes from the default synthetic dictionary data 19; the sentences "こ '/' にちは" and "suzuki です" are synthesized voices which are read out from the user phrase synthesis dictionary data 13 and reproduce formant frame data of the respective phrase units.
In the example of fig. 2, 3 phonetic units such as "あ", "い" and "か" are shown, but any characters and symbols that can be represented by text are not particularly limited. In the above example, the sentence "こ//' にちは" is pronounced by the pronunciation symbol "あ" following "X10", and the sentence "suzuki です" is pronounced by the pronunciation symbol "か". Therefore, when the original "あ" is uttered after the above-described utterance example, a symbol (for example, a predetermined number is inserted into "X o" and "o") for returning the reference target to the default synthetic dictionary data 19 may be inserted.
The data exchange format of music reproduction sequence data (SMAF) used in the speech reproducing apparatus 1 according to the present embodiment will be described with reference to FIG. 4. Fig. 4 shows a format of a SMAF file, which is one of data allocation for expressing music using a sound source and a data exchange format for mutual use, and is a specification of a data format for reproducing multimedia contents in a portable terminal (personal digital assistant (PDA), personal computer, cellular phone, etc.).
The SMAF file 30 of the data exchange format shown in fig. 4 has a basic structure of data units called chunks (chunks). A block is composed of a header of a fixed length (8 bytes) and a body portion of an arbitrary length, and the header is divided into a block ID of 4 bytes and a block size of 4 bytes. The block ID is used as an identifier of the block, and the block size indicates the length of the main body portion. The SMAF file 30 itself and the various data it contains also constitute the entire block structure.
As shown in fig. 4, the SMAF file 30 is composed of a content information block (contents info chunk)31, a selection data block (optional data chunk)32, a score track chunk 33, and an HV block (HV chunk) 36.
The content information block 31 stores various management information about the SMAF file 30, for example, information such as the level, genre, copyright information, genre name, song title, artist name, and writer/composer name of the content. The selection data block 32 stores information such as copyright information, genre name, song title, artist name, and writer/composer name. In the SMAF file 30, the operation data block 32 is not necessarily set.
The score track block 33 is a block storing a sequence track of a musical composition transmitted to a sound source, and includes a creation data block (setup data chunk)34 (option) and a sequence data block (sequence data chunk) 35.
The created data block 34 is a block for storing tone data of a sound source and the like, and stores a term of specific information (exclusive message) together, and the specific information includes, for example, tone parameter registration information.
The sequence data block 35 is a block for storing actual performance data, and stores HV note-on (HV note-on, 'HV' indicating human voice) that determines the reproduction timing of the scenario data 11, mixed with other sequence events. Here, the HV and other musical-piece events are distinguished by the channel designation of the HV.
The HV block 36 contains a HV creation data block (HV setup data chunk)37 (option), a HV user phrase dictionary block (HV user phrase dictionary chunk)38 (option), and a HV-S block 39.
The HV creation data block 37 stores HV user tone parameters and information for specifying a channel for HV, and the HV-S block 39 stores HV script data.
The HV user phrase dictionary block 38 stores the contents of the user phrase synthesis dictionary data 13, and parameters for assigning the syllable and phrase IDs shown in fig. 2 are necessary among the HV user tone parameters stored in the HV creation data block 37.
By applying the SMAF file 30 shown in fig. 4 to the tone color parameters of the present embodiment, it is possible to reproduce the synthesized speech (HV) in synchronization with the music and also reproduce the contents of the user phrase synthesis dictionary data 13.
Then, an HV authoring tool (HV authoring tool) used as a tool for the user phrase synthesis dictionary data 13 shown in fig. 1 and the SMAF file 30 shown in fig. 4 will be described with reference to fig. 5. Fig. 5 is a block diagram showing an example of the function and specification of the HV authoring tool.
When the SMAF file 30 is created, the HV authoring tool 42 reads an SMF file (standard MIDI file) 41 (including note-on for determining sound generation timing of HV) created in advance by a MIDI (musical instrument digital interface) sequencer, and performs conversion processing to an SMAF file 43 (corresponding to the SMAF file 30) based on information obtained from an HV script UI (HV script user interface) 44 and an HV voice editor (HV voice editor) 45.
The HV speech editor 45 is an editor having a function of editing HV user tone parameters (corresponding to the user tone parameters 12 described above) included in the HV user tone file 48. The HV speech editor 45 can assign a user phrase to an arbitrary syllable in addition to editing various HV timbre parameters.
The interface of the HV speech editor 45 has a menu for selecting a syllable and has the function of assigning an arbitrary sound file 50 to the syllable. The sound file 50 assigned by the interface of the HV speech editor 45 is analyzed by the wave analyzer 46, thereby generating formant frame data consisting of 8 sets of resonance frequencies, formant intensities, and tones. The formant frame data may be output and input as individual files (i.e., the HV user tone file 48 and the HV user synthesis dictionary file 49).
The HV script UI44 may edit HV script data directly, or the HV script data may be input and output as an individual file (i.e., the HV script file 47). The HV authoring tool 40 according to the present embodiment may be configured only by the HV authoring tool 42, the HV script UI44, the HV voice editor 45, and the waveform analyzer 46 described above.
Next, an example in which the voice reproducing device 1 of the present embodiment is applied to a mobile communication terminal will be described with reference to fig. 6, and fig. 6 is a block diagram showing a configuration of a mobile communication terminal 60 provided with the voice reproducing device 1.
The mobile communication terminal 60 is a device corresponding to a mobile phone, for example, and includes a CPU61, a ROM62, a RAM63, a display unit 64, a vibrator 65, an input unit 66, a communication unit 67, an antenna 68, a voice processing unit 69, a sound source 70, a speaker 71, and a bus 72. The CPU61 controls the portable communication terminal 60 as a whole, and the ROM62 stores various communication control programs, control programs such as a program for reproducing music, and various constant data.
The RAM63 is used as a work area, and stores music files and various application programs at the same time. The display unit 64 is constituted by, for example, a Liquid Crystal Display (LCD), and the vibrator 65 vibrates when the mobile phone receives an incoming call. The input unit 66 is configured by a plurality of operators such as keys, and these operators instruct registration processing of the user tone color parameters, the user phrase synthesis dictionary data, and the HV script data in accordance with user operations. The communication unit 67 is constituted by a modem or the like, and is connected to an antenna.
The voice processing unit 69 is connected to a microphone and a receiver speaker (e.g., a microphone and an earphone), and has a function of encoding and decoding a voice signal for a call. The sound source 70 reproduces music from a music file stored in the RAM63 or the like, and reproduces a voice signal and outputs the signal to the speaker 71. The bus 72 is a transmission path for data transmission between the components of the mobile phone, which are composed of the CPU61, the ROM62, the RAM63, the display unit 64, the vibrator 65, the input unit 66, the communication unit 67, the voice processing unit 69, and the sound source 70.
The communication unit 67 may download the HV script file or the SMAF file 30 shown in fig. 4 from a predetermined content Server (Contents Server) or the like, and store the HV script file or the SMAF file in the RAM 63. The ROM62 stores the application program 14 of the speech reproducing apparatus 1 shown in fig. 1 and programs of intermediate devices. The CPU61 reads out and starts the application program 14 and programs of the intermediate devices. The CPU61 interprets the HV script data stored in the RAM63, generates formant frame data, and sends the formant frame data to the sound source 70.
Next, the operation of the speech reproducing apparatus 1 of the present embodiment will be described. First, a method of creating the user phrase synthesis dictionary 13 will be described. FIG. 7 is a flowchart showing a method of creating the user phrase synthesis dictionary 13.
First, in step S1, the HV tone color of the reference user phrase synthesis dictionary 13 is selected by the HV authoring tool 42 shown in fig. 5, and the HV speech editor 45 is started. The used syllable is then selected with the HV speech editor 45 and the sound file is assembled. Thus, at step S2, the HV speech editor 45 generates and outputs user phrase dictionary data (corresponding to the HV user synthesis dictionary file 49).
Then, the HV voice editor 45 edits the HV tone color parameters, and then, in step S3, the HV voice editor 45 generates and outputs user tone color parameters (corresponding to the user tone color file 48).
Then, the sound quality change event specifying the corresponding HV tone color is described in the HV script data by the HV script UI44, thereby describing the syllable to be reproduced. Next, in step S4, the HV script UI44 generates and outputs HV data (equivalent to the HV script file 47).
Next, a description will be given of a reproduction operation of the user phrase synthesis dictionary data 13 in the speech reproducing apparatus 1 with reference to fig. 8. Fig. 8 is a flowchart showing a reproduction operation of the user phrase synthesis dictionary data 13 in the speech reproducing apparatus 1.
First, in step S11, the user tone color parameters 12 and the user phrase synthesis dictionary data 13 are registered in the middleware of the speech reproducing apparatus 1. Then, the script data 11 is registered in the intermediate device of the audio reproducing apparatus 1, and the HV script data is reproduced at step S12.
During reproduction, in step S13, it is monitored whether or not a sound quality change event (X event) specifying the user sound color parameters 12 is included in the scenario data 11.
If a voice quality change event is found in step S13, the phrase ID assigned to a syllable is searched for from the user tone color parameters 12, and data corresponding to the phrase ID is read out from the user phrase synthesis dictionary data 13, and then, in step S14, dictionary data of the corresponding syllable in the default synthesis dictionary data 19 managed by the HV driver is replaced with the user phrase synthesis dictionary data 13. The replacement processing of step S14 may also be performed before the HV script data is reproduced.
After the end of step S14, or when no sound quality change event is found in step S13, the flow proceeds to step S15, and the converter 16 interprets the syllables of the script data 11 (script data after the exchange processing in step S14 when the processing in step S14 is performed), and finally converts the syllables into formant frame string data by the HV driver.
In step S16, the data converted in step S15 is reproduced from the sound source 20.
Thereafter, the process proceeds to step S17, where it is determined whether or not the reproduction of the script data 11 has ended, and if not, the process returns to step S13, and if so, the reproduction process of the user phrase synthesis dictionary data 13 shown in fig. 8 ends.
Next, a method of creating the SMAF file 30 shown in fig. 4 will be described with reference to fig. 9. Fig. 9 is a flowchart showing a method of creating the SMAF file 30.
First, according to the steps shown in fig. 7, the user phrase synthesis dictionary data 13, the user tone color parameters 12, and the scenario data 11 are created (see step S21).
Then, in step S22, the SMF file 41 including an event for controlling the sound emission of the music data and the HV script data is created.
Next, the SMF file 41 is input to the HV authoring tool 42 shown in fig. 5, and the SMF file 41 is converted into an SMAF file 43 (corresponding to the aforementioned SMAF file 30) by the HV authoring tool 42 (refer to step S23).
Then, the user tone color parameters 12 created in the above-mentioned step S21 are input to the HV creation data block 37 in the HV block 36 of the SMAF file 30 shown in fig. 4, and the user phrase synthesis dictionary data 13 created in the step S21 are input to the HV user phrase dictionary block 38 in the HV block 36 of the SMAF file 30, thereby creating and outputting the input SMAF file 30 (refer to step S24).
Next, the reproduction process of the SMAF file 30 will be described with reference to fig. 10, and fig. 10 is a flowchart of the reproduction process of the SMAF file 30.
First, in step S31, the SMAF file 30 is registered in the middleware of the speech reproducing apparatus 1 shown in fig. 1. Here, the normal speech reproducing apparatus 1 registers the music data portion in the SMAF file 30 in the music reproducing section of the middleware and prepares for reproduction.
In step S32, the speech reproducing apparatus 1 determines whether the HV block 36 is included in the SMAF file 30.
If the determination result at step S32 is yes, the flow proceeds to step S33, and the speech reproducing apparatus 1 interprets the contents of the HV block 36.
In step S34, the speech reproducing apparatus 1 registers the user tone color parameters, the user phrase dictionary data, and the HV script data.
If the determination result at step S32 is "no" or if the registration processing at step S34 is completed, the process proceeds to step S35, and the speech reproducing device 1 interprets the blocks in the music reproducing unit.
Then, the speech reproducing device 1 starts interpreting the sequence data (i.e., the actual performance data) in the sequence data block 35 in response to the "start" signal, thereby performing music reproduction (refer to step S36).
In the above-described music reproduction, the speech reproducing apparatus 1 interprets the events included in the sequence data in order, and in the process, determines whether or not each event corresponds to HV note-on (refer to step S37).
If the determination result at step S37 is yes, the flow advances to step S38, and the speech reproducing device 1 starts reproducing HV script data of the HV block specified by the HV note-on.
After step S38 is completed, the speech reproducing apparatus 1 performs the reproduction processing of the user phrase synthesis dictionary data shown in fig. 8, that is, in the reproduction of the HV script data in step S38, the speech reproducing apparatus 1 determines whether or not there is a sound quality change event (X event) specifying the user sound color parameter 12 (refer to step S39).
If the sound quality change event is present, that is, if the determination result at step S39 is yes, the flow proceeds to step S40, where the phrase ID assigned to the syllable is searched for from the user timbre parameters 12, data corresponding to the phrase ID is read from the user phrase synthesis dictionary data 13, and the dictionary data of the corresponding syllable in the default synthesis dictionary data 19 managed by the HV driver is replaced with the user phrase synthesis dictionary data. The replacement processing of step S40 may be performed before the HV script data is reproduced.
After the end of step S40 or when no sound quality change event is found in step S39, the flow proceeds to step S41, and the converter 16 interprets the syllables of the script data 11 and finally converts the syllables into formant frame string data by the HV driver.
Then, the flow proceeds to step S42, and the HV reproduction unit of the sound source 20 reproduces the data converted in step S41.
Thereafter, the flow advances to step S43, and the speech reproducing apparatus 1 determines whether or not the reproduction of the music has ended. When the music reproduction is completed, the reproduction process of the SMAF file 30 is completed, whereas when the music reproduction is not completed, the flow returns to step S37.
In step S37, when the event in the sequence data is not HV note-on, the speech reproducing apparatus 1 recognizes the event as a part of music data and converts the event into sound source reproduction event data (see step S44).
The flow then proceeds to step S45, and the speech reproducing device 1 reproduces the data converted in step S44 in the music reproducing section of the sound source 20.
As described above, the present embodiment adopts the speech reproducing method using formant synthesis of an FM sound source, and has the following 3 advantages.
(1) The phrase preferred by the user can be assigned, that is, the voice can be reproduced with a tone color more similar to the preferred tone color without depending on the fixed dictionary.
(2) Since a part of the default synthetic dictionary data 19 is replaced with the user phrase synthetic dictionary data 13, it is possible to avoid an excessive increase in data capacity in the speech reproducing apparatus 1. Since a part of the default synthetic dictionary data 19 can be replaced with an arbitrary phrase, it is possible to realize pronunciation by phrase unit and eliminate the sense of auditory dissonance caused by the connection point between the pronunciation units in the conventional synthesized speech by pronunciation unit.
(3) Since an arbitrary phrase can be specified in the HV script data, speech synthesis of syllable units and speech pronunciation of phrase units can be used in combination.
In addition, according to this embodiment, it is possible to realize a change in tone color according to the formant intensity, as compared with a method of reproducing waveform data constituted by sampling phrases in advance. Although the data size and quality in the present embodiment depend on the frame rate, high-quality speech reproduction can be realized with much less data capacity than the conventional method using sampled waveform data. Therefore, for example, the speech reproducing apparatus 1 of the present embodiment can be easily incorporated in a mobile communication terminal such as a mobile phone, and therefore, contents of e-mail and the like can be reproduced with high-quality speech.
Fig. 11 is a block diagram showing the configuration of a speech/music reproducing device according to a second embodiment of the present invention. Here, the HV script (i.e., HV script data) is a file corresponding to a format defining a format for reproducing a speech, and is a file defining data for performing speech synthesis including a phonetic character string including a prosodic symbol (i.e., a symbol specifying a sound generation form such as a pitch), setting of a sound to be generated, and information on a reproduction application, and is created by text input so as to facilitate creation by a user.
The HV script is read by application software such as a text editor, and may be described in a file format in which a text can be edited. The HV-script has language dependency and can be defined in various languages, and in the present embodiment, the HV-script is defined in Japanese.
Reference numeral 101 denotes an HV-Script player (HV-Script player) for controlling reproduction or stop of an HV-Script, or the like. Here, in the case where the HV-script is registered in the HV-script player 101 and receives an instruction for the reproduction thereof, the HV-script player 101 starts interpreting the HV-script. Then, processing according to the event is performed on any one of HV driver 102, waveform playback player 104, and phrase playback player 107 according to the type of event described in the HV script.
The HV driver 102 reads from a ROM (read-only memory), not shown, and refers to synthesis dictionary data in which a human voice has a predetermined formant (i.e., a eigen spectrum) depending on a structure of a human body (for example, a shape of a vocal cord, a mouth, or the like), and the synthesis dictionary data stores parameters related to the formant of the voice in association with uttered characters. The synthetic dictionary data corresponds to a database in which parameters obtained as a result of sampling and analyzing an actual sound for each phonetic character unit (for example, phoneme units such as "あ" and "い" in japanese) are stored in advance for each phonetic character unit as formant frame data.
For example, in the case of the CSM (Composite Sinusoidal Modeling) speech synthesis method described above, the synthesis dictionary data stores 8 sets of resonance frequencies, formant intensities, pitches, and the like as parameters. Such a speech synthesis method has an advantage that the data amount is very small compared to a reproduction method of waveform data created by sampling speech. In addition, parameters for controlling the sound quality of the reproduced speech (for example, parameters for designating changes of 8 sets of resonance frequencies and formant intensities) may be stored in the synthesis dictionary data.
The HV driver 102 interprets a phonetic character string including a prosodic symbol in the HV script, converts the phonetic character string into a formant frame string using the synthetic dictionary data, and outputs the formant frame string to the HV sound source 103. The HV sound source 103 generates a sound signal from the formant frame string output from the HV driver 102 and outputs it to the adder 110.
The waveform reproducing player 104 performs reproduction or stop of waveform data such as sampled voice, music, and analog sound. Reference numeral 105 denotes a waveform data random-access memory (RAM), in which default waveform data is stored. The user can store the user waveform data in the user data RAM112 in the waveform data RAM105 via a login API (login application programming interface) 113. When the waveform playback player 104 receives a playback instruction from the HV-script player 101, the waveform data is read out from the waveform data RAM105 and output to the waveform playback device 106. The waveform reproducer 106 generates a sound emission signal from the waveform data output from the waveform reproducer 104 and outputs the sound emission signal to the adder 110. The sampled waveform data is not limited to the PCM (pulse-code modulation) method, and may be in a voice compression format such as MP3 (moving picture expert group level 3) method.
The phrase reproduction player 107 reproduces or stops the music phrase data (or music data). The composition phrase data is in the SMF format, and is composed of note information representing the pitch, volume, and the like of a sound to be generated, and time information representing the generation time of the sound to be generated. Reference numeral 108 denotes a music phrase data RAM, and default music phrase data is stored in advance. The user can store the user melody phrase data in the user data RAM112 in the melody phrase data RAM108 via the login API.
Upon receiving a reproduction instruction from the HV-script player 101, the phrase reproduction player 107 reads musical composition phrase data from the RAM108 for musical composition phrase data, performs time management of note information in the musical composition phrase data, and outputs the note information to the phrase sound source 109 based on the time information described in the musical composition phrase data. The phrase sound source 109 generates musical tone signals based on the note information output from the phrase reproduction player 107, and outputs the musical tone signals to the adder 110. The phrase sound source 109 may be of FM type or PCM type, but if it has a function of reproducing phrase data, the sound source is not necessarily limited to this type.
The adder 110 synthesizes the utterance signal output from the HV sound source 103, the speech signal output from the waveform regenerator 106, and the musical tone signal output from the phrase sound source 109, and outputs the synthesized signal to the speaker 111. The speaker 111 emits voice and/or musical tones according to the synthesized signal of the adder 110.
The HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107 may simultaneously perform processing to simultaneously sound a voice and a musical composition based on the sound generation signal, the voice signal, and the musical tone signal (the voice signal and the musical tone signal may be collectively referred to as "sound signal"). Alternatively, the HV-script player 101 may be used to manage the processing timings of the HV driver 102, the waveform playback player 104, and the phrase playback player 107, and simultaneously play back the voice and music based on the respective processing. In the present embodiment, simultaneous processing by the HV drive 102, the waveform reproduction player 104, and the phrase reproduction player 107 is prohibited. In fig. 11, for convenience of explanation, the respective RAMs are provided as the waveform data RAM105, the melody phrase data RAM108, and the user data RAM112, but these functions may be allocated to different memory areas in a single RAM.
Fig. 12 shows an example of definition of an event for reproducing waveform data or phrase data (hereinafter collectively referred to as "audio data") described in an HV-script. The beginning letter "D" of the event means a default definition, and "O" means a user definition. As a category of each event, a waveform or a phrase is assigned. The default waveform data stored in advance in the waveform data RAM105 or the default phrase data stored in advance in the phrase data RAM108 are assigned to default definitions (D0 to D63), and 64 pieces of default waveform data and default phrase data can be assigned to the default definitions. The sampled waveform data or melody phrase data arbitrarily created by the user is assigned to the user definition (o 0 to o 63), and 64 pieces of sampled waveform data or melody phrase data can be assigned to the user definition.
The type shown in fig. 12 is an event of waveform data and data showing a relationship with the waveform data represented by the event is stored in the waveform data RAM105 in advance. Further, an event whose category is a phrase and data indicating a relationship with the musical composition phrase data represented by the event are stored in advance in the musical composition phrase data RAM 108. When the user has registered waveform data or phrase data in the user data RAM112, the data is updated.
The HV-script is described as, for example, "TJK 12 みなさ'/" 0 です. D20 ", T in" TJK12 "described at the beginning is a symbol indicating the start of the HV-script, and" J "designates a country character code, and here indicates that the HV-script is described in Japanese. "K12" is a symbol for setting the sound quality, and indicates that the 12 th sound quality is specified. Further, "みなさ '/" and "です" are interpreted by the HV driver 102, and voices of japanese language such as "みなさ'/" and "です" are emitted from the speaker 111. When the pronunciation character string such as ' みなさ '/' and ' です ' contains prosodic symbols such as intonation (or strong or weak) indicating pronunciation status, the pronunciation with intonation (or strong or weak) is uttered.
In the user event "o 0", for example, waveform data sampling a voice uttering "suzuki" sound is registered. The user event ". smallargecircle 0" is interpreted by the waveform playback player 104, and thereby, the "suzuki" voice is emitted from the speaker 111. Note that, for example, short music phrase data with a fast pace is registered in the user event "o 20". The user event ". smallcircle.20" is interpreted by the phrase reproduction player 107, whereby a happy melody is emitted from the speaker 111. At this time, the reproduced speech becomes "みなさ -suzuki です" (when a music phrase is reproduced), and waveform data is reproduced only by the "suzuki" part. The pronunciation of the voice generated by the reproduction of the waveform data is more natural than the pronunciation potential reproduced by the voice synthesis of the pronunciation unit such as "みなさ'/" or "です". In addition, reproduction of a waveform featuring the pronunciation of a sentence called "suzuki" enables the user to effectively hear the reproduced voice. As described above, describing an event specifying the reproduction of waveform data or musical composition phrase data by the HV-script allows the reproduction timing of waveform data or musical composition phrase data to be arbitrarily determined. The description of the HV scenario is set as a matter of design, and is not limited to the above method.
Next, the operation of the speech/music reproducing device according to the present embodiment will be described with reference to the flowchart of fig. 13. First, the user creates an HV script by a text editor and registers the HV script in the HV script player 101 (see step S101). At this time, if there is waveform data or phrase data generated by the user definition, the registration API113 reads the waveform data or phrase data from the user data RAM 112. The registration API113 stores waveform data in the waveform data RAM105 and musical phrase data in the musical phrase data RAM 108.
Once the user makes a start instruction (step S103), the HV-script player 101 starts interpreting the HV-script (refer to step S102). The HV-script player 101 determines whether an event starting from "D" or "O" is included in the HV-script (step S104); when an event is input from "D" or "o", it is determined whether or not the type is waveform data (step S105). If the type of the event is waveform data, the HV-script player 101 instructs the waveform playback player 104 to perform this processing, and the waveform playback player 104 reads out waveform data following the number "D" or "o" from the waveform-data RAM105 and outputs the waveform data to the waveform playback device 106 (step S106). The waveform reproducer 106 generates a voice signal from the waveform data, and outputs the voice signal to the speaker 111 via the adder 110 (step S107). In this way, the speaker 111 emits corresponding voice.
In addition, in the case where the event type is not waveform data in step S105, the flow proceeds to step S108, and the HV-script player 101 determines whether or not the type of the event is melody phrase data. In the case where the category of the event is musical composition phrase data, the HV-script player 101 instructs the phrase reproduction player 107 to this process. The phrase reproducing player 107 reads out waveform data of a number following "D" or "o" from the composition phrase data RAM108, and outputs note information in the composition phrase data to the phrase sound source 109 based on time information in the composition phrase data (see step S109). The phrase sound source 109 generates a musical tone signal based on the note information, and outputs the musical tone signal to the speaker 111 via the adder 110 (step S110). In this way, the speaker 111 emits musical tones. If it is determined in step S108 that the event type is not phrase data, the speech/music reproducing apparatus of the present embodiment regards the event as an event that cannot be processed, and the flow proceeds to step S113.
If an event starting from "D" or starting from "o" is not described in the HV script in step S104, the HV script player 101 instructs the HV driver 102 to perform the processing. The HV driver 102 converts the character string into a formant frame string using the synthetic dictionary data, and outputs the formant frame string to the HV sound source 103 (see step S111). The HV sound source 103 generates a sound generation signal from the formant frame string, and outputs the sound generation signal to the speaker 111 via the adder 110 (see step S112). In this way, the speaker 111 emits corresponding voice.
Every time the processing on the event ends, the HV-script player 101 determines whether or not the interpretation has ended until the HV-script is finally described (step S113). If the description to be explained remains, the process returns to step S104, and if all the descriptions of the HV script are explained, the speech/music reproduction process shown in fig. 13 is ended.
"TJK 12 みなさ'/" 0 です "shown as a description example of the HV-script of this embodiment. In the case of D20 ", after the completion of the utterance of the waveform data defined by the event". smallcircle.0 ", the speech of the next sentence" です "must be uttered. For example, when the HV script player 101 interprets an event of waveform data (or phrase data), the reproduction of the next event is temporarily postponed, and when the sound generation by the waveform reproduction player 104 (or phrase reproduction player 107) is finished, a signal indicating the end of the sound generation is output from the waveform reproduction player 104 to the HV script player 101.
When the HV driver 102, the waveform playback player 104, and the phrase playback player 107 are permitted to perform playback processing simultaneously, the playback processing may be controlled by the description of the HV script. For example, the HV-script describes "TJK 12 みなさ'/" 03 です. In the case of D20 ", the event in which a predetermined silent period is set is indicated by" spaces "and" 3 "following" o 0 ", and the sound reproduced by the HV driver 102 is controlled so as to be silent between sounds in the phrase" suzuki "indicated by" o 0 ". The HV-script describes "TJK 12 こ i にちは. D20 みなさ is O03 です. "in case of the music piece designated by" D20 "and the speech called" みなさ Si Suzuki です "are pronounced simultaneously.
Fig. 14 is a block diagram showing the configuration of a mobile phone including the speech/music reproducing device according to the present embodiment. Here, reference numeral 141 denotes a CPU which controls each section of the mobile phone. Reference numeral 142 denotes an antenna for transmitting and receiving data. Reference numeral 143 denotes a communication unit which modulates transmission data and outputs the modulated data to the antenna 142, and demodulates reception data received by the antenna. Reference numeral 144 denotes a voice processing unit which converts voice data of a call partner output from the communication unit 143 into a voice signal and outputs the voice signal to a near-ear speaker (or an earphone, not shown) when the mobile phone is performing a call, and converts a voice signal input from a microphone (not shown) into voice data and outputs the voice data to the communication unit 143.
Reference numeral 145 denotes a sound source, and has the same functions as the HV sound source 103, the waveform reproducer 106, and the phrase sound source 109 shown in fig. 11. Reference numeral 146 denotes a speaker which emits a desired voice or musical tone. Reference numeral 147 denotes an operation portion operated by the user. Reference numeral 148 denotes a RAM which stores waveform data and tune phrase data defined by HV-script or user, and the like. Reference numeral 149 denotes a ROM which stores a program executed by the CPU141, and synthesis dictionary data, default waveform data, default tune phrase data, and the like. Reference numeral 150 denotes a display unit which displays the operation result of the user, the state of the mobile phone, and the like on the screen. Reference numeral 151 denotes a vibrator, which generates vibration upon receiving an instruction from the CPU141 when the mobile phone receives an incoming call. The above functional blocks are connected to each other via a bus B.
The mobile phone has a function of generating waveform data from voice, and transmits voice input from a microphone to the voice processing section 144, converts the voice into waveform data, and stores the waveform data in the RAM 148. When musical composition phrase data is downloaded from the WEB server via the communication unit 143, the musical composition phrase data is stored in the RAM 148.
The CPU141 performs the same processing as the HV-script player 101, HV driver 102, waveform playback player 104, and phrase playback player 107 shown in fig. 11, in accordance with the program stored in the ROM 149. Further, the CPU141 interprets events described in the HV-script read out from the RAM 148. When the event indicates that speech is to be uttered by speech synthesis, CPU141 reads out and refers to the synthesis dictionary data from ROM149, converts a character string described in the HV-script into a formant frame string, and outputs the formant frame string to sound source 145.
When the event indicates reproduction of the waveform data, CPU141 reads out the waveform data of the number following "D" or "o" in the HV-script from RAM148 or ROM149 and outputs the waveform data to sound source 145. When the event indicates reproduction of a musical composition phrase, the CPU141 reads musical composition phrase data of a number following "D" or "o" in the HV-script from the RAM148 or the ROM149, and outputs note information in the musical composition phrase data to the sound source 145 based on time information in the musical composition phrase data.
Sound source 145 generates a synthesized sound signal from the formant frame string output from CPU141 and outputs the synthesized sound signal to speaker 146. And generates a voice signal from the waveform data output from the CPU141 and outputs to the speaker 146. Further, musical tone signals are generated based on the musical composition phrase data output from the CPU141 and output to the speaker 146. The speaker 146 emits voice or musical sound as appropriate based on the synthesized vocal signal, the voice signal, or the musical sound signal.
When the user operates the operation unit 147 to start software for text editing, the user can create an HV script while recognizing the content displayed on the screen of the display unit 150, and the HV script created in this manner can be stored in the RAM 148.
In addition, the HV-script made by the user may be applied to the incoming call ring tone. At this time, the information of the use of the HV script when the mobile phone receives an incoming call is stored in the RAM148 as the setting information. That is, when communication unit 143 receives call information transmitted from another mobile phone via antenna 142, communication unit 143 notifies CPU141 of the incoming call. The CPU141 that has received the incoming call notification reads out the setting information from the RAM148, and reads out the HV-script indicated by the setting information from the RAM148, and starts its interpretation. The subsequent processing is as described above. That is, the speaker 146 generates a voice or a musical sound according to the type of the event described in the HV-script.
In addition, the user can add the HV-script in an e-mail to be sent to other terminals. Further, CPU141 may interpret the text of the e-mail in the form of HV-script, receive an instruction from the user, and output an instruction to reproduce the HV-script to voice processing unit 144 based on the description in the e-mail. The entire functions of the HV-script player 101, HV drive 102, waveform reproduction player 104, and phrase reproduction player 107 are not necessarily borne by the CPU 141. For example, any of the above functions may be supported by the sound source 145.
The present embodiment is applicable not only to a cellular phone (cellular phone), but also to a portable terminal such as a PHS (personal handyphone system) (registered trademark of japan) or a PDA (personal digital assistant), for example, and performs the above-described voice and music reproduction.
Further, as a flexible application example of the present embodiment, a portable mobile terminal such as a mobile phone can input an HV script created by a user, and thus, a general user can easily create not only a character for speech synthesis but also an HV script for reproducing regular sample waveform data or musical piece phrase data. In addition, when the portable terminal for transmission and reception is provided with the voice/music reproduction device of the present embodiment, the user can operate the portable mobile terminal to add the HV-script to the electronic mail for transmission and reception. Thus, the electronic mail received by the portable mobile terminal of the receiving side can appropriately reproduce not only the characters for speech synthesis but also the regular sample data or phrase data. In addition, the reproduction of voice and music using the HV-script can also be utilized as an incoming call ringtone.
Next, a speech/music reproducing device according to a third embodiment of the present invention will be described with reference to fig. 15 and 16. The third embodiment is configured by combining the first embodiment and the second embodiment, and the intermediate device performs HV reproduction, waveform reproduction, and phrase reproduction, and the sound source generates a synthesized sound signal based on these 3 types of data, and synthesizes signals from the 3 systems and outputs the synthesized sound signal to the speaker.
Here, fig. 15 is a configuration in which the configuration of fig. 1 is basically combined with the partial configuration of fig. 11, and reference numerals 211 to 219 correspond to reference numerals 11 to 19 of fig. 1, and reference numerals 303 to 313 correspond to reference numerals 103 to 113 of fig. 11. That is, the user data RAM of fig. 11 is connected to the intermediate device of fig. 1 through a user data API, and the waveform playback player and phrase playback player of fig. 11 are added to the intermediate device, and are connected to the waveform data RAM and the music phrase data RAM, respectively. The application software also functions as the HV script player of fig. 11, and the HV script instructs processing to any of the HV conversion, the waveform playback player, and the music phrase player according to the type of event described in the HV script via the middleware API. The sound source also has three functions of the HV sound source, waveform generator, and melody phrase sound source shown in fig. 11, and their output signals are synthesized by an adder and emitted to a speaker. The operation of each component shown in fig. 15 is the same as that of the corresponding component shown in fig. 1 and 11, and therefore, the detailed description thereof is omitted.
Fig. 16 is a flowchart showing the operation of the speech/musical instrument reproduction device shown in fig. 15. The flowchart shown in fig. 13 is used as a basis, and a part of the flowchart shown in fig. 8 is added, and reference numerals S211 to S216 correspond to reference numerals S11 to S16 of fig. 8, and reference numerals S304 to S310, S312, and S313 correspond to reference numerals S104 to S110, S112, and S113 of fig. 13. That is, if the determination result in step S104 in fig. 13 is "no", the same processing as steps S13, S14, and S15 in fig. 8 is executed, and then the HV sound source reproduction processing is executed in step S312. In this way, by inputting one HV script, it is possible to reproduce the voice from the HV sound source, reproduce the waveform data from the waveform reproduction player, and reproduce the musical composition phrase based on the note information from the musical composition phrase reproduction player. The processing of each step shown in fig. 16 is the same as that in fig. 8 and 13, and therefore, detailed description thereof will be omitted.
Finally, prosodic symbols used in the foregoing embodiments are explained. For example, "は ^3 recorded in HV-script points out ま $Lambda ま $5 し > 10 た. Is just at は and bearing まました "(i.e., a character string to be pronounced) is synthesized by adding predetermined tones (イントネ - ツヨン), where" ^ "," $ "," > ", etc. correspond to prosodic symbols. A predetermined pitch (tone) is added to the character following the prosodic symbol (in the case where the prosodic symbol is immediately followed by a numerical value, the character following the numerical value).
Specifically, "^" indicates pitch-up in the pronunciation, "$" indicates pitch-down in the pronunciation, and ">" indicates volume-down in the pronunciation, and speech synthesis is performed based on these symbols. In the case where the prosodic symbol is followed by a numerical value, the numerical value is used to specify the amount of change in the additional intonation. For example, when the word "は ^3 contacts ま", は "sounds at the standard pitch and the pitch, the next word" ま "sounds at the increased pitch by" increasing "the pitch" 3 "during the sound.
In this way, when a predetermined intonation (or pitch) is added to a character included in a speech to be uttered, the above-described prosodic symbol (or a numerical value indicating the amount of pitch change) is described immediately before the character. The prosodic symbols are used to control the pitch or volume of sound in pronunciation, but are not limited thereto, and symbols for controlling the sound quality or speed may be used. By adding such a symbol to the HV script, the sound generation state such as the pitch can be appropriately expressed.
The present invention is not limited to the above-described embodiments, and modifications within the scope of the present invention are included in the present invention.
Claims (12)
1. A speech reproducing device is composed of a storage device for storing synthesis dictionary data, an operation device and a speech synthesizing device; wherein,
the storage device associates formant frame data corresponding to a predetermined pronunciation unit with the pronunciation unit and stores the data in advance;
the operation device registers user phrase data in accordance with a user operation, the user phrase data being replaced with formant frame data corresponding to each phonetic unit stored in the synthetic dictionary data;
the speech synthesis device inputs a character string composed of a plurality of phonetic units, interprets script data including event data indicating replacement of formant frame data corresponding to at least a part of the character string, reads formant frame data from the synthesis dictionary data based on a part other than the part of the character string, reads user phrase data based on the event data and the part of the character string, and generates a synthesized speech based on the read formant frame data and the read user phrase data.
2. The speech reproducing apparatus according to claim 1, wherein the user phrase data is data corresponding to desired formant frame data.
3. The speech reproducing apparatus according to claim 1 or 2, characterized in that an information structure for synchronizing music and speech and reproducing prescribed data is defined, while the speech synthesis is performed in a data exchange format containing the user phrase data.
4. The voice reproducing apparatus according to claim 3, characterized in that the music reproduction information contained in the data exchange format is reproduced as it is; and reproducing voice reproduction information by the voice synthesis device.
5. A portable terminal device comprising the speech reproducing device according to any one of claims 1 to 4.
6. A voice reproducing device is composed of a first storage device for storing voice data, a second storage device for storing script data, a reproducing instruction device, a synthesized pronunciation signal generating device, a voice signal generating device and a synthesized voice generating device; wherein,
the script data describes a character string indicating pronunciation by speech synthesis and event data indicating reproduction of the sound data;
the reproduction instructing means reads the scenario data from the second storage means, instructs the sound generation based on a character string in the scenario data, and instructs reproduction of the sound data based on event data in the scenario data;
a synthesized speech signal generation device that performs speech synthesis based on a speech instruction of the character string from the reproduction instruction device and generates a synthesized speech signal;
the audio signal generating means reads the audio data from the first storage means in accordance with the reproduction instruction of the audio data from the reproduction instructing means, and generates an audio signal based on the audio data;
the synthesized speech device generates a synthesized speech from the synthesized utterance signal, and generates a sound from the sound signal.
7. The speech reproducing apparatus according to claim 6, wherein the sound data is waveform data generated by sampling a predetermined sound.
8. The speech reproducing apparatus according to claim 6, wherein the sound data is music data including note information indicating a pitch and a volume of a sound to be sounded.
9. The speech reproducing apparatus according to claim 6, wherein the synthesized speech signal generating means stores formant control parameters characterizing the pronunciation of characters, and performs speech synthesis using the formant control parameters corresponding to character strings in the script data.
10. The speech reproducing apparatus according to claims 6 to 9, wherein the script data is described in a file made of text data.
11. The speech reproducing apparatus according to claim 6, wherein synthetic dictionary data is provided in which formant frame data corresponding to a predetermined pronunciation unit is stored in association with the pronunciation unit;
registering user data, which is replaced with formant frame data corresponding to a pronunciation unit stored in the synthetic dictionary data, in correspondence with the pronunciation unit in accordance with a user operation;
when event data indicating replacement of formant frame data corresponding to a part of the character string is supplied to the synthesized speech signal generation device, formant frame data is read from the synthesized dictionary data based on a part other than the part of the character string, user data is read based on the event data and the part of the character string, and synthesized speech is generated based on the read formant frame data and the read user data.
12. A portable terminal device comprising the speech reproducing device according to any one of claims 6 to 11.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003152895A JP4244706B2 (en) | 2003-05-29 | 2003-05-29 | Audio playback device |
JP2003152895 | 2003-05-29 | ||
JP2003340171A JP2005107136A (en) | 2003-09-30 | 2003-09-30 | Voice and musical piece reproducing device |
JP2003340171 | 2003-09-30 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1573921A true CN1573921A (en) | 2005-02-02 |
CN1310209C CN1310209C (en) | 2007-04-11 |
Family
ID=34525345
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2004100474146A Expired - Fee Related CN1310209C (en) | 2003-05-29 | 2004-05-28 | Speech and music regeneration device |
Country Status (4)
Country | Link |
---|---|
KR (1) | KR100612780B1 (en) |
CN (1) | CN1310209C (en) |
HK (1) | HK1069433A1 (en) |
TW (1) | TWI265718B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101694772A (en) * | 2009-10-21 | 2010-04-14 | 北京中星微电子有限公司 | Method for converting text into rap music and device thereof |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101102191B1 (en) * | 2005-09-02 | 2012-01-02 | 주식회사 팬택 | Apparatus And Method For Modifying And Playing Of Sound Source In The Mobile Communication Terminal |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100279741B1 (en) * | 1998-08-17 | 2001-02-01 | 정선종 | Operation Control Method of Text / Speech Converter Using Hypertext Markup Language Element |
JP2001051688A (en) * | 1999-08-10 | 2001-02-23 | Hitachi Ltd | Electronic mail reading-aloud device using voice synthesization |
JP2002073507A (en) * | 2000-06-15 | 2002-03-12 | Sharp Corp | Electronic mail system and electronic mail device |
KR100351590B1 (en) * | 2000-12-19 | 2002-09-05 | (주)신종 | A method for voice conversion |
JP2002221980A (en) * | 2001-01-25 | 2002-08-09 | Oki Electric Ind Co Ltd | Text voice converter |
JP2002366186A (en) * | 2001-06-11 | 2002-12-20 | Hitachi Ltd | Method for synthesizing voice and its device for performing it |
JP2003029774A (en) * | 2001-07-19 | 2003-01-31 | Matsushita Electric Ind Co Ltd | Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment |
JP3589216B2 (en) * | 2001-11-02 | 2004-11-17 | 日本電気株式会社 | Speech synthesis system and speech synthesis method |
-
2004
- 2004-05-27 TW TW093115132A patent/TWI265718B/en not_active IP Right Cessation
- 2004-05-28 CN CNB2004100474146A patent/CN1310209C/en not_active Expired - Fee Related
- 2004-05-28 KR KR1020040038415A patent/KR100612780B1/en not_active IP Right Cessation
-
2005
- 2005-03-08 HK HK05101981A patent/HK1069433A1/en not_active IP Right Cessation
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101694772A (en) * | 2009-10-21 | 2010-04-14 | 北京中星微电子有限公司 | Method for converting text into rap music and device thereof |
CN101694772B (en) * | 2009-10-21 | 2014-07-30 | 北京中星微电子有限公司 | Method for converting text into rap music and device thereof |
Also Published As
Publication number | Publication date |
---|---|
HK1069433A1 (en) | 2005-05-20 |
CN1310209C (en) | 2007-04-11 |
TW200427297A (en) | 2004-12-01 |
KR20040103433A (en) | 2004-12-08 |
KR100612780B1 (en) | 2006-08-17 |
TWI265718B (en) | 2006-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1269104C (en) | Text structure for voice synthesis, voice synthesis method, voice synthesis apparatus, and computer program thereof | |
JP3938015B2 (en) | Audio playback device | |
CN1194336C (en) | Waveform generating method and appts. thereof | |
JP2020030418A (en) | Systems and methods for portable audio synthesis | |
US20080156178A1 (en) | Systems and Methods for Portable Audio Synthesis | |
KR20020094988A (en) | Voice synthesizing method and voice synthesizer performing the same | |
JP2014501941A (en) | Music content production system using client terminal | |
CN100342426C (en) | Singing generator and portable communication terminal having singing generation function | |
JP2007086316A (en) | Speech synthesizer, speech synthesizing method, speech synthesizing program, and computer readable recording medium with speech synthesizing program stored therein | |
CN1692402A (en) | Speech synthesis method and speech synthesis device | |
CN1254786C (en) | Method for synthetic output with prompting sound and text sound in speech synthetic system | |
CN1461464A (en) | Language processor | |
CN1310209C (en) | Speech and music regeneration device | |
KR100634142B1 (en) | Potable terminal device | |
CN1436345A (en) | Terminal device, guide voice reproducing method and storage medium | |
JP2009157220A (en) | Voice editing composite system, voice editing composite program, and voice editing composite method | |
CN1251175C (en) | An audio synthesis method | |
CN1273953C (en) | Audio synthesis system capable of synthesizing different types of audio data | |
CN1354569A (en) | System and method for transmission of music | |
JP4244706B2 (en) | Audio playback device | |
JP2003029774A (en) | Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment | |
JP4366918B2 (en) | Mobile device | |
KR100650071B1 (en) | Musical tone and human speech reproduction apparatus and method | |
JP2007108450A (en) | Voice reproducing device, voice distributing device, voice distribution system, voice reproducing method, voice distributing method, and program | |
CN1629933A (en) | Sound unit for bilingualism connection and speech synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1069433 Country of ref document: HK |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20070411 Termination date: 20130528 |