CN1310209C - Speech and music regeneration device - Google Patents

Speech and music regeneration device Download PDF

Info

Publication number
CN1310209C
CN1310209C CNB2004100474146A CN200410047414A CN1310209C CN 1310209 C CN1310209 C CN 1310209C CN B2004100474146 A CNB2004100474146 A CN B2004100474146A CN 200410047414 A CN200410047414 A CN 200410047414A CN 1310209 C CN1310209 C CN 1310209C
Authority
CN
China
Prior art keywords
data
regeneration
pronunciation
speech
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100474146A
Other languages
Chinese (zh)
Other versions
CN1573921A (en
Inventor
川岛隆宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamaha Corp
Original Assignee
Yamaha Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP2003152895A external-priority patent/JP4244706B2/en
Priority claimed from JP2003340171A external-priority patent/JP2005107136A/en
Application filed by Yamaha Corp filed Critical Yamaha Corp
Publication of CN1573921A publication Critical patent/CN1573921A/en
Application granted granted Critical
Publication of CN1310209C publication Critical patent/CN1310209C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/471General musical sound synthesis principles, i.e. sound category-independent synthesis methods
    • G10H2250/481Formant synthesis, i.e. simulating the human speech production mechanism by exciting formant resonators, e.g. mimicking vocal tract filtering as in LPC synthesis vocoders, wherein musical instruments may be used as excitation signal to the time-varying filter estimated from a singer's speech

Abstract

To provide a voice reproducing apparatus which can reproduce a desired phrase consisting of character string information or the like by changing its timbre as a voice of good quality.The voice reproducing apparatus is characterised by having a default synthesizing dictionary 19 which previously holds formant frame data in pronunciation unit, middleware API (application program interface) 15, converter 16 and driver 17 which constitute permutation means for permutating the formant frame data of pronunciation unit held in the default synthesizing dictionary 19 to arbitrary user data, and middleware API 15, converter 16, sound source 20, etc., which constitute voice or speech synthesizing means for synthesizing the voice or speech by using the default synthesizing dictionary 19 permutated with the holding data by the permutation means when the information enumerated with the pronunciation unit is given. C)2005,JPO&NCIPI.

Description

Voice and music player
Technical field
The present invention relates to voice and music player, when particularly relating to the words of regenerating specific, character information is transformed to voice or born again voice and the music player of melody by phonetic synthesis.
Background technology
In the past, character string (character string information) information conversion such as design handlebar Email was the character string speech converter of voice output.Japan publication spy opens 2001-7937 number and shows one of existing character string speech converter example, wherein divides character string information with the Wen Jiewei unit, in the time of the output voice, its content is presented on the display.
Known method has regeneration to the sample method of the Wave data (or sampled data) that makes such as melody short sentence or speech phrase, or (composite music moves application file by SMF (standard MIDI file) or SMAF; Synthetic music mobile application fi1e) etc. note information constitutes the regenerate method of this melody short sentence of a melody short sentence.For example, Japanese publication spy opens and has disclosed a kind of Email reading device in 2001-51688 number, can open character string information in the Email and musical sound information separated, and pronunciation is born more separately again.
But, in existing character string speech converter, owing to be to be that character string information is divided in the unit with Wen Jie (sentence (clause) or phrase), carry out voice output, so this voice output is the set of the voice of pronunciation unit (or character cell), when the tie point of this pronunciation unit of regeneration, compare with normal speech utterance, listen taker just to feel to be discord.That is, in existing character string speech converter, with regard to Wen Jie generally speaking, can not change and export this tone color, in other words, can not export and approach the speak voice of nature of language of people with the voice of acoustical sound.
The method of considering for addressing the above problem is for example in advance to the speech sample of each Wen Jie (hereinafter referred to as " phrase "), and preserves as speech data, exports as corresponding speech waveform during regeneration.But, in order to improve the quality of voice output, this method must improve sample frequency, therefore, just must preserve jumbo speech data, in the more limited device of pocket telephone memory capacity such as (mobile phone or mobile phones), will follow technical difficulty to occur.
In addition, the described existing method of the Wave data that regeneration makes melody or speech sample or constitute a music data by note informations such as SMF or SMAF and regenerate in the existing method of this music data, the regeneration of melody or voice regularly is not to record and narrate by the text form, therefore, be difficult to carry out based on the speech regeneration of character string information and the combination between Wave data regeneration or the music data regeneration according to user's intention.
Summary of the invention
For solving the above problems, the object of the present invention is to provide a kind of speech regeneration device, can make the voice of acoustical sound to the desirable Wen Jie (or phrase) that constitutes by character string information etc., and its tone color is changed, regeneration output.
Other purposes of the present invention are to provide a kind of voice music player, and the user can carry out the combination of the regeneration of speech regeneration or Wave data and music data regeneration simply, therefore, can carry out faithful to user's the voice of intention and the regeneration of melody.
According to speech regeneration device of the present invention the dictionary data storage is synthesized in the database conduct that is made of the resonance peak frame data corresponding to the pronunciation unit of predesignating, when giving the information of the relevant character string that constitutes the pronunciation unit side by side, carry out phonetic synthesis with described synthetic dictionary data.Here, the resonance peak frame data are replaced into user's phrase data arbitrarily, when giving character string information, carry out phonetic synthesis with the synthetic dictionary data that are replaced into these user's phrase data.The tamber parameter of processing resonance peak frame data is attached in user's phrase data.In addition, when phonetic synthesis, use the predetermined data Interchange Format that comprises user's phrase data.This data interchange format is a SMAF file layout for example, not only comprises user's phrase data and can also comprise various or melody regenerating information.
Specifically, above-mentioned speech regeneration device is replaced into intermediate equipment application programming interfaces (API), transducer, driver and the source of sound formation of user's phrase data corresponding to the default synthetic dictionary data of the resonance peak frame data of the pronunciation unit of predesignating, these resonance peak frame data by preservation.Like this, just can the speech regeneration that the desirable phrase that be made of character string information makes acoustical sound be come out, and tone color is changed aptly, regenerate.
Relate to the script data (being the HV-script) that pronunciation that voice music player of the present invention stores the pronunciation of having recorded and narrated literal or storage is in advance indicated with the regeneration of data.Generate corresponding to the voice signal of described literal and produce desirable voice according to this script data, generate simultaneously corresponding to pronunciation with the pronunciation signal of data and produce desirable voice or musical sound.Here, pronunciation is for example to be made of sampled speech or Wave data that melody generated with data, generates the synthetic signal that pronounces according to this Wave data.In addition, pronunciation with the situation of data as the music data that comprises note information under, according to the note signal of this music data generation corresponding to note information.In addition, have under the situation of resonance peak controlled variable (or resonance peak frame data) of feature having stored the pronunciation that makes described literal, generate voice signal according to this resonance peak controlled variable.Described script data also can be made arbitrarily by the user, and at this moment, script data is taked the document form of the regulation that makes by text input.
Specifically, explaining the variety of event of recording and narrating in the HV-script, and the classification of this incident is represented under the situation of Wave data, read this Wave data and born again, on the other hand, represent to carry out the Regeneration Treatment of these melody short sentence data under the melody short sentence data conditions in the classification of incident.At this moment, read and its note data of regenerating according to the temporal information in the melody short sentence data.In addition, corresponding with other incidents, the character string with synthetic dictionary data input is transformed to resonance peak frame string, carry out phonetic synthesis.Like this, the user just can easily combine speech regeneration, Wave data regeneration and music data regeneration and carry out.
Description of drawings
Fig. 1 is the block diagram of formation of the speech regeneration device of the expression first embodiment of the present invention.
Fig. 2 is the relations of distribution figure of pronunciation unit and phrase ID.
Fig. 3 is the content example of the synthetic dictionary data of phrase.
Fig. 4 is the form example of SMAF file.
Fig. 5 is the functional block diagram of an example of expression HV authoring tools.
Fig. 6 is the block diagram of formation that the mobile terminals of speech regeneration device has been used in expression.
Fig. 7 is the process flow diagram that makes processing of the synthetic dictionary data of expression user phrase.
Fig. 8 is the process flow diagram of the Regeneration Treatment of the synthetic dictionary data of expression user phrase.
Fig. 9 is the process flow diagram that makes processing of expression SMAF file.
Figure 10 is the process flow diagram of the Regeneration Treatment of expression SMAF file.
Figure 11 is the formation block diagram of the voice music player of the expression second embodiment of the present invention.
Figure 12 is the relations of distribution exemplary plot between each incident and Wave data and the melody short sentence data.
Figure 13 is the process flow diagram of the voice melody Regeneration Treatment of expression second embodiment.
Figure 14 is the block diagram of formation of the portable phone of the expression voice music player that possesses second embodiment.
Figure 15 is the block diagram of formation of the voice music player of the expression third embodiment of the present invention.
Figure 16 is the process flow diagram of the action of expression voice music player shown in Figure 15.
Embodiment
Embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 is the block diagram of formation of the speech regeneration device of the expression first embodiment of the present invention.
That is speech regeneration device 1, shown in Figure 1 possesses application software (application software) 14, intermediate equipment API (intermediate equipment application programming interfaces; Middleware application programinterface) 15, transducer 16, driver 17, default tamber parameter (default tone colorparameter) 18, default synthetic dictionary data (default composition dictionary date) 19 and source of sound 20, input pin notebook data (script data) 11, user's tamber parameter 12 and user's phrase synthesize dictionary data (user phrase composition dictionary data) 13 (variable-lengths) and come reproduce voice.
Speech regeneration device 1 adopts basically according to (composition sinusoidal modeling: the compounded sine wave pattern) resonance peak of phonetic synthesis mode synthesizes the method for (formant composition) reproduce voice according to the CSM with FM (frequency modulation) source of sound resource.In the present embodiment, the synthetic dictionary data 13 of definition user phrase, speech regeneration device 1 is unit distributing user phrase with reference to the synthetic 13 pairs of tamber parameters of dictionary data of this user's phrase with phoneme (phoneme).Be assigned in this wise in tamber parameter under the situation of the synthetic dictionary data 13 of user's phrase, during regeneration, speech regeneration device 1 is replaced into user's phrase to the phoneme of login in default synthetic dictionary data, carries out phonetic synthesis according to this replacement data then.And so-called above-mentioned " phoneme " is the minimum unit of pronunciation, under Japanese situation, is made of vowel (vowel) and consonant (consonant).
The following describes the detailed formation of speech regeneration device 1.
Among Fig. 1, script data 11 be the definition be used for regeneration " HV (human voice; The data of data layout human sound): with the synthetic voice of said method ".Promptly, the event data and being used for that script data 11 expression comprises the pronunciation character string that contains prosodic sign (intonation symbol), be used for setting the sound of pronunciation is controlled the data layout that is used for carrying out phonetic synthesis of the event data of above-mentioned application software 14 etc., becomes the text input form easily for the manual input of being undertaken by the user is become.
The definition of the data layout in this script data 11 has the language interdependence, can be the definition of being done by various language, but in the present embodiment, takes the definition of doing with Japanese.
Synthetic dictionary data 13 of user's phrase and default synthetic dictionary data 19 are by (for example using the pronunciation character cell, Japanese " あ ", " い " etc.) people's the sound of sampling analysis reality, extract 8 groups of formant frequencies, resonance peak intensity (formant level) and tone out as parameter, and this parameter is made the resonance peak frame data in advance be mapped with the pronunciation character cell, be equivalent to be kept at the database of pronunciation character cell.The synthetic dictionary data 13 of user's phrase are to be structured in the outer database of intermediate equipment, the user can at random login the resonance peak frame data to this database, therefore, can be replaced into the login content of the synthetic dictionary data 13 of user's phrase fully the preservation content of default synthetic dictionary data 19 through intermediate equipment API15.That is, can be replaced into the content of default synthetic dictionary data 19 fully the content of the synthetic dictionary data 13 of user's phrase.On the other hand, default synthetic dictionary data 19 are the databases that are structured in the intermediate equipment.
Preferably possess male sex's tonequality usefulness and women's tonequality two classes respectively as synthetic dictionary data 13 of user's phrase and default synthetic dictionary data 19, cycle according to each frame changes the voice output of speech regeneration device 1, but the frame period that is logged the resonance peak frame data in synthetic dictionary data 13 of user's phrase and default synthetic dictionary data 19 is set to for example 20ms.
User's tamber parameter 12 and default tamber parameter 18 are populations of parameters of the tonequality in the voice output of control speech regeneration device 1, promptly, the change that user's tamber parameter 12 and default tamber parameter 18 for example can carry out 8 groups of resonant frequencies and resonance peak intensity (promptly, from the resonant frequency of login in synthetic dictionary data 13 of user's phrase and default synthetic dictionary data 19 and the appointment of the variable quantity of resonance peak intensity) and resonance peak synthesize the appointment of the basic waveform of usefulness, like this, just can produce various tone colors.
Default tamber parameter 18 is to be set in advance in tamber parameter in the intermediate equipment as default value, user's tamber parameter 12 is the parameters that can be made arbitrarily by the user, being set the outside that is kept at intermediate equipment, is the parameter of expanding the content of default tamber parameter 18 through described intermediate equipment API15.
Application software 14 is to be used for regenerating the software of script data 11.
Intermediate equipment API15 constitutes the interface of 19 of the application software 14 that is made of software and the transducer 16 that is made of intermediate equipment, driver 17, default tamber parameter 18 and default synthetic dictionary data.
Transducer 16 is explained script data 11 and is utilized driver 17 finally to be transformed to the resonance peak frame data string that links frame data and form.
Driver 17 generates resonance peak frame data string according to the pronunciation character string and the default synthetic dictionary data 19 that are included in the script data 11, and the explanation tamber parameter is processed this resonance peak frame data string.
Source of sound 20 outputs output to loudspeaker to this synthetic pronunciation signal and sound corresponding to the synthetic pronunciation signal of the output data of transducer 16.
The following describes the technical characterictic of the speech regeneration device 1 of present embodiment.
Include parameter in user's tamber parameter 12 to the phrase ID of synthetic dictionary data 13 storages of any pronunciation unit distributing user phrase.What Fig. 2 represented is an example of distributing pronunciation unit and phrase ID,, expresses the relations of distribution of syllable (mora) and phrase ID here.Under Japanese situation, so-called syllable is the meaning of " bat ", for example is equivalent to the kana character unit.
By phrase ID being distributed to each pronunciation unit, specified pronunciation unit is not with reference to default synthetic dictionary data 19 but is prescribed with reference to the synthetic dictionary data 13 of user's phrase in user's tamber parameter 12.In user's tamber parameter 12, preferably can from a tamber parameter, specify the unit number of pronouncing arbitrarily.
It is an example of present embodiment that the phrase ID of each the pronunciation unit in above-mentioned user's tamber parameter 12 distributes, as long as corresponding to the pronunciation unit, also can adopt additive method.
The details of the synthetic dictionary data 13 of user's phrase is described then, and what Fig. 3 represented is the content example of the synthetic dictionary data 13 of user's phrase.The resonance peak frame data that synthetic dictionary data 13 storages of user's phrase are made of 8 groups of resonant frequencies, resonance peak intensity and tone, what is called among Fig. 3 " phrase " is meant for example Japanese " お は I う " etc., has a kind of meaning or by the unified statement of syllable, definition scale that should " phrase " need not special provision, also can be the statement of random scales such as word, syllable and article.
As the instrument that makes the synthetic dictionary data 13 of user's phrase, must the common audio files (file that has extension name such as * .wav, * .aif) of loading analysis generate the analysis tool of the resonance peak frame data that constitute by 8 groups of resonant frequencies, resonance peak intensity and tone.
Include the event data of indication tonequality change in the script data 11, but can come designated user tamber parameter 12 with this event data.
For example, as the record example of script data 11, under the situation of using Japanese hiragana and alphanumeric character, can set " TJK12 body な さ ん X10 あ か ".In this example, the event data of default tamber parameter 18 is specified in " K " representative, and " X " represents the event data of designated user tamber parameter 12." K12 " is the code of specifying certain specific default tamber parameter from multiple default tamber parameter, and " X10 " is the code of specifying user's tamber parameter shown in Figure 2 from multiple user's tamber parameter.
In above-mentioned example, the synthetic speech that is reproduced is " body な さ ん こ ん To Chi は Suzuki In a The.”。Here, " body な さ ん " is the synthetic speech with reference to default tamber parameter 18 and 19 regeneration of default synthetic dictionary data; " こ ん To Chi は " and " Suzuki In The " are the synthetic speechs with reference to user's tamber parameter 12 and synthetic dictionary data 13 regeneration of user's phrase.That is, statement " body な さ ん " is to read from default synthetic dictionary data 19 and the synthetic speech of the resonance peak frame data of regenerate relevant " body ", " な ", " さ ", " ん " 4 phonemes; Statement " ニ ん To Chi は " and " Suzuki In The " are to read from the synthetic dictionary data 13 of user's phrase and the synthetic speech of the resonance peak frame data of the phrase unit separately of regenerating.
In the example of Fig. 2, though represented 3 pronunciations such as " あ ", " い, ", " か " unit, so long as the character and the symbol of the enough text table notes of energy are not done special provision.In above-mentioned example, represent that with the diacritic " あ " of continue " X10 " statement " こ ん To Chi は " pronounces, and pronounce with diacritic " か " expression statement " Suzuki In The ".Therefore, after above-mentioned pronunciation example, under the situation of the pronunciation of carrying out original " あ ",, insertion just can as long as making reference object return the symbol of default synthetic dictionary data 19 (for example, " X 00 ", " zero " interior numeral of inserting regulation etc.).
The data interchange format of the music playback sequence data (SMAF:synthetic music mobile appliation format) that the speech regeneration device 1 of relevant present embodiment is used is described below with reference to Fig. 4.Fig. 4 represents the SMAF file layout, this SMAF shows a kind of of music data allocations of using and the data interchange format that is used for utilizing mutually with source of sound, portable terminal computer (personal digital assistant (PDA), personal computer, cell phone etc.) in, be to be used for regenerating the specification of data layout of content of multimedia.
The SMAF file 30 of data interchange format shown in Figure 4 is being called the data cell of piece (chunk) as basic structure.Piece is made of the stem of regular length (8 byte) and the main part of random length, and stem is divided into the piece ID of 4 bytes and block size two parts of 4 bytes.Piece ID is used as the identifier of piece, and block size is represented the length of main part.SMAF file 30 itself and the various data that comprised thereof also constitute whole block structure.
As shown in Figure 4, SMAF file 30 is made of content information piece (contents info chunk) 31, selection data block (optional data chunk) 32, music score track piece (score track chunk) 33 and HV piece (HV chunk) 36.
Content information piece 31 is being stored the various management information of relevant SMAF file 30, for example, the rank of memory contents, kind, literary property information, type name, bent name, artist name, writes words/information such as composer's name.Select data block 32 for example to store literary property information, type name, bent name, artist name, write words/information such as composer's name.In SMAF file 30, also needn't setting operation data block 32.
Music score track piece 33 is pieces of storing the sequence track of the melody that sends to source of sound, comprises to create data block (setup data chunk) 34 (options) and data block sequence (sequence data chunk) 35.
Create data block 34 and be the piece of the tamber data etc. of storage source of sound, store the statement of specific information (exclusive message) in the lump, the tamber parameter log-on message is for example arranged as specific information.
Data block sequence 35 is pieces of the actual such performance data of storage, the regeneration HV note regularly of decision script data 11 is opened (HV note-on, the human sound (human voice) of ' HV ' expression) and is mixed storage with other sequence incident.Here, HV and melody incident in addition thereof are specified by the channel of this HV and are distinguished.
HV piece 36 comprises HV and creates data block (HV setup data chunk) 37 (options), HV user's phrase dictionary piece (HV user phrase dictionary chunk) 38 (options) and HV-S piece 39.
HV creates data block 37 storage HV user's tamber parameters and is used for specifying the information of the channel that HV uses, HV-S piece 39 storage HV script datas.
The content of the synthetic dictionary data 13 of HV user's phrase dictionary piece 38 storage user's phrases is created must have in HV user's tamber parameter in the data block 37 and is used for the parameter of syllable shown in the distribution diagram 2 and phrase ID being stored in HV.
SMAF file 30 shown in Figure 4 is applicable in the tamber parameter of present embodiment, just can goes out synthetic speech (HV), the content of the synthetic dictionary data 13 of user's phrase of can also regenerating simultaneously with the melody synchronously reproducing.
The instrument that is used for making synthetic dictionary data 13 of user's phrase shown in Figure 1 and SMAF file 30 shown in Figure 4 with reference to Fig. 5 explanation is HV authoring tools (HV authoring tool) then.Fig. 5 is the function of expression HV authoring tools and the block diagram of specification example.
Under the situation that makes SMAF file 30, HV authoring tools 42 reads in advance by MIDI (MIDI; Musical instrument digital interface) the SMF file (standard MIDI file) 41 that makes of sequencer (comprising the pronunciation note regularly that determines HV opens) is according to carrying out to SMAF file 43 conversion process of (being equivalent to aforementioned SMAF file 30) from HV script UI (HV script user interface) 44 and HV voice edition device (HV voice editor) 45 information that obtain.
HV voice edition device 45 is to have the editing machine that edit package is contained in the function of the HV user's tamber parameter (being equivalent to aforesaid user's tamber parameter 12) in HV user's tone color file 48.This HV voice edition device 45 can also be to syllable distributing user phrase arbitrarily except that the various HV tamber parameters of editor.
The interface of HV voice edition device 45 has the menu of selecting syllable, and has the function of this syllable being distributed any audio files 50.Analyze by wave analyzer 46 with the audio files 50 that the interface of HV voice edition device 45 is assigned with, generate the resonance peak frame data that 8 groups of resonant frequencies, resonance peak intensity and tone constitute thus.These resonance peak frame data can be used as respective files (that is, HV user's tone color file 48, the synthetic dictionary file 49 of HV user) output input.
HV script UI44 can directly edit the HV script data, and this HV script data also can be used as respective files (that is, the HV script file 47) input and output.In addition, relate to the HV authoring tools 40 of present embodiment, also can only constitute by above-mentioned HV authoring tools 42, HV script UI44, HV voice edition device 45 and wave analyzer 46.
Below with reference to Fig. 6 the example that the speech regeneration device 1 of present embodiment is applicable to mobile terminals is described, Fig. 6 is the block diagram of formation that expression possesses the mobile terminals 60 of speech regeneration device 1.
Mobile terminals 60 for example is the equipment that is equivalent to pocket telephone, is provided with CPU61, ROM62, RAM63, display part 64, Vib. 65, input part 66, Department of Communication Force 67, antenna 68, speech processes portion 69, source of sound 70, loudspeaker 71 and bus 72.CPU61 carries out the overall control of mobile terminals 60, and ROM62 stores various communication control programs and the control programs such as program of melody that are used for regenerating, and stores various constant data etc. simultaneously.
RAM63 is used as the workspace, stores melody file and various application program simultaneously.Display part 64 for example is made of liquid crystal indicator (LCD), and Vib. 65 vibrates when pocket telephone has incoming call.Input part 66 is made of operating parts such as a plurality of keys, and these operating parts are according to the login process of user's operation indication user tamber parameter, the synthetic dictionary data of user's phrase and HV script data.Department of Communication Force 67 is made of modulator-demodular unit etc., is connected on the antenna.
Speech processes portion 69 is connected transmitter and is talked about loudspeaker (for example, loudspeaker and earphone; E.g., microphone and earphone) on, have the function of voice signal being encoded and deciphering in order to converse.Source of sound 70 carries out the regeneration of melody according to the melody file that is stored in RAM63 etc., and the reproduce voice signal also outputs to loudspeaker 71 simultaneously.Bus 72 is to be used for carrying out the transfer path that the data between each inscape of the pocket telephone that is made of CPU61, ROM62, RAM63, display part 64, Vib. 65, input part 66, Department of Communication Force 67, speech processes portion 69 and source of sound 70 transmit.
Department of Communication Force 67 can wait download HV script file or SMAF file 30 shown in Figure 4 from the content server (Contents Server) of regulation, and is stored in the RAM63.In ROM62, store the application program 14 of speech regeneration device 1 shown in Figure 1 and the program of intermediate equipment.CPU61 reads and starts the program of this application program 14 and intermediate equipment.CPU61 explains the HV script data that is stored in the RAM63, and generates the resonance peak frame data, and these resonance peak frame data are delivered to source of sound 70.
The action of the speech regeneration device 1 of present embodiment is described then.The creating method of the synthetic dictionary 13 of user's phrase at first, is described.Fig. 7 is the process flow diagram of the creating method of the synthetic dictionary 13 of expression user phrase.
At first, at step S1,, start HV voice edition device 45 with the HV tone color that HV authoring tools shown in Figure 5 42 is selected with reference to the synthetic dictionary 13 of user's phrase.Select the syllable of use with HV voice edition device 45 then, and load onto audio files.Like this, at step S2, HV voice edition device 45 generates and output user phrase dictionary data (being equivalent to the synthetic dictionary file 49 of HV user).
Then, with HV voice edition device 45 editor HV tamber parameters, then, at step S3, HV voice edition device 45 generates and output user's tamber parameter (being equivalent to user's tone color file 48).
Then, in the HV script data, records and narrates the tonequality altering event of the corresponding HV tone color of appointment, record and narrate thus and want the syllable of regenerating with HV script UI44.Then, at step S4, HV script UI44 generates and output HV data (being equivalent to HV script file 47).
The regeneration action of the synthetic dictionary data 13 of user's phrase in the speech regeneration device 1 is described with reference to Fig. 8 then.Fig. 8 is the process flow diagram of the regeneration action of the synthetic dictionary data 13 of user's phrase in the expression speech regeneration device 1.
At first, at step S11, user's tamber parameter 12 and the synthetic dictionary data 13 of user's phrase are signed in in the intermediate equipment of speech regeneration device 1.Then, script data 11 logins in the intermediate equipment of speech regeneration device 1, are begun the regeneration of HV script data at step S12.
During regeneration,, monitor whether the tonequality altering event (X incident) of designated user tamber parameter 12 is included in the script data 11 at step S13.
At step S13, if found the tonequality altering event, from user's tamber parameter 12, seek the phrase ID that distributes to syllable, and from the synthetic dictionary data 13 of user's phrase, read data corresponding to this phrase ID, then, at step S14, the dictionary data replacement of the corresponding syllable in the default synthetic dictionary data 19 of HV driver management is the synthetic dictionary data 13 of user's phrase.The replacement of step S14 is handled also and can be carried out before regeneration HV script data.
After step S14 finishes, perhaps do not find under the situation of tonequality altering event at step S13, flow process enters step S15, transducer 16 explains that script data 11 is (when having carried out the processing of step S14, script data after the exchange of this step S14 is handled) syllable is transformed to resonance peak frame string data with the HV driver at last.
At step S16, the data that obtain by step S15 conversion by source of sound 20 regeneration.
After this, enter step S17, judge whether the regeneration of script data 11 finishes, under unclosed situation, turn back to step S13, on the other hand,, just finish the Regeneration Treatment of the synthetic dictionary data 13 of user's phrase shown in Figure 8 if be through with.
The creating method of SMAF file 30 shown in Figure 4 is described with reference to Fig. 9 then.Fig. 9 is the process flow diagram of the creating method of expression SMAF file 30.
At first, according to step shown in Figure 7, make user's phrase synthetic dictionary data 13, user's tamber parameter 12 and script data 11 (with reference to step S21).
Then, at step S22, make the SMF file 41 of the incident of the pronunciation that comprises control music data and HV script data.
Next, SMF file 41 is input to HV authoring tools 42 shown in Figure 5, SMF file 41 is transformed to SMAF file 43 (being equivalent to aforesaid SMAF file 30) (with reference to step S23) with this HV authoring tools 42.
Then, the HV that the user's tamber parameter 12 that makes at abovementioned steps S21 is input in the HV piece 36 of SMAF file 30 shown in Figure 4 creates data block 37, in addition, will be at the HV user's phrase dictionary piece 38 in the synthetic dictionary data 13 of user's phrase that step S21 makes is input to the HV piece 36 of SMAF file 30, generate like this and export and import SMAF file 30 (with reference to step S24).
Next, with reference to Figure 10 the Regeneration Treatment of SMAF file 30 is described, Figure 10 is the Regeneration Treatment process flow diagram of SMAF file 30.
At first, at step S31, SMAF file 30 is signed in in the intermediate equipment of speech regeneration device 1 shown in Figure 1.Here, speech regeneration device 1 is partly logined in the melody reproducing unit of intermediate equipment the preparation of regenerating to the music data in the SMAF file 30 usually.
At step S32, speech regeneration device 1 judges whether HV piece 36 is included in the SMAF file 30.
Result of determination at step S32 is under the situation of "Yes", and flow process enters into step S33, and speech regeneration device 1 is explained the content of HV piece 36.
At step S34, speech regeneration device 1 carries out the login of user's tamber parameter, the login of user's phrase dictionary data and the login of HV script data.
Result of determination at step S32 is under the situation of "No", or under the situation about being through with of the login process in step S34, enters step S35, and speech regeneration device 1 carries out the explanation of the piece in this melody reproducing unit.
Then, speech regeneration device 1 is corresponding to " beginning " signal, and the sequence data (that is, actual such performance data) in the beginning sequence of interpretation data block 35 carries out melody regeneration (with reference to step S36) thus.
In above-mentioned melody regeneration, speech regeneration device 1 is explained the incident that is included in the sequence data in order, in this process, judges whether each incident is equivalent to the HV note and opens (with reference to step S37).
Result of determination at step S37 is under the situation of "Yes", and flow process enters step S38, and speech regeneration device 1 begins to regenerate and opened the HV script data of the HV piece of appointment by the HV note.
After step S38 finishes, speech regeneration device 1 carries out the Regeneration Treatment of the synthetic dictionary data of user's phrase shown in Figure 8, promptly, in the regeneration of the HV script data in step S38, speech regeneration device 1 is judged the tonequality altering event (X incident) (with reference to step S39) that whether has designated user tamber parameter 12.
Exist under the situation of above-mentioned tonequality altering event, promptly, the result of determination of step S39 is under the situation of "Yes", flow process enters step S40, from user's tamber parameter 12, seek the phrase ID that distributes to syllable, and read data corresponding to phrase ID from the synthetic dictionary data 13 of user's phrase, the dictionary data replacement of corresponding syllables in the default synthetic dictionary data 19 of HV driver management is synthesized the dictionary data for user's phrase.The replacement of this step S40 is handled also and can be carried out before regeneration HV script data.
Step S40 does not perhaps find that at step S39 under the situation of tonequality altering event, flow process enters step S41 after finishing, and transducer 16 is explained the syllable of script data 11, finally is transformed to resonance peak frame string data with the HV driver.
Flow process enters step S42 then, in the HV of source of sound 20 reproducing unit the data reproduction that is transformed in step S41 is come out.
After this, flow process enters step S43, and speech regeneration device 1 judges whether the regeneration of melody finishes.The melody regeneration ending situation under, finish the Regeneration Treatment of SMAF file 30, on the other hand, under melody was regenerated unclosed situation, flow process turned back to step S37.
At step S37, the incident in the described sequence data is not that speech regeneration device 1 is just taken this incident as the part of music data, and is transformed to source of sound regeneration event data (with reference to step S44) under the HV note situation about opening.
Flow process enters step S45 then, and speech regeneration device 1 is born again in the melody reproducing unit of source of sound 20 the data that are transformed at step S44.
As mentioned above, present embodiment has adopted the synthetic speech regeneration mode of carrying out of the resonance peak that uses the FM source of sound, and following 3 advantages are arranged.
(1) can distributing user the phrase of hobby, that is, do not exist with ... fixed dictionary, just can carry out speech regeneration with the tone color of the tone color that more is similar to hobby.
(2) because the part of default synthetic dictionary data 19 is replaced into the synthetic dictionary data 13 of user's phrase, increase data capacity greatly so can avoid in speech regeneration device 1, crossing.Owing to can be replaced into phrase arbitrarily to the part of default synthetic dictionary data 19, so can realize pronunciation, and can eliminate the discordant sensation acoustically that causes in the tie point between the unit of respectively pronouncing that produces in the existing synthetic speech by the pronunciation unit by the phrase unit.
(3) owing to can in the HV script data, specifying phrase arbitrarily, so can and use the phonetic synthesis of syllable unit and the sound pronunciation of phrase unit.
In addition,, compare, can realize changing according to the tone color of resonance peak intensity with the method for the regeneration Wave data that sampling constitutes to phrase in advance according to present embodiment.Though size of data in the present embodiment and quality all exist with ... frame per second, compare with existing method of being undertaken by the sample waveform data, can realize high-quality speech regeneration by enough few data capacities far away.Therefore, can easily be assembled in the speech regeneration device 1 of for example present embodiment in the mobile terminals such as pocket telephone, therefore, can be with the content of high-quality speech regeneration Email etc.
Figure 11 is the block diagram of formation of the voice music player of the expression second embodiment of the present invention.Here, HV-script (being the HV script data) is to be equivalent to define be used for the file of form of reproduce voice, be that definition is used for carrying out by comprising prosodic sign (promptly, the file of the data of the setting of the pronunciation character string symbol of designated tones etc. pronunciations form), the sound of pronunciation and phonetic synthesis that the information that regeneration is used etc. is constituted to become easier and makes with text input in order to make by making of user.
The HV-script is read in by application software such as text editors, as long as can be recorded and narrated by the document form that text is edited, as an example, can list the text that makes with text editor.The language interdependence is arranged in the HV-script, can use various language definitions, in the present embodiment, the HV-script defines with Japanese.
Symbol 101 expression HV-script players (HV-Script player) are used for controlling the regeneration of HV-script or stop etc.Here, the HV-script is logged in HV-script player 101 and receives under the situation of its regeneration indication, and HV-script player 101 begins to explain this HV-script.Then, according to the kind of the incident of recording and narrating in the HV-script to the some processing of carrying out according to this incident in HV driver 102, waveform regeneration player 104 and the phrase regeneration player 107.
Never illustrated ROM (the intelligence memory read of HV driver 102; Read-only memory) reads in and the synthetic dictionary data of reference, people's voice have the structure (for example shape in vocal cords or oral cavity etc.) that exists with ... human body regulation resonance peak (promptly, eigenfrequency spectrum), synthetic dictionary data are mapped the parameter of the resonance peak of relevant voice and preserve with the pronunciation literal.Synthetic dictionary data are equivalent to database that the parameter that the result who by pronunciation text unit (for example, phoneme unit such as Japanese " あ ", " い ") sound of reality is sampled and analyze obtains is stored by the pronunciation text unit in advance as the resonance peak frame data.
For example, aforesaid CSM (Composite Sinusoidal Modeling: the compounded sine wave pattern) under the situation of phonetic synthesis mode, preserve 8 groups of resonant frequencies, resonance peak intensity and tone etc. as parameter by synthetic dictionary data.Such phonetic synthesis mode is compared with the regeneration of the Wave data that sampled speech makes, and has the very little advantage of data volume.And, in synthetic dictionary data, can also preserve the parameter (for example, being used for carrying out the change designated parameters etc. of 8 groups of resonant frequencies and resonance peak intensity) of the tonequality of the voice that are reproduced of control.
HV driver 102 is explained the pronunciation character string that comprises the prosodic sign in the HV-script etc., and is after the resonance peak frame string with synthetic dictionary data conversion, outputs to HV source of sound 103.HV source of sound 103 is according to concatenating into the pronunciation signal and it is outputed to totalizer 110 from the resonance peak frame of HV driver 102 outputs.
Waveform regeneration player 104 carry out sampled speech in advance or melody and simulation sound etc. Wave data regeneration or stop etc.Symbol 105 is represented Wave data RAM (Wave data random access memory; Waveform data random-access memory), store default Wave data in advance.The user can be via login API (login application program interface; Registration application programinterface) 113 user data is stored in Wave data with in the RAM105 with the user's Wave data among the RAM112.If waveform regeneration player 104 is accepted the regeneration indication from HV-script player 101, then read Wave data and output to waveform regenerator 106 with RAM105 from Wave data.Waveform regenerator 106 generates the pronunciation signal and outputs to totalizer 110 according to the Wave data from 104 outputs of waveform regeneration player.The Wave data that is sampled is not limited to PCM (pulse code modulation (PCM); Pulse-code modulation) mode for example also can be taked MP3 (mobile picture experts group aspect 3; Moving picture expertsgroup layer 3) voice compression format of mode.
Phrase regeneration player 107 carries out the regeneration of melody short sentence data (or music data) or stops etc.Melody short sentence data are SMF form types, are made of the temporal information of the tone period of the note information of the tone of the sound of representative pronunciation and volume etc. and pronouncing sound.Symbol 108 is represented melody short sentence data RAM, stores default melody short sentence data in advance.The user can be via login API user data with the user's melody short sentence data storage among the RAM112 in melody short sentence data with in the RAM108.
Phrase regeneration player 107 is in case accept to indicate from the regeneration of HV-script player 101, just read melody short sentence data with RAM108 from melody short sentence data, the time management of the note information in the bent short sentence data of making merry of going forward side by side outputs to phrase source of sound 109 to note information according to the temporal information of recording and narrating in the melody short sentence data.Phrase source of sound 109 generates note signal according to the note information of phrase regeneration player 107 outputs, and outputs to totalizer 110.Regeneration function with melody short sentence data can adopt FM mode or PCM mode etc. as phrase source of sound 109, if but just needn't limit the mode of this source of sound.
Totalizer 110 is from the pronunciation signals of HV source of sound 103 outputs, combine from the voice signal of waveform regenerator 106 outputs with from the note signal of phrase source of sound 109 outputs, and this composite signal is outputed to loudspeaker 111.Loudspeaker 111 sends voice and/or musical sound according to the composite signal of totalizer 110.
Also can in HV driver 102, waveform regeneration player 104 and phrase regeneration player 107, handle simultaneously, make thus respectively and sound simultaneously based on the voice and the melody of pronunciation signal, voice signal and note signal (in addition, also can be referred to as " voice signal " to voice signal and note signal).Perhaps, also can be regularly with the processing of HV-script player 101 management HV drivers 102, waveform regeneration player 104 and phrase regeneration player 107, regenerate simultaneously based on the voice and the melody of separately processing.Handle when in the present embodiment, forbidding by HV driver 102, waveform regeneration player 104 and phrase regeneration player 107.In addition, in Figure 11, for convenience of description, be provided with respectively separately RAM as Wave data with RAM105, melody short sentence data with RAM108 and user data RAM112, but also can be assigned to different storage area in the single RAM to these functions.
Figure 12 illustrates the definition example of the incident of Wave data that the HV-script that is used for regenerating records and narrates or melody short sentence data (following general designation " voice data ").The implication of the beginning literal " D " of incident is a default definition, and the implication of " zero " is a user definition.As the classification of each incident, be assigned as waveform or phrase.The default melody short sentence data allocations that default Wave data that Wave data is stored in advance with RAM105 or melody short sentence data are stored in advance with RAM108 (among the D0~D63), can be distributed 64 default Wave datas and default melody short sentence data to default definition in the default definition.Sample waveform data or melody short sentence data allocations that the user is made arbitrarily arrive in the user definition (0 0~0 63), can distribute 64 sample waveform data or melody short sentence data in the user definition.
Classification shown in Figure 12 is that the incident of Wave data and the data of the relation of the Wave data of expression and the representative of this incident are stored in Wave data in advance with in the RAM105.In addition, classification is that the incident of phrase and the data of the relation of the melody short sentence data of expression and the representative of this incident are stored in melody short sentence data in advance with in the RAM108.Carried out under the situation of user data with the login of Wave data among the RAM112 or melody short sentence data, the user these Data Update.
For example record and narrate as the HV-script and to be " TJK12 body な さ ん 00 In The.D20 ", " T " in " TJK12 " that beginning is recorded and narrated is the symbol of the beginning of expression HV-script, " J " specifies national character code, represents that here the HV-script records and narrates with Japanese." K12 " is the symbol of setting tonequality, and the 12nd kind of tonequality is specified in expression.In addition, " body な さ ん " and " In The " are explained with HV driver 102, send " body な さ ん " and " In The " such Japanese voice from loudspeaker 111.In " body な さ ん " and " In The " such pronunciation character string, include under the situation of prosodic sign that intonation (or strong and weak) waits the expression pronunciation state, send the voice that added intonation (or added power).
In customer incident " 00 ", for example login has the Wave data that the voice that send " Suzuki " sound are sampled.This customer incident " 00 " is explained with waveform regeneration player 104, thus, sends the voice of " Suzuki " from loudspeaker 111.In addition, in customer incident " 0 20 ", login the melody short sentence data that for example cheerful and light-hearted weak point is arranged.This customer incident " 0 20 " is explained with phrase regeneration player 107, thus, is sent the lively music composition sound from loudspeaker 111.At this moment, reproduce voice becomes " body な さ ん Suzuki In The " (during the regeneration of melody short sentence), has only " Suzuki " partial regeneration Wave data.The pronunciation of the voice that produce by the regeneration of Wave data with compare by the intensity of sound of the phonetic synthesis regeneration of " body な さ ん " or " In The " such pronunciation unit, the regeneration of the tie point of pronunciation unit is just more natural.In addition, just can allow the user hear the voice of regeneration effectively as the regeneration of the waveform of characteristic the pronunciation of the statement that is called " Suzuki ".As mentioned above, record and narrate to specify the incident of the regeneration of Wave data or melody short sentence data, just can specify the regeneration timing of Wave data or melody short sentence data arbitrarily by the HV-script.The setting of the record of relevant HV-script is so-called design item, is not limited to above-mentioned method.
Then, relate to the action of the voice music player of present embodiment with the process flow diagram explanation of Figure 13.At first, the user makes the HV-script with text editor, and login is in described HV-script player 101 (with reference to step S101).At this moment, if having the Wave data or the melody short sentence data of generation defined by the user, login API113 just reads in Wave data or melody short sentence data from user data with RAM112.Login API113 is stored in Wave data to Wave data with in the RAM105, and melody short sentence data storage in melody short sentence data usefulness RAM108.
Begin indication (step S103) in case the user makes, HV-script player 101 just begins to explain HV-script (with reference to step S102).HV-script player 101 judges whether comprise in the HV-script from the incident (step S104) of " D " or " zero " beginning; When the input of incident that has from " D " or " zero " beginning, judge whether its classification is Wave data (step S105).If the classification of this incident is a Wave data, then 101 pairs of waveform regeneration players of HV-script player, 104 these processing of indication, waveform regeneration player 104 is read the Wave data of the number of continue " D " or " zero " from Wave data with RAM105, and outputs to waveform regenerator 106 (step S106).Waveform regenerator 106 generates voice signal according to this Wave data, outputs to loudspeaker 111 (step S107) through totalizer 110.Like this, loudspeaker 111 sends corresponding voice.
In addition, event category is not under the situation of Wave data in step S105, and flow process enters step S108, and HV-script player 101 judges whether the classification of incident is melody short sentence data.Classification in incident is under the melody short sentence data conditions, 107 these processing of indication of 101 pairs of phrase regeneration of HV-script player player.Phrase regeneration player 107 is read the Wave data of the number of continue " D " or " zero " from melody short sentence data with PAM108, according to the temporal information in these melody short sentence data the note information in the melody short sentence data is outputed to phrase source of sound 109 (with reference to step S109).Phrase source of sound 109 generates note signal according to this note information, outputs to loudspeaker 111 (step S110) through totalizer 110.Like this, loudspeaker 111 sends the melody sound.In addition, at step S108, event category is judged as and is not under the melody short sentence data conditions, and the voice music player of present embodiment is thought the incident that can not handle, and flow process enters step S113.
At step S104, do not record and narrate from " D " beginning or under the situation of incident of " zero " beginning 102 its processing of indication of 101 pairs of HV drivers of HV-script player in the HV-script.The synthetic dictionary data of HV driver 102 usefulness are transformed to resonance peak frame string to character string, output to HV source of sound 103 (with reference to step S111).HV source of sound 103 is concatenated into the pronunciation signal according to this resonance peak frame, and outputs to loudspeaker 111 (with reference to step S112) through totalizer 110.Like this, loudspeaker 111 sends corresponding voice.
When the processing of relevant incident finished, HV-script player 101 was judged the explanation (step S113) that whether is through with till the last record of HV-script.When also residue has the record that should explain, turn back to step S104, on the other hand, all explain under the situation about being in whole records of HV-script, finish voice melody Regeneration Treatment shown in Figure 13.
Shown " the TJK12 body な さ ん 00 In The of record example as the HV-script of present embodiment.D20 " situation under, after finishing with the pronunciation of the Wave data of incident " 00 " definition, must send the voice of next statement " In The ".For example, carry out at HV-script player 101 under the situation of explanation of incident of Wave data (or melody short sentence data), the regeneration of its next incident of temporarily postponing a deadline, when finishing the pronunciation of being undertaken by waveform regeneration player 104 (or phrase regeneration player 107), represent the signal that pronunciation finishes from 101 outputs of 104 pairs of HV-scripts of this waveform regeneration player player.
In addition, carry out simultaneously under the situation of Regeneration Treatment by HV driver 102, waveform regeneration player 104 and phrase regeneration player 107, also can control their Regeneration Treatment with the record of HV-script in permission.For example, recording and narrating " TJKl2 body な さ ん 003 In The at the HV-script.D20 " situation under; the incident during the tone-off of regulation is set with " " (space) of " 00 " and " 3 " expression that continues, makes that the voice of being regenerated by HV driver 102 are that noiseless mode is controlled between the voice of the statement " Suzuki " that sends " 00 " expression.In the record of HV-script " TJK12 こ ん To Chi は is arranged.D20みなさん○03です。" situation under, pronounce simultaneously by the melody of " D20 " appointment and the voice that are called " body な さ ん Suzuki In The ".
Figure 14 is the formation block diagram of pocket telephone that possesses the voice music player of present embodiment.Here, the CPU of each one of label 141 representative control pocket telephones.The antenna that on behalf of transmitting and receiving data, label 142 use.Label 143 is represented Department of Communication Force, sending with outputing to antenna 142 after the data-modulated, simultaneously the reception that is received by antenna is carried out demodulation with data.Label 144 is represented speech processes portion, when pocket telephone is conversed, the speech data of the partner of exporting from Department of Communication Force 143 is transformed to voice signal and outputs to nearly ear loudspeaker (or earphone, not shown), simultaneously the voice signal from transmitter (not shown) input is transformed to speech data and outputs to Department of Communication Force 143.
Label 145 is represented source of sound, has the function same with HV source of sound 103 shown in Figure 11, waveform regenerator 106 and phrase source of sound 109.Label 146 is represented loudspeaker, sends desirable voice or musical sound.Label 147 representatives are by the operating portion of user's operation.Label 148 representative storages are by the RAM of HV-script or user-defined Wave data and melody short sentence data etc.The ROM of program that label 149 representative storage CPU141 carry out and synthetic dictionary data, default Wave data, default melody short sentence data etc.Label 150 is represented display part, and the operating result that the user is carried out or state of pocket telephone etc. are presented on the picture.Label 151 is represented Vib., accepts the indication from CPU141 when pocket telephone has incoming call, produces vibration.Each above-mentioned functional block is connected with each other through bus B.
This pocket telephone has from the function of speech production Wave data, speech processes portion 144 is delivered in the voice of importing from transmitter, and be transformed to Wave data, and this Wave data is stored in the RAM148.By Department of Communication Force 143 during from WEB downloaded melody short sentence data, just this melody short sentence data storage in RAM148.
CPU141 is according to the program that is stored in the ROM149, carries out the processing same with HV-script player 101 shown in Figure 11, HV driver 102, waveform regeneration player 104 and phrase regeneration player 107.In addition, CPU141 also makes an explanation to the incident of record in the HV-script of reading from RAM148.Under the situation that representations of events is pronounced by phonetic synthesis, CPU141 reads and with reference to synthetic dictionary data, the character string of recording and narrating in the HV-script is transformed to resonance peak frame string and it is outputed to source of sound 145 from ROM149.
Under the situation of the regeneration of representations of events Wave data, CPU141 from RAM148 or ROM149, read continuing in the HV-script " D " or " zero " number Wave data and output to source of sound 145.Under the situation of the regeneration of representations of events melody short sentence, CPU141 reads the melody short sentence data of continuing in the HV-script " D " or " zero's " number from RAM148 or ROM149, and according to the temporal information in these melody short sentence data the note information in the melody short sentence data is outputed to source of sound 145.
Source of sound 145 is according to concatenating to become to synthesize the pronunciation signal and output to loudspeaker 146 from the resonance peak frame of CPU141 output.And generate voice signal and output to loudspeaker 146 according to Wave data from CPU141 output.Further, also generate note signal and output to loudspeaker 146 according to melody short sentence data from CPU141 output.Loudspeaker 146 sends voice or musical sound aptly according to synthetic pronunciation signal, voice signal or note signal.
In case the software that user's operating operation portion 147 starts corresponding to text editing makes the HV-script when then the user just can be familiar with the content that is presented on display part 150 pictures, can be stored in the HV-script that makes like this among the RAM148.
In addition, can be the HV-script applications that makes by the user in the incoming call the tinkle of bells.At this moment, the matters of using the HV-script when pocket telephone has incoming call are stored in the RAM148 in advance as set information.That is, when Department of Communication Force 143 received from call information that other pocket telephones send through antenna 142,143 pairs of CPU141 notices of Department of Communication Force had incoming call.The CPU141 that has accepted call-in reporting reads set information from RAM148, reads the HV-script that this set information is represented from RAM148 again, and begins its explanation.Later processing as previously mentioned.That is, loudspeaker 146 sends voice or musical sound according to the classification of the incident of recording and narrating in the HV-script.
In addition, the user can also be added on the HV-script in the Email, sends to other-end.In addition, CPU141 also can explain the text of Email itself according to the HV-form of scripts, accept user's indication after, indicate to the regeneration of speech processes portion 144 these HV-scripts of output according to the record in the Email.The repertoire of HV-script player 101, HV driver 102, waveform regeneration player 104 and phrase regeneration player 107 needn't be born by CPU141.For example also can bear any function of above-mentioned functions by source of sound 145.
The applicable object of present embodiment is not limited to pocket telephone (cellular phone), for example be applicable to PHS (personal handphone system:(personal handyphone system) Japan's registered trademark) or PDA portable terminals such as (personal digital assistants), carry out the regeneration of aforesaid voice and melody.
In addition, flexible Application example as present embodiment, in portable mobile termianls such as pocket telephone, can import the HV-script that makes by the user, thus, the general user not only can easily make the literal that phonetic synthesis is used, and can easily make and be used for regenerating the sample waveform data of typing or the HV-script of melody short sentence data.In addition, possess in the portable terminal that sends and receive usefulness under the situation of voice music player of present embodiment, the user can operate portable mobile termianl, the HV-script is added on sends reception in the Email.Like this, the Email that receives with take over party's portable mobile termianl is the literal of the synthetic usefulness of reproduce voice aptly not only, and the sampled data or the melody short sentence data of the typing of can regenerating aptly.In addition, can also will use the voice of HV-script and the regeneration of melody to utilize as the incoming call the tinkle of bells.
The voice music player of the third embodiment of the present invention is described with reference to Figure 15 and Figure 16 then.The 3rd embodiment is with aforesaid first embodiment and second embodiment formation that combines, intermediate equipment is implemented HV regeneration, waveform regeneration and the regeneration of melody short sentence, source of sound produces synthetic pronunciation signal according to these 3 kinds of data, the signal from 3 systems is combined output to loudspeaker.
Here, Figure 15 is that the structure with Fig. 1 serves as basic, and with the formation that the part-structure of Figure 11 combines, label 211~219 is corresponding to the label 11~19 of figure l, and label 303~313 is corresponding to the label 103~113 of Figure 11.Promptly, the user data of Figure 11 is connected on the intermediate equipment of Fig. 1 with RAM by user data API, in intermediate equipment, appended waveform regeneration player and the phrase regeneration player of Figure 11, this waveform regeneration player is connected with RAM with melody short sentence data with RAM with Wave data respectively with phrase regeneration player.In addition, application software has the function as the HV script player of Figure 11 concurrently, and the HV script is handled to any indication of HV conversion, waveform regeneration player, melody short sentence player in the kind of the incident of HV-script according to record by intermediate equipment APl.In addition, source of sound is with three kinds of functions of HV source of sound, waveform generator and the melody short sentence source of sound of Figure 11, and their each output signal is synthesized by totalizer and pronounces at loudspeaker.In addition, the action of the action of each inscape shown in Figure 15 and Fig. 1 and corresponding inscape shown in Figure 11 is identical, its detailed description of Therefore, omited.
Figure 16 is the process flow diagram of the action of expression voice musical instrument regenerating unit shown in Figure 15.It serves as basic to scheme the process flow diagram shown in the l3, a part of having appended process flow diagram shown in Figure 8, and label S211~S216 is corresponding to label S11~S16 of Fig. 8, label S304~S310, S312, S313 is corresponding to label S104~S110 of Figure 13, S112, S113.That is, the judged result among the step S104 of Figure 13 is under the situation of "No", carries out the processing identical with step S13, S14, the S15 of Fig. 8, then carries out HV source of sound Regeneration Treatment in step S312.Like this, by import a HV-script can carry out according to the voice of HV source of sound regeneration, according to the regeneration of the Wave data of waveform regeneration player and can carry out based on regeneration according to the melody short sentence of the note information of melody short sentence regeneration player.In addition, the processing of each step shown in Figure 16 is identical with Fig. 8 and Figure 13, so omitted detailed description.
At last, employed prosodic sign among the aforesaid embodiment is described.For example, " the は ^3 じ ま $ of record in the HV-script
Figure C20041004741400241
^ ま $5>10." be exactly at " は じ ま
Figure C20041004741400242
ま " in (i.e. Fa Yin character string) additional regulation tone (イ ソ ト ネ one シ ヨ ソ) and carry out that phonetic synthesis gets up, here, " ^ ", " ", ">" etc. are equivalent to prosodic sign.Facing upward of the additional regulation of literal (under the situation that is right after numerical value behind the prosodic sign, continuing at the literal of this numerical value back) to this prosodic sign back raised (intonation).
Specifically, " ^ " expression pronunciation medium pitch improves, and " " expression pronunciation medium pitch reduces, and volume descends in ">" expression pronunciation, carries out phonetic synthesis according to these symbols.Under the situation that is right after numerical value behind the prosodic sign, this numerical value is used for specifying the variable quantity of additional intonation.For example, under the situation of statement " は ^3 じ ま ", " は " by the pronunciation of concert pitch and volume, and " じ " the raise the tone amount of " 3 ", " ま " of back is by the tone pronunciation that is enhanced in pronunciation.
Like this, under the situation of the intonation (or tone) of additional regulation, in front of this literal, record and narrate above-mentioned prosodic sign (numerical value of representing the tonal variations amount in addition) in the literal in being included in the language of pronunciation.Above-mentioned prosodic sign is used for controlling tone or the volume in the pronunciation, but is not limited to this, for example also can use the symbol of control tonequality or speed.Such symbol is attached to pronunciation states such as just can embodying tone in the HV-script very suitably.
In addition, the present invention is not limited to the above embodiments, and the change in the invention scope all is comprised among the present invention.

Claims (10)

1. speech regeneration device, it has memory storage, entering device, input media and speech synthetic device; Wherein,
This memory device stores is synthesized the dictionary data, and these synthetic dictionary data will be mapped and preserve in advance corresponding to the resonance peak frame data of the pronunciation literal of expression regulation pronunciation unit and this pronunciation literal;
This entering device according to user's operation with user's phrase data entry in user's dictionary data, this user's phrase data representation is used for replacing other resonance peak frame data of using with being kept at the corresponding resonance peak frame data of pronunciation literal of aforementioned synthetic dictionary data;
The input of this input media comprises the character string that is made of a plurality of pronunciation literal and the script data of event data, and this event data indication is corresponding to the displacement of the resonance peak frame data of at least a portion pronunciation literal of this character string;
This speech synthetic device is explained the described script data of input, and from described synthetic dictionary data, read the resonance peak frame data according to the pronunciation literal beyond at least a portion in the above-mentioned character string, read described user's phrase data according to described event data and described a part of character string from described user's dictionary data, generate synthetic speech according to the described resonance peak frame data of being read and described user's phrase data of being read again.
2. according to the speech regeneration device of claim 1, it is characterized in that further having music player according to melody regenerating information regeneration melody;
In described input media, import data interchange format, described data interchange format be comprise the melody regenerating information of the melody that is used to regenerate and contain the music playback information of described script data and described user's phrase data and synchronizing regeneration based on the melody regeneration of described melody regenerating information with based on the message structure of the speech regeneration of described speech regeneration information;
Described music player regeneration is included in the described melody regenerating information in the described data interchange format;
Described speech synthetic device regeneration is included in the described speech regeneration information in the described data interchange format.
3. a mobile terminal device possesses claim 1 or 2 described speech regeneration devices.
4. a speech regeneration device is made of first memory storage of stored sound data, second memory storage, regeneration indicating device, synthetic pronunciation signal generating apparatus, sound signal generating device and the synthetic speech generating apparatus of stores scripts data; Wherein,
This script data has been recorded and narrated the event data by pronunciation the literal character string that constitutes and the regeneration of indicating described voice data of expression regulation pronunciation unit;
This regeneration indicating device is read described script data from described second memory storage, indicate this pronunciation according to the character string in the described script data, indicates the regeneration of described voice data according to the event data in the described script data;
Should synthetic pronunciation signal generating apparatus basis indicate, carry out phonetic synthesis, and generate synthetic pronunciation signal from the pronunciation of the character string of described regeneration indicating device;
This sound signal generating device basis is read described voice data from the regeneration indication of the described voice data of described regeneration indicating device from described first memory storage, generate voice signal according to this voice data;
This synthetic speech device generates synthetic speech according to described synthetic pronunciation signal, generates sound according to described voice signal.
5. according to the speech regeneration device of claim 4, it is characterized in that described voice data is the Wave data by the sound generation of sampling regulation.
6. according to the speech regeneration device of claim 4, it is characterized in that described voice data is the music data that comprises the note information of the tone of the sound that expression should pronounce and volume.
7. according to the speech regeneration device of claim 4, it is characterized in that, described synthetic pronunciation signal generating apparatus stores the characteristic resonance peak controlled variable of the phonatory bands that makes literal, uses corresponding to the resonance peak controlled variable of the character string in the described script data and carries out phonetic synthesis.
8. according to any described speech regeneration device in the claim 4 to 7, it is characterized in that described script data uses the file that is made of text data to record and narrate.
9. according to the speech regeneration device of claim 4, it is characterized in that, be provided with corresponding to the resonance peak frame data of the pronunciation literal of expression regulation pronunciation unit with this pronunciation literal is mapped and the synthetic dictionary data of preservation in advance;
According to user's operation, in user's dictionary data, this user's phrase data representation is used for replacing other resonance peak frame data of using with being kept at the corresponding resonance peak frame data of pronunciation literal of aforementioned synthetic dictionary data with user's phrase data entry;
Comprise under the situation of indication corresponding to the event data of the displacement of the resonance peak frame data of at least a portion pronunciation literal of described character string at described script data, described synthetic pronunciation signal generating apparatus is read the resonance peak frame data according to the pronunciation literal beyond the described part pronunciation literal from described synthetic dictionary data, and read described user's phrase data from described user's dictionary data according to described event data and described a part of character string, generate described synthetic pronunciation signal according to the described resonance peak frame data of being read and described user's phrase data of being read again.
10. mobile terminal device possesses the speech regeneration device of any record in the claim 4 to 9.
CNB2004100474146A 2003-05-29 2004-05-28 Speech and music regeneration device Expired - Fee Related CN1310209C (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2003152895 2003-05-29
JP2003152895A JP4244706B2 (en) 2003-05-29 2003-05-29 Audio playback device
JP2003340171A JP2005107136A (en) 2003-09-30 2003-09-30 Voice and musical piece reproducing device
JP2003340171 2003-09-30

Publications (2)

Publication Number Publication Date
CN1573921A CN1573921A (en) 2005-02-02
CN1310209C true CN1310209C (en) 2007-04-11

Family

ID=34525345

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100474146A Expired - Fee Related CN1310209C (en) 2003-05-29 2004-05-28 Speech and music regeneration device

Country Status (4)

Country Link
KR (1) KR100612780B1 (en)
CN (1) CN1310209C (en)
HK (1) HK1069433A1 (en)
TW (1) TWI265718B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101102191B1 (en) * 2005-09-02 2012-01-02 주식회사 팬택 Apparatus And Method For Modifying And Playing Of Sound Source In The Mobile Communication Terminal
CN101694772B (en) * 2009-10-21 2014-07-30 北京中星微电子有限公司 Method for converting text into rap music and device thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001051688A (en) * 1999-08-10 2001-02-23 Hitachi Ltd Electronic mail reading-aloud device using voice synthesization
JP2002221980A (en) * 2001-01-25 2002-08-09 Oki Electric Ind Co Ltd Text voice converter
JP2002366186A (en) * 2001-06-11 2002-12-20 Hitachi Ltd Method for synthesizing voice and its device for performing it
JP2003029774A (en) * 2001-07-19 2003-01-31 Matsushita Electric Ind Co Ltd Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment
CN1416053A (en) * 2001-11-02 2003-05-07 日本电气株式会社 Speech synthetic system and speech synthetic method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100279741B1 (en) * 1998-08-17 2001-02-01 정선종 Operation Control Method of Text / Speech Converter Using Hypertext Markup Language Element
JP2002073507A (en) * 2000-06-15 2002-03-12 Sharp Corp Electronic mail system and electronic mail device
KR100351590B1 (en) * 2000-12-19 2002-09-05 (주)신종 A method for voice conversion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001051688A (en) * 1999-08-10 2001-02-23 Hitachi Ltd Electronic mail reading-aloud device using voice synthesization
JP2002221980A (en) * 2001-01-25 2002-08-09 Oki Electric Ind Co Ltd Text voice converter
JP2002366186A (en) * 2001-06-11 2002-12-20 Hitachi Ltd Method for synthesizing voice and its device for performing it
JP2003029774A (en) * 2001-07-19 2003-01-31 Matsushita Electric Ind Co Ltd Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment
CN1416053A (en) * 2001-11-02 2003-05-07 日本电气株式会社 Speech synthetic system and speech synthetic method

Also Published As

Publication number Publication date
TWI265718B (en) 2006-11-01
KR100612780B1 (en) 2006-08-17
CN1573921A (en) 2005-02-02
HK1069433A1 (en) 2005-05-20
TW200427297A (en) 2004-12-01
KR20040103433A (en) 2004-12-08

Similar Documents

Publication Publication Date Title
CN1269104C (en) Text structure for voice synthesis, voice synthesis method, voice synthesis apparatus, and computer program thereof
US9761219B2 (en) System and method for distributed text-to-speech synthesis and intelligibility
JP3938015B2 (en) Audio playback device
US7013282B2 (en) System and method for text-to-speech processing in a portable device
CN1194336C (en) Waveform generating method and appts. thereof
EP2704092A2 (en) System for creating musical content using a client terminal
CN100342426C (en) Singing generator and portable communication terminal having singing generation function
CN1461464A (en) Language processor
CN1235189C (en) Method and equipment and modifying speech sound signal using music
CN1310209C (en) Speech and music regeneration device
KR100634142B1 (en) Potable terminal device
CN1436345A (en) Terminal device, guide voice reproducing method and storage medium
JP3681111B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
CN1273953C (en) Audio synthesis system capable of synthesizing different types of audio data
JP2003029774A (en) Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment
JP4244706B2 (en) Audio playback device
CN2694427Y (en) Sound synthesis system capable of synthesizing different kinds of sound data
KR20100003574A (en) Appratus, system and method for generating phonetic sound-source information
JP4366918B2 (en) Mobile device
CN1622194A (en) Musical tone and speech reproducing device and method
CN1629933A (en) Sound unit for bilingualism connection and speech synthesis
JP2005107136A (en) Voice and musical piece reproducing device
JP2004240333A (en) Method and program for generating voice
JP2005234208A (en) Musical sound reproducing device and mobile terminal device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1069433

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070411

Termination date: 20130528