Specific embodiment
First embodiment
Below with reference to Detailed description of the invention speech synthesis dictionary creating apparatus according to first embodiment.Fig. 1 is shown according to
The configuration diagram of the configuration of the speech synthesis dictionary creating apparatus 1a of one embodiment.Here, speech synthesis dictionary creating apparatus 1a
It is realized in this way using general purpose computer.That is, speech synthesis dictionary creating apparatus 1a is for example with including CPU, storage
The function of the computer of device device, input/output unit and communication interface.
As shown in Figure 1, speech synthesis dictionary creating apparatus 1a includes the first voice-input unit 10, the first storage unit
11, control unit 12, display unit 13, the second voice-input unit 14, analysis determination unit 15, creating unit 16 and second are deposited
Storage unit 17.Here, the first voice-input unit 10, control unit 12, display unit 13, the second voice-input unit 14 and point
Hardware can be used to configure or can be used software executed by CPU and configure in analysis determination unit 15.First storage unit, 11 He
Second storage unit 17 is configured using such as HDD (hard disk drive) or memory.Therefore, speech synthesis dictionary creating apparatus
1a may be configured such that its function by executing speech synthesis dictionary creating program to realize.
First voice-input unit 10 for example receives the voice data of for example any user via communication interface (not shown)
(the first voice data);And input voice data into analysis determination unit 15.In addition, the first voice-input unit 10 may include
The hardware of such as communication interface and microphone.
First storage unit 11 stores multiple texts (or the text recorded) wherein, and in response to control unit 12
It controls and exports stored any one of text.The composition of the control of control unit 12 speech synthesis dictionary creating apparatus 1a
Unit.In addition, control unit 12 selects any one of the text stored in the first storage unit 11, it is single from the first storage
Member 11 reads selected text, and exports read text to display unit 13.
Display unit 13 receives any one of the text stored in the first storage unit 11 via control unit 12
Text, and the received text of institute is presented to user.Here, display unit 13 is presented on the first storage unit 11 in a random way
The text of middle storage.In addition, text, which is presented, in display unit 13 lasts only for about predetermined time period (for example, about several seconds to one
Minute).In addition, display unit 13 can be such as display device, loudspeaker or communication interface.That is, in order to make user
Selected text can be identified and say, display unit 13 passes through display text or the language by executing recorded text
Sound output is presented to execute text.
When any user for example reads aloud the text presented by display unit 13, the second voice-input unit 14 is received
Its voice data is entered into analysis determination unit 15 as voice data appropriate (second speech data).Here, the
Two voice-input units 14 for example can receive second speech data via communication interface (not shown).In addition, the second voice inputs
Unit 14 may include the hardware of such as communication interface and microphone shared with the first voice-input unit 10, or may include altogether
The software enjoyed.
After having received the first voice data via the first voice-input unit 10, analysis determination unit 15 makes control unit
12 start to work, so that text is presented in display unit 13.In addition, having received the second language via the second voice-input unit 14
After sound data, determination unit 15 is analyzed by comparing the characteristic quantity of the characteristic quantity of the first voice data and second speech data
Compared with determining whether the speaker of the first voice data is identical as the speaker of second speech data.
For example, analysis determination unit 15 executes speech recognition to the first voice data and second speech data, and generates and divide
Text not corresponding with the first voice data and second speech data.In addition, analysis determination unit 15 can be to second speech data
Voice quality inspection is executed, to determine whether signal-to-noise ratio (SNR) and amplitude are equal to or more than predetermined threshold.In addition, analysis is true
Order member 15 is based on the first voice data and second speech data at least one of properties come comparative feature amount: amplitude
Value, basic frequency (F0) average value or discrete value, spectrum envelope extract the word accuracy of the correlation of result, speech recognition
And word recognition rate.Here, the example of spectrum envelope extracting method includes linear predictor coefficient (LPC), mel-frequency cepstrum system
Number, line spectrum pair (LSP), Meier LPC and Meier LSP.
Then, analysis determination unit 15 compares the characteristic quantity of the characteristic quantity of the first voice data and second speech data
Compared with.If the difference between the characteristic quantity of the first voice data and the characteristic quantity of second speech data is equal to or less than predetermined threshold,
Or if the correlation between the characteristic quantity of the first voice data and the characteristic quantity of second speech data is equal to or more than make a reservation for
Threshold value then analyzes determination unit 15 and determines that the speaker of the first voice data is identical as the speaker of second speech data.Here,
It is assumed that as the threshold value used in determination of analysis determination unit 15 by advance learn the same person characteristic quantity average value and
Discrete value is arranged by practising speech recognition result from mass data middle school in advance.
When the speaker for determining the first voice data is identical as the speaker of second speech data, determination unit 15 is analyzed
Determine that voice is appropriate.Then, analysis determination unit 15 exports the first voice data (and the second voice number to creating unit 16
According to) it is used as voice data appropriate, wherein the speaker of the first voice data is confirmed as the speaker with second speech data
It is identical.In addition, analysis determination unit 15 be divided into analysis the first voice data and second speech data analytical unit with
And execute determining determination unit.
Creating unit 16 realizes speech recognition technology, and according to via the analysis received first voice number of determination unit 15
According to the text of the said content of creation.Then, creating unit 16 uses created text and the first voice data creation language
Sound synthesizes dictionary, and exports speech synthesis dictionary to the second storage unit 17.Therefore, the second storage unit 17 store wherein from
The received speech synthesis dictionary of creating unit 16.
The variation of first embodiment
Fig. 2 is the variation for showing speech synthesis dictionary creating apparatus 1a shown in FIG. 1 according to first embodiment
Configuration configuration diagram (configuration example of speech synthesis dictionary creating apparatus 1b).As shown in Fig. 2, speech synthesis dictionary creating fills
Setting 1b includes the first voice-input unit 10, the first storage unit 11, control unit 12, display unit 13, the input of the second voice
Unit 14, analysis determination unit 15, creating unit 16, the second storage unit 17 and text input unit 18.In speech synthesis word
In allusion quotation creating device 1b, it is single that composition actually identical with speech synthesis dictionary creating apparatus 1a is referred to identical reference marker
Member.
Text input unit 18 receives text corresponding with the first voice data via such as communication interface (not shown), and
Enter text into analysis determination unit 15.Here, text input unit 18, which can be used, can such as receive the defeated of text input
Enter the hardware of device to configure, or software can be used to configure.
What analysis determination unit 15 will enter into text input unit 18 says the voice number that text obtains by user
According to as the first voice data, and determine whether the speaker of the first voice data is identical as the speaker of second speech data.
Then, creating unit 16 is determined as voice appropriate and is input to text input unit 18 using analyzed determination unit 15
Text creates speech synthesis dictionary.Therefore, in speech synthesis dictionary creating apparatus 1b, due to including text input unit
18, it is therefore not necessary to create text by executing speech recognition.This makes it possible to realize the reduction of processing load.
Be given below in speech synthesis dictionary creating apparatus 1a according to first embodiment (or in speech synthesis dictionary
In creating device 1b) execute for create speech synthesis dictionary operation explanation.Fig. 3 is for illustrating real according to first
(or in speech synthesis dictionary creating apparatus 1b) execution is applied in the speech synthesis dictionary creating apparatus 1a of example, for creating
The flow chart of the operation of speech synthesis dictionary.
As shown in figure 3, the first voice-input unit 10 is via such as communication interface (not shown) in step 100 (S100)
The input of the first voice data is received, and the first voice data is input to analysis determination unit 15 (input of the first voice).
In step 102 (S102), recorded text (or text) is presented to user for display unit 13.
In step 104 (S104), the second voice-input unit 14 receive when the text presented by display unit 13 for example by
The voice data obtained when user reads aloud, as voice data appropriate (second speech data);And by the second voice number
Determination unit 15 is analyzed according to being input to.
In step 106 (S106), characteristic quantity and second speech data that determination unit 15 extracts the first voice data are analyzed
Characteristic quantity.
In step 108 (S108), determination unit 15 is analyzed by the characteristic quantity of the first voice data and second speech data
Characteristic quantity is compared, so that it is determined that whether the speaker of the first voice data is identical as the speaker of second speech data.?
In speech synthesis dictionary creating apparatus 1a (or speech synthesis dictionary creating apparatus 1b), if analysis determination unit 15 determines first
The speaker of voice data and the speaker of second speech data are identical (in the "Yes" of S108), then are premises appropriate in voice
Under, system control proceeds to S110.If analyzing speaker and the second voice number that determination unit 15 determines the first voice data
According to speaker it is different (in the "No" of S108), then speech synthesis dictionary creating apparatus 1a (or speech synthesis dictionary creating apparatus
1b) the end of marking operation.
In step 110 (S110), 16 use of creating unit is determined as the first voice number appropriate by analysis determination unit 15
According to (and second speech data), and use text corresponding with the first voice data (and second speech data), create voice
Synthesize dictionary;And speech synthesis dictionary is exported to the second storage unit 17.
Fig. 4 is to be shown schematically in the speech synthesis dictionary creating system including speech synthesis dictionary creating apparatus 1a
The example of the operation executed in 100.Speech synthesis dictionary creating system 100 includes speech synthesis dictionary creating apparatus 1a, and is held
It passes through and is output and input by data (voice data and the text) of network (not shown).That is, speech synthesis dictionary is created
Build system 100 be using uploaded by the user of system voice creation speech synthesis dictionary and what speech synthesis dictionary was provided be
System.
Referring to Fig. 4, the first voice data 20 is indicated by personal A by saying any number of text with arbitrary content
And the voice data generated.First voice data 20 is received by the first voice-input unit 10.
Presentation example 22 prompts user to say, and by the text of speech synthesis dictionary creating apparatus 1a presentation, " advanced television is 50
Inch size ".Second speech data 24 is indicated when the text presented by speech synthesis dictionary creating apparatus 1a is loud by user
The voice data obtained when reading aloud.Second speech data 24 is input into the second voice-input unit 14.Via TV or because
In the voice that spy's net obtains, it is difficult to say the text presented at random by speech synthesis dictionary creating apparatus 1a.The input of second voice
The received voice data of institute as voice data appropriate, and is output to analysis determination unit 15 by unit 14.
Analysis determination unit 15 compares the characteristic quantity of the characteristic quantity and second speech data 24 of the first voice data 20
Compared with so that it is determined that whether the speaker of the first voice data 20 is identical as the speaker of second speech data 24.
If the speaker of the first voice data 20 is identical as the speaker of second speech data 24, speech synthesis dictionary
Creation system 100 creates speech synthesis dictionary, and for example shows that 26 are used as about creation speech synthesis dictionary to user
Notice.On the other hand, if the speaker of the first voice data 20 is different from the speaker of second speech data 24, voice
It synthesizes dictionary creating system 100 and refuses the first voice data 20, and for example show 28 conducts about not creating to user
The notice of speech synthesis dictionary.
Second embodiment
The explanation to speech synthesis creating device according to the second embodiment is given below.Fig. 5 is to show according to second in fact
Apply the configuration diagram of the configuration of the speech synthesis dictionary creating apparatus 3 of example.Here, speech synthesis dictionary creating apparatus 3 for example using
General purpose computer is realized.That is, speech synthesis dictionary creating apparatus 3 is for example with including CPU, memory device, input
The function of output device and the computer of communication interface.
As shown in figure 5, speech synthesis dictionary creating apparatus 3 include the first voice-input unit 10, voice-input unit 31,
Detection unit 32, analytical unit 33, determination unit 34, creating unit 16 and the second storage unit 17.Voice out shown in Fig. 3
Synthesize dictionary creating apparatus 3 in, with identical reference marker refer to actually with speech synthesis dictionary creating apparatus shown in FIG. 1
The identical Component units of the Component units of 1a.
Voice-input unit 31, detection unit 32, analytical unit 33 and determination unit 34 hardware can be used configuring or
Software executed by CPU can be used to configure in person.Therefore, speech synthesis dictionary creating apparatus 3 may be configured such that its function
It can be realized by executing speech synthesis dictionary creating program.
Voice-input unit 31 is inputted to detection unit 32 to be remembered by the voice recorder that can be for example embedded in authentication information
The voice data of record and such as by any voice data of the voice data of other recording device records.
In addition, can be embedded in the voice recorder of authentication information with continuous but random manner in for example entire voice or
Authentication information is embedded in specified content of text or text number.The example of embedding grammar includes using public key or shared key
Encryption and digital watermarking.When authentication information indicates to encrypt, speech waveform is encrypted (waveform encryption).Number applied to voice
Watermark includes using the echo diffusion method continuously sheltered, wherein manipulates modulated amplitude frequency spectrum and be embedded in the frequency spectrum diffusion method of information
The phase modulation of information is embedded in the method for piecing together or in which by phase modulation.
Detection unit 32 detects the authentication information for including in by the received voice data of voice-input unit 31.Moreover,
Detection unit 32 extracts authentication information from the voice data for being wherein embedded in authentication information.It encrypts when implementing waveform as embedding
When entering method, detection unit 32 can be configured to use private key to execute decryption.When authentication information table shows digital watermarking, detection
Unit 32 obtains bits of information according to decoding order.
When detecting authentication information, detection unit 32 thinks that inputting voice data is by specified speech recording device records
Voice data.In this way, detection unit 32 will be considered appropriate wherein detecting that the voice data of authentication information is set as
Second speech data, and to analytical unit 33 export second speech data.
In addition, for example, voice-input unit 31 and detection unit 32 can be integrated into the second voice-input unit 35, inspection
It surveys the authentication information for including in any voice data and exports and detect the voice data of authentication information wherein, as being recognized
To be second speech data appropriate.
Analytical unit 33 receives the first voice data from the first voice-input unit 10, receives the second language from detection unit 32
Sound data analyze the first voice data and second speech data, and export analysis result to determination unit 34.
For example, analytical unit 33 executes speech recognition to the first voice data and second speech data, and generate and first
The corresponding text of voice data and text corresponding with second speech data.In addition, analytical unit 33 can be to second speech data
Voice quality inspection is executed, to determine whether signal-to-noise ratio (SNR) and amplitude are equal to or more than predetermined threshold.In addition, analysis is single
Member 33 extracts characteristic quantity based on the first voice data and second speech data at least one of properties: amplitude,
The average value or discrete value of basic frequency (F0), spectrum envelope extract the correlation of result, speech recognition word accuracy and
Word recognition rate.Spectrum envelope extracting method can be identical as the method implemented by analysis determination unit 15 (Fig. 2).
Determination unit 34 receives the characteristic quantity calculated by analytical unit 33.Then, it is determined that unit 34 is by the first voice data
Characteristic quantity be compared with the characteristic quantity of second speech data, so that it is determined that whether the speaker of the first voice data with second
The speaker of voice data is identical.For example, if between the characteristic quantity of the first voice data and the characteristic quantity of second speech data
Difference be equal to or less than predetermined threshold, or if between the characteristic quantity of the first voice data and the characteristic quantity of second speech data
Correlation be equal to or more than predetermined threshold, it is determined that unit 34 determine the first voice data speaker and second speech data
Speaker it is identical.Herein it is assumed that passing through the spy of the study same person in advance as the threshold value used in determination of determination unit 34
The average value and discrete value of sign amount are arranged by practising speech recognition result from a large amount of data middle school in advance.
If it is determined that the speaker of the first voice data is identical as the speaker of second speech data, it is determined that unit 34 is true
Attribute sound is appropriate.Make then, it is determined that unit 34 exports the first voice data (and second speech data) to creating unit 16
For voice data appropriate, wherein the speaker of the first voice data be determined it is identical as the speaker of second speech data.Separately
Outside, analytical unit 33 and determination unit 34 can be configured as together analysis determination unit 36, with speech synthesis dictionary creating
The identical mode of analysis determination unit 15 of device 1a (Fig. 1) works.
The voice that is used to create executed in speech synthesis dictionary creating apparatus 3 according to the second embodiment is given below to close
At the explanation of the operation of dictionary.Fig. 6 is for illustrating to execute in speech synthesis dictionary creating apparatus 3 according to the second embodiment
For create speech synthesis dictionary operation flow chart.
As shown in fig. 6, the first voice-input unit 10 inputs the first voice to analytical unit 33 in step 200 (S200)
Data, voice-input unit 31 input arbitrary voice data (voice input) to detection unit 32.
In step 202 (S202), detection unit 32 detects authentication information.
In step 204 (S204), for example, whether speech synthesis dictionary creating apparatus 3 determines detection unit 32 from appointing
Authentication information is detected in the voice data of meaning.In speech synthesis dictionary creating apparatus 3, if detection unit 32 has detected
To authentication information (in the "Yes" of S204), then system control proceeds to S206.On the other hand, in speech synthesis dictionary creating apparatus
In 3, if detection unit 32 detects authentication information (in the "No" of S204), the end of marking operation not yet.
In step 206 (S206), analytical unit 33 extracts the characteristic quantity of the first voice data and the spy of second speech data
Sign amount (analysis).
In step 208 (S208), determination unit 34 is by the feature of the characteristic quantity of the first voice data and second speech data
Amount is compared, so that it is determined that whether the speaker of the first voice data is identical as the speaker of second speech data.
In step 210 (S210), in speech synthesis dictionary creating apparatus 3, if it is determined that unit 34 determines in S208
The speaker of one voice data and the speaker of second speech data are identical (in the "Yes" of S210), then before voice is appropriate
It puts, system control proceeds to S212.On the other hand, in speech synthesis dictionary creating apparatus 3, if it is determined that unit 34 exists
S208 determines that the speaker of the first voice data is different from the speaker of second speech data (in the "No" of S210), then in voice
Under the premise of being unsuitable, the end of marking operation.
In step 212 (S212), creating unit 16 creates and is determined as the first voice data appropriate by determination unit 34
(and second speech data) corresponding speech synthesis dictionary, and speech synthesis dictionary is exported to the second storage unit 17.
Fig. 7 is to be shown schematically in the speech synthesis dictionary creating system 300 including speech synthesis dictionary creating apparatus 3
The exemplary figure of the operation of middle execution.Speech synthesis dictionary creating system 300 includes speech synthesis dictionary creating apparatus 3, and is held
It passes through and is output and input by data (voice data) of network (not shown).That is, speech synthesis dictionary creating system
300 be to create speech synthesis dictionary for using the voice uploaded by user and provide the system of speech synthesis dictionary.
With reference to Fig. 7, the first voice data 40 is indicated by personal A or individual B by saying the arbitrary number with arbitrary content
The text of amount and the voice data generated.First voice data 40 is received by the first voice-input unit 10.
For example, individual A reads aloud text " the advanced electricity presented by the recording device 42 including authentication information embedded unit
Depending on being 50 inches of sizes ", and execute voice record.Which is embedded authentication information to recognize for the text representation said by personal A
It demonstrate,proves information and is embedded in voice 44.Therefore, authentication information insertion voice (second speech data) is considered as by can be in voice data
The voice data of the preassigned recording device records of middle insertion authentication information.That is, authentication information is embedded in voice quilt
It is considered voice data appropriate.
Speech synthesis dictionary creating system 300 is by the characteristic quantity of the first voice data 40 and authentication information insertion voice (the
Two voice data) 44 characteristic quantity is compared, so that it is determined that whether the speaker of the first voice data 40 is embedding with authentication information
The speaker for entering voice (second speech data) 44 is identical.
If the speaker of the speaker of the first voice data 40 and authentication information insertion voice (second speech data) 44
Identical, then speech synthesis dictionary creating system 300 creates speech synthesis dictionary, and for example shows that 46 are used as pass to user
In the notice of creation speech synthesis dictionary.On the other hand, if the speaker of the first voice data 40 and authentication information are embedded in language
The speaker 44 of sound (second speech data) is different, then speech synthesis dictionary creating system 300 refuses the first voice data 40, and
Such as it shows to user 48 as about the notice for not creating speech synthesis dictionary.
In this way, in speech synthesis dictionary creating apparatus according to the embodiment, since it is determined the first voice data is said
Whether identical as the speaker for the second speech data for being considered as voice data appropriate people is talked about, accordingly it is possible to prevent to take advantage of
The mode deceived creates speech synthesis dictionary.
Although it have been described that some embodiments of the present invention, but the merely exemplary proposition of these embodiments, not
It is intended to limit the scope of the invention.In fact, new method and system described herein can be embodied in the form of various other;This
Outside, without departing from the spirit of the invention, the form of method and system described herein can be carried out it is various omit,
Replacement and change.It the attached claims and its is equally intended to cover the such form of scope and spirit of the present invention of also falling into
Or modification.
List of numerals
1a, 1b, 3: speech synthesis dictionary creating apparatus
10: the first voice-input units
11: the first storage units
12: control unit
13: display unit
14: the second voice-input units
15: analysis determination unit
16: creating unit
17: the second storage units
18: text input unit
31: voice-input unit
32: detection unit
33: analytical unit
34: determination unit
35: the second voice-input units
36: analysis determination unit
100,300: speech synthesis dictionary creating system