CN105340003B

CN105340003B - Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method

Info

Publication number: CN105340003B
Application number: CN201380077502.8A
Authority: CN
Inventors: 橘健太郎; 森田真弘; 笼岛岳彦
Original assignee: Toshiba Corp
Current assignee: Color Sound Station Co ltd
Priority date: 2013-06-20
Filing date: 2013-06-20
Publication date: 2019-04-05
Anticipated expiration: 2033-06-20
Also published as: CN105340003A; JP6184494B2; JPWO2014203370A1; US9792894B2; WO2014203370A1; US20160104475A1

Abstract

Speech synthesis dictionary creating apparatus according to the embodiment includes the first voice-input unit, the second voice-input unit, determination unit and creating unit.First voice-input unit receives the input of the first voice data.The reception of second voice-input unit is considered as the input of the second speech data of voice data appropriate.Determination unit determines whether the speaker of the first voice data is identical as the speaker of second speech data.When determination unit determine the first voice data speaker it is identical as the speaker of second speech data when, creating unit creates speech synthesis dictionary using the first voice data and text corresponding with the first voice data.

Description

Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method

Technical field

The embodiment of the present invention is related to speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method.

Background technique

In recent years, with the raising of the quality of speech synthesis technique, the use scope of speech synthesis sharply expands, all Such as in auto-navigation system, cellular phone voice mail read application in, voice assistant application in.In addition, also mentioning The service for creating speech synthesis dictionary according to the voice of general user is supplied.In the service, if only recorded Voice is available, then speech synthesis dictionary can be created according to anyone voice.

Patent document 1: special open 2010-117528 bulletin

Summary of the invention

However, if voice is obtained from TV or internet in a manner of fraud, it is likely that by imitate other people come Speech synthesis dictionary is created, and speech synthesis dictionary has the risk being abused.Therefore, the purpose of the present invention is to provide a kind of languages Sound synthesizes dictionary creating apparatus and speech synthesis dictionary creating method, makes it possible to prevent from creating voice in a manner of deception Synthesize dictionary.

According to embodiment, speech synthesis dictionary creating apparatus include the first voice-input unit, the second voice-input unit, Determination unit and creating unit.First voice-input unit receives the input of the first voice data.Second voice-input unit connects Receipts are considered as the input of the second speech data of voice data appropriate.Determination unit determines the speaker of the first voice data It is whether identical as the speaker of second speech data.When determination unit determines speaker and the second voice number of the first voice data According to speaker it is identical when, creating unit creates language using the first voice data and text corresponding with the first voice data Sound synthesizes dictionary.

Detailed description of the invention

Fig. 1 is the configuration diagram for showing the configuration of speech synthesis dictionary creating apparatus according to first embodiment；

Fig. 2 is the configuration diagram for showing the configuration of variation of speech synthesis dictionary creating apparatus according to first embodiment；

Fig. 3 is to be used to create language for illustrate to execute in speech synthesis dictionary creating apparatus according to first embodiment Sound synthesizes the flow chart of the operation of dictionary；

Fig. 4 is to be shown schematically in the speech synthesis including speech synthesis dictionary creating apparatus according to first embodiment The exemplary figure of the operation executed in dictionary creating system；

Fig. 5 is the configuration diagram for showing the configuration of speech synthesis dictionary creating apparatus according to the second embodiment；

Fig. 6 is to be used to create language for illustrate to execute in speech synthesis dictionary creating apparatus according to the second embodiment Sound synthesizes the flow chart of the operation of dictionary；

Fig. 7 is to be shown schematically in the speech synthesis including speech synthesis dictionary creating apparatus according to the second embodiment The exemplary figure of the operation executed in dictionary creating system.

Specific embodiment

First embodiment

Below with reference to Detailed description of the invention speech synthesis dictionary creating apparatus according to first embodiment.Fig. 1 is shown according to The configuration diagram of the configuration of the speech synthesis dictionary creating apparatus 1a of one embodiment.Here, speech synthesis dictionary creating apparatus 1a It is realized in this way using general purpose computer.That is, speech synthesis dictionary creating apparatus 1a is for example with including CPU, storage The function of the computer of device device, input/output unit and communication interface.

As shown in Figure 1, speech synthesis dictionary creating apparatus 1a includes the first voice-input unit 10, the first storage unit 11, control unit 12, display unit 13, the second voice-input unit 14, analysis determination unit 15, creating unit 16 and second are deposited Storage unit 17.Here, the first voice-input unit 10, control unit 12, display unit 13, the second voice-input unit 14 and point Hardware can be used to configure or can be used software executed by CPU and configure in analysis determination unit 15.First storage unit, 11 He Second storage unit 17 is configured using such as HDD (hard disk drive) or memory.Therefore, speech synthesis dictionary creating apparatus 1a may be configured such that its function by executing speech synthesis dictionary creating program to realize.

First voice-input unit 10 for example receives the voice data of for example any user via communication interface (not shown) (the first voice data)；And input voice data into analysis determination unit 15.In addition, the first voice-input unit 10 may include The hardware of such as communication interface and microphone.

First storage unit 11 stores multiple texts (or the text recorded) wherein, and in response to control unit 12 It controls and exports stored any one of text.The composition of the control of control unit 12 speech synthesis dictionary creating apparatus 1a Unit.In addition, control unit 12 selects any one of the text stored in the first storage unit 11, it is single from the first storage Member 11 reads selected text, and exports read text to display unit 13.

Display unit 13 receives any one of the text stored in the first storage unit 11 via control unit 12 Text, and the received text of institute is presented to user.Here, display unit 13 is presented on the first storage unit 11 in a random way The text of middle storage.In addition, text, which is presented, in display unit 13 lasts only for about predetermined time period (for example, about several seconds to one Minute).In addition, display unit 13 can be such as display device, loudspeaker or communication interface.That is, in order to make user Selected text can be identified and say, display unit 13 passes through display text or the language by executing recorded text Sound output is presented to execute text.

When any user for example reads aloud the text presented by display unit 13, the second voice-input unit 14 is received Its voice data is entered into analysis determination unit 15 as voice data appropriate (second speech data).Here, the Two voice-input units 14 for example can receive second speech data via communication interface (not shown).In addition, the second voice inputs Unit 14 may include the hardware of such as communication interface and microphone shared with the first voice-input unit 10, or may include altogether The software enjoyed.

After having received the first voice data via the first voice-input unit 10, analysis determination unit 15 makes control unit 12 start to work, so that text is presented in display unit 13.In addition, having received the second language via the second voice-input unit 14 After sound data, determination unit 15 is analyzed by comparing the characteristic quantity of the characteristic quantity of the first voice data and second speech data Compared with determining whether the speaker of the first voice data is identical as the speaker of second speech data.

For example, analysis determination unit 15 executes speech recognition to the first voice data and second speech data, and generates and divide Text not corresponding with the first voice data and second speech data.In addition, analysis determination unit 15 can be to second speech data Voice quality inspection is executed, to determine whether signal-to-noise ratio (SNR) and amplitude are equal to or more than predetermined threshold.In addition, analysis is true Order member 15 is based on the first voice data and second speech data at least one of properties come comparative feature amount: amplitude Value, basic frequency (F₀) average value or discrete value, spectrum envelope extract the word accuracy of the correlation of result, speech recognition And word recognition rate.Here, the example of spectrum envelope extracting method includes linear predictor coefficient (LPC), mel-frequency cepstrum system Number, line spectrum pair (LSP), Meier LPC and Meier LSP.

Then, analysis determination unit 15 compares the characteristic quantity of the characteristic quantity of the first voice data and second speech data Compared with.If the difference between the characteristic quantity of the first voice data and the characteristic quantity of second speech data is equal to or less than predetermined threshold, Or if the correlation between the characteristic quantity of the first voice data and the characteristic quantity of second speech data is equal to or more than make a reservation for Threshold value then analyzes determination unit 15 and determines that the speaker of the first voice data is identical as the speaker of second speech data.Here, It is assumed that as the threshold value used in determination of analysis determination unit 15 by advance learn the same person characteristic quantity average value and Discrete value is arranged by practising speech recognition result from mass data middle school in advance.

When the speaker for determining the first voice data is identical as the speaker of second speech data, determination unit 15 is analyzed Determine that voice is appropriate.Then, analysis determination unit 15 exports the first voice data (and the second voice number to creating unit 16 According to) it is used as voice data appropriate, wherein the speaker of the first voice data is confirmed as the speaker with second speech data It is identical.In addition, analysis determination unit 15 be divided into analysis the first voice data and second speech data analytical unit with And execute determining determination unit.

Creating unit 16 realizes speech recognition technology, and according to via the analysis received first voice number of determination unit 15 According to the text of the said content of creation.Then, creating unit 16 uses created text and the first voice data creation language Sound synthesizes dictionary, and exports speech synthesis dictionary to the second storage unit 17.Therefore, the second storage unit 17 store wherein from The received speech synthesis dictionary of creating unit 16.

The variation of first embodiment

Fig. 2 is the variation for showing speech synthesis dictionary creating apparatus 1a shown in FIG. 1 according to first embodiment Configuration configuration diagram (configuration example of speech synthesis dictionary creating apparatus 1b).As shown in Fig. 2, speech synthesis dictionary creating fills Setting 1b includes the first voice-input unit 10, the first storage unit 11, control unit 12, display unit 13, the input of the second voice Unit 14, analysis determination unit 15, creating unit 16, the second storage unit 17 and text input unit 18.In speech synthesis word In allusion quotation creating device 1b, it is single that composition actually identical with speech synthesis dictionary creating apparatus 1a is referred to identical reference marker Member.

Text input unit 18 receives text corresponding with the first voice data via such as communication interface (not shown), and Enter text into analysis determination unit 15.Here, text input unit 18, which can be used, can such as receive the defeated of text input Enter the hardware of device to configure, or software can be used to configure.

What analysis determination unit 15 will enter into text input unit 18 says the voice number that text obtains by user According to as the first voice data, and determine whether the speaker of the first voice data is identical as the speaker of second speech data. Then, creating unit 16 is determined as voice appropriate and is input to text input unit 18 using analyzed determination unit 15 Text creates speech synthesis dictionary.Therefore, in speech synthesis dictionary creating apparatus 1b, due to including text input unit 18, it is therefore not necessary to create text by executing speech recognition.This makes it possible to realize the reduction of processing load.

Be given below in speech synthesis dictionary creating apparatus 1a according to first embodiment (or in speech synthesis dictionary In creating device 1b) execute for create speech synthesis dictionary operation explanation.Fig. 3 is for illustrating real according to first (or in speech synthesis dictionary creating apparatus 1b) execution is applied in the speech synthesis dictionary creating apparatus 1a of example, for creating The flow chart of the operation of speech synthesis dictionary.

As shown in figure 3, the first voice-input unit 10 is via such as communication interface (not shown) in step 100 (S100) The input of the first voice data is received, and the first voice data is input to analysis determination unit 15 (input of the first voice).

In step 102 (S102), recorded text (or text) is presented to user for display unit 13.

In step 104 (S104), the second voice-input unit 14 receive when the text presented by display unit 13 for example by The voice data obtained when user reads aloud, as voice data appropriate (second speech data)；And by the second voice number Determination unit 15 is analyzed according to being input to.

In step 106 (S106), characteristic quantity and second speech data that determination unit 15 extracts the first voice data are analyzed Characteristic quantity.

In step 108 (S108), determination unit 15 is analyzed by the characteristic quantity of the first voice data and second speech data Characteristic quantity is compared, so that it is determined that whether the speaker of the first voice data is identical as the speaker of second speech data.? In speech synthesis dictionary creating apparatus 1a (or speech synthesis dictionary creating apparatus 1b), if analysis determination unit 15 determines first The speaker of voice data and the speaker of second speech data are identical (in the "Yes" of S108), then are premises appropriate in voice Under, system control proceeds to S110.If analyzing speaker and the second voice number that determination unit 15 determines the first voice data According to speaker it is different (in the "No" of S108), then speech synthesis dictionary creating apparatus 1a (or speech synthesis dictionary creating apparatus 1b) the end of marking operation.

In step 110 (S110), 16 use of creating unit is determined as the first voice number appropriate by analysis determination unit 15 According to (and second speech data), and use text corresponding with the first voice data (and second speech data), create voice Synthesize dictionary；And speech synthesis dictionary is exported to the second storage unit 17.

Fig. 4 is to be shown schematically in the speech synthesis dictionary creating system including speech synthesis dictionary creating apparatus 1a The example of the operation executed in 100.Speech synthesis dictionary creating system 100 includes speech synthesis dictionary creating apparatus 1a, and is held It passes through and is output and input by data (voice data and the text) of network (not shown).That is, speech synthesis dictionary is created Build system 100 be using uploaded by the user of system voice creation speech synthesis dictionary and what speech synthesis dictionary was provided be System.

Referring to Fig. 4, the first voice data 20 is indicated by personal A by saying any number of text with arbitrary content And the voice data generated.First voice data 20 is received by the first voice-input unit 10.

Presentation example 22 prompts user to say, and by the text of speech synthesis dictionary creating apparatus 1a presentation, " advanced television is 50 Inch size ".Second speech data 24 is indicated when the text presented by speech synthesis dictionary creating apparatus 1a is loud by user The voice data obtained when reading aloud.Second speech data 24 is input into the second voice-input unit 14.Via TV or because In the voice that spy's net obtains, it is difficult to say the text presented at random by speech synthesis dictionary creating apparatus 1a.The input of second voice The received voice data of institute as voice data appropriate, and is output to analysis determination unit 15 by unit 14.

Analysis determination unit 15 compares the characteristic quantity of the characteristic quantity and second speech data 24 of the first voice data 20 Compared with so that it is determined that whether the speaker of the first voice data 20 is identical as the speaker of second speech data 24.

If the speaker of the first voice data 20 is identical as the speaker of second speech data 24, speech synthesis dictionary Creation system 100 creates speech synthesis dictionary, and for example shows that 26 are used as about creation speech synthesis dictionary to user Notice.On the other hand, if the speaker of the first voice data 20 is different from the speaker of second speech data 24, voice It synthesizes dictionary creating system 100 and refuses the first voice data 20, and for example show 28 conducts about not creating to user The notice of speech synthesis dictionary.

Second embodiment

The explanation to speech synthesis creating device according to the second embodiment is given below.Fig. 5 is to show according to second in fact Apply the configuration diagram of the configuration of the speech synthesis dictionary creating apparatus 3 of example.Here, speech synthesis dictionary creating apparatus 3 for example using General purpose computer is realized.That is, speech synthesis dictionary creating apparatus 3 is for example with including CPU, memory device, input The function of output device and the computer of communication interface.

As shown in figure 5, speech synthesis dictionary creating apparatus 3 include the first voice-input unit 10, voice-input unit 31, Detection unit 32, analytical unit 33, determination unit 34, creating unit 16 and the second storage unit 17.Voice out shown in Fig. 3 Synthesize dictionary creating apparatus 3 in, with identical reference marker refer to actually with speech synthesis dictionary creating apparatus shown in FIG. 1 The identical Component units of the Component units of 1a.

Voice-input unit 31, detection unit 32, analytical unit 33 and determination unit 34 hardware can be used configuring or Software executed by CPU can be used to configure in person.Therefore, speech synthesis dictionary creating apparatus 3 may be configured such that its function It can be realized by executing speech synthesis dictionary creating program.

Voice-input unit 31 is inputted to detection unit 32 to be remembered by the voice recorder that can be for example embedded in authentication information The voice data of record and such as by any voice data of the voice data of other recording device records.

In addition, can be embedded in the voice recorder of authentication information with continuous but random manner in for example entire voice or Authentication information is embedded in specified content of text or text number.The example of embedding grammar includes using public key or shared key Encryption and digital watermarking.When authentication information indicates to encrypt, speech waveform is encrypted (waveform encryption).Number applied to voice Watermark includes using the echo diffusion method continuously sheltered, wherein manipulates modulated amplitude frequency spectrum and be embedded in the frequency spectrum diffusion method of information The phase modulation of information is embedded in the method for piecing together or in which by phase modulation.

Detection unit 32 detects the authentication information for including in by the received voice data of voice-input unit 31.Moreover, Detection unit 32 extracts authentication information from the voice data for being wherein embedded in authentication information.It encrypts when implementing waveform as embedding When entering method, detection unit 32 can be configured to use private key to execute decryption.When authentication information table shows digital watermarking, detection Unit 32 obtains bits of information according to decoding order.

When detecting authentication information, detection unit 32 thinks that inputting voice data is by specified speech recording device records Voice data.In this way, detection unit 32 will be considered appropriate wherein detecting that the voice data of authentication information is set as Second speech data, and to analytical unit 33 export second speech data.

In addition, for example, voice-input unit 31 and detection unit 32 can be integrated into the second voice-input unit 35, inspection It surveys the authentication information for including in any voice data and exports and detect the voice data of authentication information wherein, as being recognized To be second speech data appropriate.

Analytical unit 33 receives the first voice data from the first voice-input unit 10, receives the second language from detection unit 32 Sound data analyze the first voice data and second speech data, and export analysis result to determination unit 34.

For example, analytical unit 33 executes speech recognition to the first voice data and second speech data, and generate and first The corresponding text of voice data and text corresponding with second speech data.In addition, analytical unit 33 can be to second speech data Voice quality inspection is executed, to determine whether signal-to-noise ratio (SNR) and amplitude are equal to or more than predetermined threshold.In addition, analysis is single Member 33 extracts characteristic quantity based on the first voice data and second speech data at least one of properties: amplitude, The average value or discrete value of basic frequency (F0), spectrum envelope extract the correlation of result, speech recognition word accuracy and Word recognition rate.Spectrum envelope extracting method can be identical as the method implemented by analysis determination unit 15 (Fig. 2).

Determination unit 34 receives the characteristic quantity calculated by analytical unit 33.Then, it is determined that unit 34 is by the first voice data Characteristic quantity be compared with the characteristic quantity of second speech data, so that it is determined that whether the speaker of the first voice data with second The speaker of voice data is identical.For example, if between the characteristic quantity of the first voice data and the characteristic quantity of second speech data Difference be equal to or less than predetermined threshold, or if between the characteristic quantity of the first voice data and the characteristic quantity of second speech data Correlation be equal to or more than predetermined threshold, it is determined that unit 34 determine the first voice data speaker and second speech data Speaker it is identical.Herein it is assumed that passing through the spy of the study same person in advance as the threshold value used in determination of determination unit 34 The average value and discrete value of sign amount are arranged by practising speech recognition result from a large amount of data middle school in advance.

If it is determined that the speaker of the first voice data is identical as the speaker of second speech data, it is determined that unit 34 is true Attribute sound is appropriate.Make then, it is determined that unit 34 exports the first voice data (and second speech data) to creating unit 16 For voice data appropriate, wherein the speaker of the first voice data be determined it is identical as the speaker of second speech data.Separately Outside, analytical unit 33 and determination unit 34 can be configured as together analysis determination unit 36, with speech synthesis dictionary creating The identical mode of analysis determination unit 15 of device 1a (Fig. 1) works.

The voice that is used to create executed in speech synthesis dictionary creating apparatus 3 according to the second embodiment is given below to close At the explanation of the operation of dictionary.Fig. 6 is for illustrating to execute in speech synthesis dictionary creating apparatus 3 according to the second embodiment For create speech synthesis dictionary operation flow chart.

As shown in fig. 6, the first voice-input unit 10 inputs the first voice to analytical unit 33 in step 200 (S200) Data, voice-input unit 31 input arbitrary voice data (voice input) to detection unit 32.

In step 202 (S202), detection unit 32 detects authentication information.

In step 204 (S204), for example, whether speech synthesis dictionary creating apparatus 3 determines detection unit 32 from appointing Authentication information is detected in the voice data of meaning.In speech synthesis dictionary creating apparatus 3, if detection unit 32 has detected To authentication information (in the "Yes" of S204), then system control proceeds to S206.On the other hand, in speech synthesis dictionary creating apparatus In 3, if detection unit 32 detects authentication information (in the "No" of S204), the end of marking operation not yet.

In step 206 (S206), analytical unit 33 extracts the characteristic quantity of the first voice data and the spy of second speech data Sign amount (analysis).

In step 208 (S208), determination unit 34 is by the feature of the characteristic quantity of the first voice data and second speech data Amount is compared, so that it is determined that whether the speaker of the first voice data is identical as the speaker of second speech data.

In step 210 (S210), in speech synthesis dictionary creating apparatus 3, if it is determined that unit 34 determines in S208 The speaker of one voice data and the speaker of second speech data are identical (in the "Yes" of S210), then before voice is appropriate It puts, system control proceeds to S212.On the other hand, in speech synthesis dictionary creating apparatus 3, if it is determined that unit 34 exists S208 determines that the speaker of the first voice data is different from the speaker of second speech data (in the "No" of S210), then in voice Under the premise of being unsuitable, the end of marking operation.

In step 212 (S212), creating unit 16 creates and is determined as the first voice data appropriate by determination unit 34 (and second speech data) corresponding speech synthesis dictionary, and speech synthesis dictionary is exported to the second storage unit 17.

Fig. 7 is to be shown schematically in the speech synthesis dictionary creating system 300 including speech synthesis dictionary creating apparatus 3 The exemplary figure of the operation of middle execution.Speech synthesis dictionary creating system 300 includes speech synthesis dictionary creating apparatus 3, and is held It passes through and is output and input by data (voice data) of network (not shown).That is, speech synthesis dictionary creating system 300 be to create speech synthesis dictionary for using the voice uploaded by user and provide the system of speech synthesis dictionary.

With reference to Fig. 7, the first voice data 40 is indicated by personal A or individual B by saying the arbitrary number with arbitrary content The text of amount and the voice data generated.First voice data 40 is received by the first voice-input unit 10.

For example, individual A reads aloud text " the advanced electricity presented by the recording device 42 including authentication information embedded unit Depending on being 50 inches of sizes ", and execute voice record.Which is embedded authentication information to recognize for the text representation said by personal A It demonstrate,proves information and is embedded in voice 44.Therefore, authentication information insertion voice (second speech data) is considered as by can be in voice data The voice data of the preassigned recording device records of middle insertion authentication information.That is, authentication information is embedded in voice quilt It is considered voice data appropriate.

Speech synthesis dictionary creating system 300 is by the characteristic quantity of the first voice data 40 and authentication information insertion voice (the Two voice data) 44 characteristic quantity is compared, so that it is determined that whether the speaker of the first voice data 40 is embedding with authentication information The speaker for entering voice (second speech data) 44 is identical.

If the speaker of the speaker of the first voice data 40 and authentication information insertion voice (second speech data) 44 Identical, then speech synthesis dictionary creating system 300 creates speech synthesis dictionary, and for example shows that 46 are used as pass to user In the notice of creation speech synthesis dictionary.On the other hand, if the speaker of the first voice data 40 and authentication information are embedded in language The speaker 44 of sound (second speech data) is different, then speech synthesis dictionary creating system 300 refuses the first voice data 40, and Such as it shows to user 48 as about the notice for not creating speech synthesis dictionary.

In this way, in speech synthesis dictionary creating apparatus according to the embodiment, since it is determined the first voice data is said Whether identical as the speaker for the second speech data for being considered as voice data appropriate people is talked about, accordingly it is possible to prevent to take advantage of The mode deceived creates speech synthesis dictionary.

Although it have been described that some embodiments of the present invention, but the merely exemplary proposition of these embodiments, not It is intended to limit the scope of the invention.In fact, new method and system described herein can be embodied in the form of various other；This Outside, without departing from the spirit of the invention, the form of method and system described herein can be carried out it is various omit, Replacement and change.It the attached claims and its is equally intended to cover the such form of scope and spirit of the present invention of also falling into Or modification.

List of numerals

1a, 1b, 3: speech synthesis dictionary creating apparatus

10: the first voice-input units

11: the first storage units

12: control unit

13: display unit

14: the second voice-input units

15: analysis determination unit

16: creating unit

17: the second storage units

18: text input unit

31: voice-input unit

32: detection unit

33: analytical unit

34: determination unit

35: the second voice-input units

36: analysis determination unit

100,300: speech synthesis dictionary creating system

Claims

1. a kind of speech synthesis dictionary creating apparatus, comprising:

First voice-input unit is configured to receive the input of the first voice data；

Second voice-input unit, is configured to receive the input of second speech data, and the second speech data is considered It is voice data appropriate；

Whether determination unit, the speaker for being configured to determine first voice data say with the second speech data It is identical to talk about people；And

Creating unit, be configured to when the determination unit determine first voice data speaker and second language When the speaker of sound data is identical, first voice data and text creation corresponding with first voice data are used Speech synthesis dictionary,

Wherein the voice data appropriate is to read aloud the voice data that text is presented or the voice data for detecting authentication information.

2. the apparatus according to claim 1, further includes:

Storage unit is configured to store multiple texts wherein；And

Display unit is configured to be presented on any one of described text stored in the storage unit；

Wherein, the language that second voice-input unit will be obtained and saying the text presented by the display unit Sound data be set as be considered as voice data appropriate the second speech data.

3. the apparatus of claim 2, wherein the display unit execute it is below at least one: be presented at random Any one of any one of described text stored in the storage unit, and the presentation text only make a reservation for Period.

4. the apparatus according to claim 1, wherein the determination unit is by by the characteristic quantity of first voice data It is compared to determine the speaker of first voice data whether with described with the characteristic quantity of the second speech data The speaker of two voice data is identical.

5. device according to claim 4, wherein the determination unit is based on first voice data and described second At least one of word recognition rate, word accuracy, amplitude, basic frequency and spectrum envelope of voice data carry out comparative feature Amount.

6. device according to claim 5, wherein when the characteristic quantity and the second voice number of first voice data According to characteristic quantity between difference be equal to or less than predetermined threshold when, or when the characteristic quantity of first voice data and described the When correlation between the characteristic quantity of two voice data is equal to or more than predetermined threshold, the determination unit determines first language The speaker of sound data is identical as the speaker of the second speech data.

7. the apparatus according to claim 1, further includes: text input unit is configured to input and first language The corresponding text of sound data,

Wherein, the voice number that the determination unit will be obtained and saying the text received by the text input unit According to as first voice data, with the speaker of determination first voice data whether with the second speech data Speaker is identical.

8. the apparatus according to claim 1, wherein second voice-input unit includes:

Voice-input unit is configured to receive the input of voice data；And

Detection unit is configured to detect the certification for including in by the received voice data of the voice-input unit Information；

Wherein, the voice data that the detection unit will test authentication information is set as being considered as second language appropriate Sound data.

9. device according to claim 8, wherein the authentication information indicates voice watermark or speech waveform encryption.

10. a kind of speech synthesis dictionary creating method, comprising:

Receive the input of the first voice data；

The input of second speech data is received, the second speech data is considered as voice data appropriate；

Determine whether the speaker of first voice data is identical as the speaker of the second speech data；And

When the speaker for determining first voice data is identical as the speaker of the second speech data, described is used One voice data and text creation speech synthesis dictionary corresponding with first voice data.