CN105340003B - Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method - Google Patents

Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method Download PDF

Info

Publication number
CN105340003B
CN105340003B CN201380077502.8A CN201380077502A CN105340003B CN 105340003 B CN105340003 B CN 105340003B CN 201380077502 A CN201380077502 A CN 201380077502A CN 105340003 B CN105340003 B CN 105340003B
Authority
CN
China
Prior art keywords
voice
data
voice data
speech
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201380077502.8A
Other languages
Chinese (zh)
Other versions
CN105340003A (en
Inventor
橘健太郎
森田真弘
笼岛岳彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Color Sound Station Co ltd
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Publication of CN105340003A publication Critical patent/CN105340003A/en
Application granted granted Critical
Publication of CN105340003B publication Critical patent/CN105340003B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Speech synthesis dictionary creating apparatus according to the embodiment includes the first voice-input unit, the second voice-input unit, determination unit and creating unit.First voice-input unit receives the input of the first voice data.The reception of second voice-input unit is considered as the input of the second speech data of voice data appropriate.Determination unit determines whether the speaker of the first voice data is identical as the speaker of second speech data.When determination unit determine the first voice data speaker it is identical as the speaker of second speech data when, creating unit creates speech synthesis dictionary using the first voice data and text corresponding with the first voice data.

Description

Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method
Technical field
The embodiment of the present invention is related to speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method.
Background technique
In recent years, with the raising of the quality of speech synthesis technique, the use scope of speech synthesis sharply expands, all Such as in auto-navigation system, cellular phone voice mail read application in, voice assistant application in.In addition, also mentioning The service for creating speech synthesis dictionary according to the voice of general user is supplied.In the service, if only recorded Voice is available, then speech synthesis dictionary can be created according to anyone voice.
Patent document 1: special open 2010-117528 bulletin
Summary of the invention
However, if voice is obtained from TV or internet in a manner of fraud, it is likely that by imitate other people come Speech synthesis dictionary is created, and speech synthesis dictionary has the risk being abused.Therefore, the purpose of the present invention is to provide a kind of languages Sound synthesizes dictionary creating apparatus and speech synthesis dictionary creating method, makes it possible to prevent from creating voice in a manner of deception Synthesize dictionary.
According to embodiment, speech synthesis dictionary creating apparatus include the first voice-input unit, the second voice-input unit, Determination unit and creating unit.First voice-input unit receives the input of the first voice data.Second voice-input unit connects Receipts are considered as the input of the second speech data of voice data appropriate.Determination unit determines the speaker of the first voice data It is whether identical as the speaker of second speech data.When determination unit determines speaker and the second voice number of the first voice data According to speaker it is identical when, creating unit creates language using the first voice data and text corresponding with the first voice data Sound synthesizes dictionary.
Detailed description of the invention
Fig. 1 is the configuration diagram for showing the configuration of speech synthesis dictionary creating apparatus according to first embodiment;
Fig. 2 is the configuration diagram for showing the configuration of variation of speech synthesis dictionary creating apparatus according to first embodiment;
Fig. 3 is to be used to create language for illustrate to execute in speech synthesis dictionary creating apparatus according to first embodiment Sound synthesizes the flow chart of the operation of dictionary;
Fig. 4 is to be shown schematically in the speech synthesis including speech synthesis dictionary creating apparatus according to first embodiment The exemplary figure of the operation executed in dictionary creating system;
Fig. 5 is the configuration diagram for showing the configuration of speech synthesis dictionary creating apparatus according to the second embodiment;
Fig. 6 is to be used to create language for illustrate to execute in speech synthesis dictionary creating apparatus according to the second embodiment Sound synthesizes the flow chart of the operation of dictionary;
Fig. 7 is to be shown schematically in the speech synthesis including speech synthesis dictionary creating apparatus according to the second embodiment The exemplary figure of the operation executed in dictionary creating system.
Specific embodiment
First embodiment
Below with reference to Detailed description of the invention speech synthesis dictionary creating apparatus according to first embodiment.Fig. 1 is shown according to The configuration diagram of the configuration of the speech synthesis dictionary creating apparatus 1a of one embodiment.Here, speech synthesis dictionary creating apparatus 1a It is realized in this way using general purpose computer.That is, speech synthesis dictionary creating apparatus 1a is for example with including CPU, storage The function of the computer of device device, input/output unit and communication interface.
As shown in Figure 1, speech synthesis dictionary creating apparatus 1a includes the first voice-input unit 10, the first storage unit 11, control unit 12, display unit 13, the second voice-input unit 14, analysis determination unit 15, creating unit 16 and second are deposited Storage unit 17.Here, the first voice-input unit 10, control unit 12, display unit 13, the second voice-input unit 14 and point Hardware can be used to configure or can be used software executed by CPU and configure in analysis determination unit 15.First storage unit, 11 He Second storage unit 17 is configured using such as HDD (hard disk drive) or memory.Therefore, speech synthesis dictionary creating apparatus 1a may be configured such that its function by executing speech synthesis dictionary creating program to realize.
First voice-input unit 10 for example receives the voice data of for example any user via communication interface (not shown) (the first voice data);And input voice data into analysis determination unit 15.In addition, the first voice-input unit 10 may include The hardware of such as communication interface and microphone.
First storage unit 11 stores multiple texts (or the text recorded) wherein, and in response to control unit 12 It controls and exports stored any one of text.The composition of the control of control unit 12 speech synthesis dictionary creating apparatus 1a Unit.In addition, control unit 12 selects any one of the text stored in the first storage unit 11, it is single from the first storage Member 11 reads selected text, and exports read text to display unit 13.
Display unit 13 receives any one of the text stored in the first storage unit 11 via control unit 12 Text, and the received text of institute is presented to user.Here, display unit 13 is presented on the first storage unit 11 in a random way The text of middle storage.In addition, text, which is presented, in display unit 13 lasts only for about predetermined time period (for example, about several seconds to one Minute).In addition, display unit 13 can be such as display device, loudspeaker or communication interface.That is, in order to make user Selected text can be identified and say, display unit 13 passes through display text or the language by executing recorded text Sound output is presented to execute text.
When any user for example reads aloud the text presented by display unit 13, the second voice-input unit 14 is received Its voice data is entered into analysis determination unit 15 as voice data appropriate (second speech data).Here, the Two voice-input units 14 for example can receive second speech data via communication interface (not shown).In addition, the second voice inputs Unit 14 may include the hardware of such as communication interface and microphone shared with the first voice-input unit 10, or may include altogether The software enjoyed.
After having received the first voice data via the first voice-input unit 10, analysis determination unit 15 makes control unit 12 start to work, so that text is presented in display unit 13.In addition, having received the second language via the second voice-input unit 14 After sound data, determination unit 15 is analyzed by comparing the characteristic quantity of the characteristic quantity of the first voice data and second speech data Compared with determining whether the speaker of the first voice data is identical as the speaker of second speech data.
For example, analysis determination unit 15 executes speech recognition to the first voice data and second speech data, and generates and divide Text not corresponding with the first voice data and second speech data.In addition, analysis determination unit 15 can be to second speech data Voice quality inspection is executed, to determine whether signal-to-noise ratio (SNR) and amplitude are equal to or more than predetermined threshold.In addition, analysis is true Order member 15 is based on the first voice data and second speech data at least one of properties come comparative feature amount: amplitude Value, basic frequency (F0) average value or discrete value, spectrum envelope extract the word accuracy of the correlation of result, speech recognition And word recognition rate.Here, the example of spectrum envelope extracting method includes linear predictor coefficient (LPC), mel-frequency cepstrum system Number, line spectrum pair (LSP), Meier LPC and Meier LSP.
Then, analysis determination unit 15 compares the characteristic quantity of the characteristic quantity of the first voice data and second speech data Compared with.If the difference between the characteristic quantity of the first voice data and the characteristic quantity of second speech data is equal to or less than predetermined threshold, Or if the correlation between the characteristic quantity of the first voice data and the characteristic quantity of second speech data is equal to or more than make a reservation for Threshold value then analyzes determination unit 15 and determines that the speaker of the first voice data is identical as the speaker of second speech data.Here, It is assumed that as the threshold value used in determination of analysis determination unit 15 by advance learn the same person characteristic quantity average value and Discrete value is arranged by practising speech recognition result from mass data middle school in advance.
When the speaker for determining the first voice data is identical as the speaker of second speech data, determination unit 15 is analyzed Determine that voice is appropriate.Then, analysis determination unit 15 exports the first voice data (and the second voice number to creating unit 16 According to) it is used as voice data appropriate, wherein the speaker of the first voice data is confirmed as the speaker with second speech data It is identical.In addition, analysis determination unit 15 be divided into analysis the first voice data and second speech data analytical unit with And execute determining determination unit.
Creating unit 16 realizes speech recognition technology, and according to via the analysis received first voice number of determination unit 15 According to the text of the said content of creation.Then, creating unit 16 uses created text and the first voice data creation language Sound synthesizes dictionary, and exports speech synthesis dictionary to the second storage unit 17.Therefore, the second storage unit 17 store wherein from The received speech synthesis dictionary of creating unit 16.
The variation of first embodiment
Fig. 2 is the variation for showing speech synthesis dictionary creating apparatus 1a shown in FIG. 1 according to first embodiment Configuration configuration diagram (configuration example of speech synthesis dictionary creating apparatus 1b).As shown in Fig. 2, speech synthesis dictionary creating fills Setting 1b includes the first voice-input unit 10, the first storage unit 11, control unit 12, display unit 13, the input of the second voice Unit 14, analysis determination unit 15, creating unit 16, the second storage unit 17 and text input unit 18.In speech synthesis word In allusion quotation creating device 1b, it is single that composition actually identical with speech synthesis dictionary creating apparatus 1a is referred to identical reference marker Member.
Text input unit 18 receives text corresponding with the first voice data via such as communication interface (not shown), and Enter text into analysis determination unit 15.Here, text input unit 18, which can be used, can such as receive the defeated of text input Enter the hardware of device to configure, or software can be used to configure.
What analysis determination unit 15 will enter into text input unit 18 says the voice number that text obtains by user According to as the first voice data, and determine whether the speaker of the first voice data is identical as the speaker of second speech data. Then, creating unit 16 is determined as voice appropriate and is input to text input unit 18 using analyzed determination unit 15 Text creates speech synthesis dictionary.Therefore, in speech synthesis dictionary creating apparatus 1b, due to including text input unit 18, it is therefore not necessary to create text by executing speech recognition.This makes it possible to realize the reduction of processing load.
Be given below in speech synthesis dictionary creating apparatus 1a according to first embodiment (or in speech synthesis dictionary In creating device 1b) execute for create speech synthesis dictionary operation explanation.Fig. 3 is for illustrating real according to first (or in speech synthesis dictionary creating apparatus 1b) execution is applied in the speech synthesis dictionary creating apparatus 1a of example, for creating The flow chart of the operation of speech synthesis dictionary.
As shown in figure 3, the first voice-input unit 10 is via such as communication interface (not shown) in step 100 (S100) The input of the first voice data is received, and the first voice data is input to analysis determination unit 15 (input of the first voice).
In step 102 (S102), recorded text (or text) is presented to user for display unit 13.
In step 104 (S104), the second voice-input unit 14 receive when the text presented by display unit 13 for example by The voice data obtained when user reads aloud, as voice data appropriate (second speech data);And by the second voice number Determination unit 15 is analyzed according to being input to.
In step 106 (S106), characteristic quantity and second speech data that determination unit 15 extracts the first voice data are analyzed Characteristic quantity.
In step 108 (S108), determination unit 15 is analyzed by the characteristic quantity of the first voice data and second speech data Characteristic quantity is compared, so that it is determined that whether the speaker of the first voice data is identical as the speaker of second speech data.? In speech synthesis dictionary creating apparatus 1a (or speech synthesis dictionary creating apparatus 1b), if analysis determination unit 15 determines first The speaker of voice data and the speaker of second speech data are identical (in the "Yes" of S108), then are premises appropriate in voice Under, system control proceeds to S110.If analyzing speaker and the second voice number that determination unit 15 determines the first voice data According to speaker it is different (in the "No" of S108), then speech synthesis dictionary creating apparatus 1a (or speech synthesis dictionary creating apparatus 1b) the end of marking operation.
In step 110 (S110), 16 use of creating unit is determined as the first voice number appropriate by analysis determination unit 15 According to (and second speech data), and use text corresponding with the first voice data (and second speech data), create voice Synthesize dictionary;And speech synthesis dictionary is exported to the second storage unit 17.
Fig. 4 is to be shown schematically in the speech synthesis dictionary creating system including speech synthesis dictionary creating apparatus 1a The example of the operation executed in 100.Speech synthesis dictionary creating system 100 includes speech synthesis dictionary creating apparatus 1a, and is held It passes through and is output and input by data (voice data and the text) of network (not shown).That is, speech synthesis dictionary is created Build system 100 be using uploaded by the user of system voice creation speech synthesis dictionary and what speech synthesis dictionary was provided be System.
Referring to Fig. 4, the first voice data 20 is indicated by personal A by saying any number of text with arbitrary content And the voice data generated.First voice data 20 is received by the first voice-input unit 10.
Presentation example 22 prompts user to say, and by the text of speech synthesis dictionary creating apparatus 1a presentation, " advanced television is 50 Inch size ".Second speech data 24 is indicated when the text presented by speech synthesis dictionary creating apparatus 1a is loud by user The voice data obtained when reading aloud.Second speech data 24 is input into the second voice-input unit 14.Via TV or because In the voice that spy's net obtains, it is difficult to say the text presented at random by speech synthesis dictionary creating apparatus 1a.The input of second voice The received voice data of institute as voice data appropriate, and is output to analysis determination unit 15 by unit 14.
Analysis determination unit 15 compares the characteristic quantity of the characteristic quantity and second speech data 24 of the first voice data 20 Compared with so that it is determined that whether the speaker of the first voice data 20 is identical as the speaker of second speech data 24.
If the speaker of the first voice data 20 is identical as the speaker of second speech data 24, speech synthesis dictionary Creation system 100 creates speech synthesis dictionary, and for example shows that 26 are used as about creation speech synthesis dictionary to user Notice.On the other hand, if the speaker of the first voice data 20 is different from the speaker of second speech data 24, voice It synthesizes dictionary creating system 100 and refuses the first voice data 20, and for example show 28 conducts about not creating to user The notice of speech synthesis dictionary.
Second embodiment
The explanation to speech synthesis creating device according to the second embodiment is given below.Fig. 5 is to show according to second in fact Apply the configuration diagram of the configuration of the speech synthesis dictionary creating apparatus 3 of example.Here, speech synthesis dictionary creating apparatus 3 for example using General purpose computer is realized.That is, speech synthesis dictionary creating apparatus 3 is for example with including CPU, memory device, input The function of output device and the computer of communication interface.
As shown in figure 5, speech synthesis dictionary creating apparatus 3 include the first voice-input unit 10, voice-input unit 31, Detection unit 32, analytical unit 33, determination unit 34, creating unit 16 and the second storage unit 17.Voice out shown in Fig. 3 Synthesize dictionary creating apparatus 3 in, with identical reference marker refer to actually with speech synthesis dictionary creating apparatus shown in FIG. 1 The identical Component units of the Component units of 1a.
Voice-input unit 31, detection unit 32, analytical unit 33 and determination unit 34 hardware can be used configuring or Software executed by CPU can be used to configure in person.Therefore, speech synthesis dictionary creating apparatus 3 may be configured such that its function It can be realized by executing speech synthesis dictionary creating program.
Voice-input unit 31 is inputted to detection unit 32 to be remembered by the voice recorder that can be for example embedded in authentication information The voice data of record and such as by any voice data of the voice data of other recording device records.
In addition, can be embedded in the voice recorder of authentication information with continuous but random manner in for example entire voice or Authentication information is embedded in specified content of text or text number.The example of embedding grammar includes using public key or shared key Encryption and digital watermarking.When authentication information indicates to encrypt, speech waveform is encrypted (waveform encryption).Number applied to voice Watermark includes using the echo diffusion method continuously sheltered, wherein manipulates modulated amplitude frequency spectrum and be embedded in the frequency spectrum diffusion method of information The phase modulation of information is embedded in the method for piecing together or in which by phase modulation.
Detection unit 32 detects the authentication information for including in by the received voice data of voice-input unit 31.Moreover, Detection unit 32 extracts authentication information from the voice data for being wherein embedded in authentication information.It encrypts when implementing waveform as embedding When entering method, detection unit 32 can be configured to use private key to execute decryption.When authentication information table shows digital watermarking, detection Unit 32 obtains bits of information according to decoding order.
When detecting authentication information, detection unit 32 thinks that inputting voice data is by specified speech recording device records Voice data.In this way, detection unit 32 will be considered appropriate wherein detecting that the voice data of authentication information is set as Second speech data, and to analytical unit 33 export second speech data.
In addition, for example, voice-input unit 31 and detection unit 32 can be integrated into the second voice-input unit 35, inspection It surveys the authentication information for including in any voice data and exports and detect the voice data of authentication information wherein, as being recognized To be second speech data appropriate.
Analytical unit 33 receives the first voice data from the first voice-input unit 10, receives the second language from detection unit 32 Sound data analyze the first voice data and second speech data, and export analysis result to determination unit 34.
For example, analytical unit 33 executes speech recognition to the first voice data and second speech data, and generate and first The corresponding text of voice data and text corresponding with second speech data.In addition, analytical unit 33 can be to second speech data Voice quality inspection is executed, to determine whether signal-to-noise ratio (SNR) and amplitude are equal to or more than predetermined threshold.In addition, analysis is single Member 33 extracts characteristic quantity based on the first voice data and second speech data at least one of properties: amplitude, The average value or discrete value of basic frequency (F0), spectrum envelope extract the correlation of result, speech recognition word accuracy and Word recognition rate.Spectrum envelope extracting method can be identical as the method implemented by analysis determination unit 15 (Fig. 2).
Determination unit 34 receives the characteristic quantity calculated by analytical unit 33.Then, it is determined that unit 34 is by the first voice data Characteristic quantity be compared with the characteristic quantity of second speech data, so that it is determined that whether the speaker of the first voice data with second The speaker of voice data is identical.For example, if between the characteristic quantity of the first voice data and the characteristic quantity of second speech data Difference be equal to or less than predetermined threshold, or if between the characteristic quantity of the first voice data and the characteristic quantity of second speech data Correlation be equal to or more than predetermined threshold, it is determined that unit 34 determine the first voice data speaker and second speech data Speaker it is identical.Herein it is assumed that passing through the spy of the study same person in advance as the threshold value used in determination of determination unit 34 The average value and discrete value of sign amount are arranged by practising speech recognition result from a large amount of data middle school in advance.
If it is determined that the speaker of the first voice data is identical as the speaker of second speech data, it is determined that unit 34 is true Attribute sound is appropriate.Make then, it is determined that unit 34 exports the first voice data (and second speech data) to creating unit 16 For voice data appropriate, wherein the speaker of the first voice data be determined it is identical as the speaker of second speech data.Separately Outside, analytical unit 33 and determination unit 34 can be configured as together analysis determination unit 36, with speech synthesis dictionary creating The identical mode of analysis determination unit 15 of device 1a (Fig. 1) works.
The voice that is used to create executed in speech synthesis dictionary creating apparatus 3 according to the second embodiment is given below to close At the explanation of the operation of dictionary.Fig. 6 is for illustrating to execute in speech synthesis dictionary creating apparatus 3 according to the second embodiment For create speech synthesis dictionary operation flow chart.
As shown in fig. 6, the first voice-input unit 10 inputs the first voice to analytical unit 33 in step 200 (S200) Data, voice-input unit 31 input arbitrary voice data (voice input) to detection unit 32.
In step 202 (S202), detection unit 32 detects authentication information.
In step 204 (S204), for example, whether speech synthesis dictionary creating apparatus 3 determines detection unit 32 from appointing Authentication information is detected in the voice data of meaning.In speech synthesis dictionary creating apparatus 3, if detection unit 32 has detected To authentication information (in the "Yes" of S204), then system control proceeds to S206.On the other hand, in speech synthesis dictionary creating apparatus In 3, if detection unit 32 detects authentication information (in the "No" of S204), the end of marking operation not yet.
In step 206 (S206), analytical unit 33 extracts the characteristic quantity of the first voice data and the spy of second speech data Sign amount (analysis).
In step 208 (S208), determination unit 34 is by the feature of the characteristic quantity of the first voice data and second speech data Amount is compared, so that it is determined that whether the speaker of the first voice data is identical as the speaker of second speech data.
In step 210 (S210), in speech synthesis dictionary creating apparatus 3, if it is determined that unit 34 determines in S208 The speaker of one voice data and the speaker of second speech data are identical (in the "Yes" of S210), then before voice is appropriate It puts, system control proceeds to S212.On the other hand, in speech synthesis dictionary creating apparatus 3, if it is determined that unit 34 exists S208 determines that the speaker of the first voice data is different from the speaker of second speech data (in the "No" of S210), then in voice Under the premise of being unsuitable, the end of marking operation.
In step 212 (S212), creating unit 16 creates and is determined as the first voice data appropriate by determination unit 34 (and second speech data) corresponding speech synthesis dictionary, and speech synthesis dictionary is exported to the second storage unit 17.
Fig. 7 is to be shown schematically in the speech synthesis dictionary creating system 300 including speech synthesis dictionary creating apparatus 3 The exemplary figure of the operation of middle execution.Speech synthesis dictionary creating system 300 includes speech synthesis dictionary creating apparatus 3, and is held It passes through and is output and input by data (voice data) of network (not shown).That is, speech synthesis dictionary creating system 300 be to create speech synthesis dictionary for using the voice uploaded by user and provide the system of speech synthesis dictionary.
With reference to Fig. 7, the first voice data 40 is indicated by personal A or individual B by saying the arbitrary number with arbitrary content The text of amount and the voice data generated.First voice data 40 is received by the first voice-input unit 10.
For example, individual A reads aloud text " the advanced electricity presented by the recording device 42 including authentication information embedded unit Depending on being 50 inches of sizes ", and execute voice record.Which is embedded authentication information to recognize for the text representation said by personal A It demonstrate,proves information and is embedded in voice 44.Therefore, authentication information insertion voice (second speech data) is considered as by can be in voice data The voice data of the preassigned recording device records of middle insertion authentication information.That is, authentication information is embedded in voice quilt It is considered voice data appropriate.
Speech synthesis dictionary creating system 300 is by the characteristic quantity of the first voice data 40 and authentication information insertion voice (the Two voice data) 44 characteristic quantity is compared, so that it is determined that whether the speaker of the first voice data 40 is embedding with authentication information The speaker for entering voice (second speech data) 44 is identical.
If the speaker of the speaker of the first voice data 40 and authentication information insertion voice (second speech data) 44 Identical, then speech synthesis dictionary creating system 300 creates speech synthesis dictionary, and for example shows that 46 are used as pass to user In the notice of creation speech synthesis dictionary.On the other hand, if the speaker of the first voice data 40 and authentication information are embedded in language The speaker 44 of sound (second speech data) is different, then speech synthesis dictionary creating system 300 refuses the first voice data 40, and Such as it shows to user 48 as about the notice for not creating speech synthesis dictionary.
In this way, in speech synthesis dictionary creating apparatus according to the embodiment, since it is determined the first voice data is said Whether identical as the speaker for the second speech data for being considered as voice data appropriate people is talked about, accordingly it is possible to prevent to take advantage of The mode deceived creates speech synthesis dictionary.
Although it have been described that some embodiments of the present invention, but the merely exemplary proposition of these embodiments, not It is intended to limit the scope of the invention.In fact, new method and system described herein can be embodied in the form of various other;This Outside, without departing from the spirit of the invention, the form of method and system described herein can be carried out it is various omit, Replacement and change.It the attached claims and its is equally intended to cover the such form of scope and spirit of the present invention of also falling into Or modification.
List of numerals
1a, 1b, 3: speech synthesis dictionary creating apparatus
10: the first voice-input units
11: the first storage units
12: control unit
13: display unit
14: the second voice-input units
15: analysis determination unit
16: creating unit
17: the second storage units
18: text input unit
31: voice-input unit
32: detection unit
33: analytical unit
34: determination unit
35: the second voice-input units
36: analysis determination unit
100,300: speech synthesis dictionary creating system

Claims (10)

1. a kind of speech synthesis dictionary creating apparatus, comprising:
First voice-input unit is configured to receive the input of the first voice data;
Second voice-input unit, is configured to receive the input of second speech data, and the second speech data is considered It is voice data appropriate;
Whether determination unit, the speaker for being configured to determine first voice data say with the second speech data It is identical to talk about people;And
Creating unit, be configured to when the determination unit determine first voice data speaker and second language When the speaker of sound data is identical, first voice data and text creation corresponding with first voice data are used Speech synthesis dictionary,
Wherein the voice data appropriate is to read aloud the voice data that text is presented or the voice data for detecting authentication information.
2. the apparatus according to claim 1, further includes:
Storage unit is configured to store multiple texts wherein;And
Display unit is configured to be presented on any one of described text stored in the storage unit;
Wherein, the language that second voice-input unit will be obtained and saying the text presented by the display unit Sound data be set as be considered as voice data appropriate the second speech data.
3. the apparatus of claim 2, wherein the display unit execute it is below at least one: be presented at random Any one of any one of described text stored in the storage unit, and the presentation text only make a reservation for Period.
4. the apparatus according to claim 1, wherein the determination unit is by by the characteristic quantity of first voice data It is compared to determine the speaker of first voice data whether with described with the characteristic quantity of the second speech data The speaker of two voice data is identical.
5. device according to claim 4, wherein the determination unit is based on first voice data and described second At least one of word recognition rate, word accuracy, amplitude, basic frequency and spectrum envelope of voice data carry out comparative feature Amount.
6. device according to claim 5, wherein when the characteristic quantity and the second voice number of first voice data According to characteristic quantity between difference be equal to or less than predetermined threshold when, or when the characteristic quantity of first voice data and described the When correlation between the characteristic quantity of two voice data is equal to or more than predetermined threshold, the determination unit determines first language The speaker of sound data is identical as the speaker of the second speech data.
7. the apparatus according to claim 1, further includes: text input unit is configured to input and first language The corresponding text of sound data,
Wherein, the voice number that the determination unit will be obtained and saying the text received by the text input unit According to as first voice data, with the speaker of determination first voice data whether with the second speech data Speaker is identical.
8. the apparatus according to claim 1, wherein second voice-input unit includes:
Voice-input unit is configured to receive the input of voice data;And
Detection unit is configured to detect the certification for including in by the received voice data of the voice-input unit Information;
Wherein, the voice data that the detection unit will test authentication information is set as being considered as second language appropriate Sound data.
9. device according to claim 8, wherein the authentication information indicates voice watermark or speech waveform encryption.
10. a kind of speech synthesis dictionary creating method, comprising:
Receive the input of the first voice data;
The input of second speech data is received, the second speech data is considered as voice data appropriate;
Determine whether the speaker of first voice data is identical as the speaker of the second speech data;And
When the speaker for determining first voice data is identical as the speaker of the second speech data, described is used One voice data and text creation speech synthesis dictionary corresponding with first voice data.
CN201380077502.8A 2013-06-20 2013-06-20 Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method Active CN105340003B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2013/066949 WO2014203370A1 (en) 2013-06-20 2013-06-20 Speech synthesis dictionary creation device and speech synthesis dictionary creation method

Publications (2)

Publication Number Publication Date
CN105340003A CN105340003A (en) 2016-02-17
CN105340003B true CN105340003B (en) 2019-04-05

Family

ID=52104132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201380077502.8A Active CN105340003B (en) 2013-06-20 2013-06-20 Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method

Country Status (4)

Country Link
US (1) US9792894B2 (en)
JP (1) JP6184494B2 (en)
CN (1) CN105340003B (en)
WO (1) WO2014203370A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139857B (en) * 2015-09-02 2019-03-22 中山大学 For the countercheck of voice deception in a kind of automatic Speaker Identification
KR102596430B1 (en) * 2016-08-31 2023-10-31 삼성전자주식회사 Method and apparatus for speech recognition based on speaker recognition
CN108091321B (en) * 2017-11-06 2021-07-16 芋头科技(杭州)有限公司 Speech synthesis method
US11664033B2 (en) * 2020-06-15 2023-05-30 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method thereof

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1369830A (en) * 2001-01-31 2002-09-18 微软公司 Divergence elimination language model
CN1585968A (en) * 2001-11-12 2005-02-23 诺基亚有限公司 Method for compressing dictionary data
CN101057497A (en) * 2004-11-08 2007-10-17 松下电器产业株式会社 Digital video reproduction apparatus
CN101266789A (en) * 2007-03-14 2008-09-17 佳能株式会社 Speech synthesis apparatus and method
CN101989284A (en) * 2009-08-07 2011-03-23 赛微科技股份有限公司 Portable electronic device, and voice input dictionary module and data processing method thereof
CN102332268A (en) * 2011-09-22 2012-01-25 王天荆 Speech signal sparse representation method based on self-adaptive redundant dictionary
CN102469363A (en) * 2010-11-11 2012-05-23 Tcl集团股份有限公司 Television system with speech comment function and speech comment method
CN102881293A (en) * 2012-10-10 2013-01-16 南京邮电大学 Over-complete dictionary constructing method applicable to voice compression sensing

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5713493A (en) * 1980-06-27 1982-01-23 Hitachi Ltd Speaker recognizing device
JPS6223097A (en) * 1985-07-23 1987-01-31 株式会社トミー Voice recognition equipment
US8005677B2 (en) * 2003-05-09 2011-08-23 Cisco Technology, Inc. Source-dependent text-to-speech system
US7355623B2 (en) * 2004-04-30 2008-04-08 Microsoft Corporation System and process for adding high frame-rate current speaker data to a low frame-rate video using audio watermarking techniques
JP2008224911A (en) * 2007-03-10 2008-09-25 Toyohashi Univ Of Technology Speaker recognition system
ATE456130T1 (en) * 2007-10-29 2010-02-15 Harman Becker Automotive Sys PARTIAL LANGUAGE RECONSTRUCTION
JP5152588B2 (en) * 2008-11-12 2013-02-27 富士通株式会社 Voice quality change determination device, voice quality change determination method, voice quality change determination program
US8719019B2 (en) * 2011-04-25 2014-05-06 Microsoft Corporation Speaker identification
US9245254B2 (en) * 2011-12-01 2016-01-26 Elwha Llc Enhanced voice conferencing with history, language translation and identification

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1369830A (en) * 2001-01-31 2002-09-18 微软公司 Divergence elimination language model
CN1585968A (en) * 2001-11-12 2005-02-23 诺基亚有限公司 Method for compressing dictionary data
CN101057497A (en) * 2004-11-08 2007-10-17 松下电器产业株式会社 Digital video reproduction apparatus
CN101266789A (en) * 2007-03-14 2008-09-17 佳能株式会社 Speech synthesis apparatus and method
CN101989284A (en) * 2009-08-07 2011-03-23 赛微科技股份有限公司 Portable electronic device, and voice input dictionary module and data processing method thereof
CN102469363A (en) * 2010-11-11 2012-05-23 Tcl集团股份有限公司 Television system with speech comment function and speech comment method
CN102332268A (en) * 2011-09-22 2012-01-25 王天荆 Speech signal sparse representation method based on self-adaptive redundant dictionary
CN102881293A (en) * 2012-10-10 2013-01-16 南京邮电大学 Over-complete dictionary constructing method applicable to voice compression sensing

Also Published As

Publication number Publication date
CN105340003A (en) 2016-02-17
JP6184494B2 (en) 2017-08-23
JPWO2014203370A1 (en) 2017-02-23
US9792894B2 (en) 2017-10-17
WO2014203370A1 (en) 2014-12-24
US20160104475A1 (en) 2016-04-14

Similar Documents

Publication Publication Date Title
US10446134B2 (en) Computer-implemented system and method for identifying special information within a voice recording
US10158633B2 (en) Using the ability to speak as a human interactive proof
Saquib et al. A survey on automatic speaker recognition systems
US20210304783A1 (en) Voice conversion and verification
Zhao et al. Audio splicing detection and localization using environmental signature
CN104123115A (en) Audio information processing method and electronic device
CN103678977A (en) Method and electronic device for protecting information security
CN105340003B (en) Speech synthesis dictionary creating apparatus and speech synthesis dictionary creating method
CN105283916B (en) Electronic watermark embedded device, electronic watermark embedding method and computer readable recording medium
US9767787B2 (en) Artificial utterances for speaker verification
KR20160027005A (en) Collaborative audio conversation attestation
CN114627856A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
CN101630372B (en) Method for verifying IC card, equipment and system thereof
CN108665901B (en) Phoneme/syllable extraction method and device
Seymour et al. Your voice is my passport
KR102291113B1 (en) Apparatus and method for producing conference record
CN111785280A (en) Identity authentication method and device, storage medium and electronic equipment
Khan et al. Speaker verification from partially encrypted compressed speech for forensic investigation
Adamski A speaker recognition solution for identification and authentication
Cardaioli et al. For Your Voice Only: Exploiting Side Channels in Voice Messaging for Environment Detection
FI126129B (en) Audiovisual associative authentication method and equivalent system
CN114842851A (en) Voiceprint recognition method, system, equipment and storage medium
Sahin Robust Speech Hashing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190905

Address after: Kanagawa

Patentee after: TOSHIBA DIGITAL SOLUTIONS Corp.

Address before: Tokyo, Japan

Co-patentee before: TOSHIBA DIGITAL SOLUTIONS Corp.

Patentee before: Toshiba Corp.

Effective date of registration: 20190905

Address after: Tokyo, Japan

Co-patentee after: TOSHIBA DIGITAL SOLUTIONS Corp.

Patentee after: Toshiba Corp.

Address before: Tokyo, Japan

Patentee before: Toshiba Corp.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201120

Address after: Tokyo, Japan

Patentee after: Color sound station Co.,Ltd.

Address before: Kanagawa Prefecture, Japan

Patentee before: TOSHIBA DIGITAL SOLUTIONS Corp.

TR01 Transfer of patent right