CN104517605B

CN104517605B - A kind of sound bite splicing system and method for phonetic synthesis

Info

Publication number: CN104517605B
Application number: CN201410734257.XA
Authority: CN
Inventors: 刘青松
Original assignee: Beijing Yunzhisheng Information Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2017-11-28
Anticipated expiration: 2034-12-04
Also published as: CN104517605A

Abstract

The present invention relates to a kind of sound bite splicing system and method for phonetic synthesis, first, two sound bites to be spliced are extracted from sound bank as the first sound bite and the second sound bite, and optimum sampling point is selected from the first sound bite and the second sound bite；Then, it is smooth that single order is carried out to optimum sampling point, generates voice joint point；Single order smoothing method is：Calculate the slope k at optimum sampling point U1, U2_a、k_b, and optimum sampling point U1, U2 numerical value difference value delta_U；According to slope k_a、k_bWith difference value delta_UIt is predicted, generates voice joint point.Finally, voice joint point is inserted between the first sound bite and the second sound bite, generates the 3rd sound bite.The present invention solves the problems, such as the voice spectrum saltus step that direct splicing occurs in the prior art, and the problem of smoothing method amount of calculation that adds up again is excessive is searched by auto-correlation, the frequency spectrum for making stitching portion by the smooth method of single order obtains good continuity, enhances user's auditory perception.

Description

A kind of sound bite splicing system and method for phonetic synthesis

Technical field

It is more particularly to a kind of for the sound bite splicing system of phonetic synthesis and side the present invention relates to phonetic synthesis field Method.

Background technology

Existing voice synthetic method has based on speech characteristic parameter and based on two methods of waveform concatenation.Relative to based on ginseng Several methods, the phonetic synthesis based on waveform concatenation can obtain the higher synthesis voice of quality, and sound sounds also more natural, More close to the tone color of original transcription people.Therefore, the online phonetic synthesis of main flow is all to bias toward to spell using based on waveform at present The phonetic synthesis scheme connect.

Phoneme synthesizing method principle based on waveform concatenation is：First selected from the sound bank for prerecording and completing mark Then suitable voice unit obtains final synthesis language as sound bite to be spliced by the splicing between sound bite Sound.Using this joining method, if the fragment of splicing is bad in junction processing, saltus step occurs on frequency spectrum, will lead Family of applying is unnatural on auditory perception.Therefore a crucial technical problem is：Which type of caused using joining method The sound bite for completing splicing is capable of the output of smoothness.

Current existing joining method is using the smooth method that added up again after first being alignd to sound bite, this splicing The sound bite smooth effect of method output is general, the problem of saltus step between sound bite frequency spectrum be present.In addition, in certain situation Under, the problem of can not find smooth alignment point be present in this joining method.From user's sense of hearing, it may appear that the high frequency explosion of ' ' sound Sound, the auditory perception of user can be influenceed.Therefore, it is necessary to a kind of sound bite splicing side for the sound bite that can export smoothness Method.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of sound bite for the sound bite that can export smoothness and spelled Connect method.

The technical scheme that the present invention solves above-mentioned technical problem is as follows：A kind of sound bite for phonetic synthesis splices system System, including sound bank, Samples selecting module, voice joint point generation module and concatenation module；

The sound bank, it is used to store the sound bite for recording and completing mark；

The Samples selecting module, its be used to extracting from sound bank two sound bites to be spliced respectively as First sound bite and the second sound bite, and select optimum sampling from first sound bite and the second sound bite Point；

The voice joint point generation module, it is used for, generation voice joint point smooth to optimum sampling point progress single order；

The concatenation module, it is used to insert voice joint point between the first sound bite and the second sound bite, raw Into the 3rd sound bite.

The beneficial effects of the invention are as follows：Solve and mobile cumulative smoothing method appearance is searched again by the cycle in the prior art Voice spectrum saltus step the problem of, frequency spectrum of the voice in stitching portion is obtained good continuity by the smooth method of single order, Also enhance user's auditory perception.In addition, single order smooth registration method is when searching stitching position candidate's sampled point, it is not necessary to counts The auto-correlation of voice signal is calculated, accurately stitching position is found so as to simpler, greatly reduces amount of calculation, improve fortune Scanning frequency degree.

On the basis of above-mentioned technical proposal, the present invention also makes following improvement.

Further, the Samples selecting module includes search unit and screening unit；

The search unit, it is used to scan for obtaining at least two to first sound bite and the second sound bite Individual candidate's sampled point；

The screening unit, it is used for from least two candidate's sampled points the optimum sampling for filtering out the first sound bite Point U1 and the second sound bite optimum sampling point U2.

Further, the voice joint point generation module includes computing unit and predicting unit；

The computing unit, it is used to calculate the slope k at the optimum sampling point U1_aAt the optimum sampling point U2 Slope k_b, and the difference value delta of optimum sampling point U1 numerical value and optimum sampling point U2 numerical value_U；

The predicting unit, it is used for according to slope k_a, slope k_bWith difference value delta_UIt is predicted, generation voice is spelled Contact.

Further, the searcher of use is scanned in the search unit to the first sound bite and the second sound bite Formula is bidirectional research, and the first sound bite is using way of search from back to front, and the second sound bite is using searching from front to back Rope mode.

Further, carrying out the condition that candidate's sampled point that the bidirectional research is drawn meets is：

Condition one, the first sound bite and the second sound bite are less than setting in the difference of the absolute value of candidate's sampled point slope Threshold value T_k, i.e. abs (k_a-k_b)<T_k；

Condition two, the first sound bite and the second sound bite are less than adjustable in the absolute value of the difference of candidate's sampling point value Parameter ratio and the first sound bite are in the product of the absolute value of candidate's sampled point slope, i.e. abs (S_a-S_b)<ratio*abs (k_a)。

Further, screening optimum sampling point uses minimal error cost criterion, and minimal error cost is slope difference cost With the weighting sum of the different cost of numerical difference, i.e. U^*=argmin (w₁*D_ratio+w₂*D_val), wherein, w₁For optimum sampling point U^*Place Slope cost weighting weight, w₂For optimum sampling point U^*The weighting weight of numerical value difference cost, D_ratioFor optimum sampling point U^* The slope difference function at place, D_valFor optimum sampling point U^*Numerical value difference function.

In order to solve the above-mentioned technical problem, the present invention also provides a kind of sound bite joining method for phonetic synthesis, Comprise the following steps,

Step 1：Two sound bites to be spliced are extracted from sound bank respectively as the first sound bite and second Sound bite, and select optimum sampling point from first sound bite and the second sound bite；

Step 2：It is smooth that single order is carried out to optimum sampling point, generates voice joint point；

Step 3：Voice joint point is inserted between the first sound bite and the second sound bite, generates the 3rd voice sheet Section.

Further, the step 1 specifically,

101：Two sound bites to be spliced are extracted from sound bank respectively as the first sound bite and the second language Tablet section；

102：First sound bite and the second sound bite are scanned for obtain at least two candidate's sampled points；

103：The optimum sampling point U1 and the second voice of the first sound bite are filtered out from least two candidate's sampled points The optimum sampling point U2 of fragment.

Further, the step 2 specifically,

201：Calculate the slope k at the optimum sampling point U1_aWith the slope k of the optimum sampling point U2_b, and most preferably The difference value delta of sampled point U1 numerical value and optimum sampling point U2 numerical value_U；

202：According to slope k_a, slope k_bWith difference value delta_UIt is predicted, generates voice joint point.

Further, the searcher of use is scanned for described in step 102 to the first sound bite and the second sound bite Formula is bidirectional research, and the first sound bite is using way of search from back to front, and the second sound bite is using searching from front to back Rope mode, carrying out the condition that candidate's sampled point that the bidirectional research is drawn meets is：

Condition one, the first sound bite and the second sound bite are less than setting in the difference of the absolute value of candidate's sampled point slope Threshold value, i.e. abs (k_a-k_b)<T_k；

Brief description of the drawings

Fig. 1 is a kind of sound bite splicing system modular structure schematic diagram for phonetic synthesis of the invention；

Fig. 2 is that a kind of sound bite splicing system for phonetic synthesis of the invention carries out bidirectional research side to sound bite To schematic diagram；

Fig. 3 is a kind of sound bite joining method flow chart of steps for phonetic synthesis of the invention.

In accompanying drawing, the list of parts representated by each label is as follows：

1st, sound bank, 2, Samples selecting module, 3, voice joint point generation module,

4th, concatenation module, 21, search unit, 22, screening unit,

31st, computing unit, 32, predicting unit.

Embodiment

The principle and feature of the present invention are described below in conjunction with accompanying drawing, the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the present invention.

Fig. 1 is a kind of sound bite splicing system modular structure schematic diagram for phonetic synthesis of the invention, such as Fig. 1 institutes Show, a kind of sound bite splicing system for phonetic synthesis, including sound bank 1, Samples selecting module 2, voice joint point Generation module 3 and concatenation module 4；Sound bank 1 stores the sound bite for recording and completing mark；Sound bite in sound bank 1 Quantity is at least two.Samples selecting module, two sound bites for extracting to be spliced from sound bank 1 are made respectively For the first sound bite and the second sound bite, and select from first sound bite and the second sound bite and most preferably adopt Sampling point.Voice joint point generation module, it is smooth for carrying out single order to optimum sampling point, generate voice joint point；Concatenation module, For voice joint point to be inserted between the first sound bite and the second sound bite, the 3rd sound bite is generated.

Samples selecting module 2 includes：Search unit 21 and screening unit 22, voice joint point generation module 3 include meter Calculate unit 31 and predicting unit 32.

Search unit 21 is used to the first sound bite and the second sound bite are scanned for obtaining at least two candidates to adopt Sampling point；For two sound bites to be spliced, the last period sound bite is referred to as the first sound bite, latter section of voice sheet Section is referred to as the second sound bite.

As shown in Fig. 2 the way of search for scanning for using to the first sound bite and the second sound bite is searched to be two-way Rope, the first sound bite is using way of search from back to front, and the second sound bite is using way of search from front to back.Carry out Candidate's sampled point that bidirectional research is drawn needs to meet two conditions：

abs(k_a-k_b)<T_kCondition one

abs(S_a-S_b)<ratio*abs(k_a) condition two

Condition one, the first sound bite and the second sound bite are less than setting in the difference of the absolute value of candidate's sampled point slope Threshold value T_k.Wherein, k_aIt is the first sound bite in the slope of candidate's sampling point position, k_bAdopted for the second sound bite in candidate The slope of sampling point position.

Condition two, the first sound bite and the second sound bite are less than adjustable in the absolute value of the difference of candidate's sampling point value The product of parameter ratio and the first sound bite in the absolute value of candidate's sampled point slope.Wherein, S_aExist for the first sound bite The numerical value of the sampled point of candidate's sampling point position, S_bIt is the second sound bite in the numerical value of candidate's sampled point, k_aFor the first voice sheet Section is in the slope of candidate's sampling point position, adjustable parameter ratio control difference value changes sizes.

Meet above-mentioned two condition voice point, candidate's sampled point as splicing simultaneously.Fixing the first sound bite While candidate's sampled point, search is moved after the second sound bite.One wheel search finishes, before candidate's sampled point of the first sound bite Move, continue next round search.Search end condition is searches out alternative splicing candidate's sampled point and the first sound bite and the Two sound bites are moved to higher limit.When search terminates, multiple (at least two) candidate's sampled points can be obtained, and these are waited It is even number to select sampled point number, i.e., the candidate's sampled point collected respectively from the first sound bite and the second sound bite.

After obtaining candidate's sampled point, screening unit 22 filters out the first sound bite from least two candidate's sampled points Optimum sampling point U1 and the second sound bite optimum sampling point U2.

Screen optimum sampling point U^*(i.e. U1, U2, U3, U4 ...), sampled using the criterion of minimal error cost from candidate Optimum sampling point U is selected in point^*Position as follow-up smooth interpolation.Minimal error cost is optimum sampling point U^*Locate slope differences The weighting sum of the different different cost of cost and numerical difference.

U^*=argmin (w₁*D_ratio+w₂*D_val)

Wherein, w₁For optimum sampling point U^*The weighting weight of the slope cost at place, w₂For optimum sampling point U^*Numerical value difference generation The weighting weight of valency.D_ratioFor optimum sampling point U^*The cost function of the slope difference at place, D_valFor optimum sampling point U^*Numerical difference Different cost function.Optimum sampling point U1, U2 are finally drawn according to minimal error cost criterion.

Computing unit 31 calculates the slope k at optimum sampling point U1_aWith the slope k at the optimum sampling point U2_b, and The difference value delta of optimum sampling point U1 numerical value and optimum sampling point U2 numerical value_U；

Predicting unit 32 is according to slope k_a, slope k_bWith difference value delta_UIt is predicted, generates voice joint point.Prediction Process is：

Slope prediction, if the optimal splice point U1 of the first sound bite is the sampled point at T moment, amplitude size is S, then The sampled point T-1 at T-1 moment amplitude size is S_T-1=S-k_a, wherein k_aFor optimal splice point U1 slope, then can be predicted The sampled point amplitude for going out for the first sound bite T+1 moment isIf the optimal splicing of the second sound bite Point U2 is the sampled point of n-hour, and amplitude size is V, then the sampled point N+1 at N+1 moment amplitude size is V_N+1=V+ K_b, wherein k_bFor the slope of optimal splice point, then the sampled point amplitude that can be predicted for the second sound bite N-1 moment is

From slope prediction, the first sound bite and the second sound bite exist in respective optimal splice point junction Sampled point forecasted variances

This species diversity cause both can not direct splicing together, therefore, it is necessary to be modified to sampling point value, obtain Going out revised sampling point value is

Final splicing sequence is

…… S-k_a S E V V+K_b ………

Because the optimum sampling point described above spliced to the first sound bite and the second sound bite carries out smooth manner Slope information (single order information) is make use of, therefore this smooth manner is single order exponential smoothing.

Fig. 3 is a kind of sound bite joining method flow chart of steps for phonetic synthesis of the invention, as indicated at 3, a kind of For the sound bite joining method of phonetic synthesis, comprise the following steps,

Step 1 specifically,

In step 102, the way of search for scanning for using to the first sound bite and the second sound bite is searched to be two-way Rope, the first sound bite are carried out using way of search from back to front, the second sound bite using way of search from front to back The condition that candidate's sampled point that the bidirectional research is drawn meets is：

Condition two, the first sound bite and the second sound bite are less than adjustable in the absolute value of the difference of candidate's sampling point value Parameter rat io and the first sound bite are in the product of the absolute value of candidate's sampled point slope, i.e. abs (S_a-S_b)<ratio*abs (k_a)。

In step 103, optimum sampling point U is screened^*(i.e. U1, U2, U3, U4 ...), using the criterion of minimal error cost Optimum sampling point U is selected from candidate's sampled point^*Position as follow-up smooth interpolation.Minimal error cost is optimum sampling point U^*Locate the weighting sum of slope difference cost and the different cost of numerical difference.

U^*=argmin (w₁*D_ratio+w₂*D_val)

Step 2 specifically,

In step 202, prediction process is：

Slope prediction, if the optimal splice point U1 of the first sound bite is the sampled point at T moment, amplitude size is S, then The sampled point T-1 at T-1 moment amplitude size is S_T-1=S-k_a, wherein k_aFor optimal splice point U1 slope, then can be predicted The sampled point amplitude for going out for the first sound bite T+1 moment is S_T+1=S+k_a.If the optimal splice point U2 of the second sound bite is The sampled point of n-hour, amplitude size are V, then the sampled point N+1 at N+1 moment amplitude size is V_N+1=V+K_b, wherein k_b For the slope of optimal splice point, then the sampled point amplitude that can be predicted for the second sound bite N-1 moment is

Final splicing sequence is

…… S-k_a S E V V+K_b ………

The present invention solves to search by the cycle in the prior art moves the voice spectrum jump that cumulative smoothing method occurs again The problem of change, frequency spectrum of the voice in stitching portion is obtained good continuity by the smooth method of single order, also enhance user Auditory perception.In addition, single order smooth registration method is when searching stitching position candidate's sampled point, it is not necessary to calculates voice signal Auto-correlation, stitching position accurately is found so as to simpler, amount of calculation is greatly reduced, improves the speed of service.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims

1. a kind of sound bite splicing system for phonetic synthesis, it is characterised in that including sound bank, Samples selecting mould Block, voice joint point generation module and concatenation module；

The Samples selecting module includes search unit and screening unit, wherein, the search unit, it is used for described One sound bite and the second sound bite scan for obtaining at least two candidate's sampled points；

The screening unit, it is used for from least two candidate's sampled points the optimum sampling point U1 for filtering out the first sound bite With the optimum sampling point U2 of the second sound bite；

The voice joint point generation module includes computing unit and predicting unit, wherein, the computing unit, it is used to calculate Slope k at the optimum sampling point U1_aWith the slope k at the optimum sampling point U2_b, and optimum sampling point U1 numerical value With the difference value delta of optimum sampling point U2 numerical value_U；

The predicting unit, it is used for according to slope k_a, slope k_bWith difference value delta_UIt is predicted, generates voice joint point；

The concatenation module, it is used to insert voice joint point between the first sound bite and the second sound bite, generation the Three sound bites.

A kind of 2. sound bite splicing system for phonetic synthesis according to claim 1, it is characterised in that the search The way of search for scanning for using to the first sound bite and the second sound bite in unit is bidirectional research, the first voice sheet The ways of search of Duan Caiyong from back to front, the second sound bite is using way of search from front to back.

3. a kind of sound bite splicing system for phonetic synthesis according to claim 2, it is characterised in that described in implementation The condition that candidate's sampled point that bidirectional research is drawn meets is：

Condition one, the first sound bite and the second sound bite are less than the threshold of setting in the difference of the absolute value of candidate's sampled point slope Value T_k, i.e. abs (k_a-k_b)<T_k；

Condition two, the first sound bite and the second sound bite are less than adjustable parameter in the absolute value of the difference of candidate's sampling point value Ratio and the first sound bite are in the product of the absolute value of candidate's sampled point slope, i.e. abs (S_a-S_b)<ratio*abs(k_a)。

4. a kind of sound bite splicing system for phonetic synthesis according to claim 1, it is characterised in that screening is optimal Sampled point uses minimal error cost criterion, and minimal error cost is sampled point U^*The different cost of slope cost and numerical difference at place Weight sum, U^*=argmin (w₁*D_ratio+w₂*D_val), wherein, w₁For optimum sampling point U^*The slope difference cost at place adds Weigh weight, w₂For optimum sampling point U^*The weighting weight of numerical value difference cost, D_ratioFor optimum sampling point U^*The slope difference letter at place Number, D_valFor optimum sampling point U^*Numerical value difference function.

A kind of 5. sound bite joining method for phonetic synthesis, it is characterised in that comprise the following steps,

Step 1：Two sound bites to be spliced are extracted from sound bank respectively as the first sound bite and the second voice Fragment, first sound bite and the second sound bite are scanned for obtain at least two candidate's sampled points, from least two The optimum sampling point U1 of the first sound bite and the optimum sampling point U2 of the second sound bite are filtered out in individual candidate's sampled point；

Step 2：Calculate the slope k at the optimum sampling point U1_aWith the slope k of the optimum sampling point U2_b, and most preferably adopt The difference value delta of sampling point U1 numerical value and optimum sampling point U2 numerical value_U, according to slope k_a, slope k_bWith difference value delta_U It is predicted, generates voice joint point；

Step 3：Voice joint point is inserted between the first sound bite and the second sound bite, generates the 3rd sound bite.

6. a kind of sound bite joining method for phonetic synthesis according to claim 5, it is characterised in that in step 1 The way of search for scanning for using to the first sound bite and the second sound bite is bidirectional research, the first sound bite Using way of search from back to front, the second sound bite is carried out the bidirectional research and obtained using way of search from front to back The condition that candidate's sampled point for going out meets is：

Condition one, the first sound bite and the second sound bite are less than the threshold of setting in the difference of the absolute value of candidate's sampled point slope Value, i.e. abs (k_a-k_b)<T_k；