CN107437412A

CN107437412A - A kind of acoustic model processing method, phoneme synthesizing method, device and relevant device

Info

Publication number: CN107437412A
Application number: CN201610353978.5A
Authority: CN
Inventors: 宋阳; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2016-05-25
Filing date: 2016-05-25
Publication date: 2017-12-05
Anticipated expiration: 2036-05-25
Also published as: CN107437412B

Abstract

Data processing field of the present invention, a kind of acoustic model processing method, phoneme synthesizing method, device and relevant device are disclosed, the poor technical problem of the speech quality synthesized in the prior art with solution.This method includes：Obtain the parameter preset of the spectral model in speech model；The parameter preset of the spectral model is converted into amplitude spectrum；The amplitude spectrum is adaptively post-processed, the amplitude spectrum after being handled；Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, and then the spectral model after being handled.The technique effect for improving phonetic synthesis quality is reached.

Description

A kind of acoustic model processing method, phoneme synthesizing method, device and relevant device

Technical field

The present invention relates to data processing field, more particularly to a kind of acoustic model processing method, phoneme synthesizing method, device And relevant device.

Background technology

Offline speech synthesis system, main stream approach is to be based on HMM (Hidden Markov Model at present：Hidden Markov Model) parameter phonetic synthesis.Then it is realized the synthesis of voice by speech model, asked firstly the need of training speech model With reference to figure 1, establish speech model and comprise the following steps：

Step S101：Obtain corpus；

Step S102：The extraction of parameters,acoustic is carried out to language material in corpus；

Step S103：Context-sensitive HMM-GMM is carried out to parameters,acoustic in corpus and corresponding rhythm text to build Mould, and then speech model is obtained, wherein, modeling object includes frequency spectrum, fundamental frequency, duration；

Before speech model is established, Fig. 2 is refer to, voice can be synthesized by the following：

Step S201：Obtain text to be synthesized；

Step S202：Treat synthesis text and parse contextual information；

Step S203：Model prediction, parameters,acoustic corresponding to acquisition, parameters,acoustic are carried out to context by speech model Including：Frequency spectrum, fundamental frequency information；

Step S204：Parameters,acoustic is synthesized by voice by vocoder.

By the voice that the program synthesizes there is the poor technical problem of tonequality, cause user experience relatively low.

The content of the invention

The present invention provides a kind of acoustic model processing method, phoneme synthesizing method, device and relevant device, existing to solve The poor technical problem of the speech quality that is synthesized in technology.

In a first aspect, the embodiment of the present invention provides a kind of acoustic model processing method, including：

Obtain the parameter preset of the spectral model in speech model；

The parameter preset of the spectral model is converted into amplitude spectrum；

The amplitude spectrum is adaptively post-processed, the amplitude spectrum after being handled；

Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, so it is described after being handled Spectral model.

Optionally, the parameter preset by the spectral model is converted to amplitude spectrum, including：

The static parameter of the equal value part of the spectral model is converted into the amplitude spectrum.

Optionally, the amplitude spectrum after adaptive handled is carried out to the amplitude spectrum, including：Pass through following public affairs Formula is adaptively post-processed to the amplitude spectrum：

Wherein, S_new(z) amplitude spectrum after expression processing；

S_ori(z) amplitude spectrum of before processing is represented；

S_ori(z/ β) represents the S on z-plane_ori(z) β times before change of scale arrives, the amplitude spectrum obtained from；

S_ori(z/ α) represents the S on z-plane_ori(z) α times before change of scale arrives, the amplitude spectrum obtained from.

Optionally, the amplitude spectrum is adaptively post-processed, the amplitude spectrum after being handled, in addition to：

For the amplitude spectrum of each before processing, judge to calculate whether the amplitude spectrum after the processing of acquisition is located at preset maximum value With the scope of predetermined minimum；

When amplitude spectrum after treatment is less than the predetermined minimum, using the predetermined minimum as the amplitude after processing Spectrum；

When amplitude spectrum after treatment is more than the preset maximum value, using the preset maximum value as the amplitude after processing Spectrum.

Optionally, the amplitude spectrum is adaptively post-processed, after the amplitude spectrum after being handled, methods described is also Including：

Spectrum energy unification processing is carried out to the amplitude spectrum after the processing；

Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, including：

Amplitude spectrum after progress spectrum energy unification processing is converted to the parameter preset of the spectral model.

Optionally, methods described also includes：

Obtain the text to be synthesized for carrying out phonetic synthesis；

The parameters,acoustic of the text to be synthesized is determined based on the speech model；

The speech data of the text to be synthesized is synthesized by the parameters,acoustic.

Second aspect, the embodiment of the present invention provide a kind of phoneme synthesizing method, including：

Obtain the text to be synthesized for carrying out phonetic synthesis；

The frequency spectrum parameter of the text to be synthesized is determined based on the spectral model in speech model, the spectral model is Through adaptive reprocessed spectral model, the adaptive last handling process comprises the following steps：By the spectral model Parameter preset is converted to amplitude spectrum, the amplitude spectrum after adaptive handled is carried out to the amplitude spectrum, by the place Amplitude spectrum after reason is converted to the parameter preset of the spectral model；

The speech data of the text to be synthesized is synthesized by the frequency spectrum parameter.

Optionally, the spectral model based in speech model determine the text to be synthesized frequency spectrum parameter it Before, methods described also includes：

The adaptive post processing of the spectral model is locally carried out in client device；And/or

The spectral model by adaptively post-processing is received from the server.

The third aspect, the embodiment of the present invention provide a kind of acoustic model processing unit, including：

Acquisition module, for obtaining the parameter preset of the spectral model in speech model；

First modular converter, for the parameter preset of the spectral model to be converted into amplitude spectrum；

First obtains module, for adaptively being post-processed to the amplitude spectrum, the amplitude spectrum after being handled；

Second modular converter, for the amplitude spectrum after the processing to be converted to the parameter preset of the spectral model, enter And the spectral model after being handled.

Fourth aspect, the embodiment of the present invention provide a kind of speech synthetic device, including：

3rd obtains module, and the text to be synthesized of phonetic synthesis is carried out for obtaining；

Second determining module, for determining that the frequency spectrum of the text to be synthesized is joined based on the spectral model in speech model Number, the spectral model are to comprise the following steps through adaptive reprocessed spectral model, the adaptive last handling process： The parameter preset of the spectral model is converted into amplitude spectrum, after adaptive handled is carried out to the amplitude spectrum Amplitude spectrum, the amplitude spectrum after the processing is converted to the parameter preset of the spectral model；

Second synthesis module, for synthesizing the speech data of the text to be synthesized by the frequency spectrum parameter.

5th aspect, the embodiment of the present invention provide a kind of processing equipment, include memory, and one or one with On program, one of them or more than one program storage in memory, and be configured to by one or more than one Computing device is one or more than one program bag contains the instruction for being used for being operated below：

Obtain the parameter preset of the spectral model in speech model；

6th aspect, the embodiment of the present invention provide a kind of processing equipment, include memory, and one or one with On program, one of them or more than one program storage in memory, and be configured to by one or more than one Computing device is one or more than one program bag contains the instruction for being used for being operated below：

Obtain the text to be synthesized for carrying out phonetic synthesis；

The present invention has the beneficial effect that：

In embodiments of the present invention, handled in the following manner for speech model：Obtain the frequency in speech model The parameter preset of spectrum model；Then, the parameter preset of the spectral model is converted into amplitude spectrum；The amplitude spectrum is carried out certainly Adapt to the amplitude spectrum after being handled；Amplitude spectrum after the processing is converted to the default ginseng of the spectral model Number, and then the spectral model after handle, due to for the parameter preset in spectral model carried out it is adaptive after locate Reason, that is to say the desired signal enhanced in spectral model and reduces interference signal, so as to be given birth to subsequently through the speech model During into speech data, it is possible to increase the quality of synthesized voice；

Also, it is amplitude spectrum that adaptive post processing object is carried out in scheme, and amplitude spectrum is a kind of general frequency spectrum, various Frequency spectrum parameter can be converted to amplitude spectrum, thus the program is all suitable for for any frequency spectrum parameter, without for difference Frequency spectrum parameter (such as：Line spectrum pair, mel cepstrum etc.) different adaptive post processing modes is used, therefore the program is directed to The compatibility of the adaptive post processing of frequency spectrum parameter is stronger；

Also, the spectral model that the program is directed in speech model in advance is adaptively post-processed, without rear Adaptively post-processed after continuous generation parameters,acoustic, therefore, reduce the consumption using speech model synthesis speech data When.

Brief description of the drawings

Fig. 1 is the flow chart for establishing speech model in the prior art；

Fig. 2 is the flow chart for synthesizing speech data in the prior art；

Fig. 3 is the flow chart of the acoustic model processing method of first aspect of the embodiment of the present invention；

Fig. 4 is the flow chart that speech data is synthesized in the acoustic model processing method of first aspect of the embodiment of the present invention；

Fig. 5 is the flow chart of the phoneme synthesizing method of second aspect of the embodiment of the present invention；

Fig. 6 is the structure chart of the acoustic model processing method of the third aspect of the embodiment of the present invention；

Fig. 7 is the speech synthetic device structure chart of fourth aspect of the embodiment of the present invention；

Fig. 8 is the block diagram of the electronic equipment according to an exemplary embodiment；

Fig. 9 is the structural representation of server in the embodiment of the present invention.

Embodiment

Technical scheme in the embodiment of the present application is the above-mentioned technical problem of solution, and general thought is as follows：

In order to be better understood from above-mentioned technical proposal, below by accompanying drawing and specific embodiment to technical solution of the present invention It is described in detail, it should be understood that the specific features in the embodiment of the present invention and embodiment are to the detailed of technical solution of the present invention Thin explanation, rather than the restriction to technical solution of the present invention, in the case where not conflicting, the embodiment of the present invention and embodiment In technical characteristic can be mutually combined.

Handled in the following manner for speech model：Obtain the parameter preset of the spectral model in speech model； Then, the parameter preset of the spectral model is converted into amplitude spectrum；The amplitude spectrum is adaptively post-processed, at acquisition Amplitude spectrum after reason；Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, and then after being handled The spectral model, due to having carried out adaptive post processing for the parameter preset in spectral model, that is to say enhance frequency Desired signal in spectrum model simultaneously reduces interference signal, can during so as to generate speech data subsequently through the speech model Improve the quality of synthesized voice；

Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation Property concept, then can make other change and modification to these embodiments.So appended claims be intended to be construed to include it is excellent Select embodiment and fall into having altered and changing for the scope of the invention.

Obviously, those skilled in the art can carry out the essence of various changes and modification without departing from the present invention to the present invention God and scope.So, if these modifications and variations of the present invention belong to the scope of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to comprising including these changes and modification.

In a first aspect, the embodiment of the present invention provides acoustic model processing method, Fig. 3 is refer to, including：

Step S301：Obtain the parameter preset of the spectral model in speech model；

Step S302：The parameter preset of the spectral model is converted into amplitude spectrum；

Step S303：The amplitude spectrum is adaptively post-processed, the amplitude spectrum after being handled；

Step S304：Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, and then at acquisition The spectral model after reason.

For example, the program can apply to server, can also be applied to client device, the embodiment of the present invention is not It is restricted.Client device is, for example,：Mobile phone, notebook computer, tablet personal computer, PC etc., the embodiment of the present invention is not restricted.

In step S301, for example, speech model for example including：Spectral model, fundamental frequency model, duration modeling etc.. Spectral model generally includes：Probability density part and decision tree part, wherein, probability density part includes average and variance, Value and variance all include static parameter and dynamic parameter respectively, and the parameter preset of spectral model is, for example,：Static parameter, certainly, Dynamic parameter can also be included, the embodiment of the present invention is not restricted.

In step S302, the parameter preset of spectral model can be converted into amplitude spectrum in the following manner：

When the parameter preset of spectral model is line spectrum pairs parameter, it is assumed that its form is K, l (1), l (2) ... l (V).Work as V For even number when, amplitude spectrum S (ω) is：

When V is odd number, amplitude spectrum S (ω) is：

When the parameter preset of spectral model is Mel-cepstrum, it is assumed that its form is c_a(0),c_a(1),…c_a(V), Wherein a is known, and when spectral model derives from sample rate 16KHZ audio, a is traditionally arranged to be 0.42, first according to such as Lower equations cepstrumWherein v represents currently processed dimension,

Then by Fourier transformation, then by the exponential function that natural constant e is bottom, so as to obtain amplitude spectrum.

Where it is assumed that containing M equal value sequences in spectral model, the static parameter of each of which value sequence is N-dimensional, i.e., The pending data of the parameter preset of spectral model is M*N matrix, and it is converted into the amplitude spectrum of Y dimensions by such scheme Afterwards, M*Y matrix will be obtained.In subsequent step S102~S104, only processing Y ties up amplitude spectrum every time, performs M behaviour altogether Make.

In step S303, it can adaptively be post-processed by below equation for Y dimension amplitude spectrums：

Wherein, S_new(z) amplitude spectrum after expression processing；

S_ori(z) amplitude spectrum of before processing is represented；

Under normal circumstances, α and β can empirically be set, and generally, β-α numerical value is bigger, and the tonequality for synthesizing voice increases Potent fruit is more obvious, but the excessive synthetic effect that may result in of β-α numerical value is unstable, such as：The voice distortion of synthesis.

In specific implementation process, after adaptively being post-processed through the above way, in order to which synthetic effect is stable, The scope of amplitude spectral transformation, and then the amplitude carried out to the amplitude spectrum after adaptive handled can be limited Spectrum, in addition to：For the amplitude spectrum of each before processing, whether judge to calculate the amplitude spectrum after the processing of acquisition positioned at default maximum The scope of value and predetermined minimum；When amplitude spectrum after treatment is less than the predetermined minimum, the predetermined minimum is made For the amplitude spectrum after processing；When amplitude spectrum after treatment is more than the preset maximum value, using the preset maximum value as place Amplitude spectrum after reason.

For example, preset maximum value can be the fixed value of setting, or S_ori(z) preset ratio, equally, Predetermined minimum can be the fixed value of setting, or S_ori(z) preset ratio, the embodiment of the present invention are not restricted.

Wherein, if preset maximum value and predetermined minimum are all S_ori(z) if preset ratio, then can by with The transformation range of lower equations amplitude spectrum：

Assuming that S_ori(z) value of y dimensions is s_ori, S_new(z) value of y dimensions is s_new, wherein, 1≤y≤Y.So：

Wherein, mindata and maxdata can empirically be set, and generally maxdata-mindata numerical value is got over Greatly, it is more obvious to synthesize the tonequality enhancing effect of voice, but maxdata-mindata numerical value is excessive may to cause synthetic effect not It is stable.Maxdata-mindata value for example can in 7-10, such as：8th, 9,10 etc., in this case, can either Ensure the stabilization of synthetic effect, while the enough pairings of and can realize preferable enhancing effect into the tonequality of voice.

Wherein, if preset maximum value and predetermined minimum are all the fixed value of setting, then following public affairs can be passed through Formula limits the transformation range of amplitude spectrum：

Assuming that S_new(z) value of y dimensions is s_new, wherein, 1≤y≤Y.So：

Equally, mindata and maxdata can empirically be set, and generally maxdata-mindata numerical value is got over Greatly, it is more obvious to synthesize the tonequality enhancing effect of voice, but maxdata-mindata numerical value is excessive may to cause synthetic effect not It is stable.Equally, maxdata-mindata value for example can in 7-10, such as：8th, 9,10 etc., in this case, both The stabilization of synthetic effect can be ensured, while the enough pairings of and can realize preferable enhancing effect into the tonequality of voice.

As a kind of optional embodiment, in order to ensure that synthetic effect is stable, it is also necessary to ensure before and after adaptively post-processing Spectrum energy it is consistent, that is to say：The amplitude spectrum carried out to the amplitude spectrum after adaptive handled it Afterwards, methods described also includes：Spectrum energy unification processing is carried out to the amplitude spectrum after the processing.

Wherein it is possible to ensure that the front and rear spectrum energy of adaptive post processing is consistent by below equation：

Wherein, S '_new(z) amplitude spectrum after the processing of spectrum energy unification is represented；

S_new(z) amplitude spectrum of spectrum energy unification before processing is represented；

S_ori(z) amplitude spectrum before self-adaptive processing is represented.

In step S304, amplitude spectrum can be converted to the parameter preset of spectral model in the following manner：

When the parameter preset of spectral model is line spectrum pairs parameter, the logarithm using e the bottom of as, Ran Houtong are taken to amplitude spectrum first Cross inverse Fourier transform and obtain cepstrum parameter c₀(v) broad sense cepstrum parameter c then, is solved according to following regression equation_-1(v), v tables Show the dimension for being presently in reason,

Then by the regular acquisition LPC parameters of gain, z-transform is carried out to it afterwards, solves it in unit circle On zero point, the angular frequency value corresponding to zero point is line spectrum pairs parameter.

When the parameter preset of spectral model is Mel-cepstrum, the logarithm using e the bottom of as is taken to amplitude spectrum first, then Cepstrum parameter is obtained by inverse Fourier transform, it is assumed that its form is c₀(0),c₀(1),…c₀(V), finally according to equation below Solve mel cepstrumWherein a is known, and when spectral model derives from sample rate 16KHZ audio, a is typically set 0.42, v is set to represent to be presently in the dimension of reason,

Wherein, if not carrying out spectrum energy unification processing to amplitude spectrum before, then in step S304 directly The parameter preset of spectral model will be converted to through adaptive reprocessed amplitude spectrum；If frequency spectrum was carried out to amplitude spectrum before If energy coincidenceization processing, then in step S304, the amplitude spectrum treated by spectrum energy unification is converted into frequency spectrum The parameter preset of model.

In specific implementation process, after the spectral model after being handled based on step S304, it is possible to pass through bag Speech model synthesis speech data containing the spectral model, refer to Fig. 4, can synthesize speech data by following steps：

Step S401：Obtain the text to be synthesized for carrying out phonetic synthesis；

Step S402：The parameters,acoustic of the text to be synthesized is determined based on the speech model；

Step S403：The speech data of the text to be synthesized is synthesized by the parameters,acoustic.

In step S401, text to be synthesized is, for example,：Text, the client device of user's input are produced corresponding to prompt tone Text, e-book text etc., can also be other any form of texts certainly, the embodiment of the present invention no longer arranges in detail Lift, and be not restricted.

In step S402, synthesis text can be treated first and carries out context resolution, and then parses text to be synthesized Contextual information, model prediction, parameters,acoustic corresponding to acquisition, parameters,acoustic bag are then carried out to context by speech model Include：Frequency spectrum, fundamental frequency information, duration etc..

In step S403, the step S402 parameters,acoustics determined can be synthesized by vocoder, and then obtained Corresponding speech data.After speech data is synthesized, the speech data can also be inputted by various modes, such as：Pass through The voice output that client device carries exports the speech data, sends the speech data to another client device, So that another client device exports the speech data etc..Second aspect, based on same inventive concept, the embodiment of the present invention A kind of phoneme synthesizing method is provided, refer to Fig. 5, including：

Step S501：Obtain the text to be synthesized for carrying out phonetic synthesis；

Step S502：The frequency spectrum parameter of the text to be synthesized is determined based on the spectral model in speech model, it is described Spectral model is to comprise the following steps through adaptive reprocessed spectral model, the adaptive last handling process：By described in The parameter preset of spectral model is converted to amplitude spectrum, and the amplitude after adaptive handled is carried out to the amplitude spectrum Spectrum, the amplitude spectrum after the processing is converted to the parameter preset of the spectral model；

Step S503：The speech data of the text to be synthesized is synthesized by the frequency spectrum parameter.

In step S501, which kind of text text to be synthesized is specially, due to being above described, so it is no longer superfluous herein State.

In step S502, specifically how to obtain through adaptive reprocessed spectral model, due to first aspect present invention It has been be described that, so will not be repeated here.It can be obtained through adaptive reprocessed spectral model by number of ways, below Enumerate two kinds of approach therein to be introduced, certainly, in specific implementation process, be not limited to following two situations.

The first, the adaptive post processing of the spectral model is locally carried out in client device.

Second, the spectral model by adaptively post-processing is received from the server.

In step S503, other ginsengs of speech data can also be obtained by the model of other included in speech model Number, such as：The base frequency parameters of text to be synthesized are obtained by fundamental frequency model, the duration of text to be synthesized is obtained by duration modeling Parameter etc., then synthesize the language of text to be synthesized jointly by parameters,acoustics such as base frequency parameters, duration parameters and frequency spectrum parameters Sound data.

Speech data for specifically how to synthesize by parameters,acoustic text to be synthesized, due to being above described, therefore And it will not be repeated here.

Analyzed more than, in embodiments of the present invention for spectral model, first by the parameter preset of spectral model (such as：The equal value part of static parameter) amplitude spectrum is converted into, then amplitude spectrum is adaptively post-processed, in order to synthesize effect Fruit is stable, the scope of limited amplitude spectral transformation, and amplitude of accommodation spectrum energy, is allowed to identical with the amplitude spectrum energy of before processing, Amplitude spectrum after processing is finally converted into the parameter preset of spectral model, keeps constant for the other parts of spectral model, Due to having carried out adaptive post processing for the parameter preset in spectral model, the expectation letter enhanced in spectral model that is to say Number and reduce interference signal, and then when synthesizing speech data based on the spectral model, it is possible to increase synthesized voice number According to quality.

The third aspect, based on same inventive concept, the embodiment of the present invention provides a kind of acoustic model processing unit, refer to Fig. 6, including：

Acquisition module 60, for obtaining the parameter preset of the spectral model in speech model；

First modular converter 61, for the parameter preset of the spectral model to be converted into amplitude spectrum；

First obtains module 62, for adaptively being post-processed to the amplitude spectrum, the amplitude spectrum after being handled；

Second modular converter 63, for the amplitude spectrum after the processing to be converted to the parameter preset of the spectral model, And then the spectral model after being handled.

Optionally, first modular converter 61, is used for：

Optionally, described first module 62 is obtained, for carrying out adaptive rear place to the amplitude spectrum by below equation Reason：

Wherein, S_new(z) amplitude spectrum after expression processing；

S_ori(z) amplitude spectrum of before processing is represented；

Optionally, described first module 62 is obtained, including：

Judging unit, for the amplitude spectrum for each before processing, whether judge the amplitude spectrum after the processing that calculating obtains Positioned at preset maximum value and the scope of predetermined minimum；

First determining unit, when being less than the predetermined minimum for amplitude spectrum after treatment, by the default minimum It is worth as the amplitude spectrum after processing；

Second determining unit, when being more than the preset maximum value for amplitude spectrum after treatment, by the default maximum It is worth as the amplitude spectrum after processing.

Optionally, described device also includes：

First processing module, for carrying out spectrum energy unification processing to the amplitude spectrum after the processing；

Second modular converter, for the amplitude spectrum after progress spectrum energy unification processing to be converted into the frequency spectrum The parameter preset of model.

Optionally, described device also includes：

Second obtains module, and the text to be synthesized of phonetic synthesis is carried out for obtaining；

First determining module, for determining the parameters,acoustic of the text to be synthesized based on the speech model；

First synthesis module, for synthesizing the speech data of the text to be synthesized by the parameters,acoustic.

By the acoustic model processing unit that third aspect present invention is introduced is implementation first aspect of the embodiment of the present invention Device used by the acoustic model processing method introduced, the acoustic model introduced based on first aspect of the embodiment of the present invention Processing method, those skilled in the art can understand concrete structure and the deformation of the device, so will not be repeated here, it is all It is that device belongs to the present invention in fact used by implementing the acoustic model processing method that first aspect of the embodiment of the present invention is introduced Apply the scope to be protected of example.

Fourth aspect, based on same inventive concept, the embodiment of the present invention provides a kind of speech synthetic device, refer to Fig. 7, Including：

3rd obtains module 70, and the text to be synthesized of phonetic synthesis is carried out for obtaining；

Second determining module 71, for determining the frequency spectrum of the text to be synthesized based on the spectral model in speech model Parameter, the spectral model are to include following step through adaptive reprocessed spectral model, the adaptive last handling process Suddenly：The parameter preset of the spectral model is converted into amplitude spectrum, the amplitude spectrum adaptively handled Amplitude spectrum afterwards, the amplitude spectrum after the processing is converted to the parameter preset of the spectral model；

Second synthesis module 72, for synthesizing the speech data of the text to be synthesized by the frequency spectrum parameter.

Optionally, described device also includes：

Second processing module, for locally carrying out the adaptive post processing of the spectral model in client device；And/or

Receiving module, for receiving the spectral model by adaptively post-processing from the server.

By the speech synthetic device that fourth aspect present invention is introduced by implement second aspect of the embodiment of the present invention be situated between Device used by the phoneme synthesizing method to continue, the phoneme synthesizing method introduced based on second aspect of the embodiment of the present invention, this Those skilled in the art can understand concrete structure and the deformation of the device, so will not be repeated here, every this hair of implementation Device belongs to what the embodiment of the present invention to be protected used by the phoneme synthesizing method that bright embodiment second aspect is introduced Scope.

5th aspect, based on same inventive concept, the embodiment of the present invention provides a kind of processing equipment, and the processing equipment can be with For electronic equipment or server, include memory, and one or more than one program, one of them or one Procedure above is stored in memory, and be configured to by one either more than one computing device it is one or one Procedure above includes the instruction for being used for being operated below：

Obtain the parameter preset of the spectral model in speech model；

Implement first aspect of the embodiment of the present invention by the electronic equipment that fifth aspect present invention is introduced to be introduced Electronic equipment used by acoustic model processing method, the acoustic model processing introduced based on first aspect of the embodiment of the present invention Method, those skilled in the art can understand concrete structure and the deformation of the electronic equipment, so will not be repeated here, it is all It is that electronic equipment belongs to this hair used by implementing the acoustic model processing method that first aspect of the embodiment of the present invention is introduced The scope to be protected of bright embodiment.

6th aspect, based on same inventive concept, the embodiment of the present invention provides a kind of processing equipment, and the processing equipment can be with For electronic equipment or server, include memory, and one or more than one program, one of them or one Procedure above is stored in memory, and be configured to by one either more than one computing device it is one or one Procedure above includes the instruction for being used for being operated below：

Obtain the text to be synthesized for carrying out phonetic synthesis；

Implement second aspect of the embodiment of the present invention by the electronic equipment that sixth aspect present invention is introduced to be introduced Electronic equipment used by phoneme synthesizing method, the phoneme synthesizing method introduced based on second aspect of the embodiment of the present invention, this Those skilled in the art can understand concrete structure and the deformation of the electronic equipment, so will not be repeated here, every implementation Electronic equipment belongs to institute of the embodiment of the present invention used by the phoneme synthesizing method that second aspect of the embodiment of the present invention is introduced The scope to be protected.

Fig. 8 is a kind of acoustic model processing method of implementation (or phonetic synthesis side according to an exemplary embodiment Method) electronic equipment 800 block diagram.For example, electronic equipment 800 can be mobile phone, computer, digital broadcast terminal, disappear Cease transceiver, game console, tablet device, Medical Devices, body-building equipment, personal digital assistant etc..

Reference picture 8, electronic equipment 800 can include following one or more assemblies：Processing component 802, memory 804, Power supply module 806, multimedia groupware 808, audio-frequency assembly 810, the interface 812 of input/output (I/O), sensor cluster 814, And communication component 816.

The integrated operation of the usual control electronics 800 of processing component 802, such as leads to display, call, data The operation that letter, camera operation and record operation are associated.Treatment element 802 can include one or more processors 820 to hold Row instruction, to complete all or part of step of above-mentioned method.In addition, processing component 802 can include one or more moulds Block, the interaction being easy between processing component 802 and other assemblies.For example, processing component 802 can include multi-media module, with Facilitate the interaction between multimedia groupware 808 and processing component 802.

Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown Example includes the instruction of any application program or method for being operated on electronic equipment 800, contact data, telephone directory number According to, message, picture, video etc..Memory 804 can by any kind of volatibility or non-volatile memory device or they Combination realize, as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM) are erasable Programmable read only memory (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, quick flashing Memory, disk or CD.

Power supply module 806 provides electric power for the various assemblies of electronic equipment 800.Power supply module 806 can include power supply pipe Reason system, one or more power supplys, and other components associated with generating, managing and distributing electric power for electronic equipment 800.

Multimedia groupware 808 is included in the screen of one output interface of offer between the electronic equipment 800 and user. In certain embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface Plate, screen may be implemented as touch-screen, to receive the input signal from user.Touch panel includes one or more touch Sensor is with the gesture on sensing touch, slip and touch panel.The touch sensor can not only sensing touch or slip The border of action, but also detect the duration and pressure related to the touch or slide.In certain embodiments, Multimedia groupware 808 includes a front camera and/or rear camera.When electronic equipment 800 is in operator scheme, such as clap When taking the photograph pattern or video mode, front camera and/or rear camera can receive outside multi-medium data.It is each preposition Camera and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio-frequency assembly 810 is configured as output and/or input audio signal.For example, audio-frequency assembly 810 includes a Mike Wind (MIC), when electronic equipment 800 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone It is configured as receiving external audio signal.The audio signal received can be further stored in memory 804 or via logical Letter component 816 is sent.In certain embodiments, audio-frequency assembly 810 also includes a loudspeaker, for exports audio signal.

I/O interfaces 812 provide interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to：Home button, volume button, start button and lock Determine button.

Sensor cluster 814 includes one or more sensors, for providing the state of various aspects for electronic equipment 800 Assess.For example, sensor cluster 814 can detect opening/closed mode of equipment 800, the relative positioning of component, such as institute The display and keypad that component is electronic equipment 800 are stated, sensor cluster 814 can also detect electronic equipment 800 or electronics The position of 800 1 components of equipment changes, the existence or non-existence that user contacts with electronic equipment 800, the orientation of electronic equipment 800 Or acceleration/deceleration and the temperature change of electronic equipment 800.Sensor cluster 814 can include proximity transducer, be configured to The presence of object nearby is detected in no any physical contact.Sensor cluster 814 can also include optical sensor, such as CMOS or ccd image sensor, for being used in imaging applications.In certain embodiments, the sensor cluster 814 can be with Including acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 816 is configured to facilitate the communication of wired or wireless way between electronic equipment 800 and other equipment. Electronic equipment 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or combinations thereof.Show at one In example property embodiment, communication component 816 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel Relevant information.In one exemplary embodiment, the communication component 816 also includes near-field communication (NFC) module, short to promote Cheng Tongxin.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, electronic equipment 800 can be by one or more application specific integrated circuits (ASIC), number Word signal processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instructing, example are additionally provided Such as include the memory 804 of instruction, above-mentioned instruction can be performed to complete the above method by the processor 820 of electronic equipment 800.Example Such as, the non-transitorycomputer readable storage medium can be ROM, it is random access memory (RAM), CD-ROM, tape, soft Disk and optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of electronic equipment When device performs so that electronic equipment is able to carry out a kind of acoustic model processing method, and methods described includes：

Obtain the parameter preset of the spectral model in speech model；

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of electronic equipment When device performs so that electronic equipment is able to carry out a kind of phoneme synthesizing method, and methods described includes：

Obtain the text to be synthesized for carrying out phonetic synthesis；

Fig. 9 is the structural representation of server in the embodiment of the present invention.The server 1900 can be different because of configuration or performance And produce bigger difference, can include one or more central processing units (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage application programs 1942 or the storage medium 1930 (such as one or more mass memory units) of data 1944.Wherein, memory 1932 Can be of short duration storage or persistently storage with storage medium 1930.Be stored in storage medium 1930 program can include one or More than one module (diagram does not mark), each module can include operating the series of instructions in server.Further Ground, central processing unit 1922 be could be arranged to communicate with storage medium 1930, and storage medium 1930 is performed on server 1900 In series of instructions operation.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the centre of server When managing device execution so that server is able to carry out a kind of acoustic model processing method, and methods described includes：

Obtain the parameter preset of the spectral model in speech model；

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the centre of server When managing device execution so that server is able to carry out a kind of phoneme synthesizing method, and methods described includes：

Obtain the text to be synthesized for carrying out phonetic synthesis；

The one or more embodiments of the present invention, at least have the advantages that：

Due in embodiments of the present invention, being handled in the following manner for speech model：Obtain in speech model Spectral model parameter preset；Then, the parameter preset of the spectral model is converted into amplitude spectrum；The amplitude spectrum is entered The amplitude spectrum gone after adaptively being handled；Amplitude spectrum after the processing is converted into the default of the spectral model Parameter, and then the spectral model after being handled, due to for the parameter preset in spectral model carried out it is adaptive after Processing, that is to say the desired signal enhanced in spectral model and reduces interference signal, so as to subsequently through the speech model When generating speech data, it is possible to increase the quality of synthesized voice；

Also, the object adaptively post-processed in scheme is amplitude spectrum, amplitude spectrum is a kind of general frequency spectrum, respectively Kind frequency spectrum parameter can be converted to amplitude spectrum, thus the program is all suitable for for any frequency spectrum parameter, without for not Frequency spectrum parameter together (such as：Line spectrum pair, mel cepstrum etc.) use different adaptive post processing modes, therefore program pin The compatibility of adaptive post processing to frequency spectrum parameter is stronger；

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The equipment for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of equipment, the commander equipment realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Claims

A kind of 1. acoustic model processing method, it is characterised in that including：

Obtain the parameter preset of the spectral model in speech model；

The parameter preset of the spectral model is converted into amplitude spectrum；

The amplitude spectrum is adaptively post-processed, the amplitude spectrum after being handled；

Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, and then the frequency spectrum after being handled Model.
2. the method as described in claim 1, it is characterised in that the parameter preset by the spectral model is converted to amplitude Spectrum, including：

The static parameter of the equal value part of the spectral model is converted into the amplitude spectrum.
3. the method as described in claim 1, it is characterised in that after adaptive handled is carried out to the amplitude spectrum Amplitude spectrum, including：The amplitude spectrum is adaptively post-processed by below equation：

<mrow> <msub> <mi>S</mi> <mrow> <mi>n</mi> <mi>e</mi> <mi>w</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>z</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>S</mi> <mrow> <mi>o</mi> <mi>r</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>z</mi> <mo>/</mo> <mi>&beta;</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>S</mi> <mrow> <mi>o</mi> <mi>r</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>z</mi> <mo>/</mo> <mi>&alpha;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>*</mo> <msub> <mi>S</mi> <mrow> <mi>o</mi> <mi>r</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>z</mi> <mo>)</mo> </mrow> <mo>,</mo> <mn>0</mn> <mo><</mo> <mi>&alpha;</mi> <mo><</mo> <mi>&beta;</mi> <mo><</mo> <mn>1</mn> </mrow>

Wherein, S_new(z) amplitude spectrum after expression processing；

S_ori(z) amplitude spectrum of before processing is represented；

S_ori(z/ β) represents the S on z-plane_ori(z) β times before change of scale arrives, the amplitude spectrum obtained from；

S_ori(z/ α) represents the S on z-plane_ori(z) α times before change of scale arrives, the amplitude spectrum obtained from.
4. method as claimed in claim 3, it is characterised in that adaptively post-processed, handled to the amplitude spectrum Amplitude spectrum afterwards, in addition to：

For the amplitude spectrum of each before processing, judge to calculate amplitude spectrum after the processing obtained whether positioned at preset maximum value with it is pre- If the scope of minimum value；

When amplitude spectrum after treatment is less than the predetermined minimum, using the predetermined minimum as the amplitude spectrum after processing；

When amplitude spectrum after treatment is more than the preset maximum value, using the preset maximum value as the amplitude spectrum after processing.
5. the method as described in claim 1, it is characterised in that adaptively post-processed, handled to the amplitude spectrum After amplitude spectrum afterwards, methods described also includes：

Spectrum energy unification processing is carried out to the amplitude spectrum after the processing；

Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, including：

Amplitude spectrum after progress spectrum energy unification processing is converted to the parameter preset of the spectral model.
6. the method as described in claim 1-5 is any, it is characterised in that methods described also includes：

Obtain the text to be synthesized for carrying out phonetic synthesis；

The parameters,acoustic of the text to be synthesized is determined based on the speech model；

The speech data of the text to be synthesized is synthesized by the parameters,acoustic.
A kind of 7. phoneme synthesizing method, it is characterised in that including：

Obtain the text to be synthesized for carrying out phonetic synthesis；

The frequency spectrum parameter of the text to be synthesized is determined based on the spectral model in speech model, the spectral model is through certainly Reprocessed spectral model is adapted to, the adaptive last handling process comprises the following steps：By the default of the spectral model Parameter Switch is amplitude spectrum, the amplitude spectrum after adaptive handled is carried out to the amplitude spectrum, after the processing Amplitude spectrum be converted to the parameter preset of the spectral model；

The speech data of the text to be synthesized is synthesized by the frequency spectrum parameter.
8. method as claimed in claim 7, it is characterised in that determine institute in the spectral model based in speech model Before the frequency spectrum parameter for stating text to be synthesized, methods described also includes：

The adaptive post processing of the spectral model is locally carried out in client device；And/or

The spectral model by adaptively post-processing is received from the server.
A kind of 9. acoustic model processing unit, it is characterised in that including：

Acquisition module, for obtaining the parameter preset of the spectral model in speech model；

First modular converter, for the parameter preset of the spectral model to be converted into amplitude spectrum；

First obtains module, for adaptively being post-processed to the amplitude spectrum, the amplitude spectrum after being handled；

Second modular converter, for the amplitude spectrum after the processing to be converted to the parameter preset of the spectral model, and then obtain The spectral model after must handling.
A kind of 10. speech synthetic device, it is characterised in that including：

3rd obtains module, and the text to be synthesized of phonetic synthesis is carried out for obtaining；

Second determining module, for determining the frequency spectrum parameter of the text to be synthesized based on the spectral model in speech model, The spectral model is to comprise the following steps through adaptive reprocessed spectral model, the adaptive last handling process：Will The parameter preset of the spectral model is converted to amplitude spectrum, and the width after adaptive handled is carried out to the amplitude spectrum Degree spectrum, the amplitude spectrum after the processing is converted to the parameter preset of the spectral model；

Second synthesis module, for synthesizing the speech data of the text to be synthesized by the frequency spectrum parameter.
11. a kind of processing equipment, it is characterised in that include memory, and one or more than one program, wherein one Individual or more than one program storage is configured to one as described in one or more than one computing device in memory Individual or more than one program bag contains the instruction for being used for being operated below：

Obtain the parameter preset of the spectral model in speech model；

The parameter preset of the spectral model is converted into amplitude spectrum；

The amplitude spectrum is adaptively post-processed, the amplitude spectrum after being handled；

Amplitude spectrum after the processing is converted to the parameter preset of the spectral model, and then the frequency spectrum after being handled Model.
12. a kind of processing equipment, it is characterised in that include memory, and one or more than one program, wherein one Individual or more than one program storage is configured to one as described in one or more than one computing device in memory Individual or more than one program bag contains the instruction for being used for being operated below：

Obtain the text to be synthesized for carrying out phonetic synthesis；

The frequency spectrum parameter of the text to be synthesized is determined based on the spectral model in speech model, the spectral model is through certainly Reprocessed spectral model is adapted to, the adaptive last handling process comprises the following steps：By the default of the spectral model Parameter Switch is amplitude spectrum, the amplitude spectrum after adaptive handled is carried out to the amplitude spectrum, after the processing Amplitude spectrum be converted to the parameter preset of the spectral model；

The speech data of the text to be synthesized is synthesized by the frequency spectrum parameter.